Need help to recover root filesystem after a power supply issue

All of lore.kernel.org
 help / color / mirror / Atom feed

* Need help to recover root filesystem after a power supply issue
@ 2019-07-10  9:47 Andrey Zhunev
  2019-07-10 14:30 ` Chris Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: Andrey Zhunev @ 2019-07-10  9:47 UTC (permalink / raw)
  To: linux-xfs

Hello All,

I am struggling to recover my system after a PSU failure.

One of the hard drives throws some read errors, and that happen to be
my root drive...
My system is CentOS 7, and the root partition is a part of LVM.

[root@mgmt ~]# lvscan
  ACTIVE            '/dev/centos/root' [<98.83 GiB] inherit
  ACTIVE            '/dev/centos/home' [<638.31 GiB] inherit
  ACTIVE            '/dev/centos/swap' [<7.52 GiB] inherit
[root@mgmt ~]#

[root@tftp ~]# file -s /dev/centos/root
/dev/centos/root: symbolic link to `../dm-3'
[root@tftp ~]# file -s /dev/centos/home
/dev/centos/home: symbolic link to `../dm-4'
[root@tftp ~]# file -s /dev/dm-3
/dev/dm-3: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
[root@tftp ~]# file -s /dev/dm-4
/dev/dm-4: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)


[root@tftp ~]# xfs_repair /dev/centos/root
Phase 1 - find and verify superblock...
superblock read failed, offset 53057945600, size 131072, ag 2, rval -1

fatal error -- Input/output error
[root@tftp ~]#


smartctl shows some pending sectors on /dev/sda, and no reallocated
sectors (yet?).

Can someone please give me a hand to bring root partition back to life
(ideally)? Or, at least, recover a couple of critical configuration
files...


---
Best regards,
 Andrey                    

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10  9:47 Need help to recover root filesystem after a power supply issue Andrey Zhunev
@ 2019-07-10 14:30 ` Chris Murphy
  2019-07-10 15:28   ` Andrey Zhunev
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2019-07-10 14:30 UTC (permalink / raw)
  To: Andrey Zhunev; +Cc: xfs list

On Wed, Jul 10, 2019 at 3:52 AM Andrey Zhunev <a-j@a-j.ru> wrote:
>
> [root@tftp ~]# xfs_repair /dev/centos/root
> Phase 1 - find and verify superblock...
> superblock read failed, offset 53057945600, size 131072, ag 2, rval -1
>
> fatal error -- Input/output error
> [root@tftp ~]#

# smartctl -l scterc /dev/

Point it to the physical device. If it's a consumer drive, it might
support a configurable SCT ERC. Also need to see the kernel messages
at the time of the i/o error. There's some chance if a deep recover
read is possible, it'll recover the data. But I don't see how this is
related to power supply failure.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 14:30 ` Chris Murphy
@ 2019-07-10 15:28   ` Andrey Zhunev
  2019-07-10 15:45     ` Chris Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: Andrey Zhunev @ 2019-07-10 15:28 UTC (permalink / raw)
  To: Chris Murphy; +Cc: xfs list

Wednesday, July 10, 2019, 5:30:37 PM, you wrote:

> On Wed, Jul 10, 2019 at 3:52 AM Andrey Zhunev <a-j@a-j.ru> wrote:
>>
>> [root@tftp ~]# xfs_repair /dev/centos/root
>> Phase 1 - find and verify superblock...
>> superblock read failed, offset 53057945600, size 131072, ag 2, rval -1
>>
>> fatal error -- Input/output error
>> [root@tftp ~]#

> # smartctl -l scterc /dev/

> Point it to the physical device. If it's a consumer drive, it might
> support a configurable SCT ERC. Also need to see the kernel messages
> at the time of the i/o error. There's some chance if a deep recover
> read is possible, it'll recover the data. But I don't see how this is
> related to power supply failure.



Well, this machine is always online (24/7, with a UPS backup power).
Yesterday we found it switched OFF, without any signs of life. Trying
to switch it on, the PSU made a humming noise and the machine didn't
even try to start. So we replaced the PSU. After that, the machine
powered on - but refused to boot... Something tells me these two
failures are likely related...



# smartctl -l scterc /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-957.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

#

This is a WD RED series drive, WD30EFRX.
Here are some more of the error messages from kernel log file:

Jul 10 11:59:03 mgmt kernel: ata1.00: exception Emask 0x0 SAct 0x100000 SErr 0x0 action 0x0
Jul 10 11:59:03 mgmt kernel: ata1.00: irq_stat 0x40000008
Jul 10 11:59:03 mgmt kernel: ata1.00: failed command: READ FPDMA QUEUED
Jul 10 11:59:03 mgmt kernel: ata1.00: cmd 60/08:a0:d8:c3:84/00:00:0a:00:00/40 tag 20 ncq 4096 in#012         res 41/40:00:d8:c3:84/00:00:0a:00:00/40 Emask 0x409 (media error) <F>
Jul 10 11:59:03 mgmt kernel: ata1.00: status: { DRDY ERR }
Jul 10 11:59:03 mgmt kernel: ata1.00: error: { UNC }
Jul 10 11:59:03 mgmt kernel: ata1.00: configured for UDMA/133
Jul 10 11:59:03 mgmt kernel: sd 0:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 10 11:59:03 mgmt kernel: sd 0:0:0:0: [sda] tag#20 Sense Key : Medium Error [current] [descriptor]
Jul 10 11:59:03 mgmt kernel: sd 0:0:0:0: [sda] tag#20 Add. Sense: Unrecovered read error - auto reallocate failed
Jul 10 11:59:03 mgmt kernel: sd 0:0:0:0: [sda] tag#20 CDB: Read(16) 88 00 00 00 00 00 0a 84 c3 d8 00 00 00 08 00 00
Jul 10 11:59:03 mgmt kernel: blk_update_request: I/O error, dev sda, sector 176473048
Jul 10 11:59:03 mgmt kernel: Buffer I/O error on dev sda, logical block 22059131, async page read
Jul 10 11:59:03 mgmt kernel: ata1: EH complete
Jul 10 11:59:05 mgmt kernel: ata1.00: exception Emask 0x0 SAct 0x1000000 SErr 0x0 action 0x0
Jul 10 11:59:05 mgmt kernel: ata1.00: irq_stat 0x40000008
Jul 10 11:59:05 mgmt kernel: ata1.00: failed command: READ FPDMA QUEUED
Jul 10 11:59:05 mgmt kernel: ata1.00: cmd 60/08:c0:d8:c3:84/00:00:0a:00:00/40 tag 24 ncq 4096 in#012         res 41/40:00:d8:c3:84/00:00:0a:00:00/40 Emask 0x409 (media error) <F>
Jul 10 11:59:05 mgmt kernel: ata1.00: status: { DRDY ERR }
Jul 10 11:59:05 mgmt kernel: ata1.00: error: { UNC }
Jul 10 11:59:05 mgmt kernel: ata1.00: configured for UDMA/133
Jul 10 11:59:05 mgmt kernel: sd 0:0:0:0: [sda] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 10 11:59:05 mgmt kernel: sd 0:0:0:0: [sda] tag#24 Sense Key : Medium Error [current] [descriptor]
Jul 10 11:59:05 mgmt kernel: sd 0:0:0:0: [sda] tag#24 Add. Sense: Unrecovered read error - auto reallocate failed
Jul 10 11:59:05 mgmt kernel: sd 0:0:0:0: [sda] tag#24 CDB: Read(16) 88 00 00 00 00 00 0a 84 c3 d8 00 00 00 08 00 00
Jul 10 11:59:05 mgmt kernel: blk_update_request: I/O error, dev sda, sector 176473048
Jul 10 11:59:05 mgmt kernel: Buffer I/O error on dev sda, logical block 22059131, async page read
Jul 10 11:59:05 mgmt kernel: ata1: EH complete





---
Best regards,
 Andrey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 15:28   ` Andrey Zhunev
@ 2019-07-10 15:45     ` Chris Murphy
  2019-07-10 16:07       ` Andrey Zhunev
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2019-07-10 15:45 UTC (permalink / raw)
  To: Andrey Zhunev; +Cc: Chris Murphy, xfs list

On Wed, Jul 10, 2019 at 9:29 AM Andrey Zhunev <a-j@a-j.ru> wrote:
>
> Well, this machine is always online (24/7, with a UPS backup power).
> Yesterday we found it switched OFF, without any signs of life. Trying
> to switch it on, the PSU made a humming noise and the machine didn't
> even try to start. So we replaced the PSU. After that, the machine
> powered on - but refused to boot... Something tells me these two
> failures are likely related...

Most likely the drive is dying and the spin down from power failure
and subsequent spin up has increased the rate of degradation, and
that's why they seem related.

What do you get for:

# smarctl -x /dev/sda

>
>
>
> # smartctl -l scterc /dev/sda
> smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-957.el7.x86_64] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
> SCT Error Recovery Control:
>            Read:     70 (7.0 seconds)
>           Write:     70 (7.0 seconds)

Good news. This can be raised by a ton and maybe you'll recover the
bad sectors. You need to do two things. You might have to iterate some
of this because I don't know what the max SCT ERC value is for this
make/model drive. Consumer drives can have really high values, upwards
of three minutes, which is ridiculous but off topic. I'd like to think
60 seconds would be enough and also below whatever cap the drive
firmware has. Also, I've had drive firmware crash when issuing
multiple SCT ERC changes - so if the drive starts doing new crazy
things, we're not going to know if it's a firmware bug or more likely
if the drive is continuing to degrade.

I would shoot for a 90 second SCT ERC for reads, and hopefully that's
long enough and also isn't above the max value for this make/model.

# smartctl -l scterc,900,100

And next, raise the kernel's command timer into the stratosphere so
that it won't get mad and do a link reset if the drive takes a long
time to recover.

# echo 180 > /sys/block/sda/device/timeout

In this configuration, it's possible every single read command for a
(marginally) bad sector will take 90 seconds. So if you have a bunch
of these, an fsck might take hours. So that's not necessarily how I
would do it. Best to see the smartctl -x to have some idea how many
bad sectors there might be.

>
> #
>
> This is a WD RED series drive, WD30EFRX.

Yeah this is a NAS drive, and this low 70 decisecond value is meant
for RAID. It's a suboptimal value if you're using it for a boot drive.
But deal with that later after recovery.

>Jul 10 11:48:05 mgmt kernel: blk_update_request: I/O error, dev sda, sector 54439176

> Jul 10 11:59:03 mgmt kernel: blk_update_request: I/O error, dev sda, sector 176473048
> Jul 10 11:59:05 mgmt kernel: blk_update_request: I/O error, dev sda, sector 176473048

So at least two bad sectors and they aren't anywhere near each other.
The smartctl -x command might give us an idea how bad the drive is.
Anyway, these drives have decent warranties, but they're going to want
the drive returned to them. So if there's anything sensitive on it and
it's not encrypted  you'll want it still working long enough to wipe
it.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 15:45     ` Chris Murphy
@ 2019-07-10 16:07       ` Andrey Zhunev
  2019-07-10 16:46         ` Chris Murphy
  2019-07-10 16:51         ` Chris Murphy
  0 siblings, 2 replies; 23+ messages in thread
From: Andrey Zhunev @ 2019-07-10 16:07 UTC (permalink / raw)
  To: Chris Murphy; +Cc: xfs list


Wednesday, July 10, 2019, 6:45:28 PM, you wrote:

> On Wed, Jul 10, 2019 at 9:29 AM Andrey Zhunev <a-j@a-j.ru> wrote:
>>
>> Well, this machine is always online (24/7, with a UPS backup power).
>> Yesterday we found it switched OFF, without any signs of life. Trying
>> to switch it on, the PSU made a humming noise and the machine didn't
>> even try to start. So we replaced the PSU. After that, the machine
>> powered on - but refused to boot... Something tells me these two
>> failures are likely related...

> Most likely the drive is dying and the spin down from power failure
> and subsequent spin up has increased the rate of degradation, and
> that's why they seem related.

> What do you get for:

> # smarctl -x /dev/sda


The '-x' option gives a lot of output.
It's pasted here: https://pastebin.com/raw/yW3yDuSF


Well, if there are evidnces the drive is really dying - so be it...
I just need to recover the data, if possible.
On the other hand, if the drive will work further - I will find some
unimportant files to store...


---
Best regards,
 Andrey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 16:07       ` Andrey Zhunev
@ 2019-07-10 16:46         ` Chris Murphy
  2019-07-10 16:47           ` Chris Murphy
  2019-07-10 16:51         ` Chris Murphy
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2019-07-10 16:46 UTC (permalink / raw)
  To: Andrey Zhunev; +Cc: Chris Murphy, xfs list

On Wed, Jul 10, 2019 at 10:08 AM Andrey Zhunev <a-j@a-j.ru> wrote:
>
>
> Wednesday, July 10, 2019, 6:45:28 PM, you wrote:
>
> > On Wed, Jul 10, 2019 at 9:29 AM Andrey Zhunev <a-j@a-j.ru> wrote:
> >>
> >> Well, this machine is always online (24/7, with a UPS backup power).
> >> Yesterday we found it switched OFF, without any signs of life. Trying
> >> to switch it on, the PSU made a humming noise and the machine didn't
> >> even try to start. So we replaced the PSU. After that, the machine
> >> powered on - but refused to boot... Something tells me these two
> >> failures are likely related...
>
> > Most likely the drive is dying and the spin down from power failure
> > and subsequent spin up has increased the rate of degradation, and
> > that's why they seem related.
>
> > What do you get for:
>
> > # smarctl -x /dev/sda
>
>
> The '-x' option gives a lot of output.
> It's pasted here: https://pastebin.com/raw/yW3yDuSF

197 Current_Pending_Sector  -O--CK   200   200   000    -    68

> Well, if there are evidnces the drive is really dying - so be it...
> I just need to recover the data, if possible.
> On the other hand, if the drive will work further - I will find some
> unimportant files to store...

I think 68 pending sectors is excessive and I'd plan to have the drive
replaced under warranty, or demote it to something you don't care
about. Chances are this is going to get worse. I don't know how many
reserve sectors drives have, I don't even have a guess. But I have
seen drives run out of reserve sectors, at which point you start to
see write failures because LBA's can't be remapped from a bad sector
that fails writes, to a good one. At that point, the drive is
untenable.

Anyway, it's a bit tedious to fix 68 sectors manually, so if you have
the time to just wait for it, try this:

# smartctl -l scterc,900,100
# echo 180 > /sys/block/sda/device/timeout

And now try to fsck.

If it fails with i/o very quickly, as in less than 90 seconds, then
that means the drive firmware has concluded deep recovery won't matter
and is pretty much immediately giving up. At that point, those sectors
are lost. You could overwrite those sectors one by one with zeros and
maybe an xfs_repair will have enough information it can reconstruct
and repair things well enough to copy data off. But you'll have to be
suspicious of every file, as anyone of them could have been silently
corrupted - either bad ECC reconstruction by drive firmware or from
overwriting with zeros.

I'd say there's a decent chance of recovery but it will be tedious.

If it seems like the system is hanging without errors, that's actually
a good sign deep recovery is working. But like I said, it could take
hours. And then in the end it might still find a totally unrecoverable
sector.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 16:46         ` Chris Murphy
@ 2019-07-10 16:47           ` Chris Murphy
  2019-07-10 17:16             ` Andrey Zhunev
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2019-07-10 16:47 UTC (permalink / raw)
  Cc: Andrey Zhunev, xfs list

On Wed, Jul 10, 2019 at 10:46 AM Chris Murphy <lists@colorremedies.com> wrote:
>
> # smartctl -l scterc,900,100
> # echo 180 > /sys/block/sda/device/timeout


smartctl command above does need a drive specified...



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 16:07       ` Andrey Zhunev
  2019-07-10 16:46         ` Chris Murphy
@ 2019-07-10 16:51         ` Chris Murphy
  1 sibling, 0 replies; 23+ messages in thread
From: Chris Murphy @ 2019-07-10 16:51 UTC (permalink / raw)
  To: Andrey Zhunev; +Cc: xfs list

On Wed, Jul 10, 2019 at 10:08 AM Andrey Zhunev <a-j@a-j.ru> wrote:
>
>
> Wednesday, July 10, 2019, 6:45:28 PM, you wrote:
>
> The '-x' option gives a lot of output.
> It's pasted here: https://pastebin.com/raw/yW3yDuSF

  9 Power_On_Hours          -O--CK   022   022   000    -    56941

56941÷8760 = 6.5 years ?

Doubtful it's under warranty.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 16:47           ` Chris Murphy
@ 2019-07-10 17:16             ` Andrey Zhunev
  2019-07-10 18:03               ` Chris Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: Andrey Zhunev @ 2019-07-10 17:16 UTC (permalink / raw)
  To: Chris Murphy; +Cc: xfs list

Wednesday, July 10, 2019, 7:47:55 PM, you wrote:

> On Wed, Jul 10, 2019 at 10:46 AM Chris Murphy <lists@colorremedies.com> wrote:
>>
>> # smartctl -l scterc,900,100
>> # echo 180 > /sys/block/sda/device/timeout

> smartctl command above does need a drive specified...

Indeed! :)

With the commands above, you are increasing the timeout and then fsck
will try to re-read the sectors, right?

As for the SMART status, the number of pending sectors was 0 before.
It started to grow after the PSU incident yesterday. Now, since I'm
doing a ddrescue, all the sectors will be read (or attempted to be
read). So the pending sectors counter may increase further.

As I understand, when a drive cannot READ a sector, the sector is
reported as pending. And it will stay like that until either the
sector is finally read or until it is overwritten. When either of
these happens, the Pending Sector Counter should decrease.
In theory, it can go back to 0 (although I didn't follow this closely
enough, so I never saw a drive like that).

If a drive can't WRITE to a sector, it tries to reallocate it. If it
succeeds, Reallocated Sectors Counter is increased. If it fails to
reallocate - I guess there should be another kind of error or a
counter, but I'm not sure which one.

When reallocated sectors appear - it's clearly a bad sign. If the
number of reallocated sectors grow - the drive should not be used.
But it's not that obvious for the pending sectors...

Anyway, as you noted, the drive isn't new already:

>   9 Power_On_Hours          -O--CK   022   022   000    -    56941
>
> 56941÷8760 = 6.5 years ?
>
> Doubtful it's under warranty.

Mmm... yeah... I guess it was one of the early WD30EFRX drives...
This model was launched about 7 years ago, if I'm not mistaken... :)

---
Best regards,
 Andrey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 17:16             ` Andrey Zhunev
@ 2019-07-10 18:03               ` Chris Murphy
  2019-07-10 18:35                 ` Carlos E. R.
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2019-07-10 18:03 UTC (permalink / raw)
  To: Andrey Zhunev; +Cc: Chris Murphy, xfs list

On Wed, Jul 10, 2019 at 11:16 AM Andrey Zhunev <a-j@a-j.ru> wrote:
>
> Wednesday, July 10, 2019, 7:47:55 PM, you wrote:
>
> > On Wed, Jul 10, 2019 at 10:46 AM Chris Murphy <lists@colorremedies.com> wrote:
> >>
> >> # smartctl -l scterc,900,100
> >> # echo 180 > /sys/block/sda/device/timeout
>
>
> > smartctl command above does need a drive specified...
>
> Indeed! :)
>
> With the commands above, you are increasing the timeout and then fsck
> will try to re-read the sectors, right?

More correctly, the drive firmware won't timeout, and will try longer
to recover the data *if* the sectors are marginally bad. If the
sectors are flat out bad, then the firmware will still (almost)
immediately give up and at that point nothing else can be done except
zero the bad sectors and hope fsck can reconstruct what's missing.

Thing is, 68 sectors has a low likelihood of impacting fs metadata,
because it's a smaller target than your actual data, or free space if
there's a lot of it.

> As for the SMART status, the number of pending sectors was 0 before.
> It started to grow after the PSU incident yesterday. Now, since I'm
> doing a ddrescue, all the sectors will be read (or attempted to be
> read). So the pending sectors counter may increase further.

It's a good and valid tactic to just use ddrescue with the previously
mentioned modifications for SCT ERC and kernel timeouts, rather than
directly use fsck on a drive that's clearly dying.

> As I understand, when a drive cannot READ a sector, the sector is
> reported as pending. And it will stay like that until either the
> sector is finally read or until it is overwritten. When either of
> these happens, the Pending Sector Counter should decrease.

Sounds about right.

> In theory, it can go back to 0 (although I didn't follow this closely
> enough, so I never saw a drive like that).

It can and should go to zero once all the pending sectors are
overwritten with either good data or zeros. It's possible the write
succeeds to the same sector, in which case it's no longer pending and
not remapped. It's possible internally the write fails and the drive
firmware does a remap to make the write succeed, in which case it's no
longer pending.

If a write fails (externally reported write failure to the kernel),
then pending sectors will get pinned at that point and only ever go up
as the drive continues to get worse.

> If a drive can't WRITE to a sector, it tries to reallocate it. If it
> succeeds, Reallocated Sectors Counter is increased. If it fails to
> reallocate - I guess there should be another kind of error or a
> counter, but I'm not sure which one.

You get essentially the same UNC type of error you've seen except it's
a write error instead of read. That's widely considered fatal because
having a drive that can't write is just not usable for anything (well,
read only).

>
> When reallocated sectors appear - it's clearly a bad sign. If the
> number of reallocated sectors grow - the drive should not be used.
> But it's not that obvious for the pending sectors...

They're both bad news. It's just a matter of degree. Yes a
manufacturer probably takes the position that pending sectors is and
even remapping is normal drive behavior. But realistically it's not
something anyone wants to have to deal with. It's useful for
curiousity. Use it for Btrfs testing :-D

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 18:03               ` Chris Murphy
@ 2019-07-10 18:35                 ` Carlos E. R.
  2019-07-10 19:30                   ` Chris Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: Carlos E. R. @ 2019-07-10 18:35 UTC (permalink / raw)
  Cc: xfs list


[-- Attachment #1.1: Type: text/plain, Size: 967 bytes --]

On 10/07/2019 20.03, Chris Murphy wrote:
> On Wed, Jul 10, 2019 at 11:16 AM Andrey Zhunev <a-j@a-j.ru> wrote:

...

>> When reallocated sectors appear - it's clearly a bad sign. If the
>> number of reallocated sectors grow - the drive should not be used.
>> But it's not that obvious for the pending sectors...
> 
> They're both bad news. It's just a matter of degree. Yes a
> manufacturer probably takes the position that pending sectors is and
> even remapping is normal drive behavior. But realistically it's not
> something anyone wants to have to deal with. It's useful for
> curiousity. Use it for Btrfs testing :-D

I have used some disks with some reallocated sectors for several years
after the "event", with not even a single failure afterwards. It should
not be fatal. For me, the criteria is that the number does not increase,
and that it is not large.


-- 
Cheers / Saludos,

		Carlos E. R.
		(from 15.0 x86_64 at Telcontar)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 18:35                 ` Carlos E. R.
@ 2019-07-10 19:30                   ` Chris Murphy
  2019-07-10 23:43                     ` Andrey Zhunev
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2019-07-10 19:30 UTC (permalink / raw)
  To: Carlos E. R.; +Cc: xfs list

On Wed, Jul 10, 2019 at 12:35 PM Carlos E. R.
<robin.listas@telefonica.net> wrote:
>
> On 10/07/2019 20.03, Chris Murphy wrote:
> > On Wed, Jul 10, 2019 at 11:16 AM Andrey Zhunev <a-j@a-j.ru> wrote:
>
> ...
>
> >> When reallocated sectors appear - it's clearly a bad sign. If the
> >> number of reallocated sectors grow - the drive should not be used.
> >> But it's not that obvious for the pending sectors...
> >
> > They're both bad news. It's just a matter of degree. Yes a
> > manufacturer probably takes the position that pending sectors is and
> > even remapping is normal drive behavior. But realistically it's not
> > something anyone wants to have to deal with. It's useful for
> > curiousity. Use it for Btrfs testing :-D
>
> I have used some disks with some reallocated sectors for several years
> after the "event", with not even a single failure afterwards. It should
> not be fatal. For me, the criteria is that the number does not increase,
> and that it is not large.

That is true but it also takes mitigation effort beyond what most
people are willing or capable of doing. But also there's no way to
know in advance. SMART just isn't a good predictor.

There may have been a brief period period where some of these
marginally bad sectors could have been remapped automatically, but
didn't because of the default short SCT ERC since these are intended
to be NAS drives, not boot/system drives.

And also, the default kernel command time out of 30 seconds is
inappropriate for a single boot or system drive. It should be quite a
bit longer. 30s makes sense only if the drive SCT ERC is shorter than
that, and it's some kind of RAID setup.

Thus far no one's been willing to budget on setting better defaults.
Distros say the kernel should default to something safe. And kernel
developers pretty much say defaults like this one should never be
changed and it's a distro + use case responsibility to change it. And
the end result of that going nowhere is users consistently have a
suboptimal experience, especially the desktop/laptop  use case.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 19:30                   ` Chris Murphy
@ 2019-07-10 23:43                     ` Andrey Zhunev
  2019-07-11  2:47                       ` Carlos E. R.
  0 siblings, 1 reply; 23+ messages in thread
From: Andrey Zhunev @ 2019-07-10 23:43 UTC (permalink / raw)
  To: xfs list

Ok, the ddrescue finished copying whatever it was able to recover.
There were many unreadable sectors near the end of the drive.
In total, there were over 170 pending sectors reported by SMART.

I then ran the following commands:

# smartctl -l scterc,900,100 /dev/sda
# echo 180 > /sys/block/sda/device/timeout

But this didn't help at all. The unreadable sectors still remained
unreadable.

So I wiped them with hdparm:
# hdparm --yes-i-know-what-i-am-doing --write-sector <sector_number> /dev/sda

I then re-read all these sectors, and they were all read correctly.

The number of pending sectors reported by SMART dropped down to 7.
Interestingly, there are still NO reallocated sectors reported.

Now, the xfs-repair reported the following:

# xfs_repair /dev/centos/root
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

But I was still unable to momunt the filesystem:

# mount /dev/centos/root /tmp/root/
mount: mount /dev/mapper/centos-root on /tmp/root failed: Structure needs cleaning

So I went ahead with the '-L', and after some time the xfs_repair
completed the repair.

I can now mount my / partition successfully!
Most of the data seems to be there (I didn't check all of it, yet).

First thing I checked was the original kernel log. And yes, Chris was
right! There are a few warnings reporting read errors on /dev/sda.
These obviously were logged before the PSU failed. So the PSU might
have been just a coincidence...

Thanks a lot to everybody for your great help and support!!!

BTW, interestingly enough, those 7 pending sectors "disappeared" after
a power cycle. Maybe they were on some internal space on the HDD, not
accessible to users?
The SMART on this drive looks pretty clean now:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       867
  3 Spin_Up_Time            0x0027   181   179   021    Pre-fail  Always       -       5916
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       175
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   022   022   000    Old_age   Always       -       56949
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       175
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       112
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       62
194 Temperature_Celsius     0x0022   117   091   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

---
Best regards,
 Andrey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 23:43                     ` Andrey Zhunev
@ 2019-07-11  2:47                       ` Carlos E. R.
  2019-07-11  7:10                         ` Andrey Zhunev
  0 siblings, 1 reply; 23+ messages in thread
From: Carlos E. R. @ 2019-07-11  2:47 UTC (permalink / raw)
  To: xfs list


[-- Attachment #1.1: Type: text/plain, Size: 1382 bytes --]

On 11/07/2019 01.43, Andrey Zhunev wrote:
> 
> Ok, the ddrescue finished copying whatever it was able to recover.
> There were many unreadable sectors near the end of the drive.
> In total, there were over 170 pending sectors reported by SMART.
> 
> I then ran the following commands:
> 
> # smartctl -l scterc,900,100 /dev/sda
> # echo 180 > /sys/block/sda/device/timeout
> 
> But this didn't help at all. The unreadable sectors still remained
> unreadable.
> 
> So I wiped them with hdparm:
> # hdparm --yes-i-know-what-i-am-doing --write-sector <sector_number> /dev/sda

This has always eluded me. How did you know the sector numbers?

At this point, I typically take the brutal approach of overwriting the
entire partition (or disk) with zeroes using dd, which works as a
destructive write test ;-)

Previous to that, I attempt to create an image with ddrescue, of course.

> 
> I then re-read all these sectors, and they were all read correctly.
> 
> The number of pending sectors reported by SMART dropped down to 7.
> Interestingly, there are still NO reallocated sectors reported.

I suspect that the figure SMART reports only starts to rise after some
unknown amount of sectors have been remapped, so when the numbers
actually appear there, it is serious.


-- 
Cheers / Saludos,

		Carlos E. R.
		(from 15.0 x86_64 at Telcontar)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-11  2:47                       ` Carlos E. R.
@ 2019-07-11  7:10                         ` Andrey Zhunev
  2019-07-11 10:23                           ` Carlos E. R.
  0 siblings, 1 reply; 23+ messages in thread
From: Andrey Zhunev @ 2019-07-11  7:10 UTC (permalink / raw)
  To: Carlos E. R.; +Cc: xfs list


Thursday, July 11, 2019, 5:47:36 AM, you wrote:

> On 11/07/2019 01.43, Andrey Zhunev wrote:
>> 
>> Ok, the ddrescue finished copying whatever it was able to recover.
>> There were many unreadable sectors near the end of the drive.
>> In total, there were over 170 pending sectors reported by SMART.
>> 
>> I then ran the following commands:
>> 
>> # smartctl -l scterc,900,100 /dev/sda
>> # echo 180 > /sys/block/sda/device/timeout
>> 
>> But this didn't help at all. The unreadable sectors still remained
>> unreadable.
>> 
>> So I wiped them with hdparm:
>> # hdparm --yes-i-know-what-i-am-doing --write-sector <sector_number> /dev/sda

> This has always eluded me. How did you know the sector numbers?


When you use ddrescue (or any other tool) to try and read the data
and there is a read error, an error message is added to your kernel
log. You can find the sector number there:

Jul 10 11:56:01 mgmt kernel: blk_update_request: I/O error, dev sda, sector 157804112

You can then try to re-read that specific sector with:

# hdparm --read-sector 157804112

If that one still gives an error - then you're sure you can wipe it.


> At this point, I typically take the brutal approach of overwriting the
> entire partition (or disk) with zeroes using dd, which works as a
> destructive write test ;-)

> Previous to that, I attempt to create an image with ddrescue, of course.

>> 
>> I then re-read all these sectors, and they were all read correctly.
>> 
>> The number of pending sectors reported by SMART dropped down to 7.
>> Interestingly, there are still NO reallocated sectors reported.

> I suspect that the figure SMART reports only starts to rise after some
> unknown amount of sectors have been remapped, so when the numbers
> actually appear there, it is serious.

Hmmm, this is an interesting thought!
Everybody lies... :)




---
Best regards,
 Andrey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-11  7:10                         ` Andrey Zhunev
@ 2019-07-11 10:23                           ` Carlos E. R.
  0 siblings, 0 replies; 23+ messages in thread
From: Carlos E. R. @ 2019-07-11 10:23 UTC (permalink / raw)
  To: xfs list


[-- Attachment #1.1: Type: text/plain, Size: 1216 bytes --]

On 11/07/2019 09.10, Andrey Zhunev wrote:
> 
> Thursday, July 11, 2019, 5:47:36 AM, you wrote:

...

>>> So I wiped them with hdparm:
>>> # hdparm --yes-i-know-what-i-am-doing --write-sector <sector_number> /dev/sda
> 
>> This has always eluded me. How did you know the sector numbers?
> 
> 
> When you use ddrescue (or any other tool) to try and read the data
> and there is a read error, an error message is added to your kernel
> log. You can find the sector number there:

Ah, ok, yes, I see. Thanks :-)

...

>>> I then re-read all these sectors, and they were all read correctly.
>>>
>>> The number of pending sectors reported by SMART dropped down to 7.
>>> Interestingly, there are still NO reallocated sectors reported.
> 
>> I suspect that the figure SMART reports only starts to rise after some
>> unknown amount of sectors have been remapped, so when the numbers
>> actually appear there, it is serious.
> 
> Hmmm, this is an interesting thought!
> Everybody lies... :)

It is either that, or the number has a multiplier, and starts counting
at one hundred, two hundred... I don't know.


-- 
Cheers / Saludos,

		Carlos E. R.
		(from 15.0 x86_64 at Telcontar)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 15:02       ` Andrey Zhunev
  2019-07-10 15:23         ` Eric Sandeen
@ 2019-07-10 18:21         ` Carlos E. R.
  1 sibling, 0 replies; 23+ messages in thread
From: Carlos E. R. @ 2019-07-10 18:21 UTC (permalink / raw)
  To: Andrey Zhunev, Linux-XFS mailing list


[-- Attachment #1.1: Type: text/plain, Size: 2301 bytes --]

On 10/07/2019 17.02, Andrey Zhunev wrote:

> Ooops, I forgot to paste the error message from dmesg.
> Here it is:
> 
> Jul 10 11:48:05 mgmt kernel: ata1.00: exception Emask 0x0 SAct 0x180000 SErr 0x0 action 0x0
> Jul 10 11:48:05 mgmt kernel: ata1.00: irq_stat 0x40000008
> Jul 10 11:48:05 mgmt kernel: ata1.00: failed command: READ FPDMA QUEUED
> Jul 10 11:48:05 mgmt kernel: ata1.00: cmd 60/00:98:28:ac:3e/01:00:03:00:00/40 tag 19 ncq 131072 in#012         res 41/40:00:08:ad:3e/00:00:03:00:00/40 Emask 0x409 (media error) <F>
> Jul 10 11:48:05 mgmt kernel: ata1.00: status: { DRDY ERR }
> Jul 10 11:48:05 mgmt kernel: ata1.00: error: { UNC }
> Jul 10 11:48:05 mgmt kernel: ata1.00: configured for UDMA/133
> Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 Sense Key : Medium Error [current] [descriptor]
> Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto reallocate failed
> Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 CDB: Read(16) 88 00 00 00 00 00 03 3e ac 28 00 00 01 00 00 00
> Jul 10 11:48:05 mgmt kernel: blk_update_request: I/O error, dev sda, sector 54439176
> Jul 10 11:48:05 mgmt kernel: ata1: EH complete
> 
> There are several of these.
> At the moment ddrescue reports 22 read errors (with 35% of the data
> copied to a new storage). If I remember correctly, the LVM with my
> root partition is at the end of the drive. This means more errors will
> likely come... :( 
> 
> The way I interpret the dmesg message, that's just a read error.

"auto realocate failed" is important. Might indicate the realocation
area is full :-?

> I'm
> not sure, but maybe a complete wipe of the drive will even overwrite /
> clear these unreadable sectors.
> Well, that's something to be checked after the copy process finishes.

Run the SMART long test after you have made a copy, and watch specially
for the Current_Pending_Sector, Offline_Uncorrectable, and
Reallocated_Sector_Ct values. Then overwrite the entire disk with zeroes
and repeat the test. If the bad sector number increases, dump the disk.


-- 
Cheers / Saludos,

		Carlos E. R.
		(from 15.0 x86_64 at Telcontar)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 15:02       ` Andrey Zhunev
@ 2019-07-10 15:23         ` Eric Sandeen
  2019-07-10 18:21         ` Carlos E. R.
  1 sibling, 0 replies; 23+ messages in thread
From: Eric Sandeen @ 2019-07-10 15:23 UTC (permalink / raw)
  To: Andrey Zhunev, linux-xfs


On 7/10/19 10:02 AM, Andrey Zhunev wrote:
> Wednesday, July 10, 2019, 5:23:41 PM, you wrote:
...

 
>> As I said, look at dmesg to see what failed on the original drive read
>> attempt.
> 
>> ddrescue will fill unreadable sectors with 0, and then of course that
>> can be read from the image file.
> 
> 
> Ooops, I forgot to paste the error message from dmesg.
> Here it is:
> 
> Jul 10 11:48:05 mgmt kernel: ata1.00: exception Emask 0x0 SAct 0x180000 SErr 0x0 action 0x0
> Jul 10 11:48:05 mgmt kernel: ata1.00: irq_stat 0x40000008
> Jul 10 11:48:05 mgmt kernel: ata1.00: failed command: READ FPDMA QUEUED
> Jul 10 11:48:05 mgmt kernel: ata1.00: cmd 60/00:98:28:ac:3e/01:00:03:00:00/40 tag 19 ncq 131072 in#012         res 41/40:00:08:ad:3e/00:00:03:00:00/40 Emask 0x409 (media error) <F>
> Jul 10 11:48:05 mgmt kernel: ata1.00: status: { DRDY ERR }
> Jul 10 11:48:05 mgmt kernel: ata1.00: error: { UNC }
> Jul 10 11:48:05 mgmt kernel: ata1.00: configured for UDMA/133
> Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 Sense Key : Medium Error [current] [descriptor]
> Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto reallocate failed
> Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 CDB: Read(16) 88 00 00 00 00 00 03 3e ac 28 00 00 01 00 00 00
> Jul 10 11:48:05 mgmt kernel: blk_update_request: I/O error, dev sda, sector 54439176
> Jul 10 11:48:05 mgmt kernel: ata1: EH complete
> 
> There are several of these.
> At the moment ddrescue reports 22 read errors (with 35% of the data
> copied to a new storage). If I remember correctly, the LVM with my
> root partition is at the end of the drive. This means more errors will
> likely come... :( 
> 
> The way I interpret the dmesg message, that's just a read error. I'm
> not sure, but maybe a complete wipe of the drive will even overwrite /
> clear these unreadable sectors.
> Well, that's something to be checked after the copy process finishes.

Yep so hardware error, ddrescue will fill unreadable sectors with zeros, then you
can see whether or not xfs_repair can cope with what is left.

overwriting the sectors may "fix" them but I would never trust that drive
after this, personally.  ;)

-Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 14:23     ` Eric Sandeen
@ 2019-07-10 15:02       ` Andrey Zhunev
  2019-07-10 15:23         ` Eric Sandeen
  2019-07-10 18:21         ` Carlos E. R.
  0 siblings, 2 replies; 23+ messages in thread
From: Andrey Zhunev @ 2019-07-10 15:02 UTC (permalink / raw)
  To: Eric Sandeen, linux-xfs

Wednesday, July 10, 2019, 5:23:41 PM, you wrote:



> On 7/10/19 8:58 AM, Andrey Zhunev wrote:
>> Wednesday, July 10, 2019, 4:26:14 PM, you wrote:
>> 
>>> On 7/10/19 4:56 AM, Andrey Zhunev wrote:
>>>> Hello All,
>>>>
>>>> I am struggling to recover my system after a PSU failure, and I was
>>>> suggested to ask here for support.
>>>>
>>>> One of the hard drives throws some read errors, and that happen to be
>>>> my root drive...
>>>> My system is CentOS 7, and the root partition is a part of LVM.
>>>>
>>>> [root@mgmt ~]# lvscan
>>>>   ACTIVE            '/dev/centos/root' [<98.83 GiB] inherit
>>>>   ACTIVE            '/dev/centos/home' [<638.31 GiB] inherit
>>>>   ACTIVE            '/dev/centos/swap' [<7.52 GiB] inherit
>>>> [root@mgmt ~]#
>>>>
>>>> [root@tftp ~]# file -s /dev/centos/root
>>>> /dev/centos/root: symbolic link to `../dm-3'
>>>> [root@tftp ~]# file -s /dev/centos/home
>>>> /dev/centos/home: symbolic link to `../dm-4'
>>>> [root@tftp ~]# file -s /dev/dm-3
>>>> /dev/dm-3: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
>>>> [root@tftp ~]# file -s /dev/dm-4
>>>> /dev/dm-4: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
>>>>
>>>>
>>>> [root@tftp ~]# xfs_repair /dev/centos/root
>>>> Phase 1 - find and verify superblock...
>>>> superblock read failed, offset 53057945600, size 131072, ag 2, rval -1
>>>>
>>>> fatal error -- Input/output error
>> 
>>> look at dmesg, see what the kernel says about the read failure.
>> 
>>> You might be able to use https://www.gnu.org/software/ddrescue/ 
>>> to read as many sectors off the device into an image file as possible,
>>> and that image might be enough to work with for recovery.  That would be
>>> my first approach:
>> 
>>> 1) use dd-rescue to create an image file of the device
>>> 2) make a copy of that image file
>>> 3) run xfs_repair -n on the copy to see what it would do
>>> 4) if that looks reasonable run xfs_repair on the copy
>>> 5) mount the copy and see what you get
>> 
>>> But if your drive simply cannot be read at all, this is not a filesystem
>>> problem, it is a hardware problem. If this is critical data you may wish
>>> to hire a data recovery service.
>> 
>>> -Eric
>> 
>> 
>> Hi Eric,
>> 
>> Thanks for your message!
>> I already started to copy the failing drive with ddrescue. This is a
>> large drive, so it takes some time to complete...
>> 
>> When I tried to run xfs_repair on the original (failing) drive, the
>> xfs_repair was unable to read the superblock and then just quitted
>> with an 'io error'.
>> Do you think it can behave differently on a copied image ?

> As I said, look at dmesg to see what failed on the original drive read
> attempt.

> ddrescue will fill unreadable sectors with 0, and then of course that
> can be read from the image file.


Ooops, I forgot to paste the error message from dmesg.
Here it is:

Jul 10 11:48:05 mgmt kernel: ata1.00: exception Emask 0x0 SAct 0x180000 SErr 0x0 action 0x0
Jul 10 11:48:05 mgmt kernel: ata1.00: irq_stat 0x40000008
Jul 10 11:48:05 mgmt kernel: ata1.00: failed command: READ FPDMA QUEUED
Jul 10 11:48:05 mgmt kernel: ata1.00: cmd 60/00:98:28:ac:3e/01:00:03:00:00/40 tag 19 ncq 131072 in#012         res 41/40:00:08:ad:3e/00:00:03:00:00/40 Emask 0x409 (media error) <F>
Jul 10 11:48:05 mgmt kernel: ata1.00: status: { DRDY ERR }
Jul 10 11:48:05 mgmt kernel: ata1.00: error: { UNC }
Jul 10 11:48:05 mgmt kernel: ata1.00: configured for UDMA/133
Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 Sense Key : Medium Error [current] [descriptor]
Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto reallocate failed
Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 CDB: Read(16) 88 00 00 00 00 00 03 3e ac 28 00 00 01 00 00 00
Jul 10 11:48:05 mgmt kernel: blk_update_request: I/O error, dev sda, sector 54439176
Jul 10 11:48:05 mgmt kernel: ata1: EH complete

There are several of these.
At the moment ddrescue reports 22 read errors (with 35% of the data
copied to a new storage). If I remember correctly, the LVM with my
root partition is at the end of the drive. This means more errors will
likely come... :( 

The way I interpret the dmesg message, that's just a read error. I'm
not sure, but maybe a complete wipe of the drive will even overwrite /
clear these unreadable sectors.
Well, that's something to be checked after the copy process finishes.


---
Best regards,
 Andrey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 13:58   ` Andrey Zhunev
@ 2019-07-10 14:23     ` Eric Sandeen
  2019-07-10 15:02       ` Andrey Zhunev
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Sandeen @ 2019-07-10 14:23 UTC (permalink / raw)
  To: Andrey Zhunev, linux-xfs



On 7/10/19 8:58 AM, Andrey Zhunev wrote:
> Wednesday, July 10, 2019, 4:26:14 PM, you wrote:
> 
>> On 7/10/19 4:56 AM, Andrey Zhunev wrote:
>>> Hello All,
>>>
>>> I am struggling to recover my system after a PSU failure, and I was
>>> suggested to ask here for support.
>>>
>>> One of the hard drives throws some read errors, and that happen to be
>>> my root drive...
>>> My system is CentOS 7, and the root partition is a part of LVM.
>>>
>>> [root@mgmt ~]# lvscan
>>>   ACTIVE            '/dev/centos/root' [<98.83 GiB] inherit
>>>   ACTIVE            '/dev/centos/home' [<638.31 GiB] inherit
>>>   ACTIVE            '/dev/centos/swap' [<7.52 GiB] inherit
>>> [root@mgmt ~]#
>>>
>>> [root@tftp ~]# file -s /dev/centos/root
>>> /dev/centos/root: symbolic link to `../dm-3'
>>> [root@tftp ~]# file -s /dev/centos/home
>>> /dev/centos/home: symbolic link to `../dm-4'
>>> [root@tftp ~]# file -s /dev/dm-3
>>> /dev/dm-3: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
>>> [root@tftp ~]# file -s /dev/dm-4
>>> /dev/dm-4: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
>>>
>>>
>>> [root@tftp ~]# xfs_repair /dev/centos/root
>>> Phase 1 - find and verify superblock...
>>> superblock read failed, offset 53057945600, size 131072, ag 2, rval -1
>>>
>>> fatal error -- Input/output error
> 
>> look at dmesg, see what the kernel says about the read failure.
> 
>> You might be able to use https://www.gnu.org/software/ddrescue/ 
>> to read as many sectors off the device into an image file as possible,
>> and that image might be enough to work with for recovery.  That would be
>> my first approach:
> 
>> 1) use dd-rescue to create an image file of the device
>> 2) make a copy of that image file
>> 3) run xfs_repair -n on the copy to see what it would do
>> 4) if that looks reasonable run xfs_repair on the copy
>> 5) mount the copy and see what you get
> 
>> But if your drive simply cannot be read at all, this is not a filesystem
>> problem, it is a hardware problem. If this is critical data you may wish
>> to hire a data recovery service.
> 
>> -Eric
> 
> 
> Hi Eric,
> 
> Thanks for your message!
> I already started to copy the failing drive with ddrescue. This is a
> large drive, so it takes some time to complete...
> 
> When I tried to run xfs_repair on the original (failing) drive, the
> xfs_repair was unable to read the superblock and then just quitted
> with an 'io error'.
> Do you think it can behave differently on a copied image ?

As I said, look at dmesg to see what failed on the original drive read
attempt.

ddrescue will fill unreadable sectors with 0, and then of course that
can be read from the image file.

-Eric

> I will definitely give it a try once the ddrescue finishes.
> 
> 
> P.S. The data on this drive is not THAT critical to hire a
> professional data recovery service. Still, there are some files I
> would really like to restore (mostly settings and configuration
> files - nothing large, but important)... This will save me weeks to
> reconfigure and get the system back to its original state...
> Backups, always make backups... yeah, I know... :(
> 
> 
>  ---
>  Best regards,
>   Andrey
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10 13:26 ` Eric Sandeen
@ 2019-07-10 13:58   ` Andrey Zhunev
  2019-07-10 14:23     ` Eric Sandeen
  0 siblings, 1 reply; 23+ messages in thread
From: Andrey Zhunev @ 2019-07-10 13:58 UTC (permalink / raw)
  To: Eric Sandeen, linux-xfs

Wednesday, July 10, 2019, 4:26:14 PM, you wrote:

> On 7/10/19 4:56 AM, Andrey Zhunev wrote:
>> Hello All,
>> 
>> I am struggling to recover my system after a PSU failure, and I was
>> suggested to ask here for support.
>> 
>> One of the hard drives throws some read errors, and that happen to be
>> my root drive...
>> My system is CentOS 7, and the root partition is a part of LVM.
>> 
>> [root@mgmt ~]# lvscan
>>   ACTIVE            '/dev/centos/root' [<98.83 GiB] inherit
>>   ACTIVE            '/dev/centos/home' [<638.31 GiB] inherit
>>   ACTIVE            '/dev/centos/swap' [<7.52 GiB] inherit
>> [root@mgmt ~]#
>> 
>> [root@tftp ~]# file -s /dev/centos/root
>> /dev/centos/root: symbolic link to `../dm-3'
>> [root@tftp ~]# file -s /dev/centos/home
>> /dev/centos/home: symbolic link to `../dm-4'
>> [root@tftp ~]# file -s /dev/dm-3
>> /dev/dm-3: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
>> [root@tftp ~]# file -s /dev/dm-4
>> /dev/dm-4: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
>> 
>> 
>> [root@tftp ~]# xfs_repair /dev/centos/root
>> Phase 1 - find and verify superblock...
>> superblock read failed, offset 53057945600, size 131072, ag 2, rval -1
>> 
>> fatal error -- Input/output error

> look at dmesg, see what the kernel says about the read failure.

> You might be able to use https://www.gnu.org/software/ddrescue/ 
> to read as many sectors off the device into an image file as possible,
> and that image might be enough to work with for recovery.  That would be
> my first approach:

> 1) use dd-rescue to create an image file of the device
> 2) make a copy of that image file
> 3) run xfs_repair -n on the copy to see what it would do
> 4) if that looks reasonable run xfs_repair on the copy
> 5) mount the copy and see what you get

> But if your drive simply cannot be read at all, this is not a filesystem
> problem, it is a hardware problem. If this is critical data you may wish
> to hire a data recovery service.

> -Eric


Hi Eric,

Thanks for your message!
I already started to copy the failing drive with ddrescue. This is a
large drive, so it takes some time to complete...

When I tried to run xfs_repair on the original (failing) drive, the
xfs_repair was unable to read the superblock and then just quitted
with an 'io error'.
Do you think it can behave differently on a copied image ?

I will definitely give it a try once the ddrescue finishes.


P.S. The data on this drive is not THAT critical to hire a
professional data recovery service. Still, there are some files I
would really like to restore (mostly settings and configuration
files - nothing large, but important)... This will save me weeks to
reconfigure and get the system back to its original state...
Backups, always make backups... yeah, I know... :(


 ---
 Best regards,
  Andrey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Need help to recover root filesystem after a power supply issue
  2019-07-10  9:56 Andrey Zhunev
@ 2019-07-10 13:26 ` Eric Sandeen
  2019-07-10 13:58   ` Andrey Zhunev
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Sandeen @ 2019-07-10 13:26 UTC (permalink / raw)
  To: Andrey Zhunev, linux-xfs

On 7/10/19 4:56 AM, Andrey Zhunev wrote:
> Hello All,
> 
> I am struggling to recover my system after a PSU failure, and I was
> suggested to ask here for support.
> 
> One of the hard drives throws some read errors, and that happen to be
> my root drive...
> My system is CentOS 7, and the root partition is a part of LVM.
> 
> [root@mgmt ~]# lvscan
>   ACTIVE            '/dev/centos/root' [<98.83 GiB] inherit
>   ACTIVE            '/dev/centos/home' [<638.31 GiB] inherit
>   ACTIVE            '/dev/centos/swap' [<7.52 GiB] inherit
> [root@mgmt ~]#
> 
> [root@tftp ~]# file -s /dev/centos/root
> /dev/centos/root: symbolic link to `../dm-3'
> [root@tftp ~]# file -s /dev/centos/home
> /dev/centos/home: symbolic link to `../dm-4'
> [root@tftp ~]# file -s /dev/dm-3
> /dev/dm-3: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
> [root@tftp ~]# file -s /dev/dm-4
> /dev/dm-4: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
> 
> 
> [root@tftp ~]# xfs_repair /dev/centos/root
> Phase 1 - find and verify superblock...
> superblock read failed, offset 53057945600, size 131072, ag 2, rval -1
> 
> fatal error -- Input/output error

look at dmesg, see what the kernel says about the read failure.

You might be able to use https://www.gnu.org/software/ddrescue/ 
to read as many sectors off the device into an image file as possible,
and that image might be enough to work with for recovery.  That would be
my first approach:

1) use dd-rescue to create an image file of the device
2) make a copy of that image file
3) run xfs_repair -n on the copy to see what it would do
4) if that looks reasonable run xfs_repair on the copy
5) mount the copy and see what you get

But if your drive simply cannot be read at all, this is not a filesystem
problem, it is a hardware problem. If this is critical data you may wish
to hire a data recovery service.

-Eric


> [root@tftp ~]#
> 
> 
> smartctl shows some pending sectors on /dev/sda, and no reallocated
> sectors (yet?).
> 
> Can someone please give me a hand to bring root partition back to life
> (ideally)? Or, at least, recover a couple of critical configuration
> files...
> 
> 
> ---
> Best regards,
>  Andrey
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Need help to recover root filesystem after a power supply issue
@ 2019-07-10  9:56 Andrey Zhunev
  2019-07-10 13:26 ` Eric Sandeen
  0 siblings, 1 reply; 23+ messages in thread
From: Andrey Zhunev @ 2019-07-10  9:56 UTC (permalink / raw)
  To: linux-xfs

Hello All,

I am struggling to recover my system after a PSU failure, and I was
suggested to ask here for support.

One of the hard drives throws some read errors, and that happen to be
my root drive...
My system is CentOS 7, and the root partition is a part of LVM.

[root@mgmt ~]# lvscan
  ACTIVE            '/dev/centos/root' [<98.83 GiB] inherit
  ACTIVE            '/dev/centos/home' [<638.31 GiB] inherit
  ACTIVE            '/dev/centos/swap' [<7.52 GiB] inherit
[root@mgmt ~]#

[root@tftp ~]# file -s /dev/centos/root
/dev/centos/root: symbolic link to `../dm-3'
[root@tftp ~]# file -s /dev/centos/home
/dev/centos/home: symbolic link to `../dm-4'
[root@tftp ~]# file -s /dev/dm-3
/dev/dm-3: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
[root@tftp ~]# file -s /dev/dm-4
/dev/dm-4: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)


[root@tftp ~]# xfs_repair /dev/centos/root
Phase 1 - find and verify superblock...
superblock read failed, offset 53057945600, size 131072, ag 2, rval -1

fatal error -- Input/output error
[root@tftp ~]#


smartctl shows some pending sectors on /dev/sda, and no reallocated
sectors (yet?).

Can someone please give me a hand to bring root partition back to life
(ideally)? Or, at least, recover a couple of critical configuration
files...


---
Best regards,
 Andrey

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-07-11 10:23 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-10  9:47 Need help to recover root filesystem after a power supply issue Andrey Zhunev
2019-07-10 14:30 ` Chris Murphy
2019-07-10 15:28   ` Andrey Zhunev
2019-07-10 15:45     ` Chris Murphy
2019-07-10 16:07       ` Andrey Zhunev
2019-07-10 16:46         ` Chris Murphy
2019-07-10 16:47           ` Chris Murphy
2019-07-10 17:16             ` Andrey Zhunev
2019-07-10 18:03               ` Chris Murphy
2019-07-10 18:35                 ` Carlos E. R.
2019-07-10 19:30                   ` Chris Murphy
2019-07-10 23:43                     ` Andrey Zhunev
2019-07-11  2:47                       ` Carlos E. R.
2019-07-11  7:10                         ` Andrey Zhunev
2019-07-11 10:23                           ` Carlos E. R.
2019-07-10 16:51         ` Chris Murphy
2019-07-10  9:56 Andrey Zhunev
2019-07-10 13:26 ` Eric Sandeen
2019-07-10 13:58   ` Andrey Zhunev
2019-07-10 14:23     ` Eric Sandeen
2019-07-10 15:02       ` Andrey Zhunev
2019-07-10 15:23         ` Eric Sandeen
2019-07-10 18:21         ` Carlos E. R.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.