All of lore.kernel.org
 help / color / mirror / Atom feed
* is it safe to xfs_repair this volume? do i have a different first step?
@ 2019-02-07 13:25 David T-G
  2019-02-07 14:52 ` Brian Foster
  2019-02-08 18:40 ` Chris Murphy
  0 siblings, 2 replies; 6+ messages in thread
From: David T-G @ 2019-02-07 13:25 UTC (permalink / raw)
  To: Linux-XFS list

Good morning!

I have a four-disk RAID5 volume with an ~11T filesystem that suddenly
won't mount 

  diskfarm:root:4:~> mount -v /mnt/4Traid5md/
  mount: mount /dev/md0p1 on /mnt/4Traid5md failed: Bad message

after a power outage :-(  Because of the GPT errors I see

  diskfarm:root:4:~> fdisk -l /dev/md0
  The backup GPT table is corrupt, but the primary appears OK, so that will be used.
  Disk /dev/md0: 10.9 TiB, 12001551581184 bytes, 23440530432 sectors
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 4096 bytes
  I/O size (minimum/optimal): 524288 bytes / 1572864 bytes
  Disklabel type: gpt
  Disk identifier: 8D29E2FB-1A26-4C46-B284-99FA7163B89D

  Device     Start         End     Sectors  Size Type
  /dev/md0p1  2048 23440530398 23440528351 10.9T Linux filesystem

  diskfarm:root:4:~> parted /dev/md0 print
  Error: end of file while reading /dev/md0
  Retry/Ignore/Cancel? ignore
  Error: The backup GPT table is corrupt, but the primary appears OK, so that will be used.
  OK/Cancel? ok
  Model: Linux Software RAID Array (md)
  Disk /dev/md0: 12.0TB
  Sector size (logical/physical): 512B/4096B
  Partition Table: gpt
  Disk Flags:

  Number  Start   End     Size    File system  Name              Flags
   1      1049kB  12.0TB  12.0TB  xfs          Linux filesystem

when poking, I at first thought that this was a RAID issue, but all of
the md reports look good and apparently the GPT table issue is common, so
I'll leave all of that out unless someone asks for it.

dmesg reports some XFS problems

  diskfarm:root:5:~> dmesg | egrep 'md[:/0]'
  [  117.999012] md/raid:md127: device sdg2 operational as raid disk 1
  [  117.999014] md/raid:md127: device sdh2 operational as raid disk 2
  [  117.999015] md/raid:md127: device sdd2 operational as raid disk 0
  [  117.999246] md/raid:md127: raid level 5 active with 3 out of 3 devices, algorithm 2
  [  120.820661] md/raid:md0: not clean -- starting background reconstruction
  [  120.821279] md/raid:md0: device sdf1 operational as raid disk 2
  [  120.821282] md/raid:md0: device sda1 operational as raid disk 3
  [  120.821283] md/raid:md0: device sdb1 operational as raid disk 0
  [  120.821284] md/raid:md0: device sde1 operational as raid disk 1
  [  120.822028] md/raid:md0: raid level 5 active with 4 out of 4 devices, algorithm 2
  [  120.822063] md0: detected capacity change from 0 to 12001551581184
  [  120.888841]  md0: p1
  [  202.230961] XFS (md0p1): Mounting V4 Filesystem
  [  203.182567] XFS (md0p1): Torn write (CRC failure) detected at log block 0x3397e8. Truncating head block from 0x3399e8.
  [  203.367581] XFS (md0p1): failed to locate log tail
  [  203.367587] XFS (md0p1): log mount/recovery failed: error -74
  [  203.367712] XFS (md0p1): log mount failed
  [  285.893728] XFS (md0p1): Mounting V4 Filesystem
  [  286.057829] XFS (md0p1): Torn write (CRC failure) detected at log block 0x3397e8. Truncating head block from 0x3399e8.
  [  286.203436] XFS (md0p1): failed to locate log tail
  [  286.203440] XFS (md0p1): log mount/recovery failed: error -74
  [  286.203497] XFS (md0p1): log mount failed

but doesn't tell me a whole lot -- or at least not a whole lot that makes
enough sense to me :-)  I tried an xfs_repair dry run and here

  diskfarm:root:4:~> xfs_repair -n /dev/disk/by-label/4Traid5md 2>&1 | egrep -v 'agno = '
  Phase 1 - find and verify superblock...
          - reporting progress in intervals of 15 minutes
  Phase 2 - using internal log
          - zero log...
          - scan filesystem freespace and inode maps...
  sb_fdblocks 471930978, counted 471939170
          - 09:18:47: scanning filesystem freespace - 48 of 48 allocation groups done
          - found root inode chunk
  Phase 3 - for each AG...
          - scan (but don't clear) agi unlinked lists...
          - 09:18:47: scanning agi unlinked lists - 48 of 48 allocation groups done
          - process known inodes and perform inode discovery...
          - 09:24:17: process known inodes and inode discovery - 4466560 of 4466560 inodes done
          - process newly discovered inodes...
          - 09:24:17: process newly discovered inodes - 48 of 48 allocation groups done
  Phase 4 - check for duplicate blocks...
          - setting up duplicate extent list...
          - 09:24:17: setting up duplicate extent list - 48 of 48 allocation groups done
          - check for inodes claiming duplicate blocks...
          - 09:29:44: check for inodes claiming duplicate blocks - 4466560 of 4466560 inodes done
  No modify flag set, skipping phase 5
  Phase 6 - check inode connectivity...
          - traversing filesystem ...
          - traversal finished ...
          - moving disconnected inodes to lost+found ...
  Phase 7 - verify link counts...
          - 09:34:02: verify and correct link counts - 48 of 48 allocation groups done
  No modify flag set, skipping filesystem flush and exiting.

is the trimmed output that can fit on one screen.  Since I don't have a
second copy of all of this data, I'm a bit nervous about pulling the
trigger to write changes and want to make sure that I take the right
steps!  How should I proceed?

I'm not subscribed to this list, so please do cc/bcc me on your replies.
I didn't see any other lists and did see some discussion here, so I hope
that I'm in the right place, but please feel free also to point me in
another direction if that's better.


TIA & HAND

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: is it safe to xfs_repair this volume? do i have a different first step?
  2019-02-07 13:25 is it safe to xfs_repair this volume? do i have a different first step? David T-G
@ 2019-02-07 14:52 ` Brian Foster
  2019-02-08  2:25   ` David T-G
  2019-02-08 18:40 ` Chris Murphy
  1 sibling, 1 reply; 6+ messages in thread
From: Brian Foster @ 2019-02-07 14:52 UTC (permalink / raw)
  To: David T-G; +Cc: Linux-XFS list

On Thu, Feb 07, 2019 at 08:25:34AM -0500, David T-G wrote:
> Good morning!
> 
> I have a four-disk RAID5 volume with an ~11T filesystem that suddenly
> won't mount 
> 
>   diskfarm:root:4:~> mount -v /mnt/4Traid5md/
>   mount: mount /dev/md0p1 on /mnt/4Traid5md failed: Bad message
> 
> after a power outage :-(  Because of the GPT errors I see
> 
>   diskfarm:root:4:~> fdisk -l /dev/md0
>   The backup GPT table is corrupt, but the primary appears OK, so that will be used.
>   Disk /dev/md0: 10.9 TiB, 12001551581184 bytes, 23440530432 sectors
>   Units: sectors of 1 * 512 = 512 bytes
>   Sector size (logical/physical): 512 bytes / 4096 bytes
>   I/O size (minimum/optimal): 524288 bytes / 1572864 bytes
>   Disklabel type: gpt
>   Disk identifier: 8D29E2FB-1A26-4C46-B284-99FA7163B89D
> 
>   Device     Start         End     Sectors  Size Type
>   /dev/md0p1  2048 23440530398 23440528351 10.9T Linux filesystem
> 
>   diskfarm:root:4:~> parted /dev/md0 print
>   Error: end of file while reading /dev/md0
>   Retry/Ignore/Cancel? ignore
>   Error: The backup GPT table is corrupt, but the primary appears OK, so that will be used.
>   OK/Cancel? ok
>   Model: Linux Software RAID Array (md)
>   Disk /dev/md0: 12.0TB
>   Sector size (logical/physical): 512B/4096B
>   Partition Table: gpt
>   Disk Flags:
> 
>   Number  Start   End     Size    File system  Name              Flags
>    1      1049kB  12.0TB  12.0TB  xfs          Linux filesystem
> 
> when poking, I at first thought that this was a RAID issue, but all of
> the md reports look good and apparently the GPT table issue is common, so
> I'll leave all of that out unless someone asks for it.
> 

I'd be curious if the MD metadata format contends with GPT metadata. Is
the above something you've ever tried before running into this problem
and thus can confirm whether it preexisted the mount problem or not?

If not, I'd suggest some more investigation into this before you make
any future partition or raid changes to this storage. I thought there
were different MD formats to accommodate precisely this sort of
incompatibility, but I don't know for sure. linux-raid is probably more
of a help here.

> dmesg reports some XFS problems
> 
>   diskfarm:root:5:~> dmesg | egrep 'md[:/0]'
>   [  117.999012] md/raid:md127: device sdg2 operational as raid disk 1
>   [  117.999014] md/raid:md127: device sdh2 operational as raid disk 2
>   [  117.999015] md/raid:md127: device sdd2 operational as raid disk 0
>   [  117.999246] md/raid:md127: raid level 5 active with 3 out of 3 devices, algorithm 2
>   [  120.820661] md/raid:md0: not clean -- starting background reconstruction
>   [  120.821279] md/raid:md0: device sdf1 operational as raid disk 2
>   [  120.821282] md/raid:md0: device sda1 operational as raid disk 3
>   [  120.821283] md/raid:md0: device sdb1 operational as raid disk 0
>   [  120.821284] md/raid:md0: device sde1 operational as raid disk 1
>   [  120.822028] md/raid:md0: raid level 5 active with 4 out of 4 devices, algorithm 2
>   [  120.822063] md0: detected capacity change from 0 to 12001551581184
>   [  120.888841]  md0: p1
>   [  202.230961] XFS (md0p1): Mounting V4 Filesystem
>   [  203.182567] XFS (md0p1): Torn write (CRC failure) detected at log block 0x3397e8. Truncating head block from 0x3399e8.
>   [  203.367581] XFS (md0p1): failed to locate log tail
>   [  203.367587] XFS (md0p1): log mount/recovery failed: error -74
>   [  203.367712] XFS (md0p1): log mount failed
>   [  285.893728] XFS (md0p1): Mounting V4 Filesystem
>   [  286.057829] XFS (md0p1): Torn write (CRC failure) detected at log block 0x3397e8. Truncating head block from 0x3399e8.
>   [  286.203436] XFS (md0p1): failed to locate log tail
>   [  286.203440] XFS (md0p1): log mount/recovery failed: error -74
>   [  286.203497] XFS (md0p1): log mount failed
> 
> but doesn't tell me a whole lot -- or at least not a whole lot that makes
> enough sense to me :-)  I tried an xfs_repair dry run and here
> 

Hmm. So part of the on-disk log is invalid. We attempt to deal with this
problem by truncating off the rest of the log after the point of the
corruption, but this apparently removes too much to perform a recovery.
I'd guess that the torn write is due to interleaving log writes across
raid devices or something, but we can't really tell from just this.

>   diskfarm:root:4:~> xfs_repair -n /dev/disk/by-label/4Traid5md 2>&1 | egrep -v 'agno = '
>   Phase 1 - find and verify superblock...
>           - reporting progress in intervals of 15 minutes
>   Phase 2 - using internal log
>           - zero log...
>           - scan filesystem freespace and inode maps...
>   sb_fdblocks 471930978, counted 471939170

The above said, the corruption here looks extremely minor. You basically
have an accounting mismatch between what the superblock says is
available for free space and what xfs_repair actually found via its
scans and not much else going on.

>           - 09:18:47: scanning filesystem freespace - 48 of 48 allocation groups done
>           - found root inode chunk
>   Phase 3 - for each AG...
>           - scan (but don't clear) agi unlinked lists...
>           - 09:18:47: scanning agi unlinked lists - 48 of 48 allocation groups done
>           - process known inodes and perform inode discovery...
>           - 09:24:17: process known inodes and inode discovery - 4466560 of 4466560 inodes done
>           - process newly discovered inodes...
>           - 09:24:17: process newly discovered inodes - 48 of 48 allocation groups done
>   Phase 4 - check for duplicate blocks...
>           - setting up duplicate extent list...
>           - 09:24:17: setting up duplicate extent list - 48 of 48 allocation groups done
>           - check for inodes claiming duplicate blocks...
>           - 09:29:44: check for inodes claiming duplicate blocks - 4466560 of 4466560 inodes done
>   No modify flag set, skipping phase 5
>   Phase 6 - check inode connectivity...
>           - traversing filesystem ...
>           - traversal finished ...
>           - moving disconnected inodes to lost+found ...
>   Phase 7 - verify link counts...
>           - 09:34:02: verify and correct link counts - 48 of 48 allocation groups done
>   No modify flag set, skipping filesystem flush and exiting.
> 
> is the trimmed output that can fit on one screen.  Since I don't have a
> second copy of all of this data, I'm a bit nervous about pulling the
> trigger to write changes and want to make sure that I take the right
> steps!  How should I proceed?
> 

What do you mean by trimmed output? Was there more output from
xfs_repair that is not shown here?

In general, if you're concerned about what xfs_repair might do to a
particular filesystem you can always do a normal xfs_repair run against
a metadump of the filesystem before the original copy. Collect a
metadump of the fs:

xfs_metadump -go <dev> <outputmdimg>

Note that the metadump collects everything except file data so it will
require a decent amount of space depending on how much metadata
populates your fs vs. data.

Then restore the metadump to a sparse file (on some other
filesystem/storage):

xfs_mdrestore -g <mdfile> <sparsefiletarget>

Then you can mount/xfs_repair the restored sparse image, see what
xfs_repair does, mount the before/after img, etc. Note again that file
data is absent from the restored metadata image so don't expect to be
able to look at file content in the metadump image.

Brian

> I'm not subscribed to this list, so please do cc/bcc me on your replies.
> I didn't see any other lists and did see some discussion here, so I hope
> that I'm in the right place, but please feel free also to point me in
> another direction if that's better.
> 
> 
> TIA & HAND
> 
> :-D
> -- 
> David T-G
> See http://justpickone.org/davidtg/email/
> See http://justpickone.org/davidtg/tofu.txt
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: is it safe to xfs_repair this volume? do i have a different first step?
  2019-02-07 14:52 ` Brian Foster
@ 2019-02-08  2:25   ` David T-G
  2019-02-08 13:00     ` Brian Foster
  2019-02-08 19:45     ` Chris Murphy
  0 siblings, 2 replies; 6+ messages in thread
From: David T-G @ 2019-02-08  2:25 UTC (permalink / raw)
  To: Linux-XFS list

Brian, et al --

...and then Brian Foster said...
% 
% On Thu, Feb 07, 2019 at 08:25:34AM -0500, David T-G wrote:
% > 
% > I have a four-disk RAID5 volume with an ~11T filesystem that suddenly
% > won't mount 
...
% > when poking, I at first thought that this was a RAID issue, but all of
% > the md reports look good and apparently the GPT table issue is common, so
% > I'll leave all of that out unless someone asks for it.
% 
% I'd be curious if the MD metadata format contends with GPT metadata. Is
% the above something you've ever tried before running into this problem
% and thus can confirm whether it preexisted the mount problem or not?

There's a lot I don't know, so it's quite possible that it doesn't line
up.  Here's what mdadm tells me:

  diskfarm:root:6:~> mdadm --detail /dev/md0
  /dev/md0:
          Version : 1.2
    Creation Time : Mon Feb  6 00:56:35 2017
       Raid Level : raid5
       Array Size : 11720265216 (11177.32 GiB 12001.55 GB)
    Used Dev Size : 3906755072 (3725.77 GiB 4000.52 GB)
     Raid Devices : 4
    Total Devices : 4
      Persistence : Superblock is persistent

      Update Time : Fri Jan 25 03:32:18 2019
            State : clean
   Active Devices : 4
  Working Devices : 4
   Failed Devices : 0
    Spare Devices : 0

           Layout : left-symmetric
       Chunk Size : 512K

             Name : diskfarm:0  (local to host diskfarm)
             UUID : ca7008ef:90693dae:6c231ad7:08b3f92d
           Events : 48211

      Number   Major   Minor   RaidDevice State
         0       8       17        0      active sync   /dev/sdb1
         1       8       65        1      active sync   /dev/sde1
         3       8       81        2      active sync   /dev/sdf1
         4       8        1        3      active sync   /dev/sda1
  diskfarm:root:6:~>
  diskfarm:root:6:~> for D in a1 b1 e1 f1 ; do mdadm --examine /dev/sd$D | egrep "$D|Role|State|Checksum|Events" ; done
  /dev/sda1:
            State : clean
      Device UUID : f05a143b:50c9b024:36714b9a:44b6a159
         Checksum : 4561f58b - correct
           Events : 48211
     Device Role : Active device 3
     Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
  /dev/sdb1:
            State : clean
         Checksum : 4654df78 - correct
           Events : 48211
     Device Role : Active device 0
     Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
  /dev/sde1:
            State : clean
         Checksum : c4ec7cb6 - correct
           Events : 48211
     Device Role : Active device 1
     Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
  /dev/sdf1:
            State : clean
         Checksum : 349cf800 - correct
           Events : 48211
     Device Role : Active device 2
     Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

Does that set off any alarms for you?


% 
% If not, I'd suggest some more investigation into this before you make
% any future partition or raid changes to this storage. I thought there
% were different MD formats to accommodate precisely this sort of
% incompatibility, but I don't know for sure. linux-raid is probably more
% of a help here.

Thanks :-)  I have no plans to partition, but I will eventually want to
grow it, so I'll definitely have to check on that.


% 
% > dmesg reports some XFS problems
% > 
% >   diskfarm:root:5:~> dmesg | egrep 'md[:/0]'
...
% >   [  202.230961] XFS (md0p1): Mounting V4 Filesystem
% >   [  203.182567] XFS (md0p1): Torn write (CRC failure) detected at log block 0x3397e8. Truncating head block from 0x3399e8.
% >   [  203.367581] XFS (md0p1): failed to locate log tail
% >   [  203.367587] XFS (md0p1): log mount/recovery failed: error -74
% >   [  203.367712] XFS (md0p1): log mount failed
...
% 
% Hmm. So part of the on-disk log is invalid. We attempt to deal with this
...
% I'd guess that the torn write is due to interleaving log writes across
% raid devices or something, but we can't really tell from just this.

The filesystem *shouldn't* see that there are distinct devices under
there, since that's handled by the md driver, but there's STILL a lot
that I don't know :-)


% 
% >   diskfarm:root:4:~> xfs_repair -n /dev/disk/by-label/4Traid5md 2>&1 | egrep -v 'agno = '
...
% >           - scan filesystem freespace and inode maps...
% >   sb_fdblocks 471930978, counted 471939170
% 
% The above said, the corruption here looks extremely minor. You basically
...
% scans and not much else going on.

That sounds hopeful! :-)


% 
% >           - 09:18:47: scanning filesystem freespace - 48 of 48 allocation groups done
...
% >   Phase 7 - verify link counts...
% >           - 09:34:02: verify and correct link counts - 48 of 48 allocation groups done
% >   No modify flag set, skipping filesystem flush and exiting.
% > 
% > is the trimmed output that can fit on one screen.  Since I don't have a
...
% 
% What do you mean by trimmed output? Was there more output from
% xfs_repair that is not shown here?

Yes.  Note the

  | egrep -v 'agno = '

on the command line above.  The full output

  diskfarm:root:4:~> xfs_repair -n /dev/disk/by-label/4Traid5md >/tmp/xfs_repair.out 2>&1
  diskfarm:root:4:~> wc -l /tmp/xfs_repair.out
  124 /tmp/xfs_repair.out

was quite long.  Shall I attach that file or post a link?


% 
% In general, if you're concerned about what xfs_repair might do to a
% particular filesystem you can always do a normal xfs_repair run against
% a metadump of the filesystem before the original copy. Collect a
% metadump of the fs:
% 
% xfs_metadump -go <dev> <outputmdimg>

Hey, cool!  I like that :-)  It generated a sizeable output file

  diskfarm:root:8:~> xfs_metadump /dev/disk/by-label/4Traid5md /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out >/mnt/750Graid5md/tmp/4Traid5md.xfs_d.out.stdout-stderr 2>&1
  diskfarm:root:8:~> ls -goh /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out
  -rw-r--r-- 1 3.5G Feb  7 17:57 /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out
  diskfarm:root:8:~> wc -l /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out.stdout-stderr
  239 /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out.stdout-stderr

as well as quite a few errors.  Here

  diskfarm:root:8:~> head /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out.stdout-stderr
  xfs_metadump: error - read only 0 of 4096 bytes
  xfs_metadump: error - read only 0 of 4096 bytes
  xfs_metadump: cannot init perag data (5). Continuing anyway.
  xfs_metadump: error - read only 0 of 4096 bytes
  xfs_metadump: cannot read dir2 block 39/132863 (2617378559)
  xfs_metadump: error - read only 0 of 4096 bytes
  xfs_metadump: cannot read dir2 block 41/11461784 (2762925208)
  xfs_metadump: error - read only 0 of 4096 bytes
  xfs_metadump: cannot read dir2 block 41/4237562 (2755700986)
  xfs_metadump: error - read only 0 of 4096 bytes

  diskfarm:root:8:~> tail /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out.stdout-stderr
  xfs_metadump: error - read only 0 of 4096 bytes
  xfs_metadump: cannot read superblock for ag 47
  xfs_metadump: error - read only 0 of 4096 bytes
  xfs_metadump: cannot read agf block for ag 47
  xfs_metadump: error - read only 0 of 4096 bytes
  xfs_metadump: cannot read agi block for ag 47
  xfs_metadump: error - read only 0 of 4096 bytes
  xfs_metadump: cannot read agfl block for ag 47
  xfs_metadump: Filesystem log is dirty; image will contain unobfuscated metadata in log.
  cache_purge: shake on cache 0x4ee1c0 left 117 nodes!?

is a glance at the contents.  Should I post/paste the full copy?


% 
% Note that the metadump collects everything except file data so it will
% require a decent amount of space depending on how much metadata
% populates your fs vs. data.
% 
% Then restore the metadump to a sparse file (on some other
% filesystem/storage):
% 
% xfs_mdrestore -g <mdfile> <sparsefiletarget>

I tried this 

  diskfarm:root:11:~> dd if=/dev/zero bs=1 count=0 seek=4G of=/mnt/750Graid5md/tmp/4Traid5md.xfs_d.iso
  0+0 records in
  0+0 records out
  0 bytes copied, 6.7252e-05 s, 0.0 kB/s
  diskfarm:root:11:~> ls -goh /mnt/750Graid5md/tmp/4Traid5md.xfs_d.iso
  -rw-r--r-- 1 4.0G Feb  7 21:15 /mnt/750Graid5md/tmp/4Traid5md.xfs_d.iso
  diskfarm:root:11:~> xfs_mdrestore /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out /mnt/750Graid5md/tmp/4Traid5md.xfs_d.iso
  xfs_mdrestore: cannot set filesystem image size: File too large

and got an error :-(  Should a 4G file be large enough for a 3.5G
metadata dump?


% 
% Then you can mount/xfs_repair the restored sparse image, see what
% xfs_repair does, mount the before/after img, etc. Note again that file
% data is absent from the restored metadata image so don't expect to be
% able to look at file content in the metadump image.

Right.  That sounds like a great middle step, though.  Thanks!


% 
% Brian


HAND

:-D
-- 
David T-G
See http://justpickone.org/davidtg/email/
See http://justpickone.org/davidtg/tofu.txt

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: is it safe to xfs_repair this volume? do i have a different first step?
  2019-02-08  2:25   ` David T-G
@ 2019-02-08 13:00     ` Brian Foster
  2019-02-08 19:45     ` Chris Murphy
  1 sibling, 0 replies; 6+ messages in thread
From: Brian Foster @ 2019-02-08 13:00 UTC (permalink / raw)
  To: David T-G; +Cc: Linux-XFS list

On Thu, Feb 07, 2019 at 09:25:13PM -0500, David T-G wrote:
> Brian, et al --
> 
> ...and then Brian Foster said...
> % 
> % On Thu, Feb 07, 2019 at 08:25:34AM -0500, David T-G wrote:
> % > 
> % > I have a four-disk RAID5 volume with an ~11T filesystem that suddenly
> % > won't mount 
> ...
> % > when poking, I at first thought that this was a RAID issue, but all of
> % > the md reports look good and apparently the GPT table issue is common, so
> % > I'll leave all of that out unless someone asks for it.
> % 
> % I'd be curious if the MD metadata format contends with GPT metadata. Is
> % the above something you've ever tried before running into this problem
> % and thus can confirm whether it preexisted the mount problem or not?
> 
> There's a lot I don't know, so it's quite possible that it doesn't line
> up.  Here's what mdadm tells me:
> 
>   diskfarm:root:6:~> mdadm --detail /dev/md0
>   /dev/md0:
>           Version : 1.2
>     Creation Time : Mon Feb  6 00:56:35 2017
>        Raid Level : raid5
>        Array Size : 11720265216 (11177.32 GiB 12001.55 GB)
>     Used Dev Size : 3906755072 (3725.77 GiB 4000.52 GB)
>      Raid Devices : 4
>     Total Devices : 4
>       Persistence : Superblock is persistent
> 
>       Update Time : Fri Jan 25 03:32:18 2019
>             State : clean
>    Active Devices : 4
>   Working Devices : 4
>    Failed Devices : 0
>     Spare Devices : 0
> 
>            Layout : left-symmetric
>        Chunk Size : 512K
> 
>              Name : diskfarm:0  (local to host diskfarm)
>              UUID : ca7008ef:90693dae:6c231ad7:08b3f92d
>            Events : 48211
> 
>       Number   Major   Minor   RaidDevice State
>          0       8       17        0      active sync   /dev/sdb1
>          1       8       65        1      active sync   /dev/sde1
>          3       8       81        2      active sync   /dev/sdf1
>          4       8        1        3      active sync   /dev/sda1
>   diskfarm:root:6:~>
>   diskfarm:root:6:~> for D in a1 b1 e1 f1 ; do mdadm --examine /dev/sd$D | egrep "$D|Role|State|Checksum|Events" ; done
>   /dev/sda1:
>             State : clean
>       Device UUID : f05a143b:50c9b024:36714b9a:44b6a159
>          Checksum : 4561f58b - correct
>            Events : 48211
>      Device Role : Active device 3
>      Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
>   /dev/sdb1:
>             State : clean
>          Checksum : 4654df78 - correct
>            Events : 48211
>      Device Role : Active device 0
>      Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
>   /dev/sde1:
>             State : clean
>          Checksum : c4ec7cb6 - correct
>            Events : 48211
>      Device Role : Active device 1
>      Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
>   /dev/sdf1:
>             State : clean
>          Checksum : 349cf800 - correct
>            Events : 48211
>      Device Role : Active device 2
>      Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
> 
> Does that set off any alarms for you?
>

It looks normal to me, but I'm not an MD person. I also don't think an
MD format / GPT format conflict is something that mdadm will show. It
not appear until/unless you change the geometry on one side or the
other. Again, I'd strongly suggest to validate your configuration with
linux-raid before making any such changes.
 
> 
> % 
> % If not, I'd suggest some more investigation into this before you make
> % any future partition or raid changes to this storage. I thought there
> % were different MD formats to accommodate precisely this sort of
> % incompatibility, but I don't know for sure. linux-raid is probably more
> % of a help here.
> 
> Thanks :-)  I have no plans to partition, but I will eventually want to
> grow it, so I'll definitely have to check on that.
> 
> 
> % 
> % > dmesg reports some XFS problems
> % > 
> % >   diskfarm:root:5:~> dmesg | egrep 'md[:/0]'
> ...
> % >   [  202.230961] XFS (md0p1): Mounting V4 Filesystem
> % >   [  203.182567] XFS (md0p1): Torn write (CRC failure) detected at log block 0x3397e8. Truncating head block from 0x3399e8.
> % >   [  203.367581] XFS (md0p1): failed to locate log tail
> % >   [  203.367587] XFS (md0p1): log mount/recovery failed: error -74
> % >   [  203.367712] XFS (md0p1): log mount failed
> ...
> % 
> % Hmm. So part of the on-disk log is invalid. We attempt to deal with this
> ...
> % I'd guess that the torn write is due to interleaving log writes across
> % raid devices or something, but we can't really tell from just this.
> 
> The filesystem *shouldn't* see that there are distinct devices under
> there, since that's handled by the md driver, but there's STILL a lot
> that I don't know :-)
> 

It doesn't see multiple devices, but a contiguous range of filesystem
blocks (such as the fs log) that happen to map to multiple physical
devices by underlying storage layer.

> 
> % 
> % >   diskfarm:root:4:~> xfs_repair -n /dev/disk/by-label/4Traid5md 2>&1 | egrep -v 'agno = '
> ...
> % >           - scan filesystem freespace and inode maps...
> % >   sb_fdblocks 471930978, counted 471939170
> % 
> % The above said, the corruption here looks extremely minor. You basically
> ...
> % scans and not much else going on.
> 
> That sounds hopeful! :-)
> 
> 
> % 
> % >           - 09:18:47: scanning filesystem freespace - 48 of 48 allocation groups done
> ...
> % >   Phase 7 - verify link counts...
> % >           - 09:34:02: verify and correct link counts - 48 of 48 allocation groups done
> % >   No modify flag set, skipping filesystem flush and exiting.
> % > 
> % > is the trimmed output that can fit on one screen.  Since I don't have a
> ...
> % 
> % What do you mean by trimmed output? Was there more output from
> % xfs_repair that is not shown here?
> 
> Yes.  Note the
> 
>   | egrep -v 'agno = '
> 
> on the command line above.  The full output
> 
>   diskfarm:root:4:~> xfs_repair -n /dev/disk/by-label/4Traid5md >/tmp/xfs_repair.out 2>&1
>   diskfarm:root:4:~> wc -l /tmp/xfs_repair.out
>   124 /tmp/xfs_repair.out
> 
> was quite long.  Shall I attach that file or post a link?
> 

Please post the full repair output.

> 
> % 
> % In general, if you're concerned about what xfs_repair might do to a
> % particular filesystem you can always do a normal xfs_repair run against
> % a metadump of the filesystem before the original copy. Collect a
> % metadump of the fs:
> % 
> % xfs_metadump -go <dev> <outputmdimg>
> 
> Hey, cool!  I like that :-)  It generated a sizeable output file
> 
>   diskfarm:root:8:~> xfs_metadump /dev/disk/by-label/4Traid5md /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out >/mnt/750Graid5md/tmp/4Traid5md.xfs_d.out.stdout-stderr 2>&1
>   diskfarm:root:8:~> ls -goh /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out
>   -rw-r--r-- 1 3.5G Feb  7 17:57 /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out
>   diskfarm:root:8:~> wc -l /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out.stdout-stderr
>   239 /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out.stdout-stderr
> 
> as well as quite a few errors.  Here
> 
>   diskfarm:root:8:~> head /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out.stdout-stderr
>   xfs_metadump: error - read only 0 of 4096 bytes
>   xfs_metadump: error - read only 0 of 4096 bytes
>   xfs_metadump: cannot init perag data (5). Continuing anyway.
>   xfs_metadump: error - read only 0 of 4096 bytes
>   xfs_metadump: cannot read dir2 block 39/132863 (2617378559)
>   xfs_metadump: error - read only 0 of 4096 bytes
>   xfs_metadump: cannot read dir2 block 41/11461784 (2762925208)
>   xfs_metadump: error - read only 0 of 4096 bytes
>   xfs_metadump: cannot read dir2 block 41/4237562 (2755700986)
>   xfs_metadump: error - read only 0 of 4096 bytes
> 
>   diskfarm:root:8:~> tail /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out.stdout-stderr
>   xfs_metadump: error - read only 0 of 4096 bytes
>   xfs_metadump: cannot read superblock for ag 47
>   xfs_metadump: error - read only 0 of 4096 bytes
>   xfs_metadump: cannot read agf block for ag 47
>   xfs_metadump: error - read only 0 of 4096 bytes
>   xfs_metadump: cannot read agi block for ag 47
>   xfs_metadump: error - read only 0 of 4096 bytes
>   xfs_metadump: cannot read agfl block for ag 47
>   xfs_metadump: Filesystem log is dirty; image will contain unobfuscated metadata in log.
>   cache_purge: shake on cache 0x4ee1c0 left 117 nodes!?
> 
> is a glance at the contents.  Should I post/paste the full copy?
> 

It couldn't hurt. Perhaps this suggests there are other issues beyond
what was shown in the original repair output.

> 
> % 
> % Note that the metadump collects everything except file data so it will
> % require a decent amount of space depending on how much metadata
> % populates your fs vs. data.
> % 
> % Then restore the metadump to a sparse file (on some other
> % filesystem/storage):
> % 
> % xfs_mdrestore -g <mdfile> <sparsefiletarget>
> 
> I tried this 
> 
>   diskfarm:root:11:~> dd if=/dev/zero bs=1 count=0 seek=4G of=/mnt/750Graid5md/tmp/4Traid5md.xfs_d.iso
>   0+0 records in
>   0+0 records out
>   0 bytes copied, 6.7252e-05 s, 0.0 kB/s
>   diskfarm:root:11:~> ls -goh /mnt/750Graid5md/tmp/4Traid5md.xfs_d.iso
>   -rw-r--r-- 1 4.0G Feb  7 21:15 /mnt/750Graid5md/tmp/4Traid5md.xfs_d.iso
>   diskfarm:root:11:~> xfs_mdrestore /mnt/750Graid5md/tmp/4Traid5md.xfs_d.out /mnt/750Graid5md/tmp/4Traid5md.xfs_d.iso
>   xfs_mdrestore: cannot set filesystem image size: File too large
> 
> and got an error :-(  Should a 4G file be large enough for a 3.5G
> metadata dump?
> 

The output file size is too large and not supported by the underlying
filesystem.  Note that the output file size will match the size of the
original fs despite the fact that the image may only consume 3.5G worth
of space. What is the underlying fs? You might need to find somewhere
where you can restore this file on another XFS fs.

Brian

> 
> % 
> % Then you can mount/xfs_repair the restored sparse image, see what
> % xfs_repair does, mount the before/after img, etc. Note again that file
> % data is absent from the restored metadata image so don't expect to be
> % able to look at file content in the metadump image.
> 
> Right.  That sounds like a great middle step, though.  Thanks!
> 
> 
> % 
> % Brian
> 
> 
> HAND
> 
> :-D
> -- 
> David T-G
> See http://justpickone.org/davidtg/email/
> See http://justpickone.org/davidtg/tofu.txt
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: is it safe to xfs_repair this volume? do i have a different first step?
  2019-02-07 13:25 is it safe to xfs_repair this volume? do i have a different first step? David T-G
  2019-02-07 14:52 ` Brian Foster
@ 2019-02-08 18:40 ` Chris Murphy
  1 sibling, 0 replies; 6+ messages in thread
From: Chris Murphy @ 2019-02-08 18:40 UTC (permalink / raw)
  To: David T-G; +Cc: Linux-XFS list

On Thu, Feb 7, 2019 at 6:30 AM David T-G <davidtg@justpickone.org> wrote:
>
>   diskfarm:root:4:~> parted /dev/md0 print
>   Error: end of file while reading /dev/md0
>   Retry/Ignore/Cancel? ignore
>   Error: The backup GPT table is corrupt, but the primary appears OK, so that will be used.
[snip]
> when poking, I at first thought that this was a RAID issue, but all of
> the md reports look good and apparently the GPT table issue is common, so
> I'll leave all of that out unless someone asks for it.

A corrupt backup GPT is a huge redflag that there's user confusion,
that has then led to the storage stack itself becoming confused.

Since GPT partitioning an array, in particular with just one
partition, seems unnecessarily complicated and thus pointless; I'm
suspicious that /dev/md0 is not in fact partitioned - that GPT very
well may belong to the first member device of the array. Not the
array. And the reason the backup is "corrupt" is because parted+fdisk
looking at the end of /dev/md0 rather than the end of the device this
GPT actually belongs to.

So I suspect GPT and XFS have stepped on each other possibly more than
once each which is why both have corruption; and the mdadm metadata
doesn't. Or even possible that one or more signatures in this storage
stack are stale, not having previously been properly wiped, and now
are haunting this storage stack.

I wouldn't make any writes until you've double checked what the layout
is supposed to be. First check if the individual member drives are GPT
partitioned, and whether their primary and backups are valid (not
corrupt); if there's corruption don't fix it yet. Right now you just
need to focus on what all of the on disk metadata says is true, and
then you'll be able to discover what metadata is wrong and
contributing to all this confusion.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: is it safe to xfs_repair this volume? do i have a different first step?
  2019-02-08  2:25   ` David T-G
  2019-02-08 13:00     ` Brian Foster
@ 2019-02-08 19:45     ` Chris Murphy
  1 sibling, 0 replies; 6+ messages in thread
From: Chris Murphy @ 2019-02-08 19:45 UTC (permalink / raw)
  To: David T-G; +Cc: Linux-XFS list

On Thu, Feb 7, 2019 at 7:25 PM David T-G <davidtg@justpickone.org> wrote:
>
>   diskfarm:root:6:~> mdadm --detail /dev/md0
>   /dev/md0:
>           Version : 1.2

Version 1.2 metadata is 4K offset from the start of the member device.
The member devices in your case:

>       Number   Major   Minor   RaidDevice State
>          0       8       17        0      active sync   /dev/sdb1
>          1       8       65        1      active sync   /dev/sde1
>          3       8       81        2      active sync   /dev/sdf1
>          4       8        1        3      active sync   /dev/sda1

That means those member devices are partitioned. The primary GPT will
be in the first 34 512 byte sectors, and backup GPT in the last 34 512
byte sectors, on each physical drive. The mdadm v1.2 superblock is
located at 4K from the start of the partition designated as a member
of the array. And mdadm will only consider the partition as the area
that can be written to which means each member device's backup GPT
should be immune from being written to by md and XFS.

Since there's a 512KiB chunk size, and the array is clearly also
partitioned, means the array primary GPT is on one member device soon
after the mdadm superblock; and the array backup GPT is on a different
member device immediately before its own backup GPT. I can't think of
a reason for a conflict off the top of my head. And yet there's a
conflict somewhere as you have independent corruptions: XFS and GPT.

Just - whatever you do, don't fix anything. Here's an idea for setting
up an overlay so you can test your repairs by writing changes
elsewhere, and not touch the original drives.
https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file

I suggest getting advice on the linux-raid list before proceeding,
find out why it appears XFS and the array backup GPT are being stepped
on. They'll want to see the partitioning for every device (both
primary and backup if they aren't identical, i.e. one is corrupt), the
full superblock for each device, and the GPT for the array. And what
version of mdadm was used to create the array. They'll also want
smartctl -x for each drive. And they'll want 'smartctl -l scterc' from
each drive. And they'll want to know what the kernel command timer is
set to for each drive:
# cat /sys/block/sdX/device/timeout

I imagine you're gonna get asked by someone why bother partitioning
each drive with one partition, and then partition the array too, also
with one partition. That's overly complicated and serves no purpose.
Next time, make each whole drive an mdadm member; and then format the
array.

People lose their data all the time due to user error, so I can't
recommend enough that you sanity check what you've done and what you
intend to do, on each applicable list, using linux-raid for the mdadm
stuff. And for godsake if you care at all about this data you need at
least one backup copy.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-02-08 19:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-07 13:25 is it safe to xfs_repair this volume? do i have a different first step? David T-G
2019-02-07 14:52 ` Brian Foster
2019-02-08  2:25   ` David T-G
2019-02-08 13:00     ` Brian Foster
2019-02-08 19:45     ` Chris Murphy
2019-02-08 18:40 ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.