All of lore.kernel.org
 help / color / mirror / Atom feed
* Weird xfs_repair error
@ 2017-07-06 13:30 Emmanuel Florac
  2017-07-06 13:48 ` Brian Foster
  2017-07-06 23:28 ` Dave Chinner
  0 siblings, 2 replies; 15+ messages in thread
From: Emmanuel Florac @ 2017-07-06 13:30 UTC (permalink / raw)
  To: 'linux-xfs@vger.kernel.org'

[-- Attachment #1: Type: text/plain, Size: 2388 bytes --]


After a RAID controller went bananas, I encountered an XFS corruption
on a filesystem. Weirdly, the corruption seems to be mostly located in
lost+found.

(I'm currently working on a metadump'd image of course, not the real
thing; there are 90TB of data to be hopefully salvaged in there).

"ls /mnt/rescue/lost+found" gave this:

XFS (loop0): metadata I/O error: block 0x22b03f490
("xfs_trans_read_buf_map") error 117 numblks 16 
XFS (loop0): xfs_imap_to_bp: xfs_trans_read_buf() returned error 117.
XFS (loop0): Corruption detected. Unmount and run xfs_repair 
XFS (loop0): Corruption detected. Unmount and run xfs_repair

I've run xfs_repair 4.9 on the xfs_mdrestored image. It dumps an insane
lot of errors (the output log is 65MB)  and ends with this very strange
message:

disconnected inode 26417467, moving to lost+found
disconnected inode 26417468, moving to lost+found
disconnected inode 26417469, moving to lost+found
disconnected inode 26417470, moving to lost+found

fatal error -- name create failed in lost+found (117), filesystem may
be out of space

Even stranger, after mounting back the image, there is no lost+found
anywhere to be found! However the filesystem has lots of free space and
free inodes, how come?

df -i
Sys. fich.                    Inodes  IUtil.     ILibre IUti% Monté sur
rootfs                             0       0          0     - /
/dev/root                          0       0          0     - /
tmpfs                        2058692     990    2057702    1% /run
tmpfs                        2058692       6    2058686    1% /run/lock
tmpfs                        2058692    1623    2057069    1% /dev
tmpfs                        2058692       3    2058689    1% /run/shm
guitare:/mnt/raid/partage   33554432  305069   33249363    1% /mnt/qnap1
/dev/loop0                4914413568 5199932 4909213636    1% /mnt/rescue

df
/dev/loop0                122858252288 88827890868 34030361420  73% /mnt/rescue

I'll give a shot to a newer version of xfs_repair just in case...

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-06 13:30 Weird xfs_repair error Emmanuel Florac
@ 2017-07-06 13:48 ` Brian Foster
  2017-07-06 14:49   ` Emmanuel Florac
  2017-07-06 23:28 ` Dave Chinner
  1 sibling, 1 reply; 15+ messages in thread
From: Brian Foster @ 2017-07-06 13:48 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: 'linux-xfs@vger.kernel.org'

On Thu, Jul 06, 2017 at 03:30:20PM +0200, Emmanuel Florac wrote:
> 
> After a RAID controller went bananas, I encountered an XFS corruption
> on a filesystem. Weirdly, the corruption seems to be mostly located in
> lost+found.
> 
> (I'm currently working on a metadump'd image of course, not the real
> thing; there are 90TB of data to be hopefully salvaged in there).
> 
> "ls /mnt/rescue/lost+found" gave this:
> 
> XFS (loop0): metadata I/O error: block 0x22b03f490
> ("xfs_trans_read_buf_map") error 117 numblks 16 
> XFS (loop0): xfs_imap_to_bp: xfs_trans_read_buf() returned error 117.
> XFS (loop0): Corruption detected. Unmount and run xfs_repair 
> XFS (loop0): Corruption detected. Unmount and run xfs_repair
> 
> I've run xfs_repair 4.9 on the xfs_mdrestored image. It dumps an insane
> lot of errors (the output log is 65MB)  and ends with this very strange
> message:
> 
> disconnected inode 26417467, moving to lost+found
> disconnected inode 26417468, moving to lost+found
> disconnected inode 26417469, moving to lost+found
> disconnected inode 26417470, moving to lost+found
> 
> fatal error -- name create failed in lost+found (117), filesystem may
> be out of space
> 
> Even stranger, after mounting back the image, there is no lost+found
> anywhere to be found! However the filesystem has lots of free space and
> free inodes, how come?
> 

Did you originally run xfs_repair using the -n option? I'd guess not if
it ultimately failed making a modification, but if so, something to be
aware of is that it skips warning about a dirty log and potentially can
report much more corruption than after a log recovery occurs. It might
be worth running after an attempted log recovery.

Otherwise, I'd be curious about the state of the fs after the above
error. Does 'xfs_repair -n' continue to report errors?

Also the above suggests that lost+found existed (in a corrupted state)
prior to the initial repair attempt, yes? If so, it might be interesting
to identify the inode # of lost+found to follow what xfs_repair does to
that inode during the initial run (e.g., if lost+found is corrupted and
is attempted to be used before it is fixed up or something of that
nature).

Brian

> df -i
> Sys. fich.                    Inodes  IUtil.     ILibre IUti% Monté sur
> rootfs                             0       0          0     - /
> /dev/root                          0       0          0     - /
> tmpfs                        2058692     990    2057702    1% /run
> tmpfs                        2058692       6    2058686    1% /run/lock
> tmpfs                        2058692    1623    2057069    1% /dev
> tmpfs                        2058692       3    2058689    1% /run/shm
> guitare:/mnt/raid/partage   33554432  305069   33249363    1% /mnt/qnap1
> /dev/loop0                4914413568 5199932 4909213636    1% /mnt/rescue
> 
> df
> /dev/loop0                122858252288 88827890868 34030361420  73% /mnt/rescue
> 
> I'll give a shot to a newer version of xfs_repair just in case...
> 
> -- 
> ------------------------------------------------------------------------
> Emmanuel Florac     |   Direction technique
>                     |   Intellique
>                     |	<eflorac@intellique.com>
>                     |   +33 1 78 94 84 02
> ------------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-06 13:48 ` Brian Foster
@ 2017-07-06 14:49   ` Emmanuel Florac
  0 siblings, 0 replies; 15+ messages in thread
From: Emmanuel Florac @ 2017-07-06 14:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: 'linux-xfs@vger.kernel.org'

[-- Attachment #1: Type: text/plain, Size: 2585 bytes --]

Le Thu, 6 Jul 2017 09:48:05 -0400
Brian Foster <bfoster@redhat.com> écrivait:

> > 
> > I've run xfs_repair 4.9 on the xfs_mdrestored image. It dumps an
> > insane lot of errors (the output log is 65MB)  and ends with this
> > very strange message:
> > 
> > disconnected inode 26417467, moving to lost+found
> > disconnected inode 26417468, moving to lost+found
> > disconnected inode 26417469, moving to lost+found
> > disconnected inode 26417470, moving to lost+found
> > 
> > fatal error -- name create failed in lost+found (117), filesystem
> > may be out of space
> > 
> > Even stranger, after mounting back the image, there is no lost+found
> > anywhere to be found! However the filesystem has lots of free space
> > and free inodes, how come?
> >   
> 
> Did you originally run xfs_repair using the -n option? I'd guess not
> if it ultimately failed making a modification, but if so, something
> to be aware of is that it skips warning about a dirty log and
> potentially can report much more corruption than after a log recovery
> occurs. It might be worth running after an attempted log recovery.

I've mounted the FS first to clean up the log. I've also tried making a
bigger image, in case the hosting file was too small. No dice.
 
> Otherwise, I'd be curious about the state of the fs after the above
> error. Does 'xfs_repair -n' continue to report errors?

You're onto something here. In fact each time I re-run xfs_repair, it
still spits out many errors and ends with the same line as I mentioned
previously.
However each run of xfs_repair generates fewer errors. The first log
was 65MB; the second 7.5, the third 3.8MB. I'll try running it again
and again to see how it ends...

> Also the above suggests that lost+found existed (in a corrupted state)
> prior to the initial repair attempt, yes? If so, it might be
> interesting to identify the inode # of lost+found to follow what
> xfs_repair does to that inode during the initial run (e.g., if
> lost+found is corrupted and is attempted to be used before it is
> fixed up or something of that nature).

OK I'll try to restore the dump again to check "lost+found". Maybe I
could remove it before running the repair, but that's unlikely...

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-06 13:30 Weird xfs_repair error Emmanuel Florac
  2017-07-06 13:48 ` Brian Foster
@ 2017-07-06 23:28 ` Dave Chinner
  2017-07-07 11:36   ` Emmanuel Florac
  2017-07-07 11:50   ` Emmanuel Florac
  1 sibling, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2017-07-06 23:28 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: 'linux-xfs@vger.kernel.org'

On Thu, Jul 06, 2017 at 03:30:20PM +0200, Emmanuel Florac wrote:
> 
> After a RAID controller went bananas, I encountered an XFS corruption
> on a filesystem. Weirdly, the corruption seems to be mostly located in
> lost+found.
> 
> (I'm currently working on a metadump'd image of course, not the real
> thing; there are 90TB of data to be hopefully salvaged in there).
> 
> "ls /mnt/rescue/lost+found" gave this:
> 
> XFS (loop0): metadata I/O error: block 0x22b03f490
> ("xfs_trans_read_buf_map") error 117 numblks 16 
> XFS (loop0): xfs_imap_to_bp: xfs_trans_read_buf() returned error 117.
> XFS (loop0): Corruption detected. Unmount and run xfs_repair 
> XFS (loop0): Corruption detected. Unmount and run xfs_repair
> 
> I've run xfs_repair 4.9 on the xfs_mdrestored image. It dumps an insane
> lot of errors (the output log is 65MB)  and ends with this very strange
> message:
> 
> disconnected inode 26417467, moving to lost+found
> disconnected inode 26417468, moving to lost+found
> disconnected inode 26417469, moving to lost+found
> disconnected inode 26417470, moving to lost+found
> 
> fatal error -- name create failed in lost+found (117), filesystem may
> be out of space

Error 117. That's EFSCORRUPTED, not ENOSPC.  IOWs, lost+found was
corrupted as it was being modified by xfs_repair.

> Even stranger, after mounting back the image, there is no lost+found
> anywhere to be found! However the filesystem has lots of free space and
> free inodes, how come?

Because lost+found was corrupted.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-06 23:28 ` Dave Chinner
@ 2017-07-07 11:36   ` Emmanuel Florac
  2017-07-07 11:50   ` Emmanuel Florac
  1 sibling, 0 replies; 15+ messages in thread
From: Emmanuel Florac @ 2017-07-07 11:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: 'linux-xfs@vger.kernel.org'

[-- Attachment #1: Type: text/plain, Size: 1106 bytes --]

Le Fri, 7 Jul 2017 09:28:03 +1000
Dave Chinner <david@fromorbit.com> écrivait:

> > 
> > fatal error -- name create failed in lost+found (117), filesystem
> > may be out of space  
> 
> Error 117. That's EFSCORRUPTED, not ENOSPC.  IOWs, lost+found was
> corrupted as it was being modified by xfs_repair.
> 
> > Even stranger, after mounting back the image, there is no lost+found
> > anywhere to be found! However the filesystem has lots of free space
> > and free inodes, how come?  
> 
> Because lost+found was corrupted.
> 
I've tried again with xfs_repair 4.11, not better. At the end, I've got
a "lost+found" directory, however, it I run xfs_repair again and again
it still ends with the same error... I'll try moving the old lost+found
out of the way.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-06 23:28 ` Dave Chinner
  2017-07-07 11:36   ` Emmanuel Florac
@ 2017-07-07 11:50   ` Emmanuel Florac
  2017-07-07 15:36     ` Darrick J. Wong
  1 sibling, 1 reply; 15+ messages in thread
From: Emmanuel Florac @ 2017-07-07 11:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: 'linux-xfs@vger.kernel.org'

[-- Attachment #1: Type: text/plain, Size: 1303 bytes --]

Le Fri, 7 Jul 2017 09:28:03 +1000
Dave Chinner <david@fromorbit.com> écrivait:

> > disconnected inode 26417468, moving to lost+found
> > disconnected inode 26417469, moving to lost+found
> > disconnected inode 26417470, moving to lost+found
> > 
> > fatal error -- name create failed in lost+found (117), filesystem
> > may be out of space  
> 
> Error 117. That's EFSCORRUPTED, not ENOSPC.  IOWs, lost+found was
> corrupted as it was being modified by xfs_repair.
> 
> > Even stranger, after mounting back the image, there is no lost+found
> > anywhere to be found! However the filesystem has lots of free space
> > and free inodes, how come?  
> 
> Because lost+found was corrupted.

Actually with xfs_repair 4.11 it's a different error in the end:

disconnected inode 180399699751, moving to lost+found
disconnected inode 180399699752, moving to lost+found

fatal error -- name create failed in lost+found (28), filesystem may be
out of space


-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-07 11:50   ` Emmanuel Florac
@ 2017-07-07 15:36     ` Darrick J. Wong
  2017-07-10 17:29       ` Emmanuel Florac
  2017-07-11 13:23       ` Emmanuel Florac
  0 siblings, 2 replies; 15+ messages in thread
From: Darrick J. Wong @ 2017-07-07 15:36 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: Dave Chinner, 'linux-xfs@vger.kernel.org'

On Fri, Jul 07, 2017 at 01:50:09PM +0200, Emmanuel Florac wrote:
> Le Fri, 7 Jul 2017 09:28:03 +1000
> Dave Chinner <david@fromorbit.com> écrivait:
> 
> > > disconnected inode 26417468, moving to lost+found
> > > disconnected inode 26417469, moving to lost+found
> > > disconnected inode 26417470, moving to lost+found
> > > 
> > > fatal error -- name create failed in lost+found (117), filesystem
> > > may be out of space  
> > 
> > Error 117. That's EFSCORRUPTED, not ENOSPC.  IOWs, lost+found was
> > corrupted as it was being modified by xfs_repair.
> > 
> > > Even stranger, after mounting back the image, there is no lost+found
> > > anywhere to be found! However the filesystem has lots of free space
> > > and free inodes, how come?  
> > 
> > Because lost+found was corrupted.
> 
> Actually with xfs_repair 4.11 it's a different error in the end:
> 
> disconnected inode 180399699751, moving to lost+found
> disconnected inode 180399699752, moving to lost+found
> 
> fatal error -- name create failed in lost+found (28), filesystem may be
> out of space

Would be helpful to have a metadump of this goobered-up lost+found fs...

--D

> 
> 
> -- 
> ------------------------------------------------------------------------
> Emmanuel Florac     |   Direction technique
>                     |   Intellique
>                     |	<eflorac@intellique.com>
>                     |   +33 1 78 94 84 02
> ------------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-07 15:36     ` Darrick J. Wong
@ 2017-07-10 17:29       ` Emmanuel Florac
  2017-07-11 13:23       ` Emmanuel Florac
  1 sibling, 0 replies; 15+ messages in thread
From: Emmanuel Florac @ 2017-07-10 17:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, 'linux-xfs@vger.kernel.org'

[-- Attachment #1: Type: text/plain, Size: 870 bytes --]

Le Fri, 7 Jul 2017 08:36:33 -0700
"Darrick J. Wong" <darrick.wong@oracle.com> écrivait:

> > disconnected inode 180399699751, moving to lost+found
> > disconnected inode 180399699752, moving to lost+found
> > 
> > fatal error -- name create failed in lost+found (28), filesystem
> > may be out of space  
> 
> Would be helpful to have a metadump of this goobered-up lost+found
> fs...
> 

Well, annoyingly this FS is really big, and the metadump is 5.3GB. It
will be several days before I make it available somewhere online...

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-07 15:36     ` Darrick J. Wong
  2017-07-10 17:29       ` Emmanuel Florac
@ 2017-07-11 13:23       ` Emmanuel Florac
  2017-07-17 17:11         ` Brian Foster
  1 sibling, 1 reply; 15+ messages in thread
From: Emmanuel Florac @ 2017-07-11 13:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, 'linux-xfs@vger.kernel.org'

[-- Attachment #1: Type: text/plain, Size: 760 bytes --]

Le Fri, 7 Jul 2017 08:36:33 -0700
"Darrick J. Wong" <darrick.wong@oracle.com> écrivait:

> > fatal error -- name create failed in lost+found (28), filesystem
> > may be out of space  
> 
> Would be helpful to have a metadump of this goobered-up lost+found
> fs...
> 

The metadump is here for anyone who would like to have a look:

http://update2.intellique.com/pub/bign.metadump.xz

The filesystem is about 115 TiB.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-11 13:23       ` Emmanuel Florac
@ 2017-07-17 17:11         ` Brian Foster
  2017-07-24 14:27           ` Emmanuel Florac
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2017-07-17 17:11 UTC (permalink / raw)
  To: Emmanuel Florac
  Cc: Darrick J. Wong, Dave Chinner, 'linux-xfs@vger.kernel.org'

On Tue, Jul 11, 2017 at 03:23:52PM +0200, Emmanuel Florac wrote:
> Le Fri, 7 Jul 2017 08:36:33 -0700
> "Darrick J. Wong" <darrick.wong@oracle.com> écrivait:
> 
> > > fatal error -- name create failed in lost+found (28), filesystem
> > > may be out of space  
> > 
> > Would be helpful to have a metadump of this goobered-up lost+found
> > fs...
> > 
> 
> The metadump is here for anyone who would like to have a look:
> 
> http://update2.intellique.com/pub/bign.metadump.xz
> 
> The filesystem is about 115 TiB.
> 

Thanks for posting this. The first thing to note is that this filesystem
is severely corrupted. Nonetheless, I've been playing around with trying
to get the latest for-next xfs_repair to run through this fs (via gdb)
and have definitely hit a few issues:

- xfs_sb_verify() was changed to use bp->b_maps[0].bm_bn rather than
  bp->b_bn in libxfs commit 85428dd23f ("xfs: fix superblock inprogress
  check"). b_maps isn't allocated if the buffer was initialized with
  libxfs_initbuf() (rather than libxfs_initbuf_map()). This causes a
  sigsegv here, though only if I disable -O2 optimization for some
  reason that I haven't dug into yet.
- libxfs commit 0268fdc3fe ("xfs: remove xfs_trans_get_block_res")
  replaced the use of xfs_trans_get_block_res() in
  xfs_bmbt_alloc_block() which causes the -ENOSPC error. The previous
  function was hardcoded to return 1 such that this would never occur.
- The recently added directory sf format verifier (xfs_iformat_fork() ->
  xfs_dir2_sf_verify()) seems to cause a premature repair failure in at
  least one case.

I was able to eventually get repair to complete with some quick hacks to
bypass those issues. I did have to run repair two or three times to get
the fs to a clean state. The fs mounts and otherwise appears clean to
xfs_repair, but it's not clear to me how usable the resulting fs really
is (repair is for fs consistency after all, not necessarily data
recovery). Note that lost+found appears to be loaded with 18T of data
across almost 2 million inodes. :/

Brian

> -- 
> ------------------------------------------------------------------------
> Emmanuel Florac     |   Direction technique
>                     |   Intellique
>                     |	<eflorac@intellique.com>
>                     |   +33 1 78 94 84 02
> ------------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-17 17:11         ` Brian Foster
@ 2017-07-24 14:27           ` Emmanuel Florac
  2017-07-24 14:51             ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Emmanuel Florac @ 2017-07-24 14:27 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, Dave Chinner, 'linux-xfs@vger.kernel.org'

[-- Attachment #1: Type: text/plain, Size: 3290 bytes --]

Le Mon, 17 Jul 2017 13:11:29 -0400
Brian Foster <bfoster@redhat.com> écrivait:

> On Tue, Jul 11, 2017 at 03:23:52PM +0200, Emmanuel Florac wrote:
> > Le Fri, 7 Jul 2017 08:36:33 -0700
> > "Darrick J. Wong" <darrick.wong@oracle.com> écrivait:
> >   
> > > > fatal error -- name create failed in lost+found (28), filesystem
> > > > may be out of space    
> > > 
> > > Would be helpful to have a metadump of this goobered-up lost+found
> > > fs...
> > >   
> > 
> > The metadump is here for anyone who would like to have a look:
> > 
> > http://update2.intellique.com/pub/bign.metadump.xz
> > 
> > The filesystem is about 115 TiB.
> >   
> 
> Thanks for posting this. The first thing to note is that this
> filesystem is severely corrupted.

This I have determined myself through the fact that many runs of
xfs_repair (and different versions of it, v4.7, 4.9, 4.11...) can't get
it into a stable (i.e. that won't crash while trying to access it)
state.

> Nonetheless, I've been playing
> around with trying to get the latest for-next xfs_repair to run
> through this fs (via gdb) and have definitely hit a few issues:
> 
> - xfs_sb_verify() was changed to use bp->b_maps[0].bm_bn rather than
>   bp->b_bn in libxfs commit 85428dd23f ("xfs: fix superblock
> inprogress check"). b_maps isn't allocated if the buffer was
> initialized with libxfs_initbuf() (rather than libxfs_initbuf_map()).
> This causes a sigsegv here, though only if I disable -O2 optimization
> for some reason that I haven't dug into yet.
> - libxfs commit 0268fdc3fe ("xfs: remove xfs_trans_get_block_res")
>   replaced the use of xfs_trans_get_block_res() in
>   xfs_bmbt_alloc_block() which causes the -ENOSPC error. The previous
>   function was hardcoded to return 1 such that this would never occur.
> - The recently added directory sf format verifier (xfs_iformat_fork()
> -> xfs_dir2_sf_verify()) seems to cause a premature repair failure in
> at least one case.
> 
> I was able to eventually get repair to complete with some quick hacks
> to bypass those issues. I did have to run repair two or three times
> to get the fs to a clean state. The fs mounts and otherwise appears
> clean to xfs_repair, but it's not clear to me how usable the
> resulting fs really is (repair is for fs consistency after all, not
> necessarily data recovery). Note that lost+found appears to be loaded
> with 18T of data across almost 2 million inodes. :/

Thank you for your efforts, the loaded lost+found matches my own
results, however some of the files there have been present for possibly
years. In fact this filesystem has crashed several times in the past
years but always went back online at some point, until... now.

So what could I do, at least to be able to mount it and copy everything
elsewhere before mkfs'ing it all again? Do you have an xfs_repair
binary at hand that I could use, or should I dig into the latest
source?

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-24 14:27           ` Emmanuel Florac
@ 2017-07-24 14:51             ` Brian Foster
  2017-07-25 16:44               ` Emmanuel Florac
  2017-07-25 17:16               ` Emmanuel Florac
  0 siblings, 2 replies; 15+ messages in thread
From: Brian Foster @ 2017-07-24 14:51 UTC (permalink / raw)
  To: Emmanuel Florac
  Cc: Darrick J. Wong, Dave Chinner, 'linux-xfs@vger.kernel.org'

On Mon, Jul 24, 2017 at 04:27:28PM +0200, Emmanuel Florac wrote:
> Le Mon, 17 Jul 2017 13:11:29 -0400
> Brian Foster <bfoster@redhat.com> écrivait:
> 
> > On Tue, Jul 11, 2017 at 03:23:52PM +0200, Emmanuel Florac wrote:
> > > Le Fri, 7 Jul 2017 08:36:33 -0700
> > > "Darrick J. Wong" <darrick.wong@oracle.com> écrivait:
> > >   
> > > > > fatal error -- name create failed in lost+found (28), filesystem
> > > > > may be out of space    
> > > > 
> > > > Would be helpful to have a metadump of this goobered-up lost+found
> > > > fs...
> > > >   
> > > 
> > > The metadump is here for anyone who would like to have a look:
> > > 
> > > http://update2.intellique.com/pub/bign.metadump.xz
> > > 
> > > The filesystem is about 115 TiB.
> > >   
> > 
> > Thanks for posting this. The first thing to note is that this
> > filesystem is severely corrupted.
> 
> This I have determined myself through the fact that many runs of
> xfs_repair (and different versions of it, v4.7, 4.9, 4.11...) can't get
> it into a stable (i.e. that won't crash while trying to access it)
> state.
> 
> > Nonetheless, I've been playing
> > around with trying to get the latest for-next xfs_repair to run
> > through this fs (via gdb) and have definitely hit a few issues:
> > 
> > - xfs_sb_verify() was changed to use bp->b_maps[0].bm_bn rather than
> >   bp->b_bn in libxfs commit 85428dd23f ("xfs: fix superblock
> > inprogress check"). b_maps isn't allocated if the buffer was
> > initialized with libxfs_initbuf() (rather than libxfs_initbuf_map()).
> > This causes a sigsegv here, though only if I disable -O2 optimization
> > for some reason that I haven't dug into yet.
> > - libxfs commit 0268fdc3fe ("xfs: remove xfs_trans_get_block_res")
> >   replaced the use of xfs_trans_get_block_res() in
> >   xfs_bmbt_alloc_block() which causes the -ENOSPC error. The previous
> >   function was hardcoded to return 1 such that this would never occur.
> > - The recently added directory sf format verifier (xfs_iformat_fork()
> > -> xfs_dir2_sf_verify()) seems to cause a premature repair failure in
> > at least one case.
> > 
> > I was able to eventually get repair to complete with some quick hacks
> > to bypass those issues. I did have to run repair two or three times
> > to get the fs to a clean state. The fs mounts and otherwise appears
> > clean to xfs_repair, but it's not clear to me how usable the
> > resulting fs really is (repair is for fs consistency after all, not
> > necessarily data recovery). Note that lost+found appears to be loaded
> > with 18T of data across almost 2 million inodes. :/
> 
> Thank you for your efforts, the loaded lost+found matches my own
> results, however some of the files there have been present for possibly
> years. In fact this filesystem has crashed several times in the past
> years but always went back online at some point, until... now.
> 
> So what could I do, at least to be able to mount it and copy everything
> elsewhere before mkfs'ing it all again? Do you have an xfs_repair
> binary at hand that I could use, or should I dig into the latest
> source?
> 

There are several fixes in-flight for the issues uncovered by this
metadump. I think you'll want to include the following 3 patches to
xfsprogs:

http://marc.info/?l=linux-xfs&m=150047977108174&w=2
http://marc.info/?l=linux-xfs&m=150040481220074&w=2
http://marc.info/?l=linux-xfs&m=150040481820076&w=2

Note that the last 2 patches are probably going to be reworked into a
different implementation. The idea here is ultimately to avoid running
the verifier in a case where it disrupts xfs_repair, so using this
intermediate patch series should be good enough to build a custom binary
that allows xfs_repair to eventually piece the fs back together. You
could alternatively just hack xfs_dir2_sf_verify() to return 0.

Note that I would highly recommend to test whatever you build against
your metadump before the original fs.

Brian

> -- 
> ------------------------------------------------------------------------
> Emmanuel Florac     |   Direction technique
>                     |   Intellique
>                     |	<eflorac@intellique.com>
>                     |   +33 1 78 94 84 02
> ------------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-24 14:51             ` Brian Foster
@ 2017-07-25 16:44               ` Emmanuel Florac
  2017-07-25 17:16               ` Emmanuel Florac
  1 sibling, 0 replies; 15+ messages in thread
From: Emmanuel Florac @ 2017-07-25 16:44 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, Dave Chinner, 'linux-xfs@vger.kernel.org'

[-- Attachment #1: Type: text/plain, Size: 2081 bytes --]

Le Mon, 24 Jul 2017 10:51:25 -0400
Brian Foster <bfoster@redhat.com> écrivait:

> There are several fixes in-flight for the issues uncovered by this
> metadump. I think you'll want to include the following 3 patches to
> xfsprogs:
> 
> http://marc.info/?l=linux-xfs&m=150047977108174&w=2
> http://marc.info/?l=linux-xfs&m=150040481220074&w=2
> http://marc.info/?l=linux-xfs&m=150040481820076&w=2
> 
> Note that the last 2 patches are probably going to be reworked into a
> different implementation. The idea here is ultimately to avoid running
> the verifier in a case where it disrupts xfs_repair, so using this
> intermediate patch series should be good enough to build a custom
> binary that allows xfs_repair to eventually piece the fs back
> together. You could alternatively just hack xfs_dir2_sf_verify() to
> return 0.
> 
> Note that I would highly recommend to test whatever you build against
> your metadump before the original fs.
> 

You bet... I would even try salvaging files from the unrepaired fs if
possible, but it's probably not workable.

For info I tried the new 4.12, and it fails reliably like this (after
gazillions of metadata errors, etc):

bad hash table for directory inode 4295385906 (no data entry):
rebuilding rebuilding directory inode 4295385906
7f42b14f8780: Badness in key lookup (length)
bp=(bno 0x0, len 4096 bytes) key=(bno 0x0, len 512 bytes)
7f42b14f8780: Badness in key lookup (length)
bp=(bno 0x0, len 4096 bytes) key=(bno 0x0, len 512 bytes)
Invalid inode number 0x0
xfs_dir_ino_validate: XFS_ERROR_REPORT

fatal error -- couldn't map inode 4295400241, err = 117

At least it's always the same error (previous versions were ending with
various logs sizes and errors).

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-24 14:51             ` Brian Foster
  2017-07-25 16:44               ` Emmanuel Florac
@ 2017-07-25 17:16               ` Emmanuel Florac
  2017-07-25 19:22                 ` Brian Foster
  1 sibling, 1 reply; 15+ messages in thread
From: Emmanuel Florac @ 2017-07-25 17:16 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, Dave Chinner, 'linux-xfs@vger.kernel.org'

[-- Attachment #1: Type: text/plain, Size: 954 bytes --]

Le Mon, 24 Jul 2017 10:51:25 -0400
Brian Foster <bfoster@redhat.com> écrivait:

> There are several fixes in-flight for the issues uncovered by this
> metadump. I think you'll want to include the following 3 patches to
> xfsprogs:
> 
> http://marc.info/?l=linux-xfs&m=150047977108174&w=2
> http://marc.info/?l=linux-xfs&m=150040481220074&w=2
> http://marc.info/?l=linux-xfs&m=150040481820076&w=2
> 

BTW to which branch do these apply?

For some reason I've tried xfs_repair 4.12 on another image I had tried
4.7 and 4.9 on, and... it seems to have repaired it. I'm trying to
understand how that happened... 

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Weird xfs_repair error
  2017-07-25 17:16               ` Emmanuel Florac
@ 2017-07-25 19:22                 ` Brian Foster
  0 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2017-07-25 19:22 UTC (permalink / raw)
  To: Emmanuel Florac
  Cc: Darrick J. Wong, Dave Chinner, 'linux-xfs@vger.kernel.org'

On Tue, Jul 25, 2017 at 07:16:04PM +0200, Emmanuel Florac wrote:
> Le Mon, 24 Jul 2017 10:51:25 -0400
> Brian Foster <bfoster@redhat.com> écrivait:
> 
> > There are several fixes in-flight for the issues uncovered by this
> > metadump. I think you'll want to include the following 3 patches to
> > xfsprogs:
> > 
> > http://marc.info/?l=linux-xfs&m=150047977108174&w=2
> > http://marc.info/?l=linux-xfs&m=150040481220074&w=2
> > http://marc.info/?l=linux-xfs&m=150040481820076&w=2
> > 
> 
> BTW to which branch do these apply?
> 
> For some reason I've tried xfs_repair 4.12 on another image I had tried
> 4.7 and 4.9 on, and... it seems to have repaired it. I'm trying to
> understand how that happened... 

I believe these should apply to for-next. FWIW, the error in your
previous mail looks like the one that is fixed by the latter two patches
above.

Brian

> 
> -- 
> ------------------------------------------------------------------------
> Emmanuel Florac     |   Direction technique
>                     |   Intellique
>                     |	<eflorac@intellique.com>
>                     |   +33 1 78 94 84 02
> ------------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-07-25 19:22 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-06 13:30 Weird xfs_repair error Emmanuel Florac
2017-07-06 13:48 ` Brian Foster
2017-07-06 14:49   ` Emmanuel Florac
2017-07-06 23:28 ` Dave Chinner
2017-07-07 11:36   ` Emmanuel Florac
2017-07-07 11:50   ` Emmanuel Florac
2017-07-07 15:36     ` Darrick J. Wong
2017-07-10 17:29       ` Emmanuel Florac
2017-07-11 13:23       ` Emmanuel Florac
2017-07-17 17:11         ` Brian Foster
2017-07-24 14:27           ` Emmanuel Florac
2017-07-24 14:51             ` Brian Foster
2017-07-25 16:44               ` Emmanuel Florac
2017-07-25 17:16               ` Emmanuel Florac
2017-07-25 19:22                 ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.