xfs_repair 3.1.4/3.1.5: fatal error -- couldn't malloc dir2 buffer data

All of lore.kernel.org
 help / color / mirror / Atom feed

* xfs_repair 3.1.4/3.1.5: fatal error -- couldn't malloc dir2 buffer data
@ 2011-08-06 12:17 Marc Lehmann
  2011-08-06 14:12 ` Dave Chinner
  2011-08-06 17:27 ` Roger Willcocks
  0 siblings, 2 replies; 8+ messages in thread
From: Marc Lehmann @ 2011-08-06 12:17 UTC (permalink / raw)
  To: xfs

When trying to xfs_repair a 15GB image file, I get this:

Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...

fatal error -- couldn't malloc dir2 buffer data

this is 3.1.5 - 3.1.4 simply segfaults. using ltrace shows this as
last call to malloc:

   malloc(18446744073708732928)                                  = NULL

I think thats a bit unreasonable of xfs_repair :)

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs_repair 3.1.4/3.1.5: fatal error -- couldn't malloc dir2 buffer data
  2011-08-06 12:17 xfs_repair 3.1.4/3.1.5: fatal error -- couldn't malloc dir2 buffer data Marc Lehmann
@ 2011-08-06 14:12 ` Dave Chinner
  2011-08-06 17:54   ` Marc Lehmann
  2011-08-06 17:27 ` Roger Willcocks
  1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2011-08-06 14:12 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Sat, Aug 06, 2011 at 02:17:28PM +0200, Marc Lehmann wrote:
> When trying to xfs_repair a 15GB image file, I get this:
> 
> Phase 3 - for each AG...
>         - scan and clear agi unlinked lists...
>         - process known inodes and perform inode discovery...
> 
> fatal error -- couldn't malloc dir2 buffer data
> 
> this is 3.1.5 - 3.1.4 simply segfaults. using ltrace shows this as
> last call to malloc:
> 
>    malloc(18446744073708732928)                                  = NULL
> 
> I think thats a bit unreasonable of xfs_repair :)

Can you share a metadump of the image in question?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs_repair 3.1.4/3.1.5: fatal error -- couldn't malloc dir2 buffer data
  2011-08-06 12:17 xfs_repair 3.1.4/3.1.5: fatal error -- couldn't malloc dir2 buffer data Marc Lehmann
  2011-08-06 14:12 ` Dave Chinner
@ 2011-08-06 17:27 ` Roger Willcocks
  1 sibling, 0 replies; 8+ messages in thread
From: Roger Willcocks @ 2011-08-06 17:27 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 103 bytes --]


On 6 Aug 2011, at 13:17, Marc Lehmann wrote:

> 18446744073708732928

0xFFFFFFFFFFF38200 (-0xC7800)



[-- Attachment #1.2: Type: text/html, Size: 924 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs_repair 3.1.4/3.1.5: fatal error -- couldn't malloc dir2 buffer data
  2011-08-06 14:12 ` Dave Chinner
@ 2011-08-06 17:54   ` Marc Lehmann
  2011-08-06 23:39     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Marc Lehmann @ 2011-08-06 17:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sun, Aug 07, 2011 at 12:12:41AM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > this is 3.1.5 - 3.1.4 simply segfaults. using ltrace shows this as
> > last call to malloc:
> > 
> >    malloc(18446744073708732928)                                  = NULL
> > 
> > I think thats a bit unreasonable of xfs_repair :)
> 
> Can you share a metadump of the image in question?

I can, but unfortunately, it's fixed itself in the meantime:

I wanted to make a copy of the image, and mounted it read-write. I stat'ed
all files inside (which worked) and then rsynced all files out.

Then I unmounmted it and re-ran xfs_repair
(http://ue.tst.eu/3cbc07150eb6b69c63361937c6c3044f.txt) which got much
farther, but failed with the same error.

Then I re-ran xfs_repair one last time, which ran through without any "error"
messages.

An xfs_metadata -o is here (gzipped):
http://data.plan9.de/smoker-chroot.bin.gz

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs_repair 3.1.4/3.1.5: fatal error -- couldn't malloc dir2 buffer data
  2011-08-06 17:54   ` Marc Lehmann
@ 2011-08-06 23:39     ` Dave Chinner
  2011-08-08  0:29       ` Dave Chinner
  2011-08-08 17:49       ` Marc Lehmann
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Chinner @ 2011-08-06 23:39 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Sat, Aug 06, 2011 at 07:54:28PM +0200, Marc Lehmann wrote:
> On Sun, Aug 07, 2011 at 12:12:41AM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > > this is 3.1.5 - 3.1.4 simply segfaults. using ltrace shows this as
> > > last call to malloc:
> > > 
> > >    malloc(18446744073708732928)                                  = NULL
> > > 
> > > I think thats a bit unreasonable of xfs_repair :)
> > 
> > Can you share a metadump of the image in question?
> 
> I can, but unfortunately, it's fixed itself in the meantime:
> 
> I wanted to make a copy of the image, and mounted it read-write. I stat'ed
> all files inside (which worked) and then rsynced all files out.
> 
> Then I unmounmted it and re-ran xfs_repair
> (http://ue.tst.eu/3cbc07150eb6b69c63361937c6c3044f.txt) which got much
> farther, but failed with the same error.

Looks lke corrupt directory blocks are causing it.

> Then I re-ran xfs_repair one last time, which ran through without any "error"
> messages.
> 
> An xfs_metadata -o is here (gzipped):
> http://data.plan9.de/smoker-chroot.bin.gz

I'll have a look at it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs_repair 3.1.4/3.1.5: fatal error -- couldn't malloc dir2 buffer data
  2011-08-06 23:39     ` Dave Chinner
@ 2011-08-08  0:29       ` Dave Chinner
  2011-08-08 17:49       ` Marc Lehmann
  1 sibling, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2011-08-08  0:29 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Sun, Aug 07, 2011 at 09:39:13AM +1000, Dave Chinner wrote:
> On Sat, Aug 06, 2011 at 07:54:28PM +0200, Marc Lehmann wrote:
> > On Sun, Aug 07, 2011 at 12:12:41AM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > > > this is 3.1.5 - 3.1.4 simply segfaults. using ltrace shows this as
> > > > last call to malloc:
> > > > 
> > > >    malloc(18446744073708732928)                                  = NULL
> > > > 
> > > > I think thats a bit unreasonable of xfs_repair :)
> > > 
> > > Can you share a metadump of the image in question?
> > 
> > I can, but unfortunately, it's fixed itself in the meantime:
> > 
> > I wanted to make a copy of the image, and mounted it read-write. I stat'ed
> > all files inside (which worked) and then rsynced all files out.
> > 
> > Then I unmounmted it and re-ran xfs_repair
> > (http://ue.tst.eu/3cbc07150eb6b69c63361937c6c3044f.txt) which got much
> > farther, but failed with the same error.
> 
> Looks lke corrupt directory blocks are causing it.
> 
> > Then I re-ran xfs_repair one last time, which ran through without any "error"
> > messages.
> > 
> > An xfs_metadata -o is here (gzipped):
> > http://data.plan9.de/smoker-chroot.bin.gz
> 
> I'll have a look at it.

$ sudo xfs_repair -V  -f /vm-images/busted.img 
xfs_repair version 3.1.5
$ sudo xfs_repair  -f /vm-images/busted.img 
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 11
        - agno = 1
        - agno = 3
        - agno = 2
        - agno = 4
        - agno = 10
        - agno = 7
        - agno = 5
        - agno = 6
        - agno = 8
        - agno = 9
        - agno = 12
        - agno = 13
        - agno = 15
        - agno = 14
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
$

Yup, there's nothing wrong with the filesystem in the image you
posted.

I need an image of the broken filesystem to be able to find the bug
in xfs_repair. Next time it breaks, can you post the image of the
broken fs? i.e. run xfs_repair -n first to see if it will fail
without trying to repair the corruption it encounters, then take a
metadump before really trying to repair the problem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs_repair 3.1.4/3.1.5: fatal error -- couldn't malloc dir2 buffer data
  2011-08-06 23:39     ` Dave Chinner
  2011-08-08  0:29       ` Dave Chinner
@ 2011-08-08 17:49       ` Marc Lehmann
  2011-08-08 23:45         ` Dave Chinner
  1 sibling, 1 reply; 8+ messages in thread
From: Marc Lehmann @ 2011-08-08 17:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sun, Aug 07, 2011 at 09:39:13AM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > Then I unmounmted it and re-ran xfs_repair
> > (http://ue.tst.eu/3cbc07150eb6b69c63361937c6c3044f.txt) which got much
> > farther, but failed with the same error.
> 
> Looks lke corrupt directory blocks are causing it.
> 
> > Then I re-ran xfs_repair one last time, which ran through without any "error"
> > messages.
> > 
> > An xfs_metadata -o is here (gzipped):
> > http://data.plan9.de/smoker-chroot.bin.gz
> 
> I'll have a look at it.

I had another lockup, no xfs_fsr involved this time.

After rebooting, xfs_repair on the filesystem I mkfs'ed yesterday had the
same problem, here is the metadump:

   http://data.plan9.de/metadump-smoker-new.gz
   
(if it's not accessible right now then this is because thats the server
that locked up, it should be up and running in an hour again).

And here is the output of xfs_repair:

   Phase 1 - find and verify superblock...
   Phase 2 - using internal log
           - zero log...
           - scan filesystem freespace and inode maps...
           - found root inode chunk
   Phase 3 - for each AG...
           - scan and clear agi unlinked lists...
           - process known inodes and perform inode discovery...
           - agno = 0
           - agno = 1
           - agno = 2
           - agno = 3
           - agno = 4
           - agno = 5
           - agno = 6
           - agno = 7

   fatal error -- couldn't malloc dir2 buffer data

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: xfs_repair 3.1.4/3.1.5: fatal error -- couldn't malloc dir2 buffer data
  2011-08-08 17:49       ` Marc Lehmann
@ 2011-08-08 23:45         ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2011-08-08 23:45 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: xfs

On Mon, Aug 08, 2011 at 07:49:11PM +0200, Marc Lehmann wrote:
> On Sun, Aug 07, 2011 at 09:39:13AM +1000, Dave Chinner <david@fromorbit.com> wrote:
> > > Then I unmounmted it and re-ran xfs_repair
> > > (http://ue.tst.eu/3cbc07150eb6b69c63361937c6c3044f.txt) which got much
> > > farther, but failed with the same error.
> > 
> > Looks lke corrupt directory blocks are causing it.
> > 
> > > Then I re-ran xfs_repair one last time, which ran through without any "error"
> > > messages.
> > > 
> > > An xfs_metadata -o is here (gzipped):
> > > http://data.plan9.de/smoker-chroot.bin.gz
> > 
> > I'll have a look at it.
> 
> I had another lockup, no xfs_fsr involved this time.
> 
> After rebooting, xfs_repair on the filesystem I mkfs'ed yesterday had the
> same problem, here is the metadump:
> 
>    http://data.plan9.de/metadump-smoker-new.gz
>    
> (if it's not accessible right now then this is because thats the server
> that locked up, it should be up and running in an hour again).
> 
> And here is the output of xfs_repair:
> 
>    Phase 1 - find and verify superblock...
>    Phase 2 - using internal log
>            - zero log...
>            - scan filesystem freespace and inode maps...
>            - found root inode chunk
>    Phase 3 - for each AG...
>            - scan and clear agi unlinked lists...
>            - process known inodes and perform inode discovery...
>            - agno = 0
>            - agno = 1
>            - agno = 2
>            - agno = 3
>            - agno = 4
>            - agno = 5
>            - agno = 6
>            - agno = 7
> 
>    fatal error -- couldn't malloc dir2 buffer data

Ok, I can reproduce that.

>From a quick look over breakfast, xfs_repair from the current git
tree results in this:

$ ~/src/build/xfsprogs-dev/repair/xfs_repair -nvd -f busted.img 
Phase 1 - find and verify superblock...
        - block cache size set to 2311200 entries
Phase 2 - using internal log
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
xfs_repair: read failed: Bad address
can't read block 0 for directory inode 29420386
no . entry for directory 29420386
no .. entry for directory 29420386
problem with directory contents in inode 29420386
would have cleared inode 29420386
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
xfs_repair: read failed: Bad address
can't read block 0 for directory inode 63252826
no . entry for directory 63252826
no .. entry for directory 63252826
problem with directory contents in inode 63252826
would have cleared inode 63252826
bad directory block magic # 0 in block 0 for directory inode 63254628
corrupt block 0 in directory inode 63254628
        would junk block
no . entry for directory 63254628
no .. entry for directory 63254628
problem with directory contents in inode 63254628
would have cleared inode 63254628
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 1
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 9
        - agno = 7
        - agno = 11
        - agno = 10
        - agno = 8
        - agno = 13
        - agno = 14
        - agno = 12
        - agno = 15
Segmentation fault
$

So it gets a lot further, and indicates somewhat how the directory
structure is corrupted - bad block pointers. Interstingly:

$ sudo xfs_db -r -f busted.img 
xfs_db> inode 63252826
xfs_db> p
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 2 (extents)
core.nlinkv2 = 2
....
core.size = 4096
core.nblocks = 8

That does not add up. A single block directory should be in block format,
which has a single 1 block extent.

core.extsize = 0
core.nextents = 3

That's quite clearly not the case:

....
u.bmx[0-2] = [startoff,startblock,blockcount,extentflag] 0:[0,32457376,3,0] 1:[3,32457504,3,0] 2:[6,32457631,2,0]

It's apparently got 8 blocks in the directory data space. Looking at
the first block:

$ sudo xfs_db -r -f busted.img 
xfs_db> fsb 32457376
xfs_db> p
000: 58443242 07900770 003003b0 00000000 00000000 03c5295a 012e5fc4 c27e0010
      X D 2 B - that's definitely a block format directory block.

020: 00000000 02047b33 022e2e02 66240020 ffff03b0 03c53047 0a303030 5f6c6f61
040: 642e7499 cf610030 00000000 03c53055 0b303031 5f626173 69632e74 131f0030
060: 00000000 03c53064 0c303035 5f737472 6963742e 741f0030 00000000 03c53067
080: 0c303130 5f646173 6865732e 748f0030 00000000 03c5306c 0c313033 5f75635f
0a0: 6275672e 74130030 00000000 03c5306d 0d303034 5f6e6f67 65746f70 2e740030
0c0: 00000000 03c53535 0d72656c 65617365 2d656f6c 2e740030 00000000 03c53536
0e0: 0e313031 5f617267 765f6275 672e749c 1e9be359 f7500030 00000000 03c53537
100: 0f313037 5f756e69 6f6e5f62 75672e74 a70fa089 e1230030 00000000 03c5354d
120: 0f313039 5f68656c 705f666c 61672e74 fd4ec683 52490030 00000000 03c53550
140: 10313038 5f757361 67655f61 7474722e 74731cd1 13e00030 00000000 03c53551
160: 11313032 5f626173 69635f62 61736963 2e744212 c5260030 00000000 03c53553
180: 11313035 5f75635f 6275675f 6d6f7265 2e744b0e c48a0030 00000000 03c53559
1a0: 1172656c 65617365 2d6e6f2d 74616273 2e740476 735f0030 00000000 03c5355a
1c0: 12303131 5f70726f 63657373 5f617267 762e743a 5b100030 00000000 03c5355b
1e0: 12313037 5f6e6f5f 6175746f 5f68656c 702e74af 32bc0030 00000000 03c5355c
xfs_db> type dir2
xfs_db> p
Segmentation fault

But clearly there's something bad in it. More digging needed.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-08-08 23:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-06 12:17 xfs_repair 3.1.4/3.1.5: fatal error -- couldn't malloc dir2 buffer data Marc Lehmann
2011-08-06 14:12 ` Dave Chinner
2011-08-06 17:54   ` Marc Lehmann
2011-08-06 23:39     ` Dave Chinner
2011-08-08  0:29       ` Dave Chinner
2011-08-08 17:49       ` Marc Lehmann
2011-08-08 23:45         ` Dave Chinner
2011-08-06 17:27 ` Roger Willcocks

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.