All of lore.kernel.org
 help / color / mirror / Atom feed
* xfs_repair segfaults
@ 2013-02-28 15:22 Ole Tange
  2013-02-28 18:48 ` Eric Sandeen
                   ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Ole Tange @ 2013-02-28 15:22 UTC (permalink / raw)
  To: xfs

I forced a RAID online. I have done that before and xfs_repair
normally removes the last hour of data or so, but saves everything
else.

Today that did not work:

/usr/local/src/xfsprogs-3.1.10/repair# ./xfs_repair -n /dev/md5p1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - scan filesystem freespace and inode maps...
flfirst 232 in agf 91 too large (max = 128)
Segmentation fault (core dumped)

Core put in: http://dna.ku.dk/~tange/tmp/xfs_repair.core.bz2

I tried using the git-version, too, but could not get that to compile.

# uname -a
Linux franklin 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.35-2~bpo60+1
x86_64 GNU/Linux

# ./xfs_repair -V
xfs_repair version 3.1.10

# cat /proc/cpuinfo |grep MH | wc
     64     256    1280

# cat /proc/partitions |grep md5
   9        5 125024550912 md5
 259        0 107521114112 md5p1
 259        1 17503434752 md5p2

# cat /proc/mdstat
Personalities : [raid0] [raid6] [raid5] [raid4]
md5 : active raid0 md1[0] md4[3] md3[2] md2[1]
      125024550912 blocks super 1.2 512k chunks

md1 : active raid6 sdd[1] sdi[9] sdq[13] sdau[7] sdt[10] sdg[5] sdf[4] sde[2]
      31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
[10/8] [_UU_UUUUUU]
      bitmap: 2/2 pages [8KB], 1048576KB chunk

md4 : active raid6 sdo[13] sdu[9] sdad[8] sdh[7] sdc[6] sds[11]
sdap[3] sdao[2] sdk[1]
      31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
[10/8] [_UUUU_UUUU]
      [>....................]  recovery =  2.1% (84781876/3907017344)
finish=2196.4min speed=29003K/sec
      bitmap: 2/2 pages [8KB], 1048576KB chunk

md2 : active raid6 sdac[0] sdal[9] sdak[8] sdaj[7] sdai[6] sdah[5]
sdag[4] sdaf[3] sdae[2] sdr[10]
      31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
[10/10] [UUUUUUUUUU]
      bitmap: 0/2 pages [0KB], 1048576KB chunk

md3 : active raid6 sdaq[0] sdab[9] sdaa[8] sdb[7] sdy[6] sdx[5] sdw[4]
sdv[3] sdz[10] sdj[1]
      31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
[10/10] [UUUUUUUUUU]
      bitmap: 0/2 pages [0KB], 1048576KB chunk

unused devices: <none>

# smartctl -a /dev/sdau|grep Model
Device Model:     Hitachi HDS724040ALE640

# hdparm -W /dev/sdau
/dev/sdau:
 write-caching =  0 (off)

# dmesg
[ 3745.914280] xfs_repair[25300]: segfault at 7f5d9282b000 ip
000000000042d068 sp 00007f5da3183dd0 error 4 in
xfs_repair[400000+7f000]


/Ole

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-02-28 15:22 xfs_repair segfaults Ole Tange
@ 2013-02-28 18:48 ` Eric Sandeen
  2013-03-01  9:37   ` Ole Tange
  2013-03-01 11:17 ` Dave Chinner
  2013-03-01 22:14 ` Eric Sandeen
  2 siblings, 1 reply; 39+ messages in thread
From: Eric Sandeen @ 2013-02-28 18:48 UTC (permalink / raw)
  To: Ole Tange; +Cc: xfs

On 2/28/13 9:22 AM, Ole Tange wrote:
> I forced a RAID online. I have done that before and xfs_repair
> normally removes the last hour of data or so, but saves everything
> else.
> 
> Today that did not work:
> 
> /usr/local/src/xfsprogs-3.1.10/repair# ./xfs_repair -n /dev/md5p1
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
>         - scan filesystem freespace and inode maps...
> flfirst 232 in agf 91 too large (max = 128)
> Segmentation fault (core dumped)
> 
> Core put in: http://dna.ku.dk/~tange/tmp/xfs_repair.core.bz2

We'd need a binary w/ debug symbols to go along with it.

an xfs_metadump might recreate the problem too.

> I tried using the git-version, too, but could not get that to compile.

How'd it fail, can you report that in a different thread?

thanks,
-eric

> # uname -a
> Linux franklin 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.35-2~bpo60+1
> x86_64 GNU/Linux
> 
> # ./xfs_repair -V
> xfs_repair version 3.1.10
> 
> # cat /proc/cpuinfo |grep MH | wc
>      64     256    1280
> 
> # cat /proc/partitions |grep md5
>    9        5 125024550912 md5
>  259        0 107521114112 md5p1
>  259        1 17503434752 md5p2
> 
> # cat /proc/mdstat
> Personalities : [raid0] [raid6] [raid5] [raid4]
> md5 : active raid0 md1[0] md4[3] md3[2] md2[1]
>       125024550912 blocks super 1.2 512k chunks
> 
> md1 : active raid6 sdd[1] sdi[9] sdq[13] sdau[7] sdt[10] sdg[5] sdf[4] sde[2]
>       31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
> [10/8] [_UU_UUUUUU]
>       bitmap: 2/2 pages [8KB], 1048576KB chunk
> 
> md4 : active raid6 sdo[13] sdu[9] sdad[8] sdh[7] sdc[6] sds[11]
> sdap[3] sdao[2] sdk[1]
>       31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
> [10/8] [_UUUU_UUUU]
>       [>....................]  recovery =  2.1% (84781876/3907017344)
> finish=2196.4min speed=29003K/sec
>       bitmap: 2/2 pages [8KB], 1048576KB chunk
> 
> md2 : active raid6 sdac[0] sdal[9] sdak[8] sdaj[7] sdai[6] sdah[5]
> sdag[4] sdaf[3] sdae[2] sdr[10]
>       31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
> [10/10] [UUUUUUUUUU]
>       bitmap: 0/2 pages [0KB], 1048576KB chunk
> 
> md3 : active raid6 sdaq[0] sdab[9] sdaa[8] sdb[7] sdy[6] sdx[5] sdw[4]
> sdv[3] sdz[10] sdj[1]
>       31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
> [10/10] [UUUUUUUUUU]
>       bitmap: 0/2 pages [0KB], 1048576KB chunk
> 
> unused devices: <none>
> 
> # smartctl -a /dev/sdau|grep Model
> Device Model:     Hitachi HDS724040ALE640
> 
> # hdparm -W /dev/sdau
> /dev/sdau:
>  write-caching =  0 (off)
> 
> # dmesg
> [ 3745.914280] xfs_repair[25300]: segfault at 7f5d9282b000 ip
> 000000000042d068 sp 00007f5da3183dd0 error 4 in
> xfs_repair[400000+7f000]
> 
> 
> /Ole
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-02-28 18:48 ` Eric Sandeen
@ 2013-03-01  9:37   ` Ole Tange
  2013-03-01 16:46     ` Eric Sandeen
  0 siblings, 1 reply; 39+ messages in thread
From: Ole Tange @ 2013-03-01  9:37 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On Thu, Feb 28, 2013 at 7:48 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
> On 2/28/13 9:22 AM, Ole Tange wrote:

>> /usr/local/src/xfsprogs-3.1.10/repair# ./xfs_repair -n /dev/md5p1
[...]
>> Segmentation fault (core dumped)
>>
>> Core put in: http://dna.ku.dk/~tange/tmp/xfs_repair.core.bz2
>
> We'd need a binary w/ debug symbols to go along with it.

http://dna.ku.dk/~tange/tmp/xfs_repair

> an xfs_metadump might recreate the problem too.

# sudo ./xfs_metadump.sh -g /dev/md5p1 - | pbzip2 >
/home/tange/public_html/tmp/metadump.bz2
xfs_metadump: cannot init perag data (117)
Copying log

http://dna.ku.dk/~tange/tmp/metadump.bz2

Please consider providing an example in the man page for xfs_metadump e.g:

  xfs_metadump.sh -g /dev/sda2 meta.dump


/Ole

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-02-28 15:22 xfs_repair segfaults Ole Tange
  2013-02-28 18:48 ` Eric Sandeen
@ 2013-03-01 11:17 ` Dave Chinner
  2013-03-01 12:24   ` Ole Tange
  2013-03-01 22:14 ` Eric Sandeen
  2 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2013-03-01 11:17 UTC (permalink / raw)
  To: Ole Tange; +Cc: xfs

On Thu, Feb 28, 2013 at 04:22:08PM +0100, Ole Tange wrote:
> I forced a RAID online. I have done that before and xfs_repair
> normally removes the last hour of data or so, but saves everything
> else.

Why did you need to force it online?

> Today that did not work:
> 
> /usr/local/src/xfsprogs-3.1.10/repair# ./xfs_repair -n /dev/md5p1
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
>         - scan filesystem freespace and inode maps...
> flfirst 232 in agf 91 too large (max = 128)

Can you run:

# xfs_db -c "agf 91" -c p /dev/md5p1

And post the output?

> # cat /proc/partitions |grep md5
>    9        5 125024550912 md5
>  259        0 107521114112 md5p1
>  259        1 17503434752 md5p2

Ouch.

> # cat /proc/mdstat
> Personalities : [raid0] [raid6] [raid5] [raid4]
> md5 : active raid0 md1[0] md4[3] md3[2] md2[1]
>       125024550912 blocks super 1.2 512k chunks
> 
> md1 : active raid6 sdd[1] sdi[9] sdq[13] sdau[7] sdt[10] sdg[5] sdf[4] sde[2]
>       31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
> [10/8] [_UU_UUUUUU]
>       bitmap: 2/2 pages [8KB], 1048576KB chunk

There are 2 failed devices in this RAID6 lun - i.e. no redundancy -
and no rebuild in progress. Is this related to why you had to force
the RAID online?

> md4 : active raid6 sdo[13] sdu[9] sdad[8] sdh[7] sdc[6] sds[11]
> sdap[3] sdao[2] sdk[1]
>       31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
> [10/8] [_UUUU_UUUU]
>       [>....................]  recovery =  2.1% (84781876/3907017344)
> finish=2196.4min speed=29003K/sec
>       bitmap: 2/2 pages [8KB], 1048576KB chunk

and 2 failed devices here, too, with a rebuild underway that will
take the best part of 2 days to complete...

So, before even trying to diagnose the xfs_repair problem, can you
tell us what actually went wrong with your md devices?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-01 11:17 ` Dave Chinner
@ 2013-03-01 12:24   ` Ole Tange
  2013-03-01 20:53     ` Dave Chinner
  0 siblings, 1 reply; 39+ messages in thread
From: Ole Tange @ 2013-03-01 12:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Mar 1, 2013 at 12:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Feb 28, 2013 at 04:22:08PM +0100, Ole Tange wrote:
:
>> I forced a RAID online. I have done that before and xfs_repair
>> normally removes the last hour of data or so, but saves everything
>> else.
>
> Why did you need to force it online?

More than 2 harddisks went offline. We have seen that before and it is
not due to bad harddisks. It may be due to driver/timings/controller.

The alternative to forcing it online would be to read back a backup.
Since we are talking 100 TB of data reading back the backup can take a
week and will set us back to the last backup (which is more than a day
old). So it is preferable to force the last failing harddisk online
even though that causes us to loose a few hours of work.

>> Today that did not work:
>>
>> /usr/local/src/xfsprogs-3.1.10/repair# ./xfs_repair -n /dev/md5p1
>> Phase 1 - find and verify superblock...
>> Phase 2 - using internal log
>>         - scan filesystem freespace and inode maps...
>> flfirst 232 in agf 91 too large (max = 128)
>
> Can you run:
>
> # xfs_db -c "agf 91" -c p /dev/md5p1
>
> And post the output?

# xfs_db -c "agf 91" -c p /dev/md5p1
xfs_db: cannot init perag data (117)
magicnum = 0x58414746
versionnum = 1
seqno = 91
length = 268435200
bnoroot = 295199
cntroot = 13451007
bnolevel = 2
cntlevel = 2
flfirst = 232
fllast = 32
flcount = 191
freeblks = 184285136
longest = 84709383
btreeblks = 24

The partition has earlier been mounted with -o inode64.

/Ole

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-01  9:37   ` Ole Tange
@ 2013-03-01 16:46     ` Eric Sandeen
  2013-03-04  9:00       ` Ole Tange
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Sandeen @ 2013-03-01 16:46 UTC (permalink / raw)
  To: Ole Tange; +Cc: xfs

On 3/1/13 3:37 AM, Ole Tange wrote:
> On Thu, Feb 28, 2013 at 7:48 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
>> On 2/28/13 9:22 AM, Ole Tange wrote:
> 
>>> /usr/local/src/xfsprogs-3.1.10/repair# ./xfs_repair -n /dev/md5p1
> [...]
>>> Segmentation fault (core dumped)
>>>
>>> Core put in: http://dna.ku.dk/~tange/tmp/xfs_repair.core.bz2
>>
>> We'd need a binary w/ debug symbols to go along with it.
> 
> http://dna.ku.dk/~tange/tmp/xfs_repair
> 
>> an xfs_metadump might recreate the problem too.
> 
> # sudo ./xfs_metadump.sh -g /dev/md5p1 - | pbzip2 >
> /home/tange/public_html/tmp/metadump.bz2
> xfs_metadump: cannot init perag data (117)
> Copying log

I'll take a look.  May be that the error renders it
invalid but we'll see.

> http://dna.ku.dk/~tange/tmp/metadump.bz2
> 
> Please consider providing an example in the man page for xfs_metadump e.g:
> 
>   xfs_metadump.sh -g /dev/sda2 meta.dump

>From the manpage,

SYNOPSIS
       xfs_metadump [ -efgow ] [ -l logdev ] source target

The source argument must be the pathname of
the device or file containing the XFS filesystem

and

the target argument specifies the destination file name. 

is not enough?

Thanks,
-Eric

> 
> /Ole
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-01 12:24   ` Ole Tange
@ 2013-03-01 20:53     ` Dave Chinner
  2013-03-04  9:03       ` Ole Tange
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2013-03-01 20:53 UTC (permalink / raw)
  To: Ole Tange; +Cc: xfs

On Fri, Mar 01, 2013 at 01:24:36PM +0100, Ole Tange wrote:
> On Fri, Mar 1, 2013 at 12:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Feb 28, 2013 at 04:22:08PM +0100, Ole Tange wrote:
> :
> >> I forced a RAID online. I have done that before and xfs_repair
> >> normally removes the last hour of data or so, but saves everything
> >> else.
> >
> > Why did you need to force it online?
> 
> More than 2 harddisks went offline. We have seen that before and it is
> not due to bad harddisks. It may be due to driver/timings/controller.

I thought that might be the case. What filesystem errors occurred
when the srives went offline?

> >> /usr/local/src/xfsprogs-3.1.10/repair# ./xfs_repair -n /dev/md5p1
> >> Phase 1 - find and verify superblock...
> >> Phase 2 - using internal log
> >>         - scan filesystem freespace and inode maps...
> >> flfirst 232 in agf 91 too large (max = 128)
> >
> > Can you run:
> >
> > # xfs_db -c "agf 91" -c p /dev/md5p1
> >
> > And post the output?
> 
> # xfs_db -c "agf 91" -c p /dev/md5p1
> xfs_db: cannot init perag data (117)

Interesting. It's detecting corrupt AG headers.

> magicnum = 0x58414746
> versionnum = 1
> seqno = 91
> length = 268435200
> bnoroot = 295199
> cntroot = 13451007
> bnolevel = 2
> cntlevel = 2
> flfirst = 232
> fllast = 32
> flcount = 191

That implies that the free list is actually 232+191-32 = 391
entries long. That doesn't add up any way I look at it. both the
flfirst and flcount fields look wrong here, which rules out a simple
bit error as the problem. I can't see how these values would have
been written by XFS as they are out of range for 512 byte sector
AGFL:

        be32_add_cpu(&agf->agf_flfirst, 1);
        xfs_trans_brelse(tp, agflbp);
        if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp))
                agf->agf_flfirst = 0;

So I suspect that something more than just disks going off line here
went wrong here, as I've never seen this sort of corruption before...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-02-28 15:22 xfs_repair segfaults Ole Tange
  2013-02-28 18:48 ` Eric Sandeen
  2013-03-01 11:17 ` Dave Chinner
@ 2013-03-01 22:14 ` Eric Sandeen
  2013-03-01 22:31   ` Dave Chinner
  2 siblings, 1 reply; 39+ messages in thread
From: Eric Sandeen @ 2013-03-01 22:14 UTC (permalink / raw)
  To: Ole Tange; +Cc: xfs

On 2/28/13 9:22 AM, Ole Tange wrote:
> I forced a RAID online. I have done that before and xfs_repair
> normally removes the last hour of data or so, but saves everything
> else.
> 
> Today that did not work:
> 
> /usr/local/src/xfsprogs-3.1.10/repair# ./xfs_repair -n /dev/md5p1
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
>         - scan filesystem freespace and inode maps...
> flfirst 232 in agf 91 too large (max = 128)
> Segmentation fault (core dumped)

FWIW, the fs in question seems to need a log replay, so 
xfs_repair -n would find it in a worse state...
I had forgotten that xfs_repair -n won't complain about
a dirty log.  Seems like it should.

But, the log is corrupt enough that it won't replay:

XFS (loop0): Mounting Filesystem
XFS (loop0): Starting recovery (logdev: internal)
ffff88036e7cd800: 58 41 47 46 00 00 00 01 00 00 00 5b 0f ff ff 00  XAGF.......[....
XFS (loop0): Internal error xfs_alloc_read_agf at line 2146 of file fs/xfs/xfs_alloc.c.  Caller 0xffffffffa033d009

so really this'll require xfs_repair -L

xfs_repair -L doesn't segfault though, FWIW.

I'll try to look into the -n segfault in any case.

-Eric

> Core put in: http://dna.ku.dk/~tange/tmp/xfs_repair.core.bz2
> 
> I tried using the git-version, too, but could not get that to compile.
> 
> # uname -a
> Linux franklin 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.35-2~bpo60+1
> x86_64 GNU/Linux
> 
> # ./xfs_repair -V
> xfs_repair version 3.1.10
> 
> # cat /proc/cpuinfo |grep MH | wc
>      64     256    1280
> 
> # cat /proc/partitions |grep md5
>    9        5 125024550912 md5
>  259        0 107521114112 md5p1
>  259        1 17503434752 md5p2
> 
> # cat /proc/mdstat
> Personalities : [raid0] [raid6] [raid5] [raid4]
> md5 : active raid0 md1[0] md4[3] md3[2] md2[1]
>       125024550912 blocks super 1.2 512k chunks
> 
> md1 : active raid6 sdd[1] sdi[9] sdq[13] sdau[7] sdt[10] sdg[5] sdf[4] sde[2]
>       31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
> [10/8] [_UU_UUUUUU]
>       bitmap: 2/2 pages [8KB], 1048576KB chunk
> 
> md4 : active raid6 sdo[13] sdu[9] sdad[8] sdh[7] sdc[6] sds[11]
> sdap[3] sdao[2] sdk[1]
>       31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
> [10/8] [_UUUU_UUUU]
>       [>....................]  recovery =  2.1% (84781876/3907017344)
> finish=2196.4min speed=29003K/sec
>       bitmap: 2/2 pages [8KB], 1048576KB chunk
> 
> md2 : active raid6 sdac[0] sdal[9] sdak[8] sdaj[7] sdai[6] sdah[5]
> sdag[4] sdaf[3] sdae[2] sdr[10]
>       31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
> [10/10] [UUUUUUUUUU]
>       bitmap: 0/2 pages [0KB], 1048576KB chunk
> 
> md3 : active raid6 sdaq[0] sdab[9] sdaa[8] sdb[7] sdy[6] sdx[5] sdw[4]
> sdv[3] sdz[10] sdj[1]
>       31256138752 blocks super 1.2 level 6, 128k chunk, algorithm 2
> [10/10] [UUUUUUUUUU]
>       bitmap: 0/2 pages [0KB], 1048576KB chunk
> 
> unused devices: <none>
> 
> # smartctl -a /dev/sdau|grep Model
> Device Model:     Hitachi HDS724040ALE640
> 
> # hdparm -W /dev/sdau
> /dev/sdau:
>  write-caching =  0 (off)
> 
> # dmesg
> [ 3745.914280] xfs_repair[25300]: segfault at 7f5d9282b000 ip
> 000000000042d068 sp 00007f5da3183dd0 error 4 in
> xfs_repair[400000+7f000]
> 
> 
> /Ole
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-01 22:14 ` Eric Sandeen
@ 2013-03-01 22:31   ` Dave Chinner
  2013-03-01 22:32     ` Eric Sandeen
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2013-03-01 22:31 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs, Ole Tange

On Fri, Mar 01, 2013 at 04:14:23PM -0600, Eric Sandeen wrote:
> On 2/28/13 9:22 AM, Ole Tange wrote:
> > I forced a RAID online. I have done that before and xfs_repair
> > normally removes the last hour of data or so, but saves everything
> > else.
> > 
> > Today that did not work:
> > 
> > /usr/local/src/xfsprogs-3.1.10/repair# ./xfs_repair -n /dev/md5p1
> > Phase 1 - find and verify superblock...
> > Phase 2 - using internal log
> >         - scan filesystem freespace and inode maps...
> > flfirst 232 in agf 91 too large (max = 128)
> > Segmentation fault (core dumped)
> 
> FWIW, the fs in question seems to need a log replay, so 
> xfs_repair -n would find it in a worse state...
> I had forgotten that xfs_repair -n won't complain about
> a dirty log.  Seems like it should.
> 
> But, the log is corrupt enough that it won't replay:
> 
> XFS (loop0): Mounting Filesystem
> XFS (loop0): Starting recovery (logdev: internal)
> ffff88036e7cd800: 58 41 47 46 00 00 00 01 00 00 00 5b 0f ff ff 00  XAGF.......[....
                                                     ^^
It's detecting AGF 91 is corrupt....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-01 22:31   ` Dave Chinner
@ 2013-03-01 22:32     ` Eric Sandeen
  2013-03-01 23:55       ` Eric Sandeen
  2013-03-04 12:47       ` Ole Tange
  0 siblings, 2 replies; 39+ messages in thread
From: Eric Sandeen @ 2013-03-01 22:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, Ole Tange

On 3/1/13 4:31 PM, Dave Chinner wrote:
> On Fri, Mar 01, 2013 at 04:14:23PM -0600, Eric Sandeen wrote:
>> On 2/28/13 9:22 AM, Ole Tange wrote:
>>> I forced a RAID online. I have done that before and xfs_repair
>>> normally removes the last hour of data or so, but saves everything
>>> else.
>>>
>>> Today that did not work:
>>>
>>> /usr/local/src/xfsprogs-3.1.10/repair# ./xfs_repair -n /dev/md5p1
>>> Phase 1 - find and verify superblock...
>>> Phase 2 - using internal log
>>>         - scan filesystem freespace and inode maps...
>>> flfirst 232 in agf 91 too large (max = 128)
>>> Segmentation fault (core dumped)
>>
>> FWIW, the fs in question seems to need a log replay, so 
>> xfs_repair -n would find it in a worse state...
>> I had forgotten that xfs_repair -n won't complain about
>> a dirty log.  Seems like it should.
>>
>> But, the log is corrupt enough that it won't replay:
>>
>> XFS (loop0): Mounting Filesystem
>> XFS (loop0): Starting recovery (logdev: internal)
>> ffff88036e7cd800: 58 41 47 46 00 00 00 01 00 00 00 5b 0f ff ff 00  XAGF.......[....
>                                                      ^^
> It's detecting AGF 91 is corrupt....

Yep and that's what lights up when repair -L runs too.

Ole, you can xfs_mdrestore your metadump image and run test repairs on the result,
if you want a more realistic "dry run" of what repair would do.

-Eric

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-01 22:32     ` Eric Sandeen
@ 2013-03-01 23:55       ` Eric Sandeen
  2013-03-04 12:47       ` Ole Tange
  1 sibling, 0 replies; 39+ messages in thread
From: Eric Sandeen @ 2013-03-01 23:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ole Tange, xfs

On 3/1/13 4:32 PM, Eric Sandeen wrote:
> Ole, you can xfs_mdrestore your metadump image and run test repairs on the result,

If you like, test it after applying the patch I just sent.

(or wait 'til it's reviewed) :)

Anyway, running in non "-n" mode will avoid the segfault you
reported.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-01 16:46     ` Eric Sandeen
@ 2013-03-04  9:00       ` Ole Tange
  2013-03-04 15:20         ` Eric Sandeen
  0 siblings, 1 reply; 39+ messages in thread
From: Ole Tange @ 2013-03-04  9:00 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On Fri, Mar 1, 2013 at 5:46 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
> On 3/1/13 3:37 AM, Ole Tange wrote:

>> Please consider providing an example in the man page for xfs_metadump e.g:
>>
>>   xfs_metadump.sh -g /dev/sda2 meta.dump
>
> From the manpage,
>
> SYNOPSIS
>        xfs_metadump [ -efgow ] [ -l logdev ] source target
>
> The source argument must be the pathname of
> the device or file containing the XFS filesystem
>
> and
>
> the target argument specifies the destination file name.
>
> is not enough?

I have never run xfs_metadump before and I am in a state of worry that
my filesystem is toast. I would therefore like to be re-assured that
what I am doing is correct. I did that by reading and re-reading the
manual to make sure I had understood it correctly. By providing me
with an example of the right way to do it in the man page, I will feel
more confident that what I am about to do it correct and I could
probably save time by not having to re-read the manual.

So I am not saying the information is not there, what I am saying is
that you in a simple way could make it easier to grasp the
information.


/Ole

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-01 20:53     ` Dave Chinner
@ 2013-03-04  9:03       ` Ole Tange
  2013-03-04 23:23         ` Dave Chinner
  0 siblings, 1 reply; 39+ messages in thread
From: Ole Tange @ 2013-03-04  9:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Mar 1, 2013 at 9:53 PM, Dave Chinner <david@fromorbit.com> wrote:
:
> What filesystem errors occurred
> when the srives went offline?

See http://dna.ku.dk/~tange/tmp/syslog.3

Feb 26 00:46:52 franklin kernel: [556238.429259] XFS (md5p1): metadata
I/O error: block 0x459b8 ("xfs_buf_iodone_callbacks") error 5 buf
count 4096


/Ole

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-01 22:32     ` Eric Sandeen
  2013-03-01 23:55       ` Eric Sandeen
@ 2013-03-04 12:47       ` Ole Tange
  2013-03-04 15:17         ` Eric Sandeen
  1 sibling, 1 reply; 39+ messages in thread
From: Ole Tange @ 2013-03-04 12:47 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On Fri, Mar 1, 2013 at 11:32 PM, Eric Sandeen <sandeen@sandeen.net> wrote:

> Ole, you can xfs_mdrestore your metadump image and run test repairs on the result,
> if you want a more realistic "dry run" of what repair would do.

I have never run xfs_mdrestore before.

>From the man page:

       xfs_mdrestore should not be used to restore metadata onto an
existing filesystem unless you are completely certain the  target  can
 be destroyed.

It is unclear to me if you are suggesting me to do:

  xfs_mdrestore the-already-created-dump /dev/md5p1

followed by xfs_repair. Or if you want me to restore the metadata on
another 100 TB partition (I do not have that available).

Maybe you have a trick so that it can be restored on some smaller
block device, so I do not need the 100 TB partition, but I will still
be able to see how many files are being removed? If you have such a
trick, consider including it in the manual.

Also I would love if xfs_repair had an option to copy the changed
sectors to a file, so it would be easy to revert. E.g:

  xfs_repair --backup backup.file /dev/sda1

and if the repair did not work out, then you could revert using:

  xfs_repair --revert backup.file /dev/sda1

and be back at your starting state.


/Ole

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-04 12:47       ` Ole Tange
@ 2013-03-04 15:17         ` Eric Sandeen
  2013-03-04 23:11           ` Dave Chinner
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Sandeen @ 2013-03-04 15:17 UTC (permalink / raw)
  To: Ole Tange; +Cc: xfs

On 3/4/13 6:47 AM, Ole Tange wrote:
> On Fri, Mar 1, 2013 at 11:32 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
> 
>> Ole, you can xfs_mdrestore your metadump image and run test repairs on the result,
>> if you want a more realistic "dry run" of what repair would do.
> 
> I have never run xfs_mdrestore before.
> 
> From the man page:
> 
>        xfs_mdrestore should not be used to restore metadata onto an
> existing filesystem unless you are completely certain the  target  can
>  be destroyed.
> 
> It is unclear to me if you are suggesting me to do:
> 
>   xfs_mdrestore the-already-created-dump /dev/md5p1

no. definitely not. :)

> followed by xfs_repair. Or if you want me to restore the metadata on
> another 100 TB partition (I do not have that available).

Nope - to a sparse file, on a filesystem which can hold a file with
100T offsets - like xfs.

> Maybe you have a trick so that it can be restored on some smaller
> block device, so I do not need the 100 TB partition, but I will still
> be able to see how many files are being removed? If you have such a
> trick, consider including it in the manual.

Probably worth doing, or putting in the xfs faq.

Basically, if you do:

# xfs_mdatadump -o /dev/whatever metadump.file
# xfs_mdrestore metadump.file xfsfile.img

or

# xfs_metadump -o /dev/whatever - | xfs_mdrestore - xfsfile.img

then you can xfs_repair xfsfile.img, without the -n, see what happens,
mount the image as loopback, see what's changed, etc, to see what
xfs_repair really would do with your actual filesystem.

(although, if things are so badly corrupted that metadump gets confused,
it's not as good a test).

The metadump only contains *metadata* so if you read any file on
the mounted loopback image, you just get 0s back.

> Also I would love if xfs_repair had an option to copy the changed
> sectors to a file, so it would be easy to revert. E.g:
> 
>   xfs_repair --backup backup.file /dev/sda1
> 
> and if the repair did not work out, then you could revert using:
> 
>   xfs_repair --revert backup.file /dev/sda1
> 
> and be back at your starting state.

That would be nice.  But as soon as you have mounted the result
and made any change at all to the fs, reverting in this manner wouldn't be
safe anymore.

-Eric

> 
> /Ole
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-04  9:00       ` Ole Tange
@ 2013-03-04 15:20         ` Eric Sandeen
  2013-03-08 10:21           ` Ole Tange
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Sandeen @ 2013-03-04 15:20 UTC (permalink / raw)
  To: Ole Tange; +Cc: xfs

On 3/4/13 3:00 AM, Ole Tange wrote:
> On Fri, Mar 1, 2013 at 5:46 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
>> On 3/1/13 3:37 AM, Ole Tange wrote:
> 
>>> Please consider providing an example in the man page for xfs_metadump e.g:
>>>
>>>   xfs_metadump.sh -g /dev/sda2 meta.dump
>>
>> From the manpage,
>>
>> SYNOPSIS
>>        xfs_metadump [ -efgow ] [ -l logdev ] source target
>>
>> The source argument must be the pathname of
>> the device or file containing the XFS filesystem
>>
>> and
>>
>> the target argument specifies the destination file name.
>>
>> is not enough?
> 
> I have never run xfs_metadump before and I am in a state of worry that
> my filesystem is toast. I would therefore like to be re-assured that
> what I am doing is correct. I did that by reading and re-reading the
> manual to make sure I had understood it correctly. By providing me
> with an example of the right way to do it in the man page, I will feel
> more confident that what I am about to do it correct and I could
> probably save time by not having to re-read the manual.
> 
> So I am not saying the information is not there, what I am saying is
> that you in a simple way could make it easier to grasp the
> information.

Fair enough, maybe a concrete example is warranted.

I suggested the meatadump for 2 reasons:

1) to get an image we could look at, to analyze the reason for the segfault, and
2) so you could run a "real" non-"n" xfs_repair on a metadata image as a more realistic dry run

xfs_metadump only *reads* your filesystem, so there is nothing dangerous.

But I understand your paranoia and worry.  :)

-Eric

> 
> /Ole
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-04 15:17         ` Eric Sandeen
@ 2013-03-04 23:11           ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2013-03-04 23:11 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs, Ole Tange

On Mon, Mar 04, 2013 at 09:17:51AM -0600, Eric Sandeen wrote:
> On 3/4/13 6:47 AM, Ole Tange wrote:
> > On Fri, Mar 1, 2013 at 11:32 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
> > 
> >> Ole, you can xfs_mdrestore your metadump image and run test repairs on the result,
> >> if you want a more realistic "dry run" of what repair would do.
> > 
> > I have never run xfs_mdrestore before.
> > 
> > From the man page:
> > 
> >        xfs_mdrestore should not be used to restore metadata onto an
> > existing filesystem unless you are completely certain the  target  can
> >  be destroyed.
> > 
> > It is unclear to me if you are suggesting me to do:
> > 
> >   xfs_mdrestore the-already-created-dump /dev/md5p1
> 
> no. definitely not. :)
> 
> > followed by xfs_repair. Or if you want me to restore the metadata on
> > another 100 TB partition (I do not have that available).
> 
> Nope - to a sparse file, on a filesystem which can hold a file with
> 100T offsets - like xfs.
> 
> > Maybe you have a trick so that it can be restored on some smaller
> > block device, so I do not need the 100 TB partition, but I will still
> > be able to see how many files are being removed? If you have such a
> > trick, consider including it in the manual.
> 
> Probably worth doing, or putting in the xfs faq.

Examples of how to do these sorts of operations should go into the
xfs users guide here:

http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/index.html

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-04  9:03       ` Ole Tange
@ 2013-03-04 23:23         ` Dave Chinner
  2013-03-08 10:09           ` Ole Tange
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2013-03-04 23:23 UTC (permalink / raw)
  To: Ole Tange; +Cc: xfs

On Mon, Mar 04, 2013 at 10:03:29AM +0100, Ole Tange wrote:
> On Fri, Mar 1, 2013 at 9:53 PM, Dave Chinner <david@fromorbit.com> wrote:
> :
> > What filesystem errors occurred
> > when the srives went offline?
> 
> See http://dna.ku.dk/~tange/tmp/syslog.3

You log is full of this:

mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)

What's that mean?

> 
> Feb 26 00:46:52 franklin kernel: [556238.429259] XFS (md5p1): metadata
> I/O error: block 0x459b8 ("xfs_buf_iodone_callbacks") error 5 buf
> count 4096

So, the first IO errors appear at 23:00 on /dev/sdb, and the
controller does a full reset and reprobe. Look slike a port failure
of some kind. Notable:

mpt2sas1: LSISAS2008: FWVersion(07.00.00.00), ChipRevision(0x03), BiosVersion(07.11.10.00)

>From a quick google, that firmware looks out of date (current
LSISAS2008 firmwares are numbered 10 or 11, and bios versions are at
7.21).

So, /dev/md1 reported a failure (/dev/sdb) around 23:01:16, started a
rebuild. Looks like it swapped in /dev/sdd and started a rebuild.

/dev/md4 had a failure (/dev/sds) around 00:19, no rebuild started.
Down to 8 disks in /dev/md4, no rebuild in progress, no redundancy
available.

/dev/md1 had another failure (/dev/sdj) around 00:46, this time on a
SYNCHRONISE CACHE command (i.e. log write). This IO failure caused
the shutdown to occur. And this is the result:

[556219.292225] end_request: I/O error, dev sdj, sector 10
[556219.292275] md: super_written gets error=-5, uptodate=0
[556219.292283] md/raid:md1: Disk failure on sdj, disabling device.
[556219.292286] md/raid:md1: Operation continuing on 7 devices.

At this point, /dev/md1 is reporting 7 working disks and has had an
EIO on it's superblock write, which means it's probably in an
inconsistent state. Further, it's only got 8 disks associated with
it and as a rebuild is in progress it means that data loss has
occurred with this failure. There's your problem.

Essentially, you need to fix your hardware before you do anything
else. Get it all back fully online and fix whatever the problems are
that are causing IO errors, then you can worry about recovering the
filesystem and your data. Until the hardware is stable and not
throwing errors, recovery is going to be unreliable (if not
impossible).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-04 23:23         ` Dave Chinner
@ 2013-03-08 10:09           ` Ole Tange
  0 siblings, 0 replies; 39+ messages in thread
From: Ole Tange @ 2013-03-08 10:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Mar 5, 2013 at 12:23 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Mar 04, 2013 at 10:03:29AM +0100, Ole Tange wrote:
>> On Fri, Mar 1, 2013 at 9:53 PM, Dave Chinner <david@fromorbit.com> wrote:
>> :
>> > What filesystem errors occurred
>> > when the srives went offline?
>>
>> See http://dna.ku.dk/~tange/tmp/syslog.3
>
> You log is full of this:
>
> mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
>
> What's that mean?

We do not know, but it is something we are continually trying to find
out. We have 5 other systems using the same setup and they experience
the same.

1 of these 5 systems drop disks off the RAID but the rest work fine.
In other words: we do not experience data corruption - only disk
dropping of the RAID. That leads me to believe it is some kind of
timeout error.

>> Feb 26 00:46:52 franklin kernel: [556238.429259] XFS (md5p1): metadata
>> I/O error: block 0x459b8 ("xfs_buf_iodone_callbacks") error 5 buf
>> count 4096
>
> So, the first IO errors appear at 23:00 on /dev/sdb, and the
> controller does a full reset and reprobe. Look slike a port failure
> of some kind. Notable:
>
> mpt2sas1: LSISAS2008: FWVersion(07.00.00.00), ChipRevision(0x03), BiosVersion(07.11.10.00)
>
> From a quick google, that firmware looks out of date (current
> LSISAS2008 firmwares are numbered 10 or 11, and bios versions are at
> 7.21).

We have tried updating the firmware using LSIs own tool. That fails as
LSI tools says the firmware is not signed correctly.

> /dev/md4 had a failure (/dev/sds) around 00:19, no rebuild started.

The rebuild of md4 is now complete.

> /dev/md1 had another failure (/dev/sdj) around 00:46, this time on a
> SYNCHRONISE CACHE command (i.e. log write). This IO failure caused
> the shutdown to occur. And this is the result:
>
> [556219.292225] end_request: I/O error, dev sdj, sector 10
> [556219.292275] md: super_written gets error=-5, uptodate=0
> [556219.292283] md/raid:md1: Disk failure on sdj, disabling device.
> [556219.292286] md/raid:md1: Operation continuing on 7 devices.
>
> At this point, /dev/md1 is reporting 7 working disks and has had an
> EIO on it's superblock write, which means it's probably in an
> inconsistent state. Further, it's only got 8 disks associated with
> it and as a rebuild is in progress it means that data loss has
> occurred with this failure. There's your problem.

Yep. What I would like to see from xfs_repair is salvaging the part
that is not affected - which ought to be the primary part of the 100
TB.

> Essentially, you need to fix your hardware before you do anything
> else. Get it all back fully online and fix whatever the problems are
> that are causing IO errors, then you can worry about recovering the
> filesystem and your data. Until the hardware is stable and not
> throwing errors, recovery is going to be unreliable (if not
> impossible).

As that has been an ongoing effort it is unlikely to be solved within
a short timeframe.


/Ole

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-04 15:20         ` Eric Sandeen
@ 2013-03-08 10:21           ` Ole Tange
  2013-03-08 20:32             ` Eric Sandeen
  2013-03-12 11:37             ` Ole Tange
  0 siblings, 2 replies; 39+ messages in thread
From: Ole Tange @ 2013-03-08 10:21 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On Mon, Mar 4, 2013 at 4:20 PM, Eric Sandeen <sandeen@sandeen.net> wrote:

> 2) so you could run a "real" non-"n" xfs_repair on a metadata image as a more realistic dry run

I have now done a 'xfs_repair' using the code in GIT. It failed, and I
then did 'xfs_repair -L' which succeeded.

Am I correct that I should now be able to mount the sparse disk-image
file and see all the filenames? In that case I am quite worried. I get
filenames like:

/mnt/disk/??5?z+hEOgl/?7?Psr1?aIH<?ip:??/>S??+??z=ozK/8_0/???d)
5JCG?eiBd?EVsNF'A?v?m?f;Fi6v)d>/?M%?A??J?)B<soGlc??QuY!e-<,6G?
X[Df?Wm^[?f 4|

My guess is some superblock is corrupt and that it should instead try
a backup superblock. It might be useful if xfs_repair could do this
automatically based on the rule of thumb that more than 90% of
filenames/dirnames match:

[- _.,=A-Za-z0-9':]* [([{]* [- _.,=A-Za-z0-9':]* []})]* [- _.,=A-Za-z0-9':]*

If it finds a superblock resulting in more then 10% not matching the
above it should probably ignore that superblock (unless the file names
are using non-latin characters - such as Japanese).


/Ole

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-08 10:21           ` Ole Tange
@ 2013-03-08 20:32             ` Eric Sandeen
  2013-03-12 10:41               ` Ole Tange
  2013-03-12 11:37             ` Ole Tange
  1 sibling, 1 reply; 39+ messages in thread
From: Eric Sandeen @ 2013-03-08 20:32 UTC (permalink / raw)
  To: Ole Tange; +Cc: xfs

On 3/8/13 4:21 AM, Ole Tange wrote:
> On Mon, Mar 4, 2013 at 4:20 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
> 
>> 2) so you could run a "real" non-"n" xfs_repair on a metadata image as a more realistic dry run
> 
> I have now done a 'xfs_repair' using the code in GIT. It failed, and I
> then did 'xfs_repair -L' which succeeded.
> 
> Am I correct that I should now be able to mount the sparse disk-image
> file and see all the filenames? In that case I am quite worried. I get
> filenames like:
> 
> /mnt/disk/??5?z+hEOgl/?7?Psr1?aIH<?ip:??/>S??+??z=ozK/8_0/???d)
> 5JCG?eiBd?EVsNF'A?v?m?f;Fi6v)d>/?M%?A??J?)B<soGlc??QuY!e-<,6G?
> X[Df?Wm^[?f 4|

By default, xfs_metadump scrambles filenames, so nothing to worry
about (it's for privacy reasons).  If you use the "-o" option it'll keep
it in the clear.

-Eric

> My guess is some superblock is corrupt and that it should instead try
> a backup superblock. It might be useful if xfs_repair could do this
> automatically based on the rule of thumb that more than 90% of
> filenames/dirnames match:
> 
> [- _.,=A-Za-z0-9':]* [([{]* [- _.,=A-Za-z0-9':]* []})]* [- _.,=A-Za-z0-9':]*
> 
> If it finds a superblock resulting in more then 10% not matching the
> above it should probably ignore that superblock (unless the file names
> are using non-latin characters - such as Japanese).
> 
> 
> /Ole
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-08 20:32             ` Eric Sandeen
@ 2013-03-12 10:41               ` Ole Tange
  2013-03-12 14:40                 ` Eric Sandeen
  0 siblings, 1 reply; 39+ messages in thread
From: Ole Tange @ 2013-03-12 10:41 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On Fri, Mar 8, 2013 at 9:32 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
> On 3/8/13 4:21 AM, Ole Tange wrote:
>> On Mon, Mar 4, 2013 at 4:20 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
>>
>>> 2) so you could run a "real" non-"n" xfs_repair on a metadata image as a more realistic dry run
:
>> I get
>> filenames like:
>>
>> /mnt/disk/??5?z+hEOgl/?7?Psr1?aIH<?ip:??/>S??+??z=ozK/8_0/???d)
>> 5JCG?eiBd?EVsNF'A?v?m?f;Fi6v)d>/?M%?A??J?)B<soGlc??QuY!e-<,6G?
>> X[Df?Wm^[?f 4|
>
> By default, xfs_metadump scrambles filenames, so nothing to worry
> about (it's for privacy reasons).  If you use the "-o" option it'll keep
> it in the clear.

Ahh. To me that does not conform to Principle of Least Astonishment -
especially since some of the filenames were not obfuscated.

I would have been less surprised if the files were named:

    Use_-o_for_real_file_names_XXXXXXXX
    Use_-o_for_real_dir_names_XXXXXXXX

where XXXXXXXX is just a unique number.


/Ole

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-08 10:21           ` Ole Tange
  2013-03-08 20:32             ` Eric Sandeen
@ 2013-03-12 11:37             ` Ole Tange
  2013-03-12 14:47               ` Eric Sandeen
  1 sibling, 1 reply; 39+ messages in thread
From: Ole Tange @ 2013-03-12 11:37 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

(Forgot CC-to list)

On Fri, Mar 8, 2013 at 11:21 AM, Ole Tange <tange@binf.ku.dk> wrote:
> On Mon, Mar 4, 2013 at 4:20 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
>
>> 2) so you could run a "real" non-"n" xfs_repair on a metadata image as a more realistic dry run

So I made a new metadata image using xfs_metadump.sh from git:

    ./xfs_metadump.sh -o /dev/md123p1 franklin.xfs.metadump
    pbzip2 franklin.xfs.metadump

Then I restored it:

    pbzip2 -dc < franklin.xfs.noobfuscate.metadump.bz2 |
      time ~/work/xfsprogs/mdrestore/xfs_mdrestore - franklin.img

Then I ran git version of xfs_repair (first with -n, then no option,
then with -L):

$ ~/work/xfsprogs/repair/xfs_repair -n franklin.img
# Load of output. Completes OK.

$ ~/work/xfsprogs/repair/xfs_repair  franklin.img
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

$ ~/work/xfsprogs/repair/xfs_repair -L franklin.img
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
xfs_repair: scan.c:1080: scan_freelist: Assertion `0' failed.
Aborted (core dumped)

Then I restored the metadata:

    pbzip2 -dc < franklin.xfs.noobfuscate.metadump.bz2 |
      time ~/work/xfsprogs/mdrestore/xfs_mdrestore - franklin.img

Then I ran xfs_repair version 3.1.7. This gave a lot of output but
completed without core dumping.

The resulting filesystem was mountable and contained at least some of
the filenames I expected.

I believe either there is a new bug in the git version, or it simply
discovers a bug that 3.1.7 does not.

metadata, xfs_repair binary, core:
http://dna.ku.dk/~tange/xfs/


/Ole

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-12 10:41               ` Ole Tange
@ 2013-03-12 14:40                 ` Eric Sandeen
  0 siblings, 0 replies; 39+ messages in thread
From: Eric Sandeen @ 2013-03-12 14:40 UTC (permalink / raw)
  To: Ole Tange; +Cc: xfs

On 3/12/13 5:41 AM, Ole Tange wrote:
> On Fri, Mar 8, 2013 at 9:32 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
>> On 3/8/13 4:21 AM, Ole Tange wrote:
>>> On Mon, Mar 4, 2013 at 4:20 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
>>>
>>>> 2) so you could run a "real" non-"n" xfs_repair on a metadata image as a more realistic dry run
> :
>>> I get
>>> filenames like:
>>>
>>> /mnt/disk/??5?z+hEOgl/?7?Psr1?aIH<?ip:??/>S??+??z=ozK/8_0/???d)
>>> 5JCG?eiBd?EVsNF'A?v?m?f;Fi6v)d>/?M%?A??J?)B<soGlc??QuY!e-<,6G?
>>> X[Df?Wm^[?f 4|
>>
>> By default, xfs_metadump scrambles filenames, so nothing to worry
>> about (it's for privacy reasons).  If you use the "-o" option it'll keep
>> it in the clear.
> 
> Ahh. To me that does not conform to Principle of Least Astonishment -
> especially since some of the filenames were not obfuscated.
> 
> I would have been less surprised if the files were named:
> 
>     Use_-o_for_real_file_names_XXXXXXXX
>     Use_-o_for_real_dir_names_XXXXXXXX
> 
> where XXXXXXXX is just a unique number.

That would completely change the actual on-disk metadata format,
though, which would defeat the primary purpose of metadump.

As it is, it preserves name lengths and name hashes, so that
what is produced is an accurate representation of the original
filesystem's metadata for analysis.

This is described in the manpage, though I sympathize that
it's a bit alarming the first time you see it in the dump
output, if you weren't aware.

-Eric
 
> 
> /Ole
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: xfs_repair segfaults
  2013-03-12 11:37             ` Ole Tange
@ 2013-03-12 14:47               ` Eric Sandeen
  0 siblings, 0 replies; 39+ messages in thread
From: Eric Sandeen @ 2013-03-12 14:47 UTC (permalink / raw)
  To: Ole Tange; +Cc: xfs

On 3/12/13 6:37 AM, Ole Tange wrote:

> $ ~/work/xfsprogs/repair/xfs_repair -L franklin.img
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
>         - zero log...
> ALERT: The filesystem has valuable metadata changes in a log which is being
> destroyed because the -L option was used.
>         - scan filesystem freespace and inode maps...
> xfs_repair: scan.c:1080: scan_freelist: Assertion `0' failed.
> Aborted (core dumped)

Oh, man.  I need to have my hacker card revoked.  Or maybe focus
on one filesystem at a time so I don't keep doing dumb things, like
adding an unconditional ASSERT in non-"-n"-mode.  Holy cow, I don't
know what's up with me lately.  :/

Anyway, just modify these 2 lines in repair/scan.c to remove the ASSERT
around line 1080.

I'll send a proper patch as well.



diff --git a/repair/scan.c b/repair/scan.c
index 6a62dff..76bb7f1 100644
--- a/repair/scan.c
+++ b/repair/scan.c
@@ -1076,8 +1076,7 @@ scan_freelist(
 				  "freelist scan\n"), i);
 			return;
 		}
-	} else /* should have been fixed in verify_set_agf() */
-		ASSERT(0);
+	}
 
 	count = 0;
 	for (;;) {




_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Xfs_repair segfaults.
       [not found]                       ` <CADNx=KuQjMNHUk6t0+hBZ5DN6s=RXqrPEjeoSxpBta47CJoDgQ@mail.gmail.com>
@ 2013-05-10 11:00                         ` Filippo Stenico
  0 siblings, 0 replies; 39+ messages in thread
From: Filippo Stenico @ 2013-05-10 11:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, xfs


[-- Attachment #1.1: Type: text/plain, Size: 2753 bytes --]

Below is the backtrace, and last lines of verbose output at the end.

I am no skilled software developer so this has little meaning to me.. only
that I expected a bit more output of this backtrace..

root@ws1000:~# gdb xfs_repair core
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /sbin/xfs_repair...done.
[New LWP 5187]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `xfs_repair -vv -L -P -m 1750 /dev/mapper/vg0-lv0'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fa8a1e05bcc in _wordcopy_bwd_dest_aligned (dstp=140734808924136,
    srcp=98541856, len=2305843009212646584) at wordcopy.c:395
395    wordcopy.c: No such file or directory.
(gdb) bt
#0  0x00007fa8a1e05bcc in _wordcopy_bwd_dest_aligned (dstp=140734808924136,
    srcp=98541856, len=2305843009212646584) at wordcopy.c:395
#1  0x0000000000000000 in ?? ()
(gdb) q
root@ws1000:~# uname -r
3.2.0-4-amd64
root@ws1000:~# tail xfs_repair_gdb/xfs_repair-vv-L-P-m1750.out
correcting nextents for inode 28722522
bad data fork in inode 28722522
cleared inode 28722522
inode 28722524 - bad extent starting block number 3387062910, offset 0
correcting nextents for inode 28722524
bad data fork in inode 28722524
cleared inode 28722524
entry
"
qs " in shortform directory 28722525 references invalid inode 0
size of last entry overflows space left in in shortform dir 28722525,
resetting to -2
entry contains offset out of order in shortform dir 28722525

....

Cheers,
F

P.S.
sorry for the double mail, Eric.
On Fri, May 10, 2013 at 12:39 AM, Dave Chinner <david@fromorbit.com> wrote:

> On Thu, May 09, 2013 at 07:22:32PM +0200, Filippo Stenico wrote:
>> > Hello,
>> > ran xfs_repair -vv -L -P -m1750 and segfault at expected point.
>> >
>> > I got the core-dump with dbg symbols, along with repair output and
>> strace
>> > output.
>> >
>> > What should I do with it? Provide as it is?
>>
>> Run it in gdb to get a stack trace of where it died.
>>
>> $ gdb xfs_repair corefile
>> ...
>> > bt
>>
>> and post the output, along with the repair output from the run that
>> crashed...
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
>>
>
>
>
> --
> F




-- 
F

[-- Attachment #1.2: Type: text/html, Size: 4111 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Xfs_repair segfaults.
  2013-05-09 17:22                   ` Filippo Stenico
@ 2013-05-09 22:39                     ` Dave Chinner
       [not found]                       ` <CADNx=KuQjMNHUk6t0+hBZ5DN6s=RXqrPEjeoSxpBta47CJoDgQ@mail.gmail.com>
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2013-05-09 22:39 UTC (permalink / raw)
  To: Filippo Stenico; +Cc: Eric Sandeen, xfs

On Thu, May 09, 2013 at 07:22:32PM +0200, Filippo Stenico wrote:
> Hello,
> ran xfs_repair -vv -L -P -m1750 and segfault at expected point.
> 
> I got the core-dump with dbg symbols, along with repair output and strace
> output.
> 
> What should I do with it? Provide as it is?

Run it in gdb to get a stack trace of where it died.

$ gdb xfs_repair corefile
...
> bt

and post the output, along with the repair output from the run that
crashed...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Xfs_repair segfaults.
  2013-05-09 15:11                 ` Filippo Stenico
  2013-05-09 17:22                   ` Filippo Stenico
@ 2013-05-09 22:37                   ` Dave Chinner
  1 sibling, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2013-05-09 22:37 UTC (permalink / raw)
  To: Filippo Stenico; +Cc: Eric Sandeen, xfs

On Thu, May 09, 2013 at 05:11:13PM +0200, Filippo Stenico wrote:
> On Thu, May 9, 2013 at 1:39 AM, Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Wed, May 08, 2013 at 07:30:05PM +0200, Filippo Stenico wrote:
> > > Hello,
> > > -m option seems not to handle the excessive memory consumption I ran
> > into.
> > > I actually ran xfs_repair -vv -m1750 and  looking into kern.log it  seems
> > > that xfs_repair invoked oom killer, but was killed itself ( !! )
> >
> > That's exactly what the oom killer is supposed to do.
> >
> > Yeah, some sacrifice needed.
> 
> 
> > > This is last try to reproduce segfault:
> > > xfs_repair -vv -P -m1750
> >
> > I know your filesystem is around 7TB in size, but how much RAM do
> > you have? It's not unusual for xfs_repair to require many GB of
> > memory to run succesfully on filesystems of this size...
> >
> > it is around 11 TB, 7.2 used.
> I have 4 G ram, but xfs_repair -vv -m1 says I need 1558
> root@ws1000:~# xfs_repair -vv -P -m 1 /dev/mapper/vg0-lv0
> Phase 1 - find and verify superblock...
>         - max_mem = 1024, icount = 29711040, imem = 116058, dblock =
> 2927886336, dmem = 1429632
> Required memory for repair is greater that the maximum specified
> with the -m option. Please increase it to at least 1558.

That's the minimum it requires to run in prefetch mode, not the
maximum it will use.

4GB RAM on a badly corrupted 7TB filesystem is almost certainly not
enough memory to track all the broken bits that need tracking. Add
20-30GB of swap space and see how it goes...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Xfs_repair segfaults.
  2013-05-09 15:11                 ` Filippo Stenico
@ 2013-05-09 17:22                   ` Filippo Stenico
  2013-05-09 22:39                     ` Dave Chinner
  2013-05-09 22:37                   ` Dave Chinner
  1 sibling, 1 reply; 39+ messages in thread
From: Filippo Stenico @ 2013-05-09 17:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, xfs


[-- Attachment #1.1: Type: text/plain, Size: 1866 bytes --]

Hello,
ran xfs_repair -vv -L -P -m1750 and segfault at expected point.

I got the core-dump with dbg symbols, along with repair output and strace
output.

What should I do with it? Provide as it is?
Archive is about 70M: for now I can make a public share:
https://apps.memopal.com/e/S656XDVV

Hope this can help.


On Thu, May 9, 2013 at 5:11 PM, Filippo Stenico
<filippo.stenico@gmail.com>wrote:

>
>
> On Thu, May 9, 2013 at 1:39 AM, Dave Chinner <david@fromorbit.com> wrote:
>
>> On Wed, May 08, 2013 at 07:30:05PM +0200, Filippo Stenico wrote:
>> > Hello,
>> > -m option seems not to handle the excessive memory consumption I ran
>> into.
>> > I actually ran xfs_repair -vv -m1750 and  looking into kern.log it
>>  seems
>> > that xfs_repair invoked oom killer, but was killed itself ( !! )
>>
>> That's exactly what the oom killer is supposed to do.
>>
>> Yeah, some sacrifice needed.
>
>
>> > This is last try to reproduce segfault:
>> > xfs_repair -vv -P -m1750
>>
>> I know your filesystem is around 7TB in size, but how much RAM do
>> you have? It's not unusual for xfs_repair to require many GB of
>> memory to run succesfully on filesystems of this size...
>>
>> it is around 11 TB, 7.2 used.
> I have 4 G ram, but xfs_repair -vv -m1 says I need 1558
> root@ws1000:~# xfs_repair -vv -P -m 1 /dev/mapper/vg0-lv0
> Phase 1 - find and verify superblock...
>         - max_mem = 1024, icount = 29711040, imem = 116058, dblock =
> 2927886336, dmem = 1429632
> Required memory for repair is greater that the maximum specified
> with the -m option. Please increase it to at least 1558.
>
> so I figured 1750 megs would be enough
>
> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
>>
>> _______________________________________________
>> xfs mailing list
>> xfs@oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs
>>
>
>
>
> --
> F




-- 
F

[-- Attachment #1.2: Type: text/html, Size: 3238 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Xfs_repair segfaults.
  2013-05-08 23:39               ` Dave Chinner
@ 2013-05-09 15:11                 ` Filippo Stenico
  2013-05-09 17:22                   ` Filippo Stenico
  2013-05-09 22:37                   ` Dave Chinner
  0 siblings, 2 replies; 39+ messages in thread
From: Filippo Stenico @ 2013-05-09 15:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, xfs


[-- Attachment #1.1: Type: text/plain, Size: 1391 bytes --]

On Thu, May 9, 2013 at 1:39 AM, Dave Chinner <david@fromorbit.com> wrote:

> On Wed, May 08, 2013 at 07:30:05PM +0200, Filippo Stenico wrote:
> > Hello,
> > -m option seems not to handle the excessive memory consumption I ran
> into.
> > I actually ran xfs_repair -vv -m1750 and  looking into kern.log it  seems
> > that xfs_repair invoked oom killer, but was killed itself ( !! )
>
> That's exactly what the oom killer is supposed to do.
>
> Yeah, some sacrifice needed.


> > This is last try to reproduce segfault:
> > xfs_repair -vv -P -m1750
>
> I know your filesystem is around 7TB in size, but how much RAM do
> you have? It's not unusual for xfs_repair to require many GB of
> memory to run succesfully on filesystems of this size...
>
> it is around 11 TB, 7.2 used.
I have 4 G ram, but xfs_repair -vv -m1 says I need 1558
root@ws1000:~# xfs_repair -vv -P -m 1 /dev/mapper/vg0-lv0
Phase 1 - find and verify superblock...
        - max_mem = 1024, icount = 29711040, imem = 116058, dblock =
2927886336, dmem = 1429632
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 1558.

so I figured 1750 megs would be enough

Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>



-- 
F

[-- Attachment #1.2: Type: text/html, Size: 2342 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Xfs_repair segfaults.
  2013-05-08 17:30             ` Filippo Stenico
  2013-05-08 17:42               ` Filippo Stenico
@ 2013-05-08 23:39               ` Dave Chinner
  2013-05-09 15:11                 ` Filippo Stenico
  1 sibling, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2013-05-08 23:39 UTC (permalink / raw)
  To: Filippo Stenico; +Cc: Eric Sandeen, xfs

On Wed, May 08, 2013 at 07:30:05PM +0200, Filippo Stenico wrote:
> Hello,
> -m option seems not to handle the excessive memory consumption I ran into.
> I actually ran xfs_repair -vv -m1750 and  looking into kern.log it  seems
> that xfs_repair invoked oom killer, but was killed itself ( !! )

That's exactly what the oom killer is supposed to do.

> This is last try to reproduce segfault:
> xfs_repair -vv -P -m1750

I know your filesystem is around 7TB in size, but how much RAM do
you have? It's not unusual for xfs_repair to require many GB of
memory to run succesfully on filesystems of this size...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Xfs_repair segfaults.
  2013-05-08 17:30             ` Filippo Stenico
@ 2013-05-08 17:42               ` Filippo Stenico
  2013-05-08 23:39               ` Dave Chinner
  1 sibling, 0 replies; 39+ messages in thread
From: Filippo Stenico @ 2013-05-08 17:42 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 4749 bytes --]

As I was writing it happened.

I got segfault at same place as reported first time.

So, uhm, what do you need to see?

I was putting together:
- machine info (cat /proc/meminfo, cat /proc/cpuinfo, uname -r)
- kernel log
- core.dump with debugging symbols
- output of xfs_repair
- output of strace $(xfs_repair)

I will put all the logs in place and send tomorrow as ... i forgot to raise
core size limit for my shell ... and today I am toasted and I am better to
go home and sleep.

Tell me if you need anything else.

Regards.

On Wed, May 8, 2013 at 7:30 PM, Filippo Stenico
<filippo.stenico@gmail.com>wrote:

> Hello,
> -m option seems not to handle the excessive memory consumption I ran into.
> I actually ran xfs_repair -vv -m1750 and  looking into kern.log it  seems
> that xfs_repair invoked oom killer, but was killed itself ( !! )
>
> This is last try to reproduce segfault:
> xfs_repair -vv -P -m1750
>
> On Tue, May 7, 2013 at 8:20 PM, Filippo Stenico <filippo.stenico@gmail.com
> > wrote:
>
>> xfs_repair -L -vv -P /dev/mapper/vg0-lv0 does the same kernel panic as my
>> first report. No use to double info on this.
>> I'll try xfs_repair -L -vv -P -m 2000 to keep memory consuption at a
>> limit.
>>
>>
>>
>> On Tue, May 7, 2013 at 3:36 PM, Filippo Stenico <
>> filippo.stenico@gmail.com> wrote:
>>
>>>
>>>
>>> On Tue, May 7, 2013 at 3:20 PM, Eric Sandeen <sandeen@sandeen.net>wrote:
>>>
>>>> On 5/7/13 4:27 AM, Filippo Stenico wrote:
>>>> > Hello,
>>>> > this is a start-over to try hard to recover some more data out of my
>>>> raid5 - lvm - xfs toasted volume.
>>>> > My goal is either to try the best to get some more data out of the
>>>> volume, and see if I can reproduce the segfault.
>>>> > I compiled xfsprogs 3.1.9 from deb-source. I ran a xfs_metarestore to
>>>> put original metadata on the cloned raid volume i had zeroed the log on
>>>> before via xfs_repair -L (i figured none of the actual data was modified
>>>> before as I am just working on metadata.. right?).
>>>> > Then I ran a mount, checked a dir that I knew it was corrupted,
>>>> unmount and try an xfs_repair (commands.txt attached for details)
>>>> > I went home to sleep, but at morning I found out that kernel paniced
>>>> due "out of memory and no killable process".
>>>> > I ran repair without -P... Should I try now disabling inode prefetch?
>>>> > Attached are also output of "free" and "top" at time of panic, as
>>>> well as the output of xfs_repair and strace attached to it. Dont think gdb
>>>> symbols would help here....
>>>> >
>>>>
>>>> >
>>>>
>>>> Ho hum, well, no segfault this time, just an out of memory error?
>>>>
>>> That's right....
>>>
>>>> No real way to know where it went from the available data I think.
>>>>
>>>> A few things:
>>>>
>>>> > root@ws1000:~# mount /dev/mapper/vg0-lv0 /raid0/data/
>>>> > mount: Structure needs cleaning
>>>>
>>>> mount failed?  Now's the time to look at dmesg to see why.
>>>>
>>>
>>> From attached logs it seems to be:
>>>>
>>>> > XFS internal error xlog_valid_rec_header(1) at line 3466 of file
>>>> [...2.6.32...]/fs/xfs/xfs_log_recover.c
>>>> > XFS: log mount/recovery failed: error 117
>>>>
>>>> > root@ws1000:~# mount
>>>>
>>>> <no raid0 mounted>
>>>>
>>>> > root@ws1000:~# mount /dev/mapper/vg0-lv0 /raid0/data/
>>>> > root@ws1000:~# mount | grep raid0
>>>> > /dev/mapper/vg0-lv0 on /raid0/data type xfs
>>>> (rw,relatime,attr2,noquota)
>>>>
>>>> Uh, now it worked, with no other steps in between?  That's a little odd.
>>>>
>>> Looks odd to me too. But i just copied the commands issued as they where
>>> on my console... so yes, nothing in between.
>>>
>>>> It found a clean log this time:
>>>>
>>>> > XFS mounting filesystem dm-1
>>>> > Ending clean XFS mount for filesystem: dm-1
>>>>
>>>> which is unexpected.
>>>>
>>>> So the memory consumption might be a bug but there's not enough info to
>>>> go on here.
>>>>
>>>> > PS. Let me know if you wish reports like this one on list.
>>>>
>>>> worth reporting, but I'm not sure what we can do with it.
>>>> Your storage is in pretty bad shape, and xfs_repair can't make
>>>> something out
>>>> of nothing.
>>>>
>>>> -Eric
>>>>
>>>
>>> I still got back around 6TB out of 7.2 TB of total data stored, so this
>>> tells xfs is reliable even when major faults occur...
>>>
>>> Thanks anyways, I am trying with a "-L" repair, at this step I expect
>>> another fail (due out of memory or something, as it happened last time)
>>> then I will try with "xfs_repair -L -vv -P" and I expect to see that
>>> segfault again.
>>>
>>> Will report next steps, maybe something interesting for you will pop
>>> up... for me is not a waste of time, since this last try is worth being
>>> made.
>>>
>>> --
>>> F
>>
>>
>>
>>
>> --
>> F
>
>
>
>
> --
> F




-- 
F

[-- Attachment #1.2: Type: text/html, Size: 6614 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Xfs_repair segfaults.
  2013-05-07 18:20           ` Filippo Stenico
@ 2013-05-08 17:30             ` Filippo Stenico
  2013-05-08 17:42               ` Filippo Stenico
  2013-05-08 23:39               ` Dave Chinner
  0 siblings, 2 replies; 39+ messages in thread
From: Filippo Stenico @ 2013-05-08 17:30 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 3983 bytes --]

Hello,
-m option seems not to handle the excessive memory consumption I ran into.
I actually ran xfs_repair -vv -m1750 and  looking into kern.log it  seems
that xfs_repair invoked oom killer, but was killed itself ( !! )

This is last try to reproduce segfault:
xfs_repair -vv -P -m1750

On Tue, May 7, 2013 at 8:20 PM, Filippo Stenico
<filippo.stenico@gmail.com>wrote:

> xfs_repair -L -vv -P /dev/mapper/vg0-lv0 does the same kernel panic as my
> first report. No use to double info on this.
> I'll try xfs_repair -L -vv -P -m 2000 to keep memory consuption at a limit.
>
>
>
> On Tue, May 7, 2013 at 3:36 PM, Filippo Stenico <filippo.stenico@gmail.com
> > wrote:
>
>>
>>
>> On Tue, May 7, 2013 at 3:20 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
>>
>>> On 5/7/13 4:27 AM, Filippo Stenico wrote:
>>> > Hello,
>>> > this is a start-over to try hard to recover some more data out of my
>>> raid5 - lvm - xfs toasted volume.
>>> > My goal is either to try the best to get some more data out of the
>>> volume, and see if I can reproduce the segfault.
>>> > I compiled xfsprogs 3.1.9 from deb-source. I ran a xfs_metarestore to
>>> put original metadata on the cloned raid volume i had zeroed the log on
>>> before via xfs_repair -L (i figured none of the actual data was modified
>>> before as I am just working on metadata.. right?).
>>> > Then I ran a mount, checked a dir that I knew it was corrupted,
>>> unmount and try an xfs_repair (commands.txt attached for details)
>>> > I went home to sleep, but at morning I found out that kernel paniced
>>> due "out of memory and no killable process".
>>> > I ran repair without -P... Should I try now disabling inode prefetch?
>>> > Attached are also output of "free" and "top" at time of panic, as well
>>> as the output of xfs_repair and strace attached to it. Dont think gdb
>>> symbols would help here....
>>> >
>>>
>>> >
>>>
>>> Ho hum, well, no segfault this time, just an out of memory error?
>>>
>> That's right....
>>
>>> No real way to know where it went from the available data I think.
>>>
>>> A few things:
>>>
>>> > root@ws1000:~# mount /dev/mapper/vg0-lv0 /raid0/data/
>>> > mount: Structure needs cleaning
>>>
>>> mount failed?  Now's the time to look at dmesg to see why.
>>>
>>
>> From attached logs it seems to be:
>>>
>>> > XFS internal error xlog_valid_rec_header(1) at line 3466 of file
>>> [...2.6.32...]/fs/xfs/xfs_log_recover.c
>>> > XFS: log mount/recovery failed: error 117
>>>
>>> > root@ws1000:~# mount
>>>
>>> <no raid0 mounted>
>>>
>>> > root@ws1000:~# mount /dev/mapper/vg0-lv0 /raid0/data/
>>> > root@ws1000:~# mount | grep raid0
>>> > /dev/mapper/vg0-lv0 on /raid0/data type xfs (rw,relatime,attr2,noquota)
>>>
>>> Uh, now it worked, with no other steps in between?  That's a little odd.
>>>
>> Looks odd to me too. But i just copied the commands issued as they where
>> on my console... so yes, nothing in between.
>>
>>> It found a clean log this time:
>>>
>>> > XFS mounting filesystem dm-1
>>> > Ending clean XFS mount for filesystem: dm-1
>>>
>>> which is unexpected.
>>>
>>> So the memory consumption might be a bug but there's not enough info to
>>> go on here.
>>>
>>> > PS. Let me know if you wish reports like this one on list.
>>>
>>> worth reporting, but I'm not sure what we can do with it.
>>> Your storage is in pretty bad shape, and xfs_repair can't make something
>>> out
>>> of nothing.
>>>
>>> -Eric
>>>
>>
>> I still got back around 6TB out of 7.2 TB of total data stored, so this
>> tells xfs is reliable even when major faults occur...
>>
>> Thanks anyways, I am trying with a "-L" repair, at this step I expect
>> another fail (due out of memory or something, as it happened last time)
>> then I will try with "xfs_repair -L -vv -P" and I expect to see that
>> segfault again.
>>
>> Will report next steps, maybe something interesting for you will pop
>> up... for me is not a waste of time, since this last try is worth being
>> made.
>>
>> --
>> F
>
>
>
>
> --
> F




-- 
F

[-- Attachment #1.2: Type: text/html, Size: 5575 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Xfs_repair segfaults.
  2013-05-07 13:36         ` Filippo Stenico
@ 2013-05-07 18:20           ` Filippo Stenico
  2013-05-08 17:30             ` Filippo Stenico
  0 siblings, 1 reply; 39+ messages in thread
From: Filippo Stenico @ 2013-05-07 18:20 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 3483 bytes --]

xfs_repair -L -vv -P /dev/mapper/vg0-lv0 does the same kernel panic as my
first report. No use to double info on this.
I'll try xfs_repair -L -vv -P -m 2000 to keep memory consuption at a limit.


On Tue, May 7, 2013 at 3:36 PM, Filippo Stenico
<filippo.stenico@gmail.com>wrote:

>
>
> On Tue, May 7, 2013 at 3:20 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
>
>> On 5/7/13 4:27 AM, Filippo Stenico wrote:
>> > Hello,
>> > this is a start-over to try hard to recover some more data out of my
>> raid5 - lvm - xfs toasted volume.
>> > My goal is either to try the best to get some more data out of the
>> volume, and see if I can reproduce the segfault.
>> > I compiled xfsprogs 3.1.9 from deb-source. I ran a xfs_metarestore to
>> put original metadata on the cloned raid volume i had zeroed the log on
>> before via xfs_repair -L (i figured none of the actual data was modified
>> before as I am just working on metadata.. right?).
>> > Then I ran a mount, checked a dir that I knew it was corrupted, unmount
>> and try an xfs_repair (commands.txt attached for details)
>> > I went home to sleep, but at morning I found out that kernel paniced
>> due "out of memory and no killable process".
>> > I ran repair without -P... Should I try now disabling inode prefetch?
>> > Attached are also output of "free" and "top" at time of panic, as well
>> as the output of xfs_repair and strace attached to it. Dont think gdb
>> symbols would help here....
>> >
>>
>> >
>>
>> Ho hum, well, no segfault this time, just an out of memory error?
>>
> That's right....
>
>> No real way to know where it went from the available data I think.
>>
>> A few things:
>>
>> > root@ws1000:~# mount /dev/mapper/vg0-lv0 /raid0/data/
>> > mount: Structure needs cleaning
>>
>> mount failed?  Now's the time to look at dmesg to see why.
>>
>
> From attached logs it seems to be:
>>
>> > XFS internal error xlog_valid_rec_header(1) at line 3466 of file
>> [...2.6.32...]/fs/xfs/xfs_log_recover.c
>> > XFS: log mount/recovery failed: error 117
>>
>> > root@ws1000:~# mount
>>
>> <no raid0 mounted>
>>
>> > root@ws1000:~# mount /dev/mapper/vg0-lv0 /raid0/data/
>> > root@ws1000:~# mount | grep raid0
>> > /dev/mapper/vg0-lv0 on /raid0/data type xfs (rw,relatime,attr2,noquota)
>>
>> Uh, now it worked, with no other steps in between?  That's a little odd.
>>
> Looks odd to me too. But i just copied the commands issued as they where
> on my console... so yes, nothing in between.
>
>> It found a clean log this time:
>>
>> > XFS mounting filesystem dm-1
>> > Ending clean XFS mount for filesystem: dm-1
>>
>> which is unexpected.
>>
>> So the memory consumption might be a bug but there's not enough info to
>> go on here.
>>
>> > PS. Let me know if you wish reports like this one on list.
>>
>> worth reporting, but I'm not sure what we can do with it.
>> Your storage is in pretty bad shape, and xfs_repair can't make something
>> out
>> of nothing.
>>
>> -Eric
>>
>
> I still got back around 6TB out of 7.2 TB of total data stored, so this
> tells xfs is reliable even when major faults occur...
>
> Thanks anyways, I am trying with a "-L" repair, at this step I expect
> another fail (due out of memory or something, as it happened last time)
> then I will try with "xfs_repair -L -vv -P" and I expect to see that
> segfault again.
>
> Will report next steps, maybe something interesting for you will pop up...
> for me is not a waste of time, since this last try is worth being made.
>
> --
> F




-- 
F

[-- Attachment #1.2: Type: text/html, Size: 4827 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Xfs_repair segfaults.
  2013-05-07 13:20       ` Eric Sandeen
@ 2013-05-07 13:36         ` Filippo Stenico
  2013-05-07 18:20           ` Filippo Stenico
  0 siblings, 1 reply; 39+ messages in thread
From: Filippo Stenico @ 2013-05-07 13:36 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 3086 bytes --]

On Tue, May 7, 2013 at 3:20 PM, Eric Sandeen <sandeen@sandeen.net> wrote:

> On 5/7/13 4:27 AM, Filippo Stenico wrote:
> > Hello,
> > this is a start-over to try hard to recover some more data out of my
> raid5 - lvm - xfs toasted volume.
> > My goal is either to try the best to get some more data out of the
> volume, and see if I can reproduce the segfault.
> > I compiled xfsprogs 3.1.9 from deb-source. I ran a xfs_metarestore to
> put original metadata on the cloned raid volume i had zeroed the log on
> before via xfs_repair -L (i figured none of the actual data was modified
> before as I am just working on metadata.. right?).
> > Then I ran a mount, checked a dir that I knew it was corrupted, unmount
> and try an xfs_repair (commands.txt attached for details)
> > I went home to sleep, but at morning I found out that kernel paniced due
> "out of memory and no killable process".
> > I ran repair without -P... Should I try now disabling inode prefetch?
> > Attached are also output of "free" and "top" at time of panic, as well
> as the output of xfs_repair and strace attached to it. Dont think gdb
> symbols would help here....
> >
>
> >
>
> Ho hum, well, no segfault this time, just an out of memory error?
>
That's right....

> No real way to know where it went from the available data I think.
>
> A few things:
>
> > root@ws1000:~# mount /dev/mapper/vg0-lv0 /raid0/data/
> > mount: Structure needs cleaning
>
> mount failed?  Now's the time to look at dmesg to see why.
>

>From attached logs it seems to be:
>
> > XFS internal error xlog_valid_rec_header(1) at line 3466 of file
> [...2.6.32...]/fs/xfs/xfs_log_recover.c
> > XFS: log mount/recovery failed: error 117
>
> > root@ws1000:~# mount
>
> <no raid0 mounted>
>
> > root@ws1000:~# mount /dev/mapper/vg0-lv0 /raid0/data/
> > root@ws1000:~# mount | grep raid0
> > /dev/mapper/vg0-lv0 on /raid0/data type xfs (rw,relatime,attr2,noquota)
>
> Uh, now it worked, with no other steps in between?  That's a little odd.
>
Looks odd to me too. But i just copied the commands issued as they where on
my console... so yes, nothing in between.

> It found a clean log this time:
>
> > XFS mounting filesystem dm-1
> > Ending clean XFS mount for filesystem: dm-1
>
> which is unexpected.
>
> So the memory consumption might be a bug but there's not enough info to go
> on here.
>
> > PS. Let me know if you wish reports like this one on list.
>
> worth reporting, but I'm not sure what we can do with it.
> Your storage is in pretty bad shape, and xfs_repair can't make something
> out
> of nothing.
>
> -Eric
>

I still got back around 6TB out of 7.2 TB of total data stored, so this
tells xfs is reliable even when major faults occur...

Thanks anyways, I am trying with a "-L" repair, at this step I expect
another fail (due out of memory or something, as it happened last time)
then I will try with "xfs_repair -L -vv -P" and I expect to see that
segfault again.

Will report next steps, maybe something interesting for you will pop up...
for me is not a waste of time, since this last try is worth being made.

-- 
F

[-- Attachment #1.2: Type: text/html, Size: 4163 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Xfs_repair segfaults.
       [not found]     ` <CADNx=Kv0bt3fNGW8Y24GziW9MOO-+b7fBGub4AYP70b5gAegxw@mail.gmail.com>
@ 2013-05-07 13:20       ` Eric Sandeen
  2013-05-07 13:36         ` Filippo Stenico
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Sandeen @ 2013-05-07 13:20 UTC (permalink / raw)
  To: Filippo Stenico; +Cc: xfs

On 5/7/13 4:27 AM, Filippo Stenico wrote:
> Hello,
> this is a start-over to try hard to recover some more data out of my raid5 - lvm - xfs toasted volume.
> My goal is either to try the best to get some more data out of the volume, and see if I can reproduce the segfault.
> I compiled xfsprogs 3.1.9 from deb-source. I ran a xfs_metarestore to put original metadata on the cloned raid volume i had zeroed the log on before via xfs_repair -L (i figured none of the actual data was modified before as I am just working on metadata.. right?).
> Then I ran a mount, checked a dir that I knew it was corrupted, unmount and try an xfs_repair (commands.txt attached for details)
> I went home to sleep, but at morning I found out that kernel paniced due "out of memory and no killable process".
> I ran repair without -P... Should I try now disabling inode prefetch?
> Attached are also output of "free" and "top" at time of panic, as well as the output of xfs_repair and strace attached to it. Dont think gdb symbols would help here....
> 

> 

Ho hum, well, no segfault this time, just an out of memory error?

No real way to know where it went from the available data I think.

A few things:

> root@ws1000:~# mount /dev/mapper/vg0-lv0 /raid0/data/
> mount: Structure needs cleaning

mount failed?  Now's the time to look at dmesg to see why.

>From attached logs it seems to be:

> XFS internal error xlog_valid_rec_header(1) at line 3466 of file [...2.6.32...]/fs/xfs/xfs_log_recover.c
> XFS: log mount/recovery failed: error 117

> root@ws1000:~# mount

<no raid0 mounted>

> root@ws1000:~# mount /dev/mapper/vg0-lv0 /raid0/data/
> root@ws1000:~# mount | grep raid0
> /dev/mapper/vg0-lv0 on /raid0/data type xfs (rw,relatime,attr2,noquota)

Uh, now it worked, with no other steps in between?  That's a little odd.
It found a clean log this time:

> XFS mounting filesystem dm-1
> Ending clean XFS mount for filesystem: dm-1

which is unexpected.

So the memory consumption might be a bug but there's not enough info to go on here.

> PS. Let me know if you wish reports like this one on list.

worth reporting, but I'm not sure what we can do with it.
Your storage is in pretty bad shape, and xfs_repair can't make something out
of nothing.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Xfs_repair segfaults.
  2013-05-06 14:34 ` Eric Sandeen
@ 2013-05-06 15:00   ` Filippo Stenico
       [not found]     ` <CADNx=Kv0bt3fNGW8Y24GziW9MOO-+b7fBGub4AYP70b5gAegxw@mail.gmail.com>
  0 siblings, 1 reply; 39+ messages in thread
From: Filippo Stenico @ 2013-05-06 15:00 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 1984 bytes --]

On Mon, May 6, 2013 at 4:34 PM, Eric Sandeen <sandeen@sandeen.net> wrote:

> On 5/6/13 7:06 AM, Filippo Stenico wrote:
> > Hello, I've had an issue on a raid5 volume (2 disks failing at same
> > time, and buggy NAS firmware trying hard to sync then stop for I/O
> > error then retryes to sync leading (!?) to lost raid and an xfs data
> > corruption.
>
> Sooo bad storage hardware, first off \o/  :(
>
>
>
> I've recreated raid, mounted the xfs filesystem, copied about 6TB of
> > data out of 8TB, the rest of data being unreachable with "structrure
> > needs cleaning" error. Thus I tried to unmount/mount and still got
> > same error, then dumped metadata and tried various xfs_repair with
> > different options but it always reaches a same point where it always
> > segfaults.
>
> OK, what version of xfsprogs?
> If not latest usptream, please try that next.
>
> It was the one included in debian squeeze, I believe v.3.1.4
I figured segfault could be fixed in newer releases, so I added testing and
unstable sources, and tryied with those (3.1.7 and 3.1.9) but same segfault.


> If upstream still segfaults, you could provide a core and associated
> binary debug info, or (maybe) an xfs_metadump for analysis.  It sounds
> like this fs is in pretty bad shape though, so even the xfs_metadump
> might fail and/or not gather enough information.
>
> As suggested on IRC chat, I built from xfsprogs-3.1.9 source, I will
restore "original" metadata and make same steps I took first time, so that
you can get a detailed report.
Of course I will include as much as info I can (metadump - gdb backtrace on
eventual segfault).

Thanx for now,

-Eric
>
> > If someone is interested of investigating what's going on here and
> > helping me recover more data I would be happy to send in in more
> > details....
> >
> > Cheers
> >
> > -- F
> >
> >
> > _______________________________________________ xfs mailing list
> > xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs
> >
>
>

-- 
F

[-- Attachment #1.2: Type: text/html, Size: 3039 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Xfs_repair segfaults.
  2013-05-06 12:06 Xfs_repair segfaults Filippo Stenico
@ 2013-05-06 14:34 ` Eric Sandeen
  2013-05-06 15:00   ` Filippo Stenico
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Sandeen @ 2013-05-06 14:34 UTC (permalink / raw)
  To: Filippo Stenico; +Cc: xfs

On 5/6/13 7:06 AM, Filippo Stenico wrote:
> Hello, I've had an issue on a raid5 volume (2 disks failing at same
> time, and buggy NAS firmware trying hard to sync then stop for I/O
> error then retryes to sync leading (!?) to lost raid and an xfs data
> corruption.

Sooo bad storage hardware, first off \o/  :(

> I've recreated raid, mounted the xfs filesystem, copied about 6TB of
> data out of 8TB, the rest of data being unreachable with "structrure
> needs cleaning" error. Thus I tried to unmount/mount and still got
> same error, then dumped metadata and tried various xfs_repair with
> different options but it always reaches a same point where it always
> segfaults.

OK, what version of xfsprogs?
If not latest usptream, please try that next.

If upstream still segfaults, you could provide a core and associated
binary debug info, or (maybe) an xfs_metadump for analysis.  It sounds
like this fs is in pretty bad shape though, so even the xfs_metadump
might fail and/or not gather enough information.

-Eric

> If someone is interested of investigating what's going on here and
> helping me recover more data I would be happy to send in in more
> details....
> 
> Cheers
> 
> -- F
> 
> 
> _______________________________________________ xfs mailing list 
> xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Xfs_repair segfaults.
@ 2013-05-06 12:06 Filippo Stenico
  2013-05-06 14:34 ` Eric Sandeen
  0 siblings, 1 reply; 39+ messages in thread
From: Filippo Stenico @ 2013-05-06 12:06 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 729 bytes --]

Hello,
I've had an issue on a raid5 volume (2 disks failing at same time, and
buggy NAS firmware trying hard to sync then stop for I/O error then retryes
to sync leading (!?) to lost raid and an xfs data corruption.

I've recreated raid, mounted the xfs filesystem, copied about 6TB of data
out of 8TB, the rest of data being unreachable with "structrure needs
cleaning" error.
Thus I tried to unmount/mount and still got same error, then dumped
metadata and tried various xfs_repair with different options but it always
reaches a same point where it always segfaults.

If someone is interested of investigating what's going on here and helping
me recover more data I would be happy to send in in more details....

Cheers

-- 
F

[-- Attachment #1.2: Type: text/html, Size: 807 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2013-05-10 11:01 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-28 15:22 xfs_repair segfaults Ole Tange
2013-02-28 18:48 ` Eric Sandeen
2013-03-01  9:37   ` Ole Tange
2013-03-01 16:46     ` Eric Sandeen
2013-03-04  9:00       ` Ole Tange
2013-03-04 15:20         ` Eric Sandeen
2013-03-08 10:21           ` Ole Tange
2013-03-08 20:32             ` Eric Sandeen
2013-03-12 10:41               ` Ole Tange
2013-03-12 14:40                 ` Eric Sandeen
2013-03-12 11:37             ` Ole Tange
2013-03-12 14:47               ` Eric Sandeen
2013-03-01 11:17 ` Dave Chinner
2013-03-01 12:24   ` Ole Tange
2013-03-01 20:53     ` Dave Chinner
2013-03-04  9:03       ` Ole Tange
2013-03-04 23:23         ` Dave Chinner
2013-03-08 10:09           ` Ole Tange
2013-03-01 22:14 ` Eric Sandeen
2013-03-01 22:31   ` Dave Chinner
2013-03-01 22:32     ` Eric Sandeen
2013-03-01 23:55       ` Eric Sandeen
2013-03-04 12:47       ` Ole Tange
2013-03-04 15:17         ` Eric Sandeen
2013-03-04 23:11           ` Dave Chinner
2013-05-06 12:06 Xfs_repair segfaults Filippo Stenico
2013-05-06 14:34 ` Eric Sandeen
2013-05-06 15:00   ` Filippo Stenico
     [not found]     ` <CADNx=Kv0bt3fNGW8Y24GziW9MOO-+b7fBGub4AYP70b5gAegxw@mail.gmail.com>
2013-05-07 13:20       ` Eric Sandeen
2013-05-07 13:36         ` Filippo Stenico
2013-05-07 18:20           ` Filippo Stenico
2013-05-08 17:30             ` Filippo Stenico
2013-05-08 17:42               ` Filippo Stenico
2013-05-08 23:39               ` Dave Chinner
2013-05-09 15:11                 ` Filippo Stenico
2013-05-09 17:22                   ` Filippo Stenico
2013-05-09 22:39                     ` Dave Chinner
     [not found]                       ` <CADNx=KuQjMNHUk6t0+hBZ5DN6s=RXqrPEjeoSxpBta47CJoDgQ@mail.gmail.com>
2013-05-10 11:00                         ` Filippo Stenico
2013-05-09 22:37                   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.