All of lore.kernel.org
 help / color / mirror / Atom feed
* xfs_repair stops on "traversing filesystem..."
@ 2009-07-09 14:13 Tomek Kruszona
  2009-07-09 14:54 ` Eric Sandeen
  2009-07-10  5:28 ` Eric Sandeen
  0 siblings, 2 replies; 11+ messages in thread
From: Tomek Kruszona @ 2009-07-09 14:13 UTC (permalink / raw)
  To: xfs

Hello!

I have a little problem with XFS filesystem that I have on one of my
machines. I try to make xfs_repair that was not making any problems
before, but xfs_repair stops on:

Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...

CPU usage grows up to 100%. I left it in the night hoping it will finish
job till morning, but the situation hasn't changed...

System is Debian Lenny with current updates and custom 2.6.30.1 kernel
xfsprogs-2.9.8. Filesysystem is placed on LVM2 Logical Volume.

I upgraded xfsprogs to 3.0.2 version and the problem still persists.
Then I reverted to 2.9.8 package from Debian Lenny.
Switching back to debian default 2.6.26 kernel doesn't help too.

I can mount this filesystem and operate on it.

Data on this system is not so crucial, because it's backup/testing
machine, but it would be great to keep this data, because synchronizing
14TB of data will take some time.

Output from xfs_info:
# xfs_info /mnt/storage/
meta-data=/dev/mapper/p02bvg-p02blv isize=256    agcount=32,
agsize=268435455 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=8410889216, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

Any ideas how to make xfs_repair working again?

Best regards,
Tomasz Kruszona





_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: xfs_repair stops on "traversing filesystem..."
  2009-07-09 14:13 xfs_repair stops on "traversing filesystem..." Tomek Kruszona
@ 2009-07-09 14:54 ` Eric Sandeen
  2009-07-09 15:03   ` Tomek Kruszona
  2009-07-10  5:28 ` Eric Sandeen
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2009-07-09 14:54 UTC (permalink / raw)
  To: Tomek Kruszona; +Cc: xfs

Tomek Kruszona wrote:
> Hello!
> 
> I have a little problem with XFS filesystem that I have on one of my
> machines. I try to make xfs_repair that was not making any problems
> before, but xfs_repair stops on:
> 
> Phase 6 - check inode connectivity...
>         - resetting contents of realtime bitmap and summary inodes
>         - traversing filesystem ...
> 
> CPU usage grows up to 100%. I left it in the night hoping it will finish
> job till morning, but the situation hasn't changed...

...

> Any ideas how to make xfs_repair working again?

You might try running with -P, though I doubt that's the issue.

If that doesn't help, you could provide an xfs_metadump image (in public
or to me in private) and I'll take a look to see what's going on.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: xfs_repair stops on "traversing filesystem..."
  2009-07-09 14:54 ` Eric Sandeen
@ 2009-07-09 15:03   ` Tomek Kruszona
  0 siblings, 0 replies; 11+ messages in thread
From: Tomek Kruszona @ 2009-07-09 15:03 UTC (permalink / raw)
  To: xfs

Eric Sandeen wrote:
> You might try running with -P, though I doubt that's the issue.
> 
> If that doesn't help, you could provide an xfs_metadump image (in public
> or to me in private) and I'll take a look to see what's going on.

Thank you! I will just left xfs_repair -P to do the job and let you know
if it helps. Otherwise, I'll send you metadump image in private.

Best regards,
Tomasz Kruszona

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: xfs_repair stops on "traversing filesystem..."
  2009-07-09 14:13 xfs_repair stops on "traversing filesystem..." Tomek Kruszona
  2009-07-09 14:54 ` Eric Sandeen
@ 2009-07-10  5:28 ` Eric Sandeen
  2009-07-10  7:27   ` Tomek Kruszona
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2009-07-10  5:28 UTC (permalink / raw)
  To: Tomek Kruszona; +Cc: xfs

Tomek Kruszona wrote:
> Hello!
> 
> I have a little problem with XFS filesystem that I have on one of my
> machines. I try to make xfs_repair that was not making any problems
> before, but xfs_repair stops on:
> 
> Phase 6 - check inode connectivity...
>         - resetting contents of realtime bitmap and summary inodes
>         - traversing filesystem ...
> 
> CPU usage grows up to 100%. I left it in the night hoping it will finish
> job till morning, but the situation hasn't changed...
> 
> System is Debian Lenny with current updates and custom 2.6.30.1 kernel
> xfsprogs-2.9.8. Filesysystem is placed on LVM2 Logical Volume.
> 
> I upgraded xfsprogs to 3.0.2 version and the problem still persists.
> Then I reverted to 2.9.8 package from Debian Lenny.
> Switching back to debian default 2.6.26 kernel doesn't help too.
> 
> I can mount this filesystem and operate on it.
> 
> Data on this system is not so crucial, because it's backup/testing
> machine, but it would be great to keep this data, because synchronizing
> 14TB of data will take some time.
> 
> Output from xfs_info:
> # xfs_info /mnt/storage/
> meta-data=/dev/mapper/p02bvg-p02blv isize=256    agcount=32,
> agsize=268435455 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=8410889216, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096
> log      =internal               bsize=4096   blocks=32768, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=0
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> Any ideas how to make xfs_repair working again?

No fix for you yet, but it's in cache_node_get(), in the for(;;) loop,
and it looks like cache_node_allocate() fails to get a new node and we
keep spinning.  I need to look some more at what's going on....

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: xfs_repair stops on "traversing filesystem..."
  2009-07-10  5:28 ` Eric Sandeen
@ 2009-07-10  7:27   ` Tomek Kruszona
  2009-07-10 14:35     ` Eric Sandeen
  2009-07-10 20:17     ` Eric Sandeen
  0 siblings, 2 replies; 11+ messages in thread
From: Tomek Kruszona @ 2009-07-10  7:27 UTC (permalink / raw)
  To: xfs

Eric Sandeen wrote:
> No fix for you yet, but it's in cache_node_get(), in the for(;;) loop,
> and it looks like cache_node_allocate() fails to get a new node and we
> keep spinning.  I need to look some more at what's going on....

Hello!

Is this specific behavior for this particular broken filesystem or is it
a bug in functions you mentioned? I'm just curious :)

Best regards,
Tomasz Kruszona

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: xfs_repair stops on "traversing filesystem..."
  2009-07-10  7:27   ` Tomek Kruszona
@ 2009-07-10 14:35     ` Eric Sandeen
  2009-07-10 20:17     ` Eric Sandeen
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Sandeen @ 2009-07-10 14:35 UTC (permalink / raw)
  To: Tomek Kruszona; +Cc: xfs

Tomek Kruszona wrote:
> Eric Sandeen wrote:
>> No fix for you yet, but it's in cache_node_get(), in the for(;;) loop,
>> and it looks like cache_node_allocate() fails to get a new node and we
>> keep spinning.  I need to look some more at what's going on....
> 
> Hello!
> 
> Is this specific behavior for this particular broken filesystem or is it
> a bug in functions you mentioned? I'm just curious :)


I don't know yet :)

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: xfs_repair stops on "traversing filesystem..."
  2009-07-10  7:27   ` Tomek Kruszona
  2009-07-10 14:35     ` Eric Sandeen
@ 2009-07-10 20:17     ` Eric Sandeen
  2009-07-10 21:02       ` Tomek Kruszona
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2009-07-10 20:17 UTC (permalink / raw)
  To: Tomek Kruszona; +Cc: xfs

Tomek Kruszona wrote:
> Eric Sandeen wrote:
>> No fix for you yet, but it's in cache_node_get(), in the for(;;) loop,
>> and it looks like cache_node_allocate() fails to get a new node and we
>> keep spinning.  I need to look some more at what's going on....
> 
> Hello!
> 
> Is this specific behavior for this particular broken filesystem or is it
> a bug in functions you mentioned? I'm just curious :)

This looks like some of the caching that xfs_repair does is mis-sized,
and it gets stuck when it's unable to find a slot for a new node to
cache.  IMHO that's still a bug that I'd like to work out.  If it gets
stuck this way, it'd probably be better to exit, and suggest a larger
hash size.

But anyway, I forced a bigger hash size:

xfs_repair -P -o bhash=1024 <blah>

and it did complete.  1024 is probably over the top, but it worked for
me on a 4G machine w/ some swap.

I'd strongly suggest doing a non-obfuscated xfs_metadump, do
xfs_mdrestore of that to some temp.img, run xfs_repair <blah> on that
temp.img, mount it, and see what you're left with; that way you'll know
what you're getting into w/ repair.

I ended up w/ about 5000 files in lost+found just FWIW...

Out of curiosity, do you know how the fs was damaged?

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: xfs_repair stops on "traversing filesystem..."
  2009-07-10 20:17     ` Eric Sandeen
@ 2009-07-10 21:02       ` Tomek Kruszona
  2009-07-10 21:15         ` Eric Sandeen
  0 siblings, 1 reply; 11+ messages in thread
From: Tomek Kruszona @ 2009-07-10 21:02 UTC (permalink / raw)
  To: xfs

Eric Sandeen wrote:
> This looks like some of the caching that xfs_repair does is mis-sized,
> and it gets stuck when it's unable to find a slot for a new node to
> cache.  IMHO that's still a bug that I'd like to work out.  If it gets
> stuck this way, it'd probably be better to exit, and suggest a larger
> hash size.
> 
> But anyway, I forced a bigger hash size:
> 
> xfs_repair -P -o bhash=1024 <blah>
> 
> and it did complete.  1024 is probably over the top, but it worked for
> me on a 4G machine w/ some swap.
:D

Is it safe to use xfs_repair without this options after the FS was
repaired? Or maybe I should use them every time I have similar problem?

> I'd strongly suggest doing a non-obfuscated xfs_metadump, do
> xfs_mdrestore of that to some temp.img, run xfs_repair <blah> on that
> temp.img, mount it, and see what you're left with; that way you'll know
> what you're getting into w/ repair.
> I ended up w/ about 5000 files in lost+found just FWIW...
It doesn't matter. On this filesystem is a lot of small files. Those are
image sequences used for video composition. It's backup machine so if
they're gone from filesystem they will be copied back from original
machine. No stress :)

I'm doing xfs_repair on the image now - it's Phase 4 and for now list of
files looks very similar to list that I saw during xfs_repair without
options you suggested.

> Out of curiosity, do you know how the fs was damaged?
I'm not sure. I see some possibilities. I played with write cache
options on the RAID controller when the FS was mounted and running.
Maybe then something went wrong... Second possible reason is that we had
power loss last time and this machine went down then :/
Last one is that I have some problems with XFS filesytems on LVM2. in
kernels <2.6.30 barriers are automatically disabled when underlying
device is some dm-device. As I'm using RAID controllers I should have
write cache disabled. So after upgrade to 2.6.30 message about disabled
barriers disappeared and it was safe to enable write cache again.
Somewhere in the meantime I wanted to check filesystem that everything
is ok with it and then the problem started - I couldn't finish
xfs_repair. This power loss was IIRC after my troubles with xfs_repair,
so the filesystem wasn't totally clean when power failed. Maybe this is
the reason of this mess ;)

Best regards
Tomasz Kruszona

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: xfs_repair stops on "traversing filesystem..."
  2009-07-10 21:02       ` Tomek Kruszona
@ 2009-07-10 21:15         ` Eric Sandeen
  2009-07-10 23:44           ` Tomek Kruszona
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2009-07-10 21:15 UTC (permalink / raw)
  To: Tomek Kruszona; +Cc: xfs

Tomek Kruszona wrote:
> Eric Sandeen wrote:
>> This looks like some of the caching that xfs_repair does is mis-sized,
>> and it gets stuck when it's unable to find a slot for a new node to
>> cache.  IMHO that's still a bug that I'd like to work out.  If it gets
>> stuck this way, it'd probably be better to exit, and suggest a larger
>> hash size.
>>
>> But anyway, I forced a bigger hash size:
>>
>> xfs_repair -P -o bhash=1024 <blah>
>>
>> and it did complete.  1024 is probably over the top, but it worked for
>> me on a 4G machine w/ some swap.
> :D
> 
> Is it safe to use xfs_repair without this options after the FS was
> repaired? Or maybe I should use them every time I have similar problem?


These are all good questions ;)  TBH I'm kind of digging through repair
in earnest for the first time.  I'm not certain why it got into this
state, whether there is some underlying bug, perhaps leaving things
wrongly referenced, or just a plain ol' mis-sizing of the caches.

I have a patch now that ends like this; if all else fails at least it'd
not spin forever, and give a hint of what to try.

-Eric

...

Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
unknown magic number 0 for block 8388608 in directory inode 40541
rebuilding directory inode 40541
unknown magic number 0 for block 8388608 in directory inode 48934
rebuilding directory inode 48934
unknown magic number 0 for block 8388608 in directory inode 56139
rebuilding directory inode 56139
unknown magic number 0 for block 8388608 in directory inode 63785
rebuilding directory inode 63785
Unable to free any items in cache for new node; exiting.
Try increasing the bhash and/or ihash size beyond 64
cache: 0x190ed4d0
Max supported entries = 512
Max utilized entries = 512
Active entries = 512
Hash table size = 64
Hits = 130779
Misses = 271155
Hit ratio = 32.54
MRU 0 entries =      0 (  0%)
MRU 1 entries =      0 (  0%)
MRU 2 entries =      0 (  0%)
MRU 3 entries =      0 (  0%)
MRU 4 entries =      0 (  0%)
MRU 5 entries =      0 (  0%)
MRU 6 entries =      0 (  0%)
MRU 7 entries =      0 (  0%)
MRU 8 entries =      0 (  0%)
MRU 9 entries =      0 (  0%)
MRU 10 entries =      0 (  0%)
MRU 11 entries =      0 (  0%)
MRU 12 entries =      0 (  0%)
MRU 13 entries =      0 (  0%)
MRU 14 entries =      0 (  0%)
MRU 15 entries =      0 (  0%)
Hash buckets with   2 entries      2 (  0%)
Hash buckets with   3 entries      2 (  1%)
Hash buckets with   4 entries      4 (  3%)
Hash buckets with   5 entries      1 (  0%)
Hash buckets with   6 entries      8 (  9%)
Hash buckets with   7 entries      9 ( 12%)
Hash buckets with   8 entries      9 ( 14%)
Hash buckets with   9 entries     10 ( 17%)
Hash buckets with  10 entries      9 ( 17%)
Hash buckets with  11 entries      6 ( 12%)
Hash buckets with  12 entries      1 (  2%)
Hash buckets with  13 entries      2 (  5%)
Hash buckets with  14 entries      1 (  2%)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: xfs_repair stops on "traversing filesystem..."
  2009-07-10 21:15         ` Eric Sandeen
@ 2009-07-10 23:44           ` Tomek Kruszona
  2009-07-11  0:36             ` Eric Sandeen
  0 siblings, 1 reply; 11+ messages in thread
From: Tomek Kruszona @ 2009-07-10 23:44 UTC (permalink / raw)
  To: xfs

Eric Sandeen wrote:
> These are all good questions ;)  TBH I'm kind of digging through repair
> in earnest for the first time.  I'm not certain why it got into this
> state, whether there is some underlying bug, perhaps leaving things
> wrongly referenced, or just a plain ol' mis-sizing of the caches.
> 
> I have a patch now that ends like this; if all else fails at least it'd
> not spin forever, and give a hint of what to try.
> 
> -Eric
> 
> ...
> 
> Phase 6 - check inode connectivity...
>         - resetting contents of realtime bitmap and summary inodes
>         - traversing filesystem ...
> unknown magic number 0 for block 8388608 in directory inode 40541
> rebuilding directory inode 40541
> unknown magic number 0 for block 8388608 in directory inode 48934
> rebuilding directory inode 48934
> unknown magic number 0 for block 8388608 in directory inode 56139
> rebuilding directory inode 56139
> unknown magic number 0 for block 8388608 in directory inode 63785
> rebuilding directory inode 63785
> Unable to free any items in cache for new node; exiting.
> Try increasing the bhash and/or ihash size beyond 64
> cache: 0x190ed4d0
> Max supported entries = 512
> Max utilized entries = 512
> Active entries = 512
> Hash table size = 64
> Hits = 130779
> Misses = 271155
> Hit ratio = 32.54
[snip]

I made some tests and it seems, that filesystem to finish xfs_repair
needs to be repaired with bhash=1024... With default options it still
hangs on "traversing filesystem..." Is it possible to change this
behavior to normal in other way than reformat? Moreover I spotted some
strange thing. 16GB of data has been moved to lost+found. I tried to
clean L+F by
# rm -rf lost+found

but suddenly I got this:

# ls -l /mnt/storage/
ls: cannot access /mnt/storage/lost+found: No such file or directory
total 0
drwxr-xr-x 4 root    root 33 Mar  4 17:21 l_mirror
?????????? ? ?       ?     ?            ? lost+found

I had to run xfs_repair (bhash=1024) once again and then l+f disappeared...

So I started to think: does it have some influence on data that are
stored on this filesystem? I'm afraid that files on this FS may become
inconsistent :/

Best regards,
Tomasz Kruszona

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: xfs_repair stops on "traversing filesystem..."
  2009-07-10 23:44           ` Tomek Kruszona
@ 2009-07-11  0:36             ` Eric Sandeen
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Sandeen @ 2009-07-11  0:36 UTC (permalink / raw)
  To: Tomek Kruszona; +Cc: xfs

Tomek Kruszona wrote:

...

> I made some tests and it seems, that filesystem to finish xfs_repair
> needs to be repaired with bhash=1024... With default options it still
> hangs on "traversing filesystem..." Is it possible to change this
> behavior to normal in other way than reformat? 

It has nothing to do w/ the format, it's just internal to xfs_repair
while it's running, the way it caches blocks that it has recently used.

> Moreover I spotted some
> strange thing. 16GB of data has been moved to lost+found. I tried to

Yeah ...

> clean L+F by
> # rm -rf lost+found
> 
> but suddenly I got this:
> 
> # ls -l /mnt/storage/
> ls: cannot access /mnt/storage/lost+found: No such file or directory
> total 0
> drwxr-xr-x 4 root    root 33 Mar  4 17:21 l_mirror
> ?????????? ? ?       ?     ?            ? lost+found

huh.  let me try on the repaired metadata image...

[root@bear-05 bad-repair]# mount -o loop badfs.img mnt/
[root@bear-05 bad-repair]# du -hc mnt/lost+found/
26M	mnt/lost+found/1602304
5.0M	mnt/lost+found/2558868
18G	mnt/lost+found/
18G	total
[root@bear-05 bad-repair]# rm -rf mnt/lost+found/*
[root@bear-05 bad-repair]# ls -l mnt/
total 0
drwxr-xr-x 4 root root 33 Mar  4 10:21 ??5?t??4
drwxr-xr-x 0 root root  6 Jul 10 19:48 lost+found
drwxr-xr-x 2 1000 root  6 Jun 30 08:45 tests

Did you do something else to the fs in between?

> I had to run xfs_repair (bhash=1024) once again and then l+f disappeared...
> 
> So I started to think: does it have some influence on data that are
> stored on this filesystem? I'm afraid that files on this FS may become
> inconsistent :/

Well, I don't know what has gone wrong with your filesytem (or, with the
storage beneath it, or whatever) - but it is certainly possible that
with corrupted metadata, that there is corrupted data as well - all
depends on the root cause ...

-Eric

> Best regards,
> Tomasz Kruszona

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-07-11  0:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-09 14:13 xfs_repair stops on "traversing filesystem..." Tomek Kruszona
2009-07-09 14:54 ` Eric Sandeen
2009-07-09 15:03   ` Tomek Kruszona
2009-07-10  5:28 ` Eric Sandeen
2009-07-10  7:27   ` Tomek Kruszona
2009-07-10 14:35     ` Eric Sandeen
2009-07-10 20:17     ` Eric Sandeen
2009-07-10 21:02       ` Tomek Kruszona
2009-07-10 21:15         ` Eric Sandeen
2009-07-10 23:44           ` Tomek Kruszona
2009-07-11  0:36             ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.