All of lore.kernel.org
 help / color / mirror / Atom feed
* parent transid troubles
@ 2011-04-19 19:08 Gregory L Shomo
  2011-04-19 19:34 ` Chris Mason
  0 siblings, 1 reply; 10+ messages in thread
From: Gregory L Shomo @ 2011-04-19 19:08 UTC (permalink / raw)
  To: linux-btrfs

Hello list-

Under heavy load (i/o), one of our fileservers lost two drives
in a raid6 configuration. After the drives were synchronized,
we can no longer mount the multiple-device btrfs filesystem
due to (at least) parent transid verification.

btrfsck built from git commit 1b444cd2e6ab8dcafdd47dbaeaae369dd1517c17
runs for a while and then aborts on 'failed to find block number'.
Sample output includes :

  parent transid verify failed on 22569952096256 wanted 176066 found
  176064
  parent transid verify failed on 22569952096256 wanted 176066 found
  176064
  parent transid verify failed on 20403515183104 wanted 176066 found
  174710
  parent transid verify failed on 20403515183104 wanted 176066 found
  174710
  parent transid verify failed on 1265784008704 wanted 176066 found
  175341
  !-- snip
  bad block 1099696562176
  leaf parent key incorrect 1117248647168
  !-- snip
  Extent back ref already exists for 1130294538240 parent 0 root 2
  Extent back ref already exists for 1130295001088 parent 0 root 2
  !-- snip
  fs uuid d8464857-db87-412e-9d57-ece6c2054f40
  chunk uuid 52a652a3-650d-4dd7-aaa2-6f096a714bbf
          item 0 key (20407857930240 EXTENT_ITEM 4096) itemoff 3944
  itemsize 51
                  extent refs 1 gen 165193 flags 2
                  tree block key (257306671 1 0) level 0
                  tree block backref root 5
          item 1 key (20407857934336 EXTENT_ITEM 4096) itemoff 3893
  itemsize 51
                  extent refs 1 gen 165308 flags 2
                  tree block key (257585950 1 0) level 0
                  tree block backref root 5
  !-- snip
  failed to find block number 20407858008064
  Aborted

Output is exactly the same when run against both devices, even
when using 'btrfsck -s1'.

Should we light a candle and say 'goodbye' to the data or is
there some hope that btrfsck will be able to help us mount the
filesystem ?

Is there any additional information that is useful to the developers ?

The system is based on fedora-14, btrfs-progs-0.19-12.fc14.x86_64,
and the arcmsr module (built from the 1.20.0X.15-100729 sources).

- greg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: parent transid troubles
  2011-04-19 19:08 parent transid troubles Gregory L Shomo
@ 2011-04-19 19:34 ` Chris Mason
  2011-04-20 12:56   ` Gregory L Shomo
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Mason @ 2011-04-19 19:34 UTC (permalink / raw)
  To: Gregory L Shomo; +Cc: linux-btrfs

Excerpts from Gregory L Shomo's message of 2011-04-19 15:08:13 -0400:
> Hello list-
> 
> Under heavy load (i/o), one of our fileservers lost two drives
> in a raid6 configuration. After the drives were synchronized,
> we can no longer mount the multiple-device btrfs filesystem
> due to (at least) parent transid verification.
> 
> btrfsck built from git commit 1b444cd2e6ab8dcafdd47dbaeaae369dd1517c17
> runs for a while and then aborts on 'failed to find block number'.
> Sample output includes :

Looks like the rebuild gave you older copies of some of the blocks.
btrfsck will exit out pretty early when it sees problems, but I'd say
most of your FS is there.

Can you please do a btrfs-debug-tree /dev/xxx > out, I'd like to see how
far we get.

What errors do you get when trying to mount the FS?

-chris

> 
>   parent transid verify failed on 22569952096256 wanted 176066 found
>   176064
>   parent transid verify failed on 22569952096256 wanted 176066 found
>   176064
>   parent transid verify failed on 20403515183104 wanted 176066 found
>   174710
>   parent transid verify failed on 20403515183104 wanted 176066 found
>   174710
>   parent transid verify failed on 1265784008704 wanted 176066 found
>   175341
>   !-- snip
>   bad block 1099696562176
>   leaf parent key incorrect 1117248647168
>   !-- snip
>   Extent back ref already exists for 1130294538240 parent 0 root 2
>   Extent back ref already exists for 1130295001088 parent 0 root 2
>   !-- snip
>   fs uuid d8464857-db87-412e-9d57-ece6c2054f40
>   chunk uuid 52a652a3-650d-4dd7-aaa2-6f096a714bbf
>           item 0 key (20407857930240 EXTENT_ITEM 4096) itemoff 3944
>   itemsize 51
>                   extent refs 1 gen 165193 flags 2
>                   tree block key (257306671 1 0) level 0
>                   tree block backref root 5
>           item 1 key (20407857934336 EXTENT_ITEM 4096) itemoff 3893
>   itemsize 51
>                   extent refs 1 gen 165308 flags 2
>                   tree block key (257585950 1 0) level 0
>                   tree block backref root 5
>   !-- snip
>   failed to find block number 20407858008064
>   Aborted
> 
> Output is exactly the same when run against both devices, even
> when using 'btrfsck -s1'.
> 
> Should we light a candle and say 'goodbye' to the data or is
> there some hope that btrfsck will be able to help us mount the
> filesystem ?
> 
> Is there any additional information that is useful to the developers ?
> 
> The system is based on fedora-14, btrfs-progs-0.19-12.fc14.x86_64,
> and the arcmsr module (built from the 1.20.0X.15-100729 sources).
> 
> - greg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: parent transid troubles
  2011-04-19 19:34 ` Chris Mason
@ 2011-04-20 12:56   ` Gregory L Shomo
  2011-04-20 13:06     ` Chris Mason
  0 siblings, 1 reply; 10+ messages in thread
From: Gregory L Shomo @ 2011-04-20 12:56 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Chris Mason <chris.mason@oracle.com> writes:

> Excerpts from Gregory L Shomo's message of 2011-04-19 15:08:13 -0400:
>> Hello list-
>> 
>> Under heavy load (i/o), one of our fileservers lost two drives
>> in a raid6 configuration. After the drives were synchronized,
>> we can no longer mount the multiple-device btrfs filesystem
>> due to (at least) parent transid verification.
>> 
>> btrfsck built from git commit 1b444cd2e6ab8dcafdd47dbaeaae369dd1517c17
>> runs for a while and then aborts on 'failed to find block number'.
>> Sample output includes :
>
> Looks like the rebuild gave you older copies of some of the blocks.
> btrfsck will exit out pretty early when it sees problems, but I'd say
> most of your FS is there.
>
> Can you please do a btrfs-debug-tree /dev/xxx > out, I'd like to see how
> far we get.
>
> What errors do you get when trying to mount the FS?
>
> -chris

I'm not sure how far we will get, but btrfs-debug-tree
has been running for over 12h now and the screenlog is
at 80Gb. This may not be surprising, as the filesystem 
is large (60T) and has millions of files. 

>From the logs at boottime, we have

  btrfs: failed to read the system array on sdd1
  btrfs: open_ctree failed

Should we wait for the btrfs-debug-tree to finish
before executing an other mount command ? 

- greg


>>   parent transid verify failed on 22569952096256 wanted 176066 found
>>   176064
>>   parent transid verify failed on 22569952096256 wanted 176066 found
>>   176064
>>   parent transid verify failed on 20403515183104 wanted 176066 found
>>   174710
>>   parent transid verify failed on 20403515183104 wanted 176066 found
>>   174710
>>   parent transid verify failed on 1265784008704 wanted 176066 found
>>   175341
>>   !-- snip
>>   bad block 1099696562176
>>   leaf parent key incorrect 1117248647168
>>   !-- snip
>>   Extent back ref already exists for 1130294538240 parent 0 root 2
>>   Extent back ref already exists for 1130295001088 parent 0 root 2
>>   !-- snip
>>   fs uuid d8464857-db87-412e-9d57-ece6c2054f40
>>   chunk uuid 52a652a3-650d-4dd7-aaa2-6f096a714bbf
>>           item 0 key (20407857930240 EXTENT_ITEM 4096) itemoff 3944
>>   itemsize 51
>>                   extent refs 1 gen 165193 flags 2
>>                   tree block key (257306671 1 0) level 0
>>                   tree block backref root 5
>>           item 1 key (20407857934336 EXTENT_ITEM 4096) itemoff 3893
>>   itemsize 51
>>                   extent refs 1 gen 165308 flags 2
>>                   tree block key (257585950 1 0) level 0
>>                   tree block backref root 5
>>   !-- snip
>>   failed to find block number 20407858008064
>>   Aborted
>> 
>> Output is exactly the same when run against both devices, even
>> when using 'btrfsck -s1'.
>> 
>> Should we light a candle and say 'goodbye' to the data or is
>> there some hope that btrfsck will be able to help us mount the
>> filesystem ?
>> 
>> Is there any additional information that is useful to the developers ?
>> 
>> The system is based on fedora-14, btrfs-progs-0.19-12.fc14.x86_64,
>> and the arcmsr module (built from the 1.20.0X.15-100729 sources).
>> 
>> - greg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: parent transid troubles
  2011-04-20 12:56   ` Gregory L Shomo
@ 2011-04-20 13:06     ` Chris Mason
  2011-04-20 13:20       ` Gregory L Shomo
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Mason @ 2011-04-20 13:06 UTC (permalink / raw)
  To: Gregory L Shomo; +Cc: linux-btrfs

Excerpts from Gregory L Shomo's message of 2011-04-20 08:56:02 -0400:
> Chris Mason <chris.mason@oracle.com> writes:
> 
> > Excerpts from Gregory L Shomo's message of 2011-04-19 15:08:13 -0400:
> >> Hello list-
> >> 
> >> Under heavy load (i/o), one of our fileservers lost two drives
> >> in a raid6 configuration. After the drives were synchronized,
> >> we can no longer mount the multiple-device btrfs filesystem
> >> due to (at least) parent transid verification.
> >> 
> >> btrfsck built from git commit 1b444cd2e6ab8dcafdd47dbaeaae369dd1517c17
> >> runs for a while and then aborts on 'failed to find block number'.
> >> Sample output includes :
> >
> > Looks like the rebuild gave you older copies of some of the blocks.
> > btrfsck will exit out pretty early when it sees problems, but I'd say
> > most of your FS is there.
> >
> > Can you please do a btrfs-debug-tree /dev/xxx > out, I'd like to see how
> > far we get.
> >
> > What errors do you get when trying to mount the FS?
> >
> > -chris
> 
> I'm not sure how far we will get, but btrfs-debug-tree
> has been running for over 12h now and the screenlog is
> at 80Gb. This may not be surprising, as the filesystem 
> is large (60T) and has millions of files. 
> 
> From the logs at boottime, we have
> 
>   btrfs: failed to read the system array on sdd1
>   btrfs: open_ctree failed
> 
> Should we wait for the btrfs-debug-tree to finish
> before executing an other mount command ? 

For btrfs-debug-tree to run this long, big parts of your FS must be
valid.  Also, btrfs-debug-tree must have been able to read the sys
array (which mount was complaining about).

How easily can you try a newer kernel?  We need to make sure and do
readonly operations (mount -o ro), but we may be able to pull out a
bunch of files.

-chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: parent transid troubles
  2011-04-20 13:06     ` Chris Mason
@ 2011-04-20 13:20       ` Gregory L Shomo
  2011-04-20 14:04         ` Chris Mason
  0 siblings, 1 reply; 10+ messages in thread
From: Gregory L Shomo @ 2011-04-20 13:20 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Chris Mason <chris.mason@oracle.com> writes:

> Excerpts from Gregory L Shomo's message of 2011-04-20 08:56:02 -0400:
>> Chris Mason <chris.mason@oracle.com> writes:
>> 
>> > Excerpts from Gregory L Shomo's message of 2011-04-19 15:08:13 -0400:
>> >> Hello list-
>> >> 
>> >> Under heavy load (i/o), one of our fileservers lost two drives
>> >> in a raid6 configuration. After the drives were synchronized,
>> >> we can no longer mount the multiple-device btrfs filesystem
>> >> due to (at least) parent transid verification.
>> >> 
>> >> btrfsck built from git commit 1b444cd2e6ab8dcafdd47dbaeaae369dd1517c17
>> >> runs for a while and then aborts on 'failed to find block number'.
>> >> Sample output includes :
>> >
>> > Looks like the rebuild gave you older copies of some of the blocks.
>> > btrfsck will exit out pretty early when it sees problems, but I'd say
>> > most of your FS is there.
>> >
>> > Can you please do a btrfs-debug-tree /dev/xxx > out, I'd like to see how
>> > far we get.
>> >
>> > What errors do you get when trying to mount the FS?
>> >
>> > -chris
>> 
>> I'm not sure how far we will get, but btrfs-debug-tree
>> has been running for over 12h now and the screenlog is
>> at 80Gb. This may not be surprising, as the filesystem 
>> is large (60T) and has millions of files. 
>> 
>> From the logs at boottime, we have
>> 
>>   btrfs: failed to read the system array on sdd1
>>   btrfs: open_ctree failed
>> 
>> Should we wait for the btrfs-debug-tree to finish
>> before executing an other mount command ? 
>
> For btrfs-debug-tree to run this long, big parts of your FS must be
> valid.  Also, btrfs-debug-tree must have been able to read the sys
> array (which mount was complaining about).
>
> How easily can you try a newer kernel?  We need to make sure and do
> readonly operations (mount -o ro), but we may be able to pull out a
> bunch of files.
>
> -chris


Sure, we're up for that. Should we rebuild the kernel, or just 
the btrfs module ? If the kernel, is linux-2.6.38.3 a good
choice, or should we build 2.6.39-rc4 ? If we only need to
rebuild the btrfs module, should we use Monday's commit to 
btrfs-unstable ? 

- greg 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: parent transid troubles
  2011-04-20 13:20       ` Gregory L Shomo
@ 2011-04-20 14:04         ` Chris Mason
  2011-04-20 20:53           ` Gregory L Shomo
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Mason @ 2011-04-20 14:04 UTC (permalink / raw)
  To: Gregory L Shomo; +Cc: linux-btrfs

Excerpts from Gregory L Shomo's message of 2011-04-20 09:20:20 -0400:
> Chris Mason <chris.mason@oracle.com> writes:
> 
> > Excerpts from Gregory L Shomo's message of 2011-04-20 08:56:02 -0400:
> >> Chris Mason <chris.mason@oracle.com> writes:
> >> 
> >> > Excerpts from Gregory L Shomo's message of 2011-04-19 15:08:13 -0400:
> >> >> Hello list-
> >> >> 
> >> >> Under heavy load (i/o), one of our fileservers lost two drives
> >> >> in a raid6 configuration. After the drives were synchronized,
> >> >> we can no longer mount the multiple-device btrfs filesystem
> >> >> due to (at least) parent transid verification.
> >> >> 
> >> >> btrfsck built from git commit 1b444cd2e6ab8dcafdd47dbaeaae369dd1517c17
> >> >> runs for a while and then aborts on 'failed to find block number'.
> >> >> Sample output includes :
> >> >
> >> > Looks like the rebuild gave you older copies of some of the blocks.
> >> > btrfsck will exit out pretty early when it sees problems, but I'd say
> >> > most of your FS is there.
> >> >
> >> > Can you please do a btrfs-debug-tree /dev/xxx > out, I'd like to see how
> >> > far we get.
> >> >
> >> > What errors do you get when trying to mount the FS?
> >> >
> >> > -chris
> >> 
> >> I'm not sure how far we will get, but btrfs-debug-tree
> >> has been running for over 12h now and the screenlog is
> >> at 80Gb. This may not be surprising, as the filesystem 
> >> is large (60T) and has millions of files. 
> >> 
> >> From the logs at boottime, we have
> >> 
> >>   btrfs: failed to read the system array on sdd1
> >>   btrfs: open_ctree failed
> >> 
> >> Should we wait for the btrfs-debug-tree to finish
> >> before executing an other mount command ? 
> >
> > For btrfs-debug-tree to run this long, big parts of your FS must be
> > valid.  Also, btrfs-debug-tree must have been able to read the sys
> > array (which mount was complaining about).
> >
> > How easily can you try a newer kernel?  We need to make sure and do
> > readonly operations (mount -o ro), but we may be able to pull out a
> > bunch of files.
> >
> > -chris
> 
> 
> Sure, we're up for that. Should we rebuild the kernel, or just 
> the btrfs module ? If the kernel, is linux-2.6.38.3 a good
> choice, or should we build 2.6.39-rc4 ? If we only need to
> rebuild the btrfs module, should we use Monday's commit to 
> btrfs-unstable ? 

The best choice right now is 2.6.38 plus the master branch of the btrfs
unstable tree.  There are a lot of fixes to dealing with busted blocks
thanks to Josef and Fujitsu.

It may still have trouble, please make sure to mount -o ro.

-chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: parent transid troubles
  2011-04-20 14:04         ` Chris Mason
@ 2011-04-20 20:53           ` Gregory L Shomo
  2011-04-20 20:54             ` Chris Mason
  0 siblings, 1 reply; 10+ messages in thread
From: Gregory L Shomo @ 2011-04-20 20:53 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Chris Mason <chris.mason@oracle.com> writes:

> Excerpts from Gregory L Shomo's message of 2011-04-20 09:20:20 -0400:
>> Chris Mason <chris.mason@oracle.com> writes:
>> 
>> > Excerpts from Gregory L Shomo's message of 2011-04-20 08:56:02 -0400:
>> >> Chris Mason <chris.mason@oracle.com> writes:
>> >> 
>> >> > Excerpts from Gregory L Shomo's message of 2011-04-19 15:08:13 -0400:
>> >> >> Hello list-
>> >> >> 
>> >> >> Under heavy load (i/o), one of our fileservers lost two drives
>> >> >> in a raid6 configuration. After the drives were synchronized,
>> >> >> we can no longer mount the multiple-device btrfs filesystem
>> >> >> due to (at least) parent transid verification.
>> >> >> 
>> >> >> btrfsck built from git commit 1b444cd2e6ab8dcafdd47dbaeaae369dd1517c17
>> >> >> runs for a while and then aborts on 'failed to find block number'.
>> >> >> Sample output includes :
>> >> >
>> >> > Looks like the rebuild gave you older copies of some of the blocks.
>> >> > btrfsck will exit out pretty early when it sees problems, but I'd say
>> >> > most of your FS is there.
>> >> >
>> >> > Can you please do a btrfs-debug-tree /dev/xxx > out, I'd like to see how
>> >> > far we get.
>> >> >
>> >> > What errors do you get when trying to mount the FS?
>> >> >
>> >> > -chris
>> >> 
>> >> I'm not sure how far we will get, but btrfs-debug-tree
>> >> has been running for over 12h now and the screenlog is
>> >> at 80Gb. This may not be surprising, as the filesystem 
>> >> is large (60T) and has millions of files. 
>> >> 
>> >> From the logs at boottime, we have
>> >> 
>> >>   btrfs: failed to read the system array on sdd1
>> >>   btrfs: open_ctree failed
>> >> 
>> >> Should we wait for the btrfs-debug-tree to finish
>> >> before executing an other mount command ? 
>> >
>> > For btrfs-debug-tree to run this long, big parts of your FS must be
>> > valid.  Also, btrfs-debug-tree must have been able to read the sys
>> > array (which mount was complaining about).
>> >
>> > How easily can you try a newer kernel?  We need to make sure and do
>> > readonly operations (mount -o ro), but we may be able to pull out a
>> > bunch of files.
>> >
>> > -chris
>> 
>> 
>> Sure, we're up for that. Should we rebuild the kernel, or just 
>> the btrfs module ? If the kernel, is linux-2.6.38.3 a good
>> choice, or should we build 2.6.39-rc4 ? If we only need to
>> rebuild the btrfs module, should we use Monday's commit to 
>> btrfs-unstable ? 
>
> The best choice right now is 2.6.38 plus the master branch of the btrfs
> unstable tree.  There are a lot of fixes to dealing with busted blocks
> thanks to Josef and Fujitsu.
>
> It may still have trouble, please make sure to mount -o ro.
>
> -chris

OK, we've re-compiled linux-2.6.38 patched up to btrfs-unstable
commit f65647c29b14f5a32ff6f3237b0ef3b375ed5a79 and can now mount 
the filesystem. 

Mounting the filesystem read-only from /dev/sdd1 fails, but
succeeds from /dev/sdc1... after about 4855 parent transid 
verification failures. 

  kernel: [  293.827069] Btrfs loaded
  kernel: [  293.828014] device fsid 2e4187db574846d8-404f05c2e6ec579d devid 2 transid 176065 /dev/sdd1
  kernel: [  293.828781] btrfs: failed to read the system array on sdd1
  kernel: [  293.835956] btrfs: open_ctree failed 

  kernel: [  305.296345] device fsid 2e4187db574846d8-404f05c2e6ec579d devid 1 transid 176066 /dev/sdc1
  kernel: [  305.476360] parent transid verify failed on 20403515125760 wanted 176066 found 174710
  kernel: [  305.476608] parent transid verify failed on 20403515125760 wanted 176066 found 174710
  !-- snip

Is there any chance we can resolve some of the parent transid 
verification failures ? What should our next steps be ? 

Thank you very much for all your help. 

- greg 




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: parent transid troubles
  2011-04-20 20:53           ` Gregory L Shomo
@ 2011-04-20 20:54             ` Chris Mason
  2011-05-04 18:04               ` Gregory L Shomo
  2011-05-25 18:03               ` Gregory L Shomo
  0 siblings, 2 replies; 10+ messages in thread
From: Chris Mason @ 2011-04-20 20:54 UTC (permalink / raw)
  To: Gregory L Shomo; +Cc: linux-btrfs

Excerpts from Gregory L Shomo's message of 2011-04-20 16:53:29 -0400:
> Chris Mason <chris.mason@oracle.com> writes:
> 
> > Excerpts from Gregory L Shomo's message of 2011-04-20 09:20:20 -0400:
> >> Chris Mason <chris.mason@oracle.com> writes:
> >> 
> >> > Excerpts from Gregory L Shomo's message of 2011-04-20 08:56:02 -0400:
> >> >> Chris Mason <chris.mason@oracle.com> writes:
> >> >> 
> >> >> > Excerpts from Gregory L Shomo's message of 2011-04-19 15:08:13 -0400:
> >> >> >> Hello list-
> >> >> >> 
> >> >> >> Under heavy load (i/o), one of our fileservers lost two drives
> >> >> >> in a raid6 configuration. After the drives were synchronized,
> >> >> >> we can no longer mount the multiple-device btrfs filesystem
> >> >> >> due to (at least) parent transid verification.
> >> >> >> 
> >> >> >> btrfsck built from git commit 1b444cd2e6ab8dcafdd47dbaeaae369dd1517c17
> >> >> >> runs for a while and then aborts on 'failed to find block number'.
> >> >> >> Sample output includes :
> >> >> >
> >> >> > Looks like the rebuild gave you older copies of some of the blocks.
> >> >> > btrfsck will exit out pretty early when it sees problems, but I'd say
> >> >> > most of your FS is there.
> >> >> >
> >> >> > Can you please do a btrfs-debug-tree /dev/xxx > out, I'd like to see how
> >> >> > far we get.
> >> >> >
> >> >> > What errors do you get when trying to mount the FS?
> >> >> >
> >> >> > -chris
> >> >> 
> >> >> I'm not sure how far we will get, but btrfs-debug-tree
> >> >> has been running for over 12h now and the screenlog is
> >> >> at 80Gb. This may not be surprising, as the filesystem 
> >> >> is large (60T) and has millions of files. 
> >> >> 
> >> >> From the logs at boottime, we have
> >> >> 
> >> >>   btrfs: failed to read the system array on sdd1
> >> >>   btrfs: open_ctree failed
> >> >> 
> >> >> Should we wait for the btrfs-debug-tree to finish
> >> >> before executing an other mount command ? 
> >> >
> >> > For btrfs-debug-tree to run this long, big parts of your FS must be
> >> > valid.  Also, btrfs-debug-tree must have been able to read the sys
> >> > array (which mount was complaining about).
> >> >
> >> > How easily can you try a newer kernel?  We need to make sure and do
> >> > readonly operations (mount -o ro), but we may be able to pull out a
> >> > bunch of files.
> >> >
> >> > -chris
> >> 
> >> 
> >> Sure, we're up for that. Should we rebuild the kernel, or just 
> >> the btrfs module ? If the kernel, is linux-2.6.38.3 a good
> >> choice, or should we build 2.6.39-rc4 ? If we only need to
> >> rebuild the btrfs module, should we use Monday's commit to 
> >> btrfs-unstable ? 
> >
> > The best choice right now is 2.6.38 plus the master branch of the btrfs
> > unstable tree.  There are a lot of fixes to dealing with busted blocks
> > thanks to Josef and Fujitsu.
> >
> > It may still have trouble, please make sure to mount -o ro.
> >
> > -chris
> 
> OK, we've re-compiled linux-2.6.38 patched up to btrfs-unstable
> commit f65647c29b14f5a32ff6f3237b0ef3b375ed5a79 and can now mount 
> the filesystem. 
> 
> Mounting the filesystem read-only from /dev/sdd1 fails, but
> succeeds from /dev/sdc1... after about 4855 parent transid 
> verification failures. 
> 
>   kernel: [  293.827069] Btrfs loaded
>   kernel: [  293.828014] device fsid 2e4187db574846d8-404f05c2e6ec579d devid 2 transid 176065 /dev/sdd1
>   kernel: [  293.828781] btrfs: failed to read the system array on sdd1
>   kernel: [  293.835956] btrfs: open_ctree failed 
> 
>   kernel: [  305.296345] device fsid 2e4187db574846d8-404f05c2e6ec579d devid 1 transid 176066 /dev/sdc1
>   kernel: [  305.476360] parent transid verify failed on 20403515125760 wanted 176066 found 174710
>   kernel: [  305.476608] parent transid verify failed on 20403515125760 wanted 176066 found 174710
>   !-- snip
> 
> Is there any chance we can resolve some of the parent transid 
> verification failures ? What should our next steps be ? 
> 
> Thank you very much for all your help. 

The failures won't get resolved easily.  Many of them will be duplicates
because of the way we do readahead.

Step one is to copy off the data that you can.  dmesg -n 1 will help
prevent performance problems from message floods.

-chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: parent transid troubles
  2011-04-20 20:54             ` Chris Mason
@ 2011-05-04 18:04               ` Gregory L Shomo
  2011-05-25 18:03               ` Gregory L Shomo
  1 sibling, 0 replies; 10+ messages in thread
From: Gregory L Shomo @ 2011-05-04 18:04 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Chris Mason <chris.mason@oracle.com> writes:

>> Mounting the filesystem read-only from /dev/sdd1 fails, but
>> succeeds from /dev/sdc1... after about 4855 parent transid 
>> verification failures. 
>> 
>>   kernel: [  293.827069] Btrfs loaded
>>   kernel: [  293.828014] device fsid 2e4187db574846d8-404f05c2e6ec579d devid 2 transid 176065 /dev/sdd1
>>   kernel: [  293.828781] btrfs: failed to read the system array on sdd1
>>   kernel: [  293.835956] btrfs: open_ctree failed 
>> 
>>   kernel: [  305.296345] device fsid 2e4187db574846d8-404f05c2e6ec579d devid 1 transid 176066 /dev/sdc1
>>   kernel: [  305.476360] parent transid verify failed on 20403515125760 wanted 176066 found 174710
>>   kernel: [  305.476608] parent transid verify failed on 20403515125760 wanted 176066 found 174710
>>   !-- snip
>> 
>> Is there any chance we can resolve some of the parent transid 
>> verification failures ? What should our next steps be ? 
>> 
>> Thank you very much for all your help. 
>
> The failures won't get resolved easily.  Many of them will be duplicates
> because of the way we do readahead.
>
> Step one is to copy off the data that you can.  dmesg -n 1 will help
> prevent performance problems from message floods.
>
> -chris

So we've copied off all the data, what's the next step ? 

Losing all files that were open for writing at the time 
of the failure is no problem, as those data sets will have
to be re-computed anywise. Does that work in our favour
to resolve this issue ? 

- greg 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: parent transid troubles
  2011-04-20 20:54             ` Chris Mason
  2011-05-04 18:04               ` Gregory L Shomo
@ 2011-05-25 18:03               ` Gregory L Shomo
  1 sibling, 0 replies; 10+ messages in thread
From: Gregory L Shomo @ 2011-05-25 18:03 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Chris Mason <chris.mason@oracle.com> writes:

>> OK, we've re-compiled linux-2.6.38 patched up to btrfs-unstable
>> commit f65647c29b14f5a32ff6f3237b0ef3b375ed5a79 and can now mount 
>> the filesystem. 
>> 
>> Mounting the filesystem read-only from /dev/sdd1 fails, but
>> succeeds from /dev/sdc1... after about 4855 parent transid 
>> verification failures. 
>> 
>>   kernel: [  293.827069] Btrfs loaded
>>   kernel: [  293.828014] device fsid 2e4187db574846d8-404f05c2e6ec579d devid 2 transid 176065 /dev/sdd1
>>   kernel: [  293.828781] btrfs: failed to read the system array on sdd1
>>   kernel: [  293.835956] btrfs: open_ctree failed 
>> 
>>   kernel: [  305.296345] device fsid 2e4187db574846d8-404f05c2e6ec579d devid 1 transid 176066 /dev/sdc1
>>   kernel: [  305.476360] parent transid verify failed on 20403515125760 wanted 176066 found 174710
>>   kernel: [  305.476608] parent transid verify failed on 20403515125760 wanted 176066 found 174710
>>   !-- snip
>> 
>> Is there any chance we can resolve some of the parent transid 
>> verification failures ? What should our next steps be ? 
>> 
>> Thank you very much for all your help. 
>
> The failures won't get resolved easily.  Many of them will be duplicates
> because of the way we do readahead.
>
> Step one is to copy off the data that you can.  dmesg -n 1 will help
> prevent performance problems from message floods.
>
> -chris

Now that we've copied-off what we need, shall we just
take the plunge and mount read-write ? Is there some
way in can clear out the parent transid failures so that
we will not have to look at them in the future ? 

- greg 



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-05-25 18:03 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-19 19:08 parent transid troubles Gregory L Shomo
2011-04-19 19:34 ` Chris Mason
2011-04-20 12:56   ` Gregory L Shomo
2011-04-20 13:06     ` Chris Mason
2011-04-20 13:20       ` Gregory L Shomo
2011-04-20 14:04         ` Chris Mason
2011-04-20 20:53           ` Gregory L Shomo
2011-04-20 20:54             ` Chris Mason
2011-05-04 18:04               ` Gregory L Shomo
2011-05-25 18:03               ` Gregory L Shomo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.