All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs check inconsistency with raid1, part 1
@ 2015-12-14  4:16 Chris Murphy
  2015-12-14  5:48 ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2015-12-14  4:16 UTC (permalink / raw)
  To: Btrfs BTRFS

Part 1= What to do about it? This post.
Part 2 = How I got here? I'm still working on the write up, so it's
not yet posted.

Summary:

2 dev (spinning rust) raid1 for data and metadata.
kernel 4.2.6, btrfs-progs 4.2.2

btrfs check with devid 1 and 2 present produces thousands of scary
messages, e.g.
checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000

btrfs check with devid 1 or devid2 separate (the other is missing)
produces no such scary messages at all, but instead messages e.g.
failed to load free space cache for block group 357585387520

a. This inconsistency is unexpected.
b. the 'btrfs check' with combined devices gives no insight to the
seriousness of "checksum verify failed" messages, or what the solution
is.
c. combined or separate+degraded, read-only mounts succeed with no
errors in user space or dmesg; only normal mount messages happen. With
both devs ro mounted, I was able to completely btrfs send/receive the
most recent two ro snapshots comprising 100% (minus stale historical)
data on the drive, with zero errors reported.
d. no read-write mount attempt has happened since "the incident" which
will be detailed in part 2.


Details:


The full devid1&2 btrfs check is long and not very interesting, so
I've put that here:
https://drive.google.com/open?id=0B_2Asp8DGjJ9Vjd0VlNYb09LVFU

btrfs-show-super shows some differences, values denoted as
devid1/devid2. If there's no split, those values are the same for both
devids.


generation        4924/4923
root            714189258752/714188554240
sys_array_size        129
chunk_root_generation    4918
root_level        1
chunk_root        715141414912
chunk_root_level    1
log_root        0
log_root_transid    0
log_root_level        0
total_bytes        1500312748032
bytes_used        537228206080
sectorsize        4096
nodesize        16384
[snip]
cache_generation    4924/4923
uuid_tree_generation    4924/4923
[snip]
dev_item.total_bytes    750156374016
dev_item.bytes_used    541199433728

Perhaps useful, is at the time of "the incident" this volume was rw
mounted, but was being used by a single process only: btrfs send. So
it was used as a source. No writes, other than btrfs's own generation
increment, were happening.

So in theory, this should perhaps be the simplest case of "what do I
do now?" and even makes me wonder if a normal rw mount should just fix
this up: either btrfs uses generation 4924 and updates all changes
from 4923 and 4924 automatically to devid2 so they are now in sync, or
it automatically discards generation 4924 from devid1, so both devices
are in sync.

The workload, circumstances of "the incident", the general purpose of
btrfs, and the likelihood a typical user would never have even become
aware of "the incident" until much later than I did, makes me strongly
feel like Btrfs should be able to completely recover from this, with
just a rw mount and eventually the missync'd generations will
autocorrect. But I don't know that. And I get essentially no advice
from btrfs check results.

So. What's the theory in this case? And then does it differ from reality?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-14  4:16 btrfs check inconsistency with raid1, part 1 Chris Murphy
@ 2015-12-14  5:48 ` Qu Wenruo
  2015-12-14  7:24   ` Chris Murphy
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2015-12-14  5:48 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS



Chris Murphy wrote on 2015/12/13 21:16 -0700:
> Part 1= What to do about it? This post.
> Part 2 = How I got here? I'm still working on the write up, so it's
> not yet posted.
>
> Summary:
>
> 2 dev (spinning rust) raid1 for data and metadata.
> kernel 4.2.6, btrfs-progs 4.2.2
>
> btrfs check with devid 1 and 2 present produces thousands of scary
> messages, e.g.
> checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000

Checked the full output.
The interesting part is, the calculated result is always E4E3BDB6, and 
wanted is always all 0.

I assume E4E3BDB6 is crc32 of all 0 data.


If there is a full disk dump, it will be much easier to find where the 
problem is.
But I'm a afraid it won't be possible.

At least, 'btrfs-debug-tree -t 2' should help to locate what's wrong 
with the bytenr in the warning.


The good news is, the fs seems to be OK without major problem.
As except the csum error, btrfsck doesn't give other error/warning.
>
> btrfs check with devid 1 or devid2 separate (the other is missing)
> produces no such scary messages at all, but instead messages e.g.
> failed to load free space cache for block group 357585387520
>
> a. This inconsistency is unexpected.
> b. the 'btrfs check' with combined devices gives no insight to the
> seriousness of "checksum verify failed" messages, or what the solution
> is.

I guess btrfsck did the wrong device assemble, but that's just my 
personal guess.
And since I can't reproduce in my test environment, it won't be easy to 
find the root cause.

> c. combined or separate+degraded, read-only mounts succeed with no
> errors in user space or dmesg; only normal mount messages happen. With
> both devs ro mounted, I was able to completely btrfs send/receive the
> most recent two ro snapshots comprising 100% (minus stale historical)
> data on the drive, with zero errors reported.
> d. no read-write mount attempt has happened since "the incident" which
> will be detailed in part 2.
>
>
> Details:
>
>
> The full devid1&2 btrfs check is long and not very interesting, so
> I've put that here:
> https://drive.google.com/open?id=0B_2Asp8DGjJ9Vjd0VlNYb09LVFU
>
> btrfs-show-super shows some differences, values denoted as
> devid1/devid2. If there's no split, those values are the same for both
> devids.
>
>
> generation        4924/4923
> root            714189258752/714188554240
> sys_array_size        129
> chunk_root_generation    4918
> root_level        1
> chunk_root        715141414912
> chunk_root_level    1
> log_root        0
> log_root_transid    0
> log_root_level        0
> total_bytes        1500312748032
> bytes_used        537228206080
> sectorsize        4096
> nodesize        16384
> [snip]
> cache_generation    4924/4923
> uuid_tree_generation    4924/4923
> [snip]
> dev_item.total_bytes    750156374016
> dev_item.bytes_used    541199433728
>
> Perhaps useful, is at the time of "the incident" this volume was rw
> mounted, but was being used by a single process only: btrfs send. So
> it was used as a source. No writes, other than btrfs's own generation
> increment, were happening.
>
> So in theory, this should perhaps be the simplest case of "what do I
> do now?" and even makes me wonder if a normal rw mount should just fix
> this up: either btrfs uses generation 4924 and updates all changes
> from 4923 and 4924 automatically to devid2 so they are now in sync, or
> it automatically discards generation 4924 from devid1, so both devices
> are in sync.
>
> The workload, circumstances of "the incident", the general purpose of
> btrfs, and the likelihood a typical user would never have even become
> aware of "the incident" until much later than I did, makes me strongly
> feel like Btrfs should be able to completely recover from this, with
> just a rw mount and eventually the missync'd generations will
> autocorrect. But I don't know that. And I get essentially no advice
> from btrfs check results.
>
> So. What's the theory in this case? And then does it differ from reality?

Personally speaking, it may be a false alert from btrfsck.
So in this case, I can't provide much help.

If you're brave enough, mount it rw to see what will happen(although it 
may mount just OK).

Thanks,
Qu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-14  5:48 ` Qu Wenruo
@ 2015-12-14  7:24   ` Chris Murphy
  2015-12-14  8:04     ` Qu Wenruo
  2015-12-14 11:51     ` Duncan
  0 siblings, 2 replies; 18+ messages in thread
From: Chris Murphy @ 2015-12-14  7:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

Thanks for the reply.


On Sun, Dec 13, 2015 at 10:48 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> Chris Murphy wrote on 2015/12/13 21:16 -0700:
>> btrfs check with devid 1 and 2 present produces thousands of scary
>> messages, e.g.
>> checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
>
>
> Checked the full output.
> The interesting part is, the calculated result is always E4E3BDB6, and
> wanted is always all 0.
>
> I assume E4E3BDB6 is crc32 of all 0 data.
>
>
> If there is a full disk dump, it will be much easier to find where the
> problem is.
> But I'm a afraid it won't be possible.

What is a full disk dump? I can try to see if it's possible. Main
thing though is only if it can make Btrfs overall better, because I
don't need this volume repaired, there's no data loss (backups!) so
this volume's purpose now is for study.


> At least, 'btrfs-debug-tree -t 2' should help to locate what's wrong with
> the bytenr in the warning.

Both devs attached (not mounted).

[root@f23a ~]# btrfs-debug-tree -t 2 /dev/sdb > btrfsdebugtreet2_verb.txt
checksum verify failed on 714189570048 found E4E3BDB6 wanted 00000000
checksum verify failed on 714189570048 found E4E3BDB6 wanted 00000000
checksum verify failed on 714189471744 found E4E3BDB6 wanted 00000000
checksum verify failed on 714189471744 found E4E3BDB6 wanted 00000000
checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
checksum verify failed on 714189750272 found E4E3BDB6 wanted 00000000
checksum verify failed on 714189750272 found E4E3BDB6 wanted 00000000

https://drive.google.com/open?id=0B_2Asp8DGjJ9NUdmdXZFQ1Myek0


>
>
> The good news is, the fs seems to be OK without major problem.
> As except the csum error, btrfsck doesn't give other error/warning.

Yes, I think so. Main issue here seems to be the scary warnings and
uncertainty what the user should do next, if anything at all.

> I guess btrfsck did the wrong device assemble, but that's just my personal
> guess.
> And since I can't reproduce in my test environment, it won't be easy to find
> the root cause.

It might be reproducible. More on that in the next email. Easy to get
you remote access if useful.


>> So. What's the theory in this case? And then does it differ from reality?
>
>
> Personally speaking, it may be a false alert from btrfsck.
> So in this case, I can't provide much help.
>
> If you're brave enough, mount it rw to see what will happen(although it may
> mount just OK).

I'm brave enough. I'll give it a try tomorrow unless there's another
request for more info before then.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-14  7:24   ` Chris Murphy
@ 2015-12-14  8:04     ` Qu Wenruo
  2015-12-14 17:59       ` Chris Murphy
  2015-12-14 11:51     ` Duncan
  1 sibling, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2015-12-14  8:04 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS



Chris Murphy wrote on 2015/12/14 00:24 -0700:
> Thanks for the reply.
>
>
> On Sun, Dec 13, 2015 at 10:48 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> Chris Murphy wrote on 2015/12/13 21:16 -0700:
>>> btrfs check with devid 1 and 2 present produces thousands of scary
>>> messages, e.g.
>>> checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
>>
>>
>> Checked the full output.
>> The interesting part is, the calculated result is always E4E3BDB6, and
>> wanted is always all 0.
>>
>> I assume E4E3BDB6 is crc32 of all 0 data.
>>
>>
>> If there is a full disk dump, it will be much easier to find where the
>> problem is.
>> But I'm a afraid it won't be possible.
>
> What is a full disk dump? I can try to see if it's possible.

Just a dd dump.

dd if=<disk1> of=disk1.img bs=1M

> Main
> thing though is only if it can make Btrfs overall better, because I
> don't need this volume repaired, there's no data loss (backups!) so
> this volume's purpose now is for study.

But please also consider your privacy before doing this.

And more important thing is the size...

Considering how large your -t 2 dump is, I won't ever try to do the dump 
even I have enough spare space to contain the image, it won't be an easy 
thing to find a place to upload them.

>
>
>> At least, 'btrfs-debug-tree -t 2' should help to locate what's wrong with
>> the bytenr in the warning.
>
> Both devs attached (not mounted).
>
> [root@f23a ~]# btrfs-debug-tree -t 2 /dev/sdb > btrfsdebugtreet2_verb.txt
> checksum verify failed on 714189570048 found E4E3BDB6 wanted 00000000
> checksum verify failed on 714189570048 found E4E3BDB6 wanted 00000000
> checksum verify failed on 714189471744 found E4E3BDB6 wanted 00000000
> checksum verify failed on 714189471744 found E4E3BDB6 wanted 00000000
> checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
> checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
> checksum verify failed on 714189750272 found E4E3BDB6 wanted 00000000
> checksum verify failed on 714189750272 found E4E3BDB6 wanted 00000000
>
> https://drive.google.com/open?id=0B_2Asp8DGjJ9NUdmdXZFQ1Myek0
>

Got the result, and things is very interesting.

It seems all these tree blocks (search by the bytenr) shares the same 
crc32 by coincidence.
Or we won't be able to read them all (and their contents all seems valid).


I hope if I can have some raw blocks dump of that bytenr.
Here is the procedure:
$ btrfs-map-logical -l <LOGICAL> -n 16384 -c 2 <DEVICE1or2>
mirror 1 logical <LOGICAL> physical XXXXXXXX device <DEVICE1>
mirror 2 logical <LOGICAL> physical YYYYYYYY device <DEVICE2>

$ dd if=<DEVICE1> of=dev1_<LOGICAL>.img bs=1 count=16384 skip=XXXXXXX
$ dd if=<DEVICE2> of=dev2_<LOGICAL>.img bs=1 count=16384 skip=YYYYYYY

In your output, there are 12 different bytenr, but the most interesting 
ones are *714189357056* and *714189471744*.
They are extent tree blocks. If they are really broken, btrfsck should 
complain about it.

Others are mostly csum tree block, less interesting.

And unlike the super large disk dump, it's very small, exactly 16K each.
64K in total.

>
>>
>>
>> The good news is, the fs seems to be OK without major problem.
>> As except the csum error, btrfsck doesn't give other error/warning.
>
> Yes, I think so. Main issue here seems to be the scary warnings and
> uncertainty what the user should do next, if anything at all.
>
>> I guess btrfsck did the wrong device assemble, but that's just my personal
>> guess.
>> And since I can't reproduce in my test environment, it won't be easy to find
>> the root cause.
>
> It might be reproducible. More on that in the next email. Easy to get
> you remote access if useful.
>
>
>>> So. What's the theory in this case? And then does it differ from reality?
>>
>>
>> Personally speaking, it may be a false alert from btrfsck.
>> So in this case, I can't provide much help.
>>
>> If you're brave enough, mount it rw to see what will happen(although it may
>> mount just OK).
>
> I'm brave enough. I'll give it a try tomorrow unless there's another
> request for more info before then.
>
>
Great!

Thanks,
Qu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-14  7:24   ` Chris Murphy
  2015-12-14  8:04     ` Qu Wenruo
@ 2015-12-14 11:51     ` Duncan
  1 sibling, 0 replies; 18+ messages in thread
From: Duncan @ 2015-12-14 11:51 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Mon, 14 Dec 2015 00:24:21 -0700 as excerpted:

>> Personally speaking, it may be a false alert from btrfsck.
>> So in this case, I can't provide much help.
>>
>> If you're brave enough, mount it rw to see what will happen(although it
>> may mount just OK).
> 
> I'm brave enough. I'll give it a try tomorrow unless there's another
> request for more info before then.

Given the off-by-one generations and my own btrfs raid1 experience, I'm 
guessing the likely result is a good mount and either no problems or a 
good initial mount but lockup once you try actually doing too much (like 
actually reading the affected blocks) with the filesystem.

Looks like a normal generation-out-of-sync condition, common with forced 
unsynced/not-remounted-ro shutdowns.  If so, btrfs should redirect reads 
to the updated current generation device, but you'll need to do a scrub 
to get everything 100% back in sync.

The catch I found, at least when I still had the then-failing (but not 
failed, it was just finding more and more sectors that needed redirected 
to spares) ssd still in my raid1, also with an on-boot service that read 
a rather large dir into cache, was that after so many errors from the 
failing device, instead of continuing to redirect errors to the good 
device, btrfs just gives up, which resulted in a system crash, here.

But when there weren't that many errors on the failing device, or when I 
intercepted the boot process and mounted everything but didn't run normal 
post-mount services (systemd emergency target instead of my usual default 
multi-user) so the service that cached that dir didn't have a chance to 
run, so all those errors didn't trigger, I could still mount normally, 
and from there, I could run scrub, which took care of the problem without 
triggering the usual too many errors crash, and after scrub, I could 
invoke normal multi-user mode and start all services including the 
caching service, and go about my usual business.

So if I'm correct, mount normally and scrub, and you should be fine, tho 
you may have to abort a normal boot if it accesses too many bad files, in 
ordered to be able to finish the scrub before a crash.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-14  8:04     ` Qu Wenruo
@ 2015-12-14 17:59       ` Chris Murphy
  2015-12-20 22:32         ` Chris Murphy
       [not found]         ` <CAJCQCtSEx_wYPkfazik0bcpQwXxJCA=O5f0o6RbxON4jjB4q7A@mail.gmail.com>
  0 siblings, 2 replies; 18+ messages in thread
From: Chris Murphy @ 2015-12-14 17:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 2766 bytes --]

On Mon, Dec 14, 2015 at 1:04 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> Chris Murphy wrote on 2015/12/14 00:24 -0700:
>> What is a full disk dump? I can try to see if it's possible.
>
>
> Just a dd dump.

OK, yeah. That's 750GB per drive.

>t won't be an easy
> thing to find a place to upload them.

Right. I have no ideas. I'll give you the rest of what you asked for,
and won't do the rw mount yet in case you need more.


> Got the result, and things is very interesting.
>
> It seems all these tree blocks (search by the bytenr) shares the same crc32
> by coincidence.
> Or we won't be able to read them all (and their contents all seems valid).
>
>
> I hope if I can have some raw blocks dump of that bytenr.
> Here is the procedure:
> $ btrfs-map-logical -l <LOGICAL> -n 16384 -c 2 <DEVICE1or2>
> mirror 1 logical <LOGICAL> physical XXXXXXXX device <DEVICE1>
> mirror 2 logical <LOGICAL> physical YYYYYYYY device <DEVICE2>

Option -n is invalid, I'll use option -b.

##btrfs fi show has this mapping, seems opposite from
btrfs-map-logical (although it uses the term mirror rather than
devid). So I will use devid and ignore mirror number.
/dev/sdb = devid1
/dev/sdc = devid2


# btrfs-map-logical -l 714189357056 -b 16384 -c 2 /dev/sdb
checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
mirror 1 logical 714189357056 physical 356605018112 device /dev/sdc
mirror 2 logical 714189357056 physical 3380658176 device /dev/sdb



# btrfs-map-logical -l 714189471744 -b 16384 -c 2 /dev/sdb
checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
mirror 1 logical 714189471744 physical 356605132800 device /dev/sdc
mirror 2 logical 714189471744 physical 3380772864 device /dev/sdb


>
> $ dd if=<DEVICE1> of=dev1_<LOGICAL>.img bs=1 count=16384 skip=XXXXXXX
> $ dd if=<DEVICE2> of=dev2_<LOGICAL>.img bs=1 count=16384 skip=YYYYYYY
>
> In your output, there are 12 different bytenr, but the most interesting ones
> are *714189357056* and *714189471744*.


dd if=/dev/sdb of=dev1_714189357056.img bs=1 count=16384 skip=3380658176
dd if=/dev/sdc of=dev2_714189357056.img bs=1 count=16384 skip=356605018112

dd if=/dev/sdb of=dev1_714189471744.img bs=1 count=16384 skip=3380772864
dd if=/dev/sdc of=dev2_714189471744.img bs=1 count=16384 skip=356605132800

Files are attached to this email.


-- 
Chris Murphy

[-- Attachment #2: dev2_714189471744.img --]
[-- Type: application/x-raw-disk-image, Size: 16384 bytes --]

[-- Attachment #3: dev2_714189357056.img --]
[-- Type: application/x-raw-disk-image, Size: 16384 bytes --]

[-- Attachment #4: dev1_714189471744.img --]
[-- Type: application/x-raw-disk-image, Size: 16384 bytes --]

[-- Attachment #5: dev1_714189357056.img --]
[-- Type: application/x-raw-disk-image, Size: 16384 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-14 17:59       ` Chris Murphy
@ 2015-12-20 22:32         ` Chris Murphy
       [not found]         ` <CAJCQCtSEx_wYPkfazik0bcpQwXxJCA=O5f0o6RbxON4jjB4q7A@mail.gmail.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Chris Murphy @ 2015-12-20 22:32 UTC (permalink / raw)
  To: Btrfs BTRFS

On Mon, Dec 14, 2015 at 10:59 AM, Chris Murphy <lists@colorremedies.com> wrote:
> On Mon, Dec 14, 2015 at 1:04 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> Chris Murphy wrote on 2015/12/14 00:24 -0700:
>>> What is a full disk dump? I can try to see if it's possible.
>>
>>
>> Just a dd dump.
>
> OK, yeah. That's 750GB per drive.
>
>>t won't be an easy
>> thing to find a place to upload them.
>
> Right. I have no ideas. I'll give you the rest of what you asked for,
> and won't do the rw mount yet in case you need more.
>
>
>> Got the result, and things is very interesting.
>>
>> It seems all these tree blocks (search by the bytenr) shares the same crc32
>> by coincidence.
>> Or we won't be able to read them all (and their contents all seems valid).
>>
>>
>> I hope if I can have some raw blocks dump of that bytenr.
>> Here is the procedure:
>> $ btrfs-map-logical -l <LOGICAL> -n 16384 -c 2 <DEVICE1or2>
>> mirror 1 logical <LOGICAL> physical XXXXXXXX device <DEVICE1>
>> mirror 2 logical <LOGICAL> physical YYYYYYYY device <DEVICE2>
>
> Option -n is invalid, I'll use option -b.
>
> ##btrfs fi show has this mapping, seems opposite from
> btrfs-map-logical (although it uses the term mirror rather than
> devid). So I will use devid and ignore mirror number.
> /dev/sdb = devid1
> /dev/sdc = devid2
>
>
> # btrfs-map-logical -l 714189357056 -b 16384 -c 2 /dev/sdb
> checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
> checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
> checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
> checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
> mirror 1 logical 714189357056 physical 356605018112 device /dev/sdc
> mirror 2 logical 714189357056 physical 3380658176 device /dev/sdb
>
>
>
> # btrfs-map-logical -l 714189471744 -b 16384 -c 2 /dev/sdb
> checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
> checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
> checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
> checksum verify failed on 714189357056 found E4E3BDB6 wanted 00000000
> mirror 1 logical 714189471744 physical 356605132800 device /dev/sdc
> mirror 2 logical 714189471744 physical 3380772864 device /dev/sdb
>
>
>>
>> $ dd if=<DEVICE1> of=dev1_<LOGICAL>.img bs=1 count=16384 skip=XXXXXXX
>> $ dd if=<DEVICE2> of=dev2_<LOGICAL>.img bs=1 count=16384 skip=YYYYYYY
>>
>> In your output, there are 12 different bytenr, but the most interesting ones
>> are *714189357056* and *714189471744*.
>
>
> dd if=/dev/sdb of=dev1_714189357056.img bs=1 count=16384 skip=3380658176
> dd if=/dev/sdc of=dev2_714189357056.img bs=1 count=16384 skip=356605018112
>
> dd if=/dev/sdb of=dev1_714189471744.img bs=1 count=16384 skip=3380772864
> dd if=/dev/sdc of=dev2_714189471744.img bs=1 count=16384 skip=356605132800
>
> Files are attached to this email.
>

Hi Qu, any insight with these attachements?

I will likely try a normal rw mount once 4.4.0rc6 is done and built in
Fedora's koji (24-48 hours). If that goes OK I'll try some reads and
see if that triggers any problems, and if there are no problems then
I'll do some writes and see if the two device generations end up back
in sync. If there continue to be no complaints, I'll do a scrub and
we'll see if that notices anything or fixes things or what.

I think the cause is related to bus power with buggy USB 3 LPM
firmware (these enclosures are cheap maybe $6). I've found some
threads about this being a problem, but it's not expected to cause any
corruptions. So, the fact Btrfs picks up one some problems might prove
that (somewhat) incorrect.

http://permalink.gmane.org/gmane.linux.usb.general/105502
http://www.spinics.net/lists/linux-usb/msg108949.html

I have the same exactly enclosure mentioned in the 2nd link (which is
the last email in the thread, with no real resolution). The usb reset
messages never happen when the same enclosure+drive is attached to a
1.5A USB connector on the NUC. It only happens (with two of the same
model enclosures with different drive make/models) on the standard USB
connectors on the Intel NUC. But I have a hard time believing a laptop
drive needs more than 900mA continuously, rather than just at spin up
time.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
       [not found]           ` <5677592F.5000202@cn.fujitsu.com>
@ 2015-12-21  2:12             ` Chris Murphy
  2015-12-21  2:23               ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2015-12-21  2:12 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 911 bytes --]

On Sun, Dec 20, 2015 at 6:43 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> Chris Murphy wrote on 2015/12/20 15:31 -0700:

>> I think the cause is related to bus power with buggy USB 3 LPM
>> firmware (these enclosures are cheap maybe $6). I've found some
>> threads about this being a problem, but it's not expected to cause any
>> corruptions. So, the fact Btrfs picks up one some problems might prove
>> that (somewhat) incorrect.
>
>
> Seems possible. Maybe some metadata just failed to reach disk.
> BTW, did I asked for a btrfs-show-super output?

Nope. I will attach to this email below for both devices.

> If that's the case, superblock on device 2 maybe older than superblock on
> device 1.

Yes, looks iike devid 1 transid 4924, and devid 2 transid 4923. And
it's devid 2 that had device reset and write errors when it vanished
and reappeared as a different block device.





-- 
Chris Murphy

[-- Attachment #2: btrfsshowsuper_devid1.txt --]
[-- Type: text/plain, Size: 9711 bytes --]

[liveuser@localhost ~]$ sudo btrfs-show-super -af /dev/sdc
superblock: bytenr=65536, device=/dev/sdc
---------------------------------------------------------
csum			0x93333bd8 [match]
bytenr			65536
flags			0x1
			( WRITTEN )
magic			_BHRfS_M [match]
fsid			197606b2-9f4a-4742-8824-7fc93285c29c
label			verb
generation		4924
root			714189258752
sys_array_size		129
chunk_root_generation	4918
root_level		1
chunk_root		715141414912
chunk_root_level	1
log_root		0
log_root_transid	0
log_root_level		0
total_bytes		1500312748032
bytes_used		537228206080
sectorsize		4096
nodesize		16384
leafsize		16384
stripesize		4096
root_dir		6
num_devices		2
compat_flags		0x0
compat_ro_flags		0x0
incompat_flags		0x161
			( MIXED_BACKREF |
			  BIG_METADATA |
			  EXTENDED_IREF |
			  SKINNY_METADATA )
csum_type		0
csum_size		4
cache_generation	4924
uuid_tree_generation	4924
dev_item.uuid		94c62352-2568-4abe-8a58-828d1766719c
dev_item.fsid		197606b2-9f4a-4742-8824-7fc93285c29c [match]
dev_item.type		0
dev_item.total_bytes	750156374016
dev_item.bytes_used	541199433728
dev_item.io_align	4096
dev_item.io_width	4096
dev_item.sector_size	4096
dev_item.devid		1
dev_item.dev_group	0
dev_item.seek_speed	0
dev_item.bandwidth	0
dev_item.generation	0
sys_chunk_array[2048]:
	item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 715141414912)
		chunk length 33554432 owner 2 stripe_len 65536
		type SYSTEM|RAID1 num_stripes 2
			stripe 0 devid 2 offset 357557075968
			dev uuid: f98143e4-24a2-4a2a-8dbf-2871c75f7b78
			stripe 1 devid 1 offset 2185232384
			dev uuid: 94c62352-2568-4abe-8a58-828d1766719c
backup_roots[4]:
	backup 0:
		backup_tree_root:	714616012800	gen: 4921	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714190635008	gen: 4921	level: 2
		backup_fs_root:		714186096640	gen: 4921	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714186326016	gen: 4921	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 1:
		backup_tree_root:	714186997760	gen: 4922	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714187014144	gen: 4922	level: 2
		backup_fs_root:		714186096640	gen: 4921	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714187505664	gen: 4922	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 2:
		backup_tree_root:	714188554240	gen: 4923	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714188505088	gen: 4923	level: 2
		backup_fs_root:		714188488704	gen: 4923	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714188668928	gen: 4923	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 3:
		backup_tree_root:	714189258752	gen: 4924	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714189324288	gen: 4924	level: 2
		backup_fs_root:		714188488704	gen: 4923	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714189422592	gen: 4924	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2


superblock: bytenr=67108864, device=/dev/sdc
---------------------------------------------------------
csum			0x33521316 [match]
bytenr			67108864
flags			0x1
			( WRITTEN )
magic			_BHRfS_M [match]
fsid			197606b2-9f4a-4742-8824-7fc93285c29c
label			verb
generation		4924
root			714189258752
sys_array_size		129
chunk_root_generation	4918
root_level		1
chunk_root		715141414912
chunk_root_level	1
log_root		0
log_root_transid	0
log_root_level		0
total_bytes		1500312748032
bytes_used		537228206080
sectorsize		4096
nodesize		16384
leafsize		16384
stripesize		4096
root_dir		6
num_devices		2
compat_flags		0x0
compat_ro_flags		0x0
incompat_flags		0x161
			( MIXED_BACKREF |
			  BIG_METADATA |
			  EXTENDED_IREF |
			  SKINNY_METADATA )
csum_type		0
csum_size		4
cache_generation	4924
uuid_tree_generation	4924
dev_item.uuid		94c62352-2568-4abe-8a58-828d1766719c
dev_item.fsid		197606b2-9f4a-4742-8824-7fc93285c29c [match]
dev_item.type		0
dev_item.total_bytes	750156374016
dev_item.bytes_used	541199433728
dev_item.io_align	4096
dev_item.io_width	4096
dev_item.sector_size	4096
dev_item.devid		1
dev_item.dev_group	0
dev_item.seek_speed	0
dev_item.bandwidth	0
dev_item.generation	0
sys_chunk_array[2048]:
	item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 715141414912)
		chunk length 33554432 owner 2 stripe_len 65536
		type SYSTEM|RAID1 num_stripes 2
			stripe 0 devid 2 offset 357557075968
			dev uuid: f98143e4-24a2-4a2a-8dbf-2871c75f7b78
			stripe 1 devid 1 offset 2185232384
			dev uuid: 94c62352-2568-4abe-8a58-828d1766719c
backup_roots[4]:
	backup 0:
		backup_tree_root:	714616012800	gen: 4921	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714190635008	gen: 4921	level: 2
		backup_fs_root:		714186096640	gen: 4921	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714186326016	gen: 4921	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 1:
		backup_tree_root:	714186997760	gen: 4922	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714187014144	gen: 4922	level: 2
		backup_fs_root:		714186096640	gen: 4921	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714187505664	gen: 4922	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 2:
		backup_tree_root:	714188554240	gen: 4923	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714188505088	gen: 4923	level: 2
		backup_fs_root:		714188488704	gen: 4923	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714188668928	gen: 4923	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 3:
		backup_tree_root:	714189258752	gen: 4924	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714189324288	gen: 4924	level: 2
		backup_fs_root:		714188488704	gen: 4923	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714189422592	gen: 4924	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2


superblock: bytenr=274877906944, device=/dev/sdc
---------------------------------------------------------
csum			0xced54527 [match]
bytenr			274877906944
flags			0x1
			( WRITTEN )
magic			_BHRfS_M [match]
fsid			197606b2-9f4a-4742-8824-7fc93285c29c
label			verb
generation		4924
root			714189258752
sys_array_size		129
chunk_root_generation	4918
root_level		1
chunk_root		715141414912
chunk_root_level	1
log_root		0
log_root_transid	0
log_root_level		0
total_bytes		1500312748032
bytes_used		537228206080
sectorsize		4096
nodesize		16384
leafsize		16384
stripesize		4096
root_dir		6
num_devices		2
compat_flags		0x0
compat_ro_flags		0x0
incompat_flags		0x161
			( MIXED_BACKREF |
			  BIG_METADATA |
			  EXTENDED_IREF |
			  SKINNY_METADATA )
csum_type		0
csum_size		4
cache_generation	4924
uuid_tree_generation	4924
dev_item.uuid		94c62352-2568-4abe-8a58-828d1766719c
dev_item.fsid		197606b2-9f4a-4742-8824-7fc93285c29c [match]
dev_item.type		0
dev_item.total_bytes	750156374016
dev_item.bytes_used	541199433728
dev_item.io_align	4096
dev_item.io_width	4096
dev_item.sector_size	4096
dev_item.devid		1
dev_item.dev_group	0
dev_item.seek_speed	0
dev_item.bandwidth	0
dev_item.generation	0
sys_chunk_array[2048]:
	item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 715141414912)
		chunk length 33554432 owner 2 stripe_len 65536
		type SYSTEM|RAID1 num_stripes 2
			stripe 0 devid 2 offset 357557075968
			dev uuid: f98143e4-24a2-4a2a-8dbf-2871c75f7b78
			stripe 1 devid 1 offset 2185232384
			dev uuid: 94c62352-2568-4abe-8a58-828d1766719c
backup_roots[4]:
	backup 0:
		backup_tree_root:	714616012800	gen: 4921	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714190635008	gen: 4921	level: 2
		backup_fs_root:		714186096640	gen: 4921	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714186326016	gen: 4921	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 1:
		backup_tree_root:	714186997760	gen: 4922	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714187014144	gen: 4922	level: 2
		backup_fs_root:		714186096640	gen: 4921	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714187505664	gen: 4922	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 2:
		backup_tree_root:	714188554240	gen: 4923	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714188505088	gen: 4923	level: 2
		backup_fs_root:		714188488704	gen: 4923	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714188668928	gen: 4923	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 3:
		backup_tree_root:	714189258752	gen: 4924	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714189324288	gen: 4924	level: 2
		backup_fs_root:		714188488704	gen: 4923	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714189422592	gen: 4924	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2


[-- Attachment #3: btrfsshowsuper_devid2.txt --]
[-- Type: text/plain, Size: 9703 bytes --]

[chris@f23m ~]$ sudo btrfs-show-super -af /dev/sdb
superblock: bytenr=65536, device=/dev/sdb
---------------------------------------------------------
csum			0x3364e6b8 [match]
bytenr			65536
flags			0x1
			( WRITTEN )
magic			_BHRfS_M [match]
fsid			197606b2-9f4a-4742-8824-7fc93285c29c
label			verb
generation		4923
root			714188554240
sys_array_size		129
chunk_root_generation	4918
root_level		1
chunk_root		715141414912
chunk_root_level	1
log_root		0
log_root_transid	0
log_root_level		0
total_bytes		1500312748032
bytes_used		537228206080
sectorsize		4096
nodesize		16384
leafsize		16384
stripesize		4096
root_dir		6
num_devices		2
compat_flags		0x0
compat_ro_flags		0x0
incompat_flags		0x161
			( MIXED_BACKREF |
			  BIG_METADATA |
			  EXTENDED_IREF |
			  SKINNY_METADATA )
csum_type		0
csum_size		4
cache_generation	4923
uuid_tree_generation	4923
dev_item.uuid		f98143e4-24a2-4a2a-8dbf-2871c75f7b78
dev_item.fsid		197606b2-9f4a-4742-8824-7fc93285c29c [match]
dev_item.type		0
dev_item.total_bytes	750156374016
dev_item.bytes_used	541199433728
dev_item.io_align	4096
dev_item.io_width	4096
dev_item.sector_size	4096
dev_item.devid		2
dev_item.dev_group	0
dev_item.seek_speed	0
dev_item.bandwidth	0
dev_item.generation	0
sys_chunk_array[2048]:
	item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 715141414912)
		chunk length 33554432 owner 2 stripe_len 65536
		type SYSTEM|RAID1 num_stripes 2
			stripe 0 devid 2 offset 357557075968
			dev uuid: f98143e4-24a2-4a2a-8dbf-2871c75f7b78
			stripe 1 devid 1 offset 2185232384
			dev uuid: 94c62352-2568-4abe-8a58-828d1766719c
backup_roots[4]:
	backup 0:
		backup_tree_root:	714616012800	gen: 4921	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714190635008	gen: 4921	level: 2
		backup_fs_root:		714186096640	gen: 4921	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714186326016	gen: 4921	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 1:
		backup_tree_root:	714186997760	gen: 4922	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714187014144	gen: 4922	level: 2
		backup_fs_root:		714186096640	gen: 4921	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714187505664	gen: 4922	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 2:
		backup_tree_root:	714188554240	gen: 4923	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714188505088	gen: 4923	level: 2
		backup_fs_root:		714188488704	gen: 4923	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714188668928	gen: 4923	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 3:
		backup_tree_root:	809898442752	gen: 4920	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	809898459136	gen: 4920	level: 2
		backup_fs_root:		810253713408	gen: 4805	level: 0
		backup_dev_root:	809896886272	gen: 4918	level: 1
		backup_csum_root:	809898557440	gen: 4920	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2


superblock: bytenr=67108864, device=/dev/sdb
---------------------------------------------------------
csum			0x9305ce76 [match]
bytenr			67108864
flags			0x1
			( WRITTEN )
magic			_BHRfS_M [match]
fsid			197606b2-9f4a-4742-8824-7fc93285c29c
label			verb
generation		4923
root			714188554240
sys_array_size		129
chunk_root_generation	4918
root_level		1
chunk_root		715141414912
chunk_root_level	1
log_root		0
log_root_transid	0
log_root_level		0
total_bytes		1500312748032
bytes_used		537228206080
sectorsize		4096
nodesize		16384
leafsize		16384
stripesize		4096
root_dir		6
num_devices		2
compat_flags		0x0
compat_ro_flags		0x0
incompat_flags		0x161
			( MIXED_BACKREF |
			  BIG_METADATA |
			  EXTENDED_IREF |
			  SKINNY_METADATA )
csum_type		0
csum_size		4
cache_generation	4923
uuid_tree_generation	4923
dev_item.uuid		f98143e4-24a2-4a2a-8dbf-2871c75f7b78
dev_item.fsid		197606b2-9f4a-4742-8824-7fc93285c29c [match]
dev_item.type		0
dev_item.total_bytes	750156374016
dev_item.bytes_used	541199433728
dev_item.io_align	4096
dev_item.io_width	4096
dev_item.sector_size	4096
dev_item.devid		2
dev_item.dev_group	0
dev_item.seek_speed	0
dev_item.bandwidth	0
dev_item.generation	0
sys_chunk_array[2048]:
	item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 715141414912)
		chunk length 33554432 owner 2 stripe_len 65536
		type SYSTEM|RAID1 num_stripes 2
			stripe 0 devid 2 offset 357557075968
			dev uuid: f98143e4-24a2-4a2a-8dbf-2871c75f7b78
			stripe 1 devid 1 offset 2185232384
			dev uuid: 94c62352-2568-4abe-8a58-828d1766719c
backup_roots[4]:
	backup 0:
		backup_tree_root:	714616012800	gen: 4921	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714190635008	gen: 4921	level: 2
		backup_fs_root:		714186096640	gen: 4921	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714186326016	gen: 4921	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 1:
		backup_tree_root:	714186997760	gen: 4922	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714187014144	gen: 4922	level: 2
		backup_fs_root:		714186096640	gen: 4921	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714187505664	gen: 4922	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 2:
		backup_tree_root:	714188554240	gen: 4923	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714188505088	gen: 4923	level: 2
		backup_fs_root:		714188488704	gen: 4923	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714188668928	gen: 4923	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 3:
		backup_tree_root:	809898442752	gen: 4920	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	809898459136	gen: 4920	level: 2
		backup_fs_root:		810253713408	gen: 4805	level: 0
		backup_dev_root:	809896886272	gen: 4918	level: 1
		backup_csum_root:	809898557440	gen: 4920	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2


superblock: bytenr=274877906944, device=/dev/sdb
---------------------------------------------------------
csum			0x6e829847 [match]
bytenr			274877906944
flags			0x1
			( WRITTEN )
magic			_BHRfS_M [match]
fsid			197606b2-9f4a-4742-8824-7fc93285c29c
label			verb
generation		4923
root			714188554240
sys_array_size		129
chunk_root_generation	4918
root_level		1
chunk_root		715141414912
chunk_root_level	1
log_root		0
log_root_transid	0
log_root_level		0
total_bytes		1500312748032
bytes_used		537228206080
sectorsize		4096
nodesize		16384
leafsize		16384
stripesize		4096
root_dir		6
num_devices		2
compat_flags		0x0
compat_ro_flags		0x0
incompat_flags		0x161
			( MIXED_BACKREF |
			  BIG_METADATA |
			  EXTENDED_IREF |
			  SKINNY_METADATA )
csum_type		0
csum_size		4
cache_generation	4923
uuid_tree_generation	4923
dev_item.uuid		f98143e4-24a2-4a2a-8dbf-2871c75f7b78
dev_item.fsid		197606b2-9f4a-4742-8824-7fc93285c29c [match]
dev_item.type		0
dev_item.total_bytes	750156374016
dev_item.bytes_used	541199433728
dev_item.io_align	4096
dev_item.io_width	4096
dev_item.sector_size	4096
dev_item.devid		2
dev_item.dev_group	0
dev_item.seek_speed	0
dev_item.bandwidth	0
dev_item.generation	0
sys_chunk_array[2048]:
	item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 715141414912)
		chunk length 33554432 owner 2 stripe_len 65536
		type SYSTEM|RAID1 num_stripes 2
			stripe 0 devid 2 offset 357557075968
			dev uuid: f98143e4-24a2-4a2a-8dbf-2871c75f7b78
			stripe 1 devid 1 offset 2185232384
			dev uuid: 94c62352-2568-4abe-8a58-828d1766719c
backup_roots[4]:
	backup 0:
		backup_tree_root:	714616012800	gen: 4921	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714190635008	gen: 4921	level: 2
		backup_fs_root:		714186096640	gen: 4921	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714186326016	gen: 4921	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 1:
		backup_tree_root:	714186997760	gen: 4922	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714187014144	gen: 4922	level: 2
		backup_fs_root:		714186096640	gen: 4921	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714187505664	gen: 4922	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 2:
		backup_tree_root:	714188554240	gen: 4923	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	714188505088	gen: 4923	level: 2
		backup_fs_root:		714188488704	gen: 4923	level: 0
		backup_dev_root:	715082776576	gen: 4921	level: 1
		backup_csum_root:	714188668928	gen: 4923	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2

	backup 3:
		backup_tree_root:	809898442752	gen: 4920	level: 1
		backup_chunk_root:	715141414912	gen: 4918	level: 1
		backup_extent_root:	809898459136	gen: 4920	level: 2
		backup_fs_root:		810253713408	gen: 4805	level: 0
		backup_dev_root:	809896886272	gen: 4918	level: 1
		backup_csum_root:	809898557440	gen: 4920	level: 2
		backup_total_bytes:	1500312748032
		backup_bytes_used:	537228206080
		backup_num_devices:	2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-21  2:12             ` Chris Murphy
@ 2015-12-21  2:23               ` Qu Wenruo
  2015-12-21  2:46                 ` Chris Murphy
  2015-12-22  1:05                 ` Kai Krakow
  0 siblings, 2 replies; 18+ messages in thread
From: Qu Wenruo @ 2015-12-21  2:23 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS



Chris Murphy wrote on 2015/12/20 19:12 -0700:
> On Sun, Dec 20, 2015 at 6:43 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> Chris Murphy wrote on 2015/12/20 15:31 -0700:
>
>>> I think the cause is related to bus power with buggy USB 3 LPM
>>> firmware (these enclosures are cheap maybe $6). I've found some
>>> threads about this being a problem, but it's not expected to cause any
>>> corruptions. So, the fact Btrfs picks up one some problems might prove
>>> that (somewhat) incorrect.
>>
>>
>> Seems possible. Maybe some metadata just failed to reach disk.
>> BTW, did I asked for a btrfs-show-super output?
>
> Nope. I will attach to this email below for both devices.
>
>> If that's the case, superblock on device 2 maybe older than superblock on
>> device 1.
>
> Yes, looks iike devid 1 transid 4924, and devid 2 transid 4923. And
> it's devid 2 that had device reset and write errors when it vanished
> and reappeared as a different block device.
>

Now all the problem is explained.

You should be good to mount it rw, as RAID1 will handle all the problem.
Then you can either use scrub on dev2 to fix all the generation mismatch.

Although I prefer to wipe dev2 and mount dev1 as degraded, and replace 
the missing dev2 with a good device/usb port.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-21  2:23               ` Qu Wenruo
@ 2015-12-21  2:46                 ` Chris Murphy
  2015-12-22  1:05                 ` Kai Krakow
  1 sibling, 0 replies; 18+ messages in thread
From: Chris Murphy @ 2015-12-21  2:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, Btrfs BTRFS

On Sun, Dec 20, 2015 at 7:23 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> Chris Murphy wrote on 2015/12/20 19:12 -0700:
>>
>> On Sun, Dec 20, 2015 at 6:43 PM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>> wrote:
>>>
>>>
>>>
>>> Chris Murphy wrote on 2015/12/20 15:31 -0700:
>>
>>
>>>> I think the cause is related to bus power with buggy USB 3 LPM
>>>> firmware (these enclosures are cheap maybe $6). I've found some
>>>> threads about this being a problem, but it's not expected to cause any
>>>> corruptions. So, the fact Btrfs picks up one some problems might prove
>>>> that (somewhat) incorrect.
>>>
>>>
>>>
>>> Seems possible. Maybe some metadata just failed to reach disk.
>>> BTW, did I asked for a btrfs-show-super output?
>>
>>
>> Nope. I will attach to this email below for both devices.
>>
>>> If that's the case, superblock on device 2 maybe older than superblock on
>>> device 1.
>>
>>
>> Yes, looks iike devid 1 transid 4924, and devid 2 transid 4923. And
>> it's devid 2 that had device reset and write errors when it vanished
>> and reappeared as a different block device.
>>
>
> Now all the problem is explained.
>
> You should be good to mount it rw, as RAID1 will handle all the problem.
> Then you can either use scrub on dev2 to fix all the generation mismatch.
>
> Although I prefer to wipe dev2 and mount dev1 as degraded, and replace the
> missing dev2 with a good device/usb port.

Yeah.

Best info I have right now is this particular make/model of USB 3.0
enclosure is common and sometimes has this reset and vanish problem
with only certain controllers. In my case all four of the same kind of
enclosure does this but only with 900mA ports. There's never a problem
with 1.5A ports. I think it's just a slightly out of spec product. But
usb-storage kernel developers said the warnings shouldn't result in
corruptions. Another user with the same enclosure reported the problem
only happens on Linux, not Windows, on the same host hardware. So it
could also be some Linux SCSI layer error handling that's not working
around a pre-existing issue when the device is flaky.

Thanks!


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-21  2:23               ` Qu Wenruo
  2015-12-21  2:46                 ` Chris Murphy
@ 2015-12-22  1:05                 ` Kai Krakow
  2015-12-22  1:22                   ` Qu Wenruo
  1 sibling, 1 reply; 18+ messages in thread
From: Kai Krakow @ 2015-12-22  1:05 UTC (permalink / raw)
  To: linux-btrfs

Am Mon, 21 Dec 2015 10:23:31 +0800
schrieb Qu Wenruo <quwenruo@cn.fujitsu.com>:

> 
> 
> Chris Murphy wrote on 2015/12/20 19:12 -0700:
> > On Sun, Dec 20, 2015 at 6:43 PM, Qu Wenruo
> > <quwenruo@cn.fujitsu.com> wrote:
> >>
> >>
> >> Chris Murphy wrote on 2015/12/20 15:31 -0700:
> >
> >>> I think the cause is related to bus power with buggy USB 3 LPM
> >>> firmware (these enclosures are cheap maybe $6). I've found some
> >>> threads about this being a problem, but it's not expected to
> >>> cause any corruptions. So, the fact Btrfs picks up one some
> >>> problems might prove that (somewhat) incorrect.
> >>
> >>
> >> Seems possible. Maybe some metadata just failed to reach disk.
> >> BTW, did I asked for a btrfs-show-super output?
> >
> > Nope. I will attach to this email below for both devices.
> >
> >> If that's the case, superblock on device 2 maybe older than
> >> superblock on device 1.
> >
> > Yes, looks iike devid 1 transid 4924, and devid 2 transid 4923. And
> > it's devid 2 that had device reset and write errors when it vanished
> > and reappeared as a different block device.
> >
> 
> Now all the problem is explained.
> 
> You should be good to mount it rw, as RAID1 will handle all the
> problem.

How should RAID1 handle this if both copies have valid checksums (as I
would assume here unless shown otherwise)? This is an even bigger
problem with block based RAID1 which does not have checksums at all.
Luckily, btrfs works different here.

> Then you can either use scrub on dev2 to fix all the
> generation mismatch.

I better understand why this could fix a problem...

> Although I prefer to wipe dev2 and mount dev1 as degraded, and
> replace the missing dev2 with a good device/usb port.

Given the assumption above I'd do that, too (but check if the
"original" has no block errors before discarding the mirror).


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-22  1:05                 ` Kai Krakow
@ 2015-12-22  1:22                   ` Qu Wenruo
  2015-12-22  1:48                     ` Kai Krakow
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2015-12-22  1:22 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs



Kai Krakow wrote on 2015/12/22 02:05 +0100:
> Am Mon, 21 Dec 2015 10:23:31 +0800
> schrieb Qu Wenruo <quwenruo@cn.fujitsu.com>:
>
>>
>>
>> Chris Murphy wrote on 2015/12/20 19:12 -0700:
>>> On Sun, Dec 20, 2015 at 6:43 PM, Qu Wenruo
>>> <quwenruo@cn.fujitsu.com> wrote:
>>>>
>>>>
>>>> Chris Murphy wrote on 2015/12/20 15:31 -0700:
>>>
>>>>> I think the cause is related to bus power with buggy USB 3 LPM
>>>>> firmware (these enclosures are cheap maybe $6). I've found some
>>>>> threads about this being a problem, but it's not expected to
>>>>> cause any corruptions. So, the fact Btrfs picks up one some
>>>>> problems might prove that (somewhat) incorrect.
>>>>
>>>>
>>>> Seems possible. Maybe some metadata just failed to reach disk.
>>>> BTW, did I asked for a btrfs-show-super output?
>>>
>>> Nope. I will attach to this email below for both devices.
>>>
>>>> If that's the case, superblock on device 2 maybe older than
>>>> superblock on device 1.
>>>
>>> Yes, looks iike devid 1 transid 4924, and devid 2 transid 4923. And
>>> it's devid 2 that had device reset and write errors when it vanished
>>> and reappeared as a different block device.
>>>
>>
>> Now all the problem is explained.
>>
>> You should be good to mount it rw, as RAID1 will handle all the
>> problem.
>
> How should RAID1 handle this if both copies have valid checksums (as I
> would assume here unless shown otherwise)? This is an even bigger
> problem with block based RAID1 which does not have checksums at all.
> Luckily, btrfs works different here.

No, these two devices don't have the same generation, which means they 
point to *different* bytenr.

Like the following:

Super of Dev1:
gen: X + 1
root bytenr: A (Btrfs logical)
logical A is mapped to A1 on dev1 and A2 on dev2.

Super of Dev2:
gen: X
root bytenr: B
Here we don't need to bother bytenr B though.

Due to the power bug, A2 and super of dev2 is not written to dev2.

So you should see the problem now.
A1 on dev1 contains *valid* tree block, but A2 on dev2 doesn't(empty 
data only).

And your assumption on "both have valid copies" is wrong.

Check all the 4 attachment in previous mail.

>
>> Then you can either use scrub on dev2 to fix all the
>> generation mismatch.
>
> I better understand why this could fix a problem...

Why not?

Tree block/data copy on dev1 is valid, but tree block/data copy on dev2 
is empty(not written), so btrfs detects the csum error, and scrub will 
try to rewrite it.

After rewrite, both copy on dev1 and dev2 with match and fix the problem.

Thanks,
Qu

>
>> Although I prefer to wipe dev2 and mount dev1 as degraded, and
>> replace the missing dev2 with a good device/usb port.
>
> Given the assumption above I'd do that, too (but check if the
> "original" has no block errors before discarding the mirror).
>




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-22  1:22                   ` Qu Wenruo
@ 2015-12-22  1:48                     ` Kai Krakow
  2015-12-22  2:15                       ` Qu Wenruo
  2015-12-22 10:23                       ` Duncan
  0 siblings, 2 replies; 18+ messages in thread
From: Kai Krakow @ 2015-12-22  1:48 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 22 Dec 2015 09:22:20 +0800
schrieb Qu Wenruo <quwenruo@cn.fujitsu.com>:

> 
> 
> Kai Krakow wrote on 2015/12/22 02:05 +0100:
> > Am Mon, 21 Dec 2015 10:23:31 +0800
> > schrieb Qu Wenruo <quwenruo@cn.fujitsu.com>:
> >
> >>
> >>
> >> Chris Murphy wrote on 2015/12/20 19:12 -0700:
> >>> On Sun, Dec 20, 2015 at 6:43 PM, Qu Wenruo
> >>> <quwenruo@cn.fujitsu.com> wrote:
> >>>>
> >>>>
> >>>> Chris Murphy wrote on 2015/12/20 15:31 -0700:
> >>>
> >>>>> I think the cause is related to bus power with buggy USB 3 LPM
> >>>>> firmware (these enclosures are cheap maybe $6). I've found some
> >>>>> threads about this being a problem, but it's not expected to
> >>>>> cause any corruptions. So, the fact Btrfs picks up one some
> >>>>> problems might prove that (somewhat) incorrect.
> >>>>
> >>>>
> >>>> Seems possible. Maybe some metadata just failed to reach disk.
> >>>> BTW, did I asked for a btrfs-show-super output?
> >>>
> >>> Nope. I will attach to this email below for both devices.
> >>>
> >>>> If that's the case, superblock on device 2 maybe older than
> >>>> superblock on device 1.
> >>>
> >>> Yes, looks iike devid 1 transid 4924, and devid 2 transid 4923.
> >>> And it's devid 2 that had device reset and write errors when it
> >>> vanished and reappeared as a different block device.
> >>>
> >>
> >> Now all the problem is explained.
> >>
> >> You should be good to mount it rw, as RAID1 will handle all the
> >> problem.
> >
> > How should RAID1 handle this if both copies have valid checksums
> > (as I would assume here unless shown otherwise)? This is an even
> > bigger problem with block based RAID1 which does not have checksums
> > at all. Luckily, btrfs works different here.
> 
> No, these two devices don't have the same generation, which means
> they point to *different* bytenr.
> 
> Like the following:
> 
> Super of Dev1:
> gen: X + 1
> root bytenr: A (Btrfs logical)
> logical A is mapped to A1 on dev1 and A2 on dev2.
> 
> Super of Dev2:
> gen: X
> root bytenr: B
> Here we don't need to bother bytenr B though.
> 
> Due to the power bug, A2 and super of dev2 is not written to dev2.
> 
> So you should see the problem now.
> A1 on dev1 contains *valid* tree block, but A2 on dev2 doesn't(empty 
> data only).
> 
> And your assumption on "both have valid copies" is wrong.
> 
> Check all the 4 attachment in previous mail.

I did only see those attachments at a second glance. Sry.

Primarily I just wanted to note that RAID1 per-se doesn't mean anything
more than: we have two readable copies but we don't know which one is
correct. As in: let the admin think twice about it before blindly
following a guide.

This is why I pointed out btrfs csums which make this a little better
which in turn has further consequences as you describe (for the
treeblock).

In contrast to block-level RAID btrfs usually has the knowledge which
block is correct and which is not.

I just wondered if btrfs allows for the case where both stripes could
have valid checksums despite of btrfs-RAID - just because a failure
occurred right on the spot.

Is this possible? What happens then? If yes, it would mean not to
blindly trust the RAID without doing the homeworks.

> >> Then you can either use scrub on dev2 to fix all the
> >> generation mismatch.
> >
> > I better understand why this could fix a problem...
> 
> Why not?
> 
> Tree block/data copy on dev1 is valid, but tree block/data copy on
> dev2 is empty(not written), so btrfs detects the csum error, and
> scrub will try to rewrite it.
> 
> After rewrite, both copy on dev1 and dev2 with match and fix the
> problem.

Exactly. ;-) Didn't say anything against it.


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-22  1:48                     ` Kai Krakow
@ 2015-12-22  2:15                       ` Qu Wenruo
  2015-12-22  4:21                         ` Chris Murphy
  2015-12-22 10:23                       ` Duncan
  1 sibling, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2015-12-22  2:15 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs



Kai Krakow wrote on 2015/12/22 02:48 +0100:
> Am Tue, 22 Dec 2015 09:22:20 +0800
> schrieb Qu Wenruo <quwenruo@cn.fujitsu.com>:
>
>>
>>
>> Kai Krakow wrote on 2015/12/22 02:05 +0100:
>>> Am Mon, 21 Dec 2015 10:23:31 +0800
>>> schrieb Qu Wenruo <quwenruo@cn.fujitsu.com>:
>>>
>>>>
>>>>
>>>> Chris Murphy wrote on 2015/12/20 19:12 -0700:
>>>>> On Sun, Dec 20, 2015 at 6:43 PM, Qu Wenruo
>>>>> <quwenruo@cn.fujitsu.com> wrote:
>>>>>>
>>>>>>
>>>>>> Chris Murphy wrote on 2015/12/20 15:31 -0700:
>>>>>
>>>>>>> I think the cause is related to bus power with buggy USB 3 LPM
>>>>>>> firmware (these enclosures are cheap maybe $6). I've found some
>>>>>>> threads about this being a problem, but it's not expected to
>>>>>>> cause any corruptions. So, the fact Btrfs picks up one some
>>>>>>> problems might prove that (somewhat) incorrect.
>>>>>>
>>>>>>
>>>>>> Seems possible. Maybe some metadata just failed to reach disk.
>>>>>> BTW, did I asked for a btrfs-show-super output?
>>>>>
>>>>> Nope. I will attach to this email below for both devices.
>>>>>
>>>>>> If that's the case, superblock on device 2 maybe older than
>>>>>> superblock on device 1.
>>>>>
>>>>> Yes, looks iike devid 1 transid 4924, and devid 2 transid 4923.
>>>>> And it's devid 2 that had device reset and write errors when it
>>>>> vanished and reappeared as a different block device.
>>>>>
>>>>
>>>> Now all the problem is explained.
>>>>
>>>> You should be good to mount it rw, as RAID1 will handle all the
>>>> problem.
>>>
>>> How should RAID1 handle this if both copies have valid checksums
>>> (as I would assume here unless shown otherwise)? This is an even
>>> bigger problem with block based RAID1 which does not have checksums
>>> at all. Luckily, btrfs works different here.
>>
>> No, these two devices don't have the same generation, which means
>> they point to *different* bytenr.
>>
>> Like the following:
>>
>> Super of Dev1:
>> gen: X + 1
>> root bytenr: A (Btrfs logical)
>> logical A is mapped to A1 on dev1 and A2 on dev2.
>>
>> Super of Dev2:
>> gen: X
>> root bytenr: B
>> Here we don't need to bother bytenr B though.
>>
>> Due to the power bug, A2 and super of dev2 is not written to dev2.
>>
>> So you should see the problem now.
>> A1 on dev1 contains *valid* tree block, but A2 on dev2 doesn't(empty
>> data only).
>>
>> And your assumption on "both have valid copies" is wrong.
>>
>> Check all the 4 attachment in previous mail.
>
> I did only see those attachments at a second glance. Sry.
>
> Primarily I just wanted to note that RAID1 per-se doesn't mean anything
> more than: we have two readable copies but we don't know which one is
> correct. As in: let the admin think twice about it before blindly
> following a guide.
>
> This is why I pointed out btrfs csums which make this a little better
> which in turn has further consequences as you describe (for the
> treeblock).
>
> In contrast to block-level RAID btrfs usually has the knowledge which
> block is correct and which is not.
>
> I just wondered if btrfs allows for the case where both stripes could
> have valid checksums despite of btrfs-RAID - just because a failure
> occurred right on the spot.
>
> Is this possible? What happens then? If yes, it would mean not to
> blindly trust the RAID without doing the homeworks.

Very interesting question.
Although btrfs is a little beyond your expectation on block based RAID1.

1) Yes, it is possible.

2) Btrfs still detects it as an transid error and won't trust the
    metadata.(kernel behavior)
    And since it's raid1, it will try next copy to go on.

    The trick here is, btrfs metadata doesn't only record bytenr of its
    child tree block, but also the tranid(generation) of the tree block.

    So even such case happens, the transid won't match, and cause btrfs
    detects the error.

Thanks,
Qu
>
>>>> Then you can either use scrub on dev2 to fix all the
>>>> generation mismatch.
>>>
>>> I better understand why this could fix a problem...
>>
>> Why not?
>>
>> Tree block/data copy on dev1 is valid, but tree block/data copy on
>> dev2 is empty(not written), so btrfs detects the csum error, and
>> scrub will try to rewrite it.
>>
>> After rewrite, both copy on dev1 and dev2 with match and fix the
>> problem.
>
> Exactly. ;-) Didn't say anything against it.
>
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-22  2:15                       ` Qu Wenruo
@ 2015-12-22  4:21                         ` Chris Murphy
  0 siblings, 0 replies; 18+ messages in thread
From: Chris Murphy @ 2015-12-22  4:21 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Kai Krakow, Btrfs BTRFS

Latest update.

4.4.0-0.rc6.git0.1.fc24.x86_64
btrfs-progs v4.3.1

Mounted the volume normally with both devices available, no mount
options, so it is a rw mount. And it mounts with only the normal
kernel messages:
[ 9458.290778] BTRFS info (device sdc): disk space caching is enabled
[ 9458.290788] BTRFS: has skinny extents

I left the volume alone for 20 minutes. After that time,
btrfs-show-super still shows different generation numbers for the two
devids.

I did an ls -l at the top level of the fs. And btrfs-show-super now
shows the same generation numbers and backup_roots information for
both devids.

Next, I read the most recently modified files, they all read OK, no
kernel messages, no missing files.

Last, I umounted the volume and did a btrfs check, and it comes up
completely clean, no errors.

No scrub done yet, no (user space) writes done yet. But going back to
the original btrfs check with all the errors, it really doesn't give a
user/admin of the volume any useful information what the problem is.
After-the-fact it's relatively clear that devid 1 has generation 4924,
and devid 2 has generation 4923, and that's what the btrfs check
complaints are about: just a generation mismatch and the associated
missing metadata on one device.

By all measures it's checking out and behaving completely healthy and
OK. So I'm going to play with some fire, and treat it normally for a
few days: including making snapshots and writing files. I'll do a
scrub in a few days and report back.



Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-22  1:48                     ` Kai Krakow
  2015-12-22  2:15                       ` Qu Wenruo
@ 2015-12-22 10:23                       ` Duncan
  2015-12-22 15:44                         ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 18+ messages in thread
From: Duncan @ 2015-12-22 10:23 UTC (permalink / raw)
  To: linux-btrfs

Kai Krakow posted on Tue, 22 Dec 2015 02:48:04 +0100 as excerpted:

> I just wondered if btrfs allows for the case where both stripes could
> have valid checksums despite of btrfs-RAID - just because a failure
> occurred right on the spot.
> 
> Is this possible? What happens then? If yes, it would mean not to
> blindly trust the RAID without doing the homeworks.

The one case where btrfs could get things wrong that I know of is as I 
discovered in my initial pre-btrfs-raid1-deployment testing...

1) Create a two-device btrfs raid1 (data and metadata) and ensure some 
data on it, including a test file with some content to be modified later. 
Sync and unmount normally.

2) Remove one of the two devices.

3) Mount the remaining device degraded-writable (it shouldn't allow 
mounting without degraded) and modify that test file.  Sync and unmount.

4) Switch devices and repeat, modifying that test file in some other 
incompatible way.  Sync and unmount.

To this point, everything should be fine, except that you now have two 
incompatible versions of the test file, potentially with the same 
separate-but-equal generation numbers after the separate degraded-
writable mount, modify, unmount, cycles.

5) Plug both devices in and mount normally.  Unless this has changed 
since my tests, btrfs will neither complain in dmesg nor otherwise 
provide any hint than anything is wrong.  If you read the file, it'll 
give you one of the versions, still not complaining or providing any hint 
that something's wrong.  Again unmount, without writing anything to the 
test file this time.

6) Try separately mounting each device individually again (without the 
other one available so degraded, can be writable or read-only this time) 
and check the file.  Each incompatible copy should remain in place on its 
respective device.  Reading the one copy (randomly chosen or more 
precisely, chosen based on PID even/odd, as that's what the btrfs raid1 
read-scheduler uses to decide which copy to read) didn't change the other 
one -- btrfs remained oblivious to the incompatible versions.  Again 
unmount.

7) Plug both devices in and mount the combined filesystem writable once 
again.  Scrub.

Back when I did my testing, I stopped at step 6 as I didn't understand 
that scrub was what I should use to resolve the problem.  However, based 
on quite a bit of later experience due to keeping a failing device (more 
and more sectors replaced with spares, turns out at least the SSD I was 
working with had way more spares than I would have expected, and even 
after several months when I finally gave up and replaced it, I was only 
down to about 85% of spares left, 15% used) around in raid1 mode for 
awhile, this should *NORMALLY* not be a problem.  As long as the 
generations differ, btrfs scrub can sort things out and catch up the 
"behind" device, resolving all differences to the latest generation copy.

8) But if both generations happen to be the same, having both been 
mounted separately and written so they diverged, but so they end up at 
the same generation when recombined...

>From all I know and from everything others told me when I asked at the 
time, which copy you get then is entirely unpredictable, and worse yet, 
you might get btrfs acting on divergent metadata when writing to the 
other device.


The caution, therefore, is to do your best not to ever let the two copies 
be both mounted degraded-writable, separately.  If only one copy is 
written to, then its generation will be higher than the other one, and 
scrub should have no problem resolving things.  Even if both copies are 
separately written to incompatibly, in most real-world cases one's going 
to have more generations written than the other and scrub should reliably 
and predictably resolve differences in favor of that one.  The problem 
only appears if they actually happen to have the same generation number, 
relatively unlikely except under controlled test conditions, but that has 
the potential to be a *BIG* problem should it actually occur.

So if for some reason you MUST mount both copies degraded-writable 
separately, the following are your options:

a) don't ever recombine them, doing a device replace missing with a third 
device instead (or a convert to single/dup); use one of the options below 
if you do need to recombine, or...

b) manually verify (using btrfs-show-super or the like) that the supers 
on each don't have the same generation before attempting a recombine, 
or...

c) wipe the one device and treat it as a new device add, so btrfs can't 
get mixed up with differing versions at the same generation number, or...

d) simply take your chances and hope that the generation numbers don't 
match.

(D should in practice be "good enough" if one was only mounted writable a 
very short time, while the other was written to over a rather longer 
period, such that it almost certainly had far more intervening commits 
and thus generations than the other.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-22 10:23                       ` Duncan
@ 2015-12-22 15:44                         ` Austin S. Hemmelgarn
  2015-12-29 21:33                           ` Chris Murphy
  0 siblings, 1 reply; 18+ messages in thread
From: Austin S. Hemmelgarn @ 2015-12-22 15:44 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 2015-12-22 05:23, Duncan wrote:
> Kai Krakow posted on Tue, 22 Dec 2015 02:48:04 +0100 as excerpted:
>
>> I just wondered if btrfs allows for the case where both stripes could
>> have valid checksums despite of btrfs-RAID - just because a failure
>> occurred right on the spot.
>>
>> Is this possible? What happens then? If yes, it would mean not to
>> blindly trust the RAID without doing the homeworks.
>
> The one case where btrfs could get things wrong that I know of is as I
> discovered in my initial pre-btrfs-raid1-deployment testing...
I've had exactly one case where I got _really_ unlucky and had a bunch 
of media errors on a BTRFS raid1 setup that happened to result in 
something similar to this.  Things happened such that one copy of a 
block (we'll call this one copy 1) had correct data, and the other 
(we'll call this one copy 2) had incorrect data, except that one copy of 
the metadata had the correct checksum for copy 2, and the other metadata 
copy had a correct checksum for copy 1, but, due to a hash collision, 
the checksum for the metadata block was correct for both copies.  As a 
result of this, I ended up getting a read-error about 25% of the time 
(which then forced a re-read of the data, the correct data about 37.5% 
of the time, and incorrect data the remaining 37.5% of the time.  I 
actually ran the numbers on how likely this was to happen (more than a 
dozen errors on different disks in blocks that happened to reference 
each other, and a hash collision involving a 4 byte difference between 
two 16k blocks of data), and it's a statistical impossibility (It's more 
likely that one of Amazon or Google's data-centers goes offline due to 
hardware failures than it is that this will happen again).  Obviously it 
did happen, but I would say it's such a unrealistic edge case that you 
probably don't need to worry about it (although I learned _a lot_ about 
the internals of BTRFS in trying to figure out what was going on).
>
[...snip...]
>
>  From all I know and from everything others told me when I asked at the
> time, which copy you get then is entirely unpredictable, and worse yet,
> you might get btrfs acting on divergent metadata when writing to the
> other device.
>
This is indeed the case.  Because of how BTRFS verifies checksums, 
there's a roughly 50% chance that the first read attempt will result in 
picking a mismatched checksum and data block, which will trigger a 
re-read which has an independent 50% chance of again picking a mismatch, 
resulting in a 25% chance that any read that actually goes to the device 
returns a read error.  The remaining 75% of the time, you'll get either 
one block or the other.  These numbers of course get skewed by the VFS 
cache.  In my case above, the file that was affected was one that is 
almost never in cache when it gets accessed, so I saw numbers relatively 
close to what you would get without the cache.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: btrfs check inconsistency with raid1, part 1
  2015-12-22 15:44                         ` Austin S. Hemmelgarn
@ 2015-12-29 21:33                           ` Chris Murphy
  0 siblings, 0 replies; 18+ messages in thread
From: Chris Murphy @ 2015-12-29 21:33 UTC (permalink / raw)
  Cc: Btrfs BTRFS

Latest update on this thread. btrfs check (4.3.1) reports no problems.
Volume mounts with kernel 4.2.8 with no errors. And I just did a scrub
and there were no errors, not even any fix up messages. And dev stats
are all zero.

So... it appears it was a minor enough problem, and still consistent
enough, that it fixed itself. Granted, there was no writing occurring
at the time, just heavy reading, or perhaps this would be a different
story.


Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-12-29 21:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-14  4:16 btrfs check inconsistency with raid1, part 1 Chris Murphy
2015-12-14  5:48 ` Qu Wenruo
2015-12-14  7:24   ` Chris Murphy
2015-12-14  8:04     ` Qu Wenruo
2015-12-14 17:59       ` Chris Murphy
2015-12-20 22:32         ` Chris Murphy
     [not found]         ` <CAJCQCtSEx_wYPkfazik0bcpQwXxJCA=O5f0o6RbxON4jjB4q7A@mail.gmail.com>
     [not found]           ` <5677592F.5000202@cn.fujitsu.com>
2015-12-21  2:12             ` Chris Murphy
2015-12-21  2:23               ` Qu Wenruo
2015-12-21  2:46                 ` Chris Murphy
2015-12-22  1:05                 ` Kai Krakow
2015-12-22  1:22                   ` Qu Wenruo
2015-12-22  1:48                     ` Kai Krakow
2015-12-22  2:15                       ` Qu Wenruo
2015-12-22  4:21                         ` Chris Murphy
2015-12-22 10:23                       ` Duncan
2015-12-22 15:44                         ` Austin S. Hemmelgarn
2015-12-29 21:33                           ` Chris Murphy
2015-12-14 11:51     ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.