All of lore.kernel.org
 help / color / mirror / Atom feed
* BTRFS Data at Rest File Corruption
@ 2016-05-11 18:36 Richard Lochner
  2016-05-11 19:01 ` Roman Mamedov
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Richard Lochner @ 2016-05-11 18:36 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I have encountered a data corruption error with BTRFS which may or may
not be of interest to your developers.

The problem is that an unmodified file on a RAID-1 volume that had
been scrubbed successfully is now corrupt.  The details follow.

The volume was formatted as btrfs with raid1 data and raid1 metadata
on two new 4T hard drives (WD Data Center Re WD4000FYYZ) .

A large binary file was copied to the volume (~76 GB) on December 27,
2015.  Soon after copying the file, a btrfs scrub was run. There were
no errors.  Multiple scrubs have also been run over the past several
months.

Recently, a scrub returned an unrecoverable error on that file.
Again, the file has not been modified since it was originally copied
and has the time stamp from December.  Furthermore, SMART tests (long)
for both drives do not indicate any errors (Current_Pending_Sector or
otherwise).

I should note that the system does not have ECC memory.

It would be interesting to me to know if:

a) The primary and secondary data blocks match (I suspect they do), and
b) The primary and secondary checksums for the block match (I suspect
they do as well).

Unfortunately, I do not have the skills to do such a verification.

If you have any thoughts or suggestions, I would be most interested.
I was hoping that I could trust the integrity of "data at rest" in a
RAID-1 setting under BTRFS, but this appears not to be the case.

Thank you,

R. Lochner

#uname -a
Linux vmh001.clone1.com 4.4.6-300.fc23.x86_64 #1 SMP Wed Mar 16
22:10:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

# btrfs --version
btrfs-progs v4.4.1

# btrfs fi show
Label: 'raid_pool'  uuid: d397ff55-e5c8-4d31-966e-d65694997451
    Total devices 2 FS bytes used 2.32TiB
    devid    1 size 3.00TiB used 2.32TiB path /dev/sdb1
    devid    2 size 3.00TiB used 2.32TiB path /dev/sdc1

# btrfs fi df /mnt
Data, RAID1: total=2.32TiB, used=2.31TiB
System, RAID1: total=40.00MiB, used=384.00KiB
Metadata, RAID1: total=7.00GiB, used=5.42GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Dmesg:

[2027323.705035] BTRFS warning (device sdc1): checksum error at
logical 3037444042752 on dev /dev/sdc1, sector 4988750584, root 259,
inode 1437377, offset 75754369024, length 4096, links 1 (path:
Rick/sda4.img)
[2027323.705056] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 0,
rd 13, flush 0, corrupt 3, gen 0
[2027323.718869] BTRFS error (device sdc1): unable to fixup (regular)
error at logical 3037444042752 on dev /dev/sdc1

ls:

#ls -l /mnt/backup/Rick/sda4.img
-rw-r--r--. 1 root root 75959197696 Dec 27 10:36 /mnt/backup/Rick/sda4.img

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-11 18:36 BTRFS Data at Rest File Corruption Richard Lochner
@ 2016-05-11 19:01 ` Roman Mamedov
  2016-05-11 19:26 ` Austin S. Hemmelgarn
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Roman Mamedov @ 2016-05-11 19:01 UTC (permalink / raw)
  To: Richard Lochner; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1177 bytes --]

On Wed, 11 May 2016 13:36:23 -0500
Richard Lochner <lochner@clone1.com> wrote:

> Recently, a scrub returned an unrecoverable error on that file.
> Again, the file has not been modified since it was originally copied
> and has the time stamp from December.  Furthermore, SMART tests (long)
> for both drives do not indicate any errors (Current_Pending_Sector or
> otherwise).
> 
> I should note that the system does not have ECC memory.

> [2027323.705035] BTRFS warning (device sdc1): checksum error at
> logical 3037444042752 on dev /dev/sdc1, sector 4988750584, root 259,
> inode 1437377, offset 75754369024, length 4096, links 1 (path:
> Rick/sda4.img)
> [2027323.705056] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 0,
> rd 13, flush 0, corrupt 3, gen 0
> [2027323.718869] BTRFS error (device sdc1): unable to fixup (regular)
> error at logical 3037444042752 on dev /dev/sdc1

I wonder, did you try rebooting the system after getting this? And if you get
the same error also after a reboot, check if the sector/offset numbers are the
same. That way you could at least rule out any kind of transient (RAM?) errors.

-- 
With respect,
Roman

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-11 18:36 BTRFS Data at Rest File Corruption Richard Lochner
  2016-05-11 19:01 ` Roman Mamedov
@ 2016-05-11 19:26 ` Austin S. Hemmelgarn
  2016-05-12 17:49   ` Richard A. Lochner
  2016-05-13 16:28   ` Goffredo Baroncelli
  2016-05-12  6:49 ` Chris Murphy
       [not found] ` <CAAuLxcaQ1Uo+pff9AtD74UwUvo5yYKBuNLwKzjVMWV1kt2DcRQ@mail.gmail.com>
  3 siblings, 2 replies; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-11 19:26 UTC (permalink / raw)
  To: Richard Lochner, Btrfs BTRFS

On 2016-05-11 14:36, Richard Lochner wrote:
> Hello,
>
> I have encountered a data corruption error with BTRFS which may or may
> not be of interest to your developers.
>
> The problem is that an unmodified file on a RAID-1 volume that had
> been scrubbed successfully is now corrupt.  The details follow.
>
> The volume was formatted as btrfs with raid1 data and raid1 metadata
> on two new 4T hard drives (WD Data Center Re WD4000FYYZ) .
>
> A large binary file was copied to the volume (~76 GB) on December 27,
> 2015.  Soon after copying the file, a btrfs scrub was run. There were
> no errors.  Multiple scrubs have also been run over the past several
> months.
>
> Recently, a scrub returned an unrecoverable error on that file.
> Again, the file has not been modified since it was originally copied
> and has the time stamp from December.  Furthermore, SMART tests (long)
> for both drives do not indicate any errors (Current_Pending_Sector or
> otherwise).
>
> I should note that the system does not have ECC memory.
>
> It would be interesting to me to know if:
>
> a) The primary and secondary data blocks match (I suspect they do), and
> b) The primary and secondary checksums for the block match (I suspect
> they do as well)
Do you mean if they're both incorrect?  Because that's the only case 
that scrub should return an un-correctable error is if neither block 
appears correct.

In general, based on what you've said, there are four possibilities:
1. Both of your disks happened to have an undetectable error at 
equivalent locations.  While not likely, this is still possible.  It's 
important to note that while hard disks have internal ECC, ECC doesn't 
inherently catch everything, so it's fully possible (although really 
rare) to have a sector go bad and the disk not notice.
2. Some other part of your hardware has issues.  What I would check, in 
order are:
	1. Internal cables (you would probably be surprised how many times I've 
seen people have disk issues that were really caused by a bad data cable)
	2. RAM
	3. PSU (if you don't have a spare and don't have a multimeter or power 
supply tester, move this one to the bottom of the list)
	4. CPU
	5. Storage controller
	6. Motherboard
    If you want advice on testing anything, let me know.
3. It's caused by a transient error, and may or may not be fixable. 
Computers have internal EMI shielding (or have metal cases) for a 
reason, but this still doesn't protect from everything (cosmic 
background radiation exists even in shielded enclosures).
4. You've found a bug in BTRFS or the kernel itself.  I seriously doubt 
this, as you're setup appears to be pretty much as trivial as possible 
for a BTRFS raid1 filesystem, and you don't appear to be doing anything 
other than storing data (if fact, if you actually found a bug in BTRFS 
in such well tested code under such a trivial use case, you deserve a 
commendation).

The first thing I would do is make sure that the scrub fails 
consistently.  I've had cases on systems which had been on for multiple 
months where a scrub failed, I rebooted, and then the scrub succeeded. 
If you still get the error after a reboot, check if everything other 
than the error counts is the same, if it isn't, then it's probably an 
issue with your hardware (although probably not the disk).
>
> Unfortunately, I do not have the skills to do such a verification.
>
> If you have any thoughts or suggestions, I would be most interested.
> I was hoping that I could trust the integrity of "data at rest" in a
> RAID-1 setting under BTRFS, but this appears not to be the case.
It probably isn't BTRFS.  This is one of the most tested code paths in 
BTRFS (the only ones more tested are single device), and you don't 
appear to be using anything else between BTRFS and the disks, so there's 
not much that can go wrong.  Keep in mind that unlike other filesystems 
on top of hardware or software RAID, BTRFS actually notices that things 
are wrong and has some idea which things are wrong (although it can't 
tell the difference between a corrupted checksum and a corrupted block 
of data).
>
> Thank you,
>
> R. Lochner
>
> #uname -a
> Linux vmh001.clone1.com 4.4.6-300.fc23.x86_64 #1 SMP Wed Mar 16
> 22:10:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>
> # btrfs --version
> btrfs-progs v4.4.1
>
> # btrfs fi show
> Label: 'raid_pool'  uuid: d397ff55-e5c8-4d31-966e-d65694997451
>     Total devices 2 FS bytes used 2.32TiB
>     devid    1 size 3.00TiB used 2.32TiB path /dev/sdb1
>     devid    2 size 3.00TiB used 2.32TiB path /dev/sdc1
>
> # btrfs fi df /mnt
> Data, RAID1: total=2.32TiB, used=2.31TiB
> System, RAID1: total=40.00MiB, used=384.00KiB
> Metadata, RAID1: total=7.00GiB, used=5.42GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> Dmesg:
>
> [2027323.705035] BTRFS warning (device sdc1): checksum error at
> logical 3037444042752 on dev /dev/sdc1, sector 4988750584, root 259,
> inode 1437377, offset 75754369024, length 4096, links 1 (path:
> Rick/sda4.img)
> [2027323.705056] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 0,
> rd 13, flush 0, corrupt 3, gen 0
> [2027323.718869] BTRFS error (device sdc1): unable to fixup (regular)
> error at logical 3037444042752 on dev /dev/sdc1
>
> ls:
>
> #ls -l /mnt/backup/Rick/sda4.img
> -rw-r--r--. 1 root root 75959197696 Dec 27 10:36 /mnt/backup/Rick/sda4.img


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-11 18:36 BTRFS Data at Rest File Corruption Richard Lochner
  2016-05-11 19:01 ` Roman Mamedov
  2016-05-11 19:26 ` Austin S. Hemmelgarn
@ 2016-05-12  6:49 ` Chris Murphy
       [not found] ` <CAAuLxcaQ1Uo+pff9AtD74UwUvo5yYKBuNLwKzjVMWV1kt2DcRQ@mail.gmail.com>
  3 siblings, 0 replies; 22+ messages in thread
From: Chris Murphy @ 2016-05-12  6:49 UTC (permalink / raw)
  To: Richard Lochner; +Cc: Btrfs BTRFS

What are the mount options for this filesystem?

Maybe filter with grep/egrep the journalctl output or
/var/log/messages for -i btrfs, and also for libata/scsi messages.
Anything over previous days might reveal some clue.

I've got multiple Btrfs raid1's and several times I've had
*correctable* errors. So your expectation is proper.


Chris Murphy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-11 19:26 ` Austin S. Hemmelgarn
@ 2016-05-12 17:49   ` Richard A. Lochner
  2016-05-12 18:29     ` Austin S. Hemmelgarn
  2016-05-13  1:41     ` Chris Murphy
  2016-05-13 16:28   ` Goffredo Baroncelli
  1 sibling, 2 replies; 22+ messages in thread
From: Richard A. Lochner @ 2016-05-12 17:49 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Btrfs BTRFS

Austin,

I rebooted the computer and reran the scrub to no avail.  The error is
consistent.

The reason I brought this question to the mailing list is because it
seemed like a situation that might be of interest to the developers.
 Perhaps, there might be a way to "defend" against this type of
corruption.

I suspected, and I still suspect that the error occurred upon a
metadata update that corrupted the checksum for the file, probably due
to silent memory corruption.  If the checksum was silently corrupted,
it would be simply written to both drives causing this type of error.

With that in mind, I proved (see below) that the data blocks match on
both mirrors.  This I expected since the data blocks should not have
been touched as the the file has not been written.

This is the sequence of events as I see them that I think might be of
interest to the developers.

1. A block containing a checksum for the file was read into memory.
The block read would have been checksummed, so the checksum for the
file must have been good at that moment.

2. The checksum block was the altered in memory (perhaps to add or
change a value).

3. A new checksum would then have been calculated for the checksum
block.

4. The checksum block would have been written to both mirrors.

Presumably, in the case that I am experiencing, an undetected memory
error must have occurred after 1 and before step 3 was completed.

I wonder if there is a way to correct or detect that situation.  

As I stated previously, the machine on which this occurred does not
have ECC memory, however, I would not think that the majority of users
running btrfs do either.  If it has happened to me, it likely has
happened to others.

Rick Lochner

btrfs dmesg(s):

[16510.334020] BTRFS warning (device sdb1): checksum error at logical
3037444042752 on dev /dev/sdb1, sector 4988789496, root 259, inode
1437377, offset 75754369024, length 4096, links 1 (path: Rick/sda4.img)
[16510.334043] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd
0, flush 0, corrupt 5, gen 0
[16510.345662] BTRFS error (device sdb1): unable to fixup (regular)
error at logical 3037444042752 on dev /dev/sdb1

[17606.978439] BTRFS warning (device sdb1): checksum error at logical
3037444042752 on dev /dev/sdc1, sector 4988750584, root 259, inode
1437377, offset 75754369024, length 4096, links 1 (path: Rick/sda4.img)
[17606.978460] BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd
13, flush 0, corrupt 4, gen 0
[17606.989497] BTRFS error (device sdb1): unable to fixup (regular)
error at logical 3037444042752 on dev /dev/sdc1

How I compared the data blocks:

#btrfs-map-logical -l 3037444042752  /dev/sdc1
mirror 1 logical 3037444042752 physical 2554240299008 device /dev/sdc1
mirror 1 logical 3037444046848 physical 2554240303104 device /dev/sdc1
mirror 2 logical 3037444042752 physical 2554260221952 device /dev/sdb1
mirror 2 logical 3037444046848 physical 2554260226048 device /dev/sdb1

#dd if=/dev/sdc1 bs=1 skip=2554240299008 count=4096 of=c1
4096+0 records in
4096+0 records out
4096 bytes (4.1 kB) copied, 0.0292201 s, 140 kB/s

#dd if=/dev/sdc1 bs=1 skip=2554240303104 count=4096 of=c2
4096+0 records in
4096+0 records out
4096 bytes (4.1 kB) copied, 0.0142381 s, 288 kB/s

#dd if=/dev/sdb1 bs=1 skip=2554260221952 count=4096 of=b1
4096+0 records in
4096+0 records out
4096 bytes (4.1 kB) copied, 0.0293211 s, 140 kB/s

#dd if=/dev/sdb1 bs=1 skip=2554260226048 count=4096 of=b2
4096+0 records in
4096+0 records out
4096 bytes (4.1 kB) copied, 0.0151947 s, 270 kB/s

#diff b1 c1
#diff b2 c2

On Wed, 2016-05-11 at 15:26 -0400, Austin S. Hemmelgarn wrote:
On 2016-05-11 14:36, Richard Lochner wrote:
> > 
> > Hello,
> > 
> > I have encountered a data corruption error with BTRFS which may or
> > may
> > not be of interest to your developers.
> > 
> > The problem is that an unmodified file on a RAID-1 volume that had
> > been scrubbed successfully is now corrupt.  The details follow.
> > 
> > The volume was formatted as btrfs with raid1 data and raid1
> > metadata
> > on two new 4T hard drives (WD Data Center Re WD4000FYYZ) .
> > 
> > A large binary file was copied to the volume (~76 GB) on December
> > 27,
> > 2015.  Soon after copying the file, a btrfs scrub was run. There
> > were
> > no errors.  Multiple scrubs have also been run over the past
> > several
> > months.
> > 
> > Recently, a scrub returned an unrecoverable error on that file.
> > Again, the file has not been modified since it was originally
> > copied
> > and has the time stamp from December.  Furthermore, SMART tests
> > (long)
> > for both drives do not indicate any errors (Current_Pending_Sector
> > or
> > otherwise).
> > 
> > I should note that the system does not have ECC memory.
> > 
> > It would be interesting to me to know if:
> > 
> > a) The primary and secondary data blocks match (I suspect they do),
> > and
> > b) The primary and secondary checksums for the block match (I
> > suspect
> > they do as well)
> Do you mean if they're both incorrect?  Because that's the only case 
> that scrub should return an un-correctable error is if neither block 
> appears correct.
> 
> In general, based on what you've said, there are four possibilities:
> 1. Both of your disks happened to have an undetectable error at 
> equivalent locations.  While not likely, this is still
> possible.  It's 
> important to note that while hard disks have internal ECC, ECC
> doesn't 
> inherently catch everything, so it's fully possible (although really 
> rare) to have a sector go bad and the disk not notice.
> 2. Some other part of your hardware has issues.  What I would check,
> in 
> order are:
> 	1. Internal cables (you would probably be surprised how many
> times I've 
> seen people have disk issues that were really caused by a bad data
> cable)
> 	2. RAM
> 	3. PSU (if you don't have a spare and don't have a multimeter
> or power 
> supply tester, move this one to the bottom of the list)
> 	4. CPU
> 	5. Storage controller
> 	6. Motherboard
>     If you want advice on testing anything, let me know.
> 3. It's caused by a transient error, and may or may not be fixable. 
> Computers have internal EMI shielding (or have metal cases) for a 
> reason, but this still doesn't protect from everything (cosmic 
> background radiation exists even in shielded enclosures).
> 4. You've found a bug in BTRFS or the kernel itself.  I seriously
> doubt 
> this, as you're setup appears to be pretty much as trivial as
> possible 
> for a BTRFS raid1 filesystem, and you don't appear to be doing
> anything 
> other than storing data (if fact, if you actually found a bug in
> BTRFS 
> in such well tested code under such a trivial use case, you deserve
> a 
> commendation).
> 
> The first thing I would do is make sure that the scrub fails 
> consistently.  I've had cases on systems which had been on for
> multiple 
> months where a scrub failed, I rebooted, and then the scrub
> succeeded. 
> If you still get the error after a reboot, check if everything other 
> than the error counts is the same, if it isn't, then it's probably
> an 
> issue with your hardware (although probably not the disk).
> > 
> > 
> > Unfortunately, I do not have the skills to do such a verification.
> > 
> > If you have any thoughts or suggestions, I would be most
> > interested.
> > I was hoping that I could trust the integrity of "data at rest" in
> > a
> > RAID-1 setting under BTRFS, but this appears not to be the case.
> It probably isn't BTRFS.  This is one of the most tested code paths
> in 
> BTRFS (the only ones more tested are single device), and you don't 
> appear to be using anything else between BTRFS and the disks, so
> there's 
> not much that can go wrong.  Keep in mind that unlike other
> filesystems 
> on top of hardware or software RAID, BTRFS actually notices that
> things 
> are wrong and has some idea which things are wrong (although it
> can't 
> tell the difference between a corrupted checksum and a corrupted
> block 
> of data).
> > 
> > 
> > Thank you,
> > 
> > R. Lochner
> > 
> > #uname -a
> > Linux vmh001.clone1.com 4.4.6-300.fc23.x86_64 #1 SMP Wed Mar 16
> > 22:10:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> > 
> > # btrfs --version
> > btrfs-progs v4.4.1
> > 
> > # btrfs fi show
> > Label: 'raid_pool'  uuid: d397ff55-e5c8-4d31-966e-d65694997451
> >     Total devices 2 FS bytes used 2.32TiB
> >     devid    1 size 3.00TiB used 2.32TiB path /dev/sdb1
> >     devid    2 size 3.00TiB used 2.32TiB path /dev/sdc1
> > 
> > # btrfs fi df /mnt
> > Data, RAID1: total=2.32TiB, used=2.31TiB
> > System, RAID1: total=40.00MiB, used=384.00KiB
> > Metadata, RAID1: total=7.00GiB, used=5.42GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> > 
> > Dmesg:
> > 
> > [2027323.705035] BTRFS warning (device sdc1): checksum error at
> > logical 3037444042752 on dev /dev/sdc1, sector 4988750584, root
> > 259,
> > inode 1437377, offset 75754369024, length 4096, links 1 (path:
> > Rick/sda4.img)
> > [2027323.705056] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr
> > 0,
> > rd 13, flush 0, corrupt 3, gen 0
> > [2027323.718869] BTRFS error (device sdc1): unable to fixup
> > (regular)
> > error at logical 3037444042752 on dev /dev/sdc1
> > 
> > ls:
> > 
> > #ls -l /mnt/backup/Rick/sda4.img
> > -rw-r--r--. 1 root root 75959197696 Dec 27 10:36
> > /mnt/backup/Rick/sda4.img

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
       [not found] ` <CAAuLxcaQ1Uo+pff9AtD74UwUvo5yYKBuNLwKzjVMWV1kt2DcRQ@mail.gmail.com>
@ 2016-05-12 18:26   ` Richard A. Lochner
  0 siblings, 0 replies; 22+ messages in thread
From: Richard A. Lochner @ 2016-05-12 18:26 UTC (permalink / raw)
  To: andrew.j.wade; +Cc: linux-btrfs

Andrew,

I agree with your supposition about the metadata and corrupted RAM.  

I verified that the data blocks on both devices are equal (see my reply
to Austin for the commands I used.  I believe that they correctly prove
that the blocks are, in fact, equal.

I am not sure I have the skills to "walk the checksum tree manually" as
you described.  I would also like to verify that the checksum blocks
agree as I expect they do, but I may have to "bone up" on my tree
walking skills first.

Thanks for your help.

Rick Lochner

On Wed, 2016-05-11 at 21:16 -0400, Andrew Wade wrote:
> 
> I would expect the "data at rest" to be good too. But perhaps
> something happened to the metadata (checksum). If the checksum was
> corrupted in RAM it could be written back to the disks due to updates
> elsewhere in the metadata node.
> If this is what happened I would expect the metadata node containing
> the checksum to have a recent generation number.
> I'm not actually a BTRFS developer myself, but you might be able to
> find the generation by using btrfs-debug-tree from btrfs-tools.
> btrfs-debug-tree -r /dev/sdc1 will give you the block number of the
> checksum tree root, which you can then feed into btrfs-debug-tree -b
> #### /dev/sdc1 and walk the tree manually. You're looking for the
> largest key before 3037444042752. 
> For dumping the data and metadata blocks I think btrfs-map-logical is
> what you need, though to be honest I've never used this tool myself.
> Even if the file data is still good I don't know of a simple way to
> tell BTRFS to ignore the checksums for a file. It is possible to
> regenerate the checksum tree for the entire filesystem, but I
> personally wouldn't do that unless you really need the file.
> regards,
> Andrew
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-12 17:49   ` Richard A. Lochner
@ 2016-05-12 18:29     ` Austin S. Hemmelgarn
  2016-05-12 21:53       ` Goffredo Baroncelli
  2016-05-12 23:15       ` Richard A. Lochner
  2016-05-13  1:41     ` Chris Murphy
  1 sibling, 2 replies; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-12 18:29 UTC (permalink / raw)
  To: Richard A. Lochner, Btrfs BTRFS

On 2016-05-12 13:49, Richard A. Lochner wrote:
> Austin,
>
> I rebooted the computer and reran the scrub to no avail.  The error is
> consistent.
>
> The reason I brought this question to the mailing list is because it
> seemed like a situation that might be of interest to the developers.
>  Perhaps, there might be a way to "defend" against this type of
> corruption.
>
> I suspected, and I still suspect that the error occurred upon a
> metadata update that corrupted the checksum for the file, probably due
> to silent memory corruption.  If the checksum was silently corrupted,
> it would be simply written to both drives causing this type of error.
That does seem to be the most likely cause, and sadly, is not something 
any filesystem can protect reliably against on any commodity hardware.
>
> With that in mind, I proved (see below) that the data blocks match on
> both mirrors.  This I expected since the data blocks should not have
> been touched as the the file has not been written.
>
> This is the sequence of events as I see them that I think might be of
> interest to the developers.
>
> 1. A block containing a checksum for the file was read into memory.
> The block read would have been checksummed, so the checksum for the
> file must have been good at that moment.
It's worth noting that BTRFS doesn't verify all the checksums in a 
metadata block when it loads that metadata block, only the ones for the 
reads that triggered the metadata block being loaded will get verified.
>
> 2. The checksum block was the altered in memory (perhaps to add or
> change a value).
>
> 3. A new checksum would then have been calculated for the checksum
> block.
>
> 4. The checksum block would have been written to both mirrors.
>
> Presumably, in the case that I am experiencing, an undetected memory
> error must have occurred after 1 and before step 3 was completed.
>
> I wonder if there is a way to correct or detect that situation.
The closest we could get is to provide an option to handle this in 
scrub, preferably with a big scary warning on it as this same situation 
can be easily cause by someone modifying the disks themselves (we can't 
reasonably protect against that, but we shouldn't make it trivial for 
people to inject arbitrary data that way either).
>
> As I stated previously, the machine on which this occurred does not
> have ECC memory, however, I would not think that the majority of users
> running btrfs do either.  If it has happened to me, it likely has
> happened to others.
>
> Rick Lochner
>
> btrfs dmesg(s):
>
> [16510.334020] BTRFS warning (device sdb1): checksum error at logical
> 3037444042752 on dev /dev/sdb1, sector 4988789496, root 259, inode
> 1437377, offset 75754369024, length 4096, links 1 (path: Rick/sda4.img)
> [16510.334043] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd
> 0, flush 0, corrupt 5, gen 0
> [16510.345662] BTRFS error (device sdb1): unable to fixup (regular)
> error at logical 3037444042752 on dev /dev/sdb1
>
> [17606.978439] BTRFS warning (device sdb1): checksum error at logical
> 3037444042752 on dev /dev/sdc1, sector 4988750584, root 259, inode
> 1437377, offset 75754369024, length 4096, links 1 (path: Rick/sda4.img)
> [17606.978460] BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd
> 13, flush 0, corrupt 4, gen 0
> [17606.989497] BTRFS error (device sdb1): unable to fixup (regular)
> error at logical 3037444042752 on dev /dev/sdc1
>
> How I compared the data blocks:
>
> #btrfs-map-logical -l 3037444042752  /dev/sdc1
> mirror 1 logical 3037444042752 physical 2554240299008 device /dev/sdc1
> mirror 1 logical 3037444046848 physical 2554240303104 device /dev/sdc1
> mirror 2 logical 3037444042752 physical 2554260221952 device /dev/sdb1
> mirror 2 logical 3037444046848 physical 2554260226048 device /dev/sdb1
>
> #dd if=/dev/sdc1 bs=1 skip=2554240299008 count=4096 of=c1
> 4096+0 records in
> 4096+0 records out
> 4096 bytes (4.1 kB) copied, 0.0292201 s, 140 kB/s
>
> #dd if=/dev/sdc1 bs=1 skip=2554240303104 count=4096 of=c2
> 4096+0 records in
> 4096+0 records out
> 4096 bytes (4.1 kB) copied, 0.0142381 s, 288 kB/s
>
> #dd if=/dev/sdb1 bs=1 skip=2554260221952 count=4096 of=b1
> 4096+0 records in
> 4096+0 records out
> 4096 bytes (4.1 kB) copied, 0.0293211 s, 140 kB/s
>
> #dd if=/dev/sdb1 bs=1 skip=2554260226048 count=4096 of=b2
> 4096+0 records in
> 4096+0 records out
> 4096 bytes (4.1 kB) copied, 0.0151947 s, 270 kB/s
>
> #diff b1 c1
> #diff b2 c2
Excellent thinking here.

Now, if you can find some external method to verify that that block is 
in fact correct, you can just write it back into the file itself at the 
correct offset, and fix the issue.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-12 18:29     ` Austin S. Hemmelgarn
@ 2016-05-12 21:53       ` Goffredo Baroncelli
  2016-05-12 23:15       ` Richard A. Lochner
  1 sibling, 0 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2016-05-12 21:53 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Richard A. Lochner, Btrfs BTRFS

On 2016-05-12 20:29, Austin S. Hemmelgarn wrote:
>> I wonder if there is a way to correct or detect that situation.
> The closest we could get is to provide an option to handle this in
> scrub, preferably with a big scary warning on it as this same
> situation can be easily cause by someone modifying the disks
> themselves (we can't reasonably protect against that, but we
> shouldn't make it trivial for people to inject arbitrary data that
> way either).

"btrfs check" has the option "--init-csum-tree"...

Anyway, it should be exist an option to recalculate the checksum for a single file. BTRFS is good to highlight that a file is corrupted, but it should have the possibility to read it anyway: in some case is better to have a corrupted file (knowing that it is corrupted) that loosing all the file.

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-12 18:29     ` Austin S. Hemmelgarn
  2016-05-12 21:53       ` Goffredo Baroncelli
@ 2016-05-12 23:15       ` Richard A. Lochner
  1 sibling, 0 replies; 22+ messages in thread
From: Richard A. Lochner @ 2016-05-12 23:15 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Btrfs BTRFS

Austin,

Ah, the idea of rewriting the "bad" data block is very interesting. I
had not thought of that.  Interestingly, the corrupted file is a raw
backup image of a btrfs file system partition. I can mount it as a loop
device.  I suppose I could rewrite that data block, mount it and run a
scrub on that mounted loop device to find out if it is truly fixed.

I should also mention that this data is not critical to me.  I only
brought this issue up because I thought it might be of interest.  

I can think of ways to protect against most manifestations of this type
of error (since metadata is checksummed in btrfs), but I cannot argue
that it would be worth the development effort, increased code
complexity or the additional cpu cycles required to implement such a
"defensive" algorithm for an "edge case" like this.  Even with a
defensive algorithm, these errors could still occur, but I believe you
could shrink the time window in which they could occur enough to
significantly reduce their probability.

That said, I happen to have experienced this particular error twice
(over a period of about 7 months) with btrfs on this system.  I do
believe that both were due to memory errors and I plan to upgrade soon
to a Haswell system with ECC memory because of this. 

However, I wonder if my "commodity hardware" is that unique?

In any event, thank you very much for your time and insight.

Rick Lochner


On Thu, 2016-05-12 at 14:29 -0400, Austin S. Hemmelgarn wrote:
> On 2016-05-12 13:49, Richard A. Lochner wrote:
> > 
> > Austin,
> > 
> > I rebooted the computer and reran the scrub to no avail.  The error
> > is
> > consistent.
> > 
> > The reason I brought this question to the mailing list is because
> > it
> > seemed like a situation that might be of interest to the
> > developers.
> >  Perhaps, there might be a way to "defend" against this type of
> > corruption.
> > 
> > I suspected, and I still suspect that the error occurred upon a
> > metadata update that corrupted the checksum for the file, probably
> > due
> > to silent memory corruption.  If the checksum was silently
> > corrupted,
> > it would be simply written to both drives causing this type of
> > error.
> That does seem to be the most likely cause, and sadly, is not
> something 
> any filesystem can protect reliably against on any commodity
> hardware.
> > 
> > 
> > With that in mind, I proved (see below) that the data blocks match
> > on
> > both mirrors.  This I expected since the data blocks should not
> > have
> > been touched as the the file has not been written.
> > 
> > This is the sequence of events as I see them that I think might be
> > of
> > interest to the developers.
> > 
> > 1. A block containing a checksum for the file was read into memory.
> > The block read would have been checksummed, so the checksum for the
> > file must have been good at that moment.
> It's worth noting that BTRFS doesn't verify all the checksums in a 
> metadata block when it loads that metadata block, only the ones for
> the 
> reads that triggered the metadata block being loaded will get
> verified.
> > 
> > 
> > 2. The checksum block was the altered in memory (perhaps to add or
> > change a value).
> > 
> > 3. A new checksum would then have been calculated for the checksum
> > block.
> > 
> > 4. The checksum block would have been written to both mirrors.
> > 
> > Presumably, in the case that I am experiencing, an undetected
> > memory
> > error must have occurred after 1 and before step 3 was completed.
> > 
> > I wonder if there is a way to correct or detect that situation.
> The closest we could get is to provide an option to handle this in 
> scrub, preferably with a big scary warning on it as this same
> situation 
> can be easily cause by someone modifying the disks themselves (we
> can't 
> reasonably protect against that, but we shouldn't make it trivial
> for 
> people to inject arbitrary data that way either).
> > 
> > 
> > As I stated previously, the machine on which this occurred does not
> > have ECC memory, however, I would not think that the majority of
> > users
> > running btrfs do either.  If it has happened to me, it likely has
> > happened to others.
> > 
> > Rick Lochner
> > 
> > btrfs dmesg(s):
> > 
> > [16510.334020] BTRFS warning (device sdb1): checksum error at
> > logical
> > 3037444042752 on dev /dev/sdb1, sector 4988789496, root 259, inode
> > 1437377, offset 75754369024, length 4096, links 1 (path:
> > Rick/sda4.img)
> > [16510.334043] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr
> > 0, rd
> > 0, flush 0, corrupt 5, gen 0
> > [16510.345662] BTRFS error (device sdb1): unable to fixup (regular)
> > error at logical 3037444042752 on dev /dev/sdb1
> > 
> > [17606.978439] BTRFS warning (device sdb1): checksum error at
> > logical
> > 3037444042752 on dev /dev/sdc1, sector 4988750584, root 259, inode
> > 1437377, offset 75754369024, length 4096, links 1 (path:
> > Rick/sda4.img)
> > [17606.978460] BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr
> > 0, rd
> > 13, flush 0, corrupt 4, gen 0
> > [17606.989497] BTRFS error (device sdb1): unable to fixup (regular)
> > error at logical 3037444042752 on dev /dev/sdc1
> > 
> > How I compared the data blocks:
> > 
> > #btrfs-map-logical -l 3037444042752  /dev/sdc1
> > mirror 1 logical 3037444042752 physical 2554240299008 device
> > /dev/sdc1
> > mirror 1 logical 3037444046848 physical 2554240303104 device
> > /dev/sdc1
> > mirror 2 logical 3037444042752 physical 2554260221952 device
> > /dev/sdb1
> > mirror 2 logical 3037444046848 physical 2554260226048 device
> > /dev/sdb1
> > 
> > #dd if=/dev/sdc1 bs=1 skip=2554240299008 count=4096 of=c1
> > 4096+0 records in
> > 4096+0 records out
> > 4096 bytes (4.1 kB) copied, 0.0292201 s, 140 kB/s
> > 
> > #dd if=/dev/sdc1 bs=1 skip=2554240303104 count=4096 of=c2
> > 4096+0 records in
> > 4096+0 records out
> > 4096 bytes (4.1 kB) copied, 0.0142381 s, 288 kB/s
> > 
> > #dd if=/dev/sdb1 bs=1 skip=2554260221952 count=4096 of=b1
> > 4096+0 records in
> > 4096+0 records out
> > 4096 bytes (4.1 kB) copied, 0.0293211 s, 140 kB/s
> > 
> > #dd if=/dev/sdb1 bs=1 skip=2554260226048 count=4096 of=b2
> > 4096+0 records in
> > 4096+0 records out
> > 4096 bytes (4.1 kB) copied, 0.0151947 s, 270 kB/s
> > 
> > #diff b1 c1
> > #diff b2 c2
> Excellent thinking here.
> 
> Now, if you can find some external method to verify that that block
> is 
> in fact correct, you can just write it back into the file itself at
> the 
> correct offset, and fix the issue.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-12 17:49   ` Richard A. Lochner
  2016-05-12 18:29     ` Austin S. Hemmelgarn
@ 2016-05-13  1:41     ` Chris Murphy
  2016-05-13  4:49       ` Richard A. Lochner
  1 sibling, 1 reply; 22+ messages in thread
From: Chris Murphy @ 2016-05-13  1:41 UTC (permalink / raw)
  To: Richard A. Lochner; +Cc: Austin S. Hemmelgarn, Btrfs BTRFS

On Thu, May 12, 2016 at 11:49 AM, Richard A. Lochner <lochner@clone1.com> wrote:

> I suspected, and I still suspect that the error occurred upon a
> metadata update that corrupted the checksum for the file, probably due
> to silent memory corruption.  If the checksum was silently corrupted,
> it would be simply written to both drives causing this type of error.

Metadata is checksummed independently of data. So if the data isn't
updated, its checksum doesn't change, only metadata checksum is
changed.

>
> btrfs dmesg(s):
>
> [16510.334020] BTRFS warning (device sdb1): checksum error at logical
> 3037444042752 on dev /dev/sdb1, sector 4988789496, root 259, inode
> 1437377, offset 75754369024, length 4096, links 1 (path: Rick/sda4.img)
> [16510.334043] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd
> 0, flush 0, corrupt 5, gen 0
> [16510.345662] BTRFS error (device sdb1): unable to fixup (regular)
> error at logical 3037444042752 on dev /dev/sdb1
>
> [17606.978439] BTRFS warning (device sdb1): checksum error at logical
> 3037444042752 on dev /dev/sdc1, sector 4988750584, root 259, inode
> 1437377, offset 75754369024, length 4096, links 1 (path: Rick/sda4.img)
> [17606.978460] BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd
> 13, flush 0, corrupt 4, gen 0
> [17606.989497] BTRFS error (device sdb1): unable to fixup (regular)
> error at logical 3037444042752 on dev /dev/sdc1

This is confusing. Are these the same boot? The later time has a lower
corrupt count. Can you just 'dd if=sda4.img of=/dev/null' and report
all (new) messages in dmesg? It seems to me there should be pretty
much all the same monotonic-time for the problem with both devices.

Also what do you get for these for each device:

smartctl scterc -l /dev/sdX
cat /sys/block/sdX/device/timeout


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-13  1:41     ` Chris Murphy
@ 2016-05-13  4:49       ` Richard A. Lochner
  2016-05-13 17:46         ` Chris Murphy
  0 siblings, 1 reply; 22+ messages in thread
From: Richard A. Lochner @ 2016-05-13  4:49 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Austin S. Hemmelgarn, Btrfs BTRFS

Chris,

See notes inline.

On Thu, 2016-05-12 at 19:41 -0600, Chris Murphy wrote:
> On Thu, May 12, 2016 at 11:49 AM, Richard A. Lochner <lochner@clone1.
> com> wrote:
> 
> > 
> > I suspected, and I still suspect that the error occurred upon a
> > metadata update that corrupted the checksum for the file, probably
> > due
> > to silent memory corruption.  If the checksum was silently
> > corrupted,
> > it would be simply written to both drives causing this type of
> > error.
> Metadata is checksummed independently of data. So if the data isn't
> updated, its checksum doesn't change, only metadata checksum is
> changed.
> > 
> > 
> > btrfs dmesg(s):
> > 
> > [16510.334020] BTRFS warning (device sdb1): checksum error at
> > logical
> > 3037444042752 on dev /dev/sdb1, sector 4988789496, root 259, inode
> > 1437377, offset 75754369024, length 4096, links 1 (path:
> > Rick/sda4.img)
> > [16510.334043] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr
> > 0, rd
> > 0, flush 0, corrupt 5, gen 0
> > [16510.345662] BTRFS error (device sdb1): unable to fixup (regular)
> > error at logical 3037444042752 on dev /dev/sdb1
> > 
> > [17606.978439] BTRFS warning (device sdb1): checksum error at
> > logical
> > 3037444042752 on dev /dev/sdc1, sector 4988750584, root 259, inode
> > 1437377, offset 75754369024, length 4096, links 1 (path:
> > Rick/sda4.img)
> > [17606.978460] BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr
> > 0, rd
> > 13, flush 0, corrupt 4, gen 0
> > [17606.989497] BTRFS error (device sdb1): unable to fixup (regular)
> > error at logical 3037444042752 on dev /dev/sdc1
> This is confusing. Are these the same boot? The later time has a
> lower
> corrupt count. Can you just 'dd if=sda4.img of=/dev/null' and report
> all (new) messages in dmesg? It seems to me there should be pretty
> much all the same monotonic-time for the problem with both devices.

My apologies, they were from different boots.  After the dd, I get
these:

[109479.550836] BTRFS warning (device sdb1): csum failed ino 1437377
off 75754369024 csum 1689728329 expected csum 2165338402
[109479.596626] BTRFS warning (device sdb1): csum failed ino 1437377
off 75754369024 csum 1689728329 expected csum 2165338402
[109479.601969] BTRFS warning (device sdb1): csum failed ino 1437377
off 75754369024 csum 1689728329 expected csum 2165338402
[109479.602189] BTRFS warning (device sdb1): csum failed ino 1437377
off 75754369024 csum 1689728329 expected csum 2165338402
[109479.602323] BTRFS warning (device sdb1): csum failed ino 1437377
off 75754369024 csum 1689728329 expected csum 2165338402
> 
> Also what do you get for these for each device:
> 
> smartctl scterc -l /dev/sdX
> cat /sys/block/sdX/device/timeout
> 
# smartctl -l scterc  /dev/sdb
sartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.8-300.fc23.x86_64]
(local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools
.org

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

# smartctl -l scterc  /dev/sdc
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.8-300.fc23.x86_64]
(local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools
.org

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

# cat /sys/block/sdb/device/timeout
30
# cat /sys/block/sdc/device/timeout
30
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-11 19:26 ` Austin S. Hemmelgarn
  2016-05-12 17:49   ` Richard A. Lochner
@ 2016-05-13 16:28   ` Goffredo Baroncelli
  2016-05-13 16:54     ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 22+ messages in thread
From: Goffredo Baroncelli @ 2016-05-13 16:28 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Richard Lochner, Btrfs BTRFS

On 2016-05-11 21:26, Austin S. Hemmelgarn wrote:
> (although it can't tell the difference between a corrupted checksum and a corrupted block of data).

I don't think so. The data checksums are stored in metadata blocks, and as metadata block, these have their checksums. So btrfs know if the checksum is correct or none, despite the fact that the data is correct or none. Of course if the checksum is wrong, btrfs can't tell if the data is correct.

The only exception should be the inline data: in this case the data is stored in the metadata block, and this block is protected by only one checksum.

I know that I am pedantic :_)  but after reading your comment I looked at the btrfs data structure to refresh my memory, so I want to share these information.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-13 16:28   ` Goffredo Baroncelli
@ 2016-05-13 16:54     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-13 16:54 UTC (permalink / raw)
  To: kreijack, Richard Lochner, Btrfs BTRFS

On 2016-05-13 12:28, Goffredo Baroncelli wrote:
> On 2016-05-11 21:26, Austin S. Hemmelgarn wrote:
>> (although it can't tell the difference between a corrupted checksum and a corrupted block of data).
>
> I don't think so. The data checksums are stored in metadata blocks, and as metadata block, these have their checksums. So btrfs know if the checksum is correct or none, despite the fact that the data is correct or none. Of course if the checksum is wrong, btrfs can't tell if the data is correct.
>
> The only exception should be the inline data: in this case the data is stored in the metadata block, and this block is protected by only one checksum.
>
> I know that I am pedantic :_)  but after reading your comment I looked at the btrfs data structure to refresh my memory, so I want to share these information.
It is fully possible for the block of data to be good and the checksum 
to be bad, it's just ridiculously unlikely.  I've actually had this 
happen before at least twice (I have really bad luck when it comes disk 
corruption).


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-13  4:49       ` Richard A. Lochner
@ 2016-05-13 17:46         ` Chris Murphy
  2016-05-15 18:43           ` Richard A. Lochner
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Murphy @ 2016-05-13 17:46 UTC (permalink / raw)
  To: Richard A. Lochner; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS

On Thu, May 12, 2016 at 10:49 PM, Richard A. Lochner <lochner@clone1.com> wrote:

> My apologies, they were from different boots.  After the dd, I get
> these:
>
> [109479.550836] BTRFS warning (device sdb1): csum failed ino 1437377
> off 75754369024 csum 1689728329 expected csum 2165338402
> [109479.596626] BTRFS warning (device sdb1): csum failed ino 1437377
> off 75754369024 csum 1689728329 expected csum 2165338402
> [109479.601969] BTRFS warning (device sdb1): csum failed ino 1437377
> off 75754369024 csum 1689728329 expected csum 2165338402
> [109479.602189] BTRFS warning (device sdb1): csum failed ino 1437377
> off 75754369024 csum 1689728329 expected csum 2165338402
> [109479.602323] BTRFS warning (device sdb1): csum failed ino 1437377
> off 75754369024 csum 1689728329 expected csum 2165338402

That's it? Only errors from sdb1? And this time no attempt to fix it?

Normally when there is failure to match data checksums stored in
metadata to the newly computed data checksums as the blocks are read
there's an attempt to read the mismatching blocks from another stripe.
I don't see that this is being attempted.


>>
>> Also what do you get for these for each device:
>>
>> smartctl scterc -l /dev/sdX
>> cat /sys/block/sdX/device/timeout
>>
> # smartctl -l scterc  /dev/sdb
> sartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.8-300.fc23.x86_64]
> (local build)
> Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools
> .org
>
> SCT Error Recovery Control:
>            Read:     70 (7.0 seconds)
>           Write:     70 (7.0 seconds)
>
> # smartctl -l scterc  /dev/sdc
> smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.8-300.fc23.x86_64]
> (local build)
> Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools
> .org
>
> SCT Error Recovery Control:
>            Read:     70 (7.0 seconds)
>           Write:     70 (7.0 seconds)
>
> # cat /sys/block/sdb/device/timeout
> 30
> # cat /sys/block/sdc/device/timeout
> 30
>>

That's appropriate. So at least any failures have a chance of being
fixed before the command timer does a reset on the bus.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-13 17:46         ` Chris Murphy
@ 2016-05-15 18:43           ` Richard A. Lochner
  2016-05-16  6:07             ` Chris Murphy
  0 siblings, 1 reply; 22+ messages in thread
From: Richard A. Lochner @ 2016-05-15 18:43 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Austin S. Hemmelgarn, Btrfs BTRFS

Chris,

I have some interesting news.  In the process of trying to prepare some
clean logs for you, a new error showed up in my scrub.  It is another
very large file (500+ GB) that has been "at rest" for at least 5 months
(it has a timestamp of 1/4/15, but was actually copied around
December).  In this case, I do have the original file on a freenas zfs
volume.  

The file has one bad data block (4096 bytes).  The data in the bad
block matches on both btrfs mirrors, and it matches to the data on zfs.

This proves to me that the error is in the metadata.

There is clearly something with my hardware that is allowing metadata
corruption to happen, albeit relatively infrequently (3 times in 6
months).

In any event, the commands I ran and the associated log entries folow.

Rick Lochner

# mount /dev/sdb1 /mnt

# ls -l /mnt/backup/Rick/sda4.img
-rw-r--r--. 1 root root 75959197696 Dec 27 10:36
/mnt/backup/Rick/sda4.img

# ls -l /mnt/backup/freenas/Backups/Rick/crw2k3s1_share.img
-rwxrwxr-x. 1 1556 1999 536870912000 Jan  4  2015
/mnt/backup/freenas/Backups/Rick/crw2k3s1_share.img

# dd if=/mnt/backup/Rick/sda4.img of=/dev/null
dd: error reading ‘/mnt/backup/Rick/sda4.img’: Input/output error
147957752+0 records in
147957752+0 records out
75754369024 bytes (76 GB) copied, 610.88 s, 124 MB/s

# btrfs scrub start /mnt

# btrfs scrub status /mnt
scrub status for d397ff55-e5c8-4d31-966e-d65694997451
	scrub started at Sun May 15 06:07:37 2016 and finished after
04:49:54
	total bytes scrubbed: 4.64TiB with 3 errors
	error details: csum=3
	corrected errors: 0, uncorrectable errors: 3, unverified
errors: 0

# btrfs fi sh /mnt
Label: 'raid_pool'  uuid: d397ff55-e5c8-4d31-966e-d65694997451
	Total devices 2 FS bytes used 2.32TiB
	devid    1 size 3.00TiB used 2.32TiB path /dev/sdb1
	devid    2 size 3.00TiB used 2.32TiB path /dev/sdc1

The log contained the following messages (the first four from the
mount, the next five from the dd, the last three from the scrub):

[ 2451.050107] BTRFS info (device sdc1): disk space caching is enabled
[ 2451.050112] BTRFS: has skinny extents
[ 2451.157276] BTRFS info (device sdc1): bdev /dev/sdc1 errs: wr 0, rd
13, flush 0, corrupt 4, gen 0
[ 2451.157284] BTRFS info (device sdc1): bdev /dev/sdb1 errs: wr 0, rd
0, flush 0, corrupt 5, gen 0

[ 3118.415249] BTRFS warning (device sdc1): csum failed ino 1437377 off
75754369024 csum 1689728329 expected csum 2165338402
[ 3118.481373] BTRFS warning (device sdc1): csum failed ino 1437377 off
75754369024 csum 1689728329 expected csum 2165338402
[ 3118.490322] BTRFS warning (device sdc1): csum failed ino 1437377 off
75754369024 csum 1689728329 expected csum 2165338402
[ 3118.497292] BTRFS warning (device sdc1): csum failed ino 1437377 off
75754369024 csum 1689728329 expected csum 2165338402
[ 3118.497465] BTRFS warning (device sdc1): csum failed ino 1437377 off
75754369024 csum 1689728329 expected csum 2165338402

[11353.723860] BTRFS warning (device sdc1): checksum error at logical
1279007596544 on dev /dev/sdc1, sector 2498022800, root 259, inode
3715, offset 271 776 030 720, length 4096, links 1 (path:
freenas/Backups/Rick/crw2k3s1_share.img)
[11353.723884] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 0, rd
13, flush 0, corrupt 5, gen 0
[11353.734409] BTRFS error (device sdc1): unable to fixup (regular)
error at logical 1279007596544 on dev /dev/sdc1

[19446.539490] BTRFS warning (device sdc1): checksum error at logical
3037444042752 on dev /dev/sdb1, sector 4988789496, root 259, inode
1437377, offset 75754369024, length 4096, links 1 (path: Rick/sda4.img)
[19446.539503] BTRFS error (device sdc1): bdev /dev/sdb1 errs: wr 0, rd
0, flush 0, corrupt 6, gen 0
[19446.544776] BTRFS error (device sdc1): unable to fixup (regular)
error at logical 3037444042752 on dev /dev/sdb1

[20570.969126] BTRFS warning (device sdc1): checksum error at logical
3037444042752 on dev /dev/sdc1, sector 4988750584, root 259, inode
1437377, offset 75754369024, length 4096, links 1 (path: Rick/sda4.img)
[20570.969147] BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 0, rd
13, flush 0, corrupt 6, gen 0
[20570.983318] BTRFS error (device sdc1): unable to fixup (regular)
error at logical 3037444042752 on dev /dev/sdc1

On Fri, 2016-05-13 at 11:46 -0600, Chris Murphy wrote:
> On Thu, May 12, 2016 at 10:49 PM, Richard A. Lochner <lochner@clone1.
> com> wrote:
> 
> > 
> > My apologies, they were from different boots.  After the dd, I get
> > these:
> > 
> > [109479.550836] BTRFS warning (device sdb1): csum failed ino
> > 1437377
> > off 75754369024 csum 1689728329 expected csum 2165338402
> > [109479.596626] BTRFS warning (device sdb1): csum failed ino
> > 1437377
> > off 75754369024 csum 1689728329 expected csum 2165338402
> > [109479.601969] BTRFS warning (device sdb1): csum failed ino
> > 1437377
> > off 75754369024 csum 1689728329 expected csum 2165338402
> > [109479.602189] BTRFS warning (device sdb1): csum failed ino
> > 1437377
> > off 75754369024 csum 1689728329 expected csum 2165338402
> > [109479.602323] BTRFS warning (device sdb1): csum failed ino
> > 1437377
> > off 75754369024 csum 1689728329 expected csum 2165338402
> That's it? Only errors from sdb1? And this time no attempt to fix it?
> 
> Normally when there is failure to match data checksums stored in
> metadata to the newly computed data checksums as the blocks are read
> there's an attempt to read the mismatching blocks from another
> stripe.
> I don't see that this is being attempted.
> 
> 
> > 
> > > 
> > > 
> > > Also what do you get for these for each device:
> > > 
> > > smartctl scterc -l /dev/sdX
> > > cat /sys/block/sdX/device/timeout
> > > 
> > # smartctl -l scterc  /dev/sdb
> > sartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.8-300.fc23.x86_64]
> > (local build)
> > Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmont
> > ools
> > .org
> > 
> > SCT Error Recovery Control:
> >            Read:     70 (7.0 seconds)
> >           Write:     70 (7.0 seconds)
> > 
> > # smartctl -l scterc  /dev/sdc
> > smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.8-300.fc23.x86_64]
> > (local build)
> > Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmont
> > ools
> > .org
> > 
> > SCT Error Recovery Control:
> >            Read:     70 (7.0 seconds)
> >           Write:     70 (7.0 seconds)
> > 
> > # cat /sys/block/sdb/device/timeout
> > 30
> > # cat /sys/block/sdc/device/timeout
> > 30
> > > 
> > > 
> That's appropriate. So at least any failures have a chance of being
> fixed before the command timer does a reset on the bus.
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-15 18:43           ` Richard A. Lochner
@ 2016-05-16  6:07             ` Chris Murphy
  2016-05-16 11:33               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Murphy @ 2016-05-16  6:07 UTC (permalink / raw)
  To: Richard A. Lochner; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS

Current hypothesis
 "I suspected, and I still suspect that the error occurred upon a
metadata update that corrupted the checksum for the file, probably due
to silent memory corruption.  If the checksum was silently corrupted,
it would be simply written to both drives causing this type of error."

A metadata update alone will not change the data checksums.

But let's ignore that. If there's corrupt extent csum in a node that
itself has a valid csum, this is functionally identical to e.g.
nerfing 100 bytes of a file's extent data (both copies, identically).
The fs doesn't know the difference. All it knows is the node csum is
valid, therefore the data extent csum is valid, and that's why it
assumes the data is wrong and hence you get an I/O error. And I can
reproduce most of your results by nerfing file data.

The entire dmesg for scrub looks like this:


May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6):
checksum error at logical 5566889984 on dev /dev/dm-6, sector 8540160,
root 5, inode 258, offset 0, length 4096, links 1 (path:
openSUSE-Tumbleweed-NET-x86_64-Current.iso)
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
bdev /dev/dm-6 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
unable to fixup (regular) error at logical 5566889984 on dev /dev/dm-6
May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6):
checksum error at logical 5566889984 on dev /dev/mapper/VG-b1, sector
8579072, root 5, inode 258, offset 0, length 4096, links 1 (path:
openSUSE-Tumbleweed-NET-x86_64-Current.iso)
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
bdev /dev/mapper/VG-b1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
unable to fixup (regular) error at logical 5566889984 on dev
/dev/mapper/VG-b1

And the entire dmesg for running sha256sum on the file is

May 15 23:33:41 f23s.localdomain kernel: __readpage_endio_check: 22
callbacks suppressed
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141


And I do get an i/o error for sha256sum and no hash is computed.

But there's two important differences:
1. I have two unable to fixup messages, one for each device, at the
exact same time.
2. I altered both copies of extent data.

It's a mystery to me how your file data has not changed, but somehow
the extent csum was changed but also the node csum was recomputed
correctly. That's a bit odd.




Chris Murphy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-16  6:07             ` Chris Murphy
@ 2016-05-16 11:33               ` Austin S. Hemmelgarn
  2016-05-16 21:20                 ` Richard A. Lochner
  2016-05-16 22:43                 ` Chris Murphy
  0 siblings, 2 replies; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-16 11:33 UTC (permalink / raw)
  To: Chris Murphy, Richard A. Lochner; +Cc: Btrfs BTRFS

On 2016-05-16 02:07, Chris Murphy wrote:
> Current hypothesis
>  "I suspected, and I still suspect that the error occurred upon a
> metadata update that corrupted the checksum for the file, probably due
> to silent memory corruption.  If the checksum was silently corrupted,
> it would be simply written to both drives causing this type of error."
>
> A metadata update alone will not change the data checksums.
>
> But let's ignore that. If there's corrupt extent csum in a node that
> itself has a valid csum, this is functionally identical to e.g.
> nerfing 100 bytes of a file's extent data (both copies, identically).
> The fs doesn't know the difference. All it knows is the node csum is
> valid, therefore the data extent csum is valid, and that's why it
> assumes the data is wrong and hence you get an I/O error. And I can
> reproduce most of your results by nerfing file data.
>
> The entire dmesg for scrub looks like this:
>
>
> May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6):
> checksum error at logical 5566889984 on dev /dev/dm-6, sector 8540160,
> root 5, inode 258, offset 0, length 4096, links 1 (path:
> openSUSE-Tumbleweed-NET-x86_64-Current.iso)
> May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
> bdev /dev/dm-6 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
> unable to fixup (regular) error at logical 5566889984 on dev /dev/dm-6
> May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6):
> checksum error at logical 5566889984 on dev /dev/mapper/VG-b1, sector
> 8579072, root 5, inode 258, offset 0, length 4096, links 1 (path:
> openSUSE-Tumbleweed-NET-x86_64-Current.iso)
> May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
> bdev /dev/mapper/VG-b1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
> unable to fixup (regular) error at logical 5566889984 on dev
> /dev/mapper/VG-b1
>
> And the entire dmesg for running sha256sum on the file is
>
> May 15 23:33:41 f23s.localdomain kernel: __readpage_endio_check: 22
> callbacks suppressed
> May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
> csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
> csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
> csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
> csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
> csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
>
>
> And I do get an i/o error for sha256sum and no hash is computed.
>
> But there's two important differences:
> 1. I have two unable to fixup messages, one for each device, at the
> exact same time.
> 2. I altered both copies of extent data.
>
> It's a mystery to me how your file data has not changed, but somehow
> the extent csum was changed but also the node csum was recomputed
> correctly. That's a bit odd.
I would think this would be perfectly possible if some other file that 
had a checksum in that node changed, thus forcing the node's checksum to 
be updated.  Theoretical sequence of events:
1. Some file which has a checksum in node A gets written to.
2. Node A is loaded into memory to update the checksum.
3. The new checksum for the changed extent in the file gets updated in 
the in-memory copy of node A.
4. Node A has it's own checksum recomputed based on the new data, and 
then gets saved to disk.
If something happened after 2 but before 4 that caused one of the other 
checksums to go bad, then the checksum computed in 4 will have been with 
the corrupted data.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-16 11:33               ` Austin S. Hemmelgarn
@ 2016-05-16 21:20                 ` Richard A. Lochner
  2016-05-16 22:43                 ` Chris Murphy
  1 sibling, 0 replies; 22+ messages in thread
From: Richard A. Lochner @ 2016-05-16 21:20 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Chris Murphy; +Cc: Btrfs BTRFS

Chris/Austin,

Thank you both for your help.

The sequence of events described by Austin is the only sequence that
seems to be plausible, given what I have been seen in the data (other
than an outright bug which I think extremely unlikely).

I will be moving these drives soon to a new system with ECC memory.  I
will definitely let you both know if I encounter this problem again
after that.  I do not expect to.

If I was really adventurous, I would modify the code to attempt to
detect this and run the patched version on my system to see if it is
possible to detect (and maybe even correct) it as it happens.
 Unfortunately, that does not appear to be a trivial exercise.

Rick Lochner

On Mon, 2016-05-16 at 07:33 -0400, Austin S. Hemmelgarn wrote:
> On 2016-05-16 02:07, Chris Murphy wrote:
> > 
> > Current hypothesis
> >  "I suspected, and I still suspect that the error occurred upon a
> > metadata update that corrupted the checksum for the file, probably
> > due
> > to silent memory corruption.  If the checksum was silently
> > corrupted,
> > it would be simply written to both drives causing this type of
> > error."
> > 
> > A metadata update alone will not change the data checksums.
> > 
> > But let's ignore that. If there's corrupt extent csum in a node
> > that
> > itself has a valid csum, this is functionally identical to e.g.
> > nerfing 100 bytes of a file's extent data (both copies,
> > identically).
> > The fs doesn't know the difference. All it knows is the node csum
> > is
> > valid, therefore the data extent csum is valid, and that's why it
> > assumes the data is wrong and hence you get an I/O error. And I can
> > reproduce most of your results by nerfing file data.
> > 
> > The entire dmesg for scrub looks like this:
> > 
> > 
> > May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > checksum error at logical 5566889984 on dev /dev/dm-6, sector
> > 8540160,
> > root 5, inode 258, offset 0, length 4096, links 1 (path:
> > openSUSE-Tumbleweed-NET-x86_64-Current.iso)
> > May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
> > bdev /dev/dm-6 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> > May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
> > unable to fixup (regular) error at logical 5566889984 on dev
> > /dev/dm-6
> > May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > checksum error at logical 5566889984 on dev /dev/mapper/VG-b1,
> > sector
> > 8579072, root 5, inode 258, offset 0, length 4096, links 1 (path:
> > openSUSE-Tumbleweed-NET-x86_64-Current.iso)
> > May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
> > bdev /dev/mapper/VG-b1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> > May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
> > unable to fixup (regular) error at logical 5566889984 on dev
> > /dev/mapper/VG-b1
> > 
> > And the entire dmesg for running sha256sum on the file is
> > 
> > May 15 23:33:41 f23s.localdomain kernel: __readpage_endio_check: 22
> > callbacks suppressed
> > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> > 
> > 
> > And I do get an i/o error for sha256sum and no hash is computed.
> > 
> > But there's two important differences:
> > 1. I have two unable to fixup messages, one for each device, at the
> > exact same time.
> > 2. I altered both copies of extent data.
> > 
> > It's a mystery to me how your file data has not changed, but
> > somehow
> > the extent csum was changed but also the node csum was recomputed
> > correctly. That's a bit odd.
> I would think this would be perfectly possible if some other file
> that 
> had a checksum in that node changed, thus forcing the node's checksum
> to 
> be updated.  Theoretical sequence of events:
> 1. Some file which has a checksum in node A gets written to.
> 2. Node A is loaded into memory to update the checksum.
> 3. The new checksum for the changed extent in the file gets updated
> in 
> the in-memory copy of node A.
> 4. Node A has it's own checksum recomputed based on the new data,
> and 
> then gets saved to disk.
> If something happened after 2 but before 4 that caused one of the
> other 
> checksums to go bad, then the checksum computed in 4 will have been
> with 
> the corrupted data.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-16 11:33               ` Austin S. Hemmelgarn
  2016-05-16 21:20                 ` Richard A. Lochner
@ 2016-05-16 22:43                 ` Chris Murphy
  2016-05-16 23:44                   ` Richard A. Lochner
  1 sibling, 1 reply; 22+ messages in thread
From: Chris Murphy @ 2016-05-16 22:43 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Richard A. Lochner, Btrfs BTRFS

On Mon, May 16, 2016 at 5:33 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

>
> I would think this would be perfectly possible if some other file that had a
> checksum in that node changed, thus forcing the node's checksum to be
> updated.  Theoretical sequence of events:
> 1. Some file which has a checksum in node A gets written to.
> 2. Node A is loaded into memory to update the checksum.
> 3. The new checksum for the changed extent in the file gets updated in the
> in-memory copy of node A.
> 4. Node A has it's own checksum recomputed based on the new data, and then
> gets saved to disk.
> If something happened after 2 but before 4 that caused one of the other
> checksums to go bad, then the checksum computed in 4 will have been with the
> corrupted data.
>

I'm pretty sure Qu had a suggestion that would mitigate this sort of
problem, where there'd be a CRC32C checksum for each data extent (?)
something like that anyway. There's enough room to stuff in more than
just a checksum per 4096 byte block. That way there's three checks,
and thus there's a way to break a tie.

But this has now happened to Richard twice. What are the chances of
this manifesting exactly the same way a second time? If the chance of
corruption is equal, I'd think the much much larger footprint for
in-memory corruption is data itself. Problem is, if the corruption
happens before the checksum is computed, the checksum would say the
data is valid. So the only way to test this would be passing all file
from this volume and a reference volume through a hash function and
comparing hashes, e.g. rsync -c option.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-16 22:43                 ` Chris Murphy
@ 2016-05-16 23:44                   ` Richard A. Lochner
  2016-05-17  3:42                     ` Chris Murphy
  0 siblings, 1 reply; 22+ messages in thread
From: Richard A. Lochner @ 2016-05-16 23:44 UTC (permalink / raw)
  To: Chris Murphy, Austin S. Hemmelgarn; +Cc: Btrfs BTRFS

Chris,

It has actually happened to me three times that I know of in ~7mos.,
but your point about the "larger footprint" for data corruption is a
good one.  No doubt I have silently experienced that too.  And, as you
suggest, there is no way to prevent those errors.  If the memory to be
written to disk gets corrupted before its checksum is calculated, the
data will be silently corrupted, period.

Clearly, I won't rely on this machine to produce any data directly that
I would consider important at this point.

One odd thing to me is that if this is really due to undetected memory
errors, I'd think this system would crash fairly often due to detected
"parity errors."  This system rarely crashes.  It often runs for
several months without an indication of problems.  

Rick Lochner


On Mon, 2016-05-16 at 16:43 -0600, Chris Murphy wrote:
> On Mon, May 16, 2016 at 5:33 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> 
> > 
> > 
> > I would think this would be perfectly possible if some other file
> > that had a
> > checksum in that node changed, thus forcing the node's checksum to
> > be
> > updated.  Theoretical sequence of events:
> > 1. Some file which has a checksum in node A gets written to.
> > 2. Node A is loaded into memory to update the checksum.
> > 3. The new checksum for the changed extent in the file gets updated
> > in the
> > in-memory copy of node A.
> > 4. Node A has it's own checksum recomputed based on the new data,
> > and then
> > gets saved to disk.
> > If something happened after 2 but before 4 that caused one of the
> > other
> > checksums to go bad, then the checksum computed in 4 will have been
> > with the
> > corrupted data.
> > 
> I'm pretty sure Qu had a suggestion that would mitigate this sort of
> problem, where there'd be a CRC32C checksum for each data extent (?)
> something like that anyway. There's enough room to stuff in more than
> just a checksum per 4096 byte block. That way there's three checks,
> and thus there's a way to break a tie.
> 
> But this has now happened to Richard twice. What are the chances of
> this manifesting exactly the same way a second time? If the chance of
> corruption is equal, I'd think the much much larger footprint for
> in-memory corruption is data itself. Problem is, if the corruption
> happens before the checksum is computed, the checksum would say the
> data is valid. So the only way to test this would be passing all file
> from this volume and a reference volume through a hash function and
> comparing hashes, e.g. rsync -c option.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-16 23:44                   ` Richard A. Lochner
@ 2016-05-17  3:42                     ` Chris Murphy
  2016-05-17 11:26                       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Murphy @ 2016-05-17  3:42 UTC (permalink / raw)
  To: Richard A. Lochner; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS

On Mon, May 16, 2016 at 5:44 PM, Richard A. Lochner <lochner@clone1.com> wrote:
> Chris,
>
> It has actually happened to me three times that I know of in ~7mos.,
> but your point about the "larger footprint" for data corruption is a
> good one.  No doubt I have silently experienced that too.

I dunno three is a lot to have the exact same corruption only in
memory then written out into two copies with valid node checksums; and
yet not have other problems, like a node item, or uuid, or xattr or
any number of other item or object types all of which get checksummed.
I suppose if the file system contains large files, the % of metadata
that's csums could be the 2nd largest footprint. But still.

Three times in 7 months, if it's really the same vector, is just short
of almost reproducible. Ha. It seems like if you merely balanced this
file system a few times, you'd eventually stumble on this. And if
that's true, then it's time for debug options and see if it can be
caught in action, and whether there's a hardware or software
explanation for it.


> And, as you
> suggest, there is no way to prevent those errors.  If the memory to be
> written to disk gets corrupted before its checksum is calculated, the
> data will be silently corrupted, period.

Well, no way in the present design, maybe.



>
> Clearly, I won't rely on this machine to produce any data directly that
> I would consider important at this point.
>
> One odd thing to me is that if this is really due to undetected memory
> errors, I'd think this system would crash fairly often due to detected
> "parity errors."  This system rarely crashes.  It often runs for
> several months without an indication of problems.

I think you'd have other problems. Only data csums are being corrupt
after they're read in, but before the node csum is computed? Three
times?  Pretty wonky.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: BTRFS Data at Rest File Corruption
  2016-05-17  3:42                     ` Chris Murphy
@ 2016-05-17 11:26                       ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-17 11:26 UTC (permalink / raw)
  To: Chris Murphy, Richard A. Lochner; +Cc: Btrfs BTRFS

On 2016-05-16 23:42, Chris Murphy wrote:
> On Mon, May 16, 2016 at 5:44 PM, Richard A. Lochner <lochner@clone1.com> wrote:
>> Chris,
>>
>> It has actually happened to me three times that I know of in ~7mos.,
>> but your point about the "larger footprint" for data corruption is a
>> good one.  No doubt I have silently experienced that too.
>
> I dunno three is a lot to have the exact same corruption only in
> memory then written out into two copies with valid node checksums; and
> yet not have other problems, like a node item, or uuid, or xattr or
> any number of other item or object types all of which get checksummed.
> I suppose if the file system contains large files, the % of metadata
> that's csums could be the 2nd largest footprint. But still.
Assuming that the workload on the volume is mostly backup images like 
the file that originally sparked this discussion, then inodes, xattrs, 
and even UUID's would be nowhere near as common as metadata blocks just 
containing checksums.  The fact that this hasn't hit any metadata 
checksums is unusual, but not impossible.
>
> Three times in 7 months, if it's really the same vector, is just short
> of almost reproducible. Ha. It seems like if you merely balanced this
> file system a few times, you'd eventually stumble on this. And if
> that's true, then it's time for debug options and see if it can be
> caught in action, and whether there's a hardware or software
> explanation for it.
>
>
>> And, as you
>> suggest, there is no way to prevent those errors.  If the memory to be
>> written to disk gets corrupted before its checksum is calculated, the
>> data will be silently corrupted, period.
>
> Well, no way in the present design, maybe.
If the RAM is bad, there is no way we can completely protect user data, 
period.  We can try to mitigate certain situations, but we cannot 
protect against all forms of memory corruption.
>
>>
>> Clearly, I won't rely on this machine to produce any data directly that
>> I would consider important at this point.
>>
>> One odd thing to me is that if this is really due to undetected memory
>> errors, I'd think this system would crash fairly often due to detected
>> "parity errors."  This system rarely crashes.  It often runs for
>> several months without an indication of problems.
>
> I think you'd have other problems. Only data csums are being corrupt
> after they're read in, but before the node csum is computed? Three
> times?  Pretty wonky.
Running regularly for several months without ECC RAM may be part of the 
issue.  Minute electrical instabilities build up over time, as do 
instabilities caused by background radiation, and beyond a certain point 
(which is based on more factors than are practical to compute), you end 
up almost certain to have at least a single bit error.

On that note, I'd actually be curious to see how far off the checksum is 
(how many bits aren't correct).  Given that there are no other visible 
issues with the system, I'd expect it to only be one or at most two bits 
that are incorrect.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-05-17 11:26 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-11 18:36 BTRFS Data at Rest File Corruption Richard Lochner
2016-05-11 19:01 ` Roman Mamedov
2016-05-11 19:26 ` Austin S. Hemmelgarn
2016-05-12 17:49   ` Richard A. Lochner
2016-05-12 18:29     ` Austin S. Hemmelgarn
2016-05-12 21:53       ` Goffredo Baroncelli
2016-05-12 23:15       ` Richard A. Lochner
2016-05-13  1:41     ` Chris Murphy
2016-05-13  4:49       ` Richard A. Lochner
2016-05-13 17:46         ` Chris Murphy
2016-05-15 18:43           ` Richard A. Lochner
2016-05-16  6:07             ` Chris Murphy
2016-05-16 11:33               ` Austin S. Hemmelgarn
2016-05-16 21:20                 ` Richard A. Lochner
2016-05-16 22:43                 ` Chris Murphy
2016-05-16 23:44                   ` Richard A. Lochner
2016-05-17  3:42                     ` Chris Murphy
2016-05-17 11:26                       ` Austin S. Hemmelgarn
2016-05-13 16:28   ` Goffredo Baroncelli
2016-05-13 16:54     ` Austin S. Hemmelgarn
2016-05-12  6:49 ` Chris Murphy
     [not found] ` <CAAuLxcaQ1Uo+pff9AtD74UwUvo5yYKBuNLwKzjVMWV1kt2DcRQ@mail.gmail.com>
2016-05-12 18:26   ` Richard A. Lochner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.