Linux-BTRFS Archive on lore.kernel.org
 help / Atom feed
From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	Christoph Anton Mitterer <calestyo@scientia.net>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
Date: Fri, 8 Mar 2019 07:20:04 -0500
Message-ID: <b701dc28-a703-7e7c-2ced-aad07e70850a@gmail.com> (raw)
In-Reply-To: <20190307200712.GG23918@hungrycats.org>

On 2019-03-07 15:07, Zygo Blaxell wrote:
> On Mon, Mar 04, 2019 at 04:34:39PM +0100, Christoph Anton Mitterer wrote:
>> Hey.
>>
>>
>> Thanks for your elaborate explanations :-)
>>
>>
>> On Fri, 2019-02-15 at 00:40 -0500, Zygo Blaxell wrote:
>>> The problem occurs only on reads.  Data that is written to disk will
>>> be OK, and can be read correctly by a fixed kernel.
>>>
>>> A kernel without the fix will give corrupt data on reads with no
>>> indication of corruption other than the changes to the data itself.
>>>
>>> Applications that copy data may read corrupted data and write it back
>>> to the filesystem.  This will make the corruption permanent in the
>>> copied data.
>>
>> So that basically means even a cp (without refcopy) or a btrfs
>> send/receive could already cause permanent silent data corruption.
>> Of course, only if the conditions you've described below are met.
>>
>>
>>> Given the age of the bug
>>
>> Since when was it in the kernel?
> 
> Since at least 2015.  Note that if you are looking for an end date for
> "clean" data, you may be disappointed.
> 
> In 2016 there were two kernel bugs that silently corrupted reads of
> compressed data.  In 2015 there were...4?  5?  Before 2015 the problems
> are worse, also damaging on-disk compressed data and crashing the kernel.
> The bugs that were present in 2014 were present since compression was
> introduced in 2008.
> 
> With this last fix, as far as I know, we have a kernel that can read
> compressed data without corruption for the first time--at least for a
> subset of use cases that doesn't include direct IO.  Of course I thought
> the same thing in 2017, too, but I have since proven myself wrong.
> 
> When btrfs gets to the point where it doesn't fail backup verification for
> some contiguous years, then I'll be satisfied btrfs (or any filesystem)
> is properly debugged.  I'll still run backup verification then, of
> course--hardware breaks all the time, and broken hardware can corrupt
> any data it touches.  Verification failures point to broken hardware
> much more often than btrfs data corruption bugs.
> 
>>> Even
>>> if
>>> compression is enabled, the file data must be compressed for the bug
>>> to
>>> corrupt it.
>>
>> Is there a simple way to find files (i.e. pathnames) that were actually
>> compressed?
> 
> Run compsize (sometimes the package is named btrfs-compsize) and see if
> there are any lines referring to zlib, zstd, or lzo in the output.
> If it's all "total" and "none" then there's no compression in that file.
> 
> filefrag -v reports non-inline compressed data extents with the "encoded"
> flag, so
> 
> 	if filefrag -v "$file" | grep -qw encoded; then
> 		echo "$file" is compressed, do something here
> 	fi
> 
> might also be a solution (assuming your filename doesn't include the
> string 'encoded').
> 
>>> 	- you never punch holes in files
>>
>> Is there any "standard application" (like cp, tar, etc.) that would do
>> this?
> 
> Legacy POSIX doesn't have the hole-punching concept, so legacy
> tools won't do it; however, people add features to GNU tools all the
> time, so it's hard to be 100% sure without downloading the code and
> reading/auditing/scanning it.  I'm 99% sure cp and tar are OK.
They are, the only things they do with sparse files are creating new 
ones from scratch using the standard seek then write method.  The same 
is true of a vast majority of applications as well.  The stuff most 
people would have to worry about largely comes down to:

* VM software.  Some hypervisors such as QEMU can be configured to 
translate discard commands issued against the emulated block devices to 
fallocate calls to punch holes in the VM disk image file (and QEMU can 
be configured to translate block writes of null bytes to this too), 
though I know of none that do this by default.
* Database software.  This is what stuff like punching holes originated 
for, so it's obviously a potential source of this issue.
* FUSE filesystem drivers.  Most of them that support the required 
fallocate flag to punch holes pass it down directly.  Some make use of 
it themselves too.
* Userspace distributed storage systems.  Stuff like Ceph or Gluster. 
Same arguments as above for FUSE filesystem drivers.
> 
>> What do you mean by clone? refcopy? Would btrfs snapshots or btrfs
>> send/receive be affected?
> 
> clone is part of some file operation syscalls (e.g. clone_file_range,
> dedupe_range) which make two different files, or two different offsets in
> the same file, refer to the same physical extent.  This is the basis of
> deduplication (replacing separate copies with references to a single
> copy) and also of punching holes (a single reference is split into
> two references to the original extent with a hole object inserted in
> the middle).
> 
> "reflink copy" is a synonym for "cp --reflink", which is clone_file_range
> using 0 as the start of range and EOF as the end.  The term 'reflink'
> is sometimes used to refer to any extent shared between files that is
> not the result of a snapshot.  reflink is to extents what a hardlink is
> to inodes, if you ignore some details.
> 
> To trigger the bug you need to clone the same compressed source range
> to two nearly adjacent locations in the destination file (i.e. two or
> more ranges in the source overlap).  cp --reflink never overlaps ranges,
> so it can't create the extent pattern that triggers this bug *by itself*.
> 
> If the source file already has extent references arranged in a way
> that triggers the bug, then the copy made with cp --reflink will copy
> the arrangement to the new file (i.e. if you upgrade the kernel, you
> can correctly read both copies, and if you don't upgrade the kernel,
> both copies will appear to be corrupted, probably the same way).
> 
> I would expect btrfs receive may be affected, but I did not find any
> code in receive that would be affected.  There are a number of different
> ways to make a file with a hole in it, and btrfs receive could use a
> different one not affected by this bug.  I don't use send/receive myself,
> so I don't have historical corruption data to guess from.
> 
>> Or is there anything in btrfs itself which does any of the two per
>> default or on a typical system (i.e. I didn't use dedupe).
> 
> 'btrfs' (the command-line utility) doesn't do these operations as far
> as I can tell.  The kernel only does these when requested by applications.
The receive command will issue clone operations if the sent subvolume 
requires it to get the correct block layout, so there is a 'regular' 
BTRFS operation that can in theory set things up such that the required 
patterns are more likely to happen.
> 
>> Also, did the bug only affect data, or could metadata also be
>> affected... basically should such filesystems be re-created since they
>> may also hold corruptions in the meta-data like trees and so on?
> 
> Metadata is not affected by this bug.  The bug only corrupts btrfs data
> (specificially, the contents of files) in memory, not disk.
> 
>> My scenario looks about the following, and given your explanations, I'd
>> assume I should probably be safe:
>>
>> - my normal laptop doesn't use compress, so it's safe anyway
>>
>> - my cp has an alias to always have --reflink=auto
>>
>> - two 8TB data archive disks, each with two backup disks to which the
>>    data of the two master disks is btrfs sent/received,... which were
>>    all mounted with compress
>>
>>
>> - typically I either cp or mv data from the laptop to these disks,
>>    => should then be safe as the laptop fs didn't use compress,...
>>
>> - or I directly create the files on the data disks (which use compress)
>>    by means of wget, scp or similar from other sources
>>    => should be safe, too, as they probably don't do dedupe/hole
>>       punching by default
>>
>> - or I cp/mv from them camera SD cards, which use some *FAT
>>    => so again I'd expect that to be fine
>>
>> - on vacation I had the case that I put large amount of picture/videos
>>    from SD cards to some btrfs-with-compress mobile HDDs, and back home
>>    from these HDDs to my actual data HDDs.
>>    => here I do have the read / re-write pattern, so data could have
>>       been corrupted if it was compressed + deduped/hole-punched
>>       I'd guess that's anyway not the case (JPEGs/MPEGs don't compress
>>       well)... and AFAIU there would be no deduping/hole-punching
>>       involved here
> 
> dedupe doesn't happen by itself on btrfs.  You have to run dedupe
> userspace software (e.g. duperemove, bees, dduper, rmlint, jdupes, bedup,
> etc...) or build a kernel with dedupe patches.
> 
>> - on my main data disks, I do snapshots... and these snapshots I
>>    send/receive to the other (also compress-mounted) btrfs disks.
>>    => could these operations involve deduping/hole-punching and thus the
>>       corruption?
> 
> Snapshots won't interact with the bug--they are not affected by it
> and will not trigger it.  Send could transmit incorrect data (if it
> uses the kernel's readpages path internally, I don't know if it does).
> Receive seems not to be affected (though it will not detect incorrect
> data from send).
> 
>> Another thing:
>> I always store SHA512 hashsums of files as an XATTR of them (like
>> "directly after" creating such files).
>> I assume there would be no deduping/hole-punching involved till then,
>> so the sums should be from correct data, right?
> 
> There's no assurance of that with this method.  It's highly likely that
> the hashes match the input data, because the file will usually be cached
> in host RAM from when it was written, so the bug has no opportunity to
> appear.  It's not impossible for other system activity to evict those
> cached pages between the copy and hash, so the hash function might reread
> the data from disk again and thus be exposed to the bug.
> 
> Contrast with a copy tool which integrates the SHA512 function, so
> the SHA hash and the copy consume their data from the same RAM buffers.
> This reduces the risk of undetected error but still does not eliminate it.
> A DRAM access failure could corrupt either the data or SHA hash but not
> both, so the hash will fail verification later, but you won't know if
> the hash is incorrect or the data.
> 
> If the source filesystem is not btrfs (and therefore cannot have this
> btrfs bug), you can calculate the SHA512 from the source filesystem and
> copy that to the xattr on the btrfs filesystem.  That reduces the risk
> pool for data errors to the host RAM and CPU, the source filesystem,
> and the storage stack below the source filesystem (i.e.  the generic
> set of problems that can occur on any system at any time and corrupt
> data during copy and hash operations).
> 
>> But when I e.g. copy data from SD, to mobile btrfs-HDD and then to the
>> final archive HDD... corruption could in principle occur when copying
>> from mobile HDD to archive HDD.
>> In that case, would a diff between the two show me the corruption? I
>> guess not because the diff would likely get the same corruption on
>> read?
> 
> Upgrade your kernel before doing any verification activity; otherwise
> you'll just get false results.
> 
> If you try to replace the data before upgrading the kernel, you're more
> likely to introduce new corruption where corruption did not exist before,
> or convert transient corruption events into permanent data corruption.
> You might even miss corrupted data because the bug tends to corrupt data
> in a consistent way.
> 
> Once you have a kernel with the fix applied, diff will show any corruption
> in file copies, though 'cmp -l' might be much faster than diff on large
> binary files.  Use just 'cmp' if you only want to know if any difference
> exists but don't need detailed information, or 'cmp -s' in a shell script.
> 
>> [...]
>> I assume normal mv of refcopy (i.e. cp --reflink=auto) would not punch
>> holes and thus be not affected?
>>
>> Further, I'd assume XATTRs couldn't be affected?
> 
> XATTRs aren't compressed file data, so they aren't affected by this bug
> which only affects compressed file data.
> 
>> So what remains unanswered is send/receive:
>>
>>> btrfs send and receive may be affected, but I don't use them so I
>>> don't
>>> have any experience of the bug related to these tools.  It seems from
>>> reading the btrfs receive code that it lacks any code capable of
>>> punching
>>> a hole, but I'm only doing a quick search for words like "punch", not
>>> a detailed code analysis.
>>
>> Is there some other developer who possibly knows whether send/receive
>> would have been vulnerable to the issue?
>>
>>
>> But since I use send/receive anyway in just one direction from the
>> master to the backup disks... only the later could be affected.
> 
> I presume from this line of questioning that you are not in the habit
> of verifying the SHA512 hashes on your data every few weeks or months.
> If you had that step in your scheduled backup routine, then you would
> already be aware of data corruption bugs that affect you--or you'd
> already be reasonably confident that this bug has no impact on your setup.
> 
> If you had asked questions like "is this bug the reason why I've been
> seeing random SHA hash verification failures for several years?" then
> you should worry about this bug; otherwise, it probably didn't affect you.
> 
>> Thanks,
>> Chris.
>>
>>


  parent reply index

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-23  3:11 Reproducer for "compressed data + hole data corruption bug, 2018 editiion" Zygo Blaxell
2018-08-23  5:10 ` Qu Wenruo
2018-08-23 16:44   ` Zygo Blaxell
2018-08-23 23:50     ` Qu Wenruo
2019-02-12  3:09 ` Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7 Zygo Blaxell
2019-02-12 15:33   ` Christoph Anton Mitterer
2019-02-12 15:35   ` Filipe Manana
2019-02-12 17:01     ` Zygo Blaxell
2019-02-12 17:56       ` Filipe Manana
2019-02-12 18:13         ` Zygo Blaxell
2019-02-13  7:24           ` Qu Wenruo
2019-02-13 17:36           ` Filipe Manana
2019-02-13 18:14             ` Filipe Manana
2019-02-14  1:22               ` Filipe Manana
2019-02-14  5:00                 ` Zygo Blaxell
2019-02-14 12:21                 ` Christoph Anton Mitterer
2019-02-15  5:40                   ` Zygo Blaxell
2019-03-04 15:34                     ` Christoph Anton Mitterer
2019-03-07 20:07                       ` Zygo Blaxell
2019-03-08 10:37                         ` Filipe Manana
2019-03-14 18:58                           ` Christoph Anton Mitterer
2019-03-14 20:22                           ` Christoph Anton Mitterer
2019-03-14 22:39                             ` Filipe Manana
2019-03-08 12:20                         ` Austin S. Hemmelgarn [this message]
2019-03-14 18:58                           ` Christoph Anton Mitterer
2019-03-14 18:58                         ` Christoph Anton Mitterer
2019-03-15  5:28                           ` Zygo Blaxell
2019-03-16 22:11                             ` Christoph Anton Mitterer
2019-03-17  2:54                               ` Zygo Blaxell
2019-02-15 12:02                   ` Filipe Manana
2019-03-04 15:46                     ` Christoph Anton Mitterer
2019-02-12 18:58       ` Andrei Borzenkov
2019-02-12 21:48         ` Chris Murphy
2019-02-12 22:11           ` Zygo Blaxell
2019-02-12 22:53             ` Chris Murphy
2019-02-13  2:46               ` Zygo Blaxell
2019-02-13  7:47   ` Roman Mamedov
2019-02-13  8:04     ` Qu Wenruo

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b701dc28-a703-7e7c-2ced-aad07e70850a@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=calestyo@scientia.net \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox