Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7

From: Filipe Manana <fdmanana@gmail.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Christoph Anton Mitterer <calestyo@scientia.net>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Reproducer for "compressed data + hole data corruption bug, 2018 edition" still works on 4.20.7
Date: Fri, 8 Mar 2019 10:37:14 +0000	[thread overview]
Message-ID: <CAL3q7H599_kDPDFHMRqSwNLXKtANC7aaqBe+Hm6JjpR37uS1Vg@mail.gmail.com> (raw)
In-Reply-To: <20190307200712.GG23918@hungrycats.org>

On Thu, Mar 7, 2019 at 8:14 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Mon, Mar 04, 2019 at 04:34:39PM +0100, Christoph Anton Mitterer wrote:
> > Hey.
> >
> >
> > Thanks for your elaborate explanations :-)
> >
> >
> > On Fri, 2019-02-15 at 00:40 -0500, Zygo Blaxell wrote:
> > > The problem occurs only on reads.  Data that is written to disk will
> > > be OK, and can be read correctly by a fixed kernel.
> > >
> > > A kernel without the fix will give corrupt data on reads with no
> > > indication of corruption other than the changes to the data itself.
> > >
> > > Applications that copy data may read corrupted data and write it back
> > > to the filesystem.  This will make the corruption permanent in the
> > > copied data.
> >
> > So that basically means even a cp (without refcopy) or a btrfs
> > send/receive could already cause permanent silent data corruption.
> > Of course, only if the conditions you've described below are met.
> >
> >
> > > Given the age of the bug
> >
> > Since when was it in the kernel?
>
> Since at least 2015.  Note that if you are looking for an end date for
> "clean" data, you may be disappointed.

It's been around since compression was introduced (October 2008).
The read ahead path was buggy for the case where the same compressed extent
is shared consecutively. I fixed 2 bugs there back in 2015 but missed the case
where there's a hole that makes the compressed extent be shared with a non-zero
start offset, which is the case that was fixed recently.

>
> In 2016 there were two kernel bugs that silently corrupted reads of
> compressed data.  In 2015 there were...4?  5?  Before 2015 the problems
> are worse, also damaging on-disk compressed data and crashing the kernel.
> The bugs that were present in 2014 were present since compression was
> introduced in 2008.
>
> With this last fix, as far as I know, we have a kernel that can read
> compressed data without corruption for the first time--at least for a
> subset of use cases that doesn't include direct IO.  Of course I thought
> the same thing in 2017, too, but I have since proven myself wrong.
>
> When btrfs gets to the point where it doesn't fail backup verification for
> some contiguous years, then I'll be satisfied btrfs (or any filesystem)
> is properly debugged.  I'll still run backup verification then, of
> course--hardware breaks all the time, and broken hardware can corrupt
> any data it touches.  Verification failures point to broken hardware
> much more often than btrfs data corruption bugs.
>
> > > Even
> > > if
> > > compression is enabled, the file data must be compressed for the bug
> > > to
> > > corrupt it.
> >
> > Is there a simple way to find files (i.e. pathnames) that were actually
> > compressed?
>
> Run compsize (sometimes the package is named btrfs-compsize) and see if
> there are any lines referring to zlib, zstd, or lzo in the output.
> If it's all "total" and "none" then there's no compression in that file.
>
> filefrag -v reports non-inline compressed data extents with the "encoded"
> flag, so
>
>         if filefrag -v "$file" | grep -qw encoded; then
>                 echo "$file" is compressed, do something here
>         fi
>
> might also be a solution (assuming your filename doesn't include the
> string 'encoded').
>
> > >     - you never punch holes in files
> >
> > Is there any "standard application" (like cp, tar, etc.) that would do
> > this?
>
> Legacy POSIX doesn't have the hole-punching concept, so legacy
> tools won't do it; however, people add features to GNU tools all the
> time, so it's hard to be 100% sure without downloading the code and
> reading/auditing/scanning it.  I'm 99% sure cp and tar are OK.
>
> > What do you mean by clone? refcopy? Would btrfs snapshots or btrfs
> > send/receive be affected?
>
> clone is part of some file operation syscalls (e.g. clone_file_range,
> dedupe_range) which make two different files, or two different offsets in
> the same file, refer to the same physical extent.  This is the basis of
> deduplication (replacing separate copies with references to a single
> copy) and also of punching holes (a single reference is split into
> two references to the original extent with a hole object inserted in
> the middle).
>
> "reflink copy" is a synonym for "cp --reflink", which is clone_file_range
> using 0 as the start of range and EOF as the end.  The term 'reflink'
> is sometimes used to refer to any extent shared between files that is
> not the result of a snapshot.  reflink is to extents what a hardlink is
> to inodes, if you ignore some details.
>
> To trigger the bug you need to clone the same compressed source range
> to two nearly adjacent locations in the destination file (i.e. two or
> more ranges in the source overlap).  cp --reflink never overlaps ranges,
> so it can't create the extent pattern that triggers this bug *by itself*.
>
> If the source file already has extent references arranged in a way
> that triggers the bug, then the copy made with cp --reflink will copy
> the arrangement to the new file (i.e. if you upgrade the kernel, you
> can correctly read both copies, and if you don't upgrade the kernel,
> both copies will appear to be corrupted, probably the same way).
>
> I would expect btrfs receive may be affected, but I did not find any
> code in receive that would be affected.  There are a number of different
> ways to make a file with a hole in it, and btrfs receive could use a
> different one not affected by this bug.  I don't use send/receive myself,
> so I don't have historical corruption data to guess from.
>
> > Or is there anything in btrfs itself which does any of the two per
> > default or on a typical system (i.e. I didn't use dedupe).
>
> 'btrfs' (the command-line utility) doesn't do these operations as far
> as I can tell.  The kernel only does these when requested by applications.
>
> > Also, did the bug only affect data, or could metadata also be
> > affected... basically should such filesystems be re-created since they
> > may also hold corruptions in the meta-data like trees and so on?
>
> Metadata is not affected by this bug.  The bug only corrupts btrfs data
> (specificially, the contents of files) in memory, not disk.
>
> > My scenario looks about the following, and given your explanations, I'd
> > assume I should probably be safe:
> >
> > - my normal laptop doesn't use compress, so it's safe anyway
> >
> > - my cp has an alias to always have --reflink=auto
> >
> > - two 8TB data archive disks, each with two backup disks to which the
> >   data of the two master disks is btrfs sent/received,... which were
> >   all mounted with compress
> >
> >
> > - typically I either cp or mv data from the laptop to these disks,
> >   => should then be safe as the laptop fs didn't use compress,...
> >
> > - or I directly create the files on the data disks (which use compress)
> >   by means of wget, scp or similar from other sources
> >   => should be safe, too, as they probably don't do dedupe/hole
> >      punching by default
> >
> > - or I cp/mv from them camera SD cards, which use some *FAT
> >   => so again I'd expect that to be fine
> >
> > - on vacation I had the case that I put large amount of picture/videos
> >   from SD cards to some btrfs-with-compress mobile HDDs, and back home
> >   from these HDDs to my actual data HDDs.
> >   => here I do have the read / re-write pattern, so data could have
> >      been corrupted if it was compressed + deduped/hole-punched
> >      I'd guess that's anyway not the case (JPEGs/MPEGs don't compress
> >      well)... and AFAIU there would be no deduping/hole-punching
> >      involved here
>
> dedupe doesn't happen by itself on btrfs.  You have to run dedupe
> userspace software (e.g. duperemove, bees, dduper, rmlint, jdupes, bedup,
> etc...) or build a kernel with dedupe patches.
>
> > - on my main data disks, I do snapshots... and these snapshots I
> >   send/receive to the other (also compress-mounted) btrfs disks.
> >   => could these operations involve deduping/hole-punching and thus the
> >      corruption?
>
> Snapshots won't interact with the bug--they are not affected by it
> and will not trigger it.  Send could transmit incorrect data (if it
> uses the kernel's readpages path internally, I don't know if it does).
> Receive seems not to be affected (though it will not detect incorrect
> data from send).
>
> > Another thing:
> > I always store SHA512 hashsums of files as an XATTR of them (like
> > "directly after" creating such files).
> > I assume there would be no deduping/hole-punching involved till then,
> > so the sums should be from correct data, right?
>
> There's no assurance of that with this method.  It's highly likely that
> the hashes match the input data, because the file will usually be cached
> in host RAM from when it was written, so the bug has no opportunity to
> appear.  It's not impossible for other system activity to evict those
> cached pages between the copy and hash, so the hash function might reread
> the data from disk again and thus be exposed to the bug.
>
> Contrast with a copy tool which integrates the SHA512 function, so
> the SHA hash and the copy consume their data from the same RAM buffers.
> This reduces the risk of undetected error but still does not eliminate it.
> A DRAM access failure could corrupt either the data or SHA hash but not
> both, so the hash will fail verification later, but you won't know if
> the hash is incorrect or the data.
>
> If the source filesystem is not btrfs (and therefore cannot have this
> btrfs bug), you can calculate the SHA512 from the source filesystem and
> copy that to the xattr on the btrfs filesystem.  That reduces the risk
> pool for data errors to the host RAM and CPU, the source filesystem,
> and the storage stack below the source filesystem (i.e.  the generic
> set of problems that can occur on any system at any time and corrupt
> data during copy and hash operations).
>
> > But when I e.g. copy data from SD, to mobile btrfs-HDD and then to the
> > final archive HDD... corruption could in principle occur when copying
> > from mobile HDD to archive HDD.
> > In that case, would a diff between the two show me the corruption? I
> > guess not because the diff would likely get the same corruption on
> > read?
>
> Upgrade your kernel before doing any verification activity; otherwise
> you'll just get false results.
>
> If you try to replace the data before upgrading the kernel, you're more
> likely to introduce new corruption where corruption did not exist before,
> or convert transient corruption events into permanent data corruption.
> You might even miss corrupted data because the bug tends to corrupt data
> in a consistent way.
>
> Once you have a kernel with the fix applied, diff will show any corruption
> in file copies, though 'cmp -l' might be much faster than diff on large
> binary files.  Use just 'cmp' if you only want to know if any difference
> exists but don't need detailed information, or 'cmp -s' in a shell script.
>
> >[...]
> > I assume normal mv of refcopy (i.e. cp --reflink=auto) would not punch
> > holes and thus be not affected?
> >
> > Further, I'd assume XATTRs couldn't be affected?
>
> XATTRs aren't compressed file data, so they aren't affected by this bug
> which only affects compressed file data.
>
> > So what remains unanswered is send/receive:
> >
> > > btrfs send and receive may be affected, but I don't use them so I
> > > don't
> > > have any experience of the bug related to these tools.  It seems from
> > > reading the btrfs receive code that it lacks any code capable of
> > > punching
> > > a hole, but I'm only doing a quick search for words like "punch", not
> > > a detailed code analysis.
> >
> > Is there some other developer who possibly knows whether send/receive
> > would have been vulnerable to the issue?
> >
> >
> > But since I use send/receive anyway in just one direction from the
> > master to the backup disks... only the later could be affected.
>
> I presume from this line of questioning that you are not in the habit
> of verifying the SHA512 hashes on your data every few weeks or months.
> If you had that step in your scheduled backup routine, then you would
> already be aware of data corruption bugs that affect you--or you'd
> already be reasonably confident that this bug has no impact on your setup.
>
> If you had asked questions like "is this bug the reason why I've been
> seeing random SHA hash verification failures for several years?" then
> you should worry about this bug; otherwise, it probably didn't affect you.
>
> > Thanks,
> > Chris.
> >
> >

-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”