On Thu, Feb 14, 2019 at 01:21:29PM +0100, Christoph Anton Mitterer wrote:
> On Thu, 2019-02-14 at 01:22 +0000, Filipe Manana wrote:
> > The following one liner fixes it:
> > https://friendpaste.com/22t4OdktHQTl0aMGxcWLj3
> 
> Great to see that fixed... is there any advise that can be given for
> users/admins?
> 
> 
> Like whether and how any occurred corruptions can be detected (right
> now, people may still have backups)?

The problem occurs only on reads.  Data that is written to disk will
be OK, and can be read correctly by a fixed kernel.

A kernel without the fix will give corrupt data on reads with no
indication of corruption other than the changes to the data itself.

Applications that copy data may read corrupted data and write it back
to the filesystem.  This will make the corruption permanent in the
copied data.

Given the age of the bug, backups that can be corrupted by this bug
probably already are.  Verify files against internal CRC/hashes where
possible.  The original files are likely to be OK, since the bug does
not affect writes.  If your situation has the risk factors listed below,
it may be worthwhile to create a fresh set of non-incremental backups
after applying the kernel fix.

> Or under which exact circumstances did the corruption happen? And under
> which was one safe?

Compression is required to trigger the bug, so you are safe if you (or
the applications you run) never enabled filesystem compression.  Even if
compression is enabled, the file data must be compressed for the bug to
corrupt it.  Incompressible data extents will never be affected by
this bug.

If you do use compression, you are still safe if:

	- you never punch holes in files

	- you never dedupe or clone files

If you do use compression and do the other things, the probability of
corruption by this particular bug is non-zero.  Whether you get corruption
and how often depends on the technical details of what you're doing.

To get corruption you have to have one data extent that is split in
two parts by punching a hole, or an extent that is cloned/deduped in
two parts to adjacent logical offsets in the same file.  Both of these
methods create the pattern on disk which triggers the bug.

Files that consist entirely of unique data will not be affected by dedupe
so will not trigger the bug that way.  Files that consist partially of
unique data may or may not be affected depending on the dedupe tool,
data alignment, etc.

> E.g. only on specific compression algos (I've been using -o compress
> (which should be zlib) for quite a while but never found any

All decompress algorithms are affected.  The bug is in the generic btrfs
decompression handling, so it is not limited to any single algorithm.

Compression (i.e. writing) is not affected--whatever data is written to
disk should be readable correctly with a fixed kernel.

> compression),... or only when specific file operations were done (I did
> e.g. cp with refcopy, but I think none of the standard tools does hole-
> punching)?

That depends on whether you consider fallocate or qemu to be standard
tools.  The hole-punching function has been a feature of several Linux
filesystems for some years now, so we can expect it to be more widely
adopted over time.  You'd have to do an audit to be sure none of the
tools you use are punching holes.

"Ordinary" sparse files (made by seeking forward while writing, as done
by older Unix utilities including cp, tar, rsync, cpio, binutils) do not
trigger this bug.  An ordinary sparse file has two distinct data extents
from two different writes separated by a hole which has never contained
file data.  A punched hole splits an existing single data extent into two
pieces with a newly created hole between them that replaces previously
existing file data.  These actions create different extent reference
patterns and only the hole-punching one is affected by the bug.

Files that contain no blocks full of zeros will not be affected by
fallocate-d-style hole punching (it searches for existing zeros and
punches holes over them--no zeros, no holes).  If the the hole punching
intentionally introduces zeros where zeros did not exist before (e.g. qemu
discard operations on raw image files) then it may trigger the bug.

btrfs send and receive may be affected, but I don't use them so I don't
have any experience of the bug related to these tools.  It seems from
reading the btrfs receive code that it lacks any code capable of punching
a hole, but I'm only doing a quick search for words like "punch", not
a detailed code analysis.

bees continues to be an awesome tool for discovering btrfs kernel bugs.
It compresses, dedupes, *and* punches holes.

> 
> Cheers,
> Chris.
>