All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org>,
	"Niccolò Belli" <darkbasic@linuxsystems.it>
Cc: linux-btrfs@vger.kernel.org,
	Clemens Eisserer <linuxhippy@gmail.com>,
	Patrik Lundquist <patrik.lundquist@gmail.com>,
	Chris Murphy <lists@colorremedies.com>,
	Qu Wenruo <quwenruo@cn.fujitsu.com>,
	Omar Sandoval <osandov@osandov.com>
Subject: Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
Date: Mon, 9 May 2016 14:21:57 -0400	[thread overview]
Message-ID: <4aa3dda7-70d6-5dcf-2fa7-4f2b509e4a1e@gmail.com> (raw)
In-Reply-To: <20160509162940.GC15597@hungrycats.org>

On 2016-05-09 12:29, Zygo Blaxell wrote:
> On Mon, May 09, 2016 at 04:53:13PM +0200, Niccolò Belli wrote:
>> While trying to find a common denominator for my issue I did lots of backups
>> of /dev/mapper/cryptroot and I restored them into /dev/mapper/cryptroot
>> dozens of times (triggering a 150GB+ random data write every time), without
>> any issue (after restoring the backup I alwyas check the parition with btrfs
>> check). So disk doesn't seem to be the culprit.
>
> Did you also check the data matches the backup?  btrfs check will only
> look at the metadata, which is 0.1% of what you've copied.  From what
> you've written, there should be a lot of errors in the data too.  If you
> have incorrect data but btrfs scrub finds no incorrect checksums, then
> your storage layer is probably fine and we have to look at CPU, host RAM,
> and software as possible culprits.
This is a good point.
>
> The logs you've posted so far indicate that bad metadata (e.g. negative
> item lengths, nonsense transids in metadata references but sane transids
> in the referred pages) is getting into otherwise valid and well-formed
> btrfs metadata pages.  Since these pages are protected by checksums,
> the corruption can't be originating in the storage layer--if it was, the
> pages should be rejected as they are read from disk, before btrfs even
> looks at them, and the insane transid should be the "found" one not the
> "expected" one.  That suggests there is either RAM corruption happening
> _after_ the data is read from disk (i.e. while the pages are cached in
> RAM), or a severe software bug in the kernel you're running.
>
> Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
> maintains your kernel had a bad day and merged a patch they should
> not have.
>
> Try a minimal configuration with as few drivers as possible loaded,
> especially GPU drivers and anything from the staging subdirectory--when
> these drivers have bugs, they ruin everything.
>
> Try memtest86+ which has a few more/different tests than memtest86.
> I have encountered RAM modules that pass memtest86 but fail memtest86+
> and vice versa.
>
> Try memtester, a memory tester that runs as a Linux process, so it can
> detect corruption caused when device drivers spray data randomly into RAM,
> or when the CPU thermal controls are influenced by Linux (an overheating
> CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
> designs rely on the OS for thermal management).
>
> Try running more than one memory testing process, in case there is a bug
> in your hardware that affects interactions between multiple cores (memtest
> is single-threaded).  You can run memtest86 inside a kvm (e.g. kvm
> -m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.
>
> Kernel compiles are a bad way to test RAM.  I've successfully built
> kernels on hosts with known RAM failures.  The kernels don't always work
> properly, but it's quite rare to see a build fail outright.
My original suggestion that prompted that part of the comment was to run 
a bunch of concurrent kernel builds (I only use kernel builds myself 
because it's a big project with essentially zero build dependencies, if 
I had the patience and space (and a LiveCD with the right tools and 
packages installed), I'd probably be using something like LibreOffice or 
Chromium instead), each run with as many jobs as CPU's (so on a 
quad-core system, run a dozen or so concurrently with make -j4).  I 
don't use this as my sole test (I also use multiple other tools), but I 
find that this does a particularly good job of exercising things that 
memtest doesn't, and I don't just make sure the build's succeed, but 
also that the compiled kernel images all match, because if there's bad 
RAM, the resultant images will often be different in some way (and I had 
forgotten to mention this bit).

This practice evolved out of the fact that the only bad RAM I've ever 
dealt with either completely failed to POST (which can have all kinds of 
interesting symptoms if it's just one module, some MB's refuse to boot, 
some report the error, others just disable the module and act like 
nothing happened), or passed all the memory testing tools I threw at it 
(memtest86, memtest86+, memtester, concurrent memtest86 invocations from 
Xen domains, inventive acrobatics with tmpfs and FIO, etc), but failed 
under heavy concurrent random access, which can be reliably produced by 
running a bunch of big software builds at the same time with the CPU 
insanely over-committed.  I could probably produce a similar workload 
with tmpfs and FIO, but it's a lot quicker and easier to remember how to 
do a kernel build than it is to remember the complex incantations needed 
to get FIO to do anything interesting.

  reply	other threads:[~2016-05-09 18:22 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-04 23:21 btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair Niccolò Belli
2016-05-05  1:07 ` Chris Murphy
2016-05-05 10:36   ` Niccolò Belli
2016-05-05 17:48     ` Omar Sandoval
2016-05-06 11:38       ` Niccolò Belli
2016-05-07 15:45         ` Niccolò Belli
2016-05-07 15:58           ` Clemens Eisserer
2016-05-07 16:11             ` Niccolò Belli
2016-05-08 18:27               ` Patrik Lundquist
2016-05-09 11:52               ` Austin S. Hemmelgarn
2016-05-09 14:53                 ` Niccolò Belli
2016-05-09 16:29                   ` Zygo Blaxell
2016-05-09 18:21                     ` Austin S. Hemmelgarn [this message]
2016-05-09 19:18                       ` Duncan
2016-05-12 14:35                     ` Niccolò Belli
2016-05-12 15:43                       ` Austin S. Hemmelgarn
2016-05-13 11:07                         ` Niccolò Belli
2016-05-13 11:35                           ` Austin S. Hemmelgarn
2016-05-13 12:10                             ` Niccolò Belli
2016-05-13 21:54                               ` Chris Murphy
2016-05-12 16:48                       ` Zygo Blaxell
2016-05-09 19:23                   ` Lionel Bouton
2016-05-09 21:30                   ` Chris Murphy
2016-05-07 23:35           ` Chris Murphy
2016-05-05  4:12 ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4aa3dda7-70d6-5dcf-2fa7-4f2b509e4a1e@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=darkbasic@linuxsystems.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linuxhippy@gmail.com \
    --cc=lists@colorremedies.com \
    --cc=osandov@osandov.com \
    --cc=patrik.lundquist@gmail.com \
    --cc=quwenruo@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.