All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: "Niccolò Belli" <darkbasic@linuxsystems.it>
Cc: linux-btrfs@vger.kernel.org,
	Clemens Eisserer <linuxhippy@gmail.com>,
	Patrik Lundquist <patrik.lundquist@gmail.com>,
	Chris Murphy <lists@colorremedies.com>,
	Qu Wenruo <quwenruo@cn.fujitsu.com>,
	Omar Sandoval <osandov@osandov.com>,
	Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	1i5t5.duncan@cox.net
Subject: Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
Date: Fri, 13 May 2016 07:35:01 -0400	[thread overview]
Message-ID: <994b4fa5-c7ef-27e1-2fc2-386ab62a16c0@gmail.com> (raw)
In-Reply-To: <ef2e08cf-587c-4569-9600-9ad7bc45aab2@linuxsystems.it>

On 2016-05-13 07:07, Niccolò Belli wrote:
> On giovedì 12 maggio 2016 17:43:38 CEST, Austin S. Hemmelgarn wrote:
>> That's probably a good indication of the CPU and the MB being OK, but
>> not necessarily the RAM.  There's two other possible options for
>> testing the RAM that haven't been mentioned yet though (which I hadn't
>> thought of myself until now):
>> 1. If you have access to Windows, try the Windows Memory Diagnostic.
>> This runs yet another slightly different set of tests from memtest86
>> and memtest86+, so it may catch issues they don't.  You can start this
>> directly on an EFI system by loading /EFI/Microsoft/Boot/MEMTEST.EFI
>> from the EFI system partition.
>> 2. This is a Dell system.  If you still have the utility partition
>> which Dell ships all their per-provisioned systems with, that should
>> have a hardware diagnostics tool.  I doubt that this will find
>> anything (it's part of their QA procedure AFAICT), but it's probably
>> worth trying, as the memory testing in that uses yet another slightly
>> different implementation of the typical tests.  You can usually find
>> this in the boot interrupt menu accessed by hitting F12 before the
>> boot-loader loads.
>
> I tried the Dell System Test, including the enhanced optional ram tests
> and it was fine. I also tried the Microsoft one, which passed. BUT if I
> select the advanced test in the Microsoft One it always stops at 21% of
> first test. The test menus are still working, but fans get quiet and it
> keeps writing "test running... 21%" forever. I tried it many times and
> it always got stuck at 21%, so I suspect a test suite bug instead of a
> ram failure.
I've actually seen this before on other systems (different completion 
percentage on each system, but otherwise the same), all of them ended up 
actually having a bad CPU or MB, although the ones with CPU issues were 
fine after BIOS updates which included newer microcode.
>
> I also noticed some other interesting behaviours: while I was running
> the usual scrub+check (both were fine) from the livecd I noticed this in
> dmesg:
> [  261.301159] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot
> errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
> Corrupt? But both scrub and check were fine... I double checked scrub
> and check and they were still fine.
It's worth noting that these are running counts of errors since the last 
time the stats were reset (and they only get reset manually).  If you 
haven't reset the stats, then this isn't all that surprising.
>
> This is what happened another time:
> https://drive.google.com/open?id=0Bwe9Wtc-5xF1dGtPaWhTZ0w5aUU
> I was making a backup of my partition USING DD from the livecd. It
> wasn't even mounted if I recall correctly!
The fact that you're getting an OOPS involving core kernel threads 
(kswapd) is a pretty good indication that either there's a bug elsewhere 
in the kernel, or that something is wrong with your hardware.  it's 
really difficult to be certain if you don't have a reliable test case 
though.
>
> On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
>> That's what a RAM corruption problem looks like when you run btrfs scrub.
>> Maybe the RAM itself is OK, but *something* is scribbling on it.
>>
>> Does the Arch live usb use the same kernel as your normal system?
>
> Yes, except for the point release (the system is slightly ahead of the
> liveusb).
>
> On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
>> Did you try an older (or newer) kernel?  I've been running 4.5.x on a few
>> canary systems, but so far none of them have survived more than a day.
>
> No (except for point releases from 4.5.0 to 4.5.4), but I will try 4.4.
FWIW, I've been running 4.5 with almost no issues on my laptop since it 
came out (the few issues I have had are not unique to 4.5, and are all 
ultimately firmware issues (Lenovo has been getting _really_ bad 
recently about having broken ACPI and EFI implementations...)).  Of 
course, I'm also running Gentoo, so everything is built locally, but I 
doubt that that has much impact on stability.
>
> On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
>> It's possible there's a problem that affects only very specific chipsets
>> You seem to have eliminated RAM in isolation, but there could be a
>> problem
>> in the kernel that affects only your chipset.
>
> Funny considering it is sold as a Linux laptop. Unfortunately they only
> tested it with the ancient Ubuntu 14.04.
Sadly, this is pretty typical for anything sold as a 'Linux' system that 
isn't a server.  Even for the servers sold as such, it's not unusual for 
it to only be tested with with old versions of CentOS.

Now, I hadn't thought of this before, but it's a Dell system, so you're 
trapping out to SMBIOS for everything under the sun, and if they don't 
pass a correct memory map (or correct ACPI tables) to the OS during 
boot, then there may be some sections of RAM that both Linux and the 
firmware think they can use, which could definitely result in symptoms 
like bad RAM while still consistently passing memory tests (because they 
don't make BIOS calls after they have the system info they need).

  reply	other threads:[~2016-05-13 11:35 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-04 23:21 btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair Niccolò Belli
2016-05-05  1:07 ` Chris Murphy
2016-05-05 10:36   ` Niccolò Belli
2016-05-05 17:48     ` Omar Sandoval
2016-05-06 11:38       ` Niccolò Belli
2016-05-07 15:45         ` Niccolò Belli
2016-05-07 15:58           ` Clemens Eisserer
2016-05-07 16:11             ` Niccolò Belli
2016-05-08 18:27               ` Patrik Lundquist
2016-05-09 11:52               ` Austin S. Hemmelgarn
2016-05-09 14:53                 ` Niccolò Belli
2016-05-09 16:29                   ` Zygo Blaxell
2016-05-09 18:21                     ` Austin S. Hemmelgarn
2016-05-09 19:18                       ` Duncan
2016-05-12 14:35                     ` Niccolò Belli
2016-05-12 15:43                       ` Austin S. Hemmelgarn
2016-05-13 11:07                         ` Niccolò Belli
2016-05-13 11:35                           ` Austin S. Hemmelgarn [this message]
2016-05-13 12:10                             ` Niccolò Belli
2016-05-13 21:54                               ` Chris Murphy
2016-05-12 16:48                       ` Zygo Blaxell
2016-05-09 19:23                   ` Lionel Bouton
2016-05-09 21:30                   ` Chris Murphy
2016-05-07 23:35           ` Chris Murphy
2016-05-05  4:12 ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=994b4fa5-c7ef-27e1-2fc2-386ab62a16c0@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=1i5t5.duncan@cox.net \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=darkbasic@linuxsystems.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linuxhippy@gmail.com \
    --cc=lists@colorremedies.com \
    --cc=osandov@osandov.com \
    --cc=patrik.lundquist@gmail.com \
    --cc=quwenruo@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.