All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
@ 2016-05-04 23:21 Niccolò Belli
  2016-05-05  1:07 ` Chris Murphy
  2016-05-05  4:12 ` Qu Wenruo
  0 siblings, 2 replies; 25+ messages in thread
From: Niccolò Belli @ 2016-05-04 23:21 UTC (permalink / raw)
  To: linux-btrfs

I really need your help, because it's the second time btrfs ate my data in 
a couple of days and I can't use my laptop if I don't find the culprit.

This was the mail I sent a couple of days ago: 
https://www.spinics.net/lists/linux-btrfs/msg54754.html
I previously thought the culprit was a bug in kernel 4.6-rc, but I was 
wrong.

Then I reinstalled the whole system (Arch Linux) from scratch, and after 
just two days I lost some of my data, again. Once again btrfs check 
--repair got stuck in an infinite loop and I can't repair my fs. The system 
has always been shutdown properly, except for a single time when I had to 
forcedly power it off just after the boot because I didn't see any signal 
on the screen.

First the obvious things:

- memory is ok 
(https://drive.google.com/open?id=0Bwe9Wtc-5xF1VnJ0SE9fT1FZMTg)
- disk is ok 
(https://drive.google.com/open?id=0Bwe9Wtc-5xF1NGRhd2daVDRJVGc)
- tlp has SATA_LINKPWR_ON_BAT=max_performance 
(https://drive.google.com/open?id=0Bwe9Wtc-5xF1dFAwUE5ETVpNWGM)
- rootfs mount options: 
rw,noatime,compress=lzo,ssd,discard,space_cache,autodefrag,subvolid=257,subvol=/@
- Command line: BOOT_IMAGE=/@/boot/vmlinuz-linux 
root=UUID=4fc2278e-f6e8-4a21-8876-cabbf885bb2e rw rootflags=subvol=@ 
cryptdevice=/dev/disk/by-uuid/c7c8f501-507c-4bd2-a80a-8c7360651f02:cryptroot:allow-discards 
quiet
- scrub didn't find any error:
$ sudo btrfs scrub status /
scrub status for 4fc2278e-f6e8-4a21-8876-cabbf885bb2e
        scrub started at Thu May  5 00:57:30 2016 and finished after 
00:00:45
        total bytes scrubbed: 22.26GiB with 0 errors

I have the whole rootfs encrypted, including boot. I followed these steps: 
https://wiki.archlinux.org/index.php/Dm-crypt/Encrypting_an_entire_system#Btrfs_subvolumes_with_swap

Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q).
Laptop is a Dell XPS 13 9343 QHD+.
Distro is Arch Linux, kernel version is 4.5.1. btrfs-progs is 4.5.2.

After two days from the previous data loss I finished reinstalling my 
distro from scratch, then I decided to do a full backup from a snapshot 
using tar. This is what I got while trying to backup my data:

tar: usr/share/kig/icons/hicolor/32x32/actions/test.png: errore di lettura 
al byte 0 leggendo 810 byte: Errore di input/output
tar: usr/share/kig/icons/hicolor/32x32/actions/circlebpd.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/pointOnLine.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/bezierN.png: funzione "stat" 
non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/convexhull.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/centerofcurvature.png: 
funzione "stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/en.png: funzione "stat" non 
riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/circlebps.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/directrix.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/beziercurves.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/segment_midpoint.png: 
funzione "stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/distance.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/circlebcl.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/conicb5p.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/kig_polygon.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/conicasymptotes.png: 
funzione "stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/pointxy.png: funzione "stat" 
non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/attacher.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/coniclineintersection.png: 
funzione "stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/vectorsum.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/rbezier4.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/ellipsebffp.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/angle.png: funzione "stat" 
non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/kig_text.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/vectordifference.png: 
funzione "stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/segmentaxis.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/radicalline.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/polygonsides.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/projection.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/inversion.png: funzione 
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/bezier4.png: funzione "stat" 
non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/equilateralhyperbolab4p.png: 
funzione "stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/areaCircle.png: funzione 
"stat" non riuscita: Stale file handle
tar: var/lib/samba/private/msg.sock/666: socket ignorato
tar: Uscita con stato di fallimento in base agli errori precedenti


[ 3057.008185] BTRFS error (device dm-0): parent transid verify failed on 
528089088 wanted 3458764513820541211 found 283
[ 3057.008195] BTRFS error (device dm-0): error loading props for ino 
183988 (root 505): -5
[ 3057.008417] BTRFS error (device dm-0): parent transid verify failed on 
528089088 wanted 3458764513820541211 found 283
[ 3057.008631] BTRFS error (device dm-0): parent transid verify failed on 
528089088 wanted 3458764513820541211 found 283
[ 3057.009165] BTRFS error (device dm-0): parent transid verify failed on 
528089088 wanted 3458764513820541211 found 283
[ 3057.009389] BTRFS error (device dm-0): parent transid verify failed on 
528089088 wanted 3458764513820541211 found 283
[ 3057.009734] BTRFS error (device dm-0): parent transid verify failed on 
528089088 wanted 3458764513820541211 found 283
[ 3057.009960] BTRFS error (device dm-0): parent transid verify failed on 
528089088 wanted 3458764513820541211 found 283
[ 3057.010664] BTRFS error (device dm-0): parent transid verify failed on 
528089088 wanted 3458764513820541211 found 283
[ 3057.010888] BTRFS error (device dm-0): parent transid verify failed on 
528089088 wanted 3458764513820541211 found 283
[ 3057.011201] BTRFS error (device dm-0): parent transid verify failed on 
528089088 wanted 3458764513820541211 found 283
[ 3331.795474] verify_parent_transid: 57 callbacks suppressed
[ 3331.795480] BTRFS error (device dm-0): parent transid verify failed on 
528089088 wanted 3458764513820541211 found 283
[ 3331.795776] BTRFS error (device dm-0): parent transid verify failed on 
528089088 wanted 3458764513820541211 found 283

I made a copy of /dev/mapper/cryptroot with dd on an external drive and I 
run btrfs check on it (btrfs-progs 4.5.2): 
https://drive.google.com/open?id=0Bwe9Wtc-5xF1SjJacXpMMU5mems (37MB)

Then I tried to run btrfs check --repair on it but once again it got stuck 
in an infinite loop like this one 
(https://www.spinics.net/lists/linux-btrfs/msg54146.html) and after an hour 
of looping and several hundreds of MBs of logs I had to kill it. Here is 
the log, truncated to 30MB: 
https://drive.google.com/open?id=0Bwe9Wtc-5xF1SmRuVUlfeGRES3M

They are probably not needed but here is snapper -c @ list: 
https://drive.google.com/open?id=0Bwe9Wtc-5xF1N0llOFpfVXVwNVk
and btrfs subvolume list -p /: 
https://drive.google.com/open?id=0Bwe9Wtc-5xF1andCdWZzeV9VbDg

This is the link to the whole gdrive directory with all the logs: 
https://drive.google.com/open?id=0Bwe9Wtc-5xF1UFltcXhtRmt4YjA

I really don't know what may be the problem, maybe discard? I can't think 
about switching back to ext4 and losing snapshots, transactions, 
compression, incremental send/receive backups etc.
I would really love being able to do something to fix it, but I don't have 
the slightest idea about what's the problem. Hopefully someone here will be 
smarter than me and find the problem, otherwise I will have to switch to 
ext4 because I need my laptop to work.

Thanks,
Niccolò

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-04 23:21 btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair Niccolò Belli
@ 2016-05-05  1:07 ` Chris Murphy
  2016-05-05 10:36   ` Niccolò Belli
  2016-05-05  4:12 ` Qu Wenruo
  1 sibling, 1 reply; 25+ messages in thread
From: Chris Murphy @ 2016-05-05  1:07 UTC (permalink / raw)
  To: Niccolò Belli; +Cc: Btrfs BTRFS

On Wed, May 4, 2016 at 5:21 PM, Niccolò Belli <darkbasic@linuxsystems.it> wrote:

> rw,noatime,compress=lzo,ssd,discard,space_cache,autodefrag,subvolid=257,subvol=/@

I suggest using defaults for starters. The only thing in that list
that needs be there is either subvolid or subvold, not both. Add in
the non-default options once you've proven the defaults are working,
and add them one at a time.



> I have the whole rootfs encrypted, including boot. I followed these steps:
> https://wiki.archlinux.org/index.php/Dm-crypt/Encrypting_an_entire_system#Btrfs_subvolumes_with_swap
>
> Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q).

The firmware is old if I understand the naming scheme used by Dell. It
says EXT49D0Q is current.

http://www.dell.com/support/home/al/en/aldhs1/Drivers/DriversDetails?driverId=0NXHH

If you need to update, you may be best off doing a whole device trim,
which is easiest done with mkfs.btrfs pointed at the whole device. I
wouldn't trust any data on the drive after a firmware update so I'd
start over entirely from scratch, new partition map, new everything.
So the way to do this is:

mkfs.btrfs /dev/sda
wipefs -a /dev/sda

That way the btrfs magic is removed, and now you can partition it,
setup dmcrypt, etc. I advice using all defaults for everything for
now, otherwise it's anyone's guess what you're running into.


Off topic, but at least gmail users see your posts go to spam because
your domain is configured to disallow relaying. Most mail services
ignore this request by the domain but google honors it so no amount of
training will make your email not spam. This is what's in your emails
that's causing the problem:

       dmarc=fail (p=QUARANTINE dis=NONE) header.from=linuxsystems.it

http://webmasters.stackexchange.com/questions/76765/sent-emails-pass-spf-and-dkim-but-fail-dmarc-when-received-by-gmail
http://www.pcworld.com/article/2141120/yahoo-email-antispoofing-policy-breaks-mailing-lists.html



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-04 23:21 btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair Niccolò Belli
  2016-05-05  1:07 ` Chris Murphy
@ 2016-05-05  4:12 ` Qu Wenruo
  1 sibling, 0 replies; 25+ messages in thread
From: Qu Wenruo @ 2016-05-05  4:12 UTC (permalink / raw)
  To: Niccolò Belli, linux-btrfs



Niccolò Belli wrote on 2016/05/05 01:21 +0200:
> I really need your help, because it's the second time btrfs ate my data
> in a couple of days and I can't use my laptop if I don't find the culprit.
>
> This was the mail I sent a couple of days ago:
> https://www.spinics.net/lists/linux-btrfs/msg54754.html

Output in that mail shows obvious tree block corruption:
checksum verify failed on 245498111 found C7652CC3 wanted 00000000
checksum verify failed on 245498111 found C7652CC3 wanted 00000000
checksum verify failed on 245498111 found C7652CC3 wanted 00000000
checksum verify failed on 245498111 found C7652CC3 wanted 00000000
bytenr mismatch, want=245498111, have=8454382400481263616

That's the root cause of following tons of error.
I assume it maybe the same cause this time.

> I previously thought the culprit was a bug in kernel 4.6-rc, but I was
> wrong.
>
> Then I reinstalled the whole system (Arch Linux) from scratch, and after
> just two days I lost some of my data, again. Once again btrfs check
> --repair got stuck in an infinite loop and I can't repair my fs. The
> system has always been shutdown properly, except for a single time when
> I had to forcedly power it off just after the boot because I didn't see
> any signal on the screen.
>
> First the obvious things:
>
> - memory is ok
> (https://drive.google.com/open?id=0Bwe9Wtc-5xF1VnJ0SE9fT1FZMTg)
> - disk is ok
> (https://drive.google.com/open?id=0Bwe9Wtc-5xF1NGRhd2daVDRJVGc)
> - tlp has SATA_LINKPWR_ON_BAT=max_performance
> (https://drive.google.com/open?id=0Bwe9Wtc-5xF1dFAwUE5ETVpNWGM)
> - rootfs mount options:
> rw,noatime,compress=lzo,ssd,discard,space_cache,autodefrag,subvolid=257,subvol=/@
>
> - Command line: BOOT_IMAGE=/@/boot/vmlinuz-linux
> root=UUID=4fc2278e-f6e8-4a21-8876-cabbf885bb2e rw rootflags=subvol=@
> cryptdevice=/dev/disk/by-uuid/c7c8f501-507c-4bd2-a80a-8c7360651f02:cryptroot:allow-discards
> quiet
> - scrub didn't find any error:
> $ sudo btrfs scrub status /
> scrub status for 4fc2278e-f6e8-4a21-8876-cabbf885bb2e
>        scrub started at Thu May  5 00:57:30 2016 and finished after
> 00:00:45
>        total bytes scrubbed: 22.26GiB with 0 errors
>
> I have the whole rootfs encrypted, including boot. I followed these
> steps:
> https://wiki.archlinux.org/index.php/Dm-crypt/Encrypting_an_entire_system#Btrfs_subvolumes_with_swap
>

Would it be OK for you to test your btrfs on a plain ssd, without 
encryption?

I know this suggestion is quite rude, but this would hugely reduce the 
possible layers we need to investigate.

And just as Chris Murphy said, reducing mount option is also a pretty 
good debugging start point.

>
> Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q).
> Laptop is a Dell XPS 13 9343 QHD+.
> Distro is Arch Linux, kernel version is 4.5.1. btrfs-progs is 4.5.2.
>
> After two days from the previous data loss I finished reinstalling my
> distro from scratch, then I decided to do a full backup from a snapshot
> using tar. This is what I got while trying to backup my data:
>
> tar: usr/share/kig/icons/hicolor/32x32/actions/test.png: errore di
> lettura al byte 0 leggendo 810 byte: Errore di input/output
> tar: usr/share/kig/icons/hicolor/32x32/actions/circlebpd.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/pointOnLine.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/bezierN.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/convexhull.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/centerofcurvature.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/en.png: funzione "stat"
> non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/circlebps.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/directrix.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/beziercurves.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/segment_midpoint.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/distance.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/circlebcl.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/conicb5p.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/kig_polygon.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/conicasymptotes.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/pointxy.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/attacher.png: funzione
> "stat" non riuscita: Stale file handle
> tar:
> usr/share/kig/icons/hicolor/32x32/actions/coniclineintersection.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/vectorsum.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/rbezier4.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/ellipsebffp.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/angle.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/kig_text.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/vectordifference.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/segmentaxis.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/radicalline.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/polygonsides.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/projection.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/inversion.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/bezier4.png: funzione
> "stat" non riuscita: Stale file handle
> tar:
> usr/share/kig/icons/hicolor/32x32/actions/equilateralhyperbolab4p.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/areaCircle.png: funzione
> "stat" non riuscita: Stale file handle
> tar: var/lib/samba/private/msg.sock/666: socket ignorato
> tar: Uscita con stato di fallimento in base agli errori precedenti
>
>
> [ 3057.008185] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283

Tree blocks are again heavily damaged.
Wanted transid is super large, definitely not sane.

So parent node is already corrupted.
Although the child transid, 283 seems quite valid.


> [ 3057.008195] BTRFS error (device dm-0): error loading props for ino
> 183988 (root 505): -5
> [ 3057.008417] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.008631] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.009165] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.009389] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.009734] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.009960] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.010664] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.010888] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.011201] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3331.795474] verify_parent_transid: 57 callbacks suppressed
> [ 3331.795480] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3331.795776] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
>
> I made a copy of /dev/mapper/cryptroot with dd on an external drive and
> I run btrfs check on it (btrfs-progs 4.5.2):
> https://drive.google.com/open?id=0Bwe9Wtc-5xF1SjJacXpMMU5mems (37MB)

Checked, but seems the output is truncated?

Thanks,
Qu

>
> Then I tried to run btrfs check --repair on it but once again it got
> stuck in an infinite loop like this one
> (https://www.spinics.net/lists/linux-btrfs/msg54146.html) and after an
> hour of looping and several hundreds of MBs of logs I had to kill it.
> Here is the log, truncated to 30MB:
> https://drive.google.com/open?id=0Bwe9Wtc-5xF1SmRuVUlfeGRES3M
>
> They are probably not needed but here is snapper -c @ list:
> https://drive.google.com/open?id=0Bwe9Wtc-5xF1N0llOFpfVXVwNVk
> and btrfs subvolume list -p /:
> https://drive.google.com/open?id=0Bwe9Wtc-5xF1andCdWZzeV9VbDg
>
> This is the link to the whole gdrive directory with all the logs:
> https://drive.google.com/open?id=0Bwe9Wtc-5xF1UFltcXhtRmt4YjA
>
> I really don't know what may be the problem, maybe discard? I can't
> think about switching back to ext4 and losing snapshots, transactions,
> compression, incremental send/receive backups etc.
> I would really love being able to do something to fix it, but I don't
> have the slightest idea about what's the problem. Hopefully someone here
> will be smarter than me and find the problem, otherwise I will have to
> switch to ext4 because I need my laptop to work.
>
> Thanks,
> Niccolò
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-05  1:07 ` Chris Murphy
@ 2016-05-05 10:36   ` Niccolò Belli
  2016-05-05 17:48     ` Omar Sandoval
  0 siblings, 1 reply; 25+ messages in thread
From: Niccolò Belli @ 2016-05-05 10:36 UTC (permalink / raw)
  To: Btrfs BTRFS; +Cc: Chris Murphy, Qu Wenruo

On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote:
> I suggest using defaults for starters. The only thing in that list
> that needs be there is either subvolid or subvold, not both. Add in
> the non-default options once you've proven the defaults are working,
> and add them one at a time.

Yes I read your previous suggestion and I already dropped subvolid, but 
since the problem already happened I left it in the mail for completeness.
Anyway the culprit here is genfstab and that's probably what a beginner is 
going to use when installing a distro: 
https://wiki.archlinux.org/index.php/beginners'_guide#fstab

>> Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q).
>
> The firmware is old if I understand the naming scheme used by Dell. It
> says EXT49D0Q is current.
>
> http://www.dell.com/support/home/al/en/aldhs1/Drivers/DriversDetails?driverId=0NXHH

According to this 
(http://forum.notebookreview.com/threads/2015-xps-13-ssd-fw-problem-with-m-2-samsung-pm851.770501/) 
the firmware you linked is for the mSATA version of the drive, not the M.2 
one. EXT25D0Q seems to be the very latest one for my drive.

> I advice using all defaults for everything for
> now, otherwise it's anyone's guess what you're running into.

On giovedì 5 maggio 2016 06:12:28 CEST, Qu Wenruo wrote:
> Would it be OK for you to test your btrfs on a plain ssd, 
> without encryption?
> And just as Chris Murphy said, reducing mount option is also a 
> pretty good debugging start point.

Ok, I will remove dmcrypt, discard, compress=lzo, nodefrag and see what 
happens.

>> I made a copy of /dev/mapper/cryptroot with dd on an external drive and
>> I run btrfs check on it (btrfs-progs 4.5.2):
>> https://drive.google.com/open?id=0Bwe9Wtc-5xF1SjJacXpMMU5mems (37MB)
>
> Checked, but seems the output is truncated?

No, I didn't truncate the btrfs check output because it wasn't endless. I 
just truncated the repair output.

I also have something new to report. Do you remember when I said that my 
screen was black and so I had to forcedly power off the system? Something 
similar happened today and since in the meantime I enabled magic sysrq keys 
I have been able to recover this from the logs:

mag 05 11:55:51 arch-laptop kdeinit5[960]: Registering 
"org.kde.StatusNotifierItem-1060-1/StatusNotifierItem" to system tray
mag 05 11:55:51 arch-laptop obexd[1098]: OBEX daemon 5.39
mag 05 11:55:51 arch-laptop dbus-daemon[920]: Successfully activated 
service 'org.bluez.obex'
mag 05 11:55:51 arch-laptop systemd[898]: Started Bluetooth OBEX service.
mag 05 11:55:51 arch-laptop korgac[1044]: log_kidentitymanagement: 
IdentityManager: There was no default identity. Marking first one as 
default.
mag 05 11:55:51 arch-laptop kernel: BUG: unable to handle kernel paging 
request at 0000000000017d11
mag 05 11:55:51 arch-laptop kernel: IP: [<ffffffff81194f9f>] 
anon_vma_interval_tree_insert+0x3f/0x90
mag 05 11:55:51 arch-laptop kernel: PGD 0 
mag 05 11:55:51 arch-laptop kernel: Oops: 0000 [#1] PREEMPT SMP 
mag 05 11:55:51 arch-laptop kernel: Modules linked in: rfcomm(+) visor bnep 
uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core 
videodev media btusb btrtl btbcm btintel cdc_ether bluetooth usbnet r8152 
crc16 mii joydev mousedev nvr
mag 05 11:55:51 arch-laptop kernel:  mei_me syscopyarea sysfillrect snd 
sysimgblt fb_sys_fops i2c_algo_bit shpchp soundcore mei wmi thermal fan 
intel_hid sparse_keymap int3403_thermal video processor_thermal_device 
dw_dmac snd_soc_sst_acpi snd_soc_sst_m
mag 05 11:55:51 arch-laptop kernel:  lrw gf128mul glue_helper ablk_helper 
cryptd ahci libahci libata scsi_mod xhci_pci rtsx_pci
mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM TTY layer initialized
mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM socket layer 
initialized
mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM ver 1.11
mag 05 11:55:51 arch-laptop kernel:  xhci_hcd
mag 05 11:55:51 arch-laptop kernel:  i8042 serio sdhci_acpi sdhci led_class 
mmc_core pl2303 mos7720 usbserial parport hid_generic usbhid hid usbcore 
usb_common
mag 05 11:55:51 arch-laptop kernel: CPU: 0 PID: 351 Comm: systemd-udevd Not 
tainted 4.5.1-1-ARCH #1
mag 05 11:55:51 arch-laptop kernel: Hardware name: Dell Inc. XPS 13 
9343/0F5KF3, BIOS A07 11/11/2015
mag 05 11:55:51 arch-laptop kernel: task: ffff88021347d580 ti: 
ffff880211f8c000 task.ti: ffff880211f8c000
mag 05 11:55:51 arch-laptop kernel: RIP: 0010:[<ffffffff81194f9f>]  
[<ffffffff81194f9f>] anon_vma_interval_tree_insert+0x3f/0x90
mag 05 11:55:51 arch-laptop kernel: RSP: 0018:ffff880211f8fd68  EFLAGS: 
00010206
mag 05 11:55:51 arch-laptop kernel: RAX: ffff8800da2f4820 RBX: 
ffff8800bb59ce40 RCX: ffff8800da2f4830
mag 05 11:55:51 arch-laptop kernel: RDX: ffff8800da2f4828 RSI: 
ffff8800374404a0 RDI: ffff8800c58dfa40
mag 05 11:55:51 arch-laptop kernel: RBP: ffff880211f8fdb8 R08: 
0000000000017c79 R09: 00000007f55e2059
mag 05 11:55:51 arch-laptop kernel: R10: 00000007f55e2053 R11: 
ffff8800c58dfa40 R12: ffff880037440460
mag 05 11:55:51 arch-laptop kernel: R13: ffff8800d9e27100 R14: 
ffff8800c58dfa40 R15: ffff880037440460
mag 05 11:55:51 arch-laptop kernel: FS:  00007f55e20537c0(0000) 
GS:ffff88021e400000(0000) knlGS:0000000000000000
mag 05 11:55:51 arch-laptop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
mag 05 11:55:51 arch-laptop kernel: CR2: 0000000000017d11 CR3: 
0000000211cd5000 CR4: 00000000003406f0
mag 05 11:55:51 arch-laptop kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
mag 05 11:55:51 arch-laptop kernel: DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
mag 05 11:55:51 arch-laptop kernel: Stack:
mag 05 11:55:51 arch-laptop kernel:  ffffffff811a90c8 0000000000000246 
ffff880212d00900 ffff8800bb59ceb8
mag 05 11:55:51 arch-laptop kernel:  ffff880212d00978 ffff8800bb59ce40 
ffff880212d00900 0000000000000007
mag 05 11:55:51 arch-laptop kernel:  00007f55e2053a90 ffff8800d991e1c0 
ffff880211f8fdf0 ffffffff811a9232
mag 05 11:55:51 arch-laptop kernel: Call Trace:
mag 05 11:55:51 arch-laptop kernel:  [<ffffffff811a90c8>] ? 
anon_vma_clone+0xc8/0x200
mag 05 11:55:51 arch-laptop kernel:  [<ffffffff811a9232>] 
anon_vma_fork+0x32/0x140
mag 05 11:55:51 arch-laptop kernel:  [<ffffffff8107742d>] 
copy_process.part.8+0xcdd/0x1890
mag 05 11:55:51 arch-laptop kernel:  [<ffffffff8107819f>] 
_do_fork+0xcf/0x3c0
mag 05 11:55:51 arch-laptop kernel:  [<ffffffff81078539>] 
SyS_clone+0x19/0x20
mag 05 11:55:51 arch-laptop kernel:  [<ffffffff815ad6ae>] 
entry_SYSCALL_64_fastpath+0x12/0x6d
mag 05 11:55:51 arch-laptop kernel: Code: 01 4c 8b 91 98 00 00 00 31 c9 48 
c1 e8 0c 4d 8d 4c 02 ff eb 24 4c 3b 48 18 76 04 4c 89 48 18 4c 8b 40 e0 48 
8d 48 10 48 8d 50 08 <4d> 3b 90 98 00 00 00 48 0f 42 d1 48 89 c1 48 8b 02 
48 85 c0 75 
mag 05 11:55:51 arch-laptop kernel: RIP  [<ffffffff81194f9f>] 
anon_vma_interval_tree_insert+0x3f/0x90
mag 05 11:55:52 arch-laptop kernel:  RSP <ffff880211f8fd68>
mag 05 11:55:52 arch-laptop kernel: CR2: 0000000000017d11
mag 05 11:55:52 arch-laptop kernel: ---[ end trace 6a392d6afbffe7f5 ]---
[...]
mag 05 11:55:52 arch-laptop dbus[584]: [system] Activating via systemd: 
service name='org.freedesktop.ColorManager' unit='colord.service'
mag 05 11:55:52 arch-laptop kernel: BTRFS critical (device dm-0): unable to 
find logical 2330894282579755008 len 4096
mag 05 11:55:52 arch-laptop kernel: ------------[ cut here ]------------
mag 05 11:55:52 arch-laptop kernel: kernel BUG at fs/btrfs/inode.c:1828!
mag 05 11:55:52 arch-laptop kernel: invalid opcode: 0000 [#2] PREEMPT SMP 
mag 05 11:55:52 arch-laptop kernel: Modules linked in: rfcomm visor bnep 
uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core 
videodev media btusb btrtl btbcm btintel cdc_ether bluetooth usbnet r8152 
crc16 mii joydev mousedev nvram 
mag 05 11:55:52 arch-laptop kernel:  mei_me syscopyarea sysfillrect snd 
sysimgblt fb_sys_fops i2c_algo_bit shpchp soundcore mei wmi thermal fan 
intel_hid sparse_keymap int3403_thermal video processor_thermal_device 
dw_dmac snd_soc_sst_acpi snd_soc_sst_m
mag 05 11:55:52 arch-laptop kernel:  lrw gf128mul glue_helper ablk_helper 
cryptd ahci libahci libata scsi_mod xhci_pci rtsx_pci xhci_hcd i8042 serio 
sdhci_acpi sdhci led_class mmc_core pl2303 mos7720 usbserial parport 
hid_generic usbhid hid usbcore usb_
mag 05 11:55:52 arch-laptop kernel: CPU: 3 PID: 1028 Comm: plasmashell 
Tainted: G      D         4.5.1-1-ARCH #1
mag 05 11:55:52 arch-laptop kernel: Hardware name: Dell Inc. XPS 13 
9343/0F5KF3, BIOS A07 11/11/2015
mag 05 11:55:52 arch-laptop kernel: task: ffff8800d9e2aac0 ti: 
ffff8801f5900000 task.ti: ffff8801f5900000
mag 05 11:55:52 arch-laptop kernel: RIP: 0010:[<ffffffffa02ddabb>]  
[<ffffffffa02ddabb>] btrfs_merge_bio_hook+0x8b/0xa0 [btrfs]
mag 05 11:55:52 arch-laptop kernel: RSP: 0018:ffff8801f5903938  EFLAGS: 
00010282
mag 05 11:55:52 arch-laptop kernel: RAX: 00000000ffffffea RBX: 
0000000000001000 RCX: 0000000000000051
mag 05 11:55:52 arch-laptop kernel: RDX: 0000000000000000 RSI: 
ffff88021e58db38 RDI: 0000000000000000
mag 05 11:55:52 arch-laptop kernel: RBP: ffff8801f5903958 R08: 
0000000000070aad R09: 0000000000000368
mag 05 11:55:52 arch-laptop kernel: R10: 00102c80000d13e8 R11: 
0000000000000368 R12: 0000000000001000
mag 05 11:55:52 arch-laptop kernel: R13: ffff8801e205ee28 R14: 
0000000000000000 R15: ffffea000788d580
mag 05 11:55:52 arch-laptop kernel: FS:  00007fe8e688a800(0000) 
GS:ffff88021e580000(0000) knlGS:0000000000000000
mag 05 11:55:52 arch-laptop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
mag 05 11:55:52 arch-laptop kernel: CR2: 00007fe8d14b5cbc CR3: 
00000000bf57f000 CR4: 00000000003406e0
mag 05 11:55:52 arch-laptop kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
mag 05 11:55:52 arch-laptop kernel: DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
mag 05 11:55:52 arch-laptop kernel: Stack:
mag 05 11:55:52 arch-laptop kernel:  0000000000001000 0000000095d6c394 
0000000000001000 ffff8801f5903bc0
mag 05 11:55:52 arch-laptop kernel:  ffff8801f59039b0 ffffffffa02fbd03 
0000000000000000 00102c80000d13e8
mag 05 11:55:52 arch-laptop kernel:  0000002000000000 ffff8800da874040 
0000000000000000 ffffea000788d580
mag 05 11:55:52 arch-laptop kernel: Call Trace:
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02fbd03>] 
submit_extent_page+0xc3/0x230 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02fd02a>] 
__do_readpage+0x3aa/0x990 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02fb450>] ? 
btrfs_create_repair_bio+0x100/0x100 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02d0cf0>] ? 
free_root_pointers+0x70/0x70 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02fd6f6>] 
__extent_read_full_page+0xe6/0x100 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02d0cf0>] ? 
free_root_pointers+0x70/0x70 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02ff489>] 
read_extent_buffer_pages+0x179/0x330 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02d0cf0>] ? 
free_root_pointers+0x70/0x70 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02d26fc>] 
btree_read_extent_buffer_pages.constprop.19+0xac/0x110 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02d2cfd>] 
read_tree_block+0x3d/0x70 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02b1b49>] 
read_block_for_search.isra.14+0x139/0x330 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02b72e5>] 
btrfs_next_old_leaf+0x245/0x420 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02b74d0>] 
btrfs_next_leaf+0x10/0x20 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffffa02dc564>] 
btrfs_real_readdir+0x144/0x5f0 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  [<ffffffff81200492>] 
iterate_dir+0x92/0x120
mag 05 11:55:52 arch-laptop kernel:  [<ffffffff81200939>] 
SyS_getdents+0x99/0x110
mag 05 11:55:52 arch-laptop kernel:  [<ffffffff812005f0>] ? 
fillonedir+0xd0/0xd0
mag 05 11:55:52 arch-laptop kernel:  [<ffffffff815ad6ae>] 
entry_SYSCALL_64_fastpath+0x12/0x6d
mag 05 11:55:52 arch-laptop kernel: Code: 8b 80 38 fe ff ff 4c 89 65 e0 48 
8b 80 f0 01 00 00 48 89 c7 e8 77 ac 02 00 85 c0 78 0e 31 c0 4c 01 e3 48 3b 
5d e0 0f 97 c0 eb 9a <0f> 0b e8 5e b1 d9 e0 0f 1f 40 00 66 2e 0f 1f 84 00 
00 00 00 00 
mag 05 11:55:52 arch-laptop kernel: RIP  [<ffffffffa02ddabb>] 
btrfs_merge_bio_hook+0x8b/0xa0 [btrfs]
mag 05 11:55:52 arch-laptop kernel:  RSP <ffff8801f5903938>
mag 05 11:55:52 arch-laptop kernel: ---[ end trace 6a392d6afbffe7f6 ]---

On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote:
> Off topic, but at least gmail users see your posts go to spam
>        dmarc=fail (p=QUARANTINE dis=NONE) header.from=linuxsystems.it

Thanks for reporting, I changed my dmarc DNS entry from quarantine to none. 
I previously used reject and I hoped that quarantine was enough of a middle 
ground to survive spam filters, but it seems I will have to get rid of 
dmarc altogether.

Thanks,
Niccolò

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-05 10:36   ` Niccolò Belli
@ 2016-05-05 17:48     ` Omar Sandoval
  2016-05-06 11:38       ` Niccolò Belli
  0 siblings, 1 reply; 25+ messages in thread
From: Omar Sandoval @ 2016-05-05 17:48 UTC (permalink / raw)
  To: Niccolò Belli; +Cc: Btrfs BTRFS, Chris Murphy, Qu Wenruo

On Thu, May 05, 2016 at 12:36:52PM +0200, Niccolò Belli wrote:
> On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote:
> > I suggest using defaults for starters. The only thing in that list
> > that needs be there is either subvolid or subvold, not both. Add in
> > the non-default options once you've proven the defaults are working,
> > and add them one at a time.
> 
> Yes I read your previous suggestion and I already dropped subvolid, but
> since the problem already happened I left it in the mail for completeness.
> Anyway the culprit here is genfstab and that's probably what a beginner is
> going to use when installing a distro:
> https://wiki.archlinux.org/index.php/beginners'_guide#fstab
> 

The redundant subvolid doesn't hurt, the kernel will just check that it
matches the passed subvol (see [1]). genfstab probably just pulls the
options out of /proc/mounts or /proc/self/mountinfo, and since we show
both, that's how it gets in fstab. If it was actually a problem, there
would be a clear message in dmesg.

1: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb289b7be62db84b9630ce00367444c810cada2c

-- 
Omar

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-05 17:48     ` Omar Sandoval
@ 2016-05-06 11:38       ` Niccolò Belli
  2016-05-07 15:45         ` Niccolò Belli
  0 siblings, 1 reply; 25+ messages in thread
From: Niccolò Belli @ 2016-05-06 11:38 UTC (permalink / raw)
  To: Btrfs BTRFS; +Cc: Chris Murphy, Qu Wenruo, Omar Sandoval

I formatted the partition and copied the content of my previous rootfs to 
it. There is no dmcrypt now and mount options are defaults, except for 
noatime. After a single boot I got the very same problem as before (fs 
corrupted and an infinite loop when doing btrfs check --repair.

I wanted to replicate results and so I tried once again and since then I 
only experienced minor corruption, correctly resolved by repair. But during 
a pacaman upgrade, which triggered snapper pre-post snapshots, the system 
hanged and I found this in the logs:

mag 06 10:31:15 arch-laptop plasmashell[873]: requesting unexisting screen 
2
mag 06 10:31:18 arch-laptop dbus[418]: [system] Activating service 
name='org.opensuse.Snapper' (using servicehelper)
mag 06 10:31:18 arch-laptop dbus[418]: [system] Successfully activated 
service 'org.opensuse.Snapper'
mag 06 10:31:20 arch-laptop kernel: ------------[ cut here ]------------
mag 06 10:31:20 arch-laptop kernel: kernel BUG at fs/btrfs/ctree.h:2693!

Still no major corruption found since my second attempt.

Niccolò

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram  and disk are ok. it still mounts, but I cannot repair
  2016-05-06 11:38       ` Niccolò Belli
@ 2016-05-07 15:45         ` Niccolò Belli
  2016-05-07 15:58           ` Clemens Eisserer
  2016-05-07 23:35           ` Chris Murphy
  0 siblings, 2 replies; 25+ messages in thread
From: Niccolò Belli @ 2016-05-07 15:45 UTC (permalink / raw)
  To: Btrfs BTRFS; +Cc: Chris Murphy, Qu Wenruo, Omar Sandoval

btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
So discard is not the culprit. Will try to remove compress=lzo and 
autodefrag and see if it still happens.

[  748.224346] BTRFS error (device dm-0): memmove bogus src_offset 5431 
move len 4294962894 len 16384
[  748.226206] ------------[ cut here ]------------
[  748.227831] kernel BUG at fs/btrfs/extent_io.c:5723!
[  748.229498] invalid opcode: 0000 [#1] PREEMPT SMP
[  748.231161] Modules linked in: ext4 mbcache jbd2 nls_iso8859_1 
nls_cp437 vfat fat snd_hda_codec_hdmi dell_laptop dcdbas dell_wmi 
iTCO_wdt iTCO_vendor_support intel_rapl x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel arc4 kvm irqbypass psmouse serio_raw 
pcspkr elan_i2c snd_soc_ssm4567 snd_soc_rt286 snd_soc_rl6347a 
snd_soc_core i2c_hid iwlmvm snd_compress snd_pcm_dmaengine ac97_bus 
mac80211 uvcvideo videobuf2_vmalloc btusb videobuf2_memops cdc_ether 
btrtl usbnet iwlwifi btbcm videobuf2_v4l2 btintel intel_pch_thermal 
videobuf2_core i2c_i801 videodev r8152 rtsx_pci_ms cfg80211 bluetooth 
visor media mii memstick joydev evdev mousedev input_leds rfkill mac_hid 
crc16 i915 fan thermal wmi dw_dmac int3403_thermal video dw_dmac_core 
drm_kms_helper snd_soc_sst_acpi i2c_designware_platform 
snd_soc_sst_match
[  748.237203]  snd_hda_intel 8250_dw i2c_designware_core gpio_lynxpoint 
spi_pxa2xx_platform drm int3402_thermal snd_hda_codec battery tpm_crb 
intel_hid snd_hda_core sparse_keymap fjes snd_hwdep int3400_thermal 
acpi_thermal_rel tpm_tis snd_pcm intel_gtt tpm acpi_als syscopyarea 
sysfillrect snd_timer sysimgblt fb_sys_fops mei_me i2c_algo_bit 
processor_thermal_device kfifo_buf processor snd industrialio acpi_pad 
ac int340x_thermal_zone mei intel_soc_dts_iosf button lpc_ich soundcore 
shpchp sch_fq_codel ip_tables x_tables btrfs xor raid6_pq 
jitterentropy_rng sha256_ssse3 sha256_generic hmac drbg ansi_cprng 
algif_skcipher af_alg uas usb_storage dm_crypt dm_mod sd_mod 
rtsx_pci_sdmmc atkbd libps2 crct10dif_pclmul crc32_pclmul crc32c_intel 
ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper
[  748.244176]  ablk_helper cryptd ahci libahci libata scsi_mod xhci_pci 
rtsx_pci xhci_hcd i8042 serio sdhci_acpi sdhci led_class mmc_core pl2303 
mos7720 usbserial parport hid_generic usbhid hid usbcore usb_common
[  748.246662] CPU: 0 PID: 2316 Comm: pacman Not tainted 4.5.1-1-ARCH #1
[  748.249123] Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A07 
11/11/2015
[  748.251576] task: ffff8800d9d98e40 ti: ffff8800cec10000 task.ti: 
ffff8800cec10000
[  748.254064] RIP: 0010:[<ffffffffa0300bac>]  [<ffffffffa0300bac>] 
memmove_extent_buffer+0x10c/0x110 [btrfs]
[  748.256600] RSP: 0018:ffff8800cec13c18  EFLAGS: 00010246
[  748.259120] RAX: 0000000000000000 RBX: ffff88020c01ba40 RCX: 
0000000000000056
[  748.261631] RDX: 0000000000000000 RSI: ffff88021e40db38 RDI: 
ffff88021e40db38
[  748.264166] RBP: ffff8800cec13c48 R08: 0000000000000000 R09: 
000000000000033b
[  748.266716] R10: 0000000000000000 R11: 000000000000033b R12: 
00000000ffffeece
[  748.269267] R13: 0000000100000405 R14: 00000001000004c9 R15: 
ffff88020c01ba40
[  748.271818] FS:  00007f14d4271740(0000) GS:ffff88021e400000(0000) 
knlGS:0000000000000000
[  748.274392] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  748.276987] CR2: 0000000001630008 CR3: 00000000cffc8000 CR4: 
00000000003406f0
[  748.279603] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  748.282220] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[  748.284815] Stack:
[  748.287422]  00000000e3438cd2 ffff88020c01ba40 00000000000000c4 
000000000000002a
[  748.290082]  000000000000006b 00000000000003a0 ffff8800cec13ce8 
ffffffffa02b612c
[  748.292754]  ffffffffa02b433d ffff8800da9ca820 0000002800000000 
ffff8800daa78bd0
[  748.295441] Call Trace:
[  748.298104]  [<ffffffffa02b612c>] btrfs_del_items+0x33c/0x4a0 [btrfs]
[  748.300827]  [<ffffffffa02b433d>] ? btrfs_search_slot+0x90d/0x990 
[btrfs]
[  748.303564]  [<ffffffffa02f3d9c>] ? btrfs_get_token_8+0x6c/0x130 
[btrfs]
[  748.306311]  [<ffffffffa02e5ca9>] 
btrfs_truncate_inode_items+0x649/0xd20 [btrfs]
[  748.309071]  [<ffffffffa0330b5e>] ? 
btrfs_delayed_inode_release_metadata.isra.1+0x4e/0xf0 [btrfs]
[  748.311860]  [<ffffffffa02e7315>] btrfs_evict_inode+0x485/0x5d0 
[btrfs]
[  748.314627]  [<ffffffff81207e55>] evict+0xc5/0x190
[  748.317412]  [<ffffffff81208689>] iput+0x1d9/0x260
[  748.320199]  [<ffffffff811fd689>] do_unlinkat+0x199/0x2d0
[  748.322988]  [<ffffffff811fdf66>] SyS_unlink+0x16/0x20
[  748.325781]  [<ffffffff815ad6ae>] entry_SYSCALL_64_fastpath+0x12/0x6d
[  748.328584] Code: 41 5e 41 5f 5d c3 48 8b 7f 18 48 89 f2 48 c7 c6 40 
44 36 a0 e8 06 90 fa ff 0f 0b 48 8b 7f 18 48 c7 c6 08 44 36 a0 e8 f4 8f 
fa ff <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 89 fb
[  748.331558] RIP  [<ffffffffa0300bac>] 
memmove_extent_buffer+0x10c/0x110 [btrfs]
[  748.334473]  RSP <ffff8800cec13c18>
[  748.356077] ---[ end trace 9bfb28800ab52273 ]---
[  748.359042] note: pacman[2316] exited with preempt_count 2

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-07 15:45         ` Niccolò Belli
@ 2016-05-07 15:58           ` Clemens Eisserer
  2016-05-07 16:11             ` Niccolò Belli
  2016-05-07 23:35           ` Chris Murphy
  1 sibling, 1 reply; 25+ messages in thread
From: Clemens Eisserer @ 2016-05-07 15:58 UTC (permalink / raw)
  To: Niccolò Belli, linux-btrfs

Hi Niccolo,

> btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot

Just to be curious - couldn't it be a hardware issue? I use almost the
same setup (compress-force=lzo instead of compress-force=lzo) on my
laptop for 2-3 years and haven't experienced any issues since
~kernel-3.14 or so.

Br, Clemens Eisserer


2016-05-07 17:45 GMT+02:00 Niccolò Belli <darkbasic@linuxsystems.it>:
> btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
> So discard is not the culprit. Will try to remove compress=lzo and
> autodefrag and see if it still happens.
>
> [  748.224346] BTRFS error (device dm-0): memmove bogus src_offset 5431 move
> len 4294962894 len 16384
> [  748.226206] ------------[ cut here ]------------
> [  748.227831] kernel BUG at fs/btrfs/extent_io.c:5723!
> [  748.229498] invalid opcode: 0000 [#1] PREEMPT SMP
> [  748.231161] Modules linked in: ext4 mbcache jbd2 nls_iso8859_1 nls_cp437
> vfat fat snd_hda_codec_hdmi dell_laptop dcdbas dell_wmi iTCO_wdt
> iTCO_vendor_support intel_rapl x86_pkg_temp_thermal intel_powerclamp
> coretemp kvm_intel arc4 kvm irqbypass psmouse serio_raw pcspkr elan_i2c
> snd_soc_ssm4567 snd_soc_rt286 snd_soc_rl6347a snd_soc_core i2c_hid iwlmvm
> snd_compress snd_pcm_dmaengine ac97_bus mac80211 uvcvideo videobuf2_vmalloc
> btusb videobuf2_memops cdc_ether btrtl usbnet iwlwifi btbcm videobuf2_v4l2
> btintel intel_pch_thermal videobuf2_core i2c_i801 videodev r8152 rtsx_pci_ms
> cfg80211 bluetooth visor media mii memstick joydev evdev mousedev input_leds
> rfkill mac_hid crc16 i915 fan thermal wmi dw_dmac int3403_thermal video
> dw_dmac_core drm_kms_helper snd_soc_sst_acpi i2c_designware_platform
> snd_soc_sst_match
> [  748.237203]  snd_hda_intel 8250_dw i2c_designware_core gpio_lynxpoint
> spi_pxa2xx_platform drm int3402_thermal snd_hda_codec battery tpm_crb
> intel_hid snd_hda_core sparse_keymap fjes snd_hwdep int3400_thermal
> acpi_thermal_rel tpm_tis snd_pcm intel_gtt tpm acpi_als syscopyarea
> sysfillrect snd_timer sysimgblt fb_sys_fops mei_me i2c_algo_bit
> processor_thermal_device kfifo_buf processor snd industrialio acpi_pad ac
> int340x_thermal_zone mei intel_soc_dts_iosf button lpc_ich soundcore shpchp
> sch_fq_codel ip_tables x_tables btrfs xor raid6_pq jitterentropy_rng
> sha256_ssse3 sha256_generic hmac drbg ansi_cprng algif_skcipher af_alg uas
> usb_storage dm_crypt dm_mod sd_mod rtsx_pci_sdmmc atkbd libps2
> crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
> aes_x86_64 lrw gf128mul glue_helper
> [  748.244176]  ablk_helper cryptd ahci libahci libata scsi_mod xhci_pci
> rtsx_pci xhci_hcd i8042 serio sdhci_acpi sdhci led_class mmc_core pl2303
> mos7720 usbserial parport hid_generic usbhid hid usbcore usb_common
> [  748.246662] CPU: 0 PID: 2316 Comm: pacman Not tainted 4.5.1-1-ARCH #1
> [  748.249123] Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A07
> 11/11/2015
> [  748.251576] task: ffff8800d9d98e40 ti: ffff8800cec10000 task.ti:
> ffff8800cec10000
> [  748.254064] RIP: 0010:[<ffffffffa0300bac>]  [<ffffffffa0300bac>]
> memmove_extent_buffer+0x10c/0x110 [btrfs]
> [  748.256600] RSP: 0018:ffff8800cec13c18  EFLAGS: 00010246
> [  748.259120] RAX: 0000000000000000 RBX: ffff88020c01ba40 RCX:
> 0000000000000056
> [  748.261631] RDX: 0000000000000000 RSI: ffff88021e40db38 RDI:
> ffff88021e40db38
> [  748.264166] RBP: ffff8800cec13c48 R08: 0000000000000000 R09:
> 000000000000033b
> [  748.266716] R10: 0000000000000000 R11: 000000000000033b R12:
> 00000000ffffeece
> [  748.269267] R13: 0000000100000405 R14: 00000001000004c9 R15:
> ffff88020c01ba40
> [  748.271818] FS:  00007f14d4271740(0000) GS:ffff88021e400000(0000)
> knlGS:0000000000000000
> [  748.274392] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  748.276987] CR2: 0000000001630008 CR3: 00000000cffc8000 CR4:
> 00000000003406f0
> [  748.279603] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  748.282220] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [  748.284815] Stack:
> [  748.287422]  00000000e3438cd2 ffff88020c01ba40 00000000000000c4
> 000000000000002a
> [  748.290082]  000000000000006b 00000000000003a0 ffff8800cec13ce8
> ffffffffa02b612c
> [  748.292754]  ffffffffa02b433d ffff8800da9ca820 0000002800000000
> ffff8800daa78bd0
> [  748.295441] Call Trace:
> [  748.298104]  [<ffffffffa02b612c>] btrfs_del_items+0x33c/0x4a0 [btrfs]
> [  748.300827]  [<ffffffffa02b433d>] ? btrfs_search_slot+0x90d/0x990 [btrfs]
> [  748.303564]  [<ffffffffa02f3d9c>] ? btrfs_get_token_8+0x6c/0x130 [btrfs]
> [  748.306311]  [<ffffffffa02e5ca9>] btrfs_truncate_inode_items+0x649/0xd20
> [btrfs]
> [  748.309071]  [<ffffffffa0330b5e>] ?
> btrfs_delayed_inode_release_metadata.isra.1+0x4e/0xf0 [btrfs]
> [  748.311860]  [<ffffffffa02e7315>] btrfs_evict_inode+0x485/0x5d0 [btrfs]
> [  748.314627]  [<ffffffff81207e55>] evict+0xc5/0x190
> [  748.317412]  [<ffffffff81208689>] iput+0x1d9/0x260
> [  748.320199]  [<ffffffff811fd689>] do_unlinkat+0x199/0x2d0
> [  748.322988]  [<ffffffff811fdf66>] SyS_unlink+0x16/0x20
> [  748.325781]  [<ffffffff815ad6ae>] entry_SYSCALL_64_fastpath+0x12/0x6d
> [  748.328584] Code: 41 5e 41 5f 5d c3 48 8b 7f 18 48 89 f2 48 c7 c6 40 44
> 36 a0 e8 06 90 fa ff 0f 0b 48 8b 7f 18 48 c7 c6 08 44 36 a0 e8 f4 8f fa ff
> <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 89 fb
> [  748.331558] RIP  [<ffffffffa0300bac>] memmove_extent_buffer+0x10c/0x110
> [btrfs]
> [  748.334473]  RSP <ffff8800cec13c18>
> [  748.356077] ---[ end trace 9bfb28800ab52273 ]---
> [  748.359042] note: pacman[2316] exited with preempt_count 2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram  and disk are ok. it still mounts, but I cannot repair
  2016-05-07 15:58           ` Clemens Eisserer
@ 2016-05-07 16:11             ` Niccolò Belli
  2016-05-08 18:27               ` Patrik Lundquist
  2016-05-09 11:52               ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 25+ messages in thread
From: Niccolò Belli @ 2016-05-07 16:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Clemens Eisserer

Il 2016-05-07 17:58 Clemens Eisserer ha scritto:
> Hi Niccolo,
> 
>> btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
> 
> Just to be curious - couldn't it be a hardware issue? I use almost the
> same setup (compress-force=lzo instead of compress-force=lzo) on my
> laptop for 2-3 years and haven't experienced any issues since
> ~kernel-3.14 or so.
> 
> Br, Clemens Eisserer

Hi,
Which kind of hardware issue? I did a full memtest86 check, a full 
smartmontools extended check and even a badblocks -wsv.
If this is really an hardware issue that we can identify I would be more 
than happy because Dell will replace my laptop and this nightmare will 
be finally over. I'm open to suggestions.

Niccolò

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-07 15:45         ` Niccolò Belli
  2016-05-07 15:58           ` Clemens Eisserer
@ 2016-05-07 23:35           ` Chris Murphy
  1 sibling, 0 replies; 25+ messages in thread
From: Chris Murphy @ 2016-05-07 23:35 UTC (permalink / raw)
  To: Niccolò Belli; +Cc: Btrfs BTRFS, Chris Murphy, Qu Wenruo, Omar Sandoval

On Sat, May 7, 2016 at 9:45 AM, Niccolò Belli <darkbasic@linuxsystems.it> wrote:
> btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
> So discard is not the culprit. Will try to remove compress=lzo and
> autodefrag and see if it still happens.

You're making the troubleshooting unnecessarily difficult by
continuing to use non-default options. *shrug*

Every single layer you add complicates the setup and troubleshooting.
Of course all of it should work together, many people do. But you're
the one having the problem so in order to demonstrate whether this is
a software bug or hardware problem, you need to test it with the most
basic setup possible --> btrfs on plain partitions and default mount
options.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-07 16:11             ` Niccolò Belli
@ 2016-05-08 18:27               ` Patrik Lundquist
  2016-05-09 11:52               ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 25+ messages in thread
From: Patrik Lundquist @ 2016-05-08 18:27 UTC (permalink / raw)
  To: Niccolò Belli; +Cc: linux-btrfs

On 7 May 2016 at 18:11, Niccolò Belli <darkbasic@linuxsystems.it> wrote:

> Which kind of hardware issue? I did a full memtest86 check, a full smartmontools extended check and even a badblocks -wsv.
> If this is really an hardware issue that we can identify I would be more than happy because Dell will replace my laptop and this nightmare will be finally over. I'm open to suggestions.


Well, your hardware differs from a lot of successful installations.
Are you using any power management tweaks?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-07 16:11             ` Niccolò Belli
  2016-05-08 18:27               ` Patrik Lundquist
@ 2016-05-09 11:52               ` Austin S. Hemmelgarn
  2016-05-09 14:53                 ` Niccolò Belli
  1 sibling, 1 reply; 25+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-09 11:52 UTC (permalink / raw)
  To: Niccolò Belli, linux-btrfs; +Cc: Clemens Eisserer

On 2016-05-07 12:11, Niccolò Belli wrote:
> Il 2016-05-07 17:58 Clemens Eisserer ha scritto:
>> Hi Niccolo,
>>
>>> btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
>>
>> Just to be curious - couldn't it be a hardware issue? I use almost the
>> same setup (compress-force=lzo instead of compress-force=lzo) on my
>> laptop for 2-3 years and haven't experienced any issues since
>> ~kernel-3.14 or so.
>>
>> Br, Clemens Eisserer
>
> Hi,
> Which kind of hardware issue? I did a full memtest86 check, a full
> smartmontools extended check and even a badblocks -wsv.
> If this is really an hardware issue that we can identify I would be more
> than happy because Dell will replace my laptop and this nightmare will
> be finally over. I'm open to suggestions.
First, some general advice:
1. It is fully possible to have bad RAM that still passes memtest86 
consistently, and in fact, most of the time this will be the case (if 
you're seeing any thing other than the bit-fade test in memtest86 fail, 
then your system probably won't boot fully).  Memtest doesn't replicate 
typical usage patterns very well.  My usual testing for RAM involves not 
just memtest, but also booting into a LiveCD (usually SystemRescueCD), 
pulling down a copy of the kernel source, and then running as many 
concurrent kernel builds as cores, each with as many make jobs as cores 
(so if you've got a quad core CPU (or a dual core with hyperthreading), 
it would be running 4 builds with -j4 passed to make).  GCC seems to 
have memory usage patterns that reliably trigger memory errors that 
aren't caught by memtest, so this generally gives good results. 
Secondarily, if it's a big system and I am not pressed for time, I do a 
quick Gentoo install with Xen, and then spin up twice as many Xen VM's 
as cores and run memtest in those concurrently (this seems to catch 
things a bit more reliably than just a plain memtest).
2. On a similar note, badblocks doesn't replicate filesystem like access 
patterns, it just runs sequentially through the entire disk.  This isn't 
as likely to give bad results, but it's still important to know.  In 
particular, try running it over a dmcrypt volume a couple of times 
(preferably with a different key each time, pulling keys from 
/dev/urandom works well for this), as that will result in writing 
different data.  For what it's worth, when I'm doing initial testing of 
new disks, I always use ddrescue to copy /dev/zero over the whole disk, 
then do it twice through dmcrypt with different keys, copying from the 
disk to /dev/null after each pass.  This gives random data on disk as a 
starting point (which is good if you're going to use dmcrypt), and 
usually triggers reallocation of any bad sectors as early as possible. 
If I have time and access to an existing system I can connect the disk 
to, I often do testing with fio as well.

Now, to slightly more specific advice:
1. If you have an eSATA port, try plugging your hard disk in there and 
see if things work.  If that works but having the hard drive plugged in 
internally doesn't, then the issue is probably either that specific SATA 
port (in which case your chip-set is bad and you should get a new 
system), or the SATA connector itself (or the wiring, but that's not as 
likely when it's traces on a PCB).  Normally I'd suggest just swapping 
cables and SATA ports, but that's not really possible with a laptop.
2. If you have access to a reasonably large flash drive, or to a USB to 
SATA adapter, try that as well, if it works on that but not internally 
(or on an eSATA port), you've probably got a bad SATA controller, and 
should get a new system.
3. Try things without dmcrypt.  Adding extra layers makes it harder to 
determine what is actually wrong.  If it works without dmcrypt, try 
using different parameters for the encryption (different ciphers is what 
I would try first).  If it works reliably without dmcrypt, then it's 
either a bug in dmcrypt (which I don't think is very likely), or it's 
bad interaction between dmcrypt and BTRFS.  If it works with some 
encryption parameters but not others, then that will help narrow down 
where the issue is.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-09 11:52               ` Austin S. Hemmelgarn
@ 2016-05-09 14:53                 ` Niccolò Belli
  2016-05-09 16:29                   ` Zygo Blaxell
                                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Niccolò Belli @ 2016-05-09 14:53 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Clemens Eisserer, Austin S. Hemmelgarn, Patrik Lundquist,
	Chris Murphy, Qu Wenruo, Omar Sandoval

On domenica 8 maggio 2016 20:27:55 CEST, Patrik Lundquist wrote:
> Are you using any power management tweaks?

Yes, as stated in my very first post I use TLP with 
SATA_LINKPWR_ON_BAT=max_performance, but I managed to reproduce the bug 
even without TLP. Also in the past week I've alwyas been on AC.

On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
> Memtest doesn't replicate typical usage patterns very well.  My 
> usual testing for RAM involves not just memtest, but also 
> booting into a LiveCD (usually SystemRescueCD), pulling down a 
> copy of the kernel source, and then running as many concurrent 
> kernel builds as cores, each with as many make jobs as cores (so 
> if you've got a quad core CPU (or a dual core with 
> hyperthreading), it would be running 4 builds with -j4 passed to 
> make).  GCC seems to have memory usage patterns that reliably 
> trigger memory errors that aren't caught by memtest, so this 
> generally gives good results.

Building kernel with 4 concurrent threads is not an issue for my system, in 
fact I do compile a lot and I never had any issue.

On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
> On a similar note, badblocks doesn't replicate filesystem like 
> access patterns, it just runs sequentially through the entire 
> disk.  This isn't as likely to give bad results, but it's still 
> important to know.  In particular, try running it over a dmcrypt 
> volume a couple of times (preferably with a different key each 
> time, pulling keys from /dev/urandom works well for this), as 
> that will result in writing different data.  For what it's 
> worth, when I'm doing initial testing of new disks, I always use 
> ddrescue to copy /dev/zero over the whole disk, then do it twice 
> through dmcrypt with different keys, copying from the disk to 
> /dev/null after each pass.  This gives random data on disk as a 
> starting point (which is good if you're going to use dmcrypt), 
> and usually triggers reallocation of any bad sectors as early as 
> possible.

While trying to find a common denominator for my issue I did lots of 
backups of /dev/mapper/cryptroot and I restored them into 
/dev/mapper/cryptroot dozens of times (triggering a 150GB+ random data 
write every time), without any issue (after restoring the backup I alwyas 
check the parition with btrfs check). So disk doesn't seem to be the 
culprit.

On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
> 1. If you have an eSATA port, try plugging your hard disk in 
> there and see if things work.  If that works but having the hard 
> drive plugged in internally doesn't, then the issue is probably 
> either that specific SATA port (in which case your chip-set is 
> bad and you should get a new system), or the SATA connector 
> itself (or the wiring, but that's not as likely when it's traces 
> on a PCB).  Normally I'd suggest just swapping cables and SATA 
> ports, but that's not really possible with a laptop.
> 2. If you have access to a reasonably large flash drive, or to 
> a USB to SATA adapter, try that as well, if it works on that but 
> not internally (or on an eSATA port), you've probably got a bad 
> SATA controller, and should get a new system.

My laptop doesn't have an eSATA port and my only big enough external drive 
is currently used for daily backups, since I fear for data loss.

On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
> 3. Try things without dmcrypt.  Adding extra layers makes it 
> harder to determine what is actually wrong.  If it works without 
> dmcrypt, try using different parameters for the encryption 
> (different ciphers is what I would try first).  If it works 
> reliably without dmcrypt, then it's either a bug in dmcrypt 
> (which I don't think is very likely), or it's bad interaction 
> between dmcrypt and BTRFS.  If it works with some encryption 
> parameters but not others, then that will help narrow down where 
> the issue is.

On domenica 8 maggio 2016 01:35:16 CEST, Chris Murphy wrote:
> You're making the troubleshooting unnecessarily difficult by
> continuing to use non-default options. *shrug*
>
> Every single layer you add complicates the setup and troubleshooting.
> Of course all of it should work together, many people do. But you're
> the one having the problem so in order to demonstrate whether this is
> a software bug or hardware problem, you need to test it with the most
> basic setup possible --> btrfs on plain partitions and default mount
> options.

I will try to recap because you obviously missed my previous e-mail: I 
managed to replicate the irrecoverable corruption bug even with default 
options and no dmcrypt at all. Somehow it was a bit more difficult to 
replicate with default options and so I started to play with different 
combinations to find if there was something which increased the chances of 
getting corruption. I have the feeling that "autodefrag" enhances the 
chances to get corruption, but I'm not 100% sure about it. Anyway, 
triggering a whole packages reinstall with "pacaur -S $(pacman -Qe)", 
giving high chances to get irrecoverable corruption. When running such 
command it simply extracts the tarballs from the cache and overwrites the 
already installed files. It doesn't write lots of data (after 
reinstallation my system is still quite small, just a few GBs) but it seems 
to be enough to displease the filesystem.

To avoid losing my data every time I power on or reboot my laptop I first 
boot into an external drive, I btrfs check /dev/mapper/cryptroot and if 
it's still sane I backup /dev/mapper/cryptroot into an external SSD with 
dd, otherwise I restore the previous copy from the SSD into 
/dev/mapper/cryptroot.
I cannot manage to survive such annoying workflow for long, so I really 
hope someone will manage to track the bug down soon.

Thanks for your help, I really appreciate it.
Niccolò

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-09 14:53                 ` Niccolò Belli
@ 2016-05-09 16:29                   ` Zygo Blaxell
  2016-05-09 18:21                     ` Austin S. Hemmelgarn
  2016-05-12 14:35                     ` Niccolò Belli
  2016-05-09 19:23                   ` Lionel Bouton
  2016-05-09 21:30                   ` Chris Murphy
  2 siblings, 2 replies; 25+ messages in thread
From: Zygo Blaxell @ 2016-05-09 16:29 UTC (permalink / raw)
  To: Niccolò Belli
  Cc: linux-btrfs, Clemens Eisserer, Austin S. Hemmelgarn,
	Patrik Lundquist, Chris Murphy, Qu Wenruo, Omar Sandoval

[-- Attachment #1: Type: text/plain, Size: 3690 bytes --]

On Mon, May 09, 2016 at 04:53:13PM +0200, Niccolò Belli wrote:
> While trying to find a common denominator for my issue I did lots of backups
> of /dev/mapper/cryptroot and I restored them into /dev/mapper/cryptroot
> dozens of times (triggering a 150GB+ random data write every time), without
> any issue (after restoring the backup I alwyas check the parition with btrfs
> check). So disk doesn't seem to be the culprit.

Did you also check the data matches the backup?  btrfs check will only
look at the metadata, which is 0.1% of what you've copied.  From what
you've written, there should be a lot of errors in the data too.  If you
have incorrect data but btrfs scrub finds no incorrect checksums, then
your storage layer is probably fine and we have to look at CPU, host RAM,
and software as possible culprits.

The logs you've posted so far indicate that bad metadata (e.g. negative
item lengths, nonsense transids in metadata references but sane transids
in the referred pages) is getting into otherwise valid and well-formed
btrfs metadata pages.  Since these pages are protected by checksums,
the corruption can't be originating in the storage layer--if it was, the
pages should be rejected as they are read from disk, before btrfs even
looks at them, and the insane transid should be the "found" one not the
"expected" one.  That suggests there is either RAM corruption happening
_after_ the data is read from disk (i.e. while the pages are cached in
RAM), or a severe software bug in the kernel you're running.

Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
maintains your kernel had a bad day and merged a patch they should
not have.

Try a minimal configuration with as few drivers as possible loaded,
especially GPU drivers and anything from the staging subdirectory--when
these drivers have bugs, they ruin everything.

Try memtest86+ which has a few more/different tests than memtest86.
I have encountered RAM modules that pass memtest86 but fail memtest86+
and vice versa.

Try memtester, a memory tester that runs as a Linux process, so it can
detect corruption caused when device drivers spray data randomly into RAM,
or when the CPU thermal controls are influenced by Linux (an overheating
CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
designs rely on the OS for thermal management).

Try running more than one memory testing process, in case there is a bug
in your hardware that affects interactions between multiple cores (memtest
is single-threaded).  You can run memtest86 inside a kvm (e.g. kvm
-m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.

Kernel compiles are a bad way to test RAM.  I've successfully built
kernels on hosts with known RAM failures.  The kernels don't always work
properly, but it's quite rare to see a build fail outright.

> [...]I have the feeling that "autodefrag" enhances the
> chances to get corruption, but I'm not 100% sure about it. Anyway,
> triggering a whole packages reinstall with "pacaur -S $(pacman -Qe)", giving
> high chances to get irrecoverable corruption. When running such command it
> simply extracts the tarballs from the cache and overwrites the already
> installed files. It doesn't write lots of data (after reinstallation my
> system is still quite small, just a few GBs) but it seems to be enough to
> displease the filesystem.

pacman probably does a lot of fsync() which will do a lot of metadata
tree updates.  autodefrag triples the I/O load for fragmented files and
most of that extra load is metadata tree writes.  Both will make the
symptoms of your problem worse.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-09 16:29                   ` Zygo Blaxell
@ 2016-05-09 18:21                     ` Austin S. Hemmelgarn
  2016-05-09 19:18                       ` Duncan
  2016-05-12 14:35                     ` Niccolò Belli
  1 sibling, 1 reply; 25+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-09 18:21 UTC (permalink / raw)
  To: Zygo Blaxell, Niccolò Belli
  Cc: linux-btrfs, Clemens Eisserer, Patrik Lundquist, Chris Murphy,
	Qu Wenruo, Omar Sandoval

On 2016-05-09 12:29, Zygo Blaxell wrote:
> On Mon, May 09, 2016 at 04:53:13PM +0200, Niccolò Belli wrote:
>> While trying to find a common denominator for my issue I did lots of backups
>> of /dev/mapper/cryptroot and I restored them into /dev/mapper/cryptroot
>> dozens of times (triggering a 150GB+ random data write every time), without
>> any issue (after restoring the backup I alwyas check the parition with btrfs
>> check). So disk doesn't seem to be the culprit.
>
> Did you also check the data matches the backup?  btrfs check will only
> look at the metadata, which is 0.1% of what you've copied.  From what
> you've written, there should be a lot of errors in the data too.  If you
> have incorrect data but btrfs scrub finds no incorrect checksums, then
> your storage layer is probably fine and we have to look at CPU, host RAM,
> and software as possible culprits.
This is a good point.
>
> The logs you've posted so far indicate that bad metadata (e.g. negative
> item lengths, nonsense transids in metadata references but sane transids
> in the referred pages) is getting into otherwise valid and well-formed
> btrfs metadata pages.  Since these pages are protected by checksums,
> the corruption can't be originating in the storage layer--if it was, the
> pages should be rejected as they are read from disk, before btrfs even
> looks at them, and the insane transid should be the "found" one not the
> "expected" one.  That suggests there is either RAM corruption happening
> _after_ the data is read from disk (i.e. while the pages are cached in
> RAM), or a severe software bug in the kernel you're running.
>
> Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
> maintains your kernel had a bad day and merged a patch they should
> not have.
>
> Try a minimal configuration with as few drivers as possible loaded,
> especially GPU drivers and anything from the staging subdirectory--when
> these drivers have bugs, they ruin everything.
>
> Try memtest86+ which has a few more/different tests than memtest86.
> I have encountered RAM modules that pass memtest86 but fail memtest86+
> and vice versa.
>
> Try memtester, a memory tester that runs as a Linux process, so it can
> detect corruption caused when device drivers spray data randomly into RAM,
> or when the CPU thermal controls are influenced by Linux (an overheating
> CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
> designs rely on the OS for thermal management).
>
> Try running more than one memory testing process, in case there is a bug
> in your hardware that affects interactions between multiple cores (memtest
> is single-threaded).  You can run memtest86 inside a kvm (e.g. kvm
> -m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.
>
> Kernel compiles are a bad way to test RAM.  I've successfully built
> kernels on hosts with known RAM failures.  The kernels don't always work
> properly, but it's quite rare to see a build fail outright.
My original suggestion that prompted that part of the comment was to run 
a bunch of concurrent kernel builds (I only use kernel builds myself 
because it's a big project with essentially zero build dependencies, if 
I had the patience and space (and a LiveCD with the right tools and 
packages installed), I'd probably be using something like LibreOffice or 
Chromium instead), each run with as many jobs as CPU's (so on a 
quad-core system, run a dozen or so concurrently with make -j4).  I 
don't use this as my sole test (I also use multiple other tools), but I 
find that this does a particularly good job of exercising things that 
memtest doesn't, and I don't just make sure the build's succeed, but 
also that the compiled kernel images all match, because if there's bad 
RAM, the resultant images will often be different in some way (and I had 
forgotten to mention this bit).

This practice evolved out of the fact that the only bad RAM I've ever 
dealt with either completely failed to POST (which can have all kinds of 
interesting symptoms if it's just one module, some MB's refuse to boot, 
some report the error, others just disable the module and act like 
nothing happened), or passed all the memory testing tools I threw at it 
(memtest86, memtest86+, memtester, concurrent memtest86 invocations from 
Xen domains, inventive acrobatics with tmpfs and FIO, etc), but failed 
under heavy concurrent random access, which can be reliably produced by 
running a bunch of big software builds at the same time with the CPU 
insanely over-committed.  I could probably produce a similar workload 
with tmpfs and FIO, but it's a lot quicker and easier to remember how to 
do a kernel build than it is to remember the complex incantations needed 
to get FIO to do anything interesting.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-09 18:21                     ` Austin S. Hemmelgarn
@ 2016-05-09 19:18                       ` Duncan
  0 siblings, 0 replies; 25+ messages in thread
From: Duncan @ 2016-05-09 19:18 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Mon, 09 May 2016 14:21:57 -0400 as
excerpted:

> This practice evolved out of the fact that the only bad RAM I've ever
> dealt with either completely failed to POST (which can have all kinds of
> interesting symptoms if it's just one module, some MB's refuse to boot,
> some report the error, others just disable the module and act like
> nothing happened), or passed all the memory testing tools I threw at it
> (memtest86, memtest86+, memtester, concurrent memtest86 invocations from
> Xen domains, inventive acrobatics with tmpfs and FIO, etc), but failed
> under heavy concurrent random access, which can be reliably produced by
> running a bunch of big software builds at the same time with the CPU
> insanely over-committed.

My (likely much more limited) experience matches yours.

Tho FWIW, in my case I did find that one of the more common memory 
failure indicators was bz2-ed tarball decompression, where the tarball 
would fail its decompression checksum safety checks.  However, that most 
reliably happened in the context of a heavily loaded system doing other 
package builds in parallel to the package tarball extraction that failed.

In my case, I even had ECC RAM, but it was apparently just slightly out 
of spec for its labeled and internally configured memory speeds (PC3200 
DDR1 at the time), at least on my hardware.  Once I got a BIOS update 
that let me, I slightly downclocked the memory (to PC3000, IIRC), and it 
was absolutely solid, no more errors, even with tightened up wait-state 
timings.  Later I upgraded RAM, and the new RAM worked just fine at the 
same PC3200 speeds that were a problem for the older RAM.

The problem was apparently that while the RAM cells that memcheck checks 
were fine, it was testing in an otherwise calm environment (not much 
choice since you can only boot to the test directly and can't do anything 
else at the same time), without all the other stuff going on in the 
hectic environment of a multi-package parallel build, that apparently 
happened to occasionally trigger the edge-case that would corrupt things.

And FWIW, I still have major respect for how well reiserfs behaved under 
those conditions.  No filesystem can be expected to be 100% reliable when 
it's getting corrupted data due to bad memory, but reiserfs held up 
remarkably well, far better than btrfs did under similar conditions (but 
then with the PCI and SATA bus) a few year later, forcing me back to 
reiserfs for a time, which again, continued to work like a champ, even 
under hardware conditions that were absolutely unworkable with btrfs.  I 
had a heat-related (AC went out, in Phoenix, in the summer, 40+ C 
outside, 50+C inside, who knows what the disks were!?) head crash on a 
disk too, where the partitions that were mounted and likely had the head 
flying over them were damaged beyond (easy) recovery, but other 
partitions on the same disk were absolutely fine, and I actually 
continued to run off them for a few months after cooling everything back 
down.  That sort of experience is the reason I still use reiserfs on 
spinning rust, including my second and third level backups, even while 
I'm running btrfs on the ssds for the working system and primary backup.  
It's also the reason I continue to use a partitioned system with multiple 
independent filesystems (btrfs raid1 on a pair of ssds for most of the 
working btrfs and primary backups, individual ssd btrfs in dup mode for 
/boot, and its backup on the other ssd), instead of putting my data eggs 
all in the same filesystem basket with subvolumes, where if the 
filesystem goes out all the subvolumes go with it!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-09 14:53                 ` Niccolò Belli
  2016-05-09 16:29                   ` Zygo Blaxell
@ 2016-05-09 19:23                   ` Lionel Bouton
  2016-05-09 21:30                   ` Chris Murphy
  2 siblings, 0 replies; 25+ messages in thread
From: Lionel Bouton @ 2016-05-09 19:23 UTC (permalink / raw)
  To: Niccolò Belli, linux-btrfs
  Cc: Clemens Eisserer, Austin S. Hemmelgarn, Patrik Lundquist,
	Chris Murphy, Qu Wenruo, Omar Sandoval

Hi,

Le 09/05/2016 16:53, Niccolò Belli a écrit :
> On domenica 8 maggio 2016 20:27:55 CEST, Patrik Lundquist wrote:
>> Are you using any power management tweaks?
>
> Yes, as stated in my very first post I use TLP with
> SATA_LINKPWR_ON_BAT=max_performance, but I managed to reproduce the
> bug even without TLP. Also in the past week I've alwyas been on AC.
>
> On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
>> Memtest doesn't replicate typical usage patterns very well.  My usual
>> testing for RAM involves not just memtest, but also booting into a
>> LiveCD (usually SystemRescueCD), pulling down a copy of the kernel
>> source, and then running as many concurrent kernel builds as cores,
>> each with as many make jobs as cores (so if you've got a quad core
>> CPU (or a dual core with hyperthreading), it would be running 4
>> builds with -j4 passed to make).  GCC seems to have memory usage
>> patterns that reliably trigger memory errors that aren't caught by
>> memtest, so this generally gives good results.
>
> Building kernel with 4 concurrent threads is not an issue for my
> system, in fact I do compile a lot and I never had any issue.

Note : I once had a server which would pass memtest86 and repeated
kernel compilations maxing out the CPU threads but couldn't at the same
time reliably compile a kernel and copy large amounts of data.
I think I lost my little automated test suite (I should definitely look
for it again or code it from scratch) but what I did on new servers
since that time was :

1/ create a file larger than the system's RAM (this makes sure you will
read and write all data from disk and not only caches and might catch
controller hardware problems too) with dd if=/dev/urandom (several
gigabytes of random data exercise many different patterns, far more than
what memtest86 would test), compute its md5 checksum
2/ launch a subprocess repeatedly compiling the kernel with more jobs
than available CPU threads and stopping as soon as the make exit code
was != 0.
3/ launch another subprocess repeatedly copying the random file to
another location and exiting when the md5 checksum didn't match the source.

Let it run as a burn-in test for as long as you can afford (from
experience after 24 hours if it's still running the probability that the
test will find a problem becomes negligible).
If one of the subprocess stopped by itself your hardware is not stable.

This actually caught a few unstable systems before it could go into
production for me.

Lionel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-09 14:53                 ` Niccolò Belli
  2016-05-09 16:29                   ` Zygo Blaxell
  2016-05-09 19:23                   ` Lionel Bouton
@ 2016-05-09 21:30                   ` Chris Murphy
  2 siblings, 0 replies; 25+ messages in thread
From: Chris Murphy @ 2016-05-09 21:30 UTC (permalink / raw)
  To: Niccolò Belli
  Cc: Btrfs BTRFS, Clemens Eisserer, Austin S. Hemmelgarn,
	Patrik Lundquist, Chris Murphy, Qu Wenruo, Omar Sandoval

On Mon, May 9, 2016 at 8:53 AM, Niccolò Belli <darkbasic@linuxsystems.it> wrote:

> I cannot manage to survive such annoying workflow for long, so I really hope
> someone will manage to track the bug down soon.

I suggest perseverance :) despite how tedious this is. Btrfs is more
aware of its state than other file systems, so if you give up and go
to ext4 it's entirely possible corruption is still happening but you
won't know it until there's a lot more damage. At the least if you
have to give up I'd suggest XFS and make sure you're using not older
than xfsprogs 3.2.3 which will make a V5 file system that uses
metadata checksumming by default.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-09 16:29                   ` Zygo Blaxell
  2016-05-09 18:21                     ` Austin S. Hemmelgarn
@ 2016-05-12 14:35                     ` Niccolò Belli
  2016-05-12 15:43                       ` Austin S. Hemmelgarn
  2016-05-12 16:48                       ` Zygo Blaxell
  1 sibling, 2 replies; 25+ messages in thread
From: Niccolò Belli @ 2016-05-12 14:35 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Clemens Eisserer, Austin S. Hemmelgarn, Patrik Lundquist,
	Chris Murphy, Qu Wenruo, Omar Sandoval, Zygo Blaxell, ahferroin7,
	1i5t5.duncan

On lunedì 9 maggio 2016 18:29:41 CEST, Zygo Blaxell wrote:
> Did you also check the data matches the backup?  btrfs check will only
> look at the metadata, which is 0.1% of what you've copied.  From what
> you've written, there should be a lot of errors in the data too.  If you
> have incorrect data but btrfs scrub finds no incorrect checksums, then
> your storage layer is probably fine and we have to look at CPU, host RAM,
> and software as possible culprits.
>
> The logs you've posted so far indicate that bad metadata (e.g. negative
> item lengths, nonsense transids in metadata references but sane transids
> in the referred pages) is getting into otherwise valid and well-formed
> btrfs metadata pages.  Since these pages are protected by checksums,
> the corruption can't be originating in the storage layer--if it was, the
> pages should be rejected as they are read from disk, before btrfs even
> looks at them, and the insane transid should be the "found" one not the
> "expected" one.  That suggests there is either RAM corruption happening
> _after_ the data is read from disk (i.e. while the pages are cached in
> RAM), or a severe software bug in the kernel you're running.

When doing the btrfs check I also always do a btrfs scrub and it never 
found any error. Once it didn't manage to finish the scrub because of:
BTRFS critical (device dm-0): corrupt leaf, slot offset bad: 
block=670597120,root=1, slot=6
and btrfs scrub status reported "was aborted after 00:00:10".

Talking about scrub I created a systemd timer to run scrub hourly and I 
noticed 2 *uncorrectable* errors suddenly appeared on my system. So I 
immediately re-run the scrub just to confirm it and then I rebooted into 
the Arch live usb and runned btrfs check: the metadata were perfect. So I 
runned btrfs scrub from the live usb and there were no errors at all! I 
rebooted into my system and runned scrub once again and the uncorrectable 
errors where really gone! It happened two times in the past few days.

> Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
> maintains your kernel had a bad day and merged a patch they should
> not have.

Almost no patches get applied by the Arch kernel team: 
https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux
At the moment the only one is an harmless 
"change-default-console-loglevel.patch".

> Try a minimal configuration with as few drivers as possible loaded,
> especially GPU drivers and anything from the staging subdirectory--when
> these drivers have bugs, they ruin everything.

Arch kernel team is quite conservative regarding staging/experimental 
features, I remember they rejected some config patches I submitted because 
of this.
Anyway I will try to blacklist as many kernel modules as I can. Maybe 
blacklisting GPU is too much because if I can't actually use my laptop it 
will be much more difficult to reproduce the issue.

> Try memtest86+ which has a few more/different tests than memtest86.
> I have encountered RAM modules that pass memtest86 but fail memtest86+
> and vice versa.
>
> Try memtester, a memory tester that runs as a Linux process, so it can
> detect corruption caused when device drivers spray data randomly into RAM,
> or when the CPU thermal controls are influenced by Linux (an overheating
> CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
> designs rely on the OS for thermal management).
>
> Try running more than one memory testing process, in case there is a bug
> in your hardware that affects interactions between multiple cores (memtest
> is single-threaded).  You can run memtest86 inside a kvm (e.g. kvm
> -m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.
>
> Kernel compiles are a bad way to test RAM.  I've successfully built
> kernels on hosts with known RAM failures.  The kernels don't always work
> properly, but it's quite rare to see a build fail outright.

I didn't use memtest86+ because of the lack of EFI support, but I just 
tried the shiny new memtest86 7.0 beta with improved tests for 12+ hours 
without issues.
Also I runned "memtester 4G" and "systester-cli -gausslg 64M -threads 4 
-turns 100000" together for 12 hours without any issue so I think both my 
ram and cpu are ok.

I can think only about two possible culprits now (correct me if I'm wrong):
1) A btrfs bug
2) Another module screwing things around

I can do nothing about btrfs bugs so I will try to hunt the second option. 
This is the list of modules I'm running:

lsmod | awk '$4 == ""' | awk '{print $1}' | sort

8250_dw
ac
acpi_als
acpi_pad
aesni_intel
ahci
algif_skcipher
ansi_cprng
arc4
atkbd
battery
bnep
btrfs
btusb
cdc_ether
cmac
coretemp
crc32c_intel
crc32_pclmul
crct10dif_pclmul
dell_laptop
dell_wmi
dm_crypt
drbg
ecb
elan_i2c
evdev
ext4
fan
fjes
ghash_clmulni_intel
gpio_lynxpoint
hid_generic
hid_multitouch
hmac
i2c_designware_platform
i2c_hid
i2c_i801
i915
input_leds
int3400_thermal
int3402_thermal
int3403_thermal
intel_hid
intel_pch_thermal
intel_powerclamp
intel_rapl
ip_tables
iTCO_wdt
iwlmvm
jitterentropy_rng
joydev
kvm_intel
lpc_ich
mac_hid
mei_me
mos7720
mousedev
msr
nls_cp437
nls_iso8859_1
nvram
pcspkr
pl2303
processor
processor_thermal_device
psmouse
r8152
rfcomm
rtsx_pci_ms
rtsx_pci_sdmmc
sch_fq_codel
sdhci_acpi
sd_mod
serio_raw
sha256_ssse3
shpchp
snd_hda_codec_hdmi
snd_hda_intel
snd_soc_ssm4567
snd_soc_sst_acpi
snd_soc_sst_broadwell
spi_pxa2xx_platform
thermal
tpm_crb
tpm_tis
uas
usbhid
uvcvideo
vfat
visor
x86_pkg_temp_thermal
xhci_pci

I will try to blacklist as many as I can will still keeping a somehow 
usable system and see if can reproduce it. If I will not be able to 
reproduce it anymore then the hunt will begin. It will not be a funny one 
as I already experienced with hid-multitouch which gave me random kernel 
hangs at boot ONLY if loaded early into the initramfs: 
https://bugzilla.kernel.org/show_bug.cgi?id=105251

Another option will be crashing it with my car's wheels hoping that because 
of my comprehensive insurance policy Dell will give me the next model (the 
Skylake one) as a replacement (hoping that it will not suffer from the same 
issue of the Broadwell one).

Thanks,
Niccolò

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-12 14:35                     ` Niccolò Belli
@ 2016-05-12 15:43                       ` Austin S. Hemmelgarn
  2016-05-13 11:07                         ` Niccolò Belli
  2016-05-12 16:48                       ` Zygo Blaxell
  1 sibling, 1 reply; 25+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-12 15:43 UTC (permalink / raw)
  To: Niccolò Belli, linux-btrfs
  Cc: Clemens Eisserer, Patrik Lundquist, Chris Murphy, Qu Wenruo,
	Omar Sandoval, Zygo Blaxell, 1i5t5.duncan

On 2016-05-12 10:35, Niccolò Belli wrote:
> On lunedì 9 maggio 2016 18:29:41 CEST, Zygo Blaxell wrote:
>> Did you also check the data matches the backup?  btrfs check will only
>> look at the metadata, which is 0.1% of what you've copied.  From what
>> you've written, there should be a lot of errors in the data too.  If you
>> have incorrect data but btrfs scrub finds no incorrect checksums, then
>> your storage layer is probably fine and we have to look at CPU, host RAM,
>> and software as possible culprits.
>>
>> The logs you've posted so far indicate that bad metadata (e.g. negative
>> item lengths, nonsense transids in metadata references but sane transids
>> in the referred pages) is getting into otherwise valid and well-formed
>> btrfs metadata pages.  Since these pages are protected by checksums,
>> the corruption can't be originating in the storage layer--if it was, the
>> pages should be rejected as they are read from disk, before btrfs even
>> looks at them, and the insane transid should be the "found" one not the
>> "expected" one.  That suggests there is either RAM corruption happening
>> _after_ the data is read from disk (i.e. while the pages are cached in
>> RAM), or a severe software bug in the kernel you're running.
>
> When doing the btrfs check I also always do a btrfs scrub and it never
> found any error. Once it didn't manage to finish the scrub because of:
> BTRFS critical (device dm-0): corrupt leaf, slot offset bad:
> block=670597120,root=1, slot=6
> and btrfs scrub status reported "was aborted after 00:00:10".
>
> Talking about scrub I created a systemd timer to run scrub hourly and I
> noticed 2 *uncorrectable* errors suddenly appeared on my system. So I
> immediately re-run the scrub just to confirm it and then I rebooted into
> the Arch live usb and runned btrfs check: the metadata were perfect. So
> I runned btrfs scrub from the live usb and there were no errors at all!
> I rebooted into my system and runned scrub once again and the
> uncorrectable errors where really gone! It happened two times in the
> past few days.
This would indicate to me that you've either got bad RAM (most likely), 
or some other hardware component is not working correctly.  It's not 
unusual for hardware issues to be intermittent.
>
>> Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
>> maintains your kernel had a bad day and merged a patch they should
>> not have.
>
> Almost no patches get applied by the Arch kernel team:
> https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux
> At the moment the only one is an harmless
> "change-default-console-loglevel.patch".
>
>> Try a minimal configuration with as few drivers as possible loaded,
>> especially GPU drivers and anything from the staging subdirectory--when
>> these drivers have bugs, they ruin everything.
>
> Arch kernel team is quite conservative regarding staging/experimental
> features, I remember they rejected some config patches I submitted
> because of this.
> Anyway I will try to blacklist as many kernel modules as I can. Maybe
> blacklisting GPU is too much because if I can't actually use my laptop
> it will be much more difficult to reproduce the issue.
Disable the GPU driver, but make sure you have the VGA_CONSOLE config 
enabled, and you should be fine (you'll just get a 80x25 text-mode 
console instead of a high-resolution one).
>
>> Try memtest86+ which has a few more/different tests than memtest86.
>> I have encountered RAM modules that pass memtest86 but fail memtest86+
>> and vice versa.
>>
>> Try memtester, a memory tester that runs as a Linux process, so it can
>> detect corruption caused when device drivers spray data randomly into
>> RAM,
>> or when the CPU thermal controls are influenced by Linux (an overheating
>> CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
>> designs rely on the OS for thermal management).
>>
>> Try running more than one memory testing process, in case there is a bug
>> in your hardware that affects interactions between multiple cores
>> (memtest
>> is single-threaded).  You can run memtest86 inside a kvm (e.g. kvm
>> -m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.
>>
>> Kernel compiles are a bad way to test RAM.  I've successfully built
>> kernels on hosts with known RAM failures.  The kernels don't always work
>> properly, but it's quite rare to see a build fail outright.
>
> I didn't use memtest86+ because of the lack of EFI support, but I just
> tried the shiny new memtest86 7.0 beta with improved tests for 12+ hours
> without issues.
> Also I runned "memtester 4G" and "systester-cli -gausslg 64M -threads 4
> -turns 100000" together for 12 hours without any issue so I think both
> my ram and cpu are ok.
That's probably a good indication of the CPU and the MB being OK, but 
not necessarily the RAM.  There's two other possible options for testing 
the RAM that haven't been mentioned yet though (which I hadn't thought 
of myself until now):
1. If you have access to Windows, try the Windows Memory Diagnostic. 
This runs yet another slightly different set of tests from memtest86 and 
memtest86+, so it may catch issues they don't.  You can start this 
directly on an EFI system by loading /EFI/Microsoft/Boot/MEMTEST.EFI 
from the EFI system partition.
2. This is a Dell system.  If you still have the utility partition which 
Dell ships all their per-provisioned systems with, that should have a 
hardware diagnostics tool.  I doubt that this will find anything (it's 
part of their QA procedure AFAICT), but it's probably worth trying, as 
the memory testing in that uses yet another slightly different 
implementation of the typical tests.  You can usually find this in the 
boot interrupt menu accessed by hitting F12 before the boot-loader loads.
>
> I can think only about two possible culprits now (correct me if I'm wrong):
> 1) A btrfs bug
> 2) Another module screwing things around
It could still be the disk (not likely, but possible) or the storage 
controller.  If you have a spare disk, I'd suggest trying with that 
(assuming of course it doesn't void your warranty).
>
> I can do nothing about btrfs bugs so I will try to hunt the second
> option. This is the list of modules I'm running:
>
> lsmod | awk '$4 == ""' | awk '{print $1}' | sort
>
> 8250_dw
> ac
> acpi_als
> acpi_pad
> aesni_intel
> ahci
> algif_skcipher
> ansi_cprng
> arc4
> atkbd
> battery
> bnep
> btrfs
> btusb
> cdc_ether
> cmac
> coretemp
> crc32c_intel
> crc32_pclmul
> crct10dif_pclmul
> dell_laptop
> dell_wmi
> dm_crypt
> drbg
> ecb
> elan_i2c
> evdev
> ext4
> fan
> fjes
> ghash_clmulni_intel
> gpio_lynxpoint
> hid_generic
> hid_multitouch
> hmac
> i2c_designware_platform
> i2c_hid
> i2c_i801
> i915
> input_leds
> int3400_thermal
> int3402_thermal
> int3403_thermal
> intel_hid
> intel_pch_thermal
> intel_powerclamp
> intel_rapl
> ip_tables
> iTCO_wdt
> iwlmvm
> jitterentropy_rng
> joydev
> kvm_intel
> lpc_ich
> mac_hid
> mei_me
> mos7720
> mousedev
> msr
> nls_cp437
> nls_iso8859_1
> nvram
> pcspkr
> pl2303
> processor
> processor_thermal_device
> psmouse
> r8152
> rfcomm
> rtsx_pci_ms
> rtsx_pci_sdmmc
> sch_fq_codel
> sdhci_acpi
> sd_mod
> serio_raw
> sha256_ssse3
> shpchp
> snd_hda_codec_hdmi
> snd_hda_intel
> snd_soc_ssm4567
> snd_soc_sst_acpi
> snd_soc_sst_broadwell
> spi_pxa2xx_platform
> thermal
> tpm_crb
> tpm_tis
> uas
> usbhid
> uvcvideo
> vfat
> visor
> x86_pkg_temp_thermal
> xhci_pci
>
> I will try to blacklist as many as I can will still keeping a somehow
> usable system and see if can reproduce it. If I will not be able to
> reproduce it anymore then the hunt will begin. It will not be a funny
> one as I already experienced with hid-multitouch which gave me random
> kernel hangs at boot ONLY if loaded early into the initramfs:
> https://bugzilla.kernel.org/show_bug.cgi?id=105251
Based on what you've got listed for modules, I'd expect the absolute 
minimum for a usable test system to be:
  ac
  acpi_als (you can probably remove this, it's for the ambient light sensor)
  acpi_pad
  ahci
  atkbd
  battery
  btrfs
  coretemp
  dell_laptop
  dell_wmi
  elan_i2c
  evdev
  ext4
  fan
  gpio_lynxpoint
  hid_generic
  hid_multitouch
  i2c_i801
  i915 (this is your GPU module, you should still have a usable text 
console if this isn't loaded)
  int3400_thermal
  int3402_thermal
  int3403_thermal
  intel_hid
  intel_pch_thermal
  intel_powerclamp
  intel_rapl
  ip_tables (if you have no firewall configured, you can safely 
blacklist this)
  iwlmvm (you might try removing this, but you will have no wifi without it)
  lpc_ich
  mousedev
  nvram (you might be able to remove this, I don't remember if the dell 
modules depend on it or not)
  processor
  processor_thermal_device
  psmouse
  r8152 (you can try removing this too, but you will have no ethernet 
without it)
  sch_fq_codel
  serio_raw
  spi_pxa2xx_platform
  thermal
  usbhid
  vfat (if you avoid mounting your EFI system partition, you can 
probably pull this out)
  x86_pkg_temp_thermal
  xhci_pci
Note that this assumes you aren't testing on dmcrypt.  Make absolutely 
certain though that you don't remove any of the *thermal modules, the 
fan module, and the dell modules, not having those may result in 
hardware damage.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-12 14:35                     ` Niccolò Belli
  2016-05-12 15:43                       ` Austin S. Hemmelgarn
@ 2016-05-12 16:48                       ` Zygo Blaxell
  1 sibling, 0 replies; 25+ messages in thread
From: Zygo Blaxell @ 2016-05-12 16:48 UTC (permalink / raw)
  To: Niccolò Belli
  Cc: linux-btrfs, Clemens Eisserer, Austin S. Hemmelgarn,
	Patrik Lundquist, Chris Murphy, Qu Wenruo, Omar Sandoval,
	1i5t5.duncan

[-- Attachment #1: Type: text/plain, Size: 2790 bytes --]

On Thu, May 12, 2016 at 04:35:24PM +0200, Niccolò Belli wrote:
> When doing the btrfs check I also always do a btrfs scrub and it never found
> any error. Once it didn't manage to finish the scrub because of:
> BTRFS critical (device dm-0): corrupt leaf, slot offset bad:
> block=670597120,root=1, slot=6
> and btrfs scrub status reported "was aborted after 00:00:10".
> 
> Talking about scrub I created a systemd timer to run scrub hourly and I
> noticed 2 *uncorrectable* errors suddenly appeared on my system. So I
> immediately re-run the scrub just to confirm it and then I rebooted into the
> Arch live usb and runned btrfs check: the metadata were perfect. So I runned
> btrfs scrub from the live usb and there were no errors at all! I rebooted
> into my system and runned scrub once again and the uncorrectable errors
> where really gone! It happened two times in the past few days.

That's what a RAM corruption problem looks like when you run btrfs scrub.
Maybe the RAM itself is OK, but *something* is scribbling on it.

Does the Arch live usb use the same kernel as your normal system?

> Almost no patches get applied by the Arch kernel team:
> https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux
> At the moment the only one is an harmless
> "change-default-console-loglevel.patch".

Did you try an older (or newer) kernel?  I've been running 4.5.x on a few
canary systems, but so far none of them have survived more than a day.
Contrast with 4.1.x and 4.4.x, which runs for months between reboots
for me.  Maybe there's a regression in 4.5.x, maybe I did something
wrong in my config or build, or maybe I just have too few data points
to draw any conclusions, but my data so far is telling me to stay on
4.4.x until something changes (i.e. wait for a 4.5.x stable update or
skip directly to 4.6.x).  :-/

It's always worth trying this if only to eliminate regression as a
possible root cause early.  In practice, every mainline kernel release
has a regression that affects at least one combination of config options
and hardware.  btrfs is stable enough now that you can be running one
or two releases behind to avoid a problem elsewhere in the kernel.

> Another option will be crashing it with my car's wheels hoping that because
> of my comprehensive insurance policy Dell will give me the next model (the
> Skylake one) as a replacement (hoping that it will not suffer from the same
> issue of the Broadwell one).

The first rule of Insurance Fraud Club:  don't talk about Insurance
Fraud Club.  ;)

It's possible there's a problem that affects only very specific chipsets
You seem to have eliminated RAM in isolation, but there could be a problem
in the kernel that affects only your chipset.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-12 15:43                       ` Austin S. Hemmelgarn
@ 2016-05-13 11:07                         ` Niccolò Belli
  2016-05-13 11:35                           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 25+ messages in thread
From: Niccolò Belli @ 2016-05-13 11:07 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: linux-btrfs, Clemens Eisserer, Patrik Lundquist, Chris Murphy,
	Qu Wenruo, Omar Sandoval, Zygo Blaxell, 1i5t5.duncan

On giovedì 12 maggio 2016 17:43:38 CEST, Austin S. Hemmelgarn wrote:
> That's probably a good indication of the CPU and the MB being 
> OK, but not necessarily the RAM.  There's two other possible 
> options for testing the RAM that haven't been mentioned yet 
> though (which I hadn't thought of myself until now):
> 1. If you have access to Windows, try the Windows Memory 
> Diagnostic. This runs yet another slightly different set of 
> tests from memtest86 and memtest86+, so it may catch issues they 
> don't.  You can start this directly on an EFI system by loading 
> /EFI/Microsoft/Boot/MEMTEST.EFI from the EFI system partition.
> 2. This is a Dell system.  If you still have the utility 
> partition which Dell ships all their per-provisioned systems 
> with, that should have a hardware diagnostics tool.  I doubt 
> that this will find anything (it's part of their QA procedure 
> AFAICT), but it's probably worth trying, as the memory testing 
> in that uses yet another slightly different implementation of 
> the typical tests.  You can usually find this in the boot 
> interrupt menu accessed by hitting F12 before the boot-loader 
> loads.

I tried the Dell System Test, including the enhanced optional ram tests and 
it was fine. I also tried the Microsoft one, which passed. BUT if I select 
the advanced test in the Microsoft One it always stops at 21% of first 
test. The test menus are still working, but fans get quiet and it keeps 
writing "test running... 21%" forever. I tried it many times and it always 
got stuck at 21%, so I suspect a test suite bug instead of a ram failure.

I also noticed some other interesting behaviours: while I was running the 
usual scrub+check (both were fine) from the livecd I noticed this in dmesg:
[  261.301159] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot errs: 
wr 0, rd 0, flush 0, corrupt 4, gen 0
Corrupt? But both scrub and check were fine... I double checked scrub and 
check and they were still fine.

This is what happened another time: 
https://drive.google.com/open?id=0Bwe9Wtc-5xF1dGtPaWhTZ0w5aUU
I was making a backup of my partition USING DD from the livecd. It wasn't 
even mounted if I recall correctly!

On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
> That's what a RAM corruption problem looks like when you run btrfs scrub.
> Maybe the RAM itself is OK, but *something* is scribbling on it.
>
> Does the Arch live usb use the same kernel as your normal system?

Yes, except for the point release (the system is slightly ahead of the 
liveusb).

On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
> Did you try an older (or newer) kernel?  I've been running 4.5.x on a few
> canary systems, but so far none of them have survived more than a day.

No (except for point releases from 4.5.0 to 4.5.4), but I will try 4.4.

On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
> It's possible there's a problem that affects only very specific chipsets
> You seem to have eliminated RAM in isolation, but there could be a problem
> in the kernel that affects only your chipset.

Funny considering it is sold as a Linux laptop. Unfortunately they only 
tested it with the ancient Ubuntu 14.04.

Niccolò

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-13 11:07                         ` Niccolò Belli
@ 2016-05-13 11:35                           ` Austin S. Hemmelgarn
  2016-05-13 12:10                             ` Niccolò Belli
  0 siblings, 1 reply; 25+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-13 11:35 UTC (permalink / raw)
  To: Niccolò Belli
  Cc: linux-btrfs, Clemens Eisserer, Patrik Lundquist, Chris Murphy,
	Qu Wenruo, Omar Sandoval, Zygo Blaxell, 1i5t5.duncan

On 2016-05-13 07:07, Niccolò Belli wrote:
> On giovedì 12 maggio 2016 17:43:38 CEST, Austin S. Hemmelgarn wrote:
>> That's probably a good indication of the CPU and the MB being OK, but
>> not necessarily the RAM.  There's two other possible options for
>> testing the RAM that haven't been mentioned yet though (which I hadn't
>> thought of myself until now):
>> 1. If you have access to Windows, try the Windows Memory Diagnostic.
>> This runs yet another slightly different set of tests from memtest86
>> and memtest86+, so it may catch issues they don't.  You can start this
>> directly on an EFI system by loading /EFI/Microsoft/Boot/MEMTEST.EFI
>> from the EFI system partition.
>> 2. This is a Dell system.  If you still have the utility partition
>> which Dell ships all their per-provisioned systems with, that should
>> have a hardware diagnostics tool.  I doubt that this will find
>> anything (it's part of their QA procedure AFAICT), but it's probably
>> worth trying, as the memory testing in that uses yet another slightly
>> different implementation of the typical tests.  You can usually find
>> this in the boot interrupt menu accessed by hitting F12 before the
>> boot-loader loads.
>
> I tried the Dell System Test, including the enhanced optional ram tests
> and it was fine. I also tried the Microsoft one, which passed. BUT if I
> select the advanced test in the Microsoft One it always stops at 21% of
> first test. The test menus are still working, but fans get quiet and it
> keeps writing "test running... 21%" forever. I tried it many times and
> it always got stuck at 21%, so I suspect a test suite bug instead of a
> ram failure.
I've actually seen this before on other systems (different completion 
percentage on each system, but otherwise the same), all of them ended up 
actually having a bad CPU or MB, although the ones with CPU issues were 
fine after BIOS updates which included newer microcode.
>
> I also noticed some other interesting behaviours: while I was running
> the usual scrub+check (both were fine) from the livecd I noticed this in
> dmesg:
> [  261.301159] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot
> errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
> Corrupt? But both scrub and check were fine... I double checked scrub
> and check and they were still fine.
It's worth noting that these are running counts of errors since the last 
time the stats were reset (and they only get reset manually).  If you 
haven't reset the stats, then this isn't all that surprising.
>
> This is what happened another time:
> https://drive.google.com/open?id=0Bwe9Wtc-5xF1dGtPaWhTZ0w5aUU
> I was making a backup of my partition USING DD from the livecd. It
> wasn't even mounted if I recall correctly!
The fact that you're getting an OOPS involving core kernel threads 
(kswapd) is a pretty good indication that either there's a bug elsewhere 
in the kernel, or that something is wrong with your hardware.  it's 
really difficult to be certain if you don't have a reliable test case 
though.
>
> On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
>> That's what a RAM corruption problem looks like when you run btrfs scrub.
>> Maybe the RAM itself is OK, but *something* is scribbling on it.
>>
>> Does the Arch live usb use the same kernel as your normal system?
>
> Yes, except for the point release (the system is slightly ahead of the
> liveusb).
>
> On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
>> Did you try an older (or newer) kernel?  I've been running 4.5.x on a few
>> canary systems, but so far none of them have survived more than a day.
>
> No (except for point releases from 4.5.0 to 4.5.4), but I will try 4.4.
FWIW, I've been running 4.5 with almost no issues on my laptop since it 
came out (the few issues I have had are not unique to 4.5, and are all 
ultimately firmware issues (Lenovo has been getting _really_ bad 
recently about having broken ACPI and EFI implementations...)).  Of 
course, I'm also running Gentoo, so everything is built locally, but I 
doubt that that has much impact on stability.
>
> On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
>> It's possible there's a problem that affects only very specific chipsets
>> You seem to have eliminated RAM in isolation, but there could be a
>> problem
>> in the kernel that affects only your chipset.
>
> Funny considering it is sold as a Linux laptop. Unfortunately they only
> tested it with the ancient Ubuntu 14.04.
Sadly, this is pretty typical for anything sold as a 'Linux' system that 
isn't a server.  Even for the servers sold as such, it's not unusual for 
it to only be tested with with old versions of CentOS.

Now, I hadn't thought of this before, but it's a Dell system, so you're 
trapping out to SMBIOS for everything under the sun, and if they don't 
pass a correct memory map (or correct ACPI tables) to the OS during 
boot, then there may be some sections of RAM that both Linux and the 
firmware think they can use, which could definitely result in symptoms 
like bad RAM while still consistently passing memory tests (because they 
don't make BIOS calls after they have the system info they need).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-13 11:35                           ` Austin S. Hemmelgarn
@ 2016-05-13 12:10                             ` Niccolò Belli
  2016-05-13 21:54                               ` Chris Murphy
  0 siblings, 1 reply; 25+ messages in thread
From: Niccolò Belli @ 2016-05-13 12:10 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: linux-btrfs, Clemens Eisserer, Patrik Lundquist, Chris Murphy,
	Qu Wenruo, Omar Sandoval, Zygo Blaxell, 1i5t5.duncan

On venerdì 13 maggio 2016 13:35:01 CEST, Austin S. Hemmelgarn wrote:
> The fact that you're getting an OOPS involving core kernel 
> threads (kswapd) is a pretty good indication that either there's 
> a bug elsewhere in the kernel, or that something is wrong with 
> your hardware.  it's really difficult to be certain if you don't 
> have a reliable test case though.

Talking about reliable test cases, I forgot to say that I definitely found 
an interesting one. It doesn't lead to OOPS but perhaps something even more 
interesting. While running countless stress tests I tried running some 
games to stress the system in different ways. I chosed openmw (an open 
source engine for Morrowind) and I played it for a while on my second 
external monitor (while I watched at some monitoring tools on my first 
monitor). I noticed that after playing a while I *always* lose internet 
connection (I use an USB3 Gigabit Ethernet adapter). This isn't the only 
thing which happens: even if the game keeps running flawlessly and the 
system *seems* to work fine (I can drag windows, open the terminal...) lots 
of commands simply stall (for example mounting a partition, unmounting it, 
rebooting...). I can reliably reproduce it, it ALWAYS happens.

Niccolò

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
  2016-05-13 12:10                             ` Niccolò Belli
@ 2016-05-13 21:54                               ` Chris Murphy
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Murphy @ 2016-05-13 21:54 UTC (permalink / raw)
  To: Niccolò Belli
  Cc: Austin S. Hemmelgarn, Btrfs BTRFS, Clemens Eisserer,
	Patrik Lundquist, Chris Murphy, Qu Wenruo, Omar Sandoval,
	Zygo Blaxell, Duncan

On Fri, May 13, 2016 at 6:10 AM, Niccolò Belli
<darkbasic@linuxsystems.it> wrote:
> On venerdì 13 maggio 2016 13:35:01 CEST, Austin S. Hemmelgarn wrote:
>>
>> The fact that you're getting an OOPS involving core kernel threads
>> (kswapd) is a pretty good indication that either there's a bug elsewhere in
>> the kernel, or that something is wrong with your hardware.  it's really
>> difficult to be certain if you don't have a reliable test case though.
>
>
> Talking about reliable test cases, I forgot to say that I definitely found
> an interesting one. It doesn't lead to OOPS but perhaps something even more
> interesting. While running countless stress tests I tried running some games
> to stress the system in different ways. I chosed openmw (an open source
> engine for Morrowind) and I played it for a while on my second external
> monitor (while I watched at some monitoring tools on my first monitor). I
> noticed that after playing a while I *always* lose internet connection (I
> use an USB3 Gigabit Ethernet adapter). This isn't the only thing which
> happens: even if the game keeps running flawlessly and the system *seems* to
> work fine (I can drag windows, open the terminal...) lots of commands simply
> stall (for example mounting a partition, unmounting it, rebooting...). I can
> reliably reproduce it, it ALWAYS happens.

Well there are a bunch of kernel debug options. If your kernel has
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y at compile time you can boot with boot parameter
slub_debug=1 to enable it and maybe there'll be something more
revealing about the problems you're having. More aggressive is
CONFIG_DEBUG_PAGEALLOC=y but it'll slow things down quite noticeably.

And then there's some Btrfs debug options for compile time, and are
enabled with mount options. But I think the problem you're having
isn't specific to Btrfs or someone else would have run into it.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-05-13 21:54 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-04 23:21 btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair Niccolò Belli
2016-05-05  1:07 ` Chris Murphy
2016-05-05 10:36   ` Niccolò Belli
2016-05-05 17:48     ` Omar Sandoval
2016-05-06 11:38       ` Niccolò Belli
2016-05-07 15:45         ` Niccolò Belli
2016-05-07 15:58           ` Clemens Eisserer
2016-05-07 16:11             ` Niccolò Belli
2016-05-08 18:27               ` Patrik Lundquist
2016-05-09 11:52               ` Austin S. Hemmelgarn
2016-05-09 14:53                 ` Niccolò Belli
2016-05-09 16:29                   ` Zygo Blaxell
2016-05-09 18:21                     ` Austin S. Hemmelgarn
2016-05-09 19:18                       ` Duncan
2016-05-12 14:35                     ` Niccolò Belli
2016-05-12 15:43                       ` Austin S. Hemmelgarn
2016-05-13 11:07                         ` Niccolò Belli
2016-05-13 11:35                           ` Austin S. Hemmelgarn
2016-05-13 12:10                             ` Niccolò Belli
2016-05-13 21:54                               ` Chris Murphy
2016-05-12 16:48                       ` Zygo Blaxell
2016-05-09 19:23                   ` Lionel Bouton
2016-05-09 21:30                   ` Chris Murphy
2016-05-07 23:35           ` Chris Murphy
2016-05-05  4:12 ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.