* btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
@ 2016-05-04 23:21 Niccolò Belli
2016-05-05 1:07 ` Chris Murphy
2016-05-05 4:12 ` Qu Wenruo
0 siblings, 2 replies; 25+ messages in thread
From: Niccolò Belli @ 2016-05-04 23:21 UTC (permalink / raw)
To: linux-btrfs
I really need your help, because it's the second time btrfs ate my data in
a couple of days and I can't use my laptop if I don't find the culprit.
This was the mail I sent a couple of days ago:
https://www.spinics.net/lists/linux-btrfs/msg54754.html
I previously thought the culprit was a bug in kernel 4.6-rc, but I was
wrong.
Then I reinstalled the whole system (Arch Linux) from scratch, and after
just two days I lost some of my data, again. Once again btrfs check
--repair got stuck in an infinite loop and I can't repair my fs. The system
has always been shutdown properly, except for a single time when I had to
forcedly power it off just after the boot because I didn't see any signal
on the screen.
First the obvious things:
- memory is ok
(https://drive.google.com/open?id=0Bwe9Wtc-5xF1VnJ0SE9fT1FZMTg)
- disk is ok
(https://drive.google.com/open?id=0Bwe9Wtc-5xF1NGRhd2daVDRJVGc)
- tlp has SATA_LINKPWR_ON_BAT=max_performance
(https://drive.google.com/open?id=0Bwe9Wtc-5xF1dFAwUE5ETVpNWGM)
- rootfs mount options:
rw,noatime,compress=lzo,ssd,discard,space_cache,autodefrag,subvolid=257,subvol=/@
- Command line: BOOT_IMAGE=/@/boot/vmlinuz-linux
root=UUID=4fc2278e-f6e8-4a21-8876-cabbf885bb2e rw rootflags=subvol=@
cryptdevice=/dev/disk/by-uuid/c7c8f501-507c-4bd2-a80a-8c7360651f02:cryptroot:allow-discards
quiet
- scrub didn't find any error:
$ sudo btrfs scrub status /
scrub status for 4fc2278e-f6e8-4a21-8876-cabbf885bb2e
scrub started at Thu May 5 00:57:30 2016 and finished after
00:00:45
total bytes scrubbed: 22.26GiB with 0 errors
I have the whole rootfs encrypted, including boot. I followed these steps:
https://wiki.archlinux.org/index.php/Dm-crypt/Encrypting_an_entire_system#Btrfs_subvolumes_with_swap
Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q).
Laptop is a Dell XPS 13 9343 QHD+.
Distro is Arch Linux, kernel version is 4.5.1. btrfs-progs is 4.5.2.
After two days from the previous data loss I finished reinstalling my
distro from scratch, then I decided to do a full backup from a snapshot
using tar. This is what I got while trying to backup my data:
tar: usr/share/kig/icons/hicolor/32x32/actions/test.png: errore di lettura
al byte 0 leggendo 810 byte: Errore di input/output
tar: usr/share/kig/icons/hicolor/32x32/actions/circlebpd.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/pointOnLine.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/bezierN.png: funzione "stat"
non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/convexhull.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/centerofcurvature.png:
funzione "stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/en.png: funzione "stat" non
riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/circlebps.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/directrix.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/beziercurves.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/segment_midpoint.png:
funzione "stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/distance.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/circlebcl.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/conicb5p.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/kig_polygon.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/conicasymptotes.png:
funzione "stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/pointxy.png: funzione "stat"
non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/attacher.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/coniclineintersection.png:
funzione "stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/vectorsum.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/rbezier4.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/ellipsebffp.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/angle.png: funzione "stat"
non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/kig_text.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/vectordifference.png:
funzione "stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/segmentaxis.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/radicalline.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/polygonsides.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/projection.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/inversion.png: funzione
"stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/bezier4.png: funzione "stat"
non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/equilateralhyperbolab4p.png:
funzione "stat" non riuscita: Stale file handle
tar: usr/share/kig/icons/hicolor/32x32/actions/areaCircle.png: funzione
"stat" non riuscita: Stale file handle
tar: var/lib/samba/private/msg.sock/666: socket ignorato
tar: Uscita con stato di fallimento in base agli errori precedenti
[ 3057.008185] BTRFS error (device dm-0): parent transid verify failed on
528089088 wanted 3458764513820541211 found 283
[ 3057.008195] BTRFS error (device dm-0): error loading props for ino
183988 (root 505): -5
[ 3057.008417] BTRFS error (device dm-0): parent transid verify failed on
528089088 wanted 3458764513820541211 found 283
[ 3057.008631] BTRFS error (device dm-0): parent transid verify failed on
528089088 wanted 3458764513820541211 found 283
[ 3057.009165] BTRFS error (device dm-0): parent transid verify failed on
528089088 wanted 3458764513820541211 found 283
[ 3057.009389] BTRFS error (device dm-0): parent transid verify failed on
528089088 wanted 3458764513820541211 found 283
[ 3057.009734] BTRFS error (device dm-0): parent transid verify failed on
528089088 wanted 3458764513820541211 found 283
[ 3057.009960] BTRFS error (device dm-0): parent transid verify failed on
528089088 wanted 3458764513820541211 found 283
[ 3057.010664] BTRFS error (device dm-0): parent transid verify failed on
528089088 wanted 3458764513820541211 found 283
[ 3057.010888] BTRFS error (device dm-0): parent transid verify failed on
528089088 wanted 3458764513820541211 found 283
[ 3057.011201] BTRFS error (device dm-0): parent transid verify failed on
528089088 wanted 3458764513820541211 found 283
[ 3331.795474] verify_parent_transid: 57 callbacks suppressed
[ 3331.795480] BTRFS error (device dm-0): parent transid verify failed on
528089088 wanted 3458764513820541211 found 283
[ 3331.795776] BTRFS error (device dm-0): parent transid verify failed on
528089088 wanted 3458764513820541211 found 283
I made a copy of /dev/mapper/cryptroot with dd on an external drive and I
run btrfs check on it (btrfs-progs 4.5.2):
https://drive.google.com/open?id=0Bwe9Wtc-5xF1SjJacXpMMU5mems (37MB)
Then I tried to run btrfs check --repair on it but once again it got stuck
in an infinite loop like this one
(https://www.spinics.net/lists/linux-btrfs/msg54146.html) and after an hour
of looping and several hundreds of MBs of logs I had to kill it. Here is
the log, truncated to 30MB:
https://drive.google.com/open?id=0Bwe9Wtc-5xF1SmRuVUlfeGRES3M
They are probably not needed but here is snapper -c @ list:
https://drive.google.com/open?id=0Bwe9Wtc-5xF1N0llOFpfVXVwNVk
and btrfs subvolume list -p /:
https://drive.google.com/open?id=0Bwe9Wtc-5xF1andCdWZzeV9VbDg
This is the link to the whole gdrive directory with all the logs:
https://drive.google.com/open?id=0Bwe9Wtc-5xF1UFltcXhtRmt4YjA
I really don't know what may be the problem, maybe discard? I can't think
about switching back to ext4 and losing snapshots, transactions,
compression, incremental send/receive backups etc.
I would really love being able to do something to fix it, but I don't have
the slightest idea about what's the problem. Hopefully someone here will be
smarter than me and find the problem, otherwise I will have to switch to
ext4 because I need my laptop to work.
Thanks,
Niccolò
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-04 23:21 btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair Niccolò Belli
@ 2016-05-05 1:07 ` Chris Murphy
2016-05-05 10:36 ` Niccolò Belli
2016-05-05 4:12 ` Qu Wenruo
1 sibling, 1 reply; 25+ messages in thread
From: Chris Murphy @ 2016-05-05 1:07 UTC (permalink / raw)
To: Niccolò Belli; +Cc: Btrfs BTRFS
On Wed, May 4, 2016 at 5:21 PM, Niccolò Belli <darkbasic@linuxsystems.it> wrote:
> rw,noatime,compress=lzo,ssd,discard,space_cache,autodefrag,subvolid=257,subvol=/@
I suggest using defaults for starters. The only thing in that list
that needs be there is either subvolid or subvold, not both. Add in
the non-default options once you've proven the defaults are working,
and add them one at a time.
> I have the whole rootfs encrypted, including boot. I followed these steps:
> https://wiki.archlinux.org/index.php/Dm-crypt/Encrypting_an_entire_system#Btrfs_subvolumes_with_swap
>
> Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q).
The firmware is old if I understand the naming scheme used by Dell. It
says EXT49D0Q is current.
http://www.dell.com/support/home/al/en/aldhs1/Drivers/DriversDetails?driverId=0NXHH
If you need to update, you may be best off doing a whole device trim,
which is easiest done with mkfs.btrfs pointed at the whole device. I
wouldn't trust any data on the drive after a firmware update so I'd
start over entirely from scratch, new partition map, new everything.
So the way to do this is:
mkfs.btrfs /dev/sda
wipefs -a /dev/sda
That way the btrfs magic is removed, and now you can partition it,
setup dmcrypt, etc. I advice using all defaults for everything for
now, otherwise it's anyone's guess what you're running into.
Off topic, but at least gmail users see your posts go to spam because
your domain is configured to disallow relaying. Most mail services
ignore this request by the domain but google honors it so no amount of
training will make your email not spam. This is what's in your emails
that's causing the problem:
dmarc=fail (p=QUARANTINE dis=NONE) header.from=linuxsystems.it
http://webmasters.stackexchange.com/questions/76765/sent-emails-pass-spf-and-dkim-but-fail-dmarc-when-received-by-gmail
http://www.pcworld.com/article/2141120/yahoo-email-antispoofing-policy-breaks-mailing-lists.html
--
Chris Murphy
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-04 23:21 btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair Niccolò Belli
2016-05-05 1:07 ` Chris Murphy
@ 2016-05-05 4:12 ` Qu Wenruo
1 sibling, 0 replies; 25+ messages in thread
From: Qu Wenruo @ 2016-05-05 4:12 UTC (permalink / raw)
To: Niccolò Belli, linux-btrfs
Niccolò Belli wrote on 2016/05/05 01:21 +0200:
> I really need your help, because it's the second time btrfs ate my data
> in a couple of days and I can't use my laptop if I don't find the culprit.
>
> This was the mail I sent a couple of days ago:
> https://www.spinics.net/lists/linux-btrfs/msg54754.html
Output in that mail shows obvious tree block corruption:
checksum verify failed on 245498111 found C7652CC3 wanted 00000000
checksum verify failed on 245498111 found C7652CC3 wanted 00000000
checksum verify failed on 245498111 found C7652CC3 wanted 00000000
checksum verify failed on 245498111 found C7652CC3 wanted 00000000
bytenr mismatch, want=245498111, have=8454382400481263616
That's the root cause of following tons of error.
I assume it maybe the same cause this time.
> I previously thought the culprit was a bug in kernel 4.6-rc, but I was
> wrong.
>
> Then I reinstalled the whole system (Arch Linux) from scratch, and after
> just two days I lost some of my data, again. Once again btrfs check
> --repair got stuck in an infinite loop and I can't repair my fs. The
> system has always been shutdown properly, except for a single time when
> I had to forcedly power it off just after the boot because I didn't see
> any signal on the screen.
>
> First the obvious things:
>
> - memory is ok
> (https://drive.google.com/open?id=0Bwe9Wtc-5xF1VnJ0SE9fT1FZMTg)
> - disk is ok
> (https://drive.google.com/open?id=0Bwe9Wtc-5xF1NGRhd2daVDRJVGc)
> - tlp has SATA_LINKPWR_ON_BAT=max_performance
> (https://drive.google.com/open?id=0Bwe9Wtc-5xF1dFAwUE5ETVpNWGM)
> - rootfs mount options:
> rw,noatime,compress=lzo,ssd,discard,space_cache,autodefrag,subvolid=257,subvol=/@
>
> - Command line: BOOT_IMAGE=/@/boot/vmlinuz-linux
> root=UUID=4fc2278e-f6e8-4a21-8876-cabbf885bb2e rw rootflags=subvol=@
> cryptdevice=/dev/disk/by-uuid/c7c8f501-507c-4bd2-a80a-8c7360651f02:cryptroot:allow-discards
> quiet
> - scrub didn't find any error:
> $ sudo btrfs scrub status /
> scrub status for 4fc2278e-f6e8-4a21-8876-cabbf885bb2e
> scrub started at Thu May 5 00:57:30 2016 and finished after
> 00:00:45
> total bytes scrubbed: 22.26GiB with 0 errors
>
> I have the whole rootfs encrypted, including boot. I followed these
> steps:
> https://wiki.archlinux.org/index.php/Dm-crypt/Encrypting_an_entire_system#Btrfs_subvolumes_with_swap
>
Would it be OK for you to test your btrfs on a plain ssd, without
encryption?
I know this suggestion is quite rude, but this would hugely reduce the
possible layers we need to investigate.
And just as Chris Murphy said, reducing mount option is also a pretty
good debugging start point.
>
> Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q).
> Laptop is a Dell XPS 13 9343 QHD+.
> Distro is Arch Linux, kernel version is 4.5.1. btrfs-progs is 4.5.2.
>
> After two days from the previous data loss I finished reinstalling my
> distro from scratch, then I decided to do a full backup from a snapshot
> using tar. This is what I got while trying to backup my data:
>
> tar: usr/share/kig/icons/hicolor/32x32/actions/test.png: errore di
> lettura al byte 0 leggendo 810 byte: Errore di input/output
> tar: usr/share/kig/icons/hicolor/32x32/actions/circlebpd.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/pointOnLine.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/bezierN.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/convexhull.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/centerofcurvature.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/en.png: funzione "stat"
> non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/circlebps.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/directrix.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/beziercurves.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/segment_midpoint.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/distance.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/circlebcl.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/conicb5p.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/kig_polygon.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/conicasymptotes.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/pointxy.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/attacher.png: funzione
> "stat" non riuscita: Stale file handle
> tar:
> usr/share/kig/icons/hicolor/32x32/actions/coniclineintersection.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/vectorsum.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/rbezier4.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/ellipsebffp.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/angle.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/kig_text.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/vectordifference.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/segmentaxis.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/radicalline.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/polygonsides.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/projection.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/inversion.png: funzione
> "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/bezier4.png: funzione
> "stat" non riuscita: Stale file handle
> tar:
> usr/share/kig/icons/hicolor/32x32/actions/equilateralhyperbolab4p.png:
> funzione "stat" non riuscita: Stale file handle
> tar: usr/share/kig/icons/hicolor/32x32/actions/areaCircle.png: funzione
> "stat" non riuscita: Stale file handle
> tar: var/lib/samba/private/msg.sock/666: socket ignorato
> tar: Uscita con stato di fallimento in base agli errori precedenti
>
>
> [ 3057.008185] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
Tree blocks are again heavily damaged.
Wanted transid is super large, definitely not sane.
So parent node is already corrupted.
Although the child transid, 283 seems quite valid.
> [ 3057.008195] BTRFS error (device dm-0): error loading props for ino
> 183988 (root 505): -5
> [ 3057.008417] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.008631] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.009165] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.009389] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.009734] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.009960] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.010664] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.010888] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3057.011201] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3331.795474] verify_parent_transid: 57 callbacks suppressed
> [ 3331.795480] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
> [ 3331.795776] BTRFS error (device dm-0): parent transid verify failed
> on 528089088 wanted 3458764513820541211 found 283
>
> I made a copy of /dev/mapper/cryptroot with dd on an external drive and
> I run btrfs check on it (btrfs-progs 4.5.2):
> https://drive.google.com/open?id=0Bwe9Wtc-5xF1SjJacXpMMU5mems (37MB)
Checked, but seems the output is truncated?
Thanks,
Qu
>
> Then I tried to run btrfs check --repair on it but once again it got
> stuck in an infinite loop like this one
> (https://www.spinics.net/lists/linux-btrfs/msg54146.html) and after an
> hour of looping and several hundreds of MBs of logs I had to kill it.
> Here is the log, truncated to 30MB:
> https://drive.google.com/open?id=0Bwe9Wtc-5xF1SmRuVUlfeGRES3M
>
> They are probably not needed but here is snapper -c @ list:
> https://drive.google.com/open?id=0Bwe9Wtc-5xF1N0llOFpfVXVwNVk
> and btrfs subvolume list -p /:
> https://drive.google.com/open?id=0Bwe9Wtc-5xF1andCdWZzeV9VbDg
>
> This is the link to the whole gdrive directory with all the logs:
> https://drive.google.com/open?id=0Bwe9Wtc-5xF1UFltcXhtRmt4YjA
>
> I really don't know what may be the problem, maybe discard? I can't
> think about switching back to ext4 and losing snapshots, transactions,
> compression, incremental send/receive backups etc.
> I would really love being able to do something to fix it, but I don't
> have the slightest idea about what's the problem. Hopefully someone here
> will be smarter than me and find the problem, otherwise I will have to
> switch to ext4 because I need my laptop to work.
>
> Thanks,
> Niccolò
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-05 1:07 ` Chris Murphy
@ 2016-05-05 10:36 ` Niccolò Belli
2016-05-05 17:48 ` Omar Sandoval
0 siblings, 1 reply; 25+ messages in thread
From: Niccolò Belli @ 2016-05-05 10:36 UTC (permalink / raw)
To: Btrfs BTRFS; +Cc: Chris Murphy, Qu Wenruo
On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote:
> I suggest using defaults for starters. The only thing in that list
> that needs be there is either subvolid or subvold, not both. Add in
> the non-default options once you've proven the defaults are working,
> and add them one at a time.
Yes I read your previous suggestion and I already dropped subvolid, but
since the problem already happened I left it in the mail for completeness.
Anyway the culprit here is genfstab and that's probably what a beginner is
going to use when installing a distro:
https://wiki.archlinux.org/index.php/beginners'_guide#fstab
>> Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q).
>
> The firmware is old if I understand the naming scheme used by Dell. It
> says EXT49D0Q is current.
>
> http://www.dell.com/support/home/al/en/aldhs1/Drivers/DriversDetails?driverId=0NXHH
According to this
(http://forum.notebookreview.com/threads/2015-xps-13-ssd-fw-problem-with-m-2-samsung-pm851.770501/)
the firmware you linked is for the mSATA version of the drive, not the M.2
one. EXT25D0Q seems to be the very latest one for my drive.
> I advice using all defaults for everything for
> now, otherwise it's anyone's guess what you're running into.
On giovedì 5 maggio 2016 06:12:28 CEST, Qu Wenruo wrote:
> Would it be OK for you to test your btrfs on a plain ssd,
> without encryption?
> And just as Chris Murphy said, reducing mount option is also a
> pretty good debugging start point.
Ok, I will remove dmcrypt, discard, compress=lzo, nodefrag and see what
happens.
>> I made a copy of /dev/mapper/cryptroot with dd on an external drive and
>> I run btrfs check on it (btrfs-progs 4.5.2):
>> https://drive.google.com/open?id=0Bwe9Wtc-5xF1SjJacXpMMU5mems (37MB)
>
> Checked, but seems the output is truncated?
No, I didn't truncate the btrfs check output because it wasn't endless. I
just truncated the repair output.
I also have something new to report. Do you remember when I said that my
screen was black and so I had to forcedly power off the system? Something
similar happened today and since in the meantime I enabled magic sysrq keys
I have been able to recover this from the logs:
mag 05 11:55:51 arch-laptop kdeinit5[960]: Registering
"org.kde.StatusNotifierItem-1060-1/StatusNotifierItem" to system tray
mag 05 11:55:51 arch-laptop obexd[1098]: OBEX daemon 5.39
mag 05 11:55:51 arch-laptop dbus-daemon[920]: Successfully activated
service 'org.bluez.obex'
mag 05 11:55:51 arch-laptop systemd[898]: Started Bluetooth OBEX service.
mag 05 11:55:51 arch-laptop korgac[1044]: log_kidentitymanagement:
IdentityManager: There was no default identity. Marking first one as
default.
mag 05 11:55:51 arch-laptop kernel: BUG: unable to handle kernel paging
request at 0000000000017d11
mag 05 11:55:51 arch-laptop kernel: IP: [<ffffffff81194f9f>]
anon_vma_interval_tree_insert+0x3f/0x90
mag 05 11:55:51 arch-laptop kernel: PGD 0
mag 05 11:55:51 arch-laptop kernel: Oops: 0000 [#1] PREEMPT SMP
mag 05 11:55:51 arch-laptop kernel: Modules linked in: rfcomm(+) visor bnep
uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core
videodev media btusb btrtl btbcm btintel cdc_ether bluetooth usbnet r8152
crc16 mii joydev mousedev nvr
mag 05 11:55:51 arch-laptop kernel: mei_me syscopyarea sysfillrect snd
sysimgblt fb_sys_fops i2c_algo_bit shpchp soundcore mei wmi thermal fan
intel_hid sparse_keymap int3403_thermal video processor_thermal_device
dw_dmac snd_soc_sst_acpi snd_soc_sst_m
mag 05 11:55:51 arch-laptop kernel: lrw gf128mul glue_helper ablk_helper
cryptd ahci libahci libata scsi_mod xhci_pci rtsx_pci
mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM TTY layer initialized
mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM socket layer
initialized
mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM ver 1.11
mag 05 11:55:51 arch-laptop kernel: xhci_hcd
mag 05 11:55:51 arch-laptop kernel: i8042 serio sdhci_acpi sdhci led_class
mmc_core pl2303 mos7720 usbserial parport hid_generic usbhid hid usbcore
usb_common
mag 05 11:55:51 arch-laptop kernel: CPU: 0 PID: 351 Comm: systemd-udevd Not
tainted 4.5.1-1-ARCH #1
mag 05 11:55:51 arch-laptop kernel: Hardware name: Dell Inc. XPS 13
9343/0F5KF3, BIOS A07 11/11/2015
mag 05 11:55:51 arch-laptop kernel: task: ffff88021347d580 ti:
ffff880211f8c000 task.ti: ffff880211f8c000
mag 05 11:55:51 arch-laptop kernel: RIP: 0010:[<ffffffff81194f9f>]
[<ffffffff81194f9f>] anon_vma_interval_tree_insert+0x3f/0x90
mag 05 11:55:51 arch-laptop kernel: RSP: 0018:ffff880211f8fd68 EFLAGS:
00010206
mag 05 11:55:51 arch-laptop kernel: RAX: ffff8800da2f4820 RBX:
ffff8800bb59ce40 RCX: ffff8800da2f4830
mag 05 11:55:51 arch-laptop kernel: RDX: ffff8800da2f4828 RSI:
ffff8800374404a0 RDI: ffff8800c58dfa40
mag 05 11:55:51 arch-laptop kernel: RBP: ffff880211f8fdb8 R08:
0000000000017c79 R09: 00000007f55e2059
mag 05 11:55:51 arch-laptop kernel: R10: 00000007f55e2053 R11:
ffff8800c58dfa40 R12: ffff880037440460
mag 05 11:55:51 arch-laptop kernel: R13: ffff8800d9e27100 R14:
ffff8800c58dfa40 R15: ffff880037440460
mag 05 11:55:51 arch-laptop kernel: FS: 00007f55e20537c0(0000)
GS:ffff88021e400000(0000) knlGS:0000000000000000
mag 05 11:55:51 arch-laptop kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
mag 05 11:55:51 arch-laptop kernel: CR2: 0000000000017d11 CR3:
0000000211cd5000 CR4: 00000000003406f0
mag 05 11:55:51 arch-laptop kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
mag 05 11:55:51 arch-laptop kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
mag 05 11:55:51 arch-laptop kernel: Stack:
mag 05 11:55:51 arch-laptop kernel: ffffffff811a90c8 0000000000000246
ffff880212d00900 ffff8800bb59ceb8
mag 05 11:55:51 arch-laptop kernel: ffff880212d00978 ffff8800bb59ce40
ffff880212d00900 0000000000000007
mag 05 11:55:51 arch-laptop kernel: 00007f55e2053a90 ffff8800d991e1c0
ffff880211f8fdf0 ffffffff811a9232
mag 05 11:55:51 arch-laptop kernel: Call Trace:
mag 05 11:55:51 arch-laptop kernel: [<ffffffff811a90c8>] ?
anon_vma_clone+0xc8/0x200
mag 05 11:55:51 arch-laptop kernel: [<ffffffff811a9232>]
anon_vma_fork+0x32/0x140
mag 05 11:55:51 arch-laptop kernel: [<ffffffff8107742d>]
copy_process.part.8+0xcdd/0x1890
mag 05 11:55:51 arch-laptop kernel: [<ffffffff8107819f>]
_do_fork+0xcf/0x3c0
mag 05 11:55:51 arch-laptop kernel: [<ffffffff81078539>]
SyS_clone+0x19/0x20
mag 05 11:55:51 arch-laptop kernel: [<ffffffff815ad6ae>]
entry_SYSCALL_64_fastpath+0x12/0x6d
mag 05 11:55:51 arch-laptop kernel: Code: 01 4c 8b 91 98 00 00 00 31 c9 48
c1 e8 0c 4d 8d 4c 02 ff eb 24 4c 3b 48 18 76 04 4c 89 48 18 4c 8b 40 e0 48
8d 48 10 48 8d 50 08 <4d> 3b 90 98 00 00 00 48 0f 42 d1 48 89 c1 48 8b 02
48 85 c0 75
mag 05 11:55:51 arch-laptop kernel: RIP [<ffffffff81194f9f>]
anon_vma_interval_tree_insert+0x3f/0x90
mag 05 11:55:52 arch-laptop kernel: RSP <ffff880211f8fd68>
mag 05 11:55:52 arch-laptop kernel: CR2: 0000000000017d11
mag 05 11:55:52 arch-laptop kernel: ---[ end trace 6a392d6afbffe7f5 ]---
[...]
mag 05 11:55:52 arch-laptop dbus[584]: [system] Activating via systemd:
service name='org.freedesktop.ColorManager' unit='colord.service'
mag 05 11:55:52 arch-laptop kernel: BTRFS critical (device dm-0): unable to
find logical 2330894282579755008 len 4096
mag 05 11:55:52 arch-laptop kernel: ------------[ cut here ]------------
mag 05 11:55:52 arch-laptop kernel: kernel BUG at fs/btrfs/inode.c:1828!
mag 05 11:55:52 arch-laptop kernel: invalid opcode: 0000 [#2] PREEMPT SMP
mag 05 11:55:52 arch-laptop kernel: Modules linked in: rfcomm visor bnep
uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core
videodev media btusb btrtl btbcm btintel cdc_ether bluetooth usbnet r8152
crc16 mii joydev mousedev nvram
mag 05 11:55:52 arch-laptop kernel: mei_me syscopyarea sysfillrect snd
sysimgblt fb_sys_fops i2c_algo_bit shpchp soundcore mei wmi thermal fan
intel_hid sparse_keymap int3403_thermal video processor_thermal_device
dw_dmac snd_soc_sst_acpi snd_soc_sst_m
mag 05 11:55:52 arch-laptop kernel: lrw gf128mul glue_helper ablk_helper
cryptd ahci libahci libata scsi_mod xhci_pci rtsx_pci xhci_hcd i8042 serio
sdhci_acpi sdhci led_class mmc_core pl2303 mos7720 usbserial parport
hid_generic usbhid hid usbcore usb_
mag 05 11:55:52 arch-laptop kernel: CPU: 3 PID: 1028 Comm: plasmashell
Tainted: G D 4.5.1-1-ARCH #1
mag 05 11:55:52 arch-laptop kernel: Hardware name: Dell Inc. XPS 13
9343/0F5KF3, BIOS A07 11/11/2015
mag 05 11:55:52 arch-laptop kernel: task: ffff8800d9e2aac0 ti:
ffff8801f5900000 task.ti: ffff8801f5900000
mag 05 11:55:52 arch-laptop kernel: RIP: 0010:[<ffffffffa02ddabb>]
[<ffffffffa02ddabb>] btrfs_merge_bio_hook+0x8b/0xa0 [btrfs]
mag 05 11:55:52 arch-laptop kernel: RSP: 0018:ffff8801f5903938 EFLAGS:
00010282
mag 05 11:55:52 arch-laptop kernel: RAX: 00000000ffffffea RBX:
0000000000001000 RCX: 0000000000000051
mag 05 11:55:52 arch-laptop kernel: RDX: 0000000000000000 RSI:
ffff88021e58db38 RDI: 0000000000000000
mag 05 11:55:52 arch-laptop kernel: RBP: ffff8801f5903958 R08:
0000000000070aad R09: 0000000000000368
mag 05 11:55:52 arch-laptop kernel: R10: 00102c80000d13e8 R11:
0000000000000368 R12: 0000000000001000
mag 05 11:55:52 arch-laptop kernel: R13: ffff8801e205ee28 R14:
0000000000000000 R15: ffffea000788d580
mag 05 11:55:52 arch-laptop kernel: FS: 00007fe8e688a800(0000)
GS:ffff88021e580000(0000) knlGS:0000000000000000
mag 05 11:55:52 arch-laptop kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
mag 05 11:55:52 arch-laptop kernel: CR2: 00007fe8d14b5cbc CR3:
00000000bf57f000 CR4: 00000000003406e0
mag 05 11:55:52 arch-laptop kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
mag 05 11:55:52 arch-laptop kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
mag 05 11:55:52 arch-laptop kernel: Stack:
mag 05 11:55:52 arch-laptop kernel: 0000000000001000 0000000095d6c394
0000000000001000 ffff8801f5903bc0
mag 05 11:55:52 arch-laptop kernel: ffff8801f59039b0 ffffffffa02fbd03
0000000000000000 00102c80000d13e8
mag 05 11:55:52 arch-laptop kernel: 0000002000000000 ffff8800da874040
0000000000000000 ffffea000788d580
mag 05 11:55:52 arch-laptop kernel: Call Trace:
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02fbd03>]
submit_extent_page+0xc3/0x230 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02fd02a>]
__do_readpage+0x3aa/0x990 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02fb450>] ?
btrfs_create_repair_bio+0x100/0x100 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02d0cf0>] ?
free_root_pointers+0x70/0x70 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02fd6f6>]
__extent_read_full_page+0xe6/0x100 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02d0cf0>] ?
free_root_pointers+0x70/0x70 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02ff489>]
read_extent_buffer_pages+0x179/0x330 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02d0cf0>] ?
free_root_pointers+0x70/0x70 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02d26fc>]
btree_read_extent_buffer_pages.constprop.19+0xac/0x110 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02d2cfd>]
read_tree_block+0x3d/0x70 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02b1b49>]
read_block_for_search.isra.14+0x139/0x330 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02b72e5>]
btrfs_next_old_leaf+0x245/0x420 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02b74d0>]
btrfs_next_leaf+0x10/0x20 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffffa02dc564>]
btrfs_real_readdir+0x144/0x5f0 [btrfs]
mag 05 11:55:52 arch-laptop kernel: [<ffffffff81200492>]
iterate_dir+0x92/0x120
mag 05 11:55:52 arch-laptop kernel: [<ffffffff81200939>]
SyS_getdents+0x99/0x110
mag 05 11:55:52 arch-laptop kernel: [<ffffffff812005f0>] ?
fillonedir+0xd0/0xd0
mag 05 11:55:52 arch-laptop kernel: [<ffffffff815ad6ae>]
entry_SYSCALL_64_fastpath+0x12/0x6d
mag 05 11:55:52 arch-laptop kernel: Code: 8b 80 38 fe ff ff 4c 89 65 e0 48
8b 80 f0 01 00 00 48 89 c7 e8 77 ac 02 00 85 c0 78 0e 31 c0 4c 01 e3 48 3b
5d e0 0f 97 c0 eb 9a <0f> 0b e8 5e b1 d9 e0 0f 1f 40 00 66 2e 0f 1f 84 00
00 00 00 00
mag 05 11:55:52 arch-laptop kernel: RIP [<ffffffffa02ddabb>]
btrfs_merge_bio_hook+0x8b/0xa0 [btrfs]
mag 05 11:55:52 arch-laptop kernel: RSP <ffff8801f5903938>
mag 05 11:55:52 arch-laptop kernel: ---[ end trace 6a392d6afbffe7f6 ]---
On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote:
> Off topic, but at least gmail users see your posts go to spam
> dmarc=fail (p=QUARANTINE dis=NONE) header.from=linuxsystems.it
Thanks for reporting, I changed my dmarc DNS entry from quarantine to none.
I previously used reject and I hoped that quarantine was enough of a middle
ground to survive spam filters, but it seems I will have to get rid of
dmarc altogether.
Thanks,
Niccolò
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-05 10:36 ` Niccolò Belli
@ 2016-05-05 17:48 ` Omar Sandoval
2016-05-06 11:38 ` Niccolò Belli
0 siblings, 1 reply; 25+ messages in thread
From: Omar Sandoval @ 2016-05-05 17:48 UTC (permalink / raw)
To: Niccolò Belli; +Cc: Btrfs BTRFS, Chris Murphy, Qu Wenruo
On Thu, May 05, 2016 at 12:36:52PM +0200, Niccolò Belli wrote:
> On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote:
> > I suggest using defaults for starters. The only thing in that list
> > that needs be there is either subvolid or subvold, not both. Add in
> > the non-default options once you've proven the defaults are working,
> > and add them one at a time.
>
> Yes I read your previous suggestion and I already dropped subvolid, but
> since the problem already happened I left it in the mail for completeness.
> Anyway the culprit here is genfstab and that's probably what a beginner is
> going to use when installing a distro:
> https://wiki.archlinux.org/index.php/beginners'_guide#fstab
>
The redundant subvolid doesn't hurt, the kernel will just check that it
matches the passed subvol (see [1]). genfstab probably just pulls the
options out of /proc/mounts or /proc/self/mountinfo, and since we show
both, that's how it gets in fstab. If it was actually a problem, there
would be a clear message in dmesg.
1: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb289b7be62db84b9630ce00367444c810cada2c
--
Omar
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-05 17:48 ` Omar Sandoval
@ 2016-05-06 11:38 ` Niccolò Belli
2016-05-07 15:45 ` Niccolò Belli
0 siblings, 1 reply; 25+ messages in thread
From: Niccolò Belli @ 2016-05-06 11:38 UTC (permalink / raw)
To: Btrfs BTRFS; +Cc: Chris Murphy, Qu Wenruo, Omar Sandoval
I formatted the partition and copied the content of my previous rootfs to
it. There is no dmcrypt now and mount options are defaults, except for
noatime. After a single boot I got the very same problem as before (fs
corrupted and an infinite loop when doing btrfs check --repair.
I wanted to replicate results and so I tried once again and since then I
only experienced minor corruption, correctly resolved by repair. But during
a pacaman upgrade, which triggered snapper pre-post snapshots, the system
hanged and I found this in the logs:
mag 06 10:31:15 arch-laptop plasmashell[873]: requesting unexisting screen
2
mag 06 10:31:18 arch-laptop dbus[418]: [system] Activating service
name='org.opensuse.Snapper' (using servicehelper)
mag 06 10:31:18 arch-laptop dbus[418]: [system] Successfully activated
service 'org.opensuse.Snapper'
mag 06 10:31:20 arch-laptop kernel: ------------[ cut here ]------------
mag 06 10:31:20 arch-laptop kernel: kernel BUG at fs/btrfs/ctree.h:2693!
Still no major corruption found since my second attempt.
Niccolò
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-06 11:38 ` Niccolò Belli
@ 2016-05-07 15:45 ` Niccolò Belli
2016-05-07 15:58 ` Clemens Eisserer
2016-05-07 23:35 ` Chris Murphy
0 siblings, 2 replies; 25+ messages in thread
From: Niccolò Belli @ 2016-05-07 15:45 UTC (permalink / raw)
To: Btrfs BTRFS; +Cc: Chris Murphy, Qu Wenruo, Omar Sandoval
btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
So discard is not the culprit. Will try to remove compress=lzo and
autodefrag and see if it still happens.
[ 748.224346] BTRFS error (device dm-0): memmove bogus src_offset 5431
move len 4294962894 len 16384
[ 748.226206] ------------[ cut here ]------------
[ 748.227831] kernel BUG at fs/btrfs/extent_io.c:5723!
[ 748.229498] invalid opcode: 0000 [#1] PREEMPT SMP
[ 748.231161] Modules linked in: ext4 mbcache jbd2 nls_iso8859_1
nls_cp437 vfat fat snd_hda_codec_hdmi dell_laptop dcdbas dell_wmi
iTCO_wdt iTCO_vendor_support intel_rapl x86_pkg_temp_thermal
intel_powerclamp coretemp kvm_intel arc4 kvm irqbypass psmouse serio_raw
pcspkr elan_i2c snd_soc_ssm4567 snd_soc_rt286 snd_soc_rl6347a
snd_soc_core i2c_hid iwlmvm snd_compress snd_pcm_dmaengine ac97_bus
mac80211 uvcvideo videobuf2_vmalloc btusb videobuf2_memops cdc_ether
btrtl usbnet iwlwifi btbcm videobuf2_v4l2 btintel intel_pch_thermal
videobuf2_core i2c_i801 videodev r8152 rtsx_pci_ms cfg80211 bluetooth
visor media mii memstick joydev evdev mousedev input_leds rfkill mac_hid
crc16 i915 fan thermal wmi dw_dmac int3403_thermal video dw_dmac_core
drm_kms_helper snd_soc_sst_acpi i2c_designware_platform
snd_soc_sst_match
[ 748.237203] snd_hda_intel 8250_dw i2c_designware_core gpio_lynxpoint
spi_pxa2xx_platform drm int3402_thermal snd_hda_codec battery tpm_crb
intel_hid snd_hda_core sparse_keymap fjes snd_hwdep int3400_thermal
acpi_thermal_rel tpm_tis snd_pcm intel_gtt tpm acpi_als syscopyarea
sysfillrect snd_timer sysimgblt fb_sys_fops mei_me i2c_algo_bit
processor_thermal_device kfifo_buf processor snd industrialio acpi_pad
ac int340x_thermal_zone mei intel_soc_dts_iosf button lpc_ich soundcore
shpchp sch_fq_codel ip_tables x_tables btrfs xor raid6_pq
jitterentropy_rng sha256_ssse3 sha256_generic hmac drbg ansi_cprng
algif_skcipher af_alg uas usb_storage dm_crypt dm_mod sd_mod
rtsx_pci_sdmmc atkbd libps2 crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper
[ 748.244176] ablk_helper cryptd ahci libahci libata scsi_mod xhci_pci
rtsx_pci xhci_hcd i8042 serio sdhci_acpi sdhci led_class mmc_core pl2303
mos7720 usbserial parport hid_generic usbhid hid usbcore usb_common
[ 748.246662] CPU: 0 PID: 2316 Comm: pacman Not tainted 4.5.1-1-ARCH #1
[ 748.249123] Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A07
11/11/2015
[ 748.251576] task: ffff8800d9d98e40 ti: ffff8800cec10000 task.ti:
ffff8800cec10000
[ 748.254064] RIP: 0010:[<ffffffffa0300bac>] [<ffffffffa0300bac>]
memmove_extent_buffer+0x10c/0x110 [btrfs]
[ 748.256600] RSP: 0018:ffff8800cec13c18 EFLAGS: 00010246
[ 748.259120] RAX: 0000000000000000 RBX: ffff88020c01ba40 RCX:
0000000000000056
[ 748.261631] RDX: 0000000000000000 RSI: ffff88021e40db38 RDI:
ffff88021e40db38
[ 748.264166] RBP: ffff8800cec13c48 R08: 0000000000000000 R09:
000000000000033b
[ 748.266716] R10: 0000000000000000 R11: 000000000000033b R12:
00000000ffffeece
[ 748.269267] R13: 0000000100000405 R14: 00000001000004c9 R15:
ffff88020c01ba40
[ 748.271818] FS: 00007f14d4271740(0000) GS:ffff88021e400000(0000)
knlGS:0000000000000000
[ 748.274392] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 748.276987] CR2: 0000000001630008 CR3: 00000000cffc8000 CR4:
00000000003406f0
[ 748.279603] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 748.282220] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 748.284815] Stack:
[ 748.287422] 00000000e3438cd2 ffff88020c01ba40 00000000000000c4
000000000000002a
[ 748.290082] 000000000000006b 00000000000003a0 ffff8800cec13ce8
ffffffffa02b612c
[ 748.292754] ffffffffa02b433d ffff8800da9ca820 0000002800000000
ffff8800daa78bd0
[ 748.295441] Call Trace:
[ 748.298104] [<ffffffffa02b612c>] btrfs_del_items+0x33c/0x4a0 [btrfs]
[ 748.300827] [<ffffffffa02b433d>] ? btrfs_search_slot+0x90d/0x990
[btrfs]
[ 748.303564] [<ffffffffa02f3d9c>] ? btrfs_get_token_8+0x6c/0x130
[btrfs]
[ 748.306311] [<ffffffffa02e5ca9>]
btrfs_truncate_inode_items+0x649/0xd20 [btrfs]
[ 748.309071] [<ffffffffa0330b5e>] ?
btrfs_delayed_inode_release_metadata.isra.1+0x4e/0xf0 [btrfs]
[ 748.311860] [<ffffffffa02e7315>] btrfs_evict_inode+0x485/0x5d0
[btrfs]
[ 748.314627] [<ffffffff81207e55>] evict+0xc5/0x190
[ 748.317412] [<ffffffff81208689>] iput+0x1d9/0x260
[ 748.320199] [<ffffffff811fd689>] do_unlinkat+0x199/0x2d0
[ 748.322988] [<ffffffff811fdf66>] SyS_unlink+0x16/0x20
[ 748.325781] [<ffffffff815ad6ae>] entry_SYSCALL_64_fastpath+0x12/0x6d
[ 748.328584] Code: 41 5e 41 5f 5d c3 48 8b 7f 18 48 89 f2 48 c7 c6 40
44 36 a0 e8 06 90 fa ff 0f 0b 48 8b 7f 18 48 c7 c6 08 44 36 a0 e8 f4 8f
fa ff <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 89 fb
[ 748.331558] RIP [<ffffffffa0300bac>]
memmove_extent_buffer+0x10c/0x110 [btrfs]
[ 748.334473] RSP <ffff8800cec13c18>
[ 748.356077] ---[ end trace 9bfb28800ab52273 ]---
[ 748.359042] note: pacman[2316] exited with preempt_count 2
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-07 15:45 ` Niccolò Belli
@ 2016-05-07 15:58 ` Clemens Eisserer
2016-05-07 16:11 ` Niccolò Belli
2016-05-07 23:35 ` Chris Murphy
1 sibling, 1 reply; 25+ messages in thread
From: Clemens Eisserer @ 2016-05-07 15:58 UTC (permalink / raw)
To: Niccolò Belli, linux-btrfs
Hi Niccolo,
> btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
Just to be curious - couldn't it be a hardware issue? I use almost the
same setup (compress-force=lzo instead of compress-force=lzo) on my
laptop for 2-3 years and haven't experienced any issues since
~kernel-3.14 or so.
Br, Clemens Eisserer
2016-05-07 17:45 GMT+02:00 Niccolò Belli <darkbasic@linuxsystems.it>:
> btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
> So discard is not the culprit. Will try to remove compress=lzo and
> autodefrag and see if it still happens.
>
> [ 748.224346] BTRFS error (device dm-0): memmove bogus src_offset 5431 move
> len 4294962894 len 16384
> [ 748.226206] ------------[ cut here ]------------
> [ 748.227831] kernel BUG at fs/btrfs/extent_io.c:5723!
> [ 748.229498] invalid opcode: 0000 [#1] PREEMPT SMP
> [ 748.231161] Modules linked in: ext4 mbcache jbd2 nls_iso8859_1 nls_cp437
> vfat fat snd_hda_codec_hdmi dell_laptop dcdbas dell_wmi iTCO_wdt
> iTCO_vendor_support intel_rapl x86_pkg_temp_thermal intel_powerclamp
> coretemp kvm_intel arc4 kvm irqbypass psmouse serio_raw pcspkr elan_i2c
> snd_soc_ssm4567 snd_soc_rt286 snd_soc_rl6347a snd_soc_core i2c_hid iwlmvm
> snd_compress snd_pcm_dmaengine ac97_bus mac80211 uvcvideo videobuf2_vmalloc
> btusb videobuf2_memops cdc_ether btrtl usbnet iwlwifi btbcm videobuf2_v4l2
> btintel intel_pch_thermal videobuf2_core i2c_i801 videodev r8152 rtsx_pci_ms
> cfg80211 bluetooth visor media mii memstick joydev evdev mousedev input_leds
> rfkill mac_hid crc16 i915 fan thermal wmi dw_dmac int3403_thermal video
> dw_dmac_core drm_kms_helper snd_soc_sst_acpi i2c_designware_platform
> snd_soc_sst_match
> [ 748.237203] snd_hda_intel 8250_dw i2c_designware_core gpio_lynxpoint
> spi_pxa2xx_platform drm int3402_thermal snd_hda_codec battery tpm_crb
> intel_hid snd_hda_core sparse_keymap fjes snd_hwdep int3400_thermal
> acpi_thermal_rel tpm_tis snd_pcm intel_gtt tpm acpi_als syscopyarea
> sysfillrect snd_timer sysimgblt fb_sys_fops mei_me i2c_algo_bit
> processor_thermal_device kfifo_buf processor snd industrialio acpi_pad ac
> int340x_thermal_zone mei intel_soc_dts_iosf button lpc_ich soundcore shpchp
> sch_fq_codel ip_tables x_tables btrfs xor raid6_pq jitterentropy_rng
> sha256_ssse3 sha256_generic hmac drbg ansi_cprng algif_skcipher af_alg uas
> usb_storage dm_crypt dm_mod sd_mod rtsx_pci_sdmmc atkbd libps2
> crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
> aes_x86_64 lrw gf128mul glue_helper
> [ 748.244176] ablk_helper cryptd ahci libahci libata scsi_mod xhci_pci
> rtsx_pci xhci_hcd i8042 serio sdhci_acpi sdhci led_class mmc_core pl2303
> mos7720 usbserial parport hid_generic usbhid hid usbcore usb_common
> [ 748.246662] CPU: 0 PID: 2316 Comm: pacman Not tainted 4.5.1-1-ARCH #1
> [ 748.249123] Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A07
> 11/11/2015
> [ 748.251576] task: ffff8800d9d98e40 ti: ffff8800cec10000 task.ti:
> ffff8800cec10000
> [ 748.254064] RIP: 0010:[<ffffffffa0300bac>] [<ffffffffa0300bac>]
> memmove_extent_buffer+0x10c/0x110 [btrfs]
> [ 748.256600] RSP: 0018:ffff8800cec13c18 EFLAGS: 00010246
> [ 748.259120] RAX: 0000000000000000 RBX: ffff88020c01ba40 RCX:
> 0000000000000056
> [ 748.261631] RDX: 0000000000000000 RSI: ffff88021e40db38 RDI:
> ffff88021e40db38
> [ 748.264166] RBP: ffff8800cec13c48 R08: 0000000000000000 R09:
> 000000000000033b
> [ 748.266716] R10: 0000000000000000 R11: 000000000000033b R12:
> 00000000ffffeece
> [ 748.269267] R13: 0000000100000405 R14: 00000001000004c9 R15:
> ffff88020c01ba40
> [ 748.271818] FS: 00007f14d4271740(0000) GS:ffff88021e400000(0000)
> knlGS:0000000000000000
> [ 748.274392] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 748.276987] CR2: 0000000001630008 CR3: 00000000cffc8000 CR4:
> 00000000003406f0
> [ 748.279603] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 748.282220] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 748.284815] Stack:
> [ 748.287422] 00000000e3438cd2 ffff88020c01ba40 00000000000000c4
> 000000000000002a
> [ 748.290082] 000000000000006b 00000000000003a0 ffff8800cec13ce8
> ffffffffa02b612c
> [ 748.292754] ffffffffa02b433d ffff8800da9ca820 0000002800000000
> ffff8800daa78bd0
> [ 748.295441] Call Trace:
> [ 748.298104] [<ffffffffa02b612c>] btrfs_del_items+0x33c/0x4a0 [btrfs]
> [ 748.300827] [<ffffffffa02b433d>] ? btrfs_search_slot+0x90d/0x990 [btrfs]
> [ 748.303564] [<ffffffffa02f3d9c>] ? btrfs_get_token_8+0x6c/0x130 [btrfs]
> [ 748.306311] [<ffffffffa02e5ca9>] btrfs_truncate_inode_items+0x649/0xd20
> [btrfs]
> [ 748.309071] [<ffffffffa0330b5e>] ?
> btrfs_delayed_inode_release_metadata.isra.1+0x4e/0xf0 [btrfs]
> [ 748.311860] [<ffffffffa02e7315>] btrfs_evict_inode+0x485/0x5d0 [btrfs]
> [ 748.314627] [<ffffffff81207e55>] evict+0xc5/0x190
> [ 748.317412] [<ffffffff81208689>] iput+0x1d9/0x260
> [ 748.320199] [<ffffffff811fd689>] do_unlinkat+0x199/0x2d0
> [ 748.322988] [<ffffffff811fdf66>] SyS_unlink+0x16/0x20
> [ 748.325781] [<ffffffff815ad6ae>] entry_SYSCALL_64_fastpath+0x12/0x6d
> [ 748.328584] Code: 41 5e 41 5f 5d c3 48 8b 7f 18 48 89 f2 48 c7 c6 40 44
> 36 a0 e8 06 90 fa ff 0f 0b 48 8b 7f 18 48 c7 c6 08 44 36 a0 e8 f4 8f fa ff
> <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 89 fb
> [ 748.331558] RIP [<ffffffffa0300bac>] memmove_extent_buffer+0x10c/0x110
> [btrfs]
> [ 748.334473] RSP <ffff8800cec13c18>
> [ 748.356077] ---[ end trace 9bfb28800ab52273 ]---
> [ 748.359042] note: pacman[2316] exited with preempt_count 2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-07 15:58 ` Clemens Eisserer
@ 2016-05-07 16:11 ` Niccolò Belli
2016-05-08 18:27 ` Patrik Lundquist
2016-05-09 11:52 ` Austin S. Hemmelgarn
0 siblings, 2 replies; 25+ messages in thread
From: Niccolò Belli @ 2016-05-07 16:11 UTC (permalink / raw)
To: linux-btrfs; +Cc: Clemens Eisserer
Il 2016-05-07 17:58 Clemens Eisserer ha scritto:
> Hi Niccolo,
>
>> btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
>
> Just to be curious - couldn't it be a hardware issue? I use almost the
> same setup (compress-force=lzo instead of compress-force=lzo) on my
> laptop for 2-3 years and haven't experienced any issues since
> ~kernel-3.14 or so.
>
> Br, Clemens Eisserer
Hi,
Which kind of hardware issue? I did a full memtest86 check, a full
smartmontools extended check and even a badblocks -wsv.
If this is really an hardware issue that we can identify I would be more
than happy because Dell will replace my laptop and this nightmare will
be finally over. I'm open to suggestions.
Niccolò
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-07 15:45 ` Niccolò Belli
2016-05-07 15:58 ` Clemens Eisserer
@ 2016-05-07 23:35 ` Chris Murphy
1 sibling, 0 replies; 25+ messages in thread
From: Chris Murphy @ 2016-05-07 23:35 UTC (permalink / raw)
To: Niccolò Belli; +Cc: Btrfs BTRFS, Chris Murphy, Qu Wenruo, Omar Sandoval
On Sat, May 7, 2016 at 9:45 AM, Niccolò Belli <darkbasic@linuxsystems.it> wrote:
> btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
> So discard is not the culprit. Will try to remove compress=lzo and
> autodefrag and see if it still happens.
You're making the troubleshooting unnecessarily difficult by
continuing to use non-default options. *shrug*
Every single layer you add complicates the setup and troubleshooting.
Of course all of it should work together, many people do. But you're
the one having the problem so in order to demonstrate whether this is
a software bug or hardware problem, you need to test it with the most
basic setup possible --> btrfs on plain partitions and default mount
options.
--
Chris Murphy
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-07 16:11 ` Niccolò Belli
@ 2016-05-08 18:27 ` Patrik Lundquist
2016-05-09 11:52 ` Austin S. Hemmelgarn
1 sibling, 0 replies; 25+ messages in thread
From: Patrik Lundquist @ 2016-05-08 18:27 UTC (permalink / raw)
To: Niccolò Belli; +Cc: linux-btrfs
On 7 May 2016 at 18:11, Niccolò Belli <darkbasic@linuxsystems.it> wrote:
> Which kind of hardware issue? I did a full memtest86 check, a full smartmontools extended check and even a badblocks -wsv.
> If this is really an hardware issue that we can identify I would be more than happy because Dell will replace my laptop and this nightmare will be finally over. I'm open to suggestions.
Well, your hardware differs from a lot of successful installations.
Are you using any power management tweaks?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-07 16:11 ` Niccolò Belli
2016-05-08 18:27 ` Patrik Lundquist
@ 2016-05-09 11:52 ` Austin S. Hemmelgarn
2016-05-09 14:53 ` Niccolò Belli
1 sibling, 1 reply; 25+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-09 11:52 UTC (permalink / raw)
To: Niccolò Belli, linux-btrfs; +Cc: Clemens Eisserer
On 2016-05-07 12:11, Niccolò Belli wrote:
> Il 2016-05-07 17:58 Clemens Eisserer ha scritto:
>> Hi Niccolo,
>>
>>> btrfs + dmcrypt + compress=lzo + autodefrag = corruption at first boot
>>
>> Just to be curious - couldn't it be a hardware issue? I use almost the
>> same setup (compress-force=lzo instead of compress-force=lzo) on my
>> laptop for 2-3 years and haven't experienced any issues since
>> ~kernel-3.14 or so.
>>
>> Br, Clemens Eisserer
>
> Hi,
> Which kind of hardware issue? I did a full memtest86 check, a full
> smartmontools extended check and even a badblocks -wsv.
> If this is really an hardware issue that we can identify I would be more
> than happy because Dell will replace my laptop and this nightmare will
> be finally over. I'm open to suggestions.
First, some general advice:
1. It is fully possible to have bad RAM that still passes memtest86
consistently, and in fact, most of the time this will be the case (if
you're seeing any thing other than the bit-fade test in memtest86 fail,
then your system probably won't boot fully). Memtest doesn't replicate
typical usage patterns very well. My usual testing for RAM involves not
just memtest, but also booting into a LiveCD (usually SystemRescueCD),
pulling down a copy of the kernel source, and then running as many
concurrent kernel builds as cores, each with as many make jobs as cores
(so if you've got a quad core CPU (or a dual core with hyperthreading),
it would be running 4 builds with -j4 passed to make). GCC seems to
have memory usage patterns that reliably trigger memory errors that
aren't caught by memtest, so this generally gives good results.
Secondarily, if it's a big system and I am not pressed for time, I do a
quick Gentoo install with Xen, and then spin up twice as many Xen VM's
as cores and run memtest in those concurrently (this seems to catch
things a bit more reliably than just a plain memtest).
2. On a similar note, badblocks doesn't replicate filesystem like access
patterns, it just runs sequentially through the entire disk. This isn't
as likely to give bad results, but it's still important to know. In
particular, try running it over a dmcrypt volume a couple of times
(preferably with a different key each time, pulling keys from
/dev/urandom works well for this), as that will result in writing
different data. For what it's worth, when I'm doing initial testing of
new disks, I always use ddrescue to copy /dev/zero over the whole disk,
then do it twice through dmcrypt with different keys, copying from the
disk to /dev/null after each pass. This gives random data on disk as a
starting point (which is good if you're going to use dmcrypt), and
usually triggers reallocation of any bad sectors as early as possible.
If I have time and access to an existing system I can connect the disk
to, I often do testing with fio as well.
Now, to slightly more specific advice:
1. If you have an eSATA port, try plugging your hard disk in there and
see if things work. If that works but having the hard drive plugged in
internally doesn't, then the issue is probably either that specific SATA
port (in which case your chip-set is bad and you should get a new
system), or the SATA connector itself (or the wiring, but that's not as
likely when it's traces on a PCB). Normally I'd suggest just swapping
cables and SATA ports, but that's not really possible with a laptop.
2. If you have access to a reasonably large flash drive, or to a USB to
SATA adapter, try that as well, if it works on that but not internally
(or on an eSATA port), you've probably got a bad SATA controller, and
should get a new system.
3. Try things without dmcrypt. Adding extra layers makes it harder to
determine what is actually wrong. If it works without dmcrypt, try
using different parameters for the encryption (different ciphers is what
I would try first). If it works reliably without dmcrypt, then it's
either a bug in dmcrypt (which I don't think is very likely), or it's
bad interaction between dmcrypt and BTRFS. If it works with some
encryption parameters but not others, then that will help narrow down
where the issue is.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-09 11:52 ` Austin S. Hemmelgarn
@ 2016-05-09 14:53 ` Niccolò Belli
2016-05-09 16:29 ` Zygo Blaxell
` (2 more replies)
0 siblings, 3 replies; 25+ messages in thread
From: Niccolò Belli @ 2016-05-09 14:53 UTC (permalink / raw)
To: linux-btrfs
Cc: Clemens Eisserer, Austin S. Hemmelgarn, Patrik Lundquist,
Chris Murphy, Qu Wenruo, Omar Sandoval
On domenica 8 maggio 2016 20:27:55 CEST, Patrik Lundquist wrote:
> Are you using any power management tweaks?
Yes, as stated in my very first post I use TLP with
SATA_LINKPWR_ON_BAT=max_performance, but I managed to reproduce the bug
even without TLP. Also in the past week I've alwyas been on AC.
On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
> Memtest doesn't replicate typical usage patterns very well. My
> usual testing for RAM involves not just memtest, but also
> booting into a LiveCD (usually SystemRescueCD), pulling down a
> copy of the kernel source, and then running as many concurrent
> kernel builds as cores, each with as many make jobs as cores (so
> if you've got a quad core CPU (or a dual core with
> hyperthreading), it would be running 4 builds with -j4 passed to
> make). GCC seems to have memory usage patterns that reliably
> trigger memory errors that aren't caught by memtest, so this
> generally gives good results.
Building kernel with 4 concurrent threads is not an issue for my system, in
fact I do compile a lot and I never had any issue.
On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
> On a similar note, badblocks doesn't replicate filesystem like
> access patterns, it just runs sequentially through the entire
> disk. This isn't as likely to give bad results, but it's still
> important to know. In particular, try running it over a dmcrypt
> volume a couple of times (preferably with a different key each
> time, pulling keys from /dev/urandom works well for this), as
> that will result in writing different data. For what it's
> worth, when I'm doing initial testing of new disks, I always use
> ddrescue to copy /dev/zero over the whole disk, then do it twice
> through dmcrypt with different keys, copying from the disk to
> /dev/null after each pass. This gives random data on disk as a
> starting point (which is good if you're going to use dmcrypt),
> and usually triggers reallocation of any bad sectors as early as
> possible.
While trying to find a common denominator for my issue I did lots of
backups of /dev/mapper/cryptroot and I restored them into
/dev/mapper/cryptroot dozens of times (triggering a 150GB+ random data
write every time), without any issue (after restoring the backup I alwyas
check the parition with btrfs check). So disk doesn't seem to be the
culprit.
On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
> 1. If you have an eSATA port, try plugging your hard disk in
> there and see if things work. If that works but having the hard
> drive plugged in internally doesn't, then the issue is probably
> either that specific SATA port (in which case your chip-set is
> bad and you should get a new system), or the SATA connector
> itself (or the wiring, but that's not as likely when it's traces
> on a PCB). Normally I'd suggest just swapping cables and SATA
> ports, but that's not really possible with a laptop.
> 2. If you have access to a reasonably large flash drive, or to
> a USB to SATA adapter, try that as well, if it works on that but
> not internally (or on an eSATA port), you've probably got a bad
> SATA controller, and should get a new system.
My laptop doesn't have an eSATA port and my only big enough external drive
is currently used for daily backups, since I fear for data loss.
On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
> 3. Try things without dmcrypt. Adding extra layers makes it
> harder to determine what is actually wrong. If it works without
> dmcrypt, try using different parameters for the encryption
> (different ciphers is what I would try first). If it works
> reliably without dmcrypt, then it's either a bug in dmcrypt
> (which I don't think is very likely), or it's bad interaction
> between dmcrypt and BTRFS. If it works with some encryption
> parameters but not others, then that will help narrow down where
> the issue is.
On domenica 8 maggio 2016 01:35:16 CEST, Chris Murphy wrote:
> You're making the troubleshooting unnecessarily difficult by
> continuing to use non-default options. *shrug*
>
> Every single layer you add complicates the setup and troubleshooting.
> Of course all of it should work together, many people do. But you're
> the one having the problem so in order to demonstrate whether this is
> a software bug or hardware problem, you need to test it with the most
> basic setup possible --> btrfs on plain partitions and default mount
> options.
I will try to recap because you obviously missed my previous e-mail: I
managed to replicate the irrecoverable corruption bug even with default
options and no dmcrypt at all. Somehow it was a bit more difficult to
replicate with default options and so I started to play with different
combinations to find if there was something which increased the chances of
getting corruption. I have the feeling that "autodefrag" enhances the
chances to get corruption, but I'm not 100% sure about it. Anyway,
triggering a whole packages reinstall with "pacaur -S $(pacman -Qe)",
giving high chances to get irrecoverable corruption. When running such
command it simply extracts the tarballs from the cache and overwrites the
already installed files. It doesn't write lots of data (after
reinstallation my system is still quite small, just a few GBs) but it seems
to be enough to displease the filesystem.
To avoid losing my data every time I power on or reboot my laptop I first
boot into an external drive, I btrfs check /dev/mapper/cryptroot and if
it's still sane I backup /dev/mapper/cryptroot into an external SSD with
dd, otherwise I restore the previous copy from the SSD into
/dev/mapper/cryptroot.
I cannot manage to survive such annoying workflow for long, so I really
hope someone will manage to track the bug down soon.
Thanks for your help, I really appreciate it.
Niccolò
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-09 14:53 ` Niccolò Belli
@ 2016-05-09 16:29 ` Zygo Blaxell
2016-05-09 18:21 ` Austin S. Hemmelgarn
2016-05-12 14:35 ` Niccolò Belli
2016-05-09 19:23 ` Lionel Bouton
2016-05-09 21:30 ` Chris Murphy
2 siblings, 2 replies; 25+ messages in thread
From: Zygo Blaxell @ 2016-05-09 16:29 UTC (permalink / raw)
To: Niccolò Belli
Cc: linux-btrfs, Clemens Eisserer, Austin S. Hemmelgarn,
Patrik Lundquist, Chris Murphy, Qu Wenruo, Omar Sandoval
[-- Attachment #1: Type: text/plain, Size: 3690 bytes --]
On Mon, May 09, 2016 at 04:53:13PM +0200, Niccolò Belli wrote:
> While trying to find a common denominator for my issue I did lots of backups
> of /dev/mapper/cryptroot and I restored them into /dev/mapper/cryptroot
> dozens of times (triggering a 150GB+ random data write every time), without
> any issue (after restoring the backup I alwyas check the parition with btrfs
> check). So disk doesn't seem to be the culprit.
Did you also check the data matches the backup? btrfs check will only
look at the metadata, which is 0.1% of what you've copied. From what
you've written, there should be a lot of errors in the data too. If you
have incorrect data but btrfs scrub finds no incorrect checksums, then
your storage layer is probably fine and we have to look at CPU, host RAM,
and software as possible culprits.
The logs you've posted so far indicate that bad metadata (e.g. negative
item lengths, nonsense transids in metadata references but sane transids
in the referred pages) is getting into otherwise valid and well-formed
btrfs metadata pages. Since these pages are protected by checksums,
the corruption can't be originating in the storage layer--if it was, the
pages should be rejected as they are read from disk, before btrfs even
looks at them, and the insane transid should be the "found" one not the
"expected" one. That suggests there is either RAM corruption happening
_after_ the data is read from disk (i.e. while the pages are cached in
RAM), or a severe software bug in the kernel you're running.
Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
maintains your kernel had a bad day and merged a patch they should
not have.
Try a minimal configuration with as few drivers as possible loaded,
especially GPU drivers and anything from the staging subdirectory--when
these drivers have bugs, they ruin everything.
Try memtest86+ which has a few more/different tests than memtest86.
I have encountered RAM modules that pass memtest86 but fail memtest86+
and vice versa.
Try memtester, a memory tester that runs as a Linux process, so it can
detect corruption caused when device drivers spray data randomly into RAM,
or when the CPU thermal controls are influenced by Linux (an overheating
CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
designs rely on the OS for thermal management).
Try running more than one memory testing process, in case there is a bug
in your hardware that affects interactions between multiple cores (memtest
is single-threaded). You can run memtest86 inside a kvm (e.g. kvm
-m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.
Kernel compiles are a bad way to test RAM. I've successfully built
kernels on hosts with known RAM failures. The kernels don't always work
properly, but it's quite rare to see a build fail outright.
> [...]I have the feeling that "autodefrag" enhances the
> chances to get corruption, but I'm not 100% sure about it. Anyway,
> triggering a whole packages reinstall with "pacaur -S $(pacman -Qe)", giving
> high chances to get irrecoverable corruption. When running such command it
> simply extracts the tarballs from the cache and overwrites the already
> installed files. It doesn't write lots of data (after reinstallation my
> system is still quite small, just a few GBs) but it seems to be enough to
> displease the filesystem.
pacman probably does a lot of fsync() which will do a lot of metadata
tree updates. autodefrag triples the I/O load for fragmented files and
most of that extra load is metadata tree writes. Both will make the
symptoms of your problem worse.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-09 16:29 ` Zygo Blaxell
@ 2016-05-09 18:21 ` Austin S. Hemmelgarn
2016-05-09 19:18 ` Duncan
2016-05-12 14:35 ` Niccolò Belli
1 sibling, 1 reply; 25+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-09 18:21 UTC (permalink / raw)
To: Zygo Blaxell, Niccolò Belli
Cc: linux-btrfs, Clemens Eisserer, Patrik Lundquist, Chris Murphy,
Qu Wenruo, Omar Sandoval
On 2016-05-09 12:29, Zygo Blaxell wrote:
> On Mon, May 09, 2016 at 04:53:13PM +0200, Niccolò Belli wrote:
>> While trying to find a common denominator for my issue I did lots of backups
>> of /dev/mapper/cryptroot and I restored them into /dev/mapper/cryptroot
>> dozens of times (triggering a 150GB+ random data write every time), without
>> any issue (after restoring the backup I alwyas check the parition with btrfs
>> check). So disk doesn't seem to be the culprit.
>
> Did you also check the data matches the backup? btrfs check will only
> look at the metadata, which is 0.1% of what you've copied. From what
> you've written, there should be a lot of errors in the data too. If you
> have incorrect data but btrfs scrub finds no incorrect checksums, then
> your storage layer is probably fine and we have to look at CPU, host RAM,
> and software as possible culprits.
This is a good point.
>
> The logs you've posted so far indicate that bad metadata (e.g. negative
> item lengths, nonsense transids in metadata references but sane transids
> in the referred pages) is getting into otherwise valid and well-formed
> btrfs metadata pages. Since these pages are protected by checksums,
> the corruption can't be originating in the storage layer--if it was, the
> pages should be rejected as they are read from disk, before btrfs even
> looks at them, and the insane transid should be the "found" one not the
> "expected" one. That suggests there is either RAM corruption happening
> _after_ the data is read from disk (i.e. while the pages are cached in
> RAM), or a severe software bug in the kernel you're running.
>
> Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
> maintains your kernel had a bad day and merged a patch they should
> not have.
>
> Try a minimal configuration with as few drivers as possible loaded,
> especially GPU drivers and anything from the staging subdirectory--when
> these drivers have bugs, they ruin everything.
>
> Try memtest86+ which has a few more/different tests than memtest86.
> I have encountered RAM modules that pass memtest86 but fail memtest86+
> and vice versa.
>
> Try memtester, a memory tester that runs as a Linux process, so it can
> detect corruption caused when device drivers spray data randomly into RAM,
> or when the CPU thermal controls are influenced by Linux (an overheating
> CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
> designs rely on the OS for thermal management).
>
> Try running more than one memory testing process, in case there is a bug
> in your hardware that affects interactions between multiple cores (memtest
> is single-threaded). You can run memtest86 inside a kvm (e.g. kvm
> -m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.
>
> Kernel compiles are a bad way to test RAM. I've successfully built
> kernels on hosts with known RAM failures. The kernels don't always work
> properly, but it's quite rare to see a build fail outright.
My original suggestion that prompted that part of the comment was to run
a bunch of concurrent kernel builds (I only use kernel builds myself
because it's a big project with essentially zero build dependencies, if
I had the patience and space (and a LiveCD with the right tools and
packages installed), I'd probably be using something like LibreOffice or
Chromium instead), each run with as many jobs as CPU's (so on a
quad-core system, run a dozen or so concurrently with make -j4). I
don't use this as my sole test (I also use multiple other tools), but I
find that this does a particularly good job of exercising things that
memtest doesn't, and I don't just make sure the build's succeed, but
also that the compiled kernel images all match, because if there's bad
RAM, the resultant images will often be different in some way (and I had
forgotten to mention this bit).
This practice evolved out of the fact that the only bad RAM I've ever
dealt with either completely failed to POST (which can have all kinds of
interesting symptoms if it's just one module, some MB's refuse to boot,
some report the error, others just disable the module and act like
nothing happened), or passed all the memory testing tools I threw at it
(memtest86, memtest86+, memtester, concurrent memtest86 invocations from
Xen domains, inventive acrobatics with tmpfs and FIO, etc), but failed
under heavy concurrent random access, which can be reliably produced by
running a bunch of big software builds at the same time with the CPU
insanely over-committed. I could probably produce a similar workload
with tmpfs and FIO, but it's a lot quicker and easier to remember how to
do a kernel build than it is to remember the complex incantations needed
to get FIO to do anything interesting.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-09 18:21 ` Austin S. Hemmelgarn
@ 2016-05-09 19:18 ` Duncan
0 siblings, 0 replies; 25+ messages in thread
From: Duncan @ 2016-05-09 19:18 UTC (permalink / raw)
To: linux-btrfs
Austin S. Hemmelgarn posted on Mon, 09 May 2016 14:21:57 -0400 as
excerpted:
> This practice evolved out of the fact that the only bad RAM I've ever
> dealt with either completely failed to POST (which can have all kinds of
> interesting symptoms if it's just one module, some MB's refuse to boot,
> some report the error, others just disable the module and act like
> nothing happened), or passed all the memory testing tools I threw at it
> (memtest86, memtest86+, memtester, concurrent memtest86 invocations from
> Xen domains, inventive acrobatics with tmpfs and FIO, etc), but failed
> under heavy concurrent random access, which can be reliably produced by
> running a bunch of big software builds at the same time with the CPU
> insanely over-committed.
My (likely much more limited) experience matches yours.
Tho FWIW, in my case I did find that one of the more common memory
failure indicators was bz2-ed tarball decompression, where the tarball
would fail its decompression checksum safety checks. However, that most
reliably happened in the context of a heavily loaded system doing other
package builds in parallel to the package tarball extraction that failed.
In my case, I even had ECC RAM, but it was apparently just slightly out
of spec for its labeled and internally configured memory speeds (PC3200
DDR1 at the time), at least on my hardware. Once I got a BIOS update
that let me, I slightly downclocked the memory (to PC3000, IIRC), and it
was absolutely solid, no more errors, even with tightened up wait-state
timings. Later I upgraded RAM, and the new RAM worked just fine at the
same PC3200 speeds that were a problem for the older RAM.
The problem was apparently that while the RAM cells that memcheck checks
were fine, it was testing in an otherwise calm environment (not much
choice since you can only boot to the test directly and can't do anything
else at the same time), without all the other stuff going on in the
hectic environment of a multi-package parallel build, that apparently
happened to occasionally trigger the edge-case that would corrupt things.
And FWIW, I still have major respect for how well reiserfs behaved under
those conditions. No filesystem can be expected to be 100% reliable when
it's getting corrupted data due to bad memory, but reiserfs held up
remarkably well, far better than btrfs did under similar conditions (but
then with the PCI and SATA bus) a few year later, forcing me back to
reiserfs for a time, which again, continued to work like a champ, even
under hardware conditions that were absolutely unworkable with btrfs. I
had a heat-related (AC went out, in Phoenix, in the summer, 40+ C
outside, 50+C inside, who knows what the disks were!?) head crash on a
disk too, where the partitions that were mounted and likely had the head
flying over them were damaged beyond (easy) recovery, but other
partitions on the same disk were absolutely fine, and I actually
continued to run off them for a few months after cooling everything back
down. That sort of experience is the reason I still use reiserfs on
spinning rust, including my second and third level backups, even while
I'm running btrfs on the ssds for the working system and primary backup.
It's also the reason I continue to use a partitioned system with multiple
independent filesystems (btrfs raid1 on a pair of ssds for most of the
working btrfs and primary backups, individual ssd btrfs in dup mode for
/boot, and its backup on the other ssd), instead of putting my data eggs
all in the same filesystem basket with subvolumes, where if the
filesystem goes out all the subvolumes go with it!
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-09 14:53 ` Niccolò Belli
2016-05-09 16:29 ` Zygo Blaxell
@ 2016-05-09 19:23 ` Lionel Bouton
2016-05-09 21:30 ` Chris Murphy
2 siblings, 0 replies; 25+ messages in thread
From: Lionel Bouton @ 2016-05-09 19:23 UTC (permalink / raw)
To: Niccolò Belli, linux-btrfs
Cc: Clemens Eisserer, Austin S. Hemmelgarn, Patrik Lundquist,
Chris Murphy, Qu Wenruo, Omar Sandoval
Hi,
Le 09/05/2016 16:53, Niccolò Belli a écrit :
> On domenica 8 maggio 2016 20:27:55 CEST, Patrik Lundquist wrote:
>> Are you using any power management tweaks?
>
> Yes, as stated in my very first post I use TLP with
> SATA_LINKPWR_ON_BAT=max_performance, but I managed to reproduce the
> bug even without TLP. Also in the past week I've alwyas been on AC.
>
> On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
>> Memtest doesn't replicate typical usage patterns very well. My usual
>> testing for RAM involves not just memtest, but also booting into a
>> LiveCD (usually SystemRescueCD), pulling down a copy of the kernel
>> source, and then running as many concurrent kernel builds as cores,
>> each with as many make jobs as cores (so if you've got a quad core
>> CPU (or a dual core with hyperthreading), it would be running 4
>> builds with -j4 passed to make). GCC seems to have memory usage
>> patterns that reliably trigger memory errors that aren't caught by
>> memtest, so this generally gives good results.
>
> Building kernel with 4 concurrent threads is not an issue for my
> system, in fact I do compile a lot and I never had any issue.
Note : I once had a server which would pass memtest86 and repeated
kernel compilations maxing out the CPU threads but couldn't at the same
time reliably compile a kernel and copy large amounts of data.
I think I lost my little automated test suite (I should definitely look
for it again or code it from scratch) but what I did on new servers
since that time was :
1/ create a file larger than the system's RAM (this makes sure you will
read and write all data from disk and not only caches and might catch
controller hardware problems too) with dd if=/dev/urandom (several
gigabytes of random data exercise many different patterns, far more than
what memtest86 would test), compute its md5 checksum
2/ launch a subprocess repeatedly compiling the kernel with more jobs
than available CPU threads and stopping as soon as the make exit code
was != 0.
3/ launch another subprocess repeatedly copying the random file to
another location and exiting when the md5 checksum didn't match the source.
Let it run as a burn-in test for as long as you can afford (from
experience after 24 hours if it's still running the probability that the
test will find a problem becomes negligible).
If one of the subprocess stopped by itself your hardware is not stable.
This actually caught a few unstable systems before it could go into
production for me.
Lionel
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-09 14:53 ` Niccolò Belli
2016-05-09 16:29 ` Zygo Blaxell
2016-05-09 19:23 ` Lionel Bouton
@ 2016-05-09 21:30 ` Chris Murphy
2 siblings, 0 replies; 25+ messages in thread
From: Chris Murphy @ 2016-05-09 21:30 UTC (permalink / raw)
To: Niccolò Belli
Cc: Btrfs BTRFS, Clemens Eisserer, Austin S. Hemmelgarn,
Patrik Lundquist, Chris Murphy, Qu Wenruo, Omar Sandoval
On Mon, May 9, 2016 at 8:53 AM, Niccolò Belli <darkbasic@linuxsystems.it> wrote:
> I cannot manage to survive such annoying workflow for long, so I really hope
> someone will manage to track the bug down soon.
I suggest perseverance :) despite how tedious this is. Btrfs is more
aware of its state than other file systems, so if you give up and go
to ext4 it's entirely possible corruption is still happening but you
won't know it until there's a lot more damage. At the least if you
have to give up I'd suggest XFS and make sure you're using not older
than xfsprogs 3.2.3 which will make a V5 file system that uses
metadata checksumming by default.
--
Chris Murphy
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-09 16:29 ` Zygo Blaxell
2016-05-09 18:21 ` Austin S. Hemmelgarn
@ 2016-05-12 14:35 ` Niccolò Belli
2016-05-12 15:43 ` Austin S. Hemmelgarn
2016-05-12 16:48 ` Zygo Blaxell
1 sibling, 2 replies; 25+ messages in thread
From: Niccolò Belli @ 2016-05-12 14:35 UTC (permalink / raw)
To: linux-btrfs
Cc: Clemens Eisserer, Austin S. Hemmelgarn, Patrik Lundquist,
Chris Murphy, Qu Wenruo, Omar Sandoval, Zygo Blaxell, ahferroin7,
1i5t5.duncan
On lunedì 9 maggio 2016 18:29:41 CEST, Zygo Blaxell wrote:
> Did you also check the data matches the backup? btrfs check will only
> look at the metadata, which is 0.1% of what you've copied. From what
> you've written, there should be a lot of errors in the data too. If you
> have incorrect data but btrfs scrub finds no incorrect checksums, then
> your storage layer is probably fine and we have to look at CPU, host RAM,
> and software as possible culprits.
>
> The logs you've posted so far indicate that bad metadata (e.g. negative
> item lengths, nonsense transids in metadata references but sane transids
> in the referred pages) is getting into otherwise valid and well-formed
> btrfs metadata pages. Since these pages are protected by checksums,
> the corruption can't be originating in the storage layer--if it was, the
> pages should be rejected as they are read from disk, before btrfs even
> looks at them, and the insane transid should be the "found" one not the
> "expected" one. That suggests there is either RAM corruption happening
> _after_ the data is read from disk (i.e. while the pages are cached in
> RAM), or a severe software bug in the kernel you're running.
When doing the btrfs check I also always do a btrfs scrub and it never
found any error. Once it didn't manage to finish the scrub because of:
BTRFS critical (device dm-0): corrupt leaf, slot offset bad:
block=670597120,root=1, slot=6
and btrfs scrub status reported "was aborted after 00:00:10".
Talking about scrub I created a systemd timer to run scrub hourly and I
noticed 2 *uncorrectable* errors suddenly appeared on my system. So I
immediately re-run the scrub just to confirm it and then I rebooted into
the Arch live usb and runned btrfs check: the metadata were perfect. So I
runned btrfs scrub from the live usb and there were no errors at all! I
rebooted into my system and runned scrub once again and the uncorrectable
errors where really gone! It happened two times in the past few days.
> Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
> maintains your kernel had a bad day and merged a patch they should
> not have.
Almost no patches get applied by the Arch kernel team:
https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux
At the moment the only one is an harmless
"change-default-console-loglevel.patch".
> Try a minimal configuration with as few drivers as possible loaded,
> especially GPU drivers and anything from the staging subdirectory--when
> these drivers have bugs, they ruin everything.
Arch kernel team is quite conservative regarding staging/experimental
features, I remember they rejected some config patches I submitted because
of this.
Anyway I will try to blacklist as many kernel modules as I can. Maybe
blacklisting GPU is too much because if I can't actually use my laptop it
will be much more difficult to reproduce the issue.
> Try memtest86+ which has a few more/different tests than memtest86.
> I have encountered RAM modules that pass memtest86 but fail memtest86+
> and vice versa.
>
> Try memtester, a memory tester that runs as a Linux process, so it can
> detect corruption caused when device drivers spray data randomly into RAM,
> or when the CPU thermal controls are influenced by Linux (an overheating
> CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
> designs rely on the OS for thermal management).
>
> Try running more than one memory testing process, in case there is a bug
> in your hardware that affects interactions between multiple cores (memtest
> is single-threaded). You can run memtest86 inside a kvm (e.g. kvm
> -m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.
>
> Kernel compiles are a bad way to test RAM. I've successfully built
> kernels on hosts with known RAM failures. The kernels don't always work
> properly, but it's quite rare to see a build fail outright.
I didn't use memtest86+ because of the lack of EFI support, but I just
tried the shiny new memtest86 7.0 beta with improved tests for 12+ hours
without issues.
Also I runned "memtester 4G" and "systester-cli -gausslg 64M -threads 4
-turns 100000" together for 12 hours without any issue so I think both my
ram and cpu are ok.
I can think only about two possible culprits now (correct me if I'm wrong):
1) A btrfs bug
2) Another module screwing things around
I can do nothing about btrfs bugs so I will try to hunt the second option.
This is the list of modules I'm running:
lsmod | awk '$4 == ""' | awk '{print $1}' | sort
8250_dw
ac
acpi_als
acpi_pad
aesni_intel
ahci
algif_skcipher
ansi_cprng
arc4
atkbd
battery
bnep
btrfs
btusb
cdc_ether
cmac
coretemp
crc32c_intel
crc32_pclmul
crct10dif_pclmul
dell_laptop
dell_wmi
dm_crypt
drbg
ecb
elan_i2c
evdev
ext4
fan
fjes
ghash_clmulni_intel
gpio_lynxpoint
hid_generic
hid_multitouch
hmac
i2c_designware_platform
i2c_hid
i2c_i801
i915
input_leds
int3400_thermal
int3402_thermal
int3403_thermal
intel_hid
intel_pch_thermal
intel_powerclamp
intel_rapl
ip_tables
iTCO_wdt
iwlmvm
jitterentropy_rng
joydev
kvm_intel
lpc_ich
mac_hid
mei_me
mos7720
mousedev
msr
nls_cp437
nls_iso8859_1
nvram
pcspkr
pl2303
processor
processor_thermal_device
psmouse
r8152
rfcomm
rtsx_pci_ms
rtsx_pci_sdmmc
sch_fq_codel
sdhci_acpi
sd_mod
serio_raw
sha256_ssse3
shpchp
snd_hda_codec_hdmi
snd_hda_intel
snd_soc_ssm4567
snd_soc_sst_acpi
snd_soc_sst_broadwell
spi_pxa2xx_platform
thermal
tpm_crb
tpm_tis
uas
usbhid
uvcvideo
vfat
visor
x86_pkg_temp_thermal
xhci_pci
I will try to blacklist as many as I can will still keeping a somehow
usable system and see if can reproduce it. If I will not be able to
reproduce it anymore then the hunt will begin. It will not be a funny one
as I already experienced with hid-multitouch which gave me random kernel
hangs at boot ONLY if loaded early into the initramfs:
https://bugzilla.kernel.org/show_bug.cgi?id=105251
Another option will be crashing it with my car's wheels hoping that because
of my comprehensive insurance policy Dell will give me the next model (the
Skylake one) as a replacement (hoping that it will not suffer from the same
issue of the Broadwell one).
Thanks,
Niccolò
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-12 14:35 ` Niccolò Belli
@ 2016-05-12 15:43 ` Austin S. Hemmelgarn
2016-05-13 11:07 ` Niccolò Belli
2016-05-12 16:48 ` Zygo Blaxell
1 sibling, 1 reply; 25+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-12 15:43 UTC (permalink / raw)
To: Niccolò Belli, linux-btrfs
Cc: Clemens Eisserer, Patrik Lundquist, Chris Murphy, Qu Wenruo,
Omar Sandoval, Zygo Blaxell, 1i5t5.duncan
On 2016-05-12 10:35, Niccolò Belli wrote:
> On lunedì 9 maggio 2016 18:29:41 CEST, Zygo Blaxell wrote:
>> Did you also check the data matches the backup? btrfs check will only
>> look at the metadata, which is 0.1% of what you've copied. From what
>> you've written, there should be a lot of errors in the data too. If you
>> have incorrect data but btrfs scrub finds no incorrect checksums, then
>> your storage layer is probably fine and we have to look at CPU, host RAM,
>> and software as possible culprits.
>>
>> The logs you've posted so far indicate that bad metadata (e.g. negative
>> item lengths, nonsense transids in metadata references but sane transids
>> in the referred pages) is getting into otherwise valid and well-formed
>> btrfs metadata pages. Since these pages are protected by checksums,
>> the corruption can't be originating in the storage layer--if it was, the
>> pages should be rejected as they are read from disk, before btrfs even
>> looks at them, and the insane transid should be the "found" one not the
>> "expected" one. That suggests there is either RAM corruption happening
>> _after_ the data is read from disk (i.e. while the pages are cached in
>> RAM), or a severe software bug in the kernel you're running.
>
> When doing the btrfs check I also always do a btrfs scrub and it never
> found any error. Once it didn't manage to finish the scrub because of:
> BTRFS critical (device dm-0): corrupt leaf, slot offset bad:
> block=670597120,root=1, slot=6
> and btrfs scrub status reported "was aborted after 00:00:10".
>
> Talking about scrub I created a systemd timer to run scrub hourly and I
> noticed 2 *uncorrectable* errors suddenly appeared on my system. So I
> immediately re-run the scrub just to confirm it and then I rebooted into
> the Arch live usb and runned btrfs check: the metadata were perfect. So
> I runned btrfs scrub from the live usb and there were no errors at all!
> I rebooted into my system and runned scrub once again and the
> uncorrectable errors where really gone! It happened two times in the
> past few days.
This would indicate to me that you've either got bad RAM (most likely),
or some other hardware component is not working correctly. It's not
unusual for hardware issues to be intermittent.
>
>> Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
>> maintains your kernel had a bad day and merged a patch they should
>> not have.
>
> Almost no patches get applied by the Arch kernel team:
> https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux
> At the moment the only one is an harmless
> "change-default-console-loglevel.patch".
>
>> Try a minimal configuration with as few drivers as possible loaded,
>> especially GPU drivers and anything from the staging subdirectory--when
>> these drivers have bugs, they ruin everything.
>
> Arch kernel team is quite conservative regarding staging/experimental
> features, I remember they rejected some config patches I submitted
> because of this.
> Anyway I will try to blacklist as many kernel modules as I can. Maybe
> blacklisting GPU is too much because if I can't actually use my laptop
> it will be much more difficult to reproduce the issue.
Disable the GPU driver, but make sure you have the VGA_CONSOLE config
enabled, and you should be fine (you'll just get a 80x25 text-mode
console instead of a high-resolution one).
>
>> Try memtest86+ which has a few more/different tests than memtest86.
>> I have encountered RAM modules that pass memtest86 but fail memtest86+
>> and vice versa.
>>
>> Try memtester, a memory tester that runs as a Linux process, so it can
>> detect corruption caused when device drivers spray data randomly into
>> RAM,
>> or when the CPU thermal controls are influenced by Linux (an overheating
>> CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
>> designs rely on the OS for thermal management).
>>
>> Try running more than one memory testing process, in case there is a bug
>> in your hardware that affects interactions between multiple cores
>> (memtest
>> is single-threaded). You can run memtest86 inside a kvm (e.g. kvm
>> -m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.
>>
>> Kernel compiles are a bad way to test RAM. I've successfully built
>> kernels on hosts with known RAM failures. The kernels don't always work
>> properly, but it's quite rare to see a build fail outright.
>
> I didn't use memtest86+ because of the lack of EFI support, but I just
> tried the shiny new memtest86 7.0 beta with improved tests for 12+ hours
> without issues.
> Also I runned "memtester 4G" and "systester-cli -gausslg 64M -threads 4
> -turns 100000" together for 12 hours without any issue so I think both
> my ram and cpu are ok.
That's probably a good indication of the CPU and the MB being OK, but
not necessarily the RAM. There's two other possible options for testing
the RAM that haven't been mentioned yet though (which I hadn't thought
of myself until now):
1. If you have access to Windows, try the Windows Memory Diagnostic.
This runs yet another slightly different set of tests from memtest86 and
memtest86+, so it may catch issues they don't. You can start this
directly on an EFI system by loading /EFI/Microsoft/Boot/MEMTEST.EFI
from the EFI system partition.
2. This is a Dell system. If you still have the utility partition which
Dell ships all their per-provisioned systems with, that should have a
hardware diagnostics tool. I doubt that this will find anything (it's
part of their QA procedure AFAICT), but it's probably worth trying, as
the memory testing in that uses yet another slightly different
implementation of the typical tests. You can usually find this in the
boot interrupt menu accessed by hitting F12 before the boot-loader loads.
>
> I can think only about two possible culprits now (correct me if I'm wrong):
> 1) A btrfs bug
> 2) Another module screwing things around
It could still be the disk (not likely, but possible) or the storage
controller. If you have a spare disk, I'd suggest trying with that
(assuming of course it doesn't void your warranty).
>
> I can do nothing about btrfs bugs so I will try to hunt the second
> option. This is the list of modules I'm running:
>
> lsmod | awk '$4 == ""' | awk '{print $1}' | sort
>
> 8250_dw
> ac
> acpi_als
> acpi_pad
> aesni_intel
> ahci
> algif_skcipher
> ansi_cprng
> arc4
> atkbd
> battery
> bnep
> btrfs
> btusb
> cdc_ether
> cmac
> coretemp
> crc32c_intel
> crc32_pclmul
> crct10dif_pclmul
> dell_laptop
> dell_wmi
> dm_crypt
> drbg
> ecb
> elan_i2c
> evdev
> ext4
> fan
> fjes
> ghash_clmulni_intel
> gpio_lynxpoint
> hid_generic
> hid_multitouch
> hmac
> i2c_designware_platform
> i2c_hid
> i2c_i801
> i915
> input_leds
> int3400_thermal
> int3402_thermal
> int3403_thermal
> intel_hid
> intel_pch_thermal
> intel_powerclamp
> intel_rapl
> ip_tables
> iTCO_wdt
> iwlmvm
> jitterentropy_rng
> joydev
> kvm_intel
> lpc_ich
> mac_hid
> mei_me
> mos7720
> mousedev
> msr
> nls_cp437
> nls_iso8859_1
> nvram
> pcspkr
> pl2303
> processor
> processor_thermal_device
> psmouse
> r8152
> rfcomm
> rtsx_pci_ms
> rtsx_pci_sdmmc
> sch_fq_codel
> sdhci_acpi
> sd_mod
> serio_raw
> sha256_ssse3
> shpchp
> snd_hda_codec_hdmi
> snd_hda_intel
> snd_soc_ssm4567
> snd_soc_sst_acpi
> snd_soc_sst_broadwell
> spi_pxa2xx_platform
> thermal
> tpm_crb
> tpm_tis
> uas
> usbhid
> uvcvideo
> vfat
> visor
> x86_pkg_temp_thermal
> xhci_pci
>
> I will try to blacklist as many as I can will still keeping a somehow
> usable system and see if can reproduce it. If I will not be able to
> reproduce it anymore then the hunt will begin. It will not be a funny
> one as I already experienced with hid-multitouch which gave me random
> kernel hangs at boot ONLY if loaded early into the initramfs:
> https://bugzilla.kernel.org/show_bug.cgi?id=105251
Based on what you've got listed for modules, I'd expect the absolute
minimum for a usable test system to be:
ac
acpi_als (you can probably remove this, it's for the ambient light sensor)
acpi_pad
ahci
atkbd
battery
btrfs
coretemp
dell_laptop
dell_wmi
elan_i2c
evdev
ext4
fan
gpio_lynxpoint
hid_generic
hid_multitouch
i2c_i801
i915 (this is your GPU module, you should still have a usable text
console if this isn't loaded)
int3400_thermal
int3402_thermal
int3403_thermal
intel_hid
intel_pch_thermal
intel_powerclamp
intel_rapl
ip_tables (if you have no firewall configured, you can safely
blacklist this)
iwlmvm (you might try removing this, but you will have no wifi without it)
lpc_ich
mousedev
nvram (you might be able to remove this, I don't remember if the dell
modules depend on it or not)
processor
processor_thermal_device
psmouse
r8152 (you can try removing this too, but you will have no ethernet
without it)
sch_fq_codel
serio_raw
spi_pxa2xx_platform
thermal
usbhid
vfat (if you avoid mounting your EFI system partition, you can
probably pull this out)
x86_pkg_temp_thermal
xhci_pci
Note that this assumes you aren't testing on dmcrypt. Make absolutely
certain though that you don't remove any of the *thermal modules, the
fan module, and the dell modules, not having those may result in
hardware damage.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-12 14:35 ` Niccolò Belli
2016-05-12 15:43 ` Austin S. Hemmelgarn
@ 2016-05-12 16:48 ` Zygo Blaxell
1 sibling, 0 replies; 25+ messages in thread
From: Zygo Blaxell @ 2016-05-12 16:48 UTC (permalink / raw)
To: Niccolò Belli
Cc: linux-btrfs, Clemens Eisserer, Austin S. Hemmelgarn,
Patrik Lundquist, Chris Murphy, Qu Wenruo, Omar Sandoval,
1i5t5.duncan
[-- Attachment #1: Type: text/plain, Size: 2790 bytes --]
On Thu, May 12, 2016 at 04:35:24PM +0200, Niccolò Belli wrote:
> When doing the btrfs check I also always do a btrfs scrub and it never found
> any error. Once it didn't manage to finish the scrub because of:
> BTRFS critical (device dm-0): corrupt leaf, slot offset bad:
> block=670597120,root=1, slot=6
> and btrfs scrub status reported "was aborted after 00:00:10".
>
> Talking about scrub I created a systemd timer to run scrub hourly and I
> noticed 2 *uncorrectable* errors suddenly appeared on my system. So I
> immediately re-run the scrub just to confirm it and then I rebooted into the
> Arch live usb and runned btrfs check: the metadata were perfect. So I runned
> btrfs scrub from the live usb and there were no errors at all! I rebooted
> into my system and runned scrub once again and the uncorrectable errors
> where really gone! It happened two times in the past few days.
That's what a RAM corruption problem looks like when you run btrfs scrub.
Maybe the RAM itself is OK, but *something* is scribbling on it.
Does the Arch live usb use the same kernel as your normal system?
> Almost no patches get applied by the Arch kernel team:
> https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux
> At the moment the only one is an harmless
> "change-default-console-loglevel.patch".
Did you try an older (or newer) kernel? I've been running 4.5.x on a few
canary systems, but so far none of them have survived more than a day.
Contrast with 4.1.x and 4.4.x, which runs for months between reboots
for me. Maybe there's a regression in 4.5.x, maybe I did something
wrong in my config or build, or maybe I just have too few data points
to draw any conclusions, but my data so far is telling me to stay on
4.4.x until something changes (i.e. wait for a 4.5.x stable update or
skip directly to 4.6.x). :-/
It's always worth trying this if only to eliminate regression as a
possible root cause early. In practice, every mainline kernel release
has a regression that affects at least one combination of config options
and hardware. btrfs is stable enough now that you can be running one
or two releases behind to avoid a problem elsewhere in the kernel.
> Another option will be crashing it with my car's wheels hoping that because
> of my comprehensive insurance policy Dell will give me the next model (the
> Skylake one) as a replacement (hoping that it will not suffer from the same
> issue of the Broadwell one).
The first rule of Insurance Fraud Club: don't talk about Insurance
Fraud Club. ;)
It's possible there's a problem that affects only very specific chipsets
You seem to have eliminated RAM in isolation, but there could be a problem
in the kernel that affects only your chipset.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-12 15:43 ` Austin S. Hemmelgarn
@ 2016-05-13 11:07 ` Niccolò Belli
2016-05-13 11:35 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 25+ messages in thread
From: Niccolò Belli @ 2016-05-13 11:07 UTC (permalink / raw)
To: Austin S. Hemmelgarn
Cc: linux-btrfs, Clemens Eisserer, Patrik Lundquist, Chris Murphy,
Qu Wenruo, Omar Sandoval, Zygo Blaxell, 1i5t5.duncan
On giovedì 12 maggio 2016 17:43:38 CEST, Austin S. Hemmelgarn wrote:
> That's probably a good indication of the CPU and the MB being
> OK, but not necessarily the RAM. There's two other possible
> options for testing the RAM that haven't been mentioned yet
> though (which I hadn't thought of myself until now):
> 1. If you have access to Windows, try the Windows Memory
> Diagnostic. This runs yet another slightly different set of
> tests from memtest86 and memtest86+, so it may catch issues they
> don't. You can start this directly on an EFI system by loading
> /EFI/Microsoft/Boot/MEMTEST.EFI from the EFI system partition.
> 2. This is a Dell system. If you still have the utility
> partition which Dell ships all their per-provisioned systems
> with, that should have a hardware diagnostics tool. I doubt
> that this will find anything (it's part of their QA procedure
> AFAICT), but it's probably worth trying, as the memory testing
> in that uses yet another slightly different implementation of
> the typical tests. You can usually find this in the boot
> interrupt menu accessed by hitting F12 before the boot-loader
> loads.
I tried the Dell System Test, including the enhanced optional ram tests and
it was fine. I also tried the Microsoft one, which passed. BUT if I select
the advanced test in the Microsoft One it always stops at 21% of first
test. The test menus are still working, but fans get quiet and it keeps
writing "test running... 21%" forever. I tried it many times and it always
got stuck at 21%, so I suspect a test suite bug instead of a ram failure.
I also noticed some other interesting behaviours: while I was running the
usual scrub+check (both were fine) from the livecd I noticed this in dmesg:
[ 261.301159] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot errs:
wr 0, rd 0, flush 0, corrupt 4, gen 0
Corrupt? But both scrub and check were fine... I double checked scrub and
check and they were still fine.
This is what happened another time:
https://drive.google.com/open?id=0Bwe9Wtc-5xF1dGtPaWhTZ0w5aUU
I was making a backup of my partition USING DD from the livecd. It wasn't
even mounted if I recall correctly!
On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
> That's what a RAM corruption problem looks like when you run btrfs scrub.
> Maybe the RAM itself is OK, but *something* is scribbling on it.
>
> Does the Arch live usb use the same kernel as your normal system?
Yes, except for the point release (the system is slightly ahead of the
liveusb).
On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
> Did you try an older (or newer) kernel? I've been running 4.5.x on a few
> canary systems, but so far none of them have survived more than a day.
No (except for point releases from 4.5.0 to 4.5.4), but I will try 4.4.
On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
> It's possible there's a problem that affects only very specific chipsets
> You seem to have eliminated RAM in isolation, but there could be a problem
> in the kernel that affects only your chipset.
Funny considering it is sold as a Linux laptop. Unfortunately they only
tested it with the ancient Ubuntu 14.04.
Niccolò
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-13 11:07 ` Niccolò Belli
@ 2016-05-13 11:35 ` Austin S. Hemmelgarn
2016-05-13 12:10 ` Niccolò Belli
0 siblings, 1 reply; 25+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-13 11:35 UTC (permalink / raw)
To: Niccolò Belli
Cc: linux-btrfs, Clemens Eisserer, Patrik Lundquist, Chris Murphy,
Qu Wenruo, Omar Sandoval, Zygo Blaxell, 1i5t5.duncan
On 2016-05-13 07:07, Niccolò Belli wrote:
> On giovedì 12 maggio 2016 17:43:38 CEST, Austin S. Hemmelgarn wrote:
>> That's probably a good indication of the CPU and the MB being OK, but
>> not necessarily the RAM. There's two other possible options for
>> testing the RAM that haven't been mentioned yet though (which I hadn't
>> thought of myself until now):
>> 1. If you have access to Windows, try the Windows Memory Diagnostic.
>> This runs yet another slightly different set of tests from memtest86
>> and memtest86+, so it may catch issues they don't. You can start this
>> directly on an EFI system by loading /EFI/Microsoft/Boot/MEMTEST.EFI
>> from the EFI system partition.
>> 2. This is a Dell system. If you still have the utility partition
>> which Dell ships all their per-provisioned systems with, that should
>> have a hardware diagnostics tool. I doubt that this will find
>> anything (it's part of their QA procedure AFAICT), but it's probably
>> worth trying, as the memory testing in that uses yet another slightly
>> different implementation of the typical tests. You can usually find
>> this in the boot interrupt menu accessed by hitting F12 before the
>> boot-loader loads.
>
> I tried the Dell System Test, including the enhanced optional ram tests
> and it was fine. I also tried the Microsoft one, which passed. BUT if I
> select the advanced test in the Microsoft One it always stops at 21% of
> first test. The test menus are still working, but fans get quiet and it
> keeps writing "test running... 21%" forever. I tried it many times and
> it always got stuck at 21%, so I suspect a test suite bug instead of a
> ram failure.
I've actually seen this before on other systems (different completion
percentage on each system, but otherwise the same), all of them ended up
actually having a bad CPU or MB, although the ones with CPU issues were
fine after BIOS updates which included newer microcode.
>
> I also noticed some other interesting behaviours: while I was running
> the usual scrub+check (both were fine) from the livecd I noticed this in
> dmesg:
> [ 261.301159] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot
> errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
> Corrupt? But both scrub and check were fine... I double checked scrub
> and check and they were still fine.
It's worth noting that these are running counts of errors since the last
time the stats were reset (and they only get reset manually). If you
haven't reset the stats, then this isn't all that surprising.
>
> This is what happened another time:
> https://drive.google.com/open?id=0Bwe9Wtc-5xF1dGtPaWhTZ0w5aUU
> I was making a backup of my partition USING DD from the livecd. It
> wasn't even mounted if I recall correctly!
The fact that you're getting an OOPS involving core kernel threads
(kswapd) is a pretty good indication that either there's a bug elsewhere
in the kernel, or that something is wrong with your hardware. it's
really difficult to be certain if you don't have a reliable test case
though.
>
> On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
>> That's what a RAM corruption problem looks like when you run btrfs scrub.
>> Maybe the RAM itself is OK, but *something* is scribbling on it.
>>
>> Does the Arch live usb use the same kernel as your normal system?
>
> Yes, except for the point release (the system is slightly ahead of the
> liveusb).
>
> On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
>> Did you try an older (or newer) kernel? I've been running 4.5.x on a few
>> canary systems, but so far none of them have survived more than a day.
>
> No (except for point releases from 4.5.0 to 4.5.4), but I will try 4.4.
FWIW, I've been running 4.5 with almost no issues on my laptop since it
came out (the few issues I have had are not unique to 4.5, and are all
ultimately firmware issues (Lenovo has been getting _really_ bad
recently about having broken ACPI and EFI implementations...)). Of
course, I'm also running Gentoo, so everything is built locally, but I
doubt that that has much impact on stability.
>
> On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:
>> It's possible there's a problem that affects only very specific chipsets
>> You seem to have eliminated RAM in isolation, but there could be a
>> problem
>> in the kernel that affects only your chipset.
>
> Funny considering it is sold as a Linux laptop. Unfortunately they only
> tested it with the ancient Ubuntu 14.04.
Sadly, this is pretty typical for anything sold as a 'Linux' system that
isn't a server. Even for the servers sold as such, it's not unusual for
it to only be tested with with old versions of CentOS.
Now, I hadn't thought of this before, but it's a Dell system, so you're
trapping out to SMBIOS for everything under the sun, and if they don't
pass a correct memory map (or correct ACPI tables) to the OS during
boot, then there may be some sections of RAM that both Linux and the
firmware think they can use, which could definitely result in symptoms
like bad RAM while still consistently passing memory tests (because they
don't make BIOS calls after they have the system info they need).
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-13 11:35 ` Austin S. Hemmelgarn
@ 2016-05-13 12:10 ` Niccolò Belli
2016-05-13 21:54 ` Chris Murphy
0 siblings, 1 reply; 25+ messages in thread
From: Niccolò Belli @ 2016-05-13 12:10 UTC (permalink / raw)
To: Austin S. Hemmelgarn
Cc: linux-btrfs, Clemens Eisserer, Patrik Lundquist, Chris Murphy,
Qu Wenruo, Omar Sandoval, Zygo Blaxell, 1i5t5.duncan
On venerdì 13 maggio 2016 13:35:01 CEST, Austin S. Hemmelgarn wrote:
> The fact that you're getting an OOPS involving core kernel
> threads (kswapd) is a pretty good indication that either there's
> a bug elsewhere in the kernel, or that something is wrong with
> your hardware. it's really difficult to be certain if you don't
> have a reliable test case though.
Talking about reliable test cases, I forgot to say that I definitely found
an interesting one. It doesn't lead to OOPS but perhaps something even more
interesting. While running countless stress tests I tried running some
games to stress the system in different ways. I chosed openmw (an open
source engine for Morrowind) and I played it for a while on my second
external monitor (while I watched at some monitoring tools on my first
monitor). I noticed that after playing a while I *always* lose internet
connection (I use an USB3 Gigabit Ethernet adapter). This isn't the only
thing which happens: even if the game keeps running flawlessly and the
system *seems* to work fine (I can drag windows, open the terminal...) lots
of commands simply stall (for example mounting a partition, unmounting it,
rebooting...). I can reliably reproduce it, it ALWAYS happens.
Niccolò
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
2016-05-13 12:10 ` Niccolò Belli
@ 2016-05-13 21:54 ` Chris Murphy
0 siblings, 0 replies; 25+ messages in thread
From: Chris Murphy @ 2016-05-13 21:54 UTC (permalink / raw)
To: Niccolò Belli
Cc: Austin S. Hemmelgarn, Btrfs BTRFS, Clemens Eisserer,
Patrik Lundquist, Chris Murphy, Qu Wenruo, Omar Sandoval,
Zygo Blaxell, Duncan
On Fri, May 13, 2016 at 6:10 AM, Niccolò Belli
<darkbasic@linuxsystems.it> wrote:
> On venerdì 13 maggio 2016 13:35:01 CEST, Austin S. Hemmelgarn wrote:
>>
>> The fact that you're getting an OOPS involving core kernel threads
>> (kswapd) is a pretty good indication that either there's a bug elsewhere in
>> the kernel, or that something is wrong with your hardware. it's really
>> difficult to be certain if you don't have a reliable test case though.
>
>
> Talking about reliable test cases, I forgot to say that I definitely found
> an interesting one. It doesn't lead to OOPS but perhaps something even more
> interesting. While running countless stress tests I tried running some games
> to stress the system in different ways. I chosed openmw (an open source
> engine for Morrowind) and I played it for a while on my second external
> monitor (while I watched at some monitoring tools on my first monitor). I
> noticed that after playing a while I *always* lose internet connection (I
> use an USB3 Gigabit Ethernet adapter). This isn't the only thing which
> happens: even if the game keeps running flawlessly and the system *seems* to
> work fine (I can drag windows, open the terminal...) lots of commands simply
> stall (for example mounting a partition, unmounting it, rebooting...). I can
> reliably reproduce it, it ALWAYS happens.
Well there are a bunch of kernel debug options. If your kernel has
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y at compile time you can boot with boot parameter
slub_debug=1 to enable it and maybe there'll be something more
revealing about the problems you're having. More aggressive is
CONFIG_DEBUG_PAGEALLOC=y but it'll slow things down quite noticeably.
And then there's some Btrfs debug options for compile time, and are
enabled with mount options. But I think the problem you're having
isn't specific to Btrfs or someone else would have run into it.
--
Chris Murphy
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2016-05-13 21:54 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-04 23:21 btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair Niccolò Belli
2016-05-05 1:07 ` Chris Murphy
2016-05-05 10:36 ` Niccolò Belli
2016-05-05 17:48 ` Omar Sandoval
2016-05-06 11:38 ` Niccolò Belli
2016-05-07 15:45 ` Niccolò Belli
2016-05-07 15:58 ` Clemens Eisserer
2016-05-07 16:11 ` Niccolò Belli
2016-05-08 18:27 ` Patrik Lundquist
2016-05-09 11:52 ` Austin S. Hemmelgarn
2016-05-09 14:53 ` Niccolò Belli
2016-05-09 16:29 ` Zygo Blaxell
2016-05-09 18:21 ` Austin S. Hemmelgarn
2016-05-09 19:18 ` Duncan
2016-05-12 14:35 ` Niccolò Belli
2016-05-12 15:43 ` Austin S. Hemmelgarn
2016-05-13 11:07 ` Niccolò Belli
2016-05-13 11:35 ` Austin S. Hemmelgarn
2016-05-13 12:10 ` Niccolò Belli
2016-05-13 21:54 ` Chris Murphy
2016-05-12 16:48 ` Zygo Blaxell
2016-05-09 19:23 ` Lionel Bouton
2016-05-09 21:30 ` Chris Murphy
2016-05-07 23:35 ` Chris Murphy
2016-05-05 4:12 ` Qu Wenruo
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.