system locked up with btrfs-transaction consuming 100% CPU

* system locked up with btrfs-transaction consuming 100% CPU
@ 2016-08-09 18:07 Dave T
  2016-08-09 21:32 ` Duncan
  2016-08-09 22:54 ` Dave T
  0 siblings, 2 replies; 5+ messages in thread
From: Dave T @ 2016-08-09 18:07 UTC (permalink / raw)
  To: linux-btrfs

My system locked up with btrfs-transaction consuming 100% CPU and NMI
watchdog reporting BUG: soft lockup with btrfs-transaction:314.

This comes 2 days after a serious event involving BTRFS where my
system would not mount the root fs. (I gave details in an email to the
list two days ago and copied again below.)

Here are full details of todays "bug" (or whatever it was).

When i left work last night I left my system running and I locked the
session. The only things open were KDE Plasma, some terminal windows
and some plain text documents in Kate editor. No real work was running
on the local machine.

This morning I came to work and noticed that my computer was slightly
warm and the fans were running at higher than normal RPM.

I logged in and opened top in an existing terminal. I saw that
btrfs-transaction was consuming 100% of a CPU core and kworker was
consumer 100% of another CPU core.

I tried to run a command (to view logs) in another terminal window,
but the system became unresponsive. I was able to switch to another
virtual console, but it was very slow. I took photos with my phone.
See link below for two images (top and virtual console):

http://imgur.com/a/fT1RV

These photos show what I reported above:
* btrfs-transaction consuming 100% CPU
* NMI watchdog reporting BUG: soft lockup with btrfs-transaction:314

I hard reset my system, expecting the worst, but it rebooted normally.
journalctl -xb -p3 showed no entries.

Obviously I have a serious problem. However, I have no clue about what
the problem might be (except that it seemingly involves btrfs). What
other information can I provide?

On Sun, Aug 7, 2016 at 6:44 PM, Dave <davestechshop@gmail.com> wrote:
> I am running Arch Linux on a system with full disk encryption and the
> storage is a Samsung 950 Pro NVMe drive (512 GB). The computer is a
> couple months old. No bad behavior until now. (I'm only using 21 GB of
> the 512 space on the disk.)
>
>     btrfs-progs v4.5.1
>
> Today I was using my system normally and browsing the web. Firefox
> stopped responding suddenly and for no apparent reason. Then (KDE)
> Plasma stopped responding. I could not log out of KDE.
>
> I killed my user session (pkill -u me), then I tired to startx. At
> that point I noticed my root filesystem was read-only.
>
> As a first step, I rebooted. That didn't help anything. I tried
> rebooting several more times -- no change.
>
> The root filesystem (btrfs) would not mount. (See error below.) I
> booted into a LiveUSB environment and ran this command:
>
>     cryptsetup open --type luks /dev/xxx cryptroot
>
> It opens. Then I ran:
>
>     mount -t btrfs -o
> noatime,nodiratime,ssd,compress=lzo,defaults,space_cache,subvolid=257
> /dev/mapper/cryptroot /mnt
>
> The error message is shown here:
>
>     [ 2300.967048] BTRFS info (device dm-0): use ssd allocation scheme
>     [ 2300.967058] BTRFS info (device dm-0): use lzo compression
>     [ 2300.967066] BTRFS info (device dm-0): disk space caching is enabled
>     [ 2300.967069] BTRFS: has skinny extents
>     [ 2300.995393] BTRFS: error (device dm-0) in
> btrfs_replay_log:2413: errno=-22 unknown (Failed to recover log tree)
>     [ 2300.997617] BTRFS info (device dm-0): delayed_refs has NO entry
>     [ 2300.997673] BTRFS error (device dm-0): cleaner transaction
> attach returned -30
>     [ 2301.035405] BTRFS: open_ctree failed
>
> It is exactly the same error I saw when trying to boot normally as
> mentioned above.
>
> Based on these two links:
>
>> https://btrfs.wiki.kernel.org/index.php/Problem_FAQ
>> https://btrfs.wiki.kernel.org/index.php/Btrfs-zero-log
>
> I decided to take a chance on running this command:
>
>     btrfs rescue zero-log
>
> That worked and I can mount the filesystem.
>
> I ran btrfs check --repair. Here is the output:
>
>     root@broken / # umount /mnt
>     root@broken / # btrfs check --repair /dev/mapper/cryptroot
>     enabling repair mode
>     Checking filesystem on /dev/mapper/cryptroot
>     checking extents
>     bad metadata [292414476288, 292414492672) crossing stripe boundary
>     bad metadata [292414541824, 292414558208) crossing stripe boundary
>     bad metadata [292414672896, 292414689280) crossing stripe boundary
>     bad metadata [292414869504, 292414885888) crossing stripe boundary
>     bad metadata [292415000576, 292415016960) crossing stripe boundary
>     bad metadata [292415066112, 292415082496) crossing stripe boundary
>     bad metadata [292415131648, 292415148032) crossing stripe boundary
>     bad metadata [292415262720, 292415279104) crossing stripe boundary
>     bad metadata [292415328256, 292415344640) crossing stripe boundary
>     bad metadata [292415393792, 292415410176) crossing stripe boundary
>     repaired damaged extent references
>     Fixed 0 roots.
>     checking free space cache
>     cache and super generation don't match, space cache will be invalidated
>     checking fs roots
>     checking csums
>     checking root refs
>     checking quota groups
>     Ignoring qgroup relation key 258
>     Ignoring qgroup relation key 263
>     Ignoring qgroup relation key 71776119061217538
>     Ignoring qgroup relation key 71776119061217543
>     Counts for qgroup id: 257 are different
>     our:            referenced 10412273664 referenced compressed 10412273664
>     disk:           referenced 10411311104 referenced compressed 10411311104
>     diff:           referenced 962560 referenced compressed 962560
>     our:            exclusive 10412273664 exclusive compressed 10412273664
>     disk:           exclusive 10412273664 exclusive compressed 10412273664
>     found 21570773057 bytes used err is 0
>     total csum bytes: 19563456
>     total tree bytes: 403767296
>     total fs tree bytes: 349667328
>     total extent tree bytes: 27328512
>     btree space waste bytes: 66313360
>     file data blocks allocated: 39882014720
>     referenced 28043988992
>     extent buffer leak: start 20987904 len 16384
>     extent buffer leak: start 292688068608 len 16384
>     extent buffer leak: start 60915712 len 16384
>     extent buffer leak: start 29569581056 len 16384
>     extent buffer leak: start 29569597440 len 16384
>     extent buffer leak: start 292412063744 len 16384
>     extent buffer leak: start 292405870592 len 16384
>     extent buffer leak: start 292405936128 len 16384
>     extent buffer leak: start 292413964288 len 16384
>
> Then I check dmesg and I see this error information:
>
>     [ 4925.562422] BTRFS info (device dm-0): use ssd allocation scheme
>     [ 4925.562432] BTRFS info (device dm-0): use lzo compression
>     [ 4925.562440] BTRFS info (device dm-0): disk space caching is enabled
>     [ 4925.562444] BTRFS: has skinny extents
>     [ 4925.578705] BTRFS error (device dm-0): qgroup generation
> mismatch, marked as inconsistent
>     [ 4925.584033] BTRFS: checking UUID tree
>
> What should I do next? I'm a simple user.
>
> I already ran memtest86+ overnight using 8 CPU cores in parallel (so
> it was a very thorough memory test). There were 0 RAM errors.
>
> I previously used btrfs since 2012 with no issues. I am concerned
> about the present issue because I do not understand the cause.

^ permalink raw reply	[flat|nested] 5+ messages in thread