All of lore.kernel.org
 help / color / mirror / Atom feed
* system locked up with btrfs-transaction consuming 100% CPU
@ 2016-08-09 18:07 Dave T
  2016-08-09 21:32 ` Duncan
  2016-08-09 22:54 ` Dave T
  0 siblings, 2 replies; 5+ messages in thread
From: Dave T @ 2016-08-09 18:07 UTC (permalink / raw)
  To: linux-btrfs

My system locked up with btrfs-transaction consuming 100% CPU and NMI
watchdog reporting BUG: soft lockup with btrfs-transaction:314.

This comes 2 days after a serious event involving BTRFS where my
system would not mount the root fs. (I gave details in an email to the
list two days ago and copied again below.)

Here are full details of todays "bug" (or whatever it was).

When i left work last night I left my system running and I locked the
session. The only things open were KDE Plasma, some terminal windows
and some plain text documents in Kate editor. No real work was running
on the local machine.

This morning I came to work and noticed that my computer was slightly
warm and the fans were running at higher than normal RPM.

I logged in and opened top in an existing terminal. I saw that
btrfs-transaction was consuming 100% of a CPU core and kworker was
consumer 100% of another CPU core.

I tried to run a command (to view logs) in another terminal window,
but the system became unresponsive. I was able to switch to another
virtual console, but it was very slow. I took photos with my phone.
See link below for two images (top and virtual console):

http://imgur.com/a/fT1RV

These photos show what I reported above:
* btrfs-transaction consuming 100% CPU
* NMI watchdog reporting BUG: soft lockup with btrfs-transaction:314

I hard reset my system, expecting the worst, but it rebooted normally.
journalctl -xb -p3 showed no entries.

Obviously I have a serious problem. However, I have no clue about what
the problem might be (except that it seemingly involves btrfs). What
other information can I provide?

On Sun, Aug 7, 2016 at 6:44 PM, Dave <davestechshop@gmail.com> wrote:
> I am running Arch Linux on a system with full disk encryption and the
> storage is a Samsung 950 Pro NVMe drive (512 GB). The computer is a
> couple months old. No bad behavior until now. (I'm only using 21 GB of
> the 512 space on the disk.)
>
>     btrfs-progs v4.5.1
>
> Today I was using my system normally and browsing the web. Firefox
> stopped responding suddenly and for no apparent reason. Then (KDE)
> Plasma stopped responding. I could not log out of KDE.
>
> I killed my user session (pkill -u me), then I tired to startx. At
> that point I noticed my root filesystem was read-only.
>
> As a first step, I rebooted. That didn't help anything. I tried
> rebooting several more times -- no change.
>
> The root filesystem (btrfs) would not mount. (See error below.) I
> booted into a LiveUSB environment and ran this command:
>
>     cryptsetup open --type luks /dev/xxx cryptroot
>
> It opens. Then I ran:
>
>     mount -t btrfs -o
> noatime,nodiratime,ssd,compress=lzo,defaults,space_cache,subvolid=257
> /dev/mapper/cryptroot /mnt
>
> The error message is shown here:
>
>     [ 2300.967048] BTRFS info (device dm-0): use ssd allocation scheme
>     [ 2300.967058] BTRFS info (device dm-0): use lzo compression
>     [ 2300.967066] BTRFS info (device dm-0): disk space caching is enabled
>     [ 2300.967069] BTRFS: has skinny extents
>     [ 2300.995393] BTRFS: error (device dm-0) in
> btrfs_replay_log:2413: errno=-22 unknown (Failed to recover log tree)
>     [ 2300.997617] BTRFS info (device dm-0): delayed_refs has NO entry
>     [ 2300.997673] BTRFS error (device dm-0): cleaner transaction
> attach returned -30
>     [ 2301.035405] BTRFS: open_ctree failed
>
> It is exactly the same error I saw when trying to boot normally as
> mentioned above.
>
> Based on these two links:
>
>> https://btrfs.wiki.kernel.org/index.php/Problem_FAQ
>> https://btrfs.wiki.kernel.org/index.php/Btrfs-zero-log
>
> I decided to take a chance on running this command:
>
>     btrfs rescue zero-log
>
> That worked and I can mount the filesystem.
>
> I ran btrfs check --repair. Here is the output:
>
>     root@broken / # umount /mnt
>     root@broken / # btrfs check --repair /dev/mapper/cryptroot
>     enabling repair mode
>     Checking filesystem on /dev/mapper/cryptroot
>     checking extents
>     bad metadata [292414476288, 292414492672) crossing stripe boundary
>     bad metadata [292414541824, 292414558208) crossing stripe boundary
>     bad metadata [292414672896, 292414689280) crossing stripe boundary
>     bad metadata [292414869504, 292414885888) crossing stripe boundary
>     bad metadata [292415000576, 292415016960) crossing stripe boundary
>     bad metadata [292415066112, 292415082496) crossing stripe boundary
>     bad metadata [292415131648, 292415148032) crossing stripe boundary
>     bad metadata [292415262720, 292415279104) crossing stripe boundary
>     bad metadata [292415328256, 292415344640) crossing stripe boundary
>     bad metadata [292415393792, 292415410176) crossing stripe boundary
>     repaired damaged extent references
>     Fixed 0 roots.
>     checking free space cache
>     cache and super generation don't match, space cache will be invalidated
>     checking fs roots
>     checking csums
>     checking root refs
>     checking quota groups
>     Ignoring qgroup relation key 258
>     Ignoring qgroup relation key 263
>     Ignoring qgroup relation key 71776119061217538
>     Ignoring qgroup relation key 71776119061217543
>     Counts for qgroup id: 257 are different
>     our:            referenced 10412273664 referenced compressed 10412273664
>     disk:           referenced 10411311104 referenced compressed 10411311104
>     diff:           referenced 962560 referenced compressed 962560
>     our:            exclusive 10412273664 exclusive compressed 10412273664
>     disk:           exclusive 10412273664 exclusive compressed 10412273664
>     found 21570773057 bytes used err is 0
>     total csum bytes: 19563456
>     total tree bytes: 403767296
>     total fs tree bytes: 349667328
>     total extent tree bytes: 27328512
>     btree space waste bytes: 66313360
>     file data blocks allocated: 39882014720
>     referenced 28043988992
>     extent buffer leak: start 20987904 len 16384
>     extent buffer leak: start 292688068608 len 16384
>     extent buffer leak: start 60915712 len 16384
>     extent buffer leak: start 29569581056 len 16384
>     extent buffer leak: start 29569597440 len 16384
>     extent buffer leak: start 292412063744 len 16384
>     extent buffer leak: start 292405870592 len 16384
>     extent buffer leak: start 292405936128 len 16384
>     extent buffer leak: start 292413964288 len 16384
>
> Then I check dmesg and I see this error information:
>
>     [ 4925.562422] BTRFS info (device dm-0): use ssd allocation scheme
>     [ 4925.562432] BTRFS info (device dm-0): use lzo compression
>     [ 4925.562440] BTRFS info (device dm-0): disk space caching is enabled
>     [ 4925.562444] BTRFS: has skinny extents
>     [ 4925.578705] BTRFS error (device dm-0): qgroup generation
> mismatch, marked as inconsistent
>     [ 4925.584033] BTRFS: checking UUID tree
>
> What should I do next? I'm a simple user.
>
> I already ran memtest86+ overnight using 8 CPU cores in parallel (so
> it was a very thorough memory test). There were 0 RAM errors.
>
> I previously used btrfs since 2012 with no issues. I am concerned
> about the present issue because I do not understand the cause.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: system locked up with btrfs-transaction consuming 100% CPU
  2016-08-09 18:07 system locked up with btrfs-transaction consuming 100% CPU Dave T
@ 2016-08-09 21:32 ` Duncan
  2016-08-09 22:20   ` Dave T
  2016-08-09 22:54 ` Dave T
  1 sibling, 1 reply; 5+ messages in thread
From: Duncan @ 2016-08-09 21:32 UTC (permalink / raw)
  To: linux-btrfs

Dave T posted on Tue, 09 Aug 2016 14:07:46 -0400 as excerpted:

> I hard reset my system, expecting the worst, but it rebooted normally.
> journalctl -xb -p3 showed no entries.

I don't have any suggestions for your primary problem, tho I do have a 
comment down below, but I do have a suggestion regarding your "hard 
reset".

Consider doing some reading on "magic sysrequest", aka sysrq aka srq.

$KERNDIR/Documentation/sysrq.txt , and there's lots of googlable articles 
about it as well.

Basically, when you'd otherwise do a hard reset, try a series of triple-
key chords, alt-sysrq-<otherkey> first.  (Sysrq is printscreen, if alt 
isn't pressed with it, so alt-sysrq-thirdkey.)

The longer form of the emergency sequence is reisub -- you can read what 
the r-e-i keys due in the documentation -- but from my own experience, I 
find when the system's in bad enough shape I need to do an emergency 
reboot, these keys don't do much for me, while the last three, sub, often 
(but not always) do, and they're much easier to remember, so...

Alt-sysrq-s alt-sysrq-u alt-sysrq-b

s=Sync.  If the kernel is still alive and believes it's still stable 
enough to write to permanent storage without risking writing somewhere it 
shouldn't, this will force all write-cached "dirty" data to be written 
out.

You can safely do an alt-srq-s at any time, and continue working, as it 
forces cached writes to be written out, but doesn't otherwise interfere 
with the running system.  As such, alt-srq-s is a useful sequence to use 
right before you do anything you suspect /might/ crash the system, like 
starting X with a new graphics driver.

u=remoUnt-read-only.  Again, if the kernel is alive and stable, this will 
remount all filesystems read-only, allowing them to safely clean up in 
the process.  The action carries down to sub-filesystem layers like 
dmcrypt as well.

Note that this is an emergency remount-read-only, so it's a bit more 
forceful regarding open files that would block an ordinary remount-
readonly.  As such, consider the system unusable after doing an alt-srq-
u, and shutdown or reboot immediately.

b=reBoot.  This forces the kernel to do an immediate reboot, without 
syncing or remounting, etc.  Thus the s-u- first, to sync and remount.


Besides being a bit safer than a hard reset, since when it works it 
allows the system to sync and cleanup the filesystems before the reboot, 
this also serves as a crude but effective method of finding out just how 
severely the system was locked up.  If the sync and remount steps light 
up your storage I/O activity LED, you know the kernel considered itself 
in pretty good shape, even if userspace was lost and there was no display 
at all.  If there's no response to them but the reboot step works, you 
know the kernel was still alive enough to respond, but either there 
wasn't anything dirty to write out, or more likely, the kernel believed 
itself to be corrupted, and thus didn't trust its ability to write to 
permanent storage without risking scribbling on other parts of the device 
(other files, perhaps even other partitions).  And of course if none of 
them work and you /do/ have to do a hard reboot, then you know the kernel 
itself was dead, at least to the point it could no longer respond at all 
to magic srq.


As to the comment... I'm running plasma/kde5 on gentoo, here, but I'm 
running upstream-kde's live-git version, available via the gentoo/kde 
overlay.  Some weeks ago, for a period, something wasn't working, and 
every time I left the system alone long enough to lock the screen and 
power-down the monitors, when I came back the system would be crashed.  
With a bit of experimentation, I discovered that it would stay running as 
long as I didn't let the monitors power off automatically (I could power 
them down manually, tho), so for awhile, I was running xset -dpmi after 
every X/plasma restart (I start X/plasma using startx from a text login 
and don't use a *DM), to keep plasma from powering down the graphics 
adapter, tho it could and did still run the screenlocker.

Since then, they fixed whatever it was and I can let the power-downs 
happen normally.  I don't believe the bug made it to a release, tho 
because I'm following live-git I'm not tracking the releases closely and 
could be mistaken.

You mentioned arch, which IIRC is pretty close to upstream's release 
cycle, so it's just possible that if this /did/ hit a release, and you're 
running a new enough kde/plasma, the problem you're seeing may be related 
to what I was experiencing.  Tho I doubt it since as I said it was only a 
short period, and I don't think the defective code made it into a release.

FWIW, tho, I'm running Radeon Turks graphics (hd6670, IIRC) with triple 
monitor and the native freedomware kernel/mesa/xorg driver, not frglx or 
whatever the proprietary thing is called.  If you're running Radeon, with 
the freedomware driver, especially if also running multi-monitor and the 
absolute latest plasma, you might try either downgrading a version to see 
if the problem goes away, or doing the xset -dpmi thing I was doing, 
temporarily.  It's just possible it'll help since your problem seems 
similarly to be triggering when you're away from the machine, but your 
problem does seem a bit different than mine (mine was a consistent 
crash), and I don't believe mine made release code anyway, so it's likely 
the similarity is just coincidence.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: system locked up with btrfs-transaction consuming 100% CPU
  2016-08-09 21:32 ` Duncan
@ 2016-08-09 22:20   ` Dave T
  2016-08-10 11:39     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 5+ messages in thread
From: Dave T @ 2016-08-09 22:20 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

Thank you for the info, Duncan.

I will use Alt-sysrq-s alt-sysrq-u alt-sysrq-b. This is the best
description / recommendation I've read on the subject. I had read
about these special key sequences before but I could never remember
them and I didn't fully understand what they did. Now you have given
me the understanding as well as an easy-to-remember method. I'll use
it.

I launch KDE the same way you do (no DM). I also run a tiple monitor
setup, but I am using an nvidia GTX 1070 (and proprietary drivers),
for the time being.

My system does not have any issues when the monitors go to sleep. That
happens many times a day as I have a short timeout set.

I am very concerned about this primary problem (or problems) and I
hope I can find some understanding of what is going on. BTRFS has
worked well for me since 2012. While that's fantastic, it also means I
haven't had to troubleshoot it in the past. Now (because of 4 years of
problem-free operation) I'm using it on a critical production system.
I have backups, but I cannot allow these problems to go unresolved.

On Tue, Aug 9, 2016 at 5:32 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Dave T posted on Tue, 09 Aug 2016 14:07:46 -0400 as excerpted:
>
>> I hard reset my system, expecting the worst, but it rebooted normally.
>> journalctl -xb -p3 showed no entries.
>
> I don't have any suggestions for your primary problem, tho I do have a
> comment down below, but I do have a suggestion regarding your "hard
> reset".
>
> Consider doing some reading on "magic sysrequest", aka sysrq aka srq.
>
> $KERNDIR/Documentation/sysrq.txt , and there's lots of googlable articles
> about it as well.
>
> Basically, when you'd otherwise do a hard reset, try a series of triple-
> key chords, alt-sysrq-<otherkey> first.  (Sysrq is printscreen, if alt
> isn't pressed with it, so alt-sysrq-thirdkey.)
>
> The longer form of the emergency sequence is reisub -- you can read what
> the r-e-i keys due in the documentation -- but from my own experience, I
> find when the system's in bad enough shape I need to do an emergency
> reboot, these keys don't do much for me, while the last three, sub, often
> (but not always) do, and they're much easier to remember, so...
>
> Alt-sysrq-s alt-sysrq-u alt-sysrq-b
>
> s=Sync.  If the kernel is still alive and believes it's still stable
> enough to write to permanent storage without risking writing somewhere it
> shouldn't, this will force all write-cached "dirty" data to be written
> out.
>
> You can safely do an alt-srq-s at any time, and continue working, as it
> forces cached writes to be written out, but doesn't otherwise interfere
> with the running system.  As such, alt-srq-s is a useful sequence to use
> right before you do anything you suspect /might/ crash the system, like
> starting X with a new graphics driver.
>
> u=remoUnt-read-only.  Again, if the kernel is alive and stable, this will
> remount all filesystems read-only, allowing them to safely clean up in
> the process.  The action carries down to sub-filesystem layers like
> dmcrypt as well.
>
> Note that this is an emergency remount-read-only, so it's a bit more
> forceful regarding open files that would block an ordinary remount-
> readonly.  As such, consider the system unusable after doing an alt-srq-
> u, and shutdown or reboot immediately.
>
> b=reBoot.  This forces the kernel to do an immediate reboot, without
> syncing or remounting, etc.  Thus the s-u- first, to sync and remount.
>
>
> Besides being a bit safer than a hard reset, since when it works it
> allows the system to sync and cleanup the filesystems before the reboot,
> this also serves as a crude but effective method of finding out just how
> severely the system was locked up.  If the sync and remount steps light
> up your storage I/O activity LED, you know the kernel considered itself
> in pretty good shape, even if userspace was lost and there was no display
> at all.  If there's no response to them but the reboot step works, you
> know the kernel was still alive enough to respond, but either there
> wasn't anything dirty to write out, or more likely, the kernel believed
> itself to be corrupted, and thus didn't trust its ability to write to
> permanent storage without risking scribbling on other parts of the device
> (other files, perhaps even other partitions).  And of course if none of
> them work and you /do/ have to do a hard reboot, then you know the kernel
> itself was dead, at least to the point it could no longer respond at all
> to magic srq.
>
>
> As to the comment... I'm running plasma/kde5 on gentoo, here, but I'm
> running upstream-kde's live-git version, available via the gentoo/kde
> overlay.  Some weeks ago, for a period, something wasn't working, and
> every time I left the system alone long enough to lock the screen and
> power-down the monitors, when I came back the system would be crashed.
> With a bit of experimentation, I discovered that it would stay running as
> long as I didn't let the monitors power off automatically (I could power
> them down manually, tho), so for awhile, I was running xset -dpmi after
> every X/plasma restart (I start X/plasma using startx from a text login
> and don't use a *DM), to keep plasma from powering down the graphics
> adapter, tho it could and did still run the screenlocker.
>
> Since then, they fixed whatever it was and I can let the power-downs
> happen normally.  I don't believe the bug made it to a release, tho
> because I'm following live-git I'm not tracking the releases closely and
> could be mistaken.
>
> You mentioned arch, which IIRC is pretty close to upstream's release
> cycle, so it's just possible that if this /did/ hit a release, and you're
> running a new enough kde/plasma, the problem you're seeing may be related
> to what I was experiencing.  Tho I doubt it since as I said it was only a
> short period, and I don't think the defective code made it into a release.
>
> FWIW, tho, I'm running Radeon Turks graphics (hd6670, IIRC) with triple
> monitor and the native freedomware kernel/mesa/xorg driver, not frglx or
> whatever the proprietary thing is called.  If you're running Radeon, with
> the freedomware driver, especially if also running multi-monitor and the
> absolute latest plasma, you might try either downgrading a version to see
> if the problem goes away, or doing the xset -dpmi thing I was doing,
> temporarily.  It's just possible it'll help since your problem seems
> similarly to be triggering when you're away from the machine, but your
> problem does seem a bit different than mine (mine was a consistent
> crash), and I don't believe mine made release code anyway, so it's likely
> the similarity is just coincidence.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: system locked up with btrfs-transaction consuming 100% CPU
  2016-08-09 18:07 system locked up with btrfs-transaction consuming 100% CPU Dave T
  2016-08-09 21:32 ` Duncan
@ 2016-08-09 22:54 ` Dave T
  1 sibling, 0 replies; 5+ messages in thread
From: Dave T @ 2016-08-09 22:54 UTC (permalink / raw)
  To: linux-btrfs

The original problem from 2 days ago just happened again. I ran btrfs
rescue zero-log (again) and the root filesystem mounted but it was
read-only on first boot. I rebooted again and everything seems normal.
But clearly there is a problem that needs to be resolved.

Problem:
The root file system becomes read-only during normal usage. See copied
message below for more information about the error.

I'm happy to provide more info upon request. I appreciate any help. Is
this btrfs? A bad disk? Something else?

Linux x99 4.6.4-1-ARCH #1 SMP PREEMPT Mon Jul 11 19:12:32 CEST 2016
x86_64 GNU/Linux

btrfs-progs v4.6.1


On Tue, Aug 9, 2016 at 2:07 PM, Dave T <davestechshop@gmail.com> wrote:
> My system locked up with btrfs-transaction consuming 100% CPU and NMI
> watchdog reporting BUG: soft lockup with btrfs-transaction:314.
>
> This comes 2 days after a serious event involving BTRFS where my
> system would not mount the root fs. (I gave details in an email to the
> list two days ago and copied again below.)
>
> Here are full details of todays "bug" (or whatever it was).
>
> When i left work last night I left my system running and I locked the
> session. The only things open were KDE Plasma, some terminal windows
> and some plain text documents in Kate editor. No real work was running
> on the local machine.
>
> This morning I came to work and noticed that my computer was slightly
> warm and the fans were running at higher than normal RPM.
>
> I logged in and opened top in an existing terminal. I saw that
> btrfs-transaction was consuming 100% of a CPU core and kworker was
> consumer 100% of another CPU core.
>
> I tried to run a command (to view logs) in another terminal window,
> but the system became unresponsive. I was able to switch to another
> virtual console, but it was very slow. I took photos with my phone.
> See link below for two images (top and virtual console):
>
> http://imgur.com/a/fT1RV
>
> These photos show what I reported above:
> * btrfs-transaction consuming 100% CPU
> * NMI watchdog reporting BUG: soft lockup with btrfs-transaction:314
>
> I hard reset my system, expecting the worst, but it rebooted normally.
> journalctl -xb -p3 showed no entries.
>
> Obviously I have a serious problem. However, I have no clue about what
> the problem might be (except that it seemingly involves btrfs). What
> other information can I provide?
>
> On Sun, Aug 7, 2016 at 6:44 PM, Dave <davestechshop@gmail.com> wrote:
>> I am running Arch Linux on a system with full disk encryption and the
>> storage is a Samsung 950 Pro NVMe drive (512 GB). The computer is a
>> couple months old. No bad behavior until now. (I'm only using 21 GB of
>> the 512 space on the disk.)
>>
>>     btrfs-progs v4.5.1
>>
>> Today I was using my system normally and browsing the web. Firefox
>> stopped responding suddenly and for no apparent reason. Then (KDE)
>> Plasma stopped responding. I could not log out of KDE.
>>
>> I killed my user session (pkill -u me), then I tired to startx. At
>> that point I noticed my root filesystem was read-only.
>>
>> As a first step, I rebooted. That didn't help anything. I tried
>> rebooting several more times -- no change.
>>
>> The root filesystem (btrfs) would not mount. (See error below.) I
>> booted into a LiveUSB environment and ran this command:
>>
>>     cryptsetup open --type luks /dev/xxx cryptroot
>>
>> It opens. Then I ran:
>>
>>     mount -t btrfs -o
>> noatime,nodiratime,ssd,compress=lzo,defaults,space_cache,subvolid=257
>> /dev/mapper/cryptroot /mnt
>>
>> The error message is shown here:
>>
>>     [ 2300.967048] BTRFS info (device dm-0): use ssd allocation scheme
>>     [ 2300.967058] BTRFS info (device dm-0): use lzo compression
>>     [ 2300.967066] BTRFS info (device dm-0): disk space caching is enabled
>>     [ 2300.967069] BTRFS: has skinny extents
>>     [ 2300.995393] BTRFS: error (device dm-0) in
>> btrfs_replay_log:2413: errno=-22 unknown (Failed to recover log tree)
>>     [ 2300.997617] BTRFS info (device dm-0): delayed_refs has NO entry
>>     [ 2300.997673] BTRFS error (device dm-0): cleaner transaction
>> attach returned -30
>>     [ 2301.035405] BTRFS: open_ctree failed
>>
>> It is exactly the same error I saw when trying to boot normally as
>> mentioned above.
>>
>> Based on these two links:
>>
>>> https://btrfs.wiki.kernel.org/index.php/Problem_FAQ
>>> https://btrfs.wiki.kernel.org/index.php/Btrfs-zero-log
>>
>> I decided to take a chance on running this command:
>>
>>     btrfs rescue zero-log
>>
>> That worked and I can mount the filesystem.
>>
>> I ran btrfs check --repair. Here is the output:
>>
>>     root@broken / # umount /mnt
>>     root@broken / # btrfs check --repair /dev/mapper/cryptroot
>>     enabling repair mode
>>     Checking filesystem on /dev/mapper/cryptroot
>>     checking extents
>>     bad metadata [292414476288, 292414492672) crossing stripe boundary
>>     bad metadata [292414541824, 292414558208) crossing stripe boundary
>>     bad metadata [292414672896, 292414689280) crossing stripe boundary
>>     bad metadata [292414869504, 292414885888) crossing stripe boundary
>>     bad metadata [292415000576, 292415016960) crossing stripe boundary
>>     bad metadata [292415066112, 292415082496) crossing stripe boundary
>>     bad metadata [292415131648, 292415148032) crossing stripe boundary
>>     bad metadata [292415262720, 292415279104) crossing stripe boundary
>>     bad metadata [292415328256, 292415344640) crossing stripe boundary
>>     bad metadata [292415393792, 292415410176) crossing stripe boundary
>>     repaired damaged extent references
>>     Fixed 0 roots.
>>     checking free space cache
>>     cache and super generation don't match, space cache will be invalidated
>>     checking fs roots
>>     checking csums
>>     checking root refs
>>     checking quota groups
>>     Ignoring qgroup relation key 258
>>     Ignoring qgroup relation key 263
>>     Ignoring qgroup relation key 71776119061217538
>>     Ignoring qgroup relation key 71776119061217543
>>     Counts for qgroup id: 257 are different
>>     our:            referenced 10412273664 referenced compressed 10412273664
>>     disk:           referenced 10411311104 referenced compressed 10411311104
>>     diff:           referenced 962560 referenced compressed 962560
>>     our:            exclusive 10412273664 exclusive compressed 10412273664
>>     disk:           exclusive 10412273664 exclusive compressed 10412273664
>>     found 21570773057 bytes used err is 0
>>     total csum bytes: 19563456
>>     total tree bytes: 403767296
>>     total fs tree bytes: 349667328
>>     total extent tree bytes: 27328512
>>     btree space waste bytes: 66313360
>>     file data blocks allocated: 39882014720
>>     referenced 28043988992
>>     extent buffer leak: start 20987904 len 16384
>>     extent buffer leak: start 292688068608 len 16384
>>     extent buffer leak: start 60915712 len 16384
>>     extent buffer leak: start 29569581056 len 16384
>>     extent buffer leak: start 29569597440 len 16384
>>     extent buffer leak: start 292412063744 len 16384
>>     extent buffer leak: start 292405870592 len 16384
>>     extent buffer leak: start 292405936128 len 16384
>>     extent buffer leak: start 292413964288 len 16384
>>
>> Then I check dmesg and I see this error information:
>>
>>     [ 4925.562422] BTRFS info (device dm-0): use ssd allocation scheme
>>     [ 4925.562432] BTRFS info (device dm-0): use lzo compression
>>     [ 4925.562440] BTRFS info (device dm-0): disk space caching is enabled
>>     [ 4925.562444] BTRFS: has skinny extents
>>     [ 4925.578705] BTRFS error (device dm-0): qgroup generation
>> mismatch, marked as inconsistent
>>     [ 4925.584033] BTRFS: checking UUID tree
>>
>> What should I do next? I'm a simple user.
>>
>> I already ran memtest86+ overnight using 8 CPU cores in parallel (so
>> it was a very thorough memory test). There were 0 RAM errors.
>>
>> I previously used btrfs since 2012 with no issues. I am concerned
>> about the present issue because I do not understand the cause.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: system locked up with btrfs-transaction consuming 100% CPU
  2016-08-09 22:20   ` Dave T
@ 2016-08-10 11:39     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 5+ messages in thread
From: Austin S. Hemmelgarn @ 2016-08-10 11:39 UTC (permalink / raw)
  To: Dave T, linux-btrfs

On 2016-08-09 18:20, Dave T wrote:
> Thank you for the info, Duncan.
>
> I will use Alt-sysrq-s alt-sysrq-u alt-sysrq-b. This is the best
> description / recommendation I've read on the subject. I had read
> about these special key sequences before but I could never remember
> them and I didn't fully understand what they did. Now you have given
> me the understanding as well as an easy-to-remember method. I'll use
> it.
The other two which you may find potentially useful are alt-sysrq-o, 
which shuts down the system (it's like 'b' too though, so you should 
still sync and remount before using it), and alt-sysrq-c,  which will 
immediately trigger a kernel panic (and thus force a crash dump if you 
have them set up).

As for the other three:
'r' will force the keyboard back to raw mode, this is only generally 
needed if you've been using a old version of X or something like svgalib 
or directfb and it crashed and you can't get the keyboard to work on the 
terminal again.  I normally don't use this simply because it isn't 
needed if your running in text mode or have a new enough version of X.
'e' and 'i' respectively send SIGTERM and SIGKILL to all userspace 
processes except init.  These are generally recommended because most 
things will clean up properly if you send them SIGTERM, and the few 
stragglers that don't catch that (or get stuck during their cleanup) 
will get killed by SIGKILL regardless, and if there are still processes 
writing to a filesystem, syncing may not flush everything out to disk.

It's also worth pointing out that many RPM based distributions (at least 
RHEL, CentOS, and Fedora, and I think SLES and openSUSE as well) disable 
some or all of the SYsRq combinations (they technically are a security 
issue, but if someone has console access to your system, you probably 
have much bigger issues than sysrq to deal with).
>
> I launch KDE the same way you do (no DM). I also run a tiple monitor
> setup, but I am using an nvidia GTX 1070 (and proprietary drivers),
> for the time being.
This is potentially going to sound like an odd suggestion, but have you 
tried running with the proprietary drivers blacklisted?  NVIDIA's 
drivers are generally good citizens, but with any proprietary driver 
involved, there's considerably less certainty that everything else in 
the kernel is working like it should.  I don't personally have much 
experience with the NVIDIA proprietary drivers (I have a system with a 
Quadro K620, but it actually gets better overall performance when I use 
the in-kernel open source drivers or even when I just use it as a 
framebuffer and push the rendering to the CPU than it does with the 
official NVIDIA drivers, so I just don't use them), but I have had 
issues similar to what you are seeing with other kernel subsystems when 
using the proprietary AMD drivers on other systems.
>
> My system does not have any issues when the monitors go to sleep. That
> happens many times a day as I have a short timeout set.
>
> I am very concerned about this primary problem (or problems) and I
> hope I can find some understanding of what is going on. BTRFS has
> worked well for me since 2012. While that's fantastic, it also means I
> haven't had to troubleshoot it in the past. Now (because of 4 years of
> problem-free operation) I'm using it on a critical production system.
> I have backups, but I cannot allow these problems to go unresolved.
>
> On Tue, Aug 9, 2016 at 5:32 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>> Dave T posted on Tue, 09 Aug 2016 14:07:46 -0400 as excerpted:
>>
>>> I hard reset my system, expecting the worst, but it rebooted normally.
>>> journalctl -xb -p3 showed no entries.
>>
>> I don't have any suggestions for your primary problem, tho I do have a
>> comment down below, but I do have a suggestion regarding your "hard
>> reset".
>>
>> Consider doing some reading on "magic sysrequest", aka sysrq aka srq.
>>
>> $KERNDIR/Documentation/sysrq.txt , and there's lots of googlable articles
>> about it as well.
>>
>> Basically, when you'd otherwise do a hard reset, try a series of triple-
>> key chords, alt-sysrq-<otherkey> first.  (Sysrq is printscreen, if alt
>> isn't pressed with it, so alt-sysrq-thirdkey.)
>>
>> The longer form of the emergency sequence is reisub -- you can read what
>> the r-e-i keys due in the documentation -- but from my own experience, I
>> find when the system's in bad enough shape I need to do an emergency
>> reboot, these keys don't do much for me, while the last three, sub, often
>> (but not always) do, and they're much easier to remember, so...
>>
>> Alt-sysrq-s alt-sysrq-u alt-sysrq-b
>>
>> s=Sync.  If the kernel is still alive and believes it's still stable
>> enough to write to permanent storage without risking writing somewhere it
>> shouldn't, this will force all write-cached "dirty" data to be written
>> out.
>>
>> You can safely do an alt-srq-s at any time, and continue working, as it
>> forces cached writes to be written out, but doesn't otherwise interfere
>> with the running system.  As such, alt-srq-s is a useful sequence to use
>> right before you do anything you suspect /might/ crash the system, like
>> starting X with a new graphics driver.
>>
>> u=remoUnt-read-only.  Again, if the kernel is alive and stable, this will
>> remount all filesystems read-only, allowing them to safely clean up in
>> the process.  The action carries down to sub-filesystem layers like
>> dmcrypt as well.
>>
>> Note that this is an emergency remount-read-only, so it's a bit more
>> forceful regarding open files that would block an ordinary remount-
>> readonly.  As such, consider the system unusable after doing an alt-srq-
>> u, and shutdown or reboot immediately.
>>
>> b=reBoot.  This forces the kernel to do an immediate reboot, without
>> syncing or remounting, etc.  Thus the s-u- first, to sync and remount.
>>
>>
>> Besides being a bit safer than a hard reset, since when it works it
>> allows the system to sync and cleanup the filesystems before the reboot,
>> this also serves as a crude but effective method of finding out just how
>> severely the system was locked up.  If the sync and remount steps light
>> up your storage I/O activity LED, you know the kernel considered itself
>> in pretty good shape, even if userspace was lost and there was no display
>> at all.  If there's no response to them but the reboot step works, you
>> know the kernel was still alive enough to respond, but either there
>> wasn't anything dirty to write out, or more likely, the kernel believed
>> itself to be corrupted, and thus didn't trust its ability to write to
>> permanent storage without risking scribbling on other parts of the device
>> (other files, perhaps even other partitions).  And of course if none of
>> them work and you /do/ have to do a hard reboot, then you know the kernel
>> itself was dead, at least to the point it could no longer respond at all
>> to magic srq.
>>
>>
>> As to the comment... I'm running plasma/kde5 on gentoo, here, but I'm
>> running upstream-kde's live-git version, available via the gentoo/kde
>> overlay.  Some weeks ago, for a period, something wasn't working, and
>> every time I left the system alone long enough to lock the screen and
>> power-down the monitors, when I came back the system would be crashed.
>> With a bit of experimentation, I discovered that it would stay running as
>> long as I didn't let the monitors power off automatically (I could power
>> them down manually, tho), so for awhile, I was running xset -dpmi after
>> every X/plasma restart (I start X/plasma using startx from a text login
>> and don't use a *DM), to keep plasma from powering down the graphics
>> adapter, tho it could and did still run the screenlocker.
>>
>> Since then, they fixed whatever it was and I can let the power-downs
>> happen normally.  I don't believe the bug made it to a release, tho
>> because I'm following live-git I'm not tracking the releases closely and
>> could be mistaken.
>>
>> You mentioned arch, which IIRC is pretty close to upstream's release
>> cycle, so it's just possible that if this /did/ hit a release, and you're
>> running a new enough kde/plasma, the problem you're seeing may be related
>> to what I was experiencing.  Tho I doubt it since as I said it was only a
>> short period, and I don't think the defective code made it into a release.
>>
>> FWIW, tho, I'm running Radeon Turks graphics (hd6670, IIRC) with triple
>> monitor and the native freedomware kernel/mesa/xorg driver, not frglx or
>> whatever the proprietary thing is called.  If you're running Radeon, with
>> the freedomware driver, especially if also running multi-monitor and the
>> absolute latest plasma, you might try either downgrading a version to see
>> if the problem goes away, or doing the xset -dpmi thing I was doing,
>> temporarily.  It's just possible it'll help since your problem seems
>> similarly to be triggering when you're away from the machine, but your
>> problem does seem a bit different than mine (mine was a consistent
>> crash), and I don't believe mine made release code anyway, so it's likely
>> the similarity is just coincidence.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-08-10 18:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-09 18:07 system locked up with btrfs-transaction consuming 100% CPU Dave T
2016-08-09 21:32 ` Duncan
2016-08-09 22:20   ` Dave T
2016-08-10 11:39     ` Austin S. Hemmelgarn
2016-08-09 22:54 ` Dave T

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.