* system locked up with btrfs-transaction consuming 100% CPU
@ 2016-08-09 18:07 Dave T
2016-08-09 21:32 ` Duncan
2016-08-09 22:54 ` Dave T
0 siblings, 2 replies; 5+ messages in thread
From: Dave T @ 2016-08-09 18:07 UTC (permalink / raw)
To: linux-btrfs
My system locked up with btrfs-transaction consuming 100% CPU and NMI
watchdog reporting BUG: soft lockup with btrfs-transaction:314.
This comes 2 days after a serious event involving BTRFS where my
system would not mount the root fs. (I gave details in an email to the
list two days ago and copied again below.)
Here are full details of todays "bug" (or whatever it was).
When i left work last night I left my system running and I locked the
session. The only things open were KDE Plasma, some terminal windows
and some plain text documents in Kate editor. No real work was running
on the local machine.
This morning I came to work and noticed that my computer was slightly
warm and the fans were running at higher than normal RPM.
I logged in and opened top in an existing terminal. I saw that
btrfs-transaction was consuming 100% of a CPU core and kworker was
consumer 100% of another CPU core.
I tried to run a command (to view logs) in another terminal window,
but the system became unresponsive. I was able to switch to another
virtual console, but it was very slow. I took photos with my phone.
See link below for two images (top and virtual console):
http://imgur.com/a/fT1RV
These photos show what I reported above:
* btrfs-transaction consuming 100% CPU
* NMI watchdog reporting BUG: soft lockup with btrfs-transaction:314
I hard reset my system, expecting the worst, but it rebooted normally.
journalctl -xb -p3 showed no entries.
Obviously I have a serious problem. However, I have no clue about what
the problem might be (except that it seemingly involves btrfs). What
other information can I provide?
On Sun, Aug 7, 2016 at 6:44 PM, Dave <davestechshop@gmail.com> wrote:
> I am running Arch Linux on a system with full disk encryption and the
> storage is a Samsung 950 Pro NVMe drive (512 GB). The computer is a
> couple months old. No bad behavior until now. (I'm only using 21 GB of
> the 512 space on the disk.)
>
> btrfs-progs v4.5.1
>
> Today I was using my system normally and browsing the web. Firefox
> stopped responding suddenly and for no apparent reason. Then (KDE)
> Plasma stopped responding. I could not log out of KDE.
>
> I killed my user session (pkill -u me), then I tired to startx. At
> that point I noticed my root filesystem was read-only.
>
> As a first step, I rebooted. That didn't help anything. I tried
> rebooting several more times -- no change.
>
> The root filesystem (btrfs) would not mount. (See error below.) I
> booted into a LiveUSB environment and ran this command:
>
> cryptsetup open --type luks /dev/xxx cryptroot
>
> It opens. Then I ran:
>
> mount -t btrfs -o
> noatime,nodiratime,ssd,compress=lzo,defaults,space_cache,subvolid=257
> /dev/mapper/cryptroot /mnt
>
> The error message is shown here:
>
> [ 2300.967048] BTRFS info (device dm-0): use ssd allocation scheme
> [ 2300.967058] BTRFS info (device dm-0): use lzo compression
> [ 2300.967066] BTRFS info (device dm-0): disk space caching is enabled
> [ 2300.967069] BTRFS: has skinny extents
> [ 2300.995393] BTRFS: error (device dm-0) in
> btrfs_replay_log:2413: errno=-22 unknown (Failed to recover log tree)
> [ 2300.997617] BTRFS info (device dm-0): delayed_refs has NO entry
> [ 2300.997673] BTRFS error (device dm-0): cleaner transaction
> attach returned -30
> [ 2301.035405] BTRFS: open_ctree failed
>
> It is exactly the same error I saw when trying to boot normally as
> mentioned above.
>
> Based on these two links:
>
>> https://btrfs.wiki.kernel.org/index.php/Problem_FAQ
>> https://btrfs.wiki.kernel.org/index.php/Btrfs-zero-log
>
> I decided to take a chance on running this command:
>
> btrfs rescue zero-log
>
> That worked and I can mount the filesystem.
>
> I ran btrfs check --repair. Here is the output:
>
> root@broken / # umount /mnt
> root@broken / # btrfs check --repair /dev/mapper/cryptroot
> enabling repair mode
> Checking filesystem on /dev/mapper/cryptroot
> checking extents
> bad metadata [292414476288, 292414492672) crossing stripe boundary
> bad metadata [292414541824, 292414558208) crossing stripe boundary
> bad metadata [292414672896, 292414689280) crossing stripe boundary
> bad metadata [292414869504, 292414885888) crossing stripe boundary
> bad metadata [292415000576, 292415016960) crossing stripe boundary
> bad metadata [292415066112, 292415082496) crossing stripe boundary
> bad metadata [292415131648, 292415148032) crossing stripe boundary
> bad metadata [292415262720, 292415279104) crossing stripe boundary
> bad metadata [292415328256, 292415344640) crossing stripe boundary
> bad metadata [292415393792, 292415410176) crossing stripe boundary
> repaired damaged extent references
> Fixed 0 roots.
> checking free space cache
> cache and super generation don't match, space cache will be invalidated
> checking fs roots
> checking csums
> checking root refs
> checking quota groups
> Ignoring qgroup relation key 258
> Ignoring qgroup relation key 263
> Ignoring qgroup relation key 71776119061217538
> Ignoring qgroup relation key 71776119061217543
> Counts for qgroup id: 257 are different
> our: referenced 10412273664 referenced compressed 10412273664
> disk: referenced 10411311104 referenced compressed 10411311104
> diff: referenced 962560 referenced compressed 962560
> our: exclusive 10412273664 exclusive compressed 10412273664
> disk: exclusive 10412273664 exclusive compressed 10412273664
> found 21570773057 bytes used err is 0
> total csum bytes: 19563456
> total tree bytes: 403767296
> total fs tree bytes: 349667328
> total extent tree bytes: 27328512
> btree space waste bytes: 66313360
> file data blocks allocated: 39882014720
> referenced 28043988992
> extent buffer leak: start 20987904 len 16384
> extent buffer leak: start 292688068608 len 16384
> extent buffer leak: start 60915712 len 16384
> extent buffer leak: start 29569581056 len 16384
> extent buffer leak: start 29569597440 len 16384
> extent buffer leak: start 292412063744 len 16384
> extent buffer leak: start 292405870592 len 16384
> extent buffer leak: start 292405936128 len 16384
> extent buffer leak: start 292413964288 len 16384
>
> Then I check dmesg and I see this error information:
>
> [ 4925.562422] BTRFS info (device dm-0): use ssd allocation scheme
> [ 4925.562432] BTRFS info (device dm-0): use lzo compression
> [ 4925.562440] BTRFS info (device dm-0): disk space caching is enabled
> [ 4925.562444] BTRFS: has skinny extents
> [ 4925.578705] BTRFS error (device dm-0): qgroup generation
> mismatch, marked as inconsistent
> [ 4925.584033] BTRFS: checking UUID tree
>
> What should I do next? I'm a simple user.
>
> I already ran memtest86+ overnight using 8 CPU cores in parallel (so
> it was a very thorough memory test). There were 0 RAM errors.
>
> I previously used btrfs since 2012 with no issues. I am concerned
> about the present issue because I do not understand the cause.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: system locked up with btrfs-transaction consuming 100% CPU
2016-08-09 18:07 system locked up with btrfs-transaction consuming 100% CPU Dave T
@ 2016-08-09 21:32 ` Duncan
2016-08-09 22:20 ` Dave T
2016-08-09 22:54 ` Dave T
1 sibling, 1 reply; 5+ messages in thread
From: Duncan @ 2016-08-09 21:32 UTC (permalink / raw)
To: linux-btrfs
Dave T posted on Tue, 09 Aug 2016 14:07:46 -0400 as excerpted:
> I hard reset my system, expecting the worst, but it rebooted normally.
> journalctl -xb -p3 showed no entries.
I don't have any suggestions for your primary problem, tho I do have a
comment down below, but I do have a suggestion regarding your "hard
reset".
Consider doing some reading on "magic sysrequest", aka sysrq aka srq.
$KERNDIR/Documentation/sysrq.txt , and there's lots of googlable articles
about it as well.
Basically, when you'd otherwise do a hard reset, try a series of triple-
key chords, alt-sysrq-<otherkey> first. (Sysrq is printscreen, if alt
isn't pressed with it, so alt-sysrq-thirdkey.)
The longer form of the emergency sequence is reisub -- you can read what
the r-e-i keys due in the documentation -- but from my own experience, I
find when the system's in bad enough shape I need to do an emergency
reboot, these keys don't do much for me, while the last three, sub, often
(but not always) do, and they're much easier to remember, so...
Alt-sysrq-s alt-sysrq-u alt-sysrq-b
s=Sync. If the kernel is still alive and believes it's still stable
enough to write to permanent storage without risking writing somewhere it
shouldn't, this will force all write-cached "dirty" data to be written
out.
You can safely do an alt-srq-s at any time, and continue working, as it
forces cached writes to be written out, but doesn't otherwise interfere
with the running system. As such, alt-srq-s is a useful sequence to use
right before you do anything you suspect /might/ crash the system, like
starting X with a new graphics driver.
u=remoUnt-read-only. Again, if the kernel is alive and stable, this will
remount all filesystems read-only, allowing them to safely clean up in
the process. The action carries down to sub-filesystem layers like
dmcrypt as well.
Note that this is an emergency remount-read-only, so it's a bit more
forceful regarding open files that would block an ordinary remount-
readonly. As such, consider the system unusable after doing an alt-srq-
u, and shutdown or reboot immediately.
b=reBoot. This forces the kernel to do an immediate reboot, without
syncing or remounting, etc. Thus the s-u- first, to sync and remount.
Besides being a bit safer than a hard reset, since when it works it
allows the system to sync and cleanup the filesystems before the reboot,
this also serves as a crude but effective method of finding out just how
severely the system was locked up. If the sync and remount steps light
up your storage I/O activity LED, you know the kernel considered itself
in pretty good shape, even if userspace was lost and there was no display
at all. If there's no response to them but the reboot step works, you
know the kernel was still alive enough to respond, but either there
wasn't anything dirty to write out, or more likely, the kernel believed
itself to be corrupted, and thus didn't trust its ability to write to
permanent storage without risking scribbling on other parts of the device
(other files, perhaps even other partitions). And of course if none of
them work and you /do/ have to do a hard reboot, then you know the kernel
itself was dead, at least to the point it could no longer respond at all
to magic srq.
As to the comment... I'm running plasma/kde5 on gentoo, here, but I'm
running upstream-kde's live-git version, available via the gentoo/kde
overlay. Some weeks ago, for a period, something wasn't working, and
every time I left the system alone long enough to lock the screen and
power-down the monitors, when I came back the system would be crashed.
With a bit of experimentation, I discovered that it would stay running as
long as I didn't let the monitors power off automatically (I could power
them down manually, tho), so for awhile, I was running xset -dpmi after
every X/plasma restart (I start X/plasma using startx from a text login
and don't use a *DM), to keep plasma from powering down the graphics
adapter, tho it could and did still run the screenlocker.
Since then, they fixed whatever it was and I can let the power-downs
happen normally. I don't believe the bug made it to a release, tho
because I'm following live-git I'm not tracking the releases closely and
could be mistaken.
You mentioned arch, which IIRC is pretty close to upstream's release
cycle, so it's just possible that if this /did/ hit a release, and you're
running a new enough kde/plasma, the problem you're seeing may be related
to what I was experiencing. Tho I doubt it since as I said it was only a
short period, and I don't think the defective code made it into a release.
FWIW, tho, I'm running Radeon Turks graphics (hd6670, IIRC) with triple
monitor and the native freedomware kernel/mesa/xorg driver, not frglx or
whatever the proprietary thing is called. If you're running Radeon, with
the freedomware driver, especially if also running multi-monitor and the
absolute latest plasma, you might try either downgrading a version to see
if the problem goes away, or doing the xset -dpmi thing I was doing,
temporarily. It's just possible it'll help since your problem seems
similarly to be triggering when you're away from the machine, but your
problem does seem a bit different than mine (mine was a consistent
crash), and I don't believe mine made release code anyway, so it's likely
the similarity is just coincidence.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: system locked up with btrfs-transaction consuming 100% CPU
2016-08-09 21:32 ` Duncan
@ 2016-08-09 22:20 ` Dave T
2016-08-10 11:39 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 5+ messages in thread
From: Dave T @ 2016-08-09 22:20 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
Thank you for the info, Duncan.
I will use Alt-sysrq-s alt-sysrq-u alt-sysrq-b. This is the best
description / recommendation I've read on the subject. I had read
about these special key sequences before but I could never remember
them and I didn't fully understand what they did. Now you have given
me the understanding as well as an easy-to-remember method. I'll use
it.
I launch KDE the same way you do (no DM). I also run a tiple monitor
setup, but I am using an nvidia GTX 1070 (and proprietary drivers),
for the time being.
My system does not have any issues when the monitors go to sleep. That
happens many times a day as I have a short timeout set.
I am very concerned about this primary problem (or problems) and I
hope I can find some understanding of what is going on. BTRFS has
worked well for me since 2012. While that's fantastic, it also means I
haven't had to troubleshoot it in the past. Now (because of 4 years of
problem-free operation) I'm using it on a critical production system.
I have backups, but I cannot allow these problems to go unresolved.
On Tue, Aug 9, 2016 at 5:32 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Dave T posted on Tue, 09 Aug 2016 14:07:46 -0400 as excerpted:
>
>> I hard reset my system, expecting the worst, but it rebooted normally.
>> journalctl -xb -p3 showed no entries.
>
> I don't have any suggestions for your primary problem, tho I do have a
> comment down below, but I do have a suggestion regarding your "hard
> reset".
>
> Consider doing some reading on "magic sysrequest", aka sysrq aka srq.
>
> $KERNDIR/Documentation/sysrq.txt , and there's lots of googlable articles
> about it as well.
>
> Basically, when you'd otherwise do a hard reset, try a series of triple-
> key chords, alt-sysrq-<otherkey> first. (Sysrq is printscreen, if alt
> isn't pressed with it, so alt-sysrq-thirdkey.)
>
> The longer form of the emergency sequence is reisub -- you can read what
> the r-e-i keys due in the documentation -- but from my own experience, I
> find when the system's in bad enough shape I need to do an emergency
> reboot, these keys don't do much for me, while the last three, sub, often
> (but not always) do, and they're much easier to remember, so...
>
> Alt-sysrq-s alt-sysrq-u alt-sysrq-b
>
> s=Sync. If the kernel is still alive and believes it's still stable
> enough to write to permanent storage without risking writing somewhere it
> shouldn't, this will force all write-cached "dirty" data to be written
> out.
>
> You can safely do an alt-srq-s at any time, and continue working, as it
> forces cached writes to be written out, but doesn't otherwise interfere
> with the running system. As such, alt-srq-s is a useful sequence to use
> right before you do anything you suspect /might/ crash the system, like
> starting X with a new graphics driver.
>
> u=remoUnt-read-only. Again, if the kernel is alive and stable, this will
> remount all filesystems read-only, allowing them to safely clean up in
> the process. The action carries down to sub-filesystem layers like
> dmcrypt as well.
>
> Note that this is an emergency remount-read-only, so it's a bit more
> forceful regarding open files that would block an ordinary remount-
> readonly. As such, consider the system unusable after doing an alt-srq-
> u, and shutdown or reboot immediately.
>
> b=reBoot. This forces the kernel to do an immediate reboot, without
> syncing or remounting, etc. Thus the s-u- first, to sync and remount.
>
>
> Besides being a bit safer than a hard reset, since when it works it
> allows the system to sync and cleanup the filesystems before the reboot,
> this also serves as a crude but effective method of finding out just how
> severely the system was locked up. If the sync and remount steps light
> up your storage I/O activity LED, you know the kernel considered itself
> in pretty good shape, even if userspace was lost and there was no display
> at all. If there's no response to them but the reboot step works, you
> know the kernel was still alive enough to respond, but either there
> wasn't anything dirty to write out, or more likely, the kernel believed
> itself to be corrupted, and thus didn't trust its ability to write to
> permanent storage without risking scribbling on other parts of the device
> (other files, perhaps even other partitions). And of course if none of
> them work and you /do/ have to do a hard reboot, then you know the kernel
> itself was dead, at least to the point it could no longer respond at all
> to magic srq.
>
>
> As to the comment... I'm running plasma/kde5 on gentoo, here, but I'm
> running upstream-kde's live-git version, available via the gentoo/kde
> overlay. Some weeks ago, for a period, something wasn't working, and
> every time I left the system alone long enough to lock the screen and
> power-down the monitors, when I came back the system would be crashed.
> With a bit of experimentation, I discovered that it would stay running as
> long as I didn't let the monitors power off automatically (I could power
> them down manually, tho), so for awhile, I was running xset -dpmi after
> every X/plasma restart (I start X/plasma using startx from a text login
> and don't use a *DM), to keep plasma from powering down the graphics
> adapter, tho it could and did still run the screenlocker.
>
> Since then, they fixed whatever it was and I can let the power-downs
> happen normally. I don't believe the bug made it to a release, tho
> because I'm following live-git I'm not tracking the releases closely and
> could be mistaken.
>
> You mentioned arch, which IIRC is pretty close to upstream's release
> cycle, so it's just possible that if this /did/ hit a release, and you're
> running a new enough kde/plasma, the problem you're seeing may be related
> to what I was experiencing. Tho I doubt it since as I said it was only a
> short period, and I don't think the defective code made it into a release.
>
> FWIW, tho, I'm running Radeon Turks graphics (hd6670, IIRC) with triple
> monitor and the native freedomware kernel/mesa/xorg driver, not frglx or
> whatever the proprietary thing is called. If you're running Radeon, with
> the freedomware driver, especially if also running multi-monitor and the
> absolute latest plasma, you might try either downgrading a version to see
> if the problem goes away, or doing the xset -dpmi thing I was doing,
> temporarily. It's just possible it'll help since your problem seems
> similarly to be triggering when you're away from the machine, but your
> problem does seem a bit different than mine (mine was a consistent
> crash), and I don't believe mine made release code anyway, so it's likely
> the similarity is just coincidence.
>
> --
> Duncan - List replies preferred. No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master." Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: system locked up with btrfs-transaction consuming 100% CPU
2016-08-09 18:07 system locked up with btrfs-transaction consuming 100% CPU Dave T
2016-08-09 21:32 ` Duncan
@ 2016-08-09 22:54 ` Dave T
1 sibling, 0 replies; 5+ messages in thread
From: Dave T @ 2016-08-09 22:54 UTC (permalink / raw)
To: linux-btrfs
The original problem from 2 days ago just happened again. I ran btrfs
rescue zero-log (again) and the root filesystem mounted but it was
read-only on first boot. I rebooted again and everything seems normal.
But clearly there is a problem that needs to be resolved.
Problem:
The root file system becomes read-only during normal usage. See copied
message below for more information about the error.
I'm happy to provide more info upon request. I appreciate any help. Is
this btrfs? A bad disk? Something else?
Linux x99 4.6.4-1-ARCH #1 SMP PREEMPT Mon Jul 11 19:12:32 CEST 2016
x86_64 GNU/Linux
btrfs-progs v4.6.1
On Tue, Aug 9, 2016 at 2:07 PM, Dave T <davestechshop@gmail.com> wrote:
> My system locked up with btrfs-transaction consuming 100% CPU and NMI
> watchdog reporting BUG: soft lockup with btrfs-transaction:314.
>
> This comes 2 days after a serious event involving BTRFS where my
> system would not mount the root fs. (I gave details in an email to the
> list two days ago and copied again below.)
>
> Here are full details of todays "bug" (or whatever it was).
>
> When i left work last night I left my system running and I locked the
> session. The only things open were KDE Plasma, some terminal windows
> and some plain text documents in Kate editor. No real work was running
> on the local machine.
>
> This morning I came to work and noticed that my computer was slightly
> warm and the fans were running at higher than normal RPM.
>
> I logged in and opened top in an existing terminal. I saw that
> btrfs-transaction was consuming 100% of a CPU core and kworker was
> consumer 100% of another CPU core.
>
> I tried to run a command (to view logs) in another terminal window,
> but the system became unresponsive. I was able to switch to another
> virtual console, but it was very slow. I took photos with my phone.
> See link below for two images (top and virtual console):
>
> http://imgur.com/a/fT1RV
>
> These photos show what I reported above:
> * btrfs-transaction consuming 100% CPU
> * NMI watchdog reporting BUG: soft lockup with btrfs-transaction:314
>
> I hard reset my system, expecting the worst, but it rebooted normally.
> journalctl -xb -p3 showed no entries.
>
> Obviously I have a serious problem. However, I have no clue about what
> the problem might be (except that it seemingly involves btrfs). What
> other information can I provide?
>
> On Sun, Aug 7, 2016 at 6:44 PM, Dave <davestechshop@gmail.com> wrote:
>> I am running Arch Linux on a system with full disk encryption and the
>> storage is a Samsung 950 Pro NVMe drive (512 GB). The computer is a
>> couple months old. No bad behavior until now. (I'm only using 21 GB of
>> the 512 space on the disk.)
>>
>> btrfs-progs v4.5.1
>>
>> Today I was using my system normally and browsing the web. Firefox
>> stopped responding suddenly and for no apparent reason. Then (KDE)
>> Plasma stopped responding. I could not log out of KDE.
>>
>> I killed my user session (pkill -u me), then I tired to startx. At
>> that point I noticed my root filesystem was read-only.
>>
>> As a first step, I rebooted. That didn't help anything. I tried
>> rebooting several more times -- no change.
>>
>> The root filesystem (btrfs) would not mount. (See error below.) I
>> booted into a LiveUSB environment and ran this command:
>>
>> cryptsetup open --type luks /dev/xxx cryptroot
>>
>> It opens. Then I ran:
>>
>> mount -t btrfs -o
>> noatime,nodiratime,ssd,compress=lzo,defaults,space_cache,subvolid=257
>> /dev/mapper/cryptroot /mnt
>>
>> The error message is shown here:
>>
>> [ 2300.967048] BTRFS info (device dm-0): use ssd allocation scheme
>> [ 2300.967058] BTRFS info (device dm-0): use lzo compression
>> [ 2300.967066] BTRFS info (device dm-0): disk space caching is enabled
>> [ 2300.967069] BTRFS: has skinny extents
>> [ 2300.995393] BTRFS: error (device dm-0) in
>> btrfs_replay_log:2413: errno=-22 unknown (Failed to recover log tree)
>> [ 2300.997617] BTRFS info (device dm-0): delayed_refs has NO entry
>> [ 2300.997673] BTRFS error (device dm-0): cleaner transaction
>> attach returned -30
>> [ 2301.035405] BTRFS: open_ctree failed
>>
>> It is exactly the same error I saw when trying to boot normally as
>> mentioned above.
>>
>> Based on these two links:
>>
>>> https://btrfs.wiki.kernel.org/index.php/Problem_FAQ
>>> https://btrfs.wiki.kernel.org/index.php/Btrfs-zero-log
>>
>> I decided to take a chance on running this command:
>>
>> btrfs rescue zero-log
>>
>> That worked and I can mount the filesystem.
>>
>> I ran btrfs check --repair. Here is the output:
>>
>> root@broken / # umount /mnt
>> root@broken / # btrfs check --repair /dev/mapper/cryptroot
>> enabling repair mode
>> Checking filesystem on /dev/mapper/cryptroot
>> checking extents
>> bad metadata [292414476288, 292414492672) crossing stripe boundary
>> bad metadata [292414541824, 292414558208) crossing stripe boundary
>> bad metadata [292414672896, 292414689280) crossing stripe boundary
>> bad metadata [292414869504, 292414885888) crossing stripe boundary
>> bad metadata [292415000576, 292415016960) crossing stripe boundary
>> bad metadata [292415066112, 292415082496) crossing stripe boundary
>> bad metadata [292415131648, 292415148032) crossing stripe boundary
>> bad metadata [292415262720, 292415279104) crossing stripe boundary
>> bad metadata [292415328256, 292415344640) crossing stripe boundary
>> bad metadata [292415393792, 292415410176) crossing stripe boundary
>> repaired damaged extent references
>> Fixed 0 roots.
>> checking free space cache
>> cache and super generation don't match, space cache will be invalidated
>> checking fs roots
>> checking csums
>> checking root refs
>> checking quota groups
>> Ignoring qgroup relation key 258
>> Ignoring qgroup relation key 263
>> Ignoring qgroup relation key 71776119061217538
>> Ignoring qgroup relation key 71776119061217543
>> Counts for qgroup id: 257 are different
>> our: referenced 10412273664 referenced compressed 10412273664
>> disk: referenced 10411311104 referenced compressed 10411311104
>> diff: referenced 962560 referenced compressed 962560
>> our: exclusive 10412273664 exclusive compressed 10412273664
>> disk: exclusive 10412273664 exclusive compressed 10412273664
>> found 21570773057 bytes used err is 0
>> total csum bytes: 19563456
>> total tree bytes: 403767296
>> total fs tree bytes: 349667328
>> total extent tree bytes: 27328512
>> btree space waste bytes: 66313360
>> file data blocks allocated: 39882014720
>> referenced 28043988992
>> extent buffer leak: start 20987904 len 16384
>> extent buffer leak: start 292688068608 len 16384
>> extent buffer leak: start 60915712 len 16384
>> extent buffer leak: start 29569581056 len 16384
>> extent buffer leak: start 29569597440 len 16384
>> extent buffer leak: start 292412063744 len 16384
>> extent buffer leak: start 292405870592 len 16384
>> extent buffer leak: start 292405936128 len 16384
>> extent buffer leak: start 292413964288 len 16384
>>
>> Then I check dmesg and I see this error information:
>>
>> [ 4925.562422] BTRFS info (device dm-0): use ssd allocation scheme
>> [ 4925.562432] BTRFS info (device dm-0): use lzo compression
>> [ 4925.562440] BTRFS info (device dm-0): disk space caching is enabled
>> [ 4925.562444] BTRFS: has skinny extents
>> [ 4925.578705] BTRFS error (device dm-0): qgroup generation
>> mismatch, marked as inconsistent
>> [ 4925.584033] BTRFS: checking UUID tree
>>
>> What should I do next? I'm a simple user.
>>
>> I already ran memtest86+ overnight using 8 CPU cores in parallel (so
>> it was a very thorough memory test). There were 0 RAM errors.
>>
>> I previously used btrfs since 2012 with no issues. I am concerned
>> about the present issue because I do not understand the cause.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: system locked up with btrfs-transaction consuming 100% CPU
2016-08-09 22:20 ` Dave T
@ 2016-08-10 11:39 ` Austin S. Hemmelgarn
0 siblings, 0 replies; 5+ messages in thread
From: Austin S. Hemmelgarn @ 2016-08-10 11:39 UTC (permalink / raw)
To: Dave T, linux-btrfs
On 2016-08-09 18:20, Dave T wrote:
> Thank you for the info, Duncan.
>
> I will use Alt-sysrq-s alt-sysrq-u alt-sysrq-b. This is the best
> description / recommendation I've read on the subject. I had read
> about these special key sequences before but I could never remember
> them and I didn't fully understand what they did. Now you have given
> me the understanding as well as an easy-to-remember method. I'll use
> it.
The other two which you may find potentially useful are alt-sysrq-o,
which shuts down the system (it's like 'b' too though, so you should
still sync and remount before using it), and alt-sysrq-c, which will
immediately trigger a kernel panic (and thus force a crash dump if you
have them set up).
As for the other three:
'r' will force the keyboard back to raw mode, this is only generally
needed if you've been using a old version of X or something like svgalib
or directfb and it crashed and you can't get the keyboard to work on the
terminal again. I normally don't use this simply because it isn't
needed if your running in text mode or have a new enough version of X.
'e' and 'i' respectively send SIGTERM and SIGKILL to all userspace
processes except init. These are generally recommended because most
things will clean up properly if you send them SIGTERM, and the few
stragglers that don't catch that (or get stuck during their cleanup)
will get killed by SIGKILL regardless, and if there are still processes
writing to a filesystem, syncing may not flush everything out to disk.
It's also worth pointing out that many RPM based distributions (at least
RHEL, CentOS, and Fedora, and I think SLES and openSUSE as well) disable
some or all of the SYsRq combinations (they technically are a security
issue, but if someone has console access to your system, you probably
have much bigger issues than sysrq to deal with).
>
> I launch KDE the same way you do (no DM). I also run a tiple monitor
> setup, but I am using an nvidia GTX 1070 (and proprietary drivers),
> for the time being.
This is potentially going to sound like an odd suggestion, but have you
tried running with the proprietary drivers blacklisted? NVIDIA's
drivers are generally good citizens, but with any proprietary driver
involved, there's considerably less certainty that everything else in
the kernel is working like it should. I don't personally have much
experience with the NVIDIA proprietary drivers (I have a system with a
Quadro K620, but it actually gets better overall performance when I use
the in-kernel open source drivers or even when I just use it as a
framebuffer and push the rendering to the CPU than it does with the
official NVIDIA drivers, so I just don't use them), but I have had
issues similar to what you are seeing with other kernel subsystems when
using the proprietary AMD drivers on other systems.
>
> My system does not have any issues when the monitors go to sleep. That
> happens many times a day as I have a short timeout set.
>
> I am very concerned about this primary problem (or problems) and I
> hope I can find some understanding of what is going on. BTRFS has
> worked well for me since 2012. While that's fantastic, it also means I
> haven't had to troubleshoot it in the past. Now (because of 4 years of
> problem-free operation) I'm using it on a critical production system.
> I have backups, but I cannot allow these problems to go unresolved.
>
> On Tue, Aug 9, 2016 at 5:32 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>> Dave T posted on Tue, 09 Aug 2016 14:07:46 -0400 as excerpted:
>>
>>> I hard reset my system, expecting the worst, but it rebooted normally.
>>> journalctl -xb -p3 showed no entries.
>>
>> I don't have any suggestions for your primary problem, tho I do have a
>> comment down below, but I do have a suggestion regarding your "hard
>> reset".
>>
>> Consider doing some reading on "magic sysrequest", aka sysrq aka srq.
>>
>> $KERNDIR/Documentation/sysrq.txt , and there's lots of googlable articles
>> about it as well.
>>
>> Basically, when you'd otherwise do a hard reset, try a series of triple-
>> key chords, alt-sysrq-<otherkey> first. (Sysrq is printscreen, if alt
>> isn't pressed with it, so alt-sysrq-thirdkey.)
>>
>> The longer form of the emergency sequence is reisub -- you can read what
>> the r-e-i keys due in the documentation -- but from my own experience, I
>> find when the system's in bad enough shape I need to do an emergency
>> reboot, these keys don't do much for me, while the last three, sub, often
>> (but not always) do, and they're much easier to remember, so...
>>
>> Alt-sysrq-s alt-sysrq-u alt-sysrq-b
>>
>> s=Sync. If the kernel is still alive and believes it's still stable
>> enough to write to permanent storage without risking writing somewhere it
>> shouldn't, this will force all write-cached "dirty" data to be written
>> out.
>>
>> You can safely do an alt-srq-s at any time, and continue working, as it
>> forces cached writes to be written out, but doesn't otherwise interfere
>> with the running system. As such, alt-srq-s is a useful sequence to use
>> right before you do anything you suspect /might/ crash the system, like
>> starting X with a new graphics driver.
>>
>> u=remoUnt-read-only. Again, if the kernel is alive and stable, this will
>> remount all filesystems read-only, allowing them to safely clean up in
>> the process. The action carries down to sub-filesystem layers like
>> dmcrypt as well.
>>
>> Note that this is an emergency remount-read-only, so it's a bit more
>> forceful regarding open files that would block an ordinary remount-
>> readonly. As such, consider the system unusable after doing an alt-srq-
>> u, and shutdown or reboot immediately.
>>
>> b=reBoot. This forces the kernel to do an immediate reboot, without
>> syncing or remounting, etc. Thus the s-u- first, to sync and remount.
>>
>>
>> Besides being a bit safer than a hard reset, since when it works it
>> allows the system to sync and cleanup the filesystems before the reboot,
>> this also serves as a crude but effective method of finding out just how
>> severely the system was locked up. If the sync and remount steps light
>> up your storage I/O activity LED, you know the kernel considered itself
>> in pretty good shape, even if userspace was lost and there was no display
>> at all. If there's no response to them but the reboot step works, you
>> know the kernel was still alive enough to respond, but either there
>> wasn't anything dirty to write out, or more likely, the kernel believed
>> itself to be corrupted, and thus didn't trust its ability to write to
>> permanent storage without risking scribbling on other parts of the device
>> (other files, perhaps even other partitions). And of course if none of
>> them work and you /do/ have to do a hard reboot, then you know the kernel
>> itself was dead, at least to the point it could no longer respond at all
>> to magic srq.
>>
>>
>> As to the comment... I'm running plasma/kde5 on gentoo, here, but I'm
>> running upstream-kde's live-git version, available via the gentoo/kde
>> overlay. Some weeks ago, for a period, something wasn't working, and
>> every time I left the system alone long enough to lock the screen and
>> power-down the monitors, when I came back the system would be crashed.
>> With a bit of experimentation, I discovered that it would stay running as
>> long as I didn't let the monitors power off automatically (I could power
>> them down manually, tho), so for awhile, I was running xset -dpmi after
>> every X/plasma restart (I start X/plasma using startx from a text login
>> and don't use a *DM), to keep plasma from powering down the graphics
>> adapter, tho it could and did still run the screenlocker.
>>
>> Since then, they fixed whatever it was and I can let the power-downs
>> happen normally. I don't believe the bug made it to a release, tho
>> because I'm following live-git I'm not tracking the releases closely and
>> could be mistaken.
>>
>> You mentioned arch, which IIRC is pretty close to upstream's release
>> cycle, so it's just possible that if this /did/ hit a release, and you're
>> running a new enough kde/plasma, the problem you're seeing may be related
>> to what I was experiencing. Tho I doubt it since as I said it was only a
>> short period, and I don't think the defective code made it into a release.
>>
>> FWIW, tho, I'm running Radeon Turks graphics (hd6670, IIRC) with triple
>> monitor and the native freedomware kernel/mesa/xorg driver, not frglx or
>> whatever the proprietary thing is called. If you're running Radeon, with
>> the freedomware driver, especially if also running multi-monitor and the
>> absolute latest plasma, you might try either downgrading a version to see
>> if the problem goes away, or doing the xset -dpmi thing I was doing,
>> temporarily. It's just possible it'll help since your problem seems
>> similarly to be triggering when you're away from the machine, but your
>> problem does seem a bit different than mine (mine was a consistent
>> crash), and I don't believe mine made release code anyway, so it's likely
>> the similarity is just coincidence.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2016-08-10 18:49 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-09 18:07 system locked up with btrfs-transaction consuming 100% CPU Dave T
2016-08-09 21:32 ` Duncan
2016-08-09 22:20 ` Dave T
2016-08-10 11:39 ` Austin S. Hemmelgarn
2016-08-09 22:54 ` Dave T
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.