* [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation
@ 2016-03-16 9:45 Ole Langbehn
2016-03-17 10:51 ` Duncan
0 siblings, 1 reply; 5+ messages in thread
From: Ole Langbehn @ 2016-03-16 9:45 UTC (permalink / raw)
To: linux-btrfs
Hi,
on my box, frequently, mostly while using firefox, any process doing
disk IO freezes while btrfs-transacti has a spike in CPU usage for more
than a minute.
I know about btrfs' fragmentation issue, but have a couple of questions:
* While btrfs-transacti is spiking, can I trace which files are the
culprit somehow?
* On my setup, with measured fragmentation, are the CPU spike durations
and freezes normal?
* Can I alleviate the situation by anything except defragmentation?
Any insight is appreciated.
Details:
I have a 1TB SSD with a large btrfs partition:
# btrfs filesystem usage /
Overall:
Device size: 915.32GiB
Device allocated: 915.02GiB
Device unallocated: 306.00MiB
Device missing: 0.00B
Used: 152.90GiB
Free (estimated): 751.96GiB (min: 751.96GiB)
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 512.00MiB (used: 0.00B)
Data,single: Size:901.01GiB, Used:149.35GiB
/dev/sda2 901.01GiB
Metadata,single: Size:14.01GiB, Used:3.55GiB
/dev/sda2 14.01GiB
System,single: Size:4.00MiB, Used:128.00KiB
/dev/sda2 4.00MiB
Unallocated:
/dev/sda2 306.00MiB
I've done the obvious and defragmented files. Some files were
defragmented from 10k+ to still more than 100 extents. But the problem
persisted or came back very quickly. Just now i re-ran defragmentation
with the following results (only showing files with more than 100
extents before fragmentation):
extents before / extents after / anonymized path
103 / 1 /home/foo/.mozilla/firefox/foo.default/formhistory.sqlite:
133 / 1
/home/foo/.thunderbird/foo.default/ImapMail/imap.example.org/ml-btrfs:
155 / 1 /var/log/messages:
158 / 30 /home/foo/.thunderbird/foo.default/ImapMail/mail.example.org/INBOX:
160 / 32 /home/foo/.thunderbird/foo.default/calendar-data/cache.sqlite:
255 / 255 /var/lib/docker/devicemapper/devicemapper/data:
550 / 1 /home/foo/.cache/chromium/Default/Cache/data_1:
627 / 1 /home/foo/.cache/chromium/Default/Cache/data_2:
1738 / 25 /home/foo/.cache/chromium/Default/Cache/data_3:
1764 / 77 /home/foo/.mozilla/firefox/foo.default/places.sqlite:
4414 / 284 /home/foo/.digikam/thumbnails-digikam.db:
6576 / 3 /home/foo/.digikam/digikam4.db:
So fragmentation came back quickly, and the firefox places.sqlite file
could explain why the system freezes while browsing.
BTW: I did a VACUUM on the sqlite db and afterwards it had 1 extent.
Expected, just saying that vacuuming seems to be a good measure for
defragmenting sqlite databases.
I am using snapper and have about 40 snapshots going back for some
months. Those are read only. Could that have any effect?
Cheers,
Ole
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation
2016-03-16 9:45 [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation Ole Langbehn
@ 2016-03-17 10:51 ` Duncan
2016-03-18 9:33 ` Ole Langbehn
0 siblings, 1 reply; 5+ messages in thread
From: Duncan @ 2016-03-17 10:51 UTC (permalink / raw)
To: linux-btrfs
Ole Langbehn posted on Wed, 16 Mar 2016 10:45:28 +0100 as excerpted:
> Hi,
>
> on my box, frequently, mostly while using firefox, any process doing
> disk IO freezes while btrfs-transacti has a spike in CPU usage for more
> than a minute.
>
> I know about btrfs' fragmentation issue, but have a couple of questions:
>
> * While btrfs-transacti is spiking, can I trace which files are the
> culprit somehow?
> * On my setup, with measured fragmentation, are the CPU spike durations
> and freezes normal?
> * Can I alleviate the situation by anything except defragmentation?
>
> Any insight is appreciated.
>
> Details:
>
> I have a 1TB SSD with a large btrfs partition:
>
> # btrfs filesystem usage /
> Overall:
> Device size: 915.32GiB
> Device allocated: 915.02GiB
> Device unallocated: 306.00MiB
> Device missing: 0.00B
> Used: 152.90GiB
> Free (estimated): 751.96GiB (min: 751.96GiB)
> Data ratio: 1.00
> Metadata ratio: 1.00
> Global reserve: 512.00MiB (used: 0.00B)
>
> Data,single: Size:901.01GiB, Used:149.35GiB
> /dev/sda2 901.01GiB
>
> Metadata,single: Size:14.01GiB, Used:3.55GiB
> /dev/sda2 14.01GiB
>
> System,single: Size:4.00MiB, Used:128.00KiB
> /dev/sda2 4.00MiB
>
> Unallocated:
> /dev/sda2 306.00MiB
>
>
> I've done the obvious and defragmented files. Some files were
> defragmented from 10k+ to still more than 100 extents. But the problem
> persisted or came back very quickly. Just now i re-ran defragmentation
> with the following results (only showing files with more than 100
> extents before fragmentation):
>
> extents before / extents after / anonymized path
> 103 / 1 /home/foo/.mozilla/firefox/foo.default/formhistory.sqlite:
> 133 / 1
> /home/foo/.thunderbird/foo.default/ImapMail/imap.example.org/ml-btrfs:
> 155 / 1 /var/log/messages:
> 158 / 30
> /home/foo/.thunderbird/foo.default/ImapMail/mail.example.org/INBOX:
> 160 / 32 /home/foo/.thunderbird/foo.default/calendar-data/cache.sqlite:
> 255 / 255 /var/lib/docker/devicemapper/devicemapper/data:
> 550 / 1 /home/foo/.cache/chromium/Default/Cache/data_1:
> 627 / 1 /home/foo/.cache/chromium/Default/Cache/data_2:
> 1738 / 25 /home/foo/.cache/chromium/Default/Cache/data_3:
> 1764 / 77 /home/foo/.mozilla/firefox/foo.default/places.sqlite:
> 4414 / 284 /home/foo/.digikam/thumbnails-digikam.db:
> 6576 / 3 /home/foo/.digikam/digikam4.db:
>
> So fragmentation came back quickly, and the firefox places.sqlite file
> could explain why the system freezes while browsing.
Have you tried the autodefrag mount option, then defragging? That should
help keep rewritten files from fragmenting so heavily, at least. On
spinning rust it doesn't play so well with large (half-gig plus)
databases or VM images, but on ssds it should scale rather larger; on
fast SSDs I'd not expect problems until 1-2 GiB, possibly higher.
For large dbs or VM images, too large for autodefrag to handle well, the
nocow attribute is the usual suggestion, but I'll skip the details on
that for now, as you may not need it with autodefrag on an ssd, unless
your database and VM files are several gig apiece.
> BTW: I did a VACUUM on the sqlite db and afterwards it had 1 extent.
> Expected, just saying that vacuuming seems to be a good measure for
> defragmenting sqlite databases.
I know the concept, but out of curiousity, what tool do you use for
that? I imagine my firefox sqlite dbs could use some vacuuming as well,
but don't have the foggiest idea how to go about it.
> I am using snapper and have about 40 snapshots going back for some
> months. Those are read only. Could that have any effect?
They could have some, but I don't expect it'd be much, not with only 40.
Other than autodefrag, and/or nocow on specific files (but research the
latter before you do it, there's some interaction with snapshots you need
to be aware of, and you can't just apply it to existing files and expect
it to work right), there's a couple other things that may help.
Of *most* importance, you really *really* need to do something about that
data chunk imbalance, and to a lessor extent that metadata chunk
imbalance, because your unallocated space is well under a gig (306 MiB),
with all that extra space, hundreds of gigs of it, locked up in unused or
only partially used chunks.
The subject says 4.4.1, but it's unclear whether that's your kernel
version or your btrfs-progs userspace version. If that's your userspace
version and you're running an old kernel, strongly consider upgrading to
the LTS kernel 4.1 or 4.4 series if possible, or at least the LTS series
before that, 3.18. Those or the latest couple current kernel series, 4.5
and 4.4, and 4.3 for the moment as 4.5 is /just/ out, are the recommended
and best supported versions.
I say this because before 3.17, the btrfs kernelspace could allocate its
own chunks, but didn't know how to free them, so one had to run balance
fairly frequently to free up all the empty chunks, and it looks like you
might have a bunch of empty chunks around.
With 3.17, the kernel learned how to delete entirely empty chunks, and
running a balance to clear them isn't necessary these days. But the
kernel still only knows how to delete entirely empty chunks, and it's
still possible over time, particularly with snapshots locking in place
file extents that might be keeping otherwise empty chunks from being
fully emptied and thus cleared by the kernel, for large imbalances to
occur.
Either way, large imbalances is what you have ATM. Copied from your post
as quoted above:
> Data,single: Size:901.01GiB, Used:149.35GiB
> /dev/sda2 901.01GiB
>
> Metadata,single: Size:14.01GiB, Used:3.55GiB
> /dev/sda2 14.01GiB
So 901 GiB of data chunks but under 150 GiB of it actually used. That's
nearly 750 GiB of free space tied up in empty or only partially filled
data chunks.
14 GiB of metadata chunks, but under 4 GiB reported used. That's about
10 GiB of metadata chunks that should be freeable (tho the half GiB of
global reserve comes from that metadata too but doesn't count as used, so
usage is actually a bit over 4 GiB, so you may only free 9.5 GiB or so).
Try this:
btrfs balance start -dusage=0 -musage=0.
That should go pretty fast whether it works or not, but it might not
work, if you don't actually have any entirely empty chunks. If you do,
it'll free them.
If that added some gigs to your unallocated total, good, as you're likely
to have difficulty balancing data chunks anyway, without that, because
data chunks are normally a gig or more in size and a new one has to be
allocated in ordered to rewrite the content of others to try to release
the unused space in the data chunks.
If it didn't do anything, as is likely if you're running a new kernel, it
means you didn't have any zero-usage chunks, which a new kernel /should/
clean up but might not in some cases.
Then start with metadata, and up the usage numbers which are percentages,
like this:
btrfs balance start -musage=5.
Then if it works up the number to 10, 20, etc. By the time you get to 50
or 70, you should have cleared several of those 9.5 or so potential gigs
and can stop. /Hopefully/ it'll let you do that with just the 300 MiB
free you have, if the 0-usage balance didn't help free several gigs. But
on that large a filesystem, the normally 256 MiB metadata chunks may be a
GiB, in which case you'd still run into trouble.
Once you have several gigs in unallocated, then try the same thing with
data:
btrfs balance start -musage=5
And again, increase it in increments of 5 or 10% at a time, to 50 or
70%. With luck, you'll get most of that potential 750 GiB back into
unallocated.
When you're done, total data should be much closer to the 150-ish gigs
it's reporting as used, with most of that near 750 gigs spread from the
current 900+ total moved to unallocated, and total metadata much closer
to the about 4 gigs used, with 9 gigs or so of that spread moved to
unallocated.
If the 0-usage thing doesn't give you anything and you can't balance even
-musage=1, or don't get anything space returned until you get high enough
to get an error, or if the metadata balance doesn't free enough space to
unallocated to let the balance -dusage= work, then things get a bit more
serious. In that case, you can try one of two things, either delete your
oldest snapshots to try and free up 100% of a few chunks so -dusage=0
will free them, or temporarily btrfs device add a second device of a few
gigs, a thumb drive can work, to give the balance somewhere to put the
new chunk it needs to write in ordered to free up old ones. Once you
have enough space free on the original device, you can btrfs device
delete the temporary one, to move all the chunks on it back to the main
device and delete it from the filesystem.
Second thing, consider tweaking your trim/discard policy, since you're on
ssd. It could well be erase block management that's hitting you, if you
haven't been doing regular trims or if the associated btrfs mount option
(discard) is set incorrectly for your device.
See the btrfs (5) manpage (not btrfs (8)!) or the wiki for the discard
mount option description, but the deal is that while most semi-recent ssds
handle trim/discard, only fairly recently was it made a command-queued
operation, and not even all recent ssds support it as command-queued.
Without that, a trim kills the command-queue and thus can dramatically
hurt performance. Which is why it's not the btrfs ssd default and why
it's not generally recommended for use with ssds, tho where the command
is queued it should be a good thing.
But without trim/discard of /some/ sort, your ssd will slow down over
time, when it no longer has a ready pool of unused erase blocks at hand
to put new and wear-level-transferred blocks into. Now mkfs.btrfs does
do a trim as part of the filesystem creation process, but after that...
After that, barring an ssd that command-queues the trim command so you
can add it to your mount options without affecting performance there, you
can run the fstrim command from time to time. Fstrim finds the unused
space in the filesystem and issues trim commands for it, thus zeroing it
out and telling the ssd firmware it can safely use those blocks for wear-
leveling and the like.
The recommendation is to put fstrim in a cron or systemd timer job,
executing it weekly or similar, preferably at a time when all those
unqueued trims won't affect your normal work.
Meanwhile, note that if you run fstrim manually, it outputs all the empty
space it's trimming, but that running it repeatedly will show the same
space every time, since it doesn't know what's already trimmed. That's
not a problem for the ssd, but it can confuse users who might think the
trim isn't working, since it trims the same thing every time.
So if you have trim in your mount options, try taking it out and see if
that helps. But if you're not doing it there, be sure to setup an fstrim
cron or systemd timer job to do it weekly or so.
Another strategy that some people use is to partition up most of the ssd,
but leave 20% or so of it unpartitioned, or partitioned but without a
filesystem if you prefer, thus giving the firmware that extra room to
play with. Once you have all those extra data and metadata chunks
removed, you can shrink the filesystem, then the partition it's on, and
let the ssd firmware have the now unpartitioned space. Only thing is I
don't know a tool to actually trim the now free space, and am not sure
whether btrfs resize does it or not, so you might have to quickly create
a new partition and filesystem in the space again but leave the
filesystem empty, then fstrim it (or just make the filesystem btrfs,
since mkfs.btrfs automatically does a trim if it detects an ssd where it
can) to let the firmware have it.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation
2016-03-17 10:51 ` Duncan
@ 2016-03-18 9:33 ` Ole Langbehn
2016-03-18 23:06 ` Duncan
0 siblings, 1 reply; 5+ messages in thread
From: Ole Langbehn @ 2016-03-18 9:33 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 5523 bytes --]
Duncan,
thanks for your extensive answer.
On 17.03.2016 11:51, Duncan wrote:
> Ole Langbehn posted on Wed, 16 Mar 2016 10:45:28 +0100 as excerpted:
>
> Have you tried the autodefrag mount option, then defragging? That should
> help keep rewritten files from fragmenting so heavily, at least. On
> spinning rust it doesn't play so well with large (half-gig plus)
> databases or VM images, but on ssds it should scale rather larger; on
> fast SSDs I'd not expect problems until 1-2 GiB, possibly higher.
Since I do have some big VM images, I never tried autodefrag.
> For large dbs or VM images, too large for autodefrag to handle well, the
> nocow attribute is the usual suggestion, but I'll skip the details on
> that for now, as you may not need it with autodefrag on an ssd, unless
> your database and VM files are several gig apiece.
Since posting the original post, I experimented with setting the firefox
places.sqlite to nodatacow (on a new file). 1 extent since, seems to work.
>> BTW: I did a VACUUM on the sqlite db and afterwards it had 1 extent.
>> Expected, just saying that vacuuming seems to be a good measure for
>> defragmenting sqlite databases.
>
> I know the concept, but out of curiousity, what tool do you use for
> that? I imagine my firefox sqlite dbs could use some vacuuming as well,
> but don't have the foggiest idea how to go about it.
simple call of the command line interface, like with any other SQL DB:
# sqlite3 /path/to/db.sqlite "VACUUM;"
> Of *most* importance, you really *really* need to do something about that
> data chunk imbalance, and to a lessor extent that metadata chunk
> imbalance, because your unallocated space is well under a gig (306 MiB),
> with all that extra space, hundreds of gigs of it, locked up in unused or
> only partially used chunks.
I'm curious - why is that a bad thing?
> The subject says 4.4.1, but it's unclear whether that's your kernel
> version or your btrfs-progs userspace version. If that's your userspace
> version and you're running an old kernel, strongly consider upgrading to
> the LTS kernel 4.1 or 4.4 series if possible, or at least the LTS series
> before that, 3.18. Those or the latest couple current kernel series, 4.5
> and 4.4, and 4.3 for the moment as 4.5 is /just/ out, are the recommended
> and best supported versions.
# uname -r
4.4.1-gentoo
# btrfs --version
btrfs-progs v4.4.1
So, both 4.4.1 ;), but I meant userspace.
> Try this:
>
> btrfs balance start -dusage=0 -musage=0.
Did this although I'm reasonably up to date kernel-wise. I am very sure
that the filesystem has never seen <3.18. Took some minutes, ended up with
# btrfs filesystem usage /
Overall:
Device size: 915.32GiB
Device allocated: 681.32GiB
Device unallocated: 234.00GiB
Device missing: 0.00B
Used: 153.80GiB
Free (estimated): 751.08GiB (min: 751.08GiB)
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 512.00MiB (used: 0.00B)
Data,single: Size:667.31GiB, Used:150.22GiB
/dev/sda2 667.31GiB
Metadata,single: Size:14.01GiB, Used:3.58GiB
/dev/sda2 14.01GiB
System,single: Size:4.00MiB, Used:112.00KiB
/dev/sda2 4.00MiB
Unallocated:
/dev/sda2 234.00GiB
-> Helped with data, not with metadata.
> Then start with metadata, and up the usage numbers which are percentages,
> like this:
>
> btrfs balance start -musage=5.
>
> Then if it works up the number to 10, 20, etc.
upped it up to 70, relocated a total of 13 out of 685 chunks:
Metadata,single: Size:5.00GiB, Used:3.58GiB
/dev/sda2 5.00GiB
> Once you have several gigs in unallocated, then try the same thing with
> data:
>
> btrfs balance start -musage=5
>
> And again, increase it in increments of 5 or 10% at a time, to 50 or
> 70%.
did
# btrfs balance start -dusage=70
straight away, took ages, regularly froze processes for minutes, after
about 8h status is:
# btrfs balance status /
Balance on '/' is paused
192 out of about 595 chunks balanced (194 considered), 68% left
# btrfs filesystem usage /
Overall:
Device size: 915.32GiB
Device allocated: 482.04GiB
Device unallocated: 433.28GiB
Device missing: 0.00B
Used: 154.36GiB
Free (estimated): 759.48GiB (min: 759.48GiB)
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 512.00MiB (used: 0.00B)
Data,single: Size:477.01GiB, Used:150.80GiB
/dev/sda2 477.01GiB
Metadata,single: Size:5.00GiB, Used:3.56GiB
/dev/sda2 5.00GiB
System,single: Size:32.00MiB, Used:96.00KiB
/dev/sda2 32.00MiB
Unallocated:
/dev/sda2 433.28GiB
-> Looking good. Will proceed when I don't need the box to actually be
responsive.
> Second thing, consider tweaking your trim/discard policy [...]
>
> The recommendation is to put fstrim in a cron or systemd timer job,
> executing it weekly or similar, preferably at a time when all those
> unqueued trims won't affect your normal work.
I have it in cron.weekly, since the creation of the filesystem:
fstrim -v / >> $LOG
Cheers,
Ole
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation
2016-03-18 9:33 ` Ole Langbehn
@ 2016-03-18 23:06 ` Duncan
2016-03-19 20:31 ` Ole Langbehn
0 siblings, 1 reply; 5+ messages in thread
From: Duncan @ 2016-03-18 23:06 UTC (permalink / raw)
To: linux-btrfs
Ole Langbehn posted on Fri, 18 Mar 2016 10:33:46 +0100 as excerpted:
> Duncan,
>
> thanks for your extensive answer.
>
> On 17.03.2016 11:51, Duncan wrote:
>> Ole Langbehn posted on Wed, 16 Mar 2016 10:45:28 +0100 as excerpted:
>>
>> Have you tried the autodefrag mount option, then defragging? That
>> should help keep rewritten files from fragmenting so heavily, at least.
>> On spinning rust it doesn't play so well with large (half-gig plus)
>> databases or VM images, but on ssds it should scale rather larger; on
>> fast SSDs I'd not expect problems until 1-2 GiB, possibly higher.
>
> Since I do have some big VM images, I never tried autodefrag.
OK. Tho as you're on ssd you might consider /trying/ it. The big
problem with autodefrag and big VMs and DBs is that as the filesize gets
larger, it becomes more difficult for autodefrag to keep up with the
incoming stream of modifications, but ssds tend to be fast enough that
they can keep up for far longer, and it may be that you won't see a
noticeable issue. If you do, you can always turn the mount option back
off.
Also, nocow should mean autodefrag doesn't affect the file anyway, as it
won't be fragmenting due to the nocow. So if you have your really large
VMs and DBs set nocow, it's quite likely, particularly on ssd, that you
can set autodefrag and not see the performance problems with those large
files that's the reason it's normally not recommended for the large db/vm
use-case.
And like I said you can always turn it back off if necessary.
>> For large dbs or VM images, too large for autodefrag to handle well,
>> the nocow attribute is the usual suggestion, but I'll skip the details
>> on that for now, as you may not need it with autodefrag on an ssd,
>> unless your database and VM files are several gig apiece.
>
> Since posting the original post, I experimented with setting the firefox
> places.sqlite to nodatacow (on a new file). 1 extent since, seems to
> work.
Seems you are reasonably familiar with the nocow attribute drill, so I'll
just cover one remaining base, in case you missed it.
Nocow interacts with snapshots. Basically, snapshots turn nocow into
cow1, because they lock the existing version in place due to the
snapshot. First changes to a block after a snapshot, then, must be cow,
tho further changes to it after that remain nocow to the new in-place
location.
So nocow isn't fully nocow with snapshots, and fragmentation will slow
down, but not be eliminated. People doing regularly scheduled
snapshotting therefore often need to do less frequent but also regularly
scheduled (perhaps weekly or monthly, for multiple snapshots per day)
defrag of their nowcow files.
Tho be aware that for performance reasons, defrag isn't snapshot aware
and will break reflinks to existing snapshots, thereby increasing
filesystem usage. The total effect on usage of course depends on how
much updating the nocow files get as well as snapshotting and defrag
frequency.
>>> BTW: I did a VACUUM on the sqlite db and afterwards it had 1 extent.
>>> Expected, just saying that vacuuming seems to be a good measure for
>>> defragmenting sqlite databases.
>>
>> I know the concept, but out of curiousity, what tool do you use for
>> that? I imagine my firefox sqlite dbs could use some vacuuming as
>> well, but don't have the foggiest idea how to go about it.
>
> simple call of the command line interface, like with any other SQL DB:
>
> # sqlite3 /path/to/db.sqlite "VACUUM;"
Cool. As far as I knew, sqlite was library only, no executable to invoke
in that manner. Shows how little I knew about sqlite. =:^) Thanks.
>> Of *most* importance, you really *really* need to do something about
>> that data chunk imbalance, and to a lessor extent that metadata chunk
>> imbalance, because your unallocated space is well under a gig (306
>> MiB), with all that extra space, hundreds of gigs of it, locked up in
>> unused or only partially used chunks.
>
> I'm curious - why is that a bad thing?
Btrfs allocates space in two stages, first to chunks of data or metadata
type (there's also system type but that's pretty much fixed size so once
the filesystem is created, no further system chunks are normally needed,
unless it's created as a single device filesystem and then a whole slew
of additional devices are added, or if the filesystem is massively resized
on the same device, of course), then from within those chunks to files
from data, and to metadata nodes from metadata, as necessary.
What can happen then, and used to happen frequently before 3.17, tho much
less frequently but it can still happen now, is that over time and with
use, the filesystem will allocate all available space as one type,
typically data chunks, and then run out of space in the other type of
chunk, typically metadata, and have no unallocated space from which to
allocate more. So you'll have lots of space left, but it'll be all tied
up in only partially used chunks of the one type and you'll be out of
space in the other type.
And by the time you actually start getting ENOSPC errors as a result of
the situation, there's often too little space left to create even the one
additional chunk necessary for a balance to write the data from other
chunks into, in ordered to combine some of the less used chunks into
fewer chunks at 100% usage (but for the last one, of course).
And you were already in a tight spot in that regard and may well have had
errors if you had simply tried an unfiltered balance, because data chunks
are typically 1 GiB in size (and can be upto 10 GiB in some circumstances
on large enough filesystems, tho I think the really large sizes require
multi-device), and you were down to 300-ish MiB of unallocated space, not
enough to create a new 1 GiB data chunk.
And considering the filesystem's near terabyte scale, to be down to under
a GiB of unallocated space is even more startling, particularly on newer
kernels where empty chunks are normally reclaimed automatically (tho as
the usage=0 balances reclaimed some space for you, obviously not all of
them had been reclaimed in your case).
That was what was alarming to me, and it /may/ have had something to do
with the high cpu and low speeds, tho indications were that you still had
enough space in both data and metadata that it shouldn't have been too
bad just yet. But it was potentially heading that way, if you didn't do
something, which is why I stressed it as I did. Getting out of such
situations once you're tightly jammed can be quite difficult and
inconvenient, tho you were lucky enough not to be that tightly jammed
just yet, only headed that way.
>> The subject says 4.4.1, but it's unclear whether that's your kernel
>> version or your btrfs-progs userspace version.
> # uname -r 4.4.1-gentoo
>
> # btrfs --version btrfs-progs v4.4.1
>
> So, both 4.4.1 ;)
=:^)
>> Try this:
>>
>> btrfs balance start -dusage=0 -musage=0.
>
> Did this although I'm reasonably up to date kernel-wise. I am very sure
> that the filesystem has never seen <3.18. Took some minutes, ended up
> with
>
> # btrfs filesystem usage /
> Overall:
> Device size: 915.32GiB
> Device allocated: 681.32GiB
> Device unallocated: 234.00GiB
> Device missing: 0.00B
> Used: 153.80GiB
> Free (estimated): 751.08GiB (min: 751.08GiB)
> Data ratio: 1.00
> Metadata ratio: 1.00
> Global reserve: 512.00MiB (used: 0.00B)
>
> Data,single: Size:667.31GiB, Used:150.22GiB
> /dev/sda2 667.31GiB
>
> Metadata,single: Size:14.01GiB, Used:3.58GiB
> /dev/sda2 14.01GiB
>
> System,single: Size:4.00MiB, Used:112.00KiB
> /dev/sda2 4.00MiB
>
> Unallocated:
> /dev/sda2 234.00GiB
>
>
> -> Helped with data, not with metadata.
Yes, and most importantly, you're already out of the tight jam you were
headed into, now with a comfortable several hundred gigs of unallocated
space. =:^)
With that, the specific hoops weren't all necessary for further steps.
In particular, I was afraid that wouldn't clear any chunks at all and
you'd still have under a GiB free, still too small to properly balance
data chunks, thus the suggestion to start with metadata and hoping it
worked.
>> Then start with metadata, and up the usage numbers which are
>> percentages,
>> like this:
>>
>> btrfs balance start -musage=5.
>>
>> Then if it works up the number to 10, 20, etc.
>
> upped it up to 70, relocated a total of 13 out of 685 chunks:
>
> Metadata,single: Size:5.00GiB, Used:3.58GiB
> /dev/sda2 5.00GiB
So you cleared a few more gigs to unallocated, as metadata total was 14
GiB, now it's 5 GiB, much more in line with used (especially given the
fact that your half a GiB of global reserve comes from metadata but
doesn't count as used in the above figure, so you're effectively a bit
over 4 GiB used, meaning you may not be able to free more even if
balancing all metadata chunks with just -m, no usage filter).
>> Once you have several gigs in unallocated, then try the same thing with
>> data:
>>
>> btrfs balance start -musage=5
>>
>> And again, increase it in increments of 5 or 10% at a time, to 50 or
>> 70%.
>
> did
>
> # btrfs balance start -dusage=70
>
> straight away, took ages, regularly froze processes for minutes, after
> about 8h status is:
>
> # btrfs balance status /
> Balance on '/' is paused
> 192 out of about 595 chunks balanced (194 considered), 68% left
> # btrfs filesystem usage /
> Overall:
> Device size: 915.32GiB
> Device allocated: 482.04GiB
> Device unallocated: 433.28GiB
> Device missing: 0.00B
> Used: 154.36GiB
> Free (estimated): 759.48GiB (min: 759.48GiB)
> Data ratio: 1.00
> Metadata ratio: 1.00
> Global reserve: 512.00MiB (used: 0.00B)
>
> Data,single: Size:477.01GiB, Used:150.80GiB
> /dev/sda2 477.01GiB
>
> Metadata,single: Size:5.00GiB, Used:3.56GiB
> /dev/sda2 5.00GiB
>
> System,single: Size:32.00MiB, Used:96.00KiB
> /dev/sda2 32.00MiB
>
> Unallocated:
> /dev/sda2 433.28GiB
>
> -> Looking good. Will proceed when I don't need the box to actually be
> responsive.
The thing with the usage= filter is this. Balancing empty (usage=0)
chunks simply deletes them so is nearly instantaneous, and obviously
reclaims 100% of the space because they were empty, so huge bang for the
buck. Balancing nearly empty chunks is still quite fast since there's
very little data to rewrite and compact into the new chunks, and
obviously, at usage=10, lets you compact 10 or more only 10% used or less
chunks into one new chunk, so as long as you have a lot of them, you
still gets really good bang for the buck.
But as usage increases, you're writing more and more data, for less and
less bang for the buck. At half full, 50% usage, balance is only
combining two chunks into one, while writing the same amount of data to
recover only one chunk of the two, as it was writing to recover 9 chunks
out of 10, at 10% usage.
So when the filesystem still has a lot of room, a lot of people stop at
say usage=25, where they're still recovering 3/4 of the chunks, or
usage=33, where they're recovering 2/3. As the filesystem fills up, they
may need to do usage=50, recovering only 1/2 of the chunks rewritten, and
eventually, usage=67 or 70, writing three chunks into two, and thus
recovering only one chunk's worth of space for every three written, 1/3.
It's rarely useful to go above that, unless you're /really/ pressed for
space, and then it's simpler to just do a balance without that filter and
balance all chunks, tho you can still use -d or -m to only do data or
metadata chunks, if desired.
That's why I suggested you bump the usage up in increments, with my
intention, tho I guess I didn't clearly state it, being that you'd stop
once total dropped reasonably close to used, for data, say 300 GiB total,
150 used, or if you were lucky, 200 GiB total, 150 used.
With luck that would have happened at say -dusage=40 or -dusage=50, while
your bang for the buck was still reclaiming at least half of the chunks
in the rewrite, and -dusage=70 would have never been needed.
That's why you found it taking so long.
Meanwhile, discussion in another thread reminded me of another factor,
quotas.
For a long time quotes were simply broken in btrfs as the code was buggy
and various corner-cases resulted in negative (!!) reported usage and the
like. With kernel 4.4, known corner-case bugs are in general fixed and
the numbers should finally be correct, but there's still a LOT of quota
overhead for balance, etc. They're discussing right now whether part or
all of that can be eliminated, but for the time being, anyway, active
btrfs quotas incur a /massive/ balance overhead, so if you use quotas and
are going to be doing more than trivial balances, it's worth turning them
off temporarily for the balance if you can, then rescanning after the
balance when you turn them back on, if you do need them and didn't simply
have quotas on because you could. Of course depending on how you are
using quotas, turning them off for the balance might not be an option,
but if you can, it will avoid effectively repeated rescans during the
balance, and while the rescan while turning them back on will take some
time, it should take far less than the time lost to the repeated rescans
if they're enabled during the balance.
I believe btrfs check is similarly afflicted with massive quota overhead,
tho I'm not sure if it's /as/ bad for check.
I've never had quotas enabled here at all, however, as I don't really
need them and the negatives are still too high, even if they're actually
working now, and I entirely forgot about them when I was recommending the
above to help get your chunk usage vs. total back under control.
So if you're using quotas, consider turning them off at least temporarily
when you do reschedule those balances. In fact, you may wish to leave
them off if you don't really need them, at least until they figure out
how to reduce the overhead they currently trigger in balance and check.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation
2016-03-18 23:06 ` Duncan
@ 2016-03-19 20:31 ` Ole Langbehn
0 siblings, 0 replies; 5+ messages in thread
From: Ole Langbehn @ 2016-03-19 20:31 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 3271 bytes --]
Duncan,
thanks again for your effort, I highly appreciate it.
On 19.03.2016 00:06, Duncan wrote:
> autodefrag
Got it, thanks.
> Nocow interacts with snapshots.
Thanks for presenting that in that much detail.
> What can happen then, and used to happen frequently before 3.17, tho much
> less frequently but it can still happen now, is that over time and with
> use, the filesystem will allocate all available space as one type,
> typically data chunks, and then run out of space in the other type of
> chunk, typically metadata, and have no unallocated space from which to
> allocate more. So you'll have lots of space left, but it'll be all tied
> up in only partially used chunks of the one type and you'll be out of
> space in the other type.
>
> And by the time you actually start getting ENOSPC errors as a result of
> the situation, there's often too little space left to create even the one
> additional chunk necessary for a balance to write the data from other
> chunks into, in ordered to combine some of the less used chunks into
> fewer chunks at 100% usage (but for the last one, of course).
>
> And you were already in a tight spot in that regard and may well have had
> errors if you had simply tried an unfiltered balance, because data chunks
> are typically 1 GiB in size (and can be upto 10 GiB in some circumstances
> on large enough filesystems, tho I think the really large sizes require
> multi-device), and you were down to 300-ish MiB of unallocated space, not
> enough to create a new 1 GiB data chunk.
>
> And considering the filesystem's near terabyte scale, to be down to under
> a GiB of unallocated space is even more startling, particularly on newer
> kernels where empty chunks are normally reclaimed automatically (tho as
> the usage=0 balances reclaimed some space for you, obviously not all of
> them had been reclaimed in your case).
As I said before, this fs has (with 99.9% probability) never seen
kernels <3.18. I'm curious why it came to the point of only having
300MiB unallocated, or what could potentially lead to this.
> Meanwhile, discussion in another thread reminded me of another factor,
> quotas.
Sure thing I had quotas enabled without the direct need for them ;).
I've been using
https://github.com/agronick/btrfs-size/
which uses quotas in order to display human readable snapshot sizes.
As a wrap up to the chunk allocation issue (the balance has finished):
# btrfs filesystem usage /
Overall:
Device size: 915.32GiB
Device allocated: 169.04GiB
Device unallocated: 746.28GiB
Device missing: 0.00B
Used: 155.51GiB
Free (estimated): 758.33GiB (min: 758.33GiB)
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 512.00MiB (used: 0.00B)
Data,single: Size:164.01GiB, Used:151.95GiB
/dev/sda2 164.01GiB
Metadata,single: Size:5.00GiB, Used:3.55GiB
/dev/sda2 5.00GiB
System,single: Size:32.00MiB, Used:48.00KiB
/dev/sda2 32.00MiB
Unallocated:
/dev/sda2 746.28GiB
Cheers,
Ole
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2016-03-19 20:32 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-16 9:45 [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation Ole Langbehn
2016-03-17 10:51 ` Duncan
2016-03-18 9:33 ` Ole Langbehn
2016-03-18 23:06 ` Duncan
2016-03-19 20:31 ` Ole Langbehn
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.