All of lore.kernel.org
 help / color / mirror / Atom feed
* [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation
@ 2016-03-16  9:45 Ole Langbehn
  2016-03-17 10:51 ` Duncan
  0 siblings, 1 reply; 5+ messages in thread
From: Ole Langbehn @ 2016-03-16  9:45 UTC (permalink / raw)
  To: linux-btrfs

Hi,

on my box, frequently, mostly while using firefox, any process doing
disk IO freezes while btrfs-transacti has a spike in CPU usage for more
than a minute.

I know about btrfs' fragmentation issue, but have a couple of questions:

* While btrfs-transacti is spiking, can I trace which files are the
culprit somehow?
* On my setup, with measured fragmentation, are the CPU spike durations
and freezes normal?
* Can I alleviate the situation by anything except defragmentation?

Any insight is appreciated.

Details:

I have a 1TB SSD with a large btrfs partition:

# btrfs filesystem usage /
Overall:
    Device size:                 915.32GiB
    Device allocated:            915.02GiB
    Device unallocated:          306.00MiB
    Device missing:                  0.00B
    Used:                        152.90GiB
    Free (estimated):            751.96GiB      (min: 751.96GiB)
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,single: Size:901.01GiB, Used:149.35GiB
   /dev/sda2     901.01GiB

Metadata,single: Size:14.01GiB, Used:3.55GiB
   /dev/sda2      14.01GiB

System,single: Size:4.00MiB, Used:128.00KiB
   /dev/sda2       4.00MiB

Unallocated:
   /dev/sda2     306.00MiB


I've done the obvious and defragmented files. Some files were
defragmented from 10k+ to still more than 100 extents. But the problem
persisted or came back very quickly. Just now i re-ran defragmentation
with the following results (only showing files with more than 100
extents before fragmentation):

extents before / extents after / anonymized path
103 / 1 /home/foo/.mozilla/firefox/foo.default/formhistory.sqlite:
133 / 1
/home/foo/.thunderbird/foo.default/ImapMail/imap.example.org/ml-btrfs:
155 / 1 /var/log/messages:
158 / 30 /home/foo/.thunderbird/foo.default/ImapMail/mail.example.org/INBOX:
160 / 32 /home/foo/.thunderbird/foo.default/calendar-data/cache.sqlite:
255 / 255 /var/lib/docker/devicemapper/devicemapper/data:
550 / 1 /home/foo/.cache/chromium/Default/Cache/data_1:
627 / 1 /home/foo/.cache/chromium/Default/Cache/data_2:
1738 / 25 /home/foo/.cache/chromium/Default/Cache/data_3:
1764 / 77 /home/foo/.mozilla/firefox/foo.default/places.sqlite:
4414 / 284 /home/foo/.digikam/thumbnails-digikam.db:
6576 / 3 /home/foo/.digikam/digikam4.db:

So fragmentation came back quickly, and the firefox places.sqlite file
could explain why the system freezes while browsing.

BTW: I did a VACUUM on the sqlite db and afterwards it had 1 extent.
Expected, just saying that vacuuming seems to be a good measure for
defragmenting sqlite databases.

I am using snapper and have about 40 snapshots going back for some
months. Those are read only. Could that have any effect?

Cheers,

Ole



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation
  2016-03-16  9:45 [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation Ole Langbehn
@ 2016-03-17 10:51 ` Duncan
  2016-03-18  9:33   ` Ole Langbehn
  0 siblings, 1 reply; 5+ messages in thread
From: Duncan @ 2016-03-17 10:51 UTC (permalink / raw)
  To: linux-btrfs

Ole Langbehn posted on Wed, 16 Mar 2016 10:45:28 +0100 as excerpted:

> Hi,
> 
> on my box, frequently, mostly while using firefox, any process doing
> disk IO freezes while btrfs-transacti has a spike in CPU usage for more
> than a minute.
> 
> I know about btrfs' fragmentation issue, but have a couple of questions:
> 
> * While btrfs-transacti is spiking, can I trace which files are the
> culprit somehow?
> * On my setup, with measured fragmentation, are the CPU spike durations
> and freezes normal?
> * Can I alleviate the situation by anything except defragmentation?
> 
> Any insight is appreciated.
> 
> Details:
> 
> I have a 1TB SSD with a large btrfs partition:
> 
> # btrfs filesystem usage /
> Overall:
>     Device size:                 915.32GiB
>     Device allocated:            915.02GiB
>     Device unallocated:          306.00MiB
>     Device missing:                  0.00B
>     Used:                        152.90GiB
>     Free (estimated):            751.96GiB      (min: 751.96GiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   1.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,single: Size:901.01GiB, Used:149.35GiB
>    /dev/sda2     901.01GiB
> 
> Metadata,single: Size:14.01GiB, Used:3.55GiB
>    /dev/sda2      14.01GiB
> 
> System,single: Size:4.00MiB, Used:128.00KiB
>    /dev/sda2       4.00MiB
> 
> Unallocated:
>    /dev/sda2     306.00MiB
> 
> 
> I've done the obvious and defragmented files. Some files were
> defragmented from 10k+ to still more than 100 extents. But the problem
> persisted or came back very quickly. Just now i re-ran defragmentation
> with the following results (only showing files with more than 100
> extents before fragmentation):
> 
> extents before / extents after / anonymized path
> 103 / 1 /home/foo/.mozilla/firefox/foo.default/formhistory.sqlite:
> 133 / 1
> /home/foo/.thunderbird/foo.default/ImapMail/imap.example.org/ml-btrfs:
> 155 / 1 /var/log/messages:
> 158 / 30
> /home/foo/.thunderbird/foo.default/ImapMail/mail.example.org/INBOX:
> 160 / 32 /home/foo/.thunderbird/foo.default/calendar-data/cache.sqlite:
> 255 / 255 /var/lib/docker/devicemapper/devicemapper/data:
> 550 / 1 /home/foo/.cache/chromium/Default/Cache/data_1:
> 627 / 1 /home/foo/.cache/chromium/Default/Cache/data_2:
> 1738 / 25 /home/foo/.cache/chromium/Default/Cache/data_3:
> 1764 / 77 /home/foo/.mozilla/firefox/foo.default/places.sqlite:
> 4414 / 284 /home/foo/.digikam/thumbnails-digikam.db:
> 6576 / 3 /home/foo/.digikam/digikam4.db:
> 
> So fragmentation came back quickly, and the firefox places.sqlite file
> could explain why the system freezes while browsing.

Have you tried the autodefrag mount option, then defragging?  That should 
help keep rewritten files from fragmenting so heavily, at least.  On 
spinning rust it doesn't play so well with large (half-gig plus) 
databases or VM images, but on ssds it should scale rather larger; on 
fast SSDs I'd not expect problems until 1-2 GiB, possibly higher.

For large dbs or VM images, too large for autodefrag to handle well, the 
nocow attribute is the usual suggestion, but I'll skip the details on 
that for now, as you may not need it with autodefrag on an ssd, unless 
your database and VM files are several gig apiece.

> BTW: I did a VACUUM on the sqlite db and afterwards it had 1 extent.
> Expected, just saying that vacuuming seems to be a good measure for
> defragmenting sqlite databases.

I know the concept, but out of curiousity, what tool do you use for 
that?  I imagine my firefox sqlite dbs could use some vacuuming as well, 
but don't have the foggiest idea how to go about it.

> I am using snapper and have about 40 snapshots going back for some
> months. Those are read only. Could that have any effect?

They could have some, but I don't expect it'd be much, not with only 40.


Other than autodefrag, and/or nocow on specific files (but research the 
latter before you do it, there's some interaction with snapshots you need 
to be aware of, and you can't just apply it to existing files and expect 
it to work right), there's a couple other things that may help.


Of *most* importance, you really *really* need to do something about that 
data chunk imbalance, and to a lessor extent that metadata chunk 
imbalance, because your unallocated space is well under a gig (306 MiB), 
with all that extra space, hundreds of gigs of it, locked up in unused or 
only partially used chunks.

The subject says 4.4.1, but it's unclear whether that's your kernel 
version or your btrfs-progs userspace version.  If that's your userspace 
version and you're running an old kernel, strongly consider upgrading to 
the LTS kernel 4.1 or 4.4 series if possible, or at least the LTS series 
before that, 3.18.  Those or the latest couple current kernel series, 4.5 
and 4.4, and 4.3 for the moment as 4.5 is /just/ out, are the recommended 
and best supported versions.

I say this because before 3.17, the btrfs kernelspace could allocate its 
own chunks, but didn't know how to free them, so one had to run balance 
fairly frequently to free up all the empty chunks, and it looks like you 
might have a bunch of empty chunks around.

With 3.17, the kernel learned how to delete entirely empty chunks, and 
running a balance to clear them isn't necessary these days.  But the 
kernel still only knows how to delete entirely empty chunks, and it's 
still possible over time, particularly with snapshots locking in place 
file extents that might be keeping otherwise empty chunks from being 
fully emptied and thus cleared by the kernel, for large imbalances to 
occur.

Either way, large imbalances is what you have ATM.  Copied from your post 
as quoted above:

> Data,single: Size:901.01GiB, Used:149.35GiB
>    /dev/sda2     901.01GiB
> 
> Metadata,single: Size:14.01GiB, Used:3.55GiB
>    /dev/sda2      14.01GiB

So 901 GiB of data chunks but under 150 GiB of it actually used.  That's 
nearly 750 GiB of free space tied up in empty or only partially filled 
data chunks.

14 GiB of metadata chunks, but under 4 GiB reported used.  That's about 
10 GiB of metadata chunks that should be freeable (tho the half GiB of 
global reserve comes from that metadata too but doesn't count as used, so 
usage is actually a bit over 4 GiB, so you may only free 9.5 GiB or so).

Try this:

btrfs balance start -dusage=0 -musage=0.

That should go pretty fast whether it works or not, but it might not 
work, if you don't actually have any entirely empty chunks.  If you do, 
it'll free them.

If that added some gigs to your unallocated total, good, as you're likely 
to have difficulty balancing data chunks anyway, without that, because 
data chunks are normally a gig or more in size and a new one has to be 
allocated in ordered to rewrite the content of others to try to release 
the unused space in the data chunks.

If it didn't do anything, as is likely if you're running a new kernel, it 
means you didn't have any zero-usage chunks, which a new kernel /should/ 
clean up but might not in some cases.

Then start with metadata, and up the usage numbers which are percentages, 
like this:

btrfs balance start -musage=5.

Then if it works up the number to 10, 20, etc.  By the time you get to 50 
or 70, you should have cleared several of those 9.5 or so potential gigs 
and can stop.  /Hopefully/ it'll let you do that with just the 300 MiB 
free you have, if the 0-usage balance didn't help free several gigs.  But 
on that large a filesystem, the normally 256 MiB metadata chunks may be a 
GiB, in which case you'd still run into trouble.

Once you have several gigs in unallocated, then try the same thing with 
data:

btrfs balance start -musage=5

And again, increase it in increments of 5 or 10% at a time, to 50 or 
70%.  With luck, you'll get most of that potential 750 GiB back into 
unallocated.

When you're done, total data should be much closer to the 150-ish gigs 
it's reporting as used, with most of that near 750 gigs spread from the 
current 900+ total moved to unallocated, and total metadata much closer 
to the about 4 gigs used, with 9 gigs or so of that spread moved to 
unallocated.

If the 0-usage thing doesn't give you anything and you can't balance even 
-musage=1, or don't get anything space returned until you get high enough 
to get an error, or if the metadata balance doesn't free enough space to 
unallocated to let the balance -dusage= work, then things get a bit more 
serious.  In that case, you can try one of two things, either delete your 
oldest snapshots to try and free up 100% of a few chunks so -dusage=0 
will free them, or temporarily btrfs device add a second device of a few 
gigs, a thumb drive can work, to give the balance somewhere to put the 
new chunk it needs to write in ordered to free up old ones.  Once you 
have enough space free on the original device, you can btrfs device 
delete the temporary one, to move all the chunks on it back to the main 
device and delete it from the filesystem.


Second thing, consider tweaking your trim/discard policy, since you're on 
ssd.  It could well be erase block management that's hitting you, if you 
haven't been doing regular trims or if the associated btrfs mount option 
(discard) is set incorrectly for your device.

See the btrfs (5) manpage (not btrfs (8)!) or the wiki for the discard 
mount option description, but the deal is that while most semi-recent ssds 
handle trim/discard, only fairly recently was it made a command-queued 
operation, and not even all recent ssds support it as command-queued.  
Without that, a trim kills the command-queue and thus can dramatically 
hurt performance.  Which is why it's not the btrfs ssd default and why 
it's not generally recommended for use with ssds, tho where the command 
is queued it should be a good thing.

But without trim/discard of /some/ sort, your ssd will slow down over 
time, when it no longer has a ready pool of unused erase blocks at hand 
to put new and wear-level-transferred blocks into.  Now mkfs.btrfs does 
do a trim as part of the filesystem creation process, but after that...

After that, barring an ssd that command-queues the trim command so you 
can add it to your mount options without affecting performance there, you 
can run the fstrim command from time to time.  Fstrim finds the unused 
space in the filesystem and issues trim commands for it, thus zeroing it 
out and telling the ssd firmware it can safely use those blocks for wear-
leveling and the like.

The recommendation is to put fstrim in a cron or systemd timer job, 
executing it weekly or similar, preferably at a time when all those 
unqueued trims won't affect your normal work.

Meanwhile, note that if you run fstrim manually, it outputs all the empty 
space it's trimming, but that running it repeatedly will show the same 
space every time, since it doesn't know what's already trimmed.  That's 
not a problem for the ssd, but it can confuse users who might think the 
trim isn't working, since it trims the same thing every time.

So if you have trim in your mount options, try taking it out and see if 
that helps.  But if you're not doing it there, be sure to setup an fstrim 
cron or systemd timer job to do it weekly or so.

Another strategy that some people use is to partition up most of the ssd, 
but leave 20% or so of it unpartitioned, or partitioned but without a 
filesystem if you prefer, thus giving the firmware that extra room to 
play with.  Once you have all those extra data and metadata chunks 
removed, you can shrink the filesystem, then the partition it's on, and 
let the ssd firmware have the now unpartitioned space.  Only thing is I 
don't know a tool to actually trim the now free space, and am not sure 
whether btrfs resize does it or not, so you might have to quickly create 
a new partition and filesystem in the space again but leave the 
filesystem empty, then fstrim it (or just make the filesystem btrfs, 
since mkfs.btrfs automatically does a trim if it detects an ssd where it 
can) to let the firmware have it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation
  2016-03-17 10:51 ` Duncan
@ 2016-03-18  9:33   ` Ole Langbehn
  2016-03-18 23:06     ` Duncan
  0 siblings, 1 reply; 5+ messages in thread
From: Ole Langbehn @ 2016-03-18  9:33 UTC (permalink / raw)
  To: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5523 bytes --]

Duncan,

thanks for your extensive answer.

On 17.03.2016 11:51, Duncan wrote:
> Ole Langbehn posted on Wed, 16 Mar 2016 10:45:28 +0100 as excerpted:
> 
> Have you tried the autodefrag mount option, then defragging?  That should 
> help keep rewritten files from fragmenting so heavily, at least.  On 
> spinning rust it doesn't play so well with large (half-gig plus) 
> databases or VM images, but on ssds it should scale rather larger; on 
> fast SSDs I'd not expect problems until 1-2 GiB, possibly higher.

Since I do have some big VM images, I never tried autodefrag.

> For large dbs or VM images, too large for autodefrag to handle well, the 
> nocow attribute is the usual suggestion, but I'll skip the details on 
> that for now, as you may not need it with autodefrag on an ssd, unless 
> your database and VM files are several gig apiece.

Since posting the original post, I experimented with setting the firefox
places.sqlite to nodatacow (on a new file). 1 extent since, seems to work.

>> BTW: I did a VACUUM on the sqlite db and afterwards it had 1 extent.
>> Expected, just saying that vacuuming seems to be a good measure for
>> defragmenting sqlite databases.
> 
> I know the concept, but out of curiousity, what tool do you use for 
> that?  I imagine my firefox sqlite dbs could use some vacuuming as well, 
> but don't have the foggiest idea how to go about it.

simple call of the command line interface, like with any other SQL DB:

# sqlite3 /path/to/db.sqlite "VACUUM;"

> Of *most* importance, you really *really* need to do something about that 
> data chunk imbalance, and to a lessor extent that metadata chunk 
> imbalance, because your unallocated space is well under a gig (306 MiB), 
> with all that extra space, hundreds of gigs of it, locked up in unused or 
> only partially used chunks.

I'm curious - why is that a bad thing?

> The subject says 4.4.1, but it's unclear whether that's your kernel 
> version or your btrfs-progs userspace version.  If that's your userspace 
> version and you're running an old kernel, strongly consider upgrading to 
> the LTS kernel 4.1 or 4.4 series if possible, or at least the LTS series 
> before that, 3.18.  Those or the latest couple current kernel series, 4.5 
> and 4.4, and 4.3 for the moment as 4.5 is /just/ out, are the recommended 
> and best supported versions.

# uname -r
4.4.1-gentoo

# btrfs --version
btrfs-progs v4.4.1

So, both 4.4.1 ;), but I meant userspace.

> Try this:
> 
> btrfs balance start -dusage=0 -musage=0.

Did this although I'm reasonably up to date kernel-wise. I am very sure
that the filesystem has never seen <3.18. Took some minutes, ended up with

# btrfs filesystem usage /
Overall:
    Device size:                 915.32GiB
    Device allocated:            681.32GiB
    Device unallocated:          234.00GiB
    Device missing:                  0.00B
    Used:                        153.80GiB
    Free (estimated):            751.08GiB      (min: 751.08GiB)
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,single: Size:667.31GiB, Used:150.22GiB
   /dev/sda2     667.31GiB

Metadata,single: Size:14.01GiB, Used:3.58GiB
   /dev/sda2      14.01GiB

System,single: Size:4.00MiB, Used:112.00KiB
   /dev/sda2       4.00MiB

Unallocated:
   /dev/sda2     234.00GiB


-> Helped with data, not with metadata.

> Then start with metadata, and up the usage numbers which are percentages, 
> like this:
> 
> btrfs balance start -musage=5.
> 
> Then if it works up the number to 10, 20, etc.

upped it up to 70, relocated a total of 13 out of 685 chunks:

Metadata,single: Size:5.00GiB, Used:3.58GiB
   /dev/sda2       5.00GiB

> Once you have several gigs in unallocated, then try the same thing with 
> data:
> 
> btrfs balance start -musage=5
> 
> And again, increase it in increments of 5 or 10% at a time, to 50 or 
> 70%.

did

# btrfs balance start -dusage=70

straight away, took ages, regularly froze processes for minutes, after
about 8h status is:

# btrfs balance status /
Balance on '/' is paused
192 out of about 595 chunks balanced (194 considered),  68% left
# btrfs filesystem usage /
Overall:
    Device size:                 915.32GiB
    Device allocated:            482.04GiB
    Device unallocated:          433.28GiB
    Device missing:                  0.00B
    Used:                        154.36GiB
    Free (estimated):            759.48GiB      (min: 759.48GiB)
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,single: Size:477.01GiB, Used:150.80GiB
   /dev/sda2     477.01GiB

Metadata,single: Size:5.00GiB, Used:3.56GiB
   /dev/sda2       5.00GiB

System,single: Size:32.00MiB, Used:96.00KiB
   /dev/sda2      32.00MiB

Unallocated:
   /dev/sda2     433.28GiB

-> Looking good. Will proceed when I don't need the box to actually be
responsive.

> Second thing, consider tweaking your trim/discard policy [...]
> 
> The recommendation is to put fstrim in a cron or systemd timer job, 
> executing it weekly or similar, preferably at a time when all those 
> unqueued trims won't affect your normal work.

I have it in cron.weekly, since the creation of the filesystem:

fstrim -v / >> $LOG

Cheers,

Ole




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation
  2016-03-18  9:33   ` Ole Langbehn
@ 2016-03-18 23:06     ` Duncan
  2016-03-19 20:31       ` Ole Langbehn
  0 siblings, 1 reply; 5+ messages in thread
From: Duncan @ 2016-03-18 23:06 UTC (permalink / raw)
  To: linux-btrfs

Ole Langbehn posted on Fri, 18 Mar 2016 10:33:46 +0100 as excerpted:

> Duncan,
> 
> thanks for your extensive answer.
> 
> On 17.03.2016 11:51, Duncan wrote:
>> Ole Langbehn posted on Wed, 16 Mar 2016 10:45:28 +0100 as excerpted:
>> 
>> Have you tried the autodefrag mount option, then defragging?  That
>> should help keep rewritten files from fragmenting so heavily, at least.
>>  On spinning rust it doesn't play so well with large (half-gig plus)
>> databases or VM images, but on ssds it should scale rather larger; on
>> fast SSDs I'd not expect problems until 1-2 GiB, possibly higher.
> 
> Since I do have some big VM images, I never tried autodefrag.

OK.  Tho as you're on ssd you might consider /trying/ it.  The big 
problem with autodefrag and big VMs and DBs is that as the filesize gets 
larger, it becomes more difficult for autodefrag to keep up with the 
incoming stream of modifications, but ssds tend to be fast enough that 
they can keep up for far longer, and it may be that you won't see a 
noticeable issue.  If you do, you can always turn the mount option back 
off.

Also, nocow should mean autodefrag doesn't affect the file anyway, as it 
won't be fragmenting due to the nocow.  So if you have your really large 
VMs and DBs set nocow, it's quite likely, particularly on ssd, that you 
can set autodefrag and not see the performance problems with those large 
files that's the reason it's normally not recommended for the large db/vm 
use-case.

And like I said you can always turn it back off if necessary.

>> For large dbs or VM images, too large for autodefrag to handle well,
>> the nocow attribute is the usual suggestion, but I'll skip the details
>> on that for now, as you may not need it with autodefrag on an ssd,
>> unless your database and VM files are several gig apiece.
> 
> Since posting the original post, I experimented with setting the firefox
> places.sqlite to nodatacow (on a new file). 1 extent since, seems to
> work.

Seems you are reasonably familiar with the nocow attribute drill, so I'll 
just cover one remaining base, in case you missed it.

Nocow interacts with snapshots.  Basically, snapshots turn nocow into 
cow1, because they lock the existing version in place due to the 
snapshot.  First changes to a block after a snapshot, then, must be cow, 
tho further changes to it after that remain nocow to the new in-place 
location.

So nocow isn't fully nocow with snapshots, and fragmentation will slow 
down, but not be eliminated.  People doing regularly scheduled 
snapshotting therefore often need to do less frequent but also regularly 
scheduled (perhaps weekly or monthly, for multiple snapshots per day) 
defrag of their nowcow files.

Tho be aware that for performance reasons, defrag isn't snapshot aware 
and will break reflinks to existing snapshots, thereby increasing 
filesystem usage.  The total effect on usage of course depends on how 
much updating the nocow files get as well as snapshotting and defrag 
frequency.

>>> BTW: I did a VACUUM on the sqlite db and afterwards it had 1 extent.
>>> Expected, just saying that vacuuming seems to be a good measure for
>>> defragmenting sqlite databases.
>> 
>> I know the concept, but out of curiousity, what tool do you use for
>> that?  I imagine my firefox sqlite dbs could use some vacuuming as
>> well, but don't have the foggiest idea how to go about it.
> 
> simple call of the command line interface, like with any other SQL DB:
> 
> # sqlite3 /path/to/db.sqlite "VACUUM;"

Cool.  As far as I knew, sqlite was library only, no executable to invoke 
in that manner.  Shows how little I knew about sqlite. =:^)  Thanks.

>> Of *most* importance, you really *really* need to do something about
>> that data chunk imbalance, and to a lessor extent that metadata chunk
>> imbalance, because your unallocated space is well under a gig (306
>> MiB), with all that extra space, hundreds of gigs of it, locked up in
>> unused or only partially used chunks.
> 
> I'm curious - why is that a bad thing?

Btrfs allocates space in two stages, first to chunks of data or metadata 
type (there's also system type but that's pretty much fixed size so once 
the filesystem is created, no further system chunks are normally needed, 
unless it's created as a single device filesystem and then a whole slew 
of additional devices are added, or if the filesystem is massively resized 
on the same device, of course), then from within those chunks to files 
from data, and to metadata nodes from metadata, as necessary.

What can happen then, and used to happen frequently before 3.17, tho much 
less frequently but it can still happen now, is that over time and with 
use, the filesystem will allocate all available space as one type, 
typically data chunks, and then run out of space in the other type of 
chunk, typically metadata, and have no unallocated space from which to 
allocate more.   So you'll have lots of space left, but it'll be all tied 
up in only partially used chunks of the one type and you'll be out of 
space in the other type.

And by the time you actually start getting ENOSPC errors as a result of 
the situation, there's often too little space left to create even the one 
additional chunk necessary for a balance to write the data from other 
chunks into, in ordered to combine some of the less used chunks into 
fewer chunks at 100% usage (but for the last one, of course).

And you were already in a tight spot in that regard and may well have had 
errors if you had simply tried an unfiltered balance, because data chunks 
are typically 1 GiB in size (and can be upto 10 GiB in some circumstances 
on large enough filesystems, tho I think the really large sizes require 
multi-device), and you were down to 300-ish MiB of unallocated space, not 
enough to create a new 1 GiB data chunk.

And considering the filesystem's near terabyte scale, to be down to under 
a GiB of unallocated space is even more startling, particularly on newer 
kernels where empty chunks are normally reclaimed automatically (tho as 
the usage=0 balances reclaimed some space for you, obviously not all of 
them had been reclaimed in your case).

That was what was alarming to me, and it /may/ have had something to do 
with the high cpu and low speeds, tho indications were that you still had 
enough space in both data and metadata that it shouldn't have been too 
bad just yet.  But it was potentially heading that way, if you didn't do 
something, which is why I stressed it as I did.  Getting out of such 
situations once you're tightly jammed can be quite difficult and 
inconvenient, tho you were lucky enough not to be that tightly jammed 
just yet, only headed that way.

>> The subject says 4.4.1, but it's unclear whether that's your kernel
>> version or your btrfs-progs userspace version.

> # uname -r 4.4.1-gentoo
> 
> # btrfs --version btrfs-progs v4.4.1
> 
> So, both 4.4.1 ;)

=:^)

>> Try this:
>> 
>> btrfs balance start -dusage=0 -musage=0.
> 
> Did this although I'm reasonably up to date kernel-wise. I am very sure
> that the filesystem has never seen <3.18. Took some minutes, ended up
> with
> 
> # btrfs filesystem usage /
> Overall:
>     Device size:                 915.32GiB
>     Device allocated:            681.32GiB
>     Device unallocated:          234.00GiB
>     Device missing:                  0.00B
>     Used:                        153.80GiB
>     Free (estimated):            751.08GiB      (min: 751.08GiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   1.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,single: Size:667.31GiB, Used:150.22GiB
>    /dev/sda2     667.31GiB
> 
> Metadata,single: Size:14.01GiB, Used:3.58GiB
>    /dev/sda2      14.01GiB
> 
> System,single: Size:4.00MiB, Used:112.00KiB
>    /dev/sda2       4.00MiB
> 
> Unallocated:
>    /dev/sda2     234.00GiB
> 
> 
> -> Helped with data, not with metadata.

Yes, and most importantly, you're already out of the tight jam you were 
headed into, now with a comfortable several hundred gigs of unallocated 
space. =:^)

With that, the specific hoops weren't all necessary for further steps.  
In particular, I was afraid that wouldn't clear any chunks at all and 
you'd still have under a GiB free, still too small to properly balance 
data chunks, thus the suggestion to start with metadata and hoping it 
worked.

>> Then start with metadata, and up the usage numbers which are
>> percentages,
>> like this:
>> 
>> btrfs balance start -musage=5.
>> 
>> Then if it works up the number to 10, 20, etc.
> 
> upped it up to 70, relocated a total of 13 out of 685 chunks:
> 
> Metadata,single: Size:5.00GiB, Used:3.58GiB
>    /dev/sda2       5.00GiB

So you cleared a few more gigs to unallocated, as metadata total was 14 
GiB, now it's 5 GiB, much more in line with used (especially given the 
fact that your half a GiB of global reserve comes from metadata but 
doesn't count as used in the above figure, so you're effectively a bit 
over 4 GiB used, meaning you may not be able to free more even if 
balancing all metadata chunks with just -m, no usage filter).


>> Once you have several gigs in unallocated, then try the same thing with
>> data:
>> 
>> btrfs balance start -musage=5
>> 
>> And again, increase it in increments of 5 or 10% at a time, to 50 or
>> 70%.
> 
> did
> 
> # btrfs balance start -dusage=70
> 
> straight away, took ages, regularly froze processes for minutes, after
> about 8h status is:
> 
> # btrfs balance status /
> Balance on '/' is paused
> 192 out of about 595 chunks balanced (194 considered),  68% left
> # btrfs filesystem usage /
> Overall:
>     Device size:                 915.32GiB
>     Device allocated:            482.04GiB
>     Device unallocated:          433.28GiB
>     Device missing:                  0.00B
>     Used:                        154.36GiB
>     Free (estimated):            759.48GiB      (min: 759.48GiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   1.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,single: Size:477.01GiB, Used:150.80GiB
>    /dev/sda2     477.01GiB
> 
> Metadata,single: Size:5.00GiB, Used:3.56GiB
>    /dev/sda2       5.00GiB
> 
> System,single: Size:32.00MiB, Used:96.00KiB
>    /dev/sda2      32.00MiB
> 
> Unallocated:
>    /dev/sda2     433.28GiB
> 
> -> Looking good. Will proceed when I don't need the box to actually be
> responsive.

The thing with the usage= filter is this.  Balancing empty (usage=0) 
chunks simply deletes them so is nearly instantaneous, and obviously 
reclaims 100% of the space because they were empty, so huge bang for the 
buck.  Balancing nearly empty chunks is still quite fast since there's 
very little data to rewrite and compact into the new chunks, and 
obviously, at usage=10, lets you compact 10 or more only 10% used or less 
chunks into one new chunk, so as long as you have a lot of them, you 
still gets really good bang for the buck.

But as usage increases, you're writing more and more data, for less and 
less bang for the buck.  At half full, 50% usage, balance is only 
combining two chunks into one, while writing the same amount of data to 
recover only one chunk of the two, as it was writing to recover 9 chunks 
out of 10, at 10% usage.

So when the filesystem still has a lot of room, a lot of people stop at 
say usage=25, where they're still recovering 3/4 of the chunks, or 
usage=33, where they're recovering 2/3.  As the filesystem fills up, they 
may need to do usage=50, recovering only 1/2 of the chunks rewritten, and 
eventually, usage=67 or 70, writing three chunks into two, and thus 
recovering only one chunk's worth of space for every three written, 1/3.

It's rarely useful to go above that, unless you're /really/ pressed for 
space, and then it's simpler to just do a balance without that filter and 
balance all chunks, tho you can still use -d or -m to only do data or 
metadata chunks, if desired.

That's why I suggested you bump the usage up in increments, with my 
intention, tho I guess I didn't clearly state it, being that you'd stop 
once total dropped reasonably close to used, for data, say 300 GiB total, 
150 used, or if you were lucky, 200 GiB total, 150 used.

With luck that would have happened at say -dusage=40 or -dusage=50, while 
your bang for the buck was still reclaiming at least half of the chunks 
in the rewrite, and -dusage=70 would have never been needed.

That's why you found it taking so long.

Meanwhile, discussion in another thread reminded me of another factor, 
quotas.

For a long time quotes were simply broken in btrfs as the code was buggy 
and various corner-cases resulted in negative (!!) reported usage and the 
like.  With kernel 4.4, known corner-case bugs are in general fixed and 
the numbers should finally be correct, but there's still a LOT of quota 
overhead for balance, etc.  They're discussing right now whether part or 
all of that can be eliminated, but for the time being, anyway, active 
btrfs quotas incur a /massive/ balance overhead, so if you use quotas and 
are going to be doing more than trivial balances, it's worth turning them 
off temporarily for the balance if you can, then rescanning after the 
balance when you turn them back on, if you do need them and didn't simply 
have quotas on because you could.  Of course depending on how you are 
using quotas, turning them off for the balance might not be an option, 
but if you can, it will avoid effectively repeated rescans during the 
balance, and while the rescan while turning them back on will take some 
time, it should take far less than the time lost to the repeated rescans 
if they're enabled during the balance.

I believe btrfs check is similarly afflicted with massive quota overhead, 
tho I'm not sure if it's /as/ bad for check.

I've never had quotas enabled here at all, however, as I don't really 
need them and the negatives are still too high, even if they're actually 
working now, and I entirely forgot about them when I was recommending the 
above to help get your chunk usage vs. total back under control.

So if you're using quotas, consider turning them off at least temporarily 
when you do reschedule those balances.  In fact, you may wish to leave 
them off if you don't really need them, at least until they figure out 
how to reduce the overhead they currently trigger in balance and check.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation
  2016-03-18 23:06     ` Duncan
@ 2016-03-19 20:31       ` Ole Langbehn
  0 siblings, 0 replies; 5+ messages in thread
From: Ole Langbehn @ 2016-03-19 20:31 UTC (permalink / raw)
  To: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3271 bytes --]

Duncan,

thanks again for your effort, I highly appreciate it.

On 19.03.2016 00:06, Duncan wrote:
> autodefrag

Got it, thanks.

> Nocow interacts with snapshots.  

Thanks for presenting that in that much detail.

> What can happen then, and used to happen frequently before 3.17, tho much 
> less frequently but it can still happen now, is that over time and with 
> use, the filesystem will allocate all available space as one type, 
> typically data chunks, and then run out of space in the other type of 
> chunk, typically metadata, and have no unallocated space from which to 
> allocate more.   So you'll have lots of space left, but it'll be all tied 
> up in only partially used chunks of the one type and you'll be out of 
> space in the other type.
> 
> And by the time you actually start getting ENOSPC errors as a result of 
> the situation, there's often too little space left to create even the one 
> additional chunk necessary for a balance to write the data from other 
> chunks into, in ordered to combine some of the less used chunks into 
> fewer chunks at 100% usage (but for the last one, of course).
> 
> And you were already in a tight spot in that regard and may well have had 
> errors if you had simply tried an unfiltered balance, because data chunks 
> are typically 1 GiB in size (and can be upto 10 GiB in some circumstances 
> on large enough filesystems, tho I think the really large sizes require 
> multi-device), and you were down to 300-ish MiB of unallocated space, not 
> enough to create a new 1 GiB data chunk.
>
> And considering the filesystem's near terabyte scale, to be down to under 
> a GiB of unallocated space is even more startling, particularly on newer 
> kernels where empty chunks are normally reclaimed automatically (tho as 
> the usage=0 balances reclaimed some space for you, obviously not all of 
> them had been reclaimed in your case).

As I said before, this fs has (with 99.9% probability) never seen
kernels <3.18. I'm curious why it came to the point of only having
300MiB unallocated, or what could potentially lead to this.

> Meanwhile, discussion in another thread reminded me of another factor, 
> quotas.

Sure thing I had quotas enabled without the direct need for them ;).
I've been using

https://github.com/agronick/btrfs-size/

which uses quotas in order to display human readable snapshot sizes.

As a wrap up to the chunk allocation issue (the balance has finished):

# btrfs filesystem usage /
Overall:
    Device size:                 915.32GiB
    Device allocated:            169.04GiB
    Device unallocated:          746.28GiB
    Device missing:                  0.00B
    Used:                        155.51GiB
    Free (estimated):            758.33GiB      (min: 758.33GiB)
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,single: Size:164.01GiB, Used:151.95GiB
   /dev/sda2     164.01GiB

Metadata,single: Size:5.00GiB, Used:3.55GiB
   /dev/sda2       5.00GiB

System,single: Size:32.00MiB, Used:48.00KiB
   /dev/sda2      32.00MiB

Unallocated:
   /dev/sda2     746.28GiB

Cheers,

Ole


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-03-19 20:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-16  9:45 [4.4.1] btrfs-transacti frequent high CPU usage despite little fragmentation Ole Langbehn
2016-03-17 10:51 ` Duncan
2016-03-18  9:33   ` Ole Langbehn
2016-03-18 23:06     ` Duncan
2016-03-19 20:31       ` Ole Langbehn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.