All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs filesystem keeps allocating new chunks for no apparent reason
@ 2016-05-06 21:28 Hans van Kranenburg
  2016-05-30 11:07 ` Hans van Kranenburg
  2016-05-31  1:36 ` Qu Wenruo
  0 siblings, 2 replies; 33+ messages in thread
From: Hans van Kranenburg @ 2016-05-06 21:28 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I've got a mostly inactive btrfs filesystem inside a virtual machine 
somewhere that shows interesting behaviour: while no interesting disk 
activity is going on, btrfs keeps allocating new chunks, a GiB at a time.

A picture, telling more than 1000 words:
https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
(when the amount of allocated/unused goes down, I did a btrfs balance)

Linux ichiban 4.5.0-0.bpo.1-amd64 #1 SMP Debian 4.5.1-1~bpo8+1 
(2016-04-20) x86_64 GNU/Linux

# btrfs fi show /
Label: none  uuid: 9881fc30-8f69-4069-a8c8-c057b842b0c4
	Total devices 1 FS bytes used 6.17GiB
	devid    1 size 20.00GiB used 16.54GiB path /dev/xvda

# btrfs fi df /
Data, single: total=15.01GiB, used=5.16GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=1.50GiB, used=1.01GiB
GlobalReserve, single: total=144.00MiB, used=0.00B

I'm a bit puzzled, since I haven't seen this happening on other 
filesystems that use 4.4 or 4.5 kernels.

If I dump the allocated chunks and their % usage, it's clear that the 
last 6 new added ones have a usage of only a few percent.

dev item devid 1 total bytes 21474836480 bytes used 17758683136
chunk vaddr 12582912 type 1 stripe 0 devid 1 offset 12582912 length 
8388608 used 4276224 used_pct 50
chunk vaddr 1103101952 type 1 stripe 0 devid 1 offset 2185232384 length 
1073741824 used 433127424 used_pct 40
chunk vaddr 3250585600 type 1 stripe 0 devid 1 offset 4332716032 length 
1073741824 used 764391424 used_pct 71
chunk vaddr 9271508992 type 1 stripe 0 devid 1 offset 12079595520 length 
1073741824 used 270704640 used_pct 25
chunk vaddr 12492734464 type 1 stripe 0 devid 1 offset 13153337344 
length 1073741824 used 866574336 used_pct 80
chunk vaddr 13566476288 type 1 stripe 0 devid 1 offset 11005853696 
length 1073741824 used 1028059136 used_pct 95
chunk vaddr 14640218112 type 1 stripe 0 devid 1 offset 3258974208 length 
1073741824 used 762466304 used_pct 71
chunk vaddr 26250051584 type 1 stripe 0 devid 1 offset 19595788288 
length 1073741824 used 114982912 used_pct 10
chunk vaddr 31618760704 type 1 stripe 0 devid 1 offset 15300820992 
length 1073741824 used 488902656 used_pct 45
chunk vaddr 32692502528 type 4 stripe 0 devid 1 offset 5406457856 length 
268435456 used 209272832 used_pct 77
chunk vaddr 32960937984 type 4 stripe 0 devid 1 offset 5943328768 length 
268435456 used 251199488 used_pct 93
chunk vaddr 33229373440 type 4 stripe 0 devid 1 offset 7419723776 length 
268435456 used 248709120 used_pct 92
chunk vaddr 33497808896 type 4 stripe 0 devid 1 offset 8896118784 length 
268435456 used 247791616 used_pct 92
chunk vaddr 33766244352 type 4 stripe 0 devid 1 offset 8627683328 length 
268435456 used 93061120 used_pct 34
chunk vaddr 34303115264 type 2 stripe 0 devid 1 offset 6748635136 length 
33554432 used 16384 used_pct 0
chunk vaddr 34336669696 type 1 stripe 0 devid 1 offset 16374562816 
length 1073741824 used 105054208 used_pct 9
chunk vaddr 35410411520 type 1 stripe 0 devid 1 offset 20971520 length 
1073741824 used 10899456 used_pct 1
chunk vaddr 36484153344 type 1 stripe 0 devid 1 offset 1094713344 length 
1073741824 used 441778176 used_pct 41
chunk vaddr 37557895168 type 4 stripe 0 devid 1 offset 5674893312 length 
268435456 used 33439744 used_pct 12
chunk vaddr 37826330624 type 1 stripe 0 devid 1 offset 9164554240 length 
1073741824 used 32096256 used_pct 2
chunk vaddr 38900072448 type 1 stripe 0 devid 1 offset 14227079168 
length 1073741824 used 40140800 used_pct 3
chunk vaddr 39973814272 type 1 stripe 0 devid 1 offset 17448304640 
length 1073741824 used 58093568 used_pct 5
chunk vaddr 41047556096 type 1 stripe 0 devid 1 offset 18522046464 
length 1073741824 used 119701504 used_pct 11

The only things this host does is
  1) being a webserver for a small internal debian packages repository
  2) running low-volume mailman with a few lists, no archive-gzipping 
mega cronjobs or anything enabled.
  3) some little legacy php thingies

Interesting fact is that most of the 1GiB increases happen at the same 
time as cron.daily runs. However, there's only a few standard things in 
there. An occasional package upgrade by unattended-upgrade, or some 
logrotate. The total contents of /var/log/ together is only 66MB... 
Graphs show only less than about 100 MB reads/writes in total around 
this time.

As you can see in the graph the amount of used space is even decreasing, 
because I cleaned up a bunch of old packages in the repository, and 
still, btrfs keeps allocating new data chunks like a hungry beast.

Why would this happen?

Hans van Kranenburg

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2016-05-06 21:28 btrfs filesystem keeps allocating new chunks for no apparent reason Hans van Kranenburg
@ 2016-05-30 11:07 ` Hans van Kranenburg
  2016-05-30 19:55   ` Duncan
  2016-05-31  1:36 ` Qu Wenruo
  1 sibling, 1 reply; 33+ messages in thread
From: Hans van Kranenburg @ 2016-05-30 11:07 UTC (permalink / raw)
  To: linux-btrfs

Hi,

since it got any followup and since I'm bold enough to bump it one more 
time... :)

I really don't understand the behaviour I described. Does it ring a bell 
with anyone? This system is still allocating new 1GB data chunks every 1 
or 2 days without using them at all, and I have to use balance every 
week to get them away again.

Hans

On 05/06/2016 11:28 PM, Hans van Kranenburg wrote:
> Hi,
>
> I've got a mostly inactive btrfs filesystem inside a virtual machine
> somewhere that shows interesting behaviour: while no interesting disk
> activity is going on, btrfs keeps allocating new chunks, a GiB at a time.
>
> A picture, telling more than 1000 words:
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
> (when the amount of allocated/unused goes down, I did a btrfs balance)
>
> Linux ichiban 4.5.0-0.bpo.1-amd64 #1 SMP Debian 4.5.1-1~bpo8+1
> (2016-04-20) x86_64 GNU/Linux
>
> # btrfs fi show /
> Label: none  uuid: 9881fc30-8f69-4069-a8c8-c057b842b0c4
>      Total devices 1 FS bytes used 6.17GiB
>      devid    1 size 20.00GiB used 16.54GiB path /dev/xvda
>
> # btrfs fi df /
> Data, single: total=15.01GiB, used=5.16GiB
> System, single: total=32.00MiB, used=16.00KiB
> Metadata, single: total=1.50GiB, used=1.01GiB
> GlobalReserve, single: total=144.00MiB, used=0.00B
>
> I'm a bit puzzled, since I haven't seen this happening on other
> filesystems that use 4.4 or 4.5 kernels.
>
> If I dump the allocated chunks and their % usage, it's clear that the
> last 6 new added ones have a usage of only a few percent.
>
> dev item devid 1 total bytes 21474836480 bytes used 17758683136
> chunk vaddr 12582912 type 1 stripe 0 devid 1 offset 12582912 length
> 8388608 used 4276224 used_pct 50
> chunk vaddr 1103101952 type 1 stripe 0 devid 1 offset 2185232384 length
> 1073741824 used 433127424 used_pct 40
> chunk vaddr 3250585600 type 1 stripe 0 devid 1 offset 4332716032 length
> 1073741824 used 764391424 used_pct 71
> chunk vaddr 9271508992 type 1 stripe 0 devid 1 offset 12079595520 length
> 1073741824 used 270704640 used_pct 25
> chunk vaddr 12492734464 type 1 stripe 0 devid 1 offset 13153337344
> length 1073741824 used 866574336 used_pct 80
> chunk vaddr 13566476288 type 1 stripe 0 devid 1 offset 11005853696
> length 1073741824 used 1028059136 used_pct 95
> chunk vaddr 14640218112 type 1 stripe 0 devid 1 offset 3258974208 length
> 1073741824 used 762466304 used_pct 71
> chunk vaddr 26250051584 type 1 stripe 0 devid 1 offset 19595788288
> length 1073741824 used 114982912 used_pct 10
> chunk vaddr 31618760704 type 1 stripe 0 devid 1 offset 15300820992
> length 1073741824 used 488902656 used_pct 45
> chunk vaddr 32692502528 type 4 stripe 0 devid 1 offset 5406457856 length
> 268435456 used 209272832 used_pct 77
> chunk vaddr 32960937984 type 4 stripe 0 devid 1 offset 5943328768 length
> 268435456 used 251199488 used_pct 93
> chunk vaddr 33229373440 type 4 stripe 0 devid 1 offset 7419723776 length
> 268435456 used 248709120 used_pct 92
> chunk vaddr 33497808896 type 4 stripe 0 devid 1 offset 8896118784 length
> 268435456 used 247791616 used_pct 92
> chunk vaddr 33766244352 type 4 stripe 0 devid 1 offset 8627683328 length
> 268435456 used 93061120 used_pct 34
> chunk vaddr 34303115264 type 2 stripe 0 devid 1 offset 6748635136 length
> 33554432 used 16384 used_pct 0
> chunk vaddr 34336669696 type 1 stripe 0 devid 1 offset 16374562816
> length 1073741824 used 105054208 used_pct 9
> chunk vaddr 35410411520 type 1 stripe 0 devid 1 offset 20971520 length
> 1073741824 used 10899456 used_pct 1
> chunk vaddr 36484153344 type 1 stripe 0 devid 1 offset 1094713344 length
> 1073741824 used 441778176 used_pct 41
> chunk vaddr 37557895168 type 4 stripe 0 devid 1 offset 5674893312 length
> 268435456 used 33439744 used_pct 12
> chunk vaddr 37826330624 type 1 stripe 0 devid 1 offset 9164554240 length
> 1073741824 used 32096256 used_pct 2
> chunk vaddr 38900072448 type 1 stripe 0 devid 1 offset 14227079168
> length 1073741824 used 40140800 used_pct 3
> chunk vaddr 39973814272 type 1 stripe 0 devid 1 offset 17448304640
> length 1073741824 used 58093568 used_pct 5
> chunk vaddr 41047556096 type 1 stripe 0 devid 1 offset 18522046464
> length 1073741824 used 119701504 used_pct 11
>
> The only things this host does is
>   1) being a webserver for a small internal debian packages repository
>   2) running low-volume mailman with a few lists, no archive-gzipping
> mega cronjobs or anything enabled.
>   3) some little legacy php thingies
>
> Interesting fact is that most of the 1GiB increases happen at the same
> time as cron.daily runs. However, there's only a few standard things in
> there. An occasional package upgrade by unattended-upgrade, or some
> logrotate. The total contents of /var/log/ together is only 66MB...
> Graphs show only less than about 100 MB reads/writes in total around
> this time.
>
> As you can see in the graph the amount of used space is even decreasing,
> because I cleaned up a bunch of old packages in the repository, and
> still, btrfs keeps allocating new data chunks like a hungry beast.
>
> Why would this happen?
>
> Hans van Kranenburg
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Hans van Kranenburg - System / Network Engineer
Mendix | Driving Digital Innovation | www.mendix.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2016-05-30 11:07 ` Hans van Kranenburg
@ 2016-05-30 19:55   ` Duncan
  2016-05-30 21:18     ` Hans van Kranenburg
  0 siblings, 1 reply; 33+ messages in thread
From: Duncan @ 2016-05-30 19:55 UTC (permalink / raw)
  To: linux-btrfs

Hans van Kranenburg posted on Mon, 30 May 2016 13:07:26 +0200 as
excerpted:

[Please don't post "upside down".  Reply in context under the quoted 
point, here the whole post, you're replying to.  It makes further replies 
in context far easier. =:^)  I've pasted your update at the bottom here.]

> On 05/06/2016 11:28 PM, Hans van Kranenburg wrote:
>>
>> I've got a mostly inactive btrfs filesystem inside a virtual machine
>> somewhere that shows interesting behaviour: while no interesting disk
>> activity is going on, btrfs keeps allocating new chunks, a GiB at a
>> time.
>>
>> A picture, telling more than 1000 words:
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
>> (when the amount of allocated/unused goes down, I did a btrfs balance)

Agreed, that shows something strange going on.

>> Linux ichiban 4.5.0-0.bpo.1-amd64 #1 SMP Debian 4.5.1-1~bpo8+1
>> (2016-04-20) x86_64 GNU/Linux

So the kernel is/was current...

>> # btrfs fi show /
>> Label: none  uuid: 9881fc30-8f69-4069-a8c8-c057b842b0c4
>>      Total devices 1 FS bytes used 6.17GiB
>>      devid    1 size 20.00GiB used 16.54GiB path /dev/xvda
>>
>> # btrfs fi df /
>> Data, single: total=15.01GiB, used=5.16GiB
>> System, single: total=32.00MiB, used=16.00KiB
>> Metadata, single: total=1.50GiB, used=1.01GiB
>> GlobalReserve, single: total=144.00MiB, used=0.00B
>>
>> I'm a bit puzzled, since I haven't seen this happening on other
>> filesystems that use 4.4 or 4.5 kernels.

Nor have I, either reported (save for you) or personally.

>> If I dump the allocated chunks and their % usage, it's clear that the
>> last 6 new added ones have a usage of only a few percent.

Snip the dump, but curious as a user (not a dev) what command you used.  
Presumably one of the debug commands which I'm not particularly familiar 
with, but I wasn't aware it was even possible.

>> The only things this host does is
>>   1) being a webserver for a small internal debian packages repository
>>   2) running low-volume mailman with a few lists, no archive-gzipping
>> mega cronjobs or anything enabled.
>>   3) some little legacy php thingies
>>
>> Interesting fact is that most of the 1GiB increases happen at the same
>> time as cron.daily runs. However, there's only a few standard things in
>> there. An occasional package upgrade by unattended-upgrade, or some
>> logrotate. The total contents of /var/log/ together is only 66MB...
>> Graphs show only less than about 100 MB reads/writes in total around
>> this time.

The cron.daily timing is interesting.  I'll come back to that below.

>> As you can see in the graph the amount of used space is even
>> decreasing, because I cleaned up a bunch of old packages in the
>> repository, and still, btrfs keeps allocating new data chunks like a
>> hungry beast.
>>
>> Why would this happen?

> since it got any followup and since I'm bold enough to bump it one more
> time... :)
> 
> I really don't understand the behaviour I described. Does it ring a bell
> with anyone? This system is still allocating new 1GB data chunks every 1
> or 2 days without using them at all, and I have to use balance every
> week to get them away again.

Honestly I can only guess, and it's a new guess I didn't think of the 
first time around, thus my lack of response the first time around.  But 
lacking anyone else replying with better theories, given that I do have a 
guess, I might as well put it out there.

Is it possible something in that daily cron allocates/writes a large but 
likely spare file, perhaps a gig or more, probably fsyncing to lock the 
large size in place, then truncates it to actual size, which might be 
only a few kilobytes?

That's sort of behavior could at least in theory trigger the behavior you 
describe, tho not being a dev and not being a Linux filesystem behavior 
expert by any means, I'm admittedly fuzzy on exactly what details might 
translate that theory into the reality you're seeing.


In any event, my usual "brute force" approach to such mysteries is to 
bisect the problem space down until I know where the issue is.

First, try rescheduling your cron.daily run to a different time, and see 
if the behavior follows it, thus specifically tying it to something in 
that run.

Second, try either running all tasks it runs manually, checking which one 
triggers the problem, or if you have too many tasks for that to be 
convenient, split them into cron.daily1 and cron.daily2, scheduled at 
different times, bisecting the problem by seeing which one the behavior 
follows.

Repeat as needed until you've discovered the culprit, then examine 
exactly what it's doing to the filesystem.

And please report your results.  Besides satisfying my own personal 
curiosity, there's a fair chance someone else will have the same issue at 
some point and either post their own question, or discover this thread 
via google or whatever.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2016-05-30 19:55   ` Duncan
@ 2016-05-30 21:18     ` Hans van Kranenburg
  2016-05-30 21:55       ` Duncan
  0 siblings, 1 reply; 33+ messages in thread
From: Hans van Kranenburg @ 2016-05-30 21:18 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 05/30/2016 09:55 PM, Duncan wrote:
> Hans van Kranenburg posted on Mon, 30 May 2016 13:07:26 +0200 as
> excerpted:
>
> [Please don't post "upside down".  Reply in context under the quoted
> point, here the whole post, you're replying to.  It makes further replies
> in context far easier. =:^)  I've pasted your update at the bottom here.]

Sure, thanks.

>> On 05/06/2016 11:28 PM, Hans van Kranenburg wrote:
>>>
>>> I've got a mostly inactive btrfs filesystem inside a virtual machine
>>> somewhere that shows interesting behaviour: while no interesting disk
>>> activity is going on, btrfs keeps allocating new chunks, a GiB at a
>>> time.
>>>
>>> A picture, telling more than 1000 words:
>>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
>>> (when the amount of allocated/unused goes down, I did a btrfs balance)
>
> Agreed, that shows something strange going on.
>
>>> Linux ichiban 4.5.0-0.bpo.1-amd64 #1 SMP Debian 4.5.1-1~bpo8+1
>>> (2016-04-20) x86_64 GNU/Linux
>
> So the kernel is/was current...

Running a slightly newer one now:

Linux ichiban 4.5.0-0.bpo.2-amd64 #1 SMP Debian 4.5.4-1~bpo8+1 
(2016-05-13) x86_64

>>> # btrfs fi show /
>>> Label: none  uuid: 9881fc30-8f69-4069-a8c8-c057b842b0c4
>>>       Total devices 1 FS bytes used 6.17GiB
>>>       devid    1 size 20.00GiB used 16.54GiB path /dev/xvda
>>>
>>> # btrfs fi df /
>>> Data, single: total=15.01GiB, used=5.16GiB
>>> System, single: total=32.00MiB, used=16.00KiB
>>> Metadata, single: total=1.50GiB, used=1.01GiB
>>> GlobalReserve, single: total=144.00MiB, used=0.00B
>>>
>>> I'm a bit puzzled, since I haven't seen this happening on other
>>> filesystems that use 4.4 or 4.5 kernels.
>
> Nor have I, either reported (save for you) or personally.
>
>>> If I dump the allocated chunks and their % usage, it's clear that the
>>> last 6 new added ones have a usage of only a few percent.
>
> Snip the dump, but curious as a user (not a dev) what command you used.
> Presumably one of the debug commands which I'm not particularly familiar
> with, but I wasn't aware it was even possible.

It's the output of a little programming exercise calling the search 
ioctl from python. https://github.com/knorrie/btrfs-heatmap

While using balance I got interested in knowing where balance got the 
information from to find how much % a chunk is used. I want to see that 
list in advance, so I can see what -dusage the most effective would be. 
My munin graphs show the stacked total value, which does not give you an 
idea about how badly the unused space is fragmented over already 
allocated chunks.

So, with some help of Hugo on IRC to get started, I ended up with this 
PoC, which can create nice movies of your data moving around over the 
physical space of the filesystem over time, like this one:

https://syrinx.knorrie.org/~knorrie/btrfs/heatmap.gif

Seeing the chunk allocator work its way around the two devices, choosing 
the one with the most free space, and reusing the gaps left by balance 
is super interesting. :-]

>>> The only things this host does is
>>>    1) being a webserver for a small internal debian packages repository
>>>    2) running low-volume mailman with a few lists, no archive-gzipping
>>> mega cronjobs or anything enabled.
>>>    3) some little legacy php thingies
>>>
>>> Interesting fact is that most of the 1GiB increases happen at the same
>>> time as cron.daily runs. However, there's only a few standard things in
>>> there. An occasional package upgrade by unattended-upgrade, or some
>>> logrotate. The total contents of /var/log/ together is only 66MB...
>>> Graphs show only less than about 100 MB reads/writes in total around
>>> this time.
>
> The cron.daily timing is interesting.  I'll come back to that below.

Well, it obviously has a very large sign saying "LOOK HERE" directly 
next to it yes.

>>> As you can see in the graph the amount of used space is even
>>> decreasing, because I cleaned up a bunch of old packages in the
>>> repository, and still, btrfs keeps allocating new data chunks like a
>>> hungry beast.
>>>
>>> Why would this happen?
>
>> since it got any followup and since I'm bold enough to bump it one more
>> time... :)
>>
>> I really don't understand the behaviour I described. Does it ring a bell
>> with anyone? This system is still allocating new 1GB data chunks every 1
>> or 2 days without using them at all, and I have to use balance every
>> week to get them away again.
>
> Honestly I can only guess, and it's a new guess I didn't think of the
> first time around, thus my lack of response the first time around.  But
> lacking anyone else replying with better theories, given that I do have a
> guess, I might as well put it out there.
>
> Is it possible something in that daily cron allocates/writes a large but
> likely spare file, perhaps a gig or more, probably fsyncing to lock the
> large size in place, then truncates it to actual size, which might be
> only a few kilobytes?
>
> That's sort of behavior could at least in theory trigger the behavior you
> describe, tho not being a dev and not being a Linux filesystem behavior
> expert by any means, I'm admittedly fuzzy on exactly what details might
> translate that theory into the reality you're seeing.

Interesting thought.

> In any event, my usual "brute force" approach to such mysteries is to
> bisect the problem space down until I know where the issue is.
>
> First, try rescheduling your cron.daily run to a different time, and see
> if the behavior follows it, thus specifically tying it to something in
> that run.

Yes.

> Second, try either running all tasks it runs manually, checking which one
> triggers the problem, or if you have too many tasks for that to be
> convenient, split them into cron.daily1 and cron.daily2, scheduled at
> different times, bisecting the problem by seeing which one the behavior
> follows.

The list is super super standard:

/etc/cron.daily 0-$ ll
total 52
-rwxr-xr-x 1 root root   744 Jan 25 19:24 000-delay
-rwxr-xr-x 1 root root   625 Nov 28  2015 apache2
-rwxr-xr-x 1 root root 15000 Jun 10  2015 apt
-rwxr-xr-x 1 root root   355 Oct 17  2014 bsdmainutils
-rwxr-xr-x 1 root root  1597 Apr 10  2015 dpkg
-rwxr-xr-x 1 root root   589 Oct 14  2015 etckeeper
-rwxr-xr-x 1 root root    89 Nov  8  2014 logrotate
-rwxr-xr-x 1 root root  1293 Dec 31  2014 man-db
-rwxr-xr-x 1 root root  1110 Oct 28  2015 ntp
-rwxr-xr-x 1 root root   249 Nov 20  2014 passwd

> Repeat as needed until you've discovered the culprit, then examine
> exactly what it's doing to the filesystem.

I'll try a few things.

> And please report your results.  Besides satisfying my own personal
> curiosity, there's a fair chance someone else will have the same issue at
> some point and either post their own question, or discover this thread
> via google or whatever.

We'll see. To be continued. Thanks for the feedback already!

-- 
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | hans.van.kranenburg@mendix.com | www.mendix.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2016-05-30 21:18     ` Hans van Kranenburg
@ 2016-05-30 21:55       ` Duncan
  0 siblings, 0 replies; 33+ messages in thread
From: Duncan @ 2016-05-30 21:55 UTC (permalink / raw)
  To: linux-btrfs

Hans van Kranenburg posted on Mon, 30 May 2016 23:18:20 +0200 as
excerpted:

>> Snip the dump, but curious as a user (not a dev) what command you used.
>> Presumably one of the debug commands which I'm not particularly
>> familiar with, but I wasn't aware it was even possible.
> 
> It's the output of a little programming exercise calling the search
> ioctl from python. https://github.com/knorrie/btrfs-heatmap
> 
> While using balance I got interested in knowing where balance got the
> information from to find how much % a chunk is used. I want to see that
> list in advance, so I can see what -dusage the most effective would be.
> My munin graphs show the stacked total value, which does not give you an
> idea about how badly the unused space is fragmented over already
> allocated chunks.
> 
> So, with some help of Hugo on IRC to get started, I ended up with this
> PoC, which can create nice movies of your data moving around over the
> physical space of the filesystem over time, like this one:
> 
> https://syrinx.knorrie.org/~knorrie/btrfs/heatmap.gif
> 
> Seeing the chunk allocator work its way around the two devices, choosing
> the one with the most free space, and reusing the gaps left by balance
> is super interesting. :-]

Very cool indeed.  Reminds me of the nice eye candy dynamic graphicals 
that MS defrag had back in 9x times.  (I've no idea what they have now as 
I've been off the platform for a decade and a half now.)

I may have to play with it a bit, when I have more time (I'm moving in a 
couple days...).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2016-05-06 21:28 btrfs filesystem keeps allocating new chunks for no apparent reason Hans van Kranenburg
  2016-05-30 11:07 ` Hans van Kranenburg
@ 2016-05-31  1:36 ` Qu Wenruo
  2016-06-08 23:10   ` Hans van Kranenburg
  2017-04-07 21:25   ` Hans van Kranenburg
  1 sibling, 2 replies; 33+ messages in thread
From: Qu Wenruo @ 2016-05-31  1:36 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs



Hans van Kranenburg wrote on 2016/05/06 23:28 +0200:
> Hi,
>
> I've got a mostly inactive btrfs filesystem inside a virtual machine
> somewhere that shows interesting behaviour: while no interesting disk
> activity is going on, btrfs keeps allocating new chunks, a GiB at a time.
>
> A picture, telling more than 1000 words:
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
> (when the amount of allocated/unused goes down, I did a btrfs balance)

Nice picture.
Really better than 1000 words.

AFAIK, the problem may be caused by fragments.

And even I saw some early prototypes inside the codes to allow btrfs do 
allocation smaller extent than required.
(E.g. caller needs 2M extent, but btrfs returns 2 1M extents)

But it's still prototype and seems no one is really working on it now.

So when btrfs is writing new data, for example, to write about 16M data, 
it will need to allocate a 16M continuous extent, and if it can't find 
large enough space to allocate, then create a new data chunk.

Despite the already awesome chunk level usage pricutre, I hope there is 
info about extent level allocation to confirm my assumption.

You could dump it by calling "btrfs-debug-tree -t 2 <device>".
It's normally recommended to do it unmounted, but it's still possible to 
call it mounted, although not 100% perfect though.
(Then I'd better find a good way to draw a picture of 
allocate/unallocate space and how fragments the chunks are)

Thanks,
Qu
>
> Linux ichiban 4.5.0-0.bpo.1-amd64 #1 SMP Debian 4.5.1-1~bpo8+1
> (2016-04-20) x86_64 GNU/Linux
>
> # btrfs fi show /
> Label: none  uuid: 9881fc30-8f69-4069-a8c8-c057b842b0c4
>     Total devices 1 FS bytes used 6.17GiB
>     devid    1 size 20.00GiB used 16.54GiB path /dev/xvda
>
> # btrfs fi df /
> Data, single: total=15.01GiB, used=5.16GiB
> System, single: total=32.00MiB, used=16.00KiB
> Metadata, single: total=1.50GiB, used=1.01GiB
> GlobalReserve, single: total=144.00MiB, used=0.00B
>
> I'm a bit puzzled, since I haven't seen this happening on other
> filesystems that use 4.4 or 4.5 kernels.
>
> If I dump the allocated chunks and their % usage, it's clear that the
> last 6 new added ones have a usage of only a few percent.
>
> dev item devid 1 total bytes 21474836480 bytes used 17758683136
> chunk vaddr 12582912 type 1 stripe 0 devid 1 offset 12582912 length
> 8388608 used 4276224 used_pct 50
> chunk vaddr 1103101952 type 1 stripe 0 devid 1 offset 2185232384 length
> 1073741824 used 433127424 used_pct 40
> chunk vaddr 3250585600 type 1 stripe 0 devid 1 offset 4332716032 length
> 1073741824 used 764391424 used_pct 71
> chunk vaddr 9271508992 type 1 stripe 0 devid 1 offset 12079595520 length
> 1073741824 used 270704640 used_pct 25
> chunk vaddr 12492734464 type 1 stripe 0 devid 1 offset 13153337344
> length 1073741824 used 866574336 used_pct 80
> chunk vaddr 13566476288 type 1 stripe 0 devid 1 offset 11005853696
> length 1073741824 used 1028059136 used_pct 95
> chunk vaddr 14640218112 type 1 stripe 0 devid 1 offset 3258974208 length
> 1073741824 used 762466304 used_pct 71
> chunk vaddr 26250051584 type 1 stripe 0 devid 1 offset 19595788288
> length 1073741824 used 114982912 used_pct 10
> chunk vaddr 31618760704 type 1 stripe 0 devid 1 offset 15300820992
> length 1073741824 used 488902656 used_pct 45
> chunk vaddr 32692502528 type 4 stripe 0 devid 1 offset 5406457856 length
> 268435456 used 209272832 used_pct 77
> chunk vaddr 32960937984 type 4 stripe 0 devid 1 offset 5943328768 length
> 268435456 used 251199488 used_pct 93
> chunk vaddr 33229373440 type 4 stripe 0 devid 1 offset 7419723776 length
> 268435456 used 248709120 used_pct 92
> chunk vaddr 33497808896 type 4 stripe 0 devid 1 offset 8896118784 length
> 268435456 used 247791616 used_pct 92
> chunk vaddr 33766244352 type 4 stripe 0 devid 1 offset 8627683328 length
> 268435456 used 93061120 used_pct 34
> chunk vaddr 34303115264 type 2 stripe 0 devid 1 offset 6748635136 length
> 33554432 used 16384 used_pct 0
> chunk vaddr 34336669696 type 1 stripe 0 devid 1 offset 16374562816
> length 1073741824 used 105054208 used_pct 9
> chunk vaddr 35410411520 type 1 stripe 0 devid 1 offset 20971520 length
> 1073741824 used 10899456 used_pct 1
> chunk vaddr 36484153344 type 1 stripe 0 devid 1 offset 1094713344 length
> 1073741824 used 441778176 used_pct 41
> chunk vaddr 37557895168 type 4 stripe 0 devid 1 offset 5674893312 length
> 268435456 used 33439744 used_pct 12
> chunk vaddr 37826330624 type 1 stripe 0 devid 1 offset 9164554240 length
> 1073741824 used 32096256 used_pct 2
> chunk vaddr 38900072448 type 1 stripe 0 devid 1 offset 14227079168
> length 1073741824 used 40140800 used_pct 3
> chunk vaddr 39973814272 type 1 stripe 0 devid 1 offset 17448304640
> length 1073741824 used 58093568 used_pct 5
> chunk vaddr 41047556096 type 1 stripe 0 devid 1 offset 18522046464
> length 1073741824 used 119701504 used_pct 11
>
> The only things this host does is
>  1) being a webserver for a small internal debian packages repository
>  2) running low-volume mailman with a few lists, no archive-gzipping
> mega cronjobs or anything enabled.
>  3) some little legacy php thingies
>
> Interesting fact is that most of the 1GiB increases happen at the same
> time as cron.daily runs. However, there's only a few standard things in
> there. An occasional package upgrade by unattended-upgrade, or some
> logrotate. The total contents of /var/log/ together is only 66MB...
> Graphs show only less than about 100 MB reads/writes in total around
> this time.
>
> As you can see in the graph the amount of used space is even decreasing,
> because I cleaned up a bunch of old packages in the repository, and
> still, btrfs keeps allocating new data chunks like a hungry beast.
>
> Why would this happen?
>
> Hans van Kranenburg
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2016-05-31  1:36 ` Qu Wenruo
@ 2016-06-08 23:10   ` Hans van Kranenburg
  2016-06-09  8:52     ` Marc Haber
                       ` (2 more replies)
  2017-04-07 21:25   ` Hans van Kranenburg
  1 sibling, 3 replies; 33+ messages in thread
From: Hans van Kranenburg @ 2016-06-08 23:10 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hi list,

On 05/31/2016 03:36 AM, Qu Wenruo wrote:
>
>
> Hans van Kranenburg wrote on 2016/05/06 23:28 +0200:
>> Hi,
>>
>> I've got a mostly inactive btrfs filesystem inside a virtual machine
>> somewhere that shows interesting behaviour: while no interesting disk
>> activity is going on, btrfs keeps allocating new chunks, a GiB at a time.
>>
>> A picture, telling more than 1000 words:
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
>> (when the amount of allocated/unused goes down, I did a btrfs balance)
>
> Nice picture.
> Really better than 1000 words.
>
> AFAIK, the problem may be caused by fragments.
>
> And even I saw some early prototypes inside the codes to allow btrfs do
> allocation smaller extent than required.
> (E.g. caller needs 2M extent, but btrfs returns 2 1M extents)
>
> But it's still prototype and seems no one is really working on it now.
>
> So when btrfs is writing new data, for example, to write about 16M data,
> it will need to allocate a 16M continuous extent, and if it can't find
> large enough space to allocate, then create a new data chunk.
>
> Despite the already awesome chunk level usage pricutre, I hope there is
> info about extent level allocation to confirm my assumption.
>
> You could dump it by calling "btrfs-debug-tree -t 2 <device>".
> It's normally recommended to do it unmounted, but it's still possible to
> call it mounted, although not 100% perfect though.
> (Then I'd better find a good way to draw a picture of
> allocate/unallocate space and how fragments the chunks are)

So, I finally found some spare time to continue investigating. In the 
meantime, the filesystem has happily been allocating new chunks every 
few days, filling them up way below 10% with data before starting a new one.

The chunk allocation primarily seems to happen during cron.daily. But, 
manually executing all the cronjobs that are in there, even multiple 
times, does not result in newly allocated chunks. Yay. :(

After the previous post, I put a little script in between every two jobs 
in /etc/cron.daily that prints the output of btrfs fi df to syslog and 
sleeps for 10 minutes so I can easily find out afterwards during which 
one it happened.

Bingo! The "apt" cron.daily, which refreshes package lists and triggers 
unattended-upgrades.

Jun 7 04:01:46 ichiban root: Data, single: total=12.00GiB, used=5.65GiB
[...]
2016-06-07 04:01:56,552 INFO Starting unattended upgrades script
[...]
Jun 7 04:12:10 ichiban root: Data, single: total=13.00GiB, used=5.64GiB

And, this thing is clever enough to do things once a day, even if you 
would execute it multple times... (Hehehe...)

Ok, let's try doing some apt-get update then.

Today, the latest added chunks look like this:

# ./show_usage.py /
[...]
chunk vaddr 63495471104 type 1 stripe 0 devid 1 offset 9164554240 length 
1073741824 used 115499008 used_pct 10
chunk vaddr 64569212928 type 1 stripe 0 devid 1 offset 12079595520 
length 1073741824 used 36585472 used_pct 3
chunk vaddr 65642954752 type 1 stripe 0 devid 1 offset 14227079168 
length 1073741824 used 17510400 used_pct 1
chunk vaddr 66716696576 type 4 stripe 0 devid 1 offset 3275751424 length 
268435456 used 72663040 used_pct 27
chunk vaddr 66985132032 type 1 stripe 0 devid 1 offset 15300820992 
length 1073741824 used 86986752 used_pct 8
chunk vaddr 68058873856 type 1 stripe 0 devid 1 offset 16374562816 
length 1073741824 used 21188608 used_pct 1
chunk vaddr 69132615680 type 1 stripe 0 devid 1 offset 17448304640 
length 1073741824 used 64032768 used_pct 5
chunk vaddr 70206357504 type 1 stripe 0 devid 1 offset 18522046464 
length 1073741824 used 71712768 used_pct 6

Now I apt-get update...

before: Data, single: total=13.00GiB, used=5.64GiB
during: Data, single: total=13.00GiB, used=5.59GiB
after : Data, single: total=14.00GiB, used=5.64GiB

# ./show_usage.py /
[...]
chunk vaddr 63495471104 type 1 stripe 0 devid 1 offset 9164554240 length 
1073741824 used 119279616 used_pct 11
chunk vaddr 64569212928 type 1 stripe 0 devid 1 offset 12079595520 
length 1073741824 used 36585472 used_pct 3
chunk vaddr 65642954752 type 1 stripe 0 devid 1 offset 14227079168 
length 1073741824 used 17510400 used_pct 1
chunk vaddr 66716696576 type 4 stripe 0 devid 1 offset 3275751424 length 
268435456 used 73170944 used_pct 27
chunk vaddr 66985132032 type 1 stripe 0 devid 1 offset 15300820992 
length 1073741824 used 82251776 used_pct 7
chunk vaddr 68058873856 type 1 stripe 0 devid 1 offset 16374562816 
length 1073741824 used 21188608 used_pct 1
chunk vaddr 69132615680 type 1 stripe 0 devid 1 offset 17448304640 
length 1073741824 used 6041600 used_pct 0
chunk vaddr 70206357504 type 1 stripe 0 devid 1 offset 18522046464 
length 1073741824 used 46178304 used_pct 4
chunk vaddr 71280099328 type 1 stripe 0 devid 1 offset 19595788288 
length 1073741824 used 84770816 used_pct 7

Interesting. There's a new one at 71280099328, 7% filled, and the usage 
of the 4 previous ones went down a bit.

Now I want to know what the distribution of data inside these chunks, to 
find out how fragmented it might be, so I spent some time this evening 
to play a bit more with the search ioctl, and list all extents and free 
space inside a chunk:

https://github.com/knorrie/btrfs-heatmap/blob/master/chunk-contents.py

Currently the output looks like this:

# ./chunk-contents.py 70206357504 .
chunk vaddr 70206357504 length 1073741824
0x1058a00000 0x105a0cafff  23900160 2.23%
0x105a0cb000 0x105a0cbfff      4096 0.00% extent
0x105a0cc000 0x105a12ffff    409600 0.04%
0x105a130000 0x105a130fff      4096 0.00% extent
0x105a131000 0x105a21dfff    970752 0.09%
0x105a21e000 0x105a220fff     12288 0.00% extent
0x105a221000 0x105a222fff      8192 0.00% extent
0x105a223000 0x105a224fff      8192 0.00% extent
0x105a225000 0x105a225fff      4096 0.00% extent
0x105a226000 0x105a226fff      4096 0.00% extent
0x105a227000 0x105a227fff      4096 0.00% extent
0x105a228000 0x105a2c3fff    638976 0.06%
0x105a2c4000 0x105a2c5fff      8192 0.00% extent
0x105a2c6000 0x105a317fff    335872 0.03%
0x105a318000 0x105a31efff     28672 0.00% extent
0x105a31f000 0x105a3affff    593920 0.06%
0x105a3b0000 0x105a3b2fff     12288 0.00% extent
0x105a3b3000 0x105a3b6fff     16384 0.00%
0x105a3b7000 0x105a3bbfff     20480 0.00% extent
0x105a3bc000 0x105a3e2fff    159744 0.01%
0x105a3e3000 0x105a3e3fff      4096 0.00% extent
0x105a3e4000 0x105a3e4fff      4096 0.00% extent
0x105a3e5000 0x105a468fff    540672 0.05%
0x105a469000 0x105a46cfff     16384 0.00% extent
0x105a46d000 0x105a493fff    159744 0.01%
0x105a494000 0x105a495fff      8192 0.00% extent
0x105a496000 0x105a49afff     20480 0.00%
[...]

After running apt-get update a few extra times, only the last (new) 
chunk keeps changing a bit, and stabilizes around 10% usage:

chunk vaddr 71280099328 type 1 stripe 0 devid 1 offset 19595788288 
length 1073741824 used 112271360 used_pct 10

chunk vaddr 71280099328 length 1073741824
0x1098a00000 0x109e00dfff  90234880 8.40%
0x109e00e000 0x109e00efff      4096 0.00% extent
0x109e00f000 0x109e00ffff      4096 0.00% extent
0x109e010000 0x109e010fff      4096 0.00%
0x109e011000 0x109e011fff      4096 0.00% extent
0x109e012000 0x109e342fff   3346432 0.31%
0x109e343000 0x109e344fff      8192 0.00% extent
0x109e345000 0x109e47cfff   1277952 0.12%
0x109e47d000 0x109e47efff      8192 0.00% extent
0x109e47f000 0x109e480fff      8192 0.00%
0x109e481000 0x109e482fff      8192 0.00% extent
0x109e483000 0x109e484fff      8192 0.00% extent
0x109e485000 0x109e48afff     24576 0.00% extent
0x109e48b000 0x109e48cfff      8192 0.00%
0x109e48d000 0x109e48efff      8192 0.00% extent
0x109e48f000 0x109e490fff      8192 0.00%
0x109e491000 0x109e492fff      8192 0.00% extent
0x109e493000 0x109e493fff      4096 0.00% extent
0x109e494000 0x109eb00fff   6737920 0.63%
0x109eb01000 0x109eb10fff     65536 0.01% extent
0x109eb11000 0x109ebc0fff    720896 0.07%
0x109ebc1000 0x109ec00fff    262144 0.02% extent
0x109ec01000 0x109ecc4fff    802816 0.07%

Full output at 
https://syrinx.knorrie.org/~knorrie/btrfs/keep/2016-06-08-extents.txt

Free space is extremely fragmented. The last one, which just got filled 
a bit using apt-get update looks better, with a few blocks up to 25% of 
free space, but the previous ones are a mess.

So, instead of being the cause, apt-get update causing a new chunk to be 
allocated might as well be the result of existing ones already filled up 
with too many fragments.

The next question is what files these extents belong to. To find out, I 
need to open up the extent items I get back and follow a backreference 
to an inode object. Might do that tomorrow, fun.

To be honest, I suspect /var/log and/or the file storage of mailman to 
be the cause of the fragmentation, since there's logging from postfix, 
mailman and nginx going on all day long in a slow but steady tempo. 
While using btrfs for a number of use cases at work now, we normally 
don't use it for the root filesystem. And the cases where it's used as 
root filesystem don't do much logging or mail.

And no, autodefrag is not in the mount options currently. Would that be 
helpful in this case?

-- 
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | hans.van.kranenburg@mendix.com | www.mendix.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2016-06-08 23:10   ` Hans van Kranenburg
@ 2016-06-09  8:52     ` Marc Haber
  2016-06-09 10:37       ` Hans van Kranenburg
  2016-06-09 15:41     ` Duncan
  2016-06-09 18:07     ` Chris Murphy
  2 siblings, 1 reply; 33+ messages in thread
From: Marc Haber @ 2016-06-09  8:52 UTC (permalink / raw)
  To: linux-btrfs

On Thu, Jun 09, 2016 at 01:10:46AM +0200, Hans van Kranenburg wrote:
> So, instead of being the cause, apt-get update causing a new chunk to be
> allocated might as well be the result of existing ones already filled up
> with too many fragments.
> 
> The next question is what files these extents belong to. To find out, I need
> to open up the extent items I get back and follow a backreference to an
> inode object. Might do that tomorrow, fun.

Does your apt use pdiffs to update the packages lists? If yes, I'd try
turning it off just for the fun of it and to see whether this changes
btrfs' allocation behavior. I have never looked at apt's pdiff stuff
in detail, but I guess that it creates many tiny temporary files.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2016-06-09  8:52     ` Marc Haber
@ 2016-06-09 10:37       ` Hans van Kranenburg
  0 siblings, 0 replies; 33+ messages in thread
From: Hans van Kranenburg @ 2016-06-09 10:37 UTC (permalink / raw)
  To: Marc Haber, linux-btrfs

On 06/09/2016 10:52 AM, Marc Haber wrote:
> On Thu, Jun 09, 2016 at 01:10:46AM +0200, Hans van Kranenburg wrote:
>> So, instead of being the cause, apt-get update causing a new chunk to be
>> allocated might as well be the result of existing ones already filled up
>> with too many fragments.
>>
>> The next question is what files these extents belong to. To find out, I need
>> to open up the extent items I get back and follow a backreference to an
>> inode object. Might do that tomorrow, fun.
>
> Does your apt use pdiffs to update the packages lists? If yes, I'd try
> turning it off just for the fun of it and to see whether this changes
> btrfs' allocation behavior. I have never looked at apt's pdiff stuff
> in detail, but I guess that it creates many tiny temporary files.

No, it does not:

Acquire::Pdiffs "false";

-- 
Hans van Kranenburg - System / Network Engineer
Mendix | Driving Digital Innovation | www.mendix.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2016-06-08 23:10   ` Hans van Kranenburg
  2016-06-09  8:52     ` Marc Haber
@ 2016-06-09 15:41     ` Duncan
  2016-06-10 17:07       ` Henk Slager
  2016-06-09 18:07     ` Chris Murphy
  2 siblings, 1 reply; 33+ messages in thread
From: Duncan @ 2016-06-09 15:41 UTC (permalink / raw)
  To: linux-btrfs

Hans van Kranenburg posted on Thu, 09 Jun 2016 01:10:46 +0200 as
excerpted:

> The next question is what files these extents belong to. To find out, I
> need to open up the extent items I get back and follow a backreference
> to an inode object. Might do that tomorrow, fun.
> 
> To be honest, I suspect /var/log and/or the file storage of mailman to
> be the cause of the fragmentation, since there's logging from postfix,
> mailman and nginx going on all day long in a slow but steady tempo.
> While using btrfs for a number of use cases at work now, we normally
> don't use it for the root filesystem. And the cases where it's used as
> root filesystem don't do much logging or mail.

FWIW, that's one reason I have a dedicated partition (and filesystem) for 
logs, here.  (The other reason is that should something go runaway log-
spewing, I get a warning much sooner when my log filesystem fills up, not 
much later, with much worse implications, when the main filesystem fills 
up!)

> And no, autodefrag is not in the mount options currently. Would that be
> helpful in this case?

It should be helpful, yes.  Be aware that autodefrag works best with 
smaller (sub-half-gig) files, however, and that it used to cause 
performance issues with larger database and VM files, in particular.  
There used to be a warning on the wiki about that, that was recently 
removed, so apparently it's not the issue that it was, but you might wish 
to monitor any databases or VMs with gig-plus files to see if it's going 
to be a performance issue, once you turn on autodefrag.

The other issue with autodefrag is that if it hasn't been on and things 
are heavily fragmented, it can at first drive down performance as it 
rewrites all these heavily fragmented files, until it catches up and is 
mostly dealing only with the normal refragmentation load.  Of course the 
best way around that is to run autodefrag from the first time you mount 
the filesystem and start writing to it, so it never gets overly 
fragmented in the first place.  For a currently in-use and highly 
fragmented filesystem, you have two choices, either backup and do a fresh 
mkfs.btrfs so you can start with a clean filesystem and autodefrag from 
the beginning, or doing manual defrag.

However, be aware that if you have snapshots locking down the old extents 
in their fragmented form, a manual defrag will copy the data to new 
extents without releasing the old ones as they're locked in place by the 
snapshots, thus using additional space.  Worse, if the filesystem is 
already heavily fragmented and snapshots are locking most of those 
fragments in place, defrag likely won't help a lot, because the free 
space as well will be heavily fragmented.   So starting off with a clean 
and new filesystem and using autodefrag from the beginning really is your 
best bet.


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2016-06-08 23:10   ` Hans van Kranenburg
  2016-06-09  8:52     ` Marc Haber
  2016-06-09 15:41     ` Duncan
@ 2016-06-09 18:07     ` Chris Murphy
  2 siblings, 0 replies; 33+ messages in thread
From: Chris Murphy @ 2016-06-09 18:07 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: Qu Wenruo, Btrfs BTRFS

On Wed, Jun 8, 2016 at 5:10 PM, Hans van Kranenburg
<hans.van.kranenburg@mendix.com> wrote:
> Hi list,
>
>
> On 05/31/2016 03:36 AM, Qu Wenruo wrote:
>>
>>
>>
>> Hans van Kranenburg wrote on 2016/05/06 23:28 +0200:
>>>
>>> Hi,
>>>
>>> I've got a mostly inactive btrfs filesystem inside a virtual machine
>>> somewhere that shows interesting behaviour: while no interesting disk
>>> activity is going on, btrfs keeps allocating new chunks, a GiB at a time.
>>>
>>> A picture, telling more than 1000 words:
>>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
>>> (when the amount of allocated/unused goes down, I did a btrfs balance)
>>
>>
>> Nice picture.
>> Really better than 1000 words.
>>
>> AFAIK, the problem may be caused by fragments.
>>
>> And even I saw some early prototypes inside the codes to allow btrfs do
>> allocation smaller extent than required.
>> (E.g. caller needs 2M extent, but btrfs returns 2 1M extents)
>>
>> But it's still prototype and seems no one is really working on it now.
>>
>> So when btrfs is writing new data, for example, to write about 16M data,
>> it will need to allocate a 16M continuous extent, and if it can't find
>> large enough space to allocate, then create a new data chunk.
>>
>> Despite the already awesome chunk level usage pricutre, I hope there is
>> info about extent level allocation to confirm my assumption.
>>
>> You could dump it by calling "btrfs-debug-tree -t 2 <device>".
>> It's normally recommended to do it unmounted, but it's still possible to
>> call it mounted, although not 100% perfect though.
>> (Then I'd better find a good way to draw a picture of
>> allocate/unallocate space and how fragments the chunks are)
>
>
> So, I finally found some spare time to continue investigating. In the
> meantime, the filesystem has happily been allocating new chunks every few
> days, filling them up way below 10% with data before starting a new one.
>
> The chunk allocation primarily seems to happen during cron.daily. But,
> manually executing all the cronjobs that are in there, even multiple times,
> does not result in newly allocated chunks. Yay. :(
>
> After the previous post, I put a little script in between every two jobs in
> /etc/cron.daily that prints the output of btrfs fi df to syslog and sleeps
> for 10 minutes so I can easily find out afterwards during which one it
> happened.
>
> Bingo! The "apt" cron.daily, which refreshes package lists and triggers
> unattended-upgrades.
>
> Jun 7 04:01:46 ichiban root: Data, single: total=12.00GiB, used=5.65GiB
> [...]
> 2016-06-07 04:01:56,552 INFO Starting unattended upgrades script
> [...]
> Jun 7 04:12:10 ichiban root: Data, single: total=13.00GiB, used=5.64GiB
>
> And, this thing is clever enough to do things once a day, even if you would
> execute it multple times... (Hehehe...)
>
> Ok, let's try doing some apt-get update then.
>
> Today, the latest added chunks look like this:
>
> # ./show_usage.py /
> [...]
> chunk vaddr 63495471104 type 1 stripe 0 devid 1 offset 9164554240 length
> 1073741824 used 115499008 used_pct 10
> chunk vaddr 64569212928 type 1 stripe 0 devid 1 offset 12079595520 length
> 1073741824 used 36585472 used_pct 3
> chunk vaddr 65642954752 type 1 stripe 0 devid 1 offset 14227079168 length
> 1073741824 used 17510400 used_pct 1
> chunk vaddr 66716696576 type 4 stripe 0 devid 1 offset 3275751424 length
> 268435456 used 72663040 used_pct 27
> chunk vaddr 66985132032 type 1 stripe 0 devid 1 offset 15300820992 length
> 1073741824 used 86986752 used_pct 8
> chunk vaddr 68058873856 type 1 stripe 0 devid 1 offset 16374562816 length
> 1073741824 used 21188608 used_pct 1
> chunk vaddr 69132615680 type 1 stripe 0 devid 1 offset 17448304640 length
> 1073741824 used 64032768 used_pct 5
> chunk vaddr 70206357504 type 1 stripe 0 devid 1 offset 18522046464 length
> 1073741824 used 71712768 used_pct 6
>
> Now I apt-get update...
>
> before: Data, single: total=13.00GiB, used=5.64GiB
> during: Data, single: total=13.00GiB, used=5.59GiB
> after : Data, single: total=14.00GiB, used=5.64GiB
>
> # ./show_usage.py /
> [...]
> chunk vaddr 63495471104 type 1 stripe 0 devid 1 offset 9164554240 length
> 1073741824 used 119279616 used_pct 11
> chunk vaddr 64569212928 type 1 stripe 0 devid 1 offset 12079595520 length
> 1073741824 used 36585472 used_pct 3
> chunk vaddr 65642954752 type 1 stripe 0 devid 1 offset 14227079168 length
> 1073741824 used 17510400 used_pct 1
> chunk vaddr 66716696576 type 4 stripe 0 devid 1 offset 3275751424 length
> 268435456 used 73170944 used_pct 27
> chunk vaddr 66985132032 type 1 stripe 0 devid 1 offset 15300820992 length
> 1073741824 used 82251776 used_pct 7
> chunk vaddr 68058873856 type 1 stripe 0 devid 1 offset 16374562816 length
> 1073741824 used 21188608 used_pct 1
> chunk vaddr 69132615680 type 1 stripe 0 devid 1 offset 17448304640 length
> 1073741824 used 6041600 used_pct 0
> chunk vaddr 70206357504 type 1 stripe 0 devid 1 offset 18522046464 length
> 1073741824 used 46178304 used_pct 4
> chunk vaddr 71280099328 type 1 stripe 0 devid 1 offset 19595788288 length
> 1073741824 used 84770816 used_pct 7
>
> Interesting. There's a new one at 71280099328, 7% filled, and the usage of
> the 4 previous ones went down a bit.
>
> Now I want to know what the distribution of data inside these chunks, to
> find out how fragmented it might be, so I spent some time this evening to
> play a bit more with the search ioctl, and list all extents and free space
> inside a chunk:
>
> https://github.com/knorrie/btrfs-heatmap/blob/master/chunk-contents.py
>
> Currently the output looks like this:
>
> # ./chunk-contents.py 70206357504 .
> chunk vaddr 70206357504 length 1073741824
> 0x1058a00000 0x105a0cafff  23900160 2.23%
> 0x105a0cb000 0x105a0cbfff      4096 0.00% extent
> 0x105a0cc000 0x105a12ffff    409600 0.04%
> 0x105a130000 0x105a130fff      4096 0.00% extent
> 0x105a131000 0x105a21dfff    970752 0.09%
> 0x105a21e000 0x105a220fff     12288 0.00% extent
> 0x105a221000 0x105a222fff      8192 0.00% extent
> 0x105a223000 0x105a224fff      8192 0.00% extent
> 0x105a225000 0x105a225fff      4096 0.00% extent
> 0x105a226000 0x105a226fff      4096 0.00% extent
> 0x105a227000 0x105a227fff      4096 0.00% extent
> 0x105a228000 0x105a2c3fff    638976 0.06%
> 0x105a2c4000 0x105a2c5fff      8192 0.00% extent
> 0x105a2c6000 0x105a317fff    335872 0.03%
> 0x105a318000 0x105a31efff     28672 0.00% extent
> 0x105a31f000 0x105a3affff    593920 0.06%
> 0x105a3b0000 0x105a3b2fff     12288 0.00% extent
> 0x105a3b3000 0x105a3b6fff     16384 0.00%
> 0x105a3b7000 0x105a3bbfff     20480 0.00% extent
> 0x105a3bc000 0x105a3e2fff    159744 0.01%
> 0x105a3e3000 0x105a3e3fff      4096 0.00% extent
> 0x105a3e4000 0x105a3e4fff      4096 0.00% extent
> 0x105a3e5000 0x105a468fff    540672 0.05%
> 0x105a469000 0x105a46cfff     16384 0.00% extent
> 0x105a46d000 0x105a493fff    159744 0.01%
> 0x105a494000 0x105a495fff      8192 0.00% extent
> 0x105a496000 0x105a49afff     20480 0.00%
> [...]
>
> After running apt-get update a few extra times, only the last (new) chunk
> keeps changing a bit, and stabilizes around 10% usage:
>
> chunk vaddr 71280099328 type 1 stripe 0 devid 1 offset 19595788288 length
> 1073741824 used 112271360 used_pct 10
>
> chunk vaddr 71280099328 length 1073741824
> 0x1098a00000 0x109e00dfff  90234880 8.40%
> 0x109e00e000 0x109e00efff      4096 0.00% extent
> 0x109e00f000 0x109e00ffff      4096 0.00% extent
> 0x109e010000 0x109e010fff      4096 0.00%
> 0x109e011000 0x109e011fff      4096 0.00% extent
> 0x109e012000 0x109e342fff   3346432 0.31%
> 0x109e343000 0x109e344fff      8192 0.00% extent
> 0x109e345000 0x109e47cfff   1277952 0.12%
> 0x109e47d000 0x109e47efff      8192 0.00% extent
> 0x109e47f000 0x109e480fff      8192 0.00%
> 0x109e481000 0x109e482fff      8192 0.00% extent
> 0x109e483000 0x109e484fff      8192 0.00% extent
> 0x109e485000 0x109e48afff     24576 0.00% extent
> 0x109e48b000 0x109e48cfff      8192 0.00%
> 0x109e48d000 0x109e48efff      8192 0.00% extent
> 0x109e48f000 0x109e490fff      8192 0.00%
> 0x109e491000 0x109e492fff      8192 0.00% extent
> 0x109e493000 0x109e493fff      4096 0.00% extent
> 0x109e494000 0x109eb00fff   6737920 0.63%
> 0x109eb01000 0x109eb10fff     65536 0.01% extent
> 0x109eb11000 0x109ebc0fff    720896 0.07%
> 0x109ebc1000 0x109ec00fff    262144 0.02% extent
> 0x109ec01000 0x109ecc4fff    802816 0.07%
>
> Full output at
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2016-06-08-extents.txt
>
> Free space is extremely fragmented. The last one, which just got filled a
> bit using apt-get update looks better, with a few blocks up to 25% of free
> space, but the previous ones are a mess.
>
> So, instead of being the cause, apt-get update causing a new chunk to be
> allocated might as well be the result of existing ones already filled up
> with too many fragments.
>
> The next question is what files these extents belong to. To find out, I need
> to open up the extent items I get back and follow a backreference to an
> inode object. Might do that tomorrow, fun.
>
> To be honest, I suspect /var/log and/or the file storage of mailman to be
> the cause of the fragmentation, since there's logging from postfix, mailman
> and nginx going on all day long in a slow but steady tempo. While using
> btrfs for a number of use cases at work now, we normally don't use it for
> the root filesystem. And the cases where it's used as root filesystem don't
> do much logging or mail.
>
> And no, autodefrag is not in the mount options currently. Would that be
> helpful in this case?


Pure speculation follows...
If existing chunks are heavily fragmented, it might be a good idea,
especially on spinning rust, to just allocate new chunks to
contiguously write to, with the idea that as chunks age, they'll
become less fragmented due to file deletions. Maybe the seemingly
premature and unnecessary new chunk allocations are (metaphorically)
equivalent to the Linux cache trying to use all available free memory
before it starts dropping cache contents? OK so to test this, what
happens if you just let the thing go? i.e. if fully allocates all the
available space? Once that happens so you get ENOSPC or other brick
wall behaviors? Or is it just a matter of performance taking a nasty
hit?

I think the brick wall behavior is a bigger bug than the performance
tuning one, and probably easier to identify and fix, especially the
fewer non-default mount options there are. The fs should be reliable,
even if it's slow.

Next, would be to find out what sort of tuning reduces this problem.
It could be relatively simple, maybe just autofrag does the trick. But
if the files are really small, but just not small enough to fit inline
with metadata, it might be a better tuning is to use a bigger
nodesize. But I have no idea what the file sizes are. Yet another one
would be using the ssd mount option even if (?) this is on an HDD. The
description of the problem sounds like what ssd_spread might do, but
it could be worth trying that one.

And still another, which I put last only because it wouldn't be nearly
so automatic, would actually be to pin the extents from being freed
with snapshots. That way piles of small fragments don't happen. You'd
really want to know what the read, write pattern is, especially the
delete pattern that results in the fragmented freeing of extents. If
it's the mail server causing this, I don't know that this is a great
solution because additions and deletions happen constantly. So this
idea might only soften the problem rather than fix it? I think if
these are really small files, say 20K, then perhaps a 64KiB nodesize
improves inlining them. And if that does help, you can in effect get a
bigger nodesize by then using the compress option.

So actually yeah a little more data about what's causing the
fragmentation would be useful; probably what's being deleted and how
big are those extents?




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2016-06-09 15:41     ` Duncan
@ 2016-06-10 17:07       ` Henk Slager
  2016-06-11 15:23         ` Hans van Kranenburg
  0 siblings, 1 reply; 33+ messages in thread
From: Henk Slager @ 2016-06-10 17:07 UTC (permalink / raw)
  To: linux-btrfs

On Thu, Jun 9, 2016 at 5:41 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Hans van Kranenburg posted on Thu, 09 Jun 2016 01:10:46 +0200 as
> excerpted:
>
>> The next question is what files these extents belong to. To find out, I
>> need to open up the extent items I get back and follow a backreference
>> to an inode object. Might do that tomorrow, fun.
>>
>> To be honest, I suspect /var/log and/or the file storage of mailman to
>> be the cause of the fragmentation, since there's logging from postfix,
>> mailman and nginx going on all day long in a slow but steady tempo.
>> While using btrfs for a number of use cases at work now, we normally
>> don't use it for the root filesystem. And the cases where it's used as
>> root filesystem don't do much logging or mail.
>
> FWIW, that's one reason I have a dedicated partition (and filesystem) for
> logs, here.  (The other reason is that should something go runaway log-
> spewing, I get a warning much sooner when my log filesystem fills up, not
> much later, with much worse implications, when the main filesystem fills
> up!)
>
>> And no, autodefrag is not in the mount options currently. Would that be
>> helpful in this case?
>
> It should be helpful, yes.  Be aware that autodefrag works best with
> smaller (sub-half-gig) files, however, and that it used to cause
> performance issues with larger database and VM files, in particular.

I don't know why you relate filesize and autodefrag. Maybe because you
say '... used to cause ...'.

autodefrag detects random writes and then tries to defrag a certain
range. Its scope size is 256K as far as I see from the code and over
time you see VM images that are on a btrfs fs (CoW, hourly ro
snapshots) having a lot of 256K (or a bit less) sized extents
according to what filefrag reports. I once wanted to try and change
the 256K to 1M or even 4M, but I haven't  come to that.
A 32G VM image would consist of 131072 extents for 256K, 32768 extents
for 1M, 8192 extents for 4M.

> There used to be a warning on the wiki about that, that was recently
> removed, so apparently it's not the issue that it was, but you might wish
> to monitor any databases or VMs with gig-plus files to see if it's going
> to be a performance issue, once you turn on autodefrag.

For very active databases, I don't know what the effects are, with or
without autodefrag ( either on SSD and/or HDD).
At least on HDD-only, so no persistent SSD caching and noautodefrag,
VMs will result in unacceptable performance soon.

> The other issue with autodefrag is that if it hasn't been on and things
> are heavily fragmented, it can at first drive down performance as it
> rewrites all these heavily fragmented files, until it catches up and is
> mostly dealing only with the normal refragmentation load.

I assume you mean that one only gets a performance drop if you
actually do new writes to the fragmented files since autodefrag on. It
shouldn't start defragging by itself AFAIK.

> Of course the
> best way around that is to run autodefrag from the first time you mount
> the filesystem and start writing to it, so it never gets overly
> fragmented in the first place.  For a currently in-use and highly
> fragmented filesystem, you have two choices, either backup and do a fresh
> mkfs.btrfs so you can start with a clean filesystem and autodefrag from
> the beginning, or doing manual defrag.
>
> However, be aware that if you have snapshots locking down the old extents
> in their fragmented form, a manual defrag will copy the data to new
> extents without releasing the old ones as they're locked in place by the
> snapshots, thus using additional space.  Worse, if the filesystem is
> already heavily fragmented and snapshots are locking most of those
> fragments in place, defrag likely won't help a lot, because the free
> space as well will be heavily fragmented.   So starting off with a clean
> and new filesystem and using autodefrag from the beginning really is your
> best bet.

If it is about multi-TB fs, I think most important is to have enough
unfragmented free space available and hopefully at the beginning of
the device if it is flat HDD. Maybe a  balance -ddrange=1M..<20% of
device> can do that, I haven't tried.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2016-06-10 17:07       ` Henk Slager
@ 2016-06-11 15:23         ` Hans van Kranenburg
  0 siblings, 0 replies; 33+ messages in thread
From: Hans van Kranenburg @ 2016-06-11 15:23 UTC (permalink / raw)
  To: Henk Slager, linux-btrfs

On 06/10/2016 07:07 PM, Henk Slager wrote:
> On Thu, Jun 9, 2016 at 5:41 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>> Hans van Kranenburg posted on Thu, 09 Jun 2016 01:10:46 +0200 as
>> excerpted:
>>
>>> The next question is what files these extents belong to. To find out, I
>>> need to open up the extent items I get back and follow a backreference
>>> to an inode object. Might do that tomorrow, fun.
>>>
>>> To be honest, I suspect /var/log and/or the file storage of mailman to
>>> be the cause of the fragmentation, since there's logging from postfix,
>>> mailman and nginx going on all day long in a slow but steady tempo.
>>> While using btrfs for a number of use cases at work now, we normally
>>> don't use it for the root filesystem. And the cases where it's used as
>>> root filesystem don't do much logging or mail.
>>
>> FWIW, that's one reason I have a dedicated partition (and filesystem) for
>> logs, here.  (The other reason is that should something go runaway log-
>> spewing, I get a warning much sooner when my log filesystem fills up, not
>> much later, with much worse implications, when the main filesystem fills
>> up!)

Well, there it is:
https://syrinx.knorrie.org/~knorrie/btrfs/keep/2016-06-11-extents_ichiban_77621886976.txt

Playing around a bit with the search ioctl:
https://github.com/knorrie/btrfs-heatmap/blob/master/chunk-contents.py

This is clearly primarily logging and mailman mbox files. All kinds of 
small extents, and a huge amount of fragmented free space in between.

>>> And no, autodefrag is not in the mount options currently. Would that be
>>> helpful in this case?
>>
>> It should be helpful, yes.  Be aware that autodefrag works best with
>> smaller (sub-half-gig) files, however, and that it used to cause
>> performance issues with larger database and VM files, in particular.
>
> I don't know why you relate filesize and autodefrag. Maybe because you
> say '... used to cause ...'.

Log files grow to few tens of MBs and logrotate will copy the contents 
into gzipped files (defragging everything as a side effect) every once 
in a while, so the only concern is the current logfiles.

> autodefrag detects random writes and then tries to defrag a certain
> range. Its scope size is 256K as far as I see from the code and over
> time you see VM images that are on a btrfs fs (CoW, hourly ro
> snapshots) having a lot of 256K (or a bit less) sized extents
> according to what filefrag reports. I once wanted to try and change
> the 256K to 1M or even 4M, but I haven't  come to that.
> A 32G VM image would consist of 131072 extents for 256K, 32768 extents
> for 1M, 8192 extents for 4M.

Aha.

>> There used to be a warning on the wiki about that, that was recently
>> removed, so apparently it's not the issue that it was, but you might wish
>> to monitor any databases or VMs with gig-plus files to see if it's going
>> to be a performance issue, once you turn on autodefrag.
>
> For very active databases, I don't know what the effects are, with or
> without autodefrag ( either on SSD and/or HDD).
> At least on HDD-only, so no persistent SSD caching and noautodefrag,
> VMs will result in unacceptable performance soon.
>
>> The other issue with autodefrag is that if it hasn't been on and things
>> are heavily fragmented, it can at first drive down performance as it
>> rewrites all these heavily fragmented files, until it catches up and is
>> mostly dealing only with the normal refragmentation load.
>
> I assume you mean that one only gets a performance drop if you
> actually do new writes to the fragmented files since autodefrag on. It
> shouldn't start defragging by itself AFAIK.

As far as I understand, it only considers new writes yes.

So I can manually defrag the mbox files (which get data appended slowly 
all the time) and turn on autodefrag, which will also take care of the 
log files, and after the next logrotate, all old fragmented extents will 
be freed.

>> Of course the
>> best way around that is to run autodefrag from the first time you mount
>> the filesystem and start writing to it, so it never gets overly
>> fragmented in the first place.  For a currently in-use and highly
>> fragmented filesystem, you have two choices, either backup and do a fresh
>> mkfs.btrfs so you can start with a clean filesystem and autodefrag from
>> the beginning, or doing manual defrag.
>>
>> However, be aware that if you have snapshots locking down the old extents
>> in their fragmented form, a manual defrag will copy the data to new
>> extents without releasing the old ones as they're locked in place by the
>> snapshots, thus using additional space.  Worse, if the filesystem is
>> already heavily fragmented and snapshots are locking most of those
>> fragments in place, defrag likely won't help a lot, because the free
>> space as well will be heavily fragmented.   So starting off with a clean
>> and new filesystem and using autodefrag from the beginning really is your
>> best bet.

No snapshots here.

> If it is about multi-TB fs, I think most important is to have enough
> unfragmented free space available and hopefully at the beginning of
> the device if it is flat HDD. Maybe a  balance -ddrange=1M..<20% of
> device> can do that, I haven't tried.

I'm going to enable autodefrag now, and defrag the existing mbox files, 
and then do some balance to compact the used space.

A question remains of course... Even when slowly appending data to e.g. 
a log file... what causes all the free space in between the newly 
written data extents...?! 300kB?! 4MB?!

78081548288 78081875967    327680 0.03% free space

78081875968 78081896447     20480 0.00% extent item
	extent refs 1 gen 155003 flags DATA
	extent data backref root 257 objectid 901223 names ['access.log.1']

78081896448 78081904639      8192 0.00% extent item
	extent refs 1 gen 155003 flags DATA
	extent data backref root 257 objectid 901223 names ['access.log.1']

78081904640 78082236415    331776 0.03% free space

78082236416 78082256895     20480 0.00% extent item
	extent refs 1 gen 155004 flags DATA
	extent data backref root 257 objectid 901223 names ['access.log.1']

78082256896 78082596863    339968 0.03% free space

78082596864 78082621439     24576 0.00% extent item
	extent refs 1 gen 155005 flags DATA
	extent data backref root 257 objectid 901223 names ['access.log.1']

78082621440 78087327743   4706304 0.44% free space

78087327744 78087335935      8192 0.00% extent item


To be continued...

-- 
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | hans.van.kranenburg@mendix.com | www.mendix.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2016-05-31  1:36 ` Qu Wenruo
  2016-06-08 23:10   ` Hans van Kranenburg
@ 2017-04-07 21:25   ` Hans van Kranenburg
  2017-04-07 23:56     ` Peter Grandi
                       ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Hans van Kranenburg @ 2017-04-07 21:25 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Josef Bacik

Ok, I'm going to revive a year old mail thread here with interesting new
info:

On 05/31/2016 03:36 AM, Qu Wenruo wrote:
> 
> 
> Hans van Kranenburg wrote on 2016/05/06 23:28 +0200:
>> Hi,
>>
>> I've got a mostly inactive btrfs filesystem inside a virtual machine
>> somewhere that shows interesting behaviour: while no interesting disk
>> activity is going on, btrfs keeps allocating new chunks, a GiB at a time.
>>
>> A picture, telling more than 1000 words:
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
>> (when the amount of allocated/unused goes down, I did a btrfs balance)

That picture is still there, for the idea.

> Nice picture.
> Really better than 1000 words.
> 
> AFAIK, the problem may be caused by fragments.

Free space fragmentation is a key thing here indeed.

The major two things involved here are 1) the extent allocator, which
causes the free space fragmentation 2) the extent allocator, which
doesn't handle the fragmentation it just caused really well.

Let's start with the pictures, instead of too many words. The following
two videos are png images of the 4 block groups with highest vaddr.
Every 15 minutes a picture is created, and then they're added together:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4

And, with autodefrag enabled, which was the first thing I tried as a change:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-13-autodefrag-ichiban.mp4

So, this is why putting your /var/log, /var/lib/mailman and /var/spool
on btrfs is a terrible idea.

Because the allocator keeps walking forward every file that is created
and then removed leaves a blank spot behind.

Autodefrag makes the situation only a little bit better, changing the
resulting pattern from a sky full of stars into a snowstorm. The result
of taking a few small writes and rewriting them again is that again the
small parts of free space are left behind.

Just a random idea.. for this write pattern, always putting new writes
in the first free available spot at the beginning of the block group
would make a total difference, since the little 4/8KiB parts would be
filled up again all the time, preventing the shotgun blast to spread all
over.

> And even I saw some early prototypes inside the codes to allow btrfs do
> allocation smaller extent than required.
> (E.g. caller needs 2M extent, but btrfs returns 2 1M extents)
> 
> But it's still prototype and seems no one is really working on it now.
> 
> So when btrfs is writing new data, for example, to write about 16M data,
> it will need to allocate a 16M continuous extent, and if it can't find
> large enough space to allocate, then create a new data chunk.
> 
> [...]

That's the cluster idea right? Combining free space fragments into a
bigger piece of space to fill with writes?

The fun thing is that this might work, but because of the pattern we end
up with, a large write apparently fails (the files downloaded when doing
apt-get update by daily cron) which causes a new chunk allocation. This
is clearly visible in the videos. Directly after that, the new chunk
gets filled with the same pattern, because the extent allocator now
continues there and next day same thing happens again etc...

And voila, there's the answer to my original question.

Now, another surprise:

>From the exact moment I did mount -o remount,nossd on this filesystem,
the problem vanished.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png

I don't have a new video yet, but I'll set up a cron tonight and post it
later.

I'm going to send another mail specifically about the nossd/ssd
behaviour and other things I found out last week, but that'll probably
be tomorrow.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-07 21:25   ` Hans van Kranenburg
@ 2017-04-07 23:56     ` Peter Grandi
  2017-04-08  7:09     ` Duncan
  2017-04-08 11:16     ` Hans van Kranenburg
  2 siblings, 0 replies; 33+ messages in thread
From: Peter Grandi @ 2017-04-07 23:56 UTC (permalink / raw)
  To: Linux fs Btrfs

[ ... ]
>>> I've got a mostly inactive btrfs filesystem inside a virtual
>>> machine somewhere that shows interesting behaviour: while no
>>> interesting disk activity is going on, btrfs keeps
>>> allocating new chunks, a GiB at a time.
[ ... ]
> Because the allocator keeps walking forward every file that is
> created and then removed leaves a blank spot behind.

That is a typical "log-structured" filesystem behaviour, not
really surprised that Btrfs is doing something like that being
COW. NILFS2 works like that and it requires a compactor (which
does the requivalent of 'balance' and 'defrag'). It is all about
tradeoffs.

With Btrfs I figured out that fairly frequent 'balance' is
really quite important, even with low percent values like
"usage=50", and usually even 'usage=90' does not take a long
time (while the default takes often a long time, I suspect
needlessly).

>> From the exact moment I did mount -o remount,nossd on this
>> filesystem, the problem vanished.

Haha. Indeed. So it switches from "COW" to more like "log
structured" with the 'ssd' option. F2FS can switch like that
too, with some tunables IIIRC. Except that modern flash SSDs
already do the "log structured" bit internally, so doing in in
Btrfs does not really help that much.

>> And even I saw some early prototypes inside the codes to
>> allow btrfs do allocation smaller extent than required.
>> (E.g. caller needs 2M extent, but btrfs returns 2 1M extents)

I am surprised that this is not already there, but it is a
terrible fix to a big mistake. The big mistake, that nearly all
filesystem designers do, is to assume that contiguous allocation
must bew done by writing contiguous large blocks or extents.

This big mistake was behind the stupid idea of the BSD FFS to
raise the block size from 512B to 4096B plus 512B "tails", and
endless stupid proposals to raise page and block sizes that get
done all the time, and is behind the stupid idea of doing
"delayed allocation", so large extents can be written in one go.

The ancient and tried and obvious idea is to preallocate space
ahead of it being written, so that a file physical size may be
larger than its logical length, and by how much it depends on
some adaptive logic, or hinting from the application (if the
file size if known in advance it can be to preallocate the whole
file).

> [ ... ] So, this is why putting your /var/log, /var/lib/mailman and
> /var/spool on btrfs is a terrible idea. [ ... ]

That is just the old "writing a file slowly" issue, and many if
not most filesystems have this issue:

  http://www.sabi.co.uk/blog/15-one.html?150203#150203

and as that post shows it was already reported for Btrfs here:

  http://kreijack.blogspot.co.uk/2014/06/btrfs-and-systemd-journal.html

> [ ... ] The fun thing is that this might work, but because of
> the pattern we end up with, a large write apparently fails
> (the files downloaded when doing apt-get update by daily cron)
> which causes a new chunk allocation. This is clearly visible
> in the videos. Directly after that, the new chunk gets filled
> with the same pattern, because the extent allocator now
> continues there and next day same thing happens again etc... [
> ... ]

The general problem is that filesystems have a very difficult
job especially on rotating media and cannot avoid large
important degenerate corner case by using any adaptive logic.

Only predictive logic can avoid them, and since psychic code is
not possible yet, "predictive" means hints from applications and
users, and application developers and users are usually not
going to give them, or give them wrong.

Consider the "slow writing" corner case, common to logging or
downloads, that you mention: the filesystem logic cannot do well
in the general case because it cannot predict how large the
final file will be, or what the rate of writing will be.

However if the applications or users hint the total final size
or at least a suitable allocation size things are going to be
good. But it is already difficult to expect applications to give
absolutely necessary 'fsync's, so explicit file size or access
pattern hints are a bit of an illusion. It is the ancient
'O_PONIES' issue in one of its many forms.

Fortunately it possible and even easy to do much better
*synthetic* hinting than most library and kernels do today:

  http://www.sabi.co.uk/blog/anno05-4th.html?051012d#051012d
  http://www.sabi.co.uk/blog/anno05-4th.html?051011b#051011b
  http://www.sabi.co.uk/blog/anno05-4th.html?051011#051011
  http://www.sabi.co.uk/blog/anno05-4th.html?051010#051010

But that has not happened because it is no developer's itch to
fix. I was instead partially impressed that recently the
'vm_cluster' implementation was "fixed", after only one or two
decades from being first reported:

  http://sabi.co.uk/blog/anno05-3rd.html?050923#050923
  https://lwn.net/Articles/716296/
  https://lkml.org/lkml/2001/1/30/160

And still the author(s) of the fix don't see to be persuaded by
many decades of research on paging that show that read-ahead on
fault is in the general case a stupid idea (at least for what
are called in Linux "anonymous" pages).

I have found over time that reports and discussions like this
are mostly pointless: some decades ago I pointed out to L McVoy
when he was a developer at Sun that tools like 'cp' and 'tar'
have *totally predictable* access patterns and (almost always)
file sizes are know in advance, so they could trivially do
access patterns hinting and preallocation. Yet it is decades
later and most such tools don't. "because can't be bothered to
read papers", "because boring", "because not my itch".
For another example someone has finally started looking into
writeback errors: https://lwn.net/Articles/718734/ that usually
don't get reported, but then how many developers check the
return code of 'close'(2)?
I wonder sometimes how M Stonebraker feels about having written
in "Operating System Support for Database Management" entirely
obvious things and it having been steadfastly ignored since 1981
by most kernel and filesystem authors.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-07 21:25   ` Hans van Kranenburg
  2017-04-07 23:56     ` Peter Grandi
@ 2017-04-08  7:09     ` Duncan
  2017-04-08 11:16     ` Hans van Kranenburg
  2 siblings, 0 replies; 33+ messages in thread
From: Duncan @ 2017-04-08  7:09 UTC (permalink / raw)
  To: linux-btrfs

Hans van Kranenburg posted on Fri, 07 Apr 2017 23:25:29 +0200 as
excerpted:

> So, this is why putting your /var/log, /var/lib/mailman and /var/spool
> on btrfs is a terrible idea.
> 
> Because the allocator keeps walking forward every file that is created
> and then removed leaves a blank spot behind.
> 
> Autodefrag makes the situation only a little bit better, changing the
> resulting pattern from a sky full of stars into a snowstorm. The result
> of taking a few small writes and rewriting them again is that again the
> small parts of free space are left behind.

> [... B]ecause of the pattern we end
> up with, a large write apparently fails (the files downloaded when doing
> apt-get update by daily cron) which causes a new chunk allocation. This
> is clearly visible in the videos. Directly after that, the new chunk
> gets filled with the same pattern, because the extent allocator now
> continues there and next day same thing happens again etc...

> Now, another surprise:
> 
> From the exact moment I did mount -o remount,nossd on this filesystem,
> the problem vanished.

That large write in the middle of small writes pattern might be why I've 
not seen the problem on my btrfs', on ssd, here.

Remember, I'm the guy who keeps advocating multiple independent small 
btrfs on partitioned-up larger devices, with the splits between 
independent btrfs' based on tasks.

So I have a quite tiny sub-GiB independent log btrfs handling those slow 
incremental writes to generally smaller files, a separate / with the main 
system on it that's mounted read-only unless I'm actively updating it, a 
separate home with my reasonably small size but written at-once non-media 
user files, a separate media partition/fs with my much larger but very 
seldom rewritten media files, and a separate update partition/fs with the 
local cache of the distro tree and overlays, sources (since it's gentoo), 
built binpkg cache, etc, with small to medium-large files that are 
comparatively frequently replaced.

So the relatively small slow-written and frequently rotated log files are 
isolated to their own partition/fs, undisturbed by the much larger update-
writes to the updates and / partitions/fs, isolating them from the update-
trigger that triggers the chunk allocations on your larger single general 
purpose filesystem/image, amongst all those fragmenting slow logfile 
writes.

Very interesting and informative thread, BTW.  I'm learning quite a bit. 
=:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-07 21:25   ` Hans van Kranenburg
  2017-04-07 23:56     ` Peter Grandi
  2017-04-08  7:09     ` Duncan
@ 2017-04-08 11:16     ` Hans van Kranenburg
  2017-04-08 11:35       ` Hans van Kranenburg
  2017-04-09 23:23       ` Hans van Kranenburg
  2 siblings, 2 replies; 33+ messages in thread
From: Hans van Kranenburg @ 2017-04-08 11:16 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Josef Bacik

On 04/07/2017 11:25 PM, Hans van Kranenburg wrote:
> Ok, I'm going to revive a year old mail thread here with interesting new
> info:
> 
> [...]
> 
> Now, another surprise:
> 
> From the exact moment I did mount -o remount,nossd on this filesystem,
> the problem vanished.
> 
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png
> 
> I don't have a new video yet, but I'll set up a cron tonight and post it
> later.
> 
> I'm going to send another mail specifically about the nossd/ssd
> behaviour and other things I found out last week, but that'll probably
> be tomorrow.

Well, there it is:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

Amazing... :) I'll update the file later with extra frames.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-08 11:16     ` Hans van Kranenburg
@ 2017-04-08 11:35       ` Hans van Kranenburg
  2017-04-09 23:23       ` Hans van Kranenburg
  1 sibling, 0 replies; 33+ messages in thread
From: Hans van Kranenburg @ 2017-04-08 11:35 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Josef Bacik

On 04/08/2017 01:16 PM, Hans van Kranenburg wrote:
> On 04/07/2017 11:25 PM, Hans van Kranenburg wrote:
>> Ok, I'm going to revive a year old mail thread here with interesting new
>> info:
>>
>> [...]
>>
>> Now, another surprise:
>>
>> From the exact moment I did mount -o remount,nossd on this filesystem,
>> the problem vanished.
>>
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png
>>
>> I don't have a new video yet, but I'll set up a cron tonight and post it
>> later.
>>
>> I'm going to send another mail specifically about the nossd/ssd
>> behaviour and other things I found out last week, but that'll probably
>> be tomorrow.
> 
> Well, there it is:
> 
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4
> 
> Amazing... :) I'll update the file later with extra frames.

By the way,

1. For the log files in /var/log... logrotate behaves as a defrag tool
of course. The small free space gaps left behind when scraping the
current log file together and rewriting it as 1 big gzipped file can be
reused throughout the next day or whatever interval by the slow writes
again.

2. For the /var/spool/postfix... small files come and go, and that's
fine now.

3. For the mailman mbox files, which get appended all the time... They
can either stay where they are, having some more extents scattered
around, or, an entry in the monthly cron to point defrag at the files of
last month (which will never change again) will solve that efficiently.

All of that doesn't sound like abnormal things to do when punishing the
filesystem with a 'slow small write' workload.

I'm happy to be able to keep this thing on btrfs. When moving all the
mailman stuff over from a previous VM, I first made it ext4 again, then
immediately ended up with no inodes left (of course!) while copying the
mailman archive, and then thought .. arg .. mkfs.btrfs, yay, unlimited
inodes! :) I was almost at the point of converting it back to ext4 after
all because of the exploding unused free space problems, but now that's
prevented just in time. :D

Moo,
-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-08 11:16     ` Hans van Kranenburg
  2017-04-08 11:35       ` Hans van Kranenburg
@ 2017-04-09 23:23       ` Hans van Kranenburg
  2017-04-10 12:39         ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 33+ messages in thread
From: Hans van Kranenburg @ 2017-04-09 23:23 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Josef Bacik

On 04/08/2017 01:16 PM, Hans van Kranenburg wrote:
> On 04/07/2017 11:25 PM, Hans van Kranenburg wrote:
>> Ok, I'm going to revive a year old mail thread here with interesting new
>> info:
>>
>> [...]
>>
>> Now, another surprise:
>>
>> From the exact moment I did mount -o remount,nossd on this filesystem,
>> the problem vanished.
>>
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png
>>
>> I don't have a new video yet, but I'll set up a cron tonight and post it
>> later.
>>
>> I'm going to send another mail specifically about the nossd/ssd
>> behaviour and other things I found out last week, but that'll probably
>> be tomorrow.
> 
> Well, there it is:
> 
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4
> 
> Amazing... :) I'll update the file later with extra frames.

Added all new pngs up until now to the video, same link to the mp4.

Looks great! It just keeps reusing the same spots of space all the time.

When looking at this, I can understand that this is an unwanted write
pattern on a low-end ssd that was available for sale in 2008.

But, how does this apply to an SSD you can buy in 2017?

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-09 23:23       ` Hans van Kranenburg
@ 2017-04-10 12:39         ` Austin S. Hemmelgarn
  2017-04-10 12:45           ` Kai Krakow
  0 siblings, 1 reply; 33+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-10 12:39 UTC (permalink / raw)
  To: Hans van Kranenburg, Qu Wenruo, linux-btrfs; +Cc: Josef Bacik

On 2017-04-09 19:23, Hans van Kranenburg wrote:
> On 04/08/2017 01:16 PM, Hans van Kranenburg wrote:
>> On 04/07/2017 11:25 PM, Hans van Kranenburg wrote:
>>> Ok, I'm going to revive a year old mail thread here with interesting new
>>> info:
>>>
>>> [...]
>>>
>>> Now, another surprise:
>>>
>>> From the exact moment I did mount -o remount,nossd on this filesystem,
>>> the problem vanished.
>>>
>>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png
>>>
>>> I don't have a new video yet, but I'll set up a cron tonight and post it
>>> later.
>>>
>>> I'm going to send another mail specifically about the nossd/ssd
>>> behaviour and other things I found out last week, but that'll probably
>>> be tomorrow.
>>
>> Well, there it is:
>>
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4
>>
>> Amazing... :) I'll update the file later with extra frames.
>
> Added all new pngs up until now to the video, same link to the mp4.
>
> Looks great! It just keeps reusing the same spots of space all the time.
>
> When looking at this, I can understand that this is an unwanted write
> pattern on a low-end ssd that was available for sale in 2008.
>
> But, how does this apply to an SSD you can buy in 2017?
>
Depends on what brand and how cheap you go.  For a decent brand (Intel, 
Samsung, Crucial) and a reasonably good SSD (I'm partial to the Crucial 
MX series), this really don't hurt as much as it used to.

I've got a couple of Crucial MX300's (released middle of last year IIRC) 
which see roughly 200kB/s of writes constantly 24/7 (average write IOPS 
is about 15-20, so most of the writes are around 16kB), and after about 
6 months of this none of their wear-out indicators have changed since I 
first checked them when I installed them.  They've been running BTRFS 
with LZO compression, the SSD allocator, atime disabled, and mtime 
updates deferred (lazytime mount option) the whole time, so it may be a 
slightly different use case than the OP from this thread.

Given this though, combined with the fact that Crucial SSD's are decent 
(they're not quite on par with Samsung EVO's or the good Intel SSD's, 
but they're still pretty good for the price), I'd be willing to say that 
they're not anywhere near as workload sensitive as they used to be.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-10 12:39         ` Austin S. Hemmelgarn
@ 2017-04-10 12:45           ` Kai Krakow
  2017-04-10 12:51             ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 33+ messages in thread
From: Kai Krakow @ 2017-04-10 12:45 UTC (permalink / raw)
  To: linux-btrfs

Am Mon, 10 Apr 2017 08:39:23 -0400
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> They've been running BTRFS 
> with LZO compression, the SSD allocator, atime disabled, and mtime 
> updates deferred (lazytime mount option) the whole time, so it may be
> a slightly different use case than the OP from this thread.

Does btrfs really support lazytime now?

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-10 12:45           ` Kai Krakow
@ 2017-04-10 12:51             ` Austin S. Hemmelgarn
  2017-04-10 16:53               ` Kai Krakow
       [not found]               ` <20170410184444.08ced097@jupiter.sol.local>
  0 siblings, 2 replies; 33+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-10 12:51 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

On 2017-04-10 08:45, Kai Krakow wrote:
> Am Mon, 10 Apr 2017 08:39:23 -0400
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>
>> They've been running BTRFS
>> with LZO compression, the SSD allocator, atime disabled, and mtime
>> updates deferred (lazytime mount option) the whole time, so it may be
>> a slightly different use case than the OP from this thread.
>
> Does btrfs really support lazytime now?
>
It appears to, I do see fewer writes with it than without it.  At the 
very least, if it doesn't, then nothing complains about it.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-10 12:51             ` Austin S. Hemmelgarn
@ 2017-04-10 16:53               ` Kai Krakow
       [not found]               ` <20170410184444.08ced097@jupiter.sol.local>
  1 sibling, 0 replies; 33+ messages in thread
From: Kai Krakow @ 2017-04-10 16:53 UTC (permalink / raw)
  To: linux-btrfs

Am Mon, 10 Apr 2017 08:51:38 -0400
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2017-04-10 08:45, Kai Krakow wrote:
> > Am Mon, 10 Apr 2017 08:39:23 -0400
> > schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> >  
> >> They've been running BTRFS
> >> with LZO compression, the SSD allocator, atime disabled, and mtime
> >> updates deferred (lazytime mount option) the whole time, so it may
> >> be a slightly different use case than the OP from this thread.  
> >
> > Does btrfs really support lazytime now?
> >  
> It appears to, I do see fewer writes with it than without it.  At the 
> very least, if it doesn't, then nothing complains about it.

Did you put it in /etc/fstab only for the rootfs? If yes, it probably
has no effect. You would need to give it as rootflags on the kernel
cmdline.


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
       [not found]               ` <20170410184444.08ced097@jupiter.sol.local>
@ 2017-04-10 16:54                 ` Kai Krakow
  2017-04-10 17:13                   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 33+ messages in thread
From: Kai Krakow @ 2017-04-10 16:54 UTC (permalink / raw)
  To: linux-btrfs

Am Mon, 10 Apr 2017 18:44:44 +0200
schrieb Kai Krakow <hurikhan77@gmail.com>:

> Am Mon, 10 Apr 2017 08:51:38 -0400
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
> > On 2017-04-10 08:45, Kai Krakow wrote:  
> > > Am Mon, 10 Apr 2017 08:39:23 -0400
> > > schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > >    
>  [...]  
> > >
> > > Does btrfs really support lazytime now?
> > >    
> > It appears to, I do see fewer writes with it than without it.  At
> > the very least, if it doesn't, then nothing complains about it.  
> 
> Did you put it in /etc/fstab only for the rootfs? If yes, it probably
> has no effect. You would need to give it as rootflags on the kernel
> cmdline.

I did a "fgrep lazytime /usr/src/linux -ir" and it reveals only ext4
and f2fs know the flag. Kernel 4.10.

So probably you're seeing a placebo effect. If you put lazytime for
rootfs just only into fstab, it won't have an effect because on initial
mount this file cannot be opened (for obvious reasons), and on remount,
btrfs seems to happily accept lazytime but it has no effect. It won't
show up in /proc/mounts. Try using it in rootflags kernel cmdline and
you should see that the kernel won't accept the flag lazytime.

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-10 16:54                 ` Kai Krakow
@ 2017-04-10 17:13                   ` Austin S. Hemmelgarn
  2017-04-10 18:18                     ` Kai Krakow
  0 siblings, 1 reply; 33+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-10 17:13 UTC (permalink / raw)
  To: linux-btrfs

On 2017-04-10 12:54, Kai Krakow wrote:
> Am Mon, 10 Apr 2017 18:44:44 +0200
> schrieb Kai Krakow <hurikhan77@gmail.com>:
>
>> Am Mon, 10 Apr 2017 08:51:38 -0400
>> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>
>>> On 2017-04-10 08:45, Kai Krakow wrote:
>>>> Am Mon, 10 Apr 2017 08:39:23 -0400
>>>> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>
>>  [...]
>>>>
>>>> Does btrfs really support lazytime now?
>>>>
>>> It appears to, I do see fewer writes with it than without it.  At
>>> the very least, if it doesn't, then nothing complains about it.
>>
>> Did you put it in /etc/fstab only for the rootfs? If yes, it probably
>> has no effect. You would need to give it as rootflags on the kernel
>> cmdline.
>
> I did a "fgrep lazytime /usr/src/linux -ir" and it reveals only ext4
> and f2fs know the flag. Kernel 4.10.
>
> So probably you're seeing a placebo effect. If you put lazytime for
> rootfs just only into fstab, it won't have an effect because on initial
> mount this file cannot be opened (for obvious reasons), and on remount,
> btrfs seems to happily accept lazytime but it has no effect. It won't
> show up in /proc/mounts. Try using it in rootflags kernel cmdline and
> you should see that the kernel won't accept the flag lazytime.
>
The command-line also rejects a number of perfectly legitimate arguments 
that BTRFS does understand too though, so that's not much of a test. 
I've just finished some quick testing though, and it looks like you're 
right, BTRFS does not support this, which means I now need to figure out 
what the hell was causing the IOPS counters in collectd to change in 
rough correlation  with remounting (especially since it appears to 
happen mostly independent of the options being changed).

This is somewhat disappointing though, as supporting this would probably 
help with the write-amplification issues inherent in COW filesystems.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-10 17:13                   ` Austin S. Hemmelgarn
@ 2017-04-10 18:18                     ` Kai Krakow
  2017-04-10 19:43                       ` Austin S. Hemmelgarn
  2017-04-10 23:45                       ` Janos Toth F.
  0 siblings, 2 replies; 33+ messages in thread
From: Kai Krakow @ 2017-04-10 18:18 UTC (permalink / raw)
  To: linux-btrfs

Am Mon, 10 Apr 2017 13:13:39 -0400
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2017-04-10 12:54, Kai Krakow wrote:
> > Am Mon, 10 Apr 2017 18:44:44 +0200
> > schrieb Kai Krakow <hurikhan77@gmail.com>:
> >  
> >> Am Mon, 10 Apr 2017 08:51:38 -0400
> >> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> >>  
>  [...]  
>  [...]  
> >>  [...]  
>  [...]  
>  [...]  
> >>
> >> Did you put it in /etc/fstab only for the rootfs? If yes, it
> >> probably has no effect. You would need to give it as rootflags on
> >> the kernel cmdline.  
> >
> > I did a "fgrep lazytime /usr/src/linux -ir" and it reveals only ext4
> > and f2fs know the flag. Kernel 4.10.
> >
> > So probably you're seeing a placebo effect. If you put lazytime for
> > rootfs just only into fstab, it won't have an effect because on
> > initial mount this file cannot be opened (for obvious reasons), and
> > on remount, btrfs seems to happily accept lazytime but it has no
> > effect. It won't show up in /proc/mounts. Try using it in rootflags
> > kernel cmdline and you should see that the kernel won't accept the
> > flag lazytime. 
> The command-line also rejects a number of perfectly legitimate
> arguments that BTRFS does understand too though, so that's not much
> of a test.

Which are those? I didn't encounter any...

> I've just finished some quick testing though, and it looks
> like you're right, BTRFS does not support this, which means I now
> need to figure out what the hell was causing the IOPS counters in
> collectd to change in rough correlation  with remounting (especially
> since it appears to happen mostly independent of the options being
> changed).

I think that noatime (which I remember you also used?), lazytime, and
relatime are mutually exclusive: they all handle the inode updates.
Maybe that is the effect you see?

> This is somewhat disappointing though, as supporting this would
> probably help with the write-amplification issues inherent in COW
> filesystems. --

Well, relatime is mostly the same thus not perfectly resembling the
POSIX standard. I think the only software that relies on atime is
mutt...

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-10 18:18                     ` Kai Krakow
@ 2017-04-10 19:43                       ` Austin S. Hemmelgarn
  2017-04-10 22:21                         ` Adam Borowski
  2017-04-11  4:01                         ` Kai Krakow
  2017-04-10 23:45                       ` Janos Toth F.
  1 sibling, 2 replies; 33+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-10 19:43 UTC (permalink / raw)
  To: linux-btrfs

On 2017-04-10 14:18, Kai Krakow wrote:
> Am Mon, 10 Apr 2017 13:13:39 -0400
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>
>> On 2017-04-10 12:54, Kai Krakow wrote:
>>> Am Mon, 10 Apr 2017 18:44:44 +0200
>>> schrieb Kai Krakow <hurikhan77@gmail.com>:
>>>
>>>> Am Mon, 10 Apr 2017 08:51:38 -0400
>>>> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>>>>
>>  [...]
>>  [...]
>>>>  [...]
>>  [...]
>>  [...]
>>>>
>>>> Did you put it in /etc/fstab only for the rootfs? If yes, it
>>>> probably has no effect. You would need to give it as rootflags on
>>>> the kernel cmdline.
>>>
>>> I did a "fgrep lazytime /usr/src/linux -ir" and it reveals only ext4
>>> and f2fs know the flag. Kernel 4.10.
>>>
>>> So probably you're seeing a placebo effect. If you put lazytime for
>>> rootfs just only into fstab, it won't have an effect because on
>>> initial mount this file cannot be opened (for obvious reasons), and
>>> on remount, btrfs seems to happily accept lazytime but it has no
>>> effect. It won't show up in /proc/mounts. Try using it in rootflags
>>> kernel cmdline and you should see that the kernel won't accept the
>>> flag lazytime.
>> The command-line also rejects a number of perfectly legitimate
>> arguments that BTRFS does understand too though, so that's not much
>> of a test.
>
> Which are those? I didn't encounter any...
I'm not sure there are any anymore, but I know that a handful (mostly 
really uncommon ones) used to (and BTRFS is not alone in this respect, 
some of the more esoteric ext4 options aren't accepted on the kernel 
command-line either).  I know at a minimum at some point in the past 
alloc-start, check_int, and inode_cache did not work from the kernel 
command-line.
>
>> I've just finished some quick testing though, and it looks
>> like you're right, BTRFS does not support this, which means I now
>> need to figure out what the hell was causing the IOPS counters in
>> collectd to change in rough correlation  with remounting (especially
>> since it appears to happen mostly independent of the options being
>> changed).
>
> I think that noatime (which I remember you also used?), lazytime, and
> relatime are mutually exclusive: they all handle the inode updates.
> Maybe that is the effect you see?
They're not exactly exclusive.  The lazytime option will prevent changes 
to the mtime or atime fields in a file from forcing inode write-out for 
up to 24 hours (if the inode would be written out for some other reason 
(such as a file-size change or the inode being evicted from the cache), 
then the timestamps will be too), but it does not change the value of 
the timestamps.  So if you have lazytime enabled and use touch to update 
the mtime on anotherwise idle file, the mtime will still be correct as 
far as userspace is concerned, as long as you don't crash before the 
update hits the disk (but userspace will only see the discrepancy 
_after_ the crash).

By comparison, relatime causes the atime not to updated at all if it's 
changed in the last 24 hours, and noatime completely prevents atime 
updates.  In both cases, the atime isn't correct at all in userspace as 
far as POSIX is concerned.

So, you have the following combinations:
* strictatime, nolazytime: Both atime and mtime updates happen, and are 
flushed to disk (almost) immediately.
* relatime, nolazytime (the upstream default): atime updates happen only 
if the atime hasn't changed in 24 hours, mtime updates happen as normal, 
and both types of update are flushed to disk (almost) immediately.
* noatime, nolazytime (the default on some specific kernels (this is 
easy to patch, so a lot of people who already carry custom patches and 
don't use mutt patch it)): atime updates never happen, mtime updates 
happen as normal and are flushed to disk (almost) immediately.
* strictatime, lazytime: Both atime and mtime updates happen, but they 
actual update may not hit the disk for up to 24 hours (this will let 
mutt work correctly as long as your system shuts down cleanly, but still 
improve performance noticeably on at least ext4).
* relatime, lazytime: atime updates happen only if the atime hasn't 
changed in 24 hours, mtime updates happen as normal, and both may not 
hit the disk for up to 24 hours.
* noatime, lazytime (what I'm trying to run): atime updates never 
happen, mtime updates happen as normal, but may not hit the disk for up 
to 24 hours.

In essence, lazytime only impacts inode writeback (deferring it under 
special circumstances), while {no,rel,strict}atime impacts the actual 
value of the time-stamps.
>
>> This is somewhat disappointing though, as supporting this would
>> probably help with the write-amplification issues inherent in COW
>> filesystems. --
>
> Well, relatime is mostly the same thus not perfectly resembling the
> POSIX standard. I think the only software that relies on atime is
> mutt...
This very much depends on what you're doing.  If you have a WORM 
workload, then yeah, it's pretty much the same.  If however you have 
something like a database workload where a specific set of files get 
internally rewritten regularly, then it actually has a measurable impact.

As a very specific example, I run collectd on my systems using RRD files 
as data storage.  An RRD file is essentially a really fancy circular 
buffer, so it remains fixed size but gets a _lot_ of internal rewrites 
(by the way, if anyone wants to test fragmentation behavior on BTRFS, 
RRD files are a great way to do it).  Because of how I have things set 
up, each file gets a batch of data points every 1-2 minutes.  This in 
turn means that the mtime is updating every 1-2 minutes for each of the 
1000+ RRD files.  In this case, writing out the timestamps results in an 
overhead of roughly 256 bytes per file, which is about 0.1% based on the 
average file size of roughly 169k.  If I use noatime on this filesystem, 
then it has near zero impact because the average number of times per 
hour that these files are read is near zero.  Turning on lazytime 
however, results in mtime updates getting deferred until the hourly 
forced fssync for this filesystem hits (this is something I'm doing, not 
the OS), that reduces the overhead by a factor of roughly 45 (the 
average number of writes per-file per-hour) to about 0.00003%, which is 
a pretty serious difference.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-10 19:43                       ` Austin S. Hemmelgarn
@ 2017-04-10 22:21                         ` Adam Borowski
  2017-04-11  4:01                         ` Kai Krakow
  1 sibling, 0 replies; 33+ messages in thread
From: Adam Borowski @ 2017-04-10 22:21 UTC (permalink / raw)
  To: linux-btrfs

On Mon, Apr 10, 2017 at 03:43:57PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-04-10 14:18, Kai Krakow wrote:

> * strictatime, lazytime: Both atime and mtime updates happen, but they
> actual update may not hit the disk for up to 24 hours (this will let mutt
> work correctly as long as your system shuts down cleanly, but still improve
> performance noticeably on at least ext4).

> > Well, relatime is mostly the same thus not perfectly resembling the
> > POSIX standard. I think the only software that relies on atime is
> > mutt...

Well, about that mutt thing...  Neomutt actually, but that's the codebase
Debian uses:

https://github.com/neomutt/neomutt/commit/816095bfdb72caafd8845e8fb28cbc8c6afc114f

-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄⠀⠀⠀⠀ preimage for double rot13!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-10 18:18                     ` Kai Krakow
  2017-04-10 19:43                       ` Austin S. Hemmelgarn
@ 2017-04-10 23:45                       ` Janos Toth F.
  2017-04-11  3:56                         ` Kai Krakow
  1 sibling, 1 reply; 33+ messages in thread
From: Janos Toth F. @ 2017-04-10 23:45 UTC (permalink / raw)
  To: Btrfs BTRFS

>> The command-line also rejects a number of perfectly legitimate
>> arguments that BTRFS does understand too though, so that's not much
>> of a test.
>
> Which are those? I didn't encounter any...

I think this bug still stands unresolved (for 3+ years, probably
because most people use init-rd/fs without ever considering to omit it
in case they don't really need it at all):
Bug 61601 - rootflags=noatime causes kernel panic when booting without initrd.
The last time I tried it applied to Btrfs as well:
https://bugzilla.kernel.org/show_bug.cgi?id=61601#c18

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-10 23:45                       ` Janos Toth F.
@ 2017-04-11  3:56                         ` Kai Krakow
  0 siblings, 0 replies; 33+ messages in thread
From: Kai Krakow @ 2017-04-11  3:56 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 11 Apr 2017 01:45:32 +0200
schrieb "Janos Toth F." <toth.f.janos@gmail.com>:

> >> The command-line also rejects a number of perfectly legitimate
> >> arguments that BTRFS does understand too though, so that's not much
> >> of a test.  
> >
> > Which are those? I didn't encounter any...  
> 
> I think this bug still stands unresolved (for 3+ years, probably
> because most people use init-rd/fs without ever considering to omit it
> in case they don't really need it at all):
> Bug 61601 - rootflags=noatime causes kernel panic when booting
> without initrd. The last time I tried it applied to Btrfs as well:
> https://bugzilla.kernel.org/show_bug.cgi?id=61601#c18

Ah okay, so the difference is with the mount handler. I can only use
initrd here because I have multi-device btrfs ontop of bcache as rootfs.

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-10 19:43                       ` Austin S. Hemmelgarn
  2017-04-10 22:21                         ` Adam Borowski
@ 2017-04-11  4:01                         ` Kai Krakow
  2017-04-11  9:55                           ` Adam Borowski
  1 sibling, 1 reply; 33+ messages in thread
From: Kai Krakow @ 2017-04-11  4:01 UTC (permalink / raw)
  To: linux-btrfs

Am Mon, 10 Apr 2017 15:43:57 -0400
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2017-04-10 14:18, Kai Krakow wrote:
> > Am Mon, 10 Apr 2017 13:13:39 -0400
> > schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> >  
> >> On 2017-04-10 12:54, Kai Krakow wrote:  
>  [...]  
>  [...]  
> >>  [...]
> >>  [...]  
>  [...]  
> >>  [...]
> >>  [...]  
>  [...]  
>  [...]  
> >> The command-line also rejects a number of perfectly legitimate
> >> arguments that BTRFS does understand too though, so that's not much
> >> of a test.  
> >
> > Which are those? I didn't encounter any...  
> I'm not sure there are any anymore, but I know that a handful (mostly 
> really uncommon ones) used to (and BTRFS is not alone in this
> respect, some of the more esoteric ext4 options aren't accepted on
> the kernel command-line either).  I know at a minimum at some point
> in the past alloc-start, check_int, and inode_cache did not work from
> the kernel command-line.

The post from Janos explains why: The difference is with the mount
handler, depending on whether you use initrd or not.

> >> I've just finished some quick testing though, and it looks
> >> like you're right, BTRFS does not support this, which means I now
> >> need to figure out what the hell was causing the IOPS counters in
> >> collectd to change in rough correlation  with remounting
> >> (especially since it appears to happen mostly independent of the
> >> options being changed).  
> >
> > I think that noatime (which I remember you also used?), lazytime,
> > and relatime are mutually exclusive: they all handle the inode
> > updates. Maybe that is the effect you see?  
> They're not exactly exclusive.  The lazytime option will prevent
> changes to the mtime or atime fields in a file from forcing inode
> write-out for up to 24 hours (if the inode would be written out for
> some other reason (such as a file-size change or the inode being
> evicted from the cache), then the timestamps will be too), but it
> does not change the value of the timestamps.  So if you have lazytime
> enabled and use touch to update the mtime on anotherwise idle file,
> the mtime will still be correct as far as userspace is concerned, as
> long as you don't crash before the update hits the disk (but
> userspace will only see the discrepancy _after_ the crash).

Yes, I know all this. But I don't see why you still want noatime or
relatime if you use lazytime, except for super-optimizing. Lazytime
gives you POSIX conformity for a problem that the other options only
tried to solve.

> > Well, relatime is mostly the same thus not perfectly resembling the
> > POSIX standard. I think the only software that relies on atime is
> > mutt...  
> This very much depends on what you're doing.  If you have a WORM 
> workload, then yeah, it's pretty much the same.  If however you have 
> something like a database workload where a specific set of files get 
> internally rewritten regularly, then it actually has a measurable
> impact.

I think "impact" is a whole different story. I'm on your side here.


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-11  4:01                         ` Kai Krakow
@ 2017-04-11  9:55                           ` Adam Borowski
  2017-04-11 11:16                             ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 33+ messages in thread
From: Adam Borowski @ 2017-04-11  9:55 UTC (permalink / raw)
  To: linux-btrfs

On Tue, Apr 11, 2017 at 06:01:19AM +0200, Kai Krakow wrote:
> Yes, I know all this. But I don't see why you still want noatime or
> relatime if you use lazytime, except for super-optimizing. Lazytime
> gives you POSIX conformity for a problem that the other options only
> tried to solve.

(Besides lazytime also working on mtime, and, technically, ctime.)

First: atime, in any form, murders snapshots.  On any filesystem that has
them, not just btrfs -- I've tested zfs and LVM snapshots, there's also
qcow2/vdi and so on.  On all of them, every single read-everything operation
costs you 5% disk space.  For a _read_ operation!

I've tested /usr-y mix of files, for consistency with the guy who mentioned
this problem first.  Your mileage will vary depending on whether you store
100GB disk images or a news spool.

Read-everything is quite rare, but most systems have at least one
stat-everything cronjob.  That touches only diratime, but that's still
1-in-11 inodes (remarkably consistent: I've checked a few machines with
drastically different purposes, and somehow the min was 10, max 12).

And no, marking snapshots as ro doesn't help: reading the live version still
breaks CoW.


Second: atime murders media with limited write endurance.  Modern SSD can
cope well, but I for one work a lot with SD and eMMC.  Every single SoC
image I've seen uses noatime for this reason.


Third: relatime/lazytime don't eliminate the performance cost.  They fix
only frequently read files -- if you have a big filesystem where you read a
lot but individual files tend to be read rarely, relatime is as bad as
strictatime, and lazytime actually worse.  Both will do an unnecessary write
of all inodes.


Four: why?  Beside being POSIXLY_CORRECT, what do you actually gain from
atime?  I can think only of:
* new mail notification with mbox.  Just patch the mail reader to manually
  futimens(..., {UTIME_NOW,UTIME_OMIT}), it has no extra cost on !noatime
  mounts.  I've personally did so for mutt, the updated version will ship
  in Debian stretch; you can patch other mail readers although they tend
  to be rarely used in conjunction with shell access (and thus they have
  no need for atime at all).
* Debian's popcon's "vote" field.  Use "inst", and there's no gain from
  popcon for you personally.
* some intrusion detection forensics (broken by open(..., O_NOATIME))


Conclusion: death to atime!
-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄⠀⠀⠀⠀ preimage for double rot13!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: btrfs filesystem keeps allocating new chunks for no apparent reason
  2017-04-11  9:55                           ` Adam Borowski
@ 2017-04-11 11:16                             ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 33+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-11 11:16 UTC (permalink / raw)
  To: linux-btrfs

On 2017-04-11 05:55, Adam Borowski wrote:
> On Tue, Apr 11, 2017 at 06:01:19AM +0200, Kai Krakow wrote:
>> Yes, I know all this. But I don't see why you still want noatime or
>> relatime if you use lazytime, except for super-optimizing. Lazytime
>> gives you POSIX conformity for a problem that the other options only
>> tried to solve.
>
> (Besides lazytime also working on mtime, and, technically, ctime.)
Nope, it by definition can't work on ctime because a ctime update means 
something else changed in the inode, which in turn will cause it to be 
flushed to disk normally (lazytime only defers the flush as long as 
nothing else in the inode is different, so it won't help much on stuff 
like traditional log files because their size is changing regularly 
(which updates the inode, which then causes it to get flushed)).
>
> First: atime, in any form, murders snapshots.  On any filesystem that has
> them, not just btrfs -- I've tested zfs and LVM snapshots, there's also
> qcow2/vdi and so on.  On all of them, every single read-everything operation
> costs you 5% disk space.  For a _read_ operation!
>
> I've tested /usr-y mix of files, for consistency with the guy who mentioned
> this problem first.  Your mileage will vary depending on whether you store
> 100GB disk images or a news spool.
>
> Read-everything is quite rare, but most systems have at least one
> stat-everything cronjob.  That touches only diratime, but that's still
> 1-in-11 inodes (remarkably consistent: I've checked a few machines with
> drastically different purposes, and somehow the min was 10, max 12).
>
> And no, marking snapshots as ro doesn't help: reading the live version still
> breaks CoW.
>
>
> Second: atime murders media with limited write endurance.  Modern SSD can
> cope well, but I for one work a lot with SD and eMMC.  Every single SoC
> image I've seen uses noatime for this reason.
Even on SSD's it's still an issue, especially if it's something like 
ext4 which uses inode tables (updating one inode will usually require a 
RMW of an erase block regardless, but using inode tables means that this 
happens _all the time_).
>
>
> Third: relatime/lazytime don't eliminate the performance cost.  They fix
> only frequently read files -- if you have a big filesystem where you read a
> lot but individual files tend to be read rarely, relatime is as bad as
> strictatime, and lazytime actually worse.  Both will do an unnecessary write
> of all inodes.
>
>
> Four: why?  Beside being POSIXLY_CORRECT, what do you actually gain from
> atime?  I can think only of:
> * new mail notification with mbox.  Just patch the mail reader to manually
>   futimens(..., {UTIME_NOW,UTIME_OMIT}), it has no extra cost on !noatime
>   mounts.  I've personally did so for mutt, the updated version will ship
>   in Debian stretch; you can patch other mail readers although they tend
>   to be rarely used in conjunction with shell access (and thus they have
>   no need for atime at all).
> * Debian's popcon's "vote" field.  Use "inst", and there's no gain from
>   popcon for you personally.
> * some intrusion detection forensics (broken by open(..., O_NOATIME))
On top of all that:
Five:
Handling of atime slows down stat and a handful of other things.  If you 
take a source tree the size of the Linux kernel, write a patch that 
changes every file (even just one character), and then go to commit it 
in Git (or SVN, or Bazaar, or Mercurial), you'll see a pretty serious 
difference in the time it takes to commit because almost all VCS 
software calls stat() on the entire tree.  relatime won't help much here 
because the check to determine whether or not to update the atime still 
has to happen (in fact, it will hurt slightly, strictatime eliminates 
that check).

Six:
It doesn't behave how most users would inherently expect, partly because 
there are ways to bypass it even if the FS is mounted with strictatime.
>
>
> Conclusion: death to atime!
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2017-04-11 11:16 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-06 21:28 btrfs filesystem keeps allocating new chunks for no apparent reason Hans van Kranenburg
2016-05-30 11:07 ` Hans van Kranenburg
2016-05-30 19:55   ` Duncan
2016-05-30 21:18     ` Hans van Kranenburg
2016-05-30 21:55       ` Duncan
2016-05-31  1:36 ` Qu Wenruo
2016-06-08 23:10   ` Hans van Kranenburg
2016-06-09  8:52     ` Marc Haber
2016-06-09 10:37       ` Hans van Kranenburg
2016-06-09 15:41     ` Duncan
2016-06-10 17:07       ` Henk Slager
2016-06-11 15:23         ` Hans van Kranenburg
2016-06-09 18:07     ` Chris Murphy
2017-04-07 21:25   ` Hans van Kranenburg
2017-04-07 23:56     ` Peter Grandi
2017-04-08  7:09     ` Duncan
2017-04-08 11:16     ` Hans van Kranenburg
2017-04-08 11:35       ` Hans van Kranenburg
2017-04-09 23:23       ` Hans van Kranenburg
2017-04-10 12:39         ` Austin S. Hemmelgarn
2017-04-10 12:45           ` Kai Krakow
2017-04-10 12:51             ` Austin S. Hemmelgarn
2017-04-10 16:53               ` Kai Krakow
     [not found]               ` <20170410184444.08ced097@jupiter.sol.local>
2017-04-10 16:54                 ` Kai Krakow
2017-04-10 17:13                   ` Austin S. Hemmelgarn
2017-04-10 18:18                     ` Kai Krakow
2017-04-10 19:43                       ` Austin S. Hemmelgarn
2017-04-10 22:21                         ` Adam Borowski
2017-04-11  4:01                         ` Kai Krakow
2017-04-11  9:55                           ` Adam Borowski
2017-04-11 11:16                             ` Austin S. Hemmelgarn
2017-04-10 23:45                       ` Janos Toth F.
2017-04-11  3:56                         ` Kai Krakow

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.