All of lore.kernel.org
 help / color / mirror / Atom feed
* R: Re: Slow startup of systemd-journal on BTRFS
@ 2014-06-12 11:13 Goffredo Baroncelli <kreijack@libero.it>
  2014-06-12 12:37 ` Duncan
  0 siblings, 1 reply; 45+ messages in thread
From: Goffredo Baroncelli <kreijack@libero.it> @ 2014-06-12 11:13 UTC (permalink / raw)
  To: david; +Cc: systemd Mailing List, linux-btrfs

>Da: david@fromorbit.com
>Data: 12/06/2014 3.21
>A: <kreijack@inwind.it>
>Cc: "systemd Mailing List"<systemd-devel@lists.freedesktop.org>, "linux-btrfs"
<linux-btrfs@vger.kernel.org>
>Ogg: Re: Slow startup of systemd-journal on BTRFS
>
>On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
>> Hi all,
>> 
>> I would like to share a my experience about a slowness of systemd when used 
on BTRFS.
>> 
>> My boot time was very high (about ~50 seconds); most of time it was due to 
NetworkManager which took about 30-40 seconds to start (this data came from 
"systemd-analyze plot").
>> 
>> I make several attempts to address this issue. Also I noticed that sometime 
this problem disappeared; but I was never able to understand why.
>> 
>> However this link
>> 
>> 	https://bugzilla.redhat.com/show_bug.cgi?id=1006386
>> 
>> suggested me that the problem could be due to a bad interaction between 
systemd and btrfs. NetworkManager was innocent. 
>
>systemd has a very stupid journal write pattern. It checks if there
>is space in the file for the write, and if not it fallocates the
>small amount of space it needs (it does *4 byte* fallocate calls!)
>and then does the write to it.  All this does is fragment the crap
>out of the log files because the filesystems cannot optimise the
>allocation patterns.

I checked the code, and to me it seems that the fallocate() are
done in FILE_SIZE_INCREASE unit (actually 8MB). 

>
>Yup, it fragments journal files on XFS, too.
>
>http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
>
>IIRC, the systemd developers consider this a filesystem problem and
>so refused to change the systemd code to be nice to the filesystem
>allocators, even though they don't actually need to use fallocate...

If I am able to start a decent setup I would like to play to change some
parameters like:
- remove fallocate at all (at the beginning only ?)
- increase the fallocate allocation unit
- change the file log size and rotation time
- periodically defragment
[...[
>
>Cheers,
>
>Dave.
>
>-- 
>Dave Chinner
>david@fromorbit.com
>--
>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: R: Re: Slow startup of systemd-journal on BTRFS
  2014-06-12 11:13 R: Re: Slow startup of systemd-journal on BTRFS Goffredo Baroncelli <kreijack@libero.it>
@ 2014-06-12 12:37 ` Duncan
  2014-06-12 23:24   ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Duncan @ 2014-06-12 12:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Goffredo Baroncelli <kreijack@libero.it> posted on Thu, 12 Jun 2014
13:13:26 +0200 as excerpted:

>>systemd has a very stupid journal write pattern. It checks if there is
>>space in the file for the write, and if not it fallocates the small
>>amount of space it needs (it does *4 byte* fallocate calls!) and then
>>does the write to it.  All this does is fragment the crap out of the log
>>files because the filesystems cannot optimise the allocation patterns.
> 
> I checked the code, and to me it seems that the fallocate() are done in
> FILE_SIZE_INCREASE unit (actually 8MB).

FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think 
actually pretty much equally bad without NOCOW set on the file.

Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for 
either 4 byte or 8 MiB file allocations is going to end up being the 
same, forcing (repeated until full) rewrite of each 4 KiB block into its 
own extent.

Turning off the fallocate should allow btrfs to at least consolidate a 
bit, tho to the extent that multiple 4 KiB blocks cannot be written, 
repeated fsync will still cause issues.

80-100 MiB logs (size mentioned in another reply) should be reasonably 
well handled by btrfs autodefrag, however, if it's turned on.  I'd be 
worried if sizes were > 256 MiB and certainly as sizes approached a GiB, 
but it should handle 80-100 MiB just fine.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: R: Re: Slow startup of systemd-journal on BTRFS
  2014-06-12 12:37 ` Duncan
@ 2014-06-12 23:24   ` Dave Chinner
  2014-06-13 22:19     ` Goffredo Baroncelli
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2014-06-12 23:24 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs, systemd-devel

On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote:
> Goffredo Baroncelli <kreijack@libero.it> posted on Thu, 12 Jun 2014
> 13:13:26 +0200 as excerpted:
> 
> >>systemd has a very stupid journal write pattern. It checks if there is
> >>space in the file for the write, and if not it fallocates the small
> >>amount of space it needs (it does *4 byte* fallocate calls!) and then
> >>does the write to it.  All this does is fragment the crap out of the log
> >>files because the filesystems cannot optimise the allocation patterns.
> > 
> > I checked the code, and to me it seems that the fallocate() are done in
> > FILE_SIZE_INCREASE unit (actually 8MB).
> 
> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think 
> actually pretty much equally bad without NOCOW set on the file.

So maybe it's been fixed in systemd since the last time I looked.
Yup:

http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58

The reason it was changed? To "save a syscall per append", not to
prevent fragmentation of the file, which was the problem everyone
was complaining about...

> Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for 
> either 4 byte or 8 MiB file allocations is going to end up being the 
> same, forcing (repeated until full) rewrite of each 4 KiB block into its 
> own extent.

And that's now a btrfs problem.... :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: R: Re: Slow startup of systemd-journal on BTRFS
  2014-06-12 23:24   ` Dave Chinner
@ 2014-06-13 22:19     ` Goffredo Baroncelli
  2014-06-14  2:53       ` Duncan
  0 siblings, 1 reply; 45+ messages in thread
From: Goffredo Baroncelli @ 2014-06-13 22:19 UTC (permalink / raw)
  To: Dave Chinner, Duncan; +Cc: linux-btrfs, systemd-devel

Hi Dave

On 06/13/2014 01:24 AM, Dave Chinner wrote:
> On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote:
>> Goffredo Baroncelli <kreijack@libero.it> posted on Thu, 12 Jun 2014
>> 13:13:26 +0200 as excerpted:
>>
>>>> systemd has a very stupid journal write pattern. It checks if there is
>>>> space in the file for the write, and if not it fallocates the small
>>>> amount of space it needs (it does *4 byte* fallocate calls!) and then
>>>> does the write to it.  All this does is fragment the crap out of the log
>>>> files because the filesystems cannot optimise the allocation patterns.
>>>
>>> I checked the code, and to me it seems that the fallocate() are done in
>>> FILE_SIZE_INCREASE unit (actually 8MB).
>>
>> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think 
>> actually pretty much equally bad without NOCOW set on the file.
> 
> So maybe it's been fixed in systemd since the last time I looked.
> Yup:
> 
> http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
> 
> The reason it was changed? To "save a syscall per append", not to
> prevent fragmentation of the file, which was the problem everyone
> was complaining about...

thanks for pointing that. However I am performing my tests on a fedora 20 with systemd-208, which seems have this change
> 
>> Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for 
>> either 4 byte or 8 MiB file allocations is going to end up being the 
>> same, forcing (repeated until full) rewrite of each 4 KiB block into its 
>> own extent.


I am reaching the conclusion that fallocate is not the problem. The fallocate increase the filesize of about 8MB, which is enough for some logging. So it is not called very often.

I have to investigate more what happens when the log are copied from /run to /var/log/journal: this is when journald seems to slow all.

I am prepared a PC which reboot continuously; I am collecting the time required to finish the boot vs the fragmentation of the system.journal file vs the number of boot. The results are dramatic: after 20 reboot, the boot time increase of 20-30 seconds. Doing a defrag of system.journal reduces the boot time to the original one, but after another 20 reboot, the boot time still requires 20-30 seconds more....

It is a slow PC, but I saw the same behavior also on a more modern pc (i5 with 8GB).

For both PC the HD is a mechanical one...

> 
> And that's now a btrfs problem.... :/

Are you sure ?

ghigo@venice:/var/log$ sudo filefrag messages
messages: 29 extents found

ghigo@venice:/var/log$ sudo filefrag journal/*/system.journal
journal/41d686199835445395ac629d576dfcb9/system.journal: 1378 extents found

So the old rsyslog creates files with fewer fragments. BTRFS (but it seems also xfs) for sure highlights more this problem than other filesystem. But also systemd seems to create a lot of extens.

BR
G.Baroncelli



> 
> Cheers,
> 
> Dave.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-13 22:19     ` Goffredo Baroncelli
@ 2014-06-14  2:53       ` Duncan
  2014-06-14  7:52         ` Goffredo Baroncelli
                           ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Duncan @ 2014-06-14  2:53 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
excerpted:

> On 06/13/2014 01:24 AM, Dave Chinner wrote:
>> On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote:
>>>
>>> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
>>> actually pretty much equally bad without NOCOW set on the file.
>> 
>> So maybe it's been fixed in systemd since the last time I looked.
>> Yup:
>> 
>> http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-
file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
>> 
>> The reason it was changed? To "save a syscall per append", not to
>> prevent fragmentation of the file, which was the problem everyone was
>> complaining about...
> 
> thanks for pointing that. However I am performing my tests on a fedora
> 20 with systemd-208, which seems have this change
>> 
>>> Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for
>>> either 4 byte or 8 MiB file allocations is going to end up being the
>>> same, forcing (repeated until full) rewrite of each 4 KiB block into
>>> its own extent.
> 
> I am reaching the conclusion that fallocate is not the problem. The
> fallocate increase the filesize of about 8MB, which is enough for some
> logging. So it is not called very often.

But... 

If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with 
nodatacow), then an fallocate of 8 MiB will increase the file size by 8 
MiB and write that out.  So far so good as at that point the 8 MiB should 
be a single extent.  But then, data gets written into 4 KiB blocks of 
that 8 MiB one at a time, and because btrfs is COW, the new data in the 
block must be written to a new location.

Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
block has been rewritten to a new location and is now an extent unto 
itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
single 4 KiB block in size.

=:^(

Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the 
btrfs autodefrag mount option should at least catch that and rewrite 
(again), this time sequentially.

> I have to investigate more what happens when the log are copied from
> /run to /var/log/journal: this is when journald seems to slow all.

That's an interesting point.

At least in theory, during normal operation journald will write to 
/var/log/journal, but there's a point during boot at which it flushes the 
information accumulated during boot from the volatile /run location to 
the non-volatile /var/log location.  /That/ write, at least, should be 
sequential, since there will be > 4 KiB of journal accumulated that needs 
to be transferred at once.  However, if it's being handled by the forced 
pre-write fallocate described above, then that's not going to be the 
case, as it'll then be a rewrite of already fallocated file blocks and 
thus will get COWed exactly as I described above.

=:^(


> I am prepared a PC which reboot continuously; I am collecting the time
> required to finish the boot vs the fragmentation of the system.journal
> file vs the number of boot. The results are dramatic: after 20 reboot,
> the boot time increase of 20-30 seconds. Doing a defrag of
> system.journal reduces the boot time to the original one, but after
> another 20 reboot, the boot time still requires 20-30 seconds more....
> 
> It is a slow PC, but I saw the same behavior also on a more modern pc
> (i5 with 8GB).
> 
> For both PC the HD is a mechanical one...

The problem's duplicable.  That's the first step toward a fix. =:^)

>> And that's now a btrfs problem.... :/
> 
> Are you sure ?

As they say, "Whoosh!"

At least here, I interpreted that remark as primarily sarcastic 
commentary on the systemd devs' apparent attitude, which can be 
(controversially) summarized as: "Systemd doesn't have problems because 
it's perfect.  Therefore, any problems you have with systemd must instead 
be with other components which systemd depends on."

IOW, it's a btrfs problem now in practice, not because it is so in a 
technical sense, but because systemd defines it as such and is unlikely 
to budge, so the only way to achieve progress is for btrfs to deal with 
it. 

An arguably fairer and more impartial assessment of this particular 
situations suggests that neither btrfs, which as a COW-based filesystem, 
like all COW-based filesystems has the existing-file-rewrite as a major 
technical challenge that it must deal with /somehow/, nor systemd, which 
in choosing to use fallocate is specifically putting itself in that 
existing-file-rewrite class, are entirely at fault.

But that doesn't matter if one side refuses to budge, because then the 
other side must do so regardless of where the fault was, if there is to 
be any progress at all.

Meanwhile, I've predicted before and do so here again, that as btrfs 
moves toward mainstream and starts supplanting ext* as the assumed Linux 
default filesystem, some of these problems will simply "go away", because 
at that point, various apps are no longer optimized for the assumed 
default filesystem, and they'll either be patched at some level (distro 
level if not upstream) to work better on the new default filesystem, or 
will be replaced by something that does.  And neither upstream nor distro 
level does that patching, then at some point, people are going to find 
that said distro performs worse than other distros that do that patching.

Another alternative is that distros will start setting /var/log/journal 
NOCOW in their setup scripts by default when it's btrfs, thus avoiding 
the problem.  (Altho if they do automated snapshotting they'll also have 
to set it as its own subvolume, to avoid the first-write-after-snapshot-
is-COW problem.)  Well, that, and/or set autodefrag in the default mount 
options.

Meanwhile, there's some focus on making btrfs behave better with such 
rewrite-pattern files, but while I think the problem can be made /some/ 
better, hopefully enough that the defaults bother far fewer people in far 
fewer cases, I expect it'll always be a bit of a sore spot because that's 
just how the technology works, and as such, setting NOCOW for such files 
and/or using autodefrag will continue to be recommended for an optimized 
setup.

---
[1] "Properly" set NOCOW:  Btrfs doesn't guarantee the effectiveness of 
setting NOCOW (chattr +C) unless the attribute is set while the file is 
still zero size, effectively, at file creation.  The easiest way to do 
that is to set NOCOW on the subdir that will contain the file, such that 
when the file is created it inherits the NOCOW attribute automatically.

[2] File sizes upto 128 MiB ... and possibly upto 1 GiB.  Under 128 MiB 
should be fine, over 1 GiB is known to cause issues, between the two is a 
gray area that depends on the speed of the hardware and the incoming 
write-stream.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-14  2:53       ` Duncan
@ 2014-06-14  7:52         ` Goffredo Baroncelli
  2014-06-15  5:43           ` Duncan
  2014-06-15 22:13           ` Lennart Poettering
  2014-06-14 10:59         ` Kai Krakow
  2014-06-15 21:31         ` Martin Steigerwald
  2 siblings, 2 replies; 45+ messages in thread
From: Goffredo Baroncelli @ 2014-06-14  7:52 UTC (permalink / raw)
  To: Duncan, linux-btrfs; +Cc: systemd-devel

On 06/14/2014 04:53 AM, Duncan wrote:
> Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
> excerpted:
> 
>> On 06/13/2014 01:24 AM, Dave Chinner wrote:
>>> On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote:
>>>>
>>>> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
>>>> actually pretty much equally bad without NOCOW set on the file.
>>>
>>> So maybe it's been fixed in systemd since the last time I looked.
>>> Yup:
>>>
>>> http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-
> file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
>>>
>>> The reason it was changed? To "save a syscall per append", not to
>>> prevent fragmentation of the file, which was the problem everyone was
>>> complaining about...
>>
>> thanks for pointing that. However I am performing my tests on a fedora
>> 20 with systemd-208, which seems have this change
>>>
>>>> Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for
>>>> either 4 byte or 8 MiB file allocations is going to end up being the
>>>> same, forcing (repeated until full) rewrite of each 4 KiB block into
>>>> its own extent.
>>
>> I am reaching the conclusion that fallocate is not the problem. The
>> fallocate increase the filesize of about 8MB, which is enough for some
>> logging. So it is not called very often.
> 
> But... 
> 
> If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with 
> nodatacow), then an fallocate of 8 MiB will increase the file size by 8 
> MiB and write that out.  So far so good as at that point the 8 MiB should 
> be a single extent.  But then, data gets written into 4 KiB blocks of 
> that 8 MiB one at a time, and because btrfs is COW, the new data in the 
> block must be written to a new location.
> 
> Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
> block has been rewritten to a new location and is now an extent unto 
> itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
> single 4 KiB block in size.

Several people pointed fallocate as the problem. But I don't understand the reason.
1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during the boot. Often never because the log are less than 8MB.
2) it is true that btrfs "rewrite" almost 2 times each 4kb page with fallocate. But the first time is a "big" write of 8MB; instead the second write would happen in any case. What I mean is that without the fallocate in any case journald would make small write.

To be honest, I fatigue to see the gain of having a fallocate on a COW filesystem... may be that I don't understand very well the fallocate() call.

> 
> =:^(
> 
> Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the 
> btrfs autodefrag mount option should at least catch that and rewrite 
> (again), this time sequentially.
> 
>> I have to investigate more what happens when the log are copied from
>> /run to /var/log/journal: this is when journald seems to slow all.
> 
> That's an interesting point.
> 
> At least in theory, during normal operation journald will write to 
> /var/log/journal, but there's a point during boot at which it flushes the 
> information accumulated during boot from the volatile /run location to 
> the non-volatile /var/log location.  /That/ write, at least, should be 
> sequential, since there will be > 4 KiB of journal accumulated that needs 
> to be transferred at once.  However, if it's being handled by the forced 
> pre-write fallocate described above, then that's not going to be the 
> case, as it'll then be a rewrite of already fallocated file blocks and 
> thus will get COWed exactly as I described above.
> 
> =:^(
> 
> 
>> I am prepared a PC which reboot continuously; I am collecting the time
>> required to finish the boot vs the fragmentation of the system.journal
>> file vs the number of boot. The results are dramatic: after 20 reboot,
>> the boot time increase of 20-30 seconds. Doing a defrag of
>> system.journal reduces the boot time to the original one, but after
>> another 20 reboot, the boot time still requires 20-30 seconds more....
>>
>> It is a slow PC, but I saw the same behavior also on a more modern pc
>> (i5 with 8GB).
>>
>> For both PC the HD is a mechanical one...
> 
> The problem's duplicable.  That's the first step toward a fix. =:^)
I Hope so
> 
>>> And that's now a btrfs problem.... :/
>>
>> Are you sure ?
> 
> As they say, "Whoosh!"
> 
[...] 
> Another alternative is that distros will start setting /var/log/journal 
> NOCOW in their setup scripts by default when it's btrfs, thus avoiding 
> the problem.  (Altho if they do automated snapshotting they'll also have 
> to set it as its own subvolume, to avoid the first-write-after-snapshot-
> is-COW problem.)  Well, that, and/or set autodefrag in the default mount 
> options.

Pay attention, that this remove also the checksum, which are very useful in a RAID configuration.

> 
> Meanwhile, there's some focus on making btrfs behave better with such 
> rewrite-pattern files, but while I think the problem can be made /some/ 
> better, hopefully enough that the defaults bother far fewer people in far 
> fewer cases, I expect it'll always be a bit of a sore spot because that's 
> just how the technology works, and as such, setting NOCOW for such files 
> and/or using autodefrag will continue to be recommended for an optimized 
> setup.
> 
> ---
> [1] "Properly" set NOCOW:  Btrfs doesn't guarantee the effectiveness of 
> setting NOCOW (chattr +C) unless the attribute is set while the file is 
> still zero size, effectively, at file creation.  The easiest way to do 
> that is to set NOCOW on the subdir that will contain the file, such that 
> when the file is created it inherits the NOCOW attribute automatically.
> 
> [2] File sizes upto 128 MiB ... and possibly upto 1 GiB.  Under 128 MiB 
> should be fine, over 1 GiB is known to cause issues, between the two is a 
> gray area that depends on the speed of the hardware and the incoming 
> write-stream.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-14  2:53       ` Duncan
  2014-06-14  7:52         ` Goffredo Baroncelli
@ 2014-06-14 10:59         ` Kai Krakow
  2014-06-15  5:02           ` Duncan
                             ` (2 more replies)
  2014-06-15 21:31         ` Martin Steigerwald
  2 siblings, 3 replies; 45+ messages in thread
From: Kai Krakow @ 2014-06-14 10:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Duncan <1i5t5.duncan@cox.net> schrieb:

> As they say, "Whoosh!"
> 
> At least here, I interpreted that remark as primarily sarcastic
> commentary on the systemd devs' apparent attitude, which can be
> (controversially) summarized as: "Systemd doesn't have problems because
> it's perfect.  Therefore, any problems you have with systemd must instead
> be with other components which systemd depends on."

Come on, sorry, but this is fud. Really... ;-)

> IOW, it's a btrfs problem now in practice, not because it is so in a
> technical sense, but because systemd defines it as such and is unlikely
> to budge, so the only way to achieve progress is for btrfs to deal with
> it.

I think that systemd is even one of the early supporters of btrfs because it 
will defragment readahead files on boot from btrfs. I'd suggest the problem 
is to be found in the different semantics with COW filesystems. And if 
someone loudly complains to the systemd developers how bad they are at doing 
their stuff - hmm, well, I would be disapointed/offended, too, as a 
programmer because much very well done work has been put into systemd and 
I'd start ignoring such people. In Germany we have a saying for this: "Wie 
man in den Wald hineinruft, so schallt es heraus." [1] They are doing many 
things right that have not been adopted to modern systems in the last twenty 
years (or so) with the legacy init systems.

So let's start with my journals, on btrfs:

$ sudo filefrag *
system@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found
system@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found
system@806cd49faa074a49b6cde5ff6fca8adc-000000000008e4cc-0004f82580cdcb45.journal: 
5 extents found
system@806cd49faa074a49b6cde5ff6fca8adc-0000000000097959-0004f89c2e8aff87.journal: 
5 extents found                                                                                                                  
system@806cd49faa074a49b6cde5ff6fca8adc-00000000000a166d-0004f98d7e04157c.journal: 
5 extents found                                                                                                                  
system@806cd49faa074a49b6cde5ff6fca8adc-00000000000aad59-0004fa379b9a1fdf.journal: 
5 extents found
system@ec16f60db38f43619f8337153a1cc024-0000000000000001-0004fae8e5057259.journal: 
5 extents found
system@ec16f60db38f43619f8337153a1cc024-00000000000092b1-0004fb59b1d034ad.journal: 
5 extents found
system.journal: 9 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-0000000000085b7a-0004f77d25ebba04.journal: 
2 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-000000000008e7fb-0004f83c7bf18294.journal: 
2 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-0000000000097fe4-0004f8ae69c198ca.journal: 
2 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-00000000000a1a7e-0004f9966e9c69d8.journal: 
2 extents found
user-500.journal: 2 extents found

I don't think these are too bad values, eh?

Well, how did I accomblish that?

First, I've set the journal directories nocow. Of course, systemd should do 
this by default. I'm not sure if this is a packaging or systemd code issue, 
tho. But I think the systemd devs are in common that for cow fs, the journal 
directories should be set nocow. After all, the journal is a transactional 
database - it does not need cow protection at all costs. And I think they 
have their own checksumming protection. So, why let systemd bother with 
that? A lot of other software has the same semantic problems with btrfs, too 
(ex. MySQL) where nobody shouts at the "inabilities" of the programmers. So 
why for systemd? Just because it's intrusive by its nature for being a 
radically and newly designed init system and thus requires some learning by 
its users/admins/packagers? Really? Come on... As admin and/or packager you 
have to stay with current technologies and developments anyways. It's only 
important to hide the details from the users.

Back to the extents counts: What I did next was implementing a defrag job 
that regularly defrags the journal (actually, the complete log directory as 
other log files suffer the same problem):

$ cat /usr/local/sbin/defrag-logs.rb 
#!/bin/sh
exec btrfs filesystem defragment -czlib -r /var/log

It can be easily converted into a timer job with systemd. This is left as an 
excercise to the reader.

BTW: Actually, that job isn't currently executed on my system which makes 
the numbers above pretty impressive... However, autodefrag is turned on 
which may play into the mix. I'm not sure. I stopped automatically running 
those defrag jobs a while ago (I have a few more).
 
> An arguably fairer and more impartial assessment of this particular
> situations suggests that neither btrfs, which as a COW-based filesystem,
> like all COW-based filesystems has the existing-file-rewrite as a major
> technical challenge that it must deal with /somehow/, nor systemd, which
> in choosing to use fallocate is specifically putting itself in that
> existing-file-rewrite class, are entirely at fault.

This challenge is not only affecting systemd but also a lot of other 
packages which do not play nice with btrfs semantics. But - as you correctly 
write - you cannot point your finger at just one party. FS and user space 
have to come together to evaluate and fix the problems on both sides. In 
> But that doesn't matter if one side refuses to budge, because then the
> other side must do so regardless of where the fault was, if there is to
> be any progress at all.

> Meanwhile, I've predicted before and do so here again, that as btrfs
> moves toward mainstream and starts supplanting ext* as the assumed Linux
> default filesystem, some of these problems will simply "go away", because
> at that point, various apps are no longer optimized for the assumed
> default filesystem, and they'll either be patched at some level (distro
> level if not upstream) to work better on the new default filesystem, or
> will be replaced by something that does.  And neither upstream nor distro
> level does that patching, then at some point, people are going to find
> that said distro performs worse than other distros that do that patching.

> Another alternative is that distros will start setting /var/log/journal
> NOCOW in their setup scripts by default when it's btrfs, thus avoiding
> the problem.  (Altho if they do automated snapshotting they'll also have
> to set it as its own subvolume, to avoid the first-write-after-snapshot-
> is-COW problem.)  Well, that, and/or set autodefrag in the default mount
> options.

> Meanwhile, there's some focus on making btrfs behave better with such
> rewrite-pattern files, but while I think the problem can be made /some/
> better, hopefully enough that the defaults bother far fewer people in far
> fewer cases, I expect it'll always be a bit of a sore spot because that's
> just how the technology works, and as such, setting NOCOW for such files
> and/or using autodefrag will continue to be recommended for an optimized
> setup.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-14 10:59         ` Kai Krakow
@ 2014-06-15  5:02           ` Duncan
  2014-06-15 11:18             ` Kai Krakow
  2014-06-15 21:45           ` Martin Steigerwald
  2014-06-15 22:43           ` [systemd-devel] " Lennart Poettering
  2 siblings, 1 reply; 45+ messages in thread
From: Duncan @ 2014-06-15  5:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Kai Krakow posted on Sat, 14 Jun 2014 12:59:31 +0200 as excerpted:

> Duncan <1i5t5.duncan@cox.net> schrieb:
> 
>> As they say, "Whoosh!"
>> 
>> At least here, I interpreted that remark as primarily sarcastic
>> commentary on the systemd devs' apparent attitude, which can be
>> (controversially) summarized as: "Systemd doesn't have problems because
>> it's perfect.  Therefore, any problems you have with systemd must
>> instead be with other components which systemd depends on."
> 
> Come on, sorry, but this is fud. Really... ;-)

I should make clear that did recently switch to systemd myself -- by 
choice as I'm on gentoo and it defaults to openrc, so I'm obviously not 
entirely anti-systemd.  And I _did_ say "(controversially)", which means 
I do recognize that there are two sides to the story.

That said, I've certainly seen what happens when non-systemd devs are on 
the receiving end of things -- including kernel devs, see the recent 
hubbub over the debug kernel commandline option and the somewhat longer 
ago firmware loading issue, among others.

But sarcasm implies neither absolute truth, or it'd be speaking truth not 
sarcasm, nor absolute untruth, because without a kernel of truth there 
it'd be simply stupid, not sarcastic.

And certainly that's how I read the comment.

But in any case, I read it as not to be taken literally, which is what it 
appeared to me the person to which I was directly replying was doing.  I 
was simply warning him off of reading it too literally as at least from 
here, it seemed more like sarcasm.

Of course if DaveC wishes to clarify one way or another he certainly 
can... tho I'd not expect it at this point since if it is sarcasm as I 
believe, that's kind of having to explain the joke...

> I think that systemd is even one of the early supporters of btrfs
> because it will defragment readahead files on boot from btrfs. I'd
> suggest the problem is to be found in the different semantics with COW
> filesystems.

Which is actually what I was saying.  In reality it's an interaction 
between the nature of COW filesystems, where fallocate tends to be a 
problem, and an application using fallocate because of its benefits on 
overwrite-in-place filesystems, which happen to be the norm at this 
point.  So neither one is to blame, it's simply a bad interaction that 
ultimately needs to be made better on one side or the other.

But btrfs is still relatively young and COW-based filesystems not that 
widespread yet, so that the problem hasn't been worked out to be handled 
automatically, just yet, isn't a big surprise.  Tho I think it'll come.

Meanwhile, as you point out below and as I've repeatedly said in this 
thread already myself, NOCOW and/or autodefrag are tools available to an 
admin faced with the problem, that together actually solve it reasonably 
well.  All an admin has to do is make use of the tools already available. 
=:^)

> So let's start with my journals, on btrfs:
> 
> $ sudo filefrag *
> system@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found
> system@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found
> 
system@806cd49faa074a49b6cde5ff6fca8adc-000000000008e4cc-0004f82580cdcb45.journal:
> 5 extents found
> 
system@806cd49faa074a49b6cde5ff6fca8adc-0000000000097959-0004f89c2e8aff87.journal:
> 5 extents found
> 
system@806cd49faa074a49b6cde5ff6fca8adc-00000000000a166d-0004f98d7e04157c.journal:
> 5 extents found
> 
system@806cd49faa074a49b6cde5ff6fca8adc-00000000000aad59-0004fa379b9a1fdf.journal:
> 5 extents found
> 
system@ec16f60db38f43619f8337153a1cc024-0000000000000001-0004fae8e5057259.journal:
> 5 extents found
> 
system@ec16f60db38f43619f8337153a1cc024-00000000000092b1-0004fb59b1d034ad.journal:
> 5 extents found system.journal: 9 extents found
> 
user-500@e4209c6628ed4a65954678b8011ad73f-0000000000085b7a-0004f77d25ebba04.journal:
> 2 extents found
> 
user-500@e4209c6628ed4a65954678b8011ad73f-000000000008e7fb-0004f83c7bf18294.journal:
> 2 extents found
> 
user-500@e4209c6628ed4a65954678b8011ad73f-0000000000097fe4-0004f8ae69c198ca.journal:
> 2 extents found
> 
user-500@e4209c6628ed4a65954678b8011ad73f-00000000000a1a7e-0004f9966e9c69d8.journal:
> 2 extents found user-500.journal: 2 extents found
> 
> I don't think these are too bad values, eh?
> 
> Well, how did I accomblish that?
> 
> First, I've set the journal directories nocow. Of course, systemd should
> do this by default. I'm not sure if this is a packaging or systemd code
> issue,
> tho. But I think the systemd devs are in common that for cow fs, the
> journal directories should be set nocow. After all, the journal is a
> transactional database - it does not need cow protection at all costs.
> And I think they have their own checksumming protection. So, why let
> systemd bother with that? A lot of other software has the same semantic
> problems with btrfs, too (ex. MySQL) where nobody shouts at the
> "inabilities" of the programmers. So why for systemd? Just because it's
> intrusive by its nature for being a radically and newly designed init
> system and thus requires some learning by its users/admins/packagers?
> Really? Come on... As admin and/or packager you have to stay with
> current technologies and developments anyways. It's only important to
> hide the details from the users.
> 
> Back to the extents counts: What I did next was implementing a defrag
> job that regularly defrags the journal (actually, the complete log
> directory as other log files suffer the same problem):
> 
> $ cat /usr/local/sbin/defrag-logs.rb #!/bin/sh exec btrfs filesystem
> defragment -czlib -r /var/log
> 
> It can be easily converted into a timer job with systemd. This is left
> as an excercise to the reader.
> 
> BTW: Actually, that job isn't currently executed on my system which
> makes the numbers above pretty impressive... However, autodefrag is
> turned on which may play into the mix. I'm not sure. I stopped
> automatically running those defrag jobs a while ago (I have a few more).

Thanks for the timer hint, BTW.  I actually already created an hourly and 
a daily timer job here (turns out that's all I needed, no weekly/monthly/
whatever needed so I didn't create those) as I switched over to systemd 
and got rid of crond, and I'll definitely keep the defrag-journals timer 
idea up my sleeve in case I decide to set journald back to keeping non-
volatile journals as well, plus as a helpful hint for others.  Tho I 
won't be using it myself currently as the volatile journals only while 
handing off to syslog-ng for the longer term logs is working very well 
for me ATM, a good sysadmin likes to have a set of tricks such as that 
ready in case they're needed, and I'm certainly no exception! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-14  7:52         ` Goffredo Baroncelli
@ 2014-06-15  5:43           ` Duncan
  2014-06-15 22:39             ` [systemd-devel] " Lennart Poettering
  2014-06-15 22:13           ` Lennart Poettering
  1 sibling, 1 reply; 45+ messages in thread
From: Duncan @ 2014-06-15  5:43 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Goffredo Baroncelli posted on Sat, 14 Jun 2014 09:52:39 +0200 as
excerpted:

> On 06/14/2014 04:53 AM, Duncan wrote:
>> Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
>> excerpted:
>> 
>>> thanks for pointing that. However I am performing my tests on a fedora
>>> 20 with systemd-208, which seems have this change
>>>
>>> I am reaching the conclusion that fallocate is not the problem. The
>>> fallocate increase the filesize of about 8MB, which is enough for some
>>> logging. So it is not called very often.

Right.

>> But...

Exactly, _but_...

>> [A]n fallocate of 8 MiB will increase the file size
>> by 8 MiB and write that out.  So far so good as at that point the 8 MiB
>> should be a single extent.  But then, data gets written into 4 KiB
>> blocks of that 8 MiB one at a time, and because btrfs is COW, the new
>> data in the block must be written to a new location.
>> 
>> Which effectively means that by the time the 8 MiB is filled, each 4
>> KiB block has been rewritten to a new location and is now an extent
>> unto itself.  So now that 8 MiB is composed of 2048 new extents, each
>> one a single 4 KiB block in size.
> 
> Several people pointed fallocate as the problem. But I don't understand
> the reason.
> 1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time
> during the boot. Often never because the log are less than 8MB.
> 2) it is true that btrfs "rewrite" almost 2 times each 4kb page with
> fallocate. But the first time is a "big" write of 8MB; instead the
> second write would happen in any case. What I mean is that without the
> fallocate in any case journald would make small write.
> 
> To be honest, I fatigue to see the gain of having a fallocate on a COW
> filesystem... may be that I don't understand very well the fallocate()
> call.

The base problem isn't fallocate per se, rather, tho it's the trigger in 
this case.  The base problem is that for COW-based filesystems, *ANY* 
rewriting of existing file content results in fragmentation.

It just so happens that the only reason there's existing file content to 
be rewritten (as opposed to simply appending) in this case, is because of 
the fallocate.  The rewrite of existing file content is the problem, but 
the existing file content is only there in this case because of the 
fallocate.

Taking a step back...

On a non-COW filesystem, allocating 8 MiB ahead and writing into it 
rewrites into the already allocated location, thus guaranteeing extents 
of 8 MiB each, since once the space is allocated it's simply rewritten in-
place.  Thus, on a non-COW filesystem, pre-allocating in something larger 
than single filesystem blocks when an app knows the data is eventually 
going to be written in to fill that space anyway is a GOOD thing, which 
is why systemd is doing it.

But on a COW-based filesystem fallocate is the exact opposite, a BAD 
thing, because an fallocate forces the file to be written out at that 
size, effectively filled with nulls/blanks.  Then the actual logging 
comes along and rewrites those nulls/blanks with actual data, but it's 
now a rewrite, which on a COW, copy-on-write, based filesystem, the 
rewritten block is copied elsewhere, it does NOT overwrite the existing 
null/blank block, and "elsewhere" by definition means detached from the 
previous blocks, thus in an extent all by itself.

Once the full 2048 original blocks composing that 8 MiB are filled in 
with actual data, because they were rewrites from null/blank blocks that 
fallocate had already forced to be allocated, that's now 2048 separate 
extents, 2048 separate file fragments, where without the forced fallocate, 
the writes would have all been appends, and there would have been at 
least /some/ chance of some of those 2048 separate blocks being written 
at close enough to the same time that they would have been written 
together as a single extent.  So while the 8 MiB might not have been a 
single extent as opposed to 2048 separate extents, it might have been 
perhaps 512 or 1024 extents, instead of the 2048 that it ended up being 
because fallocate meant that each block was a rewrite into an existing 
file, not a new append-write at the end of an existing file.

> [...]
>> Another alternative is that distros will start setting /var/log/journal
>> NOCOW in their setup scripts by default when it's btrfs, thus avoiding
>> the problem.  (Altho if they do automated snapshotting they'll also
>> have to set it as its own subvolume, to avoid the
>> first-write-after-snapshot-
>> is-COW problem.)  Well, that, and/or set autodefrag in the default
>> mount options.
> 
> Pay attention, that this remove also the checksum, which are very useful
> in a RAID configuration.

Well, it can be.  But this is only log data, not executable or the like 
data, and (as Kai K points out) journald has its own checksumming method 
in any case.

Besides which, you still haven't explained why you can't either set the 
autodefrag mount option and be done with it, or run a systemd-timer-
triggered or cron-triggered defrag script to defrag them automatically at 
hourly or daily or whatever intervals.  Those don't disable btrfs 
checksumming, but /should/ solve the problem.

(Tho if you're btrfs snapshotting the journals defrag has its own 
implications, but they should be relatively limited in scope compared to 
the fragmentation issues we're dealing with here.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-15  5:02           ` Duncan
@ 2014-06-15 11:18             ` Kai Krakow
  0 siblings, 0 replies; 45+ messages in thread
From: Kai Krakow @ 2014-06-15 11:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Duncan <1i5t5.duncan@cox.net> schrieb:

>> Back to the extents counts: What I did next was implementing a defrag
>> job that regularly defrags the journal (actually, the complete log
>> directory as other log files suffer the same problem):
>> 
>> $ cat /usr/local/sbin/defrag-logs.rb #!/bin/sh exec btrfs filesystem
>> defragment -czlib -r /var/log
>> 
>> It can be easily converted into a timer job with systemd. This is left
>> as an excercise to the reader.
>> 
>> BTW: Actually, that job isn't currently executed on my system which
>> makes the numbers above pretty impressive... However, autodefrag is
>> turned on which may play into the mix. I'm not sure. I stopped
>> automatically running those defrag jobs a while ago (I have a few more).
> 
> Thanks for the timer hint, BTW.  I actually already created an hourly and
> a daily timer job here (turns out that's all I needed, no weekly/monthly/
> whatever needed so I didn't create those) as I switched over to systemd
> and got rid of crond, and I'll definitely keep the defrag-journals timer
> idea up my sleeve in case I decide to set journald back to keeping non-
> volatile journals as well, plus as a helpful hint for others.  Tho I
> won't be using it myself currently as the volatile journals only while
> handing off to syslog-ng for the longer term logs is working very well
> for me ATM, a good sysadmin likes to have a set of tricks such as that
> ready in case they're needed, and I'm certainly no exception! =:^)

I did not yet get rid of cron. The systemd devs once noted that timers are 
not a cron replacement - tho I'm sure this was meant for running user jobs 
not system jobs. The idea back then was to use systemd user session spawning 
with timers and the devs stated that such a usage is different from how cron 
spawns user jobs, and one should just stick to cron for that because the 
purpose of systemd user sessions is different.

I already created timer targets for daily, hourly, monthly so I could just 
symlink service units there. What's needed is some sort of systemd generator 
to auto-create services from /etc/cron.{hourly,daily,monthly,weekly} and 
auto-install them in the matching targets which are:

$ for i in $(find -type f -name "timer-*");do echo "# $i:";cat $i;echo;done
# ./timer-weekly.target:
[Unit]
Description=Weekly Timer Target
StopWhenUnneeded=yes

# ./timer-daily.target:
[Unit]
Description=Daily Timer Target
StopWhenUnneeded=yes

# ./timer-hourly.target:
[Unit]
Description=Hourly Timer Target
StopWhenUnneeded=yes

# ./timer-daily.timer:
[Unit]
Description=Daily Timer

[Timer]
OnBootSec=10min
OnUnitActiveSec=1d
Unit=timer-daily.target
AccuracySec=12h

[Install]
WantedBy=basic.target

# ./timer-hourly.timer:
[Unit]
Description=Hourly Timer

[Timer]
OnBootSec=5min
OnUnitActiveSec=1h
Unit=timer-hourly.target
AccuracySec=30min

[Install]
WantedBy=basic.target

# ./timer-weekly.timer:
[Unit]
Description=Weekly Timer

[Timer]
OnBootSec=15min
OnUnitActiveSec=1w
Unit=timer-weekly.target
AccuracySec=12h

[Install]
WantedBy=basic.target


Then it's a matter of creating services which are "WantedBy timer-
weekly.target" or whatever is appropriate and they should execute after 
being installed. Maybe I'm trying to copy what Gentoo is doing for 
/etc/local.d with systemd (there's a generator in the gentoo-systemd-
integration ebuild [1] for files in that directoy). However, when going that 
route, cron should no longer be installed or at least be configured to not 
run these jobs on its own. But then, the above mentioned method of running 
user jobs would be needed - and that's not recommended (and I didn't get it 
to work either).


[1]: /usr/lib/systemd/system-generators/gentoo-local-generator

-- 
Replies to list only preferred.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-14  2:53       ` Duncan
  2014-06-14  7:52         ` Goffredo Baroncelli
  2014-06-14 10:59         ` Kai Krakow
@ 2014-06-15 21:31         ` Martin Steigerwald
  2014-06-15 21:37           ` Hugo Mills
  2014-06-17  8:22           ` Duncan
  2 siblings, 2 replies; 45+ messages in thread
From: Martin Steigerwald @ 2014-06-15 21:31 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs, systemd-devel

Am Samstag, 14. Juni 2014, 02:53:20 schrieb Duncan:
> > I am reaching the conclusion that fallocate is not the problem. The
> > fallocate increase the filesize of about 8MB, which is enough for some
> > logging. So it is not called very often.
> 
> But... 
> 
> If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with 
> nodatacow), then an fallocate of 8 MiB will increase the file size by 8 
> MiB and write that out.  So far so good as at that point the 8 MiB should 
> be a single extent.  But then, data gets written into 4 KiB blocks of 
> that 8 MiB one at a time, and because btrfs is COW, the new data in the 
> block must be written to a new location.
> 
> Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
> block has been rewritten to a new location and is now an extent unto 
> itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
> single 4 KiB block in size.

I always thought that the whole point of fallocate is that it *doesn´t* write 
out anything, but just reserves the space. Thus I don´t see how COW can have 
any adverse effect here.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-15 21:31         ` Martin Steigerwald
@ 2014-06-15 21:37           ` Hugo Mills
  2014-06-17  8:22           ` Duncan
  1 sibling, 0 replies; 45+ messages in thread
From: Hugo Mills @ 2014-06-15 21:37 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Duncan, linux-btrfs, systemd-devel

[-- Attachment #1: Type: text/plain, Size: 1898 bytes --]

On Sun, Jun 15, 2014 at 11:31:07PM +0200, Martin Steigerwald wrote:
> Am Samstag, 14. Juni 2014, 02:53:20 schrieb Duncan:
> > > I am reaching the conclusion that fallocate is not the problem. The
> > > fallocate increase the filesize of about 8MB, which is enough for some
> > > logging. So it is not called very often.
> > 
> > But... 
> > 
> > If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with 
> > nodatacow), then an fallocate of 8 MiB will increase the file size by 8 
> > MiB and write that out.  So far so good as at that point the 8 MiB should 
> > be a single extent.  But then, data gets written into 4 KiB blocks of 
> > that 8 MiB one at a time, and because btrfs is COW, the new data in the 
> > block must be written to a new location.
> > 
> > Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
> > block has been rewritten to a new location and is now an extent unto 
> > itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
> > single 4 KiB block in size.
> 
> I always thought that the whole point of fallocate is that it *doesn´t* write 
> out anything, but just reserves the space. Thus I don´t see how COW can have 
> any adverse effect here.

   Exactly. fallocate, as I understand it, says, "I'm going to write
[this much] data at some point soon; you may want to allocate that
space in a contiguous manner right now to make the process more
effcient". The space is not a formal part of the file data and so
doesn't need a CoW operation when it's first written to.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
     --- If the first-ever performance is the première,  is the ---     
                  last-ever performance the derrière?                   

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-14 10:59         ` Kai Krakow
  2014-06-15  5:02           ` Duncan
@ 2014-06-15 21:45           ` Martin Steigerwald
  2014-06-15 21:51             ` Hugo Mills
  2014-06-15 22:43           ` [systemd-devel] " Lennart Poettering
  2 siblings, 1 reply; 45+ messages in thread
From: Martin Steigerwald @ 2014-06-15 21:45 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs, systemd-devel

Am Samstag, 14. Juni 2014, 12:59:31 schrieb Kai Krakow:
> Well, how did I accomblish that?

Setting no cow and defragmenting regularily?

Quite a complex setup for a casual Linux user.

Any solution should be automatic. I´d suggest by a combination of sane
application behaviour and measures within the filesystem.

> First, I've set the journal directories nocow. Of course, systemd should do 
> this by default. I'm not sure if this is a packaging or systemd code issue,
> tho. But I think the systemd devs are in common that for cow fs, the
> journal directories should be set nocow. After all, the journal is a
> transactional database - it does not need cow protection at all costs. And
> I think they have their own checksumming protection. So, why let systemd
> bother with that? A lot of other software has the same semantic problems
> with btrfs, too (ex. MySQL) where nobody shouts at the "inabilities" of the
> programmers. So why for systemd? Just because it's intrusive by its nature
> for being a radically and newly designed init system and thus requires some
> learning by its users/admins/packagers? Really? Come on... As admin and/or
> packager you have to stay with current technologies and developments
> anyways. It's only important to hide the details from the users.

But nocow also disables the possibilty to snapshot these files, AFAIK. Akonadi
does this for its database directory as well for new install. But I am not
completely happy with that approach.

And well I would say that the following differing results are caused by
specific application behavior – my bet would be rsyslog sequentially
appending to the file while systemd used the journal files more than a
database, databases fragment heavily on BTRFS due to COW:


merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> filefrag *
system@0004e2025580f3c5-c625739d3033b738.journal~: 1 extent found
system@0004f94fbee5f4d1-1cfb9bed12a79bde.journal~: 2771 extents found
system@0004f98240efba02-b465b39a7ed0bdbe.journal~: 2534 extents found
system@0004f9ff739ad927-4650a2ca62bf9378.journal~: 4951 extents found
system@0004fa17feb65603-a7597828f9823e38.journal~: 1244 extents found
system@0004fa3f0e96b653-d6f9d5795c9ef869.journal~: 1419 extents found
system@0004fa4c448b8a95-c95a3b7950fd704d.journal~: 1511 extents found
system@0004fa5dddb554e0-33c319ebb5f8100f.journal~: 1729 extents found
system@0004fad81852c750-20a36082c6006c8a.journal~: 10257 extents found
system@0004fad970b56567-44bf4a94314792fc.journal~: 932 extents found
system@0004fb128307b981-7f2104da8b2c9fb2.journal~: 6757 extents found
system@0004fb1c3eef86c3-8adbea51a1c98729.journal~: 1498 extents found
system@0004fb1c419fb301-7303f7fd9165ed26.journal~: 19 extents found
system@0004fb44feafbafd-b7433e90b1d3d718.journal~: 2265 extents found
system@0004fb6ddf63e4d3-b40e8f4701670bff.journal~: 1894 extents found
system@0004fb6e412c3f7e-3890be9c2119a7bb.journal~: 1038 extents found
system@54803afb1b1d42b387822c56e61bc168-000000000002b364-0004e1c4aabbeefa.journal: 2 extents found
system@54803afb1b1d42b387822c56e61bc168-000000000002b365-0004e1c4ab5b0246.journal: 1 extent found
system@bd18e6867b824ba7b572de31531218a9-0000000000000001-0004e202555a0493.journal: 3232 extents found
system.journal: 3855 extents found
user-1000@0004e202588086cf-3d4130a580b63101.journal~: 1 extent found
user-1000@0004f9ff74415375-6b5dd1c3d76b09ce.journal~: 1046 extents found
user-1000@0004fa17ff12ef0c-34297fcc8c06dd4b.journal~: 96 extents found
user-1000@0004fa3f0eee8e41-dfb1f54bd31e4967.journal~: 84 extents found
user-1000@0004fa4c475a0a63-d8badb620094bce8.journal~: 173 extents found
user-1000@0004fa5de0c650e0-522069bac82c754e.journal~: 319 extents found
user-1000@0004fad818a7a980-593b8f3971b2e697.journal~: 2465 extents found
user-1000@0004fad97160c3ad-552b27f891e7a24e.journal~: 106 extents found
user-1000@0004fb1283616e7e-1fbca0bef31bd92b.journal~: 283 extents found
user-1000@0004fb6de033b269-018b4cbc2b1f319b.journal~: 874 extents found
user-1000@bd18e6867b824ba7b572de31531218a9-00000000000007c7-0004e2025881a1ee.journal: 293 extents found
user-1000.journal: 4663 extents found
user-120.journal: 5 extents found
user-2012@0004fa4c02142cad-f97563ed0105bfb3.journal~: 749 extents found
user-2012@0004fa4d8255b2df-43248028d422ca78.journal~: 29 extents found
user-2012@0004fa5de40372db-d1f3c6428ddeec22.journal~: 122 extents found
user-2012@0004fad81b8cd9d8-ed2861a9fa1b163c.journal~: 575 extents found
user-2012@0004fad980139d4e-94ad07f4a8fae3cc.journal~: 25 extents found
user-2012@0004fb160f2d5334-99462eb429f4cb7b.journal~: 416 extents found
user-2012@54803afb1b1d42b387822c56e61bc168-0000000000011c75-0004ddb2be06d876.journal: 2 extents found
user-2012.journal: 453 extents found
user-65534@0004fa4c62bf4a71-6b4c53dfc06dd588.journal~: 46 extents found
user-65534.journal: 91 extents found
merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> cd ..
merkaba:/var/log/journal> cd ..




merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> ls -lh
insgesamt 495M
-rw-r-----+ 1 root root            6,8M Jul 21  2013 system@0004e2025580f3c5-c625739d3033b738.journal~
-rw-r-----+ 1 root systemd-journal  14M Mai 14 00:40 system@0004f94fbee5f4d1-1cfb9bed12a79bde.journal~
-rw-r-----+ 1 root systemd-journal  14M Mai 16 12:55 system@0004f98240efba02-b465b39a7ed0bdbe.journal~
-rw-r-----+ 1 root systemd-journal  26M Mai 22 18:17 system@0004f9ff739ad927-4650a2ca62bf9378.journal~
-rw-r-----+ 1 root systemd-journal 6,3M Mai 23 23:34 system@0004fa17feb65603-a7597828f9823e38.journal~
-rw-r-----+ 1 root systemd-journal 6,6M Mai 25 22:10 system@0004fa3f0e96b653-d6f9d5795c9ef869.journal~
-rw-r-----+ 1 root systemd-journal 7,7M Mai 26 13:56 system@0004fa4c448b8a95-c95a3b7950fd704d.journal~
-rw-r-----+ 1 root systemd-journal 9,7M Mai 27 10:56 system@0004fa5dddb554e0-33c319ebb5f8100f.journal~
-rw-r-----+ 1 root systemd-journal  50M Jun  2 12:45 system@0004fad81852c750-20a36082c6006c8a.journal~
-rw-r-----+ 1 root systemd-journal 5,2M Jun  2 14:21 system@0004fad970b56567-44bf4a94314792fc.journal~
-rw-r-----+ 1 root systemd-journal  34M Jun  5 10:27 system@0004fb128307b981-7f2104da8b2c9fb2.journal~
-rw-r-----+ 1 root systemd-journal 8,4M Jun  5 22:03 system@0004fb1c3eef86c3-8adbea51a1c98729.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Jun  5 22:04 system@0004fb1c419fb301-7303f7fd9165ed26.journal~
-rw-r-----+ 1 root systemd-journal  12M Jun  7 22:40 system@0004fb44feafbafd-b7433e90b1d3d718.journal~
-rw-r-----+ 1 root systemd-journal  11M Jun  9 23:27 system@0004fb6ddf63e4d3-b40e8f4701670bff.journal~
-rw-r-----+ 1 root systemd-journal 6,3M Jun  9 23:54 system@0004fb6e412c3f7e-3890be9c2119a7bb.journal~
-rw-r-----+ 1 root adm             128M Jul 18  2013 system@54803afb1b1d42b387822c56e61bc168-000000000002b364-0004e1c4aabbeefa.journal
-rw-r-----+ 1 root root            7,4M Jul 20  2013 system@54803afb1b1d42b387822c56e61bc168-000000000002b365-0004e1c4ab5b0246.journal
-rw-r-----+ 1 root systemd-journal  23M Mai 11 10:21 system@bd18e6867b824ba7b572de31531218a9-0000000000000001-0004e202555a0493.journal
-rw-r-----+ 1 root systemd-journal  19M Jun 15 23:37 system.journal
-rw-r-----+ 1 root root            3,6M Jul 21  2013 user-1000@0004e202588086cf-3d4130a580b63101.journal~
-rw-r-----+ 1 root systemd-journal 4,8M Mai 22 18:17 user-1000@0004f9ff74415375-6b5dd1c3d76b09ce.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Mai 23 23:34 user-1000@0004fa17ff12ef0c-34297fcc8c06dd4b.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Mai 25 22:10 user-1000@0004fa3f0eee8e41-dfb1f54bd31e4967.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Mai 26 13:57 user-1000@0004fa4c475a0a63-d8badb620094bce8.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Mai 27 10:56 user-1000@0004fa5de0c650e0-522069bac82c754e.journal~
-rw-r-----+ 1 root systemd-journal  15M Jun  2 12:45 user-1000@0004fad818a7a980-593b8f3971b2e697.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Jun  2 14:22 user-1000@0004fad97160c3ad-552b27f891e7a24e.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Jun  5 10:27 user-1000@0004fb1283616e7e-1fbca0bef31bd92b.journal~
-rw-r-----+ 1 root systemd-journal 5,2M Jun  9 23:27 user-1000@0004fb6de033b269-018b4cbc2b1f319b.journal~
-rw-r-----+ 1 root systemd-journal 3,8M Mai 11 10:21 user-1000@bd18e6867b824ba7b572de31531218a9-00000000000007c7-0004e2025881a1ee.journal
-rw-r-----+ 1 root systemd-journal  35M Jun 15 23:38 user-1000.journal
-rw-r-----+ 1 root systemd-journal 3,6M Apr 28 09:52 user-120.journal
-rw-r-----+ 1 root systemd-journal 4,0M Mai 26 13:37 user-2012@0004fa4c02142cad-f97563ed0105bfb3.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Mai 26 15:25 user-2012@0004fa4d8255b2df-43248028d422ca78.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Mai 27 10:57 user-2012@0004fa5de40372db-d1f3c6428ddeec22.journal~
-rw-r-----+ 1 root systemd-journal 4,3M Jun  2 12:46 user-2012@0004fad81b8cd9d8-ed2861a9fa1b163c.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Jun  2 14:26 user-2012@0004fad980139d4e-94ad07f4a8fae3cc.journal~
-rw-r-----+ 1 root systemd-journal 3,8M Jun  5 14:41 user-2012@0004fb160f2d5334-99462eb429f4cb7b.journal~
-rw-r-----+ 1 root adm             200K Mai 11 10:21 user-2012@54803afb1b1d42b387822c56e61bc168-0000000000011c75-0004ddb2be06d876.journal
-rw-r-----+ 1 root systemd-journal 3,8M Jun 11 21:14 user-2012.journal
-rw-r-----+ 1 root systemd-journal 3,6M Mai 26 14:04 user-65534@0004fa4c62bf4a71-6b4c53dfc06dd588.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Jun  9 23:26 user-65534.journal






merkaba:/var/log> filefrag syslog*
syslog: 361 extents found
syslog.1: 202 extents found
syslog.2.gz: 1 extent found
[well sure, cause repacked]
syslog.3.gz: 1 extent found
syslog.4.gz: 1 extent found
syslog.5.gz: 1 extent found
syslog.6.gz: 1 extent found

merkaba:/var/log> ls -lh syslog*
-rw-r----- 1 root adm 4,2M Jun 15 23:39 syslog
-rw-r----- 1 root adm 2,1M Jun 11 16:07 syslog.1



So we have ten times the extents on some systemd journal files than on
rsyslog.


With BTRFS RAID 1 on SSD with compress=lzo, so the 361 extents of syslog
may be due to the size limit of extents on compressed BTRFS filesystems.

Anyway, since it is flash, I never bothered about the fragmentation.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-15 21:45           ` Martin Steigerwald
@ 2014-06-15 21:51             ` Hugo Mills
  0 siblings, 0 replies; 45+ messages in thread
From: Hugo Mills @ 2014-06-15 21:51 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Kai Krakow, linux-btrfs, systemd-devel

[-- Attachment #1: Type: text/plain, Size: 12016 bytes --]

On Sun, Jun 15, 2014 at 11:45:37PM +0200, Martin Steigerwald wrote:
> Am Samstag, 14. Juni 2014, 12:59:31 schrieb Kai Krakow:
> > Well, how did I accomblish that?
> 
> Setting no cow and defragmenting regularily?
> 
> Quite a complex setup for a casual Linux user.
> 
> Any solution should be automatic. I´d suggest by a combination of sane
> application behaviour and measures within the filesystem.
> 
> > First, I've set the journal directories nocow. Of course, systemd should do 
> > this by default. I'm not sure if this is a packaging or systemd code issue,
> > tho. But I think the systemd devs are in common that for cow fs, the
> > journal directories should be set nocow. After all, the journal is a
> > transactional database - it does not need cow protection at all costs. And
> > I think they have their own checksumming protection. So, why let systemd
> > bother with that? A lot of other software has the same semantic problems
> > with btrfs, too (ex. MySQL) where nobody shouts at the "inabilities" of the
> > programmers. So why for systemd? Just because it's intrusive by its nature
> > for being a radically and newly designed init system and thus requires some
> > learning by its users/admins/packagers? Really? Come on... As admin and/or
> > packager you have to stay with current technologies and developments
> > anyways. It's only important to hide the details from the users.
> 
> But nocow also disables the possibilty to snapshot these files, AFAIK.

   No, it's the other way around: snapshots break the nodatacow
behaviour. With nodatacow set, taking a snapshot will allow exactly
one CoW operation on the file data (preserving the original extent for
the snapshot to use), and then revert to nodatacow on the
newly-written extent. So fragmentation will still occur, after every
snapshot. Ideally, we'd use proper CoW rather than the RoW (redirect
on write) that we actually use, so that the _original_ is maintained
without fragmentation, and the copy fragments, but the performance for
that sucks (it uses twice the write bandwidth).

   Hugo.

> Akonadi
> does this for its database directory as well for new install. But I am not
> completely happy with that approach.
> 
> And well I would say that the following differing results are caused by
> specific application behavior – my bet would be rsyslog sequentially
> appending to the file while systemd used the journal files more than a
> database, databases fragment heavily on BTRFS due to COW:
> 
> 
> merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> filefrag *
> system@0004e2025580f3c5-c625739d3033b738.journal~: 1 extent found
> system@0004f94fbee5f4d1-1cfb9bed12a79bde.journal~: 2771 extents found
> system@0004f98240efba02-b465b39a7ed0bdbe.journal~: 2534 extents found
> system@0004f9ff739ad927-4650a2ca62bf9378.journal~: 4951 extents found
> system@0004fa17feb65603-a7597828f9823e38.journal~: 1244 extents found
> system@0004fa3f0e96b653-d6f9d5795c9ef869.journal~: 1419 extents found
> system@0004fa4c448b8a95-c95a3b7950fd704d.journal~: 1511 extents found
> system@0004fa5dddb554e0-33c319ebb5f8100f.journal~: 1729 extents found
> system@0004fad81852c750-20a36082c6006c8a.journal~: 10257 extents found
> system@0004fad970b56567-44bf4a94314792fc.journal~: 932 extents found
> system@0004fb128307b981-7f2104da8b2c9fb2.journal~: 6757 extents found
> system@0004fb1c3eef86c3-8adbea51a1c98729.journal~: 1498 extents found
> system@0004fb1c419fb301-7303f7fd9165ed26.journal~: 19 extents found
> system@0004fb44feafbafd-b7433e90b1d3d718.journal~: 2265 extents found
> system@0004fb6ddf63e4d3-b40e8f4701670bff.journal~: 1894 extents found
> system@0004fb6e412c3f7e-3890be9c2119a7bb.journal~: 1038 extents found
> system@54803afb1b1d42b387822c56e61bc168-000000000002b364-0004e1c4aabbeefa.journal: 2 extents found
> system@54803afb1b1d42b387822c56e61bc168-000000000002b365-0004e1c4ab5b0246.journal: 1 extent found
> system@bd18e6867b824ba7b572de31531218a9-0000000000000001-0004e202555a0493.journal: 3232 extents found
> system.journal: 3855 extents found
> user-1000@0004e202588086cf-3d4130a580b63101.journal~: 1 extent found
> user-1000@0004f9ff74415375-6b5dd1c3d76b09ce.journal~: 1046 extents found
> user-1000@0004fa17ff12ef0c-34297fcc8c06dd4b.journal~: 96 extents found
> user-1000@0004fa3f0eee8e41-dfb1f54bd31e4967.journal~: 84 extents found
> user-1000@0004fa4c475a0a63-d8badb620094bce8.journal~: 173 extents found
> user-1000@0004fa5de0c650e0-522069bac82c754e.journal~: 319 extents found
> user-1000@0004fad818a7a980-593b8f3971b2e697.journal~: 2465 extents found
> user-1000@0004fad97160c3ad-552b27f891e7a24e.journal~: 106 extents found
> user-1000@0004fb1283616e7e-1fbca0bef31bd92b.journal~: 283 extents found
> user-1000@0004fb6de033b269-018b4cbc2b1f319b.journal~: 874 extents found
> user-1000@bd18e6867b824ba7b572de31531218a9-00000000000007c7-0004e2025881a1ee.journal: 293 extents found
> user-1000.journal: 4663 extents found
> user-120.journal: 5 extents found
> user-2012@0004fa4c02142cad-f97563ed0105bfb3.journal~: 749 extents found
> user-2012@0004fa4d8255b2df-43248028d422ca78.journal~: 29 extents found
> user-2012@0004fa5de40372db-d1f3c6428ddeec22.journal~: 122 extents found
> user-2012@0004fad81b8cd9d8-ed2861a9fa1b163c.journal~: 575 extents found
> user-2012@0004fad980139d4e-94ad07f4a8fae3cc.journal~: 25 extents found
> user-2012@0004fb160f2d5334-99462eb429f4cb7b.journal~: 416 extents found
> user-2012@54803afb1b1d42b387822c56e61bc168-0000000000011c75-0004ddb2be06d876.journal: 2 extents found
> user-2012.journal: 453 extents found
> user-65534@0004fa4c62bf4a71-6b4c53dfc06dd588.journal~: 46 extents found
> user-65534.journal: 91 extents found
> merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> cd ..
> merkaba:/var/log/journal> cd ..
> 
> 
> 
> 
> merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> ls -lh
> insgesamt 495M
> -rw-r-----+ 1 root root            6,8M Jul 21  2013 system@0004e2025580f3c5-c625739d3033b738.journal~
> -rw-r-----+ 1 root systemd-journal  14M Mai 14 00:40 system@0004f94fbee5f4d1-1cfb9bed12a79bde.journal~
> -rw-r-----+ 1 root systemd-journal  14M Mai 16 12:55 system@0004f98240efba02-b465b39a7ed0bdbe.journal~
> -rw-r-----+ 1 root systemd-journal  26M Mai 22 18:17 system@0004f9ff739ad927-4650a2ca62bf9378.journal~
> -rw-r-----+ 1 root systemd-journal 6,3M Mai 23 23:34 system@0004fa17feb65603-a7597828f9823e38.journal~
> -rw-r-----+ 1 root systemd-journal 6,6M Mai 25 22:10 system@0004fa3f0e96b653-d6f9d5795c9ef869.journal~
> -rw-r-----+ 1 root systemd-journal 7,7M Mai 26 13:56 system@0004fa4c448b8a95-c95a3b7950fd704d.journal~
> -rw-r-----+ 1 root systemd-journal 9,7M Mai 27 10:56 system@0004fa5dddb554e0-33c319ebb5f8100f.journal~
> -rw-r-----+ 1 root systemd-journal  50M Jun  2 12:45 system@0004fad81852c750-20a36082c6006c8a.journal~
> -rw-r-----+ 1 root systemd-journal 5,2M Jun  2 14:21 system@0004fad970b56567-44bf4a94314792fc.journal~
> -rw-r-----+ 1 root systemd-journal  34M Jun  5 10:27 system@0004fb128307b981-7f2104da8b2c9fb2.journal~
> -rw-r-----+ 1 root systemd-journal 8,4M Jun  5 22:03 system@0004fb1c3eef86c3-8adbea51a1c98729.journal~
> -rw-r-----+ 1 root systemd-journal 3,7M Jun  5 22:04 system@0004fb1c419fb301-7303f7fd9165ed26.journal~
> -rw-r-----+ 1 root systemd-journal  12M Jun  7 22:40 system@0004fb44feafbafd-b7433e90b1d3d718.journal~
> -rw-r-----+ 1 root systemd-journal  11M Jun  9 23:27 system@0004fb6ddf63e4d3-b40e8f4701670bff.journal~
> -rw-r-----+ 1 root systemd-journal 6,3M Jun  9 23:54 system@0004fb6e412c3f7e-3890be9c2119a7bb.journal~
> -rw-r-----+ 1 root adm             128M Jul 18  2013 system@54803afb1b1d42b387822c56e61bc168-000000000002b364-0004e1c4aabbeefa.journal
> -rw-r-----+ 1 root root            7,4M Jul 20  2013 system@54803afb1b1d42b387822c56e61bc168-000000000002b365-0004e1c4ab5b0246.journal
> -rw-r-----+ 1 root systemd-journal  23M Mai 11 10:21 system@bd18e6867b824ba7b572de31531218a9-0000000000000001-0004e202555a0493.journal
> -rw-r-----+ 1 root systemd-journal  19M Jun 15 23:37 system.journal
> -rw-r-----+ 1 root root            3,6M Jul 21  2013 user-1000@0004e202588086cf-3d4130a580b63101.journal~
> -rw-r-----+ 1 root systemd-journal 4,8M Mai 22 18:17 user-1000@0004f9ff74415375-6b5dd1c3d76b09ce.journal~
> -rw-r-----+ 1 root systemd-journal 3,6M Mai 23 23:34 user-1000@0004fa17ff12ef0c-34297fcc8c06dd4b.journal~
> -rw-r-----+ 1 root systemd-journal 3,6M Mai 25 22:10 user-1000@0004fa3f0eee8e41-dfb1f54bd31e4967.journal~
> -rw-r-----+ 1 root systemd-journal 3,7M Mai 26 13:57 user-1000@0004fa4c475a0a63-d8badb620094bce8.journal~
> -rw-r-----+ 1 root systemd-journal 3,7M Mai 27 10:56 user-1000@0004fa5de0c650e0-522069bac82c754e.journal~
> -rw-r-----+ 1 root systemd-journal  15M Jun  2 12:45 user-1000@0004fad818a7a980-593b8f3971b2e697.journal~
> -rw-r-----+ 1 root systemd-journal 3,6M Jun  2 14:22 user-1000@0004fad97160c3ad-552b27f891e7a24e.journal~
> -rw-r-----+ 1 root systemd-journal 3,7M Jun  5 10:27 user-1000@0004fb1283616e7e-1fbca0bef31bd92b.journal~
> -rw-r-----+ 1 root systemd-journal 5,2M Jun  9 23:27 user-1000@0004fb6de033b269-018b4cbc2b1f319b.journal~
> -rw-r-----+ 1 root systemd-journal 3,8M Mai 11 10:21 user-1000@bd18e6867b824ba7b572de31531218a9-00000000000007c7-0004e2025881a1ee.journal
> -rw-r-----+ 1 root systemd-journal  35M Jun 15 23:38 user-1000.journal
> -rw-r-----+ 1 root systemd-journal 3,6M Apr 28 09:52 user-120.journal
> -rw-r-----+ 1 root systemd-journal 4,0M Mai 26 13:37 user-2012@0004fa4c02142cad-f97563ed0105bfb3.journal~
> -rw-r-----+ 1 root systemd-journal 3,6M Mai 26 15:25 user-2012@0004fa4d8255b2df-43248028d422ca78.journal~
> -rw-r-----+ 1 root systemd-journal 3,7M Mai 27 10:57 user-2012@0004fa5de40372db-d1f3c6428ddeec22.journal~
> -rw-r-----+ 1 root systemd-journal 4,3M Jun  2 12:46 user-2012@0004fad81b8cd9d8-ed2861a9fa1b163c.journal~
> -rw-r-----+ 1 root systemd-journal 3,6M Jun  2 14:26 user-2012@0004fad980139d4e-94ad07f4a8fae3cc.journal~
> -rw-r-----+ 1 root systemd-journal 3,8M Jun  5 14:41 user-2012@0004fb160f2d5334-99462eb429f4cb7b.journal~
> -rw-r-----+ 1 root adm             200K Mai 11 10:21 user-2012@54803afb1b1d42b387822c56e61bc168-0000000000011c75-0004ddb2be06d876.journal
> -rw-r-----+ 1 root systemd-journal 3,8M Jun 11 21:14 user-2012.journal
> -rw-r-----+ 1 root systemd-journal 3,6M Mai 26 14:04 user-65534@0004fa4c62bf4a71-6b4c53dfc06dd588.journal~
> -rw-r-----+ 1 root systemd-journal 3,7M Jun  9 23:26 user-65534.journal
> 
> 
> 
> 
> 
> 
> merkaba:/var/log> filefrag syslog*
> syslog: 361 extents found
> syslog.1: 202 extents found
> syslog.2.gz: 1 extent found
> [well sure, cause repacked]
> syslog.3.gz: 1 extent found
> syslog.4.gz: 1 extent found
> syslog.5.gz: 1 extent found
> syslog.6.gz: 1 extent found
> 
> merkaba:/var/log> ls -lh syslog*
> -rw-r----- 1 root adm 4,2M Jun 15 23:39 syslog
> -rw-r----- 1 root adm 2,1M Jun 11 16:07 syslog.1
> 
> 
> 
> So we have ten times the extents on some systemd journal files than on
> rsyslog.
> 
> 
> With BTRFS RAID 1 on SSD with compress=lzo, so the 361 extents of syslog
> may be due to the size limit of extents on compressed BTRFS filesystems.
> 
> Anyway, since it is flash, I never bothered about the fragmentation.
> 
> Ciao,
> -- 
> Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
> GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
     --- If the first-ever performance is the première,  is the ---     
                  last-ever performance the derrière?                   

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-14  7:52         ` Goffredo Baroncelli
  2014-06-15  5:43           ` Duncan
@ 2014-06-15 22:13           ` Lennart Poettering
  2014-06-16  0:17             ` Russell Coker
                               ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Lennart Poettering @ 2014-06-15 22:13 UTC (permalink / raw)
  To: kreijack; +Cc: Duncan, linux-btrfs, systemd-devel

On Sat, 14.06.14 09:52, Goffredo Baroncelli (kreijack@libero.it) wrote:

> > Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
> > block has been rewritten to a new location and is now an extent unto 
> > itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
> > single 4 KiB block in size.
> 
> Several people pointed fallocate as the problem. But I don't
> understand the reason.

BTW, the reason we use fallocate() in journald is not about trying to
optimize anything. It's only used for one reason: to avoid SIGBUS on
disk/quota full, since we actually write everything to the files using
mmap(). I mean, writing things with mmap() is always problematic, and
handling write errors is awfully difficult, but at least two of the most
common reasons for failure we'd like protect against in advance, under
the assumption that disk/quota full will be reported immediately by the
fallocate(), and the mmap writes later on will then necessarily succeed.

I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...

Anyway, happy to change these things around if necesary, but first I'd
like to have a very good explanation why fallocate() wouldn't be the
right thing to invoke here, and a suggestion what we should do instead
to cover this usecase...

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-15  5:43           ` Duncan
@ 2014-06-15 22:39             ` Lennart Poettering
  0 siblings, 0 replies; 45+ messages in thread
From: Lennart Poettering @ 2014-06-15 22:39 UTC (permalink / raw)
  To: Duncan; +Cc: systemd-devel, linux-btrfs

On Sun, 15.06.14 05:43, Duncan (1i5t5.duncan@cox.net) wrote:

> The base problem isn't fallocate per se, rather, tho it's the trigger in 
> this case.  The base problem is that for COW-based filesystems, *ANY* 
> rewriting of existing file content results in fragmentation.
> 
> It just so happens that the only reason there's existing file content to 
> be rewritten (as opposed to simply appending) in this case, is because of 
> the fallocate.  The rewrite of existing file content is the problem, but 
> the existing file content is only there in this case because of the 
> fallocate.
> 
> Taking a step back...
> 
> On a non-COW filesystem, allocating 8 MiB ahead and writing into it 
> rewrites into the already allocated location, thus guaranteeing extents 
> of 8 MiB each, since once the space is allocated it's simply rewritten in-
> place.  Thus, on a non-COW filesystem, pre-allocating in something larger 
> than single filesystem blocks when an app knows the data is eventually 
> going to be written in to fill that space anyway is a GOOD thing, which 
> is why systemd is doing it.

Nope, that's not why we do it. We do it to avoid SIGBUS on disk full...

> But on a COW-based filesystem fallocate is the exact opposite, a BAD 
> thing, because an fallocate forces the file to be written out at that 
> size, effectively filled with nulls/blanks.  Then the actual logging 
> comes along and rewrites those nulls/blanks with actual data, but it's 
> now a rewrite, which on a COW, copy-on-write, based filesystem, the 
> rewritten block is copied elsewhere, it does NOT overwrite the existing 
> null/blank block, and "elsewhere" by definition means detached from the 
> previous blocks, thus in an extent all by itself.

Well, quite frankly I am not entirely sure why fallocate() would be any
useful like that on COW file systems, if this is really how it is
implemented... I mean, as I understood fallocate() -- and as the man
page suggests -- it is something for reserving space on disk, not for
writing out anything. This is why journald is invoking it, to reserve
the space, so that later write accesses to it will not require any
reservation anymore, and hence are unlikely to fail.

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-14 10:59         ` Kai Krakow
  2014-06-15  5:02           ` Duncan
  2014-06-15 21:45           ` Martin Steigerwald
@ 2014-06-15 22:43           ` Lennart Poettering
  2 siblings, 0 replies; 45+ messages in thread
From: Lennart Poettering @ 2014-06-15 22:43 UTC (permalink / raw)
  To: Kai Krakow; +Cc: systemd-devel, linux-btrfs

On Sat, 14.06.14 12:59, Kai Krakow (hurikhan77@gmail.com) wrote:

> 
> Duncan <1i5t5.duncan@cox.net> schrieb:
> 
> > As they say, "Whoosh!"
> > 
> > At least here, I interpreted that remark as primarily sarcastic
> > commentary on the systemd devs' apparent attitude, which can be
> > (controversially) summarized as: "Systemd doesn't have problems because
> > it's perfect.  Therefore, any problems you have with systemd must instead
> > be with other components which systemd depends on."
> 
> Come on, sorry, but this is fud. Really... ;-)

Interestingly, I never commented on anything in this area, and neither
did anybody else from the systemd side afaics. THe entire btrfs defrag
thing i wasn't aware of before this thread started on the system ML a
few days ago. I am not sure where you take your ideas about our
"attitude" from. God, with behaviour like that you just make us ignore
you, Duncan.

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-15 22:13           ` Lennart Poettering
@ 2014-06-16  0:17             ` Russell Coker
  2014-06-16  1:06               ` John Williams
  2014-06-16 10:14               ` Lennart Poettering
  2014-06-16 16:32             ` Goffredo Baroncelli
  2014-06-19  1:13             ` Dave Chinner
  2 siblings, 2 replies; 45+ messages in thread
From: Russell Coker @ 2014-06-16  0:17 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel

On Mon, 16 Jun 2014 00:13:07 Lennart Poettering wrote:
> On Sat, 14.06.14 09:52, Goffredo Baroncelli (kreijack@libero.it) wrote:
> > > Which effectively means that by the time the 8 MiB is filled, each 4 KiB
> > > block has been rewritten to a new location and is now an extent unto
> > > itself.  So now that 8 MiB is composed of 2048 new extents, each one a
> > > single 4 KiB block in size.
> > 
> > Several people pointed fallocate as the problem. But I don't
> > understand the reason.
> 
> BTW, the reason we use fallocate() in journald is not about trying to
> optimize anything. It's only used for one reason: to avoid SIGBUS on
> disk/quota full, since we actually write everything to the files using
> mmap(). I mean, writing things with mmap() is always problematic, and
> handling write errors is awfully difficult, but at least two of the most
> common reasons for failure we'd like protect against in advance, under
> the assumption that disk/quota full will be reported immediately by the
> fallocate(), and the mmap writes later on will then necessarily succeed.

I just did some tests using fallocate(1).  I did the tests both with and 
without the -n option which appeared to make no difference.

I started by allocating a 24G file on a 106G filesystem that had 30G free 
according to df.  The first time that took almost 2 minutes of system CPU time 
on a Q8400 CPU.

I then made a snapshot of the subvol and then used dd with the conv=notrunc 
option to overwrite it.  The amount of reported disk space decreased in line 
with the progress of dd.  So in the case of snapshots the space will be USED 
(not just reserved) when you call fallocate and there is no guarantee that 
space will be available when you write to it.

My systems have cron jobs to make read-only snapshots of all subvols.  On 
these systems you have no guarantee that mmap will succeed - apart from the 
fact that the variety of problems BTRFS has in the case of running out of disk 
space makes me more careful to avoid that on BTRFS than on other filesystems.

> I am not really following though why this trips up btrfs though. I am
> not sure I understand why this breaks btrfs COW behaviour. I mean,
> fallocate() isn't necessarily supposed to write anything really, it's
> mostly about allocating disk space in advance. I would claim that
> journald's usage of it is very much within the entire reason why it
> exists...

I don't believe that fallocate() makes any difference to fragmentation on 
BTRFS.  Blocks will be allocated when writes occur so regardless of an 
fallocate() call the usage pattern in systemd-journald will cause 
fragmentation.

> Anyway, happy to change these things around if necesary, but first I'd
> like to have a very good explanation why fallocate() wouldn't be the
> right thing to invoke here, and a suggestion what we should do instead
> to cover this usecase...

Systemd could request that the files in question be defragmented.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16  0:17             ` Russell Coker
@ 2014-06-16  1:06               ` John Williams
  2014-06-16  2:19                 ` Russell Coker
  2014-06-16 10:14               ` Lennart Poettering
  1 sibling, 1 reply; 45+ messages in thread
From: John Williams @ 2014-06-16  1:06 UTC (permalink / raw)
  To: russell; +Cc: Lennart Poettering, kreijack, Duncan, Btrfs BTRFS, systemd-devel

On Sun, Jun 15, 2014 at 5:17 PM, Russell Coker <russell@coker.com.au> wrote:
> I just did some tests using fallocate(1).  I did the tests both with and
> without the -n option which appeared to make no difference.
>
> I started by allocating a 24G file on a 106G filesystem that had 30G free
> according to df.  The first time that took almost 2 minutes of system CPU time
> on a Q8400 CPU.

Why does it take 2 minutes? On XFS or ext4, fallocate is almost
instantaneous, even for multi-Terabyte allocations.

According the fallocate man page, preallocation should be quick and
require no IO:

"      fallocate  is  used  to manipulate the allocated disk space for a file,
       either to deallocate or preallocate it. For filesystems  which  support
       the  fallocate system call, preallocation is done quickly by allocating
       blocks and marking them as uninitialized, requiring no IO to  the  data
       blocks.  This  is  much  faster than creating a file by filling it with
       zeros."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16  1:06               ` John Williams
@ 2014-06-16  2:19                 ` Russell Coker
  0 siblings, 0 replies; 45+ messages in thread
From: Russell Coker @ 2014-06-16  2:19 UTC (permalink / raw)
  To: John Williams
  Cc: Lennart Poettering, kreijack, Duncan, Btrfs BTRFS, systemd-devel

On Sun, 15 Jun 2014 18:06:34 John Williams wrote:
> Why does it take 2 minutes? On XFS or ext4, fallocate is almost
> instantaneous, even for multi-Terabyte allocations.
> 
> According the fallocate man page, preallocation should be quick and
> require no IO:
> 
> "      fallocate  is  used  to manipulate the allocated disk space for a
> file, either to deallocate or preallocate it. For filesystems  which 
> support the  fallocate system call, preallocation is done quickly by
> allocating blocks and marking them as uninitialized, requiring no IO to 
> the  data blocks.  This  is  much  faster than creating a file by filling
> it with zeros."

No IO to data blocks but there is IO to metadata.

But I think that BTRFS may need some optimisation for such things.  While 
fallocate() on 24G is probably a very unusual case it will probably matter to 
some people (I can imagine scientific computing needing it) and it's likely 
that much smaller fallocate() calls also take longer than desired.

The issue was system CPU time, extending the file in that test was proceeding 
at a speed of about 200MB/s for allocated space - while the system was writing 
something less than 2MB/s to the device (sometimes it went for 10+ seconds 
without writing any data).  The SSD in question can sustain about 200MB/s of 
data written so in that case the BTRFS speed for allocating disk space was 
about equal to the speed it should be able to write real data.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16  0:17             ` Russell Coker
  2014-06-16  1:06               ` John Williams
@ 2014-06-16 10:14               ` Lennart Poettering
  2014-06-16 10:35                 ` Russell Coker
                                   ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Lennart Poettering @ 2014-06-16 10:14 UTC (permalink / raw)
  To: Russell Coker; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel

On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:

> > I am not really following though why this trips up btrfs though. I am
> > not sure I understand why this breaks btrfs COW behaviour. I mean,
> > fallocate() isn't necessarily supposed to write anything really, it's
> > mostly about allocating disk space in advance. I would claim that
> > journald's usage of it is very much within the entire reason why it
> > exists...
> 
> I don't believe that fallocate() makes any difference to fragmentation on 
> BTRFS.  Blocks will be allocated when writes occur so regardless of an 
> fallocate() call the usage pattern in systemd-journald will cause 
> fragmentation.

journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 10:14               ` Lennart Poettering
@ 2014-06-16 10:35                 ` Russell Coker
  2014-06-16 11:16                   ` Austin S Hemmelgarn
  2014-06-16 11:56                 ` Andrey Borzenkov
  2014-06-16 16:05                 ` Josef Bacik
  2 siblings, 1 reply; 45+ messages in thread
From: Russell Coker @ 2014-06-16 10:35 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel

On Mon, 16 Jun 2014 12:14:49 Lennart Poettering wrote:
> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
> > > I am not really following though why this trips up btrfs though. I am
> > > not sure I understand why this breaks btrfs COW behaviour. I mean,
> > > fallocate() isn't necessarily supposed to write anything really, it's
> > > mostly about allocating disk space in advance. I would claim that
> > > journald's usage of it is very much within the entire reason why it
> > > exists...
> > 
> > I don't believe that fallocate() makes any difference to fragmentation on
> > BTRFS.  Blocks will be allocated when writes occur so regardless of an
> > fallocate() call the usage pattern in systemd-journald will cause
> > fragmentation.
> 
> journald's write pattern looks something like this: append something to
> the end, make sure it is written, then update a few offsets stored at
> the beginning of the file to point to the newly appended data. This is
> of course not easy to handle for COW file systems. But then again, it's
> probably not too different from access patterns of other database or
> database-like engines...

Not being too different from the access patterns of other databases means 
having all the same problems as other databases...  Oracle is now selling ZFS 
servers specifically designed for running the Oracle database, but that's with 
"hybrid storage" "flash" (ZIL and L2ARC on SSD).  While BTRFS doesn't support 
features equivalent for ZIL and L2ARC it's easy to run a separate filesystem 
on SSD for things that need performance (few if any current BTRFS users would 
have a database too big to entirely fit on a SSD).

The problem we are dealing with is "database-like" access patterns on systems 
that are not designed as database servers.

Would it be possible to get an interface for defragmenting files that's not 
specific to BTRFS?  If we had a standard way of doing this then systemd-
journald could request a defragment of the file at appropriate times.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 10:35                 ` Russell Coker
@ 2014-06-16 11:16                   ` Austin S Hemmelgarn
  0 siblings, 0 replies; 45+ messages in thread
From: Austin S Hemmelgarn @ 2014-06-16 11:16 UTC (permalink / raw)
  To: russell, Lennart Poettering; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel

[-- Attachment #1: Type: text/plain, Size: 2434 bytes --]

On 2014-06-16 06:35, Russell Coker wrote:
> On Mon, 16 Jun 2014 12:14:49 Lennart Poettering wrote:
>> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>>>> I am not really following though why this trips up btrfs though. I am
>>>> not sure I understand why this breaks btrfs COW behaviour. I mean,
>>>> fallocate() isn't necessarily supposed to write anything really, it's
>>>> mostly about allocating disk space in advance. I would claim that
>>>> journald's usage of it is very much within the entire reason why it
>>>> exists...
>>>
>>> I don't believe that fallocate() makes any difference to fragmentation on
>>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>>> fallocate() call the usage pattern in systemd-journald will cause
>>> fragmentation.
>>
>> journald's write pattern looks something like this: append something to
>> the end, make sure it is written, then update a few offsets stored at
>> the beginning of the file to point to the newly appended data. This is
>> of course not easy to handle for COW file systems. But then again, it's
>> probably not too different from access patterns of other database or
>> database-like engines...
> 
> Not being too different from the access patterns of other databases means 
> having all the same problems as other databases...  Oracle is now selling ZFS 
> servers specifically designed for running the Oracle database, but that's with 
> "hybrid storage" "flash" (ZIL and L2ARC on SSD).  While BTRFS doesn't support 
> features equivalent for ZIL and L2ARC it's easy to run a separate filesystem 
> on SSD for things that need performance (few if any current BTRFS users would 
> have a database too big to entirely fit on a SSD).
> 
> The problem we are dealing with is "database-like" access patterns on systems 
> that are not designed as database servers.
> 
> Would it be possible to get an interface for defragmenting files that's not 
> specific to BTRFS?  If we had a standard way of doing this then systemd-
> journald could request a defragment of the file at appropriate times.
> 
While this is a wonderful idea, what about all the extra I/O this will
cause (and all the extra wear on SSD's)?  While I understand wanting
this to be faster, you should also consider the fact that defragmenting
the file on a regular basis is going to trash performance for other
applications.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 10:14               ` Lennart Poettering
  2014-06-16 10:35                 ` Russell Coker
@ 2014-06-16 11:56                 ` Andrey Borzenkov
  2014-06-16 16:05                 ` Josef Bacik
  2 siblings, 0 replies; 45+ messages in thread
From: Andrey Borzenkov @ 2014-06-16 11:56 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Russell Coker, SystemD Devel, kreijack, Duncan, linux-btrfs

On Mon, Jun 16, 2014 at 2:14 PM, Lennart Poettering
<lennart@poettering.net> wrote:
> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>
>> > I am not really following though why this trips up btrfs though. I am
>> > not sure I understand why this breaks btrfs COW behaviour. I mean,
>> > fallocate() isn't necessarily supposed to write anything really, it's
>> > mostly about allocating disk space in advance. I would claim that
>> > journald's usage of it is very much within the entire reason why it
>> > exists...
>>
>> I don't believe that fallocate() makes any difference to fragmentation on
>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>> fallocate() call the usage pattern in systemd-journald will cause
>> fragmentation.
>
> journald's write pattern looks something like this: append something to
> the end, make sure it is written, then update a few offsets stored at
> the beginning of the file to point to the newly appended data. This is
> of course not easy to handle for COW file systems. But then again, it's
> probably not too different from access patterns of other database or
> database-like engines...
>

... which traditionally experienced severe sequential read performance
degradation in this case. As I understand this is exactly what happens
- readahead attempts to preload files which gives us heavy random read
access.

The only real remedy was to defragment files. It should work
relatively well for journal where files are mostly "write once" at the
expense of additional read/write activity.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 10:14               ` Lennart Poettering
  2014-06-16 10:35                 ` Russell Coker
  2014-06-16 11:56                 ` Andrey Borzenkov
@ 2014-06-16 16:05                 ` Josef Bacik
  2014-06-16 19:52                   ` Martin
  2014-06-17 21:12                   ` Lennart Poettering
  2 siblings, 2 replies; 45+ messages in thread
From: Josef Bacik @ 2014-06-16 16:05 UTC (permalink / raw)
  To: Lennart Poettering, Russell Coker
  Cc: kreijack, Duncan, linux-btrfs, systemd-devel



On 06/16/2014 03:14 AM, Lennart Poettering wrote:
> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>
>>> I am not really following though why this trips up btrfs though. I am
>>> not sure I understand why this breaks btrfs COW behaviour. I mean,
>>> fallocate() isn't necessarily supposed to write anything really, it's
>>> mostly about allocating disk space in advance. I would claim that
>>> journald's usage of it is very much within the entire reason why it
>>> exists...
>>
>> I don't believe that fallocate() makes any difference to fragmentation on
>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>> fallocate() call the usage pattern in systemd-journald will cause
>> fragmentation.
>
> journald's write pattern looks something like this: append something to
> the end, make sure it is written, then update a few offsets stored at
> the beginning of the file to point to the newly appended data. This is
> of course not easy to handle for COW file systems. But then again, it's
> probably not too different from access patterns of other database or
> database-like engines...

Was waiting for you to show up before I said anything since most systemd 
related emails always devolve into how evil you are rather than what is 
actually happening.

So you are doing all the right things from what I can tell, I'm just a 
little confused about when you guys run fsync.  From what I can tell 
it's only when you open the journal file and when you switch it to 
"offline."  I didn't look too much past this point so I don't know how 
often these things happen.  Are you taking an individual message, 
writing it, updating the head of the file and then fsync'ing?  Or are 
you getting a good bit of dirty log data and fsyncing occasionally?

What would cause btrfs problems is if you fallocate(), write a small 
chunk, fsync, write a small chunk again, fsync again etc.  Fallocate 
saves you the first write around, but if the next write is within the 
same block as the previous write we'll end up triggering cow and enter 
fragmented territory.  If this is what is what journald is doing then 
that would be good to know, if not I'd like to know what is happening 
since we shouldn't be fragmenting this badly.

Like I said what you guys are doing is fine, if btrfs falls on it's face 
then its not your fault.  I'd just like an exact idea of when you guys 
are fsync'ing so I can replicate in a smaller way.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-15 22:13           ` Lennart Poettering
  2014-06-16  0:17             ` Russell Coker
@ 2014-06-16 16:32             ` Goffredo Baroncelli
  2014-06-16 18:47               ` Goffredo Baroncelli
  2014-06-19  1:13             ` Dave Chinner
  2 siblings, 1 reply; 45+ messages in thread
From: Goffredo Baroncelli @ 2014-06-16 16:32 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: Duncan, linux-btrfs, systemd-devel

Hi Lennart,

On 06/16/2014 12:13 AM, Lennart Poettering wrote:
> I am not really following though why this trips up btrfs though. I am
> not sure I understand why this breaks btrfs COW behaviour. I mean,
> fallocate() isn't necessarily supposed to write anything really, it's
> mostly about allocating disk space in advance. I would claim that
> journald's usage of it is very much within the entire reason why it
> exists...

I performed several tests, trying different setups [1]. One of these was replacing the posix_fallocate() with a truncate, to check where is the problem. The conclusion was that *posix_fallocate() is NOT the problem*.

In another reply you stated that systemd-journald appends some data at the end of file, then update some data in the middle. I think this is the reason because the file becomes quickly fragmented.


[1] Let me to revise the english, the I will post the results.


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 16:32             ` Goffredo Baroncelli
@ 2014-06-16 18:47               ` Goffredo Baroncelli
  0 siblings, 0 replies; 45+ messages in thread
From: Goffredo Baroncelli @ 2014-06-16 18:47 UTC (permalink / raw)
  Cc: Lennart Poettering, Duncan, linux-btrfs, systemd-devel, Kai Krakow

Hi all,

in this blog [1] I collected all the results of the tests which I performed in order to investigate a bit this performance problem between systemd and btrfs. I had to put these results in a blog, because there are several images. Below a brief summary.

I took an old machine (a P4 2.5GHz with 512MB of ram) where was present a fresh installation of a Fedora 20 and I measured the boot time during several reboots (up to 70).  I tested the following scenarios

1) standard (without defragmenting any file, without readahead)
2) defragment the journal file at the end of the boot
3) defragment the journal file before the flushing
4) mark as NOCOW the journald log file
5) enable the systemd-readahead
6) remove the fsync(2) call from journald
7) remove the posix_fallocate(3) call from journald
8) do a defrag when posix_fallocate(3) is called
9) do a defrag when the journal log file is opened

The test #1 highlight the problem. It shows that the boot time may require up to 50 seconds. During the reboots the number of extents of the file system.journal increases up to 6000. De-fragmenting the system.journal file the boot time decreases by ~20 seconds. My conclusion is that in BTRFS the fragmentation of this file increases the boot time.

The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3) calls aren't the root cause of the problem. Even without these the system.journal file still fragments.

The test #4 suggested that marking NOCOW the system.journal file reduces its fragmentation and so the boot time.

The test #2,#3,#9 suggested that performing a periodic defrag reduces significantly the fragmentation of system.journal and so the boot time.

The test #5 revealed that the readahead capability of systemd was not efficacy because it seems that the system.journal file was unaffected (but other *.journal files were). Further investigation is required.

BR
G.Baroncelli



[1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html

On 06/16/2014 06:32 PM, Goffredo Baroncelli wrote:
> Hi Lennart,
> 
> On 06/16/2014 12:13 AM, Lennart Poettering wrote:
>> I am not really following though why this trips up btrfs though. I am
>> not sure I understand why this breaks btrfs COW behaviour. I mean,
>> fallocate() isn't necessarily supposed to write anything really, it's
>> mostly about allocating disk space in advance. I would claim that
>> journald's usage of it is very much within the entire reason why it
>> exists...
> 
> I performed several tests, trying different setups [1]. One of these was replacing the posix_fallocate() with a truncate, to check where is the problem. The conclusion was that *posix_fallocate() is NOT the problem*.
> 
> In another reply you stated that systemd-journald appends some data at the end of file, then update some data in the middle. I think this is the reason because the file becomes quickly fragmented.
> 
> 
> [1] Let me to revise the english, the I will post the results.
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 16:05                 ` Josef Bacik
@ 2014-06-16 19:52                   ` Martin
  2014-06-16 20:20                     ` Josef Bacik
                                       ` (2 more replies)
  2014-06-17 21:12                   ` Lennart Poettering
  1 sibling, 3 replies; 45+ messages in thread
From: Martin @ 2014-06-16 19:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

On 16/06/14 17:05, Josef Bacik wrote:
> 
> On 06/16/2014 03:14 AM, Lennart Poettering wrote:
>> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>>
>>>> I am not really following though why this trips up btrfs though. I am
>>>> not sure I understand why this breaks btrfs COW behaviour. I mean,

>>> I don't believe that fallocate() makes any difference to
>>> fragmentation on
>>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>>> fallocate() call the usage pattern in systemd-journald will cause
>>> fragmentation.
>>
>> journald's write pattern looks something like this: append something to
>> the end, make sure it is written, then update a few offsets stored at
>> the beginning of the file to point to the newly appended data. This is
>> of course not easy to handle for COW file systems. But then again, it's
>> probably not too different from access patterns of other database or
>> database-like engines...

Even though this appears to be a problem case for btrfs/COW, is there a
more favourable write/access sequence possible that is easily
implemented that is favourable for both ext4-like fs /and/ COW fs?

Database-like writing is known 'difficult' for filesystems: Can a data
log can be a simpler case?


> Was waiting for you to show up before I said anything since most systemd
> related emails always devolve into how evil you are rather than what is
> actually happening.

Ouch! Hope you two know each other!! :-P :-)


[...]
> since we shouldn't be fragmenting this badly.
> 
> Like I said what you guys are doing is fine, if btrfs falls on it's face
> then its not your fault.  I'd just like an exact idea of when you guys
> are fsync'ing so I can replicate in a smaller way.  Thanks,

Good if COW can be so resilient. I have about 2GBytes of data logging
files and I must defrag those as part of my backups to stop the system
fragmenting to a stop (I use "cp -a" to defrag the files to a new area
and restart the data software logger on that).


Random thoughts:

Would using a second small file just for the mmap-ed pointers help avoid
repeated rewriting of random offsets in the log file causing excessive
fragmentation?

Align the data writes to 16kByte or 64kByte boundaries/chunks?

Are mmap-ed files a similar problem to using a swap file and so should
the same "btrfs file swap" code be used for both?


Not looked over the code so all random guesses...

Regards,
Martin





^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 19:52                   ` Martin
@ 2014-06-16 20:20                     ` Josef Bacik
  2014-06-17  0:15                     ` Austin S Hemmelgarn
  2014-06-17  1:13                     ` cwillu
  2 siblings, 0 replies; 45+ messages in thread
From: Josef Bacik @ 2014-06-16 20:20 UTC (permalink / raw)
  To: Martin, linux-btrfs; +Cc: systemd-devel



On 06/16/2014 12:52 PM, Martin wrote:
> On 16/06/14 17:05, Josef Bacik wrote:
>>
>> On 06/16/2014 03:14 AM, Lennart Poettering wrote:
>>> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>>>
>>>>> I am not really following though why this trips up btrfs though. I am
>>>>> not sure I understand why this breaks btrfs COW behaviour. I mean,
>
>>>> I don't believe that fallocate() makes any difference to
>>>> fragmentation on
>>>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>>>> fallocate() call the usage pattern in systemd-journald will cause
>>>> fragmentation.
>>>
>>> journald's write pattern looks something like this: append something to
>>> the end, make sure it is written, then update a few offsets stored at
>>> the beginning of the file to point to the newly appended data. This is
>>> of course not easy to handle for COW file systems. But then again, it's
>>> probably not too different from access patterns of other database or
>>> database-like engines...
>
> Even though this appears to be a problem case for btrfs/COW, is there a
> more favourable write/access sequence possible that is easily
> implemented that is favourable for both ext4-like fs /and/ COW fs?
>
> Database-like writing is known 'difficult' for filesystems: Can a data
> log can be a simpler case?
>
>
>> Was waiting for you to show up before I said anything since most systemd
>> related emails always devolve into how evil you are rather than what is
>> actually happening.
>
> Ouch! Hope you two know each other!! :-P :-)
>

Yup, I <3 Lennart, I'd rather deal with him directly than wade through 
all the fud that flys around when systemd is brought up.

>
> [...]
>> since we shouldn't be fragmenting this badly.
>>
>> Like I said what you guys are doing is fine, if btrfs falls on it's face
>> then its not your fault.  I'd just like an exact idea of when you guys
>> are fsync'ing so I can replicate in a smaller way.  Thanks,
>
> Good if COW can be so resilient. I have about 2GBytes of data logging
> files and I must defrag those as part of my backups to stop the system
> fragmenting to a stop (I use "cp -a" to defrag the files to a new area
> and restart the data software logger on that).
>
>
> Random thoughts:
>
> Would using a second small file just for the mmap-ed pointers help avoid
> repeated rewriting of random offsets in the log file causing excessive
> fragmentation?
>

Depends on when you fsync.  The problem isn't dirty'ing so much as writing.

> Align the data writes to 16kByte or 64kByte boundaries/chunks?
>

Yes that would help the most, if journald would try to only fsync ever 
blocksize amount of writes we'd suck less.

> Are mmap-ed files a similar problem to using a swap file and so should
> the same "btrfs file swap" code be used for both?
>

Not sure what this special swap file code is you speak of.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 19:52                   ` Martin
  2014-06-16 20:20                     ` Josef Bacik
@ 2014-06-17  0:15                     ` Austin S Hemmelgarn
  2014-06-17  1:13                     ` cwillu
  2 siblings, 0 replies; 45+ messages in thread
From: Austin S Hemmelgarn @ 2014-06-17  0:15 UTC (permalink / raw)
  To: Martin, linux-btrfs; +Cc: systemd-devel

[-- Attachment #1: Type: text/plain, Size: 3469 bytes --]

On 06/16/2014 03:52 PM, Martin wrote:
> On 16/06/14 17:05, Josef Bacik wrote:
>>
>> On 06/16/2014 03:14 AM, Lennart Poettering wrote:
>>> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>>>
>>>>> I am not really following though why this trips up btrfs though. I am
>>>>> not sure I understand why this breaks btrfs COW behaviour. I mean,
> 
>>>> I don't believe that fallocate() makes any difference to
>>>> fragmentation on
>>>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>>>> fallocate() call the usage pattern in systemd-journald will cause
>>>> fragmentation.
>>>
>>> journald's write pattern looks something like this: append something to
>>> the end, make sure it is written, then update a few offsets stored at
>>> the beginning of the file to point to the newly appended data. This is
>>> of course not easy to handle for COW file systems. But then again, it's
>>> probably not too different from access patterns of other database or
>>> database-like engines...
> 
> Even though this appears to be a problem case for btrfs/COW, is there a
> more favourable write/access sequence possible that is easily
> implemented that is favourable for both ext4-like fs /and/ COW fs?
> 
> Database-like writing is known 'difficult' for filesystems: Can a data
> log can be a simpler case?
> 
> 
>> Was waiting for you to show up before I said anything since most systemd
>> related emails always devolve into how evil you are rather than what is
>> actually happening.
> 
> Ouch! Hope you two know each other!! :-P :-)
> 
> 
> [...]
>> since we shouldn't be fragmenting this badly.
>>
>> Like I said what you guys are doing is fine, if btrfs falls on it's face
>> then its not your fault.  I'd just like an exact idea of when you guys
>> are fsync'ing so I can replicate in a smaller way.  Thanks,
> 
> Good if COW can be so resilient. I have about 2GBytes of data logging
> files and I must defrag those as part of my backups to stop the system
> fragmenting to a stop (I use "cp -a" to defrag the files to a new area
> and restart the data software logger on that).
> 
> 
> Random thoughts:
> 
> Would using a second small file just for the mmap-ed pointers help avoid
> repeated rewriting of random offsets in the log file causing excessive
> fragmentation?
> 
> Align the data writes to 16kByte or 64kByte boundaries/chunks?
> 
> Are mmap-ed files a similar problem to using a swap file and so should
> the same "btrfs file swap" code be used for both?
> 
> 
> Not looked over the code so all random guesses...
> 
> Regards,
> Martin
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Just a thought, partly inspired by the mention of the swap code, has
anyone tried making the file NOCOW and pre-allocating to the max journal
size?  A similar approach has seemed to help on my systems with generic
log files (I keep debug level logs from almost everything, so I end up
with very active log files with ridiculous numbers of fragments if I
don't pre-allocate and mark them NOCOW).  I don't know for certain how
BTRFS handles appends to NOCOW files, but I would be willing to bet that
it ends up with a new fragment for each filesystem block worth of space
allocated.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 19:52                   ` Martin
  2014-06-16 20:20                     ` Josef Bacik
  2014-06-17  0:15                     ` Austin S Hemmelgarn
@ 2014-06-17  1:13                     ` cwillu
  2014-06-17 12:24                       ` Martin
                                         ` (2 more replies)
  2 siblings, 3 replies; 45+ messages in thread
From: cwillu @ 2014-06-17  1:13 UTC (permalink / raw)
  To: Martin; +Cc: linux-btrfs, systemd-devel, clm

It's not a mmap problem, it's a small writes with an msync or fsync
after each one problem.

For the case of sequential writes (via write or mmap), padding writes
to page boundaries would help, if the wasted space isn't an issue.
Another approach, again assuming all other writes are appends, would
be to periodically (but frequently enough that the pages are still in
cache) read a chunk of the file and write it back in-place, with or
without an fsync. On the other hand, if you can afford to lose some
logs on a crash, not fsyncing/msyncing after each write will also
eliminate the fragmentation.

(Worth pointing out that none of that is conjecture, I just spent 30
minutes testing those cases while composing this ;p)

Josef has mentioned in irc that a piece of Chris' raid5/6 work will
also fix this when it lands.

On Mon, Jun 16, 2014 at 1:52 PM, Martin <m_btrfs@ml1.co.uk> wrote:
> On 16/06/14 17:05, Josef Bacik wrote:
>>
>> On 06/16/2014 03:14 AM, Lennart Poettering wrote:
>>> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>>>
>>>>> I am not really following though why this trips up btrfs though. I am
>>>>> not sure I understand why this breaks btrfs COW behaviour. I mean,
>
>>>> I don't believe that fallocate() makes any difference to
>>>> fragmentation on
>>>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>>>> fallocate() call the usage pattern in systemd-journald will cause
>>>> fragmentation.
>>>
>>> journald's write pattern looks something like this: append something to
>>> the end, make sure it is written, then update a few offsets stored at
>>> the beginning of the file to point to the newly appended data. This is
>>> of course not easy to handle for COW file systems. But then again, it's
>>> probably not too different from access patterns of other database or
>>> database-like engines...
>
> Even though this appears to be a problem case for btrfs/COW, is there a
> more favourable write/access sequence possible that is easily
> implemented that is favourable for both ext4-like fs /and/ COW fs?
>
> Database-like writing is known 'difficult' for filesystems: Can a data
> log can be a simpler case?
>
>
>> Was waiting for you to show up before I said anything since most systemd
>> related emails always devolve into how evil you are rather than what is
>> actually happening.
>
> Ouch! Hope you two know each other!! :-P :-)
>
>
> [...]
>> since we shouldn't be fragmenting this badly.
>>
>> Like I said what you guys are doing is fine, if btrfs falls on it's face
>> then its not your fault.  I'd just like an exact idea of when you guys
>> are fsync'ing so I can replicate in a smaller way.  Thanks,
>
> Good if COW can be so resilient. I have about 2GBytes of data logging
> files and I must defrag those as part of my backups to stop the system
> fragmenting to a stop (I use "cp -a" to defrag the files to a new area
> and restart the data software logger on that).
>
>
> Random thoughts:
>
> Would using a second small file just for the mmap-ed pointers help avoid
> repeated rewriting of random offsets in the log file causing excessive
> fragmentation?
>
> Align the data writes to 16kByte or 64kByte boundaries/chunks?
>
> Are mmap-ed files a similar problem to using a swap file and so should
> the same "btrfs file swap" code be used for both?
>
>
> Not looked over the code so all random guesses...
>
> Regards,
> Martin
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-15 21:31         ` Martin Steigerwald
  2014-06-15 21:37           ` Hugo Mills
@ 2014-06-17  8:22           ` Duncan
  1 sibling, 0 replies; 45+ messages in thread
From: Duncan @ 2014-06-17  8:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Martin Steigerwald posted on Sun, 15 Jun 2014 23:31:07 +0200 as excerpted:

> I always thought that the whole point of fallocate is that it *doesn´t*
> write out anything, but just reserves the space. Thus I don´t see how
> COW can have any adverse effect here.

Tying up loose ends...  I was wrong on fallocate being the trigger, tho I 
did call correctly on fragmentation as the result, with NOCOW and defrag 
as possible workarounds.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-17  1:13                     ` cwillu
@ 2014-06-17 12:24                       ` Martin
  2014-06-17 17:56                       ` Chris Murphy
  2014-06-17 18:46                       ` Filipe Brandenburger
  2 siblings, 0 replies; 45+ messages in thread
From: Martin @ 2014-06-17 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

On 17/06/14 02:13, cwillu wrote:
> It's not a mmap problem, it's a small writes with an msync or fsync
> after each one problem.

And for logging, that is exactly what is wanted to see why whatever
crashed...

Except...

Whilst logging, hold off on the msync/fsync unless the next log message
to be written is 'critical'?

With that, the mundane logging gets appended just as for any normal file
write. Only the more critical log messages suffer the extra overhead and
fragmentation of an immediate msync/fsync.


> For the case of sequential writes (via write or mmap), padding writes
> to page boundaries would help, if the wasted space isn't an issue.
> Another approach, again assuming all other writes are appends, would
> be to periodically (but frequently enough that the pages are still in
> cache) read a chunk of the file and write it back in-place, with or
> without an fsync. On the other hand, if you can afford to lose some
> logs on a crash, not fsyncing/msyncing after each write will also
> eliminate the fragmentation.
> 
> (Worth pointing out that none of that is conjecture, I just spent 30
> minutes testing those cases while composing this ;p)
> 
> Josef has mentioned in irc that a piece of Chris' raid5/6 work will
> also fix this when it lands.

Interesting...

The source problem is how the COW fragments under expected normal use...
Is all this unavoidable unless we rethink the semantics?


Regards,
Martin



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-17  1:13                     ` cwillu
  2014-06-17 12:24                       ` Martin
@ 2014-06-17 17:56                       ` Chris Murphy
  2014-06-17 18:46                       ` Filipe Brandenburger
  2 siblings, 0 replies; 45+ messages in thread
From: Chris Murphy @ 2014-06-17 17:56 UTC (permalink / raw)
  To: cwillu; +Cc: linux-btrfs, systemd-devel


On Jun 16, 2014, at 7:13 PM, cwillu <cwillu@cwillu.com> wrote:

> It's not a mmap problem, it's a small writes with an msync or fsync
> after each one problem.
> 
> For the case of sequential writes (via write or mmap), padding writes
> to page boundaries would help, if the wasted space isn't an issue.
> Another approach, again assuming all other writes are appends, would
> be to periodically (but frequently enough that the pages are still in
> cache) read a chunk of the file and write it back in-place, with or
> without an fsync. On the other hand, if you can afford to lose some
> logs on a crash, not fsyncing/msyncing after each write will also
> eliminate the fragmentation.

Normally I'd be willing to give up ~30 seconds of journal to not have fragmented journals. But then if I use systemd.log_level=debug I'd like that to trigger more frequent flushing to make sure as little of the journaling is lost as possible. Does that make sense?


Chris Murphy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-17  1:13                     ` cwillu
  2014-06-17 12:24                       ` Martin
  2014-06-17 17:56                       ` Chris Murphy
@ 2014-06-17 18:46                       ` Filipe Brandenburger
  2014-06-17 19:42                         ` Goffredo Baroncelli
  2 siblings, 1 reply; 45+ messages in thread
From: Filipe Brandenburger @ 2014-06-17 18:46 UTC (permalink / raw)
  To: linux-btrfs, systemd-devel

On Mon, Jun 16, 2014 at 6:13 PM, cwillu <cwillu@cwillu.com> wrote:
> For the case of sequential writes (via write or mmap), padding writes
> to page boundaries would help, if the wasted space isn't an issue.
> Another approach, again assuming all other writes are appends, would
> be to periodically (but frequently enough that the pages are still in
> cache) read a chunk of the file and write it back in-place, with or
> without an fsync. On the other hand, if you can afford to lose some
> logs on a crash, not fsyncing/msyncing after each write will also
> eliminate the fragmentation.

I was wondering if something could be done in btrfs to improve
performance under this workload... Something like a "defrag on demand"
for a case where mostly appends are happening.

When there are small appends with fsync/msync, they become new
fragments (as expected), but once the writes go past a block boundary,
btrfs could defragment the previous block in background, since it's
not really expected to change again.

That could potentially achieve performance close to chattr +C without
the drawbacks of disabling copy-on-write.

Cheers,
Filipe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-17 18:46                       ` Filipe Brandenburger
@ 2014-06-17 19:42                         ` Goffredo Baroncelli
  0 siblings, 0 replies; 45+ messages in thread
From: Goffredo Baroncelli @ 2014-06-17 19:42 UTC (permalink / raw)
  To: linux-btrfs

On 06/17/2014 08:46 PM, Filipe Brandenburger wrote:
> On Mon, Jun 16, 2014 at 6:13 PM, cwillu <cwillu@cwillu.com> wrote:
>> For the case of sequential writes (via write or mmap), padding writes
>> to page boundaries would help, if the wasted space isn't an issue.
>> Another approach, again assuming all other writes are appends, would
>> be to periodically (but frequently enough that the pages are still in
>> cache) read a chunk of the file and write it back in-place, with or
>> without an fsync. On the other hand, if you can afford to lose some
>> logs on a crash, not fsyncing/msyncing after each write will also
>> eliminate the fragmentation.
> 
> I was wondering if something could be done in btrfs to improve
> performance under this workload... Something like a "defrag on demand"
> for a case where mostly appends are happening.

Instead of inventing a strategy smarter than the (already smart) filesystem, would be more simple make an explicit defrag ?

In any case this "smart strategy" is filesystem specific, so it would be more simple (and less error prone) do an explicit defrag.

I tried this strategy with systemd-journald, getting good results (doing a ioctl BTRFS_IOC_DEFRAG during the journal opening).

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 16:05                 ` Josef Bacik
  2014-06-16 19:52                   ` Martin
@ 2014-06-17 21:12                   ` Lennart Poettering
  1 sibling, 0 replies; 45+ messages in thread
From: Lennart Poettering @ 2014-06-17 21:12 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Russell Coker, kreijack, Duncan, linux-btrfs, systemd-devel

On Mon, 16.06.14 09:05, Josef Bacik (jbacik@fb.com) wrote:

> So you are doing all the right things from what I can tell, I'm just
> a little confused about when you guys run fsync.  From what I can
> tell it's only when you open the journal file and when you switch it
> to "offline."  I didn't look too much past this point so I don't
> know how often these things happen.  Are you taking an individual
> message, writing it, updating the head of the file and then
> fsync'ing?  Or are you getting a good bit of dirty log data and
> fsyncing occasionally?

The latter. Basically when opening a file for writing we mark it in the
header as "online", then fsync() it. When we close a file we fsync() it,
then change the header to "offline", ans sync() again. Also, 5min after
each write we will also put things to offline, until the next write,
when we will put things to online again. Finally, if something is logged
at priorities EMERG, ALERT or CRIT we will sync immediately (which
actually should never happen in real-life, unless something is really
broken -- a simple way to check if anything like this got written is
"journalctl -p crit").

Also, we rotate and start a new file every now and then, when we hit a
size limit, but that is usually very seldom.

Putting this together: we should normally fsync() only very
infrequently. 

> What would cause btrfs problems is if you fallocate(), write a small
> chunk, fsync, write a small chunk again, fsync again etc.  Fallocate
> saves you the first write around, but if the next write is within
> the same block as the previous write we'll end up triggering cow and
> enter fragmented territory.  If this is what is what journald is
> doing then that would be good to know, if not I'd like to know what
> is happening since we shouldn't be fragmenting this badly.

Hmm, the only way I see that that would happen is if a lot of stuff is
logged at these super-high log levels mentioned above. But then again,
that never really should happen in real-life.

Could anyone who's expereiencing the slowdowns have a look on the
journalctl output menionted above? Do you have more than a few lines
printed like that?

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-15 22:13           ` Lennart Poettering
  2014-06-16  0:17             ` Russell Coker
  2014-06-16 16:32             ` Goffredo Baroncelli
@ 2014-06-19  1:13             ` Dave Chinner
  2 siblings, 0 replies; 45+ messages in thread
From: Dave Chinner @ 2014-06-19  1:13 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel

On Mon, Jun 16, 2014 at 12:13:07AM +0200, Lennart Poettering wrote:
> On Sat, 14.06.14 09:52, Goffredo Baroncelli (kreijack@libero.it) wrote:
> 
> > > Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
> > > block has been rewritten to a new location and is now an extent unto 
> > > itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
> > > single 4 KiB block in size.
> > 
> > Several people pointed fallocate as the problem. But I don't
> > understand the reason.
> 
> BTW, the reason we use fallocate() in journald is not about trying to
> optimize anything. It's only used for one reason: to avoid SIGBUS on
> disk/quota full, since we actually write everything to the files using
> mmap().

FWIW, fallocate() doesn't absolutely guarantee you that. When at
ENOSPC, a write into that reserved range can still require
un-reserved metadata blocks to be allocated. e.g. splitting a
"reserved" data extent into two extents (used and reserved) requires
an extra btree record, which can cause a split, which can require
allocation. This tends to be pretty damn rare, though, and some
filesystems have reserved block pools specifically for handling this
sort of ENOSPC corner case. Hence, in practice the filesystems
never actually fail with ENOSPC in ranges that have been
fallocate()d.

> I mean, writing things with mmap() is always problematic, and
> handling write errors is awfully difficult, but at least two of the most
> common reasons for failure we'd like protect against in advance, under
> the assumption that disk/quota full will be reported immediately by the
> fallocate(), and the mmap writes later on will then necessarily succeed.
> 
> I am not really following though why this trips up btrfs though. I am
> not sure I understand why this breaks btrfs COW behaviour. I mean,
> fallocate() isn't necessarily supposed to write anything really, it's
> mostly about allocating disk space in advance. I would claim that
> journald's usage of it is very much within the entire reason why it
> exists...
> 
> Anyway, happy to change these things around if necesary, but first I'd
> like to have a very good explanation why fallocate() wouldn't be the
> right thing to invoke here, and a suggestion what we should do instead
> to cover this usecase...

fallocate() of 8MB should be more than sufficient for non-COW
filesystems - 1MB would be enough to prevent performance degradation
due to fragmentation in most cases. The current problems seem to be
with the way btrfs does rewrites, not the use of fallocate() in
systemd.

Thanks for explanation, Lennart.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-12  1:18 ` Russell Coker
@ 2014-06-12  4:39   ` Duncan
  0 siblings, 0 replies; 45+ messages in thread
From: Duncan @ 2014-06-12  4:39 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Russell Coker posted on Thu, 12 Jun 2014 11:18:37 +1000 as excerpted:

> On Wed, 11 Jun 2014 23:28:54 Goffredo Baroncelli wrote:
>>         https://bugzilla.redhat.com/show_bug.cgi?id=1006386
>> 
>> suggested me that the problem could be due to a bad interaction between
>> systemd and btrfs. NetworkManager was innocent.  It seems that
>> systemd-journal create a very hight fragmented files when it stores its
>> log. And BTRFS it is know to behave slowly when a file is highly
>> fragmented. This had caused a slow startup of systemd-journal, which in
>> turn had blocked the services which depend by the loggin system.
> 
> On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put
> "SystemMaxUse=50M".  That doesn't solve the fragmentation problem but
> reduces it enough that it doesn't bother me.

FWIW, as a relatively new switcher to systemd, that is, after switching 
to btrfs only a year or so ago...  Two comments:

1) Having seen a few reports of journald's journal fragmentation on this 
list, I was worried about those journals here as well.

My solution to both this problem and to an unrelated frustration with 
journald[1] was to:

a) confine journald to only a volatile (memory-only) log, first on a 
temporary basis while I was only experimenting with and setting up systemd 
(using the kernel command-line's init= to point at systemd while /sbin/
init still pointed at sysv's init for openrc), then later permanently, 
once I got enough of my systemd and journald config setup to actually 
switch to it.

b) configure my former syslogger (syslog-ng, in my case) to continue in 
that role under systemd, with journald relaying to it for non-volatile 
logging.

Here's the /etc/journald.conf changes I ended up with to accomplish (a), 
see the journald.conf(5) manpage for the documentation, as well as the 
below explanation:

Storage=volatile
RuntimeMaxUse=448M
RuntimeKeepFree=48M
RuntimeMaxFileSize=64M

Storage=volatile is the important one.  As the manpage notes, that means 
journald stores files under /run/log/journal only, where /run is normally 
setup by systemd as a tmpfs mount, so these files are tmpfs and thus 
memory-only.

The other three must be read in the context of a 512 MiB /run on tmpfs
[2].  From that and the information in the journald.conf manpage, it 
should be possible to see that my setup is (runtime* settings apply to 
the volatile files under /run):

An individual journal filesize (MaxFileSize) of 64 MiB, with seven such 
files in rotation (the default if MaxFileSize is unset is eight), 
totaling 448 MiB (MaxUse, the default being 10% of the filesystem, too 
small here since the journals are basically the only thing taking 
space).  On a 512 MiB filesystem, that will leave 64 MiB for other uses 
(pretty much all 0-byte lock and pidfiles, IIRC I was running something 
like a 2 MiB /run before systemd without issue).

It's worth noting that UNLIKE MaxUse, which will trigger journal file 
rotation when hit, hitting the KeepFree forces journald to stop 
journaling entirely -- *NOT* just to stop writing them here, but to stop 
forwarding to syslog (syslog-ng here) as well.  I FOUND THIS OUT THE HARD 
WAY!  Thus, in ordered to keep journald still functional, make sure 
journald runs into the MaxUse limit before it runs into KeepFree.  The 
KeepFree default is 15% of the filesystem, just under 77 MiB on a 512 MiB 
filesystem which is why I found this out the hard way with settings that 
would otherwise keep only 64 MiB free.  The 48 MiB setting I chose leaves 
16 MiB of room for other files before journald shuts down journaling, 
which should be plenty, since under normal circumstances the other files 
should all be 0-byte lock and pidfiles.  Just in case, however, there's 
still 48 MiB of room for other files after journald shuts down, before 
the filesystem itself fills up.

Configuring the syslogger to work with journald is "left as an exercise 
for the reader", as they say, since for all I know the OP is using 
something other than the syslog-ng I'm familiar with anyway.  But if 
hints for syslog-ng are needed too, let me know. =:^)


2) Someone else mentioned btrfs' autodefrag mount-option.  Given #1 above 
I've obviously not had a lot of experience with journald logs and 
autodefrag, but based on all I know about btrfs fragmentation behavior as 
well as journald journal file behavior from this list, as long as 
journald's non-volatile files are kept significantly under 1 GiB and 
preferably under half a GiB each, it shouldn't be a problem, with a 
/possible/ exception if you get something run-away-journaling multiple 
messages a second for a reasonably long period, such that the I/O can't 
keep up with both the journaling and autodefrag.

If you do choose to keep a persistent journal with autodefrag, then, I'd 
recommend journald.conf settings that keep individual journal files to 
perhaps 128 MiB each.  (System* settings apply to the non-volatile files 
under /var, in /var/log/journal/.)

SystemMaxFileSize=128M

AFAIK, that wouldn't affect the total journal size and thus the number of 
journal files, which would remain 10% of the filesystem size by default.

Alternatively, given the default 8-file rotation if MaxFileSize is unset, 
you could limit the total journal size to 1 GiB, for the same 128 MiB 
individual file size.

SystemMaxUse=1G

Of course if you want/need more control set both and/or other settings as 
I did for my volatile-only configuration above.

---
[1] Unrelated journald frustration:  Being a syslog-ng user I've been 
accustomed to being able to pre-filter incoming messages *BEFORE* they 
get written to the files.  This ability has been the important bit of my 
run-away-log coping strategy when I have no direct way to reconfigure the 
source to reduce the rate at which it's spitting out "noise" that's 
otherwise overwhelming my logs.

Unfortunately, while it seems journald has all /sorts/ of file-grep-style 
filtering tools to focus in like a laser on what you want to see AFTER 
the journal is written, I found absolutely NO documentation on setting up 
PREWRITE journal filters (except log-level, and there's global rate-
limiting as well, but I didn't want to use those, I wanted to filter 
specific "noise" messages), which means runaway journaling simply runs 
away, and if I use the size restriction stuff to turn it down, I quickly 
lose the important stuff I want to keep around when the files size-rotate 
due to that runaway!

Thus my solution, keep journald storage volatile only, relatively small 
but still big enough I can use the great systemctl status integration to 
get the last few journal entries from each service, then feed it to 
syslog-ng to pre-write filter the noise out before actual write to 
permanent storage. =:^)

[2] 512 MiB /run tmpfs.  This is on a 16 GiB RAM system, so default tmpfs 
size would be 8 GiB.  But I have several such tmpfs including a big /tmp 
that I use for scratch space when I'm building stuff (gentoo's 
PORTAGE_TMPDIR, and for fully in-memory DVD ISOs too), and I don't have 
swap configured at all, so keeping a reasonable lid on things by 
limiting /run and its major journal-file space user to half a GiB seems 
prudent.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-12  1:37   ` Dave Chinner
@ 2014-06-12  2:32     ` Chris Murphy
  0 siblings, 0 replies; 45+ messages in thread
From: Chris Murphy @ 2014-06-12  2:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: kreijack, systemd Mailing List, linux-btrfs

On Jun 11, 2014, at 7:21 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
>> Hi all,
>> 
>> I would like to share a my experience about a slowness of systemd when used on BTRFS.
>> 
>> My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").
>> 
>> I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.
>> 
>> However this link
>> 
>> 	https://bugzilla.redhat.com/show_bug.cgi?id=1006386
>> 
>> suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. 
> 
> systemd has a very stupid journal write pattern. It checks if there
> is space in the file for the write, and if not it fallocates the
> small amount of space it needs (it does *4 byte* fallocate calls!)
> and then does the write to it.  All this does is fragment the crap
> out of the log files because the filesystems cannot optimise the
> allocation patterns.
> 
> Yup, it fragments journal files on XFS, too.
> 
> http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
> 
> IIRC, the systemd developers consider this a filesystem problem and
> so refused to change the systemd code to be nice to the filesystem
> allocators, even though they don't actually need to use fallocate...
> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

On Jun 11, 2014, at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> BTW, the systemd list is subscriber only, so thay aren't going to
> see anything that we comment on from a cross-post to the btrfs list.


Unless a subscriber finds something really interesting, quotes it, and cross posts it.

Chris murphy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-12  1:21 ` Dave Chinner
@ 2014-06-12  1:37   ` Dave Chinner
  2014-06-12  2:32     ` Chris Murphy
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2014-06-12  1:37 UTC (permalink / raw)
  To: kreijack; +Cc: systemd Mailing List, linux-btrfs

On Thu, Jun 12, 2014 at 11:21:04AM +1000, Dave Chinner wrote:
> On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
> > Hi all,
> > 
> > I would like to share a my experience about a slowness of systemd when used on BTRFS.
> > 
> > My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").
> > 
> > I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.
> > 
> > However this link
> > 
> > 	https://bugzilla.redhat.com/show_bug.cgi?id=1006386
> > 
> > suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. 
> 
> systemd has a very stupid journal write pattern. It checks if there
> is space in the file for the write, and if not it fallocates the
> small amount of space it needs (it does *4 byte* fallocate calls!)
> and then does the write to it.  All this does is fragment the crap
> out of the log files because the filesystems cannot optimise the
> allocation patterns.
> 
> Yup, it fragments journal files on XFS, too.
> 
> http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
> 
> IIRC, the systemd developers consider this a filesystem problem and
> so refused to change the systemd code to be nice to the filesystem
> allocators, even though they don't actually need to use fallocate...

BTW, the systemd list is subscriber only, so thay aren't going to
see anything that we comment on from a cross-post to the btrfs list.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-11 21:28 Goffredo Baroncelli
  2014-06-12  0:40 ` Chris Murphy
  2014-06-12  1:18 ` Russell Coker
@ 2014-06-12  1:21 ` Dave Chinner
  2014-06-12  1:37   ` Dave Chinner
  2 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2014-06-12  1:21 UTC (permalink / raw)
  To: kreijack; +Cc: systemd Mailing List, linux-btrfs

On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
> Hi all,
> 
> I would like to share a my experience about a slowness of systemd when used on BTRFS.
> 
> My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").
> 
> I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.
> 
> However this link
> 
> 	https://bugzilla.redhat.com/show_bug.cgi?id=1006386
> 
> suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. 

systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
and then does the write to it.  All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.

Yup, it fragments journal files on XFS, too.

http://oss.sgi.com/archives/xfs/2014-03/msg00322.html

IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-11 21:28 Goffredo Baroncelli
  2014-06-12  0:40 ` Chris Murphy
@ 2014-06-12  1:18 ` Russell Coker
  2014-06-12  4:39   ` Duncan
  2014-06-12  1:21 ` Dave Chinner
  2 siblings, 1 reply; 45+ messages in thread
From: Russell Coker @ 2014-06-12  1:18 UTC (permalink / raw)
  To: kreijack; +Cc: systemd Mailing List, linux-btrfs

On Wed, 11 Jun 2014 23:28:54 Goffredo Baroncelli wrote:
>         https://bugzilla.redhat.com/show_bug.cgi?id=1006386
> 
> suggested me that the problem could be due to a bad interaction between
> systemd and btrfs. NetworkManager was innocent.  It seems that
> systemd-journal create a very hight fragmented files when it stores its
> log. And BTRFS it is know to behave slowly when a file is highly
> fragmented. This had caused a slow startup of systemd-journal, which in
> turn had blocked the services which depend by the loggin system.

On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put 
"SystemMaxUse=50M".  That doesn't solve the fragmentation problem but reduces 
it enough that it doesn't bother me.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-11 21:28 Goffredo Baroncelli
@ 2014-06-12  0:40 ` Chris Murphy
  2014-06-12  1:18 ` Russell Coker
  2014-06-12  1:21 ` Dave Chinner
  2 siblings, 0 replies; 45+ messages in thread
From: Chris Murphy @ 2014-06-12  0:40 UTC (permalink / raw)
  To: kreijack, Goffredo Baroncelli; +Cc: systemd Mailing List, linux-btrfs


On Jun 11, 2014, at 3:28 PM, Goffredo Baroncelli <kreijack@libero.it> wrote:
> 
> If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data.

So long as you're not using compression, filefrag will show you fragments of systemd-journald journals. I can vouch for the behavior you experience without xattr +C or autodefrag, but further it also causes much slowness when reading journal contents. LIke if I want to search all boots for a particular error message to see how far back it started, this takes quite a bit longer than on other file systems. So far I'm not experiencing this problem with autodefrag or any other negative side effects, but my understanding is this code is still in flux.

Since the journals have their own checksumming I'm not overly concerned about setting xattr +C.

Chris Murphy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Slow startup of systemd-journal on BTRFS
@ 2014-06-11 21:28 Goffredo Baroncelli
  2014-06-12  0:40 ` Chris Murphy
                   ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Goffredo Baroncelli @ 2014-06-11 21:28 UTC (permalink / raw)
  To: systemd Mailing List, linux-btrfs

Hi all,

I would like to share a my experience about a slowness of systemd when used on BTRFS.

My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").

I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.

However this link

	https://bugzilla.redhat.com/show_bug.cgi?id=1006386

suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. 
It seems that systemd-journal create a very hight fragmented files when it stores its log. And BTRFS it is know to behave slowly when a file is highly fragmented.
This had caused a slow startup of systemd-journal, which in turn had blocked the services which depend by the loggin system.

In fact after I de-fragmented the files under /var/log/journal [*], my boot time decreased of about 20second (from 50s to 30s).

Unfortunately I don't have any data to show. The next time I will try to collect more information. But I am quite sure that when the log are highly fragmented systemd-journal becomes very slow on BTRFS.

I don't know if the problem is more on the systemd side or btrfs side. What I know is that both the projects likely will be important in the near futures, and both must work well together.

I know that I can "chattr +C" to avoid COW for some files; but I don't want to lost also the checksum protection. 

If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data.


BR
G.Baroncelli

[*] 
# btrfs fi defrag /var/log/journal/*/*



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2014-06-19  1:13 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-12 11:13 R: Re: Slow startup of systemd-journal on BTRFS Goffredo Baroncelli <kreijack@libero.it>
2014-06-12 12:37 ` Duncan
2014-06-12 23:24   ` Dave Chinner
2014-06-13 22:19     ` Goffredo Baroncelli
2014-06-14  2:53       ` Duncan
2014-06-14  7:52         ` Goffredo Baroncelli
2014-06-15  5:43           ` Duncan
2014-06-15 22:39             ` [systemd-devel] " Lennart Poettering
2014-06-15 22:13           ` Lennart Poettering
2014-06-16  0:17             ` Russell Coker
2014-06-16  1:06               ` John Williams
2014-06-16  2:19                 ` Russell Coker
2014-06-16 10:14               ` Lennart Poettering
2014-06-16 10:35                 ` Russell Coker
2014-06-16 11:16                   ` Austin S Hemmelgarn
2014-06-16 11:56                 ` Andrey Borzenkov
2014-06-16 16:05                 ` Josef Bacik
2014-06-16 19:52                   ` Martin
2014-06-16 20:20                     ` Josef Bacik
2014-06-17  0:15                     ` Austin S Hemmelgarn
2014-06-17  1:13                     ` cwillu
2014-06-17 12:24                       ` Martin
2014-06-17 17:56                       ` Chris Murphy
2014-06-17 18:46                       ` Filipe Brandenburger
2014-06-17 19:42                         ` Goffredo Baroncelli
2014-06-17 21:12                   ` Lennart Poettering
2014-06-16 16:32             ` Goffredo Baroncelli
2014-06-16 18:47               ` Goffredo Baroncelli
2014-06-19  1:13             ` Dave Chinner
2014-06-14 10:59         ` Kai Krakow
2014-06-15  5:02           ` Duncan
2014-06-15 11:18             ` Kai Krakow
2014-06-15 21:45           ` Martin Steigerwald
2014-06-15 21:51             ` Hugo Mills
2014-06-15 22:43           ` [systemd-devel] " Lennart Poettering
2014-06-15 21:31         ` Martin Steigerwald
2014-06-15 21:37           ` Hugo Mills
2014-06-17  8:22           ` Duncan
  -- strict thread matches above, loose matches on Subject: below --
2014-06-11 21:28 Goffredo Baroncelli
2014-06-12  0:40 ` Chris Murphy
2014-06-12  1:18 ` Russell Coker
2014-06-12  4:39   ` Duncan
2014-06-12  1:21 ` Dave Chinner
2014-06-12  1:37   ` Dave Chinner
2014-06-12  2:32     ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.