All of lore.kernel.org
 help / color / mirror / Atom feed
* Slow startup of systemd-journal on BTRFS
@ 2014-06-11 21:28 Goffredo Baroncelli
  2014-06-12  0:40 ` Chris Murphy
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2014-06-11 21:28 UTC (permalink / raw)
  To: systemd Mailing List, linux-btrfs

Hi all,

I would like to share a my experience about a slowness of systemd when used on BTRFS.

My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").

I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.

However this link

	https://bugzilla.redhat.com/show_bug.cgi?id=1006386

suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. 
It seems that systemd-journal create a very hight fragmented files when it stores its log. And BTRFS it is know to behave slowly when a file is highly fragmented.
This had caused a slow startup of systemd-journal, which in turn had blocked the services which depend by the loggin system.

In fact after I de-fragmented the files under /var/log/journal [*], my boot time decreased of about 20second (from 50s to 30s).

Unfortunately I don't have any data to show. The next time I will try to collect more information. But I am quite sure that when the log are highly fragmented systemd-journal becomes very slow on BTRFS.

I don't know if the problem is more on the systemd side or btrfs side. What I know is that both the projects likely will be important in the near futures, and both must work well together.

I know that I can "chattr +C" to avoid COW for some files; but I don't want to lost also the checksum protection. 

If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data.


BR
G.Baroncelli

[*] 
# btrfs fi defrag /var/log/journal/*/*



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-11 21:28 Slow startup of systemd-journal on BTRFS Goffredo Baroncelli
@ 2014-06-12  0:40 ` Chris Murphy
  2014-06-12  1:18 ` Russell Coker
  2014-06-12  1:21 ` Dave Chinner
  2 siblings, 0 replies; 21+ messages in thread
From: Chris Murphy @ 2014-06-12  0:40 UTC (permalink / raw)
  To: kreijack, Goffredo Baroncelli; +Cc: systemd Mailing List, linux-btrfs


On Jun 11, 2014, at 3:28 PM, Goffredo Baroncelli <kreijack@libero.it> wrote:
> 
> If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data.

So long as you're not using compression, filefrag will show you fragments of systemd-journald journals. I can vouch for the behavior you experience without xattr +C or autodefrag, but further it also causes much slowness when reading journal contents. LIke if I want to search all boots for a particular error message to see how far back it started, this takes quite a bit longer than on other file systems. So far I'm not experiencing this problem with autodefrag or any other negative side effects, but my understanding is this code is still in flux.

Since the journals have their own checksumming I'm not overly concerned about setting xattr +C.

Chris Murphy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-11 21:28 Slow startup of systemd-journal on BTRFS Goffredo Baroncelli
  2014-06-12  0:40 ` Chris Murphy
@ 2014-06-12  1:18 ` Russell Coker
  2014-06-12  4:39   ` Duncan
  2014-06-12  1:21 ` Dave Chinner
  2 siblings, 1 reply; 21+ messages in thread
From: Russell Coker @ 2014-06-12  1:18 UTC (permalink / raw)
  To: kreijack; +Cc: systemd Mailing List, linux-btrfs

On Wed, 11 Jun 2014 23:28:54 Goffredo Baroncelli wrote:
>         https://bugzilla.redhat.com/show_bug.cgi?id=1006386
> 
> suggested me that the problem could be due to a bad interaction between
> systemd and btrfs. NetworkManager was innocent.  It seems that
> systemd-journal create a very hight fragmented files when it stores its
> log. And BTRFS it is know to behave slowly when a file is highly
> fragmented. This had caused a slow startup of systemd-journal, which in
> turn had blocked the services which depend by the loggin system.

On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put 
"SystemMaxUse=50M".  That doesn't solve the fragmentation problem but reduces 
it enough that it doesn't bother me.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-11 21:28 Slow startup of systemd-journal on BTRFS Goffredo Baroncelli
  2014-06-12  0:40 ` Chris Murphy
  2014-06-12  1:18 ` Russell Coker
@ 2014-06-12  1:21 ` Dave Chinner
  2014-06-12  1:37   ` Dave Chinner
  2 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2014-06-12  1:21 UTC (permalink / raw)
  To: kreijack; +Cc: systemd Mailing List, linux-btrfs

On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
> Hi all,
> 
> I would like to share a my experience about a slowness of systemd when used on BTRFS.
> 
> My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").
> 
> I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.
> 
> However this link
> 
> 	https://bugzilla.redhat.com/show_bug.cgi?id=1006386
> 
> suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. 

systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
and then does the write to it.  All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.

Yup, it fragments journal files on XFS, too.

http://oss.sgi.com/archives/xfs/2014-03/msg00322.html

IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-12  1:21 ` Dave Chinner
@ 2014-06-12  1:37   ` Dave Chinner
  2014-06-12  2:32     ` Chris Murphy
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2014-06-12  1:37 UTC (permalink / raw)
  To: kreijack; +Cc: systemd Mailing List, linux-btrfs

On Thu, Jun 12, 2014 at 11:21:04AM +1000, Dave Chinner wrote:
> On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
> > Hi all,
> > 
> > I would like to share a my experience about a slowness of systemd when used on BTRFS.
> > 
> > My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").
> > 
> > I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.
> > 
> > However this link
> > 
> > 	https://bugzilla.redhat.com/show_bug.cgi?id=1006386
> > 
> > suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. 
> 
> systemd has a very stupid journal write pattern. It checks if there
> is space in the file for the write, and if not it fallocates the
> small amount of space it needs (it does *4 byte* fallocate calls!)
> and then does the write to it.  All this does is fragment the crap
> out of the log files because the filesystems cannot optimise the
> allocation patterns.
> 
> Yup, it fragments journal files on XFS, too.
> 
> http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
> 
> IIRC, the systemd developers consider this a filesystem problem and
> so refused to change the systemd code to be nice to the filesystem
> allocators, even though they don't actually need to use fallocate...

BTW, the systemd list is subscriber only, so thay aren't going to
see anything that we comment on from a cross-post to the btrfs list.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-12  1:37   ` Dave Chinner
@ 2014-06-12  2:32     ` Chris Murphy
  2014-06-15 22:34       ` [systemd-devel] " Lennart Poettering
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Murphy @ 2014-06-12  2:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: kreijack, systemd Mailing List, linux-btrfs

On Jun 11, 2014, at 7:21 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
>> Hi all,
>> 
>> I would like to share a my experience about a slowness of systemd when used on BTRFS.
>> 
>> My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").
>> 
>> I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.
>> 
>> However this link
>> 
>> 	https://bugzilla.redhat.com/show_bug.cgi?id=1006386
>> 
>> suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. 
> 
> systemd has a very stupid journal write pattern. It checks if there
> is space in the file for the write, and if not it fallocates the
> small amount of space it needs (it does *4 byte* fallocate calls!)
> and then does the write to it.  All this does is fragment the crap
> out of the log files because the filesystems cannot optimise the
> allocation patterns.
> 
> Yup, it fragments journal files on XFS, too.
> 
> http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
> 
> IIRC, the systemd developers consider this a filesystem problem and
> so refused to change the systemd code to be nice to the filesystem
> allocators, even though they don't actually need to use fallocate...
> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

On Jun 11, 2014, at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> BTW, the systemd list is subscriber only, so thay aren't going to
> see anything that we comment on from a cross-post to the btrfs list.


Unless a subscriber finds something really interesting, quotes it, and cross posts it.

Chris murphy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-12  1:18 ` Russell Coker
@ 2014-06-12  4:39   ` Duncan
  0 siblings, 0 replies; 21+ messages in thread
From: Duncan @ 2014-06-12  4:39 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Russell Coker posted on Thu, 12 Jun 2014 11:18:37 +1000 as excerpted:

> On Wed, 11 Jun 2014 23:28:54 Goffredo Baroncelli wrote:
>>         https://bugzilla.redhat.com/show_bug.cgi?id=1006386
>> 
>> suggested me that the problem could be due to a bad interaction between
>> systemd and btrfs. NetworkManager was innocent.  It seems that
>> systemd-journal create a very hight fragmented files when it stores its
>> log. And BTRFS it is know to behave slowly when a file is highly
>> fragmented. This had caused a slow startup of systemd-journal, which in
>> turn had blocked the services which depend by the loggin system.
> 
> On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put
> "SystemMaxUse=50M".  That doesn't solve the fragmentation problem but
> reduces it enough that it doesn't bother me.

FWIW, as a relatively new switcher to systemd, that is, after switching 
to btrfs only a year or so ago...  Two comments:

1) Having seen a few reports of journald's journal fragmentation on this 
list, I was worried about those journals here as well.

My solution to both this problem and to an unrelated frustration with 
journald[1] was to:

a) confine journald to only a volatile (memory-only) log, first on a 
temporary basis while I was only experimenting with and setting up systemd 
(using the kernel command-line's init= to point at systemd while /sbin/
init still pointed at sysv's init for openrc), then later permanently, 
once I got enough of my systemd and journald config setup to actually 
switch to it.

b) configure my former syslogger (syslog-ng, in my case) to continue in 
that role under systemd, with journald relaying to it for non-volatile 
logging.

Here's the /etc/journald.conf changes I ended up with to accomplish (a), 
see the journald.conf(5) manpage for the documentation, as well as the 
below explanation:

Storage=volatile
RuntimeMaxUse=448M
RuntimeKeepFree=48M
RuntimeMaxFileSize=64M

Storage=volatile is the important one.  As the manpage notes, that means 
journald stores files under /run/log/journal only, where /run is normally 
setup by systemd as a tmpfs mount, so these files are tmpfs and thus 
memory-only.

The other three must be read in the context of a 512 MiB /run on tmpfs
[2].  From that and the information in the journald.conf manpage, it 
should be possible to see that my setup is (runtime* settings apply to 
the volatile files under /run):

An individual journal filesize (MaxFileSize) of 64 MiB, with seven such 
files in rotation (the default if MaxFileSize is unset is eight), 
totaling 448 MiB (MaxUse, the default being 10% of the filesystem, too 
small here since the journals are basically the only thing taking 
space).  On a 512 MiB filesystem, that will leave 64 MiB for other uses 
(pretty much all 0-byte lock and pidfiles, IIRC I was running something 
like a 2 MiB /run before systemd without issue).

It's worth noting that UNLIKE MaxUse, which will trigger journal file 
rotation when hit, hitting the KeepFree forces journald to stop 
journaling entirely -- *NOT* just to stop writing them here, but to stop 
forwarding to syslog (syslog-ng here) as well.  I FOUND THIS OUT THE HARD 
WAY!  Thus, in ordered to keep journald still functional, make sure 
journald runs into the MaxUse limit before it runs into KeepFree.  The 
KeepFree default is 15% of the filesystem, just under 77 MiB on a 512 MiB 
filesystem which is why I found this out the hard way with settings that 
would otherwise keep only 64 MiB free.  The 48 MiB setting I chose leaves 
16 MiB of room for other files before journald shuts down journaling, 
which should be plenty, since under normal circumstances the other files 
should all be 0-byte lock and pidfiles.  Just in case, however, there's 
still 48 MiB of room for other files after journald shuts down, before 
the filesystem itself fills up.

Configuring the syslogger to work with journald is "left as an exercise 
for the reader", as they say, since for all I know the OP is using 
something other than the syslog-ng I'm familiar with anyway.  But if 
hints for syslog-ng are needed too, let me know. =:^)


2) Someone else mentioned btrfs' autodefrag mount-option.  Given #1 above 
I've obviously not had a lot of experience with journald logs and 
autodefrag, but based on all I know about btrfs fragmentation behavior as 
well as journald journal file behavior from this list, as long as 
journald's non-volatile files are kept significantly under 1 GiB and 
preferably under half a GiB each, it shouldn't be a problem, with a 
/possible/ exception if you get something run-away-journaling multiple 
messages a second for a reasonably long period, such that the I/O can't 
keep up with both the journaling and autodefrag.

If you do choose to keep a persistent journal with autodefrag, then, I'd 
recommend journald.conf settings that keep individual journal files to 
perhaps 128 MiB each.  (System* settings apply to the non-volatile files 
under /var, in /var/log/journal/.)

SystemMaxFileSize=128M

AFAIK, that wouldn't affect the total journal size and thus the number of 
journal files, which would remain 10% of the filesystem size by default.

Alternatively, given the default 8-file rotation if MaxFileSize is unset, 
you could limit the total journal size to 1 GiB, for the same 128 MiB 
individual file size.

SystemMaxUse=1G

Of course if you want/need more control set both and/or other settings as 
I did for my volatile-only configuration above.

---
[1] Unrelated journald frustration:  Being a syslog-ng user I've been 
accustomed to being able to pre-filter incoming messages *BEFORE* they 
get written to the files.  This ability has been the important bit of my 
run-away-log coping strategy when I have no direct way to reconfigure the 
source to reduce the rate at which it's spitting out "noise" that's 
otherwise overwhelming my logs.

Unfortunately, while it seems journald has all /sorts/ of file-grep-style 
filtering tools to focus in like a laser on what you want to see AFTER 
the journal is written, I found absolutely NO documentation on setting up 
PREWRITE journal filters (except log-level, and there's global rate-
limiting as well, but I didn't want to use those, I wanted to filter 
specific "noise" messages), which means runaway journaling simply runs 
away, and if I use the size restriction stuff to turn it down, I quickly 
lose the important stuff I want to keep around when the files size-rotate 
due to that runaway!

Thus my solution, keep journald storage volatile only, relatively small 
but still big enough I can use the great systemctl status integration to 
get the last few journal entries from each service, then feed it to 
syslog-ng to pre-write filter the noise out before actual write to 
permanent storage. =:^)

[2] 512 MiB /run tmpfs.  This is on a 16 GiB RAM system, so default tmpfs 
size would be 8 GiB.  But I have several such tmpfs including a big /tmp 
that I use for scratch space when I'm building stuff (gentoo's 
PORTAGE_TMPDIR, and for fully in-memory DVD ISOs too), and I don't have 
swap configured at all, so keeping a reasonable lid on things by 
limiting /run and its major journal-file space user to half a GiB seems 
prudent.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-12  2:32     ` Chris Murphy
@ 2014-06-15 22:34       ` Lennart Poettering
  2014-06-16  4:01         ` Chris Murphy
  0 siblings, 1 reply; 21+ messages in thread
From: Lennart Poettering @ 2014-06-15 22:34 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Dave Chinner, kreijack, systemd Mailing List, linux-btrfs

On Wed, 11.06.14 20:32, Chris Murphy (lists@colorremedies.com) wrote:

> > systemd has a very stupid journal write pattern. It checks if there
> > is space in the file for the write, and if not it fallocates the
> > small amount of space it needs (it does *4 byte* fallocate calls!)

Not really the case. 

http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354

We allocate 8mb at minimum.

> > and then does the write to it.  All this does is fragment the crap
> > out of the log files because the filesystems cannot optimise the
> > allocation patterns.

Well, it would be good if you'd tell me what to do instead...

I am invoking fallocate() in advance, because we write those files with
mmap() and that of course would normally triggered SIGBUS already on the
most boring of reasons, such as disk full/quota full or so. Hence,
before we do anything like that, we invoke fallocate() to ensure that
the space is actually available... As far as I can see, that pretty much
in line with what fallocate() is supposed to be useful for, the man page
says this explicitly:

     "...After a successful call to posix_fallocate(), subsequent writes
      to bytes in the specified range are guaranteed not to fail because
      of lack of disk space."

Happy to be informed that the man page is wrong. 

I am also happy to change our code, if it really is the wrong thing to
do. Note however that I generally favour correctness and relying on
documented behaviour, instead of nebulous optimizations whose effects
might change with different file systems or kernel versions...

> > Yup, it fragments journal files on XFS, too.
> > 
> > http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
> > 
> > IIRC, the systemd developers consider this a filesystem problem and
> > so refused to change the systemd code to be nice to the filesystem
> > allocators, even though they don't actually need to use fallocate...

What? No need to be dick. Nobody ever pinged me about this. And yeah, I
think I have a very good reason to use fallocate(). The only reason in
fact the man page explicitly mentions.

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-15 22:34       ` [systemd-devel] " Lennart Poettering
@ 2014-06-16  4:01         ` Chris Murphy
  2014-06-16  4:38           ` cwillu
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Murphy @ 2014-06-16  4:01 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Dave Chinner, kreijack, systemd Mailing List, linux-btrfs


On Jun 15, 2014, at 4:34 PM, Lennart Poettering <lennart@poettering.net> wrote:

> On Wed, 11.06.14 20:32, Chris Murphy (lists@colorremedies.com) wrote:
> 
>>> systemd has a very stupid journal write pattern. It checks if there
>>> is space in the file for the write, and if not it fallocates the
>>> small amount of space it needs (it does *4 byte* fallocate calls!)
> 
> Not really the case. 
> 
> http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354
> 
> We allocate 8mb at minimum.
> 
>>> and then does the write to it.  All this does is fragment the crap
>>> out of the log files because the filesystems cannot optimise the
>>> allocation patterns.
> 
> Well, it would be good if you'd tell me what to do instead...
> 
> I am invoking fallocate() in advance, because we write those files with
> mmap() and that of course would normally triggered SIGBUS already on the
> most boring of reasons, such as disk full/quota full or so. Hence,
> before we do anything like that, we invoke fallocate() to ensure that
> the space is actually available... As far as I can see, that pretty much
> in line with what fallocate() is supposed to be useful for, the man page
> says this explicitly:
> 
>     "...After a successful call to posix_fallocate(), subsequent writes
>      to bytes in the specified range are guaranteed not to fail because
>      of lack of disk space."
> 
> Happy to be informed that the man page is wrong. 
> 
> I am also happy to change our code, if it really is the wrong thing to
> do. Note however that I generally favour correctness and relying on
> documented behaviour, instead of nebulous optimizations whose effects
> might change with different file systems or kernel versions...
> 
>>> Yup, it fragments journal files on XFS, too.
>>> 
>>> http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
>>> 
>>> IIRC, the systemd developers consider this a filesystem problem and
>>> so refused to change the systemd code to be nice to the filesystem
>>> allocators, even though they don't actually need to use fallocate...
> 
> What? No need to be dick. Nobody ever pinged me about this. And yeah, I
> think I have a very good reason to use fallocate(). The only reason in
> fact the man page explicitly mentions.
> 
> Lennart

For what it's worth, I did not write what is attributed to me above.  I was quoting Dave Chinner, and I've confirmed the original attribution correctly made it onto the systemd-devel@ list.

I don't know whether some people on this distribution list are even subscribed to systemd-devel@ so those subsequent responses aren't likely being posted to systemd-devel@ but rather to linux-btrfs@.


Chris Murphy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16  4:01         ` Chris Murphy
@ 2014-06-16  4:38           ` cwillu
  0 siblings, 0 replies; 21+ messages in thread
From: cwillu @ 2014-06-16  4:38 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Lennart Poettering, Dave Chinner, kreijack, systemd Mailing List,
	linux-btrfs

Fallocate is a red herring except insofar as it's a hint that btrfs
isn't making much use of: you see the same behaviour with small writes
to an mmap'ed file that's msync'ed after each write, and likewise with
plain old appending small writes with an fsync after each write, with
or without fallocating the file first.  Looking at the fiemap output
while doing either of those, you'll see a new 4k extent being made,
and then the physical location of that extent will increment until the
writes move on to the next 4k extent.

cwillu@cwillu-home:~/work/btrfs/e2fs$ touch /tmp/test

>>> f=open('/tmp/test', 'r+')
>>> m=mmap.mmap(f.fileno(), size)
>>> for x in xrange(size):
...   m[x] = " "
...   m.flush(x / 4096 * 4096, 4096)   # msync(self->data + offset,
size, MS_SYNC)) {

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 800000
fs_ioc_fiemap 3223348747d
File /tmp/test has 3 extents:
#	Logical          Physical         Length           Flags
0:	0000000000000000 0000000b3d9c0000 0000000000001000 0000
1:	0000000000001000 000000069f012000 00000000003ff000 0000
2:	0000000000400000 0000000b419d1000 0000000000400000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b3daf3000 0000000000001000 0000
1:	0000000000001000 000000069f012000 00000000003ff000 0000
2:	0000000000400000 0000000b419d1000 0000000000400000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b3dc38000 0000000000001000 0000
1:	0000000000001000 000000069f012000 00000000003ff000 0000
2:	0000000000400000 0000000b419d1000 0000000000400000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b3dc9f000 0000000000001000 0000
1:	0000000000001000 0000000b3d2b7000 0000000000001000 0000
2:	0000000000002000 000000069f013000 00000000003fe000 0000
3:	0000000000400000 0000000b419d1000 0000000000400000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat
/tmp/test -c %s)# msync(self->data + offset, size, MS_SYNC)) {
0:	0000000000000000 0000000b3dc9f000 0000000000001000 0000
1:	0000000000001000 0000000b3d424000 0000000000001000 0000
2:	0000000000002000 000000069f013000 00000000003fe000 0000
3:	0000000000400000 0000000b419d1000 0000000000400000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b3dc9f000 0000000000001000 0000
1:	0000000000001000 0000000b3d563000 0000000000001000 0000
2:	0000000000002000 000000069f013000 00000000003fe000 0000
3:	0000000000400000 0000000b419d1000 0000000000400000 0001

========

cwillu@cwillu-home:~/work/btrfs/e2fs$ rm /tmp/test
cwillu@cwillu-home:~/work/btrfs/e2fs$ touch /tmp/test

>>> f=open('/tmp/test', 'r+')
>>> f.truncate(size)
>>> m=mmap.mmap(f.fileno(), size)
>>> for x in xrange(size):
...   m[x] = " "
...   m.flush(x / 4096 * 4096, 4096)   # msync(self->data + offset,
size, MS_SYNC)) {

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 800000
fs_ioc_fiemap 3223348747d
File /tmp/test has 1 extents:
#	Logical          Physical         Length           Flags
0:	0000000000000000 0000000b47f11000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b48006000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b48183000 0000000000001000 0000
1:	0000000000001000 0000000b48255000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b48183000 0000000000001000 0000
1:	0000000000001000 0000000b48353000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b48183000 0000000000001000 0000
1:	0000000000001000 0000000b493ed000 0000000000001000 0000
2:	0000000000002000 0000000b4a68f000 0000000000001000 0000
3:	0000000000003000 0000000b4b36f000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b48183000 0000000000001000 0000
1:	0000000000001000 0000000b493ed000 0000000000001000 0000
2:	0000000000002000 0000000b4a68f000 0000000000001000 0000
3:	0000000000003000 0000000b4b4cf000 0000000000001000 0001

========

cwillu@cwillu-home:~/work/btrfs/e2fs$ rm /tmp/test
cwillu@cwillu-home:~/work/btrfs/e2fs$ touch /tmp/test

>>> f=open('/tmp/test', 'r+')
>>> for x in xrange(size):
...   f.write(' ')
...   f.flush()
...   os.fdatasync(f.fileno())

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 395
fs_ioc_fiemap 3223348747d
File /tmp/test has 1 extents:
#	Logical          Physical         Length           Flags
0:	0000000000000000 0000000000000000 0000000000001000 0301

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 5b5
0:	0000000000000000 0000000000000000 0000000000001000 0301

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 1d61
0:	0000000000000000 0000000b4d2bc000 0000000000001000 0000
1:	0000000000001000 0000000b4e1c5000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 1e8c
0:	0000000000000000 0000000b4d2bc000 0000000000001000 0000
1:	0000000000001000 0000000b4e334000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 207c
0:	0000000000000000 0000000b4d2bc000 0000000000001000 0000
1:	0000000000001000 0000000b4e4a9000 0000000000001000 0000
2:	0000000000002000 0000000b4e528000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 21be
0:	0000000000000000 0000000b4d2bc000 0000000000001000 0000
1:	0000000000001000 0000000b4e4a9000 0000000000001000 0000
2:	0000000000002000 0000000b4e66c000 0000000000001000 0001

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-15 21:31         ` Martin Steigerwald
  2014-06-15 21:37           ` Hugo Mills
@ 2014-06-17  8:22           ` Duncan
  1 sibling, 0 replies; 21+ messages in thread
From: Duncan @ 2014-06-17  8:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Martin Steigerwald posted on Sun, 15 Jun 2014 23:31:07 +0200 as excerpted:

> I always thought that the whole point of fallocate is that it *doesn´t*
> write out anything, but just reserves the space. Thus I don´t see how
> COW can have any adverse effect here.

Tying up loose ends...  I was wrong on fallocate being the trigger, tho I 
did call correctly on fragmentation as the result, with NOCOW and defrag 
as possible workarounds.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-15 21:45           ` Martin Steigerwald
@ 2014-06-15 21:51             ` Hugo Mills
  0 siblings, 0 replies; 21+ messages in thread
From: Hugo Mills @ 2014-06-15 21:51 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Kai Krakow, linux-btrfs, systemd-devel

[-- Attachment #1: Type: text/plain, Size: 12016 bytes --]

On Sun, Jun 15, 2014 at 11:45:37PM +0200, Martin Steigerwald wrote:
> Am Samstag, 14. Juni 2014, 12:59:31 schrieb Kai Krakow:
> > Well, how did I accomblish that?
> 
> Setting no cow and defragmenting regularily?
> 
> Quite a complex setup for a casual Linux user.
> 
> Any solution should be automatic. I´d suggest by a combination of sane
> application behaviour and measures within the filesystem.
> 
> > First, I've set the journal directories nocow. Of course, systemd should do 
> > this by default. I'm not sure if this is a packaging or systemd code issue,
> > tho. But I think the systemd devs are in common that for cow fs, the
> > journal directories should be set nocow. After all, the journal is a
> > transactional database - it does not need cow protection at all costs. And
> > I think they have their own checksumming protection. So, why let systemd
> > bother with that? A lot of other software has the same semantic problems
> > with btrfs, too (ex. MySQL) where nobody shouts at the "inabilities" of the
> > programmers. So why for systemd? Just because it's intrusive by its nature
> > for being a radically and newly designed init system and thus requires some
> > learning by its users/admins/packagers? Really? Come on... As admin and/or
> > packager you have to stay with current technologies and developments
> > anyways. It's only important to hide the details from the users.
> 
> But nocow also disables the possibilty to snapshot these files, AFAIK.

   No, it's the other way around: snapshots break the nodatacow
behaviour. With nodatacow set, taking a snapshot will allow exactly
one CoW operation on the file data (preserving the original extent for
the snapshot to use), and then revert to nodatacow on the
newly-written extent. So fragmentation will still occur, after every
snapshot. Ideally, we'd use proper CoW rather than the RoW (redirect
on write) that we actually use, so that the _original_ is maintained
without fragmentation, and the copy fragments, but the performance for
that sucks (it uses twice the write bandwidth).

   Hugo.

> Akonadi
> does this for its database directory as well for new install. But I am not
> completely happy with that approach.
> 
> And well I would say that the following differing results are caused by
> specific application behavior – my bet would be rsyslog sequentially
> appending to the file while systemd used the journal files more than a
> database, databases fragment heavily on BTRFS due to COW:
> 
> 
> merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> filefrag *
> system@0004e2025580f3c5-c625739d3033b738.journal~: 1 extent found
> system@0004f94fbee5f4d1-1cfb9bed12a79bde.journal~: 2771 extents found
> system@0004f98240efba02-b465b39a7ed0bdbe.journal~: 2534 extents found
> system@0004f9ff739ad927-4650a2ca62bf9378.journal~: 4951 extents found
> system@0004fa17feb65603-a7597828f9823e38.journal~: 1244 extents found
> system@0004fa3f0e96b653-d6f9d5795c9ef869.journal~: 1419 extents found
> system@0004fa4c448b8a95-c95a3b7950fd704d.journal~: 1511 extents found
> system@0004fa5dddb554e0-33c319ebb5f8100f.journal~: 1729 extents found
> system@0004fad81852c750-20a36082c6006c8a.journal~: 10257 extents found
> system@0004fad970b56567-44bf4a94314792fc.journal~: 932 extents found
> system@0004fb128307b981-7f2104da8b2c9fb2.journal~: 6757 extents found
> system@0004fb1c3eef86c3-8adbea51a1c98729.journal~: 1498 extents found
> system@0004fb1c419fb301-7303f7fd9165ed26.journal~: 19 extents found
> system@0004fb44feafbafd-b7433e90b1d3d718.journal~: 2265 extents found
> system@0004fb6ddf63e4d3-b40e8f4701670bff.journal~: 1894 extents found
> system@0004fb6e412c3f7e-3890be9c2119a7bb.journal~: 1038 extents found
> system@54803afb1b1d42b387822c56e61bc168-000000000002b364-0004e1c4aabbeefa.journal: 2 extents found
> system@54803afb1b1d42b387822c56e61bc168-000000000002b365-0004e1c4ab5b0246.journal: 1 extent found
> system@bd18e6867b824ba7b572de31531218a9-0000000000000001-0004e202555a0493.journal: 3232 extents found
> system.journal: 3855 extents found
> user-1000@0004e202588086cf-3d4130a580b63101.journal~: 1 extent found
> user-1000@0004f9ff74415375-6b5dd1c3d76b09ce.journal~: 1046 extents found
> user-1000@0004fa17ff12ef0c-34297fcc8c06dd4b.journal~: 96 extents found
> user-1000@0004fa3f0eee8e41-dfb1f54bd31e4967.journal~: 84 extents found
> user-1000@0004fa4c475a0a63-d8badb620094bce8.journal~: 173 extents found
> user-1000@0004fa5de0c650e0-522069bac82c754e.journal~: 319 extents found
> user-1000@0004fad818a7a980-593b8f3971b2e697.journal~: 2465 extents found
> user-1000@0004fad97160c3ad-552b27f891e7a24e.journal~: 106 extents found
> user-1000@0004fb1283616e7e-1fbca0bef31bd92b.journal~: 283 extents found
> user-1000@0004fb6de033b269-018b4cbc2b1f319b.journal~: 874 extents found
> user-1000@bd18e6867b824ba7b572de31531218a9-00000000000007c7-0004e2025881a1ee.journal: 293 extents found
> user-1000.journal: 4663 extents found
> user-120.journal: 5 extents found
> user-2012@0004fa4c02142cad-f97563ed0105bfb3.journal~: 749 extents found
> user-2012@0004fa4d8255b2df-43248028d422ca78.journal~: 29 extents found
> user-2012@0004fa5de40372db-d1f3c6428ddeec22.journal~: 122 extents found
> user-2012@0004fad81b8cd9d8-ed2861a9fa1b163c.journal~: 575 extents found
> user-2012@0004fad980139d4e-94ad07f4a8fae3cc.journal~: 25 extents found
> user-2012@0004fb160f2d5334-99462eb429f4cb7b.journal~: 416 extents found
> user-2012@54803afb1b1d42b387822c56e61bc168-0000000000011c75-0004ddb2be06d876.journal: 2 extents found
> user-2012.journal: 453 extents found
> user-65534@0004fa4c62bf4a71-6b4c53dfc06dd588.journal~: 46 extents found
> user-65534.journal: 91 extents found
> merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> cd ..
> merkaba:/var/log/journal> cd ..
> 
> 
> 
> 
> merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> ls -lh
> insgesamt 495M
> -rw-r-----+ 1 root root            6,8M Jul 21  2013 system@0004e2025580f3c5-c625739d3033b738.journal~
> -rw-r-----+ 1 root systemd-journal  14M Mai 14 00:40 system@0004f94fbee5f4d1-1cfb9bed12a79bde.journal~
> -rw-r-----+ 1 root systemd-journal  14M Mai 16 12:55 system@0004f98240efba02-b465b39a7ed0bdbe.journal~
> -rw-r-----+ 1 root systemd-journal  26M Mai 22 18:17 system@0004f9ff739ad927-4650a2ca62bf9378.journal~
> -rw-r-----+ 1 root systemd-journal 6,3M Mai 23 23:34 system@0004fa17feb65603-a7597828f9823e38.journal~
> -rw-r-----+ 1 root systemd-journal 6,6M Mai 25 22:10 system@0004fa3f0e96b653-d6f9d5795c9ef869.journal~
> -rw-r-----+ 1 root systemd-journal 7,7M Mai 26 13:56 system@0004fa4c448b8a95-c95a3b7950fd704d.journal~
> -rw-r-----+ 1 root systemd-journal 9,7M Mai 27 10:56 system@0004fa5dddb554e0-33c319ebb5f8100f.journal~
> -rw-r-----+ 1 root systemd-journal  50M Jun  2 12:45 system@0004fad81852c750-20a36082c6006c8a.journal~
> -rw-r-----+ 1 root systemd-journal 5,2M Jun  2 14:21 system@0004fad970b56567-44bf4a94314792fc.journal~
> -rw-r-----+ 1 root systemd-journal  34M Jun  5 10:27 system@0004fb128307b981-7f2104da8b2c9fb2.journal~
> -rw-r-----+ 1 root systemd-journal 8,4M Jun  5 22:03 system@0004fb1c3eef86c3-8adbea51a1c98729.journal~
> -rw-r-----+ 1 root systemd-journal 3,7M Jun  5 22:04 system@0004fb1c419fb301-7303f7fd9165ed26.journal~
> -rw-r-----+ 1 root systemd-journal  12M Jun  7 22:40 system@0004fb44feafbafd-b7433e90b1d3d718.journal~
> -rw-r-----+ 1 root systemd-journal  11M Jun  9 23:27 system@0004fb6ddf63e4d3-b40e8f4701670bff.journal~
> -rw-r-----+ 1 root systemd-journal 6,3M Jun  9 23:54 system@0004fb6e412c3f7e-3890be9c2119a7bb.journal~
> -rw-r-----+ 1 root adm             128M Jul 18  2013 system@54803afb1b1d42b387822c56e61bc168-000000000002b364-0004e1c4aabbeefa.journal
> -rw-r-----+ 1 root root            7,4M Jul 20  2013 system@54803afb1b1d42b387822c56e61bc168-000000000002b365-0004e1c4ab5b0246.journal
> -rw-r-----+ 1 root systemd-journal  23M Mai 11 10:21 system@bd18e6867b824ba7b572de31531218a9-0000000000000001-0004e202555a0493.journal
> -rw-r-----+ 1 root systemd-journal  19M Jun 15 23:37 system.journal
> -rw-r-----+ 1 root root            3,6M Jul 21  2013 user-1000@0004e202588086cf-3d4130a580b63101.journal~
> -rw-r-----+ 1 root systemd-journal 4,8M Mai 22 18:17 user-1000@0004f9ff74415375-6b5dd1c3d76b09ce.journal~
> -rw-r-----+ 1 root systemd-journal 3,6M Mai 23 23:34 user-1000@0004fa17ff12ef0c-34297fcc8c06dd4b.journal~
> -rw-r-----+ 1 root systemd-journal 3,6M Mai 25 22:10 user-1000@0004fa3f0eee8e41-dfb1f54bd31e4967.journal~
> -rw-r-----+ 1 root systemd-journal 3,7M Mai 26 13:57 user-1000@0004fa4c475a0a63-d8badb620094bce8.journal~
> -rw-r-----+ 1 root systemd-journal 3,7M Mai 27 10:56 user-1000@0004fa5de0c650e0-522069bac82c754e.journal~
> -rw-r-----+ 1 root systemd-journal  15M Jun  2 12:45 user-1000@0004fad818a7a980-593b8f3971b2e697.journal~
> -rw-r-----+ 1 root systemd-journal 3,6M Jun  2 14:22 user-1000@0004fad97160c3ad-552b27f891e7a24e.journal~
> -rw-r-----+ 1 root systemd-journal 3,7M Jun  5 10:27 user-1000@0004fb1283616e7e-1fbca0bef31bd92b.journal~
> -rw-r-----+ 1 root systemd-journal 5,2M Jun  9 23:27 user-1000@0004fb6de033b269-018b4cbc2b1f319b.journal~
> -rw-r-----+ 1 root systemd-journal 3,8M Mai 11 10:21 user-1000@bd18e6867b824ba7b572de31531218a9-00000000000007c7-0004e2025881a1ee.journal
> -rw-r-----+ 1 root systemd-journal  35M Jun 15 23:38 user-1000.journal
> -rw-r-----+ 1 root systemd-journal 3,6M Apr 28 09:52 user-120.journal
> -rw-r-----+ 1 root systemd-journal 4,0M Mai 26 13:37 user-2012@0004fa4c02142cad-f97563ed0105bfb3.journal~
> -rw-r-----+ 1 root systemd-journal 3,6M Mai 26 15:25 user-2012@0004fa4d8255b2df-43248028d422ca78.journal~
> -rw-r-----+ 1 root systemd-journal 3,7M Mai 27 10:57 user-2012@0004fa5de40372db-d1f3c6428ddeec22.journal~
> -rw-r-----+ 1 root systemd-journal 4,3M Jun  2 12:46 user-2012@0004fad81b8cd9d8-ed2861a9fa1b163c.journal~
> -rw-r-----+ 1 root systemd-journal 3,6M Jun  2 14:26 user-2012@0004fad980139d4e-94ad07f4a8fae3cc.journal~
> -rw-r-----+ 1 root systemd-journal 3,8M Jun  5 14:41 user-2012@0004fb160f2d5334-99462eb429f4cb7b.journal~
> -rw-r-----+ 1 root adm             200K Mai 11 10:21 user-2012@54803afb1b1d42b387822c56e61bc168-0000000000011c75-0004ddb2be06d876.journal
> -rw-r-----+ 1 root systemd-journal 3,8M Jun 11 21:14 user-2012.journal
> -rw-r-----+ 1 root systemd-journal 3,6M Mai 26 14:04 user-65534@0004fa4c62bf4a71-6b4c53dfc06dd588.journal~
> -rw-r-----+ 1 root systemd-journal 3,7M Jun  9 23:26 user-65534.journal
> 
> 
> 
> 
> 
> 
> merkaba:/var/log> filefrag syslog*
> syslog: 361 extents found
> syslog.1: 202 extents found
> syslog.2.gz: 1 extent found
> [well sure, cause repacked]
> syslog.3.gz: 1 extent found
> syslog.4.gz: 1 extent found
> syslog.5.gz: 1 extent found
> syslog.6.gz: 1 extent found
> 
> merkaba:/var/log> ls -lh syslog*
> -rw-r----- 1 root adm 4,2M Jun 15 23:39 syslog
> -rw-r----- 1 root adm 2,1M Jun 11 16:07 syslog.1
> 
> 
> 
> So we have ten times the extents on some systemd journal files than on
> rsyslog.
> 
> 
> With BTRFS RAID 1 on SSD with compress=lzo, so the 361 extents of syslog
> may be due to the size limit of extents on compressed BTRFS filesystems.
> 
> Anyway, since it is flash, I never bothered about the fragmentation.
> 
> Ciao,
> -- 
> Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
> GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
     --- If the first-ever performance is the première,  is the ---     
                  last-ever performance the derrière?                   

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-14 10:59         ` Kai Krakow
  2014-06-15  5:02           ` Duncan
@ 2014-06-15 21:45           ` Martin Steigerwald
  2014-06-15 21:51             ` Hugo Mills
  1 sibling, 1 reply; 21+ messages in thread
From: Martin Steigerwald @ 2014-06-15 21:45 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs, systemd-devel

Am Samstag, 14. Juni 2014, 12:59:31 schrieb Kai Krakow:
> Well, how did I accomblish that?

Setting no cow and defragmenting regularily?

Quite a complex setup for a casual Linux user.

Any solution should be automatic. I´d suggest by a combination of sane
application behaviour and measures within the filesystem.

> First, I've set the journal directories nocow. Of course, systemd should do 
> this by default. I'm not sure if this is a packaging or systemd code issue,
> tho. But I think the systemd devs are in common that for cow fs, the
> journal directories should be set nocow. After all, the journal is a
> transactional database - it does not need cow protection at all costs. And
> I think they have their own checksumming protection. So, why let systemd
> bother with that? A lot of other software has the same semantic problems
> with btrfs, too (ex. MySQL) where nobody shouts at the "inabilities" of the
> programmers. So why for systemd? Just because it's intrusive by its nature
> for being a radically and newly designed init system and thus requires some
> learning by its users/admins/packagers? Really? Come on... As admin and/or
> packager you have to stay with current technologies and developments
> anyways. It's only important to hide the details from the users.

But nocow also disables the possibilty to snapshot these files, AFAIK. Akonadi
does this for its database directory as well for new install. But I am not
completely happy with that approach.

And well I would say that the following differing results are caused by
specific application behavior – my bet would be rsyslog sequentially
appending to the file while systemd used the journal files more than a
database, databases fragment heavily on BTRFS due to COW:


merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> filefrag *
system@0004e2025580f3c5-c625739d3033b738.journal~: 1 extent found
system@0004f94fbee5f4d1-1cfb9bed12a79bde.journal~: 2771 extents found
system@0004f98240efba02-b465b39a7ed0bdbe.journal~: 2534 extents found
system@0004f9ff739ad927-4650a2ca62bf9378.journal~: 4951 extents found
system@0004fa17feb65603-a7597828f9823e38.journal~: 1244 extents found
system@0004fa3f0e96b653-d6f9d5795c9ef869.journal~: 1419 extents found
system@0004fa4c448b8a95-c95a3b7950fd704d.journal~: 1511 extents found
system@0004fa5dddb554e0-33c319ebb5f8100f.journal~: 1729 extents found
system@0004fad81852c750-20a36082c6006c8a.journal~: 10257 extents found
system@0004fad970b56567-44bf4a94314792fc.journal~: 932 extents found
system@0004fb128307b981-7f2104da8b2c9fb2.journal~: 6757 extents found
system@0004fb1c3eef86c3-8adbea51a1c98729.journal~: 1498 extents found
system@0004fb1c419fb301-7303f7fd9165ed26.journal~: 19 extents found
system@0004fb44feafbafd-b7433e90b1d3d718.journal~: 2265 extents found
system@0004fb6ddf63e4d3-b40e8f4701670bff.journal~: 1894 extents found
system@0004fb6e412c3f7e-3890be9c2119a7bb.journal~: 1038 extents found
system@54803afb1b1d42b387822c56e61bc168-000000000002b364-0004e1c4aabbeefa.journal: 2 extents found
system@54803afb1b1d42b387822c56e61bc168-000000000002b365-0004e1c4ab5b0246.journal: 1 extent found
system@bd18e6867b824ba7b572de31531218a9-0000000000000001-0004e202555a0493.journal: 3232 extents found
system.journal: 3855 extents found
user-1000@0004e202588086cf-3d4130a580b63101.journal~: 1 extent found
user-1000@0004f9ff74415375-6b5dd1c3d76b09ce.journal~: 1046 extents found
user-1000@0004fa17ff12ef0c-34297fcc8c06dd4b.journal~: 96 extents found
user-1000@0004fa3f0eee8e41-dfb1f54bd31e4967.journal~: 84 extents found
user-1000@0004fa4c475a0a63-d8badb620094bce8.journal~: 173 extents found
user-1000@0004fa5de0c650e0-522069bac82c754e.journal~: 319 extents found
user-1000@0004fad818a7a980-593b8f3971b2e697.journal~: 2465 extents found
user-1000@0004fad97160c3ad-552b27f891e7a24e.journal~: 106 extents found
user-1000@0004fb1283616e7e-1fbca0bef31bd92b.journal~: 283 extents found
user-1000@0004fb6de033b269-018b4cbc2b1f319b.journal~: 874 extents found
user-1000@bd18e6867b824ba7b572de31531218a9-00000000000007c7-0004e2025881a1ee.journal: 293 extents found
user-1000.journal: 4663 extents found
user-120.journal: 5 extents found
user-2012@0004fa4c02142cad-f97563ed0105bfb3.journal~: 749 extents found
user-2012@0004fa4d8255b2df-43248028d422ca78.journal~: 29 extents found
user-2012@0004fa5de40372db-d1f3c6428ddeec22.journal~: 122 extents found
user-2012@0004fad81b8cd9d8-ed2861a9fa1b163c.journal~: 575 extents found
user-2012@0004fad980139d4e-94ad07f4a8fae3cc.journal~: 25 extents found
user-2012@0004fb160f2d5334-99462eb429f4cb7b.journal~: 416 extents found
user-2012@54803afb1b1d42b387822c56e61bc168-0000000000011c75-0004ddb2be06d876.journal: 2 extents found
user-2012.journal: 453 extents found
user-65534@0004fa4c62bf4a71-6b4c53dfc06dd588.journal~: 46 extents found
user-65534.journal: 91 extents found
merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> cd ..
merkaba:/var/log/journal> cd ..




merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> ls -lh
insgesamt 495M
-rw-r-----+ 1 root root            6,8M Jul 21  2013 system@0004e2025580f3c5-c625739d3033b738.journal~
-rw-r-----+ 1 root systemd-journal  14M Mai 14 00:40 system@0004f94fbee5f4d1-1cfb9bed12a79bde.journal~
-rw-r-----+ 1 root systemd-journal  14M Mai 16 12:55 system@0004f98240efba02-b465b39a7ed0bdbe.journal~
-rw-r-----+ 1 root systemd-journal  26M Mai 22 18:17 system@0004f9ff739ad927-4650a2ca62bf9378.journal~
-rw-r-----+ 1 root systemd-journal 6,3M Mai 23 23:34 system@0004fa17feb65603-a7597828f9823e38.journal~
-rw-r-----+ 1 root systemd-journal 6,6M Mai 25 22:10 system@0004fa3f0e96b653-d6f9d5795c9ef869.journal~
-rw-r-----+ 1 root systemd-journal 7,7M Mai 26 13:56 system@0004fa4c448b8a95-c95a3b7950fd704d.journal~
-rw-r-----+ 1 root systemd-journal 9,7M Mai 27 10:56 system@0004fa5dddb554e0-33c319ebb5f8100f.journal~
-rw-r-----+ 1 root systemd-journal  50M Jun  2 12:45 system@0004fad81852c750-20a36082c6006c8a.journal~
-rw-r-----+ 1 root systemd-journal 5,2M Jun  2 14:21 system@0004fad970b56567-44bf4a94314792fc.journal~
-rw-r-----+ 1 root systemd-journal  34M Jun  5 10:27 system@0004fb128307b981-7f2104da8b2c9fb2.journal~
-rw-r-----+ 1 root systemd-journal 8,4M Jun  5 22:03 system@0004fb1c3eef86c3-8adbea51a1c98729.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Jun  5 22:04 system@0004fb1c419fb301-7303f7fd9165ed26.journal~
-rw-r-----+ 1 root systemd-journal  12M Jun  7 22:40 system@0004fb44feafbafd-b7433e90b1d3d718.journal~
-rw-r-----+ 1 root systemd-journal  11M Jun  9 23:27 system@0004fb6ddf63e4d3-b40e8f4701670bff.journal~
-rw-r-----+ 1 root systemd-journal 6,3M Jun  9 23:54 system@0004fb6e412c3f7e-3890be9c2119a7bb.journal~
-rw-r-----+ 1 root adm             128M Jul 18  2013 system@54803afb1b1d42b387822c56e61bc168-000000000002b364-0004e1c4aabbeefa.journal
-rw-r-----+ 1 root root            7,4M Jul 20  2013 system@54803afb1b1d42b387822c56e61bc168-000000000002b365-0004e1c4ab5b0246.journal
-rw-r-----+ 1 root systemd-journal  23M Mai 11 10:21 system@bd18e6867b824ba7b572de31531218a9-0000000000000001-0004e202555a0493.journal
-rw-r-----+ 1 root systemd-journal  19M Jun 15 23:37 system.journal
-rw-r-----+ 1 root root            3,6M Jul 21  2013 user-1000@0004e202588086cf-3d4130a580b63101.journal~
-rw-r-----+ 1 root systemd-journal 4,8M Mai 22 18:17 user-1000@0004f9ff74415375-6b5dd1c3d76b09ce.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Mai 23 23:34 user-1000@0004fa17ff12ef0c-34297fcc8c06dd4b.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Mai 25 22:10 user-1000@0004fa3f0eee8e41-dfb1f54bd31e4967.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Mai 26 13:57 user-1000@0004fa4c475a0a63-d8badb620094bce8.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Mai 27 10:56 user-1000@0004fa5de0c650e0-522069bac82c754e.journal~
-rw-r-----+ 1 root systemd-journal  15M Jun  2 12:45 user-1000@0004fad818a7a980-593b8f3971b2e697.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Jun  2 14:22 user-1000@0004fad97160c3ad-552b27f891e7a24e.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Jun  5 10:27 user-1000@0004fb1283616e7e-1fbca0bef31bd92b.journal~
-rw-r-----+ 1 root systemd-journal 5,2M Jun  9 23:27 user-1000@0004fb6de033b269-018b4cbc2b1f319b.journal~
-rw-r-----+ 1 root systemd-journal 3,8M Mai 11 10:21 user-1000@bd18e6867b824ba7b572de31531218a9-00000000000007c7-0004e2025881a1ee.journal
-rw-r-----+ 1 root systemd-journal  35M Jun 15 23:38 user-1000.journal
-rw-r-----+ 1 root systemd-journal 3,6M Apr 28 09:52 user-120.journal
-rw-r-----+ 1 root systemd-journal 4,0M Mai 26 13:37 user-2012@0004fa4c02142cad-f97563ed0105bfb3.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Mai 26 15:25 user-2012@0004fa4d8255b2df-43248028d422ca78.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Mai 27 10:57 user-2012@0004fa5de40372db-d1f3c6428ddeec22.journal~
-rw-r-----+ 1 root systemd-journal 4,3M Jun  2 12:46 user-2012@0004fad81b8cd9d8-ed2861a9fa1b163c.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Jun  2 14:26 user-2012@0004fad980139d4e-94ad07f4a8fae3cc.journal~
-rw-r-----+ 1 root systemd-journal 3,8M Jun  5 14:41 user-2012@0004fb160f2d5334-99462eb429f4cb7b.journal~
-rw-r-----+ 1 root adm             200K Mai 11 10:21 user-2012@54803afb1b1d42b387822c56e61bc168-0000000000011c75-0004ddb2be06d876.journal
-rw-r-----+ 1 root systemd-journal 3,8M Jun 11 21:14 user-2012.journal
-rw-r-----+ 1 root systemd-journal 3,6M Mai 26 14:04 user-65534@0004fa4c62bf4a71-6b4c53dfc06dd588.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Jun  9 23:26 user-65534.journal






merkaba:/var/log> filefrag syslog*
syslog: 361 extents found
syslog.1: 202 extents found
syslog.2.gz: 1 extent found
[well sure, cause repacked]
syslog.3.gz: 1 extent found
syslog.4.gz: 1 extent found
syslog.5.gz: 1 extent found
syslog.6.gz: 1 extent found

merkaba:/var/log> ls -lh syslog*
-rw-r----- 1 root adm 4,2M Jun 15 23:39 syslog
-rw-r----- 1 root adm 2,1M Jun 11 16:07 syslog.1



So we have ten times the extents on some systemd journal files than on
rsyslog.


With BTRFS RAID 1 on SSD with compress=lzo, so the 361 extents of syslog
may be due to the size limit of extents on compressed BTRFS filesystems.

Anyway, since it is flash, I never bothered about the fragmentation.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-15 21:31         ` Martin Steigerwald
@ 2014-06-15 21:37           ` Hugo Mills
  2014-06-17  8:22           ` Duncan
  1 sibling, 0 replies; 21+ messages in thread
From: Hugo Mills @ 2014-06-15 21:37 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Duncan, linux-btrfs, systemd-devel

[-- Attachment #1: Type: text/plain, Size: 1898 bytes --]

On Sun, Jun 15, 2014 at 11:31:07PM +0200, Martin Steigerwald wrote:
> Am Samstag, 14. Juni 2014, 02:53:20 schrieb Duncan:
> > > I am reaching the conclusion that fallocate is not the problem. The
> > > fallocate increase the filesize of about 8MB, which is enough for some
> > > logging. So it is not called very often.
> > 
> > But... 
> > 
> > If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with 
> > nodatacow), then an fallocate of 8 MiB will increase the file size by 8 
> > MiB and write that out.  So far so good as at that point the 8 MiB should 
> > be a single extent.  But then, data gets written into 4 KiB blocks of 
> > that 8 MiB one at a time, and because btrfs is COW, the new data in the 
> > block must be written to a new location.
> > 
> > Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
> > block has been rewritten to a new location and is now an extent unto 
> > itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
> > single 4 KiB block in size.
> 
> I always thought that the whole point of fallocate is that it *doesn´t* write 
> out anything, but just reserves the space. Thus I don´t see how COW can have 
> any adverse effect here.

   Exactly. fallocate, as I understand it, says, "I'm going to write
[this much] data at some point soon; you may want to allocate that
space in a contiguous manner right now to make the process more
effcient". The space is not a formal part of the file data and so
doesn't need a CoW operation when it's first written to.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
     --- If the first-ever performance is the première,  is the ---     
                  last-ever performance the derrière?                   

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-14  2:53       ` Duncan
  2014-06-14  7:52         ` Goffredo Baroncelli
  2014-06-14 10:59         ` Kai Krakow
@ 2014-06-15 21:31         ` Martin Steigerwald
  2014-06-15 21:37           ` Hugo Mills
  2014-06-17  8:22           ` Duncan
  2 siblings, 2 replies; 21+ messages in thread
From: Martin Steigerwald @ 2014-06-15 21:31 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs, systemd-devel

Am Samstag, 14. Juni 2014, 02:53:20 schrieb Duncan:
> > I am reaching the conclusion that fallocate is not the problem. The
> > fallocate increase the filesize of about 8MB, which is enough for some
> > logging. So it is not called very often.
> 
> But... 
> 
> If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with 
> nodatacow), then an fallocate of 8 MiB will increase the file size by 8 
> MiB and write that out.  So far so good as at that point the 8 MiB should 
> be a single extent.  But then, data gets written into 4 KiB blocks of 
> that 8 MiB one at a time, and because btrfs is COW, the new data in the 
> block must be written to a new location.
> 
> Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
> block has been rewritten to a new location and is now an extent unto 
> itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
> single 4 KiB block in size.

I always thought that the whole point of fallocate is that it *doesn´t* write 
out anything, but just reserves the space. Thus I don´t see how COW can have 
any adverse effect here.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-15  5:02           ` Duncan
@ 2014-06-15 11:18             ` Kai Krakow
  0 siblings, 0 replies; 21+ messages in thread
From: Kai Krakow @ 2014-06-15 11:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Duncan <1i5t5.duncan@cox.net> schrieb:

>> Back to the extents counts: What I did next was implementing a defrag
>> job that regularly defrags the journal (actually, the complete log
>> directory as other log files suffer the same problem):
>> 
>> $ cat /usr/local/sbin/defrag-logs.rb #!/bin/sh exec btrfs filesystem
>> defragment -czlib -r /var/log
>> 
>> It can be easily converted into a timer job with systemd. This is left
>> as an excercise to the reader.
>> 
>> BTW: Actually, that job isn't currently executed on my system which
>> makes the numbers above pretty impressive... However, autodefrag is
>> turned on which may play into the mix. I'm not sure. I stopped
>> automatically running those defrag jobs a while ago (I have a few more).
> 
> Thanks for the timer hint, BTW.  I actually already created an hourly and
> a daily timer job here (turns out that's all I needed, no weekly/monthly/
> whatever needed so I didn't create those) as I switched over to systemd
> and got rid of crond, and I'll definitely keep the defrag-journals timer
> idea up my sleeve in case I decide to set journald back to keeping non-
> volatile journals as well, plus as a helpful hint for others.  Tho I
> won't be using it myself currently as the volatile journals only while
> handing off to syslog-ng for the longer term logs is working very well
> for me ATM, a good sysadmin likes to have a set of tricks such as that
> ready in case they're needed, and I'm certainly no exception! =:^)

I did not yet get rid of cron. The systemd devs once noted that timers are 
not a cron replacement - tho I'm sure this was meant for running user jobs 
not system jobs. The idea back then was to use systemd user session spawning 
with timers and the devs stated that such a usage is different from how cron 
spawns user jobs, and one should just stick to cron for that because the 
purpose of systemd user sessions is different.

I already created timer targets for daily, hourly, monthly so I could just 
symlink service units there. What's needed is some sort of systemd generator 
to auto-create services from /etc/cron.{hourly,daily,monthly,weekly} and 
auto-install them in the matching targets which are:

$ for i in $(find -type f -name "timer-*");do echo "# $i:";cat $i;echo;done
# ./timer-weekly.target:
[Unit]
Description=Weekly Timer Target
StopWhenUnneeded=yes

# ./timer-daily.target:
[Unit]
Description=Daily Timer Target
StopWhenUnneeded=yes

# ./timer-hourly.target:
[Unit]
Description=Hourly Timer Target
StopWhenUnneeded=yes

# ./timer-daily.timer:
[Unit]
Description=Daily Timer

[Timer]
OnBootSec=10min
OnUnitActiveSec=1d
Unit=timer-daily.target
AccuracySec=12h

[Install]
WantedBy=basic.target

# ./timer-hourly.timer:
[Unit]
Description=Hourly Timer

[Timer]
OnBootSec=5min
OnUnitActiveSec=1h
Unit=timer-hourly.target
AccuracySec=30min

[Install]
WantedBy=basic.target

# ./timer-weekly.timer:
[Unit]
Description=Weekly Timer

[Timer]
OnBootSec=15min
OnUnitActiveSec=1w
Unit=timer-weekly.target
AccuracySec=12h

[Install]
WantedBy=basic.target


Then it's a matter of creating services which are "WantedBy timer-
weekly.target" or whatever is appropriate and they should execute after 
being installed. Maybe I'm trying to copy what Gentoo is doing for 
/etc/local.d with systemd (there's a generator in the gentoo-systemd-
integration ebuild [1] for files in that directoy). However, when going that 
route, cron should no longer be installed or at least be configured to not 
run these jobs on its own. But then, the above mentioned method of running 
user jobs would be needed - and that's not recommended (and I didn't get it 
to work either).


[1]: /usr/lib/systemd/system-generators/gentoo-local-generator

-- 
Replies to list only preferred.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-14  7:52         ` Goffredo Baroncelli
@ 2014-06-15  5:43           ` Duncan
  0 siblings, 0 replies; 21+ messages in thread
From: Duncan @ 2014-06-15  5:43 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Goffredo Baroncelli posted on Sat, 14 Jun 2014 09:52:39 +0200 as
excerpted:

> On 06/14/2014 04:53 AM, Duncan wrote:
>> Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
>> excerpted:
>> 
>>> thanks for pointing that. However I am performing my tests on a fedora
>>> 20 with systemd-208, which seems have this change
>>>
>>> I am reaching the conclusion that fallocate is not the problem. The
>>> fallocate increase the filesize of about 8MB, which is enough for some
>>> logging. So it is not called very often.

Right.

>> But...

Exactly, _but_...

>> [A]n fallocate of 8 MiB will increase the file size
>> by 8 MiB and write that out.  So far so good as at that point the 8 MiB
>> should be a single extent.  But then, data gets written into 4 KiB
>> blocks of that 8 MiB one at a time, and because btrfs is COW, the new
>> data in the block must be written to a new location.
>> 
>> Which effectively means that by the time the 8 MiB is filled, each 4
>> KiB block has been rewritten to a new location and is now an extent
>> unto itself.  So now that 8 MiB is composed of 2048 new extents, each
>> one a single 4 KiB block in size.
> 
> Several people pointed fallocate as the problem. But I don't understand
> the reason.
> 1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time
> during the boot. Often never because the log are less than 8MB.
> 2) it is true that btrfs "rewrite" almost 2 times each 4kb page with
> fallocate. But the first time is a "big" write of 8MB; instead the
> second write would happen in any case. What I mean is that without the
> fallocate in any case journald would make small write.
> 
> To be honest, I fatigue to see the gain of having a fallocate on a COW
> filesystem... may be that I don't understand very well the fallocate()
> call.

The base problem isn't fallocate per se, rather, tho it's the trigger in 
this case.  The base problem is that for COW-based filesystems, *ANY* 
rewriting of existing file content results in fragmentation.

It just so happens that the only reason there's existing file content to 
be rewritten (as opposed to simply appending) in this case, is because of 
the fallocate.  The rewrite of existing file content is the problem, but 
the existing file content is only there in this case because of the 
fallocate.

Taking a step back...

On a non-COW filesystem, allocating 8 MiB ahead and writing into it 
rewrites into the already allocated location, thus guaranteeing extents 
of 8 MiB each, since once the space is allocated it's simply rewritten in-
place.  Thus, on a non-COW filesystem, pre-allocating in something larger 
than single filesystem blocks when an app knows the data is eventually 
going to be written in to fill that space anyway is a GOOD thing, which 
is why systemd is doing it.

But on a COW-based filesystem fallocate is the exact opposite, a BAD 
thing, because an fallocate forces the file to be written out at that 
size, effectively filled with nulls/blanks.  Then the actual logging 
comes along and rewrites those nulls/blanks with actual data, but it's 
now a rewrite, which on a COW, copy-on-write, based filesystem, the 
rewritten block is copied elsewhere, it does NOT overwrite the existing 
null/blank block, and "elsewhere" by definition means detached from the 
previous blocks, thus in an extent all by itself.

Once the full 2048 original blocks composing that 8 MiB are filled in 
with actual data, because they were rewrites from null/blank blocks that 
fallocate had already forced to be allocated, that's now 2048 separate 
extents, 2048 separate file fragments, where without the forced fallocate, 
the writes would have all been appends, and there would have been at 
least /some/ chance of some of those 2048 separate blocks being written 
at close enough to the same time that they would have been written 
together as a single extent.  So while the 8 MiB might not have been a 
single extent as opposed to 2048 separate extents, it might have been 
perhaps 512 or 1024 extents, instead of the 2048 that it ended up being 
because fallocate meant that each block was a rewrite into an existing 
file, not a new append-write at the end of an existing file.

> [...]
>> Another alternative is that distros will start setting /var/log/journal
>> NOCOW in their setup scripts by default when it's btrfs, thus avoiding
>> the problem.  (Altho if they do automated snapshotting they'll also
>> have to set it as its own subvolume, to avoid the
>> first-write-after-snapshot-
>> is-COW problem.)  Well, that, and/or set autodefrag in the default
>> mount options.
> 
> Pay attention, that this remove also the checksum, which are very useful
> in a RAID configuration.

Well, it can be.  But this is only log data, not executable or the like 
data, and (as Kai K points out) journald has its own checksumming method 
in any case.

Besides which, you still haven't explained why you can't either set the 
autodefrag mount option and be done with it, or run a systemd-timer-
triggered or cron-triggered defrag script to defrag them automatically at 
hourly or daily or whatever intervals.  Those don't disable btrfs 
checksumming, but /should/ solve the problem.

(Tho if you're btrfs snapshotting the journals defrag has its own 
implications, but they should be relatively limited in scope compared to 
the fragmentation issues we're dealing with here.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-14 10:59         ` Kai Krakow
@ 2014-06-15  5:02           ` Duncan
  2014-06-15 11:18             ` Kai Krakow
  2014-06-15 21:45           ` Martin Steigerwald
  1 sibling, 1 reply; 21+ messages in thread
From: Duncan @ 2014-06-15  5:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Kai Krakow posted on Sat, 14 Jun 2014 12:59:31 +0200 as excerpted:

> Duncan <1i5t5.duncan@cox.net> schrieb:
> 
>> As they say, "Whoosh!"
>> 
>> At least here, I interpreted that remark as primarily sarcastic
>> commentary on the systemd devs' apparent attitude, which can be
>> (controversially) summarized as: "Systemd doesn't have problems because
>> it's perfect.  Therefore, any problems you have with systemd must
>> instead be with other components which systemd depends on."
> 
> Come on, sorry, but this is fud. Really... ;-)

I should make clear that did recently switch to systemd myself -- by 
choice as I'm on gentoo and it defaults to openrc, so I'm obviously not 
entirely anti-systemd.  And I _did_ say "(controversially)", which means 
I do recognize that there are two sides to the story.

That said, I've certainly seen what happens when non-systemd devs are on 
the receiving end of things -- including kernel devs, see the recent 
hubbub over the debug kernel commandline option and the somewhat longer 
ago firmware loading issue, among others.

But sarcasm implies neither absolute truth, or it'd be speaking truth not 
sarcasm, nor absolute untruth, because without a kernel of truth there 
it'd be simply stupid, not sarcastic.

And certainly that's how I read the comment.

But in any case, I read it as not to be taken literally, which is what it 
appeared to me the person to which I was directly replying was doing.  I 
was simply warning him off of reading it too literally as at least from 
here, it seemed more like sarcasm.

Of course if DaveC wishes to clarify one way or another he certainly 
can... tho I'd not expect it at this point since if it is sarcasm as I 
believe, that's kind of having to explain the joke...

> I think that systemd is even one of the early supporters of btrfs
> because it will defragment readahead files on boot from btrfs. I'd
> suggest the problem is to be found in the different semantics with COW
> filesystems.

Which is actually what I was saying.  In reality it's an interaction 
between the nature of COW filesystems, where fallocate tends to be a 
problem, and an application using fallocate because of its benefits on 
overwrite-in-place filesystems, which happen to be the norm at this 
point.  So neither one is to blame, it's simply a bad interaction that 
ultimately needs to be made better on one side or the other.

But btrfs is still relatively young and COW-based filesystems not that 
widespread yet, so that the problem hasn't been worked out to be handled 
automatically, just yet, isn't a big surprise.  Tho I think it'll come.

Meanwhile, as you point out below and as I've repeatedly said in this 
thread already myself, NOCOW and/or autodefrag are tools available to an 
admin faced with the problem, that together actually solve it reasonably 
well.  All an admin has to do is make use of the tools already available. 
=:^)

> So let's start with my journals, on btrfs:
> 
> $ sudo filefrag *
> system@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found
> system@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found
> 
system@806cd49faa074a49b6cde5ff6fca8adc-000000000008e4cc-0004f82580cdcb45.journal:
> 5 extents found
> 
system@806cd49faa074a49b6cde5ff6fca8adc-0000000000097959-0004f89c2e8aff87.journal:
> 5 extents found
> 
system@806cd49faa074a49b6cde5ff6fca8adc-00000000000a166d-0004f98d7e04157c.journal:
> 5 extents found
> 
system@806cd49faa074a49b6cde5ff6fca8adc-00000000000aad59-0004fa379b9a1fdf.journal:
> 5 extents found
> 
system@ec16f60db38f43619f8337153a1cc024-0000000000000001-0004fae8e5057259.journal:
> 5 extents found
> 
system@ec16f60db38f43619f8337153a1cc024-00000000000092b1-0004fb59b1d034ad.journal:
> 5 extents found system.journal: 9 extents found
> 
user-500@e4209c6628ed4a65954678b8011ad73f-0000000000085b7a-0004f77d25ebba04.journal:
> 2 extents found
> 
user-500@e4209c6628ed4a65954678b8011ad73f-000000000008e7fb-0004f83c7bf18294.journal:
> 2 extents found
> 
user-500@e4209c6628ed4a65954678b8011ad73f-0000000000097fe4-0004f8ae69c198ca.journal:
> 2 extents found
> 
user-500@e4209c6628ed4a65954678b8011ad73f-00000000000a1a7e-0004f9966e9c69d8.journal:
> 2 extents found user-500.journal: 2 extents found
> 
> I don't think these are too bad values, eh?
> 
> Well, how did I accomblish that?
> 
> First, I've set the journal directories nocow. Of course, systemd should
> do this by default. I'm not sure if this is a packaging or systemd code
> issue,
> tho. But I think the systemd devs are in common that for cow fs, the
> journal directories should be set nocow. After all, the journal is a
> transactional database - it does not need cow protection at all costs.
> And I think they have their own checksumming protection. So, why let
> systemd bother with that? A lot of other software has the same semantic
> problems with btrfs, too (ex. MySQL) where nobody shouts at the
> "inabilities" of the programmers. So why for systemd? Just because it's
> intrusive by its nature for being a radically and newly designed init
> system and thus requires some learning by its users/admins/packagers?
> Really? Come on... As admin and/or packager you have to stay with
> current technologies and developments anyways. It's only important to
> hide the details from the users.
> 
> Back to the extents counts: What I did next was implementing a defrag
> job that regularly defrags the journal (actually, the complete log
> directory as other log files suffer the same problem):
> 
> $ cat /usr/local/sbin/defrag-logs.rb #!/bin/sh exec btrfs filesystem
> defragment -czlib -r /var/log
> 
> It can be easily converted into a timer job with systemd. This is left
> as an excercise to the reader.
> 
> BTW: Actually, that job isn't currently executed on my system which
> makes the numbers above pretty impressive... However, autodefrag is
> turned on which may play into the mix. I'm not sure. I stopped
> automatically running those defrag jobs a while ago (I have a few more).

Thanks for the timer hint, BTW.  I actually already created an hourly and 
a daily timer job here (turns out that's all I needed, no weekly/monthly/
whatever needed so I didn't create those) as I switched over to systemd 
and got rid of crond, and I'll definitely keep the defrag-journals timer 
idea up my sleeve in case I decide to set journald back to keeping non-
volatile journals as well, plus as a helpful hint for others.  Tho I 
won't be using it myself currently as the volatile journals only while 
handing off to syslog-ng for the longer term logs is working very well 
for me ATM, a good sysadmin likes to have a set of tricks such as that 
ready in case they're needed, and I'm certainly no exception! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-14  2:53       ` Duncan
  2014-06-14  7:52         ` Goffredo Baroncelli
@ 2014-06-14 10:59         ` Kai Krakow
  2014-06-15  5:02           ` Duncan
  2014-06-15 21:45           ` Martin Steigerwald
  2014-06-15 21:31         ` Martin Steigerwald
  2 siblings, 2 replies; 21+ messages in thread
From: Kai Krakow @ 2014-06-14 10:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Duncan <1i5t5.duncan@cox.net> schrieb:

> As they say, "Whoosh!"
> 
> At least here, I interpreted that remark as primarily sarcastic
> commentary on the systemd devs' apparent attitude, which can be
> (controversially) summarized as: "Systemd doesn't have problems because
> it's perfect.  Therefore, any problems you have with systemd must instead
> be with other components which systemd depends on."

Come on, sorry, but this is fud. Really... ;-)

> IOW, it's a btrfs problem now in practice, not because it is so in a
> technical sense, but because systemd defines it as such and is unlikely
> to budge, so the only way to achieve progress is for btrfs to deal with
> it.

I think that systemd is even one of the early supporters of btrfs because it 
will defragment readahead files on boot from btrfs. I'd suggest the problem 
is to be found in the different semantics with COW filesystems. And if 
someone loudly complains to the systemd developers how bad they are at doing 
their stuff - hmm, well, I would be disapointed/offended, too, as a 
programmer because much very well done work has been put into systemd and 
I'd start ignoring such people. In Germany we have a saying for this: "Wie 
man in den Wald hineinruft, so schallt es heraus." [1] They are doing many 
things right that have not been adopted to modern systems in the last twenty 
years (or so) with the legacy init systems.

So let's start with my journals, on btrfs:

$ sudo filefrag *
system@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found
system@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found
system@806cd49faa074a49b6cde5ff6fca8adc-000000000008e4cc-0004f82580cdcb45.journal: 
5 extents found
system@806cd49faa074a49b6cde5ff6fca8adc-0000000000097959-0004f89c2e8aff87.journal: 
5 extents found                                                                                                                  
system@806cd49faa074a49b6cde5ff6fca8adc-00000000000a166d-0004f98d7e04157c.journal: 
5 extents found                                                                                                                  
system@806cd49faa074a49b6cde5ff6fca8adc-00000000000aad59-0004fa379b9a1fdf.journal: 
5 extents found
system@ec16f60db38f43619f8337153a1cc024-0000000000000001-0004fae8e5057259.journal: 
5 extents found
system@ec16f60db38f43619f8337153a1cc024-00000000000092b1-0004fb59b1d034ad.journal: 
5 extents found
system.journal: 9 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-0000000000085b7a-0004f77d25ebba04.journal: 
2 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-000000000008e7fb-0004f83c7bf18294.journal: 
2 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-0000000000097fe4-0004f8ae69c198ca.journal: 
2 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-00000000000a1a7e-0004f9966e9c69d8.journal: 
2 extents found
user-500.journal: 2 extents found

I don't think these are too bad values, eh?

Well, how did I accomblish that?

First, I've set the journal directories nocow. Of course, systemd should do 
this by default. I'm not sure if this is a packaging or systemd code issue, 
tho. But I think the systemd devs are in common that for cow fs, the journal 
directories should be set nocow. After all, the journal is a transactional 
database - it does not need cow protection at all costs. And I think they 
have their own checksumming protection. So, why let systemd bother with 
that? A lot of other software has the same semantic problems with btrfs, too 
(ex. MySQL) where nobody shouts at the "inabilities" of the programmers. So 
why for systemd? Just because it's intrusive by its nature for being a 
radically and newly designed init system and thus requires some learning by 
its users/admins/packagers? Really? Come on... As admin and/or packager you 
have to stay with current technologies and developments anyways. It's only 
important to hide the details from the users.

Back to the extents counts: What I did next was implementing a defrag job 
that regularly defrags the journal (actually, the complete log directory as 
other log files suffer the same problem):

$ cat /usr/local/sbin/defrag-logs.rb 
#!/bin/sh
exec btrfs filesystem defragment -czlib -r /var/log

It can be easily converted into a timer job with systemd. This is left as an 
excercise to the reader.

BTW: Actually, that job isn't currently executed on my system which makes 
the numbers above pretty impressive... However, autodefrag is turned on 
which may play into the mix. I'm not sure. I stopped automatically running 
those defrag jobs a while ago (I have a few more).
 
> An arguably fairer and more impartial assessment of this particular
> situations suggests that neither btrfs, which as a COW-based filesystem,
> like all COW-based filesystems has the existing-file-rewrite as a major
> technical challenge that it must deal with /somehow/, nor systemd, which
> in choosing to use fallocate is specifically putting itself in that
> existing-file-rewrite class, are entirely at fault.

This challenge is not only affecting systemd but also a lot of other 
packages which do not play nice with btrfs semantics. But - as you correctly 
write - you cannot point your finger at just one party. FS and user space 
have to come together to evaluate and fix the problems on both sides. In 
> But that doesn't matter if one side refuses to budge, because then the
> other side must do so regardless of where the fault was, if there is to
> be any progress at all.

> Meanwhile, I've predicted before and do so here again, that as btrfs
> moves toward mainstream and starts supplanting ext* as the assumed Linux
> default filesystem, some of these problems will simply "go away", because
> at that point, various apps are no longer optimized for the assumed
> default filesystem, and they'll either be patched at some level (distro
> level if not upstream) to work better on the new default filesystem, or
> will be replaced by something that does.  And neither upstream nor distro
> level does that patching, then at some point, people are going to find
> that said distro performs worse than other distros that do that patching.

> Another alternative is that distros will start setting /var/log/journal
> NOCOW in their setup scripts by default when it's btrfs, thus avoiding
> the problem.  (Altho if they do automated snapshotting they'll also have
> to set it as its own subvolume, to avoid the first-write-after-snapshot-
> is-COW problem.)  Well, that, and/or set autodefrag in the default mount
> options.

> Meanwhile, there's some focus on making btrfs behave better with such
> rewrite-pattern files, but while I think the problem can be made /some/
> better, hopefully enough that the defaults bother far fewer people in far
> fewer cases, I expect it'll always be a bit of a sore spot because that's
> just how the technology works, and as such, setting NOCOW for such files
> and/or using autodefrag will continue to be recommended for an optimized
> setup.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-14  2:53       ` Duncan
@ 2014-06-14  7:52         ` Goffredo Baroncelli
  2014-06-15  5:43           ` Duncan
  2014-06-14 10:59         ` Kai Krakow
  2014-06-15 21:31         ` Martin Steigerwald
  2 siblings, 1 reply; 21+ messages in thread
From: Goffredo Baroncelli @ 2014-06-14  7:52 UTC (permalink / raw)
  To: Duncan, linux-btrfs; +Cc: systemd-devel

On 06/14/2014 04:53 AM, Duncan wrote:
> Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
> excerpted:
> 
>> On 06/13/2014 01:24 AM, Dave Chinner wrote:
>>> On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote:
>>>>
>>>> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
>>>> actually pretty much equally bad without NOCOW set on the file.
>>>
>>> So maybe it's been fixed in systemd since the last time I looked.
>>> Yup:
>>>
>>> http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-
> file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
>>>
>>> The reason it was changed? To "save a syscall per append", not to
>>> prevent fragmentation of the file, which was the problem everyone was
>>> complaining about...
>>
>> thanks for pointing that. However I am performing my tests on a fedora
>> 20 with systemd-208, which seems have this change
>>>
>>>> Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for
>>>> either 4 byte or 8 MiB file allocations is going to end up being the
>>>> same, forcing (repeated until full) rewrite of each 4 KiB block into
>>>> its own extent.
>>
>> I am reaching the conclusion that fallocate is not the problem. The
>> fallocate increase the filesize of about 8MB, which is enough for some
>> logging. So it is not called very often.
> 
> But... 
> 
> If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with 
> nodatacow), then an fallocate of 8 MiB will increase the file size by 8 
> MiB and write that out.  So far so good as at that point the 8 MiB should 
> be a single extent.  But then, data gets written into 4 KiB blocks of 
> that 8 MiB one at a time, and because btrfs is COW, the new data in the 
> block must be written to a new location.
> 
> Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
> block has been rewritten to a new location and is now an extent unto 
> itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
> single 4 KiB block in size.

Several people pointed fallocate as the problem. But I don't understand the reason.
1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during the boot. Often never because the log are less than 8MB.
2) it is true that btrfs "rewrite" almost 2 times each 4kb page with fallocate. But the first time is a "big" write of 8MB; instead the second write would happen in any case. What I mean is that without the fallocate in any case journald would make small write.

To be honest, I fatigue to see the gain of having a fallocate on a COW filesystem... may be that I don't understand very well the fallocate() call.

> 
> =:^(
> 
> Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the 
> btrfs autodefrag mount option should at least catch that and rewrite 
> (again), this time sequentially.
> 
>> I have to investigate more what happens when the log are copied from
>> /run to /var/log/journal: this is when journald seems to slow all.
> 
> That's an interesting point.
> 
> At least in theory, during normal operation journald will write to 
> /var/log/journal, but there's a point during boot at which it flushes the 
> information accumulated during boot from the volatile /run location to 
> the non-volatile /var/log location.  /That/ write, at least, should be 
> sequential, since there will be > 4 KiB of journal accumulated that needs 
> to be transferred at once.  However, if it's being handled by the forced 
> pre-write fallocate described above, then that's not going to be the 
> case, as it'll then be a rewrite of already fallocated file blocks and 
> thus will get COWed exactly as I described above.
> 
> =:^(
> 
> 
>> I am prepared a PC which reboot continuously; I am collecting the time
>> required to finish the boot vs the fragmentation of the system.journal
>> file vs the number of boot. The results are dramatic: after 20 reboot,
>> the boot time increase of 20-30 seconds. Doing a defrag of
>> system.journal reduces the boot time to the original one, but after
>> another 20 reboot, the boot time still requires 20-30 seconds more....
>>
>> It is a slow PC, but I saw the same behavior also on a more modern pc
>> (i5 with 8GB).
>>
>> For both PC the HD is a mechanical one...
> 
> The problem's duplicable.  That's the first step toward a fix. =:^)
I Hope so
> 
>>> And that's now a btrfs problem.... :/
>>
>> Are you sure ?
> 
> As they say, "Whoosh!"
> 
[...] 
> Another alternative is that distros will start setting /var/log/journal 
> NOCOW in their setup scripts by default when it's btrfs, thus avoiding 
> the problem.  (Altho if they do automated snapshotting they'll also have 
> to set it as its own subvolume, to avoid the first-write-after-snapshot-
> is-COW problem.)  Well, that, and/or set autodefrag in the default mount 
> options.

Pay attention, that this remove also the checksum, which are very useful in a RAID configuration.

> 
> Meanwhile, there's some focus on making btrfs behave better with such 
> rewrite-pattern files, but while I think the problem can be made /some/ 
> better, hopefully enough that the defaults bother far fewer people in far 
> fewer cases, I expect it'll always be a bit of a sore spot because that's 
> just how the technology works, and as such, setting NOCOW for such files 
> and/or using autodefrag will continue to be recommended for an optimized 
> setup.
> 
> ---
> [1] "Properly" set NOCOW:  Btrfs doesn't guarantee the effectiveness of 
> setting NOCOW (chattr +C) unless the attribute is set while the file is 
> still zero size, effectively, at file creation.  The easiest way to do 
> that is to set NOCOW on the subdir that will contain the file, such that 
> when the file is created it inherits the NOCOW attribute automatically.
> 
> [2] File sizes upto 128 MiB ... and possibly upto 1 GiB.  Under 128 MiB 
> should be fine, over 1 GiB is known to cause issues, between the two is a 
> gray area that depends on the speed of the hardware and the incoming 
> write-stream.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-13 22:19     ` Goffredo Baroncelli
@ 2014-06-14  2:53       ` Duncan
  2014-06-14  7:52         ` Goffredo Baroncelli
                           ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Duncan @ 2014-06-14  2:53 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
excerpted:

> On 06/13/2014 01:24 AM, Dave Chinner wrote:
>> On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote:
>>>
>>> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
>>> actually pretty much equally bad without NOCOW set on the file.
>> 
>> So maybe it's been fixed in systemd since the last time I looked.
>> Yup:
>> 
>> http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-
file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
>> 
>> The reason it was changed? To "save a syscall per append", not to
>> prevent fragmentation of the file, which was the problem everyone was
>> complaining about...
> 
> thanks for pointing that. However I am performing my tests on a fedora
> 20 with systemd-208, which seems have this change
>> 
>>> Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for
>>> either 4 byte or 8 MiB file allocations is going to end up being the
>>> same, forcing (repeated until full) rewrite of each 4 KiB block into
>>> its own extent.
> 
> I am reaching the conclusion that fallocate is not the problem. The
> fallocate increase the filesize of about 8MB, which is enough for some
> logging. So it is not called very often.

But... 

If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with 
nodatacow), then an fallocate of 8 MiB will increase the file size by 8 
MiB and write that out.  So far so good as at that point the 8 MiB should 
be a single extent.  But then, data gets written into 4 KiB blocks of 
that 8 MiB one at a time, and because btrfs is COW, the new data in the 
block must be written to a new location.

Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
block has been rewritten to a new location and is now an extent unto 
itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
single 4 KiB block in size.

=:^(

Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the 
btrfs autodefrag mount option should at least catch that and rewrite 
(again), this time sequentially.

> I have to investigate more what happens when the log are copied from
> /run to /var/log/journal: this is when journald seems to slow all.

That's an interesting point.

At least in theory, during normal operation journald will write to 
/var/log/journal, but there's a point during boot at which it flushes the 
information accumulated during boot from the volatile /run location to 
the non-volatile /var/log location.  /That/ write, at least, should be 
sequential, since there will be > 4 KiB of journal accumulated that needs 
to be transferred at once.  However, if it's being handled by the forced 
pre-write fallocate described above, then that's not going to be the 
case, as it'll then be a rewrite of already fallocated file blocks and 
thus will get COWed exactly as I described above.

=:^(


> I am prepared a PC which reboot continuously; I am collecting the time
> required to finish the boot vs the fragmentation of the system.journal
> file vs the number of boot. The results are dramatic: after 20 reboot,
> the boot time increase of 20-30 seconds. Doing a defrag of
> system.journal reduces the boot time to the original one, but after
> another 20 reboot, the boot time still requires 20-30 seconds more....
> 
> It is a slow PC, but I saw the same behavior also on a more modern pc
> (i5 with 8GB).
> 
> For both PC the HD is a mechanical one...

The problem's duplicable.  That's the first step toward a fix. =:^)

>> And that's now a btrfs problem.... :/
> 
> Are you sure ?

As they say, "Whoosh!"

At least here, I interpreted that remark as primarily sarcastic 
commentary on the systemd devs' apparent attitude, which can be 
(controversially) summarized as: "Systemd doesn't have problems because 
it's perfect.  Therefore, any problems you have with systemd must instead 
be with other components which systemd depends on."

IOW, it's a btrfs problem now in practice, not because it is so in a 
technical sense, but because systemd defines it as such and is unlikely 
to budge, so the only way to achieve progress is for btrfs to deal with 
it. 

An arguably fairer and more impartial assessment of this particular 
situations suggests that neither btrfs, which as a COW-based filesystem, 
like all COW-based filesystems has the existing-file-rewrite as a major 
technical challenge that it must deal with /somehow/, nor systemd, which 
in choosing to use fallocate is specifically putting itself in that 
existing-file-rewrite class, are entirely at fault.

But that doesn't matter if one side refuses to budge, because then the 
other side must do so regardless of where the fault was, if there is to 
be any progress at all.

Meanwhile, I've predicted before and do so here again, that as btrfs 
moves toward mainstream and starts supplanting ext* as the assumed Linux 
default filesystem, some of these problems will simply "go away", because 
at that point, various apps are no longer optimized for the assumed 
default filesystem, and they'll either be patched at some level (distro 
level if not upstream) to work better on the new default filesystem, or 
will be replaced by something that does.  And neither upstream nor distro 
level does that patching, then at some point, people are going to find 
that said distro performs worse than other distros that do that patching.

Another alternative is that distros will start setting /var/log/journal 
NOCOW in their setup scripts by default when it's btrfs, thus avoiding 
the problem.  (Altho if they do automated snapshotting they'll also have 
to set it as its own subvolume, to avoid the first-write-after-snapshot-
is-COW problem.)  Well, that, and/or set autodefrag in the default mount 
options.

Meanwhile, there's some focus on making btrfs behave better with such 
rewrite-pattern files, but while I think the problem can be made /some/ 
better, hopefully enough that the defaults bother far fewer people in far 
fewer cases, I expect it'll always be a bit of a sore spot because that's 
just how the technology works, and as such, setting NOCOW for such files 
and/or using autodefrag will continue to be recommended for an optimized 
setup.

---
[1] "Properly" set NOCOW:  Btrfs doesn't guarantee the effectiveness of 
setting NOCOW (chattr +C) unless the attribute is set while the file is 
still zero size, effectively, at file creation.  The easiest way to do 
that is to set NOCOW on the subdir that will contain the file, such that 
when the file is created it inherits the NOCOW attribute automatically.

[2] File sizes upto 128 MiB ... and possibly upto 1 GiB.  Under 128 MiB 
should be fine, over 1 GiB is known to cause issues, between the two is a 
gray area that depends on the speed of the hardware and the incoming 
write-stream.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-06-17  8:22 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-11 21:28 Slow startup of systemd-journal on BTRFS Goffredo Baroncelli
2014-06-12  0:40 ` Chris Murphy
2014-06-12  1:18 ` Russell Coker
2014-06-12  4:39   ` Duncan
2014-06-12  1:21 ` Dave Chinner
2014-06-12  1:37   ` Dave Chinner
2014-06-12  2:32     ` Chris Murphy
2014-06-15 22:34       ` [systemd-devel] " Lennart Poettering
2014-06-16  4:01         ` Chris Murphy
2014-06-16  4:38           ` cwillu
2014-06-12 11:13 R: " Goffredo Baroncelli <kreijack@libero.it>
2014-06-12 12:37 ` Duncan
2014-06-12 23:24   ` Dave Chinner
2014-06-13 22:19     ` Goffredo Baroncelli
2014-06-14  2:53       ` Duncan
2014-06-14  7:52         ` Goffredo Baroncelli
2014-06-15  5:43           ` Duncan
2014-06-14 10:59         ` Kai Krakow
2014-06-15  5:02           ` Duncan
2014-06-15 11:18             ` Kai Krakow
2014-06-15 21:45           ` Martin Steigerwald
2014-06-15 21:51             ` Hugo Mills
2014-06-15 21:31         ` Martin Steigerwald
2014-06-15 21:37           ` Hugo Mills
2014-06-17  8:22           ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.