All of lore.kernel.org
 help / color / mirror / Atom feed
* Slow startup of systemd-journal on BTRFS
@ 2014-06-11 21:28 Goffredo Baroncelli
  2014-06-12  0:40 ` Chris Murphy
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Goffredo Baroncelli @ 2014-06-11 21:28 UTC (permalink / raw)
  To: systemd Mailing List, linux-btrfs

Hi all,

I would like to share a my experience about a slowness of systemd when used on BTRFS.

My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").

I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.

However this link

	https://bugzilla.redhat.com/show_bug.cgi?id=1006386

suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. 
It seems that systemd-journal create a very hight fragmented files when it stores its log. And BTRFS it is know to behave slowly when a file is highly fragmented.
This had caused a slow startup of systemd-journal, which in turn had blocked the services which depend by the loggin system.

In fact after I de-fragmented the files under /var/log/journal [*], my boot time decreased of about 20second (from 50s to 30s).

Unfortunately I don't have any data to show. The next time I will try to collect more information. But I am quite sure that when the log are highly fragmented systemd-journal becomes very slow on BTRFS.

I don't know if the problem is more on the systemd side or btrfs side. What I know is that both the projects likely will be important in the near futures, and both must work well together.

I know that I can "chattr +C" to avoid COW for some files; but I don't want to lost also the checksum protection. 

If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data.


BR
G.Baroncelli

[*] 
# btrfs fi defrag /var/log/journal/*/*



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-11 21:28 Slow startup of systemd-journal on BTRFS Goffredo Baroncelli
@ 2014-06-12  0:40 ` Chris Murphy
  2014-06-12  1:18 ` Russell Coker
  2014-06-12  1:21 ` Dave Chinner
  2 siblings, 0 replies; 33+ messages in thread
From: Chris Murphy @ 2014-06-12  0:40 UTC (permalink / raw)
  To: kreijack, Goffredo Baroncelli; +Cc: systemd Mailing List, linux-btrfs


On Jun 11, 2014, at 3:28 PM, Goffredo Baroncelli <kreijack@libero.it> wrote:
> 
> If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data.

So long as you're not using compression, filefrag will show you fragments of systemd-journald journals. I can vouch for the behavior you experience without xattr +C or autodefrag, but further it also causes much slowness when reading journal contents. LIke if I want to search all boots for a particular error message to see how far back it started, this takes quite a bit longer than on other file systems. So far I'm not experiencing this problem with autodefrag or any other negative side effects, but my understanding is this code is still in flux.

Since the journals have their own checksumming I'm not overly concerned about setting xattr +C.

Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-11 21:28 Slow startup of systemd-journal on BTRFS Goffredo Baroncelli
  2014-06-12  0:40 ` Chris Murphy
@ 2014-06-12  1:18 ` Russell Coker
  2014-06-12  4:39   ` Duncan
  2014-06-12  1:21 ` Dave Chinner
  2 siblings, 1 reply; 33+ messages in thread
From: Russell Coker @ 2014-06-12  1:18 UTC (permalink / raw)
  To: kreijack; +Cc: systemd Mailing List, linux-btrfs

On Wed, 11 Jun 2014 23:28:54 Goffredo Baroncelli wrote:
>         https://bugzilla.redhat.com/show_bug.cgi?id=1006386
> 
> suggested me that the problem could be due to a bad interaction between
> systemd and btrfs. NetworkManager was innocent.  It seems that
> systemd-journal create a very hight fragmented files when it stores its
> log. And BTRFS it is know to behave slowly when a file is highly
> fragmented. This had caused a slow startup of systemd-journal, which in
> turn had blocked the services which depend by the loggin system.

On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put 
"SystemMaxUse=50M".  That doesn't solve the fragmentation problem but reduces 
it enough that it doesn't bother me.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-11 21:28 Slow startup of systemd-journal on BTRFS Goffredo Baroncelli
  2014-06-12  0:40 ` Chris Murphy
  2014-06-12  1:18 ` Russell Coker
@ 2014-06-12  1:21 ` Dave Chinner
  2014-06-12  1:37   ` Dave Chinner
  2 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2014-06-12  1:21 UTC (permalink / raw)
  To: kreijack; +Cc: systemd Mailing List, linux-btrfs

On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
> Hi all,
> 
> I would like to share a my experience about a slowness of systemd when used on BTRFS.
> 
> My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").
> 
> I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.
> 
> However this link
> 
> 	https://bugzilla.redhat.com/show_bug.cgi?id=1006386
> 
> suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. 

systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
and then does the write to it.  All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.

Yup, it fragments journal files on XFS, too.

http://oss.sgi.com/archives/xfs/2014-03/msg00322.html

IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-12  1:21 ` Dave Chinner
@ 2014-06-12  1:37   ` Dave Chinner
  2014-06-12  2:32     ` Chris Murphy
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2014-06-12  1:37 UTC (permalink / raw)
  To: kreijack; +Cc: systemd Mailing List, linux-btrfs

On Thu, Jun 12, 2014 at 11:21:04AM +1000, Dave Chinner wrote:
> On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
> > Hi all,
> > 
> > I would like to share a my experience about a slowness of systemd when used on BTRFS.
> > 
> > My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").
> > 
> > I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.
> > 
> > However this link
> > 
> > 	https://bugzilla.redhat.com/show_bug.cgi?id=1006386
> > 
> > suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. 
> 
> systemd has a very stupid journal write pattern. It checks if there
> is space in the file for the write, and if not it fallocates the
> small amount of space it needs (it does *4 byte* fallocate calls!)
> and then does the write to it.  All this does is fragment the crap
> out of the log files because the filesystems cannot optimise the
> allocation patterns.
> 
> Yup, it fragments journal files on XFS, too.
> 
> http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
> 
> IIRC, the systemd developers consider this a filesystem problem and
> so refused to change the systemd code to be nice to the filesystem
> allocators, even though they don't actually need to use fallocate...

BTW, the systemd list is subscriber only, so thay aren't going to
see anything that we comment on from a cross-post to the btrfs list.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-12  1:37   ` Dave Chinner
@ 2014-06-12  2:32     ` Chris Murphy
  2014-06-15 22:34       ` [systemd-devel] " Lennart Poettering
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2014-06-12  2:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: kreijack, systemd Mailing List, linux-btrfs

On Jun 11, 2014, at 7:21 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote:
>> Hi all,
>> 
>> I would like to share a my experience about a slowness of systemd when used on BTRFS.
>> 
>> My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").
>> 
>> I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.
>> 
>> However this link
>> 
>> 	https://bugzilla.redhat.com/show_bug.cgi?id=1006386
>> 
>> suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. 
> 
> systemd has a very stupid journal write pattern. It checks if there
> is space in the file for the write, and if not it fallocates the
> small amount of space it needs (it does *4 byte* fallocate calls!)
> and then does the write to it.  All this does is fragment the crap
> out of the log files because the filesystems cannot optimise the
> allocation patterns.
> 
> Yup, it fragments journal files on XFS, too.
> 
> http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
> 
> IIRC, the systemd developers consider this a filesystem problem and
> so refused to change the systemd code to be nice to the filesystem
> allocators, even though they don't actually need to use fallocate...
> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

On Jun 11, 2014, at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> BTW, the systemd list is subscriber only, so thay aren't going to
> see anything that we comment on from a cross-post to the btrfs list.


Unless a subscriber finds something really interesting, quotes it, and cross posts it.

Chris murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Slow startup of systemd-journal on BTRFS
  2014-06-12  1:18 ` Russell Coker
@ 2014-06-12  4:39   ` Duncan
  0 siblings, 0 replies; 33+ messages in thread
From: Duncan @ 2014-06-12  4:39 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

Russell Coker posted on Thu, 12 Jun 2014 11:18:37 +1000 as excerpted:

> On Wed, 11 Jun 2014 23:28:54 Goffredo Baroncelli wrote:
>>         https://bugzilla.redhat.com/show_bug.cgi?id=1006386
>> 
>> suggested me that the problem could be due to a bad interaction between
>> systemd and btrfs. NetworkManager was innocent.  It seems that
>> systemd-journal create a very hight fragmented files when it stores its
>> log. And BTRFS it is know to behave slowly when a file is highly
>> fragmented. This had caused a slow startup of systemd-journal, which in
>> turn had blocked the services which depend by the loggin system.
> 
> On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put
> "SystemMaxUse=50M".  That doesn't solve the fragmentation problem but
> reduces it enough that it doesn't bother me.

FWIW, as a relatively new switcher to systemd, that is, after switching 
to btrfs only a year or so ago...  Two comments:

1) Having seen a few reports of journald's journal fragmentation on this 
list, I was worried about those journals here as well.

My solution to both this problem and to an unrelated frustration with 
journald[1] was to:

a) confine journald to only a volatile (memory-only) log, first on a 
temporary basis while I was only experimenting with and setting up systemd 
(using the kernel command-line's init= to point at systemd while /sbin/
init still pointed at sysv's init for openrc), then later permanently, 
once I got enough of my systemd and journald config setup to actually 
switch to it.

b) configure my former syslogger (syslog-ng, in my case) to continue in 
that role under systemd, with journald relaying to it for non-volatile 
logging.

Here's the /etc/journald.conf changes I ended up with to accomplish (a), 
see the journald.conf(5) manpage for the documentation, as well as the 
below explanation:

Storage=volatile
RuntimeMaxUse=448M
RuntimeKeepFree=48M
RuntimeMaxFileSize=64M

Storage=volatile is the important one.  As the manpage notes, that means 
journald stores files under /run/log/journal only, where /run is normally 
setup by systemd as a tmpfs mount, so these files are tmpfs and thus 
memory-only.

The other three must be read in the context of a 512 MiB /run on tmpfs
[2].  From that and the information in the journald.conf manpage, it 
should be possible to see that my setup is (runtime* settings apply to 
the volatile files under /run):

An individual journal filesize (MaxFileSize) of 64 MiB, with seven such 
files in rotation (the default if MaxFileSize is unset is eight), 
totaling 448 MiB (MaxUse, the default being 10% of the filesystem, too 
small here since the journals are basically the only thing taking 
space).  On a 512 MiB filesystem, that will leave 64 MiB for other uses 
(pretty much all 0-byte lock and pidfiles, IIRC I was running something 
like a 2 MiB /run before systemd without issue).

It's worth noting that UNLIKE MaxUse, which will trigger journal file 
rotation when hit, hitting the KeepFree forces journald to stop 
journaling entirely -- *NOT* just to stop writing them here, but to stop 
forwarding to syslog (syslog-ng here) as well.  I FOUND THIS OUT THE HARD 
WAY!  Thus, in ordered to keep journald still functional, make sure 
journald runs into the MaxUse limit before it runs into KeepFree.  The 
KeepFree default is 15% of the filesystem, just under 77 MiB on a 512 MiB 
filesystem which is why I found this out the hard way with settings that 
would otherwise keep only 64 MiB free.  The 48 MiB setting I chose leaves 
16 MiB of room for other files before journald shuts down journaling, 
which should be plenty, since under normal circumstances the other files 
should all be 0-byte lock and pidfiles.  Just in case, however, there's 
still 48 MiB of room for other files after journald shuts down, before 
the filesystem itself fills up.

Configuring the syslogger to work with journald is "left as an exercise 
for the reader", as they say, since for all I know the OP is using 
something other than the syslog-ng I'm familiar with anyway.  But if 
hints for syslog-ng are needed too, let me know. =:^)


2) Someone else mentioned btrfs' autodefrag mount-option.  Given #1 above 
I've obviously not had a lot of experience with journald logs and 
autodefrag, but based on all I know about btrfs fragmentation behavior as 
well as journald journal file behavior from this list, as long as 
journald's non-volatile files are kept significantly under 1 GiB and 
preferably under half a GiB each, it shouldn't be a problem, with a 
/possible/ exception if you get something run-away-journaling multiple 
messages a second for a reasonably long period, such that the I/O can't 
keep up with both the journaling and autodefrag.

If you do choose to keep a persistent journal with autodefrag, then, I'd 
recommend journald.conf settings that keep individual journal files to 
perhaps 128 MiB each.  (System* settings apply to the non-volatile files 
under /var, in /var/log/journal/.)

SystemMaxFileSize=128M

AFAIK, that wouldn't affect the total journal size and thus the number of 
journal files, which would remain 10% of the filesystem size by default.

Alternatively, given the default 8-file rotation if MaxFileSize is unset, 
you could limit the total journal size to 1 GiB, for the same 128 MiB 
individual file size.

SystemMaxUse=1G

Of course if you want/need more control set both and/or other settings as 
I did for my volatile-only configuration above.

---
[1] Unrelated journald frustration:  Being a syslog-ng user I've been 
accustomed to being able to pre-filter incoming messages *BEFORE* they 
get written to the files.  This ability has been the important bit of my 
run-away-log coping strategy when I have no direct way to reconfigure the 
source to reduce the rate at which it's spitting out "noise" that's 
otherwise overwhelming my logs.

Unfortunately, while it seems journald has all /sorts/ of file-grep-style 
filtering tools to focus in like a laser on what you want to see AFTER 
the journal is written, I found absolutely NO documentation on setting up 
PREWRITE journal filters (except log-level, and there's global rate-
limiting as well, but I didn't want to use those, I wanted to filter 
specific "noise" messages), which means runaway journaling simply runs 
away, and if I use the size restriction stuff to turn it down, I quickly 
lose the important stuff I want to keep around when the files size-rotate 
due to that runaway!

Thus my solution, keep journald storage volatile only, relatively small 
but still big enough I can use the great systemctl status integration to 
get the last few journal entries from each service, then feed it to 
syslog-ng to pre-write filter the noise out before actual write to 
permanent storage. =:^)

[2] 512 MiB /run tmpfs.  This is on a 16 GiB RAM system, so default tmpfs 
size would be 8 GiB.  But I have several such tmpfs including a big /tmp 
that I use for scratch space when I'm building stuff (gentoo's 
PORTAGE_TMPDIR, and for fully in-memory DVD ISOs too), and I don't have 
swap configured at all, so keeping a reasonable lid on things by 
limiting /run and its major journal-file space user to half a GiB seems 
prudent.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-12  2:32     ` Chris Murphy
@ 2014-06-15 22:34       ` Lennart Poettering
  2014-06-16  4:01         ` Chris Murphy
  0 siblings, 1 reply; 33+ messages in thread
From: Lennart Poettering @ 2014-06-15 22:34 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Dave Chinner, kreijack, systemd Mailing List, linux-btrfs

On Wed, 11.06.14 20:32, Chris Murphy (lists@colorremedies.com) wrote:

> > systemd has a very stupid journal write pattern. It checks if there
> > is space in the file for the write, and if not it fallocates the
> > small amount of space it needs (it does *4 byte* fallocate calls!)

Not really the case. 

http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354

We allocate 8mb at minimum.

> > and then does the write to it.  All this does is fragment the crap
> > out of the log files because the filesystems cannot optimise the
> > allocation patterns.

Well, it would be good if you'd tell me what to do instead...

I am invoking fallocate() in advance, because we write those files with
mmap() and that of course would normally triggered SIGBUS already on the
most boring of reasons, such as disk full/quota full or so. Hence,
before we do anything like that, we invoke fallocate() to ensure that
the space is actually available... As far as I can see, that pretty much
in line with what fallocate() is supposed to be useful for, the man page
says this explicitly:

     "...After a successful call to posix_fallocate(), subsequent writes
      to bytes in the specified range are guaranteed not to fail because
      of lack of disk space."

Happy to be informed that the man page is wrong. 

I am also happy to change our code, if it really is the wrong thing to
do. Note however that I generally favour correctness and relying on
documented behaviour, instead of nebulous optimizations whose effects
might change with different file systems or kernel versions...

> > Yup, it fragments journal files on XFS, too.
> > 
> > http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
> > 
> > IIRC, the systemd developers consider this a filesystem problem and
> > so refused to change the systemd code to be nice to the filesystem
> > allocators, even though they don't actually need to use fallocate...

What? No need to be dick. Nobody ever pinged me about this. And yeah, I
think I have a very good reason to use fallocate(). The only reason in
fact the man page explicitly mentions.

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-15 22:34       ` [systemd-devel] " Lennart Poettering
@ 2014-06-16  4:01         ` Chris Murphy
  2014-06-16  4:38           ` cwillu
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2014-06-16  4:01 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Dave Chinner, kreijack, systemd Mailing List, linux-btrfs


On Jun 15, 2014, at 4:34 PM, Lennart Poettering <lennart@poettering.net> wrote:

> On Wed, 11.06.14 20:32, Chris Murphy (lists@colorremedies.com) wrote:
> 
>>> systemd has a very stupid journal write pattern. It checks if there
>>> is space in the file for the write, and if not it fallocates the
>>> small amount of space it needs (it does *4 byte* fallocate calls!)
> 
> Not really the case. 
> 
> http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354
> 
> We allocate 8mb at minimum.
> 
>>> and then does the write to it.  All this does is fragment the crap
>>> out of the log files because the filesystems cannot optimise the
>>> allocation patterns.
> 
> Well, it would be good if you'd tell me what to do instead...
> 
> I am invoking fallocate() in advance, because we write those files with
> mmap() and that of course would normally triggered SIGBUS already on the
> most boring of reasons, such as disk full/quota full or so. Hence,
> before we do anything like that, we invoke fallocate() to ensure that
> the space is actually available... As far as I can see, that pretty much
> in line with what fallocate() is supposed to be useful for, the man page
> says this explicitly:
> 
>     "...After a successful call to posix_fallocate(), subsequent writes
>      to bytes in the specified range are guaranteed not to fail because
>      of lack of disk space."
> 
> Happy to be informed that the man page is wrong. 
> 
> I am also happy to change our code, if it really is the wrong thing to
> do. Note however that I generally favour correctness and relying on
> documented behaviour, instead of nebulous optimizations whose effects
> might change with different file systems or kernel versions...
> 
>>> Yup, it fragments journal files on XFS, too.
>>> 
>>> http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
>>> 
>>> IIRC, the systemd developers consider this a filesystem problem and
>>> so refused to change the systemd code to be nice to the filesystem
>>> allocators, even though they don't actually need to use fallocate...
> 
> What? No need to be dick. Nobody ever pinged me about this. And yeah, I
> think I have a very good reason to use fallocate(). The only reason in
> fact the man page explicitly mentions.
> 
> Lennart

For what it's worth, I did not write what is attributed to me above.  I was quoting Dave Chinner, and I've confirmed the original attribution correctly made it onto the systemd-devel@ list.

I don't know whether some people on this distribution list are even subscribed to systemd-devel@ so those subsequent responses aren't likely being posted to systemd-devel@ but rather to linux-btrfs@.


Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16  4:01         ` Chris Murphy
@ 2014-06-16  4:38           ` cwillu
  0 siblings, 0 replies; 33+ messages in thread
From: cwillu @ 2014-06-16  4:38 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Lennart Poettering, Dave Chinner, kreijack, systemd Mailing List,
	linux-btrfs

Fallocate is a red herring except insofar as it's a hint that btrfs
isn't making much use of: you see the same behaviour with small writes
to an mmap'ed file that's msync'ed after each write, and likewise with
plain old appending small writes with an fsync after each write, with
or without fallocating the file first.  Looking at the fiemap output
while doing either of those, you'll see a new 4k extent being made,
and then the physical location of that extent will increment until the
writes move on to the next 4k extent.

cwillu@cwillu-home:~/work/btrfs/e2fs$ touch /tmp/test

>>> f=open('/tmp/test', 'r+')
>>> m=mmap.mmap(f.fileno(), size)
>>> for x in xrange(size):
...   m[x] = " "
...   m.flush(x / 4096 * 4096, 4096)   # msync(self->data + offset,
size, MS_SYNC)) {

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 800000
fs_ioc_fiemap 3223348747d
File /tmp/test has 3 extents:
#	Logical          Physical         Length           Flags
0:	0000000000000000 0000000b3d9c0000 0000000000001000 0000
1:	0000000000001000 000000069f012000 00000000003ff000 0000
2:	0000000000400000 0000000b419d1000 0000000000400000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b3daf3000 0000000000001000 0000
1:	0000000000001000 000000069f012000 00000000003ff000 0000
2:	0000000000400000 0000000b419d1000 0000000000400000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b3dc38000 0000000000001000 0000
1:	0000000000001000 000000069f012000 00000000003ff000 0000
2:	0000000000400000 0000000b419d1000 0000000000400000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b3dc9f000 0000000000001000 0000
1:	0000000000001000 0000000b3d2b7000 0000000000001000 0000
2:	0000000000002000 000000069f013000 00000000003fe000 0000
3:	0000000000400000 0000000b419d1000 0000000000400000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat
/tmp/test -c %s)# msync(self->data + offset, size, MS_SYNC)) {
0:	0000000000000000 0000000b3dc9f000 0000000000001000 0000
1:	0000000000001000 0000000b3d424000 0000000000001000 0000
2:	0000000000002000 000000069f013000 00000000003fe000 0000
3:	0000000000400000 0000000b419d1000 0000000000400000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b3dc9f000 0000000000001000 0000
1:	0000000000001000 0000000b3d563000 0000000000001000 0000
2:	0000000000002000 000000069f013000 00000000003fe000 0000
3:	0000000000400000 0000000b419d1000 0000000000400000 0001

========

cwillu@cwillu-home:~/work/btrfs/e2fs$ rm /tmp/test
cwillu@cwillu-home:~/work/btrfs/e2fs$ touch /tmp/test

>>> f=open('/tmp/test', 'r+')
>>> f.truncate(size)
>>> m=mmap.mmap(f.fileno(), size)
>>> for x in xrange(size):
...   m[x] = " "
...   m.flush(x / 4096 * 4096, 4096)   # msync(self->data + offset,
size, MS_SYNC)) {

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 800000
fs_ioc_fiemap 3223348747d
File /tmp/test has 1 extents:
#	Logical          Physical         Length           Flags
0:	0000000000000000 0000000b47f11000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b48006000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b48183000 0000000000001000 0000
1:	0000000000001000 0000000b48255000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b48183000 0000000000001000 0000
1:	0000000000001000 0000000b48353000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b48183000 0000000000001000 0000
1:	0000000000001000 0000000b493ed000 0000000000001000 0000
2:	0000000000002000 0000000b4a68f000 0000000000001000 0000
3:	0000000000003000 0000000b4b36f000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0:	0000000000000000 0000000b48183000 0000000000001000 0000
1:	0000000000001000 0000000b493ed000 0000000000001000 0000
2:	0000000000002000 0000000b4a68f000 0000000000001000 0000
3:	0000000000003000 0000000b4b4cf000 0000000000001000 0001

========

cwillu@cwillu-home:~/work/btrfs/e2fs$ rm /tmp/test
cwillu@cwillu-home:~/work/btrfs/e2fs$ touch /tmp/test

>>> f=open('/tmp/test', 'r+')
>>> for x in xrange(size):
...   f.write(' ')
...   f.flush()
...   os.fdatasync(f.fileno())

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 395
fs_ioc_fiemap 3223348747d
File /tmp/test has 1 extents:
#	Logical          Physical         Length           Flags
0:	0000000000000000 0000000000000000 0000000000001000 0301

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 5b5
0:	0000000000000000 0000000000000000 0000000000001000 0301

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 1d61
0:	0000000000000000 0000000b4d2bc000 0000000000001000 0000
1:	0000000000001000 0000000b4e1c5000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 1e8c
0:	0000000000000000 0000000b4d2bc000 0000000000001000 0000
1:	0000000000001000 0000000b4e334000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 207c
0:	0000000000000000 0000000b4d2bc000 0000000000001000 0000
1:	0000000000001000 0000000b4e4a9000 0000000000001000 0000
2:	0000000000002000 0000000b4e528000 0000000000001000 0001

cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 21be
0:	0000000000000000 0000000b4d2bc000 0000000000001000 0000
1:	0000000000001000 0000000b4e4a9000 0000000000001000 0000
2:	0000000000002000 0000000b4e66c000 0000000000001000 0001

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-15 22:13           ` Lennart Poettering
  2014-06-16  0:17             ` Russell Coker
  2014-06-16 16:32             ` Goffredo Baroncelli
@ 2014-06-19  1:13             ` Dave Chinner
  2 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2014-06-19  1:13 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel

On Mon, Jun 16, 2014 at 12:13:07AM +0200, Lennart Poettering wrote:
> On Sat, 14.06.14 09:52, Goffredo Baroncelli (kreijack@libero.it) wrote:
> 
> > > Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
> > > block has been rewritten to a new location and is now an extent unto 
> > > itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
> > > single 4 KiB block in size.
> > 
> > Several people pointed fallocate as the problem. But I don't
> > understand the reason.
> 
> BTW, the reason we use fallocate() in journald is not about trying to
> optimize anything. It's only used for one reason: to avoid SIGBUS on
> disk/quota full, since we actually write everything to the files using
> mmap().

FWIW, fallocate() doesn't absolutely guarantee you that. When at
ENOSPC, a write into that reserved range can still require
un-reserved metadata blocks to be allocated. e.g. splitting a
"reserved" data extent into two extents (used and reserved) requires
an extra btree record, which can cause a split, which can require
allocation. This tends to be pretty damn rare, though, and some
filesystems have reserved block pools specifically for handling this
sort of ENOSPC corner case. Hence, in practice the filesystems
never actually fail with ENOSPC in ranges that have been
fallocate()d.

> I mean, writing things with mmap() is always problematic, and
> handling write errors is awfully difficult, but at least two of the most
> common reasons for failure we'd like protect against in advance, under
> the assumption that disk/quota full will be reported immediately by the
> fallocate(), and the mmap writes later on will then necessarily succeed.
> 
> I am not really following though why this trips up btrfs though. I am
> not sure I understand why this breaks btrfs COW behaviour. I mean,
> fallocate() isn't necessarily supposed to write anything really, it's
> mostly about allocating disk space in advance. I would claim that
> journald's usage of it is very much within the entire reason why it
> exists...
> 
> Anyway, happy to change these things around if necesary, but first I'd
> like to have a very good explanation why fallocate() wouldn't be the
> right thing to invoke here, and a suggestion what we should do instead
> to cover this usecase...

fallocate() of 8MB should be more than sufficient for non-COW
filesystems - 1MB would be enough to prevent performance degradation
due to fragmentation in most cases. The current problems seem to be
with the way btrfs does rewrites, not the use of fallocate() in
systemd.

Thanks for explanation, Lennart.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 16:05                 ` Josef Bacik
  2014-06-16 19:52                   ` Martin
@ 2014-06-17 21:12                   ` Lennart Poettering
  1 sibling, 0 replies; 33+ messages in thread
From: Lennart Poettering @ 2014-06-17 21:12 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Russell Coker, kreijack, Duncan, linux-btrfs, systemd-devel

On Mon, 16.06.14 09:05, Josef Bacik (jbacik@fb.com) wrote:

> So you are doing all the right things from what I can tell, I'm just
> a little confused about when you guys run fsync.  From what I can
> tell it's only when you open the journal file and when you switch it
> to "offline."  I didn't look too much past this point so I don't
> know how often these things happen.  Are you taking an individual
> message, writing it, updating the head of the file and then
> fsync'ing?  Or are you getting a good bit of dirty log data and
> fsyncing occasionally?

The latter. Basically when opening a file for writing we mark it in the
header as "online", then fsync() it. When we close a file we fsync() it,
then change the header to "offline", ans sync() again. Also, 5min after
each write we will also put things to offline, until the next write,
when we will put things to online again. Finally, if something is logged
at priorities EMERG, ALERT or CRIT we will sync immediately (which
actually should never happen in real-life, unless something is really
broken -- a simple way to check if anything like this got written is
"journalctl -p crit").

Also, we rotate and start a new file every now and then, when we hit a
size limit, but that is usually very seldom.

Putting this together: we should normally fsync() only very
infrequently. 

> What would cause btrfs problems is if you fallocate(), write a small
> chunk, fsync, write a small chunk again, fsync again etc.  Fallocate
> saves you the first write around, but if the next write is within
> the same block as the previous write we'll end up triggering cow and
> enter fragmented territory.  If this is what is what journald is
> doing then that would be good to know, if not I'd like to know what
> is happening since we shouldn't be fragmenting this badly.

Hmm, the only way I see that that would happen is if a lot of stuff is
logged at these super-high log levels mentioned above. But then again,
that never really should happen in real-life.

Could anyone who's expereiencing the slowdowns have a look on the
journalctl output menionted above? Do you have more than a few lines
printed like that?

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-17 18:46                       ` Filipe Brandenburger
@ 2014-06-17 19:42                         ` Goffredo Baroncelli
  0 siblings, 0 replies; 33+ messages in thread
From: Goffredo Baroncelli @ 2014-06-17 19:42 UTC (permalink / raw)
  To: linux-btrfs

On 06/17/2014 08:46 PM, Filipe Brandenburger wrote:
> On Mon, Jun 16, 2014 at 6:13 PM, cwillu <cwillu@cwillu.com> wrote:
>> For the case of sequential writes (via write or mmap), padding writes
>> to page boundaries would help, if the wasted space isn't an issue.
>> Another approach, again assuming all other writes are appends, would
>> be to periodically (but frequently enough that the pages are still in
>> cache) read a chunk of the file and write it back in-place, with or
>> without an fsync. On the other hand, if you can afford to lose some
>> logs on a crash, not fsyncing/msyncing after each write will also
>> eliminate the fragmentation.
> 
> I was wondering if something could be done in btrfs to improve
> performance under this workload... Something like a "defrag on demand"
> for a case where mostly appends are happening.

Instead of inventing a strategy smarter than the (already smart) filesystem, would be more simple make an explicit defrag ?

In any case this "smart strategy" is filesystem specific, so it would be more simple (and less error prone) do an explicit defrag.

I tried this strategy with systemd-journald, getting good results (doing a ioctl BTRFS_IOC_DEFRAG during the journal opening).

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-17  1:13                     ` cwillu
  2014-06-17 12:24                       ` Martin
  2014-06-17 17:56                       ` Chris Murphy
@ 2014-06-17 18:46                       ` Filipe Brandenburger
  2014-06-17 19:42                         ` Goffredo Baroncelli
  2 siblings, 1 reply; 33+ messages in thread
From: Filipe Brandenburger @ 2014-06-17 18:46 UTC (permalink / raw)
  To: linux-btrfs, systemd-devel

On Mon, Jun 16, 2014 at 6:13 PM, cwillu <cwillu@cwillu.com> wrote:
> For the case of sequential writes (via write or mmap), padding writes
> to page boundaries would help, if the wasted space isn't an issue.
> Another approach, again assuming all other writes are appends, would
> be to periodically (but frequently enough that the pages are still in
> cache) read a chunk of the file and write it back in-place, with or
> without an fsync. On the other hand, if you can afford to lose some
> logs on a crash, not fsyncing/msyncing after each write will also
> eliminate the fragmentation.

I was wondering if something could be done in btrfs to improve
performance under this workload... Something like a "defrag on demand"
for a case where mostly appends are happening.

When there are small appends with fsync/msync, they become new
fragments (as expected), but once the writes go past a block boundary,
btrfs could defragment the previous block in background, since it's
not really expected to change again.

That could potentially achieve performance close to chattr +C without
the drawbacks of disabling copy-on-write.

Cheers,
Filipe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-17  1:13                     ` cwillu
  2014-06-17 12:24                       ` Martin
@ 2014-06-17 17:56                       ` Chris Murphy
  2014-06-17 18:46                       ` Filipe Brandenburger
  2 siblings, 0 replies; 33+ messages in thread
From: Chris Murphy @ 2014-06-17 17:56 UTC (permalink / raw)
  To: cwillu; +Cc: linux-btrfs, systemd-devel


On Jun 16, 2014, at 7:13 PM, cwillu <cwillu@cwillu.com> wrote:

> It's not a mmap problem, it's a small writes with an msync or fsync
> after each one problem.
> 
> For the case of sequential writes (via write or mmap), padding writes
> to page boundaries would help, if the wasted space isn't an issue.
> Another approach, again assuming all other writes are appends, would
> be to periodically (but frequently enough that the pages are still in
> cache) read a chunk of the file and write it back in-place, with or
> without an fsync. On the other hand, if you can afford to lose some
> logs on a crash, not fsyncing/msyncing after each write will also
> eliminate the fragmentation.

Normally I'd be willing to give up ~30 seconds of journal to not have fragmented journals. But then if I use systemd.log_level=debug I'd like that to trigger more frequent flushing to make sure as little of the journaling is lost as possible. Does that make sense?


Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-17  1:13                     ` cwillu
@ 2014-06-17 12:24                       ` Martin
  2014-06-17 17:56                       ` Chris Murphy
  2014-06-17 18:46                       ` Filipe Brandenburger
  2 siblings, 0 replies; 33+ messages in thread
From: Martin @ 2014-06-17 12:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

On 17/06/14 02:13, cwillu wrote:
> It's not a mmap problem, it's a small writes with an msync or fsync
> after each one problem.

And for logging, that is exactly what is wanted to see why whatever
crashed...

Except...

Whilst logging, hold off on the msync/fsync unless the next log message
to be written is 'critical'?

With that, the mundane logging gets appended just as for any normal file
write. Only the more critical log messages suffer the extra overhead and
fragmentation of an immediate msync/fsync.


> For the case of sequential writes (via write or mmap), padding writes
> to page boundaries would help, if the wasted space isn't an issue.
> Another approach, again assuming all other writes are appends, would
> be to periodically (but frequently enough that the pages are still in
> cache) read a chunk of the file and write it back in-place, with or
> without an fsync. On the other hand, if you can afford to lose some
> logs on a crash, not fsyncing/msyncing after each write will also
> eliminate the fragmentation.
> 
> (Worth pointing out that none of that is conjecture, I just spent 30
> minutes testing those cases while composing this ;p)
> 
> Josef has mentioned in irc that a piece of Chris' raid5/6 work will
> also fix this when it lands.

Interesting...

The source problem is how the COW fragments under expected normal use...
Is all this unavoidable unless we rethink the semantics?


Regards,
Martin



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 19:52                   ` Martin
  2014-06-16 20:20                     ` Josef Bacik
  2014-06-17  0:15                     ` Austin S Hemmelgarn
@ 2014-06-17  1:13                     ` cwillu
  2014-06-17 12:24                       ` Martin
                                         ` (2 more replies)
  2 siblings, 3 replies; 33+ messages in thread
From: cwillu @ 2014-06-17  1:13 UTC (permalink / raw)
  To: Martin; +Cc: linux-btrfs, systemd-devel, clm

It's not a mmap problem, it's a small writes with an msync or fsync
after each one problem.

For the case of sequential writes (via write or mmap), padding writes
to page boundaries would help, if the wasted space isn't an issue.
Another approach, again assuming all other writes are appends, would
be to periodically (but frequently enough that the pages are still in
cache) read a chunk of the file and write it back in-place, with or
without an fsync. On the other hand, if you can afford to lose some
logs on a crash, not fsyncing/msyncing after each write will also
eliminate the fragmentation.

(Worth pointing out that none of that is conjecture, I just spent 30
minutes testing those cases while composing this ;p)

Josef has mentioned in irc that a piece of Chris' raid5/6 work will
also fix this when it lands.

On Mon, Jun 16, 2014 at 1:52 PM, Martin <m_btrfs@ml1.co.uk> wrote:
> On 16/06/14 17:05, Josef Bacik wrote:
>>
>> On 06/16/2014 03:14 AM, Lennart Poettering wrote:
>>> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>>>
>>>>> I am not really following though why this trips up btrfs though. I am
>>>>> not sure I understand why this breaks btrfs COW behaviour. I mean,
>
>>>> I don't believe that fallocate() makes any difference to
>>>> fragmentation on
>>>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>>>> fallocate() call the usage pattern in systemd-journald will cause
>>>> fragmentation.
>>>
>>> journald's write pattern looks something like this: append something to
>>> the end, make sure it is written, then update a few offsets stored at
>>> the beginning of the file to point to the newly appended data. This is
>>> of course not easy to handle for COW file systems. But then again, it's
>>> probably not too different from access patterns of other database or
>>> database-like engines...
>
> Even though this appears to be a problem case for btrfs/COW, is there a
> more favourable write/access sequence possible that is easily
> implemented that is favourable for both ext4-like fs /and/ COW fs?
>
> Database-like writing is known 'difficult' for filesystems: Can a data
> log can be a simpler case?
>
>
>> Was waiting for you to show up before I said anything since most systemd
>> related emails always devolve into how evil you are rather than what is
>> actually happening.
>
> Ouch! Hope you two know each other!! :-P :-)
>
>
> [...]
>> since we shouldn't be fragmenting this badly.
>>
>> Like I said what you guys are doing is fine, if btrfs falls on it's face
>> then its not your fault.  I'd just like an exact idea of when you guys
>> are fsync'ing so I can replicate in a smaller way.  Thanks,
>
> Good if COW can be so resilient. I have about 2GBytes of data logging
> files and I must defrag those as part of my backups to stop the system
> fragmenting to a stop (I use "cp -a" to defrag the files to a new area
> and restart the data software logger on that).
>
>
> Random thoughts:
>
> Would using a second small file just for the mmap-ed pointers help avoid
> repeated rewriting of random offsets in the log file causing excessive
> fragmentation?
>
> Align the data writes to 16kByte or 64kByte boundaries/chunks?
>
> Are mmap-ed files a similar problem to using a swap file and so should
> the same "btrfs file swap" code be used for both?
>
>
> Not looked over the code so all random guesses...
>
> Regards,
> Martin
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 19:52                   ` Martin
  2014-06-16 20:20                     ` Josef Bacik
@ 2014-06-17  0:15                     ` Austin S Hemmelgarn
  2014-06-17  1:13                     ` cwillu
  2 siblings, 0 replies; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-06-17  0:15 UTC (permalink / raw)
  To: Martin, linux-btrfs; +Cc: systemd-devel

[-- Attachment #1: Type: text/plain, Size: 3469 bytes --]

On 06/16/2014 03:52 PM, Martin wrote:
> On 16/06/14 17:05, Josef Bacik wrote:
>>
>> On 06/16/2014 03:14 AM, Lennart Poettering wrote:
>>> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>>>
>>>>> I am not really following though why this trips up btrfs though. I am
>>>>> not sure I understand why this breaks btrfs COW behaviour. I mean,
> 
>>>> I don't believe that fallocate() makes any difference to
>>>> fragmentation on
>>>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>>>> fallocate() call the usage pattern in systemd-journald will cause
>>>> fragmentation.
>>>
>>> journald's write pattern looks something like this: append something to
>>> the end, make sure it is written, then update a few offsets stored at
>>> the beginning of the file to point to the newly appended data. This is
>>> of course not easy to handle for COW file systems. But then again, it's
>>> probably not too different from access patterns of other database or
>>> database-like engines...
> 
> Even though this appears to be a problem case for btrfs/COW, is there a
> more favourable write/access sequence possible that is easily
> implemented that is favourable for both ext4-like fs /and/ COW fs?
> 
> Database-like writing is known 'difficult' for filesystems: Can a data
> log can be a simpler case?
> 
> 
>> Was waiting for you to show up before I said anything since most systemd
>> related emails always devolve into how evil you are rather than what is
>> actually happening.
> 
> Ouch! Hope you two know each other!! :-P :-)
> 
> 
> [...]
>> since we shouldn't be fragmenting this badly.
>>
>> Like I said what you guys are doing is fine, if btrfs falls on it's face
>> then its not your fault.  I'd just like an exact idea of when you guys
>> are fsync'ing so I can replicate in a smaller way.  Thanks,
> 
> Good if COW can be so resilient. I have about 2GBytes of data logging
> files and I must defrag those as part of my backups to stop the system
> fragmenting to a stop (I use "cp -a" to defrag the files to a new area
> and restart the data software logger on that).
> 
> 
> Random thoughts:
> 
> Would using a second small file just for the mmap-ed pointers help avoid
> repeated rewriting of random offsets in the log file causing excessive
> fragmentation?
> 
> Align the data writes to 16kByte or 64kByte boundaries/chunks?
> 
> Are mmap-ed files a similar problem to using a swap file and so should
> the same "btrfs file swap" code be used for both?
> 
> 
> Not looked over the code so all random guesses...
> 
> Regards,
> Martin
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Just a thought, partly inspired by the mention of the swap code, has
anyone tried making the file NOCOW and pre-allocating to the max journal
size?  A similar approach has seemed to help on my systems with generic
log files (I keep debug level logs from almost everything, so I end up
with very active log files with ridiculous numbers of fragments if I
don't pre-allocate and mark them NOCOW).  I don't know for certain how
BTRFS handles appends to NOCOW files, but I would be willing to bet that
it ends up with a new fragment for each filesystem block worth of space
allocated.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 19:52                   ` Martin
@ 2014-06-16 20:20                     ` Josef Bacik
  2014-06-17  0:15                     ` Austin S Hemmelgarn
  2014-06-17  1:13                     ` cwillu
  2 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2014-06-16 20:20 UTC (permalink / raw)
  To: Martin, linux-btrfs; +Cc: systemd-devel



On 06/16/2014 12:52 PM, Martin wrote:
> On 16/06/14 17:05, Josef Bacik wrote:
>>
>> On 06/16/2014 03:14 AM, Lennart Poettering wrote:
>>> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>>>
>>>>> I am not really following though why this trips up btrfs though. I am
>>>>> not sure I understand why this breaks btrfs COW behaviour. I mean,
>
>>>> I don't believe that fallocate() makes any difference to
>>>> fragmentation on
>>>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>>>> fallocate() call the usage pattern in systemd-journald will cause
>>>> fragmentation.
>>>
>>> journald's write pattern looks something like this: append something to
>>> the end, make sure it is written, then update a few offsets stored at
>>> the beginning of the file to point to the newly appended data. This is
>>> of course not easy to handle for COW file systems. But then again, it's
>>> probably not too different from access patterns of other database or
>>> database-like engines...
>
> Even though this appears to be a problem case for btrfs/COW, is there a
> more favourable write/access sequence possible that is easily
> implemented that is favourable for both ext4-like fs /and/ COW fs?
>
> Database-like writing is known 'difficult' for filesystems: Can a data
> log can be a simpler case?
>
>
>> Was waiting for you to show up before I said anything since most systemd
>> related emails always devolve into how evil you are rather than what is
>> actually happening.
>
> Ouch! Hope you two know each other!! :-P :-)
>

Yup, I <3 Lennart, I'd rather deal with him directly than wade through 
all the fud that flys around when systemd is brought up.

>
> [...]
>> since we shouldn't be fragmenting this badly.
>>
>> Like I said what you guys are doing is fine, if btrfs falls on it's face
>> then its not your fault.  I'd just like an exact idea of when you guys
>> are fsync'ing so I can replicate in a smaller way.  Thanks,
>
> Good if COW can be so resilient. I have about 2GBytes of data logging
> files and I must defrag those as part of my backups to stop the system
> fragmenting to a stop (I use "cp -a" to defrag the files to a new area
> and restart the data software logger on that).
>
>
> Random thoughts:
>
> Would using a second small file just for the mmap-ed pointers help avoid
> repeated rewriting of random offsets in the log file causing excessive
> fragmentation?
>

Depends on when you fsync.  The problem isn't dirty'ing so much as writing.

> Align the data writes to 16kByte or 64kByte boundaries/chunks?
>

Yes that would help the most, if journald would try to only fsync ever 
blocksize amount of writes we'd suck less.

> Are mmap-ed files a similar problem to using a swap file and so should
> the same "btrfs file swap" code be used for both?
>

Not sure what this special swap file code is you speak of.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 16:05                 ` Josef Bacik
@ 2014-06-16 19:52                   ` Martin
  2014-06-16 20:20                     ` Josef Bacik
                                       ` (2 more replies)
  2014-06-17 21:12                   ` Lennart Poettering
  1 sibling, 3 replies; 33+ messages in thread
From: Martin @ 2014-06-16 19:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: systemd-devel

On 16/06/14 17:05, Josef Bacik wrote:
> 
> On 06/16/2014 03:14 AM, Lennart Poettering wrote:
>> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>>
>>>> I am not really following though why this trips up btrfs though. I am
>>>> not sure I understand why this breaks btrfs COW behaviour. I mean,

>>> I don't believe that fallocate() makes any difference to
>>> fragmentation on
>>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>>> fallocate() call the usage pattern in systemd-journald will cause
>>> fragmentation.
>>
>> journald's write pattern looks something like this: append something to
>> the end, make sure it is written, then update a few offsets stored at
>> the beginning of the file to point to the newly appended data. This is
>> of course not easy to handle for COW file systems. But then again, it's
>> probably not too different from access patterns of other database or
>> database-like engines...

Even though this appears to be a problem case for btrfs/COW, is there a
more favourable write/access sequence possible that is easily
implemented that is favourable for both ext4-like fs /and/ COW fs?

Database-like writing is known 'difficult' for filesystems: Can a data
log can be a simpler case?


> Was waiting for you to show up before I said anything since most systemd
> related emails always devolve into how evil you are rather than what is
> actually happening.

Ouch! Hope you two know each other!! :-P :-)


[...]
> since we shouldn't be fragmenting this badly.
> 
> Like I said what you guys are doing is fine, if btrfs falls on it's face
> then its not your fault.  I'd just like an exact idea of when you guys
> are fsync'ing so I can replicate in a smaller way.  Thanks,

Good if COW can be so resilient. I have about 2GBytes of data logging
files and I must defrag those as part of my backups to stop the system
fragmenting to a stop (I use "cp -a" to defrag the files to a new area
and restart the data software logger on that).


Random thoughts:

Would using a second small file just for the mmap-ed pointers help avoid
repeated rewriting of random offsets in the log file causing excessive
fragmentation?

Align the data writes to 16kByte or 64kByte boundaries/chunks?

Are mmap-ed files a similar problem to using a swap file and so should
the same "btrfs file swap" code be used for both?


Not looked over the code so all random guesses...

Regards,
Martin





^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 16:32             ` Goffredo Baroncelli
@ 2014-06-16 18:47               ` Goffredo Baroncelli
  0 siblings, 0 replies; 33+ messages in thread
From: Goffredo Baroncelli @ 2014-06-16 18:47 UTC (permalink / raw)
  Cc: Lennart Poettering, Duncan, linux-btrfs, systemd-devel, Kai Krakow

Hi all,

in this blog [1] I collected all the results of the tests which I performed in order to investigate a bit this performance problem between systemd and btrfs. I had to put these results in a blog, because there are several images. Below a brief summary.

I took an old machine (a P4 2.5GHz with 512MB of ram) where was present a fresh installation of a Fedora 20 and I measured the boot time during several reboots (up to 70).  I tested the following scenarios

1) standard (without defragmenting any file, without readahead)
2) defragment the journal file at the end of the boot
3) defragment the journal file before the flushing
4) mark as NOCOW the journald log file
5) enable the systemd-readahead
6) remove the fsync(2) call from journald
7) remove the posix_fallocate(3) call from journald
8) do a defrag when posix_fallocate(3) is called
9) do a defrag when the journal log file is opened

The test #1 highlight the problem. It shows that the boot time may require up to 50 seconds. During the reboots the number of extents of the file system.journal increases up to 6000. De-fragmenting the system.journal file the boot time decreases by ~20 seconds. My conclusion is that in BTRFS the fragmentation of this file increases the boot time.

The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3) calls aren't the root cause of the problem. Even without these the system.journal file still fragments.

The test #4 suggested that marking NOCOW the system.journal file reduces its fragmentation and so the boot time.

The test #2,#3,#9 suggested that performing a periodic defrag reduces significantly the fragmentation of system.journal and so the boot time.

The test #5 revealed that the readahead capability of systemd was not efficacy because it seems that the system.journal file was unaffected (but other *.journal files were). Further investigation is required.

BR
G.Baroncelli



[1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html

On 06/16/2014 06:32 PM, Goffredo Baroncelli wrote:
> Hi Lennart,
> 
> On 06/16/2014 12:13 AM, Lennart Poettering wrote:
>> I am not really following though why this trips up btrfs though. I am
>> not sure I understand why this breaks btrfs COW behaviour. I mean,
>> fallocate() isn't necessarily supposed to write anything really, it's
>> mostly about allocating disk space in advance. I would claim that
>> journald's usage of it is very much within the entire reason why it
>> exists...
> 
> I performed several tests, trying different setups [1]. One of these was replacing the posix_fallocate() with a truncate, to check where is the problem. The conclusion was that *posix_fallocate() is NOT the problem*.
> 
> In another reply you stated that systemd-journald appends some data at the end of file, then update some data in the middle. I think this is the reason because the file becomes quickly fragmented.
> 
> 
> [1] Let me to revise the english, the I will post the results.
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-15 22:13           ` Lennart Poettering
  2014-06-16  0:17             ` Russell Coker
@ 2014-06-16 16:32             ` Goffredo Baroncelli
  2014-06-16 18:47               ` Goffredo Baroncelli
  2014-06-19  1:13             ` Dave Chinner
  2 siblings, 1 reply; 33+ messages in thread
From: Goffredo Baroncelli @ 2014-06-16 16:32 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: Duncan, linux-btrfs, systemd-devel

Hi Lennart,

On 06/16/2014 12:13 AM, Lennart Poettering wrote:
> I am not really following though why this trips up btrfs though. I am
> not sure I understand why this breaks btrfs COW behaviour. I mean,
> fallocate() isn't necessarily supposed to write anything really, it's
> mostly about allocating disk space in advance. I would claim that
> journald's usage of it is very much within the entire reason why it
> exists...

I performed several tests, trying different setups [1]. One of these was replacing the posix_fallocate() with a truncate, to check where is the problem. The conclusion was that *posix_fallocate() is NOT the problem*.

In another reply you stated that systemd-journald appends some data at the end of file, then update some data in the middle. I think this is the reason because the file becomes quickly fragmented.


[1] Let me to revise the english, the I will post the results.


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 10:14               ` Lennart Poettering
  2014-06-16 10:35                 ` Russell Coker
  2014-06-16 11:56                 ` Andrey Borzenkov
@ 2014-06-16 16:05                 ` Josef Bacik
  2014-06-16 19:52                   ` Martin
  2014-06-17 21:12                   ` Lennart Poettering
  2 siblings, 2 replies; 33+ messages in thread
From: Josef Bacik @ 2014-06-16 16:05 UTC (permalink / raw)
  To: Lennart Poettering, Russell Coker
  Cc: kreijack, Duncan, linux-btrfs, systemd-devel



On 06/16/2014 03:14 AM, Lennart Poettering wrote:
> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>
>>> I am not really following though why this trips up btrfs though. I am
>>> not sure I understand why this breaks btrfs COW behaviour. I mean,
>>> fallocate() isn't necessarily supposed to write anything really, it's
>>> mostly about allocating disk space in advance. I would claim that
>>> journald's usage of it is very much within the entire reason why it
>>> exists...
>>
>> I don't believe that fallocate() makes any difference to fragmentation on
>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>> fallocate() call the usage pattern in systemd-journald will cause
>> fragmentation.
>
> journald's write pattern looks something like this: append something to
> the end, make sure it is written, then update a few offsets stored at
> the beginning of the file to point to the newly appended data. This is
> of course not easy to handle for COW file systems. But then again, it's
> probably not too different from access patterns of other database or
> database-like engines...

Was waiting for you to show up before I said anything since most systemd 
related emails always devolve into how evil you are rather than what is 
actually happening.

So you are doing all the right things from what I can tell, I'm just a 
little confused about when you guys run fsync.  From what I can tell 
it's only when you open the journal file and when you switch it to 
"offline."  I didn't look too much past this point so I don't know how 
often these things happen.  Are you taking an individual message, 
writing it, updating the head of the file and then fsync'ing?  Or are 
you getting a good bit of dirty log data and fsyncing occasionally?

What would cause btrfs problems is if you fallocate(), write a small 
chunk, fsync, write a small chunk again, fsync again etc.  Fallocate 
saves you the first write around, but if the next write is within the 
same block as the previous write we'll end up triggering cow and enter 
fragmented territory.  If this is what is what journald is doing then 
that would be good to know, if not I'd like to know what is happening 
since we shouldn't be fragmenting this badly.

Like I said what you guys are doing is fine, if btrfs falls on it's face 
then its not your fault.  I'd just like an exact idea of when you guys 
are fsync'ing so I can replicate in a smaller way.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 10:14               ` Lennart Poettering
  2014-06-16 10:35                 ` Russell Coker
@ 2014-06-16 11:56                 ` Andrey Borzenkov
  2014-06-16 16:05                 ` Josef Bacik
  2 siblings, 0 replies; 33+ messages in thread
From: Andrey Borzenkov @ 2014-06-16 11:56 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Russell Coker, SystemD Devel, kreijack, Duncan, linux-btrfs

On Mon, Jun 16, 2014 at 2:14 PM, Lennart Poettering
<lennart@poettering.net> wrote:
> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>
>> > I am not really following though why this trips up btrfs though. I am
>> > not sure I understand why this breaks btrfs COW behaviour. I mean,
>> > fallocate() isn't necessarily supposed to write anything really, it's
>> > mostly about allocating disk space in advance. I would claim that
>> > journald's usage of it is very much within the entire reason why it
>> > exists...
>>
>> I don't believe that fallocate() makes any difference to fragmentation on
>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>> fallocate() call the usage pattern in systemd-journald will cause
>> fragmentation.
>
> journald's write pattern looks something like this: append something to
> the end, make sure it is written, then update a few offsets stored at
> the beginning of the file to point to the newly appended data. This is
> of course not easy to handle for COW file systems. But then again, it's
> probably not too different from access patterns of other database or
> database-like engines...
>

... which traditionally experienced severe sequential read performance
degradation in this case. As I understand this is exactly what happens
- readahead attempts to preload files which gives us heavy random read
access.

The only real remedy was to defragment files. It should work
relatively well for journal where files are mostly "write once" at the
expense of additional read/write activity.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 10:35                 ` Russell Coker
@ 2014-06-16 11:16                   ` Austin S Hemmelgarn
  0 siblings, 0 replies; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-06-16 11:16 UTC (permalink / raw)
  To: russell, Lennart Poettering; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel

[-- Attachment #1: Type: text/plain, Size: 2434 bytes --]

On 2014-06-16 06:35, Russell Coker wrote:
> On Mon, 16 Jun 2014 12:14:49 Lennart Poettering wrote:
>> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
>>>> I am not really following though why this trips up btrfs though. I am
>>>> not sure I understand why this breaks btrfs COW behaviour. I mean,
>>>> fallocate() isn't necessarily supposed to write anything really, it's
>>>> mostly about allocating disk space in advance. I would claim that
>>>> journald's usage of it is very much within the entire reason why it
>>>> exists...
>>>
>>> I don't believe that fallocate() makes any difference to fragmentation on
>>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>>> fallocate() call the usage pattern in systemd-journald will cause
>>> fragmentation.
>>
>> journald's write pattern looks something like this: append something to
>> the end, make sure it is written, then update a few offsets stored at
>> the beginning of the file to point to the newly appended data. This is
>> of course not easy to handle for COW file systems. But then again, it's
>> probably not too different from access patterns of other database or
>> database-like engines...
> 
> Not being too different from the access patterns of other databases means 
> having all the same problems as other databases...  Oracle is now selling ZFS 
> servers specifically designed for running the Oracle database, but that's with 
> "hybrid storage" "flash" (ZIL and L2ARC on SSD).  While BTRFS doesn't support 
> features equivalent for ZIL and L2ARC it's easy to run a separate filesystem 
> on SSD for things that need performance (few if any current BTRFS users would 
> have a database too big to entirely fit on a SSD).
> 
> The problem we are dealing with is "database-like" access patterns on systems 
> that are not designed as database servers.
> 
> Would it be possible to get an interface for defragmenting files that's not 
> specific to BTRFS?  If we had a standard way of doing this then systemd-
> journald could request a defragment of the file at appropriate times.
> 
While this is a wonderful idea, what about all the extra I/O this will
cause (and all the extra wear on SSD's)?  While I understand wanting
this to be faster, you should also consider the fact that defragmenting
the file on a regular basis is going to trash performance for other
applications.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16 10:14               ` Lennart Poettering
@ 2014-06-16 10:35                 ` Russell Coker
  2014-06-16 11:16                   ` Austin S Hemmelgarn
  2014-06-16 11:56                 ` Andrey Borzenkov
  2014-06-16 16:05                 ` Josef Bacik
  2 siblings, 1 reply; 33+ messages in thread
From: Russell Coker @ 2014-06-16 10:35 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel

On Mon, 16 Jun 2014 12:14:49 Lennart Poettering wrote:
> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:
> > > I am not really following though why this trips up btrfs though. I am
> > > not sure I understand why this breaks btrfs COW behaviour. I mean,
> > > fallocate() isn't necessarily supposed to write anything really, it's
> > > mostly about allocating disk space in advance. I would claim that
> > > journald's usage of it is very much within the entire reason why it
> > > exists...
> > 
> > I don't believe that fallocate() makes any difference to fragmentation on
> > BTRFS.  Blocks will be allocated when writes occur so regardless of an
> > fallocate() call the usage pattern in systemd-journald will cause
> > fragmentation.
> 
> journald's write pattern looks something like this: append something to
> the end, make sure it is written, then update a few offsets stored at
> the beginning of the file to point to the newly appended data. This is
> of course not easy to handle for COW file systems. But then again, it's
> probably not too different from access patterns of other database or
> database-like engines...

Not being too different from the access patterns of other databases means 
having all the same problems as other databases...  Oracle is now selling ZFS 
servers specifically designed for running the Oracle database, but that's with 
"hybrid storage" "flash" (ZIL and L2ARC on SSD).  While BTRFS doesn't support 
features equivalent for ZIL and L2ARC it's easy to run a separate filesystem 
on SSD for things that need performance (few if any current BTRFS users would 
have a database too big to entirely fit on a SSD).

The problem we are dealing with is "database-like" access patterns on systems 
that are not designed as database servers.

Would it be possible to get an interface for defragmenting files that's not 
specific to BTRFS?  If we had a standard way of doing this then systemd-
journald could request a defragment of the file at appropriate times.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16  0:17             ` Russell Coker
  2014-06-16  1:06               ` John Williams
@ 2014-06-16 10:14               ` Lennart Poettering
  2014-06-16 10:35                 ` Russell Coker
                                   ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Lennart Poettering @ 2014-06-16 10:14 UTC (permalink / raw)
  To: Russell Coker; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel

On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote:

> > I am not really following though why this trips up btrfs though. I am
> > not sure I understand why this breaks btrfs COW behaviour. I mean,
> > fallocate() isn't necessarily supposed to write anything really, it's
> > mostly about allocating disk space in advance. I would claim that
> > journald's usage of it is very much within the entire reason why it
> > exists...
> 
> I don't believe that fallocate() makes any difference to fragmentation on 
> BTRFS.  Blocks will be allocated when writes occur so regardless of an 
> fallocate() call the usage pattern in systemd-journald will cause 
> fragmentation.

journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16  1:06               ` John Williams
@ 2014-06-16  2:19                 ` Russell Coker
  0 siblings, 0 replies; 33+ messages in thread
From: Russell Coker @ 2014-06-16  2:19 UTC (permalink / raw)
  To: John Williams
  Cc: Lennart Poettering, kreijack, Duncan, Btrfs BTRFS, systemd-devel

On Sun, 15 Jun 2014 18:06:34 John Williams wrote:
> Why does it take 2 minutes? On XFS or ext4, fallocate is almost
> instantaneous, even for multi-Terabyte allocations.
> 
> According the fallocate man page, preallocation should be quick and
> require no IO:
> 
> "      fallocate  is  used  to manipulate the allocated disk space for a
> file, either to deallocate or preallocate it. For filesystems  which 
> support the  fallocate system call, preallocation is done quickly by
> allocating blocks and marking them as uninitialized, requiring no IO to 
> the  data blocks.  This  is  much  faster than creating a file by filling
> it with zeros."

No IO to data blocks but there is IO to metadata.

But I think that BTRFS may need some optimisation for such things.  While 
fallocate() on 24G is probably a very unusual case it will probably matter to 
some people (I can imagine scientific computing needing it) and it's likely 
that much smaller fallocate() calls also take longer than desired.

The issue was system CPU time, extending the file in that test was proceeding 
at a speed of about 200MB/s for allocated space - while the system was writing 
something less than 2MB/s to the device (sometimes it went for 10+ seconds 
without writing any data).  The SSD in question can sustain about 200MB/s of 
data written so in that case the BTRFS speed for allocating disk space was 
about equal to the speed it should be able to write real data.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-16  0:17             ` Russell Coker
@ 2014-06-16  1:06               ` John Williams
  2014-06-16  2:19                 ` Russell Coker
  2014-06-16 10:14               ` Lennart Poettering
  1 sibling, 1 reply; 33+ messages in thread
From: John Williams @ 2014-06-16  1:06 UTC (permalink / raw)
  To: russell; +Cc: Lennart Poettering, kreijack, Duncan, Btrfs BTRFS, systemd-devel

On Sun, Jun 15, 2014 at 5:17 PM, Russell Coker <russell@coker.com.au> wrote:
> I just did some tests using fallocate(1).  I did the tests both with and
> without the -n option which appeared to make no difference.
>
> I started by allocating a 24G file on a 106G filesystem that had 30G free
> according to df.  The first time that took almost 2 minutes of system CPU time
> on a Q8400 CPU.

Why does it take 2 minutes? On XFS or ext4, fallocate is almost
instantaneous, even for multi-Terabyte allocations.

According the fallocate man page, preallocation should be quick and
require no IO:

"      fallocate  is  used  to manipulate the allocated disk space for a file,
       either to deallocate or preallocate it. For filesystems  which  support
       the  fallocate system call, preallocation is done quickly by allocating
       blocks and marking them as uninitialized, requiring no IO to  the  data
       blocks.  This  is  much  faster than creating a file by filling it with
       zeros."

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-15 22:13           ` Lennart Poettering
@ 2014-06-16  0:17             ` Russell Coker
  2014-06-16  1:06               ` John Williams
  2014-06-16 10:14               ` Lennart Poettering
  2014-06-16 16:32             ` Goffredo Baroncelli
  2014-06-19  1:13             ` Dave Chinner
  2 siblings, 2 replies; 33+ messages in thread
From: Russell Coker @ 2014-06-16  0:17 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel

On Mon, 16 Jun 2014 00:13:07 Lennart Poettering wrote:
> On Sat, 14.06.14 09:52, Goffredo Baroncelli (kreijack@libero.it) wrote:
> > > Which effectively means that by the time the 8 MiB is filled, each 4 KiB
> > > block has been rewritten to a new location and is now an extent unto
> > > itself.  So now that 8 MiB is composed of 2048 new extents, each one a
> > > single 4 KiB block in size.
> > 
> > Several people pointed fallocate as the problem. But I don't
> > understand the reason.
> 
> BTW, the reason we use fallocate() in journald is not about trying to
> optimize anything. It's only used for one reason: to avoid SIGBUS on
> disk/quota full, since we actually write everything to the files using
> mmap(). I mean, writing things with mmap() is always problematic, and
> handling write errors is awfully difficult, but at least two of the most
> common reasons for failure we'd like protect against in advance, under
> the assumption that disk/quota full will be reported immediately by the
> fallocate(), and the mmap writes later on will then necessarily succeed.

I just did some tests using fallocate(1).  I did the tests both with and 
without the -n option which appeared to make no difference.

I started by allocating a 24G file on a 106G filesystem that had 30G free 
according to df.  The first time that took almost 2 minutes of system CPU time 
on a Q8400 CPU.

I then made a snapshot of the subvol and then used dd with the conv=notrunc 
option to overwrite it.  The amount of reported disk space decreased in line 
with the progress of dd.  So in the case of snapshots the space will be USED 
(not just reserved) when you call fallocate and there is no guarantee that 
space will be available when you write to it.

My systems have cron jobs to make read-only snapshots of all subvols.  On 
these systems you have no guarantee that mmap will succeed - apart from the 
fact that the variety of problems BTRFS has in the case of running out of disk 
space makes me more careful to avoid that on BTRFS than on other filesystems.

> I am not really following though why this trips up btrfs though. I am
> not sure I understand why this breaks btrfs COW behaviour. I mean,
> fallocate() isn't necessarily supposed to write anything really, it's
> mostly about allocating disk space in advance. I would claim that
> journald's usage of it is very much within the entire reason why it
> exists...

I don't believe that fallocate() makes any difference to fragmentation on 
BTRFS.  Blocks will be allocated when writes occur so regardless of an 
fallocate() call the usage pattern in systemd-journald will cause 
fragmentation.

> Anyway, happy to change these things around if necesary, but first I'd
> like to have a very good explanation why fallocate() wouldn't be the
> right thing to invoke here, and a suggestion what we should do instead
> to cover this usecase...

Systemd could request that the files in question be defragmented.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-14 10:59         ` Kai Krakow
@ 2014-06-15 22:43           ` Lennart Poettering
  0 siblings, 0 replies; 33+ messages in thread
From: Lennart Poettering @ 2014-06-15 22:43 UTC (permalink / raw)
  To: Kai Krakow; +Cc: systemd-devel, linux-btrfs

On Sat, 14.06.14 12:59, Kai Krakow (hurikhan77@gmail.com) wrote:

> 
> Duncan <1i5t5.duncan@cox.net> schrieb:
> 
> > As they say, "Whoosh!"
> > 
> > At least here, I interpreted that remark as primarily sarcastic
> > commentary on the systemd devs' apparent attitude, which can be
> > (controversially) summarized as: "Systemd doesn't have problems because
> > it's perfect.  Therefore, any problems you have with systemd must instead
> > be with other components which systemd depends on."
> 
> Come on, sorry, but this is fud. Really... ;-)

Interestingly, I never commented on anything in this area, and neither
did anybody else from the systemd side afaics. THe entire btrfs defrag
thing i wasn't aware of before this thread started on the system ML a
few days ago. I am not sure where you take your ideas about our
"attitude" from. God, with behaviour like that you just make us ignore
you, Duncan.

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-15  5:43           ` Duncan
@ 2014-06-15 22:39             ` Lennart Poettering
  0 siblings, 0 replies; 33+ messages in thread
From: Lennart Poettering @ 2014-06-15 22:39 UTC (permalink / raw)
  To: Duncan; +Cc: systemd-devel, linux-btrfs

On Sun, 15.06.14 05:43, Duncan (1i5t5.duncan@cox.net) wrote:

> The base problem isn't fallocate per se, rather, tho it's the trigger in 
> this case.  The base problem is that for COW-based filesystems, *ANY* 
> rewriting of existing file content results in fragmentation.
> 
> It just so happens that the only reason there's existing file content to 
> be rewritten (as opposed to simply appending) in this case, is because of 
> the fallocate.  The rewrite of existing file content is the problem, but 
> the existing file content is only there in this case because of the 
> fallocate.
> 
> Taking a step back...
> 
> On a non-COW filesystem, allocating 8 MiB ahead and writing into it 
> rewrites into the already allocated location, thus guaranteeing extents 
> of 8 MiB each, since once the space is allocated it's simply rewritten in-
> place.  Thus, on a non-COW filesystem, pre-allocating in something larger 
> than single filesystem blocks when an app knows the data is eventually 
> going to be written in to fill that space anyway is a GOOD thing, which 
> is why systemd is doing it.

Nope, that's not why we do it. We do it to avoid SIGBUS on disk full...

> But on a COW-based filesystem fallocate is the exact opposite, a BAD 
> thing, because an fallocate forces the file to be written out at that 
> size, effectively filled with nulls/blanks.  Then the actual logging 
> comes along and rewrites those nulls/blanks with actual data, but it's 
> now a rewrite, which on a COW, copy-on-write, based filesystem, the 
> rewritten block is copied elsewhere, it does NOT overwrite the existing 
> null/blank block, and "elsewhere" by definition means detached from the 
> previous blocks, thus in an extent all by itself.

Well, quite frankly I am not entirely sure why fallocate() would be any
useful like that on COW file systems, if this is really how it is
implemented... I mean, as I understood fallocate() -- and as the man
page suggests -- it is something for reserving space on disk, not for
writing out anything. This is why journald is invoking it, to reserve
the space, so that later write accesses to it will not require any
reservation anymore, and hence are unlikely to fail.

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
  2014-06-14  7:52         ` Goffredo Baroncelli
  2014-06-15  5:43           ` Duncan
@ 2014-06-15 22:13           ` Lennart Poettering
  2014-06-16  0:17             ` Russell Coker
                               ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Lennart Poettering @ 2014-06-15 22:13 UTC (permalink / raw)
  To: kreijack; +Cc: Duncan, linux-btrfs, systemd-devel

On Sat, 14.06.14 09:52, Goffredo Baroncelli (kreijack@libero.it) wrote:

> > Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
> > block has been rewritten to a new location and is now an extent unto 
> > itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
> > single 4 KiB block in size.
> 
> Several people pointed fallocate as the problem. But I don't
> understand the reason.

BTW, the reason we use fallocate() in journald is not about trying to
optimize anything. It's only used for one reason: to avoid SIGBUS on
disk/quota full, since we actually write everything to the files using
mmap(). I mean, writing things with mmap() is always problematic, and
handling write errors is awfully difficult, but at least two of the most
common reasons for failure we'd like protect against in advance, under
the assumption that disk/quota full will be reported immediately by the
fallocate(), and the mmap writes later on will then necessarily succeed.

I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...

Anyway, happy to change these things around if necesary, but first I'd
like to have a very good explanation why fallocate() wouldn't be the
right thing to invoke here, and a suggestion what we should do instead
to cover this usecase...

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2014-06-19  1:13 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-11 21:28 Slow startup of systemd-journal on BTRFS Goffredo Baroncelli
2014-06-12  0:40 ` Chris Murphy
2014-06-12  1:18 ` Russell Coker
2014-06-12  4:39   ` Duncan
2014-06-12  1:21 ` Dave Chinner
2014-06-12  1:37   ` Dave Chinner
2014-06-12  2:32     ` Chris Murphy
2014-06-15 22:34       ` [systemd-devel] " Lennart Poettering
2014-06-16  4:01         ` Chris Murphy
2014-06-16  4:38           ` cwillu
2014-06-12 11:13 R: " Goffredo Baroncelli <kreijack@libero.it>
2014-06-12 12:37 ` Duncan
2014-06-12 23:24   ` Dave Chinner
2014-06-13 22:19     ` Goffredo Baroncelli
2014-06-14  2:53       ` Duncan
2014-06-14  7:52         ` Goffredo Baroncelli
2014-06-15  5:43           ` Duncan
2014-06-15 22:39             ` [systemd-devel] " Lennart Poettering
2014-06-15 22:13           ` Lennart Poettering
2014-06-16  0:17             ` Russell Coker
2014-06-16  1:06               ` John Williams
2014-06-16  2:19                 ` Russell Coker
2014-06-16 10:14               ` Lennart Poettering
2014-06-16 10:35                 ` Russell Coker
2014-06-16 11:16                   ` Austin S Hemmelgarn
2014-06-16 11:56                 ` Andrey Borzenkov
2014-06-16 16:05                 ` Josef Bacik
2014-06-16 19:52                   ` Martin
2014-06-16 20:20                     ` Josef Bacik
2014-06-17  0:15                     ` Austin S Hemmelgarn
2014-06-17  1:13                     ` cwillu
2014-06-17 12:24                       ` Martin
2014-06-17 17:56                       ` Chris Murphy
2014-06-17 18:46                       ` Filipe Brandenburger
2014-06-17 19:42                         ` Goffredo Baroncelli
2014-06-17 21:12                   ` Lennart Poettering
2014-06-16 16:32             ` Goffredo Baroncelli
2014-06-16 18:47               ` Goffredo Baroncelli
2014-06-19  1:13             ` Dave Chinner
2014-06-14 10:59         ` Kai Krakow
2014-06-15 22:43           ` [systemd-devel] " Lennart Poettering

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.