* Slow startup of systemd-journal on BTRFS @ 2014-06-11 21:28 Goffredo Baroncelli 2014-06-12 0:40 ` Chris Murphy ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Goffredo Baroncelli @ 2014-06-11 21:28 UTC (permalink / raw) To: systemd Mailing List, linux-btrfs Hi all, I would like to share a my experience about a slowness of systemd when used on BTRFS. My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot"). I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why. However this link https://bugzilla.redhat.com/show_bug.cgi?id=1006386 suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. It seems that systemd-journal create a very hight fragmented files when it stores its log. And BTRFS it is know to behave slowly when a file is highly fragmented. This had caused a slow startup of systemd-journal, which in turn had blocked the services which depend by the loggin system. In fact after I de-fragmented the files under /var/log/journal [*], my boot time decreased of about 20second (from 50s to 30s). Unfortunately I don't have any data to show. The next time I will try to collect more information. But I am quite sure that when the log are highly fragmented systemd-journal becomes very slow on BTRFS. I don't know if the problem is more on the systemd side or btrfs side. What I know is that both the projects likely will be important in the near futures, and both must work well together. I know that I can "chattr +C" to avoid COW for some files; but I don't want to lost also the checksum protection. If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data. BR G.Baroncelli [*] # btrfs fi defrag /var/log/journal/*/* -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Slow startup of systemd-journal on BTRFS 2014-06-11 21:28 Slow startup of systemd-journal on BTRFS Goffredo Baroncelli @ 2014-06-12 0:40 ` Chris Murphy 2014-06-12 1:18 ` Russell Coker 2014-06-12 1:21 ` Dave Chinner 2 siblings, 0 replies; 33+ messages in thread From: Chris Murphy @ 2014-06-12 0:40 UTC (permalink / raw) To: kreijack, Goffredo Baroncelli; +Cc: systemd Mailing List, linux-btrfs On Jun 11, 2014, at 3:28 PM, Goffredo Baroncelli <kreijack@libero.it> wrote: > > If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data. So long as you're not using compression, filefrag will show you fragments of systemd-journald journals. I can vouch for the behavior you experience without xattr +C or autodefrag, but further it also causes much slowness when reading journal contents. LIke if I want to search all boots for a particular error message to see how far back it started, this takes quite a bit longer than on other file systems. So far I'm not experiencing this problem with autodefrag or any other negative side effects, but my understanding is this code is still in flux. Since the journals have their own checksumming I'm not overly concerned about setting xattr +C. Chris Murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Slow startup of systemd-journal on BTRFS 2014-06-11 21:28 Slow startup of systemd-journal on BTRFS Goffredo Baroncelli 2014-06-12 0:40 ` Chris Murphy @ 2014-06-12 1:18 ` Russell Coker 2014-06-12 4:39 ` Duncan 2014-06-12 1:21 ` Dave Chinner 2 siblings, 1 reply; 33+ messages in thread From: Russell Coker @ 2014-06-12 1:18 UTC (permalink / raw) To: kreijack; +Cc: systemd Mailing List, linux-btrfs On Wed, 11 Jun 2014 23:28:54 Goffredo Baroncelli wrote: > https://bugzilla.redhat.com/show_bug.cgi?id=1006386 > > suggested me that the problem could be due to a bad interaction between > systemd and btrfs. NetworkManager was innocent. It seems that > systemd-journal create a very hight fragmented files when it stores its > log. And BTRFS it is know to behave slowly when a file is highly > fragmented. This had caused a slow startup of systemd-journal, which in > turn had blocked the services which depend by the loggin system. On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put "SystemMaxUse=50M". That doesn't solve the fragmentation problem but reduces it enough that it doesn't bother me. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Slow startup of systemd-journal on BTRFS 2014-06-12 1:18 ` Russell Coker @ 2014-06-12 4:39 ` Duncan 0 siblings, 0 replies; 33+ messages in thread From: Duncan @ 2014-06-12 4:39 UTC (permalink / raw) To: linux-btrfs; +Cc: systemd-devel Russell Coker posted on Thu, 12 Jun 2014 11:18:37 +1000 as excerpted: > On Wed, 11 Jun 2014 23:28:54 Goffredo Baroncelli wrote: >> https://bugzilla.redhat.com/show_bug.cgi?id=1006386 >> >> suggested me that the problem could be due to a bad interaction between >> systemd and btrfs. NetworkManager was innocent. It seems that >> systemd-journal create a very hight fragmented files when it stores its >> log. And BTRFS it is know to behave slowly when a file is highly >> fragmented. This had caused a slow startup of systemd-journal, which in >> turn had blocked the services which depend by the loggin system. > > On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put > "SystemMaxUse=50M". That doesn't solve the fragmentation problem but > reduces it enough that it doesn't bother me. FWIW, as a relatively new switcher to systemd, that is, after switching to btrfs only a year or so ago... Two comments: 1) Having seen a few reports of journald's journal fragmentation on this list, I was worried about those journals here as well. My solution to both this problem and to an unrelated frustration with journald[1] was to: a) confine journald to only a volatile (memory-only) log, first on a temporary basis while I was only experimenting with and setting up systemd (using the kernel command-line's init= to point at systemd while /sbin/ init still pointed at sysv's init for openrc), then later permanently, once I got enough of my systemd and journald config setup to actually switch to it. b) configure my former syslogger (syslog-ng, in my case) to continue in that role under systemd, with journald relaying to it for non-volatile logging. Here's the /etc/journald.conf changes I ended up with to accomplish (a), see the journald.conf(5) manpage for the documentation, as well as the below explanation: Storage=volatile RuntimeMaxUse=448M RuntimeKeepFree=48M RuntimeMaxFileSize=64M Storage=volatile is the important one. As the manpage notes, that means journald stores files under /run/log/journal only, where /run is normally setup by systemd as a tmpfs mount, so these files are tmpfs and thus memory-only. The other three must be read in the context of a 512 MiB /run on tmpfs [2]. From that and the information in the journald.conf manpage, it should be possible to see that my setup is (runtime* settings apply to the volatile files under /run): An individual journal filesize (MaxFileSize) of 64 MiB, with seven such files in rotation (the default if MaxFileSize is unset is eight), totaling 448 MiB (MaxUse, the default being 10% of the filesystem, too small here since the journals are basically the only thing taking space). On a 512 MiB filesystem, that will leave 64 MiB for other uses (pretty much all 0-byte lock and pidfiles, IIRC I was running something like a 2 MiB /run before systemd without issue). It's worth noting that UNLIKE MaxUse, which will trigger journal file rotation when hit, hitting the KeepFree forces journald to stop journaling entirely -- *NOT* just to stop writing them here, but to stop forwarding to syslog (syslog-ng here) as well. I FOUND THIS OUT THE HARD WAY! Thus, in ordered to keep journald still functional, make sure journald runs into the MaxUse limit before it runs into KeepFree. The KeepFree default is 15% of the filesystem, just under 77 MiB on a 512 MiB filesystem which is why I found this out the hard way with settings that would otherwise keep only 64 MiB free. The 48 MiB setting I chose leaves 16 MiB of room for other files before journald shuts down journaling, which should be plenty, since under normal circumstances the other files should all be 0-byte lock and pidfiles. Just in case, however, there's still 48 MiB of room for other files after journald shuts down, before the filesystem itself fills up. Configuring the syslogger to work with journald is "left as an exercise for the reader", as they say, since for all I know the OP is using something other than the syslog-ng I'm familiar with anyway. But if hints for syslog-ng are needed too, let me know. =:^) 2) Someone else mentioned btrfs' autodefrag mount-option. Given #1 above I've obviously not had a lot of experience with journald logs and autodefrag, but based on all I know about btrfs fragmentation behavior as well as journald journal file behavior from this list, as long as journald's non-volatile files are kept significantly under 1 GiB and preferably under half a GiB each, it shouldn't be a problem, with a /possible/ exception if you get something run-away-journaling multiple messages a second for a reasonably long period, such that the I/O can't keep up with both the journaling and autodefrag. If you do choose to keep a persistent journal with autodefrag, then, I'd recommend journald.conf settings that keep individual journal files to perhaps 128 MiB each. (System* settings apply to the non-volatile files under /var, in /var/log/journal/.) SystemMaxFileSize=128M AFAIK, that wouldn't affect the total journal size and thus the number of journal files, which would remain 10% of the filesystem size by default. Alternatively, given the default 8-file rotation if MaxFileSize is unset, you could limit the total journal size to 1 GiB, for the same 128 MiB individual file size. SystemMaxUse=1G Of course if you want/need more control set both and/or other settings as I did for my volatile-only configuration above. --- [1] Unrelated journald frustration: Being a syslog-ng user I've been accustomed to being able to pre-filter incoming messages *BEFORE* they get written to the files. This ability has been the important bit of my run-away-log coping strategy when I have no direct way to reconfigure the source to reduce the rate at which it's spitting out "noise" that's otherwise overwhelming my logs. Unfortunately, while it seems journald has all /sorts/ of file-grep-style filtering tools to focus in like a laser on what you want to see AFTER the journal is written, I found absolutely NO documentation on setting up PREWRITE journal filters (except log-level, and there's global rate- limiting as well, but I didn't want to use those, I wanted to filter specific "noise" messages), which means runaway journaling simply runs away, and if I use the size restriction stuff to turn it down, I quickly lose the important stuff I want to keep around when the files size-rotate due to that runaway! Thus my solution, keep journald storage volatile only, relatively small but still big enough I can use the great systemctl status integration to get the last few journal entries from each service, then feed it to syslog-ng to pre-write filter the noise out before actual write to permanent storage. =:^) [2] 512 MiB /run tmpfs. This is on a 16 GiB RAM system, so default tmpfs size would be 8 GiB. But I have several such tmpfs including a big /tmp that I use for scratch space when I'm building stuff (gentoo's PORTAGE_TMPDIR, and for fully in-memory DVD ISOs too), and I don't have swap configured at all, so keeping a reasonable lid on things by limiting /run and its major journal-file space user to half a GiB seems prudent. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Slow startup of systemd-journal on BTRFS 2014-06-11 21:28 Slow startup of systemd-journal on BTRFS Goffredo Baroncelli 2014-06-12 0:40 ` Chris Murphy 2014-06-12 1:18 ` Russell Coker @ 2014-06-12 1:21 ` Dave Chinner 2014-06-12 1:37 ` Dave Chinner 2 siblings, 1 reply; 33+ messages in thread From: Dave Chinner @ 2014-06-12 1:21 UTC (permalink / raw) To: kreijack; +Cc: systemd Mailing List, linux-btrfs On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote: > Hi all, > > I would like to share a my experience about a slowness of systemd when used on BTRFS. > > My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot"). > > I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why. > > However this link > > https://bugzilla.redhat.com/show_bug.cgi?id=1006386 > > suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. systemd has a very stupid journal write pattern. It checks if there is space in the file for the write, and if not it fallocates the small amount of space it needs (it does *4 byte* fallocate calls!) and then does the write to it. All this does is fragment the crap out of the log files because the filesystems cannot optimise the allocation patterns. Yup, it fragments journal files on XFS, too. http://oss.sgi.com/archives/xfs/2014-03/msg00322.html IIRC, the systemd developers consider this a filesystem problem and so refused to change the systemd code to be nice to the filesystem allocators, even though they don't actually need to use fallocate... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Slow startup of systemd-journal on BTRFS 2014-06-12 1:21 ` Dave Chinner @ 2014-06-12 1:37 ` Dave Chinner 2014-06-12 2:32 ` Chris Murphy 0 siblings, 1 reply; 33+ messages in thread From: Dave Chinner @ 2014-06-12 1:37 UTC (permalink / raw) To: kreijack; +Cc: systemd Mailing List, linux-btrfs On Thu, Jun 12, 2014 at 11:21:04AM +1000, Dave Chinner wrote: > On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote: > > Hi all, > > > > I would like to share a my experience about a slowness of systemd when used on BTRFS. > > > > My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot"). > > > > I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why. > > > > However this link > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1006386 > > > > suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. > > systemd has a very stupid journal write pattern. It checks if there > is space in the file for the write, and if not it fallocates the > small amount of space it needs (it does *4 byte* fallocate calls!) > and then does the write to it. All this does is fragment the crap > out of the log files because the filesystems cannot optimise the > allocation patterns. > > Yup, it fragments journal files on XFS, too. > > http://oss.sgi.com/archives/xfs/2014-03/msg00322.html > > IIRC, the systemd developers consider this a filesystem problem and > so refused to change the systemd code to be nice to the filesystem > allocators, even though they don't actually need to use fallocate... BTW, the systemd list is subscriber only, so thay aren't going to see anything that we comment on from a cross-post to the btrfs list. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Slow startup of systemd-journal on BTRFS 2014-06-12 1:37 ` Dave Chinner @ 2014-06-12 2:32 ` Chris Murphy 2014-06-15 22:34 ` [systemd-devel] " Lennart Poettering 0 siblings, 1 reply; 33+ messages in thread From: Chris Murphy @ 2014-06-12 2:32 UTC (permalink / raw) To: Dave Chinner; +Cc: kreijack, systemd Mailing List, linux-btrfs On Jun 11, 2014, at 7:21 PM, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote: >> Hi all, >> >> I would like to share a my experience about a slowness of systemd when used on BTRFS. >> >> My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot"). >> >> I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why. >> >> However this link >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1006386 >> >> suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. > > systemd has a very stupid journal write pattern. It checks if there > is space in the file for the write, and if not it fallocates the > small amount of space it needs (it does *4 byte* fallocate calls!) > and then does the write to it. All this does is fragment the crap > out of the log files because the filesystems cannot optimise the > allocation patterns. > > Yup, it fragments journal files on XFS, too. > > http://oss.sgi.com/archives/xfs/2014-03/msg00322.html > > IIRC, the systemd developers consider this a filesystem problem and > so refused to change the systemd code to be nice to the filesystem > allocators, even though they don't actually need to use fallocate... > > Cheers, > > Dave. > > -- > Dave Chinner > david@fromorbit.com On Jun 11, 2014, at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote: > > BTW, the systemd list is subscriber only, so thay aren't going to > see anything that we comment on from a cross-post to the btrfs list. Unless a subscriber finds something really interesting, quotes it, and cross posts it. Chris murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-12 2:32 ` Chris Murphy @ 2014-06-15 22:34 ` Lennart Poettering 2014-06-16 4:01 ` Chris Murphy 0 siblings, 1 reply; 33+ messages in thread From: Lennart Poettering @ 2014-06-15 22:34 UTC (permalink / raw) To: Chris Murphy; +Cc: Dave Chinner, kreijack, systemd Mailing List, linux-btrfs On Wed, 11.06.14 20:32, Chris Murphy (lists@colorremedies.com) wrote: > > systemd has a very stupid journal write pattern. It checks if there > > is space in the file for the write, and if not it fallocates the > > small amount of space it needs (it does *4 byte* fallocate calls!) Not really the case. http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354 We allocate 8mb at minimum. > > and then does the write to it. All this does is fragment the crap > > out of the log files because the filesystems cannot optimise the > > allocation patterns. Well, it would be good if you'd tell me what to do instead... I am invoking fallocate() in advance, because we write those files with mmap() and that of course would normally triggered SIGBUS already on the most boring of reasons, such as disk full/quota full or so. Hence, before we do anything like that, we invoke fallocate() to ensure that the space is actually available... As far as I can see, that pretty much in line with what fallocate() is supposed to be useful for, the man page says this explicitly: "...After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space." Happy to be informed that the man page is wrong. I am also happy to change our code, if it really is the wrong thing to do. Note however that I generally favour correctness and relying on documented behaviour, instead of nebulous optimizations whose effects might change with different file systems or kernel versions... > > Yup, it fragments journal files on XFS, too. > > > > http://oss.sgi.com/archives/xfs/2014-03/msg00322.html > > > > IIRC, the systemd developers consider this a filesystem problem and > > so refused to change the systemd code to be nice to the filesystem > > allocators, even though they don't actually need to use fallocate... What? No need to be dick. Nobody ever pinged me about this. And yeah, I think I have a very good reason to use fallocate(). The only reason in fact the man page explicitly mentions. Lennart -- Lennart Poettering, Red Hat ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-15 22:34 ` [systemd-devel] " Lennart Poettering @ 2014-06-16 4:01 ` Chris Murphy 2014-06-16 4:38 ` cwillu 0 siblings, 1 reply; 33+ messages in thread From: Chris Murphy @ 2014-06-16 4:01 UTC (permalink / raw) To: Lennart Poettering Cc: Dave Chinner, kreijack, systemd Mailing List, linux-btrfs On Jun 15, 2014, at 4:34 PM, Lennart Poettering <lennart@poettering.net> wrote: > On Wed, 11.06.14 20:32, Chris Murphy (lists@colorremedies.com) wrote: > >>> systemd has a very stupid journal write pattern. It checks if there >>> is space in the file for the write, and if not it fallocates the >>> small amount of space it needs (it does *4 byte* fallocate calls!) > > Not really the case. > > http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354 > > We allocate 8mb at minimum. > >>> and then does the write to it. All this does is fragment the crap >>> out of the log files because the filesystems cannot optimise the >>> allocation patterns. > > Well, it would be good if you'd tell me what to do instead... > > I am invoking fallocate() in advance, because we write those files with > mmap() and that of course would normally triggered SIGBUS already on the > most boring of reasons, such as disk full/quota full or so. Hence, > before we do anything like that, we invoke fallocate() to ensure that > the space is actually available... As far as I can see, that pretty much > in line with what fallocate() is supposed to be useful for, the man page > says this explicitly: > > "...After a successful call to posix_fallocate(), subsequent writes > to bytes in the specified range are guaranteed not to fail because > of lack of disk space." > > Happy to be informed that the man page is wrong. > > I am also happy to change our code, if it really is the wrong thing to > do. Note however that I generally favour correctness and relying on > documented behaviour, instead of nebulous optimizations whose effects > might change with different file systems or kernel versions... > >>> Yup, it fragments journal files on XFS, too. >>> >>> http://oss.sgi.com/archives/xfs/2014-03/msg00322.html >>> >>> IIRC, the systemd developers consider this a filesystem problem and >>> so refused to change the systemd code to be nice to the filesystem >>> allocators, even though they don't actually need to use fallocate... > > What? No need to be dick. Nobody ever pinged me about this. And yeah, I > think I have a very good reason to use fallocate(). The only reason in > fact the man page explicitly mentions. > > Lennart For what it's worth, I did not write what is attributed to me above. I was quoting Dave Chinner, and I've confirmed the original attribution correctly made it onto the systemd-devel@ list. I don't know whether some people on this distribution list are even subscribed to systemd-devel@ so those subsequent responses aren't likely being posted to systemd-devel@ but rather to linux-btrfs@. Chris Murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 4:01 ` Chris Murphy @ 2014-06-16 4:38 ` cwillu 0 siblings, 0 replies; 33+ messages in thread From: cwillu @ 2014-06-16 4:38 UTC (permalink / raw) To: Chris Murphy Cc: Lennart Poettering, Dave Chinner, kreijack, systemd Mailing List, linux-btrfs Fallocate is a red herring except insofar as it's a hint that btrfs isn't making much use of: you see the same behaviour with small writes to an mmap'ed file that's msync'ed after each write, and likewise with plain old appending small writes with an fsync after each write, with or without fallocating the file first. Looking at the fiemap output while doing either of those, you'll see a new 4k extent being made, and then the physical location of that extent will increment until the writes move on to the next 4k extent. cwillu@cwillu-home:~/work/btrfs/e2fs$ touch /tmp/test >>> f=open('/tmp/test', 'r+') >>> m=mmap.mmap(f.fileno(), size) >>> for x in xrange(size): ... m[x] = " " ... m.flush(x / 4096 * 4096, 4096) # msync(self->data + offset, size, MS_SYNC)) { cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) start: 0, length: 800000 fs_ioc_fiemap 3223348747d File /tmp/test has 3 extents: # Logical Physical Length Flags 0: 0000000000000000 0000000b3d9c0000 0000000000001000 0000 1: 0000000000001000 000000069f012000 00000000003ff000 0000 2: 0000000000400000 0000000b419d1000 0000000000400000 0001 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) 0: 0000000000000000 0000000b3daf3000 0000000000001000 0000 1: 0000000000001000 000000069f012000 00000000003ff000 0000 2: 0000000000400000 0000000b419d1000 0000000000400000 0001 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) 0: 0000000000000000 0000000b3dc38000 0000000000001000 0000 1: 0000000000001000 000000069f012000 00000000003ff000 0000 2: 0000000000400000 0000000b419d1000 0000000000400000 0001 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) 0: 0000000000000000 0000000b3dc9f000 0000000000001000 0000 1: 0000000000001000 0000000b3d2b7000 0000000000001000 0000 2: 0000000000002000 000000069f013000 00000000003fe000 0000 3: 0000000000400000 0000000b419d1000 0000000000400000 0001 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)# msync(self->data + offset, size, MS_SYNC)) { 0: 0000000000000000 0000000b3dc9f000 0000000000001000 0000 1: 0000000000001000 0000000b3d424000 0000000000001000 0000 2: 0000000000002000 000000069f013000 00000000003fe000 0000 3: 0000000000400000 0000000b419d1000 0000000000400000 0001 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) 0: 0000000000000000 0000000b3dc9f000 0000000000001000 0000 1: 0000000000001000 0000000b3d563000 0000000000001000 0000 2: 0000000000002000 000000069f013000 00000000003fe000 0000 3: 0000000000400000 0000000b419d1000 0000000000400000 0001 ======== cwillu@cwillu-home:~/work/btrfs/e2fs$ rm /tmp/test cwillu@cwillu-home:~/work/btrfs/e2fs$ touch /tmp/test >>> f=open('/tmp/test', 'r+') >>> f.truncate(size) >>> m=mmap.mmap(f.fileno(), size) >>> for x in xrange(size): ... m[x] = " " ... m.flush(x / 4096 * 4096, 4096) # msync(self->data + offset, size, MS_SYNC)) { cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) start: 0, length: 800000 fs_ioc_fiemap 3223348747d File /tmp/test has 1 extents: # Logical Physical Length Flags 0: 0000000000000000 0000000b47f11000 0000000000001000 0001 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) 0: 0000000000000000 0000000b48006000 0000000000001000 0001 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) 0: 0000000000000000 0000000b48183000 0000000000001000 0000 1: 0000000000001000 0000000b48255000 0000000000001000 0001 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) 0: 0000000000000000 0000000b48183000 0000000000001000 0000 1: 0000000000001000 0000000b48353000 0000000000001000 0001 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) 0: 0000000000000000 0000000b48183000 0000000000001000 0000 1: 0000000000001000 0000000b493ed000 0000000000001000 0000 2: 0000000000002000 0000000b4a68f000 0000000000001000 0000 3: 0000000000003000 0000000b4b36f000 0000000000001000 0001 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) 0: 0000000000000000 0000000b48183000 0000000000001000 0000 1: 0000000000001000 0000000b493ed000 0000000000001000 0000 2: 0000000000002000 0000000b4a68f000 0000000000001000 0000 3: 0000000000003000 0000000b4b4cf000 0000000000001000 0001 ======== cwillu@cwillu-home:~/work/btrfs/e2fs$ rm /tmp/test cwillu@cwillu-home:~/work/btrfs/e2fs$ touch /tmp/test >>> f=open('/tmp/test', 'r+') >>> for x in xrange(size): ... f.write(' ') ... f.flush() ... os.fdatasync(f.fileno()) cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) start: 0, length: 395 fs_ioc_fiemap 3223348747d File /tmp/test has 1 extents: # Logical Physical Length Flags 0: 0000000000000000 0000000000000000 0000000000001000 0301 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) start: 0, length: 5b5 0: 0000000000000000 0000000000000000 0000000000001000 0301 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) start: 0, length: 1d61 0: 0000000000000000 0000000b4d2bc000 0000000000001000 0000 1: 0000000000001000 0000000b4e1c5000 0000000000001000 0001 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) start: 0, length: 1e8c 0: 0000000000000000 0000000b4d2bc000 0000000000001000 0000 1: 0000000000001000 0000000b4e334000 0000000000001000 0001 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) start: 0, length: 207c 0: 0000000000000000 0000000b4d2bc000 0000000000001000 0000 1: 0000000000001000 0000000b4e4a9000 0000000000001000 0000 2: 0000000000002000 0000000b4e528000 0000000000001000 0001 cwillu@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s) start: 0, length: 21be 0: 0000000000000000 0000000b4d2bc000 0000000000001000 0000 1: 0000000000001000 0000000b4e4a9000 0000000000001000 0000 2: 0000000000002000 0000000b4e66c000 0000000000001000 0001 ^ permalink raw reply [flat|nested] 33+ messages in thread
* R: Re: Slow startup of systemd-journal on BTRFS @ 2014-06-12 11:13 Goffredo Baroncelli <kreijack@libero.it> 2014-06-12 12:37 ` Duncan 0 siblings, 1 reply; 33+ messages in thread From: Goffredo Baroncelli <kreijack@libero.it> @ 2014-06-12 11:13 UTC (permalink / raw) To: david; +Cc: systemd Mailing List, linux-btrfs >Da: david@fromorbit.com >Data: 12/06/2014 3.21 >A: <kreijack@inwind.it> >Cc: "systemd Mailing List"<systemd-devel@lists.freedesktop.org>, "linux-btrfs" <linux-btrfs@vger.kernel.org> >Ogg: Re: Slow startup of systemd-journal on BTRFS > >On Wed, Jun 11, 2014 at 11:28:54PM +0200, Goffredo Baroncelli wrote: >> Hi all, >> >> I would like to share a my experience about a slowness of systemd when used on BTRFS. >> >> My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot"). >> >> I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why. >> >> However this link >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1006386 >> >> suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent. > >systemd has a very stupid journal write pattern. It checks if there >is space in the file for the write, and if not it fallocates the >small amount of space it needs (it does *4 byte* fallocate calls!) >and then does the write to it. All this does is fragment the crap >out of the log files because the filesystems cannot optimise the >allocation patterns. I checked the code, and to me it seems that the fallocate() are done in FILE_SIZE_INCREASE unit (actually 8MB). > >Yup, it fragments journal files on XFS, too. > >http://oss.sgi.com/archives/xfs/2014-03/msg00322.html > >IIRC, the systemd developers consider this a filesystem problem and >so refused to change the systemd code to be nice to the filesystem >allocators, even though they don't actually need to use fallocate... If I am able to start a decent setup I would like to play to change some parameters like: - remove fallocate at all (at the beginning only ?) - increase the fallocate allocation unit - change the file log size and rotation time - periodically defragment [...[ > >Cheers, > >Dave. > >-- >Dave Chinner >david@fromorbit.com >-- >To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: R: Re: Slow startup of systemd-journal on BTRFS 2014-06-12 11:13 R: " Goffredo Baroncelli <kreijack@libero.it> @ 2014-06-12 12:37 ` Duncan 2014-06-12 23:24 ` Dave Chinner 0 siblings, 1 reply; 33+ messages in thread From: Duncan @ 2014-06-12 12:37 UTC (permalink / raw) To: linux-btrfs; +Cc: systemd-devel Goffredo Baroncelli <kreijack@libero.it> posted on Thu, 12 Jun 2014 13:13:26 +0200 as excerpted: >>systemd has a very stupid journal write pattern. It checks if there is >>space in the file for the write, and if not it fallocates the small >>amount of space it needs (it does *4 byte* fallocate calls!) and then >>does the write to it. All this does is fragment the crap out of the log >>files because the filesystems cannot optimise the allocation patterns. > > I checked the code, and to me it seems that the fallocate() are done in > FILE_SIZE_INCREASE unit (actually 8MB). FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think actually pretty much equally bad without NOCOW set on the file. Why? Because btrfs data blocks are 4 KiB. With COW, the effect for either 4 byte or 8 MiB file allocations is going to end up being the same, forcing (repeated until full) rewrite of each 4 KiB block into its own extent. Turning off the fallocate should allow btrfs to at least consolidate a bit, tho to the extent that multiple 4 KiB blocks cannot be written, repeated fsync will still cause issues. 80-100 MiB logs (size mentioned in another reply) should be reasonably well handled by btrfs autodefrag, however, if it's turned on. I'd be worried if sizes were > 256 MiB and certainly as sizes approached a GiB, but it should handle 80-100 MiB just fine. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: R: Re: Slow startup of systemd-journal on BTRFS 2014-06-12 12:37 ` Duncan @ 2014-06-12 23:24 ` Dave Chinner 2014-06-13 22:19 ` Goffredo Baroncelli 0 siblings, 1 reply; 33+ messages in thread From: Dave Chinner @ 2014-06-12 23:24 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs, systemd-devel On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote: > Goffredo Baroncelli <kreijack@libero.it> posted on Thu, 12 Jun 2014 > 13:13:26 +0200 as excerpted: > > >>systemd has a very stupid journal write pattern. It checks if there is > >>space in the file for the write, and if not it fallocates the small > >>amount of space it needs (it does *4 byte* fallocate calls!) and then > >>does the write to it. All this does is fragment the crap out of the log > >>files because the filesystems cannot optimise the allocation patterns. > > > > I checked the code, and to me it seems that the fallocate() are done in > > FILE_SIZE_INCREASE unit (actually 8MB). > > FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think > actually pretty much equally bad without NOCOW set on the file. So maybe it's been fixed in systemd since the last time I looked. Yup: http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58 The reason it was changed? To "save a syscall per append", not to prevent fragmentation of the file, which was the problem everyone was complaining about... > Why? Because btrfs data blocks are 4 KiB. With COW, the effect for > either 4 byte or 8 MiB file allocations is going to end up being the > same, forcing (repeated until full) rewrite of each 4 KiB block into its > own extent. And that's now a btrfs problem.... :/ Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: R: Re: Slow startup of systemd-journal on BTRFS 2014-06-12 23:24 ` Dave Chinner @ 2014-06-13 22:19 ` Goffredo Baroncelli 2014-06-14 2:53 ` Duncan 0 siblings, 1 reply; 33+ messages in thread From: Goffredo Baroncelli @ 2014-06-13 22:19 UTC (permalink / raw) To: Dave Chinner, Duncan; +Cc: linux-btrfs, systemd-devel Hi Dave On 06/13/2014 01:24 AM, Dave Chinner wrote: > On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote: >> Goffredo Baroncelli <kreijack@libero.it> posted on Thu, 12 Jun 2014 >> 13:13:26 +0200 as excerpted: >> >>>> systemd has a very stupid journal write pattern. It checks if there is >>>> space in the file for the write, and if not it fallocates the small >>>> amount of space it needs (it does *4 byte* fallocate calls!) and then >>>> does the write to it. All this does is fragment the crap out of the log >>>> files because the filesystems cannot optimise the allocation patterns. >>> >>> I checked the code, and to me it seems that the fallocate() are done in >>> FILE_SIZE_INCREASE unit (actually 8MB). >> >> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think >> actually pretty much equally bad without NOCOW set on the file. > > So maybe it's been fixed in systemd since the last time I looked. > Yup: > > http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58 > > The reason it was changed? To "save a syscall per append", not to > prevent fragmentation of the file, which was the problem everyone > was complaining about... thanks for pointing that. However I am performing my tests on a fedora 20 with systemd-208, which seems have this change > >> Why? Because btrfs data blocks are 4 KiB. With COW, the effect for >> either 4 byte or 8 MiB file allocations is going to end up being the >> same, forcing (repeated until full) rewrite of each 4 KiB block into its >> own extent. I am reaching the conclusion that fallocate is not the problem. The fallocate increase the filesize of about 8MB, which is enough for some logging. So it is not called very often. I have to investigate more what happens when the log are copied from /run to /var/log/journal: this is when journald seems to slow all. I am prepared a PC which reboot continuously; I am collecting the time required to finish the boot vs the fragmentation of the system.journal file vs the number of boot. The results are dramatic: after 20 reboot, the boot time increase of 20-30 seconds. Doing a defrag of system.journal reduces the boot time to the original one, but after another 20 reboot, the boot time still requires 20-30 seconds more.... It is a slow PC, but I saw the same behavior also on a more modern pc (i5 with 8GB). For both PC the HD is a mechanical one... > > And that's now a btrfs problem.... :/ Are you sure ? ghigo@venice:/var/log$ sudo filefrag messages messages: 29 extents found ghigo@venice:/var/log$ sudo filefrag journal/*/system.journal journal/41d686199835445395ac629d576dfcb9/system.journal: 1378 extents found So the old rsyslog creates files with fewer fragments. BTRFS (but it seems also xfs) for sure highlights more this problem than other filesystem. But also systemd seems to create a lot of extens. BR G.Baroncelli > > Cheers, > > Dave. > -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Slow startup of systemd-journal on BTRFS 2014-06-13 22:19 ` Goffredo Baroncelli @ 2014-06-14 2:53 ` Duncan 2014-06-14 7:52 ` Goffredo Baroncelli 2014-06-14 10:59 ` Kai Krakow 0 siblings, 2 replies; 33+ messages in thread From: Duncan @ 2014-06-14 2:53 UTC (permalink / raw) To: linux-btrfs; +Cc: systemd-devel Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as excerpted: > On 06/13/2014 01:24 AM, Dave Chinner wrote: >> On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote: >>> >>> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think >>> actually pretty much equally bad without NOCOW set on the file. >> >> So maybe it's been fixed in systemd since the last time I looked. >> Yup: >> >> http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal- file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58 >> >> The reason it was changed? To "save a syscall per append", not to >> prevent fragmentation of the file, which was the problem everyone was >> complaining about... > > thanks for pointing that. However I am performing my tests on a fedora > 20 with systemd-208, which seems have this change >> >>> Why? Because btrfs data blocks are 4 KiB. With COW, the effect for >>> either 4 byte or 8 MiB file allocations is going to end up being the >>> same, forcing (repeated until full) rewrite of each 4 KiB block into >>> its own extent. > > I am reaching the conclusion that fallocate is not the problem. The > fallocate increase the filesize of about 8MB, which is enough for some > logging. So it is not called very often. But... If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with nodatacow), then an fallocate of 8 MiB will increase the file size by 8 MiB and write that out. So far so good as at that point the 8 MiB should be a single extent. But then, data gets written into 4 KiB blocks of that 8 MiB one at a time, and because btrfs is COW, the new data in the block must be written to a new location. Which effectively means that by the time the 8 MiB is filled, each 4 KiB block has been rewritten to a new location and is now an extent unto itself. So now that 8 MiB is composed of 2048 new extents, each one a single 4 KiB block in size. =:^( Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the btrfs autodefrag mount option should at least catch that and rewrite (again), this time sequentially. > I have to investigate more what happens when the log are copied from > /run to /var/log/journal: this is when journald seems to slow all. That's an interesting point. At least in theory, during normal operation journald will write to /var/log/journal, but there's a point during boot at which it flushes the information accumulated during boot from the volatile /run location to the non-volatile /var/log location. /That/ write, at least, should be sequential, since there will be > 4 KiB of journal accumulated that needs to be transferred at once. However, if it's being handled by the forced pre-write fallocate described above, then that's not going to be the case, as it'll then be a rewrite of already fallocated file blocks and thus will get COWed exactly as I described above. =:^( > I am prepared a PC which reboot continuously; I am collecting the time > required to finish the boot vs the fragmentation of the system.journal > file vs the number of boot. The results are dramatic: after 20 reboot, > the boot time increase of 20-30 seconds. Doing a defrag of > system.journal reduces the boot time to the original one, but after > another 20 reboot, the boot time still requires 20-30 seconds more.... > > It is a slow PC, but I saw the same behavior also on a more modern pc > (i5 with 8GB). > > For both PC the HD is a mechanical one... The problem's duplicable. That's the first step toward a fix. =:^) >> And that's now a btrfs problem.... :/ > > Are you sure ? As they say, "Whoosh!" At least here, I interpreted that remark as primarily sarcastic commentary on the systemd devs' apparent attitude, which can be (controversially) summarized as: "Systemd doesn't have problems because it's perfect. Therefore, any problems you have with systemd must instead be with other components which systemd depends on." IOW, it's a btrfs problem now in practice, not because it is so in a technical sense, but because systemd defines it as such and is unlikely to budge, so the only way to achieve progress is for btrfs to deal with it. An arguably fairer and more impartial assessment of this particular situations suggests that neither btrfs, which as a COW-based filesystem, like all COW-based filesystems has the existing-file-rewrite as a major technical challenge that it must deal with /somehow/, nor systemd, which in choosing to use fallocate is specifically putting itself in that existing-file-rewrite class, are entirely at fault. But that doesn't matter if one side refuses to budge, because then the other side must do so regardless of where the fault was, if there is to be any progress at all. Meanwhile, I've predicted before and do so here again, that as btrfs moves toward mainstream and starts supplanting ext* as the assumed Linux default filesystem, some of these problems will simply "go away", because at that point, various apps are no longer optimized for the assumed default filesystem, and they'll either be patched at some level (distro level if not upstream) to work better on the new default filesystem, or will be replaced by something that does. And neither upstream nor distro level does that patching, then at some point, people are going to find that said distro performs worse than other distros that do that patching. Another alternative is that distros will start setting /var/log/journal NOCOW in their setup scripts by default when it's btrfs, thus avoiding the problem. (Altho if they do automated snapshotting they'll also have to set it as its own subvolume, to avoid the first-write-after-snapshot- is-COW problem.) Well, that, and/or set autodefrag in the default mount options. Meanwhile, there's some focus on making btrfs behave better with such rewrite-pattern files, but while I think the problem can be made /some/ better, hopefully enough that the defaults bother far fewer people in far fewer cases, I expect it'll always be a bit of a sore spot because that's just how the technology works, and as such, setting NOCOW for such files and/or using autodefrag will continue to be recommended for an optimized setup. --- [1] "Properly" set NOCOW: Btrfs doesn't guarantee the effectiveness of setting NOCOW (chattr +C) unless the attribute is set while the file is still zero size, effectively, at file creation. The easiest way to do that is to set NOCOW on the subdir that will contain the file, such that when the file is created it inherits the NOCOW attribute automatically. [2] File sizes upto 128 MiB ... and possibly upto 1 GiB. Under 128 MiB should be fine, over 1 GiB is known to cause issues, between the two is a gray area that depends on the speed of the hardware and the incoming write-stream. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Slow startup of systemd-journal on BTRFS 2014-06-14 2:53 ` Duncan @ 2014-06-14 7:52 ` Goffredo Baroncelli 2014-06-15 5:43 ` Duncan 2014-06-15 22:13 ` Lennart Poettering 2014-06-14 10:59 ` Kai Krakow 1 sibling, 2 replies; 33+ messages in thread From: Goffredo Baroncelli @ 2014-06-14 7:52 UTC (permalink / raw) To: Duncan, linux-btrfs; +Cc: systemd-devel On 06/14/2014 04:53 AM, Duncan wrote: > Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as > excerpted: > >> On 06/13/2014 01:24 AM, Dave Chinner wrote: >>> On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote: >>>> >>>> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think >>>> actually pretty much equally bad without NOCOW set on the file. >>> >>> So maybe it's been fixed in systemd since the last time I looked. >>> Yup: >>> >>> http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal- > file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58 >>> >>> The reason it was changed? To "save a syscall per append", not to >>> prevent fragmentation of the file, which was the problem everyone was >>> complaining about... >> >> thanks for pointing that. However I am performing my tests on a fedora >> 20 with systemd-208, which seems have this change >>> >>>> Why? Because btrfs data blocks are 4 KiB. With COW, the effect for >>>> either 4 byte or 8 MiB file allocations is going to end up being the >>>> same, forcing (repeated until full) rewrite of each 4 KiB block into >>>> its own extent. >> >> I am reaching the conclusion that fallocate is not the problem. The >> fallocate increase the filesize of about 8MB, which is enough for some >> logging. So it is not called very often. > > But... > > If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with > nodatacow), then an fallocate of 8 MiB will increase the file size by 8 > MiB and write that out. So far so good as at that point the 8 MiB should > be a single extent. But then, data gets written into 4 KiB blocks of > that 8 MiB one at a time, and because btrfs is COW, the new data in the > block must be written to a new location. > > Which effectively means that by the time the 8 MiB is filled, each 4 KiB > block has been rewritten to a new location and is now an extent unto > itself. So now that 8 MiB is composed of 2048 new extents, each one a > single 4 KiB block in size. Several people pointed fallocate as the problem. But I don't understand the reason. 1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during the boot. Often never because the log are less than 8MB. 2) it is true that btrfs "rewrite" almost 2 times each 4kb page with fallocate. But the first time is a "big" write of 8MB; instead the second write would happen in any case. What I mean is that without the fallocate in any case journald would make small write. To be honest, I fatigue to see the gain of having a fallocate on a COW filesystem... may be that I don't understand very well the fallocate() call. > > =:^( > > Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the > btrfs autodefrag mount option should at least catch that and rewrite > (again), this time sequentially. > >> I have to investigate more what happens when the log are copied from >> /run to /var/log/journal: this is when journald seems to slow all. > > That's an interesting point. > > At least in theory, during normal operation journald will write to > /var/log/journal, but there's a point during boot at which it flushes the > information accumulated during boot from the volatile /run location to > the non-volatile /var/log location. /That/ write, at least, should be > sequential, since there will be > 4 KiB of journal accumulated that needs > to be transferred at once. However, if it's being handled by the forced > pre-write fallocate described above, then that's not going to be the > case, as it'll then be a rewrite of already fallocated file blocks and > thus will get COWed exactly as I described above. > > =:^( > > >> I am prepared a PC which reboot continuously; I am collecting the time >> required to finish the boot vs the fragmentation of the system.journal >> file vs the number of boot. The results are dramatic: after 20 reboot, >> the boot time increase of 20-30 seconds. Doing a defrag of >> system.journal reduces the boot time to the original one, but after >> another 20 reboot, the boot time still requires 20-30 seconds more.... >> >> It is a slow PC, but I saw the same behavior also on a more modern pc >> (i5 with 8GB). >> >> For both PC the HD is a mechanical one... > > The problem's duplicable. That's the first step toward a fix. =:^) I Hope so > >>> And that's now a btrfs problem.... :/ >> >> Are you sure ? > > As they say, "Whoosh!" > [...] > Another alternative is that distros will start setting /var/log/journal > NOCOW in their setup scripts by default when it's btrfs, thus avoiding > the problem. (Altho if they do automated snapshotting they'll also have > to set it as its own subvolume, to avoid the first-write-after-snapshot- > is-COW problem.) Well, that, and/or set autodefrag in the default mount > options. Pay attention, that this remove also the checksum, which are very useful in a RAID configuration. > > Meanwhile, there's some focus on making btrfs behave better with such > rewrite-pattern files, but while I think the problem can be made /some/ > better, hopefully enough that the defaults bother far fewer people in far > fewer cases, I expect it'll always be a bit of a sore spot because that's > just how the technology works, and as such, setting NOCOW for such files > and/or using autodefrag will continue to be recommended for an optimized > setup. > > --- > [1] "Properly" set NOCOW: Btrfs doesn't guarantee the effectiveness of > setting NOCOW (chattr +C) unless the attribute is set while the file is > still zero size, effectively, at file creation. The easiest way to do > that is to set NOCOW on the subdir that will contain the file, such that > when the file is created it inherits the NOCOW attribute automatically. > > [2] File sizes upto 128 MiB ... and possibly upto 1 GiB. Under 128 MiB > should be fine, over 1 GiB is known to cause issues, between the two is a > gray area that depends on the speed of the hardware and the incoming > write-stream. > -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Slow startup of systemd-journal on BTRFS 2014-06-14 7:52 ` Goffredo Baroncelli @ 2014-06-15 5:43 ` Duncan 2014-06-15 22:39 ` [systemd-devel] " Lennart Poettering 2014-06-15 22:13 ` Lennart Poettering 1 sibling, 1 reply; 33+ messages in thread From: Duncan @ 2014-06-15 5:43 UTC (permalink / raw) To: linux-btrfs; +Cc: systemd-devel Goffredo Baroncelli posted on Sat, 14 Jun 2014 09:52:39 +0200 as excerpted: > On 06/14/2014 04:53 AM, Duncan wrote: >> Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as >> excerpted: >> >>> thanks for pointing that. However I am performing my tests on a fedora >>> 20 with systemd-208, which seems have this change >>> >>> I am reaching the conclusion that fallocate is not the problem. The >>> fallocate increase the filesize of about 8MB, which is enough for some >>> logging. So it is not called very often. Right. >> But... Exactly, _but_... >> [A]n fallocate of 8 MiB will increase the file size >> by 8 MiB and write that out. So far so good as at that point the 8 MiB >> should be a single extent. But then, data gets written into 4 KiB >> blocks of that 8 MiB one at a time, and because btrfs is COW, the new >> data in the block must be written to a new location. >> >> Which effectively means that by the time the 8 MiB is filled, each 4 >> KiB block has been rewritten to a new location and is now an extent >> unto itself. So now that 8 MiB is composed of 2048 new extents, each >> one a single 4 KiB block in size. > > Several people pointed fallocate as the problem. But I don't understand > the reason. > 1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time > during the boot. Often never because the log are less than 8MB. > 2) it is true that btrfs "rewrite" almost 2 times each 4kb page with > fallocate. But the first time is a "big" write of 8MB; instead the > second write would happen in any case. What I mean is that without the > fallocate in any case journald would make small write. > > To be honest, I fatigue to see the gain of having a fallocate on a COW > filesystem... may be that I don't understand very well the fallocate() > call. The base problem isn't fallocate per se, rather, tho it's the trigger in this case. The base problem is that for COW-based filesystems, *ANY* rewriting of existing file content results in fragmentation. It just so happens that the only reason there's existing file content to be rewritten (as opposed to simply appending) in this case, is because of the fallocate. The rewrite of existing file content is the problem, but the existing file content is only there in this case because of the fallocate. Taking a step back... On a non-COW filesystem, allocating 8 MiB ahead and writing into it rewrites into the already allocated location, thus guaranteeing extents of 8 MiB each, since once the space is allocated it's simply rewritten in- place. Thus, on a non-COW filesystem, pre-allocating in something larger than single filesystem blocks when an app knows the data is eventually going to be written in to fill that space anyway is a GOOD thing, which is why systemd is doing it. But on a COW-based filesystem fallocate is the exact opposite, a BAD thing, because an fallocate forces the file to be written out at that size, effectively filled with nulls/blanks. Then the actual logging comes along and rewrites those nulls/blanks with actual data, but it's now a rewrite, which on a COW, copy-on-write, based filesystem, the rewritten block is copied elsewhere, it does NOT overwrite the existing null/blank block, and "elsewhere" by definition means detached from the previous blocks, thus in an extent all by itself. Once the full 2048 original blocks composing that 8 MiB are filled in with actual data, because they were rewrites from null/blank blocks that fallocate had already forced to be allocated, that's now 2048 separate extents, 2048 separate file fragments, where without the forced fallocate, the writes would have all been appends, and there would have been at least /some/ chance of some of those 2048 separate blocks being written at close enough to the same time that they would have been written together as a single extent. So while the 8 MiB might not have been a single extent as opposed to 2048 separate extents, it might have been perhaps 512 or 1024 extents, instead of the 2048 that it ended up being because fallocate meant that each block was a rewrite into an existing file, not a new append-write at the end of an existing file. > [...] >> Another alternative is that distros will start setting /var/log/journal >> NOCOW in their setup scripts by default when it's btrfs, thus avoiding >> the problem. (Altho if they do automated snapshotting they'll also >> have to set it as its own subvolume, to avoid the >> first-write-after-snapshot- >> is-COW problem.) Well, that, and/or set autodefrag in the default >> mount options. > > Pay attention, that this remove also the checksum, which are very useful > in a RAID configuration. Well, it can be. But this is only log data, not executable or the like data, and (as Kai K points out) journald has its own checksumming method in any case. Besides which, you still haven't explained why you can't either set the autodefrag mount option and be done with it, or run a systemd-timer- triggered or cron-triggered defrag script to defrag them automatically at hourly or daily or whatever intervals. Those don't disable btrfs checksumming, but /should/ solve the problem. (Tho if you're btrfs snapshotting the journals defrag has its own implications, but they should be relatively limited in scope compared to the fragmentation issues we're dealing with here.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-15 5:43 ` Duncan @ 2014-06-15 22:39 ` Lennart Poettering 0 siblings, 0 replies; 33+ messages in thread From: Lennart Poettering @ 2014-06-15 22:39 UTC (permalink / raw) To: Duncan; +Cc: systemd-devel, linux-btrfs On Sun, 15.06.14 05:43, Duncan (1i5t5.duncan@cox.net) wrote: > The base problem isn't fallocate per se, rather, tho it's the trigger in > this case. The base problem is that for COW-based filesystems, *ANY* > rewriting of existing file content results in fragmentation. > > It just so happens that the only reason there's existing file content to > be rewritten (as opposed to simply appending) in this case, is because of > the fallocate. The rewrite of existing file content is the problem, but > the existing file content is only there in this case because of the > fallocate. > > Taking a step back... > > On a non-COW filesystem, allocating 8 MiB ahead and writing into it > rewrites into the already allocated location, thus guaranteeing extents > of 8 MiB each, since once the space is allocated it's simply rewritten in- > place. Thus, on a non-COW filesystem, pre-allocating in something larger > than single filesystem blocks when an app knows the data is eventually > going to be written in to fill that space anyway is a GOOD thing, which > is why systemd is doing it. Nope, that's not why we do it. We do it to avoid SIGBUS on disk full... > But on a COW-based filesystem fallocate is the exact opposite, a BAD > thing, because an fallocate forces the file to be written out at that > size, effectively filled with nulls/blanks. Then the actual logging > comes along and rewrites those nulls/blanks with actual data, but it's > now a rewrite, which on a COW, copy-on-write, based filesystem, the > rewritten block is copied elsewhere, it does NOT overwrite the existing > null/blank block, and "elsewhere" by definition means detached from the > previous blocks, thus in an extent all by itself. Well, quite frankly I am not entirely sure why fallocate() would be any useful like that on COW file systems, if this is really how it is implemented... I mean, as I understood fallocate() -- and as the man page suggests -- it is something for reserving space on disk, not for writing out anything. This is why journald is invoking it, to reserve the space, so that later write accesses to it will not require any reservation anymore, and hence are unlikely to fail. Lennart -- Lennart Poettering, Red Hat ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-14 7:52 ` Goffredo Baroncelli 2014-06-15 5:43 ` Duncan @ 2014-06-15 22:13 ` Lennart Poettering 2014-06-16 0:17 ` Russell Coker ` (2 more replies) 1 sibling, 3 replies; 33+ messages in thread From: Lennart Poettering @ 2014-06-15 22:13 UTC (permalink / raw) To: kreijack; +Cc: Duncan, linux-btrfs, systemd-devel On Sat, 14.06.14 09:52, Goffredo Baroncelli (kreijack@libero.it) wrote: > > Which effectively means that by the time the 8 MiB is filled, each 4 KiB > > block has been rewritten to a new location and is now an extent unto > > itself. So now that 8 MiB is composed of 2048 new extents, each one a > > single 4 KiB block in size. > > Several people pointed fallocate as the problem. But I don't > understand the reason. BTW, the reason we use fallocate() in journald is not about trying to optimize anything. It's only used for one reason: to avoid SIGBUS on disk/quota full, since we actually write everything to the files using mmap(). I mean, writing things with mmap() is always problematic, and handling write errors is awfully difficult, but at least two of the most common reasons for failure we'd like protect against in advance, under the assumption that disk/quota full will be reported immediately by the fallocate(), and the mmap writes later on will then necessarily succeed. I am not really following though why this trips up btrfs though. I am not sure I understand why this breaks btrfs COW behaviour. I mean, fallocate() isn't necessarily supposed to write anything really, it's mostly about allocating disk space in advance. I would claim that journald's usage of it is very much within the entire reason why it exists... Anyway, happy to change these things around if necesary, but first I'd like to have a very good explanation why fallocate() wouldn't be the right thing to invoke here, and a suggestion what we should do instead to cover this usecase... Lennart -- Lennart Poettering, Red Hat ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-15 22:13 ` Lennart Poettering @ 2014-06-16 0:17 ` Russell Coker 2014-06-16 1:06 ` John Williams 2014-06-16 10:14 ` Lennart Poettering 2014-06-16 16:32 ` Goffredo Baroncelli 2014-06-19 1:13 ` Dave Chinner 2 siblings, 2 replies; 33+ messages in thread From: Russell Coker @ 2014-06-16 0:17 UTC (permalink / raw) To: Lennart Poettering; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel On Mon, 16 Jun 2014 00:13:07 Lennart Poettering wrote: > On Sat, 14.06.14 09:52, Goffredo Baroncelli (kreijack@libero.it) wrote: > > > Which effectively means that by the time the 8 MiB is filled, each 4 KiB > > > block has been rewritten to a new location and is now an extent unto > > > itself. So now that 8 MiB is composed of 2048 new extents, each one a > > > single 4 KiB block in size. > > > > Several people pointed fallocate as the problem. But I don't > > understand the reason. > > BTW, the reason we use fallocate() in journald is not about trying to > optimize anything. It's only used for one reason: to avoid SIGBUS on > disk/quota full, since we actually write everything to the files using > mmap(). I mean, writing things with mmap() is always problematic, and > handling write errors is awfully difficult, but at least two of the most > common reasons for failure we'd like protect against in advance, under > the assumption that disk/quota full will be reported immediately by the > fallocate(), and the mmap writes later on will then necessarily succeed. I just did some tests using fallocate(1). I did the tests both with and without the -n option which appeared to make no difference. I started by allocating a 24G file on a 106G filesystem that had 30G free according to df. The first time that took almost 2 minutes of system CPU time on a Q8400 CPU. I then made a snapshot of the subvol and then used dd with the conv=notrunc option to overwrite it. The amount of reported disk space decreased in line with the progress of dd. So in the case of snapshots the space will be USED (not just reserved) when you call fallocate and there is no guarantee that space will be available when you write to it. My systems have cron jobs to make read-only snapshots of all subvols. On these systems you have no guarantee that mmap will succeed - apart from the fact that the variety of problems BTRFS has in the case of running out of disk space makes me more careful to avoid that on BTRFS than on other filesystems. > I am not really following though why this trips up btrfs though. I am > not sure I understand why this breaks btrfs COW behaviour. I mean, > fallocate() isn't necessarily supposed to write anything really, it's > mostly about allocating disk space in advance. I would claim that > journald's usage of it is very much within the entire reason why it > exists... I don't believe that fallocate() makes any difference to fragmentation on BTRFS. Blocks will be allocated when writes occur so regardless of an fallocate() call the usage pattern in systemd-journald will cause fragmentation. > Anyway, happy to change these things around if necesary, but first I'd > like to have a very good explanation why fallocate() wouldn't be the > right thing to invoke here, and a suggestion what we should do instead > to cover this usecase... Systemd could request that the files in question be defragmented. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 0:17 ` Russell Coker @ 2014-06-16 1:06 ` John Williams 2014-06-16 2:19 ` Russell Coker 2014-06-16 10:14 ` Lennart Poettering 1 sibling, 1 reply; 33+ messages in thread From: John Williams @ 2014-06-16 1:06 UTC (permalink / raw) To: russell; +Cc: Lennart Poettering, kreijack, Duncan, Btrfs BTRFS, systemd-devel On Sun, Jun 15, 2014 at 5:17 PM, Russell Coker <russell@coker.com.au> wrote: > I just did some tests using fallocate(1). I did the tests both with and > without the -n option which appeared to make no difference. > > I started by allocating a 24G file on a 106G filesystem that had 30G free > according to df. The first time that took almost 2 minutes of system CPU time > on a Q8400 CPU. Why does it take 2 minutes? On XFS or ext4, fallocate is almost instantaneous, even for multi-Terabyte allocations. According the fallocate man page, preallocation should be quick and require no IO: " fallocate is used to manipulate the allocated disk space for a file, either to deallocate or preallocate it. For filesystems which support the fallocate system call, preallocation is done quickly by allocating blocks and marking them as uninitialized, requiring no IO to the data blocks. This is much faster than creating a file by filling it with zeros." ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 1:06 ` John Williams @ 2014-06-16 2:19 ` Russell Coker 0 siblings, 0 replies; 33+ messages in thread From: Russell Coker @ 2014-06-16 2:19 UTC (permalink / raw) To: John Williams Cc: Lennart Poettering, kreijack, Duncan, Btrfs BTRFS, systemd-devel On Sun, 15 Jun 2014 18:06:34 John Williams wrote: > Why does it take 2 minutes? On XFS or ext4, fallocate is almost > instantaneous, even for multi-Terabyte allocations. > > According the fallocate man page, preallocation should be quick and > require no IO: > > " fallocate is used to manipulate the allocated disk space for a > file, either to deallocate or preallocate it. For filesystems which > support the fallocate system call, preallocation is done quickly by > allocating blocks and marking them as uninitialized, requiring no IO to > the data blocks. This is much faster than creating a file by filling > it with zeros." No IO to data blocks but there is IO to metadata. But I think that BTRFS may need some optimisation for such things. While fallocate() on 24G is probably a very unusual case it will probably matter to some people (I can imagine scientific computing needing it) and it's likely that much smaller fallocate() calls also take longer than desired. The issue was system CPU time, extending the file in that test was proceeding at a speed of about 200MB/s for allocated space - while the system was writing something less than 2MB/s to the device (sometimes it went for 10+ seconds without writing any data). The SSD in question can sustain about 200MB/s of data written so in that case the BTRFS speed for allocating disk space was about equal to the speed it should be able to write real data. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 0:17 ` Russell Coker 2014-06-16 1:06 ` John Williams @ 2014-06-16 10:14 ` Lennart Poettering 2014-06-16 10:35 ` Russell Coker ` (2 more replies) 1 sibling, 3 replies; 33+ messages in thread From: Lennart Poettering @ 2014-06-16 10:14 UTC (permalink / raw) To: Russell Coker; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote: > > I am not really following though why this trips up btrfs though. I am > > not sure I understand why this breaks btrfs COW behaviour. I mean, > > fallocate() isn't necessarily supposed to write anything really, it's > > mostly about allocating disk space in advance. I would claim that > > journald's usage of it is very much within the entire reason why it > > exists... > > I don't believe that fallocate() makes any difference to fragmentation on > BTRFS. Blocks will be allocated when writes occur so regardless of an > fallocate() call the usage pattern in systemd-journald will cause > fragmentation. journald's write pattern looks something like this: append something to the end, make sure it is written, then update a few offsets stored at the beginning of the file to point to the newly appended data. This is of course not easy to handle for COW file systems. But then again, it's probably not too different from access patterns of other database or database-like engines... Lennart -- Lennart Poettering, Red Hat ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 10:14 ` Lennart Poettering @ 2014-06-16 10:35 ` Russell Coker 2014-06-16 11:16 ` Austin S Hemmelgarn 2014-06-16 11:56 ` Andrey Borzenkov 2014-06-16 16:05 ` Josef Bacik 2 siblings, 1 reply; 33+ messages in thread From: Russell Coker @ 2014-06-16 10:35 UTC (permalink / raw) To: Lennart Poettering; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel On Mon, 16 Jun 2014 12:14:49 Lennart Poettering wrote: > On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote: > > > I am not really following though why this trips up btrfs though. I am > > > not sure I understand why this breaks btrfs COW behaviour. I mean, > > > fallocate() isn't necessarily supposed to write anything really, it's > > > mostly about allocating disk space in advance. I would claim that > > > journald's usage of it is very much within the entire reason why it > > > exists... > > > > I don't believe that fallocate() makes any difference to fragmentation on > > BTRFS. Blocks will be allocated when writes occur so regardless of an > > fallocate() call the usage pattern in systemd-journald will cause > > fragmentation. > > journald's write pattern looks something like this: append something to > the end, make sure it is written, then update a few offsets stored at > the beginning of the file to point to the newly appended data. This is > of course not easy to handle for COW file systems. But then again, it's > probably not too different from access patterns of other database or > database-like engines... Not being too different from the access patterns of other databases means having all the same problems as other databases... Oracle is now selling ZFS servers specifically designed for running the Oracle database, but that's with "hybrid storage" "flash" (ZIL and L2ARC on SSD). While BTRFS doesn't support features equivalent for ZIL and L2ARC it's easy to run a separate filesystem on SSD for things that need performance (few if any current BTRFS users would have a database too big to entirely fit on a SSD). The problem we are dealing with is "database-like" access patterns on systems that are not designed as database servers. Would it be possible to get an interface for defragmenting files that's not specific to BTRFS? If we had a standard way of doing this then systemd- journald could request a defragment of the file at appropriate times. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 10:35 ` Russell Coker @ 2014-06-16 11:16 ` Austin S Hemmelgarn 0 siblings, 0 replies; 33+ messages in thread From: Austin S Hemmelgarn @ 2014-06-16 11:16 UTC (permalink / raw) To: russell, Lennart Poettering; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel [-- Attachment #1: Type: text/plain, Size: 2434 bytes --] On 2014-06-16 06:35, Russell Coker wrote: > On Mon, 16 Jun 2014 12:14:49 Lennart Poettering wrote: >> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote: >>>> I am not really following though why this trips up btrfs though. I am >>>> not sure I understand why this breaks btrfs COW behaviour. I mean, >>>> fallocate() isn't necessarily supposed to write anything really, it's >>>> mostly about allocating disk space in advance. I would claim that >>>> journald's usage of it is very much within the entire reason why it >>>> exists... >>> >>> I don't believe that fallocate() makes any difference to fragmentation on >>> BTRFS. Blocks will be allocated when writes occur so regardless of an >>> fallocate() call the usage pattern in systemd-journald will cause >>> fragmentation. >> >> journald's write pattern looks something like this: append something to >> the end, make sure it is written, then update a few offsets stored at >> the beginning of the file to point to the newly appended data. This is >> of course not easy to handle for COW file systems. But then again, it's >> probably not too different from access patterns of other database or >> database-like engines... > > Not being too different from the access patterns of other databases means > having all the same problems as other databases... Oracle is now selling ZFS > servers specifically designed for running the Oracle database, but that's with > "hybrid storage" "flash" (ZIL and L2ARC on SSD). While BTRFS doesn't support > features equivalent for ZIL and L2ARC it's easy to run a separate filesystem > on SSD for things that need performance (few if any current BTRFS users would > have a database too big to entirely fit on a SSD). > > The problem we are dealing with is "database-like" access patterns on systems > that are not designed as database servers. > > Would it be possible to get an interface for defragmenting files that's not > specific to BTRFS? If we had a standard way of doing this then systemd- > journald could request a defragment of the file at appropriate times. > While this is a wonderful idea, what about all the extra I/O this will cause (and all the extra wear on SSD's)? While I understand wanting this to be faster, you should also consider the fact that defragmenting the file on a regular basis is going to trash performance for other applications. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 2967 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 10:14 ` Lennart Poettering 2014-06-16 10:35 ` Russell Coker @ 2014-06-16 11:56 ` Andrey Borzenkov 2014-06-16 16:05 ` Josef Bacik 2 siblings, 0 replies; 33+ messages in thread From: Andrey Borzenkov @ 2014-06-16 11:56 UTC (permalink / raw) To: Lennart Poettering Cc: Russell Coker, SystemD Devel, kreijack, Duncan, linux-btrfs On Mon, Jun 16, 2014 at 2:14 PM, Lennart Poettering <lennart@poettering.net> wrote: > On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote: > >> > I am not really following though why this trips up btrfs though. I am >> > not sure I understand why this breaks btrfs COW behaviour. I mean, >> > fallocate() isn't necessarily supposed to write anything really, it's >> > mostly about allocating disk space in advance. I would claim that >> > journald's usage of it is very much within the entire reason why it >> > exists... >> >> I don't believe that fallocate() makes any difference to fragmentation on >> BTRFS. Blocks will be allocated when writes occur so regardless of an >> fallocate() call the usage pattern in systemd-journald will cause >> fragmentation. > > journald's write pattern looks something like this: append something to > the end, make sure it is written, then update a few offsets stored at > the beginning of the file to point to the newly appended data. This is > of course not easy to handle for COW file systems. But then again, it's > probably not too different from access patterns of other database or > database-like engines... > ... which traditionally experienced severe sequential read performance degradation in this case. As I understand this is exactly what happens - readahead attempts to preload files which gives us heavy random read access. The only real remedy was to defragment files. It should work relatively well for journal where files are mostly "write once" at the expense of additional read/write activity. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 10:14 ` Lennart Poettering 2014-06-16 10:35 ` Russell Coker 2014-06-16 11:56 ` Andrey Borzenkov @ 2014-06-16 16:05 ` Josef Bacik 2014-06-16 19:52 ` Martin 2014-06-17 21:12 ` Lennart Poettering 2 siblings, 2 replies; 33+ messages in thread From: Josef Bacik @ 2014-06-16 16:05 UTC (permalink / raw) To: Lennart Poettering, Russell Coker Cc: kreijack, Duncan, linux-btrfs, systemd-devel On 06/16/2014 03:14 AM, Lennart Poettering wrote: > On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote: > >>> I am not really following though why this trips up btrfs though. I am >>> not sure I understand why this breaks btrfs COW behaviour. I mean, >>> fallocate() isn't necessarily supposed to write anything really, it's >>> mostly about allocating disk space in advance. I would claim that >>> journald's usage of it is very much within the entire reason why it >>> exists... >> >> I don't believe that fallocate() makes any difference to fragmentation on >> BTRFS. Blocks will be allocated when writes occur so regardless of an >> fallocate() call the usage pattern in systemd-journald will cause >> fragmentation. > > journald's write pattern looks something like this: append something to > the end, make sure it is written, then update a few offsets stored at > the beginning of the file to point to the newly appended data. This is > of course not easy to handle for COW file systems. But then again, it's > probably not too different from access patterns of other database or > database-like engines... Was waiting for you to show up before I said anything since most systemd related emails always devolve into how evil you are rather than what is actually happening. So you are doing all the right things from what I can tell, I'm just a little confused about when you guys run fsync. From what I can tell it's only when you open the journal file and when you switch it to "offline." I didn't look too much past this point so I don't know how often these things happen. Are you taking an individual message, writing it, updating the head of the file and then fsync'ing? Or are you getting a good bit of dirty log data and fsyncing occasionally? What would cause btrfs problems is if you fallocate(), write a small chunk, fsync, write a small chunk again, fsync again etc. Fallocate saves you the first write around, but if the next write is within the same block as the previous write we'll end up triggering cow and enter fragmented territory. If this is what is what journald is doing then that would be good to know, if not I'd like to know what is happening since we shouldn't be fragmenting this badly. Like I said what you guys are doing is fine, if btrfs falls on it's face then its not your fault. I'd just like an exact idea of when you guys are fsync'ing so I can replicate in a smaller way. Thanks, Josef ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 16:05 ` Josef Bacik @ 2014-06-16 19:52 ` Martin 2014-06-16 20:20 ` Josef Bacik ` (2 more replies) 2014-06-17 21:12 ` Lennart Poettering 1 sibling, 3 replies; 33+ messages in thread From: Martin @ 2014-06-16 19:52 UTC (permalink / raw) To: linux-btrfs; +Cc: systemd-devel On 16/06/14 17:05, Josef Bacik wrote: > > On 06/16/2014 03:14 AM, Lennart Poettering wrote: >> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote: >> >>>> I am not really following though why this trips up btrfs though. I am >>>> not sure I understand why this breaks btrfs COW behaviour. I mean, >>> I don't believe that fallocate() makes any difference to >>> fragmentation on >>> BTRFS. Blocks will be allocated when writes occur so regardless of an >>> fallocate() call the usage pattern in systemd-journald will cause >>> fragmentation. >> >> journald's write pattern looks something like this: append something to >> the end, make sure it is written, then update a few offsets stored at >> the beginning of the file to point to the newly appended data. This is >> of course not easy to handle for COW file systems. But then again, it's >> probably not too different from access patterns of other database or >> database-like engines... Even though this appears to be a problem case for btrfs/COW, is there a more favourable write/access sequence possible that is easily implemented that is favourable for both ext4-like fs /and/ COW fs? Database-like writing is known 'difficult' for filesystems: Can a data log can be a simpler case? > Was waiting for you to show up before I said anything since most systemd > related emails always devolve into how evil you are rather than what is > actually happening. Ouch! Hope you two know each other!! :-P :-) [...] > since we shouldn't be fragmenting this badly. > > Like I said what you guys are doing is fine, if btrfs falls on it's face > then its not your fault. I'd just like an exact idea of when you guys > are fsync'ing so I can replicate in a smaller way. Thanks, Good if COW can be so resilient. I have about 2GBytes of data logging files and I must defrag those as part of my backups to stop the system fragmenting to a stop (I use "cp -a" to defrag the files to a new area and restart the data software logger on that). Random thoughts: Would using a second small file just for the mmap-ed pointers help avoid repeated rewriting of random offsets in the log file causing excessive fragmentation? Align the data writes to 16kByte or 64kByte boundaries/chunks? Are mmap-ed files a similar problem to using a swap file and so should the same "btrfs file swap" code be used for both? Not looked over the code so all random guesses... Regards, Martin ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 19:52 ` Martin @ 2014-06-16 20:20 ` Josef Bacik 2014-06-17 0:15 ` Austin S Hemmelgarn 2014-06-17 1:13 ` cwillu 2 siblings, 0 replies; 33+ messages in thread From: Josef Bacik @ 2014-06-16 20:20 UTC (permalink / raw) To: Martin, linux-btrfs; +Cc: systemd-devel On 06/16/2014 12:52 PM, Martin wrote: > On 16/06/14 17:05, Josef Bacik wrote: >> >> On 06/16/2014 03:14 AM, Lennart Poettering wrote: >>> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote: >>> >>>>> I am not really following though why this trips up btrfs though. I am >>>>> not sure I understand why this breaks btrfs COW behaviour. I mean, > >>>> I don't believe that fallocate() makes any difference to >>>> fragmentation on >>>> BTRFS. Blocks will be allocated when writes occur so regardless of an >>>> fallocate() call the usage pattern in systemd-journald will cause >>>> fragmentation. >>> >>> journald's write pattern looks something like this: append something to >>> the end, make sure it is written, then update a few offsets stored at >>> the beginning of the file to point to the newly appended data. This is >>> of course not easy to handle for COW file systems. But then again, it's >>> probably not too different from access patterns of other database or >>> database-like engines... > > Even though this appears to be a problem case for btrfs/COW, is there a > more favourable write/access sequence possible that is easily > implemented that is favourable for both ext4-like fs /and/ COW fs? > > Database-like writing is known 'difficult' for filesystems: Can a data > log can be a simpler case? > > >> Was waiting for you to show up before I said anything since most systemd >> related emails always devolve into how evil you are rather than what is >> actually happening. > > Ouch! Hope you two know each other!! :-P :-) > Yup, I <3 Lennart, I'd rather deal with him directly than wade through all the fud that flys around when systemd is brought up. > > [...] >> since we shouldn't be fragmenting this badly. >> >> Like I said what you guys are doing is fine, if btrfs falls on it's face >> then its not your fault. I'd just like an exact idea of when you guys >> are fsync'ing so I can replicate in a smaller way. Thanks, > > Good if COW can be so resilient. I have about 2GBytes of data logging > files and I must defrag those as part of my backups to stop the system > fragmenting to a stop (I use "cp -a" to defrag the files to a new area > and restart the data software logger on that). > > > Random thoughts: > > Would using a second small file just for the mmap-ed pointers help avoid > repeated rewriting of random offsets in the log file causing excessive > fragmentation? > Depends on when you fsync. The problem isn't dirty'ing so much as writing. > Align the data writes to 16kByte or 64kByte boundaries/chunks? > Yes that would help the most, if journald would try to only fsync ever blocksize amount of writes we'd suck less. > Are mmap-ed files a similar problem to using a swap file and so should > the same "btrfs file swap" code be used for both? > Not sure what this special swap file code is you speak of. Thanks, Josef ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 19:52 ` Martin 2014-06-16 20:20 ` Josef Bacik @ 2014-06-17 0:15 ` Austin S Hemmelgarn 2014-06-17 1:13 ` cwillu 2 siblings, 0 replies; 33+ messages in thread From: Austin S Hemmelgarn @ 2014-06-17 0:15 UTC (permalink / raw) To: Martin, linux-btrfs; +Cc: systemd-devel [-- Attachment #1: Type: text/plain, Size: 3469 bytes --] On 06/16/2014 03:52 PM, Martin wrote: > On 16/06/14 17:05, Josef Bacik wrote: >> >> On 06/16/2014 03:14 AM, Lennart Poettering wrote: >>> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote: >>> >>>>> I am not really following though why this trips up btrfs though. I am >>>>> not sure I understand why this breaks btrfs COW behaviour. I mean, > >>>> I don't believe that fallocate() makes any difference to >>>> fragmentation on >>>> BTRFS. Blocks will be allocated when writes occur so regardless of an >>>> fallocate() call the usage pattern in systemd-journald will cause >>>> fragmentation. >>> >>> journald's write pattern looks something like this: append something to >>> the end, make sure it is written, then update a few offsets stored at >>> the beginning of the file to point to the newly appended data. This is >>> of course not easy to handle for COW file systems. But then again, it's >>> probably not too different from access patterns of other database or >>> database-like engines... > > Even though this appears to be a problem case for btrfs/COW, is there a > more favourable write/access sequence possible that is easily > implemented that is favourable for both ext4-like fs /and/ COW fs? > > Database-like writing is known 'difficult' for filesystems: Can a data > log can be a simpler case? > > >> Was waiting for you to show up before I said anything since most systemd >> related emails always devolve into how evil you are rather than what is >> actually happening. > > Ouch! Hope you two know each other!! :-P :-) > > > [...] >> since we shouldn't be fragmenting this badly. >> >> Like I said what you guys are doing is fine, if btrfs falls on it's face >> then its not your fault. I'd just like an exact idea of when you guys >> are fsync'ing so I can replicate in a smaller way. Thanks, > > Good if COW can be so resilient. I have about 2GBytes of data logging > files and I must defrag those as part of my backups to stop the system > fragmenting to a stop (I use "cp -a" to defrag the files to a new area > and restart the data software logger on that). > > > Random thoughts: > > Would using a second small file just for the mmap-ed pointers help avoid > repeated rewriting of random offsets in the log file causing excessive > fragmentation? > > Align the data writes to 16kByte or 64kByte boundaries/chunks? > > Are mmap-ed files a similar problem to using a swap file and so should > the same "btrfs file swap" code be used for both? > > > Not looked over the code so all random guesses... > > Regards, > Martin > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Just a thought, partly inspired by the mention of the swap code, has anyone tried making the file NOCOW and pre-allocating to the max journal size? A similar approach has seemed to help on my systems with generic log files (I keep debug level logs from almost everything, so I end up with very active log files with ridiculous numbers of fragments if I don't pre-allocate and mark them NOCOW). I don't know for certain how BTRFS handles appends to NOCOW files, but I would be willing to bet that it ends up with a new fragment for each filesystem block worth of space allocated. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 2967 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 19:52 ` Martin 2014-06-16 20:20 ` Josef Bacik 2014-06-17 0:15 ` Austin S Hemmelgarn @ 2014-06-17 1:13 ` cwillu 2014-06-17 12:24 ` Martin ` (2 more replies) 2 siblings, 3 replies; 33+ messages in thread From: cwillu @ 2014-06-17 1:13 UTC (permalink / raw) To: Martin; +Cc: linux-btrfs, systemd-devel, clm It's not a mmap problem, it's a small writes with an msync or fsync after each one problem. For the case of sequential writes (via write or mmap), padding writes to page boundaries would help, if the wasted space isn't an issue. Another approach, again assuming all other writes are appends, would be to periodically (but frequently enough that the pages are still in cache) read a chunk of the file and write it back in-place, with or without an fsync. On the other hand, if you can afford to lose some logs on a crash, not fsyncing/msyncing after each write will also eliminate the fragmentation. (Worth pointing out that none of that is conjecture, I just spent 30 minutes testing those cases while composing this ;p) Josef has mentioned in irc that a piece of Chris' raid5/6 work will also fix this when it lands. On Mon, Jun 16, 2014 at 1:52 PM, Martin <m_btrfs@ml1.co.uk> wrote: > On 16/06/14 17:05, Josef Bacik wrote: >> >> On 06/16/2014 03:14 AM, Lennart Poettering wrote: >>> On Mon, 16.06.14 10:17, Russell Coker (russell@coker.com.au) wrote: >>> >>>>> I am not really following though why this trips up btrfs though. I am >>>>> not sure I understand why this breaks btrfs COW behaviour. I mean, > >>>> I don't believe that fallocate() makes any difference to >>>> fragmentation on >>>> BTRFS. Blocks will be allocated when writes occur so regardless of an >>>> fallocate() call the usage pattern in systemd-journald will cause >>>> fragmentation. >>> >>> journald's write pattern looks something like this: append something to >>> the end, make sure it is written, then update a few offsets stored at >>> the beginning of the file to point to the newly appended data. This is >>> of course not easy to handle for COW file systems. But then again, it's >>> probably not too different from access patterns of other database or >>> database-like engines... > > Even though this appears to be a problem case for btrfs/COW, is there a > more favourable write/access sequence possible that is easily > implemented that is favourable for both ext4-like fs /and/ COW fs? > > Database-like writing is known 'difficult' for filesystems: Can a data > log can be a simpler case? > > >> Was waiting for you to show up before I said anything since most systemd >> related emails always devolve into how evil you are rather than what is >> actually happening. > > Ouch! Hope you two know each other!! :-P :-) > > > [...] >> since we shouldn't be fragmenting this badly. >> >> Like I said what you guys are doing is fine, if btrfs falls on it's face >> then its not your fault. I'd just like an exact idea of when you guys >> are fsync'ing so I can replicate in a smaller way. Thanks, > > Good if COW can be so resilient. I have about 2GBytes of data logging > files and I must defrag those as part of my backups to stop the system > fragmenting to a stop (I use "cp -a" to defrag the files to a new area > and restart the data software logger on that). > > > Random thoughts: > > Would using a second small file just for the mmap-ed pointers help avoid > repeated rewriting of random offsets in the log file causing excessive > fragmentation? > > Align the data writes to 16kByte or 64kByte boundaries/chunks? > > Are mmap-ed files a similar problem to using a swap file and so should > the same "btrfs file swap" code be used for both? > > > Not looked over the code so all random guesses... > > Regards, > Martin > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-17 1:13 ` cwillu @ 2014-06-17 12:24 ` Martin 2014-06-17 17:56 ` Chris Murphy 2014-06-17 18:46 ` Filipe Brandenburger 2 siblings, 0 replies; 33+ messages in thread From: Martin @ 2014-06-17 12:24 UTC (permalink / raw) To: linux-btrfs; +Cc: systemd-devel On 17/06/14 02:13, cwillu wrote: > It's not a mmap problem, it's a small writes with an msync or fsync > after each one problem. And for logging, that is exactly what is wanted to see why whatever crashed... Except... Whilst logging, hold off on the msync/fsync unless the next log message to be written is 'critical'? With that, the mundane logging gets appended just as for any normal file write. Only the more critical log messages suffer the extra overhead and fragmentation of an immediate msync/fsync. > For the case of sequential writes (via write or mmap), padding writes > to page boundaries would help, if the wasted space isn't an issue. > Another approach, again assuming all other writes are appends, would > be to periodically (but frequently enough that the pages are still in > cache) read a chunk of the file and write it back in-place, with or > without an fsync. On the other hand, if you can afford to lose some > logs on a crash, not fsyncing/msyncing after each write will also > eliminate the fragmentation. > > (Worth pointing out that none of that is conjecture, I just spent 30 > minutes testing those cases while composing this ;p) > > Josef has mentioned in irc that a piece of Chris' raid5/6 work will > also fix this when it lands. Interesting... The source problem is how the COW fragments under expected normal use... Is all this unavoidable unless we rethink the semantics? Regards, Martin ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-17 1:13 ` cwillu 2014-06-17 12:24 ` Martin @ 2014-06-17 17:56 ` Chris Murphy 2014-06-17 18:46 ` Filipe Brandenburger 2 siblings, 0 replies; 33+ messages in thread From: Chris Murphy @ 2014-06-17 17:56 UTC (permalink / raw) To: cwillu; +Cc: linux-btrfs, systemd-devel On Jun 16, 2014, at 7:13 PM, cwillu <cwillu@cwillu.com> wrote: > It's not a mmap problem, it's a small writes with an msync or fsync > after each one problem. > > For the case of sequential writes (via write or mmap), padding writes > to page boundaries would help, if the wasted space isn't an issue. > Another approach, again assuming all other writes are appends, would > be to periodically (but frequently enough that the pages are still in > cache) read a chunk of the file and write it back in-place, with or > without an fsync. On the other hand, if you can afford to lose some > logs on a crash, not fsyncing/msyncing after each write will also > eliminate the fragmentation. Normally I'd be willing to give up ~30 seconds of journal to not have fragmented journals. But then if I use systemd.log_level=debug I'd like that to trigger more frequent flushing to make sure as little of the journaling is lost as possible. Does that make sense? Chris Murphy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-17 1:13 ` cwillu 2014-06-17 12:24 ` Martin 2014-06-17 17:56 ` Chris Murphy @ 2014-06-17 18:46 ` Filipe Brandenburger 2014-06-17 19:42 ` Goffredo Baroncelli 2 siblings, 1 reply; 33+ messages in thread From: Filipe Brandenburger @ 2014-06-17 18:46 UTC (permalink / raw) To: linux-btrfs, systemd-devel On Mon, Jun 16, 2014 at 6:13 PM, cwillu <cwillu@cwillu.com> wrote: > For the case of sequential writes (via write or mmap), padding writes > to page boundaries would help, if the wasted space isn't an issue. > Another approach, again assuming all other writes are appends, would > be to periodically (but frequently enough that the pages are still in > cache) read a chunk of the file and write it back in-place, with or > without an fsync. On the other hand, if you can afford to lose some > logs on a crash, not fsyncing/msyncing after each write will also > eliminate the fragmentation. I was wondering if something could be done in btrfs to improve performance under this workload... Something like a "defrag on demand" for a case where mostly appends are happening. When there are small appends with fsync/msync, they become new fragments (as expected), but once the writes go past a block boundary, btrfs could defragment the previous block in background, since it's not really expected to change again. That could potentially achieve performance close to chattr +C without the drawbacks of disabling copy-on-write. Cheers, Filipe ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-17 18:46 ` Filipe Brandenburger @ 2014-06-17 19:42 ` Goffredo Baroncelli 0 siblings, 0 replies; 33+ messages in thread From: Goffredo Baroncelli @ 2014-06-17 19:42 UTC (permalink / raw) To: linux-btrfs On 06/17/2014 08:46 PM, Filipe Brandenburger wrote: > On Mon, Jun 16, 2014 at 6:13 PM, cwillu <cwillu@cwillu.com> wrote: >> For the case of sequential writes (via write or mmap), padding writes >> to page boundaries would help, if the wasted space isn't an issue. >> Another approach, again assuming all other writes are appends, would >> be to periodically (but frequently enough that the pages are still in >> cache) read a chunk of the file and write it back in-place, with or >> without an fsync. On the other hand, if you can afford to lose some >> logs on a crash, not fsyncing/msyncing after each write will also >> eliminate the fragmentation. > > I was wondering if something could be done in btrfs to improve > performance under this workload... Something like a "defrag on demand" > for a case where mostly appends are happening. Instead of inventing a strategy smarter than the (already smart) filesystem, would be more simple make an explicit defrag ? In any case this "smart strategy" is filesystem specific, so it would be more simple (and less error prone) do an explicit defrag. I tried this strategy with systemd-journald, getting good results (doing a ioctl BTRFS_IOC_DEFRAG during the journal opening). BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 16:05 ` Josef Bacik 2014-06-16 19:52 ` Martin @ 2014-06-17 21:12 ` Lennart Poettering 1 sibling, 0 replies; 33+ messages in thread From: Lennart Poettering @ 2014-06-17 21:12 UTC (permalink / raw) To: Josef Bacik; +Cc: Russell Coker, kreijack, Duncan, linux-btrfs, systemd-devel On Mon, 16.06.14 09:05, Josef Bacik (jbacik@fb.com) wrote: > So you are doing all the right things from what I can tell, I'm just > a little confused about when you guys run fsync. From what I can > tell it's only when you open the journal file and when you switch it > to "offline." I didn't look too much past this point so I don't > know how often these things happen. Are you taking an individual > message, writing it, updating the head of the file and then > fsync'ing? Or are you getting a good bit of dirty log data and > fsyncing occasionally? The latter. Basically when opening a file for writing we mark it in the header as "online", then fsync() it. When we close a file we fsync() it, then change the header to "offline", ans sync() again. Also, 5min after each write we will also put things to offline, until the next write, when we will put things to online again. Finally, if something is logged at priorities EMERG, ALERT or CRIT we will sync immediately (which actually should never happen in real-life, unless something is really broken -- a simple way to check if anything like this got written is "journalctl -p crit"). Also, we rotate and start a new file every now and then, when we hit a size limit, but that is usually very seldom. Putting this together: we should normally fsync() only very infrequently. > What would cause btrfs problems is if you fallocate(), write a small > chunk, fsync, write a small chunk again, fsync again etc. Fallocate > saves you the first write around, but if the next write is within > the same block as the previous write we'll end up triggering cow and > enter fragmented territory. If this is what is what journald is > doing then that would be good to know, if not I'd like to know what > is happening since we shouldn't be fragmenting this badly. Hmm, the only way I see that that would happen is if a lot of stuff is logged at these super-high log levels mentioned above. But then again, that never really should happen in real-life. Could anyone who's expereiencing the slowdowns have a look on the journalctl output menionted above? Do you have more than a few lines printed like that? Lennart -- Lennart Poettering, Red Hat ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-15 22:13 ` Lennart Poettering 2014-06-16 0:17 ` Russell Coker @ 2014-06-16 16:32 ` Goffredo Baroncelli 2014-06-16 18:47 ` Goffredo Baroncelli 2014-06-19 1:13 ` Dave Chinner 2 siblings, 1 reply; 33+ messages in thread From: Goffredo Baroncelli @ 2014-06-16 16:32 UTC (permalink / raw) To: Lennart Poettering; +Cc: Duncan, linux-btrfs, systemd-devel Hi Lennart, On 06/16/2014 12:13 AM, Lennart Poettering wrote: > I am not really following though why this trips up btrfs though. I am > not sure I understand why this breaks btrfs COW behaviour. I mean, > fallocate() isn't necessarily supposed to write anything really, it's > mostly about allocating disk space in advance. I would claim that > journald's usage of it is very much within the entire reason why it > exists... I performed several tests, trying different setups [1]. One of these was replacing the posix_fallocate() with a truncate, to check where is the problem. The conclusion was that *posix_fallocate() is NOT the problem*. In another reply you stated that systemd-journald appends some data at the end of file, then update some data in the middle. I think this is the reason because the file becomes quickly fragmented. [1] Let me to revise the english, the I will post the results. -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-16 16:32 ` Goffredo Baroncelli @ 2014-06-16 18:47 ` Goffredo Baroncelli 0 siblings, 0 replies; 33+ messages in thread From: Goffredo Baroncelli @ 2014-06-16 18:47 UTC (permalink / raw) Cc: Lennart Poettering, Duncan, linux-btrfs, systemd-devel, Kai Krakow Hi all, in this blog [1] I collected all the results of the tests which I performed in order to investigate a bit this performance problem between systemd and btrfs. I had to put these results in a blog, because there are several images. Below a brief summary. I took an old machine (a P4 2.5GHz with 512MB of ram) where was present a fresh installation of a Fedora 20 and I measured the boot time during several reboots (up to 70). I tested the following scenarios 1) standard (without defragmenting any file, without readahead) 2) defragment the journal file at the end of the boot 3) defragment the journal file before the flushing 4) mark as NOCOW the journald log file 5) enable the systemd-readahead 6) remove the fsync(2) call from journald 7) remove the posix_fallocate(3) call from journald 8) do a defrag when posix_fallocate(3) is called 9) do a defrag when the journal log file is opened The test #1 highlight the problem. It shows that the boot time may require up to 50 seconds. During the reboots the number of extents of the file system.journal increases up to 6000. De-fragmenting the system.journal file the boot time decreases by ~20 seconds. My conclusion is that in BTRFS the fragmentation of this file increases the boot time. The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3) calls aren't the root cause of the problem. Even without these the system.journal file still fragments. The test #4 suggested that marking NOCOW the system.journal file reduces its fragmentation and so the boot time. The test #2,#3,#9 suggested that performing a periodic defrag reduces significantly the fragmentation of system.journal and so the boot time. The test #5 revealed that the readahead capability of systemd was not efficacy because it seems that the system.journal file was unaffected (but other *.journal files were). Further investigation is required. BR G.Baroncelli [1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html On 06/16/2014 06:32 PM, Goffredo Baroncelli wrote: > Hi Lennart, > > On 06/16/2014 12:13 AM, Lennart Poettering wrote: >> I am not really following though why this trips up btrfs though. I am >> not sure I understand why this breaks btrfs COW behaviour. I mean, >> fallocate() isn't necessarily supposed to write anything really, it's >> mostly about allocating disk space in advance. I would claim that >> journald's usage of it is very much within the entire reason why it >> exists... > > I performed several tests, trying different setups [1]. One of these was replacing the posix_fallocate() with a truncate, to check where is the problem. The conclusion was that *posix_fallocate() is NOT the problem*. > > In another reply you stated that systemd-journald appends some data at the end of file, then update some data in the middle. I think this is the reason because the file becomes quickly fragmented. > > > [1] Let me to revise the english, the I will post the results. > > -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-15 22:13 ` Lennart Poettering 2014-06-16 0:17 ` Russell Coker 2014-06-16 16:32 ` Goffredo Baroncelli @ 2014-06-19 1:13 ` Dave Chinner 2 siblings, 0 replies; 33+ messages in thread From: Dave Chinner @ 2014-06-19 1:13 UTC (permalink / raw) To: Lennart Poettering; +Cc: kreijack, Duncan, linux-btrfs, systemd-devel On Mon, Jun 16, 2014 at 12:13:07AM +0200, Lennart Poettering wrote: > On Sat, 14.06.14 09:52, Goffredo Baroncelli (kreijack@libero.it) wrote: > > > > Which effectively means that by the time the 8 MiB is filled, each 4 KiB > > > block has been rewritten to a new location and is now an extent unto > > > itself. So now that 8 MiB is composed of 2048 new extents, each one a > > > single 4 KiB block in size. > > > > Several people pointed fallocate as the problem. But I don't > > understand the reason. > > BTW, the reason we use fallocate() in journald is not about trying to > optimize anything. It's only used for one reason: to avoid SIGBUS on > disk/quota full, since we actually write everything to the files using > mmap(). FWIW, fallocate() doesn't absolutely guarantee you that. When at ENOSPC, a write into that reserved range can still require un-reserved metadata blocks to be allocated. e.g. splitting a "reserved" data extent into two extents (used and reserved) requires an extra btree record, which can cause a split, which can require allocation. This tends to be pretty damn rare, though, and some filesystems have reserved block pools specifically for handling this sort of ENOSPC corner case. Hence, in practice the filesystems never actually fail with ENOSPC in ranges that have been fallocate()d. > I mean, writing things with mmap() is always problematic, and > handling write errors is awfully difficult, but at least two of the most > common reasons for failure we'd like protect against in advance, under > the assumption that disk/quota full will be reported immediately by the > fallocate(), and the mmap writes later on will then necessarily succeed. > > I am not really following though why this trips up btrfs though. I am > not sure I understand why this breaks btrfs COW behaviour. I mean, > fallocate() isn't necessarily supposed to write anything really, it's > mostly about allocating disk space in advance. I would claim that > journald's usage of it is very much within the entire reason why it > exists... > > Anyway, happy to change these things around if necesary, but first I'd > like to have a very good explanation why fallocate() wouldn't be the > right thing to invoke here, and a suggestion what we should do instead > to cover this usecase... fallocate() of 8MB should be more than sufficient for non-COW filesystems - 1MB would be enough to prevent performance degradation due to fragmentation in most cases. The current problems seem to be with the way btrfs does rewrites, not the use of fallocate() in systemd. Thanks for explanation, Lennart. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Slow startup of systemd-journal on BTRFS 2014-06-14 2:53 ` Duncan 2014-06-14 7:52 ` Goffredo Baroncelli @ 2014-06-14 10:59 ` Kai Krakow 2014-06-15 22:43 ` [systemd-devel] " Lennart Poettering 1 sibling, 1 reply; 33+ messages in thread From: Kai Krakow @ 2014-06-14 10:59 UTC (permalink / raw) To: linux-btrfs; +Cc: systemd-devel Duncan <1i5t5.duncan@cox.net> schrieb: > As they say, "Whoosh!" > > At least here, I interpreted that remark as primarily sarcastic > commentary on the systemd devs' apparent attitude, which can be > (controversially) summarized as: "Systemd doesn't have problems because > it's perfect. Therefore, any problems you have with systemd must instead > be with other components which systemd depends on." Come on, sorry, but this is fud. Really... ;-) > IOW, it's a btrfs problem now in practice, not because it is so in a > technical sense, but because systemd defines it as such and is unlikely > to budge, so the only way to achieve progress is for btrfs to deal with > it. I think that systemd is even one of the early supporters of btrfs because it will defragment readahead files on boot from btrfs. I'd suggest the problem is to be found in the different semantics with COW filesystems. And if someone loudly complains to the systemd developers how bad they are at doing their stuff - hmm, well, I would be disapointed/offended, too, as a programmer because much very well done work has been put into systemd and I'd start ignoring such people. In Germany we have a saying for this: "Wie man in den Wald hineinruft, so schallt es heraus." [1] They are doing many things right that have not been adopted to modern systems in the last twenty years (or so) with the legacy init systems. So let's start with my journals, on btrfs: $ sudo filefrag * system@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found system@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found system@806cd49faa074a49b6cde5ff6fca8adc-000000000008e4cc-0004f82580cdcb45.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-0000000000097959-0004f89c2e8aff87.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-00000000000a166d-0004f98d7e04157c.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-00000000000aad59-0004fa379b9a1fdf.journal: 5 extents found system@ec16f60db38f43619f8337153a1cc024-0000000000000001-0004fae8e5057259.journal: 5 extents found system@ec16f60db38f43619f8337153a1cc024-00000000000092b1-0004fb59b1d034ad.journal: 5 extents found system.journal: 9 extents found user-500@e4209c6628ed4a65954678b8011ad73f-0000000000085b7a-0004f77d25ebba04.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-000000000008e7fb-0004f83c7bf18294.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-0000000000097fe4-0004f8ae69c198ca.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-00000000000a1a7e-0004f9966e9c69d8.journal: 2 extents found user-500.journal: 2 extents found I don't think these are too bad values, eh? Well, how did I accomblish that? First, I've set the journal directories nocow. Of course, systemd should do this by default. I'm not sure if this is a packaging or systemd code issue, tho. But I think the systemd devs are in common that for cow fs, the journal directories should be set nocow. After all, the journal is a transactional database - it does not need cow protection at all costs. And I think they have their own checksumming protection. So, why let systemd bother with that? A lot of other software has the same semantic problems with btrfs, too (ex. MySQL) where nobody shouts at the "inabilities" of the programmers. So why for systemd? Just because it's intrusive by its nature for being a radically and newly designed init system and thus requires some learning by its users/admins/packagers? Really? Come on... As admin and/or packager you have to stay with current technologies and developments anyways. It's only important to hide the details from the users. Back to the extents counts: What I did next was implementing a defrag job that regularly defrags the journal (actually, the complete log directory as other log files suffer the same problem): $ cat /usr/local/sbin/defrag-logs.rb #!/bin/sh exec btrfs filesystem defragment -czlib -r /var/log It can be easily converted into a timer job with systemd. This is left as an excercise to the reader. BTW: Actually, that job isn't currently executed on my system which makes the numbers above pretty impressive... However, autodefrag is turned on which may play into the mix. I'm not sure. I stopped automatically running those defrag jobs a while ago (I have a few more). > An arguably fairer and more impartial assessment of this particular > situations suggests that neither btrfs, which as a COW-based filesystem, > like all COW-based filesystems has the existing-file-rewrite as a major > technical challenge that it must deal with /somehow/, nor systemd, which > in choosing to use fallocate is specifically putting itself in that > existing-file-rewrite class, are entirely at fault. This challenge is not only affecting systemd but also a lot of other packages which do not play nice with btrfs semantics. But - as you correctly write - you cannot point your finger at just one party. FS and user space have to come together to evaluate and fix the problems on both sides. In > But that doesn't matter if one side refuses to budge, because then the > other side must do so regardless of where the fault was, if there is to > be any progress at all. > Meanwhile, I've predicted before and do so here again, that as btrfs > moves toward mainstream and starts supplanting ext* as the assumed Linux > default filesystem, some of these problems will simply "go away", because > at that point, various apps are no longer optimized for the assumed > default filesystem, and they'll either be patched at some level (distro > level if not upstream) to work better on the new default filesystem, or > will be replaced by something that does. And neither upstream nor distro > level does that patching, then at some point, people are going to find > that said distro performs worse than other distros that do that patching. > Another alternative is that distros will start setting /var/log/journal > NOCOW in their setup scripts by default when it's btrfs, thus avoiding > the problem. (Altho if they do automated snapshotting they'll also have > to set it as its own subvolume, to avoid the first-write-after-snapshot- > is-COW problem.) Well, that, and/or set autodefrag in the default mount > options. > Meanwhile, there's some focus on making btrfs behave better with such > rewrite-pattern files, but while I think the problem can be made /some/ > better, hopefully enough that the defaults bother far fewer people in far > fewer cases, I expect it'll always be a bit of a sore spot because that's > just how the technology works, and as such, setting NOCOW for such files > and/or using autodefrag will continue to be recommended for an optimized > setup. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [systemd-devel] Slow startup of systemd-journal on BTRFS 2014-06-14 10:59 ` Kai Krakow @ 2014-06-15 22:43 ` Lennart Poettering 0 siblings, 0 replies; 33+ messages in thread From: Lennart Poettering @ 2014-06-15 22:43 UTC (permalink / raw) To: Kai Krakow; +Cc: systemd-devel, linux-btrfs On Sat, 14.06.14 12:59, Kai Krakow (hurikhan77@gmail.com) wrote: > > Duncan <1i5t5.duncan@cox.net> schrieb: > > > As they say, "Whoosh!" > > > > At least here, I interpreted that remark as primarily sarcastic > > commentary on the systemd devs' apparent attitude, which can be > > (controversially) summarized as: "Systemd doesn't have problems because > > it's perfect. Therefore, any problems you have with systemd must instead > > be with other components which systemd depends on." > > Come on, sorry, but this is fud. Really... ;-) Interestingly, I never commented on anything in this area, and neither did anybody else from the systemd side afaics. THe entire btrfs defrag thing i wasn't aware of before this thread started on the system ML a few days ago. I am not sure where you take your ideas about our "attitude" from. God, with behaviour like that you just make us ignore you, Duncan. Lennart -- Lennart Poettering, Red Hat ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2014-06-19 1:13 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-06-11 21:28 Slow startup of systemd-journal on BTRFS Goffredo Baroncelli 2014-06-12 0:40 ` Chris Murphy 2014-06-12 1:18 ` Russell Coker 2014-06-12 4:39 ` Duncan 2014-06-12 1:21 ` Dave Chinner 2014-06-12 1:37 ` Dave Chinner 2014-06-12 2:32 ` Chris Murphy 2014-06-15 22:34 ` [systemd-devel] " Lennart Poettering 2014-06-16 4:01 ` Chris Murphy 2014-06-16 4:38 ` cwillu 2014-06-12 11:13 R: " Goffredo Baroncelli <kreijack@libero.it> 2014-06-12 12:37 ` Duncan 2014-06-12 23:24 ` Dave Chinner 2014-06-13 22:19 ` Goffredo Baroncelli 2014-06-14 2:53 ` Duncan 2014-06-14 7:52 ` Goffredo Baroncelli 2014-06-15 5:43 ` Duncan 2014-06-15 22:39 ` [systemd-devel] " Lennart Poettering 2014-06-15 22:13 ` Lennart Poettering 2014-06-16 0:17 ` Russell Coker 2014-06-16 1:06 ` John Williams 2014-06-16 2:19 ` Russell Coker 2014-06-16 10:14 ` Lennart Poettering 2014-06-16 10:35 ` Russell Coker 2014-06-16 11:16 ` Austin S Hemmelgarn 2014-06-16 11:56 ` Andrey Borzenkov 2014-06-16 16:05 ` Josef Bacik 2014-06-16 19:52 ` Martin 2014-06-16 20:20 ` Josef Bacik 2014-06-17 0:15 ` Austin S Hemmelgarn 2014-06-17 1:13 ` cwillu 2014-06-17 12:24 ` Martin 2014-06-17 17:56 ` Chris Murphy 2014-06-17 18:46 ` Filipe Brandenburger 2014-06-17 19:42 ` Goffredo Baroncelli 2014-06-17 21:12 ` Lennart Poettering 2014-06-16 16:32 ` Goffredo Baroncelli 2014-06-16 18:47 ` Goffredo Baroncelli 2014-06-19 1:13 ` Dave Chinner 2014-06-14 10:59 ` Kai Krakow 2014-06-15 22:43 ` [systemd-devel] " Lennart Poettering
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.