From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from plane.gmane.org ([80.91.229.3]:35425 "EHLO plane.gmane.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754052AbaFNCxd (ORCPT ); Fri, 13 Jun 2014 22:53:33 -0400 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1Wve5z-0000Zv-1s for linux-btrfs@vger.kernel.org; Sat, 14 Jun 2014 04:53:31 +0200 Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 14 Jun 2014 04:53:31 +0200 Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 14 Jun 2014 04:53:31 +0200 To: linux-btrfs@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: Slow startup of systemd-journal on BTRFS Date: Sat, 14 Jun 2014 02:53:20 +0000 (UTC) Message-ID: References: <1346098950.2730051402571606829.JavaMail.defaultUser@defaultHost> <20140612232453.GR9508@dastard> <539B78F3.9070607@libero.it> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: systemd-devel@lists.freedesktop.org Sender: linux-btrfs-owner@vger.kernel.org List-ID: Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as excerpted: > On 06/13/2014 01:24 AM, Dave Chinner wrote: >> On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote: >>> >>> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think >>> actually pretty much equally bad without NOCOW set on the file. >> >> So maybe it's been fixed in systemd since the last time I looked. >> Yup: >> >> http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal- file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58 >> >> The reason it was changed? To "save a syscall per append", not to >> prevent fragmentation of the file, which was the problem everyone was >> complaining about... > > thanks for pointing that. However I am performing my tests on a fedora > 20 with systemd-208, which seems have this change >> >>> Why? Because btrfs data blocks are 4 KiB. With COW, the effect for >>> either 4 byte or 8 MiB file allocations is going to end up being the >>> same, forcing (repeated until full) rewrite of each 4 KiB block into >>> its own extent. > > I am reaching the conclusion that fallocate is not the problem. The > fallocate increase the filesize of about 8MB, which is enough for some > logging. So it is not called very often. But... If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with nodatacow), then an fallocate of 8 MiB will increase the file size by 8 MiB and write that out. So far so good as at that point the 8 MiB should be a single extent. But then, data gets written into 4 KiB blocks of that 8 MiB one at a time, and because btrfs is COW, the new data in the block must be written to a new location. Which effectively means that by the time the 8 MiB is filled, each 4 KiB block has been rewritten to a new location and is now an extent unto itself. So now that 8 MiB is composed of 2048 new extents, each one a single 4 KiB block in size. =:^( Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the btrfs autodefrag mount option should at least catch that and rewrite (again), this time sequentially. > I have to investigate more what happens when the log are copied from > /run to /var/log/journal: this is when journald seems to slow all. That's an interesting point. At least in theory, during normal operation journald will write to /var/log/journal, but there's a point during boot at which it flushes the information accumulated during boot from the volatile /run location to the non-volatile /var/log location. /That/ write, at least, should be sequential, since there will be > 4 KiB of journal accumulated that needs to be transferred at once. However, if it's being handled by the forced pre-write fallocate described above, then that's not going to be the case, as it'll then be a rewrite of already fallocated file blocks and thus will get COWed exactly as I described above. =:^( > I am prepared a PC which reboot continuously; I am collecting the time > required to finish the boot vs the fragmentation of the system.journal > file vs the number of boot. The results are dramatic: after 20 reboot, > the boot time increase of 20-30 seconds. Doing a defrag of > system.journal reduces the boot time to the original one, but after > another 20 reboot, the boot time still requires 20-30 seconds more.... > > It is a slow PC, but I saw the same behavior also on a more modern pc > (i5 with 8GB). > > For both PC the HD is a mechanical one... The problem's duplicable. That's the first step toward a fix. =:^) >> And that's now a btrfs problem.... :/ > > Are you sure ? As they say, "Whoosh!" At least here, I interpreted that remark as primarily sarcastic commentary on the systemd devs' apparent attitude, which can be (controversially) summarized as: "Systemd doesn't have problems because it's perfect. Therefore, any problems you have with systemd must instead be with other components which systemd depends on." IOW, it's a btrfs problem now in practice, not because it is so in a technical sense, but because systemd defines it as such and is unlikely to budge, so the only way to achieve progress is for btrfs to deal with it. An arguably fairer and more impartial assessment of this particular situations suggests that neither btrfs, which as a COW-based filesystem, like all COW-based filesystems has the existing-file-rewrite as a major technical challenge that it must deal with /somehow/, nor systemd, which in choosing to use fallocate is specifically putting itself in that existing-file-rewrite class, are entirely at fault. But that doesn't matter if one side refuses to budge, because then the other side must do so regardless of where the fault was, if there is to be any progress at all. Meanwhile, I've predicted before and do so here again, that as btrfs moves toward mainstream and starts supplanting ext* as the assumed Linux default filesystem, some of these problems will simply "go away", because at that point, various apps are no longer optimized for the assumed default filesystem, and they'll either be patched at some level (distro level if not upstream) to work better on the new default filesystem, or will be replaced by something that does. And neither upstream nor distro level does that patching, then at some point, people are going to find that said distro performs worse than other distros that do that patching. Another alternative is that distros will start setting /var/log/journal NOCOW in their setup scripts by default when it's btrfs, thus avoiding the problem. (Altho if they do automated snapshotting they'll also have to set it as its own subvolume, to avoid the first-write-after-snapshot- is-COW problem.) Well, that, and/or set autodefrag in the default mount options. Meanwhile, there's some focus on making btrfs behave better with such rewrite-pattern files, but while I think the problem can be made /some/ better, hopefully enough that the defaults bother far fewer people in far fewer cases, I expect it'll always be a bit of a sore spot because that's just how the technology works, and as such, setting NOCOW for such files and/or using autodefrag will continue to be recommended for an optimized setup. --- [1] "Properly" set NOCOW: Btrfs doesn't guarantee the effectiveness of setting NOCOW (chattr +C) unless the attribute is set while the file is still zero size, effectively, at file creation. The easiest way to do that is to set NOCOW on the subdir that will contain the file, such that when the file is created it inherits the NOCOW attribute automatically. [2] File sizes upto 128 MiB ... and possibly upto 1 GiB. Under 128 MiB should be fine, over 1 GiB is known to cause issues, between the two is a gray area that depends on the speed of the hardware and the incoming write-stream. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman