btrfs, journald logs, fragmentation, and fallocate

* btrfs, journald logs, fragmentation, and fallocate
@ 2017-04-28 16:16 Chris Murphy
  2017-04-28 17:05 ` Goffredo Baroncelli
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Chris Murphy @ 2017-04-28 16:16 UTC (permalink / raw)
  To: Btrfs BTRFS

Old news is that systemd-journald journals end up pretty heavily
fragmented on Btrfs due to COW. While journald uses chattr +C on
journal files now, COW still happens if the subvolume the journal is
in gets snapshot. e.g. a week old system.journal has 19000+ extents.

The news is I started a systemd thread.

This is the start:
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html

Where it gets interesting, two messages by Andrei Borzenkov: He
evaluates existing code and does some tests on ext4 and XFS.
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038728.html

And then the question.
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038735.html

Given what journald is doing, is what Btrfs is doing expected? Is
there something it could do better to be more like ext4 and XFS in the
same situation? Or is it out of scope for Btrfs?

It appears to me (see below URLs pointing to example journals) that
journald fallocated in 8MiB increments but then ends up doing 4KiB
writes; there's a lot of these unused (unwritten) 8MiB extents that
appear in both filefrag and btrfs-debug -f outputs.

The +C idea just rearranges the deck chairs, it's not solving the
underlying problem except in the case where the containing subvolume
is never snapshot. And in the COW case, I'm seeing about 30 metadata
nodes being written out for what amounts to less than a 4KiB journal
append. Each time.

And that makes me wonder whether metadata fragmentation is happening
as a result. But in any case, there's a lot of metadata being written
for each journal update compared to what's being added to the journal
file.

And then that makes me wonder if a better optimization on Btrfs would
be having each write be a separate file. The small updates would have
data inline. Which is worse, a single file with 20000 fragments; or
40000 separate journal files? *shrug* At least those individual files
would be subject to compression with +c; whereas right now the open
endedness of the active journal has not a single compressed extent.
Only once rotated do they get compressed (via defragmentation which
journald does only on Btrfs). Journals contain highly compressible
data.

Anyway, two example journals. The parent directory has chattr +c, both
journals inherited it. The first URL is filefrag -v, the 2nd is
btrfs-debug -f; for each journal.

This is a rotated journal. Upon rotation on Btrfs, journald
defragments the file which ends up compressing it when chattr +c.
https://da.gd/4NKyq
https://da.gd/zEeYW

This is an active system.journal. No compressed extents (the writes I
think are too small).
https://da.gd/cBjX
https://da.gd/YXuI

Extra credit if you've followed this far... The rotated log has piles
of unwritten items in it that are making it fairly inefficient even
with compression. Just using cat to write its contents to a new file,
compression goes from a 1.27 ratio, to 5.70. Here are the results
after catting that file:
https://da.gd/rE8KT
https://da.gd/PD5qI

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread