All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs, journald logs, fragmentation, and fallocate
@ 2017-04-28 16:16 Chris Murphy
  2017-04-28 17:05 ` Goffredo Baroncelli
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Chris Murphy @ 2017-04-28 16:16 UTC (permalink / raw)
  To: Btrfs BTRFS

Old news is that systemd-journald journals end up pretty heavily
fragmented on Btrfs due to COW. While journald uses chattr +C on
journal files now, COW still happens if the subvolume the journal is
in gets snapshot. e.g. a week old system.journal has 19000+ extents.

The news is I started a systemd thread.

This is the start:
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html

Where it gets interesting, two messages by Andrei Borzenkov: He
evaluates existing code and does some tests on ext4 and XFS.
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038728.html

And then the question.
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038735.html

Given what journald is doing, is what Btrfs is doing expected? Is
there something it could do better to be more like ext4 and XFS in the
same situation? Or is it out of scope for Btrfs?

It appears to me (see below URLs pointing to example journals) that
journald fallocated in 8MiB increments but then ends up doing 4KiB
writes; there's a lot of these unused (unwritten) 8MiB extents that
appear in both filefrag and btrfs-debug -f outputs.

The +C idea just rearranges the deck chairs, it's not solving the
underlying problem except in the case where the containing subvolume
is never snapshot. And in the COW case, I'm seeing about 30 metadata
nodes being written out for what amounts to less than a 4KiB journal
append. Each time.

And that makes me wonder whether metadata fragmentation is happening
as a result. But in any case, there's a lot of metadata being written
for each journal update compared to what's being added to the journal
file.

And then that makes me wonder if a better optimization on Btrfs would
be having each write be a separate file. The small updates would have
data inline. Which is worse, a single file with 20000 fragments; or
40000 separate journal files? *shrug* At least those individual files
would be subject to compression with +c; whereas right now the open
endedness of the active journal has not a single compressed extent.
Only once rotated do they get compressed (via defragmentation which
journald does only on Btrfs). Journals contain highly compressible
data.



Anyway, two example journals. The parent directory has chattr +c, both
journals inherited it. The first URL is filefrag -v, the 2nd is
btrfs-debug -f; for each journal.

This is a rotated journal. Upon rotation on Btrfs, journald
defragments the file which ends up compressing it when chattr +c.
https://da.gd/4NKyq
https://da.gd/zEeYW

This is an active system.journal. No compressed extents (the writes I
think are too small).
https://da.gd/cBjX
https://da.gd/YXuI


Extra credit if you've followed this far... The rotated log has piles
of unwritten items in it that are making it fairly inefficient even
with compression. Just using cat to write its contents to a new file,
compression goes from a 1.27 ratio, to 5.70. Here are the results
after catting that file:
https://da.gd/rE8KT
https://da.gd/PD5qI



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 16:16 btrfs, journald logs, fragmentation, and fallocate Chris Murphy
@ 2017-04-28 17:05 ` Goffredo Baroncelli
  2017-04-28 17:41   ` Chris Murphy
                     ` (2 more replies)
  2017-04-28 17:46 ` Peter Grandi
  2017-04-28 17:53 ` Peter Grandi
  2 siblings, 3 replies; 16+ messages in thread
From: Goffredo Baroncelli @ 2017-04-28 17:05 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 2017-04-28 18:16, Chris Murphy wrote:
> Old news is that systemd-journald journals end up pretty heavily
> fragmented on Btrfs due to COW. While journald uses chattr +C on
> journal files now, COW still happens if the subvolume the journal is
> in gets snapshot. e.g. a week old system.journal has 19000+ extents.
> 
> The news is I started a systemd thread.
> 
> This is the start:
> https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html
> 
> Where it gets interesting, two messages by Andrei Borzenkov: He
> evaluates existing code and does some tests on ext4 and XFS.
> https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html
> https://lists.freedesktop.org/archives/systemd-devel/2017-April/038728.html
> 
> And then the question.
> https://lists.freedesktop.org/archives/systemd-devel/2017-April/038735.html
> 
> Given what journald is doing, is what Btrfs is doing expected? Is
> there something it could do better to be more like ext4 and XFS in the
> same situation? Or is it out of scope for Btrfs?

In the past I faced the same problems; I collected some data here http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html.
Unfortunately the journald files are very bad, because first the data is written (appended), then the index fields are updated. Unfortunately these indexes are near after the last write . So fragmentation is unavoidable.

After some thinking I adopted a different strategies: I used journald as collector, then I forward all the log to rsyslogd, which used a "log append" format. Journald never write on the root filesystem, only in tmp.

The think became interesting when I discovered that the searching in a rsyslog file is faster than journalctl (on a rotational media). Unfortunately I don't have any data to support this. 
However if someone is interested I can share more details.

BR
G.Baroncelli



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 17:05 ` Goffredo Baroncelli
@ 2017-04-28 17:41   ` Chris Murphy
  2017-04-28 18:54     ` Goffredo Baroncelli
  2017-04-28 20:14     ` Adam Borowski
  2017-04-29  1:46   ` Paul Jones
  2017-04-29  4:16   ` Duncan
  2 siblings, 2 replies; 16+ messages in thread
From: Chris Murphy @ 2017-04-28 17:41 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Chris Murphy, Btrfs BTRFS

On Fri, Apr 28, 2017 at 11:05 AM, Goffredo Baroncelli
<kreijack@inwind.it> wrote:

> In the past I faced the same problems; I collected some data here http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html.
> Unfortunately the journald files are very bad, because first the data is written (appended), then the index fields are updated. Unfortunately these indexes are near after the last write . So fragmentation is unavoidable.
>
> After some thinking I adopted a different strategies: I used journald as collector, then I forward all the log to rsyslogd, which used a "log append" format. Journald never write on the root filesystem, only in tmp.

The gotcha though is there's a pile of data in the journal that would
never make it to rsyslogd. If you use journalctl -o verbose you can
see some of this. There's a bunch of extra metadata in the journal.
And then also filtering based on that metadata is useful rather than
being limited to grep on a syslog file. Which, you know, it's fine for
many use cases. I guess I'm just interested in whether there's an
enhancement that can be done to make journals more compatible with
Btrfs or vice versa. It's not a huge problem anyway.


>
> The think became interesting when I discovered that the searching in a rsyslog file is faster than journalctl (on a rotational media). Unfortunately I don't have any data to support this.


Yes on drives all of these scattered extents cause a lot of head
seeking. And I also suspect it's a lot of metadata spread out
everywhere too, to account for all of these extents. That's why they
moved to chattr +C to make them nocow. An idea I had on systemd list
was to automatically make the journal directory a Btrfs subvolume,
similar to how systemd already creates a /var/lib/machines subvolume
for nspawn containers. This prevents the journals from being caught up
in a snapshot of the parent subvolume that typically contains the
journals (root fs). There's no practical use I can think of for
snapshotting logs. You'd really want the logs to always be linear,
contiguous, and never get rolled back. Even if something in the system
does get rolled back, you'd want the logs to show that and continue
on, rather than being rolled back themselves.

So the super simple option would be continue with +C on journals, and
then a separate subvolume to prevent COW from ever happening
inadvertently.

The same behavior happens with NTFS in qcow2 files. They quickly end
up with 100,000+ extents unless set nocow. It's like the worst case
scenario.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 16:16 btrfs, journald logs, fragmentation, and fallocate Chris Murphy
  2017-04-28 17:05 ` Goffredo Baroncelli
@ 2017-04-28 17:46 ` Peter Grandi
  2017-04-28 19:43   ` Chris Murphy
  2017-04-28 17:53 ` Peter Grandi
  2 siblings, 1 reply; 16+ messages in thread
From: Peter Grandi @ 2017-04-28 17:46 UTC (permalink / raw)
  To: Linux fs Btrfs

> Old news is that systemd-journald journals end up pretty
> heavily fragmented on Btrfs due to COW.

This has been discussed before in detail indeeed here, but also
here: http://www.sabi.co.uk/blog/15-one.html?150203#150203

> While journald uses chattr +C on journal files now, COW still
> happens if the subvolume the journal is in gets snapshot. e.g.
> a week old system.journal has 19000+ extents. [ ... ]  It
> appears to me (see below URLs pointing to example journals)
> that journald fallocated in 8MiB increments but then ends up
> doing 4KiB writes; [ ... ]

So there are three layers of silliness here:

* Writing large files slowly to a COW filesystem and
  snapshotting it frequently.
* A filesystem that does delayed allocation instead of
  allocate-ahead, and does not have psychic code.
* Working around that by using no-COW and preallocation
  with a fixed size regardless of snapshot frequency.

The primary problem here is that there is no way to have slow
small writes and frequent snapshots without generating small
extents: if a file is written at a rate of 1MiB/hour and gets
snapshot every hour the extent size will not be larger than 1MiB
*obviously*.

Filesystem-level snapshots are not designed to snapshot slowly
growing files, but to snapshots changing collections of
files. There are harsh tradeoffs involved. Application-level
shapshots (also known as log rotations :->) are needed for
special cases and finer grained policies.

The secondary problem is that a fixed preallocate of 8MiB is
good only if in betweeen snapshots the file grows by a little
less than 8MiB or by substantially more.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 16:16 btrfs, journald logs, fragmentation, and fallocate Chris Murphy
  2017-04-28 17:05 ` Goffredo Baroncelli
  2017-04-28 17:46 ` Peter Grandi
@ 2017-04-28 17:53 ` Peter Grandi
  2017-04-28 19:55   ` Chris Murphy
  2 siblings, 1 reply; 16+ messages in thread
From: Peter Grandi @ 2017-04-28 17:53 UTC (permalink / raw)
  To: Linux fs Btrfs

> [ ... ] And that makes me wonder whether metadata
> fragmentation is happening as a result. But in any case,
> there's a lot of metadata being written for each journal
> update compared to what's being added to the journal file. [
> ... ]

That's the "wandering trees" problem in COW filesystems, and
manifestations of it in Btrfs have also been reported before.
If there is a workload that triggers a lot of "wandering trees"
updates, then a filesystem that has "wandering trees" perhaps
should not be used :-).

> [ ... ] worse, a single file with 20000 fragments; or 40000
> separate journal files? *shrug* [ ... ]

Well, depends, but probably the single file: it is more likely
that the 20,000 fragments will actually be contiguous, and that
there will be less metadata IO than for 40,000 separate journal
files.

The deeper "strategic" issue is that storage systems and
filesystems in particular have very anisotropic performance
envelopes, and mismatches between the envelopes of application
and filesystem can be very expensive:
  http://www.sabi.co.uk/blog/15-two.html?151023#151023

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 17:41   ` Chris Murphy
@ 2017-04-28 18:54     ` Goffredo Baroncelli
  2017-04-28 19:39       ` Peter Grandi
  2017-04-28 20:14     ` Adam Borowski
  1 sibling, 1 reply; 16+ messages in thread
From: Goffredo Baroncelli @ 2017-04-28 18:54 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 2017-04-28 19:41, Chris Murphy wrote:
> On Fri, Apr 28, 2017 at 11:05 AM, Goffredo Baroncelli
> <kreijack@inwind.it> wrote:
> 
>> In the past I faced the same problems; I collected some data here http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html.
>> Unfortunately the journald files are very bad, because first the data is written (appended), then the index fields are updated. Unfortunately these indexes are near after the last write . So fragmentation is unavoidable.
>>
>> After some thinking I adopted a different strategies: I used journald as collector, then I forward all the log to rsyslogd, which used a "log append" format. Journald never write on the root filesystem, only in tmp.
> 
> The gotcha though is there's a pile of data in the journal that would
> never make it to rsyslogd. If you use journalctl -o verbose you can
> see some of this. 

You can send *all the info* to rsyslogd via imjournal

http://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html

In my setup all the data are stored in json format in the /var/log/cee.log file:


$ head  /var/log/cee.log
2017-04-28T18:41:41.931273+02:00 venice liblogging-stdlog: @cee: { "PRIORITY": "6", "_BOOT_ID": "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": "e84907d099904117b355a99c98378dca", "_HOSTNAME": "venice.bhome", "_SYSTEMD_SLICE": "system.slice", "_UID": "0", "_GID": "0", "_CAP_EFFECTIVE": "3fffffffff", "_TRANSPORT": "syslog", "SYSLOG_FACILITY": "23", "SYSLOG_IDENTIFIER": "liblogging-stdlog", "MESSAGE": " [origin software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed", "_PID": "737", "_COMM": "rsyslogd", "_EXE": "\/usr\/sbin\/rsyslogd", "_CMDLINE": "\/usr\/sbin\/rsyslogd -n", "_SYSTEMD_CGROUP": "\/system.slice\/rsyslog.service", "_SYSTEMD_UNIT": "rsyslog.service", "_SYSTEMD_INVOCATION_ID": "18b9a8b27f9143728adef972db7b394c", "_SOURCE_REALTIME_TIMESTAMP": "1493397701931255", "msg": "[origin software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed" }
2017-04-28T18:41:42.058549+02:00 venice liblogging-stdlog: @cee: { "PRIORITY": "6", "_BOOT_ID": "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": "e84907d099904117b355a99c98378dca", "_HOSTNAME": "venice.bhome", "_SYSTEMD_SLICE": "system.slice", "_UID": "0", "_GID": "0", "_CAP_EFFECTIVE": "3fffffffff", "_TRANSPORT": "syslog", "SYSLOG_FACILITY": "23", "SYSLOG_IDENTIFIER": "liblogging-stdlog", "MESSAGE": " [origin software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed", "_PID": "737", "_COMM": "rsyslogd", "_EXE": "\/usr\/sbin\/rsyslogd", "_CMDLINE": "\/usr\/sbin\/rsyslogd -n", "_SYSTEMD_CGROUP": "\/system.slice\/rsyslog.service", "_SYSTEMD_UNIT": "rsyslog.service", "_SYSTEMD_INVOCATION_ID": "18b9a8b27f9143728adef972db7b394c", "_SOURCE_REALTIME_TIMESTAMP": "1493397702058441", "msg": "[origin software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed" }
[....]

All the info are stored with the same keys/values as journald does.

I developed an utility (called clp), which allow to query the log by key, filtering by boot nr, by date....

For example to show all the log related to rsyslog

$ clp log -t full-details _SYSTEMD_CGROUP=/system.slice/rsyslog.service 

2017-04-21 19:12:29.579748 MESSAGE= [origin software="rsyslogd" swVersion="8.24.0" x-pid="804" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
                           PRIORITY=6
                           SYSLOG_FACILITY=23
                           SYSLOG_IDENTIFIER=liblogging-stdlog
                           _BOOT_ID=d77198380c9344248e01166fbd8d60df
                           _CAP_EFFECTIVE=3fffffffff
                           _CMDLINE=/usr/sbin/rsyslogd -n
                           _COMM=rsyslogd
                           _EXE=/usr/sbin/rsyslogd
                           _GID=0
                           _HOSTNAME=venice.bhome
                           _LOGFILEINITLINE=2017-04-21T19:12:29.579768+02:00 venice liblogging-stdlog: 
                           _LOGFILELINENUMBER=1
                           _LOGFILENAME=/var/log/cee.log.7.gz
                           _LOGFILETIMESTAMP=1492794749579768
                           _MACHINE_ID=e84907d099904117b355a99c98378dca
                           _PID=804
                           _SOURCE_REALTIME_TIMESTAMP=1492794749579748
                           _SYSTEMD_CGROUP=/system.slice/rsyslog.service
                           _SYSTEMD_INVOCATION_ID=8f9cb6c871be4158a3ccb374f4323027
                           _SYSTEMD_SLICE=system.slice
                           _SYSTEMD_UNIT=rsyslog.service
                           _TRANSPORT=syslog
                           _UID=0
                           msg=[origin software="rsyslogd" swVersion="8.24.0" x-pid="804" x-info="http://www.rsyslog.com"] rsyslogd was HUPed

2017-04-21 19:12:29.669637 MESSAGE= [origin software="rsyslogd" swVersion="8.24.0" x-pid="804" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
                           PRIORITY=6
                           SYSLOG_FACILITY=23
                           SYSLOG_IDENTIFIER=liblogging-stdlog
                           _BOOT_ID=d77198380c9344248e01166fbd8d60df
                           _CAP_EFFECTIVE=3fffffffff
                           _CMDLINE=/usr/sbin/r
[...]


> There's a bunch of extra metadata in the journal.
> And then also filtering based on that metadata is useful rather than
> being limited to grep on a syslog file. Which, you know, it's fine for
> many use cases. 

With imjournal you don't loose any metadata or data. Unfortunately there no is an utility to query the log. In my case I developed a my own

> I guess I'm just interested in whether there's an
> enhancement that can be done to make journals more compatible with
> Btrfs or vice versa. It's not a huge problem anyway.

I am still inclined to think that a text file format "append only" would be better than the pseudo database which journald uses. However I have to collect some data to be sure that the perfomance would not be worse. To do that I need some megabytes of log, which I don't have because I don't use the journald file format. Someone would share with me its logs :-) ?

> 
> 
>>
>> The think became interesting when I discovered that the searching in a rsyslog file is faster than journalctl (on a rotational media). Unfortunately I don't have any data to support this.
> 
> 
> Yes on drives all of these scattered extents cause a lot of head
> seeking. And I also suspect it's a lot of metadata spread out
> everywhere too, to account for all of these extents. That's why they
> moved to chattr +C to make them nocow. An idea I had on systemd list
> was to automatically make the journal directory a Btrfs subvolume,
> similar to how systemd already creates a /var/lib/machines subvolume
> for nspawn containers. This prevents the journals from being caught up
> in a snapshot of the parent subvolume that typically contains the
> journals (root fs). There's no practical use I can think of for
> snapshotting logs. You'd really want the logs to always be linear,
> contiguous, and never get rolled back. Even if something in the system
> does get rolled back, you'd want the logs to show that and continue
> on, rather than being rolled back themselves.
> 
> So the super simple option would be continue with +C on journals, and
> then a separate subvolume to prevent COW from ever happening
> inadvertently.
> 
> The same behavior happens with NTFS in qcow2 files. They quickly end
> up with 100,000+ extents unless set nocow. It's like the worst case
> scenario.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 18:54     ` Goffredo Baroncelli
@ 2017-04-28 19:39       ` Peter Grandi
  2017-04-28 19:59         ` Chris Murphy
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Grandi @ 2017-04-28 19:39 UTC (permalink / raw)
  To: Linux fs Btrfs

>> The gotcha though is there's a pile of data in the journal
>> that would never make it to rsyslogd. If you use journalctl
>> -o verbose you can see some of this.

> You can send *all the info* to rsyslogd via imjournal
> http://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html
> In my setup all the data are stored in json format in the
> /var/log/cee.log file:
> $ head /var/log/cee.log 2017-04-28T18:41:41.931273+02:00
> venice liblogging-stdlog: @cee: { "PRIORITY": "6", "_BOOT_ID":
> "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": [ ... ]

Ahhhhhh the horror the horror, I will never be able to unsee
that. The UNIX way of doing things is truly dead.

>> The same behavior happens with NTFS in qcow2 files. They
>> quickly end up with 100,000+ extents unless set nocow.
>> It's like the worst case scenario.

In a particularly demented setup I had to decastrophize with
great pain a Zimbra QCOW2 disk image (XFS on NFS on XFS on
RAID6) containining an ever growing number Maildir email archive
ended up with over a million widely scattered microextents:

  http://www.sabi.co.uk/blog/1101Jan.html?110116#110116

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 17:46 ` Peter Grandi
@ 2017-04-28 19:43   ` Chris Murphy
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Murphy @ 2017-04-28 19:43 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs Btrfs

On Fri, Apr 28, 2017 at 11:46 AM, Peter Grandi <pg@btrfs.list.sabi.co.uk> wrote:

> So there are three layers of silliness here:
>
> * Writing large files slowly to a COW filesystem and
>   snapshotting it frequently.
> * A filesystem that does delayed allocation instead of
>   allocate-ahead, and does not have psychic code.
> * Working around that by using no-COW and preallocation
>   with a fixed size regardless of snapshot frequency.
>
> The primary problem here is that there is no way to have slow
> small writes and frequent snapshots without generating small
> extents: if a file is written at a rate of 1MiB/hour and gets
> snapshot every hour the extent size will not be larger than 1MiB
> *obviously*.

Sure.

But in my example, no snapshotting, and +C is inhibited (i.e. I set
/etc/tmpfiles.d/journal-nocow.conf which stops systemd from the new
behavior of setting +C on journals). That's resulting in a 19000+
fragment journal file. In fact snapshotting does not make it worse
though. If it's nocow, then yes snapshotting makes it worse than
nocow, but no worse than cow.

What I'm trying to get at is default Btrfs behavior and (previous)
default journald behavior, have a misalignment resulting in a lot of
fragmentation, is there a better way around this than merely setting
journals to nocow *and* making sure they stay nocow by preventing
snapshotting. If there's nothing better to be done, then I'll just
re-recommend to systemd folks that the directory containing journals
should be made a subvolume to isolate it from inadvertent
snapshotting. If people want to snapshot it anyway there's nothing we
can do about that.



> Filesystem-level snapshots are not designed to snapshot slowly
> growing files, but to snapshots changing collections of
> files. There are harsh tradeoffs involved. Application-level
> shapshots (also known as log rotations :->) are needed for
> special cases and finer grained policies.
>
> The secondary problem is that a fixed preallocate of 8MiB is
> good only if in betweeen snapshots the file grows by a little
> less than 8MiB or by substantially more.

Just to be clear, none of my own examples involve journals being
snapshot. There are no shared extents for any of those files.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 17:53 ` Peter Grandi
@ 2017-04-28 19:55   ` Chris Murphy
  2017-04-28 23:04     ` Peter Grandi
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Murphy @ 2017-04-28 19:55 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs Btrfs

On Fri, Apr 28, 2017 at 11:53 AM, Peter Grandi <pg@btrfs.list.sabi.co.uk> wrote:

> Well, depends, but probably the single file: it is more likely
> that the 20,000 fragments will actually be contiguous, and that
> there will be less metadata IO than for 40,000 separate journal
> files.

You can see from the examples I posted that these extents are all over
the place, they're not contiguous at all. 4K here, 4K there, 4K over
there, back to 4K here next to this one, 4K over there...12K over
there, 500K unwritten, 4K over there. This seems not so consequential
on SSD, at least if it impacts performance it's not so bad I care. On
a hard drive, it's totally noticeable. And that's why journald went
with chattr +C by default a few versions ago when on Btrfs. And it
does help *if* the partent is never snapshot, which on a snapshotting
file system can't really be guaranteed. Inadvertent snapshotting could
be inhibited by putting the journals in their own subvolume though.

Anyway, it's difficult to consider Btrfs a general purpose file system
if other general purpose workloads like journal files, are causing a
problem like wandering tree. Hence the subject of what to do about it,
and that may mean short term and long term. I can't speak for systemd
developers but if there's a different way to write to the journals
that'd be better for Btrfs and no worse for ext4 and XFS, it might be
considered.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 19:39       ` Peter Grandi
@ 2017-04-28 19:59         ` Chris Murphy
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Murphy @ 2017-04-28 19:59 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs Btrfs

On Fri, Apr 28, 2017 at 1:39 PM, Peter Grandi <pg@btrfs.list.sabi.co.uk> wrote:


> In a particularly demented setup I had to decastrophize with
> great pain a Zimbra QCOW2 disk image (XFS on NFS on XFS on
> RAID6) containining an ever growing number Maildir email archive
> ended up with over a million widely scattered microextents:
>
>   http://www.sabi.co.uk/blog/1101Jan.html?110116#110116

Related Btrfs thread "File system corruption, btrfsck abort" involves
5 concurrent use VM's with guests using ext4, NTFS, HFS+, Btrfs, LVM,
pointing to qcow2 files on Btrfs for backing. And it's resulting in
problems...


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 17:41   ` Chris Murphy
  2017-04-28 18:54     ` Goffredo Baroncelli
@ 2017-04-28 20:14     ` Adam Borowski
  2017-04-29 10:46       ` Peter Grandi
  1 sibling, 1 reply; 16+ messages in thread
From: Adam Borowski @ 2017-04-28 20:14 UTC (permalink / raw)
  To: Chris Murphy, linux-btrfs

On Fri, Apr 28, 2017 at 11:41:00AM -0600, Chris Murphy wrote:
> The same behavior happens with NTFS in qcow2 files. They quickly end
> up with 100,000+ extents unless set nocow. It's like the worst case
> scenario.

You should never use qcow2 on btrfs, especially if snapshots are involved.
They both do roughly the same thing, and layering fragmentation upon
fragmentation ɪꜱ ɴᴏᴛ ᴘʀᴇᴛᴛʏ.  Layering syncs is bad, too.

Instead, you can use raw files (preferably sparse unless there's both nocow
and no snapshots).  Btrfs does natively everything you'd gain from qcow2,
and does it better: you can delete the master of a cloned image, deduplicate
them, deduplicate two unrelated images; you can turn on compression, etc.

Once you pay the btrfs performance penalty, you may as well actually use its
features, which make qcow2 redundant and harmful.


Meow!
-- 
Don't be racist.  White, amber or black, all beers should be judged based
solely on their merits.  Heck, even if occasionally a cider applies for a
beer's job, why not?
On the other hand, corpo lager is not a race.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 19:55   ` Chris Murphy
@ 2017-04-28 23:04     ` Peter Grandi
  2017-04-29 10:30       ` Peter Grandi
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Grandi @ 2017-04-28 23:04 UTC (permalink / raw)
  To: Linux fs Btrfs


> [ ... ] these extents are all over the place, they're not
> contiguous at all. 4K here, 4K there, 4K over there, back to
> 4K here next to this one, 4K over there...12K over there, 500K
> unwritten, 4K over there. This seems not so consequential on
> SSD, [ ... ]

Indeed there were recent reports that the 'ssd' mount option
causes that, IIRC by Hans van Kranenburg (around 2017-04-17),
which also noticed issues with the wandering trees in certain
situations (around 2017-04-08).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 17:05 ` Goffredo Baroncelli
  2017-04-28 17:41   ` Chris Murphy
@ 2017-04-29  1:46   ` Paul Jones
  2017-04-29  4:16   ` Duncan
  2 siblings, 0 replies; 16+ messages in thread
From: Paul Jones @ 2017-04-29  1:46 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1136 bytes --]

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-
> owner@vger.kernel.org] On Behalf Of Goffredo Baroncelli
> Sent: Saturday, 29 April 2017 3:05 AM
> To: Chris Murphy <lists@colorremedies.com>
> Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
> Subject: Re: btrfs, journald logs, fragmentation, and fallocate
> 
> 
> In the past I faced the same problems; I collected some data here
> http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html.
> Unfortunately the journald files are very bad, because first the data is
> written (appended), then the index fields are updated. Unfortunately these
> indexes are near after the last write . So fragmentation is unavoidable.

Perhaps a better idea for COW filesystems is to store the index in a separate file, and/or rewrite the last 1 MB block (or part thereof) of the data file every time data is appended? That way the data file will use 1MB extents and hopefully avoid ridiculous amounts of metadata. 


Paul.
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨è­Ú&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~†­†Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ߣøm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 17:05 ` Goffredo Baroncelli
  2017-04-28 17:41   ` Chris Murphy
  2017-04-29  1:46   ` Paul Jones
@ 2017-04-29  4:16   ` Duncan
  2 siblings, 0 replies; 16+ messages in thread
From: Duncan @ 2017-04-29  4:16 UTC (permalink / raw)
  To: linux-btrfs

Goffredo Baroncelli posted on Fri, 28 Apr 2017 19:05:21 +0200 as
excerpted:

> After some thinking I adopted a different strategies: I used journald as
> collector, then I forward all the log to rsyslogd, which used a "log
> append" format. Journald never write on the root filesystem, only in
> tmp.

Great minds think alike. =:^)

Only here it's syslog-ng that does the permanent writes.

I just couldn't see journald's crazy (for btrfs) write pattern going to 
permanent storage.

And AFAIK, journald has no pre-write filtering mechanism at all, only 
post-write display-time filtering, so even "log-spam" that I don't want/
need logged gets written to it, while if I see something spamming 
continuously (I run git kernels and kde, and do get such spammers 
occasionally) I setup a syslog-ng spam filter to kill it, so it never 
actually gets written to permanent storage at all.

But the tmpfs journals and btrfs traditional logs gives me the best of 
both worlds, per-boot journals with all the extra metadata, the last ten 
journal entries for it when I do systemctl status on a unit, etc, and a 
nice filtered and ordered multi-boot log that I can use traditional text-
based log-administration tools on.

The only part of it I'm not happy with is that journald apparently can't 
keep separate user and system journals when set to temporary only -- 
everything goes to the system journal.  Which eventually means that much 
of the stdout/stderr debugging spew that kde-based apps like to spew out 
ends up in the system journal and (would be in the) log.  But that's a 
journald "documented bug-feature", and I can and do syslog-ng filter it 
before it actually hits the written system log (or console log display).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 23:04     ` Peter Grandi
@ 2017-04-29 10:30       ` Peter Grandi
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Grandi @ 2017-04-29 10:30 UTC (permalink / raw)
  To: Linux fs Btrfs

>> [ ... ] these extents are all over the place, they're not
>> contiguous at all. 4K here, 4K there, 4K over there, back to
>> 4K here next to this one, 4K over there...12K over there, 500K
>> unwritten, 4K over there. This seems not so consequential on
>> SSD, [ ... ]

> Indeed there were recent reports that the 'ssd' mount option
> causes that, IIRC by Hans van Kranenburg [ ... ]

The report included news that "sometimes" the 'ssd' option is
automatically switched on at mount even on hard disks. I had
promised to put a summary of the issue on the Btrfs wiki, but
I regret that I haven't yet done that.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs, journald logs, fragmentation, and fallocate
  2017-04-28 20:14     ` Adam Borowski
@ 2017-04-29 10:46       ` Peter Grandi
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Grandi @ 2017-04-29 10:46 UTC (permalink / raw)
  To: Linux fs Btrfs

> [ ... ] Instead, you can use raw files (preferably sparse unless
> there's both nocow and no snapshots). Btrfs does natively everything
> you'd gain from qcow2, and does it better: you can delete the master
> of a cloned image, deduplicate them, deduplicate two unrelated images;
> you can turn on compression, etc.

Uhmmmmm, I understand this argument in the general case (not
specifically as to QCOW2 images), and it has some merit, but it is
"controversial", as there are two counterarguments:

* Application specifici file formats can match better application
  specific requirements.
* Putting advanced functionality into the filesystem code makes it more
  complex and less robust, and Btrfs is a bit of a major example of the
  consequences. I put compression and deduplication as things that I
  reckon make a filesystem too complex.

As to snapshots, I make a difference between filetree snapshots and file
snapshots: the first clones a tree as of the snapshot moment, and it is
a system management feature, the second provides per-file update
rollback. One sort of implies the other, but using the per-file rollback
*systematically*, that is a a feature an application can rely one seems
a bit dangerous to me.

> Once you pay the btrfs performance penalty,

Uhmmmmmmmmmmm, Btrfs has a small or negative performance penalty as a
general purpose filesystem, and many (more or less well conceived) tests
show it performs up there with the best. The only two real costs I have
to it are the huge CPU cost of doing checksumming all the time, but
that's unavoidable if one wants checksumming, and that checksumming
usually requires metadata duplication, that is at least 'dup' profile
for metadata, and that is indeed a bit expensive.

> you may as well actually use its features,

The features that I think Btrfs gives that are worth using are
checksumming, metadata duplication, and filetree snapshots.

> which make qcow2 redundant and harmful.

My impression is that in almost all cases QCOW2 is harmful, because it
trades more IOPS and complexity for less disk space, and disk space is
cheap and IOPS and complexity are expensive, but of course a lot of
people know better :-). My preferred VM setup is a small essentially
read-only non-QCOW2 image for '/' and everything else mounted via NFSv4,
from the VM host itself or a NAS server, but again lots of people know
better and use multi-terabyte-sized QCOW2 images :-).

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-04-29 10:46 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-28 16:16 btrfs, journald logs, fragmentation, and fallocate Chris Murphy
2017-04-28 17:05 ` Goffredo Baroncelli
2017-04-28 17:41   ` Chris Murphy
2017-04-28 18:54     ` Goffredo Baroncelli
2017-04-28 19:39       ` Peter Grandi
2017-04-28 19:59         ` Chris Murphy
2017-04-28 20:14     ` Adam Borowski
2017-04-29 10:46       ` Peter Grandi
2017-04-29  1:46   ` Paul Jones
2017-04-29  4:16   ` Duncan
2017-04-28 17:46 ` Peter Grandi
2017-04-28 19:43   ` Chris Murphy
2017-04-28 17:53 ` Peter Grandi
2017-04-28 19:55   ` Chris Murphy
2017-04-28 23:04     ` Peter Grandi
2017-04-29 10:30       ` Peter Grandi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.