All of lore.kernel.org
 help / color / mirror / Atom feed
* price to pay for nocow file bit?
@ 2015-01-07 17:43 Lennart Poettering
  2015-01-07 20:10 ` Josef Bacik
  2015-01-08 15:56 ` Zygo Blaxell
  0 siblings, 2 replies; 22+ messages in thread
From: Lennart Poettering @ 2015-01-07 17:43 UTC (permalink / raw)
  To: linux-btrfs

Heya!

Currently, systemd-journald's disk access patterns (appending to the
end of files, then updating a few pointers in the front) result in
awfully fragmented journal files on btrfs, which has a pretty
negative effect on performance when accessing them.

Now, to improve things a bit, I yesterday made a change to journald,
to issue the btrfs defrag ioctl when a journal file is rotated,
i.e. when we know that no further writes will be ever done on the
file. 

However, I wonder now if I should go one step further even, and use
the equivalent of "chattr -C" (i.e. nocow) on all journal files. I am
wondering what price I would precisely have to pay for
that. Judging by this earlier thread:

        http://www.spinics.net/lists/linux-btrfs/msg33134.html

it's mostly about data integrity, which is something I can live with,
given the conservative write patterns of journald, and the fact that
we do our own checksumming and careful data validation. I mean, if
btrfs in this mode provides no worse data integrity semantics than
ext4 I am fully fine with losing this feature for these files.

Hence I am mostly interested in what else is lost if this flag is
turned on by default for all journal files journald creates: 

Does this have any effect on functionality? As I understood snapshots
still work fine for files marked like that, and so do
reflinks. Any drawback functionality-wise? Apparently file compression
support is lost if the bit is set? (which I can live with too, journal
files are internally compressed anyway)

What about performance? Do any operations get substantially slower by
setting this bit? For example, what happens if I take a snapshot of
files with this bit set and then modify the file, does this result in
a full (and hence slow) copy of the file on that occasion? 

I am trying to understand the pros and cons of turning this bit on,
before I can make this change. So far I see one big pro, but I wonder
if there's any major con I should think about?

Thanks,

Lennart

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-07 17:43 price to pay for nocow file bit? Lennart Poettering
@ 2015-01-07 20:10 ` Josef Bacik
  2015-01-07 21:05   ` Goffredo Baroncelli
                     ` (3 more replies)
  2015-01-08 15:56 ` Zygo Blaxell
  1 sibling, 4 replies; 22+ messages in thread
From: Josef Bacik @ 2015-01-07 20:10 UTC (permalink / raw)
  To: Lennart Poettering, linux-btrfs

On 01/07/2015 12:43 PM, Lennart Poettering wrote:
> Heya!
>
> Currently, systemd-journald's disk access patterns (appending to the
> end of files, then updating a few pointers in the front) result in
> awfully fragmented journal files on btrfs, which has a pretty
> negative effect on performance when accessing them.
>

I've been wondering if mount -o autodefrag would deal with this problem 
but I haven't had the chance to look into it.

> Now, to improve things a bit, I yesterday made a change to journald,
> to issue the btrfs defrag ioctl when a journal file is rotated,
> i.e. when we know that no further writes will be ever done on the
> file.
>
> However, I wonder now if I should go one step further even, and use
> the equivalent of "chattr -C" (i.e. nocow) on all journal files. I am
> wondering what price I would precisely have to pay for
> that. Judging by this earlier thread:
>
>          https://urldefense.proofpoint.com/v1/url?u=http://www.spinics.net/lists/linux-btrfs/msg33134.html&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0A&m=ODekp6cRJncqEDXqNoiRQ1kLtNawlAzzBmNPpCF7hIw%3D%0A&s=3868518396650e6542b0189719e11f9c490e400c5205c29a20db0b699969c414
>
> it's mostly about data integrity, which is something I can live with,
> given the conservative write patterns of journald, and the fact that
> we do our own checksumming and careful data validation. I mean, if
> btrfs in this mode provides no worse data integrity semantics than
> ext4 I am fully fine with losing this feature for these files.
>

Yup its no worse than ext4.

> Hence I am mostly interested in what else is lost if this flag is
> turned on by default for all journal files journald creates:
>
> Does this have any effect on functionality? As I understood snapshots
> still work fine for files marked like that, and so do
> reflinks. Any drawback functionality-wise? Apparently file compression
> support is lost if the bit is set? (which I can live with too, journal
> files are internally compressed anyway)
>

Yeah no compression, no checksums.  If you do reflink then you'll COW 
once and then the new COW will be nocow so it'll be fine.  Same goes for 
snapshots.  So you'll likely incur some fragmentation but less than 
before, but I'd measure to just make sure if it's that big of a deal.

> What about performance? Do any operations get substantially slower by
> setting this bit? For example, what happens if I take a snapshot of
> files with this bit set and then modify the file, does this result in
> a full (and hence slow) copy of the file on that occasion?
>

Performance is the same.

> I am trying to understand the pros and cons of turning this bit on,
> before I can make this change. So far I see one big pro, but I wonder
> if there's any major con I should think about?
>

Nope there's no real con other than you don't get csums, but that 
doesn't really matter for you.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-07 20:10 ` Josef Bacik
@ 2015-01-07 21:05   ` Goffredo Baroncelli
  2015-01-07 22:06     ` Josef Bacik
  2015-01-08  6:30   ` Duncan
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Goffredo Baroncelli @ 2015-01-07 21:05 UTC (permalink / raw)
  To: Josef Bacik, Lennart Poettering, linux-btrfs

> 
>> I am trying to understand the pros and cons of turning this bit
>> on, before I can make this change. So far I see one big pro, but I
>> wonder if there's any major con I should think about?
>> 
> 
> Nope there's no real con other than you don't get csums, but that
> doesn't really matter for you.  Thanks,

In a btrfs-raid setup, in case of a corrupted sector, is BTRFS able to 
rebuild the sector ?
I suppose no; if so this has to be add to the cons I think.

>From my tests [1][2] I was unable to get bigger difference between doing a defrag 
and setting chattr -C the log directory. Did you get other results, if so I am 
interested to know more.

BR
G.Baroncelli



[1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html
[2] http://lists.freedesktop.org/archives/systemd-devel/2014-June/020141.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-07 21:05   ` Goffredo Baroncelli
@ 2015-01-07 22:06     ` Josef Bacik
  0 siblings, 0 replies; 22+ messages in thread
From: Josef Bacik @ 2015-01-07 22:06 UTC (permalink / raw)
  To: kreijack, Lennart Poettering, linux-btrfs

On 01/07/2015 04:05 PM, Goffredo Baroncelli wrote:
>>
>>> I am trying to understand the pros and cons of turning this bit
>>> on, before I can make this change. So far I see one big pro, but I
>>> wonder if there's any major con I should think about?
>>>
>>
>> Nope there's no real con other than you don't get csums, but that
>> doesn't really matter for you.  Thanks,
>
> In a btrfs-raid setup, in case of a corrupted sector, is BTRFS able to
> rebuild the sector ?
> I suppose no; if so this has to be add to the cons I think.
>

It won't know its corrupted, but it can rebuild if say you yank a drive 
and add a new one.  RAID5/RAID6 would catch corruption of course.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-07 20:10 ` Josef Bacik
  2015-01-07 21:05   ` Goffredo Baroncelli
@ 2015-01-08  6:30   ` Duncan
  2015-01-10 12:00     ` Martin Steigerwald
  2015-01-08  8:24   ` Chris Murphy
  2015-01-08 13:30   ` Lennart Poettering
  3 siblings, 1 reply; 22+ messages in thread
From: Duncan @ 2015-01-08  6:30 UTC (permalink / raw)
  To: linux-btrfs

Josef Bacik posted on Wed, 07 Jan 2015 15:10:06 -0500 as excerpted:

>> Does this have any effect on functionality? As I understood snapshots
>> still work fine for files marked like that, and so do reflinks. Any
>> drawback functionality-wise? Apparently file compression support is
>> lost if the bit is set? (which I can live with too, journal files are
>> internally compressed anyway)
>>
>>
> Yeah no compression, no checksums.  If you do reflink then you'll COW
> once and then the new COW will be nocow so it'll be fine.  Same goes for
> snapshots.  So you'll likely incur some fragmentation but less than
> before, but I'd measure to just make sure if it's that big of a deal.
> 
>> What about performance? Do any operations get substantially slower by
>> setting this bit? For example, what happens if I take a snapshot of
>> files with this bit set and then modify the file, does this result in a
>> full (and hence slow) copy of the file on that occasion?
>>
>>
> Performance is the same.

The otherwise nocow on-snapshot "cow1" is per-block (4096-byte AFAIK), so 
some fragmentation, but slower.

The "perfect storm" situation is people doing automated per-minute 
snapshots or similar (some people go to extremes with snapper or the 
like...), in which case setting nocow often doesn't help a whole lot, 
depending on how active the file-writing is, of course.

But for something like append-plus-pointer-update-pattern log files with 
something like per-day snapshotting, nocow should at least in theory help 
quite a bit, since the write-frequency and thus the prevented cows should 
be MUCH higher than the daily snapshot and thus the forced-block-cow1s.

-

FWIW, I'm systemd on btrfs here, but I use syslog-ng for my non-volatile 
logs and have Storage=volatile in journald.conf, using journald only for 
current-session, where unit status including last-10-messages makes 
troubleshooting /so/ much easier. =:^)  Once past current-session, text 
logs are more useful to me, which is where syslog-ng comes in.  Each to 
its strength, and keeping the journals from wearing the SSDs[1] is a very 
nice bonus. =:^)

---
[1] I can and do filter what syslog-ng writes, but couldn't find a way to 
filter journald's writes, only  queries/reads.  That alone saves writes 
for repeated noise I'm filtering out with syslog before it's ever 
written, that journald would still be writing if I let it write non-
volatile.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-07 20:10 ` Josef Bacik
  2015-01-07 21:05   ` Goffredo Baroncelli
  2015-01-08  6:30   ` Duncan
@ 2015-01-08  8:24   ` Chris Murphy
  2015-01-08  8:35     ` Koen Kooi
  2015-01-08 13:30   ` Lennart Poettering
  3 siblings, 1 reply; 22+ messages in thread
From: Chris Murphy @ 2015-01-08  8:24 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Lennart Poettering, Btrfs BTRFS

On Wed, Jan 7, 2015 at 1:10 PM, Josef Bacik <jbacik@fb.com> wrote:
> On 01/07/2015 12:43 PM, Lennart Poettering wrote:
>>
>> Heya!
>>
>> Currently, systemd-journald's disk access patterns (appending to the
>> end of files, then updating a few pointers in the front) result in
>> awfully fragmented journal files on btrfs, which has a pretty
>> negative effect on performance when accessing them.
>>
>
> I've been wondering if mount -o autodefrag would deal with this problem but
> I haven't had the chance to look into it.

I've been using autodefrag and haven't run into journal corruptions
that I can attribute to btrfs since the last one was fixed over a year
ago. Chris Mason has suggested preference to use of autodefrag for
this use case rather than xattr +C. But I don't know the time frame
for autodefrag by default, it's come up a couple times but it's not
the default yet.

I've found autodefrag journals are less than 200 fragments, and
average between 50-150 fragments. Without it, this spirals into
thousands quite quickly. Searches don't seem slower when journal files
are made of a few extents vs ~ 100, but beyond several hundred let
alone several thousand it becomes noticeable.

A somewhat minor negative of +C: In case of RAID 1 or higher and
silent data corruption, there will be no Btrfs detection due to lack
of checksum and therefore no correction. In the case a drive reports a
read error then it's corrected, same as with md or lvm raid1+.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-08  8:24   ` Chris Murphy
@ 2015-01-08  8:35     ` Koen Kooi
  0 siblings, 0 replies; 22+ messages in thread
From: Koen Kooi @ 2015-01-08  8:35 UTC (permalink / raw)
  To: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Chris Murphy schreef op 08-01-15 om 09:24:
> On Wed, Jan 7, 2015 at 1:10 PM, Josef Bacik <jbacik@fb.com> wrote:
>> On 01/07/2015 12:43 PM, Lennart Poettering wrote:
>>> 
>>> Heya!
>>> 
>>> Currently, systemd-journald's disk access patterns (appending to the 
>>> end of files, then updating a few pointers in the front) result in 
>>> awfully fragmented journal files on btrfs, which has a pretty 
>>> negative effect on performance when accessing them.
>>> 
>> 
>> I've been wondering if mount -o autodefrag would deal with this problem
>> but I haven't had the chance to look into it.
> 
> I've been using autodefrag and haven't run into journal corruptions that
> I can attribute to btrfs since the last one was fixed over a year ago.
> Chris Mason has suggested preference to use of autodefrag for this use
> case rather than xattr +C. But I don't know the time frame for autodefrag
> by default, it's come up a couple times but it's not the default yet.

Same here, no issues with using autodefrag and journals.

regards,

Koen

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)
Comment: GPGTools - http://gpgtools.org

iD8DBQFUrkFVMkyGM64RGpERAgGKAJ9pmXA4STYx6sUJP5HBALcUCkfMqwCeNhzR
8v4u6bvhtFZYxYbGDiHghps=
=4MPU
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-07 20:10 ` Josef Bacik
                     ` (2 preceding siblings ...)
  2015-01-08  8:24   ` Chris Murphy
@ 2015-01-08 13:30   ` Lennart Poettering
  2015-01-08 18:24     ` Konstantinos Skarlatos
                       ` (2 more replies)
  3 siblings, 3 replies; 22+ messages in thread
From: Lennart Poettering @ 2015-01-08 13:30 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Wed, 07.01.15 15:10, Josef Bacik (jbacik@fb.com) wrote:

> On 01/07/2015 12:43 PM, Lennart Poettering wrote:
> >Heya!
> >
> >Currently, systemd-journald's disk access patterns (appending to the
> >end of files, then updating a few pointers in the front) result in
> >awfully fragmented journal files on btrfs, which has a pretty
> >negative effect on performance when accessing them.
> 
> I've been wondering if mount -o autodefrag would deal with this problem but
> I haven't had the chance to look into it.

Hmm, I am kinda interested in a solution that I can just implement in
systemd/journald now and that will then just make things work for
people suffering by the problem. I mean, I can hardly make systemd
patch the mount options of btrfs just because I place a journal file
on some fs...

Is "autodefrag" supposed to become a default one day?

Anyway, given the pros and cons I have now changed journald to set the
nocow bit on newly created journal files. When files are rotated (and
we hence know we will never ever write again to them) the bit is tried
to be unset again, and a defrag ioctl will be invoked right
after. btrfs currently silently ignores that we unset the bit, and
leaves it set, but I figure i should try to unset it anyway, in case
it learns that one day. After all, after rotating the files there's no
reason to treat the files special anymore...

I'll keep an eye on this, and see if I still get user complaints about
it. Should autodefrag become default eventually we can get rid of this
code in journald again.

One question regarding the btrfs defrag ioctl: playing around with it
it appears to be asynchronous, the defrag request is simply queued and
the ioctl returns immediately. Which is great for my usecase. However
I was wondering if it always was async like this? I googled a bit, and
found reports that defrag might take a while, but I am not sure if
those reports were about the ioctl taking so long, or the effect of
defrag actually hitting the disk... 

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-07 17:43 price to pay for nocow file bit? Lennart Poettering
  2015-01-07 20:10 ` Josef Bacik
@ 2015-01-08 15:56 ` Zygo Blaxell
  2015-01-08 16:53   ` Lennart Poettering
  1 sibling, 1 reply; 22+ messages in thread
From: Zygo Blaxell @ 2015-01-08 15:56 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4531 bytes --]

On Wed, Jan 07, 2015 at 06:43:15PM +0100, Lennart Poettering wrote:
> Heya!
> 
> Currently, systemd-journald's disk access patterns (appending to the
> end of files, then updating a few pointers in the front) result in
> awfully fragmented journal files on btrfs, which has a pretty
> negative effect on performance when accessing them.
> 
> Now, to improve things a bit, I yesterday made a change to journald,
> to issue the btrfs defrag ioctl when a journal file is rotated,
> i.e. when we know that no further writes will be ever done on the
> file. 
> 
> However, I wonder now if I should go one step further even, and use
> the equivalent of "chattr -C" (i.e. nocow) on all journal files. I am
> wondering what price I would precisely have to pay for
> that. Judging by this earlier thread:
> 
>         http://www.spinics.net/lists/linux-btrfs/msg33134.html
> 
> it's mostly about data integrity, which is something I can live with,
> given the conservative write patterns of journald, and the fact that
> we do our own checksumming and careful data validation. I mean, if
> btrfs in this mode provides no worse data integrity semantics than
> ext4 I am fully fine with losing this feature for these files.

This sounds to me like a job for fallocate with FALLOC_FL_KEEP_SIZE.
This would work on ext4, xfs, and others, and provide the same benefit
(or even better) without filesystem-specific code.  journald would
preallocate a contiguous chunk past the end of the file for appends, and
on btrfs the first write to each block will not be COWed or compressed
(I'm hand-waving away some details here related to small writes, file
tails, and inline storage, but the end result is the same).  If there's a
configured target size for journals then allocate that amount; otherwise,
double the allocated size each time the visible file size reaches a power
of two so that the number of fragments is logarithmic over file size.

This should get you what you want without all the dangerous messing around
with data integrity controls and defragmentation.  Defragmentation has a
number of negative side-effects of its own:  it searches for free space
aggressively and holds locks that can block writes for a long time (I've
learned the hard way that this can be over 20 minutes for a 1GB file, long
enough to trigger hardware watchdog resets).  There are some other good
reasons to never defragment, but they don't arise in journald's use cases.

I, for one, use btrfs scrub to detect data corruption that occurs during
early stages of disk failure.  I'd object strongly to applications
randomly turning off data integrity features without being explicitly
configured to do so, especially those that do most of the writing.
It would create areas of the disk that are blind spots when testing for
storage corruption errors, and in journald's case those blind spots would
be among the most significant sources of data about storage corruption.

I don't really care if applications can survive corrupted data--as the
owner of the storage, I need to be aware that storage-level corruption is
happening.  I don't want to have to test different areas of the filesystem
with a dozen different application-specific tools.  That particular
insanity is one of the reasons why I now use btrfs and not ext4.

> Hence I am mostly interested in what else is lost if this flag is
> turned on by default for all journal files journald creates: 
> 
> Does this have any effect on functionality? As I understood snapshots
> still work fine for files marked like that, and so do
> reflinks. Any drawback functionality-wise? Apparently file compression
> support is lost if the bit is set? (which I can live with too, journal
> files are internally compressed anyway)
> 
> What about performance? Do any operations get substantially slower by
> setting this bit? For example, what happens if I take a snapshot of
> files with this bit set and then modify the file, does this result in
> a full (and hence slow) copy of the file on that occasion? 
> 
> I am trying to understand the pros and cons of turning this bit on,
> before I can make this change. So far I see one big pro, but I wonder
> if there's any major con I should think about?
> 
> Thanks,
> 
> Lennart
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-08 15:56 ` Zygo Blaxell
@ 2015-01-08 16:53   ` Lennart Poettering
  2015-01-08 18:36     ` Zygo Blaxell
                       ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Lennart Poettering @ 2015-01-08 16:53 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Thu, 08.01.15 10:56, Zygo Blaxell (ce3g8jdj@umail.furryterror.org) wrote:

> On Wed, Jan 07, 2015 at 06:43:15PM +0100, Lennart Poettering wrote:
> > Heya!
> > 
> > Currently, systemd-journald's disk access patterns (appending to the
> > end of files, then updating a few pointers in the front) result in
> > awfully fragmented journal files on btrfs, which has a pretty
> > negative effect on performance when accessing them.
> > 
> > Now, to improve things a bit, I yesterday made a change to journald,
> > to issue the btrfs defrag ioctl when a journal file is rotated,
> > i.e. when we know that no further writes will be ever done on the
> > file. 
> > 
> > However, I wonder now if I should go one step further even, and use
> > the equivalent of "chattr -C" (i.e. nocow) on all journal files. I am
> > wondering what price I would precisely have to pay for
> > that. Judging by this earlier thread:
> > 
> >         http://www.spinics.net/lists/linux-btrfs/msg33134.html
> > 
> > it's mostly about data integrity, which is something I can live with,
> > given the conservative write patterns of journald, and the fact that
> > we do our own checksumming and careful data validation. I mean, if
> > btrfs in this mode provides no worse data integrity semantics than
> > ext4 I am fully fine with losing this feature for these files.
> 
> This sounds to me like a job for fallocate with FALLOC_FL_KEEP_SIZE.

We already use fallocate(), but this is not enough on cow file
systems. With fallocate() you can certainly improve fragmentation when
appending things to a file. But on a COW file system this will help
little if we change things in the beginning of the file, since COW
means that it will then make a copy of those blocks and alter the
copy, but leave the original version unmodified. And if we do that all
the time the files get heavily fragmented, even though all the blocks
we modify have been fallocate()d initially...

> This would work on ext4, xfs, and others, and provide the same benefit
> (or even better) without filesystem-specific code.  journald would
> preallocate a contiguous chunk past the end of the file for appends,
> and

That's precisely what we do. But journald's write pattern is not
purely appending to files, it's "append something to the end, then
link it up in the beginning". And for the "append" part we are
fine with fallocate(). It's the "link up" part that completely fucks
up fragmentation so far.

Lennart

-- 
Lennart Poettering, Red Hat

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-08 13:30   ` Lennart Poettering
@ 2015-01-08 18:24     ` Konstantinos Skarlatos
  2015-01-08 18:48       ` Goffredo Baroncelli
  2015-01-09 15:52     ` David Sterba
  2015-01-11 20:39     ` Chris Murphy
  2 siblings, 1 reply; 22+ messages in thread
From: Konstantinos Skarlatos @ 2015-01-08 18:24 UTC (permalink / raw)
  To: Lennart Poettering, Josef Bacik; +Cc: linux-btrfs

On 8/1/2015 3:30 μμ, Lennart Poettering wrote:
> On Wed, 07.01.15 15:10, Josef Bacik (jbacik@fb.com) wrote:
>
>> On 01/07/2015 12:43 PM, Lennart Poettering wrote:
>>> Heya!
>>>
>>> Currently, systemd-journald's disk access patterns (appending to the
>>> end of files, then updating a few pointers in the front) result in
>>> awfully fragmented journal files on btrfs, which has a pretty
>>> negative effect on performance when accessing them.
>> I've been wondering if mount -o autodefrag would deal with this problem but
>> I haven't had the chance to look into it.
> Hmm, I am kinda interested in a solution that I can just implement in
> systemd/journald now and that will then just make things work for
> people suffering by the problem. I mean, I can hardly make systemd
> patch the mount options of btrfs just because I place a journal file
> on some fs...
>
> Is "autodefrag" supposed to become a default one day?
>
> Anyway, given the pros and cons I have now changed journald to set the
> nocow bit on newly created journal files. When files are rotated (and
> we hence know we will never ever write again to them) the bit is tried
> to be unset again, and a defrag ioctl will be invoked right
> after. btrfs currently silently ignores that we unset the bit, and
> leaves it set, but I figure i should try to unset it anyway, in case
> it learns that one day. After all, after rotating the files there's no
> reason to treat the files special anymore...
Can this behaviour be optional? I dont mind some fragmentation if i can 
keep having checksums and the ability for raid 1 to repair those files.

> I'll keep an eye on this, and see if I still get user complaints about
> it. Should autodefrag become default eventually we can get rid of this
> code in journald again.
>
> One question regarding the btrfs defrag ioctl: playing around with it
> it appears to be asynchronous, the defrag request is simply queued and
> the ioctl returns immediately. Which is great for my usecase. However
> I was wondering if it always was async like this? I googled a bit, and
> found reports that defrag might take a while, but I am not sure if
> those reports were about the ioctl taking so long, or the effect of
> defrag actually hitting the disk...
>
> Lennart
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-08 16:53   ` Lennart Poettering
@ 2015-01-08 18:36     ` Zygo Blaxell
  2015-01-09 15:41       ` David Sterba
  2015-01-08 20:42     ` Roger Binns
  2015-01-15 19:06     ` Chris Mason
  2 siblings, 1 reply; 22+ messages in thread
From: Zygo Blaxell @ 2015-01-08 18:36 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4226 bytes --]

On Thu, Jan 08, 2015 at 05:53:21PM +0100, Lennart Poettering wrote:
> On Thu, 08.01.15 10:56, Zygo Blaxell (ce3g8jdj@umail.furryterror.org) wrote:
> 
> > On Wed, Jan 07, 2015 at 06:43:15PM +0100, Lennart Poettering wrote:
> > > Heya!
> > > 
> > > Currently, systemd-journald's disk access patterns (appending to the
> > > end of files, then updating a few pointers in the front) result in
> > > awfully fragmented journal files on btrfs, which has a pretty
> > > negative effect on performance when accessing them.
> > > 
> > > Now, to improve things a bit, I yesterday made a change to journald,
> > > to issue the btrfs defrag ioctl when a journal file is rotated,
> > > i.e. when we know that no further writes will be ever done on the
> > > file. 
> > > 
> > > However, I wonder now if I should go one step further even, and use
> > > the equivalent of "chattr -C" (i.e. nocow) on all journal files. I am
> > > wondering what price I would precisely have to pay for
> > > that. Judging by this earlier thread:
> > > 
> > >         http://www.spinics.net/lists/linux-btrfs/msg33134.html
> > > 
> > > it's mostly about data integrity, which is something I can live with,
> > > given the conservative write patterns of journald, and the fact that
> > > we do our own checksumming and careful data validation. I mean, if
> > > btrfs in this mode provides no worse data integrity semantics than
> > > ext4 I am fully fine with losing this feature for these files.
> > 
> > This sounds to me like a job for fallocate with FALLOC_FL_KEEP_SIZE.
> 
> We already use fallocate(), but this is not enough on cow file
> systems. With fallocate() you can certainly improve fragmentation when
> appending things to a file. But on a COW file system this will help
> little if we change things in the beginning of the file, since COW
> means that it will then make a copy of those blocks and alter the
> copy, but leave the original version unmodified. And if we do that all
> the time the files get heavily fragmented, even though all the blocks
> we modify have been fallocate()d initially...

Hmmm...it seems the handwaving about tail-packing that I was previously
ignoring is important after all.

A few quick tests with filefrag show that btrfs isn't doing full
tail-packing, only small file allocation (i.e. files smaller than 4096
bytes get stored inline, and nothing else does, not even sparse files
with a single 1-byte extent at offset != 0).  Thus the inline storage
avoids fragmentation only to the minimum extent possible.

Short appends to the end of the file effectively become modifications
of the last block of the file.  That triggers CoW on the append, and if
we're doing lots of tiny writes the file becomes extremely fragmented
(exactly the worst case of one fragment per block).  A mix of big and
small appends seems to use fallocated space for those writes that cover
complete blocks, which is arguably worse than not fallocating at all.

So fallocate will not help until btrfs learns to do tail-packing, or
some other way to avoid this problem.

> > This would work on ext4, xfs, and others, and provide the same benefit
> > (or even better) without filesystem-specific code.  journald would
> > preallocate a contiguous chunk past the end of the file for appends,
> > and
> 
> That's precisely what we do. But journald's write pattern is not
> purely appending to files, it's "append something to the end, then
> link it up in the beginning". And for the "append" part we are
> fine with fallocate(). It's the "link up" part that completely fucks
> up fragmentation so far.

Wrong theory but same result.  The writes at the beginning just keep
replacing a single extent over and over, which has a worst-case effect
of adding a single fragment to the beginning of a file that would not
otherwise be fragmented.  The appends are causing fragmentation all
by themselves.  :-P

> Lennart
> 
> -- 
> Lennart Poettering, Red Hat
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-08 18:24     ` Konstantinos Skarlatos
@ 2015-01-08 18:48       ` Goffredo Baroncelli
  0 siblings, 0 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2015-01-08 18:48 UTC (permalink / raw)
  To: Konstantinos Skarlatos, Lennart Poettering, Josef Bacik; +Cc: linux-btrfs

On 2015-01-08 19:24, Konstantinos Skarlatos wrote:
>> Anyway, given the pros and cons I have now changed journald to set
>> the nocow bit on newly created journal files. When files are
>> rotated (and we hence know we will never ever write again to them)
>> the bit is tried to be unset again, and a defrag ioctl will be
>> invoked right after. btrfs currently silently ignores that we unset
>> the bit, and leaves it set, but I figure i should try to unset it
>> anyway, in case it learns that one day. After all, after rotating
>> the files there's no reason to treat the files special anymore...

> Can this behaviour be optional? I dont mind some fragmentation if i
> can keep having checksums and the ability for raid 1 to repair those
> files.

I agree with Konstantinos's request: please let this behavior optional.

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-08 16:53   ` Lennart Poettering
  2015-01-08 18:36     ` Zygo Blaxell
@ 2015-01-08 20:42     ` Roger Binns
  2015-01-15 19:06     ` Chris Mason
  2 siblings, 0 replies; 22+ messages in thread
From: Roger Binns @ 2015-01-08 20:42 UTC (permalink / raw)
  To: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/08/2015 08:53 AM, Lennart Poettering wrote:
> this will help little if we change things in the beginning of the
> file,

Have you considered changing the format so that those pointers are
stored at the end of the file, letting data always be append only?

While it is traditional to have things at the beginning as headers,
there are formats like zip where metadata is stored at the end instead
providing other benefits.

Roger

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iEYEARECAAYFAlSu68gACgkQmOOfHg372QSn5wCfaRAfI/xN3SHiDEPNMjjAuFQB
NbcAn2GCjzZyfHocF7yTKEBFdt3znD6n
=KL2f
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-08 18:36     ` Zygo Blaxell
@ 2015-01-09 15:41       ` David Sterba
  2015-01-09 16:14         ` Zygo Blaxell
  0 siblings, 1 reply; 22+ messages in thread
From: David Sterba @ 2015-01-09 15:41 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Lennart Poettering, linux-btrfs

On Thu, Jan 08, 2015 at 01:36:21PM -0500, Zygo Blaxell wrote:
> Hmmm...it seems the handwaving about tail-packing that I was previously
> ignoring is important after all.
> 
> A few quick tests with filefrag show that btrfs isn't doing full
> tail-packing, only small file allocation (i.e. files smaller than 4096
> bytes get stored inline, and nothing else does, not even sparse files
> with a single 1-byte extent at offset != 0).  Thus the inline storage
> avoids fragmentation only to the minimum extent possible.

That's right, btrfs does not do the reiserfs-style tail packing, and
IMHO will never do that. This brings a lot of code complexity than it's
worth in the end.

> Short appends to the end of the file effectively become modifications
> of the last block of the file.  That triggers CoW on the append, and if
> we're doing lots of tiny writes the file becomes extremely fragmented
> (exactly the worst case of one fragment per block).  A mix of big and
> small appends seems to use fallocated space for those writes that cover
> complete blocks, which is arguably worse than not fallocating at all.
> 
> So fallocate will not help until btrfs learns to do tail-packing, or
> some other way to avoid this problem.
> 
> > > This would work on ext4, xfs, and others, and provide the same benefit
> > > (or even better) without filesystem-specific code.  journald would
> > > preallocate a contiguous chunk past the end of the file for appends,
> > > and
> > 
> > That's precisely what we do. But journald's write pattern is not
> > purely appending to files, it's "append something to the end, then
> > link it up in the beginning". And for the "append" part we are
> > fine with fallocate(). It's the "link up" part that completely fucks
> > up fragmentation so far.
> 
> Wrong theory but same result.  The writes at the beginning just keep
> replacing a single extent over and over, which has a worst-case effect
> of adding a single fragment to the beginning of a file that would not
> otherwise be fragmented.  The appends are causing fragmentation all
> by themselves.  :-P

OTOH, the appending write and the header rewrite happen at roughly same
time so the actual block allocations may end up close to each other as
well. But yes, one cannot rely on that.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-08 13:30   ` Lennart Poettering
  2015-01-08 18:24     ` Konstantinos Skarlatos
@ 2015-01-09 15:52     ` David Sterba
  2015-01-10 10:30       ` Martin Steigerwald
  2015-01-11 20:39     ` Chris Murphy
  2 siblings, 1 reply; 22+ messages in thread
From: David Sterba @ 2015-01-09 15:52 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: Josef Bacik, linux-btrfs

On Thu, Jan 08, 2015 at 02:30:36PM +0100, Lennart Poettering wrote:
> On Wed, 07.01.15 15:10, Josef Bacik (jbacik@fb.com) wrote:
> > On 01/07/2015 12:43 PM, Lennart Poettering wrote:
> > >Currently, systemd-journald's disk access patterns (appending to the
> > >end of files, then updating a few pointers in the front) result in
> > >awfully fragmented journal files on btrfs, which has a pretty
> > >negative effect on performance when accessing them.
> > 
> > I've been wondering if mount -o autodefrag would deal with this problem but
> > I haven't had the chance to look into it.
> 
> Hmm, I am kinda interested in a solution that I can just implement in
> systemd/journald now and that will then just make things work for
> people suffering by the problem. I mean, I can hardly make systemd
> patch the mount options of btrfs just because I place a journal file
> on some fs...
> 
> Is "autodefrag" supposed to become a default one day?

Maybe. The option brings a performance hit because reading a block
that's out of sequential order with it's neighbors will also require to
read the neighbors. Then the group (like 8 blocks) will be written
sequentially to a new location.

It's an increased read latency in the fragmented case and more stress to
the block allocator. Practically it's not that bad for general use, eg.
a root partition, but now it's still users' decision whether to use it
or not.

> Anyway, given the pros and cons I have now changed journald to set the
> nocow bit on newly created journal files. When files are rotated (and
> we hence know we will never ever write again to them) the bit is tried
> to be unset again, and a defrag ioctl will be invoked right
> after. btrfs currently silently ignores that we unset the bit, and
> leaves it set, but I figure i should try to unset it anyway, in case
> it learns that one day. After all, after rotating the files there's no
> reason to treat the files special anymore...
> 
> I'll keep an eye on this, and see if I still get user complaints about
> it. Should autodefrag become default eventually we can get rid of this
> code in journald again.
> 
> One question regarding the btrfs defrag ioctl: playing around with it
> it appears to be asynchronous, the defrag request is simply queued and
> the ioctl returns immediately. Which is great for my usecase. However
> I was wondering if it always was async like this? I googled a bit, and
> found reports that defrag might take a while, but I am not sure if
> those reports were about the ioctl taking so long, or the effect of
> defrag actually hitting the disk... 

Defrag can be both sync and async, that's what the option -f is for.
Schedule file blocks for write and flush it, then go to the next file.
This avoids the hit in the async mode when tons of data can get
redirtied at once.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-09 15:41       ` David Sterba
@ 2015-01-09 16:14         ` Zygo Blaxell
  0 siblings, 0 replies; 22+ messages in thread
From: Zygo Blaxell @ 2015-01-09 16:14 UTC (permalink / raw)
  To: dsterba, Lennart Poettering, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3465 bytes --]

On Fri, Jan 09, 2015 at 04:41:03PM +0100, David Sterba wrote:
> On Thu, Jan 08, 2015 at 01:36:21PM -0500, Zygo Blaxell wrote:
> > Hmmm...it seems the handwaving about tail-packing that I was previously
> > ignoring is important after all.
> > 
> > A few quick tests with filefrag show that btrfs isn't doing full
> > tail-packing, only small file allocation (i.e. files smaller than 4096
> > bytes get stored inline, and nothing else does, not even sparse files
> > with a single 1-byte extent at offset != 0).  Thus the inline storage
> > avoids fragmentation only to the minimum extent possible.
> 
> That's right, btrfs does not do the reiserfs-style tail packing, and
> IMHO will never do that. This brings a lot of code complexity than it's
> worth in the end.

If the file has been fallocated past EOF, it may make sense to do the
extra work of maintaining a tail fragment in metadata until it's bigger
than a block, and therefore large enough to write to the fallocated
extent.  At least in that case the application has explicitly asked
the filesystem for more optimization than in the general append case.
Otherwise, what are fallocations past EOF for?

If the application appends 4K blocks all the time everything is fine,
but that requirement might not work for journald, and doesn't work for
rsyslog, mboxes, and many other long-running small-write use cases
that append in non-block-sized units.

On the other hand...it could be easier to handle such cases with a special
case of autodefrag--one that focuses on appends, so it can be enabled
by default earlier than the other problematic autodefrag use cases.
It may even be faster to defragment in small batches (coalescing a few
hundred blocks at a time near the end of file) than to do tail-packing
on every append, especially if metadata blocks have much more overhead
than data blocks (e.g. dup metadata with single data on spinning rust).
The fallocate would be wasted in this case, but the number of fragments
in the final file would be reasonably sane.

> > > > This would work on ext4, xfs, and others, and provide the same benefit
> > > > (or even better) without filesystem-specific code.  journald would
> > > > preallocate a contiguous chunk past the end of the file for appends,
> > > > and
> > > 
> > > That's precisely what we do. But journald's write pattern is not
> > > purely appending to files, it's "append something to the end, then
> > > link it up in the beginning". And for the "append" part we are
> > > fine with fallocate(). It's the "link up" part that completely fucks
> > > up fragmentation so far.
> > 
> > Wrong theory but same result.  The writes at the beginning just keep
> > replacing a single extent over and over, which has a worst-case effect
> > of adding a single fragment to the beginning of a file that would not
> > otherwise be fragmented.  The appends are causing fragmentation all
> > by themselves.  :-P
> 
> OTOH, the appending write and the header rewrite happen at roughly same
> time so the actual block allocations may end up close to each other as
> well. But yes, one cannot rely on that.

The header rewrite is close to the last append, but that's not really
useful.  There will be one header near one appending write, but there are
also thousands of other appending writes separated in time and space on
the disk, even after fallocate preallocated contiguous space for the file.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-09 15:52     ` David Sterba
@ 2015-01-10 10:30       ` Martin Steigerwald
  0 siblings, 0 replies; 22+ messages in thread
From: Martin Steigerwald @ 2015-01-10 10:30 UTC (permalink / raw)
  To: dsterba, Lennart Poettering, Josef Bacik, linux-btrfs

Am Freitag, 9. Januar 2015, 16:52:59 schrieb David Sterba:
> On Thu, Jan 08, 2015 at 02:30:36PM +0100, Lennart Poettering wrote:
> > On Wed, 07.01.15 15:10, Josef Bacik (jbacik@fb.com) wrote:
> > > On 01/07/2015 12:43 PM, Lennart Poettering wrote:
> > > >Currently, systemd-journald's disk access patterns (appending to the
> > > >end of files, then updating a few pointers in the front) result in
> > > >awfully fragmented journal files on btrfs, which has a pretty
> > > >negative effect on performance when accessing them.
> > >
> > > 
> > >
> > > I've been wondering if mount -o autodefrag would deal with this problem
> > > but
> > > I haven't had the chance to look into it.
> >
> > 
> >
> > Hmm, I am kinda interested in a solution that I can just implement in
> > systemd/journald now and that will then just make things work for
> > people suffering by the problem. I mean, I can hardly make systemd
> > patch the mount options of btrfs just because I place a journal file
> > on some fs...
> >
> > 
> >
> > Is "autodefrag" supposed to become a default one day?
> 
> Maybe. The option brings a performance hit because reading a block
> that's out of sequential order with it's neighbors will also require to
> read the neighbors. Then the group (like 8 blocks) will be written
> sequentially to a new location.
> 
> It's an increased read latency in the fragmented case and more stress to
> the block allocator. Practically it's not that bad for general use, eg.
> a root partition, but now it's still users' decision whether to use it
> or not.

I am concerned about flash based storage as probably not needing it and for the 
additional writes it causes.

And about free space fragmentation due to regular defragmenting. I read on XFS 
mailing list more than one, not to run xfs_fsr, the XFS online defrag tool 
regularily from a cron job, as it can make free space fragmentation worse.

And given the issues BTRFS still has with free space handling (see the thread 
I started about it and the kernel bug report 90401), I am vary of anything 
that could add more of free space fragmentation by default, especially when 
its not needed, like on an SSD.

I have

merkaba:/home/martin/.local/share/akonadi/db_data/akonadi> filefrag 
parttable.ibd
parttable.ibd: 8039 extents found

And I had this up to 40000 extents already, I did try manual defragmenting it 
with various options to look whether I see any effect:

None.

Same with desktop search database of KDE.

On my dual SSD BTRFS RAID 1 setup the amount of extents simply does not seem 
to matter at all, except for journalctl where I saw some noticable delays on 
initially calling it. But right now also there its on one hand just about one 
second – which I consider to be on the other hand much giving its a SSD RAID 
1.

But heck, the fragmentation of some of those files in there is abysmal 
considering the small size of the files:

merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> filefrag *                         
system@00050bbcaeb23ff2-c7230ef5d29df634.journal~: 2030 extents found
system@00050be4b7106b25-a4ab21cd18c0424c.journal~: 1859 extents found
system@00050bf84d2efb2c-1e4e85dacaf1252c.journal~: 1803 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-0000000000000001-00050bf84d2ae7be.journal: 
1076 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b22f7-00050bfb82b379f8.journal: 
84 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b22fb-00050bfb8657c8b0.journal: 
1036 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b2693-00050c0d8075ea4b.journal: 
1478 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b4136-00050c3782b1c527.journal: 
2 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b4137-00050c378666837a.journal: 
142 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b414c-00050c37c7883228.journal: 
574 extents found
system@5ee315765b1a4c6d9ed2fe833dec7094-0000000000010fdd-00050b56fa20f846.journal: 
2309 extents found
system.journal: 783 extents found
user-1000@cc345f87cb404df6a9588b0b1c707007-0000000000011061-00050b56fa223006.journal: 
340 extents found
user-1000@cc345f87cb404df6a9588b0b1c707007-00000000001ad624-00050ba77c734a3b.journal: 
564 extents found
user-1000@cc345f87cb404df6a9588b0b1c707007-00000000001b297c-00050c0d8077447c.journal: 
105 extents found
user-1000.journal: 133 extents found
user-120.journal: 5 extents found
user-2012.journal: 2 extents found
user-65534.journal: 222 extents found

merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> du -sh * | cut -
c1-72
16M     system@00050bbcaeb23ff2-c7230ef5d29df634.journal~
16M     system@00050be4b7106b25-a4ab21cd18c0424c.journal~
16M     system@00050bf84d2efb2c-1e4e85dacaf1252c.journal~
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-0000000000000001-00050bf84d
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b22f7-00050bfb82
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b22fb-00050bfb86
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b2693-00050c0d80
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b4136-00050c3782
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b4137-00050c3786
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b414c-00050c37c7
16M     system@5ee315765b1a4c6d9ed2fe833dec7094-0000000000010fdd-00050b56fa2
8,0M    system.journal
8,0M    user-1000@cc345f87cb404df6a9588b0b1c707007-0000000000011061-00050b5
8,0M    user-1000@cc345f87cb404df6a9588b0b1c707007-00000000001ad624-00050ba
8,0M    user-1000@cc345f87cb404df6a9588b0b1c707007-00000000001b297c-00050c0
8,0M    user-1000.journal
3,6M    user-120.journal
8,0M    user-2012.journal
8,0M    user-65534.journal

Especially when I compare that to rsyslog:

merkaba:/var/log> filefrag messages syslog kern.log
messages: 24 extents found
syslog: 3 extents found
kern.log: 31 extents found
merkaba:/var/log> filefrag messages.1 syslog.1 kern.log.1
messages.1: 67 extents found
syslog.1: 20 extents found
kern.log.1: 78 extents found


When I see this, I wonder whether it would make sense to use two files:

1. One for the sequential appending case

2. Another one for the pointers, which could even be rewritten from scratch 
each time.


On the other hand one can claim:

Non copy on write filesystems cope well with that kind of random I/O workload 
inside a file, so BTRFS will have to cope well with that as well.

On the other hand the way systemd writes logfiles obviously didn´t take the 
copy on write nature of the BTRFS filesystem into account.

But MySQL, PostgreSQL or others also do not do this.

So to never break userspace, BTRFS would have to adapt. On the other hand I 
think it may be easier to adapt the applications, and I wonder how a database 
could perform that is specifically designed for copy on write semantics.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-08  6:30   ` Duncan
@ 2015-01-10 12:00     ` Martin Steigerwald
  2015-01-10 12:23       ` Martin Steigerwald
  0 siblings, 1 reply; 22+ messages in thread
From: Martin Steigerwald @ 2015-01-10 12:00 UTC (permalink / raw)
  To: linux-btrfs

Am Donnerstag, 8. Januar 2015, 06:30:59 schrieb Duncan:
> FWIW, I'm systemd on btrfs here, but I use syslog-ng for my non-volatile 
> logs and have Storage=volatile in journald.conf, using journald only for 
> current-session, where unit status including last-10-messages makes 
> troubleshooting /so/ much easier. =:^)  Once past current-session, text 
> logs are more useful to me, which is where syslog-ng comes in.  Each to 
> its strength, and keeping the journals from wearing the SSDs[1] is a very 
> nice bonus. =:^)

Nice, I try this as well.

Cause while journalctl provides some nice stuff to query the logs, even by field 
or time and what not, frankly on my laptop, I don´t care.

I have seen this setting before, but I thought, well, logs would be good to 
keep. But for the SSD based laptop I will try volatile storage now. I will see 
whether I missed a longer history, but I reduced it before anyway to a 14 day 
maximum retention time already, cause systemd used 1,1 GiB of my root 
partition for logs while rsyslog + logrotate used much less[1]. And I have yet 
not seen the immediate benefit for me here on this laptop to justify using up 
that much resources just for logging. So for me its a useless waste of 
resources currently. (This may be different on a server or anywhere where 
logfiles matter more, but then, when I consider some of our server VMs with 
just 4 to 5 GiB VMDK file, journald on Debian in default settings could easily 
fill the remaining space on some of them. Which I would consider a regression.)

[1] systemd: journal is quite big compared to rsyslog output
https://bugs.debian.org/773538

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-10 12:00     ` Martin Steigerwald
@ 2015-01-10 12:23       ` Martin Steigerwald
  0 siblings, 0 replies; 22+ messages in thread
From: Martin Steigerwald @ 2015-01-10 12:23 UTC (permalink / raw)
  To: linux-btrfs

Am Samstag, 10. Januar 2015, 13:00:23 schrieben Sie:
> I have seen this setting before, but I thought, well, logs would be good to 
> keep. But for the SSD based laptop I will try volatile storage now. I will
> see whether I missed a longer history, but I reduced it before anyway to a
> 14 day maximum retention time already, cause systemd used 1,1 GiB of my
> root partition for logs while rsyslog + logrotate used much less[1]. And I
> have yet not seen the immediate benefit for me here on this laptop to
> justify using up that much resources just for logging. So for me its a
> useless waste of resources currently. (This may be different on a server or
> anywhere where logfiles matter more, but then, when I consider some of our
> server VMs with just 4 to 5 GiB VMDK file, journald on Debian in default
> settings could easily fill the remaining space on some of them. Which I
> would consider a regression.)

Okay, scratch that.

journald is adaptive to the remaining space on the disk AFAIK.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-08 13:30   ` Lennart Poettering
  2015-01-08 18:24     ` Konstantinos Skarlatos
  2015-01-09 15:52     ` David Sterba
@ 2015-01-11 20:39     ` Chris Murphy
  2 siblings, 0 replies; 22+ messages in thread
From: Chris Murphy @ 2015-01-11 20:39 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: Josef Bacik, Btrfs BTRFS

On Thu, Jan 8, 2015 at 6:30 AM, Lennart Poettering
<lennart@poettering.net> wrote:
> On Wed, 07.01.15 15:10, Josef Bacik (jbacik@fb.com) wrote:
>
>> On 01/07/2015 12:43 PM, Lennart Poettering wrote:
>> >Heya!
>> >
>> >Currently, systemd-journald's disk access patterns (appending to the
>> >end of files, then updating a few pointers in the front) result in
>> >awfully fragmented journal files on btrfs, which has a pretty
>> >negative effect on performance when accessing them.
>>
>> I've been wondering if mount -o autodefrag would deal with this problem but
>> I haven't had the chance to look into it.
>
> Hmm, I am kinda interested in a solution that I can just implement in
> systemd/journald now and that will then just make things work for
> people suffering by the problem. I mean, I can hardly make systemd
> patch the mount options of btrfs just because I place a journal file
> on some fs...
>
> Is "autodefrag" supposed to become a default one day?
>
> Anyway, given the pros and cons I have now changed journald to set the
> nocow bit on newly created journal files. When files are rotated (and
> we hence know we will never ever write again to them) the bit is tried
> to be unset again, and a defrag ioctl will be invoked right
> after. btrfs currently silently ignores that we unset the bit, and
> leaves it set, but I figure i should try to unset it anyway, in case
> it learns that one day. After all, after rotating the files there's no
> reason to treat the files special anymore...

I don't think it makes sense to unset nocow on a non-zero byte file
anymore than it makes sense to set it. The functional equivalent
that'd need to be done is:

touch system@<blah>.journal~
chattr -C system@<blah>.journal~
cp system@<blah>.journal system@<blah>.journal~

The copy won't have nocow set.

I suggest just leaving it alone. +C the /var/log/journal/ directory
before machine-name directories are created, and then everything in
there automatically inherits +C upon creation. No need to unset or
defrag, in particular on SSD's I think it's sorta pointless excess
writing. Set it and forget it policy.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: price to pay for nocow file bit?
  2015-01-08 16:53   ` Lennart Poettering
  2015-01-08 18:36     ` Zygo Blaxell
  2015-01-08 20:42     ` Roger Binns
@ 2015-01-15 19:06     ` Chris Mason
  2 siblings, 0 replies; 22+ messages in thread
From: Chris Mason @ 2015-01-15 19:06 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: Zygo Blaxell, linux-btrfs



On Thu, Jan 8, 2015 at 11:53 AM, Lennart Poettering 
<lennart@poettering.net> wrote:
> On Thu, 08.01.15 10:56, Zygo Blaxell (ce3g8jdj@umail.furryterror.org) 
> wrote:
> 
>>  On Wed, Jan 07, 2015 at 06:43:15PM +0100, Lennart Poettering wrote:
>>  > Heya!
>>  >
>>  > Currently, systemd-journald's disk access patterns (appending to 
>> the
>>  > end of files, then updating a few pointers in the front) result in
>>  > awfully fragmented journal files on btrfs, which has a pretty
>>  > negative effect on performance when accessing them.
>>  >
>>  > Now, to improve things a bit, I yesterday made a change to 
>> journald,
>>  > to issue the btrfs defrag ioctl when a journal file is rotated,
>>  > i.e. when we know that no further writes will be ever done on the
>>  > file.
>>  >
>>  > However, I wonder now if I should go one step further even, and 
>> use
>>  > the equivalent of "chattr -C" (i.e. nocow) on all journal files. 
>> I am
>>  > wondering what price I would precisely have to pay for
>>  > that. Judging by this earlier thread:
>>  >
>>  >         http://www.spinics.net/lists/linux-btrfs/msg33134.html
>>  >
>>  > it's mostly about data integrity, which is something I can live 
>> with,
>>  > given the conservative write patterns of journald, and the fact 
>> that
>>  > we do our own checksumming and careful data validation. I mean, if
>>  > btrfs in this mode provides no worse data integrity semantics than
>>  > ext4 I am fully fine with losing this feature for these files.
>> 
>>  This sounds to me like a job for fallocate with FALLOC_FL_KEEP_SIZE.
> 
> We already use fallocate(), but this is not enough on cow file
> systems. With fallocate() you can certainly improve fragmentation when
> appending things to a file. But on a COW file system this will help
> little if we change things in the beginning of the file, since COW
> means that it will then make a copy of those blocks and alter the
> copy, but leave the original version unmodified. And if we do that all
> the time the files get heavily fragmented, even though all the blocks
> we modify have been fallocate()d initially...
> 
>>  This would work on ext4, xfs, and others, and provide the same 
>> benefit
>>  (or even better) without filesystem-specific code.  journald would
>>  preallocate a contiguous chunk past the end of the file for appends,
>>  and
> 
> That's precisely what we do. But journald's write pattern is not
> purely appending to files, it's "append something to the end, then
> link it up in the beginning". And for the "append" part we are
> fine with fallocate(). It's the "link up" part that completely fucks
> up fragmentation so far.

I think a per-file autodefrag flag would help a lot here.  We've made 
some improvements for autodefrag and slowly growing log files because 
we noticed that compression ratios on slowly growing files really 
weren't very good.  The problem was we'd never have more than a single 
block to compress, so the compression code would give up and write the 
raw data.

compression + autodefrag on the other hand would take 64-128K and recow 
it down, giving very good results.

The second problem we hit was with stable page writes.  If bdflush 
decides to write the last block in the file, it's really a wasted IO 
unless the block is fully filled.  We've been experimenting with a 
patch to leave the last block out of writepages unless its a 
fsync/O_SYNC.

I'll code up the per-file autodefrag, we've hit a few use cases that 
make sense.

-chris




^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-01-15 19:06 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-07 17:43 price to pay for nocow file bit? Lennart Poettering
2015-01-07 20:10 ` Josef Bacik
2015-01-07 21:05   ` Goffredo Baroncelli
2015-01-07 22:06     ` Josef Bacik
2015-01-08  6:30   ` Duncan
2015-01-10 12:00     ` Martin Steigerwald
2015-01-10 12:23       ` Martin Steigerwald
2015-01-08  8:24   ` Chris Murphy
2015-01-08  8:35     ` Koen Kooi
2015-01-08 13:30   ` Lennart Poettering
2015-01-08 18:24     ` Konstantinos Skarlatos
2015-01-08 18:48       ` Goffredo Baroncelli
2015-01-09 15:52     ` David Sterba
2015-01-10 10:30       ` Martin Steigerwald
2015-01-11 20:39     ` Chris Murphy
2015-01-08 15:56 ` Zygo Blaxell
2015-01-08 16:53   ` Lennart Poettering
2015-01-08 18:36     ` Zygo Blaxell
2015-01-09 15:41       ` David Sterba
2015-01-09 16:14         ` Zygo Blaxell
2015-01-08 20:42     ` Roger Binns
2015-01-15 19:06     ` Chris Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.