Big disk space usage difference, even after defrag, on identical data

All of lore.kernel.org
 help / color / mirror / Atom feed

* Big disk space usage difference, even after defrag, on identical data
@ 2015-04-11 19:59 Gian-Carlo Pascutto
  2015-04-13  4:04 ` Zygo Blaxell
  2015-04-13  5:06 ` Duncan
  0 siblings, 2 replies; 8+ messages in thread
From: Gian-Carlo Pascutto @ 2015-04-11 19:59 UTC (permalink / raw)
  To: linux-btrfs

Linux mozwell 3.19.0-trunk-amd64 #1 SMP Debian 3.19.1-1~exp1
(2015-03-08) x86_64 GNU/Linux
btrfs-progs v3.19.1

I have a btrfs volume that's been in use for a week or 2. It has about
~560G of uncompressible data (video files, tar.xz, git repos, ...) and
~200G of data that compresses 2:1 with LZO (PostgreSQL db).

It's split into 2 subvolumes:
ID 257 gen 6550 top level 5 path @db
ID 258 gen 6590 top level 5 path @large

and mounted like this:
/dev/sdc /srv/db btrfs rw,noatime,compress=lzo,space_cache 0 0
/dev/sdc /srv/large btrfs rw,noatime,compress=lzo,space_cache 0 0

du -skh /srv
768G    /srv

df -h
/dev/sdc        1.4T  754G  641G  55% /srv/db
/dev/sdc        1.4T  754G  641G  55% /srv/large

btrfs fi df /srv/large
Data, single: total=808.01GiB, used=749.36GiB
System, DUP: total=8.00MiB, used=112.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=3.50GiB, used=1.87GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B

So that's a bit bigger than perhaps expected (~750G instead of
~660G+metadata). I thought it might've been related to compress bailing
out too easily, but I've done a
btrfs fi defragment -r -v -clzo /srv/db /srv/large
and this doesn't change anything.

I recently copied this data to a new, bigger disk, and the result looks
worrying:

mount options:
/dev/sdd /mnt/large btrfs rw,noatime,compress=lzo,space_cache 0 0
/dev/sdd /mnt/db btrfs rw,noatime,compress=lzo,space_cache 0 0

btrfs fi df
Data, single: total=684.00GiB, used=683.00GiB
System, DUP: total=8.00MiB, used=96.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=3.50GiB, used=2.04GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B

df
/dev/sdd        3.7T  688G  3.0T  19% /mnt/large
/dev/sdd        3.7T  688G  3.0T  19% /mnt/db

du
767G    /mnt

That's a 66G difference for the same data with the same compress option.
The used size here is much more in line with what I'd have expected
given the nature of the data.

I would think that compression differences or things like fragmentation
or bookending for modified files shouldn't affect this, because the
first filesystem has been defragmented/recompressed and didn't shrink.

So what can explain this? Where did the 66G go?

-- 
GCP

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Big disk space usage difference, even after defrag, on identical data
  2015-04-11 19:59 Big disk space usage difference, even after defrag, on identical data Gian-Carlo Pascutto
@ 2015-04-13  4:04 ` Zygo Blaxell
  2015-04-13  8:07   ` Duncan
  2015-04-13 11:32   ` Gian-Carlo Pascutto
  2015-04-13  5:06 ` Duncan
  1 sibling, 2 replies; 8+ messages in thread
From: Zygo Blaxell @ 2015-04-13  4:04 UTC (permalink / raw)
  To: Gian-Carlo Pascutto; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6578 bytes --]

On Sat, Apr 11, 2015 at 09:59:50PM +0200, Gian-Carlo Pascutto wrote:
> Linux mozwell 3.19.0-trunk-amd64 #1 SMP Debian 3.19.1-1~exp1
> (2015-03-08) x86_64 GNU/Linux
> btrfs-progs v3.19.1
> 
> I have a btrfs volume that's been in use for a week or 2. It has about
> ~560G of uncompressible data (video files, tar.xz, git repos, ...) and
> ~200G of data that compresses 2:1 with LZO (PostgreSQL db).
> 
> It's split into 2 subvolumes:
> ID 257 gen 6550 top level 5 path @db
> ID 258 gen 6590 top level 5 path @large
> 
> and mounted like this:
> /dev/sdc /srv/db btrfs rw,noatime,compress=lzo,space_cache 0 0
> /dev/sdc /srv/large btrfs rw,noatime,compress=lzo,space_cache 0 0
> 
> du -skh /srv
> 768G    /srv
> 
> df -h
> /dev/sdc        1.4T  754G  641G  55% /srv/db
> /dev/sdc        1.4T  754G  641G  55% /srv/large
> 
> btrfs fi df /srv/large
> Data, single: total=808.01GiB, used=749.36GiB
> System, DUP: total=8.00MiB, used=112.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, DUP: total=3.50GiB, used=1.87GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> So that's a bit bigger than perhaps expected (~750G instead of
> ~660G+metadata). I thought it might've been related to compress bailing
> out too easily, but I've done a
> btrfs fi defragment -r -v -clzo /srv/db /srv/large
> and this doesn't change anything.
> 
> I recently copied this data to a new, bigger disk, and the result looks
> worrying:
> 
> mount options:
> /dev/sdd /mnt/large btrfs rw,noatime,compress=lzo,space_cache 0 0
> /dev/sdd /mnt/db btrfs rw,noatime,compress=lzo,space_cache 0 0
> 
> btrfs fi df
> Data, single: total=684.00GiB, used=683.00GiB
> System, DUP: total=8.00MiB, used=96.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, DUP: total=3.50GiB, used=2.04GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> df
> /dev/sdd        3.7T  688G  3.0T  19% /mnt/large
> /dev/sdd        3.7T  688G  3.0T  19% /mnt/db
> 
> du
> 767G    /mnt
> 
> That's a 66G difference for the same data with the same compress option.
> The used size here is much more in line with what I'd have expected
> given the nature of the data.
> 
> I would think that compression differences or things like fragmentation
> or bookending for modified files shouldn't affect this, because the
> first filesystem has been defragmented/recompressed and didn't shrink.
> 
> So what can explain this? Where did the 66G go?

There are a few places:  the kernel may have decided your files are not
compressible and disabled compression on them (some older kernels did
this with great enthusiasm); your files might have preallocated space
from the fallocate system call (which disables compression and allocates
contiguous space, so defrag will not touch it).   'filefrag -v' can
tell you if this is happening to your files.

In practice database files take about double the amount of space they
appear to because of extent shingling.

Suppose we have a defragmented file with one extent "A" like this:

        0 MB AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 1MB

Now we overwrite about half of the blocks:

        0 MB BBBBBBBBBBBBBBBBAAAAAAAAAAAAAAAA 1MB

btrfs tracks references to the entire extent, so what is on disk now is this:

        0 MB aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAA 1MB original extent
        0 MB BBBBBBBBBBBBBBBB                 1MB new extent

The "a" are blocks from the original extent that are not visible in
the file, but remain present on disk.  In other words, this 1MB file is
now taking up 1.5MB of space.

This continues as long as any blocks of partially overwritten extents
are visible in any file (including snapshots, dedup, and clones), with
the worst case being something like this:

        0 MB BBBBBBBBBBBBBCCCCCCCCCCCCCCDDDDA 1MB

which could be like this on disk:

        0 MB aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaA 1MB first extent
        0 MB BBBBBBBBBBBBBbbb                 1MB second extent
        0 MB              CCCCCCCCCCCCCCcccc  1MB third extent
        0 MB                            DDDD  1MB fourth extent

This 1MB file takes up a little over 2MB of disk space, and there are
parts of extents A, B, and C which persist on disk but are no longer
part of any file's content.

In this case, if we wrote the last 4K of the file, we would free 1MB
of disk space by doing so:

             (extent A now deleted)
        0 MB BBBBBBBBBBBBBbbb                 1MB second extent
        0 MB              CCCCCCCCCCCCCCcccc  1MB third extent
        0 MB                            DDDD  1MB fourth extent
        0 MB                                E 1MB fifth extent

Similarly to free the "B" extent we have to overwrite all the visible
blocks, i.e. from 0 to the beginning of the "C" extent, before the
last visible block from "B" is destroyed and the entire "B" extent can
be freed.

The worst case is pretty bad:  with the worst possible overwrite pattern,
a file can occupy the square of its size on disk divided by the block
size (4K) divided by two.  That's a little under 128MB for a 1MB file,
or 128TB for a 1GB file.  Above 1GB, the scaling is linear instead of
quadratic because the extent size limit (1G) has been reached and
single-extent files are no longer possible (so a worst-case 2GB file
takes only 256TB of space instead of 512TB).

Defragmenting the files helps free space temporarily; however, space usage
will quickly grow again until it returns to the steady state around 2x
the file size.

A database ends up maxing out at about a factor of two space usage
because it tends to write short uniform-sized bursts of pages randomly,
so we get a pattern a bit like bricks in a wall:

        0 MB AA BB CC DD EE FF GG HH II JJ KK 1 MB half the extents
        0 MB  LL MM NN OO PP QQ RR SS TT UU V 1 MB the other half

        0 MB ALLBMMCNNDOOEPPFQQGRRHSSITTJUUKV 1 MB what the file looks like

Fixing this is non-trivial (it may require an incompatible disk format
change).  Until this is fixed, the most space-efficient approach seems to
be to force compression (so the maximum extent is 128K instead of 1GB)
and never defragment database files ever.

> -- 
> GCP
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Big disk space usage difference, even after defrag, on identical data
  2015-04-11 19:59 Big disk space usage difference, even after defrag, on identical data Gian-Carlo Pascutto
  2015-04-13  4:04 ` Zygo Blaxell
@ 2015-04-13  5:06 ` Duncan
  2015-04-13 14:06   ` Gian-Carlo Pascutto
  1 sibling, 1 reply; 8+ messages in thread
From: Duncan @ 2015-04-13  5:06 UTC (permalink / raw)
  To: linux-btrfs

Gian-Carlo Pascutto posted on Sat, 11 Apr 2015 21:59:50 +0200 as
excerpted:

> That's a 66G difference for the same data with the same compress option.
> The used size here is much more in line with what I'd have expected
> given the nature of the data.
> 
> I would think that compression differences or things like fragmentation
> or bookending for modified files shouldn't affect this, because the
> first filesystem has been defragmented/recompressed and didn't shrink.
> 
> So what can explain this? Where did the 66G go?

Out of curiosity, does a balance on the actively used btrfs help?

You mentioned defrag -v -r -clzo, but didn't use the -f (flush) or -t 
(minimum size file) options.  Does adding -f -t1 help?

You aren't doing btrfs snapshots of either subvolume, are you?

I'm not sure this is related to the answer to your question, since you 
did defrag, but it might be, and it's good to know when dealing with 
database files on btrfs in any case.

Btrfs is in general a copy-on-write (COW) based filesystem.  Random 
rewrite pattern files, database and VM image files being prime examples, 
typically HEAVILY fragment on COW filesystems, since any rewrite forces a 
copy of the rewritten data block elsewhere.  The often rather large 
original extents get holes, but remained pinned by the existing data 
still remaining in them that hasn't been rewritten.  This is analogous to 
the way databases often rewrite records but leave holes behind that 
aren't immediately cleaned up, only it's occurring at the filesystem 
extent level.  Only after all the data in an extent has been rewritten, 
can the extent itself be unpinned and returned to the free space pool.

Defrag should force the rewrite of entire files and take care of this, 
but obviously it's not returning to "clean" state.  I forgot what the 
default minimum file size is if -t isn't set, maybe 128 MiB?  But a -t1 
will force it to defrag even small files, and I recall at least one 
thread here where the poster said it made all the difference for him, so 
try that.  And the -f should force a filesystem sync afterward, so you 
know the numbers from any report you run afterward match the final state.

Meanwhile, you may consider using the nocow attribute on those database 
files.  It will disable compression on them, but rewrites should then 
occur in-place, so you don't get the fragmentation and extent usage holes 
and duplication that you'd have otherwise.  It'll also disable btrfs 
checksumming, but mature databases already have their own error detection 
and correction system, since they don't normally run on filesystems that 
provide that sort of service like btrfs does.  While initial usage will 
be higher due to the lack of compression, as you've discovered, over 
time, on an actively updated database, compression isn't all that 
effective anyway.  And while usage may be a bit higher at least 
originally, it should be stable, but for expanding the actual size of the 
database, anyway.

But there's a couple of caveats to nocow.  First, in ordered to be 
properly effective, it needs to be set on a file while it's still empty.  
The most effective way to do this is to set nocow on the empty parent 
directory, then copy the nocow-target files into it so they inherit the 
nocow attribute as they are created, before they actually have any data.

The second pertains to btrfs snapshots.  Snapshots lock the existing file 
in place, effectively making an otherwise nocow file cow1 -- the first 
write to an existing file block will cow it, but after that, further 
writes to the same block will rewrite in-place... until the next 
snapshot, of course.  So try to minimize the number of snapshots done to 
nocow files, and if you do snapshot them, defrag them once in awhile as 
well.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Big disk space usage difference, even after defrag, on identical data
  2015-04-13  4:04 ` Zygo Blaxell
@ 2015-04-13  8:07   ` Duncan
  2015-04-13 11:32   ` Gian-Carlo Pascutto
  1 sibling, 0 replies; 8+ messages in thread
From: Duncan @ 2015-04-13  8:07 UTC (permalink / raw)
  To: linux-btrfs

Zygo Blaxell posted on Mon, 13 Apr 2015 00:04:36 -0400 as excerpted:

> A database ends up maxing out at about a factor of two space usage
> because it tends to write short uniform-sized bursts of pages randomly,
> so we get a pattern a bit like bricks in a wall:
> 
>         0 MB AA BB CC DD EE FF GG HH II JJ KK 1 MB half the extents 0 MB
>          LL MM NN OO PP QQ RR SS TT UU V 1 MB the other half
> 
>         0 MB ALLBMMCNNDOOEPPFQQGRRHSSITTJUUKV 1 MB what the file looks
>         like
> 
> Fixing this is non-trivial (it may require an incompatible disk format
> change).  Until this is fixed, the most space-efficient approach seems
> to be to force compression (so the maximum extent is 128K instead of
> 1GB) and never defragment database files ever.

... Or set the database file nocow at creation, and don't snapshot it, so 
overwrites are always in-place.  (Btrfs compression and checksumming get 
turned off with nocow, but as we've seen, compression isn't all that 
effective on random-rewrite-pattern files anyway, and databases generally 
have their own data integrity handling, so neither one is a huge loss, 
and the in-place rewrite makes for better performance and a more 
predictable steady-state.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Big disk space usage difference, even after defrag, on identical data
  2015-04-13  4:04 ` Zygo Blaxell
  2015-04-13  8:07   ` Duncan
@ 2015-04-13 11:32   ` Gian-Carlo Pascutto
  1 sibling, 0 replies; 8+ messages in thread
From: Gian-Carlo Pascutto @ 2015-04-13 11:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zygo Blaxell

On 13-04-15 06:04, Zygo Blaxell wrote:

>> I would think that compression differences or things like
>> fragmentation or bookending for modified files shouldn't affect
>> this, because the first filesystem has been
>> defragmented/recompressed and didn't shrink.
>> 
>> So what can explain this? Where did the 66G go?
> 
> There are a few places:  the kernel may have decided your files are
> not compressible and disabled compression on them (some older kernels
> did this with great enthusiasm);

As stated in the previous mail, this is 3.19.1. Moreover, the data is
either uniformly compressible or not at all. Lastly, note that the
*exact same* mount options are being used on *the exact same kernel*
with *the exact same data*. Getting a different compressible decision
given the same inputs would point to bugs.

> your files might have preallocated space from the fallocate system
> call (which disables compression and allocates contiguous space, so
> defrag will not touch it).

So defrag -clzo or -czlib won't actually re-compress mostly-continuous
files? That's evil. I have no idea whether PostgreSQL allocates files
that way, though.

> 'filefrag -v' can tell you if this is happening to your files.

Not sure how to interpret that. Without "-v", I see most of the (DB)
data has 2-5 extents per Gigabyte. A few have 8192 extents per Gigabyte.

Comparing to the copy that takes 66G less, there every (compressible)
file has about 8192 extents per Gigabyte, and the others 5 or 6.

So you may be right that some DB files are "wedged" in a format that
btrfs can't compress. I forced the files to be rewritten (VACUUM FULL)
and that "fixed" the problem.

> In practice database files take about double the amount of space
> they appear to because of extent shingling.

This is what I called "bookending" in the original mail, I didn't know
the correct name, but I understand doing updates can result in N^2/2 or
thereabouts disk space usage, however:

> Defragmenting the files helps free space temporarily; however, space
> usage will quickly grow again until it returns to the steady state
> around 2x the file size.

As stated in the original mail, the filesystem was *freshly
defragmented* so that can't have been the cause.

> Until this is fixed, the most space-efficient approach seems to be to
> force compression (so the maximum extent is 128K instead of 1GB)

Would that fix the problem with fallocated() files?

-- 
GCP

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Big disk space usage difference, even after defrag, on identical data
  2015-04-13  5:06 ` Duncan
@ 2015-04-13 14:06   ` Gian-Carlo Pascutto
  2015-04-13 21:45     ` Zygo Blaxell
  2015-04-14  3:18     ` Duncan
  0 siblings, 2 replies; 8+ messages in thread
From: Gian-Carlo Pascutto @ 2015-04-13 14:06 UTC (permalink / raw)
  To: linux-btrfs

On 13-04-15 07:06, Duncan wrote:

>> So what can explain this? Where did the 66G go?
> 
> Out of curiosity, does a balance on the actively used btrfs help?
> 
> You mentioned defrag -v -r -clzo, but didn't use the -f (flush) or -t 
> (minimum size file) options.  Does adding -f -t1 help?

Unfortunately I can no longer try this, see the other reply why. But the
problem turned out to be some 1G-sized files, written using 3-5 extents,
that for whatever reason defrag was not touching.

> You aren't doing btrfs snapshots of either subvolume, are you?

No :-) I should've mentioned that.

> Defrag should force the rewrite of entire files and take care of this, 
> but obviously it's not returning to "clean" state.  I forgot what the 
> default minimum file size is if -t isn't set, maybe 128 MiB?  But a -t1 
> will force it to defrag even small files, and I recall at least one 
> thread here where the poster said it made all the difference for him, so 
> try that.  And the -f should force a filesystem sync afterward, so you 
> know the numbers from any report you run afterward match the final state.

Reading the corresponding manual, the -t explanation says that "any
extent bigger than this size will be considered already defragged". So I
guess setting -t1 might've fixed the problem too...but after checking
the source, I'm not so sure.

I didn't find the -t default in the manpages - after browsing through
the source, the default is in the kernel:
https://github.com/torvalds/linux/blob/4f671fe2f9523a1ea206f63fe60a7c7b3a56d5c7/fs/btrfs/ioctl.c#L1268
(Not sure what units those are.)

I wonder if this is relevant:
https://github.com/torvalds/linux/blob/4f671fe2f9523a1ea206f63fe60a7c7b3a56d5c7/fs/btrfs/ioctl.c#L2572

This seems to reset the -t flag if compress (-c) is set? This looks a
bit fishy?

> Meanwhile, you may consider using the nocow attribute on those database 
> files.  It will disable compression on them,

I'm using btrfs specifically to get compression, so this isn't an option.

> While initial usage will  be higher due to the lack of compression,
> as you've discovered, over time, on an actively updated database,
> compression isn't all that effective anyway.

I don't see why. If you're referring to the additional overhead of
continuously compressing and decompressing everything - yes, of course.
But in my case I have a mostly-append workload to a huge amount of
fairly compressible data that's on magnetic storage, so compression is a
win in disk space and perhaps even in performance.

I'm well aware of the many caveats in using btrfs for databases -
they're well documented and although I much appreciate your extended
explanation, it wasn't new to me.

It turns out that if your dataset isn't update heavy (so it doesn't
fragment to begin with), or has to be queried via indexed access (i.e.
mostly via random seeks), the fragmentation doesn't matter much anyway.
Conversely, btrfs appears to have better sync performance with multiple
threads, and allows one to disable part of the partial-page-write
protection logic in the database (full_page_writes=off for PostgreSQL),
because btrfs is already doing the COW to ensure those can't actually
happen [1].

The net result is a *boost* from about 40 tps (ext4) to 55 tps (btrfs),
which certainly is contrary to popular wisdom. Maybe btrfs would fall
off eventually as fragementation does set in gradually, but given that
there's an offline defragmentation tool that can run in the background,
I don't care.

[1] I wouldn't be too surprised if database COW, which consists of
journal-writing a copy of the data out of band, then rewriting it again
in the original place, is actually functionally equivalent do disabling
COW in the database and running btrfs + defrag. Obviously you shouldn't
keep COW enabled in btrfs *AND* the DB, requiring all data to be copied
around at least 3 times...which I'm afraid almost everyone does because
it's the default...

-- 
GCP

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Big disk space usage difference, even after defrag, on identical data
  2015-04-13 14:06   ` Gian-Carlo Pascutto
@ 2015-04-13 21:45     ` Zygo Blaxell
  2015-04-14  3:18     ` Duncan
  1 sibling, 0 replies; 8+ messages in thread
From: Zygo Blaxell @ 2015-04-13 21:45 UTC (permalink / raw)
  To: Gian-Carlo Pascutto; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4670 bytes --]

On Mon, Apr 13, 2015 at 04:06:39PM +0200, Gian-Carlo Pascutto wrote:
> On 13-04-15 07:06, Duncan wrote:
> 
> >> So what can explain this? Where did the 66G go?
> > 
> > Out of curiosity, does a balance on the actively used btrfs help?
> > 
> > You mentioned defrag -v -r -clzo, but didn't use the -f (flush) or -t 
> > (minimum size file) options.  Does adding -f -t1 help?
> 
> Unfortunately I can no longer try this, see the other reply why. But the
> problem turned out to be some 1G-sized files, written using 3-5 extents,
> that for whatever reason defrag was not touching.

There are several corner cases that defrag won't touch by default.
It's designed to be conservative and favor speed over size.

Also when the kernel decides you're not getting enough compression, it
seems to disable compression on the file _forever_ even if future writes
are compressible again.  mount -o compress-force works around that.

> > You aren't doing btrfs snapshots of either subvolume, are you?
> 
> No :-) I should've mentioned that.

read-only snapshots:  yet another thing defrag won't touch.

> > While initial usage will  be higher due to the lack of compression,
> > as you've discovered, over time, on an actively updated database,
> > compression isn't all that effective anyway.
> 
> I don't see why. If you're referring to the additional overhead of
> continuously compressing and decompressing everything - yes, of course.
> But in my case I have a mostly-append workload to a huge amount of
> fairly compressible data that's on magnetic storage, so compression is a
> win in disk space and perhaps even in performance.

Short writes won't compress--not just well, but at all--because btrfs
won't look at adjacent already-written blocks.  If you write a file
at less than 4K/minute, there will be no compression, as each new extent
(or replacement extent for overwritten data) is already minimum-sized.

If you write in bursts of 128K or more, consecutively, then you can
get compression benefit.

There has been talk of teaching autodefrag to roll up the last few dozen
extents of files that grow slowly so they can be compressed.

> It turns out that if your dataset isn't update heavy (so it doesn't
> fragment to begin with), or has to be queried via indexed access (i.e.
> mostly via random seeks), the fragmentation doesn't matter much anyway.
> Conversely, btrfs appears to have better sync performance with multiple
> threads, and allows one to disable part of the partial-page-write
> protection logic in the database (full_page_writes=off for PostgreSQL),
> because btrfs is already doing the COW to ensure those can't actually
> happen [1].
> 
> The net result is a *boost* from about 40 tps (ext4) to 55 tps (btrfs),
> which certainly is contrary to popular wisdom. Maybe btrfs would fall
> off eventually as fragementation does set in gradually, but given that
> there's an offline defragmentation tool that can run in the background,
> I don't care.

I've found the performance of PostgreSQL to be wildly variable on btrfs.
It may be OK at first, but watch it for a week or two to admire the
full four-orders-of-magnitude swing (100 tps to 0.01 tps).  :-O

> [1] I wouldn't be too surprised if database COW, which consists of
> journal-writing a copy of the data out of band, then rewriting it again
> in the original place, is actually functionally equivalent do disabling
> COW in the database and running btrfs + defrag. Obviously you shouldn't
> keep COW enabled in btrfs *AND* the DB, requiring all data to be copied
> around at least 3 times...which I'm afraid almost everyone does because
> it's the default...

Journalling writes all the data twice:  once to the journal, once to
update the origin page after the journal (though PostgreSQL will omit
some of those duplicate writes in cases where there is no origin page
to overwrite).

COW writes all the new and updated data only once.

In the event of a crash, if the log tree is not recoverable (and it's
a rich source of btrfs bugs, so it's often not), you lose everything
that happened to the database in the last 30 seconds.  If you were
already using async commit in PostgreSQL anyway then that's not much
of a concern (and not having to call fsync 100 times a second _really_
helps performance!)  but if you really need sync commit then btrfs is
not the filesystem for you.

> -- 
> GCP
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Big disk space usage difference, even after defrag, on identical data
  2015-04-13 14:06   ` Gian-Carlo Pascutto
  2015-04-13 21:45     ` Zygo Blaxell
@ 2015-04-14  3:18     ` Duncan
  1 sibling, 0 replies; 8+ messages in thread
From: Duncan @ 2015-04-14  3:18 UTC (permalink / raw)
  To: linux-btrfs

Gian-Carlo Pascutto posted on Mon, 13 Apr 2015 16:06:39 +0200 as
excerpted:

>> Defrag should force the rewrite of entire files and take care of this,
>> but obviously it's not returning to "clean" state.  I forgot what the
>> default minimum file size is if -t isn't set, maybe 128 MiB?  But a -t1
>> will force it to defrag even small files, and I recall at least one
>> thread here where the poster said it made all the difference for him,
>> so try that.  And the -f should force a filesystem sync afterward, so
>> you know the numbers from any report you run afterward match the final
>> state.
> 
> Reading the corresponding manual, the -t explanation says that "any
> extent bigger than this size will be considered already defragged". So I
> guess setting -t1 might've fixed the problem too...but after checking
> the source, I'm not so sure.

Oops!  You are correct.  There was an on-list discussion of that before 
that I had forgotten.  The "make sure everything gets defragged" magic 
setting is -t 1G or higher, *not* the -t 1 I was trying to tell you 
previously (which will end up skipping everything, instead of defragging 
everything).

Thanks for spotting the inconsistency and calling me on it! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-04-14  3:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-11 19:59 Big disk space usage difference, even after defrag, on identical data Gian-Carlo Pascutto
2015-04-13  4:04 ` Zygo Blaxell
2015-04-13  8:07   ` Duncan
2015-04-13 11:32   ` Gian-Carlo Pascutto
2015-04-13  5:06 ` Duncan
2015-04-13 14:06   ` Gian-Carlo Pascutto
2015-04-13 21:45     ` Zygo Blaxell
2015-04-14  3:18     ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.