From: Edward Shishkin <email@example.com>
To: Daniel J Blueman <firstname.lastname@example.org>
Cc: Mat <email@example.com>, LKML <firstname.lastname@example.org>,
Chris Mason <email@example.com>,
Ric Wheeler <firstname.lastname@example.org>,
Andrew Morton <email@example.com>,
Linus Torvalds <firstname.lastname@example.org>,
The development of BTRFS <email@example.com>
Subject: Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Date: Fri, 18 Jun 2010 18:50:45 +0200 [thread overview]
Message-ID: <4C1BA3E5.firstname.lastname@example.org> (raw)
Daniel J Blueman wrote:
> On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
> <email@example.com> wrote:
>> Mat wrote:
>>> On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <firstname.lastname@example.org> wrote:
>>>> Hello everyone.
>>>> I was asked to review/evaluate Btrfs for using in enterprise
>>>> systems and the below are my first impressions (linux-2.6.33).
>>>> The first test I have made was filling an empty 659M (/dev/sdb2)
>>>> btrfs partition (mounted to /mnt) with 2K files:
>>>> # for i in $(seq 1000000); \
>>>> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
>>>> (terminated after getting "No space left on device" reports).
>>>> # ls /mnt | wc -l
>>>> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
>>>> and the first obvious question is "hey, where are other 83% of my
>>>> disk space???" I looked at the btrfs storage tree (fs_tree) and was
>>>> shocked with the situation on the leaf level. The Appendix B shows
>>>> 5 adjacent btrfs leafs, which have the same parent.
>>>> For example, look at the leaf 29425664: "items 1 free space 3892"
>>>> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
>>>> attempts to write to the file system will result in "No space left
>>>> on device".
>>>> Internal fragmentation (see Appendix A) of those 5 leafs is
>>>> (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
>>>> ext4 and xfs: The last ones in this example will show fragmentation
>>>> near zero with blocksize <= 2K. Even with 4K blocksize they will
>>>> show better utilization 0.50 (against 0.38 in btrfs)!
>>>> I have a small question for btrfs developers: Why do you folks put
>>>> "inline extents", xattr, etc items of variable size to the B-tree
>>>> in spite of the fact that B-tree is a data structure NOT for variable
>>>> sized records? This disadvantage of B-trees was widely discussed.
>>>> For example, maestro D. Knuth warned about this issue long time
>>>> ago (see Appendix C).
>>>> It is a well known fact that internal fragmentation of classic Bayer's
>>>> B-trees is restricted by the value 0.50 (see Appendix C). However it
>>>> takes place only if your tree contains records of the _same_ length
>>>> (for example, extent pointers). Once you put to your B-tree records
>>>> of variable length (restricted only by leaf size, like btrfs "inline
>>>> extents"), your tree LOSES this boundary. Moreover, even worse:
>>>> it is clear, that in this case utilization of B-tree scales as zero(!).
>>>> That said, for every small E and for every amount of data N we
>>>> can construct a consistent B-tree, which contains data N and has
>>>> utilization worse then E. I.e. from the standpoint of utilization
>>>> such trees can be completely degenerated.
>>>> That said, the very important property of B-trees, which guarantees
>>>> non-zero utilization, has been lost, and I don't see in Btrfs code any
>>>> substitution for this property. In other words, where is a formal
>>>> guarantee that all disk space of our users won't be eaten by internal
>>>> fragmentation? I consider such guarantee as a *necessary* condition
>>>> for putting a file system to production.
> Wow...a small part of me says 'well said', on the basis that your
> assertions are true, but I do think there needs to be more
> constructivity in such critique; it is almost impossible to be a great
> engineer and a great academic at once in a time-pressured environment.
Sure it is impossible. I believe in division of labour:
academics writes algorithms, and we (engineers) encode them.
I have noticed that events in Btrfs develop by scenario not predicted
by the paper of academic Ohad Rodeh (in spite of the announce that
Btrfs is based on this paper). This is why I have started to grumble..
> If you can produce some specific and suggestions with code references,
> I'm sure we'll get some good discussion with potential to improve from
> where we are.
next prev parent reply other threads:[~2010-06-18 16:50 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-06-03 14:58 Unbound(?) internal fragmentation in Btrfs Edward Shishkin
[not found] ` <AANLkTilKw2onQkdNlZjg7WVnPu2dsNpDSvoxrO_FA2z_@mail.gmail.com>
2010-06-18 8:03 ` Christian Stroetmann
2010-06-18 13:32 ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Edward Shishkin
2010-06-18 13:45 ` Daniel J Blueman
2010-06-18 16:50 ` Edward Shishkin [this message]
2010-06-23 23:40 ` Jamie Lokier
2010-06-24 3:43 ` Daniel Taylor
2010-06-24 4:51 ` Mike Fedyk
2010-06-24 22:06 ` Daniel Taylor
2010-06-25 9:15 ` Btrfs: broken file system design Andi Kleen
2010-06-25 18:58 ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Ric Wheeler
2010-06-26 5:18 ` Michael Tokarev
2010-06-26 11:55 ` Ric Wheeler
[not found] ` <57784.2001:5c0:82dc::email@example.com>
2010-06-26 13:47 ` Ric Wheeler
2010-06-24 9:50 ` David Woodhouse
2010-06-18 18:15 ` Christian Stroetmann
2010-06-18 13:47 ` Chris Mason
2010-06-18 15:05 ` Edward Shishkin
[not found] ` <4C1B8B4A.firstname.lastname@example.org>
2010-06-18 15:10 ` Chris Mason
2010-06-18 16:22 ` Edward Shishkin
[not found] ` <4C1B9D4F.email@example.com>
2010-06-18 18:10 ` Chris Mason
2010-06-18 15:21 ` Christian Stroetmann
2010-06-18 15:22 ` Chris Mason
2010-06-18 15:56 ` Jamie Lokier
2010-06-18 19:25 ` Christian Stroetmann
2010-06-18 19:29 ` Edward Shishkin
2010-06-18 19:35 ` Chris Mason
2010-06-18 22:04 ` Balancing leaves when walking from top to down (was Btrfs:...) Edward Shishkin
[not found] ` <4C1BED56.firstname.lastname@example.org>
2010-06-18 22:16 ` Ric Wheeler
2010-06-19 0:03 ` Edward Shishkin
2010-06-21 13:15 ` Chris Mason
[not found] ` <20100621180013.GD17979@think>
2010-06-22 14:12 ` Edward Shishkin
2010-06-22 14:20 ` Chris Mason
2010-06-23 13:46 ` Edward Shishkin
[not found] ` <4C221049.email@example.com>
2010-06-23 23:37 ` Jamie Lokier
2010-06-24 13:06 ` Chris Mason
2010-06-30 20:05 ` Edward Shishkin
[not found] ` <4C2BA381.firstname.lastname@example.org>
2010-06-30 21:12 ` Chris Mason
2010-07-09 4:16 ` Chris Samuel
2010-07-09 20:30 ` Chris Mason
2010-06-23 23:57 ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Jamie Lokier
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).