Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: Edward Shishkin <edward.shishkin@gmail.com>
To: Daniel J Blueman <daniel.blueman@gmail.com>
Cc: Mat <jackdachef@gmail.com>, LKML <linux-kernel@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org,
	Chris Mason <chris.mason@oracle.com>,
	Ric Wheeler <rwheeler@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	The development of BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Btrfs: broken file system design (was Unbound(?) internal 	fragmentation in Btrfs)
Date: Fri, 18 Jun 2010 18:50:45 +0200
Message-ID: <4C1BA3E5.7020400@gmail.com> (raw)
In-Reply-To: <AANLkTilQm8VGAvc1XNW4EaHd1FLd6dXIXwJ9-yT-joQ8@mail.gmail.com>

Daniel J Blueman wrote:
> On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
> <edward.shishkin@gmail.com> wrote:
>   
>> Mat wrote:
>>     
>>> On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin <edward@redhat.com> wrote:
>>>       
>>>> Hello everyone.
>>>>
>>>> I was asked to review/evaluate Btrfs for using in enterprise
>>>> systems and the below are my first impressions (linux-2.6.33).
>>>>
>>>> The first test I have made was filling an empty 659M (/dev/sdb2)
>>>> btrfs partition (mounted to /mnt) with 2K files:
>>>>
>>>> # for i in $(seq 1000000); \
>>>> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
>>>> (terminated after getting "No space left on device" reports).
>>>>
>>>> # ls /mnt | wc -l
>>>> 59480
>>>>
>>>> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
>>>> and the first obvious question is "hey, where are other 83% of my
>>>> disk space???" I looked at the btrfs storage tree (fs_tree) and was
>>>> shocked with the situation on the leaf level. The Appendix B shows
>>>> 5 adjacent btrfs leafs, which have the same parent.
>>>>
>>>> For example, look at the leaf 29425664: "items 1 free space 3892"
>>>> (of 4096!!). Note, that this "free" space (3892) is _dead_: any
>>>> attempts to write to the file system will result in "No space left
>>>> on device".
>>>>
>>>> Internal fragmentation (see Appendix A) of those 5 leafs is
>>>> (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
>>>> ext4 and xfs: The last ones in this example will show fragmentation
>>>> near zero with blocksize <= 2K. Even with 4K blocksize they will
>>>> show better utilization 0.50 (against 0.38 in btrfs)!
>>>>
>>>> I have a small question for btrfs developers: Why do you folks put
>>>> "inline extents", xattr, etc items of variable size to the B-tree
>>>> in spite of the fact that B-tree is a data structure NOT for variable
>>>> sized records? This disadvantage of B-trees was widely discussed.
>>>> For example, maestro D. Knuth warned about this issue long time
>>>> ago (see Appendix C).
>>>>
>>>> It is a well known fact that internal fragmentation of classic Bayer's
>>>> B-trees is restricted by the value 0.50 (see Appendix C). However it
>>>> takes place only if your tree contains records of the _same_ length
>>>> (for example, extent pointers). Once you put to your B-tree records
>>>> of variable length (restricted only by leaf size, like btrfs "inline
>>>> extents"), your tree LOSES this boundary. Moreover, even worse:
>>>> it is clear, that in this case utilization of B-tree scales as zero(!).
>>>> That said, for every small E and for every amount of data N we
>>>> can construct a consistent B-tree, which contains data N and has
>>>> utilization worse then E. I.e. from the standpoint of utilization
>>>> such trees can be completely degenerated.
>>>>
>>>> That said, the very important property of B-trees, which guarantees
>>>> non-zero utilization, has been lost, and I don't see in Btrfs code any
>>>> substitution for this property. In other words, where is a formal
>>>> guarantee that all disk space of our users won't be eaten by internal
>>>> fragmentation? I consider such guarantee as a *necessary* condition
>>>> for putting a file system to production.
>>>>         
>
> Wow...a small part of me says 'well said', on the basis that your
> assertions are true, but I do think there needs to be more
> constructivity in such critique; it is almost impossible to be a great
> engineer and a great academic at once in a time-pressured environment.
>   

Sure it is impossible. I believe in division of labour:
academics writes algorithms, and we (engineers) encode them.

I have noticed that events in Btrfs develop by scenario not predicted
by the paper of academic Ohad Rodeh (in spite of the announce that
Btrfs is based on this paper). This is why I have started to grumble..

Thanks.

> If you can produce some specific and suggestions with code references,
> I'm sure we'll get some good discussion with potential to improve from
> where we are.
>
> Thanks,
>   Daniel
>   

  reply index

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-03 14:58 Unbound(?) internal fragmentation in Btrfs Edward Shishkin
     [not found] ` <AANLkTilKw2onQkdNlZjg7WVnPu2dsNpDSvoxrO_FA2z_@mail.gmail.com>
2010-06-18  8:03   ` Christian Stroetmann
2010-06-18 13:32   ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Edward Shishkin
2010-06-18 13:45     ` Daniel J Blueman
2010-06-18 16:50       ` Edward Shishkin [this message]
2010-06-23 23:40         ` Jamie Lokier
2010-06-24  3:43           ` Daniel Taylor
2010-06-24  4:51             ` Mike Fedyk
2010-06-24 22:06               ` Daniel Taylor
2010-06-25  9:15                 ` Btrfs: broken file system design Andi Kleen
2010-06-25 18:58                 ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Ric Wheeler
2010-06-26  5:18                   ` Michael Tokarev
2010-06-26 11:55                     ` Ric Wheeler
     [not found]                     ` <57784.2001:5c0:82dc::2.1277555665.squirrel@www.tofubar.com>
2010-06-26 13:47                       ` Ric Wheeler
2010-06-24  9:50             ` David Woodhouse
2010-06-18 18:15       ` Christian Stroetmann
2010-06-18 13:47     ` Chris Mason
2010-06-18 15:05       ` Edward Shishkin
     [not found]       ` <4C1B8B4A.9060308@gmail.com>
2010-06-18 15:10         ` Chris Mason
2010-06-18 16:22           ` Edward Shishkin
     [not found]           ` <4C1B9D4F.6010008@gmail.com>
2010-06-18 18:10             ` Chris Mason
2010-06-18 15:21       ` Christian Stroetmann
2010-06-18 15:22         ` Chris Mason
2010-06-18 15:56     ` Jamie Lokier
2010-06-18 19:25       ` Christian Stroetmann
2010-06-18 19:29       ` Edward Shishkin
2010-06-18 19:35         ` Chris Mason
2010-06-18 22:04           ` Balancing leaves when walking from top to down (was Btrfs:...) Edward Shishkin
     [not found]           ` <4C1BED56.9010300@redhat.com>
2010-06-18 22:16             ` Ric Wheeler
2010-06-19  0:03               ` Edward Shishkin
2010-06-21 13:15             ` Chris Mason
     [not found]               ` <20100621180013.GD17979@think>
2010-06-22 14:12                 ` Edward Shishkin
2010-06-22 14:20                   ` Chris Mason
2010-06-23 13:46                     ` Edward Shishkin
     [not found]                     ` <4C221049.501@gmail.com>
2010-06-23 23:37                       ` Jamie Lokier
2010-06-24 13:06                         ` Chris Mason
2010-06-30 20:05                           ` Edward Shishkin
     [not found]                           ` <4C2BA381.7040808@redhat.com>
2010-06-30 21:12                             ` Chris Mason
2010-07-09  4:16                 ` Chris Samuel
2010-07-09 20:30                   ` Chris Mason
2010-06-23 23:57         ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C1BA3E5.7020400@gmail.com \
    --to=edward.shishkin@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=daniel.blueman@gmail.com \
    --cc=jackdachef@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rwheeler@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git