All of lore.kernel.org
 help / color / mirror / Atom feed
* Unbound(?) internal fragmentation in Btrfs
@ 2010-06-03 14:58 Edward Shishkin
  2010-06-17 23:29 ` Mat
  0 siblings, 1 reply; 62+ messages in thread
From: Edward Shishkin @ 2010-06-03 14:58 UTC (permalink / raw)
  To: The development of BTRFS, LKML, linux-fsdevel
  Cc: Chris Mason, Ric Wheeler, Andrew Morton

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 1000000); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting "No space left on device" reports).

# ls /mnt | wc -l
59480

So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is "hey, where are other 83% of my
disk space???" I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: "items 1 free space 3892"
(of 4096!!). Note, that this "free" space (3892) is _dead_: any
attempts to write to the file system will result in "No space left
on device".

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize <= 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
"inline extents", xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs "inline
extents"), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.

Any technical comments are welcome.

Thanks,
Edward.


Appendix A.
-----------
Glossary

1. Utilization of data and(or) metadata storage.

The fraction A/B, where
A is total size in bytes of stored data and(or) metadata.
B = N * S, where
N is number of blocks occupied by stored data and(or) metadata.
S is block size in bytes.

2. Internal fragmentation of data and(or) metadata storage.

difference (1 - U), where U is utilization.


Appendix B.
-----------
a "period" in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2)

...

leaf 29982720 items 4 free space 1572 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
        item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
                location key (0 UNKNOWN 0) type 8
                namelen 16 datalen 32 name: security.selinux
        item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069
                inline extent data size 2048 ram 2048 compress 0
        item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160
                inode generation 8 size 2048 block group 29360128 mode 
100644 links 1
        item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16
                inode ref index 65 namelen 6 name: file64
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
        item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
                location key (0 UNKNOWN 0) type 8
                namelen 16 datalen 32 name: security.selinux
leaf 29990912 items 1 free space 1901 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
        item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069
                inline extent data size 2048 ram 2048 compress 0
leaf 29986816 items 3 free space 3666 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
        item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 160
                inode generation 8 size 2048 block group 29360128 mode 
100644 links 1
        item 1 key (321 INODE_REF 256) itemoff 3819 itemsize 16
                inode ref index 66 namelen 6 name: file65
        item 2 key (321 XATTR_ITEM 3817753667) itemoff 3741 itemsize 78
                location key (0 UNKNOWN 0) type 8
                namelen 16 datalen 32 name: security.selinux
leaf 29995008 items 3 free space 1675 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
        item 0 key (321 EXTENT_DATA 0) itemoff 1926 itemsize 2069
                inline extent data size 2048 ram 2048 compress 0
        item 1 key (322 INODE_ITEM 0) itemoff 1766 itemsize 160
                inode generation 8 size 2048 block group 29360128 mode 
100644 links 1
        item 2 key (322 INODE_REF 256) itemoff 1750 itemsize 16
                inode ref index 67 namelen 6 name: file66
...

Appendix C.
-----------

D.E. Knuth, The Art of Computer Programming, vol. 3 (Sorting and Searching),
Addison-Wesley, Reading, MA, 1973.

-- 
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2010-07-09 20:30 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-03 14:58 Unbound(?) internal fragmentation in Btrfs Edward Shishkin
2010-06-17 23:29 ` Mat
2010-06-18  8:03   ` Christian Stroetmann
2010-06-18 13:32   ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Edward Shishkin
2010-06-18 13:45     ` Daniel J Blueman
2010-06-18 13:45       ` Daniel J Blueman
2010-06-18 16:50       ` Edward Shishkin
2010-06-23 23:40         ` Jamie Lokier
2010-06-24  3:43           ` Daniel Taylor
2010-06-24  4:51             ` Mike Fedyk
2010-06-24  4:51               ` Mike Fedyk
2010-06-24  4:51               ` Mike Fedyk
2010-06-24 22:06               ` Daniel Taylor
2010-06-24 22:06                 ` Daniel Taylor
2010-06-24 22:06                 ` Daniel Taylor
2010-06-25  9:15                 ` Btrfs: broken file system design Andi Kleen
2010-06-25 18:58                 ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Ric Wheeler
2010-06-26  5:18                   ` Michael Tokarev
2010-06-26 11:55                     ` Ric Wheeler
     [not found]                     ` <57784.2001:5c0:82dc::2.1277555665.squirrel@www.tofubar.com>
2010-06-26 13:47                       ` Ric Wheeler
2010-06-24  9:50             ` David Woodhouse
2010-06-24  9:50               ` David Woodhouse
2010-06-18 18:15       ` Christian Stroetmann
2010-06-18 13:47     ` Chris Mason
2010-06-18 15:05       ` Edward Shishkin
2010-06-18 15:05       ` Edward Shishkin
2010-06-18 15:05         ` Edward Shishkin
2010-06-18 15:10         ` Chris Mason
2010-06-18 16:22           ` Edward Shishkin
2010-06-18 16:22           ` Edward Shishkin
2010-06-18 16:22             ` Edward Shishkin
2010-06-18 18:10             ` Chris Mason
2010-06-18 15:21       ` Christian Stroetmann
2010-06-18 15:22         ` Chris Mason
2010-06-18 15:56     ` Jamie Lokier
2010-06-18 19:25       ` Christian Stroetmann
2010-06-18 19:29       ` Edward Shishkin
2010-06-18 19:35         ` Chris Mason
2010-06-18 22:04           ` Balancing leaves when walking from top to down (was Btrfs:...) Edward Shishkin
2010-06-18 22:04           ` Edward Shishkin
2010-06-18 22:04           ` Edward Shishkin
2010-06-18 22:16             ` Ric Wheeler
2010-06-19  0:03               ` Edward Shishkin
2010-06-21 13:15             ` Chris Mason
2010-06-21 18:00               ` Chris Mason
2010-06-22 14:12                 ` Edward Shishkin
2010-06-22 14:12                 ` Edward Shishkin
2010-06-22 14:12                   ` Edward Shishkin
2010-06-22 14:20                   ` Chris Mason
2010-06-23 13:46                     ` Edward Shishkin
2010-06-23 13:46                       ` Edward Shishkin
2010-06-23 23:37                       ` Jamie Lokier
2010-06-24 13:06                         ` Chris Mason
2010-06-30 20:05                           ` Edward Shishkin
2010-06-30 20:05                           ` Edward Shishkin
2010-06-30 20:05                             ` Edward Shishkin
2010-06-30 21:12                             ` Chris Mason
2010-06-30 20:05                           ` Edward Shishkin
2010-06-23 13:46                     ` Edward Shishkin
2010-07-09  4:16                 ` Chris Samuel
2010-07-09 20:30                   ` Chris Mason
2010-06-23 23:57         ` Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) Jamie Lokier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.