linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [rfc] Near-constant time directory index for Ext2
@ 2001-02-20 15:04 Daniel Phillips
  2001-02-20 20:03 ` Linus Torvalds
                   ` (3 more replies)
  0 siblings, 4 replies; 69+ messages in thread
From: Daniel Phillips @ 2001-02-20 15:04 UTC (permalink / raw)
  To: Linux-kernel; +Cc: tytso, Andreas Dilger, hch, ext2-devel

Earlier this month a runaway installation script decided to mail all its
problems to root.  After a couple of hours the script aborted, having
created 65535 entries in Postfix's maildrop directory.  Removing those
files took an awfully long time.  The problem is that Ext2 does each
directory access using a simple, linear search though the entire
directory file, resulting in n**2 behaviour to create/delete n files. 
It's about time we fixed that.

Last fall in Miami, Ted Ts'o mentioned some ideas he was playing with
for an Ext2 directory index, including the following points:

  - Fixed-size hash keys instead of names in the index
  - Leaf blocks are normal ext2 directory blocks
  - Leaf blocks are sequental, so readdir doesn't have to be changed

Having thought about it on and off since then, I came up with the
following additional design elements:

  - Logical addressing
    The cost of logical addressing of disk blocks is scarcely higher
    than physical addressing, and logical access through the page cache
    is actually faster than physical addressing because you don't have
    to traverse a tree: you can go straight to the logical block you are
    interested in and only traverse the tree if it's not there.  The
    payoff in terms of not breaking Ext2's existing allocation strategy
    is huge, not to mention friendliness to tools such as e2fsck and
    e2resize.  Finally, logical addressing means Tux2 will support
    this feature without modification. :-)

  - 8 bytes is sufficient for an index entry
    This gives a branching factor of 512 for 4K filesystem blocks
    resulting in log512(n) access time, performance that is almost
    indistinguishable from constant-time.  The 8 bytes allows for a 32
    bit hash key (31 bits actually used, see below) and a 4 byte
    logical block number, both sufficient for handling billions of
    directory entries.

  - Uniform-depth tree
    Usually, some form of balanced tree is used for a directory index. 
    I found that a simple, uniform-depth tree provides equivalent
    performance with far simpler code.  Just two tree levels can handle
    millions of directory entries, and for all practical purposes,
    such a tree is never out of balance.

So to give this a name, it's a "uniform-depth hash tree", or htree for
short.  (It's not a btree.)  Such a structure inherits properties of
both trees and hash tables.  From the hash table side, the htree
inherits the advantage of compact, fixed-size keys  which gives a high
branching factor and enables the use of binary search in interior index
nodes.

It also inherits a big disadvantage of hash tables: key collisions. 
Though rare, collisions give rise to a number of corner cases that
are not particularly easy to deal with.  (see below)

Index Structure
---------------

The root of the index tree is in the 0th block of the file.  Space is
reserved for a second level of the index tree in blocks 1 though 511
(for 4K filesystem blocks).  Directory leaf blocks are appended
starting at block 512, thus the tail of the directory file looks like a
normal Ext2 directory and can be processed directly by ext2_readdir. 
For directories with less than about 90K files there is a hole running
from block 1 to block 511, so an empty directory has just two blocks in
it, though its size appears to be about 2 Meg in a directory listing.

So a directory file looks like:

	0: Root index block
	1: Index block/0
	2: Index block/0
	...
	511: Index block/0
	512: Dirent block
	513: Dirent block
	...

Each index block consists of 512 index entries of the form:

	hash, block

where hash is a 32 bit hash with a collision flag in its least
significant bit, and block is the logical block number of an index of
leaf block, depending on the tree level.

The hash value of the 0th index entry isn't needed because it can
always be obtained from the level about, so it is used to record the
count of index entries in an index block.  This gives a nice round
branching factor of 512, the evenness being a nicety that mainly
satisfies my need to seek regularity, rather than winning any real
performance.  (On the other hand, the largeness of the branching factor
matters a great deal.)

The root index block has the same format as the other index blocks,
with its first 8 bytes reserved for a small header:

	1 byte header length (default: 8)
	1 byte index type (default: 0)
	1 byte hash version (default:0)
	1 byte tree depth (default: 1)

The treatment of the header differs slightly in the attached patch.  In
particular, only a single level of the index tree (the root) is
implemented here.  This turns out to be sufficient to handle more than
90,000 entries, so it is enough for today.  When a second level is
added to the tree, capacity will incease to somewhere around 50
million entries, and there is nothing preventing the use of n levels,
should there ever be a reason.  It's doubtfull that a third level
will ever be required, but if it is, the design provides for it.

Lookup Algorithm
----------------

Lookup is straightforword:

  - Compute a hash of the name
  - Read the index root
  - Use binary search (linear in the current code) to find the
    first index or leaf block that could contain the target hash
    (in tree order)
  - Repeat the above until the lowest tree level is reached
  - Read the leaf directory entry block and do a normal Ext2
    directory block search in it.
  - If the name is found, return its directory entry and buffer
  - Otherwise, if the collision bit of the next directory entry is
    set, continue searching in the successor block

Normally, two logical blocks of the file will need to be accessed, and
one or two metadata index blocks.  The effect of the metadata index
blocks can largely be ignored in terms of disk access time since these
blocks are unlikely to be evicted from cache.  There is some small CPU
cost that can be addressed by moving the whole directory into the page
cache.

Insert Algorithm
----------------

Insertion of new entries into the directory is considerably more
complex than lookup, due to the need to split leaf blocks when they
become full, and to satisfy the conditions that allow hash key
collisions to be handled reliably and efficiently.  I'll just summarize
here:

  - Probe the index as for lookup
  - If the target leaf block is full, split it and note the block
    that will receive the new entry
  - Insert the new entry in the leaf block using the normal Ext2
    directory entry insertion code.

The details of splitting and hash collision handling are somewhat
messy, but I will be happy to dwell on them at length if anyone is
interested.

Splitting
---------

In brief, when a leaf node fills up and we want to put a new entry into
it the leaf has to be split, and its share of the hash space has to
be partitioned.  The most straightforward way to do this is to sort the
entrys by hash value and split somewhere in the middle of the sorted
list.  This operation is log(number_of_entries_in_leaf) and is not a
great cost so long as an efficient sorter is used.  I used Combsort
for this, although Quicksort would have been just as good in this
case since average case performance is more important than worst case. 

An alternative approach would be just to guess a median value for the
hash key, and the partition could be done in linear time, but the
resulting poorer partitioning of hash key space outweighs the small
advantage of the linear partition algorithm.  In any event, the number
of entries needing sorting is bounded by the number that fit in a leaf.

Key Collisions
--------------

Some complexity is introduced by the need to handle sequences of hash
key collisions.  It is desireable to avoid splitting such sequences
between blocks, so the split point of a block is adjusted with this in
mind.  But the possibility still remains that if the block fills up
with identically-hashed entries, the sequence may still have to be
split.  This situation is flagged by placing a 1 in the low bit of the
index entry that points at the sucessor block, which is naturally
interpreted by the index probe as an intermediate value without any
special coding.  Thus, handling the collision problem imposes no real
processing overhead, just come extra code and a slight reduction in the
hash key space.  The hash key space remains  sufficient for any
conceivable number of directory entries, up into the billions.

Hash Function
-------------

The exact properties of the hash function critically affect the
performance of this indexing strategy, as I learned by trying a number
of poor hash functions, at times intentionally.  A poor hash function
will result in many collisions or poor partitioning of the hash space. 
To illustrate why the latter is a problem, consider what happens when a
block is split such that it covers just a few distinct hash values. 
The probability of later index entries hashing into the same, small
hash space is very small.  In practice, once a block is split, if its
hash space is too small it tends to stay half full forever, an effect I
observed in practice.

After some experimentation I came up with a hash function that gives
reasonably good dispersal of hash keys across the entire 31 bit key
space.  This improved the average fullness of leaf blocks considerably,
getting much closer to the theoretical average of 3/4 full.

But the current hash function is just a place holder, waiting for
an better version based on some solid theory.  I currently favor the
idea of using crc32 as the default hash function, but I welcome
suggestions.

Inevitably, no matter how good a hash function I come up with, somebody
will come up with a better one later.  For this reason the design
allows for additional hash functiones to be added, with backward
compatibility.  This is accomplished simply, by including a hash
function number in the index root.  If a new, improved hash function is
added, all the previous versions remain available, and previously
created indexes remain readable.

Of course, the best strategy is to have a good hash function right from
the beginning.  The initial, quick hack has produced results that
certainly have not been disappointing.

Performance
-----------

OK, if you have read this far then this is no doubt the part you've
been waiting for.  In short, the performance improvement over normal
Ext2 has been stunning.  With very small directories performance is
similar to standard Ext2, but as directory size increases standard
Ext2 quickly blows up quadratically, while htree-enhanced Ext2
continues to scale linearly.

Uli Luckas ran benchmarks for file creation in various sizes of
directories ranging from 10,000 to 90,000 files.  The results are
pleasing: total file creation time stays very close to linear, versus
quadratic increase with normal Ext2.

Time to create:

		Indexed		Normal
		=======		======
10000 Files:	0m1.350s	0m23.670s
20000 Files:	0m2.720s	1m20.470s
30000 Files:	0m4.330s	3m9.320s
40000 Files:	0m5.890s	5m48.750s
50000 Files:	0m7.040s	9m31.270s
60000 Files:	0m8.610s	13m52.250s
70000 Files:	0m9.980s	19m24.070s
80000 Files:	0m12.060s	25m36.730s
90000 Files:	0m13.400s	33m18.550s

A graph is posted at:

   http://www.innominate.org/~phillips/htree/performance.png

All of these tests are CPU-bound, which may come as a surprise.  The
directories fit easily in cache, and the limiting factor in the case of
standard Ext2 is the looking up of directory blocks in buffer cache,
and the low level scan of directory entries.  In the case of htree
indexing there are a number of costs to be considered, all of them
pretty well bounded.  Notwithstanding, there are a few obvious
optimizations to be done:

  - Use binary search instead of linear search in the interior index
    nodes.

  - If there is only one leaf block in a directory, bypass the index
    probe, go straight to the block.

  - Map the directory into the page cache instead of the buffer cache.

Each of these optimizations will produce a noticeable improvement in
performance, but naturally it will never be anything like the big jump
going from N**2 to Log512(N), ~= N.  In time the optimizations will be
applied and we can expect to see another doubling or so in performance.

There will be a very slight performance hit when the directory gets big
enough to need a second level.  Because of caching this will be very
small.  Traversing the directories metadata index blocks will be a
bigger cost, and once again, this cost can be reduced by moving the
directory blocks into the page cache.

Typically, we will traverse 3 blocks to read or write a directory
entry, and that number increases to 4-5 with really huge directories. 
But this is really nothing compared to normal Ext2, which traverses
several hundred blocks in the same situation.

Current Implementation
----------------------

The current implementation has only a single level of the htree (the
root) and is sufficient to handle a little more than 90,000 files. 
This good enough for benchmarking.  There has not been a lot of
stability testing yet and indeed there are a number of unhandled error
conditions in the code, and possibly some buffer leaks as well.

This patch is for kernel 2.4.1, but it should be entirely applicable
to the 2.2 series as well.  There it should find a friend: Stephen
Tweedie's Ext3 journalling extension.

To-do List
----------

There is still a fair amount of work remaining before this patch is
ready for regular use.  Here is the to-do list as of today:

  - finalize the file format
  - endianness
  - improve the hash function
  - INCOMPAT flag handling
  - second tree level
  - bullet proofing
  - testing under load

Additionally, some (small) changes will be required in ext2utils.  The
ETA for completion of the items on the to-do list is... pretty soon.

Credits
-------

Thanks to Ted Ts'o for providing the inspiration and essential design
elements.  Many thanks to Uli Luckas for spending large number of
hours drinking beer^H^H^H^H^H^H^H^H^H^H walking through the code with
me, suggesting a number of design improvements and understanding and
fixing at least one part of the code which remains, quite frankly,
beyond me. :-)

Applying and Running the patch
------------------------------

The patch adds a symbol to ext2_fs.h, CONFIG_EXT2_INDEX, which
controls whether the htree index feature is enabled or not - it
defaults to on.

  - Use a test machine, not your workstation :-)
  - cd to the 2.4.1 source root
  - patch -p0 <this.email
  - build and install - should have no effect on normal operation
  - mount /dev/hdxxx /test -t ext2 -o index

All new directories in the mounted partition will be created indexed.

Here is the patch:

--- ../2.4.1.uml.clean/fs/ext2/dir.c	Sat Dec  9 02:35:54 2000
+++ ./fs/ext2/dir.c	Tue Feb 20 04:21:25 2001
@@ -67,22 +67,24 @@
 {
 	int error = 0;
 	unsigned long offset, blk;
-	int i, num, stored;
-	struct buffer_head * bh, * tmp, * bha[16];
-	struct ext2_dir_entry_2 * de;
-	struct super_block * sb;
-	int err;
+	int i, num, stored = 0, err;
+	struct buffer_head *bh = NULL, *tmp, *bha[16];
+	struct ext2_dir_entry_2 *de;
 	struct inode *inode = filp->f_dentry->d_inode;
+	struct super_block *sb = inode->i_sb;
+	unsigned blockshift = EXT2_BLOCK_SIZE_BITS(sb);
+#ifdef CONFIG_EXT2_INDEX
+	int dir_base = is_dx(inode)? dx_dir_base(sb): 0;
+#else
+	int dir_base = 0;
+#endif
 
-	sb = inode->i_sb;
-
-	stored = 0;
-	bh = NULL;
 	offset = filp->f_pos & (sb->s_blocksize - 1);
 
-	while (!error && !stored && filp->f_pos < inode->i_size) {
-		blk = (filp->f_pos) >> EXT2_BLOCK_SIZE_BITS(sb);
-		bh = ext2_bread (inode, blk, 0, &err);
+	while (!error && !stored && filp->f_pos < inode->i_size - (dir_base << blockshift))
+	{
+		blk = (filp->f_pos) >> blockshift;
+		bh = ext2_bread (inode, dir_base + blk, 0, &err);
 		if (!bh) {
 			ext2_error (sb, "ext2_readdir",
 				    "directory #%lu contains a hole at offset %lu",
@@ -95,9 +97,9 @@
 		 * Do the readahead
 		 */
 		if (!offset) {
-			for (i = 16 >> (EXT2_BLOCK_SIZE_BITS(sb) - 9), num = 0;
-			     i > 0; i--) {
-				tmp = ext2_getblk (inode, ++blk, 0, &err);
+			for (i = 16 >> (blockshift - 9), num = 0; i > 0; i--)
+			{
+				tmp = ext2_getblk (inode, dir_base + ++blk, 0, &err);
 				if (tmp && !buffer_uptodate(tmp) && !buffer_locked(tmp))
 					bha[num++] = tmp;
 				else
@@ -140,8 +142,7 @@
 			de = (struct ext2_dir_entry_2 *) (bh->b_data + offset);
 			if (!ext2_check_dir_entry ("ext2_readdir", inode, de,
 						   bh, offset)) {
-				/* On error, skip the f_pos to the
-                                   next block. */
+				/* On error, skip the f_pos to the next block. */
 				filp->f_pos = (filp->f_pos | (sb->s_blocksize - 1))
 					      + 1;
 				brelse (bh);
--- ../2.4.1.uml.clean/fs/ext2/namei.c	Sat Dec  9 02:35:54 2000
+++ ./fs/ext2/namei.c	Tue Feb 20 16:00:53 2001
@@ -18,13 +18,13 @@
  *  	for B-tree directories by Theodore Ts'o (tytso@mit.edu), 1998
  */
 
+#define CONFIG_EXT2_INDEX
+
 #include <linux/fs.h>
 #include <linux/ext2_fs.h>
 #include <linux/locks.h>
 #include <linux/quotaops.h>
 
-
-
 /*
  * define how far ahead to read directories while searching them.
  */
@@ -33,6 +33,250 @@
 #define NAMEI_RA_SIZE        (NAMEI_RA_CHUNKS * NAMEI_RA_BLOCKS)
 #define NAMEI_RA_INDEX(c,b)  (((c) * NAMEI_RA_BLOCKS) + (b))
 
+#ifdef CONFIG_EXT2_INDEX
+#define dxtrace(command)
+#define dxtrace_on(command) command
+#define dxtrace_off(command)
+
+/*
+ * Order n log(n) sort utility with n log(n) worst case
+ */
+
+#ifndef COMBSORT
+#define COMBSORT(size, i, j, COMPARE, EXCHANGE) { \
+	unsigned gap = size, more, i; \
+	do { \
+		if (gap > 1) gap = gap*10/13; \
+		if (gap - 9 < 2) gap = 11; \
+		for (i = size - 1, more = gap > 1; i >= gap; i--) { \
+			int j = i - gap; \
+			if (COMPARE) { EXCHANGE; more = 1; } } \
+	} while (more); }
+#endif
+
+#ifndef exchange
+#define exchange(x, y) do { typeof(x) z = x; x = y; y = z; } while (0)
+#endif
+
+/*
+ * Structure of the directory root block
+ */
+
+struct dx_root
+{
+	struct dx_root_info
+	{
+		u32 total_entries;
+		u32 reserved_zero;
+	}
+	info;
+	struct dx_entry
+	{
+		u32 hash;
+		u32 block;
+	} 
+	entries[0];
+};
+
+/*
+ * Bookkeeping for index traversal
+ */
+
+struct dx_frame
+{
+	struct buffer_head *bh;
+	struct dx_entry *entries;
+	struct dx_entry *at;
+	struct dx_root_info *info;
+	unsigned count;
+	unsigned limit;
+};
+
+/*
+ * Sort map for splitting leaf
+ */
+
+struct dx_map_entry
+{
+	u32 hash;
+	u32 offs;
+};
+
+#define MAX_DX_MAP (PAGE_SIZE/EXT2_DIR_REC_LEN(1) + 1)
+/* Assumes file blocksize <= PAGE_SIZE */
+
+#if 1
+unsigned dx_hash (const char *name, int namelen)
+{
+	u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
+	if (!namelen) return 0;
+	while (namelen--)
+	{
+		u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
+		if (hash < 0) hash -= 0x7fffffff;
+		hash1 = hash0;
+		hash0 = hash;
+	}
+	return ((hash0 & -1) << 1);
+}
+#else
+/*
+ * A simple hash // need hash function upgrade support
+ */
+
+int dx_hash (const char *name, int namelen)
+{
+	u32 hash = 0;
+	if (!namelen) BUG();
+	while (namelen--) hash = *(name++) + (hash << 6);
+	return hash << 1;
+}
+#endif
+
+/*
+ * Probe to find a directory leaf block to search
+ */
+
+int dx_probe (struct inode *dir, u32 hash, struct dx_frame *dxframe)
+{
+	int count, search, err;
+	struct buffer_head *bh;
+	struct dx_entry *at, *at0;
+
+	dxtrace(printk("Look up %u.", hash));
+	if (!(bh = ext2_bread (dir, 0, 0, &err)))
+	{
+		dxframe->bh = NULL;
+		return -1;
+	}
+
+	/* First hash field holds count of entries */
+	at = at0 = ((struct dx_root *) (bh->b_data))->entries;
+	if (!(count = *(u32 *) at)) BUG();
+	search = count - 1; // should use binary search
+
+	while (search--)
+	{
+		dxtrace(printk("."));
+		if ((++at)->hash > hash)
+		{
+			at--;
+			break;
+		}
+	}
+	dxtrace(printk(" in %u:%u\n", at - at0, at->block));
+	dxframe->info = (struct dx_root_info *) bh->b_data;
+	dxframe->bh = bh;
+	dxframe->entries = at0;
+	dxframe->at = at;
+	dxframe->count = count;
+	dxframe->limit = (bh->b_size - sizeof(struct dx_root_info)) / sizeof(struct dx_entry);
+	return 0;
+}
+
+/*
+ * Prior to split, finds record offset, computes hash of each dir block record
+ */
+
+static int dx_make_map (struct ext2_dir_entry_2 *de, int size, struct dx_map_entry map[])
+{
+	int count = 0;
+	char *base = (char *) de;
+	while ((char *) de < base + size)
+	{
+		map[count].hash = dx_hash (de->name, de->name_len);
+		map[count].offs = (u32) ((char *) de - base);
+		de = (struct ext2_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+		count++;
+	}
+	return count;
+}
+
+/*
+ * For dir block splitting and compacting
+ */
+
+struct ext2_dir_entry_2 *dx_copy (
+	char *from, char *to, unsigned size, // should pass from, to as de's (uli)
+	struct dx_map_entry map[], int start, int count)
+{
+	struct ext2_dir_entry_2 *de = NULL;
+	char *top = to + size;
+	unsigned rec_len = 0;
+	if (!count) BUG();
+	while (count--)
+	{
+		de = (struct ext2_dir_entry_2 *) (from + map[start++].offs);
+		rec_len = EXT2_DIR_REC_LEN(de->name_len);
+		if (to + rec_len > top) BUG();
+		memcpy (to, de, rec_len);
+		((struct ext2_dir_entry_2 *) to)->rec_len = rec_len;
+		to += rec_len;
+	}
+	return (struct ext2_dir_entry_2 *) (to - rec_len);
+}
+
+void dx_adjust (struct ext2_dir_entry_2 *de, char *limit)
+{
+	de->rec_len = limit - (char *) de; // need to clear top?
+}
+
+/*
+ * Debug
+ */
+
+void dx_show_index (struct dx_frame *dxframe)
+{
+	struct dx_entry *entries = dxframe->entries;
+	int i = 0;
+	printk("Index: ");
+	for (;i < *(u32 *) entries; i++)
+	{
+		printk("%u@%u ", entries[i].hash, entries[i].block);
+	}
+	printk("\n");
+}
+
+void dx_show_leaf (struct ext2_dir_entry_2 *de, int size)
+{
+	int count = 0;
+	char *base = (char *) de;
+	printk("dirblock: ");
+	while ((char *) de < base + size)
+	{
+		{ int n = de->name_len; char *s = de->name; while (n--) printk("%c", *s++); }
+		printk(":%u.%u ", dx_hash (de->name, de->name_len), (u32) ((char *) de - base));
+		de = (struct ext2_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+		count++;
+	}
+	printk("(%i)\n", count);
+}
+
+void dx_show_buckets (struct inode *dir)
+{
+	struct super_block *sb = dir->i_sb;
+	int blockshift = EXT2_BLOCK_SIZE_BITS (sb), blocksize = 1 << blockshift;
+	int count, i, err;
+	struct dx_entry *at;
+	struct buffer_head *bh, *dbh;
+	if (!(dbh = ext2_bread (dir, 0, 0, &err))) return;
+	at = ((struct dx_root *) (dbh->b_data))->entries;
+	count = *(u32 *) at;
+	printk("%i indexed blocks...\n", count);
+	for (i = 0; i < count; i++, at++)
+	{
+		u32 hash = i? at->hash: 0;
+		u32 range = i == count - 1? ~at->hash: ((at + 1)->hash - hash);
+		printk("%i:%u hash %u/%u", i, at->block, hash, range);
+		if (!(bh = ext2_bread (dir, at->block, 0, &err))) continue;
+		dx_show_leaf ((struct ext2_dir_entry_2 *) bh->b_data, blocksize);
+		brelse (bh);
+	}
+	brelse(dbh);
+	printk("\n");
+}
+#endif
+
 /*
  * NOTE! unlike strncmp, ext2_match returns 1 for success, 0 for failure.
  *
@@ -49,36 +293,94 @@
 	return !memcmp(name, de->name, len);
 }
 
+struct ext2_dir_entry_2 *ext2_find_de (struct buffer_head *bh,
+	const char *const name, int namelen,
+	int *err, struct inode *dir, u32 offset)
+	/* dir, offset used only in error report */
+{
+	struct ext2_dir_entry_2 *de = (struct ext2_dir_entry_2 *) bh->b_data;
+	char *top = (char *) de + bh->b_size;
+	while ((char *) de < top) {
+		/* this code may be executed quadratically often */
+		/* do minimal checking `by hand' */
+		int de_len;
+		if ((char *) de + namelen <= top && ext2_match (namelen, name, de)) // is the compare to top really needed??
+		{
+			/* found a match - just to be sure, do a full check */
+			if (!ext2_check_dir_entry("ext2_find_entry", dir, de, bh, offset))
+				goto error;
+			*err = 0;
+			return de;
+		}
+		de_len = le16_to_cpu(de->rec_len);
+		/* prevent looping on a bad block */
+		if (de_len <= 0)
+			goto error;
+		de = (struct ext2_dir_entry_2 *) ((char *) de + de_len);
+		offset += de_len;
+	}
+	*err = 0;
+	return NULL;
+error:
+	*err = 1;
+	return NULL;
+}
+
 /*
- *	ext2_find_entry()
- *
- * finds an entry in the specified directory with the wanted name. It
- * returns the cache buffer in which the entry was found, and the entry
- * itself (as a parameter - res_dir). It does NOT read the inode of the
- * entry - you'll have to do that yourself if you want to.
- */
-static struct buffer_head * ext2_find_entry (struct inode * dir,
-					     const char * const name, int namelen,
-					     struct ext2_dir_entry_2 ** res_dir)
-{
-	struct super_block * sb;
-	struct buffer_head * bh_use[NAMEI_RA_SIZE];
-	struct buffer_head * bh_read[NAMEI_RA_SIZE];
+ * Find an entry in the specified directory with the wanted name.  Return 
+ * the buffer the entry was found in, and set the entry through a pointer.
+ */
+static struct buffer_head *ext2_find_entry (
+	struct inode *dir, 
+	const char *name, int namelen,
+	struct ext2_dir_entry_2 **res_dir)
+{
+	struct super_block *sb = dir->i_sb;
+	struct buffer_head *bh_use[NAMEI_RA_SIZE];
+	struct buffer_head *bh_read[NAMEI_RA_SIZE];
 	unsigned long offset;
 	int block, toread, i, err;
+	int blockshift = EXT2_BLOCK_SIZE_BITS (sb);
 
 	*res_dir = NULL;
-	sb = dir->i_sb;
+	if (namelen > EXT2_NAME_LEN) return NULL;
+#ifdef CONFIG_EXT2_INDEX
+	if (is_dx(dir))
+	{
+		u32 hash = dx_hash (name, namelen);
+		struct ext2_dir_entry_2 *de;
+		struct dx_frame dxframe;
+		struct buffer_head *bh;
+		int err = dx_probe (dir, hash, &dxframe); // don't ignore the error!!
+
+		while (1)
+		{
+			bh = ext2_bread (dir, dxframe.at->block, 0, &err); // don't ignore the error!!
+			de = ext2_find_de (bh, name, namelen, &err, dir, 666); // don't ignore the error!!
+			if (de)
+			{
+				dxtrace(printk("Found %s in %i:%i\n", name, 
+					dxframe.at - dxframe.entries, dxframe.at->block));
+				brelse(dxframe.bh);
+				*res_dir = de;
+				return bh;
+			}
 
-	if (namelen > EXT2_NAME_LEN)
+			brelse(bh);
+			/* Same hash continues in next block?  Search further. */
+			if (++(dxframe.at) - dxframe.entries == dxframe.count) break;
+			if ((dxframe.at->hash & -2) != hash) break;
+			dxtrace(printk("Try next, block %i\n", dxframe.at->block));
+		}
+		brelse(dxframe.bh);
 		return NULL;
-
+	}
+#endif
 	memset (bh_use, 0, sizeof (bh_use));
 	toread = 0;
 	for (block = 0; block < NAMEI_RA_SIZE; ++block) {
 		struct buffer_head * bh;
-
-		if ((block << EXT2_BLOCK_SIZE_BITS (sb)) >= dir->i_size)
+			if ((block << blockshift) >= dir->i_size)
 			break;
 		bh = ext2_getblk (dir, block, 0, &err);
 		bh_use[block] = bh;
@@ -86,75 +388,54 @@
 			bh_read[toread++] = bh;
 	}
 
-	for (block = 0, offset = 0; offset < dir->i_size; block++) {
+	for (block = 0, offset = 0; offset < dir->i_size; offset += sb->s_blocksize, block++)
+	{
 		struct buffer_head * bh;
-		struct ext2_dir_entry_2 * de;
-		char * dlimit;
-
-		if ((block % NAMEI_RA_BLOCKS) == 0 && toread) {
+		struct ext2_dir_entry_2 *de;
+		if ((block % NAMEI_RA_BLOCKS) == 0 && toread)
+		{
 			ll_rw_block (READ, toread, bh_read);
 			toread = 0;
 		}
 		bh = bh_use[block % NAMEI_RA_SIZE];
-		if (!bh) {
+		if (!bh)
+		{
 #if 0
 			ext2_error (sb, "ext2_find_entry",
 				    "directory #%lu contains a hole at offset %lu",
 				    dir->i_ino, offset);
 #endif
-			offset += sb->s_blocksize;
 			continue;
 		}
+
 		wait_on_buffer (bh);
-		if (!buffer_uptodate(bh)) {
-			/*
-			 * read error: all bets are off
-			 */
+
+		/* handle read error */
+		if (!buffer_uptodate(bh))
 			break;
-		}
 
-		de = (struct ext2_dir_entry_2 *) bh->b_data;
-		dlimit = bh->b_data + sb->s_blocksize;
-		while ((char *) de < dlimit) {
-			/* this code is executed quadratically often */
-			/* do minimal checking `by hand' */
-			int de_len;
-
-			if ((char *) de + namelen <= dlimit &&
-			    ext2_match (namelen, name, de)) {
-				/* found a match -
-				   just to be sure, do a full check */
-				if (!ext2_check_dir_entry("ext2_find_entry",
-							  dir, de, bh, offset))
-					goto failure;
-				for (i = 0; i < NAMEI_RA_SIZE; ++i) {
-					if (bh_use[i] != bh)
-						brelse (bh_use[i]);
-				}
-				*res_dir = de;
-				return bh;
-			}
-			/* prevent looping on a bad block */
-			de_len = le16_to_cpu(de->rec_len);
-			if (de_len <= 0)
-				goto failure;
-			offset += de_len;
-			de = (struct ext2_dir_entry_2 *)
-				((char *) de + de_len);
+		de = ext2_find_de (bh, name, namelen, &err, dir, offset);
+		if (de)
+		{
+			for (i = 0; i < NAMEI_RA_SIZE; ++i)
+				if (bh_use[i] != bh)
+					brelse (bh_use[i]);
+			*res_dir = de;
+			return bh;
 		}
-
+		if (err)
+			goto fail;
 		brelse (bh);
-		if (((block + NAMEI_RA_SIZE) << EXT2_BLOCK_SIZE_BITS (sb)) >=
-		    dir->i_size)
-			bh = NULL;
-		else
+		if (((block + NAMEI_RA_SIZE) << blockshift) < dir->i_size)
 			bh = ext2_getblk (dir, block + NAMEI_RA_SIZE, 0, &err);
+		else
+			bh = NULL;
+
 		bh_use[block % NAMEI_RA_SIZE] = bh;
 		if (bh && !buffer_uptodate(bh))
 			bh_read[toread++] = bh;
 	}
-
-failure:
+fail:
 	for (i = 0; i < NAMEI_RA_SIZE; ++i)
 		brelse (bh_use[i]);
 	return NULL;
@@ -171,7 +452,8 @@
 
 	bh = ext2_find_entry (dir, dentry->d_name.name, dentry->d_name.len, &de);
 	inode = NULL;
-	if (bh) {
+	if (bh)
+	{
 		unsigned long ino = le32_to_cpu(de->inode);
 		brelse (bh);
 		inode = iget(dir->i_sb, ino);
@@ -202,37 +484,151 @@
 }
 
 /*
- *	ext2_add_entry()
- *
  * adds a file entry to the specified directory.
  */
+
 int ext2_add_entry (struct inode * dir, const char * name, int namelen,
 		    struct inode *inode)
 {
 	unsigned long offset;
-	unsigned short rec_len;
+	unsigned short rec_len = EXT2_DIR_REC_LEN(namelen);
 	struct buffer_head * bh;
-	struct ext2_dir_entry_2 * de, * de1;
-	struct super_block * sb;
-	int	retval;
-
-	sb = dir->i_sb;
+	struct ext2_dir_entry_2 * de, * de2;
+	struct super_block * sb = dir->i_sb;
+	unsigned blockshift = EXT2_BLOCK_SIZE_BITS(sb);
+	unsigned blocksize = 1 << blockshift;
+	int err;
+#ifdef CONFIG_EXT2_INDEX
+	struct dx_frame dxframe;
+	u32 hash;
+#endif
 
-	if (!namelen)
-		return -EINVAL;
-	bh = ext2_bread (dir, 0, 0, &retval);
-	if (!bh)
-		return retval;
-	rec_len = EXT2_DIR_REC_LEN(namelen);
+	if (!namelen) return -EINVAL;
+#ifdef CONFIG_EXT2_INDEX
+	if (is_dx(dir))
+	{
+		hash = dx_hash (name, namelen);
+		dx_probe (dir, hash, &dxframe); // don't ignore the error!!
+		if (!dxframe.bh) return EINVAL;
+		if (!(bh = ext2_bread (dir, dxframe.at->block, 0, &err))) return err;
+	}
+	else
+#endif
+	{
+		if (!(bh = ext2_bread (dir, 0, 0, &err))) return err;
+	}
 	offset = 0;
 	de = (struct ext2_dir_entry_2 *) bh->b_data;
-	while (1) {
-		if ((char *)de >= sb->s_blocksize + bh->b_data) {
+	while (1) 
+	{
+		if ((char *) de >= bh->b_data + blocksize)
+		{
+#ifdef CONFIG_EXT2_INDEX
+		if (is_dx(dir))
+		{
+			u32 block2 = dir->i_size >> blockshift;
+			struct dx_entry *entries = dxframe.entries, *at = dxframe.at;
+			struct buffer_head *bh2;
+			int count, split;
+			int continued; /* true if identical hashes split between two blocks */
+			u32 hash2;
+			dxtrace_off(printk("entry count %i, limit %i\n", dxframe.count, dxframe.limit));
+
+			if (dxframe.count == dxframe.limit)
+			{
+				brelse(bh);
+				brelse (dxframe.bh);
+				return -ENOENT;
+			}
+
+			if (!(bh2 = ext2_getblk (dir, block2, 1, &err)))
+			{
+				brelse(bh);
+				brelse (dxframe.bh);
+				return err;
+			}
+
+			{
+				char *b1 = bh->b_data, *b2, *b3;
+				struct dx_map_entry map[MAX_DX_MAP];
+				count = dx_make_map ((struct ext2_dir_entry_2 *) b1, blocksize, map);
+				split = count/2; // need to adjust to actual middle
+				COMBSORT(count, i, j, map[i].hash < map[j].hash, exchange(map[i], map[j]));
+
+				/* Don't split between duplicate hashes */
+				if (hash <= map[split].hash)
+					while (split && map[split].hash == map[split-1].hash)
+						split--;
+				else
+					while (split < count && map[split].hash == map[split-1].hash)
+						split++;
+				hash2 = map[split].hash;
+				continued = hash == hash2; // this happens to be valid for now
+				dxtrace(printk("Split block %i at %u, %i/%i\n", dxframe.at->block, hash2, split, count-split));
+
+				b2 = bh2->b_data;
+				dir->i_size = dir->i_size += blocksize;
+
+				if (!split || split == count)
+				{
+					// just create an empty dirblock for now
+					de2 = (struct ext2_dir_entry_2 *) b2;
+					de2->inode = 0;
+					de2->rec_len = le16_to_cpu(blocksize);
+				} else {
+					/* Fancy dance to stay within two buffers */
+					de2 = dx_copy (b1, b2, blocksize, map, split, count - split);
+					b3 = (char *) de2 + de2->rec_len;
+					de = dx_copy (b1, b3, blocksize - (b3 - b2), map, 0, split);
+					memcpy(b1, b3, (char *) de + de->rec_len - b3);
+					de = (struct ext2_dir_entry_2 *) ((char *) de - b3 + b1);
+					dx_adjust (de, b1 + blocksize);
+					dx_adjust (de2, b2 + blocksize);
+				}
+
+				dxtrace(dx_show_leaf ((struct ext2_dir_entry_2 *) b1, blocksize));
+				dxtrace(dx_show_leaf ((struct ext2_dir_entry_2 *) b2, blocksize));
+
+				/* Which block gets the new entry? */
+				dxtrace(printk("Insert %s/%u ", name, hash));
+				if (hash >= hash2 || !split || split == count)
+				{
+					dxtrace(printk("above"));
+					exchange(bh, bh2);
+					de = de2;
+				}
+				dxtrace(printk("\n"));
+			}
+
+			memmove (at + 1, at, (char *) (entries + dxframe.count) - (char *) (at));
+			if (continued && (!split || split == count))
+			{
+				/* assuming we put new identical hash into lower entry's block */
+				(at+1)->hash = hash + 1;
+				if (at != dxframe.entries) at->hash = hash;
+				at->block = block2;
+			} else {
+				at++;
+				at->block = block2;
+				at->hash = hash2;
+			}
+			dxframe.count = entries[0].hash++; /* first hash field is entry count */
+
+			/* Clean up and continue with scan for available space */
+			/* New dirent will be added at de in bh */
+			if (!continued) mark_buffer_dirty (bh2);
+			mark_buffer_dirty (dxframe.bh);
+			brelse (dxframe.bh);
+			brelse (bh2);
+			dxframe.bh = NULL; // oops if come here again
+			dxtrace(dx_show_index (&dxframe));
+		} else {
+#endif
 			brelse (bh);
 			bh = NULL;
-			bh = ext2_bread (dir, offset >> EXT2_BLOCK_SIZE_BITS(sb), 1, &retval);
+			bh = ext2_bread (dir, offset >> EXT2_BLOCK_SIZE_BITS(sb), 1, &err);
 			if (!bh)
-				return retval;
+				return err;
 			if (dir->i_size <= offset) {
 				if (dir->i_size == 0) {
 					return -ENOENT;
@@ -244,7 +640,6 @@
 				de->inode = 0;
 				de->rec_len = le16_to_cpu(sb->s_blocksize);
 				dir->i_size = offset + sb->s_blocksize;
-				dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
 				mark_inode_dirty(dir);
 			} else {
 
@@ -252,6 +647,9 @@
 
 				de = (struct ext2_dir_entry_2 *) bh->b_data;
 			}
+#ifdef CONFIG_EXT2_INDEX
+		}
+#endif
 		}
 		if (!ext2_check_dir_entry ("ext2_add_entry", dir, de, bh,
 					   offset)) {
@@ -266,12 +664,12 @@
 		    (le16_to_cpu(de->rec_len) >= EXT2_DIR_REC_LEN(de->name_len) + rec_len)) {
 			offset += le16_to_cpu(de->rec_len);
 			if (le32_to_cpu(de->inode)) {
-				de1 = (struct ext2_dir_entry_2 *) ((char *) de +
+				de2 = (struct ext2_dir_entry_2 *) ((char *) de +
 					EXT2_DIR_REC_LEN(de->name_len));
-				de1->rec_len = cpu_to_le16(le16_to_cpu(de->rec_len) -
+				de2->rec_len = cpu_to_le16(le16_to_cpu(de->rec_len) -
 					EXT2_DIR_REC_LEN(de->name_len));
 				de->rec_len = cpu_to_le16(EXT2_DIR_REC_LEN(de->name_len));
-				de = de1;
+				de = de2;
 			}
 			de->file_type = EXT2_FT_UNKNOWN;
 			if (inode) {
@@ -293,7 +691,6 @@
 			 * and/or different from the directory change time.
 			 */
 			dir->i_mtime = dir->i_ctime = CURRENT_TIME;
-			dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
 			mark_inode_dirty(dir);
 			dir->i_version = ++event;
 			mark_buffer_dirty_inode(bh, dir);
@@ -380,6 +777,7 @@
 		return err;
 	}
 	d_instantiate(dentry, inode);
+//	dx_show_buckets (dir);
 	return 0;
 }
 
@@ -408,12 +806,19 @@
 	return err;
 }
 
-static int ext2_mkdir(struct inode * dir, struct dentry * dentry, int mode)
+static int ext2_mkdir (struct inode *dir, struct dentry *dentry, int mode)
 {
-	struct inode * inode;
-	struct buffer_head * dir_block;
-	struct ext2_dir_entry_2 * de;
+	struct super_block *sb = dir->i_sb;
+	struct inode *inode;
+	struct buffer_head *bh;
+	struct ext2_dir_entry_2 *de;
 	int err;
+#ifdef CONFIG_EXT2_INDEX
+	int make_dx = test_opt (sb, DXTREE);
+	int dir_blk = make_dx? dx_dir_base(sb): 0;
+#else
+	int dir_blk = 0;
+#endif
 
 	if (dir->i_nlink >= EXT2_LINK_MAX)
 		return -EMLINK;
@@ -425,40 +830,61 @@
 
 	inode->i_op = &ext2_dir_inode_operations;
 	inode->i_fop = &ext2_dir_operations;
-	inode->i_size = inode->i_sb->s_blocksize;
+	inode->i_size = sb->s_blocksize;
 	inode->i_blocks = 0;	
-	dir_block = ext2_bread (inode, 0, 1, &err);
-	if (!dir_block) {
+	bh = ext2_bread (inode, dir_blk, 1, &err);
+	if (!bh)
+	{
 		inode->i_nlink--; /* is this nlink == 0? */
 		mark_inode_dirty(inode);
 		iput (inode);
 		return err;
 	}
-	de = (struct ext2_dir_entry_2 *) dir_block->b_data;
+	de = (struct ext2_dir_entry_2 *) bh->b_data;
+#ifdef CONFIG_EXT2_INDEX
 	de->inode = cpu_to_le32(inode->i_ino);
 	de->name_len = 1;
 	de->rec_len = cpu_to_le16(EXT2_DIR_REC_LEN(de->name_len));
 	strcpy (de->name, ".");
-	ext2_set_de_type(dir->i_sb, de, S_IFDIR);
+	ext2_set_de_type(sb, de, S_IFDIR);
 	de = (struct ext2_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
 	de->inode = cpu_to_le32(dir->i_ino);
-	de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize - EXT2_DIR_REC_LEN(1));
+	de->rec_len = cpu_to_le16(sb->s_blocksize - EXT2_DIR_REC_LEN(1));
 	de->name_len = 2;
 	strcpy (de->name, "..");
-	ext2_set_de_type(dir->i_sb, de, S_IFDIR);
+	ext2_set_de_type (sb, de, S_IFDIR);
+#else
+	de->rec_len = cpu_to_le16(sb->s_blocksize);
+#endif
 	inode->i_nlink = 2;
-	mark_buffer_dirty_inode(dir_block, dir);
-	brelse (dir_block);
+	mark_buffer_dirty_inode(bh, dir);
+	brelse (bh);
 	inode->i_mode = S_IFDIR | mode;
 	if (dir->i_mode & S_ISGID)
 		inode->i_mode |= S_ISGID;
 	mark_inode_dirty(inode);
-	err = ext2_add_entry (dir, dentry->d_name.name, dentry->d_name.len, 
-			     inode);
+	err = ext2_add_entry (dir, dentry->d_name.name, dentry->d_name.len, inode);
 	if (err)
 		goto out_no_entry;
 	dir->i_nlink++;
-	dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
+#ifdef CONFIG_EXT2_INDEX
+	if (make_dx)
+	{
+		struct buffer_head *bh = ext2_bread (inode, 0, 1, &err);
+		if (bh)
+		{
+			struct dx_entry *entries = ((struct dx_root *) bh->b_data)->entries;
+			dxtrace_on(printk("Making dx indexed directory\n"));
+			inode->i_size = (dx_dir_base(sb) + 1) << sb->s_blocksize_bits;
+			entries[0].block = dx_dir_base(sb);
+			entries[0].hash = 1; /* first hash field is entry count */
+			mark_buffer_dirty(bh);
+			brelse(bh);
+			inode->u.ext2_i.i_flags |= EXT2_INDEX_FL;
+
+		}
+	}
+#endif
 	mark_inode_dirty(dir);
 	d_instantiate(dentry, inode);
 	return 0;
@@ -473,23 +899,27 @@
 /*
  * routine to check that the specified directory is empty (for rmdir)
  */
-static int empty_dir (struct inode * inode)
+static int ext2_is_empty_dir (struct inode *inode)
 {
 	unsigned long offset;
 	struct buffer_head * bh;
 	struct ext2_dir_entry_2 * de, * de1;
-	struct super_block * sb;
+	struct super_block * sb = inode->i_sb;
 	int err;
-
-	sb = inode->i_sb;
+#ifdef CONFIG_EXT2_INDEX
+	int start = is_dx(inode)? dx_dir_base(sb): 0;
+#else
+	int start = 0;
+#endif
 	if (inode->i_size < EXT2_DIR_REC_LEN(1) + EXT2_DIR_REC_LEN(2) ||
-	    !(bh = ext2_bread (inode, 0, 0, &err))) {
+	    !(bh = ext2_bread (inode, start, 0, &err))) {
 	    	ext2_warning (inode->i_sb, "empty_dir",
 			      "bad directory (dir #%lu) - no data block",
 			      inode->i_ino);
 		return 1;
 	}
 	de = (struct ext2_dir_entry_2 *) bh->b_data;
+#ifdef CONFIG_EXT2_INDEX
 	de1 = (struct ext2_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
 	if (le32_to_cpu(de->inode) != inode->i_ino || !le32_to_cpu(de1->inode) || 
 	    strcmp (".", de->name) || strcmp ("..", de1->name)) {
@@ -501,6 +931,7 @@
 	}
 	offset = le16_to_cpu(de->rec_len) + le16_to_cpu(de1->rec_len);
 	de = (struct ext2_dir_entry_2 *) ((char *) de1 + le16_to_cpu(de1->rec_len));
+#endif
 	while (offset < inode->i_size ) {
 		if (!bh || (void *) de >= (void *) (bh->b_data + sb->s_blocksize)) {
 			brelse (bh);
@@ -552,7 +983,7 @@
 		goto end_rmdir;
 
 	retval = -ENOTEMPTY;
-	if (!empty_dir (inode))
+	if (!ext2_is_empty_dir (inode))
 		goto end_rmdir;
 
 	retval = ext2_delete_entry(dir, de, bh);
@@ -568,7 +999,6 @@
 	mark_inode_dirty(inode);
 	dir->i_nlink--;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-	dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
 	mark_inode_dirty(dir);
 
 end_rmdir:
@@ -605,7 +1035,6 @@
 	if (retval)
 		goto end_unlink;
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-	dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
 	mark_inode_dirty(dir);
 	inode->i_nlink--;
 	mark_inode_dirty(inode);
@@ -729,7 +1158,7 @@
 	if (S_ISDIR(old_inode->i_mode)) {
 		if (new_inode) {
 			retval = -ENOTEMPTY;
-			if (!empty_dir (new_inode))
+			if (!ext2_is_empty_dir (new_inode))
 				goto end_rename;
 		}
 		retval = -EIO;
@@ -782,7 +1211,6 @@
 		mark_inode_dirty(new_inode);
 	}
 	old_dir->i_ctime = old_dir->i_mtime = CURRENT_TIME;
-	old_dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
 	mark_inode_dirty(old_dir);
 	if (dir_bh) {
 		PARENT_INO(dir_bh->b_data) = le32_to_cpu(new_dir->i_ino);
@@ -794,7 +1222,6 @@
 			mark_inode_dirty(new_inode);
 		} else {
 			new_dir->i_nlink++;
-			new_dir->u.ext2_i.i_flags &= ~EXT2_BTREE_FL;
 			mark_inode_dirty(new_dir);
 		}
 	}
--- ../2.4.1.uml.clean/fs/ext2/super.c	Fri Dec 29 23:36:44 2000
+++ ./fs/ext2/super.c	Tue Feb 20 04:56:43 2001
@@ -188,6 +188,12 @@
 				printk("EXT2 Check option not supported\n");
 #endif
 		}
+		else if (!strcmp (this_char, "index"))
+#ifdef CONFIG_EXT2_INDEX
+			set_opt (*mount_options, DXTREE);
+#else
+			printk("EXT2 Index option not supported\n");
+#endif
 		else if (!strcmp (this_char, "debug"))
 			set_opt (*mount_options, DEBUG);
 		else if (!strcmp (this_char, "errors")) {
--- ../2.4.1.uml.clean/include/linux/ext2_fs.h	Tue Jan 30 08:24:55 2001
+++ ./include/linux/ext2_fs.h	Tue Feb 20 15:52:54 2001
@@ -40,6 +40,12 @@
 #define EXT2FS_VERSION		"0.5b"
 
 /*
+ * Hash Tree Directory indexing
+ * (c) Daniel Phillips, 2001
+ */
+#undef CONFIG_EXT2_INDEX
+
+/*
  * Debug code
  */
 #ifdef EXT2FS_DEBUG
@@ -53,7 +59,7 @@
 #endif
 
 /*
- * Special inodes numbers
+ * Special inode numbers
  */
 #define	EXT2_BAD_INO		 1	/* Bad blocks inode */
 #define EXT2_ROOT_INO		 2	/* Root inode */
@@ -197,7 +203,7 @@
 #define EXT2_NOCOMP_FL			0x00000400 /* Don't compress */
 #define EXT2_ECOMPR_FL			0x00000800 /* Compression error */
 /* End compression flags --- maybe not all used */	
-#define EXT2_BTREE_FL			0x00001000 /* btree format dir */
+#define EXT2_INDEX_FL			0x00001000 /* btree format dir */
 #define EXT2_RESERVED_FL		0x80000000 /* reserved for ext2 lib */
 
 #define EXT2_FL_USER_VISIBLE		0x00001FFF /* User visible flags */
@@ -314,6 +320,7 @@
 #define EXT2_MOUNT_ERRORS_PANIC		0x0040	/* Panic on errors */
 #define EXT2_MOUNT_MINIX_DF		0x0080	/* Mimics the Minix statfs */
 #define EXT2_MOUNT_NO_UID32		0x0200  /* Disable 32-bit UIDs */
+#define EXT2_MOUNT_DXTREE		0x0400  /* Enable dx trees */
 
 #define clear_opt(o, opt)		o &= ~EXT2_MOUNT_##opt
 #define set_opt(o, opt)			o |= EXT2_MOUNT_##opt
@@ -518,6 +525,16 @@
 #define EXT2_DIR_ROUND 			(EXT2_DIR_PAD - 1)
 #define EXT2_DIR_REC_LEN(name_len)	(((name_len) + 8 + EXT2_DIR_ROUND) & \
 					 ~EXT2_DIR_ROUND)
+
+/*
+ * Hash Tree Directory indexing
+ * (c) Daniel Phillips, 2001
+ */
+#ifdef CONFIG_EXT2_INDEX
+#define is_dx(dir) (dir->u.ext2_i.i_flags & EXT2_INDEX_FL)
+#define dx_entries_per_block(sb) (EXT2_BLOCK_SIZE(sb) >> 3)
+#define dx_dir_base(sb) (dx_entries_per_block(sb) - 1 + 1)
+#endif
 
 #ifdef __KERNEL__
 /*


-- 
Daniel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-20 15:04 [rfc] Near-constant time directory index for Ext2 Daniel Phillips
@ 2001-02-20 20:03 ` Linus Torvalds
  2001-02-20 21:08   ` Jeremy Jackson
  2001-02-20 21:41   ` Daniel Phillips
  2001-02-21 17:21 ` Davide Libenzi
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2001-02-20 20:03 UTC (permalink / raw)
  To: linux-kernel

In article <01022020011905.18944@gimli>,
Daniel Phillips  <phillips@innominate.de> wrote:
>Earlier this month a runaway installation script decided to mail all its
>problems to root.  After a couple of hours the script aborted, having
>created 65535 entries in Postfix's maildrop directory.  Removing those
>files took an awfully long time.  The problem is that Ext2 does each
>directory access using a simple, linear search though the entire
>directory file, resulting in n**2 behaviour to create/delete n files. 
>It's about time we fixed that.

Interesting.

However, if you're playing with the directory structure, please consider
getting rid of the "struct buffer_head"-centricity, and using the page
cache instead.  The page cache has much nicer caching semantics, and
looking up data in the page cache is much faster because it never needs
to do the "virtual->physical" translation. 

Talk to Al Viro about this - he's already posted patches to move the
regular ext2 directory tree into the page cache, and they weren't
applied to 2.4.x only because there was no great feeling of "we _must_
do this for correctness".

I see that you already considered this issue, but I wanted to bring it
up again simply because something like this certainly looks like a
potential candidate for 2.5.x, but I will _refuse_ to add code that
increases our reliance of "struct buffer_head" as a caching entity.  So
I'd rather see the page cache conversion happen sooner rather than
later... 

Also, just out of interest: if you've already been worrying about
hashes, what's the verdict on just using the native dentry hash value
directly? It has other constraints (_really_ low latency and absolutely
performance critical to calculate for the common case, which is not
needing a real lookup at all), but maybe it is good enough? And if not,
and you have done some statistics on it, I'd love to hear about it ;)

			Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-20 20:03 ` Linus Torvalds
@ 2001-02-20 21:08   ` Jeremy Jackson
  2001-02-20 21:20     ` Mike Dresser
  2001-02-20 21:41   ` Daniel Phillips
  1 sibling, 1 reply; 69+ messages in thread
From: Jeremy Jackson @ 2001-02-20 21:08 UTC (permalink / raw)
  Cc: linux-kernel

> In article <01022020011905.18944@gimli>,
> Daniel Phillips  <phillips@innominate.de> wrote:
> >Earlier this month a runaway installation script decided to mail all its
> >problems to root.  After a couple of hours the script aborted, having
> >created 65535 entries in Postfix's maildrop directory.  Removing those
> >files took an awfully long time.  The problem is that Ext2 does each
> >directory access using a simple, linear search though the entire
> >directory file, resulting in n**2 behaviour to create/delete n files.
> >It's about time we fixed that.

In the case of your script I'm not sure this will help, but:
I've seen /home directories organised like /home/a/adamsonj,
/home/a/arthurtone, /home/b/barrettj, etc.
this way (crude) indexing only costs areas where it's needed,
without kernel modification. (app does it)  What other placed would we
need indexing *in* the filesystem?


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-20 21:08   ` Jeremy Jackson
@ 2001-02-20 21:20     ` Mike Dresser
  2001-02-20 22:36       ` Jeremy Jackson
  2001-02-20 22:58       ` Jonathan Morton
  0 siblings, 2 replies; 69+ messages in thread
From: Mike Dresser @ 2001-02-20 21:20 UTC (permalink / raw)
  To: Jeremy Jackson; +Cc: linux-kernel

the way i'm reading this, the problem is there's 65535 files in the directory
/where/postfix/lives.  rm * or what have you, is going to take forever and
ever, and bog the machine down while its doing it.  My understanding is you
could do the rm *, and instead of it reading the tree over and over for every
file that has to be deleted, it just jumps one or two blocks to the file that's
being deleted, instead of thousands of files to be scanned for each file
deleted.

Jeremy Jackson wrote:

> > In article <01022020011905.18944@gimli>,
> > Daniel Phillips  <phillips@innominate.de> wrote:
> > >Earlier this month a runaway installation script decided to mail all its
> > >problems to root.  After a couple of hours the script aborted, having
> > >created 65535 entries in Postfix's maildrop directory.  Removing those
> > >files took an awfully long time.  The problem is that Ext2 does each
> > >directory access using a simple, linear search though the entire
> > >directory file, resulting in n**2 behaviour to create/delete n files.
> > >It's about time we fixed that.
>
> In the case of your script I'm not sure this will help, but:
> I've seen /home directories organised like /home/a/adamsonj,
> /home/a/arthurtone, /home/b/barrettj, etc.
> this way (crude) indexing only costs areas where it's needed,
> without kernel modification. (app does it)  What other placed would we
> need indexing *in* the filesystem?
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-20 20:03 ` Linus Torvalds
  2001-02-20 21:08   ` Jeremy Jackson
@ 2001-02-20 21:41   ` Daniel Phillips
  2001-02-21  0:22     ` Linus Torvalds
  1 sibling, 1 reply; 69+ messages in thread
From: Daniel Phillips @ 2001-02-20 21:41 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

On Tue, 20 Feb 2001, Linus Torvalds wrote:
> In article <01022020011905.18944@gimli>,
> Daniel Phillips  <phillips@innominate.de> wrote:
> >Earlier this month a runaway installation script decided to mail all its
> >problems to root.  After a couple of hours the script aborted, having
> >created 65535 entries in Postfix's maildrop directory.  Removing those
> >files took an awfully long time.  The problem is that Ext2 does each
> >directory access using a simple, linear search though the entire
> >directory file, resulting in n**2 behaviour to create/delete n files. 
> >It's about time we fixed that.
> 
> Interesting.
> 
> However, if you're playing with the directory structure, please consider
> getting rid of the "struct buffer_head"-centricity, and using the page
> cache instead.  The page cache has much nicer caching semantics, and
> looking up data in the page cache is much faster because it never needs
> to do the "virtual->physical" translation. 

Oh yes, I was planning on it.  I started with the buffers version
for two main reasons version: 1) it's simple and solid and 2) it
provides the basis for a backport to 2.2 - after the 2.4/2.5 version is
complete of course.

> Talk to Al Viro about this - he's already posted patches to move the
> regular ext2 directory tree into the page cache, and they weren't
> applied to 2.4.x only because there was no great feeling of "we _must_
> do this for correctness".
> 
> I see that you already considered this issue, but I wanted to bring it
> up again simply because something like this certainly looks like a
> potential candidate for 2.5.x, but I will _refuse_ to add code that
> increases our reliance of "struct buffer_head" as a caching entity.  So
> I'd rather see the page cache conversion happen sooner rather than
> later... 

You are preaching to the converted.

> Also, just out of interest: if you've already been worrying about
> hashes, what's the verdict on just using the native dentry hash value
> directly? It has other constraints (_really_ low latency and absolutely
> performance critical to calculate for the common case, which is not
> needing a real lookup at all), but maybe it is good enough? And if not,
> and you have done some statistics on it, I'd love to hear about it ;)

You mean full_name_hash?  I will un-static it and try it.  I should have
some statistics tomorrow.  I have a couple of simple metrics for
measuring the effectiveness of the hash function: the uniformity of
the hash space splitting (which in turn affects the average fullness
of directory leaves) and speed.

Let the hash races begin.

-- 
Daniel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-20 21:20     ` Mike Dresser
@ 2001-02-20 22:36       ` Jeremy Jackson
  2001-02-20 23:08         ` Daniel Phillips
  2001-02-20 22:58       ` Jonathan Morton
  1 sibling, 1 reply; 69+ messages in thread
From: Jeremy Jackson @ 2001-02-20 22:36 UTC (permalink / raw)
  To: Mike Dresser; +Cc: linux-kernel

Mike Dresser wrote:

> the way i'm reading this, the problem is there's 65535 files in the directory
> /where/postfix/lives.  rm * or what have you, is going to take forever and
> ever, and bog the machine down while its doing it.  My understanding is you
> could do the rm *, and instead of it reading the tree over and over for every
> file that has to be deleted, it just jumps one or two blocks to the file that's
> being deleted, instead of thousands of files to be scanned for each file
> deleted.
>

I thought about it again, and the proformance problem with "rm *" is that
the shell reads and sorts the directory, passes each file as a separate
argument to rm, which then causes the kernel to lookup each file
from a random directory block (random because of previous sort),
modify that directory block, then read another... after a few seconds
the modified blocks start to be written back to disk while new ones
are looked up... disk seek contention.  and this becomes hard on the
dir. block cache (wherever this is) since from source each dir entry
is just over 256 bytes (?) 65535 files would require 16MB to
cache dir entries.  Plus it has to read in all the inodes, modify,
then write, taking up xxMB more.  You're probably swapping
out,  with swap partition on same disk, the disk may explode.

If it were truly doing a linear scan, it might be faster.  Two
successive mods to same dir block would be merged
onto same write.

Perhaps rm -rf . would be faster?  Let rm do glob expansion,
without the sort.  Care to recreate those 65535 files and try it?

or use ls with the nosort flag pipe through xargs then to rm...
again loose sorting but don't delete directory or subdirs.

>
> Jeremy Jackson wrote:
>
> > > In article <01022020011905.18944@gimli>,
> > > Daniel Phillips  <phillips@innominate.de> wrote:
> > > >Earlier this month a runaway installation script decided to mail all its
> > > >problems to root.  After a couple of hours the script aborted, having
> > > >created 65535 entries in Postfix's maildrop directory.  Removing those
> > > >files took an awfully long time.  The problem is that Ext2 does each
> > > >directory access using a simple, linear search though the entire
> > > >directory file, resulting in n**2 behaviour to create/delete n files.
> > > >It's about time we fixed that.
> >


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-20 21:20     ` Mike Dresser
  2001-02-20 22:36       ` Jeremy Jackson
@ 2001-02-20 22:58       ` Jonathan Morton
  1 sibling, 0 replies; 69+ messages in thread
From: Jonathan Morton @ 2001-02-20 22:58 UTC (permalink / raw)
  To: Jeremy Jackson, Mike Dresser; +Cc: linux-kernel

>Perhaps rm -rf . would be faster?  Let rm do glob expansion,
>without the sort.  Care to recreate those 65535 files and try it?

Perhaps, but I think that form is still fairly slow.  It takes an
"uncomfortable" amount of time to remove a complex directory structure
using, eg. "rm -rf /usr/src/linux-obsolete" or "rm -rf
downloads/XFree86-old-and-buggy".  I'm not sure, but I would guess it's not
as much quicker than removing each file individually as you might think.

If I had more time on my hands, I'd run some quick benchmarks on some of my
systems.

--------------------------------------------------------------
from:     Jonathan "Chromatix" Morton
mail:     chromi@cyberspace.org  (not for attachments)
big-mail: chromatix@penguinpowered.com
uni-mail: j.d.morton@lancaster.ac.uk

The key to knowledge is not to rely on people to teach you it.

Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/

-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-20 22:36       ` Jeremy Jackson
@ 2001-02-20 23:08         ` Daniel Phillips
  2001-02-21  1:04           ` Bernd Eckenfels
  0 siblings, 1 reply; 69+ messages in thread
From: Daniel Phillips @ 2001-02-20 23:08 UTC (permalink / raw)
  To: Jeremy Jackson, Mike Dresser; +Cc: linux-kernel

On Tue, 20 Feb 2001, Jeremy Jackson wrote:
> Mike Dresser wrote:
> 
> > the way i'm reading this, the problem is there's 65535 files in the directory
> > /where/postfix/lives.  rm * or what have you, is going to take forever and
> > ever, and bog the machine down while its doing it.  My understanding is you
> > could do the rm *, and instead of it reading the tree over and over for every
> > file that has to be deleted, it just jumps one or two blocks to the file that's
> > being deleted, instead of thousands of files to be scanned for each file
> > deleted.
> 
> I thought about it again, and the proformance problem with "rm *" is that
> the shell reads and sorts the directory, passes each file as a separate
> argument to rm, which then causes the kernel to lookup each file
> from a random directory block (random because of previous sort),
> modify that directory block, then read another... after a few seconds
> the modified blocks start to be written back to disk while new ones
> are looked up... disk seek contention.  and this becomes hard on the
> dir. block cache (wherever this is) since from source each dir entry
> is just over 256 bytes (?) 65535 files would require 16MB to
> cache dir entries.  Plus it has to read in all the inodes, modify,
> then write, taking up xxMB more.  You're probably swapping
> out,  with swap partition on same disk, the disk may explode.
> 
> If it were truly doing a linear scan, it might be faster.  Two
> successive mods to same dir block would be merged
> onto same write.
> 
> Perhaps rm -rf . would be faster?  Let rm do glob expansion,
> without the sort.  Care to recreate those 65535 files and try it?
> 
> or use ls with the nosort flag pipe through xargs then to rm...
> again loose sorting but don't delete directory or subdirs.

Indeed, rm -rf is faster.  It does a readdir to get all the directory
entries in internal order, then calls unlink to remove them, one at a
time.  This removes each entry from the front of the file, shortening
the time that has to be spent scanning forward in the file to find the
target entry.  Manfred Spraul observed that this could be speeded up
with by caching the file position, and sent me a patch to do that.  It
did speed things up - about 20%.

But actually, rm is not problem, it's open and create.  To do a
create you have to make sure the file doesn't already exist, and
without an index you have to scan on average half the directory file. 
Open requires a similar scan.  Here we are talking about using an index
to speed that up quadraticly when operating on N files.  That is the
real gravy.

-- 
Daniel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-20 21:41   ` Daniel Phillips
@ 2001-02-21  0:22     ` Linus Torvalds
  2001-02-21  0:30       ` Alan Cox
                         ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Linus Torvalds @ 2001-02-21  0:22 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel



On Tue, 20 Feb 2001, Daniel Phillips wrote:
> 
> You mean full_name_hash?  I will un-static it and try it.  I should have
> some statistics tomorrow.  I have a couple of simple metrics for
> measuring the effectiveness of the hash function: the uniformity of
> the hash space splitting (which in turn affects the average fullness
> of directory leaves) and speed.

I was more thinking about just using "dentry->d_name->hash" directly, and
not worrying about how that hash was computed. Yes, for ext2 it will have
the same value as "full_name_hash" - the difference really being that
d_hash has already been precomputed for you anyway.

> Let the hash races begin.

Note that dentry->d_name->hash is really quick (no extra computation), but
I'm not claiming that it has anything like a CRC quality. And it's
probably a bad idea to use it, because in theory at least the VFS layer
might decide to switch the hash function around. I'm more interested in
hearing whether it's a good hash, and maybe we could improve the VFS hash
enough that there's no reason to use anything else..

		Linus


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21  0:22     ` Linus Torvalds
@ 2001-02-21  0:30       ` Alan Cox
  2001-02-21  2:35         ` Ed Tomlinson
  2001-02-21  1:01       ` Andreas Dilger
  2001-02-22  2:28       ` Daniel Phillips
  2 siblings, 1 reply; 69+ messages in thread
From: Alan Cox @ 2001-02-21  0:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel

> probably a bad idea to use it, because in theory at least the VFS layer
> might decide to switch the hash function around. I'm more interested in
> hearing whether it's a good hash, and maybe we could improve the VFS hash
> enough that there's no reason to use anything else..

Reiserfs seems to have done a lot of work on this and be using tea, which is
also nice as tea is non trivial to abuse as a user to create pessimal file
searches intentionally


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21  0:22     ` Linus Torvalds
  2001-02-21  0:30       ` Alan Cox
@ 2001-02-21  1:01       ` Andreas Dilger
  2001-02-22  2:28       ` Daniel Phillips
  2 siblings, 0 replies; 69+ messages in thread
From: Andreas Dilger @ 2001-02-21  1:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel

Linus writes:
> On Tue, 20 Feb 2001, Daniel Phillips wrote:
> > You mean full_name_hash?  I will un-static it and try it.  I should have
> > some statistics tomorrow.
> 
> I was more thinking about just using "dentry->d_name->hash" directly, and
> not worrying about how that hash was computed. Yes, for ext2 it will have
> the same value as "full_name_hash" - the difference really being that
> d_hash has already been precomputed for you anyway.

I _thought_ that's what you meant, but then I was also thinking that the
dentry hash was on the full path name and not just the filename?  This
wouldn't be any good for use in the directory index, in case the directory
is renamed.  If this is _not_ the case, then it is a definite candidate.

> Note that dentry->d_name->hash is really quick (no extra computation), but
> I'm not claiming that it has anything like a CRC quality. And it's
> probably a bad idea to use it, because in theory at least the VFS layer
> might decide to switch the hash function around.

I was thinking about this as well.  Since the setup Daniel has allows us
to store a hash version, we could run the hash function on a fixed string
at SB init time to give us a hash "version" number.  If the hash function
changes we will get a new hash "version".  We could inline each new dentry
hash function into the ext2 code (so we can unpack the directories), or
as a cop-out if any directory has a hash version not equal to the current
one we re-hash all the entries in the directory.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-20 23:08         ` Daniel Phillips
@ 2001-02-21  1:04           ` Bernd Eckenfels
  2001-02-21 16:38             ` Daniel Phillips
  0 siblings, 1 reply; 69+ messages in thread
From: Bernd Eckenfels @ 2001-02-21  1:04 UTC (permalink / raw)
  To: linux-kernel

In article <01022100361408.18944@gimli> you wrote:
> But actually, rm is not problem, it's open and create.  To do a
> create you have to make sure the file doesn't already exist, and
> without an index you have to scan on average half the directory file. 

Unless you use a File System which is better for that, like Reiser-FS. Of
course a even better solution is to distribute those files in hashed subdirs.

Greetings
Bernd

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21  0:30       ` Alan Cox
@ 2001-02-21  2:35         ` Ed Tomlinson
  2001-02-21 23:13           ` Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Ed Tomlinson @ 2001-02-21  2:35 UTC (permalink / raw)
  To: linux-kernel

Alan Cox wrote:

>> probably a bad idea to use it, because in theory at least the VFS layer
>> might decide to switch the hash function around. I'm more interested in
>> hearing whether it's a good hash, and maybe we could improve the VFS hash
>> enough that there's no reason to use anything else..
> 
> Reiserfs seems to have done a lot of work on this and be using tea, which is
> also nice as tea is non trivial to abuse as a user to create pessimal file
> searches intentionally

The default in reiserfs is now the R5 hash, but you are right that lots of efforts went 
into finding this hash.  This includes testing various hashes on real directory 
structures to see which one worked best.  R5 won.

Ed Tomlinson

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21  1:04           ` Bernd Eckenfels
@ 2001-02-21 16:38             ` Daniel Phillips
  0 siblings, 0 replies; 69+ messages in thread
From: Daniel Phillips @ 2001-02-21 16:38 UTC (permalink / raw)
  To: Bernd Eckenfels, linux-kernel

On Wed, 21 Feb 2001, Bernd Eckenfels wrote:
> In article <01022100361408.18944@gimli> you wrote:
> > But actually, rm is not problem, it's open and create.  To do a
> > create you have to make sure the file doesn't already exist, and
> > without an index you have to scan on average half the directory file. 
> 
> Unless you use a File System which is better for that, like Reiser-FS. Of
> course a even better solution is to distribute those files in hashed subdirs.

Ahem.  Please read the first post in the thread. ;-)

-- 
Daniel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [rfc] Near-constant time directory index for Ext2
  2001-02-20 15:04 [rfc] Near-constant time directory index for Ext2 Daniel Phillips
  2001-02-20 20:03 ` Linus Torvalds
@ 2001-02-21 17:21 ` Davide Libenzi
  2001-02-21 21:08   ` Martin Mares
  2001-02-22  6:23 ` [Ext2-devel] " tytso
  2001-02-22 18:38 ` Kai Henningsen
  3 siblings, 1 reply; 69+ messages in thread
From: Davide Libenzi @ 2001-02-21 17:21 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: ext2-devel, hch, Andreas Dilger, tytso, Linux-kernel


On 20-Feb-2001 Daniel Phillips wrote:
> Earlier this month a runaway installation script decided to mail all its
> problems to root.  After a couple of hours the script aborted, having
> created 65535 entries in Postfix's maildrop directory.  Removing those
> files took an awfully long time.  The problem is that Ext2 does each
> directory access using a simple, linear search though the entire
> directory file, resulting in n**2 behaviour to create/delete n files. 
> It's about time we fixed that.
> 
> Last fall in Miami, Ted Ts'o mentioned some ideas he was playing with
> for an Ext2 directory index, including the following points:
> 
>   - Fixed-size hash keys instead of names in the index
>   - Leaf blocks are normal ext2 directory blocks
>   - Leaf blocks are sequental, so readdir doesn't have to be changed

Have You tried to use skiplists ?
In 93 I've coded a skiplist based directory access for Minix and it gave very
interesting performances.
Skiplists have a link-list like performance when linear scanned, and overall
good performance in insertion/seek/delete.




- Davide


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 17:21 ` Davide Libenzi
@ 2001-02-21 21:08   ` Martin Mares
  2001-02-21 21:29     ` Davide Libenzi
  0 siblings, 1 reply; 69+ messages in thread
From: Martin Mares @ 2001-02-21 21:08 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Daniel Phillips, ext2-devel, hch, Andreas Dilger, tytso, Linux-kernel

Hello!

> Have You tried to use skiplists ?
> In 93 I've coded a skiplist based directory access for Minix and it gave very
> interesting performances.
> Skiplists have a link-list like performance when linear scanned, and overall
> good performance in insertion/seek/delete.

Skip list search/insert/delete is O(log N) in average as skip lists are just a
dynamic version of interval bisection. Good hashing is O(1).

				Have a nice fortnight
-- 
Martin `MJ' Mares <mj@ucw.cz> <mj@suse.cz> http://atrey.karlin.mff.cuni.cz/~mj/
Entropy isn't what it used to be.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 21:08   ` Martin Mares
@ 2001-02-21 21:29     ` Davide Libenzi
  2001-02-21 21:32       ` Martin Mares
  0 siblings, 1 reply; 69+ messages in thread
From: Davide Libenzi @ 2001-02-21 21:29 UTC (permalink / raw)
  To: Martin Mares
  Cc: Linux-kernel, tytso, Andreas Dilger, hch, ext2-devel, Daniel Phillips


On 21-Feb-2001 Martin Mares wrote:
> Hello!
> 
>> Have You tried to use skiplists ?
>> In 93 I've coded a skiplist based directory access for Minix and it gave
>> very
>> interesting performances.
>> Skiplists have a link-list like performance when linear scanned, and overall
>> good performance in insertion/seek/delete.
> 
> Skip list search/insert/delete is O(log N) in average as skip lists are just
> a
> dynamic version of interval bisection. Good hashing is O(1).

To have O(1) you've to have the number of hash entries > number of files and a
really good hasing function.



> 
>                               Have a nice fortnight

To be sincere, here is pretty daylight :)



- Davide


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 21:29     ` Davide Libenzi
@ 2001-02-21 21:32       ` Martin Mares
  2001-02-21 21:59         ` Davide Libenzi
  2001-02-21 22:14         ` H. Peter Anvin
  0 siblings, 2 replies; 69+ messages in thread
From: Martin Mares @ 2001-02-21 21:32 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux-kernel, tytso, Andreas Dilger, hch, ext2-devel, Daniel Phillips

Hello!

> To have O(1) you've to have the number of hash entries > number of files and a
> really good hasing function.

No, if you enlarge the hash table twice (and re-hash everything) every time the
table fills up, the load factor of the table keeps small and everything is O(1)
amortized, of course if you have a good hashing function. If you are really
smart and re-hash incrementally, you can get O(1) worst case complexity, but
the multiplicative constant is large.

> To be sincere, here is pretty daylight :)

:)
								Martin

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 21:32       ` Martin Mares
@ 2001-02-21 21:59         ` Davide Libenzi
  2001-02-21 22:26           ` Martin Mares
  2001-02-21 22:14         ` H. Peter Anvin
  1 sibling, 1 reply; 69+ messages in thread
From: Davide Libenzi @ 2001-02-21 21:59 UTC (permalink / raw)
  To: Martin Mares
  Cc: Daniel Phillips, ext2-devel, hch, Andreas Dilger, tytso, Linux-kernel


On 21-Feb-2001 Martin Mares wrote:
> Hello!
> 
>> To have O(1) you've to have the number of hash entries > number of files and
>> a
>> really good hasing function.
> 
> No, if you enlarge the hash table twice (and re-hash everything) every time
> the
> table fills up, the load factor of the table keeps small and everything is
> O(1)
> amortized, of course if you have a good hashing function. If you are really
> smart and re-hash incrementally, you can get O(1) worst case complexity, but
> the multiplicative constant is large.

My personal preference goes to skiplist coz it doesn't have fixed ( or growing
) tables to handle. You've simply a stub of data togheter with FS data in each
direntry.
And performance ( O(log2(n)) ) are the same for whatever number of entries.




- Davide


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 21:32       ` Martin Mares
  2001-02-21 21:59         ` Davide Libenzi
@ 2001-02-21 22:14         ` H. Peter Anvin
  2001-02-21 22:32           ` Martin Mares
  1 sibling, 1 reply; 69+ messages in thread
From: H. Peter Anvin @ 2001-02-21 22:14 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <20010221223238.A17903@atrey.karlin.mff.cuni.cz>
By author:    Martin Mares <mj@suse.cz>
In newsgroup: linux.dev.kernel
>
> Hello!
> 
> > To have O(1) you've to have the number of hash entries > number of files and a
> > really good hasing function.
> 
> No, if you enlarge the hash table twice (and re-hash everything) every time the
> table fills up, the load factor of the table keeps small and everything is O(1)
> amortized, of course if you have a good hashing function. If you are really
> smart and re-hash incrementally, you can get O(1) worst case complexity, but
> the multiplicative constant is large.
> 

Not true.  The rehashing is O(n) and it has to be performed O(log n)
times during insertion.  Therefore, insertion is O(log n).

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 21:59         ` Davide Libenzi
@ 2001-02-21 22:26           ` Martin Mares
  2001-02-21 22:43             ` Davide Libenzi
  0 siblings, 1 reply; 69+ messages in thread
From: Martin Mares @ 2001-02-21 22:26 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Daniel Phillips, ext2-devel, hch, Andreas Dilger, tytso, Linux-kernel

Hello!

> My personal preference goes to skiplist coz it doesn't have fixed ( or growing
> ) tables to handle. You've simply a stub of data togheter with FS data in each
> direntry.

Another problem with skip lists is that they require variable sized nodes,
so you either need to keep free chunk lists and lose some space in deleted
nodes kept in these lists, or you choose to shift remaining nodes which is
slow and complicated as you need to keep the inter-node links right. With
hashing, you can separate the control part of the structure and the actual
data and shift data while leaving most of the control part intact.

> And performance ( O(log2(n)) ) are the same for whatever number of entries.

I don't understand this complexity estimate -- it cannot be the same for
whatever number of entries as the complexity function depends on the number
of entries.

				Have a nice fortnight
-- 
Martin `MJ' Mares <mj@ucw.cz> <mj@suse.cz> http://atrey.karlin.mff.cuni.cz/~mj/
P.C.M.C.I.A. stands for `People Can't Memorize Computer Industry Acronyms'

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 22:14         ` H. Peter Anvin
@ 2001-02-21 22:32           ` Martin Mares
  2001-02-21 22:38             ` H. Peter Anvin
  0 siblings, 1 reply; 69+ messages in thread
From: Martin Mares @ 2001-02-21 22:32 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

Hello!

> Not true.  The rehashing is O(n) and it has to be performed O(log n)
> times during insertion.  Therefore, insertion is O(log n).

Rehashing is O(n), but the "n" is the _current_ number of items, not the
maximum one after all the insertions.

Let's assume you start with a single-entry hash table. You rehash for the
first time after inserting the first item (giving hash table of size 2),
then after the second item (=> size 4), then after the fourth item (=> size 8)
and so on. I.e., when you insert n items, the total cost of rehashing summed
over all the insertions is at most 1 + 2 + 4 + 8 + 16 + ... + 2^k (where
k=floor(log2(n))) <= 2^k+1 = O(n). That is O(1) operations per item inserted.

				Have a nice fortnight
-- 
Martin `MJ' Mares <mj@ucw.cz> <mj@suse.cz> http://atrey.karlin.mff.cuni.cz/~mj/
MIPS: Meaningless Indicator of Processor Speed.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 22:32           ` Martin Mares
@ 2001-02-21 22:38             ` H. Peter Anvin
  2001-02-21 22:50               ` Martin Mares
  0 siblings, 1 reply; 69+ messages in thread
From: H. Peter Anvin @ 2001-02-21 22:38 UTC (permalink / raw)
  To: Martin Mares; +Cc: H. Peter Anvin, linux-kernel

Martin Mares wrote:
> 
> Hello!
> 
> > Not true.  The rehashing is O(n) and it has to be performed O(log n)
> > times during insertion.  Therefore, insertion is O(log n).
> 
> Rehashing is O(n), but the "n" is the _current_ number of items, not the
> maximum one after all the insertions.
> 
> Let's assume you start with a single-entry hash table. You rehash for the
> first time after inserting the first item (giving hash table of size 2),
> then after the second item (=> size 4), then after the fourth item (=> size 8)
> and so on. I.e., when you insert n items, the total cost of rehashing summed
> over all the insertions is at most 1 + 2 + 4 + 8 + 16 + ... + 2^k (where
> k=floor(log2(n))) <= 2^k+1 = O(n). That is O(1) operations per item inserted.
> 

You're right.  However, for each hash table operation to be O(1) the size
of the hash table must be >> n.

I suggested at one point to use B-trees with a hash value as the key. 
B-trees are extremely efficient when used on a small constant-size key.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 22:26           ` Martin Mares
@ 2001-02-21 22:43             ` Davide Libenzi
  0 siblings, 0 replies; 69+ messages in thread
From: Davide Libenzi @ 2001-02-21 22:43 UTC (permalink / raw)
  To: Martin Mares
  Cc: Linux-kernel, tytso, Andreas Dilger, hch, ext2-devel, Daniel Phillips


On 21-Feb-2001 Martin Mares wrote:
> Hello!
> 
>> My personal preference goes to skiplist coz it doesn't have fixed ( or
>> growing
>> ) tables to handle. You've simply a stub of data togheter with FS data in
>> each
>> direntry.
> 
> Another problem with skip lists is that they require variable sized nodes,
> so you either need to keep free chunk lists and lose some space in deleted
> nodes kept in these lists, or you choose to shift remaining nodes which is
> slow and complicated as you need to keep the inter-node links right. With
> hashing, you can separate the control part of the structure and the actual
> data and shift data while leaving most of the control part intact.

An entry in skip list table is a u32 direntry offset and You've not to keep
free entries, simply the height of the node will change depending on the number
of entries.


>> And performance ( O(log2(n)) ) are the same for whatever number of entries.
> 
> I don't understand this complexity estimate -- it cannot be the same for
> whatever number of entries as the complexity function depends on the number
> of entries.

n == number of entries

For constant I mean the formula not the result.



- Davide


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 22:38             ` H. Peter Anvin
@ 2001-02-21 22:50               ` Martin Mares
  2001-02-21 22:54                 ` H. Peter Anvin
  2001-02-22 19:04                 ` Kai Henningsen
  0 siblings, 2 replies; 69+ messages in thread
From: Martin Mares @ 2001-02-21 22:50 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

Hello!

> You're right.  However, for each hash table operation to be O(1) the size
> of the hash table must be >> n.

If we are talking about average case complexity (which is the only possibility
with fixed hash function and arbitrary input keys), it suffices to have
hash table size >= c*n for some constant c which gives O(1/c) cost of
all operations.
 
> I suggested at one point to use B-trees with a hash value as the key. 
> B-trees are extremely efficient when used on a small constant-size key.

Although from asymptotic complexity standpoint hashing is much better
than B-trees, I'm not sure at all what will give the best performance for
reasonable directory sizes. Maybe the B-trees are really the better
alternative as they are updated dynamically and the costs of successive
operations are similar as opposed to hashing which is occassionally very
slow due to rehashing unless you try to rehash on-line, but I don't
know any algorithm for on-line rehashing with both inserts and deletes
which wouldn't be awfully complex and slow (speaking of multiplicative
constants, of course -- it's still O(1) per operation, but "the big Oh
is really big there").

				Have a nice fortnight
-- 
Martin `MJ' Mares <mj@ucw.cz> <mj@suse.cz> http://atrey.karlin.mff.cuni.cz/~mj/
"#define QUESTION ((bb) || !(bb))"  -- Shakespeare

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 22:50               ` Martin Mares
@ 2001-02-21 22:54                 ` H. Peter Anvin
  2001-02-21 23:07                   ` Martin Mares
  2001-02-22 19:04                 ` Kai Henningsen
  1 sibling, 1 reply; 69+ messages in thread
From: H. Peter Anvin @ 2001-02-21 22:54 UTC (permalink / raw)
  To: Martin Mares; +Cc: linux-kernel

Martin Mares wrote:
> 
> Hello!
> 
> > You're right.  However, for each hash table operation to be O(1) the size
> > of the hash table must be >> n.
> 
> If we are talking about average case complexity (which is the only possibility
> with fixed hash function and arbitrary input keys), it suffices to have
> hash table size >= c*n for some constant c which gives O(1/c) cost of
> all operations.
> 

True.  Note too, though, that on a filesystem (which we are, after all,
talking about), if you assume a large linear space you have to create a
file, which means you need to multiply the cost of all random-access
operations with O(log n).

> > I suggested at one point to use B-trees with a hash value as the key.
> > B-trees are extremely efficient when used on a small constant-size key.
> 
> Although from asymptotic complexity standpoint hashing is much better
> than B-trees, I'm not sure at all what will give the best performance for
> reasonable directory sizes. Maybe the B-trees are really the better
> alternative as they are updated dynamically and the costs of successive
> operations are similar as opposed to hashing which is occassionally very
> slow due to rehashing unless you try to rehash on-line, but I don't
> know any algorithm for on-line rehashing with both inserts and deletes
> which wouldn't be awfully complex and slow (speaking of multiplicative
> constants, of course -- it's still O(1) per operation, but "the big Oh
> is really big there").

Well, once you multiply with O(log n) for the file indirection (which
B-trees don't need, since they inherently handle blocking and thus can
use block pointers directly) then the asymptotic complexity is the same
as well, and I think the B-trees are the overall winner.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 22:54                 ` H. Peter Anvin
@ 2001-02-21 23:07                   ` Martin Mares
  2001-02-21 23:15                     ` H. Peter Anvin
  2001-02-21 23:26                     ` Jamie Lokier
  0 siblings, 2 replies; 69+ messages in thread
From: Martin Mares @ 2001-02-21 23:07 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

Hello!

> True.  Note too, though, that on a filesystem (which we are, after all,
> talking about), if you assume a large linear space you have to create a
> file, which means you need to multiply the cost of all random-access
> operations with O(log n).

One could avoid this, but it would mean designing the whole filesystem in a
completely different way -- merge all directories to a single gigantic
hash table and use (directory ID,file name) as a key, but we were originally
talking about extending ext2, so such massive changes are out of question
and your log n access argument is right.

				Have a nice fortnight
-- 
Martin `MJ' Mares <mj@ucw.cz> <mj@suse.cz> http://atrey.karlin.mff.cuni.cz/~mj/
COBOL -- Completely Outdated, Badly Overused Language

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21  2:35         ` Ed Tomlinson
@ 2001-02-21 23:13           ` Linus Torvalds
  2001-02-21 23:34             ` Davide Libenzi
                               ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Linus Torvalds @ 2001-02-21 23:13 UTC (permalink / raw)
  To: linux-kernel

In article <20010221023515.6DF8E18C99@oscar.casa.dyndns.org>,
Ed Tomlinson  <tomlins@cam.org> wrote:
>
>The default in reiserfs is now the R5 hash, but you are right that lots of efforts went 
>into finding this hash.  This includes testing various hashes on real directory 
>structures to see which one worked best.  R5 won.

That's interesting.  The R5 hash is easily also the only one of the
reiser hashes that might be useable for the generic VFS hashing.  It's
not so different in spirit from the current one, and if you've done the
work to test it, it's bound to be a lot better.

(The current VFS name hash is probably _really_ stupid - I think it's
still my original one, and nobody probably ever even tried to run it
through any testing.  For example, I bet that using a shift factor of 4
is really bad, because it evenly divides a byte, which together with the
xor means that you can really easily generate trivial bad cases). 

What did you use for a test-case? Real-life directory contents? Did you
do any worst-case analysis too?

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 23:07                   ` Martin Mares
@ 2001-02-21 23:15                     ` H. Peter Anvin
  2001-02-21 23:42                       ` Daniel Phillips
                                         ` (3 more replies)
  2001-02-21 23:26                     ` Jamie Lokier
  1 sibling, 4 replies; 69+ messages in thread
From: H. Peter Anvin @ 2001-02-21 23:15 UTC (permalink / raw)
  To: Martin Mares; +Cc: linux-kernel

Martin Mares wrote:
> 
> Hello!
> 
> > True.  Note too, though, that on a filesystem (which we are, after all,
> > talking about), if you assume a large linear space you have to create a
> > file, which means you need to multiply the cost of all random-access
> > operations with O(log n).
> 
> One could avoid this, but it would mean designing the whole filesystem in a
> completely different way -- merge all directories to a single gigantic
> hash table and use (directory ID,file name) as a key, but we were originally
> talking about extending ext2, so such massive changes are out of question
> and your log n access argument is right.
> 

It would still be tricky since you have to have actual files in the
filesystem as well.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 23:07                   ` Martin Mares
  2001-02-21 23:15                     ` H. Peter Anvin
@ 2001-02-21 23:26                     ` Jamie Lokier
  1 sibling, 0 replies; 69+ messages in thread
From: Jamie Lokier @ 2001-02-21 23:26 UTC (permalink / raw)
  To: Martin Mares; +Cc: H. Peter Anvin, linux-kernel

Martin Mares wrote:
> Hello!
> 
> > True.  Note too, though, that on a filesystem (which we are, after all,
> > talking about), if you assume a large linear space you have to create a
> > file, which means you need to multiply the cost of all random-access
> > operations with O(log n).
> 
> One could avoid this, but it would mean designing the whole filesystem in a
> completely different way -- merge all directories to a single gigantic
> hash table and use (directory ID,file name) as a key, but we were originally
> talking about extending ext2, so such massive changes are out of question
> and your log n access argument is right.

A gigantic hash table has serious problems with non-locality of
reference.  Basically any regular access pattern you started with is
destroyed.  This is a problem with pageable RAM, let alone disks with
millisecond seek times.

-- Jamie

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 23:13           ` Linus Torvalds
@ 2001-02-21 23:34             ` Davide Libenzi
  2001-02-21 23:59               ` Linus Torvalds
  2001-02-21 23:57             ` H. Peter Anvin
  2001-02-22  0:35             ` Ed Tomlinson
  2 siblings, 1 reply; 69+ messages in thread
From: Davide Libenzi @ 2001-02-21 23:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel


On 21-Feb-2001 Linus Torvalds wrote:
> In article <20010221023515.6DF8E18C99@oscar.casa.dyndns.org>,
> Ed Tomlinson  <tomlins@cam.org> wrote:
>>
>>The default in reiserfs is now the R5 hash, but you are right that lots of
>>efforts went 
>>into finding this hash.  This includes testing various hashes on real
>>directory 
>>structures to see which one worked best.  R5 won.
> 
> That's interesting.  The R5 hash is easily also the only one of the
> reiser hashes that might be useable for the generic VFS hashing.  It's
> not so different in spirit from the current one, and if you've done the
> work to test it, it's bound to be a lot better.
> 
> (The current VFS name hash is probably _really_ stupid - I think it's
> still my original one, and nobody probably ever even tried to run it
> through any testing.  For example, I bet that using a shift factor of 4
> is really bad, because it evenly divides a byte, which together with the
> xor means that you can really easily generate trivial bad cases). 
> 
> What did you use for a test-case? Real-life directory contents? Did you
> do any worst-case analysis too?

Yep, 4 is not good as a shifting factor. Prime number are the better choice for
this stuff.
The issue to have a good distribution is not only to have a good hashing
function, but also to give this function not correlated data.
Good hashing function for a Domain A may not be so good for a Domain B.




- Davide


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 23:15                     ` H. Peter Anvin
@ 2001-02-21 23:42                       ` Daniel Phillips
  2001-02-21 23:52                         ` Davide Libenzi
       [not found]                       ` <3A945081.E6EB78F4@innominate.de>
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 69+ messages in thread
From: Daniel Phillips @ 2001-02-21 23:42 UTC (permalink / raw)
  To: H. Peter Anvin, linux-kernel, Martin Mares, Davide Libenzi

"H. Peter Anvin" wrote:
> 
> Martin Mares wrote:
> >
> > > True.  Note too, though, that on a filesystem (which we are, after all,
> > > talking about), if you assume a large linear space you have to create a
> > > file, which means you need to multiply the cost of all random-access
> > > operations with O(log n).
> >
> > One could avoid this, but it would mean designing the whole filesystem in a
> > completely different way -- merge all directories to a single gigantic
> > hash table and use (directory ID,file name) as a key, but we were originally
> > talking about extending ext2, so such massive changes are out of question
> > and your log n access argument is right.
> 
> It would still be tricky since you have to have actual files in the
> filesystem as well.

Have you looked at the structure and algorithms I'm using?  I would not
call this a hash table, nor is it a btree.  It's a 'hash-keyed
uniform-depth tree'.  It never needs to be rehashed (though it might be
worthwhile compacting it at some point).  It also never needs to be
rebalanced - it's only two levels deep for up to 50 million files.

This thing deserves a name of its own.  I call it an 'htree'.  The
performance should speak for itself - 150 usec/create across 90,000
files and still a few optmizations to go.

Random access runs at similar speeds too, it's not just taking advantage
of a long sequence of insertions into the same directory.

BTW, the discussion in this thread has been very interesting, it just
isn't entirely relevant to my patch :-)

--
Daniel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
       [not found]                       ` <3A945081.E6EB78F4@innominate.de>
@ 2001-02-21 23:48                         ` H. Peter Anvin
  2001-02-22  1:22                           ` Daniel Phillips
  0 siblings, 1 reply; 69+ messages in thread
From: H. Peter Anvin @ 2001-02-21 23:48 UTC (permalink / raw)
  To: Daniel Phillips, Linux Kernel Mailing List

Daniel Phillips wrote:
> 
> Have you looked at the structure and algorithms I'm using?  I would not
> call this a hash table, nor is it a btree.  It's a 'hash-keyed
> uniform-depth tree'.  It never needs to be rehashed (though it might be
> worthwhile compacting it at some point).  It also never needs to be
> rebalanced - it's only two levels deep for up to 50 million files.
> 

I'm curious how you do that.  It seems each level would have to be 64K
large in order to do that, with a minimum disk space consumption of 128K
for a directory.  That seems extremely painful *except* in the case of
hysterically large directories, which tend to be the exception even on
filesystems where they occur.

I think I'd rather take the extra complexity and rebalancing cost of a
B-tree.

> This thing deserves a name of its own.  I call it an 'htree'.  The
> performance should speak for itself - 150 usec/create across 90,000
> files and still a few optmizations to go.
> 
> Random access runs at similar speeds too, it's not just taking advantage
> of a long sequence of insertions into the same directory.
> 
> BTW, the discussion in this thread has been very interesting, it just
> isn't entirely relevant to my patch :-)

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 23:42                       ` Daniel Phillips
@ 2001-02-21 23:52                         ` Davide Libenzi
  0 siblings, 0 replies; 69+ messages in thread
From: Davide Libenzi @ 2001-02-21 23:52 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin Mares, linux-kernel, H. Peter Anvin


On 21-Feb-2001 Daniel Phillips wrote:
> "H. Peter Anvin" wrote:
>> 
>> Martin Mares wrote:
>> >
>> > > True.  Note too, though, that on a filesystem (which we are, after all,
>> > > talking about), if you assume a large linear space you have to create a
>> > > file, which means you need to multiply the cost of all random-access
>> > > operations with O(log n).
>> >
>> > One could avoid this, but it would mean designing the whole filesystem in
>> > a
>> > completely different way -- merge all directories to a single gigantic
>> > hash table and use (directory ID,file name) as a key, but we were
>> > originally
>> > talking about extending ext2, so such massive changes are out of question
>> > and your log n access argument is right.
>> 
>> It would still be tricky since you have to have actual files in the
>> filesystem as well.
> 
> Have you looked at the structure and algorithms I'm using?  I would not
> call this a hash table, nor is it a btree.  It's a 'hash-keyed
> uniform-depth tree'.  It never needs to be rehashed (though it might be
> worthwhile compacting it at some point).  It also never needs to be
> rebalanced - it's only two levels deep for up to 50 million files.
> 
> This thing deserves a name of its own.  I call it an 'htree'.  The
> performance should speak for itself - 150 usec/create across 90,000
> files and still a few optmizations to go.
> 
> Random access runs at similar speeds too, it's not just taking advantage
> of a long sequence of insertions into the same directory.
> 
> BTW, the discussion in this thread has been very interesting, it just
> isn't entirely relevant to my patch :-)

Daniel,

I'm all but saying that Your algo is not good.
I use something very like to it in my mail server ( XMail ) to index mail queue
files that has a two level depth fs splitting.
The mine was only an hint to try different types of directory indexing.



- Davide


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 23:13           ` Linus Torvalds
  2001-02-21 23:34             ` Davide Libenzi
@ 2001-02-21 23:57             ` H. Peter Anvin
  2001-02-22  0:35             ` Ed Tomlinson
  2 siblings, 0 replies; 69+ messages in thread
From: H. Peter Anvin @ 2001-02-21 23:57 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <971i36$180$1@penguin.transmeta.com>
By author:    torvalds@transmeta.com (Linus Torvalds)
In newsgroup: linux.dev.kernel
> 
> (The current VFS name hash is probably _really_ stupid - I think it's
> still my original one, and nobody probably ever even tried to run it
> through any testing.  For example, I bet that using a shift factor of 4
> is really bad, because it evenly divides a byte, which together with the
> xor means that you can really easily generate trivial bad cases). 
> 

Actually, the VFS name hash I think is derived from the "Dragon Book"
hash (via autofs), so it's not like it's completely untested.

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 23:34             ` Davide Libenzi
@ 2001-02-21 23:59               ` Linus Torvalds
  0 siblings, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2001-02-21 23:59 UTC (permalink / raw)
  To: linux-kernel

In article <XFMail.20010221153438.davidel@xmailserver.org>,
Davide Libenzi  <davidel@xmailserver.org> wrote:
>
>Yep, 4 is not good as a shifting factor. Prime number are the better choice for
>this stuff.

Oh, absolutely.

It looks like the hash function was done rather early on in the dcache
lifetime (one of the first things), back when nobody cared about whether
it was really good or not because there were many much more complicated
questions like "how the h*ll will this all ever work" ;)

And at no point did anybody ever go back and verify whether the hash
function made much sense or not.

We had another boo-boo with the actual _folding_ of the "full" hash
value into the actual hash chain pointer that is done when the name
cache is actually looked up, which was even more embarrassing: even if
the hash ended up being ok, we would remove most of the valid bits from
it because it would under certain circumstances (512MB of RAM on x86)
basically xor itself with itself. 

That took quite a while to find too - the code still worked fine, it
just had a horrible distribution on machines with half a gig of memory.

>The issue to have a good distribution is not only to have a good hashing
>function, but also to give this function not correlated data.
>Good hashing function for a Domain A may not be so good for a Domain B.

This is not something we can do all that much about.  The data we get is
generated by the user, and can basically be a random string of
characters.  HOWEVER, there are certainly tons of _usual_ data, and
while there's no way to select the data we can at least try to make sure
that the distribution is good for the normal case (ie regular ASCII
filenames, not forgetting the fact that many people use more interesting
encodings)

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 23:13           ` Linus Torvalds
  2001-02-21 23:34             ` Davide Libenzi
  2001-02-21 23:57             ` H. Peter Anvin
@ 2001-02-22  0:35             ` Ed Tomlinson
  2 siblings, 0 replies; 69+ messages in thread
From: Ed Tomlinson @ 2001-02-22  0:35 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds <torvalds@transmeta.com> wrote:
>
> Ed Tomlinson  <tomlins@cam.org> wrote:
> >The default in reiserfs is now the R5 hash, but you are right that lots of
> > efforts went into finding this hash.  This includes testing various
> > hashes on real directory structures to see which one worked best.  R5
> > won.
>
> That's interesting.  The R5 hash is easily also the only one of the
> reiser hashes that might be useable for the generic VFS hashing.  It's
> not so different in spirit from the current one, and if you've done the
> work to test it, it's bound to be a lot better.

It was not me personally.   I just remembered the thread (from june 2000) on 
the reiserfs list...  I have summerized the results for you below.

For the program see: http://www.jedi.claranet.fr/hash_torture.tar.gz

Ed 

PS.  I am still seeing hangs with (2.4.2pre2 then I switched to ac7 or so and 
have had hangs with all pre and ac(s) tried and that is most of them)  ac20 
plus the latest reiserfs fixes has stayed up 8 hours so far - it can take two 
or three days  to trigger the hang though.  When it hangs it really dead,  a 
UPS connected via a serial port cannot shut it down.   pings to the box fail. 
A+SysRQ is dead, and the software watchdog does not trigger a reboot.  
ideas?

> (The current VFS name hash is probably _really_ stupid - I think it's
> still my original one, and nobody probably ever even tried to run it
> through any testing.  For example, I bet that using a shift factor of 4
> is really bad, because it evenly divides a byte, which together with the
> xor means that you can really easily generate trivial bad cases).
>
> What did you use for a test-case? Real-life directory contents? Did you
> do any worst-case analysis too?
>
>                Linus


some test results from june 2000 with Hans's summary first.
---------------------------------------------------------------
(reiserfs) Re: r5 hash
From: Hans Reiser <hans@reiser.to>
To: "Yury Yu. Rupasov" <yura@yura.polnet.botik.ru>
Cc: Jedi/Sector One <j@4u.net>, Petru Paler <ppetru@coltronix.com>, 
"reiserfs@devlinux.com" <reiserfs@devlinux.com>, Yury Shevchuk 
<sizif@botik.ru>


Ok, based on this benchmark let's put rupasov5 in, and warn users who choose 
the
currently used rupasov1 hash that rupasov5 has obsoleted it.  Do this in both
3.6 and 3.5, and fix the the delimiting key check in 3.5 REISERFS_CHECK bug at
the same time.  Cut the patch, start testing, and see if you can release by
Monday.  Make rupasov5 the default.  sizif, review the documentation he 
creates
for users.

Jedi, if you disagree with the benchmarks let me know.  You might try
concatenating two filenames together instead of adding a digit to them, or
running find on a really large FS, to improve these tests.  Thanks for helping
us with analyzing the different hash methods available Jedi.

Hans

---------------------------------------------------------------
(reiserfs) Re: r5 hash
From: "Yury Yu. Rupasov" <yura@yura.polnet.botik.ru>
To: Hans Reiser <hans@reiser.to>
Cc: Jedi/Sector One <j@4u.net>, Petru Paler <ppetru@coltronix.com>, 
"reiserfs@devlinux.com" <reiserfs@devlinux.com>, Yury Shevchuk 
<sizif@botik.ru>


Hans Reiser wrote:
> 
> What is the speed of the real filenames, not just the number of collisions.
> 



Ok, here is the results for real names :
# find / -type d -exec ls {} \; | sort | uniq > allfiles.txt

# wc -l allfiles.txt
161101 allfiles.txt

Collisions for 161 101 names:

tea_hash  : 784 total,  2 dangerous
jedi_hash2: 957 total,  2 dangerous 
r5_hash   :1191 total,  2 dangerous 
r7_hash   :8439 total, 18 dangerous


The speed for 161 101 real names :

create 161101 files of 10 bytes with names from allfiles.txt

# time create d1 allfiles.txt
# time cp d1 d2 -r
# time rm d1 -r

              create      copy        remove 
             --------------------------------
tea_hash   : 1m27.223s   5m43.069s  2m33.449s
jedi_hash2 : 1m26.062s   5m40.872s  2m32.795s
r5_hash    : 1m16.729s   4m14.967s  1m53.037s
r7_hash    : 1m10.665s   3m34.950s  1m39.756s


As you can see the results are differ, but not too much. :)
The situation changes dramatically if we will test 1 million files.

The same test, but at the end of each name from allfiles.txt 
added numbers from 0 to 6 (1 127 707 files):
 
              create      copy        remove 
             --------------------------------
tea_hash   : 81m44.449s  
jedi_hash2 : 79m46.419s
r5_hash    : 15m56.037s
r7_hash    : 15m30.680s

Dual Celeron 500, 128 MB RAM, 8 GB scsi HDD
Reiserfs-3.5.21, Linux-2.2.15

Thanks,
Yura.
---------------------------------------------------------------
body { font-family: "helvetica" } p { font-size: 12pt } a { color: #0000ff; 
text-decoration: none; }(reiserfs) Torture results
From: Jedi/Sector One <j@4u.net>
To: reiserfs@devlinux.com


  Here are the results of the hash torture on a Celeron 300.
  Once again, you can substract 1 from the dangerous collisions numbers.
  Xuan, can you provide a test for the case Rupasov hash was designed
for ?
  Anyway, I don't really see why large directories should have similar
file names, rather that keywords.

  Best regards,
-- 
         Frank DENIS aka Jedi/Sector One aka DJ Chrysalis <j@4u.net>
                 -> Software : http://www.jedi.claranet.fr <-
      If Bill Gates had a dime for every time a Windows box crashed...
                  ...oh, wait a minute -- he already does.


********************** /usr/dict/words test **********************

Trying with   45402 words


-------------[Benchmarking tea hash]-------------

Collisions : 45
Dangerous :       1      ffff980
Timing :

real     0m0.145s
user     0m0.120s
sys      0m0.010s

-------------[Benchmarking rupasov hash]-------------

Collisions : 553
Dangerous :       1      ffffe00
Timing :

real     0m0.297s
user     0m0.260s
sys      0m0.020s

-------------[Benchmarking r5 hash]-------------

Collisions : 185
Dangerous :       1      ffae000
Timing :

real     0m0.124s
user     0m0.080s
sys      0m0.030s

-------------[Benchmarking r7 hash]-------------

Collisions : 2528
Dangerous :       1      fffd400
Timing :

real     0m0.121s
user     0m0.100s
sys      0m0.000s

-------------[Benchmarking jedi hash]-------------

Collisions : 54
Dangerous :       1      fff9780
Timing :

real     0m0.122s
user     0m0.100s
sys      0m0.010s

-------------[Benchmarking jedi2 hash]-------------

Collisions : 93
Dangerous :       1      fff9780
Timing :

real     0m0.122s
user     0m0.090s
sys      0m0.020s

-------------[Benchmarking lookup2 hash]-------------

Collisions : 63
Dangerous :       1      ffff480
Timing :

real     0m0.123s
user     0m0.100s
sys      0m0.000s

********************** Squid names test **********************

Trying with  458752 squid cache entries

-------------[Benchmarking tea hash]-------------

Collisions : 6237
Dangerous :       1      fffff80
Timing :

real     0m1.138s
user     0m1.090s
sys      0m0.030s

-------------[Benchmarking rupasov hash]-------------

Collisions : 377520
Dangerous :       1      e32700
Timing :

real     0m2.588s
user     0m2.550s
sys      0m0.020s

-------------[Benchmarking r5 hash]-------------

Collisions : 309991
Dangerous :       1      55406b80
Timing :

real     0m0.940s
user     0m0.880s
sys      0m0.040s

-------------[Benchmarking r7 hash]-------------

Collisions : 449006
Dangerous :       2      22b16580
Timing :

real     0m0.928s
user     0m0.840s
sys      0m0.070s

-------------[Benchmarking jedi hash]-------------

Collisions : 2771
Dangerous :       1      fffef80
Timing :

real     0m0.928s
user     0m0.860s
sys      0m0.050s

-------------[Benchmarking jedi2 hash]-------------

Collisions : 0
Dangerous :       1      ffff80
Timing :

real     0m0.879s
user     0m0.810s
sys      0m0.050s

-------------[Benchmarking lookup2 hash]-------------

Collisions : 6203
Dangerous :       1      fffdc00
Timing :

real     0m0.930s
user     0m0.840s
sys      0m0.080s

********************** Real names test **********************

Trying with   89830 files

-------------[Benchmarking tea hash]-------------

Collisions : 237
Dangerous :       1      fff5580
Timing :

real     0m0.276s
user     0m0.250s
sys      0m0.000s

-------------[Benchmarking rupasov hash]-------------

Collisions : 6288
Dangerous :       1      ffee080
Timing :

real     0m0.582s
user     0m0.560s
sys      0m0.010s

-------------[Benchmarking r5 hash]-------------

Collisions : 3920
Dangerous :       1      fff4600
Timing :

real     0m0.230s
user     0m0.190s
sys      0m0.020s

-------------[Benchmarking r7 hash]-------------

Collisions : 11801
Dangerous :       1      fff580
Timing :

real     0m0.225s
user     0m0.180s
sys      0m0.030s

-------------[Benchmarking jedi hash]-------------

Collisions : 269
Dangerous :       1      fff9f80
Timing :

real     0m0.226s
user     0m0.200s
sys      0m0.010s

-------------[Benchmarking jedi2 hash]-------------

Collisions : 415
Dangerous :       1      fff9f80
Timing :

real     0m0.225s
user     0m0.200s
sys      0m0.010s

-------------[Benchmarking lookup2 hash]-------------

Collisions : 223
Dangerous :       1      ffff480
Timing :

real     0m0.230s
user     0m0.210s
sys      0m0.000s

----------------------------------------------------------------------------------------

body { font-family: "helvetica" } p { font-size: 12pt } a { color: #0000ff; 
text-decoration: none; }(reiserfs) hash torture results
From: Petru Paler <ppetru@coltronix.com>
To: reiserfs@devlinux.com


Machine: AMD Athlon/650MHz, 128Mb RAM, Quantum Fireball lct15 IDE hdd
(UDMA/66 but that doesn't matter). Kernel 2.4.0-test1-ac10.

The results are interesting, but more interesting would be to see how fast
reiserfs actually is with each of these hashes.

Script output:

********************** /usr/dict/words test **********************

Trying with   45402 words


-------------[Benchmarking tea hash]-------------

Collisions : 45
Dangerous :       1      ffff980
Timing :
0.00user 0.01system 0:00.08elapsed 11%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking rupasov hash]-------------

Collisions : 553
Dangerous :       1      ffffe00
Timing :
0.00user 0.00system 0:00.18elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r5 hash]-------------

Collisions : 185
Dangerous :       1      ffae000
Timing :
0.00user 0.00system 0:00.08elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r7 hash]-------------

Collisions : 2528
Dangerous :       1      fffd400
Timing :
0.00user 0.01system 0:00.07elapsed 12%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi hash]-------------

Collisions : 54
Dangerous :       1      fff9780
Timing :
0.00user 0.00system 0:00.08elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi2 hash]-------------

Collisions : 93
Dangerous :       1      fff9780
Timing :
0.00user 0.00system 0:00.07elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking lookup2 hash]-------------

Collisions : 63
Dangerous :       1      ffff480
Timing :
0.00user 0.00system 0:00.07elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

********************** Squid names test **********************

Trying with  262144 squid cache entries

-------------[Benchmarking tea hash]-------------

Collisions : 2019
Dangerous :       1      ffff880
Timing :
0.00user 0.01system 0:00.47elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking rupasov hash]-------------

Collisions : 210912
Dangerous :       1      a88f00
Timing :
0.00user 0.02system 0:01.03elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r5 hash]-------------

Collisions : 171912
Dangerous :       1      54ca7680
Timing :
0.00user 0.03system 0:00.41elapsed 7%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r7 hash]-------------

Collisions : 256171
Dangerous :       6      22aa0600
Timing :
0.00user 0.03system 0:00.41elapsed 7%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi hash]-------------

Collisions : 589
Dangerous :       1      fffda00
Timing :
0.00user 0.02system 0:00.42elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi2 hash]-------------

Collisions : 0
Dangerous :       1      ffff80
Timing :
0.00user 0.00system 0:00.40elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking lookup2 hash]-------------

Collisions : 2041
Dangerous :       1      fffdc00
Timing :
0.00user 0.01system 0:00.40elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

********************** Real names test **********************

find: /proc/31112/fd/4: No such file or directory
Trying with   94836 files

-------------[Benchmarking tea hash]-------------

Collisions : 235
Dangerous :       1      fff5e80
Timing :
0.00user 0.00system 0:00.20elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking rupasov hash]-------------

Collisions : 2016
Dangerous :       1      fffab80
Timing :
0.01user 0.00system 0:00.46elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r5 hash]-------------

Collisions : 495
Dangerous :       1      fff8780
Timing :
0.00user 0.00system 0:00.17elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r7 hash]-------------

Collisions : 8162
Dangerous :       1      fff580
Timing :
0.00user 0.02system 0:00.17elapsed 11%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi hash]-------------

Collisions : 331
Dangerous :       1      ffe400
Timing :
0.00user 0.00system 0:00.17elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi2 hash]-------------

Collisions : 341
Dangerous :       1      ffe400
Timing :
0.00user 0.00system 0:00.17elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking lookup2 hash]-------------

Collisions : 298
Dangerous :       1      fffb700
Timing :
0.00user 0.00system 0:00.17elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-Petru


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 23:48                         ` H. Peter Anvin
@ 2001-02-22  1:22                           ` Daniel Phillips
  2001-02-22  1:42                             ` H. Peter Anvin
  2001-02-22  2:03                             ` Andreas Dilger
  0 siblings, 2 replies; 69+ messages in thread
From: Daniel Phillips @ 2001-02-22  1:22 UTC (permalink / raw)
  To: H. Peter Anvin, Linux-Kernel

"H. Peter Anvin" wrote:
> 
> Daniel Phillips wrote:
> >
> > Have you looked at the structure and algorithms I'm using?  I would not
> > call this a hash table, nor is it a btree.  It's a 'hash-keyed
> > uniform-depth tree'.  It never needs to be rehashed (though it might be
> > worthwhile compacting it at some point).  It also never needs to be
> > rebalanced - it's only two levels deep for up to 50 million files.
> 
> I'm curious how you do that.  It seems each level would have to be 64K
> large in order to do that, with a minimum disk space consumption of 128K
> for a directory.  That seems extremely painful *except* in the case of
> hysterically large directories, which tend to be the exception even on
> filesystems where they occur.

Easy, with average dirent reclen of 16 bytes each directory leaf block
can holds up to 256 entries.  Each index block indexes 512 directory
blocks and the root indexes 511 index blocks.  Assuming the leaves are
on average 75% full this gives:

	(4096 / 16) * 512 * 511 * .75 = 50,233,344

I practice I'm getting a little more than 90,000 entries indexed by a
*single* index block (the root) so I'm not just making this up.

> I think I'd rather take the extra complexity and rebalancing cost of a
> B-tree.

Do you still think so?

--
Daniel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  1:22                           ` Daniel Phillips
@ 2001-02-22  1:42                             ` H. Peter Anvin
  2001-02-22  2:03                             ` Andreas Dilger
  1 sibling, 0 replies; 69+ messages in thread
From: H. Peter Anvin @ 2001-02-22  1:42 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linux-Kernel

Daniel Phillips wrote:
> 
> "H. Peter Anvin" wrote:
> >
> > Daniel Phillips wrote:
> > >
> > > Have you looked at the structure and algorithms I'm using?  I would not
> > > call this a hash table, nor is it a btree.  It's a 'hash-keyed
> > > uniform-depth tree'.  It never needs to be rehashed (though it might be
> > > worthwhile compacting it at some point).  It also never needs to be
> > > rebalanced - it's only two levels deep for up to 50 million files.
> >
> > I'm curious how you do that.  It seems each level would have to be 64K
> > large in order to do that, with a minimum disk space consumption of 128K
> > for a directory.  That seems extremely painful *except* in the case of
> > hysterically large directories, which tend to be the exception even on
> > filesystems where they occur.
> 
> Easy, with average dirent reclen of 16 bytes each directory leaf block
> can holds up to 256 entries.  Each index block indexes 512 directory
> blocks and the root indexes 511 index blocks.  Assuming the leaves are
> on average 75% full this gives:
> 
>         (4096 / 16) * 512 * 511 * .75 = 50,233,344
> 

That's a three-level tree, not a two-level tree.

> I practice I'm getting a little more than 90,000 entries indexed by a
> *single* index block (the root) so I'm not just making this up.
> 
> > I think I'd rather take the extra complexity and rebalancing cost of a
> > B-tree.
> 
> Do you still think so?

I think so.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  1:22                           ` Daniel Phillips
  2001-02-22  1:42                             ` H. Peter Anvin
@ 2001-02-22  2:03                             ` Andreas Dilger
  2001-02-22  2:41                               ` H. Peter Anvin
  2001-02-22  3:08                               ` Daniel Phillips
  1 sibling, 2 replies; 69+ messages in thread
From: Andreas Dilger @ 2001-02-22  2:03 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: H. Peter Anvin, Linux-Kernel

Daniel Phillips writes:
> Easy, with average dirent reclen of 16 bytes each directory leaf block
> can holds up to 256 entries.  Each index block indexes 512 directory
> blocks and the root indexes 511 index blocks.  Assuming the leaves are
> on average 75% full this gives:
> 
> 	(4096 / 16) * 512 * 511 * .75 = 50,233,344
> 
> I practice I'm getting a little more than 90,000 entries indexed by a
> *single* index block (the root) so I'm not just making this up.

I was just doing the math for 1k ext2 filesystems, and the numbers aren't
nearly as nice.  We get:

	(1024 / 16) * 127 * .75 = 6096		# 1 level
	(1024 / 16) * 128 * 127 * .75 = 780288	# 2 levels

Basically (IMHO) we will not really get any noticable benefit with 1 level
index blocks for a 1k filesystem - my estimates at least are that the break
even point is about 5k files.  We _should_ be OK with 780k files in a single
directory for a while.  Looks like we will need 2-level indexes sooner than
you would think though.  Note that tests on my workstation showed an average
filename length of 10 characters (excluding MP3s at 78 characters), so this
would give 20-byte (or 88-byte) dirents for ext3, reducing the files count
to 4857 and 621792 (or 78183 and 40029696 for 4k filesystems) at 75% full.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21  0:22     ` Linus Torvalds
  2001-02-21  0:30       ` Alan Cox
  2001-02-21  1:01       ` Andreas Dilger
@ 2001-02-22  2:28       ` Daniel Phillips
  2001-02-22  3:30         ` Linus Torvalds
  2 siblings, 1 reply; 69+ messages in thread
From: Daniel Phillips @ 2001-02-22  2:28 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

Linus Torvalds wrote:
> 
> On Tue, 20 Feb 2001, Daniel Phillips wrote:
> >
> > You mean full_name_hash?  I will un-static it and try it.  I should have
> > some statistics tomorrow.  I have a couple of simple metrics for
> > measuring the effectiveness of the hash function: the uniformity of
> > the hash space splitting (which in turn affects the average fullness
> > of directory leaves) and speed.
> 
> I was more thinking about just using "dentry->d_name->hash" directly, and
> not worrying about how that hash was computed. Yes, for ext2 it will have
> the same value as "full_name_hash" - the difference really being that
> d_hash has already been precomputed for you anyway.
> 
> > Let the hash races begin.
> 
> Note that dentry->d_name->hash is really quick (no extra computation), but
> I'm not claiming that it has anything like a CRC quality. And it's
> probably a bad idea to use it, because in theory at least the VFS layer
> might decide to switch the hash function around. I'm more interested in
> hearing whether it's a good hash, and maybe we could improve the VFS hash
> enough that there's no reason to use anything else..

In the first heat of hash races - creating 20,000 files in one directory
- dentry::hash lost out to my original hack::dx_hash, causing a high
percentage of leaf blocks to remain exactly half full and slowing down
the whole thing by about 5%.  (This was under uml - I haven't tried it
native yet but I expect the results to be similar.)

	  Contender			Result
	  =========			======
	dentry::hash		Average fullness = 2352 (57%)
	hack::dx_hash		Average fullness = 2758 (67%)

This suggests that dentry::hash is producing distinctly non-dispersed
results and needs to be subjected to further scrutiny.  I'll run the
next heat of hash races tomorrow, probably with R5, and CRC32 too if I
have time.

--
Daniel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  2:03                             ` Andreas Dilger
@ 2001-02-22  2:41                               ` H. Peter Anvin
  2001-02-22  3:43                                 ` Daniel Phillips
  2001-02-22  3:08                               ` Daniel Phillips
  1 sibling, 1 reply; 69+ messages in thread
From: H. Peter Anvin @ 2001-02-22  2:41 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Daniel Phillips, Linux-Kernel

Andreas Dilger wrote:
> 
> Basically (IMHO) we will not really get any noticable benefit with 1 level
> index blocks for a 1k filesystem - my estimates at least are that the break
> even point is about 5k files.  We _should_ be OK with 780k files in a single
> directory for a while.
>

I've had a news server with 2000000 files in one directory.  Such a
filesystem is likely to use small blocks, too, because each file is
generally small.

This is an important connection: filesystems which have lots and lots of
small files will have large directories and small block sizes.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  2:03                             ` Andreas Dilger
  2001-02-22  2:41                               ` H. Peter Anvin
@ 2001-02-22  3:08                               ` Daniel Phillips
  2001-02-22  8:06                                 ` [rfc] [LONG] " Andreas Dilger
  1 sibling, 1 reply; 69+ messages in thread
From: Daniel Phillips @ 2001-02-22  3:08 UTC (permalink / raw)
  To: Andreas Dilger, Linux-Kernel

Andreas Dilger wrote:
> 
> Daniel Phillips writes:
> > Easy, with average dirent reclen of 16 bytes each directory leaf block
> > can holds up to 256 entries.  Each index block indexes 512 directory
> > blocks and the root indexes 511 index blocks.  Assuming the leaves are
> > on average 75% full this gives:
> >
> >       (4096 / 16) * 512 * 511 * .75 = 50,233,344
> >
> > I practice I'm getting a little more than 90,000 entries indexed by a
> > *single* index block (the root) so I'm not just making this up.
> 
> I was just doing the math for 1k ext2 filesystems, and the numbers aren't
> nearly as nice.  We get:
> 
>         (1024 / 16) * 127 * .75 = 6096          # 1 level
>         (1024 / 16) * 128 * 127 * .75 = 780288  # 2 levels
> 
> Basically (IMHO) we will not really get any noticable benefit with 1 level
> index blocks for a 1k filesystem - my estimates at least are that the break
> even point is about 5k files.  We _should_ be OK with 780k files in a single
> directory for a while.  Looks like we will need 2-level indexes sooner than
> you would think though.  Note that tests on my workstation showed an average
> filename length of 10 characters (excluding MP3s at 78 characters), so this
> would give 20-byte (or 88-byte) dirents for ext3, reducing the files count
> to 4857 and 621792 (or 78183 and 40029696 for 4k filesystems) at 75% full.

But you are getting over 3/4 million files in one directory on a 1K
blocksize system, and you really shouldn't be using 1K blocks on a
filesystem under that big a load.  Is it just to reduce tail block
fragmentation?  That's what tail merging is for - it does a much better
job than shrinking the block size.

But if you are *determined* to use 1K blocks and have more than 1/2
million files in one directory then I suppose a 3rd level is what you
need.  The uniform-depth tree still works just fine and still doesn't
need to be rebalanced - it's never out of balance.

--
Daniel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  2:28       ` Daniel Phillips
@ 2001-02-22  3:30         ` Linus Torvalds
  2001-02-22 16:33           ` Chris Mason
  2001-02-22 22:30           ` Daniel Phillips
  0 siblings, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2001-02-22  3:30 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel



On Thu, 22 Feb 2001, Daniel Phillips wrote:
> 
> In the first heat of hash races - creating 20,000 files in one directory
> - dentry::hash lost out to my original hack::dx_hash, causing a high
> percentage of leaf blocks to remain exactly half full and slowing down
> the whole thing by about 5%.  (This was under uml - I haven't tried it
> native yet but I expect the results to be similar.)
> 
> 	  Contender			Result
> 	  =========			======
> 	dentry::hash		Average fullness = 2352 (57%)
> 	hack::dx_hash		Average fullness = 2758 (67%)
> 
> This suggests that dentry::hash is producing distinctly non-dispersed
> results and needs to be subjected to further scrutiny.  I'll run the
> next heat of hash races tomorrow, probably with R5, and CRC32 too if I
> have time.

I'd love to hear the results from R5, as that seems to be the reiserfs
favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
in..

		Linus


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  2:41                               ` H. Peter Anvin
@ 2001-02-22  3:43                                 ` Daniel Phillips
  2001-02-22  4:02                                   ` Linus Torvalds
                                                     ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Daniel Phillips @ 2001-02-22  3:43 UTC (permalink / raw)
  To: H. Peter Anvin, linux-kernel

"H. Peter Anvin" wrote:
> 
> Andreas Dilger wrote:
> >
> > Basically (IMHO) we will not really get any noticable benefit with 1 level
> > index blocks for a 1k filesystem - my estimates at least are that the break
> > even point is about 5k files.  We _should_ be OK with 780k files in a single
> > directory for a while.
> >
> 
> I've had a news server with 2000000 files in one directory.  Such a
> filesystem is likely to use small blocks, too, because each file is
> generally small.
> 
> This is an important connection: filesystems which have lots and lots of
> small files will have large directories and small block sizes.

I mentioned this earlier but it's worth repeating: the desire to use a
small block size is purely an artifact of the fact that ext2 has no
handling for tail block fragmentation.  That's a temporary situation -
once we've dealt with it your 2,000,000 file directory will be happier
with 4K filesystem blocks.  There will be a lot fewer metadata index
blocks in your directory file, for one thing.  Another practical matter
is that 4K filesystem blocks map directly to 4K PAGE_SIZE and are as a
result friendlier to the page cache and memory manager.

--
Daniel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  3:43                                 ` Daniel Phillips
@ 2001-02-22  4:02                                   ` Linus Torvalds
  2001-02-22  5:19                                     ` Linus Torvalds
  2001-02-22  4:02                                   ` H. Peter Anvin
  2001-02-22  4:03                                   ` H. Peter Anvin
  2 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2001-02-22  4:02 UTC (permalink / raw)
  To: linux-kernel

In article <3A948ACB.7B55BEAE@innominate.de>,
Daniel Phillips  <phillips@innominate.de> wrote:
>
>I mentioned this earlier but it's worth repeating: the desire to use a
>small block size is purely an artifact of the fact that ext2 has no
>handling for tail block fragmentation.  That's a temporary situation -
>once we've dealt with it your 2,000,000 file directory will be happier
>with 4K filesystem blocks.

I'd rather see a whole new filesystem than have ext2 do tail-block
fragmentation. 

Once you do tail fragments, you might as well do the whole filesystem
over and have it do fancier stuff than just handling sub-blocking. 

Another way of saying this: if you go to the complexity of no longer
being a purely block-based filesystem, please go the whole way. Make the
thing be extent-based, and get away from the notion that you have to
allocate blocks one at a time. Make the blocksize something nice and
big, not just 4kB or 8kB or something.

And don't call it ext2. 

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  3:43                                 ` Daniel Phillips
  2001-02-22  4:02                                   ` Linus Torvalds
@ 2001-02-22  4:02                                   ` H. Peter Anvin
  2001-02-22  7:03                                     ` Andreas Dilger
  2001-02-22  4:03                                   ` H. Peter Anvin
  2 siblings, 1 reply; 69+ messages in thread
From: H. Peter Anvin @ 2001-02-22  4:02 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

Daniel Phillips wrote:
> 
> "H. Peter Anvin" wrote:
> >
> > Andreas Dilger wrote:
> > >
> > > Basically (IMHO) we will not really get any noticable benefit with 1 level
> > > index blocks for a 1k filesystem - my estimates at least are that the break
> > > even point is about 5k files.  We _should_ be OK with 780k files in a single
> > > directory for a while.
> > >
> >
> > I've had a news server with 2000000 files in one directory.  Such a
> > filesystem is likely to use small blocks, too, because each file is
> > generally small.
> >
> > This is an important connection: filesystems which have lots and lots of
> > small files will have large directories and small block sizes.
> 
> I mentioned this earlier but it's worth repeating: the desire to use a
> small block size is purely an artifact of the fact that ext2 has no
> handling for tail block fragmentation.  That's a temporary situation -
> once we've dealt with it your 2,000,000 file directory will be happier
> with 4K filesystem blocks.  There will be a lot fewer metadata index
> blocks in your directory file, for one thing.  Another practical matter
> is that 4K filesystem blocks map directly to 4K PAGE_SIZE and are as a
> result friendlier to the page cache and memory manager.
> 

Well, that's something I really don't expect to see anymore -- this
"purely temporary situation" is now already 7 years old at least.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  3:43                                 ` Daniel Phillips
  2001-02-22  4:02                                   ` Linus Torvalds
  2001-02-22  4:02                                   ` H. Peter Anvin
@ 2001-02-22  4:03                                   ` H. Peter Anvin
  2001-02-22 10:35                                     ` Alan Cox
  2 siblings, 1 reply; 69+ messages in thread
From: H. Peter Anvin @ 2001-02-22  4:03 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

Daniel Phillips wrote:
> 
> There will be a lot fewer metadata index
> blocks in your directory file, for one thing.
> 

Oh yes, another thing: a B-tree directory structure does not need
metadata index blocks.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  4:02                                   ` Linus Torvalds
@ 2001-02-22  5:19                                     ` Linus Torvalds
  2001-02-22 11:31                                       ` Ingo Oeser
  0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2001-02-22  5:19 UTC (permalink / raw)
  To: linux-kernel

In article <97230a$16k$1@penguin.transmeta.com>,
Linus Torvalds <torvalds@transmeta.com> wrote:
>
>Another way of saying this: if you go to the complexity of no longer
>being a purely block-based filesystem, please go the whole way. Make the
>thing be extent-based, and get away from the notion that you have to
>allocate blocks one at a time. Make the blocksize something nice and
>big, not just 4kB or 8kB or something.

Btw, this is also going to be a VM and performance issue some time in
the future.  Tgere are already CPU's that would _love_ to have 64kB
pages etc, and as such a filesystem that doesn't play with the old silly
"everthing is a block" rules would be much appreciated with the kind of
people who have multi-gigabyte files and want to read in big chunks at a
time. 

So either you have a simple block-based filesystem (current ext2, no
extents, no crapola), or you decide to do it over.  Don't do some
half-way thing, please. 

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
  2001-02-20 15:04 [rfc] Near-constant time directory index for Ext2 Daniel Phillips
  2001-02-20 20:03 ` Linus Torvalds
  2001-02-21 17:21 ` Davide Libenzi
@ 2001-02-22  6:23 ` tytso
  2001-02-22  7:24   ` Daniel Phillips
  2001-02-22 13:20   ` tytso
  2001-02-22 18:38 ` Kai Henningsen
  3 siblings, 2 replies; 69+ messages in thread
From: tytso @ 2001-02-22  6:23 UTC (permalink / raw)
  To: phillips; +Cc: Linux-kernel, adilger, hch, ext2-devel

Daniel,

Nice work!

A couple of comments.  If you make the beginning of each index block
look like a an empty directory block (i.e, the first 8 blocks look like
this):

	32 bits: ino == 0
	16 bits: rec_len == blocksize
	16 bits: name_len = 0

... then you will have full backwards compatibility, both for reading
*and* writing.  When reading, old kernels will simply ignore the index
blocks, since it looks like it has an unpopulated directory entry.  And
if the kernel attempts to write into the directory, it will clear the
BTREE_FL flag, in which case new kernels won't treat the directory as a
tree anymore.  (Running a smart e2fsck which knows about directory trees
will be able to restore the tree structure).

Is it worth it?  Well, it means you lose an index entry from each
directory block, thus reducing your fanout at each node of the tree by a
worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
using 4k blocksizes.

						- Ted


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  4:02                                   ` H. Peter Anvin
@ 2001-02-22  7:03                                     ` Andreas Dilger
  0 siblings, 0 replies; 69+ messages in thread
From: Andreas Dilger @ 2001-02-22  7:03 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Daniel Phillips, linux-kernel

HPA writes:
> Daniel Phillips wrote:
> > I mentioned this earlier but it's worth repeating: the desire to use a
> > small block size is purely an artifact of the fact that ext2 has no
> > handling for tail block fragmentation.  That's a temporary situation -
> > once we've dealt with it your 2,000,000 file directory will be happier
> > with 4K filesystem blocks.  There will be a lot fewer metadata index
> > blocks in your directory file, for one thing.  Another practical matter
> > is that 4K filesystem blocks map directly to 4K PAGE_SIZE and are as a
> > result friendlier to the page cache and memory manager.
> > 
> 
> Well, that's something I really don't expect to see anymore -- this
> "purely temporary situation" is now already 7 years old at least.

Peter, you're barking up the wrong tree - Daniel has had an ext2 tail
merging patch around for 6 months or more...  However, from the sounds
of it, Linus may not want such a thing in ext2 (at least not until he
is convinced otherwise).  It will be interesting to compare ext2 +
ongoing patches vs. new filesystems like reiserfs, XFS, JFS --  not only
speed, but reliability as well.  XFS and JFS have previous implementations
to work with (although the JFS code is not the AIX JFS code), but reiserfs
has a long way to go, just from the standpoint of being run on millions
of machines, and being looked at by thousands of programmers.

I think people will be surprised at how ext2 + patches will continue to
improve.  One of the reasons (despite Linus' misgivings, IMHO) is that
ext2 is continually being improved by small measures, has lots of eyes
on the code, and it offers a stable base for each improvement - which
means each improvement is stable and reliable much quicker than if you
were to code a new filesystem from scratch for each new feature.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 23:15                     ` H. Peter Anvin
  2001-02-21 23:42                       ` Daniel Phillips
       [not found]                       ` <3A945081.E6EB78F4@innominate.de>
@ 2001-02-22  7:20                       ` Bill Wendling
  2001-02-22  8:34                       ` Rogier Wolff
  3 siblings, 0 replies; 69+ messages in thread
From: Bill Wendling @ 2001-02-22  7:20 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Also sprach H. Peter Anvin:
} Martin Mares wrote:
} > 
} > Hello!
} > 
} > > True.  Note too, though, that on a filesystem (which we are, after all,
} > > talking about), if you assume a large linear space you have to create a
} > > file, which means you need to multiply the cost of all random-access
} > > operations with O(log n).
} > 
} > One could avoid this, but it would mean designing the whole filesystem in a
} > completely different way -- merge all directories to a single gigantic
} > hash table and use (directory ID,file name) as a key, but we were originally
} > talking about extending ext2, so such massive changes are out of question
} > and your log n access argument is right.
} > 
} 
} It would still be tricky since you have to have actual files in the
} filesystem as well.
} 
But that's just a user space issue, isn't it.

(Just kidding :-)

-- 
|| Bill Wendling			wendling@ganymede.isdn.uiuc.edu

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
  2001-02-22  6:23 ` [Ext2-devel] " tytso
@ 2001-02-22  7:24   ` Daniel Phillips
  2001-02-22 13:20   ` tytso
  1 sibling, 0 replies; 69+ messages in thread
From: Daniel Phillips @ 2001-02-22  7:24 UTC (permalink / raw)
  To: tytso, phillips; +Cc: Linux-kernel, adilger, hch, ext2-devel

On Thu, 22 Feb 2001, tytso@valinux.com wrote:
> A couple of comments.  If you make the beginning of each index block
> look like a an empty directory block (i.e, the first 8 blocks look like
> this):
> 
> 	32 bits: ino == 0
> 	16 bits: rec_len == blocksize
> 	16 bits: name_len = 0
> 
> ... then you will have full backwards compatibility, both for reading
> *and* writing.  When reading, old kernels will simply ignore the index
> blocks, since it looks like it has an unpopulated directory entry.  And
> if the kernel attempts to write into the directory, it will clear the
> BTREE_FL flag, in which case new kernels won't treat the directory as a
> tree anymore.  (Running a smart e2fsck which knows about directory trees
> will be able to restore the tree structure).

:-)  That's really nice, now I see what you were thinking about with
all those bit clears.

> Is it worth it?  Well, it means you lose an index entry from each
> directory block, thus reducing your fanout at each node of the tree by a
> worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
> using 4k blocksizes.

I'll leave that up to somebody else - we now have two alternatives, the
100%, no-compromise INCOMPAT solution, and the slightly-bruised but
still largely intact forward compatible solution.  I'll maintain both
solutions for now code so it's just as easy to choose either in the end.

-- 
Daniel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] [LONG] Near-constant time directory index for Ext2
  2001-02-22  3:08                               ` Daniel Phillips
@ 2001-02-22  8:06                                 ` Andreas Dilger
  0 siblings, 0 replies; 69+ messages in thread
From: Andreas Dilger @ 2001-02-22  8:06 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Linux kernel development list, Christoph Hellwig,
	Theodore Y. Ts'o, Ext2 development mailing list

Daniel Phillips writes:
> Andreas Dilger wrote:
> > I was just doing the math for 1k ext2 filesystems, and the numbers aren't
> > nearly as nice.  We get:
> > 
> >         (1024 / 16) * 127 * .75 = 6096          # 1 level
> >         (1024 / 16) * 128 * 127 * .75 = 780288  # 2 levels
> 
> But if you are *determined* to use 1K blocks and have more than 1/2
> million files in one directory then I suppose a 3rd level is what you
> need.  The uniform-depth tree still works just fine and still doesn't
> need to be rebalanced - it's never out of balance.

I would rather simply go to some chained block scheme at that point.
ext2 is already fairly fast at linear searching, so if we index a HUGE
directory we are still linearly searching only 1/2^16 of the directory
(at worst for 1k blocks, 1/2^20 for 4k blocks).

I just had a clever idea - on a single-level index you put the header
and index data in block 0, and put the directory data in the first
indirect block (11 sparse blocks, instead of 511).  If you need to go
to a second-level index, you can simply shift the indirect data block to
be a double-indirect block, and start the level-2 index in the first
indirect block.  If we ever need a third-level index, you basically do
the same thing - move the double-indirect blocks to triple-indirect,
and put the level-3 index in the double-indirect block.  It will always
fit, because the index branching level is 1/2 of the indirect block
branching level because the index has the extra 4-byte hash values.

Andreas:
>> One thing I was thinking was that you could put "." and ".." in the first
>> block (like usual), and then put the index data after that.  This way
>> "." and ".." still exist and e2fsck and the kernel code doesn't complain,
>> except about the sparse directory blocks.

Daniel:
>The kernel code - ext2 fs that is - doesn't complain at the moment
>because I removed the complaint, and everything seems to be fine.  All
>references to "." and ".." are now intercepted and never reach the
>filesystem level.  If they did then I'd just fix ext2_is_empty_dir to
>tolerate those entries being somewhere other than the first block. 
>But, reading ahead, I see you are talking about forward compatibility...

One of the (many) benefits of ext2 is that it has tried to maintain
compatibility as much as possible, if possible.  In this case, I
don't see that there is an overwhelming reason to NOT keep compatibility,
and I think Ted agrees:

Ted Ts'o writes:
> E2fsck uses '..' to be able to climb up the directory tree when it needs
> to print a pathname given only a directory inode.  So yes, removing it 
> will cause e2fsck to not work as well.  '.' is not as useful, but it's
> useful as a sanity check.  

> Of course, if we completely give up on compatibility, we don't actually
> need to have special directory entries for '.' and '..' complete with
> their names; we can just store the inode numbers for both in a 32bit
> field along with the indexes.  (And so magic number for sanity checking;
> magic numbers are good things....)

Having real dirents for "." and ".." only costs 16 more bytes (2 index
leaves), compared to only keeping the inode numbers.

Andreas:
> > So, we would have (for the root entry, let's say):
(in directory block 0)

> > ext2_dir_entry_2{ EXT2_ROOT_INO, 12, 1, EXT2_FT_DIR, ".\0\0\0"}
> > ext2_dir_entry_2{ EXT2_ROOT_INO, <blocksize> - 12, 2, EXT2_FT_DIR, "..\0\0"}
> > <index magic (maybe)>
> > <index header>
> > <index data>
> > 
> > For the index ext2 kernel code, it would notice the EXT2_INDEX_FL and
> > access the data after the end of the ".." dir entry, and this would also
> > give you read-only "compatibility" of sorts with older kernels (modulo
> > calling ext2_error() for all of the sparse blocks before the start of the
> > actual directory data blocks).  You lose 24 bytes of data in the first
> > block, but gain some compatibility.  For second-level index blocks, if you
> > want to keep compatibility you lose 8 bytes each block if you start with:
> > 
> > ext2_dir_entry_2 { 0, <blocksize>, 0, EXT2_FT_DIR, "" }
> > <index magic (maybe)>
> > <second level index data>

Daniel:
> I really think INCOMPAT is the way to go and if you must mount it with
> an old kernel, do a fsck.  Old fsck manages to put things back into a
> form it can understand without too much difficulty, though you do have
> to answer quite a few questions.  The exact answers you give don't seem
> to be that important.

You don't always have the luxury to go back to an old kernel (for whatever
reason), and if an incompat flag is set the kernel will refuse to mount
your old filesystem.  If this is your root, you can't even run fsck.  Yes,
I know - have a rescue disk/partition - but _sometimes_ you are just stuck
and it would be good to not get into that situation in the first place.

Andreas:
> > Will there be a lower limit at which you create indexed directories?

Daniel:
> Yes, I hashed that out today with Al Viro on #kernelnewbies.  The
> breakeven happens at 3 directory blocks.

Andreas:
> > I guess the tradeoff is if you have to index all of the existing entries
> > in a non-indexed directory.  However, you need to handle this anyways if
> > you are updating an existing non-indexed directory from an old filesystem.

Daniel:
> If I do the optimization just for the first directory block then it's
> very nearly free - just one extra read of the first directory block,
> and it's almost certainly in cache anyway because it was just read to
> see if the entry already exists.

But you still need to handle the case for an arbitrary-sized non-indexed
directory, if you want to be able to upgrade an existing ext2 filesystem.
Since you need this, you may as well only turn indexing when you are
actually getting a speed benefit, because doing anything else still
wastes space.  It may even be that indexing a large existing directory
and _then_ doing the lookup is still faster than doing the lookup on the
original un-indexed directory...

Ted writes:
> A couple of comments.  If you make the beginning of each index block
> look like a an empty directory block:
> 
> 	32 bits: ino == 0
> 	16 bits: rec_len == blocksize
> 	16 bits: name_len = 0

This is what I also suggested for second-level index blocks above.
However, for a single-level index, blocks 1-511 (1-127 on a 1k filesystem)
will be sparse, because they will be unused - we don't want to have 511
(or 127) real empty dir blocks just for compatibility on a single-level
index.  The ext2 dir code handles the case of a sparse directory block
with an ext2_error() and continues.  By default ext2_error() is just
a printk, and on the only system I have seen where it is otherwise
(Debian), it is remount-ro for root only.

> ... then you will have full backwards compatibility, both for reading
> *and* writing.  When reading, old kernels will simply ignore the index
> blocks, since it looks like it has an unpopulated directory entry.  And
> if the kernel attempts to write into the directory, it will clear the
> BTREE_FL flag, in which case new kernels won't treat the directory as a
> tree anymore.

Yes, I had something like this on the tip of my brain as well.  When you
boot with a non-index ext2 kernel, it will naturally find free space in
the first block, immediately after "." and ".." (with the setup above).
Not only will it clear BTREE_FL, it will also overwrite the index magic
(if we have one) so we definitely know that the index is not valid.
Since the index head is only using 4 of the 8 bytes needed for alignment,
we could stick in a 4 byte magic before or after the index header, and
still be assured that it will be overwritten by a new dirent.

Full COMPAT support would be a win, IMHO.  You could leave it to e2fsck
to do reindexing, or the next time a file is added (or even removed)
from a candidate directory it could do the reindexing, which it needs
to be able to do for compatibility with old filesystems.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 23:15                     ` H. Peter Anvin
                                         ` (2 preceding siblings ...)
  2001-02-22  7:20                       ` [rfc] " Bill Wendling
@ 2001-02-22  8:34                       ` Rogier Wolff
  3 siblings, 0 replies; 69+ messages in thread
From: Rogier Wolff @ 2001-02-22  8:34 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Martin Mares, linux-kernel

H. Peter Anvin wrote:
> Martin Mares wrote:
> > 
> > Hello!
> > 
> > > True.  Note too, though, that on a filesystem (which we are, after all,
> > > talking about), if you assume a large linear space you have to create a
> > > file, which means you need to multiply the cost of all random-access
> > > operations with O(log n).
> > 
> > One could avoid this, but it would mean designing the whole filesystem in a
> > completely different way -- merge all directories to a single gigantic
> > hash table and use (directory ID,file name) as a key,

Novell, NTFS, HFS all do this. 

				Roger. 

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  4:03                                   ` H. Peter Anvin
@ 2001-02-22 10:35                                     ` Alan Cox
  2001-02-23  0:59                                       ` Felix von Leitner
  0 siblings, 1 reply; 69+ messages in thread
From: Alan Cox @ 2001-02-22 10:35 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Daniel Phillips, linux-kernel

> Daniel Phillips wrote:
> > 
> > There will be a lot fewer metadata index
> > blocks in your directory file, for one thing.
> > 
> 
> Oh yes, another thing: a B-tree directory structure does not need
> metadata index blocks.

Before people get excited about complex tree directory indexes, remember to 
solve the other 95% before implementation - recovering from lost blocks,
corruption and the like

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  5:19                                     ` Linus Torvalds
@ 2001-02-22 11:31                                       ` Ingo Oeser
  2001-02-22 18:20                                         ` Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Ingo Oeser @ 2001-02-22 11:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Hi Linus,
Hi LKML people,

On Wed, Feb 21, 2001 at 09:19:45PM -0800, Linus Torvalds wrote:
> In article <97230a$16k$1@penguin.transmeta.com>,
> Linus Torvalds <torvalds@transmeta.com> wrote:
> >allocate blocks one at a time. Make the blocksize something nice and
> >big, not just 4kB or 8kB or something.
> 
> Btw, this is also going to be a VM and performance issue some time in
> the future.  Tgere are already CPU's that would _love_ to have 64kB
> pages etc, and as such a filesystem that doesn't play with the old silly
> "everthing is a block" rules would be much appreciated with the kind of
> people who have multi-gigabyte files and want to read in big chunks at a
> time. 
 
For this we need a block remapper layer that can map any
blocksize n to any blocksize m with only the following constraints:

   - n and m are powers of 2
   - n is a multiple of m

Both should use the page cache ( of size p) of course, so it
becomes 2 layers, if n > p.

   -  translating a buffer of n into some pages
   -  translating a page into buffers of m (current buffercache)

We could limit the translation to 5 powers of 2 obove and 5 powers of 2
below PAGE_CACHE_SIZE so that we can maintain a validity bitmap
(2^5 = 32 bits) for each layer if access is too expensive[1].

Some subsystems could certainly benefit from it.

   -  loop device (with all the crypto stuff)
   -  LVM
   -  FSes that support block sizes != PAGE_CACHE_SIZE
   -  Devices with blocksize != 512 (they don't have to care
      being special anymore). There are even some rumors
      about very pervert blocksizes of 1M and the like.

Since these remapped buffers will look like merged requests, I
see even no problems with the elevator any more.

The question is, where we implement this infrastructure, esp. if
we consider the last user (devices with blocksize != 512).

This has to be answered by the main architects of Linux before
anyone could start.

> So either you have a simple block-based filesystem (current ext2, no
> extents, no crapola), or you decide to do it over.  Don't do some
> half-way thing, please. 

Daniel (and others) uses ext2 as as a playground, because it is
implemented, tested and not that hard to understand and verify.

Hope they will switch to some own design later, once they
sufficiently played around with^W^W^Wtested their ideas.

Regards

Ingo Oeser

[1] In buffer cache we use read-modify-write for partial pages,
   which hurts performance for them and is annoying for media
   with limited write cycles like flash and CD-RW[2].

[2] Yes I know about packet writing mode ;-)
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
         <<<<<<<<<<<<       come and join the fun       >>>>>>>>>>>>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
  2001-02-22  6:23 ` [Ext2-devel] " tytso
  2001-02-22  7:24   ` Daniel Phillips
@ 2001-02-22 13:20   ` tytso
  2001-02-22 18:16     ` Andreas Dilger
  1 sibling, 1 reply; 69+ messages in thread
From: tytso @ 2001-02-22 13:20 UTC (permalink / raw)
  To: phillips; +Cc: phillips, Linux-kernel, adilger, hch, ext2-devel

   From: Daniel Phillips <phillips@innominate.de>
   Date: Thu, 22 Feb 2001 08:24:08 +0100
   Content-Type: text/plain

   > Is it worth it?  Well, it means you lose an index entry from each
   > directory block, thus reducing your fanout at each node of the tree by a
   > worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
   > using 4k blocksizes.

   I'll leave that up to somebody else - we now have two alternatives, the
   100%, no-compromise INCOMPAT solution, and the slightly-bruised but
   still largely intact forward compatible solution.  I'll maintain both
   solutions for now code so it's just as easy to choose either in the end.

Well, the $64,000 question is exactly how much performance does it cost?
My guess is that it will be barely measurable, but only benchmarks will
answer that question.

							- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  3:30         ` Linus Torvalds
@ 2001-02-22 16:33           ` Chris Mason
  2001-02-22 22:30           ` Daniel Phillips
  1 sibling, 0 replies; 69+ messages in thread
From: Chris Mason @ 2001-02-22 16:33 UTC (permalink / raw)
  To: Linus Torvalds, Daniel Phillips; +Cc: linux-kernel



On Wednesday, February 21, 2001 07:30:47 PM -0800 Linus Torvalds
<torvalds@transmeta.com> wrote:
> On Thu, 22 Feb 2001, Daniel Phillips wrote:
>> 
> 
> I'd love to hear the results from R5, as that seems to be the reiserfs
> favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
> in..

Quick details, since I don't think I've seen them on l-k yet.  r5 was
chosen because it is more tuned to the reiserfs disk format.  The location
of a directory item on disk is determined by the hash of the name, and r5
is designed to put similar names close to each other on disk.

The benchmark that shows this best is creating X number of files in a
single dir (named 0001, 0002, 0003 etc).  r5 greating increases the chances
the directory item for 00006 will be right next to the item for 00007.  If
the application accesses these files in the same order they were created,
this has benefits at other times than just creation.  The benchmarks Ed
posted give a general idea for other naming patterns, but this one is best
case:

Time to create 100,000 files (10 bytes each) with r5 hash: 48s
Time to create 100,000 files (10 bytes each) with tea: 3m58s

The percentage increase just gets bigger as you create more and more files.
That doesn't mean this is a real world case, but it is what the hash was
designed for.  

-chris


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
  2001-02-22 13:20   ` tytso
@ 2001-02-22 18:16     ` Andreas Dilger
  2001-02-22 23:04       ` Daniel Phillips
                         ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Andreas Dilger @ 2001-02-22 18:16 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: phillips, Linux-kernel, adilger, hch, ext2-devel, Al Viro

Daniel writes:
> All references to "." and ".." are now intercepted and never reach the
> filesystem level.

Ted writes:
>    From: Daniel Phillips <phillips@innominate.de>
> 
>    I'll leave that up to somebody else - we now have two alternatives, the
>    100%, no-compromise INCOMPAT solution, and the slightly-bruised but
>    still largely intact forward compatible solution.  I'll maintain both
>    solutions for now code so it's just as easy to choose either in the end.
> 
> Well, the $64,000 question is exactly how much performance does it cost?
> My guess is that it will be barely measurable, but only benchmarks will
> answer that question.

One important question as to the disk format is whether the "." and ".."
interception by VFS is a new phenomenon in 2.4 or if this also happened
in 2.2?  If so, then having these entries on disk will be important
for 2.2 compatibility, and you don't want to have different on-disk formats
between 2.2 and 2.4.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22 11:31                                       ` Ingo Oeser
@ 2001-02-22 18:20                                         ` Linus Torvalds
  0 siblings, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2001-02-22 18:20 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: linux-kernel



On Thu, 22 Feb 2001, Ingo Oeser wrote:
> 
> On Wed, Feb 21, 2001 at 09:19:45PM -0800, Linus Torvalds wrote:
> > In article <97230a$16k$1@penguin.transmeta.com>,
> > Linus Torvalds <torvalds@transmeta.com> wrote:
> > >allocate blocks one at a time. Make the blocksize something nice and
> > >big, not just 4kB or 8kB or something.
> > 
> > Btw, this is also going to be a VM and performance issue some time in
> > the future.  Tgere are already CPU's that would _love_ to have 64kB
> > pages etc, and as such a filesystem that doesn't play with the old silly
> > "everthing is a block" rules would be much appreciated with the kind of
> > people who have multi-gigabyte files and want to read in big chunks at a
> > time. 
>  
> For this we need a block remapper layer that can map any
> blocksize n to any blocksize m with only the following constraints:

No, nothing like that at all..

What you can _trivially_ do is to basically act to the VFS and VM layer as
if you're a 1kB block filesystem (or something), and then when you get
called to do a "bmap()" (which only happens for kernel installing and
LILO, not under normal load), you just return the "offset" into a larger
block.

The VFS and MM layers do not care what the _real_ underlying blocksize of
the filesystem is. They will just do "readpage()" and "write()" calls, and
you can implement those any way you want to - never showing that you are
chunking out page-sized pieces from a bigger allocation block.

It's not all that hard. You just have to think a bit differently: don't
think of it as a block-based filesystem that has to have fixed blocks. The
VFS and MM layer don't care. They just want to access it.

> Daniel (and others) uses ext2 as as a playground, because it is
> implemented, tested and not that hard to understand and verify.

I realize that. But I get _very_ nervous when people talk about adding
stuff to ext2, because there are a lot of people who do not want to ever
even by mistake run code that is "new" on their filesystem.

Note that I had the same issue with ext3 - for the longest time, Stephen
Tweedie wanted to just extend ext2, and make it an invisible upgrade where
the filesystem would just magically become journalled when the user asked
for it. It _sounds_ appealing, but it doesn't take into account (a)
inevitable bugs and (b) the fact that Reiserfs actually got a head start
at least partly because it didn't need to worry about backwards
compatibility at all (there were other reasons too).

Basically, if there is one thing I've learnt over ten years of Linux, it's
that it is absolutely _evil_ to add more "linkages" or dependencies than
you absolutely have to. It is _much_ easier to create a new filesystem,
and slowly phase out old code that is no longer used. It's been done
several times (both with filesystems and with drivers), and every time
we've had a "new X, phase out old X" kind of situation it has been very
smooth.

In comparison, if you have "new features in X, which also handles the old
cases of X" situation, you not only bind yourself to backwards
compatibility, but you also cause yourself to be unable to ever phase out
the old code. Which means that eventually the whole system is a piece of
crap, full of old garbage that nobody needs to use, but that is part of
the new stuff that everybody _does_ use.

			Linus


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-20 15:04 [rfc] Near-constant time directory index for Ext2 Daniel Phillips
                   ` (2 preceding siblings ...)
  2001-02-22  6:23 ` [Ext2-devel] " tytso
@ 2001-02-22 18:38 ` Kai Henningsen
  3 siblings, 0 replies; 69+ messages in thread
From: Kai Henningsen @ 2001-02-22 18:38 UTC (permalink / raw)
  To: Linux-kernel

phillips@innominate.de (Daniel Phillips)  wrote on 20.02.01 in <01022020011905.18944@gimli>:

> But the current hash function is just a place holder, waiting for
> an better version based on some solid theory.  I currently favor the
> idea of using crc32 as the default hash function, but I welcome
> suggestions.

I once liked those things, too - but I've learned better since.

Quoting _Handbook_of_Algorithms_and_Data_Structures_ (Gonnet/Baeza-Yates,  
ISBM 0-201-41607-7, Addison-Wesley):

--- snip ---

3.3.1 Practical hashing functions

[...]

A universal class of hashing functions is a class with the property that  
given any input, the average performance of all the functions is good.  
[...] For example, h(k) = (a * k + b) mod m with integers a != 0 and b is  
a universal class of hash functions.
[...]
Keys which are strings or sequences of words (including those which are of  
variable length) are best treated by considering them as a number base b.  
Let the string s be composed of k characters s1s2...sk. Then

        h(s) = ( sum(i=0..k-1) B^i*s(k-i) ) mod m

To obtain a more efficient version of this function we can compute

        h(s) = ( ( sum(i=0..k-1) B^i*s(k-i) ) mod 2^w ) mod m

where w is the number of bits in a computer word, and the mod 2^w  
operation is done by the hardware. For this function the value B = 131 is  
recommended, as B^i has a maximum cycle mod 2^k for 8<=k<=64.

Hashing function for strings

        int hashfunction(s)
        char *s;

        { int i;
          for(i=0; *s; s++) i = 131*i + *s;
          return(i % m);
        }

--- snip ---

I've actually used that function for a hash containing something like a  
million phone numbers as keys, and there were *very* few collisions.  
Similarly for another hash containgng megabytes of RFC 822 message-ids.

MfG Kai

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-21 22:50               ` Martin Mares
  2001-02-21 22:54                 ` H. Peter Anvin
@ 2001-02-22 19:04                 ` Kai Henningsen
  1 sibling, 0 replies; 69+ messages in thread
From: Kai Henningsen @ 2001-02-22 19:04 UTC (permalink / raw)
  To: linux-kernel

mj@suse.cz (Martin Mares)  wrote on 22.02.01 in <20010222000755.A29061@atrey.karlin.mff.cuni.cz>:

> One could avoid this, but it would mean designing the whole filesystem in a
> completely different way -- merge all directories to a single gigantic
> hash table and use (directory ID,file name) as a key, but we were originally
> talking about extending ext2, so such massive changes are out of question
> and your log n access argument is right.

s/hash table/btree/ and you have just described the Macintosh HFS file  
system. (Incidentally, it stores file extent indices in a similar manner,  
with key = (file id, fork, offset).)


MfG Kai

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22  3:30         ` Linus Torvalds
  2001-02-22 16:33           ` Chris Mason
@ 2001-02-22 22:30           ` Daniel Phillips
  1 sibling, 0 replies; 69+ messages in thread
From: Daniel Phillips @ 2001-02-22 22:30 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

Linus Torvalds wrote:
> 
> On Thu, 22 Feb 2001, Daniel Phillips wrote:
> >
> > In the first heat of hash races - creating 20,000 files in one directory
> > - dentry::hash lost out to my original hack::dx_hash, causing a high
> > percentage of leaf blocks to remain exactly half full and slowing down
> > the whole thing by about 5%.  (This was under uml - I haven't tried it
> > native yet but I expect the results to be similar.)
> >
> >         Contender                     Result
> >         =========                     ======
> >       dentry::hash            Average fullness = 2352 (57%)
> >       hack::dx_hash           Average fullness = 2758 (67%)
> >
> > This suggests that dentry::hash is producing distinctly non-dispersed
> > results and needs to be subjected to further scrutiny.  I'll run the
> > next heat of hash races tomorrow, probably with R5, and CRC32 too if I
> > have time.
> 
> I'd love to hear the results from R5, as that seems to be the reiserfs
> favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
> in..

In this round there were two new contenders:

	- ReiserFS's R5
	- Bob Jenkins' hash

Eirik Fuller pointed me to the latter, the subject of a very interesting
article in Dr. Dobbs, available online here: 

	http://burtleburtle.net/bob/hash/doobs.html

As before, the runs are for 20,000 creates and I report only the
fullness, because I'm still running these under UML.  Suffice to say
that the total running time is roughly related to the average fullness
with a variance of about 15% from the best to the worst.  Eventually I
will rerun the entire series of tests natively and provide more detailed
statistics.  Here are the results from the second heat of hash races:

	     Contender			Result
	     =========			======
	dentry::hash		Average fullness = 2352 (57%)
	daniel::hack_hash	Average fullness = 2758 (67%)
	bob::hash		Average fullness = 2539
(61%)                                                                                                                 
	reiserfs::r5		Average fullness = 2064 (50%)

Just looking at R5 I knew it wasn't going to do well in this application
because it's similar to a number of hash functions I tried with the same
idea in mind: to place similar names together in the same leaf block. 
That turned out to be not very important compared to achieving a
relatively high fullness of leaf blocks.  The problem with R5 when used
with my htree is, it doesn't give very uniform dispersal   But according
to Chris Mason (see his post) it does work very well for ReiserFS.  This
provides a little more evidence that my htree scheme is a quite
different from other approaches.

	u32 r5_hash (const char *msg, int len)
	{
	  u32 a=0;
	  while(*msg) { 
	    a += *msg << 4;
	    a += *msg >> 4;
	    a *= 11;
	    msg++;
	   } 
	  return a;
	}

I expected more from bob::hash since it's very carefully well-thought
out in terms of dispersal and avoidance of 'funnelling' (the property
that determines the probabililty collision), but it still fell short of
hack_hash's performance.  Oh well.  Tomorrow I'll try CRC32\x13.

The bottom line: dx_hack_hash is still the reigning champion.  OK, come
out and take a bow:

	unsigned dx_hack_hash (const char *name, int len)
	{
		u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
		while (len--)
		{
			u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
			if (hash < 0) hash -= 0x7fffffff;
			hash1 = hash0;
			hash0 = hash;
		}
		return hash0;
	}

--
Daniel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
  2001-02-22 18:16     ` Andreas Dilger
@ 2001-02-22 23:04       ` Daniel Phillips
  2001-02-22 23:40       ` tytso
  2001-02-23 20:11       ` tytso
  2 siblings, 0 replies; 69+ messages in thread
From: Daniel Phillips @ 2001-02-22 23:04 UTC (permalink / raw)
  To: Andreas Dilger, tytso, Linux-kernel, ext2-devel

Andreas Dilger wrote:
> Daniel writes:
> > All references to "." and ".." are now intercepted and never reach the
> > filesystem level.
> 
> Ted writes:
> >    From: Daniel Phillips <phillips@innominate.de>
> >
> >    I'll leave that up to somebody else - we now have two alternatives, the
> >    100%, no-compromise INCOMPAT solution, and the slightly-bruised but
> >    still largely intact forward compatible solution.  I'll maintain both
> >    solutions for now code so it's just as easy to choose either in the end.
> >
> > Well, the $64,000 question is exactly how much performance does it cost?
> > My guess is that it will be barely measurable, but only benchmarks will
> > answer that question.
> 
> One important question as to the disk format is whether the "." and ".."
> interception by VFS is a new phenomenon in 2.4 or if this also happened
> in 2.2?  If so, then having these entries on disk will be important
> for 2.2 compatibility, and you don't want to have different on-disk formats
> between 2.2 and 2.4.

The answer is 'yes', it's been in since at least the beginning of 2.2:

 
http://innominate.org/cgi-bin/lksr/linux/fs/namei.c?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=v2.2

Search for '.'.

By the way, out whole linux cvsweb tree is here:

	http://lksr.org/ 

will all versions of linux back to linux-0.97.pl5, with a makefile that
starts out with:

	#
	# Makefile for linux.
	# If you don't have '-mstring-insns' in your gcc (and nobody but me has
:-)
	# remove them from the CFLAGS defines.
	#

Getting back on topic, this makes the idea of getting rid of the actual
on-disk "." and ".." entries a little less scary, though I am keeping in
mind the fact that having those entries on disk could in some extreme
circumstance help fsck recover a a corrupted directory tree little
better and more automatically.

I resolve not to take a position on this subject, and I will carry
forward both a 'squeaky clean' backward-compatible version that sets an
INCOMPAT flag, and a 'slightly tarnished' but very clever version that
is both forward and backward-compatible, along the lines suggested by
Ted.  Both flavors have the desireable property that old versions of
fsck with no knowledge of the new index structure can remove the indices
automatically, with fsck -y.

--
Daniel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
  2001-02-22 18:16     ` Andreas Dilger
  2001-02-22 23:04       ` Daniel Phillips
@ 2001-02-22 23:40       ` tytso
  2001-02-23 20:11       ` tytso
  2 siblings, 0 replies; 69+ messages in thread
From: tytso @ 2001-02-22 23:40 UTC (permalink / raw)
  To: adilger; +Cc: phillips, Linux-kernel, adilger, hch, ext2-devel, viro

   From: Andreas Dilger <adilger@turbolinux.com>
   Date: Thu, 22 Feb 2001 11:16:32 -0700 (MST)

   One important question as to the disk format is whether the "." and ".."
   interception by VFS is a new phenomenon in 2.4 or if this also happened
   in 2.2?  If so, then having these entries on disk will be important
   for 2.2 compatibility, and you don't want to have different on-disk formats
   between 2.2 and 2.4.

Well, you need to have the '.' and '..' there for compatibility if you
for the full backwards compatibility.   That's clear.

If you don't care about backwards compatibility, it's important that
there be a way to find the parent directory, but there doesn't have to
be explicit '.' and '..'  entries.

So if Daniel is going to try implementing it both ways then that's one
place where the #ifdef's might get a bit more complicated.  After it's
done, we should do some benchmarks comparing it both ways; if the
difference is negligible, I'd argue for simply always providing
backwards compatibility.  One of the key advantages of ext2/ext3 is its
backwards compatibility, and so if it's not too costly to preserve it
(as I suspect will be the case), we should try to do so.

						- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [rfc] Near-constant time directory index for Ext2
  2001-02-22 10:35                                     ` Alan Cox
@ 2001-02-23  0:59                                       ` Felix von Leitner
  0 siblings, 0 replies; 69+ messages in thread
From: Felix von Leitner @ 2001-02-23  0:59 UTC (permalink / raw)
  To: linux-kernel

Thus spake Alan Cox (alan@lxorguk.ukuu.org.uk):
> > > There will be a lot fewer metadata index
> > > blocks in your directory file, for one thing.
> > Oh yes, another thing: a B-tree directory structure does not need
> > metadata index blocks.
> Before people get excited about complex tree directory indexes, remember to 
> solve the other 95% before implementation - recovering from lost blocks,
> corruption and the like

And don't forget the trouble with NFS handles after the tree was rebalanced.

Trees are nice only theoretically.  In practice, the benefits are
outweighed by the nastiness in form of fsck and NFS and bigger code
(normally: more complex -> less reliable).

Felix

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
  2001-02-22 18:16     ` Andreas Dilger
  2001-02-22 23:04       ` Daniel Phillips
  2001-02-22 23:40       ` tytso
@ 2001-02-23 20:11       ` tytso
  2001-02-24  0:32         ` Andreas Dilger
  2 siblings, 1 reply; 69+ messages in thread
From: tytso @ 2001-02-23 20:11 UTC (permalink / raw)
  To: phillips; +Cc: adilger, Linux-kernel, ext2-devel

   From: Daniel Phillips <phillips@innominate.de>
   Date: Fri, 23 Feb 2001 00:04:02 +0100

   I resolve not to take a position on this subject, and I will carry
   forward both a 'squeaky clean' backward-compatible version that sets an
   INCOMPAT flag, and a 'slightly tarnished' but very clever version that
   is both forward and backward-compatible, along the lines suggested by
   Ted.  Both flavors have the desireable property that old versions of
   fsck with no knowledge of the new index structure can remove the indices
   automatically, with fsck -y.

Note that in the long run, the fully comatible version should probably
have a COMPAT feature flag set so that you're forced to use a new enough
version of e2fsck.  Otherwise an old e2fsck may end up not noticing
corruptions in an index block which might cause a new kernel to have
serious heartburn.

						- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
  2001-02-23 20:11       ` tytso
@ 2001-02-24  0:32         ` Andreas Dilger
  0 siblings, 0 replies; 69+ messages in thread
From: Andreas Dilger @ 2001-02-24  0:32 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: phillips, Linux kernel development list, Ext2 development mailing list

Ted writes:
> Note that in the long run, the fully comatible version should probably
> have a COMPAT feature flag set so that you're forced to use a new enough
> version of e2fsck.  Otherwise an old e2fsck may end up not noticing
> corruptions in an index block which might cause a new kernel to have
> serious heartburn.

Actually, having a COMPAT flag also helps in other ways:

1) Turning indexing on and off is not a mount option as it currently is
   (or automatically done) so it will quell Linus' fears about priniciple
   of least surprise (i.e. not converting a filesystem without user action).
   A superblock COMPAT flag is more in keeping with other ext2 features.

2) Running a new e2fsck on a COMPAT_INDEX filesystem could create the
   index for existing "large" directories that don't have the BTREE/INDEX
   flag set, so the kernel only ever has to deal with incremental indexing
   after the first block.  The kernel would just do linear access on
   existing multi-block directories until e2fsck is run.

3) Clearing the COMPAT flag would make e2fsck remove the indexes, if the
   user so desires.  I think this would be the behaviour of existing
   e2fsck anyways.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2001-02-24  0:33 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-02-20 15:04 [rfc] Near-constant time directory index for Ext2 Daniel Phillips
2001-02-20 20:03 ` Linus Torvalds
2001-02-20 21:08   ` Jeremy Jackson
2001-02-20 21:20     ` Mike Dresser
2001-02-20 22:36       ` Jeremy Jackson
2001-02-20 23:08         ` Daniel Phillips
2001-02-21  1:04           ` Bernd Eckenfels
2001-02-21 16:38             ` Daniel Phillips
2001-02-20 22:58       ` Jonathan Morton
2001-02-20 21:41   ` Daniel Phillips
2001-02-21  0:22     ` Linus Torvalds
2001-02-21  0:30       ` Alan Cox
2001-02-21  2:35         ` Ed Tomlinson
2001-02-21 23:13           ` Linus Torvalds
2001-02-21 23:34             ` Davide Libenzi
2001-02-21 23:59               ` Linus Torvalds
2001-02-21 23:57             ` H. Peter Anvin
2001-02-22  0:35             ` Ed Tomlinson
2001-02-21  1:01       ` Andreas Dilger
2001-02-22  2:28       ` Daniel Phillips
2001-02-22  3:30         ` Linus Torvalds
2001-02-22 16:33           ` Chris Mason
2001-02-22 22:30           ` Daniel Phillips
2001-02-21 17:21 ` Davide Libenzi
2001-02-21 21:08   ` Martin Mares
2001-02-21 21:29     ` Davide Libenzi
2001-02-21 21:32       ` Martin Mares
2001-02-21 21:59         ` Davide Libenzi
2001-02-21 22:26           ` Martin Mares
2001-02-21 22:43             ` Davide Libenzi
2001-02-21 22:14         ` H. Peter Anvin
2001-02-21 22:32           ` Martin Mares
2001-02-21 22:38             ` H. Peter Anvin
2001-02-21 22:50               ` Martin Mares
2001-02-21 22:54                 ` H. Peter Anvin
2001-02-21 23:07                   ` Martin Mares
2001-02-21 23:15                     ` H. Peter Anvin
2001-02-21 23:42                       ` Daniel Phillips
2001-02-21 23:52                         ` Davide Libenzi
     [not found]                       ` <3A945081.E6EB78F4@innominate.de>
2001-02-21 23:48                         ` H. Peter Anvin
2001-02-22  1:22                           ` Daniel Phillips
2001-02-22  1:42                             ` H. Peter Anvin
2001-02-22  2:03                             ` Andreas Dilger
2001-02-22  2:41                               ` H. Peter Anvin
2001-02-22  3:43                                 ` Daniel Phillips
2001-02-22  4:02                                   ` Linus Torvalds
2001-02-22  5:19                                     ` Linus Torvalds
2001-02-22 11:31                                       ` Ingo Oeser
2001-02-22 18:20                                         ` Linus Torvalds
2001-02-22  4:02                                   ` H. Peter Anvin
2001-02-22  7:03                                     ` Andreas Dilger
2001-02-22  4:03                                   ` H. Peter Anvin
2001-02-22 10:35                                     ` Alan Cox
2001-02-23  0:59                                       ` Felix von Leitner
2001-02-22  3:08                               ` Daniel Phillips
2001-02-22  8:06                                 ` [rfc] [LONG] " Andreas Dilger
2001-02-22  7:20                       ` [rfc] " Bill Wendling
2001-02-22  8:34                       ` Rogier Wolff
2001-02-21 23:26                     ` Jamie Lokier
2001-02-22 19:04                 ` Kai Henningsen
2001-02-22  6:23 ` [Ext2-devel] " tytso
2001-02-22  7:24   ` Daniel Phillips
2001-02-22 13:20   ` tytso
2001-02-22 18:16     ` Andreas Dilger
2001-02-22 23:04       ` Daniel Phillips
2001-02-22 23:40       ` tytso
2001-02-23 20:11       ` tytso
2001-02-24  0:32         ` Andreas Dilger
2001-02-22 18:38 ` Kai Henningsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).