linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Bug 417] New: htree much slower than regular ext3
@ 2003-02-27 17:31 Martin J. Bligh
  2003-02-28  2:55 ` Daniel Phillips
                   ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Martin J. Bligh @ 2003-02-27 17:31 UTC (permalink / raw)
  To: linux-kernel

http://bugme.osdl.org/show_bug.cgi?id=417

           Summary: htree much slower than regular ext3
    Kernel Version: 2.5.63-bk3
            Status: NEW
          Severity: normal
             Owner: akpm@digeo.com
         Submitter: bwindle-kbt@fint.org


Distribution: Debian Testing

Hardware Environment: x86, 256mb RAM, PIIX4, Maxtor 91728D8 ATA DISK drive

Software Environment: 
Linux razor 2.5.63bk3 #25 Thu Feb 27 10:13:35 EST 2003 i686 Pentium II 
(Klamath) GenuineIntel GNU/Linux

Gnu C                  2.95.4
Gnu make               3.79.1
util-linux             2.11n
mount                  2.11n
module-init-tools      implemented
e2fsprogs              1.30-WIP
reiserfsprogs          3.6.3
Linux C Library        2.2.5
Dynamic linker (ldd)   2.2.5
Procps                 2.0.7
Net-tools              1.60
Console-tools          0.2.3
Sh-utils               4.5.2

Problem Description:
I created a directory ("test") with 32000 (ok, 31998) directories in it,
and  put a file called 'foo' in each of them (for i in `ls`; do cd $i &&
touch bar  && cd .. ; done). Then I took that ext3 partion, umounted it,
did a 'tune2fs -O  dir_index', then 'fsck -fD', and remounted. I then did a
'time du -hs' on the  test directory, and here are the results.

ext3+htree:     
bwindle@razor:/giant/inodes$ time du -hs
126M    .

real    7m21.756s
user    0m2.021s
sys     0m22.190s

I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1, remounted,
and  did another du -hs on the test directory. It took 1 minute, 48 seconds.

bwindle@razor:/giant/test$ time du -hs
126M    .

real    1m48.760s
user    0m1.986s
sys     0m21.563s


I thought htree was supposed to speed up access with large numbers of 
directories?



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bug 417] New: htree much slower than regular ext3
  2003-02-28  2:55 ` Daniel Phillips
@ 2003-02-27 21:00   ` Andreas Dilger
  2003-02-28  4:12     ` Daniel Phillips
  2003-03-13 21:04     ` [Ext2-devel] " Stephen C. Tweedie
  0 siblings, 2 replies; 38+ messages in thread
From: Andreas Dilger @ 2003-02-27 21:00 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Martin J. Bligh, linux-kernel, ext2-devel, Theodore Ts'o, chrisl

On Feb 28, 2003  03:55 +0100, Daniel Phillips wrote:
> On Thursday 27 February 2003 18:31, Martin J. Bligh wrote:
> > Problem Description:
> > I created a directory ("test") with 32000 (ok, 31998) directories in it,
> > and  put a file called 'foo' in each of them (for i in `ls`; do cd $i &&
> > touch bar  && cd .. ; done). Then I took that ext3 partion, umounted it,
> > did a 'tune2fs -O  dir_index', then 'fsck -fD', and remounted. I then did a
> > 'time du -hs' on the  test directory, and here are the results.
> >
> > ext3+htree:
> > bwindle@razor:/giant/inodes$ time du -hs
> > 126M    .
> >
> > real    7m21.756s
> > user    0m2.021s
> > sys     0m22.190s
> >
> > I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1, remounted,
> > and  did another du -hs on the test directory. It took 1 minute, 48
> > seconds.
> >
> > bwindle@razor:/giant/test$ time du -hs
> > 126M    .
> >
> > real    1m48.760s
> > user    0m1.986s
> > sys     0m21.563s
> >
> >
> > I thought htree was supposed to speed up access with large numbers of
> > directories?
> 
> The du just does getdents and lstats in physical storage order, so there's no 
> possible benefit from indexing in this case, and unindexed ext3 avoids long
> search times by caching the position at which it last found an entry.  That 
> answers the question "why doesn't it speed up", however, "why did it slow way 
> down" is harder.
> 
> The single-file leaves of your directory tree don't carry any index (it's not 
> worth it with only one file) and therfore use the same code path as unindexed 
> ext3, so there's no culprit there.  I'm looking suspiciously at 
> ext3_dx_readdir, which is apparently absorbing about 11 ms per returned 
> entry. To put this in perspective, I'm used to seeing individual directory 
> operation times well below 100 us, so even if each dirent cost as much as a 
> full lookup, you'd see less than 3 seconds overhead for your 30,000 
> directories.
> 
> 11 ms sounds like two seeks for each returned dirent, which sounds like a bug.

I think you are pretty dead on there.  The difference is that with unindexed
entries, the directory entry and the inode are in the same order, while with
indexed directories they are essentially in random order to each other.  That
means that each directory lookup causes a seek to get the directory inode data
instead of doing allocation order (which is sequential reads on disk).

In the past both would have been slow equally, but the orlov allocator in
2.5 causes a number of directories to be created in the same group before
moving on to the next group, so you have nice batches of sequential reads.

I've got a patch which should help here, although it was originally written
to speed up the "create" case instead of the "lookup" case.  In the lookup
case, it will do a pre-read of a number of inode table blocks, since the cost
of doing a 64kB read and doing a 4kB read is basically the same - the cost
of the seek.

The patch was written for a 2.4.18 RH kernel, but I think the areas touched
(ext3_get_inode_loc and ext3_new_inode) are relatively unchanged, so it
should be fairly straightforward to fix the rejects by hand for 2.5.

For the create case, the patch speeds up lots-of-file creation rates a lot
(50%), regardless of whether we are using htree or not.  That is because we
do not read in empty inode table blocks from disk (which will be slow
behind a bunch of writes going on), and instead we have a pure write
workload (with the exception of reading in the inode bitmaps occasionally).

Cheers, Andreas
======================= ext3-noread.diff ====================================
diff -ru lustre-head/fs/ext3/ialloc.c lustre/fs/ext3/ialloc.c
--- lustre-head/fs/ext3/ialloc.c	Mon Dec 23 10:02:58 2002
+++ lustre/fs/ext3/ialloc.c	Mon Dec 23 09:46:20 2002
@@ -289,6 +289,37 @@
 }
 
 /*
+ * @block_group: block group of inode
+ * @offset: relative offset of inode within @block_group
+ *
+ * Check whether any of the inodes in this disk block are in use.
+ *
+ * Caller must be holding superblock lock (group/bitmap read lock in future).
+ */
+int ext3_itable_block_used(struct super_block *sb, unsigned int block_group,
+			   int offset)
+{
+	int bitmap_nr = load_inode_bitmap(sb, block_group);
+	int inodes_per_block;
+	unsigned long inum, iend;
+	struct buffer_head *ibitmap;
+
+	if (bitmap_nr < 0)
+		return 1;
+
+	inodes_per_block = sb->s_blocksize / EXT3_SB(sb)->s_inode_size;
+	inum = offset & ~(inodes_per_block - 1);
+	iend = inum + inodes_per_block;
+	ibitmap = EXT3_SB(sb)->s_inode_bitmap[bitmap_nr];
+	for (; inum < iend; inum++) {
+		if (inum != offset && ext3_test_bit(inum, ibitmap->b_data))
+			return 1;
+	}
+
+	return 0;
+}
+
+/*
  * There are two policies for allocating an inode.  If the new inode is
  * a directory, then a forward search is made for a block group with both
  * free space and a low directory-to-inode ratio; if that fails, then of
@@ -312,6 +343,7 @@
 	struct ext3_group_desc * gdp;
 	struct ext3_group_desc * tmp;
 	struct ext3_super_block * es;
+	struct ext3_iloc iloc;
 	int err = 0;
 
 	/* Cannot create files in a deleted directory */
@@ -514,9 +547,18 @@
 	inode->i_generation = sbi->s_next_generation++;
 
 	ei->i_state = EXT3_STATE_NEW;
-	err = ext3_mark_inode_dirty(handle, inode);
+	err = ext3_get_inode_loc_new(inode, &iloc, 1);
 	if (err) goto fail;
-	
+	BUFFER_TRACE(iloc->bh, "get_write_access");
+	err = ext3_journal_get_write_access(handle, iloc.bh);
+	if (err) {
+		brelse(iloc.bh);
+		iloc.bh = NULL;
+		goto fail;
+	}
+	err = ext3_mark_iloc_dirty(handle, inode, &iloc);
+	if (err) goto fail;
+
 	unlock_super (sb);
 	if(DQUOT_ALLOC_INODE(inode)) {
 		DQUOT_DROP(inode);
diff -ru lustre-head/fs/ext3/inode.c lustre/fs/ext3/inode.c
--- lustre-head/fs/ext3/inode.c	Mon Dec 23 10:02:58 2002
+++ lustre/fs/ext3/inode.c	Mon Dec 23 09:50:25 2002
@@ -2011,16 +1994,25 @@
 	ext3_journal_stop(handle, inode);
 }
 
-/* 
- * ext3_get_inode_loc returns with an extra refcount against the
- * inode's underlying buffer_head on success. 
- */
+extern int ext3_itable_block_used(struct super_block *sb,
+				  unsigned int block_group,
+				  int offset);
+
+#define NUM_INODE_PREREAD 16
 
-int ext3_get_inode_loc (struct inode *inode, struct ext3_iloc *iloc)
+/*
+ * ext3_get_inode_loc returns with an extra refcount against the inode's
+ * underlying buffer_head on success.  If this is for a new inode allocation
+ * (new is non-zero) then we may be able to optimize away the read if there
+ * are no other in-use inodes in this inode table block.  If we need to do
+ * a read, then read in a whole chunk of blocks to avoid blocking again soon
+ * if we are doing lots of creates/updates.
+ */
+int ext3_get_inode_loc_new(struct inode *inode, struct ext3_iloc *iloc, int new)
 {
 	struct super_block *sb = inode->i_sb;
 	struct ext3_sb_info *sbi = EXT3_SB(sb);
-	struct buffer_head *bh = 0;
+	struct buffer_head *bh[NUM_INODE_PREREAD];
 	unsigned long block;
 	unsigned long block_group;
 	unsigned long group_desc;
@@ -2042,38 +2034,86 @@
 	}
 	group_desc = block_group >> sbi->s_desc_per_block_bits;
 	desc = block_group & (sbi->s_desc_per_block - 1);
-	bh = sbi->s_group_desc[group_desc];
-	if (!bh) {
+	if (!sbi->s_group_desc[group_desc]) {
 		ext3_error(sb, __FUNCTION__, "Descriptor not loaded");
 		goto bad_inode;
 	}
 
-	gdp = (struct ext3_group_desc *) bh->b_data;
+	gdp = (struct ext3_group_desc *)(sbi->s_group_desc[group_desc]->b_data);
+
 	/*
 	 * Figure out the offset within the block group inode table
 	 */
-	offset = ((inode->i_ino - 1) % sbi->s_inodes_per_group) *
-		sbi->s_inode_size;
+	offset = ((inode->i_ino - 1) % sbi->s_inodes_per_group);
+
 	block = le32_to_cpu(gdp[desc].bg_inode_table) +
-		(offset >> EXT3_BLOCK_SIZE_BITS(sb));
-	if (!(bh = sb_bread(sb, block))) {
-		ext3_error (sb, __FUNCTION__,
-			    "unable to read inode block - "
-			    "inode=%lu, block=%lu", inode->i_ino, block);
-		goto bad_inode;
+		(offset * sbi->s_inode_size >> EXT3_BLOCK_SIZE_BITS(sb));
+
+	bh[0] = sb_getblk(sb, block);
+	if (buffer_uptodate(bh[0]))
+		goto done;
+
+	/* If we don't really need to read this block, and it isn't already
+	 * in memory, then we just zero it out.  Otherwise, we keep the
+	 * current block contents (deleted inode data) for posterity.
+	 */
+	if (new && !ext3_itable_block_used(sb, block_group, offset)) {
+		lock_buffer(bh[0]);
+		memset(bh[0]->b_data, 0, bh[0]->b_size);
+		mark_buffer_uptodate(bh[0], 1);
+		unlock_buffer(bh[0]);
+	} else {
+		unsigned long block_end, itable_end;
+		int count = 1;
+
+		itable_end = le32_to_cpu(gdp[desc].bg_inode_table) +
+				sbi->s_itb_per_group;
+		block_end = block + NUM_INODE_PREREAD;
+		if (block_end > itable_end)
+			block_end = itable_end;
+
+		for (; block < block_end; block++) {
+			bh[count] = sb_getblk(sb, block);
+			if (count && (buffer_uptodate(bh[count]) ||
+				      buffer_locked(bh[count]))) {
+				__brelse(bh[count]);
+			} else
+				count++;
+		}
+
+		ll_rw_block(READ, count, bh);
+
+		/* Release all but the block we actually need (bh[0]) */
+		while (--count > 0)
+			__brelse(bh[count]);
+
+		wait_on_buffer(bh[0]);
+		if (!buffer_uptodate(bh[0])) {
+			ext3_error(sb, __FUNCTION__,
+				   "unable to read inode block - "
+				   "inode=%lu, block=%lu", inode->i_ino,
+				   bh[0]->b_blocknr);
+			goto bad_inode;
+		}
 	}
-	offset &= (EXT3_BLOCK_SIZE(sb) - 1);
+ done:
+	offset = (offset * sbi->s_inode_size) & (EXT3_BLOCK_SIZE(sb) - 1);
 
-	iloc->bh = bh;
-	iloc->raw_inode = (struct ext3_inode *) (bh->b_data + offset);
+	iloc->bh = bh[0];
+	iloc->raw_inode = (struct ext3_inode *)(bh[0]->b_data + offset);
 	iloc->block_group = block_group;
-	
+
 	return 0;
-	
+
  bad_inode:
 	return -EIO;
 }
 
+int ext3_get_inode_loc(struct inode *inode, struct ext3_iloc *iloc)
+{
+	return ext3_get_inode_loc_new(inode, iloc, 0);
+}
+
 void ext3_read_inode(struct inode * inode)
 {
 	struct ext3_iloc iloc;
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bug 417] New: htree much slower than regular ext3
  2003-02-28  4:12     ` Daniel Phillips
@ 2003-02-27 21:33       ` Martin J. Bligh
  0 siblings, 0 replies; 38+ messages in thread
From: Martin J. Bligh @ 2003-02-27 21:33 UTC (permalink / raw)
  To: Daniel Phillips, Andreas Dilger, bwindle-kbt
  Cc: linux-kernel, ext2-devel, Theodore Ts'o, chrisl

>> > 11 ms sounds like two seeks for each returned dirent, which sounds
>> > like a bug.
>> 
>> I think you are pretty dead on there.  The difference is that with
>> unindexed entries, the directory entry and the inode are in the same
>> order, while with indexed directories they are essentially in random
>> order to each other.  That means that each directory lookup causes a
>> seek to get the directory inode data instead of doing allocation order
>> (which is sequential reads on disk).
>> 
>> In the past both would have been slow equally, but the orlov allocator in
>> 2.5 causes a number of directories to be created in the same group before
>> moving on to the next group, so you have nice batches of sequential
>> reads.
> 
> I think you're close to the truth there, but out-of-order inode table
> access  would only introduce one seek per inode table block, and the
> cache should  take care of the rest.  Martin's numbers suggest the cache
> isn't caching at  all.
> 
> Martin, does iostat show enormously more reads for the Htree case?

Wasn't me ... I just forward the bug data from bugzilla ;-)
Filer in this case was ... bwindle-kbt@fint.org
cc'ed.

M.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bug 417] New: htree much slower than regular ext3
  2003-02-27 17:31 [Bug 417] New: htree much slower than regular ext3 Martin J. Bligh
@ 2003-02-28  2:55 ` Daniel Phillips
  2003-02-27 21:00   ` Andreas Dilger
  2003-03-07 15:46 ` Alex Tomas
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2003-02-28  2:55 UTC (permalink / raw)
  To: Martin J. Bligh, linux-kernel; +Cc: ext2-devel, Theodore Ts'o, chrisl

On Thursday 27 February 2003 18:31, Martin J. Bligh wrote:
> Problem Description:
> I created a directory ("test") with 32000 (ok, 31998) directories in it,
> and  put a file called 'foo' in each of them (for i in `ls`; do cd $i &&
> touch bar  && cd .. ; done). Then I took that ext3 partion, umounted it,
> did a 'tune2fs -O  dir_index', then 'fsck -fD', and remounted. I then did a
> 'time du -hs' on the  test directory, and here are the results.
>
> ext3+htree:
> bwindle@razor:/giant/inodes$ time du -hs
> 126M    .
>
> real    7m21.756s
> user    0m2.021s
> sys     0m22.190s
>
> I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1, remounted,
> and  did another du -hs on the test directory. It took 1 minute, 48
> seconds.
>
> bwindle@razor:/giant/test$ time du -hs
> 126M    .
>
> real    1m48.760s
> user    0m1.986s
> sys     0m21.563s
>
>
> I thought htree was supposed to speed up access with large numbers of
> directories?

The du just does getdents and lstats in physical storage order, so there's no 
possible benefit from indexing in this case, and unindexed ext3 avoids long
search times by caching the position at which it last found an entry.  That 
answers the question "why doesn't it speed up", however, "why did it slow way 
down" is harder.

The single-file leaves of your directory tree don't carry any index (it's not 
worth it with only one file) and therfore use the same code path as unindexed 
ext3, so there's no culprit there.  I'm looking suspiciously at 
ext3_dx_readdir, which is apparently absorbing about 11 ms per returned 
entry. To put this in perspective, I'm used to seeing individual directory 
operation times well below 100 us, so even if each dirent cost as much as a 
full lookup, you'd see less than 3 seconds overhead for your 30,000 
directories.

11 ms sounds like two seeks for each returned dirent, which sounds like a bug.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bug 417] New: htree much slower than regular ext3
  2003-02-27 21:00   ` Andreas Dilger
@ 2003-02-28  4:12     ` Daniel Phillips
  2003-02-27 21:33       ` Martin J. Bligh
  2003-03-13 21:04     ` [Ext2-devel] " Stephen C. Tweedie
  1 sibling, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2003-02-28  4:12 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Martin J. Bligh, linux-kernel, ext2-devel, Theodore Ts'o, chrisl

On Thursday 27 February 2003 22:00, Andreas Dilger wrote:
> > 11 ms sounds like two seeks for each returned dirent, which sounds like a
> > bug.
>
> I think you are pretty dead on there.  The difference is that with
> unindexed entries, the directory entry and the inode are in the same order,
> while with indexed directories they are essentially in random order to each
> other.  That means that each directory lookup causes a seek to get the
> directory inode data instead of doing allocation order (which is sequential
> reads on disk).
>
> In the past both would have been slow equally, but the orlov allocator in
> 2.5 causes a number of directories to be created in the same group before
> moving on to the next group, so you have nice batches of sequential reads.

I think you're close to the truth there, but out-of-order inode table access 
would only introduce one seek per inode table block, and the cache should 
take care of the rest.  Martin's numbers suggest the cache isn't caching at 
all.

Martin, does iostat show enormously more reads for the Htree case?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bug 417] New: htree much slower than regular ext3
  2003-02-27 17:31 [Bug 417] New: htree much slower than regular ext3 Martin J. Bligh
  2003-02-28  2:55 ` Daniel Phillips
@ 2003-03-07 15:46 ` Alex Tomas
  2003-03-08 17:38   ` Daniel Phillips
  2003-03-11 21:57   ` Bill Davidsen
       [not found] ` <20030307214833.00a37e35.akpm@digeo.com>
       [not found] ` <20030309184755.ACC80FCA8C@mx12.arcor-online.net>
  3 siblings, 2 replies; 38+ messages in thread
From: Alex Tomas @ 2003-03-07 15:46 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: linux-kernel, ext2-devel, Theodore Ts'o, Daniel Phillips,
	Andrew Morton, Alex Tomas


Hi!

The problem is that getdents(2) returns inodes out of order and
du causes many head seeks. I tried to solve the problem by patch
I included in. The idea of the patch is pretty simple: just try
to sort dentries by inode number in readdir(). It works because
inodes sits at fixed places in ext2/ext3. Please, look at results
I just got:

                  real        user         sys

ext3+htree:  7m22.236s    0m0.292s    0m2.204s
ext3-htree:  3m6.539s     0m0.481s    0m2.278s
ext3+sd-05:  3m45.453s    0m0.493s    0m2.265s
ext3+sd-10:  3m35.770s    0m0.514s    0m2.076s
ext3+sd-15:  3m24.235s    0m0.558s    0m2.075s
ext3+sd-20:  3m6.802s     0m0.486s    0m2.214s
ext3+sd-25:  2m55.062s    0m0.490s    0m2.105s
ext3+sd-30:  2m39.658s    0m0.492s    0m2.138s


'sd-N' stands for sortdir, N blocks will be prefetched before
sorting.

Of course, sortdir isn't production-ready and I even don't try
to test it hard. Just to probe the idea.

After applying the patch, use 'sortdir' or 'sortdir=N' as mount
option. 

with best regards, Alex

>>>>> Martin J Bligh (MJB) writes:

 MJB> Problem Description: I created a directory ("test") with 32000
 MJB> (ok, 31998) directories in it, and put a file called 'foo' in
 MJB> each of them (for i in `ls`; do cd $i && touch bar && cd .. ;
 MJB> done). Then I took that ext3 partion, umounted it, did a
 MJB> 'tune2fs -O dir_index', then 'fsck -fD', and remounted. I then
 MJB> did a 'time du -hs' on the test directory, and here are the
 MJB> results.

 MJB> ext3+htree: bwindle@razor:/giant/inodes$ time du -hs 126M .

 MJB> real 7m21.756s user 0m2.021s sys 0m22.190s

 MJB> I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1,
 MJB> remounted, and did another du -hs on the test directory. It took
 MJB> 1 minute, 48 seconds.

 MJB> bwindle@razor:/giant/test$ time du -hs 126M .

 MJB> real 1m48.760s user 0m1.986s sys 0m21.563s


 MJB> I thought htree was supposed to speed up access with large
 MJB> numbers of directories?



diff -uNr linux/fs/ext3/dir.c edited/fs/ext3/dir.c
--- linux/fs/ext3/dir.c	Thu Mar  6 14:57:50 2003
+++ edited/fs/ext3/dir.c	Fri Mar  7 18:35:32 2003
@@ -33,8 +33,12 @@
 static int ext3_readdir(struct file *, void *, filldir_t);
 static int ext3_dx_readdir(struct file * filp,
 			   void * dirent, filldir_t filldir);
+static int ext3_readdir_sort (struct file * filp,
+                         void * dirent, filldir_t filldir);
 static int ext3_release_dir (struct inode * inode,
 				struct file * filp);
+int ext3_htree_store_dirent_sorted(struct file *dir_file,
+                             struct ext3_dir_entry_2 *dirent);
 
 struct file_operations ext3_dir_operations = {
 	.read		= generic_read_dir,
@@ -104,9 +108,17 @@
 	sb = inode->i_sb;
 
 	if (is_dx(inode)) {
+		if (test_opt(inode->i_sb, SORTDIR)) {
+			err = ext3_readdir_sort(filp, dirent, filldir);
+			unlock_kernel();
+			return err;
+		}
+		
 		err = ext3_dx_readdir(filp, dirent, filldir);
-		if (err != ERR_BAD_DX_DIR)
+		if (err != ERR_BAD_DX_DIR) {
+			unlock_kernel();
 			return err;
+		}
 		/*
 		 * We don't set the inode dirty flag since it's not
 		 * critical that it get flushed back to the disk.
@@ -249,6 +261,8 @@
 	__u32		inode;
 	__u8		name_len;
 	__u8		file_type;
+	loff_t		offs;
+	struct list_head list;
 	char		name[0];
 };
 
@@ -311,6 +325,9 @@
 	p->curr_hash = pos2maj_hash(pos);
 	p->curr_minor_hash = pos2min_hash(pos);
 	p->next_hash = 0;
+	INIT_LIST_HEAD(&p->list);
+	p->list_nr = 0;
+
 	return p;
 }
 
@@ -493,10 +510,193 @@
 
 static int ext3_release_dir (struct inode * inode, struct file * filp)
 {
-       if (is_dx(inode) && filp->private_data)
+       if (is_dx(inode) && filp->private_data) 
 		ext3_htree_free_dir_info(filp->private_data);
 
 	return 0;
 }
 
+int ext3_htree_fill_pool(struct file *dir_file,  struct ext3_dir_entry_2 *dirent)
+{                       
+        struct rb_node **p, *parent = NULL;
+        struct fname * fname, *new_fn, *prev_fname = NULL;
+        struct dir_private_info *info;
+        int len;        
+                
+        info = (struct dir_private_info *) dir_file->private_data;
+        p = &info->root.rb_node; 
+                       
+        /* Create and allocate the fname structure */ 
+        len = sizeof(struct fname) + dirent->name_len + 1;
+        new_fn = kmalloc(len, GFP_KERNEL);         
+        if (!new_fn)            
+                return -ENOMEM;
+        memset(new_fn, 0, len); 
+        new_fn->hash = new_fn->inode = le32_to_cpu(dirent->inode); 
+        new_fn->name_len = dirent->name_len;
+        new_fn->file_type = dirent->file_type;
+        memcpy(new_fn->name, dirent->name, dirent->name_len);
+        new_fn->name[dirent->name_len] = 0;
+        new_fn->offs = dir_file->f_pos;
+                                
+        while (*p) {
+                parent = *p;
+                fname = rb_entry(parent, struct fname, rb_hash);
+                
+                if (new_fn->hash < fname->hash)
+                        p = &(*p)->rb_left;
+                else {          
+                        prev_fname = fname;
+                        p = &(*p)->rb_right;
+                }               
+        }                               
+
+        if (prev_fname)
+                list_add(&new_fn->list, &prev_fname->list);
+        else
+                list_add(&new_fn->list, &info->list);
+
+        rb_link_node(&new_fn->rb_hash, parent, p);
+        rb_insert_color(&new_fn->rb_hash, &info->root);
+        info->list_nr++;
+
+        return 0;
+}
+
+static int ext3_fill_from_sort_pool (struct file * filp, void * dirent, filldir_t filldir)
+{
+	struct super_block * sb = filp->f_dentry->d_inode->i_sb;
+	struct dir_private_info *info = filp->private_data;
+	struct list_head * cur;
+        struct fname *fname;
+	int error;
+
+	list_for_each(cur, &info->list) {
+		fname = list_entry(cur, struct fname, list);
+
+                error = filldir(dirent, fname->name,
+				fname->name_len, fname->offs,
+				fname->inode,
+				get_dtype(sb, fname->file_type));
+		if (error)
+			return 0;
+
+		list_del(&fname->list);
+		info->list_nr--;
+	}
+	
+	if (info->list_nr == 0)
+		free_rb_tree_fname(&info->root);
+
+	return 0;
+}
+
+static int ext3_readdir_sort (struct file * filp,
+                         void * dirent, filldir_t filldir)
+{
+        unsigned long offset, blk;
+        int i, stored, error = 0, blocks = 0, err;
+        struct buffer_head * bh;
+        struct ext3_dir_entry_2 * de;
+        struct inode *inode = filp->f_dentry->d_inode;
+	struct super_block * sb = inode->i_sb;
+	struct dir_private_info *info = filp->private_data;
+
+	if (!info) {
+		info = create_dir_info(filp->f_pos);
+		if (!info)
+			return -ENOMEM;
+		filp->private_data = info;
+	}
+
+	if (info->list_nr > 0) {
+		ext3_fill_from_sort_pool(filp, dirent, filldir);
+		return 0;
+	}
+	
+        stored = 0;
+        bh = NULL;
+        offset = filp->f_pos & (sb->s_blocksize - 1);
+
+        while (!error && (blocks < EXT3_SB(sb)->sortdir_prefetch) &&
+			(filp->f_pos < inode->i_size)) {
+                blk = (filp->f_pos) >> EXT3_BLOCK_SIZE_BITS(sb);
+                bh = ext3_bread (0, inode, blk, 0, &err);
+                if (!bh) {
+                        ext3_error (sb, "ext3_readdir",
+                                "directory #%lu contains a hole at offset %lu",
+                                inode->i_ino, (unsigned long)filp->f_pos);
+                        filp->f_pos += sb->s_blocksize - offset;
+                        continue;
+                }
+
+revalidate:
+                /* If the dir block has changed since the last call to
+                 * readdir(2), then we might be pointing to an invalid
+                 * dirent right now.  Scan from the start of the block
+                 * to make sure. */
+                if (filp->f_version != inode->i_version) {
+                        for (i = 0; i < sb->s_blocksize && i < offset; ) {
+                                de = (struct ext3_dir_entry_2 *)
+                                        (bh->b_data + i);
+                                /* It's too expensive to do a full
+                                 * dirent test each time round this
+                                 * loop, but we do have to test at
+                                 * least that it is non-zero.  A
+                                 * failure will be detected in the
+                                 * dirent test below. */
+                                if (le16_to_cpu(de->rec_len) <
+                                                EXT3_DIR_REC_LEN(1))
+                                        break;
+                                i += le16_to_cpu(de->rec_len);
+                        }
+                        offset = i;
+                        filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1))
+                                | offset;
+                        filp->f_version = inode->i_version;
+                }
+
+                while (!error && filp->f_pos < inode->i_size
+                       && offset < sb->s_blocksize) {
+                        de = (struct ext3_dir_entry_2 *) (bh->b_data + offset);
+                        if (!ext3_check_dir_entry ("ext3_readdir", inode, de,
+                                                   bh, offset)) {
+                                /* On error, skip the f_pos to the
+                                 * next block. */
+                                filp->f_pos = (filp->f_pos |
+                                                (sb->s_blocksize - 1)) + 1;
+                                brelse (bh);
+                                return stored;
+                        }
+                        offset += le16_to_cpu(de->rec_len);
+                        if (le32_to_cpu(de->inode)) {
+                                /* We might block in the next section
+                                 * if the data destination is
+                                 * currently swapped out.  So, use a
+                                 * version stamp to detect whether or
+                                 * not the directory has been modified
+                                 * during the copy operation.
+                                 */
+                                unsigned long version = filp->f_version;
+
+				error = ext3_htree_fill_pool(filp, de);
+                                if (error)
+                                        break;
+                                if (version != filp->f_version)
+                                        goto revalidate;
+                                stored ++;
+                        }
+                        filp->f_pos += le16_to_cpu(de->rec_len);
+                }
+                offset = 0;
+                brelse (bh);
+		blocks++;
+        }
+	
+	ext3_fill_from_sort_pool(filp, dirent, filldir);
+	
+        UPDATE_ATIME(inode);
+        return 0;
+}
+
 #endif
diff -uNr linux/fs/ext3/super.c edited/fs/ext3/super.c
--- linux/fs/ext3/super.c	Mon Feb 24 17:47:45 2003
+++ edited/fs/ext3/super.c	Fri Mar  7 17:47:27 2003
@@ -584,6 +584,19 @@
 			continue;
 		if ((value = strchr (this_char, '=')) != NULL)
 			*value++ = 0;
+		if (!strcmp (this_char, "sortdir")) {
+			if (value && *value) {
+				int blocks = simple_strtoul(value, &value, 0);
+				printk("EXT3: has value: %d\n", blocks);
+				if (blocks > 0)
+					sbi->sortdir_prefetch = blocks;
+			}
+			if (sbi->sortdir_prefetch == 0)
+				sbi->sortdir_prefetch = 10; /* default value */
+			set_opt(sbi->s_mount_opt, SORTDIR);
+			printk("EXT3-fs: sortdir (%d blocks prefetch)\n",
+					sbi->sortdir_prefetch);
+		} else
 #ifdef CONFIG_EXT3_FS_XATTR
 		if (!strcmp (this_char, "user_xattr"))
 			set_opt (sbi->s_mount_opt, XATTR_USER);
diff -uNr linux/include/linux/ext3_fs.h edited/include/linux/ext3_fs.h
--- linux/include/linux/ext3_fs.h	Mon Feb 24 17:47:33 2003
+++ edited/include/linux/ext3_fs.h	Fri Mar  7 18:32:53 2003
@@ -330,6 +330,7 @@
 #define EXT3_MOUNT_NO_UID32		0x2000  /* Disable 32-bit UIDs */
 #define EXT3_MOUNT_XATTR_USER		0x4000	/* Extended user attributes */
 #define EXT3_MOUNT_POSIX_ACL		0x8000	/* POSIX Access Control Lists */
+#define EXT3_MOUNT_SORTDIR		0x10000	/* sort indexed dirs in readdir() */
 
 /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
@@ -656,6 +657,8 @@
 	__u32		curr_hash;
 	__u32		curr_minor_hash;
 	__u32		next_hash;
+	struct list_head list;
+	int		list_nr;
 };
 
 /*
diff -uNr linux/include/linux/ext3_fs_sb.h edited/include/linux/ext3_fs_sb.h
--- linux/include/linux/ext3_fs_sb.h	Mon Nov 11 06:28:30 2002
+++ edited/include/linux/ext3_fs_sb.h	Fri Mar  7 17:30:46 2003
@@ -63,6 +63,7 @@
 	struct timer_list turn_ro_timer;	/* For turning read-only (crash simulation) */
 	wait_queue_head_t ro_wait_queue;	/* For people waiting for the fs to go read-only */
 #endif
+	int sortdir_prefetch;			/* how many blocks to read to fill pool --AT */
 };
 
 #endif	/* _LINUX_EXT3_FS_SB */


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bug 417] New: htree much slower than regular ext3
  2003-03-08 17:38   ` Daniel Phillips
@ 2003-03-07 23:27     ` Theodore Ts'o
  2003-03-09 19:26       ` Alex Tomas
  2003-03-09  7:08     ` Alex Tomas
  1 sibling, 1 reply; 38+ messages in thread
From: Theodore Ts'o @ 2003-03-07 23:27 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Alex Tomas, Martin J. Bligh, linux-kernel, ext2-devel, Andrew Morton

On Sat, Mar 08, 2003 at 06:38:50PM +0100, Daniel Phillips wrote:
> > The problem is that getdents(2) returns inodes out of order and
> > du causes many head seeks. I tried to solve the problem by patch
> > I included in. The idea of the patch is pretty simple: just try
> > to sort dentries by inode number in readdir(). 

Yup, that's the problem and the fix; the only issue is the the one
which Daniel pointed out:

> The problem I see with your approach is that the traversal is no
> longer in hash order, so a leaf split in the middle of a directory
> traversal could result in a lot of duplicate dirents.  I'm not sure
> there's a way around that.

The fix in userspace would be for du and find (the two programs most
likely to care about this sort of thing) to pull into memory a large
chunks of the directory at a time, sort them by inode, and then stat
them in inode order.

The only other way around it would be to do maintain an entirely
separate B-tree on disk just for the benefit of readdir(), which could
be in inode sort order.  Readdir() could then traverse the tree in
inode sort order, thus solving the performance penalty.  (The b-tree
has to be on disk, since for large directories, we can't afford to fit
it on disk.)

> We've apparently got a simpler problem though: inode numbers aren't
> allocated densely enough in the inode table blocks.  This is the
> only thing I can think of that could cause the amount of thrashing
> we're seeing.  Spreading inode number allocations out all over a
> volume is ok, but sparse allocation within each table block is not
> ok because it multiplies the pressure on the cache.  We want a
> 30,000 entry directory to touch 1,000 - 2,000 inode table blocks,
> not the worst-case 30,000.  As above, fixing this comes down to
> tweaking the ext3_new_inode allocation goal.

Nope.  That's not it.  Inode numbers are allocated sequentially in the
inode table blocks.  You can't get any more dense that that.  Create a
bunch of inodes in a directory like this:

mke2fs -j -F -b 1024 -g 1024 -N 1200000 test.img
mount -t ext3 -o loop test.img /mnt
pushd /mnt
mkdir test test2 test3
cd test
time seq -f "%06.0f" 1 100  | xargs touch
cd ..
cd test2
time seq -f "%06.0f" 1 10000  | xargs touch
cd ..
cd test3
time seq -f "%06.0f" 1 100000  | xargs touch
cd ..
popd

Then look at the result using "ls -li".  You will see that the inode
numbers are all sequentially allocated, when you look at the inodes in
creation order (which in this particular case is in filename sort
order ).  The problem is that when you look at them in hash sort
order, the inode numbers are scattered all over the place.

> Another approach is to have the inode number correspond approximately to the 
> hash value.  That's not as hard as it sounds, it simply means that the goal 
> for ext3_new_inode should be influenced by the hash value.  (But watch out 
> for interaction with the new Orlov allocator.)  

I'm not all convinced that splattering the inodes all over the volume
is in fact a good thing, because it means splattering the block
allocation for files within the same directory all over the volume.
Arguably, read access to files in a directory is much more common than
the "du" or "find" case.  So if we need to make trade-off in one
direction or another, I'd much rather go with striving to maximize
block locality than making a readdir+stat run go fast.  This is
especially true since there *is* a user space workaround that would
speed up things for du and find.

						- Ted



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bug 417] New: htree much slower than regular ext3
  2003-03-07 15:46 ` Alex Tomas
@ 2003-03-08 17:38   ` Daniel Phillips
  2003-03-07 23:27     ` Theodore Ts'o
  2003-03-09  7:08     ` Alex Tomas
  2003-03-11 21:57   ` Bill Davidsen
  1 sibling, 2 replies; 38+ messages in thread
From: Daniel Phillips @ 2003-03-08 17:38 UTC (permalink / raw)
  To: Alex Tomas, Martin J. Bligh
  Cc: linux-kernel, ext2-devel, Theodore Ts'o, Andrew Morton, Alex Tomas

On Fri 07 Mar 03 16:46, Alex Tomas wrote:
> Hi!
>
> The problem is that getdents(2) returns inodes out of order and
> du causes many head seeks. I tried to solve the problem by patch
> I included in. The idea of the patch is pretty simple: just try
> to sort dentries by inode number in readdir(). It works because
> inodes sits at fixed places in ext2/ext3. Please, look at results
> I just got:
>
>                   real        user         sys
>
> ext3+htree:  7m22.236s    0m0.292s    0m2.204s
> ext3-htree:  3m6.539s     0m0.481s    0m2.278s
> [...]
> ext3+sd-30:  2m39.658s    0m0.492s    0m2.138s

Nice.  I posted a similar demonstration some time ago (could it *really* be 
two years?) but I sorted the dirents in user space, which meant it was just a 
demonstration, not a practical solution.

The problem I see with your approach is that the traversal is no longer in 
hash order, so a leaf split in the middle of a directory traversal could 
result in a lot of duplicate dirents.  I'm not sure there's a way around that.

Another approach is to have the inode number correspond approximately to the 
hash value.  That's not as hard as it sounds, it simply means that the goal 
for ext3_new_inode should be influenced by the hash value.  (But watch out 
for interaction with the new Orlov allocator.)  It also depends on some
a priori estimate of the size of a directory so that increasing hash values 
can be distributed more or less uniformly and monotonically across some 
corresponding range of inode numbers. This isn't too hard either, just round 
up the current size of the directory to a power of two and use that as the 
size estimate.  The size estimate would change from time to time over the 
life of the directory, but there are only log N different sizes and that's 
roughly how heavy the load on the inode cache would be during a directory 
traversal.  Finally, a nice property of this approach is that it stays stable 
over many creates and deletes.

We've apparently got a simpler problem though: inode numbers aren't allocated 
densely enough in the inode table blocks.  This is the only thing I can think 
of that could cause the amount of thrashing we're seeing.  Spreading inode 
number allocations out all over a volume is ok, but sparse allocation within 
each table block is not ok because it multiplies the pressure on the cache.  
We want a 30,000 entry directory to touch 1,000 - 2,000 inode table blocks, 
not the worst-case 30,000.  As above, fixing this comes down to tweaking the 
ext3_new_inode allocation goal.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Ext2-devel] Re: [Bug 417] New: htree much slower than regular ext3
  2003-03-09 22:54     ` [Ext2-devel] " Daniel Phillips
@ 2003-03-08 23:19       ` Andrew Morton
  0 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2003-03-08 23:19 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: adilger, tytso, bzzz, mbligh, linux-kernel, ext2-devel

Daniel Phillips <phillips@arcor.de> wrote:
>
> Yes, I think that's exactly what's happening.  There are some questions 
> remaining, such as why doesn't it happen to akpm.

Oh but it does.  The 30,000 mkdirs test takes 76 seconds on 1k blocks, 16
seconds on 4k blocks.

With 1k blocks, the journal is 1/4 the size, and there are 4x as many
blockgroups.

So not only does the fs have to checkpoint 4x as often, it has to seek all
over the disk to do it.

If you create a 1k blocksize fs with a 100MB journal, the test takes just
nine seconds.

But note that after these nine seconds, we were left with 50MB of dirty
buffercache.  When pdflush later comes along to write that out it takes >30
seconds, because of the additional fragmentation from the tiny blockgroups.
So all we've done is to pipeline the pain.

I suspect the Orlov allocator isn't doing the right thing here - a full
dumpe2fs of the disk shows that the directories are all over the place. 
Nodoby has really looked at fine-tuning Orlov.

btw, an `rm -rf' of the dir which holds 30,000 dirs takes 50 seconds system
time on a 2.7GHz CPU.  What's up with that?

c01945bc str2hashbuf                                9790  61.1875
c0187064 ext3_htree_store_dirent                   10472  28.7692
c0154920 __getblk                                  11319 202.1250
c018d11c dx_probe                                  15934  24.2896
c018d4d0 ext3_htree_fill_tree                      15984  33.8644
c0188d3c ext3_get_branch                           18238  81.4196
c0136388 find_get_page                             18617 202.3587
c015477c bh_lru_install                            20493  94.8750
c0194290 TEA_transform                             21463 173.0887
c01536b0 __find_get_block_slow                     24069  62.6797
c0154620 __brelse                                  36139 752.8958
c01893dc ext3_get_block_handle                     43815  49.7898
c0154854 __find_get_block                          71292 349.4706


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bug 417] New: htree much slower than regular ext3
  2003-03-08 17:38   ` Daniel Phillips
  2003-03-07 23:27     ` Theodore Ts'o
@ 2003-03-09  7:08     ` Alex Tomas
  2003-03-10 17:58       ` Daniel Phillips
  2003-03-10 21:25       ` Theodore Ts'o
  1 sibling, 2 replies; 38+ messages in thread
From: Alex Tomas @ 2003-03-09  7:08 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Alex Tomas, Martin J. Bligh, linux-kernel, ext2-devel,
	Theodore Ts'o, Andrew Morton

>>>>> Daniel Phillips (DP) writes:

 DP> On Fri 07 Mar 03 16:46, Alex Tomas wrote:
 DP> The problem I see with your approach is that the traversal is no
 DP> longer in hash order, so a leaf split in the middle of a
 DP> directory traversal could result in a lot of duplicate dirents.
 DP> I'm not sure there's a way around that.

1) As far as I understand, duplicates are possible even in classic ext2
   w/o sortdir/index. See the diagram:

                    Process 1                  Process 2

                    getdents(2) returns
                    dentry1 (file1 -> Inode1)
                    dentry2 (file2 -> Inode2)

context switch -->
                                               unlink(file1), empty dentry1
                                               creat(file3), Inode3, use dentry1
                                               creat(file1), Inode1, use dentry3

context switch -->

                    getdents(2) returns
                    dentry3(file1 -> Inode1)


Am I right?


2) Why do not use hash order for traversal like ext3_dx_readdir() does?
   Upon reading several dentries within some hash set readdir() sorts them
   in inode order and returns to an user.


with best regards, Alex


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bug 417] New: htree much slower than regular ext3
  2003-03-07 23:27     ` Theodore Ts'o
@ 2003-03-09 19:26       ` Alex Tomas
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Tomas @ 2003-03-09 19:26 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Daniel Phillips, Alex Tomas, Martin J. Bligh, linux-kernel,
	ext2-devel, Andrew Morton

>>>>> Theodore Ts'o (TT) writes:

 TT> The fix in userspace would be for du and find (the two programs
 TT> most likely to care about this sort of thing) to pull into memory
 TT> a large chunks of the directory at a time, sort them by inode,
 TT> and then stat them in inode order.

in fact, this solution solves problem for filesystems with fixed inodes
placement. for example, this solutions won't work for reiserfs.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Ext2-devel] Re: [Bug 417] New: htree much slower than regular ext3
       [not found]   ` <20030308010424.Z1373@schatzie.adilger.int>
@ 2003-03-09 22:54     ` Daniel Phillips
  2003-03-08 23:19       ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2003-03-09 22:54 UTC (permalink / raw)
  To: Andreas Dilger, Andrew Morton
  Cc: tytso, bzzz, mbligh, linux-kernel, ext2-devel

On Sat 08 Mar 03 09:04, Andreas Dilger wrote:
> I was testing this in UML-over-loop in 2.4, and the difference in speed
> for doing file creates vs. directory creates is dramatic.  For file
> creates I was running 3s per 10000 files, and for directory creates I
> was running 12s per 10000 files.

And  on a 10K scsi disk I'm running 35s per 10,000 directories, which is way, 
way slower than it ought to be.  There are two analysis tools we're hurting 
for badly here:

   - We need to see the physical allocation maps for directories, preferably
     in a running kernel.  I think the best way to do this is a little 
     map-dumper hooked into ext3/dir.c and exported through /proc.

  - We need block-access traces in a nicer form than printks (or nobody
    will ever use them).  IOW, we need LTT or something very much like
    it.

> Depending on the size of the journal vs. how many block/inode bitmaps and
> directory blocks are dirtied, you will likely wrap the journal before you
> return to the first block group, so you might write 20kB * 32000 for the
> directory creates instead of 8kB for the file creates.  You also have a
> lot of seeking to each block group to write out the directory data, instead
> of nearly sequential IO for the inode create case.

Yes, I think that's exactly what's happening.  There are some questions 
remaining, such as why doesn't it happen to akpm.  Another question: why does 
it happen to the directory creates, where the only thing being accessed 
randomly is the directory itself - the inode table is supposedly being 
allocated/dirtied sequentially.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bug 417] New: htree much slower than regular ext3
       [not found] ` <20030307214833.00a37e35.akpm@digeo.com>
       [not found]   ` <20030308010424.Z1373@schatzie.adilger.int>
@ 2003-03-09 23:10   ` Daniel Phillips
  1 sibling, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2003-03-09 23:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: tytso, bzzz, mbligh, linux-kernel, ext2-devel

On Sat 08 Mar 03 06:48, Andrew Morton wrote:
> Daniel Phillips <phillips@arcor.de> wrote:
> > OK, according to akpm, data=journal is broken at the moment.  Changing
> > the host filesystem to data=ordered allowed me to complete the test.
> >
> > Interesting results.  Creating the initial 30,000 directories ought to
> > take just a few seconds with the index, but instead it takes 5 minutes
> > and 9 seconds, while consuming only 5 seconds of cpu time.
>
> It takes 14 second here aganst a real disk, and 9 seconds against loop.

The best I've seen is 77 seconds to make the directories, on a fast scsi disk 
with 4K blocksize.  That's a 6 1/2 times difference, what could account for 
it?

> The difference: I used 4k blocksize.  1k blocksize should not be that bad:
> something is broken.
>
> Suggest you test against a real disk for now.  Using loop may skew things.

OK, I tested 1K and 4K blocksize on scsi, ide and ide-hosted loopback, the 
results are summarized here:

Make 30,000 directories:
	scsi 10K	ide 5400	loopback
1K	1m44.243s	4m59.757s	4m8.362s
4K	1m16.659s	4m49.827s	1m43.399s

Make one file in each directory:
	scsi 10K	ide 5400	loopback
1K	2m50.483s	7m28.657s	4m8.793s
4K	2m34.461s	5m45.615s	3m0.603s

Stat the tree (du):
	scsi 10K	ide 5400	loopback
1K	0m43.245s	1m52.230s	0m1.343s
4K	0m1.038s	0m2.056s	0m1.016s

The first thing to notice:  Alex Tomas's original complaint about Htree and 
stat doesn't manifest at all here.  Mind you, I did not repeat his exact 
sequence of tune2fs's etc, so it's possible there could be a utils bug or 
something.  I'll check into that later.  Meanwhile, these stats show other 
breakage that needs to be investigated.

The directory creation time is several times higher than it should be, the 
same for the file creates.  With the index, the two sets of operations should 
take about the same time.  Is this a journal problem?  My journal is about 8 
meg.  I can imagine the journal wrapping during the file creation, because of 
touching all the existing inode table blocks, but this should not happen 
during the directory creation.

The stats also show a huge slowdown for 1K blocksize on non-loopback, as you 
noticed earlier.   Hmm.  First: why on the raw device and not on loopback?  
Second: why does the device affect this at all?  This operation is r/o.  Hmm, 
except for atimes.  I reran the scsi tests with noatime+nodiratime and the 
stat times for 1K and 4K became nearly equal, about .75 seconds.  So there we 
have it: several tens of seconds are spent journalling dirty inode table 
blocks due to atime, but only when blocksize is 1K.  Big fat clue.

Stat the tree (du), noatime+nodtime:
	scsi 10K
1K	0m0.780s
4K	0m0.751s

The cpu numbers are in the detailed results below, and they show that Htree 
is doing its job, all the cpu numbers are tiny.  We've kicked around the idea 
that Htree is causing inode table thrashing, but I think that's the wrong 
tree to bark up.  Journal effects are looking more and more like the real 
problem, possibly in combination with IO scheduling.  The fact that Htree 
dirties the inode table out of order is just exposing some other underlying 
problem.

A quick look at slabinfo before and after unmounting the test volume shows 
that all the inodes we expect to be cached, actually are:

cat /proc/slabinfo | grep ext3_inode
ext3_inode_cache  121469 121473    448 13497 13497    1 :  120   60

sudo umount /mnt; cat /proc/slabinfo | grep ext3_inode
ext3_inode_cache   61506  61632    448 6836 6848    1 :  120   60

So there were about 60,000 inodes cached for that volume, exactly as 
expected.  So much for the theory that we're thrashing the inode table cache.
What's actually happening, I strongly suspect, is a lot of dirty blocks are 
getting pushed through the journal a little prematurely, without allowing any 
opportunity for coalescing.  At create time, it must be the directory blocks, 
not the inode table blocks, that cause the extra writeout, since inodes are 
being allocated (mostly) sequentially.

By the way, the script is a perfect trigger for the scheduler interactivity 
bug, leaving me mouseless for 4 seconds at a time.  I guess I'll apply 
Linus's patch and see if that goes away.

One other thing I noticed is that, when the script tries to umount 
immediately after the stat, I get "umount: /mnt: device is busy", however, 
after a few seconds it works fine.  This strikes me as a bug.

======================
1K blocksize, 10K rpm scsi
======================

Make directories...

real    1m44.243s
user    0m0.142s
sys     0m3.465s

Make one file in each directory...

real    2m50.483s
user    0m23.617s
sys     0m59.465s

Stat the tree...
30M     /mnt/test

real    0m43.245s
user    0m0.245s
sys     0m0.915s

======================
1K blocksize, 5400 rpm ide
======================

Make directories...

real    4m59.757s
user    0m0.138s
sys     0m5.207s

Make one file in each directory...

real    7m28.657s
user    0m24.351s
sys     0m58.941s

Stat the tree...
30M     /mnt/test

real    1m52.230s
user    0m0.261s
sys     0m1.013s

===================
1K blocksize, loopback
===================

Make directories...

real    4m8.362s
user    0m0.144s
sys     0m7.448s

Make one file in each directory...

real    4m8.793s
user    0m25.262s
sys     1m20.376s

Stat the tree...
30M     /mnt/test

real    0m1.343s
user    0m0.265s
sys     0m0.720s

======================
4K blocksize, 10K rpm scsi
======================

Make directories...

real    1m16.659s
user    0m0.152s
sys     0m4.106s

Make one file in each directory...

real    2m34.461s
user    0m23.555s
sys     0m58.533s

Stat the tree...
118M    /mnt/test

real    0m1.038s
user    0m0.242s
sys     0m0.664s

======================
4K blocksize, 5400 rpm ide
======================

Make directories...

real    4m49.827s
user    0m0.153s
sys     0m16.666s

Make one file in each directory...

real    5m45.615s
user    0m23.836s
sys     1m6.908s

Stat the tree...
118M    /mnt/test

real    0m2.056s
user    0m0.432s
sys     0m0.622s

===================
4K blocksize, loopback
===================

Make directories...

real    1m43.399s
user    0m0.149s
sys     0m10.806s

Make one file in each directory...

real    3m0.603s
user    0m25.568s
sys     1m34.851s

Stat the tree...
118M    /mnt/test

real    0m1.016s
user    0m0.318s
sys     0m0.588s

========
The script
========

#volume=test.img
#volume=/dev/hda5
volume=/dev/sdb5
mke2fs -j -F -b 4096 -g 1024 $volume
mount -t ext3 -o loop $volume /mnt
cd /mnt
mkdir test 
cd test

echo
echo Make directories...
time seq -f "%06.0f" 1 30000 | xargs mkdir

echo
echo Make one file in each directory...
time for i in `ls`; do cd $i && touch foo && cd .. ; done

echo
echo Stat the tree...
time du -hs /mnt/test

echo umount /mnt

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bug 417] New: htree much slower than regular ext3
  2003-03-09  7:08     ` Alex Tomas
@ 2003-03-10 17:58       ` Daniel Phillips
  2003-03-10 21:25       ` Theodore Ts'o
  1 sibling, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2003-03-10 17:58 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Alex Tomas, Martin J. Bligh, linux-kernel, ext2-devel,
	Theodore Ts'o, Andrew Morton

On Sun 09 Mar 03 08:08, Alex Tomas wrote:
> >>>>> Daniel Phillips (DP) writes:
>
>  DP> On Fri 07 Mar 03 16:46, Alex Tomas wrote:
>  DP> The problem I see with your approach is that the traversal is no
>  DP> longer in hash order, so a leaf split in the middle of a
>  DP> directory traversal could result in a lot of duplicate dirents.
>  DP> I'm not sure there's a way around that.
>
> 1) As far as I understand, duplicates are possible even in classic ext2
>    w/o sortdir/index. See the diagram:
>
>                     Process 1                  Process 2
>
>                     getdents(2) returns
>                     dentry1 (file1 -> Inode1)
>                     dentry2 (file2 -> Inode2)
>
> context switch -->
>                                                unlink(file1), empty dentry1
>                                                creat(file3), Inode3, use
> dentry1 creat(file1), Inode1, use dentry3
>
> context switch -->
>
>                     getdents(2) returns
>                     dentry3(file1 -> Inode1)
>
>
> Am I right?
>
>
> 2) Why do not use hash order for traversal like ext3_dx_readdir() does?
>    Upon reading several dentries within some hash set readdir() sorts them
>    in inode order and returns to an user.
>
>
> with best regards, Alex

You're right, but you still don't win the stuffed poodle.

I put forth the same argument at last year's kernel workshop and was 
overruled on the grounds that, let me see:

  - Duplicates as a result of unlinks are one thing, duplicates as a
    result of creates are another, worse thing.

  - Creating one duplicate as a result of one unlink is one thing,
    creating lots of duplicates with one operation is another, worse
    thing.

  - The rules are written to allow this particular duplicate behavior
    in UFS (design inherited by Ext2/3) because at the time, UFS was
    the only game in town.

The phrase "broken by design" comes to mind.  Ted and Stephen would no doubt 
be happy to elaborate.  Anyway, there's another reason we need to return 
results in hash order, and that is telldir/seekdir, which expects a stable 
enumeration of a directory with no duplicates, even with concurrent 
operations going on, and even if the server is rebooted in the middle of a 
directory traversal.  Not just broken by design, but smashed to pieces by 
design.  And we still have to make it work, for some definition of "work".

Anyway,

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC] Improved inode number allocation for HTree
       [not found]   ` <m3u1ecl5h8.fsf@lexa.home.net>
@ 2003-03-10 20:45     ` Daniel Phillips
       [not found]       ` <3E6D1D25.5000004@namesys.com>
  2003-03-10 20:48     ` [RFC] Improved inode number allocation for HTree Daniel Phillips
  1 sibling, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2003-03-10 20:45 UTC (permalink / raw)
  Cc: Theodore Ts'o, Andreas Dilger, Christopher Li, Alex Tomas

Summary of the problem:

HTree's random relationship between directory entry order and inode order has 
been observed to cause increased cache pressure, affecting performance under 
certain conditions.  Most of this pressure is due to inode table usage, which 
is normally several times larger than the directory itself.  Even when the 
inode table usage is much less than available cache, it may still exceed the 
size of the journal file (8 MB by default).  Though the journal only becomes 
involved when blocks are modified, unfortunately, because of atime updates, 
this includes all directory operations.  We could suggest to users that they 
should disable atime updating if they care about performance, but we ought to 
be able to do better than that.

To reduce this cache pressure, what we need to do is make the directory entry 
order correspond more closely to the inode number order.  It doesn't have to 
correspond perfectly in order to show substantial improvement, as Tomas 
demonstrated with his recent getdents sorting patch.

Analysis:

To understand why sorting relatively small sections of the directory entries 
reduces the cache pressure, we can think about the lifetime of an inode table 
block over the course of a directory traversal.  This traversal lifetime 
extends from when it is first accessed (always a write as noted above) to 
when it is last accessed, or the directory traversal finishes.  Each inode 
table block contains multiple inodes, so if those inodes are accessed 
randomly, the traversal lifetime of every inode table block will tend to be 
from nearly the beginning of the traversal to nearly the end, and so the load 
on the cache will be nearly the same as the number of inode table blocks.  If 
this load is greater than available cache, various blocks will have to be 
written out and reloaded later for further processing.  We can call the 
period between loading and evicting an inode table block, its "cache period".

The number of times a block must be written/reloaded depends not only on the 
number of inode table blocks covered by the directory and the availablity of 
cache memory, but on the probable number of the inodes within the block that 
will be written during each of the block's cache period.  By sorting the 
directory chunkwise, the probable number of inodes processed in one cache 
period increases, reducing the number of cache periods, in turn reducing disk 
seeking and transfer overhead.

To apply the above model to the journal, just substitute "journal period" for 
"cache period", where journal period is defined as the time from when a dirty 
block first enters the journal to when it is written out to disk. 

We can see that local sorting or directory chunks has hardly any effect on 
the traversal lifetime of a block; it only reduces the number of cache 
periods.  The size of the chunk of directory that we sort locally presumably 
stays constant, so the probable number of inodes on each inode table block 
processed in each cache period approaches one as the directory size 
increases.  At some point, there is no longer any benefit from sorting, andwe 
arrive at that point at rather realistic directory sizes.

We can always do the job perfectly by sorting the whole directory.  This 
would require changes in every getdents-using application that cares about 
performance, to either make the getdents window as big as the directory or to 
do the sort in user space.  A spinoff benefit of doing the full sort is that 
it becomes trivial to detect and eliminate duplicate entries.  Higher memory 
load is a drawback, as is the work required to update user space programs.  
There will also be noticable latency introduced by the sort, which didn't 
appear before.  If sorting is to be done in user space then we impose the 
requirement on all filesystems - even those that do not use inode numbers 
internally - that they supply an inode key for sorting, and that the supplied 
sorting key be meaningful in terms of cache locality.

Proposed Solution:

What if we could maintain the inodes in the inode table in the same order as 
the directory entries, permanently, so that no sorting is required?  If we 
could do this, the traversal lifetime of each inode table block would be the 
same as the number of inodes it contains, and the cache load due to the inode 
table would be exactly one block.  That is, the directory traversal would 
process each inode table block completely before moving on to the next one. 
Though the current Ext3 filesystem structure does not support this directly, 
we can approximate the effect quite closely within the existing framework.

Due to Ted's recent work, we are already traversing the directory in hash 
order, as opposed to physical allocation order, because of the particular 
approach adopted to address the telldir/seekdir problem.  This is relatively 
inexpensive.  So now, what we would like to do is make the inode number of 
each directory entry increase monotonically, corresponding to monotonically 
increasing hash values.  We also want to allocate those inode numbers densely 
in the inode table, to minimize seeking.  If we know beforehand how many 
entries the directory will have, this is easy: when we create an entry, we 
multiply its hash value by the total number of entries divided by the total 
hash range (2**31), add in a base target, and use that as the goal to 
allocate a new inode.  Due to the randomness of the hash value, we will 
sometimes be forced to allocate an inode number out of order, but hopefully 
there won't be too many of those, and they won't be very far out of order.  
We may also end up with some gaps in the inode table.  In fact, we may 
intentionally leave some slack in order to reduce the number of out-of-order 
allocations.  These problems are not serious.

The big problem is that we don't know the number of directory entries ahead 
of time.  To get around this, we can use the current size of the directory, 
rounded up to a power of two as the size estimate.  Each time the directory
increases to a new binary size we establish a new region in the inode table, 
which will map to the to-be-created half of directory entries, roughly in 
order.

Now, consider what happens to a group of directory entries created when the 
directory was so small that it mapped to a single inode table block.  After a 
series of leaf splits, these former neighbors will end up far apart in the 
hash order, but their inode numbers will still be ordered monotonically over 
a traversal.  The traversal lifetime of that group of entries will be nearly 
the entire directory traversal, but there is only one such block.

When the directory became larger, some newly created entries were mapped to 
two new inode table blocks, the traversal lifetimes of which are roughly half 
the entire traversal.  There are only two such blocks, and their lifetimes do 
not overlap, so these two blocks only generate one block worth of cache load.

So it goes, with each new set of entries of size 2**i generating only one 
additional block of cache load.  The final cache load is O log2(N), a 
pleasingly small number.

A very nice property of this inode allocation strategy is that the ordering 
will tend to be reinforced as a directory ages, as opposed to slowly decaying 
towards randomness as happens with linear directories.

So this approach is attractive, especially as it can be implemented in a 
small number of lines of code.  The main outstanding issue I have not 
addressed is how to arrange things so that the inode allocation patterns of 
neighbouring directories don't interfere with each other too much.  I expect 
that each time the directory grows to a new power of two size, we need to 
scan the inode allocation map for a relatively unpopulated region of that 
size and store the region's base in the directory header.  Would we need to 
store the entire history of allocation bases?  Probably not.  This aspect 
requires further consideration.

Regards,

Daniel


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC] Improved inode number allocation for HTree
       [not found]   ` <m3u1ecl5h8.fsf@lexa.home.net>
  2003-03-10 20:45     ` [RFC] Improved inode number allocation for HTree Daniel Phillips
@ 2003-03-10 20:48     ` Daniel Phillips
  2003-03-10 21:04       ` John Bradford
  1 sibling, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2003-03-10 20:48 UTC (permalink / raw)
  To: ext2-devel, linux-kernel
  Cc: Theodore Ts'o, Andreas Dilger, Christopher Li, Alex Tomas

Summary of the problem:

HTree's random relationship between directory entry order and inode order has 
been observed to cause increased cache pressure, affecting performance under 
certain conditions.  Most of this pressure is due to inode table usage, which 
is normally several times larger than the directory itself.  Even when the 
inode table usage is much less than available cache, it may still exceed the 
size of the journal file (8 MB by default).  Though the journal only becomes 
involved when blocks are modified, unfortunately, because of atime updates, 
this includes all directory operations.  We could suggest to users that they 
should disable atime updating if they care about performance, but we ought to 
be able to do better than that.

To reduce this cache pressure, what we need to do is make the directory entry 
order correspond more closely to the inode number order.  It doesn't have to 
correspond perfectly in order to show substantial improvement, as Tomas 
demonstrated with his recent getdents sorting patch.

Analysis:

To understand why sorting relatively small sections of the directory entries 
reduces the cache pressure, we can think about the lifetime of an inode table 
block over the course of a directory traversal.  This traversal lifetime 
extends from when it is first accessed (always a write as noted above) to 
when it is last accessed, or the directory traversal finishes.  Each inode 
table block contains multiple inodes, so if those inodes are accessed 
randomly, the traversal lifetime of every inode table block will tend to be 
from nearly the beginning of the traversal to nearly the end, and so the load 
on the cache will be nearly the same as the number of inode table blocks.  If 
this load is greater than available cache, various blocks will have to be 
written out and reloaded later for further processing.  We can call the 
period between loading and evicting an inode table block, its "cache period".

The number of times a block must be written/reloaded depends not only on the 
number of inode table blocks covered by the directory and the availablity of 
cache memory, but on the probable number of the inodes within the block that 
will be written during each of the block's cache period.  By sorting the 
directory chunkwise, the probable number of inodes processed in one cache 
period increases, reducing the number of cache periods, in turn reducing disk 
seeking and transfer overhead.

To apply the above model to the journal, just substitute "journal period" for 
"cache period", where journal period is defined as the time from when a dirty 
block first enters the journal to when it is written out to disk. 

We can see that local sorting or directory chunks has hardly any effect on 
the traversal lifetime of a block; it only reduces the number of cache 
periods.  The size of the chunk of directory that we sort locally presumably 
stays constant, so the probable number of inodes on each inode table block 
processed in each cache period approaches one as the directory size 
increases.  At some point, there is no longer any benefit from sorting, andwe 
arrive at that point at rather realistic directory sizes.

We can always do the job perfectly by sorting the whole directory.  This 
would require changes in every getdents-using application that cares about 
performance, to either make the getdents window as big as the directory or to 
do the sort in user space.  A spinoff benefit of doing the full sort is that 
it becomes trivial to detect and eliminate duplicate entries.  Higher memory 
load is a drawback, as is the work required to update user space programs.  
There will also be noticable latency introduced by the sort, which didn't 
appear before.  If sorting is to be done in user space then we impose the 
requirement on all filesystems - even those that do not use inode numbers 
internally - that they supply an inode key for sorting, and that the supplied 
sorting key be meaningful in terms of cache locality.

Proposed Solution:

What if we could maintain the inodes in the inode table in the same order as 
the directory entries, permanently, so that no sorting is required?  If we 
could do this, the traversal lifetime of each inode table block would be the 
same as the number of inodes it contains, and the cache load due to the inode 
table would be exactly one block.  That is, the directory traversal would 
process each inode table block completely before moving on to the next one. 
Though the current Ext3 filesystem structure does not support this directly, 
we can approximate the effect quite closely within the existing framework.

Due to Ted's recent work, we are already traversing the directory in hash 
order, as opposed to physical allocation order, because of the particular 
approach adopted to address the telldir/seekdir problem.  This is relatively 
inexpensive.  So now, what we would like to do is make the inode number of 
each directory entry increase monotonically, corresponding to monotonically 
increasing hash values.  We also want to allocate those inode numbers densely 
in the inode table, to minimize seeking.  If we know beforehand how many 
entries the directory will have, this is easy: when we create an entry, we 
multiply its hash value by the total number of entries divided by the total 
hash range (2**31), add in a base target, and use that as the goal to 
allocate a new inode.  Due to the randomness of the hash value, we will 
sometimes be forced to allocate an inode number out of order, but hopefully 
there won't be too many of those, and they won't be very far out of order.  
We may also end up with some gaps in the inode table.  In fact, we may 
intentionally leave some slack in order to reduce the number of out-of-order 
allocations.  These problems are not serious.

The big problem is that we don't know the number of directory entries ahead 
of time.  To get around this, we can use the current size of the directory, 
rounded up to a power of two as the size estimate.  Each time the directory
increases to a new binary size we establish a new region in the inode table, 
which will map to the to-be-created half of directory entries, roughly in 
order.

Now, consider what happens to a group of directory entries created when the 
directory was so small that it mapped to a single inode table block.  After a 
series of leaf splits, these former neighbors will end up far apart in the 
hash order, but their inode numbers will still be ordered monotonically over 
a traversal.  The traversal lifetime of that group of entries will be nearly 
the entire directory traversal, but there is only one such block.

When the directory became larger, some newly created entries were mapped to 
two new inode table blocks, the traversal lifetimes of which are roughly half 
the entire traversal.  There are only two such blocks, and their lifetimes do 
not overlap, so these two blocks only generate one block worth of cache load.

So it goes, with each new set of entries of size 2**i generating only one 
additional block of cache load.  The final cache load is O log2(N), a 
pleasingly small number.

A very nice property of this inode allocation strategy is that the ordering 
will tend to be reinforced as a directory ages, as opposed to slowly decaying 
towards randomness as happens with linear directories.

So this approach is attractive, especially as it can be implemented in a 
small number of lines of code.  The main outstanding issue I have not 
addressed is how to arrange things so that the inode allocation patterns of 
neighbouring directories don't interfere with each other too much.  I expect 
that each time the directory grows to a new power of two size, we need to 
scan the inode allocation map for a relatively unpopulated region of that 
size and store the region's base in the directory header.  Would we need to 
store the entire history of allocation bases?  Probably not.  This aspect 
requires further consideration.

Regards,

Daniel


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] Improved inode number allocation for HTree
  2003-03-10 20:48     ` [RFC] Improved inode number allocation for HTree Daniel Phillips
@ 2003-03-10 21:04       ` John Bradford
  2003-03-10 21:28         ` Andreas Schwab
  2003-03-10 21:33         ` [RFC] Improved inode number allocation for HTree Daniel Phillips
  0 siblings, 2 replies; 38+ messages in thread
From: John Bradford @ 2003-03-10 21:04 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz

> Though the journal only becomes involved when blocks are modified,
> unfortunately, because of atime updates, this includes all directory
> operations.  We could suggest to users that they should disable
> atime updating if they care about performance, but we ought to be
> able to do better than that.

On a separate note, since atime updates are not usually very important
anyway, why not have an option to cache atime updates for a long time,
or until either a write occurs anyway.  Holding a large number of
atime updates in a write cache is generally not going to be a major
issue - the worst case if a partition isn't cleanly unmounted is that
some atimes will be wrong.

John.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bug 417] New: htree much slower than regular ext3
  2003-03-09  7:08     ` Alex Tomas
  2003-03-10 17:58       ` Daniel Phillips
@ 2003-03-10 21:25       ` Theodore Ts'o
  1 sibling, 0 replies; 38+ messages in thread
From: Theodore Ts'o @ 2003-03-10 21:25 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Daniel Phillips, Martin J. Bligh, linux-kernel, ext2-devel,
	Andrew Morton

On Sun, Mar 09, 2003 at 10:08:34AM +0300, Alex Tomas wrote:
> 1) As far as I understand, duplicates are possible even in classic ext2
>    w/o sortdir/index. See the diagram:
> 
>                     Process 1                  Process 2
> 
>                     getdents(2) returns
>                     dentry1 (file1 -> Inode1)
>                     dentry2 (file2 -> Inode2)
> 
> context switch -->
>                                                unlink(file1), empty dentry1
>                                                creat(file3), Inode3, use dentry1
>                                                creat(file1), Inode1, use dentry3
> 
> context switch -->
> 
>                     getdents(2) returns
>                     dentry3(file1 -> Inode1)
> 
> 
> Am I right?

Yup.  POSIX allows this.  If you're in the middle of readdir() when a
file (or files) is created or deleted, then it is undefined how
readdir() will treat the filename(s) involved.  However, you're *not*
allowed to duplicate or omit filenames that were not touched by a file
creation or deletion.

> 2) Why do not use hash order for traversal like ext3_dx_readdir() does?
>    Upon reading several dentries within some hash set readdir() sorts them
>    in inode order and returns to an user.

Yeah, glibc could do this, although it would make telldir/seekdir more
than a little complicated.  (There's an awful lot of pain and agony
caused to filesystem developers caused by the existence of
telldir/seekdir, which inherently assumes a flat directory structure,
and so it makes it hard to do tricks like this.)

At some level this is the same solution as "do this in userspace",
except instead of needing to convince the maintainers of du and find,
we'd need to convince Ulrich.  It is *not* clear that it would
necessarily be easier to get this fix into the glibc.  In fact, since
du and find probably don't use seekdir/telldir, it's probably easier
to get the change into du and find.

						- Ted

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] Improved inode number allocation for HTree
  2003-03-10 21:04       ` John Bradford
@ 2003-03-10 21:28         ` Andreas Schwab
  2003-03-10 21:50           ` Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree) John Bradford
  2003-03-10 21:33         ` [RFC] Improved inode number allocation for HTree Daniel Phillips
  1 sibling, 1 reply; 38+ messages in thread
From: Andreas Schwab @ 2003-03-10 21:28 UTC (permalink / raw)
  To: John Bradford
  Cc: Daniel Phillips, ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz

John Bradford <john@grabjohn.com> writes:

|> > Though the journal only becomes involved when blocks are modified,
|> > unfortunately, because of atime updates, this includes all directory
|> > operations.  We could suggest to users that they should disable
|> > atime updating if they care about performance, but we ought to be
|> > able to do better than that.
|> 
|> On a separate note, since atime updates are not usually very important
|> anyway, why not have an option to cache atime updates for a long time,
|> or until either a write occurs anyway.

mount -o noatime

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 Nürnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] Improved inode number allocation for HTree
  2003-03-10 21:04       ` John Bradford
  2003-03-10 21:28         ` Andreas Schwab
@ 2003-03-10 21:33         ` Daniel Phillips
  2003-03-10 21:47           ` [Ext2-devel] " Bryan O'Sullivan
  1 sibling, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2003-03-10 21:33 UTC (permalink / raw)
  To: John Bradford; +Cc: ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz

On Mon 10 Mar 03 22:04, John Bradford wrote:
> > Though the journal only becomes involved when blocks are modified,
> > unfortunately, because of atime updates, this includes all directory
> > operations.  We could suggest to users that they should disable
> > atime updating if they care about performance, but we ought to be
> > able to do better than that.
>
> On a separate note, since atime updates are not usually very important
> anyway, why not have an option to cache atime updates for a long time,
> or until either a write occurs anyway.  Holding a large number of
> atime updates in a write cache is generally not going to be a major
> issue - the worst case if a partition isn't cleanly unmounted is that
> some atimes will be wrong.

It sounds practical.  Why stop there?  Since Ted is seriously considering 
making a batch of incompatible extensions to the on-disk format anyway, how 
about adding an atime table to each block group, four bytes per inode.  Even 
without lazy updating, it would cut down the dirty  blocks generated by r/o 
operations a lot.  If actual atime is the latest of the atime table value and 
the inode atime value[1], then inode write operations won't generate extra 
traffic.  You will only get new traffic when somebody wants the real atime.  
I'd put this under the category of "things to add to Ted's long list of fun 
new ideas for Ext3/4".

[1] How to handle wrapping is left as an exercise for the interested reader.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree
  2003-03-10 21:33         ` [RFC] Improved inode number allocation for HTree Daniel Phillips
@ 2003-03-10 21:47           ` Bryan O'Sullivan
  2003-03-10 22:02             ` Matthew Wilcox
  0 siblings, 1 reply; 38+ messages in thread
From: Bryan O'Sullivan @ 2003-03-10 21:47 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: John Bradford, ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz

On Mon, 2003-03-10 at 13:33, Daniel Phillips wrote:

> It sounds practical.  Why stop there?

Why start?  Who actually uses atime for anything at all, other than the
tiny number of shops that care about moving untouched files to tertiary
storage?

Surely if you want to heap someone else's plate with work, you should
offer a reason why :-)

	<b


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree)
  2003-03-10 21:28         ` Andreas Schwab
@ 2003-03-10 21:50           ` John Bradford
  2003-03-14 21:55             ` [Ext2-devel] " Stephen C. Tweedie
  0 siblings, 1 reply; 38+ messages in thread
From: John Bradford @ 2003-03-10 21:50 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: phillips, ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz

> |> > Though the journal only becomes involved when blocks are modified,
> |> > unfortunately, because of atime updates, this includes all directory
> |> > operations.  We could suggest to users that they should disable
> |> > atime updating if they care about performance, but we ought to be
> |> > able to do better than that.
> |> 
> |> On a separate note, since atime updates are not usually very important
> |> anyway, why not have an option to cache atime updates for a long time,
> |> or until either a write occurs anyway.
> 
> mount -o noatime

Well, yes, I use noatime by default, but I was thinking that there
might be users who wanted to gain most of the performance that using
noatime would, who still wanted access times available generally, but
who were not bothered about loosing them on an unclean shutdown.

Infact, taking this one step further, we could have assign directories
priorities, analogous to nice values for processes, and make the
lazyness of writes dependant on that.

So, for example, on a webserver, you could assign databases a high
priority for writes, but your webserver log a low priority.  If the
box crashed, the data most likely to be lost would be the webserver
log, rather than the database.

Another benefit would be that you could spin down disks, and not have
them spin up just to write data to the webserver log - that could be
committed to the disk just once an hour, but a database write could
cause it to spin up immediately.

John.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree
  2003-03-10 21:47           ` [Ext2-devel] " Bryan O'Sullivan
@ 2003-03-10 22:02             ` Matthew Wilcox
  2003-03-11  8:47               ` Jakob Oestergaard
  2003-03-14 21:57               ` Stephen C. Tweedie
  0 siblings, 2 replies; 38+ messages in thread
From: Matthew Wilcox @ 2003-03-10 22:02 UTC (permalink / raw)
  To: Bryan O'Sullivan
  Cc: Daniel Phillips, John Bradford, ext2-devel, linux-kernel, tytso,
	adilger, chrisl, bzzz

On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote:
> Why start?  Who actually uses atime for anything at all, other than the
> tiny number of shops that care about moving untouched files to tertiary
> storage?
> 
> Surely if you want to heap someone else's plate with work, you should
> offer a reason why :-)

"You have new mail" vs "You have mail".

-- 
"It's not Hollywood.  War is real, war is primarily not about defeat or
victory, it is about death.  I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree
  2003-03-10 22:02             ` Matthew Wilcox
@ 2003-03-11  8:47               ` Jakob Oestergaard
  2003-03-11 11:27                 ` John Bradford
  2003-03-14 21:57               ` Stephen C. Tweedie
  1 sibling, 1 reply; 38+ messages in thread
From: Jakob Oestergaard @ 2003-03-11  8:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Bryan O'Sullivan, Daniel Phillips, John Bradford, ext2-devel,
	linux-kernel, tytso, adilger, chrisl, bzzz

On Mon, Mar 10, 2003 at 10:02:54PM +0000, Matthew Wilcox wrote:
> On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote:
> > Why start?  Who actually uses atime for anything at all, other than the
> > tiny number of shops that care about moving untouched files to tertiary
> > storage?
> > 
> > Surely if you want to heap someone else's plate with work, you should
> > offer a reason why :-)
> 
> "You have new mail" vs "You have mail".

And having a clean /tmp using tmpwatch.  Everything in my /tmp not
accessed for a few weeks gets deleted.  I move cruft to /tmp every
single day, and I have not cleaned it up for years. It just stays tidy
all by itself.  Of course one has to remember this when leavning a few
database files there before taking a vacation   ;)

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] Improved inode number allocation for HTree
       [not found]         ` <20030311031216.8A31CEFD5F@mx12.arcor-online.net>
@ 2003-03-11 10:45           ` Hans Reiser
  2003-03-11 13:00             ` Helge Hafting
  0 siblings, 1 reply; 38+ messages in thread
From: Hans Reiser @ 2003-03-11 10:45 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Theodore Ts'o, Andreas Dilger, Christopher Li, Alex Tomas,
	Linux Kernel Mailing List

Let's make noatime the default for VFS.

Daniel Phillips wrote:

>Hi Hans,
>
>On Tue 11 Mar 03 00:17, Hans Reiser wrote:
>  
>
>>What do you think of creating a new telldir/seekdir which uses filenames
>>instead of ints to convey position?
>>    
>>
>
>What do you do if that file gets deleted in the middle of a traversal?
>
So what?  It still tells you where to restart.  Strictly speaking, you 
would want to allow the filesystem to choose what it wants to return to 
indicate position.  For htree and reiserfs this would be a filename or 
its hash, for ext2 without htree this would be a byte offset.

>
>If I were able to design Unix over again, I'd state that if you don't lock a 
>directory before traversing it then it's your own fault if somebody changes 
>it under you, and I would have provided an interface to inform you about your 
>bad luck.  Strictly wishful thinking.  (There, it feels better now.)
>
We are designing Linux.  You know, Microsoft and Steve Jobs continue to 
design.  We should too.

Needless change should be avoided.  Failure to change when something is 
known to be broken leads to being inferior to those who do change.

Let's design something that does it right.  Or else SCO will be right in 
saying that Linux is just a Unix ripoff by people who couldn't do 
anything unless Unix had been written to tell them how to do it right.

-- 
Hans



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree
  2003-03-11  8:47               ` Jakob Oestergaard
@ 2003-03-11 11:27                 ` John Bradford
  0 siblings, 0 replies; 38+ messages in thread
From: John Bradford @ 2003-03-11 11:27 UTC (permalink / raw)
  To: Jakob Oestergaard
  Cc: willy, bos, phillips, ext2-devel, linux-kernel, tytso, adilger,
	chrisl, bzzz

> > > Why start?  Who actually uses atime for anything at all, other than the
> > > tiny number of shops that care about moving untouched files to tertiary
> > > storage?

What about the situation where the primary storage is a device which
has a limited number of write-cycles.  That's just the sort of
application where you might be archiving data to secondary or tertiary
storage, and it would be a big advantage to save on writes.  If a file
is read every ten minutes, we could just update the atime once an
hour.  That would save five writes an hour.

John.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] Improved inode number allocation for HTree
  2003-03-11 10:45           ` Hans Reiser
@ 2003-03-11 13:00             ` Helge Hafting
  2003-03-11 13:41               ` Daniel Phillips
  0 siblings, 1 reply; 38+ messages in thread
From: Helge Hafting @ 2003-03-11 13:00 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Daniel Phillips, Linux Kernel Mailing List

Hans Reiser wrote:
> Let's make noatime the default for VFS.
> 
> Daniel Phillips wrote:
[...]
>> If I were able to design Unix over again, I'd state that if you don't 
>> lock a directory before traversing it then it's your own fault if 
>> somebody changes it under you, and I would have provided an interface 
>> to inform you about your bad luck.  Strictly wishful thinking.  
>> (There, it feels better now.)

I'm happy nobody _can_ lock a directory like that.  Think of it - unable
to create or delete files while some slow-moving program is traversing
the directory?  Ouch.  Plenty of options for DOS attacks too.
And how to do "rm *.bak" if rm locks the dir for traversal?

Helge Hafting


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] Improved inode number allocation for HTree
  2003-03-11 13:00             ` Helge Hafting
@ 2003-03-11 13:41               ` Daniel Phillips
  2003-03-11 17:16                 ` Andreas Dilger
                                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Daniel Phillips @ 2003-03-11 13:41 UTC (permalink / raw)
  To: Helge Hafting, Hans Reiser; +Cc: Linux Kernel Mailing List

On Tue 11 Mar 03 14:00, Helge Hafting wrote:
> Hans Reiser wrote:
> > Let's make noatime the default for VFS.
> >
> > Daniel Phillips wrote:
>
> [...]
>
> >> If I were able to design Unix over again, I'd state that if you don't
> >> lock a directory before traversing it then it's your own fault if
> >> somebody changes it under you, and I would have provided an interface
> >> to inform you about your bad luck.  Strictly wishful thinking.
> >> (There, it feels better now.)
>
> I'm happy nobody _can_ lock a directory like that.  Think of it - unable
> to create or delete files while some slow-moving program is traversing
> the directory?  Ouch.  Plenty of options for DOS attacks too.
> And how to do "rm *.bak" if rm locks the dir for traversal?

<wishful thinking>
Now that you mention it, just locking out create and rename during directory 
traversal would eliminate the pain.  Delete is easy enough to handle during 
traversal.  For a B-Tree, coalescing could simply be deferred until the 
traversal is finished, so reading the directory in physical storage order 
would be fine.  Way, way cleaner than what we have to do now.
</wishful thinking>

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] Improved inode number allocation for HTree
  2003-03-11 13:41               ` Daniel Phillips
@ 2003-03-11 17:16                 ` Andreas Dilger
  2003-03-11 19:39                 ` Helge Hafting
  2003-03-11 21:25                 ` atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) Hans Reiser
  2 siblings, 0 replies; 38+ messages in thread
From: Andreas Dilger @ 2003-03-11 17:16 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Helge Hafting, Hans Reiser, Linux Kernel Mailing List

On Mar 11, 2003  14:41 +0100, Daniel Phillips wrote:
> On Tue 11 Mar 03 14:00, Helge Hafting wrote:
> > I'm happy nobody _can_ lock a directory like that.  Think of it - unable
> > to create or delete files while some slow-moving program is traversing
> > the directory?  Ouch.  Plenty of options for DOS attacks too.
> > And how to do "rm *.bak" if rm locks the dir for traversal?
> 
> <wishful thinking>
> Now that you mention it, just locking out create and rename during directory 
> traversal would eliminate the pain.  Delete is easy enough to handle during 
> traversal.  For a B-Tree, coalescing could simply be deferred until the 
> traversal is finished, so reading the directory in physical storage order 
> would be fine.  Way, way cleaner than what we have to do now.
> </wishful thinking>

The other problem, of course, is that this would essentially mean that a
user-space application has a lock that prevents other processes (even the
kernel) from doing anything.  At best we have to worry about an application
going sour, and at worst it is a DOS hole as Helge says.

How about something like:

	death:
		for each directory
			while readdir(directory, large_buffer)
				if (directory)
					run death(directory)
			readdir(directory, small_buffer)

		sleep forever

The first getdents walks recursively down to the leaves, and then does a
single getdents with enough space to hold a single dirent, locking the
directory, but then never progressing.  It also does the single-dirent
getdents on the way up the tree.  Now, everything is locked out of the
entire directory tree.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] Improved inode number allocation for HTree
  2003-03-11 13:41               ` Daniel Phillips
  2003-03-11 17:16                 ` Andreas Dilger
@ 2003-03-11 19:39                 ` Helge Hafting
  2003-03-11 20:19                   ` Daniel Phillips
  2003-03-11 21:25                 ` atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) Hans Reiser
  2 siblings, 1 reply; 38+ messages in thread
From: Helge Hafting @ 2003-03-11 19:39 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

On Tue, Mar 11, 2003 at 02:41:06PM +0100, Daniel Phillips wrote:
> On Tue 11 Mar 03 14:00, Helge Hafting wrote:
> > I'm happy nobody _can_ lock a directory like that.  Think of it - unable
> > to create or delete files while some slow-moving program is traversing
> > the directory?  Ouch.  Plenty of options for DOS attacks too.
> > And how to do "rm *.bak" if rm locks the dir for traversal?
> 
> <wishful thinking>
> Now that you mention it, just locking out create and rename during directory 
> traversal would eliminate the pain.  Delete is easy enough to handle during 
> traversal.  For a B-Tree, coalescing could simply be deferred until the 
> traversal is finished, so reading the directory in physical storage order 
> would be fine.  Way, way cleaner than what we have to do now.
> </wishful thinking>

Ok, so "rm" works.  Then you have things like "mv *.c /usr/src" to worry
about.  Lock for traversal, get stuck unable to work on the files.

A "dream" solution for this might involve something like:
Directory is traversed in some well defined order (defined by the fs)
Removing a file (or moving it out of the directory) leaves
a "empty" directory entry that isn't reclaimed until no more
traversals is in progress in that directory.

New files can be created, and will be created further out than
any traversal in progress so nothing will be missed.

This approach don't lock anything out, but I guess someone evil still
could cause trouble by keeping a traversal going forever, creating one dummy
file and deleting one whenever it makes progress.  The directory would
get big, filled up with placeholders until some ulimit kicks in.

Helge Hafting

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC] Improved inode number allocation for HTree
  2003-03-11 19:39                 ` Helge Hafting
@ 2003-03-11 20:19                   ` Daniel Phillips
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2003-03-11 20:19 UTC (permalink / raw)
  To: Helge Hafting; +Cc: linux-kernel

On Tue 11 Mar 03 20:39, Helge Hafting wrote:
> On Tue, Mar 11, 2003 at 02:41:06PM +0100, Daniel Phillips wrote:
> > <wishful thinking>
> > Now that you mention it, just locking out create and rename during
> > directory traversal would eliminate the pain.  Delete is easy enough to
> > handle during traversal.  For a B-Tree, coalescing could simply be
> > deferred until the traversal is finished, so reading the directory in
> > physical storage order would be fine.  Way, way cleaner than what we have
> > to do now.
> > </wishful thinking>
>
> Ok, so "rm" works.  Then you have things like "mv *.c /usr/src" to worry
> about.

That's equivalent to ls, the shell expands it before doing any moving.  You 
can construct something more interesting with ls | xargs <something nasty>
into the same directory.  Since the user is trying to shoot their foot off, 
let the lock be recursive, and let them do that.

> ...someone evil still
> could cause trouble by keeping a traversal going forever, creating one
> dummy file and deleting one whenever it makes progress.  The directory
> would get big, filled up with placeholders until some ulimit kicks in.
>
> Helge Hafting

It's not clear that's any more evil than things they can do already, eg,

   seq 1 1000000 | xargs -l1 cp -a /usr

Welllll, this is all idle speculation anyway.  Nobody's going to fix the 
flaw this week, if it's even possible (I suspect it is).

Regards,

Daniel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* atomic kernel operations are very tricky to export to user space (was  [RFC] Improved inode number allocation for HTree )
  2003-03-11 13:41               ` Daniel Phillips
  2003-03-11 17:16                 ` Andreas Dilger
  2003-03-11 19:39                 ` Helge Hafting
@ 2003-03-11 21:25                 ` Hans Reiser
  2003-03-11 23:49                   ` Jamie Lokier
  2 siblings, 1 reply; 38+ messages in thread
From: Hans Reiser @ 2003-03-11 21:25 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Helge Hafting, Linux Kernel Mailing List

Daniel Phillips wrote:

>On Tue 11 Mar 03 14:00, Helge Hafting wrote:
>  
>
>>Hans Reiser wrote:
>>    
>>
>>>Let's make noatime the default for VFS.
>>>
>>>Daniel Phillips wrote:
>>>      
>>>
>>[...]
>>
>>    
>>
>>>>If I were able to design Unix over again, I'd state that if you don't
>>>>lock a directory before traversing it then it's your own fault if
>>>>somebody changes it under you, and I would have provided an interface
>>>>to inform you about your bad luck.  Strictly wishful thinking.
>>>>(There, it feels better now.)
>>>>        
>>>>
>>I'm happy nobody _can_ lock a directory like that.  Think of it - unable
>>to create or delete files while some slow-moving program is traversing
>>the directory?  Ouch.  Plenty of options for DOS attacks too.
>>And how to do "rm *.bak" if rm locks the dir for traversal?
>>    
>>
>
><wishful thinking>
>Now that you mention it, just locking out create and rename during directory 
>traversal would eliminate the pain.  Delete is easy enough to handle during 
>traversal.  For a B-Tree, coalescing could simply be deferred until the 
>traversal is finished, so reading the directory in physical storage order 
>would be fine.  Way, way cleaner than what we have to do now.
></wishful thinking>
>
>Regards,
>
>Daniel
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>
>  
>
You would want a  directory listing command that is a single system 
call, and then that could be made isolated, without risk of userspace 
failing to unlock.   Then you have to worry about very large directories 
straining things in user space, but you can probably have the user 
specify the maximimum size directory his process has space for, etc.

In general, DOS issues are what makes it difficult to introduce 
transactions into the kernel.  We are grappling with that now in 
reiser4.  It seems that the formula is to create atomic plugins, and 
then it is the system administrator's responsibility to verify that he 
can live with whatever DOS vulnerabilities it has before he installs it 
in his kernel (or our responsibility if it is sent to us for inclusion 
in the main kernel).  Allowing arbitrary filesystem operations to be 
combined into one atomic transaction seems problematic for either user 
space or the kernel, depending on what you do.

In general, allowing user space to lock things means that you trust user 
space  to unlock.  This creates all sorts of trust troubles, and if you 
force the unlock after some timeout, then the user space application 
becomes vulnerable to DOS from other processes causing it to exceed the 
timeout.

Ideas on this are welcome.

-- 
Hans



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Bug 417] New: htree much slower than regular ext3
  2003-03-07 15:46 ` Alex Tomas
  2003-03-08 17:38   ` Daniel Phillips
@ 2003-03-11 21:57   ` Bill Davidsen
  1 sibling, 0 replies; 38+ messages in thread
From: Bill Davidsen @ 2003-03-11 21:57 UTC (permalink / raw)
  To: Alex Tomas; +Cc: linux-kernel, ext2-devel

On 7 Mar 2003, Alex Tomas wrote:

> The problem is that getdents(2) returns inodes out of order and
> du causes many head seeks. I tried to solve the problem by patch
> I included in. The idea of the patch is pretty simple: just try
> to sort dentries by inode number in readdir(). It works because
> inodes sits at fixed places in ext2/ext3. Please, look at results
> I just got:

Any change in the order of dentries returned is likely to run into problem
when seeking in a directory. Given that readdir() is usully followed by
either zero or many stat()s, perhaps when the first stat() comes along you
could pre-read the inodes in optimal order and cace them. However you
tuned the size of your sort should work exactly as well as the size of the
pre-read.

This seems consistent with readahead and AS disk io.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: atomic kernel operations are very tricky to export to user space (was  [RFC] Improved inode number allocation for HTree )
  2003-03-11 21:25                 ` atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) Hans Reiser
@ 2003-03-11 23:49                   ` Jamie Lokier
  0 siblings, 0 replies; 38+ messages in thread
From: Jamie Lokier @ 2003-03-11 23:49 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Daniel Phillips, Helge Hafting, Linux Kernel Mailing List

Hans Reiser wrote:
> Allowing arbitrary filesystem operations to be 
> combined into one atomic transaction seems problematic for either user 
> space or the kernel, depending on what you do.
> 
> In general, allowing user space to lock things means that you trust user 
> space  to unlock.  This creates all sorts of trust troubles, and if you 
> force the unlock after some timeout, then the user space application 
> becomes vulnerable to DOS from other processes causing it to exceed the 
> timeout.
>
> Ideas on this are welcome.

You can allow user space to begin a transaction, do some operations
and end a transaction, possibly returning an "abort" result which
means userspace should assume the transaction did not commit any
results and/or whatever was read in the transaction was not reliable.

On the face of it this leaves userspace susceptible to DOS or indeed
fairness/livelock problems.  For example if another program is always
changing a directory entry, how can you read that whole directory
in a transaction?

Fairness/livelock problems are hard to avoid with any kinds of lock.
Even the kernel's internal locks have these problems in corner cases
(for example, remember when gettimeofday()'s clock access had to be
converted from using a spinlock to a sequence lock - and that still
doesn't _guarantee_ there is no problem in principle, it just reduces
the probability in all reasonable scenarios).

However, some remedies can be applied to filesystem transactions.  If
an operation would cause some other task's transaction to eventually
return an abort code, consider sleeping for a short duration.
Randomise that duration.  If the other transaction(s) have been
aborting repeatedly, consider lengthening the sleep duration and/or
specifically waiting for the other transaction to complete, to boost
the other task(s) likilihood of transaction success.  Randomise this
decision too.  If you know something about the type of other
transactions (such as it is trying to implement a read-write lock by
doing atomic operations on bytes in a file), consider exactly what
policy you hope to offer (writer preference?  reader preference?
something in between?)

By which point it has remarkable similarities to the problems of
fairness in the task scheduler, and fairness/livelock in locks.

-- Jamie

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Ext2-devel] Re: [Bug 417] New: htree much slower than regular ext3
  2003-02-27 21:00   ` Andreas Dilger
  2003-02-28  4:12     ` Daniel Phillips
@ 2003-03-13 21:04     ` Stephen C. Tweedie
  1 sibling, 0 replies; 38+ messages in thread
From: Stephen C. Tweedie @ 2003-03-13 21:04 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Daniel Phillips, Martin J. Bligh, linux-kernel, ext2-devel,
	Theodore Ts'o, chrisl, Stephen Tweedie

Hi,

On Thu, 2003-02-27 at 21:00, Andreas Dilger wrote:

> I've got a patch which should help here, although it was originally written
> to speed up the "create" case instead of the "lookup" case.  In the lookup
> case, it will do a pre-read of a number of inode table blocks, since the cost
> of doing a 64kB read and doing a 4kB read is basically the same - the cost
> of the seek.

No it's not --- you're evicting 16 times as much other
potentially-useful data from the cache for each lookup.  You'll improve
the "du" or "ls -l" case by prefetching, but you may well slow down the
overall system performance when you're just doing random accesses (eg.
managing large spools.)

It would be interesting to think about how we can spot the cases where
the prefetch is likely to be beneficial, for example by observing
"stat"s coming in in strict hash order.

--Stephen


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Ext2-devel] Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree)
  2003-03-10 21:50           ` Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree) John Bradford
@ 2003-03-14 21:55             ` Stephen C. Tweedie
  0 siblings, 0 replies; 38+ messages in thread
From: Stephen C. Tweedie @ 2003-03-14 21:55 UTC (permalink / raw)
  To: John Bradford
  Cc: Andreas Schwab, phillips, ext2-devel, linux-kernel,
	Theodore Ts'o, Andreas Dilger, chrisl, bzzz, Stephen Tweedie

Hi,

On Mon, 2003-03-10 at 21:50, John Bradford wrote:

> Well, yes, I use noatime by default, but I was thinking that there
> might be users who wanted to gain most of the performance that using
> noatime would, who still wanted access times available generally, but
> who were not bothered about loosing them on an unclean shutdown.

I've got new inode-flushing code which somewhat does that already.  The
code in 

http://people.redhat.com/sct/patches/ext3-2.4/dev-20030314/40-iflush-sct/

allows us to defer inode writes, primarily to eliminate redundant
copy-to-buffer-cache operations in mark_inode_dirty; but it has the
side-effect of allowing us to defer atime updates quite safely.

I'm currently integrating all the latest bits and pieces of orlov and
htree work into that set of patches to provide a stable base, and the
next order of business is to get ACL and O_DIRECT diffs in for 2.4
too.   But beyond that, I need to get down to some serious performance
work with ext3, and the inode flush code is a major part of that.

--Stephen


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree
  2003-03-10 22:02             ` Matthew Wilcox
  2003-03-11  8:47               ` Jakob Oestergaard
@ 2003-03-14 21:57               ` Stephen C. Tweedie
  2003-03-15  8:39                 ` jw schultz
  1 sibling, 1 reply; 38+ messages in thread
From: Stephen C. Tweedie @ 2003-03-14 21:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Bryan O'Sullivan, Daniel Phillips, John Bradford, ext2-devel,
	linux-kernel, Theodore Ts'o, Andreas Dilger, chrisl, bzzz,
	Stephen Tweedie

Hi,

On Mon, 2003-03-10 at 22:02, Matthew Wilcox wrote:
> On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote:
> > Why start?  Who actually uses atime for anything at all, other than the
> > tiny number of shops that care about moving untouched files to tertiary
> > storage?
> > 
> > Surely if you want to heap someone else's plate with work, you should
> > offer a reason why :-)
> 
> "You have new mail" vs "You have mail".

"nodiratime" can still help there.

--Stephen


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree
  2003-03-14 21:57               ` Stephen C. Tweedie
@ 2003-03-15  8:39                 ` jw schultz
  0 siblings, 0 replies; 38+ messages in thread
From: jw schultz @ 2003-03-15  8:39 UTC (permalink / raw)
  To: linux-kernel

On Fri, Mar 14, 2003 at 09:57:12PM +0000, Stephen C. Tweedie wrote:
> Hi,
> 
> On Mon, 2003-03-10 at 22:02, Matthew Wilcox wrote:
> > On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote:
> > > Why start?  Who actually uses atime for anything at all, other than the
> > > tiny number of shops that care about moving untouched files to tertiary
> > > storage?
> > > 
> > > Surely if you want to heap someone else's plate with work, you should
> > > offer a reason why :-)
> > 
> > "You have new mail" vs "You have mail".
> 
> "nodiratime" can still help there.

As may noatime.  noatime only prevents the automatic update
of atime on read.  It doesn't (at least in my experience)
prevent utime(2) from updating the atime field.

Using noatime works quite well with at least with mutt which
explicitly uses utime(2) to update atime.  I cannot be
certain but mutt may actually work better with noatime
because then other tools (*biff &co) accesses won't break the
mailer's notion of new mail (mtime > atime).

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2003-03-15  8:28 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-02-27 17:31 [Bug 417] New: htree much slower than regular ext3 Martin J. Bligh
2003-02-28  2:55 ` Daniel Phillips
2003-02-27 21:00   ` Andreas Dilger
2003-02-28  4:12     ` Daniel Phillips
2003-02-27 21:33       ` Martin J. Bligh
2003-03-13 21:04     ` [Ext2-devel] " Stephen C. Tweedie
2003-03-07 15:46 ` Alex Tomas
2003-03-08 17:38   ` Daniel Phillips
2003-03-07 23:27     ` Theodore Ts'o
2003-03-09 19:26       ` Alex Tomas
2003-03-09  7:08     ` Alex Tomas
2003-03-10 17:58       ` Daniel Phillips
2003-03-10 21:25       ` Theodore Ts'o
2003-03-11 21:57   ` Bill Davidsen
     [not found] ` <20030307214833.00a37e35.akpm@digeo.com>
     [not found]   ` <20030308010424.Z1373@schatzie.adilger.int>
2003-03-09 22:54     ` [Ext2-devel] " Daniel Phillips
2003-03-08 23:19       ` Andrew Morton
2003-03-09 23:10   ` Daniel Phillips
     [not found] ` <20030309184755.ACC80FCA8C@mx12.arcor-online.net>
     [not found]   ` <m3u1ecl5h8.fsf@lexa.home.net>
2003-03-10 20:45     ` [RFC] Improved inode number allocation for HTree Daniel Phillips
     [not found]       ` <3E6D1D25.5000004@namesys.com>
     [not found]         ` <20030311031216.8A31CEFD5F@mx12.arcor-online.net>
2003-03-11 10:45           ` Hans Reiser
2003-03-11 13:00             ` Helge Hafting
2003-03-11 13:41               ` Daniel Phillips
2003-03-11 17:16                 ` Andreas Dilger
2003-03-11 19:39                 ` Helge Hafting
2003-03-11 20:19                   ` Daniel Phillips
2003-03-11 21:25                 ` atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) Hans Reiser
2003-03-11 23:49                   ` Jamie Lokier
2003-03-10 20:48     ` [RFC] Improved inode number allocation for HTree Daniel Phillips
2003-03-10 21:04       ` John Bradford
2003-03-10 21:28         ` Andreas Schwab
2003-03-10 21:50           ` Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree) John Bradford
2003-03-14 21:55             ` [Ext2-devel] " Stephen C. Tweedie
2003-03-10 21:33         ` [RFC] Improved inode number allocation for HTree Daniel Phillips
2003-03-10 21:47           ` [Ext2-devel] " Bryan O'Sullivan
2003-03-10 22:02             ` Matthew Wilcox
2003-03-11  8:47               ` Jakob Oestergaard
2003-03-11 11:27                 ` John Bradford
2003-03-14 21:57               ` Stephen C. Tweedie
2003-03-15  8:39                 ` jw schultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).