* [Bug 417] New: htree much slower than regular ext3
@ 2003-02-27 17:31 Martin J. Bligh
2003-02-28 2:55 ` Daniel Phillips
` (3 more replies)
0 siblings, 4 replies; 38+ messages in thread
From: Martin J. Bligh @ 2003-02-27 17:31 UTC (permalink / raw)
To: linux-kernel
http://bugme.osdl.org/show_bug.cgi?id=417
Summary: htree much slower than regular ext3
Kernel Version: 2.5.63-bk3
Status: NEW
Severity: normal
Owner: akpm@digeo.com
Submitter: bwindle-kbt@fint.org
Distribution: Debian Testing
Hardware Environment: x86, 256mb RAM, PIIX4, Maxtor 91728D8 ATA DISK drive
Software Environment:
Linux razor 2.5.63bk3 #25 Thu Feb 27 10:13:35 EST 2003 i686 Pentium II
(Klamath) GenuineIntel GNU/Linux
Gnu C 2.95.4
Gnu make 3.79.1
util-linux 2.11n
mount 2.11n
module-init-tools implemented
e2fsprogs 1.30-WIP
reiserfsprogs 3.6.3
Linux C Library 2.2.5
Dynamic linker (ldd) 2.2.5
Procps 2.0.7
Net-tools 1.60
Console-tools 0.2.3
Sh-utils 4.5.2
Problem Description:
I created a directory ("test") with 32000 (ok, 31998) directories in it,
and put a file called 'foo' in each of them (for i in `ls`; do cd $i &&
touch bar && cd .. ; done). Then I took that ext3 partion, umounted it,
did a 'tune2fs -O dir_index', then 'fsck -fD', and remounted. I then did a
'time du -hs' on the test directory, and here are the results.
ext3+htree:
bwindle@razor:/giant/inodes$ time du -hs
126M .
real 7m21.756s
user 0m2.021s
sys 0m22.190s
I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1, remounted,
and did another du -hs on the test directory. It took 1 minute, 48 seconds.
bwindle@razor:/giant/test$ time du -hs
126M .
real 1m48.760s
user 0m1.986s
sys 0m21.563s
I thought htree was supposed to speed up access with large numbers of
directories?
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3
2003-02-28 2:55 ` Daniel Phillips
@ 2003-02-27 21:00 ` Andreas Dilger
2003-02-28 4:12 ` Daniel Phillips
2003-03-13 21:04 ` [Ext2-devel] " Stephen C. Tweedie
0 siblings, 2 replies; 38+ messages in thread
From: Andreas Dilger @ 2003-02-27 21:00 UTC (permalink / raw)
To: Daniel Phillips
Cc: Martin J. Bligh, linux-kernel, ext2-devel, Theodore Ts'o, chrisl
On Feb 28, 2003 03:55 +0100, Daniel Phillips wrote:
> On Thursday 27 February 2003 18:31, Martin J. Bligh wrote:
> > Problem Description:
> > I created a directory ("test") with 32000 (ok, 31998) directories in it,
> > and put a file called 'foo' in each of them (for i in `ls`; do cd $i &&
> > touch bar && cd .. ; done). Then I took that ext3 partion, umounted it,
> > did a 'tune2fs -O dir_index', then 'fsck -fD', and remounted. I then did a
> > 'time du -hs' on the test directory, and here are the results.
> >
> > ext3+htree:
> > bwindle@razor:/giant/inodes$ time du -hs
> > 126M .
> >
> > real 7m21.756s
> > user 0m2.021s
> > sys 0m22.190s
> >
> > I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1, remounted,
> > and did another du -hs on the test directory. It took 1 minute, 48
> > seconds.
> >
> > bwindle@razor:/giant/test$ time du -hs
> > 126M .
> >
> > real 1m48.760s
> > user 0m1.986s
> > sys 0m21.563s
> >
> >
> > I thought htree was supposed to speed up access with large numbers of
> > directories?
>
> The du just does getdents and lstats in physical storage order, so there's no
> possible benefit from indexing in this case, and unindexed ext3 avoids long
> search times by caching the position at which it last found an entry. That
> answers the question "why doesn't it speed up", however, "why did it slow way
> down" is harder.
>
> The single-file leaves of your directory tree don't carry any index (it's not
> worth it with only one file) and therfore use the same code path as unindexed
> ext3, so there's no culprit there. I'm looking suspiciously at
> ext3_dx_readdir, which is apparently absorbing about 11 ms per returned
> entry. To put this in perspective, I'm used to seeing individual directory
> operation times well below 100 us, so even if each dirent cost as much as a
> full lookup, you'd see less than 3 seconds overhead for your 30,000
> directories.
>
> 11 ms sounds like two seeks for each returned dirent, which sounds like a bug.
I think you are pretty dead on there. The difference is that with unindexed
entries, the directory entry and the inode are in the same order, while with
indexed directories they are essentially in random order to each other. That
means that each directory lookup causes a seek to get the directory inode data
instead of doing allocation order (which is sequential reads on disk).
In the past both would have been slow equally, but the orlov allocator in
2.5 causes a number of directories to be created in the same group before
moving on to the next group, so you have nice batches of sequential reads.
I've got a patch which should help here, although it was originally written
to speed up the "create" case instead of the "lookup" case. In the lookup
case, it will do a pre-read of a number of inode table blocks, since the cost
of doing a 64kB read and doing a 4kB read is basically the same - the cost
of the seek.
The patch was written for a 2.4.18 RH kernel, but I think the areas touched
(ext3_get_inode_loc and ext3_new_inode) are relatively unchanged, so it
should be fairly straightforward to fix the rejects by hand for 2.5.
For the create case, the patch speeds up lots-of-file creation rates a lot
(50%), regardless of whether we are using htree or not. That is because we
do not read in empty inode table blocks from disk (which will be slow
behind a bunch of writes going on), and instead we have a pure write
workload (with the exception of reading in the inode bitmaps occasionally).
Cheers, Andreas
======================= ext3-noread.diff ====================================
diff -ru lustre-head/fs/ext3/ialloc.c lustre/fs/ext3/ialloc.c
--- lustre-head/fs/ext3/ialloc.c Mon Dec 23 10:02:58 2002
+++ lustre/fs/ext3/ialloc.c Mon Dec 23 09:46:20 2002
@@ -289,6 +289,37 @@
}
/*
+ * @block_group: block group of inode
+ * @offset: relative offset of inode within @block_group
+ *
+ * Check whether any of the inodes in this disk block are in use.
+ *
+ * Caller must be holding superblock lock (group/bitmap read lock in future).
+ */
+int ext3_itable_block_used(struct super_block *sb, unsigned int block_group,
+ int offset)
+{
+ int bitmap_nr = load_inode_bitmap(sb, block_group);
+ int inodes_per_block;
+ unsigned long inum, iend;
+ struct buffer_head *ibitmap;
+
+ if (bitmap_nr < 0)
+ return 1;
+
+ inodes_per_block = sb->s_blocksize / EXT3_SB(sb)->s_inode_size;
+ inum = offset & ~(inodes_per_block - 1);
+ iend = inum + inodes_per_block;
+ ibitmap = EXT3_SB(sb)->s_inode_bitmap[bitmap_nr];
+ for (; inum < iend; inum++) {
+ if (inum != offset && ext3_test_bit(inum, ibitmap->b_data))
+ return 1;
+ }
+
+ return 0;
+}
+
+/*
* There are two policies for allocating an inode. If the new inode is
* a directory, then a forward search is made for a block group with both
* free space and a low directory-to-inode ratio; if that fails, then of
@@ -312,6 +343,7 @@
struct ext3_group_desc * gdp;
struct ext3_group_desc * tmp;
struct ext3_super_block * es;
+ struct ext3_iloc iloc;
int err = 0;
/* Cannot create files in a deleted directory */
@@ -514,9 +547,18 @@
inode->i_generation = sbi->s_next_generation++;
ei->i_state = EXT3_STATE_NEW;
- err = ext3_mark_inode_dirty(handle, inode);
+ err = ext3_get_inode_loc_new(inode, &iloc, 1);
if (err) goto fail;
-
+ BUFFER_TRACE(iloc->bh, "get_write_access");
+ err = ext3_journal_get_write_access(handle, iloc.bh);
+ if (err) {
+ brelse(iloc.bh);
+ iloc.bh = NULL;
+ goto fail;
+ }
+ err = ext3_mark_iloc_dirty(handle, inode, &iloc);
+ if (err) goto fail;
+
unlock_super (sb);
if(DQUOT_ALLOC_INODE(inode)) {
DQUOT_DROP(inode);
diff -ru lustre-head/fs/ext3/inode.c lustre/fs/ext3/inode.c
--- lustre-head/fs/ext3/inode.c Mon Dec 23 10:02:58 2002
+++ lustre/fs/ext3/inode.c Mon Dec 23 09:50:25 2002
@@ -2011,16 +1994,25 @@
ext3_journal_stop(handle, inode);
}
-/*
- * ext3_get_inode_loc returns with an extra refcount against the
- * inode's underlying buffer_head on success.
- */
+extern int ext3_itable_block_used(struct super_block *sb,
+ unsigned int block_group,
+ int offset);
+
+#define NUM_INODE_PREREAD 16
-int ext3_get_inode_loc (struct inode *inode, struct ext3_iloc *iloc)
+/*
+ * ext3_get_inode_loc returns with an extra refcount against the inode's
+ * underlying buffer_head on success. If this is for a new inode allocation
+ * (new is non-zero) then we may be able to optimize away the read if there
+ * are no other in-use inodes in this inode table block. If we need to do
+ * a read, then read in a whole chunk of blocks to avoid blocking again soon
+ * if we are doing lots of creates/updates.
+ */
+int ext3_get_inode_loc_new(struct inode *inode, struct ext3_iloc *iloc, int new)
{
struct super_block *sb = inode->i_sb;
struct ext3_sb_info *sbi = EXT3_SB(sb);
- struct buffer_head *bh = 0;
+ struct buffer_head *bh[NUM_INODE_PREREAD];
unsigned long block;
unsigned long block_group;
unsigned long group_desc;
@@ -2042,38 +2034,86 @@
}
group_desc = block_group >> sbi->s_desc_per_block_bits;
desc = block_group & (sbi->s_desc_per_block - 1);
- bh = sbi->s_group_desc[group_desc];
- if (!bh) {
+ if (!sbi->s_group_desc[group_desc]) {
ext3_error(sb, __FUNCTION__, "Descriptor not loaded");
goto bad_inode;
}
- gdp = (struct ext3_group_desc *) bh->b_data;
+ gdp = (struct ext3_group_desc *)(sbi->s_group_desc[group_desc]->b_data);
+
/*
* Figure out the offset within the block group inode table
*/
- offset = ((inode->i_ino - 1) % sbi->s_inodes_per_group) *
- sbi->s_inode_size;
+ offset = ((inode->i_ino - 1) % sbi->s_inodes_per_group);
+
block = le32_to_cpu(gdp[desc].bg_inode_table) +
- (offset >> EXT3_BLOCK_SIZE_BITS(sb));
- if (!(bh = sb_bread(sb, block))) {
- ext3_error (sb, __FUNCTION__,
- "unable to read inode block - "
- "inode=%lu, block=%lu", inode->i_ino, block);
- goto bad_inode;
+ (offset * sbi->s_inode_size >> EXT3_BLOCK_SIZE_BITS(sb));
+
+ bh[0] = sb_getblk(sb, block);
+ if (buffer_uptodate(bh[0]))
+ goto done;
+
+ /* If we don't really need to read this block, and it isn't already
+ * in memory, then we just zero it out. Otherwise, we keep the
+ * current block contents (deleted inode data) for posterity.
+ */
+ if (new && !ext3_itable_block_used(sb, block_group, offset)) {
+ lock_buffer(bh[0]);
+ memset(bh[0]->b_data, 0, bh[0]->b_size);
+ mark_buffer_uptodate(bh[0], 1);
+ unlock_buffer(bh[0]);
+ } else {
+ unsigned long block_end, itable_end;
+ int count = 1;
+
+ itable_end = le32_to_cpu(gdp[desc].bg_inode_table) +
+ sbi->s_itb_per_group;
+ block_end = block + NUM_INODE_PREREAD;
+ if (block_end > itable_end)
+ block_end = itable_end;
+
+ for (; block < block_end; block++) {
+ bh[count] = sb_getblk(sb, block);
+ if (count && (buffer_uptodate(bh[count]) ||
+ buffer_locked(bh[count]))) {
+ __brelse(bh[count]);
+ } else
+ count++;
+ }
+
+ ll_rw_block(READ, count, bh);
+
+ /* Release all but the block we actually need (bh[0]) */
+ while (--count > 0)
+ __brelse(bh[count]);
+
+ wait_on_buffer(bh[0]);
+ if (!buffer_uptodate(bh[0])) {
+ ext3_error(sb, __FUNCTION__,
+ "unable to read inode block - "
+ "inode=%lu, block=%lu", inode->i_ino,
+ bh[0]->b_blocknr);
+ goto bad_inode;
+ }
}
- offset &= (EXT3_BLOCK_SIZE(sb) - 1);
+ done:
+ offset = (offset * sbi->s_inode_size) & (EXT3_BLOCK_SIZE(sb) - 1);
- iloc->bh = bh;
- iloc->raw_inode = (struct ext3_inode *) (bh->b_data + offset);
+ iloc->bh = bh[0];
+ iloc->raw_inode = (struct ext3_inode *)(bh[0]->b_data + offset);
iloc->block_group = block_group;
-
+
return 0;
-
+
bad_inode:
return -EIO;
}
+int ext3_get_inode_loc(struct inode *inode, struct ext3_iloc *iloc)
+{
+ return ext3_get_inode_loc_new(inode, iloc, 0);
+}
+
void ext3_read_inode(struct inode * inode)
{
struct ext3_iloc iloc;
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3
2003-02-28 4:12 ` Daniel Phillips
@ 2003-02-27 21:33 ` Martin J. Bligh
0 siblings, 0 replies; 38+ messages in thread
From: Martin J. Bligh @ 2003-02-27 21:33 UTC (permalink / raw)
To: Daniel Phillips, Andreas Dilger, bwindle-kbt
Cc: linux-kernel, ext2-devel, Theodore Ts'o, chrisl
>> > 11 ms sounds like two seeks for each returned dirent, which sounds
>> > like a bug.
>>
>> I think you are pretty dead on there. The difference is that with
>> unindexed entries, the directory entry and the inode are in the same
>> order, while with indexed directories they are essentially in random
>> order to each other. That means that each directory lookup causes a
>> seek to get the directory inode data instead of doing allocation order
>> (which is sequential reads on disk).
>>
>> In the past both would have been slow equally, but the orlov allocator in
>> 2.5 causes a number of directories to be created in the same group before
>> moving on to the next group, so you have nice batches of sequential
>> reads.
>
> I think you're close to the truth there, but out-of-order inode table
> access would only introduce one seek per inode table block, and the
> cache should take care of the rest. Martin's numbers suggest the cache
> isn't caching at all.
>
> Martin, does iostat show enormously more reads for the Htree case?
Wasn't me ... I just forward the bug data from bugzilla ;-)
Filer in this case was ... bwindle-kbt@fint.org
cc'ed.
M.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3
2003-02-27 17:31 [Bug 417] New: htree much slower than regular ext3 Martin J. Bligh
@ 2003-02-28 2:55 ` Daniel Phillips
2003-02-27 21:00 ` Andreas Dilger
2003-03-07 15:46 ` Alex Tomas
` (2 subsequent siblings)
3 siblings, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2003-02-28 2:55 UTC (permalink / raw)
To: Martin J. Bligh, linux-kernel; +Cc: ext2-devel, Theodore Ts'o, chrisl
On Thursday 27 February 2003 18:31, Martin J. Bligh wrote:
> Problem Description:
> I created a directory ("test") with 32000 (ok, 31998) directories in it,
> and put a file called 'foo' in each of them (for i in `ls`; do cd $i &&
> touch bar && cd .. ; done). Then I took that ext3 partion, umounted it,
> did a 'tune2fs -O dir_index', then 'fsck -fD', and remounted. I then did a
> 'time du -hs' on the test directory, and here are the results.
>
> ext3+htree:
> bwindle@razor:/giant/inodes$ time du -hs
> 126M .
>
> real 7m21.756s
> user 0m2.021s
> sys 0m22.190s
>
> I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1, remounted,
> and did another du -hs on the test directory. It took 1 minute, 48
> seconds.
>
> bwindle@razor:/giant/test$ time du -hs
> 126M .
>
> real 1m48.760s
> user 0m1.986s
> sys 0m21.563s
>
>
> I thought htree was supposed to speed up access with large numbers of
> directories?
The du just does getdents and lstats in physical storage order, so there's no
possible benefit from indexing in this case, and unindexed ext3 avoids long
search times by caching the position at which it last found an entry. That
answers the question "why doesn't it speed up", however, "why did it slow way
down" is harder.
The single-file leaves of your directory tree don't carry any index (it's not
worth it with only one file) and therfore use the same code path as unindexed
ext3, so there's no culprit there. I'm looking suspiciously at
ext3_dx_readdir, which is apparently absorbing about 11 ms per returned
entry. To put this in perspective, I'm used to seeing individual directory
operation times well below 100 us, so even if each dirent cost as much as a
full lookup, you'd see less than 3 seconds overhead for your 30,000
directories.
11 ms sounds like two seeks for each returned dirent, which sounds like a bug.
Regards,
Daniel
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3
2003-02-27 21:00 ` Andreas Dilger
@ 2003-02-28 4:12 ` Daniel Phillips
2003-02-27 21:33 ` Martin J. Bligh
2003-03-13 21:04 ` [Ext2-devel] " Stephen C. Tweedie
1 sibling, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2003-02-28 4:12 UTC (permalink / raw)
To: Andreas Dilger
Cc: Martin J. Bligh, linux-kernel, ext2-devel, Theodore Ts'o, chrisl
On Thursday 27 February 2003 22:00, Andreas Dilger wrote:
> > 11 ms sounds like two seeks for each returned dirent, which sounds like a
> > bug.
>
> I think you are pretty dead on there. The difference is that with
> unindexed entries, the directory entry and the inode are in the same order,
> while with indexed directories they are essentially in random order to each
> other. That means that each directory lookup causes a seek to get the
> directory inode data instead of doing allocation order (which is sequential
> reads on disk).
>
> In the past both would have been slow equally, but the orlov allocator in
> 2.5 causes a number of directories to be created in the same group before
> moving on to the next group, so you have nice batches of sequential reads.
I think you're close to the truth there, but out-of-order inode table access
would only introduce one seek per inode table block, and the cache should
take care of the rest. Martin's numbers suggest the cache isn't caching at
all.
Martin, does iostat show enormously more reads for the Htree case?
Regards,
Daniel
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3
2003-02-27 17:31 [Bug 417] New: htree much slower than regular ext3 Martin J. Bligh
2003-02-28 2:55 ` Daniel Phillips
@ 2003-03-07 15:46 ` Alex Tomas
2003-03-08 17:38 ` Daniel Phillips
2003-03-11 21:57 ` Bill Davidsen
[not found] ` <20030307214833.00a37e35.akpm@digeo.com>
[not found] ` <20030309184755.ACC80FCA8C@mx12.arcor-online.net>
3 siblings, 2 replies; 38+ messages in thread
From: Alex Tomas @ 2003-03-07 15:46 UTC (permalink / raw)
To: Martin J. Bligh
Cc: linux-kernel, ext2-devel, Theodore Ts'o, Daniel Phillips,
Andrew Morton, Alex Tomas
Hi!
The problem is that getdents(2) returns inodes out of order and
du causes many head seeks. I tried to solve the problem by patch
I included in. The idea of the patch is pretty simple: just try
to sort dentries by inode number in readdir(). It works because
inodes sits at fixed places in ext2/ext3. Please, look at results
I just got:
real user sys
ext3+htree: 7m22.236s 0m0.292s 0m2.204s
ext3-htree: 3m6.539s 0m0.481s 0m2.278s
ext3+sd-05: 3m45.453s 0m0.493s 0m2.265s
ext3+sd-10: 3m35.770s 0m0.514s 0m2.076s
ext3+sd-15: 3m24.235s 0m0.558s 0m2.075s
ext3+sd-20: 3m6.802s 0m0.486s 0m2.214s
ext3+sd-25: 2m55.062s 0m0.490s 0m2.105s
ext3+sd-30: 2m39.658s 0m0.492s 0m2.138s
'sd-N' stands for sortdir, N blocks will be prefetched before
sorting.
Of course, sortdir isn't production-ready and I even don't try
to test it hard. Just to probe the idea.
After applying the patch, use 'sortdir' or 'sortdir=N' as mount
option.
with best regards, Alex
>>>>> Martin J Bligh (MJB) writes:
MJB> Problem Description: I created a directory ("test") with 32000
MJB> (ok, 31998) directories in it, and put a file called 'foo' in
MJB> each of them (for i in `ls`; do cd $i && touch bar && cd .. ;
MJB> done). Then I took that ext3 partion, umounted it, did a
MJB> 'tune2fs -O dir_index', then 'fsck -fD', and remounted. I then
MJB> did a 'time du -hs' on the test directory, and here are the
MJB> results.
MJB> ext3+htree: bwindle@razor:/giant/inodes$ time du -hs 126M .
MJB> real 7m21.756s user 0m2.021s sys 0m22.190s
MJB> I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1,
MJB> remounted, and did another du -hs on the test directory. It took
MJB> 1 minute, 48 seconds.
MJB> bwindle@razor:/giant/test$ time du -hs 126M .
MJB> real 1m48.760s user 0m1.986s sys 0m21.563s
MJB> I thought htree was supposed to speed up access with large
MJB> numbers of directories?
diff -uNr linux/fs/ext3/dir.c edited/fs/ext3/dir.c
--- linux/fs/ext3/dir.c Thu Mar 6 14:57:50 2003
+++ edited/fs/ext3/dir.c Fri Mar 7 18:35:32 2003
@@ -33,8 +33,12 @@
static int ext3_readdir(struct file *, void *, filldir_t);
static int ext3_dx_readdir(struct file * filp,
void * dirent, filldir_t filldir);
+static int ext3_readdir_sort (struct file * filp,
+ void * dirent, filldir_t filldir);
static int ext3_release_dir (struct inode * inode,
struct file * filp);
+int ext3_htree_store_dirent_sorted(struct file *dir_file,
+ struct ext3_dir_entry_2 *dirent);
struct file_operations ext3_dir_operations = {
.read = generic_read_dir,
@@ -104,9 +108,17 @@
sb = inode->i_sb;
if (is_dx(inode)) {
+ if (test_opt(inode->i_sb, SORTDIR)) {
+ err = ext3_readdir_sort(filp, dirent, filldir);
+ unlock_kernel();
+ return err;
+ }
+
err = ext3_dx_readdir(filp, dirent, filldir);
- if (err != ERR_BAD_DX_DIR)
+ if (err != ERR_BAD_DX_DIR) {
+ unlock_kernel();
return err;
+ }
/*
* We don't set the inode dirty flag since it's not
* critical that it get flushed back to the disk.
@@ -249,6 +261,8 @@
__u32 inode;
__u8 name_len;
__u8 file_type;
+ loff_t offs;
+ struct list_head list;
char name[0];
};
@@ -311,6 +325,9 @@
p->curr_hash = pos2maj_hash(pos);
p->curr_minor_hash = pos2min_hash(pos);
p->next_hash = 0;
+ INIT_LIST_HEAD(&p->list);
+ p->list_nr = 0;
+
return p;
}
@@ -493,10 +510,193 @@
static int ext3_release_dir (struct inode * inode, struct file * filp)
{
- if (is_dx(inode) && filp->private_data)
+ if (is_dx(inode) && filp->private_data)
ext3_htree_free_dir_info(filp->private_data);
return 0;
}
+int ext3_htree_fill_pool(struct file *dir_file, struct ext3_dir_entry_2 *dirent)
+{
+ struct rb_node **p, *parent = NULL;
+ struct fname * fname, *new_fn, *prev_fname = NULL;
+ struct dir_private_info *info;
+ int len;
+
+ info = (struct dir_private_info *) dir_file->private_data;
+ p = &info->root.rb_node;
+
+ /* Create and allocate the fname structure */
+ len = sizeof(struct fname) + dirent->name_len + 1;
+ new_fn = kmalloc(len, GFP_KERNEL);
+ if (!new_fn)
+ return -ENOMEM;
+ memset(new_fn, 0, len);
+ new_fn->hash = new_fn->inode = le32_to_cpu(dirent->inode);
+ new_fn->name_len = dirent->name_len;
+ new_fn->file_type = dirent->file_type;
+ memcpy(new_fn->name, dirent->name, dirent->name_len);
+ new_fn->name[dirent->name_len] = 0;
+ new_fn->offs = dir_file->f_pos;
+
+ while (*p) {
+ parent = *p;
+ fname = rb_entry(parent, struct fname, rb_hash);
+
+ if (new_fn->hash < fname->hash)
+ p = &(*p)->rb_left;
+ else {
+ prev_fname = fname;
+ p = &(*p)->rb_right;
+ }
+ }
+
+ if (prev_fname)
+ list_add(&new_fn->list, &prev_fname->list);
+ else
+ list_add(&new_fn->list, &info->list);
+
+ rb_link_node(&new_fn->rb_hash, parent, p);
+ rb_insert_color(&new_fn->rb_hash, &info->root);
+ info->list_nr++;
+
+ return 0;
+}
+
+static int ext3_fill_from_sort_pool (struct file * filp, void * dirent, filldir_t filldir)
+{
+ struct super_block * sb = filp->f_dentry->d_inode->i_sb;
+ struct dir_private_info *info = filp->private_data;
+ struct list_head * cur;
+ struct fname *fname;
+ int error;
+
+ list_for_each(cur, &info->list) {
+ fname = list_entry(cur, struct fname, list);
+
+ error = filldir(dirent, fname->name,
+ fname->name_len, fname->offs,
+ fname->inode,
+ get_dtype(sb, fname->file_type));
+ if (error)
+ return 0;
+
+ list_del(&fname->list);
+ info->list_nr--;
+ }
+
+ if (info->list_nr == 0)
+ free_rb_tree_fname(&info->root);
+
+ return 0;
+}
+
+static int ext3_readdir_sort (struct file * filp,
+ void * dirent, filldir_t filldir)
+{
+ unsigned long offset, blk;
+ int i, stored, error = 0, blocks = 0, err;
+ struct buffer_head * bh;
+ struct ext3_dir_entry_2 * de;
+ struct inode *inode = filp->f_dentry->d_inode;
+ struct super_block * sb = inode->i_sb;
+ struct dir_private_info *info = filp->private_data;
+
+ if (!info) {
+ info = create_dir_info(filp->f_pos);
+ if (!info)
+ return -ENOMEM;
+ filp->private_data = info;
+ }
+
+ if (info->list_nr > 0) {
+ ext3_fill_from_sort_pool(filp, dirent, filldir);
+ return 0;
+ }
+
+ stored = 0;
+ bh = NULL;
+ offset = filp->f_pos & (sb->s_blocksize - 1);
+
+ while (!error && (blocks < EXT3_SB(sb)->sortdir_prefetch) &&
+ (filp->f_pos < inode->i_size)) {
+ blk = (filp->f_pos) >> EXT3_BLOCK_SIZE_BITS(sb);
+ bh = ext3_bread (0, inode, blk, 0, &err);
+ if (!bh) {
+ ext3_error (sb, "ext3_readdir",
+ "directory #%lu contains a hole at offset %lu",
+ inode->i_ino, (unsigned long)filp->f_pos);
+ filp->f_pos += sb->s_blocksize - offset;
+ continue;
+ }
+
+revalidate:
+ /* If the dir block has changed since the last call to
+ * readdir(2), then we might be pointing to an invalid
+ * dirent right now. Scan from the start of the block
+ * to make sure. */
+ if (filp->f_version != inode->i_version) {
+ for (i = 0; i < sb->s_blocksize && i < offset; ) {
+ de = (struct ext3_dir_entry_2 *)
+ (bh->b_data + i);
+ /* It's too expensive to do a full
+ * dirent test each time round this
+ * loop, but we do have to test at
+ * least that it is non-zero. A
+ * failure will be detected in the
+ * dirent test below. */
+ if (le16_to_cpu(de->rec_len) <
+ EXT3_DIR_REC_LEN(1))
+ break;
+ i += le16_to_cpu(de->rec_len);
+ }
+ offset = i;
+ filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1))
+ | offset;
+ filp->f_version = inode->i_version;
+ }
+
+ while (!error && filp->f_pos < inode->i_size
+ && offset < sb->s_blocksize) {
+ de = (struct ext3_dir_entry_2 *) (bh->b_data + offset);
+ if (!ext3_check_dir_entry ("ext3_readdir", inode, de,
+ bh, offset)) {
+ /* On error, skip the f_pos to the
+ * next block. */
+ filp->f_pos = (filp->f_pos |
+ (sb->s_blocksize - 1)) + 1;
+ brelse (bh);
+ return stored;
+ }
+ offset += le16_to_cpu(de->rec_len);
+ if (le32_to_cpu(de->inode)) {
+ /* We might block in the next section
+ * if the data destination is
+ * currently swapped out. So, use a
+ * version stamp to detect whether or
+ * not the directory has been modified
+ * during the copy operation.
+ */
+ unsigned long version = filp->f_version;
+
+ error = ext3_htree_fill_pool(filp, de);
+ if (error)
+ break;
+ if (version != filp->f_version)
+ goto revalidate;
+ stored ++;
+ }
+ filp->f_pos += le16_to_cpu(de->rec_len);
+ }
+ offset = 0;
+ brelse (bh);
+ blocks++;
+ }
+
+ ext3_fill_from_sort_pool(filp, dirent, filldir);
+
+ UPDATE_ATIME(inode);
+ return 0;
+}
+
#endif
diff -uNr linux/fs/ext3/super.c edited/fs/ext3/super.c
--- linux/fs/ext3/super.c Mon Feb 24 17:47:45 2003
+++ edited/fs/ext3/super.c Fri Mar 7 17:47:27 2003
@@ -584,6 +584,19 @@
continue;
if ((value = strchr (this_char, '=')) != NULL)
*value++ = 0;
+ if (!strcmp (this_char, "sortdir")) {
+ if (value && *value) {
+ int blocks = simple_strtoul(value, &value, 0);
+ printk("EXT3: has value: %d\n", blocks);
+ if (blocks > 0)
+ sbi->sortdir_prefetch = blocks;
+ }
+ if (sbi->sortdir_prefetch == 0)
+ sbi->sortdir_prefetch = 10; /* default value */
+ set_opt(sbi->s_mount_opt, SORTDIR);
+ printk("EXT3-fs: sortdir (%d blocks prefetch)\n",
+ sbi->sortdir_prefetch);
+ } else
#ifdef CONFIG_EXT3_FS_XATTR
if (!strcmp (this_char, "user_xattr"))
set_opt (sbi->s_mount_opt, XATTR_USER);
diff -uNr linux/include/linux/ext3_fs.h edited/include/linux/ext3_fs.h
--- linux/include/linux/ext3_fs.h Mon Feb 24 17:47:33 2003
+++ edited/include/linux/ext3_fs.h Fri Mar 7 18:32:53 2003
@@ -330,6 +330,7 @@
#define EXT3_MOUNT_NO_UID32 0x2000 /* Disable 32-bit UIDs */
#define EXT3_MOUNT_XATTR_USER 0x4000 /* Extended user attributes */
#define EXT3_MOUNT_POSIX_ACL 0x8000 /* POSIX Access Control Lists */
+#define EXT3_MOUNT_SORTDIR 0x10000 /* sort indexed dirs in readdir() */
/* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
@@ -656,6 +657,8 @@
__u32 curr_hash;
__u32 curr_minor_hash;
__u32 next_hash;
+ struct list_head list;
+ int list_nr;
};
/*
diff -uNr linux/include/linux/ext3_fs_sb.h edited/include/linux/ext3_fs_sb.h
--- linux/include/linux/ext3_fs_sb.h Mon Nov 11 06:28:30 2002
+++ edited/include/linux/ext3_fs_sb.h Fri Mar 7 17:30:46 2003
@@ -63,6 +63,7 @@
struct timer_list turn_ro_timer; /* For turning read-only (crash simulation) */
wait_queue_head_t ro_wait_queue; /* For people waiting for the fs to go read-only */
#endif
+ int sortdir_prefetch; /* how many blocks to read to fill pool --AT */
};
#endif /* _LINUX_EXT3_FS_SB */
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3
2003-03-08 17:38 ` Daniel Phillips
@ 2003-03-07 23:27 ` Theodore Ts'o
2003-03-09 19:26 ` Alex Tomas
2003-03-09 7:08 ` Alex Tomas
1 sibling, 1 reply; 38+ messages in thread
From: Theodore Ts'o @ 2003-03-07 23:27 UTC (permalink / raw)
To: Daniel Phillips
Cc: Alex Tomas, Martin J. Bligh, linux-kernel, ext2-devel, Andrew Morton
On Sat, Mar 08, 2003 at 06:38:50PM +0100, Daniel Phillips wrote:
> > The problem is that getdents(2) returns inodes out of order and
> > du causes many head seeks. I tried to solve the problem by patch
> > I included in. The idea of the patch is pretty simple: just try
> > to sort dentries by inode number in readdir().
Yup, that's the problem and the fix; the only issue is the the one
which Daniel pointed out:
> The problem I see with your approach is that the traversal is no
> longer in hash order, so a leaf split in the middle of a directory
> traversal could result in a lot of duplicate dirents. I'm not sure
> there's a way around that.
The fix in userspace would be for du and find (the two programs most
likely to care about this sort of thing) to pull into memory a large
chunks of the directory at a time, sort them by inode, and then stat
them in inode order.
The only other way around it would be to do maintain an entirely
separate B-tree on disk just for the benefit of readdir(), which could
be in inode sort order. Readdir() could then traverse the tree in
inode sort order, thus solving the performance penalty. (The b-tree
has to be on disk, since for large directories, we can't afford to fit
it on disk.)
> We've apparently got a simpler problem though: inode numbers aren't
> allocated densely enough in the inode table blocks. This is the
> only thing I can think of that could cause the amount of thrashing
> we're seeing. Spreading inode number allocations out all over a
> volume is ok, but sparse allocation within each table block is not
> ok because it multiplies the pressure on the cache. We want a
> 30,000 entry directory to touch 1,000 - 2,000 inode table blocks,
> not the worst-case 30,000. As above, fixing this comes down to
> tweaking the ext3_new_inode allocation goal.
Nope. That's not it. Inode numbers are allocated sequentially in the
inode table blocks. You can't get any more dense that that. Create a
bunch of inodes in a directory like this:
mke2fs -j -F -b 1024 -g 1024 -N 1200000 test.img
mount -t ext3 -o loop test.img /mnt
pushd /mnt
mkdir test test2 test3
cd test
time seq -f "%06.0f" 1 100 | xargs touch
cd ..
cd test2
time seq -f "%06.0f" 1 10000 | xargs touch
cd ..
cd test3
time seq -f "%06.0f" 1 100000 | xargs touch
cd ..
popd
Then look at the result using "ls -li". You will see that the inode
numbers are all sequentially allocated, when you look at the inodes in
creation order (which in this particular case is in filename sort
order ). The problem is that when you look at them in hash sort
order, the inode numbers are scattered all over the place.
> Another approach is to have the inode number correspond approximately to the
> hash value. That's not as hard as it sounds, it simply means that the goal
> for ext3_new_inode should be influenced by the hash value. (But watch out
> for interaction with the new Orlov allocator.)
I'm not all convinced that splattering the inodes all over the volume
is in fact a good thing, because it means splattering the block
allocation for files within the same directory all over the volume.
Arguably, read access to files in a directory is much more common than
the "du" or "find" case. So if we need to make trade-off in one
direction or another, I'd much rather go with striving to maximize
block locality than making a readdir+stat run go fast. This is
especially true since there *is* a user space workaround that would
speed up things for du and find.
- Ted
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3
2003-03-07 15:46 ` Alex Tomas
@ 2003-03-08 17:38 ` Daniel Phillips
2003-03-07 23:27 ` Theodore Ts'o
2003-03-09 7:08 ` Alex Tomas
2003-03-11 21:57 ` Bill Davidsen
1 sibling, 2 replies; 38+ messages in thread
From: Daniel Phillips @ 2003-03-08 17:38 UTC (permalink / raw)
To: Alex Tomas, Martin J. Bligh
Cc: linux-kernel, ext2-devel, Theodore Ts'o, Andrew Morton, Alex Tomas
On Fri 07 Mar 03 16:46, Alex Tomas wrote:
> Hi!
>
> The problem is that getdents(2) returns inodes out of order and
> du causes many head seeks. I tried to solve the problem by patch
> I included in. The idea of the patch is pretty simple: just try
> to sort dentries by inode number in readdir(). It works because
> inodes sits at fixed places in ext2/ext3. Please, look at results
> I just got:
>
> real user sys
>
> ext3+htree: 7m22.236s 0m0.292s 0m2.204s
> ext3-htree: 3m6.539s 0m0.481s 0m2.278s
> [...]
> ext3+sd-30: 2m39.658s 0m0.492s 0m2.138s
Nice. I posted a similar demonstration some time ago (could it *really* be
two years?) but I sorted the dirents in user space, which meant it was just a
demonstration, not a practical solution.
The problem I see with your approach is that the traversal is no longer in
hash order, so a leaf split in the middle of a directory traversal could
result in a lot of duplicate dirents. I'm not sure there's a way around that.
Another approach is to have the inode number correspond approximately to the
hash value. That's not as hard as it sounds, it simply means that the goal
for ext3_new_inode should be influenced by the hash value. (But watch out
for interaction with the new Orlov allocator.) It also depends on some
a priori estimate of the size of a directory so that increasing hash values
can be distributed more or less uniformly and monotonically across some
corresponding range of inode numbers. This isn't too hard either, just round
up the current size of the directory to a power of two and use that as the
size estimate. The size estimate would change from time to time over the
life of the directory, but there are only log N different sizes and that's
roughly how heavy the load on the inode cache would be during a directory
traversal. Finally, a nice property of this approach is that it stays stable
over many creates and deletes.
We've apparently got a simpler problem though: inode numbers aren't allocated
densely enough in the inode table blocks. This is the only thing I can think
of that could cause the amount of thrashing we're seeing. Spreading inode
number allocations out all over a volume is ok, but sparse allocation within
each table block is not ok because it multiplies the pressure on the cache.
We want a 30,000 entry directory to touch 1,000 - 2,000 inode table blocks,
not the worst-case 30,000. As above, fixing this comes down to tweaking the
ext3_new_inode allocation goal.
Regards,
Daniel
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Ext2-devel] Re: [Bug 417] New: htree much slower than regular ext3
2003-03-09 22:54 ` [Ext2-devel] " Daniel Phillips
@ 2003-03-08 23:19 ` Andrew Morton
0 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2003-03-08 23:19 UTC (permalink / raw)
To: Daniel Phillips; +Cc: adilger, tytso, bzzz, mbligh, linux-kernel, ext2-devel
Daniel Phillips <phillips@arcor.de> wrote:
>
> Yes, I think that's exactly what's happening. There are some questions
> remaining, such as why doesn't it happen to akpm.
Oh but it does. The 30,000 mkdirs test takes 76 seconds on 1k blocks, 16
seconds on 4k blocks.
With 1k blocks, the journal is 1/4 the size, and there are 4x as many
blockgroups.
So not only does the fs have to checkpoint 4x as often, it has to seek all
over the disk to do it.
If you create a 1k blocksize fs with a 100MB journal, the test takes just
nine seconds.
But note that after these nine seconds, we were left with 50MB of dirty
buffercache. When pdflush later comes along to write that out it takes >30
seconds, because of the additional fragmentation from the tiny blockgroups.
So all we've done is to pipeline the pain.
I suspect the Orlov allocator isn't doing the right thing here - a full
dumpe2fs of the disk shows that the directories are all over the place.
Nodoby has really looked at fine-tuning Orlov.
btw, an `rm -rf' of the dir which holds 30,000 dirs takes 50 seconds system
time on a 2.7GHz CPU. What's up with that?
c01945bc str2hashbuf 9790 61.1875
c0187064 ext3_htree_store_dirent 10472 28.7692
c0154920 __getblk 11319 202.1250
c018d11c dx_probe 15934 24.2896
c018d4d0 ext3_htree_fill_tree 15984 33.8644
c0188d3c ext3_get_branch 18238 81.4196
c0136388 find_get_page 18617 202.3587
c015477c bh_lru_install 20493 94.8750
c0194290 TEA_transform 21463 173.0887
c01536b0 __find_get_block_slow 24069 62.6797
c0154620 __brelse 36139 752.8958
c01893dc ext3_get_block_handle 43815 49.7898
c0154854 __find_get_block 71292 349.4706
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3
2003-03-08 17:38 ` Daniel Phillips
2003-03-07 23:27 ` Theodore Ts'o
@ 2003-03-09 7:08 ` Alex Tomas
2003-03-10 17:58 ` Daniel Phillips
2003-03-10 21:25 ` Theodore Ts'o
1 sibling, 2 replies; 38+ messages in thread
From: Alex Tomas @ 2003-03-09 7:08 UTC (permalink / raw)
To: Daniel Phillips
Cc: Alex Tomas, Martin J. Bligh, linux-kernel, ext2-devel,
Theodore Ts'o, Andrew Morton
>>>>> Daniel Phillips (DP) writes:
DP> On Fri 07 Mar 03 16:46, Alex Tomas wrote:
DP> The problem I see with your approach is that the traversal is no
DP> longer in hash order, so a leaf split in the middle of a
DP> directory traversal could result in a lot of duplicate dirents.
DP> I'm not sure there's a way around that.
1) As far as I understand, duplicates are possible even in classic ext2
w/o sortdir/index. See the diagram:
Process 1 Process 2
getdents(2) returns
dentry1 (file1 -> Inode1)
dentry2 (file2 -> Inode2)
context switch -->
unlink(file1), empty dentry1
creat(file3), Inode3, use dentry1
creat(file1), Inode1, use dentry3
context switch -->
getdents(2) returns
dentry3(file1 -> Inode1)
Am I right?
2) Why do not use hash order for traversal like ext3_dx_readdir() does?
Upon reading several dentries within some hash set readdir() sorts them
in inode order and returns to an user.
with best regards, Alex
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3
2003-03-07 23:27 ` Theodore Ts'o
@ 2003-03-09 19:26 ` Alex Tomas
0 siblings, 0 replies; 38+ messages in thread
From: Alex Tomas @ 2003-03-09 19:26 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Daniel Phillips, Alex Tomas, Martin J. Bligh, linux-kernel,
ext2-devel, Andrew Morton
>>>>> Theodore Ts'o (TT) writes:
TT> The fix in userspace would be for du and find (the two programs
TT> most likely to care about this sort of thing) to pull into memory
TT> a large chunks of the directory at a time, sort them by inode,
TT> and then stat them in inode order.
in fact, this solution solves problem for filesystems with fixed inodes
placement. for example, this solutions won't work for reiserfs.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Ext2-devel] Re: [Bug 417] New: htree much slower than regular ext3
[not found] ` <20030308010424.Z1373@schatzie.adilger.int>
@ 2003-03-09 22:54 ` Daniel Phillips
2003-03-08 23:19 ` Andrew Morton
0 siblings, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2003-03-09 22:54 UTC (permalink / raw)
To: Andreas Dilger, Andrew Morton
Cc: tytso, bzzz, mbligh, linux-kernel, ext2-devel
On Sat 08 Mar 03 09:04, Andreas Dilger wrote:
> I was testing this in UML-over-loop in 2.4, and the difference in speed
> for doing file creates vs. directory creates is dramatic. For file
> creates I was running 3s per 10000 files, and for directory creates I
> was running 12s per 10000 files.
And on a 10K scsi disk I'm running 35s per 10,000 directories, which is way,
way slower than it ought to be. There are two analysis tools we're hurting
for badly here:
- We need to see the physical allocation maps for directories, preferably
in a running kernel. I think the best way to do this is a little
map-dumper hooked into ext3/dir.c and exported through /proc.
- We need block-access traces in a nicer form than printks (or nobody
will ever use them). IOW, we need LTT or something very much like
it.
> Depending on the size of the journal vs. how many block/inode bitmaps and
> directory blocks are dirtied, you will likely wrap the journal before you
> return to the first block group, so you might write 20kB * 32000 for the
> directory creates instead of 8kB for the file creates. You also have a
> lot of seeking to each block group to write out the directory data, instead
> of nearly sequential IO for the inode create case.
Yes, I think that's exactly what's happening. There are some questions
remaining, such as why doesn't it happen to akpm. Another question: why does
it happen to the directory creates, where the only thing being accessed
randomly is the directory itself - the inode table is supposedly being
allocated/dirtied sequentially.
Regards,
Daniel
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3
[not found] ` <20030307214833.00a37e35.akpm@digeo.com>
[not found] ` <20030308010424.Z1373@schatzie.adilger.int>
@ 2003-03-09 23:10 ` Daniel Phillips
1 sibling, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2003-03-09 23:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: tytso, bzzz, mbligh, linux-kernel, ext2-devel
On Sat 08 Mar 03 06:48, Andrew Morton wrote:
> Daniel Phillips <phillips@arcor.de> wrote:
> > OK, according to akpm, data=journal is broken at the moment. Changing
> > the host filesystem to data=ordered allowed me to complete the test.
> >
> > Interesting results. Creating the initial 30,000 directories ought to
> > take just a few seconds with the index, but instead it takes 5 minutes
> > and 9 seconds, while consuming only 5 seconds of cpu time.
>
> It takes 14 second here aganst a real disk, and 9 seconds against loop.
The best I've seen is 77 seconds to make the directories, on a fast scsi disk
with 4K blocksize. That's a 6 1/2 times difference, what could account for
it?
> The difference: I used 4k blocksize. 1k blocksize should not be that bad:
> something is broken.
>
> Suggest you test against a real disk for now. Using loop may skew things.
OK, I tested 1K and 4K blocksize on scsi, ide and ide-hosted loopback, the
results are summarized here:
Make 30,000 directories:
scsi 10K ide 5400 loopback
1K 1m44.243s 4m59.757s 4m8.362s
4K 1m16.659s 4m49.827s 1m43.399s
Make one file in each directory:
scsi 10K ide 5400 loopback
1K 2m50.483s 7m28.657s 4m8.793s
4K 2m34.461s 5m45.615s 3m0.603s
Stat the tree (du):
scsi 10K ide 5400 loopback
1K 0m43.245s 1m52.230s 0m1.343s
4K 0m1.038s 0m2.056s 0m1.016s
The first thing to notice: Alex Tomas's original complaint about Htree and
stat doesn't manifest at all here. Mind you, I did not repeat his exact
sequence of tune2fs's etc, so it's possible there could be a utils bug or
something. I'll check into that later. Meanwhile, these stats show other
breakage that needs to be investigated.
The directory creation time is several times higher than it should be, the
same for the file creates. With the index, the two sets of operations should
take about the same time. Is this a journal problem? My journal is about 8
meg. I can imagine the journal wrapping during the file creation, because of
touching all the existing inode table blocks, but this should not happen
during the directory creation.
The stats also show a huge slowdown for 1K blocksize on non-loopback, as you
noticed earlier. Hmm. First: why on the raw device and not on loopback?
Second: why does the device affect this at all? This operation is r/o. Hmm,
except for atimes. I reran the scsi tests with noatime+nodiratime and the
stat times for 1K and 4K became nearly equal, about .75 seconds. So there we
have it: several tens of seconds are spent journalling dirty inode table
blocks due to atime, but only when blocksize is 1K. Big fat clue.
Stat the tree (du), noatime+nodtime:
scsi 10K
1K 0m0.780s
4K 0m0.751s
The cpu numbers are in the detailed results below, and they show that Htree
is doing its job, all the cpu numbers are tiny. We've kicked around the idea
that Htree is causing inode table thrashing, but I think that's the wrong
tree to bark up. Journal effects are looking more and more like the real
problem, possibly in combination with IO scheduling. The fact that Htree
dirties the inode table out of order is just exposing some other underlying
problem.
A quick look at slabinfo before and after unmounting the test volume shows
that all the inodes we expect to be cached, actually are:
cat /proc/slabinfo | grep ext3_inode
ext3_inode_cache 121469 121473 448 13497 13497 1 : 120 60
sudo umount /mnt; cat /proc/slabinfo | grep ext3_inode
ext3_inode_cache 61506 61632 448 6836 6848 1 : 120 60
So there were about 60,000 inodes cached for that volume, exactly as
expected. So much for the theory that we're thrashing the inode table cache.
What's actually happening, I strongly suspect, is a lot of dirty blocks are
getting pushed through the journal a little prematurely, without allowing any
opportunity for coalescing. At create time, it must be the directory blocks,
not the inode table blocks, that cause the extra writeout, since inodes are
being allocated (mostly) sequentially.
By the way, the script is a perfect trigger for the scheduler interactivity
bug, leaving me mouseless for 4 seconds at a time. I guess I'll apply
Linus's patch and see if that goes away.
One other thing I noticed is that, when the script tries to umount
immediately after the stat, I get "umount: /mnt: device is busy", however,
after a few seconds it works fine. This strikes me as a bug.
======================
1K blocksize, 10K rpm scsi
======================
Make directories...
real 1m44.243s
user 0m0.142s
sys 0m3.465s
Make one file in each directory...
real 2m50.483s
user 0m23.617s
sys 0m59.465s
Stat the tree...
30M /mnt/test
real 0m43.245s
user 0m0.245s
sys 0m0.915s
======================
1K blocksize, 5400 rpm ide
======================
Make directories...
real 4m59.757s
user 0m0.138s
sys 0m5.207s
Make one file in each directory...
real 7m28.657s
user 0m24.351s
sys 0m58.941s
Stat the tree...
30M /mnt/test
real 1m52.230s
user 0m0.261s
sys 0m1.013s
===================
1K blocksize, loopback
===================
Make directories...
real 4m8.362s
user 0m0.144s
sys 0m7.448s
Make one file in each directory...
real 4m8.793s
user 0m25.262s
sys 1m20.376s
Stat the tree...
30M /mnt/test
real 0m1.343s
user 0m0.265s
sys 0m0.720s
======================
4K blocksize, 10K rpm scsi
======================
Make directories...
real 1m16.659s
user 0m0.152s
sys 0m4.106s
Make one file in each directory...
real 2m34.461s
user 0m23.555s
sys 0m58.533s
Stat the tree...
118M /mnt/test
real 0m1.038s
user 0m0.242s
sys 0m0.664s
======================
4K blocksize, 5400 rpm ide
======================
Make directories...
real 4m49.827s
user 0m0.153s
sys 0m16.666s
Make one file in each directory...
real 5m45.615s
user 0m23.836s
sys 1m6.908s
Stat the tree...
118M /mnt/test
real 0m2.056s
user 0m0.432s
sys 0m0.622s
===================
4K blocksize, loopback
===================
Make directories...
real 1m43.399s
user 0m0.149s
sys 0m10.806s
Make one file in each directory...
real 3m0.603s
user 0m25.568s
sys 1m34.851s
Stat the tree...
118M /mnt/test
real 0m1.016s
user 0m0.318s
sys 0m0.588s
========
The script
========
#volume=test.img
#volume=/dev/hda5
volume=/dev/sdb5
mke2fs -j -F -b 4096 -g 1024 $volume
mount -t ext3 -o loop $volume /mnt
cd /mnt
mkdir test
cd test
echo
echo Make directories...
time seq -f "%06.0f" 1 30000 | xargs mkdir
echo
echo Make one file in each directory...
time for i in `ls`; do cd $i && touch foo && cd .. ; done
echo
echo Stat the tree...
time du -hs /mnt/test
echo umount /mnt
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3
2003-03-09 7:08 ` Alex Tomas
@ 2003-03-10 17:58 ` Daniel Phillips
2003-03-10 21:25 ` Theodore Ts'o
1 sibling, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2003-03-10 17:58 UTC (permalink / raw)
To: Alex Tomas
Cc: Alex Tomas, Martin J. Bligh, linux-kernel, ext2-devel,
Theodore Ts'o, Andrew Morton
On Sun 09 Mar 03 08:08, Alex Tomas wrote:
> >>>>> Daniel Phillips (DP) writes:
>
> DP> On Fri 07 Mar 03 16:46, Alex Tomas wrote:
> DP> The problem I see with your approach is that the traversal is no
> DP> longer in hash order, so a leaf split in the middle of a
> DP> directory traversal could result in a lot of duplicate dirents.
> DP> I'm not sure there's a way around that.
>
> 1) As far as I understand, duplicates are possible even in classic ext2
> w/o sortdir/index. See the diagram:
>
> Process 1 Process 2
>
> getdents(2) returns
> dentry1 (file1 -> Inode1)
> dentry2 (file2 -> Inode2)
>
> context switch -->
> unlink(file1), empty dentry1
> creat(file3), Inode3, use
> dentry1 creat(file1), Inode1, use dentry3
>
> context switch -->
>
> getdents(2) returns
> dentry3(file1 -> Inode1)
>
>
> Am I right?
>
>
> 2) Why do not use hash order for traversal like ext3_dx_readdir() does?
> Upon reading several dentries within some hash set readdir() sorts them
> in inode order and returns to an user.
>
>
> with best regards, Alex
You're right, but you still don't win the stuffed poodle.
I put forth the same argument at last year's kernel workshop and was
overruled on the grounds that, let me see:
- Duplicates as a result of unlinks are one thing, duplicates as a
result of creates are another, worse thing.
- Creating one duplicate as a result of one unlink is one thing,
creating lots of duplicates with one operation is another, worse
thing.
- The rules are written to allow this particular duplicate behavior
in UFS (design inherited by Ext2/3) because at the time, UFS was
the only game in town.
The phrase "broken by design" comes to mind. Ted and Stephen would no doubt
be happy to elaborate. Anyway, there's another reason we need to return
results in hash order, and that is telldir/seekdir, which expects a stable
enumeration of a directory with no duplicates, even with concurrent
operations going on, and even if the server is rebooted in the middle of a
directory traversal. Not just broken by design, but smashed to pieces by
design. And we still have to make it work, for some definition of "work".
Anyway,
^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC] Improved inode number allocation for HTree
[not found] ` <m3u1ecl5h8.fsf@lexa.home.net>
@ 2003-03-10 20:45 ` Daniel Phillips
[not found] ` <3E6D1D25.5000004@namesys.com>
2003-03-10 20:48 ` [RFC] Improved inode number allocation for HTree Daniel Phillips
1 sibling, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2003-03-10 20:45 UTC (permalink / raw)
Cc: Theodore Ts'o, Andreas Dilger, Christopher Li, Alex Tomas
Summary of the problem:
HTree's random relationship between directory entry order and inode order has
been observed to cause increased cache pressure, affecting performance under
certain conditions. Most of this pressure is due to inode table usage, which
is normally several times larger than the directory itself. Even when the
inode table usage is much less than available cache, it may still exceed the
size of the journal file (8 MB by default). Though the journal only becomes
involved when blocks are modified, unfortunately, because of atime updates,
this includes all directory operations. We could suggest to users that they
should disable atime updating if they care about performance, but we ought to
be able to do better than that.
To reduce this cache pressure, what we need to do is make the directory entry
order correspond more closely to the inode number order. It doesn't have to
correspond perfectly in order to show substantial improvement, as Tomas
demonstrated with his recent getdents sorting patch.
Analysis:
To understand why sorting relatively small sections of the directory entries
reduces the cache pressure, we can think about the lifetime of an inode table
block over the course of a directory traversal. This traversal lifetime
extends from when it is first accessed (always a write as noted above) to
when it is last accessed, or the directory traversal finishes. Each inode
table block contains multiple inodes, so if those inodes are accessed
randomly, the traversal lifetime of every inode table block will tend to be
from nearly the beginning of the traversal to nearly the end, and so the load
on the cache will be nearly the same as the number of inode table blocks. If
this load is greater than available cache, various blocks will have to be
written out and reloaded later for further processing. We can call the
period between loading and evicting an inode table block, its "cache period".
The number of times a block must be written/reloaded depends not only on the
number of inode table blocks covered by the directory and the availablity of
cache memory, but on the probable number of the inodes within the block that
will be written during each of the block's cache period. By sorting the
directory chunkwise, the probable number of inodes processed in one cache
period increases, reducing the number of cache periods, in turn reducing disk
seeking and transfer overhead.
To apply the above model to the journal, just substitute "journal period" for
"cache period", where journal period is defined as the time from when a dirty
block first enters the journal to when it is written out to disk.
We can see that local sorting or directory chunks has hardly any effect on
the traversal lifetime of a block; it only reduces the number of cache
periods. The size of the chunk of directory that we sort locally presumably
stays constant, so the probable number of inodes on each inode table block
processed in each cache period approaches one as the directory size
increases. At some point, there is no longer any benefit from sorting, andwe
arrive at that point at rather realistic directory sizes.
We can always do the job perfectly by sorting the whole directory. This
would require changes in every getdents-using application that cares about
performance, to either make the getdents window as big as the directory or to
do the sort in user space. A spinoff benefit of doing the full sort is that
it becomes trivial to detect and eliminate duplicate entries. Higher memory
load is a drawback, as is the work required to update user space programs.
There will also be noticable latency introduced by the sort, which didn't
appear before. If sorting is to be done in user space then we impose the
requirement on all filesystems - even those that do not use inode numbers
internally - that they supply an inode key for sorting, and that the supplied
sorting key be meaningful in terms of cache locality.
Proposed Solution:
What if we could maintain the inodes in the inode table in the same order as
the directory entries, permanently, so that no sorting is required? If we
could do this, the traversal lifetime of each inode table block would be the
same as the number of inodes it contains, and the cache load due to the inode
table would be exactly one block. That is, the directory traversal would
process each inode table block completely before moving on to the next one.
Though the current Ext3 filesystem structure does not support this directly,
we can approximate the effect quite closely within the existing framework.
Due to Ted's recent work, we are already traversing the directory in hash
order, as opposed to physical allocation order, because of the particular
approach adopted to address the telldir/seekdir problem. This is relatively
inexpensive. So now, what we would like to do is make the inode number of
each directory entry increase monotonically, corresponding to monotonically
increasing hash values. We also want to allocate those inode numbers densely
in the inode table, to minimize seeking. If we know beforehand how many
entries the directory will have, this is easy: when we create an entry, we
multiply its hash value by the total number of entries divided by the total
hash range (2**31), add in a base target, and use that as the goal to
allocate a new inode. Due to the randomness of the hash value, we will
sometimes be forced to allocate an inode number out of order, but hopefully
there won't be too many of those, and they won't be very far out of order.
We may also end up with some gaps in the inode table. In fact, we may
intentionally leave some slack in order to reduce the number of out-of-order
allocations. These problems are not serious.
The big problem is that we don't know the number of directory entries ahead
of time. To get around this, we can use the current size of the directory,
rounded up to a power of two as the size estimate. Each time the directory
increases to a new binary size we establish a new region in the inode table,
which will map to the to-be-created half of directory entries, roughly in
order.
Now, consider what happens to a group of directory entries created when the
directory was so small that it mapped to a single inode table block. After a
series of leaf splits, these former neighbors will end up far apart in the
hash order, but their inode numbers will still be ordered monotonically over
a traversal. The traversal lifetime of that group of entries will be nearly
the entire directory traversal, but there is only one such block.
When the directory became larger, some newly created entries were mapped to
two new inode table blocks, the traversal lifetimes of which are roughly half
the entire traversal. There are only two such blocks, and their lifetimes do
not overlap, so these two blocks only generate one block worth of cache load.
So it goes, with each new set of entries of size 2**i generating only one
additional block of cache load. The final cache load is O log2(N), a
pleasingly small number.
A very nice property of this inode allocation strategy is that the ordering
will tend to be reinforced as a directory ages, as opposed to slowly decaying
towards randomness as happens with linear directories.
So this approach is attractive, especially as it can be implemented in a
small number of lines of code. The main outstanding issue I have not
addressed is how to arrange things so that the inode allocation patterns of
neighbouring directories don't interfere with each other too much. I expect
that each time the directory grows to a new power of two size, we need to
scan the inode allocation map for a relatively unpopulated region of that
size and store the region's base in the directory header. Would we need to
store the entire history of allocation bases? Probably not. This aspect
requires further consideration.
Regards,
Daniel
^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC] Improved inode number allocation for HTree
[not found] ` <m3u1ecl5h8.fsf@lexa.home.net>
2003-03-10 20:45 ` [RFC] Improved inode number allocation for HTree Daniel Phillips
@ 2003-03-10 20:48 ` Daniel Phillips
2003-03-10 21:04 ` John Bradford
1 sibling, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2003-03-10 20:48 UTC (permalink / raw)
To: ext2-devel, linux-kernel
Cc: Theodore Ts'o, Andreas Dilger, Christopher Li, Alex Tomas
Summary of the problem:
HTree's random relationship between directory entry order and inode order has
been observed to cause increased cache pressure, affecting performance under
certain conditions. Most of this pressure is due to inode table usage, which
is normally several times larger than the directory itself. Even when the
inode table usage is much less than available cache, it may still exceed the
size of the journal file (8 MB by default). Though the journal only becomes
involved when blocks are modified, unfortunately, because of atime updates,
this includes all directory operations. We could suggest to users that they
should disable atime updating if they care about performance, but we ought to
be able to do better than that.
To reduce this cache pressure, what we need to do is make the directory entry
order correspond more closely to the inode number order. It doesn't have to
correspond perfectly in order to show substantial improvement, as Tomas
demonstrated with his recent getdents sorting patch.
Analysis:
To understand why sorting relatively small sections of the directory entries
reduces the cache pressure, we can think about the lifetime of an inode table
block over the course of a directory traversal. This traversal lifetime
extends from when it is first accessed (always a write as noted above) to
when it is last accessed, or the directory traversal finishes. Each inode
table block contains multiple inodes, so if those inodes are accessed
randomly, the traversal lifetime of every inode table block will tend to be
from nearly the beginning of the traversal to nearly the end, and so the load
on the cache will be nearly the same as the number of inode table blocks. If
this load is greater than available cache, various blocks will have to be
written out and reloaded later for further processing. We can call the
period between loading and evicting an inode table block, its "cache period".
The number of times a block must be written/reloaded depends not only on the
number of inode table blocks covered by the directory and the availablity of
cache memory, but on the probable number of the inodes within the block that
will be written during each of the block's cache period. By sorting the
directory chunkwise, the probable number of inodes processed in one cache
period increases, reducing the number of cache periods, in turn reducing disk
seeking and transfer overhead.
To apply the above model to the journal, just substitute "journal period" for
"cache period", where journal period is defined as the time from when a dirty
block first enters the journal to when it is written out to disk.
We can see that local sorting or directory chunks has hardly any effect on
the traversal lifetime of a block; it only reduces the number of cache
periods. The size of the chunk of directory that we sort locally presumably
stays constant, so the probable number of inodes on each inode table block
processed in each cache period approaches one as the directory size
increases. At some point, there is no longer any benefit from sorting, andwe
arrive at that point at rather realistic directory sizes.
We can always do the job perfectly by sorting the whole directory. This
would require changes in every getdents-using application that cares about
performance, to either make the getdents window as big as the directory or to
do the sort in user space. A spinoff benefit of doing the full sort is that
it becomes trivial to detect and eliminate duplicate entries. Higher memory
load is a drawback, as is the work required to update user space programs.
There will also be noticable latency introduced by the sort, which didn't
appear before. If sorting is to be done in user space then we impose the
requirement on all filesystems - even those that do not use inode numbers
internally - that they supply an inode key for sorting, and that the supplied
sorting key be meaningful in terms of cache locality.
Proposed Solution:
What if we could maintain the inodes in the inode table in the same order as
the directory entries, permanently, so that no sorting is required? If we
could do this, the traversal lifetime of each inode table block would be the
same as the number of inodes it contains, and the cache load due to the inode
table would be exactly one block. That is, the directory traversal would
process each inode table block completely before moving on to the next one.
Though the current Ext3 filesystem structure does not support this directly,
we can approximate the effect quite closely within the existing framework.
Due to Ted's recent work, we are already traversing the directory in hash
order, as opposed to physical allocation order, because of the particular
approach adopted to address the telldir/seekdir problem. This is relatively
inexpensive. So now, what we would like to do is make the inode number of
each directory entry increase monotonically, corresponding to monotonically
increasing hash values. We also want to allocate those inode numbers densely
in the inode table, to minimize seeking. If we know beforehand how many
entries the directory will have, this is easy: when we create an entry, we
multiply its hash value by the total number of entries divided by the total
hash range (2**31), add in a base target, and use that as the goal to
allocate a new inode. Due to the randomness of the hash value, we will
sometimes be forced to allocate an inode number out of order, but hopefully
there won't be too many of those, and they won't be very far out of order.
We may also end up with some gaps in the inode table. In fact, we may
intentionally leave some slack in order to reduce the number of out-of-order
allocations. These problems are not serious.
The big problem is that we don't know the number of directory entries ahead
of time. To get around this, we can use the current size of the directory,
rounded up to a power of two as the size estimate. Each time the directory
increases to a new binary size we establish a new region in the inode table,
which will map to the to-be-created half of directory entries, roughly in
order.
Now, consider what happens to a group of directory entries created when the
directory was so small that it mapped to a single inode table block. After a
series of leaf splits, these former neighbors will end up far apart in the
hash order, but their inode numbers will still be ordered monotonically over
a traversal. The traversal lifetime of that group of entries will be nearly
the entire directory traversal, but there is only one such block.
When the directory became larger, some newly created entries were mapped to
two new inode table blocks, the traversal lifetimes of which are roughly half
the entire traversal. There are only two such blocks, and their lifetimes do
not overlap, so these two blocks only generate one block worth of cache load.
So it goes, with each new set of entries of size 2**i generating only one
additional block of cache load. The final cache load is O log2(N), a
pleasingly small number.
A very nice property of this inode allocation strategy is that the ordering
will tend to be reinforced as a directory ages, as opposed to slowly decaying
towards randomness as happens with linear directories.
So this approach is attractive, especially as it can be implemented in a
small number of lines of code. The main outstanding issue I have not
addressed is how to arrange things so that the inode allocation patterns of
neighbouring directories don't interfere with each other too much. I expect
that each time the directory grows to a new power of two size, we need to
scan the inode allocation map for a relatively unpopulated region of that
size and store the region's base in the directory header. Would we need to
store the entire history of allocation bases? Probably not. This aspect
requires further consideration.
Regards,
Daniel
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree
2003-03-10 20:48 ` [RFC] Improved inode number allocation for HTree Daniel Phillips
@ 2003-03-10 21:04 ` John Bradford
2003-03-10 21:28 ` Andreas Schwab
2003-03-10 21:33 ` [RFC] Improved inode number allocation for HTree Daniel Phillips
0 siblings, 2 replies; 38+ messages in thread
From: John Bradford @ 2003-03-10 21:04 UTC (permalink / raw)
To: Daniel Phillips; +Cc: ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz
> Though the journal only becomes involved when blocks are modified,
> unfortunately, because of atime updates, this includes all directory
> operations. We could suggest to users that they should disable
> atime updating if they care about performance, but we ought to be
> able to do better than that.
On a separate note, since atime updates are not usually very important
anyway, why not have an option to cache atime updates for a long time,
or until either a write occurs anyway. Holding a large number of
atime updates in a write cache is generally not going to be a major
issue - the worst case if a partition isn't cleanly unmounted is that
some atimes will be wrong.
John.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3
2003-03-09 7:08 ` Alex Tomas
2003-03-10 17:58 ` Daniel Phillips
@ 2003-03-10 21:25 ` Theodore Ts'o
1 sibling, 0 replies; 38+ messages in thread
From: Theodore Ts'o @ 2003-03-10 21:25 UTC (permalink / raw)
To: Alex Tomas
Cc: Daniel Phillips, Martin J. Bligh, linux-kernel, ext2-devel,
Andrew Morton
On Sun, Mar 09, 2003 at 10:08:34AM +0300, Alex Tomas wrote:
> 1) As far as I understand, duplicates are possible even in classic ext2
> w/o sortdir/index. See the diagram:
>
> Process 1 Process 2
>
> getdents(2) returns
> dentry1 (file1 -> Inode1)
> dentry2 (file2 -> Inode2)
>
> context switch -->
> unlink(file1), empty dentry1
> creat(file3), Inode3, use dentry1
> creat(file1), Inode1, use dentry3
>
> context switch -->
>
> getdents(2) returns
> dentry3(file1 -> Inode1)
>
>
> Am I right?
Yup. POSIX allows this. If you're in the middle of readdir() when a
file (or files) is created or deleted, then it is undefined how
readdir() will treat the filename(s) involved. However, you're *not*
allowed to duplicate or omit filenames that were not touched by a file
creation or deletion.
> 2) Why do not use hash order for traversal like ext3_dx_readdir() does?
> Upon reading several dentries within some hash set readdir() sorts them
> in inode order and returns to an user.
Yeah, glibc could do this, although it would make telldir/seekdir more
than a little complicated. (There's an awful lot of pain and agony
caused to filesystem developers caused by the existence of
telldir/seekdir, which inherently assumes a flat directory structure,
and so it makes it hard to do tricks like this.)
At some level this is the same solution as "do this in userspace",
except instead of needing to convince the maintainers of du and find,
we'd need to convince Ulrich. It is *not* clear that it would
necessarily be easier to get this fix into the glibc. In fact, since
du and find probably don't use seekdir/telldir, it's probably easier
to get the change into du and find.
- Ted
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree
2003-03-10 21:04 ` John Bradford
@ 2003-03-10 21:28 ` Andreas Schwab
2003-03-10 21:50 ` Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree) John Bradford
2003-03-10 21:33 ` [RFC] Improved inode number allocation for HTree Daniel Phillips
1 sibling, 1 reply; 38+ messages in thread
From: Andreas Schwab @ 2003-03-10 21:28 UTC (permalink / raw)
To: John Bradford
Cc: Daniel Phillips, ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz
John Bradford <john@grabjohn.com> writes:
|> > Though the journal only becomes involved when blocks are modified,
|> > unfortunately, because of atime updates, this includes all directory
|> > operations. We could suggest to users that they should disable
|> > atime updating if they care about performance, but we ought to be
|> > able to do better than that.
|>
|> On a separate note, since atime updates are not usually very important
|> anyway, why not have an option to cache atime updates for a long time,
|> or until either a write occurs anyway.
mount -o noatime
Andreas.
--
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 Nürnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree
2003-03-10 21:04 ` John Bradford
2003-03-10 21:28 ` Andreas Schwab
@ 2003-03-10 21:33 ` Daniel Phillips
2003-03-10 21:47 ` [Ext2-devel] " Bryan O'Sullivan
1 sibling, 1 reply; 38+ messages in thread
From: Daniel Phillips @ 2003-03-10 21:33 UTC (permalink / raw)
To: John Bradford; +Cc: ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz
On Mon 10 Mar 03 22:04, John Bradford wrote:
> > Though the journal only becomes involved when blocks are modified,
> > unfortunately, because of atime updates, this includes all directory
> > operations. We could suggest to users that they should disable
> > atime updating if they care about performance, but we ought to be
> > able to do better than that.
>
> On a separate note, since atime updates are not usually very important
> anyway, why not have an option to cache atime updates for a long time,
> or until either a write occurs anyway. Holding a large number of
> atime updates in a write cache is generally not going to be a major
> issue - the worst case if a partition isn't cleanly unmounted is that
> some atimes will be wrong.
It sounds practical. Why stop there? Since Ted is seriously considering
making a batch of incompatible extensions to the on-disk format anyway, how
about adding an atime table to each block group, four bytes per inode. Even
without lazy updating, it would cut down the dirty blocks generated by r/o
operations a lot. If actual atime is the latest of the atime table value and
the inode atime value[1], then inode write operations won't generate extra
traffic. You will only get new traffic when somebody wants the real atime.
I'd put this under the category of "things to add to Ted's long list of fun
new ideas for Ext3/4".
[1] How to handle wrapping is left as an exercise for the interested reader.
Regards,
Daniel
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree
2003-03-10 21:33 ` [RFC] Improved inode number allocation for HTree Daniel Phillips
@ 2003-03-10 21:47 ` Bryan O'Sullivan
2003-03-10 22:02 ` Matthew Wilcox
0 siblings, 1 reply; 38+ messages in thread
From: Bryan O'Sullivan @ 2003-03-10 21:47 UTC (permalink / raw)
To: Daniel Phillips
Cc: John Bradford, ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz
On Mon, 2003-03-10 at 13:33, Daniel Phillips wrote:
> It sounds practical. Why stop there?
Why start? Who actually uses atime for anything at all, other than the
tiny number of shops that care about moving untouched files to tertiary
storage?
Surely if you want to heap someone else's plate with work, you should
offer a reason why :-)
<b
^ permalink raw reply [flat|nested] 38+ messages in thread
* Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree)
2003-03-10 21:28 ` Andreas Schwab
@ 2003-03-10 21:50 ` John Bradford
2003-03-14 21:55 ` [Ext2-devel] " Stephen C. Tweedie
0 siblings, 1 reply; 38+ messages in thread
From: John Bradford @ 2003-03-10 21:50 UTC (permalink / raw)
To: Andreas Schwab
Cc: phillips, ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz
> |> > Though the journal only becomes involved when blocks are modified,
> |> > unfortunately, because of atime updates, this includes all directory
> |> > operations. We could suggest to users that they should disable
> |> > atime updating if they care about performance, but we ought to be
> |> > able to do better than that.
> |>
> |> On a separate note, since atime updates are not usually very important
> |> anyway, why not have an option to cache atime updates for a long time,
> |> or until either a write occurs anyway.
>
> mount -o noatime
Well, yes, I use noatime by default, but I was thinking that there
might be users who wanted to gain most of the performance that using
noatime would, who still wanted access times available generally, but
who were not bothered about loosing them on an unclean shutdown.
Infact, taking this one step further, we could have assign directories
priorities, analogous to nice values for processes, and make the
lazyness of writes dependant on that.
So, for example, on a webserver, you could assign databases a high
priority for writes, but your webserver log a low priority. If the
box crashed, the data most likely to be lost would be the webserver
log, rather than the database.
Another benefit would be that you could spin down disks, and not have
them spin up just to write data to the webserver log - that could be
committed to the disk just once an hour, but a database write could
cause it to spin up immediately.
John.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree
2003-03-10 21:47 ` [Ext2-devel] " Bryan O'Sullivan
@ 2003-03-10 22:02 ` Matthew Wilcox
2003-03-11 8:47 ` Jakob Oestergaard
2003-03-14 21:57 ` Stephen C. Tweedie
0 siblings, 2 replies; 38+ messages in thread
From: Matthew Wilcox @ 2003-03-10 22:02 UTC (permalink / raw)
To: Bryan O'Sullivan
Cc: Daniel Phillips, John Bradford, ext2-devel, linux-kernel, tytso,
adilger, chrisl, bzzz
On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote:
> Why start? Who actually uses atime for anything at all, other than the
> tiny number of shops that care about moving untouched files to tertiary
> storage?
>
> Surely if you want to heap someone else's plate with work, you should
> offer a reason why :-)
"You have new mail" vs "You have mail".
--
"It's not Hollywood. War is real, war is primarily not about defeat or
victory, it is about death. I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree
2003-03-10 22:02 ` Matthew Wilcox
@ 2003-03-11 8:47 ` Jakob Oestergaard
2003-03-11 11:27 ` John Bradford
2003-03-14 21:57 ` Stephen C. Tweedie
1 sibling, 1 reply; 38+ messages in thread
From: Jakob Oestergaard @ 2003-03-11 8:47 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Bryan O'Sullivan, Daniel Phillips, John Bradford, ext2-devel,
linux-kernel, tytso, adilger, chrisl, bzzz
On Mon, Mar 10, 2003 at 10:02:54PM +0000, Matthew Wilcox wrote:
> On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote:
> > Why start? Who actually uses atime for anything at all, other than the
> > tiny number of shops that care about moving untouched files to tertiary
> > storage?
> >
> > Surely if you want to heap someone else's plate with work, you should
> > offer a reason why :-)
>
> "You have new mail" vs "You have mail".
And having a clean /tmp using tmpwatch. Everything in my /tmp not
accessed for a few weeks gets deleted. I move cruft to /tmp every
single day, and I have not cleaned it up for years. It just stays tidy
all by itself. Of course one has to remember this when leavning a few
database files there before taking a vacation ;)
--
................................................................
: jakob@unthought.net : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob Østergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree
[not found] ` <20030311031216.8A31CEFD5F@mx12.arcor-online.net>
@ 2003-03-11 10:45 ` Hans Reiser
2003-03-11 13:00 ` Helge Hafting
0 siblings, 1 reply; 38+ messages in thread
From: Hans Reiser @ 2003-03-11 10:45 UTC (permalink / raw)
To: Daniel Phillips
Cc: Theodore Ts'o, Andreas Dilger, Christopher Li, Alex Tomas,
Linux Kernel Mailing List
Let's make noatime the default for VFS.
Daniel Phillips wrote:
>Hi Hans,
>
>On Tue 11 Mar 03 00:17, Hans Reiser wrote:
>
>
>>What do you think of creating a new telldir/seekdir which uses filenames
>>instead of ints to convey position?
>>
>>
>
>What do you do if that file gets deleted in the middle of a traversal?
>
So what? It still tells you where to restart. Strictly speaking, you
would want to allow the filesystem to choose what it wants to return to
indicate position. For htree and reiserfs this would be a filename or
its hash, for ext2 without htree this would be a byte offset.
>
>If I were able to design Unix over again, I'd state that if you don't lock a
>directory before traversing it then it's your own fault if somebody changes
>it under you, and I would have provided an interface to inform you about your
>bad luck. Strictly wishful thinking. (There, it feels better now.)
>
We are designing Linux. You know, Microsoft and Steve Jobs continue to
design. We should too.
Needless change should be avoided. Failure to change when something is
known to be broken leads to being inferior to those who do change.
Let's design something that does it right. Or else SCO will be right in
saying that Linux is just a Unix ripoff by people who couldn't do
anything unless Unix had been written to tell them how to do it right.
--
Hans
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree
2003-03-11 8:47 ` Jakob Oestergaard
@ 2003-03-11 11:27 ` John Bradford
0 siblings, 0 replies; 38+ messages in thread
From: John Bradford @ 2003-03-11 11:27 UTC (permalink / raw)
To: Jakob Oestergaard
Cc: willy, bos, phillips, ext2-devel, linux-kernel, tytso, adilger,
chrisl, bzzz
> > > Why start? Who actually uses atime for anything at all, other than the
> > > tiny number of shops that care about moving untouched files to tertiary
> > > storage?
What about the situation where the primary storage is a device which
has a limited number of write-cycles. That's just the sort of
application where you might be archiving data to secondary or tertiary
storage, and it would be a big advantage to save on writes. If a file
is read every ten minutes, we could just update the atime once an
hour. That would save five writes an hour.
John.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree
2003-03-11 10:45 ` Hans Reiser
@ 2003-03-11 13:00 ` Helge Hafting
2003-03-11 13:41 ` Daniel Phillips
0 siblings, 1 reply; 38+ messages in thread
From: Helge Hafting @ 2003-03-11 13:00 UTC (permalink / raw)
To: Hans Reiser; +Cc: Daniel Phillips, Linux Kernel Mailing List
Hans Reiser wrote:
> Let's make noatime the default for VFS.
>
> Daniel Phillips wrote:
[...]
>> If I were able to design Unix over again, I'd state that if you don't
>> lock a directory before traversing it then it's your own fault if
>> somebody changes it under you, and I would have provided an interface
>> to inform you about your bad luck. Strictly wishful thinking.
>> (There, it feels better now.)
I'm happy nobody _can_ lock a directory like that. Think of it - unable
to create or delete files while some slow-moving program is traversing
the directory? Ouch. Plenty of options for DOS attacks too.
And how to do "rm *.bak" if rm locks the dir for traversal?
Helge Hafting
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree
2003-03-11 13:00 ` Helge Hafting
@ 2003-03-11 13:41 ` Daniel Phillips
2003-03-11 17:16 ` Andreas Dilger
` (2 more replies)
0 siblings, 3 replies; 38+ messages in thread
From: Daniel Phillips @ 2003-03-11 13:41 UTC (permalink / raw)
To: Helge Hafting, Hans Reiser; +Cc: Linux Kernel Mailing List
On Tue 11 Mar 03 14:00, Helge Hafting wrote:
> Hans Reiser wrote:
> > Let's make noatime the default for VFS.
> >
> > Daniel Phillips wrote:
>
> [...]
>
> >> If I were able to design Unix over again, I'd state that if you don't
> >> lock a directory before traversing it then it's your own fault if
> >> somebody changes it under you, and I would have provided an interface
> >> to inform you about your bad luck. Strictly wishful thinking.
> >> (There, it feels better now.)
>
> I'm happy nobody _can_ lock a directory like that. Think of it - unable
> to create or delete files while some slow-moving program is traversing
> the directory? Ouch. Plenty of options for DOS attacks too.
> And how to do "rm *.bak" if rm locks the dir for traversal?
<wishful thinking>
Now that you mention it, just locking out create and rename during directory
traversal would eliminate the pain. Delete is easy enough to handle during
traversal. For a B-Tree, coalescing could simply be deferred until the
traversal is finished, so reading the directory in physical storage order
would be fine. Way, way cleaner than what we have to do now.
</wishful thinking>
Regards,
Daniel
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree
2003-03-11 13:41 ` Daniel Phillips
@ 2003-03-11 17:16 ` Andreas Dilger
2003-03-11 19:39 ` Helge Hafting
2003-03-11 21:25 ` atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) Hans Reiser
2 siblings, 0 replies; 38+ messages in thread
From: Andreas Dilger @ 2003-03-11 17:16 UTC (permalink / raw)
To: Daniel Phillips; +Cc: Helge Hafting, Hans Reiser, Linux Kernel Mailing List
On Mar 11, 2003 14:41 +0100, Daniel Phillips wrote:
> On Tue 11 Mar 03 14:00, Helge Hafting wrote:
> > I'm happy nobody _can_ lock a directory like that. Think of it - unable
> > to create or delete files while some slow-moving program is traversing
> > the directory? Ouch. Plenty of options for DOS attacks too.
> > And how to do "rm *.bak" if rm locks the dir for traversal?
>
> <wishful thinking>
> Now that you mention it, just locking out create and rename during directory
> traversal would eliminate the pain. Delete is easy enough to handle during
> traversal. For a B-Tree, coalescing could simply be deferred until the
> traversal is finished, so reading the directory in physical storage order
> would be fine. Way, way cleaner than what we have to do now.
> </wishful thinking>
The other problem, of course, is that this would essentially mean that a
user-space application has a lock that prevents other processes (even the
kernel) from doing anything. At best we have to worry about an application
going sour, and at worst it is a DOS hole as Helge says.
How about something like:
death:
for each directory
while readdir(directory, large_buffer)
if (directory)
run death(directory)
readdir(directory, small_buffer)
sleep forever
The first getdents walks recursively down to the leaves, and then does a
single getdents with enough space to hold a single dirent, locking the
directory, but then never progressing. It also does the single-dirent
getdents on the way up the tree. Now, everything is locked out of the
entire directory tree.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree
2003-03-11 13:41 ` Daniel Phillips
2003-03-11 17:16 ` Andreas Dilger
@ 2003-03-11 19:39 ` Helge Hafting
2003-03-11 20:19 ` Daniel Phillips
2003-03-11 21:25 ` atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) Hans Reiser
2 siblings, 1 reply; 38+ messages in thread
From: Helge Hafting @ 2003-03-11 19:39 UTC (permalink / raw)
To: Daniel Phillips; +Cc: linux-kernel
On Tue, Mar 11, 2003 at 02:41:06PM +0100, Daniel Phillips wrote:
> On Tue 11 Mar 03 14:00, Helge Hafting wrote:
> > I'm happy nobody _can_ lock a directory like that. Think of it - unable
> > to create or delete files while some slow-moving program is traversing
> > the directory? Ouch. Plenty of options for DOS attacks too.
> > And how to do "rm *.bak" if rm locks the dir for traversal?
>
> <wishful thinking>
> Now that you mention it, just locking out create and rename during directory
> traversal would eliminate the pain. Delete is easy enough to handle during
> traversal. For a B-Tree, coalescing could simply be deferred until the
> traversal is finished, so reading the directory in physical storage order
> would be fine. Way, way cleaner than what we have to do now.
> </wishful thinking>
Ok, so "rm" works. Then you have things like "mv *.c /usr/src" to worry
about. Lock for traversal, get stuck unable to work on the files.
A "dream" solution for this might involve something like:
Directory is traversed in some well defined order (defined by the fs)
Removing a file (or moving it out of the directory) leaves
a "empty" directory entry that isn't reclaimed until no more
traversals is in progress in that directory.
New files can be created, and will be created further out than
any traversal in progress so nothing will be missed.
This approach don't lock anything out, but I guess someone evil still
could cause trouble by keeping a traversal going forever, creating one dummy
file and deleting one whenever it makes progress. The directory would
get big, filled up with placeholders until some ulimit kicks in.
Helge Hafting
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree
2003-03-11 19:39 ` Helge Hafting
@ 2003-03-11 20:19 ` Daniel Phillips
0 siblings, 0 replies; 38+ messages in thread
From: Daniel Phillips @ 2003-03-11 20:19 UTC (permalink / raw)
To: Helge Hafting; +Cc: linux-kernel
On Tue 11 Mar 03 20:39, Helge Hafting wrote:
> On Tue, Mar 11, 2003 at 02:41:06PM +0100, Daniel Phillips wrote:
> > <wishful thinking>
> > Now that you mention it, just locking out create and rename during
> > directory traversal would eliminate the pain. Delete is easy enough to
> > handle during traversal. For a B-Tree, coalescing could simply be
> > deferred until the traversal is finished, so reading the directory in
> > physical storage order would be fine. Way, way cleaner than what we have
> > to do now.
> > </wishful thinking>
>
> Ok, so "rm" works. Then you have things like "mv *.c /usr/src" to worry
> about.
That's equivalent to ls, the shell expands it before doing any moving. You
can construct something more interesting with ls | xargs <something nasty>
into the same directory. Since the user is trying to shoot their foot off,
let the lock be recursive, and let them do that.
> ...someone evil still
> could cause trouble by keeping a traversal going forever, creating one
> dummy file and deleting one whenever it makes progress. The directory
> would get big, filled up with placeholders until some ulimit kicks in.
>
> Helge Hafting
It's not clear that's any more evil than things they can do already, eg,
seq 1 1000000 | xargs -l1 cp -a /usr
Welllll, this is all idle speculation anyway. Nobody's going to fix the
flaw this week, if it's even possible (I suspect it is).
Regards,
Daniel
^ permalink raw reply [flat|nested] 38+ messages in thread
* atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree )
2003-03-11 13:41 ` Daniel Phillips
2003-03-11 17:16 ` Andreas Dilger
2003-03-11 19:39 ` Helge Hafting
@ 2003-03-11 21:25 ` Hans Reiser
2003-03-11 23:49 ` Jamie Lokier
2 siblings, 1 reply; 38+ messages in thread
From: Hans Reiser @ 2003-03-11 21:25 UTC (permalink / raw)
To: Daniel Phillips; +Cc: Helge Hafting, Linux Kernel Mailing List
Daniel Phillips wrote:
>On Tue 11 Mar 03 14:00, Helge Hafting wrote:
>
>
>>Hans Reiser wrote:
>>
>>
>>>Let's make noatime the default for VFS.
>>>
>>>Daniel Phillips wrote:
>>>
>>>
>>[...]
>>
>>
>>
>>>>If I were able to design Unix over again, I'd state that if you don't
>>>>lock a directory before traversing it then it's your own fault if
>>>>somebody changes it under you, and I would have provided an interface
>>>>to inform you about your bad luck. Strictly wishful thinking.
>>>>(There, it feels better now.)
>>>>
>>>>
>>I'm happy nobody _can_ lock a directory like that. Think of it - unable
>>to create or delete files while some slow-moving program is traversing
>>the directory? Ouch. Plenty of options for DOS attacks too.
>>And how to do "rm *.bak" if rm locks the dir for traversal?
>>
>>
>
><wishful thinking>
>Now that you mention it, just locking out create and rename during directory
>traversal would eliminate the pain. Delete is easy enough to handle during
>traversal. For a B-Tree, coalescing could simply be deferred until the
>traversal is finished, so reading the directory in physical storage order
>would be fine. Way, way cleaner than what we have to do now.
></wishful thinking>
>
>Regards,
>
>Daniel
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>
You would want a directory listing command that is a single system
call, and then that could be made isolated, without risk of userspace
failing to unlock. Then you have to worry about very large directories
straining things in user space, but you can probably have the user
specify the maximimum size directory his process has space for, etc.
In general, DOS issues are what makes it difficult to introduce
transactions into the kernel. We are grappling with that now in
reiser4. It seems that the formula is to create atomic plugins, and
then it is the system administrator's responsibility to verify that he
can live with whatever DOS vulnerabilities it has before he installs it
in his kernel (or our responsibility if it is sent to us for inclusion
in the main kernel). Allowing arbitrary filesystem operations to be
combined into one atomic transaction seems problematic for either user
space or the kernel, depending on what you do.
In general, allowing user space to lock things means that you trust user
space to unlock. This creates all sorts of trust troubles, and if you
force the unlock after some timeout, then the user space application
becomes vulnerable to DOS from other processes causing it to exceed the
timeout.
Ideas on this are welcome.
--
Hans
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3
2003-03-07 15:46 ` Alex Tomas
2003-03-08 17:38 ` Daniel Phillips
@ 2003-03-11 21:57 ` Bill Davidsen
1 sibling, 0 replies; 38+ messages in thread
From: Bill Davidsen @ 2003-03-11 21:57 UTC (permalink / raw)
To: Alex Tomas; +Cc: linux-kernel, ext2-devel
On 7 Mar 2003, Alex Tomas wrote:
> The problem is that getdents(2) returns inodes out of order and
> du causes many head seeks. I tried to solve the problem by patch
> I included in. The idea of the patch is pretty simple: just try
> to sort dentries by inode number in readdir(). It works because
> inodes sits at fixed places in ext2/ext3. Please, look at results
> I just got:
Any change in the order of dentries returned is likely to run into problem
when seeking in a directory. Given that readdir() is usully followed by
either zero or many stat()s, perhaps when the first stat() comes along you
could pre-read the inodes in optimal order and cace them. However you
tuned the size of your sort should work exactly as well as the size of the
pre-read.
This seems consistent with readahead and AS disk io.
--
bill davidsen <davidsen@tmr.com>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree )
2003-03-11 21:25 ` atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) Hans Reiser
@ 2003-03-11 23:49 ` Jamie Lokier
0 siblings, 0 replies; 38+ messages in thread
From: Jamie Lokier @ 2003-03-11 23:49 UTC (permalink / raw)
To: Hans Reiser; +Cc: Daniel Phillips, Helge Hafting, Linux Kernel Mailing List
Hans Reiser wrote:
> Allowing arbitrary filesystem operations to be
> combined into one atomic transaction seems problematic for either user
> space or the kernel, depending on what you do.
>
> In general, allowing user space to lock things means that you trust user
> space to unlock. This creates all sorts of trust troubles, and if you
> force the unlock after some timeout, then the user space application
> becomes vulnerable to DOS from other processes causing it to exceed the
> timeout.
>
> Ideas on this are welcome.
You can allow user space to begin a transaction, do some operations
and end a transaction, possibly returning an "abort" result which
means userspace should assume the transaction did not commit any
results and/or whatever was read in the transaction was not reliable.
On the face of it this leaves userspace susceptible to DOS or indeed
fairness/livelock problems. For example if another program is always
changing a directory entry, how can you read that whole directory
in a transaction?
Fairness/livelock problems are hard to avoid with any kinds of lock.
Even the kernel's internal locks have these problems in corner cases
(for example, remember when gettimeofday()'s clock access had to be
converted from using a spinlock to a sequence lock - and that still
doesn't _guarantee_ there is no problem in principle, it just reduces
the probability in all reasonable scenarios).
However, some remedies can be applied to filesystem transactions. If
an operation would cause some other task's transaction to eventually
return an abort code, consider sleeping for a short duration.
Randomise that duration. If the other transaction(s) have been
aborting repeatedly, consider lengthening the sleep duration and/or
specifically waiting for the other transaction to complete, to boost
the other task(s) likilihood of transaction success. Randomise this
decision too. If you know something about the type of other
transactions (such as it is trying to implement a read-write lock by
doing atomic operations on bytes in a file), consider exactly what
policy you hope to offer (writer preference? reader preference?
something in between?)
By which point it has remarkable similarities to the problems of
fairness in the task scheduler, and fairness/livelock in locks.
-- Jamie
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Ext2-devel] Re: [Bug 417] New: htree much slower than regular ext3
2003-02-27 21:00 ` Andreas Dilger
2003-02-28 4:12 ` Daniel Phillips
@ 2003-03-13 21:04 ` Stephen C. Tweedie
1 sibling, 0 replies; 38+ messages in thread
From: Stephen C. Tweedie @ 2003-03-13 21:04 UTC (permalink / raw)
To: Andreas Dilger
Cc: Daniel Phillips, Martin J. Bligh, linux-kernel, ext2-devel,
Theodore Ts'o, chrisl, Stephen Tweedie
Hi,
On Thu, 2003-02-27 at 21:00, Andreas Dilger wrote:
> I've got a patch which should help here, although it was originally written
> to speed up the "create" case instead of the "lookup" case. In the lookup
> case, it will do a pre-read of a number of inode table blocks, since the cost
> of doing a 64kB read and doing a 4kB read is basically the same - the cost
> of the seek.
No it's not --- you're evicting 16 times as much other
potentially-useful data from the cache for each lookup. You'll improve
the "du" or "ls -l" case by prefetching, but you may well slow down the
overall system performance when you're just doing random accesses (eg.
managing large spools.)
It would be interesting to think about how we can spot the cases where
the prefetch is likely to be beneficial, for example by observing
"stat"s coming in in strict hash order.
--Stephen
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Ext2-devel] Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree)
2003-03-10 21:50 ` Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree) John Bradford
@ 2003-03-14 21:55 ` Stephen C. Tweedie
0 siblings, 0 replies; 38+ messages in thread
From: Stephen C. Tweedie @ 2003-03-14 21:55 UTC (permalink / raw)
To: John Bradford
Cc: Andreas Schwab, phillips, ext2-devel, linux-kernel,
Theodore Ts'o, Andreas Dilger, chrisl, bzzz, Stephen Tweedie
Hi,
On Mon, 2003-03-10 at 21:50, John Bradford wrote:
> Well, yes, I use noatime by default, but I was thinking that there
> might be users who wanted to gain most of the performance that using
> noatime would, who still wanted access times available generally, but
> who were not bothered about loosing them on an unclean shutdown.
I've got new inode-flushing code which somewhat does that already. The
code in
http://people.redhat.com/sct/patches/ext3-2.4/dev-20030314/40-iflush-sct/
allows us to defer inode writes, primarily to eliminate redundant
copy-to-buffer-cache operations in mark_inode_dirty; but it has the
side-effect of allowing us to defer atime updates quite safely.
I'm currently integrating all the latest bits and pieces of orlov and
htree work into that set of patches to provide a stable base, and the
next order of business is to get ACL and O_DIRECT diffs in for 2.4
too. But beyond that, I need to get down to some serious performance
work with ext3, and the inode flush code is a major part of that.
--Stephen
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree
2003-03-10 22:02 ` Matthew Wilcox
2003-03-11 8:47 ` Jakob Oestergaard
@ 2003-03-14 21:57 ` Stephen C. Tweedie
2003-03-15 8:39 ` jw schultz
1 sibling, 1 reply; 38+ messages in thread
From: Stephen C. Tweedie @ 2003-03-14 21:57 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Bryan O'Sullivan, Daniel Phillips, John Bradford, ext2-devel,
linux-kernel, Theodore Ts'o, Andreas Dilger, chrisl, bzzz,
Stephen Tweedie
Hi,
On Mon, 2003-03-10 at 22:02, Matthew Wilcox wrote:
> On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote:
> > Why start? Who actually uses atime for anything at all, other than the
> > tiny number of shops that care about moving untouched files to tertiary
> > storage?
> >
> > Surely if you want to heap someone else's plate with work, you should
> > offer a reason why :-)
>
> "You have new mail" vs "You have mail".
"nodiratime" can still help there.
--Stephen
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree
2003-03-14 21:57 ` Stephen C. Tweedie
@ 2003-03-15 8:39 ` jw schultz
0 siblings, 0 replies; 38+ messages in thread
From: jw schultz @ 2003-03-15 8:39 UTC (permalink / raw)
To: linux-kernel
On Fri, Mar 14, 2003 at 09:57:12PM +0000, Stephen C. Tweedie wrote:
> Hi,
>
> On Mon, 2003-03-10 at 22:02, Matthew Wilcox wrote:
> > On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote:
> > > Why start? Who actually uses atime for anything at all, other than the
> > > tiny number of shops that care about moving untouched files to tertiary
> > > storage?
> > >
> > > Surely if you want to heap someone else's plate with work, you should
> > > offer a reason why :-)
> >
> > "You have new mail" vs "You have mail".
>
> "nodiratime" can still help there.
As may noatime. noatime only prevents the automatic update
of atime on read. It doesn't (at least in my experience)
prevent utime(2) from updating the atime field.
Using noatime works quite well with at least with mutt which
explicitly uses utime(2) to update atime. I cannot be
certain but mutt may actually work better with noatime
because then other tools (*biff &co) accesses won't break the
mailer's notion of new mail (mtime > atime).
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw@pegasys.ws
Remember Cernan and Schmitt
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2003-03-15 8:28 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-02-27 17:31 [Bug 417] New: htree much slower than regular ext3 Martin J. Bligh
2003-02-28 2:55 ` Daniel Phillips
2003-02-27 21:00 ` Andreas Dilger
2003-02-28 4:12 ` Daniel Phillips
2003-02-27 21:33 ` Martin J. Bligh
2003-03-13 21:04 ` [Ext2-devel] " Stephen C. Tweedie
2003-03-07 15:46 ` Alex Tomas
2003-03-08 17:38 ` Daniel Phillips
2003-03-07 23:27 ` Theodore Ts'o
2003-03-09 19:26 ` Alex Tomas
2003-03-09 7:08 ` Alex Tomas
2003-03-10 17:58 ` Daniel Phillips
2003-03-10 21:25 ` Theodore Ts'o
2003-03-11 21:57 ` Bill Davidsen
[not found] ` <20030307214833.00a37e35.akpm@digeo.com>
[not found] ` <20030308010424.Z1373@schatzie.adilger.int>
2003-03-09 22:54 ` [Ext2-devel] " Daniel Phillips
2003-03-08 23:19 ` Andrew Morton
2003-03-09 23:10 ` Daniel Phillips
[not found] ` <20030309184755.ACC80FCA8C@mx12.arcor-online.net>
[not found] ` <m3u1ecl5h8.fsf@lexa.home.net>
2003-03-10 20:45 ` [RFC] Improved inode number allocation for HTree Daniel Phillips
[not found] ` <3E6D1D25.5000004@namesys.com>
[not found] ` <20030311031216.8A31CEFD5F@mx12.arcor-online.net>
2003-03-11 10:45 ` Hans Reiser
2003-03-11 13:00 ` Helge Hafting
2003-03-11 13:41 ` Daniel Phillips
2003-03-11 17:16 ` Andreas Dilger
2003-03-11 19:39 ` Helge Hafting
2003-03-11 20:19 ` Daniel Phillips
2003-03-11 21:25 ` atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) Hans Reiser
2003-03-11 23:49 ` Jamie Lokier
2003-03-10 20:48 ` [RFC] Improved inode number allocation for HTree Daniel Phillips
2003-03-10 21:04 ` John Bradford
2003-03-10 21:28 ` Andreas Schwab
2003-03-10 21:50 ` Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree) John Bradford
2003-03-14 21:55 ` [Ext2-devel] " Stephen C. Tweedie
2003-03-10 21:33 ` [RFC] Improved inode number allocation for HTree Daniel Phillips
2003-03-10 21:47 ` [Ext2-devel] " Bryan O'Sullivan
2003-03-10 22:02 ` Matthew Wilcox
2003-03-11 8:47 ` Jakob Oestergaard
2003-03-11 11:27 ` John Bradford
2003-03-14 21:57 ` Stephen C. Tweedie
2003-03-15 8:39 ` jw schultz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).