* [Bug 417] New: htree much slower than regular ext3 @ 2003-02-27 17:31 Martin J. Bligh 2003-02-28 2:55 ` Daniel Phillips ` (3 more replies) 0 siblings, 4 replies; 39+ messages in thread From: Martin J. Bligh @ 2003-02-27 17:31 UTC (permalink / raw) To: linux-kernel http://bugme.osdl.org/show_bug.cgi?id=417 Summary: htree much slower than regular ext3 Kernel Version: 2.5.63-bk3 Status: NEW Severity: normal Owner: akpm@digeo.com Submitter: bwindle-kbt@fint.org Distribution: Debian Testing Hardware Environment: x86, 256mb RAM, PIIX4, Maxtor 91728D8 ATA DISK drive Software Environment: Linux razor 2.5.63bk3 #25 Thu Feb 27 10:13:35 EST 2003 i686 Pentium II (Klamath) GenuineIntel GNU/Linux Gnu C 2.95.4 Gnu make 3.79.1 util-linux 2.11n mount 2.11n module-init-tools implemented e2fsprogs 1.30-WIP reiserfsprogs 3.6.3 Linux C Library 2.2.5 Dynamic linker (ldd) 2.2.5 Procps 2.0.7 Net-tools 1.60 Console-tools 0.2.3 Sh-utils 4.5.2 Problem Description: I created a directory ("test") with 32000 (ok, 31998) directories in it, and put a file called 'foo' in each of them (for i in `ls`; do cd $i && touch bar && cd .. ; done). Then I took that ext3 partion, umounted it, did a 'tune2fs -O dir_index', then 'fsck -fD', and remounted. I then did a 'time du -hs' on the test directory, and here are the results. ext3+htree: bwindle@razor:/giant/inodes$ time du -hs 126M . real 7m21.756s user 0m2.021s sys 0m22.190s I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1, remounted, and did another du -hs on the test directory. It took 1 minute, 48 seconds. bwindle@razor:/giant/test$ time du -hs 126M . real 1m48.760s user 0m1.986s sys 0m21.563s I thought htree was supposed to speed up access with large numbers of directories? ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3 2003-02-27 17:31 [Bug 417] New: htree much slower than regular ext3 Martin J. Bligh @ 2003-02-28 2:55 ` Daniel Phillips 2003-02-27 21:00 ` Andreas Dilger 2003-03-07 15:46 ` Alex Tomas ` (2 subsequent siblings) 3 siblings, 1 reply; 39+ messages in thread From: Daniel Phillips @ 2003-02-28 2:55 UTC (permalink / raw) To: Martin J. Bligh, linux-kernel; +Cc: ext2-devel, Theodore Ts'o, chrisl On Thursday 27 February 2003 18:31, Martin J. Bligh wrote: > Problem Description: > I created a directory ("test") with 32000 (ok, 31998) directories in it, > and put a file called 'foo' in each of them (for i in `ls`; do cd $i && > touch bar && cd .. ; done). Then I took that ext3 partion, umounted it, > did a 'tune2fs -O dir_index', then 'fsck -fD', and remounted. I then did a > 'time du -hs' on the test directory, and here are the results. > > ext3+htree: > bwindle@razor:/giant/inodes$ time du -hs > 126M . > > real 7m21.756s > user 0m2.021s > sys 0m22.190s > > I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1, remounted, > and did another du -hs on the test directory. It took 1 minute, 48 > seconds. > > bwindle@razor:/giant/test$ time du -hs > 126M . > > real 1m48.760s > user 0m1.986s > sys 0m21.563s > > > I thought htree was supposed to speed up access with large numbers of > directories? The du just does getdents and lstats in physical storage order, so there's no possible benefit from indexing in this case, and unindexed ext3 avoids long search times by caching the position at which it last found an entry. That answers the question "why doesn't it speed up", however, "why did it slow way down" is harder. The single-file leaves of your directory tree don't carry any index (it's not worth it with only one file) and therfore use the same code path as unindexed ext3, so there's no culprit there. I'm looking suspiciously at ext3_dx_readdir, which is apparently absorbing about 11 ms per returned entry. To put this in perspective, I'm used to seeing individual directory operation times well below 100 us, so even if each dirent cost as much as a full lookup, you'd see less than 3 seconds overhead for your 30,000 directories. 11 ms sounds like two seeks for each returned dirent, which sounds like a bug. Regards, Daniel ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3 2003-02-28 2:55 ` Daniel Phillips @ 2003-02-27 21:00 ` Andreas Dilger 2003-02-28 4:12 ` Daniel Phillips 2003-03-13 21:04 ` [Ext2-devel] " Stephen C. Tweedie 0 siblings, 2 replies; 39+ messages in thread From: Andreas Dilger @ 2003-02-27 21:00 UTC (permalink / raw) To: Daniel Phillips Cc: Martin J. Bligh, linux-kernel, ext2-devel, Theodore Ts'o, chrisl On Feb 28, 2003 03:55 +0100, Daniel Phillips wrote: > On Thursday 27 February 2003 18:31, Martin J. Bligh wrote: > > Problem Description: > > I created a directory ("test") with 32000 (ok, 31998) directories in it, > > and put a file called 'foo' in each of them (for i in `ls`; do cd $i && > > touch bar && cd .. ; done). Then I took that ext3 partion, umounted it, > > did a 'tune2fs -O dir_index', then 'fsck -fD', and remounted. I then did a > > 'time du -hs' on the test directory, and here are the results. > > > > ext3+htree: > > bwindle@razor:/giant/inodes$ time du -hs > > 126M . > > > > real 7m21.756s > > user 0m2.021s > > sys 0m22.190s > > > > I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1, remounted, > > and did another du -hs on the test directory. It took 1 minute, 48 > > seconds. > > > > bwindle@razor:/giant/test$ time du -hs > > 126M . > > > > real 1m48.760s > > user 0m1.986s > > sys 0m21.563s > > > > > > I thought htree was supposed to speed up access with large numbers of > > directories? > > The du just does getdents and lstats in physical storage order, so there's no > possible benefit from indexing in this case, and unindexed ext3 avoids long > search times by caching the position at which it last found an entry. That > answers the question "why doesn't it speed up", however, "why did it slow way > down" is harder. > > The single-file leaves of your directory tree don't carry any index (it's not > worth it with only one file) and therfore use the same code path as unindexed > ext3, so there's no culprit there. I'm looking suspiciously at > ext3_dx_readdir, which is apparently absorbing about 11 ms per returned > entry. To put this in perspective, I'm used to seeing individual directory > operation times well below 100 us, so even if each dirent cost as much as a > full lookup, you'd see less than 3 seconds overhead for your 30,000 > directories. > > 11 ms sounds like two seeks for each returned dirent, which sounds like a bug. I think you are pretty dead on there. The difference is that with unindexed entries, the directory entry and the inode are in the same order, while with indexed directories they are essentially in random order to each other. That means that each directory lookup causes a seek to get the directory inode data instead of doing allocation order (which is sequential reads on disk). In the past both would have been slow equally, but the orlov allocator in 2.5 causes a number of directories to be created in the same group before moving on to the next group, so you have nice batches of sequential reads. I've got a patch which should help here, although it was originally written to speed up the "create" case instead of the "lookup" case. In the lookup case, it will do a pre-read of a number of inode table blocks, since the cost of doing a 64kB read and doing a 4kB read is basically the same - the cost of the seek. The patch was written for a 2.4.18 RH kernel, but I think the areas touched (ext3_get_inode_loc and ext3_new_inode) are relatively unchanged, so it should be fairly straightforward to fix the rejects by hand for 2.5. For the create case, the patch speeds up lots-of-file creation rates a lot (50%), regardless of whether we are using htree or not. That is because we do not read in empty inode table blocks from disk (which will be slow behind a bunch of writes going on), and instead we have a pure write workload (with the exception of reading in the inode bitmaps occasionally). Cheers, Andreas ======================= ext3-noread.diff ==================================== diff -ru lustre-head/fs/ext3/ialloc.c lustre/fs/ext3/ialloc.c --- lustre-head/fs/ext3/ialloc.c Mon Dec 23 10:02:58 2002 +++ lustre/fs/ext3/ialloc.c Mon Dec 23 09:46:20 2002 @@ -289,6 +289,37 @@ } /* + * @block_group: block group of inode + * @offset: relative offset of inode within @block_group + * + * Check whether any of the inodes in this disk block are in use. + * + * Caller must be holding superblock lock (group/bitmap read lock in future). + */ +int ext3_itable_block_used(struct super_block *sb, unsigned int block_group, + int offset) +{ + int bitmap_nr = load_inode_bitmap(sb, block_group); + int inodes_per_block; + unsigned long inum, iend; + struct buffer_head *ibitmap; + + if (bitmap_nr < 0) + return 1; + + inodes_per_block = sb->s_blocksize / EXT3_SB(sb)->s_inode_size; + inum = offset & ~(inodes_per_block - 1); + iend = inum + inodes_per_block; + ibitmap = EXT3_SB(sb)->s_inode_bitmap[bitmap_nr]; + for (; inum < iend; inum++) { + if (inum != offset && ext3_test_bit(inum, ibitmap->b_data)) + return 1; + } + + return 0; +} + +/* * There are two policies for allocating an inode. If the new inode is * a directory, then a forward search is made for a block group with both * free space and a low directory-to-inode ratio; if that fails, then of @@ -312,6 +343,7 @@ struct ext3_group_desc * gdp; struct ext3_group_desc * tmp; struct ext3_super_block * es; + struct ext3_iloc iloc; int err = 0; /* Cannot create files in a deleted directory */ @@ -514,9 +547,18 @@ inode->i_generation = sbi->s_next_generation++; ei->i_state = EXT3_STATE_NEW; - err = ext3_mark_inode_dirty(handle, inode); + err = ext3_get_inode_loc_new(inode, &iloc, 1); if (err) goto fail; - + BUFFER_TRACE(iloc->bh, "get_write_access"); + err = ext3_journal_get_write_access(handle, iloc.bh); + if (err) { + brelse(iloc.bh); + iloc.bh = NULL; + goto fail; + } + err = ext3_mark_iloc_dirty(handle, inode, &iloc); + if (err) goto fail; + unlock_super (sb); if(DQUOT_ALLOC_INODE(inode)) { DQUOT_DROP(inode); diff -ru lustre-head/fs/ext3/inode.c lustre/fs/ext3/inode.c --- lustre-head/fs/ext3/inode.c Mon Dec 23 10:02:58 2002 +++ lustre/fs/ext3/inode.c Mon Dec 23 09:50:25 2002 @@ -2011,16 +1994,25 @@ ext3_journal_stop(handle, inode); } -/* - * ext3_get_inode_loc returns with an extra refcount against the - * inode's underlying buffer_head on success. - */ +extern int ext3_itable_block_used(struct super_block *sb, + unsigned int block_group, + int offset); + +#define NUM_INODE_PREREAD 16 -int ext3_get_inode_loc (struct inode *inode, struct ext3_iloc *iloc) +/* + * ext3_get_inode_loc returns with an extra refcount against the inode's + * underlying buffer_head on success. If this is for a new inode allocation + * (new is non-zero) then we may be able to optimize away the read if there + * are no other in-use inodes in this inode table block. If we need to do + * a read, then read in a whole chunk of blocks to avoid blocking again soon + * if we are doing lots of creates/updates. + */ +int ext3_get_inode_loc_new(struct inode *inode, struct ext3_iloc *iloc, int new) { struct super_block *sb = inode->i_sb; struct ext3_sb_info *sbi = EXT3_SB(sb); - struct buffer_head *bh = 0; + struct buffer_head *bh[NUM_INODE_PREREAD]; unsigned long block; unsigned long block_group; unsigned long group_desc; @@ -2042,38 +2034,86 @@ } group_desc = block_group >> sbi->s_desc_per_block_bits; desc = block_group & (sbi->s_desc_per_block - 1); - bh = sbi->s_group_desc[group_desc]; - if (!bh) { + if (!sbi->s_group_desc[group_desc]) { ext3_error(sb, __FUNCTION__, "Descriptor not loaded"); goto bad_inode; } - gdp = (struct ext3_group_desc *) bh->b_data; + gdp = (struct ext3_group_desc *)(sbi->s_group_desc[group_desc]->b_data); + /* * Figure out the offset within the block group inode table */ - offset = ((inode->i_ino - 1) % sbi->s_inodes_per_group) * - sbi->s_inode_size; + offset = ((inode->i_ino - 1) % sbi->s_inodes_per_group); + block = le32_to_cpu(gdp[desc].bg_inode_table) + - (offset >> EXT3_BLOCK_SIZE_BITS(sb)); - if (!(bh = sb_bread(sb, block))) { - ext3_error (sb, __FUNCTION__, - "unable to read inode block - " - "inode=%lu, block=%lu", inode->i_ino, block); - goto bad_inode; + (offset * sbi->s_inode_size >> EXT3_BLOCK_SIZE_BITS(sb)); + + bh[0] = sb_getblk(sb, block); + if (buffer_uptodate(bh[0])) + goto done; + + /* If we don't really need to read this block, and it isn't already + * in memory, then we just zero it out. Otherwise, we keep the + * current block contents (deleted inode data) for posterity. + */ + if (new && !ext3_itable_block_used(sb, block_group, offset)) { + lock_buffer(bh[0]); + memset(bh[0]->b_data, 0, bh[0]->b_size); + mark_buffer_uptodate(bh[0], 1); + unlock_buffer(bh[0]); + } else { + unsigned long block_end, itable_end; + int count = 1; + + itable_end = le32_to_cpu(gdp[desc].bg_inode_table) + + sbi->s_itb_per_group; + block_end = block + NUM_INODE_PREREAD; + if (block_end > itable_end) + block_end = itable_end; + + for (; block < block_end; block++) { + bh[count] = sb_getblk(sb, block); + if (count && (buffer_uptodate(bh[count]) || + buffer_locked(bh[count]))) { + __brelse(bh[count]); + } else + count++; + } + + ll_rw_block(READ, count, bh); + + /* Release all but the block we actually need (bh[0]) */ + while (--count > 0) + __brelse(bh[count]); + + wait_on_buffer(bh[0]); + if (!buffer_uptodate(bh[0])) { + ext3_error(sb, __FUNCTION__, + "unable to read inode block - " + "inode=%lu, block=%lu", inode->i_ino, + bh[0]->b_blocknr); + goto bad_inode; + } } - offset &= (EXT3_BLOCK_SIZE(sb) - 1); + done: + offset = (offset * sbi->s_inode_size) & (EXT3_BLOCK_SIZE(sb) - 1); - iloc->bh = bh; - iloc->raw_inode = (struct ext3_inode *) (bh->b_data + offset); + iloc->bh = bh[0]; + iloc->raw_inode = (struct ext3_inode *)(bh[0]->b_data + offset); iloc->block_group = block_group; - + return 0; - + bad_inode: return -EIO; } +int ext3_get_inode_loc(struct inode *inode, struct ext3_iloc *iloc) +{ + return ext3_get_inode_loc_new(inode, iloc, 0); +} + void ext3_read_inode(struct inode * inode) { struct ext3_iloc iloc; -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3 2003-02-27 21:00 ` Andreas Dilger @ 2003-02-28 4:12 ` Daniel Phillips 2003-02-27 21:33 ` Martin J. Bligh 2003-03-13 21:04 ` [Ext2-devel] " Stephen C. Tweedie 1 sibling, 1 reply; 39+ messages in thread From: Daniel Phillips @ 2003-02-28 4:12 UTC (permalink / raw) To: Andreas Dilger Cc: Martin J. Bligh, linux-kernel, ext2-devel, Theodore Ts'o, chrisl On Thursday 27 February 2003 22:00, Andreas Dilger wrote: > > 11 ms sounds like two seeks for each returned dirent, which sounds like a > > bug. > > I think you are pretty dead on there. The difference is that with > unindexed entries, the directory entry and the inode are in the same order, > while with indexed directories they are essentially in random order to each > other. That means that each directory lookup causes a seek to get the > directory inode data instead of doing allocation order (which is sequential > reads on disk). > > In the past both would have been slow equally, but the orlov allocator in > 2.5 causes a number of directories to be created in the same group before > moving on to the next group, so you have nice batches of sequential reads. I think you're close to the truth there, but out-of-order inode table access would only introduce one seek per inode table block, and the cache should take care of the rest. Martin's numbers suggest the cache isn't caching at all. Martin, does iostat show enormously more reads for the Htree case? Regards, Daniel ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3 2003-02-28 4:12 ` Daniel Phillips @ 2003-02-27 21:33 ` Martin J. Bligh 0 siblings, 0 replies; 39+ messages in thread From: Martin J. Bligh @ 2003-02-27 21:33 UTC (permalink / raw) To: Daniel Phillips, Andreas Dilger, bwindle-kbt Cc: linux-kernel, ext2-devel, Theodore Ts'o, chrisl >> > 11 ms sounds like two seeks for each returned dirent, which sounds >> > like a bug. >> >> I think you are pretty dead on there. The difference is that with >> unindexed entries, the directory entry and the inode are in the same >> order, while with indexed directories they are essentially in random >> order to each other. That means that each directory lookup causes a >> seek to get the directory inode data instead of doing allocation order >> (which is sequential reads on disk). >> >> In the past both would have been slow equally, but the orlov allocator in >> 2.5 causes a number of directories to be created in the same group before >> moving on to the next group, so you have nice batches of sequential >> reads. > > I think you're close to the truth there, but out-of-order inode table > access would only introduce one seek per inode table block, and the > cache should take care of the rest. Martin's numbers suggest the cache > isn't caching at all. > > Martin, does iostat show enormously more reads for the Htree case? Wasn't me ... I just forward the bug data from bugzilla ;-) Filer in this case was ... bwindle-kbt@fint.org cc'ed. M. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Ext2-devel] Re: [Bug 417] New: htree much slower than regular ext3 2003-02-27 21:00 ` Andreas Dilger 2003-02-28 4:12 ` Daniel Phillips @ 2003-03-13 21:04 ` Stephen C. Tweedie 1 sibling, 0 replies; 39+ messages in thread From: Stephen C. Tweedie @ 2003-03-13 21:04 UTC (permalink / raw) To: Andreas Dilger Cc: Daniel Phillips, Martin J. Bligh, linux-kernel, ext2-devel, Theodore Ts'o, chrisl, Stephen Tweedie Hi, On Thu, 2003-02-27 at 21:00, Andreas Dilger wrote: > I've got a patch which should help here, although it was originally written > to speed up the "create" case instead of the "lookup" case. In the lookup > case, it will do a pre-read of a number of inode table blocks, since the cost > of doing a 64kB read and doing a 4kB read is basically the same - the cost > of the seek. No it's not --- you're evicting 16 times as much other potentially-useful data from the cache for each lookup. You'll improve the "du" or "ls -l" case by prefetching, but you may well slow down the overall system performance when you're just doing random accesses (eg. managing large spools.) It would be interesting to think about how we can spot the cases where the prefetch is likely to be beneficial, for example by observing "stat"s coming in in strict hash order. --Stephen ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3 2003-02-27 17:31 [Bug 417] New: htree much slower than regular ext3 Martin J. Bligh 2003-02-28 2:55 ` Daniel Phillips @ 2003-03-07 15:46 ` Alex Tomas 2003-03-08 17:38 ` Daniel Phillips 2003-03-11 21:57 ` Bill Davidsen [not found] ` <20030307214833.00a37e35.akpm@digeo.com> [not found] ` <20030309184755.ACC80FCA8C@mx12.arcor-online.net> 3 siblings, 2 replies; 39+ messages in thread From: Alex Tomas @ 2003-03-07 15:46 UTC (permalink / raw) To: Martin J. Bligh Cc: linux-kernel, ext2-devel, Theodore Ts'o, Daniel Phillips, Andrew Morton, Alex Tomas Hi! The problem is that getdents(2) returns inodes out of order and du causes many head seeks. I tried to solve the problem by patch I included in. The idea of the patch is pretty simple: just try to sort dentries by inode number in readdir(). It works because inodes sits at fixed places in ext2/ext3. Please, look at results I just got: real user sys ext3+htree: 7m22.236s 0m0.292s 0m2.204s ext3-htree: 3m6.539s 0m0.481s 0m2.278s ext3+sd-05: 3m45.453s 0m0.493s 0m2.265s ext3+sd-10: 3m35.770s 0m0.514s 0m2.076s ext3+sd-15: 3m24.235s 0m0.558s 0m2.075s ext3+sd-20: 3m6.802s 0m0.486s 0m2.214s ext3+sd-25: 2m55.062s 0m0.490s 0m2.105s ext3+sd-30: 2m39.658s 0m0.492s 0m2.138s 'sd-N' stands for sortdir, N blocks will be prefetched before sorting. Of course, sortdir isn't production-ready and I even don't try to test it hard. Just to probe the idea. After applying the patch, use 'sortdir' or 'sortdir=N' as mount option. with best regards, Alex >>>>> Martin J Bligh (MJB) writes: MJB> Problem Description: I created a directory ("test") with 32000 MJB> (ok, 31998) directories in it, and put a file called 'foo' in MJB> each of them (for i in `ls`; do cd $i && touch bar && cd .. ; MJB> done). Then I took that ext3 partion, umounted it, did a MJB> 'tune2fs -O dir_index', then 'fsck -fD', and remounted. I then MJB> did a 'time du -hs' on the test directory, and here are the MJB> results. MJB> ext3+htree: bwindle@razor:/giant/inodes$ time du -hs 126M . MJB> real 7m21.756s user 0m2.021s sys 0m22.190s MJB> I then unmounted, tune2fs -O ^dir_index, e2fsck -fD /dev/hdb1, MJB> remounted, and did another du -hs on the test directory. It took MJB> 1 minute, 48 seconds. MJB> bwindle@razor:/giant/test$ time du -hs 126M . MJB> real 1m48.760s user 0m1.986s sys 0m21.563s MJB> I thought htree was supposed to speed up access with large MJB> numbers of directories? diff -uNr linux/fs/ext3/dir.c edited/fs/ext3/dir.c --- linux/fs/ext3/dir.c Thu Mar 6 14:57:50 2003 +++ edited/fs/ext3/dir.c Fri Mar 7 18:35:32 2003 @@ -33,8 +33,12 @@ static int ext3_readdir(struct file *, void *, filldir_t); static int ext3_dx_readdir(struct file * filp, void * dirent, filldir_t filldir); +static int ext3_readdir_sort (struct file * filp, + void * dirent, filldir_t filldir); static int ext3_release_dir (struct inode * inode, struct file * filp); +int ext3_htree_store_dirent_sorted(struct file *dir_file, + struct ext3_dir_entry_2 *dirent); struct file_operations ext3_dir_operations = { .read = generic_read_dir, @@ -104,9 +108,17 @@ sb = inode->i_sb; if (is_dx(inode)) { + if (test_opt(inode->i_sb, SORTDIR)) { + err = ext3_readdir_sort(filp, dirent, filldir); + unlock_kernel(); + return err; + } + err = ext3_dx_readdir(filp, dirent, filldir); - if (err != ERR_BAD_DX_DIR) + if (err != ERR_BAD_DX_DIR) { + unlock_kernel(); return err; + } /* * We don't set the inode dirty flag since it's not * critical that it get flushed back to the disk. @@ -249,6 +261,8 @@ __u32 inode; __u8 name_len; __u8 file_type; + loff_t offs; + struct list_head list; char name[0]; }; @@ -311,6 +325,9 @@ p->curr_hash = pos2maj_hash(pos); p->curr_minor_hash = pos2min_hash(pos); p->next_hash = 0; + INIT_LIST_HEAD(&p->list); + p->list_nr = 0; + return p; } @@ -493,10 +510,193 @@ static int ext3_release_dir (struct inode * inode, struct file * filp) { - if (is_dx(inode) && filp->private_data) + if (is_dx(inode) && filp->private_data) ext3_htree_free_dir_info(filp->private_data); return 0; } +int ext3_htree_fill_pool(struct file *dir_file, struct ext3_dir_entry_2 *dirent) +{ + struct rb_node **p, *parent = NULL; + struct fname * fname, *new_fn, *prev_fname = NULL; + struct dir_private_info *info; + int len; + + info = (struct dir_private_info *) dir_file->private_data; + p = &info->root.rb_node; + + /* Create and allocate the fname structure */ + len = sizeof(struct fname) + dirent->name_len + 1; + new_fn = kmalloc(len, GFP_KERNEL); + if (!new_fn) + return -ENOMEM; + memset(new_fn, 0, len); + new_fn->hash = new_fn->inode = le32_to_cpu(dirent->inode); + new_fn->name_len = dirent->name_len; + new_fn->file_type = dirent->file_type; + memcpy(new_fn->name, dirent->name, dirent->name_len); + new_fn->name[dirent->name_len] = 0; + new_fn->offs = dir_file->f_pos; + + while (*p) { + parent = *p; + fname = rb_entry(parent, struct fname, rb_hash); + + if (new_fn->hash < fname->hash) + p = &(*p)->rb_left; + else { + prev_fname = fname; + p = &(*p)->rb_right; + } + } + + if (prev_fname) + list_add(&new_fn->list, &prev_fname->list); + else + list_add(&new_fn->list, &info->list); + + rb_link_node(&new_fn->rb_hash, parent, p); + rb_insert_color(&new_fn->rb_hash, &info->root); + info->list_nr++; + + return 0; +} + +static int ext3_fill_from_sort_pool (struct file * filp, void * dirent, filldir_t filldir) +{ + struct super_block * sb = filp->f_dentry->d_inode->i_sb; + struct dir_private_info *info = filp->private_data; + struct list_head * cur; + struct fname *fname; + int error; + + list_for_each(cur, &info->list) { + fname = list_entry(cur, struct fname, list); + + error = filldir(dirent, fname->name, + fname->name_len, fname->offs, + fname->inode, + get_dtype(sb, fname->file_type)); + if (error) + return 0; + + list_del(&fname->list); + info->list_nr--; + } + + if (info->list_nr == 0) + free_rb_tree_fname(&info->root); + + return 0; +} + +static int ext3_readdir_sort (struct file * filp, + void * dirent, filldir_t filldir) +{ + unsigned long offset, blk; + int i, stored, error = 0, blocks = 0, err; + struct buffer_head * bh; + struct ext3_dir_entry_2 * de; + struct inode *inode = filp->f_dentry->d_inode; + struct super_block * sb = inode->i_sb; + struct dir_private_info *info = filp->private_data; + + if (!info) { + info = create_dir_info(filp->f_pos); + if (!info) + return -ENOMEM; + filp->private_data = info; + } + + if (info->list_nr > 0) { + ext3_fill_from_sort_pool(filp, dirent, filldir); + return 0; + } + + stored = 0; + bh = NULL; + offset = filp->f_pos & (sb->s_blocksize - 1); + + while (!error && (blocks < EXT3_SB(sb)->sortdir_prefetch) && + (filp->f_pos < inode->i_size)) { + blk = (filp->f_pos) >> EXT3_BLOCK_SIZE_BITS(sb); + bh = ext3_bread (0, inode, blk, 0, &err); + if (!bh) { + ext3_error (sb, "ext3_readdir", + "directory #%lu contains a hole at offset %lu", + inode->i_ino, (unsigned long)filp->f_pos); + filp->f_pos += sb->s_blocksize - offset; + continue; + } + +revalidate: + /* If the dir block has changed since the last call to + * readdir(2), then we might be pointing to an invalid + * dirent right now. Scan from the start of the block + * to make sure. */ + if (filp->f_version != inode->i_version) { + for (i = 0; i < sb->s_blocksize && i < offset; ) { + de = (struct ext3_dir_entry_2 *) + (bh->b_data + i); + /* It's too expensive to do a full + * dirent test each time round this + * loop, but we do have to test at + * least that it is non-zero. A + * failure will be detected in the + * dirent test below. */ + if (le16_to_cpu(de->rec_len) < + EXT3_DIR_REC_LEN(1)) + break; + i += le16_to_cpu(de->rec_len); + } + offset = i; + filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1)) + | offset; + filp->f_version = inode->i_version; + } + + while (!error && filp->f_pos < inode->i_size + && offset < sb->s_blocksize) { + de = (struct ext3_dir_entry_2 *) (bh->b_data + offset); + if (!ext3_check_dir_entry ("ext3_readdir", inode, de, + bh, offset)) { + /* On error, skip the f_pos to the + * next block. */ + filp->f_pos = (filp->f_pos | + (sb->s_blocksize - 1)) + 1; + brelse (bh); + return stored; + } + offset += le16_to_cpu(de->rec_len); + if (le32_to_cpu(de->inode)) { + /* We might block in the next section + * if the data destination is + * currently swapped out. So, use a + * version stamp to detect whether or + * not the directory has been modified + * during the copy operation. + */ + unsigned long version = filp->f_version; + + error = ext3_htree_fill_pool(filp, de); + if (error) + break; + if (version != filp->f_version) + goto revalidate; + stored ++; + } + filp->f_pos += le16_to_cpu(de->rec_len); + } + offset = 0; + brelse (bh); + blocks++; + } + + ext3_fill_from_sort_pool(filp, dirent, filldir); + + UPDATE_ATIME(inode); + return 0; +} + #endif diff -uNr linux/fs/ext3/super.c edited/fs/ext3/super.c --- linux/fs/ext3/super.c Mon Feb 24 17:47:45 2003 +++ edited/fs/ext3/super.c Fri Mar 7 17:47:27 2003 @@ -584,6 +584,19 @@ continue; if ((value = strchr (this_char, '=')) != NULL) *value++ = 0; + if (!strcmp (this_char, "sortdir")) { + if (value && *value) { + int blocks = simple_strtoul(value, &value, 0); + printk("EXT3: has value: %d\n", blocks); + if (blocks > 0) + sbi->sortdir_prefetch = blocks; + } + if (sbi->sortdir_prefetch == 0) + sbi->sortdir_prefetch = 10; /* default value */ + set_opt(sbi->s_mount_opt, SORTDIR); + printk("EXT3-fs: sortdir (%d blocks prefetch)\n", + sbi->sortdir_prefetch); + } else #ifdef CONFIG_EXT3_FS_XATTR if (!strcmp (this_char, "user_xattr")) set_opt (sbi->s_mount_opt, XATTR_USER); diff -uNr linux/include/linux/ext3_fs.h edited/include/linux/ext3_fs.h --- linux/include/linux/ext3_fs.h Mon Feb 24 17:47:33 2003 +++ edited/include/linux/ext3_fs.h Fri Mar 7 18:32:53 2003 @@ -330,6 +330,7 @@ #define EXT3_MOUNT_NO_UID32 0x2000 /* Disable 32-bit UIDs */ #define EXT3_MOUNT_XATTR_USER 0x4000 /* Extended user attributes */ #define EXT3_MOUNT_POSIX_ACL 0x8000 /* POSIX Access Control Lists */ +#define EXT3_MOUNT_SORTDIR 0x10000 /* sort indexed dirs in readdir() */ /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */ #ifndef _LINUX_EXT2_FS_H @@ -656,6 +657,8 @@ __u32 curr_hash; __u32 curr_minor_hash; __u32 next_hash; + struct list_head list; + int list_nr; }; /* diff -uNr linux/include/linux/ext3_fs_sb.h edited/include/linux/ext3_fs_sb.h --- linux/include/linux/ext3_fs_sb.h Mon Nov 11 06:28:30 2002 +++ edited/include/linux/ext3_fs_sb.h Fri Mar 7 17:30:46 2003 @@ -63,6 +63,7 @@ struct timer_list turn_ro_timer; /* For turning read-only (crash simulation) */ wait_queue_head_t ro_wait_queue; /* For people waiting for the fs to go read-only */ #endif + int sortdir_prefetch; /* how many blocks to read to fill pool --AT */ }; #endif /* _LINUX_EXT3_FS_SB */ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3 2003-03-07 15:46 ` Alex Tomas @ 2003-03-08 17:38 ` Daniel Phillips 2003-03-07 23:27 ` Theodore Ts'o 2003-03-09 7:08 ` Alex Tomas 2003-03-11 21:57 ` Bill Davidsen 1 sibling, 2 replies; 39+ messages in thread From: Daniel Phillips @ 2003-03-08 17:38 UTC (permalink / raw) To: Alex Tomas, Martin J. Bligh Cc: linux-kernel, ext2-devel, Theodore Ts'o, Andrew Morton, Alex Tomas On Fri 07 Mar 03 16:46, Alex Tomas wrote: > Hi! > > The problem is that getdents(2) returns inodes out of order and > du causes many head seeks. I tried to solve the problem by patch > I included in. The idea of the patch is pretty simple: just try > to sort dentries by inode number in readdir(). It works because > inodes sits at fixed places in ext2/ext3. Please, look at results > I just got: > > real user sys > > ext3+htree: 7m22.236s 0m0.292s 0m2.204s > ext3-htree: 3m6.539s 0m0.481s 0m2.278s > [...] > ext3+sd-30: 2m39.658s 0m0.492s 0m2.138s Nice. I posted a similar demonstration some time ago (could it *really* be two years?) but I sorted the dirents in user space, which meant it was just a demonstration, not a practical solution. The problem I see with your approach is that the traversal is no longer in hash order, so a leaf split in the middle of a directory traversal could result in a lot of duplicate dirents. I'm not sure there's a way around that. Another approach is to have the inode number correspond approximately to the hash value. That's not as hard as it sounds, it simply means that the goal for ext3_new_inode should be influenced by the hash value. (But watch out for interaction with the new Orlov allocator.) It also depends on some a priori estimate of the size of a directory so that increasing hash values can be distributed more or less uniformly and monotonically across some corresponding range of inode numbers. This isn't too hard either, just round up the current size of the directory to a power of two and use that as the size estimate. The size estimate would change from time to time over the life of the directory, but there are only log N different sizes and that's roughly how heavy the load on the inode cache would be during a directory traversal. Finally, a nice property of this approach is that it stays stable over many creates and deletes. We've apparently got a simpler problem though: inode numbers aren't allocated densely enough in the inode table blocks. This is the only thing I can think of that could cause the amount of thrashing we're seeing. Spreading inode number allocations out all over a volume is ok, but sparse allocation within each table block is not ok because it multiplies the pressure on the cache. We want a 30,000 entry directory to touch 1,000 - 2,000 inode table blocks, not the worst-case 30,000. As above, fixing this comes down to tweaking the ext3_new_inode allocation goal. Regards, Daniel ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3 2003-03-08 17:38 ` Daniel Phillips @ 2003-03-07 23:27 ` Theodore Ts'o 2003-03-09 19:26 ` Alex Tomas 2003-03-09 7:08 ` Alex Tomas 1 sibling, 1 reply; 39+ messages in thread From: Theodore Ts'o @ 2003-03-07 23:27 UTC (permalink / raw) To: Daniel Phillips Cc: Alex Tomas, Martin J. Bligh, linux-kernel, ext2-devel, Andrew Morton On Sat, Mar 08, 2003 at 06:38:50PM +0100, Daniel Phillips wrote: > > The problem is that getdents(2) returns inodes out of order and > > du causes many head seeks. I tried to solve the problem by patch > > I included in. The idea of the patch is pretty simple: just try > > to sort dentries by inode number in readdir(). Yup, that's the problem and the fix; the only issue is the the one which Daniel pointed out: > The problem I see with your approach is that the traversal is no > longer in hash order, so a leaf split in the middle of a directory > traversal could result in a lot of duplicate dirents. I'm not sure > there's a way around that. The fix in userspace would be for du and find (the two programs most likely to care about this sort of thing) to pull into memory a large chunks of the directory at a time, sort them by inode, and then stat them in inode order. The only other way around it would be to do maintain an entirely separate B-tree on disk just for the benefit of readdir(), which could be in inode sort order. Readdir() could then traverse the tree in inode sort order, thus solving the performance penalty. (The b-tree has to be on disk, since for large directories, we can't afford to fit it on disk.) > We've apparently got a simpler problem though: inode numbers aren't > allocated densely enough in the inode table blocks. This is the > only thing I can think of that could cause the amount of thrashing > we're seeing. Spreading inode number allocations out all over a > volume is ok, but sparse allocation within each table block is not > ok because it multiplies the pressure on the cache. We want a > 30,000 entry directory to touch 1,000 - 2,000 inode table blocks, > not the worst-case 30,000. As above, fixing this comes down to > tweaking the ext3_new_inode allocation goal. Nope. That's not it. Inode numbers are allocated sequentially in the inode table blocks. You can't get any more dense that that. Create a bunch of inodes in a directory like this: mke2fs -j -F -b 1024 -g 1024 -N 1200000 test.img mount -t ext3 -o loop test.img /mnt pushd /mnt mkdir test test2 test3 cd test time seq -f "%06.0f" 1 100 | xargs touch cd .. cd test2 time seq -f "%06.0f" 1 10000 | xargs touch cd .. cd test3 time seq -f "%06.0f" 1 100000 | xargs touch cd .. popd Then look at the result using "ls -li". You will see that the inode numbers are all sequentially allocated, when you look at the inodes in creation order (which in this particular case is in filename sort order ). The problem is that when you look at them in hash sort order, the inode numbers are scattered all over the place. > Another approach is to have the inode number correspond approximately to the > hash value. That's not as hard as it sounds, it simply means that the goal > for ext3_new_inode should be influenced by the hash value. (But watch out > for interaction with the new Orlov allocator.) I'm not all convinced that splattering the inodes all over the volume is in fact a good thing, because it means splattering the block allocation for files within the same directory all over the volume. Arguably, read access to files in a directory is much more common than the "du" or "find" case. So if we need to make trade-off in one direction or another, I'd much rather go with striving to maximize block locality than making a readdir+stat run go fast. This is especially true since there *is* a user space workaround that would speed up things for du and find. - Ted ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3 2003-03-07 23:27 ` Theodore Ts'o @ 2003-03-09 19:26 ` Alex Tomas 0 siblings, 0 replies; 39+ messages in thread From: Alex Tomas @ 2003-03-09 19:26 UTC (permalink / raw) To: Theodore Ts'o Cc: Daniel Phillips, Alex Tomas, Martin J. Bligh, linux-kernel, ext2-devel, Andrew Morton >>>>> Theodore Ts'o (TT) writes: TT> The fix in userspace would be for du and find (the two programs TT> most likely to care about this sort of thing) to pull into memory TT> a large chunks of the directory at a time, sort them by inode, TT> and then stat them in inode order. in fact, this solution solves problem for filesystems with fixed inodes placement. for example, this solutions won't work for reiserfs. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3 2003-03-08 17:38 ` Daniel Phillips 2003-03-07 23:27 ` Theodore Ts'o @ 2003-03-09 7:08 ` Alex Tomas 2003-03-10 17:58 ` Daniel Phillips 2003-03-10 21:25 ` Theodore Ts'o 1 sibling, 2 replies; 39+ messages in thread From: Alex Tomas @ 2003-03-09 7:08 UTC (permalink / raw) To: Daniel Phillips Cc: Alex Tomas, Martin J. Bligh, linux-kernel, ext2-devel, Theodore Ts'o, Andrew Morton >>>>> Daniel Phillips (DP) writes: DP> On Fri 07 Mar 03 16:46, Alex Tomas wrote: DP> The problem I see with your approach is that the traversal is no DP> longer in hash order, so a leaf split in the middle of a DP> directory traversal could result in a lot of duplicate dirents. DP> I'm not sure there's a way around that. 1) As far as I understand, duplicates are possible even in classic ext2 w/o sortdir/index. See the diagram: Process 1 Process 2 getdents(2) returns dentry1 (file1 -> Inode1) dentry2 (file2 -> Inode2) context switch --> unlink(file1), empty dentry1 creat(file3), Inode3, use dentry1 creat(file1), Inode1, use dentry3 context switch --> getdents(2) returns dentry3(file1 -> Inode1) Am I right? 2) Why do not use hash order for traversal like ext3_dx_readdir() does? Upon reading several dentries within some hash set readdir() sorts them in inode order and returns to an user. with best regards, Alex ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3 2003-03-09 7:08 ` Alex Tomas @ 2003-03-10 17:58 ` Daniel Phillips 2003-03-10 21:25 ` Theodore Ts'o 1 sibling, 0 replies; 39+ messages in thread From: Daniel Phillips @ 2003-03-10 17:58 UTC (permalink / raw) To: Alex Tomas Cc: Alex Tomas, Martin J. Bligh, linux-kernel, ext2-devel, Theodore Ts'o, Andrew Morton On Sun 09 Mar 03 08:08, Alex Tomas wrote: > >>>>> Daniel Phillips (DP) writes: > > DP> On Fri 07 Mar 03 16:46, Alex Tomas wrote: > DP> The problem I see with your approach is that the traversal is no > DP> longer in hash order, so a leaf split in the middle of a > DP> directory traversal could result in a lot of duplicate dirents. > DP> I'm not sure there's a way around that. > > 1) As far as I understand, duplicates are possible even in classic ext2 > w/o sortdir/index. See the diagram: > > Process 1 Process 2 > > getdents(2) returns > dentry1 (file1 -> Inode1) > dentry2 (file2 -> Inode2) > > context switch --> > unlink(file1), empty dentry1 > creat(file3), Inode3, use > dentry1 creat(file1), Inode1, use dentry3 > > context switch --> > > getdents(2) returns > dentry3(file1 -> Inode1) > > > Am I right? > > > 2) Why do not use hash order for traversal like ext3_dx_readdir() does? > Upon reading several dentries within some hash set readdir() sorts them > in inode order and returns to an user. > > > with best regards, Alex You're right, but you still don't win the stuffed poodle. I put forth the same argument at last year's kernel workshop and was overruled on the grounds that, let me see: - Duplicates as a result of unlinks are one thing, duplicates as a result of creates are another, worse thing. - Creating one duplicate as a result of one unlink is one thing, creating lots of duplicates with one operation is another, worse thing. - The rules are written to allow this particular duplicate behavior in UFS (design inherited by Ext2/3) because at the time, UFS was the only game in town. The phrase "broken by design" comes to mind. Ted and Stephen would no doubt be happy to elaborate. Anyway, there's another reason we need to return results in hash order, and that is telldir/seekdir, which expects a stable enumeration of a directory with no duplicates, even with concurrent operations going on, and even if the server is rebooted in the middle of a directory traversal. Not just broken by design, but smashed to pieces by design. And we still have to make it work, for some definition of "work". Anyway, ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3 2003-03-09 7:08 ` Alex Tomas 2003-03-10 17:58 ` Daniel Phillips @ 2003-03-10 21:25 ` Theodore Ts'o 1 sibling, 0 replies; 39+ messages in thread From: Theodore Ts'o @ 2003-03-10 21:25 UTC (permalink / raw) To: Alex Tomas Cc: Daniel Phillips, Martin J. Bligh, linux-kernel, ext2-devel, Andrew Morton On Sun, Mar 09, 2003 at 10:08:34AM +0300, Alex Tomas wrote: > 1) As far as I understand, duplicates are possible even in classic ext2 > w/o sortdir/index. See the diagram: > > Process 1 Process 2 > > getdents(2) returns > dentry1 (file1 -> Inode1) > dentry2 (file2 -> Inode2) > > context switch --> > unlink(file1), empty dentry1 > creat(file3), Inode3, use dentry1 > creat(file1), Inode1, use dentry3 > > context switch --> > > getdents(2) returns > dentry3(file1 -> Inode1) > > > Am I right? Yup. POSIX allows this. If you're in the middle of readdir() when a file (or files) is created or deleted, then it is undefined how readdir() will treat the filename(s) involved. However, you're *not* allowed to duplicate or omit filenames that were not touched by a file creation or deletion. > 2) Why do not use hash order for traversal like ext3_dx_readdir() does? > Upon reading several dentries within some hash set readdir() sorts them > in inode order and returns to an user. Yeah, glibc could do this, although it would make telldir/seekdir more than a little complicated. (There's an awful lot of pain and agony caused to filesystem developers caused by the existence of telldir/seekdir, which inherently assumes a flat directory structure, and so it makes it hard to do tricks like this.) At some level this is the same solution as "do this in userspace", except instead of needing to convince the maintainers of du and find, we'd need to convince Ulrich. It is *not* clear that it would necessarily be easier to get this fix into the glibc. In fact, since du and find probably don't use seekdir/telldir, it's probably easier to get the change into du and find. - Ted ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3 2003-03-07 15:46 ` Alex Tomas 2003-03-08 17:38 ` Daniel Phillips @ 2003-03-11 21:57 ` Bill Davidsen 1 sibling, 0 replies; 39+ messages in thread From: Bill Davidsen @ 2003-03-11 21:57 UTC (permalink / raw) To: Alex Tomas; +Cc: linux-kernel, ext2-devel On 7 Mar 2003, Alex Tomas wrote: > The problem is that getdents(2) returns inodes out of order and > du causes many head seeks. I tried to solve the problem by patch > I included in. The idea of the patch is pretty simple: just try > to sort dentries by inode number in readdir(). It works because > inodes sits at fixed places in ext2/ext3. Please, look at results > I just got: Any change in the order of dentries returned is likely to run into problem when seeking in a directory. Given that readdir() is usully followed by either zero or many stat()s, perhaps when the first stat() comes along you could pre-read the inodes in optimal order and cace them. However you tuned the size of your sort should work exactly as well as the size of the pre-read. This seems consistent with readahead and AS disk io. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <20030307214833.00a37e35.akpm@digeo.com>]
[parent not found: <20030308010424.Z1373@schatzie.adilger.int>]
* Re: [Ext2-devel] Re: [Bug 417] New: htree much slower than regular ext3 [not found] ` <20030308010424.Z1373@schatzie.adilger.int> @ 2003-03-09 22:54 ` Daniel Phillips 2003-03-08 23:19 ` Andrew Morton 0 siblings, 1 reply; 39+ messages in thread From: Daniel Phillips @ 2003-03-09 22:54 UTC (permalink / raw) To: Andreas Dilger, Andrew Morton Cc: tytso, bzzz, mbligh, linux-kernel, ext2-devel On Sat 08 Mar 03 09:04, Andreas Dilger wrote: > I was testing this in UML-over-loop in 2.4, and the difference in speed > for doing file creates vs. directory creates is dramatic. For file > creates I was running 3s per 10000 files, and for directory creates I > was running 12s per 10000 files. And on a 10K scsi disk I'm running 35s per 10,000 directories, which is way, way slower than it ought to be. There are two analysis tools we're hurting for badly here: - We need to see the physical allocation maps for directories, preferably in a running kernel. I think the best way to do this is a little map-dumper hooked into ext3/dir.c and exported through /proc. - We need block-access traces in a nicer form than printks (or nobody will ever use them). IOW, we need LTT or something very much like it. > Depending on the size of the journal vs. how many block/inode bitmaps and > directory blocks are dirtied, you will likely wrap the journal before you > return to the first block group, so you might write 20kB * 32000 for the > directory creates instead of 8kB for the file creates. You also have a > lot of seeking to each block group to write out the directory data, instead > of nearly sequential IO for the inode create case. Yes, I think that's exactly what's happening. There are some questions remaining, such as why doesn't it happen to akpm. Another question: why does it happen to the directory creates, where the only thing being accessed randomly is the directory itself - the inode table is supposedly being allocated/dirtied sequentially. Regards, Daniel ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Ext2-devel] Re: [Bug 417] New: htree much slower than regular ext3 2003-03-09 22:54 ` [Ext2-devel] " Daniel Phillips @ 2003-03-08 23:19 ` Andrew Morton 0 siblings, 0 replies; 39+ messages in thread From: Andrew Morton @ 2003-03-08 23:19 UTC (permalink / raw) To: Daniel Phillips; +Cc: adilger, tytso, bzzz, mbligh, linux-kernel, ext2-devel Daniel Phillips <phillips@arcor.de> wrote: > > Yes, I think that's exactly what's happening. There are some questions > remaining, such as why doesn't it happen to akpm. Oh but it does. The 30,000 mkdirs test takes 76 seconds on 1k blocks, 16 seconds on 4k blocks. With 1k blocks, the journal is 1/4 the size, and there are 4x as many blockgroups. So not only does the fs have to checkpoint 4x as often, it has to seek all over the disk to do it. If you create a 1k blocksize fs with a 100MB journal, the test takes just nine seconds. But note that after these nine seconds, we were left with 50MB of dirty buffercache. When pdflush later comes along to write that out it takes >30 seconds, because of the additional fragmentation from the tiny blockgroups. So all we've done is to pipeline the pain. I suspect the Orlov allocator isn't doing the right thing here - a full dumpe2fs of the disk shows that the directories are all over the place. Nodoby has really looked at fine-tuning Orlov. btw, an `rm -rf' of the dir which holds 30,000 dirs takes 50 seconds system time on a 2.7GHz CPU. What's up with that? c01945bc str2hashbuf 9790 61.1875 c0187064 ext3_htree_store_dirent 10472 28.7692 c0154920 __getblk 11319 202.1250 c018d11c dx_probe 15934 24.2896 c018d4d0 ext3_htree_fill_tree 15984 33.8644 c0188d3c ext3_get_branch 18238 81.4196 c0136388 find_get_page 18617 202.3587 c015477c bh_lru_install 20493 94.8750 c0194290 TEA_transform 21463 173.0887 c01536b0 __find_get_block_slow 24069 62.6797 c0154620 __brelse 36139 752.8958 c01893dc ext3_get_block_handle 43815 49.7898 c0154854 __find_get_block 71292 349.4706 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Bug 417] New: htree much slower than regular ext3 [not found] ` <20030307214833.00a37e35.akpm@digeo.com> [not found] ` <20030308010424.Z1373@schatzie.adilger.int> @ 2003-03-09 23:10 ` Daniel Phillips 1 sibling, 0 replies; 39+ messages in thread From: Daniel Phillips @ 2003-03-09 23:10 UTC (permalink / raw) To: Andrew Morton; +Cc: tytso, bzzz, mbligh, linux-kernel, ext2-devel On Sat 08 Mar 03 06:48, Andrew Morton wrote: > Daniel Phillips <phillips@arcor.de> wrote: > > OK, according to akpm, data=journal is broken at the moment. Changing > > the host filesystem to data=ordered allowed me to complete the test. > > > > Interesting results. Creating the initial 30,000 directories ought to > > take just a few seconds with the index, but instead it takes 5 minutes > > and 9 seconds, while consuming only 5 seconds of cpu time. > > It takes 14 second here aganst a real disk, and 9 seconds against loop. The best I've seen is 77 seconds to make the directories, on a fast scsi disk with 4K blocksize. That's a 6 1/2 times difference, what could account for it? > The difference: I used 4k blocksize. 1k blocksize should not be that bad: > something is broken. > > Suggest you test against a real disk for now. Using loop may skew things. OK, I tested 1K and 4K blocksize on scsi, ide and ide-hosted loopback, the results are summarized here: Make 30,000 directories: scsi 10K ide 5400 loopback 1K 1m44.243s 4m59.757s 4m8.362s 4K 1m16.659s 4m49.827s 1m43.399s Make one file in each directory: scsi 10K ide 5400 loopback 1K 2m50.483s 7m28.657s 4m8.793s 4K 2m34.461s 5m45.615s 3m0.603s Stat the tree (du): scsi 10K ide 5400 loopback 1K 0m43.245s 1m52.230s 0m1.343s 4K 0m1.038s 0m2.056s 0m1.016s The first thing to notice: Alex Tomas's original complaint about Htree and stat doesn't manifest at all here. Mind you, I did not repeat his exact sequence of tune2fs's etc, so it's possible there could be a utils bug or something. I'll check into that later. Meanwhile, these stats show other breakage that needs to be investigated. The directory creation time is several times higher than it should be, the same for the file creates. With the index, the two sets of operations should take about the same time. Is this a journal problem? My journal is about 8 meg. I can imagine the journal wrapping during the file creation, because of touching all the existing inode table blocks, but this should not happen during the directory creation. The stats also show a huge slowdown for 1K blocksize on non-loopback, as you noticed earlier. Hmm. First: why on the raw device and not on loopback? Second: why does the device affect this at all? This operation is r/o. Hmm, except for atimes. I reran the scsi tests with noatime+nodiratime and the stat times for 1K and 4K became nearly equal, about .75 seconds. So there we have it: several tens of seconds are spent journalling dirty inode table blocks due to atime, but only when blocksize is 1K. Big fat clue. Stat the tree (du), noatime+nodtime: scsi 10K 1K 0m0.780s 4K 0m0.751s The cpu numbers are in the detailed results below, and they show that Htree is doing its job, all the cpu numbers are tiny. We've kicked around the idea that Htree is causing inode table thrashing, but I think that's the wrong tree to bark up. Journal effects are looking more and more like the real problem, possibly in combination with IO scheduling. The fact that Htree dirties the inode table out of order is just exposing some other underlying problem. A quick look at slabinfo before and after unmounting the test volume shows that all the inodes we expect to be cached, actually are: cat /proc/slabinfo | grep ext3_inode ext3_inode_cache 121469 121473 448 13497 13497 1 : 120 60 sudo umount /mnt; cat /proc/slabinfo | grep ext3_inode ext3_inode_cache 61506 61632 448 6836 6848 1 : 120 60 So there were about 60,000 inodes cached for that volume, exactly as expected. So much for the theory that we're thrashing the inode table cache. What's actually happening, I strongly suspect, is a lot of dirty blocks are getting pushed through the journal a little prematurely, without allowing any opportunity for coalescing. At create time, it must be the directory blocks, not the inode table blocks, that cause the extra writeout, since inodes are being allocated (mostly) sequentially. By the way, the script is a perfect trigger for the scheduler interactivity bug, leaving me mouseless for 4 seconds at a time. I guess I'll apply Linus's patch and see if that goes away. One other thing I noticed is that, when the script tries to umount immediately after the stat, I get "umount: /mnt: device is busy", however, after a few seconds it works fine. This strikes me as a bug. ====================== 1K blocksize, 10K rpm scsi ====================== Make directories... real 1m44.243s user 0m0.142s sys 0m3.465s Make one file in each directory... real 2m50.483s user 0m23.617s sys 0m59.465s Stat the tree... 30M /mnt/test real 0m43.245s user 0m0.245s sys 0m0.915s ====================== 1K blocksize, 5400 rpm ide ====================== Make directories... real 4m59.757s user 0m0.138s sys 0m5.207s Make one file in each directory... real 7m28.657s user 0m24.351s sys 0m58.941s Stat the tree... 30M /mnt/test real 1m52.230s user 0m0.261s sys 0m1.013s =================== 1K blocksize, loopback =================== Make directories... real 4m8.362s user 0m0.144s sys 0m7.448s Make one file in each directory... real 4m8.793s user 0m25.262s sys 1m20.376s Stat the tree... 30M /mnt/test real 0m1.343s user 0m0.265s sys 0m0.720s ====================== 4K blocksize, 10K rpm scsi ====================== Make directories... real 1m16.659s user 0m0.152s sys 0m4.106s Make one file in each directory... real 2m34.461s user 0m23.555s sys 0m58.533s Stat the tree... 118M /mnt/test real 0m1.038s user 0m0.242s sys 0m0.664s ====================== 4K blocksize, 5400 rpm ide ====================== Make directories... real 4m49.827s user 0m0.153s sys 0m16.666s Make one file in each directory... real 5m45.615s user 0m23.836s sys 1m6.908s Stat the tree... 118M /mnt/test real 0m2.056s user 0m0.432s sys 0m0.622s =================== 4K blocksize, loopback =================== Make directories... real 1m43.399s user 0m0.149s sys 0m10.806s Make one file in each directory... real 3m0.603s user 0m25.568s sys 1m34.851s Stat the tree... 118M /mnt/test real 0m1.016s user 0m0.318s sys 0m0.588s ======== The script ======== #volume=test.img #volume=/dev/hda5 volume=/dev/sdb5 mke2fs -j -F -b 4096 -g 1024 $volume mount -t ext3 -o loop $volume /mnt cd /mnt mkdir test cd test echo echo Make directories... time seq -f "%06.0f" 1 30000 | xargs mkdir echo echo Make one file in each directory... time for i in `ls`; do cd $i && touch foo && cd .. ; done echo echo Stat the tree... time du -hs /mnt/test echo umount /mnt ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <20030309184755.ACC80FCA8C@mx12.arcor-online.net>]
[parent not found: <m3u1ecl5h8.fsf@lexa.home.net>]
* [RFC] Improved inode number allocation for HTree [not found] ` <m3u1ecl5h8.fsf@lexa.home.net> @ 2003-03-10 20:45 ` Daniel Phillips [not found] ` <3E6D1D25.5000004@namesys.com> 2003-03-10 20:48 ` [RFC] Improved inode number allocation for HTree Daniel Phillips 1 sibling, 1 reply; 39+ messages in thread From: Daniel Phillips @ 2003-03-10 20:45 UTC (permalink / raw) Cc: Theodore Ts'o, Andreas Dilger, Christopher Li, Alex Tomas Summary of the problem: HTree's random relationship between directory entry order and inode order has been observed to cause increased cache pressure, affecting performance under certain conditions. Most of this pressure is due to inode table usage, which is normally several times larger than the directory itself. Even when the inode table usage is much less than available cache, it may still exceed the size of the journal file (8 MB by default). Though the journal only becomes involved when blocks are modified, unfortunately, because of atime updates, this includes all directory operations. We could suggest to users that they should disable atime updating if they care about performance, but we ought to be able to do better than that. To reduce this cache pressure, what we need to do is make the directory entry order correspond more closely to the inode number order. It doesn't have to correspond perfectly in order to show substantial improvement, as Tomas demonstrated with his recent getdents sorting patch. Analysis: To understand why sorting relatively small sections of the directory entries reduces the cache pressure, we can think about the lifetime of an inode table block over the course of a directory traversal. This traversal lifetime extends from when it is first accessed (always a write as noted above) to when it is last accessed, or the directory traversal finishes. Each inode table block contains multiple inodes, so if those inodes are accessed randomly, the traversal lifetime of every inode table block will tend to be from nearly the beginning of the traversal to nearly the end, and so the load on the cache will be nearly the same as the number of inode table blocks. If this load is greater than available cache, various blocks will have to be written out and reloaded later for further processing. We can call the period between loading and evicting an inode table block, its "cache period". The number of times a block must be written/reloaded depends not only on the number of inode table blocks covered by the directory and the availablity of cache memory, but on the probable number of the inodes within the block that will be written during each of the block's cache period. By sorting the directory chunkwise, the probable number of inodes processed in one cache period increases, reducing the number of cache periods, in turn reducing disk seeking and transfer overhead. To apply the above model to the journal, just substitute "journal period" for "cache period", where journal period is defined as the time from when a dirty block first enters the journal to when it is written out to disk. We can see that local sorting or directory chunks has hardly any effect on the traversal lifetime of a block; it only reduces the number of cache periods. The size of the chunk of directory that we sort locally presumably stays constant, so the probable number of inodes on each inode table block processed in each cache period approaches one as the directory size increases. At some point, there is no longer any benefit from sorting, andwe arrive at that point at rather realistic directory sizes. We can always do the job perfectly by sorting the whole directory. This would require changes in every getdents-using application that cares about performance, to either make the getdents window as big as the directory or to do the sort in user space. A spinoff benefit of doing the full sort is that it becomes trivial to detect and eliminate duplicate entries. Higher memory load is a drawback, as is the work required to update user space programs. There will also be noticable latency introduced by the sort, which didn't appear before. If sorting is to be done in user space then we impose the requirement on all filesystems - even those that do not use inode numbers internally - that they supply an inode key for sorting, and that the supplied sorting key be meaningful in terms of cache locality. Proposed Solution: What if we could maintain the inodes in the inode table in the same order as the directory entries, permanently, so that no sorting is required? If we could do this, the traversal lifetime of each inode table block would be the same as the number of inodes it contains, and the cache load due to the inode table would be exactly one block. That is, the directory traversal would process each inode table block completely before moving on to the next one. Though the current Ext3 filesystem structure does not support this directly, we can approximate the effect quite closely within the existing framework. Due to Ted's recent work, we are already traversing the directory in hash order, as opposed to physical allocation order, because of the particular approach adopted to address the telldir/seekdir problem. This is relatively inexpensive. So now, what we would like to do is make the inode number of each directory entry increase monotonically, corresponding to monotonically increasing hash values. We also want to allocate those inode numbers densely in the inode table, to minimize seeking. If we know beforehand how many entries the directory will have, this is easy: when we create an entry, we multiply its hash value by the total number of entries divided by the total hash range (2**31), add in a base target, and use that as the goal to allocate a new inode. Due to the randomness of the hash value, we will sometimes be forced to allocate an inode number out of order, but hopefully there won't be too many of those, and they won't be very far out of order. We may also end up with some gaps in the inode table. In fact, we may intentionally leave some slack in order to reduce the number of out-of-order allocations. These problems are not serious. The big problem is that we don't know the number of directory entries ahead of time. To get around this, we can use the current size of the directory, rounded up to a power of two as the size estimate. Each time the directory increases to a new binary size we establish a new region in the inode table, which will map to the to-be-created half of directory entries, roughly in order. Now, consider what happens to a group of directory entries created when the directory was so small that it mapped to a single inode table block. After a series of leaf splits, these former neighbors will end up far apart in the hash order, but their inode numbers will still be ordered monotonically over a traversal. The traversal lifetime of that group of entries will be nearly the entire directory traversal, but there is only one such block. When the directory became larger, some newly created entries were mapped to two new inode table blocks, the traversal lifetimes of which are roughly half the entire traversal. There are only two such blocks, and their lifetimes do not overlap, so these two blocks only generate one block worth of cache load. So it goes, with each new set of entries of size 2**i generating only one additional block of cache load. The final cache load is O log2(N), a pleasingly small number. A very nice property of this inode allocation strategy is that the ordering will tend to be reinforced as a directory ages, as opposed to slowly decaying towards randomness as happens with linear directories. So this approach is attractive, especially as it can be implemented in a small number of lines of code. The main outstanding issue I have not addressed is how to arrange things so that the inode allocation patterns of neighbouring directories don't interfere with each other too much. I expect that each time the directory grows to a new power of two size, we need to scan the inode allocation map for a relatively unpopulated region of that size and store the region's base in the directory header. Would we need to store the entire history of allocation bases? Probably not. This aspect requires further consideration. Regards, Daniel ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <3E6D1D25.5000004@namesys.com>]
[parent not found: <20030311031216.8A31CEFD5F@mx12.arcor-online.net>]
* Re: [RFC] Improved inode number allocation for HTree [not found] ` <20030311031216.8A31CEFD5F@mx12.arcor-online.net> @ 2003-03-11 10:45 ` Hans Reiser 2003-03-11 13:00 ` Helge Hafting 0 siblings, 1 reply; 39+ messages in thread From: Hans Reiser @ 2003-03-11 10:45 UTC (permalink / raw) To: Daniel Phillips Cc: Theodore Ts'o, Andreas Dilger, Christopher Li, Alex Tomas, Linux Kernel Mailing List Let's make noatime the default for VFS. Daniel Phillips wrote: >Hi Hans, > >On Tue 11 Mar 03 00:17, Hans Reiser wrote: > > >>What do you think of creating a new telldir/seekdir which uses filenames >>instead of ints to convey position? >> >> > >What do you do if that file gets deleted in the middle of a traversal? > So what? It still tells you where to restart. Strictly speaking, you would want to allow the filesystem to choose what it wants to return to indicate position. For htree and reiserfs this would be a filename or its hash, for ext2 without htree this would be a byte offset. > >If I were able to design Unix over again, I'd state that if you don't lock a >directory before traversing it then it's your own fault if somebody changes >it under you, and I would have provided an interface to inform you about your >bad luck. Strictly wishful thinking. (There, it feels better now.) > We are designing Linux. You know, Microsoft and Steve Jobs continue to design. We should too. Needless change should be avoided. Failure to change when something is known to be broken leads to being inferior to those who do change. Let's design something that does it right. Or else SCO will be right in saying that Linux is just a Unix ripoff by people who couldn't do anything unless Unix had been written to tell them how to do it right. -- Hans ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree 2003-03-11 10:45 ` Hans Reiser @ 2003-03-11 13:00 ` Helge Hafting 2003-03-11 13:41 ` Daniel Phillips 0 siblings, 1 reply; 39+ messages in thread From: Helge Hafting @ 2003-03-11 13:00 UTC (permalink / raw) To: Hans Reiser; +Cc: Daniel Phillips, Linux Kernel Mailing List Hans Reiser wrote: > Let's make noatime the default for VFS. > > Daniel Phillips wrote: [...] >> If I were able to design Unix over again, I'd state that if you don't >> lock a directory before traversing it then it's your own fault if >> somebody changes it under you, and I would have provided an interface >> to inform you about your bad luck. Strictly wishful thinking. >> (There, it feels better now.) I'm happy nobody _can_ lock a directory like that. Think of it - unable to create or delete files while some slow-moving program is traversing the directory? Ouch. Plenty of options for DOS attacks too. And how to do "rm *.bak" if rm locks the dir for traversal? Helge Hafting ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree 2003-03-11 13:00 ` Helge Hafting @ 2003-03-11 13:41 ` Daniel Phillips 2003-03-11 17:16 ` Andreas Dilger ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Daniel Phillips @ 2003-03-11 13:41 UTC (permalink / raw) To: Helge Hafting, Hans Reiser; +Cc: Linux Kernel Mailing List On Tue 11 Mar 03 14:00, Helge Hafting wrote: > Hans Reiser wrote: > > Let's make noatime the default for VFS. > > > > Daniel Phillips wrote: > > [...] > > >> If I were able to design Unix over again, I'd state that if you don't > >> lock a directory before traversing it then it's your own fault if > >> somebody changes it under you, and I would have provided an interface > >> to inform you about your bad luck. Strictly wishful thinking. > >> (There, it feels better now.) > > I'm happy nobody _can_ lock a directory like that. Think of it - unable > to create or delete files while some slow-moving program is traversing > the directory? Ouch. Plenty of options for DOS attacks too. > And how to do "rm *.bak" if rm locks the dir for traversal? <wishful thinking> Now that you mention it, just locking out create and rename during directory traversal would eliminate the pain. Delete is easy enough to handle during traversal. For a B-Tree, coalescing could simply be deferred until the traversal is finished, so reading the directory in physical storage order would be fine. Way, way cleaner than what we have to do now. </wishful thinking> Regards, Daniel ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree 2003-03-11 13:41 ` Daniel Phillips @ 2003-03-11 17:16 ` Andreas Dilger 2003-03-11 19:39 ` Helge Hafting 2003-03-11 21:25 ` atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) Hans Reiser 2 siblings, 0 replies; 39+ messages in thread From: Andreas Dilger @ 2003-03-11 17:16 UTC (permalink / raw) To: Daniel Phillips; +Cc: Helge Hafting, Hans Reiser, Linux Kernel Mailing List On Mar 11, 2003 14:41 +0100, Daniel Phillips wrote: > On Tue 11 Mar 03 14:00, Helge Hafting wrote: > > I'm happy nobody _can_ lock a directory like that. Think of it - unable > > to create or delete files while some slow-moving program is traversing > > the directory? Ouch. Plenty of options for DOS attacks too. > > And how to do "rm *.bak" if rm locks the dir for traversal? > > <wishful thinking> > Now that you mention it, just locking out create and rename during directory > traversal would eliminate the pain. Delete is easy enough to handle during > traversal. For a B-Tree, coalescing could simply be deferred until the > traversal is finished, so reading the directory in physical storage order > would be fine. Way, way cleaner than what we have to do now. > </wishful thinking> The other problem, of course, is that this would essentially mean that a user-space application has a lock that prevents other processes (even the kernel) from doing anything. At best we have to worry about an application going sour, and at worst it is a DOS hole as Helge says. How about something like: death: for each directory while readdir(directory, large_buffer) if (directory) run death(directory) readdir(directory, small_buffer) sleep forever The first getdents walks recursively down to the leaves, and then does a single getdents with enough space to hold a single dirent, locking the directory, but then never progressing. It also does the single-dirent getdents on the way up the tree. Now, everything is locked out of the entire directory tree. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree 2003-03-11 13:41 ` Daniel Phillips 2003-03-11 17:16 ` Andreas Dilger @ 2003-03-11 19:39 ` Helge Hafting 2003-03-11 20:19 ` Daniel Phillips 2003-03-11 21:25 ` atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) Hans Reiser 2 siblings, 1 reply; 39+ messages in thread From: Helge Hafting @ 2003-03-11 19:39 UTC (permalink / raw) To: Daniel Phillips; +Cc: linux-kernel On Tue, Mar 11, 2003 at 02:41:06PM +0100, Daniel Phillips wrote: > On Tue 11 Mar 03 14:00, Helge Hafting wrote: > > I'm happy nobody _can_ lock a directory like that. Think of it - unable > > to create or delete files while some slow-moving program is traversing > > the directory? Ouch. Plenty of options for DOS attacks too. > > And how to do "rm *.bak" if rm locks the dir for traversal? > > <wishful thinking> > Now that you mention it, just locking out create and rename during directory > traversal would eliminate the pain. Delete is easy enough to handle during > traversal. For a B-Tree, coalescing could simply be deferred until the > traversal is finished, so reading the directory in physical storage order > would be fine. Way, way cleaner than what we have to do now. > </wishful thinking> Ok, so "rm" works. Then you have things like "mv *.c /usr/src" to worry about. Lock for traversal, get stuck unable to work on the files. A "dream" solution for this might involve something like: Directory is traversed in some well defined order (defined by the fs) Removing a file (or moving it out of the directory) leaves a "empty" directory entry that isn't reclaimed until no more traversals is in progress in that directory. New files can be created, and will be created further out than any traversal in progress so nothing will be missed. This approach don't lock anything out, but I guess someone evil still could cause trouble by keeping a traversal going forever, creating one dummy file and deleting one whenever it makes progress. The directory would get big, filled up with placeholders until some ulimit kicks in. Helge Hafting ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree 2003-03-11 19:39 ` Helge Hafting @ 2003-03-11 20:19 ` Daniel Phillips 0 siblings, 0 replies; 39+ messages in thread From: Daniel Phillips @ 2003-03-11 20:19 UTC (permalink / raw) To: Helge Hafting; +Cc: linux-kernel On Tue 11 Mar 03 20:39, Helge Hafting wrote: > On Tue, Mar 11, 2003 at 02:41:06PM +0100, Daniel Phillips wrote: > > <wishful thinking> > > Now that you mention it, just locking out create and rename during > > directory traversal would eliminate the pain. Delete is easy enough to > > handle during traversal. For a B-Tree, coalescing could simply be > > deferred until the traversal is finished, so reading the directory in > > physical storage order would be fine. Way, way cleaner than what we have > > to do now. > > </wishful thinking> > > Ok, so "rm" works. Then you have things like "mv *.c /usr/src" to worry > about. That's equivalent to ls, the shell expands it before doing any moving. You can construct something more interesting with ls | xargs <something nasty> into the same directory. Since the user is trying to shoot their foot off, let the lock be recursive, and let them do that. > ...someone evil still > could cause trouble by keeping a traversal going forever, creating one > dummy file and deleting one whenever it makes progress. The directory > would get big, filled up with placeholders until some ulimit kicks in. > > Helge Hafting It's not clear that's any more evil than things they can do already, eg, seq 1 1000000 | xargs -l1 cp -a /usr Welllll, this is all idle speculation anyway. Nobody's going to fix the flaw this week, if it's even possible (I suspect it is). Regards, Daniel ^ permalink raw reply [flat|nested] 39+ messages in thread
* atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) 2003-03-11 13:41 ` Daniel Phillips 2003-03-11 17:16 ` Andreas Dilger 2003-03-11 19:39 ` Helge Hafting @ 2003-03-11 21:25 ` Hans Reiser 2003-03-11 23:49 ` Jamie Lokier 2 siblings, 1 reply; 39+ messages in thread From: Hans Reiser @ 2003-03-11 21:25 UTC (permalink / raw) To: Daniel Phillips; +Cc: Helge Hafting, Linux Kernel Mailing List Daniel Phillips wrote: >On Tue 11 Mar 03 14:00, Helge Hafting wrote: > > >>Hans Reiser wrote: >> >> >>>Let's make noatime the default for VFS. >>> >>>Daniel Phillips wrote: >>> >>> >>[...] >> >> >> >>>>If I were able to design Unix over again, I'd state that if you don't >>>>lock a directory before traversing it then it's your own fault if >>>>somebody changes it under you, and I would have provided an interface >>>>to inform you about your bad luck. Strictly wishful thinking. >>>>(There, it feels better now.) >>>> >>>> >>I'm happy nobody _can_ lock a directory like that. Think of it - unable >>to create or delete files while some slow-moving program is traversing >>the directory? Ouch. Plenty of options for DOS attacks too. >>And how to do "rm *.bak" if rm locks the dir for traversal? >> >> > ><wishful thinking> >Now that you mention it, just locking out create and rename during directory >traversal would eliminate the pain. Delete is easy enough to handle during >traversal. For a B-Tree, coalescing could simply be deferred until the >traversal is finished, so reading the directory in physical storage order >would be fine. Way, way cleaner than what we have to do now. ></wishful thinking> > >Regards, > >Daniel >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ > > > > You would want a directory listing command that is a single system call, and then that could be made isolated, without risk of userspace failing to unlock. Then you have to worry about very large directories straining things in user space, but you can probably have the user specify the maximimum size directory his process has space for, etc. In general, DOS issues are what makes it difficult to introduce transactions into the kernel. We are grappling with that now in reiser4. It seems that the formula is to create atomic plugins, and then it is the system administrator's responsibility to verify that he can live with whatever DOS vulnerabilities it has before he installs it in his kernel (or our responsibility if it is sent to us for inclusion in the main kernel). Allowing arbitrary filesystem operations to be combined into one atomic transaction seems problematic for either user space or the kernel, depending on what you do. In general, allowing user space to lock things means that you trust user space to unlock. This creates all sorts of trust troubles, and if you force the unlock after some timeout, then the user space application becomes vulnerable to DOS from other processes causing it to exceed the timeout. Ideas on this are welcome. -- Hans ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) 2003-03-11 21:25 ` atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) Hans Reiser @ 2003-03-11 23:49 ` Jamie Lokier 0 siblings, 0 replies; 39+ messages in thread From: Jamie Lokier @ 2003-03-11 23:49 UTC (permalink / raw) To: Hans Reiser; +Cc: Daniel Phillips, Helge Hafting, Linux Kernel Mailing List Hans Reiser wrote: > Allowing arbitrary filesystem operations to be > combined into one atomic transaction seems problematic for either user > space or the kernel, depending on what you do. > > In general, allowing user space to lock things means that you trust user > space to unlock. This creates all sorts of trust troubles, and if you > force the unlock after some timeout, then the user space application > becomes vulnerable to DOS from other processes causing it to exceed the > timeout. > > Ideas on this are welcome. You can allow user space to begin a transaction, do some operations and end a transaction, possibly returning an "abort" result which means userspace should assume the transaction did not commit any results and/or whatever was read in the transaction was not reliable. On the face of it this leaves userspace susceptible to DOS or indeed fairness/livelock problems. For example if another program is always changing a directory entry, how can you read that whole directory in a transaction? Fairness/livelock problems are hard to avoid with any kinds of lock. Even the kernel's internal locks have these problems in corner cases (for example, remember when gettimeofday()'s clock access had to be converted from using a spinlock to a sequence lock - and that still doesn't _guarantee_ there is no problem in principle, it just reduces the probability in all reasonable scenarios). However, some remedies can be applied to filesystem transactions. If an operation would cause some other task's transaction to eventually return an abort code, consider sleeping for a short duration. Randomise that duration. If the other transaction(s) have been aborting repeatedly, consider lengthening the sleep duration and/or specifically waiting for the other transaction to complete, to boost the other task(s) likilihood of transaction success. Randomise this decision too. If you know something about the type of other transactions (such as it is trying to implement a read-write lock by doing atomic operations on bytes in a file), consider exactly what policy you hope to offer (writer preference? reader preference? something in between?) By which point it has remarkable similarities to the problems of fairness in the task scheduler, and fairness/livelock in locks. -- Jamie ^ permalink raw reply [flat|nested] 39+ messages in thread
* [RFC] Improved inode number allocation for HTree [not found] ` <m3u1ecl5h8.fsf@lexa.home.net> 2003-03-10 20:45 ` [RFC] Improved inode number allocation for HTree Daniel Phillips @ 2003-03-10 20:48 ` Daniel Phillips 2003-03-10 21:04 ` John Bradford 1 sibling, 1 reply; 39+ messages in thread From: Daniel Phillips @ 2003-03-10 20:48 UTC (permalink / raw) To: ext2-devel, linux-kernel Cc: Theodore Ts'o, Andreas Dilger, Christopher Li, Alex Tomas Summary of the problem: HTree's random relationship between directory entry order and inode order has been observed to cause increased cache pressure, affecting performance under certain conditions. Most of this pressure is due to inode table usage, which is normally several times larger than the directory itself. Even when the inode table usage is much less than available cache, it may still exceed the size of the journal file (8 MB by default). Though the journal only becomes involved when blocks are modified, unfortunately, because of atime updates, this includes all directory operations. We could suggest to users that they should disable atime updating if they care about performance, but we ought to be able to do better than that. To reduce this cache pressure, what we need to do is make the directory entry order correspond more closely to the inode number order. It doesn't have to correspond perfectly in order to show substantial improvement, as Tomas demonstrated with his recent getdents sorting patch. Analysis: To understand why sorting relatively small sections of the directory entries reduces the cache pressure, we can think about the lifetime of an inode table block over the course of a directory traversal. This traversal lifetime extends from when it is first accessed (always a write as noted above) to when it is last accessed, or the directory traversal finishes. Each inode table block contains multiple inodes, so if those inodes are accessed randomly, the traversal lifetime of every inode table block will tend to be from nearly the beginning of the traversal to nearly the end, and so the load on the cache will be nearly the same as the number of inode table blocks. If this load is greater than available cache, various blocks will have to be written out and reloaded later for further processing. We can call the period between loading and evicting an inode table block, its "cache period". The number of times a block must be written/reloaded depends not only on the number of inode table blocks covered by the directory and the availablity of cache memory, but on the probable number of the inodes within the block that will be written during each of the block's cache period. By sorting the directory chunkwise, the probable number of inodes processed in one cache period increases, reducing the number of cache periods, in turn reducing disk seeking and transfer overhead. To apply the above model to the journal, just substitute "journal period" for "cache period", where journal period is defined as the time from when a dirty block first enters the journal to when it is written out to disk. We can see that local sorting or directory chunks has hardly any effect on the traversal lifetime of a block; it only reduces the number of cache periods. The size of the chunk of directory that we sort locally presumably stays constant, so the probable number of inodes on each inode table block processed in each cache period approaches one as the directory size increases. At some point, there is no longer any benefit from sorting, andwe arrive at that point at rather realistic directory sizes. We can always do the job perfectly by sorting the whole directory. This would require changes in every getdents-using application that cares about performance, to either make the getdents window as big as the directory or to do the sort in user space. A spinoff benefit of doing the full sort is that it becomes trivial to detect and eliminate duplicate entries. Higher memory load is a drawback, as is the work required to update user space programs. There will also be noticable latency introduced by the sort, which didn't appear before. If sorting is to be done in user space then we impose the requirement on all filesystems - even those that do not use inode numbers internally - that they supply an inode key for sorting, and that the supplied sorting key be meaningful in terms of cache locality. Proposed Solution: What if we could maintain the inodes in the inode table in the same order as the directory entries, permanently, so that no sorting is required? If we could do this, the traversal lifetime of each inode table block would be the same as the number of inodes it contains, and the cache load due to the inode table would be exactly one block. That is, the directory traversal would process each inode table block completely before moving on to the next one. Though the current Ext3 filesystem structure does not support this directly, we can approximate the effect quite closely within the existing framework. Due to Ted's recent work, we are already traversing the directory in hash order, as opposed to physical allocation order, because of the particular approach adopted to address the telldir/seekdir problem. This is relatively inexpensive. So now, what we would like to do is make the inode number of each directory entry increase monotonically, corresponding to monotonically increasing hash values. We also want to allocate those inode numbers densely in the inode table, to minimize seeking. If we know beforehand how many entries the directory will have, this is easy: when we create an entry, we multiply its hash value by the total number of entries divided by the total hash range (2**31), add in a base target, and use that as the goal to allocate a new inode. Due to the randomness of the hash value, we will sometimes be forced to allocate an inode number out of order, but hopefully there won't be too many of those, and they won't be very far out of order. We may also end up with some gaps in the inode table. In fact, we may intentionally leave some slack in order to reduce the number of out-of-order allocations. These problems are not serious. The big problem is that we don't know the number of directory entries ahead of time. To get around this, we can use the current size of the directory, rounded up to a power of two as the size estimate. Each time the directory increases to a new binary size we establish a new region in the inode table, which will map to the to-be-created half of directory entries, roughly in order. Now, consider what happens to a group of directory entries created when the directory was so small that it mapped to a single inode table block. After a series of leaf splits, these former neighbors will end up far apart in the hash order, but their inode numbers will still be ordered monotonically over a traversal. The traversal lifetime of that group of entries will be nearly the entire directory traversal, but there is only one such block. When the directory became larger, some newly created entries were mapped to two new inode table blocks, the traversal lifetimes of which are roughly half the entire traversal. There are only two such blocks, and their lifetimes do not overlap, so these two blocks only generate one block worth of cache load. So it goes, with each new set of entries of size 2**i generating only one additional block of cache load. The final cache load is O log2(N), a pleasingly small number. A very nice property of this inode allocation strategy is that the ordering will tend to be reinforced as a directory ages, as opposed to slowly decaying towards randomness as happens with linear directories. So this approach is attractive, especially as it can be implemented in a small number of lines of code. The main outstanding issue I have not addressed is how to arrange things so that the inode allocation patterns of neighbouring directories don't interfere with each other too much. I expect that each time the directory grows to a new power of two size, we need to scan the inode allocation map for a relatively unpopulated region of that size and store the region's base in the directory header. Would we need to store the entire history of allocation bases? Probably not. This aspect requires further consideration. Regards, Daniel ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree 2003-03-10 20:48 ` [RFC] Improved inode number allocation for HTree Daniel Phillips @ 2003-03-10 21:04 ` John Bradford 2003-03-10 21:28 ` Andreas Schwab 2003-03-10 21:33 ` [RFC] Improved inode number allocation for HTree Daniel Phillips 0 siblings, 2 replies; 39+ messages in thread From: John Bradford @ 2003-03-10 21:04 UTC (permalink / raw) To: Daniel Phillips; +Cc: ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz > Though the journal only becomes involved when blocks are modified, > unfortunately, because of atime updates, this includes all directory > operations. We could suggest to users that they should disable > atime updating if they care about performance, but we ought to be > able to do better than that. On a separate note, since atime updates are not usually very important anyway, why not have an option to cache atime updates for a long time, or until either a write occurs anyway. Holding a large number of atime updates in a write cache is generally not going to be a major issue - the worst case if a partition isn't cleanly unmounted is that some atimes will be wrong. John. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree 2003-03-10 21:04 ` John Bradford @ 2003-03-10 21:28 ` Andreas Schwab 2003-03-10 21:50 ` Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree) John Bradford 2003-03-10 21:33 ` [RFC] Improved inode number allocation for HTree Daniel Phillips 1 sibling, 1 reply; 39+ messages in thread From: Andreas Schwab @ 2003-03-10 21:28 UTC (permalink / raw) To: John Bradford Cc: Daniel Phillips, ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz John Bradford <john@grabjohn.com> writes: |> > Though the journal only becomes involved when blocks are modified, |> > unfortunately, because of atime updates, this includes all directory |> > operations. We could suggest to users that they should disable |> > atime updating if they care about performance, but we ought to be |> > able to do better than that. |> |> On a separate note, since atime updates are not usually very important |> anyway, why not have an option to cache atime updates for a long time, |> or until either a write occurs anyway. mount -o noatime Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 Nürnberg Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 39+ messages in thread
* Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree) 2003-03-10 21:28 ` Andreas Schwab @ 2003-03-10 21:50 ` John Bradford 2003-03-14 21:55 ` [Ext2-devel] " Stephen C. Tweedie 0 siblings, 1 reply; 39+ messages in thread From: John Bradford @ 2003-03-10 21:50 UTC (permalink / raw) To: Andreas Schwab Cc: phillips, ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz > |> > Though the journal only becomes involved when blocks are modified, > |> > unfortunately, because of atime updates, this includes all directory > |> > operations. We could suggest to users that they should disable > |> > atime updating if they care about performance, but we ought to be > |> > able to do better than that. > |> > |> On a separate note, since atime updates are not usually very important > |> anyway, why not have an option to cache atime updates for a long time, > |> or until either a write occurs anyway. > > mount -o noatime Well, yes, I use noatime by default, but I was thinking that there might be users who wanted to gain most of the performance that using noatime would, who still wanted access times available generally, but who were not bothered about loosing them on an unclean shutdown. Infact, taking this one step further, we could have assign directories priorities, analogous to nice values for processes, and make the lazyness of writes dependant on that. So, for example, on a webserver, you could assign databases a high priority for writes, but your webserver log a low priority. If the box crashed, the data most likely to be lost would be the webserver log, rather than the database. Another benefit would be that you could spin down disks, and not have them spin up just to write data to the webserver log - that could be committed to the disk just once an hour, but a database write could cause it to spin up immediately. John. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Ext2-devel] Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree) 2003-03-10 21:50 ` Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree) John Bradford @ 2003-03-14 21:55 ` Stephen C. Tweedie 0 siblings, 0 replies; 39+ messages in thread From: Stephen C. Tweedie @ 2003-03-14 21:55 UTC (permalink / raw) To: John Bradford Cc: Andreas Schwab, phillips, ext2-devel, linux-kernel, Theodore Ts'o, Andreas Dilger, chrisl, bzzz, Stephen Tweedie Hi, On Mon, 2003-03-10 at 21:50, John Bradford wrote: > Well, yes, I use noatime by default, but I was thinking that there > might be users who wanted to gain most of the performance that using > noatime would, who still wanted access times available generally, but > who were not bothered about loosing them on an unclean shutdown. I've got new inode-flushing code which somewhat does that already. The code in http://people.redhat.com/sct/patches/ext3-2.4/dev-20030314/40-iflush-sct/ allows us to defer inode writes, primarily to eliminate redundant copy-to-buffer-cache operations in mark_inode_dirty; but it has the side-effect of allowing us to defer atime updates quite safely. I'm currently integrating all the latest bits and pieces of orlov and htree work into that set of patches to provide a stable base, and the next order of business is to get ACL and O_DIRECT diffs in for 2.4 too. But beyond that, I need to get down to some serious performance work with ext3, and the inode flush code is a major part of that. --Stephen ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFC] Improved inode number allocation for HTree 2003-03-10 21:04 ` John Bradford 2003-03-10 21:28 ` Andreas Schwab @ 2003-03-10 21:33 ` Daniel Phillips 2003-03-10 21:47 ` [Ext2-devel] " Bryan O'Sullivan 1 sibling, 1 reply; 39+ messages in thread From: Daniel Phillips @ 2003-03-10 21:33 UTC (permalink / raw) To: John Bradford; +Cc: ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz On Mon 10 Mar 03 22:04, John Bradford wrote: > > Though the journal only becomes involved when blocks are modified, > > unfortunately, because of atime updates, this includes all directory > > operations. We could suggest to users that they should disable > > atime updating if they care about performance, but we ought to be > > able to do better than that. > > On a separate note, since atime updates are not usually very important > anyway, why not have an option to cache atime updates for a long time, > or until either a write occurs anyway. Holding a large number of > atime updates in a write cache is generally not going to be a major > issue - the worst case if a partition isn't cleanly unmounted is that > some atimes will be wrong. It sounds practical. Why stop there? Since Ted is seriously considering making a batch of incompatible extensions to the on-disk format anyway, how about adding an atime table to each block group, four bytes per inode. Even without lazy updating, it would cut down the dirty blocks generated by r/o operations a lot. If actual atime is the latest of the atime table value and the inode atime value[1], then inode write operations won't generate extra traffic. You will only get new traffic when somebody wants the real atime. I'd put this under the category of "things to add to Ted's long list of fun new ideas for Ext3/4". [1] How to handle wrapping is left as an exercise for the interested reader. Regards, Daniel ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree 2003-03-10 21:33 ` [RFC] Improved inode number allocation for HTree Daniel Phillips @ 2003-03-10 21:47 ` Bryan O'Sullivan 2003-03-10 22:02 ` Matthew Wilcox 0 siblings, 1 reply; 39+ messages in thread From: Bryan O'Sullivan @ 2003-03-10 21:47 UTC (permalink / raw) To: Daniel Phillips Cc: John Bradford, ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz On Mon, 2003-03-10 at 13:33, Daniel Phillips wrote: > It sounds practical. Why stop there? Why start? Who actually uses atime for anything at all, other than the tiny number of shops that care about moving untouched files to tertiary storage? Surely if you want to heap someone else's plate with work, you should offer a reason why :-) <b ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree 2003-03-10 21:47 ` [Ext2-devel] " Bryan O'Sullivan @ 2003-03-10 22:02 ` Matthew Wilcox 2003-03-11 8:47 ` Jakob Oestergaard 2003-03-14 21:57 ` Stephen C. Tweedie 0 siblings, 2 replies; 39+ messages in thread From: Matthew Wilcox @ 2003-03-10 22:02 UTC (permalink / raw) To: Bryan O'Sullivan Cc: Daniel Phillips, John Bradford, ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote: > Why start? Who actually uses atime for anything at all, other than the > tiny number of shops that care about moving untouched files to tertiary > storage? > > Surely if you want to heap someone else's plate with work, you should > offer a reason why :-) "You have new mail" vs "You have mail". -- "It's not Hollywood. War is real, war is primarily not about defeat or victory, it is about death. I've seen thousands and thousands of dead bodies. Do you think I want to have an academic debate on this subject?" -- Robert Fisk ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree 2003-03-10 22:02 ` Matthew Wilcox @ 2003-03-11 8:47 ` Jakob Oestergaard 2003-03-11 11:27 ` John Bradford 2003-03-14 21:57 ` Stephen C. Tweedie 1 sibling, 1 reply; 39+ messages in thread From: Jakob Oestergaard @ 2003-03-11 8:47 UTC (permalink / raw) To: Matthew Wilcox Cc: Bryan O'Sullivan, Daniel Phillips, John Bradford, ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz On Mon, Mar 10, 2003 at 10:02:54PM +0000, Matthew Wilcox wrote: > On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote: > > Why start? Who actually uses atime for anything at all, other than the > > tiny number of shops that care about moving untouched files to tertiary > > storage? > > > > Surely if you want to heap someone else's plate with work, you should > > offer a reason why :-) > > "You have new mail" vs "You have mail". And having a clean /tmp using tmpwatch. Everything in my /tmp not accessed for a few weeks gets deleted. I move cruft to /tmp every single day, and I have not cleaned it up for years. It just stays tidy all by itself. Of course one has to remember this when leavning a few database files there before taking a vacation ;) -- ................................................................ : jakob@unthought.net : And I see the elder races, : :.........................: putrid forms of man : : Jakob Østergaard : See him rise and claim the earth, : : OZ9ABN : his downfall is at hand. : :.........................:............{Konkhra}...............: ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree 2003-03-11 8:47 ` Jakob Oestergaard @ 2003-03-11 11:27 ` John Bradford 0 siblings, 0 replies; 39+ messages in thread From: John Bradford @ 2003-03-11 11:27 UTC (permalink / raw) To: Jakob Oestergaard Cc: willy, bos, phillips, ext2-devel, linux-kernel, tytso, adilger, chrisl, bzzz > > > Why start? Who actually uses atime for anything at all, other than the > > > tiny number of shops that care about moving untouched files to tertiary > > > storage? What about the situation where the primary storage is a device which has a limited number of write-cycles. That's just the sort of application where you might be archiving data to secondary or tertiary storage, and it would be a big advantage to save on writes. If a file is read every ten minutes, we could just update the atime once an hour. That would save five writes an hour. John. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree 2003-03-10 22:02 ` Matthew Wilcox 2003-03-11 8:47 ` Jakob Oestergaard @ 2003-03-14 21:57 ` Stephen C. Tweedie 2003-03-15 8:39 ` jw schultz 1 sibling, 1 reply; 39+ messages in thread From: Stephen C. Tweedie @ 2003-03-14 21:57 UTC (permalink / raw) To: Matthew Wilcox Cc: Bryan O'Sullivan, Daniel Phillips, John Bradford, ext2-devel, linux-kernel, Theodore Ts'o, Andreas Dilger, chrisl, bzzz, Stephen Tweedie Hi, On Mon, 2003-03-10 at 22:02, Matthew Wilcox wrote: > On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote: > > Why start? Who actually uses atime for anything at all, other than the > > tiny number of shops that care about moving untouched files to tertiary > > storage? > > > > Surely if you want to heap someone else's plate with work, you should > > offer a reason why :-) > > "You have new mail" vs "You have mail". "nodiratime" can still help there. --Stephen ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [Ext2-devel] Re: [RFC] Improved inode number allocation for HTree 2003-03-14 21:57 ` Stephen C. Tweedie @ 2003-03-15 8:39 ` jw schultz 0 siblings, 0 replies; 39+ messages in thread From: jw schultz @ 2003-03-15 8:39 UTC (permalink / raw) To: linux-kernel On Fri, Mar 14, 2003 at 09:57:12PM +0000, Stephen C. Tweedie wrote: > Hi, > > On Mon, 2003-03-10 at 22:02, Matthew Wilcox wrote: > > On Mon, Mar 10, 2003 at 01:47:14PM -0800, Bryan O'Sullivan wrote: > > > Why start? Who actually uses atime for anything at all, other than the > > > tiny number of shops that care about moving untouched files to tertiary > > > storage? > > > > > > Surely if you want to heap someone else's plate with work, you should > > > offer a reason why :-) > > > > "You have new mail" vs "You have mail". > > "nodiratime" can still help there. As may noatime. noatime only prevents the automatic update of atime on read. It doesn't (at least in my experience) prevent utime(2) from updating the atime field. Using noatime works quite well with at least with mutt which explicitly uses utime(2) to update atime. I cannot be certain but mutt may actually work better with noatime because then other tools (*biff &co) accesses won't break the mailer's notion of new mail (mtime > atime). -- ________________________________________________________________ J.W. Schultz Pegasystems Technologies email address: jw@pegasys.ws Remember Cernan and Schmitt ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <20030311194014$426e@gated-at.bofh.it>]
[parent not found: <20030311194014$1a3c@gated-at.bofh.it>]
[parent not found: <20030311194014$78c3@gated-at.bofh.it>]
[parent not found: <20030311194014$5811@gated-at.bofh.it>]
[parent not found: <20030311194014$49a6@gated-at.bofh.it>]
* Re: [RFC] Improved inode number allocation for HTree [not found] ` <20030311194014$49a6@gated-at.bofh.it> @ 2003-03-11 20:14 ` Pascal Schmidt 0 siblings, 0 replies; 39+ messages in thread From: Pascal Schmidt @ 2003-03-11 20:14 UTC (permalink / raw) To: Helge Hafting; +Cc: linux-kernel On Tue, 11 Mar 2003 20:40:14 +0100, you wrote in linux.kernel: > Ok, so "rm" works. Then you have things like "mv *.c /usr/src" to worry > about. Lock for traversal, get stuck unable to work on the files. In both cases, the shell does the traversal and passes a complete list of files to rm or mv... so rm and mv themselves don't need to do any directory traversal. -- Ciao, Pascal ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2003-03-15 8:28 UTC | newest] Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-02-27 17:31 [Bug 417] New: htree much slower than regular ext3 Martin J. Bligh 2003-02-28 2:55 ` Daniel Phillips 2003-02-27 21:00 ` Andreas Dilger 2003-02-28 4:12 ` Daniel Phillips 2003-02-27 21:33 ` Martin J. Bligh 2003-03-13 21:04 ` [Ext2-devel] " Stephen C. Tweedie 2003-03-07 15:46 ` Alex Tomas 2003-03-08 17:38 ` Daniel Phillips 2003-03-07 23:27 ` Theodore Ts'o 2003-03-09 19:26 ` Alex Tomas 2003-03-09 7:08 ` Alex Tomas 2003-03-10 17:58 ` Daniel Phillips 2003-03-10 21:25 ` Theodore Ts'o 2003-03-11 21:57 ` Bill Davidsen [not found] ` <20030307214833.00a37e35.akpm@digeo.com> [not found] ` <20030308010424.Z1373@schatzie.adilger.int> 2003-03-09 22:54 ` [Ext2-devel] " Daniel Phillips 2003-03-08 23:19 ` Andrew Morton 2003-03-09 23:10 ` Daniel Phillips [not found] ` <20030309184755.ACC80FCA8C@mx12.arcor-online.net> [not found] ` <m3u1ecl5h8.fsf@lexa.home.net> 2003-03-10 20:45 ` [RFC] Improved inode number allocation for HTree Daniel Phillips [not found] ` <3E6D1D25.5000004@namesys.com> [not found] ` <20030311031216.8A31CEFD5F@mx12.arcor-online.net> 2003-03-11 10:45 ` Hans Reiser 2003-03-11 13:00 ` Helge Hafting 2003-03-11 13:41 ` Daniel Phillips 2003-03-11 17:16 ` Andreas Dilger 2003-03-11 19:39 ` Helge Hafting 2003-03-11 20:19 ` Daniel Phillips 2003-03-11 21:25 ` atomic kernel operations are very tricky to export to user space (was [RFC] Improved inode number allocation for HTree ) Hans Reiser 2003-03-11 23:49 ` Jamie Lokier 2003-03-10 20:48 ` [RFC] Improved inode number allocation for HTree Daniel Phillips 2003-03-10 21:04 ` John Bradford 2003-03-10 21:28 ` Andreas Schwab 2003-03-10 21:50 ` Filesystem write priorities, (Was: Re: [RFC] Improved inode number allocation for HTree) John Bradford 2003-03-14 21:55 ` [Ext2-devel] " Stephen C. Tweedie 2003-03-10 21:33 ` [RFC] Improved inode number allocation for HTree Daniel Phillips 2003-03-10 21:47 ` [Ext2-devel] " Bryan O'Sullivan 2003-03-10 22:02 ` Matthew Wilcox 2003-03-11 8:47 ` Jakob Oestergaard 2003-03-11 11:27 ` John Bradford 2003-03-14 21:57 ` Stephen C. Tweedie 2003-03-15 8:39 ` jw schultz [not found] <20030311194014$426e@gated-at.bofh.it> [not found] ` <20030311194014$1a3c@gated-at.bofh.it> [not found] ` <20030311194014$78c3@gated-at.bofh.it> [not found] ` <20030311194014$5811@gated-at.bofh.it> [not found] ` <20030311194014$49a6@gated-at.bofh.it> 2003-03-11 20:14 ` Pascal Schmidt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).