* disk throughput @ 2001-11-05 2:13 Andrew Morton 2001-11-05 3:20 ` Mohammad A. Haque ` (3 more replies) 0 siblings, 4 replies; 78+ messages in thread From: Andrew Morton @ 2001-11-05 2:13 UTC (permalink / raw) To: lkml, ext2-devel I've been taking a look at one particular workload - the creation and use of many smallish files. ie: the things we do every day with kernel trees. There are a number of things which cause linux to perform quite badly with this workload. I've fixed most of them and the speedups are quite dramatic. The changes include: - reorganise the sync_inode() paths so we don't do read-a-block,write-a-block,read-a-block, ... - asynchronous preread of an ext2 inode's block when we create the inode, so: a) the reads will cluster into larger requests and b) during inode writeout we don't keep stalling on reads, preventing write clustering. - Move ext2's bitmap loading code outside lock_super(), so other threads can still get in and write to the fs while the reader sleeps, thus increasing write request merging. This benefits multithreaded workloads (ie: dbench) and needs more thought. The above changes are basically a search-and-destroy mission against the things which are preventing effective writeout request merging. Once they're in place we also need: - Alter the request queue size and elvtune settings The time to create 100,000 4k files (10 per directory) has fallen from 3:09 (3min 9second) down to 0:30. A six-fold speedup. All well and good, and still a WIP. But by far the most dramatic speedups come from disabling ext2's policy of placing new directories in a different blockgroup from the parent: --- linux-2.4.14-pre8/fs/ext2/ialloc.c Tue Oct 9 21:31:40 2001 +++ linux-akpm/fs/ext2/ialloc.c Sun Nov 4 17:40:43 2001 @@ -286,7 +286,7 @@ struct inode * ext2_new_inode (const str repeat: gdp = NULL; i=0; - if (S_ISDIR(mode)) { + if (0 && S_ISDIR(mode)) { avefreei = le32_to_cpu(es->s_free_inodes_count) / sb->u.ext2_sb.s_groups_count; /* I am not yet convinced that this next bit is necessary. Numbers. The machine has 768 megs; the disk is IDE with a two meg cache. The workload consists of untarring, tarring, diffing and removing kernel trees. This filesystem is 21 gigs, and has 176 block groups. After each test which wrote data a `sync' was performed, and was included in the timing under the assumption that all the data will be written back by kupdate in a few seconds, and running `sync' allows measurement of the cost of that. The filesystem was unmounted between each test - all tests are with cold caches. stock patched untar one kernel tree, sync: 0:31 0:14 diff two trees: 3:04 1:12 tar kernel tree to /dev/null: 0:15 0:03 remove 2 kernel trees, sync: 0:30 0:10 A significant thing here is the improvement in read performance as well as writes. All of the other speedup changes only affect writes. We are paying an extremely heavy price for placing directories in different block groups from their parent. Why do we do this, and is it worth the cost? - ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: disk throughput 2001-11-05 2:13 disk throughput Andrew Morton @ 2001-11-05 3:20 ` Mohammad A. Haque 2001-11-05 3:31 ` Andrew Morton 2001-11-05 3:32 ` [Ext2-devel] " Mike Fedyk ` (2 subsequent siblings) 3 siblings, 1 reply; 78+ messages in thread From: Mohammad A. Haque @ 2001-11-05 3:20 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, ext2-devel On Sunday, November 4, 2001, at 09:13 PM, Andrew Morton wrote: > The time to create 100,000 4k files (10 per directory) has fallen > from 3:09 (3min 9second) down to 0:30. A six-fold speedup. Nice. How long before you release a patch? I have a couple of tasks we execute at work I'd like to throw at it. -- ===================================================================== Mohammad A. Haque http://www.haque.net/ mhaque@haque.net "Alcohol and calculus don't mix. Developer/Project Lead Don't drink and derive." --Unknown http://wm.themes.org/ batmanppc@themes.org ===================================================================== ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: disk throughput 2001-11-05 3:20 ` Mohammad A. Haque @ 2001-11-05 3:31 ` Andrew Morton 0 siblings, 0 replies; 78+ messages in thread From: Andrew Morton @ 2001-11-05 3:31 UTC (permalink / raw) To: Mohammad A. Haque; +Cc: lkml, ext2-devel "Mohammad A. Haque" wrote: > > On Sunday, November 4, 2001, at 09:13 PM, Andrew Morton wrote: > > > The time to create 100,000 4k files (10 per directory) has fallen > > from 3:09 (3min 9second) down to 0:30. A six-fold speedup. > > Nice. > > How long before you release a patch? I have a couple of tasks we execute > at work I'd like to throw at it. > Try the one-liner against ialloc.c. You'll need to rebuild each filesystem to remove the inter-file fragmentation which ext2 put in there though. Fortunately I made all the partitions on my laptop the same size, and there is a spare one. So I'm busily doing a `cp -a' of each filesystem onto a newly created one, then mke2fs -j'ing the old one, then moving on to the next one. It's boring. All the other changes make basically no difference once the ialloc.c change is made. With that workload. It's all a matter of utilisation of device-level readahead. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 2:13 disk throughput Andrew Morton 2001-11-05 3:20 ` Mohammad A. Haque @ 2001-11-05 3:32 ` Mike Fedyk 2001-11-05 3:45 ` Andrew Morton 2001-11-05 5:54 ` Albert D. Cahalan 2001-11-05 8:47 ` Jan Kara 2001-11-05 12:23 ` Matthias Andree 3 siblings, 2 replies; 78+ messages in thread From: Mike Fedyk @ 2001-11-05 3:32 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, ext2-devel On Sun, Nov 04, 2001 at 06:13:19PM -0800, Andrew Morton wrote: > I've been taking a look at one particular workload - the creation > and use of many smallish files. ie: the things we do every day > with kernel trees. > > There are a number of things which cause linux to perform quite badly > with this workload. I've fixed most of them and the speedups are quite > dramatic. The changes include: > > - reorganise the sync_inode() paths so we don't do > read-a-block,write-a-block,read-a-block, ... > Why can't the elevator do that? I'm all for better performance, but if the elevator can do it, then it will work for other file systems too. > - asynchronous preread of an ext2 inode's block when we > create the inode, so: > > a) the reads will cluster into larger requests and > b) during inode writeout we don't keep stalling on > reads, preventing write clustering. > This may be what is inhibiting the elevator... > The above changes are basically a search-and-destroy mission against > the things which are preventing effective writeout request merging. > Once they're in place we also need: > > - Alter the request queue size and elvtune settings > What settings are you suggesting? The 2.4 elevator queue size is an order of magnatide larger than 2.2... > > The time to create 100,000 4k files (10 per directory) has fallen > from 3:09 (3min 9second) down to 0:30. A six-fold speedup. > Nice! > > All well and good, and still a WIP. But by far the most dramatic > speedups come from disabling ext2's policy of placing new directories > in a different blockgroup from the parent: [snip] > A significant thing here is the improvement in read performance as well > as writes. All of the other speedup changes only affect writes. > > We are paying an extremely heavy price for placing directories in > different block groups from their parent. Why do we do this, and > is it worth the cost? > My God! I'm no kernel hacker, but I would think the first thing you would want to do is keep similar data (in this case similar because of proximity in the dir tree) as close as possible to reduce seeking... Is there any chance that this will go into ext3 too? Mike ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 3:32 ` [Ext2-devel] " Mike Fedyk @ 2001-11-05 3:45 ` Andrew Morton 2001-11-05 4:39 ` Mike Fedyk 2001-11-05 7:06 ` Jens Axboe 2001-11-05 5:54 ` Albert D. Cahalan 1 sibling, 2 replies; 78+ messages in thread From: Andrew Morton @ 2001-11-05 3:45 UTC (permalink / raw) To: Mike Fedyk; +Cc: lkml, ext2-devel Mike Fedyk wrote: > > On Sun, Nov 04, 2001 at 06:13:19PM -0800, Andrew Morton wrote: > > I've been taking a look at one particular workload - the creation > > and use of many smallish files. ie: the things we do every day > > with kernel trees. > > > > There are a number of things which cause linux to perform quite badly > > with this workload. I've fixed most of them and the speedups are quite > > dramatic. The changes include: > > > > - reorganise the sync_inode() paths so we don't do > > read-a-block,write-a-block,read-a-block, ... > > > > Why can't the elevator do that? I'm all for better performance, but if > the elevator can do it, then it will work for other file systems too. If a process is feeding writes into the request layer and then stops to do a read, we've blown it. There's nothing for the elevator to elevate. > > - asynchronous preread of an ext2 inode's block when we > > create the inode, so: > > > > a) the reads will cluster into larger requests and > > b) during inode writeout we don't keep stalling on > > reads, preventing write clustering. > > > > This may be what is inhibiting the elevator... Yes. > > The above changes are basically a search-and-destroy mission against > > the things which are preventing effective writeout request merging. > > Once they're in place we also need: > > > > - Alter the request queue size and elvtune settings > > > > What settings are you suggesting? The 2.4 elevator queue size is an > order of magnatide larger than 2.2... The default number of requests is 128. This is in fact quite ample AS LONG AS the filesystem is feeding decent amounts of reasonably localised stuff into the request layer, and isn't stopping for reads all the time. ext2 and the VFS are not. But I suspect that with the ialloc.c change, disk readahead is covering up for it. The meaning of the parameter to elvtune is a complete mystery, and the code is uncommented crud (tautology). So I just used -r20000 -w20000. This was based on observing the request queue dynamics. We frequently fail to merge requests which really should be merged regardless of latency. Bumping the elvtune settings fixed it all. But once the fs starts writing data out contiguously it's all academic. > > > > The time to create 100,000 4k files (10 per directory) has fallen > > from 3:09 (3min 9second) down to 0:30. A six-fold speedup. > > > > Nice! Well. I got to choose the benchmark. > > > > All well and good, and still a WIP. But by far the most dramatic > > speedups come from disabling ext2's policy of placing new directories > > in a different blockgroup from the parent: > [snip] > > A significant thing here is the improvement in read performance as well > > as writes. All of the other speedup changes only affect writes. > > > > We are paying an extremely heavy price for placing directories in > > different block groups from their parent. Why do we do this, and > > is it worth the cost? > > > > My God! I'm no kernel hacker, but I would think the first thing you would > want to do is keep similar data (in this case similar because of proximity > in the dir tree) as close as possible to reduce seeking... Sure. ext2 goes to great lengths to avoid intra-file fragmentation, and then goes and adds its own inter-file fragmentation. It's worse on larger partitons, because they have more of the 128 meg block groups. > Is there any chance that this will go into ext3 too? > If it goes in ext2, yes. Depends on what the ext2 gods say - there may be some deep design issue I'm missing here. - ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 3:45 ` Andrew Morton @ 2001-11-05 4:39 ` Mike Fedyk 2001-11-05 7:06 ` Jens Axboe 1 sibling, 0 replies; 78+ messages in thread From: Mike Fedyk @ 2001-11-05 4:39 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, ext2-devel On Sun, Nov 04, 2001 at 07:45:21PM -0800, Andrew Morton wrote: > Mike Fedyk wrote: > > What settings are you suggesting? The 2.4 elevator queue size is an > > order of magnatide larger than 2.2... > > The default number of requests is 128. This is in fact quite ample AS > LONG AS the filesystem is feeding decent amounts of reasonably localised > stuff into the request layer, and isn't stopping for reads all the time. > ext2 and the VFS are not. But I suspect that with the ialloc.c change, > disk readahead is covering up for it. > Hmm... > The meaning of the parameter to elvtune is a complete mystery, and the > code is uncommented crud (tautology). So I just used -r20000 -w20000. > I saw somewhere that Andrea Acrangeli wrote it... Maybe he can help? > This was based on observing the request queue dynamics. We frequently > fail to merge requests which really should be merged regardless of > latency. Bumping the elvtune settings fixed it all. But once the > fs starts writing data out contiguously it's all academic. > I have had much improved interactive performance with -r 333 -w 1000, or even -r 100 -w 300... Setting it down to -r 0 -w 0 caused several processes (in a -j5 kernel compile) to start waiting for disk... > > > > > > The time to create 100,000 4k files (10 per directory) has fallen > > > from 3:09 (3min 9second) down to 0:30. A six-fold speedup. > > > > > > > Nice! > > Well. I got to choose the benchmark. > Yep, but I'm sure the diffing two trees test will will get your patch noticed... ;) How do the numbers look for ext2 mounted -o sync? > > My God! I'm no kernel hacker, but I would think the first thing you would > > want to do is keep similar data (in this case similar because of proximity > > in the dir tree) as close as possible to reduce seeking... > > Sure. ext2 goes to great lengths to avoid intra-file fragmentation, > and then goes and adds its own inter-file fragmentation. > > It's worse on larger partitons, because they have more of the 128 meg > block groups. > Yep. Do you think that more (and thus, smaller) block groups would help for the larger partitions? > > Is there any chance that this will go into ext3 too? > > > > If it goes in ext2, yes. Great! >Depends on what the ext2 gods say - there > may be some deep design issue I'm missing here. > Yes, let's see what they say. :) Mike ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 3:45 ` Andrew Morton 2001-11-05 4:39 ` Mike Fedyk @ 2001-11-05 7:06 ` Jens Axboe 2001-11-05 7:14 ` Andrew Morton ` (2 more replies) 1 sibling, 3 replies; 78+ messages in thread From: Jens Axboe @ 2001-11-05 7:06 UTC (permalink / raw) To: Andrew Morton; +Cc: Mike Fedyk, lkml, ext2-devel On Sun, Nov 04 2001, Andrew Morton wrote: > The meaning of the parameter to elvtune is a complete mystery, and the > code is uncommented crud (tautology). So I just used -r20000 -w20000. It's the number of sectors that are allowed to pass a request on the queue, because of merges or inserts before that particular request. So you want lower than that probably, and you want READ latency to be smaller than WRITE latency too. The default I set is 8192/16384 iirc, so go lower than this -- -r512 -w1024 or even lower just to check the results. > This was based on observing the request queue dynamics. We frequently > fail to merge requests which really should be merged regardless of > latency. Bumping the elvtune settings fixed it all. But once the > fs starts writing data out contiguously it's all academic. Interesting, the 2.5 design prevents this since it doesn't account merges as a penalty (like a seek). I can do something like that for 2.4 too, I'll do a patch for you to test. -- Jens Axboe ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 7:06 ` Jens Axboe @ 2001-11-05 7:14 ` Andrew Morton 2001-11-05 7:26 ` Jens Axboe 2001-11-05 7:14 ` Mike Fedyk 2001-11-05 7:18 ` Jens Axboe 2 siblings, 1 reply; 78+ messages in thread From: Andrew Morton @ 2001-11-05 7:14 UTC (permalink / raw) To: Jens Axboe; +Cc: Mike Fedyk, lkml, ext2-devel Jens Axboe wrote: > > On Sun, Nov 04 2001, Andrew Morton wrote: > > The meaning of the parameter to elvtune is a complete mystery, and the > > code is uncommented crud (tautology). So I just used -r20000 -w20000. > > It's the number of sectors that are allowed to pass a request on the > queue, because of merges or inserts before that particular request. So > you want lower than that probably, and you want READ latency to be > smaller than WRITE latency too. The default I set is 8192/16384 iirc, so > go lower than this -- -r512 -w1024 or even lower just to check the > results. Right, thanks. With the ialloc.c one-liner I didn't touch elvtune. Defaults seem fine. It should the number of requests which are allowed to pass a request, not the number of sectors! Well, you know what I mean: Make it 1 + nr_sectors_in_request / 1000 > > This was based on observing the request queue dynamics. We frequently > > fail to merge requests which really should be merged regardless of > > latency. Bumping the elvtune settings fixed it all. But once the > > fs starts writing data out contiguously it's all academic. > > Interesting, the 2.5 design prevents this since it doesn't account > merges as a penalty (like a seek). I can do something like that for 2.4 > too, I'll do a patch for you to test. OK. - ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 7:14 ` Andrew Morton @ 2001-11-05 7:26 ` Jens Axboe 0 siblings, 0 replies; 78+ messages in thread From: Jens Axboe @ 2001-11-05 7:26 UTC (permalink / raw) To: Andrew Morton; +Cc: Mike Fedyk, lkml, ext2-devel On Sun, Nov 04 2001, Andrew Morton wrote: > Jens Axboe wrote: > > > > On Sun, Nov 04 2001, Andrew Morton wrote: > > > The meaning of the parameter to elvtune is a complete mystery, and the > > > code is uncommented crud (tautology). So I just used -r20000 -w20000. > > > > It's the number of sectors that are allowed to pass a request on the > > queue, because of merges or inserts before that particular request. So > > you want lower than that probably, and you want READ latency to be > > smaller than WRITE latency too. The default I set is 8192/16384 iirc, so > > go lower than this -- -r512 -w1024 or even lower just to check the > > results. > > Right, thanks. With the ialloc.c one-liner I didn't touch > elvtune. Defaults seem fine. > > It should the number of requests which are allowed to pass a > request, not the number of sectors! > > Well, you know what I mean: Make it > > 1 + nr_sectors_in_request / 1000 That has been tried, performance and latency wasn't good. But yes that is what we are really looking to account, the number of seeks. Approximately. -- Jens Axboe ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 7:06 ` Jens Axboe 2001-11-05 7:14 ` Andrew Morton @ 2001-11-05 7:14 ` Mike Fedyk 2001-11-05 7:18 ` Jens Axboe 2001-11-05 7:18 ` Jens Axboe 2 siblings, 1 reply; 78+ messages in thread From: Mike Fedyk @ 2001-11-05 7:14 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, lkml, ext2-devel On Mon, Nov 05, 2001 at 08:06:35AM +0100, Jens Axboe wrote: > On Sun, Nov 04 2001, Andrew Morton wrote: > > The meaning of the parameter to elvtune is a complete mystery, and the > > code is uncommented crud (tautology). So I just used -r20000 -w20000. > > It's the number of sectors that are allowed to pass a request on the > queue, because of merges or inserts before that particular request. So > you want lower than that probably, and you want READ latency to be > smaller than WRITE latency too. The default I set is 8192/16384 iirc, so > go lower than this -- -r512 -w1024 or even lower just to check the > results. > Does the elevator do better with powers of two? > > This was based on observing the request queue dynamics. We frequently > > fail to merge requests which really should be merged regardless of > > latency. Bumping the elvtune settings fixed it all. But once the > > fs starts writing data out contiguously it's all academic. > > Interesting, the 2.5 design prevents this since it doesn't account > merges as a penalty (like a seek). I can do something like that for 2.4 > too, I'll do a patch for you to test. > I'd be very interested in this patch. Can you post it pubically? > -- > Jens Axboe > Mike ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 7:14 ` Mike Fedyk @ 2001-11-05 7:18 ` Jens Axboe 0 siblings, 0 replies; 78+ messages in thread From: Jens Axboe @ 2001-11-05 7:18 UTC (permalink / raw) To: Andrew Morton, lkml, ext2-devel On Sun, Nov 04 2001, Mike Fedyk wrote: > On Mon, Nov 05, 2001 at 08:06:35AM +0100, Jens Axboe wrote: > > On Sun, Nov 04 2001, Andrew Morton wrote: > > > The meaning of the parameter to elvtune is a complete mystery, and the > > > code is uncommented crud (tautology). So I just used -r20000 -w20000. > > > > It's the number of sectors that are allowed to pass a request on the > > queue, because of merges or inserts before that particular request. So > > you want lower than that probably, and you want READ latency to be > > smaller than WRITE latency too. The default I set is 8192/16384 iirc, so > > go lower than this -- -r512 -w1024 or even lower just to check the > > results. > > > > Does the elevator do better with powers of two? No, that doesn't matter. > > > This was based on observing the request queue dynamics. We frequently > > > fail to merge requests which really should be merged regardless of > > > latency. Bumping the elvtune settings fixed it all. But once the > > > fs starts writing data out contiguously it's all academic. > > > > Interesting, the 2.5 design prevents this since it doesn't account > > merges as a penalty (like a seek). I can do something like that for 2.4 > > too, I'll do a patch for you to test. > > > > I'd be very interested in this patch. Can you post it pubically? Just posted :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 7:06 ` Jens Axboe 2001-11-05 7:14 ` Andrew Morton 2001-11-05 7:14 ` Mike Fedyk @ 2001-11-05 7:18 ` Jens Axboe 2001-11-05 9:14 ` Mike Fedyk 2 siblings, 1 reply; 78+ messages in thread From: Jens Axboe @ 2001-11-05 7:18 UTC (permalink / raw) To: Andrew Morton; +Cc: Mike Fedyk, lkml, ext2-devel [-- Attachment #1: Type: text/plain, Size: 392 bytes --] On Mon, Nov 05 2001, Jens Axboe wrote: > Interesting, the 2.5 design prevents this since it doesn't account > merges as a penalty (like a seek). I can do something like that for 2.4 > too, I'll do a patch for you to test. Ok here it is. It's not as efficient as the 2.5 version since it proceeds to scan the entire queue for a merge, but it should suffice for your testing. -- Jens Axboe [-- Attachment #2: elv-merge-account-1 --] [-- Type: text/plain, Size: 1711 bytes --] --- drivers/block/ll_rw_blk.c~ Mon Nov 5 08:15:25 2001 +++ drivers/block/ll_rw_blk.c Mon Nov 5 08:15:51 2001 @@ -696,7 +696,9 @@ case ELEVATOR_BACK_MERGE: if (!q->back_merge_fn(q, req, bh, max_segments)) break; +#if 0 elevator->elevator_merge_cleanup_fn(q, req, count); +#endif req->bhtail->b_reqnext = bh; req->bhtail = bh; req->nr_sectors = req->hard_nr_sectors += count; @@ -708,7 +710,9 @@ case ELEVATOR_FRONT_MERGE: if (!q->front_merge_fn(q, req, bh, max_segments)) break; +#if 0 elevator->elevator_merge_cleanup_fn(q, req, count); +#endif bh->b_reqnext = req->bh; req->bh = bh; req->buffer = bh->b_data; --- drivers/block/elevator.c~ Mon Nov 5 08:13:19 2001 +++ drivers/block/elevator.c Mon Nov 5 08:18:04 2001 @@ -81,8 +81,9 @@ int max_sectors) { struct list_head *entry = &q->queue_head; - unsigned int count = bh->b_size >> 9, ret = ELEVATOR_NO_MERGE; + unsigned int count = bh->b_size >> 9, ret = ELEVATOR_NO_MERGE, queue; + queue = 1; while ((entry = entry->prev) != head) { struct request *__rq = blkdev_entry_to_request(entry); @@ -90,13 +91,13 @@ * simply "aging" of requests in queue */ if (__rq->elevator_sequence-- <= 0) - break; + queue = 0; if (__rq->waiting) continue; if (__rq->rq_dev != bh->b_rdev) continue; - if (!*req && bh_rq_in_between(bh, __rq, &q->queue_head)) + if (queue && !*req && bh_rq_in_between(bh, __rq,&q->queue_head)) *req = __rq; if (__rq->cmd != rw) continue; @@ -110,7 +111,6 @@ break; } else if (__rq->sector - count == bh->b_rsector) { ret = ELEVATOR_FRONT_MERGE; - __rq->elevator_sequence -= count; *req = __rq; break; } ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 7:18 ` Jens Axboe @ 2001-11-05 9:14 ` Mike Fedyk 2001-11-05 9:20 ` Jens Axboe 0 siblings, 1 reply; 78+ messages in thread From: Mike Fedyk @ 2001-11-05 9:14 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, lkml, ext2-devel On Mon, Nov 05, 2001 at 08:18:36AM +0100, Jens Axboe wrote: > On Mon, Nov 05 2001, Jens Axboe wrote: > > Interesting, the 2.5 design prevents this since it doesn't account > > merges as a penalty (like a seek). I can do something like that for 2.4 > > too, I'll do a patch for you to test. > > Ok here it is. It's not as efficient as the 2.5 version since it > proceeds to scan the entire queue for a merge, but it should suffice for > your testing. > Does the elevator still favor blocks that are on the outside of the platter over all others? If so, I think this should be removed in favor of a timeout just like the other seek requests... I've been able to put a swap partition on the outside of my drive, and get utterly abizmal performance, while on another similar system, with swap on the inside of the drive performance was much better during a swap storm... Mike ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 9:14 ` Mike Fedyk @ 2001-11-05 9:20 ` Jens Axboe 0 siblings, 0 replies; 78+ messages in thread From: Jens Axboe @ 2001-11-05 9:20 UTC (permalink / raw) To: Andrew Morton, lkml, ext2-devel On Mon, Nov 05 2001, Mike Fedyk wrote: > On Mon, Nov 05, 2001 at 08:18:36AM +0100, Jens Axboe wrote: > > On Mon, Nov 05 2001, Jens Axboe wrote: > > > Interesting, the 2.5 design prevents this since it doesn't account > > > merges as a penalty (like a seek). I can do something like that for 2.4 > > > too, I'll do a patch for you to test. > > > > Ok here it is. It's not as efficient as the 2.5 version since it > > proceeds to scan the entire queue for a merge, but it should suffice for > > your testing. > > > > Does the elevator still favor blocks that are on the outside of the platter > over all others? If so, I think this should be removed in favor of a > timeout just like the other seek requests... It doesn't quite favor outside LBA's (lower numbers), it's a combination of sector and device. It's hard to do this right in 2.4 because request sectors are not absolute but a combination of partion + offset. 2.5 will have this fixed, generic_make_request remaps buffer heads (well actually bio's, but same deal) before submitting so the I/O scheduler can be a bit smarter. > I've been able to put a swap partition on the outside of my drive, and get > utterly abizmal performance, while on another similar system, with swap on > the inside of the drive performance was much better during a swap storm... That sounds very odd, swap should be much faster on the outside. -- Jens Axboe ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 3:32 ` [Ext2-devel] " Mike Fedyk 2001-11-05 3:45 ` Andrew Morton @ 2001-11-05 5:54 ` Albert D. Cahalan 2001-11-05 8:04 ` Andrew Morton 2001-11-05 9:45 ` [Ext2-devel] disk throughput Alex Bligh - linux-kernel 1 sibling, 2 replies; 78+ messages in thread From: Albert D. Cahalan @ 2001-11-05 5:54 UTC (permalink / raw) To: Mike Fedyk; +Cc: Andrew Morton, lkml, ext2-devel Mike Fedyk writes: > On Sun, Nov 04, 2001 at 06:13:19PM -0800, Andrew Morton wrote: >> All well and good, and still a WIP. But by far the most dramatic >> speedups come from disabling ext2's policy of placing new directories >> in a different blockgroup from the parent: > [snip] >> A significant thing here is the improvement in read performance as well >> as writes. All of the other speedup changes only affect writes. >> >> We are paying an extremely heavy price for placing directories in >> different block groups from their parent. Why do we do this, and >> is it worth the cost? > > My God! I'm no kernel hacker, but I would think the first thing > you would want to do is keep similar data (in this case similar > because of proximity in the dir tree) as close as possible to > reduce seeking... By putting directories far apart, you leave room for regular files (added at some future time) to be packed close together. I'm sure your benchmark doesn't fill directories with files by adding a few files every day over a period of many months. Think about projects under your home directory though. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 5:54 ` Albert D. Cahalan @ 2001-11-05 8:04 ` Andrew Morton 2001-11-05 12:28 ` Matthias Andree ` (2 more replies) 2001-11-05 9:45 ` [Ext2-devel] disk throughput Alex Bligh - linux-kernel 1 sibling, 3 replies; 78+ messages in thread From: Andrew Morton @ 2001-11-05 8:04 UTC (permalink / raw) To: Albert D. Cahalan; +Cc: Mike Fedyk, lkml, ext2-devel "Albert D. Cahalan" wrote: > > Mike Fedyk writes: > > On Sun, Nov 04, 2001 at 06:13:19PM -0800, Andrew Morton wrote: > > >> All well and good, and still a WIP. But by far the most dramatic > >> speedups come from disabling ext2's policy of placing new directories > >> in a different blockgroup from the parent: > > [snip] > >> A significant thing here is the improvement in read performance as well > >> as writes. All of the other speedup changes only affect writes. > >> > >> We are paying an extremely heavy price for placing directories in > >> different block groups from their parent. Why do we do this, and > >> is it worth the cost? > > > > My God! I'm no kernel hacker, but I would think the first thing > > you would want to do is keep similar data (in this case similar > > because of proximity in the dir tree) as close as possible to > > reduce seeking... > > By putting directories far apart, you leave room for regular > files (added at some future time) to be packed close together. OK, that's one possible reason. Not sure I buy it though. If the files are created a few days after their parent directory then the chance of their data or metadata being within device readhead scope of any of the parent dir's blocks seems small? > I'm sure your benchmark doesn't fill directories with files > by adding a few files every day over a period of many months. > Think about projects under your home directory though. OK. I'm not really aware of a suitable benchmark for that. postmark only works in a single directory. mongo hardly touches the disk at all, (with 3/4 of a gig of RAM, anyway). My original make-100,000-4k-files test creates the files in a tree - each node has 10 leafs. For a total of 11,110 directories and 100,000 files. It originally did it in-order, so: mkdir(00) mkdir(00/00) mkdir(00/00/00) mkdir(00/00/00/00) creat(00/00/00/00/00) creat(00/00/00/00/01) ... mkdir(00/00/00/01) etc. So I changed it to create the 11,110 directories, and then to go back and create the 100,000 files. This will ensure that the file's data are not contiguous with their parent directory. With the ialloc.c change, plus the other changes I mentioned the time to create all these directories and files and then run /bin/sync fell from 1:53 to 0:28. Fourfold. But the time to read the data was remarkable. This is purely due to the ialloc.c change. The command was time (find . -type f | xargs cat > /dev/null) stock 2.4.14-pre8: 6:32 with ialloc.c change: 0:30 And this was on an 8 gig fs. On a 32 gig fs I'd expect to see a fifteen-fold difference due to the additional block groups. Can you suggest a better test? - ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 8:04 ` Andrew Morton @ 2001-11-05 12:28 ` Matthias Andree 2001-11-05 14:23 ` Alexander Viro 2001-11-05 20:16 ` Andreas Dilger 2 siblings, 0 replies; 78+ messages in thread From: Matthias Andree @ 2001-11-05 12:28 UTC (permalink / raw) To: lkml On Mon, 05 Nov 2001, Andrew Morton wrote: > OK. I'm not really aware of a suitable benchmark for that. > postmark only works in a single directory. mongo hardly > touches the disk at all, (with 3/4 of a gig of RAM, anyway). You can always pass the kernel "mem=64M". -- Matthias Andree "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." Benjamin Franklin ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 8:04 ` Andrew Morton 2001-11-05 12:28 ` Matthias Andree @ 2001-11-05 14:23 ` Alexander Viro 2001-11-05 22:22 ` Andrew Morton 2001-11-05 20:16 ` Andreas Dilger 2 siblings, 1 reply; 78+ messages in thread From: Alexander Viro @ 2001-11-05 14:23 UTC (permalink / raw) To: Andrew Morton; +Cc: Albert D. Cahalan, Mike Fedyk, lkml, ext2-devel On Mon, 5 Nov 2001, Andrew Morton wrote: > OK, that's one possible reason. Not sure I buy it though. If > the files are created a few days after their parent directory > then the chance of their data or metadata being within device > readhead scope of any of the parent dir's blocks seems small? Algorithm for inode allocation had been written by Kirk back in '84. You can find some analisys in the original paper (A Fast Filesystem for UNIX). BTW, what you want is not "readahead scope of parent dir block". You want inodes of files in given directory close to each other. That matters a lot when you do stat() on directory contents, etc. Moreover, since we attempt to keep data blocks close to inodes, we want to keep use of cylinder groups more or less even. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 14:23 ` Alexander Viro @ 2001-11-05 22:22 ` Andrew Morton 2001-11-05 22:41 ` Andreas Dilger ` (3 more replies) 0 siblings, 4 replies; 78+ messages in thread From: Andrew Morton @ 2001-11-05 22:22 UTC (permalink / raw) To: Alexander Viro; +Cc: Albert D. Cahalan, Mike Fedyk, lkml, ext2-devel Alexander Viro wrote: > > On Mon, 5 Nov 2001, Andrew Morton wrote: > > > OK, that's one possible reason. Not sure I buy it though. If > > the files are created a few days after their parent directory > > then the chance of their data or metadata being within device > > readhead scope of any of the parent dir's blocks seems small? > > Algorithm for inode allocation had been written by Kirk back in > '84. You can find some analisys in the original paper (A Fast > Filesystem for UNIX). '84. > BTW, what you want is not "readahead scope of parent dir block". > You want inodes of files in given directory close to each other. > That matters a lot when you do stat() on directory contents, > etc. Moreover, since we attempt to keep data blocks close to > inodes, we want to keep use of cylinder groups more or less > even. For some workloads we want the subdirectories close to the parent as well. Failing to do so is horridly wrong. What has changed since Kirk's design? - The relative cost of seeks has increased. Device caching and readahead and increased bandwidth make it more expensive to seek away. - The seek distance-versus-cost equation has changed. Take a look at a graph of seek distance versus time. Once you've decided to seek ten percent of the distance across the disk, a 90% seek only takes twice as long. - The size of devices has grown more quickly than ext2 blocksizes, so instead of having four or eight block groups as we did in the old days, we now have hundreds. 256 block groups on a 32 gig fs. Sprinkling a directory tree across 256 block groups hurts a lot. I don't think I buy the fragmentation argument, really. I suspect that doing first-fit within the bounds of a block group will have a long-term effect similar to performing first-fit on the entire fs. But I don't know. This is just all bullshit handwaving speculation. We need tests. Numbers. Does anyone have source to a filesystem aging simulation? The Smith/Seltzer code seems to be off the air. I just fed patch-2.4.14-ac8 into my sucky cvs import scripts and it ran in about a fifth of the time. This is a serious matter, worth spending time on. - ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 22:22 ` Andrew Morton @ 2001-11-05 22:41 ` Andreas Dilger 2001-11-05 22:53 ` Andrew Morton 2001-11-05 23:14 ` Dan Hollis 2001-11-06 10:52 ` Daniel Phillips ` (2 subsequent siblings) 3 siblings, 2 replies; 78+ messages in thread From: Andreas Dilger @ 2001-11-05 22:41 UTC (permalink / raw) To: Andrew Morton Cc: Alexander Viro, Albert D. Cahalan, Mike Fedyk, lkml, ext2-devel On Nov 05, 2001 14:22 -0800, Andrew Morton wrote: > But I don't know. This is just all bullshit handwaving speculation. > We need tests. Numbers. Does anyone have source to a filesystem > aging simulation? The Smith/Seltzer code seems to be off the air. There is a guy doing fragmentation testing for reiserfs. It turns out that (in his tests) reiserfs can get 10x slower as the filesystem fills up because of intra-file fragmentation. I don't know enough about reiserfs block/file allocation policy to know how this compares to ext2 at all. See http://www.informatik.uni-frankfurt.de/~loizides/reiserfs/agetest.html Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 22:41 ` Andreas Dilger @ 2001-11-05 22:53 ` Andrew Morton 2001-11-08 15:28 ` Constantin Loizides 2001-11-05 23:14 ` Dan Hollis 1 sibling, 1 reply; 78+ messages in thread From: Andrew Morton @ 2001-11-05 22:53 UTC (permalink / raw) To: Andreas Dilger; +Cc: lkml, ext2-devel, constantin.loizides Andreas Dilger wrote: > > On Nov 05, 2001 14:22 -0800, Andrew Morton wrote: > > But I don't know. This is just all bullshit handwaving speculation. > > We need tests. Numbers. Does anyone have source to a filesystem > > aging simulation? The Smith/Seltzer code seems to be off the air. > > There is a guy doing fragmentation testing for reiserfs. It turns > out that (in his tests) reiserfs can get 10x slower as the filesystem > fills up because of intra-file fragmentation. I don't know enough > about reiserfs block/file allocation policy to know how this compares > to ext2 at all. > > See http://www.informatik.uni-frankfurt.de/~loizides/reiserfs/agetest.html > Wow. That looks nice. Unfortunately the tarballs are missing several files. read.cxx, agesystem.cxx, etc. Plus a number of dud links. Constantin, could you please check that the agesystem3 tarball builds OK, maybe put up a new one? Thanks. - ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 22:53 ` Andrew Morton @ 2001-11-08 15:28 ` Constantin Loizides 0 siblings, 0 replies; 78+ messages in thread From: Constantin Loizides @ 2001-11-08 15:28 UTC (permalink / raw) To: Andrew Morton; +Cc: kernel-list > Wow. That looks nice. Unfortunately the tarballs are > missing several files. read.cxx, agesystem.cxx, etc. Plus > a number of dud links. I am sorry, I have made a new makefile recently... should have tested it... > > Constantin, could you please check that the agesystem3 tarball > builds OK, maybe put up a new one? Thanks. > New tars are on the web... If you plan to do some tests using agesystem3 eg. the agetest.tar.gz package, dont hesitate to ask me before you waste your time, as my documentation is not very selfexplained! Constantin ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 22:41 ` Andreas Dilger 2001-11-05 22:53 ` Andrew Morton @ 2001-11-05 23:14 ` Dan Hollis 1 sibling, 0 replies; 78+ messages in thread From: Dan Hollis @ 2001-11-05 23:14 UTC (permalink / raw) To: Andreas Dilger Cc: Andrew Morton, Albert D. Cahalan, Mike Fedyk, lkml, ext2-devel On Mon, 5 Nov 2001, Andreas Dilger wrote: > There is a guy doing fragmentation testing for reiserfs. It turns > out that (in his tests) reiserfs can get 10x slower as the filesystem > fills up because of intra-file fragmentation. I don't know enough > about reiserfs block/file allocation policy to know how this compares > to ext2 at all. How does xfs fare in comparison? -Dan -- [-] Omae no subete no kichi wa ore no mono da. [-] ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 22:22 ` Andrew Morton 2001-11-05 22:41 ` Andreas Dilger @ 2001-11-06 10:52 ` Daniel Phillips 2001-11-06 16:17 ` Jeremy Fitzhardinge 2001-11-06 21:45 ` Stephen Tweedie 3 siblings, 0 replies; 78+ messages in thread From: Daniel Phillips @ 2001-11-06 10:52 UTC (permalink / raw) To: Andrew Morton, Alexander Viro Cc: Albert D. Cahalan, Mike Fedyk, lkml, ext2-devel On November 5, 2001 11:22 pm, Andrew Morton wrote: > - The seek distance-versus-cost equation has changed. Take a look > at a graph of seek distance versus time. Once you've decided to > seek ten percent of the distance across the disk, a 90% seek only > takes twice as long. Do you have such a graph handy? -- Daniel ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 22:22 ` Andrew Morton 2001-11-05 22:41 ` Andreas Dilger 2001-11-06 10:52 ` Daniel Phillips @ 2001-11-06 16:17 ` Jeremy Fitzhardinge 2001-11-08 15:24 ` Constantin Loizides 2001-11-08 16:46 ` Jeremy Fitzhardinge 2001-11-06 21:45 ` Stephen Tweedie 3 siblings, 2 replies; 78+ messages in thread From: Jeremy Fitzhardinge @ 2001-11-06 16:17 UTC (permalink / raw) To: Daniel Phillips Cc: Andrew Morton, Alexander Viro, Albert D. Cahalan, Mike Fedyk, lkml, Ext2 devel [-- Attachment #1: Type: text/plain, Size: 1162 bytes --] On Tue, 2001-11-06 at 02:52, Daniel Phillips wrote: > On November 5, 2001 11:22 pm, Andrew Morton wrote: > > - The seek distance-versus-cost equation has changed. Take a look > > at a graph of seek distance versus time. Once you've decided to > > seek ten percent of the distance across the disk, a 90% seek only > > takes twice as long. > > Do you have such a graph handy? This is one I made a while ago while doing some measurements; its also the one Andrew was referring to. The drive is a very ordinary WD 40G 7200RPM drive. You can clearly see the rotation time in the width of the band, and the non-linear short-seek optimizations (the drive seems to do something like dump a lot more current into the seek coil for short seeks, probably knowing that it can't burn things out if its only short). There's an interesting detail you can't really see from this graph. Very short seeks (<< 1% disk size) are about 1ms. That's not seeks within a track; that's track-track seeks. Since rotational latency is up to 8.3ms on this drive, its often cheaper to seek to the next track if the data is available there rather than wait for a rotation. J [-- Attachment #2: seek-buf.eps.gz --] [-- Type: application/x-gzip, Size: 26186 bytes --] ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 16:17 ` Jeremy Fitzhardinge @ 2001-11-08 15:24 ` Constantin Loizides 2001-11-08 16:46 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 78+ messages in thread From: Constantin Loizides @ 2001-11-08 15:24 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: kernel-list > This is one I made a while ago while doing some measurements; its also <[snip] That's an interesting plot. I would like to do one for my disk! How did you do it? How do you find out about the seek distance??? Do you create one big file and then start seeking around accessing different blocks of it\x13? Please, let me know, t\x13hanks, Constantin ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 16:17 ` Jeremy Fitzhardinge 2001-11-08 15:24 ` Constantin Loizides @ 2001-11-08 16:46 ` Jeremy Fitzhardinge 2001-11-09 6:08 ` Andrew Morton 2001-11-09 8:49 ` Jeremy Fitzhardinge 1 sibling, 2 replies; 78+ messages in thread From: Jeremy Fitzhardinge @ 2001-11-08 16:46 UTC (permalink / raw) To: Constantin Loizides; +Cc: Linux Kernel List On Thu, 2001-11-08 at 07:24, Constantin Loizides wrote: > > > This is one I made a while ago while doing some measurements; its also > <[snip] > > That's an interesting plot. I would like to do one for my disk! > > How did you do it? > How do you find out about the seek distance??? > Do you create one big file and then start > seeking around accessing different blocks of it\x13? Its pretty simple. I open /dev/hda (the whole device), and read random blocks, timing how long it takes for the data to come back since the last one. I set up a few hundred/thousand buckets, and accumulate the measured time into the bucket for that seek distance. So: fd=open("/dev/hda") llseek(fd, last = rand()); read(fd) while(!finished) { blk = rand(); llseek(fd, blk); read(fd); bucket[abs(last-blk)] = time; last = blk; } I found the random seeking was the only way I could get reasonable numbers out of the drive; any of the ordered patterns of the form "OK, lets measure N block seeks" seemed to be predicted by the drive and gave unreasonably fast results. This technique measures seek+rotational latency for reads. Seek-for-reading is generally faster than seek-for-writing, because they start reading as soon as the head is appoximately in the right place, and start using the data as soon as the ECCs start checking out OK. You certainly can't start writing until you've done the full head-settle time (another 1-2ms). I'll see what state my measurement code is in (technically and legally) and see if I can release it. Anyway I'm happy to answer questions. J ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-08 16:46 ` Jeremy Fitzhardinge @ 2001-11-09 6:08 ` Andrew Morton 2001-11-09 8:49 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 78+ messages in thread From: Andrew Morton @ 2001-11-09 6:08 UTC (permalink / raw) To: Jeremy Fitzhardinge; +Cc: Constantin Loizides, Linux Kernel List Jeremy Fitzhardinge wrote: > > On Thu, 2001-11-08 at 07:24, Constantin Loizides wrote: > > > > > This is one I made a while ago while doing some measurements; its also > > <[snip] > > > > That's an interesting plot. I would like to do one for my disk! > > > > How did you do it? > > How do you find out about the seek distance??? > > Do you create one big file and then start > > seeking around accessing different blocks of it\x13? > > Its pretty simple. I open /dev/hda (the whole device), and read random > blocks, timing how long it takes for the data to come back since the > last one. I set up a few hundred/thousand buckets, and accumulate the > measured time into the bucket for that seek distance. So: > > fd=open("/dev/hda") > llseek(fd, last = rand()); > read(fd) > while(!finished) { > blk = rand(); > llseek(fd, blk); > read(fd); > bucket[abs(last-blk)] = time; > last = blk; > } How do you avoid aftifacts from Linux caching with this test? > I found the random seeking was the only way I could get reasonable > numbers out of the drive; any of the ordered patterns of the form "OK, > lets measure N block seeks" seemed to be predicted by the drive and gave > unreasonably fast results. > But we *want* unreasonably fast results! We need to understand this device-level prediction, then decide if it's sufficiently widespread for us to go and exploit the hell out of it. Have you any ideas as to what's happening? - ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-08 16:46 ` Jeremy Fitzhardinge 2001-11-09 6:08 ` Andrew Morton @ 2001-11-09 8:49 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 78+ messages in thread From: Jeremy Fitzhardinge @ 2001-11-09 8:49 UTC (permalink / raw) To: Andrew Morton; +Cc: Constantin Loizides, Linux Kernel List On Thu, 2001-11-08 at 22:08, Andrew Morton wrote: > How do you avoid aftifacts from Linux caching with this test? Randomly seeking around a 40G disk isn't going to give the caching much to work with (that was on a 256Mbyte machine). Also, I tried it with raw devices, but it made no difference to the measurements. > > I found the random seeking was the only way I could get reasonable > > numbers out of the drive; any of the ordered patterns of the form "OK, > > lets measure N block seeks" seemed to be predicted by the drive and gave > > unreasonably fast results. > > > > But we *want* unreasonably fast results! We need to understand > this device-level prediction, then decide if it's sufficiently > widespread for us to go and exploit the hell out of it. > > Have you any ideas as to what's happening? Not really. I wanted results which at least confirm the simple model of how the disk works. When I did deterministic seeks, it was something like: for(n = 0; n < disksize/2; n += about_a_cylinder) { seek(n); read(); seek(disksize-n); read(); } I suspect that the drive was partitioning its cache and using lots of readahead to satisfy the reads at either end of the seek. I was trying to work around that by making the n increment large enough, but it didn't seem to work. Maybe the drive was being extra special clever, or maybe I miscalculated the cylinder size. Or maybe I just happened to be hitting a rotational sweet-spot. For this kind of test, it would be interesting to know what differences there are between read and write. In principle the drive could start writing a track anywhere it wants so long as it has enough to write the whole thing. That would automatically make it get the track-track skew right... J ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 22:22 ` Andrew Morton ` (2 preceding siblings ...) 2001-11-06 16:17 ` Jeremy Fitzhardinge @ 2001-11-06 21:45 ` Stephen Tweedie 3 siblings, 0 replies; 78+ messages in thread From: Stephen Tweedie @ 2001-11-06 21:45 UTC (permalink / raw) To: Andrew Morton Cc: Alexander Viro, Albert D. Cahalan, Mike Fedyk, lkml, ext2-devel Hi, On Mon, Nov 05, 2001 at 02:22:41PM -0800, Andrew Morton wrote: > For some workloads we want the subdirectories close to the > parent as well. Failing to do so is horridly wrong. If you apply that recursively, then _all_ the directories in a filesystem end up in the same place. ext2 has traditionally been extremely resistant to fragmentation degradation over time, and the spreading out of the directory tree over the filesystem is part of that. > What has changed since Kirk's design? > > - The relative cost of seeks has increased. Device caching > and readahead and increased bandwidth make it more expensive > to seek away. I'm not convinced about that. Modern disks are so fast at streaming that _any_ non-sequential access is a major cost. Track-to-track seeks are typically well under the average rotational cost. It's not seeking to a distant location that's particularly expensive: any seek is, whether to the the same track or not. > I don't think I buy the fragmentation argument, really. Recent experiments showed that reiserfs, which starts off allocating files quite close together, was significantly faster than ext2 on mostly-empty filesystems but got hugely slower as you approached 90% full or more. I don't buy the argument that you can ignore fragmentation. There must be a balance between short-term performance when allocating files and long-term performance when ensuring you've got enough free space inside a directory tree to cope with new files. Even kernel builds may show this up. If you try to keep a directory tree compact, then you may get excellent performance when unpacking the kernel tarball. But once you've applied a few large patch sets and set the kernel build going, your new files, .c.orig patch backups, and .o files will have nowhere nearby to get allocated in. Cheers, Stephen ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 8:04 ` Andrew Morton 2001-11-05 12:28 ` Matthias Andree 2001-11-05 14:23 ` Alexander Viro @ 2001-11-05 20:16 ` Andreas Dilger 2001-11-05 20:28 ` m 2 siblings, 1 reply; 78+ messages in thread From: Andreas Dilger @ 2001-11-05 20:16 UTC (permalink / raw) To: Andrew Morton; +Cc: Albert D. Cahalan, Mike Fedyk, lkml, ext2-devel On Nov 05, 2001 00:04 -0800, Andrew Morton wrote: > My original make-100,000-4k-files test creates the files > in a tree - each node has 10 leafs. For a total of 11,110 > directories and 100,000 files. It originally did it > in-order, so: > > mkdir(00) > mkdir(00/00) > mkdir(00/00/00) > mkdir(00/00/00/00) > creat(00/00/00/00/00) > creat(00/00/00/00/01) > ... > mkdir(00/00/00/01) > > etc. > > So I changed it to create the 11,110 directories, and then > to go back and create the 100,000 files. This will ensure that the > file's data are not contiguous with their parent directory. > > With the ialloc.c change, plus the other changes I mentioned > the time to create all these directories and files and then run > /bin/sync fell from 1:53 to 0:28. Fourfold. > > And this was on an 8 gig fs. On a 32 gig fs I'd expect to see > a fifteen-fold difference due to the additional block groups. > > Can you suggest a better test? Well, just to emphasize the "block group" issues, you could try testing with a 1kB or 2kB block filesystem. This will give you 64x or 8x as many groups as a 4kB block filesystem, respectively. A more "valid" test, IMHO, would be "untar the kernel, (flush buffers), build kernel" on both the original, and your "all in one group" inode allocation heuristics. It should be especially noticable on a 1kB filesystem. What this will show (I think) is that while untar/read with your method will be fast (all inodes/files contiguous on disk) once you start trying to write to that filesystem, you will have more fragmentation/seeking for the writes. It may be that with large-memory systems you will cache so much you don't see a difference, hence the (flush buffers) part, which is probably umount, mount. An even better test would be untar kernel, patch up a few major versions, then try to compile. The old heuristic would probably be OK, as there is space in each group for files to grow, while your heuristic would move files into groups other than their parent because there is no space. In the end, though, while the old heuristic has a good theory, it _may_ be that in practise, you are _always_ seeking to get data from different groups, rather than _theoretically_ seeking because of fragmented files. I don't know what the answer is - probably depends on finding "valid" benchmarks (cough). Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 20:16 ` Andreas Dilger @ 2001-11-05 20:28 ` m 2001-11-05 21:39 ` Andrew Morton 2001-11-06 21:48 ` Stephen Tweedie 0 siblings, 2 replies; 78+ messages in thread From: m @ 2001-11-05 20:28 UTC (permalink / raw) To: Andreas Dilger Cc: Andrew Morton, Albert D. Cahalan, Mike Fedyk, lkml, ext2-devel Andreas Dilger <adilger@turbolabs.com> writes: > On Nov 05, 2001 00:04 -0800, Andrew Morton wrote: [..] > > With the ialloc.c change, plus the other changes I mentioned > > the time to create all these directories and files and then run > > /bin/sync fell from 1:53 to 0:28. Fourfold. > > In the end, though, while the old heuristic has a good theory, it _may_ > be that in practise, you are _always_ seeking to get data from different > groups, rather than _theoretically_ seeking because of fragmented files. > I don't know what the answer is - probably depends on finding "valid" > benchmarks (cough). Another heuristic to try make be to only use a different blockgroup for when the mkdir()s are seperate in time. i.e. rather than doing if ( 0 && .. use something like if ((last_time + 100) < jiffes && ... last_time = jiffies; which would in theory use the old behaviour for sparodic mkdirs and the new behaviour for things like 'untar' et al. M. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 20:28 ` m @ 2001-11-05 21:39 ` Andrew Morton 2001-11-05 22:59 ` Linus Torvalds 2001-11-06 21:48 ` Stephen Tweedie 1 sibling, 1 reply; 78+ messages in thread From: Andrew Morton @ 2001-11-05 21:39 UTC (permalink / raw) To: m; +Cc: Andreas Dilger, Albert D. Cahalan, Mike Fedyk, lkml, ext2-devel m@mo.optusnet.com.au wrote: > > Andreas Dilger <adilger@turbolabs.com> writes: > > On Nov 05, 2001 00:04 -0800, Andrew Morton wrote: > [..] > > > With the ialloc.c change, plus the other changes I mentioned > > > the time to create all these directories and files and then run > > > /bin/sync fell from 1:53 to 0:28. Fourfold. > > > > In the end, though, while the old heuristic has a good theory, it _may_ > > be that in practise, you are _always_ seeking to get data from different > > groups, rather than _theoretically_ seeking because of fragmented files. > > I don't know what the answer is - probably depends on finding "valid" > > benchmarks (cough). > > Another heuristic to try make be to only use a different blockgroup > for when the mkdir()s are seperate in time. i.e. rather than > doing > if ( 0 && .. > use something like > if ((last_time + 100) < jiffes && ... > last_time = jiffies; > which would in theory use the old behaviour for sparodic mkdirs > and the new behaviour for things like 'untar' et al. > I agree - that's a pretty sane heuristic. It would allow us to preserve the existing semantics for the slowly-accreting case. If they're still valid. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 21:39 ` Andrew Morton @ 2001-11-05 22:59 ` Linus Torvalds 2001-11-05 23:36 ` Alexander Viro 0 siblings, 1 reply; 78+ messages in thread From: Linus Torvalds @ 2001-11-05 22:59 UTC (permalink / raw) To: linux-kernel In article <3BE70717.54F3084A@zip.com.au>, Andrew Morton <akpm@zip.com.au> wrote: >> if ( 0 && .. >> use something like >> if ((last_time + 100) < jiffes && ... >> last_time = jiffies; >> which would in theory use the old behaviour for sparodic mkdirs >> and the new behaviour for things like 'untar' et al. >> > >I agree - that's a pretty sane heuristic. > >It would allow us to preserve the existing semantics for the >slowly-accreting case. If they're still valid. I don't particularly like behaviour that changes over time, so I would much rather just state clearly that the current inode allocation strategy is obviously complete crap. Proof: simple real-world benchmarks, along with some trivial thinking about seek latencies. In particular, the way it works now, it will on purpose try to spread out inodes over the whole disk. Every new directory will be allocated in the group that has the most free inodes, which obviously on average means that you try to fill up all groups equally. Which makes _no_ sense. There is no advantage to trying to spread things out, only clear disadvantages. Instead of trying to make this time-dependent, let's just realize that spreading things out is stupid. It might make more sense to say - if we create a new directory - AND the group we're currently in is getting filled up - AND there is another group that is clearly emptier, - THEN do we switch groups. None of this stupid "if this group has more inodes than the average" crap. But I'd rather have the "if (0 && .." than something that depends on time. But if people want to have something that is in the _spirit_ of the old code, then make it something like the above, which only switches groups if there is a need and a clear win from it. Linus ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 22:59 ` Linus Torvalds @ 2001-11-05 23:36 ` Alexander Viro 2001-11-05 23:50 ` Linus Torvalds 0 siblings, 1 reply; 78+ messages in thread From: Alexander Viro @ 2001-11-05 23:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Mon, 5 Nov 2001, Linus Torvalds wrote: > I don't particularly like behaviour that changes over time, so I would > much rather just state clearly that the current inode allocation > strategy is obviously complete crap. Proof: simple real-world > benchmarks, along with some trivial thinking about seek latencies. > > In particular, the way it works now, it will on purpose try to spread > out inodes over the whole disk. Every new directory will be allocated in > the group that has the most free inodes, which obviously on average > means that you try to fill up all groups equally. > Which makes _no_ sense. There is no advantage to trying to spread things > out, only clear disadvantages. Wrong. Trivial example: create skeleton homedirs for 50 new users. You _really_ don't want all of them in one cylinder group. Because they will be slowly filling up with files, while directory structure is very likely to stay more or less stable. You want the prefered group for the file inode to be the same as its parent directory. _And_ you want data close to inode if we can afford that. Worse yet, for data allocation we use quadratic hash. Which works nicely _unless_ starting point for all of them sits in the same group. See where it's going? The real issue is ratio of frequencies for directory and file creation. The "time-dependent" part is ugly, but the thing it tries to address is very, very real. Allocation policy for a tree created at once is different from allocation policy for normal use. Ideally we would need to predict how many (and how large) files will go into directory. We can't - we have no time machines. But heuristics you've mentioned is clearly broken. It will end up with mostly empty trees squeezed into a single cylinder group and when they start to get populated that will be pure hell. And yes, it's more than realistic scenario. Your strategy would make sense if all directories were created by untaring a large archive. Which may be fairly accurate for your boxen (or mine, for that matter - most of the time), but it's not universal. Benchmarks that try to stress that code tend to be something like cvs co, tar x, yodda, yodda. _All_ of them deal only with "fast-growth" pattern. And yes, FFS inode allocator sucks for that scenario - no arguments here. Unfortunately, the variant you propose will suck for slow-growth one and that is going to hurt a lot. The fact that Linux became a huge directory tree means that we tend to deal with fast-growth scenario quite often. Not everyone works on the kernel, though ;-) ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 23:36 ` Alexander Viro @ 2001-11-05 23:50 ` Linus Torvalds 2001-11-06 0:03 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 78+ messages in thread From: Linus Torvalds @ 2001-11-05 23:50 UTC (permalink / raw) To: Alexander Viro; +Cc: linux-kernel On Mon, 5 Nov 2001, Alexander Viro wrote: > > Ideally we would need to predict how many (and how large) files > will go into directory. We can't - we have no time machines. But > heuristics you've mentioned is clearly broken. It will end up with > mostly empty trees squeezed into a single cylinder group and when > they start to get populated that will be pure hell. Why? Squeezing into a single cylinder group is _good_. Fragmentation is bad. Where's the hell? You move over to the next cylinder group when you have 90% filled one ("pick percentage randomly", and there are others that are clearly less used. > Benchmarks that try to stress that code tend to be something like > cvs co, tar x, yodda, yodda. _All_ of them deal only with "fast-growth" > pattern. And yes, FFS inode allocator sucks for that scenario - no > arguments here. Unfortunately, the variant you propose will suck for > slow-growth one and that is going to hurt a lot. I don't see any arguments for _why_ you claim it would hurt a lot. Fast-growth is what we should care about from a performance standpoint - slow growth can be sanely handled with slower-paced maintenance issues (including approaches like "defragment the disk once a month"). By definition, "slow growth" is amenable to defragmentation. And by definition, fast growth isn't. And if we're talking about speedups on the order of _five times_ for something as simple as a "cvs co", I don't see where your "pure hell" comes in. A five-time slowdown on real work _is_ pure hell. You've not shown a credible argument that the slow-growth behaviour would ever result in a five-time slowdown for _anything_. Linus ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 23:50 ` Linus Torvalds @ 2001-11-06 0:03 ` Linus Torvalds 2001-11-06 1:33 ` Alexander Viro 2001-11-06 1:28 ` Alexander Viro 2001-11-08 12:51 ` Pavel Machek 2 siblings, 1 reply; 78+ messages in thread From: Linus Torvalds @ 2001-11-06 0:03 UTC (permalink / raw) To: Alexander Viro; +Cc: linux-kernel On Mon, 5 Nov 2001, Linus Torvalds wrote: > > A five-time slowdown on real work _is_ pure hell. You've not shown a > credible argument that the slow-growth behaviour would ever result in a > five-time slowdown for _anything_. There might also be heuristics that explicitly _notice_ slow growth, not necessarily as a function of time, but as a function of the tree structure itself. For example, spreading out (and the inherent assumption of "slow growth") might make sense for the root directory, and possibly for a level below that. It almost certainly does _not_ make sense for a directory created four levels down. Linus ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 0:03 ` Linus Torvalds @ 2001-11-06 1:33 ` Alexander Viro 2001-11-06 2:10 ` Linus Torvalds 2001-11-09 22:35 ` Riley Williams 0 siblings, 2 replies; 78+ messages in thread From: Alexander Viro @ 2001-11-06 1:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Mon, 5 Nov 2001, Linus Torvalds wrote: > For example, spreading out (and the inherent assumption of "slow growth") > might make sense for the root directory, and possibly for a level below > that. It almost certainly does _not_ make sense for a directory created > four levels down. Depends on the naming scheme used by admin, for one thing. Linus, I seriously think that getting the real-life traces to play with would be a Good Thing(tm) - at least that would allow to test how well would heuristics of that kind do. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 1:33 ` Alexander Viro @ 2001-11-06 2:10 ` Linus Torvalds 2001-11-06 3:02 ` Alexander Viro 2001-11-06 3:49 ` Alexander Viro 2001-11-09 22:35 ` Riley Williams 1 sibling, 2 replies; 78+ messages in thread From: Linus Torvalds @ 2001-11-06 2:10 UTC (permalink / raw) To: Alexander Viro; +Cc: linux-kernel On Mon, 5 Nov 2001, Alexander Viro wrote: > > Depends on the naming scheme used by admin, for one thing. Linus, I seriously > think that getting the real-life traces to play with would be a Good Thing(tm) > - at least that would allow to test how well would heuristics of that kind do. Well, we hae some real-life traces. Those particular real-life traces show that the "switch cylinder groups on mkdir" heuristic sucks. And you have to realize that _whatever_ we do, it will always be a heuristic. We don't know what the right behaviour is without being able to predict the future. Agreed? Ok, so let's just assume that we had no heuristic in place at all, and we were looking at improving it for slow-growth situations. I think you'd agree that decreasing the performance of a CVS checkout five-fold would be considered _totally_ unacceptable for a new heuristic, no? So the current heuristic provably sucks. We have cold hard numbers, and quite frankly, Al, there is very very little point in arguing against numbers. It's silly. "Gimme an S, gimme a U, gimme a C, gimme a K - S-U-C-K". The current one sucks. So the only question on the table is not "do we stay with the current algorithm", because I dare you, the answer to that question is very probably a clear "NO". So there's no point in even trying to find out _why_, in 1984, it was considered a good heuristic. The question is: "what can we do to improve it?". Not "what arguments can we come up with to make excuses for a sucky algorithm that clearly does the wrong thing for real-life loads". One such improvement has already been put on the table: remove the algorithm, and make it purely greedy. We know that works. And yes, we realize that it has downsides too. Which is why some kind of hybrid is probably called for. Come up with your own sixth sense. We know the current one is really bad for real-life loads. I actually bet that the greedy algorithm would work wonderfully well, if we just had run-time balancing. I bet that the main argument for "spreading things out" is that it minimizes the risk of overflow in any particular group, never mind performance. And if you're stuck with the decision you make forever, minimizing risk is a good approach. And maybe the fundamental problem is exactly that: because we're stuck with our decision forever, people felt that they couldn't afford to risk doing what was very obviously the right thing. So I still claim that we should look for short-time profit, and then try to fix up the problems longer term. With, if required, some kind of rebalancing. Linus ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 2:10 ` Linus Torvalds @ 2001-11-06 3:02 ` Alexander Viro 2001-11-06 8:39 ` Alan Cox 2001-11-06 3:49 ` Alexander Viro 1 sibling, 1 reply; 78+ messages in thread From: Alexander Viro @ 2001-11-06 3:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Mon, 5 Nov 2001, Linus Torvalds wrote: > And you have to realize that _whatever_ we do, it will always be a > heuristic. We don't know what the right behaviour is without being able to > predict the future. Agreed? No arguments. While we are at it, let's just state once and forever: FFS allocator sucks for fast-growth case Everyone agrees with that, including me, you _and_ Kirk. > The question is: "what can we do to improve it?". Not "what arguments can > we come up with to make excuses for a sucky algorithm that clearly does > the wrong thing for real-life loads". Obviously. > One such improvement has already been put on the table: remove the > algorithm, and make it purely greedy. > > We know that works. And yes, we realize that it has downsides too. Which > is why some kind of hybrid is probably called for. Come up with your own Exactly. > And maybe the fundamental problem is exactly that: because we're stuck > with our decision forever, people felt that they couldn't afford to risk > doing what was very obviously the right thing. > > So I still claim that we should look for short-time profit, and then try > to fix up the problems longer term. With, if required, some kind of > rebalancing. Whatever heuristics we use, it _must_ catch fast-growth scenario. No arguments on that. The question being, what will minimize the problems for other cases. On-line defrag can be actually fairly nasty - that had been tried and it ends up with a hell of tricky details. Especially with page cache in the game. And "umount once a month" is not serious - think of a _large_ disk on a department NFS server. So I'd rather look for decent heuristics before going for "let's defrag it once in a while" kind of solution. I definitely want to start with looking through relevant work - both for the data and for information on _failed_ attempts to solve the problem. /me goes to dig through that stuff ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 3:02 ` Alexander Viro @ 2001-11-06 8:39 ` Alan Cox 2001-11-06 8:37 ` Alexander Viro 0 siblings, 1 reply; 78+ messages in thread From: Alan Cox @ 2001-11-06 8:39 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, linux-kernel > > So I still claim that we should look for short-time profit, and then try > > to fix up the problems longer term. With, if required, some kind of > > rebalancing. > > Whatever heuristics we use, it _must_ catch fast-growth scenario. No > arguments on that. The question being, what will minimize the problems > for other cases. Surely the answer if you want short term write speed and long term balancing is to use ext3 not ext2 and simply ensure that the relevant stuff goes to the journal (which will be nicely ordered) first. That will give you some buffering at least. Alan ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 8:39 ` Alan Cox @ 2001-11-06 8:37 ` Alexander Viro 2001-11-06 8:48 ` Andrew Morton 0 siblings, 1 reply; 78+ messages in thread From: Alexander Viro @ 2001-11-06 8:37 UTC (permalink / raw) To: Alan Cox; +Cc: Linus Torvalds, linux-kernel On Tue, 6 Nov 2001, Alan Cox wrote: > Surely the answer if you want short term write speed and long term balancing > is to use ext3 not ext2 and simply ensure that the relevant stuff goes to > the journal (which will be nicely ordered) first. That will give you some > buffering at least. Alan, the problem is present in ext3 as well as in all other FFS derivatives (well, FreeBSD had tried to deal that stuff this Spring). ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 8:37 ` Alexander Viro @ 2001-11-06 8:48 ` Andrew Morton 0 siblings, 0 replies; 78+ messages in thread From: Andrew Morton @ 2001-11-06 8:48 UTC (permalink / raw) To: Alexander Viro; +Cc: Alan Cox, Linus Torvalds, linux-kernel Alexander Viro wrote: > > On Tue, 6 Nov 2001, Alan Cox wrote: > > > Surely the answer if you want short term write speed and long term balancing > > is to use ext3 not ext2 and simply ensure that the relevant stuff goes to > > the journal (which will be nicely ordered) first. That will give you some > > buffering at least. > > Alan, the problem is present in ext3 as well as in all other FFS derivatives > (well, FreeBSD had tried to deal that stuff this Spring). > Yep. Once we're seek-bound on metadata and data, the occasional seek-and-squirt into the journal won't make much difference, and the other write patterns will be the same. Interestingly, current ext3 can do a 600 meg write in fifty seconds, whereas ext2 takes seventy. This will be related to the fact that ext3 just pumps all the buffers into submit_bh(), whereas ext2 fiddles around with all the write_locked_buffers() stuff. I think. Or the intermingling of indirects with data is tripping ext2 up. The additional seeking is audible. - ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 2:10 ` Linus Torvalds 2001-11-06 3:02 ` Alexander Viro @ 2001-11-06 3:49 ` Alexander Viro 2001-11-06 4:01 ` Linus Torvalds 1 sibling, 1 reply; 78+ messages in thread From: Alexander Viro @ 2001-11-06 3:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Mon, 5 Nov 2001, Linus Torvalds wrote: > One such improvement has already been put on the table: remove the > algorithm, and make it purely greedy. OK, some digging had brought another one: a) if it's first-level directory - get it the fsck out of root's cylinder group. b) if we just keep creating directories in a cylinder group and do not create any files there - stop, it's no good (i.e. there's a limit on number of back-to-back directory creations in the same group). c) try putting it into the parent's CG, but reserve some number of inodes and data blocks in it. If we can't - tough, get the fsck out of there. >From the first reading of the code (aside of general yuck wrt style, but that's actually more about the older code in there) it seems to be reasonable. It should solve the problem with fast-growth scenario and it puts some effort to avoid nastiness with slow-growth one. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 3:49 ` Alexander Viro @ 2001-11-06 4:01 ` Linus Torvalds 2001-11-06 4:21 ` Alexander Viro 0 siblings, 1 reply; 78+ messages in thread From: Linus Torvalds @ 2001-11-06 4:01 UTC (permalink / raw) To: Alexander Viro; +Cc: linux-kernel On Mon, 5 Nov 2001, Alexander Viro wrote: > > OK, some digging had brought another one: > > a) if it's first-level directory - get it the fsck out of root's cylinder > group. Hey, now that you've read it in a paper you like it, but when I suggest it in email you shoot it down? <Whiny mode on> I thought you loved me, Al. <Whiny mode off> > b) if we just keep creating directories in a cylinder group and do not > create any files there - stop, it's no good (i.e. there's a limit on > number of back-to-back directory creations in the same group). The current code actually has some vestiges that _seem_ to be trying to do something like this: see the commented-out if (tmp && le16_to_cpu(tmp->bg_used_dirs_count) << 8) < le16_to_cpu(tmp->bg_free_inodes_count)) { which _seems_ to want to play games with "number of directories allocated vs nr of free inodes". But it's commented out with "I am not yet convinced that this next bit is necessary". I don't know if the code has ever been active, or whether it showed other problems. > c) try putting it into the parent's CG, but reserve some number of inodes > and data blocks in it. If we can't - tough, get the fsck out of there. Hmm.. Maybe this is actually closer to what we try to do above.. Linus ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 4:01 ` Linus Torvalds @ 2001-11-06 4:21 ` Alexander Viro 2001-11-06 5:01 ` Linus Torvalds 0 siblings, 1 reply; 78+ messages in thread From: Alexander Viro @ 2001-11-06 4:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Mon, 5 Nov 2001, Linus Torvalds wrote: > > On Mon, 5 Nov 2001, Alexander Viro wrote: > > > > OK, some digging had brought another one: > > > > a) if it's first-level directory - get it the fsck out of root's cylinder > > group. > > Hey, now that you've read it in a paper you like it, but when I suggest it > in email you shoot it down? > > <Whiny mode on> I thought you loved me, Al. <Whiny mode off> Oh, come on. (a) is obvious, but obviously not enough ;-) > > b) if we just keep creating directories in a cylinder group and do not > > create any files there - stop, it's no good (i.e. there's a limit on > > number of back-to-back directory creations in the same group). > > The current code actually has some vestiges that _seem_ to be trying to do > something like this: see the commented-out > > if (tmp && le16_to_cpu(tmp->bg_used_dirs_count) << 8) < > le16_to_cpu(tmp->bg_free_inodes_count)) { > > which _seems_ to want to play games with "number of directories allocated > vs nr of free inodes". > > But it's commented out with "I am not yet convinced that this next bit is > necessary". I don't know if the code has ever been active, or whether it > showed other problems. > > > c) try putting it into the parent's CG, but reserve some number of inodes > > and data blocks in it. If we can't - tough, get the fsck out of there. > > Hmm.. Maybe this is actually closer to what we try to do above.. Yes, but block reservation also makes sense (otherwise we can end up putting a directory into parent's CG only to have all children going there _and_ getting far from their data). Which might be the problem with original code, BTW. OK, anyway - I've got a bunch of cleanups for ialloc.c (equivalent transformations, split into small steps and decently tested - I've used them for almost a year). That stuff moves choice of CG for directories and non-directories into separate functions, so no matter which variant we end up doing I think that it's worth doing first - things will be cleaner after that. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 4:21 ` Alexander Viro @ 2001-11-06 5:01 ` Linus Torvalds 2001-11-06 5:31 ` Andrew Morton 0 siblings, 1 reply; 78+ messages in thread From: Linus Torvalds @ 2001-11-06 5:01 UTC (permalink / raw) To: Alexander Viro; +Cc: linux-kernel On Mon, 5 Nov 2001, Alexander Viro wrote: > > Oh, come on. (a) is obvious, but obviously not enough ;-) I agree, but I think it's a fairly good starting point to build up some heuristics. Also note that traditonal UNIX implementations will have a hard time doing anything more fine-grained than "is the parent the root directory or not". Information about grandparents etc has been lost long before we get to "ialloc()", so I bet you'll find experimental data only on that one-level thing. But it is probably not a binary decision in real life. In real life, you're ok switching cylinder groups for /home vs /usr, but you're also ok switching groups for /home/torvalds vs /home/viro. And it's still reasonably likely that it makes sense to switch groups in /home/torvalds between 'src' and 'public_html'. But the deeper down in the directory hierarchy you get, the less likely it is that it makes sense. And yes, it's just a heuristic. But it's one that where thanks to the dcache we could reasonably easily do more than just a black-and-white "root vs non-root" thing. > Yes, but block reservation also makes sense (otherwise we can end up > putting a directory into parent's CG only to have all children > going there _and_ getting far from their data). Which might be the > problem with original code, BTW. Yes, agreed. Disk space allocations should definitely be part of the heuristics, and that's obviously both inode and blocks. I do believe that there are probably more "high-level" heuristics that can be useful, though. Looking at man common big trees, the ownership issue is one obvious clue. Sadly the trees obviously end up being _created_ without owner information, and the chown/chgrp happens later, but it might still be useable for some clues. Linus ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 5:01 ` Linus Torvalds @ 2001-11-06 5:31 ` Andrew Morton 2001-11-06 5:48 ` Linus Torvalds 2001-11-06 7:10 ` Kai Henningsen 0 siblings, 2 replies; 78+ messages in thread From: Andrew Morton @ 2001-11-06 5:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alexander Viro, linux-kernel Linus Torvalds wrote: > > I do believe that there are probably more "high-level" heuristics that can > be useful, though. Looking at man common big trees, the ownership issue is > one obvious clue. Sadly the trees obviously end up being _created_ without > owner information, and the chown/chgrp happens later, but it might still > be useable for some clues. > I didn't understand your objection to the heuristic "was the parent directory created within the past 30 seconds?". If the parent and child were created at the same time, chances are that they'll be accessed at the same time? And there's always the `chattr' cop-out, to alter the allocation policy at a chosen point in the tree by administrative act. Any change in ext2 allocation policy at this point in time really, really worries me. If it screws up we'll have 10,000,000 linux boxes running like dead dogs in a year. So if we _do_ make a change, I'd suggest that it be opt-in. Call me a wimp. - ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 5:31 ` Andrew Morton @ 2001-11-06 5:48 ` Linus Torvalds 2001-11-06 7:34 ` Mike Castle 2001-11-06 7:10 ` Kai Henningsen 1 sibling, 1 reply; 78+ messages in thread From: Linus Torvalds @ 2001-11-06 5:48 UTC (permalink / raw) To: Andrew Morton; +Cc: Alexander Viro, linux-kernel On Mon, 5 Nov 2001, Andrew Morton wrote: > > I didn't understand your objection to the heuristic "was the > parent directory created within the past 30 seconds?". If the > parent and child were created at the same time, chances are that > they'll be accessed at the same time? the thing I don't like about it is the non-data-dependence, ie the layout of the disk will actually depend on how long it took you to write the tree. I'm not saying it's a bad heuristic - it's probably a fine (and certainly simple) one. But the thought that when the NFS server has problems, a straight "cp -a" of the same tree results in different layout just because the server was moved over from one network to another makes me go "Ewww.." Linus ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 5:48 ` Linus Torvalds @ 2001-11-06 7:34 ` Mike Castle 0 siblings, 0 replies; 78+ messages in thread From: Mike Castle @ 2001-11-06 7:34 UTC (permalink / raw) To: linux-kernel On Mon, Nov 05, 2001 at 09:48:40PM -0800, Linus Torvalds wrote: > I'm not saying it's a bad heuristic - it's probably a fine (and certainly > simple) one. But the thought that when the NFS server has problems, a > straight "cp -a" of the same tree results in different layout just because > the server was moved over from one network to another makes me go "Ewww.." The layout is most likely going to be different anyway, isn't it? We don't know what has gone on in the past to get the FS into the current state. But now we have more information than we did when the file system was originally built. We don't have an extent based interface do we? So we can't say "I know this file is going to be X bytes in size." But if we accomplish nearly the same thing by a quick copy like this, what's the harm? mrc -- Mike Castle dalgoda@ix.netcom.com www.netcom.com/~dalgoda/ We are all of us living in the shadow of Manhattan. -- Watchmen fatal ("You are in a maze of twisty compiler features, all different"); -- gcc ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 5:31 ` Andrew Morton 2001-11-06 5:48 ` Linus Torvalds @ 2001-11-06 7:10 ` Kai Henningsen 1 sibling, 0 replies; 78+ messages in thread From: Kai Henningsen @ 2001-11-06 7:10 UTC (permalink / raw) To: torvalds; +Cc: linux-kernel akpm@zip.com.au (Andrew Morton) wrote on 05.11.01 in <3BE77599.9CFB5CA9@zip.com.au>: > Linus Torvalds wrote: > > > > I do believe that there are probably more "high-level" heuristics that can > > be useful, though. Looking at man common big trees, the ownership issue is > > one obvious clue. Sadly the trees obviously end up being _created_ without > > owner information, and the chown/chgrp happens later, but it might still > > be useable for some clues. Size of the parent directory might be another clue. > I didn't understand your objection to the heuristic "was the > parent directory created within the past 30 seconds?". If the > parent and child were created at the same time, chances are that > they'll be accessed at the same time? Thought experiment: Put stuff on a disk the usual slow way. Backup. Mkfs. Restore. Should the allocation pattern now be different? Why? > And there's always the `chattr' cop-out, to alter the allocation > policy at a chosen point in the tree by administrative act. Much help that's going to be in the above scenario, given how tar calls chattr ... not. > Any change in ext2 allocation policy at this point in time really, > really worries me. If it screws up we'll have 10,000,000 linux > boxes running like dead dogs in a year. So if we _do_ make a change, I'd > suggest that it be opt-in. Call me a wimp. Well, with Alex' cleanups, switchable policies might just become possible. MfG Kai ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 1:33 ` Alexander Viro 2001-11-06 2:10 ` Linus Torvalds @ 2001-11-09 22:35 ` Riley Williams 1 sibling, 0 replies; 78+ messages in thread From: Riley Williams @ 2001-11-09 22:35 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Linux Kernel Hi Al, Linus. >> For example, spreading out (and the inherent assumption of "slow >> growth") might make sense for the root directory, and possibly for a >> level below that. It almost certainly does _not_ make sense for a >> directory created four levels down. > Depends on the naming scheme used by admin, for one thing. Linus, I > seriously think that getting the real-life traces to play with would > be a Good Thing(tm) - at least that would allow to test how well > would heuristics of that kind do. Let's be realistic here - you two seem to be talking at cross-purposes, and that will get us precicely nowhere. Linus: If I'm right, your "root directory" refers to the root directory of a particular partition. Correct? Al: If I'm right, your "root directory" refers to the root directory of the entire VFS tree. Correct? If you can try reading each other's comments with both of the above definitions, I think you'll see why you're clearly drifting apart from each other in your comments. Best wishes from Riley. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 23:50 ` Linus Torvalds 2001-11-06 0:03 ` Linus Torvalds @ 2001-11-06 1:28 ` Alexander Viro 2001-11-06 9:16 ` Wojtek Pilorz 2001-11-08 12:51 ` Pavel Machek 2 siblings, 1 reply; 78+ messages in thread From: Alexander Viro @ 2001-11-06 1:28 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Mon, 5 Nov 2001, Linus Torvalds wrote: > Fast-growth is what we should care about from a performance standpoint - > slow growth can be sanely handled with slower-paced maintenance issues > (including approaches like "defragment the disk once a month"). Ehh... Slow growth can involve a _lot_ of IO and file creation/removal, just no directory tree changes. > By definition, "slow growth" is amenable to defragmentation. And by > definition, fast growth isn't. And if we're talking about speedups on the > order of _five times_ for something as simple as a "cvs co", I don't see > where your "pure hell" comes in. Take a look at the allocator for non-directory inodes. > A five-time slowdown on real work _is_ pure hell. You've not shown a > credible argument that the slow-growth behaviour would ever result in a > five-time slowdown for _anything_. find(1). Stuff that runs every damn night. We end up doing getdents() + bunch of lstat() on the entries we had found. And that means access to inodes refered from them. Try that on a large directory with child inodes spread all over the disk. Linus, I'm not saying that fast-growth scenario doesn't need to be handled or that 5 times slowdown on cvs co is OK. We need a decent heuristics to distinguish between the slow and fast variants. BTW, the problem is not new or Linux-specific. IIRC, there was a couple of papers by folks who gathered real-life traces of allocations/freeing of objects on different boxen and had ran them through several allocators. I'll try to locate that stuff and see if we can get to their raw data. OK? I'd really like to have something beyond "current allocator sucks for fast-growth" to deal with - especially if we start inventing heuristics for switching between the fast and slow modes. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 1:28 ` Alexander Viro @ 2001-11-06 9:16 ` Wojtek Pilorz 2001-11-06 9:58 ` Alexander Viro 0 siblings, 1 reply; 78+ messages in thread From: Wojtek Pilorz @ 2001-11-06 9:16 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, linux-kernel On Mon, 5 Nov 2001, Alexander Viro wrote: [...] > > find(1). Stuff that runs every damn night. We end up doing getdents() + > bunch of lstat() on the entries we had found. And that means access to > inodes refered from them. Try that on a large directory with child > inodes spread all over the disk. > [...] Part of the problem is that, as far as I know, there is no sane way to request the kernel to execute a number of lstat-s or similar calls in the order that would be convenient to the system (I do not consider creating threads for this purpose a sane way). If a program (say find, or tar, or anything) needs an information from 5000 lstats of fstats or reads, it has to specify them one-by-one in some order, without knowledge which order would be best. If we had a call to execute a vector of lstats, or open, or reads (from different handles), program which would use such calls would allow kernel to take decision what is the best order or individual operations. I remember that using similar 'vector' functions in old days on IBM OS/MVT gave dramatical performance improvements (maybe for different reasons; machine memories were often half a megabyte, so opening a file required the system to read and execute several modules loaded from system disks; when opening 10 files each of the modules had to be loaded once rather than 10 times). Would it be possible and a good idea to add such 'vector' operations to the Linux kernel? Best regards, Wojtek -------------------- Wojtek Pilorz Wojtek.Pilorz@bdk.pl ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-06 9:16 ` Wojtek Pilorz @ 2001-11-06 9:58 ` Alexander Viro 0 siblings, 0 replies; 78+ messages in thread From: Alexander Viro @ 2001-11-06 9:58 UTC (permalink / raw) To: Wojtek Pilorz; +Cc: Linus Torvalds, linux-kernel On Tue, 6 Nov 2001, Wojtek Pilorz wrote: > Would it be possible and a good idea to add such 'vector' operations to > the Linux kernel? Occam's Razor and all such. Plus portability problems. Besides, you get enough information from getdents() if you really want to play silly games with that. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 23:50 ` Linus Torvalds 2001-11-06 0:03 ` Linus Torvalds 2001-11-06 1:28 ` Alexander Viro @ 2001-11-08 12:51 ` Pavel Machek 2 siblings, 0 replies; 78+ messages in thread From: Pavel Machek @ 2001-11-08 12:51 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alexander Viro, linux-kernel Hi! > By definition, "slow growth" is amenable to defragmentation. And by > definition, fast growth isn't. And if we're talking about speedups on the Heh, last time I looked, defrage2fs crashed on >2G partition and was clearly not maintained. Has things changed? -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 20:28 ` m 2001-11-05 21:39 ` Andrew Morton @ 2001-11-06 21:48 ` Stephen Tweedie 2001-11-06 23:17 ` ext2/ialloc.c cleanup Alexander Viro 1 sibling, 1 reply; 78+ messages in thread From: Stephen Tweedie @ 2001-11-06 21:48 UTC (permalink / raw) To: m Cc: Andreas Dilger, Andrew Morton, Albert D. Cahalan, Mike Fedyk, lkml, ext2-devel Hi, On Tue, Nov 06, 2001 at 07:28:02AM +1100, m@mo.optusnet.com.au wrote: > Another heuristic to try make be to only use a different blockgroup > for when the mkdir()s are seperate in time. If I'm creating a hierarchy with "tar xvzf", why am I doing it? I might be unpacking a source tree where all the files are going to be used together. But I might be restoring a backup of /home. Assuming that we want a compact unpacking is not always going to be correct. Cheers, Stephen ^ permalink raw reply [flat|nested] 78+ messages in thread
* ext2/ialloc.c cleanup 2001-11-06 21:48 ` Stephen Tweedie @ 2001-11-06 23:17 ` Alexander Viro 2001-11-07 19:34 ` [Ext2-devel] " Andreas Dilger 0 siblings, 1 reply; 78+ messages in thread From: Alexander Viro @ 2001-11-06 23:17 UTC (permalink / raw) To: Stephen Tweedie Cc: ext2-devel, m, Andreas Dilger, Andrew Morton, Albert D. Cahalan, Mike Fedyk, lkml Folks, promised cleanup of ialloc.c is on ftp.math.psu.edu:pub/viro/ialloc.c,v And yes, it's in RCS. The thing is split into _really_ small steps (commented in the log). Each is a trivial transformation and it should be very easy to verify correctness of any of them. Please, review. IMO it's cut fine enough to make the gradual merge possible for 2.4 - look and you'll see. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] ext2/ialloc.c cleanup 2001-11-06 23:17 ` ext2/ialloc.c cleanup Alexander Viro @ 2001-11-07 19:34 ` Andreas Dilger 2001-11-07 20:02 ` Alexander Viro 0 siblings, 1 reply; 78+ messages in thread From: Andreas Dilger @ 2001-11-07 19:34 UTC (permalink / raw) To: Alexander Viro Cc: Stephen Tweedie, ext2-devel, m, Andrew Morton, Albert D. Cahalan, Mike Fedyk, lkml On Nov 06, 2001 18:17 -0500, Alexander Viro wrote: > Folks, promised cleanup of ialloc.c is on > ftp.math.psu.edu:pub/viro/ialloc.c,v > > And yes, it's in RCS. The thing is split into _really_ small > steps (commented in the log). Each is a trivial transformation > and it should be very easy to verify correctness of any of > them. > > Please, review. IMO it's cut fine enough to make the gradual merge > possible for 2.4 - look and you'll see. Minor nits, from my changes to this same function: 1) please replace use of "i" for best block group in find_cg_*, to something better like "group", just for clarity. 2) in find_cg_*, when you fail the quadratic search, the linear search should skip groups that were previously checked in the quadratic search, with slight changes to both loops: start = dir->u.ext2_i.i_block_group; /* Use quadratic hash to find group with a free inode */ for (j = 1; j < count; j <<= 1) { group = start + j; if (group >= count) group -= count; cg = ext2_get_group_desc(sb, group, &bh); if (cg && le16_to_cpu(cg->bg_free_inodes_count)) goto found; } /* That failed: try linear search for a free inode * skipping groups we checked in the previous loop. */ for (j = 3; j < count; j++) { if ((j & (j - 1)) == 0) continue; group = start + j; if (group > count) group -= count; cg = ext2_get_group_desc(sb, group, &bh); if (cg && le16_to_cpu(cg->bg_free_inodes_count)) goto found; } 3) I know that "cylinder groups" were used in old FFS/whatever implementation, but all of the ext2 code/documentation refers to these as block groups. Can you stick with that for ext2 (e.g. gdp, not cg; bg_foo, not cg_foo)? 4) sbi can be gotten by "EXT2_SB(sb)". Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] ext2/ialloc.c cleanup 2001-11-07 19:34 ` [Ext2-devel] " Andreas Dilger @ 2001-11-07 20:02 ` Alexander Viro 2001-11-08 2:06 ` Andrew Morton 0 siblings, 1 reply; 78+ messages in thread From: Alexander Viro @ 2001-11-07 20:02 UTC (permalink / raw) To: Andreas Dilger Cc: Stephen Tweedie, ext2-devel, m, Andrew Morton, Albert D. Cahalan, Mike Fedyk, lkml On Wed, 7 Nov 2001, Andreas Dilger wrote: > Minor nits, from my changes to this same function: > 1) please replace use of "i" for best block group in find_cg_*, to > something better like "group", just for clarity. Consider that done. > 2) in find_cg_*, when you fail the quadratic search, the linear search > should skip groups that were previously checked in the quadratic search, > with slight changes to both loops: I'm not actually sure that it's a good thing. The different between the sequences we do is that I do n n+1 n+3 n+7 ... n+2 (linear) and you do n n+1 n+2 n+4 n+8 ... n+3 (linear) which has slightly worse properties. You avoid duplicated check on n+3, but lose a very nice property - shifting the old sequence is guaranteed not to have many intersections with original in the beginning (distances between elements do not repeat). With your sequence it's no longer true. > 3) I know that "cylinder groups" were used in old FFS/whatever implementation, > but all of the ext2 code/documentation refers to these as block groups. > Can you stick with that for ext2 (e.g. gdp, not cg; bg_foo, not cg_foo)? Ehh... Try to read that aloud. Maybe it's just me, but "gdp" sounds (and looks) bad... > 4) sbi can be gotten by "EXT2_SB(sb)". True, consider that done. Right now I'm doing alternative strategy for directory allocation, as soon as I finish that I'll put the result on usual place. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] ext2/ialloc.c cleanup 2001-11-07 20:02 ` Alexander Viro @ 2001-11-08 2:06 ` Andrew Morton 2001-11-08 20:45 ` Andrew Morton 0 siblings, 1 reply; 78+ messages in thread From: Andrew Morton @ 2001-11-08 2:06 UTC (permalink / raw) To: Alexander Viro; +Cc: ext2-devel, lkml Alexander Viro wrote: > > Right now I'm doing alternative strategy for directory allocation, as soon > as I finish that I'll put the result on usual place. I'll be interested in seeing the algorithm. Keith Smith has kindly directed me to the current location of his testing workload. This attempts to replay ten months' worth of real file server activity. It's described in his paper, at http://www.eecs.harvard.edu/~keith/usenix96 It's all set up for a 512 megabyte filesystem. So I ran it sixteen times in succession on sixteen different directories at the top level of an eight-gig filesystem. It takes about five hours. The filesystem ends up 76% full. I did this on two filesystems: one with stock 2.4.14 and the other with the `if (0 &&' change. The patched kernel was consistently 10-15% faster at running the workload. Now, lots of researchers seem to define lots of different ways of measuring fragmentation and they are all rather unconvincing - they don't take into account the extreme non-linear behaviour of modern disks. So I simply measured how long it took to read some files. Which doesn't give any insight into the nature and sources of fragmentation, but it is the bottom line. And reading nine thousand files containing three hundred megs of data from the filesystem which was populated with the `if (0 &&' patch takes 1.5x to 2x as long as it does from stock 2.4.14. So that idea is a non-starter. This seems to be a pretty good fragmentation test and I'm inclined to stick with it. I suspect that the current algorithm, with some hysteresis to make it less inclined to hop into a different block group is the way to go. = ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] ext2/ialloc.c cleanup 2001-11-08 2:06 ` Andrew Morton @ 2001-11-08 20:45 ` Andrew Morton 2001-11-08 22:16 ` Alexander Viro 0 siblings, 1 reply; 78+ messages in thread From: Andrew Morton @ 2001-11-08 20:45 UTC (permalink / raw) To: Alexander Viro, ext2-devel, lkml OK, some more columns of numbers. The 10-month simulated workload generates a total of 328 megs of data. This workload was applied seventeen times in succession to sequentially numbered directories at the top of an empty 8 gigabyte ext2 fs. Each top-level directory ends up containing 62 directories and 8757 files. After the seventeen directories were created, the disk was 75% full. To assess the fragmentation we see how long it takes to read all the files in each top-level directory, cache-cold, using find . -type f | xargs cat > /dev/null We can find the "ideal" time for running the above command by copying the 328 megs of files onto a pristine partition using an ext2 which has the the `if (0 &&' patch and then timing how long that takes to read. It's 13 seconds, which is pretty darn good. Reading an equivalently-sized single file take 12.6, so we're maxing out the disk. Tree Stock Stock/ Patched Patched/ (secs) ideal (secs) Stock 0 37 2.9 74 2.0 1 41 3.2 89 2.2 2 41 3.2 97 2.4 3 38 3.0 97 2.6 4 55 4.3 102 1.9 5 51 4.0 98 1.9 6 72 5.7 94 1.3 7 56 4.4 96 1.7 8 64 5.1 102 1.6 9 63 5 100 1.6 10 57 4.5 95 1.7 11 83 6.6 102 1.2 12 54 4.3 101 1.9 13 82 6.5 104 1.3 14 89 7.1 103 1.2 15 88 7.0 95 1.1 16 106 8.4 99 0.9 Mean: 63.35 4.9 96.4 1.7 So. - Even after a single iteration of the Smith test, ext2 fragmentation has reduced efficiency by a factor of three. And the disk was never more than 5% full during the creation of these files. - By the time the disk is 75% full, read efficiency is down by a factor of over eight. Given that the most-recently created files are also the most-frequently accessed, this hurts. - The `if (0 &&' patch makes things 1.7x worse, but it never got worse than the stock ext2's worst case. Directory and inode fragmentation is not significant. The time to run `du -s' against tree number 8 is two seconds. Seems to be fertile ground for improvement, however it looks as though even a single pass of the Smith test has fragged the disk to death and this is perhaps not a case we want to optimise for. The total amount of data which a single pass of the Smith workload passes into write() is approx 52 gigabytes. Thankfully, very little of that actually hits disk because it's deleted before it gets written back; but that's fine for this exercise. So after writing 52 gigs of data and removing 51.7 gigs, an 8 gig ext2 filesystem is badly fragmented. Does this mean that the tale about ext2 being resistant to fragmentation is a myth? Or is this workload simply asking too much? If the answer is yes, then one solution is to change ext2 block allocation to favour the fast-growth case and develop an online defrag tool. Seems we have looked at both ends of the spectrum: the ultimate case of fast growth and the ultimate case of slow growth. In both cases, ext2 block allocation is running at 3x to 5x less than ideal. Where is its sweet spot? - ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] ext2/ialloc.c cleanup 2001-11-08 20:45 ` Andrew Morton @ 2001-11-08 22:16 ` Alexander Viro 2001-11-08 22:43 ` Andreas Dilger 0 siblings, 1 reply; 78+ messages in thread From: Alexander Viro @ 2001-11-08 22:16 UTC (permalink / raw) To: Andrew Morton; +Cc: ext2-devel, lkml OK, here's something that might be worth testing. It isn't a final version and it most likely needs tuning, but... Anyway, it builds, boots and seems to be working. The worst it can do is to make bad choices for directory allocation ;-) Comparing this beast (and its variants with different parameters) with standard allocator would be interesting. diff -urN S14/fs/ext2/ialloc.c S14-ext2/fs/ext2/ialloc.c --- S14/fs/ext2/ialloc.c Tue Oct 9 21:47:26 2001 +++ S14-ext2/fs/ext2/ialloc.c Thu Nov 8 15:57:13 2001 @@ -17,7 +17,7 @@ #include <linux/ext2_fs.h> #include <linux/locks.h> #include <linux/quotaops.h> - +#include <linux/random.h> /* * ialloc.c contains the inodes allocation and deallocation routines @@ -39,37 +39,26 @@ * Read the inode allocation bitmap for a given block_group, reading * into the specified slot in the superblock's bitmap cache. * - * Return >=0 on success or a -ve error code. + * Return buffer_head of bitmap on success or NULL. */ -static int read_inode_bitmap (struct super_block * sb, - unsigned long block_group, - unsigned int bitmap_nr) +static struct buffer_head *read_inode_bitmap (struct super_block * sb, + unsigned long block_group) { struct ext2_group_desc * gdp; struct buffer_head * bh = NULL; - int retval = 0; gdp = ext2_get_group_desc (sb, block_group, NULL); - if (!gdp) { - retval = -EIO; + if (!gdp) goto error_out; - } + bh = bread (sb->s_dev, le32_to_cpu(gdp->bg_inode_bitmap), sb->s_blocksize); - if (!bh) { + if (!bh) ext2_error (sb, "read_inode_bitmap", "Cannot read inode bitmap - " "block_group = %lu, inode_bitmap = %lu", block_group, (unsigned long) gdp->bg_inode_bitmap); - retval = -EIO; - } - /* - * On IO error, just leave a zero in the superblock's block pointer for - * this group. The IO will be retried next time. - */ error_out: - sb->u.ext2_sb.s_inode_bitmap_number[bitmap_nr] = block_group; - sb->u.ext2_sb.s_inode_bitmap[bitmap_nr] = bh; - return retval; + return bh; } /* @@ -83,79 +72,62 @@ * 2/ If the file system contains less than EXT2_MAX_GROUP_LOADED groups, * this function reads the bitmap without maintaining a LRU cache. * - * Return the slot used to store the bitmap, or a -ve error code. + * Return the buffer_head of the bitmap or the ERR_PTR(error) */ -static int load_inode_bitmap (struct super_block * sb, - unsigned int block_group) +static struct buffer_head *load_inode_bitmap (struct super_block * sb, + unsigned int block_group) { - int i, j, retval = 0; - unsigned long inode_bitmap_number; - struct buffer_head * inode_bitmap; + int i, slot = 0; + struct ext2_sb_info *sbi = &sb->u.ext2_sb; + struct buffer_head *bh = sbi->s_inode_bitmap[0]; - if (block_group >= sb->u.ext2_sb.s_groups_count) + if (block_group >= sbi->s_groups_count) ext2_panic (sb, "load_inode_bitmap", "block_group >= groups_count - " "block_group = %d, groups_count = %lu", - block_group, sb->u.ext2_sb.s_groups_count); - if (sb->u.ext2_sb.s_loaded_inode_bitmaps > 0 && - sb->u.ext2_sb.s_inode_bitmap_number[0] == block_group && - sb->u.ext2_sb.s_inode_bitmap[0] != NULL) - return 0; - if (sb->u.ext2_sb.s_groups_count <= EXT2_MAX_GROUP_LOADED) { - if (sb->u.ext2_sb.s_inode_bitmap[block_group]) { - if (sb->u.ext2_sb.s_inode_bitmap_number[block_group] != block_group) - ext2_panic (sb, "load_inode_bitmap", - "block_group != inode_bitmap_number"); - else - return block_group; - } else { - retval = read_inode_bitmap (sb, block_group, - block_group); - if (retval < 0) - return retval; - return block_group; - } + block_group, sbi->s_groups_count); + + if (sbi->s_loaded_inode_bitmaps > 0 && + sbi->s_inode_bitmap_number[0] == block_group && bh) + goto found; + + if (sbi->s_groups_count <= EXT2_MAX_GROUP_LOADED) { + slot = block_group; + bh = sbi->s_inode_bitmap[slot]; + if (!bh) + goto read_it; + if (sbi->s_inode_bitmap_number[slot] == slot) + goto found; + ext2_panic (sb, "load_inode_bitmap", + "block_group != inode_bitmap_number"); } - for (i = 0; i < sb->u.ext2_sb.s_loaded_inode_bitmaps && - sb->u.ext2_sb.s_inode_bitmap_number[i] != block_group; + bh = NULL; + for (i = 0; i < sbi->s_loaded_inode_bitmaps && + sbi->s_inode_bitmap_number[i] != block_group; i++) ; - if (i < sb->u.ext2_sb.s_loaded_inode_bitmaps && - sb->u.ext2_sb.s_inode_bitmap_number[i] == block_group) { - inode_bitmap_number = sb->u.ext2_sb.s_inode_bitmap_number[i]; - inode_bitmap = sb->u.ext2_sb.s_inode_bitmap[i]; - for (j = i; j > 0; j--) { - sb->u.ext2_sb.s_inode_bitmap_number[j] = - sb->u.ext2_sb.s_inode_bitmap_number[j - 1]; - sb->u.ext2_sb.s_inode_bitmap[j] = - sb->u.ext2_sb.s_inode_bitmap[j - 1]; - } - sb->u.ext2_sb.s_inode_bitmap_number[0] = inode_bitmap_number; - sb->u.ext2_sb.s_inode_bitmap[0] = inode_bitmap; - - /* - * There's still one special case here --- if inode_bitmap == 0 - * then our last attempt to read the bitmap failed and we have - * just ended up caching that failure. Try again to read it. - */ - if (!inode_bitmap) - retval = read_inode_bitmap (sb, block_group, 0); - - } else { - if (sb->u.ext2_sb.s_loaded_inode_bitmaps < EXT2_MAX_GROUP_LOADED) - sb->u.ext2_sb.s_loaded_inode_bitmaps++; - else - brelse (sb->u.ext2_sb.s_inode_bitmap[EXT2_MAX_GROUP_LOADED - 1]); - for (j = sb->u.ext2_sb.s_loaded_inode_bitmaps - 1; j > 0; j--) { - sb->u.ext2_sb.s_inode_bitmap_number[j] = - sb->u.ext2_sb.s_inode_bitmap_number[j - 1]; - sb->u.ext2_sb.s_inode_bitmap[j] = - sb->u.ext2_sb.s_inode_bitmap[j - 1]; - } - retval = read_inode_bitmap (sb, block_group, 0); - } - return retval; + if (i < sbi->s_loaded_inode_bitmaps) + bh = sbi->s_inode_bitmap[i]; + else if (sbi->s_loaded_inode_bitmaps < EXT2_MAX_GROUP_LOADED) + sbi->s_loaded_inode_bitmaps++; + else + brelse (sbi->s_inode_bitmap[--i]); + + while (i--) { + sbi->s_inode_bitmap_number[i+1] = sbi->s_inode_bitmap_number[i]; + sbi->s_inode_bitmap[i+1] = sbi->s_inode_bitmap[i]; + } + +read_it: + if (!bh) + bh = read_inode_bitmap (sb, block_group); + sbi->s_inode_bitmap_number[slot] = block_group; + sbi->s_inode_bitmap[slot] = bh; + if (!bh) + return ERR_PTR(-EIO); +found: + return bh; } /* @@ -183,7 +155,6 @@ struct buffer_head * bh2; unsigned long block_group; unsigned long bit; - int bitmap_nr; struct ext2_group_desc * gdp; struct ext2_super_block * es; @@ -215,12 +186,10 @@ } block_group = (ino - 1) / EXT2_INODES_PER_GROUP(sb); bit = (ino - 1) % EXT2_INODES_PER_GROUP(sb); - bitmap_nr = load_inode_bitmap (sb, block_group); - if (bitmap_nr < 0) + bh = load_inode_bitmap (sb, block_group); + if (IS_ERR(bh)) goto error_return; - bh = sb->u.ext2_sb.s_inode_bitmap[bitmap_nr]; - /* Ok, now we can actually update the inode bitmaps.. */ if (!ext2_clear_bit (bit, bh->b_data)) ext2_error (sb, "ext2_free_inode", @@ -230,9 +199,11 @@ if (gdp) { gdp->bg_free_inodes_count = cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) + 1); - if (is_directory) + if (is_directory) { gdp->bg_used_dirs_count = cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) - 1); + sb->u.ext2_sb.s_dir_count--; + } } mark_buffer_dirty(bh2); es->s_free_inodes_count = @@ -259,23 +230,229 @@ * For other inodes, search forward from the parent directory\'s block * group to find a free inode. */ + +static int find_cg_dir(struct super_block *sb, const struct inode *parent) +{ + struct ext2_super_block * es = sb->u.ext2_sb.s_es; + int ngroups = sb->u.ext2_sb.s_groups_count; + int avefreei = le32_to_cpu(es->s_free_inodes_count) / ngroups; + struct ext2_group_desc *cg, *best_cg = NULL; + struct buffer_head *bh, *best_bh = NULL; + int i = -1, j; + + for (j = 0; j < ngroups; j++) { + cg = ext2_get_group_desc (sb, j, &bh); + if (!cg || !cg->bg_free_inodes_count) + continue; + if (le16_to_cpu(cg->bg_free_inodes_count) < avefreei) + continue; + if (!best_cg || + (le16_to_cpu(cg->bg_free_blocks_count) > + le16_to_cpu(best_cg->bg_free_blocks_count))) { + i = j; + best_cg = cg; + best_bh = bh; + } + } + if (!best_cg) + return -1; + + best_cg->bg_free_inodes_count = + cpu_to_le16(le16_to_cpu(best_cg->bg_free_inodes_count) - 1); + best_cg->bg_used_dirs_count = + cpu_to_le16(le16_to_cpu(best_cg->bg_used_dirs_count) + 1); + sb->u.ext2_sb.s_dir_count++; + mark_buffer_dirty(best_bh); + return i; +} + +/* + * Orlov's allocator for directories. + * + * We always try to spread first-level directories: + * If there are directories with both free inodes and free blocks counts + * not worse than average we return one with smallest directory count. + * Otherwise we simply return a random group. + * + * For the rest rules look so: + * + * It's OK to put directory into a group unless + * it has too many directories already (max_dirs) or + * it has too few free inodes left (min_inodes) or + * it has too few free blocks left (min_blocks) or + * it's already running too large debt (max_debt). + * Parent's group is prefered, if it doesn't satisfy these + * conditions we search cyclically through the rest. If none + * of the groups look good we just look for a group with more + * free inodes than average (starting at parent's group). + * + * Debt is incremented each time we allocate a directory and decremented + * when we allocate an inode, within 0--255. + */ + +#define INODE_COST 64 +#define BLOCK_COST 256 + +static int find_cg_dir_orlov(struct super_block *sb, const struct inode *parent) +{ + int parent_cg = parent->u.ext2_i.i_block_group; + struct ext2_super_block * es = sb->u.ext2_sb.s_es; + int ngroups = sb->u.ext2_sb.s_groups_count; + int inodes_per_group = EXT2_INODES_PER_GROUP(sb); + int avefreei = le32_to_cpu(es->s_free_inodes_count) / ngroups; + int avefreeb = le32_to_cpu(es->s_free_blocks_count) / ngroups; + int blocks_per_dir; + int ndirs = sb->u.ext2_sb.s_dir_count; + int max_debt, max_dirs, min_blocks, min_inodes; + int group = -1, i; + struct ext2_group_desc *cg; + struct buffer_head *bh; + + if (parent == sb->s_root->d_inode) { + struct ext2_group_desc *best_cg = NULL; + struct buffer_head *best_bh = NULL; + int best_ndir = inodes_per_group; + int best_group = -1; + + for (group = 0; group < ngroups; group++) { + cg = ext2_get_group_desc (sb, group, &bh); + if (!cg || !cg->bg_free_inodes_count) + continue; + if (le16_to_cpu(cg->bg_used_dirs_count) >= best_ndir) + continue; + if (le16_to_cpu(cg->bg_free_inodes_count) < avefreei) + continue; + if (le16_to_cpu(cg->bg_free_blocks_count) < avefreeb) + continue; + best_group = group; + best_ndir = le16_to_cpu(cg->bg_used_dirs_count); + best_cg = cg; + best_bh = bh; + } + if (best_group >= 0) { + cg = best_cg; + bh = best_bh; + group = best_group; + goto found; + } + get_random_bytes(&group, sizeof(group)); + goto fallback; + } + + blocks_per_dir = (le32_to_cpu(es->s_blocks_count) - + le32_to_cpu(es->s_free_blocks_count)) / ndirs; + + max_dirs = ndirs / ngroups + inodes_per_group / 16; + min_inodes = avefreei - inodes_per_group / 4; + min_blocks = avefreeb - EXT2_BLOCKS_PER_GROUP(sb) / 4; + + max_debt = EXT2_BLOCKS_PER_GROUP(sb) / max(blocks_per_dir, BLOCK_COST); + if (max_debt * INODE_COST > inodes_per_group) + max_debt = inodes_per_group / INODE_COST; + if (max_debt > 255) + max_debt = 255; + if (max_debt == 0) + max_debt = 1; + + for (i = 0; i < ngroups; i++) { + group = (parent_cg + i) % ngroups; + cg = ext2_get_group_desc (sb, group, &bh); + if (!cg || !cg->bg_free_inodes_count) + continue; + if (sb->u.ext2_sb.debts[group] >= max_debt) + continue; + if (le16_to_cpu(cg->bg_used_dirs_count) >= max_dirs) + continue; + if (le16_to_cpu(cg->bg_free_inodes_count) < min_inodes) + continue; + if (le16_to_cpu(cg->bg_free_blocks_count) < min_blocks) + continue; + goto found; + } + +fallback: + for (i = 0; i < ngroups; i++) { + group = (parent_cg + i) % ngroups; + cg = ext2_get_group_desc (sb, group, &bh); + if (!cg || !cg->bg_free_inodes_count) + continue; + if (le16_to_cpu(cg->bg_free_inodes_count) >= avefreei) + goto found; + } + + return -1; + +found: + cg->bg_free_inodes_count = + cpu_to_le16(le16_to_cpu(cg->bg_free_inodes_count) - 1); + cg->bg_used_dirs_count = + cpu_to_le16(le16_to_cpu(cg->bg_used_dirs_count) + 1); + sb->u.ext2_sb.s_dir_count++; + mark_buffer_dirty(bh); + return group; +} + +static int find_cg_other(struct super_block *sb, const struct inode *parent) +{ + int parent_cg = parent->u.ext2_i.i_block_group; + int ngroups = sb->u.ext2_sb.s_groups_count; + struct ext2_group_desc *cg; + struct buffer_head *bh; + int i, j; + + /* + * Try to place the inode in its parent directory + */ + i = parent_cg; + cg = ext2_get_group_desc (sb, i, &bh); + if (cg && le16_to_cpu(cg->bg_free_inodes_count)) + goto found; + + /* + * Use a quadratic hash to find a group with a + * free inode + */ + for (j = 1; j < ngroups; j <<= 1) { + i += j; + if (i >= ngroups) + i -= ngroups; + cg = ext2_get_group_desc (sb, i, &bh); + if (cg && le16_to_cpu(cg->bg_free_inodes_count)) + goto found; + } + + /* + * That failed: try linear search for a free inode + */ + i = parent_cg + 1; + for (j = 2; j < ngroups; j++) { + if (++i >= ngroups) + i = 0; + cg = ext2_get_group_desc (sb, i, &bh); + if (cg && le16_to_cpu(cg->bg_free_inodes_count)) + goto found; + } + + return -1; + +found: + cg->bg_free_inodes_count = + cpu_to_le16(le16_to_cpu(cg->bg_free_inodes_count) - 1); + mark_buffer_dirty(bh); + return i; +} + struct inode * ext2_new_inode (const struct inode * dir, int mode) { struct super_block * sb; struct buffer_head * bh; struct buffer_head * bh2; - int i, j, avefreei; + int i, j; struct inode * inode; - int bitmap_nr; struct ext2_group_desc * gdp; - struct ext2_group_desc * tmp; struct ext2_super_block * es; int err; - /* Cannot create files in a deleted directory */ - if (!dir || !dir->i_nlink) - return ERR_PTR(-EPERM); - sb = dir->i_sb; inode = new_inode(sb); if (!inode) @@ -284,140 +461,52 @@ lock_super (sb); es = sb->u.ext2_sb.s_es; repeat: - gdp = NULL; i=0; - - if (S_ISDIR(mode)) { - avefreei = le32_to_cpu(es->s_free_inodes_count) / - sb->u.ext2_sb.s_groups_count; -/* I am not yet convinced that this next bit is necessary. - i = dir->u.ext2_i.i_block_group; - for (j = 0; j < sb->u.ext2_sb.s_groups_count; j++) { - tmp = ext2_get_group_desc (sb, i, &bh2); - if (tmp && - (le16_to_cpu(tmp->bg_used_dirs_count) << 8) < - le16_to_cpu(tmp->bg_free_inodes_count)) { - gdp = tmp; - break; - } - else - i = ++i % sb->u.ext2_sb.s_groups_count; - } -*/ - if (!gdp) { - for (j = 0; j < sb->u.ext2_sb.s_groups_count; j++) { - tmp = ext2_get_group_desc (sb, j, &bh2); - if (tmp && - le16_to_cpu(tmp->bg_free_inodes_count) && - le16_to_cpu(tmp->bg_free_inodes_count) >= avefreei) { - if (!gdp || - (le16_to_cpu(tmp->bg_free_blocks_count) > - le16_to_cpu(gdp->bg_free_blocks_count))) { - i = j; - gdp = tmp; - } - } - } - } - } + if (S_ISDIR(mode)) + i = find_cg_dir_orlov(sb, dir); else - { - /* - * Try to place the inode in its parent directory - */ - i = dir->u.ext2_i.i_block_group; - tmp = ext2_get_group_desc (sb, i, &bh2); - if (tmp && le16_to_cpu(tmp->bg_free_inodes_count)) - gdp = tmp; - else - { - /* - * Use a quadratic hash to find a group with a - * free inode - */ - for (j = 1; j < sb->u.ext2_sb.s_groups_count; j <<= 1) { - i += j; - if (i >= sb->u.ext2_sb.s_groups_count) - i -= sb->u.ext2_sb.s_groups_count; - tmp = ext2_get_group_desc (sb, i, &bh2); - if (tmp && - le16_to_cpu(tmp->bg_free_inodes_count)) { - gdp = tmp; - break; - } - } - } - if (!gdp) { - /* - * That failed: try linear search for a free inode - */ - i = dir->u.ext2_i.i_block_group + 1; - for (j = 2; j < sb->u.ext2_sb.s_groups_count; j++) { - if (++i >= sb->u.ext2_sb.s_groups_count) - i = 0; - tmp = ext2_get_group_desc (sb, i, &bh2); - if (tmp && - le16_to_cpu(tmp->bg_free_inodes_count)) { - gdp = tmp; - break; - } - } - } - } + i = find_cg_other(sb, dir); err = -ENOSPC; - if (!gdp) + if (i == -1) goto fail; err = -EIO; - bitmap_nr = load_inode_bitmap (sb, i); - if (bitmap_nr < 0) - goto fail; + bh = load_inode_bitmap (sb, i); + if (IS_ERR(bh)) + goto fail2; + + j = ext2_find_first_zero_bit ((unsigned long *) bh->b_data, + EXT2_INODES_PER_GROUP(sb)); + if (j >= EXT2_INODES_PER_GROUP(sb)) + goto bad_count; + ext2_set_bit (j, bh->b_data); - bh = sb->u.ext2_sb.s_inode_bitmap[bitmap_nr]; - if ((j = ext2_find_first_zero_bit ((unsigned long *) bh->b_data, - EXT2_INODES_PER_GROUP(sb))) < - EXT2_INODES_PER_GROUP(sb)) { - if (ext2_set_bit (j, bh->b_data)) { - ext2_error (sb, "ext2_new_inode", - "bit already set for inode %d", j); - goto repeat; - } - mark_buffer_dirty(bh); - if (sb->s_flags & MS_SYNCHRONOUS) { - ll_rw_block (WRITE, 1, &bh); - wait_on_buffer (bh); - } - } else { - if (le16_to_cpu(gdp->bg_free_inodes_count) != 0) { - ext2_error (sb, "ext2_new_inode", - "Free inodes count corrupted in group %d", - i); - /* Is it really ENOSPC? */ - err = -ENOSPC; - if (sb->s_flags & MS_RDONLY) - goto fail; - - gdp->bg_free_inodes_count = 0; - mark_buffer_dirty(bh2); - } - goto repeat; + mark_buffer_dirty(bh); + if (sb->s_flags & MS_SYNCHRONOUS) { + ll_rw_block (WRITE, 1, &bh); + wait_on_buffer (bh); } + j += i * EXT2_INODES_PER_GROUP(sb) + 1; if (j < EXT2_FIRST_INO(sb) || j > le32_to_cpu(es->s_inodes_count)) { ext2_error (sb, "ext2_new_inode", "reserved inode or inode > inodes count - " "block_group = %d,inode=%d", i, j); err = -EIO; - goto fail; + goto fail2; } - gdp->bg_free_inodes_count = - cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1); - if (S_ISDIR(mode)) - gdp->bg_used_dirs_count = - cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) + 1); - mark_buffer_dirty(bh2); + es->s_free_inodes_count = cpu_to_le32(le32_to_cpu(es->s_free_inodes_count) - 1); + + if (S_ISDIR(mode)) { + if (sb->u.ext2_sb.debts[i] < 255) + sb->u.ext2_sb.debts[i]++; + } else { + if (sb->u.ext2_sb.debts[i]) + sb->u.ext2_sb.debts[i]--; + } + mark_buffer_dirty(sb->u.ext2_sb.s_sbh); sb->s_dirt = 1; inode->i_uid = current->fsuid; @@ -438,14 +527,7 @@ inode->u.ext2_i.i_new_inode = 1; inode->u.ext2_i.i_flags = dir->u.ext2_i.i_flags; if (S_ISLNK(mode)) - inode->u.ext2_i.i_flags &= ~(EXT2_IMMUTABLE_FL | EXT2_APPEND_FL); - inode->u.ext2_i.i_faddr = 0; - inode->u.ext2_i.i_frag_no = 0; - inode->u.ext2_i.i_frag_size = 0; - inode->u.ext2_i.i_file_acl = 0; - inode->u.ext2_i.i_dir_acl = 0; - inode->u.ext2_i.i_dtime = 0; - inode->u.ext2_i.i_prealloc_count = 0; + inode->u.ext2_i.i_flags &= ~(EXT2_IMMUTABLE_FL|EXT2_APPEND_FL); inode->u.ext2_i.i_block_group = i; if (inode->u.ext2_i.i_flags & EXT2_SYNC_FL) inode->i_flags |= S_SYNC; @@ -464,38 +546,59 @@ ext2_debug ("allocating inode %lu\n", inode->i_ino); return inode; +fail2: + gdp = ext2_get_group_desc (sb, i, &bh2); + gdp->bg_free_inodes_count = + cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) + 1); + if (S_ISDIR(mode)) { + gdp->bg_used_dirs_count = + cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) - 1); + sb->u.ext2_sb.s_dir_count--; + } + mark_buffer_dirty(bh2); fail: unlock_super(sb); make_bad_inode(inode); iput(inode); return ERR_PTR(err); + +bad_count: + ext2_error (sb, "ext2_new_inode", + "Free inodes count corrupted in group %d", + i); + /* Is it really ENOSPC? */ + err = -ENOSPC; + if (sb->s_flags & MS_RDONLY) + goto fail; + + gdp = ext2_get_group_desc (sb, i, &bh2); + gdp->bg_free_inodes_count = 0; + mark_buffer_dirty(bh2); + goto repeat; } unsigned long ext2_count_free_inodes (struct super_block * sb) { #ifdef EXT2FS_DEBUG struct ext2_super_block * es; - unsigned long desc_count, bitmap_count, x; - int bitmap_nr; - struct ext2_group_desc * gdp; + unsigned long desc_count = 0, bitmap_count = 0; int i; lock_super (sb); es = sb->u.ext2_sb.s_es; - desc_count = 0; - bitmap_count = 0; - gdp = NULL; for (i = 0; i < sb->u.ext2_sb.s_groups_count; i++) { - gdp = ext2_get_group_desc (sb, i, NULL); + struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL); + struct buffer_head *bh; + unsigned x; + if (!gdp) continue; desc_count += le16_to_cpu(gdp->bg_free_inodes_count); - bitmap_nr = load_inode_bitmap (sb, i); - if (bitmap_nr < 0) + bh = load_inode_bitmap (sb, i); + if (IS_ERR(bh)) continue; - x = ext2_count_free (sb->u.ext2_sb.s_inode_bitmap[bitmap_nr], - EXT2_INODES_PER_GROUP(sb) / 8); + x = ext2_count_free (bh, EXT2_INODES_PER_GROUP(sb) / 8); printk ("group %d: stored = %d, counted = %lu\n", i, le16_to_cpu(gdp->bg_free_inodes_count), x); bitmap_count += x; @@ -509,31 +612,42 @@ #endif } +/* Called at mount-time, super-block is locked */ +unsigned long ext2_count_dirs (struct super_block * sb) +{ + unsigned long count = 0; + int i; + + for (i = 0; i < sb->u.ext2_sb.s_groups_count; i++) { + struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL); + if (!gdp) + continue; + count += le16_to_cpu(gdp->bg_used_dirs_count); + } + return count; +} + #ifdef CONFIG_EXT2_CHECK /* Called at mount-time, super-block is locked */ void ext2_check_inodes_bitmap (struct super_block * sb) { - struct ext2_super_block * es; - unsigned long desc_count, bitmap_count, x; - int bitmap_nr; - struct ext2_group_desc * gdp; + struct ext2_super_block * es = sb->u.ext2_sb.s_es; + unsigned long desc_count = 0, bitmap_count = 0; int i; - es = sb->u.ext2_sb.s_es; - desc_count = 0; - bitmap_count = 0; - gdp = NULL; for (i = 0; i < sb->u.ext2_sb.s_groups_count; i++) { - gdp = ext2_get_group_desc (sb, i, NULL); + struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL); + struct buffer_head *bh; + unsigned x; + if (!gdp) continue; desc_count += le16_to_cpu(gdp->bg_free_inodes_count); - bitmap_nr = load_inode_bitmap (sb, i); - if (bitmap_nr < 0) + bh = load_inode_bitmap (sb, i); + if (IS_ERR(bh)) continue; - x = ext2_count_free (sb->u.ext2_sb.s_inode_bitmap[bitmap_nr], - EXT2_INODES_PER_GROUP(sb) / 8); + x = ext2_count_free (bh, EXT2_INODES_PER_GROUP(sb) / 8); if (le16_to_cpu(gdp->bg_free_inodes_count) != x) ext2_error (sb, "ext2_check_inodes_bitmap", "Wrong free inodes count in group %d, " @@ -545,7 +659,7 @@ ext2_error (sb, "ext2_check_inodes_bitmap", "Wrong free inodes count in super block, " "stored = %lu, counted = %lu", - (unsigned long) le32_to_cpu(es->s_free_inodes_count), + (unsigned long)le32_to_cpu(es->s_free_inodes_count), bitmap_count); } #endif diff -urN S14/fs/ext2/super.c S14-ext2/fs/ext2/super.c --- S14/fs/ext2/super.c Tue Oct 9 21:47:26 2001 +++ S14-ext2/fs/ext2/super.c Thu Nov 8 15:29:07 2001 @@ -605,6 +605,12 @@ printk ("EXT2-fs: not enough memory\n"); goto failed_mount; } + sb->u.ext2_sb.debts = kmalloc(sb->u.ext2_sb.s_groups_count, GFP_KERNEL); + if (!sb->u.ext2_sb.debts) { + printk ("EXT2-fs: not enough memory\n"); + goto failed_mount; + } + memset(sb->u.ext2_sb.debts, 0, sb->u.ext2_sb.s_groups_count); for (i = 0; i < db_count; i++) { sb->u.ext2_sb.s_group_desc[i] = bread (dev, logic_sb_block + i + 1, sb->s_blocksize); @@ -621,6 +627,7 @@ db_count = i; goto failed_mount2; } + sb->u.ext2_sb.s_dir_count = ext2_count_dirs(sb); for (i = 0; i < EXT2_MAX_GROUP_LOADED; i++) { sb->u.ext2_sb.s_inode_bitmap_number[i] = 0; sb->u.ext2_sb.s_inode_bitmap[i] = NULL; diff -urN S14/include/linux/ext2_fs.h S14-ext2/include/linux/ext2_fs.h --- S14/include/linux/ext2_fs.h Thu Nov 8 16:33:29 2001 +++ S14-ext2/include/linux/ext2_fs.h Thu Nov 8 16:20:16 2001 @@ -578,6 +578,7 @@ extern unsigned long ext2_count_free_inodes (struct super_block *); extern void ext2_check_inodes_bitmap (struct super_block *); extern unsigned long ext2_count_free (struct buffer_head *, unsigned); +extern unsigned long ext2_count_dirs (struct super_block *); /* inode.c */ extern void ext2_read_inode (struct inode *); diff -urN S14/include/linux/ext2_fs_sb.h S14-ext2/include/linux/ext2_fs_sb.h --- S14/include/linux/ext2_fs_sb.h Fri Feb 16 21:29:52 2001 +++ S14-ext2/include/linux/ext2_fs_sb.h Thu Nov 8 15:28:34 2001 @@ -56,6 +56,8 @@ int s_desc_per_block_bits; int s_inode_size; int s_first_ino; + unsigned long s_dir_count; + u8 *debts; }; #endif /* _LINUX_EXT2_FS_SB */ ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] ext2/ialloc.c cleanup 2001-11-08 22:16 ` Alexander Viro @ 2001-11-08 22:43 ` Andreas Dilger 2001-11-08 23:08 ` Alexander Viro 0 siblings, 1 reply; 78+ messages in thread From: Andreas Dilger @ 2001-11-08 22:43 UTC (permalink / raw) To: Alexander Viro; +Cc: Andrew Morton, ext2-devel, lkml On Nov 08, 2001 17:16 -0500, Alexander Viro wrote: > + get_random_bytes(&group, sizeof(group)); > + goto fallback; Little chance that 0 <= group < ngroups. Even so, value is unused, AFAICS. > +fallback: > + for (i = 0; i < ngroups; i++) { > + group = (parent_cg + i) % ngroups; > + cg = ext2_get_group_desc (sb, group, &bh); > + if (!cg || !cg->bg_free_inodes_count) > + continue; > + if (le16_to_cpu(cg->bg_free_inodes_count) >= avefreei) > + goto found; > + } > + > + return -1; > diff -urN S14/include/linux/ext2_fs_sb.h S14-ext2/include/linux/ext2_fs_sb.h > --- S14/include/linux/ext2_fs_sb.h Fri Feb 16 21:29:52 2001 > +++ S14-ext2/include/linux/ext2_fs_sb.h Thu Nov 8 15:28:34 2001 > @@ -56,6 +56,8 @@ > int s_desc_per_block_bits; > int s_inode_size; > int s_first_ino; > + unsigned long s_dir_count; > + u8 *debts; > }; I had thought a couple of times it may be useful to have an in-memory list of group descriptors, each with a pointer to the on-disk struct, like ext2_sb points to s_es. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] ext2/ialloc.c cleanup 2001-11-08 22:43 ` Andreas Dilger @ 2001-11-08 23:08 ` Alexander Viro 2001-11-09 6:15 ` Andrew Morton 0 siblings, 1 reply; 78+ messages in thread From: Alexander Viro @ 2001-11-08 23:08 UTC (permalink / raw) To: Andreas Dilger; +Cc: Andrew Morton, ext2-devel, lkml On Thu, 8 Nov 2001, Andreas Dilger wrote: > On Nov 08, 2001 17:16 -0500, Alexander Viro wrote: > > + get_random_bytes(&group, sizeof(group)); Grrr.... + parent_cg = group % ngroups; /me wonders how the heck had it disappeared... Corrected patch follows: diff -urN S14/fs/ext2/ialloc.c S14-ext2/fs/ext2/ialloc.c --- S14/fs/ext2/ialloc.c Tue Oct 9 21:47:26 2001 +++ S14-ext2/fs/ext2/ialloc.c Thu Nov 8 18:05:00 2001 @@ -17,7 +17,7 @@ #include <linux/ext2_fs.h> #include <linux/locks.h> #include <linux/quotaops.h> - +#include <linux/random.h> /* * ialloc.c contains the inodes allocation and deallocation routines @@ -39,37 +39,26 @@ * Read the inode allocation bitmap for a given block_group, reading * into the specified slot in the superblock's bitmap cache. * - * Return >=0 on success or a -ve error code. + * Return buffer_head of bitmap on success or NULL. */ -static int read_inode_bitmap (struct super_block * sb, - unsigned long block_group, - unsigned int bitmap_nr) +static struct buffer_head *read_inode_bitmap (struct super_block * sb, + unsigned long block_group) { struct ext2_group_desc * gdp; struct buffer_head * bh = NULL; - int retval = 0; gdp = ext2_get_group_desc (sb, block_group, NULL); - if (!gdp) { - retval = -EIO; + if (!gdp) goto error_out; - } + bh = bread (sb->s_dev, le32_to_cpu(gdp->bg_inode_bitmap), sb->s_blocksize); - if (!bh) { + if (!bh) ext2_error (sb, "read_inode_bitmap", "Cannot read inode bitmap - " "block_group = %lu, inode_bitmap = %lu", block_group, (unsigned long) gdp->bg_inode_bitmap); - retval = -EIO; - } - /* - * On IO error, just leave a zero in the superblock's block pointer for - * this group. The IO will be retried next time. - */ error_out: - sb->u.ext2_sb.s_inode_bitmap_number[bitmap_nr] = block_group; - sb->u.ext2_sb.s_inode_bitmap[bitmap_nr] = bh; - return retval; + return bh; } /* @@ -83,79 +72,62 @@ * 2/ If the file system contains less than EXT2_MAX_GROUP_LOADED groups, * this function reads the bitmap without maintaining a LRU cache. * - * Return the slot used to store the bitmap, or a -ve error code. + * Return the buffer_head of the bitmap or the ERR_PTR(error) */ -static int load_inode_bitmap (struct super_block * sb, - unsigned int block_group) +static struct buffer_head *load_inode_bitmap (struct super_block * sb, + unsigned int block_group) { - int i, j, retval = 0; - unsigned long inode_bitmap_number; - struct buffer_head * inode_bitmap; + int i, slot = 0; + struct ext2_sb_info *sbi = &sb->u.ext2_sb; + struct buffer_head *bh = sbi->s_inode_bitmap[0]; - if (block_group >= sb->u.ext2_sb.s_groups_count) + if (block_group >= sbi->s_groups_count) ext2_panic (sb, "load_inode_bitmap", "block_group >= groups_count - " "block_group = %d, groups_count = %lu", - block_group, sb->u.ext2_sb.s_groups_count); - if (sb->u.ext2_sb.s_loaded_inode_bitmaps > 0 && - sb->u.ext2_sb.s_inode_bitmap_number[0] == block_group && - sb->u.ext2_sb.s_inode_bitmap[0] != NULL) - return 0; - if (sb->u.ext2_sb.s_groups_count <= EXT2_MAX_GROUP_LOADED) { - if (sb->u.ext2_sb.s_inode_bitmap[block_group]) { - if (sb->u.ext2_sb.s_inode_bitmap_number[block_group] != block_group) - ext2_panic (sb, "load_inode_bitmap", - "block_group != inode_bitmap_number"); - else - return block_group; - } else { - retval = read_inode_bitmap (sb, block_group, - block_group); - if (retval < 0) - return retval; - return block_group; - } + block_group, sbi->s_groups_count); + + if (sbi->s_loaded_inode_bitmaps > 0 && + sbi->s_inode_bitmap_number[0] == block_group && bh) + goto found; + + if (sbi->s_groups_count <= EXT2_MAX_GROUP_LOADED) { + slot = block_group; + bh = sbi->s_inode_bitmap[slot]; + if (!bh) + goto read_it; + if (sbi->s_inode_bitmap_number[slot] == slot) + goto found; + ext2_panic (sb, "load_inode_bitmap", + "block_group != inode_bitmap_number"); } - for (i = 0; i < sb->u.ext2_sb.s_loaded_inode_bitmaps && - sb->u.ext2_sb.s_inode_bitmap_number[i] != block_group; + bh = NULL; + for (i = 0; i < sbi->s_loaded_inode_bitmaps && + sbi->s_inode_bitmap_number[i] != block_group; i++) ; - if (i < sb->u.ext2_sb.s_loaded_inode_bitmaps && - sb->u.ext2_sb.s_inode_bitmap_number[i] == block_group) { - inode_bitmap_number = sb->u.ext2_sb.s_inode_bitmap_number[i]; - inode_bitmap = sb->u.ext2_sb.s_inode_bitmap[i]; - for (j = i; j > 0; j--) { - sb->u.ext2_sb.s_inode_bitmap_number[j] = - sb->u.ext2_sb.s_inode_bitmap_number[j - 1]; - sb->u.ext2_sb.s_inode_bitmap[j] = - sb->u.ext2_sb.s_inode_bitmap[j - 1]; - } - sb->u.ext2_sb.s_inode_bitmap_number[0] = inode_bitmap_number; - sb->u.ext2_sb.s_inode_bitmap[0] = inode_bitmap; - - /* - * There's still one special case here --- if inode_bitmap == 0 - * then our last attempt to read the bitmap failed and we have - * just ended up caching that failure. Try again to read it. - */ - if (!inode_bitmap) - retval = read_inode_bitmap (sb, block_group, 0); - - } else { - if (sb->u.ext2_sb.s_loaded_inode_bitmaps < EXT2_MAX_GROUP_LOADED) - sb->u.ext2_sb.s_loaded_inode_bitmaps++; - else - brelse (sb->u.ext2_sb.s_inode_bitmap[EXT2_MAX_GROUP_LOADED - 1]); - for (j = sb->u.ext2_sb.s_loaded_inode_bitmaps - 1; j > 0; j--) { - sb->u.ext2_sb.s_inode_bitmap_number[j] = - sb->u.ext2_sb.s_inode_bitmap_number[j - 1]; - sb->u.ext2_sb.s_inode_bitmap[j] = - sb->u.ext2_sb.s_inode_bitmap[j - 1]; - } - retval = read_inode_bitmap (sb, block_group, 0); - } - return retval; + if (i < sbi->s_loaded_inode_bitmaps) + bh = sbi->s_inode_bitmap[i]; + else if (sbi->s_loaded_inode_bitmaps < EXT2_MAX_GROUP_LOADED) + sbi->s_loaded_inode_bitmaps++; + else + brelse (sbi->s_inode_bitmap[--i]); + + while (i--) { + sbi->s_inode_bitmap_number[i+1] = sbi->s_inode_bitmap_number[i]; + sbi->s_inode_bitmap[i+1] = sbi->s_inode_bitmap[i]; + } + +read_it: + if (!bh) + bh = read_inode_bitmap (sb, block_group); + sbi->s_inode_bitmap_number[slot] = block_group; + sbi->s_inode_bitmap[slot] = bh; + if (!bh) + return ERR_PTR(-EIO); +found: + return bh; } /* @@ -183,7 +155,6 @@ struct buffer_head * bh2; unsigned long block_group; unsigned long bit; - int bitmap_nr; struct ext2_group_desc * gdp; struct ext2_super_block * es; @@ -215,12 +186,10 @@ } block_group = (ino - 1) / EXT2_INODES_PER_GROUP(sb); bit = (ino - 1) % EXT2_INODES_PER_GROUP(sb); - bitmap_nr = load_inode_bitmap (sb, block_group); - if (bitmap_nr < 0) + bh = load_inode_bitmap (sb, block_group); + if (IS_ERR(bh)) goto error_return; - bh = sb->u.ext2_sb.s_inode_bitmap[bitmap_nr]; - /* Ok, now we can actually update the inode bitmaps.. */ if (!ext2_clear_bit (bit, bh->b_data)) ext2_error (sb, "ext2_free_inode", @@ -230,9 +199,11 @@ if (gdp) { gdp->bg_free_inodes_count = cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) + 1); - if (is_directory) + if (is_directory) { gdp->bg_used_dirs_count = cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) - 1); + sb->u.ext2_sb.s_dir_count--; + } } mark_buffer_dirty(bh2); es->s_free_inodes_count = @@ -259,23 +230,230 @@ * For other inodes, search forward from the parent directory\'s block * group to find a free inode. */ + +static int find_cg_dir(struct super_block *sb, const struct inode *parent) +{ + struct ext2_super_block * es = sb->u.ext2_sb.s_es; + int ngroups = sb->u.ext2_sb.s_groups_count; + int avefreei = le32_to_cpu(es->s_free_inodes_count) / ngroups; + struct ext2_group_desc *cg, *best_cg = NULL; + struct buffer_head *bh, *best_bh = NULL; + int i = -1, j; + + for (j = 0; j < ngroups; j++) { + cg = ext2_get_group_desc (sb, j, &bh); + if (!cg || !cg->bg_free_inodes_count) + continue; + if (le16_to_cpu(cg->bg_free_inodes_count) < avefreei) + continue; + if (!best_cg || + (le16_to_cpu(cg->bg_free_blocks_count) > + le16_to_cpu(best_cg->bg_free_blocks_count))) { + i = j; + best_cg = cg; + best_bh = bh; + } + } + if (!best_cg) + return -1; + + best_cg->bg_free_inodes_count = + cpu_to_le16(le16_to_cpu(best_cg->bg_free_inodes_count) - 1); + best_cg->bg_used_dirs_count = + cpu_to_le16(le16_to_cpu(best_cg->bg_used_dirs_count) + 1); + sb->u.ext2_sb.s_dir_count++; + mark_buffer_dirty(best_bh); + return i; +} + +/* + * Orlov's allocator for directories. + * + * We always try to spread first-level directories: + * If there are directories with both free inodes and free blocks counts + * not worse than average we return one with smallest directory count. + * Otherwise we simply return a random group. + * + * For the rest rules look so: + * + * It's OK to put directory into a group unless + * it has too many directories already (max_dirs) or + * it has too few free inodes left (min_inodes) or + * it has too few free blocks left (min_blocks) or + * it's already running too large debt (max_debt). + * Parent's group is prefered, if it doesn't satisfy these + * conditions we search cyclically through the rest. If none + * of the groups look good we just look for a group with more + * free inodes than average (starting at parent's group). + * + * Debt is incremented each time we allocate a directory and decremented + * when we allocate an inode, within 0--255. + */ + +#define INODE_COST 64 +#define BLOCK_COST 256 + +static int find_cg_dir_orlov(struct super_block *sb, const struct inode *parent) +{ + int parent_cg = parent->u.ext2_i.i_block_group; + struct ext2_super_block * es = sb->u.ext2_sb.s_es; + int ngroups = sb->u.ext2_sb.s_groups_count; + int inodes_per_group = EXT2_INODES_PER_GROUP(sb); + int avefreei = le32_to_cpu(es->s_free_inodes_count) / ngroups; + int avefreeb = le32_to_cpu(es->s_free_blocks_count) / ngroups; + int blocks_per_dir; + int ndirs = sb->u.ext2_sb.s_dir_count; + int max_debt, max_dirs, min_blocks, min_inodes; + int group = -1, i; + struct ext2_group_desc *cg; + struct buffer_head *bh; + + if (parent == sb->s_root->d_inode) { + struct ext2_group_desc *best_cg = NULL; + struct buffer_head *best_bh = NULL; + int best_ndir = inodes_per_group; + int best_group = -1; + + for (group = 0; group < ngroups; group++) { + cg = ext2_get_group_desc (sb, group, &bh); + if (!cg || !cg->bg_free_inodes_count) + continue; + if (le16_to_cpu(cg->bg_used_dirs_count) >= best_ndir) + continue; + if (le16_to_cpu(cg->bg_free_inodes_count) < avefreei) + continue; + if (le16_to_cpu(cg->bg_free_blocks_count) < avefreeb) + continue; + best_group = group; + best_ndir = le16_to_cpu(cg->bg_used_dirs_count); + best_cg = cg; + best_bh = bh; + } + if (best_group >= 0) { + cg = best_cg; + bh = best_bh; + group = best_group; + goto found; + } + get_random_bytes(&group, sizeof(group)); + parent_cg = group % ngroups; + goto fallback; + } + + blocks_per_dir = (le32_to_cpu(es->s_blocks_count) - + le32_to_cpu(es->s_free_blocks_count)) / ndirs; + + max_dirs = ndirs / ngroups + inodes_per_group / 16; + min_inodes = avefreei - inodes_per_group / 4; + min_blocks = avefreeb - EXT2_BLOCKS_PER_GROUP(sb) / 4; + + max_debt = EXT2_BLOCKS_PER_GROUP(sb) / max(blocks_per_dir, BLOCK_COST); + if (max_debt * INODE_COST > inodes_per_group) + max_debt = inodes_per_group / INODE_COST; + if (max_debt > 255) + max_debt = 255; + if (max_debt == 0) + max_debt = 1; + + for (i = 0; i < ngroups; i++) { + group = (parent_cg + i) % ngroups; + cg = ext2_get_group_desc (sb, group, &bh); + if (!cg || !cg->bg_free_inodes_count) + continue; + if (sb->u.ext2_sb.debts[group] >= max_debt) + continue; + if (le16_to_cpu(cg->bg_used_dirs_count) >= max_dirs) + continue; + if (le16_to_cpu(cg->bg_free_inodes_count) < min_inodes) + continue; + if (le16_to_cpu(cg->bg_free_blocks_count) < min_blocks) + continue; + goto found; + } + +fallback: + for (i = 0; i < ngroups; i++) { + group = (parent_cg + i) % ngroups; + cg = ext2_get_group_desc (sb, group, &bh); + if (!cg || !cg->bg_free_inodes_count) + continue; + if (le16_to_cpu(cg->bg_free_inodes_count) >= avefreei) + goto found; + } + + return -1; + +found: + cg->bg_free_inodes_count = + cpu_to_le16(le16_to_cpu(cg->bg_free_inodes_count) - 1); + cg->bg_used_dirs_count = + cpu_to_le16(le16_to_cpu(cg->bg_used_dirs_count) + 1); + sb->u.ext2_sb.s_dir_count++; + mark_buffer_dirty(bh); + return group; +} + +static int find_cg_other(struct super_block *sb, const struct inode *parent) +{ + int parent_cg = parent->u.ext2_i.i_block_group; + int ngroups = sb->u.ext2_sb.s_groups_count; + struct ext2_group_desc *cg; + struct buffer_head *bh; + int i, j; + + /* + * Try to place the inode in its parent directory + */ + i = parent_cg; + cg = ext2_get_group_desc (sb, i, &bh); + if (cg && le16_to_cpu(cg->bg_free_inodes_count)) + goto found; + + /* + * Use a quadratic hash to find a group with a + * free inode + */ + for (j = 1; j < ngroups; j <<= 1) { + i += j; + if (i >= ngroups) + i -= ngroups; + cg = ext2_get_group_desc (sb, i, &bh); + if (cg && le16_to_cpu(cg->bg_free_inodes_count)) + goto found; + } + + /* + * That failed: try linear search for a free inode + */ + i = parent_cg + 1; + for (j = 2; j < ngroups; j++) { + if (++i >= ngroups) + i = 0; + cg = ext2_get_group_desc (sb, i, &bh); + if (cg && le16_to_cpu(cg->bg_free_inodes_count)) + goto found; + } + + return -1; + +found: + cg->bg_free_inodes_count = + cpu_to_le16(le16_to_cpu(cg->bg_free_inodes_count) - 1); + mark_buffer_dirty(bh); + return i; +} + struct inode * ext2_new_inode (const struct inode * dir, int mode) { struct super_block * sb; struct buffer_head * bh; struct buffer_head * bh2; - int i, j, avefreei; + int i, j; struct inode * inode; - int bitmap_nr; struct ext2_group_desc * gdp; - struct ext2_group_desc * tmp; struct ext2_super_block * es; int err; - /* Cannot create files in a deleted directory */ - if (!dir || !dir->i_nlink) - return ERR_PTR(-EPERM); - sb = dir->i_sb; inode = new_inode(sb); if (!inode) @@ -284,140 +462,52 @@ lock_super (sb); es = sb->u.ext2_sb.s_es; repeat: - gdp = NULL; i=0; - - if (S_ISDIR(mode)) { - avefreei = le32_to_cpu(es->s_free_inodes_count) / - sb->u.ext2_sb.s_groups_count; -/* I am not yet convinced that this next bit is necessary. - i = dir->u.ext2_i.i_block_group; - for (j = 0; j < sb->u.ext2_sb.s_groups_count; j++) { - tmp = ext2_get_group_desc (sb, i, &bh2); - if (tmp && - (le16_to_cpu(tmp->bg_used_dirs_count) << 8) < - le16_to_cpu(tmp->bg_free_inodes_count)) { - gdp = tmp; - break; - } - else - i = ++i % sb->u.ext2_sb.s_groups_count; - } -*/ - if (!gdp) { - for (j = 0; j < sb->u.ext2_sb.s_groups_count; j++) { - tmp = ext2_get_group_desc (sb, j, &bh2); - if (tmp && - le16_to_cpu(tmp->bg_free_inodes_count) && - le16_to_cpu(tmp->bg_free_inodes_count) >= avefreei) { - if (!gdp || - (le16_to_cpu(tmp->bg_free_blocks_count) > - le16_to_cpu(gdp->bg_free_blocks_count))) { - i = j; - gdp = tmp; - } - } - } - } - } + if (S_ISDIR(mode)) + i = find_cg_dir_orlov(sb, dir); else - { - /* - * Try to place the inode in its parent directory - */ - i = dir->u.ext2_i.i_block_group; - tmp = ext2_get_group_desc (sb, i, &bh2); - if (tmp && le16_to_cpu(tmp->bg_free_inodes_count)) - gdp = tmp; - else - { - /* - * Use a quadratic hash to find a group with a - * free inode - */ - for (j = 1; j < sb->u.ext2_sb.s_groups_count; j <<= 1) { - i += j; - if (i >= sb->u.ext2_sb.s_groups_count) - i -= sb->u.ext2_sb.s_groups_count; - tmp = ext2_get_group_desc (sb, i, &bh2); - if (tmp && - le16_to_cpu(tmp->bg_free_inodes_count)) { - gdp = tmp; - break; - } - } - } - if (!gdp) { - /* - * That failed: try linear search for a free inode - */ - i = dir->u.ext2_i.i_block_group + 1; - for (j = 2; j < sb->u.ext2_sb.s_groups_count; j++) { - if (++i >= sb->u.ext2_sb.s_groups_count) - i = 0; - tmp = ext2_get_group_desc (sb, i, &bh2); - if (tmp && - le16_to_cpu(tmp->bg_free_inodes_count)) { - gdp = tmp; - break; - } - } - } - } + i = find_cg_other(sb, dir); err = -ENOSPC; - if (!gdp) + if (i == -1) goto fail; err = -EIO; - bitmap_nr = load_inode_bitmap (sb, i); - if (bitmap_nr < 0) - goto fail; + bh = load_inode_bitmap (sb, i); + if (IS_ERR(bh)) + goto fail2; + + j = ext2_find_first_zero_bit ((unsigned long *) bh->b_data, + EXT2_INODES_PER_GROUP(sb)); + if (j >= EXT2_INODES_PER_GROUP(sb)) + goto bad_count; + ext2_set_bit (j, bh->b_data); - bh = sb->u.ext2_sb.s_inode_bitmap[bitmap_nr]; - if ((j = ext2_find_first_zero_bit ((unsigned long *) bh->b_data, - EXT2_INODES_PER_GROUP(sb))) < - EXT2_INODES_PER_GROUP(sb)) { - if (ext2_set_bit (j, bh->b_data)) { - ext2_error (sb, "ext2_new_inode", - "bit already set for inode %d", j); - goto repeat; - } - mark_buffer_dirty(bh); - if (sb->s_flags & MS_SYNCHRONOUS) { - ll_rw_block (WRITE, 1, &bh); - wait_on_buffer (bh); - } - } else { - if (le16_to_cpu(gdp->bg_free_inodes_count) != 0) { - ext2_error (sb, "ext2_new_inode", - "Free inodes count corrupted in group %d", - i); - /* Is it really ENOSPC? */ - err = -ENOSPC; - if (sb->s_flags & MS_RDONLY) - goto fail; - - gdp->bg_free_inodes_count = 0; - mark_buffer_dirty(bh2); - } - goto repeat; + mark_buffer_dirty(bh); + if (sb->s_flags & MS_SYNCHRONOUS) { + ll_rw_block (WRITE, 1, &bh); + wait_on_buffer (bh); } + j += i * EXT2_INODES_PER_GROUP(sb) + 1; if (j < EXT2_FIRST_INO(sb) || j > le32_to_cpu(es->s_inodes_count)) { ext2_error (sb, "ext2_new_inode", "reserved inode or inode > inodes count - " "block_group = %d,inode=%d", i, j); err = -EIO; - goto fail; + goto fail2; } - gdp->bg_free_inodes_count = - cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1); - if (S_ISDIR(mode)) - gdp->bg_used_dirs_count = - cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) + 1); - mark_buffer_dirty(bh2); + es->s_free_inodes_count = cpu_to_le32(le32_to_cpu(es->s_free_inodes_count) - 1); + + if (S_ISDIR(mode)) { + if (sb->u.ext2_sb.debts[i] < 255) + sb->u.ext2_sb.debts[i]++; + } else { + if (sb->u.ext2_sb.debts[i]) + sb->u.ext2_sb.debts[i]--; + } + mark_buffer_dirty(sb->u.ext2_sb.s_sbh); sb->s_dirt = 1; inode->i_uid = current->fsuid; @@ -438,14 +528,7 @@ inode->u.ext2_i.i_new_inode = 1; inode->u.ext2_i.i_flags = dir->u.ext2_i.i_flags; if (S_ISLNK(mode)) - inode->u.ext2_i.i_flags &= ~(EXT2_IMMUTABLE_FL | EXT2_APPEND_FL); - inode->u.ext2_i.i_faddr = 0; - inode->u.ext2_i.i_frag_no = 0; - inode->u.ext2_i.i_frag_size = 0; - inode->u.ext2_i.i_file_acl = 0; - inode->u.ext2_i.i_dir_acl = 0; - inode->u.ext2_i.i_dtime = 0; - inode->u.ext2_i.i_prealloc_count = 0; + inode->u.ext2_i.i_flags &= ~(EXT2_IMMUTABLE_FL|EXT2_APPEND_FL); inode->u.ext2_i.i_block_group = i; if (inode->u.ext2_i.i_flags & EXT2_SYNC_FL) inode->i_flags |= S_SYNC; @@ -464,38 +547,59 @@ ext2_debug ("allocating inode %lu\n", inode->i_ino); return inode; +fail2: + gdp = ext2_get_group_desc (sb, i, &bh2); + gdp->bg_free_inodes_count = + cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) + 1); + if (S_ISDIR(mode)) { + gdp->bg_used_dirs_count = + cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) - 1); + sb->u.ext2_sb.s_dir_count--; + } + mark_buffer_dirty(bh2); fail: unlock_super(sb); make_bad_inode(inode); iput(inode); return ERR_PTR(err); + +bad_count: + ext2_error (sb, "ext2_new_inode", + "Free inodes count corrupted in group %d", + i); + /* Is it really ENOSPC? */ + err = -ENOSPC; + if (sb->s_flags & MS_RDONLY) + goto fail; + + gdp = ext2_get_group_desc (sb, i, &bh2); + gdp->bg_free_inodes_count = 0; + mark_buffer_dirty(bh2); + goto repeat; } unsigned long ext2_count_free_inodes (struct super_block * sb) { #ifdef EXT2FS_DEBUG struct ext2_super_block * es; - unsigned long desc_count, bitmap_count, x; - int bitmap_nr; - struct ext2_group_desc * gdp; + unsigned long desc_count = 0, bitmap_count = 0; int i; lock_super (sb); es = sb->u.ext2_sb.s_es; - desc_count = 0; - bitmap_count = 0; - gdp = NULL; for (i = 0; i < sb->u.ext2_sb.s_groups_count; i++) { - gdp = ext2_get_group_desc (sb, i, NULL); + struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL); + struct buffer_head *bh; + unsigned x; + if (!gdp) continue; desc_count += le16_to_cpu(gdp->bg_free_inodes_count); - bitmap_nr = load_inode_bitmap (sb, i); - if (bitmap_nr < 0) + bh = load_inode_bitmap (sb, i); + if (IS_ERR(bh)) continue; - x = ext2_count_free (sb->u.ext2_sb.s_inode_bitmap[bitmap_nr], - EXT2_INODES_PER_GROUP(sb) / 8); + x = ext2_count_free (bh, EXT2_INODES_PER_GROUP(sb) / 8); printk ("group %d: stored = %d, counted = %lu\n", i, le16_to_cpu(gdp->bg_free_inodes_count), x); bitmap_count += x; @@ -509,31 +613,42 @@ #endif } +/* Called at mount-time, super-block is locked */ +unsigned long ext2_count_dirs (struct super_block * sb) +{ + unsigned long count = 0; + int i; + + for (i = 0; i < sb->u.ext2_sb.s_groups_count; i++) { + struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL); + if (!gdp) + continue; + count += le16_to_cpu(gdp->bg_used_dirs_count); + } + return count; +} + #ifdef CONFIG_EXT2_CHECK /* Called at mount-time, super-block is locked */ void ext2_check_inodes_bitmap (struct super_block * sb) { - struct ext2_super_block * es; - unsigned long desc_count, bitmap_count, x; - int bitmap_nr; - struct ext2_group_desc * gdp; + struct ext2_super_block * es = sb->u.ext2_sb.s_es; + unsigned long desc_count = 0, bitmap_count = 0; int i; - es = sb->u.ext2_sb.s_es; - desc_count = 0; - bitmap_count = 0; - gdp = NULL; for (i = 0; i < sb->u.ext2_sb.s_groups_count; i++) { - gdp = ext2_get_group_desc (sb, i, NULL); + struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL); + struct buffer_head *bh; + unsigned x; + if (!gdp) continue; desc_count += le16_to_cpu(gdp->bg_free_inodes_count); - bitmap_nr = load_inode_bitmap (sb, i); - if (bitmap_nr < 0) + bh = load_inode_bitmap (sb, i); + if (IS_ERR(bh)) continue; - x = ext2_count_free (sb->u.ext2_sb.s_inode_bitmap[bitmap_nr], - EXT2_INODES_PER_GROUP(sb) / 8); + x = ext2_count_free (bh, EXT2_INODES_PER_GROUP(sb) / 8); if (le16_to_cpu(gdp->bg_free_inodes_count) != x) ext2_error (sb, "ext2_check_inodes_bitmap", "Wrong free inodes count in group %d, " @@ -545,7 +660,7 @@ ext2_error (sb, "ext2_check_inodes_bitmap", "Wrong free inodes count in super block, " "stored = %lu, counted = %lu", - (unsigned long) le32_to_cpu(es->s_free_inodes_count), + (unsigned long)le32_to_cpu(es->s_free_inodes_count), bitmap_count); } #endif diff -urN S14/fs/ext2/super.c S14-ext2/fs/ext2/super.c --- S14/fs/ext2/super.c Tue Oct 9 21:47:26 2001 +++ S14-ext2/fs/ext2/super.c Thu Nov 8 15:29:07 2001 @@ -605,6 +605,12 @@ printk ("EXT2-fs: not enough memory\n"); goto failed_mount; } + sb->u.ext2_sb.debts = kmalloc(sb->u.ext2_sb.s_groups_count, GFP_KERNEL); + if (!sb->u.ext2_sb.debts) { + printk ("EXT2-fs: not enough memory\n"); + goto failed_mount; + } + memset(sb->u.ext2_sb.debts, 0, sb->u.ext2_sb.s_groups_count); for (i = 0; i < db_count; i++) { sb->u.ext2_sb.s_group_desc[i] = bread (dev, logic_sb_block + i + 1, sb->s_blocksize); @@ -621,6 +627,7 @@ db_count = i; goto failed_mount2; } + sb->u.ext2_sb.s_dir_count = ext2_count_dirs(sb); for (i = 0; i < EXT2_MAX_GROUP_LOADED; i++) { sb->u.ext2_sb.s_inode_bitmap_number[i] = 0; sb->u.ext2_sb.s_inode_bitmap[i] = NULL; diff -urN S14/include/linux/ext2_fs.h S14-ext2/include/linux/ext2_fs.h --- S14/include/linux/ext2_fs.h Thu Nov 8 16:33:29 2001 +++ S14-ext2/include/linux/ext2_fs.h Thu Nov 8 16:20:16 2001 @@ -578,6 +578,7 @@ extern unsigned long ext2_count_free_inodes (struct super_block *); extern void ext2_check_inodes_bitmap (struct super_block *); extern unsigned long ext2_count_free (struct buffer_head *, unsigned); +extern unsigned long ext2_count_dirs (struct super_block *); /* inode.c */ extern void ext2_read_inode (struct inode *); diff -urN S14/include/linux/ext2_fs_sb.h S14-ext2/include/linux/ext2_fs_sb.h --- S14/include/linux/ext2_fs_sb.h Fri Feb 16 21:29:52 2001 +++ S14-ext2/include/linux/ext2_fs_sb.h Thu Nov 8 15:28:34 2001 @@ -56,6 +56,8 @@ int s_desc_per_block_bits; int s_inode_size; int s_first_ino; + unsigned long s_dir_count; + u8 *debts; }; #endif /* _LINUX_EXT2_FS_SB */ ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] ext2/ialloc.c cleanup 2001-11-08 23:08 ` Alexander Viro @ 2001-11-09 6:15 ` Andrew Morton 2001-11-09 6:56 ` Andreas Dilger 0 siblings, 1 reply; 78+ messages in thread From: Andrew Morton @ 2001-11-09 6:15 UTC (permalink / raw) To: Alexander Viro; +Cc: Andreas Dilger, ext2-devel, lkml Thanks, Al. First a couple of comments on the patch (looks nice, BTW): /* * Orlov's allocator for directories. * * We always try to spread first-level directories: * If there are directories with both free inodes and free blocks counts ^^^^^^^^^^^ cylinder groups * not worse than average we return one with smallest directory count. (I agree with Andreas on this one. Why switch terminology?) get_random_bytes(&group, sizeof(group)); parent_cg = group % ngroups; goto fallback; AFAICT, get_random_bytes() here can set `group' to a negative value, and parent_cg can go negative, and that propagates to `group' going negative, and getting passed to ext2_get_group_desc(), and everything goes generally pear-shaped. Really, all this arith should be using unsigneds. >From here: max_dirs = ndirs / ngroups + inodes_per_group / 16; min_inodes = avefreei - inodes_per_group / 4; min_blocks = avefreeb - EXT2_BLOCKS_PER_GROUP(sb) / 4; things start to get a bit confusing. A couple of 1-2 line comments which explain what the variables actually represent would help to clarify things. Also, an explanation of `debt' is needed. Offtopic, in ext2_new_inode(): mark_buffer_dirty(bh); if (sb->s_flags & MS_SYNCHRONOUS) { ll_rw_block (WRITE, 1, &bh); wait_on_buffer (bh); } Does the inode bitmap writeout actually need to be synchronous here? The file will still be recoverable by fsck if this is not done? If the sync _is_ needed, shouldn't we be doing it with the group descriptors? Finally, please, please, please take the opportunity to rename `bh', `bh2', `i' and `j' in ext2_new_inode() to something semantically meaningful. What we have now is just insane. We need to test carefully with ENOSPC, but it looks solid. Performance-wise, the Orlov approach is almost as good as the `if (0 &&' approach for fast growth. This is the "manipulate kernel trees on an empty partition" test: Stock Patched Orlov untar 31 14 14 diff 184 72 82.6 tar 15 3 3 rm 30 10 10.3 So the diffing was a bit slower; not much. For the slow growth test, same as last time (cut-n-paste from the very excellent staroffice 6 spreadsheet): Tree Stock Stock/ideal Patched Patched/stock Orlov Orlov/ideal (secs) (secs) (secs) 0 37 2.85 74 200.00% 63 4.85 1 41 3.15 89 217.07% 68 5.23 2 41 3.15 97 236.59% 74 5.69 3 38 2.92 97 255.26% 84 6.46 4 55 4.23 102 185.45% 78 6 5 51 3.92 98 192.16% 76 5.85 6 72 5.54 94 130.56% 73 5.62 7 56 4.31 96 171.43% 71 5.46 8 64 4.92 102 159.38% 73 5.62 9 63 4.85 100 158.73% 71 5.46 10 57 4.38 95 166.67% 74 5.69 11 83 6.38 102 122.89% 78 6 12 54 4.15 101 187.04% 76 5.85 13 82 6.31 104 126.83% 78 6 14 89 6.85 103 115.73% 77 5.92 15 88 6.77 95 107.95% 77 5.92 16 106 8.15 99 93.40% 77 5.92 We see that Orlov is more prone to fragmentation than stock 2.4.14: The time to read the first batch of 378 megs of files is 37 seconds with 2.4.14, 63 seconds with Orlov. But it's better than the `if (0 && ' approach. So I just don't know at this stage. Even after a single pass of the Smith workload, we're running at 3x to 5x worse than ideal. If that's simply the best we can do, then we need to compare stock 2.4.14 with Orlov partway through the progress of the Smith workload to evaluate how much more serious the fragmentation is. That's easy enough - I'll do it. The next task is to work out why a single pass of the Smith workload fragments the filesystem so much, and whether this can be improved. Oh. And some preliminary testing with SCSI shows markedly different behaviour: those 3x to 5x speedups I was demonstrating with IDE are only 1.5x with a Quantum Atlas IV. There is some magical characteristic of modern IDE disks which stock 2.4.14 is failing to exploit, and I don't yet have a handle on precisely what it is. - ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] ext2/ialloc.c cleanup 2001-11-09 6:15 ` Andrew Morton @ 2001-11-09 6:56 ` Andreas Dilger 2001-11-09 7:09 ` Andrew Morton 2001-11-09 7:12 ` Alexander Viro 0 siblings, 2 replies; 78+ messages in thread From: Andreas Dilger @ 2001-11-09 6:56 UTC (permalink / raw) To: Andrew Morton; +Cc: Alexander Viro, ext2-devel, lkml On Nov 08, 2001 22:15 -0800, Andrew Morton wrote: > max_dirs = ndirs / ngroups + inodes_per_group / 16; > min_inodes = avefreei - inodes_per_group / 4; > min_blocks = avefreeb - EXT2_BLOCKS_PER_GROUP(sb) / 4; > > things start to get a bit confusing. A couple of 1-2 line comments > which explain what the variables actually represent would help to > clarify things. Also, an explanation of `debt' is needed. Yes, my eyes glazed over at this point as well. > Tree Stock Stock/ideal Patched Patched/stock Orlov Orlov/ideal > (secs) (secs) (secs) Andrew, could you stick with a single metric (either /ideal or /stock)? > 12 54 4.15 101 187.04% 76 5.85 > 13 82 6.31 104 126.83% 78 6 > 14 89 6.85 103 115.73% 77 5.92 > 15 88 6.77 95 107.95% 77 5.92 > 16 106 8.15 99 93.40% 77 5.92 What else would be useful here is percent full for the filesystem. Since stock ext2 preallocates up to 7 blocks for each file, this is good for reducing fragmentation, but as the filesystem gets full it makes fragmentation worse in the end. > So I just don't know at this stage. Even after a single pass of the Smith > workload, we're running at 3x to 5x worse than ideal. If that's simply > the best we can do, then we need to compare stock 2.4.14 with Orlov > partway through the progress of the Smith workload to evaluate how much > more serious the fragmentation is. That's easy enough - I'll do it. > > The next task is to work out why a single pass of the Smith workload > fragments the filesystem so much, and whether this can be improved. After reading the paper, one possible cause of skewed results may be a result of their "reverse engineering" of the filesystem layout from the inode numbers. They state that they were not able to capture the actual pathnames of the files in the workload, so they invented pathnames that would put the files in the same cylinder groups based on the FFS allocation algorithms then in use, assuming the test filesystem had an identical layout. It may be possible to hack the test data into ext2 by creating a filesystem with the same number of block groups as the test FFS filesystem with the Smith workload. It may also not be valid for our needs, as we are playing with the actual group selection algorithm, so real pathnames may give us a different layout. Have you ever looked at the resulting data from a Smith test? Does it do more than simply create 27 directories (the number of cylinder groups) and dump files directly therein, as opposed to creating more subdirectories? Remember that they were strictly concerned with block allocation for files, and not cylinder group selection for directories. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] ext2/ialloc.c cleanup 2001-11-09 6:56 ` Andreas Dilger @ 2001-11-09 7:09 ` Andrew Morton 2001-11-09 7:12 ` Alexander Viro 1 sibling, 0 replies; 78+ messages in thread From: Andrew Morton @ 2001-11-09 7:09 UTC (permalink / raw) To: Andreas Dilger; +Cc: Alexander Viro, ext2-devel, lkml Andreas Dilger wrote: > > > Tree Stock Stock/ideal Patched Patched/stock Orlov Orlov/ideal > > (secs) (secs) (secs) > > Andrew, could you stick with a single metric (either /ideal or /stock)? yup. > > 12 54 4.15 101 187.04% 76 5.85 > > 13 82 6.31 104 126.83% 78 6 > > 14 89 6.85 103 115.73% 77 5.92 > > 15 88 6.77 95 107.95% 77 5.92 > > 16 106 8.15 99 93.40% 77 5.92 > > What else would be useful here is percent full for the filesystem. > Since stock ext2 preallocates up to 7 blocks for each file, this is good > for reducing fragmentation, but as the filesystem gets full it makes > fragmentation worse in the end. After the 17th iteration the fs was 75% full, so at iteration 8 we're at 37%, etc. > > So I just don't know at this stage. Even after a single pass of the Smith > > workload, we're running at 3x to 5x worse than ideal. If that's simply > > the best we can do, then we need to compare stock 2.4.14 with Orlov > > partway through the progress of the Smith workload to evaluate how much > > more serious the fragmentation is. That's easy enough - I'll do it. > > > > The next task is to work out why a single pass of the Smith workload > > fragments the filesystem so much, and whether this can be improved. > > After reading the paper, one possible cause of skewed results may be > a result of their "reverse engineering" of the filesystem layout from > the inode numbers. They state that they were not able to capture the > actual pathnames of the files in the workload, so they invented pathnames > that would put the files in the same cylinder groups based on the FFS > allocation algorithms then in use, assuming the test filesystem had an > identical layout. Their attempt to put each top-level dir in its own cylinder group _may_ work with ext2 on pass zero, but it'll come unstuck on passes one to sixteen. I've been treating this as basically random fs activity. Any conclusions need to verified with Constantin's test code. > ... > Have you ever looked at the resulting data from a Smith test? Does it do > more than simply create 27 directories (the number of cylinder groups) and > dump files directly therein, as opposed to creating more subdirectories? > Remember that they were strictly concerned with block allocation for files, > and not cylinder group selection for directories. > True. At the end of the run, we end up with 62 directories and 8757 files. So under /mountpoint/NN/ we end up with 61 directories and 474 files. The remainder of the files are immediately under thos 61 directories. It's a wide and shallow layout. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] ext2/ialloc.c cleanup 2001-11-09 6:56 ` Andreas Dilger 2001-11-09 7:09 ` Andrew Morton @ 2001-11-09 7:12 ` Alexander Viro 2001-11-09 7:18 ` Andrew Morton 1 sibling, 1 reply; 78+ messages in thread From: Alexander Viro @ 2001-11-09 7:12 UTC (permalink / raw) To: Andreas Dilger; +Cc: Andrew Morton, ext2-devel, lkml On Thu, 8 Nov 2001, Andreas Dilger wrote: > It may be possible to hack the test data into ext2 by creating a filesystem > with the same number of block groups as the test FFS filesystem with the > Smith workload. It may also not be valid for our needs, as we are playing > with the actual group selection algorithm, so real pathnames may give us > a different layout. Umm... What was the block and fragment sizes in their tests? ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] ext2/ialloc.c cleanup 2001-11-09 7:12 ` Alexander Viro @ 2001-11-09 7:18 ` Andrew Morton 0 siblings, 0 replies; 78+ messages in thread From: Andrew Morton @ 2001-11-09 7:18 UTC (permalink / raw) To: Alexander Viro; +Cc: Andreas Dilger, ext2-devel, lkml Alexander Viro wrote: > > On Thu, 8 Nov 2001, Andreas Dilger wrote: > > > It may be possible to hack the test data into ext2 by creating a filesystem > > with the same number of block groups as the test FFS filesystem with the > > Smith workload. It may also not be valid for our needs, as we are playing > > with the actual group selection algorithm, so real pathnames may give us > > a different layout. > > Umm... What was the block and fragment sizes in their tests? Size: 502M Fragment size: 1k Block size: 8k Max cluster size: 56k I haven't been trying to recreate the Smith tests, BTW. Just using it as a representative workload and something which is worth optimising for. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 5:54 ` Albert D. Cahalan 2001-11-05 8:04 ` Andrew Morton @ 2001-11-05 9:45 ` Alex Bligh - linux-kernel 2001-11-05 9:58 ` Alex Bligh - linux-kernel 1 sibling, 1 reply; 78+ messages in thread From: Alex Bligh - linux-kernel @ 2001-11-05 9:45 UTC (permalink / raw) To: Albert D. Cahalan, Mike Fedyk Cc: Andrew Morton, lkml, ext2-devel, Alex Bligh - linux-kernel --On Monday, 05 November, 2001 12:54 AM -0500 "Albert D. Cahalan" <acahalan@cs.uml.edu> wrote: > By putting directories far apart, you leave room for regular > files (added at some future time) to be packed close together. Mmmm... but if you read ahead your directories (see lkml passim) I'd be prepared to be the gain here is either minimized, or negative. -- Alex Bligh ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] disk throughput 2001-11-05 9:45 ` [Ext2-devel] disk throughput Alex Bligh - linux-kernel @ 2001-11-05 9:58 ` Alex Bligh - linux-kernel 0 siblings, 0 replies; 78+ messages in thread From: Alex Bligh - linux-kernel @ 2001-11-05 9:58 UTC (permalink / raw) To: Alex Bligh - linux-kernel, Albert D. Cahalan, Mike Fedyk Cc: Andrew Morton, lkml, ext2-devel, Alex Bligh - linux-kernel I think I was slightly too brief: > --On Monday, 05 November, 2001 12:54 AM -0500 "Albert D. Cahalan" > <acahalan@cs.uml.edu> wrote: > >> By putting directories far apart, you leave room for regular >> files (added at some future time) to be packed close together. > > Mmmm... but if you read ahead your directories (see lkml passim) > I'd be prepared to be the gain here is either minimized, or > negative. IE if you put in code to queue a cache a subdirectory when it's entry in the parent is enumerated (either in userland or in kernelland), then you get far less seeking as it all comes 'up front' and the seeks can be monotonic. However, still the closer the directories are, the better. But more to the point, if your directories tend to be cached this way, you surely win by having the directories allocated the same way as files (i.e. to be packed /all/ close together). -- Alex Bligh ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: disk throughput 2001-11-05 2:13 disk throughput Andrew Morton 2001-11-05 3:20 ` Mohammad A. Haque 2001-11-05 3:32 ` [Ext2-devel] " Mike Fedyk @ 2001-11-05 8:47 ` Jan Kara 2001-11-05 8:50 ` [Ext2-devel] " Mike Fedyk 2001-11-05 12:23 ` Matthias Andree 3 siblings, 1 reply; 78+ messages in thread From: Jan Kara @ 2001-11-05 8:47 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, ext2-devel > I've been taking a look at one particular workload - the creation > and use of many smallish files. ie: the things we do every day > with kernel trees. > > There are a number of things which cause linux to perform quite badly > with this workload. I've fixed most of them and the speedups are quite > dramatic. The changes include: > > - reorganise the sync_inode() paths so we don't do > read-a-block,write-a-block,read-a-block, ... > > - asynchronous preread of an ext2 inode's block when we > create the inode, so: > > a) the reads will cluster into larger requests and > b) during inode writeout we don't keep stalling on > reads, preventing write clustering. > > - Move ext2's bitmap loading code outside lock_super(), > so other threads can still get in and write to the > fs while the reader sleeps, thus increasing write > request merging. This benefits multithreaded workloads > (ie: dbench) and needs more thought. > > The above changes are basically a search-and-destroy mission against > the things which are preventing effective writeout request merging. > Once they're in place we also need: > > - Alter the request queue size and elvtune settings > > > The time to create 100,000 4k files (10 per directory) has fallen > from 3:09 (3min 9second) down to 0:30. A six-fold speedup. Nice :). > All well and good, and still a WIP. But by far the most dramatic > speedups come from disabling ext2's policy of placing new directories > in a different blockgroup from the parent: > > --- linux-2.4.14-pre8/fs/ext2/ialloc.c Tue Oct 9 21:31:40 2001 > +++ linux-akpm/fs/ext2/ialloc.c Sun Nov 4 17:40:43 2001 > @@ -286,7 +286,7 @@ struct inode * ext2_new_inode (const str > repeat: > gdp = NULL; i=0; > > - if (S_ISDIR(mode)) { > + if (0 && S_ISDIR(mode)) { > avefreei = le32_to_cpu(es->s_free_inodes_count) / > sb->u.ext2_sb.s_groups_count; > /* I am not yet convinced that this next bit is necessary. > > > Numbers. The machine has 768 megs; the disk is IDE with a two meg cache. > The workload consists of untarring, tarring, diffing and removing kernel > trees. This filesystem is 21 gigs, and has 176 block groups. I'm not sure if this speedup isn't connected just with your testcase.. If I understood well the code it tries to spread files uniformly over the fs (ie. all groups equally full). I think that if you have filesystem like /home where are large+small files changing a lot your change can actually lead to more fragmentation - groups in the beginning gets full (files are created in the same group as it's parent). Then if some file gets deleted and new one created filesystem will try to stuff new file in the first groups and that causes fragmentation.. But it's just an idea - some testing would be probably more useful... > After each test which wrote data a `sync' was performed, and was included > in the timing under the assumption that all the data will be written back > by kupdate in a few seconds, and running `sync' allows measurement of the > cost of that. > > The filesystem was unmounted between each test - all tests are with > cold caches. > > stock patched > untar one kernel tree, sync: 0:31 0:14 > diff two trees: 3:04 1:12 > tar kernel tree to /dev/null: 0:15 0:03 > remove 2 kernel trees, sync: 0:30 0:10 > > A significant thing here is the improvement in read performance as well > as writes. All of the other speedup changes only affect writes. > > We are paying an extremely heavy price for placing directories in > different block groups from their parent. Why do we do this, and > is it worth the cost? Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] Re: disk throughput 2001-11-05 8:47 ` Jan Kara @ 2001-11-05 8:50 ` Mike Fedyk 2001-11-05 9:01 ` Jan Kara 0 siblings, 1 reply; 78+ messages in thread From: Mike Fedyk @ 2001-11-05 8:50 UTC (permalink / raw) To: Jan Kara; +Cc: Andrew Morton, lkml, ext2-devel On Mon, Nov 05, 2001 at 09:47:01AM +0100, Jan Kara wrote: > If I understood well the code it tries to spread files uniformly over the > fs (ie. all groups equally full). I think that if you have filesystem like > /home where are large+small files changing a lot your change can actually > lead to more fragmentation - groups in the beginning gets full (files > are created in the same group as it's parent). Then if some file gets deleted > and new one created filesystem will try to stuff new file in the first > groups and that causes fragmentation.. But it's just an idea - some testing > would be probably more useful... > Shouldn't it choose another block group if the file won't fit in the current one? ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [Ext2-devel] Re: disk throughput 2001-11-05 8:50 ` [Ext2-devel] " Mike Fedyk @ 2001-11-05 9:01 ` Jan Kara 0 siblings, 0 replies; 78+ messages in thread From: Jan Kara @ 2001-11-05 9:01 UTC (permalink / raw) To: Andrew Morton, lkml, ext2-devel > On Mon, Nov 05, 2001 at 09:47:01AM +0100, Jan Kara wrote: > > If I understood well the code it tries to spread files uniformly over the > > fs (ie. all groups equally full). I think that if you have filesystem like > > /home where are large+small files changing a lot your change can actually > > lead to more fragmentation - groups in the beginning gets full (files > > are created in the same group as it's parent). Then if some file gets deleted > > and new one created filesystem will try to stuff new file in the first > > groups and that causes fragmentation.. But it's just an idea - some testing > > would be probably more useful... > > > > Shouldn't it choose another block group if the file won't fit in the current > one? If I understand the code well: When ext2 creates new file it will try to allocate data in the same group where it got inode. If there's no free block (or better 8 blocks) in the group it will try another one etc (but it will be probably also full - we will first try to spread file over half of full groups (in average) before hitting the empty one). But I agree that this way it will fill the holes with the file and next time it will start in empty group. Anyway this allocation strategy will IMHO not keep inodes near data... Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: disk throughput 2001-11-05 2:13 disk throughput Andrew Morton ` (2 preceding siblings ...) 2001-11-05 8:47 ` Jan Kara @ 2001-11-05 12:23 ` Matthias Andree 2001-11-05 22:39 ` Andrew Morton 3 siblings, 1 reply; 78+ messages in thread From: Matthias Andree @ 2001-11-05 12:23 UTC (permalink / raw) To: lkml On Sun, 04 Nov 2001, Andrew Morton wrote: > Numbers. The machine has 768 megs; the disk is IDE with a two meg cache. > The workload consists of untarring, tarring, diffing and removing kernel > trees. This filesystem is 21 gigs, and has 176 block groups. Does that IDE disk run with its write cache enabled or disabled? ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: disk throughput 2001-11-05 12:23 ` Matthias Andree @ 2001-11-05 22:39 ` Andrew Morton 2001-11-05 23:41 ` Matthias Andree 0 siblings, 1 reply; 78+ messages in thread From: Andrew Morton @ 2001-11-05 22:39 UTC (permalink / raw) To: Matthias Andree; +Cc: lkml Matthias Andree wrote: > > On Sun, 04 Nov 2001, Andrew Morton wrote: > > > Numbers. The machine has 768 megs; the disk is IDE with a two meg cache. > > The workload consists of untarring, tarring, diffing and removing kernel > > trees. This filesystem is 21 gigs, and has 176 block groups. > > Does that IDE disk run with its write cache enabled or disabled? write-behind is enabled. I'm not religious about write-behind. Yes, there's a small chance that the disk will decide to write a commit block in front of the data (which is at lower LBAs). But a) It's improbable b) if it does happen, the time window where you need to crash is small. c) if your kernel crashes, the data will still be written. It has to be a power-down. But good point - I'll test without writebehind, and with SCSI. - ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: disk throughput 2001-11-05 22:39 ` Andrew Morton @ 2001-11-05 23:41 ` Matthias Andree 0 siblings, 0 replies; 78+ messages in thread From: Matthias Andree @ 2001-11-05 23:41 UTC (permalink / raw) To: lkml On Mon, 05 Nov 2001, Andrew Morton wrote: > write-behind is enabled. I'm not religious about write-behind. > Yes, there's a small chance that the disk will decide to write a > commit block in front of the data (which is at lower LBAs). But Well, you're talking about performance here. In my simple tests, the write cache has a major impact (twofold) on linear writes (just dd to a partition), so your milage may vary a lot depending on the cache configuration. ^ permalink raw reply [flat|nested] 78+ messages in thread
end of thread, other threads:[~2001-11-10 1:00 UTC | newest] Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2001-11-05 2:13 disk throughput Andrew Morton 2001-11-05 3:20 ` Mohammad A. Haque 2001-11-05 3:31 ` Andrew Morton 2001-11-05 3:32 ` [Ext2-devel] " Mike Fedyk 2001-11-05 3:45 ` Andrew Morton 2001-11-05 4:39 ` Mike Fedyk 2001-11-05 7:06 ` Jens Axboe 2001-11-05 7:14 ` Andrew Morton 2001-11-05 7:26 ` Jens Axboe 2001-11-05 7:14 ` Mike Fedyk 2001-11-05 7:18 ` Jens Axboe 2001-11-05 7:18 ` Jens Axboe 2001-11-05 9:14 ` Mike Fedyk 2001-11-05 9:20 ` Jens Axboe 2001-11-05 5:54 ` Albert D. Cahalan 2001-11-05 8:04 ` Andrew Morton 2001-11-05 12:28 ` Matthias Andree 2001-11-05 14:23 ` Alexander Viro 2001-11-05 22:22 ` Andrew Morton 2001-11-05 22:41 ` Andreas Dilger 2001-11-05 22:53 ` Andrew Morton 2001-11-08 15:28 ` Constantin Loizides 2001-11-05 23:14 ` Dan Hollis 2001-11-06 10:52 ` Daniel Phillips 2001-11-06 16:17 ` Jeremy Fitzhardinge 2001-11-08 15:24 ` Constantin Loizides 2001-11-08 16:46 ` Jeremy Fitzhardinge 2001-11-09 6:08 ` Andrew Morton 2001-11-09 8:49 ` Jeremy Fitzhardinge 2001-11-06 21:45 ` Stephen Tweedie 2001-11-05 20:16 ` Andreas Dilger 2001-11-05 20:28 ` m 2001-11-05 21:39 ` Andrew Morton 2001-11-05 22:59 ` Linus Torvalds 2001-11-05 23:36 ` Alexander Viro 2001-11-05 23:50 ` Linus Torvalds 2001-11-06 0:03 ` Linus Torvalds 2001-11-06 1:33 ` Alexander Viro 2001-11-06 2:10 ` Linus Torvalds 2001-11-06 3:02 ` Alexander Viro 2001-11-06 8:39 ` Alan Cox 2001-11-06 8:37 ` Alexander Viro 2001-11-06 8:48 ` Andrew Morton 2001-11-06 3:49 ` Alexander Viro 2001-11-06 4:01 ` Linus Torvalds 2001-11-06 4:21 ` Alexander Viro 2001-11-06 5:01 ` Linus Torvalds 2001-11-06 5:31 ` Andrew Morton 2001-11-06 5:48 ` Linus Torvalds 2001-11-06 7:34 ` Mike Castle 2001-11-06 7:10 ` Kai Henningsen 2001-11-09 22:35 ` Riley Williams 2001-11-06 1:28 ` Alexander Viro 2001-11-06 9:16 ` Wojtek Pilorz 2001-11-06 9:58 ` Alexander Viro 2001-11-08 12:51 ` Pavel Machek 2001-11-06 21:48 ` Stephen Tweedie 2001-11-06 23:17 ` ext2/ialloc.c cleanup Alexander Viro 2001-11-07 19:34 ` [Ext2-devel] " Andreas Dilger 2001-11-07 20:02 ` Alexander Viro 2001-11-08 2:06 ` Andrew Morton 2001-11-08 20:45 ` Andrew Morton 2001-11-08 22:16 ` Alexander Viro 2001-11-08 22:43 ` Andreas Dilger 2001-11-08 23:08 ` Alexander Viro 2001-11-09 6:15 ` Andrew Morton 2001-11-09 6:56 ` Andreas Dilger 2001-11-09 7:09 ` Andrew Morton 2001-11-09 7:12 ` Alexander Viro 2001-11-09 7:18 ` Andrew Morton 2001-11-05 9:45 ` [Ext2-devel] disk throughput Alex Bligh - linux-kernel 2001-11-05 9:58 ` Alex Bligh - linux-kernel 2001-11-05 8:47 ` Jan Kara 2001-11-05 8:50 ` [Ext2-devel] " Mike Fedyk 2001-11-05 9:01 ` Jan Kara 2001-11-05 12:23 ` Matthias Andree 2001-11-05 22:39 ` Andrew Morton 2001-11-05 23:41 ` Matthias Andree
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).