* ext3 allocate-with-reservation latencies @ 2005-04-05 3:51 Lee Revell 2005-04-05 4:13 ` Ingo Molnar 0 siblings, 1 reply; 68+ messages in thread From: Lee Revell @ 2005-04-05 3:51 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 484 bytes --] I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks like inside ext3_try_to_allocate_with_rsv, we spend a long time in this loop: ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) etc. Lee [-- Attachment #2: ext3_reservation.bz2 --] [-- Type: application/x-bzip, Size: 3287 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-05 3:51 ext3 allocate-with-reservation latencies Lee Revell @ 2005-04-05 4:13 ` Ingo Molnar 2005-04-05 6:10 ` Mingming Cao 2005-04-06 5:35 ` Mingming Cao 0 siblings, 2 replies; 68+ messages in thread From: Ingo Molnar @ 2005-04-05 4:13 UTC (permalink / raw) To: Lee Revell; +Cc: linux-kernel, Andrew Morton * Lee Revell <rlrevell@joe-job.com> wrote: > I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this > loop: > > ext3_test_allocatable (bitmap_search_next_usable_block) > find_next_zero_bit (bitmap_search_next_usable_block) > find_next_zero_bit (bitmap_search_next_usable_block) > > ext3_test_allocatable (bitmap_search_next_usable_block) > find_next_zero_bit (bitmap_search_next_usable_block) > find_next_zero_bit (bitmap_search_next_usable_block) Breaking the lock is not really possible at that point, and it doesnt look too easy to make that path preemptable either. (To make it preemptable rsv_lock would need to become a semaphore (this could be fine, as it's only used when a new reservation window is created).) The hard part is the seqlock - the read side is performance-critical, maybe it could be solved via a preemptable but still scalable seqlock variant that uses a semaphore for the write side? It all depends on what the scalability impact of using a semaphore for the new-window code would be. the best longterm solution for these types of tradeoffs seems to be to add a locking primitive that is a spinlock on !PREEMPT kernels and a semaphore on PREEMPT kernels. I.e. not as drastic as a full PREEMPT_RT kernel, but good enough to make latency-critical codepaths of ext3 preemptable, without having to hurt scalability on !PREEMPT. The PREEMPT_RT kernel has all the 'compile-time type-switching' infrastructure for such tricks, all that needs to be changed to switch a lock's type is to change the spinlock definition - all the spin_lock(&lock) uses can remain unchanged. (The same method is used on PREEMPT_RT to have 'dual-type' spinlocks.) the same thing could then also be used for things like the mm lock, and other longer-held locks that PREEMPT would like to see preemptable. It would also be a good first step towards merging the PREEMPT_RT infrastructure ;-) I'll cook up something. Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-05 4:13 ` Ingo Molnar @ 2005-04-05 6:10 ` Mingming Cao 2005-04-05 16:38 ` Lee Revell 2005-04-06 5:35 ` Mingming Cao 1 sibling, 1 reply; 68+ messages in thread From: Mingming Cao @ 2005-04-05 6:10 UTC (permalink / raw) To: Ingo Molnar; +Cc: Lee Revell, linux-kernel, Andrew Morton On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote: > * Lee Revell <rlrevell@joe-job.com> wrote: > > > I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks > > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this > > loop: > > We have not modify the reservation create algorithm for a long time. Sorry, I missed the background here, on which kernel did you see this latency? And what tests you were running? > > ext3_test_allocatable (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > > > ext3_test_allocatable (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > The loop in bitmap_search_next_usable_block tries to find the next zero bit on the bitmap, if it find, then it need to check whether this zero bit is also be zero on the journalling last copy the the bitmap. The double check is there prevent re-use the just freed block on disk before it is committed from journalling. If it's not marked as a free block on the journalling copy, then it will find the next free bit on that copy first, then loop back etc.etc. The bitmap_search_next_usable_block() code is quite simple: static int bitmap_search_next_usable_block(int start, struct buffer_head *bh, int maxblocks) { int next; struct journal_head *jh = bh2jh(bh); /* * The bitmap search --- search forward alternately through the actual * bitmap and the last-committed copy until we find a bit free in * both */ while (start < maxblocks) { next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start); if (next >= maxblocks) return -1; if (ext3_test_allocatable(next, bh)) return next; jbd_lock_bh_state(bh); if (jh->b_committed_data) start = ext3_find_next_zero_bit(jh->b_committed_data, maxblocks, next); jbd_unlock_bh_state(bh); } return -1; } The latency report shows the pattern of two calls of ext3_find_next_zero_bit () after a call to ext3_test_allocatable(), matches what I suspect above. I wonder if re-run the test with -0 noreservation mount option make any difference? > Breaking the lock is not really possible at that point, and it doesnt > look too easy to make that path preemptable either. (To make it > preemptable rsv_lock would need to become a semaphore (this could be > fine, as it's only used when a new reservation window is created).) > > The hard part is the seqlock - the read side is performance-critical, > maybe it could be solved via a preemptable but still scalable seqlock > variant that uses a semaphore for the write side? It all depends on what > the scalability impact of using a semaphore for the new-window code > would be. > The seqlock was there to prevent a smp race, but actually turns out to be unnecessary, and it has been removed recently. While even if it is there, it will not cause much latency here: the whole ext3_try_to_allocate_with_rsv is guard by ei->truncate_sem, thus no multiple readers/writers at the same time. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-05 6:10 ` Mingming Cao @ 2005-04-05 16:38 ` Lee Revell 0 siblings, 0 replies; 68+ messages in thread From: Lee Revell @ 2005-04-05 16:38 UTC (permalink / raw) To: cmm; +Cc: Ingo Molnar, linux-kernel, Andrew Morton On Mon, 2005-04-04 at 23:10 -0700, Mingming Cao wrote: > On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote: > > * Lee Revell <rlrevell@joe-job.com> wrote: > > > > > I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks > > > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this > > > loop: > > > > > We have not modify the reservation create algorithm for a long time. > Sorry, I missed the background here, on which kernel did you see this > latency? And what tests you were running? Makes sense, I have been seeing this one for a long time. I get it with dbench, or when doing a CVS checkout of a large module. Kernel is 2.6.12-rc1-RT-V0.7.43-05, with PREEMPT_DESKTOP which AFAIK should be equivalent to mainline. Lee ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-05 4:13 ` Ingo Molnar 2005-04-05 6:10 ` Mingming Cao @ 2005-04-06 5:35 ` Mingming Cao 2005-04-06 9:51 ` Stephen C. Tweedie 2005-04-07 8:14 ` Ingo Molnar 1 sibling, 2 replies; 68+ messages in thread From: Mingming Cao @ 2005-04-06 5:35 UTC (permalink / raw) To: Ingo Molnar, sct; +Cc: Lee Revell, linux-kernel, Andrew Morton On Tue, 2005-04-05 at 06:13 +0200, Ingo Molnar wrote: > * Lee Revell <rlrevell@joe-job.com> wrote: > > > I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks > > like inside ext3_try_to_allocate_with_rsv, we spend a long time in this > > loop: > > > > ext3_test_allocatable (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > > > ext3_test_allocatable (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > find_next_zero_bit (bitmap_search_next_usable_block) > > Breaking the lock is not really possible at that point, and it doesnt > look too easy to make that path preemptable either. (To make it > preemptable rsv_lock would need to become a semaphore (this could be > fine, as it's only used when a new reservation window is created).) It seems we are holding the rsv_block while searching the bitmap for a free bit. In alloc_new_reservation(), we first find a available to create a reservation window, then we check the bitmap to see if it contains any free block. If not, we will search for next available window, so on and on. During the whole process we are holding the global rsv_lock. We could, and probably should, avoid that. Just unlock the rsv_lock before the bitmap search and re-grab it after it. We need to make sure that the available space that are still available after we re- grab the lock. Another option is to hold that available window before we release the rsv_lock, and if there is no free bit inside that window, we will remove it from the tree in the next round of searching for next available window. I prefer the second option, and plan to code it up soon. Any comments? Thanks, Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-06 5:35 ` Mingming Cao @ 2005-04-06 9:51 ` Stephen C. Tweedie 2005-04-06 16:53 ` Mingming Cao 2005-04-07 8:14 ` Ingo Molnar 1 sibling, 1 reply; 68+ messages in thread From: Stephen C. Tweedie @ 2005-04-06 9:51 UTC (permalink / raw) To: Mingming Cao Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Wed, 2005-04-06 at 06:35, Mingming Cao wrote: > It seems we are holding the rsv_block while searching the bitmap for a > free bit. Probably something to avoid! > In alloc_new_reservation(), we first find a available to > create a reservation window, then we check the bitmap to see if it > contains any free block. If not, we will search for next available > window, so on and on. During the whole process we are holding the global > rsv_lock. We could, and probably should, avoid that. Just unlock the > rsv_lock before the bitmap search and re-grab it after it. We need to > make sure that the available space that are still available after we re- > grab the lock. Not necessarily. As long as the windows remain mutually exclusive in the rbtree, it doesn't matter too much if we occasionally allocate a full one --- as long as that case is rare, the worst that happens is that we fail to allocate from the window and have to repeat the window reserve. The difficulty will be in keeping it rare. What we need to avoid is the case where multiple tasks need a new window, they all drop the lock, find the same bits free in the bitmap, then all try to take that window. One will succeed, the others will fail; but as the files in that bit of the disk continue to grow, we risk those processes *continually* repeating the same stomp-on-each-others'-windows operation and raising the allocation overhead significantly. > Another option is to hold that available window before we release the > rsv_lock, and if there is no free bit inside that window, we will remove > it from the tree in the next round of searching for next available > window. Possible, but not necessarily nice. If you've got a nearly-full disk, most bits will be already allocated. As you scan the bitmaps, it may take quite a while to find a free bit; do you really want to (a) lock the whole block group with a temporary window just to do the scan, or (b) keep allocating multiple smaller windows until you finally find a free bit? The former is bad for concurrency if you have multiple tasks trying to allocate nearby on disk --- you'll force them into different block groups. The latter is high overhead. --Stephen ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-06 9:51 ` Stephen C. Tweedie @ 2005-04-06 16:53 ` Mingming Cao 2005-04-06 18:22 ` Stephen C. Tweedie 0 siblings, 1 reply; 68+ messages in thread From: Mingming Cao @ 2005-04-06 16:53 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton On Wed, 2005-04-06 at 10:51 +0100, Stephen C. Tweedie wrote: > Hi, > > On Wed, 2005-04-06 at 06:35, Mingming Cao wrote: > > > It seems we are holding the rsv_block while searching the bitmap for a > > free bit. > > Probably something to avoid! > > > In alloc_new_reservation(), we first find a available to > > create a reservation window, then we check the bitmap to see if it > > contains any free block. If not, we will search for next available > > window, so on and on. During the whole process we are holding the global > > rsv_lock. We could, and probably should, avoid that. Just unlock the > > rsv_lock before the bitmap search and re-grab it after it. We need to > > make sure that the available space that are still available after we re- > > grab the lock. > > Not necessarily. As long as the windows remain mutually exclusive in > the rbtree, it doesn't matter too much if we occasionally allocate a > full one --- as long as that case is rare, the worst that happens is > that we fail to allocate from the window and have to repeat the window > reserve. > > The difficulty will be in keeping it rare. What we need to avoid is the > case where multiple tasks need a new window, they all drop the lock, > find the same bits free in the bitmap, then all try to take that > window. One will succeed, the others will fail; but as the files in > that bit of the disk continue to grow, we risk those processes > *continually* repeating the same stomp-on-each-others'-windows operation > and raising the allocation overhead significantly. > Agreed. That's why the second option is preferred. When a file find a window open to book, it booked it, then drop the lock and do the bitmap search. So other files will not attempt to book the same window. > > Another option is to hold that available window before we release the > > rsv_lock, and if there is no free bit inside that window, we will remove > > it from the tree in the next round of searching for next available > > window. > > Possible, but not necessarily nice. If you've got a nearly-full disk, > most bits will be already allocated. As you scan the bitmaps, it may > take quite a while to find a free bit; do you really want to (a) lock > the whole block group with a temporary window just to do the scan, or > (b) keep allocating multiple smaller windows until you finally find a > free bit? The former is bad for concurrency if you have multiple tasks > trying to allocate nearby on disk --- you'll force them into different > block groups. The latter is high overhead. > I am not quite understand what you mean about (a). In this proposal, we will drop the lock before the scan. And for (b), maybe I did not make myself clear: I am not proposing to keeping allocating multiple smaller windows until finally find a free bit. I mean, we book the window(just link the node into the tree) before we drop the lock, if there is no free bit inside that window, we will go back search for another window(call find_next_reserveable_window()), inside it, we will remove the temporary window we just created and find next window. SO we only have one temporary window at a time. In the case of we did find a free block inside the temporary created window, we make it official: do the rb tree coloring rotate stuff then. Does it make sense? Let me know, thanks, Maybe I should post my proposed patch here.( I am testing it now. ) --- linux-2.6.11-ming/fs/ext3/balloc.c | 107 +++++++++++++++++++++++-------------- 1 files changed, 68 insertions(+), 39 deletions(-) diff -puN fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency fs/ext3/balloc.c --- linux-2.6.11/fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency 2005-04-05 10:48:31.000000000 -0700 +++ linux-2.6.11-ming/fs/ext3/balloc.c 2005-04-06 09:24:24.203028968 -0700 @@ -241,8 +241,13 @@ void ext3_rsv_window_add(struct super_bl BUG(); } + /* + * we only link the node the the rb tree here + * the real color balancing will be done + * after we confirm the window has some free + * block. + */ rb_link_node(node, parent, p); - rb_insert_color(node, root); } static void rsv_window_remove(struct super_block *sb, @@ -749,24 +754,24 @@ fail_access: * to find a free region that is of my size and has not * been reserved. * - * on succeed, it returns the reservation window to be appended to. - * failed, return NULL. */ -static struct ext3_reserve_window_node *find_next_reservable_window( +static int find_next_reservable_window( struct ext3_reserve_window_node *search_head, - unsigned long size, int *start_block, + struct ext3_reserve_window_node *my_rsv, + struct super_block * sb, int start_block, int last_block) { struct rb_node *next; struct ext3_reserve_window_node *rsv, *prev; int cur; + int size = my_rsv->rsv_goal_size; /* TODO: make the start of the reservation window byte-aligned */ /* cur = *start_block & ~7;*/ - cur = *start_block; + cur = start_block; rsv = search_head; if (!rsv) - return NULL; + return -1; while (1) { if (cur <= rsv->rsv_end) @@ -782,7 +787,7 @@ static struct ext3_reserve_window_node * * space with expected-size (or more)... */ if (cur > last_block) - return NULL; /* fail */ + return -1; /* fail */ prev = rsv; next = rb_next(&rsv->rsv_node); @@ -813,8 +818,29 @@ static struct ext3_reserve_window_node * * return the reservation window that we could append to. * succeed. */ - *start_block = cur; - return prev; + + if ((prev != my_rsv) && (!rsv_is_empty(&my_rsv->rsv_window))) + rsv_window_remove(sb, my_rsv); + + /* let's book the whole avaliable window for now + * we will check the + * disk bitmap later and then, if there are free block + * then we adjust the window size if the it's + * larger than requested. + * Otherwise, we will remove this node from the tree next time + * call find_next_reservable_window. + */ + my_rsv->rsv_start = cur; + if (rsv) + my_rsv->rsv_end = rsv->rsv_start - 1; + else + my_rsv->rsv_end = last_block; + my_rsv->rsv_alloc_hit = 0; + + if (prev != my_rsv) + ext3_rsv_window_add(sb, my_rsv); + + return 0; } /** @@ -852,6 +878,7 @@ static struct ext3_reserve_window_node * * @sb: the super block * @group: the group we are trying to allocate in * @bitmap_bh: the block group block bitmap + * */ static int alloc_new_reservation(struct ext3_reserve_window_node *my_rsv, int goal, struct super_block *sb, @@ -860,10 +887,10 @@ static int alloc_new_reservation(struct struct ext3_reserve_window_node *search_head; int group_first_block, group_end_block, start_block; int first_free_block; - int reservable_space_start; - struct ext3_reserve_window_node *prev_rsv; struct rb_root *fs_rsv_root = &EXT3_SB(sb)->s_rsv_window_root; unsigned long size; + int ret; + spinlock_t *rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock; group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) + group * EXT3_BLOCKS_PER_GROUP(sb); @@ -875,6 +902,8 @@ static int alloc_new_reservation(struct start_block = goal + group_first_block; size = my_rsv->rsv_goal_size; + + spin_lock(rsv_lock); if (!rsv_is_empty(&my_rsv->rsv_window)) { /* * if the old reservation is cross group boundary @@ -908,6 +937,7 @@ static int alloc_new_reservation(struct my_rsv->rsv_goal_size= size; } } + /* * shift the search start to the window near the goal block */ @@ -921,11 +951,11 @@ static int alloc_new_reservation(struct * need to check the bitmap after we found a reservable window. */ retry: - prev_rsv = find_next_reservable_window(search_head, size, - &start_block, group_end_block); - if (prev_rsv == NULL) + ret = find_next_reservable_window(search_head, my_rsv, sb, + start_block, group_end_block); + if (ret == -1) goto failed; - reservable_space_start = start_block; + /* * On success, find_next_reservable_window() returns the * reservation window where there is a reservable space after it. @@ -937,10 +967,13 @@ retry: * block. Search start from the start block of the reservable space * we just found. */ + spin_unlock(rsv_lock); first_free_block = bitmap_search_next_usable_block( - reservable_space_start - group_first_block, + my_rsv->rsv_start - group_first_block, bitmap_bh, group_end_block - group_first_block + 1); + spin_lock(rsv_lock); + if (first_free_block < 0) { /* * no free block left on the bitmap, no point @@ -948,13 +981,20 @@ retry: */ goto failed; } + /* + * it is possible someone else freed the window during we check + * the bitmap + */ + if (rsv_is_empty(&my_rsv->rsv_window)) + goto retry; + start_block = first_free_block + group_first_block; /* * check if the first free block is within the * free space we just found */ - if ((start_block >= reservable_space_start) && - (start_block < reservable_space_start + size)) + if ((start_block >= my_rsv->rsv_start) && + (start_block < my_rsv->rsv_end)) goto found_rsv_window; /* * if the first free bit we found is out of the reservable space @@ -963,27 +1003,19 @@ retry: * start from where the free block is, * we also shift the list head to where we stopped last time */ - search_head = prev_rsv; + search_head = my_rsv; goto retry; found_rsv_window: /* * great! the reservable space contains some free blocks. - * if the search returns that we should add the new - * window just next to where the old window, we don't - * need to remove the old window first then add it to the - * same place, just update the new start and new end. - */ - if (my_rsv != prev_rsv) { - if (!rsv_is_empty(&my_rsv->rsv_window)) - rsv_window_remove(sb, my_rsv); - } - my_rsv->rsv_start = reservable_space_start; - my_rsv->rsv_end = my_rsv->rsv_start + size - 1; - my_rsv->rsv_alloc_hit = 0; - if (my_rsv != prev_rsv) { - ext3_rsv_window_add(sb, my_rsv); - } + */ + my_rsv->rsv_start = start_block; + if ((my_rsv->rsv_end) > (start_block + size -1 )) + my_rsv->rsv_end = my_rsv->rsv_start + size - 1; + + rb_insert_color(&my_rsv->rsv_node, fs_rsv_root); + spin_unlock(rsv_lock); return 0; /* succeed */ failed: /* @@ -993,6 +1025,7 @@ failed: */ if (!rsv_is_empty(&my_rsv->rsv_window)) rsv_window_remove(sb, my_rsv); + spin_unlock(rsv_lock); return -1; /* failed */ } @@ -1023,7 +1056,6 @@ ext3_try_to_allocate_with_rsv(struct sup int goal, struct ext3_reserve_window_node * my_rsv, int *errp) { - spinlock_t *rsv_lock; unsigned long group_first_block; int ret = 0; int fatal; @@ -1052,7 +1084,6 @@ ext3_try_to_allocate_with_rsv(struct sup ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL); goto out; } - rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock; /* * goal is a group relative block number (if there is a goal) * 0 < goal < EXT3_BLOCKS_PER_GROUP(sb) @@ -1085,12 +1116,10 @@ ext3_try_to_allocate_with_rsv(struct sup if (rsv_is_empty(&rsv_copy) || (ret < 0) || !goal_in_my_reservation(&rsv_copy, goal, group, sb)) { - spin_lock(rsv_lock); ret = alloc_new_reservation(my_rsv, goal, sb, group, bitmap_bh); rsv_copy._rsv_start = my_rsv->rsv_start; rsv_copy._rsv_end = my_rsv->rsv_end; - spin_unlock(rsv_lock); if (ret < 0) break; /* failed */ _ ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-06 16:53 ` Mingming Cao @ 2005-04-06 18:22 ` Stephen C. Tweedie 2005-04-06 19:03 ` Mingming Cao 0 siblings, 1 reply; 68+ messages in thread From: Stephen C. Tweedie @ 2005-04-06 18:22 UTC (permalink / raw) To: Mingming Cao Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Wed, 2005-04-06 at 17:53, Mingming Cao wrote: > > Possible, but not necessarily nice. If you've got a nearly-full disk, > > most bits will be already allocated. As you scan the bitmaps, it may > > take quite a while to find a free bit; do you really want to (a) lock > > the whole block group with a temporary window just to do the scan, or > > (b) keep allocating multiple smaller windows until you finally find a > > free bit? The former is bad for concurrency if you have multiple tasks > > trying to allocate nearby on disk --- you'll force them into different > > block groups. The latter is high overhead. > I am not quite understand what you mean about (a). In this proposal, we > will drop the lock before the scan. s/lock/reserve/. > And for (b), maybe I did not make myself clear: I am not proposing to > keeping allocating multiple smaller windows until finally find a free > bit. I mean, we book the window(just link the node into the tree) before > we drop the lock, if there is no free bit inside that window, we will go > back search for another window(call find_next_reserveable_window()), > inside it, we will remove the temporary window we just created and find > next window. SO we only have one temporary window at a time. And that's the problem. Either we create small temporary windows, in which case we may end up thrashing through vast numbers of them before we find a bit that's available --- very expensive as the disk gets full --- or we use large windows but get worse layout when there are parallel allocators going on. --Stephen ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-06 18:22 ` Stephen C. Tweedie @ 2005-04-06 19:03 ` Mingming Cao 0 siblings, 0 replies; 68+ messages in thread From: Mingming Cao @ 2005-04-06 19:03 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton On Wed, 2005-04-06 at 19:22 +0100, Stephen C. Tweedie wrote: > Hi, > > On Wed, 2005-04-06 at 17:53, Mingming Cao wrote: > > > > Possible, but not necessarily nice. If you've got a nearly-full disk, > > > most bits will be already allocated. As you scan the bitmaps, it may > > > take quite a while to find a free bit; do you really want to (a) lock > > > the whole block group with a temporary window just to do the scan, or > > > (b) keep allocating multiple smaller windows until you finally find a > > > free bit? The former is bad for concurrency if you have multiple tasks > > > trying to allocate nearby on disk --- you'll force them into different > > > block groups. The latter is high overhead. > > > I am not quite understand what you mean about (a). In this proposal, we > > will drop the lock before the scan. > > s/lock/reserve/. > > > And for (b), maybe I did not make myself clear: I am not proposing to > > keeping allocating multiple smaller windows until finally find a free > > bit. I mean, we book the window(just link the node into the tree) before > > we drop the lock, if there is no free bit inside that window, we will go > > back search for another window(call find_next_reserveable_window()), > > inside it, we will remove the temporary window we just created and find > > next window. SO we only have one temporary window at a time. > > And that's the problem. Either we create small temporary windows, in > which case we may end up thrashing through vast numbers of them before > we find a bit that's available --- very expensive as the disk gets full > --- or we use large windows but get worse layout when there are parallel > allocators going on. > Ah... I see your points. (a) is certainly not a good option. (b) is not very nice, but not necessary that bad when the disk is really full: We are not only scanning the bitmap within the reserved window range, instead, scanning the bitmap start from the reserved window start block to the last block of the group, to find the next free bit; So in the case the group is really full, we could reduce the # of small windows to to try. Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-06 5:35 ` Mingming Cao 2005-04-06 9:51 ` Stephen C. Tweedie @ 2005-04-07 8:14 ` Ingo Molnar 2005-04-07 13:08 ` Stephen C. Tweedie 1 sibling, 1 reply; 68+ messages in thread From: Ingo Molnar @ 2005-04-07 8:14 UTC (permalink / raw) To: Mingming Cao; +Cc: sct, Lee Revell, linux-kernel, Andrew Morton * Mingming Cao <cmm@us.ibm.com> wrote: > It seems we are holding the rsv_block while searching the bitmap for a > free bit. In alloc_new_reservation(), we first find a available to > create a reservation window, then we check the bitmap to see if it > contains any free block. If not, we will search for next available > window, so on and on. During the whole process we are holding the > global rsv_lock. We could, and probably should, avoid that. Just > unlock the rsv_lock before the bitmap search and re-grab it after it. > We need to make sure that the available space that are still available > after we re- grab the lock. > > Another option is to hold that available window before we release the > rsv_lock, and if there is no free bit inside that window, we will > remove it from the tree in the next round of searching for next > available window. > > I prefer the second option, and plan to code it up soon. Any comments? doesnt the first option also allow searches to be in parallel? This particular one took over 1 msec, so it seems there's a fair room for parallellizing multiple writers and thus improving scalability. (or is this codepath serialized globally anyway?) Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-07 8:14 ` Ingo Molnar @ 2005-04-07 13:08 ` Stephen C. Tweedie 2005-04-07 19:16 ` Mingming Cao 2005-04-07 23:37 ` Mingming Cao 0 siblings, 2 replies; 68+ messages in thread From: Stephen C. Tweedie @ 2005-04-07 13:08 UTC (permalink / raw) To: Ingo Molnar Cc: Mingming Cao, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote: > doesnt the first option also allow searches to be in parallel? In terms of CPU usage, yes. But either we use large windows, in which case we *can't* search remotely near areas of the disk in parallel; or we use small windows, in which case failure to find a free bit becomes disproportionately expensive (we have to do an rbtree link and unlink for each window we search.) > This > particular one took over 1 msec, so it seems there's a fair room for > parallellizing multiple writers and thus improving scalability. (or is > this codepath serialized globally anyway?) No, the rbtree ops are serialised per-sb, and allocations within any one inode are serialised, but the bitmap searches need not be serialised globally. (They are in the case of new window searches right now, which is what we're trying to fix.) --Stephen ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-07 13:08 ` Stephen C. Tweedie @ 2005-04-07 19:16 ` Mingming Cao 2005-04-07 23:37 ` Mingming Cao 1 sibling, 0 replies; 68+ messages in thread From: Mingming Cao @ 2005-04-07 19:16 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton On Thu, 2005-04-07 at 14:08 +0100, Stephen C. Tweedie wrote: > Hi, > > On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote: > > > doesnt the first option also allow searches to be in parallel? > > In terms of CPU usage, yes. But either we use large windows, in which > case we *can't* search remotely near areas of the disk in parallel; or > we use small windows, in which case failure to find a free bit becomes > disproportionately expensive (we have to do an rbtree link and unlink > for each window we search.) > I was thinking that at the end of find_next_reservable_window(), since we already know the previous node in the rbtree to insert, we could just link the window there, that way the rbtree link cost will be minimized. Then I realize in rbtree, previous node is not necessary the parent node, we still have to search part of the rbtree to find the parent node, which is really expensive in the case the insertion operations are very frequent. Hmmm... > > This particular one took over 1 msec, If you look at the pattern in the latency report: ......... cvs-14981 0.n.2 1044us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1045us : ext3_test_allocatable (bitmap_search_next_usable_block) cvs-14981 0.n.2 1045us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1046us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1047us : ext3_test_allocatable (bitmap_search_next_usable_block) cvs-14981 0.n.2 1048us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1049us : find_next_zero_bit (bitmap_search_next_usable_block) cvs-14981 0.n.2 1049us : ext3_test_allocatable (bitmap_search_next_usable_block) ........ the patten is: two calls to find_next_zero_bit() followed by a ext3_test_allocatabl(). And we have about 440 iteration in this report. That means we are really unlucky: Checked the on disk bitmap, found a free bit, but it's not free in the journal copy, then we check the journal copy, find a free bit, but it's not free on disk....I could imagine we need to check the journal copy once or twice(to prevent re- use the just freed bit before it is been commited), but how could this many (440) iterations happen? Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-07 13:08 ` Stephen C. Tweedie 2005-04-07 19:16 ` Mingming Cao @ 2005-04-07 23:37 ` Mingming Cao 2005-04-08 14:40 ` Stephen C. Tweedie 1 sibling, 1 reply; 68+ messages in thread From: Mingming Cao @ 2005-04-07 23:37 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton On Thu, 2005-04-07 at 14:08 +0100, Stephen C. Tweedie wrote: > Hi, > > On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote: > > > doesnt the first option also allow searches to be in parallel? > > In terms of CPU usage, yes. But either we use large windows, in which > case we *can't* search remotely near areas of the disk in parallel; or > we use small windows, in which case failure to find a free bit becomes > disproportionately expensive (we have to do an rbtree link and unlink > for each window we search.) Actually, we do not have to do an rbtree link and unlink for every window we search. If the reserved window(old) has no free bit and the new reservable window's is right after the old one, no need to unlink the old window from the rbtree and then link the new window, just update the start and end of the old one. That's in the current designed already, to reduce the cost of rbtree link and unlink. Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-07 23:37 ` Mingming Cao @ 2005-04-08 14:40 ` Stephen C. Tweedie 2005-04-08 16:06 ` Arjan van de Ven 2005-04-08 18:10 ` Mingming Cao 0 siblings, 2 replies; 68+ messages in thread From: Stephen C. Tweedie @ 2005-04-08 14:40 UTC (permalink / raw) To: Mingming Cao Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Fri, 2005-04-08 at 00:37, Mingming Cao wrote: > Actually, we do not have to do an rbtree link and unlink for every > window we search. If the reserved window(old) has no free bit and the > new reservable window's is right after the old one, no need to unlink > the old window from the rbtree and then link the new window, just update > the start and end of the old one. It still needs to be done under locking to prevent us from expanding over the next window, though. And having to take and drop a spinlock a dozen times or more just to find out that there are no usable free blocks in the current block group is still expensive, even if we're not actually fully unlinking the window each time. I wonder if this can possibly be done safely without locking? It would be really good if we could rotate windows forward with no global locks. At the very least, a fair rwlock would let us freeze the total layout of the tree, while still letting us modify individual windows safely. As long as we use wmb() to make sure that we always extend the end of the window before we shrink the start of it, I think we could get away with such shared locking. And rw locking is much better for concurrency, so we might be able to hold it over the whole bitmap search rather than taking it and dropping it at each window advance. --Stephen ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-08 14:40 ` Stephen C. Tweedie @ 2005-04-08 16:06 ` Arjan van de Ven 2005-04-08 18:10 ` Mingming Cao 1 sibling, 0 replies; 68+ messages in thread From: Arjan van de Ven @ 2005-04-08 16:06 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Mingming Cao, Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton > And rw locking is much better for concurrency, so > we might be able to hold it over the whole bitmap search rather than > taking it and dropping it at each window advance. rw locks only help if readers are 10x more common than writers generally speaking. They are quite expensive locks, so they really should be used with the utmost care. (if you have really long hold times the dynamics are different, but if you have really long hold times your latency sucks too and wasn't that what this thread was trying to fix?) ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-08 14:40 ` Stephen C. Tweedie 2005-04-08 16:06 ` Arjan van de Ven @ 2005-04-08 18:10 ` Mingming Cao 2005-04-08 18:12 ` Lee Revell 2005-04-11 11:48 ` Stephen C. Tweedie 1 sibling, 2 replies; 68+ messages in thread From: Mingming Cao @ 2005-04-08 18:10 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton On Fri, 2005-04-08 at 15:40 +0100, Stephen C. Tweedie wrote: > Hi, > > On Fri, 2005-04-08 at 00:37, Mingming Cao wrote: > > > Actually, we do not have to do an rbtree link and unlink for every > > window we search. If the reserved window(old) has no free bit and the > > new reservable window's is right after the old one, no need to unlink > > the old window from the rbtree and then link the new window, just update > > the start and end of the old one. > > It still needs to be done under locking to prevent us from expanding > over the next window, though. And having to take and drop a spinlock a > dozen times or more just to find out that there are no usable free > blocks in the current block group is still expensive, even if we're not > actually fully unlinking the window each time. > Isn't this a rare case? The whole group is relatively full and the free bits are all reserved by other files. Probably we should avoid trying to make reservation in this block group at the first place, if we could find a way to detect the number of _usable_ free bits are less than the requested window size. > I wonder if this can possibly be done safely without locking? It would > be really good if we could rotate windows forward with no global locks. > At the very least, a fair rwlock would let us freeze the total layout of > the tree, while still letting us modify individual windows safely. As > long as we use wmb() to make sure that we always extend the end of the > window before we shrink the start of it, I think we could get away with > such shared locking. And rw locking is much better for concurrency, so > we might be able to hold it over the whole bitmap search rather than > taking it and dropping it at each window advance. > You are proposing that we hold the read lock first, do the window search and bitmap scan, then once we confirm there is free bit in the candidate window, we grab the write lock and update the tree? I think this is a good idea to address case you have concerned: when we need to do lots of window search before settle down. Also if later we decide (I think we have discussed this before) to always try to reserve the window with at least 8 contigous free blocks, the search will be more expensive and the read lock will help. However I am still worried that the rw lock will allow concurrent files trying to lock the same window at the same time. Only one succeed though., high cpu usage then. And also, in the normal case the filesystem is not really full, probably rw lock becomes expensive. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-08 18:10 ` Mingming Cao @ 2005-04-08 18:12 ` Lee Revell 2005-04-11 11:48 ` Stephen C. Tweedie 1 sibling, 0 replies; 68+ messages in thread From: Lee Revell @ 2005-04-08 18:12 UTC (permalink / raw) To: cmm; +Cc: Stephen C. Tweedie, Ingo Molnar, linux-kernel, Andrew Morton On Fri, 2005-04-08 at 11:10 -0700, Mingming Cao wrote: > However I am still worried that the rw lock will allow concurrent files > trying to lock the same window at the same time. Only one succeed > though., high cpu usage then. And also, in the normal case the > filesystem is not really full, probably rw lock becomes expensive. FWIW this was a 125GB filesystem, about 70% full. Lee ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-08 18:10 ` Mingming Cao 2005-04-08 18:12 ` Lee Revell @ 2005-04-11 11:48 ` Stephen C. Tweedie 2005-04-11 18:38 ` Mingming Cao 1 sibling, 1 reply; 68+ messages in thread From: Stephen C. Tweedie @ 2005-04-11 11:48 UTC (permalink / raw) To: Mingming Cao Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Fri, 2005-04-08 at 19:10, Mingming Cao wrote: > > It still needs to be done under locking to prevent us from expanding > > over the next window, though. And having to take and drop a spinlock a > > dozen times or more just to find out that there are no usable free > > blocks in the current block group is still expensive, even if we're not > > actually fully unlinking the window each time. > Isn't this a rare case? The whole group is relatively full and the free > bits are all reserved by other files. Well, that's not much different from the case where there _are_ usable bits but they are all near the end of the bitmap. And that's common enough as you fill up a block group with small files, for example. Once the bg becomes nearly full, new file opens-for-write will still try to allocate in that bg (because that's where the inode was allocated), but as it's a new fd we have no prior knowledge of _where_ in the bh to allocate, and we'll have to walk it to the end to find any free space. This is the access pattern I'd expect of (for example) "tar xvjf linux-2.6.12.tar.bz2", not exactly a rare case. > Probably we should avoid trying > to make reservation in this block group at the first place Not in this case, because it's the "home" of the file in question, and skipping to another bg would just leave useful space empty --- and that leads to fragmentation. > You are proposing that we hold the read lock first, do the window search > and bitmap scan, then once we confirm there is free bit in the candidate > window, we grab the write lock and update the tree? No, I'm suggesting that if we need the write lock for tree updates, we may still be able to get away with just a read lock when updating an individual window. If all we're doing is winding the window forwards a bit, that's not actually changing the structure of the tree. > However I am still worried that the rw lock will allow concurrent files > trying to lock the same window at the same time. Adding a new window will need the write lock, always. But with the read lock, we can still check the neighbouring windows (the structure is pinned so those remain valid), so advancing a window with just that lock can still be done without risking overlapping the next window. --Stephen ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-11 11:48 ` Stephen C. Tweedie @ 2005-04-11 18:38 ` Mingming Cao 2005-04-11 19:12 ` Lee Revell 2005-04-11 19:57 ` Stephen C. Tweedie 0 siblings, 2 replies; 68+ messages in thread From: Mingming Cao @ 2005-04-11 18:38 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton On Mon, 2005-04-11 at 12:48 +0100, Stephen C. Tweedie wrote: > Hi, > > On Fri, 2005-04-08 at 19:10, Mingming Cao wrote: > > > > It still needs to be done under locking to prevent us from expanding > > > over the next window, though. And having to take and drop a spinlock a > > > dozen times or more just to find out that there are no usable free > > > blocks in the current block group is still expensive, even if we're not > > > actually fully unlinking the window each time. > > > Isn't this a rare case? The whole group is relatively full and the free > > bits are all reserved by other files. > > Well, that's not much different from the case where there _are_ usable > bits but they are all near the end of the bitmap. And that's common > enough as you fill up a block group with small files, for example. Once > the bg becomes nearly full, new file opens-for-write will still try to > allocate in that bg (because that's where the inode was allocated), but > as it's a new fd we have no prior knowledge of _where_ in the bh to > allocate, and we'll have to walk it to the end to find any free space. > This is the access pattern I'd expect of (for example) "tar xvjf > linux-2.6.12.tar.bz2", not exactly a rare case. > Okey. > > Probably we should avoid trying > > to make reservation in this block group at the first place > > Not in this case, because it's the "home" of the file in question, and > skipping to another bg would just leave useful space empty --- and that > leads to fragmentation. > I agree. We should not skip the home block group of the file. I guess what I was suggesting is, if allocation from the home group failed and we continuing the linear search the rest of block groups, we could probably try to skip the block groups without enough usable free blocks to make a reservation. Checking for the number of free blocks left in the quary bg is a good way, but probably not good enough, since those free blocks might already being reserved. > > You are proposing that we hold the read lock first, do the window search > > and bitmap scan, then once we confirm there is free bit in the candidate > > window, we grab the write lock and update the tree? > > No, I'm suggesting that if we need the write lock for tree updates, we > may still be able to get away with just a read lock when updating an > individual window. If all we're doing is winding the window forwards a > bit, that's not actually changing the structure of the tree. > > > However I am still worried that the rw lock will allow concurrent files > > trying to lock the same window at the same time. > > Adding a new window will need the write lock, always. But with the read > lock, we can still check the neighbouring windows (the structure is > pinned so those remain valid), so advancing a window with just that lock > can still be done without risking overlapping the next window. > Ah.. I see the win with the read lock now: once the a reservation window is added, updating it (either winding it forward and or searching for a avaliable window) probably is the majorirty of the operations on the tree, and using read lock for that should help reduce the latency. I will work on a patch for Lee to try sometime tonight. Thanks, Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-11 18:38 ` Mingming Cao @ 2005-04-11 19:12 ` Lee Revell 2005-04-11 19:22 ` Lee Revell 2005-04-11 19:57 ` Stephen C. Tweedie 1 sibling, 1 reply; 68+ messages in thread From: Lee Revell @ 2005-04-11 19:12 UTC (permalink / raw) To: cmm; +Cc: Stephen C. Tweedie, Ingo Molnar, linux-kernel, Andrew Morton On Mon, 2005-04-11 at 11:38 -0700, Mingming Cao wrote: > I will work on a patch for Lee to try sometime tonight. > Just FYI, this will take a while to test, because this latency seems quite rare. I haven't seen it again since my original report. Hopefully I can reproduce it with enough dbench processes. This seems to agree with your earlier observation that we seem to have been quite unlucky. Anyway, if I understand the thread it seems like this could increase performance as well as help latency. Lee ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-11 19:12 ` Lee Revell @ 2005-04-11 19:22 ` Lee Revell 0 siblings, 0 replies; 68+ messages in thread From: Lee Revell @ 2005-04-11 19:22 UTC (permalink / raw) To: cmm; +Cc: Stephen C. Tweedie, Ingo Molnar, linux-kernel, Andrew Morton On Mon, 2005-04-11 at 15:12 -0400, Lee Revell wrote: > On Mon, 2005-04-11 at 11:38 -0700, Mingming Cao wrote: > > I will work on a patch for Lee to try sometime tonight. > > > > Just FYI, this will take a while to test, because this latency seems > quite rare. I haven't seen it again since my original report. Never mind, I spoke too soon. As soon as I posted this, I hit it again. Seems like doing a CVS checkout reliably triggers the problem. Lee ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-11 18:38 ` Mingming Cao 2005-04-11 19:12 ` Lee Revell @ 2005-04-11 19:57 ` Stephen C. Tweedie 2005-04-12 6:41 ` Mingming Cao 1 sibling, 1 reply; 68+ messages in thread From: Stephen C. Tweedie @ 2005-04-11 19:57 UTC (permalink / raw) To: Mingming Cao Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Mon, 2005-04-11 at 19:38, Mingming Cao wrote: > I agree. We should not skip the home block group of the file. I guess > what I was suggesting is, if allocation from the home group failed and > we continuing the linear search the rest of block groups, we could > probably try to skip the block groups without enough usable free blocks > to make a reservation. Fair enough. Once those are the only bgs left, performance is going to drop pretty quickly, but that's not really avoidable on a very full filesystem. > Ah.. I see the win with the read lock now: once the a reservation window > is added, updating it (either winding it forward and or searching for a > avaliable window) probably is the majorirty of the operations on the > tree, and using read lock for that should help reduce the latency. Right. The down side is that for things like a kernel "tar xf", we'll be doing lots of small file unpacks, and hopefully most files will be just one or two reservations --- so there's little winding forward going on. The searching will still be improved in that case. Note that this may improve average case latencies, but it's not likely to improve worst-case ones. We still need a write lock to install a new window, and that's going to have to wait for us to finish finding a free bit even if that operation starts using a read lock. I'm not really sure what to do about worst-case here. For that, we really do want to drop the lock entirely while we do the bitmap scan. That leaves two options. Hold a reservation while we do that; or don't. Holding one poses the problems we discussed before: either you hold a large reservation (bad for disk layout in the presence of concurrent allocators), or you hold smaller ones (high cost as you continually advance the window, which requires some read lock on the tree to avoid bumping into the next window.) Just how bad would it be if we didn't hold a lock _or_ a window at all while doing the search for new window space? I didn't like that alternative at first because of the problem when you've got multiple tasks trying to allocate in the same space at the same time; but given the locking overhead of the alternatives, I'm wondering if this is actually the lesser evil. --Stephen ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-11 19:57 ` Stephen C. Tweedie @ 2005-04-12 6:41 ` Mingming Cao 2005-04-12 11:18 ` Stephen C. Tweedie 0 siblings, 1 reply; 68+ messages in thread From: Mingming Cao @ 2005-04-12 6:41 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton On Mon, 2005-04-11 at 20:57 +0100, Stephen C. Tweedie wrote: > Hi, Hi Stephen, > > On Mon, 2005-04-11 at 19:38, Mingming Cao wrote: > > Ah.. I see the win with the read lock now: once the a reservation window > > is added, updating it (either winding it forward and or searching for a > > avaliable window) probably is the majorirty of the operations on the > > tree, and using read lock for that should help reduce the latency. > > Right. The down side is that for things like a kernel "tar xf", we'll > be doing lots of small file unpacks, and hopefully most files will be > just one or two reservations --- so there's little winding forward going > on. The searching will still be improved in that case. Just a side note that "tar xf" should know the file size before unpacking it. So it could set the reservation window size large enough to fit the entire file before doing the file write through ioctl command. > Note that this may improve average case latencies, but it's not likely > to improve worst-case ones. We still need a write lock to install a new > window, and that's going to have to wait for us to finish finding a free > bit even if that operation starts using a read lock. > Yes indeed. However nothing is free and there are always trade-offs.:) By worse case you mean multiple writes trying to allocate blocks around same area? But I wonder if the latency saw by Lee belongs to this worst-case: the latency comes mostly from loop calling find_next_zero_bit() in bitmap_search_next_usable_block(). Even if we take out the whole reservation, we still possibility run into this kind of latency: the bitmap on disk and on journal are extremely inconsistent so we need to keep searching them both until we find a free bit on both map. > I'm not really sure what to do about worst-case here. For that, we > really do want to drop the lock entirely while we do the bitmap scan. > Hmm...if we drop the lock entirely while scan the bitmap, assuming you mean drop the read lock, then I am afraid we have to re-check the tree (require a read or write lock ) to see if the new window space is still there after the scan succeed. This is probably not very interesting for the window rotating case. > That leaves two options. Hold a reservation while we do that; or don't. > Holding one poses the problems we discussed before: either you hold a > large reservation (bad for disk layout in the presence of concurrent > allocators), or you hold smaller ones (high cost as you continually > advance the window, which requires some read lock on the tree to avoid > bumping into the next window.) > Well, we cannot hold a reservation (which need to update the tree) without a write lock. I guess if we want to improve the average case latency by replacing the current spin_lock with the read lock for the new window space searching, we don't have much choice here. > Just how bad would it be if we didn't hold a lock _or_ a window at all > while doing the search for new window space? I wonder if this is feasible: Walk through the rb tree without a lock? What if some node is being removed by another thread while we are walking through the tree and trying to get the next node from it? Thanks, Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-12 6:41 ` Mingming Cao @ 2005-04-12 11:18 ` Stephen C. Tweedie 2005-04-12 23:27 ` Mingming Cao 0 siblings, 1 reply; 68+ messages in thread From: Stephen C. Tweedie @ 2005-04-12 11:18 UTC (permalink / raw) To: Mingming Cao Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Tue, 2005-04-12 at 07:41, Mingming Cao wrote: > > Note that this may improve average case latencies, but it's not likely > > to improve worst-case ones. We still need a write lock to install a new > > window, and that's going to have to wait for us to finish finding a free > > bit even if that operation starts using a read lock. > > > Yes indeed. However nothing is free and there are always trade-offs.:) > > By worse case you mean multiple writes trying to allocate blocks around > same area? It doesn't matter where they are; multiple new file opens will all be looking for a write lock. You only need one long-held read lock and all the writers still block. The worst-case latencies can't be properly solved with r/w locks --- those let the readers go more quickly (assuming they are in the majority), which helps the average case, but writers still have to wait for exclusive access. We only really help them by dropping the lock entirely. > Even if we take out the whole > reservation, we still possibility run into this kind of latency: the > bitmap on disk and on journal are extremely inconsistent so we need to > keep searching them both until we find a free bit on both map. Quite possibly. But as long as that code is running without locks, it's much easier to deal with those latencies: they won't impact other CPUs, cond_resched() is easier, and there's even CONFIG_PREEMPT. > > I'm not really sure what to do about worst-case here. For that, we > > really do want to drop the lock entirely while we do the bitmap scan. > Hmm...if we drop the lock entirely while scan the bitmap, assuming you > mean drop the read lock, then I am afraid we have to re-check the tree > (require a read or write lock ) to see if the new window space is still > there after the scan succeed. Sure. You basically start off with a provisional window, and then if necessary roll it forward just the same way you roll normal windows forward when they get to their end. That means you can still drop the lock while you search for new space. When you get there, reacquire the lock and check that the intervening space is still available. That's really cheap for the common case. The difficulty is when you have many parallel allocations hitting the same bg: they allocate provisional windows, find the same free area later on in the bg, and then stomp on each other as they try to move their windows there. I wonder if there's not a simple solution for this --- mark the window as "provisional", and if any other task tries to allocate in the space immediately following such a window, it needs to block until that window is released. --Stephen ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-12 11:18 ` Stephen C. Tweedie @ 2005-04-12 23:27 ` Mingming Cao 2005-04-13 10:29 ` Stephen C. Tweedie 0 siblings, 1 reply; 68+ messages in thread From: Mingming Cao @ 2005-04-12 23:27 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton On Tue, 2005-04-12 at 12:18 +0100, Stephen C. Tweedie wrote: > Hi, > > On Tue, 2005-04-12 at 07:41, Mingming Cao wrote: > > > > Note that this may improve average case latencies, but it's not likely > > > to improve worst-case ones. We still need a write lock to install a new > > > window, and that's going to have to wait for us to finish finding a free > > > bit even if that operation starts using a read lock. > > > > > Yes indeed. However nothing is free and there are always trade-offs.:) > > > > By worse case you mean multiple writes trying to allocate blocks around > > same area? > > It doesn't matter where they are; multiple new file opens will all be > looking for a write lock. You only need one long-held read lock and all > the writers still block. The worst-case latencies can't be properly > solved with r/w locks --- those let the readers go more quickly > (assuming they are in the majority), which helps the average case, but > writers still have to wait for exclusive access. We only really help > them by dropping the lock entirely. > > > Even if we take out the whole > > reservation, we still possibility run into this kind of latency: the > > bitmap on disk and on journal are extremely inconsistent so we need to > > keep searching them both until we find a free bit on both map. > > Quite possibly. But as long as that code is running without locks, it's > much easier to deal with those latencies: they won't impact other CPUs, > cond_resched() is easier, and there's even CONFIG_PREEMPT. > Okey, I agree. > > > I'm not really sure what to do about worst-case here. For that, we > > > really do want to drop the lock entirely while we do the bitmap scan. > > > Hmm...if we drop the lock entirely while scan the bitmap, assuming you > > mean drop the read lock, then I am afraid we have to re-check the tree > > (require a read or write lock ) to see if the new window space is still > > there after the scan succeed. > > Sure. You basically start off with a provisional window, and then if > necessary roll it forward just the same way you roll normal windows > forward when they get to their end. That means you can still drop the > lock while you search for new space. When you get there, reacquire the > lock and check that the intervening space is still available. > Please note that the bitmap scan does not only scan the provisional window range, it will return the first free it on the bitmap start from the start block of the provisional window until the end of the whole block group. So in the case the new window's neighbor is the same as the old one(that means window is rolled forward), we only need roll forward once to find it. Either we find a free bit inside the provisional window, or we find a free bit out of it. In the first case we just roll window forward and we are done. In the second case, it's possible the free bit is inside someone else's window(which means we can't take that window) or it inside the new space after a already reserved window. Either way we have to lock up the whole tree to remove the old window and insert the new window. > That's really cheap for the common case. The difficulty is when you > have many parallel allocations hitting the same bg: they allocate > provisional windows, find the same free area later on in the bg, and > then stomp on each other as they try to move their windows there. > > I wonder if there's not a simple solution for this --- mark the window > as "provisional", and if any other task tries to allocate in the space > immediately following such a window, it needs to block until that window > is released. > Sounds interesting. However that implies we need a write lock to mark the window as provisional and block other files looking for windows near it: we need to insert the provisional window into the tree and then mark it as a temporary window, to really let other file notice this kind of "hold". I wonder if the benefit of read/write lock is worth all the hassles now. If the new window's neighbor stays the same, we only need to roll forward once. If not, after a successful scan, we need to grab the write lock, and make sure the window is still there. If we dropped the lock without holding the new space, we have to search the whole tree again to find out if the space is still there, we cannot use the previous node returned by find_next_reservable_window() since the previous node could be gone while we scan the bitmap without the lock. Basically we do twice tree search(once for roll forward case) and twice locking in the normal case. Also I am concerned about the possible starvation on writers. I am thinking, maybe back to the spin_lock is not that bad with the "mark provisional" suggest you made? It allows us to mark the new space as provisional if we find a new space(prevent other window searching run into the same neighborhood). We could release the lock and scan the bitmap without worry about the new space will be taking by others. If there is a free bit, then we could just clear the provisional bit and simply return.(we don't have to re-grab the lock there). In the case the the final new window is the one just rolled the old window forward, we only need to grab the spin_lock once and move the window forward. This seems to me the simplest way to fix the latency that Lee reported, and balanced both average and worse cases. Is it still looks too bad to you? Thanks, Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-12 23:27 ` Mingming Cao @ 2005-04-13 10:29 ` Stephen C. Tweedie [not found] ` <1113597161.3899.80.camel@localhost.localdomain> 2005-04-22 22:10 ` [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies Mingming Cao 0 siblings, 2 replies; 68+ messages in thread From: Stephen C. Tweedie @ 2005-04-13 10:29 UTC (permalink / raw) To: Mingming Cao Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Wed, 2005-04-13 at 00:27, Mingming Cao wrote: > > I wonder if there's not a simple solution for this --- mark the window > > as "provisional", and if any other task tries to allocate in the space > > immediately following such a window, it needs to block until that window > > is released. > Sounds interesting. However that implies we need a write lock to mark > the window as provisional and block other files looking for windows near > it: we need to insert the provisional window into the tree and then mark > it as a temporary window, to really let other file notice this kind of > "hold". We need a lock for the tree modification, yes. > I wonder if the benefit of read/write lock is worth all the hassles now. The point of the provisional window is that you don't need read/write locks at all. The provisional window lets you unlock the tree completely while you do the bitmap scan, so there's really no reason to have rwlocks for the tree any more. > If the new window's neighbor stays the same, we only need to roll > forward once. If not, after a successful scan, we need to grab the > write lock, and make sure the window is still there. When we take the provisional window, we can make a note of how much space we have before the next window. And because all future allocators will stall if they try to allocate at this point due to the provisional window, we know that that space will remain outside any other window until we come to fix the provisional window and potentially roll it forward to the space we found. > If we dropped the > lock without holding the new space, we have to search the whole tree > again to find out if the space is still there As long as the space is within the area between the provisional window and its successor, we don't need to do that. (It gets more complex if the bitmap search returns a bit _beyond_ the next window, though.) > Also I am concerned about the possible > starvation on writers. In what way? > I am thinking, maybe back to the spin_lock is not that bad with the > "mark provisional" suggest you made? Right, that was the intent --- sorry if I didn't make it clear. > It allows us to mark the new space > as provisional if we find a new space(prevent other window searching run > into the same neighborhood). We could release the lock and scan the > bitmap without worry about the new space will be taking by others. Exactly. Note that there _is_ some additional complexity here. It's not entirely free. Specifically, if we have other tasks waiting on our provisional window, then we need to be very careful about the life expectancy of the wait queues involved, so that if the first task fixes its window and then deletes it before the waiters have woken up, they don't get confused by the provisional window vanishing while being waited for. The easy solution is a global wait queue for that, but that doesn't scale well. The complex solution is a per-window wait queue and reference count, which is obviously a bit of bloat, though probably worth it for the high-load case. Cheers, Stephen ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <1113597161.3899.80.camel@localhost.localdomain>]
* Re: ext3 allocate-with-reservation latencies [not found] ` <1113597161.3899.80.camel@localhost.localdomain> @ 2005-04-18 18:00 ` Stephen C. Tweedie 2005-04-18 21:56 ` [Ext2-devel] " Mingming Cao 0 siblings, 1 reply; 68+ messages in thread From: Stephen C. Tweedie @ 2005-04-18 18:00 UTC (permalink / raw) To: Mingming Cao; +Cc: Andrew Morton, linux-kernel, ext2-devel, Stephen Tweedie Hi, On Fri, 2005-04-15 at 21:32, Mingming Cao wrote: > Sorry for the delaying. I was not in office these days. No problem. > > > Also I am concerned about the possible > > > starvation on writers. > > In what way? > I was worried about the rw lock case.:) OK, so we're both on the same track here. :-) > > Note that there _is_ some additional complexity here. It's not entirely > > free. Specifically, if we have other tasks waiting on our provisional > > window, then we need to be very careful about the life expectancy of the > > wait queues involved, so that if the first task fixes its window and > > then deletes it before the waiters have woken up, they don't get > > confused by the provisional window vanishing while being waited for. > This approach(provisional window) sounds interesting to me, but it's a > little more complicated than I thought:( Yep. Once you have other tasks waiting on your window while it's not locked, object lifetime becomes a much bigger problem. > alloc_new_reservation() > retry: > lock the tree > search for a new space start from the given startblock > found a new space > if the new space is not a "provisional window" I was thinking rather that we _start_ with the window, and _then_ look for new space. So we'd start with: if we already have a window, mark it provisional; else, do search for a new window; if the immediately preceding window is provisional, wait for that window; continue; allocate the window and mark it provisional; break At this point we have a provisional window, and we know that nobody else is going to allocate either in it, or in the empty space following it (because if they tried to, they'd bump into our own provisional window as their predecessor and would wait for us.) So even if the window doesn't occupy the _whole_ unreserved area, we can still keep searching without fear of multiple tasks trying to do so in the same space at the same time. --Stephen ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] Re: ext3 allocate-with-reservation latencies 2005-04-18 18:00 ` Stephen C. Tweedie @ 2005-04-18 21:56 ` Mingming Cao 0 siblings, 0 replies; 68+ messages in thread From: Mingming Cao @ 2005-04-18 21:56 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Andrew Morton, linux-kernel, ext2-devel On Mon, 2005-04-18 at 19:00 +0100, Stephen C. Tweedie wrote: > > > Note that there _is_ some additional complexity here. It's not entirely > > > free. Specifically, if we have other tasks waiting on our provisional > > > window, then we need to be very careful about the life expectancy of the > > > wait queues involved, so that if the first task fixes its window and > > > then deletes it before the waiters have woken up, they don't get > > > confused by the provisional window vanishing while being waited for. > > > This approach(provisional window) sounds interesting to me, but it's a > > little more complicated than I thought:( > > Yep. Once you have other tasks waiting on your window while it's not > locked, object lifetime becomes a much bigger problem. > > > alloc_new_reservation() > > retry: > > lock the tree > > search for a new space start from the given startblock > > found a new space > > if the new space is not a "provisional window" > I was thinking rather that we _start_ with the window, and _then_ look > for new space. > It seems I am lost here. Could you elaborate more here, please? What is "the window" you are referring to here? The old(stale) reservation window? I thought we are discussing the algorithm to how to allocate a new window. > So we'd start with: > > if we already have a window, > mark it provisional; > else, > do > search for a new window; > if the immediately preceding window is provisional, > wait for that window; > continue; > allocate the window and mark it provisional; > break > > At this point we have a provisional window, and we know that nobody else > is going to allocate either in it, or in the empty space following it > (because if they tried to, they'd bump into our own provisional window > as their predecessor and would wait for us.) So even if the window > doesn't occupy the _whole_ unreserved area, we can still keep searching > without fear of multiple tasks trying to do so in the same space at the > same time. > > --Stephen > > Hmm...... This thread was to address the latency of holding the global per-fs reservation lock while scanning the block group bitmap. And the whole point of the "provisional window" is to prevent other inodes being forced to put into other block groups when one inode hold/reserve the whole block group as a temporary window to do the bitmap scan. I clearly see it's a win if there are no multiple threads allocating blocks nearby at the same time. However the whole reservation is there to address the case where multiple allocations happen at the same time near the same place. With the provisional method, if multiple threads trying to allocate new windows in the same area, they are still have to wait for other new- window-allocation-and-bitmap-scan finish. After that the probably will compete the same window again and again....:( I am worried that the latency is not being being improved than before (holding the lock for bitmap scan), and we have paid extra cpu time or context switches. Also the changes to the algorithm is no going to be trivial. Now we all agree that the right thing to fix the latency is to drop the lock and then scan the bitmap. Before that we need to reserve the open window in case someone else is trying to target at the same window. Question was should we reserve the whole free reservable space or just the window size we need. Now that we have explored and discussed so many possible solutions, I think the lest evil and less intrusive way to just lock the small window. We were worried about in the case the scanned free bit is not inside the temporary reserved window, it is possible we have to do many window unlink and link operations on the rb tree, however this is not avoidable with the provisional window proposal either. Probably we could start with that simple proposal first, and if there are benchmarks shows the cpu usage is really high and a concern, we could address that later? Thanks, Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies 2005-04-13 10:29 ` Stephen C. Tweedie [not found] ` <1113597161.3899.80.camel@localhost.localdomain> @ 2005-04-22 22:10 ` Mingming Cao 2005-04-28 3:45 ` Lee Revell 2005-04-29 6:28 ` [PATCH] Reduce ext3 allocate-with-reservation lock latencies Mingming Cao 1 sibling, 2 replies; 68+ messages in thread From: Mingming Cao @ 2005-04-22 22:10 UTC (permalink / raw) To: Andrew Morton, Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel Andrew, Stephen, Currently in ext3 block reservation code, the global filesystem reservation tree lock (rsv_block) is hold during the process of searching for a space to make a new reservation window, including while scaning the block bitmap to verify if the available window has a free block. Holding the lock during bitmap scan is unnecessary and could possibly cause scalability issue and latency issues. This patch trying to address this by dropping the lock before scan the bitmap. Before that we need to reserve the open window in case someone else is targeting at the same window. Question was should we reserve the whole free reservable space or just the window size we need. Reserve the whole free reservable space will possibly force other threads which intended to do block allocation nearby move to another block group(cause bad layout). In this patch, we just reserve the desired size before drop the lock and scan the block bitmap. Please review. I have tested on fsx on SMP box. Lee, if you got time, please try this patch. Signed-Off-By: Mingming Cao <cmm@us.ibm.com> --- linux-2.6.11-ming/fs/ext3/balloc.c | 135 +++++++++++++++++-------------------- linux-2.6.11-ming/fs/ext3/file.c | 4 + 2 files changed, 68 insertions(+), 71 deletions(-) diff -puN fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency fs/ext3/balloc.c --- linux-2.6.11/fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency 2005-04-22 10:49:04.000000000 -0700 +++ linux-2.6.11-ming/fs/ext3/balloc.c 2005-04-22 14:42:35.518751070 -0700 @@ -749,24 +749,24 @@ fail_access: * to find a free region that is of my size and has not * been reserved. * - * on succeed, it returns the reservation window to be appended to. - * failed, return NULL. */ -static struct ext3_reserve_window_node *find_next_reservable_window( +static int find_next_reservable_window( struct ext3_reserve_window_node *search_head, - unsigned long size, int *start_block, + struct ext3_reserve_window_node *my_rsv, + struct super_block * sb, int start_block, int last_block) { struct rb_node *next; struct ext3_reserve_window_node *rsv, *prev; int cur; + int size = my_rsv->rsv_goal_size; /* TODO: make the start of the reservation window byte-aligned */ /* cur = *start_block & ~7;*/ - cur = *start_block; + cur = start_block; rsv = search_head; if (!rsv) - return NULL; + return -1; while (1) { if (cur <= rsv->rsv_end) @@ -782,7 +782,7 @@ static struct ext3_reserve_window_node * * space with expected-size (or more)... */ if (cur > last_block) - return NULL; /* fail */ + return -1; /* fail */ prev = rsv; next = rb_next(&rsv->rsv_node); @@ -813,8 +813,26 @@ static struct ext3_reserve_window_node * * return the reservation window that we could append to. * succeed. */ - *start_block = cur; - return prev; + + if ((prev != my_rsv) && (!rsv_is_empty(&my_rsv->rsv_window))) + rsv_window_remove(sb, my_rsv); + + /* let's book the whole avaliable window for now + * we will check the + * disk bitmap later and then, if there are free block + * then we adjust the window size if the it's + * larger than requested. + * Otherwise, we will remove this node from the tree next time + * call find_next_reservable_window. + */ + my_rsv->rsv_start = cur; + my_rsv->rsv_end = cur + size - 1; + my_rsv->rsv_alloc_hit = 0; + + if (prev != my_rsv) + ext3_rsv_window_add(sb, my_rsv); + + return 0; } /** @@ -852,6 +870,7 @@ static struct ext3_reserve_window_node * * @sb: the super block * @group: the group we are trying to allocate in * @bitmap_bh: the block group block bitmap + * */ static int alloc_new_reservation(struct ext3_reserve_window_node *my_rsv, int goal, struct super_block *sb, @@ -860,10 +879,10 @@ static int alloc_new_reservation(struct struct ext3_reserve_window_node *search_head; int group_first_block, group_end_block, start_block; int first_free_block; - int reservable_space_start; - struct ext3_reserve_window_node *prev_rsv; struct rb_root *fs_rsv_root = &EXT3_SB(sb)->s_rsv_window_root; unsigned long size; + int ret; + spinlock_t *rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock; group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) + group * EXT3_BLOCKS_PER_GROUP(sb); @@ -875,6 +894,7 @@ static int alloc_new_reservation(struct start_block = goal + group_first_block; size = my_rsv->rsv_goal_size; + if (!rsv_is_empty(&my_rsv->rsv_window)) { /* * if the old reservation is cross group boundary @@ -908,6 +928,8 @@ static int alloc_new_reservation(struct my_rsv->rsv_goal_size= size; } } + + spin_lock(rsv_lock); /* * shift the search start to the window near the goal block */ @@ -921,11 +943,16 @@ static int alloc_new_reservation(struct * need to check the bitmap after we found a reservable window. */ retry: - prev_rsv = find_next_reservable_window(search_head, size, - &start_block, group_end_block); - if (prev_rsv == NULL) - goto failed; - reservable_space_start = start_block; + ret = find_next_reservable_window(search_head, my_rsv, sb, + start_block, group_end_block); + + if (ret == -1) { + if (!rsv_is_empty(&my_rsv->rsv_window)) + rsv_window_remove(sb, my_rsv); + spin_unlock(rsv_lock); + return -1; + } + /* * On success, find_next_reservable_window() returns the * reservation window where there is a reservable space after it. @@ -937,8 +964,9 @@ retry: * block. Search start from the start block of the reservable space * we just found. */ + spin_unlock(rsv_lock); first_free_block = bitmap_search_next_usable_block( - reservable_space_start - group_first_block, + my_rsv->rsv_start - group_first_block, bitmap_bh, group_end_block - group_first_block + 1); if (first_free_block < 0) { @@ -946,54 +974,30 @@ retry: * no free block left on the bitmap, no point * to reserve the space. return failed. */ - goto failed; + spin_lock(rsv_lock); + if (!rsv_is_empty(&my_rsv->rsv_window)) + rsv_window_remove(sb, my_rsv); + spin_unlock(rsv_lock); + return -1; /* failed */ } + start_block = first_free_block + group_first_block; /* * check if the first free block is within the - * free space we just found + * free space we just reserved */ - if ((start_block >= reservable_space_start) && - (start_block < reservable_space_start + size)) - goto found_rsv_window; + if ((start_block >= my_rsv->rsv_start) && + (start_block < my_rsv->rsv_end)) + return 0; /* succeed */ /* * if the first free bit we found is out of the reservable space - * this means there is no free block on the reservable space - * we should continue search for next reservable space, + * continue search for next reservable space, * start from where the free block is, * we also shift the list head to where we stopped last time */ - search_head = prev_rsv; + search_head = my_rsv; + spin_lock(rsv_lock); goto retry; - -found_rsv_window: - /* - * great! the reservable space contains some free blocks. - * if the search returns that we should add the new - * window just next to where the old window, we don't - * need to remove the old window first then add it to the - * same place, just update the new start and new end. - */ - if (my_rsv != prev_rsv) { - if (!rsv_is_empty(&my_rsv->rsv_window)) - rsv_window_remove(sb, my_rsv); - } - my_rsv->rsv_start = reservable_space_start; - my_rsv->rsv_end = my_rsv->rsv_start + size - 1; - my_rsv->rsv_alloc_hit = 0; - if (my_rsv != prev_rsv) { - ext3_rsv_window_add(sb, my_rsv); - } - return 0; /* succeed */ -failed: - /* - * failed to find a new reservation window in the current - * group, remove the current(stale) reservation window - * if there is any - */ - if (!rsv_is_empty(&my_rsv->rsv_window)) - rsv_window_remove(sb, my_rsv); - return -1; /* failed */ } /* @@ -1023,7 +1027,6 @@ ext3_try_to_allocate_with_rsv(struct sup int goal, struct ext3_reserve_window_node * my_rsv, int *errp) { - spinlock_t *rsv_lock; unsigned long group_first_block; int ret = 0; int fatal; @@ -1052,7 +1055,6 @@ ext3_try_to_allocate_with_rsv(struct sup ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL); goto out; } - rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock; /* * goal is a group relative block number (if there is a goal) * 0 < goal < EXT3_BLOCKS_PER_GROUP(sb) @@ -1078,30 +1080,21 @@ ext3_try_to_allocate_with_rsv(struct sup * then we could go to allocate from the reservation window directly. */ while (1) { - struct ext3_reserve_window rsv_copy; - - rsv_copy._rsv_start = my_rsv->rsv_start; - rsv_copy._rsv_end = my_rsv->rsv_end; - - if (rsv_is_empty(&rsv_copy) || (ret < 0) || - !goal_in_my_reservation(&rsv_copy, goal, group, sb)) { - spin_lock(rsv_lock); + if (rsv_is_empty(&my_rsv->rsv_window) || (ret < 0) || + !goal_in_my_reservation(&my_rsv->rsv_window, goal, group, sb)) { ret = alloc_new_reservation(my_rsv, goal, sb, group, bitmap_bh); - rsv_copy._rsv_start = my_rsv->rsv_start; - rsv_copy._rsv_end = my_rsv->rsv_end; - spin_unlock(rsv_lock); if (ret < 0) break; /* failed */ - if (!goal_in_my_reservation(&rsv_copy, goal, group, sb)) + if (!goal_in_my_reservation(&my_rsv->rsv_window, goal, group, sb)) goal = -1; } - if ((rsv_copy._rsv_start >= group_first_block + EXT3_BLOCKS_PER_GROUP(sb)) - || (rsv_copy._rsv_end < group_first_block)) + if ((my_rsv->rsv_start >= group_first_block + EXT3_BLOCKS_PER_GROUP(sb)) + || (my_rsv->rsv_end < group_first_block)) BUG(); ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, - &rsv_copy); + &my_rsv->rsv_window); if (ret >= 0) { my_rsv->rsv_alloc_hit++; break; /* succeed */ diff -puN fs/ext3/file.c~ext3-reduce-reservation-lock-latency fs/ext3/file.c --- linux-2.6.11/fs/ext3/file.c~ext3-reduce-reservation-lock-latency 2005-04-22 10:49:04.000000000 -0700 +++ linux-2.6.11-ming/fs/ext3/file.c 2005-04-22 10:49:04.000000000 -0700 @@ -36,7 +36,11 @@ static int ext3_release_file (struct ino /* if we are the last writer on the inode, drop the block reservation */ if ((filp->f_mode & FMODE_WRITE) && (atomic_read(&inode->i_writecount) == 1)) + { + down(&EXT3_I(inode)->truncate_sem); ext3_discard_reservation(inode); + up(&EXT3_I(inode)->truncate_sem); + } if (is_dx(inode) && filp->private_data) ext3_htree_free_dir_info(filp->private_data); _ ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies 2005-04-22 22:10 ` [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies Mingming Cao @ 2005-04-28 3:45 ` Lee Revell 2005-04-28 7:37 ` Mingming Cao 2005-04-28 19:14 ` [RFC] Adding multiple block allocation to current ext3 Mingming Cao 2005-04-29 6:28 ` [PATCH] Reduce ext3 allocate-with-reservation lock latencies Mingming Cao 1 sibling, 2 replies; 68+ messages in thread From: Lee Revell @ 2005-04-28 3:45 UTC (permalink / raw) To: cmm; +Cc: Andrew Morton, Stephen C. Tweedie, Ingo Molnar, linux-kernel On Fri, 2005-04-22 at 15:10 -0700, Mingming Cao wrote: > Please review. I have tested on fsx on SMP box. Lee, if you got time, > please try this patch. I have tested and this does fix the problem. I ran my tests and no ext3 code paths showed up on the latency tracer at all, it never went above 33 usecs. Please apply. Lee ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies 2005-04-28 3:45 ` Lee Revell @ 2005-04-28 7:37 ` Mingming Cao 2005-04-28 16:12 ` Lee Revell 2005-04-28 19:14 ` [RFC] Adding multiple block allocation to current ext3 Mingming Cao 1 sibling, 1 reply; 68+ messages in thread From: Mingming Cao @ 2005-04-28 7:37 UTC (permalink / raw) To: Lee Revell; +Cc: Andrew Morton, Stephen C. Tweedie, Ingo Molnar, linux-kernel On Wed, 2005-04-27 at 23:45 -0400, Lee Revell wrote: > On Fri, 2005-04-22 at 15:10 -0700, Mingming Cao wrote: > > Please review. I have tested on fsx on SMP box. Lee, if you got time, > > please try this patch. > > I have tested and this does fix the problem. I ran my tests and no ext3 > code paths showed up on the latency tracer at all, it never went above > 33 usecs. > Thanks, Lee. The patch survived on many fsx test over 20 hours on a 2cpu machine. Tested the patch on the same machine with tiobench (1-64 threads), and untar a kernel tree test, no regression there. However I see about 5-7% throughput drop on dbench with 16 threads. It probably due to the cpu cost that we have discussed. Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies 2005-04-28 7:37 ` Mingming Cao @ 2005-04-28 16:12 ` Lee Revell 2005-04-28 18:34 ` Mingming Cao 0 siblings, 1 reply; 68+ messages in thread From: Lee Revell @ 2005-04-28 16:12 UTC (permalink / raw) To: cmm; +Cc: Andrew Morton, Stephen C. Tweedie, Ingo Molnar, linux-kernel On Thu, 2005-04-28 at 00:37 -0700, Mingming Cao wrote: > On Wed, 2005-04-27 at 23:45 -0400, Lee Revell wrote: > > On Fri, 2005-04-22 at 15:10 -0700, Mingming Cao wrote: > > > Please review. I have tested on fsx on SMP box. Lee, if you got time, > > > please try this patch. > > > > I have tested and this does fix the problem. I ran my tests and no ext3 > > code paths showed up on the latency tracer at all, it never went above > > 33 usecs. > > > Thanks, Lee. > > The patch survived on many fsx test over 20 hours on a 2cpu machine. > Tested the patch on the same machine with tiobench (1-64 threads), and > untar a kernel tree test, no regression there. > However I see about 5-7% throughput drop on dbench with 16 threads. It > probably due to the cpu cost that we have discussed. Hmm, I guess someone needs to test it on a bigger system. AFAICT this should improve SMP scalability quite a bit. Maybe that lock is rarely contended. Lee ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies 2005-04-28 16:12 ` Lee Revell @ 2005-04-28 18:34 ` Mingming Cao 2005-04-29 6:18 ` Mingming Cao 0 siblings, 1 reply; 68+ messages in thread From: Mingming Cao @ 2005-04-28 18:34 UTC (permalink / raw) To: Lee Revell; +Cc: Andrew Morton, Stephen C. Tweedie, Ingo Molnar, linux-kernel On Thu, 2005-04-28 at 12:12 -0400, Lee Revell wrote: > On Thu, 2005-04-28 at 00:37 -0700, Mingming Cao wrote: > > On Wed, 2005-04-27 at 23:45 -0400, Lee Revell wrote: > > > On Fri, 2005-04-22 at 15:10 -0700, Mingming Cao wrote: > > > > Please review. I have tested on fsx on SMP box. Lee, if you got time, > > > > please try this patch. > > > > > > I have tested and this does fix the problem. I ran my tests and no ext3 > > > code paths showed up on the latency tracer at all, it never went above > > > 33 usecs. > > > > > Thanks, Lee. > > > > The patch survived on many fsx test over 20 hours on a 2cpu machine. > > Tested the patch on the same machine with tiobench (1-64 threads), and > > untar a kernel tree test, no regression there. > > However I see about 5-7% throughput drop on dbench with 16 threads. It > > probably due to the cpu cost that we have discussed. > > Hmm, I guess someone needs to test it on a bigger system. AFAICT this > should improve SMP scalability quite a bit. Maybe that lock is rarely > contended. Probably. I will try it on a relatively full filesystem(where bitmap scan to find a free block takes time), and on a 4 cpu box with many threads allocating at the same time. Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies 2005-04-28 18:34 ` Mingming Cao @ 2005-04-29 6:18 ` Mingming Cao 0 siblings, 0 replies; 68+ messages in thread From: Mingming Cao @ 2005-04-29 6:18 UTC (permalink / raw) To: Lee Revell; +Cc: Andrew Morton, Stephen C. Tweedie, Ingo Molnar, linux-kernel On Thu, 2005-04-28 at 11:34 -0700, Mingming Cao wrote: > On Thu, 2005-04-28 at 12:12 -0400, Lee Revell wrote: > > On Thu, 2005-04-28 at 00:37 -0700, Mingming Cao wrote: > > > On Wed, 2005-04-27 at 23:45 -0400, Lee Revell wrote: > > > > On Fri, 2005-04-22 at 15:10 -0700, Mingming Cao wrote: > > > > > Please review. I have tested on fsx on SMP box. Lee, if you got time, > > > > > please try this patch. > > > > > > > > I have tested and this does fix the problem. I ran my tests and no ext3 > > > > code paths showed up on the latency tracer at all, it never went above > > > > 33 usecs. > > > > > > > Thanks, Lee. > > > > > > The patch survived on many fsx test over 20 hours on a 2cpu machine. > > > Tested the patch on the same machine with tiobench (1-64 threads), and > > > untar a kernel tree test, no regression there. > > > However I see about 5-7% throughput drop on dbench with 16 threads. It > > > probably due to the cpu cost that we have discussed. > > > > Hmm, I guess someone needs to test it on a bigger system. AFAICT this > > should improve SMP scalability quite a bit. Maybe that lock is rarely > > contended. > > Probably. I will try it on a relatively full filesystem(where bitmap > scan to find a free block takes time), and on a 4 cpu box with many > threads allocating at the same time. I run 64 threads dbench on a 4way box on a 90% full filesystem, patched kernel is slightly faster. I think part of the reason is on a 90% full filesystem, block bitmap scan will take a long time. Thus drop the lock before bitmap scan will help, although there are some overhead of drop/re-grab lock and extra window search/link/unlink before we found a window with free block. On a freshly created filesystem where the regression is seen, the bitmap scan takes shorter time, thus the overhead of drop/re-grab lock and extra window search/link/unlink probably is probably be outstanding. I also run 1000 dd tests in parallel, each of them writing to different file(2M size). Seems no gain or no regression. Anyway, we should not hold the global reservation lock while scanning the bitmap, that's a scalability issue and a latency issue. To drop the lock during bitmap scan we need to do extra work, I don't think it's avoidable. Andrew, could you add it to mm tree? I will re-send the patch with the change log. Thanks, Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* [RFC] Adding multiple block allocation to current ext3 2005-04-28 3:45 ` Lee Revell 2005-04-28 7:37 ` Mingming Cao @ 2005-04-28 19:14 ` Mingming Cao 2005-04-29 13:52 ` [Ext2-devel] [RFC] Adding multiple block allocation Suparna Bhattacharya 2006-01-10 23:26 ` [PATCH 0/5] multiple block allocation to current ext3 Mingming Cao 1 sibling, 2 replies; 68+ messages in thread From: Mingming Cao @ 2005-04-28 19:14 UTC (permalink / raw) To: Andrew Morton; +Cc: Stephen C. Tweedie, linux-kernel, ext2-devel Currently ext3_get_block()/ext3_new_block() only allocate one block at a time. To allocate multiple blocks, the caller, for example, ext3 direct IO routine, has to invoke ext3_get_block() many times. This is quite inefficient for sequential IO workload. The benefit of a real get_blocks() include 1) increase the possibility to get contiguous blocks, reduce possibility of fragmentation due to interleaved allocations from other threads. (should good for non reservation case) 2) Reduces CPU cycles spent in repeated get_block() calls 3) Batch meta data update and journaling in one short 4) Could possibly speed up future get_blocks() look up by cache the last mapped blocks in inode. Multiple block allocation for ext3 is very useful for support delayed allocation, also useful for direct io. This effort is trying to fill this requirements on top of the current ext3 disk format. Given a file offset and maximum length of blocks to map, ext3_get_blocks() will find out or allocate the corresponding chunk of contiguous physical blocks on disk. It will return the buffer head mapped to the first physical block corresponding to the file offset, and the number of the contiguous blocks just mapped or allocated. First, search the [i,d,t]direct blocks indexed by the file offset to found out the physical block. In the case that it is allocated, then continue mapping the next logical block until the mapped physical block is dis-adjacent, or there is no corresponding physical block being allocated. In the case block allocation is needed, first, it walks through the inode's block mapping tree to count the number of adjacent blocks to allocate. Then it passes the number of blocks to allocate to ext3_new_blocks(), while the real block allocation is performed. After allocated the first block, ext3_new_blocks() will always attempt to allocate the next few(up to the requested size and not beyond the reservation window) adjacent blocks at the same time. Once the blocks are allocated and updated on bitmap, finally update those metadata blocks with those new blocks info and splice the whole branch into the block mapping tree. Partial read map and partial write map is not supported since there is no garantuee that the block allocation will allocate a contiguous physical block back. Cross indirect block boundary allocation is not supported due to the concern that it may cause metadata block far from it's data blocks. Below is just a first cut of proposed patch. Attach here just to show the effort/direction. Appreciate any comments/suggestions/critics! Thanks, Mingming --- linux-2.6.11-ming/fs/ext3/balloc.c | 89 ++++++--- linux-2.6.11-ming/fs/ext3/inode.c | 284 ++++++++++++++++++++++++++++-- linux-2.6.11-ming/fs/ext3/xattr.c | 3 linux-2.6.11-ming/include/linux/ext3_fs.h | 2 4 files changed, 333 insertions(+), 45 deletions(-) diff -puN fs/ext3/balloc.c~ext3_multiple_blocks_allocation fs/ext3/balloc.c --- linux-2.6.11/fs/ext3/balloc.c~ext3_multiple_blocks_allocation 2005-04-28 12:12:06.168475152 -0700 +++ linux-2.6.11-ming/fs/ext3/balloc.c 2005-04-28 12:12:06.186473049 -0700 @@ -652,9 +652,11 @@ claim_block(spinlock_t *lock, int block, */ static int ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group, - struct buffer_head *bitmap_bh, int goal, struct ext3_reserve_window *my_rsv) + struct buffer_head *bitmap_bh, int goal, unsigned long *count, + struct ext3_reserve_window *my_rsv) { int group_first_block, start, end; + unsigned long num = 0; /* we do allocation within the reservation window if we have a window */ if (my_rsv) { @@ -712,8 +714,18 @@ repeat: goto fail_access; goto repeat; } - return goal; + num++; + goal++; + while (num < *count && goal < end + && ext3_test_allocatable(goal, bitmap_bh) + && claim_block(sb_bgl_lock(EXT3_SB(sb), group), goal, bitmap_bh)) { + num++; + goal++; + } + *count = num; + return goal - num; fail_access: + *count = num; return -1; } @@ -1025,11 +1037,13 @@ static int ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle, unsigned int group, struct buffer_head *bitmap_bh, int goal, struct ext3_reserve_window_node * my_rsv, - int *errp) + unsigned long *count, int *errp) { unsigned long group_first_block; int ret = 0; int fatal; + int original_goal=goal; + unsigned long num = *count; *errp = 0; @@ -1052,7 +1066,8 @@ ext3_try_to_allocate_with_rsv(struct sup * or last attempt to allocate a block with reservation turned on failed */ if (my_rsv == NULL ) { - ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL); + ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, + count, NULL); goto out; } /* @@ -1094,13 +1109,17 @@ ext3_try_to_allocate_with_rsv(struct sup || (my_rsv->rsv_end < group_first_block)) BUG(); ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, - &my_rsv->rsv_window); + &num, &my_rsv->rsv_window); if (ret >= 0) { - my_rsv->rsv_alloc_hit++; + my_rsv->rsv_alloc_hit += num; + *count = num; break; /* succeed */ } + num = *count; } out: + if (my_rsv) + printk("Allocated blocks from %d ,count is %d, goal was %d, group is %d, inode is %, reservation window (%d,%d)\n", ret, *count, original_goal, group, my_rsv->rsv_start, my_rsv->rsv_end); if (ret >= 0) { BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for " "bitmap block"); @@ -1155,8 +1174,8 @@ int ext3_should_retry_alloc(struct super * bitmap, and then for any free bit if that fails. * This function also updates quota and i_blocks field. */ -int ext3_new_block(handle_t *handle, struct inode *inode, - unsigned long goal, int *errp) +int ext3_new_blocks(handle_t *handle, struct inode *inode, + unsigned long goal, unsigned long* count, int *errp) { struct buffer_head *bitmap_bh = NULL; struct buffer_head *gdp_bh; @@ -1179,7 +1198,8 @@ int ext3_new_block(handle_t *handle, str static int goal_hits, goal_attempts; #endif unsigned long ngroups; - + unsigned long num = *count; + int i; *errp = -ENOSPC; sb = inode->i_sb; if (!sb) { @@ -1190,7 +1210,7 @@ int ext3_new_block(handle_t *handle, str /* * Check quota for allocation of this block. */ - if (DQUOT_ALLOC_BLOCK(inode, 1)) { + if (DQUOT_ALLOC_BLOCK(inode, num)) { *errp = -EDQUOT; return 0; } @@ -1245,7 +1265,7 @@ retry: if (!bitmap_bh) goto io_error; ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no, - bitmap_bh, ret_block, my_rsv, &fatal); + bitmap_bh, ret_block, my_rsv, &num, &fatal); if (fatal) goto out; if (ret_block >= 0) @@ -1282,7 +1302,7 @@ retry: if (!bitmap_bh) goto io_error; ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no, - bitmap_bh, -1, my_rsv, &fatal); + bitmap_bh, -1, my_rsv, &num, &fatal); if (fatal) goto out; if (ret_block >= 0) @@ -1317,14 +1337,17 @@ allocated: target_block = ret_block + group_no * EXT3_BLOCKS_PER_GROUP(sb) + le32_to_cpu(es->s_first_data_block); - if (target_block == le32_to_cpu(gdp->bg_block_bitmap) || - target_block == le32_to_cpu(gdp->bg_inode_bitmap) || - in_range(target_block, le32_to_cpu(gdp->bg_inode_table), - EXT3_SB(sb)->s_itb_per_group)) - ext3_error(sb, "ext3_new_block", - "Allocating block in system zone - " - "block = %u", target_block); - + for (i = 0; i < num; i++, target_block++) { + if (target_block == le32_to_cpu(gdp->bg_block_bitmap) || + target_block == le32_to_cpu(gdp->bg_inode_bitmap) || + in_range(target_block, le32_to_cpu(gdp->bg_inode_table), + EXT3_SB(sb)->s_itb_per_group)) { + ext3_error(sb, "ext3_new_block", + "Allocating block in system zone - " + "block = %u", target_block); + goto out; + } + } performed_allocation = 1; #ifdef CONFIG_JBD_DEBUG @@ -1342,10 +1365,12 @@ allocated: jbd_lock_bh_state(bitmap_bh); spin_lock(sb_bgl_lock(sbi, group_no)); if (buffer_jbd(bitmap_bh) && bh2jh(bitmap_bh)->b_committed_data) { - if (ext3_test_bit(ret_block, - bh2jh(bitmap_bh)->b_committed_data)) { - printk("%s: block was unexpectedly set in " - "b_committed_data\n", __FUNCTION__); + for (i = 0; i < num ; i++) { + if (ext3_test_bit(ret_block++, + bh2jh(bitmap_bh)->b_committed_data)) { + printk("%s: block was unexpectedly set in " + "b_committed_data\n", __FUNCTION__); + } } } ext3_debug("found bit %d\n", ret_block); @@ -1354,12 +1379,12 @@ allocated: #endif /* ret_block was blockgroup-relative. Now it becomes fs-relative */ - ret_block = target_block; + ret_block = target_block - num; - if (ret_block >= le32_to_cpu(es->s_blocks_count)) { + if (target_block - 1>= le32_to_cpu(es->s_blocks_count)) { ext3_error(sb, "ext3_new_block", - "block(%d) >= blocks count(%d) - " - "block_group = %d, es == %p ", ret_block, + "block(%d) >= fs blocks count(%d) - " + "block_group = %d, es == %p ", target_block - 1, le32_to_cpu(es->s_blocks_count), group_no, es); goto out; } @@ -1374,9 +1399,9 @@ allocated: spin_lock(sb_bgl_lock(sbi, group_no)); gdp->bg_free_blocks_count = - cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) - 1); + cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num); spin_unlock(sb_bgl_lock(sbi, group_no)); - percpu_counter_mod(&sbi->s_freeblocks_counter, -1); + percpu_counter_mod(&sbi->s_freeblocks_counter, -num); BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor"); err = ext3_journal_dirty_metadata(handle, gdp_bh); @@ -1388,6 +1413,8 @@ allocated: goto out; *errp = 0; + DQUOT_FREE_BLOCK(inode, *count-num); + *count = num; brelse(bitmap_bh); return ret_block; @@ -1402,7 +1429,7 @@ out: * Undo the block allocation */ if (!performed_allocation) - DQUOT_FREE_BLOCK(inode, 1); + DQUOT_FREE_BLOCK(inode, *count); brelse(bitmap_bh); return 0; } diff -puN fs/ext3/inode.c~ext3_multiple_blocks_allocation fs/ext3/inode.c --- linux-2.6.11/fs/ext3/inode.c~ext3_multiple_blocks_allocation 2005-04-28 12:12:06.172474685 -0700 +++ linux-2.6.11-ming/fs/ext3/inode.c 2005-04-28 12:12:06.192472348 -0700 @@ -237,12 +237,12 @@ static int ext3_alloc_block (handle_t *h struct inode * inode, unsigned long goal, int *err) { unsigned long result; + unsigned long count = 1; - result = ext3_new_block(handle, inode, goal, err); + result = ext3_new_blocks(handle, inode, goal, &count, err); return result; } - typedef struct { __le32 *p; __le32 key; @@ -328,7 +328,7 @@ static int ext3_block_to_path(struct ino ext3_warning (inode->i_sb, "ext3_block_to_path", "block > big"); } if (boundary) - *boundary = (i_block & (ptrs - 1)) == (final - 1); + *boundary = final - 1 - (i_block & (ptrs - 1)); return n; } @@ -526,7 +526,7 @@ static int ext3_alloc_branch(handle_t *h /* * Get buffer_head for parent block, zero it out * and set the pointer to new one, then send - * parent to disk. + * parent to disk. */ bh = sb_getblk(inode->i_sb, parent); branch[n].bh = bh; @@ -566,6 +566,111 @@ static int ext3_alloc_branch(handle_t *h ext3_free_blocks(handle, inode, le32_to_cpu(branch[i].key), 1); return err; } +#define GBS_DEBUG 0 +static int ext3_alloc_splice_branch(handle_t *handle, struct inode *inode, + unsigned long num, unsigned long first_block, + int *offsets, Indirect *branch, int depth) +{ + int blocksize = inode->i_sb->s_blocksize; + int n = 0; + int err = 0; + int i; + unsigned long current_block; + struct buffer_head *bh; + + current_block = first_block; + branch[0].key = cpu_to_le32(current_block); + + if (GBS_DEBUG) + printk("block %d, branch[0].p :%x\n", current_block, branch[0].p); + for (n = 1; n < depth; n++) { + current_block++; + branch[n].key = cpu_to_le32(current_block); + + /* + * Get buffer_head for parent block, zero it out + * and set the pointer to new one, then send + * parent to disk. + */ + bh = sb_getblk(inode->i_sb, current_block - 1); + branch[n].bh = bh; + lock_buffer(bh); + BUFFER_TRACE(bh, "call get_create_access"); + err = ext3_journal_get_create_access(handle, bh); + if (err) { + unlock_buffer(bh); + brelse(bh); + goto failed; + } + + memset(bh->b_data, 0, blocksize); + branch[n].p = (__le32*) bh->b_data + offsets[n]; + *branch[n].p = branch[n].key; + /* + * allocate blocks to the rest missing data blocks + */ + if (n == depth -1 ) { + i = 0; + while (current_block - first_block + 1 < num) { + *(branch[n].p + i + 1) = ++current_block; + i++; + } + } + BUFFER_TRACE(bh, "marking uptodate"); + set_buffer_uptodate(bh); + unlock_buffer(bh); + + BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata"); + err = ext3_journal_dirty_metadata(handle, bh); + if (err) + goto failed; + } + + bh = branch[0].bh; + if (bh){ + BUFFER_TRACE(bh, "call get_write_access"); + err = ext3_journal_get_write_access(handle, bh); + if (err) + goto failed; + } + + *(branch[0].p) = branch[0].key; + + /* allocate blocks to the rest missing data blocks */ + i = 0; + while (current_block - first_block + 1< num) { + *(branch[0].p + i + 1) = ++current_block; + i++; + } + + if (GBS_DEBUG) { + for (i=0; i< n; i++) + printk("inode %x, branch[%d].p:%x, branch[%d].key:%d,\n", inode, i, branch[i].p, i, branch[i].key); + for (i=0; i< num-n; i++) + printk("inode %x, branch[%d].p + %d + 1:%x, *(branch[%d].p+%d+1):%d,\n, branch[%d].bh:%x\n", inode, n-1, i, branch[n-1].p + i +1, n-1, i, *(branch[n-1].p+i+1),n-1, branch[n-1].bh); + } + if (bh) { + BUFFER_TRACE(bh, "marking uptodate"); + set_buffer_uptodate(bh); + unlock_buffer(bh); + + BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata"); + err = ext3_journal_dirty_metadata(handle, bh); + if (err) + goto failed; + } + + return err; +failed: + /* Allocation failed, free what we already allocated */ + for (i = 0; i < n ; i++) { + BUFFER_TRACE(branch[i].bh, "call journal_forget"); + ext3_journal_forget(handle, branch[n].bh); + } + for (i = 0; i < num; i++) + ext3_free_blocks(handle, inode, first_block + i, 1); + return err; +} /** * ext3_splice_branch - splice the allocated branch onto inode. @@ -777,8 +882,153 @@ out: return err; } -static int ext3_get_block(struct inode *inode, sector_t iblock, - struct buffer_head *bh_result, int create) +static int +ext3_count_blocks_to_allocate(Indirect * branch, int k, + unsigned long maxblocks, int blocks_to_boundary) +{ + unsigned long count = 0; + + if (k == 0) return 0; + /* + * Simple case, [t,d]Indirect block(s) has not allocated yet + * then it's clear blocks on that path have not allocated + */ + if (GBS_DEBUG) + printk("maxblocks: %d, k: %d, boundary : %d \n",maxblocks, k, + blocks_to_boundary); + if (k > 1) { + /* right now don't hanel cross boundary allocation */ + if ((maxblocks - count) < blocks_to_boundary) + count += maxblocks; + else + count += blocks_to_boundary; + count += k - 1; /* blocks for [t,d]indirect blocks */ + return count; + } + + count++; + while (count < maxblocks && count <= blocks_to_boundary + && *(branch[0].p + count) == 0) { + count++; + } + return count; +} +static int +ext3_get_blocks_handle(handle_t *handle, struct inode *inode, sector_t iblock, + unsigned long *maxblocks, struct buffer_head *bh_result, + int create, int extend_disksize) +{ + int err = -EIO; + int offsets[4]; + Indirect chain[4]; + Indirect *partial = NULL; + unsigned long goal; + int left; + int blocks_to_boundary = 0; + int depth; + struct ext3_inode_info *ei = EXT3_I(inode); + unsigned long count = 0; + unsigned long first_block; + struct ext3_block_alloc_info *block_i; + + J_ASSERT(handle != NULL || create == 0); + + if (GBS_DEBUG) + printk("ext3_get_blocks_handle: maxblocks= %d, iblock = %d\n", *maxblocks, iblock); + down(&ei->truncate_sem); + depth = ext3_block_to_path(inode, iblock, offsets, &blocks_to_boundary); + if (depth == 0) + goto out; + partial = ext3_get_branch(inode, depth, + offsets, chain, &err); + /* Simplest case - block found */ + if (!err && !partial) { + first_block = chain[depth-1].key; + clear_buffer_new(bh_result); + count ++; + /* map more blocks */ + while (count < *maxblocks && count <= blocks_to_boundary + && (*(chain[depth-1].p+count) == first_block + count)) { + count ++; + } + up(&ei->truncate_sem); + goto got_it; + } + /* got mapped blocks or plain lookup or failed read of indirect block */ + if (!create || err == -EIO){ + up(&ei->truncate_sem); + goto out; + } + /* Okey, we need to do block allocation */ + /* lazy initialize the block allocation info here if necessary */ + if (S_ISREG(inode->i_mode) && (!ei->i_block_alloc_info)) { + ext3_init_block_alloc_info(inode); + } + + goal = ext3_find_goal(inode, iblock, chain, partial); + + /* number of missing meta data blocks need to allocate for this branch */ + left = chain + depth - partial; + count = ext3_count_blocks_to_allocate(partial, left, *maxblocks, blocks_to_boundary); + if (GBS_DEBUG) + printk("blocks to allocate: %d\n", count); + + first_block = ext3_new_blocks(handle, inode, goal, &count, &err); + if (GBS_DEBUG) + printk("blocks allocated(%d,%d,%d)\n", (int)iblock, (int)first_block, count); + + if (!err) + err = ext3_alloc_splice_branch(handle, inode, count, + first_block, offsets+(partial-chain), partial, left); + if (err) { + up(&ei->truncate_sem); + goto cleanup; + } + /* i_disksize growing is protected by truncate_sem + * don't forget to protect it if you're about to implement + * concurrent ext3_get_block() -bzzz */ + if (extend_disksize && inode->i_size > ei->i_disksize) + ei->i_disksize = inode->i_size; + /* + * update the most recently allocated logical & physical block + * in i_block_alloc_info, to assist find the proper goal block for next + * allocation + */ + block_i = ei->i_block_alloc_info; + if (block_i) { + block_i->last_alloc_logical_block = iblock + count - left; + block_i->last_alloc_physical_block = first_block + count - 1; + } + + inode->i_ctime = CURRENT_TIME_SEC; + ext3_mark_inode_dirty(handle, inode); + + up(&ei->truncate_sem); + if (err) + goto cleanup; + + set_buffer_new(bh_result); +got_it: + map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key)); + if (blocks_to_boundary == 0) + set_buffer_boundary(bh_result); + /* Clean up and exit */ + partial = chain+depth-1; /* the whole chain */ +cleanup: + while (partial > chain) { + BUFFER_TRACE(partial->bh, "call brelse"); + brelse(partial->bh); + partial--; + } + BUFFER_TRACE(bh_result, "returned"); +out: + *maxblocks = count; + return err; +} + +static int ext3_get_blocks(struct inode *inode, sector_t iblock, + unsigned long maxblocks, struct buffer_head *bh_result, + int create) { handle_t *handle = NULL; int ret; @@ -787,15 +1037,21 @@ static int ext3_get_block(struct inode * handle = ext3_journal_current_handle(); J_ASSERT(handle != 0); } - ret = ext3_get_block_handle(handle, inode, iblock, + ret = ext3_get_blocks_handle(handle, inode, iblock, &maxblocks, bh_result, create, 1); - return ret; + bh_result->b_size = (maxblocks << inode->i_blkbits); + return ret; +} + +static int ext3_get_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create) +{ + return ext3_get_blocks(inode, iblock, 1, bh_result, create); } #define DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32) -static int -ext3_direct_io_get_blocks(struct inode *inode, sector_t iblock, +static int ext3_direct_io_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks, struct buffer_head *bh_result, int create) { @@ -831,10 +1087,14 @@ ext3_direct_io_get_blocks(struct inode * } get_block: + if (GBS_DEBUG) + printk("Calling ext3_get_blocks_handle from dio: maxblocks= %d, iblock = %d", (int)max_blocks, (int)iblock); if (ret == 0) - ret = ext3_get_block_handle(handle, inode, iblock, + ret = ext3_get_blocks_handle(handle, inode, iblock, &max_blocks, bh_result, create, 0); - bh_result->b_size = (1 << inode->i_blkbits); + bh_result->b_size = (max_blocks << inode->i_blkbits); + if (GBS_DEBUG) + printk("ext3_get_blocks_handle returns to dio: maxblocks= %d, iblock = %d", (int)max_blocks, (int)iblock); return ret; } diff -puN fs/ext3/xattr.c~ext3_multiple_blocks_allocation fs/ext3/xattr.c --- linux-2.6.11/fs/ext3/xattr.c~ext3_multiple_blocks_allocation 2005-04-28 12:12:06.177474100 -0700 +++ linux-2.6.11-ming/fs/ext3/xattr.c 2005-04-28 12:12:06.194472114 -0700 @@ -796,7 +796,8 @@ inserted: EXT3_SB(sb)->s_es->s_first_data_block) + EXT3_I(inode)->i_block_group * EXT3_BLOCKS_PER_GROUP(sb); - int block = ext3_new_block(handle, inode, goal, &error); + unsigned long count = 1; + int block = ext3_new_blocks(handle, inode, goal, &count, &error); if (error) goto cleanup; ea_idebug(inode, "creating block %d", block); diff -puN include/linux/ext3_fs.h~ext3_multiple_blocks_allocation include/linux/ext3_fs.h --- linux-2.6.11/include/linux/ext3_fs.h~ext3_multiple_blocks_allocation 2005-04-28 12:12:06.180473750 -0700 +++ linux-2.6.11-ming/include/linux/ext3_fs.h 2005-04-28 12:12:06.196471880 -0700 @@ -714,7 +714,7 @@ struct dir_private_info { /* balloc.c */ extern int ext3_bg_has_super(struct super_block *sb, int group); extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group); -extern int ext3_new_block (handle_t *, struct inode *, unsigned long, int *); +extern int ext3_new_blocks (handle_t *, struct inode *, unsigned long, unsigned long*, int *); extern void ext3_free_blocks (handle_t *, struct inode *, unsigned long, unsigned long); extern void ext3_free_blocks_sb (handle_t *, struct super_block *, _ ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-28 19:14 ` [RFC] Adding multiple block allocation to current ext3 Mingming Cao @ 2005-04-29 13:52 ` Suparna Bhattacharya 2005-04-29 17:10 ` Mingming Cao ` (3 more replies) 2006-01-10 23:26 ` [PATCH 0/5] multiple block allocation to current ext3 Mingming Cao 1 sibling, 4 replies; 68+ messages in thread From: Suparna Bhattacharya @ 2005-04-29 13:52 UTC (permalink / raw) To: Mingming Cao Cc: Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel On Thu, Apr 28, 2005 at 12:14:24PM -0700, Mingming Cao wrote: > Currently ext3_get_block()/ext3_new_block() only allocate one block at a > time. To allocate multiple blocks, the caller, for example, ext3 direct > IO routine, has to invoke ext3_get_block() many times. This is quite > inefficient for sequential IO workload. > > The benefit of a real get_blocks() include > 1) increase the possibility to get contiguous blocks, reduce possibility > of fragmentation due to interleaved allocations from other threads. > (should good for non reservation case) > 2) Reduces CPU cycles spent in repeated get_block() calls > 3) Batch meta data update and journaling in one short > 4) Could possibly speed up future get_blocks() look up by cache the last > mapped blocks in inode. > And here is the patch to make mpage_writepages use get_blocks() for multiple block lookup/allocation. It performs a radix-tree contiguous pages lookup, and issues a get_blocks for the range together. It maintains an mpageio structure to track intermediate mapping state, somewhat like the DIO code. It does need some more testing, especially block_size < PAGE_SIZE. The JFS workaround can be dropped if the JFS get_blocks fix from Dave Kleikamp is integrated. Review feedback would be welcome. Mingming, Let me know if you have a chance to try this out with your patch. Regards Suparna -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India diff -urp -X dontdiff linux-2.6.12-rc3/fs/buffer.c linux-2.6.12-rc3-getblocks/fs/buffer.c --- linux-2.6.12-rc3/fs/buffer.c 2005-04-21 05:33:15.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/fs/buffer.c 2005-04-22 15:08:33.000000000 +0530 @@ -2514,53 +2514,10 @@ EXPORT_SYMBOL(nobh_commit_write); * that it tries to operate without attaching bufferheads to * the page. */ -int nobh_writepage(struct page *page, get_block_t *get_block, - struct writeback_control *wbc) +int nobh_writepage(struct page *page, get_blocks_t *get_blocks, + struct writeback_control *wbc, writepage_t bh_writepage_fn) { - struct inode * const inode = page->mapping->host; - loff_t i_size = i_size_read(inode); - const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; - unsigned offset; - void *kaddr; - int ret; - - /* Is the page fully inside i_size? */ - if (page->index < end_index) - goto out; - - /* Is the page fully outside i_size? (truncate in progress) */ - offset = i_size & (PAGE_CACHE_SIZE-1); - if (page->index >= end_index+1 || !offset) { - /* - * The page may have dirty, unmapped buffers. For example, - * they may have been added in ext3_writepage(). Make them - * freeable here, so the page does not leak. - */ -#if 0 - /* Not really sure about this - do we need this ? */ - if (page->mapping->a_ops->invalidatepage) - page->mapping->a_ops->invalidatepage(page, offset); -#endif - unlock_page(page); - return 0; /* don't care */ - } - - /* - * The page straddles i_size. It must be zeroed out on each and every - * writepage invocation because it may be mmapped. "A file is mapped - * in multiples of the page size. For a file that is not a multiple of - * the page size, the remaining memory is zeroed when mapped, and - * writes to that region are not written out to the file." - */ - kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); - flush_dcache_page(page); - kunmap_atomic(kaddr, KM_USER0); -out: - ret = mpage_writepage(page, get_block, wbc); - if (ret == -EAGAIN) - ret = __block_write_full_page(inode, page, get_block, wbc); - return ret; + return mpage_writepage(page, get_blocks, wbc, bh_writepage_fn); } EXPORT_SYMBOL(nobh_writepage); diff -urp -X dontdiff linux-2.6.12-rc3/fs/ext2/inode.c linux-2.6.12-rc3-getblocks/fs/ext2/inode.c --- linux-2.6.12-rc3/fs/ext2/inode.c 2005-04-21 05:33:15.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/fs/ext2/inode.c 2005-04-22 16:30:42.000000000 +0530 @@ -639,12 +639,6 @@ ext2_nobh_prepare_write(struct file *fil return nobh_prepare_write(page,from,to,ext2_get_block); } -static int ext2_nobh_writepage(struct page *page, - struct writeback_control *wbc) -{ - return nobh_writepage(page, ext2_get_block, wbc); -} - static sector_t ext2_bmap(struct address_space *mapping, sector_t block) { return generic_block_bmap(mapping,block,ext2_get_block); @@ -662,6 +656,12 @@ ext2_get_blocks(struct inode *inode, sec return ret; } +static int ext2_nobh_writepage(struct page *page, + struct writeback_control *wbc) +{ + return nobh_writepage(page, ext2_get_blocks, wbc, ext2_writepage); +} + static ssize_t ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, unsigned long nr_segs) @@ -676,7 +676,8 @@ ext2_direct_IO(int rw, struct kiocb *ioc static int ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) { - return mpage_writepages(mapping, wbc, ext2_get_block); + return __mpage_writepages(mapping, wbc, ext2_get_blocks, + ext2_writepage); } struct address_space_operations ext2_aops = { diff -urp -X dontdiff linux-2.6.12-rc3/fs/ext3/inode.c linux-2.6.12-rc3-getblocks/fs/ext3/inode.c --- linux-2.6.12-rc3/fs/ext3/inode.c 2005-04-21 05:33:15.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/fs/ext3/inode.c 2005-04-22 15:08:33.000000000 +0530 @@ -866,10 +866,10 @@ get_block: return ret; } -static int ext3_writepages_get_block(struct inode *inode, sector_t iblock, - struct buffer_head *bh, int create) +static int ext3_writepages_get_blocks(struct inode *inode, sector_t iblock, + unsigned long max_blocks, struct buffer_head *bh, int create) { - return ext3_direct_io_get_blocks(inode, iblock, 1, bh, create); + return ext3_direct_io_get_blocks(inode, iblock, max_blocks, bh, create); } /* @@ -1369,11 +1369,11 @@ ext3_writeback_writepages(struct address return ret; } - ret = __mpage_writepages(mapping, wbc, ext3_writepages_get_block, + ret = __mpage_writepages(mapping, wbc, ext3_writepages_get_blocks, ext3_writeback_writepage_helper); /* - * Need to reaquire the handle since ext3_writepages_get_block() + * Need to reaquire the handle since ext3_writepages_get_blocks() * can restart the handle */ handle = journal_current_handle(); @@ -1402,7 +1402,8 @@ static int ext3_writeback_writepage(stru } if (test_opt(inode->i_sb, NOBH)) - ret = nobh_writepage(page, ext3_get_block, wbc); + ret = nobh_writepage(page, ext3_writepages_get_blocks, wbc, + ext3_writeback_writepage_helper); else ret = block_write_full_page(page, ext3_get_block, wbc); diff -urp -X dontdiff linux-2.6.12-rc3/fs/ext3/super.c linux-2.6.12-rc3-getblocks/fs/ext3/super.c --- linux-2.6.12-rc3/fs/ext3/super.c 2005-04-21 05:33:15.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/fs/ext3/super.c 2005-04-22 15:08:33.000000000 +0530 @@ -1321,6 +1321,7 @@ static int ext3_fill_super (struct super sbi->s_resgid = le16_to_cpu(es->s_def_resgid); set_opt(sbi->s_mount_opt, RESERVATION); + set_opt(sbi->s_mount_opt, NOBH); /* temp: set nobh default */ if (!parse_options ((char *) data, sb, &journal_inum, NULL, 0)) goto failed_mount; @@ -1567,6 +1568,7 @@ static int ext3_fill_super (struct super printk(KERN_ERR "EXT3-fs: Journal does not support " "requested data journaling mode\n"); goto failed_mount3; + set_opt(sbi->s_mount_opt, NOBH); /* temp: set nobh default */ } default: break; @@ -1584,6 +1586,7 @@ static int ext3_fill_super (struct super "its supported only with writeback mode\n"); clear_opt(sbi->s_mount_opt, NOBH); } + printk("NOBH option set\n"); } /* * The journal_load will have done any necessary log recovery, diff -urp -X dontdiff linux-2.6.12-rc3/fs/hfs/inode.c linux-2.6.12-rc3-getblocks/fs/hfs/inode.c --- linux-2.6.12-rc3/fs/hfs/inode.c 2005-04-21 05:33:15.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/fs/hfs/inode.c 2005-04-22 15:08:33.000000000 +0530 @@ -124,7 +124,7 @@ static ssize_t hfs_direct_IO(int rw, str static int hfs_writepages(struct address_space *mapping, struct writeback_control *wbc) { - return mpage_writepages(mapping, wbc, hfs_get_block); + return mpage_writepages(mapping, wbc, hfs_get_blocks); } struct address_space_operations hfs_btree_aops = { diff -urp -X dontdiff linux-2.6.12-rc3/fs/hfsplus/inode.c linux-2.6.12-rc3-getblocks/fs/hfsplus/inode.c --- linux-2.6.12-rc3/fs/hfsplus/inode.c 2005-04-21 05:33:15.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/fs/hfsplus/inode.c 2005-04-22 15:08:33.000000000 +0530 @@ -121,7 +121,7 @@ static ssize_t hfsplus_direct_IO(int rw, static int hfsplus_writepages(struct address_space *mapping, struct writeback_control *wbc) { - return mpage_writepages(mapping, wbc, hfsplus_get_block); + return mpage_writepages(mapping, wbc, hfsplus_get_blocks); } struct address_space_operations hfsplus_btree_aops = { diff -urp -X dontdiff linux-2.6.12-rc3/fs/jfs/inode.c linux-2.6.12-rc3-getblocks/fs/jfs/inode.c --- linux-2.6.12-rc3/fs/jfs/inode.c 2005-04-21 05:33:15.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/fs/jfs/inode.c 2005-04-22 16:27:19.000000000 +0530 @@ -267,21 +267,41 @@ jfs_get_blocks(struct inode *ip, sector_ return rc; } +static int +jfs_mpage_get_blocks(struct inode *ip, sector_t lblock, unsigned long + max_blocks, struct buffer_head *bh_result, int create) +{ + /* + * fixme: temporary workaround: return one block at a time until + * we figure out why we see exposures with truncate on + * allocating multiple blocks in one shot. + */ + return jfs_get_blocks(ip, lblock, 1, bh_result, create); +} + static int jfs_get_block(struct inode *ip, sector_t lblock, struct buffer_head *bh_result, int create) { return jfs_get_blocks(ip, lblock, 1, bh_result, create); } +static int jfs_bh_writepage(struct page *page, + struct writeback_control *wbc) +{ + return block_write_full_page(page, jfs_get_block, wbc); +} + + static int jfs_writepage(struct page *page, struct writeback_control *wbc) { - return nobh_writepage(page, jfs_get_block, wbc); + return nobh_writepage(page, jfs_mpage_get_blocks, wbc, jfs_bh_writepage); } static int jfs_writepages(struct address_space *mapping, struct writeback_control *wbc) { - return mpage_writepages(mapping, wbc, jfs_get_block); + return __mpage_writepages(mapping, wbc, jfs_mpage_get_blocks, + jfs_bh_writepage); } static int jfs_readpage(struct file *file, struct page *page) diff -urp -X dontdiff linux-2.6.12-rc3/fs/mpage.c linux-2.6.12-rc3-getblocks/fs/mpage.c --- linux-2.6.12-rc3/fs/mpage.c 2005-04-21 05:33:15.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/fs/mpage.c 2005-04-22 16:19:14.000000000 +0530 @@ -370,6 +370,67 @@ int mpage_readpage(struct page *page, ge } EXPORT_SYMBOL(mpage_readpage); +struct mpageio { + struct bio *bio; + struct buffer_head map_bh; + unsigned long block_in_file; + unsigned long final_block_in_request; + sector_t long block_in_bio; + int boundary; + sector_t boundary_block; + struct block_device *boundary_bdev; +}; + +/* + * Maps as many contiguous disk blocks as it can within the range of + * the request, and returns the total number of contiguous mapped + * blocks in the mpageio. + */ +static unsigned long mpage_get_more_blocks(struct mpageio *mio, + struct inode *inode, get_blocks_t get_blocks) +{ + struct buffer_head map_bh = {.b_state = 0}; + unsigned long mio_nblocks = mio->map_bh.b_size >> inode->i_blkbits; + unsigned long first_unmapped = mio->block_in_file + mio_nblocks; + unsigned long next_contig_block = mio->map_bh.b_blocknr + mio_nblocks; + + while ((first_unmapped < mio->final_block_in_request) && + (mio->map_bh.b_size < PAGE_SIZE)) { + + if (get_blocks(inode, first_unmapped, + mio->final_block_in_request - first_unmapped, + &map_bh, 1)) + break; + if (mio_nblocks && ((map_bh.b_blocknr != next_contig_block) || + map_bh.b_bdev != mio->map_bh.b_bdev)) + break; + + if (buffer_new(&map_bh)) { + int i = 0; + for (; i < map_bh.b_size >> inode->i_blkbits; i++) + unmap_underlying_metadata(map_bh.b_bdev, + map_bh.b_blocknr + i); + } + + if (buffer_boundary(&map_bh)) { + mio->boundary = 1; + mio->boundary_block = map_bh.b_blocknr; + mio->boundary_bdev = map_bh.b_bdev; + } + if (mio_nblocks == 0) { + mio->map_bh.b_bdev = map_bh.b_bdev; + mio->map_bh.b_blocknr = map_bh.b_blocknr; + } + + mio_nblocks += map_bh.b_size >> inode->i_blkbits; + first_unmapped = mio->block_in_file + mio_nblocks; + next_contig_block = mio->map_bh.b_blocknr + mio_nblocks; + mio->map_bh.b_size += map_bh.b_size; + } + + return mio_nblocks; +} + /* * Writing is not so simple. * @@ -386,9 +447,9 @@ EXPORT_SYMBOL(mpage_readpage); * written, so it can intelligently allocate a suitably-sized BIO. For now, * just allocate full-size (16-page) BIOs. */ -static struct bio * -__mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block, - sector_t *last_block_in_bio, int *ret, struct writeback_control *wbc, +static int +__mpage_writepage(struct mpageio *mio, struct page *page, + get_blocks_t get_blocks, struct writeback_control *wbc, writepage_t writepage_fn) { struct address_space *mapping = page->mapping; @@ -396,9 +457,8 @@ __mpage_writepage(struct bio *bio, struc const unsigned blkbits = inode->i_blkbits; unsigned long end_index; const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits; - sector_t last_block; + sector_t last_block, blocks_to_skip; sector_t block_in_file; - sector_t blocks[MAX_BUF_PER_PAGE]; unsigned page_block; unsigned first_unmapped = blocks_per_page; struct block_device *bdev = NULL; @@ -406,8 +466,10 @@ __mpage_writepage(struct bio *bio, struc sector_t boundary_block = 0; struct block_device *boundary_bdev = NULL; int length; - struct buffer_head map_bh; loff_t i_size = i_size_read(inode); + struct buffer_head *map_bh = &mio->map_bh; + struct bio *bio = mio->bio; + int ret = 0; if (page_has_buffers(page)) { struct buffer_head *head = page_buffers(page); @@ -435,10 +497,13 @@ __mpage_writepage(struct bio *bio, struc if (!buffer_dirty(bh) || !buffer_uptodate(bh)) goto confused; if (page_block) { - if (bh->b_blocknr != blocks[page_block-1] + 1) + if (bh->b_blocknr != map_bh->b_blocknr + + page_block) goto confused; + } else { + map_bh->b_blocknr = bh->b_blocknr; + map_bh->b_size = PAGE_SIZE; } - blocks[page_block++] = bh->b_blocknr; boundary = buffer_boundary(bh); if (boundary) { boundary_block = bh->b_blocknr; @@ -465,33 +530,30 @@ __mpage_writepage(struct bio *bio, struc BUG_ON(!PageUptodate(page)); block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits); last_block = (i_size - 1) >> blkbits; - map_bh.b_page = page; - for (page_block = 0; page_block < blocks_per_page; ) { - - map_bh.b_state = 0; - if (get_block(inode, block_in_file, &map_bh, 1)) - goto confused; - if (buffer_new(&map_bh)) - unmap_underlying_metadata(map_bh.b_bdev, - map_bh.b_blocknr); - if (buffer_boundary(&map_bh)) { - boundary_block = map_bh.b_blocknr; - boundary_bdev = map_bh.b_bdev; - } - if (page_block) { - if (map_bh.b_blocknr != blocks[page_block-1] + 1) - goto confused; - } - blocks[page_block++] = map_bh.b_blocknr; - boundary = buffer_boundary(&map_bh); - bdev = map_bh.b_bdev; - if (block_in_file == last_block) - break; - block_in_file++; + blocks_to_skip = block_in_file - mio->block_in_file; + mio->block_in_file = block_in_file; + if (blocks_to_skip < (map_bh->b_size >> blkbits)) { + map_bh->b_blocknr += blocks_to_skip; + map_bh->b_size -= blocks_to_skip << blkbits; + } else { + map_bh->b_state = 0; + map_bh->b_size = 0; + if (mio->final_block_in_request > last_block) + mio->final_block_in_request = last_block; + mpage_get_more_blocks(mio, inode, get_blocks); } - BUG_ON(page_block == 0); + if (map_bh->b_size < PAGE_SIZE) + goto confused; - first_unmapped = page_block; + if (mio->boundary && (mio->boundary_block < map_bh->b_blocknr + + blocks_per_page)) { + boundary = 1; + boundary_block = mio->boundary_block; + boundary_bdev = mio->boundary_bdev; + } + + bdev = map_bh->b_bdev; + first_unmapped = blocks_per_page; page_is_mapped: end_index = i_size >> PAGE_CACHE_SHIFT; @@ -518,12 +580,16 @@ page_is_mapped: /* * This page will go to BIO. Do we need to send this BIO off first? */ - if (bio && *last_block_in_bio != blocks[0] - 1) + if (bio && mio->block_in_bio != map_bh->b_blocknr - 1) bio = mpage_bio_submit(WRITE, bio); alloc_new: if (bio == NULL) { - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), + /* + * Fixme: bio size can be limited to final_block - block, or + * even mio->map_bh.b_size + */ + bio = mpage_alloc(bdev, map_bh->b_blocknr << (blkbits - 9), bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH); if (bio == NULL) goto confused; @@ -539,6 +605,9 @@ alloc_new: bio = mpage_bio_submit(WRITE, bio); goto alloc_new; } + map_bh->b_blocknr += blocks_per_page; + map_bh->b_size -= PAGE_SIZE; + mio->block_in_file += blocks_per_page; /* * OK, we have our BIO, so we can now mark the buffers clean. Make @@ -575,7 +644,8 @@ alloc_new: boundary_block, 1 << blkbits); } } else { - *last_block_in_bio = blocks[blocks_per_page - 1]; + /* we can pack more pages into the bio, don't submit yet */ + mio->block_in_bio = map_bh->b_blocknr - 1; } goto out; @@ -584,22 +654,23 @@ confused: bio = mpage_bio_submit(WRITE, bio); if (writepage_fn) { - *ret = (*writepage_fn)(page, wbc); + ret = (*writepage_fn)(page, wbc); } else { - *ret = -EAGAIN; + ret = -EAGAIN; goto out; } /* * The caller has a ref on the inode, so *mapping is stable */ - if (*ret) { - if (*ret == -ENOSPC) + if (ret) { + if (ret == -ENOSPC) set_bit(AS_ENOSPC, &mapping->flags); else set_bit(AS_EIO, &mapping->flags); } out: - return bio; + mio->bio = bio; + return ret; } /** @@ -625,20 +696,21 @@ out: */ int mpage_writepages(struct address_space *mapping, - struct writeback_control *wbc, get_block_t get_block) + struct writeback_control *wbc, get_blocks_t get_blocks) { - return __mpage_writepages(mapping, wbc, get_block, + return __mpage_writepages(mapping, wbc, get_blocks, mapping->a_ops->writepage); } int __mpage_writepages(struct address_space *mapping, - struct writeback_control *wbc, get_block_t get_block, + struct writeback_control *wbc, get_blocks_t get_blocks, writepage_t writepage_fn) { struct backing_dev_info *bdi = mapping->backing_dev_info; + struct inode *inode = mapping->host; + const unsigned blkbits = inode->i_blkbits; struct bio *bio = NULL; - sector_t last_block_in_bio = 0; int ret = 0; int done = 0; int (*writepage)(struct page *page, struct writeback_control *wbc); @@ -648,6 +720,9 @@ __mpage_writepages(struct address_space pgoff_t end = -1; /* Inclusive */ int scanned = 0; int is_range = 0; + struct mpageio mio = { + .bio = NULL + }; if (wbc->nonblocking && bdi_write_congested(bdi)) { wbc->encountered_congestion = 1; @@ -655,7 +730,7 @@ __mpage_writepages(struct address_space } writepage = NULL; - if (get_block == NULL) + if (get_blocks == NULL) writepage = mapping->a_ops->writepage; pagevec_init(&pvec, 0); @@ -672,12 +747,15 @@ __mpage_writepages(struct address_space scanned = 1; } retry: + down_read(&inode->i_alloc_sem); while (!done && (index <= end) && - (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, - PAGECACHE_TAG_DIRTY, + (nr_pages = pagevec_contig_lookup_tag(&pvec, mapping, + &index, PAGECACHE_TAG_DIRTY, min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { unsigned i; + mio.final_block_in_request = min(index, end) << + (PAGE_CACHE_SHIFT - blkbits); scanned = 1; for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; @@ -702,7 +780,7 @@ retry: unlock_page(page); continue; } - + if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); @@ -723,9 +801,9 @@ retry: &mapping->flags); } } else { - bio = __mpage_writepage(bio, page, get_block, - &last_block_in_bio, &ret, wbc, - writepage_fn); + ret = __mpage_writepage(&mio, page, get_blocks, + wbc, writepage_fn); + bio = mio.bio; } if (ret || (--(wbc->nr_to_write) <= 0)) done = 1; @@ -737,6 +815,9 @@ retry: pagevec_release(&pvec); cond_resched(); } + + up_read(&inode->i_alloc_sem); + if (!scanned && !done) { /* * We hit the last page and there is more work to be done: wrap @@ -755,17 +836,23 @@ retry: EXPORT_SYMBOL(mpage_writepages); EXPORT_SYMBOL(__mpage_writepages); -int mpage_writepage(struct page *page, get_block_t get_block, - struct writeback_control *wbc) +int mpage_writepage(struct page *page, get_blocks_t get_blocks, + struct writeback_control *wbc, writepage_t writepage_fn) { int ret = 0; - struct bio *bio; - sector_t last_block_in_bio = 0; - - bio = __mpage_writepage(NULL, page, get_block, - &last_block_in_bio, &ret, wbc, NULL); - if (bio) - mpage_bio_submit(WRITE, bio); + struct address_space *mapping = page->mapping; + struct inode *inode = mapping->host; + const unsigned blkbits = inode->i_blkbits; + struct mpageio mio = { + .final_block_in_request = (page->index + 1) << (PAGE_CACHE_SHIFT + - blkbits) + }; + + dump_stack(); + ret = __mpage_writepage(&mio, page, get_blocks, + wbc, writepage_fn); + if (mio.bio) + mpage_bio_submit(WRITE, mio.bio); return ret; } diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/buffer_head.h linux-2.6.12-rc3-getblocks/include/linux/buffer_head.h --- linux-2.6.12-rc3/include/linux/buffer_head.h 2005-04-21 05:33:16.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/include/linux/buffer_head.h 2005-04-22 15:08:33.000000000 +0530 @@ -203,8 +203,8 @@ int file_fsync(struct file *, struct den int nobh_prepare_write(struct page*, unsigned, unsigned, get_block_t*); int nobh_commit_write(struct file *, struct page *, unsigned, unsigned); int nobh_truncate_page(struct address_space *, loff_t); -int nobh_writepage(struct page *page, get_block_t *get_block, - struct writeback_control *wbc); +int nobh_writepage(struct page *page, get_blocks_t *get_blocks, + struct writeback_control *wbc, writepage_t bh_writepage); /* diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/fs.h linux-2.6.12-rc3-getblocks/include/linux/fs.h --- linux-2.6.12-rc3/include/linux/fs.h 2005-04-21 05:33:16.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/include/linux/fs.h 2005-04-22 15:08:33.000000000 +0530 @@ -304,6 +304,8 @@ struct address_space; struct writeback_control; struct kiocb; +typedef int (writepage_t)(struct page *page, struct writeback_control *wbc); + struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/mpage.h linux-2.6.12-rc3-getblocks/include/linux/mpage.h --- linux-2.6.12-rc3/include/linux/mpage.h 2005-04-21 05:33:16.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/include/linux/mpage.h 2005-04-22 15:08:33.000000000 +0530 @@ -11,17 +11,16 @@ */ struct writeback_control; -typedef int (writepage_t)(struct page *page, struct writeback_control *wbc); int mpage_readpages(struct address_space *mapping, struct list_head *pages, unsigned nr_pages, get_block_t get_block); int mpage_readpage(struct page *page, get_block_t get_block); int mpage_writepages(struct address_space *mapping, - struct writeback_control *wbc, get_block_t get_block); -int mpage_writepage(struct page *page, get_block_t *get_block, - struct writeback_control *wbc); + struct writeback_control *wbc, get_blocks_t get_blocks); +int mpage_writepage(struct page *page, get_blocks_t *get_blocks, + struct writeback_control *wbc, writepage_t writepage); int __mpage_writepages(struct address_space *mapping, - struct writeback_control *wbc, get_block_t get_block, + struct writeback_control *wbc, get_blocks_t get_blocks, writepage_t writepage); static inline int diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/pagemap.h linux-2.6.12-rc3-getblocks/include/linux/pagemap.h --- linux-2.6.12-rc3/include/linux/pagemap.h 2005-04-21 05:33:16.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/include/linux/pagemap.h 2005-04-22 15:08:33.000000000 +0530 @@ -73,7 +73,8 @@ extern struct page * find_or_create_page unsigned find_get_pages(struct address_space *mapping, pgoff_t start, unsigned int nr_pages, struct page **pages); unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, - int tag, unsigned int nr_pages, struct page **pages); + int tag, unsigned int nr_pages, struct page **pages, + int contig); /* * Returns locked page at given index in given cache, creating it if needed. diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/pagevec.h linux-2.6.12-rc3-getblocks/include/linux/pagevec.h --- linux-2.6.12-rc3/include/linux/pagevec.h 2005-04-21 05:33:16.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/include/linux/pagevec.h 2005-04-22 15:08:33.000000000 +0530 @@ -28,6 +28,9 @@ unsigned pagevec_lookup(struct pagevec * unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping, pgoff_t *index, int tag, unsigned nr_pages); +unsigned pagevec_contig_lookup_tag(struct pagevec *pvec, + struct address_space *mapping, pgoff_t *index, int tag, + unsigned nr_pages); static inline void pagevec_init(struct pagevec *pvec, int cold) { diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/radix-tree.h linux-2.6.12-rc3-getblocks/include/linux/radix-tree.h --- linux-2.6.12-rc3/include/linux/radix-tree.h 2005-04-21 05:33:16.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/include/linux/radix-tree.h 2005-04-22 15:08:33.000000000 +0530 @@ -59,8 +59,18 @@ void *radix_tree_tag_clear(struct radix_ int radix_tree_tag_get(struct radix_tree_root *root, unsigned long index, int tag); unsigned int -radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results, - unsigned long first_index, unsigned int max_items, int tag); +__radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results, + unsigned long first_index, unsigned int max_items, int tag, + int contig); + +static inline unsigned int radix_tree_gang_lookup_tag(struct radix_tree_root + *root, void **results, unsigned long first_index, + unsigned int max_items, int tag) +{ + return __radix_tree_gang_lookup_tag(root, results, first_index, + max_items, tag, 0); +} + int radix_tree_tagged(struct radix_tree_root *root, int tag); static inline void radix_tree_preload_end(void) diff -urp -X dontdiff linux-2.6.12-rc3/lib/radix-tree.c linux-2.6.12-rc3-getblocks/lib/radix-tree.c --- linux-2.6.12-rc3/lib/radix-tree.c 2005-04-21 05:33:16.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/lib/radix-tree.c 2005-04-22 16:34:29.000000000 +0530 @@ -557,12 +557,13 @@ EXPORT_SYMBOL(radix_tree_gang_lookup); */ static unsigned int __lookup_tag(struct radix_tree_root *root, void **results, unsigned long index, - unsigned int max_items, unsigned long *next_index, int tag) + unsigned int max_items, unsigned long *next_index, int tag, int contig) { unsigned int nr_found = 0; unsigned int shift; unsigned int height = root->height; struct radix_tree_node *slot; + unsigned long cindex = (contig && (*next_index)) ? *next_index : -1; shift = (height - 1) * RADIX_TREE_MAP_SHIFT; slot = root->rnode; @@ -575,6 +576,11 @@ __lookup_tag(struct radix_tree_root *roo BUG_ON(slot->slots[i] == NULL); break; } + if (contig && index >= cindex) { + /* break in contiguity */ + index = 0; + goto out; + } index &= ~((1UL << shift) - 1); index += 1UL << shift; if (index == 0) @@ -593,6 +599,10 @@ __lookup_tag(struct radix_tree_root *roo results[nr_found++] = slot->slots[j]; if (nr_found == max_items) goto out; + } else if (contig && nr_found) { + /* break in contiguity */ + index = 0; + goto out; } } } @@ -618,29 +628,32 @@ out: * returns the number of items which were placed at *@results. */ unsigned int -radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results, - unsigned long first_index, unsigned int max_items, int tag) +__radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results, + unsigned long first_index, unsigned int max_items, int tag, + int contig) { const unsigned long max_index = radix_tree_maxindex(root->height); unsigned long cur_index = first_index; + unsigned long next_index = 0; /* Index of next contiguous search */ unsigned int ret = 0; while (ret < max_items) { unsigned int nr_found; - unsigned long next_index; /* Index of next search */ if (cur_index > max_index) break; nr_found = __lookup_tag(root, results + ret, cur_index, - max_items - ret, &next_index, tag); + max_items - ret, &next_index, tag, contig); ret += nr_found; if (next_index == 0) break; cur_index = next_index; + if (!nr_found) + next_index = 0; } return ret; } -EXPORT_SYMBOL(radix_tree_gang_lookup_tag); +EXPORT_SYMBOL(__radix_tree_gang_lookup_tag); /** * radix_tree_delete - delete an item from a radix tree diff -urp -X dontdiff linux-2.6.12-rc3/mm/filemap.c linux-2.6.12-rc3-getblocks/mm/filemap.c --- linux-2.6.12-rc3/mm/filemap.c 2005-04-21 05:33:16.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/mm/filemap.c 2005-04-22 16:20:30.000000000 +0530 @@ -635,16 +635,19 @@ unsigned find_get_pages(struct address_s /* * Like find_get_pages, except we only return pages which are tagged with * `tag'. We update *index to index the next page for the traversal. + * If 'contig' is 1, then we return only pages which are contiguous in the + * file. */ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, - int tag, unsigned int nr_pages, struct page **pages) + int tag, unsigned int nr_pages, struct page **pages, + int contig) { unsigned int i; unsigned int ret; read_lock_irq(&mapping->tree_lock); - ret = radix_tree_gang_lookup_tag(&mapping->page_tree, - (void **)pages, *index, nr_pages, tag); + ret = __radix_tree_gang_lookup_tag(&mapping->page_tree, + (void **)pages, *index, nr_pages, tag, contig); for (i = 0; i < ret; i++) page_cache_get(pages[i]); if (ret) diff -urp -X dontdiff linux-2.6.12-rc3/mm/swap.c linux-2.6.12-rc3-getblocks/mm/swap.c --- linux-2.6.12-rc3/mm/swap.c 2005-04-21 05:33:16.000000000 +0530 +++ linux-2.6.12-rc3-getblocks/mm/swap.c 2005-04-22 15:08:33.000000000 +0530 @@ -384,7 +384,16 @@ unsigned pagevec_lookup_tag(struct pagev pgoff_t *index, int tag, unsigned nr_pages) { pvec->nr = find_get_pages_tag(mapping, index, tag, - nr_pages, pvec->pages); + nr_pages, pvec->pages, 0); + return pagevec_count(pvec); +} + +unsigned int +pagevec_contig_lookup_tag(struct pagevec *pvec, struct address_space *mapping, + pgoff_t *index, int tag, unsigned nr_pages) +{ + pvec->nr = find_get_pages_tag(mapping, index, tag, + nr_pages, pvec->pages, 1); return pagevec_count(pvec); } ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-29 13:52 ` [Ext2-devel] [RFC] Adding multiple block allocation Suparna Bhattacharya @ 2005-04-29 17:10 ` Mingming Cao 2005-04-29 19:42 ` Mingming Cao 2005-04-30 16:00 ` Suparna Bhattacharya 2005-04-29 18:45 ` Badari Pulavarty ` (2 subsequent siblings) 3 siblings, 2 replies; 68+ messages in thread From: Mingming Cao @ 2005-04-29 17:10 UTC (permalink / raw) To: suparna Cc: Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel On Fri, 2005-04-29 at 19:22 +0530, Suparna Bhattacharya wrote: > On Thu, Apr 28, 2005 at 12:14:24PM -0700, Mingming Cao wrote: > > Currently ext3_get_block()/ext3_new_block() only allocate one block at a > > time. To allocate multiple blocks, the caller, for example, ext3 direct > > IO routine, has to invoke ext3_get_block() many times. This is quite > > inefficient for sequential IO workload. > > > > The benefit of a real get_blocks() include > > 1) increase the possibility to get contiguous blocks, reduce possibility > > of fragmentation due to interleaved allocations from other threads. > > (should good for non reservation case) > > 2) Reduces CPU cycles spent in repeated get_block() calls > > 3) Batch meta data update and journaling in one short > > 4) Could possibly speed up future get_blocks() look up by cache the last > > mapped blocks in inode. > > > > And here is the patch to make mpage_writepages use get_blocks() for > multiple block lookup/allocation. It performs a radix-tree contiguous > pages lookup, and issues a get_blocks for the range together. It maintains > an mpageio structure to track intermediate mapping state, somewhat > like the DIO code. > > It does need some more testing, especially block_size < PAGE_SIZE. > The JFS workaround can be dropped if the JFS get_blocks fix from > Dave Kleikamp is integrated. > > Review feedback would be welcome. > > Mingming, > Let me know if you have a chance to try this out with your patch. Sure, Suparna, I will try your patch soon! In my patch, I have modified ext3 directo io code to make use of ext3_get_blocks(). Tested with a simple file write with O_DIRECT, seems work fine! Allocating blocks for a 120k file only invokes ext3_get_blocks() for 4 times(perfect is 1, but before is 30 times call to ext3_get_block). Among the 4 calls to ext3_get_blocks, 2 because of reach the meta data block boundary(direct ->indirect), another 2 because of reach the end of the reservation window. For the later 2, we could avoid that by extend the reservation window before calling ext3_new_blocks() if the window size is less than the number of blocks to allocate. But if it try to allocating blocks in the hole (with direct IO), blocks are allocated one by one. I am looking at it right now. Thanks, Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-29 17:10 ` Mingming Cao @ 2005-04-29 19:42 ` Mingming Cao 2005-04-29 20:57 ` Andrew Morton 2005-04-30 16:00 ` Suparna Bhattacharya 1 sibling, 1 reply; 68+ messages in thread From: Mingming Cao @ 2005-04-29 19:42 UTC (permalink / raw) To: suparna Cc: Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel On Fri, 2005-04-29 at 10:10 -0700, Mingming Cao wrote: > But if it try to allocating blocks in the hole (with direct IO), blocks > are allocated one by one. I am looking at it right now. > Hi Andrew, Badari, If we do direct write(block allocation) to a hole, I found that the "create" flag passed to ext3_direct_io_get_blocks() is 0 if we are trying to _write_ to a file hole. Is this expected? This is what happened on mainline 2.6.12-rc2(and with my patch). To simplify, here is the problem description on mainline: If we do 30 blocks write to a new file at offset 800k, fine, create flag is all 1. Then if seek back to offset 400k, write another 30 blocks, create flag is 0 -bash-2.05b# mount -t ext3 /dev/ubdc /mnt/ext3 -bash-2.05b# cd /mnt/ext3 -bash-2.05b# touch a -bash-2.05b# /root/filetst -o 819200 -b 122880 -c 1 -w -d -f a Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 30, iblock = 200, create = 1 Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 29, iblock = 201, create = 1 Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 28, iblock = 202, create = 1 Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 27, iblock = 203, create = 1 Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 26, iblock = 204, create = 1 ................... Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 5, iblock = 225, create = 1 Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 4, iblock = 226, create = 1 Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 3, iblock = 227, create = 1 Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 2, iblock = 228, create = 1 Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 1, iblock = 229, create = 1 -bash-2.05b# /root/filetst -o 409600 -b 122880 -c 1 -w -d -f a Calling ext3_get_block_handle from ext3_direct_io_get_blocks: maxblocks = 30, iblock = 100, create = 0 Because of create flag is 0, ext3_get_block will not do block allocation and return immediately after look up failed. Then ext3_get_block_handle () is called from other path(I am not sure where) other than ext3_direct_io_get_blocks to allocate the desired 30 blocks.(thus, when apply ext3_get_blocks patch, ext3_get_blocks is not called) Could you clarify? Thanks, Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-29 19:42 ` Mingming Cao @ 2005-04-29 20:57 ` Andrew Morton 2005-04-29 21:12 ` Mingming Cao 0 siblings, 1 reply; 68+ messages in thread From: Andrew Morton @ 2005-04-29 20:57 UTC (permalink / raw) To: cmm; +Cc: suparna, sct, linux-kernel, ext2-devel, linux-fsdevel Mingming Cao <cmm@us.ibm.com> wrote: > > If we do direct write(block allocation) to a hole, I found that the > "create" flag passed to ext3_direct_io_get_blocks() is 0 if we are > trying to _write_ to a file hole. Is this expected? Nope. The code in get_more_blocks() is pretty explicit. > But if it try to allocating blocks in the hole (with direct IO), blocks > are allocated one by one. I am looking at it right now. > Please see the comment over get_more_blocks(). ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-29 20:57 ` Andrew Morton @ 2005-04-29 21:12 ` Mingming Cao 2005-04-29 21:34 ` Andrew Morton 0 siblings, 1 reply; 68+ messages in thread From: Mingming Cao @ 2005-04-29 21:12 UTC (permalink / raw) To: Andrew Morton, pbadari Cc: suparna, sct, linux-kernel, ext2-devel, linux-fsdevel On Fri, 2005-04-29 at 13:57 -0700, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > If we do direct write(block allocation) to a hole, I found that the > > "create" flag passed to ext3_direct_io_get_blocks() is 0 if we are > > trying to _write_ to a file hole. Is this expected? > > Nope. The code in get_more_blocks() is pretty explicit. > > > But if it try to allocating blocks in the hole (with direct IO), blocks > > are allocated one by one. I am looking at it right now. > > > > Please see the comment over get_more_blocks(). > I went to look at get_more_blocks, the create flag is cleared if the file offset is less than the i_size. (see code below). static int get_more_blocks(struct dio *dio) { .......... ......... create = dio->rw == WRITE; if (dio->lock_type == DIO_LOCKING) { if (dio->block_in_file < (i_size_read(dio->inode) >> dio->blkbits)) create = 0; } else if (dio->lock_type == DIO_NO_LOCKING) { create = 0; ret = (*dio->get_blocks)(dio->inode, fs_startblk, fs_count, map_bh, create); } When it is trying to direct writing to a hole, the create flag is cleared so that the underlying filesystem get_blocks() function will only do a block look up(look up will be failed and no block allocation). do_direct_IO -> get_more_blocks -> ext3_direct_io_get_blocks. In get_more_blocks(), do_direct_IO() handles the write to hole situation by simply return an error of ENOTBLK. In direct_io_worker() it detects this error then just simply return the size of written byte to 0. The real write is handled by the buffered I/O. In __generic_file_aio_write_nolock(), generic_file_buffered_write() will be called if written by generic_file_direct_write() is 0. Could you confirm my observation above? Hope I understand this right: right now direct io write to a hole is actually handled by buffered IO. If so, there must be some valid reason that I could not see now. Thanks, Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-29 21:12 ` Mingming Cao @ 2005-04-29 21:34 ` Andrew Morton 0 siblings, 0 replies; 68+ messages in thread From: Andrew Morton @ 2005-04-29 21:34 UTC (permalink / raw) To: cmm; +Cc: pbadari, suparna, sct, linux-kernel, ext2-devel, linux-fsdevel Mingming Cao <cmm@us.ibm.com> wrote: > > right now direct io write to a hole is actually handled by buffered IO. > If so, there must be some valid reason that I could not see now. Oh yes, that's right. It's all to do with the fixes which went in a year ago to prevent concurrent readers from peeking at uninitialised disk blocks. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-29 17:10 ` Mingming Cao 2005-04-29 19:42 ` Mingming Cao @ 2005-04-30 16:00 ` Suparna Bhattacharya 1 sibling, 0 replies; 68+ messages in thread From: Suparna Bhattacharya @ 2005-04-30 16:00 UTC (permalink / raw) To: Mingming Cao Cc: Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel On Fri, Apr 29, 2005 at 10:10:08AM -0700, Mingming Cao wrote: > On Fri, 2005-04-29 at 19:22 +0530, Suparna Bhattacharya wrote: > > On Thu, Apr 28, 2005 at 12:14:24PM -0700, Mingming Cao wrote: > > > Currently ext3_get_block()/ext3_new_block() only allocate one block at a > > > time. To allocate multiple blocks, the caller, for example, ext3 direct > > > IO routine, has to invoke ext3_get_block() many times. This is quite > > > inefficient for sequential IO workload. > > > > > > The benefit of a real get_blocks() include > > > 1) increase the possibility to get contiguous blocks, reduce possibility > > > of fragmentation due to interleaved allocations from other threads. > > > (should good for non reservation case) > > > 2) Reduces CPU cycles spent in repeated get_block() calls > > > 3) Batch meta data update and journaling in one short > > > 4) Could possibly speed up future get_blocks() look up by cache the last > > > mapped blocks in inode. > > > > > > > And here is the patch to make mpage_writepages use get_blocks() for > > multiple block lookup/allocation. It performs a radix-tree contiguous > > pages lookup, and issues a get_blocks for the range together. It maintains > > an mpageio structure to track intermediate mapping state, somewhat > > like the DIO code. > > > > It does need some more testing, especially block_size < PAGE_SIZE. > > The JFS workaround can be dropped if the JFS get_blocks fix from > > Dave Kleikamp is integrated. > > > > Review feedback would be welcome. > > > > Mingming, > > Let me know if you have a chance to try this out with your patch. > > > Sure, Suparna, I will try your patch soon! > > In my patch, I have modified ext3 directo io code to make use of > ext3_get_blocks(). Tested with a simple file write with O_DIRECT, seems > work fine! Allocating blocks for a 120k file only invokes > ext3_get_blocks() for 4 times(perfect is 1, but before is 30 times call > to ext3_get_block). Among the 4 calls to ext3_get_blocks, 2 because of > reach the meta data block boundary(direct ->indirect), another 2 because > of reach the end of the reservation window. For the later 2, we could > avoid that by extend the reservation window before calling > ext3_new_blocks() if the window size is less than the number of blocks > to allocate. > > But if it try to allocating blocks in the hole (with direct IO), blocks > are allocated one by one. I am looking at it right now. That is expected, because DIO falls back to using buffered IO for overwriting holes. This is to avoid some tricky races between DIO and buffered IO that could otherwise have led to exposure of stale data. Regards Suparna > > Thanks, > Mingming > -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-29 13:52 ` [Ext2-devel] [RFC] Adding multiple block allocation Suparna Bhattacharya 2005-04-29 17:10 ` Mingming Cao @ 2005-04-29 18:45 ` Badari Pulavarty 2005-04-29 23:22 ` Mingming Cao 2005-04-30 16:52 ` Suparna Bhattacharya 2005-04-30 0:33 ` Mingming Cao 2005-04-30 0:44 ` Mingming Cao 3 siblings, 2 replies; 68+ messages in thread From: Badari Pulavarty @ 2005-04-29 18:45 UTC (permalink / raw) To: suparna Cc: Mingming Cao, Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel Suparna Bhattacharya wrote: > On Thu, Apr 28, 2005 at 12:14:24PM -0700, Mingming Cao wrote: > >>Currently ext3_get_block()/ext3_new_block() only allocate one block at a >>time. To allocate multiple blocks, the caller, for example, ext3 direct >>IO routine, has to invoke ext3_get_block() many times. This is quite >>inefficient for sequential IO workload. >> >>The benefit of a real get_blocks() include >>1) increase the possibility to get contiguous blocks, reduce possibility >>of fragmentation due to interleaved allocations from other threads. >>(should good for non reservation case) >>2) Reduces CPU cycles spent in repeated get_block() calls >>3) Batch meta data update and journaling in one short >>4) Could possibly speed up future get_blocks() look up by cache the last >>mapped blocks in inode. >> > > > And here is the patch to make mpage_writepages use get_blocks() for > multiple block lookup/allocation. It performs a radix-tree contiguous > pages lookup, and issues a get_blocks for the range together. It maintains > an mpageio structure to track intermediate mapping state, somewhat > like the DIO code. > > It does need some more testing, especially block_size < PAGE_SIZE. > The JFS workaround can be dropped if the JFS get_blocks fix from > Dave Kleikamp is integrated. > > Review feedback would be welcome. > > Mingming, > Let me know if you have a chance to try this out with your patch. > > Regards > Suparna > Suparna, I touch tested your patch earlier and seems to work fine. Lets integrate Mingming's getblocks() patches with this and see if we get any benifit from the whole effort. BTW, is it the plan to remove repeated calls to getblocks() and work with whatever getblocks() returned ? Or do you find the its better to repeatedly do getblocks() in a loop till you satisfy all the pages in the list (as long as blocks are contiguous) ? Thanks, Badari ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-29 18:45 ` Badari Pulavarty @ 2005-04-29 23:22 ` Mingming Cao 2005-04-30 16:10 ` Suparna Bhattacharya 2005-04-30 16:52 ` Suparna Bhattacharya 1 sibling, 1 reply; 68+ messages in thread From: Mingming Cao @ 2005-04-29 23:22 UTC (permalink / raw) To: Badari Pulavarty Cc: suparna, Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel On Fri, 2005-04-29 at 11:45 -0700, Badari Pulavarty wrote: > I touch tested your patch earlier and seems to work fine. Lets integrate > Mingming's getblocks() patches with this and see if we get any benifit > from the whole effort. > I tried Suparna's mpage_writepages_getblocks patch with my ext3_get_blocks patch, seems to work fine, except that still only one block is allocated at a time. I got a little confused.... I did not see any delayed allocation code in your patch, I assume you have to update ext3_prepare_write to not call ext3_get_block, so that block allocation will be defered at ext3_writepages time. So without the delayed allocation part, the get_blocks in mpage_writepages is doing multiple blocks look up only, right? Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-29 23:22 ` Mingming Cao @ 2005-04-30 16:10 ` Suparna Bhattacharya 2005-04-30 17:11 ` Suparna Bhattacharya 0 siblings, 1 reply; 68+ messages in thread From: Suparna Bhattacharya @ 2005-04-30 16:10 UTC (permalink / raw) To: Mingming Cao Cc: Badari Pulavarty, Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel On Fri, Apr 29, 2005 at 04:22:59PM -0700, Mingming Cao wrote: > On Fri, 2005-04-29 at 11:45 -0700, Badari Pulavarty wrote: > > I touch tested your patch earlier and seems to work fine. Lets integrate > > Mingming's getblocks() patches with this and see if we get any benifit > > from the whole effort. > > > > I tried Suparna's mpage_writepages_getblocks patch with my > ext3_get_blocks patch, seems to work fine, except that still only one > block is allocated at a time. I got a little confused.... > > I did not see any delayed allocation code in your patch, I assume you > have to update ext3_prepare_write to not call ext3_get_block, so that > block allocation will be defered at ext3_writepages time. So without the > delayed allocation part, the get_blocks in mpage_writepages is doing > multiple blocks look up only, right? That's right - so you could try it with mmapped writes (which don't go through a prepare write) or with Badari's patch to not call ext3_get_block in prepare write. Regards Suparna > > > Mingming > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: NEC IT Guy Games. > Get your fingers limbered up and give it your best shot. 4 great events, 4 > opportunities to win big! Highest score wins.NEC IT Guy Games. Play to > win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20 > _______________________________________________ > Ext2-devel mailing list > Ext2-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ext2-devel -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-30 16:10 ` Suparna Bhattacharya @ 2005-04-30 17:11 ` Suparna Bhattacharya 2005-04-30 18:07 ` Mingming Cao 0 siblings, 1 reply; 68+ messages in thread From: Suparna Bhattacharya @ 2005-04-30 17:11 UTC (permalink / raw) To: Mingming Cao Cc: Badari Pulavarty, Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel On Sat, Apr 30, 2005 at 09:40:43PM +0530, Suparna Bhattacharya wrote: > On Fri, Apr 29, 2005 at 04:22:59PM -0700, Mingming Cao wrote: > > On Fri, 2005-04-29 at 11:45 -0700, Badari Pulavarty wrote: > > > I touch tested your patch earlier and seems to work fine. Lets integrate > > > Mingming's getblocks() patches with this and see if we get any benifit > > > from the whole effort. > > > > > > > I tried Suparna's mpage_writepages_getblocks patch with my > > ext3_get_blocks patch, seems to work fine, except that still only one > > block is allocated at a time. I got a little confused.... > > > > I did not see any delayed allocation code in your patch, I assume you > > have to update ext3_prepare_write to not call ext3_get_block, so that > > block allocation will be defered at ext3_writepages time. So without the > > delayed allocation part, the get_blocks in mpage_writepages is doing > > multiple blocks look up only, right? > > That's right - so you could try it with mmapped writes (which don't > go through a prepare write) or with Badari's patch to not call > ext3_get_block in prepare write. BTW, I hope you are running with NOBH. Regards Suparna > > Regards > Suparna > > > > > > Mingming > > > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by: NEC IT Guy Games. > > Get your fingers limbered up and give it your best shot. 4 great events, 4 > > opportunities to win big! Highest score wins.NEC IT Guy Games. Play to > > win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20 > > _______________________________________________ > > Ext2-devel mailing list > > Ext2-devel@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/ext2-devel > > -- > Suparna Bhattacharya (suparna@in.ibm.com) > Linux Technology Center > IBM Software Lab, India > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: NEC IT Guy Games. > Get your fingers limbered up and give it your best shot. 4 great events, 4 > opportunities to win big! Highest score wins.NEC IT Guy Games. Play to > win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20 > _______________________________________________ > Ext2-devel mailing list > Ext2-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ext2-devel -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-30 17:11 ` Suparna Bhattacharya @ 2005-04-30 18:07 ` Mingming Cao 2005-05-02 4:46 ` Suparna Bhattacharya 0 siblings, 1 reply; 68+ messages in thread From: Mingming Cao @ 2005-04-30 18:07 UTC (permalink / raw) To: suparna Cc: Badari Pulavarty, Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel On Sat, 2005-04-30 at 22:41 +0530, Suparna Bhattacharya wrote: > On Sat, Apr 30, 2005 at 09:40:43PM +0530, Suparna Bhattacharya wrote: > > On Fri, Apr 29, 2005 at 04:22:59PM -0700, Mingming Cao wrote: > > > On Fri, 2005-04-29 at 11:45 -0700, Badari Pulavarty wrote: > > > > I touch tested your patch earlier and seems to work fine. Lets integrate > > > > Mingming's getblocks() patches with this and see if we get any benifit > > > > from the whole effort. > > > > > > > > > > I tried Suparna's mpage_writepages_getblocks patch with my > > > ext3_get_blocks patch, seems to work fine, except that still only one > > > block is allocated at a time. I got a little confused.... > > > > > > I did not see any delayed allocation code in your patch, I assume you > > > have to update ext3_prepare_write to not call ext3_get_block, so that > > > block allocation will be defered at ext3_writepages time. So without the > > > delayed allocation part, the get_blocks in mpage_writepages is doing > > > multiple blocks look up only, right? > > > > That's right - so you could try it with mmapped writes (which don't > > go through a prepare write) or with Badari's patch to not call > > ext3_get_block in prepare write. > > BTW, I hope you are running with NOBH. No, it was not run with NOBH. Why we need to turn on nobh here? There are some issue running fsx tests with both patches, w/o direct io option. I will spend more time on this next week. thanks, Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-30 18:07 ` Mingming Cao @ 2005-05-02 4:46 ` Suparna Bhattacharya 0 siblings, 0 replies; 68+ messages in thread From: Suparna Bhattacharya @ 2005-05-02 4:46 UTC (permalink / raw) To: Mingming Cao Cc: Badari Pulavarty, Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel On Sat, Apr 30, 2005 at 11:07:53AM -0700, Mingming Cao wrote: > On Sat, 2005-04-30 at 22:41 +0530, Suparna Bhattacharya wrote: > > On Sat, Apr 30, 2005 at 09:40:43PM +0530, Suparna Bhattacharya wrote: > > > On Fri, Apr 29, 2005 at 04:22:59PM -0700, Mingming Cao wrote: > > > > On Fri, 2005-04-29 at 11:45 -0700, Badari Pulavarty wrote: > > > > > I touch tested your patch earlier and seems to work fine. Lets integrate > > > > > Mingming's getblocks() patches with this and see if we get any benifit > > > > > from the whole effort. > > > > > > > > > > > > > I tried Suparna's mpage_writepages_getblocks patch with my > > > > ext3_get_blocks patch, seems to work fine, except that still only one > > > > block is allocated at a time. I got a little confused.... > > > > > > > > I did not see any delayed allocation code in your patch, I assume you > > > > have to update ext3_prepare_write to not call ext3_get_block, so that > > > > block allocation will be defered at ext3_writepages time. So without the > > > > delayed allocation part, the get_blocks in mpage_writepages is doing > > > > multiple blocks look up only, right? > > > > > > That's right - so you could try it with mmapped writes (which don't > > > go through a prepare write) or with Badari's patch to not call > > > ext3_get_block in prepare write. > > > > BTW, I hope you are running with NOBH. > > No, it was not run with NOBH. Why we need to turn on nobh here? If the page has buffers, then get_blocks won't be invoked -- it either finds all the buffers in a page to be mapped contiguous, in which case it can directly issue the IO or enters the confused case where it goes through block_write_full_page. > > There are some issue running fsx tests with both patches, w/o direct io > option. I will spend more time on this next week. OK, I can take a look at the fsx logs, just in case it is similar to something I've seen before. Regards Suparna -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-29 18:45 ` Badari Pulavarty 2005-04-29 23:22 ` Mingming Cao @ 2005-04-30 16:52 ` Suparna Bhattacharya 1 sibling, 0 replies; 68+ messages in thread From: Suparna Bhattacharya @ 2005-04-30 16:52 UTC (permalink / raw) To: Badari Pulavarty Cc: Mingming Cao, Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel On Fri, Apr 29, 2005 at 11:45:21AM -0700, Badari Pulavarty wrote: > Suparna Bhattacharya wrote: > >On Thu, Apr 28, 2005 at 12:14:24PM -0700, Mingming Cao wrote: > > > >>Currently ext3_get_block()/ext3_new_block() only allocate one block at a > >>time. To allocate multiple blocks, the caller, for example, ext3 direct > >>IO routine, has to invoke ext3_get_block() many times. This is quite > >>inefficient for sequential IO workload. > >> > >>The benefit of a real get_blocks() include > >>1) increase the possibility to get contiguous blocks, reduce possibility > >>of fragmentation due to interleaved allocations from other threads. > >>(should good for non reservation case) > >>2) Reduces CPU cycles spent in repeated get_block() calls > >>3) Batch meta data update and journaling in one short > >>4) Could possibly speed up future get_blocks() look up by cache the last > >>mapped blocks in inode. > >> > > > > > >And here is the patch to make mpage_writepages use get_blocks() for > >multiple block lookup/allocation. It performs a radix-tree contiguous > >pages lookup, and issues a get_blocks for the range together. It maintains > >an mpageio structure to track intermediate mapping state, somewhat > >like the DIO code. > > > >It does need some more testing, especially block_size < PAGE_SIZE. > >The JFS workaround can be dropped if the JFS get_blocks fix from > >Dave Kleikamp is integrated. > > > >Review feedback would be welcome. > > > >Mingming, > >Let me know if you have a chance to try this out with your patch. > > > >Regards > >Suparna > > > > Suparna, > > I touch tested your patch earlier and seems to work fine. Lets integrate > Mingming's getblocks() patches with this and see if we get any benifit > from the whole effort. > > BTW, is it the plan to remove repeated calls to getblocks() and work > with whatever getblocks() returned ? Or do you find the its better to > repeatedly do getblocks() in a loop till you satisfy all the pages > in the list (as long as blocks are contiguous) ? The patch only loops through repeated get_blocks upto PAGE_SIZE, just like the earlier mpage_writepage code - in this case if blocks aren't contiguous at least upto PAGE_SIZE, it would fall back to confused case (since it means multiple ios for the same page) just as before. What is new now is just the numblocks specified to get_blocks, so in case it gets more contiguous blocks it can remember that mapping and avoid get_blocks calls for subsequent pages. Does that clarify ? Regards Suparna -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-29 13:52 ` [Ext2-devel] [RFC] Adding multiple block allocation Suparna Bhattacharya 2005-04-29 17:10 ` Mingming Cao 2005-04-29 18:45 ` Badari Pulavarty @ 2005-04-30 0:33 ` Mingming Cao 2005-04-30 0:44 ` Mingming Cao 3 siblings, 0 replies; 68+ messages in thread From: Mingming Cao @ 2005-04-30 0:33 UTC (permalink / raw) To: suparna Cc: Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel On Fri, 2005-04-29 at 19:22 +0530, Suparna Bhattacharya wrote: > On Thu, Apr 28, 2005 at 12:14:24PM -0700, Mingming Cao wrote: > > Currently ext3_get_block()/ext3_new_block() only allocate one block at a > > time. To allocate multiple blocks, the caller, for example, ext3 direct > > IO routine, has to invoke ext3_get_block() many times. This is quite > > inefficient for sequential IO workload. > > > > The benefit of a real get_blocks() include > > 1) increase the possibility to get contiguous blocks, reduce possibility > > of fragmentation due to interleaved allocations from other threads. > > (should good for non reservation case) > > 2) Reduces CPU cycles spent in repeated get_block() calls > > 3) Batch meta data update and journaling in one short > > 4) Could possibly speed up future get_blocks() look up by cache the last > > mapped blocks in inode. > > > > And here is the patch to make mpage_writepages use get_blocks() for > multiple block lookup/allocation. It performs a radix-tree contiguous > pages lookup, and issues a get_blocks for the range together. It maintains > an mpageio structure to track intermediate mapping state, somewhat > like the DIO code. > > It does need some more testing, especially block_size < PAGE_SIZE. > The JFS workaround can be dropped if the JFS get_blocks fix from > Dave Kleikamp is integrated. > > Review feedback would be welcome. > > Mingming, > Let me know if you have a chance to try this out with your patch. > > Regards > Suparna > > -- > Suparna Bhattacharya (suparna@in.ibm.com) > Linux Technology Center > IBM Software Lab, India > > > diff -urp -X dontdiff linux-2.6.12-rc3/fs/buffer.c linux-2.6.12-rc3-getblocks/fs/buffer.c > --- linux-2.6.12-rc3/fs/buffer.c 2005-04-21 05:33:15.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/fs/buffer.c 2005-04-22 15:08:33.000000000 +0530 > @@ -2514,53 +2514,10 @@ EXPORT_SYMBOL(nobh_commit_write); > * that it tries to operate without attaching bufferheads to > * the page. > */ > -int nobh_writepage(struct page *page, get_block_t *get_block, > - struct writeback_control *wbc) > +int nobh_writepage(struct page *page, get_blocks_t *get_blocks, > + struct writeback_control *wbc, writepage_t bh_writepage_fn) > { > - struct inode * const inode = page->mapping->host; > - loff_t i_size = i_size_read(inode); > - const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; > - unsigned offset; > - void *kaddr; > - int ret; > - > - /* Is the page fully inside i_size? */ > - if (page->index < end_index) > - goto out; > - > - /* Is the page fully outside i_size? (truncate in progress) */ > - offset = i_size & (PAGE_CACHE_SIZE-1); > - if (page->index >= end_index+1 || !offset) { > - /* > - * The page may have dirty, unmapped buffers. For example, > - * they may have been added in ext3_writepage(). Make them > - * freeable here, so the page does not leak. > - */ > -#if 0 > - /* Not really sure about this - do we need this ? */ > - if (page->mapping->a_ops->invalidatepage) > - page->mapping->a_ops->invalidatepage(page, offset); > -#endif > - unlock_page(page); > - return 0; /* don't care */ > - } > - > - /* > - * The page straddles i_size. It must be zeroed out on each and every > - * writepage invocation because it may be mmapped. "A file is mapped > - * in multiples of the page size. For a file that is not a multiple of > - * the page size, the remaining memory is zeroed when mapped, and > - * writes to that region are not written out to the file." > - */ > - kaddr = kmap_atomic(page, KM_USER0); > - memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); > - flush_dcache_page(page); > - kunmap_atomic(kaddr, KM_USER0); > -out: > - ret = mpage_writepage(page, get_block, wbc); > - if (ret == -EAGAIN) > - ret = __block_write_full_page(inode, page, get_block, wbc); > - return ret; > + return mpage_writepage(page, get_blocks, wbc, bh_writepage_fn); > } > EXPORT_SYMBOL(nobh_writepage); > > diff -urp -X dontdiff linux-2.6.12-rc3/fs/ext2/inode.c linux-2.6.12-rc3-getblocks/fs/ext2/inode.c > --- linux-2.6.12-rc3/fs/ext2/inode.c 2005-04-21 05:33:15.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/fs/ext2/inode.c 2005-04-22 16:30:42.000000000 +0530 > @@ -639,12 +639,6 @@ ext2_nobh_prepare_write(struct file *fil > return nobh_prepare_write(page,from,to,ext2_get_block); > } > > -static int ext2_nobh_writepage(struct page *page, > - struct writeback_control *wbc) > -{ > - return nobh_writepage(page, ext2_get_block, wbc); > -} > - > static sector_t ext2_bmap(struct address_space *mapping, sector_t block) > { > return generic_block_bmap(mapping,block,ext2_get_block); > @@ -662,6 +656,12 @@ ext2_get_blocks(struct inode *inode, sec > return ret; > } > > +static int ext2_nobh_writepage(struct page *page, > + struct writeback_control *wbc) > +{ > + return nobh_writepage(page, ext2_get_blocks, wbc, ext2_writepage); > +} > + > static ssize_t > ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, > loff_t offset, unsigned long nr_segs) > @@ -676,7 +676,8 @@ ext2_direct_IO(int rw, struct kiocb *ioc > static int > ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) > { > - return mpage_writepages(mapping, wbc, ext2_get_block); > + return __mpage_writepages(mapping, wbc, ext2_get_blocks, > + ext2_writepage); > } > > struct address_space_operations ext2_aops = { > diff -urp -X dontdiff linux-2.6.12-rc3/fs/ext3/inode.c linux-2.6.12-rc3-getblocks/fs/ext3/inode.c > --- linux-2.6.12-rc3/fs/ext3/inode.c 2005-04-21 05:33:15.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/fs/ext3/inode.c 2005-04-22 15:08:33.000000000 +0530 > @@ -866,10 +866,10 @@ get_block: > return ret; > } > > -static int ext3_writepages_get_block(struct inode *inode, sector_t iblock, > - struct buffer_head *bh, int create) > +static int ext3_writepages_get_blocks(struct inode *inode, sector_t iblock, > + unsigned long max_blocks, struct buffer_head *bh, int create) > { > - return ext3_direct_io_get_blocks(inode, iblock, 1, bh, create); > + return ext3_direct_io_get_blocks(inode, iblock, max_blocks, bh, create); > } > > /* > @@ -1369,11 +1369,11 @@ ext3_writeback_writepages(struct address > return ret; > } > > - ret = __mpage_writepages(mapping, wbc, ext3_writepages_get_block, > + ret = __mpage_writepages(mapping, wbc, ext3_writepages_get_blocks, > ext3_writeback_writepage_helper); > > /* > - * Need to reaquire the handle since ext3_writepages_get_block() > + * Need to reaquire the handle since ext3_writepages_get_blocks() > * can restart the handle > */ > handle = journal_current_handle(); > @@ -1402,7 +1402,8 @@ static int ext3_writeback_writepage(stru > } > > if (test_opt(inode->i_sb, NOBH)) > - ret = nobh_writepage(page, ext3_get_block, wbc); > + ret = nobh_writepage(page, ext3_writepages_get_blocks, wbc, > + ext3_writeback_writepage_helper); > else > ret = block_write_full_page(page, ext3_get_block, wbc); > > diff -urp -X dontdiff linux-2.6.12-rc3/fs/ext3/super.c linux-2.6.12-rc3-getblocks/fs/ext3/super.c > --- linux-2.6.12-rc3/fs/ext3/super.c 2005-04-21 05:33:15.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/fs/ext3/super.c 2005-04-22 15:08:33.000000000 +0530 > @@ -1321,6 +1321,7 @@ static int ext3_fill_super (struct super > sbi->s_resgid = le16_to_cpu(es->s_def_resgid); > > set_opt(sbi->s_mount_opt, RESERVATION); > + set_opt(sbi->s_mount_opt, NOBH); /* temp: set nobh default */ > > if (!parse_options ((char *) data, sb, &journal_inum, NULL, 0)) > goto failed_mount; > @@ -1567,6 +1568,7 @@ static int ext3_fill_super (struct super > printk(KERN_ERR "EXT3-fs: Journal does not support " > "requested data journaling mode\n"); > goto failed_mount3; > + set_opt(sbi->s_mount_opt, NOBH); /* temp: set nobh default */ > } > default: > break; > @@ -1584,6 +1586,7 @@ static int ext3_fill_super (struct super > "its supported only with writeback mode\n"); > clear_opt(sbi->s_mount_opt, NOBH); > } > + printk("NOBH option set\n"); > } > /* > * The journal_load will have done any necessary log recovery, > diff -urp -X dontdiff linux-2.6.12-rc3/fs/hfs/inode.c linux-2.6.12-rc3-getblocks/fs/hfs/inode.c > --- linux-2.6.12-rc3/fs/hfs/inode.c 2005-04-21 05:33:15.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/fs/hfs/inode.c 2005-04-22 15:08:33.000000000 +0530 > @@ -124,7 +124,7 @@ static ssize_t hfs_direct_IO(int rw, str > static int hfs_writepages(struct address_space *mapping, > struct writeback_control *wbc) > { > - return mpage_writepages(mapping, wbc, hfs_get_block); > + return mpage_writepages(mapping, wbc, hfs_get_blocks); > } > > struct address_space_operations hfs_btree_aops = { > diff -urp -X dontdiff linux-2.6.12-rc3/fs/hfsplus/inode.c linux-2.6.12-rc3-getblocks/fs/hfsplus/inode.c > --- linux-2.6.12-rc3/fs/hfsplus/inode.c 2005-04-21 05:33:15.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/fs/hfsplus/inode.c 2005-04-22 15:08:33.000000000 +0530 > @@ -121,7 +121,7 @@ static ssize_t hfsplus_direct_IO(int rw, > static int hfsplus_writepages(struct address_space *mapping, > struct writeback_control *wbc) > { > - return mpage_writepages(mapping, wbc, hfsplus_get_block); > + return mpage_writepages(mapping, wbc, hfsplus_get_blocks); > } > > struct address_space_operations hfsplus_btree_aops = { > diff -urp -X dontdiff linux-2.6.12-rc3/fs/jfs/inode.c linux-2.6.12-rc3-getblocks/fs/jfs/inode.c > --- linux-2.6.12-rc3/fs/jfs/inode.c 2005-04-21 05:33:15.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/fs/jfs/inode.c 2005-04-22 16:27:19.000000000 +0530 > @@ -267,21 +267,41 @@ jfs_get_blocks(struct inode *ip, sector_ > return rc; > } > > +static int > +jfs_mpage_get_blocks(struct inode *ip, sector_t lblock, unsigned long > + max_blocks, struct buffer_head *bh_result, int create) > +{ > + /* > + * fixme: temporary workaround: return one block at a time until > + * we figure out why we see exposures with truncate on > + * allocating multiple blocks in one shot. > + */ > + return jfs_get_blocks(ip, lblock, 1, bh_result, create); > +} > + > static int jfs_get_block(struct inode *ip, sector_t lblock, > struct buffer_head *bh_result, int create) > { > return jfs_get_blocks(ip, lblock, 1, bh_result, create); > } > > +static int jfs_bh_writepage(struct page *page, > + struct writeback_control *wbc) > +{ > + return block_write_full_page(page, jfs_get_block, wbc); > +} > + > + > static int jfs_writepage(struct page *page, struct writeback_control *wbc) > { > - return nobh_writepage(page, jfs_get_block, wbc); > + return nobh_writepage(page, jfs_mpage_get_blocks, wbc, jfs_bh_writepage); > } > > static int jfs_writepages(struct address_space *mapping, > struct writeback_control *wbc) > { > - return mpage_writepages(mapping, wbc, jfs_get_block); > + return __mpage_writepages(mapping, wbc, jfs_mpage_get_blocks, > + jfs_bh_writepage); > } > > static int jfs_readpage(struct file *file, struct page *page) > diff -urp -X dontdiff linux-2.6.12-rc3/fs/mpage.c linux-2.6.12-rc3-getblocks/fs/mpage.c > --- linux-2.6.12-rc3/fs/mpage.c 2005-04-21 05:33:15.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/fs/mpage.c 2005-04-22 16:19:14.000000000 +0530 > @@ -370,6 +370,67 @@ int mpage_readpage(struct page *page, ge > } > EXPORT_SYMBOL(mpage_readpage); > > +struct mpageio { > + struct bio *bio; > + struct buffer_head map_bh; > + unsigned long block_in_file; > + unsigned long final_block_in_request; > + sector_t long block_in_bio; > + int boundary; > + sector_t boundary_block; > + struct block_device *boundary_bdev; > +}; > + > +/* > + * Maps as many contiguous disk blocks as it can within the range of > + * the request, and returns the total number of contiguous mapped > + * blocks in the mpageio. > + */ > +static unsigned long mpage_get_more_blocks(struct mpageio *mio, > + struct inode *inode, get_blocks_t get_blocks) > +{ > + struct buffer_head map_bh = {.b_state = 0}; > + unsigned long mio_nblocks = mio->map_bh.b_size >> inode->i_blkbits; > + unsigned long first_unmapped = mio->block_in_file + mio_nblocks; > + unsigned long next_contig_block = mio->map_bh.b_blocknr + mio_nblocks; > + > + while ((first_unmapped < mio->final_block_in_request) && > + (mio->map_bh.b_size < PAGE_SIZE)) { > + > + if (get_blocks(inode, first_unmapped, > + mio->final_block_in_request - first_unmapped, > + &map_bh, 1)) > + break; > + if (mio_nblocks && ((map_bh.b_blocknr != next_contig_block) || > + map_bh.b_bdev != mio->map_bh.b_bdev)) > + break; > + > + if (buffer_new(&map_bh)) { > + int i = 0; > + for (; i < map_bh.b_size >> inode->i_blkbits; i++) > + unmap_underlying_metadata(map_bh.b_bdev, > + map_bh.b_blocknr + i); > + } > + > + if (buffer_boundary(&map_bh)) { > + mio->boundary = 1; > + mio->boundary_block = map_bh.b_blocknr; > + mio->boundary_bdev = map_bh.b_bdev; > + } > + if (mio_nblocks == 0) { > + mio->map_bh.b_bdev = map_bh.b_bdev; > + mio->map_bh.b_blocknr = map_bh.b_blocknr; > + } > + > + mio_nblocks += map_bh.b_size >> inode->i_blkbits; > + first_unmapped = mio->block_in_file + mio_nblocks; > + next_contig_block = mio->map_bh.b_blocknr + mio_nblocks; > + mio->map_bh.b_size += map_bh.b_size; > + } > + > + return mio_nblocks; > +} > + > /* > * Writing is not so simple. > * > @@ -386,9 +447,9 @@ EXPORT_SYMBOL(mpage_readpage); > * written, so it can intelligently allocate a suitably-sized BIO. For now, > * just allocate full-size (16-page) BIOs. > */ > -static struct bio * > -__mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block, > - sector_t *last_block_in_bio, int *ret, struct writeback_control *wbc, > +static int > +__mpage_writepage(struct mpageio *mio, struct page *page, > + get_blocks_t get_blocks, struct writeback_control *wbc, > writepage_t writepage_fn) > { > struct address_space *mapping = page->mapping; > @@ -396,9 +457,8 @@ __mpage_writepage(struct bio *bio, struc > const unsigned blkbits = inode->i_blkbits; > unsigned long end_index; > const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits; > - sector_t last_block; > + sector_t last_block, blocks_to_skip; > sector_t block_in_file; > - sector_t blocks[MAX_BUF_PER_PAGE]; > unsigned page_block; > unsigned first_unmapped = blocks_per_page; > struct block_device *bdev = NULL; > @@ -406,8 +466,10 @@ __mpage_writepage(struct bio *bio, struc > sector_t boundary_block = 0; > struct block_device *boundary_bdev = NULL; > int length; > - struct buffer_head map_bh; > loff_t i_size = i_size_read(inode); > + struct buffer_head *map_bh = &mio->map_bh; > + struct bio *bio = mio->bio; > + int ret = 0; > > if (page_has_buffers(page)) { > struct buffer_head *head = page_buffers(page); > @@ -435,10 +497,13 @@ __mpage_writepage(struct bio *bio, struc > if (!buffer_dirty(bh) || !buffer_uptodate(bh)) > goto confused; > if (page_block) { > - if (bh->b_blocknr != blocks[page_block-1] + 1) > + if (bh->b_blocknr != map_bh->b_blocknr > + + page_block) > goto confused; > + } else { > + map_bh->b_blocknr = bh->b_blocknr; > + map_bh->b_size = PAGE_SIZE; > } > - blocks[page_block++] = bh->b_blocknr; > boundary = buffer_boundary(bh); > if (boundary) { > boundary_block = bh->b_blocknr; > @@ -465,33 +530,30 @@ __mpage_writepage(struct bio *bio, struc > BUG_ON(!PageUptodate(page)); > block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits); > last_block = (i_size - 1) >> blkbits; > - map_bh.b_page = page; > - for (page_block = 0; page_block < blocks_per_page; ) { > - > - map_bh.b_state = 0; > - if (get_block(inode, block_in_file, &map_bh, 1)) > - goto confused; > - if (buffer_new(&map_bh)) > - unmap_underlying_metadata(map_bh.b_bdev, > - map_bh.b_blocknr); > - if (buffer_boundary(&map_bh)) { > - boundary_block = map_bh.b_blocknr; > - boundary_bdev = map_bh.b_bdev; > - } > - if (page_block) { > - if (map_bh.b_blocknr != blocks[page_block-1] + 1) > - goto confused; > - } > - blocks[page_block++] = map_bh.b_blocknr; > - boundary = buffer_boundary(&map_bh); > - bdev = map_bh.b_bdev; > - if (block_in_file == last_block) > - break; > - block_in_file++; > + blocks_to_skip = block_in_file - mio->block_in_file; > + mio->block_in_file = block_in_file; > + if (blocks_to_skip < (map_bh->b_size >> blkbits)) { > + map_bh->b_blocknr += blocks_to_skip; > + map_bh->b_size -= blocks_to_skip << blkbits; > + } else { > + map_bh->b_state = 0; > + map_bh->b_size = 0; > + if (mio->final_block_in_request > last_block) > + mio->final_block_in_request = last_block; > + mpage_get_more_blocks(mio, inode, get_blocks); > } > - BUG_ON(page_block == 0); > + if (map_bh->b_size < PAGE_SIZE) > + goto confused; > > - first_unmapped = page_block; > + if (mio->boundary && (mio->boundary_block < map_bh->b_blocknr > + + blocks_per_page)) { > + boundary = 1; > + boundary_block = mio->boundary_block; > + boundary_bdev = mio->boundary_bdev; > + } > + > + bdev = map_bh->b_bdev; > + first_unmapped = blocks_per_page; > > page_is_mapped: > end_index = i_size >> PAGE_CACHE_SHIFT; > @@ -518,12 +580,16 @@ page_is_mapped: > /* > * This page will go to BIO. Do we need to send this BIO off first? > */ > - if (bio && *last_block_in_bio != blocks[0] - 1) > + if (bio && mio->block_in_bio != map_bh->b_blocknr - 1) > bio = mpage_bio_submit(WRITE, bio); > > alloc_new: > if (bio == NULL) { > - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), > + /* > + * Fixme: bio size can be limited to final_block - block, or > + * even mio->map_bh.b_size > + */ > + bio = mpage_alloc(bdev, map_bh->b_blocknr << (blkbits - 9), > bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH); > if (bio == NULL) > goto confused; > @@ -539,6 +605,9 @@ alloc_new: > bio = mpage_bio_submit(WRITE, bio); > goto alloc_new; > } > + map_bh->b_blocknr += blocks_per_page; > + map_bh->b_size -= PAGE_SIZE; > + mio->block_in_file += blocks_per_page; > > /* > * OK, we have our BIO, so we can now mark the buffers clean. Make > @@ -575,7 +644,8 @@ alloc_new: > boundary_block, 1 << blkbits); > } > } else { > - *last_block_in_bio = blocks[blocks_per_page - 1]; > + /* we can pack more pages into the bio, don't submit yet */ > + mio->block_in_bio = map_bh->b_blocknr - 1; > } > goto out; > > @@ -584,22 +654,23 @@ confused: > bio = mpage_bio_submit(WRITE, bio); > > if (writepage_fn) { > - *ret = (*writepage_fn)(page, wbc); > + ret = (*writepage_fn)(page, wbc); > } else { > - *ret = -EAGAIN; > + ret = -EAGAIN; > goto out; > } > /* > * The caller has a ref on the inode, so *mapping is stable > */ > - if (*ret) { > - if (*ret == -ENOSPC) > + if (ret) { > + if (ret == -ENOSPC) > set_bit(AS_ENOSPC, &mapping->flags); > else > set_bit(AS_EIO, &mapping->flags); > } > out: > - return bio; > + mio->bio = bio; > + return ret; > } > > /** > @@ -625,20 +696,21 @@ out: > */ > int > mpage_writepages(struct address_space *mapping, > - struct writeback_control *wbc, get_block_t get_block) > + struct writeback_control *wbc, get_blocks_t get_blocks) > { > - return __mpage_writepages(mapping, wbc, get_block, > + return __mpage_writepages(mapping, wbc, get_blocks, > mapping->a_ops->writepage); > } > > int > __mpage_writepages(struct address_space *mapping, > - struct writeback_control *wbc, get_block_t get_block, > + struct writeback_control *wbc, get_blocks_t get_blocks, > writepage_t writepage_fn) > { > struct backing_dev_info *bdi = mapping->backing_dev_info; > + struct inode *inode = mapping->host; > + const unsigned blkbits = inode->i_blkbits; > struct bio *bio = NULL; > - sector_t last_block_in_bio = 0; > int ret = 0; > int done = 0; > int (*writepage)(struct page *page, struct writeback_control *wbc); > @@ -648,6 +720,9 @@ __mpage_writepages(struct address_space > pgoff_t end = -1; /* Inclusive */ > int scanned = 0; > int is_range = 0; > + struct mpageio mio = { > + .bio = NULL > + }; > > if (wbc->nonblocking && bdi_write_congested(bdi)) { > wbc->encountered_congestion = 1; > @@ -655,7 +730,7 @@ __mpage_writepages(struct address_space > } > > writepage = NULL; > - if (get_block == NULL) > + if (get_blocks == NULL) > writepage = mapping->a_ops->writepage; > > pagevec_init(&pvec, 0); > @@ -672,12 +747,15 @@ __mpage_writepages(struct address_space > scanned = 1; > } > retry: > + down_read(&inode->i_alloc_sem); > while (!done && (index <= end) && > - (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, > - PAGECACHE_TAG_DIRTY, > + (nr_pages = pagevec_contig_lookup_tag(&pvec, mapping, > + &index, PAGECACHE_TAG_DIRTY, > min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { > unsigned i; > > + mio.final_block_in_request = min(index, end) << > + (PAGE_CACHE_SHIFT - blkbits); > scanned = 1; > for (i = 0; i < nr_pages; i++) { > struct page *page = pvec.pages[i]; > @@ -702,7 +780,7 @@ retry: > unlock_page(page); > continue; > } > - > + > if (wbc->sync_mode != WB_SYNC_NONE) > wait_on_page_writeback(page); > > @@ -723,9 +801,9 @@ retry: > &mapping->flags); > } > } else { > - bio = __mpage_writepage(bio, page, get_block, > - &last_block_in_bio, &ret, wbc, > - writepage_fn); > + ret = __mpage_writepage(&mio, page, get_blocks, > + wbc, writepage_fn); > + bio = mio.bio; > } > if (ret || (--(wbc->nr_to_write) <= 0)) > done = 1; > @@ -737,6 +815,9 @@ retry: > pagevec_release(&pvec); > cond_resched(); > } > + > + up_read(&inode->i_alloc_sem); > + > if (!scanned && !done) { > /* > * We hit the last page and there is more work to be done: wrap > @@ -755,17 +836,23 @@ retry: > EXPORT_SYMBOL(mpage_writepages); > EXPORT_SYMBOL(__mpage_writepages); > > -int mpage_writepage(struct page *page, get_block_t get_block, > - struct writeback_control *wbc) > +int mpage_writepage(struct page *page, get_blocks_t get_blocks, > + struct writeback_control *wbc, writepage_t writepage_fn) > { > int ret = 0; > - struct bio *bio; > - sector_t last_block_in_bio = 0; > - > - bio = __mpage_writepage(NULL, page, get_block, > - &last_block_in_bio, &ret, wbc, NULL); > - if (bio) > - mpage_bio_submit(WRITE, bio); > + struct address_space *mapping = page->mapping; > + struct inode *inode = mapping->host; > + const unsigned blkbits = inode->i_blkbits; > + struct mpageio mio = { > + .final_block_in_request = (page->index + 1) << (PAGE_CACHE_SHIFT > + - blkbits) > + }; > + > + dump_stack(); > + ret = __mpage_writepage(&mio, page, get_blocks, > + wbc, writepage_fn); > + if (mio.bio) > + mpage_bio_submit(WRITE, mio.bio); > > return ret; > } > diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/buffer_head.h linux-2.6.12-rc3-getblocks/include/linux/buffer_head.h > --- linux-2.6.12-rc3/include/linux/buffer_head.h 2005-04-21 05:33:16.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/include/linux/buffer_head.h 2005-04-22 15:08:33.000000000 +0530 > @@ -203,8 +203,8 @@ int file_fsync(struct file *, struct den > int nobh_prepare_write(struct page*, unsigned, unsigned, get_block_t*); > int nobh_commit_write(struct file *, struct page *, unsigned, unsigned); > int nobh_truncate_page(struct address_space *, loff_t); > -int nobh_writepage(struct page *page, get_block_t *get_block, > - struct writeback_control *wbc); > +int nobh_writepage(struct page *page, get_blocks_t *get_blocks, > + struct writeback_control *wbc, writepage_t bh_writepage); > > > /* > diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/fs.h linux-2.6.12-rc3-getblocks/include/linux/fs.h > --- linux-2.6.12-rc3/include/linux/fs.h 2005-04-21 05:33:16.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/include/linux/fs.h 2005-04-22 15:08:33.000000000 +0530 > @@ -304,6 +304,8 @@ struct address_space; > struct writeback_control; > struct kiocb; > > +typedef int (writepage_t)(struct page *page, struct writeback_control *wbc); > + > struct address_space_operations { > int (*writepage)(struct page *page, struct writeback_control *wbc); > int (*readpage)(struct file *, struct page *); > diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/mpage.h linux-2.6.12-rc3-getblocks/include/linux/mpage.h > --- linux-2.6.12-rc3/include/linux/mpage.h 2005-04-21 05:33:16.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/include/linux/mpage.h 2005-04-22 15:08:33.000000000 +0530 > @@ -11,17 +11,16 @@ > */ > > struct writeback_control; > -typedef int (writepage_t)(struct page *page, struct writeback_control *wbc); > > int mpage_readpages(struct address_space *mapping, struct list_head *pages, > unsigned nr_pages, get_block_t get_block); > int mpage_readpage(struct page *page, get_block_t get_block); > int mpage_writepages(struct address_space *mapping, > - struct writeback_control *wbc, get_block_t get_block); > -int mpage_writepage(struct page *page, get_block_t *get_block, > - struct writeback_control *wbc); > + struct writeback_control *wbc, get_blocks_t get_blocks); > +int mpage_writepage(struct page *page, get_blocks_t *get_blocks, > + struct writeback_control *wbc, writepage_t writepage); > int __mpage_writepages(struct address_space *mapping, > - struct writeback_control *wbc, get_block_t get_block, > + struct writeback_control *wbc, get_blocks_t get_blocks, > writepage_t writepage); > > static inline int > diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/pagemap.h linux-2.6.12-rc3-getblocks/include/linux/pagemap.h > --- linux-2.6.12-rc3/include/linux/pagemap.h 2005-04-21 05:33:16.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/include/linux/pagemap.h 2005-04-22 15:08:33.000000000 +0530 > @@ -73,7 +73,8 @@ extern struct page * find_or_create_page > unsigned find_get_pages(struct address_space *mapping, pgoff_t start, > unsigned int nr_pages, struct page **pages); > unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, > - int tag, unsigned int nr_pages, struct page **pages); > + int tag, unsigned int nr_pages, struct page **pages, > + int contig); > > /* > * Returns locked page at given index in given cache, creating it if needed. > diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/pagevec.h linux-2.6.12-rc3-getblocks/include/linux/pagevec.h > --- linux-2.6.12-rc3/include/linux/pagevec.h 2005-04-21 05:33:16.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/include/linux/pagevec.h 2005-04-22 15:08:33.000000000 +0530 > @@ -28,6 +28,9 @@ unsigned pagevec_lookup(struct pagevec * > unsigned pagevec_lookup_tag(struct pagevec *pvec, > struct address_space *mapping, pgoff_t *index, int tag, > unsigned nr_pages); > +unsigned pagevec_contig_lookup_tag(struct pagevec *pvec, > + struct address_space *mapping, pgoff_t *index, int tag, > + unsigned nr_pages); > > static inline void pagevec_init(struct pagevec *pvec, int cold) > { > diff -urp -X dontdiff linux-2.6.12-rc3/include/linux/radix-tree.h linux-2.6.12-rc3-getblocks/include/linux/radix-tree.h > --- linux-2.6.12-rc3/include/linux/radix-tree.h 2005-04-21 05:33:16.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/include/linux/radix-tree.h 2005-04-22 15:08:33.000000000 +0530 > @@ -59,8 +59,18 @@ void *radix_tree_tag_clear(struct radix_ > int radix_tree_tag_get(struct radix_tree_root *root, > unsigned long index, int tag); > unsigned int > -radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results, > - unsigned long first_index, unsigned int max_items, int tag); > +__radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results, > + unsigned long first_index, unsigned int max_items, int tag, > + int contig); > + > +static inline unsigned int radix_tree_gang_lookup_tag(struct radix_tree_root > + *root, void **results, unsigned long first_index, > + unsigned int max_items, int tag) > +{ > + return __radix_tree_gang_lookup_tag(root, results, first_index, > + max_items, tag, 0); > +} > + > int radix_tree_tagged(struct radix_tree_root *root, int tag); > > static inline void radix_tree_preload_end(void) > diff -urp -X dontdiff linux-2.6.12-rc3/lib/radix-tree.c linux-2.6.12-rc3-getblocks/lib/radix-tree.c > --- linux-2.6.12-rc3/lib/radix-tree.c 2005-04-21 05:33:16.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/lib/radix-tree.c 2005-04-22 16:34:29.000000000 +0530 > @@ -557,12 +557,13 @@ EXPORT_SYMBOL(radix_tree_gang_lookup); > */ > static unsigned int > __lookup_tag(struct radix_tree_root *root, void **results, unsigned long index, > - unsigned int max_items, unsigned long *next_index, int tag) > + unsigned int max_items, unsigned long *next_index, int tag, int contig) > { > unsigned int nr_found = 0; > unsigned int shift; > unsigned int height = root->height; > struct radix_tree_node *slot; > + unsigned long cindex = (contig && (*next_index)) ? *next_index : -1; > > shift = (height - 1) * RADIX_TREE_MAP_SHIFT; > slot = root->rnode; > @@ -575,6 +576,11 @@ __lookup_tag(struct radix_tree_root *roo > BUG_ON(slot->slots[i] == NULL); > break; > } > + if (contig && index >= cindex) { > + /* break in contiguity */ > + index = 0; > + goto out; > + } > index &= ~((1UL << shift) - 1); > index += 1UL << shift; > if (index == 0) > @@ -593,6 +599,10 @@ __lookup_tag(struct radix_tree_root *roo > results[nr_found++] = slot->slots[j]; > if (nr_found == max_items) > goto out; > + } else if (contig && nr_found) { > + /* break in contiguity */ > + index = 0; > + goto out; > } > } > } > @@ -618,29 +628,32 @@ out: > * returns the number of items which were placed at *@results. > */ > unsigned int > -radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results, > - unsigned long first_index, unsigned int max_items, int tag) > +__radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results, > + unsigned long first_index, unsigned int max_items, int tag, > + int contig) > { > const unsigned long max_index = radix_tree_maxindex(root->height); > unsigned long cur_index = first_index; > + unsigned long next_index = 0; /* Index of next contiguous search */ > unsigned int ret = 0; > > while (ret < max_items) { > unsigned int nr_found; > - unsigned long next_index; /* Index of next search */ > > if (cur_index > max_index) > break; > nr_found = __lookup_tag(root, results + ret, cur_index, > - max_items - ret, &next_index, tag); > + max_items - ret, &next_index, tag, contig); > ret += nr_found; > if (next_index == 0) > break; > cur_index = next_index; > + if (!nr_found) > + next_index = 0; > } > return ret; > } > -EXPORT_SYMBOL(radix_tree_gang_lookup_tag); > +EXPORT_SYMBOL(__radix_tree_gang_lookup_tag); > > /** > * radix_tree_delete - delete an item from a radix tree > diff -urp -X dontdiff linux-2.6.12-rc3/mm/filemap.c linux-2.6.12-rc3-getblocks/mm/filemap.c > --- linux-2.6.12-rc3/mm/filemap.c 2005-04-21 05:33:16.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/mm/filemap.c 2005-04-22 16:20:30.000000000 +0530 > @@ -635,16 +635,19 @@ unsigned find_get_pages(struct address_s > /* > * Like find_get_pages, except we only return pages which are tagged with > * `tag'. We update *index to index the next page for the traversal. > + * If 'contig' is 1, then we return only pages which are contiguous in the > + * file. > */ > unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, > - int tag, unsigned int nr_pages, struct page **pages) > + int tag, unsigned int nr_pages, struct page **pages, > + int contig) > { > unsigned int i; > unsigned int ret; > > read_lock_irq(&mapping->tree_lock); > - ret = radix_tree_gang_lookup_tag(&mapping->page_tree, > - (void **)pages, *index, nr_pages, tag); > + ret = __radix_tree_gang_lookup_tag(&mapping->page_tree, > + (void **)pages, *index, nr_pages, tag, contig); > for (i = 0; i < ret; i++) > page_cache_get(pages[i]); > if (ret) > diff -urp -X dontdiff linux-2.6.12-rc3/mm/swap.c linux-2.6.12-rc3-getblocks/mm/swap.c > --- linux-2.6.12-rc3/mm/swap.c 2005-04-21 05:33:16.000000000 +0530 > +++ linux-2.6.12-rc3-getblocks/mm/swap.c 2005-04-22 15:08:33.000000000 +0530 > @@ -384,7 +384,16 @@ unsigned pagevec_lookup_tag(struct pagev > pgoff_t *index, int tag, unsigned nr_pages) > { > pvec->nr = find_get_pages_tag(mapping, index, tag, > - nr_pages, pvec->pages); > + nr_pages, pvec->pages, 0); > + return pagevec_count(pvec); > +} > + > +unsigned int > +pagevec_contig_lookup_tag(struct pagevec *pvec, struct address_space *mapping, > + pgoff_t *index, int tag, unsigned nr_pages) > +{ > + pvec->nr = find_get_pages_tag(mapping, index, tag, > + nr_pages, pvec->pages, 1); > return pagevec_count(pvec); > } > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-29 13:52 ` [Ext2-devel] [RFC] Adding multiple block allocation Suparna Bhattacharya ` (2 preceding siblings ...) 2005-04-30 0:33 ` Mingming Cao @ 2005-04-30 0:44 ` Mingming Cao 2005-04-30 17:03 ` Suparna Bhattacharya 3 siblings, 1 reply; 68+ messages in thread From: Mingming Cao @ 2005-04-30 0:44 UTC (permalink / raw) To: suparna Cc: Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel Oops, sorry about the empty message ... > -static int ext3_writepages_get_block(struct inode *inode, sector_t iblock, > - struct buffer_head *bh, int create) > +static int ext3_writepages_get_blocks(struct inode *inode, sector_t iblock, > + unsigned long max_blocks, struct buffer_head *bh, int create) > { > - return ext3_direct_io_get_blocks(inode, iblock, 1, bh, create); > + return ext3_direct_io_get_blocks(inode, iblock, max_blocks, bh, create); > } > I have a question here, ext3_direct_io_get_blocks use DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32 = ) to reserve the space for journalling, but it seems based on assumption of one data block update once a time. Is it sufficent to re-use that routine for multiple block allocation here? Don't we need something like ext3_writepage_trans_blocks() here? Thanks, Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] [RFC] Adding multiple block allocation 2005-04-30 0:44 ` Mingming Cao @ 2005-04-30 17:03 ` Suparna Bhattacharya 0 siblings, 0 replies; 68+ messages in thread From: Suparna Bhattacharya @ 2005-04-30 17:03 UTC (permalink / raw) To: Mingming Cao Cc: Andrew Morton, Stephen C. Tweedie, linux-kernel, ext2-devel, linux-fsdevel On Fri, Apr 29, 2005 at 05:44:26PM -0700, Mingming Cao wrote: > Oops, sorry about the empty message ... > > > -static int ext3_writepages_get_block(struct inode *inode, sector_t iblock, > > - struct buffer_head *bh, int create) > > +static int ext3_writepages_get_blocks(struct inode *inode, sector_t iblock, > > + unsigned long max_blocks, struct buffer_head *bh, int create) > > { > > - return ext3_direct_io_get_blocks(inode, iblock, 1, bh, create); > > + return ext3_direct_io_get_blocks(inode, iblock, max_blocks, bh, create); > > } > > > > I have a question here, ext3_direct_io_get_blocks use DIO_CREDITS > (EXT3_RESERVE_TRANS_BLOCKS + 32 = ) to reserve the space for > journalling, but it seems based on assumption of one data block update > once a time. Is it sufficent to re-use that routine for multiple block > allocation here? Don't we need something like > ext3_writepage_trans_blocks() here? Quite likely - with your patch, as get_blocks actually allocates multiple blocks at a time, the min credits estimate would change for ext3_direct_io_get_blocks/ext3_writepages_get_blocks. Regards Suparna > > Thanks, > Mingming > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: NEC IT Guy Games. > Get your fingers limbered up and give it your best shot. 4 great events, 4 > opportunities to win big! Highest score wins.NEC IT Guy Games. Play to > win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20 > _______________________________________________ > Ext2-devel mailing list > Ext2-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ext2-devel -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 0/5] multiple block allocation to current ext3 2005-04-28 19:14 ` [RFC] Adding multiple block allocation to current ext3 Mingming Cao 2005-04-29 13:52 ` [Ext2-devel] [RFC] Adding multiple block allocation Suparna Bhattacharya @ 2006-01-10 23:26 ` Mingming Cao 2006-01-11 5:25 ` Andrew Morton 1 sibling, 1 reply; 68+ messages in thread From: Mingming Cao @ 2006-01-10 23:26 UTC (permalink / raw) To: Andrew Morton; +Cc: hch, pbadari, Stephen C. Tweedie, linux-kernel, ext2-devel Currently ext3_get_block() only map or allocate one block at a time. This is quite inefficient for sequential IO workload. I have posted a early implements a simply multiple block map and allocation with current ext3. The basic idea is allocating the 1st block in the existing way, and attempting to allocate the next adjacent blocks on a best effort basis. More description about the implementation could be found here: http://marc.theaimsgroup.com/?l=ext2-devel&m=112162230003522&w=2 The following the latest version of the patch: break the original patch into 5 patches, re-worked some logicals, and fixed some bugs. The break ups are: [patch 1] Adding map multiple blocks at a time in ext3_get_blocks() [patch 2] Extend ext3_get_blocks() to support multiple block allocation [patch 3] Implement multiple block allocation in ext3-try-to-allocate (called via ext3_new_block()). [patch 4] Proper accounting updates in ext3_new_blocks() [patch 5] Adjust reservation window size properly (by the given number of blocks to allocate) before block allocation to increase the possibility of allocating multiple blocks in a single call. Tests done so far includes fsx,tiobench and dbench. The following numbers collected from Direct IO tests (1G file creation/read) shows the system time have been greatly reduced (more than 50% on my 8 cpu system) with the patches. 1G file DIO write: 2.6.15 2.6.15+patches real 0m31.275s 0m31.161s user 0m0.000s 0m0.000s sys 0m3.384s 0m0.564s 1G file DIO read: 2.6.15 2.6.15+patches real 0m30.733s 0m30.624s user 0m0.000s 0m0.004s sys 0m0.748s 0m0.380s Some previous test we did on buffered IO with using multiple blocks allocation and delayed allocation shows noticeable improvement on throughput and system time. Patches against 2.6.15. Please consider to add to mm tree. Thanks! Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 0/5] multiple block allocation to current ext3 2006-01-10 23:26 ` [PATCH 0/5] multiple block allocation to current ext3 Mingming Cao @ 2006-01-11 5:25 ` Andrew Morton 2006-01-11 19:17 ` Mingming Cao 0 siblings, 1 reply; 68+ messages in thread From: Andrew Morton @ 2006-01-11 5:25 UTC (permalink / raw) To: cmm; +Cc: hch, pbadari, sct, linux-kernel, ext2-devel Mingming Cao <cmm@us.ibm.com> wrote: > > Tests done so far includes fsx,tiobench and dbench. The following > numbers collected from Direct IO tests (1G file creation/read) shows > the system time have been greatly reduced (more than 50% on my 8 cpu > system) with the patches. > > 1G file DIO write: > 2.6.15 2.6.15+patches > real 0m31.275s 0m31.161s > user 0m0.000s 0m0.000s > sys 0m3.384s 0m0.564s > > > 1G file DIO read: > 2.6.15 2.6.15+patches > real 0m30.733s 0m30.624s > user 0m0.000s 0m0.004s > sys 0m0.748s 0m0.380s > > Some previous test we did on buffered IO with using multiple blocks > allocation and delayed allocation shows noticeable improvement on > throughput and system time. I'd be interested in seeing benchmark results for the common allocate-one-block case - just normal old buffered IO without any additional multiblock patches. Would they show any regression? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 0/5] multiple block allocation to current ext3 2006-01-11 5:25 ` Andrew Morton @ 2006-01-11 19:17 ` Mingming Cao 2006-01-11 19:43 ` Andrew Morton 0 siblings, 1 reply; 68+ messages in thread From: Mingming Cao @ 2006-01-11 19:17 UTC (permalink / raw) To: Andrew Morton; +Cc: hch, pbadari, sct, linux-kernel, ext2-devel On Tue, 2006-01-10 at 21:25 -0800, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > Tests done so far includes fsx,tiobench and dbench. The following > > numbers collected from Direct IO tests (1G file creation/read) shows > > the system time have been greatly reduced (more than 50% on my 8 cpu > > system) with the patches. > > > > 1G file DIO write: > > 2.6.15 2.6.15+patches > > real 0m31.275s 0m31.161s > > user 0m0.000s 0m0.000s > > sys 0m3.384s 0m0.564s > > > > > > 1G file DIO read: > > 2.6.15 2.6.15+patches > > real 0m30.733s 0m30.624s > > user 0m0.000s 0m0.004s > > sys 0m0.748s 0m0.380s > > > > Some previous test we did on buffered IO with using multiple blocks > > allocation and delayed allocation shows noticeable improvement on > > throughput and system time. > > I'd be interested in seeing benchmark results for the common > allocate-one-block case - just normal old buffered IO without any > additional multiblock patches. Would they show any regression? > Hi Andrew, One thing I want to clarify is that: for the buffered IO, even with mutlipleblock patches, currently ext3 is still allocate one block at a time.(we will need delayed allocation to make use of the multiple block allocation) I did the same test on buffered IO, w/o the patches. Very little regression(less than 1% could be consider as noise) comparing 2.6.15 kernel w/o patches: buffered IO write: (no sync) # time ./filetst -b 1048576 -w -f /mnt/a 2.6.15 2.6.15+patches real 0m25.773s 0m26.102s user 0m0.004s 0m0.000s sys 0m15.065s 0m16.053s buffered IO read (after umount/remount) # time ./filetst -b 1048576 -r -f /mnt/a 2.6.15 2.6.15+patches real 0m29.257s 0m29.257s user 0m0.000s 0m0.000s sys 0m6.996s 0m6.980s But I do notice regression between vanilla 2.6.14 kernel and vanilla 2.6.15 kernel on buffered IO(about 18%). # time ./filetst -b 1048576 -w -f /mnt/a 2.6.14 2.6.15 real 0m21.710s 0m25.773s user 0m0.012s 0m0.004s sys 0m14.569s 0m15.065s I also found tiobench(sequential write test) and dbench has similar regression between 2.6.14 and 2.6.15. Actually I found 2.6.15 rc2 already has the regression. Is this a known issue? Anyway I will continue looking at the issue... Thanks, Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 0/5] multiple block allocation to current ext3 2006-01-11 19:17 ` Mingming Cao @ 2006-01-11 19:43 ` Andrew Morton 2006-01-11 21:31 ` Mingming Cao 2006-01-14 1:12 ` Fall back io scheduler for 2.6.15? Mingming Cao 0 siblings, 2 replies; 68+ messages in thread From: Andrew Morton @ 2006-01-11 19:43 UTC (permalink / raw) To: cmm; +Cc: hch, pbadari, sct, linux-kernel, ext2-devel Mingming Cao <cmm@us.ibm.com> wrote: > > # time ./filetst -b 1048576 -w -f /mnt/a > 2.6.14 2.6.15 > real 0m21.710s 0m25.773s > user 0m0.012s 0m0.004s > sys 0m14.569s 0m15.065s That's a big drop. Was it doing I/O, or was it all from pagecache? > I also found tiobench(sequential write test) and dbench has similar > regression between 2.6.14 and 2.6.15. Actually I found 2.6.15 rc2 > already has the regression. Is this a known issue? No, it is not known. > Anyway I will continue looking at the issue... Thanks. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 0/5] multiple block allocation to current ext3 2006-01-11 19:43 ` Andrew Morton @ 2006-01-11 21:31 ` Mingming Cao 2006-01-14 1:12 ` Fall back io scheduler for 2.6.15? Mingming Cao 1 sibling, 0 replies; 68+ messages in thread From: Mingming Cao @ 2006-01-11 21:31 UTC (permalink / raw) To: Andrew Morton; +Cc: hch, pbadari, sct, linux-kernel, ext2-devel On Wed, 2006-01-11 at 11:43 -0800, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > # time ./filetst -b 1048576 -w -f /mnt/a > > 2.6.14 2.6.15 > > real 0m21.710s 0m25.773s > > user 0m0.012s 0m0.004s > > sys 0m14.569s 0m15.065s > > That's a big drop. > > Was it doing I/O, or was it all from pagecache? > It's doing IO on a freshly created filesystem. It was running with AS io scheduler. It seems DIO has no problem, and ext2 has no regression either... Will keep you posted. Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Fall back io scheduler for 2.6.15? 2006-01-11 19:43 ` Andrew Morton 2006-01-11 21:31 ` Mingming Cao @ 2006-01-14 1:12 ` Mingming Cao 2006-01-14 1:49 ` Andrew Morton 1 sibling, 1 reply; 68+ messages in thread From: Mingming Cao @ 2006-01-14 1:12 UTC (permalink / raw) To: Andrew Morton; +Cc: Seetharami Seelam, linux-kernel, ext2-devel On Wed, 2006-01-11 at 11:43 -0800, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > # time ./filetst -b 1048576 -w -f /mnt/a > > 2.6.14 2.6.15 > > real 0m21.710s 0m25.773s > > user 0m0.012s 0m0.004s > > sys 0m14.569s 0m15.065s > > That's a big drop. > > Was it doing I/O, or was it all from pagecache? > > > I also found tiobench(sequential write test) and dbench has similar > > regression between 2.6.14 and 2.6.15. Actually I found 2.6.15 rc2 > > already has the regression. Is this a known issue? > > No, it is not known. > > > Anyway I will continue looking at the issue... > > Thanks. Hi, Andrew, I did some trace, it turns out there isn't regression between 2.6.14 and 2.6.15, and there is no problem in ext3 filesystem. I am comparing apple to orange: the tests were run on two different io schedulers. That makes the bogus throughput difference that I reported to you earlier this week. I gave the same boot option "elevator=as" for both 2.6.14 and 2.6.15-rc2 (this has been working for me for a long time to get the anticipatory scheduler on), but the results are, the io schedulers turned on on the two kernels are different( see elevator_setup_default()). On 2.6.14, the fall back io scheduler (if the chosen io scheduler is not found) is set to the default io scheduler (anticipatory, in this case), but since 2.6.15-rc1, this semanistic is changed to fall back to noop. Is there any reason to fall back to noop instead of as? It seems anticipatory is much better than noop for ext3 with large sequential write tests (i.e, 1G dd test) ... Thanks, Mingming ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-14 1:12 ` Fall back io scheduler for 2.6.15? Mingming Cao @ 2006-01-14 1:49 ` Andrew Morton 2006-01-14 5:22 ` Dave Jones ` (2 more replies) 0 siblings, 3 replies; 68+ messages in thread From: Andrew Morton @ 2006-01-14 1:49 UTC (permalink / raw) To: cmm; +Cc: seelam, linux-kernel, ext2-devel, Jens Axboe Mingming Cao <cmm@us.ibm.com> wrote: > > On 2.6.14, the > fall back io scheduler (if the chosen io scheduler is not found) is set > to the default io scheduler (anticipatory, in this case), but since > 2.6.15-rc1, this semanistic is changed to fall back to noop. OK. And I assume that AS wasn't compiled, so that's why it fell back? I actually thought that elevator= got removed, now we have /sys/block/sda/queue/scheduler. But I guess that's not very useful with CONFIG_SYSFS=n. > Is there any reason to fall back to noop instead of as? It seems > anticipatory is much better than noop for ext3 with large sequential > write tests (i.e, 1G dd test) ... I suspect that was an accident. Jens? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-14 1:49 ` Andrew Morton @ 2006-01-14 5:22 ` Dave Jones 2006-01-16 8:43 ` Jens Axboe 2006-01-16 19:38 ` [Ext2-devel] " Mingming Cao 2 siblings, 0 replies; 68+ messages in thread From: Dave Jones @ 2006-01-14 5:22 UTC (permalink / raw) To: Andrew Morton; +Cc: cmm, seelam, linux-kernel, ext2-devel, Jens Axboe On Fri, Jan 13, 2006 at 05:49:14PM -0800, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > On 2.6.14, the > > fall back io scheduler (if the chosen io scheduler is not found) is set > > to the default io scheduler (anticipatory, in this case), but since > > 2.6.15-rc1, this semanistic is changed to fall back to noop. > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > > I actually thought that elevator= got removed, now we have > /sys/block/sda/queue/scheduler. But I guess that's not very useful with > CONFIG_SYSFS=n. It's also a lifesaver if the default scheduler happens to trigger some breakage preventing boot, and you can tell users to workaround it with a bootparam until the real problem is fixed (which has bitten us twice now, as Fedora like several other distros chooses CFQ by default). Dave ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-14 1:49 ` Andrew Morton 2006-01-14 5:22 ` Dave Jones @ 2006-01-16 8:43 ` Jens Axboe 2006-01-16 19:45 ` [PATCH] Fall back io scheduler ( Re: [Ext2-devel] Re: Fall back io scheduler for 2.6.15?) Mingming Cao 2006-01-19 19:37 ` Fall back io scheduler for 2.6.15? Nate Diller 2006-01-16 19:38 ` [Ext2-devel] " Mingming Cao 2 siblings, 2 replies; 68+ messages in thread From: Jens Axboe @ 2006-01-16 8:43 UTC (permalink / raw) To: Andrew Morton; +Cc: cmm, seelam, linux-kernel, ext2-devel On Fri, Jan 13 2006, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > On 2.6.14, the > > fall back io scheduler (if the chosen io scheduler is not found) is set > > to the default io scheduler (anticipatory, in this case), but since > > 2.6.15-rc1, this semanistic is changed to fall back to noop. > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > > I actually thought that elevator= got removed, now we have > /sys/block/sda/queue/scheduler. But I guess that's not very useful with > CONFIG_SYSFS=n. > > > Is there any reason to fall back to noop instead of as? It seems > > anticipatory is much better than noop for ext3 with large sequential > > write tests (i.e, 1G dd test) ... > > I suspect that was an accident. Jens? It is, it makes more sense to fallback to the default of course. -- Jens Axboe ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH] Fall back io scheduler ( Re: [Ext2-devel] Re: Fall back io scheduler for 2.6.15?) 2006-01-16 8:43 ` Jens Axboe @ 2006-01-16 19:45 ` Mingming Cao 2006-01-16 19:49 ` Jens Axboe 2006-01-19 19:37 ` Fall back io scheduler for 2.6.15? Nate Diller 1 sibling, 1 reply; 68+ messages in thread From: Mingming Cao @ 2006-01-16 19:45 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, seelam, linux-kernel, ext2-devel On Mon, 2006-01-16 at 09:43 +0100, Jens Axboe wrote: > On Fri, Jan 13 2006, Andrew Morton wrote: > > Mingming Cao <cmm@us.ibm.com> wrote: > > > > > > On 2.6.14, the > > > fall back io scheduler (if the chosen io scheduler is not found) is set > > > to the default io scheduler (anticipatory, in this case), but since > > > 2.6.15-rc1, this semanistic is changed to fall back to noop. > > > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > > > > I actually thought that elevator= got removed, now we have > > /sys/block/sda/queue/scheduler. But I guess that's not very useful with > > CONFIG_SYSFS=n. > > > > > Is there any reason to fall back to noop instead of as? It seems > > > anticipatory is much better than noop for ext3 with large sequential > > > write tests (i.e, 1G dd test) ... > > > > I suspect that was an accident. Jens? > > It is, it makes more sense to fallback to the default of course. > How about this one? In the case the chosen io scheduler (elevator = "xxx") is not on the registered list, fall back to default scheduler makes more sense. Signed-off-by: Mingming Cao <cmm@us.ibm.com> --- linux-2.6.15-ming/block/elevator.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff -puN block/elevator.c~default-fall-back-scheduler-fix block/elevator.c --- linux-2.6.15/block/elevator.c~default-fall-back-scheduler-fix 2006-01-16 11:42:04.000446588 -0800 +++ linux-2.6.15-ming/block/elevator.c 2006-01-16 11:42:41.145131615 -0800 @@ -151,12 +151,12 @@ static void elevator_setup_default(void) strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED); /* - * If the given scheduler is not available, fall back to no-op. + * If the given scheduler is not available, fall back to default. */ if ((e = elevator_find(chosen_elevator))) elevator_put(e); else - strcpy(chosen_elevator, "noop"); + strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED); } static int __init elevator_setup(char *str) _ ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] Fall back io scheduler ( Re: [Ext2-devel] Re: Fall back io scheduler for 2.6.15?) 2006-01-16 19:45 ` [PATCH] Fall back io scheduler ( Re: [Ext2-devel] Re: Fall back io scheduler for 2.6.15?) Mingming Cao @ 2006-01-16 19:49 ` Jens Axboe 2006-01-16 19:57 ` Mingming Cao 0 siblings, 1 reply; 68+ messages in thread From: Jens Axboe @ 2006-01-16 19:49 UTC (permalink / raw) To: Mingming Cao; +Cc: Andrew Morton, seelam, linux-kernel, ext2-devel On Mon, Jan 16 2006, Mingming Cao wrote: > On Mon, 2006-01-16 at 09:43 +0100, Jens Axboe wrote: > > On Fri, Jan 13 2006, Andrew Morton wrote: > > > Mingming Cao <cmm@us.ibm.com> wrote: > > > > > > > > On 2.6.14, the > > > > fall back io scheduler (if the chosen io scheduler is not found) is set > > > > to the default io scheduler (anticipatory, in this case), but since > > > > 2.6.15-rc1, this semanistic is changed to fall back to noop. > > > > > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > > > > > > I actually thought that elevator= got removed, now we have > > > /sys/block/sda/queue/scheduler. But I guess that's not very useful with > > > CONFIG_SYSFS=n. > > > > > > > Is there any reason to fall back to noop instead of as? It seems > > > > anticipatory is much better than noop for ext3 with large sequential > > > > write tests (i.e, 1G dd test) ... > > > > > > I suspect that was an accident. Jens? > > > > It is, it makes more sense to fallback to the default of course. > > > > How about this one? > > > In the case the chosen io scheduler (elevator = "xxx") is not on the > registered list, fall back to default scheduler makes more sense. > > Signed-off-by: Mingming Cao <cmm@us.ibm.com> I think you just recreated an identical patch to the one I added this morning: http://brick.kernel.dk/git/?p=linux-2.6-block.git;a=commitdiff;h=b7bfcf7cbd58d2a64aa46f3b4bec921e346e604f :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] Fall back io scheduler ( Re: [Ext2-devel] Re: Fall back io scheduler for 2.6.15?) 2006-01-16 19:49 ` Jens Axboe @ 2006-01-16 19:57 ` Mingming Cao 0 siblings, 0 replies; 68+ messages in thread From: Mingming Cao @ 2006-01-16 19:57 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, seelam, linux-kernel, ext2-devel On Mon, 2006-01-16 at 20:49 +0100, Jens Axboe wrote: > On Mon, Jan 16 2006, Mingming Cao wrote: > > On Mon, 2006-01-16 at 09:43 +0100, Jens Axboe wrote: > > > On Fri, Jan 13 2006, Andrew Morton wrote: > > > > Mingming Cao <cmm@us.ibm.com> wrote: > > > > > > > > > > On 2.6.14, the > > > > > fall back io scheduler (if the chosen io scheduler is not found) is set > > > > > to the default io scheduler (anticipatory, in this case), but since > > > > > 2.6.15-rc1, this semanistic is changed to fall back to noop. > > > > > > > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > > > > > > > > I actually thought that elevator= got removed, now we have > > > > /sys/block/sda/queue/scheduler. But I guess that's not very useful with > > > > CONFIG_SYSFS=n. > > > > > > > > > Is there any reason to fall back to noop instead of as? It seems > > > > > anticipatory is much better than noop for ext3 with large sequential > > > > > write tests (i.e, 1G dd test) ... > > > > > > > > I suspect that was an accident. Jens? > > > > > > It is, it makes more sense to fallback to the default of course. > > > > > > > How about this one? > > > > > > In the case the chosen io scheduler (elevator = "xxx") is not on the > > registered list, fall back to default scheduler makes more sense. > > > > Signed-off-by: Mingming Cao <cmm@us.ibm.com> > > I think you just recreated an identical patch to the one I added this > morning: > > http://brick.kernel.dk/git/?p=linux-2.6-block.git;a=commitdiff;h=b7bfcf7cbd58d2a64aa46f3b4bec921e346e604f > > :-) > Hehe, please ignore mine:-) ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-16 8:43 ` Jens Axboe 2006-01-16 19:45 ` [PATCH] Fall back io scheduler ( Re: [Ext2-devel] Re: Fall back io scheduler for 2.6.15?) Mingming Cao @ 2006-01-19 19:37 ` Nate Diller 2006-01-20 8:10 ` Jens Axboe 1 sibling, 1 reply; 68+ messages in thread From: Nate Diller @ 2006-01-19 19:37 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, cmm, seelam, linux-kernel, ext2-devel On 1/16/06, Jens Axboe <axboe@suse.de> wrote: > On Fri, Jan 13 2006, Andrew Morton wrote: > > Mingming Cao <cmm@us.ibm.com> wrote: > > > > > > On 2.6.14, the > > > fall back io scheduler (if the chosen io scheduler is not found) is set > > > to the default io scheduler (anticipatory, in this case), but since > > > 2.6.15-rc1, this semanistic is changed to fall back to noop. > > > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > > > > I actually thought that elevator= got removed, now we have > > /sys/block/sda/queue/scheduler. But I guess that's not very useful with > > CONFIG_SYSFS=n. > > > > > Is there any reason to fall back to noop instead of as? It seems > > > anticipatory is much better than noop for ext3 with large sequential > > > write tests (i.e, 1G dd test) ... > > > > I suspect that was an accident. Jens? > > It is, it makes more sense to fallback to the default of course. Not an accident at all, actually, because the original patch i submitted allowed you to select a scheduler as 'default' even if it were compiled as a module in kconfig. Since noop is guaranteed to be present in any system, it is the obvious choice if the chosen or default scheduler is not loaded. If you change it to fall back to the default, it will oops if the default is not available. NATE ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-19 19:37 ` Fall back io scheduler for 2.6.15? Nate Diller @ 2006-01-20 8:10 ` Jens Axboe 0 siblings, 0 replies; 68+ messages in thread From: Jens Axboe @ 2006-01-20 8:10 UTC (permalink / raw) To: Nate Diller; +Cc: Andrew Morton, cmm, seelam, linux-kernel, ext2-devel On Thu, Jan 19 2006, Nate Diller wrote: > On 1/16/06, Jens Axboe <axboe@suse.de> wrote: > > On Fri, Jan 13 2006, Andrew Morton wrote: > > > Mingming Cao <cmm@us.ibm.com> wrote: > > > > > > > > On 2.6.14, the > > > > fall back io scheduler (if the chosen io scheduler is not found) is set > > > > to the default io scheduler (anticipatory, in this case), but since > > > > 2.6.15-rc1, this semanistic is changed to fall back to noop. > > > > > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > > > > > > I actually thought that elevator= got removed, now we have > > > /sys/block/sda/queue/scheduler. But I guess that's not very useful with > > > CONFIG_SYSFS=n. > > > > > > > Is there any reason to fall back to noop instead of as? It seems > > > > anticipatory is much better than noop for ext3 with large sequential > > > > write tests (i.e, 1G dd test) ... > > > > > > I suspect that was an accident. Jens? > > > > It is, it makes more sense to fallback to the default of course. > > Not an accident at all, actually, because the original patch i > submitted allowed you to select a scheduler as 'default' even if it > were compiled as a module in kconfig. Since noop is guaranteed to be > present in any system, it is the obvious choice if the chosen or > default scheduler is not loaded. Yes and that was a bug in that patch. The default scheduler must be builtin, that's a given. The Kconfig rules should make a default selection as a module illegal. And they do, they have since been fixed. > If you change it to fall back to the default, it will oops if the > default is not available. It must be. -- Jens Axboe ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Ext2-devel] Re: Fall back io scheduler for 2.6.15? 2006-01-14 1:49 ` Andrew Morton 2006-01-14 5:22 ` Dave Jones 2006-01-16 8:43 ` Jens Axboe @ 2006-01-16 19:38 ` Mingming Cao 2 siblings, 0 replies; 68+ messages in thread From: Mingming Cao @ 2006-01-16 19:38 UTC (permalink / raw) To: Andrew Morton; +Cc: seelam, linux-kernel, ext2-devel, Jens Axboe On Fri, 2006-01-13 at 17:49 -0800, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > On 2.6.14, the > > fall back io scheduler (if the chosen io scheduler is not found) is set > > to the default io scheduler (anticipatory, in this case), but since > > 2.6.15-rc1, this semanistic is changed to fall back to noop. > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > AS was compiled, and AS is set as the default scheduler. But since 2.6.15 doesn't recognize "elevator=as" (we need to say "elevator=anticipatory"), so in 2.6.15, elevator_setup_default() will explicitly fall back to "noop" scheduler. 2.6.14 doesn't recognize "elevator=as" either, but it fall back to the default scheduler instead. Which makes more sense, as Jen pointed out. > I actually thought that elevator= got removed, now we have > /sys/block/sda/queue/scheduler. But I guess that's not very useful with > CONFIG_SYSFS=n. > > > Is there any reason to fall back to noop instead of as? It seems > > anticipatory is much better than noop for ext3 with large sequential > > write tests (i.e, 1G dd test) ... > > I suspect that was an accident. Jens? > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click > _______________________________________________ > Ext2-devel mailing list > Ext2-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ext2-devel > ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH] Reduce ext3 allocate-with-reservation lock latencies 2005-04-22 22:10 ` [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies Mingming Cao 2005-04-28 3:45 ` Lee Revell @ 2005-04-29 6:28 ` Mingming Cao 1 sibling, 0 replies; 68+ messages in thread From: Mingming Cao @ 2005-04-29 6:28 UTC (permalink / raw) To: Andrew Morton Cc: Stephen C. Tweedie, Ingo Molnar, Lee Revell, linux-kernel, ext2-devel Currently in ext3 block reservation code, the global filesystem reservation tree lock (rsv_block) is hold during the process of searching for a space to make a new reservation window, including while scaning the block bitmap to verify if the avalible window has a free block. Holding the lock during bitmap scan is unnecessary and could possibly cause scalability issue and latency issues. This patch trying to address this by dropping the lock before scan the bitmap. Before that we need to reserve the open window in case someone else is targetting at the same window. Question was should we reserve the whole free reservable space or just the window size we need. Reserve the whole free reservable space will possibly force other threads which intended to do block allocation nearby move to another block group(cause bad layout). In this patch, we just reserve the desired size before drop the lock and scan the block bitmap. This patch fixed a ext3 reservation latency issue seen on a cvs check out test. Patch is tested with many fsx, tiobench, dbench and untar a kernel test. Signed-Off-By: Mingming Cao <cmm@us.ibm.com> --- linux-2.6.11-ming/fs/ext3/balloc.c | 135 +++++++++++++++++-------------------- linux-2.6.11-ming/fs/ext3/file.c | 4 + 2 files changed, 68 insertions(+), 71 deletions(-) diff -puN fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency fs/ext3/balloc.c --- linux-2.6.11/fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency 2005-04-22 10:49:04.000000000 -0700 +++ linux-2.6.11-ming/fs/ext3/balloc.c 2005-04-27 22:34:19.491331405 -0700 @@ -749,24 +749,24 @@ fail_access: * to find a free region that is of my size and has not * been reserved. * - * on succeed, it returns the reservation window to be appended to. - * failed, return NULL. */ -static struct ext3_reserve_window_node *find_next_reservable_window( +static int find_next_reservable_window( struct ext3_reserve_window_node *search_head, - unsigned long size, int *start_block, + struct ext3_reserve_window_node *my_rsv, + struct super_block * sb, int start_block, int last_block) { struct rb_node *next; struct ext3_reserve_window_node *rsv, *prev; int cur; + int size = my_rsv->rsv_goal_size; /* TODO: make the start of the reservation window byte-aligned */ /* cur = *start_block & ~7;*/ - cur = *start_block; + cur = start_block; rsv = search_head; if (!rsv) - return NULL; + return -1; while (1) { if (cur <= rsv->rsv_end) @@ -782,7 +782,7 @@ static struct ext3_reserve_window_node * * space with expected-size (or more)... */ if (cur > last_block) - return NULL; /* fail */ + return -1; /* fail */ prev = rsv; next = rb_next(&rsv->rsv_node); @@ -813,8 +813,26 @@ static struct ext3_reserve_window_node * * return the reservation window that we could append to. * succeed. */ - *start_block = cur; - return prev; + + if ((prev != my_rsv) && (!rsv_is_empty(&my_rsv->rsv_window))) + rsv_window_remove(sb, my_rsv); + + /* let's book the whole avaliable window for now + * we will check the + * disk bitmap later and then, if there are free block + * then we adjust the window size if the it's + * larger than requested. + * Otherwise, we will remove this node from the tree next time + * call find_next_reservable_window. + */ + my_rsv->rsv_start = cur; + my_rsv->rsv_end = cur + size - 1; + my_rsv->rsv_alloc_hit = 0; + + if (prev != my_rsv) + ext3_rsv_window_add(sb, my_rsv); + + return 0; } /** @@ -852,6 +870,7 @@ static struct ext3_reserve_window_node * * @sb: the super block * @group: the group we are trying to allocate in * @bitmap_bh: the block group block bitmap + * */ static int alloc_new_reservation(struct ext3_reserve_window_node *my_rsv, int goal, struct super_block *sb, @@ -860,10 +879,10 @@ static int alloc_new_reservation(struct struct ext3_reserve_window_node *search_head; int group_first_block, group_end_block, start_block; int first_free_block; - int reservable_space_start; - struct ext3_reserve_window_node *prev_rsv; struct rb_root *fs_rsv_root = &EXT3_SB(sb)->s_rsv_window_root; unsigned long size; + int ret; + spinlock_t *rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock; group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) + group * EXT3_BLOCKS_PER_GROUP(sb); @@ -875,6 +894,7 @@ static int alloc_new_reservation(struct start_block = goal + group_first_block; size = my_rsv->rsv_goal_size; + if (!rsv_is_empty(&my_rsv->rsv_window)) { /* * if the old reservation is cross group boundary @@ -908,6 +928,8 @@ static int alloc_new_reservation(struct my_rsv->rsv_goal_size= size; } } + + spin_lock(rsv_lock); /* * shift the search start to the window near the goal block */ @@ -921,11 +943,16 @@ static int alloc_new_reservation(struct * need to check the bitmap after we found a reservable window. */ retry: - prev_rsv = find_next_reservable_window(search_head, size, - &start_block, group_end_block); - if (prev_rsv == NULL) - goto failed; - reservable_space_start = start_block; + ret = find_next_reservable_window(search_head, my_rsv, sb, + start_block, group_end_block); + + if (ret == -1) { + if (!rsv_is_empty(&my_rsv->rsv_window)) + rsv_window_remove(sb, my_rsv); + spin_unlock(rsv_lock); + return -1; + } + /* * On success, find_next_reservable_window() returns the * reservation window where there is a reservable space after it. @@ -937,8 +964,9 @@ retry: * block. Search start from the start block of the reservable space * we just found. */ + spin_unlock(rsv_lock); first_free_block = bitmap_search_next_usable_block( - reservable_space_start - group_first_block, + my_rsv->rsv_start - group_first_block, bitmap_bh, group_end_block - group_first_block + 1); if (first_free_block < 0) { @@ -946,54 +974,30 @@ retry: * no free block left on the bitmap, no point * to reserve the space. return failed. */ - goto failed; + spin_lock(rsv_lock); + if (!rsv_is_empty(&my_rsv->rsv_window)) + rsv_window_remove(sb, my_rsv); + spin_unlock(rsv_lock); + return -1; /* failed */ } + start_block = first_free_block + group_first_block; /* * check if the first free block is within the - * free space we just found + * free space we just reserved */ - if ((start_block >= reservable_space_start) && - (start_block < reservable_space_start + size)) - goto found_rsv_window; + if ((start_block >= my_rsv->rsv_start) && + (start_block < my_rsv->rsv_end)) + return 0; /* succeed */ /* * if the first free bit we found is out of the reservable space - * this means there is no free block on the reservable space - * we should continue search for next reservable space, + * continue search for next reservable space, * start from where the free block is, * we also shift the list head to where we stopped last time */ - search_head = prev_rsv; + search_head = my_rsv; + spin_lock(rsv_lock); goto retry; - -found_rsv_window: - /* - * great! the reservable space contains some free blocks. - * if the search returns that we should add the new - * window just next to where the old window, we don't - * need to remove the old window first then add it to the - * same place, just update the new start and new end. - */ - if (my_rsv != prev_rsv) { - if (!rsv_is_empty(&my_rsv->rsv_window)) - rsv_window_remove(sb, my_rsv); - } - my_rsv->rsv_start = reservable_space_start; - my_rsv->rsv_end = my_rsv->rsv_start + size - 1; - my_rsv->rsv_alloc_hit = 0; - if (my_rsv != prev_rsv) { - ext3_rsv_window_add(sb, my_rsv); - } - return 0; /* succeed */ -failed: - /* - * failed to find a new reservation window in the current - * group, remove the current(stale) reservation window - * if there is any - */ - if (!rsv_is_empty(&my_rsv->rsv_window)) - rsv_window_remove(sb, my_rsv); - return -1; /* failed */ } /* @@ -1023,7 +1027,6 @@ ext3_try_to_allocate_with_rsv(struct sup int goal, struct ext3_reserve_window_node * my_rsv, int *errp) { - spinlock_t *rsv_lock; unsigned long group_first_block; int ret = 0; int fatal; @@ -1052,7 +1055,6 @@ ext3_try_to_allocate_with_rsv(struct sup ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL); goto out; } - rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock; /* * goal is a group relative block number (if there is a goal) * 0 < goal < EXT3_BLOCKS_PER_GROUP(sb) @@ -1078,30 +1080,21 @@ ext3_try_to_allocate_with_rsv(struct sup * then we could go to allocate from the reservation window directly. */ while (1) { - struct ext3_reserve_window rsv_copy; - - rsv_copy._rsv_start = my_rsv->rsv_start; - rsv_copy._rsv_end = my_rsv->rsv_end; - - if (rsv_is_empty(&rsv_copy) || (ret < 0) || - !goal_in_my_reservation(&rsv_copy, goal, group, sb)) { - spin_lock(rsv_lock); + if (rsv_is_empty(&my_rsv->rsv_window) || (ret < 0) || + !goal_in_my_reservation(&my_rsv->rsv_window, goal, group, sb)) { ret = alloc_new_reservation(my_rsv, goal, sb, group, bitmap_bh); - rsv_copy._rsv_start = my_rsv->rsv_start; - rsv_copy._rsv_end = my_rsv->rsv_end; - spin_unlock(rsv_lock); if (ret < 0) break; /* failed */ - if (!goal_in_my_reservation(&rsv_copy, goal, group, sb)) + if (!goal_in_my_reservation(&my_rsv->rsv_window, goal, group, sb)) goal = -1; } - if ((rsv_copy._rsv_start >= group_first_block + EXT3_BLOCKS_PER_GROUP(sb)) - || (rsv_copy._rsv_end < group_first_block)) + if ((my_rsv->rsv_start >= group_first_block + EXT3_BLOCKS_PER_GROUP(sb)) + || (my_rsv->rsv_end < group_first_block)) BUG(); ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, - &rsv_copy); + &my_rsv->rsv_window); if (ret >= 0) { my_rsv->rsv_alloc_hit++; break; /* succeed */ diff -puN fs/ext3/file.c~ext3-reduce-reservation-lock-latency fs/ext3/file.c --- linux-2.6.11/fs/ext3/file.c~ext3-reduce-reservation-lock-latency 2005-04-22 10:49:04.000000000 -0700 +++ linux-2.6.11-ming/fs/ext3/file.c 2005-04-26 11:09:56.000000000 -0700 @@ -36,7 +36,11 @@ static int ext3_release_file (struct ino /* if we are the last writer on the inode, drop the block reservation */ if ((filp->f_mode & FMODE_WRITE) && (atomic_read(&inode->i_writecount) == 1)) + { + down(&EXT3_I(inode)->truncate_sem); ext3_discard_reservation(inode); + up(&EXT3_I(inode)->truncate_sem); + } if (is_dx(inode) && filp->private_data) ext3_htree_free_dir_info(filp->private_data); _ ^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2006-01-20 8:09 UTC | newest] Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-04-05 3:51 ext3 allocate-with-reservation latencies Lee Revell 2005-04-05 4:13 ` Ingo Molnar 2005-04-05 6:10 ` Mingming Cao 2005-04-05 16:38 ` Lee Revell 2005-04-06 5:35 ` Mingming Cao 2005-04-06 9:51 ` Stephen C. Tweedie 2005-04-06 16:53 ` Mingming Cao 2005-04-06 18:22 ` Stephen C. Tweedie 2005-04-06 19:03 ` Mingming Cao 2005-04-07 8:14 ` Ingo Molnar 2005-04-07 13:08 ` Stephen C. Tweedie 2005-04-07 19:16 ` Mingming Cao 2005-04-07 23:37 ` Mingming Cao 2005-04-08 14:40 ` Stephen C. Tweedie 2005-04-08 16:06 ` Arjan van de Ven 2005-04-08 18:10 ` Mingming Cao 2005-04-08 18:12 ` Lee Revell 2005-04-11 11:48 ` Stephen C. Tweedie 2005-04-11 18:38 ` Mingming Cao 2005-04-11 19:12 ` Lee Revell 2005-04-11 19:22 ` Lee Revell 2005-04-11 19:57 ` Stephen C. Tweedie 2005-04-12 6:41 ` Mingming Cao 2005-04-12 11:18 ` Stephen C. Tweedie 2005-04-12 23:27 ` Mingming Cao 2005-04-13 10:29 ` Stephen C. Tweedie [not found] ` <1113597161.3899.80.camel@localhost.localdomain> 2005-04-18 18:00 ` Stephen C. Tweedie 2005-04-18 21:56 ` [Ext2-devel] " Mingming Cao 2005-04-22 22:10 ` [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies Mingming Cao 2005-04-28 3:45 ` Lee Revell 2005-04-28 7:37 ` Mingming Cao 2005-04-28 16:12 ` Lee Revell 2005-04-28 18:34 ` Mingming Cao 2005-04-29 6:18 ` Mingming Cao 2005-04-28 19:14 ` [RFC] Adding multiple block allocation to current ext3 Mingming Cao 2005-04-29 13:52 ` [Ext2-devel] [RFC] Adding multiple block allocation Suparna Bhattacharya 2005-04-29 17:10 ` Mingming Cao 2005-04-29 19:42 ` Mingming Cao 2005-04-29 20:57 ` Andrew Morton 2005-04-29 21:12 ` Mingming Cao 2005-04-29 21:34 ` Andrew Morton 2005-04-30 16:00 ` Suparna Bhattacharya 2005-04-29 18:45 ` Badari Pulavarty 2005-04-29 23:22 ` Mingming Cao 2005-04-30 16:10 ` Suparna Bhattacharya 2005-04-30 17:11 ` Suparna Bhattacharya 2005-04-30 18:07 ` Mingming Cao 2005-05-02 4:46 ` Suparna Bhattacharya 2005-04-30 16:52 ` Suparna Bhattacharya 2005-04-30 0:33 ` Mingming Cao 2005-04-30 0:44 ` Mingming Cao 2005-04-30 17:03 ` Suparna Bhattacharya 2006-01-10 23:26 ` [PATCH 0/5] multiple block allocation to current ext3 Mingming Cao 2006-01-11 5:25 ` Andrew Morton 2006-01-11 19:17 ` Mingming Cao 2006-01-11 19:43 ` Andrew Morton 2006-01-11 21:31 ` Mingming Cao 2006-01-14 1:12 ` Fall back io scheduler for 2.6.15? Mingming Cao 2006-01-14 1:49 ` Andrew Morton 2006-01-14 5:22 ` Dave Jones 2006-01-16 8:43 ` Jens Axboe 2006-01-16 19:45 ` [PATCH] Fall back io scheduler ( Re: [Ext2-devel] Re: Fall back io scheduler for 2.6.15?) Mingming Cao 2006-01-16 19:49 ` Jens Axboe 2006-01-16 19:57 ` Mingming Cao 2006-01-19 19:37 ` Fall back io scheduler for 2.6.15? Nate Diller 2006-01-20 8:10 ` Jens Axboe 2006-01-16 19:38 ` [Ext2-devel] " Mingming Cao 2005-04-29 6:28 ` [PATCH] Reduce ext3 allocate-with-reservation lock latencies Mingming Cao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).