* Re: Fall back io scheduler for 2.6.15?
@ 2006-01-14 16:10 Chuck Ebbert
2006-01-16 8:43 ` Jens Axboe
0 siblings, 1 reply; 11+ messages in thread
From: Chuck Ebbert @ 2006-01-14 16:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, Mingming Cao
In-Reply-To: <20060113174914.7907bf2c.akpm@osdl.org>
On Fri, 13 Jan 2006, Andrew Morton wrote:
> OK. And I assume that AS wasn't compiled, so that's why it fell back?
As of 2.6.15 you need to use "anticipatory" instead of "as".
Maybe this patch would help?
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
--- 2.6.15a.orig/block/elevator.c
+++ 2.6.15a/block/elevator.c
@@ -150,6 +150,13 @@ static void elevator_setup_default(void)
if (!chosen_elevator[0])
strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED);
+ /*
+ * Be backwards-compatible with previous kernels, so users
+ * won't get the wrong elevator.
+ */
+ if (!strcmp(chosen_elevator, "as"))
+ strcpy(chosen_elevator, "anticipatory");
+
/*
* If the given scheduler is not available, fall back to no-op.
*/
--
Chuck
Currently reading: _Olympos_ by Dan Simmons
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-14 16:10 Fall back io scheduler for 2.6.15? Chuck Ebbert @ 2006-01-16 8:43 ` Jens Axboe 2006-01-19 19:38 ` Nate Diller 2006-01-21 6:14 ` Tejun Heo 0 siblings, 2 replies; 11+ messages in thread From: Jens Axboe @ 2006-01-16 8:43 UTC (permalink / raw) To: Chuck Ebbert; +Cc: Andrew Morton, linux-kernel, Mingming Cao On Sat, Jan 14 2006, Chuck Ebbert wrote: > In-Reply-To: <20060113174914.7907bf2c.akpm@osdl.org> > > On Fri, 13 Jan 2006, Andrew Morton wrote: > > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > > As of 2.6.15 you need to use "anticipatory" instead of "as". > > Maybe this patch would help? > > Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com> > > --- 2.6.15a.orig/block/elevator.c > +++ 2.6.15a/block/elevator.c > @@ -150,6 +150,13 @@ static void elevator_setup_default(void) > if (!chosen_elevator[0]) > strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED); > > + /* > + * Be backwards-compatible with previous kernels, so users > + * won't get the wrong elevator. > + */ > + if (!strcmp(chosen_elevator, "as")) > + strcpy(chosen_elevator, "anticipatory"); > + > /* > * If the given scheduler is not available, fall back to no-op. > */ We probably should apply this, since it used to be 'as'. -- Jens Axboe ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-16 8:43 ` Jens Axboe @ 2006-01-19 19:38 ` Nate Diller 2006-01-21 6:14 ` Tejun Heo 1 sibling, 0 replies; 11+ messages in thread From: Nate Diller @ 2006-01-19 19:38 UTC (permalink / raw) To: Jens Axboe; +Cc: Chuck Ebbert, Andrew Morton, linux-kernel, Mingming Cao On 1/16/06, Jens Axboe <axboe@suse.de> wrote: > On Sat, Jan 14 2006, Chuck Ebbert wrote: > > In-Reply-To: <20060113174914.7907bf2c.akpm@osdl.org> > > > > On Fri, 13 Jan 2006, Andrew Morton wrote: > > > > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > > > > As of 2.6.15 you need to use "anticipatory" instead of "as". > > > > Maybe this patch would help? > > > > Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com> > > > > --- 2.6.15a.orig/block/elevator.c > > +++ 2.6.15a/block/elevator.c > > @@ -150,6 +150,13 @@ static void elevator_setup_default(void) > > if (!chosen_elevator[0]) > > strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED); > > > > + /* > > + * Be backwards-compatible with previous kernels, so users > > + * won't get the wrong elevator. > > + */ > > + if (!strcmp(chosen_elevator, "as")) > > + strcpy(chosen_elevator, "anticipatory"); > > + > > /* > > * If the given scheduler is not available, fall back to no-op. > > */ > > We probably should apply this, since it used to be 'as'. i agree NATE ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-16 8:43 ` Jens Axboe 2006-01-19 19:38 ` Nate Diller @ 2006-01-21 6:14 ` Tejun Heo 2006-01-21 11:44 ` Jens Axboe 1 sibling, 1 reply; 11+ messages in thread From: Tejun Heo @ 2006-01-21 6:14 UTC (permalink / raw) To: Jens Axboe; +Cc: Chuck Ebbert, Andrew Morton, linux-kernel, Mingming Cao Jens Axboe wrote: > On Sat, Jan 14 2006, Chuck Ebbert wrote: > >>In-Reply-To: <20060113174914.7907bf2c.akpm@osdl.org> >> >>On Fri, 13 Jan 2006, Andrew Morton wrote: >> >> >>>OK. And I assume that AS wasn't compiled, so that's why it fell back? >> >>As of 2.6.15 you need to use "anticipatory" instead of "as". >> >>Maybe this patch would help? >> >>Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com> >> >>--- 2.6.15a.orig/block/elevator.c >>+++ 2.6.15a/block/elevator.c >>@@ -150,6 +150,13 @@ static void elevator_setup_default(void) >> if (!chosen_elevator[0]) >> strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED); >> >>+ /* >>+ * Be backwards-compatible with previous kernels, so users >>+ * won't get the wrong elevator. >>+ */ >>+ if (!strcmp(chosen_elevator, "as")) >>+ strcpy(chosen_elevator, "anticipatory"); >>+ >> /* >> * If the given scheduler is not available, fall back to no-op. >> */ > > > We probably should apply this, since it used to be 'as'. > Just out of curiousity, why did 'as' get renamed to 'anticipatory'? -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-21 6:14 ` Tejun Heo @ 2006-01-21 11:44 ` Jens Axboe 0 siblings, 0 replies; 11+ messages in thread From: Jens Axboe @ 2006-01-21 11:44 UTC (permalink / raw) To: Tejun Heo; +Cc: Chuck Ebbert, Andrew Morton, linux-kernel, Mingming Cao On Sat, Jan 21 2006, Tejun Heo wrote: > Jens Axboe wrote: > >On Sat, Jan 14 2006, Chuck Ebbert wrote: > > > >>In-Reply-To: <20060113174914.7907bf2c.akpm@osdl.org> > >> > >>On Fri, 13 Jan 2006, Andrew Morton wrote: > >> > >> > >>>OK. And I assume that AS wasn't compiled, so that's why it fell back? > >> > >>As of 2.6.15 you need to use "anticipatory" instead of "as". > >> > >>Maybe this patch would help? > >> > >>Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com> > >> > >>--- 2.6.15a.orig/block/elevator.c > >>+++ 2.6.15a/block/elevator.c > >>@@ -150,6 +150,13 @@ static void elevator_setup_default(void) > >> if (!chosen_elevator[0]) > >> strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED); > >> > >>+ /* > >>+ * Be backwards-compatible with previous kernels, so users > >>+ * won't get the wrong elevator. > >>+ */ > >>+ if (!strcmp(chosen_elevator, "as")) > >>+ strcpy(chosen_elevator, "anticipatory"); > >>+ > >> /* > >> * If the given scheduler is not available, fall back to no-op. > >> */ > > > > > >We probably should apply this, since it used to be 'as'. > > > > Just out of curiousity, why did 'as' get renamed to 'anticipatory'? Side effect really, not intentional. 'as' always registered itself with the elevator core as "anticipatory", the logic to match elevator=foo string to scheduler used to be completely seperate prior to the addition of online switchable elevators. So when those two were tied together, the "as" disappeared. I guess the right thing to do at that time was to rename "anticipatory", but that didn't happen... -- Jens Axboe ^ permalink raw reply [flat|nested] 11+ messages in thread
* ext3 allocate-with-reservation latencies @ 2005-04-05 3:51 Lee Revell 2005-04-07 13:08 ` Stephen C. Tweedie 0 siblings, 1 reply; 11+ messages in thread From: Lee Revell @ 2005-04-05 3:51 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 484 bytes --] I can trigger latencies up to ~1.1 ms with a CVS checkout. It looks like inside ext3_try_to_allocate_with_rsv, we spend a long time in this loop: ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) ext3_test_allocatable (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) find_next_zero_bit (bitmap_search_next_usable_block) etc. Lee [-- Attachment #2: ext3_reservation.bz2 --] [-- Type: application/x-bzip, Size: 3287 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ext3 allocate-with-reservation latencies @ 2005-04-07 13:08 ` Stephen C. Tweedie 2005-04-07 23:37 ` Mingming Cao 0 siblings, 1 reply; 11+ messages in thread From: Stephen C. Tweedie @ 2005-04-07 13:08 UTC (permalink / raw) To: Ingo Molnar Cc: Mingming Cao, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote: > doesnt the first option also allow searches to be in parallel? In terms of CPU usage, yes. But either we use large windows, in which case we *can't* search remotely near areas of the disk in parallel; or we use small windows, in which case failure to find a free bit becomes disproportionately expensive (we have to do an rbtree link and unlink for each window we search.) > This > particular one took over 1 msec, so it seems there's a fair room for > parallellizing multiple writers and thus improving scalability. (or is > this codepath serialized globally anyway?) No, the rbtree ops are serialised per-sb, and allocations within any one inode are serialised, but the bitmap searches need not be serialised globally. (They are in the case of new window searches right now, which is what we're trying to fix.) --Stephen ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-07 13:08 ` Stephen C. Tweedie @ 2005-04-07 23:37 ` Mingming Cao 2005-04-08 14:40 ` Stephen C. Tweedie 0 siblings, 1 reply; 11+ messages in thread From: Mingming Cao @ 2005-04-07 23:37 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton On Thu, 2005-04-07 at 14:08 +0100, Stephen C. Tweedie wrote: > Hi, > > On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote: > > > doesnt the first option also allow searches to be in parallel? > > In terms of CPU usage, yes. But either we use large windows, in which > case we *can't* search remotely near areas of the disk in parallel; or > we use small windows, in which case failure to find a free bit becomes > disproportionately expensive (we have to do an rbtree link and unlink > for each window we search.) Actually, we do not have to do an rbtree link and unlink for every window we search. If the reserved window(old) has no free bit and the new reservable window's is right after the old one, no need to unlink the old window from the rbtree and then link the new window, just update the start and end of the old one. That's in the current designed already, to reduce the cost of rbtree link and unlink. Mingming ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-07 23:37 ` Mingming Cao @ 2005-04-08 14:40 ` Stephen C. Tweedie 2005-04-08 18:10 ` Mingming Cao 0 siblings, 1 reply; 11+ messages in thread From: Stephen C. Tweedie @ 2005-04-08 14:40 UTC (permalink / raw) To: Mingming Cao Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Fri, 2005-04-08 at 00:37, Mingming Cao wrote: > Actually, we do not have to do an rbtree link and unlink for every > window we search. If the reserved window(old) has no free bit and the > new reservable window's is right after the old one, no need to unlink > the old window from the rbtree and then link the new window, just update > the start and end of the old one. It still needs to be done under locking to prevent us from expanding over the next window, though. And having to take and drop a spinlock a dozen times or more just to find out that there are no usable free blocks in the current block group is still expensive, even if we're not actually fully unlinking the window each time. I wonder if this can possibly be done safely without locking? It would be really good if we could rotate windows forward with no global locks. At the very least, a fair rwlock would let us freeze the total layout of the tree, while still letting us modify individual windows safely. As long as we use wmb() to make sure that we always extend the end of the window before we shrink the start of it, I think we could get away with such shared locking. And rw locking is much better for concurrency, so we might be able to hold it over the whole bitmap search rather than taking it and dropping it at each window advance. --Stephen ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-08 14:40 ` Stephen C. Tweedie @ 2005-04-08 18:10 ` Mingming Cao 2005-04-11 11:48 ` Stephen C. Tweedie 0 siblings, 1 reply; 11+ messages in thread From: Mingming Cao @ 2005-04-08 18:10 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton On Fri, 2005-04-08 at 15:40 +0100, Stephen C. Tweedie wrote: > Hi, > > On Fri, 2005-04-08 at 00:37, Mingming Cao wrote: > > > Actually, we do not have to do an rbtree link and unlink for every > > window we search. If the reserved window(old) has no free bit and the > > new reservable window's is right after the old one, no need to unlink > > the old window from the rbtree and then link the new window, just update > > the start and end of the old one. > > It still needs to be done under locking to prevent us from expanding > over the next window, though. And having to take and drop a spinlock a > dozen times or more just to find out that there are no usable free > blocks in the current block group is still expensive, even if we're not > actually fully unlinking the window each time. > Isn't this a rare case? The whole group is relatively full and the free bits are all reserved by other files. Probably we should avoid trying to make reservation in this block group at the first place, if we could find a way to detect the number of _usable_ free bits are less than the requested window size. > I wonder if this can possibly be done safely without locking? It would > be really good if we could rotate windows forward with no global locks. > At the very least, a fair rwlock would let us freeze the total layout of > the tree, while still letting us modify individual windows safely. As > long as we use wmb() to make sure that we always extend the end of the > window before we shrink the start of it, I think we could get away with > such shared locking. And rw locking is much better for concurrency, so > we might be able to hold it over the whole bitmap search rather than > taking it and dropping it at each window advance. > You are proposing that we hold the read lock first, do the window search and bitmap scan, then once we confirm there is free bit in the candidate window, we grab the write lock and update the tree? I think this is a good idea to address case you have concerned: when we need to do lots of window search before settle down. Also if later we decide (I think we have discussed this before) to always try to reserve the window with at least 8 contigous free blocks, the search will be more expensive and the read lock will help. However I am still worried that the rw lock will allow concurrent files trying to lock the same window at the same time. Only one succeed though., high cpu usage then. And also, in the normal case the filesystem is not really full, probably rw lock becomes expensive. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-08 18:10 ` Mingming Cao @ 2005-04-11 11:48 ` Stephen C. Tweedie 2005-04-11 18:38 ` Mingming Cao 0 siblings, 1 reply; 11+ messages in thread From: Stephen C. Tweedie @ 2005-04-11 11:48 UTC (permalink / raw) To: Mingming Cao Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Fri, 2005-04-08 at 19:10, Mingming Cao wrote: > > It still needs to be done under locking to prevent us from expanding > > over the next window, though. And having to take and drop a spinlock a > > dozen times or more just to find out that there are no usable free > > blocks in the current block group is still expensive, even if we're not > > actually fully unlinking the window each time. > Isn't this a rare case? The whole group is relatively full and the free > bits are all reserved by other files. Well, that's not much different from the case where there _are_ usable bits but they are all near the end of the bitmap. And that's common enough as you fill up a block group with small files, for example. Once the bg becomes nearly full, new file opens-for-write will still try to allocate in that bg (because that's where the inode was allocated), but as it's a new fd we have no prior knowledge of _where_ in the bh to allocate, and we'll have to walk it to the end to find any free space. This is the access pattern I'd expect of (for example) "tar xvjf linux-2.6.12.tar.bz2", not exactly a rare case. > Probably we should avoid trying > to make reservation in this block group at the first place Not in this case, because it's the "home" of the file in question, and skipping to another bg would just leave useful space empty --- and that leads to fragmentation. > You are proposing that we hold the read lock first, do the window search > and bitmap scan, then once we confirm there is free bit in the candidate > window, we grab the write lock and update the tree? No, I'm suggesting that if we need the write lock for tree updates, we may still be able to get away with just a read lock when updating an individual window. If all we're doing is winding the window forwards a bit, that's not actually changing the structure of the tree. > However I am still worried that the rw lock will allow concurrent files > trying to lock the same window at the same time. Adding a new window will need the write lock, always. But with the read lock, we can still check the neighbouring windows (the structure is pinned so those remain valid), so advancing a window with just that lock can still be done without risking overlapping the next window. --Stephen ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-11 11:48 ` Stephen C. Tweedie @ 2005-04-11 18:38 ` Mingming Cao 2005-04-11 19:57 ` Stephen C. Tweedie 0 siblings, 1 reply; 11+ messages in thread From: Mingming Cao @ 2005-04-11 18:38 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton On Mon, 2005-04-11 at 12:48 +0100, Stephen C. Tweedie wrote: > Hi, > > On Fri, 2005-04-08 at 19:10, Mingming Cao wrote: > > > > It still needs to be done under locking to prevent us from expanding > > > over the next window, though. And having to take and drop a spinlock a > > > dozen times or more just to find out that there are no usable free > > > blocks in the current block group is still expensive, even if we're not > > > actually fully unlinking the window each time. > > > Isn't this a rare case? The whole group is relatively full and the free > > bits are all reserved by other files. > > Well, that's not much different from the case where there _are_ usable > bits but they are all near the end of the bitmap. And that's common > enough as you fill up a block group with small files, for example. Once > the bg becomes nearly full, new file opens-for-write will still try to > allocate in that bg (because that's where the inode was allocated), but > as it's a new fd we have no prior knowledge of _where_ in the bh to > allocate, and we'll have to walk it to the end to find any free space. > This is the access pattern I'd expect of (for example) "tar xvjf > linux-2.6.12.tar.bz2", not exactly a rare case. > Okey. > > Probably we should avoid trying > > to make reservation in this block group at the first place > > Not in this case, because it's the "home" of the file in question, and > skipping to another bg would just leave useful space empty --- and that > leads to fragmentation. > I agree. We should not skip the home block group of the file. I guess what I was suggesting is, if allocation from the home group failed and we continuing the linear search the rest of block groups, we could probably try to skip the block groups without enough usable free blocks to make a reservation. Checking for the number of free blocks left in the quary bg is a good way, but probably not good enough, since those free blocks might already being reserved. > > You are proposing that we hold the read lock first, do the window search > > and bitmap scan, then once we confirm there is free bit in the candidate > > window, we grab the write lock and update the tree? > > No, I'm suggesting that if we need the write lock for tree updates, we > may still be able to get away with just a read lock when updating an > individual window. If all we're doing is winding the window forwards a > bit, that's not actually changing the structure of the tree. > > > However I am still worried that the rw lock will allow concurrent files > > trying to lock the same window at the same time. > > Adding a new window will need the write lock, always. But with the read > lock, we can still check the neighbouring windows (the structure is > pinned so those remain valid), so advancing a window with just that lock > can still be done without risking overlapping the next window. > Ah.. I see the win with the read lock now: once the a reservation window is added, updating it (either winding it forward and or searching for a avaliable window) probably is the majorirty of the operations on the tree, and using read lock for that should help reduce the latency. I will work on a patch for Lee to try sometime tonight. Thanks, Mingming ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-11 18:38 ` Mingming Cao @ 2005-04-11 19:57 ` Stephen C. Tweedie 2005-04-12 6:41 ` Mingming Cao 0 siblings, 1 reply; 11+ messages in thread From: Stephen C. Tweedie @ 2005-04-11 19:57 UTC (permalink / raw) To: Mingming Cao Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Mon, 2005-04-11 at 19:38, Mingming Cao wrote: > I agree. We should not skip the home block group of the file. I guess > what I was suggesting is, if allocation from the home group failed and > we continuing the linear search the rest of block groups, we could > probably try to skip the block groups without enough usable free blocks > to make a reservation. Fair enough. Once those are the only bgs left, performance is going to drop pretty quickly, but that's not really avoidable on a very full filesystem. > Ah.. I see the win with the read lock now: once the a reservation window > is added, updating it (either winding it forward and or searching for a > avaliable window) probably is the majorirty of the operations on the > tree, and using read lock for that should help reduce the latency. Right. The down side is that for things like a kernel "tar xf", we'll be doing lots of small file unpacks, and hopefully most files will be just one or two reservations --- so there's little winding forward going on. The searching will still be improved in that case. Note that this may improve average case latencies, but it's not likely to improve worst-case ones. We still need a write lock to install a new window, and that's going to have to wait for us to finish finding a free bit even if that operation starts using a read lock. I'm not really sure what to do about worst-case here. For that, we really do want to drop the lock entirely while we do the bitmap scan. That leaves two options. Hold a reservation while we do that; or don't. Holding one poses the problems we discussed before: either you hold a large reservation (bad for disk layout in the presence of concurrent allocators), or you hold smaller ones (high cost as you continually advance the window, which requires some read lock on the tree to avoid bumping into the next window.) Just how bad would it be if we didn't hold a lock _or_ a window at all while doing the search for new window space? I didn't like that alternative at first because of the problem when you've got multiple tasks trying to allocate in the same space at the same time; but given the locking overhead of the alternatives, I'm wondering if this is actually the lesser evil. --Stephen ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-11 19:57 ` Stephen C. Tweedie @ 2005-04-12 6:41 ` Mingming Cao 2005-04-12 11:18 ` Stephen C. Tweedie 0 siblings, 1 reply; 11+ messages in thread From: Mingming Cao @ 2005-04-12 6:41 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton On Mon, 2005-04-11 at 20:57 +0100, Stephen C. Tweedie wrote: > Hi, Hi Stephen, > > On Mon, 2005-04-11 at 19:38, Mingming Cao wrote: > > Ah.. I see the win with the read lock now: once the a reservation window > > is added, updating it (either winding it forward and or searching for a > > avaliable window) probably is the majorirty of the operations on the > > tree, and using read lock for that should help reduce the latency. > > Right. The down side is that for things like a kernel "tar xf", we'll > be doing lots of small file unpacks, and hopefully most files will be > just one or two reservations --- so there's little winding forward going > on. The searching will still be improved in that case. Just a side note that "tar xf" should know the file size before unpacking it. So it could set the reservation window size large enough to fit the entire file before doing the file write through ioctl command. > Note that this may improve average case latencies, but it's not likely > to improve worst-case ones. We still need a write lock to install a new > window, and that's going to have to wait for us to finish finding a free > bit even if that operation starts using a read lock. > Yes indeed. However nothing is free and there are always trade-offs.:) By worse case you mean multiple writes trying to allocate blocks around same area? But I wonder if the latency saw by Lee belongs to this worst-case: the latency comes mostly from loop calling find_next_zero_bit() in bitmap_search_next_usable_block(). Even if we take out the whole reservation, we still possibility run into this kind of latency: the bitmap on disk and on journal are extremely inconsistent so we need to keep searching them both until we find a free bit on both map. > I'm not really sure what to do about worst-case here. For that, we > really do want to drop the lock entirely while we do the bitmap scan. > Hmm...if we drop the lock entirely while scan the bitmap, assuming you mean drop the read lock, then I am afraid we have to re-check the tree (require a read or write lock ) to see if the new window space is still there after the scan succeed. This is probably not very interesting for the window rotating case. > That leaves two options. Hold a reservation while we do that; or don't. > Holding one poses the problems we discussed before: either you hold a > large reservation (bad for disk layout in the presence of concurrent > allocators), or you hold smaller ones (high cost as you continually > advance the window, which requires some read lock on the tree to avoid > bumping into the next window.) > Well, we cannot hold a reservation (which need to update the tree) without a write lock. I guess if we want to improve the average case latency by replacing the current spin_lock with the read lock for the new window space searching, we don't have much choice here. > Just how bad would it be if we didn't hold a lock _or_ a window at all > while doing the search for new window space? I wonder if this is feasible: Walk through the rb tree without a lock? What if some node is being removed by another thread while we are walking through the tree and trying to get the next node from it? Thanks, Mingming ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-12 6:41 ` Mingming Cao @ 2005-04-12 11:18 ` Stephen C. Tweedie 2005-04-12 23:27 ` Mingming Cao 0 siblings, 1 reply; 11+ messages in thread From: Stephen C. Tweedie @ 2005-04-12 11:18 UTC (permalink / raw) To: Mingming Cao Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Tue, 2005-04-12 at 07:41, Mingming Cao wrote: > > Note that this may improve average case latencies, but it's not likely > > to improve worst-case ones. We still need a write lock to install a new > > window, and that's going to have to wait for us to finish finding a free > > bit even if that operation starts using a read lock. > > > Yes indeed. However nothing is free and there are always trade-offs.:) > > By worse case you mean multiple writes trying to allocate blocks around > same area? It doesn't matter where they are; multiple new file opens will all be looking for a write lock. You only need one long-held read lock and all the writers still block. The worst-case latencies can't be properly solved with r/w locks --- those let the readers go more quickly (assuming they are in the majority), which helps the average case, but writers still have to wait for exclusive access. We only really help them by dropping the lock entirely. > Even if we take out the whole > reservation, we still possibility run into this kind of latency: the > bitmap on disk and on journal are extremely inconsistent so we need to > keep searching them both until we find a free bit on both map. Quite possibly. But as long as that code is running without locks, it's much easier to deal with those latencies: they won't impact other CPUs, cond_resched() is easier, and there's even CONFIG_PREEMPT. > > I'm not really sure what to do about worst-case here. For that, we > > really do want to drop the lock entirely while we do the bitmap scan. > Hmm...if we drop the lock entirely while scan the bitmap, assuming you > mean drop the read lock, then I am afraid we have to re-check the tree > (require a read or write lock ) to see if the new window space is still > there after the scan succeed. Sure. You basically start off with a provisional window, and then if necessary roll it forward just the same way you roll normal windows forward when they get to their end. That means you can still drop the lock while you search for new space. When you get there, reacquire the lock and check that the intervening space is still available. That's really cheap for the common case. The difficulty is when you have many parallel allocations hitting the same bg: they allocate provisional windows, find the same free area later on in the bg, and then stomp on each other as they try to move their windows there. I wonder if there's not a simple solution for this --- mark the window as "provisional", and if any other task tries to allocate in the space immediately following such a window, it needs to block until that window is released. --Stephen ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-12 11:18 ` Stephen C. Tweedie @ 2005-04-12 23:27 ` Mingming Cao 2005-04-13 10:29 ` Stephen C. Tweedie 0 siblings, 1 reply; 11+ messages in thread From: Mingming Cao @ 2005-04-12 23:27 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton On Tue, 2005-04-12 at 12:18 +0100, Stephen C. Tweedie wrote: > Hi, > > On Tue, 2005-04-12 at 07:41, Mingming Cao wrote: > > > > Note that this may improve average case latencies, but it's not likely > > > to improve worst-case ones. We still need a write lock to install a new > > > window, and that's going to have to wait for us to finish finding a free > > > bit even if that operation starts using a read lock. > > > > > Yes indeed. However nothing is free and there are always trade-offs.:) > > > > By worse case you mean multiple writes trying to allocate blocks around > > same area? > > It doesn't matter where they are; multiple new file opens will all be > looking for a write lock. You only need one long-held read lock and all > the writers still block. The worst-case latencies can't be properly > solved with r/w locks --- those let the readers go more quickly > (assuming they are in the majority), which helps the average case, but > writers still have to wait for exclusive access. We only really help > them by dropping the lock entirely. > > > Even if we take out the whole > > reservation, we still possibility run into this kind of latency: the > > bitmap on disk and on journal are extremely inconsistent so we need to > > keep searching them both until we find a free bit on both map. > > Quite possibly. But as long as that code is running without locks, it's > much easier to deal with those latencies: they won't impact other CPUs, > cond_resched() is easier, and there's even CONFIG_PREEMPT. > Okey, I agree. > > > I'm not really sure what to do about worst-case here. For that, we > > > really do want to drop the lock entirely while we do the bitmap scan. > > > Hmm...if we drop the lock entirely while scan the bitmap, assuming you > > mean drop the read lock, then I am afraid we have to re-check the tree > > (require a read or write lock ) to see if the new window space is still > > there after the scan succeed. > > Sure. You basically start off with a provisional window, and then if > necessary roll it forward just the same way you roll normal windows > forward when they get to their end. That means you can still drop the > lock while you search for new space. When you get there, reacquire the > lock and check that the intervening space is still available. > Please note that the bitmap scan does not only scan the provisional window range, it will return the first free it on the bitmap start from the start block of the provisional window until the end of the whole block group. So in the case the new window's neighbor is the same as the old one(that means window is rolled forward), we only need roll forward once to find it. Either we find a free bit inside the provisional window, or we find a free bit out of it. In the first case we just roll window forward and we are done. In the second case, it's possible the free bit is inside someone else's window(which means we can't take that window) or it inside the new space after a already reserved window. Either way we have to lock up the whole tree to remove the old window and insert the new window. > That's really cheap for the common case. The difficulty is when you > have many parallel allocations hitting the same bg: they allocate > provisional windows, find the same free area later on in the bg, and > then stomp on each other as they try to move their windows there. > > I wonder if there's not a simple solution for this --- mark the window > as "provisional", and if any other task tries to allocate in the space > immediately following such a window, it needs to block until that window > is released. > Sounds interesting. However that implies we need a write lock to mark the window as provisional and block other files looking for windows near it: we need to insert the provisional window into the tree and then mark it as a temporary window, to really let other file notice this kind of "hold". I wonder if the benefit of read/write lock is worth all the hassles now. If the new window's neighbor stays the same, we only need to roll forward once. If not, after a successful scan, we need to grab the write lock, and make sure the window is still there. If we dropped the lock without holding the new space, we have to search the whole tree again to find out if the space is still there, we cannot use the previous node returned by find_next_reservable_window() since the previous node could be gone while we scan the bitmap without the lock. Basically we do twice tree search(once for roll forward case) and twice locking in the normal case. Also I am concerned about the possible starvation on writers. I am thinking, maybe back to the spin_lock is not that bad with the "mark provisional" suggest you made? It allows us to mark the new space as provisional if we find a new space(prevent other window searching run into the same neighborhood). We could release the lock and scan the bitmap without worry about the new space will be taking by others. If there is a free bit, then we could just clear the provisional bit and simply return.(we don't have to re-grab the lock there). In the case the the final new window is the one just rolled the old window forward, we only need to grab the spin_lock once and move the window forward. This seems to me the simplest way to fix the latency that Lee reported, and balanced both average and worse cases. Is it still looks too bad to you? Thanks, Mingming ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: ext3 allocate-with-reservation latencies 2005-04-12 23:27 ` Mingming Cao @ 2005-04-13 10:29 ` Stephen C. Tweedie 2005-04-22 22:10 ` [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies Mingming Cao 0 siblings, 1 reply; 11+ messages in thread From: Stephen C. Tweedie @ 2005-04-13 10:29 UTC (permalink / raw) To: Mingming Cao Cc: Ingo Molnar, Lee Revell, linux-kernel, Andrew Morton, Stephen Tweedie Hi, On Wed, 2005-04-13 at 00:27, Mingming Cao wrote: > > I wonder if there's not a simple solution for this --- mark the window > > as "provisional", and if any other task tries to allocate in the space > > immediately following such a window, it needs to block until that window > > is released. > Sounds interesting. However that implies we need a write lock to mark > the window as provisional and block other files looking for windows near > it: we need to insert the provisional window into the tree and then mark > it as a temporary window, to really let other file notice this kind of > "hold". We need a lock for the tree modification, yes. > I wonder if the benefit of read/write lock is worth all the hassles now. The point of the provisional window is that you don't need read/write locks at all. The provisional window lets you unlock the tree completely while you do the bitmap scan, so there's really no reason to have rwlocks for the tree any more. > If the new window's neighbor stays the same, we only need to roll > forward once. If not, after a successful scan, we need to grab the > write lock, and make sure the window is still there. When we take the provisional window, we can make a note of how much space we have before the next window. And because all future allocators will stall if they try to allocate at this point due to the provisional window, we know that that space will remain outside any other window until we come to fix the provisional window and potentially roll it forward to the space we found. > If we dropped the > lock without holding the new space, we have to search the whole tree > again to find out if the space is still there As long as the space is within the area between the provisional window and its successor, we don't need to do that. (It gets more complex if the bitmap search returns a bit _beyond_ the next window, though.) > Also I am concerned about the possible > starvation on writers. In what way? > I am thinking, maybe back to the spin_lock is not that bad with the > "mark provisional" suggest you made? Right, that was the intent --- sorry if I didn't make it clear. > It allows us to mark the new space > as provisional if we find a new space(prevent other window searching run > into the same neighborhood). We could release the lock and scan the > bitmap without worry about the new space will be taking by others. Exactly. Note that there _is_ some additional complexity here. It's not entirely free. Specifically, if we have other tasks waiting on our provisional window, then we need to be very careful about the life expectancy of the wait queues involved, so that if the first task fixes its window and then deletes it before the waiters have woken up, they don't get confused by the provisional window vanishing while being waited for. The easy solution is a global wait queue for that, but that doesn't scale well. The complex solution is a per-window wait queue and reference count, which is obviously a bit of bloat, though probably worth it for the high-load case. Cheers, Stephen ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies 2005-04-13 10:29 ` Stephen C. Tweedie @ 2005-04-22 22:10 ` Mingming Cao 2005-04-28 3:45 ` Lee Revell 0 siblings, 1 reply; 11+ messages in thread From: Mingming Cao @ 2005-04-22 22:10 UTC (permalink / raw) To: Andrew Morton, Stephen C. Tweedie; +Cc: Ingo Molnar, Lee Revell, linux-kernel Andrew, Stephen, Currently in ext3 block reservation code, the global filesystem reservation tree lock (rsv_block) is hold during the process of searching for a space to make a new reservation window, including while scaning the block bitmap to verify if the available window has a free block. Holding the lock during bitmap scan is unnecessary and could possibly cause scalability issue and latency issues. This patch trying to address this by dropping the lock before scan the bitmap. Before that we need to reserve the open window in case someone else is targeting at the same window. Question was should we reserve the whole free reservable space or just the window size we need. Reserve the whole free reservable space will possibly force other threads which intended to do block allocation nearby move to another block group(cause bad layout). In this patch, we just reserve the desired size before drop the lock and scan the block bitmap. Please review. I have tested on fsx on SMP box. Lee, if you got time, please try this patch. Signed-Off-By: Mingming Cao <cmm@us.ibm.com> --- linux-2.6.11-ming/fs/ext3/balloc.c | 135 +++++++++++++++++-------------------- linux-2.6.11-ming/fs/ext3/file.c | 4 + 2 files changed, 68 insertions(+), 71 deletions(-) diff -puN fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency fs/ext3/balloc.c --- linux-2.6.11/fs/ext3/balloc.c~ext3-reduce-reservation-lock-latency 2005-04-22 10:49:04.000000000 -0700 +++ linux-2.6.11-ming/fs/ext3/balloc.c 2005-04-22 14:42:35.518751070 -0700 @@ -749,24 +749,24 @@ fail_access: * to find a free region that is of my size and has not * been reserved. * - * on succeed, it returns the reservation window to be appended to. - * failed, return NULL. */ -static struct ext3_reserve_window_node *find_next_reservable_window( +static int find_next_reservable_window( struct ext3_reserve_window_node *search_head, - unsigned long size, int *start_block, + struct ext3_reserve_window_node *my_rsv, + struct super_block * sb, int start_block, int last_block) { struct rb_node *next; struct ext3_reserve_window_node *rsv, *prev; int cur; + int size = my_rsv->rsv_goal_size; /* TODO: make the start of the reservation window byte-aligned */ /* cur = *start_block & ~7;*/ - cur = *start_block; + cur = start_block; rsv = search_head; if (!rsv) - return NULL; + return -1; while (1) { if (cur <= rsv->rsv_end) @@ -782,7 +782,7 @@ static struct ext3_reserve_window_node * * space with expected-size (or more)... */ if (cur > last_block) - return NULL; /* fail */ + return -1; /* fail */ prev = rsv; next = rb_next(&rsv->rsv_node); @@ -813,8 +813,26 @@ static struct ext3_reserve_window_node * * return the reservation window that we could append to. * succeed. */ - *start_block = cur; - return prev; + + if ((prev != my_rsv) && (!rsv_is_empty(&my_rsv->rsv_window))) + rsv_window_remove(sb, my_rsv); + + /* let's book the whole avaliable window for now + * we will check the + * disk bitmap later and then, if there are free block + * then we adjust the window size if the it's + * larger than requested. + * Otherwise, we will remove this node from the tree next time + * call find_next_reservable_window. + */ + my_rsv->rsv_start = cur; + my_rsv->rsv_end = cur + size - 1; + my_rsv->rsv_alloc_hit = 0; + + if (prev != my_rsv) + ext3_rsv_window_add(sb, my_rsv); + + return 0; } /** @@ -852,6 +870,7 @@ static struct ext3_reserve_window_node * * @sb: the super block * @group: the group we are trying to allocate in * @bitmap_bh: the block group block bitmap + * */ static int alloc_new_reservation(struct ext3_reserve_window_node *my_rsv, int goal, struct super_block *sb, @@ -860,10 +879,10 @@ static int alloc_new_reservation(struct struct ext3_reserve_window_node *search_head; int group_first_block, group_end_block, start_block; int first_free_block; - int reservable_space_start; - struct ext3_reserve_window_node *prev_rsv; struct rb_root *fs_rsv_root = &EXT3_SB(sb)->s_rsv_window_root; unsigned long size; + int ret; + spinlock_t *rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock; group_first_block = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) + group * EXT3_BLOCKS_PER_GROUP(sb); @@ -875,6 +894,7 @@ static int alloc_new_reservation(struct start_block = goal + group_first_block; size = my_rsv->rsv_goal_size; + if (!rsv_is_empty(&my_rsv->rsv_window)) { /* * if the old reservation is cross group boundary @@ -908,6 +928,8 @@ static int alloc_new_reservation(struct my_rsv->rsv_goal_size= size; } } + + spin_lock(rsv_lock); /* * shift the search start to the window near the goal block */ @@ -921,11 +943,16 @@ static int alloc_new_reservation(struct * need to check the bitmap after we found a reservable window. */ retry: - prev_rsv = find_next_reservable_window(search_head, size, - &start_block, group_end_block); - if (prev_rsv == NULL) - goto failed; - reservable_space_start = start_block; + ret = find_next_reservable_window(search_head, my_rsv, sb, + start_block, group_end_block); + + if (ret == -1) { + if (!rsv_is_empty(&my_rsv->rsv_window)) + rsv_window_remove(sb, my_rsv); + spin_unlock(rsv_lock); + return -1; + } + /* * On success, find_next_reservable_window() returns the * reservation window where there is a reservable space after it. @@ -937,8 +964,9 @@ retry: * block. Search start from the start block of the reservable space * we just found. */ + spin_unlock(rsv_lock); first_free_block = bitmap_search_next_usable_block( - reservable_space_start - group_first_block, + my_rsv->rsv_start - group_first_block, bitmap_bh, group_end_block - group_first_block + 1); if (first_free_block < 0) { @@ -946,54 +974,30 @@ retry: * no free block left on the bitmap, no point * to reserve the space. return failed. */ - goto failed; + spin_lock(rsv_lock); + if (!rsv_is_empty(&my_rsv->rsv_window)) + rsv_window_remove(sb, my_rsv); + spin_unlock(rsv_lock); + return -1; /* failed */ } + start_block = first_free_block + group_first_block; /* * check if the first free block is within the - * free space we just found + * free space we just reserved */ - if ((start_block >= reservable_space_start) && - (start_block < reservable_space_start + size)) - goto found_rsv_window; + if ((start_block >= my_rsv->rsv_start) && + (start_block < my_rsv->rsv_end)) + return 0; /* succeed */ /* * if the first free bit we found is out of the reservable space - * this means there is no free block on the reservable space - * we should continue search for next reservable space, + * continue search for next reservable space, * start from where the free block is, * we also shift the list head to where we stopped last time */ - search_head = prev_rsv; + search_head = my_rsv; + spin_lock(rsv_lock); goto retry; - -found_rsv_window: - /* - * great! the reservable space contains some free blocks. - * if the search returns that we should add the new - * window just next to where the old window, we don't - * need to remove the old window first then add it to the - * same place, just update the new start and new end. - */ - if (my_rsv != prev_rsv) { - if (!rsv_is_empty(&my_rsv->rsv_window)) - rsv_window_remove(sb, my_rsv); - } - my_rsv->rsv_start = reservable_space_start; - my_rsv->rsv_end = my_rsv->rsv_start + size - 1; - my_rsv->rsv_alloc_hit = 0; - if (my_rsv != prev_rsv) { - ext3_rsv_window_add(sb, my_rsv); - } - return 0; /* succeed */ -failed: - /* - * failed to find a new reservation window in the current - * group, remove the current(stale) reservation window - * if there is any - */ - if (!rsv_is_empty(&my_rsv->rsv_window)) - rsv_window_remove(sb, my_rsv); - return -1; /* failed */ } /* @@ -1023,7 +1027,6 @@ ext3_try_to_allocate_with_rsv(struct sup int goal, struct ext3_reserve_window_node * my_rsv, int *errp) { - spinlock_t *rsv_lock; unsigned long group_first_block; int ret = 0; int fatal; @@ -1052,7 +1055,6 @@ ext3_try_to_allocate_with_rsv(struct sup ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL); goto out; } - rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock; /* * goal is a group relative block number (if there is a goal) * 0 < goal < EXT3_BLOCKS_PER_GROUP(sb) @@ -1078,30 +1080,21 @@ ext3_try_to_allocate_with_rsv(struct sup * then we could go to allocate from the reservation window directly. */ while (1) { - struct ext3_reserve_window rsv_copy; - - rsv_copy._rsv_start = my_rsv->rsv_start; - rsv_copy._rsv_end = my_rsv->rsv_end; - - if (rsv_is_empty(&rsv_copy) || (ret < 0) || - !goal_in_my_reservation(&rsv_copy, goal, group, sb)) { - spin_lock(rsv_lock); + if (rsv_is_empty(&my_rsv->rsv_window) || (ret < 0) || + !goal_in_my_reservation(&my_rsv->rsv_window, goal, group, sb)) { ret = alloc_new_reservation(my_rsv, goal, sb, group, bitmap_bh); - rsv_copy._rsv_start = my_rsv->rsv_start; - rsv_copy._rsv_end = my_rsv->rsv_end; - spin_unlock(rsv_lock); if (ret < 0) break; /* failed */ - if (!goal_in_my_reservation(&rsv_copy, goal, group, sb)) + if (!goal_in_my_reservation(&my_rsv->rsv_window, goal, group, sb)) goal = -1; } - if ((rsv_copy._rsv_start >= group_first_block + EXT3_BLOCKS_PER_GROUP(sb)) - || (rsv_copy._rsv_end < group_first_block)) + if ((my_rsv->rsv_start >= group_first_block + EXT3_BLOCKS_PER_GROUP(sb)) + || (my_rsv->rsv_end < group_first_block)) BUG(); ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, - &rsv_copy); + &my_rsv->rsv_window); if (ret >= 0) { my_rsv->rsv_alloc_hit++; break; /* succeed */ diff -puN fs/ext3/file.c~ext3-reduce-reservation-lock-latency fs/ext3/file.c --- linux-2.6.11/fs/ext3/file.c~ext3-reduce-reservation-lock-latency 2005-04-22 10:49:04.000000000 -0700 +++ linux-2.6.11-ming/fs/ext3/file.c 2005-04-22 10:49:04.000000000 -0700 @@ -36,7 +36,11 @@ static int ext3_release_file (struct ino /* if we are the last writer on the inode, drop the block reservation */ if ((filp->f_mode & FMODE_WRITE) && (atomic_read(&inode->i_writecount) == 1)) + { + down(&EXT3_I(inode)->truncate_sem); ext3_discard_reservation(inode); + up(&EXT3_I(inode)->truncate_sem); + } if (is_dx(inode) && filp->private_data) ext3_htree_free_dir_info(filp->private_data); _ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies 2005-04-22 22:10 ` [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies Mingming Cao @ 2005-04-28 3:45 ` Lee Revell 2005-04-28 19:14 ` [RFC] Adding multiple block allocation to current ext3 Mingming Cao 0 siblings, 1 reply; 11+ messages in thread From: Lee Revell @ 2005-04-28 3:45 UTC (permalink / raw) To: cmm; +Cc: Andrew Morton, Stephen C. Tweedie, Ingo Molnar, linux-kernel On Fri, 2005-04-22 at 15:10 -0700, Mingming Cao wrote: > Please review. I have tested on fsx on SMP box. Lee, if you got time, > please try this patch. I have tested and this does fix the problem. I ran my tests and no ext3 code paths showed up on the latency tracer at all, it never went above 33 usecs. Please apply. Lee ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC] Adding multiple block allocation to current ext3 2005-04-28 3:45 ` Lee Revell @ 2005-04-28 19:14 ` Mingming Cao 2006-01-10 23:26 ` [PATCH 0/5] " Mingming Cao 0 siblings, 1 reply; 11+ messages in thread From: Mingming Cao @ 2005-04-28 19:14 UTC (permalink / raw) To: Andrew Morton; +Cc: Stephen C. Tweedie, linux-kernel, ext2-devel Currently ext3_get_block()/ext3_new_block() only allocate one block at a time. To allocate multiple blocks, the caller, for example, ext3 direct IO routine, has to invoke ext3_get_block() many times. This is quite inefficient for sequential IO workload. The benefit of a real get_blocks() include 1) increase the possibility to get contiguous blocks, reduce possibility of fragmentation due to interleaved allocations from other threads. (should good for non reservation case) 2) Reduces CPU cycles spent in repeated get_block() calls 3) Batch meta data update and journaling in one short 4) Could possibly speed up future get_blocks() look up by cache the last mapped blocks in inode. Multiple block allocation for ext3 is very useful for support delayed allocation, also useful for direct io. This effort is trying to fill this requirements on top of the current ext3 disk format. Given a file offset and maximum length of blocks to map, ext3_get_blocks() will find out or allocate the corresponding chunk of contiguous physical blocks on disk. It will return the buffer head mapped to the first physical block corresponding to the file offset, and the number of the contiguous blocks just mapped or allocated. First, search the [i,d,t]direct blocks indexed by the file offset to found out the physical block. In the case that it is allocated, then continue mapping the next logical block until the mapped physical block is dis-adjacent, or there is no corresponding physical block being allocated. In the case block allocation is needed, first, it walks through the inode's block mapping tree to count the number of adjacent blocks to allocate. Then it passes the number of blocks to allocate to ext3_new_blocks(), while the real block allocation is performed. After allocated the first block, ext3_new_blocks() will always attempt to allocate the next few(up to the requested size and not beyond the reservation window) adjacent blocks at the same time. Once the blocks are allocated and updated on bitmap, finally update those metadata blocks with those new blocks info and splice the whole branch into the block mapping tree. Partial read map and partial write map is not supported since there is no garantuee that the block allocation will allocate a contiguous physical block back. Cross indirect block boundary allocation is not supported due to the concern that it may cause metadata block far from it's data blocks. Below is just a first cut of proposed patch. Attach here just to show the effort/direction. Appreciate any comments/suggestions/critics! Thanks, Mingming --- linux-2.6.11-ming/fs/ext3/balloc.c | 89 ++++++--- linux-2.6.11-ming/fs/ext3/inode.c | 284 ++++++++++++++++++++++++++++-- linux-2.6.11-ming/fs/ext3/xattr.c | 3 linux-2.6.11-ming/include/linux/ext3_fs.h | 2 4 files changed, 333 insertions(+), 45 deletions(-) diff -puN fs/ext3/balloc.c~ext3_multiple_blocks_allocation fs/ext3/balloc.c --- linux-2.6.11/fs/ext3/balloc.c~ext3_multiple_blocks_allocation 2005-04-28 12:12:06.168475152 -0700 +++ linux-2.6.11-ming/fs/ext3/balloc.c 2005-04-28 12:12:06.186473049 -0700 @@ -652,9 +652,11 @@ claim_block(spinlock_t *lock, int block, */ static int ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group, - struct buffer_head *bitmap_bh, int goal, struct ext3_reserve_window *my_rsv) + struct buffer_head *bitmap_bh, int goal, unsigned long *count, + struct ext3_reserve_window *my_rsv) { int group_first_block, start, end; + unsigned long num = 0; /* we do allocation within the reservation window if we have a window */ if (my_rsv) { @@ -712,8 +714,18 @@ repeat: goto fail_access; goto repeat; } - return goal; + num++; + goal++; + while (num < *count && goal < end + && ext3_test_allocatable(goal, bitmap_bh) + && claim_block(sb_bgl_lock(EXT3_SB(sb), group), goal, bitmap_bh)) { + num++; + goal++; + } + *count = num; + return goal - num; fail_access: + *count = num; return -1; } @@ -1025,11 +1037,13 @@ static int ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle, unsigned int group, struct buffer_head *bitmap_bh, int goal, struct ext3_reserve_window_node * my_rsv, - int *errp) + unsigned long *count, int *errp) { unsigned long group_first_block; int ret = 0; int fatal; + int original_goal=goal; + unsigned long num = *count; *errp = 0; @@ -1052,7 +1066,8 @@ ext3_try_to_allocate_with_rsv(struct sup * or last attempt to allocate a block with reservation turned on failed */ if (my_rsv == NULL ) { - ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, NULL); + ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, + count, NULL); goto out; } /* @@ -1094,13 +1109,17 @@ ext3_try_to_allocate_with_rsv(struct sup || (my_rsv->rsv_end < group_first_block)) BUG(); ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, - &my_rsv->rsv_window); + &num, &my_rsv->rsv_window); if (ret >= 0) { - my_rsv->rsv_alloc_hit++; + my_rsv->rsv_alloc_hit += num; + *count = num; break; /* succeed */ } + num = *count; } out: + if (my_rsv) + printk("Allocated blocks from %d ,count is %d, goal was %d, group is %d, inode is %, reservation window (%d,%d)\n", ret, *count, original_goal, group, my_rsv->rsv_start, my_rsv->rsv_end); if (ret >= 0) { BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for " "bitmap block"); @@ -1155,8 +1174,8 @@ int ext3_should_retry_alloc(struct super * bitmap, and then for any free bit if that fails. * This function also updates quota and i_blocks field. */ -int ext3_new_block(handle_t *handle, struct inode *inode, - unsigned long goal, int *errp) +int ext3_new_blocks(handle_t *handle, struct inode *inode, + unsigned long goal, unsigned long* count, int *errp) { struct buffer_head *bitmap_bh = NULL; struct buffer_head *gdp_bh; @@ -1179,7 +1198,8 @@ int ext3_new_block(handle_t *handle, str static int goal_hits, goal_attempts; #endif unsigned long ngroups; - + unsigned long num = *count; + int i; *errp = -ENOSPC; sb = inode->i_sb; if (!sb) { @@ -1190,7 +1210,7 @@ int ext3_new_block(handle_t *handle, str /* * Check quota for allocation of this block. */ - if (DQUOT_ALLOC_BLOCK(inode, 1)) { + if (DQUOT_ALLOC_BLOCK(inode, num)) { *errp = -EDQUOT; return 0; } @@ -1245,7 +1265,7 @@ retry: if (!bitmap_bh) goto io_error; ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no, - bitmap_bh, ret_block, my_rsv, &fatal); + bitmap_bh, ret_block, my_rsv, &num, &fatal); if (fatal) goto out; if (ret_block >= 0) @@ -1282,7 +1302,7 @@ retry: if (!bitmap_bh) goto io_error; ret_block = ext3_try_to_allocate_with_rsv(sb, handle, group_no, - bitmap_bh, -1, my_rsv, &fatal); + bitmap_bh, -1, my_rsv, &num, &fatal); if (fatal) goto out; if (ret_block >= 0) @@ -1317,14 +1337,17 @@ allocated: target_block = ret_block + group_no * EXT3_BLOCKS_PER_GROUP(sb) + le32_to_cpu(es->s_first_data_block); - if (target_block == le32_to_cpu(gdp->bg_block_bitmap) || - target_block == le32_to_cpu(gdp->bg_inode_bitmap) || - in_range(target_block, le32_to_cpu(gdp->bg_inode_table), - EXT3_SB(sb)->s_itb_per_group)) - ext3_error(sb, "ext3_new_block", - "Allocating block in system zone - " - "block = %u", target_block); - + for (i = 0; i < num; i++, target_block++) { + if (target_block == le32_to_cpu(gdp->bg_block_bitmap) || + target_block == le32_to_cpu(gdp->bg_inode_bitmap) || + in_range(target_block, le32_to_cpu(gdp->bg_inode_table), + EXT3_SB(sb)->s_itb_per_group)) { + ext3_error(sb, "ext3_new_block", + "Allocating block in system zone - " + "block = %u", target_block); + goto out; + } + } performed_allocation = 1; #ifdef CONFIG_JBD_DEBUG @@ -1342,10 +1365,12 @@ allocated: jbd_lock_bh_state(bitmap_bh); spin_lock(sb_bgl_lock(sbi, group_no)); if (buffer_jbd(bitmap_bh) && bh2jh(bitmap_bh)->b_committed_data) { - if (ext3_test_bit(ret_block, - bh2jh(bitmap_bh)->b_committed_data)) { - printk("%s: block was unexpectedly set in " - "b_committed_data\n", __FUNCTION__); + for (i = 0; i < num ; i++) { + if (ext3_test_bit(ret_block++, + bh2jh(bitmap_bh)->b_committed_data)) { + printk("%s: block was unexpectedly set in " + "b_committed_data\n", __FUNCTION__); + } } } ext3_debug("found bit %d\n", ret_block); @@ -1354,12 +1379,12 @@ allocated: #endif /* ret_block was blockgroup-relative. Now it becomes fs-relative */ - ret_block = target_block; + ret_block = target_block - num; - if (ret_block >= le32_to_cpu(es->s_blocks_count)) { + if (target_block - 1>= le32_to_cpu(es->s_blocks_count)) { ext3_error(sb, "ext3_new_block", - "block(%d) >= blocks count(%d) - " - "block_group = %d, es == %p ", ret_block, + "block(%d) >= fs blocks count(%d) - " + "block_group = %d, es == %p ", target_block - 1, le32_to_cpu(es->s_blocks_count), group_no, es); goto out; } @@ -1374,9 +1399,9 @@ allocated: spin_lock(sb_bgl_lock(sbi, group_no)); gdp->bg_free_blocks_count = - cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) - 1); + cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num); spin_unlock(sb_bgl_lock(sbi, group_no)); - percpu_counter_mod(&sbi->s_freeblocks_counter, -1); + percpu_counter_mod(&sbi->s_freeblocks_counter, -num); BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor"); err = ext3_journal_dirty_metadata(handle, gdp_bh); @@ -1388,6 +1413,8 @@ allocated: goto out; *errp = 0; + DQUOT_FREE_BLOCK(inode, *count-num); + *count = num; brelse(bitmap_bh); return ret_block; @@ -1402,7 +1429,7 @@ out: * Undo the block allocation */ if (!performed_allocation) - DQUOT_FREE_BLOCK(inode, 1); + DQUOT_FREE_BLOCK(inode, *count); brelse(bitmap_bh); return 0; } diff -puN fs/ext3/inode.c~ext3_multiple_blocks_allocation fs/ext3/inode.c --- linux-2.6.11/fs/ext3/inode.c~ext3_multiple_blocks_allocation 2005-04-28 12:12:06.172474685 -0700 +++ linux-2.6.11-ming/fs/ext3/inode.c 2005-04-28 12:12:06.192472348 -0700 @@ -237,12 +237,12 @@ static int ext3_alloc_block (handle_t *h struct inode * inode, unsigned long goal, int *err) { unsigned long result; + unsigned long count = 1; - result = ext3_new_block(handle, inode, goal, err); + result = ext3_new_blocks(handle, inode, goal, &count, err); return result; } - typedef struct { __le32 *p; __le32 key; @@ -328,7 +328,7 @@ static int ext3_block_to_path(struct ino ext3_warning (inode->i_sb, "ext3_block_to_path", "block > big"); } if (boundary) - *boundary = (i_block & (ptrs - 1)) == (final - 1); + *boundary = final - 1 - (i_block & (ptrs - 1)); return n; } @@ -526,7 +526,7 @@ static int ext3_alloc_branch(handle_t *h /* * Get buffer_head for parent block, zero it out * and set the pointer to new one, then send - * parent to disk. + * parent to disk. */ bh = sb_getblk(inode->i_sb, parent); branch[n].bh = bh; @@ -566,6 +566,111 @@ static int ext3_alloc_branch(handle_t *h ext3_free_blocks(handle, inode, le32_to_cpu(branch[i].key), 1); return err; } +#define GBS_DEBUG 0 +static int ext3_alloc_splice_branch(handle_t *handle, struct inode *inode, + unsigned long num, unsigned long first_block, + int *offsets, Indirect *branch, int depth) +{ + int blocksize = inode->i_sb->s_blocksize; + int n = 0; + int err = 0; + int i; + unsigned long current_block; + struct buffer_head *bh; + + current_block = first_block; + branch[0].key = cpu_to_le32(current_block); + + if (GBS_DEBUG) + printk("block %d, branch[0].p :%x\n", current_block, branch[0].p); + for (n = 1; n < depth; n++) { + current_block++; + branch[n].key = cpu_to_le32(current_block); + + /* + * Get buffer_head for parent block, zero it out + * and set the pointer to new one, then send + * parent to disk. + */ + bh = sb_getblk(inode->i_sb, current_block - 1); + branch[n].bh = bh; + lock_buffer(bh); + BUFFER_TRACE(bh, "call get_create_access"); + err = ext3_journal_get_create_access(handle, bh); + if (err) { + unlock_buffer(bh); + brelse(bh); + goto failed; + } + + memset(bh->b_data, 0, blocksize); + branch[n].p = (__le32*) bh->b_data + offsets[n]; + *branch[n].p = branch[n].key; + /* + * allocate blocks to the rest missing data blocks + */ + if (n == depth -1 ) { + i = 0; + while (current_block - first_block + 1 < num) { + *(branch[n].p + i + 1) = ++current_block; + i++; + } + } + BUFFER_TRACE(bh, "marking uptodate"); + set_buffer_uptodate(bh); + unlock_buffer(bh); + + BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata"); + err = ext3_journal_dirty_metadata(handle, bh); + if (err) + goto failed; + } + + bh = branch[0].bh; + if (bh){ + BUFFER_TRACE(bh, "call get_write_access"); + err = ext3_journal_get_write_access(handle, bh); + if (err) + goto failed; + } + + *(branch[0].p) = branch[0].key; + + /* allocate blocks to the rest missing data blocks */ + i = 0; + while (current_block - first_block + 1< num) { + *(branch[0].p + i + 1) = ++current_block; + i++; + } + + if (GBS_DEBUG) { + for (i=0; i< n; i++) + printk("inode %x, branch[%d].p:%x, branch[%d].key:%d,\n", inode, i, branch[i].p, i, branch[i].key); + for (i=0; i< num-n; i++) + printk("inode %x, branch[%d].p + %d + 1:%x, *(branch[%d].p+%d+1):%d,\n, branch[%d].bh:%x\n", inode, n-1, i, branch[n-1].p + i +1, n-1, i, *(branch[n-1].p+i+1),n-1, branch[n-1].bh); + } + if (bh) { + BUFFER_TRACE(bh, "marking uptodate"); + set_buffer_uptodate(bh); + unlock_buffer(bh); + + BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata"); + err = ext3_journal_dirty_metadata(handle, bh); + if (err) + goto failed; + } + + return err; +failed: + /* Allocation failed, free what we already allocated */ + for (i = 0; i < n ; i++) { + BUFFER_TRACE(branch[i].bh, "call journal_forget"); + ext3_journal_forget(handle, branch[n].bh); + } + for (i = 0; i < num; i++) + ext3_free_blocks(handle, inode, first_block + i, 1); + return err; +} /** * ext3_splice_branch - splice the allocated branch onto inode. @@ -777,8 +882,153 @@ out: return err; } -static int ext3_get_block(struct inode *inode, sector_t iblock, - struct buffer_head *bh_result, int create) +static int +ext3_count_blocks_to_allocate(Indirect * branch, int k, + unsigned long maxblocks, int blocks_to_boundary) +{ + unsigned long count = 0; + + if (k == 0) return 0; + /* + * Simple case, [t,d]Indirect block(s) has not allocated yet + * then it's clear blocks on that path have not allocated + */ + if (GBS_DEBUG) + printk("maxblocks: %d, k: %d, boundary : %d \n",maxblocks, k, + blocks_to_boundary); + if (k > 1) { + /* right now don't hanel cross boundary allocation */ + if ((maxblocks - count) < blocks_to_boundary) + count += maxblocks; + else + count += blocks_to_boundary; + count += k - 1; /* blocks for [t,d]indirect blocks */ + return count; + } + + count++; + while (count < maxblocks && count <= blocks_to_boundary + && *(branch[0].p + count) == 0) { + count++; + } + return count; +} +static int +ext3_get_blocks_handle(handle_t *handle, struct inode *inode, sector_t iblock, + unsigned long *maxblocks, struct buffer_head *bh_result, + int create, int extend_disksize) +{ + int err = -EIO; + int offsets[4]; + Indirect chain[4]; + Indirect *partial = NULL; + unsigned long goal; + int left; + int blocks_to_boundary = 0; + int depth; + struct ext3_inode_info *ei = EXT3_I(inode); + unsigned long count = 0; + unsigned long first_block; + struct ext3_block_alloc_info *block_i; + + J_ASSERT(handle != NULL || create == 0); + + if (GBS_DEBUG) + printk("ext3_get_blocks_handle: maxblocks= %d, iblock = %d\n", *maxblocks, iblock); + down(&ei->truncate_sem); + depth = ext3_block_to_path(inode, iblock, offsets, &blocks_to_boundary); + if (depth == 0) + goto out; + partial = ext3_get_branch(inode, depth, + offsets, chain, &err); + /* Simplest case - block found */ + if (!err && !partial) { + first_block = chain[depth-1].key; + clear_buffer_new(bh_result); + count ++; + /* map more blocks */ + while (count < *maxblocks && count <= blocks_to_boundary + && (*(chain[depth-1].p+count) == first_block + count)) { + count ++; + } + up(&ei->truncate_sem); + goto got_it; + } + /* got mapped blocks or plain lookup or failed read of indirect block */ + if (!create || err == -EIO){ + up(&ei->truncate_sem); + goto out; + } + /* Okey, we need to do block allocation */ + /* lazy initialize the block allocation info here if necessary */ + if (S_ISREG(inode->i_mode) && (!ei->i_block_alloc_info)) { + ext3_init_block_alloc_info(inode); + } + + goal = ext3_find_goal(inode, iblock, chain, partial); + + /* number of missing meta data blocks need to allocate for this branch */ + left = chain + depth - partial; + count = ext3_count_blocks_to_allocate(partial, left, *maxblocks, blocks_to_boundary); + if (GBS_DEBUG) + printk("blocks to allocate: %d\n", count); + + first_block = ext3_new_blocks(handle, inode, goal, &count, &err); + if (GBS_DEBUG) + printk("blocks allocated(%d,%d,%d)\n", (int)iblock, (int)first_block, count); + + if (!err) + err = ext3_alloc_splice_branch(handle, inode, count, + first_block, offsets+(partial-chain), partial, left); + if (err) { + up(&ei->truncate_sem); + goto cleanup; + } + /* i_disksize growing is protected by truncate_sem + * don't forget to protect it if you're about to implement + * concurrent ext3_get_block() -bzzz */ + if (extend_disksize && inode->i_size > ei->i_disksize) + ei->i_disksize = inode->i_size; + /* + * update the most recently allocated logical & physical block + * in i_block_alloc_info, to assist find the proper goal block for next + * allocation + */ + block_i = ei->i_block_alloc_info; + if (block_i) { + block_i->last_alloc_logical_block = iblock + count - left; + block_i->last_alloc_physical_block = first_block + count - 1; + } + + inode->i_ctime = CURRENT_TIME_SEC; + ext3_mark_inode_dirty(handle, inode); + + up(&ei->truncate_sem); + if (err) + goto cleanup; + + set_buffer_new(bh_result); +got_it: + map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key)); + if (blocks_to_boundary == 0) + set_buffer_boundary(bh_result); + /* Clean up and exit */ + partial = chain+depth-1; /* the whole chain */ +cleanup: + while (partial > chain) { + BUFFER_TRACE(partial->bh, "call brelse"); + brelse(partial->bh); + partial--; + } + BUFFER_TRACE(bh_result, "returned"); +out: + *maxblocks = count; + return err; +} + +static int ext3_get_blocks(struct inode *inode, sector_t iblock, + unsigned long maxblocks, struct buffer_head *bh_result, + int create) { handle_t *handle = NULL; int ret; @@ -787,15 +1037,21 @@ static int ext3_get_block(struct inode * handle = ext3_journal_current_handle(); J_ASSERT(handle != 0); } - ret = ext3_get_block_handle(handle, inode, iblock, + ret = ext3_get_blocks_handle(handle, inode, iblock, &maxblocks, bh_result, create, 1); - return ret; + bh_result->b_size = (maxblocks << inode->i_blkbits); + return ret; +} + +static int ext3_get_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create) +{ + return ext3_get_blocks(inode, iblock, 1, bh_result, create); } #define DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32) -static int -ext3_direct_io_get_blocks(struct inode *inode, sector_t iblock, +static int ext3_direct_io_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks, struct buffer_head *bh_result, int create) { @@ -831,10 +1087,14 @@ ext3_direct_io_get_blocks(struct inode * } get_block: + if (GBS_DEBUG) + printk("Calling ext3_get_blocks_handle from dio: maxblocks= %d, iblock = %d", (int)max_blocks, (int)iblock); if (ret == 0) - ret = ext3_get_block_handle(handle, inode, iblock, + ret = ext3_get_blocks_handle(handle, inode, iblock, &max_blocks, bh_result, create, 0); - bh_result->b_size = (1 << inode->i_blkbits); + bh_result->b_size = (max_blocks << inode->i_blkbits); + if (GBS_DEBUG) + printk("ext3_get_blocks_handle returns to dio: maxblocks= %d, iblock = %d", (int)max_blocks, (int)iblock); return ret; } diff -puN fs/ext3/xattr.c~ext3_multiple_blocks_allocation fs/ext3/xattr.c --- linux-2.6.11/fs/ext3/xattr.c~ext3_multiple_blocks_allocation 2005-04-28 12:12:06.177474100 -0700 +++ linux-2.6.11-ming/fs/ext3/xattr.c 2005-04-28 12:12:06.194472114 -0700 @@ -796,7 +796,8 @@ inserted: EXT3_SB(sb)->s_es->s_first_data_block) + EXT3_I(inode)->i_block_group * EXT3_BLOCKS_PER_GROUP(sb); - int block = ext3_new_block(handle, inode, goal, &error); + unsigned long count = 1; + int block = ext3_new_blocks(handle, inode, goal, &count, &error); if (error) goto cleanup; ea_idebug(inode, "creating block %d", block); diff -puN include/linux/ext3_fs.h~ext3_multiple_blocks_allocation include/linux/ext3_fs.h --- linux-2.6.11/include/linux/ext3_fs.h~ext3_multiple_blocks_allocation 2005-04-28 12:12:06.180473750 -0700 +++ linux-2.6.11-ming/include/linux/ext3_fs.h 2005-04-28 12:12:06.196471880 -0700 @@ -714,7 +714,7 @@ struct dir_private_info { /* balloc.c */ extern int ext3_bg_has_super(struct super_block *sb, int group); extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group); -extern int ext3_new_block (handle_t *, struct inode *, unsigned long, int *); +extern int ext3_new_blocks (handle_t *, struct inode *, unsigned long, unsigned long*, int *); extern void ext3_free_blocks (handle_t *, struct inode *, unsigned long, unsigned long); extern void ext3_free_blocks_sb (handle_t *, struct super_block *, _ ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 0/5] multiple block allocation to current ext3 2005-04-28 19:14 ` [RFC] Adding multiple block allocation to current ext3 Mingming Cao @ 2006-01-10 23:26 ` Mingming Cao 2006-01-11 5:25 ` Andrew Morton 0 siblings, 1 reply; 11+ messages in thread From: Mingming Cao @ 2006-01-10 23:26 UTC (permalink / raw) To: Andrew Morton; +Cc: hch, pbadari, Stephen C. Tweedie, linux-kernel, ext2-devel Currently ext3_get_block() only map or allocate one block at a time. This is quite inefficient for sequential IO workload. I have posted a early implements a simply multiple block map and allocation with current ext3. The basic idea is allocating the 1st block in the existing way, and attempting to allocate the next adjacent blocks on a best effort basis. More description about the implementation could be found here: http://marc.theaimsgroup.com/?l=ext2-devel&m=112162230003522&w=2 The following the latest version of the patch: break the original patch into 5 patches, re-worked some logicals, and fixed some bugs. The break ups are: [patch 1] Adding map multiple blocks at a time in ext3_get_blocks() [patch 2] Extend ext3_get_blocks() to support multiple block allocation [patch 3] Implement multiple block allocation in ext3-try-to-allocate (called via ext3_new_block()). [patch 4] Proper accounting updates in ext3_new_blocks() [patch 5] Adjust reservation window size properly (by the given number of blocks to allocate) before block allocation to increase the possibility of allocating multiple blocks in a single call. Tests done so far includes fsx,tiobench and dbench. The following numbers collected from Direct IO tests (1G file creation/read) shows the system time have been greatly reduced (more than 50% on my 8 cpu system) with the patches. 1G file DIO write: 2.6.15 2.6.15+patches real 0m31.275s 0m31.161s user 0m0.000s 0m0.000s sys 0m3.384s 0m0.564s 1G file DIO read: 2.6.15 2.6.15+patches real 0m30.733s 0m30.624s user 0m0.000s 0m0.004s sys 0m0.748s 0m0.380s Some previous test we did on buffered IO with using multiple blocks allocation and delayed allocation shows noticeable improvement on throughput and system time. Patches against 2.6.15. Please consider to add to mm tree. Thanks! Mingming ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/5] multiple block allocation to current ext3 2006-01-10 23:26 ` [PATCH 0/5] " Mingming Cao @ 2006-01-11 5:25 ` Andrew Morton 2006-01-11 19:17 ` Mingming Cao 0 siblings, 1 reply; 11+ messages in thread From: Andrew Morton @ 2006-01-11 5:25 UTC (permalink / raw) To: cmm; +Cc: hch, pbadari, sct, linux-kernel, ext2-devel Mingming Cao <cmm@us.ibm.com> wrote: > > Tests done so far includes fsx,tiobench and dbench. The following > numbers collected from Direct IO tests (1G file creation/read) shows > the system time have been greatly reduced (more than 50% on my 8 cpu > system) with the patches. > > 1G file DIO write: > 2.6.15 2.6.15+patches > real 0m31.275s 0m31.161s > user 0m0.000s 0m0.000s > sys 0m3.384s 0m0.564s > > > 1G file DIO read: > 2.6.15 2.6.15+patches > real 0m30.733s 0m30.624s > user 0m0.000s 0m0.004s > sys 0m0.748s 0m0.380s > > Some previous test we did on buffered IO with using multiple blocks > allocation and delayed allocation shows noticeable improvement on > throughput and system time. I'd be interested in seeing benchmark results for the common allocate-one-block case - just normal old buffered IO without any additional multiblock patches. Would they show any regression? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/5] multiple block allocation to current ext3 2006-01-11 5:25 ` Andrew Morton @ 2006-01-11 19:17 ` Mingming Cao 2006-01-11 19:43 ` Andrew Morton 0 siblings, 1 reply; 11+ messages in thread From: Mingming Cao @ 2006-01-11 19:17 UTC (permalink / raw) To: Andrew Morton; +Cc: hch, pbadari, sct, linux-kernel, ext2-devel On Tue, 2006-01-10 at 21:25 -0800, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > Tests done so far includes fsx,tiobench and dbench. The following > > numbers collected from Direct IO tests (1G file creation/read) shows > > the system time have been greatly reduced (more than 50% on my 8 cpu > > system) with the patches. > > > > 1G file DIO write: > > 2.6.15 2.6.15+patches > > real 0m31.275s 0m31.161s > > user 0m0.000s 0m0.000s > > sys 0m3.384s 0m0.564s > > > > > > 1G file DIO read: > > 2.6.15 2.6.15+patches > > real 0m30.733s 0m30.624s > > user 0m0.000s 0m0.004s > > sys 0m0.748s 0m0.380s > > > > Some previous test we did on buffered IO with using multiple blocks > > allocation and delayed allocation shows noticeable improvement on > > throughput and system time. > > I'd be interested in seeing benchmark results for the common > allocate-one-block case - just normal old buffered IO without any > additional multiblock patches. Would they show any regression? > Hi Andrew, One thing I want to clarify is that: for the buffered IO, even with mutlipleblock patches, currently ext3 is still allocate one block at a time.(we will need delayed allocation to make use of the multiple block allocation) I did the same test on buffered IO, w/o the patches. Very little regression(less than 1% could be consider as noise) comparing 2.6.15 kernel w/o patches: buffered IO write: (no sync) # time ./filetst -b 1048576 -w -f /mnt/a 2.6.15 2.6.15+patches real 0m25.773s 0m26.102s user 0m0.004s 0m0.000s sys 0m15.065s 0m16.053s buffered IO read (after umount/remount) # time ./filetst -b 1048576 -r -f /mnt/a 2.6.15 2.6.15+patches real 0m29.257s 0m29.257s user 0m0.000s 0m0.000s sys 0m6.996s 0m6.980s But I do notice regression between vanilla 2.6.14 kernel and vanilla 2.6.15 kernel on buffered IO(about 18%). # time ./filetst -b 1048576 -w -f /mnt/a 2.6.14 2.6.15 real 0m21.710s 0m25.773s user 0m0.012s 0m0.004s sys 0m14.569s 0m15.065s I also found tiobench(sequential write test) and dbench has similar regression between 2.6.14 and 2.6.15. Actually I found 2.6.15 rc2 already has the regression. Is this a known issue? Anyway I will continue looking at the issue... Thanks, Mingming ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/5] multiple block allocation to current ext3 2006-01-11 19:17 ` Mingming Cao @ 2006-01-11 19:43 ` Andrew Morton 2006-01-14 1:12 ` Fall back io scheduler for 2.6.15? Mingming Cao 0 siblings, 1 reply; 11+ messages in thread From: Andrew Morton @ 2006-01-11 19:43 UTC (permalink / raw) To: cmm; +Cc: hch, pbadari, sct, linux-kernel, ext2-devel Mingming Cao <cmm@us.ibm.com> wrote: > > # time ./filetst -b 1048576 -w -f /mnt/a > 2.6.14 2.6.15 > real 0m21.710s 0m25.773s > user 0m0.012s 0m0.004s > sys 0m14.569s 0m15.065s That's a big drop. Was it doing I/O, or was it all from pagecache? > I also found tiobench(sequential write test) and dbench has similar > regression between 2.6.14 and 2.6.15. Actually I found 2.6.15 rc2 > already has the regression. Is this a known issue? No, it is not known. > Anyway I will continue looking at the issue... Thanks. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Fall back io scheduler for 2.6.15? 2006-01-11 19:43 ` Andrew Morton @ 2006-01-14 1:12 ` Mingming Cao 2006-01-14 1:49 ` Andrew Morton 0 siblings, 1 reply; 11+ messages in thread From: Mingming Cao @ 2006-01-14 1:12 UTC (permalink / raw) To: Andrew Morton; +Cc: Seetharami Seelam, linux-kernel, ext2-devel On Wed, 2006-01-11 at 11:43 -0800, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > # time ./filetst -b 1048576 -w -f /mnt/a > > 2.6.14 2.6.15 > > real 0m21.710s 0m25.773s > > user 0m0.012s 0m0.004s > > sys 0m14.569s 0m15.065s > > That's a big drop. > > Was it doing I/O, or was it all from pagecache? > > > I also found tiobench(sequential write test) and dbench has similar > > regression between 2.6.14 and 2.6.15. Actually I found 2.6.15 rc2 > > already has the regression. Is this a known issue? > > No, it is not known. > > > Anyway I will continue looking at the issue... > > Thanks. Hi, Andrew, I did some trace, it turns out there isn't regression between 2.6.14 and 2.6.15, and there is no problem in ext3 filesystem. I am comparing apple to orange: the tests were run on two different io schedulers. That makes the bogus throughput difference that I reported to you earlier this week. I gave the same boot option "elevator=as" for both 2.6.14 and 2.6.15-rc2 (this has been working for me for a long time to get the anticipatory scheduler on), but the results are, the io schedulers turned on on the two kernels are different( see elevator_setup_default()). On 2.6.14, the fall back io scheduler (if the chosen io scheduler is not found) is set to the default io scheduler (anticipatory, in this case), but since 2.6.15-rc1, this semanistic is changed to fall back to noop. Is there any reason to fall back to noop instead of as? It seems anticipatory is much better than noop for ext3 with large sequential write tests (i.e, 1G dd test) ... Thanks, Mingming ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-14 1:12 ` Fall back io scheduler for 2.6.15? Mingming Cao @ 2006-01-14 1:49 ` Andrew Morton 2006-01-14 5:22 ` Dave Jones 2006-01-16 8:43 ` Jens Axboe 0 siblings, 2 replies; 11+ messages in thread From: Andrew Morton @ 2006-01-14 1:49 UTC (permalink / raw) To: cmm; +Cc: seelam, linux-kernel, ext2-devel, Jens Axboe Mingming Cao <cmm@us.ibm.com> wrote: > > On 2.6.14, the > fall back io scheduler (if the chosen io scheduler is not found) is set > to the default io scheduler (anticipatory, in this case), but since > 2.6.15-rc1, this semanistic is changed to fall back to noop. OK. And I assume that AS wasn't compiled, so that's why it fell back? I actually thought that elevator= got removed, now we have /sys/block/sda/queue/scheduler. But I guess that's not very useful with CONFIG_SYSFS=n. > Is there any reason to fall back to noop instead of as? It seems > anticipatory is much better than noop for ext3 with large sequential > write tests (i.e, 1G dd test) ... I suspect that was an accident. Jens? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-14 1:49 ` Andrew Morton @ 2006-01-14 5:22 ` Dave Jones 2006-01-16 8:43 ` Jens Axboe 1 sibling, 0 replies; 11+ messages in thread From: Dave Jones @ 2006-01-14 5:22 UTC (permalink / raw) To: Andrew Morton; +Cc: cmm, seelam, linux-kernel, ext2-devel, Jens Axboe On Fri, Jan 13, 2006 at 05:49:14PM -0800, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > On 2.6.14, the > > fall back io scheduler (if the chosen io scheduler is not found) is set > > to the default io scheduler (anticipatory, in this case), but since > > 2.6.15-rc1, this semanistic is changed to fall back to noop. > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > > I actually thought that elevator= got removed, now we have > /sys/block/sda/queue/scheduler. But I guess that's not very useful with > CONFIG_SYSFS=n. It's also a lifesaver if the default scheduler happens to trigger some breakage preventing boot, and you can tell users to workaround it with a bootparam until the real problem is fixed (which has bitten us twice now, as Fedora like several other distros chooses CFQ by default). Dave ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-14 1:49 ` Andrew Morton 2006-01-14 5:22 ` Dave Jones @ 2006-01-16 8:43 ` Jens Axboe 2006-01-19 19:37 ` Nate Diller 1 sibling, 1 reply; 11+ messages in thread From: Jens Axboe @ 2006-01-16 8:43 UTC (permalink / raw) To: Andrew Morton; +Cc: cmm, seelam, linux-kernel, ext2-devel On Fri, Jan 13 2006, Andrew Morton wrote: > Mingming Cao <cmm@us.ibm.com> wrote: > > > > On 2.6.14, the > > fall back io scheduler (if the chosen io scheduler is not found) is set > > to the default io scheduler (anticipatory, in this case), but since > > 2.6.15-rc1, this semanistic is changed to fall back to noop. > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > > I actually thought that elevator= got removed, now we have > /sys/block/sda/queue/scheduler. But I guess that's not very useful with > CONFIG_SYSFS=n. > > > Is there any reason to fall back to noop instead of as? It seems > > anticipatory is much better than noop for ext3 with large sequential > > write tests (i.e, 1G dd test) ... > > I suspect that was an accident. Jens? It is, it makes more sense to fallback to the default of course. -- Jens Axboe ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-16 8:43 ` Jens Axboe @ 2006-01-19 19:37 ` Nate Diller 2006-01-20 8:10 ` Jens Axboe 0 siblings, 1 reply; 11+ messages in thread From: Nate Diller @ 2006-01-19 19:37 UTC (permalink / raw) To: Jens Axboe; +Cc: Andrew Morton, cmm, seelam, linux-kernel, ext2-devel On 1/16/06, Jens Axboe <axboe@suse.de> wrote: > On Fri, Jan 13 2006, Andrew Morton wrote: > > Mingming Cao <cmm@us.ibm.com> wrote: > > > > > > On 2.6.14, the > > > fall back io scheduler (if the chosen io scheduler is not found) is set > > > to the default io scheduler (anticipatory, in this case), but since > > > 2.6.15-rc1, this semanistic is changed to fall back to noop. > > > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > > > > I actually thought that elevator= got removed, now we have > > /sys/block/sda/queue/scheduler. But I guess that's not very useful with > > CONFIG_SYSFS=n. > > > > > Is there any reason to fall back to noop instead of as? It seems > > > anticipatory is much better than noop for ext3 with large sequential > > > write tests (i.e, 1G dd test) ... > > > > I suspect that was an accident. Jens? > > It is, it makes more sense to fallback to the default of course. Not an accident at all, actually, because the original patch i submitted allowed you to select a scheduler as 'default' even if it were compiled as a module in kconfig. Since noop is guaranteed to be present in any system, it is the obvious choice if the chosen or default scheduler is not loaded. If you change it to fall back to the default, it will oops if the default is not available. NATE ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Fall back io scheduler for 2.6.15? 2006-01-19 19:37 ` Nate Diller @ 2006-01-20 8:10 ` Jens Axboe 0 siblings, 0 replies; 11+ messages in thread From: Jens Axboe @ 2006-01-20 8:10 UTC (permalink / raw) To: Nate Diller; +Cc: Andrew Morton, cmm, seelam, linux-kernel, ext2-devel On Thu, Jan 19 2006, Nate Diller wrote: > On 1/16/06, Jens Axboe <axboe@suse.de> wrote: > > On Fri, Jan 13 2006, Andrew Morton wrote: > > > Mingming Cao <cmm@us.ibm.com> wrote: > > > > > > > > On 2.6.14, the > > > > fall back io scheduler (if the chosen io scheduler is not found) is set > > > > to the default io scheduler (anticipatory, in this case), but since > > > > 2.6.15-rc1, this semanistic is changed to fall back to noop. > > > > > > OK. And I assume that AS wasn't compiled, so that's why it fell back? > > > > > > I actually thought that elevator= got removed, now we have > > > /sys/block/sda/queue/scheduler. But I guess that's not very useful with > > > CONFIG_SYSFS=n. > > > > > > > Is there any reason to fall back to noop instead of as? It seems > > > > anticipatory is much better than noop for ext3 with large sequential > > > > write tests (i.e, 1G dd test) ... > > > > > > I suspect that was an accident. Jens? > > > > It is, it makes more sense to fallback to the default of course. > > Not an accident at all, actually, because the original patch i > submitted allowed you to select a scheduler as 'default' even if it > were compiled as a module in kconfig. Since noop is guaranteed to be > present in any system, it is the obvious choice if the chosen or > default scheduler is not loaded. Yes and that was a bug in that patch. The default scheduler must be builtin, that's a given. The Kconfig rules should make a default selection as a module illegal. And they do, they have since been fixed. > If you change it to fall back to the default, it will oops if the > default is not available. It must be. -- Jens Axboe ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2006-01-21 11:42 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-01-14 16:10 Fall back io scheduler for 2.6.15? Chuck Ebbert 2006-01-16 8:43 ` Jens Axboe 2006-01-19 19:38 ` Nate Diller 2006-01-21 6:14 ` Tejun Heo 2006-01-21 11:44 ` Jens Axboe -- strict thread matches above, loose matches on Subject: below -- 2005-04-05 3:51 ext3 allocate-with-reservation latencies Lee Revell 2005-04-07 13:08 ` Stephen C. Tweedie 2005-04-07 23:37 ` Mingming Cao 2005-04-08 14:40 ` Stephen C. Tweedie 2005-04-08 18:10 ` Mingming Cao 2005-04-11 11:48 ` Stephen C. Tweedie 2005-04-11 18:38 ` Mingming Cao 2005-04-11 19:57 ` Stephen C. Tweedie 2005-04-12 6:41 ` Mingming Cao 2005-04-12 11:18 ` Stephen C. Tweedie 2005-04-12 23:27 ` Mingming Cao 2005-04-13 10:29 ` Stephen C. Tweedie 2005-04-22 22:10 ` [RFC][PATCH] Reduce ext3 allocate-with-reservation lock latencies Mingming Cao 2005-04-28 3:45 ` Lee Revell 2005-04-28 19:14 ` [RFC] Adding multiple block allocation to current ext3 Mingming Cao 2006-01-10 23:26 ` [PATCH 0/5] " Mingming Cao 2006-01-11 5:25 ` Andrew Morton 2006-01-11 19:17 ` Mingming Cao 2006-01-11 19:43 ` Andrew Morton 2006-01-14 1:12 ` Fall back io scheduler for 2.6.15? Mingming Cao 2006-01-14 1:49 ` Andrew Morton 2006-01-14 5:22 ` Dave Jones 2006-01-16 8:43 ` Jens Axboe 2006-01-19 19:37 ` Nate Diller 2006-01-20 8:10 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).