Hello all,

[ short version, the patch attached should fix io latencies in 2.4.21. 
Please review and/or give it a try ]
 
My last set of patches was directed at reducing the latencies in
__get_request_wait, which really helped reduce stalls when you had lots
of io to one device and balance_dirty() was causing pauses while you
tried to do io to other devices.

But, a streaming write could still starve reads to the same device,
mostly because the read would have to send down any huge merged writes
that were before it in the queue.

Andrea's kernel has a fix for this too, he limits the total number of
sectors that can be in the request queue at any given time.  But, his
patches change blk_finished_io, both in the arguments it takes and the
side effects of calling it.  I don't think we can merge his current form
without breaking external drivers.

So, I added a can_throttle flag to the queue struct, drivers can enable
it if they are going to call the new blk_started_sectors and
blk_finished_sectors funcs any time they call blk_{started,finished}_io,
and these do all the -aa style sector throttling.

There were a few other small changes to Andrea's patch, he wasn't
setting q->full when get_request decided there were too many sectors in
flight.  This resulted in large latencies in __get_request_wait.  He was
also unconditionally clearing q->full in blkdev_release_request, my code
only clears q->full when all the waiters are gone.

I changed generic_unplug_device to zero the elevator_sequence field of
the last request on the queue.  This means there won't be any merges
with requests pending once an unplug is done, and helps limit the number
of sectors that need to be sent down during the run_task_queue(&tq_disk)
in wait_on_buffer.

I lowered the -aa default limit on sectors in flight from 4MB to 2MB. 
We probably want an elvtune for it, large arrays with writeback cache
should be able to tolerate larger values.

There's still a little work left to do, this patch enables sector
throttling for scsi and IDE.  cciss, DAC960 and cpqarray need
modification too (99% done already in -aa).  No sense in doing that
until after the bulk of the patch is reviewed though.

As before, most of the code here is from Andrea and Nick, I've just
wrapped a lot of duct tape around it and done some tweaking.  The
primary pieces are:

fix-pausing (andrea, corner cases where wakeups are missed)
elevator-low-latency (andrea, limit sectors in flight)
queue_full (Nick, fairness in __get_request_wait)

I've removed my latency stats for __get_request_wait in hopes of making
it a better merging candidate.

-chris