All of lore.kernel.org
 help / color / mirror / Atom feed
* Status of the IO scheduler fixes for 2.4
@ 2003-07-02 22:26 Marcelo Tosatti
  2003-07-02 22:28 ` Marcelo Tosatti
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2003-07-02 22:26 UTC (permalink / raw)
  To: lkml; +Cc: Alan Cox, Chris Mason, Andrea Arcangeli, Nick Piggin


Hello people,

What is the status of the IO scheduler fixes for increased fairness for
2.4 ?

I haven't had time to read and think about everything you guys discussed,
so a brief summary would be very helpful for me.

Danke

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-02 22:26 Status of the IO scheduler fixes for 2.4 Marcelo Tosatti
@ 2003-07-02 22:28 ` Marcelo Tosatti
  2003-07-03  2:02   ` Chris Mason
  2003-07-03 18:07   ` Chris Mason
  0 siblings, 2 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2003-07-02 22:28 UTC (permalink / raw)
  To: lkml; +Cc: Alan Cox, Chris Mason, Andrea Arcangeli, Nick Piggin



On Wed, 2 Jul 2003, Marcelo Tosatti wrote:

>
> Hello people,
>
> What is the status of the IO scheduler fixes for increased fairness for
> 2.4 ?
>
> I haven't had time to read and think about everything you guys discussed,
> so a brief summary would be very helpful for me.
>
> Danke

Ah, we all want that the fairness issues to be fixed in 2.4.22, right ?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-02 22:28 ` Marcelo Tosatti
@ 2003-07-03  2:02   ` Chris Mason
  2003-07-03 10:58     ` Stephan von Krawczynski
                       ` (2 more replies)
  2003-07-03 18:07   ` Chris Mason
  1 sibling, 3 replies; 21+ messages in thread
From: Chris Mason @ 2003-07-03  2:02 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: lkml, Alan Cox, Andrea Arcangeli, Nick Piggin

[-- Attachment #1: Type: text/plain, Size: 3096 bytes --]

On Wed, 2003-07-02 at 18:28, Marcelo Tosatti wrote:
> On Wed, 2 Jul 2003, Marcelo Tosatti wrote:
> 
> >
> > Hello people,
> >
> > What is the status of the IO scheduler fixes for increased fairness for
> > 2.4 ?
> >
> > I haven't had time to read and think about everything you guys discussed,
> > so a brief summary would be very helpful for me.
> >
> > Danke
> 
> Ah, we all want that the fairness issues to be fixed in 2.4.22, right ?

My current code is attached, it's basically a merge of these 3 patches,
with modifications based on benchmarks and latency measurements here.

fix_pausing:  From Andrea, it fixes a few corner case races where
wakeups can be missed in wait_on_buffer, wait_on_page, and
__get_request_wait.

elevator-low-latency: From Andrea, it keeps the amount of io on a given
queue to a reasonable number.  This prevents a small number of huge
requests from introducing large latencies on smaller requests.

q->full: From Nick, it reduces latency in __get_request_wait by making
sure new io can't come in and steal requests before old waiters are
served.

Those represent the big 3 areas I believe the latencies are coming
from.  The q->full patch can hurt throughput badly as the number of
writers increases (50% of what 2.4.21 gets for 10 or more concurrent
streaming writers), but it really seems to help desktop workloads here.

Andrea's elevator-low-latency patch solves 90% of the latency problem,
it keeps heavy io from taking over the disk entirely and starving out
small readers (like ls).  It also keeps good throughput numbers.

So, the patch attached includes the q->full code but has it off by
default.  I've got code locally for an elvtune interface that can toggle
q->full check on a per device basis, as well as tune the max io per
queue.  I've got two choices on how to submit it, I can either add a new
ioctl or abuse the max_bomb_segments field in the existing ioctl.

If we can agree on the userland tuning side, I can have some kind of
elvtune patch tomorrow.

Note: my merge of elevator-low-latency is a little different from
Andrea's.  I added a blk_finished_sectors call to keep track of io as it
finishes instead of changing blk_finished_io.  If a given driver is
going to call blk_finished_sectors when it calls blk_finished_io, it
should call blk_queue_throttle_sectors(q, 1) after blk_init_queue to
tell the rest of ll_rw_block to enable throttling.  The extra calls are
there to keep compatibility with external drivers.

[ random summary of negative comments that come to mind ]

Andrea and I disagree about the code in get_request_wait_wakeup, he
wants to find a way to unplug the queue instead of triggering wakeups. 
He also really doesn't like my change to __generic_unplug_device, which
is meant to lower latencies when a reader is running the task queue in
__wait_on_buffer.  This code is off by default in the attached patch.

Nick would like to see a better balance of throughput/fairness, I wimped
out and went for the userspace toggle instead because I think anything
else requires pulling in larger changes from 2.5 land.

-chris


[-- Attachment #2: io-stalls-8-min.diff --]
[-- Type: text/plain, Size: 24255 bytes --]

diff -Nru a/drivers/block/ll_rw_blk.c b/drivers/block/ll_rw_blk.c
--- a/drivers/block/ll_rw_blk.c	Wed Jul  2 16:57:17 2003
+++ b/drivers/block/ll_rw_blk.c	Wed Jul  2 16:57:17 2003
@@ -176,11 +176,12 @@
 {
 	int count = q->nr_requests;
 
-	count -= __blk_cleanup_queue(&q->rq[READ]);
-	count -= __blk_cleanup_queue(&q->rq[WRITE]);
+	count -= __blk_cleanup_queue(&q->rq);
 
 	if (count)
 		printk("blk_cleanup_queue: leaked requests (%d)\n", count);
+	if (atomic_read(&q->nr_sectors))
+		printk("blk_cleanup_queue: leaked sectors (%d)\n", atomic_read(&q->nr_sectors));
 
 	memset(q, 0, sizeof(*q));
 }
@@ -215,6 +216,24 @@
 }
 
 /**
+ * blk_queue_throttle_sectors - indicates you will call sector throttling funcs
+ * @q:       The queue which this applies to.
+ * @active:  A flag indication if you want sector throttling on
+ *
+ * Description:
+ * The sector throttling code allows us to put a limit on the number of
+ * sectors pending io to the disk at a given time, sending @active nonzero
+ * indicates you will call blk_started_sectors and blk_finished_sectors in
+ * addition to calling blk_started_io and blk_finished_io in order to
+ * keep track of the number of sectors in flight.
+ **/
+ 
+void blk_queue_throttle_sectors(request_queue_t * q, int active)
+{
+	q->can_throttle = active;
+}
+
+/**
  * blk_queue_make_request - define an alternate make_request function for a device
  * @q:  the request queue for the device to be affected
  * @mfn: the alternate make_request function
@@ -360,8 +379,22 @@
 {
 	if (q->plugged) {
 		q->plugged = 0;
-		if (!list_empty(&q->queue_head))
+		if (!list_empty(&q->queue_head)) {
+
+			/* we don't want merges later on to come in 
+			 * and significantly increase the amount of
+			 * work during an unplug, it can lead to high
+			 * latencies while some poor waiter tries to
+			 * run an ever increasing chunk of io.
+			 * This does lower throughput some though.
+			 */
+			if (q->low_latency) {
+				struct request *rq;
+				rq = blkdev_entry_prev_request(&q->queue_head),
+				rq->elevator_sequence = 0;
+			}
 			q->request_fn(q);
+		}
 	}
 }
 
@@ -389,7 +422,7 @@
  *
  * Returns the (new) number of requests which the queue has available.
  */
-int blk_grow_request_list(request_queue_t *q, int nr_requests)
+int blk_grow_request_list(request_queue_t *q, int nr_requests, int max_queue_sectors)
 {
 	unsigned long flags;
 	/* Several broken drivers assume that this function doesn't sleep,
@@ -399,21 +432,34 @@
 	spin_lock_irqsave(&io_request_lock, flags);
 	while (q->nr_requests < nr_requests) {
 		struct request *rq;
-		int rw;
 
 		rq = kmem_cache_alloc(request_cachep, SLAB_ATOMIC);
 		if (rq == NULL)
 			break;
 		memset(rq, 0, sizeof(*rq));
 		rq->rq_status = RQ_INACTIVE;
-		rw = q->nr_requests & 1;
-		list_add(&rq->queue, &q->rq[rw].free);
-		q->rq[rw].count++;
+ 		list_add(&rq->queue, &q->rq.free);
+ 		q->rq.count++;
+
 		q->nr_requests++;
 	}
+
+ 	/*
+ 	 * Wakeup waiters after both one quarter of the
+ 	 * max-in-fligh queue and one quarter of the requests
+ 	 * are available again.
+ 	 */
+
 	q->batch_requests = q->nr_requests / 4;
 	if (q->batch_requests > 32)
 		q->batch_requests = 32;
+ 	q->batch_sectors = max_queue_sectors / 4;
+ 
+ 	q->max_queue_sectors = max_queue_sectors;
+ 
+ 	BUG_ON(!q->batch_sectors);
+ 	atomic_set(&q->nr_sectors, 0);
+
 	spin_unlock_irqrestore(&io_request_lock, flags);
 	return q->nr_requests;
 }
@@ -422,23 +468,27 @@
 {
 	struct sysinfo si;
 	int megs;		/* Total memory, in megabytes */
-	int nr_requests;
-
-	INIT_LIST_HEAD(&q->rq[READ].free);
-	INIT_LIST_HEAD(&q->rq[WRITE].free);
-	q->rq[READ].count = 0;
-	q->rq[WRITE].count = 0;
+ 	int nr_requests, max_queue_sectors = MAX_QUEUE_SECTORS;
+  
+ 	INIT_LIST_HEAD(&q->rq.free);
+	q->rq.count = 0;
 	q->nr_requests = 0;
 
 	si_meminfo(&si);
 	megs = si.totalram >> (20 - PAGE_SHIFT);
-	nr_requests = 128;
-	if (megs < 32)
-		nr_requests /= 2;
-	blk_grow_request_list(q, nr_requests);
+ 	nr_requests = MAX_NR_REQUESTS;
+ 	if (megs < 30) {
+  		nr_requests /= 2;
+ 		max_queue_sectors /= 2;
+ 	}
+ 	/* notice early if anybody screwed the defaults */
+ 	BUG_ON(!nr_requests);
+ 	BUG_ON(!max_queue_sectors);
+ 
+ 	blk_grow_request_list(q, nr_requests, max_queue_sectors);
+
+ 	init_waitqueue_head(&q->wait_for_requests);
 
-	init_waitqueue_head(&q->wait_for_requests[0]);
-	init_waitqueue_head(&q->wait_for_requests[1]);
 	spin_lock_init(&q->queue_lock);
 }
 
@@ -491,6 +541,10 @@
 	q->plug_tq.routine	= &generic_unplug_device;
 	q->plug_tq.data		= q;
 	q->plugged        	= 0;
+	q->full			= 0;
+	q->can_throttle		= 0;
+	q->low_latency		= 0;
+
 	/*
 	 * These booleans describe the queue properties.  We set the
 	 * default (and most common) values here.  Other drivers can
@@ -508,12 +562,13 @@
  * Get a free request. io_request_lock must be held and interrupts
  * disabled on the way in.  Returns NULL if there are no free requests.
  */
-static struct request *get_request(request_queue_t *q, int rw)
+static struct request *__get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = q->rq + rw;
+	struct request_list *rl;
 
-	if (!list_empty(&rl->free)) {
+	rl = &q->rq;
+	if (!list_empty(&rl->free) && !blk_oversized_queue(q)) {
 		rq = blkdev_free_rq(&rl->free);
 		list_del(&rq->queue);
 		rl->count--;
@@ -521,35 +576,47 @@
 		rq->cmd = rw;
 		rq->special = NULL;
 		rq->q = q;
-	}
-
+	} else if (q->low_latency)
+		q->full = 1;
 	return rq;
 }
 
 /*
- * Here's the request allocation design:
+ * get a free request, honoring the queue_full condition
+ */
+static inline struct request *get_request(request_queue_t *q, int rw)
+{
+	if (q->full)
+		return NULL;
+	return __get_request(q, rw);
+}
+
+/* 
+ * helper func to do memory barriers and wakeups when we finally decide
+ * to clear the queue full condition
+ */
+static inline void clear_full_and_wake(request_queue_t *q)
+{
+	q->full = 0;
+	mb();
+	if (waitqueue_active(&q->wait_for_requests))
+		wake_up(&q->wait_for_requests);
+}
+
+/*
+ * Here's the request allocation design, low latency version:
  *
  * 1: Blocking on request exhaustion is a key part of I/O throttling.
  * 
  * 2: We want to be `fair' to all requesters.  We must avoid starvation, and
  *    attempt to ensure that all requesters sleep for a similar duration.  Hence
  *    no stealing requests when there are other processes waiting.
- * 
- * 3: We also wish to support `batching' of requests.  So when a process is
- *    woken, we want to allow it to allocate a decent number of requests
- *    before it blocks again, so they can be nicely merged (this only really
- *    matters if the process happens to be adding requests near the head of
- *    the queue).
- * 
- * 4: We want to avoid scheduling storms.  This isn't really important, because
- *    the system will be I/O bound anyway.  But it's easy.
- * 
- *    There is tension between requirements 2 and 3.  Once a task has woken,
- *    we don't want to allow it to sleep as soon as it takes its second request.
- *    But we don't want currently-running tasks to steal all the requests
- *    from the sleepers.  We handle this with wakeup hysteresis around
- *    0 .. batch_requests and with the assumption that request taking is much,
- *    much faster than request freeing.
+ *
+ * There used to be more here, attempting to allow a process to send in a
+ * number of requests once it has woken up.  But, there's no way to 
+ * tell if a process has just been woken up, or if it is a new process
+ * coming in to steal requests from the waiters.  So, we give up and force
+ * everyone to wait fairly.
  * 
  * So here's what we do:
  * 
@@ -561,50 +628,67 @@
  * 
  *  When a process wants a new request:
  * 
- *    b) If free_requests == 0, the requester sleeps in FIFO manner.
- * 
- *    b) If 0 <  free_requests < batch_requests and there are waiters,
- *       we still take a request non-blockingly.  This provides batching.
- *
- *    c) If free_requests >= batch_requests, the caller is immediately
- *       granted a new request.
+ *    b) If free_requests == 0, the requester sleeps in FIFO manner, and
+ *       the queue full condition is set.  The full condition is not
+ *       cleared until there are no longer any waiters.  Once the full
+ *       condition is set, all new io must wait, hopefully for a very
+ *       short period of time.
  * 
  *  When a request is released:
  * 
- *    d) If free_requests < batch_requests, do nothing.
- * 
- *    f) If free_requests >= batch_requests, wake up a single waiter.
+ *    c) If free_requests < batch_requests, do nothing.
  * 
- *   The net effect is that when a process is woken at the batch_requests level,
- *   it will be able to take approximately (batch_requests) requests before
- *   blocking again (at the tail of the queue).
- * 
- *   This all assumes that the rate of taking requests is much, much higher
- *   than the rate of releasing them.  Which is very true.
+ *    d) If free_requests >= batch_requests, wake up a single waiter.
  *
- * -akpm, Feb 2002.
+ *   As each waiter gets a request, he wakes another waiter.  We do this
+ *   to prevent a race where an unplug might get run before a request makes
+ *   it's way onto the queue.  The result is a cascade of wakeups, so delaying
+ *   the initial wakeup until we've got batch_requests available helps avoid
+ *   wakeups where there aren't any requests available yet.
  */
 
 static struct request *__get_request_wait(request_queue_t *q, int rw)
 {
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
+	int oversized;
+
+	add_wait_queue_exclusive(&q->wait_for_requests, &wait);
 
-	add_wait_queue(&q->wait_for_requests[rw], &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		generic_unplug_device(q);
-		if (q->rq[rw].count == 0)
-			schedule();
 		spin_lock_irq(&io_request_lock);
-		rq = get_request(q, rw);
+		oversized = blk_oversized_queue(q);
+		if (q->full || oversized) {
+			if (oversized)
+				__generic_unplug_device(q);
+			spin_unlock_irq(&io_request_lock);
+			schedule();
+			spin_lock_irq(&io_request_lock);
+		}
+		rq = __get_request(q, rw);
 		spin_unlock_irq(&io_request_lock);
 	} while (rq == NULL);
-	remove_wait_queue(&q->wait_for_requests[rw], &wait);
+	remove_wait_queue(&q->wait_for_requests, &wait);
 	current->state = TASK_RUNNING;
+
+	if (!waitqueue_active(&q->wait_for_requests))
+		clear_full_and_wake(q);
+
 	return rq;
 }
 
+static void get_request_wait_wakeup(request_queue_t *q, int rw)
+{
+	/*
+	 * avoid losing an unplug if a second __get_request_wait did the
+	 * generic_unplug_device while our __get_request_wait was running
+	 * w/o the queue_lock held and w/ our request out of the queue.
+	 */	
+	if (waitqueue_active(&q->wait_for_requests))
+		wake_up(&q->wait_for_requests);
+}
+
 /* RO fail safe mechanism */
 
 static long ro_bits[MAX_BLKDEV][8];
@@ -818,7 +902,6 @@
 void blkdev_release_request(struct request *req)
 {
 	request_queue_t *q = req->q;
-	int rw = req->cmd;
 
 	req->rq_status = RQ_INACTIVE;
 	req->q = NULL;
@@ -828,9 +911,19 @@
 	 * assume it has free buffers and check waiters
 	 */
 	if (q) {
-		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests)
-			wake_up(&q->wait_for_requests[rw]);
+		int oversized_batch = 0;
+
+		if (q->can_throttle)
+			oversized_batch = blk_oversized_queue_batch(q);
+		q->rq.count++;
+		list_add(&req->queue, &q->rq.free);
+		if (q->rq.count >= q->batch_requests && !oversized_batch) {
+			smp_mb();
+			if (waitqueue_active(&q->wait_for_requests))
+				wake_up(&q->wait_for_requests);
+			else
+				clear_full_and_wake(q);
+		}
 	}
 }
 
@@ -908,6 +1001,7 @@
 	struct list_head *head, *insert_here;
 	int latency;
 	elevator_t *elevator = &q->elevator;
+	int should_wake = 0;
 
 	count = bh->b_size >> 9;
 	sector = bh->b_rsector;
@@ -948,7 +1042,6 @@
 	 */
 	max_sectors = get_max_sectors(bh->b_rdev);
 
-again:
 	req = NULL;
 	head = &q->queue_head;
 	/*
@@ -957,7 +1050,9 @@
 	 */
 	spin_lock_irq(&io_request_lock);
 
+again:
 	insert_here = head->prev;
+
 	if (list_empty(head)) {
 		q->plug_device_fn(q, bh->b_rdev); /* is atomic */
 		goto get_rq;
@@ -976,6 +1071,7 @@
 			req->bhtail = bh;
 			req->nr_sectors = req->hard_nr_sectors += count;
 			blk_started_io(count);
+			blk_started_sectors(req, count);
 			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 			req_new_io(req, 1, count);
 			attempt_back_merge(q, req, max_sectors, max_segments);
@@ -998,6 +1094,7 @@
 			req->sector = req->hard_sector = sector;
 			req->nr_sectors = req->hard_nr_sectors += count;
 			blk_started_io(count);
+			blk_started_sectors(req, count);
 			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 			req_new_io(req, 1, count);
 			attempt_front_merge(q, head, req, max_sectors, max_segments);
@@ -1030,7 +1127,7 @@
 		 * See description above __get_request_wait()
 		 */
 		if (rw_ahead) {
-			if (q->rq[rw].count < q->batch_requests) {
+			if (q->rq.count < q->batch_requests || blk_oversized_queue_batch(q)) {
 				spin_unlock_irq(&io_request_lock);
 				goto end_io;
 			}
@@ -1042,6 +1139,9 @@
 			if (req == NULL) {
 				spin_unlock_irq(&io_request_lock);
 				freereq = __get_request_wait(q, rw);
+				head = &q->queue_head;
+				spin_lock_irq(&io_request_lock);
+				should_wake = 1;
 				goto again;
 			}
 		}
@@ -1064,10 +1164,13 @@
 	req->start_time = jiffies;
 	req_new_io(req, 0, count);
 	blk_started_io(count);
+	blk_started_sectors(req, count);
 	add_request(q, req, insert_here);
 out:
 	if (freereq)
 		blkdev_release_request(freereq);
+	if (should_wake)
+		get_request_wait_wakeup(q, rw);
 	spin_unlock_irq(&io_request_lock);
 	return 0;
 end_io:
@@ -1196,8 +1299,15 @@
 	bh->b_rdev = bh->b_dev;
 	bh->b_rsector = bh->b_blocknr * count;
 
+	get_bh(bh);
 	generic_make_request(rw, bh);
 
+	/* fix race condition with wait_on_buffer() */
+	smp_mb(); /* spin_unlock may have inclusive semantics */
+	if (waitqueue_active(&bh->b_wait))
+		wake_up(&bh->b_wait);
+
+	put_bh(bh);
 	switch (rw) {
 		case WRITE:
 			kstat.pgpgout += count;
@@ -1350,6 +1460,7 @@
 	if ((bh = req->bh) != NULL) {
 		nsect = bh->b_size >> 9;
 		blk_finished_io(nsect);
+		blk_finished_sectors(req, nsect);
 		req->bh = bh->b_reqnext;
 		bh->b_reqnext = NULL;
 		bh->b_end_io(bh, uptodate);
@@ -1509,6 +1620,7 @@
 EXPORT_SYMBOL(blk_get_queue);
 EXPORT_SYMBOL(blk_cleanup_queue);
 EXPORT_SYMBOL(blk_queue_headactive);
+EXPORT_SYMBOL(blk_queue_throttle_sectors);
 EXPORT_SYMBOL(blk_queue_make_request);
 EXPORT_SYMBOL(generic_make_request);
 EXPORT_SYMBOL(blkdev_release_request);
diff -Nru a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
--- a/drivers/ide/ide-probe.c	Wed Jul  2 16:57:17 2003
+++ b/drivers/ide/ide-probe.c	Wed Jul  2 16:57:17 2003
@@ -971,6 +971,7 @@
 
 	q->queuedata = HWGROUP(drive);
 	blk_init_queue(q, do_ide_request);
+	blk_queue_throttle_sectors(q, 1);
 }
 
 #undef __IRQ_HELL_SPIN
diff -Nru a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
--- a/drivers/scsi/scsi.c	Wed Jul  2 16:57:17 2003
+++ b/drivers/scsi/scsi.c	Wed Jul  2 16:57:17 2003
@@ -197,6 +197,7 @@
 
 	blk_init_queue(q, scsi_request_fn);
 	blk_queue_headactive(q, 0);
+	blk_queue_throttle_sectors(q, 1);
 	q->queuedata = (void *) SDpnt;
 }
 
diff -Nru a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
--- a/drivers/scsi/scsi_lib.c	Wed Jul  2 16:57:17 2003
+++ b/drivers/scsi/scsi_lib.c	Wed Jul  2 16:57:17 2003
@@ -378,6 +378,7 @@
 		if ((bh = req->bh) != NULL) {
 			nsect = bh->b_size >> 9;
 			blk_finished_io(nsect);
+			blk_finished_sectors(req, nsect);
 			req->bh = bh->b_reqnext;
 			bh->b_reqnext = NULL;
 			sectors -= nsect;
diff -Nru a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c	Wed Jul  2 16:57:17 2003
+++ b/fs/buffer.c	Wed Jul  2 16:57:17 2003
@@ -153,10 +153,23 @@
 	get_bh(bh);
 	add_wait_queue(&bh->b_wait, &wait);
 	do {
-		run_task_queue(&tq_disk);
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 		if (!buffer_locked(bh))
 			break;
+		/*
+		 * We must read tq_disk in TQ_ACTIVE after the
+		 * add_wait_queue effect is visible to other cpus.
+		 * We could unplug some line above it wouldn't matter
+		 * but we can't do that right after add_wait_queue
+		 * without an smp_mb() in between because spin_unlock
+		 * has inclusive semantics.
+		 * Doing it here is the most efficient place so we
+		 * don't do a suprious unplug if we get a racy
+		 * wakeup that make buffer_locked to return 0, and
+		 * doing it here avoids an explicit smp_mb() we
+		 * rely on the implicit one in set_task_state.
+		 */
+		run_task_queue(&tq_disk);
 		schedule();
 	} while (buffer_locked(bh));
 	tsk->state = TASK_RUNNING;
@@ -1507,6 +1520,9 @@
 
 	/* Done - end_buffer_io_async will unlock */
 	SetPageUptodate(page);
+
+	wakeup_page_waiters(page);
+
 	return 0;
 
 out:
@@ -1538,6 +1554,7 @@
 	} while (bh != head);
 	if (need_unlock)
 		UnlockPage(page);
+	wakeup_page_waiters(page);
 	return err;
 }
 
@@ -1765,6 +1782,8 @@
 		else
 			submit_bh(READ, bh);
 	}
+
+	wakeup_page_waiters(page);
 	
 	return 0;
 }
@@ -2378,6 +2397,7 @@
 		submit_bh(rw, bh);
 		bh = next;
 	} while (bh != head);
+	wakeup_page_waiters(page);
 	return 0;
 }
 
diff -Nru a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c	Wed Jul  2 16:57:17 2003
+++ b/fs/reiserfs/inode.c	Wed Jul  2 16:57:17 2003
@@ -2048,6 +2048,7 @@
     */
     if (nr) {
         submit_bh_for_writepage(arr, nr) ;
+	wakeup_page_waiters(page);
     } else {
         UnlockPage(page) ;
     }
diff -Nru a/include/linux/blkdev.h b/include/linux/blkdev.h
--- a/include/linux/blkdev.h	Wed Jul  2 16:57:17 2003
+++ b/include/linux/blkdev.h	Wed Jul  2 16:57:17 2003
@@ -64,12 +64,6 @@
 typedef void (plug_device_fn) (request_queue_t *q, kdev_t device);
 typedef void (unplug_device_fn) (void *q);
 
-/*
- * Default nr free requests per queue, ll_rw_blk will scale it down
- * according to available RAM at init time
- */
-#define QUEUE_NR_REQUESTS	8192
-
 struct request_list {
 	unsigned int count;
 	struct list_head free;
@@ -80,7 +74,7 @@
 	/*
 	 * the queue request freelist, one for reads and one for writes
 	 */
-	struct request_list	rq[2];
+	struct request_list	rq;
 
 	/*
 	 * The total number of requests on each queue
@@ -93,6 +87,21 @@
 	int batch_requests;
 
 	/*
+	 * The total number of 512byte blocks on each queue
+	 */
+	atomic_t nr_sectors;
+
+	/*
+	 * Batching threshold for sleep/wakeup decisions
+	 */
+	int batch_sectors;
+
+	/*
+	 * The max number of 512byte blocks on each queue
+	 */
+	int max_queue_sectors;
+
+	/*
 	 * Together with queue_head for cacheline sharing
 	 */
 	struct list_head	queue_head;
@@ -118,13 +127,34 @@
 	/*
 	 * Boolean that indicates whether this queue is plugged or not.
 	 */
-	char			plugged;
+	int			plugged:1;
 
 	/*
 	 * Boolean that indicates whether current_request is active or
 	 * not.
 	 */
-	char			head_active;
+	int			head_active:1;
+
+	/*
+	 * Booleans that indicate whether the queue's free requests have
+	 * been exhausted and is waiting to drop below the batch_requests
+	 * threshold
+	 */
+	int			full:1;
+	
+	/*
+	 * Boolean that indicates you will use blk_started_sectors
+	 * and blk_finished_sectors in addition to blk_started_io
+	 * and blk_finished_io.  It enables the throttling code to 
+	 * help keep the size of the in sectors to a reasonable number
+	 */
+	int			can_throttle:1;
+
+	/*
+	 * Boolean that indicates the queue should prefer low
+	 * latency over throughput.  This enables the q->full checks
+	 */
+	int			low_latency:1;
 
 	unsigned long		bounce_pfn;
 
@@ -137,7 +167,7 @@
 	/*
 	 * Tasks wait here for free read and write requests
 	 */
-	wait_queue_head_t	wait_for_requests[2];
+	wait_queue_head_t	wait_for_requests;
 };
 
 #define blk_queue_plugged(q)	(q)->plugged
@@ -217,14 +247,16 @@
 extern void generic_make_request(int rw, struct buffer_head * bh);
 extern inline request_queue_t *blk_get_queue(kdev_t dev);
 extern void blkdev_release_request(struct request *);
+extern void blk_print_stats(kdev_t dev);
 
 /*
  * Access functions for manipulating queue properties
  */
-extern int blk_grow_request_list(request_queue_t *q, int nr_requests);
+extern int blk_grow_request_list(request_queue_t *q, int nr_requests, int max_queue_sectors);
 extern void blk_init_queue(request_queue_t *, request_fn_proc *);
 extern void blk_cleanup_queue(request_queue_t *);
 extern void blk_queue_headactive(request_queue_t *, int);
+extern void blk_queue_throttle_sectors(request_queue_t *, int);
 extern void blk_queue_make_request(request_queue_t *, make_request_fn *);
 extern void generic_unplug_device(void *);
 extern inline int blk_seg_merge_ok(struct buffer_head *, struct buffer_head *);
@@ -243,6 +275,8 @@
 
 #define MAX_SEGMENTS 128
 #define MAX_SECTORS 255
+#define MAX_QUEUE_SECTORS (4 << (20 - 9)) /* 4 mbytes when full sized */
+#define MAX_NR_REQUESTS 1024 /* 1024k when in 512 units, normally min is 1M in 1k units */
 
 #define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK)
 
@@ -268,8 +302,50 @@
 	return retval;
 }
 
+static inline int blk_oversized_queue(request_queue_t * q)
+{
+	if (q->can_throttle)
+		return atomic_read(&q->nr_sectors) > q->max_queue_sectors;
+	return q->rq.count == 0;
+}
+
+static inline int blk_oversized_queue_batch(request_queue_t * q)
+{
+	return atomic_read(&q->nr_sectors) > q->max_queue_sectors - q->batch_sectors;
+}
+
 #define blk_finished_io(nsects)	do { } while (0)
 #define blk_started_io(nsects)	do { } while (0)
+
+static inline void blk_started_sectors(struct request *rq, int count)
+{
+	request_queue_t *q = rq->q;
+	if (q && q->can_throttle) {
+		atomic_add(count, &q->nr_sectors);
+		if (atomic_read(&q->nr_sectors) < 0) {
+			printk("nr_sectors is %d\n", atomic_read(&q->nr_sectors));
+			BUG();
+		}
+	}
+}
+
+static inline void blk_finished_sectors(struct request *rq, int count)
+{
+	request_queue_t *q = rq->q;
+	if (q && q->can_throttle) {
+		atomic_sub(count, &q->nr_sectors);
+		
+		smp_mb();
+		if (q->rq.count >= q->batch_requests && !blk_oversized_queue_batch(q)) {
+			if (waitqueue_active(&q->wait_for_requests))
+				wake_up(&q->wait_for_requests);
+		}
+		if (atomic_read(&q->nr_sectors) < 0) {
+			printk("nr_sectors is %d\n", atomic_read(&q->nr_sectors));
+			BUG();
+		}
+	}
+}
 
 static inline unsigned int blksize_bits(unsigned int size)
 {
diff -Nru a/include/linux/elevator.h b/include/linux/elevator.h
--- a/include/linux/elevator.h	Wed Jul  2 16:57:17 2003
+++ b/include/linux/elevator.h	Wed Jul  2 16:57:17 2003
@@ -80,7 +81,7 @@
 	return latency;
 }
 
-#define ELV_LINUS_SEEK_COST	16
+#define ELV_LINUS_SEEK_COST	1
 
 #define ELEVATOR_NOOP							\
 ((elevator_t) {								\
@@ -93,8 +94,8 @@
 
 #define ELEVATOR_LINUS							\
 ((elevator_t) {								\
-	2048,				/* read passovers */		\
-	8192,				/* write passovers */		\
+	128,				/* read passovers */		\
+	512,				/* write passovers */		\
 									\
 	elevator_linus_merge,		/* elevator_merge_fn */		\
 	elevator_linus_merge_req,	/* elevator_merge_req_fn */	\
diff -Nru a/include/linux/pagemap.h b/include/linux/pagemap.h
--- a/include/linux/pagemap.h	Wed Jul  2 16:57:17 2003
+++ b/include/linux/pagemap.h	Wed Jul  2 16:57:17 2003
@@ -97,6 +97,8 @@
 		___wait_on_page(page);
 }
 
+extern void FASTCALL(wakeup_page_waiters(struct page * page));
+
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.c	Wed Jul  2 16:57:17 2003
+++ b/kernel/ksyms.c	Wed Jul  2 16:57:17 2003
@@ -295,6 +295,7 @@
 EXPORT_SYMBOL(filemap_fdatawait);
 EXPORT_SYMBOL(lock_page);
 EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(wakeup_page_waiters);
 
 /* device registration */
 EXPORT_SYMBOL(register_chrdev);
diff -Nru a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c	Wed Jul  2 16:57:17 2003
+++ b/mm/filemap.c	Wed Jul  2 16:57:17 2003
@@ -812,6 +812,20 @@
 	return &wait[hash];
 }
 
+/*
+ * This must be called after every submit_bh with end_io
+ * callbacks that would result into the blkdev layer waking
+ * up the page after a queue unplug.
+ */
+void wakeup_page_waiters(struct page * page)
+{
+	wait_queue_head_t * head;
+
+	head = page_waitqueue(page);
+	if (waitqueue_active(head))
+		wake_up(head);
+}
+
 /* 
  * Wait for a page to get unlocked.
  *

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-03  2:02   ` Chris Mason
@ 2003-07-03 10:58     ` Stephan von Krawczynski
  2003-07-03 12:01       ` Chris Mason
  2003-07-03 12:07       ` Marc-Christian Petersen
  2003-07-03 12:31     ` Marc-Christian Petersen
  2003-07-04 20:01     ` Marcelo Tosatti
  2 siblings, 2 replies; 21+ messages in thread
From: Stephan von Krawczynski @ 2003-07-03 10:58 UTC (permalink / raw)
  To: Chris Mason; +Cc: marcelo, linux-kernel, alan, andrea, piggin

On 02 Jul 2003 22:02:07 -0400
Chris Mason <mason@suse.com> wrote:

> [...]
> Nick would like to see a better balance of throughput/fairness, I wimped
> out and went for the userspace toggle instead because I think anything
> else requires pulling in larger changes from 2.5 land.

I have a short question on that: did you check if there are any drawbacks on
network performance through this? We had a phenomenon here with 2.4.21 with
both samba and simple ftp where network performance dropped to a crawl when
simply entering "sync" on the console. Even simple telnet-sessions seemed to be
affected. As we could not create a reproducable setup I did not talk about this
up to now, but I wonder if anyone else ever checked that out ...

Regards,
Stephan



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-03 10:58     ` Stephan von Krawczynski
@ 2003-07-03 12:01       ` Chris Mason
  2003-07-03 12:07       ` Marc-Christian Petersen
  1 sibling, 0 replies; 21+ messages in thread
From: Chris Mason @ 2003-07-03 12:01 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: marcelo, linux-kernel, alan, andrea, piggin

On Thu, 2003-07-03 at 06:58, Stephan von Krawczynski wrote:
> On 02 Jul 2003 22:02:07 -0400
> Chris Mason <mason@suse.com> wrote:
> 
> > [...]
> > Nick would like to see a better balance of throughput/fairness, I wimped
> > out and went for the userspace toggle instead because I think anything
> > else requires pulling in larger changes from 2.5 land.
> 
> I have a short question on that: did you check if there are any drawbacks on
> network performance through this? We had a phenomenon here with 2.4.21 with
> both samba and simple ftp where network performance dropped to a crawl when
> simply entering "sync" on the console. Even simple telnet-sessions seemed to be
> affected. As we could not create a reproducable setup I did not talk about this
> up to now, but I wonder if anyone else ever checked that out ...

It's possible the network programs involved are actually getting stuck
in atime updates somewhere.  A decoded sysrq-t during one of the stalls
would help find the real cause.

-chris


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-03 10:58     ` Stephan von Krawczynski
  2003-07-03 12:01       ` Chris Mason
@ 2003-07-03 12:07       ` Marc-Christian Petersen
  2003-07-03 12:28         ` Stephan von Krawczynski
  1 sibling, 1 reply; 21+ messages in thread
From: Marc-Christian Petersen @ 2003-07-03 12:07 UTC (permalink / raw)
  To: Stephan von Krawczynski, Chris Mason
  Cc: marcelo, linux-kernel, alan, andrea, piggin

On Thursday 03 July 2003 12:58, Stephan von Krawczynski wrote:

Hi Stephan,

> I have a short question on that: did you check if there are any drawbacks
> on network performance through this? We had a phenomenon here with 2.4.21
> with both samba and simple ftp where network performance dropped to a crawl
> when simply entering "sync" on the console. Even simple telnet-sessions
> seemed to be affected. As we could not create a reproducable setup I did
> not talk about this up to now, but I wonder if anyone else ever checked
> that out ...
does the patch from Chris cause this?

ciao, Marc


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-03 12:07       ` Marc-Christian Petersen
@ 2003-07-03 12:28         ` Stephan von Krawczynski
  0 siblings, 0 replies; 21+ messages in thread
From: Stephan von Krawczynski @ 2003-07-03 12:28 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: mason, marcelo, linux-kernel, alan, andrea, piggin

On Thu, 3 Jul 2003 14:07:12 +0200
Marc-Christian Petersen <m.c.p@wolk-project.de> wrote:

> On Thursday 03 July 2003 12:58, Stephan von Krawczynski wrote:
> 
> Hi Stephan,
> 
> > I have a short question on that: did you check if there are any drawbacks
> > on network performance through this? We had a phenomenon here with 2.4.21
> > with both samba and simple ftp where network performance dropped to a crawl
> > when simply entering "sync" on the console. Even simple telnet-sessions
> > seemed to be affected. As we could not create a reproducable setup I did
> > not talk about this up to now, but I wonder if anyone else ever checked
> > that out ...
> does the patch from Chris cause this?

No, this is an independent issue. We did not try these patches so far.
The question was purely informative. We did not dig into the subject for lack
of time upto now - just a "try and see if anyone can back our experience".

Regards,
Stephan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-03  2:02   ` Chris Mason
  2003-07-03 10:58     ` Stephan von Krawczynski
@ 2003-07-03 12:31     ` Marc-Christian Petersen
  2003-07-03 13:11       ` Chris Mason
  2003-07-04 20:01     ` Marcelo Tosatti
  2 siblings, 1 reply; 21+ messages in thread
From: Marc-Christian Petersen @ 2003-07-03 12:31 UTC (permalink / raw)
  To: Chris Mason, Marcelo Tosatti
  Cc: lkml, Alan Cox, Andrea Arcangeli, Nick Piggin

On Thursday 03 July 2003 04:02, Chris Mason wrote:

Hi Chris,

> So, the patch attached includes the q->full code but has it off by
> default.  I've got code locally for an elvtune interface that can toggle
> q->full check on a per device basis, as well as tune the max io per
> queue.  I've got two choices on how to submit it, I can either add a new
> ioctl or abuse the max_bomb_segments field in the existing ioctl.
> If we can agree on the userland tuning side, I can have some kind of
> elvtune patch tomorrow.
what about /proc ?

ciao, Marc


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-03 12:31     ` Marc-Christian Petersen
@ 2003-07-03 13:11       ` Chris Mason
  0 siblings, 0 replies; 21+ messages in thread
From: Chris Mason @ 2003-07-03 13:11 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Marcelo Tosatti, lkml, Alan Cox, Andrea Arcangeli, Nick Piggin

On Thu, 2003-07-03 at 08:31, Marc-Christian Petersen wrote:
> On Thursday 03 July 2003 04:02, Chris Mason wrote:
> 
> Hi Chris,
> 
> > So, the patch attached includes the q->full code but has it off by
> > default.  I've got code locally for an elvtune interface that can toggle
> > q->full check on a per device basis, as well as tune the max io per
> > queue.  I've got two choices on how to submit it, I can either add a new
> > ioctl or abuse the max_bomb_segments field in the existing ioctl.
> > If we can agree on the userland tuning side, I can have some kind of
> > elvtune patch tomorrow.
> what about /proc ?

Always an option.  If elvtune didn't exist at all I'd say proc was a
better choice.  But I do want to be able to tune things on a per device
basis, which probably means a new directory tree somewhere in proc.  Our
chances are only 50/50 of getting that patch in without a long thread
about the one true way to access kernel tunables through an fs 
interface visible to userland ;-)

For the most part I'm only visiting drivers/block/*.c right now, so I'll
code whatever interface the long term maintainers hate the least.

-chris



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-02 22:28 ` Marcelo Tosatti
  2003-07-03  2:02   ` Chris Mason
@ 2003-07-03 18:07   ` Chris Mason
  1 sibling, 0 replies; 21+ messages in thread
From: Chris Mason @ 2003-07-03 18:07 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: lkml, Alan Cox, Andrea Arcangeli, Nick Piggin

[-- Attachment #1: Type: text/plain, Size: 1052 bytes --]

On Wed, 2003-07-02 at 18:28, Marcelo Tosatti wrote:
> On Wed, 2 Jul 2003, Marcelo Tosatti wrote:
> 
> >
> > Hello people,
> >
> > What is the status of the IO scheduler fixes for increased fairness for
> > 2.4 ?
> >
> > I haven't had time to read and think about everything you guys discussed,
> > so a brief summary would be very helpful for me.

Ok, based on mail from Andrea, it seems like reusing the existing
elvtune ioctl is the way to go for now.  The patch below uses
max_bomb_segments for two things.

1) the number of sectors allowed on a given request queue before new io
triggers an unplug (q->max_queue_sectors)

2) to enable or disable q->full checks.  If max_bomb_segments is odd, it
means the q->full checks are on, even means they are off.

The ioctl code on the kernel side is setup to allow this:

elvtune -b 1 /dev/xxx 
elvtune -b 0 /dev/xxx

For just switching q->full on and off without changing
q->max_queue_sectors.

elvtune -b 8192 /dev/xxx will get you the current default behavior (4MB
in flight, q->full checks off)

-chris

[-- Attachment #2: io-stalls-9.diff --]
[-- Type: text/plain, Size: 27324 bytes --]

diff -Nru a/drivers/block/blkpg.c b/drivers/block/blkpg.c
--- a/drivers/block/blkpg.c	Thu Jul  3 14:02:50 2003
+++ b/drivers/block/blkpg.c	Thu Jul  3 14:02:50 2003
@@ -261,10 +261,10 @@
 			return blkpg_ioctl(dev, (struct blkpg_ioctl_arg *) arg);
 			
 		case BLKELVGET:
-			return blkelvget_ioctl(&blk_get_queue(dev)->elevator,
+			return blkelvget_ioctl(blk_get_queue(dev),
 					       (blkelv_ioctl_arg_t *) arg);
 		case BLKELVSET:
-			return blkelvset_ioctl(&blk_get_queue(dev)->elevator,
+			return blkelvset_ioctl(blk_get_queue(dev),
 					       (blkelv_ioctl_arg_t *) arg);
 
 		case BLKBSZGET:
diff -Nru a/drivers/block/elevator.c b/drivers/block/elevator.c
--- a/drivers/block/elevator.c	Thu Jul  3 14:02:50 2003
+++ b/drivers/block/elevator.c	Thu Jul  3 14:02:50 2003
@@ -180,23 +180,28 @@
 
 void elevator_noop_merge_req(struct request *req, struct request *next) {}
 
-int blkelvget_ioctl(elevator_t * elevator, blkelv_ioctl_arg_t * arg)
+int blkelvget_ioctl(request_queue_t *q, blkelv_ioctl_arg_t * arg)
 {
+	elevator_t *elevator = &q->elevator;
 	blkelv_ioctl_arg_t output;
 
 	output.queue_ID			= elevator->queue_ID;
 	output.read_latency		= elevator->read_latency;
 	output.write_latency		= elevator->write_latency;
-	output.max_bomb_segments	= 0;
-
+	output.max_bomb_segments	= q->max_queue_sectors;
+	if (q->low_latency)
+		output.max_bomb_segments |= MAX_BOMB_LATENCY_MASK;	
+	else
+		output.max_bomb_segments &= ~MAX_BOMB_LATENCY_MASK;	
 	if (copy_to_user(arg, &output, sizeof(blkelv_ioctl_arg_t)))
 		return -EFAULT;
 
 	return 0;
 }
 
-int blkelvset_ioctl(elevator_t * elevator, const blkelv_ioctl_arg_t * arg)
+int blkelvset_ioctl(request_queue_t *q, const blkelv_ioctl_arg_t * arg)
 {
+	elevator_t *elevator = &q->elevator;
 	blkelv_ioctl_arg_t input;
 
 	if (copy_from_user(&input, arg, sizeof(blkelv_ioctl_arg_t)))
@@ -206,9 +211,19 @@
 		return -EINVAL;
 	if (input.write_latency < 0)
 		return -EINVAL;
+	if (input.max_bomb_segments < 0)
+		return -EINVAL;
 
 	elevator->read_latency		= input.read_latency;
 	elevator->write_latency		= input.write_latency;
+	q->low_latency = input.max_bomb_segments & MAX_BOMB_LATENCY_MASK ? 1:0;
+	printk("queue %d: low latency mode is now %s\n", elevator->queue_ID, 
+		q->low_latency ? "on" : "off");
+	input.max_bomb_segments &= ~MAX_BOMB_LATENCY_MASK;
+	if (input.max_bomb_segments)
+		q->max_queue_sectors = input.max_bomb_segments;
+	printk("queue %d: max queue sectors is now %d\n", elevator->queue_ID, 
+		q->max_queue_sectors);
 	return 0;
 }
 
diff -Nru a/drivers/block/ll_rw_blk.c b/drivers/block/ll_rw_blk.c
--- a/drivers/block/ll_rw_blk.c	Thu Jul  3 14:02:50 2003
+++ b/drivers/block/ll_rw_blk.c	Thu Jul  3 14:02:50 2003
@@ -176,11 +176,12 @@
 {
 	int count = q->nr_requests;
 
-	count -= __blk_cleanup_queue(&q->rq[READ]);
-	count -= __blk_cleanup_queue(&q->rq[WRITE]);
+	count -= __blk_cleanup_queue(&q->rq);
 
 	if (count)
 		printk("blk_cleanup_queue: leaked requests (%d)\n", count);
+	if (atomic_read(&q->nr_sectors))
+		printk("blk_cleanup_queue: leaked sectors (%d)\n", atomic_read(&q->nr_sectors));
 
 	memset(q, 0, sizeof(*q));
 }
@@ -215,6 +216,24 @@
 }
 
 /**
+ * blk_queue_throttle_sectors - indicates you will call sector throttling funcs
+ * @q:       The queue which this applies to.
+ * @active:  A flag indication if you want sector throttling on
+ *
+ * Description:
+ * The sector throttling code allows us to put a limit on the number of
+ * sectors pending io to the disk at a given time, sending @active nonzero
+ * indicates you will call blk_started_sectors and blk_finished_sectors in
+ * addition to calling blk_started_io and blk_finished_io in order to
+ * keep track of the number of sectors in flight.
+ **/
+ 
+void blk_queue_throttle_sectors(request_queue_t * q, int active)
+{
+	q->can_throttle = active;
+}
+
+/**
  * blk_queue_make_request - define an alternate make_request function for a device
  * @q:  the request queue for the device to be affected
  * @mfn: the alternate make_request function
@@ -360,8 +379,22 @@
 {
 	if (q->plugged) {
 		q->plugged = 0;
-		if (!list_empty(&q->queue_head))
+		if (!list_empty(&q->queue_head)) {
+
+			/* we don't want merges later on to come in 
+			 * and significantly increase the amount of
+			 * work during an unplug, it can lead to high
+			 * latencies while some poor waiter tries to
+			 * run an ever increasing chunk of io.
+			 * This does lower throughput some though.
+			 */
+			if (q->low_latency) {
+				struct request *rq;
+				rq = blkdev_entry_prev_request(&q->queue_head),
+				rq->elevator_sequence = 0;
+			}
 			q->request_fn(q);
+		}
 	}
 }
 
@@ -389,7 +422,7 @@
  *
  * Returns the (new) number of requests which the queue has available.
  */
-int blk_grow_request_list(request_queue_t *q, int nr_requests)
+int blk_grow_request_list(request_queue_t *q, int nr_requests, int max_queue_sectors)
 {
 	unsigned long flags;
 	/* Several broken drivers assume that this function doesn't sleep,
@@ -399,21 +432,34 @@
 	spin_lock_irqsave(&io_request_lock, flags);
 	while (q->nr_requests < nr_requests) {
 		struct request *rq;
-		int rw;
 
 		rq = kmem_cache_alloc(request_cachep, SLAB_ATOMIC);
 		if (rq == NULL)
 			break;
 		memset(rq, 0, sizeof(*rq));
 		rq->rq_status = RQ_INACTIVE;
-		rw = q->nr_requests & 1;
-		list_add(&rq->queue, &q->rq[rw].free);
-		q->rq[rw].count++;
+ 		list_add(&rq->queue, &q->rq.free);
+ 		q->rq.count++;
+
 		q->nr_requests++;
 	}
+
+ 	/*
+ 	 * Wakeup waiters after both one quarter of the
+ 	 * max-in-fligh queue and one quarter of the requests
+ 	 * are available again.
+ 	 */
+
 	q->batch_requests = q->nr_requests / 4;
 	if (q->batch_requests > 32)
 		q->batch_requests = 32;
+ 	q->batch_sectors = max_queue_sectors / 4;
+ 
+ 	q->max_queue_sectors = max_queue_sectors;
+ 
+ 	BUG_ON(!q->batch_sectors);
+ 	atomic_set(&q->nr_sectors, 0);
+
 	spin_unlock_irqrestore(&io_request_lock, flags);
 	return q->nr_requests;
 }
@@ -422,23 +468,27 @@
 {
 	struct sysinfo si;
 	int megs;		/* Total memory, in megabytes */
-	int nr_requests;
-
-	INIT_LIST_HEAD(&q->rq[READ].free);
-	INIT_LIST_HEAD(&q->rq[WRITE].free);
-	q->rq[READ].count = 0;
-	q->rq[WRITE].count = 0;
+ 	int nr_requests, max_queue_sectors = MAX_QUEUE_SECTORS;
+  
+ 	INIT_LIST_HEAD(&q->rq.free);
+	q->rq.count = 0;
 	q->nr_requests = 0;
 
 	si_meminfo(&si);
 	megs = si.totalram >> (20 - PAGE_SHIFT);
-	nr_requests = 128;
-	if (megs < 32)
-		nr_requests /= 2;
-	blk_grow_request_list(q, nr_requests);
+ 	nr_requests = MAX_NR_REQUESTS;
+ 	if (megs < 30) {
+  		nr_requests /= 2;
+ 		max_queue_sectors /= 2;
+ 	}
+ 	/* notice early if anybody screwed the defaults */
+ 	BUG_ON(!nr_requests);
+ 	BUG_ON(!max_queue_sectors);
+ 
+ 	blk_grow_request_list(q, nr_requests, max_queue_sectors);
+
+ 	init_waitqueue_head(&q->wait_for_requests);
 
-	init_waitqueue_head(&q->wait_for_requests[0]);
-	init_waitqueue_head(&q->wait_for_requests[1]);
 	spin_lock_init(&q->queue_lock);
 }
 
@@ -491,6 +541,10 @@
 	q->plug_tq.routine	= &generic_unplug_device;
 	q->plug_tq.data		= q;
 	q->plugged        	= 0;
+	q->full			= 0;
+	q->can_throttle		= 0;
+	q->low_latency		= 0;
+
 	/*
 	 * These booleans describe the queue properties.  We set the
 	 * default (and most common) values here.  Other drivers can
@@ -508,12 +562,13 @@
  * Get a free request. io_request_lock must be held and interrupts
  * disabled on the way in.  Returns NULL if there are no free requests.
  */
-static struct request *get_request(request_queue_t *q, int rw)
+static struct request *__get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = q->rq + rw;
+	struct request_list *rl;
 
-	if (!list_empty(&rl->free)) {
+	rl = &q->rq;
+	if (!list_empty(&rl->free) && !blk_oversized_queue(q)) {
 		rq = blkdev_free_rq(&rl->free);
 		list_del(&rq->queue);
 		rl->count--;
@@ -521,35 +576,47 @@
 		rq->cmd = rw;
 		rq->special = NULL;
 		rq->q = q;
-	}
-
+	} else if (q->low_latency)
+		q->full = 1;
 	return rq;
 }
 
 /*
- * Here's the request allocation design:
+ * get a free request, honoring the queue_full condition
+ */
+static inline struct request *get_request(request_queue_t *q, int rw)
+{
+	if (q->full)
+		return NULL;
+	return __get_request(q, rw);
+}
+
+/* 
+ * helper func to do memory barriers and wakeups when we finally decide
+ * to clear the queue full condition
+ */
+static inline void clear_full_and_wake(request_queue_t *q)
+{
+	q->full = 0;
+	mb();
+	if (waitqueue_active(&q->wait_for_requests))
+		wake_up(&q->wait_for_requests);
+}
+
+/*
+ * Here's the request allocation design, low latency version:
  *
  * 1: Blocking on request exhaustion is a key part of I/O throttling.
  * 
  * 2: We want to be `fair' to all requesters.  We must avoid starvation, and
  *    attempt to ensure that all requesters sleep for a similar duration.  Hence
  *    no stealing requests when there are other processes waiting.
- * 
- * 3: We also wish to support `batching' of requests.  So when a process is
- *    woken, we want to allow it to allocate a decent number of requests
- *    before it blocks again, so they can be nicely merged (this only really
- *    matters if the process happens to be adding requests near the head of
- *    the queue).
- * 
- * 4: We want to avoid scheduling storms.  This isn't really important, because
- *    the system will be I/O bound anyway.  But it's easy.
- * 
- *    There is tension between requirements 2 and 3.  Once a task has woken,
- *    we don't want to allow it to sleep as soon as it takes its second request.
- *    But we don't want currently-running tasks to steal all the requests
- *    from the sleepers.  We handle this with wakeup hysteresis around
- *    0 .. batch_requests and with the assumption that request taking is much,
- *    much faster than request freeing.
+ *
+ * There used to be more here, attempting to allow a process to send in a
+ * number of requests once it has woken up.  But, there's no way to 
+ * tell if a process has just been woken up, or if it is a new process
+ * coming in to steal requests from the waiters.  So, we give up and force
+ * everyone to wait fairly.
  * 
  * So here's what we do:
  * 
@@ -561,50 +628,67 @@
  * 
  *  When a process wants a new request:
  * 
- *    b) If free_requests == 0, the requester sleeps in FIFO manner.
- * 
- *    b) If 0 <  free_requests < batch_requests and there are waiters,
- *       we still take a request non-blockingly.  This provides batching.
- *
- *    c) If free_requests >= batch_requests, the caller is immediately
- *       granted a new request.
+ *    b) If free_requests == 0, the requester sleeps in FIFO manner, and
+ *       the queue full condition is set.  The full condition is not
+ *       cleared until there are no longer any waiters.  Once the full
+ *       condition is set, all new io must wait, hopefully for a very
+ *       short period of time.
  * 
  *  When a request is released:
  * 
- *    d) If free_requests < batch_requests, do nothing.
- * 
- *    f) If free_requests >= batch_requests, wake up a single waiter.
+ *    c) If free_requests < batch_requests, do nothing.
  * 
- *   The net effect is that when a process is woken at the batch_requests level,
- *   it will be able to take approximately (batch_requests) requests before
- *   blocking again (at the tail of the queue).
- * 
- *   This all assumes that the rate of taking requests is much, much higher
- *   than the rate of releasing them.  Which is very true.
+ *    d) If free_requests >= batch_requests, wake up a single waiter.
  *
- * -akpm, Feb 2002.
+ *   As each waiter gets a request, he wakes another waiter.  We do this
+ *   to prevent a race where an unplug might get run before a request makes
+ *   it's way onto the queue.  The result is a cascade of wakeups, so delaying
+ *   the initial wakeup until we've got batch_requests available helps avoid
+ *   wakeups where there aren't any requests available yet.
  */
 
 static struct request *__get_request_wait(request_queue_t *q, int rw)
 {
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
+	int oversized;
+
+	add_wait_queue_exclusive(&q->wait_for_requests, &wait);
 
-	add_wait_queue(&q->wait_for_requests[rw], &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		generic_unplug_device(q);
-		if (q->rq[rw].count == 0)
-			schedule();
 		spin_lock_irq(&io_request_lock);
-		rq = get_request(q, rw);
+		oversized = blk_oversized_queue(q);
+		if (q->full || oversized) {
+			if (oversized)
+				__generic_unplug_device(q);
+			spin_unlock_irq(&io_request_lock);
+			schedule();
+			spin_lock_irq(&io_request_lock);
+		}
+		rq = __get_request(q, rw);
 		spin_unlock_irq(&io_request_lock);
 	} while (rq == NULL);
-	remove_wait_queue(&q->wait_for_requests[rw], &wait);
+	remove_wait_queue(&q->wait_for_requests, &wait);
 	current->state = TASK_RUNNING;
+
+	if (!waitqueue_active(&q->wait_for_requests))
+		clear_full_and_wake(q);
+
 	return rq;
 }
 
+static void get_request_wait_wakeup(request_queue_t *q, int rw)
+{
+	/*
+	 * avoid losing an unplug if a second __get_request_wait did the
+	 * generic_unplug_device while our __get_request_wait was running
+	 * w/o the queue_lock held and w/ our request out of the queue.
+	 */	
+	if (waitqueue_active(&q->wait_for_requests))
+		wake_up(&q->wait_for_requests);
+}
+
 /* RO fail safe mechanism */
 
 static long ro_bits[MAX_BLKDEV][8];
@@ -818,7 +902,6 @@
 void blkdev_release_request(struct request *req)
 {
 	request_queue_t *q = req->q;
-	int rw = req->cmd;
 
 	req->rq_status = RQ_INACTIVE;
 	req->q = NULL;
@@ -828,9 +911,19 @@
 	 * assume it has free buffers and check waiters
 	 */
 	if (q) {
-		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests)
-			wake_up(&q->wait_for_requests[rw]);
+		int oversized_batch = 0;
+
+		if (q->can_throttle)
+			oversized_batch = blk_oversized_queue_batch(q);
+		q->rq.count++;
+		list_add(&req->queue, &q->rq.free);
+		if (q->rq.count >= q->batch_requests && !oversized_batch) {
+			smp_mb();
+			if (waitqueue_active(&q->wait_for_requests))
+				wake_up(&q->wait_for_requests);
+			else
+				clear_full_and_wake(q);
+		}
 	}
 }
 
@@ -908,6 +1001,7 @@
 	struct list_head *head, *insert_here;
 	int latency;
 	elevator_t *elevator = &q->elevator;
+	int should_wake = 0;
 
 	count = bh->b_size >> 9;
 	sector = bh->b_rsector;
@@ -948,7 +1042,6 @@
 	 */
 	max_sectors = get_max_sectors(bh->b_rdev);
 
-again:
 	req = NULL;
 	head = &q->queue_head;
 	/*
@@ -957,7 +1050,9 @@
 	 */
 	spin_lock_irq(&io_request_lock);
 
+again:
 	insert_here = head->prev;
+
 	if (list_empty(head)) {
 		q->plug_device_fn(q, bh->b_rdev); /* is atomic */
 		goto get_rq;
@@ -976,6 +1071,7 @@
 			req->bhtail = bh;
 			req->nr_sectors = req->hard_nr_sectors += count;
 			blk_started_io(count);
+			blk_started_sectors(req, count);
 			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 			req_new_io(req, 1, count);
 			attempt_back_merge(q, req, max_sectors, max_segments);
@@ -998,6 +1094,7 @@
 			req->sector = req->hard_sector = sector;
 			req->nr_sectors = req->hard_nr_sectors += count;
 			blk_started_io(count);
+			blk_started_sectors(req, count);
 			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 			req_new_io(req, 1, count);
 			attempt_front_merge(q, head, req, max_sectors, max_segments);
@@ -1030,7 +1127,7 @@
 		 * See description above __get_request_wait()
 		 */
 		if (rw_ahead) {
-			if (q->rq[rw].count < q->batch_requests) {
+			if (q->rq.count < q->batch_requests || blk_oversized_queue_batch(q)) {
 				spin_unlock_irq(&io_request_lock);
 				goto end_io;
 			}
@@ -1042,6 +1139,9 @@
 			if (req == NULL) {
 				spin_unlock_irq(&io_request_lock);
 				freereq = __get_request_wait(q, rw);
+				head = &q->queue_head;
+				spin_lock_irq(&io_request_lock);
+				should_wake = 1;
 				goto again;
 			}
 		}
@@ -1064,10 +1164,13 @@
 	req->start_time = jiffies;
 	req_new_io(req, 0, count);
 	blk_started_io(count);
+	blk_started_sectors(req, count);
 	add_request(q, req, insert_here);
 out:
 	if (freereq)
 		blkdev_release_request(freereq);
+	if (should_wake)
+		get_request_wait_wakeup(q, rw);
 	spin_unlock_irq(&io_request_lock);
 	return 0;
 end_io:
@@ -1196,8 +1299,15 @@
 	bh->b_rdev = bh->b_dev;
 	bh->b_rsector = bh->b_blocknr * count;
 
+	get_bh(bh);
 	generic_make_request(rw, bh);
 
+	/* fix race condition with wait_on_buffer() */
+	smp_mb(); /* spin_unlock may have inclusive semantics */
+	if (waitqueue_active(&bh->b_wait))
+		wake_up(&bh->b_wait);
+
+	put_bh(bh);
 	switch (rw) {
 		case WRITE:
 			kstat.pgpgout += count;
@@ -1350,6 +1460,7 @@
 	if ((bh = req->bh) != NULL) {
 		nsect = bh->b_size >> 9;
 		blk_finished_io(nsect);
+		blk_finished_sectors(req, nsect);
 		req->bh = bh->b_reqnext;
 		bh->b_reqnext = NULL;
 		bh->b_end_io(bh, uptodate);
@@ -1509,6 +1620,7 @@
 EXPORT_SYMBOL(blk_get_queue);
 EXPORT_SYMBOL(blk_cleanup_queue);
 EXPORT_SYMBOL(blk_queue_headactive);
+EXPORT_SYMBOL(blk_queue_throttle_sectors);
 EXPORT_SYMBOL(blk_queue_make_request);
 EXPORT_SYMBOL(generic_make_request);
 EXPORT_SYMBOL(blkdev_release_request);
diff -Nru a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
--- a/drivers/ide/ide-probe.c	Thu Jul  3 14:02:50 2003
+++ b/drivers/ide/ide-probe.c	Thu Jul  3 14:02:50 2003
@@ -971,6 +971,7 @@
 
 	q->queuedata = HWGROUP(drive);
 	blk_init_queue(q, do_ide_request);
+	blk_queue_throttle_sectors(q, 1);
 }
 
 #undef __IRQ_HELL_SPIN
diff -Nru a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
--- a/drivers/scsi/scsi.c	Thu Jul  3 14:02:50 2003
+++ b/drivers/scsi/scsi.c	Thu Jul  3 14:02:50 2003
@@ -197,6 +197,7 @@
 
 	blk_init_queue(q, scsi_request_fn);
 	blk_queue_headactive(q, 0);
+	blk_queue_throttle_sectors(q, 1);
 	q->queuedata = (void *) SDpnt;
 }
 
diff -Nru a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
--- a/drivers/scsi/scsi_lib.c	Thu Jul  3 14:02:50 2003
+++ b/drivers/scsi/scsi_lib.c	Thu Jul  3 14:02:50 2003
@@ -378,6 +378,7 @@
 		if ((bh = req->bh) != NULL) {
 			nsect = bh->b_size >> 9;
 			blk_finished_io(nsect);
+			blk_finished_sectors(req, nsect);
 			req->bh = bh->b_reqnext;
 			bh->b_reqnext = NULL;
 			sectors -= nsect;
diff -Nru a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c	Thu Jul  3 14:02:50 2003
+++ b/fs/buffer.c	Thu Jul  3 14:02:50 2003
@@ -153,10 +153,23 @@
 	get_bh(bh);
 	add_wait_queue(&bh->b_wait, &wait);
 	do {
-		run_task_queue(&tq_disk);
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 		if (!buffer_locked(bh))
 			break;
+		/*
+		 * We must read tq_disk in TQ_ACTIVE after the
+		 * add_wait_queue effect is visible to other cpus.
+		 * We could unplug some line above it wouldn't matter
+		 * but we can't do that right after add_wait_queue
+		 * without an smp_mb() in between because spin_unlock
+		 * has inclusive semantics.
+		 * Doing it here is the most efficient place so we
+		 * don't do a suprious unplug if we get a racy
+		 * wakeup that make buffer_locked to return 0, and
+		 * doing it here avoids an explicit smp_mb() we
+		 * rely on the implicit one in set_task_state.
+		 */
+		run_task_queue(&tq_disk);
 		schedule();
 	} while (buffer_locked(bh));
 	tsk->state = TASK_RUNNING;
@@ -1516,6 +1529,9 @@
 
 	/* Done - end_buffer_io_async will unlock */
 	SetPageUptodate(page);
+
+	wakeup_page_waiters(page);
+
 	return 0;
 
 out:
@@ -1547,6 +1563,7 @@
 	} while (bh != head);
 	if (need_unlock)
 		UnlockPage(page);
+	wakeup_page_waiters(page);
 	return err;
 }
 
@@ -1774,6 +1791,8 @@
 		else
 			submit_bh(READ, bh);
 	}
+
+	wakeup_page_waiters(page);
 	
 	return 0;
 }
@@ -2400,6 +2419,7 @@
 		submit_bh(rw, bh);
 		bh = next;
 	} while (bh != head);
+	wakeup_page_waiters(page);
 	return 0;
 }
 
diff -Nru a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c	Thu Jul  3 14:02:50 2003
+++ b/fs/reiserfs/inode.c	Thu Jul  3 14:02:50 2003
@@ -2048,6 +2048,7 @@
     */
     if (nr) {
         submit_bh_for_writepage(arr, nr) ;
+	wakeup_page_waiters(page);
     } else {
         UnlockPage(page) ;
     }
diff -Nru a/include/linux/blkdev.h b/include/linux/blkdev.h
--- a/include/linux/blkdev.h	Thu Jul  3 14:02:50 2003
+++ b/include/linux/blkdev.h	Thu Jul  3 14:02:50 2003
@@ -64,12 +64,6 @@
 typedef void (plug_device_fn) (request_queue_t *q, kdev_t device);
 typedef void (unplug_device_fn) (void *q);
 
-/*
- * Default nr free requests per queue, ll_rw_blk will scale it down
- * according to available RAM at init time
- */
-#define QUEUE_NR_REQUESTS	8192
-
 struct request_list {
 	unsigned int count;
 	struct list_head free;
@@ -80,7 +74,7 @@
 	/*
 	 * the queue request freelist, one for reads and one for writes
 	 */
-	struct request_list	rq[2];
+	struct request_list	rq;
 
 	/*
 	 * The total number of requests on each queue
@@ -93,6 +87,21 @@
 	int batch_requests;
 
 	/*
+	 * The total number of 512byte blocks on each queue
+	 */
+	atomic_t nr_sectors;
+
+	/*
+	 * Batching threshold for sleep/wakeup decisions
+	 */
+	int batch_sectors;
+
+	/*
+	 * The max number of 512byte blocks on each queue
+	 */
+	int max_queue_sectors;
+
+	/*
 	 * Together with queue_head for cacheline sharing
 	 */
 	struct list_head	queue_head;
@@ -118,13 +127,34 @@
 	/*
 	 * Boolean that indicates whether this queue is plugged or not.
 	 */
-	char			plugged;
+	int			plugged:1;
 
 	/*
 	 * Boolean that indicates whether current_request is active or
 	 * not.
 	 */
-	char			head_active;
+	int			head_active:1;
+
+	/*
+	 * Booleans that indicate whether the queue's free requests have
+	 * been exhausted and is waiting to drop below the batch_requests
+	 * threshold
+	 */
+	int			full:1;
+	
+	/*
+	 * Boolean that indicates you will use blk_started_sectors
+	 * and blk_finished_sectors in addition to blk_started_io
+	 * and blk_finished_io.  It enables the throttling code to 
+	 * help keep the sectors in flight to a reasonable value
+	 */
+	int			can_throttle:1;
+
+	/*
+	 * Boolean that indicates the queue should prefer low
+	 * latency over throughput.  This enables the q->full checks
+	 */
+	int			low_latency:1;
 
 	unsigned long		bounce_pfn;
 
@@ -137,7 +167,7 @@
 	/*
 	 * Tasks wait here for free read and write requests
 	 */
-	wait_queue_head_t	wait_for_requests[2];
+	wait_queue_head_t	wait_for_requests;
 };
 
 #define blk_queue_plugged(q)	(q)->plugged
@@ -221,10 +251,11 @@
 /*
  * Access functions for manipulating queue properties
  */
-extern int blk_grow_request_list(request_queue_t *q, int nr_requests);
+extern int blk_grow_request_list(request_queue_t *q, int nr_requests, int max_queue_sectors);
 extern void blk_init_queue(request_queue_t *, request_fn_proc *);
 extern void blk_cleanup_queue(request_queue_t *);
 extern void blk_queue_headactive(request_queue_t *, int);
+extern void blk_queue_throttle_sectors(request_queue_t *, int);
 extern void blk_queue_make_request(request_queue_t *, make_request_fn *);
 extern void generic_unplug_device(void *);
 extern inline int blk_seg_merge_ok(struct buffer_head *, struct buffer_head *);
@@ -243,6 +274,8 @@
 
 #define MAX_SEGMENTS 128
 #define MAX_SECTORS 255
+#define MAX_QUEUE_SECTORS (4 << (20 - 9)) /* 4 mbytes when full sized */
+#define MAX_NR_REQUESTS 1024 /* 1024k when in 512 units, normally min is 1M in 1k units */
 
 #define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK)
 
@@ -268,8 +301,50 @@
 	return retval;
 }
 
+static inline int blk_oversized_queue(request_queue_t * q)
+{
+	if (q->can_throttle)
+		return atomic_read(&q->nr_sectors) > q->max_queue_sectors;
+	return q->rq.count == 0;
+}
+
+static inline int blk_oversized_queue_batch(request_queue_t * q)
+{
+	return atomic_read(&q->nr_sectors) > q->max_queue_sectors - q->batch_sectors;
+}
+
 #define blk_finished_io(nsects)	do { } while (0)
 #define blk_started_io(nsects)	do { } while (0)
+
+static inline void blk_started_sectors(struct request *rq, int count)
+{
+	request_queue_t *q = rq->q;
+	if (q && q->can_throttle) {
+		atomic_add(count, &q->nr_sectors);
+		if (atomic_read(&q->nr_sectors) < 0) {
+			printk("nr_sectors is %d\n", atomic_read(&q->nr_sectors));
+			BUG();
+		}
+	}
+}
+
+static inline void blk_finished_sectors(struct request *rq, int count)
+{
+	request_queue_t *q = rq->q;
+	if (q && q->can_throttle) {
+		atomic_sub(count, &q->nr_sectors);
+		
+		smp_mb();
+		if (q->rq.count >= q->batch_requests && !blk_oversized_queue_batch(q)) {
+			if (waitqueue_active(&q->wait_for_requests))
+				wake_up(&q->wait_for_requests);
+		}
+		if (atomic_read(&q->nr_sectors) < 0) {
+			printk("nr_sectors is %d\n", atomic_read(&q->nr_sectors));
+			BUG();
+		}
+	}
+}
 
 static inline unsigned int blksize_bits(unsigned int size)
 {
diff -Nru a/include/linux/elevator.h b/include/linux/elevator.h
--- a/include/linux/elevator.h	Thu Jul  3 14:02:50 2003
+++ b/include/linux/elevator.h	Thu Jul  3 14:02:50 2003
@@ -35,14 +35,19 @@
 	int queue_ID;
 	int read_latency;
 	int write_latency;
+/*
+ * (max_bomb_segments & MAX_BOMB_LATENCY_MASK) == 1 indicates low latency
+ * mode.  We're using any odd number to indicate low latency is on.
+ */
+#define MAX_BOMB_LATENCY_MASK 1
 	int max_bomb_segments;
 } blkelv_ioctl_arg_t;
 
 #define BLKELVGET   _IOR(0x12,106,sizeof(blkelv_ioctl_arg_t))
 #define BLKELVSET   _IOW(0x12,107,sizeof(blkelv_ioctl_arg_t))
 
-extern int blkelvget_ioctl(elevator_t *, blkelv_ioctl_arg_t *);
-extern int blkelvset_ioctl(elevator_t *, const blkelv_ioctl_arg_t *);
+extern int blkelvget_ioctl(request_queue_t *, blkelv_ioctl_arg_t *);
+extern int blkelvset_ioctl(request_queue_t *, const blkelv_ioctl_arg_t *);
 
 extern void elevator_init(elevator_t *, elevator_t);
 
@@ -80,7 +85,7 @@
 	return latency;
 }
 
-#define ELV_LINUS_SEEK_COST	16
+#define ELV_LINUS_SEEK_COST	1
 
 #define ELEVATOR_NOOP							\
 ((elevator_t) {								\
@@ -93,8 +98,8 @@
 
 #define ELEVATOR_LINUS							\
 ((elevator_t) {								\
-	2048,				/* read passovers */		\
-	8192,				/* write passovers */		\
+	128,				/* read passovers */		\
+	512,				/* write passovers */		\
 									\
 	elevator_linus_merge,		/* elevator_merge_fn */		\
 	elevator_linus_merge_req,	/* elevator_merge_req_fn */	\
diff -Nru a/include/linux/pagemap.h b/include/linux/pagemap.h
--- a/include/linux/pagemap.h	Thu Jul  3 14:02:50 2003
+++ b/include/linux/pagemap.h	Thu Jul  3 14:02:50 2003
@@ -97,6 +97,8 @@
 		___wait_on_page(page);
 }
 
+extern void FASTCALL(wakeup_page_waiters(struct page * page));
+
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.c	Thu Jul  3 14:02:50 2003
+++ b/kernel/ksyms.c	Thu Jul  3 14:02:50 2003
@@ -296,6 +296,7 @@
 EXPORT_SYMBOL(filemap_fdatawait);
 EXPORT_SYMBOL(lock_page);
 EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(wakeup_page_waiters);
 
 /* device registration */
 EXPORT_SYMBOL(register_chrdev);
diff -Nru a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c	Thu Jul  3 14:02:50 2003
+++ b/mm/filemap.c	Thu Jul  3 14:02:50 2003
@@ -810,6 +810,20 @@
 	return &wait[hash];
 }
 
+/*
+ * This must be called after every submit_bh with end_io
+ * callbacks that would result into the blkdev layer waking
+ * up the page after a queue unplug.
+ */
+void wakeup_page_waiters(struct page * page)
+{
+	wait_queue_head_t * head;
+
+	head = page_waitqueue(page);
+	if (waitqueue_active(head))
+		wake_up(head);
+}
+
 /* 
  * Wait for a page to get unlocked.
  *

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-03  2:02   ` Chris Mason
  2003-07-03 10:58     ` Stephan von Krawczynski
  2003-07-03 12:31     ` Marc-Christian Petersen
@ 2003-07-04 20:01     ` Marcelo Tosatti
  2003-07-04 21:37       ` Chris Mason
  2003-07-05  0:00       ` Andrea Arcangeli
  2 siblings, 2 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2003-07-04 20:01 UTC (permalink / raw)
  To: Chris Mason; +Cc: lkml, Alan Cox, Andrea Arcangeli, Nick Piggin



On Wed, 2 Jul 2003, Chris Mason wrote:

> On Wed, 2003-07-02 at 18:28, Marcelo Tosatti wrote:
> > On Wed, 2 Jul 2003, Marcelo Tosatti wrote:
> >
> > >
> > > Hello people,
> > >
> > > What is the status of the IO scheduler fixes for increased fairness for
> > > 2.4 ?
> > >
> > > I haven't had time to read and think about everything you guys discussed,
> > > so a brief summary would be very helpful for me.
> > >
> > > Danke
> >
> > Ah, we all want that the fairness issues to be fixed in 2.4.22, right ?
>
> My current code is attached, it's basically a merge of these 3 patches,
> with modifications based on benchmarks and latency measurements here.
>
> fix_pausing:  From Andrea, it fixes a few corner case races where
> wakeups can be missed in wait_on_buffer, wait_on_page, and
> __get_request_wait.
>
> elevator-low-latency: From Andrea, it keeps the amount of io on a given
> queue to a reasonable number.  This prevents a small number of huge
> requests from introducing large latencies on smaller requests.
>
> q->full: From Nick, it reduces latency in __get_request_wait by making
> sure new io can't come in and steal requests before old waiters are
> served.
>
> Those represent the big 3 areas I believe the latencies are coming
> from.  The q->full patch can hurt throughput badly as the number of
> writers increases (50% of what 2.4.21 gets for 10 or more concurrent
> streaming writers), but it really seems to help desktop workloads here.

Chris,

Would you please separate those tree fixes in separate diffs?

For me it seems low latency and fix-pausing patches should be enough for
"better" IO fairness. I might be wrong about that, though.

Lets try this: Include elevator-low-latency in -pre3 (which I'm trying to
release today), then fix pausing in -pre4. If the IO fairness still doesnt
get somewhat better for general use (well get isolated user reports and
benchmarks for both the two patches), then I might consider the q->full
patch (it has throughtput drawbacks and I prefer avoiding a tunable
there).


Sounds good?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-04 20:01     ` Marcelo Tosatti
@ 2003-07-04 21:37       ` Chris Mason
  2003-07-05  0:05         ` Andrea Arcangeli
  2003-07-06  7:58         ` Marc-Christian Petersen
  2003-07-05  0:00       ` Andrea Arcangeli
  1 sibling, 2 replies; 21+ messages in thread
From: Chris Mason @ 2003-07-04 21:37 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: lkml, Alan Cox, Andrea Arcangeli, Nick Piggin

[-- Attachment #1: Type: text/plain, Size: 1737 bytes --]

On Fri, 2003-07-04 at 16:01, Marcelo Tosatti wrote:

> Chris,
> 
> Would you please separate those tree fixes in separate diffs?
> 
> For me it seems low latency and fix-pausing patches should be enough for
> "better" IO fairness. I might be wrong about that, though.
> 
> Lets try this: Include elevator-low-latency in -pre3 (which I'm trying to
> release today), then fix pausing in -pre4. 

I don't believe elevator-low-latency is good without fix-pausing, we
need the smp corner cases fix-pausing provides.

> If the IO fairness still doesnt
> get somewhat better for general use (well get isolated user reports and
> benchmarks for both the two patches), then I might consider the q->full
> patch (it has throughtput drawbacks and I prefer avoiding a tunable
> there).
> 
> 
> Sounds good?

We effectively had this in 2.4.19-aa and 2.4.20-aa, and people still
reported stalls on those kernels.  So, aside from testing them
separately I'm not sure what benefit we really get.  I really like the
tunable for max_queue_sectors at least (without the q->full overload),
I'd like to get that in sometime during 2.4.22-pre if we take
elevator-low-latency.

It's a holiday weekend here, so I don't have a huge amount of time to
rediff and test.  The code below is fix-pausing and elevator-low-latency
combined, without any of the q->full code.  It is lightly tested, so
please run it through a few paces locally if you plan on applying it.

I've also attached a patch I've been working on to solve the latencies a
different way.  bdflush-progress.diff changes balance_dirty to wait on
bdflush instead of trying to start io.  It helps a lot here (both
throughput and latency) but hasn't yet been tested on a box with tons of
drives.

-chris

[-- Attachment #2: bdflush_progress-3.diff --]
[-- Type: text/plain, Size: 2509 bytes --]

--- 1.87/fs/buffer.c	Thu Jul  3 13:43:56 2003
+++ edited/fs/buffer.c	Fri Jul  4 13:01:28 2003
@@ -83,6 +83,8 @@
 static int nr_unused_buffer_heads;
 static spinlock_t unused_list_lock = SPIN_LOCK_UNLOCKED;
 static DECLARE_WAIT_QUEUE_HEAD(buffer_wait);
+static DECLARE_WAIT_QUEUE_HEAD(bdflush_progress_wait);
+static atomic_t bdflush_generation = ATOMIC_INIT(0);
 
 static int grow_buffers(kdev_t dev, unsigned long block, int size);
 static int osync_buffers_list(struct list_head *);
@@ -1036,7 +1038,6 @@
 	if (state < 0)
 		return;
 
-	wakeup_bdflush();
 
 	/*
 	 * And if we're _really_ out of balance, wait for
@@ -1044,9 +1045,20 @@
 	 * This will throttle heavy writers.
 	 */
 	if (state > 0) {
-		spin_lock(&lru_list_lock);
-		write_some_buffers(NODEV);
-	}
+		int gen = atomic_read(&bdflush_generation);
+		DECLARE_WAITQUEUE(wait, current);
+		add_wait_queue_exclusive(&bdflush_progress_wait, &wait);
+		do {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			wakeup_bdflush();
+			if (gen != atomic_read(&bdflush_generation))
+				break;
+			schedule();
+		} while(gen == atomic_read(&bdflush_generation));
+		remove_wait_queue(&bdflush_progress_wait, &wait);
+		set_current_state(TASK_RUNNING);
+	} else
+		wakeup_bdflush();
 }
 EXPORT_SYMBOL(balance_dirty);
 
@@ -2947,6 +2959,16 @@
 	return 0;
 }
 
+static void poke_throttled_writers(int all) {
+	atomic_inc(&bdflush_generation);
+	smp_mb();
+	if (waitqueue_active(&bdflush_progress_wait)) {
+		if (all)
+			wake_up_all(&bdflush_progress_wait);
+		else
+			wake_up(&bdflush_progress_wait);
+	}
+}
 /*
  * This is the actual bdflush daemon itself. It used to be started from
  * the syscall above, but now we launch it ourselves internally with
@@ -2955,6 +2977,7 @@
 int bdflush(void *startup)
 {
 	struct task_struct *tsk = current;
+	DECLARE_WAITQUEUE(wait, tsk);
 
 	/*
 	 *	We have a bare-bones task_struct, and really should fill
@@ -2992,12 +3015,21 @@
 
 		while (ndirty > 0) {
 			spin_lock(&lru_list_lock);
-			if (!write_some_buffers(NODEV))
+			if (!write_some_buffers(NODEV)) {
+				poke_throttled_writers(0);
 				break;
+			} 
+			poke_throttled_writers(0);
 			ndirty -= NRSYNC;
 		}
-		if (ndirty > 0 || bdflush_stop())
-			interruptible_sleep_on(&bdflush_wait);
+		if (ndirty > 0 || bdflush_stop()) {
+			add_wait_queue(&bdflush_wait, &wait);
+			set_current_state(TASK_INTERRUPTIBLE);
+			poke_throttled_writers(1);
+			schedule();
+			remove_wait_queue(&bdflush_wait, &wait);
+			set_current_state(TASK_RUNNING);
+		}
 	}
 }
 

[-- Attachment #3: io-stalls-10.diff --]
[-- Type: text/plain, Size: 22060 bytes --]

===== drivers/block/ll_rw_blk.c 1.45 vs edited =====
--- 1.45/drivers/block/ll_rw_blk.c	Wed May 28 03:50:02 2003
+++ edited/drivers/block/ll_rw_blk.c	Fri Jul  4 17:35:08 2003
@@ -176,11 +176,12 @@
 {
 	int count = q->nr_requests;
 
-	count -= __blk_cleanup_queue(&q->rq[READ]);
-	count -= __blk_cleanup_queue(&q->rq[WRITE]);
+	count -= __blk_cleanup_queue(&q->rq);
 
 	if (count)
 		printk("blk_cleanup_queue: leaked requests (%d)\n", count);
+	if (atomic_read(&q->nr_sectors))
+		printk("blk_cleanup_queue: leaked sectors (%d)\n", atomic_read(&q->nr_sectors));
 
 	memset(q, 0, sizeof(*q));
 }
@@ -215,6 +216,24 @@
 }
 
 /**
+ * blk_queue_throttle_sectors - indicates you will call sector throttling funcs
+ * @q:       The queue which this applies to.
+ * @active:  A flag indication if you want sector throttling on
+ *
+ * Description:
+ * The sector throttling code allows us to put a limit on the number of
+ * sectors pending io to the disk at a given time, sending @active nonzero
+ * indicates you will call blk_started_sectors and blk_finished_sectors in
+ * addition to calling blk_started_io and blk_finished_io in order to
+ * keep track of the number of sectors in flight.
+ **/
+ 
+void blk_queue_throttle_sectors(request_queue_t * q, int active)
+{
+	q->can_throttle = active;
+}
+
+/**
  * blk_queue_make_request - define an alternate make_request function for a device
  * @q:  the request queue for the device to be affected
  * @mfn: the alternate make_request function
@@ -389,7 +408,7 @@
  *
  * Returns the (new) number of requests which the queue has available.
  */
-int blk_grow_request_list(request_queue_t *q, int nr_requests)
+int blk_grow_request_list(request_queue_t *q, int nr_requests, int max_queue_sectors)
 {
 	unsigned long flags;
 	/* Several broken drivers assume that this function doesn't sleep,
@@ -399,21 +418,34 @@
 	spin_lock_irqsave(&io_request_lock, flags);
 	while (q->nr_requests < nr_requests) {
 		struct request *rq;
-		int rw;
 
 		rq = kmem_cache_alloc(request_cachep, SLAB_ATOMIC);
 		if (rq == NULL)
 			break;
 		memset(rq, 0, sizeof(*rq));
 		rq->rq_status = RQ_INACTIVE;
-		rw = q->nr_requests & 1;
-		list_add(&rq->queue, &q->rq[rw].free);
-		q->rq[rw].count++;
+ 		list_add(&rq->queue, &q->rq.free);
+ 		q->rq.count++;
+
 		q->nr_requests++;
 	}
+
+ 	/*
+ 	 * Wakeup waiters after both one quarter of the
+ 	 * max-in-fligh queue and one quarter of the requests
+ 	 * are available again.
+ 	 */
+
 	q->batch_requests = q->nr_requests / 4;
 	if (q->batch_requests > 32)
 		q->batch_requests = 32;
+ 	q->batch_sectors = max_queue_sectors / 4;
+ 
+ 	q->max_queue_sectors = max_queue_sectors;
+ 
+ 	BUG_ON(!q->batch_sectors);
+ 	atomic_set(&q->nr_sectors, 0);
+
 	spin_unlock_irqrestore(&io_request_lock, flags);
 	return q->nr_requests;
 }
@@ -422,23 +454,27 @@
 {
 	struct sysinfo si;
 	int megs;		/* Total memory, in megabytes */
-	int nr_requests;
-
-	INIT_LIST_HEAD(&q->rq[READ].free);
-	INIT_LIST_HEAD(&q->rq[WRITE].free);
-	q->rq[READ].count = 0;
-	q->rq[WRITE].count = 0;
+ 	int nr_requests, max_queue_sectors = MAX_QUEUE_SECTORS;
+  
+ 	INIT_LIST_HEAD(&q->rq.free);
+	q->rq.count = 0;
 	q->nr_requests = 0;
 
 	si_meminfo(&si);
 	megs = si.totalram >> (20 - PAGE_SHIFT);
-	nr_requests = 128;
-	if (megs < 32)
-		nr_requests /= 2;
-	blk_grow_request_list(q, nr_requests);
+ 	nr_requests = MAX_NR_REQUESTS;
+ 	if (megs < 30) {
+  		nr_requests /= 2;
+ 		max_queue_sectors /= 2;
+ 	}
+ 	/* notice early if anybody screwed the defaults */
+ 	BUG_ON(!nr_requests);
+ 	BUG_ON(!max_queue_sectors);
+ 
+ 	blk_grow_request_list(q, nr_requests, max_queue_sectors);
+
+ 	init_waitqueue_head(&q->wait_for_requests);
 
-	init_waitqueue_head(&q->wait_for_requests[0]);
-	init_waitqueue_head(&q->wait_for_requests[1]);
 	spin_lock_init(&q->queue_lock);
 }
 
@@ -491,6 +527,8 @@
 	q->plug_tq.routine	= &generic_unplug_device;
 	q->plug_tq.data		= q;
 	q->plugged        	= 0;
+	q->can_throttle		= 0;
+
 	/*
 	 * These booleans describe the queue properties.  We set the
 	 * default (and most common) values here.  Other drivers can
@@ -511,9 +549,10 @@
 static struct request *get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = q->rq + rw;
+	struct request_list *rl;
 
-	if (!list_empty(&rl->free)) {
+	rl = &q->rq;
+	if (!list_empty(&rl->free) && !blk_oversized_queue(q)) {
 		rq = blkdev_free_rq(&rl->free);
 		list_del(&rq->queue);
 		rl->count--;
@@ -522,34 +561,23 @@
 		rq->special = NULL;
 		rq->q = q;
 	}
-
 	return rq;
 }
 
 /*
- * Here's the request allocation design:
+ * Here's the request allocation design, low latency version:
  *
  * 1: Blocking on request exhaustion is a key part of I/O throttling.
  * 
  * 2: We want to be `fair' to all requesters.  We must avoid starvation, and
  *    attempt to ensure that all requesters sleep for a similar duration.  Hence
  *    no stealing requests when there are other processes waiting.
- * 
- * 3: We also wish to support `batching' of requests.  So when a process is
- *    woken, we want to allow it to allocate a decent number of requests
- *    before it blocks again, so they can be nicely merged (this only really
- *    matters if the process happens to be adding requests near the head of
- *    the queue).
- * 
- * 4: We want to avoid scheduling storms.  This isn't really important, because
- *    the system will be I/O bound anyway.  But it's easy.
- * 
- *    There is tension between requirements 2 and 3.  Once a task has woken,
- *    we don't want to allow it to sleep as soon as it takes its second request.
- *    But we don't want currently-running tasks to steal all the requests
- *    from the sleepers.  We handle this with wakeup hysteresis around
- *    0 .. batch_requests and with the assumption that request taking is much,
- *    much faster than request freeing.
+ *
+ * There used to be more here, attempting to allow a process to send in a
+ * number of requests once it has woken up.  But, there's no way to 
+ * tell if a process has just been woken up, or if it is a new process
+ * coming in to steal requests from the waiters.  So, we give up and force
+ * everyone to wait fairly.
  * 
  * So here's what we do:
  * 
@@ -561,28 +589,23 @@
  * 
  *  When a process wants a new request:
  * 
- *    b) If free_requests == 0, the requester sleeps in FIFO manner.
- * 
- *    b) If 0 <  free_requests < batch_requests and there are waiters,
- *       we still take a request non-blockingly.  This provides batching.
- *
- *    c) If free_requests >= batch_requests, the caller is immediately
- *       granted a new request.
+ *    b) If free_requests == 0, the requester sleeps in FIFO manner, and
+ *       the queue full condition is set.  The full condition is not
+ *       cleared until there are no longer any waiters.  Once the full
+ *       condition is set, all new io must wait, hopefully for a very
+ *       short period of time.
  * 
  *  When a request is released:
  * 
- *    d) If free_requests < batch_requests, do nothing.
- * 
- *    f) If free_requests >= batch_requests, wake up a single waiter.
+ *    c) If free_requests < batch_requests, do nothing.
  * 
- *   The net effect is that when a process is woken at the batch_requests level,
- *   it will be able to take approximately (batch_requests) requests before
- *   blocking again (at the tail of the queue).
- * 
- *   This all assumes that the rate of taking requests is much, much higher
- *   than the rate of releasing them.  Which is very true.
+ *    d) If free_requests >= batch_requests, wake up a single waiter.
  *
- * -akpm, Feb 2002.
+ *   As each waiter gets a request, he wakes another waiter.  We do this
+ *   to prevent a race where an unplug might get run before a request makes
+ *   it's way onto the queue.  The result is a cascade of wakeups, so delaying
+ *   the initial wakeup until we've got batch_requests available helps avoid
+ *   wakeups where there aren't any requests available yet.
  */
 
 static struct request *__get_request_wait(request_queue_t *q, int rw)
@@ -590,21 +613,37 @@
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
-	add_wait_queue(&q->wait_for_requests[rw], &wait);
+	add_wait_queue_exclusive(&q->wait_for_requests, &wait);
+
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		generic_unplug_device(q);
-		if (q->rq[rw].count == 0)
-			schedule();
 		spin_lock_irq(&io_request_lock);
+		if (blk_oversized_queue(q)) {
+			__generic_unplug_device(q);
+			spin_unlock_irq(&io_request_lock);
+			schedule();
+			spin_lock_irq(&io_request_lock);
+		}
 		rq = get_request(q, rw);
 		spin_unlock_irq(&io_request_lock);
 	} while (rq == NULL);
-	remove_wait_queue(&q->wait_for_requests[rw], &wait);
+	remove_wait_queue(&q->wait_for_requests, &wait);
 	current->state = TASK_RUNNING;
+
 	return rq;
 }
 
+static void get_request_wait_wakeup(request_queue_t *q, int rw)
+{
+	/*
+	 * avoid losing an unplug if a second __get_request_wait did the
+	 * generic_unplug_device while our __get_request_wait was running
+	 * w/o the queue_lock held and w/ our request out of the queue.
+	 */	
+	if (waitqueue_active(&q->wait_for_requests))
+		wake_up(&q->wait_for_requests);
+}
+
 /* RO fail safe mechanism */
 
 static long ro_bits[MAX_BLKDEV][8];
@@ -818,7 +857,6 @@
 void blkdev_release_request(struct request *req)
 {
 	request_queue_t *q = req->q;
-	int rw = req->cmd;
 
 	req->rq_status = RQ_INACTIVE;
 	req->q = NULL;
@@ -828,9 +866,17 @@
 	 * assume it has free buffers and check waiters
 	 */
 	if (q) {
-		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests)
-			wake_up(&q->wait_for_requests[rw]);
+		int oversized_batch = 0;
+
+		if (q->can_throttle)
+			oversized_batch = blk_oversized_queue_batch(q);
+		q->rq.count++;
+		list_add(&req->queue, &q->rq.free);
+		if (q->rq.count >= q->batch_requests && !oversized_batch) {
+			smp_mb();
+			if (waitqueue_active(&q->wait_for_requests))
+				wake_up(&q->wait_for_requests);
+		}
 	}
 }
 
@@ -908,6 +954,7 @@
 	struct list_head *head, *insert_here;
 	int latency;
 	elevator_t *elevator = &q->elevator;
+	int should_wake = 0;
 
 	count = bh->b_size >> 9;
 	sector = bh->b_rsector;
@@ -948,7 +995,6 @@
 	 */
 	max_sectors = get_max_sectors(bh->b_rdev);
 
-again:
 	req = NULL;
 	head = &q->queue_head;
 	/*
@@ -957,7 +1003,9 @@
 	 */
 	spin_lock_irq(&io_request_lock);
 
+again:
 	insert_here = head->prev;
+
 	if (list_empty(head)) {
 		q->plug_device_fn(q, bh->b_rdev); /* is atomic */
 		goto get_rq;
@@ -976,6 +1024,7 @@
 			req->bhtail = bh;
 			req->nr_sectors = req->hard_nr_sectors += count;
 			blk_started_io(count);
+			blk_started_sectors(req, count);
 			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 			req_new_io(req, 1, count);
 			attempt_back_merge(q, req, max_sectors, max_segments);
@@ -998,6 +1047,7 @@
 			req->sector = req->hard_sector = sector;
 			req->nr_sectors = req->hard_nr_sectors += count;
 			blk_started_io(count);
+			blk_started_sectors(req, count);
 			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 			req_new_io(req, 1, count);
 			attempt_front_merge(q, head, req, max_sectors, max_segments);
@@ -1030,7 +1080,7 @@
 		 * See description above __get_request_wait()
 		 */
 		if (rw_ahead) {
-			if (q->rq[rw].count < q->batch_requests) {
+			if (q->rq.count < q->batch_requests || blk_oversized_queue_batch(q)) {
 				spin_unlock_irq(&io_request_lock);
 				goto end_io;
 			}
@@ -1042,6 +1092,9 @@
 			if (req == NULL) {
 				spin_unlock_irq(&io_request_lock);
 				freereq = __get_request_wait(q, rw);
+				head = &q->queue_head;
+				spin_lock_irq(&io_request_lock);
+				should_wake = 1;
 				goto again;
 			}
 		}
@@ -1064,10 +1117,13 @@
 	req->start_time = jiffies;
 	req_new_io(req, 0, count);
 	blk_started_io(count);
+	blk_started_sectors(req, count);
 	add_request(q, req, insert_here);
 out:
 	if (freereq)
 		blkdev_release_request(freereq);
+	if (should_wake)
+		get_request_wait_wakeup(q, rw);
 	spin_unlock_irq(&io_request_lock);
 	return 0;
 end_io:
@@ -1196,8 +1252,15 @@
 	bh->b_rdev = bh->b_dev;
 	bh->b_rsector = bh->b_blocknr * count;
 
+	get_bh(bh);
 	generic_make_request(rw, bh);
 
+	/* fix race condition with wait_on_buffer() */
+	smp_mb(); /* spin_unlock may have inclusive semantics */
+	if (waitqueue_active(&bh->b_wait))
+		wake_up(&bh->b_wait);
+
+	put_bh(bh);
 	switch (rw) {
 		case WRITE:
 			kstat.pgpgout += count;
@@ -1350,6 +1413,7 @@
 	if ((bh = req->bh) != NULL) {
 		nsect = bh->b_size >> 9;
 		blk_finished_io(nsect);
+		blk_finished_sectors(req, nsect);
 		req->bh = bh->b_reqnext;
 		bh->b_reqnext = NULL;
 		bh->b_end_io(bh, uptodate);
@@ -1509,6 +1573,7 @@
 EXPORT_SYMBOL(blk_get_queue);
 EXPORT_SYMBOL(blk_cleanup_queue);
 EXPORT_SYMBOL(blk_queue_headactive);
+EXPORT_SYMBOL(blk_queue_throttle_sectors);
 EXPORT_SYMBOL(blk_queue_make_request);
 EXPORT_SYMBOL(generic_make_request);
 EXPORT_SYMBOL(blkdev_release_request);
===== drivers/ide/ide-probe.c 1.17 vs edited =====
--- 1.17/drivers/ide/ide-probe.c	Sun Apr  6 07:29:19 2003
+++ edited/drivers/ide/ide-probe.c	Fri Jul  4 17:12:45 2003
@@ -971,6 +971,7 @@
 
 	q->queuedata = HWGROUP(drive);
 	blk_init_queue(q, do_ide_request);
+	blk_queue_throttle_sectors(q, 1);
 }
 
 #undef __IRQ_HELL_SPIN
===== drivers/scsi/scsi.c 1.17 vs edited =====
--- 1.17/drivers/scsi/scsi.c	Tue Jun 18 06:57:37 2002
+++ edited/drivers/scsi/scsi.c	Fri Jul  4 17:12:45 2003
@@ -197,6 +197,7 @@
 
 	blk_init_queue(q, scsi_request_fn);
 	blk_queue_headactive(q, 0);
+	blk_queue_throttle_sectors(q, 1);
 	q->queuedata = (void *) SDpnt;
 }
 
===== drivers/scsi/scsi_lib.c 1.16 vs edited =====
--- 1.16/drivers/scsi/scsi_lib.c	Fri Feb  7 20:25:46 2003
+++ edited/drivers/scsi/scsi_lib.c	Fri Jul  4 17:12:45 2003
@@ -378,6 +378,7 @@
 		if ((bh = req->bh) != NULL) {
 			nsect = bh->b_size >> 9;
 			blk_finished_io(nsect);
+			blk_finished_sectors(req, nsect);
 			req->bh = bh->b_reqnext;
 			bh->b_reqnext = NULL;
 			sectors -= nsect;
===== fs/buffer.c 1.86 vs edited =====
--- 1.86/fs/buffer.c	Mon Jun 23 02:39:21 2003
+++ edited/fs/buffer.c	Fri Jul  4 17:36:04 2003
@@ -153,10 +153,23 @@
 	get_bh(bh);
 	add_wait_queue(&bh->b_wait, &wait);
 	do {
-		run_task_queue(&tq_disk);
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 		if (!buffer_locked(bh))
 			break;
+		/*
+		 * We must read tq_disk in TQ_ACTIVE after the
+		 * add_wait_queue effect is visible to other cpus.
+		 * We could unplug some line above it wouldn't matter
+		 * but we can't do that right after add_wait_queue
+		 * without an smp_mb() in between because spin_unlock
+		 * has inclusive semantics.
+		 * Doing it here is the most efficient place so we
+		 * don't do a suprious unplug if we get a racy
+		 * wakeup that make buffer_locked to return 0, and
+		 * doing it here avoids an explicit smp_mb() we
+		 * rely on the implicit one in set_task_state.
+		 */
+		run_task_queue(&tq_disk);
 		schedule();
 	} while (buffer_locked(bh));
 	tsk->state = TASK_RUNNING;
@@ -1516,6 +1529,9 @@
 
 	/* Done - end_buffer_io_async will unlock */
 	SetPageUptodate(page);
+
+	wakeup_page_waiters(page);
+
 	return 0;
 
 out:
@@ -1547,6 +1563,7 @@
 	} while (bh != head);
 	if (need_unlock)
 		UnlockPage(page);
+	wakeup_page_waiters(page);
 	return err;
 }
 
@@ -1774,6 +1791,8 @@
 		else
 			submit_bh(READ, bh);
 	}
+
+	wakeup_page_waiters(page);
 	
 	return 0;
 }
@@ -2400,6 +2419,7 @@
 		submit_bh(rw, bh);
 		bh = next;
 	} while (bh != head);
+	wakeup_page_waiters(page);
 	return 0;
 }
 
===== fs/reiserfs/inode.c 1.45 vs edited =====
--- 1.45/fs/reiserfs/inode.c	Thu May 22 16:35:02 2003
+++ edited/fs/reiserfs/inode.c	Fri Jul  4 17:12:45 2003
@@ -2048,6 +2048,7 @@
     */
     if (nr) {
         submit_bh_for_writepage(arr, nr) ;
+	wakeup_page_waiters(page);
     } else {
         UnlockPage(page) ;
     }
===== include/linux/blkdev.h 1.23 vs edited =====
--- 1.23/include/linux/blkdev.h	Fri Nov 29 17:03:01 2002
+++ edited/include/linux/blkdev.h	Fri Jul  4 17:18:12 2003
@@ -64,12 +64,6 @@
 typedef void (plug_device_fn) (request_queue_t *q, kdev_t device);
 typedef void (unplug_device_fn) (void *q);
 
-/*
- * Default nr free requests per queue, ll_rw_blk will scale it down
- * according to available RAM at init time
- */
-#define QUEUE_NR_REQUESTS	8192
-
 struct request_list {
 	unsigned int count;
 	struct list_head free;
@@ -80,7 +74,7 @@
 	/*
 	 * the queue request freelist, one for reads and one for writes
 	 */
-	struct request_list	rq[2];
+	struct request_list	rq;
 
 	/*
 	 * The total number of requests on each queue
@@ -93,6 +87,21 @@
 	int batch_requests;
 
 	/*
+	 * The total number of 512byte blocks on each queue
+	 */
+	atomic_t nr_sectors;
+
+	/*
+	 * Batching threshold for sleep/wakeup decisions
+	 */
+	int batch_sectors;
+
+	/*
+	 * The max number of 512byte blocks on each queue
+	 */
+	int max_queue_sectors;
+
+	/*
 	 * Together with queue_head for cacheline sharing
 	 */
 	struct list_head	queue_head;
@@ -118,13 +127,21 @@
 	/*
 	 * Boolean that indicates whether this queue is plugged or not.
 	 */
-	char			plugged;
+	int			plugged:1;
 
 	/*
 	 * Boolean that indicates whether current_request is active or
 	 * not.
 	 */
-	char			head_active;
+	int			head_active:1;
+
+	/*
+	 * Boolean that indicates you will use blk_started_sectors
+	 * and blk_finished_sectors in addition to blk_started_io
+	 * and blk_finished_io.  It enables the throttling code to 
+	 * help keep the sectors in flight to a reasonable value
+	 */
+	int			can_throttle:1;
 
 	unsigned long		bounce_pfn;
 
@@ -137,7 +154,7 @@
 	/*
 	 * Tasks wait here for free read and write requests
 	 */
-	wait_queue_head_t	wait_for_requests[2];
+	wait_queue_head_t	wait_for_requests;
 };
 
 #define blk_queue_plugged(q)	(q)->plugged
@@ -221,10 +238,11 @@
 /*
  * Access functions for manipulating queue properties
  */
-extern int blk_grow_request_list(request_queue_t *q, int nr_requests);
+extern int blk_grow_request_list(request_queue_t *q, int nr_requests, int max_queue_sectors);
 extern void blk_init_queue(request_queue_t *, request_fn_proc *);
 extern void blk_cleanup_queue(request_queue_t *);
 extern void blk_queue_headactive(request_queue_t *, int);
+extern void blk_queue_throttle_sectors(request_queue_t *, int);
 extern void blk_queue_make_request(request_queue_t *, make_request_fn *);
 extern void generic_unplug_device(void *);
 extern inline int blk_seg_merge_ok(struct buffer_head *, struct buffer_head *);
@@ -243,6 +261,8 @@
 
 #define MAX_SEGMENTS 128
 #define MAX_SECTORS 255
+#define MAX_QUEUE_SECTORS (4 << (20 - 9)) /* 4 mbytes when full sized */
+#define MAX_NR_REQUESTS 1024 /* 1024k when in 512 units, normally min is 1M in 1k units */
 
 #define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK)
 
@@ -268,8 +288,50 @@
 	return retval;
 }
 
+static inline int blk_oversized_queue(request_queue_t * q)
+{
+	if (q->can_throttle)
+		return atomic_read(&q->nr_sectors) > q->max_queue_sectors;
+	return q->rq.count == 0;
+}
+
+static inline int blk_oversized_queue_batch(request_queue_t * q)
+{
+	return atomic_read(&q->nr_sectors) > q->max_queue_sectors - q->batch_sectors;
+}
+
 #define blk_finished_io(nsects)	do { } while (0)
 #define blk_started_io(nsects)	do { } while (0)
+
+static inline void blk_started_sectors(struct request *rq, int count)
+{
+	request_queue_t *q = rq->q;
+	if (q && q->can_throttle) {
+		atomic_add(count, &q->nr_sectors);
+		if (atomic_read(&q->nr_sectors) < 0) {
+			printk("nr_sectors is %d\n", atomic_read(&q->nr_sectors));
+			BUG();
+		}
+	}
+}
+
+static inline void blk_finished_sectors(struct request *rq, int count)
+{
+	request_queue_t *q = rq->q;
+	if (q && q->can_throttle) {
+		atomic_sub(count, &q->nr_sectors);
+		
+		smp_mb();
+		if (q->rq.count >= q->batch_requests && !blk_oversized_queue_batch(q)) {
+			if (waitqueue_active(&q->wait_for_requests))
+				wake_up(&q->wait_for_requests);
+		}
+		if (atomic_read(&q->nr_sectors) < 0) {
+			printk("nr_sectors is %d\n", atomic_read(&q->nr_sectors));
+			BUG();
+		}
+	}
+}
 
 static inline unsigned int blksize_bits(unsigned int size)
 {
===== include/linux/elevator.h 1.5 vs edited =====
--- 1.5/include/linux/elevator.h	Thu Aug 15 01:57:19 2002
+++ edited/include/linux/elevator.h	Fri Jul  4 17:12:45 2003
@@ -80,7 +80,7 @@
 	return latency;
 }
 
-#define ELV_LINUS_SEEK_COST	16
+#define ELV_LINUS_SEEK_COST	1
 
 #define ELEVATOR_NOOP							\
 ((elevator_t) {								\
@@ -93,8 +93,8 @@
 
 #define ELEVATOR_LINUS							\
 ((elevator_t) {								\
-	2048,				/* read passovers */		\
-	8192,				/* write passovers */		\
+	128,				/* read passovers */		\
+	512,				/* write passovers */		\
 									\
 	elevator_linus_merge,		/* elevator_merge_fn */		\
 	elevator_linus_merge_req,	/* elevator_merge_req_fn */	\
===== include/linux/pagemap.h 1.19 vs edited =====
--- 1.19/include/linux/pagemap.h	Sun Aug 25 15:32:11 2002
+++ edited/include/linux/pagemap.h	Fri Jul  4 17:18:02 2003
@@ -97,6 +97,8 @@
 		___wait_on_page(page);
 }
 
+extern void FASTCALL(wakeup_page_waiters(struct page * page));
+
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
===== kernel/ksyms.c 1.71 vs edited =====
--- 1.71/kernel/ksyms.c	Tue May 27 06:44:48 2003
+++ edited/kernel/ksyms.c	Fri Jul  4 17:12:45 2003
@@ -296,6 +296,7 @@
 EXPORT_SYMBOL(filemap_fdatawait);
 EXPORT_SYMBOL(lock_page);
 EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(wakeup_page_waiters);
 
 /* device registration */
 EXPORT_SYMBOL(register_chrdev);
===== mm/filemap.c 1.79 vs edited =====
--- 1.79/mm/filemap.c	Tue Jun  3 07:23:57 2003
+++ edited/mm/filemap.c	Fri Jul  4 17:12:45 2003
@@ -810,6 +810,20 @@
 	return &wait[hash];
 }
 
+/*
+ * This must be called after every submit_bh with end_io
+ * callbacks that would result into the blkdev layer waking
+ * up the page after a queue unplug.
+ */
+void wakeup_page_waiters(struct page * page)
+{
+	wait_queue_head_t * head;
+
+	head = page_waitqueue(page);
+	if (waitqueue_active(head))
+		wake_up(head);
+}
+
 /* 
  * Wait for a page to get unlocked.
  *

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-04 20:01     ` Marcelo Tosatti
  2003-07-04 21:37       ` Chris Mason
@ 2003-07-05  0:00       ` Andrea Arcangeli
  2003-07-05 15:59         ` Marcelo Tosatti
  1 sibling, 1 reply; 21+ messages in thread
From: Andrea Arcangeli @ 2003-07-05  0:00 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Chris Mason, lkml, Alan Cox, Nick Piggin

On Fri, Jul 04, 2003 at 05:01:54PM -0300, Marcelo Tosatti wrote:
> release today), then fix pausing in -pre4. If the IO fairness still doesnt

fix pausing is a showstopper bugfix, the box will hang for days without
it.

lowlatency elevator is for the desktop complains we get about
interactivity compared to 2.5, so it's much lower prio than fix pausing.
I would never merge fix pausing after lowlatency elevator. But that's
just me.

Andrea

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-04 21:37       ` Chris Mason
@ 2003-07-05  0:05         ` Andrea Arcangeli
  2003-07-05  0:47           ` Chris Mason
  2003-07-06  7:58         ` Marc-Christian Petersen
  1 sibling, 1 reply; 21+ messages in thread
From: Andrea Arcangeli @ 2003-07-05  0:05 UTC (permalink / raw)
  To: Chris Mason; +Cc: Marcelo Tosatti, lkml, Alan Cox, Nick Piggin

On Fri, Jul 04, 2003 at 05:37:35PM -0400, Chris Mason wrote:
> I've also attached a patch I've been working on to solve the latencies a
> different way.  bdflush-progress.diff changes balance_dirty to wait on
> bdflush instead of trying to start io.  It helps a lot here (both
> throughput and latency) but hasn't yet been tested on a box with tons of
> drives.

that's orthogonal, it changes the write throttling, it doesn't touch the
blkdev layer like the other patches does. So if it helps it solves a
different kind of latencies.

However the implementation in theory can run the box oom, since it won't
limit the dirty buffers anymore. To be safe you need to wait 2
generations. I doubt in practice it would be easily reproducible though ;).

Andrea

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-05  0:05         ` Andrea Arcangeli
@ 2003-07-05  0:47           ` Chris Mason
  2003-07-05  1:04             ` Andrea Arcangeli
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Mason @ 2003-07-05  0:47 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Marcelo Tosatti, lkml, Alan Cox, Nick Piggin

On Fri, 2003-07-04 at 20:05, Andrea Arcangeli wrote:
> On Fri, Jul 04, 2003 at 05:37:35PM -0400, Chris Mason wrote:
> > I've also attached a patch I've been working on to solve the latencies a
> > different way.  bdflush-progress.diff changes balance_dirty to wait on
> > bdflush instead of trying to start io.  It helps a lot here (both
> > throughput and latency) but hasn't yet been tested on a box with tons of
> > drives.
> 
> that's orthogonal, it changes the write throttling, it doesn't touch the
> blkdev layer like the other patches does. So if it helps it solves a
> different kind of latencies.

It's also almost useless without elevator-low-latency applied ;-)  One
major source of our latencies is a bunch of procs hammering on
__get_request_wait, so bdflush-progess helps by reducing the number of
procs doing io.  It does push some of the latency problem up higher into
balance_dirty, but I believe it is easier to manage there.

bdflush-progress doesn't help at all for non-async workloads.

> However the implementation in theory can run the box oom, since it won't
> limit the dirty buffers anymore. To be safe you need to wait 2
> generations. I doubt in practice it would be easily reproducible though ;).

Hmmm, I think the critical part here is to write some buffers after
marking a buffer dirty.  We don't check the generation number until
after marking the buffer dirty, so if the generation has incremented at
all we've made progress.  What am I missing?

-chris



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-05  0:47           ` Chris Mason
@ 2003-07-05  1:04             ` Andrea Arcangeli
  0 siblings, 0 replies; 21+ messages in thread
From: Andrea Arcangeli @ 2003-07-05  1:04 UTC (permalink / raw)
  To: Chris Mason; +Cc: Marcelo Tosatti, lkml, Alan Cox, Nick Piggin

On Fri, Jul 04, 2003 at 08:47:00PM -0400, Chris Mason wrote:
> On Fri, 2003-07-04 at 20:05, Andrea Arcangeli wrote:
> > On Fri, Jul 04, 2003 at 05:37:35PM -0400, Chris Mason wrote:
> > > I've also attached a patch I've been working on to solve the latencies a
> > > different way.  bdflush-progress.diff changes balance_dirty to wait on
> > > bdflush instead of trying to start io.  It helps a lot here (both
> > > throughput and latency) but hasn't yet been tested on a box with tons of
> > > drives.
> > 
> > that's orthogonal, it changes the write throttling, it doesn't touch the
> > blkdev layer like the other patches does. So if it helps it solves a
> > different kind of latencies.
> 
> It's also almost useless without elevator-low-latency applied ;-)  One
> major source of our latencies is a bunch of procs hammering on
> __get_request_wait, so bdflush-progess helps by reducing the number of
> procs doing io.  It does push some of the latency problem up higher into
> balance_dirty, but I believe it is easier to manage there.
> 
> bdflush-progress doesn't help at all for non-async workloads.

agreed.

> > However the implementation in theory can run the box oom, since it won't
> > limit the dirty buffers anymore. To be safe you need to wait 2
> > generations. I doubt in practice it would be easily reproducible though ;).
> 
> Hmmm, I think the critical part here is to write some buffers after
> marking a buffer dirty.  We don't check the generation number until

btw, not after a buffer, but after as much as "buffers per page" buffers
(infact for the code to be correct on systems with quite big page size
the 32 should be replaced with a max(bh_per_page, 32))

> after marking the buffer dirty, so if the generation has incremented at
> all we've made progress.  What am I missing?

	write 32 buffers
				read generation
	inc generation
				generation changed -> OOM

Am I missing any smp serialization between the two cpus?

Andrea

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-05  0:00       ` Andrea Arcangeli
@ 2003-07-05 15:59         ` Marcelo Tosatti
  2003-07-05 16:36           ` Andrea Arcangeli
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2003-07-05 15:59 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Chris Mason, lkml, Alan Cox, Nick Piggin


On Sat, 5 Jul 2003, Andrea Arcangeli wrote:

> On Fri, Jul 04, 2003 at 05:01:54PM -0300, Marcelo Tosatti wrote:
> > release today), then fix pausing in -pre4. If the IO fairness still doesnt
>
> fix pausing is a showstopper bugfix, the box will hang for days without
> it.
>
> lowlatency elevator is for the desktop complains we get about
> interactivity compared to 2.5, so it's much lower prio than fix pausing.
> I would never merge fix pausing after lowlatency elevator. But that's
> just me.

You're right. I'll merge both patches in -pre3.

Danke

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-05 15:59         ` Marcelo Tosatti
@ 2003-07-05 16:36           ` Andrea Arcangeli
  0 siblings, 0 replies; 21+ messages in thread
From: Andrea Arcangeli @ 2003-07-05 16:36 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Chris Mason, lkml, Alan Cox, Nick Piggin

On Sat, Jul 05, 2003 at 12:59:19PM -0300, Marcelo Tosatti wrote:
> 
> On Sat, 5 Jul 2003, Andrea Arcangeli wrote:
> 
> > On Fri, Jul 04, 2003 at 05:01:54PM -0300, Marcelo Tosatti wrote:
> > > release today), then fix pausing in -pre4. If the IO fairness still doesnt
> >
> > fix pausing is a showstopper bugfix, the box will hang for days without
> > it.
> >
> > lowlatency elevator is for the desktop complains we get about
> > interactivity compared to 2.5, so it's much lower prio than fix pausing.
> > I would never merge fix pausing after lowlatency elevator. But that's
> > just me.
> 
> You're right. I'll merge both patches in -pre3.

agreed, thanks.

> Danke

prego ;)

Andrea

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-04 21:37       ` Chris Mason
  2003-07-05  0:05         ` Andrea Arcangeli
@ 2003-07-06  7:58         ` Marc-Christian Petersen
  2003-07-06 18:51           ` Chris Mason
  1 sibling, 1 reply; 21+ messages in thread
From: Marc-Christian Petersen @ 2003-07-06  7:58 UTC (permalink / raw)
  To: Chris Mason; +Cc: lkml, Andrea Arcangeli, Nick Piggin

On Friday 04 July 2003 23:37, Chris Mason wrote:

Hi Chris,

> > If the IO fairness still doesnt
> > get somewhat better for general use (well get isolated user reports and
> > benchmarks for both the two patches), then I might consider the q->full
> > patch (it has throughtput drawbacks and I prefer avoiding a tunable
> > there).
now there is io-stalls-10 in .22-pre3 (lowlat elev. + fixpausing). Could you 
please send "q->full patch" as ontop of -pre3? :-)

Thank you.

ciao, Marc


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-06  7:58         ` Marc-Christian Petersen
@ 2003-07-06 18:51           ` Chris Mason
  2003-07-07  4:06             ` Nick Piggin
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Mason @ 2003-07-06 18:51 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: lkml, Andrea Arcangeli, Nick Piggin

[-- Attachment #1: Type: text/plain, Size: 831 bytes --]

On Sun, 2003-07-06 at 03:58, Marc-Christian Petersen wrote:
> On Friday 04 July 2003 23:37, Chris Mason wrote:
> 
> Hi Chris,
> 
> > > If the IO fairness still doesnt
> > > get somewhat better for general use (well get isolated user reports and
> > > benchmarks for both the two patches), then I might consider the q->full
> > > patch (it has throughtput drawbacks and I prefer avoiding a tunable
> > > there).
> now there is io-stalls-10 in .22-pre3 (lowlat elev. + fixpausing). Could you 
> please send "q->full patch" as ontop of -pre3? :-)

Attached, this defaults to q->full off and keeps the elvtune changes. 
So to turn on the q->full low latency fixes, you need to:

elvtune -b 1 /dev/xxxx

Note that for lvm and md, you need to elvtune each underlying device. 
Running it on an lvm/md device doesn't do anything.

-chris


[-- Attachment #2: q_full.diff --]
[-- Type: text/plain, Size: 7259 bytes --]

--- 1.9/drivers/block/blkpg.c	Sat Mar 30 06:58:05 2002
+++ edited/drivers/block/blkpg.c	Sun Jul  6 14:34:31 2003
@@ -261,10 +261,10 @@
 			return blkpg_ioctl(dev, (struct blkpg_ioctl_arg *) arg);
 			
 		case BLKELVGET:
-			return blkelvget_ioctl(&blk_get_queue(dev)->elevator,
+			return blkelvget_ioctl(blk_get_queue(dev),
 					       (blkelv_ioctl_arg_t *) arg);
 		case BLKELVSET:
-			return blkelvset_ioctl(&blk_get_queue(dev)->elevator,
+			return blkelvset_ioctl(blk_get_queue(dev),
 					       (blkelv_ioctl_arg_t *) arg);
 
 		case BLKBSZGET:
--- 1.8/drivers/block/elevator.c	Tue Oct 22 00:29:25 2002
+++ edited/drivers/block/elevator.c	Sun Jul  6 14:34:31 2003
@@ -180,23 +180,28 @@
 
 void elevator_noop_merge_req(struct request *req, struct request *next) {}
 
-int blkelvget_ioctl(elevator_t * elevator, blkelv_ioctl_arg_t * arg)
+int blkelvget_ioctl(request_queue_t *q, blkelv_ioctl_arg_t * arg)
 {
+	elevator_t *elevator = &q->elevator;
 	blkelv_ioctl_arg_t output;
 
 	output.queue_ID			= elevator->queue_ID;
 	output.read_latency		= elevator->read_latency;
 	output.write_latency		= elevator->write_latency;
-	output.max_bomb_segments	= 0;
-
+	output.max_bomb_segments	= q->max_queue_sectors;
+	if (q->low_latency)
+		output.max_bomb_segments |= MAX_BOMB_LATENCY_MASK;	
+	else
+		output.max_bomb_segments &= ~MAX_BOMB_LATENCY_MASK;	
 	if (copy_to_user(arg, &output, sizeof(blkelv_ioctl_arg_t)))
 		return -EFAULT;
 
 	return 0;
 }
 
-int blkelvset_ioctl(elevator_t * elevator, const blkelv_ioctl_arg_t * arg)
+int blkelvset_ioctl(request_queue_t *q, const blkelv_ioctl_arg_t * arg)
 {
+	elevator_t *elevator = &q->elevator;
 	blkelv_ioctl_arg_t input;
 
 	if (copy_from_user(&input, arg, sizeof(blkelv_ioctl_arg_t)))
@@ -206,9 +211,19 @@
 		return -EINVAL;
 	if (input.write_latency < 0)
 		return -EINVAL;
+	if (input.max_bomb_segments < 0)
+		return -EINVAL;
 
 	elevator->read_latency		= input.read_latency;
 	elevator->write_latency		= input.write_latency;
+	q->low_latency = input.max_bomb_segments & MAX_BOMB_LATENCY_MASK ? 1:0;
+	printk("queue %d: low latency mode is now %s\n", elevator->queue_ID, 
+		q->low_latency ? "on" : "off");
+	input.max_bomb_segments &= ~MAX_BOMB_LATENCY_MASK;
+	if (input.max_bomb_segments)
+		q->max_queue_sectors = input.max_bomb_segments;
+	printk("queue %d: max queue sectors is now %d\n", elevator->queue_ID, 
+		q->max_queue_sectors);
 	return 0;
 }
 
--- 1.46/drivers/block/ll_rw_blk.c	Fri Jul  4 13:35:08 2003
+++ edited/drivers/block/ll_rw_blk.c	Sun Jul  6 14:34:30 2003
@@ -379,8 +379,22 @@
 {
 	if (q->plugged) {
 		q->plugged = 0;
-		if (!list_empty(&q->queue_head))
+		if (!list_empty(&q->queue_head)) {
+
+			/* we don't want merges later on to come in 
+			 * and significantly increase the amount of
+			 * work during an unplug, it can lead to high
+			 * latencies while some poor waiter tries to
+			 * run an ever increasing chunk of io.
+			 * This does lower throughput some though.
+			 */
+			if (q->low_latency) {
+				struct request *rq;
+				rq = blkdev_entry_prev_request(&q->queue_head),
+				rq->elevator_sequence = 0;
+			}
 			q->request_fn(q);
+		}
 	}
 }
 
@@ -527,7 +541,9 @@
 	q->plug_tq.routine	= &generic_unplug_device;
 	q->plug_tq.data		= q;
 	q->plugged        	= 0;
+	q->full			= 0;
 	q->can_throttle		= 0;
+	q->low_latency		= 0;
 
 	/*
 	 * These booleans describe the queue properties.  We set the
@@ -546,7 +562,7 @@
  * Get a free request. io_request_lock must be held and interrupts
  * disabled on the way in.  Returns NULL if there are no free requests.
  */
-static struct request *get_request(request_queue_t *q, int rw)
+static struct request *__get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
 	struct request_list *rl;
@@ -560,11 +576,34 @@
 		rq->cmd = rw;
 		rq->special = NULL;
 		rq->q = q;
-	}
+	} else if (q->low_latency)
+		q->full = 1;
 	return rq;
 }
 
 /*
+ * get a free request, honoring the queue_full condition
+ */
+static inline struct request *get_request(request_queue_t *q, int rw)
+{
+	if (q->full)
+		return NULL;
+	return __get_request(q, rw);
+}
+
+/* 
+ * helper func to do memory barriers and wakeups when we finally decide
+ * to clear the queue full condition
+ */
+static inline void clear_full_and_wake(request_queue_t *q)
+{
+	q->full = 0;
+	mb();
+	if (waitqueue_active(&q->wait_for_requests))
+		wake_up(&q->wait_for_requests);
+}
+
+/*
  * Here's the request allocation design, low latency version:
  *
  * 1: Blocking on request exhaustion is a key part of I/O throttling.
@@ -612,24 +651,30 @@
 {
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
+	int oversized;
 
 	add_wait_queue_exclusive(&q->wait_for_requests, &wait);
 
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		spin_lock_irq(&io_request_lock);
-		if (blk_oversized_queue(q)) {
-			__generic_unplug_device(q);
+		oversized = blk_oversized_queue(q);
+		if (q->full || oversized) {
+			if (oversized)
+				__generic_unplug_device(q);
 			spin_unlock_irq(&io_request_lock);
 			schedule();
 			spin_lock_irq(&io_request_lock);
 		}
-		rq = get_request(q, rw);
+		rq = __get_request(q, rw);
 		spin_unlock_irq(&io_request_lock);
 	} while (rq == NULL);
 	remove_wait_queue(&q->wait_for_requests, &wait);
 	current->state = TASK_RUNNING;
 
+	if (!waitqueue_active(&q->wait_for_requests))
+		clear_full_and_wake(q);
+
 	return rq;
 }
 
@@ -876,6 +921,8 @@
 			smp_mb();
 			if (waitqueue_active(&q->wait_for_requests))
 				wake_up(&q->wait_for_requests);
+			else
+				clear_full_and_wake(q);
 		}
 	}
 }
--- 1.24/include/linux/blkdev.h	Fri Jul  4 13:18:12 2003
+++ edited/include/linux/blkdev.h	Sun Jul  6 14:34:38 2003
@@ -136,12 +136,25 @@
 	int			head_active:1;
 
 	/*
+	 * Booleans that indicate whether the queue's free requests have
+	 * been exhausted and is waiting to drop below the batch_requests
+	 * threshold
+	 */
+	int			full:1;
+	
+	/*
 	 * Boolean that indicates you will use blk_started_sectors
 	 * and blk_finished_sectors in addition to blk_started_io
 	 * and blk_finished_io.  It enables the throttling code to 
 	 * help keep the sectors in flight to a reasonable value
 	 */
 	int			can_throttle:1;
+
+	/*
+	 * Boolean that indicates the queue should prefer low
+	 * latency over throughput.  This enables the q->full checks
+	 */
+	int			low_latency:1;
 
 	unsigned long		bounce_pfn;
 
--- 1.6/include/linux/elevator.h	Fri Jul  4 13:12:45 2003
+++ edited/include/linux/elevator.h	Sun Jul  6 14:34:31 2003
@@ -35,14 +35,19 @@
 	int queue_ID;
 	int read_latency;
 	int write_latency;
+/*
+ * (max_bomb_segments & MAX_BOMB_LATENCY_MASK) == 1 indicates low latency
+ * mode.  We're using any odd number to indicate low latency is on.
+ */
+#define MAX_BOMB_LATENCY_MASK 1
 	int max_bomb_segments;
 } blkelv_ioctl_arg_t;
 
 #define BLKELVGET   _IOR(0x12,106,sizeof(blkelv_ioctl_arg_t))
 #define BLKELVSET   _IOW(0x12,107,sizeof(blkelv_ioctl_arg_t))
 
-extern int blkelvget_ioctl(elevator_t *, blkelv_ioctl_arg_t *);
-extern int blkelvset_ioctl(elevator_t *, const blkelv_ioctl_arg_t *);
+extern int blkelvget_ioctl(request_queue_t *, blkelv_ioctl_arg_t *);
+extern int blkelvset_ioctl(request_queue_t *, const blkelv_ioctl_arg_t *);
 
 extern void elevator_init(elevator_t *, elevator_t);
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Status of the IO scheduler fixes for 2.4
  2003-07-06 18:51           ` Chris Mason
@ 2003-07-07  4:06             ` Nick Piggin
  0 siblings, 0 replies; 21+ messages in thread
From: Nick Piggin @ 2003-07-07  4:06 UTC (permalink / raw)
  To: Chris Mason; +Cc: Marc-Christian Petersen, lkml, Andrea Arcangeli



Chris Mason wrote:

>On Sun, 2003-07-06 at 03:58, Marc-Christian Petersen wrote:
>
>>On Friday 04 July 2003 23:37, Chris Mason wrote:
>>
>>Hi Chris,
>>
>>
>>>>If the IO fairness still doesnt
>>>>get somewhat better for general use (well get isolated user reports and
>>>>benchmarks for both the two patches), then I might consider the q->full
>>>>patch (it has throughtput drawbacks and I prefer avoiding a tunable
>>>>there).
>>>>
>>now there is io-stalls-10 in .22-pre3 (lowlat elev. + fixpausing). Could you 
>>please send "q->full patch" as ontop of -pre3? :-)
>>
>
>Attached, this defaults to q->full off and keeps the elvtune changes. 
>So to turn on the q->full low latency fixes, you need to:
>
>

Its a shame to have it default off, seeing as it fixes a real starvation
/ unfairness problem, and nobody is going to turn it on. It would be
nice to turn it on by default and see how many people shout about the
throughput drop, but I guess you can't do that in a stable series :P

I guess it will be useful to be able to ask people to try it if they are
reporting bad behaviour.



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2003-07-07  3:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-07-02 22:26 Status of the IO scheduler fixes for 2.4 Marcelo Tosatti
2003-07-02 22:28 ` Marcelo Tosatti
2003-07-03  2:02   ` Chris Mason
2003-07-03 10:58     ` Stephan von Krawczynski
2003-07-03 12:01       ` Chris Mason
2003-07-03 12:07       ` Marc-Christian Petersen
2003-07-03 12:28         ` Stephan von Krawczynski
2003-07-03 12:31     ` Marc-Christian Petersen
2003-07-03 13:11       ` Chris Mason
2003-07-04 20:01     ` Marcelo Tosatti
2003-07-04 21:37       ` Chris Mason
2003-07-05  0:05         ` Andrea Arcangeli
2003-07-05  0:47           ` Chris Mason
2003-07-05  1:04             ` Andrea Arcangeli
2003-07-06  7:58         ` Marc-Christian Petersen
2003-07-06 18:51           ` Chris Mason
2003-07-07  4:06             ` Nick Piggin
2003-07-05  0:00       ` Andrea Arcangeli
2003-07-05 15:59         ` Marcelo Tosatti
2003-07-05 16:36           ` Andrea Arcangeli
2003-07-03 18:07   ` Chris Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.