linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait
@ 2002-11-12  3:57 Andrea Arcangeli
  2002-11-12  8:50 ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2002-11-12  3:57 UTC (permalink / raw)
  To: Linus Torvalds, Marcelo Tosatti; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7288 bytes --]

I recently found and fixed a misterious hang that could hang the
kernel with tasks in D state with the disk idle.  We could reproduce
very long hangs (several hours) with tasks in D state with reiserfs
after some hour of some intensive load running (not cerberus, see below
why), but it wasn't a reiserfs specific problem, reiserfs just happens
to take the lock_super while doing the fsync_buffer_list and this leads
kupdate to get stuck in the lock_super waiting a wait_on_buffer to
return, so with kupdate stuck the background run_task_queue() doesn't
run every 5 seconds anymore and in turn if there's a missing unplug
somewhere it will lead to an hanging machine in wait_on_buffer for
indefinite/infinite time (kind of deadlock, unless somebody can trigger
a readpage or something that unplugs the disk queue, usually logging in
with ssh fixed the problem). Increasing singificantly the kupdate
interval would potentially lead to the same indefinite hang on a ext2
while running fsync.

For some time I didn't even consider the possibility of wait_on_buffer
being the problem, there are over 700 patches applied in the kernel
where we could reproduce so for some time I was looking at everything
but the buggy place.  After ruling out various other bits
(scheduler fixes/compiler/fsync corruption fixes etc..) I actually
realized the problem is a longstanding design locking problem in
wait_on_buffer (then I found the same problem in wait_on_page and
yesterday Chris found a similar problem in get_request_wait too, the
get_request_wait is not exactly the same issue, but it's quite similar
and it could lead to exactly the same hangs).

Probably nobody noticed this yet because normally with ext2/ext3 these
hangs happens in all the machines but they are resolved after a disk
idle time of 2.5 seconds in mean and they happens once in a while,
normally people would see mean delays of 2.5 sec caming from the
datacenter and they would think it's a normal I/O congestion or the
elevator or something during the fsync on ext2.  Furthmore as Chris
pointed out with very intensive load bdflush would be usually running in
the background, this race can trigger only with mid writepage loads when
bdflush/pdflush has no reason to run.

We also have the lowlatency fixes (they're fixes) inside submit_bh so we
probably opened a larger window for the race to trigger than mainline.
Chris also double checked the bug we were facing was really this race by
introducing a delay in submit_bh to make it reproducible in a reasonable
amount of time.

the race looks like this:

        CPU0                    CPU1 
        -----------------       ------------------------ 
        reiserfs_writepage 
        lock_buffer() 
                                fsync_buffers_list() under lock_super() 
                                wait_on_buffer() 
                                run_task_queue(&tq_disk) -> noop 
                                schedule() <- hang with lock_super acquired 
        submit_bh() 
        /* don't unplug here */ 
 
This example is reiserfs specific but any wait_on_buffer can definitely
hang indefinitely against any concurrent ll_rw_block or submit_bh (even
on UP since submit_bh is a blocking operation and in particular with the
lowlat fixes). There's no big kernel lock anymore serializing
wait_on_buffer/ll_rw_block.  This design locking problem was introduced
with the removal of the BKL from wait_on_buffer/wait_on_page/ll_rw_blk
during one of the 2.3 scalability efforts. So any 2.4 kernel out there
is affected by this race.

in short the problem here is that the wait_on_"something" has no clue if
the locked "something" is just inserted in the I/O queue and visible to the
device, so it has no clue if the run_task_queue may become a noop or if
it may affect the "something". And the writer side that executes the
submit_bh won't unplug the queue rightfully (to allow merging and boost
performance until somebody actually asks for the I/O completed ASAP).

I fixed the race by simply doing a wakeup of any waiter after any
submit_bh/submit_bio that left stuff pending in the I/O queue. So if the
race triggers now the wait_on_something will get a wakeup and in turn it
will trigger the unplug again closing the window for the race. This is
fixing the problem in practice and it seems the best fix at least for
2.4, and I don't see any potential performance regression, so I don't
feel the need of anything more complicated than this, the race triggers
once every several hours only under some special workload. You may try
to avoid loading the waitqueue head cacheline during submit_bh, but at
least for 2.4 I don't think it worth the complexity and it's an I/O path
anyways so it's certainly not critical.

The problem noticed by Chris with get_request_wait is similar, the
unplugging was run before adding the task to the waitqueue, so the
unplug could free the requests and somebody else could allocate the
freed requests without unplugging the queue afterwards. I fixed it simply
by unplugging the queue just before schedule(). That was really a more
genuine bug than the other subtle ones. With get_request_wait the fix is
so simple because we deal with entities that are guaranteed to be
affected by the queue unplug always (they're the I/O requests), this
isn't the case with the locked bh or locked/writeback pages, that was
infact the wrong assumption that allowed the other races to trigger in the
first place.

while doing these fixes I noticed various other bits:

1) in general the blk_run_queues()/run_task_queue()/sync_page should
always run just before schedule(), it's pointless to unplug anything if
we don't run schedule (ultramicrooptimization)
2) in 2.4 the run_task_queue() in wait_on_buffer could have its
TQ_ACTIVE executed inside the add_wait_queue critical section since
spin_unlock has inclusive semantics (literally speculative reads can
pass the spin_unlock even on x86/x86-64)
3) the __blk_put_request/blkdev_release_request was increasing the count
and reading the waitqueue contents without even a barrier() for the asm
layer, it needs an smp_mb() in between to serialize against
get_request_wait that runs locklessy

I did two patches one for 2.4.20rc1 and one for 2.5.47 (sorry no bk
tree here, I will try to make bitdropper.py available shortly so I can
access the new info encoded in proprietary form too) that will address
all these races.  However 2.5 should be analysed further, I didn't
search too hard for all the possible places that could have this race in
2.5, I searched hard in 2.4 and I only addressed all the same problems
in 2.5. The only bit I think could be problematic in 2.4 is the nfs
specualtive I/O, the reason nfs is implementing a sync_page in the first
place. That may have the same race, I heard infact of some report with
nfs hung in wait_on_page, and I wonder if this could explain it too.
I assume the fs maintainers will take care of checking their fs for
missing wakeups of page waiters in 2.4 and 2.5 now that the problem is
well known.

	http://www.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.20rc1/fix-pausing-1
	http://www.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.5/2.5.47/fix-pausing-1

they're attached to this email too since they're tiny.

Andrea

[-- Attachment #2: fix-pausing-1 --]
[-- Type: text/plain, Size: 5359 bytes --]

diff -urNp 2.4.20rc1/drivers/block/ll_rw_blk.c hangs-2.4/drivers/block/ll_rw_blk.c
--- 2.4.20rc1/drivers/block/ll_rw_blk.c	Sat Nov  2 19:45:33 2002
+++ hangs-2.4/drivers/block/ll_rw_blk.c	Tue Nov 12 02:18:35 2002
@@ -590,12 +590,20 @@ static struct request *__get_request_wai
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
-	generic_unplug_device(q);
 	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (q->rq[rw].count == 0)
+		if (q->rq[rw].count == 0) {
+			/*
+			 * All we care about is not to stall if any request
+			 * is been released after we set TASK_UNINTERRUPTIBLE.
+			 * This is the most efficient place to unplug the queue
+			 * in case we hit the race and we can get the request
+			 * without waiting.
+			 */
+			generic_unplug_device(q);
 			schedule();
+		}
 		spin_lock_irq(&io_request_lock);
 		rq = get_request(q, rw);
 		spin_unlock_irq(&io_request_lock);
@@ -829,9 +837,11 @@ void blkdev_release_request(struct reque
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests &&
-				waitqueue_active(&q->wait_for_requests[rw]))
-			wake_up(&q->wait_for_requests[rw]);
+		if (++q->rq[rw].count >= q->batch_requests) {
+			smp_mb();
+			if (waitqueue_active(&q->wait_for_requests[rw]))
+				wake_up(&q->wait_for_requests[rw]);
+		}
 	}
 }
 
@@ -1200,6 +1210,11 @@ void submit_bh(int rw, struct buffer_hea
 
 	generic_make_request(rw, bh);
 
+	/* fix race condition with wait_on_buffer() */
+	smp_mb(); /* spin_unlock may have inclusive semantics */
+	if (waitqueue_active(&bh->b_wait))
+		wake_up(&bh->b_wait);
+
 	switch (rw) {
 		case WRITE:
 			kstat.pgpgout += count;
diff -urNp 2.4.20rc1/fs/buffer.c hangs-2.4/fs/buffer.c
--- 2.4.20rc1/fs/buffer.c	Sat Nov  2 19:45:40 2002
+++ hangs-2.4/fs/buffer.c	Tue Nov 12 02:17:56 2002
@@ -153,10 +153,23 @@ void __wait_on_buffer(struct buffer_head
 	get_bh(bh);
 	add_wait_queue(&bh->b_wait, &wait);
 	do {
-		run_task_queue(&tq_disk);
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 		if (!buffer_locked(bh))
 			break;
+		/*
+		 * We must read tq_disk in TQ_ACTIVE after the
+		 * add_wait_queue effect is visible to other cpus.
+		 * We could unplug some line above it wouldn't matter
+		 * but we can't do that right after add_wait_queue
+		 * without an smp_mb() in between because spin_unlock
+		 * has inclusive semantics.
+		 * Doing it here is the most efficient place so we
+		 * don't do a suprious unplug if we get a racy
+		 * wakeup that make buffer_locked to return 0, and
+		 * doing it here avoids an explicit smp_mb() we
+		 * rely on the implicit one in set_task_state.
+		 */
+		run_task_queue(&tq_disk);
 		schedule();
 	} while (buffer_locked(bh));
 	tsk->state = TASK_RUNNING;
@@ -1508,6 +1521,9 @@ static int __block_write_full_page(struc
 
 	/* Done - end_buffer_io_async will unlock */
 	SetPageUptodate(page);
+
+	wakeup_page_waiters(page);
+
 	return 0;
 
 out:
@@ -1539,6 +1555,7 @@ out:
 	} while (bh != head);
 	if (need_unlock)
 		UnlockPage(page);
+	wakeup_page_waiters(page);
 	return err;
 }
 
@@ -1755,6 +1772,8 @@ int block_read_full_page(struct page *pa
 		else
 			submit_bh(READ, bh);
 	}
+
+	wakeup_page_waiters(page);
 	
 	return 0;
 }
@@ -2368,6 +2387,7 @@ int brw_page(int rw, struct page *page, 
 		submit_bh(rw, bh);
 		bh = next;
 	} while (bh != head);
+	wakeup_page_waiters(page);
 	return 0;
 }
 
diff -urNp 2.4.20rc1/fs/reiserfs/inode.c hangs-2.4/fs/reiserfs/inode.c
--- 2.4.20rc1/fs/reiserfs/inode.c	Sat Nov  2 19:45:46 2002
+++ hangs-2.4/fs/reiserfs/inode.c	Tue Nov 12 02:17:56 2002
@@ -1993,6 +1993,7 @@ static int reiserfs_write_full_page(stru
     */
     if (nr) {
         submit_bh_for_writepage(arr, nr) ;
+	wakeup_page_waiters(page);
     } else {
         UnlockPage(page) ;
     }
diff -urNp 2.4.20rc1/include/linux/pagemap.h hangs-2.4/include/linux/pagemap.h
--- 2.4.20rc1/include/linux/pagemap.h	Sat Nov  2 19:45:48 2002
+++ hangs-2.4/include/linux/pagemap.h	Tue Nov 12 04:35:52 2002
@@ -97,6 +97,8 @@ static inline void wait_on_page(struct p
 		___wait_on_page(page);
 }
 
+extern void wakeup_page_waiters(struct page * page);
+
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
diff -urNp 2.4.20rc1/kernel/ksyms.c hangs-2.4/kernel/ksyms.c
--- 2.4.20rc1/kernel/ksyms.c	Sat Nov  2 19:45:48 2002
+++ hangs-2.4/kernel/ksyms.c	Tue Nov 12 04:36:25 2002
@@ -293,6 +293,7 @@ EXPORT_SYMBOL(filemap_fdatasync);
 EXPORT_SYMBOL(filemap_fdatawait);
 EXPORT_SYMBOL(lock_page);
 EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(wakeup_page_waiters);
 
 /* device registration */
 EXPORT_SYMBOL(register_chrdev);
diff -urNp 2.4.20rc1/mm/filemap.c hangs-2.4/mm/filemap.c
--- 2.4.20rc1/mm/filemap.c	Sat Nov  2 19:45:48 2002
+++ hangs-2.4/mm/filemap.c	Tue Nov 12 04:35:40 2002
@@ -909,6 +909,20 @@ void lock_page(struct page *page)
 }
 
 /*
+ * This must be called after every submit_bh with end_io
+ * callbacks that would result into the blkdev layer waking
+ * up the page after a queue unplug.
+ */
+void wakeup_page_waiters(struct page * page)
+{
+	wait_queue_head_t * head;
+
+	head = page_waitqueue(page);
+	if (waitqueue_active(head))
+		wake_up(head);
+}
+
+/*
  * a rather lightweight function, finding and getting a reference to a
  * hashed page atomically.
  */

[-- Attachment #3: fix-pausing-1 --]
[-- Type: text/plain, Size: 5331 bytes --]

diff -urNp 2.5.47/drivers/block/ll_rw_blk.c hangs-2.5/drivers/block/ll_rw_blk.c
--- 2.5.47/drivers/block/ll_rw_blk.c	Tue Nov 12 01:59:41 2002
+++ hangs-2.5/drivers/block/ll_rw_blk.c	Tue Nov 12 02:37:42 2002
@@ -1281,12 +1281,13 @@ static struct request *get_request_wait(
 
 	spin_lock_prefetch(q->queue_lock);
 
-	generic_unplug_device(q);
 	do {
 		prepare_to_wait_exclusive(&rl->wait, &wait,
 					TASK_UNINTERRUPTIBLE);
-		if (!rl->count)
+		if (!rl->count){
+			generic_unplug_device(q);
 			io_schedule();
+		}
 		finish_wait(&rl->wait, &wait);
 		spin_lock_irq(q->queue_lock);
 		rq = get_request(q, rw);
@@ -1487,8 +1488,11 @@ void __blk_put_request(request_queue_t *
 		rl->count++;
 		if (rl->count >= queue_congestion_off_threshold())
 			clear_queue_congested(q, rw);
-		if (rl->count >= batch_requests && waitqueue_active(&rl->wait))
-			wake_up(&rl->wait);
+		if (rl->count >= batch_requests) {
+			smp_mb();
+			if (waitqueue_active(&rl->wait))
+				wake_up(&rl->wait);
+		}
 	}
 }
 
diff -urNp 2.5.47/fs/buffer.c hangs-2.5/fs/buffer.c
--- 2.5.47/fs/buffer.c	Tue Nov 12 01:59:42 2002
+++ hangs-2.5/fs/buffer.c	Tue Nov 12 02:47:46 2002
@@ -135,9 +135,10 @@ void __wait_on_buffer(struct buffer_head
 	get_bh(bh);
 	do {
 		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-		blk_run_queues();
-		if (buffer_locked(bh))
+		if (buffer_locked(bh)) {
+			blk_run_queues();
 			schedule();
+		}
 	} while (buffer_locked(bh));
 	put_bh(bh);
 	finish_wait(wqh, &wait);
@@ -1727,7 +1728,8 @@ done:
 		if (uptodate)
 			SetPageUptodate(page);
 		end_page_writeback(page);
-	}
+	} else
+		wakeup_page_waiters(page);
 	if (err == 0)
 		return ret;
 	return err;
@@ -2011,6 +2013,7 @@ int block_read_full_page(struct page *pa
 		else
 			submit_bh(READ, bh);
 	}
+	wakeup_page_waiters(page);
 	return 0;
 }
 
@@ -2315,6 +2318,8 @@ static int end_bio_bh_io_sync(struct bio
 int submit_bh(int rw, struct buffer_head * bh)
 {
 	struct bio *bio;
+	int ret;
+	wait_queue_head_t *wqh = bh_waitq_head(bh);
 
 	BUG_ON(!buffer_locked(bh));
 	BUG_ON(!buffer_mapped(bh));
@@ -2348,7 +2353,13 @@ int submit_bh(int rw, struct buffer_head
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
 
-	return submit_bio(rw, bio);
+	ret =  submit_bio(rw, bio);
+
+	smp_mb(); /* spin_unlock may have inclusive semantics */
+	if (waitqueue_active(wqh))
+		wake_up(wqh);
+
+	return ret;
 }
 
 /**
diff -urNp 2.5.47/fs/reiserfs/inode.c hangs-2.5/fs/reiserfs/inode.c
--- 2.5.47/fs/reiserfs/inode.c	Thu Oct 31 01:42:25 2002
+++ hangs-2.5/fs/reiserfs/inode.c	Tue Nov 12 02:50:47 2002
@@ -1987,6 +1987,7 @@ static int reiserfs_write_full_page(stru
     */
     if (nr) {
         submit_bh_for_writepage(arr, nr) ;
+	wakeup_page_waiters(page);
     } else {
         end_page_writeback(page) ;
     }
diff -urNp 2.5.47/include/linux/pagemap.h hangs-2.5/include/linux/pagemap.h
--- 2.5.47/include/linux/pagemap.h	Tue Nov 12 01:59:43 2002
+++ hangs-2.5/include/linux/pagemap.h	Tue Nov 12 02:45:27 2002
@@ -122,4 +122,7 @@ static inline void wait_on_page_writebac
 }
 
 extern void end_page_writeback(struct page *page);
+
+extern void wakeup_page_waiters(struct page * page);
+
 #endif /* _LINUX_PAGEMAP_H */
diff -urNp 2.5.47/mm/filemap.c hangs-2.5/mm/filemap.c
--- 2.5.47/mm/filemap.c	Tue Nov 12 01:59:43 2002
+++ hangs-2.5/mm/filemap.c	Tue Nov 12 02:44:59 2002
@@ -272,9 +272,10 @@ void wait_on_page_bit(struct page *page,
 
 	do {
 		prepare_to_wait(waitqueue, &wait, TASK_UNINTERRUPTIBLE);
-		sync_page(page);
-		if (test_bit(bit_nr, &page->flags))
+		if (test_bit(bit_nr, &page->flags)) {
+			sync_page(page);
 			io_schedule();
+		}
 	} while (test_bit(bit_nr, &page->flags));
 	finish_wait(waitqueue, &wait);
 }
@@ -336,15 +337,30 @@ void __lock_page(struct page *page)
 
 	while (TestSetPageLocked(page)) {
 		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-		sync_page(page);
-		if (PageLocked(page))
+		if (PageLocked(page)) {
+			sync_page(page);
 			io_schedule();
+		}
 	}
 	finish_wait(wqh, &wait);
 }
 EXPORT_SYMBOL(__lock_page);
 
 /*
+ * This must be called after every submit_bh with end_io
+ * callbacks that would result into the blkdev layer waking
+ * up the page after a queue unplug.
+ */
+void wakeup_page_waiters(struct page * page)
+{
+	wait_queue_head_t * wqh;
+
+	wqh = page_waitqueue(page);
+	if (waitqueue_active(wqh))
+		wake_up(wqh);
+}
+
+/*
  * a rather lightweight function, finding and getting a reference to a
  * hashed page atomically.
  */
diff -urNp 2.5.47/mm/page_io.c hangs-2.5/mm/page_io.c
--- 2.5.47/mm/page_io.c	Thu Oct 31 01:41:56 2002
+++ hangs-2.5/mm/page_io.c	Tue Nov 12 02:50:12 2002
@@ -104,6 +104,7 @@ int swap_writepage(struct page *page)
 	SetPageWriteback(page);
 	unlock_page(page);
 	submit_bio(WRITE, bio);
+	wakeup_page_waiters(page);
 out:
 	return ret;
 }
@@ -121,6 +122,7 @@ int swap_readpage(struct file *file, str
 	}
 	inc_page_state(pswpin);
 	submit_bio(READ, bio);
+	wakeup_page_waiters(page);
 out:
 	return ret;
 }
--- hangs-2.5/kernel/ksyms.c.~1~	Tue Nov 12 01:59:43 2002
+++ hangs-2.5/kernel/ksyms.c	Tue Nov 12 04:36:37 2002
@@ -336,6 +336,7 @@ EXPORT_SYMBOL(filemap_fdatawrite);
 EXPORT_SYMBOL(filemap_fdatawait);
 EXPORT_SYMBOL(lock_page);
 EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(wakeup_page_waiters);
 
 /* device registration */
 EXPORT_SYMBOL(register_chrdev);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in  wait_on_page/wait_on_buffer/get_request_wait
  2002-11-12  3:57 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait Andrea Arcangeli
@ 2002-11-12  8:50 ` Andrew Morton
  2002-11-12 18:15   ` Andrea Arcangeli
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2002-11-12  8:50 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Marcelo Tosatti, linux-kernel

Andrea Arcangeli wrote:
> 
> the race looks like this:
> 
>         CPU0                    CPU1
>         -----------------       ------------------------
>         reiserfs_writepage
>         lock_buffer()
>                                 fsync_buffers_list() under lock_super()
>                                 wait_on_buffer()
>                                 run_task_queue(&tq_disk) -> noop
>                                 schedule() <- hang with lock_super acquired
>         submit_bh()
>         /* don't unplug here */
> 

Or, more simply:

	lock_buffer()
				while (buffer_locked()) {
					blk_run_queues();	/* Nothing happens */
					if (buffer_locked(bh))
						schedule();
	submit_bh();	/* No unplug */


The fix seems reasonable to me.  It would perhaps be better to just do:

+       if (waitqueue_active(wqh))
+               blk_run_queues();

in submit_bh().  To save the context switch.


Moving the blk_run_queues() inside the TASK_UNINTERRUPTIBLE region
is something which always worried me, because if something down
there sets TASK_RUNNING, we end up in a busy wait.  But that's OK
for 2.5 and may be OK for 2.4's run_task_queue() - I haven't checked...

The multipage stuff in 2.5 does its own blk_run_queues() and looks to be
OK, which I assume is why you didn't touch that.

The little single-page reads like do_generic_mapping_read() look to be
OK because the process whcih waits is the one which submitted the IO.


wrt the get_request_wait changes: I never bothered about the barrier
because we know that there are tons of requests outstanding, and if
we don't do a wakeup the next guy will.  Plus *this* request has to
be put back sometime too, which will deliver a wakeup.  But whatever;
it's not exactly a fastpath.

However the function is still not watertight:

static struct request *get_request_wait(request_queue_t *q, int rw)
{
	DEFINE_WAIT(wait);
	struct request_list *rl = &q->rq[rw];
	struct request *rq;

	spin_lock_prefetch(q->queue_lock);

	generic_unplug_device(q);
	do {
		prepare_to_wait_exclusive(&rl->wait, &wait,
					TASK_UNINTERRUPTIBLE);
		if (!rl->count)
			io_schedule();
		finish_wait(&rl->wait, &wait);
		spin_lock_irq(q->queue_lock);
		rq = get_request(q, rw);
		spin_unlock_irq(q->queue_lock);
	} while (rq == NULL);
	return rq;
}

If someone has taken *all* the requests and hasn't submitted any of them
yet, there is nothing to unplug.   We go to sleep, all the requests are
submitted (behind a plug) and it's game over.  Could happen if the device
has a teeny queue...


I dunno.  I bet there are still more holes, and I for one am heartily sick
of unplug bugs.  Why not make the damn queue unplug itself after ten
milliseconds or 16 requests?  I bet that would actually increase throughput,
especially in the presence of kernel preemption and scheduling points.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait
  2002-11-12  8:50 ` Andrew Morton
@ 2002-11-12 18:15   ` Andrea Arcangeli
  2002-11-12 19:46     ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2002-11-12 18:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, Marcelo Tosatti, linux-kernel, Chris Mason

On Tue, Nov 12, 2002 at 12:50:46AM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> > 
> > the race looks like this:
> > 
> >         CPU0                    CPU1
> >         -----------------       ------------------------
> >         reiserfs_writepage
> >         lock_buffer()
> >                                 fsync_buffers_list() under lock_super()
> >                                 wait_on_buffer()
> >                                 run_task_queue(&tq_disk) -> noop
> >                                 schedule() <- hang with lock_super acquired
> >         submit_bh()
> >         /* don't unplug here */
> > 
> 
> Or, more simply:
> 
> 	lock_buffer()
> 				while (buffer_locked()) {
> 					blk_run_queues();	/* Nothing happens */
> 					if (buffer_locked(bh))
> 						schedule();
> 	submit_bh();	/* No unplug */
> 

yep that is the simple one, I mentioned the reiserfs more complicated
special case since it is the one that allowed this bug to see the light
of the day, the simple race is resolved by kupdate in mean after 2.5 sec
of hang time.

> 
> The fix seems reasonable to me.  It would perhaps be better to just do:
> 
> +       if (waitqueue_active(wqh))
> +               blk_run_queues();
> 
> in submit_bh().  To save the context switch.

the race triggers very rarely I doubt it could make any difference to
save the context switch and wake_up looked simpler conceptually, but
run_task_queue(&tq_disk)/blk_run_queues will fix it too indeed. It's up
to you, I find the wake_up cleaner.

something more important if you care about microoptimizations is
probably to add unlikely(waitqueue_active()).

> Moving the blk_run_queues() inside the TASK_UNINTERRUPTIBLE region
> is something which always worried me, because if something down
> there sets TASK_RUNNING, we end up in a busy wait.  But that's OK
> for 2.5 and may be OK for 2.4's run_task_queue() - I haven't checked...

even in the unlikely case some buggy request_fn sets the task to
runnable, the second invocation of run_task_queue will do nothing, so
there's no risk to hang in R state and that is the only safe place to
unplug, unplugging before setting the task state is totally broken and
racy, see the deadlock explanation in get_request_wait in 2.4/2.5.

> The multipage stuff in 2.5 does its own blk_run_queues() and looks to be
> OK, which I assume is why you didn't touch that.

yes, but also because I didn't looked much at the 2.5 code, so you
should double check there, at the moment I care most about 2.4 where
this is been a blocker for us until a few days ago.

> The little single-page reads like do_generic_mapping_read() look to be
> OK because the process whcih waits is the one which submitted the IO.

no, you forgot readahead, that's buggy and needs fixing in readpage too
(same in 2.4 and 2.5).

> wrt the get_request_wait changes: I never bothered about the barrier
> because we know that there are tons of requests outstanding, and if
> we don't do a wakeup the next guy will.  Plus *this* request has to
> be put back sometime too, which will deliver a wakeup.  But whatever;
> it's not exactly a fastpath.

I assume you're talking about the smp_mb() needed before reading the
waitqueue. I'm not claiming that the race is pratical, it may not
trigger in most hardware due the timings and hardware detail, but it can
definitely happen in theory.

It doesn't matter if we know there are tons of requests outstanding, if
we read the waitqueue before increasing count, we won't wakeup the other
cpu, and we could get a flood of I/O completion in all cpus all watching
at the waitqueue first, you simply need tot_requests-batch_requests ==
nr_cpus - 1 to be able to trigger it in practice. So algorithmically
speaking the smp_mb() (i.e. barrier() in UP) is needed.

> However the function is still not watertight:
> 
> static struct request *get_request_wait(request_queue_t *q, int rw)
> {
> 	DEFINE_WAIT(wait);
> 	struct request_list *rl = &q->rq[rw];
> 	struct request *rq;
> 
> 	spin_lock_prefetch(q->queue_lock);
> 
> 	generic_unplug_device(q);
> 	do {
> 		prepare_to_wait_exclusive(&rl->wait, &wait,
> 					TASK_UNINTERRUPTIBLE);
> 		if (!rl->count)
> 			io_schedule();
> 		finish_wait(&rl->wait, &wait);
> 		spin_lock_irq(q->queue_lock);
> 		rq = get_request(q, rw);
> 		spin_unlock_irq(q->queue_lock);
> 	} while (rq == NULL);
> 	return rq;
> }
> 
> If someone has taken *all* the requests and hasn't submitted any of them
> yet, there is nothing to unplug.   We go to sleep, all the requests are
> submitted (behind a plug) and it's game over.  Could happen if the device
> has a teeny queue...

I thought the queue spinlock would avoid that but I overlooked
get_request_wait itself against another get_request_wait, so yes that is
still an open problem, nevertheless my fix to get_request_queue to
unplug after being in the waitqueue and in uninterruptable mode is still
definitely needed and it fixes another bug that could deadlock
get_request_wait too (actually a more severe bug than this subtle one).

I will think more on how to fix get_request_wait against
get_request_wait. One simple fix of course is to never release the
spinlock between the get_request() and the add_request(). This is infact
what I thought would been always the case but I forgotten
get_request_wait against get_request_wait as said. In short to fix this
cleanly (without wakeups [or suprious unplugs if you prefer]) we should
never drop the q->queue_lock with the request not in the waitqueue and
not in the freelist.

> I dunno.  I bet there are still more holes, and I for one am heartily sick
> of unplug bugs.  Why not make the damn queue unplug itself after ten
> milliseconds or 16 requests?  I bet that would actually increase throughput,
> especially in the presence of kernel preemption and scheduling points.

This doesn't remove the need to avoid the mean 5 msec delay before the
queue unplug (I know the race triggers rarely though, but if we left all the
places broken the number of 5msec mean waits will increase bug by bug
over time). I'm not very excited by the idea (in particular for 2.4), if
it performs so much better I would say it's a broken length of the queue
that is way too oversized and that leads to other problems with fariness
of task against task etc... We need the big number of requests with
small requests only, with big requests we should allow only very few
requests to be returned.

In short I think my patch is correct and needed in both 2.4 and 2.5. I
will now think on how to fix best the get_request_wait against
get_request_wait that you noticed too and I will upload an incremental
fix for it. I would suggest to merge the current fixes in the meantime
(if the mainline maintainers prefers to change it and unplug instead of
waking up the waiter that's ok with me, conceptually I find cleaner to
wakeup the waiter but that's up to you)

Thanks for the help,

Andrea

PS. CC'ed Chris, he's certainly interested too given he found the
get_request_wait problem. Now that I think about it, I guess Chris
really meant the get_request_wait against get_request_wait that you
found too above, I probably just happened to notice the simpler more
severe bug  in such function and I thought Chris meant that simpler one
too, not exactly the same race as in lock_buffer/lock_page infact, and I
didn't imagine get_request_wait was really buggy in two different ways ;).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in  wait_on_page/wait_on_buffer/get_request_wait
  2002-11-12 18:15   ` Andrea Arcangeli
@ 2002-11-12 19:46     ` Andrew Morton
  2002-11-13  0:33       ` Andrea Arcangeli
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2002-11-12 19:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Marcelo Tosatti, linux-kernel, Chris Mason

Andrea Arcangeli wrote:
> 
> 
> no, you forgot readahead, that's buggy and needs fixing in readpage too
> (same in 2.4 and 2.5).

readahead in 2.5 unplugs.  Which is quite wrong for AIO, but is unavoidable
given the current situation...
 
> ...
> 
> > I dunno.  I bet there are still more holes, and I for one am heartily sick
> > of unplug bugs.  Why not make the damn queue unplug itself after ten
> > milliseconds or 16 requests?  I bet that would actually increase throughput,
> > especially in the presence of kernel preemption and scheduling points.
> 
> This doesn't remove the need to avoid the mean 5 msec delay before the
> queue unplug (I know the race triggers rarely though, but if we left all the
> places broken the number of 5msec mean waits will increase bug by bug
> over time). I'm not very excited by the idea (in particular for 2.4), if
> it performs so much better I would say it's a broken length of the queue
> that is way too oversized and that leads to other problems with fariness
> of task against task etc... We need the big number of requests with
> small requests only, with big requests we should allow only very few
> requests to be returned.

No, no, no.  Step back, miles back.  Forget the current implementation
and ask yourself "what is a good design for plugging?".  Because I can
assure you that if someone came out and proposed the current design for
inclusion in today's kernel, they would be pulverised.

Plugging is an optimisation which is specific to, and internal to the
request layer.  It should not require that every client of that layer
know about it, and be careful to unplug (with bizarrely complex locking
and ordering rules) simply because the request layer is too lame to look
after its own stuff.

Particularly because the clients of the request layer do not have enough
information to make a good decision about when to unplug.  Only the
request layer knows that.

And particularly because we've been hitting the same darn bug for years.
And we've been blaming the poor old callers!

The request layer should perform its own unplug "when the time is right",
and the external unplug should be viewed as an optimisation hint.

It all becomes rather obvious when you try to answer the question "when
should the kernel unplug for AIO reads"?

And yes, the fact that the time-based component of the unplug decision
would obscure failure by callers to make use of the optimisation hint
is unfortunate.  But it's only a few callsites, and 99.9999% coverage
there is good enough.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait
  2002-11-12 19:46     ` Andrew Morton
@ 2002-11-13  0:33       ` Andrea Arcangeli
  2002-11-13  1:01         ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2002-11-13  0:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, Marcelo Tosatti, linux-kernel, Chris Mason

On Tue, Nov 12, 2002 at 11:46:01AM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> > 
> > 
> > no, you forgot readahead, that's buggy and needs fixing in readpage too
> > (same in 2.4 and 2.5).
> 
> readahead in 2.5 unplugs.  Which is quite wrong for AIO, but is unavoidable
> given the current situation...

Current situation?? I fixed these bugs and you certainly didn't know
about them previously. Furthmore writeback won't unplug or it would suck
totally period, so you still need it at the page layer too.

> > > I dunno.  I bet there are still more holes, and I for one am heartily sick
> > > of unplug bugs.  Why not make the damn queue unplug itself after ten
> > > milliseconds or 16 requests?  I bet that would actually increase throughput,
> > > especially in the presence of kernel preemption and scheduling points.
> > 
> > This doesn't remove the need to avoid the mean 5 msec delay before the
> > queue unplug (I know the race triggers rarely though, but if we left all the
> > places broken the number of 5msec mean waits will increase bug by bug
> > over time). I'm not very excited by the idea (in particular for 2.4), if
> > it performs so much better I would say it's a broken length of the queue
> > that is way too oversized and that leads to other problems with fariness
> > of task against task etc... We need the big number of requests with
> > small requests only, with big requests we should allow only very few
> > requests to be returned.
> 
> No, no, no.  Step back, miles back.  Forget the current implementation
> and ask yourself "what is a good design for plugging?".  Because I can
> assure you that if someone came out and proposed the current design for
> inclusion in today's kernel, they would be pulverised.
> 
> Plugging is an optimisation which is specific to, and internal to the
> request layer.  It should not require that every client of that layer
> know about it, and be careful to unplug (with bizarrely complex locking
> and ordering rules) simply because the request layer is too lame to look
> after its own stuff.
> 
> Particularly because the clients of the request layer do not have enough
> information to make a good decision about when to unplug.  Only the
> request layer knows that.
> 
> And particularly because we've been hitting the same darn bug for years.
> And we've been blaming the poor old callers!
> 
> The request layer should perform its own unplug "when the time is right",
> and the external unplug should be viewed as an optimisation hint.
> 
> It all becomes rather obvious when you try to answer the question "when
> should the kernel unplug for AIO reads"?
> 
> And yes, the fact that the time-based component of the unplug decision
> would obscure failure by callers to make use of the optimisation hint
> is unfortunate.  But it's only a few callsites, and 99.9999% coverage
> there is good enough.

so what? you agree the hint is needed and if you don't fix the race it's
like no hint at all and you wait 5 msec in mean which I don't want to,
period. Ok 5 msec isn't too bad (yeah also the current 2.5sec hangs with
ext2 aren't too bad) but they're still a not too bad _bugs_. The more
not too bad bugs you add the more the things will start to get worse and
worse.

then tell me how to unplug from a timer irq, with a context switch, not
nice.

Anyways I think the answer to the question "when to unplug" has nothing
to do with these races, when wait_on_buffer arrives and says "I need
this I/O completed ASAP" that request must arrive to the blkdev layer,
no matter how and no matter if it would lead to a deadlock or not.

We have bugs in delivering the "I need this I/O completed ASAP" message,
not in the "when to unplug" algorithm. You're arguing about a totally
different thing. Incidentally now the two things tends to overlap but
that's just incidental and IMHO it is very an efficient solution, more
than doing a context switch at every unplug like you're suggesting, most
probably to hide an overkill size of some I/O queue. We just have a
background unplug every 5 sec (unless the system hangs for bugs in other
places), if there are only a few requests in the queue that's fine to
delay them every 5 sec. So in short I think the current logic is just
fine, feel free to prove me wrong send me a patch and I will benchmark
it, I only reserve the possibility of fixing the blkdev layer in case
your patch would workaround any stupid thing there.

Last but not the least by your argument that these fixes are useless I
could as well fix our hang completely by simply adding a new kernel
thread doing schedule_timeout(1); run_queue_task(&tq_disk) in a loop, if
I would ever remotely do anything like that you would be flaming me for
not fixing the real bug, and still I would suffer from 5 msec slowdowns
in the buggy places, just like your universal unplug solution suffers
too.

Not fixing these races would be very bad, I'm totally stunned that
you're against these fixes and you propose to hide these bugs and to
make the system worse and worse over time.

Andrea

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in  wait_on_page/wait_on_buffer/get_request_wait
  2002-11-13  0:33       ` Andrea Arcangeli
@ 2002-11-13  1:01         ` Andrew Morton
  0 siblings, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2002-11-13  1:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Marcelo Tosatti, linux-kernel, Chris Mason

Andrea Arcangeli wrote:
> 

I agreed that the fix was reasonable, and then went on to discuss
broader things.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait
  2002-11-17 17:37   ` Marc-Christian Petersen
@ 2002-11-17 19:19     ` Folkert van Heusden
  0 siblings, 0 replies; 16+ messages in thread
From: Folkert van Heusden @ 2002-11-17 19:19 UTC (permalink / raw)
  To: 'Marc-Christian Petersen'; +Cc: linux-kernel

> Andrea, this makes a difference! The pausings are much less than before,
> but still occur. Situation below.
MP> So, after some time in trying to exchange 2.4.20-rc2 files with the
following
MP> files from 2.4.18 (and compile successfully)
MP> drivers/block/ll_rw_blk.c
MP> drivers/block/elevator.c
MP> include/linux/elevator.h
MP> those "pauses/stopps" are _GONE_!

What is really strange, is that while I run the following in the
boodscripts:

echo "90 500 0 0 600000 600000 95 20 0" > /proc/sys/vm/bdflush
elvtune /dev/hda1 -r 2048 -w 131072

I never experience those stalls at all!
And that was surely I was expecting (clean buffers at 90% and do that
synchronouse at 95% gives little room for the first to complete before new
buffers get dirty. also that elevator-fiddling should force large bursts
(why am I doing this anyway? I wanted to configure the kernel et al so that
as minimal as possible disk-reads/writes occur))


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait
  2002-11-16 21:55 ` Marc-Christian Petersen
@ 2002-11-17 17:37   ` Marc-Christian Petersen
  2002-11-17 19:19     ` Folkert van Heusden
  0 siblings, 1 reply; 16+ messages in thread
From: Marc-Christian Petersen @ 2002-11-17 17:37 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton; +Cc: linux-kernel, Rodrigo Souza de Castro

On Saturday 16 November 2002 22:55, Marc-Christian Petersen wrote:

Hi Andrea,

> Andrea, this makes a difference! The pausings are much less than before,
> but still occur. Situation below.
So, after some time in trying to exchange 2.4.20-rc2 files with the following 
files from 2.4.18 (and compile successfully)

drivers/block/ll_rw_blk.c
drivers/block/elevator.c
include/linux/elevator.h

those "pauses/stopps" are _GONE_!

I've also tested all this situations with totally different mashines. No 
changes, all behaves the same.

I am really afraid that no one else (seems so) noticed this but me! I cannot 
believe that?! :-(

It is a really serious "bug" I cannot live with. Doing that high disk i/o and 
even pings are ignored for the amount of time the stop occurs, is an 
absolutely inacceptable situation! *imho*

What do you think? :-)

ciao, Marc

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait
  2002-11-16 18:58 Marc-Christian Petersen
@ 2002-11-16 21:55 ` Marc-Christian Petersen
  2002-11-17 17:37   ` Marc-Christian Petersen
  0 siblings, 1 reply; 16+ messages in thread
From: Marc-Christian Petersen @ 2002-11-16 21:55 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton; +Cc: linux-kernel

On Saturday 16 November 2002 19:58, Marc-Christian Petersen wrote:

Hi Andrea,

> > just to make a quick test, can you try an hack like this combined with a
> > setting of elvtune -r 128 -w 256 on top of 2.4.20rc1?
> >
> > --- x/drivers/block/ll_rw_blk.c.~1~     Sat Nov  2 19:45:33 2002
> > +++ x/drivers/block/ll_rw_blk.c Sat Nov 16 19:44:20 2002
> > @@ -432,7 +432,7 @@ static void blk_init_free_list(request_q
> >
> >         si_meminfo(&si);
> >         megs = si.totalram >> (20 - PAGE_SHIFT);
> > -       nr_requests = 128;
> > +       nr_requests = 16;
> >         if (megs < 32)
> >                 nr_requests /= 2;
> >         blk_grow_request_list(q, nr_requests);
>
> hehe, Andrea, it seem's we both think of the same ... :-) ... I am just
> recompiling the kernel ... hang on.
Andrea, this makes a difference! The pausings are much less than before, but 
still occur. Situation below.

Another thing I've noticed, while doing the "cp -a xyz abc" in 
a loop the available memory decreases alot (silly caching of files)


Before copying:
---------------
MemTotal:       515992 kB
MemFree:        440876 kB
Buffers:          3808 kB
Cached:          24128 kB


While copying:
--------------
root@codeman:[/usr/src] # ./bla.sh
+ cd /usr/src
+ rm -rf linux-2.4.19-blaold linux-2.4.19-blanew
+ COUNT=0
+ echo 0
0
+ cp -a linux-2.4.19-vanilla linux-2.4.19-blaold
+ cp -a linux-2.4.19-vanilla linux-2.4.19-blanew

not yet finished the above and memory is this:

MemTotal:       515992 kB
MemFree:          3348 kB
Buffers:         12244 kB
Cached:         451608 kB

swap (500mb) turned off. Pausings are almost none. (without your patch / 
elvtune) even there were massive pauses.

If swap is turned on, swapusage grows alot and then, if SWAP is used, we have 
pauses (even more less than without your patch).

To free the used/cached memory I use umount /usr/src.

I think your proposal is good. Anything else I should test/change? Any further 
informations/test I can/should run?

Thnx alot!

ciao, Marc



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait
@ 2002-11-16 18:58 Marc-Christian Petersen
  2002-11-16 21:55 ` Marc-Christian Petersen
  0 siblings, 1 reply; 16+ messages in thread
From: Marc-Christian Petersen @ 2002-11-16 18:58 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

Hi Andrea,

> just to make a quick test, can you try an hack like this combined with a
> setting of elvtune -r 128 -w 256 on top of 2.4.20rc1?
>
> --- x/drivers/block/ll_rw_blk.c.~1~     Sat Nov  2 19:45:33 2002
> +++ x/drivers/block/ll_rw_blk.c Sat Nov 16 19:44:20 2002
> @@ -432,7 +432,7 @@ static void blk_init_free_list(request_q
> 
>         si_meminfo(&si);
>         megs = si.totalram >> (20 - PAGE_SHIFT);
> -       nr_requests = 128;
> +       nr_requests = 16;
>         if (megs < 32)
>                 nr_requests /= 2;
>         blk_grow_request_list(q, nr_requests);
hehe, Andrea, it seem's we both think of the same ... :-) ... I am just 
recompiling the kernel ... hang on.

ciao, Marc



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait
  2002-11-16 17:43       ` Marc-Christian Petersen
@ 2002-11-16 18:46         ` Andrea Arcangeli
  0 siblings, 0 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2002-11-16 18:46 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: linux-kernel, Con Kolivas

On Sat, Nov 16, 2002 at 06:43:36PM +0100, Marc-Christian Petersen wrote:
> On Saturday 16 November 2002 18:32, Andrea Arcangeli wrote:
> 
> Hi Andrea,
> 
> > you may want to try with this setting that helps with very slow devices:
> > 	echo 2 500 0 0 500 3000 3 1 0 > /proc/sys/vm/bdflush
> >
> > or also with my current default tuned for high performance:
> > 	echo 50 500 0 0 500 3000 60 20 > /proc/sys/vm/bdflush
> I've tested both without any changes, "pausings" are still there.
> 
> > you may have too many dirty buffers around and you end running at disk
> > speed at every memory allocation, the first setting will decrease the
> > amount of dirty buffers dramatically, if you still have significant
> > slowdown with the first setting above, it's most probably only the usual
> > elevator issue.
> Seems so.
> 
> So I have to use 2.4.18 until there is a real proper fix for that.

just to make a quick test, can you try an hack like this combined with a
setting of elvtune -r 128 -w 256 on top of 2.4.20rc1?

--- x/drivers/block/ll_rw_blk.c.~1~	Sat Nov  2 19:45:33 2002
+++ x/drivers/block/ll_rw_blk.c	Sat Nov 16 19:44:20 2002
@@ -432,7 +432,7 @@ static void blk_init_free_list(request_q
 
 	si_meminfo(&si);
 	megs = si.totalram >> (20 - PAGE_SHIFT);
-	nr_requests = 128;
+	nr_requests = 16;
 	if (megs < 32)
 		nr_requests /= 2;
 	blk_grow_request_list(q, nr_requests);



Andrea

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait
  2002-11-16 17:32     ` Andrea Arcangeli
@ 2002-11-16 17:43       ` Marc-Christian Petersen
  2002-11-16 18:46         ` Andrea Arcangeli
  0 siblings, 1 reply; 16+ messages in thread
From: Marc-Christian Petersen @ 2002-11-16 17:43 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, Con Kolivas

On Saturday 16 November 2002 18:32, Andrea Arcangeli wrote:

Hi Andrea,

> you may want to try with this setting that helps with very slow devices:
> 	echo 2 500 0 0 500 3000 3 1 0 > /proc/sys/vm/bdflush
>
> or also with my current default tuned for high performance:
> 	echo 50 500 0 0 500 3000 60 20 > /proc/sys/vm/bdflush
I've tested both without any changes, "pausings" are still there.

> you may have too many dirty buffers around and you end running at disk
> speed at every memory allocation, the first setting will decrease the
> amount of dirty buffers dramatically, if you still have significant
> slowdown with the first setting above, it's most probably only the usual
> elevator issue.
Seems so.

So I have to use 2.4.18 until there is a real proper fix for that.

ciao, Marc

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait
  2002-11-16 17:23   ` Marc-Christian Petersen
@ 2002-11-16 17:32     ` Andrea Arcangeli
  2002-11-16 17:43       ` Marc-Christian Petersen
  0 siblings, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2002-11-16 17:32 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: linux-kernel, Con Kolivas

On Sat, Nov 16, 2002 at 06:23:26PM +0100, Marc-Christian Petersen wrote:
> On Saturday 16 November 2002 17:59, Andrea Arcangeli wrote:
> 
> Hi Andrea,
> 
> > Your pausing problem have little to do with the pausing fix, the problem
> > for you is the read latency, you're not triggering the race condition
> > fixed by the pausing fix so it can't make differences. One of the
> > foundamental obstacles to the read latency is the size of the I/O queue,
> > factor that is workarounded by the read-latency patch that basically
> > bypasses the size of the queue hiding the problem and in turn can't fix
> > the write latency with O_SYNC and the read latency during true read aio
> > etc...
> ok, after some further tests, I think this is _somewhat_ FS dependent. I tried 
> this with ext2, ext3 (no difference there) and also with ReiserFS and what 
> must I say, those "Pausings" caused be the write latency doing it with 
> ReiserFS are alot less than with ext2|ext3 but are still occuring.
> 
> There must went in something bullshitty into 2.4.19/2.4.20 that causes those 
> ugly things because 2.4.18 does not have that problem. This is still why I 
> don't use any kernels >2.4.18.
> 
> After changing elevator things like this:
> 
> root@codeman:[/] # elvtune /dev/hda
> 
> /dev/hda elevator ID            0
>         read_latency:           2048
>         write_latency:          1024
>         max_bomb_segments:      0
> 
> those "pausings" are less worse than before but are still there.
> NOTE: Write latency is lower than read latency (it's not a typo :)

you may want to try with this setting that helps with very slow devices:

	echo 2 500 0 0 500 3000 3 1 0 > /proc/sys/vm/bdflush

or also with my current default tuned for high performance:

	echo 50 500 0 0 500 3000 60 20 > /proc/sys/vm/bdflush

you may have too many dirty buffers around and you end running at disk
speed at every memory allocation, the first setting will decrease the
amount of dirty buffers dramatically, if you still have significant
slowdown with the first setting above, it's most probably only the usual
elevator issue.

Andrea

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait
  2002-11-16 16:59 ` Andrea Arcangeli
@ 2002-11-16 17:23   ` Marc-Christian Petersen
  2002-11-16 17:32     ` Andrea Arcangeli
  0 siblings, 1 reply; 16+ messages in thread
From: Marc-Christian Petersen @ 2002-11-16 17:23 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, Con Kolivas

On Saturday 16 November 2002 17:59, Andrea Arcangeli wrote:

Hi Andrea,

> Your pausing problem have little to do with the pausing fix, the problem
> for you is the read latency, you're not triggering the race condition
> fixed by the pausing fix so it can't make differences. One of the
> foundamental obstacles to the read latency is the size of the I/O queue,
> factor that is workarounded by the read-latency patch that basically
> bypasses the size of the queue hiding the problem and in turn can't fix
> the write latency with O_SYNC and the read latency during true read aio
> etc...
ok, after some further tests, I think this is _somewhat_ FS dependent. I tried 
this with ext2, ext3 (no difference there) and also with ReiserFS and what 
must I say, those "Pausings" caused be the write latency doing it with 
ReiserFS are alot less than with ext2|ext3 but are still occuring.

There must went in something bullshitty into 2.4.19/2.4.20 that causes those 
ugly things because 2.4.18 does not have that problem. This is still why I 
don't use any kernels >2.4.18.

After changing elevator things like this:

root@codeman:[/] # elvtune /dev/hda

/dev/hda elevator ID            0
        read_latency:           2048
        write_latency:          1024
        max_bomb_segments:      0

those "pausings" are less worse than before but are still there.
NOTE: Write latency is lower than read latency (it's not a typo :)

ciao, Marc



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait
  2002-11-16 16:24 Marc-Christian Petersen
@ 2002-11-16 16:59 ` Andrea Arcangeli
  2002-11-16 17:23   ` Marc-Christian Petersen
  0 siblings, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2002-11-16 16:59 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: linux-kernel, Con Kolivas

On Sat, Nov 16, 2002 at 05:24:17PM +0100, Marc-Christian Petersen wrote:
> Hi Andrea,
> 
> I've applied your patch to 2.4.19 final and 2.4.20 final and those "pausings" 
> still exists when I do a full kernel tree copy with "cp -a linux-2.4.19 
> linux-2.4.20". When I am in a screen session and try to do a Ctrl-A-C to 
> create a new screen session it comes up within 3-5 seconds ... If I want to 
> switch to another session with Ctrl-A-N it sometimes happens fast and 
> sometimes within 3-5 seconds ... I've put above in a "while true; do ...; 
> done" loop. There is nothing else running or eating CPU and also nothing else 
> eating i/o. This does not happen with virgin 2.4.18.

Your pausing problem have little to do with the pausing fix, the problem
for you is the read latency, you're not triggering the race condition
fixed by the pausing fix so it can't make differences. One of the
foundamental obstacles to the read latency is the size of the I/O queue,
factor that is workarounded by the read-latency patch that basically
bypasses the size of the queue hiding the problem and in turn can't fix
the write latency with O_SYNC and the read latency during true read aio etc...

Andrea

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait
@ 2002-11-16 16:24 Marc-Christian Petersen
  2002-11-16 16:59 ` Andrea Arcangeli
  0 siblings, 1 reply; 16+ messages in thread
From: Marc-Christian Petersen @ 2002-11-16 16:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrea Arcangeli, Con Kolivas

Hi Andrea,

I've applied your patch to 2.4.19 final and 2.4.20 final and those "pausings" 
still exists when I do a full kernel tree copy with "cp -a linux-2.4.19 
linux-2.4.20". When I am in a screen session and try to do a Ctrl-A-C to 
create a new screen session it comes up within 3-5 seconds ... If I want to 
switch to another session with Ctrl-A-N it sometimes happens fast and 
sometimes within 3-5 seconds ... I've put above in a "while true; do ...; 
done" loop. There is nothing else running or eating CPU and also nothing else 
eating i/o. This does not happen with virgin 2.4.18.

This is with a Celeron Tualatin 1300MHz, i815 chipset, 512MB RAM, UDMA100 HDD, 
DMA enabled.


/dev/hda:

 multcount    = 16 (on)
 IO_support   =  1 (32-bit)
 unmaskirq    =  0 (off)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 7299/255/63, sectors = 117266688, start = 0

 Model=MAXTOR 6L060J3, FwRev=A93.0500, SerialNo=663219752652
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=32256, SectSize=21298, ECCbytes=4
 BuffType=DualPortCache, BuffSize=1819kB, MaxMultSect=16, MultSect=16
 CurCHS=4047/16/255, CurSects=16511760, LBA=yes, LBAsects=117266688
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 udma6 
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1:  1 2 3 4 5

Does also appear if unmaskirq is set to 1.

ciao, Marc


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2002-11-17 19:13 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-11-12  3:57 2.[45] fixes for design locking bug in wait_on_page/wait_on_buffer/get_request_wait Andrea Arcangeli
2002-11-12  8:50 ` Andrew Morton
2002-11-12 18:15   ` Andrea Arcangeli
2002-11-12 19:46     ` Andrew Morton
2002-11-13  0:33       ` Andrea Arcangeli
2002-11-13  1:01         ` Andrew Morton
2002-11-16 16:24 Marc-Christian Petersen
2002-11-16 16:59 ` Andrea Arcangeli
2002-11-16 17:23   ` Marc-Christian Petersen
2002-11-16 17:32     ` Andrea Arcangeli
2002-11-16 17:43       ` Marc-Christian Petersen
2002-11-16 18:46         ` Andrea Arcangeli
2002-11-16 18:58 Marc-Christian Petersen
2002-11-16 21:55 ` Marc-Christian Petersen
2002-11-17 17:37   ` Marc-Christian Petersen
2002-11-17 19:19     ` Folkert van Heusden

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).