linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* prepare_to_wait/waitqueue_active issues in 2.6
@ 2003-12-14  3:40 Anton Blanchard
  2003-12-14  3:53 ` Anton Blanchard
  0 siblings, 1 reply; 6+ messages in thread
From: Anton Blanchard @ 2003-12-14  3:40 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel


Hah this is the second time in a few weeks that we were hunting a 2.6 bug
right around the same area as you:

> Fix subtle bug in "finish_wait()", which can cause kernel stack
> corruption on SMP because of another CPU still accessing a waitqueue
> even after it was de-allocated.

> Use a careful version of the list emptiness check to make sure we
> don't de-allocate the stack frame before the waitqueue is all done.

We've been seeing unkillable tasks in a few places on ppc64:

[c000000000050378] io_schedule+0x54/0x94
[c00000000007e288] __lock_page+0xf0/0x174
[c00000000007ef64] do_generic_mapping_read+0x468/0x524
[c00000000007f37c] __generic_file_aio_read+0x1d4/0x218
[c00000000007f478] generic_file_read+0x60/0xa8
[c0000000000a7500] vfs_read+0x10c/0x174
[c0000000000a77fc] sys_read+0x50/0xa4

and

[c0000000000485f4] io_schedule+0x54/0x84
[c0000000000a0534] __wait_on_buffer+0x110/0x118
[c000000000105308] do_get_write_access+0xd8/0x83c
[c000000000105ab0] journal_get_write_access+0x44/0x74
[c0000000000f6928] ext3_reserve_inode_write+0xcc/0x130
[c0000000000f69b0] ext3_mark_inode_dirty+0x24/0x64
[c0000000000f6aa0] ext3_dirty_inode+0xb0/0xb8
[c0000000000cd3e0] __mark_inode_dirty+0x198/0x1a0
[c0000000000c4a10] update_atime+0xfc/0x104
[c00000000007263c] do_generic_mapping_read+0x414/0x73c
[c000000000072cdc] __generic_file_aio_read+0x1d4/0x208
[c000000000072d48] generic_file_aio_read+0x38/0x48
[c00000000009e77c] do_sync_read+0x7c/0xc4
[c00000000009e8d0] vfs_read+0x10c/0x164
[c0000000000aec08] kernel_read+0x38/0x5c
[c00000000002099c] do_execve32+0x1fc/0x35c
[c000000000020b6c] sys32_execve+0x70/0xdc

Taking the second example, we sleep in __wait_on_buffer:

void __wait_on_buffer(struct buffer_head * bh)
{
...
        do {
                prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
                if (buffer_locked(bh)) {
                        blk_run_queues();
                        io_schedule();
                }
        } while (buffer_locked(bh));
        finish_wait(wqh, &wait);
}

And

void wake_up_buffer(struct buffer_head *bh)
{
...
        smp_mb();
        if (waitqueue_active(wq))
                wake_up_all(wq);
}

We are doing a lockless check for waitqueue_active. Now look at
__wait_on_buffer, there is nothing to prevent the load in buffer_locked
from ending up inside prepare to write. That means we could do the check
for buffer_locked before we get put on the waitqueue.

End result is waitqueue_active returns 0 and buffer_locked returns 1.
We miss a wakeup. Game over. Ive attached a patch that forces the check
to happen after we are put on the waitqueue. Thanks to Brian Twichell
for the analysis and suggested fix for this.

__lock_page has the same problem, but it got us thinking if the other uses
of waitqueue_active are safe. 

A quick shows up a number of uses in 2.6. Looking at kswapd:

int kswapd(void *p)
{
...
	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
	schedule();
	finish_wait(&pgdat->kswapd_wait, &wait);
}

void wakeup_kswapd(struct zone *zone)
{
...
        if (!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
                return;
        wake_up_interruptible(&zone->zone_pgdat->kswapd_wait);
}

This doesnt look safe either. The first example could be made safe
because we checked after we added ourselves to the waitqueue but before
we went to sleep. Here we go straight to sleep. There is a large window
in which we miss the wakeup because it could take a _long_ time for the
stores in prepare_to_wait to make it to the other cpu.

So at this stage I think the only valid uses of waitqueue active are
those which do one last check before sleeping (and only if the have 
a memory barrier before the check), the others are all buggy.

Thoughts?

Anton

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: prepare_to_wait/waitqueue_active issues in 2.6
  2003-12-14  3:40 prepare_to_wait/waitqueue_active issues in 2.6 Anton Blanchard
@ 2003-12-14  3:53 ` Anton Blanchard
  2003-12-14  5:02   ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: Anton Blanchard @ 2003-12-14  3:53 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel


> End result is waitqueue_active returns 0 and buffer_locked returns 1.
> We miss a wakeup. Game over. Ive attached a patch that forces the check
> to happen after we are put on the waitqueue. Thanks to Brian Twichell
> for the analysis and suggested fix for this.

This time for sure.

Anton

===== mm/filemap.c 1.212 vs edited =====
--- 1.212/mm/filemap.c	Sun Oct 26 16:41:06 2003
+++ edited/mm/filemap.c	Sun Dec 14 14:25:45 2003
@@ -299,6 +299,7 @@
 
 	do {
 		prepare_to_wait(waitqueue, &wait, TASK_UNINTERRUPTIBLE);
+		smp_mb();
 		if (test_bit(bit_nr, &page->flags)) {
 			sync_page(page);
 			io_schedule();
@@ -372,6 +373,7 @@
 
 	while (TestSetPageLocked(page)) {
 		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+		smp_mb();
 		if (PageLocked(page)) {
 			sync_page(page);
 			io_schedule();
===== fs/buffer.c 1.215 vs edited =====
--- 1.215/fs/buffer.c	Tue Sep 30 11:12:02 2003
+++ edited/fs/buffer.c	Sun Dec 14 14:24:26 2003
@@ -131,6 +131,7 @@
 
 	do {
 		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+		smp_mb();
 		if (buffer_locked(bh)) {
 			blk_run_queues();
 			io_schedule();

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: prepare_to_wait/waitqueue_active issues in 2.6
  2003-12-14  3:53 ` Anton Blanchard
@ 2003-12-14  5:02   ` Linus Torvalds
  2003-12-14  5:23     ` Anton Blanchard
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2003-12-14  5:02 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Kernel Mailing List



On Sun, 14 Dec 2003, Anton Blanchard wrote:
>
> > End result is waitqueue_active returns 0 and buffer_locked returns 1.
> > We miss a wakeup. Game over. Ive attached a patch that forces the check
> > to happen after we are put on the waitqueue. Thanks to Brian Twichell
> > for the analysis and suggested fix for this.
>
> This time for sure.

Pardon my French, but this patch sure looks like crap.

Wouldn't it be better to just make sure "prepare_to_wait()" has the proper
barriers in it?  If you have problems with fs/buffer.c, then you should
have problems with the page_unlock() path in mm/filemap.c too, which has
_exactly_ the same logic.

In fact, pretty much any use of "prepare_to_wait()" has the potential of
moving a subsequent test upwards to before the wait-queue addition, so
we'd have this bug for _any_ use of "waitqueue_active()". The spinlocks
protect from criticial data moving outside the critical region, but not
against movement in_to_ the critical region.

In short - as-is, your patch looks to be right, but it only solves a small
part of what appears to be the larger problem.

So my preference would be to add the barrier into prepare_to_wait(), along
with a comment on why it is sometimes needed.  Something like the
appended.. (which just uses "set_current_state()", since that's what it
exists for).

Does this work for you? I'd much prefer to see a fix that fixes _all_ the
cases, and is just two lines (plus the comment, which is much more ;)

		Linus

----
--- 1.147/kernel/fork.c	Fri Dec 12 14:20:03 2003
+++ edited/kernel/fork.c	Sat Dec 13 20:59:29 2003
@@ -125,15 +125,28 @@

 EXPORT_SYMBOL(remove_wait_queue);

+
+/*
+ * Note: we use "set_current_state()" _after_ the wait-queue add,
+ * because we need a memory barrier there on SMP, so that any
+ * wake-function that tests for the wait-queue being active
+ * will be guaranteed to see waitqueue addition _or_ subsequent
+ * tests in this thread will see the wakeup having taken place.
+ *
+ * The spin_unlock() itself is semi-permeable and only protects
+ * one way (it only protects stuff inside the critical region and
+ * stops them from bleeding out - it would still allow subsequent
+ * loads to move into the the critical region).
+ */
 void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
 {
 	unsigned long flags;

-	__set_current_state(state);
 	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
 	if (list_empty(&wait->task_list))
 		__add_wait_queue(q, wait);
+	set_current_state(state);
 	spin_unlock_irqrestore(&q->lock, flags);
 }

@@ -144,11 +157,11 @@
 {
 	unsigned long flags;

-	__set_current_state(state);
 	wait->flags |= WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
 	if (list_empty(&wait->task_list))
 		__add_wait_queue_tail(q, wait);
+	set_current_state(state);
 	spin_unlock_irqrestore(&q->lock, flags);
 }


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: prepare_to_wait/waitqueue_active issues in 2.6
  2003-12-14  5:02   ` Linus Torvalds
@ 2003-12-14  5:23     ` Anton Blanchard
  2003-12-14  5:35       ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: Anton Blanchard @ 2003-12-14  5:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

 
Hi,

> Pardon my French, but this patch sure looks like crap.

I agree. Your patch wins, and that comment would even pass akpm best
practices.

> So my preference would be to add the barrier into prepare_to_wait(), along
> with a comment on why it is sometimes needed.  Something like the
> appended.. (which just uses "set_current_state()", since that's what it
> exists for).

And thats pretty much how 2.4 handled the problem (set_task_state sits
between the waitqueue addition and the test). That still leaves kswapd
with problems, but in low memory conditions we'll be calling it often so
one lost wakeup here and there shouldnt matter.

Anton

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: prepare_to_wait/waitqueue_active issues in 2.6
  2003-12-14  5:23     ` Anton Blanchard
@ 2003-12-14  5:35       ` Linus Torvalds
  2003-12-14  8:58         ` Anton Blanchard
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2003-12-14  5:35 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Kernel Mailing List



On Sun, 14 Dec 2003, Anton Blanchard wrote:
> > Pardon my French, but this patch sure looks like crap.
>
> I agree. Your patch wins, and that comment would even pass akpm best
> practices.

Heh. I may not comment my code much normally, but for the last few weeks
I've been aiming for one-liners, and making sure that I can read every
single patch and say with some comfort that I can pretty much _guarantee_
that the patch is right.

I'll go back to my lazy ways in 2.7.x, I'm sure.

Anyway, even if I think the patch is "obviously correct", can you do me a
favor and test it on the load that you've seen failing? Just to be anal.

		Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: prepare_to_wait/waitqueue_active issues in 2.6
  2003-12-14  5:35       ` Linus Torvalds
@ 2003-12-14  8:58         ` Anton Blanchard
  0 siblings, 0 replies; 6+ messages in thread
From: Anton Blanchard @ 2003-12-14  8:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

 
> I'll go back to my lazy ways in 2.7.x, I'm sure.

Phew, you had me worried for a while :)

> Anyway, even if I think the patch is "obviously correct", can you do me a
> favor and test it on the load that you've seen failing? Just to be anal.

Sure, we'll run with this patch and report back.

Anton

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-12-14  9:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-14  3:40 prepare_to_wait/waitqueue_active issues in 2.6 Anton Blanchard
2003-12-14  3:53 ` Anton Blanchard
2003-12-14  5:02   ` Linus Torvalds
2003-12-14  5:23     ` Anton Blanchard
2003-12-14  5:35       ` Linus Torvalds
2003-12-14  8:58         ` Anton Blanchard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).