2.6.24 regression: pan hanging unkilleable and un-straceable

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 2.6.24 regression: pan hanging unkilleable and un-straceable
@ 2008-01-21 20:58 Frederik Himpe
  2008-01-22  0:05 ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Frederik Himpe @ 2008-01-21 20:58 UTC (permalink / raw)
  To: linux-kernel

With Linux 2.6.24-rc8 I often have the problem that the pan usenet
reader starts using 100% of CPU time after some time. When this happens,
kill -9 does not work, and strace just hangs when trying to attach to
the process. The same with gdb. ps shows the process as being in the R
state.

I pressed Ctrl-Alt-SysRq-T, and this was shown for pan:
Jan 21 21:45:01 Anastacia kernel: pan           R  running task        0  8063      1
Jan 21 21:45:01 Anastacia kernel: ssh           S 0000000000000000     0  8323   6809
Jan 21 21:45:01 Anastacia kernel:  ffff81000a51f9c8 0000000000000082 ffff81000ed6dc00 ffffffff8045ad6f
Jan 21 21:45:01 Anastacia kernel:  ffffffff805875b8 ffffffff80623980 ffffffff80623980 ffffffff80623980
Jan 21 21:45:01 Anastacia kernel:  ffffffff8061fe80 ffffffff80623980 ffff81003941b8a8 ffffffff8043142b
Jan 21 21:45:01 Anastacia kernel: Call Trace:
Jan 21 21:45:01 Anastacia kernel:  [arp_bind_neighbour+143/208] arp_bind_neighbour+0x8f/0xd0
Jan 21 21:45:01 Anastacia kernel:  [rt_intern_hash+955/1056] rt_intern_hash+0x3bb/0x420
Jan 21 21:45:01 Anastacia kernel:  [nommu_map_single+56/96] nommu_map_single+0x38/0x60
Jan 21 21:45:01 Anastacia kernel:  [schedule_timeout+149/208] schedule_timeout+0x95/0xd0
Jan 21 21:45:01 Anastacia kernel:  [tty_ldisc_deref+82/128] tty_ldisc_deref+0x52/0x80
Jan 21 21:45:01 Anastacia kernel:  [tty_poll+145/160] tty_poll+0x91/0xa0
Jan 21 21:45:01 Anastacia kernel:  [do_select+1128/1376] do_select+0x468/0x560
Jan 21 21:45:01 Anastacia kernel:  [__pollwait+0/304] __pollwait+0x0/0x130
Jan 21 21:45:01 Anastacia kernel:  [default_wake_function+0/16] default_wake_function+0x0/0x10
Jan 21 21:45:01 Anastacia kernel:last message repeated 2 times
Jan 21 21:45:01 Anastacia kernel:  [enqueue_task+19/48] enqueue_task+0x13/0x30
Jan 21 21:45:01 Anastacia kernel:  [try_to_wake_up+98/720] try_to_wake_up+0x62/0x2d0
Jan 21 21:45:01 Anastacia kernel:  [default_wake_function+0/16] default_wake_function+0x0/0x10
Jan 21 21:45:01 Anastacia kernel:  [tcp_recvmsg+1463/3360] tcp_recvmsg+0x5b7/0xd20
Jan 21 21:45:01 Anastacia kernel:  [__wake_up_common+90/144] __wake_up_common+0x5a/0x90
Jan 21 21:45:01 Anastacia kernel:  [__wake_up+67/112] __wake_up+0x43/0x70
Jan 21 21:45:01 Anastacia kernel:  [n_tty_receive_buf+821/3888] n_tty_receive_buf+0x335/0xf30
Jan 21 21:45:01 Anastacia kernel:  [sock_aio_read+349/368] sock_aio_read+0x15d/0x170
Jan 21 21:45:01 Anastacia kernel:  [core_sys_select+521/768] core_sys_select+0x209/0x300
Jan 21 21:45:01 Anastacia kernel:  [remove_wait_queue+25/96] remove_wait_queue+0x19/0x60
Jan 21 21:45:01 Anastacia kernel:  [__wake_up+67/112] __wake_up+0x43/0x70
Jan 21 21:45:01 Anastacia kernel:  [tty_ldisc_deref+82/128] tty_ldisc_deref+0x52/0x80
Jan 21 21:45:01 Anastacia kernel:  [tty_write+569/592] tty_write+0x239/0x250
Jan 21 21:45:01 Anastacia kernel:  [sys_select+68/448] sys_select+0x44/0x1c0
Jan 21 21:45:01 Anastacia kernel:  [sys_write+83/144] sys_write+0x53/0x90
Jan 21 21:45:01 Anastacia kernel:  [system_call+126/131] system_call+0x7e/0x83

What could be causing this problem?

-- 
Frederik Himpe <fhimpe@telenet.be>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-21 20:58 2.6.24 regression: pan hanging unkilleable and un-straceable Frederik Himpe
@ 2008-01-22  0:05 ` Nick Piggin
  2008-01-22  5:03   ` Mike Galbraith
  2008-01-22 10:37   ` Ingo Molnar
  0 siblings, 2 replies; 19+ messages in thread
From: Nick Piggin @ 2008-01-22  0:05 UTC (permalink / raw)
  To: Frederik Himpe; +Cc: linux-kernel

On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> With Linux 2.6.24-rc8 I often have the problem that the pan usenet
> reader starts using 100% of CPU time after some time. When this happens,
> kill -9 does not work, and strace just hangs when trying to attach to
> the process. The same with gdb. ps shows the process as being in the R
> state.
>
> I pressed Ctrl-Alt-SysRq-T, and this was shown for pan:
> Jan 21 21:45:01 Anastacia kernel: pan           R  running task        0 

Well I've twice tried to submit a patch to print stacks for running
tasks as well, but nobody seems interested. It would at least give a
chance to see something.

Can you post a few Sysrq+P traces?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-22  0:05 ` Nick Piggin
@ 2008-01-22  5:03   ` Mike Galbraith
  2008-01-22  5:25     ` Nick Piggin
  2008-01-22 10:37   ` Ingo Molnar
  1 sibling, 1 reply; 19+ messages in thread
From: Mike Galbraith @ 2008-01-22  5:03 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Frederik Himpe, linux-kernel


On Tue, 2008-01-22 at 11:05 +1100, Nick Piggin wrote:
> On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > With Linux 2.6.24-rc8 I often have the problem that the pan usenet
> > reader starts using 100% of CPU time after some time. When this happens,
> > kill -9 does not work, and strace just hangs when trying to attach to
> > the process. The same with gdb. ps shows the process as being in the R
> > state.
> >
> > I pressed Ctrl-Alt-SysRq-T, and this was shown for pan:
> > Jan 21 21:45:01 Anastacia kernel: pan           R  running task        0 
> 
> Well I've twice tried to submit a patch to print stacks for running
> tasks as well, but nobody seems interested. It would at least give a
> chance to see something.

I've hit same twice recently (not pan, and not repeatable).


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-22  5:03   ` Mike Galbraith
@ 2008-01-22  5:25     ` Nick Piggin
  2008-01-22  5:47       ` Mike Galbraith
                         ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Nick Piggin @ 2008-01-22  5:25 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Frederik Himpe, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1071 bytes --]

On Tuesday 22 January 2008 16:03, Mike Galbraith wrote:
> On Tue, 2008-01-22 at 11:05 +1100, Nick Piggin wrote:
> > On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > > With Linux 2.6.24-rc8 I often have the problem that the pan usenet
> > > reader starts using 100% of CPU time after some time. When this
> > > happens, kill -9 does not work, and strace just hangs when trying to
> > > attach to the process. The same with gdb. ps shows the process as
> > > being in the R state.
> > >
> > > I pressed Ctrl-Alt-SysRq-T, and this was shown for pan:
> > > Jan 21 21:45:01 Anastacia kernel: pan           R  running task       
> > > 0
> >
> > Well I've twice tried to submit a patch to print stacks for running
> > tasks as well, but nobody seems interested. It would at least give a
> > chance to see something.
>
> I've hit same twice recently (not pan, and not repeatable).

Nasty. The attached patch is something really simple that can sometimes help.
sysrq+p is also an option, if you're on a UP system.

Any luck getting traces?


[-- Attachment #2: show-task-running-stack.patch --]
[-- Type: text/x-diff, Size: 453 bytes --]

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4920,8 +4920,7 @@ static void show_task(struct task_struct
 	printk(KERN_CONT "%5lu %5d %6d\n", free,
 		task_pid_nr(p), task_pid_nr(p->real_parent));
 
-	if (state != TASK_RUNNING)
-		show_stack(p, NULL);
+	show_stack(p, NULL);
 }
 
 void show_state_filter(unsigned long state_filter)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-22  5:25     ` Nick Piggin
@ 2008-01-22  5:47       ` Mike Galbraith
  2008-02-04 14:49         ` Mike Galbraith
  2008-01-22 10:38       ` Ingo Molnar
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Mike Galbraith @ 2008-01-22  5:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Frederik Himpe, linux-kernel


On Tue, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> On Tuesday 22 January 2008 16:03, Mike Galbraith wrote:

> > I've hit same twice recently (not pan, and not repeatable).
> 
> Nasty. The attached patch is something really simple that can sometimes help.
> sysrq+p is also an option, if you're on a UP system.

SMP (P4/HT imitating real cores)

> Any luck getting traces?

We'll see.  Armed.

	-Mike


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-22  0:05 ` Nick Piggin
  2008-01-22  5:03   ` Mike Galbraith
@ 2008-01-22 10:37   ` Ingo Molnar
  2008-01-22 23:00     ` Nick Piggin
  1 sibling, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2008-01-22 10:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Frederik Himpe, linux-kernel


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > With Linux 2.6.24-rc8 I often have the problem that the pan usenet
> > reader starts using 100% of CPU time after some time. When this happens,
> > kill -9 does not work, and strace just hangs when trying to attach to
> > the process. The same with gdb. ps shows the process as being in the R
> > state.
> >
> > I pressed Ctrl-Alt-SysRq-T, and this was shown for pan:
> > Jan 21 21:45:01 Anastacia kernel: pan           R  running task        0 
> 
> Well I've twice tried to submit a patch to print stacks for running 
> tasks as well, but nobody seems interested. It would at least give a 
> chance to see something.

i definitely remembering having done this myself a couple of times (it 
makes tons of sense to get _some_ info out of the system) but some 
problem in -mm kept reverting it. I dont remember the specifics ... it 
was some race.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-22  5:25     ` Nick Piggin
  2008-01-22  5:47       ` Mike Galbraith
@ 2008-01-22 10:38       ` Ingo Molnar
  2008-01-24  5:30       ` Valdis.Kletnieks
  2008-01-26 13:29       ` Frederik Himpe
  3 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2008-01-22 10:38 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Mike Galbraith, Frederik Himpe, linux-kernel


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -4920,8 +4920,7 @@ static void show_task(struct task_struct
>  	printk(KERN_CONT "%5lu %5d %6d\n", free,
>  		task_pid_nr(p), task_pid_nr(p->real_parent));
>  
> -	if (state != TASK_RUNNING)
> -		show_stack(p, NULL);
> +	show_stack(p, NULL);

thanks - applied to sched-devel.git. We'll see whether it causes any 
problems.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-22 10:37   ` Ingo Molnar
@ 2008-01-22 23:00     ` Nick Piggin
  0 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2008-01-22 23:00 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Frederik Himpe, linux-kernel

On Tuesday 22 January 2008 21:37, Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > Well I've twice tried to submit a patch to print stacks for running
> > tasks as well, but nobody seems interested. It would at least give a
> > chance to see something.
>
> i definitely remembering having done this myself a couple of times (it
> makes tons of sense to get _some_ info out of the system) but some
> problem in -mm kept reverting it. I dont remember the specifics ... it
> was some race.

Hmm, that's not unlikely. But there is nothing in the backtrace code
which prevents a task from being woken up anyway, is there? I guess
it will be more common now, but if we find a race we can try to fix
the root cause.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-22  5:25     ` Nick Piggin
  2008-01-22  5:47       ` Mike Galbraith
  2008-01-22 10:38       ` Ingo Molnar
@ 2008-01-24  5:30       ` Valdis.Kletnieks
  2008-01-26 13:29       ` Frederik Himpe
  3 siblings, 0 replies; 19+ messages in thread
From: Valdis.Kletnieks @ 2008-01-24  5:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Mike Galbraith, Frederik Himpe, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 653 bytes --]

On Tue, 22 Jan 2008 16:25:58 +1100, Nick Piggin said:
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -4920,8 +4920,7 @@ static void show_task(struct task_struct
>  	printk(KERN_CONT "%5lu %5d %6d\n", free,
>  		task_pid_nr(p), task_pid_nr(p->real_parent));
>  
> -	if (state != TASK_RUNNING)
> -		show_stack(p, NULL);
> +	show_stack(p, NULL);
>  }

Maybe something like this would be better?

	if (state == TASK_RUNNING)
		printk("running task, stack trace may be inaccurate\n");
	show_stack(p, NULL);

Just a thought....

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-22  5:25     ` Nick Piggin
                         ` (2 preceding siblings ...)
  2008-01-24  5:30       ` Valdis.Kletnieks
@ 2008-01-26 13:29       ` Frederik Himpe
  2008-01-26 13:46         ` Nick Piggin
  2008-01-28  1:46         ` Nick Piggin
  3 siblings, 2 replies; 19+ messages in thread
From: Frederik Himpe @ 2008-01-26 13:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Mike Galbraith, linux-kernel


On di, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:

> > > On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > > > With Linux 2.6.24-rc8 I often have the problem that the pan usenet
> > > > reader starts using 100% of CPU time after some time. When this
> > > > happens, kill -9 does not work, and strace just hangs when trying to
> > > > attach to the process. The same with gdb. ps shows the process as
> > > > being in the R state.
> > > >
> > > > I pressed Ctrl-Alt-SysRq-T, and this was shown for pan:
> > > > Jan 21 21:45:01 Anastacia kernel: pan           R  running task       
> > > > 0

> Nasty. The attached patch is something really simple that can sometimes help.
> sysrq+p is also an option, if you're on a UP system.
> 
> Any luck getting traces?

I just succeeded to reproduce the problem with this patch. Does this
smell like an XFS problem?

Jan 26 14:17:43 Anastacia kernel: pan           R  running task        0  7564      1
Jan 26 14:17:43 Anastacia kernel:  000000003f5b3248 0000000000001000 ffffffff880c28b0 0000000000000000
Jan 26 14:17:43 Anastacia kernel:  ffff81003f5b3248 ffff81002d1ed900 000000002d1ed900 0000000000000000
Jan 26 14:17:43 Anastacia kernel:  ffff810016050dd0 fffff000fffff000 0000000000000000 ffff81002d1eda10
Jan 26 14:17:43 Anastacia kernel: Call Trace:
Jan 26 14:17:43 Anastacia kernel:  [_end+127964408/2129947720] :xfs:xfs_get_blocks+0x0/0x10
Jan 26 14:17:43 Anastacia kernel:  [unix_poll+0/176] unix_poll+0x0/0xb0
Jan 26 14:17:43 Anastacia kernel:  [_end+127964408/2129947720] :xfs:xfs_get_blocks+0x0/0x10
Jan 26 14:17:43 Anastacia kernel:  [iov_iter_copy_from_user_atomic+65/160] iov_iter_copy_from_user_atomic+0x41/0xa0
Jan 26 14:17:43 Anastacia kernel:  [iov_iter_copy_from_user_atomic+46/160] iov_iter_copy_from_user_atomic+0x2e/0xa0
Jan 26 14:17:43 Anastacia kernel:  [generic_file_buffered_write+383/1728] generic_file_buffered_write+0x17f/0x6c0
Jan 26 14:17:43 Anastacia kernel:  [current_fs_time+30/48] current_fs_time+0x1e/0x30
Jan 26 14:17:43 Anastacia kernel:  [_end+127997742/2129947720] :xfs:xfs_write+0x676/0x910
Jan 26 14:17:43 Anastacia kernel:  [find_lock_page+61/192] find_lock_page+0x3d/0xc0
Jan 26 14:17:43 Anastacia kernel:  [_end+127981080/2129947720] :xfs:xfs_file_aio_write+0x0/0x50
Jan 26 14:17:43 Anastacia kernel:  [do_sync_readv_writev+203/272] do_sync_readv_writev+0xcb/0x110
Jan 26 14:17:43 Anastacia kernel:  [__do_fault+501/1056] __do_fault+0x1f5/0x420
Jan 26 14:17:43 Anastacia kernel:  [autoremove_wake_function+0/48] autoremove_wake_function+0x0/0x30
Jan 26 14:17:43 Anastacia kernel:  [handle_mm_fault+1344/2048] handle_mm_fault+0x540/0x800
Jan 26 14:17:43 Anastacia kernel:  [rw_copy_check_uvector+157/336] rw_copy_check_uvector+0x9d/0x150
Jan 26 14:17:43 Anastacia kernel:  [do_readv_writev+253/560] do_readv_writev+0xfd/0x230
Jan 26 14:17:43 Anastacia kernel:  [sys_writev+83/144] sys_writev+0x53/0x90
Jan 26 14:17:43 Anastacia kernel:  [system_call+126/131] system_call+0x7e/0x83
Jan 26 14:17:43 Anastacia kernel: 
Jan 26 14:17:43 Anastacia kernel: pan           S 0000000000000000     0  7565      1
Jan 26 14:17:43 Anastacia kernel:  ffff810001401c58 0000000000000086 ffff810001401bb8 ffff81003cd3a280
Jan 26 14:17:43 Anastacia kernel:  ffff81003cd3a300 ffffffff80623980 ffffffff80623980 ffffffff80623980
Jan 26 14:17:43 Anastacia kernel:  ffffffff8061fe80 ffffffff80623980 ffff810001bcc9a8 ffff8100299b34e8
Jan 26 14:17:43 Anastacia kernel: Call Trace:
Jan 26 14:17:43 Anastacia kernel:  [__qdisc_run+173/528] __qdisc_run+0xad/0x210
Jan 26 14:17:43 Anastacia kernel:  [dev_queue_xmit+216/768] dev_queue_xmit+0xd8/0x300
Jan 26 14:17:43 Anastacia kernel:  [futex_wait+838/912] futex_wait+0x346/0x390
Jan 26 14:17:43 Anastacia kernel:  [tcp_connect+851/896] tcp_connect+0x353/0x380
Jan 26 14:17:43 Anastacia kernel:  [tcp_v4_connect+914/1696] tcp_v4_connect+0x392/0x6a0
Jan 26 14:17:43 Anastacia kernel:  [default_wake_function+0/16] default_wake_function+0x0/0x10
Jan 26 14:17:43 Anastacia kernel:  [do_futex+287/3008] do_futex+0x11f/0xbc0
Jan 26 14:17:43 Anastacia kernel:  [_spin_lock_bh+9/32] _spin_lock_bh+0x9/0x20
Jan 26 14:17:43 Anastacia kernel:  [cp_new_stat+229/256] cp_new_stat+0xe5/0x100
Jan 26 14:17:43 Anastacia kernel:  [sys_futex+171/304] sys_futex+0xab/0x130
Jan 26 14:17:43 Anastacia kernel:  [system_call+126/131] system_call+0x7e/0x83
Jan 26 14:17:43 Anastacia kernel: 
Jan 26 14:17:43 Anastacia kernel: pan           S 0000000000000000     0  7566      1
Jan 26 14:17:43 Anastacia kernel:  ffff8100013fdc58 0000000000000086 0000000000a492c0 0000000000a493c8
Jan 26 14:17:43 Anastacia kernel:  0000000000a494d0 ffffffff80623980 ffffffff80623980 ffffffff80623980
Jan 26 14:17:43 Anastacia kernel:  ffffffff8061fe80 ffffffff80623980 ffff810001bcd8a8 ffff8100339756c8
Jan 26 14:17:43 Anastacia kernel: Call Trace:
Jan 26 14:17:43 Anastacia kernel:  [enqueue_entity+55/112] enqueue_entity+0x37/0x70
Jan 26 14:17:43 Anastacia kernel:  [enqueue_task_fair+56/80] enqueue_task_fair+0x38/0x50
Jan 26 14:17:43 Anastacia kernel:  [futex_wait+838/912] futex_wait+0x346/0x390
Jan 26 14:17:43 Anastacia kernel:  [__wake_up+67/112] __wake_up+0x43/0x70
Jan 26 14:17:43 Anastacia kernel:  [wake_futex+57/80] wake_futex+0x39/0x50
Jan 26 14:17:43 Anastacia kernel:  [default_wake_function+0/16] default_wake_function+0x0/0x10
Jan 26 14:17:43 Anastacia kernel:  [do_futex+287/3008] do_futex+0x11f/0xbc0
Jan 26 14:17:43 Anastacia kernel:  [__up_read+33/176] __up_read+0x21/0xb0
Jan 26 14:17:43 Anastacia kernel:  [do_page_fault+411/2000] do_page_fault+0x19b/0x7d0
Jan 26 14:17:43 Anastacia kernel:  [sys_futex+171/304] sys_futex+0xab/0x130
Jan 26 14:17:43 Anastacia kernel:  [system_call+126/131] system_call+0x7e/0x83
Jan 26 14:17:43 Anastacia kernel: 
Jan 26 14:17:43 Anastacia kernel: pan           S 0000000000000000     0  7567      1
Jan 26 14:17:43 Anastacia kernel:  ffff810001409c58 0000000000000086 0000000000a51618 0000000000a51720
Jan 26 14:17:43 Anastacia kernel:  0000000000a51828 ffffffff80623980 ffffffff80623980 ffffffff80623980
Jan 26 14:17:43 Anastacia kernel:  ffffffff8061fe80 ffffffff80623980 ffff8100339758a8 ffff810033974f48
Jan 26 14:17:43 Anastacia kernel: Call Trace:
Jan 26 14:17:43 Anastacia kernel:  [enqueue_entity+55/112] enqueue_entity+0x37/0x70
Jan 26 14:17:43 Anastacia kernel:  [enqueue_task_fair+56/80] enqueue_task_fair+0x38/0x50
Jan 26 14:17:43 Anastacia kernel:  [futex_wait+838/912] futex_wait+0x346/0x390
Jan 26 14:17:43 Anastacia kernel:  [__wake_up+67/112] __wake_up+0x43/0x70
Jan 26 14:17:43 Anastacia kernel:  [wake_futex+57/80] wake_futex+0x39/0x50
Jan 26 14:17:43 Anastacia kernel:  [default_wake_function+0/16] default_wake_function+0x0/0x10
Jan 26 14:17:43 Anastacia kernel:  [do_futex+287/3008] do_futex+0x11f/0xbc0
Jan 26 14:17:43 Anastacia kernel:  [__up_read+33/176] __up_read+0x21/0xb0
Jan 26 14:17:43 Anastacia kernel:  [do_page_fault+411/2000] do_page_fault+0x19b/0x7d0
Jan 26 14:17:43 Anastacia kernel:  [sys_futex+171/304] sys_futex+0xab/0x130
Jan 26 14:17:43 Anastacia kernel:  [system_call+126/131] system_call+0x7e/0x83
Jan 26 14:17:43 Anastacia kernel: 
Jan 26 14:17:43 Anastacia kernel: pan           S 0000000000000000     0  7568      1
Jan 26 14:17:43 Anastacia kernel:  ffff8100013fbc58 0000000000000086 00002aaaabbe4320 ffffffffffffadbc
Jan 26 14:17:43 Anastacia kernel:  0000000000000000 ffffffff80623980 ffffffff80623980 ffffffff80623980
Jan 26 14:17:43 Anastacia kernel:  ffffffff8061fe80 ffffffff80623980 ffff810033975128 0000000001137ad0
Jan 26 14:17:43 Anastacia kernel: Call Trace:
Jan 26 14:17:43 Anastacia kernel:  [futex_wait+838/912] futex_wait+0x346/0x390
Jan 26 14:17:43 Anastacia kernel:  [zone_statistics+125/128] zone_statistics+0x7d/0x80
Jan 26 14:17:43 Anastacia kernel:  [__alloc_pages+170/976] __alloc_pages+0xaa/0x3d0
Jan 26 14:17:43 Anastacia kernel:  [default_wake_function+0/16] default_wake_function+0x0/0x10
Jan 26 14:17:43 Anastacia kernel:  [do_futex+287/3008] do_futex+0x11f/0xbc0
Jan 26 14:17:43 Anastacia kernel:  [__up_read+33/176] __up_read+0x21/0xb0
Jan 26 14:17:43 Anastacia kernel:  [do_page_fault+411/2000] do_page_fault+0x19b/0x7d0
Jan 26 14:17:43 Anastacia kernel:  [sys_futex+171/304] sys_futex+0xab/0x130


SysRq : Show Regs
CPU 0:
Modules linked in: usb_storage af_packet nvidia(P) vboxdrv ipv6 fuse
snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq cpufreq_ondemand
video output tc1100_wmi sbs sbshc container dock battery ac binfmt_misc
loop ext3 jbd dm_mirror sr_mod dm_mod pata_amd ata_generic sata_sil
usbmouse usbhid ff_memless floppy usblp powernow_k8 freq_table
snd_pcm_oss snd_mixer_oss snd_mpu401 snd_mpu401_uart snd_rawmidi ns558
gameport parport_pc snd_seq_device parport rtc_cmos pcspkr snd_intel8x0
k8temp snd_ac97_codec ohci1394 ac97_bus ieee1394 snd_pcm snd_timer skge
ohci_hcd ehci_hcd snd soundcore usbcore forcedeth snd_page_alloc ssb fan
pcmcia pcmcia_core i2c_nforce2 i2c_core button thermal processor sg
evdev genrtc xfs scsi_wait_scan sd_mod sata_nv libata scsi_mod
Pid: 7564, comm: pan Tainted: P        2.6.24-desktop-0.rc8.2.1mdv #1
RIP: 0010:[<ffffffff802d5a57>]  [<ffffffff802d5a57>] block_write_begin
+0x87/0xe0
RSP: 0018:ffff81002e9b5ac8  EFLAGS: 00000286
RAX: ffff81003f5b3248 RBX: 00000000fffffff4 RCX: 0000000000000000
RDX: ffff81003f5b3248 RSI: 0000000000000000 RDI: ffff81002d1eda18
RBP: ffff81003f5b3248 R08: 0000000000000000 R09: ffff81002e9b5be0
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
R13: ffffffff880c28b0 R14: 0000000000001000 R15: 000000003f5b3248
FS:  00002b6bb3bf7960(0000) GS:ffffffff80589000(0000)
knlGS:00000000f78568d0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b0537cf6000 CR3: 00000000391d2000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
 [<ffffffff802d5a4f>] block_write_begin+0x7f/0xe0
 [<ffffffff880c2102>] :xfs:xfs_vm_write_begin+0x22/0x30
 [<ffffffff880c28b0>] :xfs:xfs_get_blocks+0x0/0x10
 [<ffffffff80280ba9>] generic_file_buffered_write+0x149/0x6c0
 [<ffffffff80240a2e>] current_fs_time+0x1e/0x30
 [<ffffffff880caae6>] :xfs:xfs_write+0x676/0x910
 [<ffffffff8027f98d>] find_lock_page+0x3d/0xc0
 [<ffffffff880c69d0>] :xfs:xfs_file_aio_write+0x0/0x50
 [<ffffffff802aee9b>] do_sync_readv_writev+0xcb/0x110
 [<ffffffff8028eb95>] __do_fault+0x1f5/0x420
 [<ffffffff802522b0>] autoremove_wake_function+0x0/0x30
 [<ffffffff80290b90>] handle_mm_fault+0x540/0x800
 [<ffffffff802aecdd>] rw_copy_check_uvector+0x9d/0x150
 [<ffffffff802af5fd>] do_readv_writev+0xfd/0x230
 [<ffffffff802afc33>] sys_writev+0x53/0x90
 [<ffffffff8020c36e>] system_call+0x7e/0x83


-- 
Frederik Himpe <fhimpe@telenet.be>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-26 13:29       ` Frederik Himpe
@ 2008-01-26 13:46         ` Nick Piggin
  2008-01-26 14:27           ` Pascal Terjan
  2008-01-28  1:46         ` Nick Piggin
  1 sibling, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2008-01-26 13:46 UTC (permalink / raw)
  To: Frederik Himpe; +Cc: Mike Galbraith, linux-kernel

On Sunday 27 January 2008 00:29, Frederik Himpe wrote:
> On di, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> > > > On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > > > > With Linux 2.6.24-rc8 I often have the problem that the pan usenet
> > > > > reader starts using 100% of CPU time after some time. When this
> > > > > happens, kill -9 does not work, and strace just hangs when trying
> > > > > to attach to the process. The same with gdb. ps shows the process
> > > > > as being in the R state.
> > > > >
> > > > > I pressed Ctrl-Alt-SysRq-T, and this was shown for pan:
> > > > > Jan 21 21:45:01 Anastacia kernel: pan           R  running task
> > > > > 0
> >
> > Nasty. The attached patch is something really simple that can sometimes
> > help. sysrq+p is also an option, if you're on a UP system.
> >
> > Any luck getting traces?
>
> I just succeeded to reproduce the problem with this patch. Does this
> smell like an XFS problem?

Possible. Though I think it is more likely to be a bug in the
new deadlock avoidance code in the generic buffered write path.
Dang... I wonder why this hasn't come up earlier. It looks like
pan's use of writev might be tickling it.

How quickly can you reproduce this?

Can you use strace to see what the hanging syscall looks like?

Thanks,
Nick


> Jan 26 14:17:43 Anastacia kernel: pan           R  running task        0 
> 7564      1 Jan 26 14:17:43 Anastacia kernel:  000000003f5b3248
> 0000000000001000 ffffffff880c28b0 0000000000000000 Jan 26 14:17:43
> Anastacia kernel:  ffff81003f5b3248 ffff81002d1ed900 000000002d1ed900
> 0000000000000000 Jan 26 14:17:43 Anastacia kernel:  ffff810016050dd0
> fffff000fffff000 0000000000000000 ffff81002d1eda10 Jan 26 14:17:43
> Anastacia kernel: Call Trace:
> Jan 26 14:17:43 Anastacia kernel:  [_end+127964408/2129947720]
> :xfs:xfs_get_blocks+0x0/0x10 Jan 26 14:17:43 Anastacia kernel: 
> [unix_poll+0/176] unix_poll+0x0/0xb0 Jan 26 14:17:43 Anastacia kernel: 
> [_end+127964408/2129947720] :xfs:xfs_get_blocks+0x0/0x10 Jan 26 14:17:43
> Anastacia kernel:  [iov_iter_copy_from_user_atomic+65/160]
> iov_iter_copy_from_user_atomic+0x41/0xa0 Jan 26 14:17:43 Anastacia kernel: 
> [iov_iter_copy_from_user_atomic+46/160]
> iov_iter_copy_from_user_atomic+0x2e/0xa0 Jan 26 14:17:43 Anastacia kernel: 
> [generic_file_buffered_write+383/1728]
> generic_file_buffered_write+0x17f/0x6c0 Jan 26 14:17:43 Anastacia kernel: 
> [current_fs_time+30/48] current_fs_time+0x1e/0x30 Jan 26 14:17:43 Anastacia
> kernel:  [_end+127997742/2129947720] :xfs:xfs_write+0x676/0x910 Jan 26
> 14:17:43 Anastacia kernel:  [find_lock_page+61/192]
> find_lock_page+0x3d/0xc0 Jan 26 14:17:43 Anastacia kernel: 
> [_end+127981080/2129947720] :xfs:xfs_file_aio_write+0x0/0x50 Jan 26
> 14:17:43 Anastacia kernel:  [do_sync_readv_writev+203/272]
> do_sync_readv_writev+0xcb/0x110 Jan 26 14:17:43 Anastacia kernel: 
> [__do_fault+501/1056] __do_fault+0x1f5/0x420 Jan 26 14:17:43 Anastacia
> kernel:  [autoremove_wake_function+0/48] autoremove_wake_function+0x0/0x30
> Jan 26 14:17:43 Anastacia kernel:  [handle_mm_fault+1344/2048]
> handle_mm_fault+0x540/0x800 Jan 26 14:17:43 Anastacia kernel: 
> [rw_copy_check_uvector+157/336] rw_copy_check_uvector+0x9d/0x150 Jan 26
> 14:17:43 Anastacia kernel:  [do_readv_writev+253/560]
> do_readv_writev+0xfd/0x230 Jan 26 14:17:43 Anastacia kernel: 
> [sys_writev+83/144] sys_writev+0x53/0x90 Jan 26 14:17:43 Anastacia kernel: 
> [system_call+126/131] system_call+0x7e/0x83 Jan 26 14:17:43 Anastacia
> kernel:


> SysRq : Show Regs
> CPU 0:
> Modules linked in: usb_storage af_packet nvidia(P) vboxdrv ipv6 fuse
> snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq cpufreq_ondemand
> video output tc1100_wmi sbs sbshc container dock battery ac binfmt_misc
> loop ext3 jbd dm_mirror sr_mod dm_mod pata_amd ata_generic sata_sil
> usbmouse usbhid ff_memless floppy usblp powernow_k8 freq_table
> snd_pcm_oss snd_mixer_oss snd_mpu401 snd_mpu401_uart snd_rawmidi ns558
> gameport parport_pc snd_seq_device parport rtc_cmos pcspkr snd_intel8x0
> k8temp snd_ac97_codec ohci1394 ac97_bus ieee1394 snd_pcm snd_timer skge
> ohci_hcd ehci_hcd snd soundcore usbcore forcedeth snd_page_alloc ssb fan
> pcmcia pcmcia_core i2c_nforce2 i2c_core button thermal processor sg
> evdev genrtc xfs scsi_wait_scan sd_mod sata_nv libata scsi_mod
> Pid: 7564, comm: pan Tainted: P        2.6.24-desktop-0.rc8.2.1mdv #1
> RIP: 0010:[<ffffffff802d5a57>]  [<ffffffff802d5a57>] block_write_begin
> +0x87/0xe0
> RSP: 0018:ffff81002e9b5ac8  EFLAGS: 00000286
> RAX: ffff81003f5b3248 RBX: 00000000fffffff4 RCX: 0000000000000000
> RDX: ffff81003f5b3248 RSI: 0000000000000000 RDI: ffff81002d1eda18
> RBP: ffff81003f5b3248 R08: 0000000000000000 R09: ffff81002e9b5be0
> R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
> R13: ffffffff880c28b0 R14: 0000000000001000 R15: 000000003f5b3248
> FS:  00002b6bb3bf7960(0000) GS:ffffffff80589000(0000)
> knlGS:00000000f78568d0
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00002b0537cf6000 CR3: 00000000391d2000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>
> Call Trace:
>  [<ffffffff802d5a4f>] block_write_begin+0x7f/0xe0
>  [<ffffffff880c2102>] :xfs:xfs_vm_write_begin+0x22/0x30
>  [<ffffffff880c28b0>] :xfs:xfs_get_blocks+0x0/0x10
>  [<ffffffff80280ba9>] generic_file_buffered_write+0x149/0x6c0
>  [<ffffffff80240a2e>] current_fs_time+0x1e/0x30
>  [<ffffffff880caae6>] :xfs:xfs_write+0x676/0x910
>  [<ffffffff8027f98d>] find_lock_page+0x3d/0xc0
>  [<ffffffff880c69d0>] :xfs:xfs_file_aio_write+0x0/0x50
>  [<ffffffff802aee9b>] do_sync_readv_writev+0xcb/0x110
>  [<ffffffff8028eb95>] __do_fault+0x1f5/0x420
>  [<ffffffff802522b0>] autoremove_wake_function+0x0/0x30
>  [<ffffffff80290b90>] handle_mm_fault+0x540/0x800
>  [<ffffffff802aecdd>] rw_copy_check_uvector+0x9d/0x150
>  [<ffffffff802af5fd>] do_readv_writev+0xfd/0x230
>  [<ffffffff802afc33>] sys_writev+0x53/0x90
>  [<ffffffff8020c36e>] system_call+0x7e/0x83

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-26 13:46         ` Nick Piggin
@ 2008-01-26 14:27           ` Pascal Terjan
  2008-01-28  1:49             ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Pascal Terjan @ 2008-01-26 14:27 UTC (permalink / raw)
  To: linux-kernel

Nick Piggin <nickpiggin <at> yahoo.com.au> writes: 
> On Sunday 27 January 2008 00:29, Frederik Himpe wrote:
> > I just succeeded to reproduce the problem with this patch. Does this
> > smell like an XFS problem?

I got the same issue using ext3

> Possible. Though I think it is more likely to be a bug in the
> new deadlock avoidance code in the generic buffered write path.
> Dang... I wonder why this hasn't come up earlier. It looks like
> pan's use of writev might be tickling it.
> 
> How quickly can you reproduce this?

When I was using pan daily one month ago, I got it twice over a week

> Can you use strace to see what the hanging syscall looks like?

I tried last week during 5 hours without luck, I can try again


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-26 13:29       ` Frederik Himpe
  2008-01-26 13:46         ` Nick Piggin
@ 2008-01-28  1:46         ` Nick Piggin
  2008-01-28 18:05           ` Frederik Himpe
  2008-01-31 22:45           ` Frederik Himpe
  1 sibling, 2 replies; 19+ messages in thread
From: Nick Piggin @ 2008-01-28  1:46 UTC (permalink / raw)
  To: Frederik Himpe, Andrew Morton, stable; +Cc: Mike Galbraith, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2137 bytes --]

On Sunday 27 January 2008 00:29, Frederik Himpe wrote:
> On di, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> > > > On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > > > > With Linux 2.6.24-rc8 I often have the problem that the pan usenet
> > > > > reader starts using 100% of CPU time after some time. When this
> > > > > happens, kill -9 does not work, and strace just hangs when trying
> > > > > to attach to the process. The same with gdb. ps shows the process
> > > > > as being in the R state.
> > > > >
> > > > > I pressed Ctrl-Alt-SysRq-T, and this was shown for pan:
> > > > > Jan 21 21:45:01 Anastacia kernel: pan           R  running task
> > > > > 0
> >
> > Nasty. The attached patch is something really simple that can sometimes
> > help. sysrq+p is also an option, if you're on a UP system.
> >
> > Any luck getting traces?
>
> I just succeeded to reproduce the problem with this patch. Does this
> smell like an XFS problem?
>
> Jan 26 14:17:43 Anastacia kernel: pan           R  running task        0 
> 7564      1 Jan 26 14:17:43 Anastacia kernel:  000000003f5b3248
> 0000000000001000 ffffffff880c28b0 0000000000000000 Jan 26 14:17:43
> Anastacia kernel:  ffff81003f5b3248 ffff81002d1ed900 000000002d1ed900
> 0000000000000000 Jan 26 14:17:43 Anastacia kernel:  ffff810016050dd0
> fffff000fffff000 0000000000000000 ffff81002d1eda10 Jan 26 14:17:43
> Anastacia kernel: Call Trace:
> Jan 26 14:17:43 Anastacia kernel:  [_end+127964408/2129947720]
> :xfs:xfs_get_blocks+0x0/0x10 Jan 26 14:17:43 Anastacia kernel: 
> [unix_poll+0/176] unix_poll+0x0/0xb0 Jan 26 14:17:43 Anastacia kernel: 
> [_end+127964408/2129947720] :xfs:xfs_get_blocks+0x0/0x10 Jan 26 14:17:43
> Anastacia kernel:  [iov_iter_copy_from_user_atomic+65/160]
> iov_iter_copy_from_user_atomic+0x41/0xa0 Jan 26 14:17:43 Anastacia kernel: 
> [iov_iter_copy_from_user_atomic+46/160]
> iov_iter_copy_from_user_atomic+0x2e/0xa0 Jan 26 14:17:43 Anastacia kernel: 
> [generic_file_buffered_write+383/1728]

Well after trying a lot of writev combinations, I've reproduced a hang
*hangs head*.

Does this help?

[-- Attachment #2: mm-zerolen-iov-fix.patch --]
[-- Type: text/x-diff, Size: 1681 bytes --]

Zero length iovecs can go into an infinite loop in writev, because the
iovec iterator does not always advance over them.

The sequence required to trigger this is not trivial. I think it requires
that a zero-length iovec be followed by a non-zero-length iovec which causes
a pagefault in the atomic usercopy. This causes the writev code to drop back
into single-segment copy mode, which then tries to copy the 0 bytes of the
zero-length iovec; a zero length copy looks like a failure though, so it
loops.

Put a test into iov_iter_advance to catch zero-length iovecs. We could just
put the test in the fallback path, but I feel it is more robust to skip
over zero-length iovecs throughout the code (iovec iterator may be used in
filesystems too, so it should be robust).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1733,7 +1733,11 @@ static void __iov_iter_advance_iov(struc
 		const struct iovec *iov = i->iov;
 		size_t base = i->iov_offset;
 
-		while (bytes) {
+		/*
+		 * The !iov->iov_len check ensures we skip over unlikely
+		 * zero-length segments.
+		 */
+		while (bytes || !iov->iov_len) {
 			int copy = min(bytes, iov->iov_len - base);
 
 			bytes -= copy;
@@ -2251,6 +2255,7 @@ again:
 
 		cond_resched();
 
+		iov_iter_advance(i, copied);
 		if (unlikely(copied == 0)) {
 			/*
 			 * If we were unable to copy any data at all, we must
@@ -2264,7 +2269,6 @@ again:
 						iov_iter_single_seg_count(i));
 			goto again;
 		}
-		iov_iter_advance(i, copied);
 		pos += copied;
 		written += copied;
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-26 14:27           ` Pascal Terjan
@ 2008-01-28  1:49             ` Nick Piggin
  0 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2008-01-28  1:49 UTC (permalink / raw)
  To: Pascal Terjan; +Cc: linux-kernel

On Sunday 27 January 2008 01:27, Pascal Terjan wrote:
> Nick Piggin <nickpiggin <at> yahoo.com.au> writes:
> > On Sunday 27 January 2008 00:29, Frederik Himpe wrote:
> > > I just succeeded to reproduce the problem with this patch. Does this
> > > smell like an XFS problem?
>
> I got the same issue using ext3
>
> > Possible. Though I think it is more likely to be a bug in the
> > new deadlock avoidance code in the generic buffered write path.
> > Dang... I wonder why this hasn't come up earlier. It looks like
> > pan's use of writev might be tickling it.
> >
> > How quickly can you reproduce this?
>
> When I was using pan daily one month ago, I got it twice over a week
>
> > Can you use strace to see what the hanging syscall looks like?
>
> I tried last week during 5 hours without luck, I can try again

Dang, I didn't see any reports of this earlier :(

Anyway, I sent a patch to fix it in the original thread (can you
reply-to-all please? just it a bit easier to keep threads together)

Thanks,
Nick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-28  1:46         ` Nick Piggin
@ 2008-01-28 18:05           ` Frederik Himpe
  2008-01-31 22:45           ` Frederik Himpe
  1 sibling, 0 replies; 19+ messages in thread
From: Frederik Himpe @ 2008-01-28 18:05 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, stable, Mike Galbraith, linux-kernel


On ma, 2008-01-28 at 12:46 +1100, Nick Piggin wrote:
> On Sunday 27 January 2008 00:29, Frederik Himpe wrote:
> > On di, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> > > > > On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > > > > > With Linux 2.6.24-rc8 I often have the problem that the pan usenet
> > > > > > reader starts using 100% of CPU time after some time. When this
> > > > > > happens, kill -9 does not work, and strace just hangs when trying
> > > > > > to attach to the process. The same with gdb. ps shows the process
> > > > > > as being in the R state.
> > > > > >
> > > > > > I pressed Ctrl-Alt-SysRq-T, and this was shown for pan:
> > > > > > Jan 21 21:45:01 Anastacia kernel: pan           R  running task
> > > > > > 0
> > >
> > > Nasty. The attached patch is something really simple that can sometimes
> > > help. sysrq+p is also an option, if you're on a UP system.
> > >
> > > Any luck getting traces?
> >
> > I just succeeded to reproduce the problem with this patch. Does this
> > smell like an XFS problem?
> >
> > Jan 26 14:17:43 Anastacia kernel: pan           R  running task        0 
> > 7564      1 Jan 26 14:17:43 Anastacia kernel:  000000003f5b3248
> > 0000000000001000 ffffffff880c28b0 0000000000000000 Jan 26 14:17:43
> > Anastacia kernel:  ffff81003f5b3248 ffff81002d1ed900 000000002d1ed900
> > 0000000000000000 Jan 26 14:17:43 Anastacia kernel:  ffff810016050dd0
> > fffff000fffff000 0000000000000000 ffff81002d1eda10 Jan 26 14:17:43
> > Anastacia kernel: Call Trace:
> > Jan 26 14:17:43 Anastacia kernel:  [_end+127964408/2129947720]
> > :xfs:xfs_get_blocks+0x0/0x10 Jan 26 14:17:43 Anastacia kernel: 
> > [unix_poll+0/176] unix_poll+0x0/0xb0 Jan 26 14:17:43 Anastacia kernel: 
> > [_end+127964408/2129947720] :xfs:xfs_get_blocks+0x0/0x10 Jan 26 14:17:43
> > Anastacia kernel:  [iov_iter_copy_from_user_atomic+65/160]
> > iov_iter_copy_from_user_atomic+0x41/0xa0 Jan 26 14:17:43 Anastacia kernel: 
> > [iov_iter_copy_from_user_atomic+46/160]
> > iov_iter_copy_from_user_atomic+0x2e/0xa0 Jan 26 14:17:43 Anastacia kernel: 
> > [generic_file_buffered_write+383/1728]
> 
> Well after trying a lot of writev combinations, I've reproduced a hang
> *hangs head*.
> 
> Does this help?

I'm currently running with this patch. The problem happens about two
times a week, so it will take a few days to come to a conclusion whether
it is fixed. I'll let you all know.

Thanks for the patch!

-- 
Frederik Himpe <fhimpe@telenet.be>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-28  1:46         ` Nick Piggin
  2008-01-28 18:05           ` Frederik Himpe
@ 2008-01-31 22:45           ` Frederik Himpe
  2008-02-02  0:53             ` Nick Piggin
  1 sibling, 1 reply; 19+ messages in thread
From: Frederik Himpe @ 2008-01-31 22:45 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, stable, Mike Galbraith, linux-kernel


On ma, 2008-01-28 at 12:46 +1100, Nick Piggin wrote:
> On Sunday 27 January 2008 00:29, Frederik Himpe wrote:
> > On di, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> > > > > On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > > > > > With Linux 2.6.24-rc8 I often have the problem that the pan usenet
> > > > > > reader starts using 100% of CPU time after some time. When this
> > > > > > happens, kill -9 does not work, and strace just hangs when trying
> > > > > > to attach to the process. The same with gdb. ps shows the process
> > > > > > as being in the R state.

> Well after trying a lot of writev combinations, I've reproduced a hang
> *hangs head*.
> 
> Does this help?

Just to confirm: in four days of testing, I haven't seen the problem
anymore, so it looks like this was indeed the right fix.

Thanks!

-- 
Frederik Himpe <fhimpe@telenet.be>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-31 22:45           ` Frederik Himpe
@ 2008-02-02  0:53             ` Nick Piggin
  0 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2008-02-02  0:53 UTC (permalink / raw)
  To: Frederik Himpe; +Cc: Andrew Morton, stable, Mike Galbraith, linux-kernel

On Friday 01 February 2008 09:45, Frederik Himpe wrote:
> On ma, 2008-01-28 at 12:46 +1100, Nick Piggin wrote:
> > On Sunday 27 January 2008 00:29, Frederik Himpe wrote:
> > > On di, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> > > > > > On Tuesday 22 January 2008 07:58, Frederik Himpe wrote:
> > > > > > > With Linux 2.6.24-rc8 I often have the problem that the pan
> > > > > > > usenet reader starts using 100% of CPU time after some time.
> > > > > > > When this happens, kill -9 does not work, and strace just hangs
> > > > > > > when trying to attach to the process. The same with gdb. ps
> > > > > > > shows the process as being in the R state.
> >
> > Well after trying a lot of writev combinations, I've reproduced a hang
> > *hangs head*.
> >
> > Does this help?
>
> Just to confirm: in four days of testing, I haven't seen the problem
> anymore, so it looks like this was indeed the right fix.

Thanks very much for reporting and testing. This patch needs to go
into 2.6.24.stable and upstream.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-01-22  5:47       ` Mike Galbraith
@ 2008-02-04 14:49         ` Mike Galbraith
  2008-02-04 23:02           ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Galbraith @ 2008-02-04 14:49 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Frederik Himpe, linux-kernel


On Tue, 2008-01-22 at 06:47 +0100, Mike Galbraith wrote:
> On Tue, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> > On Tuesday 22 January 2008 16:03, Mike Galbraith wrote:
> 
> > > I've hit same twice recently (not pan, and not repeatable).
> > 
> > Nasty. The attached patch is something really simple that can sometimes help.
> > sysrq+p is also an option, if you're on a UP system.
> 
> SMP (P4/HT imitating real cores)
> 
> > Any luck getting traces?
> 
> We'll see.  Armed.

Hm.  ld just went loopy (but killable) in v2.6.24-6928-g9135f19.  During
kbuild, modpost segfaulted, restart build, ld goes gaga.  Third attempt,
build finished.  Not what I hit before, but mentionable.


[  674.589134] modpost[18588]: segfault at 3e8dc42c ip 0804a96d sp af982920 error 5 in modpost[8048000+9000]
[  674.589211] mm/memory.c:115: bad pgd 3e081163.
[  674.589214] mm/memory.c:115: bad pgd 3e0d2163.
[  674.589217] mm/memory.c:115: bad pgd 3eb01163.

[ 1407.322144]  =======================
[ 1407.322144] ld            R running      0 21963  21962
[ 1407.322144]        db9d7f1c 00200086 c75f9020 b1814300 b0428300 b0428300 b0428300 c75f9280 
[ 1407.322144]        b1814300 00000001 db9d7000 00000000 d08c2f90 dba4f300 00000002 00000000 
[ 1407.322144]        b1810120 dba4f334 00200046 ffffffff db9d7000 c75f9020 db9d7f30 b02f333f 
[ 1407.322144] Call Trace:
[ 1407.322144]  [<b02f333f>] preempt_schedule_irq+0x45/0x5b
[ 1407.322144]  [<b0117a10>] ? do_page_fault+0x0/0x470
[ 1407.322144]  [<b0104d6e>] need_resched+0x1f/0x21
[ 1407.322144]  [<b0117a10>] ? do_page_fault+0x0/0x470
[ 1407.322144]  [<b0117a5c>] ? do_page_fault+0x4c/0x470
[ 1407.322144]  [<b0117a10>] ? do_page_fault+0x0/0x470
[ 1407.322144]  [<b02f4a3a>] ? error_code+0x72/0x78
[ 1407.322144]  [<b02f0000>] ? init_transmeta+0xcf/0x22f <== zzt P4



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.24 regression: pan hanging unkilleable and un-straceable
  2008-02-04 14:49         ` Mike Galbraith
@ 2008-02-04 23:02           ` Nick Piggin
  0 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2008-02-04 23:02 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Frederik Himpe, linux-kernel

On Tuesday 05 February 2008 01:49, Mike Galbraith wrote:
> On Tue, 2008-01-22 at 06:47 +0100, Mike Galbraith wrote:
> > On Tue, 2008-01-22 at 16:25 +1100, Nick Piggin wrote:
> > > On Tuesday 22 January 2008 16:03, Mike Galbraith wrote:
> > > > I've hit same twice recently (not pan, and not repeatable).
> > >
> > > Nasty. The attached patch is something really simple that can sometimes
> > > help. sysrq+p is also an option, if you're on a UP system.
> >
> > SMP (P4/HT imitating real cores)
> >
> > > Any luck getting traces?
> >
> > We'll see.  Armed.
>
> Hm.  ld just went loopy (but killable) in v2.6.24-6928-g9135f19.  During
> kbuild, modpost segfaulted, restart build, ld goes gaga.  Third attempt,
> build finished.  Not what I hit before, but mentionable.
>
>
> [  674.589134] modpost[18588]: segfault at 3e8dc42c ip 0804a96d sp af982920
> error 5 in modpost[8048000+9000] [  674.589211] mm/memory.c:115: bad pgd
> 3e081163.
> [  674.589214] mm/memory.c:115: bad pgd 3e0d2163.
> [  674.589217] mm/memory.c:115: bad pgd 3eb01163.

Hmm, this _could_ be bad memory. Or if it is very easy to reproduce with
a particular kernel version, then it is probably a memory scribble from
another part of the kernel :(

First thing I guess would be easy and helpful to run memtest86 for a
while if you have time.

If that's clean, then I don't have another good option except to bisect
the problem. Turning on DEBUG_VM, DEBUG_SLAB, DEBUG_LIST, DEBUG_PAGEALLOC,
DEBUG_STACKOVERFLOW, DEBUG_RODATA might help catch it sooner... SLAB and
PAGEALLOC could slow you down quite a bit though. And if the problem is
quite reproduceable, then obviously don't touch your config ;)

Thanks,
Nick


>
> [ 1407.322144]  =======================
> [ 1407.322144] ld            R running      0 21963  21962
> [ 1407.322144]        db9d7f1c 00200086 c75f9020 b1814300 b0428300 b0428300
> b0428300 c75f9280 [ 1407.322144]        b1814300 00000001 db9d7000 00000000
> d08c2f90 dba4f300 00000002 00000000 [ 1407.322144]        b1810120 dba4f334
> 00200046 ffffffff db9d7000 c75f9020 db9d7f30 b02f333f [ 1407.322144] Call
> Trace:
> [ 1407.322144]  [<b02f333f>] preempt_schedule_irq+0x45/0x5b
> [ 1407.322144]  [<b0117a10>] ? do_page_fault+0x0/0x470
> [ 1407.322144]  [<b0104d6e>] need_resched+0x1f/0x21
> [ 1407.322144]  [<b0117a10>] ? do_page_fault+0x0/0x470
> [ 1407.322144]  [<b0117a5c>] ? do_page_fault+0x4c/0x470
> [ 1407.322144]  [<b0117a10>] ? do_page_fault+0x0/0x470
> [ 1407.322144]  [<b02f4a3a>] ? error_code+0x72/0x78
> [ 1407.322144]  [<b02f0000>] ? init_transmeta+0xcf/0x22f <== zzt P4

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2008-02-04 23:03 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-21 20:58 2.6.24 regression: pan hanging unkilleable and un-straceable Frederik Himpe
2008-01-22  0:05 ` Nick Piggin
2008-01-22  5:03   ` Mike Galbraith
2008-01-22  5:25     ` Nick Piggin
2008-01-22  5:47       ` Mike Galbraith
2008-02-04 14:49         ` Mike Galbraith
2008-02-04 23:02           ` Nick Piggin
2008-01-22 10:38       ` Ingo Molnar
2008-01-24  5:30       ` Valdis.Kletnieks
2008-01-26 13:29       ` Frederik Himpe
2008-01-26 13:46         ` Nick Piggin
2008-01-26 14:27           ` Pascal Terjan
2008-01-28  1:49             ` Nick Piggin
2008-01-28  1:46         ` Nick Piggin
2008-01-28 18:05           ` Frederik Himpe
2008-01-31 22:45           ` Frederik Himpe
2008-02-02  0:53             ` Nick Piggin
2008-01-22 10:37   ` Ingo Molnar
2008-01-22 23:00     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).