linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* xfslogd-spinlock bug?
@ 2006-12-11 23:00 Haar János
  2006-12-12 14:32 ` Justin Piszcz
  0 siblings, 1 reply; 17+ messages in thread
From: Haar János @ 2006-12-11 23:00 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-kernel

Hello, list,

I am the "big red button men" with the one big 14TB xfs, if somebody can
remember me. :-)

Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the
2.6.19, but the bug still exists:

Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
Dec 11 22:47:21 dy-base general protection fault: 0000 [1]
Dec 11 22:47:21 dy-base SMP
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base CPU 3
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base Modules linked in:
Dec 11 22:47:21 dy-base  nbd
Dec 11 22:47:21 dy-base  rd
Dec 11 22:47:21 dy-base  netconsole
Dec 11 22:47:21 dy-base  e1000
Dec 11 22:47:21 dy-base  video
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
Dec 11 22:47:21 dy-base RIP: 0010:[<ffffffff803f3aba>]
Dec 11 22:47:21 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
Dec 11 22:47:21 dy-base RSP: 0018:ffff81011fb89bc0  EFLAGS: 00010002
Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX:
0000000000000000
Dec 11 22:47:21 dy-base RDX: ffffffff808137a0 RSI: 0000000000000082 RDI:
0000000100000000
Dec 11 22:47:21 dy-base RBP: ffff81011fb89be0 R08: 0000000000026a70 R09:
000000006b6b6b6b
Dec 11 22:47:21 dy-base R10: 0000000000000082 R11: ffff81000584d380 R12:
ffff8100db92ad80
Dec 11 22:47:21 dy-base R13: ffffffff80642dc6 R14: 0000000000000000 R15:
0000000000000003
Dec 11 22:47:21 dy-base FS:  0000000000000000(0000)
GS:ffff81011fc76b90(0000) knlGS:0000000000000000
Dec 11 22:47:21 dy-base CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Dec 11 22:47:21 dy-base CR2: 00002ba007700000 CR3: 0000000108c05000 CR4:
00000000000006e0
Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo
ffff81011fb88000, task ffff81011fa7f830)
Dec 11 22:47:21 dy-base Stack:
Dec 11 22:47:21 dy-base  ffff81011fb89be0
Dec 11 22:47:21 dy-base  ffff8100db92ad80
Dec 11 22:47:21 dy-base  0000000000000000
Dec 11 22:47:21 dy-base  0000000000000000
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base  ffff81011fb89c10
Dec 11 22:47:21 dy-base  ffffffff803f3bdc
Dec 11 22:47:21 dy-base  0000000000000282
Dec 11 22:47:21 dy-base  0000000000000000
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base  0000000000000000
Dec 11 22:47:21 dy-base  0000000000000000
Dec 11 22:47:21 dy-base  ffff81011fb89c30
Dec 11 22:47:21 dy-base  ffffffff805e7f2b
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base Call Trace:
Dec 11 22:47:21 dy-base  [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1
Dec 11 22:47:21 dy-base  [<ffffffff805e7f2b>] _spin_lock_irqsave+0x11/0x18
Dec 11 22:47:21 dy-base  [<ffffffff80222aab>] __wake_up+0x22/0x50
Dec 11 22:47:21 dy-base  [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23
Dec 11 22:47:21 dy-base  [<ffffffff803970a4>] xfs_buf_item_unpin+0x2e/0xa6
Dec 11 22:47:21 dy-base  [<ffffffff803bc460>]
xfs_trans_chunk_committed+0xc3/0xf7
Dec 11 22:47:21 dy-base  [<ffffffff803bc4dd>] xfs_trans_committed+0x49/0xde
Dec 11 22:47:21 dy-base  [<ffffffff803b1bde>]
xlog_state_do_callback+0x185/0x33f
Dec 11 22:47:21 dy-base  [<ffffffff803b1e9c>] xlog_iodone+0x104/0x131
Dec 11 22:47:22 dy-base  [<ffffffff803c9dae>] xfs_buf_iodone_work+0x1a/0x3e
Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
Dec 11 22:47:22 dy-base  [<ffffffff8023937e>] run_workqueue+0xa8/0xf8
Dec 11 22:47:22 dy-base  [<ffffffff803c9d94>] xfs_buf_iodone_work+0x0/0x3e
Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
Dec 11 22:47:22 dy-base  [<ffffffff80239ad3>] worker_thread+0xfb/0x134
Dec 11 22:47:22 dy-base  [<ffffffff80223f6c>] default_wake_function+0x0/0xf
Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
Dec 11 22:47:22 dy-base  [<ffffffff8023c6e5>] kthread+0xd8/0x10b
Dec 11 22:47:22 dy-base  [<ffffffff802256ac>] schedule_tail+0x45/0xa6
Dec 11 22:47:22 dy-base  [<ffffffff8020a6a8>] child_rip+0xa/0x12
Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
Dec 11 22:47:22 dy-base  [<ffffffff8023c60d>] kthread+0x0/0x10b
Dec 11 22:47:22 dy-base  [<ffffffff8020a69e>] child_rip+0x0/0x12
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base Code:
Dec 11 22:47:22 dy-base 8b
Dec 11 22:47:22 dy-base 83
Dec 11 22:47:22 dy-base 0c
Dec 11 22:47:22 dy-base 01
Dec 11 22:47:22 dy-base 00
Dec 11 22:47:22 dy-base 00
Dec 11 22:47:22 dy-base 48
Dec 11 22:47:22 dy-base 8d
Dec 11 22:47:22 dy-base 8b
Dec 11 22:47:22 dy-base 98
Dec 11 22:47:22 dy-base 02
Dec 11 22:47:22 dy-base 00
Dec 11 22:47:22 dy-base 00
Dec 11 22:47:22 dy-base 41
Dec 11 22:47:22 dy-base 8b
Dec 11 22:47:22 dy-base 54
Dec 11 22:47:22 dy-base 24
Dec 11 22:47:22 dy-base 04
Dec 11 22:47:22 dy-base 41
Dec 11 22:47:22 dy-base 89
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base RIP
Dec 11 22:47:22 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
Dec 11 22:47:22 dy-base  RSP <ffff81011fb89bc0>
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base Kernel panic - not syncing: Fatal exception
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base Rebooting in 5 seconds..

After this, sometimes the server reboots normally, but sometimes hangs, no
console, no sysreq, no nothing.

This is a "simple" crash, no "too much" data lost, or else.

Can somebody help me to tracking down the problem?

Thanks,
Janos Haar




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2006-12-11 23:00 xfslogd-spinlock bug? Haar János
@ 2006-12-12 14:32 ` Justin Piszcz
  2006-12-13  1:11   ` Haar János
  0 siblings, 1 reply; 17+ messages in thread
From: Justin Piszcz @ 2006-12-12 14:32 UTC (permalink / raw)
  To: Haar János; +Cc: linux-xfs, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5771 bytes --]

I'm not sure what is causing this problem but I was curious is this on a 
32bit or 64bit platform?

Justin.

On Tue, 12 Dec 2006, Haar János wrote:

> Hello, list,
> 
> I am the "big red button men" with the one big 14TB xfs, if somebody can
> remember me. :-)
> 
> Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the
> 2.6.19, but the bug still exists:
> 
> Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
> Dec 11 22:47:21 dy-base general protection fault: 0000 [1]
> Dec 11 22:47:21 dy-base SMP
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base CPU 3
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base Modules linked in:
> Dec 11 22:47:21 dy-base  nbd
> Dec 11 22:47:21 dy-base  rd
> Dec 11 22:47:21 dy-base  netconsole
> Dec 11 22:47:21 dy-base  e1000
> Dec 11 22:47:21 dy-base  video
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
> Dec 11 22:47:21 dy-base RIP: 0010:[<ffffffff803f3aba>]
> Dec 11 22:47:21 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
> Dec 11 22:47:21 dy-base RSP: 0018:ffff81011fb89bc0  EFLAGS: 00010002
> Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX:
> 0000000000000000
> Dec 11 22:47:21 dy-base RDX: ffffffff808137a0 RSI: 0000000000000082 RDI:
> 0000000100000000
> Dec 11 22:47:21 dy-base RBP: ffff81011fb89be0 R08: 0000000000026a70 R09:
> 000000006b6b6b6b
> Dec 11 22:47:21 dy-base R10: 0000000000000082 R11: ffff81000584d380 R12:
> ffff8100db92ad80
> Dec 11 22:47:21 dy-base R13: ffffffff80642dc6 R14: 0000000000000000 R15:
> 0000000000000003
> Dec 11 22:47:21 dy-base FS:  0000000000000000(0000)
> GS:ffff81011fc76b90(0000) knlGS:0000000000000000
> Dec 11 22:47:21 dy-base CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> Dec 11 22:47:21 dy-base CR2: 00002ba007700000 CR3: 0000000108c05000 CR4:
> 00000000000006e0
> Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo
> ffff81011fb88000, task ffff81011fa7f830)
> Dec 11 22:47:21 dy-base Stack:
> Dec 11 22:47:21 dy-base  ffff81011fb89be0
> Dec 11 22:47:21 dy-base  ffff8100db92ad80
> Dec 11 22:47:21 dy-base  0000000000000000
> Dec 11 22:47:21 dy-base  0000000000000000
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base  ffff81011fb89c10
> Dec 11 22:47:21 dy-base  ffffffff803f3bdc
> Dec 11 22:47:21 dy-base  0000000000000282
> Dec 11 22:47:21 dy-base  0000000000000000
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base  0000000000000000
> Dec 11 22:47:21 dy-base  0000000000000000
> Dec 11 22:47:21 dy-base  ffff81011fb89c30
> Dec 11 22:47:21 dy-base  ffffffff805e7f2b
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base Call Trace:
> Dec 11 22:47:21 dy-base  [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1
> Dec 11 22:47:21 dy-base  [<ffffffff805e7f2b>] _spin_lock_irqsave+0x11/0x18
> Dec 11 22:47:21 dy-base  [<ffffffff80222aab>] __wake_up+0x22/0x50
> Dec 11 22:47:21 dy-base  [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23
> Dec 11 22:47:21 dy-base  [<ffffffff803970a4>] xfs_buf_item_unpin+0x2e/0xa6
> Dec 11 22:47:21 dy-base  [<ffffffff803bc460>]
> xfs_trans_chunk_committed+0xc3/0xf7
> Dec 11 22:47:21 dy-base  [<ffffffff803bc4dd>] xfs_trans_committed+0x49/0xde
> Dec 11 22:47:21 dy-base  [<ffffffff803b1bde>]
> xlog_state_do_callback+0x185/0x33f
> Dec 11 22:47:21 dy-base  [<ffffffff803b1e9c>] xlog_iodone+0x104/0x131
> Dec 11 22:47:22 dy-base  [<ffffffff803c9dae>] xfs_buf_iodone_work+0x1a/0x3e
> Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
> Dec 11 22:47:22 dy-base  [<ffffffff8023937e>] run_workqueue+0xa8/0xf8
> Dec 11 22:47:22 dy-base  [<ffffffff803c9d94>] xfs_buf_iodone_work+0x0/0x3e
> Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
> Dec 11 22:47:22 dy-base  [<ffffffff80239ad3>] worker_thread+0xfb/0x134
> Dec 11 22:47:22 dy-base  [<ffffffff80223f6c>] default_wake_function+0x0/0xf
> Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
> Dec 11 22:47:22 dy-base  [<ffffffff8023c6e5>] kthread+0xd8/0x10b
> Dec 11 22:47:22 dy-base  [<ffffffff802256ac>] schedule_tail+0x45/0xa6
> Dec 11 22:47:22 dy-base  [<ffffffff8020a6a8>] child_rip+0xa/0x12
> Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
> Dec 11 22:47:22 dy-base  [<ffffffff8023c60d>] kthread+0x0/0x10b
> Dec 11 22:47:22 dy-base  [<ffffffff8020a69e>] child_rip+0x0/0x12
> Dec 11 22:47:22 dy-base
> Dec 11 22:47:22 dy-base
> Dec 11 22:47:22 dy-base Code:
> Dec 11 22:47:22 dy-base 8b
> Dec 11 22:47:22 dy-base 83
> Dec 11 22:47:22 dy-base 0c
> Dec 11 22:47:22 dy-base 01
> Dec 11 22:47:22 dy-base 00
> Dec 11 22:47:22 dy-base 00
> Dec 11 22:47:22 dy-base 48
> Dec 11 22:47:22 dy-base 8d
> Dec 11 22:47:22 dy-base 8b
> Dec 11 22:47:22 dy-base 98
> Dec 11 22:47:22 dy-base 02
> Dec 11 22:47:22 dy-base 00
> Dec 11 22:47:22 dy-base 00
> Dec 11 22:47:22 dy-base 41
> Dec 11 22:47:22 dy-base 8b
> Dec 11 22:47:22 dy-base 54
> Dec 11 22:47:22 dy-base 24
> Dec 11 22:47:22 dy-base 04
> Dec 11 22:47:22 dy-base 41
> Dec 11 22:47:22 dy-base 89
> Dec 11 22:47:22 dy-base
> Dec 11 22:47:22 dy-base RIP
> Dec 11 22:47:22 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
> Dec 11 22:47:22 dy-base  RSP <ffff81011fb89bc0>
> Dec 11 22:47:22 dy-base
> Dec 11 22:47:22 dy-base Kernel panic - not syncing: Fatal exception
> Dec 11 22:47:22 dy-base
> Dec 11 22:47:22 dy-base Rebooting in 5 seconds..
> 
> After this, sometimes the server reboots normally, but sometimes hangs, no
> console, no sysreq, no nothing.
> 
> This is a "simple" crash, no "too much" data lost, or else.
> 
> Can somebody help me to tracking down the problem?
> 
> Thanks,
> Janos Haar
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2006-12-12 14:32 ` Justin Piszcz
@ 2006-12-13  1:11   ` Haar János
  2006-12-16 11:19     ` Haar János
  0 siblings, 1 reply; 17+ messages in thread
From: Haar János @ 2006-12-13  1:11 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-xfs, linux-kernel

Hello, Justin,

This is a 64bit system.

But i cannot understand, what is the curious? :-)

I am not a kernel developer, and not a C programmer, but the long pointers
shows me, the 64 bit.
Or am i on the wrong clue? :-)

Anyway, this issue happens for me about daily, or max 2-3 day often.
But i have no idea what cause this exactly.
The 2.6.16.18 was stable for me a long time, and one day starts to tricking
me, and happens more and more often.
Thats why i thinking some bad part of the (14TB) FS on the disks.

(this fs have a lot of errors, what the xfs_repair cannot be corrected, but
anyway this is a productive system, and works well, except this issue.
Some months before i have replaced the parity disk in one of the RAID4
array, and the next day, during the resync process, another one disk died.
I almost lost everything, but thanks to the raid4, and mdadm, i have
successfully recovered a lot of data with the 1 day older parity-only drive.
This was really bad, and leave some scars on the fs. )

This issue looks like for me a race condition between the cpus. (2x Xeon HT)

Am i right? :-)

Thanks,

Janos




----- Original Message ----- 
From: "Justin Piszcz" <jpiszcz@lucidpixels.com>
To: "Haar János" <djani22@netcenter.hu>
Cc: <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org>
Sent: Tuesday, December 12, 2006 3:32 PM
Subject: Re: xfslogd-spinlock bug?


I'm not sure what is causing this problem but I was curious is this on a
32bit or 64bit platform?

Justin.

On Tue, 12 Dec 2006, Haar János wrote:

> Hello, list,
>
> I am the "big red button men" with the one big 14TB xfs, if somebody can
> remember me. :-)
>
> Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the
> 2.6.19, but the bug still exists:
>
> Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
> Dec 11 22:47:21 dy-base general protection fault: 0000 [1]
> Dec 11 22:47:21 dy-base SMP
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base CPU 3
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base Modules linked in:
> Dec 11 22:47:21 dy-base  nbd
> Dec 11 22:47:21 dy-base  rd
> Dec 11 22:47:21 dy-base  netconsole
> Dec 11 22:47:21 dy-base  e1000
> Dec 11 22:47:21 dy-base  video
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
> Dec 11 22:47:21 dy-base RIP: 0010:[<ffffffff803f3aba>]
> Dec 11 22:47:21 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
> Dec 11 22:47:21 dy-base RSP: 0018:ffff81011fb89bc0  EFLAGS: 00010002
> Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX:
> 0000000000000000
> Dec 11 22:47:21 dy-base RDX: ffffffff808137a0 RSI: 0000000000000082 RDI:
> 0000000100000000
> Dec 11 22:47:21 dy-base RBP: ffff81011fb89be0 R08: 0000000000026a70 R09:
> 000000006b6b6b6b
> Dec 11 22:47:21 dy-base R10: 0000000000000082 R11: ffff81000584d380 R12:
> ffff8100db92ad80
> Dec 11 22:47:21 dy-base R13: ffffffff80642dc6 R14: 0000000000000000 R15:
> 0000000000000003
> Dec 11 22:47:21 dy-base FS:  0000000000000000(0000)
> GS:ffff81011fc76b90(0000) knlGS:0000000000000000
> Dec 11 22:47:21 dy-base CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> Dec 11 22:47:21 dy-base CR2: 00002ba007700000 CR3: 0000000108c05000 CR4:
> 00000000000006e0
> Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo
> ffff81011fb88000, task ffff81011fa7f830)
> Dec 11 22:47:21 dy-base Stack:
> Dec 11 22:47:21 dy-base  ffff81011fb89be0
> Dec 11 22:47:21 dy-base  ffff8100db92ad80
> Dec 11 22:47:21 dy-base  0000000000000000
> Dec 11 22:47:21 dy-base  0000000000000000
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base  ffff81011fb89c10
> Dec 11 22:47:21 dy-base  ffffffff803f3bdc
> Dec 11 22:47:21 dy-base  0000000000000282
> Dec 11 22:47:21 dy-base  0000000000000000
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base  0000000000000000
> Dec 11 22:47:21 dy-base  0000000000000000
> Dec 11 22:47:21 dy-base  ffff81011fb89c30
> Dec 11 22:47:21 dy-base  ffffffff805e7f2b
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base Call Trace:
> Dec 11 22:47:21 dy-base  [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1
> Dec 11 22:47:21 dy-base  [<ffffffff805e7f2b>] _spin_lock_irqsave+0x11/0x18
> Dec 11 22:47:21 dy-base  [<ffffffff80222aab>] __wake_up+0x22/0x50
> Dec 11 22:47:21 dy-base  [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23
> Dec 11 22:47:21 dy-base  [<ffffffff803970a4>] xfs_buf_item_unpin+0x2e/0xa6
> Dec 11 22:47:21 dy-base  [<ffffffff803bc460>]
> xfs_trans_chunk_committed+0xc3/0xf7
> Dec 11 22:47:21 dy-base  [<ffffffff803bc4dd>]
xfs_trans_committed+0x49/0xde
> Dec 11 22:47:21 dy-base  [<ffffffff803b1bde>]
> xlog_state_do_callback+0x185/0x33f
> Dec 11 22:47:21 dy-base  [<ffffffff803b1e9c>] xlog_iodone+0x104/0x131
> Dec 11 22:47:22 dy-base  [<ffffffff803c9dae>]
xfs_buf_iodone_work+0x1a/0x3e
> Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
> Dec 11 22:47:22 dy-base  [<ffffffff8023937e>] run_workqueue+0xa8/0xf8
> Dec 11 22:47:22 dy-base  [<ffffffff803c9d94>] xfs_buf_iodone_work+0x0/0x3e
> Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
> Dec 11 22:47:22 dy-base  [<ffffffff80239ad3>] worker_thread+0xfb/0x134
> Dec 11 22:47:22 dy-base  [<ffffffff80223f6c>]
default_wake_function+0x0/0xf
> Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
> Dec 11 22:47:22 dy-base  [<ffffffff8023c6e5>] kthread+0xd8/0x10b
> Dec 11 22:47:22 dy-base  [<ffffffff802256ac>] schedule_tail+0x45/0xa6
> Dec 11 22:47:22 dy-base  [<ffffffff8020a6a8>] child_rip+0xa/0x12
> Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
> Dec 11 22:47:22 dy-base  [<ffffffff8023c60d>] kthread+0x0/0x10b
> Dec 11 22:47:22 dy-base  [<ffffffff8020a69e>] child_rip+0x0/0x12
> Dec 11 22:47:22 dy-base
> Dec 11 22:47:22 dy-base
> Dec 11 22:47:22 dy-base Code:
> Dec 11 22:47:22 dy-base 8b
> Dec 11 22:47:22 dy-base 83
> Dec 11 22:47:22 dy-base 0c
> Dec 11 22:47:22 dy-base 01
> Dec 11 22:47:22 dy-base 00
> Dec 11 22:47:22 dy-base 00
> Dec 11 22:47:22 dy-base 48
> Dec 11 22:47:22 dy-base 8d
> Dec 11 22:47:22 dy-base 8b
> Dec 11 22:47:22 dy-base 98
> Dec 11 22:47:22 dy-base 02
> Dec 11 22:47:22 dy-base 00
> Dec 11 22:47:22 dy-base 00
> Dec 11 22:47:22 dy-base 41
> Dec 11 22:47:22 dy-base 8b
> Dec 11 22:47:22 dy-base 54
> Dec 11 22:47:22 dy-base 24
> Dec 11 22:47:22 dy-base 04
> Dec 11 22:47:22 dy-base 41
> Dec 11 22:47:22 dy-base 89
> Dec 11 22:47:22 dy-base
> Dec 11 22:47:22 dy-base RIP
> Dec 11 22:47:22 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
> Dec 11 22:47:22 dy-base  RSP <ffff81011fb89bc0>
> Dec 11 22:47:22 dy-base
> Dec 11 22:47:22 dy-base Kernel panic - not syncing: Fatal exception
> Dec 11 22:47:22 dy-base
> Dec 11 22:47:22 dy-base Rebooting in 5 seconds..
>
> After this, sometimes the server reboots normally, but sometimes hangs, no
> console, no sysreq, no nothing.
>
> This is a "simple" crash, no "too much" data lost, or else.
>
> Can somebody help me to tracking down the problem?
>
> Thanks,
> Janos Haar
>
>
>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2006-12-13  1:11   ` Haar János
@ 2006-12-16 11:19     ` Haar János
  2006-12-17 22:44       ` David Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Haar János @ 2006-12-16 11:19 UTC (permalink / raw)
  To: Haar János; +Cc: linux-xfs, linux-kernel

Hi

I have some news.

I dont know there is a context between 2 messages, but i can see, the
spinlock bug comes always on cpu #3.

Somebody have any idea?

Thanks

Janos

Dec 16 11:42:48 dy-base BUG: warning at
mm/truncate.c:398/invalidate_inode_pages2_range()
Dec 16 11:42:49 dy-base
Dec 16 11:42:49 dy-base Call Trace:
Dec 16 11:42:49 dy-base  [<ffffffff8025b8a2>]
invalidate_inode_pages2_range+0x285/0x2b8
Dec 16 11:42:49 dy-base  [<ffffffff805e7ddd>] _spin_unlock+0x9/0xb
Dec 16 11:42:49 dy-base  [<ffffffff8031b98d>]
nfs_sync_inode_wait+0x1ab/0x1bd
Dec 16 11:42:49 dy-base  [<ffffffff8025b8e4>]
invalidate_inode_pages2+0xf/0x11
Dec 16 11:42:49 dy-base  [<ffffffff80314859>]
nfs_revalidate_mapping+0xa0/0x152
Dec 16 11:42:49 dy-base  [<ffffffff80312a07>] nfs_file_read+0x6e/0xbe
Dec 16 11:42:49 dy-base  [<ffffffff80273f99>] do_sync_read+0xe2/0x126
Dec 16 11:42:49 dy-base  [<ffffffff805e7ffc>] unlock_kernel+0x35/0x37
Dec 16 11:42:49 dy-base  [<ffffffff8023c804>]
autoremove_wake_function+0x0/0x38
Dec 16 11:42:49 dy-base  [<ffffffff802729e7>] nameidata_to_filp+0x2d/0x3e
Dec 16 11:42:49 dy-base  [<ffffffff8026fd15>] poison_obj+0x27/0x32
Dec 16 11:42:49 dy-base  [<ffffffff8026fee6>]
cache_free_debugcheck+0x1c6/0x1d6
Dec 16 11:42:49 dy-base  [<ffffffff8027cb26>] putname+0x37/0x39
Dec 16 11:42:49 dy-base  [<ffffffff80274849>] vfs_read+0xcc/0x172
Dec 16 11:42:49 dy-base  [<ffffffff80274d3e>] sys_pread64+0x55/0x76
Dec 16 11:42:49 dy-base  [<ffffffff802098ee>] system_call+0x7e/0x83
Dec 16 11:42:49 dy-base

....

Dec 16 12:08:36 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
Dec 16 12:08:36 dy-base general protection fault: 0000 [1]
Dec 16 12:08:36 dy-base SMP
Dec 16 12:08:36 dy-base
Dec 16 12:08:36 dy-base CPU 3
Dec 16 12:08:36 dy-base
Dec 16 12:08:36 dy-base Modules linked in:
Dec 16 12:08:36 dy-base  nbd
Dec 16 12:08:36 dy-base  rd
Dec 16 12:08:36 dy-base  netconsole
Dec 16 12:08:36 dy-base  e1000
Dec 16 12:08:36 dy-base  video
Dec 16 12:08:36 dy-base
Dec 16 12:08:36 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
Dec 16 12:08:36 dy-base RIP: 0010:[<ffffffff803f3aba>]
Dec 16 12:08:36 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0  EFLAGS: 00010002
Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX:
0000000000000000
Dec 16 12:08:36 dy-base RDX: ffffffff807f3be2 RSI: 0000000000000082 RDI:
0000000100000000
Dec 16 12:08:36 dy-base RBP: ffff81011fdedbe0 R08: 0000000000006eb2 R09:
000000006b6b6b6b
Dec 16 12:08:36 dy-base R10: 0000000000000082 R11: ffff81000584d280 R12:
ffff810081476098
Dec 16 12:08:36 dy-base R13: ffffffff80642dc6 R14: 0000000000000000 R15:
0000000000000003
Dec 16 12:08:36 dy-base FS:  0000000000000000(0000)
GS:ffff81011fc76b90(0000) knlGS:0000000000000000
Dec 16 12:08:36 dy-base CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Dec 16 12:08:36 dy-base CR2: 00002afc7d3ea000 CR3: 0000000117afc000 CR4:
00000000000006e0
Dec 16 12:08:36 dy-base Process xfslogd/3 (pid: 317, threadinfo
ffff81011fdec000, task ffff81011fafc140)
Dec 16 12:08:36 dy-base Stack:
Dec 16 12:08:36 dy-base  ffff81011fdedbe0
Dec 16 12:08:36 dy-base  ffff810081476098
Dec 16 12:08:36 dy-base  0000000000000000
Dec 16 12:08:36 dy-base  0000000000000000
Dec 16 12:08:36 dy-base
Dec 16 12:08:36 dy-base  ffff81011fdedc10
Dec 16 12:08:36 dy-base  ffffffff803f3bdc
Dec 16 12:08:36 dy-base  0000000000000282
Dec 16 12:08:36 dy-base  0000000000000000
Dec 16 12:08:36 dy-base
Dec 16 12:08:36 dy-base  0000000000000000
Dec 16 12:08:36 dy-base  0000000000000000
Dec 16 12:08:36 dy-base  ffff81011fdedc30
Dec 16 12:08:36 dy-base  ffffffff805e7f2b
Dec 16 12:08:36 dy-base
Dec 16 12:08:36 dy-base Call Trace:
Dec 16 12:08:36 dy-base  [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1
Dec 16 12:08:36 dy-base  [<ffffffff805e7f2b>] _spin_lock_irqsave+0x11/0x18
Dec 16 12:08:36 dy-base  [<ffffffff80222aab>] __wake_up+0x22/0x50
Dec 16 12:08:36 dy-base  [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23
Dec 16 12:08:36 dy-base  [<ffffffff803970a4>] xfs_buf_item_unpin+0x2e/0xa6
Dec 16 12:08:36 dy-base  [<ffffffff803bc460>]
xfs_trans_chunk_committed+0xc3/0xf7
Dec 16 12:08:37 dy-base  [<ffffffff803bc4dd>] xfs_trans_committed+0x49/0xde
Dec 16 12:08:37 dy-base  [<ffffffff803b1bde>]
xlog_state_do_callback+0x185/0x33f
Dec 16 12:08:37 dy-base  [<ffffffff803b1e9c>] xlog_iodone+0x104/0x131
Dec 16 12:08:37 dy-base  [<ffffffff803c9dae>] xfs_buf_iodone_work+0x1a/0x3e
Dec 16 12:08:37 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
Dec 16 12:08:37 dy-base  [<ffffffff8023937e>] run_workqueue+0xa8/0xf8
Dec 16 12:08:37 dy-base  [<ffffffff803c9d94>] xfs_buf_iodone_work+0x0/0x3e
Dec 16 12:08:37 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
Dec 16 12:08:37 dy-base  [<ffffffff80239ad3>] worker_thread+0xfb/0x134
Dec 16 12:08:37 dy-base  [<ffffffff80223f6c>] default_wake_function+0x0/0xf
Dec 16 12:08:37 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
Dec 16 12:08:37 dy-base  [<ffffffff8023c6e5>] kthread+0xd8/0x10b
Dec 16 12:08:37 dy-base  [<ffffffff802256ac>] schedule_tail+0x45/0xa6
Dec 16 12:08:37 dy-base  [<ffffffff8020a6a8>] child_rip+0xa/0x12
Dec 16 12:08:37 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
Dec 16 12:08:37 dy-base  [<ffffffff8023c60d>] kthread+0x0/0x10b
Dec 16 12:08:38 dy-base  [<ffffffff8020a69e>] child_rip+0x0/0x12
Dec 16 12:08:38 dy-base
Dec 16 12:08:38 dy-base
Dec 16 12:08:38 dy-base Code:
Dec 16 12:08:38 dy-base 8b
Dec 16 12:08:38 dy-base 83
Dec 16 12:08:38 dy-base 0c
Dec 16 12:08:38 dy-base 01
Dec 16 12:08:38 dy-base 00
Dec 16 12:08:38 dy-base 00
Dec 16 12:08:38 dy-base 48
Dec 16 12:08:38 dy-base 8d
Dec 16 12:08:38 dy-base 8b
Dec 16 12:08:38 dy-base 98
Dec 16 12:08:38 dy-base 02
Dec 16 12:08:38 dy-base 00
Dec 16 12:08:38 dy-base 00
Dec 16 12:08:38 dy-base 41
Dec 16 12:08:38 dy-base 8b
Dec 16 12:08:38 dy-base 54
Dec 16 12:08:38 dy-base 24
Dec 16 12:08:38 dy-base 04
Dec 16 12:08:38 dy-base 41
Dec 16 12:08:38 dy-base 89
Dec 16 12:08:38 dy-base
Dec 16 12:08:38 dy-base RIP
Dec 16 12:08:38 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
Dec 16 12:08:38 dy-base  RSP <ffff81011fdedbc0>
Dec 16 12:08:38 dy-base
Dec 16 12:08:38 dy-base Kernel panic - not syncing: Fatal exception
Dec 16 12:08:38 dy-base
Dec 16 12:08:38 dy-base Rebooting in 5 seconds..




----- Original Message ----- 
From: "Haar János" <djani22@netcenter.hu>
To: "Justin Piszcz" <jpiszcz@lucidpixels.com>
Cc: <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org>
Sent: Wednesday, December 13, 2006 2:11 AM
Subject: Re: xfslogd-spinlock bug?


> Hello, Justin,
>
> This is a 64bit system.
>
> But i cannot understand, what is the curious? :-)
>
> I am not a kernel developer, and not a C programmer, but the long pointers
> shows me, the 64 bit.
> Or am i on the wrong clue? :-)
>
> Anyway, this issue happens for me about daily, or max 2-3 day often.
> But i have no idea what cause this exactly.
> The 2.6.16.18 was stable for me a long time, and one day starts to
tricking
> me, and happens more and more often.
> Thats why i thinking some bad part of the (14TB) FS on the disks.
>
> (this fs have a lot of errors, what the xfs_repair cannot be corrected,
but
> anyway this is a productive system, and works well, except this issue.
> Some months before i have replaced the parity disk in one of the RAID4
> array, and the next day, during the resync process, another one disk died.
> I almost lost everything, but thanks to the raid4, and mdadm, i have
> successfully recovered a lot of data with the 1 day older parity-only
drive.
> This was really bad, and leave some scars on the fs. )
>
> This issue looks like for me a race condition between the cpus. (2x Xeon
HT)
>
> Am i right? :-)
>
> Thanks,
>
> Janos
>
>
>
>
> ----- Original Message ----- 
> From: "Justin Piszcz" <jpiszcz@lucidpixels.com>
> To: "Haar János" <djani22@netcenter.hu>
> Cc: <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org>
> Sent: Tuesday, December 12, 2006 3:32 PM
> Subject: Re: xfslogd-spinlock bug?
>
>
> I'm not sure what is causing this problem but I was curious is this on a
> 32bit or 64bit platform?
>
> Justin.
>
> On Tue, 12 Dec 2006, Haar János wrote:
>
> > Hello, list,
> >
> > I am the "big red button men" with the one big 14TB xfs, if somebody can
> > remember me. :-)
> >
> > Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the
> > 2.6.19, but the bug still exists:
> >
> > Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
> > Dec 11 22:47:21 dy-base general protection fault: 0000 [1]
> > Dec 11 22:47:21 dy-base SMP
> > Dec 11 22:47:21 dy-base
> > Dec 11 22:47:21 dy-base CPU 3
> > Dec 11 22:47:21 dy-base
> > Dec 11 22:47:21 dy-base Modules linked in:
> > Dec 11 22:47:21 dy-base  nbd
> > Dec 11 22:47:21 dy-base  rd
> > Dec 11 22:47:21 dy-base  netconsole
> > Dec 11 22:47:21 dy-base  e1000
> > Dec 11 22:47:21 dy-base  video
> > Dec 11 22:47:21 dy-base
> > Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
> > Dec 11 22:47:21 dy-base RIP: 0010:[<ffffffff803f3aba>]
> > Dec 11 22:47:21 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
> > Dec 11 22:47:21 dy-base RSP: 0018:ffff81011fb89bc0  EFLAGS: 00010002
> > Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX:
> > 0000000000000000
> > Dec 11 22:47:21 dy-base RDX: ffffffff808137a0 RSI: 0000000000000082 RDI:
> > 0000000100000000
> > Dec 11 22:47:21 dy-base RBP: ffff81011fb89be0 R08: 0000000000026a70 R09:
> > 000000006b6b6b6b
> > Dec 11 22:47:21 dy-base R10: 0000000000000082 R11: ffff81000584d380 R12:
> > ffff8100db92ad80
> > Dec 11 22:47:21 dy-base R13: ffffffff80642dc6 R14: 0000000000000000 R15:
> > 0000000000000003
> > Dec 11 22:47:21 dy-base FS:  0000000000000000(0000)
> > GS:ffff81011fc76b90(0000) knlGS:0000000000000000
> > Dec 11 22:47:21 dy-base CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b
> > Dec 11 22:47:21 dy-base CR2: 00002ba007700000 CR3: 0000000108c05000 CR4:
> > 00000000000006e0
> > Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo
> > ffff81011fb88000, task ffff81011fa7f830)
> > Dec 11 22:47:21 dy-base Stack:
> > Dec 11 22:47:21 dy-base  ffff81011fb89be0
> > Dec 11 22:47:21 dy-base  ffff8100db92ad80
> > Dec 11 22:47:21 dy-base  0000000000000000
> > Dec 11 22:47:21 dy-base  0000000000000000
> > Dec 11 22:47:21 dy-base
> > Dec 11 22:47:21 dy-base  ffff81011fb89c10
> > Dec 11 22:47:21 dy-base  ffffffff803f3bdc
> > Dec 11 22:47:21 dy-base  0000000000000282
> > Dec 11 22:47:21 dy-base  0000000000000000
> > Dec 11 22:47:21 dy-base
> > Dec 11 22:47:21 dy-base  0000000000000000
> > Dec 11 22:47:21 dy-base  0000000000000000
> > Dec 11 22:47:21 dy-base  ffff81011fb89c30
> > Dec 11 22:47:21 dy-base  ffffffff805e7f2b
> > Dec 11 22:47:21 dy-base
> > Dec 11 22:47:21 dy-base Call Trace:
> > Dec 11 22:47:21 dy-base  [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1
> > Dec 11 22:47:21 dy-base  [<ffffffff805e7f2b>]
_spin_lock_irqsave+0x11/0x18
> > Dec 11 22:47:21 dy-base  [<ffffffff80222aab>] __wake_up+0x22/0x50
> > Dec 11 22:47:21 dy-base  [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23
> > Dec 11 22:47:21 dy-base  [<ffffffff803970a4>]
xfs_buf_item_unpin+0x2e/0xa6
> > Dec 11 22:47:21 dy-base  [<ffffffff803bc460>]
> > xfs_trans_chunk_committed+0xc3/0xf7
> > Dec 11 22:47:21 dy-base  [<ffffffff803bc4dd>]
> xfs_trans_committed+0x49/0xde
> > Dec 11 22:47:21 dy-base  [<ffffffff803b1bde>]
> > xlog_state_do_callback+0x185/0x33f
> > Dec 11 22:47:21 dy-base  [<ffffffff803b1e9c>] xlog_iodone+0x104/0x131
> > Dec 11 22:47:22 dy-base  [<ffffffff803c9dae>]
> xfs_buf_iodone_work+0x1a/0x3e
> > Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
> > Dec 11 22:47:22 dy-base  [<ffffffff8023937e>] run_workqueue+0xa8/0xf8
> > Dec 11 22:47:22 dy-base  [<ffffffff803c9d94>]
xfs_buf_iodone_work+0x0/0x3e
> > Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
> > Dec 11 22:47:22 dy-base  [<ffffffff80239ad3>] worker_thread+0xfb/0x134
> > Dec 11 22:47:22 dy-base  [<ffffffff80223f6c>]
> default_wake_function+0x0/0xf
> > Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
> > Dec 11 22:47:22 dy-base  [<ffffffff8023c6e5>] kthread+0xd8/0x10b
> > Dec 11 22:47:22 dy-base  [<ffffffff802256ac>] schedule_tail+0x45/0xa6
> > Dec 11 22:47:22 dy-base  [<ffffffff8020a6a8>] child_rip+0xa/0x12
> > Dec 11 22:47:22 dy-base  [<ffffffff802399d8>] worker_thread+0x0/0x134
> > Dec 11 22:47:22 dy-base  [<ffffffff8023c60d>] kthread+0x0/0x10b
> > Dec 11 22:47:22 dy-base  [<ffffffff8020a69e>] child_rip+0x0/0x12
> > Dec 11 22:47:22 dy-base
> > Dec 11 22:47:22 dy-base
> > Dec 11 22:47:22 dy-base Code:
> > Dec 11 22:47:22 dy-base 8b
> > Dec 11 22:47:22 dy-base 83
> > Dec 11 22:47:22 dy-base 0c
> > Dec 11 22:47:22 dy-base 01
> > Dec 11 22:47:22 dy-base 00
> > Dec 11 22:47:22 dy-base 00
> > Dec 11 22:47:22 dy-base 48
> > Dec 11 22:47:22 dy-base 8d
> > Dec 11 22:47:22 dy-base 8b
> > Dec 11 22:47:22 dy-base 98
> > Dec 11 22:47:22 dy-base 02
> > Dec 11 22:47:22 dy-base 00
> > Dec 11 22:47:22 dy-base 00
> > Dec 11 22:47:22 dy-base 41
> > Dec 11 22:47:22 dy-base 8b
> > Dec 11 22:47:22 dy-base 54
> > Dec 11 22:47:22 dy-base 24
> > Dec 11 22:47:22 dy-base 04
> > Dec 11 22:47:22 dy-base 41
> > Dec 11 22:47:22 dy-base 89
> > Dec 11 22:47:22 dy-base
> > Dec 11 22:47:22 dy-base RIP
> > Dec 11 22:47:22 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
> > Dec 11 22:47:22 dy-base  RSP <ffff81011fb89bc0>
> > Dec 11 22:47:22 dy-base
> > Dec 11 22:47:22 dy-base Kernel panic - not syncing: Fatal exception
> > Dec 11 22:47:22 dy-base
> > Dec 11 22:47:22 dy-base Rebooting in 5 seconds..
> >
> > After this, sometimes the server reboots normally, but sometimes hangs,
no
> > console, no sysreq, no nothing.
> >
> > This is a "simple" crash, no "too much" data lost, or else.
> >
> > Can somebody help me to tracking down the problem?
> >
> > Thanks,
> > Janos Haar
> >
> >
> >
> >
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2006-12-16 11:19     ` Haar János
@ 2006-12-17 22:44       ` David Chinner
  2006-12-17 23:56         ` Haar János
  0 siblings, 1 reply; 17+ messages in thread
From: David Chinner @ 2006-12-17 22:44 UTC (permalink / raw)
  To: Haar János; +Cc: linux-xfs, linux-kernel

On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote:
> Hi
> 
> I have some news.
> 
> I dont know there is a context between 2 messages, but i can see, the
> spinlock bug comes always on cpu #3.
> 
> Somebody have any idea?

Your disk interrupts are directed to CPU 3, and so log I/O completion
occurs on that CPU.

> Dec 16 12:08:36 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
> Dec 16 12:08:36 dy-base general protection fault: 0000 [1]
> Dec 16 12:08:36 dy-base SMP
> Dec 16 12:08:36 dy-base
> Dec 16 12:08:36 dy-base CPU 3
> Dec 16 12:08:36 dy-base
> Dec 16 12:08:36 dy-base Modules linked in:
> Dec 16 12:08:36 dy-base  nbd

Are you using XFS on a NBD?

> Dec 16 12:08:36 dy-base  rd
> Dec 16 12:08:36 dy-base  netconsole
> Dec 16 12:08:36 dy-base  e1000
> Dec 16 12:08:36 dy-base  video
> Dec 16 12:08:36 dy-base
> Dec 16 12:08:36 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
> Dec 16 12:08:36 dy-base RIP: 0010:[<ffffffff803f3aba>]
> Dec 16 12:08:36 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
> Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0  EFLAGS: 00010002
> Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX:
                                                     ^^^^^^^^^^^^^^^^
Anyone recognise that pattern?

> Dec 16 12:08:36 dy-base Call Trace:
> Dec 16 12:08:36 dy-base  [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1
> Dec 16 12:08:36 dy-base  [<ffffffff805e7f2b>] _spin_lock_irqsave+0x11/0x18
> Dec 16 12:08:36 dy-base  [<ffffffff80222aab>] __wake_up+0x22/0x50
> Dec 16 12:08:36 dy-base  [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23
> Dec 16 12:08:36 dy-base  [<ffffffff803970a4>] xfs_buf_item_unpin+0x2e/0xa6

This implies a spinlock inside a wait_queue_head_t is corrupt.

What are you type of system do you have, and what sort of
workload are you running?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2006-12-17 22:44       ` David Chinner
@ 2006-12-17 23:56         ` Haar János
  2006-12-18  6:24           ` David Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Haar János @ 2006-12-17 23:56 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-xfs, linux-kernel


----- Original Message ----- 
From: "David Chinner" <dgc@sgi.com>
To: "Haar János" <djani22@netcenter.hu>
Cc: <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org>
Sent: Sunday, December 17, 2006 11:44 PM
Subject: Re: xfslogd-spinlock bug?


> On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote:
> > Hi
> >
> > I have some news.
> >
> > I dont know there is a context between 2 messages, but i can see, the
> > spinlock bug comes always on cpu #3.
> >
> > Somebody have any idea?
>
> Your disk interrupts are directed to CPU 3, and so log I/O completion
> occurs on that CPU.

           CPU0       CPU1       CPU2       CPU3
  0:        100          0          0    4583704   IO-APIC-edge      timer
  1:          0          0          0          2   IO-APIC-edge      i8042
  4:          0          0          0    3878668   IO-APIC-edge      serial
  8:          0          0          0          0   IO-APIC-edge      rtc
  9:          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          0          0          0          3   IO-APIC-edge      i8042
 14:    3072118          0          0        181   IO-APIC-edge      ide0
 16:          0          0          0          0   IO-APIC-fasteoi
uhci_hcd:usb2
 18:          0          0          0          0   IO-APIC-fasteoi
uhci_hcd:usb4
 19:          0          0          0          0   IO-APIC-fasteoi
uhci_hcd:usb3
 23:          0          0          0          0   IO-APIC-fasteoi
ehci_hcd:usb1
 52:          0          0          0  213052723   IO-APIC-fasteoi   eth1
 53:          0          0          0   91913759   IO-APIC-fasteoi   eth2
100:          0          0          0   16776910   IO-APIC-fasteoi   eth0
NMI:      42271      43187      42234      43168
LOC:    4584247    4584219    4584215    4584198
ERR:          0

Maybe....
I have 3 XFS on this system, with 3 source.

1. 200G one ide hdd.
2. 2x200G mirror on 1 ide + 1 sata hdd.
3. 4x3.3TB strip on NBD.

The NBD serves through eth1, and it is on the CPU3, but the ide0 is on the
CPU0.


>
> > Dec 16 12:08:36 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
> > Dec 16 12:08:36 dy-base general protection fault: 0000 [1]
> > Dec 16 12:08:36 dy-base SMP
> > Dec 16 12:08:36 dy-base
> > Dec 16 12:08:36 dy-base CPU 3
> > Dec 16 12:08:36 dy-base
> > Dec 16 12:08:36 dy-base Modules linked in:
> > Dec 16 12:08:36 dy-base  nbd
>
> Are you using XFS on a NBD?

Yes, on the 3. source.
I used it about 1.5 years.

(The nbd deadlock is fixed on my system, thanks to Herbert Xu on 2.6.14.)

>
> > Dec 16 12:08:36 dy-base  rd
> > Dec 16 12:08:36 dy-base  netconsole
> > Dec 16 12:08:36 dy-base  e1000
> > Dec 16 12:08:36 dy-base  video
> > Dec 16 12:08:36 dy-base
> > Dec 16 12:08:36 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
> > Dec 16 12:08:36 dy-base RIP: 0010:[<ffffffff803f3aba>]
> > Dec 16 12:08:36 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
> > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0  EFLAGS: 00010002
> > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX:
>                                                      ^^^^^^^^^^^^^^^^
> Anyone recognise that pattern?

I think i have one idea.
This issue can stops sometimes the 5sec automatic restart on crash, and this
shows possible memory corruption, and if the bug occurs in the IRQ
handling.... :-)
I have a lot of logs about this issue, and the RAX, RBX always the same.

>
> > Dec 16 12:08:36 dy-base Call Trace:
> > Dec 16 12:08:36 dy-base  [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1
> > Dec 16 12:08:36 dy-base  [<ffffffff805e7f2b>]
_spin_lock_irqsave+0x11/0x18
> > Dec 16 12:08:36 dy-base  [<ffffffff80222aab>] __wake_up+0x22/0x50
> > Dec 16 12:08:36 dy-base  [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23
> > Dec 16 12:08:36 dy-base  [<ffffffff803970a4>]
xfs_buf_item_unpin+0x2e/0xa6
>
> This implies a spinlock inside a wait_queue_head_t is corrupt.
>
> What are you type of system do you have, and what sort of
> workload are you running?

OS: Fedora 5, 64bit.
HW: dual xeon, with HT, ram 4GB.
(the min_free_kbytes limit is set to 128000, because sometimes the e1000
driver run out the reserved memory during irq handling.)

Workload:

I use this system for free web storage.
(2x apache 2.0.xx,   12x pure-ftpd, 2x mysql but sql only use the source #2
fs.)

The normal system load is ~20-40, but currently i have a little problem with
apache, because it sometimes starts to read a lot from the big XFS device,
and eats all memory, the load is rising to 700-800.
At this point i use httpd restart, and everithing go back to normal, but if
i offline.....

Thanks a lot!

Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2006-12-17 23:56         ` Haar János
@ 2006-12-18  6:24           ` David Chinner
  2006-12-18  8:17             ` Haar János
  0 siblings, 1 reply; 17+ messages in thread
From: David Chinner @ 2006-12-18  6:24 UTC (permalink / raw)
  To: Haar János; +Cc: David Chinner, linux-xfs, linux-kernel

On Mon, Dec 18, 2006 at 12:56:41AM +0100, Haar János wrote:
> > On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote:
> > > I dont know there is a context between 2 messages, but i can see, the
> > > spinlock bug comes always on cpu #3.
> > >
> > > Somebody have any idea?
> >
> > Your disk interrupts are directed to CPU 3, and so log I/O completion
> > occurs on that CPU.
> 
>            CPU0       CPU1       CPU2       CPU3
>   0:        100          0          0    4583704   IO-APIC-edge      timer
>   1:          0          0          0          2   IO-APIC-edge      i8042
>   4:          0          0          0    3878668   IO-APIC-edge      serial
.....
>  14:    3072118          0          0        181   IO-APIC-edge      ide0
.....
>  52:          0          0          0  213052723   IO-APIC-fasteoi   eth1
>  53:          0          0          0   91913759   IO-APIC-fasteoi   eth2
> 100:          0          0          0   16776910   IO-APIC-fasteoi   eth0
....
> 
> Maybe....
> I have 3 XFS on this system, with 3 source.
> 
> 1. 200G one ide hdd.
> 2. 2x200G mirror on 1 ide + 1 sata hdd.
> 3. 4x3.3TB strip on NBD.
> 
> The NBD serves through eth1, and it is on the CPU3, but the ide0 is on the
> CPU0.

I'd say your NBD based XFS filesystem is having trouble.

> > Are you using XFS on a NBD?
> 
> Yes, on the 3. source.

Ok, I've never heard of a problem like this before and you are doing
something that very few ppl are doing (i.e. XFS on NBD). I'd start
Hence  I'd start by suspecting a bug in the NBD driver.

> > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0  EFLAGS: 00010002
> > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX:
> >                                                      ^^^^^^^^^^^^^^^^
> > Anyone recognise that pattern?
> 
> I think i have one idea.
> This issue can stops sometimes the 5sec automatic restart on crash, and this
> shows possible memory corruption, and if the bug occurs in the IRQ
> handling.... :-)
> I have a lot of logs about this issue, and the RAX, RBX always the same.

And is this the only place where you see the problem? Or are there
other stack traces that you see this in as well?

> > This implies a spinlock inside a wait_queue_head_t is corrupt.
> >
> > What are you type of system do you have, and what sort of
> > workload are you running?
> 
> OS: Fedora 5, 64bit.
> HW: dual xeon, with HT, ram 4GB.
> (the min_free_kbytes limit is set to 128000, because sometimes the e1000
> driver run out the reserved memory during irq handling.)

That does not sound good. What happens when it does run out of memory?
Is that when you start to see the above corruptions?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2006-12-18  6:24           ` David Chinner
@ 2006-12-18  8:17             ` Haar János
  2006-12-18 22:36               ` David Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Haar János @ 2006-12-18  8:17 UTC (permalink / raw)
  To: David Chinner; +Cc: dgc, linux-xfs, linux-kernel


----- Original Message ----- 
From: "David Chinner" <dgc@sgi.com>
To: "Haar János" <djani22@netcenter.hu>
Cc: "David Chinner" <dgc@sgi.com>; <linux-xfs@oss.sgi.com>;
<linux-kernel@vger.kernel.org>
Sent: Monday, December 18, 2006 7:24 AM
Subject: Re: xfslogd-spinlock bug?


> On Mon, Dec 18, 2006 at 12:56:41AM +0100, Haar János wrote:
> > > On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote:
> > > > I dont know there is a context between 2 messages, but i can see,
the
> > > > spinlock bug comes always on cpu #3.
> > > >
> > > > Somebody have any idea?
> > >
> > > Your disk interrupts are directed to CPU 3, and so log I/O completion
> > > occurs on that CPU.
> >
> >            CPU0       CPU1       CPU2       CPU3
> >   0:        100          0          0    4583704   IO-APIC-edge
timer
> >   1:          0          0          0          2   IO-APIC-edge
i8042
> >   4:          0          0          0    3878668   IO-APIC-edge
serial
> .....
> >  14:    3072118          0          0        181   IO-APIC-edge
ide0
> .....
> >  52:          0          0          0  213052723   IO-APIC-fasteoi
eth1
> >  53:          0          0          0   91913759   IO-APIC-fasteoi
eth2
> > 100:          0          0          0   16776910   IO-APIC-fasteoi
eth0
> ....
> >
> > Maybe....
> > I have 3 XFS on this system, with 3 source.
> >
> > 1. 200G one ide hdd.
> > 2. 2x200G mirror on 1 ide + 1 sata hdd.
> > 3. 4x3.3TB strip on NBD.
> >
> > The NBD serves through eth1, and it is on the CPU3, but the ide0 is on
the
> > CPU0.
>
> I'd say your NBD based XFS filesystem is having trouble.
>
> > > Are you using XFS on a NBD?
> >
> > Yes, on the 3. source.
>
> Ok, I've never heard of a problem like this before and you are doing
> something that very few ppl are doing (i.e. XFS on NBD). I'd start
> Hence  I'd start by suspecting a bug in the NBD driver.

Ok, if you have right, this also can be in context with the following issue:

http://download.netcenter.hu/bughunt/20061217/messages.txt   (10KB)


>
> > > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0  EFLAGS: 00010002
> > > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b
RCX:
> > >                                                      ^^^^^^^^^^^^^^^^
> > > Anyone recognise that pattern?
> >
> > I think i have one idea.
> > This issue can stops sometimes the 5sec automatic restart on crash, and
this
> > shows possible memory corruption, and if the bug occurs in the IRQ
> > handling.... :-)
> > I have a lot of logs about this issue, and the RAX, RBX always the same.
>
> And is this the only place where you see the problem? Or are there
> other stack traces that you see this in as well?

I have used the 2.6.16.18 for a long time, and it was stable, except this
issue. (~20 dump with xfslogd)
And i try the new releases, and now i have more. :-)

What do you think exactly?
I can see in the logs, but search for what?
The RAX, RBX thing, or the xfslogd-spinlock problem or the old nbd-deadlock
+ mem corruption?

[root@NetCenter netlog]# grep "0000000000000033" messages*
messages.1:Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 12 18:16:35 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 13 11:40:05 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 14 22:25:32 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 15 06:24:44 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.11:Oct  3 19:49:44 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.11:Oct  7 01:11:17 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.13:Sep 21 15:35:31 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.15:Sep  3 16:13:35 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.15:Sep  5 21:00:38 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec  9 00:10:47 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec  9 14:07:01 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec 10 04:44:48 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.3:Nov 30 10:59:21 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.3:Dec  2 00:54:23 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 13 10:44:49 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 14 03:14:14 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 14 03:37:07 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 15 01:39:54 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  6 14:48:54 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  7 04:36:13 dy-base RAX: 0000000000000033 RBX:
ffff8100057d2080 RCX: ffff810050d638f8
messages.6:Nov  7 04:36:13 dy-base RDX: 0000000000000008 RSI:
0000000000012cff RDI: 0000000000000033
messages.6:Nov  7 11:12:06 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  8 03:20:38 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  8 15:02:16 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  8 15:27:12 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 10 15:29:43 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 11 20:44:14 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.9:Oct 18 15:31:02 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.9:Oct 19 13:53:24 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000


>
> > > This implies a spinlock inside a wait_queue_head_t is corrupt.
> > >
> > > What are you type of system do you have, and what sort of
> > > workload are you running?
> >
> > OS: Fedora 5, 64bit.
> > HW: dual xeon, with HT, ram 4GB.
> > (the min_free_kbytes limit is set to 128000, because sometimes the e1000
> > driver run out the reserved memory during irq handling.)
>
> That does not sound good. What happens when it does run out of memory?



This is an old problem, on 2.6.16.18 .
The default min_free_kbytes is 38xx , and the GIGE controller easily can be
overflow this little place.
If this happens, the system freez, and i can only use the serial console +
sysreq to dump stack:

download.netcenter.hu/bughunt/20060530/261618-good.txt
download.netcenter.hu/bughunt/20060530/dmesg.txt
download.netcenter.hu/bughunt/20060530/dump.txt

This problem is already fixed with set the min_free_kbytes to 128M.

> Is that when you start to see the above corruptions?

I think no, but i am not 100% sure.

Cheers,

Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2006-12-18  8:17             ` Haar János
@ 2006-12-18 22:36               ` David Chinner
  2006-12-18 23:39                 ` Haar János
  0 siblings, 1 reply; 17+ messages in thread
From: David Chinner @ 2006-12-18 22:36 UTC (permalink / raw)
  To: Haar János; +Cc: David Chinner, linux-xfs, linux-kernel

On Mon, Dec 18, 2006 at 09:17:50AM +0100, Haar János wrote:
> From: "David Chinner" <dgc@sgi.com>
> > > The NBD serves through eth1, and it is on the CPU3, but the ide0 is on
> the
> > > CPU0.
> >
> > I'd say your NBD based XFS filesystem is having trouble.
> >
> > > > Are you using XFS on a NBD?
> > >
> > > Yes, on the 3. source.
> >
> > Ok, I've never heard of a problem like this before and you are doing
> > something that very few ppl are doing (i.e. XFS on NBD). I'd start
> > Hence  I'd start by suspecting a bug in the NBD driver.
> 
> Ok, if you have right, this also can be in context with the following issue:
> 
> http://download.netcenter.hu/bughunt/20061217/messages.txt   (10KB)

Which appears to be a crash in wake_up_process() when doing memory
reclaim (waking the xfsbufd).

> > > > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0  EFLAGS: 00010002
> > > > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b
> RCX:
> > > >                                                      ^^^^^^^^^^^^^^^^
> > > > Anyone recognise that pattern?

Ok, I've found this pattern:

#define POISON_FREE 0x6b

Can you confirm that you are running with CONFIG_DEBUG_SLAB=y?

If so, we have a use after free occurring here and it would also
explain why no-one has reported it before.

FWIW, can you turn on CONFIG_XFS_DEBUG=y and see if that triggers
a different bug check prior to the above dump?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2006-12-18 22:36               ` David Chinner
@ 2006-12-18 23:39                 ` Haar János
  2006-12-19  2:52                   ` David Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Haar János @ 2006-12-18 23:39 UTC (permalink / raw)
  To: David Chinner; +Cc: dgc, linux-xfs, linux-kernel


----- Original Message ----- 
From: "David Chinner" <dgc@sgi.com>
To: "Haar János" <djani22@netcenter.hu>
Cc: "David Chinner" <dgc@sgi.com>; <linux-xfs@oss.sgi.com>;
<linux-kernel@vger.kernel.org>
Sent: Monday, December 18, 2006 11:36 PM
Subject: Re: xfslogd-spinlock bug?


> On Mon, Dec 18, 2006 at 09:17:50AM +0100, Haar János wrote:
> > From: "David Chinner" <dgc@sgi.com>
> > > > The NBD serves through eth1, and it is on the CPU3, but the ide0 is
on
> > the
> > > > CPU0.
> > >
> > > I'd say your NBD based XFS filesystem is having trouble.
> > >
> > > > > Are you using XFS on a NBD?
> > > >
> > > > Yes, on the 3. source.
> > >
> > > Ok, I've never heard of a problem like this before and you are doing
> > > something that very few ppl are doing (i.e. XFS on NBD). I'd start
> > > Hence  I'd start by suspecting a bug in the NBD driver.
> >
> > Ok, if you have right, this also can be in context with the following
issue:
> >
> > http://download.netcenter.hu/bughunt/20061217/messages.txt   (10KB)
>
> Which appears to be a crash in wake_up_process() when doing memory
> reclaim (waking the xfsbufd).

Sorry, can you translate it to "poor mans language"? :-)
This is a different bug?


>
> > > > > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0  EFLAGS:
00010002
> > > > > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b
> > RCX:
> > > > >
^^^^^^^^^^^^^^^^
> > > > > Anyone recognise that pattern?
>
> Ok, I've found this pattern:
>
> #define POISON_FREE 0x6b
>
> Can you confirm that you are running with CONFIG_DEBUG_SLAB=y?

Yes, i build with this option enabled.
Is this wrong?

>
> If so, we have a use after free occurring here and it would also
> explain why no-one has reported it before.
>
> FWIW, can you turn on CONFIG_XFS_DEBUG=y and see if that triggers
> a different bug check prior to the above dump?

[root@X64 linux-2.6.19]# make bzImage
scripts/kconfig/conf -s arch/x86_64/Kconfig
.config:7:warning: trying to assign nonexistent symbol XFS_DEBUG

I have missed something?


Thanks,

Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2006-12-18 23:39                 ` Haar János
@ 2006-12-19  2:52                   ` David Chinner
  2006-12-19  4:47                     ` David Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: David Chinner @ 2006-12-19  2:52 UTC (permalink / raw)
  To: Haar János; +Cc: David Chinner, linux-xfs, linux-kernel

On Tue, Dec 19, 2006 at 12:39:46AM +0100, Haar János wrote:
> From: "David Chinner" <dgc@sgi.com>
> > On Mon, Dec 18, 2006 at 09:17:50AM +0100, Haar János wrote:
> > > From: "David Chinner" <dgc@sgi.com>
> > > > Ok, I've never heard of a problem like this before and you are doing
> > > > something that very few ppl are doing (i.e. XFS on NBD). I'd start
> > > > Hence  I'd start by suspecting a bug in the NBD driver.
> > >
> > > Ok, if you have right, this also can be in context with the following
> issue:
> > >
> > > http://download.netcenter.hu/bughunt/20061217/messages.txt   (10KB)
> >
> > Which appears to be a crash in wake_up_process() when doing memory
> > reclaim (waking the xfsbufd).
> 
> Sorry, can you translate it to "poor mans language"? :-)
> This is a different bug?

Don't know - it's a different crash, but once again one that I've
never heard of occurring before.

> > Ok, I've found this pattern:
> >
> > #define POISON_FREE 0x6b
> >
> > Can you confirm that you are running with CONFIG_DEBUG_SLAB=y?
> 
> Yes, i build with this option enabled.
> Is this wrong?

No, but it does slow your machine down.

> > If so, we have a use after free occurring here and it would also
> > explain why no-one has reported it before.
> >
> > FWIW, can you turn on CONFIG_XFS_DEBUG=y and see if that triggers
> > a different bug check prior to the above dump?
> 
> [root@X64 linux-2.6.19]# make bzImage
> scripts/kconfig/conf -s arch/x86_64/Kconfig
> .config:7:warning: trying to assign nonexistent symbol XFS_DEBUG
> 
> I have missed something?

No - I forgot that config option doesn't exist in mainline XFS - it's
only in the dev tree.

FWIW, I've run XFSQA twice now on a scsi disk with slab debuggin turned
on and I haven't seen this problem. I'm not sure how to track down
the source of the problem without a test case, but as a quick test, can
you try the following patch?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


---
 fs/xfs/linux-2.6/xfs_buf.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c	2006-12-19 12:22:54.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c	2006-12-19 13:48:36.937118569 +1100
@@ -942,11 +942,14 @@ xfs_buf_unlock(
 /*
  *	Pinning Buffer Storage in Memory
  *	Ensure that no attempt to force a buffer to disk will succeed.
+ *	Hold the buffer so we don't attempt to free it while it
+ *	is pinned.
  */
 void
 xfs_buf_pin(
 	xfs_buf_t		*bp)
 {
+	xfs_buf_hold(bp);
 	atomic_inc(&bp->b_pin_count);
 	XB_TRACE(bp, "pin", (long)bp->b_pin_count.counter);
 }
@@ -958,6 +961,7 @@ xfs_buf_unpin(
 	if (atomic_dec_and_test(&bp->b_pin_count))
 		wake_up_all(&bp->b_waiters);
 	XB_TRACE(bp, "unpin", (long)bp->b_pin_count.counter);
+	xfs_buf_rele(bp);
 }
 
 int

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2006-12-19  2:52                   ` David Chinner
@ 2006-12-19  4:47                     ` David Chinner
  2006-12-27 12:58                       ` Haar János
  0 siblings, 1 reply; 17+ messages in thread
From: David Chinner @ 2006-12-19  4:47 UTC (permalink / raw)
  To: David Chinner; +Cc: Haar János, linux-xfs, linux-kernel

On Tue, Dec 19, 2006 at 01:52:29PM +1100, David Chinner wrote:
> On Tue, Dec 19, 2006 at 12:39:46AM +0100, Haar János wrote:
> > From: "David Chinner" <dgc@sgi.com>
> > > #define POISON_FREE 0x6b
> > >
> > > Can you confirm that you are running with CONFIG_DEBUG_SLAB=y?
> > 
> > Yes, i build with this option enabled.

......

> FWIW, I've run XFSQA twice now on a scsi disk with slab debuggin turned
> on and I haven't seen this problem. I'm not sure how to track down
> the source of the problem without a test case, but as a quick test, can
> you try the following patch?

Third try an I got a crash on a poisoned object:

[1]kdb> md8c40 e00000300d7d5100
0xe00000300d7d5100 000000005a2cf071 0000000000000000   q.,Z............
0xe00000300d7d5110 000000005a2cf071 6b6b6b6b6b6b6b6b   q.,Z....kkkkkkkk
0xe00000300d7d5120 e0000039eb7b6320 6b6b6b6b6b6b6b6b    c{.9...kkkkkkkk
0xe00000300d7d5130 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
0xe00000300d7d5140 6b6b6b6f6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkokkkkkkkkkkk
0xe00000300d7d5150 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
0xe00000300d7d5160 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
0xe00000300d7d5170 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
0xe00000300d7d5180 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
0xe00000300d7d5190 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
0xe00000300d7d51a0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
0xe00000300d7d51b0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
0xe00000300d7d51c0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
0xe00000300d7d51d0 6b6b6b6b6b6b6b6b a56b6b6b6b6b6b6b   kkkkkkkkkkkkkkk.
0xe00000300d7d51e0 000000005a2cf071 a000000100468c30   q.,Z....0.F.....
[1]kdb> mds 0xe00000300d7d51e0
0xe00000300d7d51e0 5a2cf071   q.,Z....
0xe00000300d7d51e8 a000000100468c30 xfs_inode_item_destroy+0x30

So the use-after-free here is on an inode item. You're tripping
over a buffer item.

Unfortunately, it is not the same problem - the problem I've just
hit is to do with a QA test that does a forced shutdown on an active
filesystem, and:

[1]kdb> xmount 0xe00000304393e238
.....
flags 0x440010 <FSSHUTDOWN IDELETE COMPAT_IOSIZE >

The filesystem was being shutdown so xfs_inode_item_destroy() just
frees the inode log item without removing it from the AIL. I'll fix that,
and see if i have any luck....

So I'd still try that patch i sent in the previous email...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2006-12-19  4:47                     ` David Chinner
@ 2006-12-27 12:58                       ` Haar János
  2007-01-07 23:14                         ` David Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Haar János @ 2006-12-27 12:58 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-xfs, linux-kernel, dgc

Hello,

----- Original Message ----- 
From: "David Chinner" <dgc@sgi.com>
To: "David Chinner" <dgc@sgi.com>
Cc: "Haar János" <djani22@netcenter.hu>; <linux-xfs@oss.sgi.com>;
<linux-kernel@vger.kernel.org>
Sent: Tuesday, December 19, 2006 5:47 AM
Subject: Re: xfslogd-spinlock bug?


> On Tue, Dec 19, 2006 at 01:52:29PM +1100, David Chinner wrote:
> > On Tue, Dec 19, 2006 at 12:39:46AM +0100, Haar János wrote:
> > > From: "David Chinner" <dgc@sgi.com>
> > > > #define POISON_FREE 0x6b
> > > >
> > > > Can you confirm that you are running with CONFIG_DEBUG_SLAB=y?
> > >
> > > Yes, i build with this option enabled.
>
> ......
>
> > FWIW, I've run XFSQA twice now on a scsi disk with slab debuggin turned
> > on and I haven't seen this problem. I'm not sure how to track down
> > the source of the problem without a test case, but as a quick test, can
> > you try the following patch?
>
> Third try an I got a crash on a poisoned object:
>
> [1]kdb> md8c40 e00000300d7d5100
> 0xe00000300d7d5100 000000005a2cf071 0000000000000000   q.,Z............
> 0xe00000300d7d5110 000000005a2cf071 6b6b6b6b6b6b6b6b   q.,Z....kkkkkkkk
> 0xe00000300d7d5120 e0000039eb7b6320 6b6b6b6b6b6b6b6b    c{.9...kkkkkkkk
> 0xe00000300d7d5130 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5140 6b6b6b6f6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkokkkkkkkkkkk
> 0xe00000300d7d5150 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5160 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5170 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5180 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5190 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d51a0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d51b0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d51c0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d51d0 6b6b6b6b6b6b6b6b a56b6b6b6b6b6b6b   kkkkkkkkkkkkkkk.
> 0xe00000300d7d51e0 000000005a2cf071 a000000100468c30   q.,Z....0.F.....
> [1]kdb> mds 0xe00000300d7d51e0
> 0xe00000300d7d51e0 5a2cf071   q.,Z....
> 0xe00000300d7d51e8 a000000100468c30 xfs_inode_item_destroy+0x30
>
> So the use-after-free here is on an inode item. You're tripping
> over a buffer item.
>
> Unfortunately, it is not the same problem - the problem I've just
> hit is to do with a QA test that does a forced shutdown on an active
> filesystem, and:
>
> [1]kdb> xmount 0xe00000304393e238
> .....
> flags 0x440010 <FSSHUTDOWN IDELETE COMPAT_IOSIZE >
>
> The filesystem was being shutdown so xfs_inode_item_destroy() just
> frees the inode log item without removing it from the AIL. I'll fix that,
> and see if i have any luck....
>
> So I'd still try that patch i sent in the previous email...

I still using the patch, but didnt shows any messages at this point.

I'v got 3 crash/reboot, but 2 causes nbd disconneted, and this one:

Dec 27 13:41:29 dy-base BUG: warning at
kernel/mutex.c:220/__mutex_unlock_common_slowpath()
Dec 27 13:41:29 dy-base Unable to handle kernel paging request at
0000000066604480 RIP:
Dec 27 13:41:29 dy-base  [<ffffffff80222c64>] resched_task+0x12/0x64
Dec 27 13:41:29 dy-base PGD 115246067 PUD 0
Dec 27 13:41:29 dy-base Oops: 0000 [1] SMP
Dec 27 13:41:29 dy-base CPU 1
Dec 27 13:41:29 dy-base Modules linked in: nbd rd netconsole e1000 video
Dec 27 13:41:29 dy-base Pid: 4069, comm: httpd Not tainted 2.6.19 #3
Dec 27 13:41:29 dy-base RIP: 0010:[<ffffffff80222c64>]  [<ffffffff80222c64>]
resched_task+0x12/0x64
Dec 27 13:41:29 dy-base RSP: 0018:ffff810105c01b78  EFLAGS: 00010083
Dec 27 13:41:29 dy-base RAX: ffffffff807d5800 RBX: 00001749fd97c214 RCX:
ffff81001cbd0000
Dec 27 13:41:29 dy-base RDX: 000000001cbd0048 RSI: ffff810005834068 RDI:
ffff8101047bf040
Dec 27 13:41:29 dy-base RBP: ffff810105c01b78 R08: 0000000000000001 R09:
0000000000000000
Dec 27 13:41:29 dy-base R10: 0000000000000057 R11: ffff81000583cd80 R12:
ffff810116693140
Dec 27 13:41:29 dy-base R13: 0000000000000001 R14: 0000000000000000 R15:
0000000000000000
Dec 27 13:41:29 dy-base FS:  00002ba3c1ad07d0(0000)
GS:ffff81011fc769c8(0000) knlGS:0000000000000000
Dec 27 13:41:29 dy-base CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 27 13:41:29 dy-base CR2: 0000000066604480 CR3: 0000000118196000 CR4:
00000000000006e0
Dec 27 13:41:29 dy-base Process httpd (pid: 4069, threadinfo
ffff810105c00000, task ffff8101166e1040)
Dec 27 13:41:29 dy-base Stack:  ffff810105c01bf8 ffffffff80223f37
ffff810002996a00 0000000002cba600
Dec 27 13:41:29 dy-base  000000000000000f 0000000000000001 ffff810005833700
0000000100000000
Dec 27 13:41:29 dy-base  0000000000000005 0000000000000296 ffff810105c01bd8
ffff810117fef690
Dec 27 13:41:29 dy-base Call Trace:
Dec 27 13:41:29 dy-base  [<ffffffff80223f37>] try_to_wake_up+0x3a7/0x3dc
Dec 27 13:41:29 dy-base  [<ffffffff80223f98>] wake_up_process+0x10/0x12
Dec 27 13:41:29 dy-base  [<ffffffff803c9186>] xfsbufd_wakeup+0x34/0x61
Dec 27 13:41:29 dy-base  [<ffffffff8025cdf9>] shrink_slab+0x64/0x163
Dec 27 13:41:29 dy-base  [<ffffffff8025d913>] try_to_free_pages+0x19c/0x289
Dec 27 13:41:29 dy-base  [<ffffffff80258b62>] __alloc_pages+0x1b8/0x2c0
Dec 27 13:41:29 dy-base  [<ffffffff80267f94>] anon_vma_prepare+0x29/0xf1
Dec 27 13:41:29 dy-base  [<ffffffff80260d52>] __handle_mm_fault+0x496/0x9e3
Dec 27 13:41:29 dy-base  [<ffffffff805e7dfd>] _spin_unlock+0x9/0xb
Dec 27 13:41:29 dy-base  [<ffffffff8021a1b6>] do_page_fault+0x418/0x7b6
Dec 27 13:41:29 dy-base  [<ffffffff802082ed>] __switch_to+0x280/0x28f
Dec 27 13:41:29 dy-base  [<ffffffff805e7efb>] _spin_unlock_irq+0x9/0xc
Dec 27 13:41:29 dy-base  [<ffffffff805e5ca8>] thread_return+0x5e/0xf7
Dec 27 13:41:29 dy-base  [<ffffffff805e809d>] error_exit+0x0/0x84
Dec 27 13:41:29 dy-base
Dec 27 13:41:29 dy-base
Dec 27 13:41:29 dy-base Code: 48 8b 14 d5 40 42 78 80 48 03 42 08 8b 00 85
c0 7e 0a 0f 0b
Dec 27 13:41:29 dy-base RIP  [<ffffffff80222c64>] resched_task+0x12/0x64
Dec 27 13:41:29 dy-base  RSP <ffff810105c01b78>
Dec 27 13:41:29 dy-base CR2: 0000000066604480
Dec 27 13:41:29 dy-base  <0>Kernel panic - not syncing: Fatal exception
Dec 27 13:41:29 dy-base
Dec 27 13:41:29 dy-base Rebooting in 5 seconds..

I found one bug on my apache config, and i think, the test case is changed.
:-(
Before the config is fixed, some users can stress the xfs source device
readahead, and can periodically overload the system, with this action.
I think, the original bug only comes on the highly overloaded system, and
about readahead+buffering/caching.

Thanks,
Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2006-12-27 12:58                       ` Haar János
@ 2007-01-07 23:14                         ` David Chinner
  2007-01-10 17:18                           ` Janos Haar
  0 siblings, 1 reply; 17+ messages in thread
From: David Chinner @ 2007-01-07 23:14 UTC (permalink / raw)
  To: Haar János; +Cc: David Chinner, linux-xfs, linux-kernel

On Wed, Dec 27, 2006 at 01:58:06PM +0100, Haar János wrote:
> Hello,
> 
> ----- Original Message ----- 
> From: "David Chinner" <dgc@sgi.com>
> To: "David Chinner" <dgc@sgi.com>
> Cc: "Haar János" <djani22@netcenter.hu>; <linux-xfs@oss.sgi.com>;
> <linux-kernel@vger.kernel.org>
> Sent: Tuesday, December 19, 2006 5:47 AM
> Subject: Re: xfslogd-spinlock bug?
> 
> 
> > On Tue, Dec 19, 2006 at 01:52:29PM +1100, David Chinner wrote:
> >
> > The filesystem was being shutdown so xfs_inode_item_destroy() just
> > frees the inode log item without removing it from the AIL. I'll fix that,
> > and see if i have any luck....
> >
> > So I'd still try that patch i sent in the previous email...
> 
> I still using the patch, but didnt shows any messages at this point.
> 
> I'v got 3 crash/reboot, but 2 causes nbd disconneted, and this one:
> 
> Dec 27 13:41:29 dy-base BUG: warning at
> kernel/mutex.c:220/__mutex_unlock_common_slowpath()
> Dec 27 13:41:29 dy-base Unable to handle kernel paging request at
> 0000000066604480 RIP:
> Dec 27 13:41:29 dy-base  [<ffffffff80222c64>] resched_task+0x12/0x64
> Dec 27 13:41:29 dy-base PGD 115246067 PUD 0
> Dec 27 13:41:29 dy-base Oops: 0000 [1] SMP
> Dec 27 13:41:29 dy-base CPU 1
> Dec 27 13:41:29 dy-base Modules linked in: nbd rd netconsole e1000 video
> Dec 27 13:41:29 dy-base Pid: 4069, comm: httpd Not tainted 2.6.19 #3
> Dec 27 13:41:29 dy-base RIP: 0010:[<ffffffff80222c64>]  [<ffffffff80222c64>]
> resched_task+0x12/0x64
> Dec 27 13:41:29 dy-base RSP: 0018:ffff810105c01b78  EFLAGS: 00010083
> Dec 27 13:41:29 dy-base RAX: ffffffff807d5800 RBX: 00001749fd97c214 RCX:

Different corruption in RBX here. Looks like semi-random garbage there.
I wonder - what's the mac and ip address(es) of your machine and nbd
servers?

(i.e. I suspect this is a nbd problem, not an XFS problem)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2007-01-07 23:14                         ` David Chinner
@ 2007-01-10 17:18                           ` Janos Haar
  2007-01-11  3:34                             ` David Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Janos Haar @ 2007-01-10 17:18 UTC (permalink / raw)
  To: David Chinner; +Cc: dgc, linux-xfs, linux-kernel


----- Original Message ----- 
From: "David Chinner" <dgc@sgi.com>
To: "Haar János" <djani22@netcenter.hu>
Cc: "David Chinner" <dgc@sgi.com>; <linux-xfs@oss.sgi.com>;
<linux-kernel@vger.kernel.org>
Sent: Monday, January 08, 2007 12:14 AM
Subject: Re: xfslogd-spinlock bug?


> On Wed, Dec 27, 2006 at 01:58:06PM +0100, Haar János wrote:
> > Hello,
> >
> > ----- Original Message ----- 
> > From: "David Chinner" <dgc@sgi.com>
> > To: "David Chinner" <dgc@sgi.com>
> > Cc: "Haar János" <djani22@netcenter.hu>; <linux-xfs@oss.sgi.com>;
> > <linux-kernel@vger.kernel.org>
> > Sent: Tuesday, December 19, 2006 5:47 AM
> > Subject: Re: xfslogd-spinlock bug?
> >
> >
> > > On Tue, Dec 19, 2006 at 01:52:29PM +1100, David Chinner wrote:
> > >
> > > The filesystem was being shutdown so xfs_inode_item_destroy() just
> > > frees the inode log item without removing it from the AIL. I'll fix
that,
> > > and see if i have any luck....
> > >
> > > So I'd still try that patch i sent in the previous email...
> >
> > I still using the patch, but didnt shows any messages at this point.
> >
> > I'v got 3 crash/reboot, but 2 causes nbd disconneted, and this one:
> >
> > Dec 27 13:41:29 dy-base BUG: warning at
> > kernel/mutex.c:220/__mutex_unlock_common_slowpath()
> > Dec 27 13:41:29 dy-base Unable to handle kernel paging request at
> > 0000000066604480 RIP:
> > Dec 27 13:41:29 dy-base  [<ffffffff80222c64>] resched_task+0x12/0x64
> > Dec 27 13:41:29 dy-base PGD 115246067 PUD 0
> > Dec 27 13:41:29 dy-base Oops: 0000 [1] SMP
> > Dec 27 13:41:29 dy-base CPU 1
> > Dec 27 13:41:29 dy-base Modules linked in: nbd rd netconsole e1000 video
> > Dec 27 13:41:29 dy-base Pid: 4069, comm: httpd Not tainted 2.6.19 #3
> > Dec 27 13:41:29 dy-base RIP: 0010:[<ffffffff80222c64>]
[<ffffffff80222c64>]
> > resched_task+0x12/0x64
> > Dec 27 13:41:29 dy-base RSP: 0018:ffff810105c01b78  EFLAGS: 00010083
> > Dec 27 13:41:29 dy-base RAX: ffffffff807d5800 RBX: 00001749fd97c214 RCX:
>
> Different corruption in RBX here. Looks like semi-random garbage there.
> I wonder - what's the mac and ip address(es) of your machine and nbd
> servers?

dy-base:
eth0      Link encap:Ethernet  HWaddr 00:90:27:A2:7B:8B
eth0:1    Link encap:Ethernet  HWaddr 00:90:27:A2:7B:8B
eth0:2    Link encap:Ethernet  HWaddr 00:90:27:A2:7B:8B
eth1      Link encap:Ethernet  HWaddr 00:07:E9:32:E6:D8
eth1:1    Link encap:Ethernet  HWaddr 00:07:E9:32:E6:D8
eth1:2    Link encap:Ethernet  HWaddr 00:07:E9:32:E6:D8
eth2      Link encap:Ethernet  HWaddr 00:07:E9:32:E6:D9

node1-4:

00:0E:0C:A0:E5:7E
00:0E:0C:A0:EF:5E
00:0E:0C:A0:E9:58
00:0E:0C:A0:EF:A3

Some new stuff:
Jan  8 18:11:16 dy-base BUG: warning at
kernel/mutex.c:220/__mutex_unlock_common_slowpath()
Jan  8 18:11:16 dy-base Unable to handle kernel NULL pointer dereference at
0000000000000008 RIP:
Jan  8 18:11:16 dy-base  [<ffffffff80222c74>] resched_task+0x1a/0x64
Jan  8 18:11:16 dy-base PGD 9859d067 PUD 4e347067 PMD 0
Jan  8 18:11:16 dy-base Oops: 0000 [1] SMP
Jan  8 18:11:16 dy-base CPU 3
Jan  8 18:11:16 dy-base Modules linked in: nbd rd netconsole e1000
Jan  8 18:11:16 dy-base Pid: 3471, comm: ls Not tainted 2.6.19 #4
Jan  8 18:11:16 dy-base RIP: 0010:[<ffffffff80222c74>]  [<ffffffff80222c74>]
resched_task+0x1a/0x64
Jan  8 18:11:16 dy-base RSP: 0000:ffff81011fd1fb10  EFLAGS: 00010097
Jan  8 18:11:16 dy-base RAX: ffffffff80810800 RBX: 000004f0e2850659 RCX:
ffff81010c0a2000
Jan  8 18:11:16 dy-base RDX: 0000000000000000 RSI: ffff81000583c368 RDI:
ffff81002b809830
Jan  8 18:11:16 dy-base RBP: ffff81011fd1fb10 R08: 0000000000000000 R09:
0000000000000080
Jan  8 18:11:16 dy-base R10: 0000000000000080 R11: ffff81000584d280 R12:
ffff8100d852a7b0
Jan  8 18:11:16 dy-base R13: 0000000000000003 R14: 0000000000000001 R15:
0000000000000001
Jan  8 18:11:16 dy-base FS:  00002b69786c8a90(0000)
GS:ffff81011fcb98c0(0000) knlGS:0000000000000000
Jan  8 18:11:16 dy-base CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan  8 18:11:16 dy-base CR2: 0000000000000008 CR3: 0000000056ec5000 CR4:
00000000000006e0
Jan  8 18:11:16 dy-base Process ls (pid: 3471, threadinfo ffff810080d80000,
task ffff8100dcdea770)
Jan  8 18:11:16 dy-base Stack:  ffff81011fd1fb90 ffffffff80223f3f
ffff81011fd1fb50 0000000000000282
Jan  8 18:11:16 dy-base  0000000000000001 0000000000000001 ffff81000583ba00
00000003032ca4b8
Jan  8 18:11:16 dy-base  000000000000000a 0000000000000082 ffff810103539d00
ffff810013867bf8
Jan  8 18:11:16 dy-base Call Trace:
Jan  8 18:11:16 dy-base  <IRQ>  [<ffffffff80223f3f>]
try_to_wake_up+0x3a7/0x3dc
Jan  8 18:11:16 dy-base  [<ffffffff80223f81>] default_wake_function+0xd/0xf
Jan  8 18:11:16 dy-base  [<ffffffff8023c81d>]
autoremove_wake_function+0x11/0x38
Jan  8 18:11:16 dy-base  [<ffffffff802225b1>] __wake_up_common+0x3e/0x68
Jan  8 18:11:16 dy-base  [<ffffffff80222ac9>] __wake_up+0x38/0x50
Jan  8 18:11:16 dy-base  [<ffffffff80569d1c>]
sk_stream_write_space+0x5d/0x83
Jan  8 18:11:16 dy-base  [<ffffffff80591afd>] tcp_check_space+0x8f/0xcd
Jan  8 18:11:16 dy-base  [<ffffffff80596de0>]
tcp_rcv_established+0x116/0x76e
Jan  8 18:11:16 dy-base  [<ffffffff8059ce41>] tcp_v4_do_rcv+0x2d/0x322
Jan  8 18:11:16 dy-base  [<ffffffff8059f59e>] tcp_v4_rcv+0x8bb/0x925
Jan  8 18:11:16 dy-base  [<ffffffff80583b24>]
ip_local_deliver_finish+0x0/0x1ce
Jan  8 18:11:16 dy-base  [<ffffffff805843e6>] ip_local_deliver+0x172/0x238
Jan  8 18:11:16 dy-base  [<ffffffff8058422c>] ip_rcv+0x44f/0x497
Jan  8 18:11:16 dy-base  [<ffffffff88004bbc>]
:e1000:e1000_alloc_rx_buffers+0x1e7/0x2cb
Jan  8 18:11:16 dy-base  [<ffffffff8056c0a6>] netif_receive_skb+0x1ee/0x255
Jan  8 18:11:16 dy-base  [<ffffffff8056de0c>] process_backlog+0x8a/0x10f
Jan  8 18:11:16 dy-base  [<ffffffff8056e060>] net_rx_action+0xa9/0x16e
Jan  8 18:11:16 dy-base  [<ffffffff8022edf9>] __do_softirq+0x57/0xc7
Jan  8 18:11:16 dy-base  [<ffffffff8020aa1c>] call_softirq+0x1c/0x28
Jan  8 18:11:16 dy-base  [<ffffffff8020c68a>] do_softirq+0x34/0x87
Jan  8 18:11:16 dy-base  [<ffffffff8022ed0f>] irq_exit+0x3f/0x41
Jan  8 18:11:16 dy-base  [<ffffffff8020c786>] do_IRQ+0xa9/0xc7
Jan  8 18:11:16 dy-base  [<ffffffff80209e11>] ret_from_intr+0x0/0xa
Jan  8 18:11:16 dy-base  <EOI>
Jan  8 18:11:16 dy-base
Jan  8 18:11:16 dy-base Code: 48 03 42 08 8b 00 85 c0 7e 0a 0f 0b 68 ed 80
64 80 c2 f0 03
Jan  8 18:11:16 dy-base RIP  [<ffffffff80222c74>] resched_task+0x1a/0x64
Jan  8 18:11:16 dy-base  RSP <ffff81011fd1fb10>
Jan  8 18:11:16 dy-base CR2: 0000000000000008
Jan  8 18:11:16 dy-base  <0>Kernel panic - not syncing: Fatal exception
Jan  8 18:11:16 dy-base
Jan  8 18:11:16 dy-base Rebooting in 5 seconds..

(i have disabled the slab debuggint, because i need more perf.)

Thanks,
Janos

>
> (i.e. I suspect this is a nbd problem, not an XFS problem)
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2007-01-10 17:18                           ` Janos Haar
@ 2007-01-11  3:34                             ` David Chinner
  2007-01-11 20:15                               ` Janos Haar
  0 siblings, 1 reply; 17+ messages in thread
From: David Chinner @ 2007-01-11  3:34 UTC (permalink / raw)
  To: Janos Haar; +Cc: David Chinner, linux-xfs, linux-kernel

On Wed, Jan 10, 2007 at 06:18:08PM +0100, Janos Haar wrote:
> From: "David Chinner" <dgc@sgi.com>
> > Different corruption in RBX here. Looks like semi-random garbage there.
> > I wonder - what's the mac and ip address(es) of your machine and nbd
> > servers?
> 
> dy-base:

no matches. Oh well, it was a long shot.

> Some new stuff:
> Jan  8 18:11:16 dy-base RAX: ffffffff80810800 RBX: 000004f0e2850659 RCX:

RBX trashed again with more random garbage.

> Jan  8 18:11:16 dy-base  [<ffffffff80223f81>] default_wake_function+0xd/0xf
> Jan  8 18:11:16 dy-base  [<ffffffff8023c81d>]
> autoremove_wake_function+0x11/0x38
> Jan  8 18:11:16 dy-base  [<ffffffff802225b1>] __wake_up_common+0x3e/0x68
> Jan  8 18:11:16 dy-base  [<ffffffff80222ac9>] __wake_up+0x38/0x50
> Jan  8 18:11:16 dy-base  [<ffffffff80569d1c>]
> sk_stream_write_space+0x5d/0x83
> Jan  8 18:11:16 dy-base  [<ffffffff80591afd>] tcp_check_space+0x8f/0xcd
> Jan  8 18:11:16 dy-base  [<ffffffff80596de0>]
> tcp_rcv_established+0x116/0x76e
> Jan  8 18:11:16 dy-base  [<ffffffff8059ce41>] tcp_v4_do_rcv+0x2d/0x322
> Jan  8 18:11:16 dy-base  [<ffffffff8059f59e>] tcp_v4_rcv+0x8bb/0x925
> Jan  8 18:11:16 dy-base  [<ffffffff80583b24>]
> ip_local_deliver_finish+0x0/0x1ce
> Jan  8 18:11:16 dy-base  [<ffffffff805843e6>] ip_local_deliver+0x172/0x238
> Jan  8 18:11:16 dy-base  [<ffffffff8058422c>] ip_rcv+0x44f/0x497
> Jan  8 18:11:16 dy-base  [<ffffffff88004bbc>]
> :e1000:e1000_alloc_rx_buffers+0x1e7/0x2cb
> Jan  8 18:11:16 dy-base  [<ffffffff8056c0a6>] netif_receive_skb+0x1ee/0x255
> Jan  8 18:11:16 dy-base  [<ffffffff8056de0c>] process_backlog+0x8a/0x10f
> Jan  8 18:11:16 dy-base  [<ffffffff8056e060>] net_rx_action+0xa9/0x16e
> Jan  8 18:11:16 dy-base  [<ffffffff8022edf9>] __do_softirq+0x57/0xc7
> Jan  8 18:11:16 dy-base  [<ffffffff8020aa1c>] call_softirq+0x1c/0x28
> Jan  8 18:11:16 dy-base  [<ffffffff8020c68a>] do_softirq+0x34/0x87
> Jan  8 18:11:16 dy-base  [<ffffffff8022ed0f>] irq_exit+0x3f/0x41
> Jan  8 18:11:16 dy-base  [<ffffffff8020c786>] do_IRQ+0xa9/0xc7
> Jan  8 18:11:16 dy-base  [<ffffffff80209e11>] ret_from_intr+0x0/0xa
......
> > (i.e. I suspect this is a nbd problem, not an XFS problem)

There's something seriously wrong in your kernel that has, AFAICT,
nothing to do with XFS. I suggest talking to the NBD folk as that is
the only unusualy thing that I can see that you are using....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfslogd-spinlock bug?
  2007-01-11  3:34                             ` David Chinner
@ 2007-01-11 20:15                               ` Janos Haar
  0 siblings, 0 replies; 17+ messages in thread
From: Janos Haar @ 2007-01-11 20:15 UTC (permalink / raw)
  To: David Chinner; +Cc: dgc, linux-xfs, linux-kernel


----- Original Message ----- 
From: "David Chinner" <dgc@sgi.com>
To: "Janos Haar" <djani22@netcenter.hu>
Cc: "David Chinner" <dgc@sgi.com>; <linux-xfs@oss.sgi.com>;
<linux-kernel@vger.kernel.org>
Sent: Thursday, January 11, 2007 4:34 AM
Subject: Re: xfslogd-spinlock bug?


> On Wed, Jan 10, 2007 at 06:18:08PM +0100, Janos Haar wrote:
> > From: "David Chinner" <dgc@sgi.com>
> > > Different corruption in RBX here. Looks like semi-random garbage
there.
> > > I wonder - what's the mac and ip address(es) of your machine and nbd
> > > servers?
> >
> > dy-base:
>
> no matches. Oh well, it was a long shot.
>
> > Some new stuff:
> > Jan  8 18:11:16 dy-base RAX: ffffffff80810800 RBX: 000004f0e2850659 RCX:
>
> RBX trashed again with more random garbage.
>
> > Jan  8 18:11:16 dy-base  [<ffffffff80223f81>]
default_wake_function+0xd/0xf
> > Jan  8 18:11:16 dy-base  [<ffffffff8023c81d>]
> > autoremove_wake_function+0x11/0x38
> > Jan  8 18:11:16 dy-base  [<ffffffff802225b1>] __wake_up_common+0x3e/0x68
> > Jan  8 18:11:16 dy-base  [<ffffffff80222ac9>] __wake_up+0x38/0x50
> > Jan  8 18:11:16 dy-base  [<ffffffff80569d1c>]
> > sk_stream_write_space+0x5d/0x83
> > Jan  8 18:11:16 dy-base  [<ffffffff80591afd>] tcp_check_space+0x8f/0xcd
> > Jan  8 18:11:16 dy-base  [<ffffffff80596de0>]
> > tcp_rcv_established+0x116/0x76e
> > Jan  8 18:11:16 dy-base  [<ffffffff8059ce41>] tcp_v4_do_rcv+0x2d/0x322
> > Jan  8 18:11:16 dy-base  [<ffffffff8059f59e>] tcp_v4_rcv+0x8bb/0x925
> > Jan  8 18:11:16 dy-base  [<ffffffff80583b24>]
> > ip_local_deliver_finish+0x0/0x1ce
> > Jan  8 18:11:16 dy-base  [<ffffffff805843e6>]
ip_local_deliver+0x172/0x238
> > Jan  8 18:11:16 dy-base  [<ffffffff8058422c>] ip_rcv+0x44f/0x497
> > Jan  8 18:11:16 dy-base  [<ffffffff88004bbc>]
> > :e1000:e1000_alloc_rx_buffers+0x1e7/0x2cb
> > Jan  8 18:11:16 dy-base  [<ffffffff8056c0a6>]
netif_receive_skb+0x1ee/0x255
> > Jan  8 18:11:16 dy-base  [<ffffffff8056de0c>] process_backlog+0x8a/0x10f
> > Jan  8 18:11:16 dy-base  [<ffffffff8056e060>] net_rx_action+0xa9/0x16e
> > Jan  8 18:11:16 dy-base  [<ffffffff8022edf9>] __do_softirq+0x57/0xc7
> > Jan  8 18:11:16 dy-base  [<ffffffff8020aa1c>] call_softirq+0x1c/0x28
> > Jan  8 18:11:16 dy-base  [<ffffffff8020c68a>] do_softirq+0x34/0x87
> > Jan  8 18:11:16 dy-base  [<ffffffff8022ed0f>] irq_exit+0x3f/0x41
> > Jan  8 18:11:16 dy-base  [<ffffffff8020c786>] do_IRQ+0xa9/0xc7
> > Jan  8 18:11:16 dy-base  [<ffffffff80209e11>] ret_from_intr+0x0/0xa
> ......
> > > (i.e. I suspect this is a nbd problem, not an XFS problem)
>
> There's something seriously wrong in your kernel that has, AFAICT,
> nothing to do with XFS. I suggest talking to the NBD folk as that is
> the only unusualy thing that I can see that you are using....
>
> Cheers,

OK, i will try....

Thanks a lot,

Janos


>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2007-01-11 20:17 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-11 23:00 xfslogd-spinlock bug? Haar János
2006-12-12 14:32 ` Justin Piszcz
2006-12-13  1:11   ` Haar János
2006-12-16 11:19     ` Haar János
2006-12-17 22:44       ` David Chinner
2006-12-17 23:56         ` Haar János
2006-12-18  6:24           ` David Chinner
2006-12-18  8:17             ` Haar János
2006-12-18 22:36               ` David Chinner
2006-12-18 23:39                 ` Haar János
2006-12-19  2:52                   ` David Chinner
2006-12-19  4:47                     ` David Chinner
2006-12-27 12:58                       ` Haar János
2007-01-07 23:14                         ` David Chinner
2007-01-10 17:18                           ` Janos Haar
2007-01-11  3:34                             ` David Chinner
2007-01-11 20:15                               ` Janos Haar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).