linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Somebody take a look please! (some kind of kernel bug?)
@ 2010-03-24 20:39 Janos Haar
  2010-03-25  3:29 ` Américo Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-03-24 20:39 UTC (permalink / raw)
  To: linux-kernel

Dear developers,

This is one of my productive servers, wich suddenly starts to freeze (crash) 
some weeks before.
I have done all what i can, (i think) please somebody give to me some 
suggestion:

Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd 
pte:2bf1e025 pmd:1535b5067
Mar 24 19:22:28 alfa kernel: page:ffffea0000f1b250 flags:4000000000000404 
count:1 mapcount:-1 mapping:(null) index:0
Mar 24 19:22:28 alfa kernel: addr:00000037b4008000 vm_flags:08000875 
anon_vma:(null) mapping:ffff88022b5d25a8 index:8
Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: filemap_fault+0x0/0x34d
Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap: 
xfs_file_mmap+0x0/0x33
Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Not tainted 2.6.32.10 #2
Mar 24 19:22:28 alfa kernel: Call Trace:
Mar 24 19:22:28 alfa kernel:  [<ffffffff810c2ea3>] print_bad_pte+0x210/0x229
Mar 24 19:22:28 alfa kernel:  [<ffffffff810c3c98>] unmap_vmas+0x44b/0x787
Mar 24 19:22:28 alfa kernel:  [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
Mar 24 19:22:28 alfa kernel:  [<ffffffff81041f83>] mmput+0x48/0xb9
Mar 24 19:22:28 alfa kernel:  [<ffffffff810463b0>] exit_mm+0x105/0x110
Mar 24 19:22:28 alfa kernel:  [<ffffffff81371287>] ? 
tty_audit_exit+0x28/0x85
Mar 24 19:22:28 alfa kernel:  [<ffffffff810477a0>] do_exit+0x1e9/0x6d2
Mar 24 19:22:28 alfa kernel:  [<ffffffff81053c37>] ? 
__dequeue_signal+0xf1/0x127
Mar 24 19:22:28 alfa kernel:  [<ffffffff81047d00>] do_group_exit+0x77/0xa1
Mar 24 19:22:28 alfa kernel:  [<ffffffff810560f7>] 
get_signal_to_deliver+0x32c/0x37f
Mar 24 19:22:28 alfa kernel:  [<ffffffff8100a484>] 
do_notify_resume+0x90/0x740
Mar 24 19:22:28 alfa kernel:  [<ffffffff8102724b>] ? 
__bad_area_nosemaphore+0x178/0x1a2
Mar 24 19:22:28 alfa kernel:  [<ffffffff810272b9>] ? __bad_area+0x44/0x4d
Mar 24 19:22:28 alfa kernel:  [<ffffffff8100bba2>] retint_signal+0x46/0x84
Mar 24 19:22:28 alfa kernel: Disabling lock debugging due to kernel taint
Mar 24 19:22:28 alfa kernel: swap_free: Bad swap file entry 6c800000
Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd 
pte:d900000000 pmd:1535b5067
Mar 24 19:22:28 alfa kernel: addr:00000037b400a000 vm_flags:08000875 
anon_vma:(null) mapping:ffff88022b5d25a8 index:a
Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: filemap_fault+0x0/0x34d
Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap: 
xfs_file_mmap+0x0/0x33
Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Tainted: G    B 
2.6.32.10 #2
Mar 24 19:22:28 alfa kernel: Call Trace:
Mar 24 19:22:28 alfa kernel:  [<ffffffff81044551>] ? add_taint+0x32/0x3e
Mar 24 19:22:28 alfa kernel:  [<ffffffff810c2ea3>] print_bad_pte+0x210/0x229
Mar 24 19:22:28 alfa kernel:  [<ffffffff810c3d47>] unmap_vmas+0x4fa/0x787
Mar 24 19:22:28 alfa kernel:  [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
Mar 24 19:22:28 alfa kernel:  [<ffffffff81041f83>] mmput+0x48/0xb9
Mar 24 19:22:28 alfa kernel:  [<ffffffff810463b0>] exit_mm+0x105/0x110
Mar 24 19:22:28 alfa kernel:  [<ffffffff81371287>] ? 
tty_audit_exit+0x28/0x85
Mar 24 19:22:28 alfa kernel:  [<ffffffff810477a0>] do_exit+0x1e9/0x6d2
Mar 24 19:22:28 alfa kernel:  [<ffffffff81053c37>] ? 
__dequeue_signal+0xf1/0x127
Mar 24 19:22:28 alfa kernel:  [<ffffffff81047d00>] do_group_exit+0x77/0xa1
Mar 24 19:22:28 alfa kernel:  [<ffffffff810560f7>] 
get_signal_to_deliver+0x32c/0x37f
Mar 24 19:22:28 alfa kernel:  [<ffffffff8100a484>] 
do_notify_resume+0x90/0x740
Mar 24 19:22:28 alfa kernel:  [<ffffffff8102724b>] ? 
__bad_area_nosemaphore+0x178/0x1a2
Mar 24 19:22:28 alfa kernel:  [<ffffffff810272b9>] ? __bad_area+0x44/0x4d
Mar 24 19:22:28 alfa kernel:  [<ffffffff8100bba2>] retint_signal+0x46/0x84
Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd 
pte:2bfe8025 pmd:1535b5067
Mar 24 19:22:28 alfa kernel: page:ffffea0000f1f7c0 flags:4000000000000404 
count:1 mapcount:-1 mapping:(null) index:0
Mar 24 19:22:28 alfa kernel: addr:00000037b400c000 vm_flags:08000875 
anon_vma:(null) mapping:ffff88022b5d25a8 index:c
Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: filemap_fault+0x0/0x34d
Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap: 
xfs_file_mmap+0x0/0x33
Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Tainted: G    B 
2.6.32.10 #2
Mar 24 19:22:28 alfa kernel: Call Trace:
Mar 24 19:22:28 alfa kernel:  [<ffffffff81044551>] ? add_taint+0x32/0x3e
Mar 24 19:22:28 alfa kernel:  [<ffffffff810c2ea3>] print_bad_pte+0x210/0x229
Mar 24 19:22:28 alfa kernel:  [<ffffffff810c3c98>] unmap_vmas+0x44b/0x787
Mar 24 19:22:28 alfa kernel:  [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
Mar 24 19:22:28 alfa kernel:  [<ffffffff81041f83>] mmput+0x48/0xb9
Mar 24 19:22:28 alfa kernel:  [<ffffffff810463b0>] exit_mm+0x105/0x110
.....

The entire log is here:
http://download.netcenter.hu/bughunt/20100324/messages

The actual kernel is 2.6.32.10, but the crash-series started @ 2.6.28.10.

I have forwarded the tasks to another server, removed this from the room, 
and the hw survived memtest86 in >7 days continously + i have tested the 
HDDs one by one with badblocks -vvw, all is good.
For me looks like this is not a hw problem.

Somebody have any idea?

Thanks a lot,
Janos Haar 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Somebody take a look please! (some kind of kernel bug?)
  2010-03-24 20:39 Somebody take a look please! (some kind of kernel bug?) Janos Haar
@ 2010-03-25  3:29 ` Américo Wang
  2010-03-25  6:31   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 33+ messages in thread
From: Américo Wang @ 2010-03-25  3:29 UTC (permalink / raw)
  To: Janos Haar; +Cc: linux-kernel, linux-mm

(Cc'ing linux-mm)

2010/3/25 Janos Haar <janos.haar@netcenter.hu>:
> Dear developers,
>
> This is one of my productive servers, wich suddenly starts to freeze (crash)
> some weeks before.
> I have done all what i can, (i think) please somebody give to me some
> suggestion:
>
> Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd pte:2bf1e025
> pmd:1535b5067
> Mar 24 19:22:28 alfa kernel: page:ffffea0000f1b250 flags:4000000000000404
> count:1 mapcount:-1 mapping:(null) index:0
> Mar 24 19:22:28 alfa kernel: addr:00000037b4008000 vm_flags:08000875
> anon_vma:(null) mapping:ffff88022b5d25a8 index:8
> Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: filemap_fault+0x0/0x34d
> Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap:
> xfs_file_mmap+0x0/0x33
> Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Not tainted 2.6.32.10 #2
> Mar 24 19:22:28 alfa kernel: Call Trace:
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810c2ea3>] print_bad_pte+0x210/0x229
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810c3c98>] unmap_vmas+0x44b/0x787
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
> Mar 24 19:22:28 alfa kernel:  [<ffffffff81041f83>] mmput+0x48/0xb9
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810463b0>] exit_mm+0x105/0x110
> Mar 24 19:22:28 alfa kernel:  [<ffffffff81371287>] ?
> tty_audit_exit+0x28/0x85
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810477a0>] do_exit+0x1e9/0x6d2
> Mar 24 19:22:28 alfa kernel:  [<ffffffff81053c37>] ?
> __dequeue_signal+0xf1/0x127
> Mar 24 19:22:28 alfa kernel:  [<ffffffff81047d00>] do_group_exit+0x77/0xa1
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810560f7>]
> get_signal_to_deliver+0x32c/0x37f
> Mar 24 19:22:28 alfa kernel:  [<ffffffff8100a484>]
> do_notify_resume+0x90/0x740
> Mar 24 19:22:28 alfa kernel:  [<ffffffff8102724b>] ?
> __bad_area_nosemaphore+0x178/0x1a2
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810272b9>] ? __bad_area+0x44/0x4d
> Mar 24 19:22:28 alfa kernel:  [<ffffffff8100bba2>] retint_signal+0x46/0x84
> Mar 24 19:22:28 alfa kernel: Disabling lock debugging due to kernel taint
> Mar 24 19:22:28 alfa kernel: swap_free: Bad swap file entry 6c800000
> Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd
> pte:d900000000 pmd:1535b5067
> Mar 24 19:22:28 alfa kernel: addr:00000037b400a000 vm_flags:08000875
> anon_vma:(null) mapping:ffff88022b5d25a8 index:a
> Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: filemap_fault+0x0/0x34d
> Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap:
> xfs_file_mmap+0x0/0x33
> Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Tainted: G    B
> 2.6.32.10 #2
> Mar 24 19:22:28 alfa kernel: Call Trace:
> Mar 24 19:22:28 alfa kernel:  [<ffffffff81044551>] ? add_taint+0x32/0x3e
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810c2ea3>] print_bad_pte+0x210/0x229
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810c3d47>] unmap_vmas+0x4fa/0x787
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
> Mar 24 19:22:28 alfa kernel:  [<ffffffff81041f83>] mmput+0x48/0xb9
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810463b0>] exit_mm+0x105/0x110
> Mar 24 19:22:28 alfa kernel:  [<ffffffff81371287>] ?
> tty_audit_exit+0x28/0x85
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810477a0>] do_exit+0x1e9/0x6d2
> Mar 24 19:22:28 alfa kernel:  [<ffffffff81053c37>] ?
> __dequeue_signal+0xf1/0x127
> Mar 24 19:22:28 alfa kernel:  [<ffffffff81047d00>] do_group_exit+0x77/0xa1
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810560f7>]
> get_signal_to_deliver+0x32c/0x37f
> Mar 24 19:22:28 alfa kernel:  [<ffffffff8100a484>]
> do_notify_resume+0x90/0x740
> Mar 24 19:22:28 alfa kernel:  [<ffffffff8102724b>] ?
> __bad_area_nosemaphore+0x178/0x1a2
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810272b9>] ? __bad_area+0x44/0x4d
> Mar 24 19:22:28 alfa kernel:  [<ffffffff8100bba2>] retint_signal+0x46/0x84
> Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd pte:2bfe8025
> pmd:1535b5067
> Mar 24 19:22:28 alfa kernel: page:ffffea0000f1f7c0 flags:4000000000000404
> count:1 mapcount:-1 mapping:(null) index:0
> Mar 24 19:22:28 alfa kernel: addr:00000037b400c000 vm_flags:08000875
> anon_vma:(null) mapping:ffff88022b5d25a8 index:c
> Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: filemap_fault+0x0/0x34d
> Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap:
> xfs_file_mmap+0x0/0x33
> Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Tainted: G    B
> 2.6.32.10 #2
> Mar 24 19:22:28 alfa kernel: Call Trace:
> Mar 24 19:22:28 alfa kernel:  [<ffffffff81044551>] ? add_taint+0x32/0x3e
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810c2ea3>] print_bad_pte+0x210/0x229
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810c3c98>] unmap_vmas+0x44b/0x787
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
> Mar 24 19:22:28 alfa kernel:  [<ffffffff81041f83>] mmput+0x48/0xb9
> Mar 24 19:22:28 alfa kernel:  [<ffffffff810463b0>] exit_mm+0x105/0x110
> .....
>
> The entire log is here:
> http://download.netcenter.hu/bughunt/20100324/messages
>
> The actual kernel is 2.6.32.10, but the crash-series started @ 2.6.28.10.
>
> I have forwarded the tasks to another server, removed this from the room,
> and the hw survived memtest86 in >7 days continously + i have tested the
> HDDs one by one with badblocks -vvw, all is good.
> For me looks like this is not a hw problem.
>
> Somebody have any idea?
>
> Thanks a lot,
> Janos Haar
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Somebody take a look please! (some kind of kernel bug?)
  2010-03-25  3:29 ` Américo Wang
@ 2010-03-25  6:31   ` KAMEZAWA Hiroyuki
  2010-03-25  8:54     ` Janos Haar
  0 siblings, 1 reply; 33+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-25  6:31 UTC (permalink / raw)
  To: Américo Wang; +Cc: Janos Haar, linux-kernel, linux-mm

On Thu, 25 Mar 2010 11:29:25 +0800
Américo Wang <xiyou.wangcong@gmail.com> wrote:

> (Cc'ing linux-mm)
> 
Hmm..here is summary of corruption (from log), but no idea.

==
process's address		pte	      pnf->pte->page

00000037b4008000		  2bf1e025 -> PG_reserved
00000037b400a000		d900000000 -> bad swap
00000037b400c000		  2bfe8025 -> PG_reserved
00000037b400d000  		 12bfe9025 -> belongs to some other files' page cache
00000037b400e000 		ff00000000 -> bad swap
00000037b400f000 		5400000000 -> bad swap
...
00000037b4019000		ff00000000 -> bad swap
==
All ptes are on the same pmd 1535b5067.
.
I doubt some kind of buffer overflow bug overwrites page table...
Because ptes for adddress of 00000037b4008000...00000037b400f000 are on head of
a page (used for pmd), some data on page [0x1535b4000..0x1535b5000) caused buffer
overflow and broke page table in [0x1535b5000...0x1535b6000)

Is this bug found from 2.6.28.10 ?

If I investigate this issue, I'll check the owner of page 0x1535b4000 by
crash dump.

Thanks,
-Kame



> 2010/3/25 Janos Haar <janos.haar@netcenter.hu>:
> > Dear developers,
> >
> > This is one of my productive servers, wich suddenly starts to freeze (crash)
> > some weeks before.
> > I have done all what i can, (i think) please somebody give to me some
> > suggestion:
> >
> > Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd pte:2bf1e025
> > pmd:1535b5067
> > Mar 24 19:22:28 alfa kernel: page:ffffea0000f1b250 flags:4000000000000404
> > count:1 mapcount:-1 mapping:(null) index:0
> > Mar 24 19:22:28 alfa kernel: addr:00000037b4008000 vm_flags:08000875
> > anon_vma:(null) mapping:ffff88022b5d25a8 index:8
> > Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: filemap_fault+0x0/0x34d
> > Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap:
> > xfs_file_mmap+0x0/0x33
> > Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Not tainted 2.6.32.10 #2
> > Mar 24 19:22:28 alfa kernel: Call Trace:
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810c2ea3>] print_bad_pte+0x210/0x229
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810c3c98>] unmap_vmas+0x44b/0x787
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff81041f83>] mmput+0x48/0xb9
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810463b0>] exit_mm+0x105/0x110
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff81371287>] ?
> > tty_audit_exit+0x28/0x85
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810477a0>] do_exit+0x1e9/0x6d2
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff81053c37>] ?
> > __dequeue_signal+0xf1/0x127
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff81047d00>] do_group_exit+0x77/0xa1
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810560f7>]
> > get_signal_to_deliver+0x32c/0x37f
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff8100a484>]
> > do_notify_resume+0x90/0x740
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff8102724b>] ?
> > __bad_area_nosemaphore+0x178/0x1a2
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810272b9>] ? __bad_area+0x44/0x4d
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff8100bba2>] retint_signal+0x46/0x84
> > Mar 24 19:22:28 alfa kernel: Disabling lock debugging due to kernel taint
> > Mar 24 19:22:28 alfa kernel: swap_free: Bad swap file entry 6c800000
> > Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd
> > pte:d900000000 pmd:1535b5067
> > Mar 24 19:22:28 alfa kernel: addr:00000037b400a000 vm_flags:08000875
> > anon_vma:(null) mapping:ffff88022b5d25a8 index:a
> > Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: filemap_fault+0x0/0x34d
> > Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap:
> > xfs_file_mmap+0x0/0x33
> > Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Tainted: G    B
> > 2.6.32.10 #2
> > Mar 24 19:22:28 alfa kernel: Call Trace:
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff81044551>] ? add_taint+0x32/0x3e
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810c2ea3>] print_bad_pte+0x210/0x229
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810c3d47>] unmap_vmas+0x4fa/0x787
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff81041f83>] mmput+0x48/0xb9
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810463b0>] exit_mm+0x105/0x110
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff81371287>] ?
> > tty_audit_exit+0x28/0x85
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810477a0>] do_exit+0x1e9/0x6d2
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff81053c37>] ?
> > __dequeue_signal+0xf1/0x127
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff81047d00>] do_group_exit+0x77/0xa1
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810560f7>]
> > get_signal_to_deliver+0x32c/0x37f
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff8100a484>]
> > do_notify_resume+0x90/0x740
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff8102724b>] ?
> > __bad_area_nosemaphore+0x178/0x1a2
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810272b9>] ? __bad_area+0x44/0x4d
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff8100bba2>] retint_signal+0x46/0x84
> > Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd pte:2bfe8025
> > pmd:1535b5067
> > Mar 24 19:22:28 alfa kernel: page:ffffea0000f1f7c0 flags:4000000000000404
> > count:1 mapcount:-1 mapping:(null) index:0
> > Mar 24 19:22:28 alfa kernel: addr:00000037b400c000 vm_flags:08000875
> > anon_vma:(null) mapping:ffff88022b5d25a8 index:c
> > Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: filemap_fault+0x0/0x34d
> > Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap:
> > xfs_file_mmap+0x0/0x33
> > Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Tainted: G    B
> > 2.6.32.10 #2
> > Mar 24 19:22:28 alfa kernel: Call Trace:
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff81044551>] ? add_taint+0x32/0x3e
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810c2ea3>] print_bad_pte+0x210/0x229
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810c3c98>] unmap_vmas+0x44b/0x787
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff81041f83>] mmput+0x48/0xb9
> > Mar 24 19:22:28 alfa kernel:  [<ffffffff810463b0>] exit_mm+0x105/0x110
> > .....
> >
> > The entire log is here:
> > http://download.netcenter.hu/bughunt/20100324/messages
> >
> > The actual kernel is 2.6.32.10, but the crash-series started @ 2.6.28.10.
> >
> > I have forwarded the tasks to another server, removed this from the room,
> > and the hw survived memtest86 in >7 days continously + i have tested the
> > HDDs one by one with badblocks -vvw, all is good.
> > For me looks like this is not a hw problem.
> >
> > Somebody have any idea?
> >
> > Thanks a lot,
> > Janos Haar
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Somebody take a look please! (some kind of kernel bug?)
  2010-03-25  6:31   ` KAMEZAWA Hiroyuki
@ 2010-03-25  8:54     ` Janos Haar
  2010-04-01 10:01       ` Janos Haar
  0 siblings, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-03-25  8:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: xiyou.wangcong, linux-kernel, linux-mm


----- Original Message ----- 
From: "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>
To: "Américo Wang" <xiyou.wangcong@gmail.com>
Cc: "Janos Haar" <janos.haar@netcenter.hu>; <linux-kernel@vger.kernel.org>; 
<linux-mm@kvack.org>
Sent: Thursday, March 25, 2010 7:31 AM
Subject: Re: Somebody take a look please! (some kind of kernel bug?)


> On Thu, 25 Mar 2010 11:29:25 +0800
> Américo Wang <xiyou.wangcong@gmail.com> wrote:
>
>> (Cc'ing linux-mm)
>>
> Hmm..here is summary of corruption (from log), but no idea.
>
> ==
> process's address pte       pnf->pte->page
>
> 00000037b4008000   2bf1e025 -> PG_reserved
> 00000037b400a000 d900000000 -> bad swap
> 00000037b400c000   2bfe8025 -> PG_reserved
> 00000037b400d000  12bfe9025 -> belongs to some other files' page cache
> 00000037b400e000 ff00000000 -> bad swap
> 00000037b400f000 5400000000 -> bad swap
> ...
> 00000037b4019000 ff00000000 -> bad swap
> ==
> All ptes are on the same pmd 1535b5067.
> .
> I doubt some kind of buffer overflow bug overwrites page table...
> Because ptes for adddress of 00000037b4008000...00000037b400f000 are on 
> head of

This is only one bit, right? :-)

> a page (used for pmd), some data on page [0x1535b4000..0x1535b5000) caused 
> buffer
> overflow and broke page table in [0x1535b5000...0x1535b6000)
>
> Is this bug found from 2.6.28.10 ?

No, the bug, what i have sent was from 2.6.32.10. (you can check it from the 
messages file in the link)
The story begins about marc 9-10 but unfortunately the system not all the 
time was able to write down the messages file.
(At Mar 13 11:20:09 i have triggered the sysreq's process and memory 
information, you can see it in the link below.)
We have more crashes with the 2.6.28.10 in the next some day and the server 
is removed for testing (7 days hole in the log), but looks stable

Here is more serious crashes from the 2.6.28.10:

http://download.netcenter.hu/bughunt/20100324/marc11-14

For me looks like all memory, swap and xfs related.
I have tested/repaired all the filesystems offline, corrected the errors 
wich was left by the previous crashes, than disabled the swap, but nothing 
helps. :(

Finally in marc 21, i have replaced the kernel to the 32.10, and the crashes 
looks gone but only for 4 days. (you can see the first dump in my first 
mail)

Thanks for all the help,

Janos Haar


>
> If I investigate this issue, I'll check the owner of page 0x1535b4000 by
> crash dump.
>
> Thanks,
> -Kame
>
>
>
>> 2010/3/25 Janos Haar <janos.haar@netcenter.hu>:
>> > Dear developers,
>> >
>> > This is one of my productive servers, wich suddenly starts to freeze 
>> > (crash)
>> > some weeks before.
>> > I have done all what i can, (i think) please somebody give to me some
>> > suggestion:
>> >
>> > Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd 
>> > pte:2bf1e025
>> > pmd:1535b5067
>> > Mar 24 19:22:28 alfa kernel: page:ffffea0000f1b250 
>> > flags:4000000000000404
>> > count:1 mapcount:-1 mapping:(null) index:0
>> > Mar 24 19:22:28 alfa kernel: addr:00000037b4008000 vm_flags:08000875
>> > anon_vma:(null) mapping:ffff88022b5d25a8 index:8
>> > Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: 
>> > filemap_fault+0x0/0x34d
>> > Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap:
>> > xfs_file_mmap+0x0/0x33
>> > Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Not tainted 
>> > 2.6.32.10 #2
>> > Mar 24 19:22:28 alfa kernel: Call Trace:
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c2ea3>] 
>> > print_bad_pte+0x210/0x229
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c3c98>] 
>> > unmap_vmas+0x44b/0x787
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81041f83>] mmput+0x48/0xb9
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810463b0>] exit_mm+0x105/0x110
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81371287>] ?
>> > tty_audit_exit+0x28/0x85
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810477a0>] do_exit+0x1e9/0x6d2
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81053c37>] ?
>> > __dequeue_signal+0xf1/0x127
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81047d00>] 
>> > do_group_exit+0x77/0xa1
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810560f7>]
>> > get_signal_to_deliver+0x32c/0x37f
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff8100a484>]
>> > do_notify_resume+0x90/0x740
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff8102724b>] ?
>> > __bad_area_nosemaphore+0x178/0x1a2
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810272b9>] ? 
>> > __bad_area+0x44/0x4d
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff8100bba2>] 
>> > retint_signal+0x46/0x84
>> > Mar 24 19:22:28 alfa kernel: Disabling lock debugging due to kernel 
>> > taint
>> > Mar 24 19:22:28 alfa kernel: swap_free: Bad swap file entry 6c800000
>> > Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd
>> > pte:d900000000 pmd:1535b5067
>> > Mar 24 19:22:28 alfa kernel: addr:00000037b400a000 vm_flags:08000875
>> > anon_vma:(null) mapping:ffff88022b5d25a8 index:a
>> > Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: 
>> > filemap_fault+0x0/0x34d
>> > Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap:
>> > xfs_file_mmap+0x0/0x33
>> > Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Tainted: G B
>> > 2.6.32.10 #2
>> > Mar 24 19:22:28 alfa kernel: Call Trace:
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81044551>] ? add_taint+0x32/0x3e
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c2ea3>] 
>> > print_bad_pte+0x210/0x229
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c3d47>] 
>> > unmap_vmas+0x4fa/0x787
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81041f83>] mmput+0x48/0xb9
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810463b0>] exit_mm+0x105/0x110
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81371287>] ?
>> > tty_audit_exit+0x28/0x85
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810477a0>] do_exit+0x1e9/0x6d2
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81053c37>] ?
>> > __dequeue_signal+0xf1/0x127
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81047d00>] 
>> > do_group_exit+0x77/0xa1
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810560f7>]
>> > get_signal_to_deliver+0x32c/0x37f
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff8100a484>]
>> > do_notify_resume+0x90/0x740
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff8102724b>] ?
>> > __bad_area_nosemaphore+0x178/0x1a2
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810272b9>] ? 
>> > __bad_area+0x44/0x4d
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff8100bba2>] 
>> > retint_signal+0x46/0x84
>> > Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd 
>> > pte:2bfe8025
>> > pmd:1535b5067
>> > Mar 24 19:22:28 alfa kernel: page:ffffea0000f1f7c0 
>> > flags:4000000000000404
>> > count:1 mapcount:-1 mapping:(null) index:0
>> > Mar 24 19:22:28 alfa kernel: addr:00000037b400c000 vm_flags:08000875
>> > anon_vma:(null) mapping:ffff88022b5d25a8 index:c
>> > Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: 
>> > filemap_fault+0x0/0x34d
>> > Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap:
>> > xfs_file_mmap+0x0/0x33
>> > Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Tainted: G B
>> > 2.6.32.10 #2
>> > Mar 24 19:22:28 alfa kernel: Call Trace:
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81044551>] ? add_taint+0x32/0x3e
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c2ea3>] 
>> > print_bad_pte+0x210/0x229
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c3c98>] 
>> > unmap_vmas+0x44b/0x787
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81041f83>] mmput+0x48/0xb9
>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810463b0>] exit_mm+0x105/0x110
>> > .....
>> >
>> > The entire log is here:
>> > http://download.netcenter.hu/bughunt/20100324/messages
>> >
>> > The actual kernel is 2.6.32.10, but the crash-series started @ 
>> > 2.6.28.10.
>> >
>> > I have forwarded the tasks to another server, removed this from the 
>> > room,
>> > and the hw survived memtest86 in >7 days continously + i have tested 
>> > the
>> > HDDs one by one with badblocks -vvw, all is good.
>> > For me looks like this is not a hw problem.
>> >
>> > Somebody have any idea?
>> >
>> > Thanks a lot,
>> > Janos Haar
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" 
>> > in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>> > Please read the FAQ at http://www.tux.org/lkml/
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" 
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/ 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Somebody take a look please! (some kind of kernel bug?)
  2010-03-25  8:54     ` Janos Haar
@ 2010-04-01 10:01       ` Janos Haar
  2010-04-01 10:37         ` Américo Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-04-01 10:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: KAMEZAWA Hiroyuki, linux-mm, xiyou.wangcong

Hello,

Another issue with this productive server:
Can somebody point me to the rigth direction?
Or support that this is a hw problem or not?

The messages file are here: 
http://download.netcenter.hu/bughunt/20100324/marc30

Thanks,
Janos Haar

Mar 30 18:51:43 alfa kernel: BUG: unable to handle kernel paging request at 
000000320000008c
Mar 30 18:51:43 alfa kernel: IP: [<ffffffff811d755b>] 
xfs_iflush_cluster+0x148/0x35a
Mar 30 18:51:43 alfa kernel: PGD 102d7a067 PUD 0
Mar 30 18:51:43 alfa kernel: Oops: 0000 [#1] SMP
Mar 30 18:51:43 alfa kernel: last sysfs file: /sys/class/misc/rfkill/dev
Mar 30 18:51:43 alfa kernel: CPU 0
Mar 30 18:51:43 alfa kernel: Modules linked in: hidp l2cap crc16 bluetooth 
rfkill ipv6 video output sbs sbshc battery ac parport_pc lp parport 
serio_raw 8250_
pnp 8250 serial_core shpchp button i2c_i801 i2c_core pcspkr
Mar 30 18:51:43 alfa kernel: Pid: 3242, comm: flush-8:16 Not tainted 
2.6.32.10 #2
Mar 30 18:51:43 alfa kernel: RIP: 0010:[<ffffffff811d755b>] 
[<ffffffff811d755b>] xfs_iflush_cluster+0x148/0x35a
Mar 30 18:51:43 alfa kernel: RSP: 0000:ffff880228ce5b60  EFLAGS: 00010206
Mar 30 18:51:43 alfa kernel: RAX: 0000003200000000 RBX: ffff8801537947d0 
RCX: 000000000000001a
Mar 30 18:51:43 alfa kernel: RDX: 0000000000000020 RSI: 00000000000c6cc2 
RDI: 0000000000000001
Mar 30 18:51:43 alfa kernel: RBP: ffff880228ce5bd0 R08: ffff880228ce5b20 
R09: ffff8801ea436928
Mar 30 18:51:43 alfa kernel: R10: 00000000000c6cc2 R11: 0000000000000001 
R12: ffff8800b630b11a
Mar 30 18:51:43 alfa kernel: R13: ffff8801bd54ab30 R14: ffff88022962d2b8 
R15: 00000000000c6ca0
Mar 30 18:51:43 alfa kernel: FS:  0000000000000000(0000) 
GS:ffff880028200000(0000) knlGS:0000000000000000
Mar 30 18:51:43 alfa kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
000000008005003b
Mar 30 18:51:43 alfa kernel: CR2: 000000320000008c CR3: 0000000168e75000 
CR4: 00000000000006f0
Mar 30 18:51:43 alfa kernel: DR0: 0000000000000000 DR1: 0000000000000000 
DR2: 0000000000000000
Mar 30 18:51:43 alfa kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 
DR7: 0000000000000400
Mar 30 18:51:43 alfa kernel: Process flush-8:16 (pid: 3242, threadinfo 
ffff880228ce4000, task ffff880228ea4040)
Mar 30 18:51:43 alfa kernel: Stack:
Mar 30 18:51:43 alfa kernel:  ffff8801bd54ab30 ffff8800b630b140 
ffff88022a2d99d0 ffffffffffffffe0
Mar 30 18:51:43 alfa kernel: <0> 0000000000000020 ffff880218e3db60 
0000002028ce5bd0 0000000200000000
Mar 30 18:51:43 alfa kernel: <0> ffff880218e3db70 ffff8801bd54ab30 
ffff8800b630b140 0000000000000002
Mar 30 18:51:43 alfa kernel: Call Trace:
Mar 30 18:51:43 alfa kernel:  [<ffffffff811d7931>] xfs_iflush+0x1c4/0x272
Mar 30 18:51:43 alfa kernel:  [<ffffffff8103458e>] ? 
try_wait_for_completion+0x24/0x45
Mar 30 18:51:43 alfa kernel:  [<ffffffff811f819c>] 
xfs_fs_write_inode+0xe0/0x11e
Mar 30 18:51:43 alfa kernel:  [<ffffffff810f7bcf>] 
writeback_single_inode+0x109/0x215
Mar 30 18:51:43 alfa kernel:  [<ffffffff810f84bd>] 
writeback_inodes_wb+0x33a/0x3cc
Mar 30 18:51:43 alfa kernel:  [<ffffffff810f8686>] wb_writeback+0x137/0x1c7
Mar 30 18:51:43 alfa kernel:  [<ffffffff810f8830>] ? 
wb_do_writeback+0x7d/0x1ae
Mar 30 18:51:43 alfa kernel:  [<ffffffff810f892c>] 
wb_do_writeback+0x179/0x1ae
Mar 30 18:51:43 alfa kernel:  [<ffffffff810f8830>] ? 
wb_do_writeback+0x7d/0x1ae
Mar 30 18:51:43 alfa kernel:  [<ffffffff8105064c>] ? 
process_timeout+0x0/0x10
Mar 30 18:51:43 alfa kernel:  [<ffffffff810c10ed>] ? bdi_start_fn+0x0/0xd1
Mar 30 18:51:43 alfa kernel:  [<ffffffff810f898d>] 
bdi_writeback_task+0x2c/0xa2
Mar 30 18:51:43 alfa kernel:  [<ffffffff810c1163>] bdi_start_fn+0x76/0xd1
Mar 30 18:51:43 alfa kernel:  [<ffffffff810c10ed>] ? bdi_start_fn+0x0/0xd1
Mar 30 18:51:43 alfa kernel:  [<ffffffff8105dda1>] kthread+0x82/0x8d
Mar 30 18:51:43 alfa kernel:  [<ffffffff8100c15a>] child_rip+0xa/0x20
Mar 30 18:51:43 alfa kernel:  [<ffffffff8100bafc>] ? restore_args+0x0/0x30
Mar 30 18:51:43 alfa kernel:  [<ffffffff81038596>] ? 
finish_task_switch+0x0/0xbc
Mar 30 18:51:43 alfa kernel:  [<ffffffff8105dd1f>] ? kthread+0x0/0x8d
Mar 30 18:51:43 alfa kernel:  [<ffffffff8100c150>] ? child_rip+0x0/0x20
Mar 30 18:51:43 alfa kernel: Code: 8e eb 01 00 00 b8 01 00 00 00 48 d3 e0 ff 
c8 23 43 18 48 23 45 a8 4c 39 f8 0f 85 ae 00 00 00 48 8b 83 80 00 00 00 48 
85 c0
74 0b <66> f7 80 8c 00 00 00 ff 01 75 13 80 bb 0a 02 00 00 00 75 0a 8b
Mar 30 18:51:43 alfa kernel: RIP  [<ffffffff811d755b>] 
xfs_iflush_cluster+0x148/0x35a
Mar 30 18:51:43 alfa kernel:  RSP <ffff880228ce5b60>
Mar 30 18:51:43 alfa kernel: CR2: 000000320000008c
Mar 30 18:51:43 alfa kernel: ---[ end trace e6c8391ea76602f4 ]---
Mar 30 18:51:43 alfa kernel: flush-8:16 used greatest stack depth: 2464 
bytes left
Mar 30 19:09:39 alfa syslogd 1.4.1: restart.

----- Original Message ----- 
From: "Janos Haar" <janos.haar@netcenter.hu>
To: "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; 
<linux-mm@kvack.org>
Sent: Thursday, March 25, 2010 10:54 AM
Subject: Re: Somebody take a look please! (some kind of kernel bug?)


>
> ----- Original Message ----- 
> From: "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>
> To: "Américo Wang" <xiyou.wangcong@gmail.com>
> Cc: "Janos Haar" <janos.haar@netcenter.hu>; 
> <linux-kernel@vger.kernel.org>; <linux-mm@kvack.org>
> Sent: Thursday, March 25, 2010 7:31 AM
> Subject: Re: Somebody take a look please! (some kind of kernel bug?)
>
>
>> On Thu, 25 Mar 2010 11:29:25 +0800
>> Américo Wang <xiyou.wangcong@gmail.com> wrote:
>>
>>> (Cc'ing linux-mm)
>>>
>> Hmm..here is summary of corruption (from log), but no idea.
>>
>> ==
>> process's address pte       pnf->pte->page
>>
>> 00000037b4008000   2bf1e025 -> PG_reserved
>> 00000037b400a000 d900000000 -> bad swap
>> 00000037b400c000   2bfe8025 -> PG_reserved
>> 00000037b400d000  12bfe9025 -> belongs to some other files' page cache
>> 00000037b400e000 ff00000000 -> bad swap
>> 00000037b400f000 5400000000 -> bad swap
>> ...
>> 00000037b4019000 ff00000000 -> bad swap
>> ==
>> All ptes are on the same pmd 1535b5067.
>> .
>> I doubt some kind of buffer overflow bug overwrites page table...
>> Because ptes for adddress of 00000037b4008000...00000037b400f000 are on 
>> head of
>
> This is only one bit, right? :-)
>
>> a page (used for pmd), some data on page [0x1535b4000..0x1535b5000) 
>> caused buffer
>> overflow and broke page table in [0x1535b5000...0x1535b6000)
>>
>> Is this bug found from 2.6.28.10 ?
>
> No, the bug, what i have sent was from 2.6.32.10. (you can check it from 
> the messages file in the link)
> The story begins about marc 9-10 but unfortunately the system not all the 
> time was able to write down the messages file.
> (At Mar 13 11:20:09 i have triggered the sysreq's process and memory 
> information, you can see it in the link below.)
> We have more crashes with the 2.6.28.10 in the next some day and the 
> server is removed for testing (7 days hole in the log), but looks stable
>
> Here is more serious crashes from the 2.6.28.10:
>
> http://download.netcenter.hu/bughunt/20100324/marc11-14
>
> For me looks like all memory, swap and xfs related.
> I have tested/repaired all the filesystems offline, corrected the errors 
> wich was left by the previous crashes, than disabled the swap, but nothing 
> helps. :(
>
> Finally in marc 21, i have replaced the kernel to the 32.10, and the 
> crashes looks gone but only for 4 days. (you can see the first dump in my 
> first mail)
>
> Thanks for all the help,
>
> Janos Haar
>
>
>>
>> If I investigate this issue, I'll check the owner of page 0x1535b4000 by
>> crash dump.
>>
>> Thanks,
>> -Kame
>>
>>
>>
>>> 2010/3/25 Janos Haar <janos.haar@netcenter.hu>:
>>> > Dear developers,
>>> >
>>> > This is one of my productive servers, wich suddenly starts to freeze 
>>> > (crash)
>>> > some weeks before.
>>> > I have done all what i can, (i think) please somebody give to me some
>>> > suggestion:
>>> >
>>> > Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd 
>>> > pte:2bf1e025
>>> > pmd:1535b5067
>>> > Mar 24 19:22:28 alfa kernel: page:ffffea0000f1b250 
>>> > flags:4000000000000404
>>> > count:1 mapcount:-1 mapping:(null) index:0
>>> > Mar 24 19:22:28 alfa kernel: addr:00000037b4008000 vm_flags:08000875
>>> > anon_vma:(null) mapping:ffff88022b5d25a8 index:8
>>> > Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: 
>>> > filemap_fault+0x0/0x34d
>>> > Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap:
>>> > xfs_file_mmap+0x0/0x33
>>> > Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Not tainted 
>>> > 2.6.32.10 #2
>>> > Mar 24 19:22:28 alfa kernel: Call Trace:
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c2ea3>] 
>>> > print_bad_pte+0x210/0x229
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c3c98>] 
>>> > unmap_vmas+0x44b/0x787
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81041f83>] mmput+0x48/0xb9
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810463b0>] exit_mm+0x105/0x110
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81371287>] ?
>>> > tty_audit_exit+0x28/0x85
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810477a0>] do_exit+0x1e9/0x6d2
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81053c37>] ?
>>> > __dequeue_signal+0xf1/0x127
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81047d00>] 
>>> > do_group_exit+0x77/0xa1
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810560f7>]
>>> > get_signal_to_deliver+0x32c/0x37f
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff8100a484>]
>>> > do_notify_resume+0x90/0x740
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff8102724b>] ?
>>> > __bad_area_nosemaphore+0x178/0x1a2
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810272b9>] ? 
>>> > __bad_area+0x44/0x4d
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff8100bba2>] 
>>> > retint_signal+0x46/0x84
>>> > Mar 24 19:22:28 alfa kernel: Disabling lock debugging due to kernel 
>>> > taint
>>> > Mar 24 19:22:28 alfa kernel: swap_free: Bad swap file entry 6c800000
>>> > Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd
>>> > pte:d900000000 pmd:1535b5067
>>> > Mar 24 19:22:28 alfa kernel: addr:00000037b400a000 vm_flags:08000875
>>> > anon_vma:(null) mapping:ffff88022b5d25a8 index:a
>>> > Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: 
>>> > filemap_fault+0x0/0x34d
>>> > Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap:
>>> > xfs_file_mmap+0x0/0x33
>>> > Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Tainted: G B
>>> > 2.6.32.10 #2
>>> > Mar 24 19:22:28 alfa kernel: Call Trace:
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81044551>] ? 
>>> > add_taint+0x32/0x3e
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c2ea3>] 
>>> > print_bad_pte+0x210/0x229
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c3d47>] 
>>> > unmap_vmas+0x4fa/0x787
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81041f83>] mmput+0x48/0xb9
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810463b0>] exit_mm+0x105/0x110
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81371287>] ?
>>> > tty_audit_exit+0x28/0x85
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810477a0>] do_exit+0x1e9/0x6d2
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81053c37>] ?
>>> > __dequeue_signal+0xf1/0x127
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81047d00>] 
>>> > do_group_exit+0x77/0xa1
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810560f7>]
>>> > get_signal_to_deliver+0x32c/0x37f
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff8100a484>]
>>> > do_notify_resume+0x90/0x740
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff8102724b>] ?
>>> > __bad_area_nosemaphore+0x178/0x1a2
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810272b9>] ? 
>>> > __bad_area+0x44/0x4d
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff8100bba2>] 
>>> > retint_signal+0x46/0x84
>>> > Mar 24 19:22:28 alfa kernel: BUG: Bad page map in process httpd 
>>> > pte:2bfe8025
>>> > pmd:1535b5067
>>> > Mar 24 19:22:28 alfa kernel: page:ffffea0000f1f7c0 
>>> > flags:4000000000000404
>>> > count:1 mapcount:-1 mapping:(null) index:0
>>> > Mar 24 19:22:28 alfa kernel: addr:00000037b400c000 vm_flags:08000875
>>> > anon_vma:(null) mapping:ffff88022b5d25a8 index:c
>>> > Mar 24 19:22:28 alfa kernel: vma->vm_ops->fault: 
>>> > filemap_fault+0x0/0x34d
>>> > Mar 24 19:22:28 alfa kernel: vma->vm_file->f_op->mmap:
>>> > xfs_file_mmap+0x0/0x33
>>> > Mar 24 19:22:28 alfa kernel: Pid: 7512, comm: httpd Tainted: G B
>>> > 2.6.32.10 #2
>>> > Mar 24 19:22:28 alfa kernel: Call Trace:
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81044551>] ? 
>>> > add_taint+0x32/0x3e
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c2ea3>] 
>>> > print_bad_pte+0x210/0x229
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c3c98>] 
>>> > unmap_vmas+0x44b/0x787
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810c81d5>] exit_mmap+0xb0/0x133
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff81041f83>] mmput+0x48/0xb9
>>> > Mar 24 19:22:28 alfa kernel: [<ffffffff810463b0>] exit_mm+0x105/0x110
>>> > .....
>>> >
>>> > The entire log is here:
>>> > http://download.netcenter.hu/bughunt/20100324/messages
>>> >
>>> > The actual kernel is 2.6.32.10, but the crash-series started @ 
>>> > 2.6.28.10.
>>> >
>>> > I have forwarded the tasks to another server, removed this from the 
>>> > room,
>>> > and the hw survived memtest86 in >7 days continously + i have tested 
>>> > the
>>> > HDDs one by one with badblocks -vvw, all is good.
>>> > For me looks like this is not a hw problem.
>>> >
>>> > Somebody have any idea?
>>> >
>>> > Thanks a lot,
>>> > Janos Haar
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe 
>>> > linux-kernel" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> > Please read the FAQ at http://www.tux.org/lkml/
>>> >
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" 
>>> in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" 
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/ 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Somebody take a look please! (some kind of kernel bug?)
  2010-04-01 10:01       ` Janos Haar
@ 2010-04-01 10:37         ` Américo Wang
  2010-04-02 22:07           ` Janos Haar
  0 siblings, 1 reply; 33+ messages in thread
From: Américo Wang @ 2010-04-01 10:37 UTC (permalink / raw)
  To: Janos Haar; +Cc: linux-kernel, KAMEZAWA Hiroyuki, linux-mm, xfs, Jens Axboe

On Thu, Apr 1, 2010 at 6:01 PM, Janos Haar <janos.haar@netcenter.hu> wrote:
> Hello,
>

Hi,
This is a totally different bug from the previous one reported by you. :)

> Another issue with this productive server:
> Can somebody point me to the rigth direction?
> Or support that this is a hw problem or not?


Probably no, it looks like an XFS bug or a write-back bug.

Thanks for your report. Cc'ing related people...


>
> The messages file are here:
> http://download.netcenter.hu/bughunt/20100324/marc30
>
> Thanks,
> Janos Haar
>
> Mar 30 18:51:43 alfa kernel: BUG: unable to handle kernel paging request at
> 000000320000008c
> Mar 30 18:51:43 alfa kernel: IP: [<ffffffff811d755b>]
> xfs_iflush_cluster+0x148/0x35a
> Mar 30 18:51:43 alfa kernel: PGD 102d7a067 PUD 0
> Mar 30 18:51:43 alfa kernel: Oops: 0000 [#1] SMP
> Mar 30 18:51:43 alfa kernel: last sysfs file: /sys/class/misc/rfkill/dev
> Mar 30 18:51:43 alfa kernel: CPU 0
> Mar 30 18:51:43 alfa kernel: Modules linked in: hidp l2cap crc16 bluetooth
> rfkill ipv6 video output sbs sbshc battery ac parport_pc lp parport
> serio_raw 8250_
> pnp 8250 serial_core shpchp button i2c_i801 i2c_core pcspkr
> Mar 30 18:51:43 alfa kernel: Pid: 3242, comm: flush-8:16 Not tainted
> 2.6.32.10 #2
> Mar 30 18:51:43 alfa kernel: RIP: 0010:[<ffffffff811d755b>]
> [<ffffffff811d755b>] xfs_iflush_cluster+0x148/0x35a
> Mar 30 18:51:43 alfa kernel: RSP: 0000:ffff880228ce5b60  EFLAGS: 00010206
> Mar 30 18:51:43 alfa kernel: RAX: 0000003200000000 RBX: ffff8801537947d0
> RCX: 000000000000001a
> Mar 30 18:51:43 alfa kernel: RDX: 0000000000000020 RSI: 00000000000c6cc2
> RDI: 0000000000000001
> Mar 30 18:51:43 alfa kernel: RBP: ffff880228ce5bd0 R08: ffff880228ce5b20
> R09: ffff8801ea436928
> Mar 30 18:51:43 alfa kernel: R10: 00000000000c6cc2 R11: 0000000000000001
> R12: ffff8800b630b11a
> Mar 30 18:51:43 alfa kernel: R13: ffff8801bd54ab30 R14: ffff88022962d2b8
> R15: 00000000000c6ca0
> Mar 30 18:51:43 alfa kernel: FS:  0000000000000000(0000)
> GS:ffff880028200000(0000) knlGS:0000000000000000
> Mar 30 18:51:43 alfa kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
> 000000008005003b
> Mar 30 18:51:43 alfa kernel: CR2: 000000320000008c CR3: 0000000168e75000
> CR4: 00000000000006f0
> Mar 30 18:51:43 alfa kernel: DR0: 0000000000000000 DR1: 0000000000000000
> DR2: 0000000000000000
> Mar 30 18:51:43 alfa kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0
> DR7: 0000000000000400
> Mar 30 18:51:43 alfa kernel: Process flush-8:16 (pid: 3242, threadinfo
> ffff880228ce4000, task ffff880228ea4040)
> Mar 30 18:51:43 alfa kernel: Stack:
> Mar 30 18:51:43 alfa kernel:  ffff8801bd54ab30 ffff8800b630b140
> ffff88022a2d99d0 ffffffffffffffe0
> Mar 30 18:51:43 alfa kernel: <0> 0000000000000020 ffff880218e3db60
> 0000002028ce5bd0 0000000200000000
> Mar 30 18:51:43 alfa kernel: <0> ffff880218e3db70 ffff8801bd54ab30
> ffff8800b630b140 0000000000000002
> Mar 30 18:51:43 alfa kernel: Call Trace:
> Mar 30 18:51:43 alfa kernel:  [<ffffffff811d7931>] xfs_iflush+0x1c4/0x272
> Mar 30 18:51:43 alfa kernel:  [<ffffffff8103458e>] ?
> try_wait_for_completion+0x24/0x45
> Mar 30 18:51:43 alfa kernel:  [<ffffffff811f819c>]
> xfs_fs_write_inode+0xe0/0x11e
> Mar 30 18:51:43 alfa kernel:  [<ffffffff810f7bcf>]
> writeback_single_inode+0x109/0x215
> Mar 30 18:51:43 alfa kernel:  [<ffffffff810f84bd>]
> writeback_inodes_wb+0x33a/0x3cc
> Mar 30 18:51:43 alfa kernel:  [<ffffffff810f8686>] wb_writeback+0x137/0x1c7
> Mar 30 18:51:43 alfa kernel:  [<ffffffff810f8830>] ?
> wb_do_writeback+0x7d/0x1ae
> Mar 30 18:51:43 alfa kernel:  [<ffffffff810f892c>]
> wb_do_writeback+0x179/0x1ae
> Mar 30 18:51:43 alfa kernel:  [<ffffffff810f8830>] ?
> wb_do_writeback+0x7d/0x1ae
> Mar 30 18:51:43 alfa kernel:  [<ffffffff8105064c>] ?
> process_timeout+0x0/0x10
> Mar 30 18:51:43 alfa kernel:  [<ffffffff810c10ed>] ? bdi_start_fn+0x0/0xd1
> Mar 30 18:51:43 alfa kernel:  [<ffffffff810f898d>]
> bdi_writeback_task+0x2c/0xa2
> Mar 30 18:51:43 alfa kernel:  [<ffffffff810c1163>] bdi_start_fn+0x76/0xd1
> Mar 30 18:51:43 alfa kernel:  [<ffffffff810c10ed>] ? bdi_start_fn+0x0/0xd1
> Mar 30 18:51:43 alfa kernel:  [<ffffffff8105dda1>] kthread+0x82/0x8d
> Mar 30 18:51:43 alfa kernel:  [<ffffffff8100c15a>] child_rip+0xa/0x20
> Mar 30 18:51:43 alfa kernel:  [<ffffffff8100bafc>] ? restore_args+0x0/0x30
> Mar 30 18:51:43 alfa kernel:  [<ffffffff81038596>] ?
> finish_task_switch+0x0/0xbc
> Mar 30 18:51:43 alfa kernel:  [<ffffffff8105dd1f>] ? kthread+0x0/0x8d
> Mar 30 18:51:43 alfa kernel:  [<ffffffff8100c150>] ? child_rip+0x0/0x20
> Mar 30 18:51:43 alfa kernel: Code: 8e eb 01 00 00 b8 01 00 00 00 48 d3 e0 ff
> c8 23 43 18 48 23 45 a8 4c 39 f8 0f 85 ae 00 00 00 48 8b 83 80 00 00 00 48
> 85 c0
> 74 0b <66> f7 80 8c 00 00 00 ff 01 75 13 80 bb 0a 02 00 00 00 75 0a 8b
> Mar 30 18:51:43 alfa kernel: RIP  [<ffffffff811d755b>]
> xfs_iflush_cluster+0x148/0x35a
> Mar 30 18:51:43 alfa kernel:  RSP <ffff880228ce5b60>
> Mar 30 18:51:43 alfa kernel: CR2: 000000320000008c
> Mar 30 18:51:43 alfa kernel: ---[ end trace e6c8391ea76602f4 ]---
> Mar 30 18:51:43 alfa kernel: flush-8:16 used greatest stack depth: 2464
> bytes left
> Mar 30 19:09:39 alfa syslogd 1.4.1: restart.
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Somebody take a look please! (some kind of kernel bug?)
  2010-04-01 10:37         ` Américo Wang
@ 2010-04-02 22:07           ` Janos Haar
  2010-04-02 23:09             ` Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...) Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-04-02 22:07 UTC (permalink / raw)
  To: Américo Wang; +Cc: linux-kernel, KAMEZAWA Hiroyuki, linux-mm, xfs, axboe

Hello,

----- Original Message ----- 
From: "Américo Wang" <xiyou.wangcong@gmail.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <linux-kernel@vger.kernel.org>; "KAMEZAWA Hiroyuki" 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
"Jens Axboe" <axboe@kernel.dk>
Sent: Thursday, April 01, 2010 12:37 PM
Subject: Re: Somebody take a look please! (some kind of kernel bug?)


> On Thu, Apr 1, 2010 at 6:01 PM, Janos Haar <janos.haar@netcenter.hu> 
> wrote:
>> Hello,
>>
>
> Hi,
> This is a totally different bug from the previous one reported by you. :)

Today i have got this again, exactly the same. (if somebody wants the log, 
just ask)
There is a cut:

Apr  1 18:50:02 alfa kernel: possible SYN flooding on port 80. Sending 
cookies.
Apr  2 21:16:59 alfa kernel: BUG: unable to handle kernel paging request at 
000000010000008c
Apr  2 21:16:59 alfa kernel: IP: [<ffffffff811d755b>] 
xfs_iflush_cluster+0x148/0x35a
Apr  2 21:16:59 alfa kernel: PGD a7374067 PUD 0
Apr  2 21:16:59 alfa kernel: Oops: 0000 [#1] SMP
Apr  2 21:16:59 alfa kernel: last sysfs file: /sys/class/misc/rfkill/dev
Apr  2 21:16:59 alfa kernel: CPU 1
Apr  2 21:16:59 alfa kernel: Modules linked in: hidp l2cap crc16 bluetooth 
rfkill ipv6 video output sbs sbshc battery ac parport_pc lp parport 8250_pnp 
serio_
raw shpchp 8250 serial_core i2c_i801 button pcspkr i2c_core
Apr  2 21:16:59 alfa kernel: Pid: 3118, comm: flush-8:16 Not tainted 
2.6.32.10 #2
Apr  2 21:16:59 alfa kernel: RIP: 0010:[<ffffffff811d755b>] 
[<ffffffff811d755b>] xfs_iflush_cluster+0x148/0x35a
Apr  2 21:16:59 alfa kernel: RSP: 0000:ffff88022849db60  EFLAGS: 00010206
Apr  2 21:16:59 alfa kernel: RAX: 0000000100000000 RBX: ffff8801535b47d0 
RCX: 000000000000001a
Apr  2 21:16:59 alfa kernel: RDX: 0000000000000020 RSI: ffff880178e49158 
RDI: ffff88022a5c8138
Apr  2 21:16:59 alfa kernel: RBP: ffff88022849dbd0 R08: 0000000000000001 
R09: ffff880137ba67a0
Apr  2 21:16:59 alfa kernel: R10: ffff88022849db50 R11: 0000000000000020 
R12: ffff880137ba6858
Apr  2 21:16:59 alfa kernel: R13: ffff880115f4cd68 R14: ffff88022953a9e0 
R15: 000000000061d440
Apr  2 21:16:59 alfa kernel: FS:  0000000000000000(0000) 
GS:ffff880028280000(0000) knlGS:0000000000000000
Apr  2 21:16:59 alfa kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
000000008005003b
Apr  2 21:16:59 alfa kernel: CR2: 000000010000008c CR3: 0000000028154000 
CR4: 00000000000006e0
Apr  2 21:16:59 alfa kernel: DR0: 0000000000000000 DR1: 0000000000000000 
DR2: 0000000000000000
Apr  2 21:16:59 alfa kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 
DR7: 0000000000000400
Apr  2 21:16:59 alfa kernel: Process flush-8:16 (pid: 3118, threadinfo 
ffff88022849c000, task ffff88022a4f4040)
Apr  2 21:16:59 alfa kernel: Stack:
Apr  2 21:16:59 alfa kernel:  ffff88022953a9e0 ffff8801d8ac58d0 
ffff88022960f7a8 ffffffffffffffe0
Apr  2 21:16:59 alfa kernel: <0> 0000000000000020 ffff8801d53bb5e8 
000000202849dbd0 0000000a00000001
Apr  2 21:16:59 alfa kernel: <0> ffff8801d53bb638 ffff880115f4cd68 
ffff8801d8ac58d0 0000000000000002
Apr  2 21:16:59 alfa kernel: Call Trace:
Apr  2 21:16:59 alfa kernel:  [<ffffffff811d7931>] xfs_iflush+0x1c4/0x272
Apr  2 21:16:59 alfa kernel:  [<ffffffff8103458e>] ? 
try_wait_for_completion+0x24/0x45
Apr  2 21:16:59 alfa kernel:  [<ffffffff811f819c>] 
xfs_fs_write_inode+0xe0/0x11e
Apr  2 21:16:59 alfa kernel:  [<ffffffff810f7bcf>] 
writeback_single_inode+0x109/0x215
Apr  2 21:16:59 alfa kernel:  [<ffffffff810f84bd>] 
writeback_inodes_wb+0x33a/0x3cc
Apr  2 21:16:59 alfa kernel:  [<ffffffff810f8686>] wb_writeback+0x137/0x1c7
Apr  2 21:16:59 alfa kernel:  [<ffffffff810f8830>] ? 
wb_do_writeback+0x7d/0x1ae
Apr  2 21:16:59 alfa kernel:  [<ffffffff810f892c>] 
wb_do_writeback+0x179/0x1ae
Apr  2 21:16:59 alfa kernel:  [<ffffffff810f8830>] ? 
wb_do_writeback+0x7d/0x1ae
Apr  2 21:16:59 alfa kernel:  [<ffffffff8105064c>] ? 
process_timeout+0x0/0x10
Apr  2 21:16:59 alfa kernel:  [<ffffffff810c10ed>] ? bdi_start_fn+0x0/0xd1
Apr  2 21:16:59 alfa kernel:  [<ffffffff810f898d>] 
bdi_writeback_task+0x2c/0xa2
Apr  2 21:16:59 alfa kernel:  [<ffffffff810c1163>] bdi_start_fn+0x76/0xd1
Apr  2 21:16:59 alfa kernel:  [<ffffffff810c10ed>] ? bdi_start_fn+0x0/0xd1
Apr  2 21:16:59 alfa kernel:  [<ffffffff8105dda1>] kthread+0x82/0x8d
Apr  2 21:16:59 alfa kernel:  [<ffffffff8100c15a>] child_rip+0xa/0x20
Apr  2 21:16:59 alfa kernel:  [<ffffffff8100bafc>] ? restore_args+0x0/0x30
Apr  2 21:16:59 alfa kernel:  [<ffffffff81038596>] ? 
finish_task_switch+0x0/0xbc
Apr  2 21:16:59 alfa kernel:  [<ffffffff8105dd1f>] ? kthread+0x0/0x8d
Apr  2 21:16:59 alfa kernel:  [<ffffffff8100c150>] ? child_rip+0x0/0x20
Apr  2 21:16:59 alfa kernel: Code: 8e eb 01 00 00 b8 01 00 00 00 48 d3 e0 ff 
c8 23 43 18 48 23 45 a8 4c 39 f8 0f 85 ae 00 00 00 48 8b 83 80 00 00 00 48 
85 c0
74 0b <66> f7 80 8c 00 00 00 ff 01 75 13 80 bb 0a 02 00 00 00 75 0a 8b
Apr  2 21:16:59 alfa kernel: RIP  [<ffffffff811d755b>] 
xfs_iflush_cluster+0x148/0x35a
Apr  2 21:16:59 alfa kernel:  RSP <ffff88022849db60>
Apr  2 21:16:59 alfa kernel: CR2: 000000010000008c
Apr  2 21:16:59 alfa kernel: ---[ end trace 7528355f76bf7b08 ]---
Apr  2 21:17:53 alfa kernel: BUG: soft lockup - CPU#3 stuck for 61s! 
[httpd:17617]
Apr  2 21:17:53 alfa kernel: Modules linked in: hidp l2cap crc16 bluetooth 
rfkill ipv6 video output sbs sbshc battery ac parport_pc lp parport 8250_pnp 
serio_
raw shpchp 8250 serial_core i2c_i801 button pcspkr i2c_core
Apr  2 21:17:53 alfa kernel: CPU 3:
Apr  2 21:17:53 alfa kernel: Modules linked in: hidp l2cap crc16 bluetooth 
rfkill ipv6 video output sbs sbshc battery ac parport_pc lp parport 8250_pnp 
serio_
raw shpchp 8250 serial_core i2c_i801 button pcspkr i2c_core
Apr  2 21:17:53 alfa kernel: Pid: 17617, comm: httpd Tainted: G      D 
2.6.32.10 #2
Apr  2 21:17:53 alfa kernel: RIP: 0010:[<ffffffff8171a0cf>] 
[<ffffffff8171a0cf>] __write_lock_failed+0xf/0x20
Apr  2 21:17:53 alfa kernel: RSP: 0018:ffff8800a46b1a20  EFLAGS: 00000287
Apr  2 21:17:53 alfa kernel: RAX: 0000000000000003 RBX: ffff8800a46b1a38 
RCX: 0000000000000000
Apr  2 21:17:53 alfa kernel: RDX: 0000000000000000 RSI: 0000000000000000 
RDI: ffff88022960f7a8
Apr  2 21:17:53 alfa kernel: RBP: ffffffff8100bc2e R08: 0000000000000001 
R09: 0000000000000000
Apr  2 21:17:53 alfa kernel: R10: ffffffff812f1fcf R11: 0000000000014001 
R12: ffff88002838e820
Apr  2 21:17:53 alfa kernel: R13: 0000000000005033 R14: ffff8800a46b0000 
R15: 0000000000000100
Apr  2 21:17:53 alfa kernel: FS:  00007feea89c26f0(0000) 
GS:ffff880028380000(0000) knlGS:0000000000000000
Apr  2 21:17:53 alfa kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Apr  2 21:17:53 alfa kernel: CR2: 0000000000edba48 CR3: 000000017b034000 
CR4: 00000000000006e0
Apr  2 21:17:53 alfa kernel: DR0: 0000000000000000 DR1: 0000000000000000 
DR2: 0000000000000000
Apr  2 21:17:53 alfa kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 
DR7: 0000000000000400
Apr  2 21:17:53 alfa kernel: Call Trace:
Apr  2 21:17:53 alfa kernel:  [<ffffffff812fa0b0>] ? 
_raw_write_lock+0x6a/0x7e
Apr  2 21:17:53 alfa kernel:  [<ffffffff8175d8ce>] ? _write_lock+0x39/0x3e
Apr  2 21:17:53 alfa kernel:  [<ffffffff811d3ea1>] ? xfs_iget+0x2e3/0x422
Apr  2 21:17:53 alfa kernel:  [<ffffffff811d3ea1>] ? xfs_iget+0x2e3/0x422
Apr  2 21:17:53 alfa kernel:  [<ffffffff811e9591>] ? 
xfs_trans_iget+0x2a/0x55
Apr  2 21:17:53 alfa kernel:  [<ffffffff811d7a7a>] ? xfs_ialloc+0x9b/0x569
Apr  2 21:17:53 alfa kernel:  [<ffffffff8175d0b4>] ? 
__down_write_nested+0x1a/0xa1
Apr  2 21:17:53 alfa kernel:  [<ffffffff811e9ea3>] ? 
xfs_dir_ialloc+0x78/0x289
Apr  2 21:17:53 alfa kernel:  [<ffffffff81060dd2>] ? 
down_write_nested+0x52/0x59
Apr  2 21:17:53 alfa kernel:  [<ffffffff811ec35f>] ? xfs_create+0x317/0x526
Apr  2 21:17:53 alfa kernel:  [<ffffffff811f5642>] ? xfs_vn_mknod+0xdb/0x171
Apr  2 21:17:53 alfa kernel:  [<ffffffff811f56fd>] ? xfs_vn_create+0x10/0x12
Apr  2 21:17:53 alfa kernel:  [<ffffffff810e5844>] ? vfs_create+0xee/0x18c
Apr  2 21:17:53 alfa kernel:  [<ffffffff810e7c46>] ? 
do_filp_open+0x31a/0x99f
Apr  2 21:17:53 alfa kernel:  [<ffffffff810df47e>] ? cp_new_stat+0xfb/0x114
Apr  2 21:17:53 alfa kernel:  [<ffffffff810f0e8e>] ? alloc_fd+0x38/0x123
Apr  2 21:17:53 alfa kernel:  [<ffffffff8175d6b1>] ? _spin_unlock+0x2b/0x2f
Apr  2 21:17:53 alfa kernel:  [<ffffffff810d9f3b>] ? do_sys_open+0x62/0x109
Apr  2 21:17:53 alfa kernel:  [<ffffffff810da015>] ? sys_open+0x20/0x22
Apr  2 21:17:53 alfa kernel:  [<ffffffff8100b09b>] ? 
system_call_fastpath+0x16/0x1b
Apr  2 21:18:59 alfa kernel: BUG: soft lockup - CPU#3 stuck for 61s! 
[httpd:17617]
Apr  2 21:31:26 alfa syslogd 1.4.1: restart.


It looks like i can reproduce the but on this server on every 2-3 days.
The only problem is, my customer will kill me if i can't fix it soon. :-)

Can somebody help me or suggest another solution to avoid this problem?

Thanks a lot,

Janos Haar


>
>> Another issue with this productive server:
>> Can somebody point me to the rigth direction?
>> Or support that this is a hw problem or not?
>
>
> Probably no, it looks like an XFS bug or a write-back bug.
>
> Thanks for your report. Cc'ing related people...
>
>
>>
>> The messages file are here:
>> http://download.netcenter.hu/bughunt/20100324/marc30
>>
>> Thanks,
>> Janos Haar
>>
>> Mar 30 18:51:43 alfa kernel: BUG: unable to handle kernel paging request 
>> at
>> 000000320000008c
>> Mar 30 18:51:43 alfa kernel: IP: [<ffffffff811d755b>]
>> xfs_iflush_cluster+0x148/0x35a
>> Mar 30 18:51:43 alfa kernel: PGD 102d7a067 PUD 0
>> Mar 30 18:51:43 alfa kernel: Oops: 0000 [#1] SMP
>> Mar 30 18:51:43 alfa kernel: last sysfs file: /sys/class/misc/rfkill/dev
>> Mar 30 18:51:43 alfa kernel: CPU 0
>> Mar 30 18:51:43 alfa kernel: Modules linked in: hidp l2cap crc16 
>> bluetooth
>> rfkill ipv6 video output sbs sbshc battery ac parport_pc lp parport
>> serio_raw 8250_
>> pnp 8250 serial_core shpchp button i2c_i801 i2c_core pcspkr
>> Mar 30 18:51:43 alfa kernel: Pid: 3242, comm: flush-8:16 Not tainted
>> 2.6.32.10 #2
>> Mar 30 18:51:43 alfa kernel: RIP: 0010:[<ffffffff811d755b>]
>> [<ffffffff811d755b>] xfs_iflush_cluster+0x148/0x35a
>> Mar 30 18:51:43 alfa kernel: RSP: 0000:ffff880228ce5b60 EFLAGS: 00010206
>> Mar 30 18:51:43 alfa kernel: RAX: 0000003200000000 RBX: ffff8801537947d0
>> RCX: 000000000000001a
>> Mar 30 18:51:43 alfa kernel: RDX: 0000000000000020 RSI: 00000000000c6cc2
>> RDI: 0000000000000001
>> Mar 30 18:51:43 alfa kernel: RBP: ffff880228ce5bd0 R08: ffff880228ce5b20
>> R09: ffff8801ea436928
>> Mar 30 18:51:43 alfa kernel: R10: 00000000000c6cc2 R11: 0000000000000001
>> R12: ffff8800b630b11a
>> Mar 30 18:51:43 alfa kernel: R13: ffff8801bd54ab30 R14: ffff88022962d2b8
>> R15: 00000000000c6ca0
>> Mar 30 18:51:43 alfa kernel: FS: 0000000000000000(0000)
>> GS:ffff880028200000(0000) knlGS:0000000000000000
>> Mar 30 18:51:43 alfa kernel: CS: 0010 DS: 0018 ES: 0018 CR0:
>> 000000008005003b
>> Mar 30 18:51:43 alfa kernel: CR2: 000000320000008c CR3: 0000000168e75000
>> CR4: 00000000000006f0
>> Mar 30 18:51:43 alfa kernel: DR0: 0000000000000000 DR1: 0000000000000000
>> DR2: 0000000000000000
>> Mar 30 18:51:43 alfa kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0
>> DR7: 0000000000000400
>> Mar 30 18:51:43 alfa kernel: Process flush-8:16 (pid: 3242, threadinfo
>> ffff880228ce4000, task ffff880228ea4040)
>> Mar 30 18:51:43 alfa kernel: Stack:
>> Mar 30 18:51:43 alfa kernel: ffff8801bd54ab30 ffff8800b630b140
>> ffff88022a2d99d0 ffffffffffffffe0
>> Mar 30 18:51:43 alfa kernel: <0> 0000000000000020 ffff880218e3db60
>> 0000002028ce5bd0 0000000200000000
>> Mar 30 18:51:43 alfa kernel: <0> ffff880218e3db70 ffff8801bd54ab30
>> ffff8800b630b140 0000000000000002
>> Mar 30 18:51:43 alfa kernel: Call Trace:
>> Mar 30 18:51:43 alfa kernel: [<ffffffff811d7931>] xfs_iflush+0x1c4/0x272
>> Mar 30 18:51:43 alfa kernel: [<ffffffff8103458e>] ?
>> try_wait_for_completion+0x24/0x45
>> Mar 30 18:51:43 alfa kernel: [<ffffffff811f819c>]
>> xfs_fs_write_inode+0xe0/0x11e
>> Mar 30 18:51:43 alfa kernel: [<ffffffff810f7bcf>]
>> writeback_single_inode+0x109/0x215
>> Mar 30 18:51:43 alfa kernel: [<ffffffff810f84bd>]
>> writeback_inodes_wb+0x33a/0x3cc
>> Mar 30 18:51:43 alfa kernel: [<ffffffff810f8686>] 
>> wb_writeback+0x137/0x1c7
>> Mar 30 18:51:43 alfa kernel: [<ffffffff810f8830>] ?
>> wb_do_writeback+0x7d/0x1ae
>> Mar 30 18:51:43 alfa kernel: [<ffffffff810f892c>]
>> wb_do_writeback+0x179/0x1ae
>> Mar 30 18:51:43 alfa kernel: [<ffffffff810f8830>] ?
>> wb_do_writeback+0x7d/0x1ae
>> Mar 30 18:51:43 alfa kernel: [<ffffffff8105064c>] ?
>> process_timeout+0x0/0x10
>> Mar 30 18:51:43 alfa kernel: [<ffffffff810c10ed>] ? bdi_start_fn+0x0/0xd1
>> Mar 30 18:51:43 alfa kernel: [<ffffffff810f898d>]
>> bdi_writeback_task+0x2c/0xa2
>> Mar 30 18:51:43 alfa kernel: [<ffffffff810c1163>] bdi_start_fn+0x76/0xd1
>> Mar 30 18:51:43 alfa kernel: [<ffffffff810c10ed>] ? bdi_start_fn+0x0/0xd1
>> Mar 30 18:51:43 alfa kernel: [<ffffffff8105dda1>] kthread+0x82/0x8d
>> Mar 30 18:51:43 alfa kernel: [<ffffffff8100c15a>] child_rip+0xa/0x20
>> Mar 30 18:51:43 alfa kernel: [<ffffffff8100bafc>] ? restore_args+0x0/0x30
>> Mar 30 18:51:43 alfa kernel: [<ffffffff81038596>] ?
>> finish_task_switch+0x0/0xbc
>> Mar 30 18:51:43 alfa kernel: [<ffffffff8105dd1f>] ? kthread+0x0/0x8d
>> Mar 30 18:51:43 alfa kernel: [<ffffffff8100c150>] ? child_rip+0x0/0x20
>> Mar 30 18:51:43 alfa kernel: Code: 8e eb 01 00 00 b8 01 00 00 00 48 d3 e0 
>> ff
>> c8 23 43 18 48 23 45 a8 4c 39 f8 0f 85 ae 00 00 00 48 8b 83 80 00 00 00 
>> 48
>> 85 c0
>> 74 0b <66> f7 80 8c 00 00 00 ff 01 75 13 80 bb 0a 02 00 00 00 75 0a 8b
>> Mar 30 18:51:43 alfa kernel: RIP [<ffffffff811d755b>]
>> xfs_iflush_cluster+0x148/0x35a
>> Mar 30 18:51:43 alfa kernel: RSP <ffff880228ce5b60>
>> Mar 30 18:51:43 alfa kernel: CR2: 000000320000008c
>> Mar 30 18:51:43 alfa kernel: ---[ end trace e6c8391ea76602f4 ]---
>> Mar 30 18:51:43 alfa kernel: flush-8:16 used greatest stack depth: 2464
>> bytes left
>> Mar 30 19:09:39 alfa syslogd 1.4.1: restart.
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/ 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-02 22:07           ` Janos Haar
@ 2010-04-02 23:09             ` Dave Chinner
  2010-04-03 13:42               ` Janos Haar
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2010-04-02 23:09 UTC (permalink / raw)
  To: Janos Haar
  Cc: Américo Wang, linux-kernel, KAMEZAWA Hiroyuki, linux-mm, xfs, axboe

On Sat, Apr 03, 2010 at 12:07:00AM +0200, Janos Haar wrote:
> Hello,
> 
> ----- Original Message ----- From: "Américo Wang"
> <xiyou.wangcong@gmail.com>
> To: "Janos Haar" <janos.haar@netcenter.hu>
> Cc: <linux-kernel@vger.kernel.org>; "KAMEZAWA Hiroyuki"
> <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>;
> <xfs@oss.sgi.com>; "Jens Axboe" <axboe@kernel.dk>
> Sent: Thursday, April 01, 2010 12:37 PM
> Subject: Re: Somebody take a look please! (some kind of kernel bug?)
> 
> 
> >On Thu, Apr 1, 2010 at 6:01 PM, Janos Haar
> ><janos.haar@netcenter.hu> wrote:
> >>Hello,
> >>
> >
> >Hi,
> >This is a totally different bug from the previous one reported by you. :)
> 
> Today i have got this again, exactly the same. (if somebody wants
> the log, just ask)
> There is a cut:

Small hint - please put the subsytemthe bug occurred in in the
subject line. I missed this in the firehose of lkml traffic because
there wasnothing to indicate to me it was in XFS. Soemthing like:

"Kernel crash in xfs_iflush_cluster"

Won't get missed quite so easily....

This may be a fixed problem - what kernel are you running?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-02 23:09             ` Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...) Dave Chinner
@ 2010-04-03 13:42               ` Janos Haar
  2010-04-04 10:37                 ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-04-03 13:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

Hello,

The actual version of kernel is 2.6.32.10.
There is any significant fixes for me in the last (.11) or in the next 
(33.x)?

Thanks,
Janos

----- Original Message ----- 
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: "Américo Wang" <xiyou.wangcong@gmail.com>; 
<linux-kernel@vger.kernel.org>; "KAMEZAWA Hiroyuki" 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
<axboe@kernel.dk>
Sent: Saturday, April 03, 2010 1:09 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Sat, Apr 03, 2010 at 12:07:00AM +0200, Janos Haar wrote:
>> Hello,
>>
>> ----- Original Message ----- From: "Américo Wang"
>> <xiyou.wangcong@gmail.com>
>> To: "Janos Haar" <janos.haar@netcenter.hu>
>> Cc: <linux-kernel@vger.kernel.org>; "KAMEZAWA Hiroyuki"
>> <kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>;
>> <xfs@oss.sgi.com>; "Jens Axboe" <axboe@kernel.dk>
>> Sent: Thursday, April 01, 2010 12:37 PM
>> Subject: Re: Somebody take a look please! (some kind of kernel bug?)
>>
>>
>> >On Thu, Apr 1, 2010 at 6:01 PM, Janos Haar
>> ><janos.haar@netcenter.hu> wrote:
>> >>Hello,
>> >>
>> >
>> >Hi,
>> >This is a totally different bug from the previous one reported by you. 
>> >:)
>>
>> Today i have got this again, exactly the same. (if somebody wants
>> the log, just ask)
>> There is a cut:
>
> Small hint - please put the subsytemthe bug occurred in in the
> subject line. I missed this in the firehose of lkml traffic because
> there wasnothing to indicate to me it was in XFS. Soemthing like:
>
> "Kernel crash in xfs_iflush_cluster"
>
> Won't get missed quite so easily....
>
> This may be a fixed problem - what kernel are you running?
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/ 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-03 13:42               ` Janos Haar
@ 2010-04-04 10:37                 ` Dave Chinner
  2010-04-05 18:17                   ` Janos Haar
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2010-04-04 10:37 UTC (permalink / raw)
  To: Janos Haar
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

On Sat, Apr 03, 2010 at 03:42:10PM +0200, Janos Haar wrote:
> Hello,
> 
> The actual version of kernel is 2.6.32.10.
> There is any significant fixes for me in the last (.11) or in the
> next (33.x)?

The fixes for this bug are queued up already for the next
2.6.32.x release.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-04 10:37                 ` Dave Chinner
@ 2010-04-05 18:17                   ` Janos Haar
  2010-04-05 22:45                     ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-04-05 18:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

Dave,

Thank you for your answer.
Like i sad before, this is a productive server with important service.
Can you please send the fix for me as soon as it is done even for testing 
it....
Or point me to the right direction to get it?

Thanks a lot,
Janos Haar

----- Original Message ----- 
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
<axboe@kernel.dk>
Sent: Sunday, April 04, 2010 12:37 PM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Sat, Apr 03, 2010 at 03:42:10PM +0200, Janos Haar wrote:
>> Hello,
>>
>> The actual version of kernel is 2.6.32.10.
>> There is any significant fixes for me in the last (.11) or in the
>> next (33.x)?
>
> The fixes for this bug are queued up already for the next
> 2.6.32.x release.
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-05 18:17                   ` Janos Haar
@ 2010-04-05 22:45                     ` Dave Chinner
  2010-04-05 22:59                       ` Janos Haar
  2010-04-08  2:45                       ` Janos Haar
  0 siblings, 2 replies; 33+ messages in thread
From: Dave Chinner @ 2010-04-05 22:45 UTC (permalink / raw)
  To: Janos Haar
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

On Mon, Apr 05, 2010 at 08:17:27PM +0200, Janos Haar wrote:
> Dave,
> 
> Thank you for your answer.
> Like i sad before, this is a productive server with important service.
> Can you please send the fix for me as soon as it is done even for
> testing it....
> Or point me to the right direction to get it?

It's in 2.6.33 if you want to upgrade the kernel, or you if don't
want to wait for the next 2.6.32.x kernel, you can apply this series
of 19 patches yourself:

http://oss.sgi.com/archives/xfs/2010-03/msg00125.html

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-05 22:45                     ` Dave Chinner
@ 2010-04-05 22:59                       ` Janos Haar
  2010-04-08  2:45                       ` Janos Haar
  1 sibling, 0 replies; 33+ messages in thread
From: Janos Haar @ 2010-04-05 22:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe


----- Original Message ----- 
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
<axboe@kernel.dk>
Sent: Tuesday, April 06, 2010 12:45 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Mon, Apr 05, 2010 at 08:17:27PM +0200, Janos Haar wrote:
>> Dave,
>>
>> Thank you for your answer.
>> Like i sad before, this is a productive server with important service.
>> Can you please send the fix for me as soon as it is done even for
>> testing it....
>> Or point me to the right direction to get it?
>
> It's in 2.6.33 if you want to upgrade the kernel, or you if don't
> want to wait for the next 2.6.32.x kernel, you can apply this series
> of 19 patches yourself:

Generally, for this system, i am much more prefer the extra-stable series, 
but in this case i will try out the 2.6.33 because these 2 versions is close 
to each other, and i don't want to add 19 patches manually. :-)
I will try it, and i will reply about the result in this week.

Thanks for You and for all the people, who works on XFS and Linux. :-)

Best Regards,
Janos Haar


>
> http://oss.sgi.com/archives/xfs/2010-03/msg00125.html
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-05 22:45                     ` Dave Chinner
  2010-04-05 22:59                       ` Janos Haar
@ 2010-04-08  2:45                       ` Janos Haar
  2010-04-08  2:58                         ` Dave Chinner
  1 sibling, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-04-08  2:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

Hello,

Sorry, but still have the problem with 2.6.33.2.

Apr  8 03:04:51 alfa kernel: BUG: unable to handle kernel paging request at 
000000610000008c
Apr  8 03:04:51 alfa kernel: IP: [<ffffffff811f17c4>] 
xfs_iflush_cluster+0x148/0x35a
Apr  8 03:04:51 alfa kernel: PGD 22258a067 PUD 0
Apr  8 03:04:51 alfa kernel: Oops: 0000 [#1] SMP
Apr  8 03:04:51 alfa kernel: last sysfs file: /sys/class/misc/rfkill/dev
Apr  8 03:04:51 alfa kernel: CPU 2
Apr  8 03:04:51 alfa kernel: Pid: 3049, comm: xfssyncd Not tainted 2.6.33.2 
#1 DP35DP/
Apr  8 03:04:51 alfa kernel: RIP: 0010:[<ffffffff811f17c4>] 
[<ffffffff811f17c4>] xfs_iflush_cluster+0x148/0x35a
Apr  8 03:04:51 alfa kernel: RSP: 0018:ffff880228e3bca0  EFLAGS: 00010206
Apr  8 03:04:51 alfa kernel: RAX: 0000006100000000 RBX: ffff880153795750 
RCX: 000000000000001a
Apr  8 03:04:51 alfa kernel: RDX: 0000000000000020 RSI: 00000000003dfdf4 
RDI: 0000000000000005
Apr  8 03:04:51 alfa kernel: RBP: ffff880228e3bd10 R08: ffff880228e3bc60 
R09: ffff8801c5d6e1b8
Apr  8 03:04:51 alfa kernel: R10: 00000000003dfdf4 R11: 0000000000000005 
R12: 000000000000001a
Apr  8 03:04:51 alfa kernel: R13: ffff8800b1d920d8 R14: ffff88022a7cabe0 
R15: 00000000003ddf80
Apr  8 03:04:51 alfa kernel: FS:  0000000000000000(0000) 
GS:ffff880028300000(0000) knlGS:0000000000000000
Apr  8 03:04:51 alfa kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Apr  8 03:04:51 alfa kernel: CR2: 000000610000008c CR3: 00000002222db000 
CR4: 00000000000006e0
Apr  8 03:04:51 alfa kernel: DR0: 0000000000000000 DR1: 0000000000000000 
DR2: 0000000000000000
Apr  8 03:04:51 alfa kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 
DR7: 0000000000000400
Apr  8 03:04:51 alfa kernel: Process xfssyncd (pid: 3049, threadinfo 
ffff880228e3a000, task ffff880228e66040)
Apr  8 03:04:51 alfa kernel: Stack:
Apr  8 03:04:51 alfa kernel:  ffff8800b1d920d8 ffff8800466bc100 
ffff880228c32580 ffffffffffffffe0
Apr  8 03:04:51 alfa kernel: <0> 0000000000000020 ffff880228e24930 
0000002028e3bd10 0000000300000000
Apr  8 03:04:51 alfa kernel: <0> ffff880228e24948 ffff8800b1d920d8 
0000000000000002 ffff8800466bc100
Apr  8 03:04:51 alfa kernel: Call Trace:
Apr  8 03:04:51 alfa kernel:  [<ffffffff811f1bcd>] xfs_iflush+0x1f7/0x2aa
Apr  8 03:04:51 alfa kernel:  [<ffffffff811ecc12>] ? xfs_ilock+0x66/0xb7
Apr  8 03:04:51 alfa kernel:  [<ffffffff81214653>] 
xfs_reclaim_inode+0xba/0xee
Apr  8 03:04:51 alfa kernel:  [<ffffffff8121498d>] 
xfs_inode_ag_walk+0x91/0xd7
Apr  8 03:04:51 alfa kernel:  [<ffffffff81214599>] ? 
xfs_reclaim_inode+0x0/0xee
Apr  8 03:04:51 alfa kernel:  [<ffffffff81214a30>] 
xfs_inode_ag_iterator+0x5d/0x8f
Apr  8 03:04:51 alfa kernel:  [<ffffffff81214599>] ? 
xfs_reclaim_inode+0x0/0xee
Apr  8 03:04:51 alfa kernel:  [<ffffffff81214a81>] 
xfs_reclaim_inodes+0x1f/0x21
Apr  8 03:04:51 alfa kernel:  [<ffffffff81214ab6>] xfs_sync_worker+0x33/0x6f
Apr  8 03:04:51 alfa kernel:  [<ffffffff812144cf>] xfssyncd+0x149/0x198
Apr  8 03:04:51 alfa kernel:  [<ffffffff81214386>] ? xfssyncd+0x0/0x198
Apr  8 03:04:51 alfa kernel:  [<ffffffff81057061>] kthread+0x82/0x8a
Apr  8 03:04:51 alfa kernel:  [<ffffffff81002fd4>] 
kernel_thread_helper+0x4/0x10
Apr  8 03:04:51 alfa kernel:  [<ffffffff8179217c>] ? restore_args+0x0/0x30
Apr  8 03:04:51 alfa kernel:  [<ffffffff81056fdf>] ? kthread+0x0/0x8a
Apr  8 03:04:51 alfa kernel:  [<ffffffff81002fd0>] ? 
kernel_thread_helper+0x0/0x10
Apr  8 03:04:51 alfa kernel: Code: 8e eb 01 00 00 b8 01 00 00 00 48 d3 e0 ff 
c8 23 43 18 48 23 45 a8 4c 39 f8 0f 85 ae 00 00 00 48 8b 83 80 00 00 00 48 
85 c0
74 0b <66> f7 80 8c 00 00 00 ff 01 75 13 80 bb 0a 02 00 00 00 75 0a 8b
Apr  8 03:04:51 alfa kernel: RIP  [<ffffffff811f17c4>] 
xfs_iflush_cluster+0x148/0x35a
Apr  8 03:04:51 alfa kernel:  RSP <ffff880228e3bca0>
Apr  8 03:04:51 alfa kernel: CR2: 000000610000008c
Apr  8 03:04:51 alfa kernel: ---[ end trace d1fc6fbf3568ba3f ]---
Apr  8 04:41:11 alfa syslogd 1.4.1: restart.


----- Original Message ----- 
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
<axboe@kernel.dk>
Sent: Tuesday, April 06, 2010 12:45 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Mon, Apr 05, 2010 at 08:17:27PM +0200, Janos Haar wrote:
>> Dave,
>>
>> Thank you for your answer.
>> Like i sad before, this is a productive server with important service.
>> Can you please send the fix for me as soon as it is done even for
>> testing it....
>> Or point me to the right direction to get it?
>
> It's in 2.6.33 if you want to upgrade the kernel, or you if don't
> want to wait for the next 2.6.32.x kernel, you can apply this series
> of 19 patches yourself:
>
> http://oss.sgi.com/archives/xfs/2010-03/msg00125.html
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/ 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-08  2:45                       ` Janos Haar
@ 2010-04-08  2:58                         ` Dave Chinner
  2010-04-08 11:21                           ` Janos Haar
                                             ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Dave Chinner @ 2010-04-08  2:58 UTC (permalink / raw)
  To: Janos Haar
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

On Thu, Apr 08, 2010 at 04:45:13AM +0200, Janos Haar wrote:
> Hello,
> 
> Sorry, but still have the problem with 2.6.33.2.

Yeah, these still a fix that needs to be back ported to .33
to solve this problem. It's in the series for 2.6.32.x, so maybe
pulling the 2.6.32-stable-queue tree in the meantime is your best
bet.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-08  2:58                         ` Dave Chinner
@ 2010-04-08 11:21                           ` Janos Haar
  2010-04-09 21:37                             ` Christian Kujau
  2010-04-10 21:15                           ` Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...) Janos Haar
  2010-04-11 22:44                           ` Janos Haar
  2 siblings, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-04-08 11:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe


----- Original Message ----- 
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
<axboe@kernel.dk>
Sent: Thursday, April 08, 2010 4:58 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Thu, Apr 08, 2010 at 04:45:13AM +0200, Janos Haar wrote:
>> Hello,
>>
>> Sorry, but still have the problem with 2.6.33.2.
>
> Yeah, these still a fix that needs to be back ported to .33
> to solve this problem. It's in the series for 2.6.32.x, so maybe
> pulling the 2.6.32-stable-queue tree in the meantime is your best
> bet.

Ok, thank you.
But where can i find this tree?

Thanks,
Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/ 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-08 11:21                           ` Janos Haar
@ 2010-04-09 21:37                             ` Christian Kujau
  2010-04-09 22:44                               ` Janos Haar
  0 siblings, 1 reply; 33+ messages in thread
From: Christian Kujau @ 2010-04-09 21:37 UTC (permalink / raw)
  To: Janos Haar
  Cc: Dave Chinner, axboe, LKML, xfs, linux-mm, xiyou.wangcong,
	kamezawa.hiroyu

On Thu, 8 Apr 2010 at 13:21, Janos Haar wrote:
> > Yeah, these still a fix that needs to be back ported to .33
> > to solve this problem. It's in the series for 2.6.32.x, so maybe
> > pulling the 2.6.32-stable-queue tree in the meantime is your best
> > bet.
> 
> Ok, thank you.
> But where can i find this tree?


Perhaps Dave meant the stable-queue?

http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git

Then again, 2.6.34-rc3 needs testing too! :-)

Christian.
-- 
BOFH excuse #98:

The vendor put the bug there.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-09 21:37                             ` Christian Kujau
@ 2010-04-09 22:44                               ` Janos Haar
  2010-04-10  8:06                                 ` Américo Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-04-09 22:44 UTC (permalink / raw)
  To: Christian Kujau
  Cc: david, axboe, "LKML",
	xfs, linux-mm, xiyou.wangcong, kamezawa.hiroyu

Hello,

I am just started to test the stable-queue patch series on 2.6.32.10.
Now running, we will see...
The 2.6.33.2 made 4 crashes in the last 3 days. :-(
This was more worse than the original 2.6.32.10.

(I am very interested, anyway, this is the last shot of this server.
The owner giving me an ultimate.
If the server crashes again in the next week, i need to replace the entire 
HW, the OS, and the services as well...)

Thanks a lot for help.

Best Regards,
Janos Haar

----- Original Message ----- 
From: "Christian Kujau" <lists@nerdbynature.de>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: "Dave Chinner" <david@fromorbit.com>; <axboe@kernel.dk>; "LKML" 
<linux-kernel@vger.kernel.org>; <xfs@oss.sgi.com>; <linux-mm@kvack.org>; 
<xiyou.wangcong@gmail.com>; <kamezawa.hiroyu@jp.fujitsu.com>
Sent: Friday, April 09, 2010 11:37 PM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Thu, 8 Apr 2010 at 13:21, Janos Haar wrote:
>> > Yeah, these still a fix that needs to be back ported to .33
>> > to solve this problem. It's in the series for 2.6.32.x, so maybe
>> > pulling the 2.6.32-stable-queue tree in the meantime is your best
>> > bet.
>>
>> Ok, thank you.
>> But where can i find this tree?
>
>
> Perhaps Dave meant the stable-queue?
>
> http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git
>
> Then again, 2.6.34-rc3 needs testing too! :-)
>
> Christian.
> -- 
> BOFH excuse #98:
>
> The vendor put the bug there. 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-09 22:44                               ` Janos Haar
@ 2010-04-10  8:06                                 ` Américo Wang
  2010-04-10 21:21                                   ` Kernel crash in xfs_iflush_cluster (was Somebody take a lookplease!...) Janos Haar
  0 siblings, 1 reply; 33+ messages in thread
From: Américo Wang @ 2010-04-10  8:06 UTC (permalink / raw)
  To: Janos Haar
  Cc: Christian Kujau, david, axboe, LKML, xfs, linux-mm,
	xiyou.wangcong, kamezawa.hiroyu

On Sat, Apr 10, 2010 at 12:44:28AM +0200, Janos Haar wrote:
> Hello,
>

Hi,

> I am just started to test the stable-queue patch series on 2.6.32.10.
> Now running, we will see...
> The 2.6.33.2 made 4 crashes in the last 3 days. :-(
> This was more worse than the original 2.6.32.10.
>
> (I am very interested, anyway, this is the last shot of this server.
> The owner giving me an ultimate.
> If the server crashes again in the next week, i need to replace the 
> entire HW, the OS, and the services as well...)
>

I would recommend you to use a distribution-released kernel,
rather than a stable kernel from kernel.org, because usually
the distribution maintains a longer supported kernel than
kernel.org.

Just a little suggestion. Hope it helps for you to choose Linux. ;)

Thanks.

-- 
Live like a child, think like the god.
 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-08  2:58                         ` Dave Chinner
  2010-04-08 11:21                           ` Janos Haar
@ 2010-04-10 21:15                           ` Janos Haar
  2010-04-11 22:44                           ` Janos Haar
  2 siblings, 0 replies; 33+ messages in thread
From: Janos Haar @ 2010-04-10 21:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

Dave,

Not the server looks stable, but only runs in 23 hour at this point.

Now i can see these and similar messages:
Apr 10 09:59:09 alfa kernel: Filesystem "sdb2": corrupt dinode 673160714, 
extent total = -1392508927, nblocks = 5.  Unmount and run xfs_repair.
Apr 10 09:59:09 alfa kernel: ffff880153797a00: 49 4e 81 a4 01 02 00 01 00 00 
00 30 00 00 00 30  IN.........0...0
Apr 10 09:59:09 alfa kernel: Filesystem "sdb2": XFS internal error 
xfs_iformat(1) at line 332 of file fs/xfs/xfs_inode.c.  Caller 
0xffffffff811d70d6
Apr 10 09:59:09 alfa kernel:
Apr 10 09:59:09 alfa kernel: Pid: 2324, comm: updatedb Not tainted 2.6.32.10 
#3
Apr 10 09:59:09 alfa kernel: Call Trace:
Apr 10 09:59:09 alfa kernel:  [<ffffffff811cf87d>] 
xfs_error_report+0x41/0x43
Apr 10 09:59:09 alfa kernel:  [<ffffffff811d70d6>] ? xfs_iread+0xb1/0x184
Apr 10 09:59:09 alfa kernel:  [<ffffffff811cf8d1>] 
xfs_corruption_error+0x52/0x5e
Apr 10 09:59:09 alfa kernel:  [<ffffffff811d6c68>] xfs_iformat+0x10d/0x4ca
Apr 10 09:59:09 alfa kernel:  [<ffffffff811d70d6>] ? xfs_iread+0xb1/0x184
Apr 10 09:59:09 alfa kernel:  [<ffffffff811d70d6>] xfs_iread+0xb1/0x184
Apr 10 09:59:09 alfa kernel:  [<ffffffff811d3ee2>] xfs_iget+0x2c3/0x455
Apr 10 09:59:09 alfa kernel:  [<ffffffff811eab8b>] xfs_lookup+0x82/0xb3
Apr 10 09:59:09 alfa kernel:  [<ffffffff811f5a8f>] xfs_vn_lookup+0x45/0x86
Apr 10 09:59:09 alfa kernel:  [<ffffffff810e3f73>] do_lookup+0xde/0x1ca
Apr 10 09:59:09 alfa kernel:  [<ffffffff810e65b6>] 
__link_path_walk+0x84e/0xcb3
Apr 10 09:59:09 alfa kernel:  [<ffffffff810e4462>] ? path_init+0xaf/0x156
Apr 10 09:59:09 alfa kernel:  [<ffffffff810e6a6e>] path_walk+0x53/0x9c
Apr 10 09:59:09 alfa kernel:  [<ffffffff810e6b9e>] do_path_lookup+0x2f/0xac
Apr 10 09:59:09 alfa kernel:  [<ffffffff810e7603>] user_path_at+0x57/0x91
Apr 10 09:59:09 alfa kernel:  [<ffffffff810ec2e5>] ? dput+0x54/0x132
Apr 10 09:59:09 alfa kernel:  [<ffffffff810df492>] ? cp_new_stat+0xfb/0x114
Apr 10 09:59:09 alfa kernel:  [<ffffffff810df670>] vfs_fstatat+0x3a/0x67
Apr 10 09:59:09 alfa kernel:  [<ffffffff810df6f4>] vfs_lstat+0x1e/0x20
Apr 10 09:59:09 alfa kernel:  [<ffffffff810df715>] sys_newlstat+0x1f/0x39
Apr 10 09:59:09 alfa kernel:  [<ffffffff8175d2d3>] ? 
trace_hardirqs_on_thunk+0x3a/0x3f
Apr 10 09:59:09 alfa kernel:  [<ffffffff811d3f36>] ? xfs_iget+0x317/0x455
Apr 10 09:59:09 alfa kernel:  [<ffffffff8100b09b>] 
system_call_fastpath+0x16/0x1b
Apr 10 09:59:09 alfa kernel: Filesystem "sdb2": corrupt inode 673160713 
((a)extents = 16777217).  Unmount and run xfs_repair.
Apr 10 09:59:09 alfa kernel: ffff880153797900: 49 4e 81 a4 01 02 00 01 00 00 
00 30 00 00 00 30  IN.........0...0
Apr 10 09:59:09 alfa kernel: Filesystem "sdb2": XFS internal error 
xfs_iformat_extents(1) at line 558 of file fs/xfs/xfs_inode.c.  Caller 
0xffffffff811d6e70
Apr 10 09:59:09 alfa kernel:

All reports sdb2 for corruption.
I will test this partition as soon as i can plan some minute planned 
pause....

Thanks for all your help again.

Best Regards:
Janos Haar



----- Original Message ----- 
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
<axboe@kernel.dk>
Sent: Thursday, April 08, 2010 4:58 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Thu, Apr 08, 2010 at 04:45:13AM +0200, Janos Haar wrote:
>> Hello,
>>
>> Sorry, but still have the problem with 2.6.33.2.
>
> Yeah, these still a fix that needs to be back ported to .33
> to solve this problem. It's in the series for 2.6.32.x, so maybe
> pulling the 2.6.32-stable-queue tree in the meantime is your best
> bet.
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/ 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a lookplease!...)
  2010-04-10  8:06                                 ` Américo Wang
@ 2010-04-10 21:21                                   ` Janos Haar
  0 siblings, 0 replies; 33+ messages in thread
From: Janos Haar @ 2010-04-10 21:21 UTC (permalink / raw)
  To: Américo Wang
  Cc: lists, david, axboe, "LKML",
	xfs, linux-mm, xiyou.wangcong, kamezawa.hiroyu

Hi,


----- Original Message ----- 
From: "Américo Wang" <xiyou.wangcong@gmail.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: "Christian Kujau" <lists@nerdbynature.de>; <david@fromorbit.com>; 
<axboe@kernel.dk>; "LKML" <linux-kernel@vger.kernel.org>; <xfs@oss.sgi.com>; 
<linux-mm@kvack.org>; <xiyou.wangcong@gmail.com>; 
<kamezawa.hiroyu@jp.fujitsu.com>
Sent: Saturday, April 10, 2010 10:06 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a 
lookplease!...)


> On Sat, Apr 10, 2010 at 12:44:28AM +0200, Janos Haar wrote:
>> Hello,
>>
>
> Hi,
>
>> I am just started to test the stable-queue patch series on 2.6.32.10.
>> Now running, we will see...
>> The 2.6.33.2 made 4 crashes in the last 3 days. :-(
>> This was more worse than the original 2.6.32.10.
>>
>> (I am very interested, anyway, this is the last shot of this server.
>> The owner giving me an ultimate.
>> If the server crashes again in the next week, i need to replace the
>> entire HW, the OS, and the services as well...)
>>
>
> I would recommend you to use a distribution-released kernel,
> rather than a stable kernel from kernel.org, because usually
> the distribution maintains a longer supported kernel than
> kernel.org.
>
> Just a little suggestion. Hope it helps for you to choose Linux. ;)

Personally, i am really like Linux, and use it since 1990. :-)
I have set up about 30 servers or more...
Usually i can't use the distro-release kernels, because these usually too 
old.
Additionally i can find bugs on any software and in any kernel version.... 
B-)
This is not the first time when i report bugs from the kernel, maybe the 4th 
or 5th time...
(I was who helped to solve the original NBD deadlock problem as well about 
2005.)

Anyway, thanks for your suggestion. ;-)

Cheers,
Janos

>
> Thanks.
>
> -- 
> Live like a child, think like the god.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/ 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-08  2:58                         ` Dave Chinner
  2010-04-08 11:21                           ` Janos Haar
  2010-04-10 21:15                           ` Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...) Janos Haar
@ 2010-04-11 22:44                           ` Janos Haar
  2010-04-12  0:11                             ` Dave Chinner
  2 siblings, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-04-11 22:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

Hi,

Ok, here comes the funny part:
I have got several messages from the kernel about one of my XFS (sdb2) have 
corrupted inodes, but my xfs_repair (v. 2.8.11) says the FS is clean and 
shine.
Should i upgrade my xfs_repair, or this is another bug? :-)

Thanks,

Janos

----- Original Message ----- 
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
<axboe@kernel.dk>
Sent: Thursday, April 08, 2010 4:58 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Thu, Apr 08, 2010 at 04:45:13AM +0200, Janos Haar wrote:
>> Hello,
>>
>> Sorry, but still have the problem with 2.6.33.2.
>
> Yeah, these still a fix that needs to be back ported to .33
> to solve this problem. It's in the series for 2.6.32.x, so maybe
> pulling the 2.6.32-stable-queue tree in the meantime is your best
> bet.
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/ 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-11 22:44                           ` Janos Haar
@ 2010-04-12  0:11                             ` Dave Chinner
  2010-04-13  8:00                               ` Janos Haar
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2010-04-12  0:11 UTC (permalink / raw)
  To: Janos Haar
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

On Mon, Apr 12, 2010 at 12:44:37AM +0200, Janos Haar wrote:
> Hi,
> 
> Ok, here comes the funny part:
> I have got several messages from the kernel about one of my XFS
> (sdb2) have corrupted inodes, but my xfs_repair (v. 2.8.11) says the
> FS is clean and shine.
> Should i upgrade my xfs_repair, or this is another bug? :-)

v2.8.11 is positively ancient. :/

I'd upgrade (current is 3.1.1) and re-run repair again.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-12  0:11                             ` Dave Chinner
@ 2010-04-13  8:00                               ` Janos Haar
  2010-04-13  8:39                                 ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-04-13  8:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe


----- Original Message ----- 
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
<axboe@kernel.dk>
Sent: Monday, April 12, 2010 2:11 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Mon, Apr 12, 2010 at 12:44:37AM +0200, Janos Haar wrote:
>> Hi,
>>
>> Ok, here comes the funny part:
>> I have got several messages from the kernel about one of my XFS
>> (sdb2) have corrupted inodes, but my xfs_repair (v. 2.8.11) says the
>> FS is clean and shine.
>> Should i upgrade my xfs_repair, or this is another bug? :-)
>
> v2.8.11 is positively ancient. :/
>
> I'd upgrade (current is 3.1.1) and re-run repair again.

OK, i will get the new repair today.

btw
Since i tested the FS with the 2.8.11, today morning i found this in the 
log:

...
Apr 12 00:41:10 alfa kernel: XFS mounting filesystem sdb2   # This was the 
point of check with xfs_repair v2.8.11
Apr 13 03:08:33 alfa kernel: xfs_da_do_buf: bno 32768
Apr 13 03:08:33 alfa kernel: dir: inode 474253931
Apr 13 03:08:33 alfa kernel: Filesystem "sdb2": XFS internal error 
xfs_da_do_buf(1) at line 2020 of file fs/xfs/xfs_da_btree.c.  Caller 
0xffffffff811c4fa6
Apr 13 03:08:33 alfa kernel:
Apr 13 03:08:33 alfa kernel: Pid: 27304, comm: 01vegzet_runner Not tainted 
2.6.32.10 #3
Apr 13 03:08:33 alfa kernel: Call Trace:
Apr 13 03:08:33 alfa kernel:  [<ffffffff811cf87d>] 
xfs_error_report+0x41/0x43
Apr 13 03:08:33 alfa kernel:  [<ffffffff811c4fa6>] ? 
xfs_da_read_buf+0x2a/0x2c
Apr 13 03:08:33 alfa kernel:  [<ffffffff811c4c30>] xfs_da_do_buf+0x2a6/0x5aa
Apr 13 03:08:33 alfa kernel:  [<ffffffff811c4fa6>] xfs_da_read_buf+0x2a/0x2c
Apr 13 03:08:33 alfa kernel:  [<ffffffff811ca0f1>] ? 
xfs_dir2_leaf_lookup_int+0x104/0x259
Apr 13 03:08:33 alfa kernel:  [<ffffffff811ca0f1>] 
xfs_dir2_leaf_lookup_int+0x104/0x259
Apr 13 03:08:33 alfa kernel:  [<ffffffff811ca56e>] 
xfs_dir2_leaf_lookup+0x26/0xb5
Apr 13 03:08:33 alfa kernel:  [<ffffffff811c6d60>] ? 
xfs_dir2_isleaf+0x21/0x52
Apr 13 03:08:33 alfa kernel:  [<ffffffff811c74ea>] 
xfs_dir_lookup+0x104/0x157
Apr 13 03:08:33 alfa kernel:  [<ffffffff811eab59>] xfs_lookup+0x50/0xb3
Apr 13 03:08:33 alfa kernel:  [<ffffffff811f5a8f>] xfs_vn_lookup+0x45/0x86
Apr 13 03:08:33 alfa kernel:  [<ffffffff810e4164>] __lookup_hash+0x105/0x12a
Apr 13 03:08:33 alfa kernel:  [<ffffffff810e41c4>] lookup_hash+0x3b/0x40
Apr 13 03:08:33 alfa kernel:  [<ffffffff810e7021>] do_unlinkat+0x71/0x17d
Apr 13 03:08:33 alfa kernel:  [<ffffffff8175d2d3>] ? 
trace_hardirqs_on_thunk+0x3a/0x3f
Apr 13 03:08:33 alfa kernel:  [<ffffffff810e5a1d>] ? putname+0x3c/0x3e
Apr 13 03:08:33 alfa kernel:  [<ffffffff810e7143>] sys_unlink+0x16/0x18
Apr 13 03:08:33 alfa kernel:  [<ffffffff8100b09b>] 
system_call_fastpath+0x16/0x1b
....

The entire log is here:
http://download.netcenter.hu/bughunt/20100413/messages

What is the best next step?
Check with the new repair?

Thanks,
Janos


>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/ 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-13  8:00                               ` Janos Haar
@ 2010-04-13  8:39                                 ` Dave Chinner
  2010-04-13  9:23                                   ` Janos Haar
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2010-04-13  8:39 UTC (permalink / raw)
  To: Janos Haar
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

On Tue, Apr 13, 2010 at 10:00:17AM +0200, Janos Haar wrote:
> >On Mon, Apr 12, 2010 at 12:44:37AM +0200, Janos Haar wrote:
> >>Hi,
> >>
> >>Ok, here comes the funny part:
> >>I have got several messages from the kernel about one of my XFS
> >>(sdb2) have corrupted inodes, but my xfs_repair (v. 2.8.11) says the
> >>FS is clean and shine.
> >>Should i upgrade my xfs_repair, or this is another bug? :-)
> >
> >v2.8.11 is positively ancient. :/
> >
> >I'd upgrade (current is 3.1.1) and re-run repair again.
> 
> OK, i will get the new repair today.
> 
> btw
> Since i tested the FS with the 2.8.11, today morning i found this in
> the log:
> 
> ...
> Apr 12 00:41:10 alfa kernel: XFS mounting filesystem sdb2   # This
> was the point of check with xfs_repair v2.8.11
> Apr 13 03:08:33 alfa kernel: xfs_da_do_buf: bno 32768
> Apr 13 03:08:33 alfa kernel: dir: inode 474253931
> Apr 13 03:08:33 alfa kernel: Filesystem "sdb2": XFS internal error
> xfs_da_do_buf(1) at line 2020 of file fs/xfs/xfs_da_btree.c.  Caller
> 0xffffffff811c4fa6

A corrupted directory. There have been several different types of
directory corruption that 2.8.11 didn't detect that 3.1.1 does.

> The entire log is here:
> http://download.netcenter.hu/bughunt/20100413/messages

So the bad inodes are:

$ awk '/corrupt inode/ { print $10 } /dir: inode/ { print $8 }' messages | sort -n -u
474253931
474253936
474253937
474253938
474253939
474253940
474253941
474253943
474253945
474253946
474253947
474253948
474253949
474253950
474253951
673160704
673160708
673160712
673160713

It looks like the bad inodes are confined to two inode clusters. The
nature of the errors - bad block mappings and bad extent counts -
makes me think you might have bad memory in the machine:

$ awk '/xfs_da_do_buf: bno/ { printf "%x\n", $8 }' messages | sort -n -u
4d8000
5e0000
7f8001
8000
8001
10000
10001
20001
28001
38000
270001
370001
548001
568000
568001
600000
600001
618000
618001
628000
628001
650001

I think they should all be 0 or 1, and:

$ awk '/corrupt inode/ { split($13, a, ")"); printf "%x\n", a[1] }' messages | sort -n -u
fffffffffd000001
6b000001
1000001
75000001

I think they should all be 1, too.

I've seen this sort of error pattern before on a machine that had a
bad DIMM.  If the corruption is on disk then the buffers were
corrupted between the time that the CPU writes to them and being
written to disk. If there is no corruption on disk, then the CPU is
reading bad data from memory...

If you run:

$ xfs_db -r -c "inode 474253940" -c p /dev/sdb2

Then I can can confirm whether there is corruption on disk or not.
Probably best to sample multiple of the inode numbers from the above
list of bad inodes.

FWIW, I'd strongly suggest backing up everything you can first
before running an updated xfs_repair....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-13  8:39                                 ` Dave Chinner
@ 2010-04-13  9:23                                   ` Janos Haar
  2010-04-13 11:34                                     ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-04-13  9:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe


----- Original Message ----- 
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
<axboe@kernel.dk>
Sent: Tuesday, April 13, 2010 10:39 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Tue, Apr 13, 2010 at 10:00:17AM +0200, Janos Haar wrote:
>> >On Mon, Apr 12, 2010 at 12:44:37AM +0200, Janos Haar wrote:
>> >>Hi,
>> >>
>> >>Ok, here comes the funny part:
>> >>I have got several messages from the kernel about one of my XFS
>> >>(sdb2) have corrupted inodes, but my xfs_repair (v. 2.8.11) says the
>> >>FS is clean and shine.
>> >>Should i upgrade my xfs_repair, or this is another bug? :-)
>> >
>> >v2.8.11 is positively ancient. :/
>> >
>> >I'd upgrade (current is 3.1.1) and re-run repair again.
>>
>> OK, i will get the new repair today.
>>
>> btw
>> Since i tested the FS with the 2.8.11, today morning i found this in
>> the log:
>>
>> ...
>> Apr 12 00:41:10 alfa kernel: XFS mounting filesystem sdb2   # This
>> was the point of check with xfs_repair v2.8.11
>> Apr 13 03:08:33 alfa kernel: xfs_da_do_buf: bno 32768
>> Apr 13 03:08:33 alfa kernel: dir: inode 474253931
>> Apr 13 03:08:33 alfa kernel: Filesystem "sdb2": XFS internal error
>> xfs_da_do_buf(1) at line 2020 of file fs/xfs/xfs_da_btree.c.  Caller
>> 0xffffffff811c4fa6
>
> A corrupted directory. There have been several different types of
> directory corruption that 2.8.11 didn't detect that 3.1.1 does.
>
>> The entire log is here:
>> http://download.netcenter.hu/bughunt/20100413/messages
>
> So the bad inodes are:
>
> $ awk '/corrupt inode/ { print $10 } /dir: inode/ { print $8 }' messages | 
> sort -n -u
> 474253931
> 474253936
> 474253937
> 474253938
> 474253939
> 474253940
> 474253941
> 474253943
> 474253945
> 474253946
> 474253947
> 474253948
> 474253949
> 474253950
> 474253951
> 673160704
> 673160708
> 673160712
> 673160713
>
> It looks like the bad inodes are confined to two inode clusters. The
> nature of the errors - bad block mappings and bad extent counts -
> makes me think you might have bad memory in the machine:
>
> $ awk '/xfs_da_do_buf: bno/ { printf "%x\n", $8 }' messages | sort -n -u
> 4d8000
> 5e0000
> 7f8001
> 8000
> 8001
> 10000
> 10001
> 20001
> 28001
> 38000
> 270001
> 370001
> 548001
> 568000
> 568001
> 600000
> 600001
> 618000
> 618001
> 628000
> 628001
> 650001
>
> I think they should all be 0 or 1, and:
>
> $ awk '/corrupt inode/ { split($13, a, ")"); printf "%x\n", a[1] }' 
> messages | sort -n -u
> fffffffffd000001
> 6b000001
> 1000001
> 75000001
>
> I think they should all be 1, too.
>
> I've seen this sort of error pattern before on a machine that had a
> bad DIMM.  If the corruption is on disk then the buffers were
> corrupted between the time that the CPU writes to them and being
> written to disk. If there is no corruption on disk, then the CPU is
> reading bad data from memory...
>
> If you run:
>
> $ xfs_db -r -c "inode 474253940" -c p /dev/sdb2
>
> Then I can can confirm whether there is corruption on disk or not.
> Probably best to sample multiple of the inode numbers from the above
> list of bad inodes.

Here is the log:
http://download.netcenter.hu/bughunt/20100413/debug.log

The xfs_db does segmentation fault. :-)

Btw memory corruption:
In the beginnig of march, one of my bets was memory problem too, but the 
server was offline for 7 days, and all the time runs the memtest86 on the 
hw, and passed all the 8GB 74 times without any bit error.
I don't think it is memory problem, additionally the server can create big 
size  .tar.gz files without crc problem.
If i force my mind to think to hw memory problem, i can think only for the 
raid card's cache memory, wich i can't test with memtest86.
Or the cache of the HDD's pcb...

In the other hand, i have seen more people reported memory corruption about 
these kernel versions, can we check this and surely select wich is the 
problem? (hw or sw)?
I mean, if i am right, the hw memory problem makes only 1-2 bit corruption 
seriously, and the sw page handling problem makes bad memory pages, no?

>
> FWIW, I'd strongly suggest backing up everything you can first
> before running an updated xfs_repair....

Yes, i know that too. :-)

Thanks,
Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/ 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-13  9:23                                   ` Janos Haar
@ 2010-04-13 11:34                                     ` Dave Chinner
  2010-04-13 23:36                                       ` Janos Haar
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2010-04-13 11:34 UTC (permalink / raw)
  To: Janos Haar
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

On Tue, Apr 13, 2010 at 11:23:36AM +0200, Janos Haar wrote:
> >If you run:
> >
> >$ xfs_db -r -c "inode 474253940" -c p /dev/sdb2
> >
> >Then I can can confirm whether there is corruption on disk or not.
> >Probably best to sample multiple of the inode numbers from the above
> >list of bad inodes.
> 
> Here is the log:
> http://download.netcenter.hu/bughunt/20100413/debug.log

There are multiple fields in the inode that are corrupted.
I am really surprised that xfs-repair - even an old version - is not
picking up the corruption....

> The xfs_db does segmentation fault. :-)

Yup, it probably ran off into la-la land chasing corrupted
extent pointers.

> Btw memory corruption:
> In the beginnig of march, one of my bets was memory problem too, but
> the server was offline for 7 days, and all the time runs the
> memtest86 on the hw, and passed all the 8GB 74 times without any bit
> error.
> I don't think it is memory problem, additionally the server can
> create big size  .tar.gz files without crc problem.

Ok.

> If i force my mind to think to hw memory problem, i can think only
> for the raid card's cache memory, wich i can't test with memtest86.
> Or the cache of the HDD's pcb...

Yes, it could be something like that, too, but the only way to test
it is to swap out the card....

> In the other hand, i have seen more people reported memory
> corruption about these kernel versions, can we check this and surely
> select wich is the problem? (hw or sw)?

I haven't heard of any significant memory corruption problems in
2.6.32 or 2.6.33, but it is a possibility given the nature of the
corruption. However, I may have only happened once and be completely
unreproducable.

I'd suggest fixing the existing corruption first, and then seeing if
it re-appears. If it does reappear, then we know there's a
reproducable problem we need to dig out....

> I mean, if i am right, the hw memory problem makes only 1-2 bit
> corruption seriously, and the sw page handling problem makes bad
> memory pages, no?

RAM ECC guarantees correction of single bit errors and detection of
double bit errors (which cause the kernel to panic, IIRC). I can't
tell you what happens when larger errors occur, though...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-13 11:34                                     ` Dave Chinner
@ 2010-04-13 23:36                                       ` Janos Haar
  2010-04-14  0:16                                         ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-04-13 23:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

Dave,

----- Original Message ----- 
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
<axboe@kernel.dk>
Sent: Tuesday, April 13, 2010 1:34 PM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Tue, Apr 13, 2010 at 11:23:36AM +0200, Janos Haar wrote:
>> >If you run:
>> >
>> >$ xfs_db -r -c "inode 474253940" -c p /dev/sdb2
>> >
>> >Then I can can confirm whether there is corruption on disk or not.
>> >Probably best to sample multiple of the inode numbers from the above
>> >list of bad inodes.
>>
>> Here is the log:
>> http://download.netcenter.hu/bughunt/20100413/debug.log
>
> There are multiple fields in the inode that are corrupted.
> I am really surprised that xfs-repair - even an old version - is not
> picking up the corruption....

I think i know now the reason....
My case starting to turn into more and more interesting.

(Just a little note for remember: tuesday night, i have run the old 2.8.11 
xfs_repair on the partiton wich was reported as corrupt by the kernel, but 
it was clean.
The system was not restarted!)

Like you suggested, today, i have tried to make a backup from the data.
During the copy, the kernel reported a lot of corrupted entries again, and 
finally the kernel crashed! (with the 19 patch pack)
Unfortunately the kernel can't write the debug info into the syslog.
The system restarted automatically, the service runs again, and i can't do 
another backup attempt because force of the owner.
Today night, when the traffic was in the low period, i have stopped the 
service, umount the partition, and repeat the xfs_repair on the previously 
reported partition on more ways.

Here you can see the results:
xfs_repair 2.8.11 run #1:
http://download.netcenter.hu/bughunt/20100413/repair2811-nr1.log

xfs_repair 2.8.11 run #2:
http://download.netcenter.hu/bughunt/20100413/repair2811-nr2.log

echo 3 >/proc/sys/vm/drop_caches - performed

xfs_repair 2.8.11 run #3:
http://download.netcenter.hu/bughunt/20100413/repair2811-nr3.log

xfs_reapir 3.1.1 run #1:
http://download.netcenter.hu/bughunt/20100413/repair311-nr1.log

xfs_reapir 3.1.1 run #2:  sorry, i had no time to play more offline. :-(

For me, it looks like the FS gets corrupted between tuesday night and today 
night.
Note: because i am expecting kernel crashes, the dirty data flush was set 
for some miliseconds timeout only for prevent too much data lost.
It was one kernel crash in this period, but the XFS have journal, and should 
be cleaned correctly. (i don't think this is the problem)

The other interesting thing is, why only this partition gets corrupted? 
(again, and again?)
Note: this is a partition of 4 disk RAID10 (hw), and 3/4 hdd was replaced in 
the last 3 week because we are hunting this bug....
Note2: why not 4/4? Because the first 3 was fine, and was replaced bigger 
drives, and i don't know what will happen if all the drives will grow, i am 
not sure, about i can replace back the 300G raptors.

>
>> The xfs_db does segmentation fault. :-)
>
> Yup, it probably ran off into la-la land chasing corrupted
> extent pointers.
>
>> Btw memory corruption:
>> In the beginnig of march, one of my bets was memory problem too, but
>> the server was offline for 7 days, and all the time runs the
>> memtest86 on the hw, and passed all the 8GB 74 times without any bit
>> error.
>> I don't think it is memory problem, additionally the server can
>> create big size  .tar.gz files without crc problem.
>
> Ok.
>
>> If i force my mind to think to hw memory problem, i can think only
>> for the raid card's cache memory, wich i can't test with memtest86.
>> Or the cache of the HDD's pcb...
>
> Yes, it could be something like that, too, but the only way to test
> it is to swap out the card....

Yeah, but i don't have another. :-/

>
>> In the other hand, i have seen more people reported memory
>> corruption about these kernel versions, can we check this and surely
>> select wich is the problem? (hw or sw)?
>
> I haven't heard of any significant memory corruption problems in
> 2.6.32 or 2.6.33, but it is a possibility given the nature of the
> corruption. However, I may have only happened once and be completely
> unreproducable.

I have reported one strange bug, this was the first mail in this series, 
with the original title "somebody take a look please.....".
I can see this too in the kernel list: "[Bug #15585] [Bisected Regression in 
2.6.32.8] i915 with KMS enabledcauses memorycorruption when resuming from 
suspend-to-disk"
And another too: "Re: Memory corruption with 2.6.32.10, but not with 
2.6.34-rc3"
Note: i am reading only the titles, i have not too much time actually.

>
> I'd suggest fixing the existing corruption first, and then seeing if
> it re-appears. If it does reappear, then we know there's a
> reproducable problem we need to dig out....

I am on it. :-)

>
>> I mean, if i am right, the hw memory problem makes only 1-2 bit
>> corruption seriously, and the sw page handling problem makes bad
>> memory pages, no?
>
> RAM ECC guarantees correction of single bit errors and detection of
> double bit errors (which cause the kernel to panic, IIRC). I can't
> tell you what happens when larger errors occur, though...

Yes, but this system have non-ECC ram unfortunately.
But i am 99.999% sure, this corruption is not mobo-cpu-ram related.
This must be something else...

Now i am tried to copy in one 4.5GB .gz 3 times into this problematic 
partition, and gzip -v -t on all archives.
All was fine.
This makes me think this is sw problem, and not a simple memory corruption, 
or the corruption can appear only for a short of time in the hw.
This whould be really nasty.

Anyway, i have set up one cron script for test all the 4G .gz files on every 
hours a day, and write to log with dates.
Maybe useful for something....

Thanks again,
Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-13 23:36                                       ` Janos Haar
@ 2010-04-14  0:16                                         ` Dave Chinner
  2010-04-15  7:00                                           ` Janos Haar
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2010-04-14  0:16 UTC (permalink / raw)
  To: Janos Haar
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

On Wed, Apr 14, 2010 at 01:36:56AM +0200, Janos Haar wrote:
> ----- Original Message ----- From: "Dave Chinner"
> >On Tue, Apr 13, 2010 at 11:23:36AM +0200, Janos Haar wrote:
> >>>If you run:
> >>>
> >>>$ xfs_db -r -c "inode 474253940" -c p /dev/sdb2
> >>>
> >>>Then I can can confirm whether there is corruption on disk or not.
> >>>Probably best to sample multiple of the inode numbers from the above
> >>>list of bad inodes.
> >>
> >>Here is the log:
> >>http://download.netcenter.hu/bughunt/20100413/debug.log
> >
> >There are multiple fields in the inode that are corrupted.
> >I am really surprised that xfs-repair - even an old version - is not
> >picking up the corruption....
> 
> I think i know now the reason....
> My case starting to turn into more and more interesting.
> 
> (Just a little note for remember: tuesday night, i have run the old
> 2.8.11 xfs_repair on the partiton wich was reported as corrupt by
> the kernel, but it was clean.
> The system was not restarted!)
> 
> Like you suggested, today, i have tried to make a backup from the data.
> During the copy, the kernel reported a lot of corrupted entries
> again, and finally the kernel crashed! (with the 19 patch pack)
> Unfortunately the kernel can't write the debug info into the syslog.
> The system restarted automatically, the service runs again, and i
> can't do another backup attempt because force of the owner.
> Today night, when the traffic was in the low period, i have stopped
> the service, umount the partition, and repeat the xfs_repair on the
> previously reported partition on more ways.
> 
> Here you can see the results:
> xfs_repair 2.8.11 run #1:
> http://download.netcenter.hu/bughunt/20100413/repair2811-nr1.log

So this successfully detected and repaired the corruption.  I don't
think this is new corruption - the corrupted inode numbers are the
same as you reported a few days back.

> xfs_repair 2.8.11 run #2:
> http://download.netcenter.hu/bughunt/20100413/repair2811-nr2.log
> 
> echo 3 >/proc/sys/vm/drop_caches - performed
> 
> xfs_repair 2.8.11 run #3:
> http://download.netcenter.hu/bughunt/20100413/repair2811-nr3.log

These two are clearing lost+found and rediscovering the
diesconnected inodes that were discovered in the first pass. Nothing
wrng here, that's just the way older repair versions behaved.

> xfs_reapir 3.1.1 run #1:
> http://download.netcenter.hu/bughunt/20100413/repair311-nr1.log

And this detected nothing wrong, either.

> For me, it looks like the FS gets corrupted between tuesday night
> and today night.
> Note: because i am expecting kernel crashes, the dirty data flush
> was set for some miliseconds timeout only for prevent too much data
> lost.
> It was one kernel crash in this period, but the XFS have journal,
> and should be cleaned correctly. (i don't think this is the problem)
> 
> The other interesting thing is, why only this partition gets
> corrupted? (again, and again?)

Can you reporduce the corruption again now that the filesystem has
been repaired? I want to know (if the corruption appears again)
whether it appears in the same location as this one.

> >>I mean, if i am right, the hw memory problem makes only 1-2 bit
> >>corruption seriously, and the sw page handling problem makes bad
> >>memory pages, no?
> >
> >RAM ECC guarantees correction of single bit errors and detection of
> >double bit errors (which cause the kernel to panic, IIRC). I can't
> >tell you what happens when larger errors occur, though...
> 
> Yes, but this system have non-ECC ram unfortunately.

If your hardware doesn't have ECC, then you can't rule out anything
- even a dodgy power supply can cause this sort of transient
problem. I'm not saying that this is the cause, but I've been
assuming that you're actually running hardware with ECC on RAM,
caches, buses, etc.

> This makes me think this is sw problem, and not a simple memory
> corruption, or the corruption can appear only for a short of time in
> the hw.

If you can take the performance hit, turn on the kernel memory leak
detector and see if that catches anything.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-14  0:16                                         ` Dave Chinner
@ 2010-04-15  7:00                                           ` Janos Haar
  2010-04-15  9:23                                             ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Janos Haar @ 2010-04-15  7:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

Dave,

The corruption + crash reproduced. (unfortunately)

http://download.netcenter.hu/bughunt/20100413/messages-15

Apr 14 01:06:33 alfa kernel: XFS mounting filesystem sdb2

This was the point of the xfs_repair more times.

Regards,
Janos

----- Original Message ----- 
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
<axboe@kernel.dk>
Sent: Wednesday, April 14, 2010 2:16 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Wed, Apr 14, 2010 at 01:36:56AM +0200, Janos Haar wrote:
>> ----- Original Message ----- From: "Dave Chinner"
>> >On Tue, Apr 13, 2010 at 11:23:36AM +0200, Janos Haar wrote:
>> >>>If you run:
>> >>>
>> >>>$ xfs_db -r -c "inode 474253940" -c p /dev/sdb2
>> >>>
>> >>>Then I can can confirm whether there is corruption on disk or not.
>> >>>Probably best to sample multiple of the inode numbers from the above
>> >>>list of bad inodes.
>> >>
>> >>Here is the log:
>> >>http://download.netcenter.hu/bughunt/20100413/debug.log
>> >
>> >There are multiple fields in the inode that are corrupted.
>> >I am really surprised that xfs-repair - even an old version - is not
>> >picking up the corruption....
>>
>> I think i know now the reason....
>> My case starting to turn into more and more interesting.
>>
>> (Just a little note for remember: tuesday night, i have run the old
>> 2.8.11 xfs_repair on the partiton wich was reported as corrupt by
>> the kernel, but it was clean.
>> The system was not restarted!)
>>
>> Like you suggested, today, i have tried to make a backup from the data.
>> During the copy, the kernel reported a lot of corrupted entries
>> again, and finally the kernel crashed! (with the 19 patch pack)
>> Unfortunately the kernel can't write the debug info into the syslog.
>> The system restarted automatically, the service runs again, and i
>> can't do another backup attempt because force of the owner.
>> Today night, when the traffic was in the low period, i have stopped
>> the service, umount the partition, and repeat the xfs_repair on the
>> previously reported partition on more ways.
>>
>> Here you can see the results:
>> xfs_repair 2.8.11 run #1:
>> http://download.netcenter.hu/bughunt/20100413/repair2811-nr1.log
>
> So this successfully detected and repaired the corruption.  I don't
> think this is new corruption - the corrupted inode numbers are the
> same as you reported a few days back.
>
>> xfs_repair 2.8.11 run #2:
>> http://download.netcenter.hu/bughunt/20100413/repair2811-nr2.log
>>
>> echo 3 >/proc/sys/vm/drop_caches - performed
>>
>> xfs_repair 2.8.11 run #3:
>> http://download.netcenter.hu/bughunt/20100413/repair2811-nr3.log
>
> These two are clearing lost+found and rediscovering the
> diesconnected inodes that were discovered in the first pass. Nothing
> wrng here, that's just the way older repair versions behaved.
>
>> xfs_reapir 3.1.1 run #1:
>> http://download.netcenter.hu/bughunt/20100413/repair311-nr1.log
>
> And this detected nothing wrong, either.
>
>> For me, it looks like the FS gets corrupted between tuesday night
>> and today night.
>> Note: because i am expecting kernel crashes, the dirty data flush
>> was set for some miliseconds timeout only for prevent too much data
>> lost.
>> It was one kernel crash in this period, but the XFS have journal,
>> and should be cleaned correctly. (i don't think this is the problem)
>>
>> The other interesting thing is, why only this partition gets
>> corrupted? (again, and again?)
>
> Can you reporduce the corruption again now that the filesystem has
> been repaired? I want to know (if the corruption appears again)
> whether it appears in the same location as this one.
>
>> >>I mean, if i am right, the hw memory problem makes only 1-2 bit
>> >>corruption seriously, and the sw page handling problem makes bad
>> >>memory pages, no?
>> >
>> >RAM ECC guarantees correction of single bit errors and detection of
>> >double bit errors (which cause the kernel to panic, IIRC). I can't
>> >tell you what happens when larger errors occur, though...
>>
>> Yes, but this system have non-ECC ram unfortunately.
>
> If your hardware doesn't have ECC, then you can't rule out anything
> - even a dodgy power supply can cause this sort of transient
> problem. I'm not saying that this is the cause, but I've been
> assuming that you're actually running hardware with ECC on RAM,
> caches, buses, etc.
>
>> This makes me think this is sw problem, and not a simple memory
>> corruption, or the corruption can appear only for a short of time in
>> the hw.
>
> If you can take the performance hit, turn on the kernel memory leak
> detector and see if that catches anything.
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-15  7:00                                           ` Janos Haar
@ 2010-04-15  9:23                                             ` Dave Chinner
  2010-04-15 10:23                                               ` Janos Haar
  2010-04-16  8:01                                               ` Janos Haar
  0 siblings, 2 replies; 33+ messages in thread
From: Dave Chinner @ 2010-04-15  9:23 UTC (permalink / raw)
  To: Janos Haar
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe

On Thu, Apr 15, 2010 at 09:00:49AM +0200, Janos Haar wrote:
> Dave,
> 
> The corruption + crash reproduced. (unfortunately)
> 
> http://download.netcenter.hu/bughunt/20100413/messages-15
> 
> Apr 14 01:06:33 alfa kernel: XFS mounting filesystem sdb2
> 
> This was the point of the xfs_repair more times.

OK, the inodes that are corrupted are different, so there's still
something funky going on here. I still would suggest replacing the
RAID controller to rule that out as the cause.

FWIW, do you have any other servers with similar h/w, s/w and
workloads? If so, are they seeing problems?

Can you recompile the kernel with CONFIG_XFS_DEBUG enabled and
reboot into it before you repair and remount the filesystem again?
(i.e. so that we know that we have started with a clean filesystem
and the debug kernel) I'm hoping that this will catch the corruption
much sooner, perhaps before it gets to disk. Note that this will
cause the machine to panic when corruption is detected, and it is
much,much more careful about checking in memory structures so there
is a CPU overhead involved as well.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-15  9:23                                             ` Dave Chinner
@ 2010-04-15 10:23                                               ` Janos Haar
  2010-04-16  8:01                                               ` Janos Haar
  1 sibling, 0 replies; 33+ messages in thread
From: Janos Haar @ 2010-04-15 10:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe


----- Original Message ----- 
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
<axboe@kernel.dk>
Sent: Thursday, April 15, 2010 11:23 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Thu, Apr 15, 2010 at 09:00:49AM +0200, Janos Haar wrote:
>> Dave,
>>
>> The corruption + crash reproduced. (unfortunately)
>>
>> http://download.netcenter.hu/bughunt/20100413/messages-15
>>
>> Apr 14 01:06:33 alfa kernel: XFS mounting filesystem sdb2
>>
>> This was the point of the xfs_repair more times.
>
> OK, the inodes that are corrupted are different, so there's still
> something funky going on here. I still would suggest replacing the
> RAID controller to rule that out as the cause.

This was not a cheap card and i can't replace, because have only one, and 
the owner decided allready about i need to replace the entire server @ 
saturday.
I have only 2 day to get useful debug information when the server is online.
This is bad too for testing, becasue the workload will disappear, and we 
need to figure out something to reproduce the problem offline...

>
> FWIW, do you have any other servers with similar h/w, s/w and
> workloads? If so, are they seeing problems?

This is a web based game, wich generates a loooot of small files on the 
corrupted filesystem, and as far as i see, the corruption happens only @ 
writing, but not when reading.
Because i can copy multiple times big gz files across the partitions, and 
compare, and test for crc, and there is a cron-tester wich tests 12GB gz 
files hourly but can't find any problem, this shows me, the corruption only 
happens when writing, and not on the content, but on the FS.
This scores the RAID card problem more lower, am i right? :-)

Additionally in the last 3 days i have tried 2 times to cp -aR the entire 
partition to another, and both times the corruption appears ON THE SOURCE 
and finally the kernel crashed.

step 1. repair
step 2 run the game (files generated...)
step 3 start copy partition's data in background
step 4 corruption reported by kernel
step 5 kernel crashed during write

Can this be a race between read and write?

Btw i have 2 server with this game, the difference are these:

- The game's language
- The HW's structure similar, but totally different branded all the parts, 
except the Intel CPU. :-)
- The workload is lower on the stable server
- The stable server is not selected for replace. :-)

The important matches:
- The base OS is FC6 on both
- The actual kernel on the stable server is 2.6.28.10
(This kernel starts to crash @ the beginnig of Marc. month on which we are 
working on.)
- The FS and the internal structure is the same

>
> Can you recompile the kernel with CONFIG_XFS_DEBUG enabled and
> reboot into it before you repair and remount the filesystem again?

Yes, of course!
I will do it now, we have 2 days left to get useful infos....

> (i.e. so that we know that we have started with a clean filesystem
> and the debug kernel) I'm hoping that this will catch the corruption
> much sooner, perhaps before it gets to disk. Note that this will
> cause the machine to panic when corruption is detected, and it is
> much,much more careful about checking in memory structures so there
> is a CPU overhead involved as well.

not a problem.


>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
  2010-04-15  9:23                                             ` Dave Chinner
  2010-04-15 10:23                                               ` Janos Haar
@ 2010-04-16  8:01                                               ` Janos Haar
  1 sibling, 0 replies; 33+ messages in thread
From: Janos Haar @ 2010-04-16  8:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xiyou.wangcong, linux-kernel, kamezawa.hiroyu, linux-mm, xfs, axboe


----- Original Message ----- 
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>; 
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>; 
<axboe@kernel.dk>
Sent: Thursday, April 15, 2010 11:23 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look 
please!...)


> On Thu, Apr 15, 2010 at 09:00:49AM +0200, Janos Haar wrote:
>> Dave,
>>
>> The corruption + crash reproduced. (unfortunately)
>>
>> http://download.netcenter.hu/bughunt/20100413/messages-15
>>
>> Apr 14 01:06:33 alfa kernel: XFS mounting filesystem sdb2
>>
>> This was the point of the xfs_repair more times.
>
> OK, the inodes that are corrupted are different, so there's still
> something funky going on here. I still would suggest replacing the
> RAID controller to rule that out as the cause.

News:

(reminder from the actual state:
xfs_repair fixed the fs, than kernel reported again the corruption and 
crashed, i wrote the provious letter to report this.)

Yesterday i have stopped the service, and run xfs_repair (new version only) 
on 2 FS, but it was clean!
(this shows me, the reported corruption was only in memory, or the kernel 
repaired it on the reboot.)
(The XFS_Debug turned on before.)
Today morning i have another messages in the syslog from the sdb2 again.
At this point, i don't know what to think.

http://download.netcenter.hu/bughunt/20100413/messages-16

Regards,
Janos


>
> FWIW, do you have any other servers with similar h/w, s/w and
> workloads? If so, are they seeing problems?
>
> Can you recompile the kernel with CONFIG_XFS_DEBUG enabled and
> reboot into it before you repair and remount the filesystem again?
> (i.e. so that we know that we have started with a clean filesystem
> and the debug kernel) I'm hoping that this will catch the corruption
> much sooner, perhaps before it gets to disk. Note that this will
> cause the machine to panic when corruption is detected, and it is
> much,much more careful about checking in memory structures so there
> is a CPU overhead involved as well.
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/ 


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2010-04-16  8:01 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-24 20:39 Somebody take a look please! (some kind of kernel bug?) Janos Haar
2010-03-25  3:29 ` Américo Wang
2010-03-25  6:31   ` KAMEZAWA Hiroyuki
2010-03-25  8:54     ` Janos Haar
2010-04-01 10:01       ` Janos Haar
2010-04-01 10:37         ` Américo Wang
2010-04-02 22:07           ` Janos Haar
2010-04-02 23:09             ` Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...) Dave Chinner
2010-04-03 13:42               ` Janos Haar
2010-04-04 10:37                 ` Dave Chinner
2010-04-05 18:17                   ` Janos Haar
2010-04-05 22:45                     ` Dave Chinner
2010-04-05 22:59                       ` Janos Haar
2010-04-08  2:45                       ` Janos Haar
2010-04-08  2:58                         ` Dave Chinner
2010-04-08 11:21                           ` Janos Haar
2010-04-09 21:37                             ` Christian Kujau
2010-04-09 22:44                               ` Janos Haar
2010-04-10  8:06                                 ` Américo Wang
2010-04-10 21:21                                   ` Kernel crash in xfs_iflush_cluster (was Somebody take a lookplease!...) Janos Haar
2010-04-10 21:15                           ` Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...) Janos Haar
2010-04-11 22:44                           ` Janos Haar
2010-04-12  0:11                             ` Dave Chinner
2010-04-13  8:00                               ` Janos Haar
2010-04-13  8:39                                 ` Dave Chinner
2010-04-13  9:23                                   ` Janos Haar
2010-04-13 11:34                                     ` Dave Chinner
2010-04-13 23:36                                       ` Janos Haar
2010-04-14  0:16                                         ` Dave Chinner
2010-04-15  7:00                                           ` Janos Haar
2010-04-15  9:23                                             ` Dave Chinner
2010-04-15 10:23                                               ` Janos Haar
2010-04-16  8:01                                               ` Janos Haar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).