* RE: 2.6.10-ac10 oops in journal_commit_transaction
@ 2005-04-21 22:10 Zou, Nanhai
2005-04-21 22:29 ` Chris Wright
0 siblings, 1 reply; 7+ messages in thread
From: Zou, Nanhai @ 2005-04-21 22:10 UTC (permalink / raw)
To: Alan Cox, Brice Figureau; +Cc: Andrew Morton, Linux Kernel Mailing List
Hi Alan,
We have seen the same oops on the same point.
Can you point to me the URL where the patch is?
I am not sure which patch should I get.
Thanks
Zou Nan hai
> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Alan Cox
> Sent: Monday, March 07, 2005 6:59 AM
> To: Brice Figureau
> Cc: Andrew Morton; Linux Kernel Mailing List
> Subject: Re: 2.6.10-ac10 oops in journal_commit_transaction
>
> FYI Stephen Tweedie has now posted a patch for 2.6.x that ought to fix
> this one.
>
> Alan
>
> -
> To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 2.6.10-ac10 oops in journal_commit_transaction
2005-04-21 22:10 2.6.10-ac10 oops in journal_commit_transaction Zou, Nanhai
@ 2005-04-21 22:29 ` Chris Wright
2005-04-22 9:23 ` Alan Cox
0 siblings, 1 reply; 7+ messages in thread
From: Chris Wright @ 2005-04-21 22:29 UTC (permalink / raw)
To: Zou, Nanhai
Cc: Alan Cox, Brice Figureau, Andrew Morton, Linux Kernel Mailing List
* Zou, Nanhai (nanhai.zou@intel.com) wrote:
> We have seen the same oops on the same point.
> Can you point to me the URL where the patch is?
> I am not sure which patch should I get.
I believe it's fixed in 2.6.11-ac, and we fixed it in the current stable
2.6.11.7 tree. The following patch is what went into 2.6.11.7:
---
From: Stephen Tweedie
Subject: Prevent race condition in jbd
This patch from Stephen Tweedie which fixes a race in jbd code (it
demonstrated itself as more or less random NULL dereferences in the
journal code).
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Chris Mason <mason@suse.com>
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
--- linux-2.6-ext3/fs/jbd/transaction.c.=K0000=.orig
+++ linux-2.6-ext3/fs/jbd/transaction.c
@@ -1775,10 +1775,10 @@ static int journal_unmap_buffer(journal_
JBUFFER_TRACE(jh, "checkpointed: add to BJ_Forget");
ret = __dispose_buffer(jh,
journal->j_running_transaction);
+ journal_put_journal_head(jh);
spin_unlock(&journal->j_list_lock);
jbd_unlock_bh_state(bh);
spin_unlock(&journal->j_state_lock);
- journal_put_journal_head(jh);
return ret;
} else {
/* There is no currently-running transaction. So the
@@ -1789,10 +1789,10 @@ static int journal_unmap_buffer(journal_
JBUFFER_TRACE(jh, "give to committing trans");
ret = __dispose_buffer(jh,
journal->j_committing_transaction);
+ journal_put_journal_head(jh);
spin_unlock(&journal->j_list_lock);
jbd_unlock_bh_state(bh);
spin_unlock(&journal->j_state_lock);
- journal_put_journal_head(jh);
return ret;
} else {
/* The orphan record's transaction has
@@ -1813,10 +1813,10 @@ static int journal_unmap_buffer(journal_
journal->j_running_transaction);
jh->b_next_transaction = NULL;
}
+ journal_put_journal_head(jh);
spin_unlock(&journal->j_list_lock);
jbd_unlock_bh_state(bh);
spin_unlock(&journal->j_state_lock);
- journal_put_journal_head(jh);
return 0;
} else {
/* Good, the buffer belongs to the running transaction.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 2.6.10-ac10 oops in journal_commit_transaction
2005-04-21 22:29 ` Chris Wright
@ 2005-04-22 9:23 ` Alan Cox
0 siblings, 0 replies; 7+ messages in thread
From: Alan Cox @ 2005-04-22 9:23 UTC (permalink / raw)
To: Chris Wright
Cc: Zou, Nanhai, Brice Figureau, Andrew Morton, Linux Kernel Mailing List
On Iau, 2005-04-21 at 23:29, Chris Wright wrote:
> I believe it's fixed in 2.6.11-ac, and we fixed it in the current stable
> 2.6.11.7 tree. The following patch is what went into 2.6.11.7:
2.6.11.7 or 2.6.11ac7 (ie 2.6.11.7-ac 8)) both have this fixed.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 2.6.10-ac10 oops in journal_commit_transaction
2005-03-04 10:51 ` Brice Figureau
@ 2005-03-06 22:59 ` Alan Cox
0 siblings, 0 replies; 7+ messages in thread
From: Alan Cox @ 2005-03-06 22:59 UTC (permalink / raw)
To: Brice Figureau; +Cc: Andrew Morton, Linux Kernel Mailing List
FYI Stephen Tweedie has now posted a patch for 2.6.x that ought to fix
this one.
Alan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 2.6.10-ac10 oops in journal_commit_transaction
2005-03-03 23:37 ` Andrew Morton
@ 2005-03-04 10:51 ` Brice Figureau
2005-03-06 22:59 ` Alan Cox
0 siblings, 1 reply; 7+ messages in thread
From: Brice Figureau @ 2005-03-04 10:51 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
Hi Andrew,
On Thu, 2005-03-03 at 15:37 -0800, Andrew Morton wrote:
> Brice Figureau <brice+lklm@daysofwonder.com> wrote:
[snip]
> > Unable to handle kernel NULL pointer dereference at virtual address 0000000c
> > printing eip:
> > c01a858d
> > *pde = 00000000
> > Oops: 0002 [#1]
> > PREEMPT SMP
> > Modules linked in: i2c_i801 i2c_core ip_conntrack_ftp ipt_LOG ipt_limit ipt_REJECT ipt_state iptable_filter ip_conntrack ip_tables
> > CPU: 2
> > EIP: 0060:[journal_commit_transaction+877/5264] Not tainted VLI
> > EFLAGS: 00010286 (2.6.10-ac10)
> > EIP is at journal_commit_transaction+0x36d/0x1490
>
> Please do:
>
> gdb vmlinux
> (gdb) l *0xc01a858d
Unfortunately this kernel is not compiled with CONFIG_DEBUG_INFO=y, so
the above command does not work.
But:
(gdb) disassemble 0xc01a858d
Dump of assembler code for function journal_commit_transaction:
[snipped]
...
0xc01a8568 <journal_commit_transaction+840>: test %eax,%eax
0xc01a856a <journal_commit_transaction+842>: jne 0xc01a93bf <journal_commit_transaction+4511>
0xc01a8570 <journal_commit_transaction+848>: mov 0xfffffea8(%ebp),%edx
0xc01a8576 <journal_commit_transaction+854>: mov 0x18(%edx),%eax
0xc01a8579 <journal_commit_transaction+857>: test %eax,%eax
0xc01a857b <journal_commit_transaction+859>: je 0xc01a8606 <journal_commit_transaction+998>
0xc01a8581 <journal_commit_transaction+865>: mov $0xffffe000,%esi
0xc01a8586 <journal_commit_transaction+870>: and %esp,%esi
0xc01a8588 <journal_commit_transaction+872>: mov 0x20(%eax),%edi
0xc01a858b <journal_commit_transaction+875>: mov (%edi),%ebx
0xc01a858d <journal_commit_transaction+877>: lock incl 0xc(%ebx)
0xc01a8591 <journal_commit_transaction+881>: mov (%ebx),%eax
0xc01a8593 <journal_commit_transaction+883>: test $0x4,%al
0xc01a8595 <journal_commit_transaction+885>: jne 0xc01a9379 <journal_commit_transaction+4441>
0xc01a859b <journal_commit_transaction+891>: mov %ebx,0x4(%esp)
0xc01a859f <journal_commit_transaction+895>: mov 0x8(%ebp),%ecx
0xc01a85a2 <journal_commit_transaction+898>: mov %ecx,(%esp)
0xc01a85a5 <journal_commit_transaction+901>: call 0xc01a81d0 <inverted_lock>
0xc01a85aa <journal_commit_transaction+906>: test %eax,%eax
0xc01a85ac <journal_commit_transaction+908>: je 0xc01a9373 <journal_commit_transaction+4435>
0xc01a85b2 <journal_commit_transaction+914>: mov (%ebx),%eax
0xc01a85b4 <journal_commit_transaction+916>: test $0x20,%ah
So I recompiled my kernel with DEBUG_CONFIG_INFO with the hope that the
code won't move too far and I could find the code:
On the kernel with *debug* enabled:
(gdb) l *0xc01a858d
0xc01a858d is in journal_commit_transaction (buffer_head.h:104).
99 * Emit the buffer bitops functions. Note that there are also functions
100 * of the form "mark_buffer_foo()". These are higher-level functions which
101 * do something in addition to setting a b_state bit.
102 */
103 BUFFER_FNS(Uptodate, uptodate)
104 BUFFER_FNS(Dirty, dirty)
105 TAS_BUFFER_FNS(Dirty, dirty)
106 BUFFER_FNS(Lock, locked)
107 TAS_BUFFER_FNS(Lock, locked)
108 BUFFER_FNS(Req, req)
Which does not seem to match the code included in the oops.
(gdb) disassemble 0xc01a858d
[snip]
0xc01a85c8 <journal_commit_transaction+840>: test %eax,%eax
0xc01a85ca <journal_commit_transaction+842>: jne 0xc01a941f <journal_commit_transaction+4511>
0xc01a85d0 <journal_commit_transaction+848>: mov 0xfffffea8(%ebp),%edx
0xc01a85d6 <journal_commit_transaction+854>: mov 0x18(%edx),%eax
0xc01a85d9 <journal_commit_transaction+857>: test %eax,%eax
0xc01a85db <journal_commit_transaction+859>: je 0xc01a8666 <journal_commit_transaction+998>
0xc01a85e1 <journal_commit_transaction+865>: mov $0xffffe000,%esi
0xc01a85e6 <journal_commit_transaction+870>: and %esp,%esi
0xc01a85e8 <journal_commit_transaction+872>: mov 0x20(%eax),%edi
0xc01a85eb <journal_commit_transaction+875>: mov (%edi),%ebx
0xc01a85ed <journal_commit_transaction+877>: lock incl 0xc(%ebx)
0xc01a85f1 <journal_commit_transaction+881>: mov (%ebx),%eax
0xc01a85f3 <journal_commit_transaction+883>: test $0x4,%al
0xc01a85f5 <journal_commit_transaction+885>: jne 0xc01a93d9 <journal_commit_transaction+4441>
0xc01a85fb <journal_commit_transaction+891>: mov %ebx,0x4(%esp)
0xc01a85ff <journal_commit_transaction+895>: mov 0x8(%ebp),%ecx
0xc01a8602 <journal_commit_transaction+898>: mov %ecx,(%esp)
0xc01a8605 <journal_commit_transaction+901>: call 0xc01a8230 <inverted_lock>
0xc01a860a <journal_commit_transaction+906>: test %eax,%eax
0xc01a860c <journal_commit_transaction+908>: je 0xc01a93d3 <journal_commit_transaction+4435>
0xc01a8612 <journal_commit_transaction+914>: mov (%ebx),%eax
0xc01a8614 <journal_commit_transaction+916>: test $0x20,%ah
So the same code is now at 0xc01a85ed:
(gdb) l *0xc01a85ed
0xc01a85ed is in journal_commit_transaction (atomic.h:103).
98 *
99 * Atomically increments @v by 1.
100 */
101 static __inline__ void atomic_inc(atomic_t *v)
102 {
103 __asm__ __volatile__(
104 LOCK "incl %0"
105 :"=m" (v->counter)
106 :"m" (v->counter));
107 }
It seems to me that get_bh is the culprit because of the following
definition from include/linux/buffer_head.h:
static inline void get_bh(struct buffer_head *bh)
{
atomic_inc(&bh->b_count);
}
I hope this will help you. Let me know if you need more information.
Thanks for taking care of that problem,
Regards,
--
Brice Figureau <brice+lklm@daysofwonder.com>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 2.6.10-ac10 oops in journal_commit_transaction
2005-03-03 13:45 Brice Figureau
@ 2005-03-03 23:37 ` Andrew Morton
2005-03-04 10:51 ` Brice Figureau
0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2005-03-03 23:37 UTC (permalink / raw)
To: Brice Figureau; +Cc: linux-kernel
Brice Figureau <brice+lklm@daysofwonder.com> wrote:
>
> I'm reporting an oops on a bi-Xeon database server under 2.6.10-ac10
> quite similar to:
> http://marc.theaimsgroup.com/?l=ext3-users&m=110848085314238&w=2
>
> I also got another server crashing (a mail server this time), but I
> couldn't get the oops/panic.
>
> This was after more than two weeks of uptime, I was running 2.6.10-ac1
> before and never got this problem.
>
> Here are the oops information:
>
> Unable to handle kernel NULL pointer dereference at virtual address 0000000c
> printing eip:
> c01a858d
> *pde = 00000000
> Oops: 0002 [#1]
> PREEMPT SMP
> Modules linked in: i2c_i801 i2c_core ip_conntrack_ftp ipt_LOG ipt_limit ipt_REJECT ipt_state iptable_filter ip_conntrack ip_tables
> CPU: 2
> EIP: 0060:[journal_commit_transaction+877/5264] Not tainted VLI
> EFLAGS: 00010286 (2.6.10-ac10)
> EIP is at journal_commit_transaction+0x36d/0x1490
Please do:
gdb vmlinux
(gdb) l *0xc01a858d
^ permalink raw reply [flat|nested] 7+ messages in thread
* 2.6.10-ac10 oops in journal_commit_transaction
@ 2005-03-03 13:45 Brice Figureau
2005-03-03 23:37 ` Andrew Morton
0 siblings, 1 reply; 7+ messages in thread
From: Brice Figureau @ 2005-03-03 13:45 UTC (permalink / raw)
To: linux-kernel
Hi,
I'm reporting an oops on a bi-Xeon database server under 2.6.10-ac10
quite similar to:
http://marc.theaimsgroup.com/?l=ext3-users&m=110848085314238&w=2
I also got another server crashing (a mail server this time), but I
couldn't get the oops/panic.
This was after more than two weeks of uptime, I was running 2.6.10-ac1
before and never got this problem.
Here are the oops information:
Unable to handle kernel NULL pointer dereference at virtual address 0000000c
printing eip:
c01a858d
*pde = 00000000
Oops: 0002 [#1]
PREEMPT SMP
Modules linked in: i2c_i801 i2c_core ip_conntrack_ftp ipt_LOG ipt_limit ipt_REJECT ipt_state iptable_filter ip_conntrack ip_tables
CPU: 2
EIP: 0060:[journal_commit_transaction+877/5264] Not tainted VLI
EFLAGS: 00010286 (2.6.10-ac10)
EIP is at journal_commit_transaction+0x36d/0x1490
eax: db38a56c ebx: 00000000 ecx: 00000000 edx: f7779480
esi: f76fa000 edi: db38a56c ebp: f76fbf60 esp: f76fbdc8
ds: 007b es: 007b ss: 0068
Process kjournald (pid: 1206, threadinfo=f76fa000 task=f7454020)
Stack: f191fadc f191fadc 00000008 00000aa2 f76fbe04 f7fea4c0 f7c305b0 00000000
f77794b8 f7fea414 00000000 00000000 00000000 00000000 00000000 db313efc
f7779480 e4079c2c 00000aa2 00000001 f76fbe28 c01239b0 00000001 f76fbea8
Call Trace:
[show_stack+127/160] show_stack+0x7f/0xa0
[show_registers+351/464] show_registers+0x15f/0x1d0
[die+256/400] die+0x100/0x190
[do_page_fault+672/1712] do_page_fault+0x2a0/0x6b0
[error_code+43/48] error_code+0x2b/0x30
[kjournald+212/576] kjournald+0xd4/0x240
[kernel_thread_helper+5/16] kernel_thread_helper+0x5/0x10
Code: 8b 85 a0 fe ff ff 85 c0 0f 85 4f 0e 00 00 8b 95 a8 fe ff ff 8b 42 18 85 c0 0f 84 85 00 00 00 be 00 e0 ff ff 21 e6 8b 78 20 8b 1f <f0> ff 43 0c 8b 03 a8 04 0f 85 de 0d 00 00 89 5c 24 04 8b 4d 08
<6>note: kjournald[1206] exited with preempt_count 1
The code crashes in fs/jbd/commit.c journal_commit_transaction in this
particular area at line 314:
...
/*
* Wait for all previously submitted IO to complete.
*/
while (commit_transaction->t_locked_list) {
struct buffer_head *bh;
jh = commit_transaction->t_locked_list->b_tprev;
bh = jh2bh(jh);
get_bh(bh); <--- crash here because bh is NULL
if (buffer_locked(bh)) {
spin_unlock(&journal->j_list_lock);
wait_on_buffer(bh);
if (unlikely(!buffer_uptodate(bh)))
err = -EIO;
spin_lock(&journal->j_list_lock);
}
if (!inverted_lock(journal, bh)) {
put_bh(bh);
spin_lock(&journal->j_list_lock);
continue;
}
if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
__journal_unfile_buffer(jh);
jbd_unlock_bh_state(bh);
journal_remove_journal_head(bh);
put_bh(bh);
} else {
jbd_unlock_bh_state(bh);
}
put_bh(bh);
if (need_resched()) {
spin_unlock(&journal->j_list_lock);
cond_resched();
spin_lock(&journal->j_list_lock);
}
}
...
And more precisely at this stage of the code:
jh = commit_transaction->t_locked_list->b_tprev;
8b 78 20 mov 0x20(%eax),%edi
bh = jh2bh(jh);
8b 1f mov (%edi),%ebx
get_bh(bh);
f0 ff 43 0c lock incl 0xc(%ebx) <-- crash because ebx is null
8b 03 mov (%ebx),%eax
Unfortunately I don't have the knowledge (and time to aquire it) that
will help me chase down this bug/problem.
If you need more information (including .config and other) I'll be happy
to provide it.
Can you CC: me as I'm not subscribed to the list.
Regards,
--
Brice Figureau <brice+lklm@daysofwonder.com>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2005-04-22 10:24 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-04-21 22:10 2.6.10-ac10 oops in journal_commit_transaction Zou, Nanhai
2005-04-21 22:29 ` Chris Wright
2005-04-22 9:23 ` Alan Cox
-- strict thread matches above, loose matches on Subject: below --
2005-03-03 13:45 Brice Figureau
2005-03-03 23:37 ` Andrew Morton
2005-03-04 10:51 ` Brice Figureau
2005-03-06 22:59 ` Alan Cox
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).