linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* unmount oops in log_do_checkpoint
@ 2006-01-16 16:04 Nick Piggin
  2006-01-16 21:22 ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2006-01-16 16:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel Mailing List

2.6.15-git12 (and 11, not sure when it started) oops when unmounting
an ext3 filesystem. Looks like 'transaction' in log_do_checkpoint is
garbage.

Reproduced it every time for about 3 or 4 reboots now.

Unable to handle kernel paging request at virtual address f63b4f7c
 printing eip:
c019db03
*pde = 004b0067
*pte = 363b4000
Oops: 0000 [#1]
SMP DEBUG_PAGEALLOC
Modules linked in: ide_cd cdrom 8250_pnp 8250 serial_core piix ide_core
CPU:    1
EIP:    0060:[<c019db03>]    Not tainted VLI
EFLAGS: 00010206   (2.6.15-git12) 
EIP is at log_do_checkpoint+0x6b/0x47b
eax: f63b4f78   ebx: 00000001   ecx: 00000000   edx: 00015e0a
esi: f57b78cc   edi: f57e7e90   ebp: f57e7e6c   esp: f57e7d38
ds: 007b   es: 007b   ss: 0068
Process umount (pid: 2418, threadinfo=f57e7000 task=f5396ad0)
Stack: <0>00000001 c18364b8 00000082 c04aa3a8 348ea000 f642bf24 f642bdf8 f63b4f78 
       00015e0a 00000001 c1836490 00000000 f57e7d80 c0114e96 00000000 00000000 
       f57e7d98 f4e7bad0 f57e7d98 c01151a0 c0441788 0000001f 00000000 c0441780 
Call Trace:
 [<c0103dc0>] show_stack_log_lvl+0xbb/0x105
 [<c0103f69>] show_registers+0x15f/0x1ef
 [<c010422a>] die+0x11b/0x22d
 [<c0114217>] do_page_fault+0x1ea/0x5d4
 [<c01037d7>] error_code+0x4f/0x54
 [<c019ff19>] journal_destroy+0x10d/0x24b
 [<c0195986>] ext3_put_super+0x20/0x1ec
 [<c015c89d>] generic_shutdown_super+0x89/0x131
 [<c015c954>] kill_block_super+0xf/0x20
 [<c015cb60>] deactivate_super+0x62/0x75
 [<c016f59c>] mntput_no_expire+0x44/0x62
 [<c0162f85>] path_release_on_umount+0x15/0x18
 [<c0170362>] sys_umount+0x3a/0x21d
 [<c017055e>] sys_oldumount+0x19/0x1b
 [<c0102bdf>] sysenter_past_esp+0x54/0x75
Code: e8 fe ff ff 89 d0 85 d2 0f 84 f2 01 00 00 8b 52 04 89 95 ec fe ff ff 39 85 e8 fe ff ff 74 15 8b 95 ec fe ff ff 8b 85 e8 fe ff ff <3b> 50 04 0f 85 cc 01 00 00 c7 45 f0 00 00 00 00 8b 85 e8 fe ff 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unmount oops in log_do_checkpoint
  2006-01-16 16:04 unmount oops in log_do_checkpoint Nick Piggin
@ 2006-01-16 21:22 ` Jan Kara
  2006-01-17 11:37   ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2006-01-16 21:22 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Linux Kernel Mailing List

> 2.6.15-git12 (and 11, not sure when it started) oops when unmounting
> an ext3 filesystem. Looks like 'transaction' in log_do_checkpoint is
> garbage.
> 
> Reproduced it every time for about 3 or 4 reboots now.
>
> Unable to handle kernel paging request at virtual address f63b4f7c
>  printing eip:
> c019db03
> *pde = 004b0067
> *pte = 363b4000
> Oops: 0000 [#1]
> SMP DEBUG_PAGEALLOC
> Modules linked in: ide_cd cdrom 8250_pnp 8250 serial_core piix ide_core
> CPU:    1
> EIP:    0060:[<c019db03>]    Not tainted VLI
> EFLAGS: 00010206   (2.6.15-git12) 
> EIP is at log_do_checkpoint+0x6b/0x47b
> eax: f63b4f78   ebx: 00000001   ecx: 00000000   edx: 00015e0a
> esi: f57b78cc   edi: f57e7e90   ebp: f57e7e6c   esp: f57e7d38
> ds: 007b   es: 007b   ss: 0068
> Process umount (pid: 2418, threadinfo=f57e7000 task=f5396ad0)
> Stack: <0>00000001 c18364b8 00000082 c04aa3a8 348ea000 f642bf24 f642bdf8 f63b4f78 
>        00015e0a 00000001 c1836490 00000000 f57e7d80 c0114e96 00000000 00000000 
>        f57e7d98 f4e7bad0 f57e7d98 c01151a0 c0441788 0000001f 00000000 c0441780 
> Call Trace:
>  [<c0103dc0>] show_stack_log_lvl+0xbb/0x105
>  [<c0103f69>] show_registers+0x15f/0x1ef
>  [<c010422a>] die+0x11b/0x22d
>  [<c0114217>] do_page_fault+0x1ea/0x5d4
>  [<c01037d7>] error_code+0x4f/0x54
>  [<c019ff19>] journal_destroy+0x10d/0x24b
>  [<c0195986>] ext3_put_super+0x20/0x1ec
>  [<c015c89d>] generic_shutdown_super+0x89/0x131
>  [<c015c954>] kill_block_super+0xf/0x20
>  [<c015cb60>] deactivate_super+0x62/0x75
>  [<c016f59c>] mntput_no_expire+0x44/0x62
>  [<c0162f85>] path_release_on_umount+0x15/0x18
>  [<c0170362>] sys_umount+0x3a/0x21d
>  [<c017055e>] sys_oldumount+0x19/0x1b
>  [<c0102bdf>] sysenter_past_esp+0x54/0x75
> Code: e8 fe ff ff 89 d0 85 d2 0f 84 f2 01 00 00 8b 52 04 89 95 ec fe ff ff 39 85 e8 fe ff ff 74 15 8b 95 ec fe ff ff 8b 85 e8 fe ff ff <3b> 50 04 0f 85 cc 01 00 00 c7 45 f0 00 00 00 00 8b 85 e8 fe ff 

  It would be useful to find out which patch cause it (by git bisect)
but one obvious suspect is my merged ext3 patch to checkpoint.c. I'll
investigate tomorrow.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unmount oops in log_do_checkpoint
  2006-01-16 21:22 ` Jan Kara
@ 2006-01-17 11:37   ` Nick Piggin
  2006-01-17 11:46     ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2006-01-17 11:37 UTC (permalink / raw)
  To: Jan Kara; +Cc: Nick Piggin, Andrew Morton, Linux Kernel Mailing List

On Mon, Jan 16, 2006 at 10:22:50PM +0100, Jan Kara wrote:
> > 2.6.15-git12 (and 11, not sure when it started) oops when unmounting
> > an ext3 filesystem. Looks like 'transaction' in log_do_checkpoint is
> > garbage.
> > 

[oops]

>   It would be useful to find out which patch cause it (by git bisect)
> but one obvious suspect is my merged ext3 patch to checkpoint.c. I'll
> investigate tomorrow.
> 

Yep, reverting jbd split checkpoint lists in -git12 fixes it. It is
100% reproducible so far, and every time rebooting with a patched
kernel fails to result in the oops.

Nick


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unmount oops in log_do_checkpoint
  2006-01-17 11:37   ` Nick Piggin
@ 2006-01-17 11:46     ` Andrew Morton
  2006-01-17 11:59       ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2006-01-17 11:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: jack, npiggin, linux-kernel

Nick Piggin <npiggin@suse.de> wrote:
>
> On Mon, Jan 16, 2006 at 10:22:50PM +0100, Jan Kara wrote:
> > > 2.6.15-git12 (and 11, not sure when it started) oops when unmounting
> > > an ext3 filesystem. Looks like 'transaction' in log_do_checkpoint is
> > > garbage.
> > > 
> 
> [oops]
> 
> >   It would be useful to find out which patch cause it (by git bisect)
> > but one obvious suspect is my merged ext3 patch to checkpoint.c. I'll
> > investigate tomorrow.
> > 
> 
> Yep, reverting jbd split checkpoint lists in -git12 fixes it. It is
> 100% reproducible so far, and every time rebooting with a patched
> kernel fails to result in the oops.
> 

But that patch was in -mm for months.  How come you didn't hit the oops
earlier?  One would almost expect some odd patch interaction, but changes
in ext3 have been small for a long time.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unmount oops in log_do_checkpoint
  2006-01-17 11:46     ` Andrew Morton
@ 2006-01-17 11:59       ` Nick Piggin
  2006-01-17 14:09         ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2006-01-17 11:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, jack, linux-kernel

On Tue, Jan 17, 2006 at 03:46:01AM -0800, Andrew Morton wrote:
> Nick Piggin <npiggin@suse.de> wrote:
> >
> > On Mon, Jan 16, 2006 at 10:22:50PM +0100, Jan Kara wrote:
> > > > 2.6.15-git12 (and 11, not sure when it started) oops when unmounting
> > > > an ext3 filesystem. Looks like 'transaction' in log_do_checkpoint is
> > > > garbage.
> > > > 
> > 
> > [oops]
> > 
> > >   It would be useful to find out which patch cause it (by git bisect)
> > > but one obvious suspect is my merged ext3 patch to checkpoint.c. I'll
> > > investigate tomorrow.
> > > 
> > 
> > Yep, reverting jbd split checkpoint lists in -git12 fixes it. It is
> > 100% reproducible so far, and every time rebooting with a patched
> > kernel fails to result in the oops.
> > 
> 
> But that patch was in -mm for months.  How come you didn't hit the oops
> earlier?  One would almost expect some odd patch interaction, but changes
> in ext3 have been small for a long time.

Haven't run -mm on that machine for quite a while, unfortunately.

What's strange is that nobody else has hit it... 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unmount oops in log_do_checkpoint
  2006-01-17 11:59       ` Nick Piggin
@ 2006-01-17 14:09         ` Nick Piggin
  2006-01-17 16:32           ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2006-01-17 14:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: jack, linux-kernel

On Tue, Jan 17, 2006 at 12:59:45PM +0100, Nick Piggin wrote:
> On Tue, Jan 17, 2006 at 03:46:01AM -0800, Andrew Morton wrote:
> > Nick Piggin <npiggin@suse.de> wrote:
> > >
> > > On Mon, Jan 16, 2006 at 10:22:50PM +0100, Jan Kara wrote:
> > > > > 2.6.15-git12 (and 11, not sure when it started) oops when unmounting
> > > > > an ext3 filesystem. Looks like 'transaction' in log_do_checkpoint is
> > > > > garbage.
> > > > > 
> > > 
> > > [oops]
> > > 
> > > >   It would be useful to find out which patch cause it (by git bisect)
> > > > but one obvious suspect is my merged ext3 patch to checkpoint.c. I'll
> > > > investigate tomorrow.
> > > > 
> > > 
> > > Yep, reverting jbd split checkpoint lists in -git12 fixes it. It is
> > > 100% reproducible so far, and every time rebooting with a patched
> > > kernel fails to result in the oops.
> > > 
> > 
> > But that patch was in -mm for months.  How come you didn't hit the oops
> > earlier?  One would almost expect some odd patch interaction, but changes
> > in ext3 have been small for a long time.
> 
> Haven't run -mm on that machine for quite a while, unfortunately.
> 
> What's strange is that nobody else has hit it... 
> 

Maybe it is because people haven't been turning on their debugging options,
tsk tsk ;) It only oopses when DEBUG_SLAB and DEBUG_PAGEALLOC are both
enabled. And only then when the jbd patch is not reverted. Weird.

Nick

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unmount oops in log_do_checkpoint
  2006-01-17 14:09         ` Nick Piggin
@ 2006-01-17 16:32           ` Jan Kara
  2006-01-17 16:36             ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2006-01-17 16:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

> On Tue, Jan 17, 2006 at 12:59:45PM +0100, Nick Piggin wrote:
> > On Tue, Jan 17, 2006 at 03:46:01AM -0800, Andrew Morton wrote:
> > > Nick Piggin <npiggin@suse.de> wrote:
> > > >
> > > > On Mon, Jan 16, 2006 at 10:22:50PM +0100, Jan Kara wrote:
> > > > > > 2.6.15-git12 (and 11, not sure when it started) oops when unmounting
> > > > > > an ext3 filesystem. Looks like 'transaction' in log_do_checkpoint is
> > > > > > garbage.
> > > > > > 
> > > > 
> > > > [oops]
> > > > 
> > > > >   It would be useful to find out which patch cause it (by git bisect)
> > > > > but one obvious suspect is my merged ext3 patch to checkpoint.c. I'll
> > > > > investigate tomorrow.
> > > > > 
> > > > 
> > > > Yep, reverting jbd split checkpoint lists in -git12 fixes it. It is
> > > > 100% reproducible so far, and every time rebooting with a patched
> > > > kernel fails to result in the oops.
> > > > 
> > > 
> > > But that patch was in -mm for months.  How come you didn't hit the oops
> > > earlier?  One would almost expect some odd patch interaction, but changes
> > > in ext3 have been small for a long time.
> > 
> > Haven't run -mm on that machine for quite a while, unfortunately.
> > 
> > What's strange is that nobody else has hit it... 
> > 
> 
> Maybe it is because people haven't been turning on their debugging options,
> tsk tsk ;) It only oopses when DEBUG_SLAB and DEBUG_PAGEALLOC are both
> enabled. And only then when the jbd patch is not reverted. Weird.
  Hmm, that's really strange, maybe we have some use-after-free
problem or so... I'll see what I can do :).

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unmount oops in log_do_checkpoint
  2006-01-17 16:32           ` Jan Kara
@ 2006-01-17 16:36             ` Nick Piggin
  2006-01-17 22:23               ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2006-01-17 16:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: Nick Piggin, Andrew Morton, linux-kernel

On Tue, Jan 17, 2006 at 05:32:35PM +0100, Jan Kara wrote:
> > On Tue, Jan 17, 2006 at 12:59:45PM +0100, Nick Piggin wrote:
> > 
> > Maybe it is because people haven't been turning on their debugging options,
> > tsk tsk ;) It only oopses when DEBUG_SLAB and DEBUG_PAGEALLOC are both
> > enabled. And only then when the jbd patch is not reverted. Weird.
>   Hmm, that's really strange, maybe we have some use-after-free
> problem or so... I'll see what I can do :).
> 

Are you able to reproduce? If not I can test patches...

Nick

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unmount oops in log_do_checkpoint
  2006-01-17 16:36             ` Nick Piggin
@ 2006-01-17 22:23               ` Jan Kara
  2006-01-18  5:41                 ` Nick Piggin
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2006-01-17 22:23 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1378 bytes --]

> On Tue, Jan 17, 2006 at 05:32:35PM +0100, Jan Kara wrote:
> > > On Tue, Jan 17, 2006 at 12:59:45PM +0100, Nick Piggin wrote:
> > > 
> > > Maybe it is because people haven't been turning on their debugging options,
> > > tsk tsk ;) It only oopses when DEBUG_SLAB and DEBUG_PAGEALLOC are both
> > > enabled. And only then when the jbd patch is not reverted. Weird.
> >   Hmm, that's really strange, maybe we have some use-after-free
> > problem or so... I'll see what I can do :).
> > 
> 
> Are you able to reproduce? If not I can test patches...
  Hmm, I was not able to reproduce the problem even with those debug
options set :(. As I'm looking into the code it seems that somebody
managed to free the transaction but did not clear the
j_checkpoint_transactions pointer. It's even stranger that it's during
umount time when there should not be much processes playing with the JBD
structures on that filesystem.
  Attached is the patch that fixes two minor possible problems I've
found. Neither of them should be causing your oops but one never knows
:). Also turn on the JBD debugging in config. Maybe it spits something
useful. If you still see the same oops, I'll create some debugging
patch.
  BTW: the oops during umount is after some activity on the filesystem
or you just mount & umount?

						Thanks for testing
								Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

[-- Attachment #2: jbd-2.6.16-rc1-checkpointfix.diff --]
[-- Type: text/plain, Size: 1332 bytes --]

diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/jbd/checkpoint.c linux-2.6.16-rc1-1-checkpoint-fix/fs/jbd/checkpoint.c
--- linux-2.6.16-rc1/fs/jbd/checkpoint.c	2006-01-17 21:44:02.000000000 +0100
+++ linux-2.6.16-rc1-1-checkpoint-fix/fs/jbd/checkpoint.c	2006-01-17 23:35:49.000000000 +0100
@@ -338,7 +338,7 @@ restart:
 	 * done (maybe it's a new transaction, but it fell at the same
 	 * address).
 	 */
- 	if (journal->j_checkpoint_transactions == transaction ||
+ 	if (journal->j_checkpoint_transactions == transaction &&
 			transaction->t_tid == this_tid) {
 		int batch_count = 0;
 		struct buffer_head *bhs[NR_BATCH];
diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/jbd/commit.c linux-2.6.16-rc1-1-checkpoint-fix/fs/jbd/commit.c
--- linux-2.6.16-rc1/fs/jbd/commit.c	2006-01-15 00:20:12.000000000 +0100
+++ linux-2.6.16-rc1-1-checkpoint-fix/fs/jbd/commit.c	2006-01-17 23:35:19.000000000 +0100
@@ -829,7 +829,8 @@ restart_loop:
 	journal->j_committing_transaction = NULL;
 	spin_unlock(&journal->j_state_lock);
 
-	if (commit_transaction->t_checkpoint_list == NULL) {
+	if (commit_transaction->t_checkpoint_list == NULL &&
+	    commit_transaction->t_checkpoint_io_list == NULL) {
 		__journal_drop_transaction(journal, commit_transaction);
 	} else {
 		if (journal->j_checkpoint_transactions == NULL) {

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unmount oops in log_do_checkpoint
  2006-01-17 22:23               ` Jan Kara
@ 2006-01-18  5:41                 ` Nick Piggin
  2006-01-18 10:35                   ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Piggin @ 2006-01-18  5:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: Nick Piggin, Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1837 bytes --]

On Tue, Jan 17, 2006 at 11:23:53PM +0100, Jan Kara wrote:
> > On Tue, Jan 17, 2006 at 05:32:35PM +0100, Jan Kara wrote:
> > > > On Tue, Jan 17, 2006 at 12:59:45PM +0100, Nick Piggin wrote:
> > > > 
> > > > Maybe it is because people haven't been turning on their debugging options,
> > > > tsk tsk ;) It only oopses when DEBUG_SLAB and DEBUG_PAGEALLOC are both
> > > > enabled. And only then when the jbd patch is not reverted. Weird.
> > >   Hmm, that's really strange, maybe we have some use-after-free
> > > problem or so... I'll see what I can do :).
> > > 
> > 
> > Are you able to reproduce? If not I can test patches...
>   Hmm, I was not able to reproduce the problem even with those debug
> options set :(. As I'm looking into the code it seems that somebody
> managed to free the transaction but did not clear the
> j_checkpoint_transactions pointer. It's even stranger that it's during
> umount time when there should not be much processes playing with the JBD
> structures on that filesystem.
>   Attached is the patch that fixes two minor possible problems I've
> found. Neither of them should be causing your oops but one never knows
> :). Also turn on the JBD debugging in config. Maybe it spits something
> useful. If you still see the same oops, I'll create some debugging
> patch.

This patch does the trick. Survived several reboots now while without
the patch it has oopsed 100% of the time so far. Thanks!

I have also attached a full jbd debug output and oops for the vanilla
2.6.16-rc1 case, just in case that helps.

>   BTW: the oops during umount is after some activity on the filesystem
> or you just mount & umount?
> 

mount,unmount doesn't seem to trigger it, nor does a bit of filesystem
activity. I haven't tracked down exactly what *does* trigger it, but
booting then rebooting does it every time.

Nick

[-- Attachment #2: dmesg.bad.gz --]
[-- Type: application/x-gunzip, Size: 4380 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: unmount oops in log_do_checkpoint
  2006-01-18  5:41                 ` Nick Piggin
@ 2006-01-18 10:35                   ` Jan Kara
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2006-01-18 10:35 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2148 bytes --]

> On Tue, Jan 17, 2006 at 11:23:53PM +0100, Jan Kara wrote:
> > > On Tue, Jan 17, 2006 at 05:32:35PM +0100, Jan Kara wrote:
> > > > > On Tue, Jan 17, 2006 at 12:59:45PM +0100, Nick Piggin wrote:
> > > > > 
> > > > > Maybe it is because people haven't been turning on their debugging options,
> > > > > tsk tsk ;) It only oopses when DEBUG_SLAB and DEBUG_PAGEALLOC are both
> > > > > enabled. And only then when the jbd patch is not reverted. Weird.
> > > >   Hmm, that's really strange, maybe we have some use-after-free
> > > > problem or so... I'll see what I can do :).
> > > > 
> > > 
> > > Are you able to reproduce? If not I can test patches...
> >   Hmm, I was not able to reproduce the problem even with those debug
> > options set :(. As I'm looking into the code it seems that somebody
> > managed to free the transaction but did not clear the
> > j_checkpoint_transactions pointer. It's even stranger that it's during
> > umount time when there should not be much processes playing with the JBD
> > structures on that filesystem.
> >   Attached is the patch that fixes two minor possible problems I've
> > found. Neither of them should be causing your oops but one never knows
> > :). Also turn on the JBD debugging in config. Maybe it spits something
> > useful. If you still see the same oops, I'll create some debugging
> > patch.
> 
> This patch does the trick. Survived several reboots now while without
> the patch it has oopsed 100% of the time so far. Thanks!
  Good to hear :).

> I have also attached a full jbd debug output and oops for the vanilla
> 2.6.16-rc1 case, just in case that helps.
  It helped me to verify my idea why my patch helped. Thanks. The problem was
the || instead of && in log_do_checkpoint(). When the journal checkpoint
list became empty, the pointer check in the loop failed and so because
of || we tried also the second check which was using transaction->t_tid.
If the transaction was already freed, bad luck...
  Andrew, attached is the fix with logs etc. I split my original patch
into two as they are in fact unrelated things. Please apply.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

[-- Attachment #2: jbd-2.6.16-rc1-1-log_do_checkpoint_fix.diff --]
[-- Type: text/plain, Size: 849 bytes --]

While checkpointing we have to check that our transaction still is in the
checkpoint list *and* (not or) that it's not just a different transaction
with the same address.

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/jbd/checkpoint.c linux-2.6.16-rc1-1-checkpoint-fix/fs/jbd/checkpoint.c
--- linux-2.6.16-rc1/fs/jbd/checkpoint.c	2006-01-17 21:44:02.000000000 +0100
+++ linux-2.6.16-rc1-1-checkpoint-fix/fs/jbd/checkpoint.c	2006-01-17 23:35:49.000000000 +0100
@@ -338,7 +338,7 @@ restart:
 	 * done (maybe it's a new transaction, but it fell at the same
 	 * address).
 	 */
- 	if (journal->j_checkpoint_transactions == transaction ||
+ 	if (journal->j_checkpoint_transactions == transaction &&
 			transaction->t_tid == this_tid) {
 		int batch_count = 0;
 		struct buffer_head *bhs[NR_BATCH];


[-- Attachment #3: jbd-2.6.16-rc1-2-commit_remove_trans_fix.diff --]
[-- Type: text/plain, Size: 837 bytes --]

We have to check that also the second checkpoint list is non-empty before
dropping the transaction.

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.16-rc1/fs/jbd/commit.c linux-2.6.16-rc1-1-checkpoint-fix/fs/jbd/commit.c
--- linux-2.6.16-rc1/fs/jbd/commit.c	2006-01-15 00:20:12.000000000 +0100
+++ linux-2.6.16-rc1-1-checkpoint-fix/fs/jbd/commit.c	2006-01-17 23:35:19.000000000 +0100
@@ -829,7 +829,8 @@ restart_loop:
 	journal->j_committing_transaction = NULL;
 	spin_unlock(&journal->j_state_lock);
 
-	if (commit_transaction->t_checkpoint_list == NULL) {
+	if (commit_transaction->t_checkpoint_list == NULL &&
+	    commit_transaction->t_checkpoint_io_list == NULL) {
 		__journal_drop_transaction(journal, commit_transaction);
 	} else {
 		if (journal->j_checkpoint_transactions == NULL) {

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2006-01-18 10:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-16 16:04 unmount oops in log_do_checkpoint Nick Piggin
2006-01-16 21:22 ` Jan Kara
2006-01-17 11:37   ` Nick Piggin
2006-01-17 11:46     ` Andrew Morton
2006-01-17 11:59       ` Nick Piggin
2006-01-17 14:09         ` Nick Piggin
2006-01-17 16:32           ` Jan Kara
2006-01-17 16:36             ` Nick Piggin
2006-01-17 22:23               ` Jan Kara
2006-01-18  5:41                 ` Nick Piggin
2006-01-18 10:35                   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).