linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Problem in log_do_checkpoint()?
@ 2005-04-04  9:04 Jan Kara
  2005-04-08 14:06 ` Stephen C. Tweedie
  2005-04-08 17:14 ` Badari Pulavarty
  0 siblings, 2 replies; 7+ messages in thread
From: Jan Kara @ 2005-04-04  9:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: sct, jeffm

  Hello,

  I've been looking through the JBD code when trying to understand the
assertion failure in log_do_checkpoint() (it was on old SUSE 2.6.5 kernel
though the reporter claims to be able to get the failure even with the
Stephen's patch fixing a race with journal_put_journal_head()) and I've
spotted one place where I think could be a race (the code around there
seems to be the same in latest kernels):
  In log_do_checkpoint() we go through the t_checkpoint_list of a
transaction and call __flush_buffer() on each buffer. Suppose there is
just one buffer on the list and it is dirty. __flush_buffer() sees it and
puts it to an array of buffers for flushing. Then the loop finishes,
retry=0, drop_count=0, batch_count=1. So __flush_batch() is called - we
drop all locks and sleep. While we are sleeping somebody else comes and
makes the buffer dirty again (OK, that is not probable, but I think it
could be possible). Now we wake up and call __cleanup_transaction().
It's not able to do anything and returns 0. And we fail on the assertion
J_ASSERT(drop_count != 0 || cleanup_ret != 0).
  Am I missing something? In my opinion we should set retry=1 after we
call __flush_batch() even if we call it outside of the "__flush_buffer-loop"...

								Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Problem in log_do_checkpoint()?
  2005-04-04  9:04 Problem in log_do_checkpoint()? Jan Kara
@ 2005-04-08 14:06 ` Stephen C. Tweedie
  2005-04-11 11:36   ` Jan Kara
  2005-04-08 17:14 ` Badari Pulavarty
  1 sibling, 1 reply; 7+ messages in thread
From: Stephen C. Tweedie @ 2005-04-08 14:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, jeffm, Stephen Tweedie, Andrew Morton

Hi,

On Mon, 2005-04-04 at 10:04, Jan Kara wrote:

>   In log_do_checkpoint() we go through the t_checkpoint_list of a
> transaction and call __flush_buffer() on each buffer. Suppose there is
> just one buffer on the list and it is dirty. __flush_buffer() sees it and
> puts it to an array of buffers for flushing. Then the loop finishes,
> retry=0, drop_count=0, batch_count=1. So __flush_batch() is called - we
> drop all locks and sleep. While we are sleeping somebody else comes and
> makes the buffer dirty again (OK, that is not probable, but I think it
> could be possible). 

Yes, there's no reason why not at that point.

> Now we wake up and call __cleanup_transaction().
> It's not able to do anything and returns 0.

I think the _right_ answer here is to have two separate checkpoint
lists: the current one, plus one for which the checkpoint write has
already been submitted.  That way, we can wait for IO completion on
submitted writes without (a) getting conned into doing multiple rewrites
if there's somebody else dirtying the buffer; or (b) getting confused
about how much progress we're making.  Buffers on the pre-write list get
written; buffers on the post-write list get waited for; and both count
as progress (eliminating the false assert-failure when we failed to
detect progress).

The prevention of multiple writes in this case should also improve
performance a little.

That ought to be pretty straightforward, I think.  The existing cases
where we remove buffers from a checkpoint shouldn't have to care about
which list_head we're removing from; those cases already handle buffers
in both states.  It's only when doing the flush/wait that we have to
distinguish the two.

Sounds good?

--Stephen


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Problem in log_do_checkpoint()?
  2005-04-04  9:04 Problem in log_do_checkpoint()? Jan Kara
  2005-04-08 14:06 ` Stephen C. Tweedie
@ 2005-04-08 17:14 ` Badari Pulavarty
  2005-04-11 11:13   ` Jan Kara
  2005-04-11 11:37   ` Stephen C. Tweedie
  1 sibling, 2 replies; 7+ messages in thread
From: Badari Pulavarty @ 2005-04-08 17:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: Linux Kernel Mailing List, sct, jeffm

I get OOPs  in log_do_checkpoint() while using ext3 quotas.
Is this anyway related to what you are working on ?

Unable to handle kernel NULL pointer dereference at virtual address
00000000
 printing eip:
801aeee1
*pde = 52b31001
Oops: 0002 [#1]
PREEMPT SMP 
Modules linked in:
CPU:    3
EIP:    0060:[<801aeee1>]    Not tainted VLI
EFLAGS: 00010213   (2.6.11-22) 
EIP is at log_do_checkpoint+0x91/0x220
eax: 00000002   ebx: b7d09e0c   ecx: 00000001   edx: e24a2000
esi: 00000000   edi: c4bac47c   ebp: cceb726c   esp: e24a2d18
ds: 007b   es: 007b   ss: 0068
Process rm (pid: 8694, threadinfo=e24a2000 task=f7b79040)
Stack: f7dc70e4 a1d60b3c e24a2d44 e24a2d3c e24a2d40 e24a2000 00004df4
a6062200 
       f7dc70e4 00000000 00000000 95447db0 95447e4c ec6c1d7c b52210e4
ec032b40 
       ec032b0c 936a5800 e5a262b8 95447cac eb4c4354 936a57cc 936a5798
ac0e93bc 
Call Trace:
 [<801ae94f>] __log_wait_for_space+0x9f/0xc0
 [<801a9b42>] start_this_handle+0x132/0x3f0
 [<8012f720>] autoremove_wake_function+0x0/0x60
 [<8012f720>] autoremove_wake_function+0x0/0x60
 [<801a9efd>] journal_start+0xad/0xe0
 [<801a68b1>] ext3_dquot_initialize+0x51/0x70
 [<801a2d0d>] ext3_rmdir+0x4d/0x1c0
 [<8031df76>] _spin_lock+0x16/0x90
 [<80168aa9>] vfs_rmdir+0x189/0x230
 [<80168be9>] sys_rmdir+0x99/0xf0
 [<8010272f>] syscall_call+0x7/0xb
Code: 8b 54 24 1c 89 5c 24 28 8b 40 04 89 44 24 18 8b 5a 28 8b 6b 2c 89
df 8d 76 00 89 fb b8 01 00 00 00 8b 7f 28 8b 33 e8 cf 76 f6 ff <f0> 0f
ba 2e 13 19 c0 85 c0 0f 85 3f 01 00 00 89 5c 24 04 8d 44 

Thanks,
Badari



On Mon, 2005-04-04 at 02:04, Jan Kara wrote:
>   Hello,
> 
>   I've been looking through the JBD code when trying to understand the
> assertion failure in log_do_checkpoint() (it was on old SUSE 2.6.5 kernel
> though the reporter claims to be able to get the failure even with the
> Stephen's patch fixing a race with journal_put_journal_head()) and I've
> spotted one place where I think could be a race (the code around there
> seems to be the same in latest kernels):
>   In log_do_checkpoint() we go through the t_checkpoint_list of a
> transaction and call __flush_buffer() on each buffer. Suppose there is
> just one buffer on the list and it is dirty. __flush_buffer() sees it and
> puts it to an array of buffers for flushing. Then the loop finishes,
> retry=0, drop_count=0, batch_count=1. So __flush_batch() is called - we
> drop all locks and sleep. While we are sleeping somebody else comes and
> makes the buffer dirty again (OK, that is not probable, but I think it
> could be possible). Now we wake up and call __cleanup_transaction().
> It's not able to do anything and returns 0. And we fail on the assertion
> J_ASSERT(drop_count != 0 || cleanup_ret != 0).
>   Am I missing something? In my opinion we should set retry=1 after we
> call __flush_batch() even if we call it outside of the "__flush_buffer-loop"...
> 
> 								Honza


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Problem in log_do_checkpoint()?
  2005-04-08 17:14 ` Badari Pulavarty
@ 2005-04-11 11:13   ` Jan Kara
  2005-04-11 11:37   ` Stephen C. Tweedie
  1 sibling, 0 replies; 7+ messages in thread
From: Jan Kara @ 2005-04-11 11:13 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Linux Kernel Mailing List, sct, jeffm

> I get OOPs  in log_do_checkpoint() while using ext3 quotas.
> Is this anyway related to what you are working on ?
  Nope, it does not seem to be the same problem. In theory it could be a
bug Stephen fixed some time ago - could you try to reproduce the problem
with 2.6.12-rc2 (it contains the fix)? If you're still able to get the
oops, could you please find out where exactly in log_do_checkpoint() it
occured?

> 
> Unable to handle kernel NULL pointer dereference at virtual address
> 00000000
>  printing eip:
> 801aeee1
> *pde = 52b31001
> Oops: 0002 [#1]
> PREEMPT SMP 
> Modules linked in:
> CPU:    3
> EIP:    0060:[<801aeee1>]    Not tainted VLI
> EFLAGS: 00010213   (2.6.11-22) 
> EIP is at log_do_checkpoint+0x91/0x220
> eax: 00000002   ebx: b7d09e0c   ecx: 00000001   edx: e24a2000
> esi: 00000000   edi: c4bac47c   ebp: cceb726c   esp: e24a2d18
> ds: 007b   es: 007b   ss: 0068
> Process rm (pid: 8694, threadinfo=e24a2000 task=f7b79040)
> Stack: f7dc70e4 a1d60b3c e24a2d44 e24a2d3c e24a2d40 e24a2000 00004df4
> a6062200 
>        f7dc70e4 00000000 00000000 95447db0 95447e4c ec6c1d7c b52210e4
> ec032b40 
>        ec032b0c 936a5800 e5a262b8 95447cac eb4c4354 936a57cc 936a5798
> ac0e93bc 
> Call Trace:
>  [<801ae94f>] __log_wait_for_space+0x9f/0xc0
>  [<801a9b42>] start_this_handle+0x132/0x3f0
>  [<8012f720>] autoremove_wake_function+0x0/0x60
>  [<8012f720>] autoremove_wake_function+0x0/0x60
>  [<801a9efd>] journal_start+0xad/0xe0
>  [<801a68b1>] ext3_dquot_initialize+0x51/0x70
>  [<801a2d0d>] ext3_rmdir+0x4d/0x1c0
>  [<8031df76>] _spin_lock+0x16/0x90
>  [<80168aa9>] vfs_rmdir+0x189/0x230
>  [<80168be9>] sys_rmdir+0x99/0xf0
>  [<8010272f>] syscall_call+0x7/0xb
> Code: 8b 54 24 1c 89 5c 24 28 8b 40 04 89 44 24 18 8b 5a 28 8b 6b 2c 89
> df 8d 76 00 89 fb b8 01 00 00 00 8b 7f 28 8b 33 e8 cf 76 f6 ff <f0> 0f
> ba 2e 13 19 c0 85 c0 0f 85 3f 01 00 00 89 5c 24 04 8d 44 

							Thanks for report
									Honza
								

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Problem in log_do_checkpoint()?
  2005-04-08 14:06 ` Stephen C. Tweedie
@ 2005-04-11 11:36   ` Jan Kara
  2005-04-12 14:40     ` Stephen C. Tweedie
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2005-04-11 11:36 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-kernel, jeffm, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2138 bytes --]

  Hello,

> On Mon, 2005-04-04 at 10:04, Jan Kara wrote:
> 
> >   In log_do_checkpoint() we go through the t_checkpoint_list of a
> > transaction and call __flush_buffer() on each buffer. Suppose there is
> > just one buffer on the list and it is dirty. __flush_buffer() sees it and
> > puts it to an array of buffers for flushing. Then the loop finishes,
> > retry=0, drop_count=0, batch_count=1. So __flush_batch() is called - we
> > drop all locks and sleep. While we are sleeping somebody else comes and
> > makes the buffer dirty again (OK, that is not probable, but I think it
> > could be possible). 
> 
> Yes, there's no reason why not at that point.
> 
> > Now we wake up and call __cleanup_transaction().
> > It's not able to do anything and returns 0.
> 
> I think the _right_ answer here is to have two separate checkpoint
> lists: the current one, plus one for which the checkpoint write has
> already been submitted.  That way, we can wait for IO completion on
> submitted writes without (a) getting conned into doing multiple rewrites
> if there's somebody else dirtying the buffer; or (b) getting confused
> about how much progress we're making.  Buffers on the pre-write list get
> written; buffers on the post-write list get waited for; and both count
> as progress (eliminating the false assert-failure when we failed to
> detect progress).
  Yes, this seems to be a better long-term solution. A hotfix (retrying
after __flush_batch()) is attached if somebody is interested - it should
be safe and is lightly tested.

> The prevention of multiple writes in this case should also improve
> performance a little.
> 
> That ought to be pretty straightforward, I think.  The existing cases
> where we remove buffers from a checkpoint shouldn't have to care about
> which list_head we're removing from; those cases already handle buffers
> in both states.  It's only when doing the flush/wait that we have to
> distinguish the two.
  Yes, AFAICS the changes should remain local to the checkpointing code
(plus __unlink_buffer()). Should I write the patch or will you?

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

[-- Attachment #2: jbd-2.6.12-rc2-checkpoint.diff --]
[-- Type: text/plain, Size: 686 bytes --]

Fix possible false assertion failure in JBD checkpointing code.

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.12-rc2/fs/jbd/checkpoint.c linux-2.6.12-rc2-checkpoint/fs/jbd/checkpoint.c
--- linux-2.6.12-rc2/fs/jbd/checkpoint.c	2005-03-03 18:58:29.000000000 +0100
+++ linux-2.6.12-rc2-checkpoint/fs/jbd/checkpoint.c	2005-04-05 13:26:42.000000000 +0200
@@ -339,8 +339,10 @@ int log_do_checkpoint(journal_t *journal
 			}
 		} while (jh != last_jh && !retry);
 
-		if (batch_count)
+		if (batch_count) {
 			__flush_batch(journal, bhs, &batch_count);
+			retry = 1;
+		}
 
 		/*
 		 * If someone cleaned up this transaction while we slept, we're

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Problem in log_do_checkpoint()?
  2005-04-08 17:14 ` Badari Pulavarty
  2005-04-11 11:13   ` Jan Kara
@ 2005-04-11 11:37   ` Stephen C. Tweedie
  1 sibling, 0 replies; 7+ messages in thread
From: Stephen C. Tweedie @ 2005-04-11 11:37 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Jan Kara, Linux Kernel Mailing List, jeffm, Stephen Tweedie

Hi,

On Fri, 2005-04-08 at 18:14, Badari Pulavarty wrote:
> I get OOPs  in log_do_checkpoint() while using ext3 quotas.
> Is this anyway related to what you are working on ?
> 
> Unable to handle kernel NULL pointer dereference at virtual address
> 00000000

Doesn't look like it, no.  If we understand it right, Jan's problem
would only ever manifest itself as an assert failure, not as an oops.

--Stephen


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Problem in log_do_checkpoint()?
  2005-04-11 11:36   ` Jan Kara
@ 2005-04-12 14:40     ` Stephen C. Tweedie
  0 siblings, 0 replies; 7+ messages in thread
From: Stephen C. Tweedie @ 2005-04-12 14:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, jeffm, Andrew Morton, Stephen Tweedie

Hi,

On Mon, 2005-04-11 at 12:36, Jan Kara wrote:

> > The prevention of multiple writes in this case should also improve
> > performance a little.
> > 
> > That ought to be pretty straightforward, I think.  The existing cases
> > where we remove buffers from a checkpoint shouldn't have to care about
> > which list_head we're removing from; those cases already handle buffers
> > in both states.  It's only when doing the flush/wait that we have to
> > distinguish the two.
>   Yes, AFAICS the changes should remain local to the checkpointing code
> (plus __unlink_buffer()). Should I write the patch or will you?

Feel free, but please let me know if you start.  I'm doing a bit of
chasing of leaks and dealing with that O_SYNC thing for 2.4 right now,
but I'll get back to the checkpoint code after that if you haven't
started by then.

Cheers,
 Stephen


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-04-12 14:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-04-04  9:04 Problem in log_do_checkpoint()? Jan Kara
2005-04-08 14:06 ` Stephen C. Tweedie
2005-04-11 11:36   ` Jan Kara
2005-04-12 14:40     ` Stephen C. Tweedie
2005-04-08 17:14 ` Badari Pulavarty
2005-04-11 11:13   ` Jan Kara
2005-04-11 11:37   ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).