linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* re: [PATCH] Change ll_rw_block() calls in JBD
@ 2006-05-18  8:25 Zoltan Menyhart
  2006-05-18 13:45 ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Zoltan Menyhart @ 2006-05-18  8:25 UTC (permalink / raw)
  To: jack, sct, linux-kernel

> We must be sure that the current data in buffer are sent to disk.
> Hence we have to call ll_rw_block() with SWRITE.

Let's consider the following case:

	while (commit_transaction->t_sync_datalist) {
		...

		// Assume a "bh" got locked before starting this loop

		if (buffer_locked(bh)) {
			...
			__journal_temp_unlink_buffer(jh);
			__journal_file_buffer(jh, commit_transaction, BJ_Locked);
		} else ...
	}
	...
	while (commit_transaction->t_locked_list) {
		...

		// Assume our "bh" is not locked any more
		// Nothing has happened to this "bh", someone just wanted
		// to look at it in a safe way

		if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
			__journal_unfile_buffer(jh);
			jbd_unlock_bh_state(bh);
			journal_remove_journal_head(bh);
			put_bh(bh);
		} else ...
	}

I.e. having an already locked "bh", it is missed out from the log.

Regards,

Zoltan Menyhart


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-18  8:25 [PATCH] Change ll_rw_block() calls in JBD Zoltan Menyhart
@ 2006-05-18 13:45 ` Jan Kara
  2006-05-18 15:11   ` Zoltan Menyhart
                     ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Jan Kara @ 2006-05-18 13:45 UTC (permalink / raw)
  To: Zoltan Menyhart; +Cc: sct, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1210 bytes --]

> >We must be sure that the current data in buffer are sent to disk.
> >Hence we have to call ll_rw_block() with SWRITE.
> 
> Let's consider the following case:
> 
> 	while (commit_transaction->t_sync_datalist) {
> 		...
> 
> 		// Assume a "bh" got locked before starting this loop
> 
> 		if (buffer_locked(bh)) {
> 			...
> 			__journal_temp_unlink_buffer(jh);
> 			__journal_file_buffer(jh, commit_transaction, 
> 			BJ_Locked);
> 		} else ...
> 	}
> 	...
> 	while (commit_transaction->t_locked_list) {
> 		...
> 
> 		// Assume our "bh" is not locked any more
> 		// Nothing has happened to this "bh", someone just wanted
> 		// to look at it in a safe way
> 
> 		if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
> 			__journal_unfile_buffer(jh);
> 			jbd_unlock_bh_state(bh);
> 			journal_remove_journal_head(bh);
> 			put_bh(bh);
> 		} else ...
> 	}
> 
> I.e. having an already locked "bh", it is missed out from the log.
  Yes, I'm aware of this problem. Actually I wrote a patch (attached) for it
some time ago but as I'm checking current kernels it got lost somewhere on
the way. I'll rediff it and submit again. Thanks for spotting the
problem.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

[-- Attachment #2: jbd-2.6.14-rc5-1-orderedwrite.diff --]
[-- Type: text/plain, Size: 4474 bytes --]

  The old code assumed that when the buffer is locked it is being
written. But need not be always true. Hence we have to be more
careful when processing ordered data buffers.
  If a buffer is not dirty, we know that write has already started
(and may be even already completed) and hence just refiling buffer
to t_locked_list (or unfiling it completely in case IO has finished)
is enough. If the buffer is dirty, we have to acquire the buffer lock
and do the write. In this case we also immediately refile the buffer
to t_locked_list thus making always some progress.

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.14-rc5/fs/jbd/commit.c linux-2.6.14-rc5-1-orderedwrite/fs/jbd/commit.c
--- linux-2.6.14-rc5/fs/jbd/commit.c	2005-10-25 08:53:22.000000000 +0200
+++ linux-2.6.14-rc5-1-orderedwrite/fs/jbd/commit.c	2005-10-25 09:07:43.000000000 +0200
@@ -333,57 +333,77 @@ write_out_data:
 
 	while (commit_transaction->t_sync_datalist) {
 		struct buffer_head *bh;
+		int was_dirty;
 
 		jh = commit_transaction->t_sync_datalist;
-		commit_transaction->t_sync_datalist = jh->b_tnext;
 		bh = jh2bh(jh);
-		if (buffer_locked(bh)) {
-			BUFFER_TRACE(bh, "locked");
-			if (!inverted_lock(journal, bh))
-				goto write_out_data;
+
+		/*
+		 * If the buffer is not dirty, we don't need to submit any
+		 * IO (either it is running or it has already finished) and
+		 * hence we don't need the buffer lock. In case we need the
+		 * lock, try to lock the buffer without blocking. If we fail,
+		 * we need to drop j_list_lock and do blocking lock_buffer().
+		 */
+		was_dirty = buffer_dirty(bh);
+		if (was_dirty && test_set_buffer_locked(bh)) {
+			BUFFER_TRACE(bh, "needs blocking lock");
+			get_bh(bh);
+			spin_unlock(&journal->j_list_lock);
+			lock_buffer(bh);
+			spin_lock(&journal->j_list_lock);
+			/* Someone already cleaned up the buffer? Restart. */
+			if (!buffer_jbd(bh) || jh->b_jlist != BJ_SyncData) {
+				unlock_buffer(bh);
+				BUFFER_TRACE(bh, "already cleaned up");
+				put_bh(bh);
+				continue;
+			}
+			put_bh(bh);
+		}
+
+		if (!jbd_trylock_bh_state(bh)) {
+			if (was_dirty)
+				unlock_buffer(bh);
+			spin_unlock(&journal->j_list_lock);
+			schedule();
+			goto write_out_data;
+		}
+		/*
+		 * Now we know that the buffer either was not dirty or we
+		 * have the buffer lock. If the buffer was not dirty,
+		 * write-out is running or already complete and we can just
+		 * refile the buffer to Locked list or unfile the buffer
+		 * respectively. In case the buffer was dirty, we have to
+		 * submit the buffer for IO before refiling it.
+		 */
+		BUFFER_TRACE(bh, "locked the buffer");
+		if ((!was_dirty && !buffer_locked(bh))
+		    || (was_dirty && !test_clear_buffer_dirty(bh))) {
+			BUFFER_TRACE(bh, "writeout complete: unfile");
+			__journal_unfile_buffer(jh);
+			jbd_unlock_bh_state(bh);
+			journal_remove_journal_head(bh);
+			if (was_dirty)
+				unlock_buffer(bh);
+			put_bh(bh);
+		}
+		else {
+			BUFFER_TRACE(bh, "needs writeout, submitting");
 			__journal_temp_unlink_buffer(jh);
 			__journal_file_buffer(jh, commit_transaction,
 						BJ_Locked);
 			jbd_unlock_bh_state(bh);
-			if (lock_need_resched(&journal->j_list_lock)) {
-				spin_unlock(&journal->j_list_lock);
-				goto write_out_data;
-			}
-		} else {
-			if (buffer_dirty(bh)) {
-				BUFFER_TRACE(bh, "start journal writeout");
+			if (was_dirty) {
 				get_bh(bh);
-				wbuf[bufs++] = bh;
-				if (bufs == journal->j_wbufsize) {
-					jbd_debug(2, "submit %d writes\n",
-							bufs);
-					spin_unlock(&journal->j_list_lock);
-					ll_rw_block(SWRITE, bufs, wbuf);
-					journal_brelse_array(wbuf, bufs);
-					bufs = 0;
-					goto write_out_data;
-				}
-			} else {
-				BUFFER_TRACE(bh, "writeout complete: unfile");
-				if (!inverted_lock(journal, bh))
-					goto write_out_data;
-				__journal_unfile_buffer(jh);
-				jbd_unlock_bh_state(bh);
-				journal_remove_journal_head(bh);
-				put_bh(bh);
-				if (lock_need_resched(&journal->j_list_lock)) {
-					spin_unlock(&journal->j_list_lock);
-					goto write_out_data;
-				}
+				bh->b_end_io = end_buffer_write_sync;
+				submit_bh(WRITE, bh);
 			}
 		}
-	}
-
-	if (bufs) {
-		spin_unlock(&journal->j_list_lock);
-		ll_rw_block(SWRITE, bufs, wbuf);
-		journal_brelse_array(wbuf, bufs);
-		spin_lock(&journal->j_list_lock);
+		if (lock_need_resched(&journal->j_list_lock)) {
+			spin_unlock(&journal->j_list_lock);
+			goto write_out_data;
+		}
 	}
 
 	/*

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-18 13:45 ` Jan Kara
@ 2006-05-18 15:11   ` Zoltan Menyhart
  2006-05-18 22:25     ` Stephen C. Tweedie
  2006-05-19  1:30     ` Jan Kara
  2006-05-19 15:06   ` Stephen C. Tweedie
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 19+ messages in thread
From: Zoltan Menyhart @ 2006-05-18 15:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: sct, linux-kernel

Getting better :-)

> +		was_dirty = buffer_dirty(bh);

Why do not we use "buffer_jbddirty()"?

> +		if (was_dirty && test_set_buffer_locked(bh)) {
> +			BUFFER_TRACE(bh, "needs blocking lock");
> +			get_bh(bh);

Why do you need a "get_bh(bh)"?
"bh" is attached to "jh".
Can it go away while waiting for the buffer lock?
("jh" in on "t_sync_datalist", it cannot go away.)

> +			spin_unlock(&journal->j_list_lock);
> +			lock_buffer(bh);
> +			spin_lock(&journal->j_list_lock);
> +			/* Someone already cleaned up the buffer? Restart. */
> +			if (!buffer_jbd(bh) || jh->b_jlist != BJ_SyncData) {

Who (else) can take away the journal head, remove our "jh" from the
synch. data list?

> +		else {
> +			BUFFER_TRACE(bh, "needs writeout, submitting");
>  			__journal_temp_unlink_buffer(jh);
>  			__journal_file_buffer(jh, commit_transaction,
>  						BJ_Locked);

A simple "__journal_file_buffer(...)" could be enough as it includes:

	if (jh->b_transaction)
		__journal_temp_unlink_buffer(jh);


Would not it be more easy to read like this?

                if ((!was_dirty && buffer_locked(bh))
                    || (was_dirty && test_clear_buffer_dirty(bh))) {
                        BUFFER_TRACE(bh, "needs writeout, submitting");
                        __journal_temp_unlink_buffer(jh);
                        __journal_file_buffer(jh, commit_transaction,
                                                BJ_Locked);
                        jbd_unlock_bh_state(bh);
                        if (was_dirty) {
                                get_bh(bh);
                                bh->b_end_io = end_buffer_write_sync;
                                submit_bh(WRITE, bh);
                        }
                }
                else {

                        BUFFER_TRACE(bh, "writeout complete: unfile");
                        __journal_unfile_buffer(jh);
                        jbd_unlock_bh_state(bh);
                        journal_remove_journal_head(bh);
                        if (was_dirty)
                                unlock_buffer(bh);
                        put_bh(bh);
                }


As synch. data handling is a compact stuff, cannot it be moved out from
"journal_commit_transaction()" as e.g. "journal_write_revoke_records()"?
(Just for a better readability...)

Thanks,

Zoltan








^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-18 15:11   ` Zoltan Menyhart
@ 2006-05-18 22:25     ` Stephen C. Tweedie
  2006-05-19 10:01       ` Zoltan Menyhart
  2006-05-19  1:30     ` Jan Kara
  1 sibling, 1 reply; 19+ messages in thread
From: Stephen C. Tweedie @ 2006-05-18 22:25 UTC (permalink / raw)
  To: Zoltan Menyhart; +Cc: Jan Kara, linux-kernel, Stephen Tweedie

Hi,

On Thu, 2006-05-18 at 17:11 +0200, Zoltan Menyhart wrote:

> > +		was_dirty = buffer_dirty(bh);
> 
> Why do not we use "buffer_jbddirty()"?

jbddirty is what we use for metadata that we don't want to get written
to disk yet.

For journaling to work, we must ensure that the transaction's metadata
gets written to and committed in the journal before it is allowed to be
written back to main backing store.  If we don't, and writeback occurs
early, then on a crash we can't rollback to the previous transaction.

So when we mark metadata as dirty we set the jbddirty flag but do *not*
set the normal dirty flag; so until the buffer is journaled, it is not
eligible for normal writeback.

This whole mechanism is only useful for metadata (plus data if we're in
data=journal mode.)  For normal ordered data writes, we never use
jbddirty because there's just no problem if the data is written before
the transaction completes.

--Stephen



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-18 15:11   ` Zoltan Menyhart
  2006-05-18 22:25     ` Stephen C. Tweedie
@ 2006-05-19  1:30     ` Jan Kara
  2006-05-19 12:33       ` Zoltan Menyhart
  1 sibling, 1 reply; 19+ messages in thread
From: Jan Kara @ 2006-05-19  1:30 UTC (permalink / raw)
  To: Zoltan Menyhart; +Cc: sct, linux-kernel

> Getting better :-)
> 
> >+		was_dirty = buffer_dirty(bh);
> 
> Why do not we use "buffer_jbddirty()"?
  I think Stephen has already explained it... We have a data buffer and
hence we use buffer_dirty()

> 
> >+		if (was_dirty && test_set_buffer_locked(bh)) {
> >+			BUFFER_TRACE(bh, "needs blocking lock");
> >+			get_bh(bh);
> 
> Why do you need a "get_bh(bh)"?
> "bh" is attached to "jh".
> Can it go away while waiting for the buffer lock?
> ("jh" in on "t_sync_datalist", it cannot go away.)
> 
> >+			spin_unlock(&journal->j_list_lock);
> >+			lock_buffer(bh);
> >+			spin_lock(&journal->j_list_lock);
> >+			/* Someone already cleaned up the buffer? Restart. */
> >+			if (!buffer_jbd(bh) || jh->b_jlist != BJ_SyncData) {
> 
> Who (else) can take away the journal head, remove our "jh" from the
> synch. data list?
  For two of the above comments: Under memory pressure data buffers can
be written out earlier and then released by __journal_try_to_free_buffer()
as they are not dirty any more. The above checks protect us against this.

> >+		else {
> >+			BUFFER_TRACE(bh, "needs writeout, submitting");
> > 			__journal_temp_unlink_buffer(jh);
> > 			__journal_file_buffer(jh, commit_transaction,
> > 						BJ_Locked);
> 
> A simple "__journal_file_buffer(...)" could be enough as it includes:
> 
> 	if (jh->b_transaction)
> 		__journal_temp_unlink_buffer(jh);
  Yes, you are right here.

> Would not it be more easy to read like this?
> 
>                if ((!was_dirty && buffer_locked(bh))
>                    || (was_dirty && test_clear_buffer_dirty(bh))) {
>                        BUFFER_TRACE(bh, "needs writeout, submitting");
>                        __journal_temp_unlink_buffer(jh);
>                        __journal_file_buffer(jh, commit_transaction,
>                                                BJ_Locked);
>                        jbd_unlock_bh_state(bh);
>                        if (was_dirty) {
>                                get_bh(bh);
>                                bh->b_end_io = end_buffer_write_sync;
>                                submit_bh(WRITE, bh);
>                        }
>                }
>                else {
> 
>                        BUFFER_TRACE(bh, "writeout complete: unfile");
>                        __journal_unfile_buffer(jh);
>                        jbd_unlock_bh_state(bh);
>                        journal_remove_journal_head(bh);
>                        if (was_dirty)
>                                unlock_buffer(bh);
>                        put_bh(bh);
>                }
  So you basically mean switching those two branches of if.. OK, maybe.

> As synch. data handling is a compact stuff, cannot it be moved out from
> "journal_commit_transaction()" as e.g. "journal_write_revoke_records()"?
> (Just for a better readability...)
  Yes, probably moving it to a new function may improve the readability.
Thanks for suggestions.

								Honza 
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-18 22:25     ` Stephen C. Tweedie
@ 2006-05-19 10:01       ` Zoltan Menyhart
  2006-05-19 12:26         ` Stephen C. Tweedie
  0 siblings, 1 reply; 19+ messages in thread
From: Zoltan Menyhart @ 2006-05-19 10:01 UTC (permalink / raw)
  To: Stephen C. Tweedie, Jan Kara; +Cc: linux-kernel

I'd like to take this opportunity to ask some questions
I always wanted to ask about committing a transaction(*)
but were afraid to ask :-)

We wait for several types of I/O-s:
- "Wait for all previously submitted IO to complete" (t_sync_datalist)
- wait_for_iobuf: (t_buffers)
- wait_for_ctlbuf: (t_log_list)
before "journal_write_commit_record()" gets invoked.

Why do not we wait for them at the very last moment in batch, is
it important that e.g. all the buffers from "t_buffers" be
completed before we start with the ones on "t_log_list"?

If "JFS_BARRIER" is set, why do we wait for these I/O-s at all?
(The "ordered" attribute is set => all the previous I/O-s must have hit
the permanent storage before the commit record can do.)

Why do we let the EXT3 layer to decide, why do not we ask the "bio"
if the "ordered" attribute is supported and use it systematically?

There is a comment in "journal_write_commit_record()":

	/* is it possible for another commit to fail at roughly
	 * the same time as this one?...

Another commit for the same journal?
(If not the same journal, why is it a problem?)

If a barrier-based sync has failed on a device, does the actual
I/O start by itself (not caring for the ordering issue)?

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-19 10:01       ` Zoltan Menyhart
@ 2006-05-19 12:26         ` Stephen C. Tweedie
  0 siblings, 0 replies; 19+ messages in thread
From: Stephen C. Tweedie @ 2006-05-19 12:26 UTC (permalink / raw)
  To: Zoltan Menyhart; +Cc: Jan Kara, linux-kernel, Stephen Tweedie

Hi,

On Fri, 2006-05-19 at 12:01 +0200, Zoltan Menyhart wrote:
> I'd like to take this opportunity to ask some questions
> I always wanted to ask about committing a transaction(*)
> but were afraid to ask :-)
> 
> We wait for several types of I/O-s:
> - "Wait for all previously submitted IO to complete" (t_sync_datalist)
> - wait_for_iobuf: (t_buffers)
> - wait_for_ctlbuf: (t_log_list)
> before "journal_write_commit_record()" gets invoked.
> 
> Why do not we wait for them at the very last moment in batch, is
> it important that e.g. all the buffers from "t_buffers" be
> completed before we start with the ones on "t_log_list"?

We don't.  We write the control and journaled metadata buffers first,
recording them on the t_buffers and t_log_list lists.  We then wait on
both of those lists, but there's no IO being submitted at that stage and
the order in which we wait on them has no effect at all on the progress
of the IO.

The sync_datalist writes could be waited for separately at the end if we
wanted too.  That code was written a _long_ time ago but I vaguely
recall some reasoning that we may have a lot more data than metadata, so
it would be reasonable to cleanup the data writes as quickly as possible
and to prevent data and metadata writes from getting submitted in
parallel and potentially causing seeks between the two.  The elevator
should prevent the latter effect, though, so it would be entirely
reasonable to move that wait to later on in commit if we wanted to.

> If "JFS_BARRIER" is set, why do we wait for these I/O-s at all?
> (The "ordered" attribute is set => all the previous I/O-s must have hit
> the permanent storage before the commit record can do.)

> Why do we let the EXT3 layer to decide, why do not we ask the "bio"
> if the "ordered" attribute is supported and use it systematically?

It would be entirely reasonable to do that, sure.  It just hasn't been
done yet.

> There is a comment in "journal_write_commit_record()":
> 
> 	/* is it possible for another commit to fail at roughly
> 	 * the same time as this one?...
> 
> Another commit for the same journal?

Not likely. :-)

> If a barrier-based sync has failed on a device, does the actual
> I/O start by itself (not caring for the ordering issue)?

It fails with EOPNOTSUPP and must be retried, iirc.  

--Stephen



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-19  1:30     ` Jan Kara
@ 2006-05-19 12:33       ` Zoltan Menyhart
  2006-05-19 15:05         ` Stephen C. Tweedie
  0 siblings, 1 reply; 19+ messages in thread
From: Zoltan Menyhart @ 2006-05-19 12:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: sct, linux-kernel

Jan Kara wrote:

>>>+			if (!buffer_jbd(bh) || jh->b_jlist != BJ_SyncData) {
>>
>>Who (else) can take away the journal head, remove our "jh" from the
>>synch. data list?
> 
>   For two of the above comments: Under memory pressure data buffers can
> be written out earlier and then released by __journal_try_to_free_buffer()
> as they are not dirty any more. The above checks protect us against this.

Assume "bh" has been set free in the mean time.
Assume it is now used for another transaction (maybe for another file system).

The first part of the test should verify not only if "bh" is used for _any_
journal head but if it is exactly for our current one:

	if (buffer_jbd(bh) != jh || ...

Thanks,

Zoltan



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-19 12:33       ` Zoltan Menyhart
@ 2006-05-19 15:05         ` Stephen C. Tweedie
  0 siblings, 0 replies; 19+ messages in thread
From: Stephen C. Tweedie @ 2006-05-19 15:05 UTC (permalink / raw)
  To: Zoltan Menyhart; +Cc: Jan Kara, linux-kernel, Stephen Tweedie

Hi,

On Fri, 2006-05-19 at 14:33 +0200, Zoltan Menyhart wrote:

> >   For two of the above comments: Under memory pressure data buffers can
> > be written out earlier and then released by __journal_try_to_free_buffer()
> > as they are not dirty any more. The above checks protect us against this.
> 
> Assume "bh" has been set free in the mean time.
> Assume it is now used for another transaction (maybe for another file system).

I don't follow --- we already have a refcount to the bh, how can it be
reused?  We took the j_list_lock first, then looked up the jh on this
transaction's sync data list, then did the get_bh() without dropping
j_list_lock; so by the time we take the refcount, the j_list_lock
ensures it cannot have come off this transaction.  And subsequently, the
refcount prevents reuse.

--Stephen



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-18 13:45 ` Jan Kara
  2006-05-18 15:11   ` Zoltan Menyhart
@ 2006-05-19 15:06   ` Stephen C. Tweedie
  2006-05-24 17:33     ` Jan Kara
  2006-05-23 16:01   ` Zoltan Menyhart
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Stephen C. Tweedie @ 2006-05-19 15:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: Zoltan Menyhart, linux-kernel, Stephen Tweedie

Hi,

On Thu, 2006-05-18 at 15:45 +0200, Jan Kara wrote:

>   Yes, I'm aware of this problem. Actually I wrote a patch (attached) for it
> some time ago but as I'm checking current kernels it got lost somewhere on
> the way. I'll rediff it and submit again. Thanks for spotting the
> problem.

...
> 
> +               was_dirty = buffer_dirty(bh);
> +               if (was_dirty && test_set_buffer_locked(bh)) {

The way was_dirty is used here seems confusing and hard to read; there
are completely separate code paths for dirty and non-dirty, lockable and
already-locked buffers, with all the paths interleaved to share a few
common bits of locking.  It would be far more readable with any sharable
locking code simply removed to a separate function (such as we already
have for inverted_lock(), for example), and the different dirty/clean
logic laid out separately.  Otherwise the code is littered with 

> +                       if (was_dirty)
> +                               unlock_buffer(bh);

and it's not obvious at any point just what locks are held.

Having said that, it looks like it should work --- it just took more
effort than it should have to check!

--Stephen



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-18 13:45 ` Jan Kara
  2006-05-18 15:11   ` Zoltan Menyhart
  2006-05-19 15:06   ` Stephen C. Tweedie
@ 2006-05-23 16:01   ` Zoltan Menyhart
  2006-05-24  9:14   ` Zoltan Menyhart
       [not found]   ` <447F13B3.6050505@bull.net>
  4 siblings, 0 replies; 19+ messages in thread
From: Zoltan Menyhart @ 2006-05-23 16:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: sct, linux-kernel

Please have a look at my version of the "write_out_data:" loop.

Thanks,

Zoltan

--- linux-2.6.16.16-save/fs/jbd/commit.c	2006-05-19 15:00:50.000000000 +0200
+++ linux-2.6.16.16/fs/jbd/commit.c	2006-05-23 17:44:46.000000000 +0200
@@ -161,6 +161,104 @@
 }
 
 /*
+ * Flush data from the "->t_sync_datalist" of the committing transaction.
+ */
+static void write_out_sync_data(journal_t *journal,
+					transaction_t *commit_transaction)
+{
+	struct journal_head *jh;
+	struct buffer_head *bh;
+	int err = 0;
+
+	/*
+	 * Whenever we unlock the journal and sleep, things can get removed
+	 * from "->t_sync_datalist" by "journal_dirty_data()", so we have
+	 * to keep looping back to write_out_data until we *know* that the
+	 * list is empty.
+	 *
+	 * Cleanup any flushed data buffers from the data list.  Even in
+	 * abort mode, we want to flush this out as soon as possible.
+	 */
+write_out_data:
+	cond_resched();
+	spin_lock(&journal->j_list_lock);
+	while ((jh = commit_transaction->t_sync_datalist) != NULL){
+		bh = jh2bh(jh);
+		if (buffer_locked(bh)){		/* Unsafe */
+			BUFFER_TRACE(bh, "locked");
+			if (!inverted_lock(journal, bh)) /* jbd_lock_bh_state */
+				goto write_out_data;
+			/* "bh" may have been unlocked in the mean time */
+			__journal_file_buffer(jh, commit_transaction, BJ_Locked);
+			jbd_unlock_bh_state(bh);
+		} else {
+			/* "bh" may have become locked in the mean time */
+			if (buffer_dirty(bh)){	/* Unsafe */
+				if (test_set_buffer_locked(bh)){
+					BUFFER_TRACE(bh, "locked");
+					continue; /* Put on the BJ_Locked list */
+				}
+				if (test_clear_buffer_dirty(bh)){
+					BUFFER_TRACE(bh, "start journal wr");
+					get_bh(bh);
+					bh->b_end_io = end_buffer_write_sync;
+					submit_bh(WRITE, bh);
+					continue; /* Put on the BJ_Locked list */
+				}
+				unlock_buffer(bh);
+			} else {
+				/* "bh" may have become dirty in the mean time */
+				/* Just do nothing for this transaction */
+			}
+			BUFFER_TRACE(bh, "writeout complete: unfile");
+			if (!inverted_lock(journal, bh)) /* jbd_lock_bh_state */
+				goto write_out_data;
+			__journal_unfile_buffer(jh);
+			jbd_unlock_bh_state(bh);
+			journal_remove_journal_head(bh);
+			put_bh(bh);
+		}
+		if (lock_need_resched(&journal->j_list_lock)){
+			spin_unlock(&journal->j_list_lock);
+			goto write_out_data;
+		}
+	}
+	/*
+	 * Wait for all previously submitted IO to complete.
+	 */
+	while (commit_transaction->t_locked_list) {
+		jh = commit_transaction->t_locked_list->b_tprev;
+		bh = jh2bh(jh);
+		get_bh(bh);
+		if (buffer_locked(bh)) {
+			spin_unlock(&journal->j_list_lock);
+			wait_on_buffer(bh);
+			if (unlikely(!buffer_uptodate(bh)))
+				err = -EIO;
+			spin_lock(&journal->j_list_lock);
+		}
+		if (!inverted_lock(journal, bh)) {
+			put_bh(bh);
+			spin_lock(&journal->j_list_lock);
+			continue;
+		}
+		if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
+			__journal_unfile_buffer(jh);
+			jbd_unlock_bh_state(bh);
+			journal_remove_journal_head(bh);
+			put_bh(bh);
+		} else {
+			jbd_unlock_bh_state(bh);
+		}
+		put_bh(bh);
+		cond_resched_lock(&journal->j_list_lock);
+	}
+	spin_unlock(&journal->j_list_lock);
+	if (err)
+		__journal_abort_hard(journal);
+}
+
+/*
  * journal_commit_transaction
  *
  * The primary function for committing a transaction to the log.  This
@@ -173,7 +271,7 @@
 	struct buffer_head **wbuf = journal->j_wbuf;
 	int bufs;
 	int flags;
-	int err;
+	int err = 0;
 	unsigned long blocknr;
 	char *tagp = NULL;
 	journal_header_t *header;
@@ -313,113 +411,7 @@
 	 * Now start flushing things to disk, in the order they appear
 	 * on the transaction lists.  Data blocks go first.
 	 */
-
-	err = 0;
-	/*
-	 * Whenever we unlock the journal and sleep, things can get added
-	 * onto ->t_sync_datalist, so we have to keep looping back to
-	 * write_out_data until we *know* that the list is empty.
-	 */
-	bufs = 0;
-	/*
-	 * Cleanup any flushed data buffers from the data list.  Even in
-	 * abort mode, we want to flush this out as soon as possible.
-	 */
-write_out_data:
-	cond_resched();
-	spin_lock(&journal->j_list_lock);
-
-	while (commit_transaction->t_sync_datalist) {
-		struct buffer_head *bh;
-
-		jh = commit_transaction->t_sync_datalist;
-		commit_transaction->t_sync_datalist = jh->b_tnext;
-		bh = jh2bh(jh);
-		if (buffer_locked(bh)) {
-			BUFFER_TRACE(bh, "locked");
-			if (!inverted_lock(journal, bh))
-				goto write_out_data;
-			__journal_temp_unlink_buffer(jh);
-			__journal_file_buffer(jh, commit_transaction,
-						BJ_Locked);
-			jbd_unlock_bh_state(bh);
-			if (lock_need_resched(&journal->j_list_lock)) {
-				spin_unlock(&journal->j_list_lock);
-				goto write_out_data;
-			}
-		} else {
-			if (buffer_dirty(bh)) {
-				BUFFER_TRACE(bh, "start journal writeout");
-				get_bh(bh);
-				wbuf[bufs++] = bh;
-				if (bufs == journal->j_wbufsize) {
-					jbd_debug(2, "submit %d writes\n",
-							bufs);
-					spin_unlock(&journal->j_list_lock);
-					ll_rw_block(SWRITE, bufs, wbuf);
-					journal_brelse_array(wbuf, bufs);
-					bufs = 0;
-					goto write_out_data;
-				}
-			} else {
-				BUFFER_TRACE(bh, "writeout complete: unfile");
-				if (!inverted_lock(journal, bh))
-					goto write_out_data;
-				__journal_unfile_buffer(jh);
-				jbd_unlock_bh_state(bh);
-				journal_remove_journal_head(bh);
-				put_bh(bh);
-				if (lock_need_resched(&journal->j_list_lock)) {
-					spin_unlock(&journal->j_list_lock);
-					goto write_out_data;
-				}
-			}
-		}
-	}
-
-	if (bufs) {
-		spin_unlock(&journal->j_list_lock);
-		ll_rw_block(SWRITE, bufs, wbuf);
-		journal_brelse_array(wbuf, bufs);
-		spin_lock(&journal->j_list_lock);
-	}
-
-	/*
-	 * Wait for all previously submitted IO to complete.
-	 */
-	while (commit_transaction->t_locked_list) {
-		struct buffer_head *bh;
-
-		jh = commit_transaction->t_locked_list->b_tprev;
-		bh = jh2bh(jh);
-		get_bh(bh);
-		if (buffer_locked(bh)) {
-			spin_unlock(&journal->j_list_lock);
-			wait_on_buffer(bh);
-			if (unlikely(!buffer_uptodate(bh)))
-				err = -EIO;
-			spin_lock(&journal->j_list_lock);
-		}
-		if (!inverted_lock(journal, bh)) {
-			put_bh(bh);
-			spin_lock(&journal->j_list_lock);
-			continue;
-		}
-		if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
-			__journal_unfile_buffer(jh);
-			jbd_unlock_bh_state(bh);
-			journal_remove_journal_head(bh);
-			put_bh(bh);
-		} else {
-			jbd_unlock_bh_state(bh);
-		}
-		put_bh(bh);
-		cond_resched_lock(&journal->j_list_lock);
-	}
-	spin_unlock(&journal->j_list_lock);
-
-	if (err)
-		__journal_abort_hard(journal);
+	write_out_sync_data(journal, commit_transaction);
 
 	journal_write_revoke_records(journal, commit_transaction);
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-18 13:45 ` Jan Kara
                     ` (2 preceding siblings ...)
  2006-05-23 16:01   ` Zoltan Menyhart
@ 2006-05-24  9:14   ` Zoltan Menyhart
  2006-05-24 17:18     ` Jan Kara
       [not found]   ` <447F13B3.6050505@bull.net>
  4 siblings, 1 reply; 19+ messages in thread
From: Zoltan Menyhart @ 2006-05-24  9:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: sct, linux-kernel

Yesterday I forgot a small "spin_unlock(&journal->j_list_lock);".

Thanks,

Zoltan

--- linux-2.6.16.16-save/fs/jbd/commit.c	2006-05-19 15:00:50.000000000 +0200
+++ linux-2.6.16.16/fs/jbd/commit.c	2006-05-24 10:43:32.000000000 +0200
@@ -161,6 +161,107 @@
 }
 
 /*
+ * Flush data from the "->t_sync_datalist" of the committing transaction.
+ */
+static void write_out_sync_data(journal_t *journal,
+					transaction_t *commit_transaction)
+{
+	struct journal_head *jh;
+	struct buffer_head *bh;
+	int err = 0;
+
+	/*
+	 * Whenever we unlock the journal and sleep, things can get removed
+	 * from "->t_sync_datalist" by "journal_dirty_data()", so we have
+	 * to keep looping back to write_out_data until we *know* that the
+	 * list is empty.
+	 *
+	 * Cleanup any flushed data buffers from the data list.  Even in
+	 * abort mode, we want to flush this out as soon as possible.
+	 */
+write_out_data:
+	cond_resched();
+	spin_lock(&journal->j_list_lock);
+	while ((jh = commit_transaction->t_sync_datalist) != NULL){
+		bh = jh2bh(jh);
+		if (buffer_locked(bh)){		/* Unsafe */
+			BUFFER_TRACE(bh, "locked");
+			if (!inverted_lock(journal, bh)) /* jbd_lock_bh_state */
+				goto write_out_data;
+			/* "bh" may have been unlocked in the mean time */
+			__journal_file_buffer(jh, commit_transaction, BJ_Locked);
+			jbd_unlock_bh_state(bh);
+		} else {
+			/* "bh" may have become locked in the mean time */
+			if (buffer_dirty(bh)){	/* Unsafe */
+				if (test_set_buffer_locked(bh)){
+					BUFFER_TRACE(bh, "locked");
+					/* Put it on the BJ_Locked list */
+					continue;
+				}
+				if (test_clear_buffer_dirty(bh)){
+					BUFFER_TRACE(bh, "start journal wr");
+					spin_unlock(&journal->j_list_lock);
+					get_bh(bh);
+					bh->b_end_io = end_buffer_write_sync;
+					submit_bh(WRITE, bh);
+					/* Put it on the BJ_Locked list */
+					goto write_out_data;
+				}
+				unlock_buffer(bh);
+			} else {
+				/* "bh" may have become dirty in the mean time */
+				/* Just do nothing for this transaction */
+			}
+			BUFFER_TRACE(bh, "writeout complete: unfile");
+			if (!inverted_lock(journal, bh)) /* jbd_lock_bh_state */
+				goto write_out_data;
+			__journal_unfile_buffer(jh);
+			jbd_unlock_bh_state(bh);
+			journal_remove_journal_head(bh);
+			put_bh(bh);
+		}
+		if (lock_need_resched(&journal->j_list_lock)){
+			spin_unlock(&journal->j_list_lock);
+			goto write_out_data;
+		}
+	}
+	/*
+	 * Wait for all previously submitted IO to complete.
+	 */
+	while (commit_transaction->t_locked_list) {
+		jh = commit_transaction->t_locked_list->b_tprev;
+		bh = jh2bh(jh);
+		get_bh(bh);
+		if (buffer_locked(bh)) {
+			spin_unlock(&journal->j_list_lock);
+			wait_on_buffer(bh);
+			if (unlikely(!buffer_uptodate(bh)))
+				err = -EIO;
+			spin_lock(&journal->j_list_lock);
+		}
+		if (!inverted_lock(journal, bh)) {
+			put_bh(bh);
+			spin_lock(&journal->j_list_lock);
+			continue;
+		}
+		if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
+			__journal_unfile_buffer(jh);
+			jbd_unlock_bh_state(bh);
+			journal_remove_journal_head(bh);
+			put_bh(bh);
+		} else {
+			jbd_unlock_bh_state(bh);
+		}
+		put_bh(bh);
+		cond_resched_lock(&journal->j_list_lock);
+	}
+	spin_unlock(&journal->j_list_lock);
+	if (err)
+		__journal_abort_hard(journal);
+}
+
+/*
  * journal_commit_transaction
  *
  * The primary function for committing a transaction to the log.  This
@@ -173,7 +274,7 @@
 	struct buffer_head **wbuf = journal->j_wbuf;
 	int bufs;
 	int flags;
-	int err;
+	int err = 0;
 	unsigned long blocknr;
 	char *tagp = NULL;
 	journal_header_t *header;
@@ -313,113 +414,7 @@
 	 * Now start flushing things to disk, in the order they appear
 	 * on the transaction lists.  Data blocks go first.
 	 */
-
-	err = 0;
-	/*
-	 * Whenever we unlock the journal and sleep, things can get added
-	 * onto ->t_sync_datalist, so we have to keep looping back to
-	 * write_out_data until we *know* that the list is empty.
-	 */
-	bufs = 0;
-	/*
-	 * Cleanup any flushed data buffers from the data list.  Even in
-	 * abort mode, we want to flush this out as soon as possible.
-	 */
-write_out_data:
-	cond_resched();
-	spin_lock(&journal->j_list_lock);
-
-	while (commit_transaction->t_sync_datalist) {
-		struct buffer_head *bh;
-
-		jh = commit_transaction->t_sync_datalist;
-		commit_transaction->t_sync_datalist = jh->b_tnext;
-		bh = jh2bh(jh);
-		if (buffer_locked(bh)) {
-			BUFFER_TRACE(bh, "locked");
-			if (!inverted_lock(journal, bh))
-				goto write_out_data;
-			__journal_temp_unlink_buffer(jh);
-			__journal_file_buffer(jh, commit_transaction,
-						BJ_Locked);
-			jbd_unlock_bh_state(bh);
-			if (lock_need_resched(&journal->j_list_lock)) {
-				spin_unlock(&journal->j_list_lock);
-				goto write_out_data;
-			}
-		} else {
-			if (buffer_dirty(bh)) {
-				BUFFER_TRACE(bh, "start journal writeout");
-				get_bh(bh);
-				wbuf[bufs++] = bh;
-				if (bufs == journal->j_wbufsize) {
-					jbd_debug(2, "submit %d writes\n",
-							bufs);
-					spin_unlock(&journal->j_list_lock);
-					ll_rw_block(SWRITE, bufs, wbuf);
-					journal_brelse_array(wbuf, bufs);
-					bufs = 0;
-					goto write_out_data;
-				}
-			} else {
-				BUFFER_TRACE(bh, "writeout complete: unfile");
-				if (!inverted_lock(journal, bh))
-					goto write_out_data;
-				__journal_unfile_buffer(jh);
-				jbd_unlock_bh_state(bh);
-				journal_remove_journal_head(bh);
-				put_bh(bh);
-				if (lock_need_resched(&journal->j_list_lock)) {
-					spin_unlock(&journal->j_list_lock);
-					goto write_out_data;
-				}
-			}
-		}
-	}
-
-	if (bufs) {
-		spin_unlock(&journal->j_list_lock);
-		ll_rw_block(SWRITE, bufs, wbuf);
-		journal_brelse_array(wbuf, bufs);
-		spin_lock(&journal->j_list_lock);
-	}
-
-	/*
-	 * Wait for all previously submitted IO to complete.
-	 */
-	while (commit_transaction->t_locked_list) {
-		struct buffer_head *bh;
-
-		jh = commit_transaction->t_locked_list->b_tprev;
-		bh = jh2bh(jh);
-		get_bh(bh);
-		if (buffer_locked(bh)) {
-			spin_unlock(&journal->j_list_lock);
-			wait_on_buffer(bh);
-			if (unlikely(!buffer_uptodate(bh)))
-				err = -EIO;
-			spin_lock(&journal->j_list_lock);
-		}
-		if (!inverted_lock(journal, bh)) {
-			put_bh(bh);
-			spin_lock(&journal->j_list_lock);
-			continue;
-		}
-		if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
-			__journal_unfile_buffer(jh);
-			jbd_unlock_bh_state(bh);
-			journal_remove_journal_head(bh);
-			put_bh(bh);
-		} else {
-			jbd_unlock_bh_state(bh);
-		}
-		put_bh(bh);
-		cond_resched_lock(&journal->j_list_lock);
-	}
-	spin_unlock(&journal->j_list_lock);
-
-	if (err)
-		__journal_abort_hard(journal);
+	write_out_sync_data(journal, commit_transaction);
 
 	journal_write_revoke_records(journal, commit_transaction);
 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-24  9:14   ` Zoltan Menyhart
@ 2006-05-24 17:18     ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2006-05-24 17:18 UTC (permalink / raw)
  To: Zoltan Menyhart; +Cc: sct, linux-kernel

  Hello,

  I have tested my version of the patch yesterday so I'll post it in
other mail... Anyway I have two objections to your version:
  1) It does not solve the original problem with the fact that 'locked'
     buffer does not mean 'buffer being written'. You cannot just refile
     such buffers. You really have to check the dirty bit...
  2) You drop j_list_lock after you write a single buffer. That may
     impact the performance significantly (originally we dropped it only
     for writing a bunch of buffers).

								Honza

> --- linux-2.6.16.16-save/fs/jbd/commit.c	2006-05-19 
> 15:00:50.000000000 +0200
> +++ linux-2.6.16.16/fs/jbd/commit.c	2006-05-24 10:43:32.000000000 +0200
> @@ -161,6 +161,107 @@
> }
> 
> /*
> + * Flush data from the "->t_sync_datalist" of the committing transaction.
> + */
> +static void write_out_sync_data(journal_t *journal,
> +					transaction_t *commit_transaction)
> +{
> +	struct journal_head *jh;
> +	struct buffer_head *bh;
> +	int err = 0;
> +
> +	/*
> +	 * Whenever we unlock the journal and sleep, things can get removed
> +	 * from "->t_sync_datalist" by "journal_dirty_data()", so we have
> +	 * to keep looping back to write_out_data until we *know* that the
> +	 * list is empty.
> +	 *
> +	 * Cleanup any flushed data buffers from the data list.  Even in
> +	 * abort mode, we want to flush this out as soon as possible.
> +	 */
> +write_out_data:
> +	cond_resched();
> +	spin_lock(&journal->j_list_lock);
> +	while ((jh = commit_transaction->t_sync_datalist) != NULL){
> +		bh = jh2bh(jh);
> +		if (buffer_locked(bh)){		/* Unsafe */
> +			BUFFER_TRACE(bh, "locked");
> +			if (!inverted_lock(journal, bh)) /* 
> jbd_lock_bh_state */
> +				goto write_out_data;
> +			/* "bh" may have been unlocked in the mean time */
> +			__journal_file_buffer(jh, commit_transaction, 
> BJ_Locked);
> +			jbd_unlock_bh_state(bh);
> +		} else {
> +			/* "bh" may have become locked in the mean time */
> +			if (buffer_dirty(bh)){	/* Unsafe */
> +				if (test_set_buffer_locked(bh)){
> +					BUFFER_TRACE(bh, "locked");
> +					/* Put it on the BJ_Locked list */
> +					continue;
> +				}
> +				if (test_clear_buffer_dirty(bh)){
> +					BUFFER_TRACE(bh, "start journal wr");
> +					spin_unlock(&journal->j_list_lock);
> +					get_bh(bh);
> +					bh->b_end_io = end_buffer_write_sync;
> +					submit_bh(WRITE, bh);
> +					/* Put it on the BJ_Locked list */
> +					goto write_out_data;
> +				}
> +				unlock_buffer(bh);
> +			} else {
> +				/* "bh" may have become dirty in the mean 
> time */
> +				/* Just do nothing for this transaction */
> +			}
> +			BUFFER_TRACE(bh, "writeout complete: unfile");
> +			if (!inverted_lock(journal, bh)) /* 
> jbd_lock_bh_state */
> +				goto write_out_data;
> +			__journal_unfile_buffer(jh);
> +			jbd_unlock_bh_state(bh);
> +			journal_remove_journal_head(bh);
> +			put_bh(bh);
> +		}
> +		if (lock_need_resched(&journal->j_list_lock)){
> +			spin_unlock(&journal->j_list_lock);
> +			goto write_out_data;
> +		}
> +	}
> +	/*
> +	 * Wait for all previously submitted IO to complete.
> +	 */
> +	while (commit_transaction->t_locked_list) {
> +		jh = commit_transaction->t_locked_list->b_tprev;
> +		bh = jh2bh(jh);
> +		get_bh(bh);
> +		if (buffer_locked(bh)) {
> +			spin_unlock(&journal->j_list_lock);
> +			wait_on_buffer(bh);
> +			if (unlikely(!buffer_uptodate(bh)))
> +				err = -EIO;
> +			spin_lock(&journal->j_list_lock);
> +		}
> +		if (!inverted_lock(journal, bh)) {
> +			put_bh(bh);
> +			spin_lock(&journal->j_list_lock);
> +			continue;
> +		}
> +		if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
> +			__journal_unfile_buffer(jh);
> +			jbd_unlock_bh_state(bh);
> +			journal_remove_journal_head(bh);
> +			put_bh(bh);
> +		} else {
> +			jbd_unlock_bh_state(bh);
> +		}
> +		put_bh(bh);
> +		cond_resched_lock(&journal->j_list_lock);
> +	}
> +	spin_unlock(&journal->j_list_lock);
> +	if (err)
> +		__journal_abort_hard(journal);
> +}
> +
> +/*
>  * journal_commit_transaction
>  *
>  * The primary function for committing a transaction to the log.  This
> @@ -173,7 +274,7 @@
> 	struct buffer_head **wbuf = journal->j_wbuf;
> 	int bufs;
> 	int flags;
> -	int err;
> +	int err = 0;
> 	unsigned long blocknr;
> 	char *tagp = NULL;
> 	journal_header_t *header;
> @@ -313,113 +414,7 @@
> 	 * Now start flushing things to disk, in the order they appear
> 	 * on the transaction lists.  Data blocks go first.
> 	 */
> -
> -	err = 0;
> -	/*
> -	 * Whenever we unlock the journal and sleep, things can get added
> -	 * onto ->t_sync_datalist, so we have to keep looping back to
> -	 * write_out_data until we *know* that the list is empty.
> -	 */
> -	bufs = 0;
> -	/*
> -	 * Cleanup any flushed data buffers from the data list.  Even in
> -	 * abort mode, we want to flush this out as soon as possible.
> -	 */
> -write_out_data:
> -	cond_resched();
> -	spin_lock(&journal->j_list_lock);
> -
> -	while (commit_transaction->t_sync_datalist) {
> -		struct buffer_head *bh;
> -
> -		jh = commit_transaction->t_sync_datalist;
> -		commit_transaction->t_sync_datalist = jh->b_tnext;
> -		bh = jh2bh(jh);
> -		if (buffer_locked(bh)) {
> -			BUFFER_TRACE(bh, "locked");
> -			if (!inverted_lock(journal, bh))
> -				goto write_out_data;
> -			__journal_temp_unlink_buffer(jh);
> -			__journal_file_buffer(jh, commit_transaction,
> -						BJ_Locked);
> -			jbd_unlock_bh_state(bh);
> -			if (lock_need_resched(&journal->j_list_lock)) {
> -				spin_unlock(&journal->j_list_lock);
> -				goto write_out_data;
> -			}
> -		} else {
> -			if (buffer_dirty(bh)) {
> -				BUFFER_TRACE(bh, "start journal writeout");
> -				get_bh(bh);
> -				wbuf[bufs++] = bh;
> -				if (bufs == journal->j_wbufsize) {
> -					jbd_debug(2, "submit %d writes\n",
> -							bufs);
> -					spin_unlock(&journal->j_list_lock);
> -					ll_rw_block(SWRITE, bufs, wbuf);
> -					journal_brelse_array(wbuf, bufs);
> -					bufs = 0;
> -					goto write_out_data;
> -				}
> -			} else {
> -				BUFFER_TRACE(bh, "writeout complete: 
> unfile");
> -				if (!inverted_lock(journal, bh))
> -					goto write_out_data;
> -				__journal_unfile_buffer(jh);
> -				jbd_unlock_bh_state(bh);
> -				journal_remove_journal_head(bh);
> -				put_bh(bh);
> -				if 
> (lock_need_resched(&journal->j_list_lock)) {
> -					spin_unlock(&journal->j_list_lock);
> -					goto write_out_data;
> -				}
> -			}
> -		}
> -	}
> -
> -	if (bufs) {
> -		spin_unlock(&journal->j_list_lock);
> -		ll_rw_block(SWRITE, bufs, wbuf);
> -		journal_brelse_array(wbuf, bufs);
> -		spin_lock(&journal->j_list_lock);
> -	}
> -
> -	/*
> -	 * Wait for all previously submitted IO to complete.
> -	 */
> -	while (commit_transaction->t_locked_list) {
> -		struct buffer_head *bh;
> -
> -		jh = commit_transaction->t_locked_list->b_tprev;
> -		bh = jh2bh(jh);
> -		get_bh(bh);
> -		if (buffer_locked(bh)) {
> -			spin_unlock(&journal->j_list_lock);
> -			wait_on_buffer(bh);
> -			if (unlikely(!buffer_uptodate(bh)))
> -				err = -EIO;
> -			spin_lock(&journal->j_list_lock);
> -		}
> -		if (!inverted_lock(journal, bh)) {
> -			put_bh(bh);
> -			spin_lock(&journal->j_list_lock);
> -			continue;
> -		}
> -		if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
> -			__journal_unfile_buffer(jh);
> -			jbd_unlock_bh_state(bh);
> -			journal_remove_journal_head(bh);
> -			put_bh(bh);
> -		} else {
> -			jbd_unlock_bh_state(bh);
> -		}
> -		put_bh(bh);
> -		cond_resched_lock(&journal->j_list_lock);
> -	}
> -	spin_unlock(&journal->j_list_lock);
> -
> -	if (err)
> -		__journal_abort_hard(journal);
> +	write_out_sync_data(journal, commit_transaction);
> 
> 	journal_write_revoke_records(journal, commit_transaction);
> 
> 
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-19 15:06   ` Stephen C. Tweedie
@ 2006-05-24 17:33     ` Jan Kara
  2006-05-30 15:36       ` Zoltan Menyhart
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2006-05-24 17:33 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Zoltan Menyhart, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1679 bytes --]

  Hello,

> On Thu, 2006-05-18 at 15:45 +0200, Jan Kara wrote:
  <snip>

> The way was_dirty is used here seems confusing and hard to read; there
> are completely separate code paths for dirty and non-dirty, lockable and
> already-locked buffers, with all the paths interleaved to share a few
> common bits of locking.  It would be far more readable with any sharable
> locking code simply removed to a separate function (such as we already
> have for inverted_lock(), for example), and the different dirty/clean
> logic laid out separately.  Otherwise the code is littered with 
> 
> > +                       if (was_dirty)
> > +                               unlock_buffer(bh);
> 
> and it's not obvious at any point just what locks are held.
> 
> Having said that, it looks like it should work --- it just took more
> effort than it should have to check!
  Thanks for comments. I have written a new patch that tries to address
both your and Zsoltan's comments. I've also fixed a bug caused by calling
submit_bh() under j_list_lock (submit_bh() can sleep on allocating the
bio). That actually required some more changes because I don't want to
drop the j_list_lock for writing each buffer but I also want to keep the
buffer locked so that I can immediately refile it to BJ_Locked list.
If I just released buffer_lock() I would have to keep the buffer in
BJ_SyncData list and I'm afraid that could cause live-locks when
somebody is redirtying the buffer all the time.
  So I'm not totally convinced this is the right way of doing things.
Anyway please have a look at the code and tell me what you think about
it. Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

[-- Attachment #2: jbd-2.6.17-rc3-1-orderedwrite.diff --]
[-- Type: text/plain, Size: 6635 bytes --]

  The old code assumed that when the buffer is locked it is being
written. But need not be always true. Hence we have to be more
careful when processing ordered data buffers.
  If a buffer is not dirty, we know that write has already started
(and may be even already completed) and hence just refiling buffer
to t_locked_list (or unfiling it completely in case IO has finished)
is enough. If the buffer is dirty, we have to acquire the buffer lock
and do the write. This gets a bit tricky as sending buffer for IO
requires blocking but we don't want to drop j_list_lock too often.

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.16.orig/fs/jbd/commit.c linux-2.6.16-2-realloc_freed_fix/fs/jbd/commit.c
--- linux-2.6.16.orig/fs/jbd/commit.c	2006-01-28 08:59:58.000000000 +0100
+++ linux-2.6.16-2-realloc_freed_fix/fs/jbd/commit.c	2006-05-25 02:46:04.000000000 +0200
@@ -160,6 +160,117 @@ static int journal_write_commit_record(j
 	return (ret == -EIO);
 }
 
+static void submit_buffers(int bufs, struct buffer_head **wbuf)
+{
+	int i;
+
+	jbd_debug(2, "submit %d writes\n", bufs);
+	for (i = 0; i < bufs; i++) {
+		wbuf[i]->b_end_io = end_buffer_write_sync;
+		submit_bh(WRITE, wbuf[i]);
+	}
+}
+
+/*
+ *  Submit all the data buffers to disk
+ */
+static void journal_submit_data_buffers(journal_t *journal,
+				transaction_t *commit_transaction)
+{
+	struct journal_head *jh;
+	struct buffer_head *bh;
+	struct buffer_head **wbuf = journal->j_wbuf;
+	int bufs = 0;
+
+	/*
+	 * Whenever we unlock the journal and sleep, things can get added
+	 * onto ->t_sync_datalist, so we have to keep looping back to
+	 * write_out_data until we *know* that the list is empty.
+	 *
+	 * Cleanup any flushed data buffers from the data list.  Even in
+	 * abort mode, we want to flush this out as soon as possible.
+	 */
+write_out_data:
+	/* Submit all the buffers we locked so far. We had to release
+	 * j_list_lock anyway so there is no point in not sending all
+	 * those buffers for IO. */
+	submit_buffers(bufs, wbuf);
+	bufs = 0;
+	cond_resched();
+	spin_lock(&journal->j_list_lock);
+
+	while (commit_transaction->t_sync_datalist) {
+		jh = commit_transaction->t_sync_datalist;
+		bh = jh2bh(jh);
+
+		/* If the buffer is dirty, we need to submit IO and hence
+		 * we need the buffer lock. We try to lock the buffer without
+		 * blocking. If we fail, we need to drop j_list_lock and do
+		 * blocking lock_buffer().
+		 */
+		if (buffer_dirty(bh)) {
+			if (test_set_buffer_locked(bh)) {
+				BUFFER_TRACE(bh, "needs blocking lock");
+				get_bh(bh);
+				spin_unlock(&journal->j_list_lock);
+				/* Submit all accumulated buffers to prevent
+				 * possible deadlocks */
+				submit_buffers(bufs, wbuf);
+				bufs = 0;
+				lock_buffer(bh);
+				spin_lock(&journal->j_list_lock);
+				/* Someone already cleaned up the buffer? */
+				if (!buffer_jbd(bh)
+					|| jh->b_jlist != BJ_SyncData) {
+					unlock_buffer(bh);
+					BUFFER_TRACE(bh, "already cleaned up");
+					put_bh(bh);
+					continue;
+				}
+				put_bh(bh);
+			}
+			if (test_clear_buffer_dirty(bh)) {
+				BUFFER_TRACE(bh, "needs writeout, submitting");
+				get_bh(bh);
+				wbuf[bufs++] = bh;
+				if (bufs == journal->j_wbufsize) {
+					spin_unlock(&journal->j_list_lock);
+					/* Writeout will be done at the
+					 * beginning of the loop */
+					goto write_out_data;
+				}
+			}
+			else
+				unlock_buffer(bh);
+		}
+		/* Now we are sure the buffer has been submitted for IO if it
+		 * was needed. The only thing left is to put it on the
+		 * correct list. */
+		if (!inverted_lock(journal, bh))
+			goto write_out_data;
+		if (buffer_locked(bh)) {
+			__journal_file_buffer(jh, commit_transaction,
+						BJ_Locked);
+			jbd_unlock_bh_state(bh);
+		}
+		else {
+			BUFFER_TRACE(bh, "writeout complete: unfile");
+			__journal_unfile_buffer(jh);
+			jbd_unlock_bh_state(bh);
+			journal_remove_journal_head(bh);
+			put_bh(bh);
+		}
+
+		if (lock_need_resched(&journal->j_list_lock)) {
+			spin_unlock(&journal->j_list_lock);
+			goto write_out_data;
+		}
+	}
+	spin_unlock(&journal->j_list_lock);
+	/* Submit the rest of buffers */
+	submit_buffers(bufs, wbuf);
+}
+
 /*
  * journal_commit_transaction
  *
@@ -313,80 +424,13 @@ void journal_commit_transaction(journal_
 	 * Now start flushing things to disk, in the order they appear
 	 * on the transaction lists.  Data blocks go first.
 	 */
-
 	err = 0;
-	/*
-	 * Whenever we unlock the journal and sleep, things can get added
-	 * onto ->t_sync_datalist, so we have to keep looping back to
-	 * write_out_data until we *know* that the list is empty.
-	 */
-	bufs = 0;
-	/*
-	 * Cleanup any flushed data buffers from the data list.  Even in
-	 * abort mode, we want to flush this out as soon as possible.
-	 */
-write_out_data:
-	cond_resched();
-	spin_lock(&journal->j_list_lock);
-
-	while (commit_transaction->t_sync_datalist) {
-		struct buffer_head *bh;
-
-		jh = commit_transaction->t_sync_datalist;
-		commit_transaction->t_sync_datalist = jh->b_tnext;
-		bh = jh2bh(jh);
-		if (buffer_locked(bh)) {
-			BUFFER_TRACE(bh, "locked");
-			if (!inverted_lock(journal, bh))
-				goto write_out_data;
-			__journal_temp_unlink_buffer(jh);
-			__journal_file_buffer(jh, commit_transaction,
-						BJ_Locked);
-			jbd_unlock_bh_state(bh);
-			if (lock_need_resched(&journal->j_list_lock)) {
-				spin_unlock(&journal->j_list_lock);
-				goto write_out_data;
-			}
-		} else {
-			if (buffer_dirty(bh)) {
-				BUFFER_TRACE(bh, "start journal writeout");
-				get_bh(bh);
-				wbuf[bufs++] = bh;
-				if (bufs == journal->j_wbufsize) {
-					jbd_debug(2, "submit %d writes\n",
-							bufs);
-					spin_unlock(&journal->j_list_lock);
-					ll_rw_block(SWRITE, bufs, wbuf);
-					journal_brelse_array(wbuf, bufs);
-					bufs = 0;
-					goto write_out_data;
-				}
-			} else {
-				BUFFER_TRACE(bh, "writeout complete: unfile");
-				if (!inverted_lock(journal, bh))
-					goto write_out_data;
-				__journal_unfile_buffer(jh);
-				jbd_unlock_bh_state(bh);
-				journal_remove_journal_head(bh);
-				put_bh(bh);
-				if (lock_need_resched(&journal->j_list_lock)) {
-					spin_unlock(&journal->j_list_lock);
-					goto write_out_data;
-				}
-			}
-		}
-	}
-
-	if (bufs) {
-		spin_unlock(&journal->j_list_lock);
-		ll_rw_block(SWRITE, bufs, wbuf);
-		journal_brelse_array(wbuf, bufs);
-		spin_lock(&journal->j_list_lock);
-	}
+	journal_submit_data_buffers(journal, commit_transaction);
 
 	/*
 	 * Wait for all previously submitted IO to complete.
 	 */
+	spin_lock(&journal->j_list_lock);
 	while (commit_transaction->t_locked_list) {
 		struct buffer_head *bh;
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-24 17:33     ` Jan Kara
@ 2006-05-30 15:36       ` Zoltan Menyhart
  2006-05-30 16:40         ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Zoltan Menyhart @ 2006-05-30 15:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: Stephen C. Tweedie, linux-kernel

I have got a concern about this code fragment:

> +		if (buffer_dirty(bh)) {
> +			if (test_set_buffer_locked(bh)) {
> +				...
> +				get_bh(bh);
> +				spin_unlock(&journal->j_list_lock);
> +				/* Submit all accumulated buffers to prevent
> +				 * possible deadlocks */
> +				submit_buffers(bufs, wbuf);
> +				bufs = 0;
> +				lock_buffer(bh);
> +				spin_lock(&journal->j_list_lock);
> +				/* Someone already cleaned up the buffer? */
> +				if (!buffer_jbd(bh)
> +					|| jh->b_jlist != BJ_SyncData) {
> +					unlock_buffer(bh);
> +					BUFFER_TRACE(bh, "already cleaned up");
> +					put_bh(bh);
> +					continue;
> +				}
> +				put_bh(bh);
> +			}
> +			if (test_clear_buffer_dirty(bh)) {
> +				BUFFER_TRACE(bh, "needs writeout, submitting");
> +				get_bh(bh);
> +				wbuf[bufs++] = bh;
> +				if (bufs == journal->j_wbufsize) {
> +					spin_unlock(&journal->j_list_lock);
> +					/* Writeout will be done at the
> +					 * beginning of the loop */
> +					goto write_out_data;
> +				}
> +			}

We lock up to "->j_wbufsize" buffers, one after the others.

Originally, we toke a buffer, we did something to it, and we released it.
When we wanted two of them and the second one was not available, we
released the first one, too, in order to avoid dead-locks.

Keeping a couple of buffer locked for an unpredictably long time...
(Here you keep N-1 buffer locked while you are waiting for the Nth one.)
And not imposing / respecting any locking order...

The original code did not lock the buffers while it was composing the
list of buffers. "ll_rw_block()" locked one by one the buffers
and submitted them to the BIO. These buffers got eventually unlocked,
possibly before the some last buffers got locked by "ll_rw_block()".

This change implies an important difference in locking behavior.

Regards,

Zoltan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-05-30 15:36       ` Zoltan Menyhart
@ 2006-05-30 16:40         ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2006-05-30 16:40 UTC (permalink / raw)
  To: Zoltan Menyhart; +Cc: Stephen C. Tweedie, linux-kernel

  Hello,

> I have got a concern about this code fragment:
> 
> >+		if (buffer_dirty(bh)) {
> >+			if (test_set_buffer_locked(bh)) {
> >+				...
> >+				get_bh(bh);
> >+				spin_unlock(&journal->j_list_lock);
> >+				/* Submit all accumulated buffers to prevent
> >+				 * possible deadlocks */
> >+				submit_buffers(bufs, wbuf);
> >+				bufs = 0;
> >+				lock_buffer(bh);
> >+				spin_lock(&journal->j_list_lock);
> >+				/* Someone already cleaned up the buffer? */
> >+				if (!buffer_jbd(bh)
> >+					|| jh->b_jlist != BJ_SyncData) {
> >+					unlock_buffer(bh);
> >+					BUFFER_TRACE(bh, "already cleaned 
> >up");
> >+					put_bh(bh);
> >+					continue;
> >+				}
> >+				put_bh(bh);
> >+			}
> >+			if (test_clear_buffer_dirty(bh)) {
> >+				BUFFER_TRACE(bh, "needs writeout, 
> >submitting");
> >+				get_bh(bh);
> >+				wbuf[bufs++] = bh;
> >+				if (bufs == journal->j_wbufsize) {
> >+					spin_unlock(&journal->j_list_lock);
> >+					/* Writeout will be done at the
> >+					 * beginning of the loop */
> >+					goto write_out_data;
> >+				}
> >+			}
> 
> We lock up to "->j_wbufsize" buffers, one after the others.
> 
> Originally, we toke a buffer, we did something to it, and we released it.
> When we wanted two of them and the second one was not available, we
> released the first one, too, in order to avoid dead-locks.
> 
> Keeping a couple of buffer locked for an unpredictably long time...
> (Here you keep N-1 buffer locked while you are waiting for the Nth one.)
> And not imposing / respecting any locking order...
  You are not right here. This is exactly why we do submit_buffers()
before we do lock_buffer. So either we get the lock without waiting via
test_set_buffer_locked() or we send all buffers for IO (and hence they
will get unlocked eventually).

> The original code did not lock the buffers while it was composing the
> list of buffers. "ll_rw_block()" locked one by one the buffers
> and submitted them to the BIO. These buffers got eventually unlocked,
> possibly before the some last buffers got locked by "ll_rw_block()".
> 
> This change implies an important difference in locking behavior.
  I agree the locking behaviour has changed a bit. But not as much as you
thought.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
       [not found]         ` <20060602134923.GA1644@atrey.karlin.mff.cuni.cz>
@ 2006-06-20 16:33           ` Zoltan Menyhart
  2006-06-21  0:09             ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Zoltan Menyhart @ 2006-06-20 16:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jan Kara, sct

I have got some crashes due to:

Assertion failure in __journal_file_buffer():
      "jh->b_transaction == transaction || jh->b_transaction == 0"

[<a0000002053b44e0>] __journal_file_buffer+0x420/0x7c0 [jbd]
      r32 : e000000161a1f3e0  jh
      r33 : e00000010396a380  transaction
      r34 : 0000000000000008  jlist == BJ_Locked

*(struct journal_head *) 0xe000000161a1f3e0: // jh
{
 b_bh = 0xe00000048bb36930,
 b_jcount = 0x0,
 b_jlist = 0x1,
 b_modified = 0x0,
 b_frozen_data = 0x0,
 b_committed_data = 0x0,
 b_transaction = 0xe0000020014adb80,	// ->j_running_transaction
 b_next_transaction = 0x0,
 b_tnext = 0xe0000001c17306e0,
 b_tprev = 0xe00000204757e540,
 b_cp_transaction = 0x0,
 b_cpnext = 0x0,
 b_cpprev = 0x0
}

*(struct buffer_head *) 0xe00000048bb36930: // jh->b_bh
{
b_state = 0x8201d,
b_this_page = 0xe00000048bb33d88,
b_page = 0xa07ffffff9201300,
b_count = {
  counter = 0x2
},
b_size = 0x1000,
b_blocknr = 0xadc001,
b_data = 0xe000000492a0e000,
b_bdev = 0xe0000023fe1ca300,
b_end_io = 0xa000000100630be0,
b_private = 0xe000000161a1f3e0,
b_assoc_buffers = {
  next = 0xe00000048bb36978,
  prev = 0xe00000048bb36978
}

--- Called from --- :

journal_submit_data_buffers+0x200/0x660 [jbd]
      r32 : e0000001035ec100  journal
      r33 : e00000010396a380  commit_transaction

As you can see, the current "jh" has been stolen for the new
"->j_running_transaction" while we released temporarily "->j_list_lock"
in the middle of "journal_submit_data_buffers()".

Therefore the test "jh->b_jlist != BJ_SyncData", i.e. if it is still
on a (_any_) sync. list is not enough.

--- linux-2.6.16.20-orig/fs/jbd/commit.c	2006-06-20 17:19:47.000000000 +0200
+++ linux-2.6.16.20/fs/jbd/commit.c	2006-06-20 17:35:54.000000000 +0200
@@ -219,15 +219,26 @@
 				bufs = 0;
 				lock_buffer(bh);
 				spin_lock(&journal->j_list_lock);
+				/* Stolen (e.g. for a new transaction) ? */
+				if (jh != commit_transaction->t_sync_datalist) {
+					unlock_buffer(bh);
+					JBUFFER_TRACE(jh, "stolen sync. data");
+					put_bh(bh);
+					continue;
+				}
 				/* Someone already cleaned up the buffer? */
-				if (!buffer_jbd(bh)
-					|| jh->b_jlist != BJ_SyncData) {
+
+				// Can this happen???
+
+				if (!buffer_jbd(bh)) {
 					unlock_buffer(bh);
 					BUFFER_TRACE(bh, "already cleaned up");
 					put_bh(bh);
 					continue;
 				}
 				put_bh(bh);
+				J_ASSERT_JH(jh, jh->b_transaction ==
+							commit_transaction);
 			}
 			if (test_clear_buffer_dirty(bh)) {
 				BUFFER_TRACE(bh, "needs writeout, submitting");

I am not really sure that the test "!buffer_jbd(bh)" is really useful.
I left it alone for not introducing a new bug.
If you can confirm that it is not necessary, I can take it away.

Thanks,

Zoltan






^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Change ll_rw_block() calls in JBD
  2006-06-20 16:33           ` Zoltan Menyhart
@ 2006-06-21  0:09             ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2006-06-21  0:09 UTC (permalink / raw)
  To: Zoltan Menyhart; +Cc: linux-kernel, sct

  Hi,

> I have got some crashes due to:
> 
> Assertion failure in __journal_file_buffer():
>      "jh->b_transaction == transaction || jh->b_transaction == 0"
  Yes, I've seen your reports. Thanks for them and sorry for being a bit
unresponsive (I have a lot of other work...).

<snip>
> 
> --- Called from --- :
> 
> journal_submit_data_buffers+0x200/0x660 [jbd]
>      r32 : e0000001035ec100  journal
>      r33 : e00000010396a380  commit_transaction
> 
> As you can see, the current "jh" has been stolen for the new
> "->j_running_transaction" while we released temporarily "->j_list_lock"
> in the middle of "journal_submit_data_buffers()".
  Yes, this seems to be correct analysis.

> Therefore the test "jh->b_jlist != BJ_SyncData", i.e. if it is still
> on a (_any_) sync. list is not enough.
  Right.

> --- linux-2.6.16.20-orig/fs/jbd/commit.c	2006-06-20 
> 17:19:47.000000000 +0200
> +++ linux-2.6.16.20/fs/jbd/commit.c	2006-06-20 17:35:54.000000000 +0200
> @@ -219,15 +219,26 @@
> 				bufs = 0;
> 				lock_buffer(bh);
> 				spin_lock(&journal->j_list_lock);
> +				/* Stolen (e.g. for a new transaction) ? */
> +				if (jh != 
> commit_transaction->t_sync_datalist) {
> +					unlock_buffer(bh);
> +					JBUFFER_TRACE(jh, "stolen sync. 
> data");
> +					put_bh(bh);
> +					continue;
> +				}
  Yes, this is definitely safer check and should also catch the case
when jh was released from memory so buffer_jbd() is not needed any more.

> 				/* Someone already cleaned up the buffer? */
> -				if (!buffer_jbd(bh)
> -					|| jh->b_jlist != BJ_SyncData) {
> +
> +				// Can this happen???
> +
> +				if (!buffer_jbd(bh)) {
> 					unlock_buffer(bh);
> 					BUFFER_TRACE(bh, "already cleaned 
> 					up");
> 					put_bh(bh);
> 					continue;
> 				}
> 				put_bh(bh);
> +				J_ASSERT_JH(jh, jh->b_transaction ==
> +							commit_transaction);
> 			}
> 			if (test_clear_buffer_dirty(bh)) {
> 				BUFFER_TRACE(bh, "needs writeout, 
> 				submitting");
> 
> I am not really sure that the test "!buffer_jbd(bh)" is really useful.
> I left it alone for not introducing a new bug.
> If you can confirm that it is not necessary, I can take it away.
  BTW: I've also written another version of the patch using a bit
different approach. When I have some time I can benchmark them against
each other to see if there is some difference...

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] Change ll_rw_block() calls in JBD
@ 2005-07-11 15:52 Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2005-07-11 15:52 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm

[-- Attachment #1: Type: text/plain, Size: 164 bytes --]

  Hi,

  attached patch changes calls of ll_rw_block() in JBD to make sure the
data really reach the disk.

								Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

[-- Attachment #2: jbd-2.6.12-2-ll_rw_block-fix.diff --]
[-- Type: text/plain, Size: 2651 bytes --]

We must be sure that the current data in buffer are sent to disk. Hence
we have to call ll_rw_block() with SWRITE.

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.12-1-forgetfix/fs/jbd/checkpoint.c linux-2.6.12-2-ll_rw_block-fix/fs/jbd/checkpoint.c
--- linux-2.6.12-1-forgetfix/fs/jbd/checkpoint.c	2005-06-28 13:26:18.000000000 +0200
+++ linux-2.6.12-2-ll_rw_block-fix/fs/jbd/checkpoint.c	2005-07-07 07:18:47.000000000 +0200
@@ -204,7 +204,7 @@ __flush_batch(journal_t *journal, struct
 	int i;
 
 	spin_unlock(&journal->j_list_lock);
-	ll_rw_block(WRITE, *batch_count, bhs);
+	ll_rw_block(SWRITE, *batch_count, bhs);
 	spin_lock(&journal->j_list_lock);
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = bhs[i];
diff -rupX /home/jack/.kerndiffexclude linux-2.6.12-1-forgetfix/fs/jbd/commit.c linux-2.6.12-2-ll_rw_block-fix/fs/jbd/commit.c
--- linux-2.6.12-1-forgetfix/fs/jbd/commit.c	2005-07-06 01:22:13.000000000 +0200
+++ linux-2.6.12-2-ll_rw_block-fix/fs/jbd/commit.c	2005-07-07 07:18:20.000000000 +0200
@@ -358,7 +358,7 @@ write_out_data:
 					jbd_debug(2, "submit %d writes\n",
 							bufs);
 					spin_unlock(&journal->j_list_lock);
-					ll_rw_block(WRITE, bufs, wbuf);
+					ll_rw_block(SWRITE, bufs, wbuf);
 					journal_brelse_array(wbuf, bufs);
 					bufs = 0;
 					goto write_out_data;
@@ -381,7 +381,7 @@ write_out_data:
 
 	if (bufs) {
 		spin_unlock(&journal->j_list_lock);
-		ll_rw_block(WRITE, bufs, wbuf);
+		ll_rw_block(SWRITE, bufs, wbuf);
 		journal_brelse_array(wbuf, bufs);
 		spin_lock(&journal->j_list_lock);
 	}
diff -rupX /home/jack/.kerndiffexclude linux-2.6.12-1-forgetfix/fs/jbd/journal.c linux-2.6.12-2-ll_rw_block-fix/fs/jbd/journal.c
--- linux-2.6.12-1-forgetfix/fs/jbd/journal.c	2005-06-28 13:26:18.000000000 +0200
+++ linux-2.6.12-2-ll_rw_block-fix/fs/jbd/journal.c	2005-07-07 07:17:11.000000000 +0200
@@ -969,7 +969,7 @@ void journal_update_superblock(journal_t
 	if (wait)
 		sync_dirty_buffer(bh);
 	else
-		ll_rw_block(WRITE, 1, &bh);
+		ll_rw_block(SWRITE, 1, &bh);
 
 out:
 	/* If we have just flushed the log (by marking s_start==0), then
diff -rupX /home/jack/.kerndiffexclude linux-2.6.12-1-forgetfix/fs/jbd/revoke.c linux-2.6.12-2-ll_rw_block-fix/fs/jbd/revoke.c
--- linux-2.6.12-1-forgetfix/fs/jbd/revoke.c	2005-03-03 18:58:29.000000000 +0100
+++ linux-2.6.12-2-ll_rw_block-fix/fs/jbd/revoke.c	2005-07-07 07:12:34.000000000 +0200
@@ -613,7 +613,7 @@ static void flush_descriptor(journal_t *
 	set_buffer_jwrite(bh);
 	BUFFER_TRACE(bh, "write");
 	set_buffer_dirty(bh);
-	ll_rw_block(WRITE, 1, &bh);
+	ll_rw_block(SWRITE, 1, &bh);
 }
 #endif
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2006-06-21  0:09 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-18  8:25 [PATCH] Change ll_rw_block() calls in JBD Zoltan Menyhart
2006-05-18 13:45 ` Jan Kara
2006-05-18 15:11   ` Zoltan Menyhart
2006-05-18 22:25     ` Stephen C. Tweedie
2006-05-19 10:01       ` Zoltan Menyhart
2006-05-19 12:26         ` Stephen C. Tweedie
2006-05-19  1:30     ` Jan Kara
2006-05-19 12:33       ` Zoltan Menyhart
2006-05-19 15:05         ` Stephen C. Tweedie
2006-05-19 15:06   ` Stephen C. Tweedie
2006-05-24 17:33     ` Jan Kara
2006-05-30 15:36       ` Zoltan Menyhart
2006-05-30 16:40         ` Jan Kara
2006-05-23 16:01   ` Zoltan Menyhart
2006-05-24  9:14   ` Zoltan Menyhart
2006-05-24 17:18     ` Jan Kara
     [not found]   ` <447F13B3.6050505@bull.net>
     [not found]     ` <20060601162751.GH26933@atrey.karlin.mff.cuni.cz>
     [not found]       ` <44801E16.3040300@bull.net>
     [not found]         ` <20060602134923.GA1644@atrey.karlin.mff.cuni.cz>
2006-06-20 16:33           ` Zoltan Menyhart
2006-06-21  0:09             ` Jan Kara
  -- strict thread matches above, loose matches on Subject: below --
2005-07-11 15:52 Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).