linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.0, 2.2, 2.4, 2.5: fsync buffer race
@ 2003-02-02 23:32 Mikulas Patocka
  2003-02-03  0:00 ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Mikulas Patocka @ 2003-02-02 23:32 UTC (permalink / raw)
  To: linux-kernel

Hi

there's a race condition in filesystem

let's have a two inodes that are placed in the same buffer.

call fsync on inode 1
it goes down to ext2_update_inode [update == 1]
it calls ll_rw_block at the end
ll_rw_block starts to write buffer
ext2_update_inode waits on buffer

while the buffer is writing, another process calls fsync on inode 2
it goes again to ext2_update_inode
it calls ll_rw_block
ll_rw_block sees buffer locked and exits immediatelly
ext2_update_inode waits for buffer
the first write finished, ext2_update_inode exits and changes made by
second proces to inode 2 ARE NOT WRITTEN TO DISK.

This bug causes that when you simultaneously fsync two inodes in the same
buffer, only the first will be really written to disk.

Mikulas





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-02 23:32 2.0, 2.2, 2.4, 2.5: fsync buffer race Mikulas Patocka
@ 2003-02-03  0:00 ` Andrew Morton
  2003-02-03  1:13   ` Mikulas Patocka
  2003-02-04 23:16   ` Pavel Machek
  0 siblings, 2 replies; 16+ messages in thread
From: Andrew Morton @ 2003-02-03  0:00 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-kernel

Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:
>
> Hi
> 
> there's a race condition in filesystem
> 
> let's have a two inodes that are placed in the same buffer.
> 
> call fsync on inode 1
> it goes down to ext2_update_inode [update == 1]
> it calls ll_rw_block at the end
> ll_rw_block starts to write buffer
> ext2_update_inode waits on buffer
> 
> while the buffer is writing, another process calls fsync on inode 2
> it goes again to ext2_update_inode
> it calls ll_rw_block
> ll_rw_block sees buffer locked and exits immediatelly
> ext2_update_inode waits for buffer
> the first write finished, ext2_update_inode exits and changes made by
> second proces to inode 2 ARE NOT WRITTEN TO DISK.
> 

hmm, yes.  This is a general weakness in the ll_rw_block() interface.  It is
not suitable for data-integrity writeouts, as you've pointed out.

A suitable fix would be do create a new

void wait_and_rw_block(...)
{
	wait_on_buffer(bh);
	ll_rw_block(...);
}

and go use that in all the appropriate places.

I shall make that change for 2.5, thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-03  0:00 ` Andrew Morton
@ 2003-02-03  1:13   ` Mikulas Patocka
  2003-02-03  1:20     ` Andrew Morton
  2003-02-04 23:16   ` Pavel Machek
  1 sibling, 1 reply; 16+ messages in thread
From: Mikulas Patocka @ 2003-02-03  1:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Sun, 2 Feb 2003, Andrew Morton wrote:

> Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:
> >
> > Hi
> >
> > there's a race condition in filesystem
> >
> > let's have a two inodes that are placed in the same buffer.
> >
> > call fsync on inode 1
> > it goes down to ext2_update_inode [update == 1]
> > it calls ll_rw_block at the end
> > ll_rw_block starts to write buffer
> > ext2_update_inode waits on buffer
> >
> > while the buffer is writing, another process calls fsync on inode 2
> > it goes again to ext2_update_inode
> > it calls ll_rw_block
> > ll_rw_block sees buffer locked and exits immediatelly
> > ext2_update_inode waits for buffer
> > the first write finished, ext2_update_inode exits and changes made by
> > second proces to inode 2 ARE NOT WRITTEN TO DISK.
> >
>
> hmm, yes.  This is a general weakness in the ll_rw_block() interface.  It is
> not suitable for data-integrity writeouts, as you've pointed out.
>
> A suitable fix would be do create a new
>
> void wait_and_rw_block(...)
> {
> 	wait_on_buffer(bh);
> 	ll_rw_block(...);
> }

It would fail if other CPU submits IO with ll_rw_block after
wait_on_buffer but before ll_rw_block. ll_rw_block really needs to be
rewritten.

Mikulas

> and go use that in all the appropriate places.
>
> I shall make that change for 2.5, thanks.
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-03  1:13   ` Mikulas Patocka
@ 2003-02-03  1:20     ` Andrew Morton
  2003-02-03  9:29       ` Mikulas Patocka
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2003-02-03  1:20 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-kernel

Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:
>
> > void wait_and_rw_block(...)
> > {
> > 	wait_on_buffer(bh);
> > 	ll_rw_block(...);
> > }
> 
> It would fail if other CPU submits IO with ll_rw_block after
> wait_on_buffer but before ll_rw_block.

In that case, the caller's data gets written anyway, and the caller will wait
upon the I/O which the other CPU started.  So the ll_rw_block() behaviour is
appropriate.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-03  1:20     ` Andrew Morton
@ 2003-02-03  9:29       ` Mikulas Patocka
  0 siblings, 0 replies; 16+ messages in thread
From: Mikulas Patocka @ 2003-02-03  9:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Sun, 2 Feb 2003, Andrew Morton wrote:

> Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:
> >
> > > void wait_and_rw_block(...)
> > > {
> > > 	wait_on_buffer(bh);
> > > 	ll_rw_block(...);
> > > }
> >
> > It would fail if other CPU submits IO with ll_rw_block after
> > wait_on_buffer but before ll_rw_block.
>
> In that case, the caller's data gets written anyway, and the caller will wait
> upon the I/O which the other CPU started.  So the ll_rw_block() behaviour is
> appropriate.

You are partly right, but it suffers from smp memory ordering bug:

CPU 1
write data to buffer (but they are
in cpu-local buffer and do not go
to the bus)

tests buffer_locked in
wait_and_rw_block->wait_on_buffer,
sees unlocked

					CPU 2
					starts to write the buffer, but
					does not see data written by CPU 1
					yet

cpu flushes data to bus
calls ll_rw_block, it sees
buffer_locked, exits.
new data are lost.

There should be smp_mb(); before wait_on_buffer in wait_and_rw_block.


BTW. why don't you just patch ll_rw_block so that it waits if it sees a
locked buffer -- you get much cleaner code with only one test for locked
buffer.

Mikulas


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-03  0:00 ` Andrew Morton
  2003-02-03  1:13   ` Mikulas Patocka
@ 2003-02-04 23:16   ` Pavel Machek
  2003-02-05 15:13     ` Mikulas Patocka
  2003-02-10 13:07     ` Andrea Arcangeli
  1 sibling, 2 replies; 16+ messages in thread
From: Pavel Machek @ 2003-02-04 23:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mikulas Patocka, linux-kernel

Hi!

> > there's a race condition in filesystem
> > 
> > let's have a two inodes that are placed in the same buffer.
> > 
> > call fsync on inode 1
> > it goes down to ext2_update_inode [update == 1]
> > it calls ll_rw_block at the end
> > ll_rw_block starts to write buffer
> > ext2_update_inode waits on buffer
> > 
> > while the buffer is writing, another process calls fsync on inode 2
> > it goes again to ext2_update_inode
> > it calls ll_rw_block
> > ll_rw_block sees buffer locked and exits immediatelly
> > ext2_update_inode waits for buffer
> > the first write finished, ext2_update_inode exits and changes made by
> > second proces to inode 2 ARE NOT WRITTEN TO DISK.
> > 
> 
> hmm, yes.  This is a general weakness in the ll_rw_block() interface.  It is
> not suitable for data-integrity writeouts, as you've pointed out.
> 
> A suitable fix would be do create a new
> 
> void wait_and_rw_block(...)
> {
> 	wait_on_buffer(bh);
> 	ll_rw_block(...);
> }
> 
> and go use that in all the appropriate places.
> 
> I shall make that change for 2.5, thanks.

Should this be fixed at least in 2.4, too? It seems pretty serious for
mail servers (etc)...
								Pavel
-- 
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-04 23:16   ` Pavel Machek
@ 2003-02-05 15:13     ` Mikulas Patocka
  2003-02-10 13:07     ` Andrea Arcangeli
  1 sibling, 0 replies; 16+ messages in thread
From: Mikulas Patocka @ 2003-02-05 15:13 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Andrew Morton, linux-kernel

> Hi!
>
> > > there's a race condition in filesystem
> > >
> > > let's have a two inodes that are placed in the same buffer.
> > >
> > > call fsync on inode 1
> > > it goes down to ext2_update_inode [update == 1]
> > > it calls ll_rw_block at the end
> > > ll_rw_block starts to write buffer
> > > ext2_update_inode waits on buffer
> > >
> > > while the buffer is writing, another process calls fsync on inode 2
> > > it goes again to ext2_update_inode
> > > it calls ll_rw_block
> > > ll_rw_block sees buffer locked and exits immediatelly
> > > ext2_update_inode waits for buffer
> > > the first write finished, ext2_update_inode exits and changes made by
> > > second proces to inode 2 ARE NOT WRITTEN TO DISK.
> > >
> >
> > hmm, yes.  This is a general weakness in the ll_rw_block() interface.  It is
> > not suitable for data-integrity writeouts, as you've pointed out.
> >
> > A suitable fix would be do create a new
> >
> > void wait_and_rw_block(...)
> > {
> > 	wait_on_buffer(bh);
> > 	ll_rw_block(...);
> > }
> >
> > and go use that in all the appropriate places.
> >
> > I shall make that change for 2.5, thanks.
>
> Should this be fixed at least in 2.4, too? It seems pretty serious for
> mail servers (etc)...
> 								Pavel

It should, but it is a hazard. The problem is that every use of
ll_rw_block has this bug, not only the one in ext2 fsync. The most clean
thing would be to modify ll_rw_block to wait until buffer becomes
unlocked, no one knows if it can produce some weird things.

Even Linus didn't know what he was doing, see this comment around the
buggy part in 2.2, 2.0 and previous kernels.

ll_rw_blk.c:
        /* Uhhuh.. Nasty dead-lock possible here.. */
        if (buffer_locked(bh))
                return;
        /* Maybe the above fixes it, and maybe it doesn't boot. Life is
interesting */
        lock_buffer(bh);

Mikulas


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-04 23:16   ` Pavel Machek
  2003-02-05 15:13     ` Mikulas Patocka
@ 2003-02-10 13:07     ` Andrea Arcangeli
  2003-02-10 16:28       ` Mikulas Patocka
  1 sibling, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2003-02-10 13:07 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Andrew Morton, Mikulas Patocka, linux-kernel

On Wed, Feb 05, 2003 at 12:16:53AM +0100, Pavel Machek wrote:
> Hi!
> 
> > > there's a race condition in filesystem
> > > 
> > > let's have a two inodes that are placed in the same buffer.
> > > 
> > > call fsync on inode 1
> > > it goes down to ext2_update_inode [update == 1]
> > > it calls ll_rw_block at the end
> > > ll_rw_block starts to write buffer
> > > ext2_update_inode waits on buffer
> > > 
> > > while the buffer is writing, another process calls fsync on inode 2
> > > it goes again to ext2_update_inode
> > > it calls ll_rw_block
> > > ll_rw_block sees buffer locked and exits immediatelly
> > > ext2_update_inode waits for buffer
> > > the first write finished, ext2_update_inode exits and changes made by
> > > second proces to inode 2 ARE NOT WRITTEN TO DISK.
> > > 
> > 
> > hmm, yes.  This is a general weakness in the ll_rw_block() interface.  It is
> > not suitable for data-integrity writeouts, as you've pointed out.
> > 
> > A suitable fix would be do create a new
> > 
> > void wait_and_rw_block(...)
> > {
> > 	wait_on_buffer(bh);
> > 	ll_rw_block(...);
> > }
> > 
> > and go use that in all the appropriate places.
> > 
> > I shall make that change for 2.5, thanks.
> 
> Should this be fixed at least in 2.4, too? It seems pretty serious for
> mail servers (etc)...

actually the lowlevel currently is supposed to take care of that if it
writes directly with ll_rw_block like fsync_buffers_list takes care of
it before calling ll_rw_block. But the whole point is that normally the
write_inode path only marks the buffer dirty, it never writes directly,
and no dirty bitflag can be lost and we flush those dirty buffers right
with the proper wait_on_buffer before ll_rw_block. So I don't think it's
happening really in 2.4, at least w/o mounting the fs with -osync.

But I thought about about Mikulas suggestion of adding lock_buffer
in place of the test and set bit. This looks very appealing. the main
reason we didn't do that while fixing fsync_buffers_list a few months
ago in 2.4.20 or so, is been that it is a very risky change, I mean, I'm
feeling like it'll break something subtle.

In theory the only thing those bitflags controls are the coherency of
the buffer cache here, the rest is serialized by design at the
highlevel. And the buffer_cache should be ok with the lock_buffer(),
since we'll see the buffer update and we'll skip the write if we race
with other reads (we'll sleep in lock_buffer rather than in
wait_on_buffer). For the writes we'll overwrite the data one more time
(the feature incidentally ;).  so it sounds safe.  Performance wise it
shouldn't matter, for read it can't matter infact.

So I guess it's ok to make that change, then we could even move the
wait_on_buffer from fsync_buffers_list to the xfs buffer_delay path of
write_buffer. I'm tempted to make this change for next -aa. I feel it
makes more sense and it's a better fix even if if risky. Can anybody see
a problem with that?

Andrea

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-10 13:07     ` Andrea Arcangeli
@ 2003-02-10 16:28       ` Mikulas Patocka
  2003-02-10 16:57         ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Mikulas Patocka @ 2003-02-10 16:28 UTC (permalink / raw)
  To: torvalds; +Cc: Andrea Arcangeli, Pavel Machek, Andrew Morton, linux-kernel

Linus, do you remember what "nasty deadlock" did you mean when writing
this to 2.[0-2].* and maybe 1.*?

ll_rw_blk.c - ll_rw_block():
        /* Uhhuh.. Nasty dead-lock possible here.. */
        if (buffer_locked(bh))
                return;
        /* Maybe the above fixes it, and maybe it doesn't boot. Life is
interesting */
        lock_buffer(bh);

Mikulas

On Mon, 10 Feb 2003, Andrea Arcangeli wrote:

> On Wed, Feb 05, 2003 at 12:16:53AM +0100, Pavel Machek wrote:
> > Hi!
> >
> > > > there's a race condition in filesystem
> > > >
> > > > let's have a two inodes that are placed in the same buffer.
> > > >
> > > > call fsync on inode 1
> > > > it goes down to ext2_update_inode [update == 1]
> > > > it calls ll_rw_block at the end
> > > > ll_rw_block starts to write buffer
> > > > ext2_update_inode waits on buffer
> > > >
> > > > while the buffer is writing, another process calls fsync on inode 2
> > > > it goes again to ext2_update_inode
> > > > it calls ll_rw_block
> > > > ll_rw_block sees buffer locked and exits immediatelly
> > > > ext2_update_inode waits for buffer
> > > > the first write finished, ext2_update_inode exits and changes made by
> > > > second proces to inode 2 ARE NOT WRITTEN TO DISK.
> > > >
> > >
> > > hmm, yes.  This is a general weakness in the ll_rw_block() interface.  It is
> > > not suitable for data-integrity writeouts, as you've pointed out.
> > >
> > > A suitable fix would be do create a new
> > >
> > > void wait_and_rw_block(...)
> > > {
> > > 	wait_on_buffer(bh);
> > > 	ll_rw_block(...);
> > > }
> > >
> > > and go use that in all the appropriate places.
> > >
> > > I shall make that change for 2.5, thanks.
> >
> > Should this be fixed at least in 2.4, too? It seems pretty serious for
> > mail servers (etc)...
>
> actually the lowlevel currently is supposed to take care of that if it
> writes directly with ll_rw_block like fsync_buffers_list takes care of
> it before calling ll_rw_block. But the whole point is that normally the
> write_inode path only marks the buffer dirty, it never writes directly,
> and no dirty bitflag can be lost and we flush those dirty buffers right
> with the proper wait_on_buffer before ll_rw_block. So I don't think it's
> happening really in 2.4, at least w/o mounting the fs with -osync.
>
> But I thought about about Mikulas suggestion of adding lock_buffer
> in place of the test and set bit. This looks very appealing. the main
> reason we didn't do that while fixing fsync_buffers_list a few months
> ago in 2.4.20 or so, is been that it is a very risky change, I mean, I'm
> feeling like it'll break something subtle.
>
> In theory the only thing those bitflags controls are the coherency of
> the buffer cache here, the rest is serialized by design at the
> highlevel. And the buffer_cache should be ok with the lock_buffer(),
> since we'll see the buffer update and we'll skip the write if we race
> with other reads (we'll sleep in lock_buffer rather than in
> wait_on_buffer). For the writes we'll overwrite the data one more time
> (the feature incidentally ;).  so it sounds safe.  Performance wise it
> shouldn't matter, for read it can't matter infact.
>
> So I guess it's ok to make that change, then we could even move the
> wait_on_buffer from fsync_buffers_list to the xfs buffer_delay path of
> write_buffer. I'm tempted to make this change for next -aa. I feel it
> makes more sense and it's a better fix even if if risky. Can anybody see
> a problem with that?
>
> Andrea
>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-10 16:28       ` Mikulas Patocka
@ 2003-02-10 16:57         ` Linus Torvalds
  2003-02-10 20:40           ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2003-02-10 16:57 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrea Arcangeli, Pavel Machek, Andrew Morton, linux-kernel


On Mon, 10 Feb 2003, Mikulas Patocka wrote:
>
> Linus, do you remember what "nasty deadlock" did you mean when writing
> this to 2.[0-2].* and maybe 1.*?
> 
> ll_rw_blk.c - ll_rw_block():
>         /* Uhhuh.. Nasty dead-lock possible here.. */
>         if (buffer_locked(bh))
>                 return;
>         /* Maybe the above fixes it, and maybe it doesn't boot. Life is interesting */
>         lock_buffer(bh);

Nope, that's a _long_ time ago. It may well have been simply a MM bug that
got fixed long since, ie something like

 - dirty writeout happens
 - writeout needs memory
 - buffer.c tries to clean up pages
 - hang.

However, some of the comments in this thread seem to just simply not be 
correct:

> > > > > while the buffer is writing, another process calls fsync on inode 2
> > > > > it goes again to ext2_update_inode
> > > > > it calls ll_rw_block
> > > > > ll_rw_block sees buffer locked and exits immediatelly
> > > > > ext2_update_inode waits for buffer
> > > > > the first write finished, ext2_update_inode exits and changes made by
> > > > > second proces to inode 2 ARE NOT WRITTEN TO DISK.

The dirty bit is not cleared, so the changes _are_ written to disk. 
Eventually. The fact that something tests "unlocked" instead of "dirty" is 
not the fault of ll_rw_block(), and never has been.

So the bug is not in ll_rw_block(), but in the waiter. If you expect to be
able to wait for dirty blocks, you can't do what the code apparently does.  
You can't assume that "unlocked" means "no longer dirty". Unlocked means
unlocked, and dirty means dirty.

If you want to synchronously wait for dirty blocks, you should do 
something like

	lock_buffer();
	.. dirty buffer..
	mark_buffer_dirty();
	unlock_buffer();

	ll_rw_block(..)

Btw, it is probably a really bad idea to just do

	wait_and_write()
	{
		wait_for_buffer(..);
		ll_rw_block(..);
	}

for things like this. It's just a better idea to lock the buffer before 
making the modification.

Of course, these days, if you really want to, you can just use submit_bh() 
etc instead. At which point _you_ are responsible for all the locking and 
unlocking.

		Linus


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-10 16:57         ` Linus Torvalds
@ 2003-02-10 20:40           ` Andrew Morton
  2003-02-10 21:18             ` Andrea Arcangeli
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2003-02-10 20:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mikulas, andrea, pavel, linux-kernel

Linus Torvalds <torvalds@transmeta.com> wrote:
>
> If you want to synchronously wait for dirty blocks, you should do 
> something like
> 
> 	lock_buffer();
> 	.. dirty buffer..
> 	mark_buffer_dirty();
> 	unlock_buffer();
> 
> 	ll_rw_block(..)

Almost all of the problematic code paths were doing this:

	alter_data_at(bh->b_data);
	mark_buffer_dirty(bh);
	if (IS_SYNC(inode)) {
		ll_rw_block(bh);
		wait_on_buffer(bh);
	}

and the bug is that if writeout was already underway, the new changes are not
committed to disk here.

The approach you describe here would involve changing that to:

	if (IS_SYNC(inode))
		lock_buffer(bh);
	alter_data_at(bh->b_data);
	mark_buffer_dirty(bh);
	if (IS_SYNC(inode)) {
		submit_bh(bh);
		wait_on_buffer(bh);
	}

which seems a bit odd.  Or perhaps

	lock_buffer(bh);
	alter_data_at(bh->b_data);
	mark_buffer_dirty(bh);
	if (IS_SYNC(inode)) {
		submit_bh(bh);
		wait_on_buffer(bh);
	} else {
		unlock_buffer(bh);
	}

which is nice.  But it'll suck for the non-IS_SYNC case due to undesirable
serialisation.

We don't need the buffer lock in there for serialisation because that is
handled at a higher level - usually i_sem.

I wrote the below patch to address this problem.





Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> points out a bug in
ll_rw_block() usage.

Typical usage is:

	mark_buffer_dirty(bh);
	ll_rw_block(WRITE, 1, &bh);
	wait_on_buffer(bh);

the problem is that if the buffer was locked on entry to this code sequence
(due to in-progress I/O), ll_rw_block() will not wait, and start new I/O.  So
this code will wait on the _old_ I/O, and will then continue execution,
leaving the buffer dirty.

It turns out that all callers were only writing one buffer, and they were all
waiting on that writeout.  So I added a new sync_dirty_buffer() function:

	void sync_dirty_buffer(struct buffer_head *bh)
	{
		lock_buffer(bh);
		if (test_clear_buffer_dirty(bh)) {
			get_bh(bh);
			bh->b_end_io = end_buffer_io_sync;
			submit_bh(WRITE, bh);
		} else {
			unlock_buffer(bh);
		}
	}

which allowed a fair amount of code to be removed, while adding the desired
data-integrity guarantees.

UFS has its own wrappers around ll_rw_block() which got in the way, so this
operation was open-coded in that case.



 fs/buffer.c                 |   18 ++++++++++++++++++
 fs/ext2/balloc.c            |   12 ++++--------
 fs/ext2/ialloc.c            |   12 ++++--------
 fs/ext2/inode.c             |    9 +++------
 fs/ext2/super.c             |    3 +--
 fs/ext2/xattr.c             |    9 +++------
 fs/ext3/super.c             |    6 ++----
 fs/jbd/commit.c             |    3 +--
 fs/jbd/journal.c            |    8 ++++----
 fs/jbd/transaction.c        |    6 ++----
 fs/jfs/jfs_imap.c           |    3 +--
 fs/jfs/jfs_mount.c          |    3 +--
 fs/jfs/namei.c              |    6 ++----
 fs/jfs/resize.c             |    9 +++------
 fs/minix/inode.c            |    3 +--
 fs/ntfs/super.c             |    3 +--
 fs/qnx4/inode.c             |    3 +--
 fs/reiserfs/journal.c       |    6 ++----
 fs/reiserfs/resize.c        |    3 +--
 fs/sysv/inode.c             |    3 +--
 fs/sysv/itree.c             |    6 ++----
 fs/udf/inode.c              |    3 +--
 fs/ufs/balloc.c             |   16 ++++++++--------
 fs/ufs/dir.c                |   18 ++++++------------
 fs/ufs/ialloc.c             |    2 ++
 fs/ufs/inode.c              |   12 ++++--------
 fs/ufs/truncate.c           |    3 +++
 include/linux/buffer_head.h |    1 +
 include/linux/hfs_sysdep.h  |    9 ++-------
 kernel/ksyms.c              |    1 +
 30 files changed, 86 insertions(+), 113 deletions(-)

diff -puN fs/buffer.c~ll_rw_block-fix fs/buffer.c
--- 25/fs/buffer.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/buffer.c	Wed Feb  5 16:27:03 2003
@@ -2622,6 +2622,24 @@ void ll_rw_block(int rw, int nr, struct 
 }
 
 /*
+ * For a data-integrity writeout, we need to wait upon any in-progress I/O
+ * and then start new I/O and then wait upon it.
+ */
+void sync_dirty_buffer(struct buffer_head *bh)
+{
+	WARN_ON(atomic_read(&bh->b_count) < 1);
+	lock_buffer(bh);
+	if (test_clear_buffer_dirty(bh)) {
+		get_bh(bh);
+		bh->b_end_io = end_buffer_io_sync;
+		submit_bh(WRITE, bh);
+		wait_on_buffer(bh);
+	} else {
+		unlock_buffer(bh);
+	}
+}
+
+/*
  * Sanity checks for try_to_free_buffers.
  */
 static void check_ttfb_buffer(struct page *page, struct buffer_head *bh)
diff -puN fs/ext2/balloc.c~ll_rw_block-fix fs/ext2/balloc.c
--- 25/fs/ext2/balloc.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/ext2/balloc.c	Wed Feb  5 16:27:03 2003
@@ -233,10 +233,8 @@ do_more:
 	}
 
 	mark_buffer_dirty(bitmap_bh);
-	if (sb->s_flags & MS_SYNCHRONOUS) {
-		ll_rw_block(WRITE, 1, &bitmap_bh);
-		wait_on_buffer(bitmap_bh);
-	}
+	if (sb->s_flags & MS_SYNCHRONOUS)
+		sync_dirty_buffer(bitmap_bh);
 
 	group_release_blocks(desc, bh2, group_freed);
 	freed += group_freed;
@@ -466,10 +464,8 @@ got_block:
 	write_unlock(&EXT2_I(inode)->i_meta_lock);
 
 	mark_buffer_dirty(bitmap_bh);
-	if (sb->s_flags & MS_SYNCHRONOUS) {
-		ll_rw_block(WRITE, 1, &bitmap_bh);
-		wait_on_buffer(bitmap_bh);
-	}
+	if (sb->s_flags & MS_SYNCHRONOUS)
+		sync_dirty_buffer(bitmap_bh);
 
 	ext2_debug ("allocating block %d. ", block);
 
diff -puN fs/ext2/ialloc.c~ll_rw_block-fix fs/ext2/ialloc.c
--- 25/fs/ext2/ialloc.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/ext2/ialloc.c	Wed Feb  5 16:27:03 2003
@@ -146,10 +146,8 @@ void ext2_free_inode (struct inode * ino
 		mark_buffer_dirty(EXT2_SB(sb)->s_sbh);
 	}
 	mark_buffer_dirty(bitmap_bh);
-	if (sb->s_flags & MS_SYNCHRONOUS) {
-		ll_rw_block(WRITE, 1, &bitmap_bh);
-		wait_on_buffer(bitmap_bh);
-	}
+	if (sb->s_flags & MS_SYNCHRONOUS)
+		sync_dirty_buffer(bitmap_bh);
 	sb->s_dirt = 1;
 error_return:
 	brelse(bitmap_bh);
@@ -485,10 +483,8 @@ repeat:
 	ext2_set_bit(i, bitmap_bh->b_data);
 
 	mark_buffer_dirty(bitmap_bh);
-	if (sb->s_flags & MS_SYNCHRONOUS) {
-		ll_rw_block(WRITE, 1, &bitmap_bh);
-		wait_on_buffer(bitmap_bh);
-	}
+	if (sb->s_flags & MS_SYNCHRONOUS)
+		sync_dirty_buffer(bitmap_bh);
 	brelse(bitmap_bh);
 
 	ino = group * EXT2_INODES_PER_GROUP(sb) + i + 1;
diff -puN fs/ext2/inode.c~ll_rw_block-fix fs/ext2/inode.c
--- 25/fs/ext2/inode.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/ext2/inode.c	Wed Feb  5 16:27:03 2003
@@ -443,10 +443,8 @@ static int ext2_alloc_branch(struct inod
 		 * But we now rely upon generic_osync_inode()
 		 * and b_inode_buffers.  But not for directories.
 		 */
-		if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode)) {
-			ll_rw_block(WRITE, 1, &bh);
-			wait_on_buffer(bh);
-		}
+		if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode))
+			sync_dirty_buffer(bh);
 		parent = nr;
 	}
 	if (n == num)
@@ -1208,8 +1206,7 @@ static int ext2_update_inode(struct inod
 		raw_inode->i_block[n] = ei->i_data[n];
 	mark_buffer_dirty(bh);
 	if (do_sync) {
-		ll_rw_block (WRITE, 1, &bh);
-		wait_on_buffer (bh);
+		sync_dirty_buffer(bh);
 		if (buffer_req(bh) && !buffer_uptodate(bh)) {
 			printk ("IO error syncing ext2 inode [%s:%08lx]\n",
 				sb->s_id, (unsigned long) ino);
diff -puN fs/ext2/super.c~ll_rw_block-fix fs/ext2/super.c
--- 25/fs/ext2/super.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/ext2/super.c	Wed Feb  5 16:27:03 2003
@@ -842,8 +842,7 @@ static void ext2_sync_super(struct super
 {
 	es->s_wtime = cpu_to_le32(get_seconds());
 	mark_buffer_dirty(EXT2_SB(sb)->s_sbh);
-	ll_rw_block(WRITE, 1, &EXT2_SB(sb)->s_sbh);
-	wait_on_buffer(EXT2_SB(sb)->s_sbh);
+	sync_dirty_buffer(EXT2_SB(sb)->s_sbh);
 	sb->s_dirt = 0;
 }
 
diff -puN fs/ext2/xattr.c~ll_rw_block-fix fs/ext2/xattr.c
--- 25/fs/ext2/xattr.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/ext2/xattr.c	Wed Feb  5 16:27:03 2003
@@ -774,8 +774,7 @@ ext2_xattr_set2(struct inode *inode, str
 		}
 		mark_buffer_dirty(new_bh);
 		if (IS_SYNC(inode)) {
-			ll_rw_block(WRITE, 1, &new_bh);
-			wait_on_buffer(new_bh); 
+			sync_dirty_buffer(new_bh);
 			error = -EIO;
 			if (buffer_req(new_bh) && !buffer_uptodate(new_bh))
 				goto cleanup;
@@ -865,10 +864,8 @@ ext2_xattr_delete_inode(struct inode *in
 		HDR(bh)->h_refcount = cpu_to_le32(
 			le32_to_cpu(HDR(bh)->h_refcount) - 1);
 		mark_buffer_dirty(bh);
-		if (IS_SYNC(inode)) {
-			ll_rw_block(WRITE, 1, &bh);
-			wait_on_buffer(bh);
-		}
+		if (IS_SYNC(inode))
+			sync_dirty_buffer(bh);
 		DQUOT_FREE_BLOCK(inode, 1);
 	}
 	EXT2_I(inode)->i_file_acl = 0;
diff -puN fs/ext3/super.c~ll_rw_block-fix fs/ext3/super.c
--- 25/fs/ext3/super.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/ext3/super.c	Wed Feb  5 16:27:03 2003
@@ -1627,10 +1627,8 @@ static void ext3_commit_super (struct su
 	es->s_wtime = cpu_to_le32(get_seconds());
 	BUFFER_TRACE(EXT3_SB(sb)->s_sbh, "marking dirty");
 	mark_buffer_dirty(EXT3_SB(sb)->s_sbh);
-	if (sync) {
-		ll_rw_block(WRITE, 1, &EXT3_SB(sb)->s_sbh);
-		wait_on_buffer(EXT3_SB(sb)->s_sbh);
-	}
+	if (sync)
+		sync_dirty_buffer(EXT3_SB(sb)->s_sbh);
 }
 
 
diff -puN fs/jbd/commit.c~ll_rw_block-fix fs/jbd/commit.c
--- 25/fs/jbd/commit.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/jbd/commit.c	Wed Feb  5 16:27:03 2003
@@ -607,8 +607,7 @@ start_journal_io:
 	{
 		struct buffer_head *bh = jh2bh(descriptor);
 		set_buffer_uptodate(bh);
-		ll_rw_block(WRITE, 1, &bh);
-		wait_on_buffer(bh);
+		sync_dirty_buffer(bh);
 		__brelse(bh);		/* One for getblk() */
 		journal_unlock_journal_head(descriptor);
 	}
diff -puN fs/jbd/journal.c~ll_rw_block-fix fs/jbd/journal.c
--- 25/fs/jbd/journal.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/jbd/journal.c	Wed Feb  5 16:27:03 2003
@@ -960,9 +960,10 @@ void journal_update_superblock(journal_t
 
 	BUFFER_TRACE(bh, "marking dirty");
 	mark_buffer_dirty(bh);
-	ll_rw_block(WRITE, 1, &bh);
 	if (wait)
-		wait_on_buffer(bh);
+		sync_dirty_buffer(bh);
+	else
+		ll_rw_block(WRITE, 1, &bh);
 
 	/* If we have just flushed the log (by marking s_start==0), then
 	 * any future commit will have to be careful to update the
@@ -1296,8 +1297,7 @@ static int journal_convert_superblock_v1
 	bh = journal->j_sb_buffer;
 	BUFFER_TRACE(bh, "marking dirty");
 	mark_buffer_dirty(bh);
-	ll_rw_block(WRITE, 1, &bh);
-	wait_on_buffer(bh);
+	sync_dirty_buffer(bh);
 	return 0;
 }
 
diff -puN fs/jbd/transaction.c~ll_rw_block-fix fs/jbd/transaction.c
--- 25/fs/jbd/transaction.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/jbd/transaction.c	Wed Feb  5 16:27:03 2003
@@ -1079,8 +1079,7 @@ int journal_dirty_data (handle_t *handle
 				atomic_inc(&bh->b_count);
 				spin_unlock(&journal_datalist_lock);
 				need_brelse = 1;
-				ll_rw_block(WRITE, 1, &bh);
-				wait_on_buffer(bh);
+				sync_dirty_buffer(bh);
 				spin_lock(&journal_datalist_lock);
 				/* The buffer may become locked again at any
 				   time if it is redirtied */
@@ -1361,8 +1360,7 @@ void journal_sync_buffer(struct buffer_h
 		}
 		atomic_inc(&bh->b_count);
 		spin_unlock(&journal_datalist_lock);
-		ll_rw_block (WRITE, 1, &bh);
-		wait_on_buffer(bh);
+		sync_dirty_buffer(bh);
 		__brelse(bh);
 		goto out;
 	}
diff -puN fs/jfs/jfs_imap.c~ll_rw_block-fix fs/jfs/jfs_imap.c
--- 25/fs/jfs/jfs_imap.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/jfs/jfs_imap.c	Wed Feb  5 16:27:03 2003
@@ -2980,8 +2980,7 @@ static void duplicateIXtree(struct super
 		j_sb->s_flag |= JFS_BAD_SAIT;
 
 		mark_buffer_dirty(bh);
-		ll_rw_block(WRITE, 1, &bh);
-		wait_on_buffer(bh);
+		sync_dirty_buffer(bh);
 		brelse(bh);
 		return;
 	}
diff -puN fs/jfs/jfs_mount.c~ll_rw_block-fix fs/jfs/jfs_mount.c
--- 25/fs/jfs/jfs_mount.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/jfs/jfs_mount.c	Wed Feb  5 16:27:03 2003
@@ -449,8 +449,7 @@ int updateSuper(struct super_block *sb, 
 	}
 
 	mark_buffer_dirty(bh);
-	ll_rw_block(WRITE, 1, &bh);
-	wait_on_buffer(bh);
+	sync_dirty_buffer(bh);
 	brelse(bh);
 
 	return 0;
diff -puN fs/jfs/namei.c~ll_rw_block-fix fs/jfs/namei.c
--- 25/fs/jfs/namei.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/jfs/namei.c	Wed Feb  5 16:27:03 2003
@@ -972,10 +972,8 @@ int jfs_symlink(struct inode *dip, struc
 #if 0
 				set_buffer_uptodate(bp);
 				mark_buffer_dirty(bp, 1);
-				if (IS_SYNC(dip)) {
-					ll_rw_block(WRITE, 1, &bp);
-					wait_on_buffer(bp);
-				}
+				if (IS_SYNC(dip))
+					sync_dirty_buffer(bp);
 				brelse(bp);
 #endif				/* 0 */
 				ssize -= copy_size;
diff -puN fs/jfs/resize.c~ll_rw_block-fix fs/jfs/resize.c
--- 25/fs/jfs/resize.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/jfs/resize.c	Wed Feb  5 16:27:03 2003
@@ -243,8 +243,7 @@ int jfs_extendfs(struct super_block *sb,
 
 		/* synchronously update superblock */
 		mark_buffer_dirty(bh);
-		ll_rw_block(WRITE, 1, &bh);
-		wait_on_buffer(bh);
+		sync_dirty_buffer(bh);
 		brelse(bh);
 
 		/*
@@ -512,15 +511,13 @@ int jfs_extendfs(struct super_block *sb,
 		memcpy(j_sb2, j_sb, sizeof (struct jfs_superblock));
 
 		mark_buffer_dirty(bh);
-		ll_rw_block(WRITE, 1, &bh2);
-		wait_on_buffer(bh2);
+		sync_dirty_buffer(bh2);
 		brelse(bh2);
 	}
 
 	/* write primary superblock */
 	mark_buffer_dirty(bh);
-	ll_rw_block(WRITE, 1, &bh);
-	wait_on_buffer(bh);
+	sync_dirty_buffer(bh);
 	brelse(bh);
 
 	goto resume;
diff -puN fs/minix/inode.c~ll_rw_block-fix fs/minix/inode.c
--- 25/fs/minix/inode.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/minix/inode.c	Wed Feb  5 16:27:03 2003
@@ -517,8 +517,7 @@ int minix_sync_inode(struct inode * inod
 	bh = minix_update_inode(inode);
 	if (bh && buffer_dirty(bh))
 	{
-		ll_rw_block(WRITE, 1, &bh);
-		wait_on_buffer(bh);
+		sync_dirty_buffer(bh);
 		if (buffer_req(bh) && !buffer_uptodate(bh))
 		{
 			printk ("IO error syncing minix inode [%s:%08lx]\n",
diff -puN fs/ntfs/super.c~ll_rw_block-fix fs/ntfs/super.c
--- 25/fs/ntfs/super.c~ll_rw_block-fix	Wed Feb  5 16:27:02 2003
+++ 25-akpm/fs/ntfs/super.c	Wed Feb  5 16:27:03 2003
@@ -505,8 +505,7 @@ hotfix_primary_boot_sector:
 			memcpy(bh_primary->b_data, bh_backup->b_data,
 					sb->s_blocksize);
 			mark_buffer_dirty(bh_primary);
-			ll_rw_block(WRITE, 1, &bh_primary);
-			wait_on_buffer(bh_primary);
+			sync_dirty_buffer(bh_primary);
 			if (buffer_uptodate(bh_primary)) {
 				brelse(bh_backup);
 				return bh_primary;
diff -puN fs/qnx4/inode.c~ll_rw_block-fix fs/qnx4/inode.c
--- 25/fs/qnx4/inode.c~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/fs/qnx4/inode.c	Wed Feb  5 16:27:03 2003
@@ -44,8 +44,7 @@ int qnx4_sync_inode(struct inode *inode)
    	bh = qnx4_update_inode(inode);
 	if (bh && buffer_dirty(bh))
 	{
-		ll_rw_block(WRITE, 1, &bh);
-		wait_on_buffer(bh);
+		sync_dirty_buffer(bh);
 		if (buffer_req(bh) && !buffer_uptodate(bh))
 		{
 			printk ("IO error syncing qnx4 inode [%s:%08lx]\n",
diff -puN fs/reiserfs/journal.c~ll_rw_block-fix fs/reiserfs/journal.c
--- 25/fs/reiserfs/journal.c~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/fs/reiserfs/journal.c	Wed Feb  5 16:27:03 2003
@@ -735,8 +735,7 @@ reiserfs_panic(s, "journal-539: flush_co
   }
 
   mark_buffer_dirty(jl->j_commit_bh) ;
-  ll_rw_block(WRITE, 1, &(jl->j_commit_bh)) ;
-  wait_on_buffer(jl->j_commit_bh) ;
+  sync_dirty_buffer(jl->j_commit_bh) ;
   if (!buffer_uptodate(jl->j_commit_bh)) {
     reiserfs_panic(s, "journal-615: buffer write failed\n") ;
   }
@@ -828,8 +827,7 @@ static int _update_journal_header_block(
     jh->j_first_unflushed_offset = cpu_to_le32(offset) ;
     jh->j_mount_id = cpu_to_le32(SB_JOURNAL(p_s_sb)->j_mount_id) ;
     set_buffer_dirty(SB_JOURNAL(p_s_sb)->j_header_bh) ;
-    ll_rw_block(WRITE, 1, &(SB_JOURNAL(p_s_sb)->j_header_bh)) ;
-    wait_on_buffer((SB_JOURNAL(p_s_sb)->j_header_bh)) ; 
+    sync_dirty_buffer(SB_JOURNAL(p_s_sb)->j_header_bh) ;
     if (!buffer_uptodate(SB_JOURNAL(p_s_sb)->j_header_bh)) {
       printk( "reiserfs: journal-837: IO error during journal replay\n" );
       return -EIO ;
diff -puN fs/reiserfs/resize.c~ll_rw_block-fix fs/reiserfs/resize.c
--- 25/fs/reiserfs/resize.c~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/fs/reiserfs/resize.c	Wed Feb  5 16:27:03 2003
@@ -120,8 +120,7 @@ int reiserfs_resize (struct super_block 
 
 		mark_buffer_dirty(bitmap[i].bh) ;
 		set_buffer_uptodate(bitmap[i].bh);
-		ll_rw_block(WRITE, 1, &bitmap[i].bh);
-		wait_on_buffer(bitmap[i].bh);
+		sync_dirty_buffer(bitmap[i].bh);
 		// update bitmap_info stuff
 		bitmap[i].first_zero_hint=1;
 		bitmap[i].free_count = sb_blocksize(sb) * 8 - 1;
diff -puN fs/sysv/inode.c~ll_rw_block-fix fs/sysv/inode.c
--- 25/fs/sysv/inode.c~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/fs/sysv/inode.c	Wed Feb  5 16:27:03 2003
@@ -265,8 +265,7 @@ int sysv_sync_inode(struct inode * inode
 
         bh = sysv_update_inode(inode);
         if (bh && buffer_dirty(bh)) {
-                ll_rw_block(WRITE, 1, &bh);
-                wait_on_buffer(bh);
+                sync_dirty_buffer(bh);
                 if (buffer_req(bh) && !buffer_uptodate(bh)) {
                         printk ("IO error syncing sysv inode [%s:%08lx]\n",
                                 inode->i_sb->s_id, inode->i_ino);
diff -puN fs/sysv/itree.c~ll_rw_block-fix fs/sysv/itree.c
--- 25/fs/sysv/itree.c~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/fs/sysv/itree.c	Wed Feb  5 16:27:03 2003
@@ -15,10 +15,8 @@ enum {DIRECT = 10, DEPTH = 4};	/* Have t
 static inline void dirty_indirect(struct buffer_head *bh, struct inode *inode)
 {
 	mark_buffer_dirty_inode(bh, inode);
-	if (IS_SYNC(inode)) {
-		ll_rw_block (WRITE, 1, &bh);
-		wait_on_buffer (bh);
-	}
+	if (IS_SYNC(inode))
+		sync_dirty_buffer(bh);
 }
 
 static int block_to_path(struct inode *inode, long block, int offsets[DEPTH])
diff -puN fs/udf/inode.c~ll_rw_block-fix fs/udf/inode.c
--- 25/fs/udf/inode.c~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/fs/udf/inode.c	Wed Feb  5 16:27:03 2003
@@ -1520,8 +1520,7 @@ udf_update_inode(struct inode *inode, in
 	mark_buffer_dirty(bh);
 	if (do_sync)
 	{
-		ll_rw_block(WRITE, 1, &bh);
-		wait_on_buffer(bh);
+		sync_dirty_buffer(bh);
 		if (buffer_req(bh) && !buffer_uptodate(bh))
 		{
 			printk("IO error syncing udf inode [%s:%08lx]\n",
diff -puN fs/ufs/balloc.c~ll_rw_block-fix fs/ufs/balloc.c
--- 25/fs/ufs/balloc.c~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/fs/ufs/balloc.c	Wed Feb  5 16:27:03 2003
@@ -114,6 +114,7 @@ void ufs_free_fragments (struct inode * 
 	ubh_mark_buffer_dirty (USPI_UBH);
 	ubh_mark_buffer_dirty (UCPI_UBH);
 	if (sb->s_flags & MS_SYNCHRONOUS) {
+		ubh_wait_on_buffer (UCPI_UBH);
 		ubh_ll_rw_block (WRITE, 1, (struct ufs_buffer_head **)&ucpi);
 		ubh_wait_on_buffer (UCPI_UBH);
 	}
@@ -199,6 +200,7 @@ do_more:
 	ubh_mark_buffer_dirty (USPI_UBH);
 	ubh_mark_buffer_dirty (UCPI_UBH);
 	if (sb->s_flags & MS_SYNCHRONOUS) {
+		ubh_wait_on_buffer (UCPI_UBH);
 		ubh_ll_rw_block (WRITE, 1, (struct ufs_buffer_head **)&ucpi);
 		ubh_wait_on_buffer (UCPI_UBH);
 	}
@@ -228,10 +230,8 @@ failed:
 		memset (bh->b_data, 0, sb->s_blocksize); \
 		set_buffer_uptodate(bh); \
 		mark_buffer_dirty (bh); \
-		if (IS_SYNC(inode)) { \
-			ll_rw_block (WRITE, 1, &bh); \
-			wait_on_buffer (bh); \
-		} \
+		if (IS_SYNC(inode)) \
+			sync_dirty_buffer(bh); \
 		brelse (bh); \
 	}
 
@@ -364,10 +364,8 @@ unsigned ufs_new_fragments (struct inode
 				clear_buffer_dirty(bh);
 				bh->b_blocknr = result + i;
 				mark_buffer_dirty (bh);
-				if (IS_SYNC(inode)) {
-					ll_rw_block (WRITE, 1, &bh);
-					wait_on_buffer (bh);
-				}
+				if (IS_SYNC(inode))
+					sync_dirty_buffer(bh);
 				brelse (bh);
 			}
 			else
@@ -459,6 +457,7 @@ unsigned ufs_add_fragments (struct inode
 	ubh_mark_buffer_dirty (USPI_UBH);
 	ubh_mark_buffer_dirty (UCPI_UBH);
 	if (sb->s_flags & MS_SYNCHRONOUS) {
+		ubh_wait_on_buffer (UCPI_UBH);
 		ubh_ll_rw_block (WRITE, 1, (struct ufs_buffer_head **)&ucpi);
 		ubh_wait_on_buffer (UCPI_UBH);
 	}
@@ -584,6 +583,7 @@ succed:
 	ubh_mark_buffer_dirty (USPI_UBH);
 	ubh_mark_buffer_dirty (UCPI_UBH);
 	if (sb->s_flags & MS_SYNCHRONOUS) {
+		ubh_wait_on_buffer (UCPI_UBH);
 		ubh_ll_rw_block (WRITE, 1, (struct ufs_buffer_head **)&ucpi);
 		ubh_wait_on_buffer (UCPI_UBH);
 	}
diff -puN fs/ufs/dir.c~ll_rw_block-fix fs/ufs/dir.c
--- 25/fs/ufs/dir.c~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/fs/ufs/dir.c	Wed Feb  5 16:27:03 2003
@@ -356,10 +356,8 @@ void ufs_set_link(struct inode *dir, str
 	dir->i_version++;
 	de->d_ino = cpu_to_fs32(dir->i_sb, inode->i_ino);
 	mark_buffer_dirty(bh);
-	if (IS_DIRSYNC(dir)) {
-		ll_rw_block (WRITE, 1, &bh);
-		wait_on_buffer(bh);
-	}
+	if (IS_DIRSYNC(dir))
+		sync_dirty_buffer(bh);
 	brelse (bh);
 }
 
@@ -457,10 +455,8 @@ int ufs_add_link(struct dentry *dentry, 
 	de->d_ino = cpu_to_fs32(sb, inode->i_ino);
 	ufs_set_de_type(sb, de, inode->i_mode);
 	mark_buffer_dirty(bh);
-	if (IS_DIRSYNC(dir)) {
-		ll_rw_block (WRITE, 1, &bh);
-		wait_on_buffer (bh);
-	}
+	if (IS_DIRSYNC(dir))
+		sync_dirty_buffer(bh);
 	brelse (bh);
 	dir->i_mtime = dir->i_ctime = CURRENT_TIME;
 	dir->i_version++;
@@ -508,10 +504,8 @@ int ufs_delete_entry (struct inode * ino
 			inode->i_ctime = inode->i_mtime = CURRENT_TIME;
 			mark_inode_dirty(inode);
 			mark_buffer_dirty(bh);
-			if (IS_DIRSYNC(inode)) {
-				ll_rw_block(WRITE, 1, &bh);
-				wait_on_buffer(bh);
-			}
+			if (IS_DIRSYNC(inode))
+				sync_dirty_buffer(bh);
 			brelse(bh);
 			UFSD(("EXIT\n"))
 			return 0;
diff -puN fs/ufs/ialloc.c~ll_rw_block-fix fs/ufs/ialloc.c
--- 25/fs/ufs/ialloc.c~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/fs/ufs/ialloc.c	Wed Feb  5 16:27:03 2003
@@ -124,6 +124,7 @@ void ufs_free_inode (struct inode * inod
 	ubh_mark_buffer_dirty (USPI_UBH);
 	ubh_mark_buffer_dirty (UCPI_UBH);
 	if (sb->s_flags & MS_SYNCHRONOUS) {
+		ubh_wait_on_buffer (UCPI_UBH);
 		ubh_ll_rw_block (WRITE, 1, (struct ufs_buffer_head **) &ucpi);
 		ubh_wait_on_buffer (UCPI_UBH);
 	}
@@ -248,6 +249,7 @@ cg_found:
 	ubh_mark_buffer_dirty (USPI_UBH);
 	ubh_mark_buffer_dirty (UCPI_UBH);
 	if (sb->s_flags & MS_SYNCHRONOUS) {
+		ubh_wait_on_buffer (UCPI_UBH);
 		ubh_ll_rw_block (WRITE, 1, (struct ufs_buffer_head **) &ucpi);
 		ubh_wait_on_buffer (UCPI_UBH);
 	}
diff -puN fs/ufs/inode.c~ll_rw_block-fix fs/ufs/inode.c
--- 25/fs/ufs/inode.c~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/fs/ufs/inode.c	Wed Feb  5 16:27:03 2003
@@ -298,10 +298,8 @@ repeat:
 	}
 
 	mark_buffer_dirty(bh);
-	if (IS_SYNC(inode)) {
-		ll_rw_block (WRITE, 1, &bh);
-		wait_on_buffer (bh);
-	}
+	if (IS_SYNC(inode))
+		sync_dirty_buffer(bh);
 	inode->i_ctime = CURRENT_TIME;
 	mark_inode_dirty(inode);
 out:
@@ -635,10 +633,8 @@ static int ufs_update_inode(struct inode
 		memset (ufs_inode, 0, sizeof(struct ufs_inode));
 		
 	mark_buffer_dirty(bh);
-	if (do_sync) {
-		ll_rw_block (WRITE, 1, &bh);
-		wait_on_buffer (bh);
-	}
+	if (do_sync)
+		sync_dirty_buffer(bh);
 	brelse (bh);
 	
 	UFSD(("EXIT\n"))
diff -puN fs/ufs/truncate.c~ll_rw_block-fix fs/ufs/truncate.c
--- 25/fs/ufs/truncate.c~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/fs/ufs/truncate.c	Wed Feb  5 16:27:03 2003
@@ -284,6 +284,7 @@ next:;
 		}
 	}
 	if (IS_SYNC(inode) && ind_ubh && ubh_buffer_dirty(ind_ubh)) {
+		ubh_wait_on_buffer (ind_ubh);
 		ubh_ll_rw_block (WRITE, 1, &ind_ubh);
 		ubh_wait_on_buffer (ind_ubh);
 	}
@@ -351,6 +352,7 @@ static int ufs_trunc_dindirect (struct i
 		}
 	}
 	if (IS_SYNC(inode) && dind_bh && ubh_buffer_dirty(dind_bh)) {
+		ubh_wait_on_buffer (dind_bh);
 		ubh_ll_rw_block (WRITE, 1, &dind_bh);
 		ubh_wait_on_buffer (dind_bh);
 	}
@@ -415,6 +417,7 @@ static int ufs_trunc_tindirect (struct i
 		}
 	}
 	if (IS_SYNC(inode) && tind_bh && ubh_buffer_dirty(tind_bh)) {
+		ubh_wait_on_buffer (tind_bh);
 		ubh_ll_rw_block (WRITE, 1, &tind_bh);
 		ubh_wait_on_buffer (tind_bh);
 	}
diff -puN include/linux/buffer_head.h~ll_rw_block-fix include/linux/buffer_head.h
--- 25/include/linux/buffer_head.h~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/include/linux/buffer_head.h	Wed Feb  5 16:27:03 2003
@@ -169,6 +169,7 @@ struct buffer_head *alloc_buffer_head(vo
 void free_buffer_head(struct buffer_head * bh);
 void FASTCALL(unlock_buffer(struct buffer_head *bh));
 void ll_rw_block(int, int, struct buffer_head * bh[]);
+void sync_dirty_buffer(struct buffer_head *bh);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
diff -puN include/linux/hfs_sysdep.h~ll_rw_block-fix include/linux/hfs_sysdep.h
--- 25/include/linux/hfs_sysdep.h~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/include/linux/hfs_sysdep.h	Wed Feb  5 16:27:03 2003
@@ -155,13 +155,8 @@ static inline void hfs_buffer_dirty(hfs_
 }
 
 static inline void hfs_buffer_sync(hfs_buffer buffer) {
-	while (buffer_locked(buffer)) {
-		wait_on_buffer(buffer);
-	}
-	if (buffer_dirty(buffer)) {
-		ll_rw_block(WRITE, 1, &buffer);
-		wait_on_buffer(buffer);
-	}
+	if (buffer_dirty(buffer))
+		sync_dirty_buffer(buffer);
 }
 
 static inline void *hfs_buffer_data(const hfs_buffer buffer) {
diff -puN kernel/ksyms.c~ll_rw_block-fix kernel/ksyms.c
--- 25/kernel/ksyms.c~ll_rw_block-fix	Wed Feb  5 16:27:03 2003
+++ 25-akpm/kernel/ksyms.c	Wed Feb  5 16:27:03 2003
@@ -208,6 +208,7 @@ EXPORT_SYMBOL(close_bdev_excl);
 EXPORT_SYMBOL(__brelse);
 EXPORT_SYMBOL(__bforget);
 EXPORT_SYMBOL(ll_rw_block);
+EXPORT_SYMBOL(sync_dirty_buffer);
 EXPORT_SYMBOL(submit_bh);
 EXPORT_SYMBOL(unlock_buffer);
 EXPORT_SYMBOL(__wait_on_buffer);

_


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-10 20:40           ` Andrew Morton
@ 2003-02-10 21:18             ` Andrea Arcangeli
  2003-02-10 21:44               ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2003-02-10 21:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, mikulas, pavel, linux-kernel

On Mon, Feb 10, 2003 at 12:40:00PM -0800, Andrew Morton wrote:
> 	void sync_dirty_buffer(struct buffer_head *bh)
> 	{
> 		lock_buffer(bh);
> 		if (test_clear_buffer_dirty(bh)) {
> 			get_bh(bh);
> 			bh->b_end_io = end_buffer_io_sync;
> 			submit_bh(WRITE, bh);
> 		} else {
> 			unlock_buffer(bh);
> 		}
> 	}

If you we don't take the lock around the mark_dirty_buffer as Linus
suggested (to avoid serializing in the non-sync case), why don't you
simply add lock_buffer() to ll_rw_block() as we suggested originally and
you #define sync_dirty_buffer as ll_rw_block+wait_on_buffer if you
really want to make the cleanup?

There would be no difference. I don't see the need of the above
specialized ll_rw_block-cloned function just for the O_SYNC usage,
O_SYNC isn't that a special case. lock_buffer in ll_rw_block makes more
sense to me, leaving the test-and-set-bit in there, and having a
secondary ll_rw_block with different behaviour just for O_SYNC doesn't
look that clean to me.

Especially in 2.4 I wouldn't like to make the below change that is
100% equivalent to a one liner patch that just adds lock_buffer()
instead of the test-and-set-bit (for reads I see no problems either).

BTW, Linus's way that suggests the lock around the data modifications
(unconditionally), would also enforce metadata coherency so it would
provide an additional coherency guarantee (but it's not directly related
to this problem and it may be overkill). Normally we always allow
in-core modifications of the buffer during write-IO to disk (also for
the data in pagecache). Only the journal commits must be very careful in
avoiding that (like applications must be careful to run fsync and not to
overwrite the data during the fsync). So normally taking the lock around
the in-core modification and mark_buffer_dirty, would be overkill IMHO.

Andrea

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-10 21:18             ` Andrea Arcangeli
@ 2003-02-10 21:44               ` Andrew Morton
  2003-02-10 21:59                 ` Andrea Arcangeli
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2003-02-10 21:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: torvalds, mikulas, pavel, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Mon, Feb 10, 2003 at 12:40:00PM -0800, Andrew Morton wrote:
> > 	void sync_dirty_buffer(struct buffer_head *bh)
> > 	{
> > 		lock_buffer(bh);
> > 		if (test_clear_buffer_dirty(bh)) {
> > 			get_bh(bh);
> > 			bh->b_end_io = end_buffer_io_sync;
> > 			submit_bh(WRITE, bh);
> > 		} else {
> > 			unlock_buffer(bh);
> > 		}
> > 	}
> 
> If you we don't take the lock around the mark_dirty_buffer as Linus
> suggested (to avoid serializing in the non-sync case), why don't you
> simply add lock_buffer() to ll_rw_block() as we suggested originally

That is undesirable for READA.

> and
> you #define sync_dirty_buffer as ll_rw_block+wait_on_buffer if you
> really want to make the cleanup?

Linux 2.4 tends to contain costly confusion between writeout for memory
cleansing and writeout for data integrity.

In 2.5 I have been trying to make it very clear and explicit that these are
fundamentally different things.

> ...
> Especially in 2.4 I wouldn't like to make the below change that is
> 100% equivalent to a one liner patch that just adds lock_buffer()
> instead of the test-and-set-bit (for reads I see no problems either).

That'd probably be OK, with a dont-do-that for READA.

> BTW, Linus's way that suggests the lock around the data modifications
> (unconditionally), would also enforce metadata coherency so it would
> provide an additional coherency guarantee (but it's not directly related
> to this problem and it may be overkill). Normally we always allow
> in-core modifications of the buffer during write-IO to disk (also for
> the data in pagecache). Only the journal commits must be very careful in
> avoiding that (like applications must be careful to run fsync and not to
> overwrite the data during the fsync). So normally taking the lock around
> the in-core modification and mark_buffer_dirty, would be overkill IMHO.

Yup.  Except for a non-uptodate buffer.  If software is bringing a
non-uptodate buffer uptodate by hand it should generally be locked, else a
concurrent read may stomp on the changes.  There are few places where this
happens.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-10 21:44               ` Andrew Morton
@ 2003-02-10 21:59                 ` Andrea Arcangeli
  2003-03-11 13:58                   ` Andrea Arcangeli
  0 siblings, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2003-02-10 21:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, mikulas, pavel, linux-kernel

On Mon, Feb 10, 2003 at 01:44:34PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > On Mon, Feb 10, 2003 at 12:40:00PM -0800, Andrew Morton wrote:
> > > 	void sync_dirty_buffer(struct buffer_head *bh)
> > > 	{
> > > 		lock_buffer(bh);
> > > 		if (test_clear_buffer_dirty(bh)) {
> > > 			get_bh(bh);
> > > 			bh->b_end_io = end_buffer_io_sync;
> > > 			submit_bh(WRITE, bh);
> > > 		} else {
> > > 			unlock_buffer(bh);
> > > 		}
> > > 	}
> > 
> > If you we don't take the lock around the mark_dirty_buffer as Linus
> > suggested (to avoid serializing in the non-sync case), why don't you
> > simply add lock_buffer() to ll_rw_block() as we suggested originally
> 
> That is undesirable for READA.

in 2.4 we killed READA some release ago becuse of races:

		case READA:
#if 0	/* bread() misinterprets failed READA attempts as IO errors on SMP */
			rw_ahead = 1;
#endif

so I wasn't focusing on it.

> > and
> > you #define sync_dirty_buffer as ll_rw_block+wait_on_buffer if you
> > really want to make the cleanup?
> 
> Linux 2.4 tends to contain costly confusion between writeout for memory
> cleansing and writeout for data integrity.
> 
> In 2.5 I have been trying to make it very clear and explicit that these are
> fundamentally different things.

yes, and these are in the memory cleansing area (only the journal
commits [and somehow the superblock again in journaling] needs to be
writeouts for data integrity). Data doesn't need to provide data
integrity either, userspace has to care about that.

> 
> > ...
> > Especially in 2.4 I wouldn't like to make the below change that is
> > 100% equivalent to a one liner patch that just adds lock_buffer()
> > instead of the test-and-set-bit (for reads I see no problems either).
> 
> That'd probably be OK, with a dont-do-that for READA.

Not an issue in 2.4. In 2.5 the bio is providing reada still though, so
it would be wrong to wait_on_buffer there.

> > BTW, Linus's way that suggests the lock around the data modifications
> > (unconditionally), would also enforce metadata coherency so it would
> > provide an additional coherency guarantee (but it's not directly related
> > to this problem and it may be overkill). Normally we always allow
> > in-core modifications of the buffer during write-IO to disk (also for
> > the data in pagecache). Only the journal commits must be very careful in
> > avoiding that (like applications must be careful to run fsync and not to
> > overwrite the data during the fsync). So normally taking the lock around
> > the in-core modification and mark_buffer_dirty, would be overkill IMHO.
> 
> Yup.  Except for a non-uptodate buffer.  If software is bringing a
> non-uptodate buffer uptodate by hand it should generally be locked, else a
> concurrent read may stomp on the changes.  There are few places where this
> happens.

I recall I fixed all those cases where we write a non uptodate buffer
under reads in ext2 in 2.2 with proper wait_on_buffers before modifying
the buffer. That was enough to guarantee any underlying read would
complete before we were going to touch the buffer. However today with
threading (if they're not protected against reads by the big kernel
lock) they should all be converted to lock_buffers(). they were easy to
spot grepping for getblk IIRC.

Andrea

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-02-10 21:59                 ` Andrea Arcangeli
@ 2003-03-11 13:58                   ` Andrea Arcangeli
  2003-03-14  6:42                     ` Andrea Arcangeli
  0 siblings, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2003-03-11 13:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, mikulas, pavel, linux-kernel

On Mon, Feb 10, 2003 at 10:59:40PM +0100, Andrea Arcangeli wrote:
> On Mon, Feb 10, 2003 at 01:44:34PM -0800, Andrew Morton wrote:
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > >
> > > On Mon, Feb 10, 2003 at 12:40:00PM -0800, Andrew Morton wrote:
> > > > 	void sync_dirty_buffer(struct buffer_head *bh)
> > > > 	{
> > > > 		lock_buffer(bh);
> > > > 		if (test_clear_buffer_dirty(bh)) {
> > > > 			get_bh(bh);
> > > > 			bh->b_end_io = end_buffer_io_sync;
> > > > 			submit_bh(WRITE, bh);
> > > > 		} else {
> > > > 			unlock_buffer(bh);
> > > > 		}
> > > > 	}
> > > 
> > > If you we don't take the lock around the mark_dirty_buffer as Linus
> > > suggested (to avoid serializing in the non-sync case), why don't you
> > > simply add lock_buffer() to ll_rw_block() as we suggested originally
> > 
> > That is undesirable for READA.
> 
> in 2.4 we killed READA some release ago becuse of races:
> 
> 		case READA:
> #if 0	/* bread() misinterprets failed READA attempts as IO errors on SMP */
> 			rw_ahead = 1;
> #endif
> 
> so I wasn't focusing on it.
> 
> > > and
> > > you #define sync_dirty_buffer as ll_rw_block+wait_on_buffer if you
> > > really want to make the cleanup?
> > 
> > Linux 2.4 tends to contain costly confusion between writeout for memory
> > cleansing and writeout for data integrity.
> > 
> > In 2.5 I have been trying to make it very clear and explicit that these are
> > fundamentally different things.
> 
> yes, and these are in the memory cleansing area (only the journal
> commits [and somehow the superblock again in journaling] needs to be
> writeouts for data integrity). Data doesn't need to provide data
> integrity either, userspace has to care about that.

I'm experimenting with the lock_buffer() in ll_rw_block for one month
(not only for the sync writes like in Andrew's patch, but for everything
including reads), and I believe this is what triggered a deadlocks on my
most stressed machine this night.

I attached below the best debug info I collected so far (I've the full
SYSRQ+T too of course but the below should make it easier to focus on
the right place).

I'm going to backout that patch
(http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21pre4aa3/00_ll_rw_block-sync-race-1)
until this issue is resolved.

One simple case where it could introduce a deadlock is if a locked
buffer visible to the rest of the fs/vm isn't immediatly inserted in the
I/O queue, if two tasks runs that code at the same time they can
deadlock against each other. Another possibility is that the journal is
running a ll_rw_block on a locked buffer not yet in the I/O queue, and
it pretends not to deadlock for whatever reason. In all cases I believe
this is an issue in ext3 that would better be fixed, but it isn't
necessairly a bug with the 00_ll_rw_block patch backed out. Note: it
maybe something else too in my tree that deadlocked the box, but I
really doubt it could be something else, also see the wait_on_buffer
from ll_rw_block in the first line that makes it quite obvious that the
new behaviour is happening during the deadlock.

any help from the ext3 developers and everybody involved in this thread
(the 2.0, 2.2, 2.4 fsync races) is welcome, thanks!

(of course the vfs down shouldn't matter, the wait_on_buffers are
interesting here, but since they were in D state I didn't cut them out)

mutt          D 00000292     0 14627  24890                     (NOTLB)
Call Trace:    [__wait_on_buffer+86/144] [ext3_free_inode+391/1248] [ll_rw_block+309/464] [ext3_find_entry+828/864] [ext3_lookup+65/208]
  [lookup_hash+194/288] [sys_unlink+139/288] [system_call+51/56]
tee           D ECD560C0     0   111  11445                 110 (NOTLB)
Call Trace:    [journal_dirty_metadata+420/608] [__wait_on_buffer+86/144] [__cleanup_transaction+362/368] [journal_cancel_revoke+91/448] [log_do_checkpoint+299/448]
  [n_tty_receive_buf+400/1808] [journal_dirty_metadata+420/608] [n_tty_receive_buf+400/1808] [journal_dirty_metadata+420/608] [ext3_do_update_inode+375/992] [__jbd_kmalloc+35/176]
  [log_wait_for_space+130/144] [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [ext3_prepare_write+567/576] [write_chan+335/512]
  [generic_file_write_nolock+1023/2032] [pipe_wait+125/160] [generic_file_write+77/112] [ext3_file_write+57/224] [sys_write+163/336] [system_call+51/56]
python        D 00000180     0 21334    112 21336   21335       (NOTLB)
Call Trace:    [start_request+409/544] [__down+105/192] [__down_failed+8/12] [.text.lock.checkpoint+15/173] [start_this_handle+191/864]
  [new_handle+75/112] [journal_start+165/208] [ext3_lookup+0/208] [ext3_create+300/320] [vfs_create+153/304] [open_namei+1301/1456]
  [filp_open+62/112] [sys_open+83/192] [system_call+51/56]
python        D E2EB302C  2388 21335    112 21337         21334 (NOTLB)
Call Trace:    [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112]
  [sys_open+83/192] [system_call+51/56]
procmail      D E51AC016     0 21378      1               19102 (NOTLB)
Call Trace:    [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112]
  [sys_open+83/192] [system_call+51/56]
python        D C01186FF     0 21470    107         21471       (NOTLB)
Call Trace:    [do_page_fault+463/1530] [__down+105/192] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/173]
  [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [do_wp_page+416/1088] [ext3_dirty_inode+336/368] [__mark_inode_dirty+195/208]
  [update_inode_times+39/64] [generic_file_write_nolock+630/2032] [generic_file_write+77/112] [ext3_file_write+57/224] [sys_write+163/336] [system_call+51/56]
python        D C01186FF     0 21471    107               21470 (NOTLB)
Call Trace:    [do_page_fault+463/1530] [__down+105/192] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/173]
  [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [do_wp_page+416/1088] [ext3_dirty_inode+336/368] [__mark_inode_dirty+195/208]
  [update_inode_times+39/64] [generic_file_write_nolock+630/2032] [generic_file_write+77/112] [ext3_file_write+57/224] [sys_write+163/336] [system_call+51/56]
python        D 000001E0     0 21732  21731                     (NOTLB)
Call Trace:    [bread+32/160] [__down+105/192] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/173]
  [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [dput+33/320] [ext3_unlink+440/448] [vfs_unlink+233/480]
  [sys_unlink+183/288] [system_call+51/56]
procmail      D D0E13016     0 21914  21913                     (NOTLB)
Call Trace:    [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112]
  [sys_open+83/192] [system_call+51/56]
procmail      D E39F3016     0 21999  21998                     (NOTLB)
Call Trace:    [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112]
  [sys_open+83/192] [system_call+51/56]

Andrea

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 2.0, 2.2, 2.4, 2.5: fsync buffer race
  2003-03-11 13:58                   ` Andrea Arcangeli
@ 2003-03-14  6:42                     ` Andrea Arcangeli
  0 siblings, 0 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2003-03-14  6:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, mikulas, pavel, linux-kernel

Hi!

> mutt          D 00000292     0 14627  24890                     (NOTLB)
> Call Trace:    [__wait_on_buffer+86/144] [ext3_free_inode+391/1248] [ll_rw_block+309/464] [ext3_find_entry+828/864] [ext3_lookup+65/208]
>   [lookup_hash+194/288] [sys_unlink+139/288] [system_call+51/56]
> tee           D ECD560C0     0   111  11445                 110 (NOTLB)
> Call Trace:    [journal_dirty_metadata+420/608] [__wait_on_buffer+86/144] [__cleanup_transaction+362/368] [journal_cancel_revoke+91/448] [log_do_checkpoint+299/448]
>   [n_tty_receive_buf+400/1808] [journal_dirty_metadata+420/608] [n_tty_receive_buf+400/1808] [journal_dirty_metadata+420/608] [ext3_do_update_inode+375/992] [__jbd_kmalloc+35/176]
>   [log_wait_for_space+130/144] [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [ext3_prepare_write+567/576] [write_chan+335/512]
>   [generic_file_write_nolock+1023/2032] [pipe_wait+125/160] [generic_file_write+77/112] [ext3_file_write+57/224] [sys_write+163/336] [system_call+51/56]
> python        D 00000180     0 21334    112 21336   21335       (NOTLB)
> Call Trace:    [start_request+409/544] [__down+105/192] [__down_failed+8/12] [.text.lock.checkpoint+15/173] [start_this_handle+191/864]
>   [new_handle+75/112] [journal_start+165/208] [ext3_lookup+0/208] [ext3_create+300/320] [vfs_create+153/304] [open_namei+1301/1456]
>   [filp_open+62/112] [sys_open+83/192] [system_call+51/56]
> python        D E2EB302C  2388 21335    112 21337         21334 (NOTLB)
> Call Trace:    [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112]
>   [sys_open+83/192] [system_call+51/56]
> procmail      D E51AC016     0 21378      1               19102 (NOTLB)
> Call Trace:    [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112]
>   [sys_open+83/192] [system_call+51/56]
> python        D C01186FF     0 21470    107         21471       (NOTLB)
> Call Trace:    [do_page_fault+463/1530] [__down+105/192] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/173]
>   [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [do_wp_page+416/1088] [ext3_dirty_inode+336/368] [__mark_inode_dirty+195/208]
>   [update_inode_times+39/64] [generic_file_write_nolock+630/2032] [generic_file_write+77/112] [ext3_file_write+57/224] [sys_write+163/336] [system_call+51/56]
> python        D C01186FF     0 21471    107               21470 (NOTLB)
> Call Trace:    [do_page_fault+463/1530] [__down+105/192] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/173]
>   [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [do_wp_page+416/1088] [ext3_dirty_inode+336/368] [__mark_inode_dirty+195/208]
>   [update_inode_times+39/64] [generic_file_write_nolock+630/2032] [generic_file_write+77/112] [ext3_file_write+57/224] [sys_write+163/336] [system_call+51/56]
> python        D 000001E0     0 21732  21731                     (NOTLB)
> Call Trace:    [bread+32/160] [__down+105/192] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/173]
>   [start_this_handle+191/864] [new_handle+75/112] [journal_start+165/208] [dput+33/320] [ext3_unlink+440/448] [vfs_unlink+233/480]
>   [sys_unlink+183/288] [system_call+51/56]
> procmail      D D0E13016     0 21914  21913                     (NOTLB)
> Call Trace:    [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112]
>   [sys_open+83/192] [system_call+51/56]
> procmail      D E39F3016     0 21999  21998                     (NOTLB)
> Call Trace:    [dput+33/320] [__down+105/192] [__down_failed+8/12] [.text.lock.namei+265/988] [filp_open+62/112]
>   [sys_open+83/192] [system_call+51/56]

never mind, I found the real bug and it's not the fsync fix or in ext3.

This other more recent cleanedup trace made the real bug more obvious
since less stuff was involved on the fs side (because it wasn't an fs
problem after all):

gzip          D CEAE5740  1616  1234   1229                     (NOTLB)
Call Trace:    [ext3_new_block+1062/2448] [do_schedule+284/416] [__wait_on_buffer+84/144] [__cleanup_transaction+252/256] [log_do_checkpoint+286/432]
  [ide_build_sglist+156/512] [__ide_dma_begin+57/80] [__ide_dma_count+22/32] [__ide_dma_write+326/352] [dma_timer_expiry+0/224] [do_rw_disk+884/1712]
  [context_switch+131/320] [ide_wait_stat+159/304] [start_request+409/544] [__jbd_kmalloc+35/176] [log_wait_for_space+108/128] [start_this_handle+184/848]
  [new_handle+75/112] [journal_start+165/208] [__alloc_pages+100/624] [ext3_prepare_write+336/352] [lru_cache_add+99/112] [generic_file_write_nolock+999/1856]
  [generic_file_write+76/112] [ext3_file_write+57/224] [sys_write+163/304] [system_call+51/56]
shutdown      D DFE7D940     0  2308      1  2309          2169 (NOTLB)
Call Trace:    [do_schedule+284/416] [__wait_on_buffer+84/144] [__refile_buffer+86/112] [wait_for_buffers+172/176] [wait_for_locked_buffers+35/48]
  [sync_buffers+106/112] [fsync_dev+68/80] [sys_sync+15/32] [system_call+51/56]
sshd          D D0CC93DC     0  2338    601  2339    2340  1226 (NOTLB)
Call Trace:    [do_schedule+284/416] [__down+90/160] [__jbd_kmalloc+35/176] [__down_failed+8/12] [.text.lock.checkpoint+15/109]
  [start_this_handle+184/848] [new_handle+75/112] [journal_start+165/208] [ext3_dirty_inode+276/288] [__mark_inode_dirty+150/160] [update_atime+94/96]
  [generic_file_read+174/352] [file_read_actor+0/144] [sys_read+163/304] [sys_fstat64+73/128] [system_call+51/56]
sync          D DFF858C0    80  2422   2394                     (NOTLB)
Call Trace:    [do_schedule+284/416] [__wait_on_buffer+84/144] [wait_for_buffers+172/176] [wait_for_locked_buffers+35/48] [sync_buffers+106/112]
  [fsync_dev+68/80] [sys_sync+15/32] [system_call+51/56]

sshd waits for gzip on the journal lock, gzip and shutdown waits on a
buffer that was supposed to be submitted by `sync`, but really `sync`
never submitted it, it only locked it. This bug is present only in
2.4.21pre4aa3 (all previous didn't had this problem). This was extremely
hard to trigger, so it's not surprising few people noticed it,
expecially on fast machines. The lowlatency improvements in
write_some_buffers zeroed out 'count' at the second pass, and so we lost
track of those bh if need_resched was set during the write_some_buffers
pass. this will be fixed in my next tree that I can now release
confortably back in rock solid state with also the fsync-race fix
re-added ;)

here the untested fix against 2.4.21pre4aa3 that I will include in
2.4.21pre5aa1:

--- x/fs/buffer.c.~1~	2003-03-13 06:15:29.000000000 +0100
+++ x/fs/buffer.c	2003-03-14 07:36:40.000000000 +0100
@@ -236,10 +236,10 @@ static int write_some_buffers(kdev_t dev
 	unsigned int count;
 	int nr, num_sched = 0;
 
+	count = 0;
  restart:
 	next = lru_list[BUF_DIRTY];
 	nr = nr_buffers_type[BUF_DIRTY];
-	count = 0;
 	while (next && --nr >= 0) {
 		struct buffer_head * bh = next;
 		next = bh->b_next_free;

Sorry for the noise.

Andrea

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2003-03-14  6:32 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-02-02 23:32 2.0, 2.2, 2.4, 2.5: fsync buffer race Mikulas Patocka
2003-02-03  0:00 ` Andrew Morton
2003-02-03  1:13   ` Mikulas Patocka
2003-02-03  1:20     ` Andrew Morton
2003-02-03  9:29       ` Mikulas Patocka
2003-02-04 23:16   ` Pavel Machek
2003-02-05 15:13     ` Mikulas Patocka
2003-02-10 13:07     ` Andrea Arcangeli
2003-02-10 16:28       ` Mikulas Patocka
2003-02-10 16:57         ` Linus Torvalds
2003-02-10 20:40           ` Andrew Morton
2003-02-10 21:18             ` Andrea Arcangeli
2003-02-10 21:44               ` Andrew Morton
2003-02-10 21:59                 ` Andrea Arcangeli
2003-03-11 13:58                   ` Andrea Arcangeli
2003-03-14  6:42                     ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).