linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Test12 ll_rw_block error.
       [not found] <3A398F84.2CD3039D@thebarn.com>
@ 2000-12-15  6:20 ` Linus Torvalds
  2000-12-15  7:00   ` Alexander Viro
  2000-12-17  0:21   ` Russell Cattelan
  0 siblings, 2 replies; 57+ messages in thread
From: Linus Torvalds @ 2000-12-15  6:20 UTC (permalink / raw)
  To: Russell Cattelan; +Cc: linux-kernel



On Thu, 14 Dec 2000, Russell Cattelan wrote:
> 
> Ok one more wrinkle.
> sync_buffers calls ll_rw_block, this is going to have the same problem as
> calling ll_rw_block directly.

Good point. 

This actually looks fairly nasty to fix. The obvious fix would be to not
put such buffers on the dirty list at all, and instead rely on the VM
layer calling "writepage()" when it wants to push out the pages.
That would be the nice behaviour from a VM standpoint.

However, that assumes that you don't have any "anonymous" buffers, which
is probably an unrealistic assumption.

The problem is that we don't have any per-buffer "writebuffer()" function,
the way we have them per-page. It was never needed for any of the normal
filesystems, and XFS just happened to be able to take advantage of the
b_end_io behaviour.

Suggestions welcome. 

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-15  6:20 ` Test12 ll_rw_block error Linus Torvalds
@ 2000-12-15  7:00   ` Alexander Viro
  2000-12-15  9:14     ` Chris Mason
                       ` (2 more replies)
  2000-12-17  0:21   ` Russell Cattelan
  1 sibling, 3 replies; 57+ messages in thread
From: Alexander Viro @ 2000-12-15  7:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stephen C. Tweedie, Russell Cattelan, linux-kernel



On Thu, 14 Dec 2000, Linus Torvalds wrote:

> Good point. 
> 
> This actually looks fairly nasty to fix. The obvious fix would be to not
> put such buffers on the dirty list at all, and instead rely on the VM
> layer calling "writepage()" when it wants to push out the pages.
> That would be the nice behaviour from a VM standpoint.
> 
> However, that assumes that you don't have any "anonymous" buffers, which
> is probably an unrealistic assumption.
> 
> The problem is that we don't have any per-buffer "writebuffer()" function,
> the way we have them per-page. It was never needed for any of the normal
> filesystems, and XFS just happened to be able to take advantage of the
> b_end_io behaviour.
> 
> Suggestions welcome. 

Just one: any fs that really cares about completion callback is very likely
to be picky about the requests ordering. So sync_buffers() is very unlikely
to be useful anyway.

In that sense we really don't have anonymous buffers here. I seriously
suspect that "unrealistic" assumption is not unrealistic at all. I'm
not sufficiently familiar with XFS code to say for sure, but...

What we really need is a way for VFS/VM to pass the pressure on filesystem.
That's it. If fs wants unusual completions for requests - let it have its
own queueing mechanism and submit these requests when it finds that convenient.

Stephen, you probably already thought about that area. Could you comment on
that?
								Cheers,
									Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-15  7:00   ` Alexander Viro
@ 2000-12-15  9:14     ` Chris Mason
  2000-12-17  0:54       ` Russell Cattelan
  2000-12-15 10:51     ` Stephen C. Tweedie
  2000-12-17  0:51     ` Test12 ll_rw_block error Russell Cattelan
  2 siblings, 1 reply; 57+ messages in thread
From: Chris Mason @ 2000-12-15  9:14 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, Stephen C. Tweedie, Russell Cattelan, linux-kernel



On Fri, 15 Dec 2000, Alexander Viro wrote:

> Just one: any fs that really cares about completion callback is very likely
> to be picky about the requests ordering. So sync_buffers() is very unlikely
> to be useful anyway.
> 
Somewhat.  I guess there are at least two ways to do it.  First flush the
buffers where ordering matters (log blocks), then send the others onto the
dirty list (general metadata).  You might have your own end_io for those, and
sync_buffers would lose it.

Second way (reiserfs recently changed to this method) is to do all the
flushing yourself, and remove the need for an end_io call back.

> In that sense we really don't have anonymous buffers here. I seriously
> suspect that "unrealistic" assumption is not unrealistic at all. I'm
> not sufficiently familiar with XFS code to say for sure, but...
> 
> What we really need is a way for VFS/VM to pass the pressure on filesystem.
> That's it. If fs wants unusual completions for requests - let it have its
> own queueing mechanism and submit these requests when it finds that convenient.
> 
Yes, this is exactly what we've discussed.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-15  7:00   ` Alexander Viro
  2000-12-15  9:14     ` Chris Mason
@ 2000-12-15 10:51     ` Stephen C. Tweedie
  2000-12-15 19:04       ` Linus Torvalds
                         ` (2 more replies)
  2000-12-17  0:51     ` Test12 ll_rw_block error Russell Cattelan
  2 siblings, 3 replies; 57+ messages in thread
From: Stephen C. Tweedie @ 2000-12-15 10:51 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, Stephen C. Tweedie, Russell Cattelan, linux-kernel

Hi,

On Fri, Dec 15, 2000 at 02:00:19AM -0500, Alexander Viro wrote:
> On Thu, 14 Dec 2000, Linus Torvalds wrote:
> 
> Just one: any fs that really cares about completion callback is very likely
> to be picky about the requests ordering. So sync_buffers() is very unlikely
> to be useful anyway.
> 
> In that sense we really don't have anonymous buffers here. I seriously
> suspect that "unrealistic" assumption is not unrealistic at all. I'm
> not sufficiently familiar with XFS code to say for sure, but...

Right.  ext3 and reiserfs just want to submit their own IOs when it
comes to the journal.  (At least in ext3, already-journaled buffers
can be written back by the VM freely.)  It's a matter of telling the
fs when that should start.

> What we really need is a way for VFS/VM to pass the pressure on filesystem.
> That's it. If fs wants unusual completions for requests - let it have its
> own queueing mechanism and submit these requests when it finds that convenient.

There is a very clean way of doing this with address spaces.  It's
something I would like to see done properly for 2.5: eliminate all
knowledge of buffer_heads from the VM layer.  It would be pretty
simple to remove page->buffers completely and replace it with a
page->private pointer, owned by whatever address_space controlled the
page.  Instead of trying to unmap and flush buffers on the page
directly, these operations would become address_space operations.

We could still provide the standard try_to_free_buffers() and
unmap_underlying_metadata() functions to operate on the per-page
buffer_head lists, and existing filesystems would only have to point
their address_space "private metadata" operations at the generic
functions.  However, a filesystem which had additional ordering
constraints could then intercept the flush or writeback calls from the
VM and decide on its own how best to honour the VM pressure.

This even works both for hashed and unhashed anonymous buffers, *if*
you allow the filesystem to attach all of its hashed buffers buffers
to an address_space of its own rather than having the buffer cache
pool all such buffers together.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-15 10:51     ` Stephen C. Tweedie
@ 2000-12-15 19:04       ` Linus Torvalds
  2000-12-17  1:08       ` Russell Cattelan
  2000-12-17  2:38       ` Marcelo Tosatti
  2 siblings, 0 replies; 57+ messages in thread
From: Linus Torvalds @ 2000-12-15 19:04 UTC (permalink / raw)
  To: linux-kernel

In article <20001215105148.E11931@redhat.com>,
Stephen C. Tweedie <sct@redhat.com> wrote:
>
>> What we really need is a way for VFS/VM to pass the pressure on filesystem.
>> That's it. If fs wants unusual completions for requests - let it have its
>> own queueing mechanism and submit these requests when it finds that convenient.
>
>There is a very clean way of doing this with address spaces.  It's
>something I would like to see done properly for 2.5: eliminate all
>knowledge of buffer_heads from the VM layer.

Note that you should be able to already get this effect with the current
2.4.0 tree.

The way to get the VM to ignore your buffer heads is to never mark the
buffers dirty, and instead mark the page they are on dirty along with
giving it a mapping (you can have a special per-superblock
"metadata-mapping" for stuff that isn't actually associated with any
particular file.

That way, the VM will just call writepage() for you, and you can use
that to schedule your writeouts (you don't actually need to write the
page the VM _asks_ you to write - you can just mark it dirty again and
consider the writepage to be mainly a VM pressure indicator).

Now, I also agree that we should be able to clean this up properly for
2.5.x, and actually do exactly this for the anonymous buffers, so that
the VM no longer needs to worry about buffer knowledge, and fs/buffer.c
becomes just another user of the writepage functionality.  That is not
all that hard to do, it mainly just requires some small changes to how
"mark_buffer_dirty()" works (ie it would also mark the page dirty, so
that the VM layer would know to call "writepage()"). 

I really think almost all of the VM infrastructure for this is in place,
the PageDirty code has both simplified the VM enormously and made it a
lot more powerful at the same time.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-15  6:20 ` Test12 ll_rw_block error Linus Torvalds
  2000-12-15  7:00   ` Alexander Viro
@ 2000-12-17  0:21   ` Russell Cattelan
  1 sibling, 0 replies; 57+ messages in thread
From: Russell Cattelan @ 2000-12-17  0:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:

> On Thu, 14 Dec 2000, Russell Cattelan wrote:
> >
> > Ok one more wrinkle.
> > sync_buffers calls ll_rw_block, this is going to have the same problem as
> > calling ll_rw_block directly.
>
> Good point.
>
> This actually looks fairly nasty to fix. The obvious fix would be to not
> put such buffers on the dirty list at all, and instead rely on the VM
> layer calling "writepage()" when it wants to push out the pages.
> That would be the nice behaviour from a VM standpoint.
>
> However, that assumes that you don't have any "anonymous" buffers, which
> is probably an unrealistic assumption.
>
> The problem is that we don't have any per-buffer "writebuffer()" function,
> the way we have them per-page. It was never needed for any of the normal
> filesystems, and XFS just happened to be able to take advantage of the
> b_end_io behaviour.
>
> Suggestions welcome.
>
>                 Linus

Ok after a bit of trial and error I do have something working.
I wouldn't call it the most elegant solution but it does work
and it isn't very intrusive.

#define BH_End_io  7    /* End io function defined don't remap it */

                /*  don't change the callback if somebody explicitly set it */

                if(!test_bit(BH_End_io, &bh->b_state)){
                  bh->b_end_io = end_buffer_io_sync;
                }
What I've done is in the XFS set buffer_head setup functions is
set the initial value of b_state to BH_Locked  and BH_End_io
set the callback function and the rest of the relevant fields and then unlock
the
buffer.

The only other quick fix that comes to mind is to change sync_buffers to use
submit_bh rather than ll_rw_block.

--
Russell Cattelan
cattelan@thebarn.com



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-15  7:00   ` Alexander Viro
  2000-12-15  9:14     ` Chris Mason
  2000-12-15 10:51     ` Stephen C. Tweedie
@ 2000-12-17  0:51     ` Russell Cattelan
  2 siblings, 0 replies; 57+ messages in thread
From: Russell Cattelan @ 2000-12-17  0:51 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Stephen C. Tweedie, linux-kernel

Alexander Viro wrote:

> On Thu, 14 Dec 2000, Linus Torvalds wrote:
>
> > Good point.
> >
> > This actually looks fairly nasty to fix. The obvious fix would be to not
> > put such buffers on the dirty list at all, and instead rely on the VM
> > layer calling "writepage()" when it wants to push out the pages.
> > That would be the nice behaviour from a VM standpoint.
> >
> > However, that assumes that you don't have any "anonymous" buffers, which
> > is probably an unrealistic assumption.
> >
> > The problem is that we don't have any per-buffer "writebuffer()" function,
> > the way we have them per-page. It was never needed for any of the normal
> > filesystems, and XFS just happened to be able to take advantage of the
> > b_end_io behaviour.
> >
> > Suggestions welcome.
>
> Just one: any fs that really cares about completion callback is very likely
> to be picky about the requests ordering. So sync_buffers() is very unlikely
> to be useful anyway.

Actually no,  that's not the issue.

The XFS log uses a LSN (Log Sequence Number) to keep track of log write ordering.
Sync IO on each log buffer isn't realistic; the performance hit would be to great.

I wasn't around when  most of XFS was developed, but  from I what I understand it
was discovered early on that firing off writes in a particular order doesn't
guarantee that
they will finish in that order.  Thus the implantation of a sequence number for
each log write.


One of the obstacles we ran into early on in the linux port was the fact that
linux used fixed size IO requests to any given device.
But most of XFS's meta data structures vary in size in multiples of 512 bytes.

We were also implementing a page caching / clustering layer called
page_buf which understands  primarily  pages and not necessary
disk blocks. If your FS block size happens to match your page size then things
are good,  but it doesn't....
So we added a  bit map field to the pages structure.
Each bit then represents one BASIC BLOCK eg 512 for all practical purposes

The end_io functions XFS defines updates the correct bit or the whole bit array
if the whole page is valid, thus signaling the rest of the page_buf that the io
has
completed.

Ok there is a lot more to it than what I've just described but you probably get
the idea.


>
>
> In that sense we really don't have anonymous buffers here. I seriously
> suspect that "unrealistic" assumption is not unrealistic at all. I'm
> not sufficiently familiar with XFS code to say for sure, but...
>
> What we really need is a way for VFS/VM to pass the pressure on filesystem.
> That's it. If fs wants unusual completions for requests - let it have its
> own queueing mechanism and submit these requests when it finds that convenient.
>
> Stephen, you probably already thought about that area. Could you comment on
> that?
>                                                                 Cheers,
>                                                                         Al

--
Russell Cattelan
cattelan@thebarn.com



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-15  9:14     ` Chris Mason
@ 2000-12-17  0:54       ` Russell Cattelan
  2000-12-17 12:20         ` Chris Mason
  0 siblings, 1 reply; 57+ messages in thread
From: Russell Cattelan @ 2000-12-17  0:54 UTC (permalink / raw)
  To: Chris Mason
  Cc: Alexander Viro, Linus Torvalds, Stephen C. Tweedie, linux-kernel

Chris Mason wrote:

> On Fri, 15 Dec 2000, Alexander Viro wrote:
>
> > Just one: any fs that really cares about completion callback is very likely
> > to be picky about the requests ordering. So sync_buffers() is very unlikely
> > to be useful anyway.
> >
> Somewhat.  I guess there are at least two ways to do it.  First flush the
> buffers where ordering matters (log blocks), then send the others onto the
> dirty list (general metadata).  You might have your own end_io for those, and
> sync_buffers would lose it.
>
> Second way (reiserfs recently changed to this method) is to do all the
> flushing yourself, and remove the need for an end_io call back.
>
I'm curious about this.
Does the mean reiserFS is doing all of it's own buffer management?

This would seem a little redundant with what is already in the kernel?

>
>
> > In that sense we really don't have anonymous buffers here. I seriously
> > suspect that "unrealistic" assumption is not unrealistic at all. I'm
> > not sufficiently familiar with XFS code to say for sure, but...
> >
> > What we really need is a way for VFS/VM to pass the pressure on filesystem.
> > That's it. If fs wants unusual completions for requests - let it have its
> > own queueing mechanism and submit these requests when it finds that convenient.
> >
> Yes, this is exactly what we've discussed.
>
> -chris

--
Russell Cattelan
cattelan@thebarn.com



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-15 10:51     ` Stephen C. Tweedie
  2000-12-15 19:04       ` Linus Torvalds
@ 2000-12-17  1:08       ` Russell Cattelan
  2000-12-18 11:44         ` Stephen C. Tweedie
  2000-12-17  2:38       ` Marcelo Tosatti
  2 siblings, 1 reply; 57+ messages in thread
From: Russell Cattelan @ 2000-12-17  1:08 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Alexander Viro, Linus Torvalds, linux-kernel

"Stephen C. Tweedie" wrote:

> Hi,
>
> On Fri, Dec 15, 2000 at 02:00:19AM -0500, Alexander Viro wrote:
> > On Thu, 14 Dec 2000, Linus Torvalds wrote:
> >
> > Just one: any fs that really cares about completion callback is very likely
> > to be picky about the requests ordering. So sync_buffers() is very unlikely
> > to be useful anyway.
> >
> > In that sense we really don't have anonymous buffers here. I seriously
> > suspect that "unrealistic" assumption is not unrealistic at all. I'm
> > not sufficiently familiar with XFS code to say for sure, but...
>
> Right.  ext3 and reiserfs just want to submit their own IOs when it
> comes to the journal.  (At least in ext3, already-journaled buffers
> can be written back by the VM freely.)  It's a matter of telling the
> fs when that should start.
>
> > What we really need is a way for VFS/VM to pass the pressure on filesystem.
> > That's it. If fs wants unusual completions for requests - let it have its
> > own queueing mechanism and submit these requests when it finds that convenient.
>
> There is a very clean way of doing this with address spaces.  It's
> something I would like to see done properly for 2.5: eliminate all
> knowledge of buffer_heads from the VM layer.  It would be pretty
> simple to remove page->buffers completely and replace it with a
> page->private pointer, owned by whatever address_space controlled the
> page.  Instead of trying to unmap and flush buffers on the page
> directly, these operations would become address_space operations.

Yes this is a lot of what page buf would like to do eventually.
Have the VM system pressure page_buf for pages which would
then be able to intelligently call the file system to free up cached pages.
A big part of getting Delay Alloc to not completely consume all the
system pages, is being told when it's time to start really allocating disk
space and push pages out.


>
>
> We could still provide the standard try_to_free_buffers() and
> unmap_underlying_metadata() functions to operate on the per-page
> buffer_head lists, and existing filesystems would only have to point
> their address_space "private metadata" operations at the generic
> functions.  However, a filesystem which had additional ordering
> constraints could then intercept the flush or writeback calls from the
> VM and decide on its own how best to honour the VM pressure.
>
> This even works both for hashed and unhashed anonymous buffers, *if*
> you allow the filesystem to attach all of its hashed buffers buffers
> to an address_space of its own rather than having the buffer cache
> pool all such buffers together.
>
> --Stephen

--
Russell Cattelan
cattelan@thebarn.com



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-15 10:51     ` Stephen C. Tweedie
  2000-12-15 19:04       ` Linus Torvalds
  2000-12-17  1:08       ` Russell Cattelan
@ 2000-12-17  2:38       ` Marcelo Tosatti
  2000-12-18 11:46         ` Stephen C. Tweedie
  2 siblings, 1 reply; 57+ messages in thread
From: Marcelo Tosatti @ 2000-12-17  2:38 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alexander Viro, Linus Torvalds, Russell Cattelan, linux-kernel


On Fri, 15 Dec 2000, Stephen C. Tweedie wrote:

> Hi,
> 
> On Fri, Dec 15, 2000 at 02:00:19AM -0500, Alexander Viro wrote:
> > On Thu, 14 Dec 2000, Linus Torvalds wrote:
> > 
> > Just one: any fs that really cares about completion callback is very likely
> > to be picky about the requests ordering. So sync_buffers() is very unlikely
> > to be useful anyway.
> > 
> > In that sense we really don't have anonymous buffers here. I seriously
> > suspect that "unrealistic" assumption is not unrealistic at all. I'm
> > not sufficiently familiar with XFS code to say for sure, but...
> 
> Right.  ext3 and reiserfs just want to submit their own IOs when it
> comes to the journal.  (At least in ext3, already-journaled buffers
> can be written back by the VM freely.)  It's a matter of telling the
> fs when that should start.
> 
> > What we really need is a way for VFS/VM to pass the pressure on filesystem.
> > That's it. If fs wants unusual completions for requests - let it have its
> > own queueing mechanism and submit these requests when it finds that convenient.
> 
> There is a very clean way of doing this with address spaces.  It's
> something I would like to see done properly for 2.5: eliminate all
> knowledge of buffer_heads from the VM layer.  It would be pretty
> simple to remove page->buffers completely and replace it with a
> page->private pointer, owned by whatever address_space controlled the
> page.  Instead of trying to unmap and flush buffers on the page
> directly, these operations would become address_space operations.
> 
> We could still provide the standard try_to_free_buffers() and
> unmap_underlying_metadata() functions to operate on the per-page
> buffer_head lists, and existing filesystems would only have to point
> their address_space "private metadata" operations at the generic
> functions.  However, a filesystem which had additional ordering
> constraints could then intercept the flush or writeback calls from the
> VM and decide on its own how best to honour the VM pressure.

Stephen,

The ->flush() operation (which we've been discussing a bit) would be very
useful now (mainly for XFS).

At page_launder(), we can call ->flush() if the given page has it defined.
Otherwise use try_to_free_buffers() as we do now for filesystems which
dont care about the special flushing treatment. 



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-17  0:54       ` Russell Cattelan
@ 2000-12-17 12:20         ` Chris Mason
  0 siblings, 0 replies; 57+ messages in thread
From: Chris Mason @ 2000-12-17 12:20 UTC (permalink / raw)
  To: Russell Cattelan
  Cc: Alexander Viro, Linus Torvalds, Stephen C. Tweedie, linux-kernel



On Sat, 16 Dec 2000, Russell Cattelan wrote:
> >
> I'm curious about this.
> Does the mean reiserFS is doing all of it's own buffer management?
> 
> This would seem a little redundant with what is already in the kernel?
> 
For metadata only reiserfs does its own write management.  The buffers
come from getblk. We just don't mark the buffers dirty for flushing by
flush_dirty_buffers()

This has the advantage of avoiding races against bdflush and friends, and
makes it easier to keep track of which buffers have actually made their
way to disk.  It has all of the obvious disadvantages with respect to
memory pressure.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-17  1:08       ` Russell Cattelan
@ 2000-12-18 11:44         ` Stephen C. Tweedie
  0 siblings, 0 replies; 57+ messages in thread
From: Stephen C. Tweedie @ 2000-12-18 11:44 UTC (permalink / raw)
  To: Russell Cattelan
  Cc: Stephen C. Tweedie, Alexander Viro, Linus Torvalds, linux-kernel

Hi,

On Sat, Dec 16, 2000 at 07:08:02PM -0600, Russell Cattelan wrote:
> > There is a very clean way of doing this with address spaces.  It's
> > something I would like to see done properly for 2.5: eliminate all
> > knowledge of buffer_heads from the VM layer.  It would be pretty
> > simple to remove page->buffers completely and replace it with a
> > page->private pointer, owned by whatever address_space controlled the
> > page.  Instead of trying to unmap and flush buffers on the page
> > directly, these operations would become address_space operations.
> 
> Yes this is a lot of what page buf would like to do eventually.
> Have the VM system pressure page_buf for pages which would
> then be able to intelligently call the file system to free up cached pages.
> A big part of getting Delay Alloc to not completely consume all the
> system pages, is being told when it's time to start really allocating disk
> space and push pages out.

Delayed allocation is actually much easier, since it's entirely an
operation on logical page addresses, not physical ones --- by
definition you don't have any buffer_heads yet because you haven't
decided on the disk blocks.  If you're just dealing with pages, not
blocks, then the address_space is the natural way of dealing with it
already.  

Only the full semantics of the flush callback have been missing to
date, and with 2.4.0-test12 even that is mostly solved, since
page_launder will give you the writeback() callbacks you need to flush
things to disk when you start getting memory pressure.  You can even
treat the writepage() as an advisory call.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-17  2:38       ` Marcelo Tosatti
@ 2000-12-18 11:46         ` Stephen C. Tweedie
  2000-12-19 14:05           ` Marcelo Tosatti
  2000-12-21 23:25           ` [RFC] changes to buffer.c (was Test12 ll_rw_block error) Chris Mason
  0 siblings, 2 replies; 57+ messages in thread
From: Stephen C. Tweedie @ 2000-12-18 11:46 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen C. Tweedie, Alexander Viro, Linus Torvalds,
	Russell Cattelan, linux-kernel

Hi,

On Sun, Dec 17, 2000 at 12:38:17AM -0200, Marcelo Tosatti wrote:
> On Fri, 15 Dec 2000, Stephen C. Tweedie wrote:
> 
> Stephen,
> 
> The ->flush() operation (which we've been discussing a bit) would be very
> useful now (mainly for XFS).
> 
> At page_launder(), we can call ->flush() if the given page has it defined.
> Otherwise use try_to_free_buffers() as we do now for filesystems which
> dont care about the special flushing treatment. 

As of 2.4.0test12, page_launder() will already call the
per-address-space writepage() operation for dirty pages.  Do you need
something similar for clean pages too, or does Linus's new laundry
code give you what you need now?

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-18 11:46         ` Stephen C. Tweedie
@ 2000-12-19 14:05           ` Marcelo Tosatti
  2000-12-19 16:43             ` Daniel Phillips
  2000-12-21 23:25           ` [RFC] changes to buffer.c (was Test12 ll_rw_block error) Chris Mason
  1 sibling, 1 reply; 57+ messages in thread
From: Marcelo Tosatti @ 2000-12-19 14:05 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alexander Viro, Linus Torvalds, Russell Cattelan, linux-kernel


On Mon, 18 Dec 2000, Stephen C. Tweedie wrote:

> Hi,
> 
> On Sun, Dec 17, 2000 at 12:38:17AM -0200, Marcelo Tosatti wrote:
> > On Fri, 15 Dec 2000, Stephen C. Tweedie wrote:
> > 
> > Stephen,
> > 
> > The ->flush() operation (which we've been discussing a bit) would be very
> > useful now (mainly for XFS).
> > 
> > At page_launder(), we can call ->flush() if the given page has it defined.
> > Otherwise use try_to_free_buffers() as we do now for filesystems which
> > dont care about the special flushing treatment. 
> 
> As of 2.4.0test12, page_launder() will already call the
> per-address-space writepage() operation for dirty pages.  Do you need
> something similar for clean pages too, or does Linus's new laundry
> code give you what you need now?

I think the semantics of the filesystem specific ->flush and ->writepage
are not the same.

Is ok for filesystem specific writepage() code to sync other "physically
contiguous" dirty pages with reference to the one requested by
writepage() ? 

If so, it can do the same job as the ->flush() idea we've discussing.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-19 14:05           ` Marcelo Tosatti
@ 2000-12-19 16:43             ` Daniel Phillips
  2000-12-19 17:18               ` Marcelo Tosatti
  2000-12-19 17:40               ` Christoph Hellwig
  0 siblings, 2 replies; 57+ messages in thread
From: Daniel Phillips @ 2000-12-19 16:43 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel

Marcelo Tosatti wrote:
> 
> On Mon, 18 Dec 2000, Stephen C. Tweedie wrote:
> 
> > Hi,
> >
> > On Sun, Dec 17, 2000 at 12:38:17AM -0200, Marcelo Tosatti wrote:
> > > On Fri, 15 Dec 2000, Stephen C. Tweedie wrote:
> > >
> > > Stephen,
> > >
> > > The ->flush() operation (which we've been discussing a bit) would be very
> > > useful now (mainly for XFS).
> > >
> > > At page_launder(), we can call ->flush() if the given page has it defined.
> > > Otherwise use try_to_free_buffers() as we do now for filesystems which
> > > dont care about the special flushing treatment.
> >
> > As of 2.4.0test12, page_launder() will already call the
> > per-address-space writepage() operation for dirty pages.  Do you need
> > something similar for clean pages too, or does Linus's new laundry
> > code give you what you need now?
> 
> I think the semantics of the filesystem specific ->flush and ->writepage
> are not the same.
> 
> Is ok for filesystem specific writepage() code to sync other "physically
> contiguous" dirty pages with reference to the one requested by
> writepage() ?
> 
> If so, it can do the same job as the ->flush() idea we've discussing.

Except that for ->writepage you don't have the option of *not* writing
the specified page.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-19 16:43             ` Daniel Phillips
@ 2000-12-19 17:18               ` Marcelo Tosatti
  2000-12-19 17:40               ` Christoph Hellwig
  1 sibling, 0 replies; 57+ messages in thread
From: Marcelo Tosatti @ 2000-12-19 17:18 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel


On Tue, 19 Dec 2000, Daniel Phillips wrote:

> Marcelo Tosatti wrote:
> > 
> > On Mon, 18 Dec 2000, Stephen C. Tweedie wrote:
> > 
> > > Hi,
> > >
> > > On Sun, Dec 17, 2000 at 12:38:17AM -0200, Marcelo Tosatti wrote:
> > > > On Fri, 15 Dec 2000, Stephen C. Tweedie wrote:
> > > >
> > > > Stephen,
> > > >
> > > > The ->flush() operation (which we've been discussing a bit) would be very
> > > > useful now (mainly for XFS).
> > > >
> > > > At page_launder(), we can call ->flush() if the given page has it defined.
> > > > Otherwise use try_to_free_buffers() as we do now for filesystems which
> > > > dont care about the special flushing treatment.
> > >
> > > As of 2.4.0test12, page_launder() will already call the
> > > per-address-space writepage() operation for dirty pages.  Do you need
> > > something similar for clean pages too, or does Linus's new laundry
> > > code give you what you need now?
> > 
> > I think the semantics of the filesystem specific ->flush and ->writepage
> > are not the same.
> > 
> > Is ok for filesystem specific writepage() code to sync other "physically
> > contiguous" dirty pages with reference to the one requested by
> > writepage() ?
> > 
> > If so, it can do the same job as the ->flush() idea we've discussing.
> 
> Except that for ->writepage you don't have the option of *not* writing
> the specified page.

It does.

Both the swapper writepage() operation and shm_writepage() cannot be able
to write the page. 
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Test12 ll_rw_block error.
  2000-12-19 16:43             ` Daniel Phillips
  2000-12-19 17:18               ` Marcelo Tosatti
@ 2000-12-19 17:40               ` Christoph Hellwig
  1 sibling, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2000-12-19 17:40 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Marcelo Tosatti, linux-kernel

In article <3A3F904F.8FBC46A5@innominate.de> you wrote:
>> I think the semantics of the filesystem specific ->flush and ->writepage
>> are not the same.
>> 
>> Is ok for filesystem specific writepage() code to sync other "physically
>> contiguous" dirty pages with reference to the one requested by
>> writepage() ?
>> 
>> If so, it can do the same job as the ->flush() idea we've discussing.
>
> Except that for ->writepage you don't have the option of *not* writing
> the specified page.

In current -test13pre you have.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-18 11:46         ` Stephen C. Tweedie
  2000-12-19 14:05           ` Marcelo Tosatti
@ 2000-12-21 23:25           ` Chris Mason
  2000-12-22  0:19             ` Marcelo Tosatti
  2000-12-22  1:54             ` Alexander Viro
  1 sibling, 2 replies; 57+ messages in thread
From: Chris Mason @ 2000-12-21 23:25 UTC (permalink / raw)
  To: Stephen C. Tweedie, Marcelo Tosatti
  Cc: Alexander Viro, Linus Torvalds, Russell Cattelan, linux-kernel



Ok guys, I think I've taken Linus' suggestion to have buffer.c use its
own writepage a bit too far.  This patch marks pages dirty when the 
buffer head is marked dirty, and changes flush_dirty_buffers and 
sync_buffers to use writepage instead of ll_rw_block.  The idea is 
to allow filesystems to use the buffer lists and provide their own 
i/o mechanism.

The result is a serious semantics change for writepage, which now is 
expected to do partial page writes when the page isn't up to date and 
there are dirty buffers inside.  For all the obvious reasons, this isn't 
fit for 2.4.0, and if you all feel it is a 2.5. thing I'll send along 
the  shorter  patch Linus originally suggested.  But, I think it would 
be  pretty useful for the new filesystems (once I also fix 
fsync_inode_buffers and sync_page_buffers).

Other changes:  submit_bh now cleans the buffers.  I don't see how 
they were getting cleaned before, it must have been try_to_free_buffers 
sending the page through sync_page_buffers, meaning they were probably 
getting written twice.  Unless someone throws a clue my way, I'll send 
this out as a separate diff.

page_launder doesn't fiddle with the buffermem_pages counter, it is done 
in try_to_free_buffers instead.

Obvious bug, block_write_full_page zeros out the bits past the end of 
file every time.  This should not be needed for normal file writes.

Most testing was on ext2, who actually calls mark_buffer_dirty, and 
supports blocksizes < intel page size.  More tests are running 
overnight.

-chris

diff -urN linux.test13p3p/drivers/block/ll_rw_blk.c linux-test12/drivers/block/ll_rw_blk.c
--- linux.test13p3p/drivers/block/ll_rw_blk.c	Tue Dec 19 05:07:50 2000
+++ linux.test13p3/drivers/block/ll_rw_blk.c	Thu Dec 21 16:56:43 2000
@@ -954,6 +954,7 @@
 	if (!test_bit(BH_Lock, &bh->b_state))
 		BUG();
 
+	mark_buffer_clean(bh) ;
 	set_bit(BH_Req, &bh->b_state);
 
 	/*
diff -urN linux.test13p3p/fs/buffer.c linux-test12/fs/buffer.c
--- linux.test13p3p/fs/buffer.c	Thu Dec 21 17:24:17 2000
+++ linux.test13p3/fs/buffer.c	Thu Dec 21 17:28:46 2000
@@ -97,6 +97,16 @@
 
 static int grow_buffers(int size);
 static void __refile_buffer(struct buffer_head *);
+static int block_write_anon_page(struct page *);
+
+static struct address_space_operations anon_space_ops = {
+	writepage: block_write_anon_page,
+	sync_page: block_sync_page,
+} ;
+static struct address_space anon_space_mapping = {
+	pages: { &anon_space_mapping.pages, &anon_space_mapping.pages },
+	a_ops: &anon_space_ops,
+} ;
 
 /* This is used by some architectures to estimate available memory. */
 atomic_t buffermem_pages = ATOMIC_INIT(0);
@@ -161,6 +171,37 @@
 	atomic_dec(&bh->b_count);
 }
 
+/*
+** util function for sync_buffers and flush_dirty_buffers
+** uses either the writepage func supplied in the page's mapping,
+** or the anon address space writepage
+*/
+static int dirty_list_writepage(struct page *page) {
+	int (*writepage)(struct page *)  ;
+	int ret ;
+
+	writepage = page->mapping->a_ops->writepage ;
+
+	if (!writepage) {
+		writepage = anon_space_ops.writepage ;
+	}
+
+	page_cache_get(page) ;
+	ClearPageDirty(page) ;
+	ret = writepage(page) ;
+	/* note, when writepage returns 1, we mark the page dirty again
+	** but the writepage func is responsible for making sure any
+	** buffers that need to stay dirty really do stay dirty
+	** ick.
+	*/
+	if (ret == 1) {
+		SetPageDirty(page) ;
+		UnlockPage(page) ;
+	}
+	page_cache_release(page) ;
+	return ret ;
+}
+
 /* Call sync_buffers with wait!=0 to ensure that the call does not
  * return until all buffer writes have completed.  Sync() may return
  * before the writes have finished; fsync() may not.
@@ -175,6 +216,7 @@
 {
 	int i, retry, pass = 0, err = 0;
 	struct buffer_head * bh, *next;
+	struct page *page ;
 
 	/* One pass for no-wait, three for wait:
 	 * 0) write out all dirty, unlocked buffers;
@@ -230,10 +272,19 @@
 			if (!buffer_dirty(bh) || pass >= 2)
 				continue;
 
-			atomic_inc(&bh->b_count);
+			page = bh->b_page ;
+			if (TryLockPage(page)) {
+				if (!wait || !pass) {
+					retry = 1 ;
+					continue ;
+				}
+				spin_unlock(&lru_list_lock);
+				wait_on_page(page) ;
+				goto repeat ;
+			}
 			spin_unlock(&lru_list_lock);
-			ll_rw_block(WRITE, 1, &bh);
-			atomic_dec(&bh->b_count);
+
+			dirty_list_writepage(page) ;
 			retry = 1;
 			goto repeat;
 		}
@@ -1096,8 +1147,10 @@
 	int dispose = BUF_CLEAN;
 	if (buffer_locked(bh))
 		dispose = BUF_LOCKED;
-	if (buffer_dirty(bh))
+	if (buffer_dirty(bh)) {
 		dispose = BUF_DIRTY;
+		SetPageDirty(bh->b_page) ;
+	}
 	if (buffer_protected(bh))
 		dispose = BUF_PROTECTED;
 	if (dispose != bh->b_list) {
@@ -1473,6 +1526,52 @@
  * "Dirty" is valid only with the last case (mapped+uptodate).
  */
 
+static int block_write_anon_page(struct page *page) 
+{
+	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	int i, nr = 0 ;
+	int partial = 0 ;
+	int ret = 0 ;
+
+	if (!PageLocked(page))
+		BUG();
+
+	if (!page->buffers)
+		BUG() ;
+
+	head = page->buffers;
+	bh = head;
+
+	/* Stage 1: find the dirty buffers, lock them for submit_bh */
+	do {
+		if (!test_and_set_bit(BH_Lock, &bh->b_state)) {
+			if (buffer_uptodate(bh) && buffer_dirty(bh)) {
+				bh->b_end_io = end_buffer_io_async;
+				atomic_inc(&bh->b_count);
+				arr[nr++] = bh ;
+			} else {
+				partial = 1 ;
+				unlock_buffer(bh) ;
+			}
+		} else {
+			partial = 1 ;
+		}
+		bh = bh->b_this_page;
+	} while (bh != head);
+
+	/* Stage 2: submit the IO */
+	for (i = 0 ; i < nr ; i++) {
+		submit_bh(WRITE, arr[i]) ;
+	}
+	/* Done - end_buffer_io_async will unlock */
+	if (!partial)
+		SetPageUptodate(page);
+	if (nr == 0) {
+		UnlockPage(page) ;
+	}
+	return ret ;
+}
+
 /*
  * block_write_full_page() is SMP-safe - currently it's still
  * being called with the kernel lock held, but the code is ready.
@@ -1482,6 +1581,10 @@
 	int err, i;
 	unsigned long block;
 	struct buffer_head *bh, *head;
+	int nr = 0 ;
+	struct buffer_head *arr[MAX_BUF_PER_PAGE] ;
+	int page_ok = Page_Uptodate(page) ;
+	int partial = 0;
 
 	if (!PageLocked(page))
 		BUG();
@@ -1504,36 +1607,46 @@
 		 *
 		 * Leave it to the low-level FS to make all those
 		 * decisions (block #0 may actually be a valid block)
+		 *
+		 * only bother when the page is up to date or the buffer
+		 * is dirty.
 		 */
-		if (!buffer_mapped(bh)) {
-			err = get_block(inode, block, bh, 1);
-			if (err)
-				goto out;
-			if (buffer_new(bh))
-				unmap_underlying_metadata(bh);
+		if (page_ok || buffer_dirty(bh)) {
+			if (!buffer_mapped(bh)) {
+				err = get_block(inode, block, bh, 1);
+				if (err)
+					goto out;
+				if (buffer_new(bh))
+					unmap_underlying_metadata(bh);
+			}
+			arr[nr++] = bh ; 
+		} else {
+			partial = 1 ;
 		}
 		bh = bh->b_this_page;
 		block++;
 	} while (bh != head);
 
 	/* Stage 2: lock the buffers, mark them dirty */
-	do {
+	for (i = 0 ; i < nr ; i++) {
+		bh = arr[i] ;
 		lock_buffer(bh);
 		bh->b_end_io = end_buffer_io_async;
 		atomic_inc(&bh->b_count);
 		set_bit(BH_Uptodate, &bh->b_state);
 		set_bit(BH_Dirty, &bh->b_state);
-		bh = bh->b_this_page;
-	} while (bh != head);
+	} 
 
-	/* Stage 3: submit the IO */
-	do {
-		submit_bh(WRITE, bh);
-		bh = bh->b_this_page;		
-	} while (bh != head);
+	for (i = 0 ; i < nr ; i++) {
+		submit_bh(WRITE, arr[i]) ;
+	}
+
+	if (nr == 0) 
+		UnlockPage(page) ;
 
 	/* Done - end_buffer_io_async will unlock */
-	SetPageUptodate(page);
+	if (!partial)
+		SetPageUptodate(page);
 	return 0;
 
 out:
@@ -1653,6 +1766,45 @@
 }
 
 /*
+** just sets the dirty bits for a range of buffers in the page.  Does
+** not balance the dirty list, or put the buffers onto the dirty list
+*/
+static int __block_dirty_range(struct inode *inode, struct page *page,
+		unsigned from, unsigned to)
+{
+	unsigned block_start, block_end;
+	int partial = 0 ;
+	unsigned blocksize;
+	struct buffer_head *bh, *head;
+
+	blocksize = inode->i_sb->s_blocksize;
+
+	for(bh = head = page->buffers, block_start = 0;
+	    bh != head || !block_start;
+	    block_start=block_end, bh = bh->b_this_page) {
+		block_end = block_start + blocksize;
+		if (block_end <= from || block_start >= to) {
+			if (!buffer_uptodate(bh))
+				partial = 1;
+		} else {
+			set_bit(BH_Uptodate, &bh->b_state);
+			if (!atomic_set_buffer_dirty(bh)) {
+				buffer_insert_inode_queue(bh, inode);
+			}
+		}
+	}
+	/*
+	 * is this a partial write that happened to make all buffers
+	 * uptodate then we can optimize away a bogus readpage() for
+	 * the next read(). Here we 'discover' wether the page went
+	 * uptodate as a result of this (potentially partial) write.
+	 */
+	if (!partial)
+		SetPageUptodate(page);
+	return 0;
+}
+
+/*
  * Generic "read page" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
  * Reads the page asynchronously --- the unlock_buffer() and
@@ -1942,13 +2093,23 @@
 	if (!err) {
 		memset(page_address(page) + offset, 0, PAGE_CACHE_SIZE - offset);
 		flush_dcache_page(page);
-		__block_commit_write(inode,page,0,offset);
+
+		/* this will just set the dirty bits for block_write_full_page
+		** it is only safe because we have the page locked and
+		** nobody will try to write the buffers between
+		** the block_dirty_range and the write_full_page calls
+		** we have to clear the page up to date so the buffers
+		** past the end of the file won't get written
+		*/
+		__block_dirty_range(inode,page,0,offset);
+		ClearPageUptodate(page);
+		err = __block_write_full_page(inode, page, get_block) ;
 done:
 		kunmap(page);
-		UnlockPage(page);
 		return err;
 	}
 	ClearPageUptodate(page);
+	UnlockPage(page);
 	goto done;
 }
 
@@ -2239,7 +2400,7 @@
 	struct buffer_head *bh, *tmp;
 	struct buffer_head * insert_point;
 	int isize;
-
+	unsigned long index ;
 	if ((size & 511) || (size > PAGE_SIZE)) {
 		printk("VFS: grow_buffers: size = %d\n",size);
 		return 0;
@@ -2255,6 +2416,16 @@
 
 	isize = BUFSIZE_INDEX(size);
 
+	/* don't put this buffer head on the free list until the
+	** page is setup.  Is there a better index to use?  Would 0
+	** be good enough?
+	*/
+	page->flags &= ~(1 << PG_referenced);
+	index = atomic_read(&buffermem_pages) ;
+	atomic_inc(&buffermem_pages);
+	add_to_page_cache_locked(page, &anon_space_mapping, index) ;
+	page->buffers = bh;
+
 	spin_lock(&free_list[isize].lock);
 	insert_point = free_list[isize].list;
 	tmp = bh;
@@ -2278,11 +2449,7 @@
 	free_list[isize].list = bh;
 	spin_unlock(&free_list[isize].lock);
 
-	page->buffers = bh;
-	page->flags &= ~(1 << PG_referenced);
-	lru_cache_add(page);
 	UnlockPage(page);
-	atomic_inc(&buffermem_pages);
 	return 1;
 
 no_buffer_head:
@@ -2381,6 +2548,9 @@
 
 	/* And free the page */
 	page->buffers = NULL;
+	if (page->mapping == (&anon_space_mapping)) {
+		atomic_dec(&buffermem_pages) ;
+	}
 	page_cache_release(page);
 	spin_unlock(&free_list[index].lock);
 	write_unlock(&hash_table_lock);
@@ -2559,6 +2729,7 @@
 static int flush_dirty_buffers(int check_flushtime)
 {
 	struct buffer_head * bh, *next;
+	struct page *page ;
 	int flushed = 0, i;
 
  restart:
@@ -2575,6 +2746,8 @@
 		}
 		if (buffer_locked(bh))
 			continue;
+		if (!buffer_uptodate(bh))
+			continue ;
 
 		if (check_flushtime) {
 			/* The dirty lru list is chronologically ordered so
@@ -2587,13 +2760,12 @@
 			if (++flushed > bdf_prm.b_un.ndirty)
 				goto out_unlock;
 		}
-
-		/* OK, now we are committed to write it out. */
-		atomic_inc(&bh->b_count);
-		spin_unlock(&lru_list_lock);
-		ll_rw_block(WRITE, 1, &bh);
-		atomic_dec(&bh->b_count);
-
+		page = bh->b_page ;
+		if (TryLockPage(page)) {
+			continue ;
+		}
+		spin_unlock(&lru_list_lock) ;
+		dirty_list_writepage(page) ;
 		if (current->need_resched)
 			schedule();
 		goto restart;
diff -urN linux.test13p3p/mm/page_alloc.c linux-test12/mm/page_alloc.c
--- linux.test13p3p/mm/page_alloc.c	Mon Dec 18 06:53:45 2000
+++ linux.test13p3/mm/page_alloc.c	Thu Dec 21 16:56:38 2000
@@ -317,11 +317,12 @@
 	/*
 	 * If we are about to get low on free pages and cleaning
 	 * the inactive_dirty pages would fix the situation,
-	 * wake up bdflush.
+	 * wake up kswapd here as well, so page_launder can start
+	 * sending things to disk.
 	 */
 	else if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
 			&& nr_inactive_dirty_pages >= freepages.high)
-		wakeup_bdflush(0);
+		wakeup_kswapd(0);
 
 try_again:
 	/*
diff -urN linux.test13p3p/mm/vmscan.c linux-test12/mm/vmscan.c
--- linux.test13p3p/mm/vmscan.c	Tue Dec 19 05:07:57 2000
+++ linux.test13p3/mm/vmscan.c	Thu Dec 21 17:32:57 2000
@@ -678,7 +678,6 @@
 
 			/* The page was only in the buffer cache. */
 			} else if (!page->mapping) {
-				atomic_dec(&buffermem_pages);
 				freed_page = 1;
 				cleaned_pages++;
 
@@ -739,9 +738,6 @@
 	 * free anything yet, we wait synchronously on the writeout of
 	 * MAX_SYNC_LAUNDER pages.
 	 *
-	 * We also wake up bdflush, since bdflush should, under most
-	 * loads, flush out the dirty pages before we have to wait on
-	 * IO.
 	 */
 	if (can_get_io_locks && !launder_loop && free_shortage()) {
 		launder_loop = 1;
@@ -750,8 +746,6 @@
 			sync = 0;
 		/* We only do a few "out of order" flushes. */
 		maxlaunder = MAX_LAUNDER;
-		/* Kflushd takes care of the rest. */
-		wakeup_bdflush(0);
 		goto dirty_page_rescan;
 	}
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-21 23:25           ` [RFC] changes to buffer.c (was Test12 ll_rw_block error) Chris Mason
@ 2000-12-22  0:19             ` Marcelo Tosatti
  2000-12-22  2:20               ` Andreas Dilger
  2000-12-22  1:54             ` Alexander Viro
  1 sibling, 1 reply; 57+ messages in thread
From: Marcelo Tosatti @ 2000-12-22  0:19 UTC (permalink / raw)
  To: Chris Mason
  Cc: Stephen C. Tweedie, Alexander Viro, Linus Torvalds,
	Russell Cattelan, linux-kernel


On Thu, 21 Dec 2000, Chris Mason wrote:

> Ok guys, I think I've taken Linus' suggestion to have buffer.c use its
> own writepage a bit too far.  This patch marks pages dirty when the 
> buffer head is marked dirty, and changes flush_dirty_buffers and 
> sync_buffers to use writepage instead of ll_rw_block.  The idea is 
> to allow filesystems to use the buffer lists and provide their own 
> i/o mechanism.
> 
> The result is a serious semantics change for writepage, which now is 
> expected to do partial page writes when the page isn't up to date and 
> there are dirty buffers inside.  For all the obvious reasons, this isn't 
> fit for 2.4.0, and if you all feel it is a 2.5. thing I'll send along 
> the  shorter  patch Linus originally suggested.  But, I think it would 
> be  pretty useful for the new filesystems (once I also fix 
> fsync_inode_buffers and sync_page_buffers).

It is very powerful.

With this on place, the filesystem is able to do write clustering at its
writepage() function by checking if the on-disk physically nearby pages
are dirty.
  
> Other changes:  submit_bh now cleans the buffers.  I don't see how 
> they were getting cleaned before, it must have been try_to_free_buffers 
> sending the page through sync_page_buffers, meaning they were probably 
> getting written twice.  Unless someone throws a clue my way, I'll send 
> this out as a separate diff.

Ouch.

> page_launder doesn't fiddle with the buffermem_pages counter, it is done 
> in try_to_free_buffers instead.
>
> Obvious bug, block_write_full_page zeros out the bits past the end of 
> file every time.  This should not be needed for normal file writes.
> 
> Most testing was on ext2, who actually calls mark_buffer_dirty, and 
> supports blocksizes < intel page size.  More tests are running 
> overnight.

It seems your code has a problem with bh flush time.

In flush_dirty_buffers(), a buffer may (if being called from kupdate) only
be written in case its old enough. (bh->b_flushtime)

If the flush happens for an anonymous buffer, you'll end up writing all
buffers which are sitting on the same page (with block_write_anon_page),
but these other buffers are not necessarily old enough to be flushed.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-22  2:20               ` Andreas Dilger
@ 2000-12-22  0:38                 ` Marcelo Tosatti
  2000-12-22 13:56                   ` Chris Mason
  2000-12-22 15:07                   ` Chris Mason
  0 siblings, 2 replies; 57+ messages in thread
From: Marcelo Tosatti @ 2000-12-22  0:38 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Chris Mason, Stephen C. Tweedie, Alexander Viro, Linus Torvalds,
	Russell Cattelan, linux-kernel


On Thu, 21 Dec 2000, Andreas Dilger wrote:

> Marcelo Tosatti writes:
> > It seems your code has a problem with bh flush time.
> > 
> > In flush_dirty_buffers(), a buffer may (if being called from kupdate) only
> > be written in case its old enough. (bh->b_flushtime)
> > 
> > If the flush happens for an anonymous buffer, you'll end up writing all
> > buffers which are sitting on the same page (with block_write_anon_page),
> > but these other buffers are not necessarily old enough to be flushed.
> 
> This isn't really a "problem" however.  The page is the _maximum_ age of
> the buffer before it needs to be written.  If we can efficiently write it
> out with another buffer 


> (essentially for free if they are on the same spot on disk)

Are you sure this is true for buffer pages in most cases? 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-21 23:25           ` [RFC] changes to buffer.c (was Test12 ll_rw_block error) Chris Mason
  2000-12-22  0:19             ` Marcelo Tosatti
@ 2000-12-22  1:54             ` Alexander Viro
  2000-12-22 13:49               ` Chris Mason
  1 sibling, 1 reply; 57+ messages in thread
From: Alexander Viro @ 2000-12-22  1:54 UTC (permalink / raw)
  To: Chris Mason
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Linus Torvalds,
	Russell Cattelan, linux-kernel



On Thu, 21 Dec 2000, Chris Mason wrote:

> Obvious bug, block_write_full_page zeros out the bits past the end of 
> file every time.  This should not be needed for normal file writes.

Unfortunately, it _is_ needed for pageout path. mmap() the last page
of file. Dirty the data past the EOF (MMU will not catch that access).
Let the pageout send the page to disk. You don't want to have the data
past EOF end up on the backstore.

Al, slowly getting back to life and sanity. Fsck the flu...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-22  0:19             ` Marcelo Tosatti
@ 2000-12-22  2:20               ` Andreas Dilger
  2000-12-22  0:38                 ` Marcelo Tosatti
  0 siblings, 1 reply; 57+ messages in thread
From: Andreas Dilger @ 2000-12-22  2:20 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Chris Mason, Stephen C. Tweedie, Alexander Viro, Linus Torvalds,
	Russell Cattelan, linux-kernel

Marcelo Tosatti writes:
> It seems your code has a problem with bh flush time.
> 
> In flush_dirty_buffers(), a buffer may (if being called from kupdate) only
> be written in case its old enough. (bh->b_flushtime)
> 
> If the flush happens for an anonymous buffer, you'll end up writing all
> buffers which are sitting on the same page (with block_write_anon_page),
> but these other buffers are not necessarily old enough to be flushed.

This isn't really a "problem" however.  The page is the _maximum_ age of
the buffer before it needs to be written.  If we can efficiently write it
out with another buffer (essentially for free if they are on the same
spot on disk), then there is less work for us to do later.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-22  1:54             ` Alexander Viro
@ 2000-12-22 13:49               ` Chris Mason
  0 siblings, 0 replies; 57+ messages in thread
From: Chris Mason @ 2000-12-22 13:49 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Stephen C. Tweedie, Marcelo Tosatti, Linus Torvalds,
	Russell Cattelan, linux-kernel



On Thursday, December 21, 2000 20:54:09 -0500 Alexander Viro <viro@math.psu.edu> wrote:

> 
> 
> On Thu, 21 Dec 2000, Chris Mason wrote:
> 
>> Obvious bug, block_write_full_page zeros out the bits past the end of 
>> file every time.  This should not be needed for normal file writes.
> 
> Unfortunately, it _is_ needed for pageout path. mmap() the last page
> of file. Dirty the data past the EOF (MMU will not catch that access).
> Let the pageout send the page to disk. You don't want to have the data
> past EOF end up on the backstore.
> 

Sorry, I wasn't very clear.  I'd like to find a way to just do the memset in the one case it is needed (a real writepage), and not do it when we are using writepage just to flush dirty buffers.  I don't see how to do it though, without making a new flush operation.  That would also solve my concerns about the change in writepage semantics (even though every FS I could find in the kernel tree was using block_write_full_page). 

thanks,
Chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-22  0:38                 ` Marcelo Tosatti
@ 2000-12-22 13:56                   ` Chris Mason
  2000-12-22 16:45                     ` Daniel Phillips
  2000-12-22 15:07                   ` Chris Mason
  1 sibling, 1 reply; 57+ messages in thread
From: Chris Mason @ 2000-12-22 13:56 UTC (permalink / raw)
  To: Marcelo Tosatti, Andreas Dilger
  Cc: Stephen C. Tweedie, Alexander Viro, Linus Torvalds,
	Russell Cattelan, linux-kernel



On Thursday, December 21, 2000 22:38:04 -0200 Marcelo Tosatti <marcelo@conectiva.com.br> wrote:

> 
> On Thu, 21 Dec 2000, Andreas Dilger wrote:
> 
>> Marcelo Tosatti writes:
>> > It seems your code has a problem with bh flush time.
>> > 
>> > In flush_dirty_buffers(), a buffer may (if being called from kupdate) only
>> > be written in case its old enough. (bh->b_flushtime)
>> > 
>> > If the flush happens for an anonymous buffer, you'll end up writing all
>> > buffers which are sitting on the same page (with block_write_anon_page),
>> > but these other buffers are not necessarily old enough to be flushed.
>> 
>> This isn't really a "problem" however.  The page is the _maximum_ age of
>> the buffer before it needs to be written.  If we can efficiently write it
>> out with another buffer 
> 
> 
>> (essentially for free if they are on the same spot on disk)
> 
> Are you sure this is true for buffer pages in most cases? 
> 
> 

It's a good point.  block_write_anon_page could be changed to just write the oldest buffer and redirty the page (if the buffers are far apart).  If memory is tight, and we *really* need the page back, it will be flushed by try_to_free_buffers.

It seems a bit nasty to me though...writepage should write the page.

-chris



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-22  0:38                 ` Marcelo Tosatti
  2000-12-22 13:56                   ` Chris Mason
@ 2000-12-22 15:07                   ` Chris Mason
  2000-12-22 19:52                     ` Marcelo Tosatti
  1 sibling, 1 reply; 57+ messages in thread
From: Chris Mason @ 2000-12-22 15:07 UTC (permalink / raw)
  To: Marcelo Tosatti, Andreas Dilger
  Cc: Stephen C. Tweedie, Alexander Viro, Linus Torvalds,
	Russell Cattelan, linux-kernel



On Thursday, December 21, 2000 22:38:04 -0200 Marcelo Tosatti
<marcelo@conectiva.com.br> wrote:

>> Marcelo Tosatti writes:
>> > It seems your code has a problem with bh flush time.
>> > 
>> > In flush_dirty_buffers(), a buffer may (if being called from kupdate)
>> > only be written in case its old enough. (bh->b_flushtime)
>> > 
>> > If the flush happens for an anonymous buffer, you'll end up writing all
>> > buffers which are sitting on the same page (with
>> > block_write_anon_page), but these other buffers are not necessarily
>> > old enough to be flushed.
>> 

A quick benchmark shows there's room for improvement here.  I'll play
around with a version of block_write_anon_page that tries to be more
selective when flushing things out.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-22 13:56                   ` Chris Mason
@ 2000-12-22 16:45                     ` Daniel Phillips
  2000-12-22 19:35                       ` Chris Mason
  0 siblings, 1 reply; 57+ messages in thread
From: Daniel Phillips @ 2000-12-22 16:45 UTC (permalink / raw)
  To: Chris Mason, linux-kernel

Chris Mason wrote:
> 
> On Thursday, December 21, 2000 22:38:04 -0200 Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> >
> > On Thu, 21 Dec 2000, Andreas Dilger wrote:
> >
> >> Marcelo Tosatti writes:
> >> > It seems your code has a problem with bh flush time.
> >> >
> >> > In flush_dirty_buffers(), a buffer may (if being called from kupdate) only
> >> > be written in case its old enough. (bh->b_flushtime)
> >> >
> >> > If the flush happens for an anonymous buffer, you'll end up writing all
> >> > buffers which are sitting on the same page (with block_write_anon_page),
> >> > but these other buffers are not necessarily old enough to be flushed.
> >>
> >> This isn't really a "problem" however.  The page is the _maximum_ age of
> >> the buffer before it needs to be written.  If we can efficiently write it
> >> out with another buffer
> >
> >> (essentially for free if they are on the same spot on disk)
> >
> > Are you sure this is true for buffer pages in most cases?
> 
> It's a good point.  block_write_anon_page could be changed to just 
> write the oldest buffer and redirty the page (if the buffers are 
> far apart).  If memory is tight, and we *really* need the page back,
> it will be flushed by try_to_free_buffers.
> 
> It seems a bit nasty to me though...writepage should write the page.

Um.  Why cater to the uncommon case of 1K blocks?  Just let
bdflush/kupdated deal with them in the normal way - it's pretty
efficient.  Only try to do the clustering optimization when buffer size
matches memory page size.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-22 16:45                     ` Daniel Phillips
@ 2000-12-22 19:35                       ` Chris Mason
  0 siblings, 0 replies; 57+ messages in thread
From: Chris Mason @ 2000-12-22 19:35 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel



On Friday, December 22, 2000 17:45:57 +0100 Daniel Phillips
<phillips@innominate.de> wrote:

[ flushing a page at a time in bdflush ]

> Um.  Why cater to the uncommon case of 1K blocks?  Just let
> bdflush/kupdated deal with them in the normal way - it's pretty
> efficient.  Only try to do the clustering optimization when buffer size
> matches memory page size.
> 

This isn't really an attempt at a clustering optimization.  The problem at
hand is that buffer cache buffers can be on relatively random pages.  So, a
page might have buffers that are very far apart, where one needs flushing
and the other doesn't.

In the blocksize == page size case, this won't happen, and we don't lose
any speed over the existing code.  In the blocksize < pagesize case, my new
code is slower, so my goal is to fix just that problem.

Real write clustering would be a different issue entirely, and is worth
doing ;-)

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-22 15:07                   ` Chris Mason
@ 2000-12-22 19:52                     ` Marcelo Tosatti
  2000-12-22 23:18                       ` Chris Mason
  2000-12-27  0:57                       ` Chris Mason
  0 siblings, 2 replies; 57+ messages in thread
From: Marcelo Tosatti @ 2000-12-22 19:52 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andreas Dilger, Stephen C. Tweedie, Alexander Viro,
	Linus Torvalds, Russell Cattelan, linux-kernel


On Fri, 22 Dec 2000, Chris Mason wrote:

> 
> 
> On Thursday, December 21, 2000 22:38:04 -0200 Marcelo Tosatti
> <marcelo@conectiva.com.br> wrote:
> 
> >> Marcelo Tosatti writes:
> >> > It seems your code has a problem with bh flush time.
> >> > 
> >> > In flush_dirty_buffers(), a buffer may (if being called from kupdate)
> >> > only be written in case its old enough. (bh->b_flushtime)
> >> > 
> >> > If the flush happens for an anonymous buffer, you'll end up writing all
> >> > buffers which are sitting on the same page (with
> >> > block_write_anon_page), but these other buffers are not necessarily
> >> > old enough to be flushed.
> >> 
> 
> A quick benchmark shows there's room for improvement here.  I'll play
> around with a version of block_write_anon_page that tries to be more
> selective when flushing things out.

There is one more nasty issue to deal with. 

You only want to take into account the buffer flushtime if
"check_flushtime" parameter is passed as true to flush_dirty_buffers
(which is done by kupdate).

Thinking a bit more about the issue, I dont see any reason why we want to
write all buffers of an anonymous page at
sync_buffers/flush_dirty_buffers.

I think we could simply write the buffer with ll_rw_block() if the page
which this buffer is sitting on does not have mapping->a_ops->writepage
operation defined.

Chris? 






-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-22 19:52                     ` Marcelo Tosatti
@ 2000-12-22 23:18                       ` Chris Mason
  2000-12-22 23:26                         ` Marcelo Tosatti
  2000-12-23 15:47                         ` Daniel Phillips
  2000-12-27  0:57                       ` Chris Mason
  1 sibling, 2 replies; 57+ messages in thread
From: Chris Mason @ 2000-12-22 23:18 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andreas Dilger, Stephen C. Tweedie, Alexander Viro,
	Linus Torvalds, Russell Cattelan, linux-kernel



On Friday, December 22, 2000 17:52:28 -0200 Marcelo Tosatti
<marcelo@conectiva.com.br> wrote:

> There is one more nasty issue to deal with. 
> 
> You only want to take into account the buffer flushtime if
> "check_flushtime" parameter is passed as true to flush_dirty_buffers
> (which is done by kupdate).
> 
> Thinking a bit more about the issue, I dont see any reason why we want to
> write all buffers of an anonymous page at
> sync_buffers/flush_dirty_buffers.
> 
> I think we could simply write the buffer with ll_rw_block() if the page
> which this buffer is sitting on does not have mapping->a_ops->writepage
> operation defined.
> 

It is enough to leave buffer heads we don't flush on the dirty list (and
redirty the page), they'll get written by a future loop through
flush_dirty_pages, or by page_launder.  We could use ll_rw_block instead,
even though anon pages do have a writepage with this patch (just check if
page->mapping == &anon_space_mapping).

There are lots of things we could try here, including some logic inside
block_write_anon_page to check the current memory pressure/dirty levels to
see how much work should be done.

I'll play with this a bit, but more ideas would be appreciated.

BTW, recent change to my local code was to remove the ll_rw_block call from
sync_page_buffers.  It seemed cleaner than making try_to_free_buffers
understand that sometimes writepage will be called, and sometimes the page
will end up unlocked because of it....comments?

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-22 23:18                       ` Chris Mason
@ 2000-12-22 23:26                         ` Marcelo Tosatti
  2000-12-23 18:21                           ` Chris Mason
  2000-12-23 15:47                         ` Daniel Phillips
  1 sibling, 1 reply; 57+ messages in thread
From: Marcelo Tosatti @ 2000-12-22 23:26 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andreas Dilger, Stephen C. Tweedie, Alexander Viro,
	Linus Torvalds, Russell Cattelan, linux-kernel


On Fri, 22 Dec 2000, Chris Mason wrote:

> It is enough to leave buffer heads we don't flush on the dirty list (and
> redirty the page), they'll get written by a future loop through
> flush_dirty_pages, or by page_launder.  We could use ll_rw_block instead,
> even though anon pages do have a writepage with this patch (just check if
> page->mapping == &anon_space_mapping).

If we use ll_rw_block directly on buffers of anonymous pages
(page->mapping == &anon_space_mapping) instead using
dirty_list_writepage() (which will end up calling block_write_anon_page)
we can fix the buffer flushtime issue.

> There are lots of things we could try here, including some logic inside
> block_write_anon_page to check the current memory pressure/dirty levels to
> see how much work should be done.

At writepage() context we do not know if we should care about the
flushtime.

> I'll play with this a bit, but more ideas would be appreciated.
>
> BTW, recent change to my local code was to remove the ll_rw_block call from
> sync_page_buffers.  It seemed cleaner than making try_to_free_buffers
> understand that sometimes writepage will be called, and sometimes the page
> will end up unlocked because of it....comments?

Could you please send me your latest patch ?

Thanks

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-22 23:18                       ` Chris Mason
  2000-12-22 23:26                         ` Marcelo Tosatti
@ 2000-12-23 15:47                         ` Daniel Phillips
  1 sibling, 0 replies; 57+ messages in thread
From: Daniel Phillips @ 2000-12-23 15:47 UTC (permalink / raw)
  To: Chris Mason, linux-kernel

Chris Mason wrote:
> It is enough to leave buffer heads we don't flush on the dirty list (and
> redirty the page), they'll get written by a future loop through
> flush_dirty_pages, or by page_launder.  We could use ll_rw_block instead,
> even though anon pages do have a writepage with this patch (just check if
> page->mapping == &anon_space_mapping).

anon_space_mapping... this is really useful.  This would make a nice
patch on its own.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-22 23:26                         ` Marcelo Tosatti
@ 2000-12-23 18:21                           ` Chris Mason
  2000-12-23 19:02                             ` Linus Torvalds
  0 siblings, 1 reply; 57+ messages in thread
From: Chris Mason @ 2000-12-23 18:21 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andreas Dilger, Stephen C. Tweedie, Alexander Viro,
	Linus Torvalds, Russell Cattelan, linux-kernel



On Friday, December 22, 2000 21:26:33 -0200 Marcelo Tosatti 
> If we use ll_rw_block directly on buffers of anonymous pages
> (page->mapping == &anon_space_mapping) instead using
> dirty_list_writepage() (which will end up calling block_write_anon_page)
> we can fix the buffer flushtime issue.
> 

Ok, I'm just being stubborn.  The point of the patch was to get rid
of the ll_rw_block calls, so I'm resisting adding one back in ;-)
But, your way does seem like the least complicated method to honor 
the flushtime param for anon pages, and I'll switch over to that.

I've updated to test13-pre4, and removed the hunk for submit_bh.
Looks as though pre4 changed the submit_bh callers to clear the dirty
bit, so my code does the same.

Other changes: sync_page_buffers doesn't write blocks on the page, it 
only waits on them.  Minor change to when the page count is
increased before writepage is called.

I still need to update fsync_inode_buffers....doesn't look like that
will happen before I head off to visit family, where for some reason
they've banned me from reading email ;-)  I'll be back on Tuesday,
hope everyone has a good holiday.  New patch below:

-chris

diff -urN linux-test13-pre4/fs/buffer.c linux-anon-space/fs/buffer.c
--- linux-test13-pre4/fs/buffer.c	Sat Dec 23 13:14:48 2000
+++ linux-anon-space/fs/buffer.c	Sat Dec 23 13:30:24 2000
@@ -97,6 +97,16 @@
 
 static int grow_buffers(int size);
 static void __refile_buffer(struct buffer_head *);
+static int block_write_anon_page(struct page *);
+
+static struct address_space_operations anon_space_ops = {
+	writepage: block_write_anon_page,
+	sync_page: block_sync_page,
+} ;
+static struct address_space anon_space_mapping = {
+	pages: { &anon_space_mapping.pages, &anon_space_mapping.pages },
+	a_ops: &anon_space_ops,
+} ;
 
 /* This is used by some architectures to estimate available memory. */
 atomic_t buffermem_pages = ATOMIC_INIT(0);
@@ -161,6 +171,30 @@
 	atomic_dec(&bh->b_count);
 }
 
+/*
+** util function for sync_buffers and flush_dirty_buffers
+** uses either the writepage func supplied in the page's mapping,
+** or the anon address space writepage
+*/
+static int dirty_list_writepage(struct page *page) {
+	int (*writepage)(struct page *)  ;
+	int ret ;
+
+	writepage = page->mapping->a_ops->writepage ;
+
+	if (!writepage) {
+		writepage = anon_space_ops.writepage ;
+	}
+
+	ClearPageDirty(page) ;
+	ret = writepage(page) ;
+	if (ret == 1) {
+		SetPageDirty(page) ;
+		UnlockPage(page) ;
+	}
+	return ret ;
+}
+
 /* Call sync_buffers with wait!=0 to ensure that the call does not
  * return until all buffer writes have completed.  Sync() may return
  * before the writes have finished; fsync() may not.
@@ -175,6 +209,7 @@
 {
 	int i, retry, pass = 0, err = 0;
 	struct buffer_head * bh, *next;
+	struct page *page ;
 
 	/* One pass for no-wait, three for wait:
 	 * 0) write out all dirty, unlocked buffers;
@@ -230,10 +265,22 @@
 			if (!buffer_dirty(bh) || pass >= 2)
 				continue;
 
-			atomic_inc(&bh->b_count);
+			page = bh->b_page ;
+			page_cache_get(page) ;
+			if (TryLockPage(page)) {
+				if (!wait || !pass) {
+					retry = 1 ;
+					continue ;
+				}
+				spin_unlock(&lru_list_lock);
+				wait_on_page(page) ;
+				page_cache_release(page) ;
+				goto repeat ;
+			}
 			spin_unlock(&lru_list_lock);
-			ll_rw_block(WRITE, 1, &bh);
-			atomic_dec(&bh->b_count);
+
+			dirty_list_writepage(page) ;
+			page_cache_release(page) ;
 			retry = 1;
 			goto repeat;
 		}
@@ -1101,8 +1148,10 @@
 	int dispose = BUF_CLEAN;
 	if (buffer_locked(bh))
 		dispose = BUF_LOCKED;
-	if (buffer_dirty(bh))
+	if (buffer_dirty(bh)) {
 		dispose = BUF_DIRTY;
+		SetPageDirty(bh->b_page) ;
+	}
 	if (buffer_protected(bh))
 		dispose = BUF_PROTECTED;
 	if (dispose != bh->b_list) {
@@ -1478,6 +1527,53 @@
  * "Dirty" is valid only with the last case (mapped+uptodate).
  */
 
+static int block_write_anon_page(struct page *page) 
+{
+	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	int i, nr = 0 ;
+	int partial = 0 ;
+	int ret = 0 ;
+
+	if (!PageLocked(page))
+		BUG();
+
+	if (!page->buffers)
+		BUG() ;
+
+	head = page->buffers;
+	bh = head;
+
+	/* Stage 1: find the dirty buffers, lock them for submit_bh */
+	do {
+		if (!test_and_set_bit(BH_Lock, &bh->b_state)) {
+			if (buffer_uptodate(bh) && buffer_dirty(bh)) {
+				bh->b_end_io = end_buffer_io_async;
+				clear_bit(BH_Dirty, &bh->b_state) ;
+				atomic_inc(&bh->b_count);
+				arr[nr++] = bh ;
+			} else {
+				partial = 1 ;
+				unlock_buffer(bh) ;
+			}
+		} else {
+			partial = 1 ;
+		}
+		bh = bh->b_this_page;
+	} while (bh != head);
+
+	/* Stage 2: submit the IO */
+	for (i = 0 ; i < nr ; i++) {
+		submit_bh(WRITE, arr[i]) ;
+	}
+	/* Done - end_buffer_io_async will unlock */
+	if (!partial)
+		SetPageUptodate(page);
+	if (nr == 0) {
+		UnlockPage(page) ;
+	}
+	return ret ;
+}
+
 /*
  * block_write_full_page() is SMP-safe - currently it's still
  * being called with the kernel lock held, but the code is ready.
@@ -1487,6 +1583,10 @@
 	int err, i;
 	unsigned long block;
 	struct buffer_head *bh, *head;
+	int nr = 0 ;
+	struct buffer_head *arr[MAX_BUF_PER_PAGE] ;
+	int page_ok = Page_Uptodate(page) ;
+	int partial = 0;
 
 	if (!PageLocked(page))
 		BUG();
@@ -1509,36 +1609,46 @@
 		 *
 		 * Leave it to the low-level FS to make all those
 		 * decisions (block #0 may actually be a valid block)
+		 *
+		 * only bother when the page is up to date or the buffer
+		 * is dirty.
 		 */
-		if (!buffer_mapped(bh)) {
-			err = get_block(inode, block, bh, 1);
-			if (err)
-				goto out;
-			if (buffer_new(bh))
-				unmap_underlying_metadata(bh);
+		if (page_ok || buffer_dirty(bh)) {
+			if (!buffer_mapped(bh)) {
+				err = get_block(inode, block, bh, 1);
+				if (err)
+					goto out;
+				if (buffer_new(bh))
+					unmap_underlying_metadata(bh);
+			}
+			arr[nr++] = bh ; 
+		} else {
+			partial = 1 ;
 		}
 		bh = bh->b_this_page;
 		block++;
 	} while (bh != head);
 
 	/* Stage 2: lock the buffers, mark them clean */
-	do {
+	for (i = 0 ; i < nr ; i++) {
+		bh = arr[i] ;
 		lock_buffer(bh);
 		bh->b_end_io = end_buffer_io_async;
 		atomic_inc(&bh->b_count);
 		set_bit(BH_Uptodate, &bh->b_state);
 		clear_bit(BH_Dirty, &bh->b_state);
-		bh = bh->b_this_page;
-	} while (bh != head);
+	} 
 
-	/* Stage 3: submit the IO */
-	do {
-		submit_bh(WRITE, bh);
-		bh = bh->b_this_page;		
-	} while (bh != head);
+	for (i = 0 ; i < nr ; i++) {
+		submit_bh(WRITE, arr[i]) ;
+	}
+
+	if (nr == 0) 
+		UnlockPage(page) ;
 
 	/* Done - end_buffer_io_async will unlock */
-	SetPageUptodate(page);
+	if (!partial)
+		SetPageUptodate(page);
 	return 0;
 
 out:
@@ -1658,6 +1768,45 @@
 }
 
 /*
+** just sets the dirty bits for a range of buffers in the page.  Does
+** not balance the dirty list, or put the buffers onto the dirty list
+*/
+static int __block_dirty_range(struct inode *inode, struct page *page,
+		unsigned from, unsigned to)
+{
+	unsigned block_start, block_end;
+	int partial = 0 ;
+	unsigned blocksize;
+	struct buffer_head *bh, *head;
+
+	blocksize = inode->i_sb->s_blocksize;
+
+	for(bh = head = page->buffers, block_start = 0;
+	    bh != head || !block_start;
+	    block_start=block_end, bh = bh->b_this_page) {
+		block_end = block_start + blocksize;
+		if (block_end <= from || block_start >= to) {
+			if (!buffer_uptodate(bh))
+				partial = 1;
+		} else {
+			set_bit(BH_Uptodate, &bh->b_state);
+			if (!atomic_set_buffer_dirty(bh)) {
+				buffer_insert_inode_queue(bh, inode);
+			}
+		}
+	}
+	/*
+	 * is this a partial write that happened to make all buffers
+	 * uptodate then we can optimize away a bogus readpage() for
+	 * the next read(). Here we 'discover' wether the page went
+	 * uptodate as a result of this (potentially partial) write.
+	 */
+	if (!partial)
+		SetPageUptodate(page);
+	return 0;
+}
+
+/*
  * Generic "read page" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
  * Reads the page asynchronously --- the unlock_buffer() and
@@ -1947,13 +2096,23 @@
 	if (!err) {
 		memset(page_address(page) + offset, 0, PAGE_CACHE_SIZE - offset);
 		flush_dcache_page(page);
-		__block_commit_write(inode,page,0,offset);
+
+		/* this will just set the dirty bits for block_write_full_page
+		** it is only safe because we have the page locked and
+		** nobody will try to write the buffers between
+		** the block_dirty_range and the write_full_page calls
+		** we have to clear the page up to date so the buffers
+		** past the end of the file won't get written
+		*/
+		__block_dirty_range(inode,page,0,offset);
+		ClearPageUptodate(page);
+		err = __block_write_full_page(inode, page, get_block) ;
 done:
 		kunmap(page);
-		UnlockPage(page);
 		return err;
 	}
 	ClearPageUptodate(page);
+	UnlockPage(page);
 	goto done;
 }
 
@@ -2244,7 +2403,7 @@
 	struct buffer_head *bh, *tmp;
 	struct buffer_head * insert_point;
 	int isize;
-
+	unsigned long index ;
 	if ((size & 511) || (size > PAGE_SIZE)) {
 		printk("VFS: grow_buffers: size = %d\n",size);
 		return 0;
@@ -2260,6 +2419,16 @@
 
 	isize = BUFSIZE_INDEX(size);
 
+	/* don't put this buffer head on the free list until the
+	** page is setup.  Is there a better index to use?  Would 0
+	** be good enough?
+	*/
+	page->flags &= ~(1 << PG_referenced);
+	index = atomic_read(&buffermem_pages) ;
+	atomic_inc(&buffermem_pages);
+	add_to_page_cache_locked(page, &anon_space_mapping, index) ;
+	page->buffers = bh;
+
 	spin_lock(&free_list[isize].lock);
 	insert_point = free_list[isize].list;
 	tmp = bh;
@@ -2283,11 +2452,7 @@
 	free_list[isize].list = bh;
 	spin_unlock(&free_list[isize].lock);
 
-	page->buffers = bh;
-	page->flags &= ~(1 << PG_referenced);
-	lru_cache_add(page);
 	UnlockPage(page);
-	atomic_inc(&buffermem_pages);
 	return 1;
 
 no_buffer_head:
@@ -2309,7 +2474,6 @@
  *
  * Wait:
  *	0 - no wait (this does not get called - see try_to_free_buffers below)
- *	1 - start IO for dirty buffers
  *	2 - wait for completion of locked buffers
  */
 static void sync_page_buffers(struct buffer_head *bh, int wait)
@@ -2319,11 +2483,9 @@
 	do {
 		struct buffer_head *p = tmp;
 		tmp = tmp->b_this_page;
-		if (buffer_locked(p)) {
-			if (wait > 1)
-				__wait_on_buffer(p);
-		} else if (buffer_dirty(p))
-			ll_rw_block(WRITE, 1, &p);
+		if (buffer_locked(p) && wait > 1) {
+			__wait_on_buffer(p);
+		} 
 	} while (tmp != bh);
 }
 
@@ -2386,6 +2548,9 @@
 
 	/* And free the page */
 	page->buffers = NULL;
+	if (page->mapping == (&anon_space_mapping)) {
+		atomic_dec(&buffermem_pages) ;
+	}
 	page_cache_release(page);
 	spin_unlock(&free_list[index].lock);
 	write_unlock(&hash_table_lock);
@@ -2564,6 +2729,7 @@
 static int flush_dirty_buffers(int check_flushtime)
 {
 	struct buffer_head * bh, *next;
+	struct page *page ;
 	int flushed = 0, i;
 
  restart:
@@ -2580,6 +2746,8 @@
 		}
 		if (buffer_locked(bh))
 			continue;
+		if (!buffer_uptodate(bh))
+			continue ;
 
 		if (check_flushtime) {
 			/* The dirty lru list is chronologically ordered so
@@ -2592,13 +2760,15 @@
 			if (++flushed > bdf_prm.b_un.ndirty)
 				goto out_unlock;
 		}
-
-		/* OK, now we are committed to write it out. */
-		atomic_inc(&bh->b_count);
-		spin_unlock(&lru_list_lock);
-		ll_rw_block(WRITE, 1, &bh);
-		atomic_dec(&bh->b_count);
-
+		page = bh->b_page ;
+		page_cache_get(page) ;
+		if (TryLockPage(page)) {
+			page_cache_release(page) ;
+			continue ;
+		}
+		spin_unlock(&lru_list_lock) ;
+		dirty_list_writepage(page) ;
+		page_cache_release(page) ;
 		if (current->need_resched)
 			schedule();
 		goto restart;
diff -urN linux-test13-pre4/mm/page_alloc.c linux-anon-space/mm/page_alloc.c
--- linux-test13-pre4/mm/page_alloc.c	Tue Nov 28 13:54:31 2000
+++ linux-anon-space/mm/page_alloc.c	Thu Dec 21 16:24:44 2000
@@ -317,11 +317,12 @@
 	/*
 	 * If we are about to get low on free pages and cleaning
 	 * the inactive_dirty pages would fix the situation,
-	 * wake up bdflush.
+	 * wake up kswapd here as well, so page_launder can start
+	 * sending things to disk.
 	 */
 	else if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
 			&& nr_inactive_dirty_pages >= freepages.high)
-		wakeup_bdflush(0);
+		wakeup_kswapd(0);
 
 try_again:
 	/*
diff -urN linux-test13-pre4/mm/vmscan.c linux-anon-space/mm/vmscan.c
--- linux-test13-pre4/mm/vmscan.c	Sat Dec 23 13:14:26 2000
+++ linux-anon-space/mm/vmscan.c	Sat Dec 23 13:33:53 2000
@@ -678,7 +678,6 @@
 
 			/* The page was only in the buffer cache. */
 			} else if (!page->mapping) {
-				atomic_dec(&buffermem_pages);
 				freed_page = 1;
 				cleaned_pages++;
 
@@ -739,9 +738,6 @@
 	 * free anything yet, we wait synchronously on the writeout of
 	 * MAX_SYNC_LAUNDER pages.
 	 *
-	 * We also wake up bdflush, since bdflush should, under most
-	 * loads, flush out the dirty pages before we have to wait on
-	 * IO.
 	 */
 	if (can_get_io_locks && !launder_loop && free_shortage()) {
 		launder_loop = 1;
@@ -750,8 +746,6 @@
 			sync = 0;
 		/* We only do a few "out of order" flushes. */
 		maxlaunder = MAX_LAUNDER;
-		/* Kflushd takes care of the rest. */
-		wakeup_bdflush(0);
 		goto dirty_page_rescan;
 	}
 





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-23 18:21                           ` Chris Mason
@ 2000-12-23 19:02                             ` Linus Torvalds
  2000-12-23 19:25                               ` Chris Mason
  0 siblings, 1 reply; 57+ messages in thread
From: Linus Torvalds @ 2000-12-23 19:02 UTC (permalink / raw)
  To: Chris Mason
  Cc: Marcelo Tosatti, Andreas Dilger, Stephen C. Tweedie,
	Alexander Viro, Russell Cattelan, linux-kernel



On Sat, 23 Dec 2000, Chris Mason wrote:
> 
> I've updated to test13-pre4, and removed the hunk for submit_bh.
> Looks as though pre4 changed the submit_bh callers to clear the dirty
> bit, so my code does the same.

Basically, I wanted to think of "submit_bh()" as a pure IO thing. When we
call submit_bh(), that is basically the same as "statr IO on this thing".
Which implies to me that submit_bh() doesn't care, or know, about why the
higher layers did this.

Which is why I prefer the higher layers handling the dirty/uptodate/xxx
bits. 

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-23 19:02                             ` Linus Torvalds
@ 2000-12-23 19:25                               ` Chris Mason
  0 siblings, 0 replies; 57+ messages in thread
From: Chris Mason @ 2000-12-23 19:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marcelo Tosatti, Andreas Dilger, Stephen C. Tweedie,
	Alexander Viro, Russell Cattelan, linux-kernel


On Saturday, December 23, 2000 11:02:53 -0800 Linus Torvalds
<torvalds@transmeta.com> wrote:
> 
> Which is why I prefer the higher layers handling the dirty/uptodate/xxx
> bits. 
> 

Grin, I should have taken the hint when we talked about the buffer up to
date checks for block_read_full_page, it made sense then too.  

brw_page doesn't clear the dirty bit in pre4, was that on purpose?

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-27  0:57                       ` Chris Mason
@ 2000-12-26 23:18                         ` Marcelo Tosatti
  2000-12-27 20:26                         ` Daniel Phillips
  1 sibling, 0 replies; 57+ messages in thread
From: Marcelo Tosatti @ 2000-12-26 23:18 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andreas Dilger, Stephen C. Tweedie, Alexander Viro,
	Linus Torvalds, Russell Cattelan, ananth, linux-kernel



On Tue, 26 Dec 2000, Chris Mason wrote:

> 
> Hi guys,
> 
> Here's my latest code, which uses ll_rw_block for anon pages (or
> pages without a writepage func) when flush_dirty_buffers, 
> sync_buffers, or fsync_inode_buffers are flushing things.  This
> seems to have fixed my slowdown on 1k buffer sizes, but I
> haven't done extensive benchmarks yet.

Great.

I'll run some benchmarks around here too and let you know. 

> Other changes:  After freeing a page with buffers, page_launder
> now stops if (!free_shortage()).  This is a mod of the check where 
> page_launder checked free_shortage after freeing a buffer cache 
> page.  Code outside buffer.c can't detect buffer cache pages with 
> this patch, so the old check doesn't apply.  
> 
> My change doesn't seem quite right though, if page_launder wants 
> to stop when there isn't a shortage, it should do that regardless of
> if the page it just freed had buffers.  It looks like this was added
> so bdflush could call page_launder, and get an early out after
> freeing some buffer heads, but I'm not sure.
> 
> In test13-pre4, invalidate_buffers skips buffers on a page
> with a mapping.  I changed that to skip mappings other than the
> anon space mapping.
> 
> Comments and/or suggestions on how to make better use of this stuff
> are more than welcome ;-)

Well, the best use of this patch seems to be the ability to do write
clustering in the ->writepage() operation for normal filesystems.

I'll try to do a lightweight write clustering patch for
block_write_full_page soon.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-22 19:52                     ` Marcelo Tosatti
  2000-12-22 23:18                       ` Chris Mason
@ 2000-12-27  0:57                       ` Chris Mason
  2000-12-26 23:18                         ` Marcelo Tosatti
  2000-12-27 20:26                         ` Daniel Phillips
  1 sibling, 2 replies; 57+ messages in thread
From: Chris Mason @ 2000-12-27  0:57 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andreas Dilger, Stephen C. Tweedie, Alexander Viro,
	Linus Torvalds, Russell Cattelan, linux-kernel


Hi guys,

Here's my latest code, which uses ll_rw_block for anon pages (or
pages without a writepage func) when flush_dirty_buffers, 
sync_buffers, or fsync_inode_buffers are flushing things.  This
seems to have fixed my slowdown on 1k buffer sizes, but I
haven't done extensive benchmarks yet.

Other changes:  After freeing a page with buffers, page_launder
now stops if (!free_shortage()).  This is a mod of the check where 
page_launder checked free_shortage after freeing a buffer cache 
page.  Code outside buffer.c can't detect buffer cache pages with 
this patch, so the old check doesn't apply.  

My change doesn't seem quite right though, if page_launder wants 
to stop when there isn't a shortage, it should do that regardless of
if the page it just freed had buffers.  It looks like this was added
so bdflush could call page_launder, and get an early out after
freeing some buffer heads, but I'm not sure.

In test13-pre4, invalidate_buffers skips buffers on a page
with a mapping.  I changed that to skip mappings other than the
anon space mapping.

Comments and/or suggestions on how to make better use of this stuff
are more than welcome ;-)

-chris

diff -urN linux-test13-pre4/fs/buffer.c linux-anon-space/fs/buffer.c
--- linux-test13-pre4/fs/buffer.c	Sat Dec 23 13:14:48 2000
+++ linux-anon-space/fs/buffer.c	Tue Dec 26 00:58:06 2000
@@ -97,6 +97,17 @@
 
 static int grow_buffers(int size);
 static void __refile_buffer(struct buffer_head *);
+static int block_write_anon_page(struct page *);
+static void end_buffer_io_async(struct buffer_head * bh, int uptodate) ;
+
+static struct address_space_operations anon_space_ops = {
+	writepage: block_write_anon_page,
+	sync_page: block_sync_page,
+} ;
+static struct address_space anon_space_mapping = {
+	pages: { &anon_space_mapping.pages, &anon_space_mapping.pages },
+	a_ops: &anon_space_ops,
+} ;
 
 /* This is used by some architectures to estimate available memory. */
 atomic_t buffermem_pages = ATOMIC_INIT(0);
@@ -161,6 +172,73 @@
 	atomic_dec(&bh->b_count);
 }
 
+/* just for use with anon pages, or pages that don't provide their own
+** writepage func.  We just want to write bh, not the whole page, so we
+** queue that io here instead of calling writepage.
+*/
+static int __dirty_list_writepage(struct page *page, struct buffer_head *bh) {
+	int other_dirty = 0 ;
+	struct buffer_head *cur ;
+
+	/* check for other dirty buffers on this page.  If there are none,
+	** clear the page dirty bit
+	*/
+	cur = bh->b_this_page ;
+	while(cur != bh) {
+		other_dirty += buffer_dirty(cur) ;	
+		cur = cur->b_this_page ;
+	} 
+	if (other_dirty == 0) {
+		ClearPageDirty(page) ;
+	} 
+
+	/* we want the page available for locking again right away.  
+	** someone walking the dirty buffer list might find another
+	** buffer from this page, and we don't want them to skip it in
+	** favor of a younger buffer.
+	*/
+	atomic_inc(&bh->b_count) ;
+	ll_rw_block(WRITE, 1, &bh) ;
+	atomic_dec(&bh->b_count) ;
+	UnlockPage(page) ;
+	return 0 ;
+}
+
+/*
+** util function for sync_buffers and flush_dirty_buffers
+** uses either the writepage func supplied in the page's mapping,
+** or the anon address space writepage
+*/
+static int dirty_list_writepage(struct page *page, struct buffer_head *bh) {
+	int (*writepage)(struct page *)  ;
+	int ret ;
+
+	/* someone wrote this page while we were locking before */
+	if (!PageDirty(page) && !buffer_dirty(bh)) {
+		UnlockPage(page) ;
+		return 0 ;
+	}
+	writepage = page->mapping->a_ops->writepage ;
+
+	/* For anon pages, and pages that don't have a writepage
+	** func, just write this one dirty buffer.  __dirty_list_writepage
+	** does a little more work to make sure the page dirty bit is cleared
+	** when we are the only dirty buffer on this page
+	*/
+	if (!writepage || page->mapping == &anon_space_mapping) {
+		writepage = anon_space_ops.writepage ;
+		return __dirty_list_writepage(page, bh) ;
+	}
+
+	ClearPageDirty(page) ;
+	ret = writepage(page) ;
+	if (ret == 1) {
+		SetPageDirty(page) ;
+		UnlockPage(page) ;
+	}
+	return ret ;
+}
+
 /* Call sync_buffers with wait!=0 to ensure that the call does not
  * return until all buffer writes have completed.  Sync() may return
  * before the writes have finished; fsync() may not.
@@ -175,6 +253,7 @@
 {
 	int i, retry, pass = 0, err = 0;
 	struct buffer_head * bh, *next;
+	struct page *page ;
 
 	/* One pass for no-wait, three for wait:
 	 * 0) write out all dirty, unlocked buffers;
@@ -230,10 +309,27 @@
 			if (!buffer_dirty(bh) || pass >= 2)
 				continue;
 
-			atomic_inc(&bh->b_count);
+			page = bh->b_page ;
+			page_cache_get(page) ;
+			if (TryLockPage(page)) {
+				if (!wait || !pass) {
+					retry = 1 ;
+					continue ;
+				}
+				spin_unlock(&lru_list_lock);
+				wait_on_page(page) ;
+				page_cache_release(page) ;
+				goto repeat ;
+			}
 			spin_unlock(&lru_list_lock);
-			ll_rw_block(WRITE, 1, &bh);
-			atomic_dec(&bh->b_count);
+
+			/* if the writepage func returns 1, it is 
+			** responsible for marking the buffers dirty
+			** again (or not marking them clean at all).
+			** we'll catch them again on the next pass
+			*/
+			dirty_list_writepage(page, bh) ;
+			page_cache_release(page) ;
 			retry = 1;
 			goto repeat;
 		}
@@ -644,7 +740,7 @@
 			if (bh->b_dev != dev)
 				continue;
 			/* Part of a mapping? */
-			if (bh->b_page->mapping)
+			if (bh->b_page->mapping != &anon_space_mapping)
 				continue;
 			if (buffer_locked(bh)) {
 				atomic_inc(&bh->b_count);
@@ -852,13 +948,14 @@
 int fsync_inode_buffers(struct inode *inode)
 {
 	struct buffer_head *bh;
-	struct inode tmp;
+	struct inode tmp ;
 	int err = 0, err2;
+	struct page * page ;
+	int ret ;
 	
 	INIT_LIST_HEAD(&tmp.i_dirty_buffers);
 	
 	spin_lock(&lru_list_lock);
-
 	while (!list_empty(&inode->i_dirty_buffers)) {
 		bh = BH_ENTRY(inode->i_dirty_buffers.next);
 		list_del(&bh->b_inode_buffers);
@@ -868,11 +965,28 @@
 			bh->b_inode = &tmp;
 			list_add(&bh->b_inode_buffers, &tmp.i_dirty_buffers);
 			if (buffer_dirty(bh)) {
-				atomic_inc(&bh->b_count);
+				page = bh->b_page ;
+				page_cache_get(page) ;
 				spin_unlock(&lru_list_lock);
-				ll_rw_block(WRITE, 1, &bh);
-				brelse(bh);
+
+				LockPage(page) ;
+				ret = dirty_list_writepage(page, bh) ;
+				page_cache_release(page) ;
+
 				spin_lock(&lru_list_lock);
+
+				/* if the writepage func decided to skip
+				** this page, we have to put it back onto
+				** the dirty buffer list.  we add onto the 
+				** tail so this buffer will be retried after
+				** all the other writes have gone through.
+				*/
+				if (ret == 1) {
+					list_del(&bh->b_inode_buffers) ;
+					list_add_tail(&bh->b_inode_buffers,
+						      &inode->i_dirty_buffers) ;
+					bh->b_inode = inode ;
+				}
 			}
 		}
 	}
@@ -1101,8 +1215,10 @@
 	int dispose = BUF_CLEAN;
 	if (buffer_locked(bh))
 		dispose = BUF_LOCKED;
-	if (buffer_dirty(bh))
+	if (buffer_dirty(bh)) {
 		dispose = BUF_DIRTY;
+		SetPageDirty(bh->b_page) ;
+	}
 	if (buffer_protected(bh))
 		dispose = BUF_PROTECTED;
 	if (dispose != bh->b_list) {
@@ -1478,6 +1594,53 @@
  * "Dirty" is valid only with the last case (mapped+uptodate).
  */
 
+static int block_write_anon_page(struct page *page) 
+{
+	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	int i, nr = 0 ;
+	int partial = 0 ;
+	int ret = 0 ;
+
+	if (!PageLocked(page))
+		BUG();
+
+	if (!page->buffers)
+		BUG() ;
+
+	head = page->buffers;
+	bh = head;
+
+	/* Stage 1: find the dirty buffers, lock them for submit_bh */
+	do {
+		if (!test_and_set_bit(BH_Lock, &bh->b_state)) {
+			if (buffer_uptodate(bh) && buffer_dirty(bh)) {
+				bh->b_end_io = end_buffer_io_async;
+				clear_bit(BH_Dirty, &bh->b_state) ;
+				atomic_inc(&bh->b_count);
+				arr[nr++] = bh ;
+			} else {
+				partial = 1 ;
+				unlock_buffer(bh) ;
+			}
+		} else {
+			partial = 1 ;
+		}
+		bh = bh->b_this_page;
+	} while (bh != head);
+
+	/* Stage 2: submit the IO */
+	for (i = 0 ; i < nr ; i++) {
+		submit_bh(WRITE, arr[i]) ;
+	}
+	/* Done - end_buffer_io_async will unlock */
+	if (!partial)
+		SetPageUptodate(page);
+	if (nr == 0) {
+		UnlockPage(page) ;
+	}
+	return ret ;
+}
+
 /*
  * block_write_full_page() is SMP-safe - currently it's still
  * being called with the kernel lock held, but the code is ready.
@@ -1487,6 +1650,10 @@
 	int err, i;
 	unsigned long block;
 	struct buffer_head *bh, *head;
+	int nr = 0 ;
+	struct buffer_head *arr[MAX_BUF_PER_PAGE] ;
+	int page_ok = Page_Uptodate(page) ;
+	int partial = 0;
 
 	if (!PageLocked(page))
 		BUG();
@@ -1509,36 +1676,46 @@
 		 *
 		 * Leave it to the low-level FS to make all those
 		 * decisions (block #0 may actually be a valid block)
+		 *
+		 * only bother when the page is up to date or the buffer
+		 * is dirty.
 		 */
-		if (!buffer_mapped(bh)) {
-			err = get_block(inode, block, bh, 1);
-			if (err)
-				goto out;
-			if (buffer_new(bh))
-				unmap_underlying_metadata(bh);
+		if (page_ok || buffer_dirty(bh)) {
+			if (!buffer_mapped(bh)) {
+				err = get_block(inode, block, bh, 1);
+				if (err)
+					goto out;
+				if (buffer_new(bh))
+					unmap_underlying_metadata(bh);
+			}
+			arr[nr++] = bh ; 
+		} else {
+			partial = 1 ;
 		}
 		bh = bh->b_this_page;
 		block++;
 	} while (bh != head);
 
 	/* Stage 2: lock the buffers, mark them clean */
-	do {
+	for (i = 0 ; i < nr ; i++) {
+		bh = arr[i] ;
 		lock_buffer(bh);
 		bh->b_end_io = end_buffer_io_async;
 		atomic_inc(&bh->b_count);
 		set_bit(BH_Uptodate, &bh->b_state);
 		clear_bit(BH_Dirty, &bh->b_state);
-		bh = bh->b_this_page;
-	} while (bh != head);
+	} 
 
-	/* Stage 3: submit the IO */
-	do {
-		submit_bh(WRITE, bh);
-		bh = bh->b_this_page;		
-	} while (bh != head);
+	for (i = 0 ; i < nr ; i++) {
+		submit_bh(WRITE, arr[i]) ;
+	}
+
+	if (nr == 0) 
+		UnlockPage(page) ;
 
 	/* Done - end_buffer_io_async will unlock */
-	SetPageUptodate(page);
+	if (!partial)
+		SetPageUptodate(page);
 	return 0;
 
 out:
@@ -1658,6 +1835,45 @@
 }
 
 /*
+** just sets the dirty bits for a range of buffers in the page.  Does
+** not balance the dirty list, or put the buffers onto the dirty list
+*/
+static int __block_dirty_range(struct inode *inode, struct page *page,
+		unsigned from, unsigned to)
+{
+	unsigned block_start, block_end;
+	int partial = 0 ;
+	unsigned blocksize;
+	struct buffer_head *bh, *head;
+
+	blocksize = inode->i_sb->s_blocksize;
+
+	for(bh = head = page->buffers, block_start = 0;
+	    bh != head || !block_start;
+	    block_start=block_end, bh = bh->b_this_page) {
+		block_end = block_start + blocksize;
+		if (block_end <= from || block_start >= to) {
+			if (!buffer_uptodate(bh))
+				partial = 1;
+		} else {
+			set_bit(BH_Uptodate, &bh->b_state);
+			if (!atomic_set_buffer_dirty(bh)) {
+				buffer_insert_inode_queue(bh, inode);
+			}
+		}
+	}
+	/*
+	 * is this a partial write that happened to make all buffers
+	 * uptodate then we can optimize away a bogus readpage() for
+	 * the next read(). Here we 'discover' wether the page went
+	 * uptodate as a result of this (potentially partial) write.
+	 */
+	if (!partial)
+		SetPageUptodate(page);
+	return 0;
+}
+
+/*
  * Generic "read page" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
  * Reads the page asynchronously --- the unlock_buffer() and
@@ -1947,13 +2163,23 @@
 	if (!err) {
 		memset(page_address(page) + offset, 0, PAGE_CACHE_SIZE - offset);
 		flush_dcache_page(page);
-		__block_commit_write(inode,page,0,offset);
+
+		/* this will just set the dirty bits for block_write_full_page
+		** it is only safe because we have the page locked and
+		** nobody will try to write the buffers between
+		** the block_dirty_range and the write_full_page calls
+		** we have to clear the page up to date so the buffers
+		** past the end of the file won't get written
+		*/
+		__block_dirty_range(inode,page,0,offset);
+		ClearPageUptodate(page);
+		err = __block_write_full_page(inode, page, get_block) ;
 done:
 		kunmap(page);
-		UnlockPage(page);
 		return err;
 	}
 	ClearPageUptodate(page);
+	UnlockPage(page);
 	goto done;
 }
 
@@ -2244,7 +2470,7 @@
 	struct buffer_head *bh, *tmp;
 	struct buffer_head * insert_point;
 	int isize;
-
+	unsigned long index ;
 	if ((size & 511) || (size > PAGE_SIZE)) {
 		printk("VFS: grow_buffers: size = %d\n",size);
 		return 0;
@@ -2260,6 +2486,16 @@
 
 	isize = BUFSIZE_INDEX(size);
 
+	/* don't put this buffer head on the free list until the
+	** page is setup.  Is there a better index to use?  Would 0
+	** be good enough?
+	*/
+	page->flags &= ~(1 << PG_referenced);
+	index = atomic_read(&buffermem_pages) ;
+	atomic_inc(&buffermem_pages);
+	add_to_page_cache_locked(page, &anon_space_mapping, index) ;
+	page->buffers = bh;
+
 	spin_lock(&free_list[isize].lock);
 	insert_point = free_list[isize].list;
 	tmp = bh;
@@ -2283,11 +2519,7 @@
 	free_list[isize].list = bh;
 	spin_unlock(&free_list[isize].lock);
 
-	page->buffers = bh;
-	page->flags &= ~(1 << PG_referenced);
-	lru_cache_add(page);
 	UnlockPage(page);
-	atomic_inc(&buffermem_pages);
 	return 1;
 
 no_buffer_head:
@@ -2309,7 +2541,6 @@
  *
  * Wait:
  *	0 - no wait (this does not get called - see try_to_free_buffers below)
- *	1 - start IO for dirty buffers
  *	2 - wait for completion of locked buffers
  */
 static void sync_page_buffers(struct buffer_head *bh, int wait)
@@ -2319,11 +2550,9 @@
 	do {
 		struct buffer_head *p = tmp;
 		tmp = tmp->b_this_page;
-		if (buffer_locked(p)) {
-			if (wait > 1)
-				__wait_on_buffer(p);
-		} else if (buffer_dirty(p))
-			ll_rw_block(WRITE, 1, &p);
+		if (buffer_locked(p) && wait > 1) {
+			__wait_on_buffer(p);
+		} 
 	} while (tmp != bh);
 }
 
@@ -2386,6 +2615,9 @@
 
 	/* And free the page */
 	page->buffers = NULL;
+	if (page->mapping == &anon_space_mapping) {
+		atomic_dec(&buffermem_pages) ;
+	}
 	page_cache_release(page);
 	spin_unlock(&free_list[index].lock);
 	write_unlock(&hash_table_lock);
@@ -2564,6 +2796,7 @@
 static int flush_dirty_buffers(int check_flushtime)
 {
 	struct buffer_head * bh, *next;
+	struct page *page ;
 	int flushed = 0, i;
 
  restart:
@@ -2580,6 +2813,8 @@
 		}
 		if (buffer_locked(bh))
 			continue;
+		if (!buffer_uptodate(bh))
+			continue ;
 
 		if (check_flushtime) {
 			/* The dirty lru list is chronologically ordered so
@@ -2592,13 +2827,15 @@
 			if (++flushed > bdf_prm.b_un.ndirty)
 				goto out_unlock;
 		}
-
-		/* OK, now we are committed to write it out. */
-		atomic_inc(&bh->b_count);
-		spin_unlock(&lru_list_lock);
-		ll_rw_block(WRITE, 1, &bh);
-		atomic_dec(&bh->b_count);
-
+		page = bh->b_page ;
+		page_cache_get(page) ;
+		if (TryLockPage(page)) {
+			page_cache_release(page) ;
+			continue ;
+		}
+		spin_unlock(&lru_list_lock) ;
+		dirty_list_writepage(page, bh) ;
+		page_cache_release(page) ;
 		if (current->need_resched)
 			schedule();
 		goto restart;
diff -urN linux-test13-pre4/mm/page_alloc.c linux-anon-space/mm/page_alloc.c
--- linux-test13-pre4/mm/page_alloc.c	Tue Nov 28 13:54:31 2000
+++ linux-anon-space/mm/page_alloc.c	Sun Dec 24 19:00:31 2000
@@ -317,11 +317,12 @@
 	/*
 	 * If we are about to get low on free pages and cleaning
 	 * the inactive_dirty pages would fix the situation,
-	 * wake up bdflush.
+	 * wake up kswapd here as well, so page_launder can start
+	 * sending things to disk.
 	 */
 	else if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
 			&& nr_inactive_dirty_pages >= freepages.high)
-		wakeup_bdflush(0);
+		wakeup_kswapd(0);
 
 try_again:
 	/*
diff -urN linux-test13-pre4/mm/vmscan.c linux-anon-space/mm/vmscan.c
--- linux-test13-pre4/mm/vmscan.c	Sat Dec 23 13:14:26 2000
+++ linux-anon-space/mm/vmscan.c	Tue Dec 26 00:52:32 2000
@@ -678,7 +678,6 @@
 
 			/* The page was only in the buffer cache. */
 			} else if (!page->mapping) {
-				atomic_dec(&buffermem_pages);
 				freed_page = 1;
 				cleaned_pages++;
 
@@ -701,10 +700,9 @@
 			page_cache_release(page);
 
 			/* 
-			 * If we're freeing buffer cache pages, stop when
-			 * we've got enough free memory.
+			 * stop when we've got enough free memory.
 			 */
-			if (freed_page && !free_shortage())
+			if (!free_shortage())
 				break;
 			continue;
 		} else if (page->mapping && !PageDirty(page)) {
@@ -739,9 +737,6 @@
 	 * free anything yet, we wait synchronously on the writeout of
 	 * MAX_SYNC_LAUNDER pages.
 	 *
-	 * We also wake up bdflush, since bdflush should, under most
-	 * loads, flush out the dirty pages before we have to wait on
-	 * IO.
 	 */
 	if (can_get_io_locks && !launder_loop && free_shortage()) {
 		launder_loop = 1;
@@ -750,8 +745,6 @@
 			sync = 0;
 		/* We only do a few "out of order" flushes. */
 		maxlaunder = MAX_LAUNDER;
-		/* Kflushd takes care of the rest. */
-		wakeup_bdflush(0);
 		goto dirty_page_rescan;
 	}
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-27  0:57                       ` Chris Mason
  2000-12-26 23:18                         ` Marcelo Tosatti
@ 2000-12-27 20:26                         ` Daniel Phillips
  2000-12-27 20:49                           ` Chris Mason
  1 sibling, 1 reply; 57+ messages in thread
From: Daniel Phillips @ 2000-12-27 20:26 UTC (permalink / raw)
  To: Chris Mason, linux-kernel

Chris Mason wrote:
> 
> Hi guys,
> 
> Here's my latest code, which uses ll_rw_block for anon pages (or
> pages without a writepage func) when flush_dirty_buffers,
> sync_buffers, or fsync_inode_buffers are flushing things.  This
> seems to have fixed my slowdown on 1k buffer sizes, but I
> haven't done extensive benchmarks yet.
> 
> Other changes:  After freeing a page with buffers, page_launder
> now stops if (!free_shortage()).  This is a mod of the check where
> page_launder checked free_shortage after freeing a buffer cache
> page.  Code outside buffer.c can't detect buffer cache pages with
> this patch, so the old check doesn't apply.
> 
> My change doesn't seem quite right though, if page_launder wants
> to stop when there isn't a shortage, it should do that regardless of
> if the page it just freed had buffers.  It looks like this was added
> so bdflush could call page_launder, and get an early out after
> freeing some buffer heads, but I'm not sure.
> 
> In test13-pre4, invalidate_buffers skips buffers on a page
> with a mapping.  I changed that to skip mappings other than the
> anon space mapping.
> 
> Comments and/or suggestions on how to make better use of this stuff
> are more than welcome ;-)

Hi Chris.  I took your patch for a test drive under dbench and it seems
impressively stable under load, but there are performance problems.

	Test machine: 64 meg, 500 Mhz K6, IDE, Ext2, Blocksize=4K
	Without patch: 9.5 MB/sec, 11 min 6 secs
	With patch: 3.12 MB/sec, 33 min 51 sec

Philosophically, I wonder if it's right for the buffer flush mechanism
to be calling into the filesystem.  It seems like the buffer lists
should stay sitting between the filesystem and the block layer, it
actually does a pretty good job.

What about just bumping the buffers of pages found on the inactive_dirty
list to the front of the buffer LRU list?  Then we can do write
clustering by being selective about the bumping.
 
--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-27 20:26                         ` Daniel Phillips
@ 2000-12-27 20:49                           ` Chris Mason
  2000-12-28 15:49                             ` Daniel Phillips
  0 siblings, 1 reply; 57+ messages in thread
From: Chris Mason @ 2000-12-27 20:49 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel



On Wednesday, December 27, 2000 21:26:02 +0100 Daniel Phillips
<phillips@innominate.de> wrote:

> Hi Chris.  I took your patch for a test drive under dbench and it seems
> impressively stable under load, but there are performance problems.
> 
>	 Test machine: 64 meg, 500 Mhz K6, IDE, Ext2, Blocksize=4K
>	 Without patch: 9.5 MB/sec, 11 min 6 secs
>	 With patch: 3.12 MB/sec, 33 min 51 sec
> 

Cool, thanks for the testing.  Which benchmark are you using?  bonnie and
dbench don't show any changes on my scsi disks, I'll give IDE a try as well.

> Philosophically, I wonder if it's right for the buffer flush mechanism
> to be calling into the filesystem.  It seems like the buffer lists
> should stay sitting between the filesystem and the block layer, it
> actually does a pretty good job.
> 
What I'm looking for is a separation of the write management (aging, memory
pressure, etc, etc) from the actual write method.  The lists (VM, buffer.c,
whatever) should do the management, and the FS should do the i/o.  This
patch is not a perfect solution by any means, but its a start.

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-27 20:49                           ` Chris Mason
@ 2000-12-28 15:49                             ` Daniel Phillips
  2000-12-28 19:19                               ` Chris Mason
  0 siblings, 1 reply; 57+ messages in thread
From: Daniel Phillips @ 2000-12-28 15:49 UTC (permalink / raw)
  To: Chris Mason, linux-kernel

Chris Mason wrote:
> 
> On Wednesday, December 27, 2000 21:26:02 +0100 Daniel Phillips
> <phillips@innominate.de> wrote:
> 
> > Hi Chris.  I took your patch for a test drive under dbench and it seems
> > impressively stable under load, but there are performance problems.
> >
> >        Test machine: 64 meg, 500 Mhz K6, IDE, Ext2, Blocksize=4K
> >        Without patch: 9.5 MB/sec, 11 min 6 secs
> >        With patch: 3.12 MB/sec, 33 min 51 sec
> >
> 
> Cool, thanks for the testing.  Which benchmark are you using?  bonnie and
> dbench don't show any changes on my scsi disks, I'll give IDE a try as well.

Chris, this was an error, I had accidently booted the wrong kernel.  The
'With patch' results above are for 2.2.16, not your patch.

With correct results you are looking much better:

      Test machine: 64 meg, 500 Mhz K6, IDE, Ext2, Blocksize=4K
      Test: dbench 48

      pre13 without patch:	9.5 MB/sec 	11 min 6 secs
      pre13 with patch:		8.9 MB/sec 	11 min 46 secs
      2.2.16:			3.1 MB/sec 	33 min 51 sec

This benchmark doesn't seem to suffer a lot from noise, so the 7%
slowdown with your patch likely real.

We've come a long way from 2.2.16, haven't we?  I'll run some of these
tests against 2.2 pre19 kernels and maybe fan the flames of the 2.2/2.4
competition a little.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-28 15:49                             ` Daniel Phillips
@ 2000-12-28 19:19                               ` Chris Mason
  2000-12-28 19:29                                 ` Linus Torvalds
  0 siblings, 1 reply; 57+ messages in thread
From: Chris Mason @ 2000-12-28 19:19 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel, Rik van Riel, Linus Torvalds



On Thursday, December 28, 2000 16:49:14 +0100 Daniel Phillips <phillips@innominate.de> wrote:

[ dbench 48 test on the anon space mapping patch ]

> 
> This benchmark doesn't seem to suffer a lot from noise, so the 7%
> slowdown with your patch likely real.
> 

Ok, page_launder is supposed to run through the inactive dirty
list twice, and on the second run, it wants to start i/o.  But,
if the page is dirty, writepage is called on the first run.  With
my patch, this flushes lots more data than it used to.  

I have writepage doing all the i/o, and try_to_free_buffers
only waits on it.  This diff makes it so writepage is only called
on the second loop through the inactive dirty list, could you 
please give it a try (slightly faster in my tests).

Linus and Rik are cc'd in to find out if this is a good idea in
general.

-chris

--- linux-test13-pre4/mm/vmscan.c	Sat Dec 23 13:14:26 2000
+++ linux/mm/vmscan.c	Thu Dec 28 15:02:08 2000
@@ -609,7 +609,7 @@
 				goto page_active;
 
 			/* Can't start IO? Move it to the back of the list */
-			if (!can_get_io_locks) {
+			if (!launder_loop || !can_get_io_locks) {
 				list_del(page_lru);
 				list_add(page_lru, &inactive_dirty_list);
 				UnlockPage(page);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-28 19:19                               ` Chris Mason
@ 2000-12-28 19:29                                 ` Linus Torvalds
  2000-12-29 16:03                                   ` Chris Mason
  0 siblings, 1 reply; 57+ messages in thread
From: Linus Torvalds @ 2000-12-28 19:29 UTC (permalink / raw)
  To: Chris Mason; +Cc: Daniel Phillips, linux-kernel, Rik van Riel



On Thu, 28 Dec 2000, Chris Mason wrote:
> 
> Linus and Rik are cc'd in to find out if this is a good idea in
> general.

Probably. 

There are some arguments for starting the writeout early, but there are
tons of arguments against it too (the main one being "avoid doing IO if
you can do so"), so your patch is probably fine. In the end, the
performance characteristics are what matters. Does the patch make for
smoother behaviour and better performance?

Anyway, the "can_get_io_locks" check is subsumed by the "launder_loop"
check: we will never set "launder_loop" to non-zero if we can't get the
io_locks, so you might as well just make the test be

	/* First loop through? Don't start IO, just move it to the back of the list */
	if (!launder_loop) {
		....

and be done with it.

I'd like to hear what that does for dbench.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-28 19:29                                 ` Linus Torvalds
@ 2000-12-29 16:03                                   ` Chris Mason
  2000-12-29 16:21                                     ` Marcelo Tosatti
  2000-12-29 17:58                                     ` Daniel Phillips
  0 siblings, 2 replies; 57+ messages in thread
From: Chris Mason @ 2000-12-29 16:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel, Rik van Riel



On Thursday, December 28, 2000 11:29:01 AM -0800 Linus Torvalds
<torvalds@transmeta.com> wrote:
[ skipping io on the first walk in page_launder ]
> 
> There are some arguments for starting the writeout early, but there are
> tons of arguments against it too (the main one being "avoid doing IO if
> you can do so"), so your patch is probably fine. In the end, the
> performance characteristics are what matters. Does the patch make for
> smoother behaviour and better performance?

My dbench speeds have always varied from run to run, but the average speed
went up about 9% with the anon space mapping patch and the page_launder
change.  I could not find much difference in a pure test13-pre4, probably
because dbench doesn't generate much swap on my machine.  I'll do more
tests when I get back on Monday night.

Daniel, sounds like dbench varies less on your machine, what did the patch
do for you?

BTW, the last anon space mapping patch I sent also works on test13-pre5.
The block_truncate_page fix does help my patch, since I have bdflush
locking pages ( thanks Marcelo )

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-29 16:03                                   ` Chris Mason
@ 2000-12-29 16:21                                     ` Marcelo Tosatti
  2000-12-29 18:33                                       ` Alexander Viro
  2000-12-29 17:58                                     ` Daniel Phillips
  1 sibling, 1 reply; 57+ messages in thread
From: Marcelo Tosatti @ 2000-12-29 16:21 UTC (permalink / raw)
  To: Chris Mason; +Cc: Linus Torvalds, Daniel Phillips, linux-kernel, Rik van Riel



On Fri, 29 Dec 2000, Chris Mason wrote:

> BTW, the last anon space mapping patch I sent also works on test13-pre5.
> The block_truncate_page fix does help my patch, since I have bdflush
> locking pages ( thanks Marcelo )

Good to know. I thought that fix would not change performance noticeable.

Now the ext2 changes will for sure make a difference since right now the
superblock lock is suffering from contention. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-29 16:03                                   ` Chris Mason
  2000-12-29 16:21                                     ` Marcelo Tosatti
@ 2000-12-29 17:58                                     ` Daniel Phillips
  2001-01-02 15:49                                       ` Chris Mason
  1 sibling, 1 reply; 57+ messages in thread
From: Daniel Phillips @ 2000-12-29 17:58 UTC (permalink / raw)
  To: Chris Mason, linux-kernel

Chris Mason wrote:
> 
> On Thursday, December 28, 2000 11:29:01 AM -0800 Linus Torvalds
> <torvalds@transmeta.com> wrote:
> [ skipping io on the first walk in page_launder ]
> >
> > There are some arguments for starting the writeout early, but there are
> > tons of arguments against it too (the main one being "avoid doing IO if
> > you can do so"), so your patch is probably fine. In the end, the
> > performance characteristics are what matters. Does the patch make for
> > smoother behaviour and better performance?
> 
> My dbench speeds have always varied from run to run, but the average speed
> went up about 9% with the anon space mapping patch and the page_launder
> change.  I could not find much difference in a pure test13-pre4, probably
> because dbench doesn't generate much swap on my machine.  I'll do more
> tests when I get back on Monday night.
> 
> Daniel, sounds like dbench varies less on your machine, what did the patch
> do for you?
> 
> BTW, the last anon space mapping patch I sent also works on test13-pre5.
> The block_truncate_page fix does help my patch, since I have bdflush
> locking pages ( thanks Marcelo )

Yes it does, but the little additional patch you posted no longer
applies.  Your patch is suffering badly under pre5 here, reducing
throughput from around 10 MB/sec to 5-6 MB/sec, and highly variable.  If
you're not seeing this you should probably try to get a few others to
run it.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-29 16:21                                     ` Marcelo Tosatti
@ 2000-12-29 18:33                                       ` Alexander Viro
  2001-01-05 15:54                                         ` Chris Mason
  0 siblings, 1 reply; 57+ messages in thread
From: Alexander Viro @ 2000-12-29 18:33 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Chris Mason, Linus Torvalds, Daniel Phillips, linux-kernel, Rik van Riel



On Fri, 29 Dec 2000, Marcelo Tosatti wrote:

> Now the ext2 changes will for sure make a difference since right now the
> superblock lock is suffering from contention. 

superblock lock is suffering from more than just contention. Consider the
following:

sys_ustat() does get_super(), which leaves the sb unlocked.
We are going to call ->statfs().
In the meanwhile, fs is umounted. Woops...

In other words, get_super() callers are oopsen waiting to happen. That's
what is fundamentally wrong with lock_super()/wait_super()/unlock_super() -
we have no reference counter on superblocks and we blindly grab references
to them assuming that they will stay.

The main source of get_super() calls is in device drivers. I propose to add

void invalidate_dev(kdev_t dev, int sync_flag)
{
	struct super_block *sb = get_super(dev);
	switch (sync_flag) {
		case 1: sync_dev(dev);break;
		case 2: fsync_dev(dev);break;
	}
	if (sb)
		invalidate_inodes(sb);
	invalidate_buffers(dev);
}

to fs/buffer.c, export it and let drivers call that instead of doing the
thing by hands. Then we have a chance to deal with the problem inside the
kernel proper. Right now almost every block device has that sequence in
BLKRRPART handling and fixing each of them separately is going to hurt like
hell.

Linus, I realize that it's changing the block devices near to release, but
AFAICS we have to change them anyway. I'm asking for permission to take
get_super() out.
							Cheers,
								Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-29 17:58                                     ` Daniel Phillips
@ 2001-01-02 15:49                                       ` Chris Mason
  0 siblings, 0 replies; 57+ messages in thread
From: Chris Mason @ 2001-01-02 15:49 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel



On Friday, December 29, 2000 06:58:01 PM +0100 Daniel Phillips
<phillips@innominate.de> wrote:

> Chris Mason wrote:
>> 
>> BTW, the last anon space mapping patch I sent also works on test13-pre5.
>> The block_truncate_page fix does help my patch, since I have bdflush
>> locking pages ( thanks Marcelo )
> 
> Yes it does, but the little additional patch you posted no longer
> applies.  Your patch is suffering badly under pre5 here, reducing
> throughput from around 10 MB/sec to 5-6 MB/sec, and highly variable.  If
> you're not seeing this you should probably try to get a few others to
> run it.
> 

The additional patch I sent (for skipping writepage calls on the first loop
in page_launder) was applied to test13-pre5, in a slightly better form.
So, you'll either need to benchmark in pre4, or make a reversed version for
pre5.

Porting to the prerelease now ;-)

-chris



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2001-01-05 15:54                                         ` Chris Mason
@ 2001-01-05 15:43                                           ` Marcelo Tosatti
  2001-01-05 17:49                                             ` Daniel Phillips
                                                               ` (2 more replies)
  2001-01-05 18:09                                           ` Juergen Schneider
  1 sibling, 3 replies; 57+ messages in thread
From: Marcelo Tosatti @ 2001-01-05 15:43 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-kernel


On Fri, 5 Jan 2001, Chris Mason wrote:

> 
> Here's the latest version of the patch, against 2.4.0.  The
> biggest open issues are what to do with bdflush, since
> page_launder could do everything bdflush does.  

I think we want to remove flush_dirty_buffers() from bdflush. 

While we are trying to be smart and do write clustering at the ->writepage
operation, flush_dirty_buffers() is "dumb" and will interfere with the
write clustering. 



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2000-12-29 18:33                                       ` Alexander Viro
@ 2001-01-05 15:54                                         ` Chris Mason
  2001-01-05 15:43                                           ` Marcelo Tosatti
  2001-01-05 18:09                                           ` Juergen Schneider
  0 siblings, 2 replies; 57+ messages in thread
From: Chris Mason @ 2001-01-05 15:54 UTC (permalink / raw)
  To: linux-kernel


Here's the latest version of the patch, against 2.4.0.  The
biggest open issues are what to do with bdflush, since
page_launder could do everything bdflush does.  And, how to
deal with writepage funcs that return 1 for fsync_inode_buffers.

-chris

diff -urN linux.2.4.0/fs/buffer.c linux/fs/buffer.c
--- linux.2.4.0/fs/buffer.c	Thu Jan  4 23:12:28 2001
+++ linux/fs/buffer.c	Thu Jan  4 18:58:09 2001
@@ -97,6 +97,20 @@
 
 static int grow_buffers(int size);
 static void __refile_buffer(struct buffer_head *);
+static int block_write_anon_page(struct page *);
+static void end_buffer_io_async(struct buffer_head * bh, int uptodate) ;
+
+static struct address_space_operations anon_space_ops = {
+	writepage: block_write_anon_page,
+	sync_page: block_sync_page,
+} ;
+static struct address_space anon_space_mapping = {
+	LIST_HEAD_INIT(anon_space_mapping.clean_pages),
+	LIST_HEAD_INIT(anon_space_mapping.dirty_pages),
+	LIST_HEAD_INIT(anon_space_mapping.locked_pages),
+	0, /* nr_pages */
+	&anon_space_ops,
+} ;
 
 /* This is used by some architectures to estimate available memory. */
 atomic_t buffermem_pages = ATOMIC_INIT(0);
@@ -161,6 +175,73 @@
 	atomic_dec(&bh->b_count);
 }
 
+/* just for use with anon pages, or pages that don't provide their own
+** writepage func.  We just want to write bh, not the whole page, so we
+** queue that io here instead of calling writepage.
+*/
+static int __dirty_list_writepage(struct page *page, struct buffer_head *bh) {
+	int other_dirty = 0 ;
+	struct buffer_head *cur ;
+
+	/* check for other dirty buffers on this page.  If there are none,
+	** clear the page dirty bit
+	*/
+	cur = bh->b_this_page ;
+	while(cur != bh) {
+		other_dirty += buffer_dirty(cur) ;	
+		cur = cur->b_this_page ;
+	} 
+	if (other_dirty == 0) {
+		ClearPageDirty(page) ;
+	} 
+
+	/* we want the page available for locking again right away.  
+	** someone walking the dirty buffer list might find another
+	** buffer from this page, and we don't want them to skip it in
+	** favor of a younger buffer.
+	*/
+	atomic_inc(&bh->b_count) ;
+	UnlockPage(page) ;
+	ll_rw_block(WRITE, 1, &bh) ;
+	atomic_dec(&bh->b_count) ;
+	return 0 ;
+}
+
+/*
+** util function for sync_buffers and flush_dirty_buffers
+** uses either the writepage func supplied in the page's mapping,
+** or the anon address space writepage
+*/
+static int dirty_list_writepage(struct page *page, struct buffer_head *bh) {
+	int (*writepage)(struct page *)  ;
+	int ret ;
+
+	/* someone wrote this page while we were locking before */
+	if (!PageDirty(page) && !buffer_dirty(bh)) {
+		UnlockPage(page) ;
+		return 0 ;
+	}
+	writepage = page->mapping->a_ops->writepage ;
+
+	/* For anon pages, and pages that don't have a writepage
+	** func, just write this one dirty buffer.  __dirty_list_writepage
+	** does a little more work to make sure the page dirty bit is cleared
+	** when we are the only dirty buffer on this page
+	*/
+	if (!writepage || page->mapping == &anon_space_mapping) {
+		writepage = anon_space_ops.writepage ;
+		return __dirty_list_writepage(page, bh) ;
+	}
+
+	ClearPageDirty(page) ;
+	ret = writepage(page) ;
+	if (ret == 1) {
+		SetPageDirty(page) ;
+		UnlockPage(page) ;
+	}
+	return ret ;
+}
+
 /* Call sync_buffers with wait!=0 to ensure that the call does not
  * return until all buffer writes have completed.  Sync() may return
  * before the writes have finished; fsync() may not.
@@ -175,6 +256,7 @@
 {
 	int i, retry, pass = 0, err = 0;
 	struct buffer_head * bh, *next;
+	struct page *page ;
 
 	/* One pass for no-wait, three for wait:
 	 * 0) write out all dirty, unlocked buffers;
@@ -230,10 +312,27 @@
 			if (!buffer_dirty(bh) || pass >= 2)
 				continue;
 
-			atomic_inc(&bh->b_count);
+			page = bh->b_page ;
+			page_cache_get(page) ;
+			if (TryLockPage(page)) {
+				if (!wait || !pass) {
+					retry = 1 ;
+					continue ;
+				}
+				spin_unlock(&lru_list_lock);
+				wait_on_page(page) ;
+				page_cache_release(page) ;
+				goto repeat ;
+			}
 			spin_unlock(&lru_list_lock);
-			ll_rw_block(WRITE, 1, &bh);
-			atomic_dec(&bh->b_count);
+
+			/* if the writepage func returns 1, it is 
+			** responsible for marking the buffers dirty
+			** again (or not marking them clean at all).
+			** we'll catch them again on the next pass
+			*/
+			dirty_list_writepage(page, bh) ;
+			page_cache_release(page) ;
 			retry = 1;
 			goto repeat;
 		}
@@ -649,7 +748,7 @@
 			if (bh->b_dev != dev)
 				continue;
 			/* Part of a mapping? */
-			if (bh->b_page->mapping)
+			if (bh->b_page->mapping != &anon_space_mapping)
 				continue;
 			if (buffer_locked(bh)) {
 				atomic_inc(&bh->b_count);
@@ -861,13 +960,14 @@
 int fsync_inode_buffers(struct inode *inode)
 {
 	struct buffer_head *bh;
-	struct inode tmp;
+	struct inode tmp ;
 	int err = 0, err2;
+	struct page * page ;
+	int ret ;
 	
 	INIT_LIST_HEAD(&tmp.i_dirty_buffers);
 	
 	spin_lock(&lru_list_lock);
-
 	while (!list_empty(&inode->i_dirty_buffers)) {
 		bh = BH_ENTRY(inode->i_dirty_buffers.next);
 		list_del(&bh->b_inode_buffers);
@@ -877,11 +977,28 @@
 			bh->b_inode = &tmp;
 			list_add(&bh->b_inode_buffers, &tmp.i_dirty_buffers);
 			if (buffer_dirty(bh)) {
-				atomic_inc(&bh->b_count);
+				page = bh->b_page ;
+				page_cache_get(page) ;
 				spin_unlock(&lru_list_lock);
-				ll_rw_block(WRITE, 1, &bh);
-				brelse(bh);
+
+				LockPage(page) ;
+				ret = dirty_list_writepage(page, bh) ;
+				page_cache_release(page) ;
+
 				spin_lock(&lru_list_lock);
+
+				/* if the writepage func decided to skip
+				** this page, we have to put it back onto
+				** the dirty buffer list.  we add onto the 
+				** tail so this buffer will be retried after
+				** all the other writes have gone through.
+				*/
+				if (ret == 1) {
+					list_del(&bh->b_inode_buffers) ;
+					list_add_tail(&bh->b_inode_buffers,
+						      &inode->i_dirty_buffers) ;
+					bh->b_inode = inode ;
+				}
 			}
 		}
 	}
@@ -1112,8 +1229,10 @@
 	int dispose = BUF_CLEAN;
 	if (buffer_locked(bh))
 		dispose = BUF_LOCKED;
-	if (buffer_dirty(bh))
+	if (buffer_dirty(bh)) {
 		dispose = BUF_DIRTY;
+		SetPageDirty(bh->b_page) ;
+	}
 	if (buffer_protected(bh))
 		dispose = BUF_PROTECTED;
 	if (dispose != bh->b_list) {
@@ -1489,6 +1608,53 @@
  * "Dirty" is valid only with the last case (mapped+uptodate).
  */
 
+static int block_write_anon_page(struct page *page) 
+{
+	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	int i, nr = 0 ;
+	int partial = 0 ;
+	int ret = 0 ;
+
+	if (!PageLocked(page))
+		BUG();
+
+	if (!page->buffers)
+		BUG() ;
+
+	head = page->buffers;
+	bh = head;
+
+	/* Stage 1: find the dirty buffers, lock them for submit_bh */
+	do {
+		if (!test_and_set_bit(BH_Lock, &bh->b_state)) {
+			if (buffer_uptodate(bh) && buffer_dirty(bh)) {
+				bh->b_end_io = end_buffer_io_async;
+				clear_bit(BH_Dirty, &bh->b_state) ;
+				atomic_inc(&bh->b_count);
+				arr[nr++] = bh ;
+			} else {
+				partial = 1 ;
+				unlock_buffer(bh) ;
+			}
+		} else {
+			partial = 1 ;
+		}
+		bh = bh->b_this_page;
+	} while (bh != head);
+
+	/* Stage 2: submit the IO */
+	for (i = 0 ; i < nr ; i++) {
+		submit_bh(WRITE, arr[i]) ;
+	}
+	/* Done - end_buffer_io_async will unlock */
+	if (!partial)
+		SetPageUptodate(page);
+	if (nr == 0) {
+		UnlockPage(page) ;
+	}
+	return ret ;
+}
+
 /*
  * block_write_full_page() is SMP-safe - currently it's still
  * being called with the kernel lock held, but the code is ready.
@@ -1498,6 +1664,10 @@
 	int err, i;
 	unsigned long block;
 	struct buffer_head *bh, *head;
+	int nr = 0 ;
+	struct buffer_head *arr[MAX_BUF_PER_PAGE] ;
+	int page_ok = Page_Uptodate(page) ;
+	int partial = 0;
 
 	if (!PageLocked(page))
 		BUG();
@@ -1520,36 +1690,46 @@
 		 *
 		 * Leave it to the low-level FS to make all those
 		 * decisions (block #0 may actually be a valid block)
+		 *
+		 * only bother when the page is up to date or the buffer
+		 * is dirty.
 		 */
-		if (!buffer_mapped(bh)) {
-			err = get_block(inode, block, bh, 1);
-			if (err)
-				goto out;
-			if (buffer_new(bh))
-				unmap_underlying_metadata(bh);
+		if (page_ok || buffer_dirty(bh)) {
+			if (!buffer_mapped(bh)) {
+				err = get_block(inode, block, bh, 1);
+				if (err)
+					goto out;
+				if (buffer_new(bh))
+					unmap_underlying_metadata(bh);
+			}
+			arr[nr++] = bh ; 
+		} else {
+			partial = 1 ;
 		}
 		bh = bh->b_this_page;
 		block++;
 	} while (bh != head);
 
 	/* Stage 2: lock the buffers, mark them clean */
-	do {
+	for (i = 0 ; i < nr ; i++) {
+		bh = arr[i] ;
 		lock_buffer(bh);
 		bh->b_end_io = end_buffer_io_async;
 		atomic_inc(&bh->b_count);
 		set_bit(BH_Uptodate, &bh->b_state);
 		clear_bit(BH_Dirty, &bh->b_state);
-		bh = bh->b_this_page;
-	} while (bh != head);
+	} 
 
-	/* Stage 3: submit the IO */
-	do {
-		submit_bh(WRITE, bh);
-		bh = bh->b_this_page;		
-	} while (bh != head);
+	for (i = 0 ; i < nr ; i++) {
+		submit_bh(WRITE, arr[i]) ;
+	}
+
+	if (nr == 0) 
+		UnlockPage(page) ;
 
 	/* Done - end_buffer_io_async will unlock */
-	SetPageUptodate(page);
+	if (!partial)
+		SetPageUptodate(page);
 	return 0;
 
 out:
@@ -1669,6 +1849,45 @@
 }
 
 /*
+** just sets the dirty bits for a range of buffers in the page.  Does
+** not balance the dirty list, or put the buffers onto the dirty list
+*/
+static int __block_dirty_range(struct inode *inode, struct page *page,
+		unsigned from, unsigned to)
+{
+	unsigned block_start, block_end;
+	int partial = 0 ;
+	unsigned blocksize;
+	struct buffer_head *bh, *head;
+
+	blocksize = inode->i_sb->s_blocksize;
+
+	for(bh = head = page->buffers, block_start = 0;
+	    bh != head || !block_start;
+	    block_start=block_end, bh = bh->b_this_page) {
+		block_end = block_start + blocksize;
+		if (block_end <= from || block_start >= to) {
+			if (!buffer_uptodate(bh))
+				partial = 1;
+		} else {
+			set_bit(BH_Uptodate, &bh->b_state);
+			if (!atomic_set_buffer_dirty(bh)) {
+				buffer_insert_inode_queue(bh, inode);
+			}
+		}
+	}
+	/*
+	 * is this a partial write that happened to make all buffers
+	 * uptodate then we can optimize away a bogus readpage() for
+	 * the next read(). Here we 'discover' wether the page went
+	 * uptodate as a result of this (potentially partial) write.
+	 */
+	if (!partial)
+		SetPageUptodate(page);
+	return 0;
+}
+
+/*
  * Generic "read page" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
  * Reads the page asynchronously --- the unlock_buffer() and
@@ -1958,13 +2177,23 @@
 	if (!err) {
 		memset(page_address(page) + offset, 0, PAGE_CACHE_SIZE - offset);
 		flush_dcache_page(page);
-		__block_commit_write(inode,page,0,offset);
+
+		/* this will just set the dirty bits for block_write_full_page
+		** it is only safe because we have the page locked and
+		** nobody will try to write the buffers between
+		** the block_dirty_range and the write_full_page calls
+		** we have to clear the page up to date so the buffers
+		** past the end of the file won't get written
+		*/
+		__block_dirty_range(inode,page,0,offset);
+		ClearPageUptodate(page);
+		err = __block_write_full_page(inode, page, get_block) ;
 done:
 		kunmap(page);
-		UnlockPage(page);
 		return err;
 	}
 	ClearPageUptodate(page);
+	UnlockPage(page);
 	goto done;
 }
 
@@ -2255,7 +2484,7 @@
 	struct buffer_head *bh, *tmp;
 	struct buffer_head * insert_point;
 	int isize;
-
+	unsigned long index ;
 	if ((size & 511) || (size > PAGE_SIZE)) {
 		printk("VFS: grow_buffers: size = %d\n",size);
 		return 0;
@@ -2271,6 +2500,16 @@
 
 	isize = BUFSIZE_INDEX(size);
 
+	/* don't put this buffer head on the free list until the
+	** page is setup.  Is there a better index to use?  Would 0
+	** be good enough?
+	*/
+	page->flags &= ~(1 << PG_referenced);
+	index = atomic_read(&buffermem_pages) ;
+	atomic_inc(&buffermem_pages);
+	add_to_page_cache_locked(page, &anon_space_mapping, index) ;
+	page->buffers = bh;
+
 	spin_lock(&free_list[isize].lock);
 	insert_point = free_list[isize].list;
 	tmp = bh;
@@ -2294,11 +2533,7 @@
 	free_list[isize].list = bh;
 	spin_unlock(&free_list[isize].lock);
 
-	page->buffers = bh;
-	page->flags &= ~(1 << PG_referenced);
-	lru_cache_add(page);
 	UnlockPage(page);
-	atomic_inc(&buffermem_pages);
 	return 1;
 
 no_buffer_head:
@@ -2320,7 +2555,6 @@
  *
  * Wait:
  *	0 - no wait (this does not get called - see try_to_free_buffers below)
- *	1 - start IO for dirty buffers
  *	2 - wait for completion of locked buffers
  */
 static void sync_page_buffers(struct buffer_head *bh, int wait)
@@ -2330,11 +2564,9 @@
 	do {
 		struct buffer_head *p = tmp;
 		tmp = tmp->b_this_page;
-		if (buffer_locked(p)) {
-			if (wait > 1)
-				__wait_on_buffer(p);
-		} else if (buffer_dirty(p))
-			ll_rw_block(WRITE, 1, &p);
+		if (buffer_locked(p) && wait > 1) {
+			__wait_on_buffer(p);
+		} 
 	} while (tmp != bh);
 }
 
@@ -2397,6 +2629,9 @@
 
 	/* And free the page */
 	page->buffers = NULL;
+	if (page->mapping == &anon_space_mapping) {
+		atomic_dec(&buffermem_pages) ;
+	}
 	page_cache_release(page);
 	spin_unlock(&free_list[index].lock);
 	write_unlock(&hash_table_lock);
@@ -2547,6 +2782,7 @@
 static int flush_dirty_buffers(int check_flushtime)
 {
 	struct buffer_head * bh, *next;
+	struct page *page ;
 	int flushed = 0, i;
 
  restart:
@@ -2563,6 +2799,8 @@
 		}
 		if (buffer_locked(bh))
 			continue;
+		if (!buffer_uptodate(bh))
+			continue ;
 
 		if (check_flushtime) {
 			/* The dirty lru list is chronologically ordered so
@@ -2575,13 +2813,15 @@
 			if (++flushed > bdf_prm.b_un.ndirty)
 				goto out_unlock;
 		}
-
-		/* OK, now we are committed to write it out. */
-		atomic_inc(&bh->b_count);
-		spin_unlock(&lru_list_lock);
-		ll_rw_block(WRITE, 1, &bh);
-		atomic_dec(&bh->b_count);
-
+		page = bh->b_page ;
+		page_cache_get(page) ;
+		if (TryLockPage(page)) {
+			page_cache_release(page) ;
+			continue ;
+		}
+		spin_unlock(&lru_list_lock) ;
+		dirty_list_writepage(page, bh) ;
+		page_cache_release(page) ;
 		if (current->need_resched)
 			schedule();
 		goto restart;
diff -urN linux.2.4.0/mm/page_alloc.c linux/mm/page_alloc.c
--- linux.2.4.0/mm/page_alloc.c	Thu Jan  4 23:11:21 2001
+++ linux/mm/page_alloc.c	Thu Jan  4 18:51:10 2001
@@ -307,11 +307,12 @@
 	/*
 	 * If we are about to get low on free pages and cleaning
 	 * the inactive_dirty pages would fix the situation,
-	 * wake up bdflush.
+	 * wake up kswapd here as well, so page_launder can start
+	 * sending things to disk.
 	 */
 	else if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
 			&& nr_inactive_dirty_pages >= freepages.high)
-		wakeup_bdflush(0);
+		wakeup_kswapd(0);
 
 try_again:
 	/*
diff -urN linux.2.4.0/mm/vmscan.c linux/mm/vmscan.c
--- linux.2.4.0/mm/vmscan.c	Thu Jan  4 23:11:21 2001
+++ linux/mm/vmscan.c	Thu Jan  4 18:53:54 2001
@@ -615,7 +615,6 @@
 
 			/* The page was only in the buffer cache. */
 			} else if (!page->mapping) {
-				atomic_dec(&buffermem_pages);
 				freed_page = 1;
 				cleaned_pages++;
 
@@ -638,10 +637,9 @@
 			page_cache_release(page);
 
 			/* 
-			 * If we're freeing buffer cache pages, stop when
-			 * we've got enough free memory.
+			 * stop when we've got enough free memory.
 			 */
-			if (freed_page && !free_shortage())
+			if (!free_shortage())
 				break;
 			continue;
 		} else if (page->mapping && !PageDirty(page)) {
@@ -676,9 +674,6 @@
 	 * free anything yet, we wait synchronously on the writeout of
 	 * MAX_SYNC_LAUNDER pages.
 	 *
-	 * We also wake up bdflush, since bdflush should, under most
-	 * loads, flush out the dirty pages before we have to wait on
-	 * IO.
 	 */
 	if (can_get_io_locks && !launder_loop && free_shortage()) {
 		launder_loop = 1;
@@ -687,8 +682,6 @@
 			sync = 0;
 		/* We only do a few "out of order" flushes. */
 		maxlaunder = MAX_LAUNDER;
-		/* Kflushd takes care of the rest. */
-		wakeup_bdflush(0);
 		goto dirty_page_rescan;
 	}
 



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2001-01-05 15:43                                           ` Marcelo Tosatti
@ 2001-01-05 17:49                                             ` Daniel Phillips
  2001-01-05 17:52                                             ` Chris Mason
  2001-01-05 19:51                                             ` Chris Mason
  2 siblings, 0 replies; 57+ messages in thread
From: Daniel Phillips @ 2001-01-05 17:49 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel

Marcelo Tosatti wrote:
> 
> On Fri, 5 Jan 2001, Chris Mason wrote:
> 
> >
> > Here's the latest version of the patch, against 2.4.0.  The
> > biggest open issues are what to do with bdflush, since
> > page_launder could do everything bdflush does.
> 
> I think we want to remove flush_dirty_buffers() from bdflush.
> 
> While we are trying to be smart and do write clustering at the ->writepage
> operation, flush_dirty_buffers() is "dumb" and will interfere with the
> write clustering.

Actually, I found it doesn't interfere that much.  Remember, there's
still an elevator in there.  Even if bdflush and clustered page flushing
are running at the same time the elevator makes the final decision what
gets written when.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2001-01-05 15:43                                           ` Marcelo Tosatti
  2001-01-05 17:49                                             ` Daniel Phillips
@ 2001-01-05 17:52                                             ` Chris Mason
  2001-01-05 19:51                                             ` Chris Mason
  2 siblings, 0 replies; 57+ messages in thread
From: Chris Mason @ 2001-01-05 17:52 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel



On Friday, January 05, 2001 01:43:07 PM -0200 Marcelo Tosatti
<marcelo@conectiva.com.br> wrote:

> 
> On Fri, 5 Jan 2001, Chris Mason wrote:
> 
>> 
>> Here's the latest version of the patch, against 2.4.0.  The
>> biggest open issues are what to do with bdflush, since
>> page_launder could do everything bdflush does.  
> 
> I think we want to remove flush_dirty_buffers() from bdflush. 
> 

I think you're right.  Now that bdflush calls page_launder with GFP_KERNEL,
the flush_dirty_buffers call isn't needed there.  I think the current
bdflush (with or without the flush_dirty_buffers call) will be more
aggressive at freeing buffer cache pages from the inactive_dirty list, and
it will be interesting to see how it performs.  I think it will be better,
but the blocksize < pagesize case might screw us up.

> While we are trying to be smart and do write clustering at the ->writepage
> operation, flush_dirty_buffers() is "dumb" and will interfere with the
> write clustering. 
> 
Only for the buffer cache pages.  For actual file data, flush_dirty_buffers
is calling the writepage func, and we should still be able to cluster it.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2001-01-05 15:54                                         ` Chris Mason
  2001-01-05 15:43                                           ` Marcelo Tosatti
@ 2001-01-05 18:09                                           ` Juergen Schneider
  1 sibling, 0 replies; 57+ messages in thread
From: Juergen Schneider @ 2001-01-05 18:09 UTC (permalink / raw)
  To: Chris Mason, linux-kernel

Hi Chris,

I don't know if this is already known,
but this patch seems to solve the loop block device problem too.
I've tested it several times and did not get any kernel lockups after
applying the patch.
Until now this was a showstopper for me but you gave the solution.
Thank you very much.

Juergen Schneider
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2001-01-05 19:51                                             ` Chris Mason
@ 2001-01-05 18:32                                               ` Marcelo Tosatti
  2001-01-05 20:29                                                 ` Rik van Riel
  2001-01-05 21:08                                                 ` Chris Mason
  0 siblings, 2 replies; 57+ messages in thread
From: Marcelo Tosatti @ 2001-01-05 18:32 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-kernel


On Fri, 5 Jan 2001, Chris Mason wrote:

> 
> 
> On Friday, January 05, 2001 01:43:07 PM -0200 Marcelo Tosatti
> <marcelo@conectiva.com.br> wrote:
> 
> > 
> > On Fri, 5 Jan 2001, Chris Mason wrote:
> > 
> >> 
> >> Here's the latest version of the patch, against 2.4.0.  The
> >> biggest open issues are what to do with bdflush, since
> >> page_launder could do everything bdflush does.  
> > 
> > I think we want to remove flush_dirty_buffers() from bdflush. 
> > 
> 
> Whoops.  If bdflush doesn't balance the dirty list, who does?

Who marks buffers dirty. 

Linus changed mark_buffer_dirty() to use flush_dirty_buffers() in case
there are too many dirty buffers.

Also, I think in practice page_launder will help on balancing. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2001-01-05 20:29                                                 ` Rik van Riel
@ 2001-01-05 18:43                                                   ` Marcelo Tosatti
  2001-01-05 20:37                                                     ` Rik van Riel
  0 siblings, 1 reply; 57+ messages in thread
From: Marcelo Tosatti @ 2001-01-05 18:43 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Chris Mason, linux-kernel



On Fri, 5 Jan 2001, Rik van Riel wrote:

> Also, you do not want the writer to block on writing out buffers
> if bdflush could write them out asynchronously while the dirty
> buffer producer can work on in the background.

flush_dirty_buffers() do not wait on the buffers written to get clean.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2001-01-05 15:43                                           ` Marcelo Tosatti
  2001-01-05 17:49                                             ` Daniel Phillips
  2001-01-05 17:52                                             ` Chris Mason
@ 2001-01-05 19:51                                             ` Chris Mason
  2001-01-05 18:32                                               ` Marcelo Tosatti
  2 siblings, 1 reply; 57+ messages in thread
From: Chris Mason @ 2001-01-05 19:51 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel



On Friday, January 05, 2001 01:43:07 PM -0200 Marcelo Tosatti
<marcelo@conectiva.com.br> wrote:

> 
> On Fri, 5 Jan 2001, Chris Mason wrote:
> 
>> 
>> Here's the latest version of the patch, against 2.4.0.  The
>> biggest open issues are what to do with bdflush, since
>> page_launder could do everything bdflush does.  
> 
> I think we want to remove flush_dirty_buffers() from bdflush. 
> 

Whoops.  If bdflush doesn't balance the dirty list, who does?

But, we can do this in bdflush:

if (free_shortage())
	flushed = page_launder(GFP_KERNEL, 0) ;
else
	flushed = flush_dirty_buffers(0) ;

The idea is that page_launder knows best which pages to free when memory is
low, and flush_dirty_buffers knows best which pages to write when things
are unbalanced.  Not the cleanest idea ever, since I'd prefer the first
pages written when we need balancing are also the ones most likely to be
freed by page_launder.

In other words, I'd like to make a flush_inactive_dirty() inside
mm/vmscan.c, and have bdflush call that before working on the buffer cache
dirty list.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2001-01-05 18:32                                               ` Marcelo Tosatti
@ 2001-01-05 20:29                                                 ` Rik van Riel
  2001-01-05 18:43                                                   ` Marcelo Tosatti
  2001-01-05 21:08                                                 ` Chris Mason
  1 sibling, 1 reply; 57+ messages in thread
From: Rik van Riel @ 2001-01-05 20:29 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Chris Mason, linux-kernel

On Fri, 5 Jan 2001, Marcelo Tosatti wrote:
> On Fri, 5 Jan 2001, Chris Mason wrote:
> > On Friday, January 05, 2001 01:43:07 PM -0200 Marcelo Tosatti
> > <marcelo@conectiva.com.br> wrote:
> > > On Fri, 5 Jan 2001, Chris Mason wrote:
> > > 
> > >> 
> > >> Here's the latest version of the patch, against 2.4.0.  The
> > >> biggest open issues are what to do with bdflush, since
> > >> page_launder could do everything bdflush does.  
> > > 
> > > I think we want to remove flush_dirty_buffers() from bdflush. 
> > > 
> > 
> > Whoops.  If bdflush doesn't balance the dirty list, who does?
> 
> Who marks buffers dirty. 
> 
> Linus changed mark_buffer_dirty() to use flush_dirty_buffers() in case
> there are too many dirty buffers.
> 
> Also, I think in practice page_launder will help on balancing. 

Chris is right. It is possible (not very likely, but possible
notheless) that the majority of the dirty pages are _active_
pages ... in that case page_launder() won't help one bit and
only flush_dirty_buffers() can do the job.

Also, you do not want the writer to block on writing out buffers
if bdflush could write them out asynchronously while the dirty
buffer producer can work on in the background.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to loose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2001-01-05 18:43                                                   ` Marcelo Tosatti
@ 2001-01-05 20:37                                                     ` Rik van Riel
  0 siblings, 0 replies; 57+ messages in thread
From: Rik van Riel @ 2001-01-05 20:37 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Chris Mason, linux-kernel

On Fri, 5 Jan 2001, Marcelo Tosatti wrote:
> On Fri, 5 Jan 2001, Rik van Riel wrote:
> 
> > Also, you do not want the writer to block on writing out buffers
> > if bdflush could write them out asynchronously while the dirty
> > buffer producer can work on in the background.
> 
> flush_dirty_buffers() do not wait on the buffers written to get clean.

This sounds more like a bug we should fix then a reason
not to use flush_dirty_buffers()

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to loose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC] changes to buffer.c (was Test12 ll_rw_block error)
  2001-01-05 18:32                                               ` Marcelo Tosatti
  2001-01-05 20:29                                                 ` Rik van Riel
@ 2001-01-05 21:08                                                 ` Chris Mason
  1 sibling, 0 replies; 57+ messages in thread
From: Chris Mason @ 2001-01-05 21:08 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel



On Friday, January 05, 2001 04:32:50 PM -0200 Marcelo Tosatti
<marcelo@conectiva.com.br> wrote:
 
>> > I think we want to remove flush_dirty_buffers() from bdflush. 
>> > 
>> 
>> Whoops.  If bdflush doesn't balance the dirty list, who does?
> 
> Who marks buffers dirty. 
> 
> Linus changed mark_buffer_dirty() to use flush_dirty_buffers() in case
> there are too many dirty buffers.
> 

Yes, but mark_buffer_dirty only ends up calling flush_dirty_buffers when
balance_dirty_state returns 1.  This means the only people balancing are
the procs (not some async daemon), and the writing only starts when we are
over the hard dirty limit.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2001-01-05 21:08 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <3A398F84.2CD3039D@thebarn.com>
2000-12-15  6:20 ` Test12 ll_rw_block error Linus Torvalds
2000-12-15  7:00   ` Alexander Viro
2000-12-15  9:14     ` Chris Mason
2000-12-17  0:54       ` Russell Cattelan
2000-12-17 12:20         ` Chris Mason
2000-12-15 10:51     ` Stephen C. Tweedie
2000-12-15 19:04       ` Linus Torvalds
2000-12-17  1:08       ` Russell Cattelan
2000-12-18 11:44         ` Stephen C. Tweedie
2000-12-17  2:38       ` Marcelo Tosatti
2000-12-18 11:46         ` Stephen C. Tweedie
2000-12-19 14:05           ` Marcelo Tosatti
2000-12-19 16:43             ` Daniel Phillips
2000-12-19 17:18               ` Marcelo Tosatti
2000-12-19 17:40               ` Christoph Hellwig
2000-12-21 23:25           ` [RFC] changes to buffer.c (was Test12 ll_rw_block error) Chris Mason
2000-12-22  0:19             ` Marcelo Tosatti
2000-12-22  2:20               ` Andreas Dilger
2000-12-22  0:38                 ` Marcelo Tosatti
2000-12-22 13:56                   ` Chris Mason
2000-12-22 16:45                     ` Daniel Phillips
2000-12-22 19:35                       ` Chris Mason
2000-12-22 15:07                   ` Chris Mason
2000-12-22 19:52                     ` Marcelo Tosatti
2000-12-22 23:18                       ` Chris Mason
2000-12-22 23:26                         ` Marcelo Tosatti
2000-12-23 18:21                           ` Chris Mason
2000-12-23 19:02                             ` Linus Torvalds
2000-12-23 19:25                               ` Chris Mason
2000-12-23 15:47                         ` Daniel Phillips
2000-12-27  0:57                       ` Chris Mason
2000-12-26 23:18                         ` Marcelo Tosatti
2000-12-27 20:26                         ` Daniel Phillips
2000-12-27 20:49                           ` Chris Mason
2000-12-28 15:49                             ` Daniel Phillips
2000-12-28 19:19                               ` Chris Mason
2000-12-28 19:29                                 ` Linus Torvalds
2000-12-29 16:03                                   ` Chris Mason
2000-12-29 16:21                                     ` Marcelo Tosatti
2000-12-29 18:33                                       ` Alexander Viro
2001-01-05 15:54                                         ` Chris Mason
2001-01-05 15:43                                           ` Marcelo Tosatti
2001-01-05 17:49                                             ` Daniel Phillips
2001-01-05 17:52                                             ` Chris Mason
2001-01-05 19:51                                             ` Chris Mason
2001-01-05 18:32                                               ` Marcelo Tosatti
2001-01-05 20:29                                                 ` Rik van Riel
2001-01-05 18:43                                                   ` Marcelo Tosatti
2001-01-05 20:37                                                     ` Rik van Riel
2001-01-05 21:08                                                 ` Chris Mason
2001-01-05 18:09                                           ` Juergen Schneider
2000-12-29 17:58                                     ` Daniel Phillips
2001-01-02 15:49                                       ` Chris Mason
2000-12-22  1:54             ` Alexander Viro
2000-12-22 13:49               ` Chris Mason
2000-12-17  0:51     ` Test12 ll_rw_block error Russell Cattelan
2000-12-17  0:21   ` Russell Cattelan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).