linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC] using writepage to start io
@ 2001-08-05 18:34 Chris Mason
  2001-08-05 22:38 ` Daniel Phillips
  2001-08-06 15:13 ` Eric W. Biederman
  0 siblings, 2 replies; 30+ messages in thread
From: Chris Mason @ 2001-08-05 18:34 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel; +Cc: linux-mm, torvalds



On Wednesday, August 01, 2001 04:57:35 PM +0200 Daniel Phillips
<phillips@bonn-fries.net> wrote:

> On Tuesday 31 July 2001 21:07, Chris Mason wrote:
>> This has been tested a little more now, both ext2 (1k, 4k) and
>> reiserfs.  dbench and iozone testing don't show any difference, but I
>> need to spend a little more time on the benchmarks.
> 
> It's impressive that such seemingly radical surgery on the vm innards 
> is a) possible and b) doesn't make the system perform noticably worse.

radical surgery is always possible ;-)  But, I was expecting better
performance results than I got.  I'm trying a few other things out here,
more details will come if they work.  

My real motivation for the patch is to allow better filesystem control of
how things get written though.  If I can do this without making things
slower, I've won.  The big drawback is how muddy writepage has gotten with
the patch, as I've more or less required checks for partial page writes.

> 
>> The idea is that using flush_dirty_buffers to start i/o under memory
>> pressure is less than optimal.  flush_dirty_buffers knows the oldest
>> dirty buffer, but has no page aging info, so it might not flush a
>> page that we actually want to free.
> 
> Note that the fact that buffers dirtied by ->writepage are ordered by 
> time-dirtied means that the dirty_buffers list really does have 
> indirect knowledge of page aging.  There may well be benefits to your 
> approach but I doubt this is one of them.

A problem is that under memory pressure, we'll flush a buffer that has been
dirty for a long time, even if we are constantly redirtying it and have it
more or less pinned.  This might not be common enough to cause problems,
but it still isn't optimal.  Yes, it is a good idea to flush that page at
some time, but under memory pressure we want to do the least amount of work
that will lead to a freeable page.

> 
> It's surprising that 1K buffer size isn't bothered by being grouped by 
> page in their IO requests.  I'd have thought that this would cause a 
> significant number of writes to be blocked waiting on the page lock 
> held by an unrelated buffer writeout.

Well, for non-buffer cache pages, we're getting a poor man's write
clustering.  If this doesn't slow down ext2 its because of good disk layout.

ext2 probably doesn't use the buffer cache enough to show bad results here.
reiserfs on ia64 or alpha might.

> 
> The most interesting part of your patch to me is the anon_space_mapping.
> It's nice to make buffer handling look more like page cache handling, 
> and get rid of some special cases in the vm scanning.  On the other 
> hand, buffers are different from pages in that, once buffers heads are 
> removed, nobody can find them any more, so they can not be rescued.
> Now, if I'm reading this correctly, buffer pages *will* progress on to 
> the inactive_clean list from the inactive_dirty list instead of jumping 
> that queue and being directly freed by the page_cache_release.

Without my patch, it looks to me like refill_inactive_scan will put buffer
cache pages on the inactive dirty list by calling deactivate_page_nolock.
page_launder catches these by checking page->buffers, and calling
try_to_free_buffers which starts the io.  

So, the big difference now is just that page_launder sees the page is
dirty, and uses writepage to start the io and try_to_free_buffers only
waits on it.  The rest should work more or less the same.

>  Maybe 
> this is good because it avoids the expensive-looking __free_pages_ok.
> 
> This looks scary:
> 
> +        index = atomic_read(&buffermem_pages) ;
> 
> Because buffermem_pages isn't unique.  This must mean you're never 
> doing page cache lookups for anon_space_mapping, because the 
> mapping+index key isn't unique.  There is a danger here of overloading 
> some hash buckets, which becomes a certainty if you use 0 or some other 
> constant for the index.  If you're never doing page cache lookups, why 
> even enter it into the page hash?

path of least surprise I suppose; I knew add_to_page_cache_locked() would
do what I wanted in terms of page setup, if there's a better way feel free
to advise ;-)  No page lookups are done on the buffer cache pages.

> That's all for now.  It's a very interesting patch.

thanks for the comments ;-)

-chris


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-05 18:34 [RFC] using writepage to start io Chris Mason
@ 2001-08-05 22:38 ` Daniel Phillips
  2001-08-05 23:32   ` Chris Mason
  2001-08-06 15:13 ` Eric W. Biederman
  1 sibling, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2001-08-05 22:38 UTC (permalink / raw)
  To: Chris Mason, linux-kernel; +Cc: linux-mm, torvalds

On Sunday 05 August 2001 20:34, Chris Mason wrote:
> I wrote:
> > Note that the fact that buffers dirtied by ->writepage are ordered
> > by time-dirtied means that the dirty_buffers list really does have
> > indirect knowledge of page aging.  There may well be benefits to
> > your approach but I doubt this is one of them.
>
> A problem is that under memory pressure, we'll flush a buffer that has
> been dirty for a long time, even if we are constantly redirtying it
> and have it more or less pinned.  This might not be common enough to
> cause problems, but it still isn't optimal.  Yes, it is a good idea to
> flush that page at some time, but under memory pressure we want to do
> the least amount of work that will lead to a freeable page.

But we don't have a choice.  The user has set an explicit limit on how 
long a dirty buffer can hang around before being flushed.  The 
old-buffer rule trumps the need to allocate new memory.  As you noted,
it doesn't cost a lot because if the system is that heavily loaded
then the rate of dirty buffer production is naturally throttled.

> > The most interesting part of your patch to me is the
> > anon_space_mapping. It's nice to make buffer handling look more like
> > page cache handling, and get rid of some special cases in the vm
> > scanning.  On the other hand, buffers are different from pages in
> > that, once buffers heads are removed, nobody can find them any more,
> > so they can not be rescued. Now, if I'm reading this correctly,
> > buffer pages *will* progress on to the inactive_clean list from the
> > inactive_dirty list instead of jumping that queue and being directly
> > freed by the page_cache_release.
>
> Without my patch, it looks to me like refill_inactive_scan will put
> buffer cache pages on the inactive dirty list by calling
> deactivate_page_nolock. page_launder catches these by checking
> page->buffers, and calling try_to_free_buffers which starts the io.
>
> So, the big difference now is just that page_launder sees the page is
> dirty, and uses writepage to start the io and try_to_free_buffers only
> waits on it.  The rest should work more or less the same.

And with your patch, buffers are no longer freed by page launder, they
are moved on to the inactive_clean list instead where they are picked
up by reclaim_page.  I'm just wondering if that's a little more
efficient than going through __free_pages_ok/__alloc_pages_limit.

> >  Maybe
> > this is good because it avoids the expensive-looking
> > __free_pages_ok.
> >
> > This looks scary:
> >
> > +        index = atomic_read(&buffermem_pages) ;
> >
> > Because buffermem_pages isn't unique.  This must mean you're never
> > doing page cache lookups for anon_space_mapping, because the
> > mapping+index key isn't unique.  There is a danger here of
> > overloading some hash buckets, which becomes a certainty if you use
> > 0 or some other constant for the index.  If you're never doing page
> > cache lookups, why even enter it into the page hash?
>
> path of least surprise I suppose; I knew add_to_page_cache_locked()
> would do what I wanted in terms of page setup, if there's a better way
> feel free to advise ;-)  No page lookups are done on the buffer cache
> pages.

If you must enter it into the page hash you'd be safer generating a 
random number for the page index.  But why not just take what you need
from add_to_page_cache_locked:

	page_cache_get(page);
	spin_lock(&pagecache_lock);
	add_page_to_inode_queue(mapping, page);
	lru_cache_add(page);
	spin_unlock(&pagecache_lock);

--
Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-05 22:38 ` Daniel Phillips
@ 2001-08-05 23:32   ` Chris Mason
  2001-08-06  5:39     ` Daniel Phillips
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Mason @ 2001-08-05 23:32 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel; +Cc: linux-mm, torvalds



On Monday, August 06, 2001 12:38:01 AM +0200 Daniel Phillips
<phillips@bonn-fries.net> wrote:

> On Sunday 05 August 2001 20:34, Chris Mason wrote:
>> I wrote:
>> > Note that the fact that buffers dirtied by ->writepage are ordered
>> > by time-dirtied means that the dirty_buffers list really does have
>> > indirect knowledge of page aging.  There may well be benefits to
>> > your approach but I doubt this is one of them.
>> 
>> A problem is that under memory pressure, we'll flush a buffer that has
>> been dirty for a long time, even if we are constantly redirtying it
>> and have it more or less pinned.  This might not be common enough to
>> cause problems, but it still isn't optimal.  Yes, it is a good idea to
>> flush that page at some time, but under memory pressure we want to do
>> the least amount of work that will lead to a freeable page.
> 
> But we don't have a choice.  The user has set an explicit limit on how 
> long a dirty buffer can hang around before being flushed.  The 
> old-buffer rule trumps the need to allocate new memory.  As you noted,
> it doesn't cost a lot because if the system is that heavily loaded
> then the rate of dirty buffer production is naturally throttled.

there are at least 3 reasons to write buffers to disk

1) they are too old
2) the percentage of dirty buffers is too high
3) you need to reclaim them due to memory pressure

There are 3 completely different things; there's no trumping of priorities.
Under memory pressure you write buffers you have a high chance of freeing,
during write throttling you write buffers that won't get dirty again right
away, and when writing old buffers you write the oldest first.

This doesn't mean you can always make the right decision on all 3 cases, or
that making the right decision is worth the effort ;-)

> If you must enter it into the page hash you'd be safer generating a 
> random number for the page index.  But why not just take what you need
> from add_to_page_cache_locked:
> 

Mostly to keep down on the cut n' pasted code in the patch.  But I'll try
this out...

-chris


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-05 23:32   ` Chris Mason
@ 2001-08-06  5:39     ` Daniel Phillips
  2001-08-06 13:24       ` Chris Mason
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2001-08-06  5:39 UTC (permalink / raw)
  To: Chris Mason, linux-kernel; +Cc: linux-mm, torvalds

On Monday 06 August 2001 01:32, Chris Mason wrote:
> On Monday, August 06, 2001 12:38:01 AM +0200 Daniel Phillips
>
> <phillips@bonn-fries.net> wrote:
> > On Sunday 05 August 2001 20:34, Chris Mason wrote:
> >> I wrote:
> >> > Note that the fact that buffers dirtied by ->writepage are
> >> > ordered by time-dirtied means that the dirty_buffers list really
> >> > does have indirect knowledge of page aging.  There may well be
> >> > benefits to your approach but I doubt this is one of them.
> >>
> >> A problem is that under memory pressure, we'll flush a buffer that
> >> has been dirty for a long time, even if we are constantly
> >> redirtying it and have it more or less pinned.  This might not be
> >> common enough to cause problems, but it still isn't optimal.  Yes,
> >> it is a good idea to flush that page at some time, but under memory
> >> pressure we want to do the least amount of work that will lead to a
> >> freeable page.
> >
> > But we don't have a choice.  The user has set an explicit limit on
> > how long a dirty buffer can hang around before being flushed.  The
> > old-buffer rule trumps the need to allocate new memory.  As you
> > noted, it doesn't cost a lot because if the system is that heavily
> > loaded then the rate of dirty buffer production is naturally
> > throttled.
>
> there are at least 3 reasons to write buffers to disk
>
> 1) they are too old
> 2) the percentage of dirty buffers is too high
> 3) you need to reclaim them due to memory pressure
>
> There are 3 completely different things; there's no trumping of
> priorities.

There is.  If your heavily loaded machine goes down and you lose edits 
from 1/2 an hour ago even though your bdflush parms specify a 30 second 
update cycle you'll call the system broken, whereas if it runs 5% slower 
under heavy write+swap load that's just life.

> Under memory pressure you write buffers you have a high
> chance of freeing, during write throttling you write buffers that
> won't get dirty again right away, and when writing old buffers you
> write the oldest first.
>
> This doesn't mean you can always make the right decision on all 3
> cases, or that making the right decision is worth the effort ;-)

If we need to do write throttling we should do it at the point where we 
still know its a write, i.e., somewhere in sys_write.  Some time after 
writes are throttled (specified by bdflush parms) all the old write 
buffers will have worked their way through to the drives and your case 
(3) gets all the bandwidth.  I don't see a conflict, except that we 
don't have such an upstream write throttling mechanism yet.  We sort-of 
have one in that a writer will busy itself trying to help out with lru 
scanning when it can't get a free page for the page cache.  This has the 
ugly result that we have bunches of processes spinning on the lru lock 
and we have no idea what the queue scanning rates really are.  We can do 
something much more intelligent and predictable there and we'll be a lot 
closer to being able to balance intelligently between your cases.

By the way, I think you should combine (2) and (3) using an and, which 
gets us back to the "kupdate thing" vs the "bdflush thing".

--
Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-06  5:39     ` Daniel Phillips
@ 2001-08-06 13:24       ` Chris Mason
  2001-08-06 16:13         ` Daniel Phillips
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Mason @ 2001-08-06 13:24 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel; +Cc: linux-mm, torvalds



On Monday, August 06, 2001 07:39:47 AM +0200 Daniel Phillips
<phillips@bonn-fries.net> wrote:
 
>> there are at least 3 reasons to write buffers to disk
>> 
>> 1) they are too old
>> 2) the percentage of dirty buffers is too high
>> 3) you need to reclaim them due to memory pressure
>> 
>> There are 3 completely different things; there's no trumping of
>> priorities.
> 
> There is.  If your heavily loaded machine goes down and you lose edits 
> from 1/2 an hour ago even though your bdflush parms specify a 30 second 
> update cycle you'll call the system broken, whereas if it runs 5% slower 
> under heavy write+swap load that's just life.

Ok, we're getting caught up in semantics here.  I'm not saying kupdate
should switch over to write buffers that might get reclaimed instead of old
buffers.  There still needs to be proper flushing of old data.

I am saying that it should be possible to have the best buffer flushed
under memory pressure (by kswapd/bdflush) and still get the old data to
disk in time through kupdate.

> 
>> Under memory pressure you write buffers you have a high
>> chance of freeing, during write throttling you write buffers that
>> won't get dirty again right away, and when writing old buffers you
>> write the oldest first.
>> 
>> This doesn't mean you can always make the right decision on all 3
>> cases, or that making the right decision is worth the effort ;-)
> 
> If we need to do write throttling we should do it at the point where we 
> still know its a write, i.e., somewhere in sys_write.  

The rest of the stuff below does make sense, but we need to keep in mind
that sys_write isn't the only way to dirty file pages.

> Some time after 
> writes are throttled (specified by bdflush parms) all the old write 
> buffers will have worked their way through to the drives and your case 
> (3) gets all the bandwidth.  I don't see a conflict, except that we 
> don't have such an upstream write throttling mechanism yet.  We sort-of 
> have one in that a writer will busy itself trying to help out with lru 
> scanning when it can't get a free page for the page cache.  This has the 
> ugly result that we have bunches of processes spinning on the lru lock 
> and we have no idea what the queue scanning rates really are.  We can do 
> something much more intelligent and predictable there and we'll be a lot 
> closer to being able to balance intelligently between your cases.
> 
> By the way, I think you should combine (2) and (3) using an and, which 
> gets us back to the "kupdate thing" vs the "bdflush thing".

Perhaps, since I think they would be handled in roughly the same way.

-chris


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-05 18:34 [RFC] using writepage to start io Chris Mason
  2001-08-05 22:38 ` Daniel Phillips
@ 2001-08-06 15:13 ` Eric W. Biederman
  1 sibling, 0 replies; 30+ messages in thread
From: Eric W. Biederman @ 2001-08-06 15:13 UTC (permalink / raw)
  To: Chris Mason; +Cc: Daniel Phillips, linux-kernel, linux-mm, torvalds

Chris Mason <mason@suse.com> writes:

> On Wednesday, August 01, 2001 04:57:35 PM +0200 Daniel Phillips
> <phillips@bonn-fries.net> wrote:
> 
> > On Tuesday 31 July 2001 21:07, Chris Mason wrote:
> >> This has been tested a little more now, both ext2 (1k, 4k) and
> >> reiserfs.  dbench and iozone testing don't show any difference, but I
> >> need to spend a little more time on the benchmarks.
> > 
> > It's impressive that such seemingly radical surgery on the vm innards 
> > is a) possible and b) doesn't make the system perform noticably worse.
> 
> radical surgery is always possible ;-)  But, I was expecting better
> performance results than I got.  I'm trying a few other things out here,
> more details will come if they work.  

Hmm.  I would expect that could could entirely avoid the issue of which
hash chain to put pages on if you did the block devices in the page cache
thing.    The the current buffer cache interface  becomes just a
backwards compatibility layer which should make things cleaner.

Eric

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-06 13:24       ` Chris Mason
@ 2001-08-06 16:13         ` Daniel Phillips
  2001-08-06 16:51           ` Chris Mason
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2001-08-06 16:13 UTC (permalink / raw)
  To: Chris Mason, linux-kernel; +Cc: linux-mm

On Monday 06 August 2001 15:24, Chris Mason wrote:
> On Monday, August 06, 2001 07:39:47 AM +0200 Daniel Phillips wrote:
> >Chris Mason wrote:
> >> there are at least 3 reasons to write buffers to disk
> >>
> >> 1) they are too old
> >> 2) the percentage of dirty buffers is too high
> >> 3) you need to reclaim them due to memory pressure
> >>
> >> There are 3 completely different things; there's no trumping of
> >> priorities.
> >
> > There is.  If your heavily loaded machine goes down and you lose
> > edits from 1/2 an hour ago even though your bdflush parms specify a
> > 30 second update cycle you'll call the system broken, whereas if it
> > runs 5% slower under heavy write+swap load that's just life.
>
> Ok, we're getting caught up in semantics here.  I'm not saying kupdate
> should switch over to write buffers that might get reclaimed instead
> of old buffers.  There still needs to be proper flushing of old data.
>
> I am saying that it should be possible to have the best buffer flushed
> under memory pressure (by kswapd/bdflush) and still get the old data
> to disk in time through kupdate.

Yes, to phrase this more precisely, after we've submitted all the 
too-old buffers we then gain the freedom to select which of the younger 
buffers to flush.  When there is memory pressure we could benefit by 
skipping over some of the sys_write buffers in favor of page_launder 
buffers.  We may well be able to recognize the latter by looking for 
!bh->b_page->age.  This method would be an alternative to your 
writepage approach.

> > By the way, I think you should combine (2) and (3) using an and,
> > which gets us back to the "kupdate thing" vs the "bdflush thing".
>
> Perhaps, since I think they would be handled in roughly the same way.

(warning: I'm going to drift pretty far off the original topic now...)

I don't see why it makes sense to have both a kupdate and a bdflush 
thread.  We should complete the process of merging these (sharing 
flush_dirty buffers was a big step) and look into the possibility of 
adding more intelligence about what to submit next.  The proof of the 
pudding is to come up with a throughput-improving patch, not so easy 
since the ore in these hills has been sought after for a good number of 
years by many skilled prospectors.

Note that bdflush also competes with an unbounded number of threads 
doing wakeup_bdflush(1)->flush_dirty_buffers.

These are called through balance_dirty:

  mark_buffer_dirty->balance_dirty
  __block_commit_write->balance_dirty
  refill_freelist->balance_dirty

(Curiously, refill_freelist also calls wakeup_bdflush(1) directly.)  You 
can see that each of these paths is very popular, and as soon as we pass 
the hard_dirty_limit everybody will jump in to try to help with buffer 
writeout.

As I recall, the current arrangement was arrived at after a flurry of 
dbench-inspired tweaking last fall and hasn't changed much since then.  
I think we need to take another look at this.  My instinct is that 
it's wrong to ever have more than one instance of flush_dirty_buffers 
active per spindle, and that the current arrangement is an attempt to 
reduce context switches or perhaps to keep buffer submission flowing 
even when page_launder blocks on writepage-><read metadata 
synchronously>.  There has to be a cleaner way to approach this.

--
Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-06 16:13         ` Daniel Phillips
@ 2001-08-06 16:51           ` Chris Mason
  2001-08-06 19:45             ` Daniel Phillips
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Mason @ 2001-08-06 16:51 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel; +Cc: linux-mm



On Monday, August 06, 2001 06:13:20 PM +0200 Daniel Phillips
<phillips@bonn-fries.net> wrote:

>> I am saying that it should be possible to have the best buffer flushed
>> under memory pressure (by kswapd/bdflush) and still get the old data
>> to disk in time through kupdate.
> 
> Yes, to phrase this more precisely, after we've submitted all the 
> too-old buffers we then gain the freedom to select which of the younger 
> buffers to flush.  

Almost ;-) memory pressure doesn't need to care about how long a buffer has
been dirty, that's kupdate's job.  kupdate doesn't care if the buffer it is
writing is a good candidate for freeing, that's taken care of elsewhere.
The two never need to talk (aside from optimizations).

> When there is memory pressure we could benefit by 
> skipping over some of the sys_write buffers in favor of page_launder 
> buffers.  We may well be able to recognize the latter by looking for 
> !bh->b_page->age.  This method would be an alternative to your 
> writepage approach.

Yes, I had experimented with this in addition to the writepage patch, it
would probably be better to try it as a standalone idea.

> 
>> > By the way, I think you should combine (2) and (3) using an and,
>> > which gets us back to the "kupdate thing" vs the "bdflush thing".
>> 
>> Perhaps, since I think they would be handled in roughly the same way.
> 
> (warning: I'm going to drift pretty far off the original topic now...)
> 
> I don't see why it makes sense to have both a kupdate and a bdflush 
> thread.  

Having two threads is exactly what allows memory pressure to not be
concerned about how long a buffer has been dirty.

> We should complete the process of merging these (sharing 
> flush_dirty buffers was a big step) and look into the possibility of 
> adding more intelligence about what to submit next.  The proof of the 
> pudding is to come up with a throughput-improving patch, not so easy 
> since the ore in these hills has been sought after for a good number of 
> years by many skilled prospectors.
> 
> Note that bdflush also competes with an unbounded number of threads 
> doing wakeup_bdflush(1)->flush_dirty_buffers.

Nods.  Of course, processes could wait on bdflush instead, but bdflush
might not be able to keep up.  It would be interesting to experiment with a
bdflush thread per device, one that uses write_unlocked_buffers to get the
io done.  I'll start by switching from flush_dirty_buffers to
write_unlocked_buffers in the current code...

-chris


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-06 16:51           ` Chris Mason
@ 2001-08-06 19:45             ` Daniel Phillips
  2001-08-06 20:12               ` Chris Mason
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2001-08-06 19:45 UTC (permalink / raw)
  To: Chris Mason, linux-kernel; +Cc: linux-mm

On Monday 06 August 2001 18:51, Chris Mason wrote:
> On Monday, August 06, 2001 06:13:20 PM +0200 Daniel Phillips
>
> <phillips@bonn-fries.net> wrote:
> >> I am saying that it should be possible to have the best buffer
> >> flushed under memory pressure (by kswapd/bdflush) and still get the
> >> old data to disk in time through kupdate.
> >
> > Yes, to phrase this more precisely, after we've submitted all the
> > too-old buffers we then gain the freedom to select which of the
> > younger buffers to flush.
>
> Almost ;-) memory pressure doesn't need to care about how long a
> buffer has been dirty, that's kupdate's job.  kupdate doesn't care if
> the buffer it is writing is a good candidate for freeing, that's taken
> care of elsewhere. The two never need to talk (aside from
> optimizations).

My point is, they should talk, in fact they should be the same function. 
It's never right for bdflush to submit younger buffers when there are 
dirty buffers whose flush time has already passed.

> > I don't see why it makes sense to have both a kupdate and a bdflush
> > thread.
>
> Having two threads is exactly what allows memory pressure to not be
> concerned about how long a buffer has been dirty.

I'm missing something.  How is it impossible for a single thread to act 
this way?

--
Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-06 19:45             ` Daniel Phillips
@ 2001-08-06 20:12               ` Chris Mason
  2001-08-06 21:18                 ` Daniel Phillips
  2001-08-07 12:02                 ` Anton Altaparmakov
  0 siblings, 2 replies; 30+ messages in thread
From: Chris Mason @ 2001-08-06 20:12 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel; +Cc: linux-mm



On Monday, August 06, 2001 09:45:12 PM +0200 Daniel Phillips
<phillips@bonn-fries.net> wrote:

>> Almost ;-) memory pressure doesn't need to care about how long a
>> buffer has been dirty, that's kupdate's job.  kupdate doesn't care if
>> the buffer it is writing is a good candidate for freeing, that's taken
>> care of elsewhere. The two never need to talk (aside from
>> optimizations).
> 
> My point is, they should talk, in fact they should be the same function. 
> It's never right for bdflush to submit younger buffers when there are 
> dirty buffers whose flush time has already passed.
> 

Grin, we're talking in circles.  My point is that by having two threads,
bdflush is allowed to skip over older buffers in favor of younger ones
because somebody else is responsible for writing the older ones out.

Take away the kupdate thread and bdflush must write the older buffer.  I
believe this limits optimizations, unless kswapd is changed to handle all
memory pressure flushes.

-chris





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-06 20:12               ` Chris Mason
@ 2001-08-06 21:18                 ` Daniel Phillips
  2001-08-07 11:02                   ` Stephen C. Tweedie
  2001-08-07 12:07                   ` Anton Altaparmakov
  2001-08-07 12:02                 ` Anton Altaparmakov
  1 sibling, 2 replies; 30+ messages in thread
From: Daniel Phillips @ 2001-08-06 21:18 UTC (permalink / raw)
  To: Chris Mason, linux-kernel; +Cc: linux-mm

On Monday 06 August 2001 22:12, Chris Mason wrote:
> On Monday, August 06, 2001 09:45:12 PM +0200 Daniel Phillips
>
> <phillips@bonn-fries.net> wrote:
> >> Almost ;-) memory pressure doesn't need to care about how long a
> >> buffer has been dirty, that's kupdate's job.  kupdate doesn't care
> >> if the buffer it is writing is a good candidate for freeing, that's
> >> taken care of elsewhere. The two never need to talk (aside from
> >> optimizations).
> >
> > My point is, they should talk, in fact they should be the same
> > function. It's never right for bdflush to submit younger buffers
> > when there are dirty buffers whose flush time has already passed.
>
> Grin, we're talking in circles.  My point is that by having two
> threads, bdflush is allowed to skip over older buffers in favor of
> younger ones because somebody else is responsible for writing the
> older ones out.

Yes, and you can't imagine an algorithm that could do that with *one* 
thread?

--
Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-06 21:18                 ` Daniel Phillips
@ 2001-08-07 11:02                   ` Stephen C. Tweedie
  2001-08-07 11:39                     ` Ed Tomlinson
  2001-08-07 12:07                   ` Anton Altaparmakov
  1 sibling, 1 reply; 30+ messages in thread
From: Stephen C. Tweedie @ 2001-08-07 11:02 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Chris Mason, linux-kernel, linux-mm

Hi,

On Mon, Aug 06, 2001 at 11:18:26PM +0200, Daniel Phillips wrote:
> > On Monday, August 06, 2001 09:45:12 PM +0200 Daniel Phillips
> >
> > Grin, we're talking in circles.  My point is that by having two
> > threads, bdflush is allowed to skip over older buffers in favor of
> > younger ones because somebody else is responsible for writing the
> > older ones out.
> 
> Yes, and you can't imagine an algorithm that could do that with *one* 
> thread?

FWIW, we've seen big performance degradations in the past when testing
different ext3 checkpointing modes.  You can't reuse a disk block in
the journal without making sure that the data in it has been flushed
to disk, so ext3 does regular checkpointing to flush journaled blocks
out.  That can interact very badly with normal VM writeback if you're
not careful: having two threads doing the same thing at the same time
can just thrash the disk.

Parallel sync() calls from multiple processes has shown up the same
behaviour on ext2 in the past.  I'd definitely like to see at most one
thread of writeback per disk to avoid that.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-07 11:02                   ` Stephen C. Tweedie
@ 2001-08-07 11:39                     ` Ed Tomlinson
  2001-08-07 18:36                       ` Daniel Phillips
  0 siblings, 1 reply; 30+ messages in thread
From: Ed Tomlinson @ 2001-08-07 11:39 UTC (permalink / raw)
  To: Stephen C. Tweedie, Daniel Phillips; +Cc: Chris Mason, linux-kernel, linux-mm

On August 7, 2001 07:02 am, Stephen C. Tweedie wrote:
> Hi,
>
> On Mon, Aug 06, 2001 at 11:18:26PM +0200, Daniel Phillips wrote:
> > > On Monday, August 06, 2001 09:45:12 PM +0200 Daniel Phillips
> > >
> > > Grin, we're talking in circles.  My point is that by having two
> > > threads, bdflush is allowed to skip over older buffers in favor of
> > > younger ones because somebody else is responsible for writing the
> > > older ones out.
> >
> > Yes, and you can't imagine an algorithm that could do that with *one*
> > thread?
>
> FWIW, we've seen big performance degradations in the past when testing
> different ext3 checkpointing modes.  You can't reuse a disk block in
> the journal without making sure that the data in it has been flushed
> to disk, so ext3 does regular checkpointing to flush journaled blocks
> out.  That can interact very badly with normal VM writeback if you're
> not careful: having two threads doing the same thing at the same time
> can just thrash the disk.
>
> Parallel sync() calls from multiple processes has shown up the same
> behaviour on ext2 in the past.  I'd definitely like to see at most one
> thread of writeback per disk to avoid that.

Be carefull here.  I have a system (solaris) at the office that has 96 drives 
on it.  Do we really want 96 writeback threads?  With 96 drives, suspect the
bus bandwidth would be the limiting factor.

Ed Tomlinson 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-06 20:12               ` Chris Mason
  2001-08-06 21:18                 ` Daniel Phillips
@ 2001-08-07 12:02                 ` Anton Altaparmakov
  2001-08-07 13:29                   ` Daniel Phillips
  1 sibling, 1 reply; 30+ messages in thread
From: Anton Altaparmakov @ 2001-08-07 12:02 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Daniel Phillips, Chris Mason, linux-kernel, linux-mm

Hi,

At 12:02 07/08/01, Stephen C. Tweedie wrote:
>On Mon, Aug 06, 2001 at 11:18:26PM +0200, Daniel Phillips wrote:
>FWIW, we've seen big performance degradations in the past when testing
>different ext3 checkpointing modes.  You can't reuse a disk block in
>the journal without making sure that the data in it has been flushed
>to disk, so ext3 does regular checkpointing to flush journaled blocks
>out.  That can interact very badly with normal VM writeback if you're
>not careful: having two threads doing the same thing at the same time
>can just thrash the disk.
>
>Parallel sync() calls from multiple processes has shown up the same
>behaviour on ext2 in the past.  I'd definitely like to see at most one
>thread of writeback per disk to avoid that.

Why not have a facility with which each fs can register their own writeback 
functions with a time interval? The daemon would be doing the writing to 
the device and would be invoking the fs registered writers every <time 
interval> seconds. That would avoid the problem of having two fs trying to 
write in parallel but that ignores the problem of having two parallel 
writers on separate partitions of the same disk but that could be solved at 
the fs writeback function level.

At least for NTFS TNG I was thinking of having a daemon running every 5 
seconds and committing dirty data to disk but it would be iterating over 
all mounted ntfs volumes in sequence and flushing all dirty data for each, 
thus avoiding concurrent writing to the same disk, which I had thought 
might cause a problem and you just confirmed it...[1]

Just a thought,

Anton

[1] I am aware this probably doesn't scale too well but considering a 
volume can span several disk partitions on the same disk or across several 
disks I don't see how to parallelize at the fs level.


-- 
   "Nothing succeeds like success." - Alexandre Dumas
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-06 21:18                 ` Daniel Phillips
  2001-08-07 11:02                   ` Stephen C. Tweedie
@ 2001-08-07 12:07                   ` Anton Altaparmakov
  1 sibling, 0 replies; 30+ messages in thread
From: Anton Altaparmakov @ 2001-08-07 12:07 UTC (permalink / raw)
  To: Ed Tomlinson
  Cc: Stephen C. Tweedie, Daniel Phillips, Chris Mason, linux-kernel, linux-mm

At 12:39 07/08/01, Ed Tomlinson wrote:
>On August 7, 2001 07:02 am, Stephen C. Tweedie wrote:
> > On Mon, Aug 06, 2001 at 11:18:26PM +0200, Daniel Phillips wrote:
> > FWIW, we've seen big performance degradations in the past when testing
> > different ext3 checkpointing modes.  You can't reuse a disk block in
> > the journal without making sure that the data in it has been flushed
> > to disk, so ext3 does regular checkpointing to flush journaled blocks
> > out.  That can interact very badly with normal VM writeback if you're
> > not careful: having two threads doing the same thing at the same time
> > can just thrash the disk.
> >
> > Parallel sync() calls from multiple processes has shown up the same
> > behaviour on ext2 in the past.  I'd definitely like to see at most one
> > thread of writeback per disk to avoid that.
>
>Be carefull here.  I have a system (solaris) at the office that has 96 drives
>on it.  Do we really want 96 writeback threads?  With 96 drives, suspect the
>bus bandwidth would be the limiting factor.

But we have that situation today already. - There is one thread running for 
each of the md devices in my file server (so I have six threads at moment 
each with the name raid1d) so if you were using the md driver extensively 
to say mirror half of your drives onto the other half you would have 48 
threads running already...

Anton


-- 
   "Nothing succeeds like success." - Alexandre Dumas
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-07 12:02                 ` Anton Altaparmakov
@ 2001-08-07 13:29                   ` Daniel Phillips
  2001-08-07 13:31                     ` Alexander Viro
  2001-08-07 14:23                     ` Stephen C. Tweedie
  0 siblings, 2 replies; 30+ messages in thread
From: Daniel Phillips @ 2001-08-07 13:29 UTC (permalink / raw)
  To: Anton Altaparmakov, Stephen C. Tweedie
  Cc: Chris Mason, linux-kernel, linux-mm

On Tuesday 07 August 2001 14:02, Anton Altaparmakov wrote:
> At 12:02 07/08/01, Stephen C. Tweedie wrote:
> >On Mon, Aug 06, 2001 at 11:18:26PM +0200, Daniel Phillips wrote:
> >FWIW, we've seen big performance degradations in the past when
> > testing different ext3 checkpointing modes.  You can't reuse a disk
> > block in the journal without making sure that the data in it has
> > been flushed to disk, so ext3 does regular checkpointing to flush
> > journaled blocks out.  That can interact very badly with normal VM
> > writeback if you're not careful: having two threads doing the same
> > thing at the same time can just thrash the disk.
> >
> >Parallel sync() calls from multiple processes has shown up the same
> >behaviour on ext2 in the past.  I'd definitely like to see at most
> > one thread of writeback per disk to avoid that.
>
> Why not have a facility with which each fs can register their own
> writeback functions with a time interval? The daemon would be doing
> the writing to the device and would be invoking the fs registered
> writers every <time interval> seconds. That would avoid the problem of
> having two fs trying to write in parallel but that ignores the problem
> of having two parallel writers on separate partitions of the same disk
> but that could be solved at the fs writeback function level.
>
> At least for NTFS TNG I was thinking of having a daemon running every
> 5 seconds and committing dirty data to disk but it would be iterating
> over all mounted ntfs volumes in sequence and flushing all dirty data
> for each, thus avoiding concurrent writing to the same disk, which I
> had thought might cause a problem and you just confirmed it...[1]

Let me see:

  Ext3 has its own writeback daemon
  ReiserFS has its own writeback daemon
  NTFS quite possibly will have its own writeback daemon
  Tux2 has its own writeback daemon
  xfs... does it?
  jfs?

And then there is kupdate, which is a writeback daemon for all those
filesystems too dumb to have their own.

I think I see a pattern here.  We must come up with a model for
efficient interaction between these writeback daemons, or better yet,
provide a generic mechanism that provides the scheduling for all fs
writeback, and knows about the fs->device topology.

> [1] I am aware this probably doesn't scale too well but considering a
> volume can span several disk partitions on the same disk or across
> several disks I don't see how to parallelize at the fs level.

One thread per block device; flushes across mounts on the same device
are serialized.  This model works well for fs->device graphs that are
strict trees.  For a non-strict tree (acyclic graph) its not clear
what to do, but you could argue that such a configuration is stupid,
so any kind of punt would do.

--
Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-07 13:29                   ` Daniel Phillips
@ 2001-08-07 13:31                     ` Alexander Viro
  2001-08-07 15:52                       ` Daniel Phillips
  2001-08-07 14:23                     ` Stephen C. Tweedie
  1 sibling, 1 reply; 30+ messages in thread
From: Alexander Viro @ 2001-08-07 13:31 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Anton Altaparmakov, Stephen C. Tweedie, Chris Mason,
	linux-kernel, linux-mm



On Tue, 7 Aug 2001, Daniel Phillips wrote:

> One thread per block device; flushes across mounts on the same device
> are serialized.  This model works well for fs->device graphs that are
> strict trees.  For a non-strict tree (acyclic graph) its not clear
> what to do, but you could argue that such a configuration is stupid,
> so any kind of punt would do.

Except that you can have a part of fs structures on a separate device.
Journal, for one thing. Now think of two disks, both partitioned. Two
filesystems. Each has data on the first partition of its own disk.
And journal on the second of another.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-07 13:29                   ` Daniel Phillips
  2001-08-07 13:31                     ` Alexander Viro
@ 2001-08-07 14:23                     ` Stephen C. Tweedie
  2001-08-07 15:51                       ` Daniel Phillips
  1 sibling, 1 reply; 30+ messages in thread
From: Stephen C. Tweedie @ 2001-08-07 14:23 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Anton Altaparmakov, Stephen C. Tweedie, Chris Mason,
	linux-kernel, linux-mm

Hi,

On Tue, Aug 07, 2001 at 03:29:26PM +0200, Daniel Phillips wrote:

>   Ext3 has its own writeback daemon

Ext3 has a daemon to schedule commits to the journal, but it uses the
normal IO scheduler for unforced writebacks.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-07 14:23                     ` Stephen C. Tweedie
@ 2001-08-07 15:51                       ` Daniel Phillips
  2001-08-08 14:49                         ` Stephen C. Tweedie
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2001-08-07 15:51 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Anton Altaparmakov, Stephen C. Tweedie, Chris Mason,
	linux-kernel, linux-mm

On Tuesday 07 August 2001 16:23, Stephen C. Tweedie wrote:
> Hi,
>
> On Tue, Aug 07, 2001 at 03:29:26PM +0200, Daniel Phillips wrote:
> >   Ext3 has its own writeback daemon
>
> Ext3 has a daemon to schedule commits to the journal, but it uses the
> normal IO scheduler for unforced writebacks.

Yes.  The currently favored journalling mode uses a writeback journal,
no?  In other words the ext3 journal daemon seems to fit the description
pretty well, especially if you have several of them on one disk.

--
Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-07 13:31                     ` Alexander Viro
@ 2001-08-07 15:52                       ` Daniel Phillips
  0 siblings, 0 replies; 30+ messages in thread
From: Daniel Phillips @ 2001-08-07 15:52 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Anton Altaparmakov, Stephen C. Tweedie, Chris Mason,
	linux-kernel, linux-mm

On Tuesday 07 August 2001 15:31, Alexander Viro wrote:
> On Tue, 7 Aug 2001, Daniel Phillips wrote:
> > One thread per block device; flushes across mounts on the same
> > device are serialized.  This model works well for fs->device graphs
> > that are strict trees.  For a non-strict tree (acyclic graph) its
> > not clear what to do, but you could argue that such a configuration
> > is stupid, so any kind of punt would do.
>
> Except that you can have a part of fs structures on a separate device.
> Journal, for one thing. Now think of two disks, both partitioned. Two
> filesystems. Each has data on the first partition of its own disk.
> And journal on the second of another.

And you want the write scheduling to not interfere when both are
active?  Good luck.  I'd call this a dumb configuration from the
point of view of performance.

That said, the update daemon can just ignore the connection between
data and journal partitions and let the fs take case of that itself
if it needs to.  Ext3 doesn't need to, the usual kupdate rules are
good enough.

So the parallel/serial update strategy survives this case just fine,
however something tells me your fevered imagination is capable of
coming up with yet another bizarre configuration or two...

--
Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-07 11:39                     ` Ed Tomlinson
@ 2001-08-07 18:36                       ` Daniel Phillips
  0 siblings, 0 replies; 30+ messages in thread
From: Daniel Phillips @ 2001-08-07 18:36 UTC (permalink / raw)
  To: Ed Tomlinson, Stephen C. Tweedie; +Cc: Chris Mason, linux-kernel, linux-mm

On Tuesday 07 August 2001 13:39, Ed Tomlinson wrote:
> On August 7, 2001 07:02 am, Stephen C. Tweedie wrote:
> > On Mon, Aug 06, 2001 at 11:18:26PM +0200, Daniel Phillips wrote:
> > > > On Monday, August 06, 2001 09:45:12 PM +0200 Daniel Phillips
> > > >
> > > > Grin, we're talking in circles.  My point is that by having two
> > > > threads, bdflush is allowed to skip over older buffers in favor
> > > > of younger ones because somebody else is responsible for writing
> > > > the older ones out.
> > >
> > > Yes, and you can't imagine an algorithm that could do that with
> > > *one* thread?
> >
> > FWIW, we've seen big performance degradations in the past when
> > testing different ext3 checkpointing modes.  You can't reuse a disk
> > block in the journal without making sure that the data in it has
> > been flushed to disk, so ext3 does regular checkpointing to flush
> > journaled blocks out.  That can interact very badly with normal VM
> > writeback if you're not careful: having two threads doing the same
> > thing at the same time can just thrash the disk.
> >
> > Parallel sync() calls from multiple processes has shown up the same
> > behaviour on ext2 in the past.  I'd definitely like to see at most
> > one thread of writeback per disk to avoid that.
>
> Be carefull here.  I have a system (solaris) at the office that has 96
> drives on it.  Do we really want 96 writeback threads?  With 96
> drives, suspect the bus bandwidth would be the limiting factor.

Surely these are grouped into some kind of raid?  You'd have one queue
per raid.  Since the buffer submission is nonblocking[1] it's a matter
of taste whether that translates into multiple threads or not.

[1] So long as you don't run into the low level request limits which
should never happen if you pay attention to how much IO is in flight.

--
Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-07 15:51                       ` Daniel Phillips
@ 2001-08-08 14:49                         ` Stephen C. Tweedie
  0 siblings, 0 replies; 30+ messages in thread
From: Stephen C. Tweedie @ 2001-08-08 14:49 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Stephen C. Tweedie, Anton Altaparmakov, Chris Mason,
	linux-kernel, linux-mm

Hi,

On Tue, Aug 07, 2001 at 05:51:26PM +0200, Daniel Phillips wrote:
> On Tuesday 07 August 2001 16:23, Stephen C. Tweedie wrote:
> > On Tue, Aug 07, 2001 at 03:29:26PM +0200, Daniel Phillips wrote:
> > >   Ext3 has its own writeback daemon
> >
> > Ext3 has a daemon to schedule commits to the journal, but it uses the
> > normal IO scheduler for unforced writebacks.
 
> Yes.  The currently favored journalling mode uses a writeback journal,
> no?

It's lazy commit, but that's not really writeback --- when you're
journaling, you write once to the journal, then after commit you write
again to primary storage.  The commit is lazy, sure, but it's not
doing the writeback.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
@ 2001-08-07 15:19 Chris Mason
  0 siblings, 0 replies; 30+ messages in thread
From: Chris Mason @ 2001-08-07 15:19 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel; +Cc: linux-mm



On Monday, August 06, 2001 11:18:26 PM +0200 Daniel Phillips
<phillips@bonn-fries.net> wrote:

>> Grin, we're talking in circles.  My point is that by having two
>> threads, bdflush is allowed to skip over older buffers in favor of
>> younger ones because somebody else is responsible for writing the
>> older ones out.
> 
> Yes, and you can't imagine an algorithm that could do that with *one* 
> thread?

Imagine one?  Yes.  We're mixing a bunch of issues, so I'll list the 3
different cases again.  memory pressure, write throttling, age limiting.
Pretending that a single thread could get enough context information about
which of the 3 (perhaps more than one) it is currently facing, it can make
the right decisions.

The problem with that right now is that a single thread can't keep up (with
one case, let alone all 3) as the number of devices increases.  We can more
or less just replay the entire l-k discussion(s) on threading models here.

In my mind, in order for a single thread to get the job done, it can't end
up waiting on a device while there are still buffers ready for writeout to
idle devices.

As for a generic mechanism to schedule all FS writeback, I've been trying
to use writepage ;-)  The bad part here it makes the async issues even
bigger, since the flushing thread ends up calling into the FS (who knows
what that might lead to).

-chris


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-07-31 19:07 ` Chris Mason
  2001-08-01  1:01   ` Daniel Phillips
@ 2001-08-01 14:57   ` Daniel Phillips
  1 sibling, 0 replies; 30+ messages in thread
From: Daniel Phillips @ 2001-08-01 14:57 UTC (permalink / raw)
  To: Chris Mason, linux-kernel; +Cc: linux-mm, torvalds

On Tuesday 31 July 2001 21:07, Chris Mason wrote:
> This has been tested a little more now, both ext2 (1k, 4k) and
> reiserfs.  dbench and iozone testing don't show any difference, but I
> need to spend a little more time on the benchmarks.

It's impressive that such seemingly radical surgery on the vm innards 
is a) possible and b) doesn't make the system perform noticably worse.

> The idea is that using flush_dirty_buffers to start i/o under memory
> pressure is less than optimal.  flush_dirty_buffers knows the oldest
> dirty buffer, but has no page aging info, so it might not flush a
> page that we actually want to free.

Note that the fact that buffers dirtied by ->writepage are ordered by 
time-dirtied means that the dirty_buffers list really does have 
indirect knowledge of page aging.  There may well be benefits to your 
approach but I doubt this is one of them.

It's surprising that 1K buffer size isn't bothered by being grouped by 
page in their IO requests.  I'd have thought that this would cause a 
significant number of writes to be blocked waiting on the page lock 
held by an unrelated buffer writeout.

The most interesting part of your patch to me is the anon_space_mapping.
It's nice to make buffer handling look more like page cache handling, 
and get rid of some special cases in the vm scanning.  On the other 
hand, buffers are different from pages in that, once buffers heads are 
removed, nobody can find them any more, so they can not be rescued.
Now, if I'm reading this correctly, buffer pages *will* progress on to 
the inactive_clean list from the inactive_dirty list instead of jumping 
that queue and being directly freed by the page_cache_release.  Maybe 
this is good because it avoids the expensive-looking __free_pages_ok.

This looks scary:

+        index = atomic_read(&buffermem_pages) ;

Because buffermem_pages isn't unique.  This must mean you're never 
doing page cache lookups for anon_space_mapping, because the 
mapping+index key isn't unique.  There is a danger here of overloading 
some hash buckets, which becomes a certainty if you use 0 or some other 
constant for the index.  If you're never doing page cache lookups, why 
even enter it into the page hash?

That's all for now.  It's a very interesting patch.

--
Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-08-01  1:01   ` Daniel Phillips
@ 2001-08-01  2:05     ` Chris Mason
  0 siblings, 0 replies; 30+ messages in thread
From: Chris Mason @ 2001-08-01  2:05 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel; +Cc: linux-mm, torvalds



On Wednesday, August 01, 2001 03:01:17 AM +0200 Daniel Phillips
<phillips@bonn-fries.net> wrote:

> Hi, Chris
> 
> On Tuesday 31 July 2001 21:07, Chris Mason wrote:
>> I had to keep some of the flush_dirty_buffer calls as page_launder
>> wasn't triggering enough i/o on its own.  What I'd like to do now is
>> experiment with changing bdflush to only write pages off the inactive
>> dirty lists.
> 
> Will kupdate continue to enforce the "no dirty buffer older than 
> XX" guarantee?

Yes, kupdate still calls flush_dirty_buffers(1).  I'm curious to see how
your write early stuff interacts with it all though....

-chris



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-07-31 19:07 ` Chris Mason
@ 2001-08-01  1:01   ` Daniel Phillips
  2001-08-01  2:05     ` Chris Mason
  2001-08-01 14:57   ` Daniel Phillips
  1 sibling, 1 reply; 30+ messages in thread
From: Daniel Phillips @ 2001-08-01  1:01 UTC (permalink / raw)
  To: Chris Mason, linux-kernel; +Cc: linux-mm, torvalds

Hi, Chris

On Tuesday 31 July 2001 21:07, Chris Mason wrote:
> I had to keep some of the flush_dirty_buffer calls as page_launder
> wasn't triggering enough i/o on its own.  What I'd like to do now is
> experiment with changing bdflush to only write pages off the inactive
> dirty lists.

Will kupdate continue to enforce the "no dirty buffer older than 
XX" guarantee?

--
Daniel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-07-28 16:01 Chris Mason
  2001-07-28 16:18 ` Alexander Viro
@ 2001-07-31 19:07 ` Chris Mason
  2001-08-01  1:01   ` Daniel Phillips
  2001-08-01 14:57   ` Daniel Phillips
  1 sibling, 2 replies; 30+ messages in thread
From: Chris Mason @ 2001-07-31 19:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, torvalds



[ linux-mm is cc'd this time, so description is included again, patch 
is slightly updated from the one on Saturday ]

Hello everyone,

This has been tested a little more now, both ext2 (1k, 4k) and reiserfs.  
dbench and iozone testing don't show any difference, but I need to 
spend a little more time on the benchmarks.

This patch changes fs/buffer.c to use writepage to start i/o,
instead of writing buffers directly.  It also changes refile_buffer
to mark the page dirty when marking buffers dirty.

The idea is that using flush_dirty_buffers to start i/o under memory
pressure is less than optimal.  flush_dirty_buffers knows the oldest
dirty buffer, but has no page aging info, so it might not flush a page
that we actually want to free.

I had to keep some of the flush_dirty_buffer calls as page_launder
wasn't triggering enough i/o on its own.  What I'd like to do now is
experiment with changing bdflush to only write pages off the inactive
dirty lists.

Since fs/buffer.c uses writepage instead of writing the page directly,
the filesystems can do more advanced things on writeout, but can still
use the generic dirty lists.

Major changes:

writepage can now be called on pages that are not up to date, so the
writepage func needs to check the page up to date bit before writing
buffers that are not marked up to date.

block_write_full_page used to send the last page in the file through
block_commit_write to mark the buffers dirty.  That doesn't work here
since writepage is used to start the io.  Instead __block_write_full_page
was changed to skip io on buffers past eof.

The ext2 directory page cache code fills placeholders in the page past
the EOF.  But, the generic block_write_full_page zeros past eof, which
makes the ext2 code very unhappy.  This adds a block_write_dir_page
that doesn't zero past eof, and changes ext2 to use it for directories.

flush_dirty_buffers now calls into the filesystem, so it is only safe
to mark a buffer dirty if you know your writepage func won't deadlock
when called by flush_dirty_buffers (thus the reiserfs_writepage changes
in this patch).

-chris

Patch is against 2.4.7:
diff -Nru a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c	Tue Jul 31 14:43:40 2001
+++ b/fs/buffer.c	Tue Jul 31 14:43:40 2001
@@ -98,6 +98,21 @@
 
 static int grow_buffers(int size);
 static void __refile_buffer(struct buffer_head *);
+int block_write_anon_page(struct page *);
+static int dirty_list_writepage(struct page *page, struct buffer_head *bh) ;
+static void end_buffer_io_async(struct buffer_head * bh, int uptodate) ;
+
+static struct address_space_operations anon_space_ops = {
+	writepage: block_write_anon_page,
+	sync_page: block_sync_page,
+} ;
+static struct address_space anon_space_mapping = {
+	LIST_HEAD_INIT(anon_space_mapping.clean_pages),
+	LIST_HEAD_INIT(anon_space_mapping.dirty_pages),
+	LIST_HEAD_INIT(anon_space_mapping.locked_pages),
+	0, /* nr_pages */
+	&anon_space_ops,
+} ;
 
 /* This is used by some architectures to estimate available memory. */
 atomic_t buffermem_pages = ATOMIC_INIT(0);
@@ -181,66 +196,46 @@
 	put_bh(bh);
 }
 
-/*
- * The buffers have been marked clean and locked.  Just submit the dang
- * things.. 
- *
- * We'll wait for the first one of them - "sync" is not exactly
- * performance-critical, and this makes us not hog the IO subsystem
- * completely, while still allowing for a fair amount of concurrent IO. 
- */
-static void write_locked_buffers(struct buffer_head **array, unsigned int count)
-{
-	struct buffer_head *wait = *array;
-	get_bh(wait);
-	do {
-		struct buffer_head * bh = *array++;
-		bh->b_end_io = end_buffer_io_sync;
-		submit_bh(WRITE, bh);
-	} while (--count);
-	wait_on_buffer(wait);
-	put_bh(wait);
-}
-
-#define NRSYNC (32)
 static void write_unlocked_buffers(kdev_t dev)
 {
 	struct buffer_head *next;
-	struct buffer_head *array[NRSYNC];
-	unsigned int count;
+	struct page *page ;
 	int nr;
 
 repeat:
 	spin_lock(&lru_list_lock);
 	next = lru_list[BUF_DIRTY];
 	nr = nr_buffers_type[BUF_DIRTY] * 2;
-	count = 0;
 	while (next && --nr >= 0) {
 		struct buffer_head * bh = next;
 		next = bh->b_next_free;
 
 		if (dev && bh->b_dev != dev)
 			continue;
-		if (test_and_set_bit(BH_Lock, &bh->b_state))
-			continue;
-		get_bh(bh);
-		if (atomic_set_buffer_clean(bh)) {
-			__refile_buffer(bh);
-			array[count++] = bh;
-			if (count < NRSYNC)
-				continue;
+		page = bh->b_page ;
 
-			spin_unlock(&lru_list_lock);
-			write_locked_buffers(array, count);
-			goto repeat;
+		/* if we wrote the buffer the last time through, it
+		** might not have been refiled yet.  Without this check,
+		** we just end up working on the same buffer over and
+		** over again due to the goto after dirty_list_writepage
+		*/
+		if (buffer_locked(bh) || !buffer_dirty(bh)) {
+			__refile_buffer(bh) ;
+			continue ;
 		}
-		unlock_buffer(bh);
-		put_bh(bh);
+		if (TryLockPage(page)) {
+			continue ;
+		}
+		page_cache_get(page) ;
+		get_bh(bh) ;
+		spin_unlock(&lru_list_lock);
+		dirty_list_writepage(page, bh) ;
+		put_bh(bh) ;
+		page_cache_release(page) ;
+		goto repeat;
+
 	}
 	spin_unlock(&lru_list_lock);
-
-	if (count)
-		write_locked_buffers(array, count);
 }
 
 static int wait_for_locked_buffers(kdev_t dev, int index, int refile)
@@ -274,6 +269,69 @@
 	return 0;
 }
 
+/* just for use with anon pages, or pages that don't provide their own
+** writepage func.  We just want to write bh, not the whole page, so we
+** queue that io here instead of calling writepage.
+*/
+static int __dirty_list_writepage(struct page *page, struct buffer_head *bh) {
+	int other_dirty = 0 ;
+	struct buffer_head *cur ;
+
+	/* check for other dirty buffers on this page.  If there are none,
+	** clear the page dirty bit
+	*/
+	cur = bh->b_this_page ;
+	while(cur != bh) {
+		other_dirty += buffer_dirty(cur) ;	
+		cur = cur->b_this_page ;
+	} 
+	if (other_dirty == 0) {
+		ClearPageDirty(page) ;
+	} 
+
+	/* we want the page available for locking again right away.  
+	** someone walking the dirty buffer list might find another
+	** buffer from this page, and we don't want them to skip it in
+	** favor of a younger buffer.
+	**
+	*/
+	get_bh(bh) ;
+	lock_buffer(bh) ;
+	clear_bit(BH_Dirty, &bh->b_state) ;
+	bh->b_end_io = end_buffer_io_sync ;
+	UnlockPage(page) ;
+	submit_bh(WRITE, bh) ;
+	return 0 ;
+}
+
+/*
+** util function for sync_buffers and flush_dirty_buffers
+** uses either the writepage func supplied in the page's mapping,
+** or the anon address space writepage
+*/
+static int dirty_list_writepage(struct page *page, struct buffer_head *bh) {
+	int (*writepage)(struct page *)  ;
+
+	/* someone wrote this page while we were locking before */
+	if (!PageDirty(page) && !buffer_dirty(bh)) {
+		UnlockPage(page) ;
+		return 0 ;
+	}
+	writepage = page->mapping->a_ops->writepage ;
+
+	/* For anon pages, and pages that don't have a writepage
+	** func, just write this one dirty buffer.  __dirty_list_writepage
+	** does a little more work to make sure the page dirty bit is cleared
+	** when we are the only dirty buffer on this page
+	*/
+	if (!writepage || page->mapping == &anon_space_mapping) {
+		return __dirty_list_writepage(page, bh) ;
+	}
+
+	ClearPageDirty(page) ;
+	return writepage(page) ;
+}
+
 /* Call sync_buffers with wait!=0 to ensure that the call does not
  * return until all buffer writes have completed.  Sync() may return
  * before the writes have finished; fsync() may not.
@@ -775,7 +833,7 @@
 	if (free_shortage())
 		page_launder(GFP_NOFS, 0);
 	if (!grow_buffers(size)) {
-		wakeup_bdflush(1);
+		wakeup_kswapd() ;
 		current->policy |= SCHED_YIELD;
 		__set_current_state(TASK_RUNNING);
 		schedule();
@@ -795,6 +853,7 @@
 	unsigned long flags;
 	struct buffer_head *tmp;
 	struct page *page;
+	int partial = 0 ;
 
 	mark_buffer_uptodate(bh, uptodate);
 
@@ -820,6 +879,8 @@
 	unlock_buffer(bh);
 	tmp = bh->b_this_page;
 	while (tmp != bh) {
+		if (!buffer_uptodate(tmp))
+			partial = 1 ;
 		if (tmp->b_end_io == end_buffer_io_async && buffer_locked(tmp))
 			goto still_busy;
 		tmp = tmp->b_this_page;
@@ -833,7 +894,7 @@
 	 * if none of the buffers had errors then we can set the
 	 * page uptodate:
 	 */
-	if (!PageError(page))
+	if (!PageError(page) && !partial)
 		SetPageUptodate(page);
 
 	/*
@@ -897,8 +958,16 @@
 			if (buffer_dirty(bh)) {
 				get_bh(bh);
 				spin_unlock(&lru_list_lock);
-				ll_rw_block(WRITE, 1, &bh);
-				brelse(bh);
+				lock_buffer(bh) ;
+				/* if we are the only buffer on this page,
+				** we know we have cleaned the page
+				*/
+				if (bh->b_this_page == bh) {
+					ClearPageDirty(bh->b_page) ;    
+				}
+				clear_bit(BH_Dirty, &bh->b_state) ;
+				bh->b_end_io = end_buffer_io_sync ;
+				submit_bh(WRITE, bh) ;
 				spin_lock(&lru_list_lock);
 			}
 		}
@@ -1086,7 +1155,7 @@
 
 	if (state < 0)
 		return;
-	wakeup_bdflush(state);
+	wakeup_bdflush(state) ;
 }
 
 static __inline__ void __mark_dirty(struct buffer_head *bh)
@@ -1120,8 +1189,10 @@
 	int dispose = BUF_CLEAN;
 	if (buffer_locked(bh))
 		dispose = BUF_LOCKED;
-	if (buffer_dirty(bh))
+	if (buffer_dirty(bh)) {
 		dispose = BUF_DIRTY;
+		SetPageDirty(bh->b_page) ;
+	}
 	if (buffer_protected(bh))
 		dispose = BUF_PROTECTED;
 	if (dispose != bh->b_list) {
@@ -1160,10 +1231,14 @@
  */
 void __bforget(struct buffer_head * buf)
 {
+	struct address_space *mapping ;
 	/* grab the lru lock here to block bdflush. */
 	spin_lock(&lru_list_lock);
 	write_lock(&hash_table_lock);
-	if (!atomic_dec_and_test(&buf->b_count) || buffer_locked(buf) || buffer_protected(buf))
+	mapping = buf->b_page->mapping ;
+	if (mapping != &anon_space_mapping || 
+	    !atomic_dec_and_test(&buf->b_count) || 
+	    buffer_locked(buf) || buffer_protected(buf))
 		goto in_use;
 	__hash_unlink(buf);
 	remove_inode_queue(buf);
@@ -1497,6 +1572,50 @@
  * "Dirty" is valid only with the last case (mapped+uptodate).
  */
 
+int block_write_anon_page(struct page *page) 
+{
+	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	int i, nr = 0 ;
+	int partial = 0 ;
+	int ret = 0 ;
+
+	if (!PageLocked(page))
+		BUG();
+
+	if (!page->buffers)
+		BUG() ;
+
+	head = page->buffers;
+	bh = head;
+
+	/* Stage 1: find the dirty buffers, lock them for submit_bh */
+	do {
+		if (buffer_dirty(bh)) {
+			get_bh(bh) ;
+			lock_buffer(bh) ;
+			set_bit(BH_Uptodate, &bh->b_state) ;
+			bh->b_end_io = end_buffer_io_async;
+			clear_bit(BH_Dirty, &bh->b_state) ;
+			arr[nr++] = bh ;
+		} else {
+			partial = 1 ;
+		}
+		bh = bh->b_this_page;
+	} while (bh != head);
+
+	/* Stage 2: submit the IO */
+	for (i = 0 ; i < nr ; i++) {
+		submit_bh(WRITE, arr[i]) ;
+	}
+	/* Done - end_buffer_io_async will unlock */
+	if (!partial)
+		SetPageUptodate(page);
+	if (nr == 0) {
+		UnlockPage(page) ;
+	}
+	return ret ;
+}
+
 /*
  * block_write_full_page() is SMP-safe - currently it's still
  * being called with the kernel lock held, but the code is ready.
@@ -1505,16 +1624,23 @@
 {
 	int err, i;
 	unsigned long block;
+	unsigned long lblock ;
+	unsigned long blocksize = inode->i_sb->s_blocksize ;
 	struct buffer_head *bh, *head;
+	int page_ok = Page_Uptodate(page) ;
+	struct buffer_head *arr[MAX_BUF_PER_PAGE] ;
+	int nr = 0 ;
+	int partial = 0 ;
 
 	if (!PageLocked(page))
 		BUG();
 
 	if (!page->buffers)
-		create_empty_buffers(page, inode->i_dev, inode->i_sb->s_blocksize);
+		create_empty_buffers(page, inode->i_dev, blocksize);
 	head = page->buffers;
 
 	block = page->index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+	lblock = (inode->i_size+blocksize-1) >> inode->i_sb->s_blocksize_bits;
 
 	bh = head;
 	i = 0;
@@ -1529,36 +1655,42 @@
 		 * Leave it to the low-level FS to make all those
 		 * decisions (block #0 may actually be a valid block)
 		 */
-		if (!buffer_mapped(bh)) {
-			err = get_block(inode, block, bh, 1);
-			if (err)
-				goto out;
-			if (buffer_new(bh))
-				unmap_underlying_metadata(bh);
+		if ((buffer_dirty(bh) || page_ok) && block < lblock) {
+			if (!buffer_mapped(bh)) {
+				err = get_block(inode, block, bh, 1);
+				if (err)
+					goto out;
+				if (buffer_new(bh))
+					unmap_underlying_metadata(bh);
+			}
+			arr[nr++] = bh ;
+		} else {
+			partial = 1 ;
 		}
 		bh = bh->b_this_page;
 		block++;
 	} while (bh != head);
 
 	/* Stage 2: lock the buffers, mark them clean */
-	do {
+	for(i = 0 ; i < nr ; i++) {
+		bh = arr[i] ;
 		lock_buffer(bh);
 		bh->b_end_io = end_buffer_io_async;
 		get_bh(bh);
 		set_bit(BH_Uptodate, &bh->b_state);
 		clear_bit(BH_Dirty, &bh->b_state);
-		bh = bh->b_this_page;
-	} while (bh != head);
+	} 
 
 	/* Stage 3: submit the IO */
-	do {
-		struct buffer_head *next = bh->b_this_page;
-		submit_bh(WRITE, bh);
-		bh = next;
-	} while (bh != head);
+	for(i = 0 ; i < nr ; i++) {
+		submit_bh(WRITE, arr[i]);
+	}
+	if (nr == 0)
+		UnlockPage(page) ;
 
 	/* Done - end_buffer_io_async will unlock */
-	SetPageUptodate(page);
+	if (!partial)
+		SetPageUptodate(page);
 	return 0;
 
 out:
@@ -1948,7 +2080,6 @@
 	struct inode *inode = page->mapping->host;
 	unsigned long end_index = inode->i_size >> PAGE_CACHE_SHIFT;
 	unsigned offset;
-	int err;
 
 	/* easy case */
 	if (page->index < end_index)
@@ -1963,18 +2094,35 @@
 	}
 
 	/* Sigh... will have to work, then... */
-	err = __block_prepare_write(inode, page, 0, offset, get_block);
-	if (!err) {
-		memset(page_address(page) + offset, 0, PAGE_CACHE_SIZE - offset);
-		flush_dcache_page(page);
-		__block_commit_write(inode,page,0,offset);
-done:
-		kunmap(page);
+	memset(kmap(page) + offset, 0, PAGE_CACHE_SIZE - offset);
+	flush_dcache_page(page);
+	kunmap(page);
+	return __block_write_full_page(inode, page, get_block) ;
+}
+
+int block_write_dir_page(struct page *page, get_block_t *get_block)
+{
+	struct inode *inode = page->mapping->host;
+	unsigned long end_index = inode->i_size >> PAGE_CACHE_SHIFT;
+	unsigned offset;
+
+	/* easy case */
+	if (page->index < end_index)
+		return __block_write_full_page(inode, page, get_block);
+
+	/* things got complicated... */
+	offset = inode->i_size & (PAGE_CACHE_SIZE-1);
+	/* OK, are we completely out? */
+	if (page->index >= end_index+1 || !offset) {
 		UnlockPage(page);
-		return err;
+		return -EIO;
 	}
-	ClearPageUptodate(page);
-	goto done;
+
+	/* directories in the page cache don't need to have the buffers
+	** past the eof zerod out.  In fact, ext2 sets up fake entries
+	** past the eof in their algs.
+	*/
+	return __block_write_full_page(inode, page, get_block) ;
 }
 
 int generic_block_bmap(struct address_space *mapping, long block, get_block_t *get_block)
@@ -2250,7 +2398,7 @@
 	struct buffer_head *bh, *tmp;
 	struct buffer_head * insert_point;
 	int isize;
-
+	unsigned long index ;
 	if ((size & 511) || (size > PAGE_SIZE)) {
 		printk(KERN_ERR "VFS: grow_buffers: size = %d\n",size);
 		return 0;
@@ -2266,6 +2414,16 @@
 
 	isize = BUFSIZE_INDEX(size);
 
+	/* don't put this buffer head on the free list until the
+	** page is setup.  Is there a better index to use?  Would 0
+	** be good enough?
+	*/
+	page->flags &= ~(1 << PG_referenced);
+	index = atomic_read(&buffermem_pages) ;
+	atomic_inc(&buffermem_pages);
+	add_to_page_cache_locked(page, &anon_space_mapping, index) ;
+	page->buffers = bh;
+
 	spin_lock(&free_list[isize].lock);
 	insert_point = free_list[isize].list;
 	tmp = bh;
@@ -2289,11 +2447,7 @@
 	free_list[isize].list = bh;
 	spin_unlock(&free_list[isize].lock);
 
-	page->buffers = bh;
-	page->flags &= ~(1 << PG_referenced);
-	lru_cache_add(page);
 	UnlockPage(page);
-	atomic_inc(&buffermem_pages);
 	return 1;
 
 no_buffer_head:
@@ -2315,10 +2469,9 @@
  *
  * Wait:
  *	0 - no wait (this does not get called - see try_to_free_buffers below)
- *	1 - start IO for dirty buffers
  *	2 - wait for completion of locked buffers
  */
-static void sync_page_buffers(struct buffer_head *bh, unsigned int gfp_mask)
+static void wait_page_buffers(struct buffer_head *bh, unsigned int gfp_mask)
 {
 	struct buffer_head * tmp = bh;
 
@@ -2326,10 +2479,8 @@
 		struct buffer_head *p = tmp;
 		tmp = tmp->b_this_page;
 		if (buffer_locked(p)) {
-			if (gfp_mask & __GFP_WAIT)
-				__wait_on_buffer(p);
-		} else if (buffer_dirty(p))
-			ll_rw_block(WRITE, 1, &p);
+			__wait_on_buffer(p);
+		} 
 	} while (tmp != bh);
 }
 
@@ -2390,6 +2541,9 @@
 
 	/* And free the page */
 	page->buffers = NULL;
+	if (page->mapping == &anon_space_mapping) {
+		atomic_dec(&buffermem_pages) ;
+	}
 	page_cache_release(page);
 	spin_unlock(&free_list[index].lock);
 	write_unlock(&hash_table_lock);
@@ -2401,14 +2555,13 @@
 	spin_unlock(&free_list[index].lock);
 	write_unlock(&hash_table_lock);
 	spin_unlock(&lru_list_lock);
-	if (gfp_mask & __GFP_IO) {
-		sync_page_buffers(bh, gfp_mask);
+	if (gfp_mask & __GFP_WAIT) {
+		wait_page_buffers(bh, gfp_mask);
 		/* We waited synchronously, so we can free the buffers. */
-		if ((gfp_mask & __GFP_WAIT) && !loop) {
+		if (!loop) {
 			loop = 1;
 			goto cleaned_buffers_try_again;
 		}
-		wakeup_bdflush(0);
 	}
 	return 0;
 }
@@ -2541,6 +2694,7 @@
 static int flush_dirty_buffers(int check_flushtime)
 {
 	struct buffer_head * bh, *next;
+	struct page *page ;
 	int flushed = 0, i;
 
  restart:
@@ -2557,6 +2711,8 @@
 		}
 		if (buffer_locked(bh))
 			continue;
+		if (!buffer_uptodate(bh))
+			continue ;
 
 		if (check_flushtime) {
 			/* The dirty lru list is chronologically ordered so
@@ -2570,11 +2726,19 @@
 				goto out_unlock;
 		}
 
+		page = bh->b_page ;
+		if (TryLockPage(page)) {
+		    flushed-- ;
+		    continue ;
+		}
+
 		/* OK, now we are committed to write it out. */
+		page_cache_get(page) ;
 		get_bh(bh);
 		spin_unlock(&lru_list_lock);
-		ll_rw_block(WRITE, 1, &bh);
+		dirty_list_writepage(page, bh) ;
 		put_bh(bh);
+		page_cache_release(page) ;
 
 		if (current->need_resched)
 			schedule();
diff -Nru a/fs/ext2/inode.c b/fs/ext2/inode.c
--- a/fs/ext2/inode.c	Tue Jul 31 14:43:40 2001
+++ b/fs/ext2/inode.c	Tue Jul 31 14:43:40 2001
@@ -574,6 +574,10 @@
 {
 	return block_write_full_page(page,ext2_get_block);
 }
+static int ext2_dir_writepage(struct page *page)
+{
+	return block_write_dir_page(page,ext2_get_block);
+}
 static int ext2_readpage(struct file *file, struct page *page)
 {
 	return block_read_full_page(page,ext2_get_block);
@@ -595,6 +599,15 @@
 	bmap: ext2_bmap
 };
 
+struct address_space_operations ext2_dir_aops = {
+	readpage: ext2_readpage,
+	writepage: ext2_dir_writepage,
+	sync_page: block_sync_page,
+	prepare_write: ext2_prepare_write,
+	commit_write: generic_commit_write,
+	bmap: ext2_bmap
+};
+
 /*
  * Probably it should be a library function... search for first non-zero word
  * or memcmp with zero_page, whatever is better for particular architecture.
@@ -985,7 +998,7 @@
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext2_dir_inode_operations;
 		inode->i_fop = &ext2_dir_operations;
-		inode->i_mapping->a_ops = &ext2_aops;
+		inode->i_mapping->a_ops = &ext2_dir_aops;
 	} else if (S_ISLNK(inode->i_mode)) {
 		if (!inode->i_blocks)
 			inode->i_op = &ext2_fast_symlink_inode_operations;
diff -Nru a/fs/ext2/namei.c b/fs/ext2/namei.c
--- a/fs/ext2/namei.c	Tue Jul 31 14:43:40 2001
+++ b/fs/ext2/namei.c	Tue Jul 31 14:43:40 2001
@@ -194,7 +194,7 @@
 
 	inode->i_op = &ext2_dir_inode_operations;
 	inode->i_fop = &ext2_dir_operations;
-	inode->i_mapping->a_ops = &ext2_aops;
+	inode->i_mapping->a_ops = &ext2_dir_aops;
 
 	ext2_inc_count(inode);
 
diff -Nru a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c	Tue Jul 31 14:43:40 2001
+++ b/fs/reiserfs/inode.c	Tue Jul 31 14:43:40 2001
@@ -1694,13 +1694,19 @@
     int use_get_block = 0 ;
     int bytes_copied = 0 ;
     int copy_size ;
+    int transaction_started = 0 ;
 
 start_over:
     lock_kernel() ;
-    journal_begin(&th, inode->i_sb, jbegin_count) ;
 
     make_cpu_key(&key, inode, byte_offset, TYPE_ANY, 3) ;
 
+    /* we know we've got a direct item, start the transaction now */
+    if (buffer_mapped(bh_result) && bh_result->b_blocknr == 0) {
+	journal_begin(&th, inode->i_sb, jbegin_count) ;
+	transaction_started = 1 ;
+    }
+
 research:
     retval = search_for_position_by_key(inode->i_sb, &key, &path) ;
     if (retval != POSITION_FOUND) {
@@ -1727,6 +1733,17 @@
 	mark_buffer_uptodate(bh_result, 1);
     } else if (is_direct_le_ih(ih)) {
         char *p ; 
+
+	/* sigh, we can't start a transaction with a path held, so we
+	** have to drop the path, start the transaction, and start over
+	*/
+	if (!transaction_started) {
+	    pathrelse(&path) ;
+	    journal_begin(&th, inode->i_sb, jbegin_count) ;
+	    transaction_started = 1 ;
+	    goto research ;
+	}
+
         p = page_address(bh_result->b_page) ;
         p += (byte_offset -1) & (PAGE_CACHE_SIZE - 1) ;
         copy_size = le16_to_cpu(ih->ih_item_len) - pos_in_item ;
@@ -1761,7 +1778,8 @@
     
 out:
     pathrelse(&path) ;
-    journal_end(&th, inode->i_sb, jbegin_count) ;
+    if (transaction_started)
+	journal_end(&th, inode->i_sb, jbegin_count) ;
     unlock_kernel() ;
 
     /* this is where we fill in holes in the file. */
@@ -1813,6 +1831,7 @@
     int partial = 0 ;
     struct buffer_head *arr[PAGE_CACHE_SIZE/512] ;
     int nr = 0 ;
+    int page_ok = Page_Uptodate(page) ;
 
     if (!page->buffers) {
         block_prepare_write(page, 0, 0, NULL) ;
@@ -1837,7 +1856,7 @@
     block = page->index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits) ;
     do {
 	/* if this offset in the page is outside the file */
-	if (cur_offset >= last_offset) {
+	if (!(buffer_dirty(bh) || page_ok) || cur_offset >= last_offset) {
 	    if (!buffer_uptodate(bh))
 	        partial = 1 ;
 	} else {
diff -Nru a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c	Tue Jul 31 14:43:40 2001
+++ b/fs/reiserfs/journal.c	Tue Jul 31 14:43:40 2001
@@ -702,10 +702,12 @@
       } 
       if (buffer_dirty(tbh)) {
 	printk("journal-569: flush_commit_list, block already dirty!\n") ;
-      } else {				
-	mark_buffer_dirty(tbh) ;
-      }
-      ll_rw_block(WRITE, 1, &tbh) ;
+      } 
+      get_bh(tbh) ; /* inc the count for the end_io handler */
+      lock_buffer(tbh) ;
+      clear_bit(BH_Dirty, &tbh->b_state) ;
+      tbh->b_end_io = end_buffer_io_sync ;
+      submit_bh(WRITE, tbh) ;
       count++ ;
       put_bh(tbh) ; /* once for our get_hash */
     } 
@@ -738,8 +740,11 @@
                        atomic_read(&(jl->j_commit_left)));
   }
 
-  mark_buffer_dirty(jl->j_commit_bh) ;
-  ll_rw_block(WRITE, 1, &(jl->j_commit_bh)) ;
+  get_bh(jl->j_commit_bh) ;
+  lock_buffer(jl->j_commit_bh) ;
+  clear_bit(BH_Dirty, &(jl->j_commit_bh->b_state)) ;
+  jl->j_commit_bh->b_end_io = end_buffer_io_sync ;
+  submit_bh(WRITE, jl->j_commit_bh) ;
   wait_on_buffer(jl->j_commit_bh) ;
   if (!buffer_uptodate(jl->j_commit_bh)) {
     reiserfs_panic(s, "journal-615: buffer write failed\n") ;
diff -Nru a/fs/reiserfs/tail_conversion.c b/fs/reiserfs/tail_conversion.c
--- a/fs/reiserfs/tail_conversion.c	Tue Jul 31 14:43:40 2001
+++ b/fs/reiserfs/tail_conversion.c	Tue Jul 31 14:43:40 2001
@@ -184,6 +184,12 @@
         cur_index += bh->b_size ;
         if (cur_index > tail_index) {
           reiserfs_unmap_buffer(bh) ;
+	  /* if we are the first buffer on the page, we know the
+	  ** page is clean 
+	  */
+	  if (bh == page->buffers) {
+	    ClearPageDirty(page) ;
+	  }
         }
 	bh = next ;
       } while (bh != head) ;
diff -Nru a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
--- a/include/linux/ext2_fs.h	Tue Jul 31 14:43:40 2001
+++ b/include/linux/ext2_fs.h	Tue Jul 31 14:43:40 2001
@@ -640,6 +640,7 @@
 extern struct inode_operations ext2_fast_symlink_inode_operations;
 
 extern struct address_space_operations ext2_aops;
+extern struct address_space_operations ext2_dir_aops;
 
 #endif	/* __KERNEL__ */
 
diff -Nru a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h	Tue Jul 31 14:43:40 2001
+++ b/include/linux/fs.h	Tue Jul 31 14:43:40 2001
@@ -1326,6 +1326,8 @@
 extern int block_flushpage(struct page *, unsigned long);
 extern int block_symlink(struct inode *, const char *, int);
 extern int block_write_full_page(struct page*, get_block_t*);
+extern int block_write_dir_page(struct page*, get_block_t*);
+extern int block_write_anon_page(struct page*) ;
 extern int block_read_full_page(struct page*, get_block_t*);
 extern int block_prepare_write(struct page*, unsigned, unsigned, get_block_t*);
 extern int cont_prepare_write(struct page*, unsigned, unsigned, get_block_t*,
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.c	Tue Jul 31 14:43:40 2001
+++ b/kernel/ksyms.c	Tue Jul 31 14:43:40 2001
@@ -197,6 +197,7 @@
 EXPORT_SYMBOL(__wait_on_buffer);
 EXPORT_SYMBOL(___wait_on_page);
 EXPORT_SYMBOL(block_write_full_page);
+EXPORT_SYMBOL(block_write_dir_page);
 EXPORT_SYMBOL(block_read_full_page);
 EXPORT_SYMBOL(block_prepare_write);
 EXPORT_SYMBOL(block_sync_page);
diff -Nru a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c	Tue Jul 31 14:43:40 2001
+++ b/mm/vmscan.c	Tue Jul 31 14:43:40 2001
@@ -548,7 +548,6 @@
 
 			/* The page was only in the buffer cache. */
 			} else if (!page->mapping) {
-				atomic_dec(&buffermem_pages);
 				freed_page = 1;
 				cleaned_pages++;
 
@@ -571,10 +570,9 @@
 			page_cache_release(page);
 
 			/* 
-			 * If we're freeing buffer cache pages, stop when
-			 * we've got enough free memory.
+			 * stop when we've got enough free memory.
 			 */
-			if (freed_page && !free_shortage())
+			if (!free_shortage())
 				break;
 			continue;
 		} else if (page->mapping && !PageDirty(page)) {
@@ -609,9 +607,6 @@
 	 * free anything yet, we wait synchronously on the writeout of
 	 * MAX_SYNC_LAUNDER pages.
 	 *
-	 * We also wake up bdflush, since bdflush should, under most
-	 * loads, flush out the dirty pages before we have to wait on
-	 * IO.
 	 */
 	if (CAN_DO_IO && !launder_loop && free_shortage()) {
 		launder_loop = 1;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-07-28 16:18 ` Alexander Viro
@ 2001-07-28 16:25   ` Chris Mason
  0 siblings, 0 replies; 30+ messages in thread
From: Chris Mason @ 2001-07-28 16:25 UTC (permalink / raw)
  To: Alexander Viro; +Cc: linux-kernel, torvalds



On Saturday, July 28, 2001 12:18:02 PM -0400 Alexander Viro <viro@math.psu.edu> wrote:

> 
> 
> On Sat, 28 Jul 2001, Chris Mason wrote:
> 
>> 
>> Hello everyone,
>> 
>> This is an rfc only, as the code has only been _lightly_ tested.  I'll
>> do more tests/benchmarks next week.
>> 
>> This patch changes fs/buffer.c to use writepage to start i/o,
>> instead of writing buffers directly.  It also changes refile_buffer
>> to mark the page dirty when marking buffers dirty.
> 
> ->writepage() unlocks the page upon completion. How do you deal with that?

writepage funcs are added for use by anonymous pages, and
flush_dirty_buffers and friends are changed to expect writepage to unlock
the page on completion.

Also, end_buffer_io_async is changed to only mark the page up to date
when all the buffers on it are up to date.  This is important when
blocksize is less than the page size, and block_prepare_write/
block_commit_write are used to dirty a buffer in the middle of a
non-up to date page.  If that dirty buffer is written with an
async end_io handler, we don't want the whole page set up to date
by the handler.

-chris







^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] using writepage to start io
  2001-07-28 16:01 Chris Mason
@ 2001-07-28 16:18 ` Alexander Viro
  2001-07-28 16:25   ` Chris Mason
  2001-07-31 19:07 ` Chris Mason
  1 sibling, 1 reply; 30+ messages in thread
From: Alexander Viro @ 2001-07-28 16:18 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-kernel, torvalds



On Sat, 28 Jul 2001, Chris Mason wrote:

> 
> Hello everyone,
> 
> This is an rfc only, as the code has only been _lightly_ tested.  I'll
> do more tests/benchmarks next week.
> 
> This patch changes fs/buffer.c to use writepage to start i/o,
> instead of writing buffers directly.  It also changes refile_buffer
> to mark the page dirty when marking buffers dirty.

->writepage() unlocks the page upon completion. How do you deal with that?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC] using writepage to start io
@ 2001-07-28 16:01 Chris Mason
  2001-07-28 16:18 ` Alexander Viro
  2001-07-31 19:07 ` Chris Mason
  0 siblings, 2 replies; 30+ messages in thread
From: Chris Mason @ 2001-07-28 16:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: torvalds


Hello everyone,

This is an rfc only, as the code has only been _lightly_ tested.  I'll
do more tests/benchmarks next week.

This patch changes fs/buffer.c to use writepage to start i/o,
instead of writing buffers directly.  It also changes refile_buffer
to mark the page dirty when marking buffers dirty.

The idea is that using flush_dirty_buffers to start i/o under memory
pressure is less than optimal.  flush_dirty_buffers knows the oldest
dirty buffer, but has no page aging info, so it might not flush a page
that we actually want to free.

Also, if fs/buffer.c uses writepage instead of writing the page directly,
the filesystems can do more advanced things on writeout, but can still
use the generic dirty lists.

Major changes:

writepage can now be called on pages that are not up to date, so the
writepage func needs to check the page up to date bit before writing
buffers that are not marked up to date.

block_write_full_page used to send the last page in the file through
block_commit_write to mark the buffers dirty.  That doesn't work here
since writepage is used to start the io.  Instead __block_write_full_page
was changed to skip io on buffers past eof.

The ext2 directory page cache code fills placeholders in the page past
the EOF.  But, the generic block_write_full_page zeros past eof, which
makes the ext2 code very unhappy.  This adds a block_write_dir_page
that doesn't zero past eof, and changes ext2 to use it for directories.

flush_dirty_buffers now calls into the filesystem, so it isn't safe to
call it in refill_freelist, or to trigger it in balance_dirty.

-chris

diff -Nru a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c	Sat Jul 28 11:33:18 2001
+++ b/fs/buffer.c	Sat Jul 28 11:33:18 2001
@@ -98,6 +98,21 @@
 
 static int grow_buffers(int size);
 static void __refile_buffer(struct buffer_head *);
+int block_write_anon_page(struct page *);
+static int dirty_list_writepage(struct page *page, struct buffer_head *bh) ;
+static void end_buffer_io_async(struct buffer_head * bh, int uptodate) ;
+
+static struct address_space_operations anon_space_ops = {
+	writepage: block_write_anon_page,
+	sync_page: block_sync_page,
+} ;
+static struct address_space anon_space_mapping = {
+	LIST_HEAD_INIT(anon_space_mapping.clean_pages),
+	LIST_HEAD_INIT(anon_space_mapping.dirty_pages),
+	LIST_HEAD_INIT(anon_space_mapping.locked_pages),
+	0, /* nr_pages */
+	&anon_space_ops,
+} ;
 
 /* This is used by some architectures to estimate available memory. */
 atomic_t buffermem_pages = ATOMIC_INIT(0);
@@ -181,66 +196,48 @@
 	put_bh(bh);
 }
 
-/*
- * The buffers have been marked clean and locked.  Just submit the dang
- * things.. 
- *
- * We'll wait for the first one of them - "sync" is not exactly
- * performance-critical, and this makes us not hog the IO subsystem
- * completely, while still allowing for a fair amount of concurrent IO. 
- */
-static void write_locked_buffers(struct buffer_head **array, unsigned int count)
-{
-	struct buffer_head *wait = *array;
-	get_bh(wait);
-	do {
-		struct buffer_head * bh = *array++;
-		bh->b_end_io = end_buffer_io_sync;
-		submit_bh(WRITE, bh);
-	} while (--count);
-	wait_on_buffer(wait);
-	put_bh(wait);
-}
-
-#define NRSYNC (32)
 static void write_unlocked_buffers(kdev_t dev)
 {
 	struct buffer_head *next;
-	struct buffer_head *array[NRSYNC];
-	unsigned int count;
+	struct page *page ;
 	int nr;
 
 repeat:
 	spin_lock(&lru_list_lock);
 	next = lru_list[BUF_DIRTY];
 	nr = nr_buffers_type[BUF_DIRTY] * 2;
-	count = 0;
 	while (next && --nr >= 0) {
 		struct buffer_head * bh = next;
 		next = bh->b_next_free;
 
 		if (dev && bh->b_dev != dev)
 			continue;
-		if (test_and_set_bit(BH_Lock, &bh->b_state))
-			continue;
-		get_bh(bh);
-		if (atomic_set_buffer_clean(bh)) {
-			__refile_buffer(bh);
-			array[count++] = bh;
-			if (count < NRSYNC)
-				continue;
+		page = bh->b_page ;
 
-			spin_unlock(&lru_list_lock);
-			write_locked_buffers(array, count);
-			goto repeat;
+		/* if we wrote the buffer the last time through, it
+		** might not have been refiled yet.  Without this check,
+		** we just end up working on the same buffer over and
+		** over again due to the goto after dirty_list_writepage
+		*/
+		if (buffer_locked(bh) || !buffer_dirty(bh)) {
+			__refile_buffer(bh) ;
+			continue ;
 		}
-		unlock_buffer(bh);
-		put_bh(bh);
+		if (TryLockPage(page)) {
+			continue ;
+		}
+		page_cache_get(page) ;
+		get_bh(bh) ;
+		spin_unlock(&lru_list_lock);
+		dirty_list_writepage(page, bh) ;
+
+		refile_buffer(bh) ;
+		put_bh(bh) ;
+		page_cache_release(page) ;
+		goto repeat;
+
 	}
 	spin_unlock(&lru_list_lock);
-
-	if (count)
-		write_locked_buffers(array, count);
 }
 
 static int wait_for_locked_buffers(kdev_t dev, int index, int refile)
@@ -274,6 +271,69 @@
 	return 0;
 }
 
+/* just for use with anon pages, or pages that don't provide their own
+** writepage func.  We just want to write bh, not the whole page, so we
+** queue that io here instead of calling writepage.
+*/
+static int __dirty_list_writepage(struct page *page, struct buffer_head *bh) {
+	int other_dirty = 0 ;
+	struct buffer_head *cur ;
+
+	/* check for other dirty buffers on this page.  If there are none,
+	** clear the page dirty bit
+	*/
+	cur = bh->b_this_page ;
+	while(cur != bh) {
+		other_dirty += buffer_dirty(cur) ;	
+		cur = cur->b_this_page ;
+	} 
+	if (other_dirty == 0) {
+		ClearPageDirty(page) ;
+	} 
+
+	/* we want the page available for locking again right away.  
+	** someone walking the dirty buffer list might find another
+	** buffer from this page, and we don't want them to skip it in
+	** favor of a younger buffer.
+	**
+	*/
+	get_bh(bh) ;
+	lock_buffer(bh) ;
+	clear_bit(BH_Dirty, &bh->b_state) ;
+	bh->b_end_io = end_buffer_io_sync ;
+	UnlockPage(page) ;
+	submit_bh(WRITE, bh) ;
+	return 0 ;
+}
+
+/*
+** util function for sync_buffers and flush_dirty_buffers
+** uses either the writepage func supplied in the page's mapping,
+** or the anon address space writepage
+*/
+static int dirty_list_writepage(struct page *page, struct buffer_head *bh) {
+	int (*writepage)(struct page *)  ;
+
+	/* someone wrote this page while we were locking before */
+	if (!PageDirty(page) && !buffer_dirty(bh)) {
+		UnlockPage(page) ;
+		return 0 ;
+	}
+	writepage = page->mapping->a_ops->writepage ;
+
+	/* For anon pages, and pages that don't have a writepage
+	** func, just write this one dirty buffer.  __dirty_list_writepage
+	** does a little more work to make sure the page dirty bit is cleared
+	** when we are the only dirty buffer on this page
+	*/
+	if (!writepage || page->mapping == &anon_space_mapping) {
+		return __dirty_list_writepage(page, bh) ;
+	}
+
+	ClearPageDirty(page) ;
+	return writepage(page) ;
+}
+
 /* Call sync_buffers with wait!=0 to ensure that the call does not
  * return until all buffer writes have completed.  Sync() may return
  * before the writes have finished; fsync() may not.
@@ -775,7 +835,7 @@
 	if (free_shortage())
 		page_launder(GFP_NOFS, 0);
 	if (!grow_buffers(size)) {
-		wakeup_bdflush(1);
+		wakeup_kswapd() ;
 		current->policy |= SCHED_YIELD;
 		__set_current_state(TASK_RUNNING);
 		schedule();
@@ -795,6 +855,7 @@
 	unsigned long flags;
 	struct buffer_head *tmp;
 	struct page *page;
+	int partial = 0 ;
 
 	mark_buffer_uptodate(bh, uptodate);
 
@@ -820,6 +881,8 @@
 	unlock_buffer(bh);
 	tmp = bh->b_this_page;
 	while (tmp != bh) {
+		if (!buffer_uptodate(tmp))
+			partial = 1 ;
 		if (tmp->b_end_io == end_buffer_io_async && buffer_locked(tmp))
 			goto still_busy;
 		tmp = tmp->b_this_page;
@@ -833,7 +896,7 @@
 	 * if none of the buffers had errors then we can set the
 	 * page uptodate:
 	 */
-	if (!PageError(page))
+	if (!PageError(page) && !partial)
 		SetPageUptodate(page);
 
 	/*
@@ -1086,7 +1149,7 @@
 
 	if (state < 0)
 		return;
-	wakeup_bdflush(state);
+	wakeup_bdflush(0) ;
 }
 
 static __inline__ void __mark_dirty(struct buffer_head *bh)
@@ -1120,8 +1183,10 @@
 	int dispose = BUF_CLEAN;
 	if (buffer_locked(bh))
 		dispose = BUF_LOCKED;
-	if (buffer_dirty(bh))
+	if (buffer_dirty(bh)) {
 		dispose = BUF_DIRTY;
+		SetPageDirty(bh->b_page) ;
+	}
 	if (buffer_protected(bh))
 		dispose = BUF_PROTECTED;
 	if (dispose != bh->b_list) {
@@ -1160,10 +1225,14 @@
  */
 void __bforget(struct buffer_head * buf)
 {
+	struct address_space *mapping ;
 	/* grab the lru lock here to block bdflush. */
 	spin_lock(&lru_list_lock);
 	write_lock(&hash_table_lock);
-	if (!atomic_dec_and_test(&buf->b_count) || buffer_locked(buf) || buffer_protected(buf))
+	mapping = buf->b_page->mapping ;
+	if (mapping != &anon_space_mapping || 
+	    !atomic_dec_and_test(&buf->b_count) || 
+	    buffer_locked(buf) || buffer_protected(buf))
 		goto in_use;
 	__hash_unlink(buf);
 	remove_inode_queue(buf);
@@ -1497,6 +1566,50 @@
  * "Dirty" is valid only with the last case (mapped+uptodate).
  */
 
+int block_write_anon_page(struct page *page) 
+{
+	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	int i, nr = 0 ;
+	int partial = 0 ;
+	int ret = 0 ;
+
+	if (!PageLocked(page))
+		BUG();
+
+	if (!page->buffers)
+		BUG() ;
+
+	head = page->buffers;
+	bh = head;
+
+	/* Stage 1: find the dirty buffers, lock them for submit_bh */
+	do {
+		if (buffer_dirty(bh)) {
+			get_bh(bh) ;
+			lock_buffer(bh) ;
+			set_bit(BH_Uptodate, &bh->b_state) ;
+			bh->b_end_io = end_buffer_io_async;
+			clear_bit(BH_Dirty, &bh->b_state) ;
+			arr[nr++] = bh ;
+		} else {
+			partial = 1 ;
+		}
+		bh = bh->b_this_page;
+	} while (bh != head);
+
+	/* Stage 2: submit the IO */
+	for (i = 0 ; i < nr ; i++) {
+		submit_bh(WRITE, arr[i]) ;
+	}
+	/* Done - end_buffer_io_async will unlock */
+	if (!partial)
+		SetPageUptodate(page);
+	if (nr == 0) {
+		UnlockPage(page) ;
+	}
+	return ret ;
+}
+
 /*
  * block_write_full_page() is SMP-safe - currently it's still
  * being called with the kernel lock held, but the code is ready.
@@ -1505,16 +1618,23 @@
 {
 	int err, i;
 	unsigned long block;
+	unsigned long lblock ;
+	unsigned long blocksize = inode->i_sb->s_blocksize ;
 	struct buffer_head *bh, *head;
+	int page_ok = Page_Uptodate(page) ;
+	struct buffer_head *arr[MAX_BUF_PER_PAGE] ;
+	int nr = 0 ;
+	int partial = 0 ;
 
 	if (!PageLocked(page))
 		BUG();
 
 	if (!page->buffers)
-		create_empty_buffers(page, inode->i_dev, inode->i_sb->s_blocksize);
+		create_empty_buffers(page, inode->i_dev, blocksize);
 	head = page->buffers;
 
 	block = page->index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+	lblock = (inode->i_size+blocksize-1) >> inode->i_sb->s_blocksize_bits;
 
 	bh = head;
 	i = 0;
@@ -1529,36 +1649,42 @@
 		 * Leave it to the low-level FS to make all those
 		 * decisions (block #0 may actually be a valid block)
 		 */
-		if (!buffer_mapped(bh)) {
-			err = get_block(inode, block, bh, 1);
-			if (err)
-				goto out;
-			if (buffer_new(bh))
-				unmap_underlying_metadata(bh);
+		if ((buffer_uptodate(bh) || page_ok) && block < lblock) {
+			if (!buffer_mapped(bh)) {
+				err = get_block(inode, block, bh, 1);
+				if (err)
+					goto out;
+				if (buffer_new(bh))
+					unmap_underlying_metadata(bh);
+			}
+			arr[nr++] = bh ;
+		} else {
+			partial = 1 ;
 		}
 		bh = bh->b_this_page;
 		block++;
 	} while (bh != head);
 
 	/* Stage 2: lock the buffers, mark them clean */
-	do {
+	for(i = 0 ; i < nr ; i++) {
+		bh = arr[i] ;
 		lock_buffer(bh);
 		bh->b_end_io = end_buffer_io_async;
 		get_bh(bh);
 		set_bit(BH_Uptodate, &bh->b_state);
 		clear_bit(BH_Dirty, &bh->b_state);
-		bh = bh->b_this_page;
-	} while (bh != head);
+	} 
 
 	/* Stage 3: submit the IO */
-	do {
-		struct buffer_head *next = bh->b_this_page;
-		submit_bh(WRITE, bh);
-		bh = next;
-	} while (bh != head);
+	for(i = 0 ; i < nr ; i++) {
+		submit_bh(WRITE, arr[i]);
+	}
+	if (nr == 0)
+		UnlockPage(page) ;
 
 	/* Done - end_buffer_io_async will unlock */
-	SetPageUptodate(page);
+	if (!partial)
+		SetPageUptodate(page);
 	return 0;
 
 out:
@@ -1948,7 +2074,6 @@
 	struct inode *inode = page->mapping->host;
 	unsigned long end_index = inode->i_size >> PAGE_CACHE_SHIFT;
 	unsigned offset;
-	int err;
 
 	/* easy case */
 	if (page->index < end_index)
@@ -1963,18 +2088,35 @@
 	}
 
 	/* Sigh... will have to work, then... */
-	err = __block_prepare_write(inode, page, 0, offset, get_block);
-	if (!err) {
-		memset(page_address(page) + offset, 0, PAGE_CACHE_SIZE - offset);
-		flush_dcache_page(page);
-		__block_commit_write(inode,page,0,offset);
-done:
-		kunmap(page);
+	memset(kmap(page) + offset, 0, PAGE_CACHE_SIZE - offset);
+	flush_dcache_page(page);
+	kunmap(page);
+	return __block_write_full_page(inode, page, get_block) ;
+}
+
+int block_write_dir_page(struct page *page, get_block_t *get_block)
+{
+	struct inode *inode = page->mapping->host;
+	unsigned long end_index = inode->i_size >> PAGE_CACHE_SHIFT;
+	unsigned offset;
+
+	/* easy case */
+	if (page->index < end_index)
+		return __block_write_full_page(inode, page, get_block);
+
+	/* things got complicated... */
+	offset = inode->i_size & (PAGE_CACHE_SIZE-1);
+	/* OK, are we completely out? */
+	if (page->index >= end_index+1 || !offset) {
 		UnlockPage(page);
-		return err;
+		return -EIO;
 	}
-	ClearPageUptodate(page);
-	goto done;
+
+	/* directories in the page cache don't need to have the buffers
+	** past the eof zerod out.  In fact, ext2 sets up fake entries
+	** past the eof in their algs.
+	*/
+	return __block_write_full_page(inode, page, get_block) ;
 }
 
 int generic_block_bmap(struct address_space *mapping, long block, get_block_t *get_block)
@@ -2250,7 +2392,7 @@
 	struct buffer_head *bh, *tmp;
 	struct buffer_head * insert_point;
 	int isize;
-
+	unsigned long index ;
 	if ((size & 511) || (size > PAGE_SIZE)) {
 		printk(KERN_ERR "VFS: grow_buffers: size = %d\n",size);
 		return 0;
@@ -2266,6 +2408,16 @@
 
 	isize = BUFSIZE_INDEX(size);
 
+	/* don't put this buffer head on the free list until the
+	** page is setup.  Is there a better index to use?  Would 0
+	** be good enough?
+	*/
+	page->flags &= ~(1 << PG_referenced);
+	index = atomic_read(&buffermem_pages) ;
+	atomic_inc(&buffermem_pages);
+	add_to_page_cache_locked(page, &anon_space_mapping, index) ;
+	page->buffers = bh;
+
 	spin_lock(&free_list[isize].lock);
 	insert_point = free_list[isize].list;
 	tmp = bh;
@@ -2289,11 +2441,7 @@
 	free_list[isize].list = bh;
 	spin_unlock(&free_list[isize].lock);
 
-	page->buffers = bh;
-	page->flags &= ~(1 << PG_referenced);
-	lru_cache_add(page);
 	UnlockPage(page);
-	atomic_inc(&buffermem_pages);
 	return 1;
 
 no_buffer_head:
@@ -2315,10 +2463,9 @@
  *
  * Wait:
  *	0 - no wait (this does not get called - see try_to_free_buffers below)
- *	1 - start IO for dirty buffers
  *	2 - wait for completion of locked buffers
  */
-static void sync_page_buffers(struct buffer_head *bh, unsigned int gfp_mask)
+static void wait_page_buffers(struct buffer_head *bh, unsigned int gfp_mask)
 {
 	struct buffer_head * tmp = bh;
 
@@ -2326,10 +2473,8 @@
 		struct buffer_head *p = tmp;
 		tmp = tmp->b_this_page;
 		if (buffer_locked(p)) {
-			if (gfp_mask & __GFP_WAIT)
-				__wait_on_buffer(p);
-		} else if (buffer_dirty(p))
-			ll_rw_block(WRITE, 1, &p);
+			__wait_on_buffer(p);
+		} 
 	} while (tmp != bh);
 }
 
@@ -2390,6 +2535,9 @@
 
 	/* And free the page */
 	page->buffers = NULL;
+	if (page->mapping == &anon_space_mapping) {
+		atomic_dec(&buffermem_pages) ;
+	}
 	page_cache_release(page);
 	spin_unlock(&free_list[index].lock);
 	write_unlock(&hash_table_lock);
@@ -2401,14 +2549,13 @@
 	spin_unlock(&free_list[index].lock);
 	write_unlock(&hash_table_lock);
 	spin_unlock(&lru_list_lock);
-	if (gfp_mask & __GFP_IO) {
-		sync_page_buffers(bh, gfp_mask);
+	if (gfp_mask & __GFP_WAIT) {
+		wait_page_buffers(bh, gfp_mask);
 		/* We waited synchronously, so we can free the buffers. */
-		if ((gfp_mask & __GFP_WAIT) && !loop) {
+		if (!loop) {
 			loop = 1;
 			goto cleaned_buffers_try_again;
 		}
-		wakeup_bdflush(0);
 	}
 	return 0;
 }
@@ -2541,6 +2688,7 @@
 static int flush_dirty_buffers(int check_flushtime)
 {
 	struct buffer_head * bh, *next;
+	struct page *page ;
 	int flushed = 0, i;
 
  restart:
@@ -2557,6 +2705,8 @@
 		}
 		if (buffer_locked(bh))
 			continue;
+		if (!buffer_uptodate(bh))
+			continue ;
 
 		if (check_flushtime) {
 			/* The dirty lru list is chronologically ordered so
@@ -2570,11 +2720,19 @@
 				goto out_unlock;
 		}
 
+		page = bh->b_page ;
+		if (TryLockPage(page)) {
+		    flushed-- ;
+		    continue ;
+		}
+
 		/* OK, now we are committed to write it out. */
+		page_cache_get(page) ;
 		get_bh(bh);
 		spin_unlock(&lru_list_lock);
-		ll_rw_block(WRITE, 1, &bh);
+		dirty_list_writepage(page, bh) ;
 		put_bh(bh);
+		page_cache_release(page) ;
 
 		if (current->need_resched)
 			schedule();
diff -Nru a/fs/ext2/inode.c b/fs/ext2/inode.c
--- a/fs/ext2/inode.c	Sat Jul 28 11:33:18 2001
+++ b/fs/ext2/inode.c	Sat Jul 28 11:33:18 2001
@@ -574,6 +574,10 @@
 {
 	return block_write_full_page(page,ext2_get_block);
 }
+static int ext2_dir_writepage(struct page *page)
+{
+	return block_write_dir_page(page,ext2_get_block);
+}
 static int ext2_readpage(struct file *file, struct page *page)
 {
 	return block_read_full_page(page,ext2_get_block);
@@ -595,6 +599,15 @@
 	bmap: ext2_bmap
 };
 
+struct address_space_operations ext2_dir_aops = {
+	readpage: ext2_readpage,
+	writepage: ext2_dir_writepage,
+	sync_page: block_sync_page,
+	prepare_write: ext2_prepare_write,
+	commit_write: generic_commit_write,
+	bmap: ext2_bmap
+};
+
 /*
  * Probably it should be a library function... search for first non-zero word
  * or memcmp with zero_page, whatever is better for particular architecture.
@@ -985,7 +998,7 @@
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext2_dir_inode_operations;
 		inode->i_fop = &ext2_dir_operations;
-		inode->i_mapping->a_ops = &ext2_aops;
+		inode->i_mapping->a_ops = &ext2_dir_aops;
 	} else if (S_ISLNK(inode->i_mode)) {
 		if (!inode->i_blocks)
 			inode->i_op = &ext2_fast_symlink_inode_operations;
diff -Nru a/fs/ext2/namei.c b/fs/ext2/namei.c
--- a/fs/ext2/namei.c	Sat Jul 28 11:33:18 2001
+++ b/fs/ext2/namei.c	Sat Jul 28 11:33:18 2001
@@ -194,7 +194,7 @@
 
 	inode->i_op = &ext2_dir_inode_operations;
 	inode->i_fop = &ext2_dir_operations;
-	inode->i_mapping->a_ops = &ext2_aops;
+	inode->i_mapping->a_ops = &ext2_dir_aops;
 
 	ext2_inc_count(inode);
 
diff -Nru a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c	Sat Jul 28 11:33:18 2001
+++ b/fs/reiserfs/inode.c	Sat Jul 28 11:33:18 2001
@@ -1813,6 +1813,7 @@
     int partial = 0 ;
     struct buffer_head *arr[PAGE_CACHE_SIZE/512] ;
     int nr = 0 ;
+    int page_ok = Page_Uptodate(page) ;
 
     if (!page->buffers) {
         block_prepare_write(page, 0, 0, NULL) ;
@@ -1837,7 +1838,7 @@
     block = page->index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits) ;
     do {
 	/* if this offset in the page is outside the file */
-	if (cur_offset >= last_offset) {
+	if (!(buffer_dirty(bh) || page_ok) || cur_offset >= last_offset) {
 	    if (!buffer_uptodate(bh))
 	        partial = 1 ;
 	} else {
diff -Nru a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
--- a/include/linux/ext2_fs.h	Sat Jul 28 11:33:18 2001
+++ b/include/linux/ext2_fs.h	Sat Jul 28 11:33:18 2001
@@ -640,6 +640,7 @@
 extern struct inode_operations ext2_fast_symlink_inode_operations;
 
 extern struct address_space_operations ext2_aops;
+extern struct address_space_operations ext2_dir_aops;
 
 #endif	/* __KERNEL__ */
 
diff -Nru a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h	Sat Jul 28 11:33:18 2001
+++ b/include/linux/fs.h	Sat Jul 28 11:33:18 2001
@@ -1326,6 +1326,8 @@
 extern int block_flushpage(struct page *, unsigned long);
 extern int block_symlink(struct inode *, const char *, int);
 extern int block_write_full_page(struct page*, get_block_t*);
+extern int block_write_dir_page(struct page*, get_block_t*);
+extern int block_write_anon_page(struct page*) ;
 extern int block_read_full_page(struct page*, get_block_t*);
 extern int block_prepare_write(struct page*, unsigned, unsigned, get_block_t*);
 extern int cont_prepare_write(struct page*, unsigned, unsigned, get_block_t*,
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.c	Sat Jul 28 11:33:17 2001
+++ b/kernel/ksyms.c	Sat Jul 28 11:33:18 2001
@@ -197,6 +197,7 @@
 EXPORT_SYMBOL(__wait_on_buffer);
 EXPORT_SYMBOL(___wait_on_page);
 EXPORT_SYMBOL(block_write_full_page);
+EXPORT_SYMBOL(block_write_dir_page);
 EXPORT_SYMBOL(block_read_full_page);
 EXPORT_SYMBOL(block_prepare_write);
 EXPORT_SYMBOL(block_sync_page);
diff -Nru a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c	Sat Jul 28 11:33:18 2001
+++ b/mm/vmscan.c	Sat Jul 28 11:33:18 2001
@@ -548,7 +548,6 @@
 
 			/* The page was only in the buffer cache. */
 			} else if (!page->mapping) {
-				atomic_dec(&buffermem_pages);
 				freed_page = 1;
 				cleaned_pages++;
 
@@ -571,10 +570,9 @@
 			page_cache_release(page);
 
 			/* 
-			 * If we're freeing buffer cache pages, stop when
-			 * we've got enough free memory.
+			 * stop when we've got enough free memory.
 			 */
-			if (freed_page && !free_shortage())
+			if (!free_shortage())
 				break;
 			continue;
 		} else if (page->mapping && !PageDirty(page)) {
@@ -609,9 +607,6 @@
 	 * free anything yet, we wait synchronously on the writeout of
 	 * MAX_SYNC_LAUNDER pages.
 	 *
-	 * We also wake up bdflush, since bdflush should, under most
-	 * loads, flush out the dirty pages before we have to wait on
-	 * IO.
 	 */
 	if (CAN_DO_IO && !launder_loop && free_shortage()) {
 		launder_loop = 1;
@@ -620,8 +615,6 @@
 			sync = 0;
 		/* We only do a few "out of order" flushes. */
 		maxlaunder = MAX_LAUNDER;
-		/* Kflushd takes care of the rest. */
-		wakeup_bdflush(0);
 		goto dirty_page_rescan;
 	}
 


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2001-08-08 14:49 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-05 18:34 [RFC] using writepage to start io Chris Mason
2001-08-05 22:38 ` Daniel Phillips
2001-08-05 23:32   ` Chris Mason
2001-08-06  5:39     ` Daniel Phillips
2001-08-06 13:24       ` Chris Mason
2001-08-06 16:13         ` Daniel Phillips
2001-08-06 16:51           ` Chris Mason
2001-08-06 19:45             ` Daniel Phillips
2001-08-06 20:12               ` Chris Mason
2001-08-06 21:18                 ` Daniel Phillips
2001-08-07 11:02                   ` Stephen C. Tweedie
2001-08-07 11:39                     ` Ed Tomlinson
2001-08-07 18:36                       ` Daniel Phillips
2001-08-07 12:07                   ` Anton Altaparmakov
2001-08-07 12:02                 ` Anton Altaparmakov
2001-08-07 13:29                   ` Daniel Phillips
2001-08-07 13:31                     ` Alexander Viro
2001-08-07 15:52                       ` Daniel Phillips
2001-08-07 14:23                     ` Stephen C. Tweedie
2001-08-07 15:51                       ` Daniel Phillips
2001-08-08 14:49                         ` Stephen C. Tweedie
2001-08-06 15:13 ` Eric W. Biederman
  -- strict thread matches above, loose matches on Subject: below --
2001-08-07 15:19 Chris Mason
2001-07-28 16:01 Chris Mason
2001-07-28 16:18 ` Alexander Viro
2001-07-28 16:25   ` Chris Mason
2001-07-31 19:07 ` Chris Mason
2001-08-01  1:01   ` Daniel Phillips
2001-08-01  2:05     ` Chris Mason
2001-08-01 14:57   ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).