From: Daniel Phillips <phillips@bonn-fries.net>
To: Mike Galbraith <mikeg@wen-online.de>
Cc: Rik van Riel <riel@conectiva.com.br>,
Pavel Machek <pavel@suse.cz>, John Stoffel <stoffel@casc.com>,
Roger Larsson <roger.larsson@norran.net>, <thunder7@xs4all.nl>,
Linux-Kernel <linux-kernel@vger.kernel.org>
Subject: [RFC] Early flush (was: spindown)
Date: Wed, 20 Jun 2001 03:50:33 +0200 [thread overview]
Message-ID: <01062003503300.00439@starship> (raw)
In-Reply-To: <Pine.LNX.4.33.0106171156410.318-100000@mikeg.weiden.de> <01061816220503.11745@starship>
In-Reply-To: <01061816220503.11745@starship>
I never realized how much I didn't like the good old 5 second delay between
saving an edit and actually getting it written to disk until it went away.
Now the question is, did I lose any performance in doing that. What I wrote
in the previous email turned out to be pretty accurate, so I'll just quote it
to keep it together with the patch:
> I'm now in the midst of hatching a patch. [1] The first thing I had to do
> is go explore the block driver code, yum yum. I found that it already
> computes the statistic I'm interested in, namely queued_sectors, which is
> used to pace the IO on block devices. It's a little crude - we really want
> this to be per-queue and have one queue per "spindle" - but even in its
> current form it's workable.
>
> The idea is that when queued_sectors drops below some threshold we have
> 'unused disk bandwidth' so it would be nice to do something useful with it:
>
> 1) Do an early 'sync_old_buffers'
> 2) Do some preemptive pageout
>
> The benefit of (1) is that it lets disks go idle a few seconds earlier, and
> (2) should improve the system's latency in response to load surges. There
> are drawbacks too, which have been pointed out to me privately, but they
> tend to be pretty minor, for example: on a flash disk you'd do a few extra
> writes and wear it out ever-so-slightly sooner. All the same, such special
> devices can be dealt easily once we progress a little further in improving
> the kernel's 'per spindle' intelligence.
>
> Now how to implement this. I considered putting a (newly minted)
> wakeup_kflush in blk_finished_io, conditional on a loaded-to-unloaded
> transition, and that's fine except it doesn't do the whole job: we also
> need to have the early flush for any write to a disk file while the disks
> are lightly loaded, i.e., there is no convenient loaded-to-unloaded
> transition to trigger it. The missing trigger could be inserted into
> __mark_dirty, but that would penalize the loaded state (a little, but
> that's still too much). Furthermore, it's probably desirable to maintain a
> small delay between the dirty and the flush. So what I'll try first is
> just running kflush's timer faster, and make its reschedule period vary
> with disk load, i.e., when there are fewer queued_sectors, kflush looks at
> the dirty buffer list more often.
>
> The rest of what has to happen in kflush is pretty straightforward. It
> just uses queued_sectors to determine how far to walk the dirty buffer
> list, which is maintained in time-since-dirtied order. If queued_sectors
> is below some threshold the entire list is flushed. Note that we want to
> change the sense of b_flushtime to b_timedirtied. It's more efficient to
> do it this way anyway.
>
> I haven't done anything about preemptive pageout yet, but similar ideas
> apply.
>
> [1] This is an experiment, do not worry, it will not show up in your tree
> any time soon. IOW, constructive criticism appreciated, flames copied to
> /dev/null.
I originally intended to implement a sliding flush delay based on disk load.
This turned out to be a lot of work for a hard-to-discern benefit. So the
current approach has just two delays: .1 second and whatever the bdflush
delay is set to. If there is any non-flush disk traffic the longer delay is
used. This is crude but effective... I think. I hope that somebody will run
this through some benchmarks to see if I lost any performance. According to
my calculations, I did not. I tested this mainly in UML, and also ran it
briefly on my laptop. The interactive feel of the change is immediately
obvious, and for me at least, a big improvement.
The patch is against 2.4.5. To apply:
cd /your/source/tree
patch <this/patch -p0
--- ../uml.2.4.5.clean/fs/buffer.c Sat May 26 02:57:46 2001
+++ ./fs/buffer.c Wed Jun 20 01:55:21 2001
@@ -1076,7 +1076,7 @@
static __inline__ void __mark_dirty(struct buffer_head *bh)
{
- bh->b_flushtime = jiffies + bdf_prm.b_un.age_buffer;
+ bh->b_dirtytime = jiffies;
refile_buffer(bh);
}
@@ -2524,12 +2524,20 @@
as all dirty buffers lives _only_ in the DIRTY lru list.
As we never browse the LOCKED and CLEAN lru lists they are infact
completly useless. */
-static int flush_dirty_buffers(int check_flushtime)
+static int flush_dirty_buffers (int update)
{
struct buffer_head * bh, *next;
int flushed = 0, i;
+ unsigned queued = atomic_read (&queued_sectors);
+ unsigned long youngest_to_update;
- restart:
+#ifdef DEBUG
+ if (update)
+ printk("kupdate %lu %i\n", jiffies, queued);
+#endif
+
+restart:
+ youngest_to_update = jiffies - (queued? bdf_prm.b_un.age_buffer: 0);
spin_lock(&lru_list_lock);
bh = lru_list[BUF_DIRTY];
if (!bh)
@@ -2544,19 +2552,14 @@
if (buffer_locked(bh))
continue;
- if (check_flushtime) {
- /* The dirty lru list is chronologically ordered so
- if the current bh is not yet timed out,
- then also all the following bhs
- will be too young. */
- if (time_before(jiffies, bh->b_flushtime))
+ if (update) {
+ if (time_before (youngest_to_update, bh->b_dirtytime))
goto out_unlock;
} else {
if (++flushed > bdf_prm.b_un.ndirty)
goto out_unlock;
}
- /* OK, now we are committed to write it out. */
atomic_inc(&bh->b_count);
spin_unlock(&lru_list_lock);
ll_rw_block(WRITE, 1, &bh);
@@ -2717,7 +2720,7 @@
int kupdate(void *sem)
{
struct task_struct * tsk = current;
- int interval;
+ int update_when = 0;
tsk->session = 1;
tsk->pgrp = 1;
@@ -2733,11 +2736,11 @@
up((struct semaphore *)sem);
for (;;) {
- /* update interval */
- interval = bdf_prm.b_un.interval;
- if (interval) {
+ unsigned check_interval = HZ/10, update_interval = bdf_prm.b_un.interval;
+
+ if (update_interval) {
tsk->state = TASK_INTERRUPTIBLE;
- schedule_timeout(interval);
+ schedule_timeout(check_interval);
} else {
stop_kupdate:
tsk->state = TASK_STOPPED;
@@ -2756,10 +2759,15 @@
if (stopped)
goto stop_kupdate;
}
+ update_when -= check_interval;
+ if (update_when > 0 && atomic_read(&queued_sectors))
+ continue;
+
#ifdef DEBUG
printk("kupdate() activated...\n");
#endif
sync_old_buffers();
+ update_when = update_interval;
}
}
--- ../uml.2.4.5.clean/include/linux/fs.h Sat May 26 03:01:28 2001
+++ ./include/linux/fs.h Tue Jun 19 15:12:18 2001
@@ -236,7 +236,7 @@
atomic_t b_count; /* users using this block */
kdev_t b_rdev; /* Real device */
unsigned long b_state; /* buffer state bitmap (see above) */
- unsigned long b_flushtime; /* Time when (dirty) buffer should be written */
+ unsigned long b_dirtytime; /* Time buffer became dirty */
struct buffer_head *b_next_free;/* lru/free list linkage */
struct buffer_head *b_prev_free;/* doubly linked list of buffers */
--- ../uml.2.4.5.clean/mm/filemap.c Thu May 31 15:29:06 2001
+++ ./mm/filemap.c Tue Jun 19 15:32:47 2001
@@ -349,7 +349,7 @@
if (buffer_locked(bh) || !buffer_dirty(bh) || !buffer_uptodate(bh))
continue;
- bh->b_flushtime = jiffies;
+ bh->b_dirtytime = jiffies /*- bdf_prm.b_un.age_buffer*/; // needed??
ll_rw_block(WRITE, 1, &bh);
} while ((bh = bh->b_this_page) != head);
return 0;
--- ../uml.2.4.5.clean/mm/highmem.c Sat May 26 02:57:46 2001
+++ ./mm/highmem.c Tue Jun 19 15:33:22 2001
@@ -400,7 +400,7 @@
bh->b_rdev = bh_orig->b_rdev;
bh->b_state = bh_orig->b_state;
#ifdef HIGHMEM_DEBUG
- bh->b_flushtime = jiffies;
+ bh->b_dirtytime = jiffies /*- bdf_prm.b_un.age_buffer*/; // needed??
bh->b_next_free = NULL;
bh->b_prev_free = NULL;
/* bh->b_this_page */
next prev parent reply other threads:[~2001-06-20 1:48 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2001-06-13 19:31 2.4.6-pre2, pre3 VM Behavior Tom Sightler
2001-06-13 20:21 ` Rik van Riel
2001-06-14 1:49 ` Tom Sightler
2001-06-14 3:16 ` Rik van Riel
2001-06-14 7:59 ` Laramie Leavitt
2001-06-14 9:24 ` Helge Hafting
2001-06-14 17:38 ` Mark Hahn
2001-06-15 8:27 ` Helge Hafting
2001-06-14 8:47 ` Daniel Phillips
2001-06-14 20:23 ` Roger Larsson
2001-06-15 6:04 ` Mike Galbraith
2001-06-14 20:39 ` John Stoffel
2001-06-14 20:51 ` Rik van Riel
2001-06-14 21:33 ` John Stoffel
2001-06-14 22:23 ` Rik van Riel
2001-06-15 15:23 ` spindown [was Re: 2.4.6-pre2, pre3 VM Behavior] Pavel Machek
2001-06-16 20:50 ` Daniel Phillips
2001-06-16 21:06 ` Rik van Riel
2001-06-16 21:25 ` Rik van Riel
2001-06-16 21:44 ` Daniel Phillips
2001-06-16 21:54 ` Rik van Riel
2001-06-17 10:28 ` Daniel Phillips
2001-06-17 10:05 ` Mike Galbraith
2001-06-17 12:49 ` (lkml)Re: " thunder7
2001-06-17 16:40 ` Mike Galbraith
2001-06-18 14:22 ` Daniel Phillips
2001-06-19 4:35 ` Mike Galbraith
2001-06-20 1:50 ` Daniel Phillips [this message]
2001-06-20 20:58 ` [RFC] Early flush (was: spindown) Tom Sightler
2001-06-20 22:09 ` Daniel Phillips
2001-06-24 3:20 ` Anuradha Ratnaweera
2001-06-24 11:14 ` Daniel Phillips
2001-06-24 15:06 ` Rik van Riel
2001-06-24 16:21 ` Daniel Phillips
2001-06-20 4:39 ` Richard Gooch
2001-06-20 14:29 ` Daniel Phillips
2001-06-20 16:12 ` Richard Gooch
2001-06-22 23:25 ` Daniel Kobras
2001-06-23 5:10 ` Daniel Phillips
2001-06-25 11:33 ` Pavel Machek
2001-06-25 11:31 ` Pavel Machek
2001-06-18 20:21 ` spindown Simon Huggins
2001-06-19 10:46 ` spindown Pavel Machek
2001-06-20 16:52 ` spindown Daniel Phillips
2001-06-20 17:32 ` spindown Rik van Riel
2001-06-20 18:00 ` spindown Daniel Phillips
2001-06-21 16:07 ` spindown Jamie Lokier
2001-06-22 22:09 ` spindown Daniel Kobras
2001-06-28 0:27 ` spindown Troy Benjegerdes
2001-06-14 15:10 ` 2.4.6-pre2, pre3 VM Behavior John Stoffel
2001-06-14 18:25 ` Daniel Phillips
2001-06-14 8:30 ` Mike Galbraith
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=01062003503300.00439@starship \
--to=phillips@bonn-fries.net \
--cc=linux-kernel@vger.kernel.org \
--cc=mikeg@wen-online.de \
--cc=pavel@suse.cz \
--cc=riel@conectiva.com.br \
--cc=roger.larsson@norran.net \
--cc=stoffel@casc.com \
--cc=thunder7@xs4all.nl \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).