All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] kernel facilities for cache prefetching
@ 2006-05-02  7:50 ` Wu Fengguang
  2006-05-02 12:46   ` Diego Calleja
                     ` (3 more replies)
  0 siblings, 4 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-02  7:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty

Pre-caching reloaded ;)

I have been exploring cache prefetching on desktop systems for quite
some time, and recently found ways that have _proven_ effects.

The basic ideas are described in the following google soc proposal,
with the proof-of-concept patch to support I/O priority attached.

I'd like to know whether the two proposed kernel components would be
acceptable for the mainline kernel, and any recommends to improve them.

Thanks,
Wu

------------------------------------------------------------------------
SOC PROPOSAL

	Rapid linux desktop startup through pre-caching


MOTIVATION

	KDE, Gnome, OpenOffice, and Firefox all take too long to start up.
	Boot time pre-caching seems to be the single most straightforward and
	effective way to improve it and make linux desktop experience more
	comfortable. It is a great pleasure for me to take up the work.


QUALIFICATION

	I have been working on linux kernel readahead for over a year, and
	developed an adaptive readahead patch(http://lwn.net/Articles/155510/)
	to bring a bunch of new capabilities to linux.  During the time I
	gained knowledge about the VM/IO subsystems of linux kernel, and get
	acquainted with the slow startup problem, the existing solutions, why
	they do no work as one would expect and how to get the job better done.


PREVIOUS WORKS

	There has been some similar efforts, i.e.
		- Linux: Boot Time Speedups Through Precaching
		  http://kerneltrap.org/node/2157
		- Andrew Morton's kernel module solution
		  http://www.zip.com.au/~akpm/linux/fboot.tar.gz
		- preload - adaptive readahead daemon
		  http://sourceforge.net/projects/preload
	The formers are mainly kernel staffs, while the latter is a pure
	user-land solution. But none seems to do the trick for system startup
	time.  Andrew saw ~10% speedup, while preload actually saw slow down.

	IMHO, Andrew's kernel module to dump the contents of pagecache and to
	restore it at boot time is one step towards the final solution.
	However, it takes one more step to actually win the time: to optimize
	the I/O on restoring time.


I/O ANALYSE

	How come the prefetcher gains little or even negative time?

	After some huntings in the source tree and some experiments,
	I came to the conclusions that

	1. the prefetcher has to readahead one file _after_ another, thus loses
	   the opportunity to reorder the blocks and reduce seeking time.
	   == reasoning ==
	   It tends to be blocked on calling open(), where the kernel has to do
	   some dir/inode/bmap lookups and do tiny I/Os _synchronously_ and with
	   _locks_ held.

	2. the readahead I/O stands in the way of normal I/O, thus the prefetcher
	   blocks normal apps and loses the opportunity to overlap CPU/IO time.
	   == reasoning ==
	   Support of I/O priority is still incomplete for linux. Of the three
	   I/O elevators, anticipatory & deadline simply overlooks priority;
	   CFQ is built for fair I/O priorities, though is still not enough for
	   the case of readahead.  Imagine that the prefetcher first issues an
	   I/O request for page A with low priority, then comes the real app
	   that needs page A, and simply waits on the page to be available,
	   which will take rather long time because the elevator still thinks
	   the page as a low priority one.

	So we get the amazing fact that prefetching actually slows things down!


SCHEME/GOAL

	1) kernel facility to provide necessary I/O priority support
		- add basic I/O priority support to AS/deadline elevators:
		  never have readahead I/O stand in the way of normal I/O
		- enable AS/CFQ/deadline to react on I/O priority changes:
		  reschedule a previous readahead I/O that is now actually
		  demanded by a legacy program

	2) kernel module to query the file cache
		- on loading: create /proc/filecache
		- setting up: echo "param value" > /proc/filecache
		- info query: cat /proc/filecache
		- sample sessions:

		# modprobe filecache
		$ cat /proc/filecache
		# file ALL
		# mask 0
		#
		# ino	size	cached	status	refcnt	name
		0	2929846	3181	D	1	/dev/hda1
		81455	9	9	_	1	/sbin/init
		......
	
		$ echo "file /dev/hda1" > /proc/filecache
		$ cat /proc/filecache
		# file /dev/hda1
		# mask 0
		#
		# idx	len
		0	24
		48	2
		53	5
		......

	3) user-land tools to dump the current content of file cache,
	   and to restore it on boot time
		- store as plain text files to be script/user friendly
		- be able to run on the very beginning of boot process
		- be able to trim down the cache records (for small systems)
		- optional poor man's defrag ;)
		- record and replay I/O for any task by taking two cache
		  snapshots and do a set-substract

	A proof of concept implementation has been developed and ran on my
	desktop. According to the experimental results, the expected effect
	of the final work would be:
		- the power-on to login time remains roughly the same
		- most gui files are ready on login time
		- login to usable desktop time comes close to a warm startup

BIO

	I am currently pursuing PhD. degree in University of Science and
	Technology of China. I enjoy computing and GNU/Linux.

	- programming since 1993
	- using linux since 1998
	- playing PXE since 2000
	- developing OSS since 2004
	- developing adaptive readahead for linux since 2005

------------------------------------------------------------------------
 block/deadline-iosched.c |   35 +++++++++++++++++++++++++++++++++++
 block/ll_rw_blk.c        |    8 ++++++++
 fs/buffer.c              |    5 +++--
 include/linux/elevator.h |    2 ++
 4 files changed, 48 insertions(+), 2 deletions(-)

--- linux.orig/block/deadline-iosched.c
+++ linux/block/deadline-iosched.c
@@ -310,6 +310,40 @@ deadline_add_request(struct request_queu
 }
 
 /*
+ * kick a page for io
+ */
+static int
+deadline_kick_page(struct request_queue *q, struct page *page)
+{
+	struct deadline_data *dd = q->elevator->elevator_data;
+	struct deadline_rq *drq;
+	struct request *rq;
+	struct list_head *pos;
+	struct bio_vec *bvec;
+	struct bio *bio;
+	int i;
+
+	list_for_each(pos, &dd->fifo_list[READ]) {
+		drq = list_entry_fifo(pos);
+		rq = drq->request;
+		rq_for_each_bio(bio, rq) {
+			bio_for_each_segment(bvec, bio, i) {
+				if (page == bvec->bv_page)
+					goto found;
+			}
+		}
+	}
+
+	return -1;
+
+found:
+	rq->ioprio = 1;
+	list_del(&drq->fifo);
+	deadline_add_drq_fifo(dd, rq);
+	return 0;
+}
+
+/*
  * remove rq from rbtree, fifo, and hash
  */
 static void deadline_remove_request(request_queue_t *q, struct request *rq)
@@ -784,6 +818,7 @@ static struct elevator_type iosched_dead
 		.elevator_merge_req_fn =	deadline_merged_requests,
 		.elevator_dispatch_fn =		deadline_dispatch_requests,
 		.elevator_add_req_fn =		deadline_add_request,
+		.elevator_kick_page_fn =	deadline_kick_page,
 		.elevator_queue_empty_fn =	deadline_queue_empty,
 		.elevator_former_req_fn =	deadline_former_request,
 		.elevator_latter_req_fn =	deadline_latter_request,
--- linux.orig/block/ll_rw_blk.c
+++ linux/block/ll_rw_blk.c
@@ -1620,6 +1620,14 @@ static void blk_backing_dev_unplug(struc
 {
 	request_queue_t *q = bdi->unplug_io_data;
 
+	if (page &&
+		IOPRIO_PRIO_CLASS(current->ioprio) != IOPRIO_CLASS_IDLE &&
+		q->elevator && q->elevator->ops->elevator_kick_page_fn) {
+		spin_lock_irq(q->queue_lock);
+		q->elevator->ops->elevator_kick_page_fn(q, page);
+		spin_unlock_irq(q->queue_lock);
+	}
+
 	/*
 	 * devices don't necessarily have an ->unplug_fn defined
 	 */
--- linux.orig/fs/buffer.c
+++ linux/fs/buffer.c
@@ -63,8 +63,9 @@ static int sync_buffer(void *word)
 
 	smp_mb();
 	bd = bh->b_bdev;
-	if (bd)
-		blk_run_address_space(bd->bd_inode->i_mapping);
+	if (bd && bd->bd_inode && bd->bd_inode->i_mapping)
+		blk_run_backing_dev(bd->bd_inode->i_mapping->backing_dev_info,
+					bh->b_page);
 	io_schedule();
 	return 0;
 }
--- linux.orig/include/linux/elevator.h
+++ linux/include/linux/elevator.h
@@ -20,6 +20,7 @@ typedef int (elevator_set_req_fn) (reque
 typedef void (elevator_put_req_fn) (request_queue_t *, struct request *);
 typedef void (elevator_activate_req_fn) (request_queue_t *, struct request *);
 typedef void (elevator_deactivate_req_fn) (request_queue_t *, struct request *);
+typedef int (elevator_kick_page_fn) (request_queue_t *, struct page *);
 
 typedef int (elevator_init_fn) (request_queue_t *, elevator_t *);
 typedef void (elevator_exit_fn) (elevator_t *);
@@ -34,6 +35,7 @@ struct elevator_ops
 	elevator_add_req_fn *elevator_add_req_fn;
 	elevator_activate_req_fn *elevator_activate_req_fn;
 	elevator_deactivate_req_fn *elevator_deactivate_req_fn;
+	elevator_kick_page_fn *elevator_kick_page_fn;
 
 	elevator_queue_empty_fn *elevator_queue_empty_fn;
 	elevator_completed_req_fn *elevator_completed_req_fn;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-02  7:58 ` Arjan van de Ven
  2006-05-02  8:06     ` Wu Fengguang
  0 siblings, 1 reply; 48+ messages in thread
From: Arjan van de Ven @ 2006-05-02  7:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty


> 
> PREVIOUS WORKS
> 
> 	There has been some similar efforts, i.e.
> 		- Linux: Boot Time Speedups Through Precaching
> 		  http://kerneltrap.org/node/2157
> 		- Andrew Morton's kernel module solution
> 		  http://www.zip.com.au/~akpm/linux/fboot.tar.gz
> 		- preload - adaptive readahead daemon
> 		  http://sourceforge.net/projects/preload

you missed the solution Fedora deploys since over a year using readahead




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-02  8:06     ` Wu Fengguang
  0 siblings, 0 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-02  8:06 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty

On Tue, May 02, 2006 at 09:58:44AM +0200, Arjan van de Ven wrote:
> 
> > 
> > PREVIOUS WORKS
> > 
> > 	There has been some similar efforts, i.e.
> > 		- Linux: Boot Time Speedups Through Precaching
> > 		  http://kerneltrap.org/node/2157
> > 		- Andrew Morton's kernel module solution
> > 		  http://www.zip.com.au/~akpm/linux/fboot.tar.gz
> > 		- preload - adaptive readahead daemon
> > 		  http://sourceforge.net/projects/preload
> 
> you missed the solution Fedora deploys since over a year using readahead

Thanks, and sorry for more previous works that I failed to mention here :)

Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-02  8:09 ` Jens Axboe
  2006-05-02  8:20     ` Wu Fengguang
  2006-05-03 22:05   ` Benjamin LaHaise
  0 siblings, 2 replies; 48+ messages in thread
From: Jens Axboe @ 2006-05-02  8:09 UTC (permalink / raw)
  To: Wu Fengguang, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Badari Pulavarty

On Tue, May 02 2006, Wu Fengguang wrote:
> I'd like to know whether the two proposed kernel components would be
> acceptable for the mainline kernel, and any recommends to improve them.

I tried something very similar to this years ago, except I made it
explicit instead of hiding it in the blk_run_backing_dev() which we
didn't have at that time. My initial results showed that you would get a
load of requests for different pages so would end up doing io randomly
instead again.

Your patch isn't exactly pretty, but that is fixable. I'm more
interested in how you'd control that sanely.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-02  8:20     ` Wu Fengguang
  0 siblings, 0 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-02  8:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Badari Pulavarty

On Tue, May 02, 2006 at 10:09:36AM +0200, Jens Axboe wrote:
> I tried something very similar to this years ago, except I made it
> explicit instead of hiding it in the blk_run_backing_dev() which we
> didn't have at that time. My initial results showed that you would get a
> load of requests for different pages so would end up doing io randomly
> instead again.

Yes, the hard one would be _not_ to impact normal I/Os in any way.
A simple solution for the case of deadline scheduler would be:

- static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
+ static const int read_expire = HZ / 2;  /* max time before a impending read is submitted. */
+ static const int reada_expire = HZ * 30;  /* max time before a read-ahead is submitted. */

Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-02  8:30     ` Arjan van de Ven
  2006-05-02  8:53         ` Wu Fengguang
  2006-05-02 22:03       ` Dave Jones
  0 siblings, 2 replies; 48+ messages in thread
From: Arjan van de Ven @ 2006-05-02  8:30 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty

On Tue, 2006-05-02 at 16:06 +0800, Wu Fengguang wrote:
> On Tue, May 02, 2006 at 09:58:44AM +0200, Arjan van de Ven wrote:
> > 
> > > 
> > > PREVIOUS WORKS
> > > 
> > > 	There has been some similar efforts, i.e.
> > > 		- Linux: Boot Time Speedups Through Precaching
> > > 		  http://kerneltrap.org/node/2157
> > > 		- Andrew Morton's kernel module solution
> > > 		  http://www.zip.com.au/~akpm/linux/fboot.tar.gz
> > > 		- preload - adaptive readahead daemon
> > > 		  http://sourceforge.net/projects/preload
> > 
> > you missed the solution Fedora deploys since over a year using readahead
> 
> Thanks, and sorry for more previous works that I failed to mention here :)

one interesting thing that came out of the fedora readahead work is that
most of the bootup isn't actually IO bound. My test machine for example
can load all the data into ram in about 10 seconds, so that the rest of
the boot is basically IO-less. But that still takes 2 minutes....
So I'm not entirely sure how much you can win by just attacking this.

Another interesting approach would be to actually put all the data you
want to use in a non-fragmented, sequential area on disk somehow (there
is an OLS paper submitted about that by Ben) so that at least the disk
side is seekless... 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-02  8:53         ` Wu Fengguang
  2006-05-06  6:49           ` Denis Vlasenko
  0 siblings, 1 reply; 48+ messages in thread
From: Wu Fengguang @ 2006-05-02  8:53 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty

On Tue, May 02, 2006 at 10:30:17AM +0200, Arjan van de Ven wrote:
> one interesting thing that came out of the fedora readahead work is that
> most of the bootup isn't actually IO bound. My test machine for example
> can load all the data into ram in about 10 seconds, so that the rest of
> the boot is basically IO-less. But that still takes 2 minutes....
> So I'm not entirely sure how much you can win by just attacking this.

Yes, I find it hard to improve the boot time of the init.d stage.
However, it is perfectly ok to preload all GUI staffs during that
timespan, by overlapping CPU/IO activities.

> Another interesting approach would be to actually put all the data you
> want to use in a non-fragmented, sequential area on disk somehow (there
> is an OLS paper submitted about that by Ben) so that at least the disk
> side is seekless... 

You are right, reducing seeking distances helps not much. My fluxbox
desktop requires near 3k seeks, which can be loaded in the 20s init.d
booting time.  But for KDE/GNOME desktops, some defragging would be
necessary to fit them into the 20s time span.

I found ext3 to be rather good filesystem to support poor man's defrag
described by Chris Mason:
        http://www.gelato.unsw.edu.au/archives/git/0504/1690.html
        This certainly can help.  Based on some ideas from andrea I
        made a poor man's defrag script last year that was similar.
        It worked by copying files into a flat dir in the order you
        expected to read them in, deleting the original, then hard
        linking them into their original name.

Make the 'flat dir' as an top level dir would do the trick for ext3.
We have good chance to merge 10k seeks into 3k seeks by this trick :)

Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-02  8:55         ` Arjan van de Ven
  2006-05-02 11:39           ` Jan Engelhardt
  2006-05-02 11:48             ` Wu Fengguang
  0 siblings, 2 replies; 48+ messages in thread
From: Arjan van de Ven @ 2006-05-02  8:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty

On Tue, 2006-05-02 at 16:53 +0800, Wu Fengguang wrote:
> On Tue, May 02, 2006 at 10:30:17AM +0200, Arjan van de Ven wrote:
> > one interesting thing that came out of the fedora readahead work is that
> > most of the bootup isn't actually IO bound. My test machine for example
> > can load all the data into ram in about 10 seconds, so that the rest of
> > the boot is basically IO-less. But that still takes 2 minutes....
> > So I'm not entirely sure how much you can win by just attacking this.
> 
> Yes, I find it hard to improve the boot time of the init.d stage.
> However, it is perfectly ok to preload all GUI staffs during that
> timespan, by overlapping CPU/IO activities.

fwiw fedora even loads a bunch of GUI apps into memory already

> 
> > Another interesting approach would be to actually put all the data you
> > want to use in a non-fragmented, sequential area on disk somehow (there
> > is an OLS paper submitted about that by Ben) so that at least the disk
> > side is seekless... 
> 
> You are right, reducing seeking distances helps not much. My fluxbox
> desktop requires near 3k seeks, which can be loaded in the 20s init.d
> booting time.  But for KDE/GNOME desktops, some defragging would be
> necessary to fit them into the 20s time span.

or just move the blocks (or copy them) to a reserved area...



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02  8:55         ` Arjan van de Ven
@ 2006-05-02 11:39           ` Jan Engelhardt
  2006-05-02 11:48             ` Wu Fengguang
  1 sibling, 0 replies; 48+ messages in thread
From: Jan Engelhardt @ 2006-05-02 11:39 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Wu Fengguang, linux-kernel, Linus Torvalds, Andrew Morton,
	Jens Axboe, Nick Piggin, Badari Pulavarty

>> > Another interesting approach would be to actually put all the data you
>> > want to use in a non-fragmented, sequential area on disk somehow (there
>> > is an OLS paper submitted about that by Ben) so that at least the disk
>> > side is seekless... 
>> 
>> You are right, reducing seeking distances helps not much. My fluxbox
>> desktop requires near 3k seeks, which can be loaded in the 20s init.d
>> booting time.  But for KDE/GNOME desktops, some defragging would be
>> necessary to fit them into the 20s time span.
>
>or just move the blocks (or copy them) to a reserved area...

Or even put it into a ramdisk, and then fix userspace. When Gnome loads 
more than 10000 files [http://kerneltrap.org/node/2157], it sounds like a 
design problem.


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-02 11:48             ` Wu Fengguang
  0 siblings, 0 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-02 11:48 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty

On Tue, May 02, 2006 at 10:55:29AM +0200, Arjan van de Ven wrote:
> > Yes, I find it hard to improve the boot time of the init.d stage.
> > However, it is perfectly ok to preload all GUI staffs during that
> > timespan, by overlapping CPU/IO activities.
> 
> fwiw fedora even loads a bunch of GUI apps into memory already

Hopefully the warm cache time has been improving.
I have very good performance with fluxbox.  Lubos Lunak said
(http://wiki.kde.org/tiki-index.php?page=KDE%20Google%20SoC%202006%20ideas#id106304)
that "KDE startup with cold caches is several times slower than with
warm caches." The latest GNOME release also saw good speedup.

> > > Another interesting approach would be to actually put all the data you
> > > want to use in a non-fragmented, sequential area on disk somehow (there
> > > is an OLS paper submitted about that by Ben) so that at least the disk
> > > side is seekless... 
> > 
> > You are right, reducing seeking distances helps not much. My fluxbox
> > desktop requires near 3k seeks, which can be loaded in the 20s init.d
> > booting time.  But for KDE/GNOME desktops, some defragging would be
> > necessary to fit them into the 20s time span.
> 
> or just move the blocks (or copy them) to a reserved area...

Ah, we can save discontinuous pages into the journal file on umounting
and restore them on mounting. But let's do the simple way first :)

Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02  7:50 ` Wu Fengguang
@ 2006-05-02 12:46   ` Diego Calleja
  2006-05-02 14:42       ` Wu Fengguang
  2006-05-02 15:55   ` Linus Torvalds
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 48+ messages in thread
From: Diego Calleja @ 2006-05-02 12:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, torvalds, akpm, axboe, nickpiggin, pbadari, arjan

El Tue, 2 May 2006 15:50:49 +0800,
Wu Fengguang <wfg@mail.ustc.edu.cn> escribió:

> 	2) kernel module to query the file cache

Can't mincore() + /proc/$PID/* stuff be used to replace that ?
Improving boot time is nice and querying the file cache would work
for that, but improving the boot time of some programs once the system
is running (ie: running openoffice 6 hours after booting) is something
that other preloaders do in other OSes aswell, querying the full file
cache wouldn't be that useful for such cases.

The main reason why I believe that the pure userspace (preload.sf.net)
solution slows down in some cases is becauses it uses bayesian heuristics
(!) as a magic ball to guess the future, which is a flawed idea IMHO.
I started (but didn't finish) a preloader which uses the process event
connector to get notifications of what processes are being launched,
then it profiles it (using mincore(), /proc/$PID/* stuff, etc) and
preloads things optimally the next time it gets a notification of the
same app.

Mac OS X has a program that implements your idea, available (the sources)
at http://darwinsource.opendarwin.org/projects/apsl/BootCache-25/

Also, mac os x uses launchd, as init/init.d replacement
(http://darwinsource.opendarwin.org/projects/apsl/launchd-106/)
If (as Arjan noticed) the bootup is not really IO-bound, launchd could
help to reduce the amount of CPU time wasted if the CPU time being wasted
is in shell interpreters (a more unixy port has been done by the freebsd
people - http://wikitest.freebsd.org/moin.cgi/launchd)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-02 14:42       ` Wu Fengguang
  2006-05-02 16:07         ` Diego Calleja
  0 siblings, 1 reply; 48+ messages in thread
From: Wu Fengguang @ 2006-05-02 14:42 UTC (permalink / raw)
  To: Diego Calleja
  Cc: linux-kernel, torvalds, akpm, axboe, nickpiggin, pbadari, arjan

Diego,

On Tue, May 02, 2006 at 02:46:41PM +0200, Diego Calleja wrote:
> El Tue, 2 May 2006 15:50:49 +0800,
> Wu Fengguang <wfg@mail.ustc.edu.cn> escribió:
> 
> > 	2) kernel module to query the file cache
> 
> Can't mincore() + /proc/$PID/* stuff be used to replace that ?

Nope. mincore() only provides info about files that are currently
opened, by the process itself. The majority in the file cache are
closed files.

> Improving boot time is nice and querying the file cache would work
> for that, but improving the boot time of some programs once the system
> is running (ie: running openoffice 6 hours after booting) is something
> that other preloaders do in other OSes aswell, querying the full file
> cache wouldn't be that useful for such cases.

Yes, it can still be useful after booting :) One can get the cache
footprint of any task started at any time by taking snapshots of the
cache before and after the task, and do a set-subtract on them.

> The main reason why I believe that the pure userspace (preload.sf.net)
> solution slows down in some cases is becauses it uses bayesian heuristics
> (!) as a magic ball to guess the future, which is a flawed idea IMHO.
> I started (but didn't finish) a preloader which uses the process event
> connector to get notifications of what processes are being launched,
> then it profiles it (using mincore(), /proc/$PID/* stuff, etc) and
> preloads things optimally the next time it gets a notification of the
> same app.

My thought is that any prefetcher that ignores the thousands of small
data files is incomplete. Only kernel can provide that info.

> Mac OS X has a program that implements your idea, available (the sources)
> at http://darwinsource.opendarwin.org/projects/apsl/BootCache-25/

Ah, thanks for mentioning it.

It seems to do the job mostly in kernel, comprising of 2400+ LOC.
Whereas my plan is to write a module of ~300 LOC to _only_ provide the
necessary info, and leave other jobs to smart user-land tools.

Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02  7:50 ` Wu Fengguang
  2006-05-02 12:46   ` Diego Calleja
@ 2006-05-02 15:55   ` Linus Torvalds
  2006-05-02 16:35     ` Andi Kleen
                       ` (5 more replies)
  2006-05-03 21:45   ` Linda Walsh
  2006-05-04  9:02   ` Helge Hafting
  3 siblings, 6 replies; 48+ messages in thread
From: Linus Torvalds @ 2006-05-02 15:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty



On Tue, 2 May 2006, Wu Fengguang wrote:
>
>  block/deadline-iosched.c |   35 +++++++++++++++++++++++++++++++++++
>  block/ll_rw_blk.c        |    8 ++++++++
>  fs/buffer.c              |    5 +++--
>  include/linux/elevator.h |    2 ++
>  4 files changed, 48 insertions(+), 2 deletions(-)

Now, regardless of the other issues people have brought up, I'd like to 
say that I think this is broken.

Doing prefetching on a physical block basis is simply not a valid 
approach, for several reasons:

 - it misses several important cases (you can surely prefetch over NFS 
   too)
 - it gives you only a very limited view into what is actually going on.
 - it doesn't much allow you to _fix_ any problems, it just allows you to 
   try to paper them over.
 - it's useless anyway, since pretty all kernel caching is based on 
   virtual caches, so if you "pre-read" the physical buffers, it won't 
   help: you'll just waste time reading the data into a buffer that will 
   never be used, and when the real request comes in, the read will 
   be done _again_.

So I would _seriously_ claim that the place to do all the statistics 
allocation is in anything that ends up having to call "->readpage()", and 
do it all on a virtual mapping level.

Yes, it isn't perfect either (I'll mention some problems), but it's a 
_lot_ better. It means that when you gather the statistics, you can see 
the actual _files_ and offsets being touched. You can even get the 
filenames by following the address space -> inode -> i_dentry list.

   This is important for several reasons:
    (a) it makes it a hell of a lot more readable, and the user gets a 
	lot more information that may make him see the higher-level issues 
	involved.
    (b) it's in the form that we cache things, so if you read-ahead in 
	that form, you'll actually get real information.
    (c) it's in a form where you can actually _do_ something about things 
	like fragmentation etc ("Oh, I could move these files all to a 
	separate area")

Now, admittedly it has a few downsides:

 - right now "readpage()" is called in several places, and you'd have to 
   create some kind of nice wrapper for the most common 
   "mapping->a_ops->readpage()" thing and hook into there to avoid 
   duplicating the effort.

   Alternatively, you could decide that you only want to do this at the 
   filesystem level, which actually simplifies some things. If you 
   instrument "mpage_readpage[2]()", you'll already get several of the 
   ones you care about, and you could do the others individually.

   [ As a third alternative, you might decide that the only thing you
   actually care about is when you have to wait on a locked page, and 
   instrument the page wait-queues instead. ]

 - it will miss any situation where a filesystem does a read some other 
   way. Notably, in many loads, the _directory_ accesses are the important 
   ones, and if you want statistics for those you'd often have to do that 
   separately (not always - some of the filesystems just use the same 
   page reading stuff).

The downsides basically boil down to the fact that it's not as clearly 
just one single point. You can't just look at the request queue and see 
what physical requests go out.

NOTE! You can obviously do both, and try to correlate one against the 
other, and you'd get the best possible information ("is this seek because 
we started reading another file, or is it because the file itself is 
fragmented" kind of stuff). So physical statistics aren't meaningless. 
They're just _less_ important than the virtual ones, and you should do 
them only if you already do virtual stats.

		Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02 14:42       ` Wu Fengguang
@ 2006-05-02 16:07         ` Diego Calleja
  2006-05-03  6:45             ` Wu Fengguang
  0 siblings, 1 reply; 48+ messages in thread
From: Diego Calleja @ 2006-05-02 16:07 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, torvalds, akpm, axboe, nickpiggin, pbadari, arjan

El Tue, 2 May 2006 22:42:03 +0800,
Wu Fengguang <wfg@mail.ustc.edu.cn> escribió:

> Nope. mincore() only provides info about files that are currently
> opened, by the process itself. The majority in the file cache are
> closed files.

Yes; what I meant was:
- start listening the process event connector as firsk task on startup
- get a list of what files are being used (every second or something)
  by looking at /proc/$PID/* stuff
- mincore() all those files at the end of the bootup

> Yes, it can still be useful after booting :) One can get the cache
> footprint of any task started at any time by taking snapshots of the
> cache before and after the task, and do a set-subtract on them.

Although this certainly looks simpler for userspace (and complete, if
you want to get absolutely all the info about files that get opened and
closed faster than the profile interval of a prefetcher) 

(another useful tool would be a dtrace-like thing)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02 15:55   ` Linus Torvalds
@ 2006-05-02 16:35     ` Andi Kleen
  2006-05-03  4:11       ` Wu Fengguang
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2006-05-02 16:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin,
	Badari Pulavarty, wfg

Linus Torvalds <torvalds@osdl.org> writes:
> 
> The downsides basically boil down to the fact that it's not as clearly 
> just one single point. You can't just look at the request queue and see 
> what physical requests go out.

The other problem is that readpages has no idea about the layout
of the disk so just doing it dumbly might end up with a lot of seeking.

I suppose you would at least need some tweaks to the scheduler to make
sure it doesn't give these reads any priority and keeps them in the
queue long enough to get real good sorting and large requests.

Problem is that this would likely need to be asynchronous to be any
useful unless you were willing to block a thread for a potentially
long time for each file to be prefetched.

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-02 19:10 ` Pavel Machek
  2006-05-02 23:36   ` Nigel Cunningham
                     ` (3 more replies)
  0 siblings, 4 replies; 48+ messages in thread
From: Pavel Machek @ 2006-05-02 19:10 UTC (permalink / raw)
  To: Wu Fengguang, linux-kernel, Linus Torvalds, Andrew Morton,
	Jens Axboe, Nick Piggin, Badari Pulavarty, Nigel Cunningham,
	Rafael J. Wysocki

Hi!

> SOC PROPOSAL
> 
> 	Rapid linux desktop startup through pre-caching

Looks nice me...

> SCHEME/GOAL
> 	2) kernel module to query the file cache
> 		- on loading: create /proc/filecache
> 		- setting up: echo "param value" > /proc/filecache
> 		- info query: cat /proc/filecache
> 		- sample sessions:
> 
> 		# modprobe filecache
> 		$ cat /proc/filecache
> 		# file ALL
> 		# mask 0
> 		#
> 		# ino	size	cached	status	refcnt	name
> 		0	2929846	3181	D	1	/dev/hda1
> 		81455	9	9	_	1	/sbin/init
> 		......
> 	
> 		$ echo "file /dev/hda1" > /proc/filecache
> 		$ cat /proc/filecache
> 		# file /dev/hda1
> 		# mask 0
> 		#
> 		# idx	len
> 		0	24
> 		48	2
> 		53	5
> 		......

Could we use this instead of blockdev freezing/big suspend image
support? It should permit us to resume quickly (with small image), and
then do readahead. ... that will give us usable machine quickly, still
very responsive desktop after resume?

								Pavel
-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02  8:30     ` Arjan van de Ven
  2006-05-02  8:53         ` Wu Fengguang
@ 2006-05-02 22:03       ` Dave Jones
  1 sibling, 0 replies; 48+ messages in thread
From: Dave Jones @ 2006-05-02 22:03 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Wu Fengguang, linux-kernel, Linus Torvalds, Andrew Morton,
	Jens Axboe, Nick Piggin, Badari Pulavarty

On Tue, May 02, 2006 at 10:30:17AM +0200, Arjan van de Ven wrote:

 > one interesting thing that came out of the fedora readahead work is that
 > most of the bootup isn't actually IO bound. 

Here's another interesting datapoint.
A huge proportion of I/O done during bootup is _completely_ unnecessary.

Profiling just a few of the places that seemed to stall yielded some
lovely things, like cupsd reading in and parsing descriptions for
every printer ever known to man, even if there's no printer connected.
Or my favorite.. hald reloading and reparsing the same XML hardware description
files.. ~50 times, _each_.

Utterly insane.

(And these aren't Fedora specific problems btw, every distro shipping those
 bits will have the same dumb behaviour [modulo versions])

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02 19:10 ` Pavel Machek
@ 2006-05-02 23:36   ` Nigel Cunningham
  2006-05-03  2:35       ` Wu Fengguang
  2006-05-03  2:32     ` Wu Fengguang
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 48+ messages in thread
From: Nigel Cunningham @ 2006-05-02 23:36 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Wu Fengguang, linux-kernel, Linus Torvalds, Andrew Morton,
	Jens Axboe, Nick Piggin, Badari Pulavarty, Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 601 bytes --]

Hi.

On Wednesday 03 May 2006 05:10, Pavel Machek wrote:
> Could we use this instead of blockdev freezing/big suspend image
> support? It should permit us to resume quickly (with small image), and
> then do readahead. ... that will give us usable machine quickly, still
> very responsive desktop after resume?

Unrelated to bdev freezing, and will involve far more seeking than reading a 
contiguous image (as they normally are).

Regards,

Nigel
-- 
See our web page for Howtos, FAQs, the Wiki and mailing list info.
http://www.suspend2.net                IRC: #suspend2 on Freenode

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-03  2:32     ` Wu Fengguang
  0 siblings, 0 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-03  2:32 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty, Nigel Cunningham,
	Rafael J. Wysocki

Hi Pavel,

On Tue, May 02, 2006 at 09:10:01PM +0200, Pavel Machek wrote:
> > SOC PROPOSAL
> > 
> > 	Rapid linux desktop startup through pre-caching
> 
> Looks nice me...

Thanks.

> > SCHEME/GOAL
> > 	2) kernel module to query the file cache
> > 		- on loading: create /proc/filecache
> > 		- setting up: echo "param value" > /proc/filecache
> > 		- info query: cat /proc/filecache
> > 		- sample sessions:
> > 
> > 		# modprobe filecache
> > 		$ cat /proc/filecache
> > 		# file ALL
> > 		# mask 0
> > 		#
> > 		# ino	size	cached	status	refcnt	name
> > 		0	2929846	3181	D	1	/dev/hda1
> > 		81455	9	9	_	1	/sbin/init
> > 		......
> > 	
> > 		$ echo "file /dev/hda1" > /proc/filecache
> > 		$ cat /proc/filecache
> > 		# file /dev/hda1
> > 		# mask 0
> > 		#
> > 		# idx	len
> > 		0	24
> > 		48	2
> > 		53	5
> > 		......
> 
> Could we use this instead of blockdev freezing/big suspend image
> support? It should permit us to resume quickly (with small image), and
> then do readahead. ... that will give us usable machine quickly, still
> very responsive desktop after resume?

Sure. Indeed I have considered about the possibility to integrate my
work with the foresighted userland-swsusp idea. The added value of
this approach is that user-land tools can do all kinds of smart
processing of the contents, and present users with different policies.

Another imagined user of /proc/filecache is system adms :)

Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-03  2:35       ` Wu Fengguang
  0 siblings, 0 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-03  2:35 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Pavel Machek, linux-kernel, Linus Torvalds, Andrew Morton,
	Jens Axboe, Nick Piggin, Badari Pulavarty, Rafael J. Wysocki

On Wed, May 03, 2006 at 09:36:12AM +1000, Nigel Cunningham wrote:
> > Could we use this instead of blockdev freezing/big suspend image
> > support? It should permit us to resume quickly (with small image), and
> > then do readahead. ... that will give us usable machine quickly, still
> > very responsive desktop after resume?
> 
> Unrelated to bdev freezing, and will involve far more seeking than reading a 
> contiguous image (as they normally are).

Yes, needs more pondering on this issue...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-03  4:11       ` Wu Fengguang
  0 siblings, 0 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-03  4:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty

On Tue, May 02, 2006 at 08:55:06AM -0700, Linus Torvalds wrote:
> Doing prefetching on a physical block basis is simply not a valid 
> approach, for several reasons:

Sorry!
I made a misleading introduction. I'll try to explain it in more detail.

DATA ACQUISITION

/proc/filecache provides an interface to query the cached pages of any
file. This information is expressed in tuples of <idx, len>, which
more specifically means <mapping-offset, pages>.

Normally one should use 'echo' to setup two parameters before doing
'cat':
        @file
                the filename;
                use 'ALL' to get a list all files cached
        @mask
                only show the pages with non-zero (page-flags & @mask);
                for simplicity, use '0' to show all present pages(take 0 as ~0)

Normally, one should first get the file list using param 'file ALL',
and then iterate through all the files and pages of interested with
params 'file filename' and 'mask pagemask'.

The param 'mask' acts as a filter for different users: it allows
sysadms to know where his memory goes, and the prefetcher to ignore
pages from false readahead.

One can use 'mask hex(PG_active|PG_referenced|PG_mapped)' in its hex form
to show only accessed pages(here PG_mapped is a faked flag), and use
'mask hex(PG_dirty)' to show only dirtied pages.

One can use 
        $ echo "file /sbin/init" > /proc/filecache
        $ echo "mask 0" > /proc/filecache
        $ cat /proc/filecache
to get an idea which pages of /sbin/init are currently cached.

In the proposal, I used the following example, which is proved to be
rather misleading:
        $ echo "file /dev/hda1" > /proc/filecache
        $ cat /proc/filecache
The intention of that example was to show that filesystem dir/inode
buffer status -- which is the key data for user-land pre-caching --
can also be retrieved through this interface.

So the proposed solution is to
        - prefetch normal files on the virtual mapping level
        - prefetch fs dir/inode buffers on a physical block basis

I/O SUBMISSION
How can we avoid unnecessary seeks when prefetching on virtual mapping
level?  The answer is to leave this job to i/o elevators. What we
should do is to present elevators with most readahead requests before
too many requests being submitted to disk drivers.
The proposed scheme is to:
        1) (first things first)
           issue all readahead requests for filesystem buffers
        2) (in background, often blocked)
           issue all readahead requests for normal files
        -) make sure the above requests are of really _low_ priority
        3) regular system boot continues
        4) promote the priority of any request that is now demanded by
           legacy programs

In the scheme, most work is done by user-land tools. The required
kernel support is minimal and general-purpose:
        - an /proc/filecache interface
        - the ability to promote I/O priority on demanded pages

By this approach, we avoided the complicated OSX bootcache solution,
which is a physical-blocks-based, special-handlings-in-kernel solution
that is exactly what Linus is against.

Thanks,
Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-03  6:45             ` Wu Fengguang
  2006-05-03 18:14               ` Diego Calleja
  0 siblings, 1 reply; 48+ messages in thread
From: Wu Fengguang @ 2006-05-03  6:45 UTC (permalink / raw)
  To: Diego Calleja
  Cc: linux-kernel, torvalds, akpm, axboe, nickpiggin, pbadari, arjan

On Tue, May 02, 2006 at 06:07:53PM +0200, Diego Calleja wrote:
> > Nope. mincore() only provides info about files that are currently
> > opened, by the process itself. The majority in the file cache are
> > closed files.
> 
> Yes; what I meant was:
> - start listening the process event connector as firsk task on startup
> - get a list of what files are being used (every second or something)
>   by looking at /proc/$PID/* stuff
> - mincore() all those files at the end of the bootup
> 
> > Yes, it can still be useful after booting :) One can get the cache
> > footprint of any task started at any time by taking snapshots of the
> > cache before and after the task, and do a set-subtract on them.
> 
> Although this certainly looks simpler for userspace (and complete, if
> you want to get absolutely all the info about files that get opened and
> closed faster than the profile interval of a prefetcher) 

Thanks, so it's a question of simplicity/completeness :)

> (another useful tool would be a dtrace-like thing)

Lubos Lunak also reminds me of SUSE's preload
(http://en.opensuse.org/index.php?title=SUPER_preloading_internals)
which is a user-land solution using strace to collect the info.

And there's Andrea Arcangeli's "bootcache userspace logging" kernel
patch(http://lkml.org/lkml/2004/8/6/216).

Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-03  7:13       ` Wu Fengguang
  0 siblings, 0 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-03  7:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty

On Tue, May 02, 2006 at 08:55:06AM -0700, Linus Torvalds wrote:
> So I would _seriously_ claim that the place to do all the statistics 
> allocation is in anything that ends up having to call "->readpage()", and 
> do it all on a virtual mapping level.
> 
> Yes, it isn't perfect either (I'll mention some problems), but it's a 
> _lot_ better. It means that when you gather the statistics, you can see 
> the actual _files_ and offsets being touched. You can even get the 
> filenames by following the address space -> inode -> i_dentry list.
> 
>    This is important for several reasons:
>     (a) it makes it a hell of a lot more readable, and the user gets a 
> 	lot more information that may make him see the higher-level issues 
> 	involved.
>     (b) it's in the form that we cache things, so if you read-ahead in 
> 	that form, you'll actually get real information.
>     (c) it's in a form where you can actually _do_ something about things 
> 	like fragmentation etc ("Oh, I could move these files all to a 
> 	separate area")

There have been two alternatives for me:
        1) static/passive interface i.e. the /proc/filecache querier
           - user-land tools request to dump the cache contents on demand
        2) dynamic/active interface i.e. the readpage() logger
           - user-land daemon accepts live page access/io activities

> Now, admittedly it has a few downsides:
> 
>  - right now "readpage()" is called in several places, and you'd have to 
>    create some kind of nice wrapper for the most common 
>    "mapping->a_ops->readpage()" thing and hook into there to avoid 
>    duplicating the effort.
> 
>    Alternatively, you could decide that you only want to do this at the 
>    filesystem level, which actually simplifies some things. If you 
>    instrument "mpage_readpage[2]()", you'll already get several of the 
>    ones you care about, and you could do the others individually.
> 
>    [ As a third alternative, you might decide that the only thing you
>    actually care about is when you have to wait on a locked page, and 
>    instrument the page wait-queues instead. ]
> 
>  - it will miss any situation where a filesystem does a read some other 
>    way. Notably, in many loads, the _directory_ accesses are the important 
>    ones, and if you want statistics for those you'd often have to do that 
>    separately (not always - some of the filesystems just use the same 
>    page reading stuff).
> 
> The downsides basically boil down to the fact that it's not as clearly 
> just one single point. You can't just look at the request queue and see 
> what physical requests go out.

Good insights.
The readpage() activities logging idea has been appealing for me.
We might even go further to log mark_page_accessed() calls for more
information.

This approach is more precise, and provides process/page
correlations and time info that the /proc/filecache interface cannot
provide. Though it involves more complexity and overhead(for me they
mean the possibility of being rejected:).

Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-03  7:19     ` Wu Fengguang
  0 siblings, 0 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-03  7:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty, Nigel Cunningham,
	Rafael J. Wysocki

On Tue, May 02, 2006 at 09:10:01PM +0200, Pavel Machek wrote:
> Could we use this instead of blockdev freezing/big suspend image
> support? It should permit us to resume quickly (with small image), and
> then do readahead. ... that will give us usable machine quickly, still
> very responsive desktop after resume?

Seems that it can help in reducing the image size:
write only small ranges of file pages to the suspend image(maybe 80MB
= 10k ranges * 8k avgsize), and let the prefetcher restore other large
chunks of code/data, depending on user specified policies.

Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02 15:55   ` Linus Torvalds
                       ` (2 preceding siblings ...)
  2006-05-03  7:13       ` Wu Fengguang
@ 2006-05-03 12:59     ` Nikita Danilov
  2006-05-03 22:20     ` Rik van Riel
  2006-05-04  0:28     ` Linda Walsh
  5 siblings, 0 replies; 48+ messages in thread
From: Nikita Danilov @ 2006-05-03 12:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty

Linus Torvalds writes:
 > 

[...]

 > 
 > Now, admittedly it has a few downsides:
 > 
 >  - right now "readpage()" is called in several places, and you'd have to 
 >    create some kind of nice wrapper for the most common 
 >    "mapping->a_ops->readpage()" thing and hook into there to avoid 
 >    duplicating the effort.
 > 
 >    Alternatively, you could decide that you only want to do this at the 
 >    filesystem level, which actually simplifies some things. If you 
 >    instrument "mpage_readpage[2]()", you'll already get several of the 
 >    ones you care about, and you could do the others individually.
 > 
 >    [ As a third alternative, you might decide that the only thing you
 >    actually care about is when you have to wait on a locked page, and 
 >    instrument the page wait-queues instead. ]
 > 
 >  - it will miss any situation where a filesystem does a read some other 
 >    way. Notably, in many loads, the _directory_ accesses are the important 
 >    ones, and if you want statistics for those you'd often have to do that 
 >    separately (not always - some of the filesystems just use the same 
 >    page reading stuff).

Another disadvantage is that ->readpage() can only do read-ahead within
single file, which is not helpful for the case of reading a lot of small
files (and this is what happens during startup).

And to implement reasonable multi-file read-ahead at the file system
layer one needs asynchronous inode loading interface implemented for
every file system.

Nikita.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-03 17:28       ` Badari Pulavarty
       [not found]         ` <346733486.30800@ustc.edu.cn>
  0 siblings, 1 reply; 48+ messages in thread
From: Badari Pulavarty @ 2006-05-03 17:28 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Linus Torvalds, lkml, Andrew Morton, Jens Axboe, Nick Piggin

On Wed, 2006-05-03 at 12:11 +0800, Wu Fengguang wrote:
> On Tue, May 02, 2006 at 08:55:06AM -0700, Linus Torvalds wrote:
> > Doing prefetching on a physical block basis is simply not a valid 
> > approach, for several reasons:
> 
> Sorry!
> I made a misleading introduction. I'll try to explain it in more detail.
> 
> DATA ACQUISITION
> 
> /proc/filecache provides an interface to query the cached pages of any
> file. This information is expressed in tuples of <idx, len>, which
> more specifically means <mapping-offset, pages>.
> 
> Normally one should use 'echo' to setup two parameters before doing
> 'cat':
>         @file
>                 the filename;
>                 use 'ALL' to get a list all files cached
>         @mask
>                 only show the pages with non-zero (page-flags & @mask);
>                 for simplicity, use '0' to show all present pages(take 0 as ~0)
> 
> Normally, one should first get the file list using param 'file ALL',
> and then iterate through all the files and pages of interested with
> params 'file filename' and 'mask pagemask'.
> 
> The param 'mask' acts as a filter for different users: it allows
> sysadms to know where his memory goes, and the prefetcher to ignore
> pages from false readahead.
> 
> One can use 'mask hex(PG_active|PG_referenced|PG_mapped)' in its hex form
> to show only accessed pages(here PG_mapped is a faked flag), and use
> 'mask hex(PG_dirty)' to show only dirtied pages.
> 
> One can use 
>         $ echo "file /sbin/init" > /proc/filecache
>         $ echo "mask 0" > /proc/filecache
>         $ cat /proc/filecache
> to get an idea which pages of /sbin/init are currently cached.
> 
> In the proposal, I used the following example, which is proved to be
> rather misleading:
>         $ echo "file /dev/hda1" > /proc/filecache
>         $ cat /proc/filecache
> The intention of that example was to show that filesystem dir/inode
> buffer status -- which is the key data for user-land pre-caching --
> can also be retrieved through this interface.
> 
> So the proposed solution is to
>         - prefetch normal files on the virtual mapping level
>         - prefetch fs dir/inode buffers on a physical block basis
> 
> I/O SUBMISSION
> How can we avoid unnecessary seeks when prefetching on virtual mapping
> level?  The answer is to leave this job to i/o elevators. What we
> should do is to present elevators with most readahead requests before
> too many requests being submitted to disk drivers.
> The proposed scheme is to:
>         1) (first things first)
>            issue all readahead requests for filesystem buffers
>         2) (in background, often blocked)
>            issue all readahead requests for normal files
>         -) make sure the above requests are of really _low_ priority
>         3) regular system boot continues
>         4) promote the priority of any request that is now demanded by
>            legacy programs
> 
> In the scheme, most work is done by user-land tools. The required
> kernel support is minimal and general-purpose:
>         - an /proc/filecache interface
>         - the ability to promote I/O priority on demanded pages
> 
> By this approach, we avoided the complicated OSX bootcache solution,
> which is a physical-blocks-based, special-handlings-in-kernel solution
> that is exactly what Linus is against.

Wu,

While ago, I hacked up similar /proc interface  
	echo "<filesystem-name>" > /proc/pagecache-usage

Which showed pagecache usage of every file in that filesystem
(filename, #num pages). My main objective was to shoot down pagecache
for all the files in a given filesystem. I ended up using it to do
posix_fadivse(POSIX_FADV_DONTNEED) on those files. (Initially, tried
to do this without this, by doing fadvise() on all files in the
filesystem - but ended up bloating up inode and dcache). 

Yeah. having this kind of information would be useful. But I am not sure
how much of this can benefit regular workloads - unless one is willing
to tweak things heavily. Bottom line is, need to have a better strategy
on how you would use information ..

Thanks,
Badari



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-03  6:45             ` Wu Fengguang
@ 2006-05-03 18:14               ` Diego Calleja
  2006-05-03 23:39                 ` Zan Lynx
  0 siblings, 1 reply; 48+ messages in thread
From: Diego Calleja @ 2006-05-03 18:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, torvalds, akpm, axboe, nickpiggin, pbadari, arjan

El Wed, 3 May 2006 14:45:03 +0800,
Wu Fengguang <wfg@mail.ustc.edu.cn> escribió:

> Lubos Lunak also reminds me of SUSE's preload
> (http://en.opensuse.org/index.php?title=SUPER_preloading_internals)
> which is a user-land solution using strace to collect the info.
> 
> And there's Andrea Arcangeli's "bootcache userspace logging" kernel
> patch(http://lkml.org/lkml/2004/8/6/216).

Just for completeness, windows vista will include a enhanced prefetcher
called (sic) SuperFetch. The idea behind it seems to be to analyze I/O
patterns and then "mirror" the most frequently used disk blocks into
the USB flash drive; so if when the usb flash drive is plugged in
the system will read those blocks from it as it was the hard drive
the next time you run the app
(http://www.windowsitpro.com/Windows/Article/ArticleID/48085/48085.html)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02  7:50 ` Wu Fengguang
  2006-05-02 12:46   ` Diego Calleja
  2006-05-02 15:55   ` Linus Torvalds
@ 2006-05-03 21:45   ` Linda Walsh
  2006-05-04 12:12       ` Wu Fengguang
  2006-05-04  9:02   ` Helge Hafting
  3 siblings, 1 reply; 48+ messages in thread
From: Linda Walsh @ 2006-05-03 21:45 UTC (permalink / raw)
  To: Wu Fengguang, linux-kernel

Wu Fengguang wrote:
> Pre-caching reloaded ;)
> I/O ANALYSE...
> SCHEME/GOAL(s)...
>   
    Some good analysis and ideas.  I don't know if it is wanted, but I'd
like to add a 'few cents' referring to the pre-fetch mechanism in
XP, which addresses both boot and application prefetch and has the
benefit of showing measurable performance improvements (compare the
boot time of an NT4 system to XP; maybe a 5-8x performance boost?).

    1. As you mention; reading files "sequentially" through the file
system is "bad" for several reasons.  Areas of interest:
    a) don't go through the file system.  Don't waste time doing
directory lookups and following file-allocation maps;  Instead,
use raw-disk i/o and read sectors in using device & block number.
    b) Be "dynamic"; "Trace" (record (dev&blockno/range) blocks
starting ASAP after system boot and continuing for some "configurable"
number of seconds past reaching the desired "run-level" (coinciding with
initial disk quiescence).  Save as "configurable" (~6-8?) number of
traces to allow finding the common initial subset of blocks needed.
    c) Allow specification of max# of blocks and max number of "sections"
(discontiguous areas on disk);
    d) "Ideally", would have a way to "defrag" the common set of blocks.
I.e. -- moving the needed blocks from potentially disparate areas of
files into 1 or 2 contiguous areas, hopefully near the beginning of
the disk (or partition(s)).

    That's the area of "boot" pre-caching.

Next is doing something similar for "application" starts.  Start tracing
when an application is loaded & observe what blocks are requested for
that app for the first 20 ("configurable") seconds of execution.  Store
traces on a per-application basis.  Again, it would be ideal if the
different files (blocks, really), needed by an application could be
grouped so that sequentially needed disk-blocks are stored sequentially
on disk (this _could_ imply the containing files are not contiguous).

Essentially, one wants to do for applications, the same thing one does
for booting.  On small applications, the benefit would likely be negligible,
but on loading a large app like a windowing system, IDE, or database app,
multiple configuration files could be read into the cache in one large
read.

    That's "application" pre-caching.

    A third area -- that can't be easily done in the kernel, but would
require a higher skill level on the part of application and library
developers, is to move towards using "delay-loaded" libraries.  In
Windows, it seems common among system libraries to use this feature. 
An obvious benefit -- if certain features of a program are not used,
the associated libraries are never loaded.  Not loading unneeded parts
of a program should speed up initial application load time, significantly.

    I don't know where the "cross-over" point is, but moving to demand
loaded "so's" can cause extreme benefits for interactive usage.  In
addition to load-time benefits, additional benefits are gained by not
wasting memory on unused libraries and program features.

    In looking at the distro I use, many unused libraries are linked in
with commonly used programs.  For a small office or home setup, I rarely
have a need for LDAP, internationalized libraries, & application support
for virtually any hardware & software configuration.

    This isn't a kernel problem -- the kernel has dynamically loadable
modules to allow loading only of hardware & software drivers that are
needed in a particular kernel.  User applications that I have been
exposed to in Linux usually don't have this adaptability -- virtually
everything is loaded at execution time -- not as needed during
program execution. 

    I've seen this feature used on Unix systems to dynamically present
feature interfaces depending on the user's software configuration.  On
linux, I more often see a "everything, including the kitchen sink"
approach, where every possible software configuration is supported
via "static", shared libraries that must be present and loaded into
memory before the program begins execution.

    This has the potential to have a greater benefit as the application
environment becomes more complex if you think about how the number
of statically loaded, sharable libraries have increased (have seen
addition of ldap, pam, and most recently, selinux libraries that are
required for loading before application execution).

    Good luck in speeding these things up.  It might require some
level of cooperation in different areas (kernel, fs utils,
distro-configuration, application design and build...etc).

-linda





^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02  8:09 ` Jens Axboe
  2006-05-02  8:20     ` Wu Fengguang
@ 2006-05-03 22:05   ` Benjamin LaHaise
  1 sibling, 0 replies; 48+ messages in thread
From: Benjamin LaHaise @ 2006-05-03 22:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Wu Fengguang, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Badari Pulavarty

On Tue, May 02, 2006 at 10:09:36AM +0200, Jens Axboe wrote:
> Your patch isn't exactly pretty, but that is fixable. I'm more
> interested in how you'd control that sanely.

I have a quick and dirty hack called BootCache (only afterwards did I notice 
that Apple used the same name) that snoops all the read requests during 
startup and uses that to populate a file which gets read in to populate 
the cache on subsequent boots.  The results are impressive: eliminating 
the seeks (since the file is contiguous) results in 15s (out of the 45s 
from init starting to the login prompt) reduction in boot time on ext3.

Prefetch without reordering isn't particularly effective, as the cost of 
the seeks to read in the inodes and file data tend to block critical path 
io in the startup scripts.

I hope to release the code sometime soon and rework it to be a better fit 
for merging with the kernel -- it should be possible to use the block io 
tracing facility to log the blocks required and then feed that back into 
the filesystem to reallocate blocks in the filesystem.  That would eliminate 
the messy part of bootcache that has to snoop block writes to maintain the 
cache.  This is all part of what I'm aiming to present at OLS.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02 15:55   ` Linus Torvalds
                       ` (3 preceding siblings ...)
  2006-05-03 12:59     ` Nikita Danilov
@ 2006-05-03 22:20     ` Rik van Riel
  2006-05-06  1:11         ` Wu Fengguang
  2006-05-04  0:28     ` Linda Walsh
  5 siblings, 1 reply; 48+ messages in thread
From: Rik van Riel @ 2006-05-03 22:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Wu Fengguang, linux-kernel, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty

On Tue, 2 May 2006, Linus Torvalds wrote:
> On Tue, 2 May 2006, Wu Fengguang wrote:
> >
> >  4 files changed, 48 insertions(+), 2 deletions(-)
> 
> So I would _seriously_ claim that the place to do all the statistics 
> allocation is in anything that ends up having to call "->readpage()", and 
> do it all on a virtual mapping level.

Why not simply read everything in a whole file at a time
at boot time, while we still have enough free memory ?

We can have a small modification to the readahead code
to read in the whole file on the first read or fault,
or maybe even on open.

Once the system is done booting, it can switch this
bootup readahead mode off through a tunable in /proc.
If the system is booting on a system with not enough
memory to load everything file-at-a-time, the bootup
scripts can switch this off earlier (or not switch
it on).

The kernel modifications needed to make this work
are minimal.  It might need some tweaking so we don't
try to read in truly enormous files, but that is easy.

Does this sound reasonable ?

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-03 18:14               ` Diego Calleja
@ 2006-05-03 23:39                 ` Zan Lynx
  2006-05-04  1:37                   ` Diego Calleja
  0 siblings, 1 reply; 48+ messages in thread
From: Zan Lynx @ 2006-05-03 23:39 UTC (permalink / raw)
  To: Diego Calleja
  Cc: Wu Fengguang, linux-kernel, torvalds, akpm, axboe, nickpiggin,
	pbadari, arjan

[-- Attachment #1: Type: text/plain, Size: 1142 bytes --]

On Wed, 2006-05-03 at 20:14 +0200, Diego Calleja wrote:
> Just for completeness, windows vista will include a enhanced prefetcher
> called (sic) SuperFetch. The idea behind it seems to be to analyze I/O
> patterns and then "mirror" the most frequently used disk blocks into
> the USB flash drive; so if when the usb flash drive is plugged in
> the system will read those blocks from it as it was the hard drive
> the next time you run the app
> (http://www.windowsitpro.com/Windows/Article/ArticleID/48085/48085.html)

Linux should be able to do something like this using unionfs.  It could
be worthwhile to try it with one of the very fastest flash cards or USB
drives.

With slower cards and USB keys its more of a loss unless the faster seek
speed can make up for it, because sequential hard drive access is
faster.

For comparison, a OCZ USB 2.0 Flash Drive with dual channel works at
about 30 MB/s.  One of my 7,200 RPM SATA desktop drives does 45 MB/s.  A
15k SCSI drive can do over 60 MB/s.

It'd be great for laptops though.  My slug of a laptop drive does 20
MB/s on a good day.
-- 
Zan Lynx <zlynx@acm.org>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02 15:55   ` Linus Torvalds
                       ` (4 preceding siblings ...)
  2006-05-03 22:20     ` Rik van Riel
@ 2006-05-04  0:28     ` Linda Walsh
  2006-05-04  1:31       ` Linus Torvalds
  5 siblings, 1 reply; 48+ messages in thread
From: Linda Walsh @ 2006-05-04  0:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Wu Fengguang, linux-kernel, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty

Linus Torvalds wrote:
 > Now, regardless of the other issues people have brought up, I'd like to
 > say that I think this is broken.
 >
 > Doing prefetching on a physical block basis is simply not a valid
 > approach, for several reasons:
 >
 >  - it misses several important cases (you can surely prefetch over NFS
 >    too)
---

    I don't think fetching over a network is a case that needs
to be "optimized".  I don't think any there would be sufficient gain
in "pre-fetching" over a network (even if it was one long network read)
to outweigh network latencies.  I would strongly suggest this be
ignored.

 >  - it gives you only a very limited view into what is actually going on.
---

    ???  In what way?  I don't think we need a *complex* view of what
is going on.  If anything, complexity can be a performance drag (though
in some cases, an excellent algorithm can outweigh the extra computational
complexitiy.

 >  - it doesn't much allow you to _fix_ any problems, it just allows 
you to
 >    try to paper them over.
----

    It isn't designed to be the one "true" fix for every problem known.
It's meant to provide speedups for a common subset of io-requests.

    If one comes up with the one-uber algorithm in three years, then
replace this one, meanwhile...

 >  - it's useless anyway, since pretty all kernel caching is based on
 >    virtual caches, so if you "pre-read" the physical buffers, it won't
 >    help: you'll just waste time reading the data into a buffer that will
 >    never be used, and when the real request comes in, the read will
 >    be done _again_.
----

    This is the big drawback/problem.  But this leads me to an obvious
question.  If I read a file into memory, it is cached, _seemingly_,
relative to the filename.  If I read the the same physical block into
memory using _block_-i/o, relative to the device, the claim is that
two separate reads are done into the block cache?  Can't this result
in inconsistent results between the two copies?  If I change the
file relative block, how is this coordinated with the device-relative
block?  Surely there is some block coelescing done at some point to
ensure the two different views of the block are consistent?

    If not, I'd tend to view this as a "bug". 

    But presuming this is "dealt with" at some layer, can't the
same information used to merge the two blocks be used to identify that
the same block is already in memory?

    If file-blockno, addresses can't be converted to device-blockno's,
how can the blocks be written to disk?

    If the file-blockno's are convertable, can't this be used to
compare which physical blocks are already in the cache?  Certainly this
would be faster than the milliseconds it could take to do another
physial read, no?

   
 >     (a) it makes it a hell of a lot more readable, and the user gets a
 >     lot more information that may make him see the higher-level issues
 >     involved.
---

    I'm not sure this is a benefit that outweight the elimination of
having to go through the file system (following directory chains,
indirect blocks, extents, etc).

 >     (b) it's in the form that we cache things, so if you read-ahead in
 >     that form, you'll actually get real information.
----

    This may not be a great reason.  Using the fact that "this is the
way things are already done", may not allow optimal implementation of
faster algorithms.
   
 >     (c) it's in a form where you can actually _do_ something about 
things
 >     like fragmentation etc ("Oh, I could move these files all to a
 >     separate area")
----

    This could theoretically be done solely on a block basis.  While
file pointers would have to be updated to point to "out-of-sequence"
blocks, it might be more optimal to store _some_ files discontiguously
in order to pack the desired blocks into a (relatively) "small" area
that can be read into memory in some minimal number of physical i/o
requests.

   
 >  - it will miss any situation where a filesystem does a read some other
 >    way. Notably, in many loads, the _directory_ accesses are the 
important
 >    ones, and if you want statistics for those you'd often have to do 
that
 >    separately (not always - some of the filesystems just use the same
 >    page reading stuff).
---

    Recording physical block reads could automatically include this
situation.   

 >
 > The downsides basically boil down to the fact that it's not as clearly
 > just one single point. You can't just look at the request queue and see
 > what physical requests go out.
---

    You also have the downside of having to go through the file-system
and following each path and file-chain in order to read a potentially
discontiguous section of a file.  It also, _seemingly_, has the downside
of reading in "entire files" when only subsections of files may be
needed on startup.


   
 >
 > NOTE! You can obviously do both, and try to correlate one against the
 > other, and you'd get the best possible information ("is this seek 
because
 > we started reading another file, or is it because the file itself is
 > fragmented" kind of stuff). So physical statistics aren't meaningless.
 > They're just _less_ important than the virtual ones, and you should do
 > them only if you already do virtual stats.
---
    Sounds like you might need to have 2 sets of "addresses" / block:
file relative block addresses, and device relative block addresses.


-l



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-04  0:28     ` Linda Walsh
@ 2006-05-04  1:31       ` Linus Torvalds
  2006-05-04  7:08         ` Ph. Marek
  0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2006-05-04  1:31 UTC (permalink / raw)
  To: Linda Walsh
  Cc: Wu Fengguang, linux-kernel, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty



On Wed, 3 May 2006, Linda Walsh wrote:
> 
> >  - it gives you only a very limited view into what is actually going on.
> 
>    ???  In what way?  I don't think we need a *complex* view of what
> is going on.

The block-based view is a very simple one, but it's _too_ simple. It 
doesn't actually tell the user what is happening. It doesn't tell you why 
the request happened in the first place, so it leaves you no way to really 
do anything sane about it.

You can't prefetch (because the indexing is wrong), and you can't actually 
even analyze it (because you don't know any background to what you're 
seeing). In short, you can't _do_ anything with it.

			Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-03 23:39                 ` Zan Lynx
@ 2006-05-04  1:37                   ` Diego Calleja
  0 siblings, 0 replies; 48+ messages in thread
From: Diego Calleja @ 2006-05-04  1:37 UTC (permalink / raw)
  To: Zan Lynx
  Cc: wfg, linux-kernel, torvalds, akpm, axboe, nickpiggin, pbadari, arjan

El Wed, 03 May 2006 17:39:51 -0600,
Zan Lynx <zlynx@acm.org> escribió:

> Linux should be able to do something like this using unionfs.  It could
> be worthwhile to try it with one of the very fastest flash cards or USB
> drives.

BTW; I forgot that the next intel laptops (and apparently, mactel laptops)
will have a small flash memory built in for this very purpose. According
to an article at extremetech.com, the flash memory requires an "I/O
controller"; IOW, it is not transparent and seems to need support from the
OS (although I guess that vista's "superprefetch" was designed precisely
for this hardware). Apparently, the main purpose here is to improve the
battery life (disks seeking for too many seconds can eat lot of power,
I guess). "Prefetchers" are becoming trendy, it seems

(page 14) http://www.intel.com/pressroom/kits/events/idfspr_2006/20060307_MaloneyTranscript.pdf

> With slower cards and USB keys its more of a loss unless the faster seek
> speed can make up for it, because sequential hard drive access is
> faster.

That's where the gain is; if the hard drive access was sequential people
wouldn't be talking about prefetchers. My SATA desktop drive also does
~45 MB/s, but it doesn't goes beyond 2 when seeking

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-04  1:31       ` Linus Torvalds
@ 2006-05-04  7:08         ` Ph. Marek
  2006-05-04  7:33           ` Arjan van de Ven
  0 siblings, 1 reply; 48+ messages in thread
From: Ph. Marek @ 2006-05-04  7:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linda Walsh, Wu Fengguang, linux-kernel, Andrew Morton,
	Jens Axboe, Nick Piggin, Badari Pulavarty

On Thursday 04 May 2006 03:31, Linus Torvalds wrote:
> On Wed, 3 May 2006, Linda Walsh wrote:
> > >  - it gives you only a very limited view into what is actually going
> > > on.
> >
> >    ???  In what way?  I don't think we need a *complex* view of what
> > is going on.
>
> The block-based view is a very simple one, but it's _too_ simple. It
> doesn't actually tell the user what is happening. It doesn't tell you why
> the request happened in the first place, so it leaves you no way to really
> do anything sane about it.
>
> You can't prefetch (because the indexing is wrong), and you can't actually
> even analyze it (because you don't know any background to what you're
> seeing). In short, you can't _do_ anything with it.

Please forgive me if I'm dumb, but what about the following reasoning:

The user as such (the person just *using* the computer) won't do anything 
with "program xyz uses block a, b, and c.".
But it wouldn't help much if he was told "program xyz uses files a, b, and d."
That's a task for package maintainers and distributions to optimize the 
*number* of accesses.


So the simple information which block numbers we need would not benefit the 
users (or the distribution developers), but it's enough for caching things:


Ascending block numbers on disk can be read very fast, as the disk needs no or 
less seeking. That's even true for stripes and mirrors. (I grant you that 
there are complicated setups out there, but these could be handled similar.)


If we know which blocks are read (directories, inodes, data), an early 
userspace application (possibly even from initrd) could
- read these blocks into a ramdisk device (linearly),
- define a device mapper as a mirror of 
  - the original root partition,
  - and a sparse device composed from parts of the ramdisk, interleaved by
    non-existing (error-producing) ranges, and
- setup the device mapper as a root device.
- In a specified point at startup the mapping device gets set to just
  a linear interpretation of the root partition, and the ramdisk gets
  discarded.



Example:
--------

N specifies a needed block, the numbers other blocks.

  Harddisk: 1 2 3 N N N 7 8 N 10 N 12 13 14 N 16 N ...
  RAM-Disk contains 4 5 6 9 11 15 17

The mirror set would look like this, 
with E representing an error-returning block:

  Harddisk:             1 2 3 N N N 7 8 N 10  N 12 13 14  N 16  N ...
  mapped from RAM-Disk: E E E 4 5 6 E E 9  E 11  E  E  E 15  E 17 ...


Advantages:
- As long as there's a good chance that the needed blocks are read from the
  ramdisk (which was filled linearly, with 20-60MB/sec as reported by others),
  *all* disk-access times (inodes, directories, data) should be nearly zero.
  Of course, there will be blocks which have now to be read which weren't
  needed the last boot, which will have to be taken from the harddisk,
  but the majority should be taken from the ramdisk.

Problems:
- We'd need to define a priority in the mirror set, so that *everything*
  possible is taken from the ramdisk
- There has to be a way to *not* disable the ram-disk when an error-block is
  read.


That's completely doable in userspace (except for knowing which blocks have to 
be read), and doesn't sound like much overhead to me.


Suggestions, ideas?
Does that work? Does it not?


Regards,

Phil

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-04  7:08         ` Ph. Marek
@ 2006-05-04  7:33           ` Arjan van de Ven
  2006-05-04 12:14               ` Wu Fengguang
  0 siblings, 1 reply; 48+ messages in thread
From: Arjan van de Ven @ 2006-05-04  7:33 UTC (permalink / raw)
  To: Ph. Marek
  Cc: Linus Torvalds, Linda Walsh, Wu Fengguang, linux-kernel,
	Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty



> Ascending block numbers on disk can be read very fast, as the disk needs no or 
> less seeking. That's even true for stripes and mirrors. (I grant you that 
> there are complicated setups out there, but these could be handled similar.)
> 


btw this all really spells out that you may want to do this as a device
mapper thing; eg have a device mapper module that can do "lookaside" to
a different order/mirror block whatever. The filesystem just doesn't
want to know; do it at the DM level ;) That also solves the entire
caching layering problem etc ;)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02  7:50 ` Wu Fengguang
                     ` (2 preceding siblings ...)
  2006-05-03 21:45   ` Linda Walsh
@ 2006-05-04  9:02   ` Helge Hafting
  3 siblings, 0 replies; 48+ messages in thread
From: Helge Hafting @ 2006-05-04  9:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty

Wu Fengguang wrote:

>	Rapid linux desktop startup through pre-caching
>
>
>MOTIVATION
>
>	KDE, Gnome, OpenOffice, and Firefox all take too long to start up.
>	Boot time pre-caching seems to be the single most straightforward and
>	effective way to improve it and make linux desktop experience more
>	comfortable. It is a great pleasure for me to take up the work.
>  
>
Actually, the best way is to not run so much software.  An yes,
that is an option.  I won't say no to an improved kernel too though. :-)

The apps mentioned are popular, but few needs *all* of them.
One can do without KDE and gnome, run a nice lightweight
window manager instead.  Take the kde/gnome performance hit
only when you actually need some kde/gnome app. Not every day.
A nice windowmanager like icewm of fluxbox brings the login
delay down to 3s or so for me.

Openoffice has lightweight alternatives for every task.
(abiword,lyx,gnumeric, . . . )  Strange that this bloated sw is
as popular as it is, given the many alternatives.  Not something
I use every month, and I use linux exclusively for my office tasks.

Another alternative is to profile the slow apps and improve them.
Fix algorithms, optimize stuff. 

The slow boot is fixable by:
1) run boot scripts in parallell instead of sequentially - somewhat 
experimental
    but helps.  Especially if you can bring up X before slowest stuff 
completes.
2) Don't run what you don't use/need!  Don't install everything and the 
kitchen
    sink just because it is free software.  I am guilty of installing 
too much myself,
    so I suffer 40-second bootup time.  But I don't reboot my office pc 
every
    week, normally I only have that 3s login delay.

Helge Hafting



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-04 12:12       ` Wu Fengguang
  2006-05-04 18:57         ` Linda Walsh
  0 siblings, 1 reply; 48+ messages in thread
From: Wu Fengguang @ 2006-05-04 12:12 UTC (permalink / raw)
  To: Linda Walsh; +Cc: linux-kernel

On Wed, May 03, 2006 at 02:45:53PM -0700, Linda Walsh wrote:
>    1. As you mention; reading files "sequentially" through the file
> system is "bad" for several reasons.  Areas of interest:
>    a) don't go through the file system.  Don't waste time doing
> directory lookups and following file-allocation maps;  Instead,
> use raw-disk i/o and read sectors in using device & block number.

Sorry, it does not fit in the linux's cache model.

>    b) Be "dynamic"; "Trace" (record (dev&blockno/range) blocks
> starting ASAP after system boot and continuing for some "configurable"
> number of seconds past reaching the desired "run-level" (coinciding with
> initial disk quiescence).  Save as "configurable" (~6-8?) number of
> traces to allow finding the common initial subset of blocks needed.

It is a alternative way of doing the same job: more precise, with more
complexity and more overhead.  However the 'blockno' way is not so
tasteful.

>    c) Allow specification of max# of blocks and max number of "sections"
> (discontiguous areas on disk);

Good point, will do it in my work.

>    d) "Ideally", would have a way to "defrag" the common set of blocks.
> I.e. -- moving the needed blocks from potentially disparate areas of
> files into 1 or 2 contiguous areas, hopefully near the beginning of
> the disk (or partition(s)).
>    That's the area of "boot" pre-caching.

I guess poor man's defrag would be good enough for the seeking storm.

> Next is doing something similar for "application" starts.  Start tracing
> when an application is loaded & observe what blocks are requested for
> that app for the first 20 ("configurable") seconds of execution.  Store
> traces on a per-application basis.  Again, it would be ideal if the
> different files (blocks, really), needed by an application could be
> grouped so that sequentially needed disk-blocks are stored sequentially
> on disk (this _could_ imply the containing files are not contiguous).
> 
> Essentially, one wants to do for applications, the same thing one does
> for booting.  On small applications, the benefit would likely be negligible,
> but on loading a large app like a windowing system, IDE, or database app,
> multiple configuration files could be read into the cache in one large
> read.
> 
>    That's "application" pre-caching.

Yes, it is a planned feature, will do it.

>    A third area -- that can't be easily done in the kernel, but would
> require a higher skill level on the part of application and library
> developers, is to move towards using "delay-loaded" libraries.  In
> Windows, it seems common among system libraries to use this feature. 
> An obvious benefit -- if certain features of a program are not used,
> the associated libraries are never loaded.  Not loading unneeded parts
> of a program should speed up initial application load time, significantly.

Linux already does lazy loading for linked libs. The only one pitfall
is that /lib/ld-linux.so.2 seems to touch the first 512B data of every
libs before doing mmap():

% strace date
execve("/bin/date", ["date"], [/* 41 vars */]) = 0
uname({sys="Linux", node="lark", ...})  = 0
brk(0)                                  = 0x8053000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f0e000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f0d000
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=77643, ...}) = 0
mmap2(NULL, 77643, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7efa000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)

open("/lib/tls/librt.so.1", O_RDONLY)   = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0`\35\0\000"..., 512) = 512
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~HERE~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fstat64(3, {st_mode=S_IFREG|0644, st_size=30612, ...}) = 0
mmap2(NULL, 29264, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7ef2000
mmap2(0xb7ef8000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6) = 0xb7ef8000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)

open("/lib/tls/libc.so.6", O_RDONLY)    = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\260O\1"..., 512) = 512
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~HERE~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fstat64(3, {st_mode=S_IFREG|0755, st_size=1270928, ...}) = 0
mmap2(NULL, 1276892, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7dba000
mmap2(0xb7ee8000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x12e) = 0xb7ee8000
mmap2(0xb7ef0000, 7132, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7ef0000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)

open("/lib/tls/libpthread.so.0", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\340G\0"..., 512) = 512
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~HERE~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fstat64(3, {st_mode=S_IFREG|0755, st_size=85770, ...}) = 0
mmap2(NULL, 70104, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7da8000
mmap2(0xb7db6000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xd) = 0xb7db6000
mmap2(0xb7db8000, 4568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7db8000
close(3)                                = 0

They can lead to more seeks, and also disturb the readahead logic much.

Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-04 12:14               ` Wu Fengguang
  0 siblings, 0 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-04 12:14 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ph. Marek, Linus Torvalds, Linda Walsh, linux-kernel,
	Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty

On Thu, May 04, 2006 at 09:33:24AM +0200, Arjan van de Ven wrote:
> 
> 
> > Ascending block numbers on disk can be read very fast, as the disk needs no or 
> > less seeking. That's even true for stripes and mirrors. (I grant you that 
> > there are complicated setups out there, but these could be handled similar.)
> > 
> 
> 
> btw this all really spells out that you may want to do this as a device
> mapper thing; eg have a device mapper module that can do "lookaside" to
> a different order/mirror block whatever. The filesystem just doesn't
> want to know; do it at the DM level ;) That also solves the entire
> caching layering problem etc ;)

I guess some big corps might want to install such a layer into their
storage products ;)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-04 12:28     ` Wu Fengguang
  0 siblings, 0 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-04 12:28 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty, Nigel Cunningham,
	Rafael J. Wysocki

On Tue, May 02, 2006 at 09:10:01PM +0200, Pavel Machek wrote:
> > 		$ echo "file /dev/hda1" > /proc/filecache
> > 		$ cat /proc/filecache
> > 		# file /dev/hda1
> > 		# mask 0
> > 		#
> > 		# idx	len
> > 		0	24
> > 		48	2
> > 		53	5
> > 		......
> 
> Could we use this instead of blockdev freezing/big suspend image
> support? It should permit us to resume quickly (with small image), and
> then do readahead. ... that will give us usable machine quickly, still
> very responsive desktop after resume?

Badari's usage case inspired me that on suspension we can
- first invoke a user-land tool to do all the cache
  status-logging/analyzing/selective-dropping jobs
- then let the kernel write all the remaining caches(made up of many
  small chunks) to the suspend image

And do the reverse things on restoring.
With that we moved all the strategies to userspace.

Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-04 12:34               ` Arjan van de Ven
  0 siblings, 0 replies; 48+ messages in thread
From: Arjan van de Ven @ 2006-05-04 12:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ph. Marek, Linus Torvalds, Linda Walsh, linux-kernel,
	Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty

On Thu, 2006-05-04 at 20:14 +0800, Wu Fengguang wrote:
> On Thu, May 04, 2006 at 09:33:24AM +0200, Arjan van de Ven wrote:
> > 
> > 
> > > Ascending block numbers on disk can be read very fast, as the disk needs no or 
> > > less seeking. That's even true for stripes and mirrors. (I grant you that 
> > > there are complicated setups out there, but these could be handled similar.)
> > > 
> > 
> > 
> > btw this all really spells out that you may want to do this as a device
> > mapper thing; eg have a device mapper module that can do "lookaside" to
> > a different order/mirror block whatever. The filesystem just doesn't
> > want to know; do it at the DM level ;) That also solves the entire
> > caching layering problem etc ;)
> 
> I guess some big corps might want to install such a layer into their
> storage products ;)

maybe, could be.
Doing it at this level also has the advantage that fs metadata becomes
seek free as well; since that is mostly fixed-location inside the fs,
reordering that on an fs-level is really hard.. on a dm level.. easy.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
       [not found]         ` <346733486.30800@ustc.edu.cn>
@ 2006-05-04 15:03           ` Linus Torvalds
  2006-05-04 16:57             ` Badari Pulavarty
  2006-05-05 14:44               ` Wu Fengguang
  0 siblings, 2 replies; 48+ messages in thread
From: Linus Torvalds @ 2006-05-04 15:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Badari Pulavarty, lkml, Andrew Morton, Jens Axboe, Nick Piggin



On Thu, 4 May 2006, Wu Fengguang wrote:
> 
> On Wed, May 03, 2006 at 10:28:00AM -0700, Badari Pulavarty wrote:
> > While ago, I hacked up similar /proc interface  
> > 	echo "<filesystem-name>" > /proc/pagecache-usage
> > 
> > Which showed pagecache usage of every file in that filesystem
> > (filename, #num pages). My main objective was to shoot down pagecache
> > for all the files in a given filesystem. I ended up using it to do
> > posix_fadivse(POSIX_FADV_DONTNEED) on those files. (Initially, tried
> > to do this without this, by doing fadvise() on all files in the
> > filesystem - but ended up bloating up inode and dcache). 
> 
> Ah, I have not thought of the possibility of querying the page cache
> just to drop some caches -- a subset of sysctl_drop_caches functions.

Actually, I did something even simpler for a totally one-off thing: you 
don't actually even need to drop caches or track pretty much anything, 
it's sufficient for most analysis to just have a timestamp on each page 
cache, and then have some way to read out the current cached contents.

You can then use the timestamps to get a pretty good idea of what order 
things happened in.

You'll lose the temporary file information (files that are created and 
deleted), because deleting a file will also flush the page cache for it, 
but temporary files tend to not be hugely interesting.

The really nice thing was that you don't even have to set the timestamp in 
any complex place: you do it at page _allocation_ time. That automatically 
gets the right answer for any page cache page, and you can do it in a 
single place.

		Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-04 15:03           ` Linus Torvalds
@ 2006-05-04 16:57             ` Badari Pulavarty
  2006-05-05 14:44               ` Wu Fengguang
  1 sibling, 0 replies; 48+ messages in thread
From: Badari Pulavarty @ 2006-05-04 16:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Wu Fengguang, lkml, Andrew Morton, Jens Axboe, Nick Piggin

On Thu, 2006-05-04 at 08:03 -0700, Linus Torvalds wrote:
> 
> On Thu, 4 May 2006, Wu Fengguang wrote:
> > 
> > On Wed, May 03, 2006 at 10:28:00AM -0700, Badari Pulavarty wrote:
> > > While ago, I hacked up similar /proc interface  
> > > 	echo "<filesystem-name>" > /proc/pagecache-usage
> > > 
> > > Which showed pagecache usage of every file in that filesystem
> > > (filename, #num pages). My main objective was to shoot down pagecache
> > > for all the files in a given filesystem. I ended up using it to do
> > > posix_fadivse(POSIX_FADV_DONTNEED) on those files. (Initially, tried
> > > to do this without this, by doing fadvise() on all files in the
> > > filesystem - but ended up bloating up inode and dcache). 
> > 
> > Ah, I have not thought of the possibility of querying the page cache
> > just to drop some caches -- a subset of sysctl_drop_caches functions.
> 
> Actually, I did something even simpler for a totally one-off thing: you 
> don't actually even need to drop caches or track pretty much anything, 
> it's sufficient for most analysis to just have a timestamp on each page 
> cache, and then have some way to read out the current cached contents.
> 
> You can then use the timestamps to get a pretty good idea of what order 
> things happened in.
> 
> You'll lose the temporary file information (files that are created and 
> deleted), because deleting a file will also flush the page cache for it, 
> but temporary files tend to not be hugely interesting.
> 
> The really nice thing was that you don't even have to set the timestamp in 
> any complex place: you do it at page _allocation_ time. That automatically 
> gets the right answer for any page cache page, and you can do it in a 
> single place.

Sounds like an interesting idea.

The problem, I was trying to address was that - one of our large
database customer was complaining about the variation (degradation) 
in database performance when other applications like backup, ftp, scp,
tar happening on other filesystems on the system. VM end up swapping
out some of database process data to make room for these read/writes.

They wanted a way to completely flush out pagecache data for a given
filesystem to see if it improves their situation. (although "swapiness"
should in *theory* do the same thing). In fact, they want a way to
control the amount of pagecache filesystems populate through /proc :(

My patch was an initial attempt to unwind mysteries of VM behaviour :)
(VM may doing the right thing from kernel perspective - but customer
was complaining that its thrashing their workload). 

Thanks,
Badari


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-04 12:12       ` Wu Fengguang
@ 2006-05-04 18:57         ` Linda Walsh
  2006-05-05 15:20             ` Wu Fengguang
  0 siblings, 1 reply; 48+ messages in thread
From: Linda Walsh @ 2006-05-04 18:57 UTC (permalink / raw)
  To: Wu Fengguang, linux-kernel

Wu Fengguang wrote:
> On Wed, May 03, 2006 at 02:45:53PM -0700, Linda Walsh wrote:
>   
>>    1. As you mention; reading files "sequentially" through the file
>> system is "bad" for several reasons.  Areas of interest:
>>    a) don't go through the file system.  Don't waste time doing
>> directory lookups and following file-allocation maps;  Instead,
>> use raw-disk i/o and read sectors in using device & block number.
>>     
>
> Sorry, it does not fit in the linux's cache model.
>   
---
    Maybe linux's cache model needs to be _improved_ to better
allow for hardware acceleration?^**  It is the job of the OS to provide
sufficiently low level facilities to allow optimal use of the system
hardware, while at the same time providing enough high level facilities
to support applications that don't require such tuning.


>>    b) Be "dynamic"; "Trace" (record (dev&blockno/range) blocks
>> starting ASAP after system boot and continuing for some "configurable"
>> number of seconds past reaching the desired "run-level" (coinciding with
>> initial disk quiescence).  Save as "configurable" (~6-8?) number of
>> traces to allow finding the common initial subset of blocks needed.
>>     
>
> It is a alternative way of doing the same job: more precise, with more
> complexity and more overhead.  However the 'blockno' way is not so
> tasteful.
>   
----
Maybe not so tasteful to you, but it is an alternate path that
circumvents unnecessary i/o redirections.  An additional optimization
is to have a "cache" of frequently used "system applications" that have
their pathnames registered, so no run-time seaching of the user PATH
is necessary.

>>    c) Allow specification of max# of blocks and max number of "sections"
>> (discontiguous areas on disk);
>>     
>
> Good point, will do it in my work.
>
>   
>>    d) "Ideally", would have a way to "defrag" the common set of blocks.
>> I.e. -- moving the needed blocks from potentially disparate areas of
>> files into 1 or 2 contiguous areas, hopefully near the beginning of
>> the disk (or partition(s)).
>>    That's the area of "boot" pre-caching.
>>     
>
>   
> I guess poor man's defrag would be good enough for the seeking storm.
>   
---
    I disagree. The "poor man's defrag, as you call it, puts entire files
into contiguous sections -- each of which will have to be referenced by 
following
a PATH and directory chain.  The optimization I'm talking about would 
only store
the referenced data-blocks actually used in the files.  This would allow a
directy read into memory of a *bunch* of needed sectors while not including
sectors from the same files that are not actually read from during the 
boot *or*
app. initialization process.

>>    That's "application" pre-caching.
>>     
> Yes, it is a planned feature, will do it.
>   
Trés cool.
>>    A third area -- that can't be easily done in the kernel, but would
>> require a higher skill level on the part of application and library
>> developers, is to move towards using "delay-loaded" libraries.  In
>> Windows, it seems common among system libraries to use this feature. 
>> An obvious benefit -- if certain features of a program are not used,
>> the associated libraries are never loaded.  Not loading unneeded parts
>> of a program should speed up initial application load time, significantly.
>>     
> Linux already does lazy loading for linked libs. The only one pitfall
> is that /lib/ld-linux.so.2 seems to touch the first 512B data of every
> libs before doing mmap():
>   
----
    Yes -- and this is the "killer".  If my app may "potentially" use
50 run-time libs, but in any single invocation, only uses 20 of those
libs, the page tables and fixups for the unused 30 libraries don't
need to be done.  In fact, in some configurations, those 30 libs may
not even need to be present on disk!

    Typical example - "Active Directory"; I don't use it.  I don't
need the libraries on my system or need them "resolved" at runtime. 
It would be far preferable if programs would only load those
libraries actually used at run-time -- and load them *dynamically*,
as needed (for libraries that may not actually be called or
used).  This way, the initial time to start the program is
significantly reduced to the "core" set of libraries needed to
run the program.  Optional features are loaded as those features
are called for off disk.  Delays of loading optional libraries
one-two at a time, interactively, are not likely to be noticed,
but if you load all of those "optional" libraries prior to execution,
the sum-total will be noticed in an interactive environment.

^** -- "somewhere", it _seems_, the physical, device relative sector
must be resolved.  If it is not, how is i/o-block buffer consistency
maintained when the user references "device "hda", sector "1000", then
the same sector as "hda1", sector 500, and also as file "/etc/passwd",
sector 0?  _If_ cache consistency is maintained (and I _assume_^**2
it is), they all need to be mapped to a physical sector at some point.

^**2 - Use of assumption noted; feel free to correct me and tell me
this isn't the case if linux doesn't maintain disk-block cache
consistency.


-linda





^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-05 14:44               ` Wu Fengguang
  0 siblings, 0 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-05 14:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Badari Pulavarty, lkml, Andrew Morton, Jens Axboe, Nick Piggin

On Thu, May 04, 2006 at 08:03:50AM -0700, Linus Torvalds wrote:
> Actually, I did something even simpler for a totally one-off thing: you 
> don't actually even need to drop caches or track pretty much anything, 
> it's sufficient for most analysis to just have a timestamp on each page 
> cache, and then have some way to read out the current cached contents.
> 
> You can then use the timestamps to get a pretty good idea of what order 
> things happened in.
 
> The really nice thing was that you don't even have to set the timestamp in 
> any complex place: you do it at page _allocation_ time. That automatically 
> gets the right answer for any page cache page, and you can do it in a 
> single place.

Looks nice.  A detailed scheme might be:

1) ctime/atime for each radixtree node(or, cluser of pages)
It seems to be a good accuracy/overhead compromise.
And is perfect for the most common case of sequential accesses.

struct radix_tree_node {
+        unsigned long ctime, atime;
} 

radix_tree_node_alloc()
{
+        node->ctime = some_virtual_time();
}

radix_tree_lookup_slot()
{
+        node->atime = some_virtual_time();
}

2) eviction-time for each recently evicted page
Store eviction-time _in place_ in the slot that used to store the
page's address, with minimal space/time impact.

+#define SLOT_IS_EVICTION_TIME  1
radix_tree_delete()
{
-                pathp->node->slots[pathp->offset] = NULL;
+                pathp->node->slots[pathp->offset] =
+                                 some_virtual_time() | SLOT_IS_EVICTION_TIME;
}

Regards,
Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-05 15:20             ` Wu Fengguang
  0 siblings, 0 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-05 15:20 UTC (permalink / raw)
  To: Linda Walsh; +Cc: linux-kernel

On Thu, May 04, 2006 at 11:57:21AM -0700, Linda Walsh wrote:
> Wu Fengguang wrote:
> >>   b) Be "dynamic"; "Trace" (record (dev&blockno/range) blocks
> >>starting ASAP after system boot and continuing for some "configurable"
> >>number of seconds past reaching the desired "run-level" (coinciding with
> >>initial disk quiescence).  Save as "configurable" (~6-8?) number of
> >>traces to allow finding the common initial subset of blocks needed.
> >>    
> >
> >It is a alternative way of doing the same job: more precise, with more
> >complexity and more overhead.  However the 'blockno' way is not so
> >tasteful.
> >  
> ----
> Maybe not so tasteful to you, but it is an alternate path that
> circumvents unnecessary i/o redirections.  An additional optimization

Yes, it's about the choice of the overhead/generalization that
indirections bring about.

> is to have a "cache" of frequently used "system applications" that have
> their pathnames registered, so no run-time seaching of the user PATH
> is necessary.

The facility is already there: it is called dcache.

> >I guess poor man's defrag would be good enough for the seeking storm.
> >  
> ---
>    I disagree. The "poor man's defrag, as you call it, puts entire files
> into contiguous sections -- each of which will have to be referenced by 
> following
> a PATH and directory chain.  The optimization I'm talking about would 
> only store
> the referenced data-blocks actually used in the files.  This would allow a
> directy read into memory of a *bunch* of needed sectors while not including
> sectors from the same files that are not actually read from during the 
> boot *or*
> app. initialization process.

The seek storm is mostly caused by small files(that are read in wholes),
and the corresponding scattered dir/inode buffers. By moving these
files to one single directory, both the file data/inode buffers are
kept continuous enough.

> ^** -- "somewhere", it _seems_, the physical, device relative sector
> must be resolved.  If it is not, how is i/o-block buffer consistency
> maintained when the user references "device "hda", sector "1000", then
> the same sector as "hda1", sector 500, and also as file "/etc/passwd",
> sector 0?  _If_ cache consistency is maintained (and I _assume_^**2
> it is), they all need to be mapped to a physical sector at some point.
> 
> ^**2 - Use of assumption noted; feel free to correct me and tell me
> this isn't the case if linux doesn't maintain disk-block cache
> consistency.

The linux cache model is something like:

                     physical drive             file /dev/hda1
                             | <-----------------> |         
file /bin/ls                 |                     |         
    |<---------------------> |                     |         
    |<---------------------> |                     |         
                             | <-----------------> |         
                             |                     |         
                             |                     |         
file /boot/vmlinuz           |                     |         
    |<---------------------> |                     |         
    |                        |                     |         
    |                        |                     |         
    |<---------------------> |                     |         
                             | <-----------------> |         
                             |                     |         
                             |                     |         
                             |                     |         

As opposed to this one:

                      file /dev/hda1         physical drive
                             | <-----------------> |         
file /bin/ls                 |                     |         
    |<---------------------> |                     |         
    |<---------------------> |                     |         
                             | <-----------------> |         
                             |                     |         
                             |                     |         
file /boot/vmlinuz           |                     |         
    |<---------------------> |                     |         
    |                        |                     |         
    |                        |                     |         
    |<---------------------> |                     |         
                             | <-----------------> |         
                             |                     |         
                             |                     |         
                             |                     |         

Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
@ 2006-05-06  1:11         ` Wu Fengguang
  0 siblings, 0 replies; 48+ messages in thread
From: Wu Fengguang @ 2006-05-06  1:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Jens Axboe,
	Nick Piggin, Badari Pulavarty

On Wed, May 03, 2006 at 06:20:29PM -0400, Rik van Riel wrote:
> Why not simply read everything in a whole file at a time
> at boot time, while we still have enough free memory ?

Sure it helps, though may be unnecessary once we can rely on userspace
pre-caching tools to do it in a more accurate and efficient way.

The main concern would be inter-file/inter-buffer readahead.
I have some data to support it:

% wc -l hda* startup.files
 1743 hda2_buffers      <== my /
 335 hda3_buffers       <== my /home
 1130 startup.files
 3208 total

The above cache sizes sum up to ~110MB, so it's about 36KB per seek.

Thanks,
Wu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] kernel facilities for cache prefetching
  2006-05-02  8:53         ` Wu Fengguang
@ 2006-05-06  6:49           ` Denis Vlasenko
  0 siblings, 0 replies; 48+ messages in thread
From: Denis Vlasenko @ 2006-05-06  6:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Arjan van de Ven, linux-kernel, Linus Torvalds, Andrew Morton,
	Jens Axboe, Nick Piggin, Badari Pulavarty

On Tuesday 02 May 2006 11:53, Wu Fengguang wrote:
> On Tue, May 02, 2006 at 10:30:17AM +0200, Arjan van de Ven wrote:
> > one interesting thing that came out of the fedora readahead work is that
> > most of the bootup isn't actually IO bound. My test machine for example
> > can load all the data into ram in about 10 seconds, so that the rest of
> > the boot is basically IO-less. But that still takes 2 minutes....
> > So I'm not entirely sure how much you can win by just attacking this.
> 
> Yes, I find it hard to improve the boot time of the init.d stage.

Then _get rid of your init.d_.

SysV init sequence is insane.

I use daemontools-based setup and it starts up to login prompt
in 5-10 secs (counting from the root fs mount).

In my setup, after a few simple startup scripts a svscan process
is called, which starts all services in parallel.
Gettys are services, and start right away.
--
vda

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2006-05-06  6:50 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-02  7:50 [RFC] kernel facilities for cache prefetching Wu Fengguang
2006-05-02  7:50 ` Wu Fengguang
2006-05-02 12:46   ` Diego Calleja
2006-05-02 14:42     ` Wu Fengguang
2006-05-02 14:42       ` Wu Fengguang
2006-05-02 16:07         ` Diego Calleja
2006-05-03  6:45           ` Wu Fengguang
2006-05-03  6:45             ` Wu Fengguang
2006-05-03 18:14               ` Diego Calleja
2006-05-03 23:39                 ` Zan Lynx
2006-05-04  1:37                   ` Diego Calleja
2006-05-02 15:55   ` Linus Torvalds
2006-05-02 16:35     ` Andi Kleen
2006-05-03  4:11     ` Wu Fengguang
2006-05-03  4:11       ` Wu Fengguang
2006-05-03 17:28       ` Badari Pulavarty
     [not found]         ` <346733486.30800@ustc.edu.cn>
2006-05-04 15:03           ` Linus Torvalds
2006-05-04 16:57             ` Badari Pulavarty
2006-05-05 14:44             ` Wu Fengguang
2006-05-05 14:44               ` Wu Fengguang
2006-05-03  7:13     ` Wu Fengguang
2006-05-03  7:13       ` Wu Fengguang
2006-05-03 12:59     ` Nikita Danilov
2006-05-03 22:20     ` Rik van Riel
2006-05-06  1:11       ` Wu Fengguang
2006-05-06  1:11         ` Wu Fengguang
2006-05-04  0:28     ` Linda Walsh
2006-05-04  1:31       ` Linus Torvalds
2006-05-04  7:08         ` Ph. Marek
2006-05-04  7:33           ` Arjan van de Ven
2006-05-04 12:14             ` Wu Fengguang
2006-05-04 12:14               ` Wu Fengguang
2006-05-04 12:34               ` Arjan van de Ven
2006-05-03 21:45   ` Linda Walsh
2006-05-04 12:12     ` Wu Fengguang
2006-05-04 12:12       ` Wu Fengguang
2006-05-04 18:57         ` Linda Walsh
2006-05-05 15:20           ` Wu Fengguang
2006-05-05 15:20             ` Wu Fengguang
2006-05-04  9:02   ` Helge Hafting
2006-05-02  7:58 ` Arjan van de Ven
2006-05-02  8:06   ` Wu Fengguang
2006-05-02  8:06     ` Wu Fengguang
2006-05-02  8:30     ` Arjan van de Ven
2006-05-02  8:53       ` Wu Fengguang
2006-05-02  8:53         ` Wu Fengguang
2006-05-06  6:49           ` Denis Vlasenko
2006-05-02  8:55         ` Arjan van de Ven
2006-05-02 11:39           ` Jan Engelhardt
2006-05-02 11:48           ` Wu Fengguang
2006-05-02 11:48             ` Wu Fengguang
2006-05-02 22:03       ` Dave Jones
2006-05-02  8:09 ` Jens Axboe
2006-05-02  8:20   ` Wu Fengguang
2006-05-02  8:20     ` Wu Fengguang
2006-05-03 22:05   ` Benjamin LaHaise
2006-05-02 19:10 ` Pavel Machek
2006-05-02 23:36   ` Nigel Cunningham
2006-05-03  2:35     ` Wu Fengguang
2006-05-03  2:35       ` Wu Fengguang
2006-05-03  2:32   ` Wu Fengguang
2006-05-03  2:32     ` Wu Fengguang
2006-05-03  7:19   ` Wu Fengguang
2006-05-03  7:19     ` Wu Fengguang
2006-05-04 12:28   ` Wu Fengguang
2006-05-04 12:28     ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.