linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Getting FS access events
       [not found] <200105140117.f4E1HqN07362@vindaloo.ras.ucalgary.ca>
@ 2001-05-14  1:32 ` Linus Torvalds
  2001-05-14  1:45   ` Larry McVoy
  2001-05-14  2:39   ` Richard Gooch
  2001-05-14  2:24 ` Richard Gooch
  1 sibling, 2 replies; 75+ messages in thread
From: Linus Torvalds @ 2001-05-14  1:32 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Kernel Mailing List


On Sun, 13 May 2001, Richard Gooch wrote:
>
>   Hi, Linus. I've been thinking more about trying to warm the page
> cache with blocks needed by the bootup process. What is currently
> missing is (AFAIK) a mechanism to find out what inodes and blocks have
> been accessed. Sure, you can use bmap() to convert from file block to
> device block, but first you need to figure out the file blocks
> accessed. I'd like to find out what kind of patch you'd accept to
> provide the missing functionality.

Why would you use bmap() anyway? You CANNOT warm up the page cache with
the physical map nr as discussed. So there's no real point in using
bmap() at any time.

> One approach would be to create a new ioctl(2) for a FS that would
> read out inum,bnum pairs.

Why not just "path,pagenr" instead? You make your instrumentation save
away the whole pathname, by just using the dentry pointer. Many
filesystems don't even _have_ a "inum", so anything less doesn't work
anyway.

Example acceptable approach:

 - save away full dentry and page number. Don't make it an ioctl. Think
   "profiling" - this is _exactly_ the same thing, and profiling uses a
	(a) command line argument to turn it on
	(b) /proc/profile
   (and because you have the full pathname, you should just make the dang
   /proc/fsaccess file be ASCII)

 - add a "prefetch()" system call that does all the same things
   "read()" does, but doesn't actually wait for (or transfer) the
   data. Basically just a read-ahead thing. So you'd basically end up
   doing

	foreach (filename in /proc/fsaccess)
		fd = open(filename);
		foreach (sorted pagenr for filename in /proc/fsaccess)
			prefetch(fd, pagenr);
		end
	end

Forget about all these crappy "ioctl" ideas. Basic rule of thumb: if you
think an ioctl is a good idea, you're (a) being stupid and (b) thinking
wrong and (c) on the wrong track.

And notice how there's not a single bmap anywhere, and not a single "raw
device open" anywhere.

		Linus


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14  1:32 ` Getting FS access events Linus Torvalds
@ 2001-05-14  1:45   ` Larry McVoy
  2001-05-14  2:39   ` Richard Gooch
  1 sibling, 0 replies; 75+ messages in thread
From: Larry McVoy @ 2001-05-14  1:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Richard Gooch, Kernel Mailing List

On Sun, May 13, 2001 at 06:32:02PM -0700, Linus Torvalds wrote:
> >   Hi, Linus. I've been thinking more about trying to warm the page
> > cache with blocks needed by the bootup process. What is currently
> > missing is (AFAIK) a mechanism to find out what inodes and blocks have
> > been accessed. Sure, you can use bmap() to convert from file block to
> > device block, but first you need to figure out the file blocks
> > accessed. I'd like to find out what kind of patch you'd accept to
> > provide the missing functionality.
> 
> Why would you use bmap() anyway? You CANNOT warm up the page cache with
> the physical map nr as discussed. So there's no real point in using
> bmap() at any time.

Ha.  For once you're both wrong but not where you are thinking.  One of
the few places that I actually hacked Linux was for exactly this - it was
in the 0.99 days I think.  I saved the list of I/O's in a file and filled
the buffer cache with them at next boot.  It actually didn't help at all.

I don't remember why, maybe it was back so long ago that I didn't have the
memory, but I think it was more subtle than that.  It's basically a queuing
problem and my instincts were wrong, I thought if I could get all the data
in there then things would go faster.  If you think through all the stuff
going on during a boot it doesn't really work that way.

Anyway, a _much_ better thing to do would be to have all this data laid
out contig, then slurp in all the blocks in on I/O and then let them get
turned into files.  This has been true for the last 30 years and people
still don't do it.  We're actually moving in this direction with BitKeeper,
in the future, large numbers of small files will be stored in one big file
and extracted on demand.  Then we do one I/O to get all the related stuff.

Dave Hitz at NetApp is about the only guy I know who really gets this,
Daniel Phillips may also get it, he's certainly thinking about it.  Lots
of little I/O's == bad, one big I/O == good.  Work through the numbers 
and it starts to look like you'd never want to do less than a 1MB I/O,
probably not less than a 4MB I/O.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
       [not found] <200105140117.f4E1HqN07362@vindaloo.ras.ucalgary.ca>
  2001-05-14  1:32 ` Getting FS access events Linus Torvalds
@ 2001-05-14  2:24 ` Richard Gooch
  2001-05-14  4:46   ` Linus Torvalds
  2001-05-14  5:15   ` Richard Gooch
  1 sibling, 2 replies; 75+ messages in thread
From: Richard Gooch @ 2001-05-14  2:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Linus Torvalds writes:
> 
> On Sun, 13 May 2001, Richard Gooch wrote:
> >
> >   Hi, Linus. I've been thinking more about trying to warm the page
> > cache with blocks needed by the bootup process. What is currently
> > missing is (AFAIK) a mechanism to find out what inodes and blocks have
> > been accessed. Sure, you can use bmap() to convert from file block to
> > device block, but first you need to figure out the file blocks
> > accessed. I'd like to find out what kind of patch you'd accept to
> > provide the missing functionality.
> 
> Why would you use bmap() anyway? You CANNOT warm up the page cache
> with the physical map nr as discussed. So there's no real point in
> using bmap() at any time.

Think about it:-) You need to generate prefetch accesses in ascending
device bnum order. So the bmap() is there to tell you those device
bnums. You'd still prefetch using file bnums, the the *ordering* is
done based on device bnum. In fact, once the list is sorted, you can
chuck out the device bnums. You only need to store inum/path and file
bnum in the database.

> > One approach would be to create a new ioctl(2) for a FS that would
> > read out inum,bnum pairs.
> 
> Why not just "path,pagenr" instead? You make your instrumentation save
> away the whole pathname, by just using the dentry pointer. Many
> filesystems don't even _have_ a "inum", so anything less doesn't work
> anyway.

Sure, this would work too. I'm a bit worried about the increased
amount of traffic this will generate.

> Example acceptable approach:
> 
>  - save away full dentry and page number. Don't make it an ioctl. Think
>    "profiling" - this is _exactly_ the same thing, and profiling uses a
> 	(a) command line argument to turn it on
> 	(b) /proc/profile
>    (and because you have the full pathname, you should just make the dang
>    /proc/fsaccess file be ASCII)

So on every page fault or read(2) call, we have to generate the full
path from the dentry? Isn't that going to add a fair bit of overhead?
Remember, we want to do this on every boot (to keep the database as
up-to-date as possible).

>  - add a "prefetch()" system call that does all the same things
>    "read()" does, but doesn't actually wait for (or transfer) the
>    data. Basically just a read-ahead thing. So you'd basically end up
>    doing
> 
> 	foreach (filename in /proc/fsaccess)
> 		fd = open(filename);
> 		foreach (sorted pagenr for filename in /proc/fsaccess)
> 			prefetch(fd, pagenr);
> 		end
> 	end

I don't see the advantage of the prefetch(2) system call. It seems to
me I can get the same effect by just making read(2) calls in another
task. Of course, I'd need to use bmap() to generate the sort key, but
I don't see why that's a bad thing.

> Forget about all these crappy "ioctl" ideas. Basic rule of thumb: if
> you think an ioctl is a good idea, you're (a) being stupid and (b)
> thinking wrong and (c) on the wrong track.

Don't hold back now. Tell us what you really think :-)

> And notice how there's not a single bmap anywhere, and not a single
> "raw device open" anywhere.

I don't mind the /proc/fsaccess approach, I'm just worried about the
overhead of doing the denty->pathname conversions on each fault/read.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14  1:32 ` Getting FS access events Linus Torvalds
  2001-05-14  1:45   ` Larry McVoy
@ 2001-05-14  2:39   ` Richard Gooch
  2001-05-14  3:09     ` Rik van Riel
                       ` (2 more replies)
  1 sibling, 3 replies; 75+ messages in thread
From: Richard Gooch @ 2001-05-14  2:39 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Linus Torvalds, Kernel Mailing List

Larry McVoy writes:
> On Sun, May 13, 2001 at 06:32:02PM -0700, Linus Torvalds wrote:
> > >   Hi, Linus. I've been thinking more about trying to warm the page
> > > cache with blocks needed by the bootup process. What is currently
> > > missing is (AFAIK) a mechanism to find out what inodes and blocks have
> > > been accessed. Sure, you can use bmap() to convert from file block to
> > > device block, but first you need to figure out the file blocks
> > > accessed. I'd like to find out what kind of patch you'd accept to
> > > provide the missing functionality.
> > 
> > Why would you use bmap() anyway? You CANNOT warm up the page cache with
> > the physical map nr as discussed. So there's no real point in using
> > bmap() at any time.
> 
> Ha.  For once you're both wrong but not where you are thinking.  One
> of the few places that I actually hacked Linux was for exactly this
> - it was in the 0.99 days I think.  I saved the list of I/O's in a
> file and filled the buffer cache with them at next boot.  It
> actually didn't help at all.

Maybe you did something wrong :-) Seriously, maybe you're right, and
maybe not. I'd like to find out, and having the infrastructure to get
FS access events will help in that (as well as your preferred
approach: see below). If I am digging into a rathole, I'll do it with
my eyese open ;-)

> I don't remember why, maybe it was back so long ago that I didn't
> have the memory, but I think it was more subtle than that.  It's
> basically a queuing problem and my instincts were wrong, I thought
> if I could get all the data in there then things would go faster.
> If you think through all the stuff going on during a boot it doesn't
> really work that way.

Well, on my machines anyway, the discs rattle an awful lot during
bootup. Not just little adjacent seeks, but big, partition crossing
seeks.

> Anyway, a _much_ better thing to do would be to have all this data
> laid out contig, then slurp in all the blocks in on I/O and then let
> them get turned into files.  This has been true for the last 30
> years and people still don't do it.  We're actually moving in this
> direction with BitKeeper, in the future, large numbers of small
> files will be stored in one big file and extracted on demand.  Then
> we do one I/O to get all the related stuff.

Yeah, we need a decent unfragmenter. We can do that now with bmap().
But to speed up boots, for example, we need to lay all the inodes that
are accessed during boot in one contiguous chunk on the disc. Again,
we need to know which files are being accessed to know that.
/proc/fsaccess would tell us that.

The down side of just relying on contiguous files is that some files
(especially bloated C libraries) are not fully used. I would not be at
all surprised if more than 75% of glibc is not (or rarely) used.
There's a lot of stuff in there that isn't used very often.

However, a *refragmenter* might be interesting. Find out which blocks
in which files are actually used during boot, and lay just those out
in a contiguous section. *That* would smoke!

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14  2:39   ` Richard Gooch
@ 2001-05-14  3:09     ` Rik van Riel
  2001-05-14  4:27     ` Richard Gooch
  2001-05-15  4:37     ` Chris Wedgwood
  2 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2001-05-14  3:09 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Larry McVoy, Linus Torvalds, Kernel Mailing List

On Sun, 13 May 2001, Richard Gooch wrote:
> Larry McVoy writes:

> > Ha.  For once you're both wrong but not where you are thinking.  One
> > of the few places that I actually hacked Linux was for exactly this
> > - it was in the 0.99 days I think.  I saved the list of I/O's in a
> > file and filled the buffer cache with them at next boot.  It
> > actually didn't help at all.
> 
> Maybe you did something wrong :-)

How about "the data loads got instrumented, but the metadata
loads which caused over half of the disk seeks didn't" ?

(just a wild guess ... if it turns out to be true we may want
to look into doing agressive readahead on inode blocks ;))

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14  2:39   ` Richard Gooch
  2001-05-14  3:09     ` Rik van Riel
@ 2001-05-14  4:27     ` Richard Gooch
  2001-05-15  4:37     ` Chris Wedgwood
  2 siblings, 0 replies; 75+ messages in thread
From: Richard Gooch @ 2001-05-14  4:27 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Larry McVoy, Linus Torvalds, Kernel Mailing List

Rik van Riel writes:
> On Sun, 13 May 2001, Richard Gooch wrote:
> > Larry McVoy writes:
> 
> > > Ha.  For once you're both wrong but not where you are thinking.  One
> > > of the few places that I actually hacked Linux was for exactly this
> > > - it was in the 0.99 days I think.  I saved the list of I/O's in a
> > > file and filled the buffer cache with them at next boot.  It
> > > actually didn't help at all.
> > 
> > Maybe you did something wrong :-)
> 
> How about "the data loads got instrumented, but the metadata
> loads which caused over half of the disk seeks didn't" ?
> 
> (just a wild guess ... if it turns out to be true we may want
> to look into doing agressive readahead on inode blocks ;))

Caching metadata is definately part of my cunning plan. I'd like to
think that once Al's metadata-in-page-cache patches go in, we'll get
that for free.

However, that will still leave indirect blocks unordered. I don't see
a clean way of fixing that. Which is why doing things at the block
device layer has it's attractions (except it doesn't work).

Hm. Is there a reason why the page cache can't see if a a block is in
the block cache, and read it from there first?

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14  2:24 ` Richard Gooch
@ 2001-05-14  4:46   ` Linus Torvalds
  2001-05-14  5:15   ` Richard Gooch
  1 sibling, 0 replies; 75+ messages in thread
From: Linus Torvalds @ 2001-05-14  4:46 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Kernel Mailing List


On Sun, 13 May 2001, Richard Gooch wrote:
> 
> Think about it:-) You need to generate prefetch accesses in ascending
> device bnum order.

I seriously doubt it is worth it.

Th ekernel will do the ordering for you anyway: that's what the elevator
is, and that's why you have a "prefetch" system call (to avoid the
synchronization that kills the elevator). And you'll end up wanting to
pre-fetch on virtual addresses, which implies that you have to open the
files: I doubt you want to have tons of files open and try to get a
"global" order.

But sure, you can use bmap if you want. It would be interesting to hear
whether it makes much of a difference..

> > Why not just "path,pagenr" instead? You make your instrumentation save
> > away the whole pathname, by just using the dentry pointer. Many
> > filesystems don't even _have_ a "inum", so anything less doesn't work
> > anyway.
> 
> Sure, this would work too. I'm a bit worried about the increased
> amount of traffic this will generate.

No increased traffic. "path" is a pointer (to a dentry), ie 32
bits. "ino" is at least 128 bits on some filesystems. You make for _less_
data to save.

> So on every page fault or read(2) call, we have to generate the full
> path from the dentry? Isn't that going to add a fair bit of overhead?

You just save the dentry pointer. You do the path _later_, when somebody
reads it away from the /proc file.

> I don't see the advantage of the prefetch(2) system call. It seems to
> me I can get the same effect by just making read(2) calls in another
> task. Of course, I'd need to use bmap() to generate the sort key, but
> I don't see why that's a bad thing.

Try it. You won't be able to. "read()" is an inherently synchronizing
operation, and you cannot get _any_ overlap with multiple reads, except
for the pre-fetching that the kernel will do for you anyway.

And when it comes to IO and the elevator, overlap is where it
matters. Sending out several tagged commands to the disk in one go.

You'd have to have multiple processes doing the reads to get the same kind
of performance. Much easier to do "prefetch()", when that's really what
you want anyway.

Remember, you'r enot interested in the data. You're just populating the
cache.

			Linus


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14  2:24 ` Richard Gooch
  2001-05-14  4:46   ` Linus Torvalds
@ 2001-05-14  5:15   ` Richard Gooch
  2001-05-14 13:04     ` Daniel Phillips
                       ` (2 more replies)
  1 sibling, 3 replies; 75+ messages in thread
From: Richard Gooch @ 2001-05-14  5:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Linus Torvalds writes:
> 
> On Sun, 13 May 2001, Richard Gooch wrote:
> > 
> > Think about it:-) You need to generate prefetch accesses in ascending
> > device bnum order.
> 
> I seriously doubt it is worth it.
> 
> Th ekernel will do the ordering for you anyway: that's what the
> elevator is, and that's why you have a "prefetch" system call (to
> avoid the synchronization that kills the elevator). And you'll end
> up wanting to pre-fetch on virtual addresses, which implies that you
> have to open the files: I doubt you want to have tons of files open
> and try to get a "global" order.

OK, provided the prefetch will queue up a large number of requests
before starting the I/O. If there was a way of controlling when the
I/O actually starts (say by having a START flag), that would be ideal,
I think.

> But sure, you can use bmap if you want. It would be interesting to
> hear whether it makes much of a difference..

I doubt bmap() would make any difference if there is a way of
controlling when the I/O starts.

However, this still doesn't address the issue of indirect blocks. If
the indirect block has a higher bnum than the data blocks it points
to, you've got a costly seek. This is why I'm still attracted to the
idea of doing this at the block device layer. It's easy to capture
*all* accesses and then warm the buffer cache.

So, why can't the page cache check if a block is in the buffer cache?

> > Sure, this would work too. I'm a bit worried about the increased
> > amount of traffic this will generate.
> 
> No increased traffic. "path" is a pointer (to a dentry), ie 32
> bits. "ino" is at least 128 bits on some filesystems. You make for _less_
> data to save.
> 
> > So on every page fault or read(2) call, we have to generate the full
> > path from the dentry? Isn't that going to add a fair bit of overhead?
> 
> You just save the dentry pointer. You do the path _later_, when
> somebody reads it away from the /proc file.

That opens up a nasty race: if the dentry is released before the
pointer is harvested, you get a bogus pointer.

> > I don't see the advantage of the prefetch(2) system call. It seems to
> > me I can get the same effect by just making read(2) calls in another
> > task. Of course, I'd need to use bmap() to generate the sort key, but
> > I don't see why that's a bad thing.
> 
> Try it. You won't be able to. "read()" is an inherently
> synchronizing operation, and you cannot get _any_ overlap with
> multiple reads, except for the pre-fetching that the kernel will do
> for you anyway.

How's that? It won't matter if read(2) synchronises, because I'll be
issuing the requests in device bnum order.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14  5:15   ` Richard Gooch
@ 2001-05-14 13:04     ` Daniel Phillips
  2001-05-14 18:00       ` Andreas Dilger
  2001-05-14 20:16     ` Linus Torvalds
  2001-05-14 23:19     ` Richard Gooch
  2 siblings, 1 reply; 75+ messages in thread
From: Daniel Phillips @ 2001-05-14 13:04 UTC (permalink / raw)
  To: Richard Gooch, Linus Torvalds; +Cc: Kernel Mailing List

On Monday 14 May 2001 07:15, Richard Gooch wrote:
> Linus Torvalds writes:
> > But sure, you can use bmap if you want. It would be interesting to
> > hear whether it makes much of a difference..
>
> I doubt bmap() would make any difference if there is a way of
> controlling when the I/O starts.
>
> However, this still doesn't address the issue of indirect blocks. If
> the indirect block has a higher bnum than the data blocks it points
> to, you've got a costly seek. This is why I'm still attracted to the
> idea of doing this at the block device layer. It's easy to capture
> *all* accesses and then warm the buffer cache.
>
> So, why can't the page cache check if a block is in the buffer cache?

That's not quite what you want, if only because there won't be anything 
in the buffer cache pretty soon.  What we really want is a block cache, 
tightly integrated with the page cache.  Readahead with a block cache 
would be more effective than our current file-based readahead.  For 
example, it handles the case where blocks of two files are interleaved.

Since we know that the page cache maps each block at most once, the 
optimal thing to do would be to just move a pointer from the block 
cache to the page cache whenever we can.  Unfortunately the layering in 
the VFS as it stands isn't friendly to this: typically we allocate a 
page in generic_file_read long before we ask the filesystem to map it.  
To test this zero-copy idea we'd need to replace generic_file_read and 
for mmap, filemap_nopage.

But we don't need anything so fancy to try out your idea, we just need 
a lvm-like device that can:

  - Maintain a block cache
  - Remap logical to physical blocks
  - Record the block accesses
  - Physically reorder the blocks according to the recorded order
  - Load a given region of disk into the block cache on command

None of this has to be particularly general to get to the benchmarking 
stage.  E.g, the 'block cache' only needs to cache one physical region.

The central idea here is that you obviously can't do any better than to 
have all the blocks you want to read at boot physically together on 
disk.

The advantage of using this lvm-style remapping is, it will work for 
any filesystem.  The disadvantage is that the ordering is then cast in 
stone - after the system is up it might not like the ordering you chose 
for the boot, and the elevator will be completely confused ;-)  But the 
thing is, everything you need to measure the boot performance is 
together in one place, just one device driver to write.  Then once you 
know what the perfect result is you have a yardstick to measure the 
effectivenns of other, less intrusive approaches.

I took a look at the lvm and md code to see if there's a quick way to 
press them into service for this test, and there probably is, but the 
complexity there is daunting.  I think starting with a clean sheet and 
writing a new driver would be easier.

--
Daniel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14 13:04     ` Daniel Phillips
@ 2001-05-14 18:00       ` Andreas Dilger
  0 siblings, 0 replies; 75+ messages in thread
From: Andreas Dilger @ 2001-05-14 18:00 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Richard Gooch, Linus Torvalds, Kernel Mailing List

Daniel writes:
> But we don't need anything so fancy to try out your idea, we just need 
> a lvm-like device that can:
> 
>   - Maintain a block cache
>   - Remap logical to physical blocks
>   - Record the block accesses
>   - Physically reorder the blocks according to the recorded order
>   - Load a given region of disk into the block cache on command

The current LVM device (if compiled with DEBUG_MAP) will report all of
the logical->physical block mappings via printk.  Probably too heavy-
weight for a large amount of IO.  It could be changed to save the block
numbers into a cache, to be extracted later.  All of the LVM mapping
is done in the lvm_map() function.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14  5:15   ` Richard Gooch
  2001-05-14 13:04     ` Daniel Phillips
@ 2001-05-14 20:16     ` Linus Torvalds
  2001-05-14 23:19     ` Richard Gooch
  2 siblings, 0 replies; 75+ messages in thread
From: Linus Torvalds @ 2001-05-14 20:16 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Kernel Mailing List



On Sun, 13 May 2001, Richard Gooch wrote:
>
> OK, provided the prefetch will queue up a large number of requests
> before starting the I/O. If there was a way of controlling when the
> I/O actually starts (say by having a START flag), that would be ideal,
> I think.

Ehh. The "start" flag is when you actually start reading, or when you've
prefetched so much that the queue has filled up. That's the behaviour
you'd get naturally, and it's the behaviour you want.

> So, why can't the page cache check if a block is in the buffer cache?

Because it would make the damn thing slower.

The whole point of the page cache is to be FAST FAST FAST. The reason we
_have_ a page cache is that the buffer cache is slow and inefficient, and
it will always remain so.

We want to get _away_ from the buffer cache, not add support for a legacy
cache into the new and more efficient one.

And remember: when raw devices are in the page cache, you simply WILL NOT
HAVE a buffer cache at all.

Just stop this line of thought. It's not going anywhere.

> That opens up a nasty race: if the dentry is released before the
> pointer is harvested, you get a bogus pointer.

..which is why you increment the dentry count when you profile it, and
decrement it when you have output the path...

> > Try it. You won't be able to. "read()" is an inherently
> > synchronizing operation, and you cannot get _any_ overlap with
> > multiple reads, except for the pre-fetching that the kernel will do
> > for you anyway.
>
> How's that? It won't matter if read(2) synchronises, because I'll be
> issuing the requests in device bnum order.

Ehh.. You don't seem to know how disks work.

By the time you follow up with the next "read", the platter will probably
have rotated past the point you want to read. You need to have multiple
outstanding requests (or _biiig_ requests) to get close to platter speed.

[ Aside: with most IDE stuff doing extensive track buffering, you won't
  see this as much. It depends on the disk, the cache size, and the
  buffering characteristics. ]

		Linus


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14  5:15   ` Richard Gooch
  2001-05-14 13:04     ` Daniel Phillips
  2001-05-14 20:16     ` Linus Torvalds
@ 2001-05-14 23:19     ` Richard Gooch
  2001-05-15  0:42       ` Daniel Phillips
                         ` (2 more replies)
  2 siblings, 3 replies; 75+ messages in thread
From: Richard Gooch @ 2001-05-14 23:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Linus Torvalds writes:
> 
> 
> On Sun, 13 May 2001, Richard Gooch wrote:
> >
> > OK, provided the prefetch will queue up a large number of requests
> > before starting the I/O. If there was a way of controlling when the
> > I/O actually starts (say by having a START flag), that would be ideal,
> > I think.
> 
> Ehh. The "start" flag is when you actually start reading,

That would be OK.

> > So, why can't the page cache check if a block is in the buffer cache?
> 
> Because it would make the damn thing slower.
> 
> The whole point of the page cache is to be FAST FAST FAST. The
> reason we _have_ a page cache is that the buffer cache is slow and
> inefficient, and it will always remain so.

Is there some fundamental reason why a buffer cache can't ever be
fast?

> We want to get _away_ from the buffer cache, not add support for a legacy
> cache into the new and more efficient one.
> 
> And remember: when raw devices are in the page cache, you simply WILL NOT
> HAVE a buffer cache at all.
> 
> Just stop this line of thought. It's not going anywhere.

I'm just going back to it because I don't see how we can otherwise
handle this case:
- inode at block N
- indirect block at N+k+j
- data block at N+k

and have the prefetch read blocks N, N+k and N+k+j in that order.
Reading them via the FS will result in two seeks, because we need to
read N before we know to read N+k+j, and we need to read N+k+j before
we know to read N+k.

Doing the work at the block device layer makes this simple. However,
if there was a way of doing this at the page cache level, then I'd be
happy.

> > > Try it. You won't be able to. "read()" is an inherently
> > > synchronizing operation, and you cannot get _any_ overlap with
> > > multiple reads, except for the pre-fetching that the kernel will do
> > > for you anyway.
> >
> > How's that? It won't matter if read(2) synchronises, because I'll be
> > issuing the requests in device bnum order.
> 
> Ehh.. You don't seem to know how disks work.
> 
> By the time you follow up with the next "read", the platter will
> probably have rotated past the point you want to read. You need to
> have multiple outstanding requests (or _biiig_ requests) to get
> close to platter speed.

Sure, I know about rotational latency. I'm counting on read-ahead.

> [ Aside: with most IDE stuff doing extensive track buffering, you won't
>   see this as much. It depends on the disk, the cache size, and the
>   buffering characteristics. ]

These days, even IDE drives come with 2 MiB of cache or more.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14 23:19     ` Richard Gooch
@ 2001-05-15  0:42       ` Daniel Phillips
  2001-05-15  4:00       ` Linus Torvalds
  2001-05-15  6:13       ` Richard Gooch
  2 siblings, 0 replies; 75+ messages in thread
From: Daniel Phillips @ 2001-05-15  0:42 UTC (permalink / raw)
  To: Richard Gooch, Linus Torvalds; +Cc: Kernel Mailing List

On Tuesday 15 May 2001 01:19, Richard Gooch wrote:
> Linus Torvalds writes:
> > On Sun, 13 May 2001, Richard Gooch wrote:
> > > So, why can't the page cache check if a block is in the buffer
> > > cache?
> >
> > Because it would make the damn thing slower.
> >
> > The whole point of the page cache is to be FAST FAST FAST. The
> > reason we _have_ a page cache is that the buffer cache is slow and
> > inefficient, and it will always remain so.
>
> Is there some fundamental reason why a buffer cache can't ever be
> fast?

Just looking at getblk, it takes one more lock than read_cache_page 
(these are noops in UP) and otherwise has very nearly the same sequence 
of operations.  This can't be the slowness he's talking about.

I know of three ways the buffer cache earned its reputation for 
slowness:  1) There used to be a copy from the buffer cache to page 
cache on every write, to keep the two in sync 2) Having the same data 
in both the buffer and page cache created extra memory pressure 3) To 
get at file data through the buffer cache you have to traverse all the 
index blocks every time, whereas with the logically-indexed page cache 
you go straight to the page data, if it's there, and in theory[1], only 
up as many levels of index as you have to.

Once you have looked into the page cache and know the page isn't there 
you know you are going to have to read it.  At this point, the overhead 
of hashing into, say, the buffer cache to see if the block is there is 
trivial.  Just one saved read by doing that will be worth hundreds of 
hash lookups.  But why use the buffer cache?  The page cache will work 
perfectly well for this.

There's a big saving in using a block cache for readahead instead of 
file-oriented readahead: if we guess wrong and don't actually need the 
readahead blocks then we paid less to get them - we didn't call into 
the filesystem to map each one.  Additionally, a block cache can do 
things that file readahead can't, as you showed in your example:

> - inode at block N
> - indirect block at N+k+j
> - data block at N+k

Another example is where you have blocks from two different files mixed 
together, and you read both of those files.

Note that your scsi disk controller is keeping a cache for you over on 
its side of the bus.  This erodes the benefit of the block cache 
somewhat, but the same argument applies to file readahead.  For all 
people who don't have scsi the block cache would be a big win.

[1] This remains theoretical until we get the indirect blocks into the 
page cache.

--
Daniel


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14 23:19     ` Richard Gooch
  2001-05-15  0:42       ` Daniel Phillips
@ 2001-05-15  4:00       ` Linus Torvalds
  2001-05-15  4:35         ` Larry McVoy
                           ` (3 more replies)
  2001-05-15  6:13       ` Richard Gooch
  2 siblings, 4 replies; 75+ messages in thread
From: Linus Torvalds @ 2001-05-15  4:00 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Kernel Mailing List


On Mon, 14 May 2001, Richard Gooch wrote:
> 
> Is there some fundamental reason why a buffer cache can't ever be
> fast?

Yes.

Or rather, there is a fundamental reason why we must NEVER EVER look at
the buffer cache: it is not coherent with the page cache. 

And keeping it coherent would be _extremely_ expensive. How do we
know? Because we used to do that. Remember the small mindcraft
benchmark? Yup. Double copies all over the place, double lookups, double
everything.

You could think: "oh, we only need to look up the buffer cache when we
create a new page cache mapping, so..".

You'd be wrong. We'd need to go the other way too: every time we create a
new buffer cache entry, we'd need to make sure that it isn't mapped
somewhere in the page cache (impossible), or otherwise we'd do the wrong
thing sometimes (ie we might have two dirty copies, and we wouldn't know
_which_ one is valid etc).

Aliasing is bad. Don't do it.

Really. Give it up. Your silly "make bootup faster" is not going to happen
this way. You're trying to break some rather fundamental data structures,
all for the unusual case of booting up. There are other ways to boot up
quickly: look into pre-filling your memory image (aka "resume from disk"),
which I will _guarantee_ you is a lot faster than anything else you can
come up with, and which doesn't have the downsides that your approach has.

You know, the mark of intelligence is realizing when you're making the
same mistake over and over and over again, and not hitting your head in
the wall five hundred times before you understand that it's not a clever
thing to do.

Please show some intelligence.

			Linus


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  4:00       ` Linus Torvalds
@ 2001-05-15  4:35         ` Larry McVoy
  2001-05-15  4:59           ` Alexander Viro
  2001-05-15  4:43         ` Linus Torvalds
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 75+ messages in thread
From: Larry McVoy @ 2001-05-15  4:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Richard Gooch, Kernel Mailing List

On Mon, May 14, 2001 at 09:00:44PM -0700, Linus Torvalds wrote:
> Or rather, there is a fundamental reason why we must NEVER EVER look at
> the buffer cache: it is not coherent with the page cache. 

Not that Linus needs any backing up but Sun got rid of the buffer cache
and just had a page cache in SunOS 4.0, which was before I got there, I 
suspect something like 15 years ago.  It was a good move.  SunOS was 
an extremely pleasant place to work, all you had to understand was 
vnode,offset and you basically understood the VM system.

It is so _blindingly_ obvious that Linus is right, it's been proven,
you don't have to think about it, just read some history.

Hell, that's the OS that gave us mmap, remember that?  

> Really. Give it up. Your silly "make bootup faster" is not going to happen
> this way. You're trying to break some rather fundamental data structures,
> all for the unusual case of booting up. There are other ways to boot up
> quickly: look into pre-filling your memory image (aka "resume from disk"),

Which is pretty much what I have been asking for, in a general way, for
a long time.  I've wanted "directory clustering" forever, where you read
one file, read the next, and go into "file readahead mode" wherein you
slurp in the entire directories worth of files in one I/O.  If we had
that, not only would we go faster in general, you could easily tweak it
slightly for the fast bootup.  

> You know, the mark of intelligence is realizing when you're making the
> same mistake over and over and over again, and not hitting your head in
> the wall five hundred times before you understand that it's not a clever
> thing to do.
> 
> Please show some intelligence.

Those who don't learn from history are doomed to repeat it, eh?
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14  2:39   ` Richard Gooch
  2001-05-14  3:09     ` Rik van Riel
  2001-05-14  4:27     ` Richard Gooch
@ 2001-05-15  4:37     ` Chris Wedgwood
  2001-05-23 11:37       ` Stephen C. Tweedie
  2 siblings, 1 reply; 75+ messages in thread
From: Chris Wedgwood @ 2001-05-15  4:37 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Larry McVoy, Linus Torvalds, Kernel Mailing List

On Sun, May 13, 2001 at 08:39:23PM -0600, Richard Gooch wrote:

    Yeah, we need a decent unfragmenter. We can do that now with
    bmap().

SCT wrote a defragger for ext2 but it only handles 1k blocks :(
Making it work for 4k blocks looked non-trivial to me, but smarter
people may not find it difficult at all.



  --cw

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  4:00       ` Linus Torvalds
  2001-05-15  4:35         ` Larry McVoy
@ 2001-05-15  4:43         ` Linus Torvalds
  2001-05-15  5:04           ` Alexander Viro
                             ` (2 more replies)
  2001-05-15  4:57         ` David S. Miller
  2001-05-15  6:20         ` Richard Gooch
  3 siblings, 3 replies; 75+ messages in thread
From: Linus Torvalds @ 2001-05-15  4:43 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Kernel Mailing List


On Mon, 14 May 2001, Linus Torvalds wrote:
> 
> Or rather, there is a fundamental reason why we must NEVER EVER look at
> the buffer cache: it is not coherent with the page cache. 
> 
> And keeping it coherent would be _extremely_ expensive. How do we
> know? Because we used to do that. Remember the small mindcraft
> benchmark? Yup. Double copies all over the place, double lookups, double
> everything.

I think I should explain a bit more.

The current page cache is completely non-coherent (with _anything_: it's
not coherent with other files using a page cache because they have a
different index, and it's not coherent with the buffer cache because that
one isn't even in the same name space).

Now, being non-coherent is always the best option if you can get away with
it. It means that there is no way you can ever have _any_ performance
overhead from maintaining the coherency, and it's 100% reproducible -
there's no question where the page cache gets its data from (the raw disk
device. No if's, but's and why's).

The disadvantage of virtual caches is that they can have aliases. That's
fine, but you hav eto be aware of it, and you have to live with the
consequences. That's what we do now. There are no aliases that are worth
worrying about, so virtual caches work perfectly. This is not always true
(virtual CPU data caches tend to be a really bad idea, while virtual CPU
instruction caches tend to work fairly well, although potentially with a
lower utilization ratio than a physical one due to aliasing).

The other alternative is to have a physical cache. That's fine too: you
avoid aliases, but you have to look up the physical address when looking
up the cache. THIS is the real cost of the buffer cache - not the hashing
and the locking, but the fact that you have to know the physical
location. 

A mixed-mode cache is not a good idea. It gets the worst from both worlds,
without getting _any_ of the good qualities. You have the horrible
coherency issue, together with the overhead of having to find out the
physical address. 

You could choose to do "partial coherency", ie be coherent only one way,
for example. That would make the coherency overhead much less, but would
also make the caches basically act very unpredictably - you might have
somebody write through the page cache yet on a read actually not _see_
what he wrote, because it got written out to disk and was shadowed by
cached data in the buffer cache that didn't get updated.

So "partial coherency" might avoid some of the performance issues, but
it's unacceptable to me simply it's pretty non-repeatable and has some
strange behaviour that can be considered "obviously wrong" (see above
about one example).

Which leaves us with the fact that the page cache is best done the way it
is, and anybody who has coherency concerns might really think about those
concerns another way.

I'm really serious about doing "resume from disk". If you want a fast
boot, I will bet you a dollar that you cannot do it faster than by loading
a contiguous image of several megabytes contiguously into memory. There is
NO overhead, you're pretty much guaranteed platter speeds, and there are
no issues about trying to order accesses etc. There are also no issues
about messing up any run-time data structures.

Give it some thought.

		Linus


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  4:00       ` Linus Torvalds
  2001-05-15  4:35         ` Larry McVoy
  2001-05-15  4:43         ` Linus Torvalds
@ 2001-05-15  4:57         ` David S. Miller
  2001-05-15  5:12           ` Alexander Viro
  2001-05-15  9:10           ` Alan Cox
  2001-05-15  6:20         ` Richard Gooch
  3 siblings, 2 replies; 75+ messages in thread
From: David S. Miller @ 2001-05-15  4:57 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Linus Torvalds, Richard Gooch, Kernel Mailing List


Larry McVoy writes:
 > Hell, that's the OS that gave us mmap, remember that?  

Larry, go read up on TOPS-20. :-) SunOS did give unix mmap(), but it
did not come up the idea.

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  4:35         ` Larry McVoy
@ 2001-05-15  4:59           ` Alexander Viro
  2001-05-15 17:01             ` Pavel Machek
  0 siblings, 1 reply; 75+ messages in thread
From: Alexander Viro @ 2001-05-15  4:59 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Linus Torvalds, Richard Gooch, Kernel Mailing List



On Mon, 14 May 2001, Larry McVoy wrote:

> Hell, that's the OS that gave us mmap, remember that?  

"I got it from Agnes..."

Don't get me wrong, SunOS 4 was probably the nicest thing Sun had ever
released and I love it, but mmap(2) was _not_ the best of ideas. Files
as streams of bytes and files as persistent segments really do not mix
well. If you still have their source check the effects of write() from
mmaped area. Especially when you play with unaligned stuff.

Said that, in all sane cases we want indexing by (vnode,offset), not by
(device,block number). We _certainly_ don't want uncontrolled readahead
on block level. E.g. because we might have just allocated a new block
and are busy filling it with data we want to write. The last thing we
want is some fsckwit overwriting it with crap we have on disk. And that's
what such readahead is.

Besides, just how often do you reboot the box? If that's the hotspot for
you - when the hell does the boor beast find time to do something useful?


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  4:43         ` Linus Torvalds
@ 2001-05-15  5:04           ` Alexander Viro
  2001-05-15 16:17           ` Pavel Machek
  2001-05-18  7:55           ` Rogier Wolff
  2 siblings, 0 replies; 75+ messages in thread
From: Alexander Viro @ 2001-05-15  5:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Richard Gooch, Kernel Mailing List



On Mon, 14 May 2001, Linus Torvalds wrote:

> The current page cache is completely non-coherent (with _anything_: it's
> not coherent with other files using a page cache because they have a
> different index, and it's not coherent with the buffer cache because that
> one isn't even in the same name space).

Unfortunately, we have cases when disk block migrates from buffer cache
to page cache. Source of serious PITA and (IMO) the only serious reason
to take indirect blocks into page cache.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  4:57         ` David S. Miller
@ 2001-05-15  5:12           ` Alexander Viro
  2001-05-15  9:10           ` Alan Cox
  1 sibling, 0 replies; 75+ messages in thread
From: Alexander Viro @ 2001-05-15  5:12 UTC (permalink / raw)
  To: David S. Miller
  Cc: Larry McVoy, Linus Torvalds, Richard Gooch, Kernel Mailing List



On Mon, 14 May 2001, David S. Miller wrote:

> 
> Larry McVoy writes:
>  > Hell, that's the OS that gave us mmap, remember that?  
> 
> Larry, go read up on TOPS-20. :-) SunOS did give unix mmap(), but it
> did not come up the idea.

s/TOPS-20/Multics/


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-14 23:19     ` Richard Gooch
  2001-05-15  0:42       ` Daniel Phillips
  2001-05-15  4:00       ` Linus Torvalds
@ 2001-05-15  6:13       ` Richard Gooch
  2 siblings, 0 replies; 75+ messages in thread
From: Richard Gooch @ 2001-05-15  6:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Richard Gooch, Kernel Mailing List

Linus Torvalds writes:
> 
> On Mon, 14 May 2001, Richard Gooch wrote:
> > 
> > Is there some fundamental reason why a buffer cache can't ever be
> > fast?
> 
> Yes.
> 
> Or rather, there is a fundamental reason why we must NEVER EVER look at
> the buffer cache: it is not coherent with the page cache. 
> 
> And keeping it coherent would be _extremely_ expensive. How do we
> know? Because we used to do that. Remember the small mindcraft
> benchmark? Yup. Double copies all over the place, double lookups, double
> everything.
> 
> You could think: "oh, we only need to look up the buffer cache when we
> create a new page cache mapping, so..".
> 
> You'd be wrong. We'd need to go the other way too: every time we create a
> new buffer cache entry, we'd need to make sure that it isn't mapped
> somewhere in the page cache (impossible), or otherwise we'd do the wrong
> thing sometimes (ie we might have two dirty copies, and we wouldn't know
> _which_ one is valid etc).
> 
> Aliasing is bad. Don't do it.

OK, this (combined with the other message) explains why we want to
keep away from the buffer cache. Thanks.

> You know, the mark of intelligence is realizing when you're making
> the same mistake over and over and over again, and not hitting your
> head in the wall five hundred times before you understand that it's
> not a clever thing to do.

But you didn't have to add this. Please note that I asked why not use
the buffer cache. I didn't proclaim that it was the ideal solution. I
did say what benefits it had, but I didn't assert that the benefits
outweighed the disadvantages.

> Please show some intelligence.

Well, frankly, I think I have. Things are obvious when you know them
already. Even if I'm ignorant, I'm not stupid!

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  4:00       ` Linus Torvalds
                           ` (2 preceding siblings ...)
  2001-05-15  4:57         ` David S. Miller
@ 2001-05-15  6:20         ` Richard Gooch
  2001-05-15  6:28           ` Linus Torvalds
  2001-05-15  6:49           ` Richard Gooch
  3 siblings, 2 replies; 75+ messages in thread
From: Richard Gooch @ 2001-05-15  6:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Linus Torvalds writes:
> You could choose to do "partial coherency", ie be coherent only one
> way, for example. That would make the coherency overhead much less,
> but would also make the caches basically act very unpredictably -
> you might have somebody write through the page cache yet on a read
> actually not _see_ what he wrote, because it got written out to disk
> and was shadowed by cached data in the buffer cache that didn't get
> updated.

OK, I see your concern. And the old way of doing things, placing a
copy in the buffer cache when the page cache does a write, will eat
away performance.

However, what about simply invalidating an entry in the buffer cache
when you do a write from the page cache? By the time you get ready to
do the I/O, you have the device bnum, so then isn't it a trivial
operation to index into the buffer cache and invalidate that block?

Is there some other subtlety I'm missing here?

Actually, I'd kind of like it if the page cache steals from the buffer
cache on read. The buffer cache is mostly populated by fsck. Once I've
done the fsck, those buffers are useless to me. They might be useful
again if they are steal-able by the page cache.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  6:20         ` Richard Gooch
@ 2001-05-15  6:28           ` Linus Torvalds
  2001-05-15  6:49           ` Richard Gooch
  1 sibling, 0 replies; 75+ messages in thread
From: Linus Torvalds @ 2001-05-15  6:28 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Kernel Mailing List


On Tue, 15 May 2001, Richard Gooch wrote:
> 
> However, what about simply invalidating an entry in the buffer cache
> when you do a write from the page cache?

And how do you do the invalidate the other way, pray tell?

What happens if you create a buffer cache entry? Does that invalidate the
page cache one? Or do you just allow invalidates one way, and not the
other? And why=

> Actually, I'd kind of like it if the page cache steals from the buffer
> cache on read. The buffer cache is mostly populated by fsck. Once I've
> done the fsck, those buffers are useless to me. They might be useful
> again if they are steal-able by the page cache.

Ehh.. And then you'll be unhappy _again_, when we early in 2.5.x start
using the page cache for block device accesses. Which we _have_ to do if
we want to be able to mmap block devices. Which we _do_ want to do (hint:
DVD's etc).

Face it. What you ask for is stupid and fundamentally unworkable. 

Tell me WHY you are completely ignoring my arguments, when I (a) tell you
why your way is bad and stupid (and when you ignore the arguments, don't
complain when I call you stupid) and (b) I give you alternate ways to do
the same thing, except my suggestion is _faster_ and has none of the
downside yours has.

WHY?

		Linus


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  6:20         ` Richard Gooch
  2001-05-15  6:28           ` Linus Torvalds
@ 2001-05-15  6:49           ` Richard Gooch
  2001-05-15  6:57             ` Alexander Viro
                               ` (4 more replies)
  1 sibling, 5 replies; 75+ messages in thread
From: Richard Gooch @ 2001-05-15  6:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Linus Torvalds writes:
> 
> On Tue, 15 May 2001, Richard Gooch wrote:
> > 
> > However, what about simply invalidating an entry in the buffer cache
> > when you do a write from the page cache?
> 
> And how do you do the invalidate the other way, pray tell?
> 
> What happens if you create a buffer cache entry? Does that
> invalidate the page cache one? Or do you just allow invalidates one
> way, and not the other? And why=

I just figured on one way invalidates, because that seems cheap and
easy and has some benefits. Invalidating the other way is costly, so
don't bother, even if there were some benefits.

> > Actually, I'd kind of like it if the page cache steals from the buffer
> > cache on read. The buffer cache is mostly populated by fsck. Once I've
> > done the fsck, those buffers are useless to me. They might be useful
> > again if they are steal-able by the page cache.
> 
> Ehh.. And then you'll be unhappy _again_, when we early in 2.5.x
> start using the page cache for block device accesses. Which we
> _have_ to do if we want to be able to mmap block devices. Which we
> _do_ want to do (hint: DVD's etc).

So what happens if I dd from the block device and also from a file on
the mounted FS, where that file overlaps the bnums I dd'ed? Do we get
two copies in the page cache? One for the block device access, and one
for the file access?

> Face it. What you ask for is stupid and fundamentally unworkable. 
> 
> Tell me WHY you are completely ignoring my arguments, when I (a)
> tell you why your way is bad and stupid (and when you ignore the
> arguments, don't complain when I call you stupid) and (b) I give you
> alternate ways to do the same thing, except my suggestion is
> _faster_ and has none of the downside yours has.
> 
> WHY?

Because I like to understand completely all the different options
before giving up on any. That in itself is a good enough reason, IMO.

Because I've found that when arguing about this kind of stuff, even if
the other person asks for something that is "wrong" or "stupid" from
your own point of view, if you respect their intelligence, then maybe
you can together find an alternative solution that solves the
underlying problem but does it cleanly.

I've been on the other side of this with a friend and colleague. We
used to have healthy arguments that lasted all afternoon. He'd ask for
something that was unclean and didn't fit into the structure or the
philosophy. But I respected his intelligence, skill and his need for a
solution. In the end, we'd come up with a better way than either one
would have proposed. We had a dialogue.

And because your suspend/resume idea isn't really going to help me
much. That's because my boot scripts have the notion of
"personalities" (change the boot configuration by asking the user
early on in the boot process). If I suspend after I've got XDM
running, it's too late.

So what I want is a solution that will keep the kernel clean (believe
me, I really do want to keep it clean), but gives me a fast boot too.
And I believe the solution is out there. We just haven't found it yet.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  6:49           ` Richard Gooch
@ 2001-05-15  6:57             ` Alexander Viro
  2001-05-15 10:33               ` Daniel Phillips
  2001-05-15  7:13             ` Linus Torvalds
                               ` (3 subsequent siblings)
  4 siblings, 1 reply; 75+ messages in thread
From: Alexander Viro @ 2001-05-15  6:57 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Linus Torvalds, Kernel Mailing List



On Tue, 15 May 2001, Richard Gooch wrote:

> > What happens if you create a buffer cache entry? Does that
> > invalidate the page cache one? Or do you just allow invalidates one
> > way, and not the other? And why=
> 
> I just figured on one way invalidates, because that seems cheap and
> easy and has some benefits. Invalidating the other way is costly, so
> don't bother, even if there were some benefits.

Cute.
	* create an instance in pagecache
	* start reading into buffer cache (doesn't invalidate, right?)
	* start writing using pagecache
	* lose the page
	* try to read it (via pagecache)
Woops - just found a copy in buffer cache, let's pick data from it.
Pity that said data is obsolete...

> So what happens if I dd from the block device and also from a file on
> the mounted FS, where that file overlaps the bnums I dd'ed? Do we get
> two copies in the page cache? One for the block device access, and one
> for the file access?

Yes.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  6:49           ` Richard Gooch
  2001-05-15  6:57             ` Alexander Viro
@ 2001-05-15  7:13             ` Linus Torvalds
  2001-05-15  7:56               ` Chris Wedgwood
  2001-05-15 10:04             ` Anton Altaparmakov
                               ` (2 subsequent siblings)
  4 siblings, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2001-05-15  7:13 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Kernel Mailing List


On Tue, 15 May 2001, Richard Gooch wrote:
> > 
> > What happens if you create a buffer cache entry? Does that
> > invalidate the page cache one? Or do you just allow invalidates one
> > way, and not the other? And why=
> 
> I just figured on one way invalidates, because that seems cheap and
> easy and has some benefits. Invalidating the other way is costly, so
> don't bother, even if there were some benefits.

Ahh..

Well, excuse me while I puke all over your shoes.

Why don't you go hack the NT kernel, or something like that? I have some
taste, and part of that is having this silly notion of "Things should make
sense".

We should not create crap code just because we _can_. Sure, it's easy to
write the code you suggest. Do you really want a system like that? A
system where you have rules that make no sense, except "it was easy to
invlidate one way, so let's do that, and never mind that it makes no
logical sense at all?".

> > Ehh.. And then you'll be unhappy _again_, when we early in 2.5.x
> > start using the page cache for block device accesses. Which we
> > _have_ to do if we want to be able to mmap block devices. Which we
> > _do_ want to do (hint: DVD's etc).
> 
> So what happens if I dd from the block device and also from a file on
> the mounted FS, where that file overlaps the bnums I dd'ed? Do we get
> two copies in the page cache? One for the block device access, and one
> for the file access?

Yup. And never the two shall meet.

Why should they? Why would you ever do something like that, or care about
the fact? Why would you design a system around a perversity, slowing down
(and uglifying) the sane and common case?

> And because your suspend/resume idea isn't really going to help me
> much. That's because my boot scripts have the notion of
> "personalities" (change the boot configuration by asking the user
> early on in the boot process). If I suspend after I've got XDM
> running, it's too late.

Note that I never said "suspend". I said _resume_. You would create the
resume-image once, and you'd create it not at shutdown time, but at the
point you want to resume from.

You don't want to ever suspend the dang thing - just shut it down, and
reboot it quickly by resuming from the snapshot. So you just create a
simple resume snapshot. Which is easy to do, with the exact same tools
that you've been talking about all the time.

What you do is:
 - trace what pages get loaded off the disk
 - create a snapshot of the contents of those pages
 - archive it all up (may I suggest compressing it at the same time?)
 - the "resume" function is just a "uncompress and populate the virtual
   caches with the contents" action.

Note that the "uncompress and populate" doesn't actually have to use the
_real_ disk contents of the file. A byte is a byte is a byte, and it
doesn't actually need to come from the actual filesystem the system
_thinks_ it comes from. Once it is loaded into memory, it's just a
value. You'e "primed" your caches, so when you actually run the bootup
scripts, you'll have some random hit-rate (say, 98%), and improve the
bootup immensely that way.

Another way of saying this: Imagine that you "tar" up and compress the
files you need for booting. You then uncompress and untar the archive, but
instead of untar'ing onto a filesystem, you _just_ populate the caches. 

This is how some CPU's bootstrap themselves: they fill their icache from a
serial rom (at least some alpha chips did this). Never mind that they
didn't actually get that initial state from the _real_ backing store (RAM,
or in the hypothetical "resume" case, the filesystem off disk). There's no
way to tell, if your cached copies have the same data as the data on
disk. Never mind that the data _got_ there a strange way.

(And yes, your "cache priming" had better prime the cache with the same
stuff that _is_ on the real filesystem, otherwise you'd obviously get
strange behaviour with the caches not actually matching what the
filesystem contents are. But that's simple to do, and it's easy enough to
boot up in safe mode without a cache priming stage).

One of the advantages of "resuming" (or "priming the cache", or whatever
you want to call it) is that you're free to lay out the resume/cache image
any way you want on disk, as it has nothing to do with the actual
filesystem - except for the fact of sharing some of the same data. Which
means that you can really read it in efficiently.

		Linus


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  7:13             ` Linus Torvalds
@ 2001-05-15  7:56               ` Chris Wedgwood
  2001-05-15  8:06                 ` Linus Torvalds
  0 siblings, 1 reply; 75+ messages in thread
From: Chris Wedgwood @ 2001-05-15  7:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Richard Gooch, Kernel Mailing List

On Tue, May 15, 2001 at 12:13:13AM -0700, Linus Torvalds wrote:

    We should not create crap code just because we _can_.

How about removing code?


In 2.5.x is we move fs metadata into the pagecache, do we even need a
buffer cache anymore? Can't we just ditch it completely and make all
device access raw?

It seems to me this is not only simple but also elegant, or perhaps I
am fundamentally missing something?



  --cw

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  7:56               ` Chris Wedgwood
@ 2001-05-15  8:06                 ` Linus Torvalds
  2001-05-15  8:33                   ` Alexander Viro
  2001-05-19  5:26                   ` Chris Wedgwood
  0 siblings, 2 replies; 75+ messages in thread
From: Linus Torvalds @ 2001-05-15  8:06 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Richard Gooch, Kernel Mailing List



On Tue, 15 May 2001, Chris Wedgwood wrote:
>
> On Tue, May 15, 2001 at 12:13:13AM -0700, Linus Torvalds wrote:
>
>     We should not create crap code just because we _can_.
>
> How about removing code?

Absolutely. It's not all that often that we can do it, but when we can,
it's the best thing in the world.

> In 2.5.x is we move fs metadata into the pagecache, do we even need a
> buffer cache anymore? Can't we just ditch it completely and make all
> device access raw?

Yes and no.

Yes, it would be nice.

But no, I doubt we'll move _all_ metadata into the page-cache. I doubt,
for example, that we'll find people re-doing all the other filesystems. So
even if ext2 was page-cache only, what about all the 35 other filesystems
out there in the standard sources, never mind others that haven't been
integrated (XFS, ext3 etc..).

Yeah, I know. Some of them already do not use the buffer cache at all (the
network filesystems come to mind ;), but even so..

Looks like there are 19 filesystems that use the buffer cache right now:

	grep -l bread fs/*/*.c | cut -d/ -f2 | sort -u | wc

So quite a bit of work involved.

But on the whole I'm definitely hoping that yes, we'll relegate the
"buffer_head" to be mainly just for IO, and not be a first-class caching
entity at all. It's just that I think it will take a _loooong_ time until
we actually reach that noble goal completely.

		Linus


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  8:06                 ` Linus Torvalds
@ 2001-05-15  8:33                   ` Alexander Viro
  2001-05-15 10:27                     ` David Woodhouse
                                       ` (2 more replies)
  2001-05-19  5:26                   ` Chris Wedgwood
  1 sibling, 3 replies; 75+ messages in thread
From: Alexander Viro @ 2001-05-15  8:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chris Wedgwood, Richard Gooch, Kernel Mailing List



On Tue, 15 May 2001, Linus Torvalds wrote:

> Looks like there are 19 filesystems that use the buffer cache right now:
> 
> 	grep -l bread fs/*/*.c | cut -d/ -f2 | sort -u | wc
> 
> So quite a bit of work involved.

UNIX-like ones (and that includes QNX) are easy. HFS is hopeless - it won't
be fixed unless authors will do it. Tigran will probably fix BFS just as a
learning experience ;-) ADFS looks tolerably easy to fix. AFFS... directories
will be pure hell - blocks jump from directory to directory at zero notice.
NTFS and HPFS will win from switch (esp. NTFS). FAT is not a problem, if we
are willing to break CVF and let author fix it. Reiserfs... Dunno. They've
got a private (slightly mutated) copy of ~60% of fs/buffer.c. UDF should be
OK. ISOFS... ask Peter. JFFS - dunno.

So probably we'll have to keep the buffer cache (AFFS looks like a real
killer), but we will be able to do pagecache-only versions of a_ops methods.
If fs has no metadata in buffer cache we can drop unmap_underlying_metadata()
for it.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  4:57         ` David S. Miller
  2001-05-15  5:12           ` Alexander Viro
@ 2001-05-15  9:10           ` Alan Cox
  2001-05-15  9:48             ` Lars Brinkhoff
  1 sibling, 1 reply; 75+ messages in thread
From: Alan Cox @ 2001-05-15  9:10 UTC (permalink / raw)
  To: David S. Miller
  Cc: Larry McVoy, Linus Torvalds, Richard Gooch, Kernel Mailing List

> Larry, go read up on TOPS-20. :-) SunOS did give unix mmap(), but it
> did not come up the idea.

Seems to be TOPS-10 ....

http://www.opost.com/dlm/tenex/fjcc72/ 
[Storage Organization and Management in TENEX - 1972]



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  9:10           ` Alan Cox
@ 2001-05-15  9:48             ` Lars Brinkhoff
  2001-05-15  9:54               ` Alexander Viro
  2001-05-15 20:17               ` Kai Henningsen
  0 siblings, 2 replies; 75+ messages in thread
From: Lars Brinkhoff @ 2001-05-15  9:48 UTC (permalink / raw)
  To: Alan Cox; +Cc: David S. Miller, Larry McVoy, Kernel Mailing List

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> > Larry, go read up on TOPS-20. :-) SunOS did give unix mmap(), but it
> > did not come up the idea.
> Seems to be TOPS-10 ....
> http://www.opost.com/dlm/tenex/fjcc72/ 

TENEX is not TOPS-10.  TOPS-10 didn't get virtual memory until around
1974.  By then, TENEX had been around for years.

TOPS-20 was developed from TENEX starting around 1973.

-- 
http://lars.nocrew.org/

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  9:48             ` Lars Brinkhoff
@ 2001-05-15  9:54               ` Alexander Viro
  2001-05-15 20:17               ` Kai Henningsen
  1 sibling, 0 replies; 75+ messages in thread
From: Alexander Viro @ 2001-05-15  9:54 UTC (permalink / raw)
  To: Lars Brinkhoff
  Cc: Alan Cox, David S. Miller, Larry McVoy, Kernel Mailing List



On 15 May 2001, Lars Brinkhoff wrote:

> Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> > > Larry, go read up on TOPS-20. :-) SunOS did give unix mmap(), but it
> > > did not come up the idea.
> > Seems to be TOPS-10 ....
> > http://www.opost.com/dlm/tenex/fjcc72/ 
> 
> TENEX is not TOPS-10.  TOPS-10 didn't get virtual memory until around
> 1974.  By then, TENEX had been around for years.
> 
> TOPS-20 was developed from TENEX starting around 1973.

... and Multics had all access to files through equivalent of mmap()
in 60s. "Segments" in ls(1) got that name for a good reason.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  6:49           ` Richard Gooch
  2001-05-15  6:57             ` Alexander Viro
  2001-05-15  7:13             ` Linus Torvalds
@ 2001-05-15 10:04             ` Anton Altaparmakov
  2001-05-15 19:28               ` H. Peter Anvin
  2001-05-15 16:26             ` Pavel Machek
  2001-05-15 18:02             ` Craig Milo Rogers
  4 siblings, 1 reply; 75+ messages in thread
From: Anton Altaparmakov @ 2001-05-15 10:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Richard Gooch, Kernel Mailing List

At 08:13 15/05/01, Linus Torvalds wrote:
>On Tue, 15 May 2001, Richard Gooch wrote:
> > So what happens if I dd from the block device and also from a file on
> > the mounted FS, where that file overlaps the bnums I dd'ed? Do we get
> > two copies in the page cache? One for the block device access, and one
> > for the file access?
>
>Yup. And never the two shall meet.
>
>Why should they? Why would you ever do something like that, or care about
>the fact?

They shouldn't, but maybe some stupid utility or a typo will do it creating 
two incoherent copies of the same block on the device. -> Bad Things can 
happen.

Can't we simply stop people from doing it by say having mount lock the 
device from further opens (and vice versa of course, doing a "dd" should 
result in lock of device preventing a mount during the duration of "dd"). - 
Wouldn't this be a good thing, guaranteeing that problems cannot happen 
while not incurring any overhead except on device open/close? Or is this a 
matter of "give the user enough rope"? - If proper rw locking is 
implemented it could allow simultaneous -o ro mount with a dd from the 
device but do exclusive write locking, for example, for maximum flexibility.

Just my 2p.

Anton


-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://sourceforge.net/projects/linux-ntfs/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  8:33                   ` Alexander Viro
@ 2001-05-15 10:27                     ` David Woodhouse
  2001-05-15 16:00                     ` Chris Mason
  2001-05-15 19:26                     ` H. Peter Anvin
  2 siblings, 0 replies; 75+ messages in thread
From: David Woodhouse @ 2001-05-15 10:27 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, Chris Wedgwood, Richard Gooch, Kernel Mailing List


viro@math.psu.edu said:
> JFFS - dunno.

Bah. JFFS doesn't use any of those horrible block device thingies.

--
dwmw2



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  6:57             ` Alexander Viro
@ 2001-05-15 10:33               ` Daniel Phillips
  2001-05-15 10:44                 ` Alexander Viro
  0 siblings, 1 reply; 75+ messages in thread
From: Daniel Phillips @ 2001-05-15 10:33 UTC (permalink / raw)
  To: Alexander Viro, Richard Gooch; +Cc: Linus Torvalds, Kernel Mailing List

On Tuesday 15 May 2001 08:57, Alexander Viro wrote:
> On Tue, 15 May 2001, Richard Gooch wrote:
> > > What happens if you create a buffer cache entry? Does that
> > > invalidate the page cache one? Or do you just allow invalidates
> > > one way, and not the other? And why=
> >
> > I just figured on one way invalidates, because that seems cheap and
> > easy and has some benefits. Invalidating the other way is costly,
> > so don't bother, even if there were some benefits.
>
> Cute.
> 	* create an instance in pagecache
> 	* start reading into buffer cache (doesn't invalidate, right?)
> 	* start writing using pagecache
> 	* lose the page
> 	* try to read it (via pagecache)
> Woops - just found a copy in buffer cache, let's pick data from it.
> Pity that said data is obsolete...

That's because you left out his invalidate:

 	* create an instance in pagecache
 	* start reading into buffer cache (doesn't invalidate, right?)
 	* start writing using pagecache (invalidate buffer copy)
 	* lose the page
 	* try to read it (via pagecache)

Everthing ok.  As an optimization, instead of 'lose the page', do 'move 
page blocks to buffer cache'.

--
Daniel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 10:33               ` Daniel Phillips
@ 2001-05-15 10:44                 ` Alexander Viro
  2001-05-15 14:42                   ` Daniel Phillips
  0 siblings, 1 reply; 75+ messages in thread
From: Alexander Viro @ 2001-05-15 10:44 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Richard Gooch, Linus Torvalds, Kernel Mailing List



On Tue, 15 May 2001, Daniel Phillips wrote:

> That's because you left out his invalidate:
> 
>  	* create an instance in pagecache
>  	* start reading into buffer cache (doesn't invalidate, right?)
>  	* start writing using pagecache (invalidate buffer copy)

Bzzert. You have a race here. Let's make it explicit:

start writing
put write request in queue
block on that
					start reading into buffer cache
					put read request into queue
					read from media
write to media

And no, we can't invalidate from IO completion hook.

>  	* lose the page
>  	* try to read it (via pagecache)
> 
> Everthing ok.

Nope.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 10:44                 ` Alexander Viro
@ 2001-05-15 14:42                   ` Daniel Phillips
  0 siblings, 0 replies; 75+ messages in thread
From: Daniel Phillips @ 2001-05-15 14:42 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Richard Gooch, Linus Torvalds, Kernel Mailing List

On Tuesday 15 May 2001 12:44, Alexander Viro wrote:
> On Tue, 15 May 2001, Daniel Phillips wrote:
> > That's because you left out his invalidate:
> >
> >  	* create an instance in pagecache
> >  	* start reading into buffer cache (doesn't invalidate, right?)
> >  	* start writing using pagecache (invalidate buffer copy)
>
> Bzzert. You have a race here. Let's make it explicit:
>
> start writing
> put write request in queue
> block on that
> 					start reading into buffer cache
> 					put read request into queue
> 					read from media
> write to media
>
> And no, we can't invalidate from IO completion hook.
>
> >  	* lose the page
> >  	* try to read it (via pagecache)
> >
> > Everthing ok.
>
> Nope.

The problem is that we have two IO operations on the same physical 
block in the queue at the same time, and we don't know it.  Maybe we 
should know it.

For your specific example we are ok if we do:

         * create an instance in pagecache
         * start reading into buffer cache (doesn't invalidate, right?)
         * start writing using pagecache (invalidate buffer copy)
         * lose the page (invalidate buffer copy)
         * try to read it (via pagecache)

We are also ok if we follow my suggested optimization and move the page 
to the buffer cache instead of just losing it.

We are not ok if we do:

         * try to read it (via buffercache)

because its copy is out of date, but this can be fixed by enforcing 
coherency in the request queue. 

1) Why should the request queue not be coherent?

2) Can we stop talking about buffer cache here and start talking about 
blocks mapped into a separate address space in the page cache?  From 
Linus's previous comments in this thread we are going to have that 
anyway, and your race also applies there.

I'd like to call that separate address space a 'block cache'.

--
Daniel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  8:33                   ` Alexander Viro
  2001-05-15 10:27                     ` David Woodhouse
@ 2001-05-15 16:00                     ` Chris Mason
  2001-05-15 19:26                     ` H. Peter Anvin
  2 siblings, 0 replies; 75+ messages in thread
From: Chris Mason @ 2001-05-15 16:00 UTC (permalink / raw)
  To: Alexander Viro, Linus Torvalds
  Cc: Chris Wedgwood, Richard Gooch, Kernel Mailing List



On Tuesday, May 15, 2001 04:33:57 AM -0400 Alexander Viro
<viro@math.psu.edu> wrote:

> 
> 
> On Tue, 15 May 2001, Linus Torvalds wrote:
> 
>> Looks like there are 19 filesystems that use the buffer cache right now:
>> 
>> 	grep -l bread fs/*/*.c | cut -d/ -f2 | sort -u | wc
>> 
>> So quite a bit of work involved.
> 
> Reiserfs... Dunno. They've got a private (slightly mutated) copy of
> ~60% of fs/buffer.c. 

But, putting the log and the metadata in the page cache makes memory
pressure and such cleaner, so this is one of my goals for 2.5.  reiserfs
will still have alias issues due to the packed tails (one copy in the
btree, another in the page), but it will be no worse than it is now.

-chris


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  4:43         ` Linus Torvalds
  2001-05-15  5:04           ` Alexander Viro
@ 2001-05-15 16:17           ` Pavel Machek
  2001-05-19 19:39             ` Linus Torvalds
  2001-05-18  7:55           ` Rogier Wolff
  2 siblings, 1 reply; 75+ messages in thread
From: Pavel Machek @ 2001-05-15 16:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Richard Gooch, Kernel Mailing List

Hi!

> I'm really serious about doing "resume from disk". If you want a fast
> boot, I will bet you a dollar that you cannot do it faster than by loading
> a contiguous image of several megabytes contiguously into memory. There is
> NO overhead, you're pretty much guaranteed platter speeds, and there are
> no issues about trying to order accesses etc. There are also no issues
> about messing up any run-time data structures.

resume from disk is actually pretty hard to do in way it is readed linearily.

While playing with swsusp patches (== suspend to disk) I found out that
it was slow. It needs to do atomic snapshot, and only reasonable way to
do that is free half of RAM, cli() and copy.

-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  6:49           ` Richard Gooch
                               ` (2 preceding siblings ...)
  2001-05-15 10:04             ` Anton Altaparmakov
@ 2001-05-15 16:26             ` Pavel Machek
  2001-05-15 18:02             ` Craig Milo Rogers
  4 siblings, 0 replies; 75+ messages in thread
From: Pavel Machek @ 2001-05-15 16:26 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Linus Torvalds, Kernel Mailing List

Hi!

> And because your suspend/resume idea isn't really going to help me
> much. That's because my boot scripts have the notion of
> "personalities" (change the boot configuration by asking the user
> early on in the boot process). If I suspend after I've got XDM
> running, it's too late.

Why not e2defrag so that everything needed for bootup is linear on the
start of disk? Use strace to collect statistics of what happens during 
bootup. [strac should be good enough. If not, uml is.]
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  4:59           ` Alexander Viro
@ 2001-05-15 17:01             ` Pavel Machek
  0 siblings, 0 replies; 75+ messages in thread
From: Pavel Machek @ 2001-05-15 17:01 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Larry McVoy, Linus Torvalds, Richard Gooch, Kernel Mailing List

Hi!

> Besides, just how often do you reboot the box? If that's the hotspot for
> you - when the hell does the boor beast find time to do something useful?

Ten times a day?

But booting is special case: You can read your mail while compiling kernel, 
but try to read your mail while your machine is booting.

What's worse, boot time tends to be time critical, as in "I need to find that 
mail that tells me where I'm expected to be half an hour from now. Ouch. It's 
going to take 40 minutes to get there."
									Pavel
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  6:49           ` Richard Gooch
                               ` (3 preceding siblings ...)
  2001-05-15 16:26             ` Pavel Machek
@ 2001-05-15 18:02             ` Craig Milo Rogers
  4 siblings, 0 replies; 75+ messages in thread
From: Craig Milo Rogers @ 2001-05-15 18:02 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Linus Torvalds, Kernel Mailing List

>And because your suspend/resume idea isn't really going to help me
>much. That's because my boot scripts have the notion of
>"personalities" (change the boot configuration by asking the user
>early on in the boot process). If I suspend after I've got XDM
>running, it's too late.

	Preface: As has been mentioned on this discussion thread, some
disk devices maintain a cache of their own, running on a small (by
today's standards) CPU.  These caches are probably sector oriented,
not block oriented, but are almost certainly not page oriented or
filesystem oriented.  Well, OK, some might have DOS filesystem
knowlege built-in, I suppose... yuck!

	Anyway, although there may be slight differences, they are
effectively block-orieted caches.  As long as they are write-through
(and/or there are cache flushing commands, etc), there are reasonably
coherent with the operating system's main cache, and they meet the
expectations of database programs, etc. that want stable storage.

	In terms of efficiency, there are questions about read-aheead,
write-behind, write-through with invalidation or write-through with
cache update -- the usual stuff.  I leave it as an exercise for the
reader to decide how to best tune their system, and merely assert that
it can be done.

	Imagine, as a mental exercise, that you move this
block-oriented cache out of the disk drive, and into the main CPU and
operating system, say roughly at the disk driver level.  We lose the
efficiency of having the small CPU do the block lookups, but a hashed
block lookup is rather cheap nowadays, wouldn't you say?  Ignoring
issues of, "What if the disk drive fails independently of the main
CPU, or vice versa?", the transplanted block cache should operate
pretty much as it did in the disk drive.

	In particular, it should continue to operate properly with the
main CPU's main page cache.

	Conclusion: a page cache can successfully run over a
appropriately designed block cache.  QED.

	What's the hitch?  It's the "appropriately designed"
constraint.  It is quite possible that the Linux block cache is not
designed (data strictures and code paths considered together) in a way
that allows it to mimic a simple disk drive's block cache.  I assume
that there's some impediment, or this discussion wouldn't have lasted
so long -- the idea of using the Linux block cache to model a disk
drive's block cache is pretty obvious, after all.

>So what I want is a solution that will keep the kernel clean (believe
>me, I really do want to keep it clean), but gives me a fast boot too.
>And I believe the solution is out there. We just haven't found it yet.

	Well, if you want a fast boot *on a single type of disk
drive*, and the existing Linux block cache doesn't work, you could
extend the driver for that hardware with an optional block cache,
independently of Linux' block cache, along with an appropriate
interface to populate it with boot-time blocks, and to flush it when
no longer needed.  That's not exactly clean, though, is it?

	You could extend the md (or LVM) drivers, or create a new
driver similar to one of them, that incorporates a simple block cache,
with appropriate mechanisms for populating and flushing it.  Clean?
er, no, rather muddy, in fact.

	You might want to lock down the pages that you've
prepopulated, rather than let them be discarded before they're needed.
This could be designed into a new block cache, but you might need to
play some accounting games to get it right with the existing block
cache.

	Finally, there's Linus' offer for a preread call, to
prepopulate the page cache.  By virtue of your knowlege of the
underlying implementation of the system, you could preload the file
system index pages into the block cache, and load the datd pages into
the page cache.  Clean!  Sewer-like!

						Craig Milo Rogers


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  8:33                   ` Alexander Viro
  2001-05-15 10:27                     ` David Woodhouse
  2001-05-15 16:00                     ` Chris Mason
@ 2001-05-15 19:26                     ` H. Peter Anvin
  2001-05-15 20:03                       ` Alexander Viro
  2 siblings, 1 reply; 75+ messages in thread
From: H. Peter Anvin @ 2001-05-15 19:26 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.GSO.4.21.0105150424310.19333-100000@weyl.math.psu.edu>
By author:    Alexander Viro <viro@math.psu.edu>
In newsgroup: linux.dev.kernel
> 
> UNIX-like ones (and that includes QNX) are easy. HFS is hopeless - it won't
> be fixed unless authors will do it. Tigran will probably fix BFS just as a
> learning experience ;-) ADFS looks tolerably easy to fix. AFFS... directories
> will be pure hell - blocks jump from directory to directory at zero notice.
> NTFS and HPFS will win from switch (esp. NTFS). FAT is not a problem, if we
> are willing to break CVF and let author fix it. Reiserfs... Dunno. They've
> got a private (slightly mutated) copy of ~60% of fs/buffer.c. UDF should be
> OK. ISOFS... ask Peter. JFFS - dunno.
> 

isofs wouldn't be too bad as long as struct mapping:struct inode is a
many-to-one mapping.

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 10:04             ` Anton Altaparmakov
@ 2001-05-15 19:28               ` H. Peter Anvin
  2001-05-15 22:31                 ` Albert D. Cahalan
  0 siblings, 1 reply; 75+ messages in thread
From: H. Peter Anvin @ 2001-05-15 19:28 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <5.1.0.14.2.20010515105633.00a22c10@pop.cus.cam.ac.uk>
By author:    Anton Altaparmakov <aia21@cam.ac.uk>
In newsgroup: linux.dev.kernel
> 
> They shouldn't, but maybe some stupid utility or a typo will do it creating 
> two incoherent copies of the same block on the device. -> Bad Things can 
> happen.
> 
> Can't we simply stop people from doing it by say having mount lock the 
> device from further opens (and vice versa of course, doing a "dd" should 
> result in lock of device preventing a mount during the duration of "dd"). - 
> Wouldn't this be a good thing, guaranteeing that problems cannot happen 
> while not incurring any overhead except on device open/close? Or is this a 
> matter of "give the user enough rope"? - If proper rw locking is 
> implemented it could allow simultaneous -o ro mount with a dd from the 
> device but do exclusive write locking, for example, for maximum flexibility.
> 

This would leave no way (without introducing new interfaces) to write,
for example, the boot block on an ext2 filesystem.  Note that the
bootblock (defined as the first 1024 bytes) is not actually used by
the filesystem, although depending on the block size it may share a
block with the superblock (if blocksize > 1024).

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 19:26                     ` H. Peter Anvin
@ 2001-05-15 20:03                       ` Alexander Viro
  2001-05-15 20:07                         ` H. Peter Anvin
  0 siblings, 1 reply; 75+ messages in thread
From: Alexander Viro @ 2001-05-15 20:03 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel



On 15 May 2001, H. Peter Anvin wrote:

> isofs wouldn't be too bad as long as struct mapping:struct inode is a
> many-to-one mapping.

Erm... What's wrong with inode->u.isofs_i.my_very_own_address_space ?


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 20:03                       ` Alexander Viro
@ 2001-05-15 20:07                         ` H. Peter Anvin
  2001-05-15 20:15                           ` Alexander Viro
  0 siblings, 1 reply; 75+ messages in thread
From: H. Peter Anvin @ 2001-05-15 20:07 UTC (permalink / raw)
  To: Alexander Viro; +Cc: H. Peter Anvin, linux-kernel

Alexander Viro wrote:
> 
> On 15 May 2001, H. Peter Anvin wrote:
> 
> > isofs wouldn't be too bad as long as struct mapping:struct inode is a
> > many-to-one mapping.
> 
> Erm... What's wrong with inode->u.isofs_i.my_very_own_address_space ?
> 

None whatsoever.  The one thing that matters is that noone starts making
the assumption that mapping->host->i_mapping == mapping.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 20:07                         ` H. Peter Anvin
@ 2001-05-15 20:15                           ` Alexander Viro
  2001-05-15 20:17                             ` H. Peter Anvin
  0 siblings, 1 reply; 75+ messages in thread
From: Alexander Viro @ 2001-05-15 20:15 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: H. Peter Anvin, linux-kernel



On Tue, 15 May 2001, H. Peter Anvin wrote:

> Alexander Viro wrote:
> > 
> > On 15 May 2001, H. Peter Anvin wrote:
> > 
> > > isofs wouldn't be too bad as long as struct mapping:struct inode is a
> > > many-to-one mapping.
> > 
> > Erm... What's wrong with inode->u.isofs_i.my_very_own_address_space ?
> > 
> 
> None whatsoever.  The one thing that matters is that noone starts making
> the assumption that mapping->host->i_mapping == mapping.

One actually shouldn't assume that mapping->host is an inode.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  9:48             ` Lars Brinkhoff
  2001-05-15  9:54               ` Alexander Viro
@ 2001-05-15 20:17               ` Kai Henningsen
  2001-05-15 20:58                 ` Alexander Viro
  1 sibling, 1 reply; 75+ messages in thread
From: Kai Henningsen @ 2001-05-15 20:17 UTC (permalink / raw)
  To: linux-kernel

viro@math.psu.edu (Alexander Viro)  wrote on 15.05.01 in <Pine.GSO.4.21.0105150550110.21081-100000@weyl.math.psu.edu>:

> ... and Multics had all access to files through equivalent of mmap()
> in 60s. "Segments" in ls(1) got that name for a good reason.

Where's something called "segments" connected with ls(1)? I can't seem to  
find the reference.


MfG Kai

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 20:15                           ` Alexander Viro
@ 2001-05-15 20:17                             ` H. Peter Anvin
  2001-05-15 20:22                               ` Alexander Viro
  0 siblings, 1 reply; 75+ messages in thread
From: H. Peter Anvin @ 2001-05-15 20:17 UTC (permalink / raw)
  To: Alexander Viro; +Cc: H. Peter Anvin, linux-kernel

Alexander Viro wrote:
> >
> > None whatsoever.  The one thing that matters is that noone starts making
> > the assumption that mapping->host->i_mapping == mapping.
> 
> One actually shouldn't assume that mapping->host is an inode.
> 

What else could it be, since it's a "struct inode *"?  NULL?

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 20:17                             ` H. Peter Anvin
@ 2001-05-15 20:22                               ` Alexander Viro
  2001-05-15 20:26                                 ` H. Peter Anvin
  2001-05-15 21:02                                 ` Linus Torvalds
  0 siblings, 2 replies; 75+ messages in thread
From: Alexander Viro @ 2001-05-15 20:22 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: H. Peter Anvin, linux-kernel



On Tue, 15 May 2001, H. Peter Anvin wrote:

> Alexander Viro wrote:
> > >
> > > None whatsoever.  The one thing that matters is that noone starts making
> > > the assumption that mapping->host->i_mapping == mapping.
> > 
> > One actually shouldn't assume that mapping->host is an inode.
> > 
> 
> What else could it be, since it's a "struct inode *"?  NULL?

struct block_device *, for one thing. We'll have to do that as soon
as we do block devices in pagecache.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 20:22                               ` Alexander Viro
@ 2001-05-15 20:26                                 ` H. Peter Anvin
  2001-05-15 20:31                                   ` Alexander Viro
  2001-05-15 21:02                                 ` Linus Torvalds
  1 sibling, 1 reply; 75+ messages in thread
From: H. Peter Anvin @ 2001-05-15 20:26 UTC (permalink / raw)
  To: Alexander Viro; +Cc: H. Peter Anvin, linux-kernel

Alexander Viro wrote:
> >
> > What else could it be, since it's a "struct inode *"?  NULL?
> 
> struct block_device *, for one thing. We'll have to do that as soon
> as we do block devices in pagecache.
> 

How would you know what datatype it is?  A union?  Making "struct
block_device *" a "struct inode *" in a nonmounted filesystem?  In a
devfs?  (Seriously.  Being able to do these kinds of data-structural
equivalence is IMO the nice thing about devfs & co...)

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 20:26                                 ` H. Peter Anvin
@ 2001-05-15 20:31                                   ` Alexander Viro
  2001-05-15 21:12                                     ` Linus Torvalds
  2001-05-15 21:22                                     ` H. Peter Anvin
  0 siblings, 2 replies; 75+ messages in thread
From: Alexander Viro @ 2001-05-15 20:31 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: H. Peter Anvin, linux-kernel



On Tue, 15 May 2001, H. Peter Anvin wrote:

> Alexander Viro wrote:
> > >
> > > What else could it be, since it's a "struct inode *"?  NULL?
> > 
> > struct block_device *, for one thing. We'll have to do that as soon
> > as we do block devices in pagecache.
> > 
> 
> How would you know what datatype it is?  A union?  Making "struct
> block_device *" a "struct inode *" in a nonmounted filesystem?  In a
> devfs?  (Seriously.  Being able to do these kinds of data-structural
> equivalence is IMO the nice thing about devfs & co...)

void *.

Look, methods of your address_space certainly know what they hell they
are dealing with. Just as autofs_root_readdir() knows what inode->u.generic_ip
really points to.

Anybody else has no business to care about the contents of ->host.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 20:17               ` Kai Henningsen
@ 2001-05-15 20:58                 ` Alexander Viro
  2001-05-15 21:08                   ` Alexander Viro
  0 siblings, 1 reply; 75+ messages in thread
From: Alexander Viro @ 2001-05-15 20:58 UTC (permalink / raw)
  To: Kai Henningsen; +Cc: linux-kernel



On 15 May 2001, Kai Henningsen wrote:

> viro@math.psu.edu (Alexander Viro)  wrote on 15.05.01 in <Pine.GSO.4.21.0105150550110.21081-100000@weyl.math.psu.edu>:
> 
> > ... and Multics had all access to files through equivalent of mmap()
> > in 60s. "Segments" in ls(1) got that name for a good reason.
> 
> Where's something called "segments" connected with ls(1)? I can't seem to  
> find the reference.

ls == list segments. Name came from Multics.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 20:22                               ` Alexander Viro
  2001-05-15 20:26                                 ` H. Peter Anvin
@ 2001-05-15 21:02                                 ` Linus Torvalds
  2001-05-15 21:53                                   ` Jan Harkes
  1 sibling, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2001-05-15 21:02 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.GSO.4.21.0105151621350.21081-100000@weyl.math.psu.edu>,
Alexander Viro  <viro@math.psu.edu> wrote:
>On Tue, 15 May 2001, H. Peter Anvin wrote:
>
>> Alexander Viro wrote:
>> > >
>> > > None whatsoever.  The one thing that matters is that noone starts making
>> > > the assumption that mapping->host->i_mapping == mapping.
>> > 
>> > One actually shouldn't assume that mapping->host is an inode.
>> > 
>> 
>> What else could it be, since it's a "struct inode *"?  NULL?
>
>struct block_device *, for one thing. We'll have to do that as soon
>as we do block devices in pagecache.

No, Al. It's an inode. It was a major mistake to ever think anything
else.

I see your problem, but it's not a real problem.  What you do for block
devices (or anything like that where you might have _multiple_ inodes
pointing to the same thing, is to just create a "virtual inode", and
have THAT be the one that the mapping is associated with.  Basically
each "struct block_device *" would have an inode associated with it, to
act as a anchor for things like this. 

What is "struct inode", after all? It's just the virtual representation
of a "entity". The inodes associated with /dev/hda are not the inodes
associated with the actual _device_ - they are just on-disk "links" to
the physical device. 

[ Aside: there are good arguments to _not_ embed "struct inode" into
  "struct block_device", but instead do it the other way around - the
  same way we have filesystem-specific inode data inside "struct inode"
  we can easily have device-type specific data there.  And it makes a
  whole lot more sense to attach a mount to an inode than it makes to
  attach a mount to a "struct block_device".

  Done right, we could eventually get rid of "loopback block devices".
  They'd just be inodes that aren't of type "struct block_device", and
  the index to "struct buffer_head" would not be <block_deve *, blknr,
  size>, but <inode *, blknr, size>. See? The added level of indirection
  is one that we actually already _use_, it's just that we have this
  loopback device special case for it..

  In a "perfect" setup you could actually do "mount -t ext2 file /mnt/x"
  without having _any_ loopback setup or anything like that, simply
  because you don't _need_ it. It would be automatic. ]

		Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 20:58                 ` Alexander Viro
@ 2001-05-15 21:08                   ` Alexander Viro
  0 siblings, 0 replies; 75+ messages in thread
From: Alexander Viro @ 2001-05-15 21:08 UTC (permalink / raw)
  To: Kai Henningsen; +Cc: linux-kernel



On Tue, 15 May 2001, Alexander Viro wrote:

> On 15 May 2001, Kai Henningsen wrote:
> 
> > viro@math.psu.edu (Alexander Viro)  wrote on 15.05.01 in <Pine.GSO.4.21.0105150550110.21081-100000@weyl.math.psu.edu>:
> > 
> > > ... and Multics had all access to files through equivalent of mmap()
> > > in 60s. "Segments" in ls(1) got that name for a good reason.
> > 
> > Where's something called "segments" connected with ls(1)? I can't seem to  
> > find the reference.
> 
> ls == list segments. Name came from Multics.

Basically, they had the whole address space consisting of mmaped files.
address was (segment << 18) + offset (both up to 18 bits) and primitive
was "attach segment (== file) to address space". Each segment had its
own page table, BTW. Directories were special segments and contained
references to other segments (both files and directories). Root had fixed
ID. You could lookup segment by name.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 20:31                                   ` Alexander Viro
@ 2001-05-15 21:12                                     ` Linus Torvalds
  2001-05-15 21:22                                     ` H. Peter Anvin
  1 sibling, 0 replies; 75+ messages in thread
From: Linus Torvalds @ 2001-05-15 21:12 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.GSO.4.21.0105151628340.21081-100000@weyl.math.psu.edu>,
Alexander Viro  <viro@math.psu.edu> wrote:
>> 
>> How would you know what datatype it is?  A union?  Making "struct
>> block_device *" a "struct inode *" in a nonmounted filesystem?  In a
>> devfs?  (Seriously.  Being able to do these kinds of data-structural
>> equivalence is IMO the nice thing about devfs & co...)
>
>void *.

No. It used to be that way, and it was a horrible mess.

We _need_ to know that it's an inode, because the generic mapping
functions basically need to do things like

	mark_inode_dirty_pages(mapping->host);

which in turn needs the host to be an inode (otherwise you don't know
how and where to write the dang things back again).

There's no question that you can avoid it being an inode by virtualizing
more of it, and adding more virtual functions to the mapping operations
(right now the only one you'd HAVE to add is the "mark_page_dirty()"
operation), but the fact is that code gets really ugly by doing things
like that.

It was an absolute pleasure to remove all the casts of "mapping->host".
With "void *" it needed to be cast to the right type (and you had to be
able to _prove_ that you knew what the right type was). With "inode *",
the type is statically known, and you don't actually lose anything (at
worst, you'd have a virtual inode and then do an extra layer of
indirection there).

I really don't think we want to go back to "void *". 

		Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 20:31                                   ` Alexander Viro
  2001-05-15 21:12                                     ` Linus Torvalds
@ 2001-05-15 21:22                                     ` H. Peter Anvin
  1 sibling, 0 replies; 75+ messages in thread
From: H. Peter Anvin @ 2001-05-15 21:22 UTC (permalink / raw)
  To: Alexander Viro; +Cc: H. Peter Anvin, linux-kernel

Alexander Viro wrote:
> 
> void *.
> 
> Look, methods of your address_space certainly know what they hell they
> are dealing with. Just as autofs_root_readdir() knows what inode->u.generic_ip
> really points to.
> 
> Anybody else has no business to care about the contents of ->host.
> 

Why do we need a ->host at all, then?  Why not simply make it a private
pointer?

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 21:02                                 ` Linus Torvalds
@ 2001-05-15 21:53                                   ` Jan Harkes
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Harkes @ 2001-05-15 21:53 UTC (permalink / raw)
  To: linux-kernel

On Tue, May 15, 2001 at 02:02:29PM -0700, Linus Torvalds wrote:
> In article <Pine.GSO.4.21.0105151621350.21081-100000@weyl.math.psu.edu>,
> Alexander Viro  <viro@math.psu.edu> wrote:
> >On Tue, 15 May 2001, H. Peter Anvin wrote:
> >
> >> Alexander Viro wrote:
> >> > >
> >> > > None whatsoever.  The one thing that matters is that noone starts making
> >> > > the assumption that mapping->host->i_mapping == mapping.

Don't worry too much about that, that relationship has been false for
Coda ever since i_mapping was introduced.

The only problem that is still lingering is related to i_size. Writes
update inode->i_mapping->host->i_size, and stat reads inode->i_size,
which are not the same.

I sent a small patch to stat.c for this a long time ago (Linux
2.3.99-pre6-7), which made the assumption in stat that i_mapping->host
was an inode. (effectively tmp.st_size = inode->i_mapping->host->i_size)

Other solutions were to finish the getattr implementation, or keep a
small Coda-specific wrapper for generic_file_write around.

> >> > One actually shouldn't assume that mapping->host is an inode.
> >> 
> >> What else could it be, since it's a "struct inode *"?  NULL?
> >
> >struct block_device *, for one thing. We'll have to do that as soon
> >as we do block devices in pagecache.
> 
> No, Al. It's an inode. It was a major mistake to ever think anything
> else.

So is anyone interested in a small patch for stat.c? It fixes, as far as
I know, the last place that 'assumes' that inode->i_mapping->host is
identical to &inode.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 19:28               ` H. Peter Anvin
@ 2001-05-15 22:31                 ` Albert D. Cahalan
  2001-05-15 22:35                   ` H. Peter Anvin
  2001-05-16  1:17                   ` Anton Altaparmakov
  0 siblings, 2 replies; 75+ messages in thread
From: Albert D. Cahalan @ 2001-05-15 22:31 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

H. Peter Anvin writes:

> This would leave no way (without introducing new interfaces) to write,
> for example, the boot block on an ext2 filesystem.  Note that the
> bootblock (defined as the first 1024 bytes) is not actually used by
> the filesystem, although depending on the block size it may share a
> block with the superblock (if blocksize > 1024).

The lack of coherency would screw this up anyway, doesn't it?
You have a block device, soon to be in the page cache, and
a superblock, also soon to be in the page cache. LILO writes to
the block device, while the ext2 driver updates the superblock.
Whatever gets written out last wins, and the other is lost.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 22:31                 ` Albert D. Cahalan
@ 2001-05-15 22:35                   ` H. Peter Anvin
  2001-05-16  1:17                   ` Anton Altaparmakov
  1 sibling, 0 replies; 75+ messages in thread
From: H. Peter Anvin @ 2001-05-15 22:35 UTC (permalink / raw)
  To: Albert D. Cahalan; +Cc: H. Peter Anvin, linux-kernel

"Albert D. Cahalan" wrote:
> 
> H. Peter Anvin writes:
> 
> > This would leave no way (without introducing new interfaces) to write,
> > for example, the boot block on an ext2 filesystem.  Note that the
> > bootblock (defined as the first 1024 bytes) is not actually used by
> > the filesystem, although depending on the block size it may share a
> > block with the superblock (if blocksize > 1024).
> 
> The lack of coherency would screw this up anyway, doesn't it?
> You have a block device, soon to be in the page cache, and
> a superblock, also soon to be in the page cache. LILO writes to
> the block device, while the ext2 driver updates the superblock.
> Whatever gets written out last wins, and the other is lost.
> 

Albert, I *did* say "this better work or we have a problem."

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 22:31                 ` Albert D. Cahalan
  2001-05-15 22:35                   ` H. Peter Anvin
@ 2001-05-16  1:17                   ` Anton Altaparmakov
  2001-05-16  1:30                     ` H. Peter Anvin
  2001-05-16  8:34                     ` Anton Altaparmakov
  1 sibling, 2 replies; 75+ messages in thread
From: Anton Altaparmakov @ 2001-05-16  1:17 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Albert D. Cahalan, H. Peter Anvin, linux-kernel

At 23:35 15/05/2001, H. Peter Anvin wrote:
>"Albert D. Cahalan" wrote:
> > H. Peter Anvin writes:
> > > This would leave no way (without introducing new interfaces) to write,
> > > for example, the boot block on an ext2 filesystem.  Note that the
> > > bootblock (defined as the first 1024 bytes) is not actually used by
> > > the filesystem, although depending on the block size it may share a
> > > block with the superblock (if blocksize > 1024).
> >
> > The lack of coherency would screw this up anyway, doesn't it?
> > You have a block device, soon to be in the page cache, and
> > a superblock, also soon to be in the page cache. LILO writes to
> > the block device, while the ext2 driver updates the superblock.
> > Whatever gets written out last wins, and the other is lost.
>
>Albert, I *did* say "this better work or we have a problem."

And how are you thinking of this working "without introducing new 
interfaces" if the caches are indeed incoherent? Please correct me if I 
understand wrong, but when two caches are incoherent, I thought it means 
that the above _would_ screw up unless protected by exclusive write locking 
as I suggested in my previous post with the side effect that you can't 
write the boot block without unmounting the filesystem or modifying some 
interface somewhere.

As not all filesystems are like ext2, perhaps it would be better to fix 
ext2 and not the cache coherency? If ext2 is claiming ownership of a 
device, then it should do so in its entirety IMHO. You could always extend 
ext2 to use the NTFS approach where the bootsector is nothing more than a 
file which happens to exist on sector(s) zero (and following) of the 
device... (just a thought)

Best regards,

Anton


-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://sourceforge.net/projects/linux-ntfs/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-16  1:17                   ` Anton Altaparmakov
@ 2001-05-16  1:30                     ` H. Peter Anvin
  2001-05-16  8:34                     ` Anton Altaparmakov
  1 sibling, 0 replies; 75+ messages in thread
From: H. Peter Anvin @ 2001-05-16  1:30 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Albert D. Cahalan, H. Peter Anvin, linux-kernel

Anton Altaparmakov wrote:
> 
> And how are you thinking of this working "without introducing new
> interfaces" if the caches are indeed incoherent? Please correct me if I
> understand wrong, but when two caches are incoherent, I thought it means
> that the above _would_ screw up unless protected by exclusive write locking
> as I suggested in my previous post with the side effect that you can't
> write the boot block without unmounting the filesystem or modifying some
> interface somewhere.
> 

Not if direct device acess and the superblock exist in the same mapping
space, OR an explicit interface to write the boot block is created.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-16  1:17                   ` Anton Altaparmakov
  2001-05-16  1:30                     ` H. Peter Anvin
@ 2001-05-16  8:34                     ` Anton Altaparmakov
  2001-05-16 16:27                       ` H. Peter Anvin
  1 sibling, 1 reply; 75+ messages in thread
From: Anton Altaparmakov @ 2001-05-16  8:34 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Albert D. Cahalan, H. Peter Anvin, linux-kernel

At 02:30 16/05/2001, H. Peter Anvin wrote:
>Anton Altaparmakov wrote:
> > And how are you thinking of this working "without introducing new
> > interfaces" if the caches are indeed incoherent? Please correct me if I
> > understand wrong, but when two caches are incoherent, I thought it means
> > that the above _would_ screw up unless protected by exclusive write locking
> > as I suggested in my previous post with the side effect that you can't
> > write the boot block without unmounting the filesystem or modifying some
> > interface somewhere.
>
>Not if direct device acess and the superblock exist in the same mapping
>space, OR an explicit interface to write the boot block is created.

True, but I was under the impression that Linus' master plan was that the 
two would be in entirely separate name spaces using separate cached copies 
of the device blocks. Putting them into the same cache would make things 
work of course, although direct access would probably give you a view of an 
inconsistent file system if the fs was writing around the page cache at the 
time (unless the fs and direct accesses lock every page on write access, 
perhaps by zeroing the uptodate flag on the page).

An explicit interface for the boot block would be interesting. AFAICS it 
would have to call down into the file system driver itself (a 
read/write_boot_block method in super_operations perhaps?) due to the 
differences in how the boot block is stored on different filesystems 
(thinking of the "boot block is a file" NTFS case).

Best regards,

         Anton


-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://sourceforge.net/projects/linux-ntfs/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-16  8:34                     ` Anton Altaparmakov
@ 2001-05-16 16:27                       ` H. Peter Anvin
  0 siblings, 0 replies; 75+ messages in thread
From: H. Peter Anvin @ 2001-05-16 16:27 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Albert D. Cahalan, H. Peter Anvin, linux-kernel

Anton Altaparmakov wrote:
> 
> True, but I was under the impression that Linus' master plan was that the
> two would be in entirely separate name spaces using separate cached copies
> of the device blocks.
> 

Nothing was said about the superblock at all.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  4:43         ` Linus Torvalds
  2001-05-15  5:04           ` Alexander Viro
  2001-05-15 16:17           ` Pavel Machek
@ 2001-05-18  7:55           ` Rogier Wolff
  2001-05-23 11:36             ` Stephen C. Tweedie
  2 siblings, 1 reply; 75+ messages in thread
From: Rogier Wolff @ 2001-05-18  7:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Richard Gooch, Kernel Mailing List

Linus Torvalds wrote:
> I'm really serious about doing "resume from disk". If you want a fast
> boot, I will bet you a dollar that you cannot do it faster than by loading
> a contiguous image of several megabytes contiguously into memory. There is
> NO overhead, you're pretty much guaranteed platter speeds, and there are
> no issues about trying to order accesses etc. There are also no issues
> about messing up any run-time data structures.

Linus, 

The "boot quickly" was an example. "Load netscape quickly" on some
systems is done by dd-ing the binary to /dev/null. 

Now, you're going to say again that this won't work because of
buffer-cache/page-cache incoherency.  That is NOT the point. The point
is that the fun about a cache is that it's just a cache. It speeds
things up transparently. 

If I need a new "prime-the-cache" program to mmap the files, and
trigger a page-in in the right order, then that's fine with me.

The fun about doing these tricks is that it works, and keeps on
working (functionally) even if it stops working (fast).

Yes, there is a way to boot even faster: preloading memory. Fine. But
this doesn't allow me to load netscape quicker.

			Roger. 

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  8:06                 ` Linus Torvalds
  2001-05-15  8:33                   ` Alexander Viro
@ 2001-05-19  5:26                   ` Chris Wedgwood
  1 sibling, 0 replies; 75+ messages in thread
From: Chris Wedgwood @ 2001-05-19  5:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Richard Gooch, Kernel Mailing List

On Tue, May 15, 2001 at 01:06:47AM -0700, Linus Torvalds wrote:

    But no, I doubt we'll move _all_ metadata into the page-cache. I
    doubt, for example, that we'll find people re-doing all the other
    filesystems. So even if ext2 was page-cache only, what about all
    the 35 other filesystems out there in the standard sources, never
    mind others that haven't been integrated (XFS, ext3 etc..).

Hmm... what about dropping block device per se' and creating a
pseudo-fs (say /dev/blk/) that gives a page-cache view of the
underlying raw devices?

That way older filesystems and tools could use that view of things
until they are moved into the page-cache, and those people without
the dependence of those filesystems can complete forgo all of this?




  --cw

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15 16:17           ` Pavel Machek
@ 2001-05-19 19:39             ` Linus Torvalds
  2001-05-19 19:44               ` Pavel Machek
  2001-05-20  4:30               ` Chris Wedgwood
  0 siblings, 2 replies; 75+ messages in thread
From: Linus Torvalds @ 2001-05-19 19:39 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Richard Gooch, Kernel Mailing List


On Tue, 15 May 2001, Pavel Machek wrote:
> 
> resume from disk is actually pretty hard to do in way it is readed linearily.
> 
> While playing with swsusp patches (== suspend to disk) I found out that
> it was slow. It needs to do atomic snapshot, and only reasonable way to
> do that is free half of RAM, cli() and copy.

Note that "resume from disk" does _not_ have to necessarily resume kernel
data structures. It is enough if it just resumes the caches etc. 

Don't get _too_ hung up about the power-management kind of "invisible
suspend/resume" sequence where you resume the whole kernel state.

		Linus


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-19 19:39             ` Linus Torvalds
@ 2001-05-19 19:44               ` Pavel Machek
  2001-05-19 19:47                 ` Linus Torvalds
  2001-05-20  4:30               ` Chris Wedgwood
  1 sibling, 1 reply; 75+ messages in thread
From: Pavel Machek @ 2001-05-19 19:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Richard Gooch, Kernel Mailing List

Hi!

> > resume from disk is actually pretty hard to do in way it is readed linearily.
> > 
> > While playing with swsusp patches (== suspend to disk) I found out that
> > it was slow. It needs to do atomic snapshot, and only reasonable way to
> > do that is free half of RAM, cli() and copy.
> 
> Note that "resume from disk" does _not_ have to necessarily resume kernel
> data structures. It is enough if it just resumes the caches etc. 

> Don't get _too_ hung up about the power-management kind of "invisible
> suspend/resume" sequence where you resume the whole kernel state.

Ugh. Now I'm confused. How do you do usefull resume from disk when you
don't restore complete state? Do you propose something like "write
only pagecache to disk"?
								Pavel
-- 
The best software in life is free (not shareware)!		Pavel
GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-19 19:44               ` Pavel Machek
@ 2001-05-19 19:47                 ` Linus Torvalds
  2001-05-23 11:29                   ` Stephen C. Tweedie
  0 siblings, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2001-05-19 19:47 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Richard Gooch, Kernel Mailing List


On Sat, 19 May 2001, Pavel Machek wrote:
> 
> > Don't get _too_ hung up about the power-management kind of "invisible
> > suspend/resume" sequence where you resume the whole kernel state.
> 
> Ugh. Now I'm confused. How do you do usefull resume from disk when you
> don't restore complete state? Do you propose something like "write
> only pagecache to disk"?

Go back to the original _reason_ for this whole discussion. 

It's not really a "resume" event, it's a "populate caches really
efficiently at boot" event. But the two are basically the same problem,
it's only a matter of how much you populate (do you populate _everything_
or do you populate just disk caches. Populating just the caches is the
smaller and simpler problem, that only solves the "fast boot" issue).

		Linus


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-19 19:39             ` Linus Torvalds
  2001-05-19 19:44               ` Pavel Machek
@ 2001-05-20  4:30               ` Chris Wedgwood
  2001-05-20 19:47                 ` Alan Cox
  1 sibling, 1 reply; 75+ messages in thread
From: Chris Wedgwood @ 2001-05-20  4:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pavel Machek, Richard Gooch, Kernel Mailing List

On Sat, May 19, 2001 at 12:39:18PM -0700, Linus Torvalds wrote:

    Note that "resume from disk" does _not_ have to necessarily
    resume kernel data structures. It is enough if it just resumes
    the caches etc.

For speeding up a boot process, sure... but for suspend/resume on a
laptop --- why would you bother?

    Don't get _too_ hung up about the power-management kind of
    "invisible suspend/resume" sequence where you resume the whole
    kernel state.

I'm confused. I've always wondered that before you suspend the state
of a machine to disk, why we just don't throw away unnecessary data
like anything not actively referenced.



  --cw

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-20  4:30               ` Chris Wedgwood
@ 2001-05-20 19:47                 ` Alan Cox
  0 siblings, 0 replies; 75+ messages in thread
From: Alan Cox @ 2001-05-20 19:47 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Linus Torvalds, Pavel Machek, Richard Gooch, Kernel Mailing List

> I'm confused. I've always wondered that before you suspend the state
> of a machine to disk, why we just don't throw away unnecessary data
> like anything not actively referenced.

swsusp does exactly that.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-19 19:47                 ` Linus Torvalds
@ 2001-05-23 11:29                   ` Stephen C. Tweedie
  0 siblings, 0 replies; 75+ messages in thread
From: Stephen C. Tweedie @ 2001-05-23 11:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Richard Gooch, Kernel Mailing List, Stephen Tweedie

Hi,

On Sat, May 19, 2001 at 12:47:15PM -0700, Linus Torvalds wrote:
> 
> On Sat, 19 May 2001, Pavel Machek wrote:
> > 
> > > Don't get _too_ hung up about the power-management kind of "invisible
> > > suspend/resume" sequence where you resume the whole kernel state.
> > 
> > Ugh. Now I'm confused. How do you do usefull resume from disk when you
> > don't restore complete state? Do you propose something like "write
> > only pagecache to disk"?
> 
> Go back to the original _reason_ for this whole discussion. 
> 
> It's not really a "resume" event, it's a "populate caches really
> efficiently at boot" event.

Then you'd better be sure that the cache (or at least, the saved
image) only contains data which is guaranteed not to be written
between successive restores from the same image.  The big advantage of
just resuming from the state of the previous shutdown (whether it's
cache or the whole kernenl state) is that you've got a much higher
expectation that nothing on disk got modified between the save and the
restore.

--Stephen

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-18  7:55           ` Rogier Wolff
@ 2001-05-23 11:36             ` Stephen C. Tweedie
  0 siblings, 0 replies; 75+ messages in thread
From: Stephen C. Tweedie @ 2001-05-23 11:36 UTC (permalink / raw)
  To: Rogier Wolff
  Cc: Linus Torvalds, Richard Gooch, Kernel Mailing List, Stephen Tweedie

Hi,

On Fri, May 18, 2001 at 09:55:14AM +0200, Rogier Wolff wrote:

> The "boot quickly" was an example. "Load netscape quickly" on some
> systems is done by dd-ing the binary to /dev/null. 

This is one of the reasons why some filesystems use extent maps
instead of inode indirection trees.  The problem of caching the
metadata basically just goes away if your mapping information is a few
bytes saying "this file is an extent of a hundred block at offset FOO
followed by fifty blocks at offset BAR."

If the mapping metadata is _that_ compact, then your binaries are
almost guaranteed to be either mapped in the inode or in a single
mapping block, so the problem of seeking between indirect blocks
basically just goes away.  You still have to do things like prime the
inode/indirect cache before the first data access if you want
directory scans to go fast, and you still have to preload data pages
for readahead, of course.  

If the objective is "start netscape faster", then the cost of having
to do one synchronous IO to pull in a single indirect extent map block
is going to be negligible next to the other costs.

(Extent maps have their own problems, especially when it comes to
dealing with holes, but that's a different story...)

--Stephen

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Getting FS access events
  2001-05-15  4:37     ` Chris Wedgwood
@ 2001-05-23 11:37       ` Stephen C. Tweedie
  0 siblings, 0 replies; 75+ messages in thread
From: Stephen C. Tweedie @ 2001-05-23 11:37 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Richard Gooch, Larry McVoy, Linus Torvalds, Kernel Mailing List,
	Stephen Tweedie

Hi,

On Tue, May 15, 2001 at 04:37:01PM +1200, Chris Wedgwood wrote:
> On Sun, May 13, 2001 at 08:39:23PM -0600, Richard Gooch wrote:
> 
>     Yeah, we need a decent unfragmenter. We can do that now with
>     bmap().
> 
> SCT wrote a defragger for ext2 but it only handles 1k blocks :(

Actually, I wrote it for extfs, and Alexey Vovenko ported it to ext2.
Extfs *really* needed a defragmenter, because it had weird behaviour
patterns which included allocating all of the blocks of a file in
descending disk blocks at times.  

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2001-05-23 12:06 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200105140117.f4E1HqN07362@vindaloo.ras.ucalgary.ca>
2001-05-14  1:32 ` Getting FS access events Linus Torvalds
2001-05-14  1:45   ` Larry McVoy
2001-05-14  2:39   ` Richard Gooch
2001-05-14  3:09     ` Rik van Riel
2001-05-14  4:27     ` Richard Gooch
2001-05-15  4:37     ` Chris Wedgwood
2001-05-23 11:37       ` Stephen C. Tweedie
2001-05-14  2:24 ` Richard Gooch
2001-05-14  4:46   ` Linus Torvalds
2001-05-14  5:15   ` Richard Gooch
2001-05-14 13:04     ` Daniel Phillips
2001-05-14 18:00       ` Andreas Dilger
2001-05-14 20:16     ` Linus Torvalds
2001-05-14 23:19     ` Richard Gooch
2001-05-15  0:42       ` Daniel Phillips
2001-05-15  4:00       ` Linus Torvalds
2001-05-15  4:35         ` Larry McVoy
2001-05-15  4:59           ` Alexander Viro
2001-05-15 17:01             ` Pavel Machek
2001-05-15  4:43         ` Linus Torvalds
2001-05-15  5:04           ` Alexander Viro
2001-05-15 16:17           ` Pavel Machek
2001-05-19 19:39             ` Linus Torvalds
2001-05-19 19:44               ` Pavel Machek
2001-05-19 19:47                 ` Linus Torvalds
2001-05-23 11:29                   ` Stephen C. Tweedie
2001-05-20  4:30               ` Chris Wedgwood
2001-05-20 19:47                 ` Alan Cox
2001-05-18  7:55           ` Rogier Wolff
2001-05-23 11:36             ` Stephen C. Tweedie
2001-05-15  4:57         ` David S. Miller
2001-05-15  5:12           ` Alexander Viro
2001-05-15  9:10           ` Alan Cox
2001-05-15  9:48             ` Lars Brinkhoff
2001-05-15  9:54               ` Alexander Viro
2001-05-15 20:17               ` Kai Henningsen
2001-05-15 20:58                 ` Alexander Viro
2001-05-15 21:08                   ` Alexander Viro
2001-05-15  6:20         ` Richard Gooch
2001-05-15  6:28           ` Linus Torvalds
2001-05-15  6:49           ` Richard Gooch
2001-05-15  6:57             ` Alexander Viro
2001-05-15 10:33               ` Daniel Phillips
2001-05-15 10:44                 ` Alexander Viro
2001-05-15 14:42                   ` Daniel Phillips
2001-05-15  7:13             ` Linus Torvalds
2001-05-15  7:56               ` Chris Wedgwood
2001-05-15  8:06                 ` Linus Torvalds
2001-05-15  8:33                   ` Alexander Viro
2001-05-15 10:27                     ` David Woodhouse
2001-05-15 16:00                     ` Chris Mason
2001-05-15 19:26                     ` H. Peter Anvin
2001-05-15 20:03                       ` Alexander Viro
2001-05-15 20:07                         ` H. Peter Anvin
2001-05-15 20:15                           ` Alexander Viro
2001-05-15 20:17                             ` H. Peter Anvin
2001-05-15 20:22                               ` Alexander Viro
2001-05-15 20:26                                 ` H. Peter Anvin
2001-05-15 20:31                                   ` Alexander Viro
2001-05-15 21:12                                     ` Linus Torvalds
2001-05-15 21:22                                     ` H. Peter Anvin
2001-05-15 21:02                                 ` Linus Torvalds
2001-05-15 21:53                                   ` Jan Harkes
2001-05-19  5:26                   ` Chris Wedgwood
2001-05-15 10:04             ` Anton Altaparmakov
2001-05-15 19:28               ` H. Peter Anvin
2001-05-15 22:31                 ` Albert D. Cahalan
2001-05-15 22:35                   ` H. Peter Anvin
2001-05-16  1:17                   ` Anton Altaparmakov
2001-05-16  1:30                     ` H. Peter Anvin
2001-05-16  8:34                     ` Anton Altaparmakov
2001-05-16 16:27                       ` H. Peter Anvin
2001-05-15 16:26             ` Pavel Machek
2001-05-15 18:02             ` Craig Milo Rogers
2001-05-15  6:13       ` Richard Gooch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).