All of lore.kernel.org
 help / color / mirror / Atom feed
* statfs() / statvfs() syscall ballsup...
@ 2003-10-09 22:16 Trond Myklebust
  2003-10-09 22:26 ` Linus Torvalds
  2003-10-09 23:16 ` Andreas Dilger
  0 siblings, 2 replies; 64+ messages in thread
From: Trond Myklebust @ 2003-10-09 22:16 UTC (permalink / raw)
  To: Ulrich Drepper, Linus Torvalds; +Cc: Linux Kernel


Hi,

  We appear to have a problem with the new statfs interface
in 2.6.0...

The problem is that as far as userland is concerned, 'struct statfs'
reports f_blocks, f_bfree,... in units of the "optimal transfer size":
f_bsize (backwards compatibility).

OTOH 'struct statvfs' reports the same values in units of the fragment
size (the blocksize of the underlying filesyste): f_frsize. (says
Single User Spec v2)

Both are apparently supposed to syscall down via sys_statfs()...

Question: how we're supposed to reconcile the two cases for something
like NFS, where these 2 values are supposed to differ?

Note that f_bsize is usually larger than f_frsize, hence conversions
from the former to the latter are subject to rounding errors...

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-09 22:16 statfs() / statvfs() syscall ballsup Trond Myklebust
@ 2003-10-09 22:26 ` Linus Torvalds
  2003-10-09 23:19   ` Ulrich Drepper
                     ` (2 more replies)
  2003-10-09 23:16 ` Andreas Dilger
  1 sibling, 3 replies; 64+ messages in thread
From: Linus Torvalds @ 2003-10-09 22:26 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Ulrich Drepper, Linux Kernel


On Thu, 9 Oct 2003, Trond Myklebust wrote:
> 
> Question: how we're supposed to reconcile the two cases for something
> like NFS, where these 2 values are supposed to differ?

I'd suggest going for "optimal block size everywhere".

> Note that f_bsize is usually larger than f_frsize, hence conversions
> from the former to the latter are subject to rounding errors...

User space shouldn't know or care about frsize, and it doesn't even 
necessarily make any sense on a lot of filesystems, so make it easy for 
the user. It's not as if the rounding errors really matter.

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-09 22:16 statfs() / statvfs() syscall ballsup Trond Myklebust
  2003-10-09 22:26 ` Linus Torvalds
@ 2003-10-09 23:16 ` Andreas Dilger
  2003-10-09 23:24   ` Linus Torvalds
  1 sibling, 1 reply; 64+ messages in thread
From: Andreas Dilger @ 2003-10-09 23:16 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Ulrich Drepper, Linus Torvalds, Linux Kernel

On Oct 09, 2003  18:16 -0400, Trond Myklebust wrote:
>   We appear to have a problem with the new statfs interface
> in 2.6.0...
> 
> The problem is that as far as userland is concerned, 'struct statfs'
> reports f_blocks, f_bfree,... in units of the "optimal transfer size":
> f_bsize (backwards compatibility).
> 
> OTOH 'struct statvfs' reports the same values in units of the fragment
> size (the blocksize of the underlying filesyste): f_frsize. (says
> Single User Spec v2)
> 
> Both are apparently supposed to syscall down via sys_statfs()...
> 
> Question: how we're supposed to reconcile the two cases for something
> like NFS, where these 2 values are supposed to differ?

Actually, what is also a problem is that there is no hook for the system
to return different results for the 32-bit and 64-bit statfs structs.
Because Lustre is used on very large filesystems (i.e. 100TB+) we can't
fit the result into 32 bits without increasing f_bsize and reducing
f_bavail/f_bfree/f_blocks proportionately.

It would be nice if we could know in advance if we are returning values
for sys_statfs() or sys_statfs64() (e.g. by sys_statfs64() calling an
optional sb->s_op->statfs64() method if available) so we didn't have to
do this munging.  We can't just assume 64-bit results, or callers of
sys_statfs() will get EOVERFLOW instead of slightly innacurate results.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-09 22:26 ` Linus Torvalds
@ 2003-10-09 23:19   ` Ulrich Drepper
  2003-10-10  0:22     ` viro
  2003-10-09 23:31   ` Trond Myklebust
  2003-10-10 12:27   ` Joel Becker
  2 siblings, 1 reply; 64+ messages in thread
From: Ulrich Drepper @ 2003-10-09 23:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Trond Myklebust, Linux Kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:

> User space shouldn't know or care about frsize, and it doesn't even 
> necessarily make any sense on a lot of filesystems, so make it easy for 
> the user. It's not as if the rounding errors really matter.

There have been numerous requests to add a statvfs syscall, at least
made to me.  The problem is that the emulation through statfs cannot be
optimal.  The emulation has to get all kinds of additional information
(like mount flags) which in some cases lead to hangs or delays.

- From what I see statvfs is much more frequently used than statfs so such
an extension would be justified.  And then the kernel would be able to
determine all the right values and guide the user with them as it pleases.

- -- 
- --------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/he0B2ijCOnn/RHQRAg+GAKC48tj7myC+lITvghxPK/ZEWcLTnQCgpUh4
5whszj+14fucakVcsZ4sOIQ=
=EVjn
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-09 23:16 ` Andreas Dilger
@ 2003-10-09 23:24   ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2003-10-09 23:24 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel


On Thu, 9 Oct 2003, Andreas Dilger wrote:
> 
> It would be nice if we could know in advance if we are returning values
> for sys_statfs() or sys_statfs64() (e.g. by sys_statfs64() calling an
> optional sb->s_op->statfs64() method if available) so we didn't have to
> do this munging.  We can't just assume 64-bit results, or callers of
> sys_statfs() will get EOVERFLOW instead of slightly innacurate results.

This is something that sys_statfs() could do on its own. It's probably 
always better to try to scale the block size up than to return EOVERFLOW.

(Some things can't be scaled up, of course, like f_ffree etc. But it 
should be trivial to just do a "try to shift to make it fit" in the 
vfs_statfs_native() function in fs/open.c).

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-09 22:26 ` Linus Torvalds
  2003-10-09 23:19   ` Ulrich Drepper
@ 2003-10-09 23:31   ` Trond Myklebust
  2003-10-10 12:27   ` Joel Becker
  2 siblings, 0 replies; 64+ messages in thread
From: Trond Myklebust @ 2003-10-09 23:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ulrich Drepper, Linux Kernel

>>>>> " " == Linus Torvalds <torvalds@osdl.org> writes:

    >> Note that f_bsize is usually larger than f_frsize, hence
    >> conversions from the former to the latter are subject to
    >> rounding errors...

     > User space shouldn't know or care about frsize, and it doesn't
     > even necessarily make any sense on a lot of filesystems, so
     > make it easy for the user. It's not as if the rounding errors
     > really matter.

It can lead to funny quirks when doing df: Used + Available != Total

Granted the effects won't be enormous (typically you'll see between 1
and 63 blocks off in the case of NFS w/ 32kwsize and 512byte frsize)
but people get upset about this. That was the reason for adding an
f_frsize field in the first place...

Note: one solution might be to swap the positions of f_frsize and
f_bsize in the kernel struct that is passed up to userland. I.e. pass
up

 struct statfs {
         __u32 f_type;
-         __u32 f_bsize;
+         __u32 f_frsize;
         __u32 f_blocks;
         __u32 f_bfree;
         __u32 f_bavail;
         __u32 f_files;
         __u32 f_ffree;
         __kernel_fsid_t f_fsid;
         __u32 f_namelen;
-         __u32 f_frsize;
+         __u32 f_bsize;
         __u32 f_spare[5];
 };

That will give correct values for the f_bfree, f_bavail,... in the
legacy statfs() case for all existing filesystems.

glibc's statvfs() can then do the correct thing if it detects a >=2.6.0
kernel. It needs to do a copy to its private statvfs struct anyway.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-09 23:19   ` Ulrich Drepper
@ 2003-10-10  0:22     ` viro
  2003-10-10  4:49       ` Jamie Lokier
  0 siblings, 1 reply; 64+ messages in thread
From: viro @ 2003-10-10  0:22 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Linus Torvalds, Trond Myklebust, Linux Kernel

On Thu, Oct 09, 2003 at 04:19:29PM -0700, Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Linus Torvalds wrote:
> 
> > User space shouldn't know or care about frsize, and it doesn't even 
> > necessarily make any sense on a lot of filesystems, so make it easy for 
> > the user. It's not as if the rounding errors really matter.
> 
> There have been numerous requests to add a statvfs syscall, at least
> made to me.  The problem is that the emulation through statfs cannot be
> optimal.  The emulation has to get all kinds of additional information
> (like mount flags) which in some cases lead to hangs or delays.

Umm...  I don't see anything equivalent to statfs(2) ->f_type in statvfs(2).
->f_frsize makes no sense for practically all filesystems we support.
->f_namemax is not well-defined ("maximum filename length" as in "you won't
see filenames longer than..." or "attempt to create a file with name longer
than... will fail" or "longer than that and I'm truncating";  and that is
aside of lovely questions about the meaning of "length" - strlen()?  number
of multibyte characters accepted by that fs? something else?)
->f_fsid is also practically undefined (and left 0 by practically every fs,
so no userland code can do anything useful with it).
->f_flag might be useful, all right.  However, I'd like to see real-world
examples of code (Solaris, whatever) that would use it in any meaningful
way...

Conclusion: if we care about something like statvfs(), it should *not* have
the statvfs() interface.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10  0:22     ` viro
@ 2003-10-10  4:49       ` Jamie Lokier
  2003-10-10  5:26         ` Trond Myklebust
  0 siblings, 1 reply; 64+ messages in thread
From: Jamie Lokier @ 2003-10-10  4:49 UTC (permalink / raw)
  To: viro; +Cc: Ulrich Drepper, Linus Torvalds, Trond Myklebust, Linux Kernel

viro@parcelfarce.linux.theplanet.co.uk wrote:
> Umm...  I don't see anything equivalent to statfs(2) ->f_type in statvfs(2).
> ->f_frsize makes no sense for practically all filesystems we support.
> ->f_namemax is not well-defined ("maximum filename length" as in "you won't
> see filenames longer than..." or "attempt to create a file with name longer
> than... will fail" or "longer than that and I'm truncating";  and that is
> aside of lovely questions about the meaning of "length" - strlen()?  number
> of multibyte characters accepted by that fs? something else?)
> ->f_fsid is also practically undefined (and left 0 by practically every fs,
> so no userland code can do anything useful with it).
> ->f_flag might be useful, all right.  However, I'd like to see real-world
> examples of code (Solaris, whatever) that would use it in any meaningful
> way...

On this theme, I'd like to know:

    - are dnotify / lease / lock reliable indicators on this filesystem?
      (i.e. dnotify is reliable on all local filesystems, but not over any
      of the remote ones AFAIK).

    - is stat() reliable (local filesystems and many remote) or potentially
      out of date without open/close (NFS due to attribute cacheing)

-- Jamie

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10  4:49       ` Jamie Lokier
@ 2003-10-10  5:26         ` Trond Myklebust
  2003-10-10 12:37           ` Jamie Lokier
  0 siblings, 1 reply; 64+ messages in thread
From: Trond Myklebust @ 2003-10-10  5:26 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux Kernel

>>>>> " " == Jamie Lokier <jamie@shareable.org> writes:

     >     - are dnotify / lease / lock reliable indicators on this filesystem?
     >       (i.e. dnotify is reliable on all local filesystems, but
     >       not over any of the remote ones AFAIK).

Belongs in fcntl()... Just return ENOLCK if someone tries to set a
lease or a directory notification on an NFS file...

     >     - is stat() reliable (local filesystems and many remote) or
     >       potentially out of date without open/close (NFS due to
     >       attribute cacheing)

There are many possible cache consistency models out there. Consider
for instance AFS connected/disconnected modes, NFSv4 delegations or
CIFS shares. How are you going to distinguish between them all and
how do you propose that applications make use of this information?

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-09 22:26 ` Linus Torvalds
  2003-10-09 23:19   ` Ulrich Drepper
  2003-10-09 23:31   ` Trond Myklebust
@ 2003-10-10 12:27   ` Joel Becker
  2003-10-10 14:59     ` Linus Torvalds
  2 siblings, 1 reply; 64+ messages in thread
From: Joel Becker @ 2003-10-10 12:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel

On Thu, Oct 09, 2003 at 03:26:47PM -0700, Linus Torvalds wrote:
> User space shouldn't know or care about frsize, and it doesn't even 
> necessarily make any sense on a lot of filesystems, so make it easy for 
> the user. It's not as if the rounding errors really matter.

	User space has to know about frsize for O_DIRECT alignment.
Some times you just want to write the 512 B you have in hand, not have to
read-modify-write the n KB around it.  frsize is much nicer that hunting
up the appropriate block device to BLKSSZGET on .


Joel


-- 

"I have never let my schooling interfere with my education."
        - Mark Twain

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10  5:26         ` Trond Myklebust
@ 2003-10-10 12:37           ` Jamie Lokier
  2003-10-10 13:46             ` Trond Myklebust
  0 siblings, 1 reply; 64+ messages in thread
From: Jamie Lokier @ 2003-10-10 12:37 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux Kernel

Trond Myklebust wrote:
>      >     - are dnotify / lease / lock reliable indicators on this filesystem?
>      >       (i.e. dnotify is reliable on all local filesystems, but
>      >       not over any of the remote ones AFAIK).
> 
> Belongs in fcntl()... Just return ENOLCK if someone tries to set a
> lease or a directory notification on an NFS file...

Yes, that would make sense.  It should be a filesystem hook, so that
even remote filesystems like SMB can implement it, although it must be
understood that remote notification has different ordering properties
than local.

>      >     - is stat() reliable (local filesystems and many remote) or
>      >       potentially out of date without open/close (NFS due to
>      >       attribute cacheing)
> 
> There are many possible cache consistency models out there. Consider
> for instance AFS connected/disconnected modes, NFSv4 delegations or
> CIFS shares. How are you going to distinguish between them all and
> how do you propose that applications make use of this information?

The difference is that NFSv3 can return _stale_ data, while local
_cannot_.  I call stat(), and the information is up to date.

I don't care about the cache semantics at all; what I care about is
whether a returned stat() result may be stale.

Why?  This is the difference between "make" generating correct data,
and "make" generating incorrect data.[1]

The caching model isn't the issue.  That's the filesystem's problem.
I just want a way to get up to date data in my application.

My motivation isn't actually "make" although that's important;
generally, I need to know how to verify my in-application cache of a
file.  (Think fontconfig, ccache etc).  I use dnotify for similar
purposes, when it's local.  (dnotify is much faster than many stats
for a complex cache dependency).

Currently, I use statfs() and read /proc/mounts to determine whether
the filesystem is a known type or mounted on a block device, to decide
whether stat() and/or dnotify are reliable.  This is not ideal.  In
particular, I don't know of any way to _guarantee_ that I have the
latest file contents from remote filesystems short of F_SETLK, which
way too heavy.[2]

-- Jamie


[1] I have built programs, including kernels, which crashed due to
timestamps not appearing on a different computer after changing code
so make didn't compile everything.

[2] I have lost code I was editing due to saving it and then a
different computer updating the file by reading a stale version,
modifying it and writing it.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 12:37           ` Jamie Lokier
@ 2003-10-10 13:46             ` Trond Myklebust
  2003-10-10 14:35               ` Jamie Lokier
  2003-10-10 14:39               ` statfs() / statvfs() syscall ballsup Jamie Lokier
  0 siblings, 2 replies; 64+ messages in thread
From: Trond Myklebust @ 2003-10-10 13:46 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux Kernel

>>>>> " " == Jamie Lokier <jamie@shareable.org> writes:

     > Trond Myklebust wrote:
    >> Belongs in fcntl()... Just return ENOLCK if someone tries to
    >> set a lease or a directory notification on an NFS file...

     > It should be a filesystem hook, so that even remote filesystems
     > like SMB can implement it, although it must be understood that
     > remote notification has different ordering properties than
     > local.

Sure. We might even try actually implementing leases on NFSv4 for
delegated files.

     > I don't care about the cache semantics at all; what I care
     > about is whether a returned stat() result may be stale.

Note that this too may be a per-file property. Under NFSv4 I can
guarantee you that stat() results are correct in the case where I have
a delegation. Otherwise, you are indeed subject to inherent races.
"noac" cannot entirely resolve such races, but it sounds as if it
could in the particular cases you describe.

     > This is not ideal.  In particular, I don't know of any way to
     > _guarantee_ that I have the latest file contents from remote
     > filesystems short of F_SETLK, which way too heavy.[2]

Err... open() should normally suffice to do that...

Unless you are simultaneously writing to the file on a remote system,
in which case you really need mandatory locking rather than NFSv2/v3's
weaker advisory model. Or possibly something like CIFS/SMB's open
"share" model (which can also be implemented in NFSv4).



...so I would argue that the caching models both can and do make a
difference to your example cases (contrary to what you assert).

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 13:46             ` Trond Myklebust
@ 2003-10-10 14:35               ` Jamie Lokier
  2003-10-10 15:32                 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust
  2003-10-10 14:39               ` statfs() / statvfs() syscall ballsup Jamie Lokier
  1 sibling, 1 reply; 64+ messages in thread
From: Jamie Lokier @ 2003-10-10 14:35 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux Kernel

Trond Myklebust wrote:
> Sure. We might even try actually implementing leases on NFSv4 for
> delegated files.

That would be nice.  (Aside: Can NFSv4 do anything like dnotify, or am I
restricted to, in effect, keeping many files open to detect changes in
any of them?)

Generally NFSv4 sounds like the way to go.  Should I be recommending
it to all my friends yet, is the implementation ready for that?

>      > I don't care about the cache semantics at all; what I care
>      > about is whether a returned stat() result may be stale.
> 
> Note that this too may be a per-file property. Under NFSv4 I can
> guarantee you that stat() results are correct in the case where I have
> a delegation. Otherwise, you are indeed subject to inherent races.
> "noac" cannot entirely resolve such races, but it sounds as if it
> could in the particular cases you describe.

You're right, in the cases I describe "noac" is fine.

I don't like having to ship an FAQ with a program which explains that
the program is theoretically fine, users should simply mount their
home directory with "noac", and tough if that's not within their
administrative power.

I'd rather make the program work correctly with the default mount
options, and maybe have an entry in the FAQ saying that "noac" may
improve performance but is not required for correct behaviour.

Unfortunately that means ugly knowledge of filesystem specifics and
/proc/mount parsing - or significantly lower performance on local
filesystems, which largely negates the purpose of the program.  (It is
very much about caching things derived from file contents).

>      > This is not ideal.  In particular, I don't know of any way to
>      > _guarantee_ that I have the latest file contents from remote
>      > filesystems short of F_SETLK, which way too heavy.[2]
> 
> Err... open() should normally suffice to do that...

Server = RH linux-2.4.20-18.9.  Client = 2.6.0-test6.  I have done
this in the last few days:

	[on client] editing file in emacs, save-buffer
	[on server] diff -ur mumble commands >> file
		    (and wait until command prompt returns)
	[on client] in emacs, find-alternate-file which discards the
		    current buffer and opens & reads the file from fs.
	[on client] edit some more, save file, post to l-k etc.
	[meta]	    notice that the diff wasn't appended to the file

Emacs didn't see the appended data.  (The reason I did the diff command
on the server is that it's a lot faster - a tree's worth of stat calls
is slow over PCMCIA ethernet).

> ...so I would argue that the caching models both can and do make a
> difference to your example cases (contrary to what you assert).

Of course they make a difference when there is no call to say "just do
X and hide the implementation details from me".  What I'd like is an
abstraction so I don't observe a difference, or at least a systematic
way of working around them at application level.

In the same way I expect CPUs to abstract away the (sometimes very)
complex memory caching models, and present something simple to the
program code.

-- Jamie

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 13:46             ` Trond Myklebust
  2003-10-10 14:35               ` Jamie Lokier
@ 2003-10-10 14:39               ` Jamie Lokier
  1 sibling, 0 replies; 64+ messages in thread
From: Jamie Lokier @ 2003-10-10 14:39 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux Kernel

Trond Myklebust wrote:
>      > I don't care about the cache semantics at all; what I care
>      > about is whether a returned stat() result may be stale.
> 
> Note that this too may be a per-file property.

Yes.  A flag from stat() or similar to say it's stale would make
sense.  Alternatively, a flag _into_ something like stat() to ask for
an up to date value, if that is possible.

I've often wondered if stat() couldn't be a bit more extensible with
some flags or extended attributes.

-- Jamie

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 12:27   ` Joel Becker
@ 2003-10-10 14:59     ` Linus Torvalds
  2003-10-10 15:27       ` Joel Becker
  0 siblings, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2003-10-10 14:59 UTC (permalink / raw)
  To: Joel Becker; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel


On Fri, 10 Oct 2003, Joel Becker wrote:
> 
> 	User space has to know about frsize for O_DIRECT alignment.

Have you ever noticed that O_DIRECT is a piece of crap?

The interface is fundamentally flawed, it has nasty security issues, it 
lacks any kind of sane synchronization, and it exposes stuff that 
shouldn't be exposed to user space.

I hope disk-based databases die off quickly. Yeah, I see where you are
working, but where I'm coming from, I see all the _crap_ that Oracle tries
to push down to the kernel, and most of the time I go "huh - that's a
f**king bad design".

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 14:59     ` Linus Torvalds
@ 2003-10-10 15:27       ` Joel Becker
  2003-10-10 16:00         ` Linus Torvalds
  2003-10-10 16:01         ` Jamie Lokier
  0 siblings, 2 replies; 64+ messages in thread
From: Joel Becker @ 2003-10-10 15:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel

On Fri, Oct 10, 2003 at 07:59:34AM -0700, Linus Torvalds wrote:
> The interface is fundamentally flawed, it has nasty security issues, it 
> lacks any kind of sane synchronization, and it exposes stuff that 
> shouldn't be exposed to user space.

	Um, sure, the interface as implemented has a few "don't do
that"s.  Yes, we've found security issues.  Those can be fixed.  That
doesn't make the concept bad.

> I hope disk-based databases die off quickly.

	As opposed to what?  Not a challenge, just interested in what
you think they should be.

> Yeah, I see where you are
> working, but where I'm coming from, I see all the _crap_ that Oracle tries
> to push down to the kernel, and most of the time I go "huh - that's a
> f**king bad design".

	I'm hoping that you've seen a marked improvement in the stuff
Oracle requests over the past couple years.  We've worked hard to filter
out the junk that really, really is bad.
	Where I work doesn't change the need for O_DIRECT.  If your Big
App has it's own cache, why copy the cache in the kernel?  That just
wastes RAM.  If your app is sharing data, whether physical disk, logical
disk, or via some network filesystem or storage device, you must
absolutely guarantee that reads and writes hit the storage, not the
kernel cache which has no idea whether another node wrote an update or
needs a cache flush.
	Putting my employer's hat back on, Oracle uses O_DIRECT because
it was the existing API for this.  If Linux came up with a better,
cleaner method, Oracle might change.  I can't guarantee that, but I know
I push like hell for obvious improvements.

Joel

-- 

"I don't want to achieve immortality through my work; I want to
 achieve immortality through not dying."
        - Woody Allen

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...)
  2003-10-10 14:35               ` Jamie Lokier
@ 2003-10-10 15:32                 ` Trond Myklebust
  2003-10-10 15:53                   ` Jamie Lokier
  2003-10-10 15:55                   ` Michael Shuey
  0 siblings, 2 replies; 64+ messages in thread
From: Trond Myklebust @ 2003-10-10 15:32 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux Kernel

>>>>> " " == Jamie Lokier <jamie@shareable.org> writes:

     > Trond Myklebust wrote:
    >> Sure. We might even try actually implementing leases on NFSv4
    >> for delegated files.

     > That would be nice.  (Aside: Can NFSv4 do anything like
     > dnotify, or am I restricted to, in effect, keeping many files
     > open to detect changes in any of them?)

Delegations for directories are in the pipeline for the next minor
revision of the protocol (NFSv4.1). Delegations are such a new feature
to NFS that it was decided to restrict them to files only to give us
time to learn how best to use them.

I can't tell as of yet whether or not the model chosen will include
all the features of dnotify (for instance recall in case the
attributes change on a subfile is a subject of hot debate), but
certainly some of us are pushing for something like this.

     > Generally NFSv4 sounds like the way to go.  Should I be
     > recommending it to all my friends yet, is the implementation
     > ready for that?

The client implementation in 2.6.0 is still lacking several important
features, including locking, ACLs, delegation support and recovery of
state (in case of server reboot or network partitions). I'm hoping
Andrew/Linus will allow me to send updates once the early 2.6.x
codefreeze period is over.

That said, I definitely encourage people to test out the existing code
for stability, and I will be offering an 'NFS_ALL' series with those
features that are missing from the main tree as and when I judge they
are approaching release quality.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...)
  2003-10-10 15:32                 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust
@ 2003-10-10 15:53                   ` Jamie Lokier
  2003-10-10 16:07                     ` Trond Myklebust
  2003-10-10 15:55                   ` Michael Shuey
  1 sibling, 1 reply; 64+ messages in thread
From: Jamie Lokier @ 2003-10-10 15:53 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux Kernel

Trond Myklebust wrote:
> I can't tell as of yet whether or not the model chosen will include
> all the features of dnotify (for instance recall in case the
> attributes change on a subfile is a subject of hot debate), but
> certainly some of us are pushing for something like this.

Different types of delegation, depending on what the client asked for,
could be offered:

Cacheing readdir() and stat() on the directory requires delegation
without subfile recall; if there's a dnotify on the client, it
requires delegation with recall.

An uber-cool capability would be notification of sub-files to any
depth.  You can't imagine how tedious it has been watching a makefile
take 5 minutes _just_ to run the "find" command on a source tree to
find newer files than the last successful make.  (It was a big tree).
That was the optimised makefile.  Without the "find" command, make's
own dependency logic took 20 minutes to do the same thing.

With any depth notifications, that would be eliminated to roughly zero
time, and just running the few compile commands that are needed.

-- Jamie

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...)
  2003-10-10 15:32                 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust
  2003-10-10 15:53                   ` Jamie Lokier
@ 2003-10-10 15:55                   ` Michael Shuey
  2003-10-10 16:20                     ` Trond Myklebust
  2003-10-10 16:45                     ` J. Bruce Fields
  1 sibling, 2 replies; 64+ messages in thread
From: Michael Shuey @ 2003-10-10 15:55 UTC (permalink / raw)
  To: trond.myklebust; +Cc: Linux Kernel

On Friday 10 October 2003 10:32 am, Trond Myklebust wrote:
> The client implementation in 2.6.0 is still lacking several important
> features, including locking, ACLs, delegation support and recovery of
> state (in case of server reboot or network partitions). I'm hoping
> Andrew/Linus will allow me to send updates once the early 2.6.x
> codefreeze period is over.

How about other features?  In particular, do the client/server do 
authentication (krb5? lipkey/spkm3?), integrity and privacy?

Also, are any patches on Citi's site useful anymore?  I see patches for 
2.6.0-test1, but nothing more recent.  Have they been folded into the main 
tree?

> That said, I definitely encourage people to test out the existing code
> for stability, and I will be offering an 'NFS_ALL' series with those
> features that are missing from the main tree as and when I judge they
> are approaching release quality.

Neato!  Those of us with hordes of machines using Linux's NFS appreciate the 
extra effort.

-- 
Mike Shuey

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 15:27       ` Joel Becker
@ 2003-10-10 16:00         ` Linus Torvalds
  2003-10-10 16:26           ` Joel Becker
                             ` (2 more replies)
  2003-10-10 16:01         ` Jamie Lokier
  1 sibling, 3 replies; 64+ messages in thread
From: Linus Torvalds @ 2003-10-10 16:00 UTC (permalink / raw)
  To: Joel Becker; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel


On Fri, 10 Oct 2003, Joel Becker wrote:
> > I hope disk-based databases die off quickly.
> 
> 	As opposed to what?  Not a challenge, just interested in what
> you think they should be.

I'm hoping in-memory databases will just kill off the current crop 
totally. That solves all the IO problems - the only thing that goes to 
disk is the log and the backups, and both go there totally linearly unless 
the designer was crazy.

Yeah, I don't follow the db market, but it's just insane to try to keep 
the on-disk data in any other format if you've got enough memory. Recovery 
may take a long time (reading that whole backup into memory and redoing 
the log will be pretty expensive), but replication should handle that 
trivially.

> 	Where I work doesn't change the need for O_DIRECT.  If your Big
> App has it's own cache, why copy the cache in the kernel?

Why indeed? 

But why do you think you need O_DIRECT with very bad semantics to handle
this?

The kernel page cache does multiple things:
 - staging area for letting the filesystem do blocking (ie this is why a 
   regular "write()" or "read()" doesn't need to care about alignment etc)
 - a synchronization entity - making sure that a write and a read cannot 
   pass each other, and that mmap contents are always _coherent_.
 - a cache

O_DIRECT throws the cache part away, but it throws out the baby with the
bathwater, and breaks the other parts. Which is why O_DIRECT breaks things
like disk scheduling in really subtle ways - think about writing and
reading to the same area on the disk, and re-ordering at all different 
levels. 

And the thing is, uncaching is _trivial_. It's not like it is hard to say
"try to get rid of these pages if they aren't mapped anywhere" and "insert
this user page directly into the page cache". But people are so fixated
with "direct to disk" that they don't even think about it.

			Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 15:27       ` Joel Becker
  2003-10-10 16:00         ` Linus Torvalds
@ 2003-10-10 16:01         ` Jamie Lokier
  2003-10-10 16:33           ` Joel Becker
  2003-10-10 18:20           ` Andrea Arcangeli
  1 sibling, 2 replies; 64+ messages in thread
From: Jamie Lokier @ 2003-10-10 16:01 UTC (permalink / raw)
  To: Linus Torvalds, Trond Myklebust, Ulrich Drepper, Linux Kernel

Joel Becker wrote:
> 	Where I work doesn't change the need for O_DIRECT.  If your Big
> App has it's own cache, why copy the cache in the kernel?  That just
> wastes RAM.

Why don't you _share_ the App's cache with the kernel's?  That's what
mmap() and remap_file_pages() are for.

>  If your app is sharing data, whether physical disk, logical
> disk, or via some network filesystem or storage device, you must
> absolutely guarantee that reads and writes hit the storage, not the
> kernel cache which has no idea whether another node wrote an update or
> needs a cache flush.

That's tough to guarantee at the platter level regardless of O_DIRECT,
but otherwise: you have fdatasync() and msync().

> If Linux came up with a better, cleaner method, Oracle might change.

Take a look at remap_file_pages() and write a note here to say if it
fits the bill.  I thought remap_file_pages() was added for Oracle, but
perhaps it was for a more modern database ;)

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...)
  2003-10-10 15:53                   ` Jamie Lokier
@ 2003-10-10 16:07                     ` Trond Myklebust
  0 siblings, 0 replies; 64+ messages in thread
From: Trond Myklebust @ 2003-10-10 16:07 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux Kernel

>>>>> " " == Jamie Lokier <jamie@shareable.org> writes:

     > An uber-cool capability would be notification of sub-files to
     > any depth.  You can't imagine how tedious it has been watching
     > a makefile take 5 minutes _just_ to run the "find" command on a
     > source tree to find newer files than the last successful make.
     > (It was a big tree).  That was the optimised makefile.  Without
     > the "find" command, make's own dependency logic took 20 minutes
     > to do the same thing.

     > With any depth notifications, that would be eliminated to
     > roughly zero time, and just running the few compile commands
     > that are needed.

In the very long term (post NFSv4.1), we're investigating something
even more cool: 'WRITE' delegation of directories could allow you to
work in a quasi-disconnected mode on all entries plus sub-entries
(files, subdirs,....).
You could do your compilation entirely locally (backed either by
memory or cachefs) and then just flush the final results out to the
server.

AFS has, of course, had similar capabilities for some time, but I'm
not sure if they have the delegation recall feature. IIRC, their
disconnected operation overwrites whatever changes have been made on
the server when your client reconnects.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...)
  2003-10-10 15:55                   ` Michael Shuey
@ 2003-10-10 16:20                     ` Trond Myklebust
  2003-10-10 16:45                     ` J. Bruce Fields
  1 sibling, 0 replies; 64+ messages in thread
From: Trond Myklebust @ 2003-10-10 16:20 UTC (permalink / raw)
  To: shuey; +Cc: Linux Kernel

>>>>> " " == Michael Shuey <shuey@fmepnet.org> writes:

     > How about other features?  In particular, do the client/server
     > do authentication (krb5? lipkey/spkm3?), integrity and privacy?

Client side krb5 authentication was added in November last
year. Privacy and integrity are queued but fell afoul of the
code-freeze. I'll bun(d|g)le them into an NFS_ALL after we've tested
them out in the v4 Bakeathon in Austin (so in about a fortnight).

I believe the server support is ready too but hasn't yet been merged
in due to bugs in the upcall mechanism.

     > Also, are any patches on Citi's site useful anymore?  I see
     > patches for 2.6.0-test1, but nothing more recent.  Have they
     > been folded into the main tree?

I'm cherrypicking the relevant bugfixes from CITI and folding those
into the tree. Much of the rest will be part of the forthcoming
NFS_ALL.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 16:00         ` Linus Torvalds
@ 2003-10-10 16:26           ` Joel Becker
  2003-10-10 16:50             ` Linus Torvalds
  2003-10-10 16:27           ` Valdis.Kletnieks
  2003-10-10 16:33           ` Chris Friesen
  2 siblings, 1 reply; 64+ messages in thread
From: Joel Becker @ 2003-10-10 16:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel

On Fri, Oct 10, 2003 at 09:00:23AM -0700, Linus Torvalds wrote:
> I'm hoping in-memory databases will just kill off the current crop 
> totally. That solves all the IO problems - the only thing that goes to 
> disk is the log and the backups, and both go there totally linearly unless 
> the designer was crazy.

	Memory is continuously too small and too expensive.  Even if you
can buy a machine with 10TB of RAM, the price is going to be
prohibitive.  And when 10TB of RAM costs better, the database is going
to be 100TB.
	I'm not saying that in-memory is bad.  Big databases do
everything they can to make the workload look almost like in-memory.
It's the only way to go.

> But why do you think you need O_DIRECT with very bad semantics to handle
> this?

	I don't need O_DIRECT with bad semantics.  I need the semantics
I need, I know that other OSes have O_DIRECT to provide those
capabilities, and everyone loves portability.  That said...

> O_DIRECT throws the cache part away, but it throws out the baby with the
> bathwater, and breaks the other parts. Which is why O_DIRECT breaks things
> like disk scheduling in really subtle ways - think about writing and
> reading to the same area on the disk, and re-ordering at all different 
> levels. 

	Sure, but you don't do that.  The breakage in mixing O_DIRECT
with pagecache I/O to the same areas of the disk isn't even all that
subtle.  But you shouldn't be doing that, at least constantly.

> And the thing is, uncaching is _trivial_. It's not like it is hard to say
> "try to get rid of these pages if they aren't mapped anywhere" and "insert
> this user page directly into the page cache". But people are so fixated
> with "direct to disk" that they don't even think about it.

	I'm not fixated.  "Use this user page for the page cache entry
for this offset into the file", "Change this user page from representing
this offset in this file to representing that offset in that file", and
"whatever you do, always read/write from backing store for this page"
are the semantics needed.  For the latter, you'd have to have a way for
the app to trigger a read or write out of the cache.  You don't want to
do it on every page modification or access, that's too often.  The
application knows the syncronization points, not the kernel.

Joel

-- 

"There is a country in Europe where multiple-choice tests are
 illegal."
        - Sigfried Hulzer

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 16:00         ` Linus Torvalds
  2003-10-10 16:26           ` Joel Becker
@ 2003-10-10 16:27           ` Valdis.Kletnieks
  2003-10-10 16:33           ` Chris Friesen
  2 siblings, 0 replies; 64+ messages in thread
From: Valdis.Kletnieks @ 2003-10-10 16:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 468 bytes --]

On Fri, 10 Oct 2003 09:00:23 PDT, Linus Torvalds said:

> I'm hoping in-memory databases will just kill off the current crop 
> totally. That solves all the IO problems - the only thing that goes to 
> disk is the log and the backups, and both go there totally linearly unless 
> the designer was crazy.

I can process a 100GB database on a current 2U Dell rackmount server. I hesitate to
think about what would be required to deal with a terabyte-sized database...



[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 16:01         ` Jamie Lokier
@ 2003-10-10 16:33           ` Joel Becker
  2003-10-10 16:58             ` Chris Friesen
                               ` (2 more replies)
  2003-10-10 18:20           ` Andrea Arcangeli
  1 sibling, 3 replies; 64+ messages in thread
From: Joel Becker @ 2003-10-10 16:33 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, Trond Myklebust, Ulrich Drepper, Linux Kernel

On Fri, Oct 10, 2003 at 05:01:44PM +0100, Jamie Lokier wrote:
> Why don't you _share_ the App's cache with the kernel's?  That's what
> mmap() and remap_file_pages() are for.

	Because you can't force flush/read.  You can't say "I need you
to go to disk for this."  If you do, you're doing O_DIRECT through mmap
(yes, I've pondered it) and you end up with perhaps the same races folks
worry about.  Doesn't mean it can't be done.

> That's tough to guarantee at the platter level regardless of O_DIRECT,
> but otherwise: you have fdatasync() and msync().

	Platter level doesn't matter.  Storage access level matters.
Node1 and Node2 have to see the same thing.  As long as I am absolutely
sure that when Node1's write() returns, any subsequent read() on Node2
will see the change (normal barrier stuff, really), it doesn't matter
what happend on the Storage.  The data could be in storage cache, on
platter, passed back to some other entity.

> Take a look at remap_file_pages() and write a note here to say if it
> fits the bill.  I thought remap_file_pages() was added for Oracle, but
> perhaps it was for a more modern database ;)

	remap_file_pages() was indeed somethign Oracle wanted, but as a
way to create 8GB shmfs files and map them into the x86 crappy address
space.  It still does not have the ability to force reads and writes to
the storage, and it even has other issues.

Joel

-- 

Life's Little Instruction Book #511

	"Call your mother."

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 16:00         ` Linus Torvalds
  2003-10-10 16:26           ` Joel Becker
  2003-10-10 16:27           ` Valdis.Kletnieks
@ 2003-10-10 16:33           ` Chris Friesen
  2003-10-10 17:04             ` Linus Torvalds
  2 siblings, 1 reply; 64+ messages in thread
From: Chris Friesen @ 2003-10-10 16:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Joel Becker, Trond Myklebust, Ulrich Drepper, Linux Kernel

Linus Torvalds wrote:

> I'm hoping in-memory databases will just kill off the current crop 
> totally. That solves all the IO problems - the only thing that goes to 
> disk is the log and the backups, and both go there totally linearly unless 
> the designer was crazy.

How does this play with massive (ie hundreds or thousands of gigabytes) 
databases?  Surely you can't expect to put it all in memory?

Chris


-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...)
  2003-10-10 15:55                   ` Michael Shuey
  2003-10-10 16:20                     ` Trond Myklebust
@ 2003-10-10 16:45                     ` J. Bruce Fields
  1 sibling, 0 replies; 64+ messages in thread
From: J. Bruce Fields @ 2003-10-10 16:45 UTC (permalink / raw)
  To: Michael Shuey; +Cc: trond.myklebust, Linux Kernel

On Fri, Oct 10, 2003 at 10:55:10AM -0500, Michael Shuey wrote:
> On Friday 10 October 2003 10:32 am, Trond Myklebust wrote:
> > The client implementation in 2.6.0 is still lacking several important
> > features, including locking, ACLs, delegation support and recovery of
> > state (in case of server reboot or network partitions). I'm hoping
> > Andrew/Linus will allow me to send updates once the early 2.6.x
> > codefreeze period is over.
> 
> How about other features?  In particular, do the client/server do 
> authentication (krb5? lipkey/spkm3?), integrity and privacy?

The client has krb5 authentication support, the server doesn't.  Patches
are available from the citi web page for server-side authentication and
client-side integrity.

> Also, are any patches on Citi's site useful anymore?

The test1 patches probably apply (possibly with some manual
intervention) up to about test6.  At least one of them (the first gss
patch) is a fairly critical bugfix.  I'm just updating to test7 myself
right now; I'll try to post new patches soon, but in the worst case it
might not be till after we get back from testing at Connectathon (in two
weeks).

--Bruce Fields

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 16:26           ` Joel Becker
@ 2003-10-10 16:50             ` Linus Torvalds
  2003-10-10 17:33               ` Joel Becker
  0 siblings, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2003-10-10 16:50 UTC (permalink / raw)
  To: Joel Becker; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel


On Fri, 10 Oct 2003, Joel Becker wrote:
> 
> 	Memory is continuously too small and too expensive.  Even if you
> can buy a machine with 10TB of RAM, the price is going to be
> prohibitive.  And when 10TB of RAM costs better, the database is going
> to be 100TB.

Hah. 

Look at the number of supercomputers and the number of desktops today.

The fact is, the high end is getting smaller and smaller. If Oracle wants 
to go after that high-end-only market, then be my guest. 

But don't be surprised if others end up taking the remaining 99%.

Have you guys learnt _nothing_ from the past? The reason MicroSoft and
Linux are kicking all the other vendors butts is that _small_ is
beautiful. Especially when small is "powerful enough".

Hint: why does Oracle care at all about the small business market? Why is
MySQL even a blip on your radar? Because it's those things that really
_drive_ stuff. The same way PC's have driven the tech market for the last 
15 years.

And believing that the load will keep up with "big iron hardware" is just 
not _true_. It's never been true. "Small iron" not only keeps up, but 
overtakes it - to the point where you have to start doing new things just 
to be able to take advantage of it.

Believe in history.

> 
> > O_DIRECT throws the cache part away, but it throws out the baby with the
> > bathwater, and breaks the other parts. Which is why O_DIRECT breaks things
> > like disk scheduling in really subtle ways - think about writing and
> > reading to the same area on the disk, and re-ordering at all different 
> > levels. 
> 
> 	Sure, but you don't do that.  The breakage in mixing O_DIRECT
> with pagecache I/O to the same areas of the disk isn't even all that
> subtle.  But you shouldn't be doing that, at least constantly.

Ok. Let's just hope all the crackers and virus writers believe you when 
you say "you shouldn't do that".

BIG FRIGGING HINT: a _real_ OS doesn't allow data corruption even for
cases where "you shouldn't do that". It shouldn't allow reading of data
that you haven't written. And "you shouldn't do that" is _not_ an excuse
for having bad interfaces that cause problems.

We're not NT.

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 16:33           ` Joel Becker
@ 2003-10-10 16:58             ` Chris Friesen
  2003-10-10 17:05               ` Trond Myklebust
  2003-10-10 17:20               ` Joel Becker
  2003-10-10 20:07             ` Jamie Lokier
  2003-10-12 15:31             ` Greg Stark
  2 siblings, 2 replies; 64+ messages in thread
From: Chris Friesen @ 2003-10-10 16:58 UTC (permalink / raw)
  To: Joel Becker
  Cc: Jamie Lokier, Linus Torvalds, Trond Myklebust, Ulrich Drepper,
	Linux Kernel

Joel Becker wrote:
> On Fri, Oct 10, 2003 at 05:01:44PM +0100, Jamie Lokier wrote:
> 
>>Why don't you _share_ the App's cache with the kernel's?  That's what
>>mmap() and remap_file_pages() are for.

> 	Because you can't force flush/read.  You can't say "I need you
> to go to disk for this."

According to my man pages, this is exactly what msync() is for, no?

>>That's tough to guarantee at the platter level regardless of O_DIRECT,
>>but otherwise: you have fdatasync() and msync().

> 	Platter level doesn't matter.  Storage access level matters.
> Node1 and Node2 have to see the same thing.  As long as I am absolutely
> sure that when Node1's write() returns, any subsequent read() on Node2
> will see the change (normal barrier stuff, really), it doesn't matter
> what happend on the Storage.

Isn't that exactly what msync() exists for?

Chris

-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 16:33           ` Chris Friesen
@ 2003-10-10 17:04             ` Linus Torvalds
  2003-10-10 17:07               ` Linus Torvalds
  0 siblings, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2003-10-10 17:04 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Joel Becker, Trond Myklebust, Ulrich Drepper, Linux Kernel


On Fri, 10 Oct 2003, Chris Friesen wrote:
> 
> How does this play with massive (ie hundreds or thousands of gigabytes) 
> databases?  Surely you can't expect to put it all in memory?

Hey, I'm a big believer in mass market.

Which means that I think odd-ball users will have to use odd-ball
databases, and pay through the nose for them. That's fine. But those db's
are doing to be very rare.

Your arguments are all the same stuff that made PC's "irrelevant" 15 years 
ago. 

I'm not sayign in-memory is here tomorrow. I'm just saying that anybody 
who isn't looking at it for the mass market _will_ be steamrolled over 
when they arrive. 

If you were a company, which market would you prefer: the high-end 0.1% or
the rest? Yes, you can charge a _lot_ more for the high-end side, but you 
will eternally live in the knowledge that your customers are slowly moving 
to the "low end" - simply because it gets more capable.

And the thing is, the economics of the 99% means that that is the one that 
sees all the real improvements. That's the one that will have the nice 
admin tools, and the cottage industry that builds up around it. 

			Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 16:58             ` Chris Friesen
@ 2003-10-10 17:05               ` Trond Myklebust
  2003-10-10 17:20               ` Joel Becker
  1 sibling, 0 replies; 64+ messages in thread
From: Trond Myklebust @ 2003-10-10 17:05 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Linux Kernel

>>>>> " " == Chris Friesen <cfriesen@nortelnetworks.com> writes:

    >> Platter level doesn't matter.  Storage access level matters.
    >> Node1 and Node2 have to see the same thing.  As long as I am
    >> absolutely sure that when Node1's write() returns, any
    >> subsequent read() on Node2 will see the change (normal barrier
    >> stuff, really), it doesn't matter what happend on the Storage.

     > Isn't that exactly what msync() exists for?

It can't, be used to invalidate the page cache (at least not in the
current implentation) so it won't help you in the above case where you
have 2 nodes writing to the same device.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 17:04             ` Linus Torvalds
@ 2003-10-10 17:07               ` Linus Torvalds
  2003-10-10 17:21                 ` Joel Becker
  0 siblings, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2003-10-10 17:07 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Joel Becker, Trond Myklebust, Ulrich Drepper, Linux Kernel


On Fri, 10 Oct 2003, Linus Torvalds wrote:
> 
> I'm not sayign in-memory is here tomorrow. I'm just saying that anybody 
> who isn't looking at it for the mass market _will_ be steamrolled over 
> when they arrive. 

Btw, anybody that takes me too seriously is an idiot. I know what _I_ 
believe in, but part of the beauty of Linux is that what I believe doesn't 
really matter all that much. 

			Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 16:58             ` Chris Friesen
  2003-10-10 17:05               ` Trond Myklebust
@ 2003-10-10 17:20               ` Joel Becker
  2003-10-10 17:33                 ` Chris Friesen
  2003-10-10 17:40                 ` Linus Torvalds
  1 sibling, 2 replies; 64+ messages in thread
From: Joel Becker @ 2003-10-10 17:20 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Jamie Lokier, Linus Torvalds, Trond Myklebust, Ulrich Drepper,
	Linux Kernel

On Fri, Oct 10, 2003 at 12:58:05PM -0400, Chris Friesen wrote:
> >	Because you can't force flush/read.  You can't say "I need you
> >to go to disk for this."
> 
> According to my man pages, this is exactly what msync() is for, no?

	msync() forces write(), like fsync().  It doesn't force read().

Joel

-- 

"Get right to the heart of matters.
 It's the heart that matters more."

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 17:07               ` Linus Torvalds
@ 2003-10-10 17:21                 ` Joel Becker
  0 siblings, 0 replies; 64+ messages in thread
From: Joel Becker @ 2003-10-10 17:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Friesen, Trond Myklebust, Ulrich Drepper, Linux Kernel

On Fri, Oct 10, 2003 at 10:07:52AM -0700, Linus Torvalds wrote:
> Btw, anybody that takes me too seriously is an idiot. I know what _I_ 
> believe in, but part of the beauty of Linux is that what I believe doesn't 
> really matter all that much. 

	Sure, but you're not exactly an idiot either.  If folks never
thought about what you said, they'd be an idiot as well.

Joel

-- 

"In the long run...we'll all be dead."
                                        -Unknown

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 17:20               ` Joel Becker
@ 2003-10-10 17:33                 ` Chris Friesen
  2003-10-10 17:40                 ` Linus Torvalds
  1 sibling, 0 replies; 64+ messages in thread
From: Chris Friesen @ 2003-10-10 17:33 UTC (permalink / raw)
  To: Joel Becker
  Cc: Jamie Lokier, Linus Torvalds, Trond Myklebust, Ulrich Drepper,
	Linux Kernel

Joel Becker wrote:
> On Fri, Oct 10, 2003 at 12:58:05PM -0400, Chris Friesen wrote:
> 
>>>	Because you can't force flush/read.  You can't say "I need you
>>>to go to disk for this."
>>>
>>According to my man pages, this is exactly what msync() is for, no?
>>
> 
> 	msync() forces write(), like fsync().  It doesn't force read().

Oh, of course.

So do the applications know when they need to invalidate the cache 
(allowing for the reader to do a reverse-msync kind of thing), or do 
they have to read from disk all the time?

Chris



-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 16:50             ` Linus Torvalds
@ 2003-10-10 17:33               ` Joel Becker
  2003-10-10 17:51                 ` Linus Torvalds
  0 siblings, 1 reply; 64+ messages in thread
From: Joel Becker @ 2003-10-10 17:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel

On Fri, Oct 10, 2003 at 09:50:02AM -0700, Linus Torvalds wrote:
> The fact is, the high end is getting smaller and smaller. If Oracle wants 
> to go after that high-end-only market, then be my guest. 

	No, the high-end for hardware is getting smaller.  The need for
high-end jobs is just fine.  But as you point out, the high-end jobs are
being done by low-end hardware.  And here is Oracle, promoting a bank of
cheap-ass 2-way boxen to do the job.

> Have you guys learnt _nothing_ from the past? The reason MicroSoft and
> Linux are kicking all the other vendors butts is that _small_ is
> beautiful. Especially when small is "powerful enough".

	Again, we need this sort of stuff precisely because we'd rather
use 2 $5k Linux/Intel servers than 1 $40k Sun server (and the Linux box
outruns the Sun, quite comfortably).  That's the "powerful enough",
right there.

> And believing that the load will keep up with "big iron hardware" is just 
> not _true_. It's never been true. "Small iron" not only keeps up, but 
> overtakes it - to the point where you have to start doing new things just 
> to be able to take advantage of it.

	Linus, I've said it twice above.  This has been our entire
direction for the past couple years, and we've been loud about it.
Please, knock us for what we do wrong, but recognize what we are
actually doing wrong, not what you think we are doing.

> Ok. Let's just hope all the crackers and virus writers believe you when 
> you say "you shouldn't do that".

	Well, if a cracker and virus writer can get enough priviledge to
write(), cached or O_DIRECT, they can corrupt you without worrying about
this specific gotcha.  That doesn't mean you don't fix it, but it also
doesn't mean you throw up your hands and claim you can't do it.

> BIG FRIGGING HINT: a _real_ OS doesn't allow data corruption even for
> cases where "you shouldn't do that". It shouldn't allow reading of data
> that you haven't written. And "you shouldn't do that" is _not_ an excuse
> for having bad interfaces that cause problems.

	I know that, I agree with it, and I said as much a few emails
past.  Linux should refuse to corrupt your data.  But you've taken the
tack "It is unsafe today, so we should abandon it altogether, never mind
fixing it.", which doesn't logically follow.

Joel

-- 

"Behind every successful man there's a lot of unsuccessful years."
        - Bob Brown

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
-- 

"When choosing between two evils, I always like to try the one
 I've never tried before."
        - Mae West

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 17:20               ` Joel Becker
  2003-10-10 17:33                 ` Chris Friesen
@ 2003-10-10 17:40                 ` Linus Torvalds
  2003-10-10 17:54                   ` Trond Myklebust
  2003-10-10 18:05                   ` Joel Becker
  1 sibling, 2 replies; 64+ messages in thread
From: Linus Torvalds @ 2003-10-10 17:40 UTC (permalink / raw)
  To: Joel Becker
  Cc: Chris Friesen, Jamie Lokier, Trond Myklebust, Ulrich Drepper,
	Linux Kernel


On Fri, 10 Oct 2003, Joel Becker wrote:
> 
> 	msync() forces write(), like fsync().  It doesn't force read().

Actually, the kernel has a "readahead(fd, offset, size)" system call that
will start asynchronous read-ahead on any mapping. After that, just
touching the page will obviously map in and synchronize the result.

I don't think anybody uses it, and the interface may be broken, but it was
literally 20 lines of code, and I had a trivial test program that
populated the cache for a directory structure really quickly using it.

In general, it would be really nice to have more oracle people discussing
what their particular pet horror is, and what they'd really like to do.

I know you're more used to just doing your own thing and working with
vendors, but even just people getting used to do the unofficial "this is
what we do, and it sucks because xxx" would make people more aware of what 
you wan tto do, and maybe it would suggest novel ways of doing things.

I suspect most of the things would get shot down as being impractical, but
there have always been a lot of discussion about more direct control of
the page cache for programs that really want it, and I'm more than willing
to discuss things (obviously 2.7.x material, but still.. A lot of it is
trivial and could be back-ported to 2.6.x if people start using it).

For example, things we can do, but don't, partly because of interface 
issues and because there is no point in doing it if people wouldn't use 
it:

 - moving a page back and forth between user space. It's _trivial_ to do, 
   with a fallback on copying if the page happens to be busy (ie we can 
   often just replace the existing page cache page, but if somebody else
   has it mapped, we'd have to copy the contents instead)

   We can't do this for "regular" read and write, because the resulting 
   copy-on-write sitution makes it less than desireable in most cases, but 
   if the user space specifically says "you can throw these pages away
   after moving them to the page cache", that avoids a lot of horror.

   The "remap_file_pages()" thing kind of does this on the read side (ie 
   it says "map in this page cache entry into my virtual address space"), 
   but we don't have the reverse aka "take this page in the virtual 
   address space and map it into the page cache".

   Interfaces like these would also allow things like zero-copy file
   copies with smaller page cache footprints - at the expense of 
   invalidating the cache for the source file as a result of the copy. 
   Which is why it can't be a _regular_ read - but it's one of those 
   things where if the user knows what he wants..

 - dirty mapping control (ie controlling partial page dirty state, and 
   also _delaying_ writeout if it needs to be ordered). Possibly by having 
   a separate backing store (ie a mmap that says "read from this file, but
   write back to that other file") to avoid the nasty memory management 
   problems.

A lot of these are really easy to do, but the usage and the interfaces are 
non-obvious.

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 17:33               ` Joel Becker
@ 2003-10-10 17:51                 ` Linus Torvalds
  2003-10-10 18:13                   ` Joel Becker
  0 siblings, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2003-10-10 17:51 UTC (permalink / raw)
  To: Joel Becker; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel


On Fri, 10 Oct 2003, Joel Becker wrote:
> 
> 	I know that, I agree with it, and I said as much a few emails
> past.  Linux should refuse to corrupt your data.  But you've taken the
> tack "It is unsafe today, so we should abandon it altogether, never mind
> fixing it.", which doesn't logically follow.

No, we've fixed it, the problem is that it ends up being a lot of extra
complexity that isn't obvious when just initially looking at it. For
example, just the IO scheduler ended up having serious problems with
overlapping IO requests. That's in addition to all the issues with
out-of-sync ordering etc that could cause direct_io reads to bypass
regular writes and read stuff off the disk that was a potential security
issue.

So right now we have extra code and extra complexity (which implies not
only potential for more bugs, but there are performance worries etc that
can impact even users that don't need it).

And these are fundamental problems to DIRECT_IO. Which means that likely
at some point we will _have_ to actually implement DIRECT_IO entirely
through the page cache to make sure that it's safe. So my bet is that
eventually we'll make DIRECT_IO just be an awkward way to do page cache
manipulation.

And maybe it works out ok. And we'll clearly have to keep it working. The 
issue is whether there are better interfaces. And I think there are bound 
to be.

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 17:40                 ` Linus Torvalds
@ 2003-10-10 17:54                   ` Trond Myklebust
  2003-10-10 18:05                     ` Linus Torvalds
  2003-10-11  2:53                     ` Andrew Morton
  2003-10-10 18:05                   ` Joel Becker
  1 sibling, 2 replies; 64+ messages in thread
From: Trond Myklebust @ 2003-10-10 17:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Joel Becker, Chris Friesen, Jamie Lokier, Linux Kernel

>>>>> " " == Linus Torvalds <torvalds@osdl.org> writes:

     > On Fri, 10 Oct 2003, Joel Becker wrote:
    >>
    >> msync() forces write(), like fsync().  It doesn't force read().

     > Actually, the kernel has a "readahead(fd, offset, size)" system
     > call that will start asynchronous read-ahead on any
     > mapping. After that, just touching the page will obviously map
     > in and synchronize the result.

That's different. That's just preheating the page cache.

It does nothing for the case Joel mentioned where 2 different nodes
are writing to the same device, and you need to force a read in order
to resynchronize the page cache.
Apart from O_DIRECT, we have nothing in the kernel as it stands that
will allow userland to deal with this case.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 17:54                   ` Trond Myklebust
@ 2003-10-10 18:05                     ` Linus Torvalds
  2003-10-10 20:40                       ` Trond Myklebust
  2003-10-11  2:53                     ` Andrew Morton
  1 sibling, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2003-10-10 18:05 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Joel Becker, Chris Friesen, Jamie Lokier, Linux Kernel


On Fri, 10 Oct 2003, Trond Myklebust wrote:
>
> Apart from O_DIRECT, we have nothing in the kernel as it stands that
> will allow userland to deal with this case.

Oh, but that's just another case of the general notion of allowing people 
to control the page cache a bit more. 

There's nothing wrong with having kernel interfaces that say "this region
is potentially stale" or "this region is dirty" or "this region is not
needed any more".

For example, using DIRECT_IO to make sure that something is uptodate is
just _stupid_, because clearly it only matters to shared-disk (either over
networks/FC or though things like SCSI device sharing) setups. So now the 
app has to have a way to query for whether the storage is shared, and 
have two totally different code-paths depending on the answer. 

This is another example of a bad design, that ends up causing more
problems (remember why this thread started in the first place: bad design
of O_DIRECT causing the app to have to care about something _else_ it
shouldn't care about. At all).

If you had a "this region is stale" thing, you'd just use it. And if it 
was local disk, it wouldn't do anything. 

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 17:40                 ` Linus Torvalds
  2003-10-10 17:54                   ` Trond Myklebust
@ 2003-10-10 18:05                   ` Joel Becker
  2003-10-10 18:31                     ` Andrea Arcangeli
  2003-10-10 20:33                     ` Helge Hafting
  1 sibling, 2 replies; 64+ messages in thread
From: Joel Becker @ 2003-10-10 18:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Fri, Oct 10, 2003 at 10:40:40AM -0700, Linus Torvalds wrote:
> Actually, the kernel has a "readahead(fd, offset, size)" system call that
> will start asynchronous read-ahead on any mapping. After that, just
> touching the page will obviously map in and synchronize the result.

	Ok, a quick peruse of sys_readahead() seems to say that it
doesn't check for existing uptodate()ness.  That would be interesting.
I could have missed it, though.  

> I don't think anybody uses it, and the interface may be broken, but it was
> literally 20 lines of code, and I had a trivial test program that
> populated the cache for a directory structure really quickly using it.

	The problem we have with msync() and friends is not 'quick
population', it's "page is in the page cache already; another node
writes to the storage; must mark page as !uptodate so as to force a
re-read from disk".  I can't find where sys_readahead() checks for
uptodate, so perhaps calling sys_readahead() on a range always causes
I/O.  Correct me if I missed it.

> For example, things we can do, but don't, partly because of interface 
> issues and because there is no point in doing it if people wouldn't use 
> it:

	Lots of interesting stuff snipped.  This discussion has me
thinking, knowing now that there's possibility to move to a more optimal
interface.

Joel

-- 

Life's Little Instruction Book #464

	"Don't miss the magic of the moment by focusing on what's
	 to come."

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 17:51                 ` Linus Torvalds
@ 2003-10-10 18:13                   ` Joel Becker
  0 siblings, 0 replies; 64+ messages in thread
From: Joel Becker @ 2003-10-10 18:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel

On Fri, Oct 10, 2003 at 10:51:52AM -0700, Linus Torvalds wrote:
> And maybe it works out ok. And we'll clearly have to keep it working. The 
> issue is whether there are better interfaces. And I think there are bound 
> to be.

	Agreed.

Joel

-- 

"Well-timed silence hath more eloquence than speech."  
         - Martin Fraquhar Tupper

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 16:01         ` Jamie Lokier
  2003-10-10 16:33           ` Joel Becker
@ 2003-10-10 18:20           ` Andrea Arcangeli
  2003-10-10 18:36             ` Linus Torvalds
  1 sibling, 1 reply; 64+ messages in thread
From: Andrea Arcangeli @ 2003-10-10 18:20 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, Trond Myklebust, Ulrich Drepper, Linux Kernel

On Fri, Oct 10, 2003 at 05:01:44PM +0100, Jamie Lokier wrote:
> Joel Becker wrote:
> > 	Where I work doesn't change the need for O_DIRECT.  If your Big
> > App has it's own cache, why copy the cache in the kernel?  That just
> > wastes RAM.
> 
> Why don't you _share_ the App's cache with the kernel's?  That's what
> mmap() and remap_file_pages() are for.

I covered this some time ago in the remap_file_pages threads with Wil.

remap_file_pages requires pte modifications and tlb flushes.

O_DIRECT only walk the pagetables, no pte mangling, no tlb flushes, the
TLB is preserved fully.

thinking only 64bit in the above of course, 32bit is different but still
mmap+remap_file_pages can't beat O_DIRECT if you dedicate your machine
for the database task.

> >  If your app is sharing data, whether physical disk, logical
> > disk, or via some network filesystem or storage device, you must
> > absolutely guarantee that reads and writes hit the storage, not the
> > kernel cache which has no idea whether another node wrote an update or
> > needs a cache flush.
> 
> That's tough to guarantee at the platter level regardless of O_DIRECT,
> but otherwise: you have fdatasync() and msync().
> 
> > If Linux came up with a better, cleaner method, Oracle might change.
> 
> Take a look at remap_file_pages() and write a note here to say if it
> fits the bill.  I thought remap_file_pages() was added for Oracle, but
> perhaps it was for a more modern database ;)

no way, it has the disavantages I mentioned above, it would be a bad
idea to use remap_file_pages on any 64bit system out there.

we know remap_file_pages has a chance to improve the /dev/shm mappings
from 32bit systems, but that has nothing to do with the long run 64bit
machines, remap_file_pages is mostly a 32bit hack for ia32 with PAE.

About the in-memory databases, that's really the big iron
non-mass-market, not the other way around. only the big irons have
enough money to buy that much ram, you sure can't compare the price of
the ram with the price of disk, or at least not yet in this market AFIK.

Andrea - If you prefer relying on open source software, check these links:
	    rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
	    http://www.cobite.com/cvsps/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 18:05                   ` Joel Becker
@ 2003-10-10 18:31                     ` Andrea Arcangeli
  2003-10-10 20:33                     ` Helge Hafting
  1 sibling, 0 replies; 64+ messages in thread
From: Andrea Arcangeli @ 2003-10-10 18:31 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

On Fri, Oct 10, 2003 at 11:05:35AM -0700, Joel Becker wrote:
> thinking, knowing now that there's possibility to move to a more optimal
> interface.

cleaner and simpler could very well be (many simpler db works that way
infact), but more optimal I doubt. To be more optimal you should let the
kernel do all the garbage collection of mappings, and not use
remap_file_pages. But then I'm unsure if the kernel is really able
better than you to choose what info to discard from the cache, and you'd
still have to pay for page faults that you don't have to right now.

And if you use remap_file_pages to still choose what to ""discard
first"" from userspace, then you'd better use O_DIRECT instead, that
doesn't require any pte mangling (ignoring the readahead, async-io and
msync, scsi-shared issues that sounds fixable).

About the security issues, they existed in older kernels they're
nowadays fixed thanks to Stephen's i_alloc_sem.

though, I'd be interesting to compared different models in practice to
be sure, I just don't have expectations for it being a "more optimal"
design at the moment.

Andrea - If you prefer relying on open source software, check these links:
	    rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
	    http://www.cobite.com/cvsps/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 18:20           ` Andrea Arcangeli
@ 2003-10-10 18:36             ` Linus Torvalds
  2003-10-10 19:03               ` Andrea Arcangeli
  0 siblings, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2003-10-10 18:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel


On Fri, 10 Oct 2003, Andrea Arcangeli wrote:
> 
> O_DIRECT only walk the pagetables, no pte mangling, no tlb flushes, the
> TLB is preserved fully.

Yes. However, it's even _nicer_ if you don't need to walk the page tables 
at all.

Quite a lot of operations could be done directly on the page cache. I'm 
not a huge fan of mmap() myself - the biggest advantage of mmap is when 
you don't know your access patterns, and you have reasonably good 
locality. In many other cases mmap is just a total loss, because the page 
table walking is often more expensive than even a memcpy().

That's _especially_ true if you have to move mappings around, and you have 
to invalidate TLB's. 

memcpy() often gets a bad name. Yeah, memory is slow, but especially if 
you copy something you just worked on, you're actually often better off 
letting the CPU cache do its job, rather than walking page tables and 
trying to be clever.

Just as an example: copying often means that you don't need nearly as much 
locking and synchronization - which in turn avoids one whole big mess 
(yes, the memcpy() will look very hot in profiles, but then doing extra 
work to avoid the memcpy() will cause spread-out overhead that is a lot 
worse and harder to think about).

This is why a simple read()/write() loop often _beats_ mmap approaches. 
And often it's actually better to not even have big buffers (ie the old 
"avoid system calls by aggregation" approach) because that just blows your 
cache away.

Right now, the fastest way to copy a file is apparently by doing lots of
~8kB read/write pairs (that data may be slightly stale, but it was true at
some point). Never mind the system call overhead - just having the extra
buffer stay in the L1 cache and avoiding page faults from mmap is a bigger
win.

And I don't think mmap _can_ beat that. It's fundamental. 

In contrast, direct page cache accesses really can do so. Exactly because 
they don't touch any page tables at all, and because they can take 
advantage of internal kernel data structure layout and move pages around 
without any cost..

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 18:36             ` Linus Torvalds
@ 2003-10-10 19:03               ` Andrea Arcangeli
  0 siblings, 0 replies; 64+ messages in thread
From: Andrea Arcangeli @ 2003-10-10 19:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel

On Fri, Oct 10, 2003 at 11:36:29AM -0700, Linus Torvalds wrote:
> 
> On Fri, 10 Oct 2003, Andrea Arcangeli wrote:
> > 
> > O_DIRECT only walk the pagetables, no pte mangling, no tlb flushes, the
> > TLB is preserved fully.
> 
> Yes. However, it's even _nicer_ if you don't need to walk the page tables 
> at all.
> 
> Quite a lot of operations could be done directly on the page cache. I'm 
> not a huge fan of mmap() myself - the biggest advantage of mmap is when 
> you don't know your access patterns, and you have reasonably good 
> locality. In many other cases mmap is just a total loss, because the page 
> table walking is often more expensive than even a memcpy().
> 
> That's _especially_ true if you have to move mappings around, and you have 
> to invalidate TLB's. 

agreed. that's what remap_file_pages does infact.

> memcpy() often gets a bad name. Yeah, memory is slow, but especially if 
> you copy something you just worked on, you're actually often better off 
> letting the CPU cache do its job, rather than walking page tables and 
> trying to be clever.
> 
> Just as an example: copying often means that you don't need nearly as much 
> locking and synchronization - which in turn avoids one whole big mess 
> (yes, the memcpy() will look very hot in profiles, but then doing extra 
> work to avoid the memcpy() will cause spread-out overhead that is a lot 
> worse and harder to think about).
> 
> This is why a simple read()/write() loop often _beats_ mmap approaches. 
> And often it's actually better to not even have big buffers (ie the old 
> "avoid system calls by aggregation" approach) because that just blows your 
> cache away.
> 
> Right now, the fastest way to copy a file is apparently by doing lots of
> ~8kB read/write pairs (that data may be slightly stale, but it was true at
> some point). Never mind the system call overhead - just having the extra
> buffer stay in the L1 cache and avoiding page faults from mmap is a bigger
> win.
> 
> And I don't think mmap _can_ beat that. It's fundamental. 

That's my whole point, agreed. Though using mmap would be sure cleaner
and simpler.

> In contrast, direct page cache accesses really can do so. Exactly because 
> they don't touch any page tables at all, and because they can take 
> advantage of internal kernel data structure layout and move pages around 
> without any cost..

Which basically means removing O_DIRECT from the open syscalls and still
use read/write if I understand correctly.

With todays commodity dirtcheap hardware, it has been proven that
walking the pte (NOTE: only walking, no mangling and no tlb flushing) is
much faster than doing the memcpy. More cpu is left free for the other
tasks and the cost of the I/O is the same. The different isn't
measurable in I/O bound tasks, but a database is both IO bound and cpu
bound at the same time, so for a db it's measurable. At least this is
the case for Oracle. I believe Joel has access to these numbers too, and
that's why he's interested in O_DIRECT in the first place.

With faster membus things may change of course (to the point where
there's no difference between the two models), but still I don't see how
can walking tree pointers to be more expensive than copying 512bytes of
data (assuming the smaller blocksize). And you're ignoring the CPU *has*
to walk those three pointers _anyways_ implicitly to allow the memcpy to
run. So as far as I can tell the memcpy is pure overhead that can be
avoided with O_DIRECT.

this is also why I rejected all approcches that wanted to allow
readahead via O_DIRECT by preloading data in pagecache, my argument is:
if you can't avoid the memcpy you must not use O_DIRECT. The only signle
object of O_DIRECT is to avoid the memcpy, the cache pollution avoidance
is a very minor issue, the main point is to avoid the memcpy.

I also posted a number of benchmarks at some point, where I've shown a
dramatical reduction of the cpu usage, up to 10% reduction, on a normal
cheap hardware w/o reduction of I/O bandwidth. This means 10% more cpu
to use for doing something useful in the cpu bound part of the database.

The main downside of O_DIRECT I believe conceptual, starting from the
ugliness inside the kernel, like the cache coherency handling and
i_alloc_sem need to avoid reads to run in parallel of block allocations,
etc... but the practical effect I doubt can be easily beaten in the
numbers. That said maybe we can provide a nicer API that does the same
thing internally I don't know, but certainly that can't be
remap_file_pages because that does a very different thing.

Andrea - If you prefer relying on open source software, check these links:
	    rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
	    http://www.cobite.com/cvsps/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 16:33           ` Joel Becker
  2003-10-10 16:58             ` Chris Friesen
@ 2003-10-10 20:07             ` Jamie Lokier
  2003-10-12 15:31             ` Greg Stark
  2 siblings, 0 replies; 64+ messages in thread
From: Jamie Lokier @ 2003-10-10 20:07 UTC (permalink / raw)
  To: Joel Becker; +Cc: Linus Torvalds, Trond Myklebust, Ulrich Drepper, Linux Kernel

Joel Becker wrote:
> 	Platter level doesn't matter.  Storage access level matters.
> Node1 and Node2 have to see the same thing.  As long as I am absolutely
> sure that when Node1's write() returns, any subsequent read() on Node2
> will see the change (normal barrier stuff, really), it doesn't matter
> what happend on the Storage.  The data could be in storage cache, on
> platter, passed back to some other entity.

That's two specifications.  Please choose one!

First you say the storage access level matters, then you say it
doesn't matter, that the only important thing is any two nodes see
each other's changes.

Committing data to a certain level of storage, for the sake of
_storing_ it, is trivially covered by fdatasync().  We won't talk
about that any more.

The other requirement is about barriers between nodes accessing data.

On a single machine, your second specification means the data doesn't
need to hit the disk at all.  On a SAN, it means the data doesn't need
to hit the SAN's storage - nor, in fact does the data have to be
transferred over the SAN when you write it!  Distributed cache
coherency exists for that purpose.

For example, let's imagine 32 processes, 8 per machine, and a giant
shared disk.  Pages in the database are regularly read and written by
pairs of nodes and, because of the way you direct requests based on
keys, certain pages tend to be accessed only by certain pairs of
nodes.

That means a significant proportion of the pages do _not_ need to be
transmitted through the shared disk every time they are made visible
to other nodes - because those page accesses are local to one _machine_
for many transfers.

That means O_DIRECT is using more storage bandwidth than you need to
use.  The waste is greatest on a single machine (i.e. infinity) but
with multiple machines there is still waste and the amount depends on
access patterns.

You should be using cache coherency protocols between nodes - at the
database level (which you very likely are, as performance would
plummet without it) - and at the filesystem level.

"Forcing a read" is *not* a required operation if you have a sound
network filesystem or even network disk protocol.  Merely reading a
page will force the read, if another node has written to it - and
*only* if that is necessary.  Some of the distributed filesystems,
like Sistina's, get this right I believe.

If, however, your shared file does not maintain coherent views between
different nodes, then you _do_ you need to force writes and force
reads.

Your quality database will not waste storage bandwidth by doing
_unnecessary_ reads, if the underlying storage isn't coherent, merely
to see whether a page changed.  For that, you should be communicating
metadata between nodes that say "this page is now dirty; you will need
to read it" and "ok" - along the lines of MESI.

That is the worst case I can think of (i.e. the kernel filesystem/san
driver doesn't do coherence so you have to do it in the database
program), and indeed you do need the ability to flush read pages in
that case.  Ideally you want the ability to pass pages directly
between nodes without requiring a storage commit, too.

Linus' suggestion of "this data is stale" is ok.  Another flag to
remap_file_pages would work, too, saving a system call in some cases,
but doing unwanted reads (sometimes you just want to invalidate) in
some others.  Btw, fadvise(POSIX_FADV_DONTNEED) appears to offser this
already.

Using O_DIRECT always can be inefficient, because it commits things to
storage which don't need to be committed so early or often, and
because it moves data when it does not need to be moved, with the
worst case being a small cluster of large machines or just one
machine.

It's likely your expensive shared disk doesn't mind all those commits
because of journalling NVRAM etc..

To avoid wasting bandwidth to the shared disk associated processing
costs you have to analyse the topology of node interconnections,
specifically to avoid using O_DIRECT and/or unnecessary reads and
writes when they aren't necessary (between local nodes).

You need that anyway even with Linus' suggestion, because there's no
way the kernel can know automatically whether you are doing a
coherence operation between two local nodes or remote ones.  Local
filesystems look like a worthy exception, but even those are iffy if
there's a remote client accessing it over a network filesystem as well
as local nodes synchronising over it.  It has to be an unexported
local filesystem, and the kernel doesn't even know that, because of
userspace servers like Samba.

   ======

That long discussion leads to this:

The best in theory is a network-coherent filesystem.  It knows the
topology and it can implement the optimal strategy.

Without one of those, it is necessary to know the topology between
nodes to get optimal performance for any method (i.e. minimum pages
transferred around, minimum operations etc.).  This is true of using
O_DIRECT or Linus' page cache manipulations.

O_DIRECT works, but causes unnecessary storage commitment when all you
need is synchronisation.

Page cache manipulation may already be possible using fdatasync +
MADV_DONTNEED + POSIX_FADV_DONTNEED, however that isn't optimal
either, because:

Both of those mechanisms do not provide a way to transfer a dirty page
from one node to another without (a) committing to storage; or (b)
copying the data at the receiver.  O_DIRECT does the commit (write at
one node; read at the other), but is zero-copy at the receiver, as
mapped files are generally.  Without O_DIRECT, you'd have to use
application level socket<->socket communication, and there is as yet
no zero-copy receive.  Zero-copy UDP receive or similar is needed to
get the best from this.

Conclusions
-----------

Only a coherent distributed filesystem actually minimises the amount
of file data transferred and copied, and automatically too.

All other suggestions so far have weaknesses in this regard.

Although the page cache manipulation methods could minimise the
transfers and copies if zero-copy socket receive was available, it is
a set of mechanisms that look like it would, after it's implemented in
the application, still be slower than a coherent filesystem just
because the latter can do the combination of manipulations etc. more
easily; on the other hand, resolution of cache coherency by a
filesystem would incur more page faults than doing it at the
application level.  So it is not absolutely clear which can be made
faster with lots of attention.

The end :)

Thanks for getting this far...
-- Jamie

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 18:05                   ` Joel Becker
  2003-10-10 18:31                     ` Andrea Arcangeli
@ 2003-10-10 20:33                     ` Helge Hafting
  1 sibling, 0 replies; 64+ messages in thread
From: Helge Hafting @ 2003-10-10 20:33 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel, Joel.Becker

On Fri, Oct 10, 2003 at 11:05:35AM -0700, Joel Becker wrote:

> 	The problem we have with msync() and friends is not 'quick
> population', it's "page is in the page cache already; another node
> writes to the storage; must mark page as !uptodate so as to force a
> re-read from disk".  I can't find where sys_readahead() checks for
> uptodate, so perhaps calling sys_readahead() on a range always causes
> I/O.  Correct me if I missed it.
>
 
Wouldn't this be solvable by giving userspace a way of invalidating
a range of mmapped pages?  I.e. a "minvalidate();" to use when
the other node tells you it is about to write?

This will cause the pages to be paged in again on next reference,
or you can issue a read in advance if you believe you'll need them.

Helge Hafting

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 18:05                     ` Linus Torvalds
@ 2003-10-10 20:40                       ` Trond Myklebust
  2003-10-10 21:09                         ` Linus Torvalds
  0 siblings, 1 reply; 64+ messages in thread
From: Trond Myklebust @ 2003-10-10 20:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Joel Becker, Chris Friesen, Jamie Lokier, Linux Kernel


     > If you had a "this region is stale" thing, you'd just use
     > it. And if it was local disk, it wouldn't do anything.

Note that in order to be race-free, such a command would also have to
wait on any outstanding operations (i.e. both pending reads and
writes) on the region in question in order to make sure that none have
crossed the synchronization point. It is not a question of just
calling invalidate_inode_pages() and thinking that all is well...

In fact, I recently noticed that we still have this race in the NFS
file locking code: readahead may have been scheduled before we
actually set the file lock on the server, and may thus fill the page
cache with stale data.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 20:40                       ` Trond Myklebust
@ 2003-10-10 21:09                         ` Linus Torvalds
  2003-10-10 22:17                           ` Trond Myklebust
  0 siblings, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2003-10-10 21:09 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Joel Becker, Chris Friesen, Jamie Lokier, Linux Kernel


On Fri, 10 Oct 2003, Trond Myklebust wrote:
> 
> In fact, I recently noticed that we still have this race in the NFS
> file locking code: readahead may have been scheduled before we
> actually set the file lock on the server, and may thus fill the page
> cache with stale data.

The current "invalidate_inode_pages()" is _not_ equivalent to a specific
user saying "these pages are bad and have to be updated".

The main difference is that invalidate_inode_pages() really cannot assume
that the pages are bad: the pages may be mapped into another process that 
is actively writing to them, so the regular "invalidate_inode_pages()" 
literally must not force a re-read - that would throw out real 
information.

So "invalidate_inode_pages()" really is a hint, not a forced eviction.

A forced eviction can be done only by a user that says "I have write
permission to this file, and I will now say that these pages _have_ to be
thrown away, whether dirty or not".

And that's totally different, and will require a totally different 
approach.

(As to the read-ahead issue: there's nothing saying that you can't wait
for the pages if they aren't up-to-date, and really synchronize with
read-ahead. But that will require filesystem help, if only to be able to
recognize that there is active IO going on. So NFS would have to keep 
track of a "read list" the same way it does for writeback pages).

		Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 21:09                         ` Linus Torvalds
@ 2003-10-10 22:17                           ` Trond Myklebust
  0 siblings, 0 replies; 64+ messages in thread
From: Trond Myklebust @ 2003-10-10 22:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel

>>>>> " " == Linus Torvalds <torvalds@osdl.org> writes:

     > (As to the read-ahead issue: there's nothing saying that you
     > can't wait for the pages if they aren't up-to-date, and really
     > synchronize with read-ahead. But that will require filesystem
     > help, if only to be able to recognize that there is active IO
     > going on. So NFS would have to keep track of a "read list" the
     > same way it does for writeback pages).

Well... I was thinking more in terms of a rw_semaphore to lock out new
calls to nfs_file_(read|write|sendfile) in combination with a call to
invalidate_inode_pages2().

Such a mechanism can also be used in schemes to improve on the generic
data/attribute cache consistency in order to reduce the number of
bogus cache invalidations due to RPC ordering races. Those can tend to
be expensive...

Note: Anybody using mmap() in combination with file locking will
however continue to enjoy the privilege of being able to screw up...

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 17:54                   ` Trond Myklebust
  2003-10-10 18:05                     ` Linus Torvalds
@ 2003-10-11  2:53                     ` Andrew Morton
  2003-10-11  3:47                       ` Trond Myklebust
  1 sibling, 1 reply; 64+ messages in thread
From: Andrew Morton @ 2003-10-11  2:53 UTC (permalink / raw)
  To: trond.myklebust; +Cc: torvalds, Joel.Becker, cfriesen, jamie, linux-kernel

Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
>
> It does nothing for the case Joel mentioned where 2 different nodes
> are writing to the same device, and you need to force a read in order
> to resynchronize the page cache.
> Apart from O_DIRECT, we have nothing in the kernel as it stands that
> will allow userland to deal with this case.

Applications may use fadvise(POSIX_FADV_DONTNEED) to invalidate sections of
a file's pagecache.

It is not designed to be 100% reliable though: mmapped pages will be
retained, and dirty pages are skipped.

For the dirty pages it might be useful to add a new mode to fadvise which
syncs a section of a file's pages; -mm has the necessary infrastructure for
that.

POSIX does not define the fadvise() semantics very clearly, so it is largely
up to us to decide what makes sense.  There are a number of things which we
can do quite easily in there - it's mainly a matter of working out exactly
what we want to do.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-11  2:53                     ` Andrew Morton
@ 2003-10-11  3:47                       ` Trond Myklebust
  0 siblings, 0 replies; 64+ messages in thread
From: Trond Myklebust @ 2003-10-11  3:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, Joel.Becker, cfriesen, jamie, linux-kernel

>>>>> " " == Andrew Morton <akpm@osdl.org> writes:

     > POSIX does not define the fadvise() semantics very clearly, so
     > it is largely up to us to decide what makes sense.  There are a
     > number of things which we can do quite easily in there - it's
     > mainly a matter of working out exactly what we want to do.

Possibly, but there really is no need to get over-creative either. The
SUS definition of msync(MS_INVALIDATE) reads as follows:

        When MS_INVALIDATE is specified, msync() shall invalidate all
        cached copies of mapped data that are inconsistent with the
        permanent storage locations such that subsequent references
        shall obtain data that was consistent with the permanent
        storage locations sometime between the call to msync() and the
        first subsequent memory reference to the data.

(ref: http://www.opengroup.org/onlinepubs/007904975/functions/msync.html)

i.e. a strict implementation would mean that msync() will in fact act
as a synchronization point that is fully consistent with Linus'
proposal for a "this region is stale" function.

Unfortunately Linux appears incapable of implementing such a strict
definition of msync() as it stands.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-10 16:33           ` Joel Becker
  2003-10-10 16:58             ` Chris Friesen
  2003-10-10 20:07             ` Jamie Lokier
@ 2003-10-12 15:31             ` Greg Stark
  2003-10-12 16:13               ` Linus Torvalds
  2 siblings, 1 reply; 64+ messages in thread
From: Greg Stark @ 2003-10-12 15:31 UTC (permalink / raw)
  To: Joel Becker
  Cc: Jamie Lokier, Linus Torvalds, Trond Myklebust, Ulrich Drepper,
	Linux Kernel

Joel Becker <Joel.Becker@oracle.com> writes:

> On Fri, Oct 10, 2003 at 05:01:44PM +0100, Jamie Lokier wrote:
> > Why don't you _share_ the App's cache with the kernel's?  That's what
> > mmap() and remap_file_pages() are for.
> 
> 	Because you can't force flush/read.  You can't say "I need you
> to go to disk for this."  If you do, you're doing O_DIRECT through mmap
> (yes, I've pondered it) and you end up with perhaps the same races folks
> worry about.  Doesn't mean it can't be done.

There are other reasons databases want to control their own cache. The
application knows more about the usage and the future usage of the data than
the kernel does.

There's currently a thread on the Postgres mailing list about a problem with
an administrative job that needs to touch potentially all the blocks of a table.
The more frequently it's run the less work it has to do, so the recommendation
is to run it very frequently.

However on busy servers whenever it's run it causes lots of pain because the
kernel flushes all the cached data in favour of the data this job touches. And
worse, there's no way to indicate that the i/o it's doing is lower priority,
so i/o bound servers get hit dramatically. 

Postgres knows the fact that this job touched the data means nothing for the
regular functioning of the server, and it knows that the i/o it's doing is low
priority. It needs some way to indicate to the kernel that this job is low
priority not only for cpu resources but for cache resources and i/o resources
as well.

There are other cases. Oracle, for example, puts blocks it reads due to full
table scans at the end of its LRU list to avoid a similar effect on the cache.

Then there's the transaction log. The database needs to know when the
transaction log is written to disk. The blocks it writes there won't be useful
to cache unless the database crashed right there. And ideally it should bypass
any disk i/o reordering and write the data to the transaction log *first*. Raw
bandwidth is not as important as latency on writes to the transaction log.

The reason mmap is tempting is not because it's faster. It's because it
provides a nice clean abstract interface. The database could simply mmap the
entire database and then pretend it is an in-memory database. The code would
be much simpler and more complex algorithms would be easier to implement.

Unfortunately there are some problems with mmap. Currently it would be just as
complex to use as read/write because the address space is limited to only a
fraction of the database. On a 64 bit machine you might be able to mmap the
entire database and then use custom syscalls to indicate to the kernel which
pages to keep in cache and which to sync.

-- 
greg


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-12 15:31             ` Greg Stark
@ 2003-10-12 16:13               ` Linus Torvalds
  2003-10-12 22:09                 ` Greg Stark
  0 siblings, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2003-10-12 16:13 UTC (permalink / raw)
  To: Greg Stark
  Cc: Joel Becker, Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel


On 12 Oct 2003, Greg Stark wrote:
> 
> There are other reasons databases want to control their own cache. The
> application knows more about the usage and the future usage of the data than
> the kernel does.

But this again is not an argument for not using the page cache - it's only 
an argument for _telling_ the kernel about its use.

> However on busy servers whenever it's run it causes lots of pain because the
> kernel flushes all the cached data in favour of the data this job touches.

Yes. But this is actually pretty easy to avoid in-kernel, since all of the 
LRU logic is pretty localized.

It could be done on a per-process thing ("this process should not pollute 
the active list") or on a per-fd thing ("accesses through this particular 
open are not to pollute the active list"). 

>									 And
> worse, there's no way to indicate that the i/o it's doing is lower priority,
> so i/o bound servers get hit dramatically. 

IO priorities are pretty much worthless. It doesn't _matter_ if other 
processes get preferred treatment - what is costly is the latency cost of 
seeking. What you want is not priorities, but batching.

			Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-12 16:13               ` Linus Torvalds
@ 2003-10-12 22:09                 ` Greg Stark
  2003-10-13  8:45                   ` Helge Hafting
  0 siblings, 1 reply; 64+ messages in thread
From: Greg Stark @ 2003-10-12 22:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Greg Stark, Joel Becker, Jamie Lokier, Trond Myklebust,
	Ulrich Drepper, Linux Kernel

Linus Torvalds <torvalds@osdl.org> writes:

> > worse, there's no way to indicate that the i/o it's doing is lower priority,
> > so i/o bound servers get hit dramatically. 
> 
> IO priorities are pretty much worthless. It doesn't _matter_ if other 
> processes get preferred treatment - what is costly is the latency cost of 
> seeking. What you want is not priorities, but batching.

What you want depends very much on the circumstances. I'm sure in a lot of
cases batching helps, but in this case it's not the issue.

The vacuum job that runs periodically in fact is batched very well. In fact
that's the main reason it exists rather than having the cleanup handled in the
critical path in the transaction itself. 

I'm not aware of all the details but my understanding is that it reads every
block in the table sequentially, keeping note of all the records that are no
longer visible to any transaction. When it's finished reading it writes out a
"free space map" that subsequent transactions read and use to find available
space in the table.

The vacuum job is makes very efficient use of disk i/o. In fact too efficient.
Frequently people have their disks running at 50-90% capacity simply handling
the random seeks to read data. Those seeks are already batched to the OS's
best ability. 

But then vacuum comes along and tries to read the entire table sequentially.
In the best case the sequential read will take up a lot of the available disk
bandwidth and delay transactions. In the worst case the OS will actually
prefer the sequential read because the elevator algorithm always sees that it
can get more bandwidth by handling it ahead of the random access.

In reality there is no time pressure on the vacuum at all. As long as it
completes faster than dead records can pile up it's fast enough. The
transactions on the other hand must complete as fast as possible.

Certainly batching is useful and in many cases is more important than
prioritizing, but in this case it's not the whole answer.

I'll mention this thread on the postgresql-hackers list, perhaps some of the
more knowledgeable programmers there will have thought about these issues and
will be able to post their wishlist ideas for kernel APIs.

I can see why back in the day Oracle preferred to simply tell all the OS
vendors, "just give us direct control over disk accesses, we'll figure it out"
rather than have to really hash out all the details of their low level needs
with every OS vendor. But between being able to prioritize I/O resources and
cache resources, and being able to sync IDE disks properly and cleanly (that
other thread) Linux may be able drastically improve the kernel interface for
databases.

-- 
greg


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-12 22:09                 ` Greg Stark
@ 2003-10-13  8:45                   ` Helge Hafting
  2003-10-15 13:25                     ` Ingo Oeser
  0 siblings, 1 reply; 64+ messages in thread
From: Helge Hafting @ 2003-10-13  8:45 UTC (permalink / raw)
  To: Greg Stark
  Cc: Joel Becker, Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel

Greg Stark wrote:
[...]
> 
> But then vacuum comes along and tries to read the entire table sequentially.
> In the best case the sequential read will take up a lot of the available disk
> bandwidth and delay transactions. In the worst case the OS will actually
> prefer the sequential read because the elevator algorithm always sees that it
> can get more bandwidth by handling it ahead of the random access.
> 
> In reality there is no time pressure on the vacuum at all. As long as it
> completes faster than dead records can pile up it's fast enough. The
> transactions on the other hand must complete as fast as possible.

This seems almost trivial.  If the vacuum job runs too much,
overusing disk bandwith - throttle it!
This is easier than trying to tell the kernel that the job is
less important, that goes wrong wether the job runs too much
or too little.  Let that job  sleep a little when its services
aren't needed, or when you need the disk bandwith elsewhere.


Helge Hafting




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-13  8:45                   ` Helge Hafting
@ 2003-10-15 13:25                     ` Ingo Oeser
  2003-10-15 15:03                       ` Greg Stark
  0 siblings, 1 reply; 64+ messages in thread
From: Ingo Oeser @ 2003-10-15 13:25 UTC (permalink / raw)
  To: Helge Hafting, Greg Stark
  Cc: Joel Becker, Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel

On Monday 13 October 2003 10:45, Helge Hafting wrote:
> Greg Stark wrote:
> [...]
> > In reality there is no time pressure on the vacuum at all. As long as it
> > completes faster than dead records can pile up it's fast enough. The
> > transactions on the other hand must complete as fast as possible.
>
> This seems almost trivial.  If the vacuum job runs too much,
> overusing disk bandwith - throttle it!

If you are using regular read/write syscalls and not too big chunks --> trivial.
If you mmap you database --> harder.

If you would like to tell the kernel, that this should not be treated
like a sequential read --> fadvise/madvise.

> This is easier than trying to tell the kernel that the job is
> less important, that goes wrong wether the job runs too much
> or too little.  Let that job  sleep a little when its services
> aren't needed, or when you need the disk bandwith elsewhere.


Here I agree as this seems like a solution. 

The problem is, that you sometimes need low latency for your
transactions and then you cannot start throttling a heavy IO process,
whose IO is already issued and who is basically just waiting for disk
eating its bandwidth.


The questions are: How IO-intensive vacuum? How fast can a throttling
free disk bandwidth (and memory)?


Regards

Ingo Oeser




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-15 13:25                     ` Ingo Oeser
@ 2003-10-15 15:03                       ` Greg Stark
  2003-10-15 18:37                         ` Helge Hafting
  2003-10-16 10:29                         ` Ingo Oeser
  0 siblings, 2 replies; 64+ messages in thread
From: Greg Stark @ 2003-10-15 15:03 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Helge Hafting, Greg Stark, Joel Becker, Jamie Lokier,
	Trond Myklebust, Ulrich Drepper, Linux Kernel

Ingo Oeser <ioe-lkml@rameria.de> writes:

> On Monday 13 October 2003 10:45, Helge Hafting wrote:
> 
> > This is easier than trying to tell the kernel that the job is
> > less important, that goes wrong wether the job runs too much
> > or too little.  Let that job  sleep a little when its services
> > aren't needed, or when you need the disk bandwith elsewhere.

Actually I think that's exactly backwards. The problem is that if the
user-space tries to throttle the process it doesn't know how much or when.
The kernel knows exactly when there are other higher priority writes, it can
schedule just enough writes from vacuum to not interfere.

So if vacuum slept a bit, say every 64k of data vacuumed. It could end up
sleeping when the disks are actually idle. Or it could be not sleeping enough
and still be interfering with transactions.

Though actually this avenue has some promise. It would not be nearly as ideal
as a kernel based solution that could take advantage of the idle times between
transactions, but it would still work somewhat as a work-around.

> The questions are: How IO-intensive vacuum? How fast can a throttling
> free disk bandwidth (and memory)?

It's purely i/o bound on large sequential reads. Ideally it should still have
large enough sequential reads to not lose the streaming advantage, but not so
large that it preempts the more random-access transactions.

-- 
greg


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-15 15:03                       ` Greg Stark
@ 2003-10-15 18:37                         ` Helge Hafting
  2003-10-16 10:29                         ` Ingo Oeser
  1 sibling, 0 replies; 64+ messages in thread
From: Helge Hafting @ 2003-10-15 18:37 UTC (permalink / raw)
  To: Greg Stark; +Cc: Ingo Oeser, Joel Becker, Linux Kernel

On Wed, Oct 15, 2003 at 11:03:23AM -0400, Greg Stark wrote:
> Ingo Oeser <ioe-lkml@rameria.de> writes:
> 
> > On Monday 13 October 2003 10:45, Helge Hafting wrote:
> > 
> > > This is easier than trying to tell the kernel that the job is
> > > less important, that goes wrong wether the job runs too much
> > > or too little.  Let that job  sleep a little when its services
> > > aren't needed, or when you need the disk bandwith elsewhere.
> 
> Actually I think that's exactly backwards. The problem is that if the
> user-space tries to throttle the process it doesn't know how much or when.
> The kernel knows exactly when there are other higher priority writes, it can
> schedule just enough writes from vacuum to not interfere.
> 
Isn't those higher-priority writes issued from userspace?
I am of course assuming that source for _everything_ is available.
So the process with the high-priority write can tell vacuum to
take a nap until its transaction completes.

> So if vacuum slept a bit, say every 64k of data vacuumed. It could end up
> sleeping when the disks are actually idle. Or it could be not sleeping enough
> and still be interfering with transactions.
> 
It can run at full speed normally, take voluntary pauses if it ever
detects a "nothing to do now" condition. And it can be paused
(forcibly or through cooperation) when there are important transactions
to sync. 

> Though actually this avenue has some promise. It would not be nearly as ideal
> as a kernel based solution that could take advantage of the idle times between
> transactions, but it would still work somewhat as a work-around.
> 
Don't that other process know when it is about to submit important transactions?

> > The questions are: How IO-intensive vacuum? How fast can a throttling
> > free disk bandwidth (and memory)?
> 

Helge Hafting

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-15 15:03                       ` Greg Stark
  2003-10-15 18:37                         ` Helge Hafting
@ 2003-10-16 10:29                         ` Ingo Oeser
  2003-10-16 14:02                           ` Greg Stark
  1 sibling, 1 reply; 64+ messages in thread
From: Ingo Oeser @ 2003-10-16 10:29 UTC (permalink / raw)
  To: Greg Stark
  Cc: Helge Hafting, Joel Becker, Jamie Lokier, Trond Myklebust,
	Ulrich Drepper, Linux Kernel

Hi there,

first: I think the problem is solvable with mixing blocking and
non-blocking IO or simply AIO, which will be supported nicely by 2.6.0,
is a POSIX standard and is meant for doing your own IO scheduling.

On Wednesday 15 October 2003 17:03, Greg Stark wrote:
> Ingo Oeser <ioe-lkml@rameria.de> writes:
> > On Monday 13 October 2003 10:45, Helge Hafting wrote:
> > > This is easier than trying to tell the kernel that the job is
> > > less important, that goes wrong wether the job runs too much
> > > or too little.  Let that job  sleep a little when its services
> > > aren't needed, or when you need the disk bandwith elsewhere.
>
> Actually I think that's exactly backwards. The problem is that if the
> user-space tries to throttle the process it doesn't know how much or when.
> The kernel knows exactly when there are other higher priority writes, it
> can schedule just enough writes from vacuum to not interfere.

On dedicated servers this might be true. But on these you could also
solve it in user space by measuring disk bandwidth and issueing just
enough IO to keep up roughly with it.

> So if vacuum slept a bit, say every 64k of data vacuumed. It could end up
> sleeping when the disks are actually idle. Or it could be not sleeping
> enough and still be interfering with transactions.

The vacuum io is submitted (via AIO or simulation of it) normally in a
unit U and waiting ALWAYS for U to complete, before submitting a new one.
Between submitting units, the vacuums checks for outstanding transactions 
and stops, when we have one.

Now a transaction is submitted and the submitting from vacuum is stopped
by it existing. The transaction waits for completion (e.g.  aio_suspend()) 
and signals vacuum to continue.

So the disk(s) should be always in good use.

I don't know much of the design internals of your database, but this
sounds promising and is portable.

> > The questions are: How IO-intensive vacuum? How fast can a throttling
> > free disk bandwidth (and memory)?
>
> It's purely i/o bound on large sequential reads. Ideally it should still
> have large enough sequential reads to not lose the streaming advantage, but
> not so large that it preempts the more random-access transactions.

Ok, so we can ignore the processing time and the above should just work.


Regards

Ingo Oeser



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-16 10:29                         ` Ingo Oeser
@ 2003-10-16 14:02                           ` Greg Stark
  2003-10-21 11:47                             ` Ingo Oeser
  0 siblings, 1 reply; 64+ messages in thread
From: Greg Stark @ 2003-10-16 14:02 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Greg Stark, Helge Hafting, Joel Becker, Jamie Lokier,
	Trond Myklebust, Ulrich Drepper, Linux Kernel


Ingo Oeser <ioe-lkml@rameria.de> writes:

> Hi there,
> 
> first: I think the problem is solvable with mixing blocking and
> non-blocking IO or simply AIO, which will be supported nicely by 2.6.0,
> is a POSIX standard and is meant for doing your own IO scheduling.

I think aio could be very useful for databases, but not in this area. I think
it's useful as a more fine-grained tool than sync/fsync. Currently the
database has to fsync a file to commit a transaction, which means flushing
_all_writes to the file even ones from other transactions. If aio inserted
write barriers to the disk controller then it would provide a way to ensure
the current transaction is synced without having to flush all other
transactions writes at the same time.

But I don't see how it's useful for the problem I'm describing.

> On Wednesday 15 October 2003 17:03, Greg Stark wrote:
> > Ingo Oeser <ioe-lkml@rameria.de> writes:
> > > On Monday 13 October 2003 10:45, Helge Hafting wrote:
> > > > This is easier than trying to tell the kernel that the job is
> > > > less important, that goes wrong wether the job runs too much
> > > > or too little.  Let that job  sleep a little when its services
> > > > aren't needed, or when you need the disk bandwith elsewhere.
> >
> > Actually I think that's exactly backwards. The problem is that if the
> > user-space tries to throttle the process it doesn't know how much or when.
> > The kernel knows exactly when there are other higher priority writes, it
> > can schedule just enough writes from vacuum to not interfere.
> 
> On dedicated servers this might be true. But on these you could also
> solve it in user space by measuring disk bandwidth and issueing just
> enough IO to keep up roughly with it.

Indeed we're discussing methods for doing that now. But this seems like a
awkward way to accomplish what the kernel could do very precisely. I don't see
why non-dedicated servers would be make priorities any less useful, in fact I
think that's exactly where they would shine.

> > So if vacuum slept a bit, say every 64k of data vacuumed. It could end up
> > sleeping when the disks are actually idle. Or it could be not sleeping
> > enough and still be interfering with transactions.
> 
> The vacuum io is submitted (via AIO or simulation of it) normally in a
> unit U and waiting ALWAYS for U to complete, before submitting a new one.
> Between submitting units, the vacuums checks for outstanding transactions 
> and stops, when we have one.
> 
> Now a transaction is submitted and the submitting from vacuum is stopped
> by it existing. The transaction waits for completion (e.g.  aio_suspend()) 
> and signals vacuum to continue.

User-space has no idea if disk i/o is occurring. The data the transaction
needs could be cached, or it could be on a different disk.

Besides, I think this is far too coarse-grained than what's needed.
Transactions sometimes run for seconds, minutes, or hours,, some of that time
is spent doing disk i/o and some of it doing cpu calculations. It can't stop
and signal another process every time it finishes reading a block and needs to
do a bit of calculation. Then context switch again a millisecond later so it
can read the next block...

And besides, this is would only useful on dedicated servers.

-- 
greg


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: statfs() / statvfs() syscall ballsup...
  2003-10-16 14:02                           ` Greg Stark
@ 2003-10-21 11:47                             ` Ingo Oeser
  0 siblings, 0 replies; 64+ messages in thread
From: Ingo Oeser @ 2003-10-21 11:47 UTC (permalink / raw)
  To: Greg Stark
  Cc: Greg Stark, Helge Hafting, Joel Becker, Jamie Lokier,
	Trond Myklebust, Ulrich Drepper, Linux Kernel

Hi Greg,

On Thursday 16 October 2003 16:02, Greg Stark wrote:
> Ingo Oeser <ioe-lkml@rameria.de> writes:
> > Hi there,
> >
> > first: I think the problem is solvable with mixing blocking and
> > non-blocking IO or simply AIO, which will be supported nicely by 2.6.0,
> > is a POSIX standard and is meant for doing your own IO scheduling.
>
> I think aio could be very useful for databases, but not in this area. 
[AIO for write barriers]
> But I don't see how it's useful for the problem I'm describing.

It can, because this way, you generate sth. like a "user space request
queue" and can control it's activity and saturation as fine grained as
the syncing. You simply notice, if an event is in flight or not and can
estimate current bandwidth that way.

> Indeed we're discussing methods for doing that now. But this seems like a
> awkward way to accomplish what the kernel could do very precisely. I don't
> see why non-dedicated servers would be make priorities any less useful, in
> fact I think that's exactly where they would shine.

The kernel problem is, that an IO operation is not associated with any
process, just with a physical page and a backing store. This is esp.
true for reads. So userspace doesn't know in many cases, whether the
kernel needs to do an IO at all to satisfy this request. Direct-IO helps
this by having you to do the IO ALWAYS, but isn't that nice for the
kernel.

So if you say "This fd has an IO priority of 1 and that fd has one of 2"
for the same file, then what should the kernel do?

Or another secenario: You have chunk A and chunk B both of 128k. Now
vacuum wants to read chunk B as low priority and transaction wants
to read second page from chunk A and chunk B high priority (readv()).

Readahead of second page from chunk A brings in first page of chunk B
which vacuum has been waiting for and is woken and vacuums until chunk C
is needed, which causes IO again.

Now the transaction continues and can read immediately from page cache
the page vacuum left.

This will be even more fun, if vacuum is working so fast per timeslice,
that it will push the cached pages out of memory ;-)

See how controlling submission from vacuum might be better, then actions
done by the kernel?

If you just prioritize work, then the low priority work accumulates and
takes up kernel memory. So better stop submission.

> > > So if vacuum slept a bit, say every 64k of data vacuumed. It could end
> > > up sleeping when the disks are actually idle. Or it could be not
> > > sleeping enough and still be interfering with transactions.
> >
> > The vacuum io is submitted (via AIO or simulation of it) normally in a
> > unit U and waiting ALWAYS for U to complete, before submitting a new one.
> > Between submitting units, the vacuums checks for outstanding transactions
> > and stops, when we have one.
> >
> > Now a transaction is submitted and the submitting from vacuum is stopped
> > by it existing. The transaction waits for completion (e.g. 
> > aio_suspend()) and signals vacuum to continue.
>
> User-space has no idea if disk i/o is occurring. The data the transaction
> needs could be cached, or it could be on a different disk.

So how should it prioritze then, if it doesn't know which will preempt
which?

> Besides, I think this is far too coarse-grained than what's needed.
> Transactions sometimes run for seconds, minutes, or hours,, some of that
> time is spent doing disk i/o and some of it doing cpu calculations. It
> can't stop and signal another process every time it finishes reading a
> block and needs to do a bit of calculation. Then context switch again a
> millisecond later so it can read the next block...

I don't want it to signal vacuum, I just want vacuum to check for
existance of more important things to do. Like a "disk idle process".
This can be as simple as having vacuum at extremly low process priority
and reading some atomically set variable, whether it can submit more now
or not.

I think you need to do sth. like the kernel does for page writing
for your user space task. (stepping by watermarks from none, async to sync)

PS: Sorry for the late answer, but needed to rethink a bit more.

If you could point me to the source files actually triggering and doing
vacuum, I might get more enlightment ;-)

Regards

Ingo Oeser



^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2003-10-21 11:50 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-10-09 22:16 statfs() / statvfs() syscall ballsup Trond Myklebust
2003-10-09 22:26 ` Linus Torvalds
2003-10-09 23:19   ` Ulrich Drepper
2003-10-10  0:22     ` viro
2003-10-10  4:49       ` Jamie Lokier
2003-10-10  5:26         ` Trond Myklebust
2003-10-10 12:37           ` Jamie Lokier
2003-10-10 13:46             ` Trond Myklebust
2003-10-10 14:35               ` Jamie Lokier
2003-10-10 15:32                 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust
2003-10-10 15:53                   ` Jamie Lokier
2003-10-10 16:07                     ` Trond Myklebust
2003-10-10 15:55                   ` Michael Shuey
2003-10-10 16:20                     ` Trond Myklebust
2003-10-10 16:45                     ` J. Bruce Fields
2003-10-10 14:39               ` statfs() / statvfs() syscall ballsup Jamie Lokier
2003-10-09 23:31   ` Trond Myklebust
2003-10-10 12:27   ` Joel Becker
2003-10-10 14:59     ` Linus Torvalds
2003-10-10 15:27       ` Joel Becker
2003-10-10 16:00         ` Linus Torvalds
2003-10-10 16:26           ` Joel Becker
2003-10-10 16:50             ` Linus Torvalds
2003-10-10 17:33               ` Joel Becker
2003-10-10 17:51                 ` Linus Torvalds
2003-10-10 18:13                   ` Joel Becker
2003-10-10 16:27           ` Valdis.Kletnieks
2003-10-10 16:33           ` Chris Friesen
2003-10-10 17:04             ` Linus Torvalds
2003-10-10 17:07               ` Linus Torvalds
2003-10-10 17:21                 ` Joel Becker
2003-10-10 16:01         ` Jamie Lokier
2003-10-10 16:33           ` Joel Becker
2003-10-10 16:58             ` Chris Friesen
2003-10-10 17:05               ` Trond Myklebust
2003-10-10 17:20               ` Joel Becker
2003-10-10 17:33                 ` Chris Friesen
2003-10-10 17:40                 ` Linus Torvalds
2003-10-10 17:54                   ` Trond Myklebust
2003-10-10 18:05                     ` Linus Torvalds
2003-10-10 20:40                       ` Trond Myklebust
2003-10-10 21:09                         ` Linus Torvalds
2003-10-10 22:17                           ` Trond Myklebust
2003-10-11  2:53                     ` Andrew Morton
2003-10-11  3:47                       ` Trond Myklebust
2003-10-10 18:05                   ` Joel Becker
2003-10-10 18:31                     ` Andrea Arcangeli
2003-10-10 20:33                     ` Helge Hafting
2003-10-10 20:07             ` Jamie Lokier
2003-10-12 15:31             ` Greg Stark
2003-10-12 16:13               ` Linus Torvalds
2003-10-12 22:09                 ` Greg Stark
2003-10-13  8:45                   ` Helge Hafting
2003-10-15 13:25                     ` Ingo Oeser
2003-10-15 15:03                       ` Greg Stark
2003-10-15 18:37                         ` Helge Hafting
2003-10-16 10:29                         ` Ingo Oeser
2003-10-16 14:02                           ` Greg Stark
2003-10-21 11:47                             ` Ingo Oeser
2003-10-10 18:20           ` Andrea Arcangeli
2003-10-10 18:36             ` Linus Torvalds
2003-10-10 19:03               ` Andrea Arcangeli
2003-10-09 23:16 ` Andreas Dilger
2003-10-09 23:24   ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.