* statfs() / statvfs() syscall ballsup... @ 2003-10-09 22:16 Trond Myklebust 2003-10-09 22:26 ` Linus Torvalds 2003-10-09 23:16 ` Andreas Dilger 0 siblings, 2 replies; 64+ messages in thread From: Trond Myklebust @ 2003-10-09 22:16 UTC (permalink / raw) To: Ulrich Drepper, Linus Torvalds; +Cc: Linux Kernel Hi, We appear to have a problem with the new statfs interface in 2.6.0... The problem is that as far as userland is concerned, 'struct statfs' reports f_blocks, f_bfree,... in units of the "optimal transfer size": f_bsize (backwards compatibility). OTOH 'struct statvfs' reports the same values in units of the fragment size (the blocksize of the underlying filesyste): f_frsize. (says Single User Spec v2) Both are apparently supposed to syscall down via sys_statfs()... Question: how we're supposed to reconcile the two cases for something like NFS, where these 2 values are supposed to differ? Note that f_bsize is usually larger than f_frsize, hence conversions from the former to the latter are subject to rounding errors... Cheers, Trond ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-09 22:16 statfs() / statvfs() syscall ballsup Trond Myklebust @ 2003-10-09 22:26 ` Linus Torvalds 2003-10-09 23:19 ` Ulrich Drepper ` (2 more replies) 2003-10-09 23:16 ` Andreas Dilger 1 sibling, 3 replies; 64+ messages in thread From: Linus Torvalds @ 2003-10-09 22:26 UTC (permalink / raw) To: Trond Myklebust; +Cc: Ulrich Drepper, Linux Kernel On Thu, 9 Oct 2003, Trond Myklebust wrote: > > Question: how we're supposed to reconcile the two cases for something > like NFS, where these 2 values are supposed to differ? I'd suggest going for "optimal block size everywhere". > Note that f_bsize is usually larger than f_frsize, hence conversions > from the former to the latter are subject to rounding errors... User space shouldn't know or care about frsize, and it doesn't even necessarily make any sense on a lot of filesystems, so make it easy for the user. It's not as if the rounding errors really matter. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-09 22:26 ` Linus Torvalds @ 2003-10-09 23:19 ` Ulrich Drepper 2003-10-10 0:22 ` viro 2003-10-09 23:31 ` Trond Myklebust 2003-10-10 12:27 ` Joel Becker 2 siblings, 1 reply; 64+ messages in thread From: Ulrich Drepper @ 2003-10-09 23:19 UTC (permalink / raw) To: Linus Torvalds; +Cc: Trond Myklebust, Linux Kernel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Linus Torvalds wrote: > User space shouldn't know or care about frsize, and it doesn't even > necessarily make any sense on a lot of filesystems, so make it easy for > the user. It's not as if the rounding errors really matter. There have been numerous requests to add a statvfs syscall, at least made to me. The problem is that the emulation through statfs cannot be optimal. The emulation has to get all kinds of additional information (like mount flags) which in some cases lead to hangs or delays. - From what I see statvfs is much more frequently used than statfs so such an extension would be justified. And then the kernel would be able to determine all the right values and guide the user with them as it pleases. - -- - --------------. ,-. 444 Castro Street Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA Red Hat `--' drepper at redhat.com `--------------------------- -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQE/he0B2ijCOnn/RHQRAg+GAKC48tj7myC+lITvghxPK/ZEWcLTnQCgpUh4 5whszj+14fucakVcsZ4sOIQ= =EVjn -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-09 23:19 ` Ulrich Drepper @ 2003-10-10 0:22 ` viro 2003-10-10 4:49 ` Jamie Lokier 0 siblings, 1 reply; 64+ messages in thread From: viro @ 2003-10-10 0:22 UTC (permalink / raw) To: Ulrich Drepper; +Cc: Linus Torvalds, Trond Myklebust, Linux Kernel On Thu, Oct 09, 2003 at 04:19:29PM -0700, Ulrich Drepper wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Linus Torvalds wrote: > > > User space shouldn't know or care about frsize, and it doesn't even > > necessarily make any sense on a lot of filesystems, so make it easy for > > the user. It's not as if the rounding errors really matter. > > There have been numerous requests to add a statvfs syscall, at least > made to me. The problem is that the emulation through statfs cannot be > optimal. The emulation has to get all kinds of additional information > (like mount flags) which in some cases lead to hangs or delays. Umm... I don't see anything equivalent to statfs(2) ->f_type in statvfs(2). ->f_frsize makes no sense for practically all filesystems we support. ->f_namemax is not well-defined ("maximum filename length" as in "you won't see filenames longer than..." or "attempt to create a file with name longer than... will fail" or "longer than that and I'm truncating"; and that is aside of lovely questions about the meaning of "length" - strlen()? number of multibyte characters accepted by that fs? something else?) ->f_fsid is also practically undefined (and left 0 by practically every fs, so no userland code can do anything useful with it). ->f_flag might be useful, all right. However, I'd like to see real-world examples of code (Solaris, whatever) that would use it in any meaningful way... Conclusion: if we care about something like statvfs(), it should *not* have the statvfs() interface. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 0:22 ` viro @ 2003-10-10 4:49 ` Jamie Lokier 2003-10-10 5:26 ` Trond Myklebust 0 siblings, 1 reply; 64+ messages in thread From: Jamie Lokier @ 2003-10-10 4:49 UTC (permalink / raw) To: viro; +Cc: Ulrich Drepper, Linus Torvalds, Trond Myklebust, Linux Kernel viro@parcelfarce.linux.theplanet.co.uk wrote: > Umm... I don't see anything equivalent to statfs(2) ->f_type in statvfs(2). > ->f_frsize makes no sense for practically all filesystems we support. > ->f_namemax is not well-defined ("maximum filename length" as in "you won't > see filenames longer than..." or "attempt to create a file with name longer > than... will fail" or "longer than that and I'm truncating"; and that is > aside of lovely questions about the meaning of "length" - strlen()? number > of multibyte characters accepted by that fs? something else?) > ->f_fsid is also practically undefined (and left 0 by practically every fs, > so no userland code can do anything useful with it). > ->f_flag might be useful, all right. However, I'd like to see real-world > examples of code (Solaris, whatever) that would use it in any meaningful > way... On this theme, I'd like to know: - are dnotify / lease / lock reliable indicators on this filesystem? (i.e. dnotify is reliable on all local filesystems, but not over any of the remote ones AFAIK). - is stat() reliable (local filesystems and many remote) or potentially out of date without open/close (NFS due to attribute cacheing) -- Jamie ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 4:49 ` Jamie Lokier @ 2003-10-10 5:26 ` Trond Myklebust 2003-10-10 12:37 ` Jamie Lokier 0 siblings, 1 reply; 64+ messages in thread From: Trond Myklebust @ 2003-10-10 5:26 UTC (permalink / raw) To: Jamie Lokier; +Cc: Linux Kernel >>>>> " " == Jamie Lokier <jamie@shareable.org> writes: > - are dnotify / lease / lock reliable indicators on this filesystem? > (i.e. dnotify is reliable on all local filesystems, but > not over any of the remote ones AFAIK). Belongs in fcntl()... Just return ENOLCK if someone tries to set a lease or a directory notification on an NFS file... > - is stat() reliable (local filesystems and many remote) or > potentially out of date without open/close (NFS due to > attribute cacheing) There are many possible cache consistency models out there. Consider for instance AFS connected/disconnected modes, NFSv4 delegations or CIFS shares. How are you going to distinguish between them all and how do you propose that applications make use of this information? Cheers, Trond ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 5:26 ` Trond Myklebust @ 2003-10-10 12:37 ` Jamie Lokier 2003-10-10 13:46 ` Trond Myklebust 0 siblings, 1 reply; 64+ messages in thread From: Jamie Lokier @ 2003-10-10 12:37 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux Kernel Trond Myklebust wrote: > > - are dnotify / lease / lock reliable indicators on this filesystem? > > (i.e. dnotify is reliable on all local filesystems, but > > not over any of the remote ones AFAIK). > > Belongs in fcntl()... Just return ENOLCK if someone tries to set a > lease or a directory notification on an NFS file... Yes, that would make sense. It should be a filesystem hook, so that even remote filesystems like SMB can implement it, although it must be understood that remote notification has different ordering properties than local. > > - is stat() reliable (local filesystems and many remote) or > > potentially out of date without open/close (NFS due to > > attribute cacheing) > > There are many possible cache consistency models out there. Consider > for instance AFS connected/disconnected modes, NFSv4 delegations or > CIFS shares. How are you going to distinguish between them all and > how do you propose that applications make use of this information? The difference is that NFSv3 can return _stale_ data, while local _cannot_. I call stat(), and the information is up to date. I don't care about the cache semantics at all; what I care about is whether a returned stat() result may be stale. Why? This is the difference between "make" generating correct data, and "make" generating incorrect data.[1] The caching model isn't the issue. That's the filesystem's problem. I just want a way to get up to date data in my application. My motivation isn't actually "make" although that's important; generally, I need to know how to verify my in-application cache of a file. (Think fontconfig, ccache etc). I use dnotify for similar purposes, when it's local. (dnotify is much faster than many stats for a complex cache dependency). Currently, I use statfs() and read /proc/mounts to determine whether the filesystem is a known type or mounted on a block device, to decide whether stat() and/or dnotify are reliable. This is not ideal. In particular, I don't know of any way to _guarantee_ that I have the latest file contents from remote filesystems short of F_SETLK, which way too heavy.[2] -- Jamie [1] I have built programs, including kernels, which crashed due to timestamps not appearing on a different computer after changing code so make didn't compile everything. [2] I have lost code I was editing due to saving it and then a different computer updating the file by reading a stale version, modifying it and writing it. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 12:37 ` Jamie Lokier @ 2003-10-10 13:46 ` Trond Myklebust 2003-10-10 14:35 ` Jamie Lokier 2003-10-10 14:39 ` statfs() / statvfs() syscall ballsup Jamie Lokier 0 siblings, 2 replies; 64+ messages in thread From: Trond Myklebust @ 2003-10-10 13:46 UTC (permalink / raw) To: Jamie Lokier; +Cc: Linux Kernel >>>>> " " == Jamie Lokier <jamie@shareable.org> writes: > Trond Myklebust wrote: >> Belongs in fcntl()... Just return ENOLCK if someone tries to >> set a lease or a directory notification on an NFS file... > It should be a filesystem hook, so that even remote filesystems > like SMB can implement it, although it must be understood that > remote notification has different ordering properties than > local. Sure. We might even try actually implementing leases on NFSv4 for delegated files. > I don't care about the cache semantics at all; what I care > about is whether a returned stat() result may be stale. Note that this too may be a per-file property. Under NFSv4 I can guarantee you that stat() results are correct in the case where I have a delegation. Otherwise, you are indeed subject to inherent races. "noac" cannot entirely resolve such races, but it sounds as if it could in the particular cases you describe. > This is not ideal. In particular, I don't know of any way to > _guarantee_ that I have the latest file contents from remote > filesystems short of F_SETLK, which way too heavy.[2] Err... open() should normally suffice to do that... Unless you are simultaneously writing to the file on a remote system, in which case you really need mandatory locking rather than NFSv2/v3's weaker advisory model. Or possibly something like CIFS/SMB's open "share" model (which can also be implemented in NFSv4). ...so I would argue that the caching models both can and do make a difference to your example cases (contrary to what you assert). Cheers, Trond ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 13:46 ` Trond Myklebust @ 2003-10-10 14:35 ` Jamie Lokier 2003-10-10 15:32 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust 2003-10-10 14:39 ` statfs() / statvfs() syscall ballsup Jamie Lokier 1 sibling, 1 reply; 64+ messages in thread From: Jamie Lokier @ 2003-10-10 14:35 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux Kernel Trond Myklebust wrote: > Sure. We might even try actually implementing leases on NFSv4 for > delegated files. That would be nice. (Aside: Can NFSv4 do anything like dnotify, or am I restricted to, in effect, keeping many files open to detect changes in any of them?) Generally NFSv4 sounds like the way to go. Should I be recommending it to all my friends yet, is the implementation ready for that? > > I don't care about the cache semantics at all; what I care > > about is whether a returned stat() result may be stale. > > Note that this too may be a per-file property. Under NFSv4 I can > guarantee you that stat() results are correct in the case where I have > a delegation. Otherwise, you are indeed subject to inherent races. > "noac" cannot entirely resolve such races, but it sounds as if it > could in the particular cases you describe. You're right, in the cases I describe "noac" is fine. I don't like having to ship an FAQ with a program which explains that the program is theoretically fine, users should simply mount their home directory with "noac", and tough if that's not within their administrative power. I'd rather make the program work correctly with the default mount options, and maybe have an entry in the FAQ saying that "noac" may improve performance but is not required for correct behaviour. Unfortunately that means ugly knowledge of filesystem specifics and /proc/mount parsing - or significantly lower performance on local filesystems, which largely negates the purpose of the program. (It is very much about caching things derived from file contents). > > This is not ideal. In particular, I don't know of any way to > > _guarantee_ that I have the latest file contents from remote > > filesystems short of F_SETLK, which way too heavy.[2] > > Err... open() should normally suffice to do that... Server = RH linux-2.4.20-18.9. Client = 2.6.0-test6. I have done this in the last few days: [on client] editing file in emacs, save-buffer [on server] diff -ur mumble commands >> file (and wait until command prompt returns) [on client] in emacs, find-alternate-file which discards the current buffer and opens & reads the file from fs. [on client] edit some more, save file, post to l-k etc. [meta] notice that the diff wasn't appended to the file Emacs didn't see the appended data. (The reason I did the diff command on the server is that it's a lot faster - a tree's worth of stat calls is slow over PCMCIA ethernet). > ...so I would argue that the caching models both can and do make a > difference to your example cases (contrary to what you assert). Of course they make a difference when there is no call to say "just do X and hide the implementation details from me". What I'd like is an abstraction so I don't observe a difference, or at least a systematic way of working around them at application level. In the same way I expect CPUs to abstract away the (sometimes very) complex memory caching models, and present something simple to the program code. -- Jamie ^ permalink raw reply [flat|nested] 64+ messages in thread
* Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) 2003-10-10 14:35 ` Jamie Lokier @ 2003-10-10 15:32 ` Trond Myklebust 2003-10-10 15:53 ` Jamie Lokier 2003-10-10 15:55 ` Michael Shuey 0 siblings, 2 replies; 64+ messages in thread From: Trond Myklebust @ 2003-10-10 15:32 UTC (permalink / raw) To: Jamie Lokier; +Cc: Linux Kernel >>>>> " " == Jamie Lokier <jamie@shareable.org> writes: > Trond Myklebust wrote: >> Sure. We might even try actually implementing leases on NFSv4 >> for delegated files. > That would be nice. (Aside: Can NFSv4 do anything like > dnotify, or am I restricted to, in effect, keeping many files > open to detect changes in any of them?) Delegations for directories are in the pipeline for the next minor revision of the protocol (NFSv4.1). Delegations are such a new feature to NFS that it was decided to restrict them to files only to give us time to learn how best to use them. I can't tell as of yet whether or not the model chosen will include all the features of dnotify (for instance recall in case the attributes change on a subfile is a subject of hot debate), but certainly some of us are pushing for something like this. > Generally NFSv4 sounds like the way to go. Should I be > recommending it to all my friends yet, is the implementation > ready for that? The client implementation in 2.6.0 is still lacking several important features, including locking, ACLs, delegation support and recovery of state (in case of server reboot or network partitions). I'm hoping Andrew/Linus will allow me to send updates once the early 2.6.x codefreeze period is over. That said, I definitely encourage people to test out the existing code for stability, and I will be offering an 'NFS_ALL' series with those features that are missing from the main tree as and when I judge they are approaching release quality. Cheers, Trond ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) 2003-10-10 15:32 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust @ 2003-10-10 15:53 ` Jamie Lokier 2003-10-10 16:07 ` Trond Myklebust 2003-10-10 15:55 ` Michael Shuey 1 sibling, 1 reply; 64+ messages in thread From: Jamie Lokier @ 2003-10-10 15:53 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux Kernel Trond Myklebust wrote: > I can't tell as of yet whether or not the model chosen will include > all the features of dnotify (for instance recall in case the > attributes change on a subfile is a subject of hot debate), but > certainly some of us are pushing for something like this. Different types of delegation, depending on what the client asked for, could be offered: Cacheing readdir() and stat() on the directory requires delegation without subfile recall; if there's a dnotify on the client, it requires delegation with recall. An uber-cool capability would be notification of sub-files to any depth. You can't imagine how tedious it has been watching a makefile take 5 minutes _just_ to run the "find" command on a source tree to find newer files than the last successful make. (It was a big tree). That was the optimised makefile. Without the "find" command, make's own dependency logic took 20 minutes to do the same thing. With any depth notifications, that would be eliminated to roughly zero time, and just running the few compile commands that are needed. -- Jamie ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) 2003-10-10 15:53 ` Jamie Lokier @ 2003-10-10 16:07 ` Trond Myklebust 0 siblings, 0 replies; 64+ messages in thread From: Trond Myklebust @ 2003-10-10 16:07 UTC (permalink / raw) To: Jamie Lokier; +Cc: Linux Kernel >>>>> " " == Jamie Lokier <jamie@shareable.org> writes: > An uber-cool capability would be notification of sub-files to > any depth. You can't imagine how tedious it has been watching > a makefile take 5 minutes _just_ to run the "find" command on a > source tree to find newer files than the last successful make. > (It was a big tree). That was the optimised makefile. Without > the "find" command, make's own dependency logic took 20 minutes > to do the same thing. > With any depth notifications, that would be eliminated to > roughly zero time, and just running the few compile commands > that are needed. In the very long term (post NFSv4.1), we're investigating something even more cool: 'WRITE' delegation of directories could allow you to work in a quasi-disconnected mode on all entries plus sub-entries (files, subdirs,....). You could do your compilation entirely locally (backed either by memory or cachefs) and then just flush the final results out to the server. AFS has, of course, had similar capabilities for some time, but I'm not sure if they have the delegation recall feature. IIRC, their disconnected operation overwrites whatever changes have been made on the server when your client reconnects. Cheers, Trond ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) 2003-10-10 15:32 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust 2003-10-10 15:53 ` Jamie Lokier @ 2003-10-10 15:55 ` Michael Shuey 2003-10-10 16:20 ` Trond Myklebust 2003-10-10 16:45 ` J. Bruce Fields 1 sibling, 2 replies; 64+ messages in thread From: Michael Shuey @ 2003-10-10 15:55 UTC (permalink / raw) To: trond.myklebust; +Cc: Linux Kernel On Friday 10 October 2003 10:32 am, Trond Myklebust wrote: > The client implementation in 2.6.0 is still lacking several important > features, including locking, ACLs, delegation support and recovery of > state (in case of server reboot or network partitions). I'm hoping > Andrew/Linus will allow me to send updates once the early 2.6.x > codefreeze period is over. How about other features? In particular, do the client/server do authentication (krb5? lipkey/spkm3?), integrity and privacy? Also, are any patches on Citi's site useful anymore? I see patches for 2.6.0-test1, but nothing more recent. Have they been folded into the main tree? > That said, I definitely encourage people to test out the existing code > for stability, and I will be offering an 'NFS_ALL' series with those > features that are missing from the main tree as and when I judge they > are approaching release quality. Neato! Those of us with hordes of machines using Linux's NFS appreciate the extra effort. -- Mike Shuey ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) 2003-10-10 15:55 ` Michael Shuey @ 2003-10-10 16:20 ` Trond Myklebust 2003-10-10 16:45 ` J. Bruce Fields 1 sibling, 0 replies; 64+ messages in thread From: Trond Myklebust @ 2003-10-10 16:20 UTC (permalink / raw) To: shuey; +Cc: Linux Kernel >>>>> " " == Michael Shuey <shuey@fmepnet.org> writes: > How about other features? In particular, do the client/server > do authentication (krb5? lipkey/spkm3?), integrity and privacy? Client side krb5 authentication was added in November last year. Privacy and integrity are queued but fell afoul of the code-freeze. I'll bun(d|g)le them into an NFS_ALL after we've tested them out in the v4 Bakeathon in Austin (so in about a fortnight). I believe the server support is ready too but hasn't yet been merged in due to bugs in the upcall mechanism. > Also, are any patches on Citi's site useful anymore? I see > patches for 2.6.0-test1, but nothing more recent. Have they > been folded into the main tree? I'm cherrypicking the relevant bugfixes from CITI and folding those into the tree. Much of the rest will be part of the forthcoming NFS_ALL. Cheers, Trond ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) 2003-10-10 15:55 ` Michael Shuey 2003-10-10 16:20 ` Trond Myklebust @ 2003-10-10 16:45 ` J. Bruce Fields 1 sibling, 0 replies; 64+ messages in thread From: J. Bruce Fields @ 2003-10-10 16:45 UTC (permalink / raw) To: Michael Shuey; +Cc: trond.myklebust, Linux Kernel On Fri, Oct 10, 2003 at 10:55:10AM -0500, Michael Shuey wrote: > On Friday 10 October 2003 10:32 am, Trond Myklebust wrote: > > The client implementation in 2.6.0 is still lacking several important > > features, including locking, ACLs, delegation support and recovery of > > state (in case of server reboot or network partitions). I'm hoping > > Andrew/Linus will allow me to send updates once the early 2.6.x > > codefreeze period is over. > > How about other features? In particular, do the client/server do > authentication (krb5? lipkey/spkm3?), integrity and privacy? The client has krb5 authentication support, the server doesn't. Patches are available from the citi web page for server-side authentication and client-side integrity. > Also, are any patches on Citi's site useful anymore? The test1 patches probably apply (possibly with some manual intervention) up to about test6. At least one of them (the first gss patch) is a fairly critical bugfix. I'm just updating to test7 myself right now; I'll try to post new patches soon, but in the worst case it might not be till after we get back from testing at Connectathon (in two weeks). --Bruce Fields ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 13:46 ` Trond Myklebust 2003-10-10 14:35 ` Jamie Lokier @ 2003-10-10 14:39 ` Jamie Lokier 1 sibling, 0 replies; 64+ messages in thread From: Jamie Lokier @ 2003-10-10 14:39 UTC (permalink / raw) To: Trond Myklebust; +Cc: Linux Kernel Trond Myklebust wrote: > > I don't care about the cache semantics at all; what I care > > about is whether a returned stat() result may be stale. > > Note that this too may be a per-file property. Yes. A flag from stat() or similar to say it's stale would make sense. Alternatively, a flag _into_ something like stat() to ask for an up to date value, if that is possible. I've often wondered if stat() couldn't be a bit more extensible with some flags or extended attributes. -- Jamie ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-09 22:26 ` Linus Torvalds 2003-10-09 23:19 ` Ulrich Drepper @ 2003-10-09 23:31 ` Trond Myklebust 2003-10-10 12:27 ` Joel Becker 2 siblings, 0 replies; 64+ messages in thread From: Trond Myklebust @ 2003-10-09 23:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Ulrich Drepper, Linux Kernel >>>>> " " == Linus Torvalds <torvalds@osdl.org> writes: >> Note that f_bsize is usually larger than f_frsize, hence >> conversions from the former to the latter are subject to >> rounding errors... > User space shouldn't know or care about frsize, and it doesn't > even necessarily make any sense on a lot of filesystems, so > make it easy for the user. It's not as if the rounding errors > really matter. It can lead to funny quirks when doing df: Used + Available != Total Granted the effects won't be enormous (typically you'll see between 1 and 63 blocks off in the case of NFS w/ 32kwsize and 512byte frsize) but people get upset about this. That was the reason for adding an f_frsize field in the first place... Note: one solution might be to swap the positions of f_frsize and f_bsize in the kernel struct that is passed up to userland. I.e. pass up struct statfs { __u32 f_type; - __u32 f_bsize; + __u32 f_frsize; __u32 f_blocks; __u32 f_bfree; __u32 f_bavail; __u32 f_files; __u32 f_ffree; __kernel_fsid_t f_fsid; __u32 f_namelen; - __u32 f_frsize; + __u32 f_bsize; __u32 f_spare[5]; }; That will give correct values for the f_bfree, f_bavail,... in the legacy statfs() case for all existing filesystems. glibc's statvfs() can then do the correct thing if it detects a >=2.6.0 kernel. It needs to do a copy to its private statvfs struct anyway. Cheers, Trond ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-09 22:26 ` Linus Torvalds 2003-10-09 23:19 ` Ulrich Drepper 2003-10-09 23:31 ` Trond Myklebust @ 2003-10-10 12:27 ` Joel Becker 2003-10-10 14:59 ` Linus Torvalds 2 siblings, 1 reply; 64+ messages in thread From: Joel Becker @ 2003-10-10 12:27 UTC (permalink / raw) To: Linus Torvalds; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel On Thu, Oct 09, 2003 at 03:26:47PM -0700, Linus Torvalds wrote: > User space shouldn't know or care about frsize, and it doesn't even > necessarily make any sense on a lot of filesystems, so make it easy for > the user. It's not as if the rounding errors really matter. User space has to know about frsize for O_DIRECT alignment. Some times you just want to write the 512 B you have in hand, not have to read-modify-write the n KB around it. frsize is much nicer that hunting up the appropriate block device to BLKSSZGET on . Joel -- "I have never let my schooling interfere with my education." - Mark Twain Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 12:27 ` Joel Becker @ 2003-10-10 14:59 ` Linus Torvalds 2003-10-10 15:27 ` Joel Becker 0 siblings, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2003-10-10 14:59 UTC (permalink / raw) To: Joel Becker; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, 10 Oct 2003, Joel Becker wrote: > > User space has to know about frsize for O_DIRECT alignment. Have you ever noticed that O_DIRECT is a piece of crap? The interface is fundamentally flawed, it has nasty security issues, it lacks any kind of sane synchronization, and it exposes stuff that shouldn't be exposed to user space. I hope disk-based databases die off quickly. Yeah, I see where you are working, but where I'm coming from, I see all the _crap_ that Oracle tries to push down to the kernel, and most of the time I go "huh - that's a f**king bad design". Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 14:59 ` Linus Torvalds @ 2003-10-10 15:27 ` Joel Becker 2003-10-10 16:00 ` Linus Torvalds 2003-10-10 16:01 ` Jamie Lokier 0 siblings, 2 replies; 64+ messages in thread From: Joel Becker @ 2003-10-10 15:27 UTC (permalink / raw) To: Linus Torvalds; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, Oct 10, 2003 at 07:59:34AM -0700, Linus Torvalds wrote: > The interface is fundamentally flawed, it has nasty security issues, it > lacks any kind of sane synchronization, and it exposes stuff that > shouldn't be exposed to user space. Um, sure, the interface as implemented has a few "don't do that"s. Yes, we've found security issues. Those can be fixed. That doesn't make the concept bad. > I hope disk-based databases die off quickly. As opposed to what? Not a challenge, just interested in what you think they should be. > Yeah, I see where you are > working, but where I'm coming from, I see all the _crap_ that Oracle tries > to push down to the kernel, and most of the time I go "huh - that's a > f**king bad design". I'm hoping that you've seen a marked improvement in the stuff Oracle requests over the past couple years. We've worked hard to filter out the junk that really, really is bad. Where I work doesn't change the need for O_DIRECT. If your Big App has it's own cache, why copy the cache in the kernel? That just wastes RAM. If your app is sharing data, whether physical disk, logical disk, or via some network filesystem or storage device, you must absolutely guarantee that reads and writes hit the storage, not the kernel cache which has no idea whether another node wrote an update or needs a cache flush. Putting my employer's hat back on, Oracle uses O_DIRECT because it was the existing API for this. If Linux came up with a better, cleaner method, Oracle might change. I can't guarantee that, but I know I push like hell for obvious improvements. Joel -- "I don't want to achieve immortality through my work; I want to achieve immortality through not dying." - Woody Allen Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 15:27 ` Joel Becker @ 2003-10-10 16:00 ` Linus Torvalds 2003-10-10 16:26 ` Joel Becker ` (2 more replies) 2003-10-10 16:01 ` Jamie Lokier 1 sibling, 3 replies; 64+ messages in thread From: Linus Torvalds @ 2003-10-10 16:00 UTC (permalink / raw) To: Joel Becker; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, 10 Oct 2003, Joel Becker wrote: > > I hope disk-based databases die off quickly. > > As opposed to what? Not a challenge, just interested in what > you think they should be. I'm hoping in-memory databases will just kill off the current crop totally. That solves all the IO problems - the only thing that goes to disk is the log and the backups, and both go there totally linearly unless the designer was crazy. Yeah, I don't follow the db market, but it's just insane to try to keep the on-disk data in any other format if you've got enough memory. Recovery may take a long time (reading that whole backup into memory and redoing the log will be pretty expensive), but replication should handle that trivially. > Where I work doesn't change the need for O_DIRECT. If your Big > App has it's own cache, why copy the cache in the kernel? Why indeed? But why do you think you need O_DIRECT with very bad semantics to handle this? The kernel page cache does multiple things: - staging area for letting the filesystem do blocking (ie this is why a regular "write()" or "read()" doesn't need to care about alignment etc) - a synchronization entity - making sure that a write and a read cannot pass each other, and that mmap contents are always _coherent_. - a cache O_DIRECT throws the cache part away, but it throws out the baby with the bathwater, and breaks the other parts. Which is why O_DIRECT breaks things like disk scheduling in really subtle ways - think about writing and reading to the same area on the disk, and re-ordering at all different levels. And the thing is, uncaching is _trivial_. It's not like it is hard to say "try to get rid of these pages if they aren't mapped anywhere" and "insert this user page directly into the page cache". But people are so fixated with "direct to disk" that they don't even think about it. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 16:00 ` Linus Torvalds @ 2003-10-10 16:26 ` Joel Becker 2003-10-10 16:50 ` Linus Torvalds 2003-10-10 16:27 ` Valdis.Kletnieks 2003-10-10 16:33 ` Chris Friesen 2 siblings, 1 reply; 64+ messages in thread From: Joel Becker @ 2003-10-10 16:26 UTC (permalink / raw) To: Linus Torvalds; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, Oct 10, 2003 at 09:00:23AM -0700, Linus Torvalds wrote: > I'm hoping in-memory databases will just kill off the current crop > totally. That solves all the IO problems - the only thing that goes to > disk is the log and the backups, and both go there totally linearly unless > the designer was crazy. Memory is continuously too small and too expensive. Even if you can buy a machine with 10TB of RAM, the price is going to be prohibitive. And when 10TB of RAM costs better, the database is going to be 100TB. I'm not saying that in-memory is bad. Big databases do everything they can to make the workload look almost like in-memory. It's the only way to go. > But why do you think you need O_DIRECT with very bad semantics to handle > this? I don't need O_DIRECT with bad semantics. I need the semantics I need, I know that other OSes have O_DIRECT to provide those capabilities, and everyone loves portability. That said... > O_DIRECT throws the cache part away, but it throws out the baby with the > bathwater, and breaks the other parts. Which is why O_DIRECT breaks things > like disk scheduling in really subtle ways - think about writing and > reading to the same area on the disk, and re-ordering at all different > levels. Sure, but you don't do that. The breakage in mixing O_DIRECT with pagecache I/O to the same areas of the disk isn't even all that subtle. But you shouldn't be doing that, at least constantly. > And the thing is, uncaching is _trivial_. It's not like it is hard to say > "try to get rid of these pages if they aren't mapped anywhere" and "insert > this user page directly into the page cache". But people are so fixated > with "direct to disk" that they don't even think about it. I'm not fixated. "Use this user page for the page cache entry for this offset into the file", "Change this user page from representing this offset in this file to representing that offset in that file", and "whatever you do, always read/write from backing store for this page" are the semantics needed. For the latter, you'd have to have a way for the app to trigger a read or write out of the cache. You don't want to do it on every page modification or access, that's too often. The application knows the syncronization points, not the kernel. Joel -- "There is a country in Europe where multiple-choice tests are illegal." - Sigfried Hulzer Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 16:26 ` Joel Becker @ 2003-10-10 16:50 ` Linus Torvalds 2003-10-10 17:33 ` Joel Becker 0 siblings, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2003-10-10 16:50 UTC (permalink / raw) To: Joel Becker; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, 10 Oct 2003, Joel Becker wrote: > > Memory is continuously too small and too expensive. Even if you > can buy a machine with 10TB of RAM, the price is going to be > prohibitive. And when 10TB of RAM costs better, the database is going > to be 100TB. Hah. Look at the number of supercomputers and the number of desktops today. The fact is, the high end is getting smaller and smaller. If Oracle wants to go after that high-end-only market, then be my guest. But don't be surprised if others end up taking the remaining 99%. Have you guys learnt _nothing_ from the past? The reason MicroSoft and Linux are kicking all the other vendors butts is that _small_ is beautiful. Especially when small is "powerful enough". Hint: why does Oracle care at all about the small business market? Why is MySQL even a blip on your radar? Because it's those things that really _drive_ stuff. The same way PC's have driven the tech market for the last 15 years. And believing that the load will keep up with "big iron hardware" is just not _true_. It's never been true. "Small iron" not only keeps up, but overtakes it - to the point where you have to start doing new things just to be able to take advantage of it. Believe in history. > > > O_DIRECT throws the cache part away, but it throws out the baby with the > > bathwater, and breaks the other parts. Which is why O_DIRECT breaks things > > like disk scheduling in really subtle ways - think about writing and > > reading to the same area on the disk, and re-ordering at all different > > levels. > > Sure, but you don't do that. The breakage in mixing O_DIRECT > with pagecache I/O to the same areas of the disk isn't even all that > subtle. But you shouldn't be doing that, at least constantly. Ok. Let's just hope all the crackers and virus writers believe you when you say "you shouldn't do that". BIG FRIGGING HINT: a _real_ OS doesn't allow data corruption even for cases where "you shouldn't do that". It shouldn't allow reading of data that you haven't written. And "you shouldn't do that" is _not_ an excuse for having bad interfaces that cause problems. We're not NT. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 16:50 ` Linus Torvalds @ 2003-10-10 17:33 ` Joel Becker 2003-10-10 17:51 ` Linus Torvalds 0 siblings, 1 reply; 64+ messages in thread From: Joel Becker @ 2003-10-10 17:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, Oct 10, 2003 at 09:50:02AM -0700, Linus Torvalds wrote: > The fact is, the high end is getting smaller and smaller. If Oracle wants > to go after that high-end-only market, then be my guest. No, the high-end for hardware is getting smaller. The need for high-end jobs is just fine. But as you point out, the high-end jobs are being done by low-end hardware. And here is Oracle, promoting a bank of cheap-ass 2-way boxen to do the job. > Have you guys learnt _nothing_ from the past? The reason MicroSoft and > Linux are kicking all the other vendors butts is that _small_ is > beautiful. Especially when small is "powerful enough". Again, we need this sort of stuff precisely because we'd rather use 2 $5k Linux/Intel servers than 1 $40k Sun server (and the Linux box outruns the Sun, quite comfortably). That's the "powerful enough", right there. > And believing that the load will keep up with "big iron hardware" is just > not _true_. It's never been true. "Small iron" not only keeps up, but > overtakes it - to the point where you have to start doing new things just > to be able to take advantage of it. Linus, I've said it twice above. This has been our entire direction for the past couple years, and we've been loud about it. Please, knock us for what we do wrong, but recognize what we are actually doing wrong, not what you think we are doing. > Ok. Let's just hope all the crackers and virus writers believe you when > you say "you shouldn't do that". Well, if a cracker and virus writer can get enough priviledge to write(), cached or O_DIRECT, they can corrupt you without worrying about this specific gotcha. That doesn't mean you don't fix it, but it also doesn't mean you throw up your hands and claim you can't do it. > BIG FRIGGING HINT: a _real_ OS doesn't allow data corruption even for > cases where "you shouldn't do that". It shouldn't allow reading of data > that you haven't written. And "you shouldn't do that" is _not_ an excuse > for having bad interfaces that cause problems. I know that, I agree with it, and I said as much a few emails past. Linux should refuse to corrupt your data. But you've taken the tack "It is unsafe today, so we should abandon it altogether, never mind fixing it.", which doesn't logically follow. Joel -- "Behind every successful man there's a lot of unsuccessful years." - Bob Brown Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 -- "When choosing between two evils, I always like to try the one I've never tried before." - Mae West Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 17:33 ` Joel Becker @ 2003-10-10 17:51 ` Linus Torvalds 2003-10-10 18:13 ` Joel Becker 0 siblings, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2003-10-10 17:51 UTC (permalink / raw) To: Joel Becker; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, 10 Oct 2003, Joel Becker wrote: > > I know that, I agree with it, and I said as much a few emails > past. Linux should refuse to corrupt your data. But you've taken the > tack "It is unsafe today, so we should abandon it altogether, never mind > fixing it.", which doesn't logically follow. No, we've fixed it, the problem is that it ends up being a lot of extra complexity that isn't obvious when just initially looking at it. For example, just the IO scheduler ended up having serious problems with overlapping IO requests. That's in addition to all the issues with out-of-sync ordering etc that could cause direct_io reads to bypass regular writes and read stuff off the disk that was a potential security issue. So right now we have extra code and extra complexity (which implies not only potential for more bugs, but there are performance worries etc that can impact even users that don't need it). And these are fundamental problems to DIRECT_IO. Which means that likely at some point we will _have_ to actually implement DIRECT_IO entirely through the page cache to make sure that it's safe. So my bet is that eventually we'll make DIRECT_IO just be an awkward way to do page cache manipulation. And maybe it works out ok. And we'll clearly have to keep it working. The issue is whether there are better interfaces. And I think there are bound to be. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 17:51 ` Linus Torvalds @ 2003-10-10 18:13 ` Joel Becker 0 siblings, 0 replies; 64+ messages in thread From: Joel Becker @ 2003-10-10 18:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, Oct 10, 2003 at 10:51:52AM -0700, Linus Torvalds wrote: > And maybe it works out ok. And we'll clearly have to keep it working. The > issue is whether there are better interfaces. And I think there are bound > to be. Agreed. Joel -- "Well-timed silence hath more eloquence than speech." - Martin Fraquhar Tupper Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 16:00 ` Linus Torvalds 2003-10-10 16:26 ` Joel Becker @ 2003-10-10 16:27 ` Valdis.Kletnieks 2003-10-10 16:33 ` Chris Friesen 2 siblings, 0 replies; 64+ messages in thread From: Valdis.Kletnieks @ 2003-10-10 16:27 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel [-- Attachment #1: Type: text/plain, Size: 468 bytes --] On Fri, 10 Oct 2003 09:00:23 PDT, Linus Torvalds said: > I'm hoping in-memory databases will just kill off the current crop > totally. That solves all the IO problems - the only thing that goes to > disk is the log and the backups, and both go there totally linearly unless > the designer was crazy. I can process a 100GB database on a current 2U Dell rackmount server. I hesitate to think about what would be required to deal with a terabyte-sized database... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 16:00 ` Linus Torvalds 2003-10-10 16:26 ` Joel Becker 2003-10-10 16:27 ` Valdis.Kletnieks @ 2003-10-10 16:33 ` Chris Friesen 2003-10-10 17:04 ` Linus Torvalds 2 siblings, 1 reply; 64+ messages in thread From: Chris Friesen @ 2003-10-10 16:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: Joel Becker, Trond Myklebust, Ulrich Drepper, Linux Kernel Linus Torvalds wrote: > I'm hoping in-memory databases will just kill off the current crop > totally. That solves all the IO problems - the only thing that goes to > disk is the log and the backups, and both go there totally linearly unless > the designer was crazy. How does this play with massive (ie hundreds or thousands of gigabytes) databases? Surely you can't expect to put it all in memory? Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 16:33 ` Chris Friesen @ 2003-10-10 17:04 ` Linus Torvalds 2003-10-10 17:07 ` Linus Torvalds 0 siblings, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2003-10-10 17:04 UTC (permalink / raw) To: Chris Friesen; +Cc: Joel Becker, Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, 10 Oct 2003, Chris Friesen wrote: > > How does this play with massive (ie hundreds or thousands of gigabytes) > databases? Surely you can't expect to put it all in memory? Hey, I'm a big believer in mass market. Which means that I think odd-ball users will have to use odd-ball databases, and pay through the nose for them. That's fine. But those db's are doing to be very rare. Your arguments are all the same stuff that made PC's "irrelevant" 15 years ago. I'm not sayign in-memory is here tomorrow. I'm just saying that anybody who isn't looking at it for the mass market _will_ be steamrolled over when they arrive. If you were a company, which market would you prefer: the high-end 0.1% or the rest? Yes, you can charge a _lot_ more for the high-end side, but you will eternally live in the knowledge that your customers are slowly moving to the "low end" - simply because it gets more capable. And the thing is, the economics of the 99% means that that is the one that sees all the real improvements. That's the one that will have the nice admin tools, and the cottage industry that builds up around it. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 17:04 ` Linus Torvalds @ 2003-10-10 17:07 ` Linus Torvalds 2003-10-10 17:21 ` Joel Becker 0 siblings, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2003-10-10 17:07 UTC (permalink / raw) To: Chris Friesen; +Cc: Joel Becker, Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, 10 Oct 2003, Linus Torvalds wrote: > > I'm not sayign in-memory is here tomorrow. I'm just saying that anybody > who isn't looking at it for the mass market _will_ be steamrolled over > when they arrive. Btw, anybody that takes me too seriously is an idiot. I know what _I_ believe in, but part of the beauty of Linux is that what I believe doesn't really matter all that much. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 17:07 ` Linus Torvalds @ 2003-10-10 17:21 ` Joel Becker 0 siblings, 0 replies; 64+ messages in thread From: Joel Becker @ 2003-10-10 17:21 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Friesen, Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, Oct 10, 2003 at 10:07:52AM -0700, Linus Torvalds wrote: > Btw, anybody that takes me too seriously is an idiot. I know what _I_ > believe in, but part of the beauty of Linux is that what I believe doesn't > really matter all that much. Sure, but you're not exactly an idiot either. If folks never thought about what you said, they'd be an idiot as well. Joel -- "In the long run...we'll all be dead." -Unknown Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 15:27 ` Joel Becker 2003-10-10 16:00 ` Linus Torvalds @ 2003-10-10 16:01 ` Jamie Lokier 2003-10-10 16:33 ` Joel Becker 2003-10-10 18:20 ` Andrea Arcangeli 1 sibling, 2 replies; 64+ messages in thread From: Jamie Lokier @ 2003-10-10 16:01 UTC (permalink / raw) To: Linus Torvalds, Trond Myklebust, Ulrich Drepper, Linux Kernel Joel Becker wrote: > Where I work doesn't change the need for O_DIRECT. If your Big > App has it's own cache, why copy the cache in the kernel? That just > wastes RAM. Why don't you _share_ the App's cache with the kernel's? That's what mmap() and remap_file_pages() are for. > If your app is sharing data, whether physical disk, logical > disk, or via some network filesystem or storage device, you must > absolutely guarantee that reads and writes hit the storage, not the > kernel cache which has no idea whether another node wrote an update or > needs a cache flush. That's tough to guarantee at the platter level regardless of O_DIRECT, but otherwise: you have fdatasync() and msync(). > If Linux came up with a better, cleaner method, Oracle might change. Take a look at remap_file_pages() and write a note here to say if it fits the bill. I thought remap_file_pages() was added for Oracle, but perhaps it was for a more modern database ;) Thanks, -- Jamie ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 16:01 ` Jamie Lokier @ 2003-10-10 16:33 ` Joel Becker 2003-10-10 16:58 ` Chris Friesen ` (2 more replies) 2003-10-10 18:20 ` Andrea Arcangeli 1 sibling, 3 replies; 64+ messages in thread From: Joel Becker @ 2003-10-10 16:33 UTC (permalink / raw) To: Jamie Lokier Cc: Linus Torvalds, Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, Oct 10, 2003 at 05:01:44PM +0100, Jamie Lokier wrote: > Why don't you _share_ the App's cache with the kernel's? That's what > mmap() and remap_file_pages() are for. Because you can't force flush/read. You can't say "I need you to go to disk for this." If you do, you're doing O_DIRECT through mmap (yes, I've pondered it) and you end up with perhaps the same races folks worry about. Doesn't mean it can't be done. > That's tough to guarantee at the platter level regardless of O_DIRECT, > but otherwise: you have fdatasync() and msync(). Platter level doesn't matter. Storage access level matters. Node1 and Node2 have to see the same thing. As long as I am absolutely sure that when Node1's write() returns, any subsequent read() on Node2 will see the change (normal barrier stuff, really), it doesn't matter what happend on the Storage. The data could be in storage cache, on platter, passed back to some other entity. > Take a look at remap_file_pages() and write a note here to say if it > fits the bill. I thought remap_file_pages() was added for Oracle, but > perhaps it was for a more modern database ;) remap_file_pages() was indeed somethign Oracle wanted, but as a way to create 8GB shmfs files and map them into the x86 crappy address space. It still does not have the ability to force reads and writes to the storage, and it even has other issues. Joel -- Life's Little Instruction Book #511 "Call your mother." Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 16:33 ` Joel Becker @ 2003-10-10 16:58 ` Chris Friesen 2003-10-10 17:05 ` Trond Myklebust 2003-10-10 17:20 ` Joel Becker 2003-10-10 20:07 ` Jamie Lokier 2003-10-12 15:31 ` Greg Stark 2 siblings, 2 replies; 64+ messages in thread From: Chris Friesen @ 2003-10-10 16:58 UTC (permalink / raw) To: Joel Becker Cc: Jamie Lokier, Linus Torvalds, Trond Myklebust, Ulrich Drepper, Linux Kernel Joel Becker wrote: > On Fri, Oct 10, 2003 at 05:01:44PM +0100, Jamie Lokier wrote: > >>Why don't you _share_ the App's cache with the kernel's? That's what >>mmap() and remap_file_pages() are for. > Because you can't force flush/read. You can't say "I need you > to go to disk for this." According to my man pages, this is exactly what msync() is for, no? >>That's tough to guarantee at the platter level regardless of O_DIRECT, >>but otherwise: you have fdatasync() and msync(). > Platter level doesn't matter. Storage access level matters. > Node1 and Node2 have to see the same thing. As long as I am absolutely > sure that when Node1's write() returns, any subsequent read() on Node2 > will see the change (normal barrier stuff, really), it doesn't matter > what happend on the Storage. Isn't that exactly what msync() exists for? Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 16:58 ` Chris Friesen @ 2003-10-10 17:05 ` Trond Myklebust 2003-10-10 17:20 ` Joel Becker 1 sibling, 0 replies; 64+ messages in thread From: Trond Myklebust @ 2003-10-10 17:05 UTC (permalink / raw) To: Chris Friesen; +Cc: Linux Kernel >>>>> " " == Chris Friesen <cfriesen@nortelnetworks.com> writes: >> Platter level doesn't matter. Storage access level matters. >> Node1 and Node2 have to see the same thing. As long as I am >> absolutely sure that when Node1's write() returns, any >> subsequent read() on Node2 will see the change (normal barrier >> stuff, really), it doesn't matter what happend on the Storage. > Isn't that exactly what msync() exists for? It can't, be used to invalidate the page cache (at least not in the current implentation) so it won't help you in the above case where you have 2 nodes writing to the same device. Cheers, Trond ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 16:58 ` Chris Friesen 2003-10-10 17:05 ` Trond Myklebust @ 2003-10-10 17:20 ` Joel Becker 2003-10-10 17:33 ` Chris Friesen 2003-10-10 17:40 ` Linus Torvalds 1 sibling, 2 replies; 64+ messages in thread From: Joel Becker @ 2003-10-10 17:20 UTC (permalink / raw) To: Chris Friesen Cc: Jamie Lokier, Linus Torvalds, Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, Oct 10, 2003 at 12:58:05PM -0400, Chris Friesen wrote: > > Because you can't force flush/read. You can't say "I need you > >to go to disk for this." > > According to my man pages, this is exactly what msync() is for, no? msync() forces write(), like fsync(). It doesn't force read(). Joel -- "Get right to the heart of matters. It's the heart that matters more." Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 17:20 ` Joel Becker @ 2003-10-10 17:33 ` Chris Friesen 2003-10-10 17:40 ` Linus Torvalds 1 sibling, 0 replies; 64+ messages in thread From: Chris Friesen @ 2003-10-10 17:33 UTC (permalink / raw) To: Joel Becker Cc: Jamie Lokier, Linus Torvalds, Trond Myklebust, Ulrich Drepper, Linux Kernel Joel Becker wrote: > On Fri, Oct 10, 2003 at 12:58:05PM -0400, Chris Friesen wrote: > >>> Because you can't force flush/read. You can't say "I need you >>>to go to disk for this." >>> >>According to my man pages, this is exactly what msync() is for, no? >> > > msync() forces write(), like fsync(). It doesn't force read(). Oh, of course. So do the applications know when they need to invalidate the cache (allowing for the reader to do a reverse-msync kind of thing), or do they have to read from disk all the time? Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 17:20 ` Joel Becker 2003-10-10 17:33 ` Chris Friesen @ 2003-10-10 17:40 ` Linus Torvalds 2003-10-10 17:54 ` Trond Myklebust 2003-10-10 18:05 ` Joel Becker 1 sibling, 2 replies; 64+ messages in thread From: Linus Torvalds @ 2003-10-10 17:40 UTC (permalink / raw) To: Joel Becker Cc: Chris Friesen, Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, 10 Oct 2003, Joel Becker wrote: > > msync() forces write(), like fsync(). It doesn't force read(). Actually, the kernel has a "readahead(fd, offset, size)" system call that will start asynchronous read-ahead on any mapping. After that, just touching the page will obviously map in and synchronize the result. I don't think anybody uses it, and the interface may be broken, but it was literally 20 lines of code, and I had a trivial test program that populated the cache for a directory structure really quickly using it. In general, it would be really nice to have more oracle people discussing what their particular pet horror is, and what they'd really like to do. I know you're more used to just doing your own thing and working with vendors, but even just people getting used to do the unofficial "this is what we do, and it sucks because xxx" would make people more aware of what you wan tto do, and maybe it would suggest novel ways of doing things. I suspect most of the things would get shot down as being impractical, but there have always been a lot of discussion about more direct control of the page cache for programs that really want it, and I'm more than willing to discuss things (obviously 2.7.x material, but still.. A lot of it is trivial and could be back-ported to 2.6.x if people start using it). For example, things we can do, but don't, partly because of interface issues and because there is no point in doing it if people wouldn't use it: - moving a page back and forth between user space. It's _trivial_ to do, with a fallback on copying if the page happens to be busy (ie we can often just replace the existing page cache page, but if somebody else has it mapped, we'd have to copy the contents instead) We can't do this for "regular" read and write, because the resulting copy-on-write sitution makes it less than desireable in most cases, but if the user space specifically says "you can throw these pages away after moving them to the page cache", that avoids a lot of horror. The "remap_file_pages()" thing kind of does this on the read side (ie it says "map in this page cache entry into my virtual address space"), but we don't have the reverse aka "take this page in the virtual address space and map it into the page cache". Interfaces like these would also allow things like zero-copy file copies with smaller page cache footprints - at the expense of invalidating the cache for the source file as a result of the copy. Which is why it can't be a _regular_ read - but it's one of those things where if the user knows what he wants.. - dirty mapping control (ie controlling partial page dirty state, and also _delaying_ writeout if it needs to be ordered). Possibly by having a separate backing store (ie a mmap that says "read from this file, but write back to that other file") to avoid the nasty memory management problems. A lot of these are really easy to do, but the usage and the interfaces are non-obvious. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 17:40 ` Linus Torvalds @ 2003-10-10 17:54 ` Trond Myklebust 2003-10-10 18:05 ` Linus Torvalds 2003-10-11 2:53 ` Andrew Morton 2003-10-10 18:05 ` Joel Becker 1 sibling, 2 replies; 64+ messages in thread From: Trond Myklebust @ 2003-10-10 17:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: Joel Becker, Chris Friesen, Jamie Lokier, Linux Kernel >>>>> " " == Linus Torvalds <torvalds@osdl.org> writes: > On Fri, 10 Oct 2003, Joel Becker wrote: >> >> msync() forces write(), like fsync(). It doesn't force read(). > Actually, the kernel has a "readahead(fd, offset, size)" system > call that will start asynchronous read-ahead on any > mapping. After that, just touching the page will obviously map > in and synchronize the result. That's different. That's just preheating the page cache. It does nothing for the case Joel mentioned where 2 different nodes are writing to the same device, and you need to force a read in order to resynchronize the page cache. Apart from O_DIRECT, we have nothing in the kernel as it stands that will allow userland to deal with this case. Cheers, Trond ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 17:54 ` Trond Myklebust @ 2003-10-10 18:05 ` Linus Torvalds 2003-10-10 20:40 ` Trond Myklebust 2003-10-11 2:53 ` Andrew Morton 1 sibling, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2003-10-10 18:05 UTC (permalink / raw) To: Trond Myklebust; +Cc: Joel Becker, Chris Friesen, Jamie Lokier, Linux Kernel On Fri, 10 Oct 2003, Trond Myklebust wrote: > > Apart from O_DIRECT, we have nothing in the kernel as it stands that > will allow userland to deal with this case. Oh, but that's just another case of the general notion of allowing people to control the page cache a bit more. There's nothing wrong with having kernel interfaces that say "this region is potentially stale" or "this region is dirty" or "this region is not needed any more". For example, using DIRECT_IO to make sure that something is uptodate is just _stupid_, because clearly it only matters to shared-disk (either over networks/FC or though things like SCSI device sharing) setups. So now the app has to have a way to query for whether the storage is shared, and have two totally different code-paths depending on the answer. This is another example of a bad design, that ends up causing more problems (remember why this thread started in the first place: bad design of O_DIRECT causing the app to have to care about something _else_ it shouldn't care about. At all). If you had a "this region is stale" thing, you'd just use it. And if it was local disk, it wouldn't do anything. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 18:05 ` Linus Torvalds @ 2003-10-10 20:40 ` Trond Myklebust 2003-10-10 21:09 ` Linus Torvalds 0 siblings, 1 reply; 64+ messages in thread From: Trond Myklebust @ 2003-10-10 20:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: Joel Becker, Chris Friesen, Jamie Lokier, Linux Kernel > If you had a "this region is stale" thing, you'd just use > it. And if it was local disk, it wouldn't do anything. Note that in order to be race-free, such a command would also have to wait on any outstanding operations (i.e. both pending reads and writes) on the region in question in order to make sure that none have crossed the synchronization point. It is not a question of just calling invalidate_inode_pages() and thinking that all is well... In fact, I recently noticed that we still have this race in the NFS file locking code: readahead may have been scheduled before we actually set the file lock on the server, and may thus fill the page cache with stale data. Cheers, Trond ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 20:40 ` Trond Myklebust @ 2003-10-10 21:09 ` Linus Torvalds 2003-10-10 22:17 ` Trond Myklebust 0 siblings, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2003-10-10 21:09 UTC (permalink / raw) To: Trond Myklebust; +Cc: Joel Becker, Chris Friesen, Jamie Lokier, Linux Kernel On Fri, 10 Oct 2003, Trond Myklebust wrote: > > In fact, I recently noticed that we still have this race in the NFS > file locking code: readahead may have been scheduled before we > actually set the file lock on the server, and may thus fill the page > cache with stale data. The current "invalidate_inode_pages()" is _not_ equivalent to a specific user saying "these pages are bad and have to be updated". The main difference is that invalidate_inode_pages() really cannot assume that the pages are bad: the pages may be mapped into another process that is actively writing to them, so the regular "invalidate_inode_pages()" literally must not force a re-read - that would throw out real information. So "invalidate_inode_pages()" really is a hint, not a forced eviction. A forced eviction can be done only by a user that says "I have write permission to this file, and I will now say that these pages _have_ to be thrown away, whether dirty or not". And that's totally different, and will require a totally different approach. (As to the read-ahead issue: there's nothing saying that you can't wait for the pages if they aren't up-to-date, and really synchronize with read-ahead. But that will require filesystem help, if only to be able to recognize that there is active IO going on. So NFS would have to keep track of a "read list" the same way it does for writeback pages). Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 21:09 ` Linus Torvalds @ 2003-10-10 22:17 ` Trond Myklebust 0 siblings, 0 replies; 64+ messages in thread From: Trond Myklebust @ 2003-10-10 22:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel >>>>> " " == Linus Torvalds <torvalds@osdl.org> writes: > (As to the read-ahead issue: there's nothing saying that you > can't wait for the pages if they aren't up-to-date, and really > synchronize with read-ahead. But that will require filesystem > help, if only to be able to recognize that there is active IO > going on. So NFS would have to keep track of a "read list" the > same way it does for writeback pages). Well... I was thinking more in terms of a rw_semaphore to lock out new calls to nfs_file_(read|write|sendfile) in combination with a call to invalidate_inode_pages2(). Such a mechanism can also be used in schemes to improve on the generic data/attribute cache consistency in order to reduce the number of bogus cache invalidations due to RPC ordering races. Those can tend to be expensive... Note: Anybody using mmap() in combination with file locking will however continue to enjoy the privilege of being able to screw up... Cheers, Trond ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 17:54 ` Trond Myklebust 2003-10-10 18:05 ` Linus Torvalds @ 2003-10-11 2:53 ` Andrew Morton 2003-10-11 3:47 ` Trond Myklebust 1 sibling, 1 reply; 64+ messages in thread From: Andrew Morton @ 2003-10-11 2:53 UTC (permalink / raw) To: trond.myklebust; +Cc: torvalds, Joel.Becker, cfriesen, jamie, linux-kernel Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > It does nothing for the case Joel mentioned where 2 different nodes > are writing to the same device, and you need to force a read in order > to resynchronize the page cache. > Apart from O_DIRECT, we have nothing in the kernel as it stands that > will allow userland to deal with this case. Applications may use fadvise(POSIX_FADV_DONTNEED) to invalidate sections of a file's pagecache. It is not designed to be 100% reliable though: mmapped pages will be retained, and dirty pages are skipped. For the dirty pages it might be useful to add a new mode to fadvise which syncs a section of a file's pages; -mm has the necessary infrastructure for that. POSIX does not define the fadvise() semantics very clearly, so it is largely up to us to decide what makes sense. There are a number of things which we can do quite easily in there - it's mainly a matter of working out exactly what we want to do. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-11 2:53 ` Andrew Morton @ 2003-10-11 3:47 ` Trond Myklebust 0 siblings, 0 replies; 64+ messages in thread From: Trond Myklebust @ 2003-10-11 3:47 UTC (permalink / raw) To: Andrew Morton; +Cc: Linus Torvalds, Joel.Becker, cfriesen, jamie, linux-kernel >>>>> " " == Andrew Morton <akpm@osdl.org> writes: > POSIX does not define the fadvise() semantics very clearly, so > it is largely up to us to decide what makes sense. There are a > number of things which we can do quite easily in there - it's > mainly a matter of working out exactly what we want to do. Possibly, but there really is no need to get over-creative either. The SUS definition of msync(MS_INVALIDATE) reads as follows: When MS_INVALIDATE is specified, msync() shall invalidate all cached copies of mapped data that are inconsistent with the permanent storage locations such that subsequent references shall obtain data that was consistent with the permanent storage locations sometime between the call to msync() and the first subsequent memory reference to the data. (ref: http://www.opengroup.org/onlinepubs/007904975/functions/msync.html) i.e. a strict implementation would mean that msync() will in fact act as a synchronization point that is fully consistent with Linus' proposal for a "this region is stale" function. Unfortunately Linux appears incapable of implementing such a strict definition of msync() as it stands. Cheers, Trond ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 17:40 ` Linus Torvalds 2003-10-10 17:54 ` Trond Myklebust @ 2003-10-10 18:05 ` Joel Becker 2003-10-10 18:31 ` Andrea Arcangeli 2003-10-10 20:33 ` Helge Hafting 1 sibling, 2 replies; 64+ messages in thread From: Joel Becker @ 2003-10-10 18:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Fri, Oct 10, 2003 at 10:40:40AM -0700, Linus Torvalds wrote: > Actually, the kernel has a "readahead(fd, offset, size)" system call that > will start asynchronous read-ahead on any mapping. After that, just > touching the page will obviously map in and synchronize the result. Ok, a quick peruse of sys_readahead() seems to say that it doesn't check for existing uptodate()ness. That would be interesting. I could have missed it, though. > I don't think anybody uses it, and the interface may be broken, but it was > literally 20 lines of code, and I had a trivial test program that > populated the cache for a directory structure really quickly using it. The problem we have with msync() and friends is not 'quick population', it's "page is in the page cache already; another node writes to the storage; must mark page as !uptodate so as to force a re-read from disk". I can't find where sys_readahead() checks for uptodate, so perhaps calling sys_readahead() on a range always causes I/O. Correct me if I missed it. > For example, things we can do, but don't, partly because of interface > issues and because there is no point in doing it if people wouldn't use > it: Lots of interesting stuff snipped. This discussion has me thinking, knowing now that there's possibility to move to a more optimal interface. Joel -- Life's Little Instruction Book #464 "Don't miss the magic of the moment by focusing on what's to come." Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 18:05 ` Joel Becker @ 2003-10-10 18:31 ` Andrea Arcangeli 2003-10-10 20:33 ` Helge Hafting 1 sibling, 0 replies; 64+ messages in thread From: Andrea Arcangeli @ 2003-10-10 18:31 UTC (permalink / raw) To: Linus Torvalds, linux-kernel On Fri, Oct 10, 2003 at 11:05:35AM -0700, Joel Becker wrote: > thinking, knowing now that there's possibility to move to a more optimal > interface. cleaner and simpler could very well be (many simpler db works that way infact), but more optimal I doubt. To be more optimal you should let the kernel do all the garbage collection of mappings, and not use remap_file_pages. But then I'm unsure if the kernel is really able better than you to choose what info to discard from the cache, and you'd still have to pay for page faults that you don't have to right now. And if you use remap_file_pages to still choose what to ""discard first"" from userspace, then you'd better use O_DIRECT instead, that doesn't require any pte mangling (ignoring the readahead, async-io and msync, scsi-shared issues that sounds fixable). About the security issues, they existed in older kernels they're nowadays fixed thanks to Stephen's i_alloc_sem. though, I'd be interesting to compared different models in practice to be sure, I just don't have expectations for it being a "more optimal" design at the moment. Andrea - If you prefer relying on open source software, check these links: rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/ http://www.cobite.com/cvsps/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 18:05 ` Joel Becker 2003-10-10 18:31 ` Andrea Arcangeli @ 2003-10-10 20:33 ` Helge Hafting 1 sibling, 0 replies; 64+ messages in thread From: Helge Hafting @ 2003-10-10 20:33 UTC (permalink / raw) To: Linus Torvalds, linux-kernel, Joel.Becker On Fri, Oct 10, 2003 at 11:05:35AM -0700, Joel Becker wrote: > The problem we have with msync() and friends is not 'quick > population', it's "page is in the page cache already; another node > writes to the storage; must mark page as !uptodate so as to force a > re-read from disk". I can't find where sys_readahead() checks for > uptodate, so perhaps calling sys_readahead() on a range always causes > I/O. Correct me if I missed it. > Wouldn't this be solvable by giving userspace a way of invalidating a range of mmapped pages? I.e. a "minvalidate();" to use when the other node tells you it is about to write? This will cause the pages to be paged in again on next reference, or you can issue a read in advance if you believe you'll need them. Helge Hafting ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 16:33 ` Joel Becker 2003-10-10 16:58 ` Chris Friesen @ 2003-10-10 20:07 ` Jamie Lokier 2003-10-12 15:31 ` Greg Stark 2 siblings, 0 replies; 64+ messages in thread From: Jamie Lokier @ 2003-10-10 20:07 UTC (permalink / raw) To: Joel Becker; +Cc: Linus Torvalds, Trond Myklebust, Ulrich Drepper, Linux Kernel Joel Becker wrote: > Platter level doesn't matter. Storage access level matters. > Node1 and Node2 have to see the same thing. As long as I am absolutely > sure that when Node1's write() returns, any subsequent read() on Node2 > will see the change (normal barrier stuff, really), it doesn't matter > what happend on the Storage. The data could be in storage cache, on > platter, passed back to some other entity. That's two specifications. Please choose one! First you say the storage access level matters, then you say it doesn't matter, that the only important thing is any two nodes see each other's changes. Committing data to a certain level of storage, for the sake of _storing_ it, is trivially covered by fdatasync(). We won't talk about that any more. The other requirement is about barriers between nodes accessing data. On a single machine, your second specification means the data doesn't need to hit the disk at all. On a SAN, it means the data doesn't need to hit the SAN's storage - nor, in fact does the data have to be transferred over the SAN when you write it! Distributed cache coherency exists for that purpose. For example, let's imagine 32 processes, 8 per machine, and a giant shared disk. Pages in the database are regularly read and written by pairs of nodes and, because of the way you direct requests based on keys, certain pages tend to be accessed only by certain pairs of nodes. That means a significant proportion of the pages do _not_ need to be transmitted through the shared disk every time they are made visible to other nodes - because those page accesses are local to one _machine_ for many transfers. That means O_DIRECT is using more storage bandwidth than you need to use. The waste is greatest on a single machine (i.e. infinity) but with multiple machines there is still waste and the amount depends on access patterns. You should be using cache coherency protocols between nodes - at the database level (which you very likely are, as performance would plummet without it) - and at the filesystem level. "Forcing a read" is *not* a required operation if you have a sound network filesystem or even network disk protocol. Merely reading a page will force the read, if another node has written to it - and *only* if that is necessary. Some of the distributed filesystems, like Sistina's, get this right I believe. If, however, your shared file does not maintain coherent views between different nodes, then you _do_ you need to force writes and force reads. Your quality database will not waste storage bandwidth by doing _unnecessary_ reads, if the underlying storage isn't coherent, merely to see whether a page changed. For that, you should be communicating metadata between nodes that say "this page is now dirty; you will need to read it" and "ok" - along the lines of MESI. That is the worst case I can think of (i.e. the kernel filesystem/san driver doesn't do coherence so you have to do it in the database program), and indeed you do need the ability to flush read pages in that case. Ideally you want the ability to pass pages directly between nodes without requiring a storage commit, too. Linus' suggestion of "this data is stale" is ok. Another flag to remap_file_pages would work, too, saving a system call in some cases, but doing unwanted reads (sometimes you just want to invalidate) in some others. Btw, fadvise(POSIX_FADV_DONTNEED) appears to offser this already. Using O_DIRECT always can be inefficient, because it commits things to storage which don't need to be committed so early or often, and because it moves data when it does not need to be moved, with the worst case being a small cluster of large machines or just one machine. It's likely your expensive shared disk doesn't mind all those commits because of journalling NVRAM etc.. To avoid wasting bandwidth to the shared disk associated processing costs you have to analyse the topology of node interconnections, specifically to avoid using O_DIRECT and/or unnecessary reads and writes when they aren't necessary (between local nodes). You need that anyway even with Linus' suggestion, because there's no way the kernel can know automatically whether you are doing a coherence operation between two local nodes or remote ones. Local filesystems look like a worthy exception, but even those are iffy if there's a remote client accessing it over a network filesystem as well as local nodes synchronising over it. It has to be an unexported local filesystem, and the kernel doesn't even know that, because of userspace servers like Samba. ====== That long discussion leads to this: The best in theory is a network-coherent filesystem. It knows the topology and it can implement the optimal strategy. Without one of those, it is necessary to know the topology between nodes to get optimal performance for any method (i.e. minimum pages transferred around, minimum operations etc.). This is true of using O_DIRECT or Linus' page cache manipulations. O_DIRECT works, but causes unnecessary storage commitment when all you need is synchronisation. Page cache manipulation may already be possible using fdatasync + MADV_DONTNEED + POSIX_FADV_DONTNEED, however that isn't optimal either, because: Both of those mechanisms do not provide a way to transfer a dirty page from one node to another without (a) committing to storage; or (b) copying the data at the receiver. O_DIRECT does the commit (write at one node; read at the other), but is zero-copy at the receiver, as mapped files are generally. Without O_DIRECT, you'd have to use application level socket<->socket communication, and there is as yet no zero-copy receive. Zero-copy UDP receive or similar is needed to get the best from this. Conclusions ----------- Only a coherent distributed filesystem actually minimises the amount of file data transferred and copied, and automatically too. All other suggestions so far have weaknesses in this regard. Although the page cache manipulation methods could minimise the transfers and copies if zero-copy socket receive was available, it is a set of mechanisms that look like it would, after it's implemented in the application, still be slower than a coherent filesystem just because the latter can do the combination of manipulations etc. more easily; on the other hand, resolution of cache coherency by a filesystem would incur more page faults than doing it at the application level. So it is not absolutely clear which can be made faster with lots of attention. The end :) Thanks for getting this far... -- Jamie ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 16:33 ` Joel Becker 2003-10-10 16:58 ` Chris Friesen 2003-10-10 20:07 ` Jamie Lokier @ 2003-10-12 15:31 ` Greg Stark 2003-10-12 16:13 ` Linus Torvalds 2 siblings, 1 reply; 64+ messages in thread From: Greg Stark @ 2003-10-12 15:31 UTC (permalink / raw) To: Joel Becker Cc: Jamie Lokier, Linus Torvalds, Trond Myklebust, Ulrich Drepper, Linux Kernel Joel Becker <Joel.Becker@oracle.com> writes: > On Fri, Oct 10, 2003 at 05:01:44PM +0100, Jamie Lokier wrote: > > Why don't you _share_ the App's cache with the kernel's? That's what > > mmap() and remap_file_pages() are for. > > Because you can't force flush/read. You can't say "I need you > to go to disk for this." If you do, you're doing O_DIRECT through mmap > (yes, I've pondered it) and you end up with perhaps the same races folks > worry about. Doesn't mean it can't be done. There are other reasons databases want to control their own cache. The application knows more about the usage and the future usage of the data than the kernel does. There's currently a thread on the Postgres mailing list about a problem with an administrative job that needs to touch potentially all the blocks of a table. The more frequently it's run the less work it has to do, so the recommendation is to run it very frequently. However on busy servers whenever it's run it causes lots of pain because the kernel flushes all the cached data in favour of the data this job touches. And worse, there's no way to indicate that the i/o it's doing is lower priority, so i/o bound servers get hit dramatically. Postgres knows the fact that this job touched the data means nothing for the regular functioning of the server, and it knows that the i/o it's doing is low priority. It needs some way to indicate to the kernel that this job is low priority not only for cpu resources but for cache resources and i/o resources as well. There are other cases. Oracle, for example, puts blocks it reads due to full table scans at the end of its LRU list to avoid a similar effect on the cache. Then there's the transaction log. The database needs to know when the transaction log is written to disk. The blocks it writes there won't be useful to cache unless the database crashed right there. And ideally it should bypass any disk i/o reordering and write the data to the transaction log *first*. Raw bandwidth is not as important as latency on writes to the transaction log. The reason mmap is tempting is not because it's faster. It's because it provides a nice clean abstract interface. The database could simply mmap the entire database and then pretend it is an in-memory database. The code would be much simpler and more complex algorithms would be easier to implement. Unfortunately there are some problems with mmap. Currently it would be just as complex to use as read/write because the address space is limited to only a fraction of the database. On a 64 bit machine you might be able to mmap the entire database and then use custom syscalls to indicate to the kernel which pages to keep in cache and which to sync. -- greg ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-12 15:31 ` Greg Stark @ 2003-10-12 16:13 ` Linus Torvalds 2003-10-12 22:09 ` Greg Stark 0 siblings, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2003-10-12 16:13 UTC (permalink / raw) To: Greg Stark Cc: Joel Becker, Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel On 12 Oct 2003, Greg Stark wrote: > > There are other reasons databases want to control their own cache. The > application knows more about the usage and the future usage of the data than > the kernel does. But this again is not an argument for not using the page cache - it's only an argument for _telling_ the kernel about its use. > However on busy servers whenever it's run it causes lots of pain because the > kernel flushes all the cached data in favour of the data this job touches. Yes. But this is actually pretty easy to avoid in-kernel, since all of the LRU logic is pretty localized. It could be done on a per-process thing ("this process should not pollute the active list") or on a per-fd thing ("accesses through this particular open are not to pollute the active list"). > And > worse, there's no way to indicate that the i/o it's doing is lower priority, > so i/o bound servers get hit dramatically. IO priorities are pretty much worthless. It doesn't _matter_ if other processes get preferred treatment - what is costly is the latency cost of seeking. What you want is not priorities, but batching. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-12 16:13 ` Linus Torvalds @ 2003-10-12 22:09 ` Greg Stark 2003-10-13 8:45 ` Helge Hafting 0 siblings, 1 reply; 64+ messages in thread From: Greg Stark @ 2003-10-12 22:09 UTC (permalink / raw) To: Linus Torvalds Cc: Greg Stark, Joel Becker, Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel Linus Torvalds <torvalds@osdl.org> writes: > > worse, there's no way to indicate that the i/o it's doing is lower priority, > > so i/o bound servers get hit dramatically. > > IO priorities are pretty much worthless. It doesn't _matter_ if other > processes get preferred treatment - what is costly is the latency cost of > seeking. What you want is not priorities, but batching. What you want depends very much on the circumstances. I'm sure in a lot of cases batching helps, but in this case it's not the issue. The vacuum job that runs periodically in fact is batched very well. In fact that's the main reason it exists rather than having the cleanup handled in the critical path in the transaction itself. I'm not aware of all the details but my understanding is that it reads every block in the table sequentially, keeping note of all the records that are no longer visible to any transaction. When it's finished reading it writes out a "free space map" that subsequent transactions read and use to find available space in the table. The vacuum job is makes very efficient use of disk i/o. In fact too efficient. Frequently people have their disks running at 50-90% capacity simply handling the random seeks to read data. Those seeks are already batched to the OS's best ability. But then vacuum comes along and tries to read the entire table sequentially. In the best case the sequential read will take up a lot of the available disk bandwidth and delay transactions. In the worst case the OS will actually prefer the sequential read because the elevator algorithm always sees that it can get more bandwidth by handling it ahead of the random access. In reality there is no time pressure on the vacuum at all. As long as it completes faster than dead records can pile up it's fast enough. The transactions on the other hand must complete as fast as possible. Certainly batching is useful and in many cases is more important than prioritizing, but in this case it's not the whole answer. I'll mention this thread on the postgresql-hackers list, perhaps some of the more knowledgeable programmers there will have thought about these issues and will be able to post their wishlist ideas for kernel APIs. I can see why back in the day Oracle preferred to simply tell all the OS vendors, "just give us direct control over disk accesses, we'll figure it out" rather than have to really hash out all the details of their low level needs with every OS vendor. But between being able to prioritize I/O resources and cache resources, and being able to sync IDE disks properly and cleanly (that other thread) Linux may be able drastically improve the kernel interface for databases. -- greg ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-12 22:09 ` Greg Stark @ 2003-10-13 8:45 ` Helge Hafting 2003-10-15 13:25 ` Ingo Oeser 0 siblings, 1 reply; 64+ messages in thread From: Helge Hafting @ 2003-10-13 8:45 UTC (permalink / raw) To: Greg Stark Cc: Joel Becker, Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel Greg Stark wrote: [...] > > But then vacuum comes along and tries to read the entire table sequentially. > In the best case the sequential read will take up a lot of the available disk > bandwidth and delay transactions. In the worst case the OS will actually > prefer the sequential read because the elevator algorithm always sees that it > can get more bandwidth by handling it ahead of the random access. > > In reality there is no time pressure on the vacuum at all. As long as it > completes faster than dead records can pile up it's fast enough. The > transactions on the other hand must complete as fast as possible. This seems almost trivial. If the vacuum job runs too much, overusing disk bandwith - throttle it! This is easier than trying to tell the kernel that the job is less important, that goes wrong wether the job runs too much or too little. Let that job sleep a little when its services aren't needed, or when you need the disk bandwith elsewhere. Helge Hafting ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-13 8:45 ` Helge Hafting @ 2003-10-15 13:25 ` Ingo Oeser 2003-10-15 15:03 ` Greg Stark 0 siblings, 1 reply; 64+ messages in thread From: Ingo Oeser @ 2003-10-15 13:25 UTC (permalink / raw) To: Helge Hafting, Greg Stark Cc: Joel Becker, Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel On Monday 13 October 2003 10:45, Helge Hafting wrote: > Greg Stark wrote: > [...] > > In reality there is no time pressure on the vacuum at all. As long as it > > completes faster than dead records can pile up it's fast enough. The > > transactions on the other hand must complete as fast as possible. > > This seems almost trivial. If the vacuum job runs too much, > overusing disk bandwith - throttle it! If you are using regular read/write syscalls and not too big chunks --> trivial. If you mmap you database --> harder. If you would like to tell the kernel, that this should not be treated like a sequential read --> fadvise/madvise. > This is easier than trying to tell the kernel that the job is > less important, that goes wrong wether the job runs too much > or too little. Let that job sleep a little when its services > aren't needed, or when you need the disk bandwith elsewhere. Here I agree as this seems like a solution. The problem is, that you sometimes need low latency for your transactions and then you cannot start throttling a heavy IO process, whose IO is already issued and who is basically just waiting for disk eating its bandwidth. The questions are: How IO-intensive vacuum? How fast can a throttling free disk bandwidth (and memory)? Regards Ingo Oeser ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-15 13:25 ` Ingo Oeser @ 2003-10-15 15:03 ` Greg Stark 2003-10-15 18:37 ` Helge Hafting 2003-10-16 10:29 ` Ingo Oeser 0 siblings, 2 replies; 64+ messages in thread From: Greg Stark @ 2003-10-15 15:03 UTC (permalink / raw) To: Ingo Oeser Cc: Helge Hafting, Greg Stark, Joel Becker, Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel Ingo Oeser <ioe-lkml@rameria.de> writes: > On Monday 13 October 2003 10:45, Helge Hafting wrote: > > > This is easier than trying to tell the kernel that the job is > > less important, that goes wrong wether the job runs too much > > or too little. Let that job sleep a little when its services > > aren't needed, or when you need the disk bandwith elsewhere. Actually I think that's exactly backwards. The problem is that if the user-space tries to throttle the process it doesn't know how much or when. The kernel knows exactly when there are other higher priority writes, it can schedule just enough writes from vacuum to not interfere. So if vacuum slept a bit, say every 64k of data vacuumed. It could end up sleeping when the disks are actually idle. Or it could be not sleeping enough and still be interfering with transactions. Though actually this avenue has some promise. It would not be nearly as ideal as a kernel based solution that could take advantage of the idle times between transactions, but it would still work somewhat as a work-around. > The questions are: How IO-intensive vacuum? How fast can a throttling > free disk bandwidth (and memory)? It's purely i/o bound on large sequential reads. Ideally it should still have large enough sequential reads to not lose the streaming advantage, but not so large that it preempts the more random-access transactions. -- greg ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-15 15:03 ` Greg Stark @ 2003-10-15 18:37 ` Helge Hafting 2003-10-16 10:29 ` Ingo Oeser 1 sibling, 0 replies; 64+ messages in thread From: Helge Hafting @ 2003-10-15 18:37 UTC (permalink / raw) To: Greg Stark; +Cc: Ingo Oeser, Joel Becker, Linux Kernel On Wed, Oct 15, 2003 at 11:03:23AM -0400, Greg Stark wrote: > Ingo Oeser <ioe-lkml@rameria.de> writes: > > > On Monday 13 October 2003 10:45, Helge Hafting wrote: > > > > > This is easier than trying to tell the kernel that the job is > > > less important, that goes wrong wether the job runs too much > > > or too little. Let that job sleep a little when its services > > > aren't needed, or when you need the disk bandwith elsewhere. > > Actually I think that's exactly backwards. The problem is that if the > user-space tries to throttle the process it doesn't know how much or when. > The kernel knows exactly when there are other higher priority writes, it can > schedule just enough writes from vacuum to not interfere. > Isn't those higher-priority writes issued from userspace? I am of course assuming that source for _everything_ is available. So the process with the high-priority write can tell vacuum to take a nap until its transaction completes. > So if vacuum slept a bit, say every 64k of data vacuumed. It could end up > sleeping when the disks are actually idle. Or it could be not sleeping enough > and still be interfering with transactions. > It can run at full speed normally, take voluntary pauses if it ever detects a "nothing to do now" condition. And it can be paused (forcibly or through cooperation) when there are important transactions to sync. > Though actually this avenue has some promise. It would not be nearly as ideal > as a kernel based solution that could take advantage of the idle times between > transactions, but it would still work somewhat as a work-around. > Don't that other process know when it is about to submit important transactions? > > The questions are: How IO-intensive vacuum? How fast can a throttling > > free disk bandwidth (and memory)? > Helge Hafting ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-15 15:03 ` Greg Stark 2003-10-15 18:37 ` Helge Hafting @ 2003-10-16 10:29 ` Ingo Oeser 2003-10-16 14:02 ` Greg Stark 1 sibling, 1 reply; 64+ messages in thread From: Ingo Oeser @ 2003-10-16 10:29 UTC (permalink / raw) To: Greg Stark Cc: Helge Hafting, Joel Becker, Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel Hi there, first: I think the problem is solvable with mixing blocking and non-blocking IO or simply AIO, which will be supported nicely by 2.6.0, is a POSIX standard and is meant for doing your own IO scheduling. On Wednesday 15 October 2003 17:03, Greg Stark wrote: > Ingo Oeser <ioe-lkml@rameria.de> writes: > > On Monday 13 October 2003 10:45, Helge Hafting wrote: > > > This is easier than trying to tell the kernel that the job is > > > less important, that goes wrong wether the job runs too much > > > or too little. Let that job sleep a little when its services > > > aren't needed, or when you need the disk bandwith elsewhere. > > Actually I think that's exactly backwards. The problem is that if the > user-space tries to throttle the process it doesn't know how much or when. > The kernel knows exactly when there are other higher priority writes, it > can schedule just enough writes from vacuum to not interfere. On dedicated servers this might be true. But on these you could also solve it in user space by measuring disk bandwidth and issueing just enough IO to keep up roughly with it. > So if vacuum slept a bit, say every 64k of data vacuumed. It could end up > sleeping when the disks are actually idle. Or it could be not sleeping > enough and still be interfering with transactions. The vacuum io is submitted (via AIO or simulation of it) normally in a unit U and waiting ALWAYS for U to complete, before submitting a new one. Between submitting units, the vacuums checks for outstanding transactions and stops, when we have one. Now a transaction is submitted and the submitting from vacuum is stopped by it existing. The transaction waits for completion (e.g. aio_suspend()) and signals vacuum to continue. So the disk(s) should be always in good use. I don't know much of the design internals of your database, but this sounds promising and is portable. > > The questions are: How IO-intensive vacuum? How fast can a throttling > > free disk bandwidth (and memory)? > > It's purely i/o bound on large sequential reads. Ideally it should still > have large enough sequential reads to not lose the streaming advantage, but > not so large that it preempts the more random-access transactions. Ok, so we can ignore the processing time and the above should just work. Regards Ingo Oeser ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-16 10:29 ` Ingo Oeser @ 2003-10-16 14:02 ` Greg Stark 2003-10-21 11:47 ` Ingo Oeser 0 siblings, 1 reply; 64+ messages in thread From: Greg Stark @ 2003-10-16 14:02 UTC (permalink / raw) To: Ingo Oeser Cc: Greg Stark, Helge Hafting, Joel Becker, Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel Ingo Oeser <ioe-lkml@rameria.de> writes: > Hi there, > > first: I think the problem is solvable with mixing blocking and > non-blocking IO or simply AIO, which will be supported nicely by 2.6.0, > is a POSIX standard and is meant for doing your own IO scheduling. I think aio could be very useful for databases, but not in this area. I think it's useful as a more fine-grained tool than sync/fsync. Currently the database has to fsync a file to commit a transaction, which means flushing _all_writes to the file even ones from other transactions. If aio inserted write barriers to the disk controller then it would provide a way to ensure the current transaction is synced without having to flush all other transactions writes at the same time. But I don't see how it's useful for the problem I'm describing. > On Wednesday 15 October 2003 17:03, Greg Stark wrote: > > Ingo Oeser <ioe-lkml@rameria.de> writes: > > > On Monday 13 October 2003 10:45, Helge Hafting wrote: > > > > This is easier than trying to tell the kernel that the job is > > > > less important, that goes wrong wether the job runs too much > > > > or too little. Let that job sleep a little when its services > > > > aren't needed, or when you need the disk bandwith elsewhere. > > > > Actually I think that's exactly backwards. The problem is that if the > > user-space tries to throttle the process it doesn't know how much or when. > > The kernel knows exactly when there are other higher priority writes, it > > can schedule just enough writes from vacuum to not interfere. > > On dedicated servers this might be true. But on these you could also > solve it in user space by measuring disk bandwidth and issueing just > enough IO to keep up roughly with it. Indeed we're discussing methods for doing that now. But this seems like a awkward way to accomplish what the kernel could do very precisely. I don't see why non-dedicated servers would be make priorities any less useful, in fact I think that's exactly where they would shine. > > So if vacuum slept a bit, say every 64k of data vacuumed. It could end up > > sleeping when the disks are actually idle. Or it could be not sleeping > > enough and still be interfering with transactions. > > The vacuum io is submitted (via AIO or simulation of it) normally in a > unit U and waiting ALWAYS for U to complete, before submitting a new one. > Between submitting units, the vacuums checks for outstanding transactions > and stops, when we have one. > > Now a transaction is submitted and the submitting from vacuum is stopped > by it existing. The transaction waits for completion (e.g. aio_suspend()) > and signals vacuum to continue. User-space has no idea if disk i/o is occurring. The data the transaction needs could be cached, or it could be on a different disk. Besides, I think this is far too coarse-grained than what's needed. Transactions sometimes run for seconds, minutes, or hours,, some of that time is spent doing disk i/o and some of it doing cpu calculations. It can't stop and signal another process every time it finishes reading a block and needs to do a bit of calculation. Then context switch again a millisecond later so it can read the next block... And besides, this is would only useful on dedicated servers. -- greg ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-16 14:02 ` Greg Stark @ 2003-10-21 11:47 ` Ingo Oeser 0 siblings, 0 replies; 64+ messages in thread From: Ingo Oeser @ 2003-10-21 11:47 UTC (permalink / raw) To: Greg Stark Cc: Greg Stark, Helge Hafting, Joel Becker, Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel Hi Greg, On Thursday 16 October 2003 16:02, Greg Stark wrote: > Ingo Oeser <ioe-lkml@rameria.de> writes: > > Hi there, > > > > first: I think the problem is solvable with mixing blocking and > > non-blocking IO or simply AIO, which will be supported nicely by 2.6.0, > > is a POSIX standard and is meant for doing your own IO scheduling. > > I think aio could be very useful for databases, but not in this area. [AIO for write barriers] > But I don't see how it's useful for the problem I'm describing. It can, because this way, you generate sth. like a "user space request queue" and can control it's activity and saturation as fine grained as the syncing. You simply notice, if an event is in flight or not and can estimate current bandwidth that way. > Indeed we're discussing methods for doing that now. But this seems like a > awkward way to accomplish what the kernel could do very precisely. I don't > see why non-dedicated servers would be make priorities any less useful, in > fact I think that's exactly where they would shine. The kernel problem is, that an IO operation is not associated with any process, just with a physical page and a backing store. This is esp. true for reads. So userspace doesn't know in many cases, whether the kernel needs to do an IO at all to satisfy this request. Direct-IO helps this by having you to do the IO ALWAYS, but isn't that nice for the kernel. So if you say "This fd has an IO priority of 1 and that fd has one of 2" for the same file, then what should the kernel do? Or another secenario: You have chunk A and chunk B both of 128k. Now vacuum wants to read chunk B as low priority and transaction wants to read second page from chunk A and chunk B high priority (readv()). Readahead of second page from chunk A brings in first page of chunk B which vacuum has been waiting for and is woken and vacuums until chunk C is needed, which causes IO again. Now the transaction continues and can read immediately from page cache the page vacuum left. This will be even more fun, if vacuum is working so fast per timeslice, that it will push the cached pages out of memory ;-) See how controlling submission from vacuum might be better, then actions done by the kernel? If you just prioritize work, then the low priority work accumulates and takes up kernel memory. So better stop submission. > > > So if vacuum slept a bit, say every 64k of data vacuumed. It could end > > > up sleeping when the disks are actually idle. Or it could be not > > > sleeping enough and still be interfering with transactions. > > > > The vacuum io is submitted (via AIO or simulation of it) normally in a > > unit U and waiting ALWAYS for U to complete, before submitting a new one. > > Between submitting units, the vacuums checks for outstanding transactions > > and stops, when we have one. > > > > Now a transaction is submitted and the submitting from vacuum is stopped > > by it existing. The transaction waits for completion (e.g. > > aio_suspend()) and signals vacuum to continue. > > User-space has no idea if disk i/o is occurring. The data the transaction > needs could be cached, or it could be on a different disk. So how should it prioritze then, if it doesn't know which will preempt which? > Besides, I think this is far too coarse-grained than what's needed. > Transactions sometimes run for seconds, minutes, or hours,, some of that > time is spent doing disk i/o and some of it doing cpu calculations. It > can't stop and signal another process every time it finishes reading a > block and needs to do a bit of calculation. Then context switch again a > millisecond later so it can read the next block... I don't want it to signal vacuum, I just want vacuum to check for existance of more important things to do. Like a "disk idle process". This can be as simple as having vacuum at extremly low process priority and reading some atomically set variable, whether it can submit more now or not. I think you need to do sth. like the kernel does for page writing for your user space task. (stepping by watermarks from none, async to sync) PS: Sorry for the late answer, but needed to rethink a bit more. If you could point me to the source files actually triggering and doing vacuum, I might get more enlightment ;-) Regards Ingo Oeser ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 16:01 ` Jamie Lokier 2003-10-10 16:33 ` Joel Becker @ 2003-10-10 18:20 ` Andrea Arcangeli 2003-10-10 18:36 ` Linus Torvalds 1 sibling, 1 reply; 64+ messages in thread From: Andrea Arcangeli @ 2003-10-10 18:20 UTC (permalink / raw) To: Jamie Lokier Cc: Linus Torvalds, Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, Oct 10, 2003 at 05:01:44PM +0100, Jamie Lokier wrote: > Joel Becker wrote: > > Where I work doesn't change the need for O_DIRECT. If your Big > > App has it's own cache, why copy the cache in the kernel? That just > > wastes RAM. > > Why don't you _share_ the App's cache with the kernel's? That's what > mmap() and remap_file_pages() are for. I covered this some time ago in the remap_file_pages threads with Wil. remap_file_pages requires pte modifications and tlb flushes. O_DIRECT only walk the pagetables, no pte mangling, no tlb flushes, the TLB is preserved fully. thinking only 64bit in the above of course, 32bit is different but still mmap+remap_file_pages can't beat O_DIRECT if you dedicate your machine for the database task. > > If your app is sharing data, whether physical disk, logical > > disk, or via some network filesystem or storage device, you must > > absolutely guarantee that reads and writes hit the storage, not the > > kernel cache which has no idea whether another node wrote an update or > > needs a cache flush. > > That's tough to guarantee at the platter level regardless of O_DIRECT, > but otherwise: you have fdatasync() and msync(). > > > If Linux came up with a better, cleaner method, Oracle might change. > > Take a look at remap_file_pages() and write a note here to say if it > fits the bill. I thought remap_file_pages() was added for Oracle, but > perhaps it was for a more modern database ;) no way, it has the disavantages I mentioned above, it would be a bad idea to use remap_file_pages on any 64bit system out there. we know remap_file_pages has a chance to improve the /dev/shm mappings from 32bit systems, but that has nothing to do with the long run 64bit machines, remap_file_pages is mostly a 32bit hack for ia32 with PAE. About the in-memory databases, that's really the big iron non-mass-market, not the other way around. only the big irons have enough money to buy that much ram, you sure can't compare the price of the ram with the price of disk, or at least not yet in this market AFIK. Andrea - If you prefer relying on open source software, check these links: rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/ http://www.cobite.com/cvsps/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 18:20 ` Andrea Arcangeli @ 2003-10-10 18:36 ` Linus Torvalds 2003-10-10 19:03 ` Andrea Arcangeli 0 siblings, 1 reply; 64+ messages in thread From: Linus Torvalds @ 2003-10-10 18:36 UTC (permalink / raw) To: Andrea Arcangeli Cc: Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, 10 Oct 2003, Andrea Arcangeli wrote: > > O_DIRECT only walk the pagetables, no pte mangling, no tlb flushes, the > TLB is preserved fully. Yes. However, it's even _nicer_ if you don't need to walk the page tables at all. Quite a lot of operations could be done directly on the page cache. I'm not a huge fan of mmap() myself - the biggest advantage of mmap is when you don't know your access patterns, and you have reasonably good locality. In many other cases mmap is just a total loss, because the page table walking is often more expensive than even a memcpy(). That's _especially_ true if you have to move mappings around, and you have to invalidate TLB's. memcpy() often gets a bad name. Yeah, memory is slow, but especially if you copy something you just worked on, you're actually often better off letting the CPU cache do its job, rather than walking page tables and trying to be clever. Just as an example: copying often means that you don't need nearly as much locking and synchronization - which in turn avoids one whole big mess (yes, the memcpy() will look very hot in profiles, but then doing extra work to avoid the memcpy() will cause spread-out overhead that is a lot worse and harder to think about). This is why a simple read()/write() loop often _beats_ mmap approaches. And often it's actually better to not even have big buffers (ie the old "avoid system calls by aggregation" approach) because that just blows your cache away. Right now, the fastest way to copy a file is apparently by doing lots of ~8kB read/write pairs (that data may be slightly stale, but it was true at some point). Never mind the system call overhead - just having the extra buffer stay in the L1 cache and avoiding page faults from mmap is a bigger win. And I don't think mmap _can_ beat that. It's fundamental. In contrast, direct page cache accesses really can do so. Exactly because they don't touch any page tables at all, and because they can take advantage of internal kernel data structure layout and move pages around without any cost.. Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-10 18:36 ` Linus Torvalds @ 2003-10-10 19:03 ` Andrea Arcangeli 0 siblings, 0 replies; 64+ messages in thread From: Andrea Arcangeli @ 2003-10-10 19:03 UTC (permalink / raw) To: Linus Torvalds Cc: Jamie Lokier, Trond Myklebust, Ulrich Drepper, Linux Kernel On Fri, Oct 10, 2003 at 11:36:29AM -0700, Linus Torvalds wrote: > > On Fri, 10 Oct 2003, Andrea Arcangeli wrote: > > > > O_DIRECT only walk the pagetables, no pte mangling, no tlb flushes, the > > TLB is preserved fully. > > Yes. However, it's even _nicer_ if you don't need to walk the page tables > at all. > > Quite a lot of operations could be done directly on the page cache. I'm > not a huge fan of mmap() myself - the biggest advantage of mmap is when > you don't know your access patterns, and you have reasonably good > locality. In many other cases mmap is just a total loss, because the page > table walking is often more expensive than even a memcpy(). > > That's _especially_ true if you have to move mappings around, and you have > to invalidate TLB's. agreed. that's what remap_file_pages does infact. > memcpy() often gets a bad name. Yeah, memory is slow, but especially if > you copy something you just worked on, you're actually often better off > letting the CPU cache do its job, rather than walking page tables and > trying to be clever. > > Just as an example: copying often means that you don't need nearly as much > locking and synchronization - which in turn avoids one whole big mess > (yes, the memcpy() will look very hot in profiles, but then doing extra > work to avoid the memcpy() will cause spread-out overhead that is a lot > worse and harder to think about). > > This is why a simple read()/write() loop often _beats_ mmap approaches. > And often it's actually better to not even have big buffers (ie the old > "avoid system calls by aggregation" approach) because that just blows your > cache away. > > Right now, the fastest way to copy a file is apparently by doing lots of > ~8kB read/write pairs (that data may be slightly stale, but it was true at > some point). Never mind the system call overhead - just having the extra > buffer stay in the L1 cache and avoiding page faults from mmap is a bigger > win. > > And I don't think mmap _can_ beat that. It's fundamental. That's my whole point, agreed. Though using mmap would be sure cleaner and simpler. > In contrast, direct page cache accesses really can do so. Exactly because > they don't touch any page tables at all, and because they can take > advantage of internal kernel data structure layout and move pages around > without any cost.. Which basically means removing O_DIRECT from the open syscalls and still use read/write if I understand correctly. With todays commodity dirtcheap hardware, it has been proven that walking the pte (NOTE: only walking, no mangling and no tlb flushing) is much faster than doing the memcpy. More cpu is left free for the other tasks and the cost of the I/O is the same. The different isn't measurable in I/O bound tasks, but a database is both IO bound and cpu bound at the same time, so for a db it's measurable. At least this is the case for Oracle. I believe Joel has access to these numbers too, and that's why he's interested in O_DIRECT in the first place. With faster membus things may change of course (to the point where there's no difference between the two models), but still I don't see how can walking tree pointers to be more expensive than copying 512bytes of data (assuming the smaller blocksize). And you're ignoring the CPU *has* to walk those three pointers _anyways_ implicitly to allow the memcpy to run. So as far as I can tell the memcpy is pure overhead that can be avoided with O_DIRECT. this is also why I rejected all approcches that wanted to allow readahead via O_DIRECT by preloading data in pagecache, my argument is: if you can't avoid the memcpy you must not use O_DIRECT. The only signle object of O_DIRECT is to avoid the memcpy, the cache pollution avoidance is a very minor issue, the main point is to avoid the memcpy. I also posted a number of benchmarks at some point, where I've shown a dramatical reduction of the cpu usage, up to 10% reduction, on a normal cheap hardware w/o reduction of I/O bandwidth. This means 10% more cpu to use for doing something useful in the cpu bound part of the database. The main downside of O_DIRECT I believe conceptual, starting from the ugliness inside the kernel, like the cache coherency handling and i_alloc_sem need to avoid reads to run in parallel of block allocations, etc... but the practical effect I doubt can be easily beaten in the numbers. That said maybe we can provide a nicer API that does the same thing internally I don't know, but certainly that can't be remap_file_pages because that does a very different thing. Andrea - If you prefer relying on open source software, check these links: rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/ http://www.cobite.com/cvsps/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-09 22:16 statfs() / statvfs() syscall ballsup Trond Myklebust 2003-10-09 22:26 ` Linus Torvalds @ 2003-10-09 23:16 ` Andreas Dilger 2003-10-09 23:24 ` Linus Torvalds 1 sibling, 1 reply; 64+ messages in thread From: Andreas Dilger @ 2003-10-09 23:16 UTC (permalink / raw) To: Trond Myklebust; +Cc: Ulrich Drepper, Linus Torvalds, Linux Kernel On Oct 09, 2003 18:16 -0400, Trond Myklebust wrote: > We appear to have a problem with the new statfs interface > in 2.6.0... > > The problem is that as far as userland is concerned, 'struct statfs' > reports f_blocks, f_bfree,... in units of the "optimal transfer size": > f_bsize (backwards compatibility). > > OTOH 'struct statvfs' reports the same values in units of the fragment > size (the blocksize of the underlying filesyste): f_frsize. (says > Single User Spec v2) > > Both are apparently supposed to syscall down via sys_statfs()... > > Question: how we're supposed to reconcile the two cases for something > like NFS, where these 2 values are supposed to differ? Actually, what is also a problem is that there is no hook for the system to return different results for the 32-bit and 64-bit statfs structs. Because Lustre is used on very large filesystems (i.e. 100TB+) we can't fit the result into 32 bits without increasing f_bsize and reducing f_bavail/f_bfree/f_blocks proportionately. It would be nice if we could know in advance if we are returning values for sys_statfs() or sys_statfs64() (e.g. by sys_statfs64() calling an optional sb->s_op->statfs64() method if available) so we didn't have to do this munging. We can't just assume 64-bit results, or callers of sys_statfs() will get EOVERFLOW instead of slightly innacurate results. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: statfs() / statvfs() syscall ballsup... 2003-10-09 23:16 ` Andreas Dilger @ 2003-10-09 23:24 ` Linus Torvalds 0 siblings, 0 replies; 64+ messages in thread From: Linus Torvalds @ 2003-10-09 23:24 UTC (permalink / raw) To: Andreas Dilger; +Cc: Trond Myklebust, Ulrich Drepper, Linux Kernel On Thu, 9 Oct 2003, Andreas Dilger wrote: > > It would be nice if we could know in advance if we are returning values > for sys_statfs() or sys_statfs64() (e.g. by sys_statfs64() calling an > optional sb->s_op->statfs64() method if available) so we didn't have to > do this munging. We can't just assume 64-bit results, or callers of > sys_statfs() will get EOVERFLOW instead of slightly innacurate results. This is something that sys_statfs() could do on its own. It's probably always better to try to scale the block size up than to return EOVERFLOW. (Some things can't be scaled up, of course, like f_ffree etc. But it should be trivial to just do a "try to shift to make it fit" in the vfs_statfs_native() function in fs/open.c). Linus ^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2003-10-21 11:50 UTC | newest] Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-10-09 22:16 statfs() / statvfs() syscall ballsup Trond Myklebust 2003-10-09 22:26 ` Linus Torvalds 2003-10-09 23:19 ` Ulrich Drepper 2003-10-10 0:22 ` viro 2003-10-10 4:49 ` Jamie Lokier 2003-10-10 5:26 ` Trond Myklebust 2003-10-10 12:37 ` Jamie Lokier 2003-10-10 13:46 ` Trond Myklebust 2003-10-10 14:35 ` Jamie Lokier 2003-10-10 15:32 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust 2003-10-10 15:53 ` Jamie Lokier 2003-10-10 16:07 ` Trond Myklebust 2003-10-10 15:55 ` Michael Shuey 2003-10-10 16:20 ` Trond Myklebust 2003-10-10 16:45 ` J. Bruce Fields 2003-10-10 14:39 ` statfs() / statvfs() syscall ballsup Jamie Lokier 2003-10-09 23:31 ` Trond Myklebust 2003-10-10 12:27 ` Joel Becker 2003-10-10 14:59 ` Linus Torvalds 2003-10-10 15:27 ` Joel Becker 2003-10-10 16:00 ` Linus Torvalds 2003-10-10 16:26 ` Joel Becker 2003-10-10 16:50 ` Linus Torvalds 2003-10-10 17:33 ` Joel Becker 2003-10-10 17:51 ` Linus Torvalds 2003-10-10 18:13 ` Joel Becker 2003-10-10 16:27 ` Valdis.Kletnieks 2003-10-10 16:33 ` Chris Friesen 2003-10-10 17:04 ` Linus Torvalds 2003-10-10 17:07 ` Linus Torvalds 2003-10-10 17:21 ` Joel Becker 2003-10-10 16:01 ` Jamie Lokier 2003-10-10 16:33 ` Joel Becker 2003-10-10 16:58 ` Chris Friesen 2003-10-10 17:05 ` Trond Myklebust 2003-10-10 17:20 ` Joel Becker 2003-10-10 17:33 ` Chris Friesen 2003-10-10 17:40 ` Linus Torvalds 2003-10-10 17:54 ` Trond Myklebust 2003-10-10 18:05 ` Linus Torvalds 2003-10-10 20:40 ` Trond Myklebust 2003-10-10 21:09 ` Linus Torvalds 2003-10-10 22:17 ` Trond Myklebust 2003-10-11 2:53 ` Andrew Morton 2003-10-11 3:47 ` Trond Myklebust 2003-10-10 18:05 ` Joel Becker 2003-10-10 18:31 ` Andrea Arcangeli 2003-10-10 20:33 ` Helge Hafting 2003-10-10 20:07 ` Jamie Lokier 2003-10-12 15:31 ` Greg Stark 2003-10-12 16:13 ` Linus Torvalds 2003-10-12 22:09 ` Greg Stark 2003-10-13 8:45 ` Helge Hafting 2003-10-15 13:25 ` Ingo Oeser 2003-10-15 15:03 ` Greg Stark 2003-10-15 18:37 ` Helge Hafting 2003-10-16 10:29 ` Ingo Oeser 2003-10-16 14:02 ` Greg Stark 2003-10-21 11:47 ` Ingo Oeser 2003-10-10 18:20 ` Andrea Arcangeli 2003-10-10 18:36 ` Linus Torvalds 2003-10-10 19:03 ` Andrea Arcangeli 2003-10-09 23:16 ` Andreas Dilger 2003-10-09 23:24 ` Linus Torvalds
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.