From: Greg Stark <email@example.com> To: Joel Becker <Joel.Becker@oracle.com> Cc: Jamie Lokier <firstname.lastname@example.org>, Linus Torvalds <email@example.com>, Trond Myklebust <firstname.lastname@example.org>, Ulrich Drepper <email@example.com>, Linux Kernel <firstname.lastname@example.org> Subject: Re: statfs() / statvfs() syscall ballsup... Date: 12 Oct 2003 11:31:31 -0400 [thread overview] Message-ID: <email@example.com> (raw) In-Reply-To: <20031010163300.GC28773@ca-server1.us.oracle.com> Joel Becker <Joel.Becker@oracle.com> writes: > On Fri, Oct 10, 2003 at 05:01:44PM +0100, Jamie Lokier wrote: > > Why don't you _share_ the App's cache with the kernel's? That's what > > mmap() and remap_file_pages() are for. > > Because you can't force flush/read. You can't say "I need you > to go to disk for this." If you do, you're doing O_DIRECT through mmap > (yes, I've pondered it) and you end up with perhaps the same races folks > worry about. Doesn't mean it can't be done. There are other reasons databases want to control their own cache. The application knows more about the usage and the future usage of the data than the kernel does. There's currently a thread on the Postgres mailing list about a problem with an administrative job that needs to touch potentially all the blocks of a table. The more frequently it's run the less work it has to do, so the recommendation is to run it very frequently. However on busy servers whenever it's run it causes lots of pain because the kernel flushes all the cached data in favour of the data this job touches. And worse, there's no way to indicate that the i/o it's doing is lower priority, so i/o bound servers get hit dramatically. Postgres knows the fact that this job touched the data means nothing for the regular functioning of the server, and it knows that the i/o it's doing is low priority. It needs some way to indicate to the kernel that this job is low priority not only for cpu resources but for cache resources and i/o resources as well. There are other cases. Oracle, for example, puts blocks it reads due to full table scans at the end of its LRU list to avoid a similar effect on the cache. Then there's the transaction log. The database needs to know when the transaction log is written to disk. The blocks it writes there won't be useful to cache unless the database crashed right there. And ideally it should bypass any disk i/o reordering and write the data to the transaction log *first*. Raw bandwidth is not as important as latency on writes to the transaction log. The reason mmap is tempting is not because it's faster. It's because it provides a nice clean abstract interface. The database could simply mmap the entire database and then pretend it is an in-memory database. The code would be much simpler and more complex algorithms would be easier to implement. Unfortunately there are some problems with mmap. Currently it would be just as complex to use as read/write because the address space is limited to only a fraction of the database. On a 64 bit machine you might be able to mmap the entire database and then use custom syscalls to indicate to the kernel which pages to keep in cache and which to sync. -- greg
next prev parent reply other threads:[~2003-10-12 15:31 UTC|newest] Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top 2003-10-09 22:16 Trond Myklebust 2003-10-09 22:26 ` Linus Torvalds 2003-10-09 23:19 ` Ulrich Drepper 2003-10-10 0:22 ` viro 2003-10-10 4:49 ` Jamie Lokier 2003-10-10 5:26 ` Trond Myklebust 2003-10-10 12:37 ` Jamie Lokier 2003-10-10 13:46 ` Trond Myklebust 2003-10-10 14:35 ` Jamie Lokier 2003-10-10 15:32 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust 2003-10-10 15:53 ` Jamie Lokier 2003-10-10 16:07 ` Trond Myklebust 2003-10-10 15:55 ` Michael Shuey 2003-10-10 16:20 ` Trond Myklebust 2003-10-10 16:45 ` J. Bruce Fields 2003-10-10 14:39 ` statfs() / statvfs() syscall ballsup Jamie Lokier 2003-10-09 23:31 ` Trond Myklebust 2003-10-10 12:27 ` Joel Becker 2003-10-10 14:59 ` Linus Torvalds 2003-10-10 15:27 ` Joel Becker 2003-10-10 16:00 ` Linus Torvalds 2003-10-10 16:26 ` Joel Becker 2003-10-10 16:50 ` Linus Torvalds 2003-10-10 17:33 ` Joel Becker 2003-10-10 17:51 ` Linus Torvalds 2003-10-10 18:13 ` Joel Becker 2003-10-10 16:27 ` Valdis.Kletnieks 2003-10-10 16:33 ` Chris Friesen 2003-10-10 17:04 ` Linus Torvalds 2003-10-10 17:07 ` Linus Torvalds 2003-10-10 17:21 ` Joel Becker 2003-10-10 16:01 ` Jamie Lokier 2003-10-10 16:33 ` Joel Becker 2003-10-10 16:58 ` Chris Friesen 2003-10-10 17:05 ` Trond Myklebust 2003-10-10 17:20 ` Joel Becker 2003-10-10 17:33 ` Chris Friesen 2003-10-10 17:40 ` Linus Torvalds 2003-10-10 17:54 ` Trond Myklebust 2003-10-10 18:05 ` Linus Torvalds 2003-10-10 20:40 ` Trond Myklebust 2003-10-10 21:09 ` Linus Torvalds 2003-10-10 22:17 ` Trond Myklebust 2003-10-11 2:53 ` Andrew Morton 2003-10-11 3:47 ` Trond Myklebust 2003-10-10 18:05 ` Joel Becker 2003-10-10 18:31 ` Andrea Arcangeli 2003-10-10 20:33 ` Helge Hafting 2003-10-10 20:07 ` Jamie Lokier 2003-10-12 15:31 ` Greg Stark [this message] 2003-10-12 16:13 ` Linus Torvalds 2003-10-12 22:09 ` Greg Stark 2003-10-13 8:45 ` Helge Hafting 2003-10-15 13:25 ` Ingo Oeser 2003-10-15 15:03 ` Greg Stark 2003-10-15 18:37 ` Helge Hafting 2003-10-16 10:29 ` Ingo Oeser 2003-10-16 14:02 ` Greg Stark 2003-10-21 11:47 ` Ingo Oeser 2003-10-10 18:20 ` Andrea Arcangeli 2003-10-10 18:36 ` Linus Torvalds 2003-10-10 19:03 ` Andrea Arcangeli 2003-10-09 23:16 ` Andreas Dilger 2003-10-09 23:24 ` Linus Torvalds
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --firstname.lastname@example.org \ --email@example.com \ --cc=Joel.Becker@oracle.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --subject='Re: statfs() / statvfs() syscall ballsup...' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.