All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Stark <gsstark@mit.edu>
To: Joel Becker <Joel.Becker@oracle.com>
Cc: Jamie Lokier <jamie@shareable.org>,
	Linus Torvalds <torvalds@osdl.org>,
	Trond Myklebust <trond.myklebust@fys.uio.no>,
	Ulrich Drepper <drepper@redhat.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: statfs() / statvfs() syscall ballsup...
Date: 12 Oct 2003 11:31:31 -0400	[thread overview]
Message-ID: <87ekxi4gmk.fsf@stark.dyndns.tv> (raw)
In-Reply-To: <20031010163300.GC28773@ca-server1.us.oracle.com>

Joel Becker <Joel.Becker@oracle.com> writes:

> On Fri, Oct 10, 2003 at 05:01:44PM +0100, Jamie Lokier wrote:
> > Why don't you _share_ the App's cache with the kernel's?  That's what
> > mmap() and remap_file_pages() are for.
> 
> 	Because you can't force flush/read.  You can't say "I need you
> to go to disk for this."  If you do, you're doing O_DIRECT through mmap
> (yes, I've pondered it) and you end up with perhaps the same races folks
> worry about.  Doesn't mean it can't be done.

There are other reasons databases want to control their own cache. The
application knows more about the usage and the future usage of the data than
the kernel does.

There's currently a thread on the Postgres mailing list about a problem with
an administrative job that needs to touch potentially all the blocks of a table.
The more frequently it's run the less work it has to do, so the recommendation
is to run it very frequently.

However on busy servers whenever it's run it causes lots of pain because the
kernel flushes all the cached data in favour of the data this job touches. And
worse, there's no way to indicate that the i/o it's doing is lower priority,
so i/o bound servers get hit dramatically. 

Postgres knows the fact that this job touched the data means nothing for the
regular functioning of the server, and it knows that the i/o it's doing is low
priority. It needs some way to indicate to the kernel that this job is low
priority not only for cpu resources but for cache resources and i/o resources
as well.

There are other cases. Oracle, for example, puts blocks it reads due to full
table scans at the end of its LRU list to avoid a similar effect on the cache.

Then there's the transaction log. The database needs to know when the
transaction log is written to disk. The blocks it writes there won't be useful
to cache unless the database crashed right there. And ideally it should bypass
any disk i/o reordering and write the data to the transaction log *first*. Raw
bandwidth is not as important as latency on writes to the transaction log.

The reason mmap is tempting is not because it's faster. It's because it
provides a nice clean abstract interface. The database could simply mmap the
entire database and then pretend it is an in-memory database. The code would
be much simpler and more complex algorithms would be easier to implement.

Unfortunately there are some problems with mmap. Currently it would be just as
complex to use as read/write because the address space is limited to only a
fraction of the database. On a 64 bit machine you might be able to mmap the
entire database and then use custom syscalls to indicate to the kernel which
pages to keep in cache and which to sync.

-- 
greg


  parent reply	other threads:[~2003-10-12 15:31 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-10-09 22:16 statfs() / statvfs() syscall ballsup Trond Myklebust
2003-10-09 22:26 ` Linus Torvalds
2003-10-09 23:19   ` Ulrich Drepper
2003-10-10  0:22     ` viro
2003-10-10  4:49       ` Jamie Lokier
2003-10-10  5:26         ` Trond Myklebust
2003-10-10 12:37           ` Jamie Lokier
2003-10-10 13:46             ` Trond Myklebust
2003-10-10 14:35               ` Jamie Lokier
2003-10-10 15:32                 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust
2003-10-10 15:53                   ` Jamie Lokier
2003-10-10 16:07                     ` Trond Myklebust
2003-10-10 15:55                   ` Michael Shuey
2003-10-10 16:20                     ` Trond Myklebust
2003-10-10 16:45                     ` J. Bruce Fields
2003-10-10 14:39               ` statfs() / statvfs() syscall ballsup Jamie Lokier
2003-10-09 23:31   ` Trond Myklebust
2003-10-10 12:27   ` Joel Becker
2003-10-10 14:59     ` Linus Torvalds
2003-10-10 15:27       ` Joel Becker
2003-10-10 16:00         ` Linus Torvalds
2003-10-10 16:26           ` Joel Becker
2003-10-10 16:50             ` Linus Torvalds
2003-10-10 17:33               ` Joel Becker
2003-10-10 17:51                 ` Linus Torvalds
2003-10-10 18:13                   ` Joel Becker
2003-10-10 16:27           ` Valdis.Kletnieks
2003-10-10 16:33           ` Chris Friesen
2003-10-10 17:04             ` Linus Torvalds
2003-10-10 17:07               ` Linus Torvalds
2003-10-10 17:21                 ` Joel Becker
2003-10-10 16:01         ` Jamie Lokier
2003-10-10 16:33           ` Joel Becker
2003-10-10 16:58             ` Chris Friesen
2003-10-10 17:05               ` Trond Myklebust
2003-10-10 17:20               ` Joel Becker
2003-10-10 17:33                 ` Chris Friesen
2003-10-10 17:40                 ` Linus Torvalds
2003-10-10 17:54                   ` Trond Myklebust
2003-10-10 18:05                     ` Linus Torvalds
2003-10-10 20:40                       ` Trond Myklebust
2003-10-10 21:09                         ` Linus Torvalds
2003-10-10 22:17                           ` Trond Myklebust
2003-10-11  2:53                     ` Andrew Morton
2003-10-11  3:47                       ` Trond Myklebust
2003-10-10 18:05                   ` Joel Becker
2003-10-10 18:31                     ` Andrea Arcangeli
2003-10-10 20:33                     ` Helge Hafting
2003-10-10 20:07             ` Jamie Lokier
2003-10-12 15:31             ` Greg Stark [this message]
2003-10-12 16:13               ` Linus Torvalds
2003-10-12 22:09                 ` Greg Stark
2003-10-13  8:45                   ` Helge Hafting
2003-10-15 13:25                     ` Ingo Oeser
2003-10-15 15:03                       ` Greg Stark
2003-10-15 18:37                         ` Helge Hafting
2003-10-16 10:29                         ` Ingo Oeser
2003-10-16 14:02                           ` Greg Stark
2003-10-21 11:47                             ` Ingo Oeser
2003-10-10 18:20           ` Andrea Arcangeli
2003-10-10 18:36             ` Linus Torvalds
2003-10-10 19:03               ` Andrea Arcangeli
2003-10-09 23:16 ` Andreas Dilger
2003-10-09 23:24   ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ekxi4gmk.fsf@stark.dyndns.tv \
    --to=gsstark@mit.edu \
    --cc=Joel.Becker@oracle.com \
    --cc=drepper@redhat.com \
    --cc=jamie@shareable.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@osdl.org \
    --cc=trond.myklebust@fys.uio.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.