All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <andrea@suse.de>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Jamie Lokier <jamie@shareable.org>,
	Trond Myklebust <trond.myklebust@fys.uio.no>,
	Ulrich Drepper <drepper@redhat.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: statfs() / statvfs() syscall ballsup...
Date: Fri, 10 Oct 2003 21:03:38 +0200	[thread overview]
Message-ID: <20031010190338.GI16013@velociraptor.random> (raw)
In-Reply-To: <Pine.LNX.4.44.0310101126120.20420-100000@home.osdl.org>

On Fri, Oct 10, 2003 at 11:36:29AM -0700, Linus Torvalds wrote:
> 
> On Fri, 10 Oct 2003, Andrea Arcangeli wrote:
> > 
> > O_DIRECT only walk the pagetables, no pte mangling, no tlb flushes, the
> > TLB is preserved fully.
> 
> Yes. However, it's even _nicer_ if you don't need to walk the page tables 
> at all.
> 
> Quite a lot of operations could be done directly on the page cache. I'm 
> not a huge fan of mmap() myself - the biggest advantage of mmap is when 
> you don't know your access patterns, and you have reasonably good 
> locality. In many other cases mmap is just a total loss, because the page 
> table walking is often more expensive than even a memcpy().
> 
> That's _especially_ true if you have to move mappings around, and you have 
> to invalidate TLB's. 

agreed. that's what remap_file_pages does infact.

> memcpy() often gets a bad name. Yeah, memory is slow, but especially if 
> you copy something you just worked on, you're actually often better off 
> letting the CPU cache do its job, rather than walking page tables and 
> trying to be clever.
> 
> Just as an example: copying often means that you don't need nearly as much 
> locking and synchronization - which in turn avoids one whole big mess 
> (yes, the memcpy() will look very hot in profiles, but then doing extra 
> work to avoid the memcpy() will cause spread-out overhead that is a lot 
> worse and harder to think about).
> 
> This is why a simple read()/write() loop often _beats_ mmap approaches. 
> And often it's actually better to not even have big buffers (ie the old 
> "avoid system calls by aggregation" approach) because that just blows your 
> cache away.
> 
> Right now, the fastest way to copy a file is apparently by doing lots of
> ~8kB read/write pairs (that data may be slightly stale, but it was true at
> some point). Never mind the system call overhead - just having the extra
> buffer stay in the L1 cache and avoiding page faults from mmap is a bigger
> win.
> 
> And I don't think mmap _can_ beat that. It's fundamental. 

That's my whole point, agreed. Though using mmap would be sure cleaner
and simpler.

> In contrast, direct page cache accesses really can do so. Exactly because 
> they don't touch any page tables at all, and because they can take 
> advantage of internal kernel data structure layout and move pages around 
> without any cost..

Which basically means removing O_DIRECT from the open syscalls and still
use read/write if I understand correctly.

With todays commodity dirtcheap hardware, it has been proven that
walking the pte (NOTE: only walking, no mangling and no tlb flushing) is
much faster than doing the memcpy. More cpu is left free for the other
tasks and the cost of the I/O is the same. The different isn't
measurable in I/O bound tasks, but a database is both IO bound and cpu
bound at the same time, so for a db it's measurable. At least this is
the case for Oracle. I believe Joel has access to these numbers too, and
that's why he's interested in O_DIRECT in the first place.

With faster membus things may change of course (to the point where
there's no difference between the two models), but still I don't see how
can walking tree pointers to be more expensive than copying 512bytes of
data (assuming the smaller blocksize). And you're ignoring the CPU *has*
to walk those three pointers _anyways_ implicitly to allow the memcpy to
run. So as far as I can tell the memcpy is pure overhead that can be
avoided with O_DIRECT.

this is also why I rejected all approcches that wanted to allow
readahead via O_DIRECT by preloading data in pagecache, my argument is:
if you can't avoid the memcpy you must not use O_DIRECT. The only signle
object of O_DIRECT is to avoid the memcpy, the cache pollution avoidance
is a very minor issue, the main point is to avoid the memcpy.

I also posted a number of benchmarks at some point, where I've shown a
dramatical reduction of the cpu usage, up to 10% reduction, on a normal
cheap hardware w/o reduction of I/O bandwidth. This means 10% more cpu
to use for doing something useful in the cpu bound part of the database.

The main downside of O_DIRECT I believe conceptual, starting from the
ugliness inside the kernel, like the cache coherency handling and
i_alloc_sem need to avoid reads to run in parallel of block allocations,
etc... but the practical effect I doubt can be easily beaten in the
numbers. That said maybe we can provide a nicer API that does the same
thing internally I don't know, but certainly that can't be
remap_file_pages because that does a very different thing.

Andrea - If you prefer relying on open source software, check these links:
	    rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
	    http://www.cobite.com/cvsps/

  reply	other threads:[~2003-10-10 19:04 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-10-09 22:16 statfs() / statvfs() syscall ballsup Trond Myklebust
2003-10-09 22:26 ` Linus Torvalds
2003-10-09 23:19   ` Ulrich Drepper
2003-10-10  0:22     ` viro
2003-10-10  4:49       ` Jamie Lokier
2003-10-10  5:26         ` Trond Myklebust
2003-10-10 12:37           ` Jamie Lokier
2003-10-10 13:46             ` Trond Myklebust
2003-10-10 14:35               ` Jamie Lokier
2003-10-10 15:32                 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust
2003-10-10 15:53                   ` Jamie Lokier
2003-10-10 16:07                     ` Trond Myklebust
2003-10-10 15:55                   ` Michael Shuey
2003-10-10 16:20                     ` Trond Myklebust
2003-10-10 16:45                     ` J. Bruce Fields
2003-10-10 14:39               ` statfs() / statvfs() syscall ballsup Jamie Lokier
2003-10-09 23:31   ` Trond Myklebust
2003-10-10 12:27   ` Joel Becker
2003-10-10 14:59     ` Linus Torvalds
2003-10-10 15:27       ` Joel Becker
2003-10-10 16:00         ` Linus Torvalds
2003-10-10 16:26           ` Joel Becker
2003-10-10 16:50             ` Linus Torvalds
2003-10-10 17:33               ` Joel Becker
2003-10-10 17:51                 ` Linus Torvalds
2003-10-10 18:13                   ` Joel Becker
2003-10-10 16:27           ` Valdis.Kletnieks
2003-10-10 16:33           ` Chris Friesen
2003-10-10 17:04             ` Linus Torvalds
2003-10-10 17:07               ` Linus Torvalds
2003-10-10 17:21                 ` Joel Becker
2003-10-10 16:01         ` Jamie Lokier
2003-10-10 16:33           ` Joel Becker
2003-10-10 16:58             ` Chris Friesen
2003-10-10 17:05               ` Trond Myklebust
2003-10-10 17:20               ` Joel Becker
2003-10-10 17:33                 ` Chris Friesen
2003-10-10 17:40                 ` Linus Torvalds
2003-10-10 17:54                   ` Trond Myklebust
2003-10-10 18:05                     ` Linus Torvalds
2003-10-10 20:40                       ` Trond Myklebust
2003-10-10 21:09                         ` Linus Torvalds
2003-10-10 22:17                           ` Trond Myklebust
2003-10-11  2:53                     ` Andrew Morton
2003-10-11  3:47                       ` Trond Myklebust
2003-10-10 18:05                   ` Joel Becker
2003-10-10 18:31                     ` Andrea Arcangeli
2003-10-10 20:33                     ` Helge Hafting
2003-10-10 20:07             ` Jamie Lokier
2003-10-12 15:31             ` Greg Stark
2003-10-12 16:13               ` Linus Torvalds
2003-10-12 22:09                 ` Greg Stark
2003-10-13  8:45                   ` Helge Hafting
2003-10-15 13:25                     ` Ingo Oeser
2003-10-15 15:03                       ` Greg Stark
2003-10-15 18:37                         ` Helge Hafting
2003-10-16 10:29                         ` Ingo Oeser
2003-10-16 14:02                           ` Greg Stark
2003-10-21 11:47                             ` Ingo Oeser
2003-10-10 18:20           ` Andrea Arcangeli
2003-10-10 18:36             ` Linus Torvalds
2003-10-10 19:03               ` Andrea Arcangeli [this message]
2003-10-09 23:16 ` Andreas Dilger
2003-10-09 23:24   ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20031010190338.GI16013@velociraptor.random \
    --to=andrea@suse.de \
    --cc=drepper@redhat.com \
    --cc=jamie@shareable.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@osdl.org \
    --cc=trond.myklebust@fys.uio.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.