All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <andrea@suse.de>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Jamie Lokier <jamie@shareable.org>,
	Trond Myklebust <trond.myklebust@fys.uio.no>,
	Ulrich Drepper <drepper@redhat.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: statfs() / statvfs() syscall ballsup...
Date: Fri, 10 Oct 2003 21:03:38 +0200	[thread overview]
Message-ID: <20031010190338.GI16013@velociraptor.random> (raw)
In-Reply-To: <Pine.LNX.4.44.0310101126120.20420-100000@home.osdl.org>

On Fri, Oct 10, 2003 at 11:36:29AM -0700, Linus Torvalds wrote:
> 
> On Fri, 10 Oct 2003, Andrea Arcangeli wrote:
> > 
> > O_DIRECT only walk the pagetables, no pte mangling, no tlb flushes, the
> > TLB is preserved fully.
> 
> Yes. However, it's even _nicer_ if you don't need to walk the page tables 
> at all.
> 
> Quite a lot of operations could be done directly on the page cache. I'm 
> not a huge fan of mmap() myself - the biggest advantage of mmap is when 
> you don't know your access patterns, and you have reasonably good 
> locality. In many other cases mmap is just a total loss, because the page 
> table walking is often more expensive than even a memcpy().
> 
> That's _especially_ true if you have to move mappings around, and you have 
> to invalidate TLB's. 

agreed. that's what remap_file_pages does infact.

> memcpy() often gets a bad name. Yeah, memory is slow, but especially if 
> you copy something you just worked on, you're actually often better off 
> letting the CPU cache do its job, rather than walking page tables and 
> trying to be clever.
> 
> Just as an example: copying often means that you don't need nearly as much 
> locking and synchronization - which in turn avoids one whole big mess 
> (yes, the memcpy() will look very hot in profiles, but then doing extra 
> work to avoid the memcpy() will cause spread-out overhead that is a lot 
> worse and harder to think about).
> 
> This is why a simple read()/write() loop often _beats_ mmap approaches. 
> And often it's actually better to not even have big buffers (ie the old 
> "avoid system calls by aggregation" approach) because that just blows your 
> cache away.
> 
> Right now, the fastest way to copy a file is apparently by doing lots of
> ~8kB read/write pairs (that data may be slightly stale, but it was true at
> some point). Never mind the system call overhead - just having the extra
> buffer stay in the L1 cache and avoiding page faults from mmap is a bigger
> win.
> 
> And I don't think mmap _can_ beat that. It's fundamental. 

That's my whole point, agreed. Though using mmap would be sure cleaner
and simpler.

> In contrast, direct page cache accesses really can do so. Exactly because 
> they don't touch any page tables at all, and because they can take 
> advantage of internal kernel data structure layout and move pages around 
> without any cost..

Which basically means removing O_DIRECT from the open syscalls and still
use read/write if I understand correctly.

With todays commodity dirtcheap hardware, it has been proven that
walking the pte (NOTE: only walking, no mangling and no tlb flushing) is
much faster than doing the memcpy. More cpu is left free for the other
tasks and the cost of the I/O is the same. The different isn't
measurable in I/O bound tasks, but a database is both IO bound and cpu
bound at the same time, so for a db it's measurable. At least this is
the case for Oracle. I believe Joel has access to these numbers too, and
that's why he's interested in O_DIRECT in the first place.

With faster membus things may change of course (to the point where
there's no difference between the two models), but still I don't see how
can walking tree pointers to be more expensive than copying 512bytes of
data (assuming the smaller blocksize). And you're ignoring the CPU *has*
to walk those three pointers _anyways_ implicitly to allow the memcpy to
run. So as far as I can tell the memcpy is pure overhead that can be
avoided with O_DIRECT.

this is also why I rejected all approcches that wanted to allow
readahead via O_DIRECT by preloading data in pagecache, my argument is:
if you can't avoid the memcpy you must not use O_DIRECT. The only signle
object of O_DIRECT is to avoid the memcpy, the cache pollution avoidance
is a very minor issue, the main point is to avoid the memcpy.

I also posted a number of benchmarks at some point, where I've shown a
dramatical reduction of the cpu usage, up to 10% reduction, on a normal
cheap hardware w/o reduction of I/O bandwidth. This means 10% more cpu
to use for doing something useful in the cpu bound part of the database.

The main downside of O_DIRECT I believe conceptual, starting from the
ugliness inside the kernel, like the cache coherency handling and
i_alloc_sem need to avoid reads to run in parallel of block allocations,
etc... but the practical effect I doubt can be easily beaten in the
numbers. That said maybe we can provide a nicer API that does the same
thing internally I don't know, but certainly that can't be
remap_file_pages because that does a very different thing.

Andrea - If you prefer relying on open source software, check these links:
	    rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
	    http://www.cobite.com/cvsps/

  reply	other threads:[~2003-10-10 19:04 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-10-09 22:16 Trond Myklebust
2003-10-09 22:26 ` Linus Torvalds
2003-10-09 23:19   ` Ulrich Drepper
2003-10-10  0:22     ` viro
2003-10-10  4:49       ` Jamie Lokier
2003-10-10  5:26         ` Trond Myklebust
2003-10-10 12:37           ` Jamie Lokier
2003-10-10 13:46             ` Trond Myklebust
2003-10-10 14:35               ` Jamie Lokier
2003-10-10 15:32                 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust
2003-10-10 15:53                   ` Jamie Lokier
2003-10-10 16:07                     ` Trond Myklebust
2003-10-10 15:55                   ` Michael Shuey
2003-10-10 16:20                     ` Trond Myklebust
2003-10-10 16:45                     ` J. Bruce Fields
2003-10-10 14:39               ` statfs() / statvfs() syscall ballsup Jamie Lokier
2003-10-09 23:31   ` Trond Myklebust
2003-10-10 12:27   ` Joel Becker
2003-10-10 14:59     ` Linus Torvalds
2003-10-10 15:27       ` Joel Becker
2003-10-10 16:00         ` Linus Torvalds
2003-10-10 16:26           ` Joel Becker
2003-10-10 16:50             ` Linus Torvalds
2003-10-10 17:33               ` Joel Becker
2003-10-10 17:51                 ` Linus Torvalds
2003-10-10 18:13                   ` Joel Becker
2003-10-10 16:27           ` Valdis.Kletnieks
2003-10-10 16:33           ` Chris Friesen
2003-10-10 17:04             ` Linus Torvalds
2003-10-10 17:07               ` Linus Torvalds
2003-10-10 17:21                 ` Joel Becker
2003-10-10 16:01         ` Jamie Lokier
2003-10-10 16:33           ` Joel Becker
2003-10-10 16:58             ` Chris Friesen
2003-10-10 17:05               ` Trond Myklebust
2003-10-10 17:20               ` Joel Becker
2003-10-10 17:33                 ` Chris Friesen
2003-10-10 17:40                 ` Linus Torvalds
2003-10-10 17:54                   ` Trond Myklebust
2003-10-10 18:05                     ` Linus Torvalds
2003-10-10 20:40                       ` Trond Myklebust
2003-10-10 21:09                         ` Linus Torvalds
2003-10-10 22:17                           ` Trond Myklebust
2003-10-11  2:53                     ` Andrew Morton
2003-10-11  3:47                       ` Trond Myklebust
2003-10-10 18:05                   ` Joel Becker
2003-10-10 18:31                     ` Andrea Arcangeli
2003-10-10 20:33                     ` Helge Hafting
2003-10-10 20:07             ` Jamie Lokier
2003-10-12 15:31             ` Greg Stark
2003-10-12 16:13               ` Linus Torvalds
2003-10-12 22:09                 ` Greg Stark
2003-10-13  8:45                   ` Helge Hafting
2003-10-15 13:25                     ` Ingo Oeser
2003-10-15 15:03                       ` Greg Stark
2003-10-15 18:37                         ` Helge Hafting
2003-10-16 10:29                         ` Ingo Oeser
2003-10-16 14:02                           ` Greg Stark
2003-10-21 11:47                             ` Ingo Oeser
2003-10-10 18:20           ` Andrea Arcangeli
2003-10-10 18:36             ` Linus Torvalds
2003-10-10 19:03               ` Andrea Arcangeli [this message]
2003-10-09 23:16 ` Andreas Dilger
2003-10-09 23:24   ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20031010190338.GI16013@velociraptor.random \
    --to=andrea@suse.de \
    --cc=drepper@redhat.com \
    --cc=jamie@shareable.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@osdl.org \
    --cc=trond.myklebust@fys.uio.no \
    --subject='Re: statfs() / statvfs() syscall ballsup...' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.