From: Andrea Arcangeli <firstname.lastname@example.org> To: Linus Torvalds <email@example.com> Cc: Jamie Lokier <firstname.lastname@example.org>, Trond Myklebust <email@example.com>, Ulrich Drepper <firstname.lastname@example.org>, Linux Kernel <email@example.com> Subject: Re: statfs() / statvfs() syscall ballsup... Date: Fri, 10 Oct 2003 21:03:38 +0200 [thread overview] Message-ID: <20031010190338.GI16013@velociraptor.random> (raw) In-Reply-To: <Pine.LNX.firstname.lastname@example.org> On Fri, Oct 10, 2003 at 11:36:29AM -0700, Linus Torvalds wrote: > > On Fri, 10 Oct 2003, Andrea Arcangeli wrote: > > > > O_DIRECT only walk the pagetables, no pte mangling, no tlb flushes, the > > TLB is preserved fully. > > Yes. However, it's even _nicer_ if you don't need to walk the page tables > at all. > > Quite a lot of operations could be done directly on the page cache. I'm > not a huge fan of mmap() myself - the biggest advantage of mmap is when > you don't know your access patterns, and you have reasonably good > locality. In many other cases mmap is just a total loss, because the page > table walking is often more expensive than even a memcpy(). > > That's _especially_ true if you have to move mappings around, and you have > to invalidate TLB's. agreed. that's what remap_file_pages does infact. > memcpy() often gets a bad name. Yeah, memory is slow, but especially if > you copy something you just worked on, you're actually often better off > letting the CPU cache do its job, rather than walking page tables and > trying to be clever. > > Just as an example: copying often means that you don't need nearly as much > locking and synchronization - which in turn avoids one whole big mess > (yes, the memcpy() will look very hot in profiles, but then doing extra > work to avoid the memcpy() will cause spread-out overhead that is a lot > worse and harder to think about). > > This is why a simple read()/write() loop often _beats_ mmap approaches. > And often it's actually better to not even have big buffers (ie the old > "avoid system calls by aggregation" approach) because that just blows your > cache away. > > Right now, the fastest way to copy a file is apparently by doing lots of > ~8kB read/write pairs (that data may be slightly stale, but it was true at > some point). Never mind the system call overhead - just having the extra > buffer stay in the L1 cache and avoiding page faults from mmap is a bigger > win. > > And I don't think mmap _can_ beat that. It's fundamental. That's my whole point, agreed. Though using mmap would be sure cleaner and simpler. > In contrast, direct page cache accesses really can do so. Exactly because > they don't touch any page tables at all, and because they can take > advantage of internal kernel data structure layout and move pages around > without any cost.. Which basically means removing O_DIRECT from the open syscalls and still use read/write if I understand correctly. With todays commodity dirtcheap hardware, it has been proven that walking the pte (NOTE: only walking, no mangling and no tlb flushing) is much faster than doing the memcpy. More cpu is left free for the other tasks and the cost of the I/O is the same. The different isn't measurable in I/O bound tasks, but a database is both IO bound and cpu bound at the same time, so for a db it's measurable. At least this is the case for Oracle. I believe Joel has access to these numbers too, and that's why he's interested in O_DIRECT in the first place. With faster membus things may change of course (to the point where there's no difference between the two models), but still I don't see how can walking tree pointers to be more expensive than copying 512bytes of data (assuming the smaller blocksize). And you're ignoring the CPU *has* to walk those three pointers _anyways_ implicitly to allow the memcpy to run. So as far as I can tell the memcpy is pure overhead that can be avoided with O_DIRECT. this is also why I rejected all approcches that wanted to allow readahead via O_DIRECT by preloading data in pagecache, my argument is: if you can't avoid the memcpy you must not use O_DIRECT. The only signle object of O_DIRECT is to avoid the memcpy, the cache pollution avoidance is a very minor issue, the main point is to avoid the memcpy. I also posted a number of benchmarks at some point, where I've shown a dramatical reduction of the cpu usage, up to 10% reduction, on a normal cheap hardware w/o reduction of I/O bandwidth. This means 10% more cpu to use for doing something useful in the cpu bound part of the database. The main downside of O_DIRECT I believe conceptual, starting from the ugliness inside the kernel, like the cache coherency handling and i_alloc_sem need to avoid reads to run in parallel of block allocations, etc... but the practical effect I doubt can be easily beaten in the numbers. That said maybe we can provide a nicer API that does the same thing internally I don't know, but certainly that can't be remap_file_pages because that does a very different thing. Andrea - If you prefer relying on open source software, check these links: rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2./ http://www.cobite.com/cvsps/
next prev parent reply other threads:[~2003-10-10 19:04 UTC|newest] Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top 2003-10-09 22:16 Trond Myklebust 2003-10-09 22:26 ` Linus Torvalds 2003-10-09 23:19 ` Ulrich Drepper 2003-10-10 0:22 ` viro 2003-10-10 4:49 ` Jamie Lokier 2003-10-10 5:26 ` Trond Myklebust 2003-10-10 12:37 ` Jamie Lokier 2003-10-10 13:46 ` Trond Myklebust 2003-10-10 14:35 ` Jamie Lokier 2003-10-10 15:32 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust 2003-10-10 15:53 ` Jamie Lokier 2003-10-10 16:07 ` Trond Myklebust 2003-10-10 15:55 ` Michael Shuey 2003-10-10 16:20 ` Trond Myklebust 2003-10-10 16:45 ` J. Bruce Fields 2003-10-10 14:39 ` statfs() / statvfs() syscall ballsup Jamie Lokier 2003-10-09 23:31 ` Trond Myklebust 2003-10-10 12:27 ` Joel Becker 2003-10-10 14:59 ` Linus Torvalds 2003-10-10 15:27 ` Joel Becker 2003-10-10 16:00 ` Linus Torvalds 2003-10-10 16:26 ` Joel Becker 2003-10-10 16:50 ` Linus Torvalds 2003-10-10 17:33 ` Joel Becker 2003-10-10 17:51 ` Linus Torvalds 2003-10-10 18:13 ` Joel Becker 2003-10-10 16:27 ` Valdis.Kletnieks 2003-10-10 16:33 ` Chris Friesen 2003-10-10 17:04 ` Linus Torvalds 2003-10-10 17:07 ` Linus Torvalds 2003-10-10 17:21 ` Joel Becker 2003-10-10 16:01 ` Jamie Lokier 2003-10-10 16:33 ` Joel Becker 2003-10-10 16:58 ` Chris Friesen 2003-10-10 17:05 ` Trond Myklebust 2003-10-10 17:20 ` Joel Becker 2003-10-10 17:33 ` Chris Friesen 2003-10-10 17:40 ` Linus Torvalds 2003-10-10 17:54 ` Trond Myklebust 2003-10-10 18:05 ` Linus Torvalds 2003-10-10 20:40 ` Trond Myklebust 2003-10-10 21:09 ` Linus Torvalds 2003-10-10 22:17 ` Trond Myklebust 2003-10-11 2:53 ` Andrew Morton 2003-10-11 3:47 ` Trond Myklebust 2003-10-10 18:05 ` Joel Becker 2003-10-10 18:31 ` Andrea Arcangeli 2003-10-10 20:33 ` Helge Hafting 2003-10-10 20:07 ` Jamie Lokier 2003-10-12 15:31 ` Greg Stark 2003-10-12 16:13 ` Linus Torvalds 2003-10-12 22:09 ` Greg Stark 2003-10-13 8:45 ` Helge Hafting 2003-10-15 13:25 ` Ingo Oeser 2003-10-15 15:03 ` Greg Stark 2003-10-15 18:37 ` Helge Hafting 2003-10-16 10:29 ` Ingo Oeser 2003-10-16 14:02 ` Greg Stark 2003-10-21 11:47 ` Ingo Oeser 2003-10-10 18:20 ` Andrea Arcangeli 2003-10-10 18:36 ` Linus Torvalds 2003-10-10 19:03 ` Andrea Arcangeli [this message] 2003-10-09 23:16 ` Andreas Dilger 2003-10-09 23:24 ` Linus Torvalds
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20031010190338.GI16013@velociraptor.random \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --subject='Re: statfs() / statvfs() syscall ballsup...' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.