From: Linus Torvalds <torvalds@osdl.org>
To: Andrew Morton <akpm@osdl.org>
Cc: Theodore Tso <tytso@mit.edu>,
Segher Boessenkool <segher@kernel.crashing.org>,
David Miller <davem@davemloft.net>,
nickpiggin@yahoo.com.au, kenneth.w.chen@intel.com,
guichaz@yahoo.fr, hugh@veritas.com,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
ranma@tdiedrich.de, gordonfarquharson@gmail.com,
a.p.zijlstra@chello.nl, tbm@cyrius.com, arjan@infradead.org,
andrei.popa@i-neo.ro, linux-ext4@vger.kernel.org
Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Date: Fri, 29 Dec 2006 16:50:39 -0800 (PST) [thread overview]
Message-ID: <Pine.LNX.4.64.0612291633500.4473@woody.osdl.org> (raw)
In-Reply-To: <20061229160520.e498789f.akpm@osdl.org>
On Fri, 29 Dec 2006, Andrew Morton wrote:
>
> Adam Richter spent considerable time a few years ago trying to make the
> mpage code go direct-to-BIO in all cases and we eventually gave up. The
> conceptual layering of page<->blocks<->bio is pretty clean, and it is hard
> and ugly to fully optimise away the "block" bit in the middle.
Using the buffer cache as a translation layer to the physical address is
fine. That's what _any_ block device will do.
I'm not at all sayign that "buffer heads must go away". They work fine.
What I'm saying is that
- if you index by buffer heads, you're screwed.
- if you do IO by starting at buffer heads, you're screwed.
Both indexing and writeback decisions should be done at the page cache
layer. Then, when you actually need to do IO, you look at the buffers. But
you start from the "page". YOU SHOULD NEVER LOOK UP a buffer on its own
merits, and YOU SHOULD NEVER DO IO on a buffer head on its own cognizance.
So by all means keep the buffer heads as a way to keep the
"virtual->physical" translation. It's what they were designed for. But
they were _originally_ also designed for "lookup" and "driving the start
of IO", and that is wrong, and has been wrong for a long time now, because
- lookup based on physical address is fundamentally slow and inefficient.
You have to look up the virtual->physical translation somewhere else,
so it's by design an unnecessary indirection _and_ that "somewere
else" is also by definition filesystem-specific, so you can't do any
of these things at the VFS layer.
Ergo: anything that needs to look up the physical address in order to
find the buffer head is BROKEN in this day and age. We look up the
_virtual_ page cache page, and then we can trivially find the buffer
heads within that page thanks to page->buffers.
Example: ext2 vs ext3 readdir. One of them sucks, the other doesn't.
- starting IO based on the physical entity is insane. It's insane exactly
_because_ the VM doesn't actually think in physical addresses, or in
buffer-sized blocks. The VM only really knows about whole pages, and
all the VM decisions fundamentally have to be page-based. We don't ever
"free a buffer". We free a whole page, and as such, doing writeback
based on buffers is pointless, because it doesn't actually say anything
about the "page state" which is what the VM tracks.
But neither of these means that "buffer_head" itself has to go away. They
both really boil down to the same thing: you should never KEY things by
the buffer head. All actions should be based on virtual indexes as far as
at all humanly possible.
Once you do lookup and locking and writeback _starting_ from the page,
it's then easy to look up the actual buffer head within the page, and use
that as a way to do the actual _IO_ on the physical address. So the buffer
heads still exist in ext2, for example, but they don't drive the show
quite as much.
(They still do in some areas: the allocation bitmaps, the xattr code etc.
But as long as none of those have big VM footprints, and as long as no
_common_ operations really care deeply, and as long as those data
structures never need to be touched by the VM or VFS layer, nobody will
ever really care).
The directory case comes up just because "readdir()" actually is very
common, and sometimes very slow. And it can have a big VM working set
footprint ("find"), so trying to be page-based actually really helps,
because it all drives things like writeback on the _right_ issues, and we
can do things like LRU's and writeback decisions on the level that really
matters.
I actually suspect that the inode tables could benefit from being in the
page cache too (although I think that the inode buffer address is actually
"physical", so there's no indirection for inode tables, which means that
the virtual vs physical addressing doesn't matter). For directories, there
definitely is a big cost to continually doing the virtual->physical
translation all the time.
Linus
next prev parent reply other threads:[~2006-12-30 0:51 UTC|newest]
Thread overview: 311+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-12-17 0:13 2.6.19 file content corruption on ext3 Andrei Popa
2006-12-17 12:06 ` Andrew Morton
2006-12-17 12:19 ` Marc Haber
2006-12-17 12:32 ` Andrei Popa
2006-12-17 13:39 ` Andrei Popa
2006-12-17 23:40 ` Andrew Morton
2006-12-18 1:02 ` Linus Torvalds
2006-12-18 1:22 ` Linus Torvalds
2006-12-18 1:29 ` Linus Torvalds
2006-12-18 1:57 ` Linus Torvalds
2006-12-18 4:51 ` Nick Piggin
2006-12-18 5:43 ` Andrew Morton
2006-12-18 7:22 ` Nick Piggin
2006-12-18 9:18 ` Andrew Morton
2006-12-18 9:26 ` Andrei Popa
2006-12-18 9:42 ` Nick Piggin
2006-12-19 8:51 ` Marc Haber
2006-12-19 9:28 ` Martin Michlmayr
2006-12-28 18:05 ` Marc Haber
2006-12-28 19:00 ` Linus Torvalds
2006-12-28 19:05 ` Petri Kaukasoina
2006-12-28 19:21 ` Linus Torvalds
2006-12-28 19:39 ` Dave Jones
2006-12-28 20:10 ` Arjan van de Ven
2006-12-29 9:23 ` maximilian attems
2006-12-29 15:02 ` Dave Jones
2006-12-29 18:52 ` maximilian attems
2006-12-29 19:14 ` Dave Jones
2006-12-28 21:24 ` Linus Torvalds
2006-12-28 21:36 ` Russell King
2006-12-28 22:37 ` Linus Torvalds
2006-12-28 22:50 ` David Miller
2006-12-28 23:01 ` Linus Torvalds
2006-12-29 1:38 ` Linus Torvalds
2006-12-29 1:59 ` Andrew Morton
2006-12-28 23:36 ` Anton Altaparmakov
2006-12-28 23:54 ` Linus Torvalds
2006-12-29 17:49 ` Guillaume Chazarain
2006-12-18 5:50 ` Linus Torvalds
2006-12-18 7:16 ` Andrew Morton
2006-12-18 7:17 ` Andrew Morton
2006-12-18 9:30 ` Nick Piggin
2006-12-18 7:30 ` Nick Piggin
2006-12-18 9:19 ` Andrei Popa
2006-12-18 9:38 ` Andrew Morton
2006-12-18 10:00 ` Andrei Popa
2006-12-18 10:11 ` Peter Zijlstra
2006-12-18 10:49 ` Andrei Popa
2006-12-18 15:24 ` Gene Heskett
2006-12-18 15:32 ` Peter Zijlstra
2006-12-18 15:47 ` Gene Heskett
2006-12-18 16:55 ` Peter Zijlstra
2006-12-18 18:03 ` Linus Torvalds
2006-12-18 18:24 ` Peter Zijlstra
2006-12-18 18:35 ` Linus Torvalds
2006-12-18 19:04 ` Andrei Popa
2006-12-18 19:10 ` Peter Zijlstra
2006-12-18 19:18 ` Linus Torvalds
2006-12-18 19:44 ` Andrei Popa
2006-12-18 20:14 ` Linus Torvalds
2006-12-18 20:41 ` Linus Torvalds
2006-12-18 21:11 ` Andrei Popa
2006-12-18 22:00 ` Alessandro Suardi
2006-12-18 22:45 ` Linus Torvalds
2006-12-19 0:13 ` Andrei Popa
2006-12-19 0:29 ` Linus Torvalds
2006-12-18 22:32 ` Linus Torvalds
2006-12-18 23:48 ` Andrei Popa
2006-12-19 0:04 ` Linus Torvalds
2006-12-19 0:29 ` Andrei Popa
2006-12-19 0:57 ` Linus Torvalds
2006-12-19 1:21 ` Andrew Morton
2006-12-19 1:44 ` Andrei Popa
2006-12-19 1:54 ` Andrew Morton
2006-12-19 2:04 ` Andrei Popa
2006-12-19 8:05 ` Andrei Popa
2006-12-19 8:24 ` Andrew Morton
2006-12-19 8:34 ` Pekka Enberg
2006-12-19 9:13 ` Marc Haber
2006-12-19 1:50 ` Andrei Popa
2006-12-19 1:03 ` Gene Heskett
2006-12-18 22:34 ` Gene Heskett
2006-12-22 17:27 ` Linus Torvalds
2006-12-18 21:43 ` Andrew Morton
2006-12-18 21:49 ` Peter Zijlstra
2006-12-19 23:42 ` Peter Zijlstra
2006-12-20 0:23 ` Linus Torvalds
2006-12-20 9:01 ` Peter Zijlstra
2006-12-20 9:12 ` Peter Zijlstra
2006-12-20 9:39 ` Arjan van de Ven
2006-12-20 11:26 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Peter Zijlstra
2006-12-20 11:39 ` Jesper Juhl
2006-12-20 11:42 ` Peter Zijlstra
2006-12-20 12:12 ` Jesper Juhl
2006-12-20 13:00 ` Hugh Dickins
2006-12-20 13:56 ` Peter Zijlstra
2006-12-20 17:03 ` Martin Michlmayr
2006-12-20 17:35 ` Linus Torvalds
2006-12-20 17:53 ` Martin Michlmayr
2006-12-20 19:01 ` Linus Torvalds
2006-12-20 19:50 ` Linus Torvalds
2006-12-20 20:22 ` Peter Zijlstra
2006-12-20 21:55 ` Dave Kleikamp
2006-12-20 22:25 ` Linus Torvalds
2006-12-20 22:59 ` Dave Kleikamp
2006-12-20 22:15 ` Peter Zijlstra
2006-12-20 22:20 ` Peter Zijlstra
2006-12-20 22:49 ` Linus Torvalds
2006-12-20 23:03 ` Peter Zijlstra
2006-12-21 9:16 ` Martin Schwidefsky
2006-12-21 9:20 ` Peter Zijlstra
2006-12-21 9:26 ` Martin Schwidefsky
2006-12-21 20:01 ` Linus Torvalds
2006-12-28 0:00 ` Martin Schwidefsky
2006-12-28 0:42 ` Linus Torvalds
2006-12-28 0:52 ` [PATCH] mm: fix page_mkclean_one David Miller
2006-12-21 2:36 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Trond Myklebust
2006-12-21 8:10 ` Peter Zijlstra
2006-12-20 23:24 ` David Chinner
2006-12-20 23:55 ` Linus Torvalds
2006-12-21 1:20 ` David Chinner
2006-12-20 23:32 ` Andrew Morton
2006-12-20 23:55 ` Linus Torvalds
2006-12-21 0:11 ` Andrew Morton
2006-12-21 0:22 ` Linus Torvalds
2006-12-21 0:24 ` Linus Torvalds
2006-12-21 15:48 ` Andrei Popa
2006-12-21 16:58 ` Linus Torvalds
2006-12-21 0:43 ` Linus Torvalds
2006-12-21 1:20 ` Andrew Morton
2006-12-21 2:54 ` Trond Myklebust
2006-12-21 17:19 ` Linus Torvalds
2006-12-21 7:32 ` Gordon Farquharson
2006-12-21 7:53 ` Linus Torvalds
2006-12-21 8:38 ` Martin Michlmayr
2006-12-21 8:59 ` Linus Torvalds
2006-12-21 9:17 ` Gordon Farquharson
2006-12-21 9:27 ` Andrew Morton
2006-12-22 4:20 ` Gordon Farquharson
2006-12-22 4:54 ` Linus Torvalds
2006-12-22 10:00 ` Martin Michlmayr
2006-12-22 10:06 ` Martin Michlmayr
2006-12-22 10:10 ` Martin Michlmayr
2006-12-22 11:07 ` Martin Michlmayr
2006-12-22 15:30 ` Gordon Farquharson
2006-12-22 17:11 ` Martin Michlmayr
2006-12-22 10:17 ` Andrew Morton
2006-12-22 11:12 ` Martin Michlmayr
2006-12-22 12:24 ` Andrei Popa
2006-12-22 12:32 ` Martin Michlmayr
2006-12-22 12:59 ` Martin Michlmayr
2006-12-22 13:25 ` Peter Zijlstra
2006-12-22 13:29 ` Peter Zijlstra
2006-12-22 17:56 ` Linus Torvalds
2006-12-22 19:20 ` Martin Michlmayr
2006-12-24 8:10 ` Gordon Farquharson
2006-12-24 8:43 ` Linus Torvalds
2006-12-24 8:57 ` Andrew Morton
2006-12-24 9:26 ` Linus Torvalds
2006-12-24 12:14 ` Andrei Popa
2006-12-24 12:26 ` Andrei Popa
2006-12-24 12:30 ` Andrew Morton
2006-12-24 12:31 ` Andrew Morton
2006-12-24 16:45 ` Andrei Popa
2006-12-24 17:16 ` Linus Torvalds
2006-12-24 18:07 ` Andrew Morton
2006-12-24 18:37 ` Linus Torvalds
2006-12-24 19:18 ` Linus Torvalds
2006-12-24 20:55 ` Gordon Farquharson
2006-12-26 10:31 ` Nick Piggin
2006-12-26 19:26 ` Linus Torvalds
2006-12-27 12:32 ` Jari Sundell
2006-12-27 12:44 ` valdyn
2006-12-27 13:33 ` Jari Sundell
2007-01-07 2:06 ` Tom Lanyon
2007-01-07 5:58 ` Tom Lanyon
2007-01-07 6:05 ` Andrew Morton
2006-12-24 21:21 ` Michael S. Tsirkin
2006-12-24 19:27 ` Gordon Farquharson
2006-12-24 19:35 ` Linus Torvalds
2006-12-24 20:10 ` Andrei Popa
2006-12-24 20:24 ` Linus Torvalds
2006-12-24 20:30 ` Andrei Popa
2006-12-26 17:51 ` Al Viro
2006-12-26 17:58 ` Al Viro
2006-12-24 22:01 ` Martin Michlmayr
2006-12-24 14:05 ` Martin Michlmayr
2006-12-26 16:17 ` Tobias Diedrich
2006-12-27 4:55 ` [PATCH] mm: fix page_mkclean_one David Miller
2006-12-27 7:00 ` Linus Torvalds
2006-12-27 8:39 ` Andrei Popa
2006-12-28 0:16 ` Linus Torvalds
2006-12-28 0:39 ` Linus Torvalds
2006-12-28 0:52 ` David Miller
2006-12-28 3:04 ` Linus Torvalds
2006-12-28 4:32 ` Gordon Farquharson
2006-12-28 4:53 ` Linus Torvalds
2006-12-28 5:20 ` Gordon Farquharson
2006-12-28 5:41 ` David Miller
2006-12-28 5:47 ` Gordon Farquharson
2006-12-28 10:13 ` Russell King
2006-12-28 14:15 ` Gordon Farquharson
2006-12-28 15:53 ` Martin Michlmayr
2006-12-28 17:27 ` Linus Torvalds
2006-12-28 18:44 ` Russell King
2006-12-28 19:01 ` Linus Torvalds
[not found] ` <97a0a9ac0612272115g4cce1f08n3c3c8498a6076bd5@mail.gmail.com>
[not found] ` <Pine.LNX.4.64.0612272120180.4473@woody.osdl.org>
2006-12-28 5:38 ` Gordon Farquharson
2006-12-28 9:30 ` Martin Michlmayr
2006-12-28 10:16 ` Martin Michlmayr
2006-12-28 10:49 ` Russell King
2006-12-28 14:56 ` Martin Michlmayr
2006-12-28 5:58 ` Gordon Farquharson
2006-12-28 17:08 ` Linus Torvalds
2006-12-28 5:55 ` Chen, Kenneth W
2006-12-28 6:10 ` Chen, Kenneth W
2006-12-28 6:27 ` David Miller
2006-12-28 17:10 ` Linus Torvalds
2006-12-28 9:15 ` Zhang, Yanmin
2006-12-28 17:15 ` Linus Torvalds
2006-12-28 11:50 ` Petri Kaukasoina
2006-12-28 15:09 ` Guillaume Chazarain
2006-12-28 19:19 ` Guillaume Chazarain
2006-12-28 19:28 ` Linus Torvalds
2006-12-28 19:45 ` Andrew Morton
2006-12-28 20:14 ` Linus Torvalds
2006-12-28 22:38 ` David Miller
2006-12-29 2:50 ` Segher Boessenkool
2006-12-29 6:48 ` Linus Torvalds
2006-12-29 8:58 ` Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) Linus Torvalds
2006-12-29 10:48 ` Linus Torvalds
2006-12-29 11:16 ` Andrei Popa
2006-12-29 12:09 ` Nick Piggin
2006-12-29 17:25 ` Linus Torvalds
2006-12-29 12:31 ` Ingo Molnar
2006-12-29 13:08 ` Martin Johansson
2006-12-29 14:08 ` Martin Michlmayr
2006-12-29 15:17 ` Stephen Clark
2006-12-29 15:54 ` Martin Michlmayr
2006-12-29 22:16 ` Andrew Morton
2006-12-29 22:24 ` Andrew Morton
2006-12-29 22:42 ` Linus Torvalds
2006-12-29 23:32 ` Theodore Tso
2006-12-29 23:59 ` Linus Torvalds
2006-12-30 0:05 ` Andrew Morton
2006-12-30 0:50 ` Linus Torvalds [this message]
2006-12-29 23:51 ` Andrew Morton
2006-12-30 0:11 ` Linus Torvalds
2006-12-30 0:33 ` Andrew Morton
2006-12-30 0:58 ` Linus Torvalds
2006-12-30 1:16 ` Andrew Morton
2006-12-29 15:27 ` Theodore Tso
2006-12-29 17:51 ` Linus Torvalds
2006-12-29 12:19 ` [patch] fix data corruption bug in __block_write_full_page() Ingo Molnar
2007-01-02 11:20 ` Christoph Hellwig
2007-01-02 12:06 ` Ingo Molnar
2007-01-02 12:16 ` Christoph Hellwig
2006-12-28 22:35 ` [PATCH] mm: fix page_mkclean_one Mike Galbraith
2006-12-22 15:01 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Patrick Mau
2006-12-23 8:15 ` Andrei Popa
2006-12-22 15:08 ` Gordon Farquharson
2006-12-22 10:01 ` Martin Michlmayr
2006-12-22 15:16 ` Gordon Farquharson
2006-12-21 12:30 ` Russell King
2006-12-21 12:36 ` Russell King
2006-12-21 11:21 ` Martin Michlmayr
2006-12-20 22:11 ` Russell King
2006-12-21 8:18 ` Martin Michlmayr
2006-12-21 9:54 ` Russell King
2006-12-20 14:55 ` Martin Schwidefsky
2006-12-20 14:27 ` 2.6.19 file content corruption on ext3 Martin Schwidefsky
2006-12-20 9:32 ` Peter Zijlstra
2006-12-20 14:15 ` Andrei Popa
2006-12-20 14:23 ` Peter Zijlstra
2006-12-20 16:30 ` Andrei Popa
2006-12-20 16:36 ` Peter Zijlstra
2006-12-19 7:38 ` Peter Zijlstra
2006-12-19 4:36 ` Nick Piggin
2006-12-19 6:34 ` Linus Torvalds
2006-12-19 6:51 ` Nick Piggin
2006-12-19 7:26 ` Linus Torvalds
2006-12-19 8:04 ` Linus Torvalds
2006-12-19 9:00 ` Peter Zijlstra
2006-12-19 9:05 ` Peter Zijlstra
[not found] ` <4587B762.2030603@yahoo.com.au>
2006-12-19 10:32 ` Andrew Morton
2006-12-19 10:42 ` Nick Piggin
2006-12-19 10:47 ` Andrew Morton
2006-12-19 10:52 ` Peter Zijlstra
2006-12-19 10:58 ` Nick Piggin
2006-12-19 11:51 ` Peter Zijlstra
2006-12-19 10:55 ` Nick Piggin
2006-12-19 16:51 ` Linus Torvalds
2006-12-19 17:43 ` Linus Torvalds
2006-12-19 18:59 ` Linus Torvalds
2006-12-19 21:30 ` Peter Zijlstra
2006-12-19 22:51 ` Linus Torvalds
2006-12-19 22:58 ` Andrew Morton
2006-12-19 23:06 ` Peter Zijlstra
2006-12-19 23:07 ` Peter Zijlstra
2006-12-20 0:03 ` Linus Torvalds
2006-12-20 0:18 ` Andrew Morton
2006-12-20 18:02 ` Stephen Clark
2006-12-20 5:56 ` Jari Sundell
2006-12-19 21:56 ` Florian Weimer
2006-12-21 13:03 ` Peter Zijlstra
2006-12-21 20:40 ` Andrew Morton
2006-12-19 20:03 ` dean gaudet
2006-12-19 7:22 ` Peter Zijlstra
2006-12-19 7:59 ` Nick Piggin
2006-12-19 8:14 ` Linus Torvalds
2006-12-19 9:40 ` Nick Piggin
2006-12-19 16:46 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0612291633500.4473@woody.osdl.org \
--to=torvalds@osdl.org \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@osdl.org \
--cc=andrei.popa@i-neo.ro \
--cc=arjan@infradead.org \
--cc=davem@davemloft.net \
--cc=gordonfarquharson@gmail.com \
--cc=guichaz@yahoo.fr \
--cc=hugh@veritas.com \
--cc=kenneth.w.chen@intel.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=nickpiggin@yahoo.com.au \
--cc=ranma@tdiedrich.de \
--cc=segher@kernel.crashing.org \
--cc=tbm@cyrius.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).