From: Andrew Morton <akpm@osdl.org>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,
David Miller <davem@davemloft.net>,
nickpiggin@yahoo.com.au, kenneth.w.chen@intel.com,
guichaz@yahoo.fr, hugh@veritas.com,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
ranma@tdiedrich.de, gordonfarquharson@gmail.com,
a.p.zijlstra@chello.nl, tbm@cyrius.com, arjan@infradead.org,
andrei.popa@i-neo.ro
Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Date: Fri, 29 Dec 2006 14:16:32 -0800 [thread overview]
Message-ID: <20061229141632.51c8c080.akpm@osdl.org> (raw)
In-Reply-To: <Pine.LNX.4.64.0612290202350.4473@woody.osdl.org>
On Fri, 29 Dec 2006 02:48:35 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:
> + if (mapping && mapping_cap_account_dirty(mapping)) {
> + /*
> + * Yes, Virginia, this is indeed insane.
> + *
> + * We use this sequence to make sure that
> + * (a) we account for dirty stats properly
> + * (b) we tell the low-level filesystem to
> + * mark the whole page dirty if it was
> + * dirty in a pagetable. Only to then
> + * (c) clean the page again and return 1 to
> + * cause the writeback.
> + *
> + * This way we avoid all nasty races with the
> + * dirty bit in multiple places and clearing
> + * them concurrently from different threads.
> + *
> + * Note! Normally the "set_page_dirty(page)"
> + * has no effect on the actual dirty bit - since
> + * that will already usually be set. But we
> + * need the side effects, and it can help us
> + * avoid races.
> + *
> + * We basically use the page "master dirty bit"
> + * as a serialization point for all the different
> + * threds doing their things.
> + *
> + * FIXME! We still have a race here: if somebody
> + * adds the page back to the page tables in
> + * between the "page_mkclean()" and the "TestClearPageDirty()",
> + * we might have it mapped without the dirty bit set.
> + */
> + if (page_mkclean(page))
> + set_page_dirty(page);
> + if (TestClearPageDirty(page)) {
> dec_zone_page_state(page, NR_FILE_DIRTY);
> + return 1;
> }
- Presumably reiser3's ordered-data mode has the same problem. And ext4,
of course. Dunno about other filesytems.
- The above change means that we do extra writeout. If a page is dirtied
once, kjournald will write it and then pdflush will come along and
needlessly write it again.
But otoh, if a mapping is being repeatedly dirtied, kjournald will
write the page once per 30 seconds (dirty_expire_centisecs) and pdflush
will write the page once per 30 seconds as well. But we _should_ be
writing it once per five seconds (kjournald commit interval). So we're
still ahead ;)
- Poor old IO accounting broke again.
- People were saying that ext2 and ext3,data=writeback were also showing
corruption. What's up with that?
- For a long time I've wanted to nuke the current ext3/jbd ordered-data
implementation altogether, and just make kjournald call into the
standard writeback code to do a standard suberblock->inodes->pages walk.
I think it'd be fairly straightforward to do. We'd need to teach the
writeback code to be able to skip dirty pages which don't have a disk
mapping, so that kjournald doesn't end up waiting for kjournald to free
up journal space..
Would need to avoid possible deadlocks where someone calls
ext3_force_commit() or otherwise does a synchronous commit while holding
VFS locks.
reiser3 and ext4 could be converted too.
Not a short-term project, but this would avoid the problem.
- It's pretty obnoxious that the VM now sets a clean page "dirty" and
then proceeds to modify its contents. It would be nice to stop doing
that.
We could stop marking the page dirty in do_wp_page() and create a new
VM counter "NR_PTE_DIRTY", which means
"number of mapping_cap_account_dirty() pages which have a dirty pte
pointing at them".
Or, perhaps
"number of dirty ptes which point at mapping_cap_account_dirty() pages".
Which can be larger, but the writeout code will probably cope.
Then we take NR_PTE_DIRTY into account in the dirty-page balancing act.
So
- do_wp_page() will still run balance_dirty_pages()
- but it would no longer run set_page_dirty().
- But it needs to run mark_inode_dirty() so the fs-writeback code
notices the file.
- And mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) becomes insufficient.
The tricky part here is "how do we do the writeback"? The
pte-dirty,!PageDirty pages aren't tagged as dirty in the radix-tree and
writeback needs to find them so that it can effectively do an msync() on
them. Walking all the mm's and vma's would be insane. Visiting all the
pages in the file would also probably be insane.
Perhaps this can be solved by adding a new radix-tree tag which means
"this page might have dirty ptes pointing at it". For each file
writeback would do a radix-tree walk of these pages,
cleaning-and-write-protecting ptes, marking the corresponding pages
dirty and clearing their PAGECACHE_TAG_PTE_DIRTY tags.
Then we can fix the mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)
problem by doing
mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
mapping_tagged(mapping, PAGECACHE_TAG_PTE_DIRTY)
or, better,
mapping_tagged(mapping,
(1<<PAGECACHE_TAG_DIRTY)|(1<<PAGECACHE_TAG_PTE_DIRTY))
perhaps.
The msync() code would need to be taught to call the
PAGECACHE_TAG_PTE_DIRTY walker for the appropriate page range.
This is also not a quick-fix.
next prev parent reply other threads:[~2006-12-29 22:17 UTC|newest]
Thread overview: 311+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-12-17 0:13 2.6.19 file content corruption on ext3 Andrei Popa
2006-12-17 12:06 ` Andrew Morton
2006-12-17 12:19 ` Marc Haber
2006-12-17 12:32 ` Andrei Popa
2006-12-17 13:39 ` Andrei Popa
2006-12-17 23:40 ` Andrew Morton
2006-12-18 1:02 ` Linus Torvalds
2006-12-18 1:22 ` Linus Torvalds
2006-12-18 1:29 ` Linus Torvalds
2006-12-18 1:57 ` Linus Torvalds
2006-12-18 4:51 ` Nick Piggin
2006-12-18 5:43 ` Andrew Morton
2006-12-18 7:22 ` Nick Piggin
2006-12-18 9:18 ` Andrew Morton
2006-12-18 9:26 ` Andrei Popa
2006-12-18 9:42 ` Nick Piggin
2006-12-19 8:51 ` Marc Haber
2006-12-19 9:28 ` Martin Michlmayr
2006-12-28 18:05 ` Marc Haber
2006-12-28 19:00 ` Linus Torvalds
2006-12-28 19:05 ` Petri Kaukasoina
2006-12-28 19:21 ` Linus Torvalds
2006-12-28 19:39 ` Dave Jones
2006-12-28 20:10 ` Arjan van de Ven
2006-12-29 9:23 ` maximilian attems
2006-12-29 15:02 ` Dave Jones
2006-12-29 18:52 ` maximilian attems
2006-12-29 19:14 ` Dave Jones
2006-12-28 21:24 ` Linus Torvalds
2006-12-28 21:36 ` Russell King
2006-12-28 22:37 ` Linus Torvalds
2006-12-28 22:50 ` David Miller
2006-12-28 23:01 ` Linus Torvalds
2006-12-29 1:38 ` Linus Torvalds
2006-12-29 1:59 ` Andrew Morton
2006-12-28 23:36 ` Anton Altaparmakov
2006-12-28 23:54 ` Linus Torvalds
2006-12-29 17:49 ` Guillaume Chazarain
2006-12-18 5:50 ` Linus Torvalds
2006-12-18 7:16 ` Andrew Morton
2006-12-18 7:17 ` Andrew Morton
2006-12-18 9:30 ` Nick Piggin
2006-12-18 7:30 ` Nick Piggin
2006-12-18 9:19 ` Andrei Popa
2006-12-18 9:38 ` Andrew Morton
2006-12-18 10:00 ` Andrei Popa
2006-12-18 10:11 ` Peter Zijlstra
2006-12-18 10:49 ` Andrei Popa
2006-12-18 15:24 ` Gene Heskett
2006-12-18 15:32 ` Peter Zijlstra
2006-12-18 15:47 ` Gene Heskett
2006-12-18 16:55 ` Peter Zijlstra
2006-12-18 18:03 ` Linus Torvalds
2006-12-18 18:24 ` Peter Zijlstra
2006-12-18 18:35 ` Linus Torvalds
2006-12-18 19:04 ` Andrei Popa
2006-12-18 19:10 ` Peter Zijlstra
2006-12-18 19:18 ` Linus Torvalds
2006-12-18 19:44 ` Andrei Popa
2006-12-18 20:14 ` Linus Torvalds
2006-12-18 20:41 ` Linus Torvalds
2006-12-18 21:11 ` Andrei Popa
2006-12-18 22:00 ` Alessandro Suardi
2006-12-18 22:45 ` Linus Torvalds
2006-12-19 0:13 ` Andrei Popa
2006-12-19 0:29 ` Linus Torvalds
2006-12-18 22:32 ` Linus Torvalds
2006-12-18 23:48 ` Andrei Popa
2006-12-19 0:04 ` Linus Torvalds
2006-12-19 0:29 ` Andrei Popa
2006-12-19 0:57 ` Linus Torvalds
2006-12-19 1:21 ` Andrew Morton
2006-12-19 1:44 ` Andrei Popa
2006-12-19 1:54 ` Andrew Morton
2006-12-19 2:04 ` Andrei Popa
2006-12-19 8:05 ` Andrei Popa
2006-12-19 8:24 ` Andrew Morton
2006-12-19 8:34 ` Pekka Enberg
2006-12-19 9:13 ` Marc Haber
2006-12-19 1:50 ` Andrei Popa
2006-12-19 1:03 ` Gene Heskett
2006-12-18 22:34 ` Gene Heskett
2006-12-22 17:27 ` Linus Torvalds
2006-12-18 21:43 ` Andrew Morton
2006-12-18 21:49 ` Peter Zijlstra
2006-12-19 23:42 ` Peter Zijlstra
2006-12-20 0:23 ` Linus Torvalds
2006-12-20 9:01 ` Peter Zijlstra
2006-12-20 9:12 ` Peter Zijlstra
2006-12-20 9:39 ` Arjan van de Ven
2006-12-20 11:26 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Peter Zijlstra
2006-12-20 11:39 ` Jesper Juhl
2006-12-20 11:42 ` Peter Zijlstra
2006-12-20 12:12 ` Jesper Juhl
2006-12-20 13:00 ` Hugh Dickins
2006-12-20 13:56 ` Peter Zijlstra
2006-12-20 17:03 ` Martin Michlmayr
2006-12-20 17:35 ` Linus Torvalds
2006-12-20 17:53 ` Martin Michlmayr
2006-12-20 19:01 ` Linus Torvalds
2006-12-20 19:50 ` Linus Torvalds
2006-12-20 20:22 ` Peter Zijlstra
2006-12-20 21:55 ` Dave Kleikamp
2006-12-20 22:25 ` Linus Torvalds
2006-12-20 22:59 ` Dave Kleikamp
2006-12-20 22:15 ` Peter Zijlstra
2006-12-20 22:20 ` Peter Zijlstra
2006-12-20 22:49 ` Linus Torvalds
2006-12-20 23:03 ` Peter Zijlstra
2006-12-21 9:16 ` Martin Schwidefsky
2006-12-21 9:20 ` Peter Zijlstra
2006-12-21 9:26 ` Martin Schwidefsky
2006-12-21 20:01 ` Linus Torvalds
2006-12-28 0:00 ` Martin Schwidefsky
2006-12-28 0:42 ` Linus Torvalds
2006-12-28 0:52 ` [PATCH] mm: fix page_mkclean_one David Miller
2006-12-21 2:36 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Trond Myklebust
2006-12-21 8:10 ` Peter Zijlstra
2006-12-20 23:24 ` David Chinner
2006-12-20 23:55 ` Linus Torvalds
2006-12-21 1:20 ` David Chinner
2006-12-20 23:32 ` Andrew Morton
2006-12-20 23:55 ` Linus Torvalds
2006-12-21 0:11 ` Andrew Morton
2006-12-21 0:22 ` Linus Torvalds
2006-12-21 0:24 ` Linus Torvalds
2006-12-21 15:48 ` Andrei Popa
2006-12-21 16:58 ` Linus Torvalds
2006-12-21 0:43 ` Linus Torvalds
2006-12-21 1:20 ` Andrew Morton
2006-12-21 2:54 ` Trond Myklebust
2006-12-21 17:19 ` Linus Torvalds
2006-12-21 7:32 ` Gordon Farquharson
2006-12-21 7:53 ` Linus Torvalds
2006-12-21 8:38 ` Martin Michlmayr
2006-12-21 8:59 ` Linus Torvalds
2006-12-21 9:17 ` Gordon Farquharson
2006-12-21 9:27 ` Andrew Morton
2006-12-22 4:20 ` Gordon Farquharson
2006-12-22 4:54 ` Linus Torvalds
2006-12-22 10:00 ` Martin Michlmayr
2006-12-22 10:06 ` Martin Michlmayr
2006-12-22 10:10 ` Martin Michlmayr
2006-12-22 11:07 ` Martin Michlmayr
2006-12-22 15:30 ` Gordon Farquharson
2006-12-22 17:11 ` Martin Michlmayr
2006-12-22 10:17 ` Andrew Morton
2006-12-22 11:12 ` Martin Michlmayr
2006-12-22 12:24 ` Andrei Popa
2006-12-22 12:32 ` Martin Michlmayr
2006-12-22 12:59 ` Martin Michlmayr
2006-12-22 13:25 ` Peter Zijlstra
2006-12-22 13:29 ` Peter Zijlstra
2006-12-22 17:56 ` Linus Torvalds
2006-12-22 19:20 ` Martin Michlmayr
2006-12-24 8:10 ` Gordon Farquharson
2006-12-24 8:43 ` Linus Torvalds
2006-12-24 8:57 ` Andrew Morton
2006-12-24 9:26 ` Linus Torvalds
2006-12-24 12:14 ` Andrei Popa
2006-12-24 12:26 ` Andrei Popa
2006-12-24 12:30 ` Andrew Morton
2006-12-24 12:31 ` Andrew Morton
2006-12-24 16:45 ` Andrei Popa
2006-12-24 17:16 ` Linus Torvalds
2006-12-24 18:07 ` Andrew Morton
2006-12-24 18:37 ` Linus Torvalds
2006-12-24 19:18 ` Linus Torvalds
2006-12-24 20:55 ` Gordon Farquharson
2006-12-26 10:31 ` Nick Piggin
2006-12-26 19:26 ` Linus Torvalds
2006-12-27 12:32 ` Jari Sundell
2006-12-27 12:44 ` valdyn
2006-12-27 13:33 ` Jari Sundell
2007-01-07 2:06 ` Tom Lanyon
2007-01-07 5:58 ` Tom Lanyon
2007-01-07 6:05 ` Andrew Morton
2006-12-24 21:21 ` Michael S. Tsirkin
2006-12-24 19:27 ` Gordon Farquharson
2006-12-24 19:35 ` Linus Torvalds
2006-12-24 20:10 ` Andrei Popa
2006-12-24 20:24 ` Linus Torvalds
2006-12-24 20:30 ` Andrei Popa
2006-12-26 17:51 ` Al Viro
2006-12-26 17:58 ` Al Viro
2006-12-24 22:01 ` Martin Michlmayr
2006-12-24 14:05 ` Martin Michlmayr
2006-12-26 16:17 ` Tobias Diedrich
2006-12-27 4:55 ` [PATCH] mm: fix page_mkclean_one David Miller
2006-12-27 7:00 ` Linus Torvalds
2006-12-27 8:39 ` Andrei Popa
2006-12-28 0:16 ` Linus Torvalds
2006-12-28 0:39 ` Linus Torvalds
2006-12-28 0:52 ` David Miller
2006-12-28 3:04 ` Linus Torvalds
2006-12-28 4:32 ` Gordon Farquharson
2006-12-28 4:53 ` Linus Torvalds
2006-12-28 5:20 ` Gordon Farquharson
2006-12-28 5:41 ` David Miller
2006-12-28 5:47 ` Gordon Farquharson
2006-12-28 10:13 ` Russell King
2006-12-28 14:15 ` Gordon Farquharson
2006-12-28 15:53 ` Martin Michlmayr
2006-12-28 17:27 ` Linus Torvalds
2006-12-28 18:44 ` Russell King
2006-12-28 19:01 ` Linus Torvalds
[not found] ` <97a0a9ac0612272115g4cce1f08n3c3c8498a6076bd5@mail.gmail.com>
[not found] ` <Pine.LNX.4.64.0612272120180.4473@woody.osdl.org>
2006-12-28 5:38 ` Gordon Farquharson
2006-12-28 9:30 ` Martin Michlmayr
2006-12-28 10:16 ` Martin Michlmayr
2006-12-28 10:49 ` Russell King
2006-12-28 14:56 ` Martin Michlmayr
2006-12-28 5:58 ` Gordon Farquharson
2006-12-28 17:08 ` Linus Torvalds
2006-12-28 5:55 ` Chen, Kenneth W
2006-12-28 6:10 ` Chen, Kenneth W
2006-12-28 6:27 ` David Miller
2006-12-28 17:10 ` Linus Torvalds
2006-12-28 9:15 ` Zhang, Yanmin
2006-12-28 17:15 ` Linus Torvalds
2006-12-28 11:50 ` Petri Kaukasoina
2006-12-28 15:09 ` Guillaume Chazarain
2006-12-28 19:19 ` Guillaume Chazarain
2006-12-28 19:28 ` Linus Torvalds
2006-12-28 19:45 ` Andrew Morton
2006-12-28 20:14 ` Linus Torvalds
2006-12-28 22:38 ` David Miller
2006-12-29 2:50 ` Segher Boessenkool
2006-12-29 6:48 ` Linus Torvalds
2006-12-29 8:58 ` Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) Linus Torvalds
2006-12-29 10:48 ` Linus Torvalds
2006-12-29 11:16 ` Andrei Popa
2006-12-29 12:09 ` Nick Piggin
2006-12-29 17:25 ` Linus Torvalds
2006-12-29 12:31 ` Ingo Molnar
2006-12-29 13:08 ` Martin Johansson
2006-12-29 14:08 ` Martin Michlmayr
2006-12-29 15:17 ` Stephen Clark
2006-12-29 15:54 ` Martin Michlmayr
2006-12-29 22:16 ` Andrew Morton [this message]
2006-12-29 22:24 ` Andrew Morton
2006-12-29 22:42 ` Linus Torvalds
2006-12-29 23:32 ` Theodore Tso
2006-12-29 23:59 ` Linus Torvalds
2006-12-30 0:05 ` Andrew Morton
2006-12-30 0:50 ` Linus Torvalds
2006-12-29 23:51 ` Andrew Morton
2006-12-30 0:11 ` Linus Torvalds
2006-12-30 0:33 ` Andrew Morton
2006-12-30 0:58 ` Linus Torvalds
2006-12-30 1:16 ` Andrew Morton
2006-12-29 15:27 ` Theodore Tso
2006-12-29 17:51 ` Linus Torvalds
2006-12-29 12:19 ` [patch] fix data corruption bug in __block_write_full_page() Ingo Molnar
2007-01-02 11:20 ` Christoph Hellwig
2007-01-02 12:06 ` Ingo Molnar
2007-01-02 12:16 ` Christoph Hellwig
2006-12-28 22:35 ` [PATCH] mm: fix page_mkclean_one Mike Galbraith
2006-12-22 15:01 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Patrick Mau
2006-12-23 8:15 ` Andrei Popa
2006-12-22 15:08 ` Gordon Farquharson
2006-12-22 10:01 ` Martin Michlmayr
2006-12-22 15:16 ` Gordon Farquharson
2006-12-21 12:30 ` Russell King
2006-12-21 12:36 ` Russell King
2006-12-21 11:21 ` Martin Michlmayr
2006-12-20 22:11 ` Russell King
2006-12-21 8:18 ` Martin Michlmayr
2006-12-21 9:54 ` Russell King
2006-12-20 14:55 ` Martin Schwidefsky
2006-12-20 14:27 ` 2.6.19 file content corruption on ext3 Martin Schwidefsky
2006-12-20 9:32 ` Peter Zijlstra
2006-12-20 14:15 ` Andrei Popa
2006-12-20 14:23 ` Peter Zijlstra
2006-12-20 16:30 ` Andrei Popa
2006-12-20 16:36 ` Peter Zijlstra
2006-12-19 7:38 ` Peter Zijlstra
2006-12-19 4:36 ` Nick Piggin
2006-12-19 6:34 ` Linus Torvalds
2006-12-19 6:51 ` Nick Piggin
2006-12-19 7:26 ` Linus Torvalds
2006-12-19 8:04 ` Linus Torvalds
2006-12-19 9:00 ` Peter Zijlstra
2006-12-19 9:05 ` Peter Zijlstra
[not found] ` <4587B762.2030603@yahoo.com.au>
2006-12-19 10:32 ` Andrew Morton
2006-12-19 10:42 ` Nick Piggin
2006-12-19 10:47 ` Andrew Morton
2006-12-19 10:52 ` Peter Zijlstra
2006-12-19 10:58 ` Nick Piggin
2006-12-19 11:51 ` Peter Zijlstra
2006-12-19 10:55 ` Nick Piggin
2006-12-19 16:51 ` Linus Torvalds
2006-12-19 17:43 ` Linus Torvalds
2006-12-19 18:59 ` Linus Torvalds
2006-12-19 21:30 ` Peter Zijlstra
2006-12-19 22:51 ` Linus Torvalds
2006-12-19 22:58 ` Andrew Morton
2006-12-19 23:06 ` Peter Zijlstra
2006-12-19 23:07 ` Peter Zijlstra
2006-12-20 0:03 ` Linus Torvalds
2006-12-20 0:18 ` Andrew Morton
2006-12-20 18:02 ` Stephen Clark
2006-12-20 5:56 ` Jari Sundell
2006-12-19 21:56 ` Florian Weimer
2006-12-21 13:03 ` Peter Zijlstra
2006-12-21 20:40 ` Andrew Morton
2006-12-19 20:03 ` dean gaudet
2006-12-19 7:22 ` Peter Zijlstra
2006-12-19 7:59 ` Nick Piggin
2006-12-19 8:14 ` Linus Torvalds
2006-12-19 9:40 ` Nick Piggin
2006-12-19 16:46 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20061229141632.51c8c080.akpm@osdl.org \
--to=akpm@osdl.org \
--cc=a.p.zijlstra@chello.nl \
--cc=andrei.popa@i-neo.ro \
--cc=arjan@infradead.org \
--cc=davem@davemloft.net \
--cc=gordonfarquharson@gmail.com \
--cc=guichaz@yahoo.fr \
--cc=hugh@veritas.com \
--cc=kenneth.w.chen@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=nickpiggin@yahoo.com.au \
--cc=ranma@tdiedrich.de \
--cc=segher@kernel.crashing.org \
--cc=tbm@cyrius.com \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).