LKML Archive on lore.kernel.org
 help / color / Atom feed
* Re: 2.6.19 file content corruption on ext3
@ 2006-12-17  0:13 Andrei Popa
  2006-12-17 12:06 ` Andrew Morton
  0 siblings, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-17  0:13 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Hello,
I had filesystem data corruption with rtorrent with 2.6.19.
I tried recent git with Peter Zijlstra patch
http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is
fixed.

Please CC as I am not subscribed to lkml.

Andrei


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17  0:13 2.6.19 file content corruption on ext3 Andrei Popa
@ 2006-12-17 12:06 ` Andrew Morton
  2006-12-17 12:19   ` Marc Haber
                     ` (2 more replies)
  0 siblings, 3 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-17 12:06 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Linus Torvalds, Florian Weimer, Marc Haber

On Sun, 17 Dec 2006 02:13:18 +0200
Andrei Popa <andrei.popa@i-neo.ro> wrote:

> Hello,
> I had filesystem data corruption with rtorrent with 2.6.19.
> I tried recent git with Peter Zijlstra patch
> http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is
> fixed.
> 

oh crap, I'd forgotten that test_clear_page_dirty() now fiddles with the
ptes.

I'd be really surprised if this was all due to a race though.  Is everyone
who has observed this problem running SMP and/or premptible kernels?

Peter, why isn't that proposed patch's cleaning of the pte racy against
do_wp_page()?

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 12:06 ` Andrew Morton
@ 2006-12-17 12:19   ` Marc Haber
  2006-12-17 12:32   ` Andrei Popa
  2006-12-17 13:39   ` Andrei Popa
  2 siblings, 0 replies; 311+ messages in thread
From: Marc Haber @ 2006-12-17 12:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra,
	Hugh Dickins, Linus Torvalds, Florian Weimer

On Sun, Dec 17, 2006 at 04:06:20AM -0800, Andrew Morton wrote:
> I'd be really surprised if this was all due to a race though.  Is everyone
> who has observed this problem running SMP and/or premptible kernels?

Linux torres 2.6.19.1-zgsrv #1 SMP PREEMPT Wed Dec 13 01:31:27 UTC 2006 i686 GNU/Linux

So, it's a "yes" to both counts, and I'll build a kernel without SMP
and without preemption asap.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 12:06 ` Andrew Morton
  2006-12-17 12:19   ` Marc Haber
@ 2006-12-17 12:32   ` Andrei Popa
  2006-12-17 13:39   ` Andrei Popa
  2 siblings, 0 replies; 311+ messages in thread
From: Andrei Popa @ 2006-12-17 12:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Linus Torvalds, Florian Weimer, Marc Haber


ierdnac ~ # uname -a
Linux ierdnac 2.6.20-rc1 #1 SMP PREEMPT Sun Dec 17 01:52:28 EET 2006
i686 Genuine Intel(R) CPU           T2050  @ 1.60GHz GenuineIntel
GNU/Linux


On Sun, 2006-12-17 at 04:06 -0800, Andrew Morton wrote:
> On Sun, 17 Dec 2006 02:13:18 +0200
> Andrei Popa <andrei.popa@i-neo.ro> wrote:
> 
> > Hello,
> > I had filesystem data corruption with rtorrent with 2.6.19.
> > I tried recent git with Peter Zijlstra patch
> > http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is
> > fixed.
> > 
> 
> oh crap, I'd forgotten that test_clear_page_dirty() now fiddles with the
> ptes.
> 
> I'd be really surprised if this was all due to a race though.  Is everyone
> who has observed this problem running SMP and/or premptible kernels?
> 
> Peter, why isn't that proposed patch's cleaning of the pte racy against
> do_wp_page()?


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 12:06 ` Andrew Morton
  2006-12-17 12:19   ` Marc Haber
  2006-12-17 12:32   ` Andrei Popa
@ 2006-12-17 13:39   ` Andrei Popa
  2006-12-17 23:40     ` Andrew Morton
  2 siblings, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-17 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Linus Torvalds, Florian Weimer, Marc Haber

I was mistaken, I'm still having file corruption with rtorrent.

On Sun, 2006-12-17 at 04:06 -0800, Andrew Morton wrote:
> On Sun, 17 Dec 2006 02:13:18 +0200
> Andrei Popa <andrei.popa@i-neo.ro> wrote:
> 
> > Hello,
> > I had filesystem data corruption with rtorrent with 2.6.19.
> > I tried recent git with Peter Zijlstra patch
> > http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is
> > fixed.
> > 
> 
> oh crap, I'd forgotten that test_clear_page_dirty() now fiddles with the
> ptes.
> 
> I'd be really surprised if this was all due to a race though.  Is everyone
> who has observed this problem running SMP and/or premptible kernels?
> 
> Peter, why isn't that proposed patch's cleaning of the pte racy against
> do_wp_page()?


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 13:39   ` Andrei Popa
@ 2006-12-17 23:40     ` Andrew Morton
  2006-12-18  1:02       ` Linus Torvalds
                         ` (2 more replies)
  0 siblings, 3 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-17 23:40 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Linus Torvalds, Florian Weimer, Marc Haber, Martin Michlmayr

On Sun, 17 Dec 2006 15:39:32 +0200
Andrei Popa <andrei.popa@i-neo.ro> wrote:

> I was mistaken, I'm still having file corruption with rtorrent.
> 

Well I'm not very optimistic, but if people could try this, please...



From: Andrew Morton <akpm@osdl.org>

try_to_free_buffers() clears the page's dirty state if it successfully removed
the page's buffers.

  Background for this:

  - a process does a one-byte-write to a file on a 64k pagesize, 4k
    blocksize ext3 filesystem.  The page is now PageDirty, !PgeUptodate and
    has one dirty buffer and 15 not uptodate buffers.

  - kjournald writes the dirty buffer.  The page is now PageDirty,
    !PageUptodate and has a mix of clean and not uptodate buffers.

  - try_to_free_buffers() removes the page's buffers.  It MUST now clear
    PageDirty.  If we were to leave the page dirty then we'd have a dirty, not
    uptodate page with no buffer_heads.

    We're screwed: we cannot write the page because we don't know which
    sections of it contain garbage.  We cannot read the page because we don't
    know which sections of it contain modified data.  We cannot free the page
    because it is dirty.


Peter's "mm: tracking shared dirty pages"
(d08b3851da41d0ee60851f2c75b118e1f7a5fc89) modified clear_page_dirty() so that
it also clears the page's pte mapping's dirty flags, arranging for a
subsequent userspace modification of the page to cause a fault.

That change to clear_page_dirty() was correct for when it is called on the
writeback path.  Here, we effectively do:

	ClearPageDirty()
	pte_mkclean()
	submit-the-writeout

if a page-dirtying via write() or via pte's happens after the ClearPageDirty()
or the pte_mkclean() then the page is redirtied while writeout is in flight
and the page will again need writing; no probs.

But that change to clear_page_dirty() was incorrect for when it is called on
the try_to_free_buffers() path.  Here, we want to preserve any pte-dirtiness
because we're not going to write the page to backing store.  We need to keep
a record of any userspace modification to the page.

One way of addressing this would be to bale from try_to_free_buffers() if the
page is mapped into pagetables.  However that is racy, because the pagefault
path doesn't lock the page when establishing a pte against it (I which it did
- it would solve a lot of nasties).

So this patch instead arranges for clear_page_dirty() to not clean the pte's
when it is called on the try_to_free_buffers() path.

clear_page_dirty() had several callers and it's not immediately obvious to me
what the appropriate behaviour is in each case.  Could maintainers please take
a look?

>From my quick reading, all callers of try_to_free_buffers() have already
unmapped the page from pagetables, and given that the reported ext3 corruption
happens on uniprocessor, non-preempt kernels, I doubt if this patch will fix
things.

But even if it is true that try_to_free_buffers() callers unmap the page
first, this fix is still needed, because a minor fault could reestablish pte's
in the meanwhile.

Note that with this change, we can now restore try_to_free_buffers()'s
->private_lock to cover the test_clear_page_dirty().  If we indeed need to do
that, it'll be in a separate patch.

(Need to think about this some more.  How can a page be pte-dirty, but not
have dirty buffers?  We're supposed to clean the pte's when we write the
page, and we dirty the page and buffers when userspace dirties the pte...)


Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: <reiserfs-dev@namesys.com>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 fs/buffer.c                 |    2 +-
 fs/cifs/file.c              |    2 +-
 fs/fuse/file.c              |    2 +-
 fs/hugetlbfs/inode.c        |    2 +-
 fs/jfs/jfs_metapage.c       |    2 +-
 fs/reiserfs/stree.c         |    2 +-
 fs/xfs/linux-2.6/xfs_aops.c |    2 +-
 include/linux/page-flags.h  |    6 +++---
 mm/page-writeback.c         |    5 +++--
 mm/truncate.c               |    4 ++--
 10 files changed, 15 insertions(+), 14 deletions(-)

diff -puN fs/buffer.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/buffer.c
--- a/fs/buffer.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/buffer.c
@@ -2858,7 +2858,7 @@ int try_to_free_buffers(struct page *pag
 		 * the page's buffers clean.  We discover that here and clean
 		 * the page also.
 		 */
-		if (test_clear_page_dirty(page))
+		if (test_clear_page_dirty(page, 0))
 			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	}
 out:
diff -puN fs/fuse/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/fuse/file.c
--- a/fs/fuse/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff -puN fs/hugetlbfs/inode.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 1);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff -puN fs/jfs/jfs_metapage.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/jfs/jfs_metapage.c
--- a/fs/jfs/jfs_metapage.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ void release_metapage(struct metapage * 
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 1);
 		ClearPageUptodate(page);
 	}
 #else
diff -puN fs/reiserfs/stree.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/reiserfs/stree.c
--- a/fs/reiserfs/stree.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff -puN fs/xfs/linux-2.6/xfs_aops.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/xfs/linux-2.6/xfs_aops.c
--- a/fs/xfs/linux-2.6/xfs_aops.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 1);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff -puN include/linux/page-flags.h~try_to_free_buffers-dont-clear-pte-dirty-bits include/linux/page-flags.h
--- a/include/linux/page-flags.h~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/include/linux/page-flags.h
@@ -253,13 +253,13 @@ static inline void SetPageUptodate(struc
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff -puN mm/page-writeback.c~try_to_free_buffers-dont-clear-pte-dirty-bits mm/page-writeback.c
--- a/mm/page-writeback.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -866,7 +866,8 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			if (must_clean_ptes)
+				page_mkclean(page);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;
diff -puN mm/truncate.c~try_to_free_buffers-dont-clear-pte-dirty-bits mm/truncate.c
--- a/mm/truncate.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 1))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);
diff -puN fs/cifs/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/cifs/file.c
--- a/fs/cifs/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 1)) {
 				unlock_page(page);
 				break;
 			}
_


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 23:40     ` Andrew Morton
@ 2006-12-18  1:02       ` Linus Torvalds
  2006-12-18  1:22       ` Linus Torvalds
  2006-12-18 16:55       ` Peter Zijlstra
  2 siblings, 0 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-18  1:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Sun, 17 Dec 2006, Andrew Morton wrote:
> 
> So this patch instead arranges for clear_page_dirty() to not clean the pte's
> when it is called on the try_to_free_buffers() path.

No. This is wrong.

It's wrong exactly because it now _breaks_ the whole thing that the 2.6.19 
PG_dirty changes were all about: keeping track of dirty pages. Now you 
have a page that is dirty, but it's no longer marked PG_dirty, and thus it 
doesn't participate in the dirty accounting.

> From my quick reading, all callers of try_to_free_buffers() have already
> unmapped the page from pagetables, and given that the reported ext3 corruption
> happens on uniprocessor, non-preempt kernels, I doubt if this patch will fix
> things.

So not only are you breaking this, you also claim that it cannot happen in 
the first place. So either the patch is buggy, or it's pointless. In 
neither case does it seem to be a good idea to do.

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 23:40     ` Andrew Morton
  2006-12-18  1:02       ` Linus Torvalds
@ 2006-12-18  1:22       ` Linus Torvalds
  2006-12-18  1:29         ` Linus Torvalds
  2006-12-18 16:55       ` Peter Zijlstra
  2 siblings, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-18  1:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Sun, 17 Dec 2006, Andrew Morton wrote:
> 
> From my quick reading, all callers of try_to_free_buffers() have already
> unmapped the page from pagetables, and given that the reported ext3 corruption
> happens on uniprocessor, non-preempt kernels, I doubt if this patch will fix
> things.

Hmm. One possible explanation: maybe the page actually _did_ get unmapped 
from the page tables, but got added back?

I don't think we lock the page when faulting it in (we want it to be 
uptodate, but not necessarily locked). So assuming the pageout sequence 
always _does_ follow the rule that it only does try_to_free_buffers() on 
pages that aren't mapped, what actually protects them from not becoming 
mapped (and dirtied) during that sequence?

So we should probably do a "wait_for_page()" in do_no_page()?

Or maybe only do it for write accesses (since we don't really care about 
getting mapped readably)? If so, we need to do it in the write case of 
do_no_page() _and_ in the do_wp_page() case. Hmm?

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  1:22       ` Linus Torvalds
@ 2006-12-18  1:29         ` Linus Torvalds
  2006-12-18  1:57           ` Linus Torvalds
  0 siblings, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-18  1:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Sun, 17 Dec 2006, Linus Torvalds wrote:
> 
> So we should probably do a "wait_for_page()" in do_no_page()?
> 
> Or maybe only do it for write accesses (since we don't really care about 
> getting mapped readably)? If so, we need to do it in the write case of 
> do_no_page() _and_ in the do_wp_page() case. Hmm?

I think we discussed doing exactly this at some earlier time, actually, 
just to try to throttle people who do lots of page dirtying. 

Maybe we even do it somewhere, but I tried to see it, and in the normal 
"nopage()" routine we very much try to _avoid_ locking the page (ie if 
it's marked PageUptodate() we'll return it whether locked or not). Which 
is fine - especially for readers, there really isn't any reason to ever 
delay them getting access to a page just because it's locked for write-out 
or something (once it's mapped, they'll have access to it regardless of 
any locked state in the kernel anyway).

So I don't actually see any serialization at all that would keep a random 
page from being paged back in.

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  1:29         ` Linus Torvalds
@ 2006-12-18  1:57           ` Linus Torvalds
  2006-12-18  4:51             ` Nick Piggin
  0 siblings, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-18  1:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr


[ Replying to myself - a sure sign that I don't get out enough ]

On Sun, 17 Dec 2006, Linus Torvalds wrote:
> 
> So I don't actually see any serialization at all that would keep a random 
> page from being paged back in.

We do actually serialize, but we do it _after_ the page has already been 
mapped back. Ie we do it for the dirty page case at rthe end of 
do_wp_page() and do_no_page() when we do the "set_page_dirty_balance()", 
but that's potentially too late - we've already mapped the page read-write 
into the address space.

That said, this means that only threaded apps should ever trigger any 
problems, which would seem to make it unlikely that this is the issue.

But Andrew: I don't think it's necessarily true that 
"try_to_free_buffers()" callers have all unmapped the page.

That seems to be true for vmscan.c (ie the shrink_page_list -> 
try_to_release_page -> try_to_release_buffers callchain), but what about 
the other callchains (through filesystems, or through "pagevec_strip()" or 
similar?) That pagevec_strip() is called from shrink_active_list(), I 
don't see that unmapping the pages..

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  1:57           ` Linus Torvalds
@ 2006-12-18  4:51             ` Nick Piggin
  2006-12-18  5:43               ` Andrew Morton
  2006-12-18  5:50               ` Linus Torvalds
  0 siblings, 2 replies; 311+ messages in thread
From: Nick Piggin @ 2006-12-18  4:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

Linus Torvalds wrote:
> [ Replying to myself - a sure sign that I don't get out enough ]
> 
> On Sun, 17 Dec 2006, Linus Torvalds wrote:
> 
>>So I don't actually see any serialization at all that would keep a random 
>>page from being paged back in.
> 
> 
> We do actually serialize, but we do it _after_ the page has already been 
> mapped back. Ie we do it for the dirty page case at rthe end of 
> do_wp_page() and do_no_page() when we do the "set_page_dirty_balance()", 
> but that's potentially too late - we've already mapped the page read-write 
> into the address space.

I can't see how that's exactly a problem -- so long as the page does not
get reclaimed (it won't, because we have a ref on it) then all that matters
is that the page eventually gets marked dirty.

> That said, this means that only threaded apps should ever trigger any 
> problems, which would seem to make it unlikely that this is the issue.
> 
> But Andrew: I don't think it's necessarily true that 
> "try_to_free_buffers()" callers have all unmapped the page.
> 
> That seems to be true for vmscan.c (ie the shrink_page_list -> 
> try_to_release_page -> try_to_release_buffers callchain), but what about 
> the other callchains (through filesystems, or through "pagevec_strip()" or 
> similar?) That pagevec_strip() is called from shrink_active_list(), I 
> don't see that unmapping the pages..

Right. But would it really matter whether they are currently mapped or
not, given that we agree it may become mapped at any point?

I think the problem Andrew identified is real.

The issue is the disconnect between the pte dirtiness and a filesystem
bringing buffers clean. But I disagree with his fix, because we don't
actually want to just throw out that pte dirtiness information: we're
just trying to get the PG_dirty bit into synch with what the buffers are
telling us, not actually clean or dirty anything, as such.

Can we clear the page dirty bit, then run set_page_dirty afterwards, if
any dirty ptes are found?

The other thing we might be able to do is to skip doing the
clear_page_dirty if the page is uptodate. This feels more hackish but it
might be faster?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  4:51             ` Nick Piggin
@ 2006-12-18  5:43               ` Andrew Morton
  2006-12-18  7:22                 ` Nick Piggin
  2006-12-19  8:51                 ` Marc Haber
  2006-12-18  5:50               ` Linus Torvalds
  1 sibling, 2 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-18  5:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Mon, 18 Dec 2006 15:51:52 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> I think the problem Andrew identified is real.

I don't.  In fact I don't think I described any problem (well, I tried to,
but then I contradicted myself).

Six hours here of fsx-linux plus high memory pressure on SMP on 1k
blocksize ext3, mainline.  Zero failures.  It's unlikely that this testing
would pass, yet people running normal workloads are able to easily trigger
failures.  I suspect we're looking in the wrong place.

> The issue is the disconnect between the pte dirtiness and a filesystem
> bringing buffers clean.

Really?  The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the
cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty.  That's pretty
simple, setting aside races.

In the try_to_free_buffers case there's a large time inverval between
!BH_Dirty and !PG_dirty, but that shouldn't affect anything.

I don't think we even have a theory as to what's gone wrong yet.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  4:51             ` Nick Piggin
  2006-12-18  5:43               ` Andrew Morton
@ 2006-12-18  5:50               ` Linus Torvalds
  2006-12-18  7:16                 ` Andrew Morton
                                   ` (2 more replies)
  1 sibling, 3 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-18  5:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr



On Mon, 18 Dec 2006, Nick Piggin wrote:
> 
> I can't see how that's exactly a problem -- so long as the page does not
> get reclaimed (it won't, because we have a ref on it) then all that matters
> is that the page eventually gets marked dirty.

But the point being that "try_to_free_buffers()" marks it clean 
AFTERWARDS.

So yes, the page gets marked dirty in the pte's - the hardware generally 
does that for us, so we don't have to worry about that part going on.

But "try_to_free_buffers()" seems to clear those dirty bits without 
serializing it really any way. It just says "ok, I will now clear them". 
Without knowing whether the dirty bits got set before the IO that cleared 
the buffer head dirty bits or not.

What is _that_ serialization? As far as I can see, the only way to 
guarantee that to happen (since the dirty bits in the page tables will get 
set without us ever even being notified) is that the page tables 
themselves must simply never contain that page in a writable form at all.

And that seems to be lacking.

Anyway, I have what I consider a much simpler solution: just don't DO all 
that crap in try_to_free_buffers() at all. I sent it out to some people 
already, not not very widely. 

I reproduce my suggestion here for you (and maybe others too who weren't 
cc'd in that other discussion group) to comment on..

		Linus

---

So I think your patch is really broken, how about this one instead?

It's really my previous patch, BUT it also adds a 

	if (PageDirty(page) ..
		return 0;

case, on the assumption that since PageDirty() measn that one of the 
buffers should be dirty, there's no point in even _trying_ drop_buffers, 
since that should just fail anyway.

Now, that assumption is obviously wrong _if_ the buffers have been cleaned 
by something else. So in that case, we now don't remove the buffer heads, 
but who really cares? The page will remain on the dirty list, and 
something should be trying to write it out, but since now all the buffers 
are clean, once that happens, there is no actual IO to happen.

Hmm? So this means that we simply don't remove the buffers early from such 
pages, but there shouldn't be any real downside.

Now, the only question would be if the page is marked dirty _while_ this 
is running. We do hold the page lock, but page dirtying doesn't get the 
lock, does it? But at least we won't mark the page _clean_ when it 
shouldn't be.. And we still are atomic wrt the actual buffer lists 
(mapping->private_lock), so I think this should all be ok, and 
drop_buffers() will do the right thing.

So no race possible either.

At least as far as I can see. And the patch certainly is simple.

Now the question whether this actually _fixes_ any problems does remain, 
but I think this should be a pretty good solution if the bug really is 
here. Andrew?

		Linus

----
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  5:50               ` Linus Torvalds
@ 2006-12-18  7:16                 ` Andrew Morton
  2006-12-18  7:17                   ` Andrew Morton
  2006-12-18  9:30                   ` Nick Piggin
  2006-12-18  7:30                 ` Nick Piggin
  2006-12-18  9:19                 ` Andrei Popa
  2 siblings, 2 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-18  7:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Sun, 17 Dec 2006 21:50:43 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Mon, 18 Dec 2006, Nick Piggin wrote:
> > 
> > I can't see how that's exactly a problem -- so long as the page does not
> > get reclaimed (it won't, because we have a ref on it) then all that matters
> > is that the page eventually gets marked dirty.
> 
> But the point being that "try_to_free_buffers()" marks it clean 
> AFTERWARDS.
> 
> So yes, the page gets marked dirty in the pte's - the hardware generally 
> does that for us, so we don't have to worry about that part going on.
> 
> But "try_to_free_buffers()" seems to clear those dirty bits without 
> serializing it really any way. It just says "ok, I will now clear them". 
> Without knowing whether the dirty bits got set before the IO that cleared 
> the buffer head dirty bits or not.

Yes, I can't see anything correct about the current behaviour.

But I'm going blue in the face here trying to feed try_to_free_buffers() a
page_mapped(page), without success.  pagevec_strip() presumably isn't
triggering.

> What is _that_ serialization? As far as I can see, the only way to 
> guarantee that to happen (since the dirty bits in the page tables will get 
> set without us ever even being notified) is that the page tables 
> themselves must simply never contain that page in a writable form at all.
> 
> And that seems to be lacking.
> 
> Anyway, I have what I consider a much simpler solution: just don't DO all 
> that crap in try_to_free_buffers() at all. I sent it out to some people 
> already, not not very widely. 
> 
> I reproduce my suggestion here for you (and maybe others too who weren't 
> cc'd in that other discussion group) to comment on..
>
> ...
>
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
>  	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
> -	if (PageWriteback(page))
> +	if (PageDirty(page) || PageWriteback(page))
>  		return 0;
>  
>  	if (mapping == NULL) {		/* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
>  	spin_lock(&mapping->private_lock);
>  	ret = drop_buffers(page, &buffers_to_free);
>  	spin_unlock(&mapping->private_lock);
> -	if (ret) {
> -		/*
> -		 * If the filesystem writes its buffers by hand (eg ext3)
> -		 * then we can have clean buffers against a dirty page.  We
> -		 * clean the page here; otherwise later reattachment of buffers
> -		 * could encounter a non-uptodate page, which is unresolvable.
> -		 * This only applies in the rare case where try_to_free_buffers
> -		 * succeeds but the page is not freed.
> -		 *
> -		 * Also, during truncate, discard_buffer will have marked all
> -		 * the page's buffers clean.  We discover that here and clean
> -		 * the page also.
> -		 */
> -		if (test_clear_page_dirty(page))
> -			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> -	}
>  out:
>  	if (buffers_to_free) {
>  		struct buffer_head *bh = buffers_to_free;

This will (at least) cause truncate to do peculiar things. 
do_invalidatepage() runs discard_buffer() against the dirty page and will
then expect try_to_free_buffers() to remove those buffers and then clean
the page.  truncate_complete_page() will clean the page, but it still has
those invalidated buffers.  We'll end up with a large number of clean,
unused pages on the LRU, with attached buffers.  These should eventually
get reaped, but it'll change the page aging dynamics.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  7:16                 ` Andrew Morton
@ 2006-12-18  7:17                   ` Andrew Morton
  2006-12-18  9:30                   ` Nick Piggin
  1 sibling, 0 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-18  7:17 UTC (permalink / raw)
  To: Linus Torvalds, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer, Marc Haber, Martin Michlmayr

On Sun, 17 Dec 2006 23:16:17 -0800
Andrew Morton <akpm@osdl.org> wrote:

> >  out:
> >  	if (buffers_to_free) {
> >  		struct buffer_head *bh = buffers_to_free;
> 
> This will (at least) cause truncate to do peculiar things. 
> do_invalidatepage() runs discard_buffer() against the dirty page and will
> then expect try_to_free_buffers() to remove those buffers and then clean
> the page.  truncate_complete_page() will clean the page, but it still has
> those invalidated buffers.  We'll end up with a large number of clean,
> unused pages on the LRU, with attached buffers.  These should eventually
> get reaped, but it'll change the page aging dynamics.

That being said, it's be great to get this tested by someone who can
trigger this bug, please.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  5:43               ` Andrew Morton
@ 2006-12-18  7:22                 ` Nick Piggin
  2006-12-18  9:18                   ` Andrew Morton
  2006-12-19  8:51                 ` Marc Haber
  1 sibling, 1 reply; 311+ messages in thread
From: Nick Piggin @ 2006-12-18  7:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

Andrew Morton wrote:
> On Mon, 18 Dec 2006 15:51:52 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>I think the problem Andrew identified is real.
> 
> 
> I don't.  In fact I don't think I described any problem (well, I tried to,
> but then I contradicted myself).

By saying that there shouldn't be any dirty ptes if there are no
dirty buffers? But in that case the _page_ shouldn't be dirty either,
so that clear_page_dirty would be redundant. But presumably it isn't.

> Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> blocksize ext3, mainline.  Zero failures.  It's unlikely that this testing
> would pass, yet people running normal workloads are able to easily trigger
> failures.  I suspect we're looking in the wrong place.

Yes I could believe it the corruption is caused by something else
completely.

>>The issue is the disconnect between the pte dirtiness and a filesystem
>>bringing buffers clean.
> 
> 
> Really?  The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the
> cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty.  That's pretty
> simple, setting aside races.
> 
> In the try_to_free_buffers case there's a large time inverval between
> !BH_Dirty and !PG_dirty, but that shouldn't affect anything.

After try_to_free_buffers detaches the buffers from the page, a
pagefault can come in, and mark the pte writeable, then set_page_dirty
(which finds no buffers, so only sets PG_dirty).

The page can now get dirtied through this mapping.

try_to_free_buffers then goes on to clean the page and ptes.

I really thought you were the one who identified this race, and I didn't
see where you showed it is safe.

It may be very unlikely with small SMPs, but less so with preempt. All
we have to do is preempt at spin_unlock in try_to_free_buffers AFAIKS.
Were you testing with preempt?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  5:50               ` Linus Torvalds
  2006-12-18  7:16                 ` Andrew Morton
@ 2006-12-18  7:30                 ` Nick Piggin
  2006-12-18  9:19                 ` Andrei Popa
  2 siblings, 0 replies; 311+ messages in thread
From: Nick Piggin @ 2006-12-18  7:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Nick Piggin wrote:
> 
>>I can't see how that's exactly a problem -- so long as the page does not
>>get reclaimed (it won't, because we have a ref on it) then all that matters
>>is that the page eventually gets marked dirty.
> 
> 
> But the point being that "try_to_free_buffers()" marks it clean 
> AFTERWARDS.

For some reason I thought you were suggesting it is a problem on its own :P

Yes I agree there is a pagefault vs ttfb race.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  7:22                 ` Nick Piggin
@ 2006-12-18  9:18                   ` Andrew Morton
  2006-12-18  9:26                     ` Andrei Popa
  2006-12-18  9:42                     ` Nick Piggin
  0 siblings, 2 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-18  9:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Mon, 18 Dec 2006 18:22:42 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Andrew Morton wrote:
> > On Mon, 18 Dec 2006 15:51:52 +1100
> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > 
> > 
> >>I think the problem Andrew identified is real.
> > 
> > 
> > I don't.  In fact I don't think I described any problem (well, I tried to,
> > but then I contradicted myself).
> 
> By saying that there shouldn't be any dirty ptes if there are no
> dirty buffers? But in that case the _page_ shouldn't be dirty either,
> so that clear_page_dirty would be redundant. But presumably it isn't.

I don't follow that.

The linkage between pte-dirtiness and buffer_heads is a bit hard to follow
without also considering page-dirtiness.

> > Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> > blocksize ext3, mainline.  Zero failures.  It's unlikely that this testing
> > would pass, yet people running normal workloads are able to easily trigger
> > failures.  I suspect we're looking in the wrong place.
> 
> Yes I could believe it the corruption is caused by something else
> completely.

Think so.  We do have a problem here, but only on threaded apps, I believe.
rtorrent doesn't appear to be threaded, and the bug is hit on non-preempt
UP.

> >>The issue is the disconnect between the pte dirtiness and a filesystem
> >>bringing buffers clean.
> > 
> > 
> > Really?  The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the
> > cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty.  That's pretty
> > simple, setting aside races.
> > 
> > In the try_to_free_buffers case there's a large time inverval between
> > !BH_Dirty and !PG_dirty, but that shouldn't affect anything.
> 
> After try_to_free_buffers detaches the buffers from the page, a
> pagefault can come in, and mark the pte writeable, then set_page_dirty
> (which finds no buffers, so only sets PG_dirty).
> 
> The page can now get dirtied through this mapping.
> 
> try_to_free_buffers then goes on to clean the page and ptes.

try_to_free_buffers() isn't called against a page which doesn't have
buffers.  It'll oops.

> Were you testing with preempt?

nope, just SMP.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  5:50               ` Linus Torvalds
  2006-12-18  7:16                 ` Andrew Morton
  2006-12-18  7:30                 ` Nick Piggin
@ 2006-12-18  9:19                 ` Andrei Popa
  2006-12-18  9:38                   ` Andrew Morton
  2 siblings, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-18  9:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

I tried latest git with the patch from this email and it still get file
content corruption. If I can help you further debug the problem tell me
what to do.

On Sun, 2006-12-17 at 21:50 -0800, Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Nick Piggin wrote:
> > 
> > I can't see how that's exactly a problem -- so long as the page does not
> > get reclaimed (it won't, because we have a ref on it) then all that matters
> > is that the page eventually gets marked dirty.
> 
> But the point being that "try_to_free_buffers()" marks it clean 
> AFTERWARDS.
> 
> So yes, the page gets marked dirty in the pte's - the hardware generally 
> does that for us, so we don't have to worry about that part going on.
> 
> But "try_to_free_buffers()" seems to clear those dirty bits without 
> serializing it really any way. It just says "ok, I will now clear them". 
> Without knowing whether the dirty bits got set before the IO that cleared 
> the buffer head dirty bits or not.
> 
> What is _that_ serialization? As far as I can see, the only way to 
> guarantee that to happen (since the dirty bits in the page tables will get 
> set without us ever even being notified) is that the page tables 
> themselves must simply never contain that page in a writable form at all.
> 
> And that seems to be lacking.
> 
> Anyway, I have what I consider a much simpler solution: just don't DO all 
> that crap in try_to_free_buffers() at all. I sent it out to some people 
> already, not not very widely. 
> 
> I reproduce my suggestion here for you (and maybe others too who weren't 
> cc'd in that other discussion group) to comment on..
> 
> 		Linus
> 
> ---
> 
> So I think your patch is really broken, how about this one instead?
> 
> It's really my previous patch, BUT it also adds a 
> 
> 	if (PageDirty(page) ..
> 		return 0;
> 
> case, on the assumption that since PageDirty() measn that one of the 
> buffers should be dirty, there's no point in even _trying_ drop_buffers, 
> since that should just fail anyway.
> 
> Now, that assumption is obviously wrong _if_ the buffers have been cleaned 
> by something else. So in that case, we now don't remove the buffer heads, 
> but who really cares? The page will remain on the dirty list, and 
> something should be trying to write it out, but since now all the buffers 
> are clean, once that happens, there is no actual IO to happen.
> 
> Hmm? So this means that we simply don't remove the buffers early from such 
> pages, but there shouldn't be any real downside.
> 
> Now, the only question would be if the page is marked dirty _while_ this 
> is running. We do hold the page lock, but page dirtying doesn't get the 
> lock, does it? But at least we won't mark the page _clean_ when it 
> shouldn't be.. And we still are atomic wrt the actual buffer lists 
> (mapping->private_lock), so I think this should all be ok, and 
> drop_buffers() will do the right thing.
> 
> So no race possible either.
> 
> At least as far as I can see. And the patch certainly is simple.
> 
> Now the question whether this actually _fixes_ any problems does remain, 
> but I think this should be a pretty good solution if the bug really is 
> here. Andrew?
> 
> 		Linus
> 
> ----
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d1f1b54..263f88e 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
>  	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
> -	if (PageWriteback(page))
> +	if (PageDirty(page) || PageWriteback(page))
>  		return 0;
>  
>  	if (mapping == NULL) {		/* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
>  	spin_lock(&mapping->private_lock);
>  	ret = drop_buffers(page, &buffers_to_free);
>  	spin_unlock(&mapping->private_lock);
> -	if (ret) {
> -		/*
> -		 * If the filesystem writes its buffers by hand (eg ext3)
> -		 * then we can have clean buffers against a dirty page.  We
> -		 * clean the page here; otherwise later reattachment of buffers
> -		 * could encounter a non-uptodate page, which is unresolvable.
> -		 * This only applies in the rare case where try_to_free_buffers
> -		 * succeeds but the page is not freed.
> -		 *
> -		 * Also, during truncate, discard_buffer will have marked all
> -		 * the page's buffers clean.  We discover that here and clean
> -		 * the page also.
> -		 */
> -		if (test_clear_page_dirty(page))
> -			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> -	}
>  out:
>  	if (buffers_to_free) {
>  		struct buffer_head *bh = buffers_to_free;
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  9:18                   ` Andrew Morton
@ 2006-12-18  9:26                     ` Andrei Popa
  2006-12-18  9:42                     ` Nick Piggin
  1 sibling, 0 replies; 311+ messages in thread
From: Andrei Popa @ 2006-12-18  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linus Torvalds, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr


On Mon, 2006-12-18 at 01:18 -0800, Andrew Morton wrote:
> On Mon, 18 Dec 2006 18:22:42 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > Andrew Morton wrote:
> > > On Mon, 18 Dec 2006 15:51:52 +1100
> > > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > 
> > > 
> > >>I think the problem Andrew identified is real.
> > > 
> > > 
> > > I don't.  In fact I don't think I described any problem (well, I tried to,
> > > but then I contradicted myself).
> > 
> > By saying that there shouldn't be any dirty ptes if there are no
> > dirty buffers? But in that case the _page_ shouldn't be dirty either,
> > so that clear_page_dirty would be redundant. But presumably it isn't.
> 
> I don't follow that.
> 
> The linkage between pte-dirtiness and buffer_heads is a bit hard to follow
> without also considering page-dirtiness.
> 
> > > Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> > > blocksize ext3, mainline.  Zero failures.  It's unlikely that this testing
> > > would pass, yet people running normal workloads are able to easily trigger
> > > failures.  I suspect we're looking in the wrong place.
> > 
> > Yes I could believe it the corruption is caused by something else
> > completely.
> 
> Think so.  We do have a problem here, but only on threaded apps, I believe.
> rtorrent doesn't appear to be threaded, and the bug is hit on non-preempt
> UP.


ierdnac ~ # uname -a
Linux ierdnac 2.6.20-rc1 #2 SMP PREEMPT Mon Dec 18 11:01:52 EET 2006
i686 Genuine Intel(R) CPU           T2050  @ 1.60GHz GenuineIntel
GNU/Linux


and the other person who had corruption with rtorrent has also SMP and
PREEMPT.


> 
> > >>The issue is the disconnect between the pte dirtiness and a filesystem
> > >>bringing buffers clean.
> > > 
> > > 
> > > Really?  The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the
> > > cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty.  That's pretty
> > > simple, setting aside races.
> > > 
> > > In the try_to_free_buffers case there's a large time inverval between
> > > !BH_Dirty and !PG_dirty, but that shouldn't affect anything.
> > 
> > After try_to_free_buffers detaches the buffers from the page, a
> > pagefault can come in, and mark the pte writeable, then set_page_dirty
> > (which finds no buffers, so only sets PG_dirty).
> > 
> > The page can now get dirtied through this mapping.
> > 
> > try_to_free_buffers then goes on to clean the page and ptes.
> 
> try_to_free_buffers() isn't called against a page which doesn't have
> buffers.  It'll oops.
> 
> > Were you testing with preempt?
> 
> nope, just SMP.
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  7:16                 ` Andrew Morton
  2006-12-18  7:17                   ` Andrew Morton
@ 2006-12-18  9:30                   ` Nick Piggin
  1 sibling, 0 replies; 311+ messages in thread
From: Nick Piggin @ 2006-12-18  9:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

Andrew Morton wrote:
> On Sun, 17 Dec 2006 21:50:43 -0800 (PST)
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> 
>>
>>On Mon, 18 Dec 2006, Nick Piggin wrote:
>>
>>>I can't see how that's exactly a problem -- so long as the page does not
>>>get reclaimed (it won't, because we have a ref on it) then all that matters
>>>is that the page eventually gets marked dirty.
>>
>>But the point being that "try_to_free_buffers()" marks it clean 
>>AFTERWARDS.
>>
>>So yes, the page gets marked dirty in the pte's - the hardware generally 
>>does that for us, so we don't have to worry about that part going on.
>>
>>But "try_to_free_buffers()" seems to clear those dirty bits without 
>>serializing it really any way. It just says "ok, I will now clear them". 
>>Without knowing whether the dirty bits got set before the IO that cleared 
>>the buffer head dirty bits or not.
> 
> 
> Yes, I can't see anything correct about the current behaviour.
> 
> But I'm going blue in the face here trying to feed try_to_free_buffers() a
> page_mapped(page), without success.  pagevec_strip() presumably isn't
> triggering.

I can trigger it here, with a kernel patch to call pagevec_strip
unconditionally. I am seeing it clearing pte dirty bits, which is surely
a dataloss bug.

BUG: warning at mm/page-writeback.c:862/clear_page_dirty_warn()
  [<c013f65a>] clear_page_dirty_warn+0xdb/0xdd
  [<c0174309>] try_to_free_buffers+0x6b/0x7e
  [<c01937ec>] ext3_releasepage+0x0/0x74
  [<c013bb48>] try_to_release_page+0x2c/0x40
  [<c0140925>] pagevec_strip+0x52/0x54
  [<c0141580>] shrink_active_list+0x2a0/0x3c8
  [<c0142100>] shrink_zone+0xcd/0xea
  [<c014266d>] kswapd+0x311/0x41e
  [<c012c6aa>] autoremove_wake_function+0x0/0x37
  [<c014235c>] kswapd+0x0/0x41e
  [<c012c527>] kthread+0xde/0xe2
  [<c012c449>] kthread+0x0/0xe2
  [<c010395b>] kernel_thread_helper+0x7/0x1c
  =======================

(clear_page_dirty_warn() is test_clear_page_dirty which WARN_ON()s the
result of page_mkclean)


> This will (at least) cause truncate to do peculiar things. 
> do_invalidatepage() runs discard_buffer() against the dirty page and will
> then expect try_to_free_buffers() to remove those buffers and then clean
> the page.  truncate_complete_page() will clean the page, but it still has
> those invalidated buffers.  We'll end up with a large number of clean,
> unused pages on the LRU, with attached buffers.  These should eventually
> get reaped, but it'll change the page aging dynamics.

This isn't so nice. I wonder if you could just ClearPageDirty before
calling try_to_free_buffers in this case, or is that too much of a
hack? Ideally I guess you want a variant that is happy to discard
dirtiness (alternatively, my proposal to redirty the page if we find
a dirty pte should also handle this).

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  9:19                 ` Andrei Popa
@ 2006-12-18  9:38                   ` Andrew Morton
  2006-12-18 10:00                     ` Andrei Popa
  0 siblings, 1 reply; 311+ messages in thread
From: Andrew Morton @ 2006-12-18  9:38 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Nick Piggin, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Mon, 18 Dec 2006 11:19:04 +0200
Andrei Popa <andrei.popa@i-neo.ro> wrote:

> 
> I tried latest git with the patch from this email and it still get file
> content corruption. If I can help you further debug the problem tell me
> what to do.

Can you please tell us all the steps which we need to take to reproduce this?

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  9:18                   ` Andrew Morton
  2006-12-18  9:26                     ` Andrei Popa
@ 2006-12-18  9:42                     ` Nick Piggin
  1 sibling, 0 replies; 311+ messages in thread
From: Nick Piggin @ 2006-12-18  9:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

Andrew Morton wrote:
> On Mon, 18 Dec 2006 18:22:42 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:

 >>Yes I could believe it the corruption is caused by something else
 >>completely.
 >
 >
 > Think so.  We do have a problem here, but only on threaded apps, I believe.
 > rtorrent doesn't appear to be threaded, and the bug is hit on non-preempt
 > UP.

I think (see below) that it does not apply only to threaded apps. But
it would need one of SMP or PREEMPT to trigger.


>>After try_to_free_buffers detaches the buffers from the page, a
>>pagefault can come in, and mark the pte writeable, then set_page_dirty
>>(which finds no buffers, so only sets PG_dirty).
>>
>>The page can now get dirtied through this mapping.
>>
>>try_to_free_buffers then goes on to clean the page and ptes.
> 
> 
> try_to_free_buffers() isn't called against a page which doesn't have
> buffers.  It'll oops.

Sure. But I think the race exists... I'll try spelling it out in
the conventional way:

try_to_free_buffers()
   drop_buffers() (succeeds)

** preempt here or run right-hand thread on 2nd CPU in SMP **

                                do_no_page()
                                  set_page_dirty()

                                [now modify the page via this mapping
                                (from this process or a concurrent thread)]


   clear_page_dirty() (clears PG_dirty + pte dirty, oops)


-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  9:38                   ` Andrew Morton
@ 2006-12-18 10:00                     ` Andrei Popa
  2006-12-18 10:11                       ` Peter Zijlstra
  0 siblings, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-18 10:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Nick Piggin, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Mon, 2006-12-18 at 01:38 -0800, Andrew Morton wrote: 
> On Mon, 18 Dec 2006 11:19:04 +0200
> Andrei Popa <andrei.popa@i-neo.ro> wrote:
> 
> > 
> > I tried latest git with the patch from this email and it still get file
> > content corruption. If I can help you further debug the problem tell me
> > what to do.
> 
> Can you please tell us all the steps which we need to take to reproduce this?

I'm using rtorrent-0.7.0 and libtorrent-0.11.0, just download a torrent
with multiple files(I downloaded 84 rar files) and when it will finish
it will do a hash check and at the end of the check will say "Hash check
on download completion found bad chunks, consider using "safe_sync"."
and stop and most of the downloaded files are broken. With Peter
Zijlstra patch this error doesn't show but there is file
corruption(although less files are corrupted); afther the hash check,
rtorrent will download the bad chunks and do another hash check and all
files are ok.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 10:00                     ` Andrei Popa
@ 2006-12-18 10:11                       ` Peter Zijlstra
  2006-12-18 10:49                         ` Andrei Popa
  0 siblings, 1 reply; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-18 10:11 UTC (permalink / raw)
  To: andrei.popa
  Cc: Andrew Morton, Linus Torvalds, Nick Piggin,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 12:00 +0200, Andrei Popa wrote:
> On Mon, 2006-12-18 at 01:38 -0800, Andrew Morton wrote: 
> > On Mon, 18 Dec 2006 11:19:04 +0200
> > Andrei Popa <andrei.popa@i-neo.ro> wrote:
> > 
> > > 
> > > I tried latest git with the patch from this email and it still get file
> > > content corruption. If I can help you further debug the problem tell me
> > > what to do.
> > 
> > Can you please tell us all the steps which we need to take to reproduce this?
> 
> I'm using rtorrent-0.7.0 and libtorrent-0.11.0, just download a torrent
> with multiple files(I downloaded 84 rar files) and when it will finish
> it will do a hash check and at the end of the check will say "Hash check
> on download completion found bad chunks, consider using "safe_sync"."
> and stop and most of the downloaded files are broken. With Peter
> Zijlstra patch this error doesn't show but there is file
> corruption(although less files are corrupted); afther the hash check,
> rtorrent will download the bad chunks and do another hash check and all
> files are ok.

OK, I'll try this on a ext3 box. BTW, what data mode are you using ext3
in?


Also, for testings sake, could you give this a go:
It's a total hack but I guess worth testing.

---
 mm/rmap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-git/mm/rmap.c
===================================================================
--- linux-2.6-git.orig/mm/rmap.c	2006-12-18 11:06:29.000000000 +0100
+++ linux-2.6-git/mm/rmap.c	2006-12-18 11:07:16.000000000 +0100
@@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
 		goto unlock;
 
 	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
+	/* entry = pte_mkclean(entry); */
 	entry = pte_wrprotect(entry);
 	ptep_establish(vma, address, pte, entry);
 	lazy_mmu_prot_update(entry);



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 10:11                       ` Peter Zijlstra
@ 2006-12-18 10:49                         ` Andrei Popa
  2006-12-18 15:24                           ` Gene Heskett
  0 siblings, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-18 10:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, Nick Piggin,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

> OK, I'll try this on a ext3 box. BTW, what data mode are you using ext3
> in?
> 

ordered

> 
> Also, for testings sake, could you give this a go:
> It's a total hack but I guess worth testing.
> 
> ---
>  mm/rmap.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6-git/mm/rmap.c
> ===================================================================
> --- linux-2.6-git.orig/mm/rmap.c	2006-12-18 11:06:29.000000000 +0100
> +++ linux-2.6-git/mm/rmap.c	2006-12-18 11:07:16.000000000 +0100
> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
>  		goto unlock;
>  
>  	entry = ptep_get_and_clear(mm, address, pte);
> -	entry = pte_mkclean(entry);
> +	/* entry = pte_mkclean(entry); */
>  	entry = pte_wrprotect(entry);
>  	ptep_establish(vma, address, pte, entry);
>  	lazy_mmu_prot_update(entry);
> 

with latest git and this patch there is no corruption !




^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 10:49                         ` Andrei Popa
@ 2006-12-18 15:24                           ` Gene Heskett
  2006-12-18 15:32                             ` Peter Zijlstra
  0 siblings, 1 reply; 311+ messages in thread
From: Gene Heskett @ 2006-12-18 15:24 UTC (permalink / raw)
  To: linux-kernel, andrei.popa
  Cc: Peter Zijlstra, Andrew Morton, Linus Torvalds, Nick Piggin,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Monday 18 December 2006 05:49, Andrei Popa wrote:
>> OK, I'll try this on a ext3 box. BTW, what data mode are you using
>> ext3 in?
>
>ordered
>
>> Also, for testings sake, could you give this a go:
>> It's a total hack but I guess worth testing.
>>
>> ---
>>  mm/rmap.c |    2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> Index: linux-2.6-git/mm/rmap.c
>> ===================================================================
>> --- linux-2.6-git.orig/mm/rmap.c	2006-12-18 11:06:29.000000000 +0100
>> +++ linux-2.6-git/mm/rmap.c	2006-12-18 11:07:16.000000000 +0100
>> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page
>>  		goto unlock;
>>
>>  	entry = ptep_get_and_clear(mm, address, pte);
>> -	entry = pte_mkclean(entry);
>> +	/* entry = pte_mkclean(entry); */
>>  	entry = pte_wrprotect(entry);
>>  	ptep_establish(vma, address, pte, entry);
>>  	lazy_mmu_prot_update(entry);
>
>with latest git and this patch there is no corruption !
>
I've not run a torrent app here recently.  Should this patch be applied to 
a plain 2.6-20-rc1 before I do run azureas or similar apps?
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 15:24                           ` Gene Heskett
@ 2006-12-18 15:32                             ` Peter Zijlstra
  2006-12-18 15:47                               ` Gene Heskett
  0 siblings, 1 reply; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-18 15:32 UTC (permalink / raw)
  To: Gene Heskett
  Cc: linux-kernel, andrei.popa, Andrew Morton, Linus Torvalds,
	Nick Piggin, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Mon, 2006-12-18 at 10:24 -0500, Gene Heskett wrote:
> On Monday 18 December 2006 05:49, Andrei Popa wrote:
> >> OK, I'll try this on a ext3 box. BTW, what data mode are you using
> >> ext3 in?
> >
> >ordered
> >
> >> Also, for testings sake, could you give this a go:
> >> It's a total hack but I guess worth testing.
> >>
> >> ---
> >>  mm/rmap.c |    2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> Index: linux-2.6-git/mm/rmap.c
> >> ===================================================================
> >> --- linux-2.6-git.orig/mm/rmap.c	2006-12-18 11:06:29.000000000 +0100
> >> +++ linux-2.6-git/mm/rmap.c	2006-12-18 11:07:16.000000000 +0100
> >> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page
> >>  		goto unlock;
> >>
> >>  	entry = ptep_get_and_clear(mm, address, pte);
> >> -	entry = pte_mkclean(entry);
> >> +	/* entry = pte_mkclean(entry); */
> >>  	entry = pte_wrprotect(entry);
> >>  	ptep_establish(vma, address, pte, entry);
> >>  	lazy_mmu_prot_update(entry);
> >
> >with latest git and this patch there is no corruption !
> >
> I've not run a torrent app here recently.  Should this patch be applied to 
> a plain 2.6-20-rc1 before I do run azureas or similar apps?

depends on what the blue frog does, if it uses MAP_SHARED like rtorrent
does then yeah, probably. This patch really should not be the final one,
I'm currently still trying to wrap my head around the issue. That said,
it should be safe to use :-)


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 15:32                             ` Peter Zijlstra
@ 2006-12-18 15:47                               ` Gene Heskett
  0 siblings, 0 replies; 311+ messages in thread
From: Gene Heskett @ 2006-12-18 15:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, andrei.popa, Andrew Morton, Linus Torvalds,
	Nick Piggin, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Monday 18 December 2006 10:32, Peter Zijlstra wrote:
[...]
>>
>> I've not run a torrent app here recently.  Should this patch be
>> applied to a plain 2.6-20-rc1 before I do run azureas or similar apps?
>
>depends on what the blue frog does, if it uses MAP_SHARED like rtorrent
>does then yeah, probably. This patch really should not be the final one,
>I'm currently still trying to wrap my head around the issue. That said,
>it should be safe to use :-)
>
Thanks, I'll do it.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 23:40     ` Andrew Morton
  2006-12-18  1:02       ` Linus Torvalds
  2006-12-18  1:22       ` Linus Torvalds
@ 2006-12-18 16:55       ` Peter Zijlstra
  2006-12-18 18:03         ` Linus Torvalds
  2 siblings, 1 reply; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-18 16:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linux Kernel Mailing List, Hugh Dickins,
	Linus Torvalds, Florian Weimer, Marc Haber, Martin Michlmayr

On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote:
> On Sun, 17 Dec 2006 15:39:32 +0200
> Andrei Popa <andrei.popa@i-neo.ro> wrote:
> 
> > I was mistaken, I'm still having file corruption with rtorrent.
> > 
> 
> Well I'm not very optimistic, but if people could try this, please...
> 
> 
> 
> From: Andrew Morton <akpm@osdl.org>
> 
> try_to_free_buffers() clears the page's dirty state if it successfully removed
> the page's buffers.
> 
>   Background for this:
> 
>   - a process does a one-byte-write to a file on a 64k pagesize, 4k
>     blocksize ext3 filesystem.  The page is now PageDirty, !PgeUptodate and
>     has one dirty buffer and 15 not uptodate buffers.
> 
>   - kjournald writes the dirty buffer.  The page is now PageDirty,
>     !PageUptodate and has a mix of clean and not uptodate buffers.
> 
>   - try_to_free_buffers() removes the page's buffers.  It MUST now clear
>     PageDirty.  If we were to leave the page dirty then we'd have a dirty, not
>     uptodate page with no buffer_heads.
> 
>     We're screwed: we cannot write the page because we don't know which
>     sections of it contain garbage.  We cannot read the page because we don't
>     know which sections of it contain modified data.  We cannot free the page
>     because it is dirty.
> 

How about we stick something like this on top of that patch. It should
preserve the dirty state as required.

I tried to tinker with avoiding the clear/set thing but could not
convince myself it was close to safe.

This should be safe; page_mkclean walks the rmap and flips the pte's
under the pte lock and records the dirty state while iterating.
Concurrent faults will either do set_page_dirty() before we get around
to doing it or vice versa, but dirty state is not lost.

---
 mm/page-writeback.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6-git/mm/page-writeback.c
===================================================================
--- linux-2.6-git.orig/mm/page-writeback.c	2006-12-18 17:24:41.000000000 +0100
+++ linux-2.6-git/mm/page-writeback.c	2006-12-18 17:26:56.000000000 +0100
@@ -872,8 +872,9 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			if (must_clean_ptes)
-				page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned)
+				set_page_dirty(page);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 16:55       ` Peter Zijlstra
@ 2006-12-18 18:03         ` Linus Torvalds
  2006-12-18 18:24           ` Peter Zijlstra
  2006-12-19  4:36           ` Nick Piggin
  0 siblings, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-18 18:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr


Andrei,
 could you try Peter's patch (on top of Andrew's patch - it depends on 
it, and wouldn't work on an unmodified -git kernel, but add the WARN_ON() 
I mention in this email? You seem to be able to reproduce this easily.. 
Thanks)

On Mon, 18 Dec 2006, Peter Zijlstra wrote:
> 
> This should be safe; page_mkclean walks the rmap and flips the pte's
> under the pte lock and records the dirty state while iterating.
> Concurrent faults will either do set_page_dirty() before we get around
> to doing it or vice versa, but dirty state is not lost.

Ok, I really liked this patch, but the more I thought about it, the more I 
started to doubt the reasons for liking it.

I think we have some core fundamental problem here that this patch is 
needed at all.

So let's think about this: we apparently have two cases of 
"clear_page_dirty()":

 - the one that really wants to clear the bit unconditionally (Andrew 
   calls this the "must_clean_ptes" case, which I personally find to be a 
   really confusing name, but whatever)

 - the other case. The case that doesn't want to really clear the pte 
   dirty bits.

and I thought your patch made sense, because it saved away the pte state 
in the page dirty state, and that matches my mental model, but the more I 
think about it, the less sense that whole "the other case" situation makes 
AT ALL.

Why does "the other case" exist at all? If you want to clear the dirty 
page flag, what is _ever_ the reason for not wanting to drop PTE dirty 
information? In other words, what possible reason can there ever be for 
saying "I want this page to be clean", while at the same time saying "but 
if it was dirty in the page tables, don't forget about that state".

So I absolutely detested Andrew's original patch, because that one made 
zero sense at all even from a code standpoint. With your patch on top, it 
all suddenly makes sense: at least you don't just leave dirty pages in the 
PTE's with a "struct page" that is marked clean, and the end result is 
undeniably at least _consistent_.

So Andrew's patch I can't stand, because the whole point of it seems to be 
to leave the system in an inconsistent state (dirty in the pte's but 
marked "clean"), and if we want to have that state, then we should just 
revert _everything_ to the 2.6.18 situation, and not play these games at 
all.

Andrew's patch with your patch on top makes me happy, because now we're 
at least honoring all the basic rules (we don't get into an inconsistent 
state), so on a local level it all makes sense. HOWEVER, I then don't 
actually understand how it could ever actually make sense to ask for 
"please clean the page, but don't actually clean it".

So _I_ think that we should add a honking huge WARN_ON() for this case. 
Ie, do your patch, but instead of re-dirtying the page:

+                       if (!must_clean_ptes && cleaned)
+                               set_page_dirty(page);

we would do

+                       if (!must_clean_ptes && cleaned) {
+                               WARN_ON(1);
+                               set_page_dirty(page);
+                       }

and ask the people who see this problem to see if they get the WARN_ON() 
(assuming it _fixes_ their data corruption).

Because whoever calls "clean_dirty_page()" without actually wanting to 
clean the PTE's is really a bug: those dirty PTE's had better not exist.

Or maybe the WARN_ON() just points out _why_ somebody would want to do 
something this insane. Right now I just can't see why it's a valid thing 
to do.

Maybe I'm still confused. 

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 18:03         ` Linus Torvalds
@ 2006-12-18 18:24           ` Peter Zijlstra
  2006-12-18 18:35             ` Linus Torvalds
  2006-12-19  4:36           ` Nick Piggin
  1 sibling, 1 reply; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-18 18:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 10:03 -0800, Linus Torvalds wrote:
> Andrei,
>  could you try Peter's patch (on top of Andrew's patch - it depends on 
> it, and wouldn't work on an unmodified -git kernel, but add the WARN_ON() 
> I mention in this email? You seem to be able to reproduce this easily.. 
> Thanks)

I finally beat yum into submission and I hope to have rtorrent compiled
shortly.

> On Mon, 18 Dec 2006, Peter Zijlstra wrote:
> > 
> > This should be safe; page_mkclean walks the rmap and flips the pte's
> > under the pte lock and records the dirty state while iterating.
> > Concurrent faults will either do set_page_dirty() before we get around
> > to doing it or vice versa, but dirty state is not lost.
> 
> Ok, I really liked this patch, but the more I thought about it, the more I 
> started to doubt the reasons for liking it.
> 
> I think we have some core fundamental problem here that this patch is 
> needed at all.

I agree, but I suspect this is like the buffered write deadlock Nick is
working on, in that it will require some proper filesystem surgery to
get right. Having the kernel working in the meantime has my
preference ;-)

> So let's think about this: we apparently have two cases of 
> "clear_page_dirty()":
> 
>  - the one that really wants to clear the bit unconditionally (Andrew 
>    calls this the "must_clean_ptes" case, which I personally find to be a 
>    really confusing name, but whatever)

I'm probably worse with names so I'm not even going to try and fix that.

>  - the other case. The case that doesn't want to really clear the pte 
>    dirty bits.
> 
> and I thought your patch made sense, because it saved away the pte state 
> in the page dirty state, and that matches my mental model, but the more I 
> think about it, the less sense that whole "the other case" situation makes 
> AT ALL.
>
> Why does "the other case" exist at all? If you want to clear the dirty 
> page flag, what is _ever_ the reason for not wanting to drop PTE dirty 
> information? In other words, what possible reason can there ever be for 
> saying "I want this page to be clean", while at the same time saying "but 
> if it was dirty in the page tables, don't forget about that state".

I have tried to get my head around this, and have so far failed. Andrews
mail with the patch (great-grandparent to this mail) was the one that
made most sense explaining it afaics.

> So I absolutely detested Andrew's original patch, because that one made 
> zero sense at all even from a code standpoint. With your patch on top, it 
> all suddenly makes sense: at least you don't just leave dirty pages in the 
> PTE's with a "struct page" that is marked clean, and the end result is 
> undeniably at least _consistent_.
> 
> So Andrew's patch I can't stand, because the whole point of it seems to be 
> to leave the system in an inconsistent state (dirty in the pte's but 
> marked "clean"), and if we want to have that state, then we should just 
> revert _everything_ to the 2.6.18 situation, and not play these games at 
> all.
> 
> Andrew's patch with your patch on top makes me happy, because now we're 
> at least honoring all the basic rules (we don't get into an inconsistent 
> state), so on a local level it all makes sense. HOWEVER, I then don't 
> actually understand how it could ever actually make sense to ask for 
> "please clean the page, but don't actually clean it".

Somehow it looses track of actual page content dirtyness when it does
the page buffer game.

Is this because page buffers are used to do sub-page sized writes
without RMW cycles?

Cannot this case be avoided when the page is mapped, because at that
point the whole page will be resident anyway.

> So _I_ think that we should add a honking huge WARN_ON() for this case. 
> Ie, do your patch, but instead of re-dirtying the page:
> 
> +                       if (!must_clean_ptes && cleaned)
> +                               set_page_dirty(page);
> 
> we would do
> 
> +                       if (!must_clean_ptes && cleaned) {
> +                               WARN_ON(1);
> +                               set_page_dirty(page);
> +                       }
> 
> and ask the people who see this problem to see if they get the WARN_ON() 
> (assuming it _fixes_ their data corruption).
> 
> Because whoever calls "clean_dirty_page()" without actually wanting to 
> clean the PTE's is really a bug: those dirty PTE's had better not exist.
> 
> Or maybe the WARN_ON() just points out _why_ somebody would want to do 
> something this insane. Right now I just can't see why it's a valid thing 
> to do.

Maybe, but I think Nick's mail here:
  http://lkml.org/lkml/2006/12/18/59

shows a trace like that. I'm guessing that if we do the WARN_ON() some
folks might get a lot of output, perhaps WARN_ON_ONCE() ?


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 18:24           ` Peter Zijlstra
@ 2006-12-18 18:35             ` Linus Torvalds
  2006-12-18 19:04               ` Andrei Popa
  0 siblings, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-18 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Peter Zijlstra wrote:
> > 
> > Or maybe the WARN_ON() just points out _why_ somebody would want to do 
> > something this insane. Right now I just can't see why it's a valid thing 
> > to do.
> 
> Maybe, but I think Nick's mail here:
>   http://lkml.org/lkml/2006/12/18/59
> 
> shows a trace like that.

Sure, but I actually think that "try_to_free_buffers()" was buggy in the 
first place, shouldn't have done what it did at all (it has NO business 
clearing dirty data), and should be fixed with my other simple and clean 
patch that just removes the crap.

But sadly, Andrei said that he still saw data corruption, which implies 
that the problem had nothing to do with "try_to_free_buffers()" at all.

(On that note: Andrei - if you do test this out, I'd suggest applying my 
patch too - the one that you already tested. It won't apply cleanly on top 
of Andrew's patch, but it should be trivial to apply by hand, since you 
really just want to remove the whole "if (ret) {...}" sequence. I realize 
that it didn't make any difference for you, but applying that patch is 
probably a good idea just to remove the noise for a codepath that you 
already showed to not matter)

> I'm guessing that if we do the WARN_ON() some folks might get a lot of 
> output, perhaps WARN_ON_ONCE() ?

Well, I'd rather get lots of noise to see all the paths that can cause 
this. We've been concentrating mainly on one (try_to_free_buffers()), but 
that one was already shown not to matter or at least not to be the _whole_ 
issue, so..

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 18:35             ` Linus Torvalds
@ 2006-12-18 19:04               ` Andrei Popa
  2006-12-18 19:10                 ` Peter Zijlstra
  2006-12-18 19:18                 ` Linus Torvalds
  0 siblings, 2 replies; 311+ messages in thread
From: Andrei Popa @ 2006-12-18 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr


> (On that note: Andrei - if you do test this out, I'd suggest applying my 
> patch too - the one that you already tested. It won't apply cleanly on top 
> of Andrew's patch, but it should be trivial to apply by hand, since you 
> really just want to remove the whole "if (ret) {...}" sequence. I realize 
> that it didn't make any difference for you, but applying that patch is 
> probably a good idea just to remove the noise for a codepath that you 
> already showed to not matter)


I applied Linus patch, Andrew patch, Peter Zijlstra patches(the last
two). All unified patch is attached. I tested and I have no corruption.


diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..760442f 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 1)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..7b87875 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 1);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..47a6b62 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 1);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..d65ba84 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 1);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..561d702 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -866,7 +866,9 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned)
+				set_page_dirty(page);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..3f9061e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
 		goto unlock;
 
 	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
+	/*entry = pte_mkclean(entry);*/
 	entry = pte_wrprotect(entry);
 	ptep_establish(vma, address, pte, entry);
 	lazy_mmu_prot_update(entry);
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..cafa843 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 1))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);

> 
> > I'm guessing that if we do the WARN_ON() some folks might get a lot of 
> > output, perhaps WARN_ON_ONCE() ?
> 
> Well, I'd rather get lots of noise to see all the paths that can cause 
> this. We've been concentrating mainly on one (try_to_free_buffers()), but 
> that one was already shown not to matter or at least not to be the _whole_ 
> issue, so..
> 
> 		Linus


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 19:04               ` Andrei Popa
@ 2006-12-18 19:10                 ` Peter Zijlstra
  2006-12-18 19:18                 ` Linus Torvalds
  1 sibling, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-18 19:10 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 21:04 +0200, Andrei Popa wrote:

> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..3f9061e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
>  		goto unlock;
>  
>  	entry = ptep_get_and_clear(mm, address, pte);
> -	entry = pte_mkclean(entry);
> +	/*entry = pte_mkclean(entry);*/
>  	entry = pte_wrprotect(entry);
>  	ptep_establish(vma, address, pte, entry);
>  	lazy_mmu_prot_update(entry);

please drop this chunk, this will always make the problem go away.



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 19:04               ` Andrei Popa
  2006-12-18 19:10                 ` Peter Zijlstra
@ 2006-12-18 19:18                 ` Linus Torvalds
  2006-12-18 19:44                   ` Andrei Popa
  2006-12-19  7:38                   ` Peter Zijlstra
  1 sibling, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-18 19:18 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Andrei Popa wrote:
> 
> I applied Linus patch, Andrew patch, Peter Zijlstra patches(the last
> two). All unified patch is attached. I tested and I have no corruption.

That wasn't very interesting, because you also had the patch that just 
disabled "page_mkclean_one()" entirely:

> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..3f9061e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
>  		goto unlock;
>  
>  	entry = ptep_get_and_clear(mm, address, pte);
> -	entry = pte_mkclean(entry);
> +	/*entry = pte_mkclean(entry);*/
>  	entry = pte_wrprotect(entry);
>  	ptep_establish(vma, address, pte, entry);
>  	lazy_mmu_prot_update(entry);

The above patch is bad. It's always going to hide the bug, but it hides it 
by just not doing anything at all. So any patch combination that contains 
that patch will probably _always_ fix your problem, but it won't be an 
interesting patch..

So can you remove that small fragment? Also, it would be nice if you added 
the WARN_ON() to this sequence in mm/page-writeback.c:

+                       if (!must_clean_ptes && cleaned)
+                               set_page_dirty(page);

just make it do a WARN_ON() if this ever triggers.

Then, IF the corruption is gone, we'd love to see the WARN_ON results..

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 19:18                 ` Linus Torvalds
@ 2006-12-18 19:44                   ` Andrei Popa
  2006-12-18 20:14                     ` Linus Torvalds
  2006-12-19  7:38                   ` Peter Zijlstra
  1 sibling, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-18 19:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 11:18 -0800, Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Andrei Popa wrote:
> > 
> > I applied Linus patch, Andrew patch, Peter Zijlstra patches(the last
> > two). All unified patch is attached. I tested and I have no corruption.
> 
> That wasn't very interesting, because you also had the patch that just 
> disabled "page_mkclean_one()" entirely:
> 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index d8a842a..3f9061e 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
> >  		goto unlock;
> >  
> >  	entry = ptep_get_and_clear(mm, address, pte);
> > -	entry = pte_mkclean(entry);
> > +	/*entry = pte_mkclean(entry);*/
> >  	entry = pte_wrprotect(entry);
> >  	ptep_establish(vma, address, pte, entry);
> >  	lazy_mmu_prot_update(entry);
> 
> The above patch is bad. It's always going to hide the bug, but it hides it 
> by just not doing anything at all. So any patch combination that contains 
> that patch will probably _always_ fix your problem, but it won't be an 
> interesting patch..
> 
> So can you remove that small fragment? Also, it would be nice if you added 
> the WARN_ON() to this sequence in mm/page-writeback.c:
> 
> +                       if (!must_clean_ptes && cleaned)
> +                               set_page_dirty(page);
> 
> just make it do a WARN_ON() if this ever triggers.
> 
> Then, IF the corruption is gone, we'd love to see the WARN_ON results..
> 
> 		Linus

I dropped that patch and added WARN_ON(1), the unified patch is
attached.

I got corruption: "Hash check on download completion found bad chunks,
consider using "safe_sync"."

In dmesg there is no message from WARN_ON(1), my .config is attached.



diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..760442f 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 1)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..7b87875 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 1);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..47a6b62 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 1);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..d65ba84 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 1);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..f7e0cc8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -866,7 +866,12 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned){
+			WARN_ON(1);
+			set_page_dirty(page);
+			}
+
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..cafa843 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 1))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);















#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.20-rc1
# Sun Dec 17 01:52:12 2006
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
# CONFIG_POSIX_MQUEUE is not set
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
# CONFIG_IKCONFIG_PROC is not set
# CONFIG_CPUSETS is not set
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set

#
# Loadable module support
#
# CONFIG_MODULES is not set
CONFIG_STOP_MACHINE=y

#
# Block layer
#
CONFIG_BLOCK=y
# CONFIG_LBD is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# Processor type and features
#
CONFIG_SMP=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_PARAVIRT is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
CONFIG_MPENTIUMM=y
# CONFIG_MCORE2 is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_NR_CPUS=8
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_NONFATAL=y
CONFIG_X86_MCE_P4THERMAL=y
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
# CONFIG_X86_REBOOTFIXUPS is not set
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_PAGE_OFFSET=0xC0000000
CONFIG_HIGHMEM=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_SPARSEMEM_STATIC=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_RESOURCES_64BIT is not set
# CONFIG_HIGHPTE is not set
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
# CONFIG_EFI is not set
CONFIG_IRQBALANCE=y
# CONFIG_SECCOMP is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
# CONFIG_KEXEC is not set
# CONFIG_CRASH_DUMP is not set
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x100000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management options (ACPI, APM)
#
CONFIG_PM=y
# CONFIG_PM_LEGACY is not set
# CONFIG_PM_DEBUG is not set
# CONFIG_PM_SYSFS_DEPRECATED is not set
CONFIG_SOFTWARE_SUSPEND=y
CONFIG_PM_STD_PARTITION=""
CONFIG_SUSPEND_SMP=y

#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_SLEEP_PROC_FS=y
# CONFIG_ACPI_SLEEP_PROC_SLEEP is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=y
CONFIG_ACPI_HOTKEY=y
CONFIG_ACPI_FAN=y
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_IBM is not set
# CONFIG_ACPI_TOSHIBA is not set
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y

#
# APM (Advanced Power Management) BIOS Support
#
# CONFIG_APM is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=y
# CONFIG_CPU_FREQ_STAT_DETAILS is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=y
# CONFIG_X86_POWERNOW_K6 is not set
# CONFIG_X86_POWERNOW_K7 is not set
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_GX_SUSPMOD is not set
CONFIG_X86_SPEEDSTEP_CENTRINO=y
CONFIG_X86_SPEEDSTEP_CENTRINO_ACPI=y
# CONFIG_X86_SPEEDSTEP_CENTRINO_TABLE is not set
CONFIG_X86_SPEEDSTEP_ICH=y
# CONFIG_X86_SPEEDSTEP_SMI is not set
# CONFIG_X86_P4_CLOCKMOD is not set
# CONFIG_X86_CPUFREQ_NFORCE2 is not set
# CONFIG_X86_LONGRUN is not set
# CONFIG_X86_LONGHAUL is not set

#
# shared options
#
# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set
CONFIG_X86_SPEEDSTEP_LIB=y
# CONFIG_X86_SPEEDSTEP_RELAXED_CAP_CHECK is not set

#
# Bus options (PCI, PCMCIA, EISA, MCA, ISA)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
# CONFIG_PCI_GODIRECT is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
# CONFIG_PCIEPORTBUS is not set
CONFIG_PCI_MSI=y
# CONFIG_PCI_MULTITHREAD_PROBE is not set
# CONFIG_PCI_DEBUG is not set
# CONFIG_HT_IRQ is not set
CONFIG_ISA_DMA_API=y
# CONFIG_ISA is not set
# CONFIG_MCA is not set
# CONFIG_SCx200 is not set

#
# PCCARD (PCMCIA/CardBus) support
#
# CONFIG_PCCARD is not set

#
# PCI Hotplug Support
#
# CONFIG_HOTPLUG_PCI is not set

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_MISC=y

#
# Networking
#
CONFIG_NET=y

#
# Networking options
#
# CONFIG_NETDEBUG is not set
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
# CONFIG_NET_KEY is not set
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_FIB_HASH=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_ARPD is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
# CONFIG_INET_DIAG is not set
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
# CONFIG_IPV6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
# CONFIG_NETWORK_SECMARK is not set
# CONFIG_NETFILTER is not set

#
# DCCP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_DCCP is not set

#
# SCTP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_SCTP is not set

#
# TIPC Configuration (EXPERIMENTAL)
#
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_BRIDGE is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set

#
# QoS and/or fair queueing
#
# CONFIG_NET_SCHED is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_HAMRADIO is not set
# CONFIG_IRDA is not set
CONFIG_BT=y
CONFIG_BT_L2CAP=y
CONFIG_BT_SCO=y
CONFIG_BT_RFCOMM=y
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=y
# CONFIG_BT_BNEP_MC_FILTER is not set
# CONFIG_BT_BNEP_PROTO_FILTER is not set
CONFIG_BT_HIDP=y

#
# Bluetooth device drivers
#
CONFIG_BT_HCIUSB=y
# CONFIG_BT_HCIUSB_SCO is not set
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBPA10X is not set
# CONFIG_BT_HCIBFUSB is not set
# CONFIG_BT_HCIVHCI is not set
# CONFIG_IEEE80211 is not set
CONFIG_WIRELESS_EXT=y

#
# Device Drivers
#

#
# Generic Driver Options
#
# CONFIG_STANDALONE is not set
# CONFIG_PREVENT_FIRMWARE_BUILD is not set
CONFIG_FW_LOADER=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_SYS_HYPERVISOR is not set

#
# Connector - unified userspace <-> kernelspace linker
#
# CONFIG_CONNECTOR is not set

#
# Memory Technology Devices (MTD)
#
# CONFIG_MTD is not set

#
# Parallel port support
#
# CONFIG_PARPORT is not set

#
# Plug and Play support
#
CONFIG_PNP=y
# CONFIG_PNP_DEBUG is not set

#
# Protocols
#
CONFIG_PNPACPI=y

#
# Block devices
#
CONFIG_BLK_DEV_FD=y
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_BLK_DEV_INITRD is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set

#
# Misc devices
#
# CONFIG_IBM_ASM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_MSI_LAPTOP is not set

#
# ATA/ATAPI/MFM/RLL support
#
CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y

#
# Please see Documentation/ide.txt for help/info on IDE drives
#
# CONFIG_BLK_DEV_IDE_SATA is not set
# CONFIG_BLK_DEV_HD_IDE is not set
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
CONFIG_BLK_DEV_IDECD=y
# CONFIG_BLK_DEV_IDETAPE is not set
# CONFIG_BLK_DEV_IDEFLOPPY is not set
CONFIG_BLK_DEV_IDESCSI=y
# CONFIG_IDE_TASK_IOCTL is not set

#
# IDE chipset support/bugfixes
#
CONFIG_IDE_GENERIC=y
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_IDEPNP is not set
CONFIG_BLK_DEV_IDEPCI=y
CONFIG_IDEPCI_SHARE_IRQ=y
# CONFIG_BLK_DEV_OFFBOARD is not set
CONFIG_BLK_DEV_GENERIC=y
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_RZ1000 is not set
CONFIG_BLK_DEV_IDEDMA_PCI=y
# CONFIG_BLK_DEV_IDEDMA_FORCED is not set
CONFIG_IDEDMA_PCI_AUTO=y
# CONFIG_IDEDMA_ONLYDISK is not set
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
# CONFIG_BLK_DEV_AMD74XX is not set
# CONFIG_BLK_DEV_ATIIXP is not set
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_TRIFLEX is not set
# CONFIG_BLK_DEV_CY82C693 is not set
# CONFIG_BLK_DEV_CS5520 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_CS5535 is not set
# CONFIG_BLK_DEV_HPT34X is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_JMICRON is not set
# CONFIG_BLK_DEV_SC1200 is not set
CONFIG_BLK_DEV_PIIX=y
# CONFIG_BLK_DEV_IT821X is not set
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_PDC202XX_OLD is not set
# CONFIG_BLK_DEV_PDC202XX_NEW is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIIMAGE is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_IDE_ARM is not set
CONFIG_BLK_DEV_IDEDMA=y
# CONFIG_IDEDMA_IVB is not set
CONFIG_IDEDMA_AUTO=y
# CONFIG_BLK_DEV_HD is not set

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
# CONFIG_SCSI_TGT is not set
# CONFIG_SCSI_NETLINK is not set
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
# CONFIG_BLK_DEV_SR_VENDOR is not set
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set

#
# SCSI low-level drivers
#
# CONFIG_ISCSI_TCP is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_NSP32 is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_SRP is not set

#
# Serial ATA (prod) and Parallel ATA (experimental) drivers
#
CONFIG_ATA=y
CONFIG_SATA_AHCI=y
# CONFIG_SATA_SVW is not set
CONFIG_ATA_PIIX=y
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIL24 is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
CONFIG_SATA_INTEL_COMBINED=y
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CS5535 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# Multi-device support (RAID and LVM)
#
# CONFIG_MD is not set

#
# Fusion MPT device support
#
# CONFIG_FUSION is not set
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_FC is not set
# CONFIG_FUSION_SAS is not set

#
# IEEE 1394 (FireWire) support
#
CONFIG_IEEE1394=y

#
# Subsystem Options
#
# CONFIG_IEEE1394_VERBOSEDEBUG is not set
# CONFIG_IEEE1394_OUI_DB is not set
# CONFIG_IEEE1394_EXTRA_CONFIG_ROMS is not set
# CONFIG_IEEE1394_EXPORT_FULL_API is not set

#
# Device Drivers
#

#
# Texas Instruments PCILynx requires I2C
#
CONFIG_IEEE1394_OHCI1394=y

#
# Protocol Drivers
#
# CONFIG_IEEE1394_VIDEO1394 is not set
CONFIG_IEEE1394_SBP2=y
# CONFIG_IEEE1394_SBP2_PHYS_DMA is not set
# CONFIG_IEEE1394_ETH1394 is not set
# CONFIG_IEEE1394_DV1394 is not set
CONFIG_IEEE1394_RAWIO=y

#
# I2O device support
#
# CONFIG_I2O is not set

#
# Network device support
#
CONFIG_NETDEVICES=y
# CONFIG_DUMMY is not set
# CONFIG_BONDING is not set
# CONFIG_EQUALIZER is not set
# CONFIG_TUN is not set
# CONFIG_NET_SB1000 is not set

#
# ARCnet devices
#
# CONFIG_ARCNET is not set

#
# PHY device support
#
# CONFIG_PHYLIB is not set

#
# Ethernet (10 or 100Mbit)
#
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set

#
# Tulip family network device support
#
# CONFIG_NET_TULIP is not set
# CONFIG_HP100 is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
# CONFIG_DGRS is not set
# CONFIG_EEPRO100 is not set
CONFIG_E100=y
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set

#
# Ethernet (1000 Mbit)
#
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_SK98LIN is not set
# CONFIG_VIA_VELOCITY is not set
# CONFIG_TIGON3 is not set
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set

#
# Ethernet (10000 Mbit)
#
# CONFIG_CHELSIO_T1 is not set
# CONFIG_IXGB is not set
# CONFIG_S2IO is not set
# CONFIG_MYRI10GE is not set
# CONFIG_NETXEN_NIC is not set

#
# Token Ring devices
#
# CONFIG_TR is not set

#
# Wireless LAN (non-hamradio)
#
CONFIG_NET_RADIO=y
# CONFIG_NET_WIRELESS_RTNETLINK is not set

#
# Obsolete Wireless cards support (pre-802.11)
#
# CONFIG_STRIP is not set

#
# Wireless 802.11b ISA/PCI cards support
#
# CONFIG_IPW2100 is not set
# CONFIG_IPW2200 is not set
# CONFIG_AIRO is not set
# CONFIG_HERMES is not set
# CONFIG_ATMEL is not set

#
# Prism GT/Duette 802.11(a/b/g) PCI/Cardbus support
#
# CONFIG_PRISM54 is not set
# CONFIG_USB_ZD1201 is not set
# CONFIG_HOSTAP is not set
CONFIG_NET_WIRELESS=y

#
# Wan interfaces
#
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
# CONFIG_NET_FC is not set
# CONFIG_SHAPER is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set

#
# ISDN subsystem
#
# CONFIG_ISDN is not set

#
# Telephony Support
#
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_FF_MEMLESS is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1280
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=800
# CONFIG_INPUT_JOYDEV is not set
# CONFIG_INPUT_TSDEV is not set
# CONFIG_INPUT_EVDEV is not set
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_PCSPKR is not set
CONFIG_INPUT_WISTRON_BTNS=y
# CONFIG_INPUT_UINPUT is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_VT_HW_CONSOLE_BINDING is not set
# CONFIG_SERIAL_NONSTANDARD is not set

#
# Serial drivers
#
# CONFIG_SERIAL_8250 is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256

#
# IPMI
#
# CONFIG_IPMI_HANDLER is not set

#
# Watchdog Cards
#
# CONFIG_WATCHDOG is not set
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_INTEL=y
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_GEODE is not set
# CONFIG_HW_RANDOM_VIA is not set
CONFIG_NVRAM=y
CONFIG_RTC=y
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_SONYPI is not set
CONFIG_AGP=y
# CONFIG_AGP_ALI is not set
# CONFIG_AGP_ATI is not set
# CONFIG_AGP_AMD is not set
# CONFIG_AGP_AMD64 is not set
CONFIG_AGP_INTEL=y
# CONFIG_AGP_NVIDIA is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_SWORKS is not set
# CONFIG_AGP_VIA is not set
# CONFIG_AGP_EFFICEON is not set
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I830 is not set
CONFIG_DRM_I915=y
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_NSC_GPIO is not set
# CONFIG_CS5535_GPIO is not set
# CONFIG_RAW_DRIVER is not set
# CONFIG_HPET is not set
# CONFIG_HANGCHECK_TIMER is not set

#
# TPM devices
#
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set

#
# I2C support
#
# CONFIG_I2C is not set

#
# SPI support
#
# CONFIG_SPI is not set
# CONFIG_SPI_MASTER is not set

#
# Dallas's 1-wire bus
#
# CONFIG_W1 is not set

#
# Hardware Monitoring support
#
# CONFIG_HWMON is not set
# CONFIG_HWMON_VID is not set

#
# Multimedia devices
#
# CONFIG_VIDEO_DEV is not set

#
# Digital Video Broadcasting Devices
#
# CONFIG_DVB is not set
# CONFIG_USB_DABUSB is not set

#
# Graphics support
#
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
# CONFIG_FB_TILEBLITTING is not set
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
CONFIG_FB_VESA=y
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
CONFIG_FB_I810=y
CONFIG_FB_I810_GTF=y
# CONFIG_FB_I810_I2C is not set
CONFIG_FB_INTEL=y
# CONFIG_FB_INTEL_DEBUG is not set
# CONFIG_FB_INTEL_I2C is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_CYBLA is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VGACON_SOFT_SCROLLBACK is not set
CONFIG_VIDEO_SELECT=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y

#
# Logo configuration
#
# CONFIG_LOGO is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_BACKLIGHT_DEVICE=y
CONFIG_LCD_CLASS_DEVICE=y
CONFIG_LCD_DEVICE=y

#
# Sound
#
CONFIG_SOUND=y

#
# Advanced Linux Sound Architecture
#
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
CONFIG_SND_SEQUENCER=y
# CONFIG_SND_SEQ_DUMMY is not set
# CONFIG_SND_MIXER_OSS is not set
# CONFIG_SND_PCM_OSS is not set
# CONFIG_SND_SEQUENCER_OSS is not set
CONFIG_SND_RTCTIMER=y
CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y
# CONFIG_SND_DYNAMIC_MINORS is not set
CONFIG_SND_SUPPORT_OLD_API=y
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set

#
# Generic devices
#
CONFIG_SND_AC97_CODEC=y
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_VIRMIDI is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set

#
# PCI devices
#
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS5535AUDIO is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=y
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=y
CONFIG_SND_INTEL8X0M=y
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
# CONFIG_SND_AC97_POWER_SAVE is not set

#
# USB devices
#
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_USX2Y is not set

#
# Open Sound System
#
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=y

#
# HID Devices
#
# CONFIG_HID is not set

#
# USB support
#
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set

#
# Miscellaneous USB options
#
# CONFIG_USB_DEVICEFS is not set
# CONFIG_USB_BANDWIDTH is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_MULTITHREAD_PROBE is not set
# CONFIG_USB_OTG is not set

#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=y
# CONFIG_USB_EHCI_SPLIT_ISO is not set
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_OHCI_HCD is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set

#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
#

#
# may also be needed; see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_DPCM is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_LIBUSUAL is not set

#
# USB Input Devices
#

#
# USB HID Boot Protocol drivers
#
# CONFIG_USB_KBD is not set
# CONFIG_USB_MOUSE is not set
# CONFIG_USB_AIPTEK is not set
# CONFIG_USB_WACOM is not set
# CONFIG_USB_ACECAD is not set
# CONFIG_USB_KBTAB is not set
# CONFIG_USB_POWERMATE is not set
# CONFIG_USB_TOUCHSCREEN is not set
# CONFIG_USB_YEALINK is not set
# CONFIG_USB_XPAD is not set
# CONFIG_USB_ATI_REMOTE is not set
# CONFIG_USB_ATI_REMOTE2 is not set
# CONFIG_USB_KEYSPAN_REMOTE is not set
# CONFIG_USB_APPLETOUCH is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET_MII is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_MON is not set

#
# USB port drivers
#

#
# USB Serial Converter support
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_AUERSWALD is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set

#
# USB DSL modem support
#

#
# USB Gadget Support
#
# CONFIG_USB_GADGET is not set

#
# MMC/SD Card support
#
# CONFIG_MMC is not set

#
# LED devices
#
# CONFIG_NEW_LEDS is not set

#
# LED drivers
#

#
# LED Triggers
#

#
# InfiniBand support
#
# CONFIG_INFINIBAND is not set

#
# EDAC - error detection and reporting (RAS) (EXPERIMENTAL)
#
# CONFIG_EDAC is not set

#
# Real Time Clock
#
# CONFIG_RTC_CLASS is not set

#
# DMA Engine support
#
# CONFIG_DMA_ENGINE is not set

#
# DMA Clients
#

#
# DMA Devices
#

#
# Virtualization
#
# CONFIG_KVM is not set

#
# File systems
#
CONFIG_EXT2_FS=y
# CONFIG_EXT2_FS_XATTR is not set
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
# CONFIG_EXT3_FS_POSIX_ACL is not set
# CONFIG_EXT3_FS_SECURITY is not set
# CONFIG_EXT4DEV_FS is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_FS_POSIX_ACL is not set
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_ROMFS_FS is not set
# CONFIG_INOTIFY is not set
# CONFIG_QUOTA is not set
CONFIG_DNOTIFY=y
# CONFIG_AUTOFS_FS is not set
CONFIG_AUTOFS4_FS=y
# CONFIG_FUSE_FS is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_ZISOFS_FS=y
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_NTFS_FS=y
# CONFIG_NTFS_DEBUG is not set
# CONFIG_NTFS_RW is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_RAMFS=y
# CONFIG_CONFIGFS_FS is not set

#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_SYSV_FS is not set
CONFIG_UFS_FS=y
# CONFIG_UFS_FS_WRITE is not set
# CONFIG_UFS_DEBUG is not set

#
# Network File Systems
#
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_SMB_FS is not set
CONFIG_CIFS=y
# CONFIG_CIFS_STATS is not set
# CONFIG_CIFS_WEAK_PW_HASH is not set
# CONFIG_CIFS_XATTR is not set
# CONFIG_CIFS_DEBUG2 is not set
# CONFIG_CIFS_EXPERIMENTAL is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
# CONFIG_9P_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
# CONFIG_ATARI_PARTITION is not set
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
# CONFIG_MINIX_SUBPARTITION is not set
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
# CONFIG_LDM_PARTITION is not set
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
# CONFIG_KARMA_PARTITION is not set
# CONFIG_EFI_PARTITION is not set

#
# Native Language Support
#
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_UTF8 is not set

#
# Distributed Lock Manager
#
# CONFIG_DLM is not set

#
# Instrumentation Support
#
# CONFIG_PROFILING is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
# CONFIG_DEBUG_FS is not set
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_LOG_BUF_SHIFT=14
# CONFIG_DETECT_SOFTLOCKUP is not set
# CONFIG_SCHEDSTATS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_HIGHMEM is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_FRAME_POINTER is not set
# CONFIG_FORCED_INLINING is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set

#
# Page alloc debug is incompatible with Software Suspend on i386
#
# CONFIG_DEBUG_RODATA is not set
CONFIG_4KSTACKS=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_DOUBLEFAULT=y

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set

#
# Cryptographic options
#
CONFIG_CRYPTO=y
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_MANAGER=y
# CONFIG_CRYPTO_HMAC is not set
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_SHA1 is not set
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_CBC is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_586 is not set
# CONFIG_CRYPTO_SERPENT is not set
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_AES_586=y
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_TEA is not set
CONFIG_CRYPTO_ARC4=y
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_ANUBIS is not set
CONFIG_CRYPTO_DEFLATE=y
CONFIG_CRYPTO_MICHAEL_MIC=y
# CONFIG_CRYPTO_CRC32C is not set

#
# Hardware crypto devices
#
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_CRYPTO_DEV_GEODE is not set

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC32=y
CONFIG_LIBCRC32C=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_PLIST=y
CONFIG_IOMAP_COPY=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_KTIME_SCALAR=y



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 19:44                   ` Andrei Popa
@ 2006-12-18 20:14                     ` Linus Torvalds
  2006-12-18 20:41                       ` Linus Torvalds
                                         ` (3 more replies)
  0 siblings, 4 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-18 20:14 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Andrei Popa wrote:
> 
> I dropped that patch and added WARN_ON(1), the unified patch is
> attached.
> 
> I got corruption: "Hash check on download completion found bad chunks,
> consider using "safe_sync"."

Ok. That is actually _very_ interesting.

It's interesting because (a) the corruption obviously goes away with the 
one-liner that effectively disables "page_mkclean_one()".

So that tells us that yes, it's a PTE dirty bit that matters.

But at the same time, it's interesting that it still happens when we try 
to re-add the dirty bit. That would tell me that it's one of two cases:

 - there is another caller of page cleaning that should have done the same 
   thing (we could check that by just doing this all _inside_ the 
   page_mkclean() thing)

OR:

 - page_mkclean_one() is simply buggy.

And I'm starting to wonder about the second case. But it all LOOKS really 
fine - I can't see anything wrong there (it uses the extremely 
conservative "ptep_get_and_clear()", and seems to flush everything right 
too, through "ptep_establish()").

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 20:14                     ` Linus Torvalds
@ 2006-12-18 20:41                       ` Linus Torvalds
  2006-12-18 21:11                         ` Andrei Popa
  2006-12-18 22:34                         ` Gene Heskett
  2006-12-18 21:43                       ` Andrew Morton
                                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-18 20:41 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Linus Torvalds wrote:
> 
> But at the same time, it's interesting that it still happens when we try 
> to re-add the dirty bit. That would tell me that it's one of two cases:

Forget that. There's a third case, which is much more likely:

 - Andrew's patch had a ", 1" where it _should_ have had a ", 0".

This should be fairly easy to test: just change every single ", 1" case in 
the patch to ", 0".

The only case that _definitely_ would want ",1" is actually the case that 
already calls page_mkclean() directly: clear_page_dirty_for_io(). So no 
other ", 1" is valid, and that one that needed it already avoided even 
calling the "test_clear_page_dirty()" function, because it did it all by 
hand.

What happens for you in that case?

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 20:41                       ` Linus Torvalds
@ 2006-12-18 21:11                         ` Andrei Popa
  2006-12-18 22:00                           ` Alessandro Suardi
  2006-12-18 22:32                           ` Linus Torvalds
  2006-12-18 22:34                         ` Gene Heskett
  1 sibling, 2 replies; 311+ messages in thread
From: Andrei Popa @ 2006-12-18 21:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 12:41 -0800, Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Linus Torvalds wrote:
> > 
> > But at the same time, it's interesting that it still happens when we try 
> > to re-add the dirty bit. That would tell me that it's one of two cases:
> 
> Forget that. There's a third case, which is much more likely:
> 
>  - Andrew's patch had a ", 1" where it _should_ have had a ", 0".
> 
> This should be fairly easy to test: just change every single ", 1" case in 
> the patch to ", 0".
> 
> The only case that _definitely_ would want ",1" is actually the case that 
> already calls page_mkclean() directly: clear_page_dirty_for_io(). So no 
> other ", 1" is valid, and that one that needed it already avoided even 
> calling the "test_clear_page_dirty()" function, because it did it all by 
> hand.
> 
> What happens for you in that case?
> 
> 		Linus

I have file corruption.


diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..760442f 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 0)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..7b87875 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 0);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..47a6b62 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..d65ba84 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..f7e0cc8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -866,7 +866,12 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned){
+			WARN_ON(1);
+			set_page_dirty(page);
+			}
+
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..cafa843 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 0))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 20:14                     ` Linus Torvalds
  2006-12-18 20:41                       ` Linus Torvalds
@ 2006-12-18 21:43                       ` Andrew Morton
  2006-12-18 21:49                       ` Peter Zijlstra
  2006-12-19 23:42                       ` Peter Zijlstra
  3 siblings, 0 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-18 21:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 18 Dec 2006 12:14:35 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> OR:
> 
>  - page_mkclean_one() is simply buggy.
> 
> And I'm starting to wonder about the second case. But it all LOOKS really 
> fine - I can't see anything wrong there (it uses the extremely 
> conservative "ptep_get_and_clear()", and seems to flush everything right 
> too, through "ptep_establish()").

What does the call to page_check_address() in there do?

It'd be good to have a printk in there to see if it's triggering.

Is this all correct for non-linear VMAs?  (rtorrent doesn't use
MAP_NONLINEAR though).

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 20:14                     ` Linus Torvalds
  2006-12-18 20:41                       ` Linus Torvalds
  2006-12-18 21:43                       ` Andrew Morton
@ 2006-12-18 21:49                       ` Peter Zijlstra
  2006-12-19 23:42                       ` Peter Zijlstra
  3 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-18 21:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Andrei Popa wrote:
> > 
> > I dropped that patch and added WARN_ON(1), the unified patch is
> > attached.
> > 
> > I got corruption: "Hash check on download completion found bad chunks,
> > consider using "safe_sync"."
> 
> Ok. That is actually _very_ interesting.
> 
> It's interesting because (a) the corruption obviously goes away with the 
> one-liner that effectively disables "page_mkclean_one()".
> 
> So that tells us that yes, it's a PTE dirty bit that matters.
> 
> But at the same time, it's interesting that it still happens when we try 
> to re-add the dirty bit. That would tell me that it's one of two cases:
> 
>  - there is another caller of page cleaning that should have done the same 
>    thing (we could check that by just doing this all _inside_ the 
>    page_mkclean() thing)
> 
> OR:
> 
>  - page_mkclean_one() is simply buggy.
> 
> And I'm starting to wonder about the second case. But it all LOOKS really 
> fine - I can't see anything wrong there (it uses the extremely 
> conservative "ptep_get_and_clear()", and seems to flush everything right 
> too, through "ptep_establish()").

How about this:

we get confused on what PG_dirty tells us, we fall back to pte_dirty,
transfer pte_dirty to PG_dirty and clear pte_dirty. Now it happens
again, however we don't have pte_dirty to fall back to anymore.

This would explain why disabling pte_mkclean() does make it go away and
non of the other tried approaches works.

We really need a way to sort out PG_dirty, independent of pte_dirty. 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 21:11                         ` Andrei Popa
@ 2006-12-18 22:00                           ` Alessandro Suardi
  2006-12-18 22:45                             ` Linus Torvalds
  2006-12-18 22:32                           ` Linus Torvalds
  1 sibling, 1 reply; 311+ messages in thread
From: Alessandro Suardi @ 2006-12-18 22:00 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Peter Zijlstra, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On 12/18/06, Andrei Popa <andrei.popa@i-neo.ro> wrote:
> On Mon, 2006-12-18 at 12:41 -0800, Linus Torvalds wrote:
> >
> > On Mon, 18 Dec 2006, Linus Torvalds wrote:
> > >
> > > But at the same time, it's interesting that it still happens when we try
> > > to re-add the dirty bit. That would tell me that it's one of two cases:
> >
> > Forget that. There's a third case, which is much more likely:
> >
> >  - Andrew's patch had a ", 1" where it _should_ have had a ", 0".
> >
> > This should be fairly easy to test: just change every single ", 1" case in
> > the patch to ", 0".
> >
> > The only case that _definitely_ would want ",1" is actually the case that
> > already calls page_mkclean() directly: clear_page_dirty_for_io(). So no
> > other ", 1" is valid, and that one that needed it already avoided even
> > calling the "test_clear_page_dirty()" function, because it did it all by
> > hand.
> >
> > What happens for you in that case?
> >
> >               Linus
>
> I have file corruption.

No idea whether this can be a data point or not, but
 here it goes... my P2P box is about to turn 5 days old
 while running nonstop one or both of aMule 2.1.3 and
 BitTorrent 4.4.0 on ext3 mounted w/default options
 on both IDE and USB disks. Zero corruption.

AMD K7-800, 512MB RAM, PREEMPT/UP kernel,
2.6.19-git20 on top of up-to-date FC6.

--alessandro

"...when I get it, I _get_ it"

     (Lara Eidemiller)

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 21:11                         ` Andrei Popa
  2006-12-18 22:00                           ` Alessandro Suardi
@ 2006-12-18 22:32                           ` Linus Torvalds
  2006-12-18 23:48                             ` Andrei Popa
  1 sibling, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-18 22:32 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Andrei Popa wrote:
> >
> > This should be fairly easy to test: just change every single ", 1" case in 
> > the patch to ", 0".
> >
> > What happens for you in that case?
> 
> I have file corruption.

Magic. And btw, _thanks_ for being such a great tester.

So now I have one more thng for you to try, it you can bother:

There's exactly two call sites that call "page_mkclean()" (an dthat is the 
only thing in turn that calls "page_mkclean_one()", which we already 
determined will cause the corruption). 

Both of them do 

	if (mapping_cap_account_dirty(mapping)) {
			..

things, although they do slightly different things inside that if in your 
patched kernel.

Can you just TOTALLY DISABLE that case for the test_clear_page_dirty() 
case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving 
the _only_ thing that actually calls "page_mkclean()" to be the 
"clear_page_dirty_for_io()" call.

Do you still see corruption?

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 20:41                       ` Linus Torvalds
  2006-12-18 21:11                         ` Andrei Popa
@ 2006-12-18 22:34                         ` Gene Heskett
  2006-12-22 17:27                           ` Linus Torvalds
  1 sibling, 1 reply; 311+ messages in thread
From: Gene Heskett @ 2006-12-18 22:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrei Popa, Peter Zijlstra, Andrew Morton,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Monday 18 December 2006 15:41, Linus Torvalds wrote:
>On Mon, 18 Dec 2006, Linus Torvalds wrote:
>> But at the same time, it's interesting that it still happens when we
>> try to re-add the dirty bit. That would tell me that it's one of two
>> cases:
>
>Forget that. There's a third case, which is much more likely:
>
> - Andrew's patch had a ", 1" where it _should_ have had a ", 0".
>
>This should be fairly easy to test: just change every single ", 1" case
> in the patch to ", 0".
>
>The only case that _definitely_ would want ",1" is actually the case
> that already calls page_mkclean() directly: clear_page_dirty_for_io().
> So no other ", 1" is valid, and that one that needed it already avoided
> even calling the "test_clear_page_dirty()" function, because it did it
> all by hand.
>
What about the mm/rmap.c one liner, in or out?

Thanks.

>What happens for you in that case?
>
>		Linus
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 22:00                           ` Alessandro Suardi
@ 2006-12-18 22:45                             ` Linus Torvalds
  2006-12-19  0:13                               ` Andrei Popa
  0 siblings, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-18 22:45 UTC (permalink / raw)
  To: Alessandro Suardi
  Cc: andrei.popa, Peter Zijlstra, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Alessandro Suardi wrote:
> 
> No idea whether this can be a data point or not, but
> here it goes... my P2P box is about to turn 5 days old
> while running nonstop one or both of aMule 2.1.3 and
> BitTorrent 4.4.0 on ext3 mounted w/default options
> on both IDE and USB disks. Zero corruption.
> 
> AMD K7-800, 512MB RAM, PREEMPT/UP kernel,
> 2.6.19-git20 on top of up-to-date FC6.

It _looks_ like PREEMPT/SMP is one common configuration.

It might also be that the blocksize of the filesystem matters. 4kB 
filesystems are fundamentally simpler than 1kB filesystems, for example. 
You can tell at least with "/sbin/dumpe2fs -h /dev/..." or something.

Andrei - one thing that might be interesting to see: when corruption 
occurs, can you get the corrupted file somehow? And compare it with a 
known-good copy to see what the corruption looks like?

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 22:32                           ` Linus Torvalds
@ 2006-12-18 23:48                             ` Andrei Popa
  2006-12-19  0:04                               ` Linus Torvalds
  2006-12-19  1:03                               ` Gene Heskett
  0 siblings, 2 replies; 311+ messages in thread
From: Andrei Popa @ 2006-12-18 23:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 14:32 -0800, Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Andrei Popa wrote:
> > >
> > > This should be fairly easy to test: just change every single ", 1" case in 
> > > the patch to ", 0".
> > >
> > > What happens for you in that case?
> > 
> > I have file corruption.
> 
> Magic. And btw, _thanks_ for being such a great tester.
> 
> So now I have one more thng for you to try, it you can bother:
> 
> There's exactly two call sites that call "page_mkclean()" (an dthat is the 
> only thing in turn that calls "page_mkclean_one()", which we already 
> determined will cause the corruption). 
> 
> Both of them do 
> 
> 	if (mapping_cap_account_dirty(mapping)) {
> 			..
> 
> things, although they do slightly different things inside that if in your 
> patched kernel.
> 
> Can you just TOTALLY DISABLE that case for the test_clear_page_dirty() 
> case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving 
> the _only_ thing that actually calls "page_mkclean()" to be the 
> "clear_page_dirty_for_io()" call.
> 
> Do you still see corruption?

nope, no file corruption at all.



diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..2d8bbbb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 0)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..9f82cd0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 0);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..5e29b37 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..44ac434 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..f2a157d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -857,6 +857,8 @@ int test_clear_page_dirty(struct page *p
 		return TestClearPageDirty(page);
 
 	write_lock_irqsave(&mapping->tree_lock, flags);
+
+#if 0
 	if (TestClearPageDirty(page)) {
 		radix_tree_tag_clear(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
@@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned){
+			WARN_ON(1);
+			set_page_dirty(page);
+			}
+
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;
 	}
+
+#endif
+
 	write_unlock_irqrestore(&mapping->tree_lock, flags);
 	return 0;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..9a01d9e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 0))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 23:48                             ` Andrei Popa
@ 2006-12-19  0:04                               ` Linus Torvalds
  2006-12-19  0:29                                 ` Andrei Popa
  2006-12-19  1:03                               ` Gene Heskett
  1 sibling, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-19  0:04 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Andrei Popa wrote:
> > 
> > There's exactly two call sites that call "page_mkclean()" (an dthat is the 
> > only thing in turn that calls "page_mkclean_one()", which we already 
> > determined will cause the corruption). 
> >
> > Can you just TOTALLY DISABLE that case for the test_clear_page_dirty() 
> > case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving 
> > the _only_ thing that actually calls "page_mkclean()" to be the 
> > "clear_page_dirty_for_io()" call.
> > 
> > Do you still see corruption?
> 
> nope, no file corruption at all.

Ok. That's interesting, but I think you actually #ifdef'ed out too 
much:

> +
> +#if 0
>  	if (TestClearPageDirty(page)) {
>  		radix_tree_tag_clear(&mapping->page_tree,
>  				page_index(page), PAGECACHE_TAG_DIRTY);
> @@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p
>  		 * page is locked, which pins the address_space
>  		 */
>  		if (mapping_cap_account_dirty(mapping)) {
> -			page_mkclean(page);
> +			int cleaned = page_mkclean(page);
> +			if (!must_clean_ptes && cleaned){
> +			WARN_ON(1);
> +			set_page_dirty(page);
> +			}
> +
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  		}
>  		return 1;
>  	}
> +
> +#endif
> +

It was really just the _inner_ "if (mapping_cap_account_dirty(.." 
statement that I meant you should remove.

Can you try that too?

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 22:45                             ` Linus Torvalds
@ 2006-12-19  0:13                               ` Andrei Popa
  2006-12-19  0:29                                 ` Linus Torvalds
  0 siblings, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-19  0:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alessandro Suardi, Peter Zijlstra, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 14:45 -0800, Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Alessandro Suardi wrote:
> > 
> > No idea whether this can be a data point or not, but
> > here it goes... my P2P box is about to turn 5 days old
> > while running nonstop one or both of aMule 2.1.3 and
> > BitTorrent 4.4.0 on ext3 mounted w/default options
> > on both IDE and USB disks. Zero corruption.
> > 
> > AMD K7-800, 512MB RAM, PREEMPT/UP kernel,
> > 2.6.19-git20 on top of up-to-date FC6.
> 
> It _looks_ like PREEMPT/SMP is one common configuration.
> 
> It might also be that the blocksize of the filesystem matters. 4kB 
> filesystems are fundamentally simpler than 1kB filesystems, for example. 
> You can tell at least with "/sbin/dumpe2fs -h /dev/..." or something.
> 
> Andrei - one thing that might be interesting to see: when corruption 
> occurs, can you get the corrupted file somehow? And compare it with a 
> known-good copy to see what the corruption looks like?

the corrupted file has a chink full with zeros

http://193.226.119.62/corruption0.jpg
http://193.226.119.62/corruption1.jpg




^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  0:13                               ` Andrei Popa
@ 2006-12-19  0:29                                 ` Linus Torvalds
  0 siblings, 0 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-19  0:29 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Alessandro Suardi, Peter Zijlstra, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Andrei Popa wrote:
> 
> the corrupted file has a chink full with zeros
> 
> http://193.226.119.62/corruption0.jpg
> http://193.226.119.62/corruption1.jpg

Thanks. Yup, filled with zeroes, and the corruption stops (but does _not_ 
start) at a page boundary.

That _does_ look very much like it was filled in linearly, then written 
out to disk when it was in the middle of the page, and then we simply lost 
the further writes that should also have gone on to that page. All 
consistent with dropping a dirty bit somewhere in the middle of the page 
updates.

Which we kind of knew must be the issue anyway, but it's good to know that 
the corruption pattern is consistent with what we're trying to figure out.

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  0:04                               ` Linus Torvalds
@ 2006-12-19  0:29                                 ` Andrei Popa
  2006-12-19  0:57                                   ` Linus Torvalds
  0 siblings, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-19  0:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 16:04 -0800, Linus Torvalds wrote:
> 
> On Tue, 19 Dec 2006, Andrei Popa wrote:
> > > 
> > > There's exactly two call sites that call "page_mkclean()" (an dthat is the 
> > > only thing in turn that calls "page_mkclean_one()", which we already 
> > > determined will cause the corruption). 
> > >
> > > Can you just TOTALLY DISABLE that case for the test_clear_page_dirty() 
> > > case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving 
> > > the _only_ thing that actually calls "page_mkclean()" to be the 
> > > "clear_page_dirty_for_io()" call.
> > > 
> > > Do you still see corruption?
> > 
> > nope, no file corruption at all.
> 
> Ok. That's interesting, but I think you actually #ifdef'ed out too 
> much:
> 
> > +
> > +#if 0
> >  	if (TestClearPageDirty(page)) {
> >  		radix_tree_tag_clear(&mapping->page_tree,
> >  				page_index(page), PAGECACHE_TAG_DIRTY);
> > @@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p
> >  		 * page is locked, which pins the address_space
> >  		 */
> >  		if (mapping_cap_account_dirty(mapping)) {
> > -			page_mkclean(page);
> > +			int cleaned = page_mkclean(page);
> > +			if (!must_clean_ptes && cleaned){
> > +			WARN_ON(1);
> > +			set_page_dirty(page);
> > +			}
> > +
> >  			dec_zone_page_state(page, NR_FILE_DIRTY);
> >  		}
> >  		return 1;
> >  	}
> > +
> > +#endif
> > +
> 
> It was really just the _inner_ "if (mapping_cap_account_dirty(.." 
> statement that I meant you should remove.
> 
> Can you try that too?

I have file corruption: "Hash check on download completion found bad
chunks, consider using "safe_sync"."


diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..2d8bbbb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 0)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..9f82cd0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 0);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..5e29b37 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..44ac434 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..4ff7f90 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -857,6 +857,7 @@ int test_clear_page_dirty(struct page *p
 		return TestClearPageDirty(page);
 
 	write_lock_irqsave(&mapping->tree_lock, flags);
+
 	if (TestClearPageDirty(page)) {
 		radix_tree_tag_clear(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
@@ -865,12 +866,23 @@ int test_clear_page_dirty(struct page *p
 		 * We can continue to use `mapping' here because the
 		 * page is locked, which pins the address_space
 		 */
+
+#if 0
+
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned){
+			WARN_ON(1);
+			set_page_dirty(page);
+			}
+
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
+#endif
+
 		return 1;
 	}
+
 	write_unlock_irqrestore(&mapping->tree_lock, flags);
 	return 0;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..9a01d9e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 0))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  0:29                                 ` Andrei Popa
@ 2006-12-19  0:57                                   ` Linus Torvalds
  2006-12-19  1:21                                     ` Andrew Morton
  2006-12-19  1:50                                     ` Andrei Popa
  0 siblings, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-19  0:57 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Andrei Popa wrote:
> > > 
> > > nope, no file corruption at all.
> > 
> > Ok. That's interesting, but I think you actually #ifdef'ed out too 
> > much:
> > 
> > It was really just the _inner_ "if (mapping_cap_account_dirty(.." 
> > statement that I meant you should remove.
> > 
> > Can you try that too?
> 
> I have file corruption: "Hash check on download completion found bad
> chunks, consider using "safe_sync"."

Ok, that's interesting.

So it doesn't seem to be the call to page_mkclean() itself that causes 
corruption. It looks like Peter's hunch that maybe there's some bug in 
PG_dirty handling _itself_ might be an idea..

And the reason it only started happening now is that it may just have been 
_hidden_ by the fact that while we kept the dirty bits in the page tables, 
we'd end up writing the dirty page _despite_ having lost the PG_dirty bit. 
So if it's some bad interaction between writable mappings and some other 
part of the system, we just didn't see it earlier, exactly because we had 
_lots_ of dirty bits, and it was enough that _one_ of them was right.

If you didn't see corruption when you #ifdef'ed out too much of the 
"test_clean_page_dirty() function (the _whole_ TestClearPageDirty() 
if-statement), but you get it when you just comment out the stuff that 
does the page_mkclean(), that's interesting.

I'm left lookin gat the "radix_tree_tag_clear()" in 
test_clear_page_dirty().

What happens if you only ifdef out that single thing? 

The actual page-cleaning functions make sure to only clear the TAG_DIRTY 
bit _after_ the page has been marked for writeback. Is there some ordering 
constraint there, perhaps?

I'm really reaching here. I'm trying to see the pattern, and I'm not 
seeing it. I'm asking you to test things just to get more of a feel for 
what triggers the failure, than because I actually have any kind of idea 
of what the heck is going on.

Andrew, Nick, Hugh - any ideas?

			Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 23:48                             ` Andrei Popa
  2006-12-19  0:04                               ` Linus Torvalds
@ 2006-12-19  1:03                               ` Gene Heskett
  1 sibling, 0 replies; 311+ messages in thread
From: Gene Heskett @ 2006-12-19  1:03 UTC (permalink / raw)
  To: linux-kernel, andrei.popa
  Cc: Linus Torvalds, Peter Zijlstra, Andrew Morton, Hugh Dickins,
	Florian Weimer, Marc Haber, Martin Michlmayr

On Monday 18 December 2006 18:48, Andrei Popa wrote:
>On Mon, 2006-12-18 at 14:32 -0800, Linus Torvalds wrote:
>> On Mon, 18 Dec 2006, Andrei Popa wrote:
>> > > This should be fairly easy to test: just change every single ", 1"
>> > > case in the patch to ", 0".
>> > >
>> > > What happens for you in that case?
>> >
>> > I have file corruption.
>>
>> Magic. And btw, _thanks_ for being such a great tester.
>>
>> So now I have one more thng for you to try, it you can bother:
>>
>> There's exactly two call sites that call "page_mkclean()" (an dthat is
>> the only thing in turn that calls "page_mkclean_one()", which we
>> already determined will cause the corruption).
>>
>> Both of them do
>>
>> 	if (mapping_cap_account_dirty(mapping)) {
>> 			..
>>
>> things, although they do slightly different things inside that if in
>> your patched kernel.
>>
>> Can you just TOTALLY DISABLE that case for the test_clear_page_dirty()
>> case? Just do an "#if 0 .. #endif" around that whole if-statement,
>> leaving the _only_ thing that actually calls "page_mkclean()" to be
>> the "clear_page_dirty_for_io()" call.
>>
>> Do you still see corruption?
>
>nope, no file corruption at all.
>
Goody I says to nobody in particular, I'll go build this...
>
>diff --git a/fs/buffer.c b/fs/buffer.c
>index d1f1b54..263f88e 100644
>--- a/fs/buffer.c
>+++ b/fs/buffer.c
>@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
> 	int ret = 0;
>
> 	BUG_ON(!PageLocked(page));
>-	if (PageWriteback(page))
>+	if (PageDirty(page) || PageWriteback(page))
> 		return 0;
>
> 	if (mapping == NULL) {		/* can this still happen? */
>@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
> 	spin_lock(&mapping->private_lock);
> 	ret = drop_buffers(page, &buffers_to_free);
> 	spin_unlock(&mapping->private_lock);
>-	if (ret) {
>-		/*
>-		 * If the filesystem writes its buffers by hand (eg ext3)
>-		 * then we can have clean buffers against a dirty page.  We
>-		 * clean the page here; otherwise later reattachment of buffers
>-		 * could encounter a non-uptodate page, which is unresolvable.
>-		 * This only applies in the rare case where try_to_free_buffers
>-		 * succeeds but the page is not freed.
>-		 *
>-		 * Also, during truncate, discard_buffer will have marked all
>-		 * the page's buffers clean.  We discover that here and clean
>-		 * the page also.
>-		 */
>-		if (test_clear_page_dirty(page))
>-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
>-	}
> out:
> 	if (buffers_to_free) {
> 		struct buffer_head *bh = buffers_to_free;
>diff --git a/fs/cifs/file.c b/fs/cifs/file.c
>index 0f05cab..2d8bbbb 100644
>--- a/fs/cifs/file.c
>+++ b/fs/cifs/file.c
>@@ -1245,7 +1245,7 @@ retry:
> 				wait_on_page_writeback(page);
>
> 			if (PageWriteback(page) ||
>-					!test_clear_page_dirty(page)) {
>+					!test_clear_page_dirty(page, 0)) {
> 				unlock_page(page);
> 				break;
> 			}
>diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>index 1387749..da2bdb1 100644
>--- a/fs/fuse/file.c
>+++ b/fs/fuse/file.c
>@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
> 		spin_unlock(&fc->lock);
>
> 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
>-			clear_page_dirty(page);
>+			clear_page_dirty(page, 0);
> 			SetPageUptodate(page);
> 		}
> 	}
>diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>index ed2c223..9f82cd0 100644
>--- a/fs/hugetlbfs/inode.c
>+++ b/fs/hugetlbfs/inode.c
>@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
>
> static void truncate_huge_page(struct page *page)
> {
>-	clear_page_dirty(page);
>+	clear_page_dirty(page, 0);
> 	ClearPageUptodate(page);
> 	remove_from_page_cache(page);
> 	put_page(page);
>diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
>index b1a1c72..5e29b37 100644
>--- a/fs/jfs/jfs_metapage.c
>+++ b/fs/jfs/jfs_metapage.c
>@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
>
> 	/* Retest mp->count since we may have released page lock */
> 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
>-		clear_page_dirty(page);
>+		clear_page_dirty(page, 0);
> 		ClearPageUptodate(page);
> 	}
> #else
>diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
>index 47e7027..a97e198 100644
>--- a/fs/reiserfs/stree.c
>+++ b/fs/reiserfs/stree.c
>@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
> 				bh = next;
> 			} while (bh != head);
> 			if (PAGE_SIZE == bh->b_size) {
>-				clear_page_dirty(page);
>+				clear_page_dirty(page, 0);
> 			}
> 		}
> 	}
>diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
>index b56eb75..44ac434 100644
>--- a/fs/xfs/linux-2.6/xfs_aops.c
>+++ b/fs/xfs/linux-2.6/xfs_aops.c
>@@ -343,7 +343,7 @@ xfs_start_page_writeback(
> 	ASSERT(!PageWriteback(page));
> 	set_page_writeback(page);
> 	if (clear_dirty)
>-		clear_page_dirty(page);
>+		clear_page_dirty(page, 0);
> 	unlock_page(page);
> 	if (!buffers) {
> 		end_page_writeback(page);
>diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>index 4830a3b..175ab3c 100644
>--- a/include/linux/page-flags.h
>+++ b/include/linux/page-flags.h
>@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
>
> struct page;	/* forward declaration */
>
>-int test_clear_page_dirty(struct page *page);
>+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
> int test_clear_page_writeback(struct page *page);
> int test_set_page_writeback(struct page *page);
>
>-static inline void clear_page_dirty(struct page *page)
>+static inline void clear_page_dirty(struct page *page, int
>must_clean_ptes)
above looks wrapped to me so I fixed it to one line
> {
>-	test_clear_page_dirty(page);
>+	test_clear_page_dirty(page, must_clean_ptes);
> }
>
> static inline void set_page_writeback(struct page *page)
>diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>index 237107c..f2a157d 100644
>--- a/mm/page-writeback.c
>+++ b/mm/page-writeback.c
>@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  * Clear a page's dirty flag, while caring for dirty memory
>accounting.
Likewise here, malformed patch otherwise
>  * Returns true if the page was previously dirty.
>  */
>-int test_clear_page_dirty(struct page *page)
>+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
> {
> 	struct address_space *mapping = page_mapping(page);
> 	unsigned long flags;
>@@ -857,6 +857,8 @@ int test_clear_page_dirty(struct page *p
> 		return TestClearPageDirty(page);
>
> 	write_lock_irqsave(&mapping->tree_lock, flags);
>+
>+#if 0
> 	if (TestClearPageDirty(page)) {
> 		radix_tree_tag_clear(&mapping->page_tree,
> 				page_index(page), PAGECACHE_TAG_DIRTY);
>@@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p
> 		 * page is locked, which pins the address_space
> 		 */
> 		if (mapping_cap_account_dirty(mapping)) {
>-			page_mkclean(page);
>+			int cleaned = page_mkclean(page);
>+			if (!must_clean_ptes && cleaned){
>+			WARN_ON(1);
>+			set_page_dirty(page);
>+			}
>+
> 			dec_zone_page_state(page, NR_FILE_DIRTY);
> 		}
> 		return 1;
> 	}
>+
>+#endif
>+
> 	write_unlock_irqrestore(&mapping->tree_lock, flags);
> 	return 0;
> }
>diff --git a/mm/rmap.c b/mm/rmap.c
>diff --git a/mm/truncate.c b/mm/truncate.c
>index 9bfb8e8..9a01d9e 100644
>--- a/mm/truncate.c
>+++ b/mm/truncate.c
>@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
> 	if (PagePrivate(page))
> 		do_invalidatepage(page, 0);
>
>-	if (test_clear_page_dirty(page))
>+	if (test_clear_page_dirty(page, 0))
> 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> 	ClearPageUptodate(page);
> 	ClearPageMappedToDisk(page);
>@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
> 					  PAGE_CACHE_SIZE, 0);
> 				}
> 			}
>-			was_dirty = test_clear_page_dirty(page);
>+			was_dirty = test_clear_page_dirty(page, 0);
> 			if (!invalidate_complete_page2(mapping, page)) {
> 				if (was_dirty)
> 					set_page_dirty(page);
>
I think I must have screwed the moose.  Following along in this thread, 
I'd patched things back and forth till I figured I'd better do a fresh 
tree, so starting with the full 2.6.19 tarball, I applied the 2.6.20-rc1 
patch, then the above patch, which should be the only thing different 
from what I'm running right now, which is the commented line in rmap.c, 
otherwise as it unpacked.

But:
In file included from include/linux/mm.h:230,
                 from include/linux/rmap.h:10,
                 from init/main.c:47:
include/linux/page-flags.h:260: error: expected declaration specifiers 
or ‘...’ before ‘in’
include/linux/page-flags.h: In function ‘clear_page_dirty’:
include/linux/page-flags.h:262: error: ‘must_clean_ptes’ undeclared (first 
use in this function)
include/linux/page-flags.h:262: error: (Each undeclared identifier is 
reported only once
include/linux/page-flags.h:262: error: for each function it appears in.)
make[1]: *** [init/main.o] Error 1
make: *** [init] Error 2

There were 2 places where this patch is word wrapped, and this was one of 
them:

-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)

The other one was in a comment, which screwed the patch and needed fixed 
too.  Is it fubared someplace else I missed?  Or am I in fact being 
bitten by this bug?

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  0:57                                   ` Linus Torvalds
@ 2006-12-19  1:21                                     ` Andrew Morton
  2006-12-19  1:44                                       ` Andrei Popa
  2006-12-19  1:50                                     ` Andrei Popa
  1 sibling, 1 reply; 311+ messages in thread
From: Andrew Morton @ 2006-12-19  1:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 18 Dec 2006 16:57:30 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> What happens if you only ifdef out that single thing? 
> 
> The actual page-cleaning functions make sure to only clear the TAG_DIRTY 
> bit _after_ the page has been marked for writeback. Is there some ordering 
> constraint there, perhaps?
> 
> I'm really reaching here. I'm trying to see the pattern, and I'm not 
> seeing it. I'm asking you to test things just to get more of a feel for 
> what triggers the failure, than because I actually have any kind of idea 
> of what the heck is going on.
> 
> Andrew, Nick, Hugh - any ideas?

If all of test_clear_page_dirty() has been commented out then the page will
never become clean hence will never fall out of pagecache, so unless Andrei
is doing a reboot before checking for corruption, perhaps the underlying
data on-disk is incorrect, but we can't see it.

Andrei, how _are_ you running this test?    What's the exact sequence of steps?

In particular, are you doing anything which would cause the corrupted file
to be evicted from memory, thus forcing a read from disk?  Such as
unmounting and then remounting the filesystem?

The point of my question is to check that the data is really incorrect
on-disk, or whether it is incorrect in pagecache.

Also, it'd be useful if you could determine whether the bug appears with
the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
rootfstype=ext2 if it's the root filesystem.

Thanks.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  1:21                                     ` Andrew Morton
@ 2006-12-19  1:44                                       ` Andrei Popa
  2006-12-19  1:54                                         ` Andrew Morton
  0 siblings, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-19  1:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 17:21 -0800, Andrew Morton wrote:
> On Mon, 18 Dec 2006 16:57:30 -0800 (PST)
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> > What happens if you only ifdef out that single thing? 
> > 
> > The actual page-cleaning functions make sure to only clear the TAG_DIRTY 
> > bit _after_ the page has been marked for writeback. Is there some ordering 
> > constraint there, perhaps?
> > 
> > I'm really reaching here. I'm trying to see the pattern, and I'm not 
> > seeing it. I'm asking you to test things just to get more of a feel for 
> > what triggers the failure, than because I actually have any kind of idea 
> > of what the heck is going on.
> > 
> > Andrew, Nick, Hugh - any ideas?
> 
> If all of test_clear_page_dirty() has been commented out then the page will
> never become clean hence will never fall out of pagecache, so unless Andrei
> is doing a reboot before checking for corruption, perhaps the underlying
> data on-disk is incorrect, but we can't see it.

if I do a sync and echo 1 > /proc/sys/vm/drop_caches does the reboot is
still necesary ?

> 
> Andrei, how _are_ you running this test?    What's the exact sequence of steps?
> 
> In particular, are you doing anything which would cause the corrupted file
> to be evicted from memory, thus forcing a read from disk?  Such as
> unmounting and then remounting the filesystem?

I boot linux, I start rtorrent and start the download, while it's
downloading I start evolution and i check my mail(my mbox is very large,
several hundered megabytes), I close evolution(I use evolution just to
have another application witch uses the filesystem and the memory), I
start evolution again. I start firefox. The download is complete.
Rtorrent says if the hash is good or not. I do a "unrar t qwe.rar" to
test that all 84 downloaded rar files are ok and see the result.

> 
> The point of my question is to check that the data is really incorrect
> on-disk, or whether it is incorrect in pagecache.
> 
> Also, it'd be useful if you could determine whether the bug appears with
> the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> rootfstype=ext2 if it's the root filesystem.

I will test.

> 
> Thanks.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  0:57                                   ` Linus Torvalds
  2006-12-19  1:21                                     ` Andrew Morton
@ 2006-12-19  1:50                                     ` Andrei Popa
  1 sibling, 0 replies; 311+ messages in thread
From: Andrei Popa @ 2006-12-19  1:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 16:57 -0800, Linus Torvalds wrote:
> 
> On Tue, 19 Dec 2006, Andrei Popa wrote:
> > > > 
> > > > nope, no file corruption at all.
> > > 
> > > Ok. That's interesting, but I think you actually #ifdef'ed out too 
> > > much:
> > > 
> > > It was really just the _inner_ "if (mapping_cap_account_dirty(.." 
> > > statement that I meant you should remove.
> > > 
> > > Can you try that too?
> > 
> > I have file corruption: "Hash check on download completion found bad
> > chunks, consider using "safe_sync"."
> 
> Ok, that's interesting.
> 
> So it doesn't seem to be the call to page_mkclean() itself that causes 
> corruption. It looks like Peter's hunch that maybe there's some bug in 
> PG_dirty handling _itself_ might be an idea..
> 
> And the reason it only started happening now is that it may just have been 
> _hidden_ by the fact that while we kept the dirty bits in the page tables, 
> we'd end up writing the dirty page _despite_ having lost the PG_dirty bit. 
> So if it's some bad interaction between writable mappings and some other 
> part of the system, we just didn't see it earlier, exactly because we had 
> _lots_ of dirty bits, and it was enough that _one_ of them was right.
> 
> If you didn't see corruption when you #ifdef'ed out too much of the 
> "test_clean_page_dirty() function (the _whole_ TestClearPageDirty() 
> if-statement), but you get it when you just comment out the stuff that 
> does the page_mkclean(), that's interesting.
> 
> I'm left lookin gat the "radix_tree_tag_clear()" in 
> test_clear_page_dirty().
> 
> What happens if you only ifdef out that single thing? 

I have file corruption.

> 
> The actual page-cleaning functions make sure to only clear the TAG_DIRTY 
> bit _after_ the page has been marked for writeback. Is there some ordering 
> constraint there, perhaps?
> 
> I'm really reaching here. I'm trying to see the pattern, and I'm not 
> seeing it. I'm asking you to test things just to get more of a feel for 
> what triggers the failure, than because I actually have any kind of idea 
> of what the heck is going on.
> 
> Andrew, Nick, Hugh - any ideas?
> 
> 			Linus


diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..2d8bbbb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 0)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..9f82cd0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 0);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..5e29b37 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..44ac434 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..4ff7f90 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -857,6 +857,7 @@ int test_clear_page_dirty(struct page *p
 		return TestClearPageDirty(page);
 
 	write_lock_irqsave(&mapping->tree_lock, flags);
+
 	if (TestClearPageDirty(page)) {
 		radix_tree_tag_clear(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
@@ -865,12 +866,23 @@ int test_clear_page_dirty(struct page *p
 		 * We can continue to use `mapping' here because the
 		 * page is locked, which pins the address_space
 		 */
+
+#if 0
+
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned){
+			WARN_ON(1);
+			set_page_dirty(page);
+			}
+
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
+#endif
+
 		return 1;
 	}
+
 	write_unlock_irqrestore(&mapping->tree_lock, flags);
 	return 0;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..9a01d9e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 0))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..2d8bbbb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 0)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..9f82cd0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 0);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..5e29b37 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..44ac434 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..e6524a6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -857,20 +857,35 @@ int test_clear_page_dirty(struct page *p
 		return TestClearPageDirty(page);
 
 	write_lock_irqsave(&mapping->tree_lock, flags);
+
 	if (TestClearPageDirty(page)) {
+
+#if 0
+
 		radix_tree_tag_clear(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
+
+#endif
+
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 		/*
 		 * We can continue to use `mapping' here because the
 		 * page is locked, which pins the address_space
 		 */
+
+
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned){
+			WARN_ON(1);
+			set_page_dirty(page);
+			}
+
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;
 	}
+
 	write_unlock_irqrestore(&mapping->tree_lock, flags);
 	return 0;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..9a01d9e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 0))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  1:44                                       ` Andrei Popa
@ 2006-12-19  1:54                                         ` Andrew Morton
  2006-12-19  2:04                                           ` Andrei Popa
  2006-12-19  8:05                                           ` Andrei Popa
  0 siblings, 2 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-19  1:54 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Tue, 19 Dec 2006 03:44:51 +0200
Andrei Popa <andrei.popa@i-neo.ro> wrote:

> On Mon, 2006-12-18 at 17:21 -0800, Andrew Morton wrote:
> > On Mon, 18 Dec 2006 16:57:30 -0800 (PST)
> > Linus Torvalds <torvalds@osdl.org> wrote:
> > 
> > > What happens if you only ifdef out that single thing? 
> > > 
> > > The actual page-cleaning functions make sure to only clear the TAG_DIRTY 
> > > bit _after_ the page has been marked for writeback. Is there some ordering 
> > > constraint there, perhaps?
> > > 
> > > I'm really reaching here. I'm trying to see the pattern, and I'm not 
> > > seeing it. I'm asking you to test things just to get more of a feel for 
> > > what triggers the failure, than because I actually have any kind of idea 
> > > of what the heck is going on.
> > > 
> > > Andrew, Nick, Hugh - any ideas?
> > 
> > If all of test_clear_page_dirty() has been commented out then the page will
> > never become clean hence will never fall out of pagecache, so unless Andrei
> > is doing a reboot before checking for corruption, perhaps the underlying
> > data on-disk is incorrect, but we can't see it.
> 
> if I do a sync and echo 1 > /proc/sys/vm/drop_caches

OK, that works.

>  does the reboot is
> still necesary ?

It might be necessary to reboot in this case - if we're leaving the
pagecache dirty, writing to drop_caches won't remove it.  And you probably
won't be able to get a clean reboot either.

> > 
> > Andrei, how _are_ you running this test?    What's the exact sequence of steps?
> > 
> > In particular, are you doing anything which would cause the corrupted file
> > to be evicted from memory, thus forcing a read from disk?  Such as
> > unmounting and then remounting the filesystem?
> 
> I boot linux, I start rtorrent and start the download, while it's
> downloading I start evolution and i check my mail(my mbox is very large,
> several hundered megabytes), I close evolution(I use evolution just to
> have another application witch uses the filesystem and the memory), I
> start evolution again. I start firefox. The download is complete.
> Rtorrent says if the hash is good or not. I do a "unrar t qwe.rar" to
> test that all 84 downloaded rar files are ok and see the result.
> 
> > 
> > The point of my question is to check that the data is really incorrect
> > on-disk, or whether it is incorrect in pagecache.
> > 
> > Also, it'd be useful if you could determine whether the bug appears with
> > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> > rootfstype=ext2 if it's the root filesystem.
> 
> I will test.

ok, thanks.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  1:54                                         ` Andrew Morton
@ 2006-12-19  2:04                                           ` Andrei Popa
  2006-12-19  8:05                                           ` Andrei Popa
  1 sibling, 0 replies; 311+ messages in thread
From: Andrei Popa @ 2006-12-19  2:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr


> > > If all of test_clear_page_dirty() has been commented out then the page will
> > > never become clean hence will never fall out of pagecache, so unless Andrei
> > > is doing a reboot before checking for corruption, perhaps the underlying
> > > data on-disk is incorrect, but we can't see it.
> > 
> > if I do a sync and echo 1 > /proc/sys/vm/drop_caches
> 
> OK, that works.
> 
> >  does the reboot is
> > still necesary ?
> 
> It might be necessary to reboot in this case - if we're leaving the
> pagecache dirty, writing to drop_caches won't remove it.  And you probably
> won't be able to get a clean reboot either.
> 
> > > 
> > > Andrei, how _are_ you running this test?    What's the exact sequence of steps?
> > > 
> > > In particular, are you doing anything which would cause the corrupted file
> > > to be evicted from memory, thus forcing a read from disk?  Such as
> > > unmounting and then remounting the filesystem?
> > 
> > I boot linux, I start rtorrent and start the download, while it's
> > downloading I start evolution and i check my mail(my mbox is very large,
> > several hundered megabytes), I close evolution(I use evolution just to
> > have another application witch uses the filesystem and the memory), I
> > start evolution again. I start firefox. The download is complete.
> > Rtorrent says if the hash is good or not. I do a "unrar t qwe.rar" to
> > test that all 84 downloaded rar files are ok and see the result.
> > 
> > > 
> > > The point of my question is to check that the data is really incorrect
> > > on-disk, or whether it is incorrect in pagecache.

I rebooted and the files are still broken after reboot(tested twice) so
the data is incorrect on disk.

> > > 
> > > Also, it'd be useful if you could determine whether the bug appears with
> > > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> > > rootfstype=ext2 if it's the root filesystem.
> > 
> > I will test.

Will test In a couple of hours, I have some work to do...

> 
> ok, thanks.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 18:03         ` Linus Torvalds
  2006-12-18 18:24           ` Peter Zijlstra
@ 2006-12-19  4:36           ` Nick Piggin
  2006-12-19  6:34             ` Linus Torvalds
  2006-12-19  7:22             ` Peter Zijlstra
  1 sibling, 2 replies; 311+ messages in thread
From: Nick Piggin @ 2006-12-19  4:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

[-- Attachment #1: Type: text/plain, Size: 3403 bytes --]

Linus Torvalds wrote:
> On Mon, 18 Dec 2006, Peter Zijlstra wrote:
> 
>>This should be safe; page_mkclean walks the rmap and flips the pte's
>>under the pte lock and records the dirty state while iterating.
>>Concurrent faults will either do set_page_dirty() before we get around
>>to doing it or vice versa, but dirty state is not lost.
> 
> 
> Ok, I really liked this patch, but the more I thought about it, the more I 
> started to doubt the reasons for liking it.

Well this implements my suggestion to redirty the page if there were dirty
ptes. I think it is a good fix (whether or not it fixes Andrei's bug, it
does fix a bug), though maybe _slightly_ suboptimal.

> I think we have some core fundamental problem here that this patch is 
> needed at all.
> 
> So let's think about this: we apparently have two cases of 
> "clear_page_dirty()":
> 
>  - the one that really wants to clear the bit unconditionally (Andrew 
>    calls this the "must_clean_ptes" case, which I personally find to be a 
>    really confusing name, but whatever)
> 
>  - the other case. The case that doesn't want to really clear the pte 
>    dirty bits.

I don't think this characterises it correctly. Think about how it worked
before the page_mkclean went in there.

We really _never_ want to just clear pte dirty bits, because that would be
a data loss situation[*]. The only reason we clear PG_dirty is because some
filesystem may have cleaned each buffer without realising it has cleaned
the whole page. But if you have a dirty pte, then all bets are off: a
buffer with a clear dirty bit can not be considered clean.

Before the dirty page tracking, it was fine to clear PG_dirty here, because
we would pick up the pte dirty info later on. After the page dirty tracking,
clearing pte dirty is a bug here, and re-accounting the dirty page is
arguably the minimal fix.

[*] except in the truncate case where we are happy to throw out dirty data,
     but in that case there would be no ptes anyway.

The only thing I would suggest is not applying Andrew's patch at all, and
do the special casing in try_to_free_buffers(). I've attached a patch for
comments.


> and I thought your patch made sense, because it saved away the pte state 
> in the page dirty state, and that matches my mental model, but the more I 
> think about it, the less sense that whole "the other case" situation makes 
> AT ALL.
> 
> Why does "the other case" exist at all? If you want to clear the dirty 
> page flag, what is _ever_ the reason for not wanting to drop PTE dirty 
> information? In other words, what possible reason can there ever be for 
> saying "I want this page to be clean", while at the same time saying "but 
> if it was dirty in the page tables, don't forget about that state".

We never want to drop dirty data! (ignoring the truncate case, which is
handled privately by truncate anyway)

This whole exercise is not about cleaning or dirtying or fogetting the actual
*data* in the page. It is about bringing the pagecache's notion of whether
the page is dirty or clean in line with the (more uptodate) filesystem's
notion.

After dirty write accounting, we also threw in "the virtual memory manager's
notion", but got that case slightly wrong.

As unlikely as this race is for SMP systems, I think it is easily possible
for PREEMPT kernels. And they have featured in all bug reports, AFAIKS.

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: fs-fix.patch --]
[-- Type: text/plain, Size: 3904 bytes --]

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2006-12-19 15:15:46.000000000 +1100
+++ linux-2.6/fs/buffer.c	2006-12-19 15:36:01.000000000 +1100
@@ -2852,7 +2852,17 @@ int try_to_free_buffers(struct page *pag
 		 * This only applies in the rare case where try_to_free_buffers
 		 * succeeds but the page is not freed.
 		 */
-		clear_page_dirty(page);
+
+		/*
+		 * If the page has been dirtied via the user mappings, then
+		 * clean buffers does not indicate the page data is actually
+		 * clean! Only clear the page dirty bit if there are no dirty
+		 * ptes either.
+		 *
+		 * If there are dirty ptes, then the page must be uptodate, so
+		 * the above concern does not apply.
+		 */
+		clear_page_dirty_sync_ptes(page);
 	}
 out:
 	if (buffers_to_free) {
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2006-12-19 15:17:18.000000000 +1100
+++ linux-2.6/include/linux/page-flags.h	2006-12-19 15:34:24.000000000 +1100
@@ -254,6 +254,7 @@ static inline void SetPageUptodate(struc
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty_sync_ptes(struct page *page);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
@@ -262,6 +263,11 @@ static inline void clear_page_dirty(stru
 	test_clear_page_dirty(page);
 }
 
+static inline void clear_page_dirty_sync_ptes(struct page *page)
+{
+	test_clear_page_dirty_sync_ptes(page);
+}
+
 static inline void set_page_writeback(struct page *page)
 {
 	test_set_page_writeback(page);
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2006-12-19 15:17:53.000000000 +1100
+++ linux-2.6/mm/page-writeback.c	2006-12-19 15:33:29.000000000 +1100
@@ -844,9 +844,10 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 
 /*
  * Clear a page's dirty flag, while caring for dirty memory accounting. 
+ * Does not clear pte dirty bits.
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+static int test_clear_page_dirty_leave_ptes(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -862,10 +863,8 @@ int test_clear_page_dirty(struct page *p
 			 * We can continue to use `mapping' here because the
 			 * page is locked, which pins the address_space
 			 */
-			if (mapping_cap_account_dirty(mapping)) {
-				page_mkclean(page);
+			if (mapping_cap_account_dirty(mapping))
 				dec_zone_page_state(page, NR_FILE_DIRTY);
-			}
 			return 1;
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -873,9 +872,43 @@ int test_clear_page_dirty(struct page *p
 	}
 	return TestClearPageDirty(page);
 }
+
+/*
+ * As above, but does clear dirty bits from ptes
+ */
+int test_clear_page_dirty(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (test_clear_page_dirty_leave_ptes(page)) {
+		if (mapping_cap_account_dirty(mapping))
+			page_mkclean(page);
+		return 1;
+	}
+	return 0;
+}
 EXPORT_SYMBOL(test_clear_page_dirty);
 
 /*
+ * As above, but redirties page if any dirty ptes are found (and then only
+ * if the mapping accounts dirty pages, otherwise dirty ptes are left dirty
+ * but the page is cleaned).
+ */
+int test_clear_page_dirty_sync_ptes(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (test_clear_page_dirty_leave_ptes(page)) {
+		if (mapping_cap_account_dirty(mapping)) {
+			if (page_mkclean(page))
+				set_page_dirty(page);
+		}
+		return 1;
+	}
+	return 0;
+}
+
+/*
  * Clear a page's dirty flag, while caring for dirty memory accounting.
  * Returns true if the page was previously dirty.
  *

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  4:36           ` Nick Piggin
@ 2006-12-19  6:34             ` Linus Torvalds
  2006-12-19  6:51               ` Nick Piggin
  2006-12-19 20:03               ` dean gaudet
  2006-12-19  7:22             ` Peter Zijlstra
  1 sibling, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-19  6:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
> We never want to drop dirty data! (ignoring the truncate case, which is
> handled privately by truncate anyway)

Bzzt.

SURE we do.

We absolutely do want to drop dirty data in the writeout path.

How do you think dirty data ever _becomes_ clean data?

In other words, yes, we _do_ want to test-and-clear all the pgtable bits 
_and_ the PG_dirty bit. We want to do it for:
 - writeout
 - truncate
 - possibly a "drop" event (which could be a case for a journal entry that 
   becomes stale due to being replaced or something - kind of "truncate" 
   on metadata)

because both of those events _literally_ turn dirty state into clean 
state.

In no other circumstance do we ever want to clear a dirty bit, as far as I 
can tell. 

			Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  6:34             ` Linus Torvalds
@ 2006-12-19  6:51               ` Nick Piggin
  2006-12-19  7:26                 ` Linus Torvalds
  2006-12-19 20:03               ` dean gaudet
  1 sibling, 1 reply; 311+ messages in thread
From: Nick Piggin @ 2006-12-19  6:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Linus Torvalds wrote:
> 
> On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
>>We never want to drop dirty data! (ignoring the truncate case, which is
>>handled privately by truncate anyway)
> 
> 
> Bzzt.
> 
> SURE we do.
> 
> We absolutely do want to drop dirty data in the writeout path.
> 
> How do you think dirty data ever _becomes_ clean data?

I wouldn't have thought it becomes clean by dropping it ;) Is this a
trick question? My answer is that we clean a page by by taking some
action such that the underlying data matches the data in RAM...

We don't "drop" any data until it has been cleaned (again, ignoring
things like truncate for a minute). That's a bug! And
try_to_free_buffers() is called from places outside the writeout path.
This is our bug (or at least, one of our bugs that appears to have the
same triggers and symptoms as people are reporting).

[...]

> In no other circumstance do we ever want to clear a dirty bit, as far as I 
> can tell. 

Exactly. And that is exactly what try_to_free_buffers is doing now.

I still think you should have a look at the patch.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  4:36           ` Nick Piggin
  2006-12-19  6:34             ` Linus Torvalds
@ 2006-12-19  7:22             ` Peter Zijlstra
  2006-12-19  7:59               ` Nick Piggin
  1 sibling, 1 reply; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-19  7:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 15:36 +1100, Nick Piggin wrote:

> plain text document attachment (fs-fix.patch)
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c	2006-12-19 15:15:46.000000000 +1100
> +++ linux-2.6/fs/buffer.c	2006-12-19 15:36:01.000000000 +1100
> @@ -2852,7 +2852,17 @@ int try_to_free_buffers(struct page *pag
>  		 * This only applies in the rare case where try_to_free_buffers
>  		 * succeeds but the page is not freed.
>  		 */
> -		clear_page_dirty(page);
> +
> +		/*
> +		 * If the page has been dirtied via the user mappings, then
> +		 * clean buffers does not indicate the page data is actually
> +		 * clean! Only clear the page dirty bit if there are no dirty
> +		 * ptes either.
> +		 *
> +		 * If there are dirty ptes, then the page must be uptodate, so
> +		 * the above concern does not apply.
> +		 */
> +		clear_page_dirty_sync_ptes(page);
>  	}
>  out:
>  	if (buffers_to_free) {
> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h	2006-12-19 15:17:18.000000000 +1100
> +++ linux-2.6/include/linux/page-flags.h	2006-12-19 15:34:24.000000000 +1100
> @@ -254,6 +254,7 @@ static inline void SetPageUptodate(struc
>  struct page;	/* forward declaration */
>  
>  int test_clear_page_dirty(struct page *page);
> +int test_clear_page_dirty_sync_ptes(struct page *page);
>  int test_clear_page_writeback(struct page *page);
>  int test_set_page_writeback(struct page *page);
>  
> @@ -262,6 +263,11 @@ static inline void clear_page_dirty(stru
>  	test_clear_page_dirty(page);
>  }
>  
> +static inline void clear_page_dirty_sync_ptes(struct page *page)
> +{
> +	test_clear_page_dirty_sync_ptes(page);
> +}
> +
>  static inline void set_page_writeback(struct page *page)
>  {
>  	test_set_page_writeback(page);
> Index: linux-2.6/mm/page-writeback.c
> ===================================================================
> --- linux-2.6.orig/mm/page-writeback.c	2006-12-19 15:17:53.000000000 +1100
> +++ linux-2.6/mm/page-writeback.c	2006-12-19 15:33:29.000000000 +1100
> @@ -844,9 +844,10 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  
>  /*
>   * Clear a page's dirty flag, while caring for dirty memory accounting. 
> + * Does not clear pte dirty bits.
>   * Returns true if the page was previously dirty.
>   */
> -int test_clear_page_dirty(struct page *page)
> +static int test_clear_page_dirty_leave_ptes(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	unsigned long flags;
> @@ -862,10 +863,8 @@ int test_clear_page_dirty(struct page *p
>  			 * We can continue to use `mapping' here because the
>  			 * page is locked, which pins the address_space
>  			 */
> -			if (mapping_cap_account_dirty(mapping)) {
> -				page_mkclean(page);
> +			if (mapping_cap_account_dirty(mapping))
>  				dec_zone_page_state(page, NR_FILE_DIRTY);
> -			}
>  			return 1;
>  		}
>  		write_unlock_irqrestore(&mapping->tree_lock, flags);
> @@ -873,9 +872,43 @@ int test_clear_page_dirty(struct page *p
>  	}
>  	return TestClearPageDirty(page);
>  }
> +
> +/*
> + * As above, but does clear dirty bits from ptes
> + */
> +int test_clear_page_dirty(struct page *page)
> +{
> +	struct address_space *mapping = page_mapping(page);
> +
> +	if (test_clear_page_dirty_leave_ptes(page)) {
> +		if (mapping_cap_account_dirty(mapping))
> +			page_mkclean(page);
> +		return 1;
> +	}
> +	return 0;
> +}
>  EXPORT_SYMBOL(test_clear_page_dirty);
>  
>  /*
> + * As above, but redirties page if any dirty ptes are found (and then only
> + * if the mapping accounts dirty pages, otherwise dirty ptes are left dirty
> + * but the page is cleaned).
> + */
> +int test_clear_page_dirty_sync_ptes(struct page *page)
> +{
> +	struct address_space *mapping = page_mapping(page);
> +
> +	if (test_clear_page_dirty_leave_ptes(page)) {
> +		if (mapping_cap_account_dirty(mapping)) {
> +			if (page_mkclean(page))
> +				set_page_dirty(page);
> +		}
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/*
>   * Clear a page's dirty flag, while caring for dirty memory accounting.
>   * Returns true if the page was previously dirty.
>   *

Hmm, not quite; It certainly look better than the extra ,[01] tagged to
test_clear_page_dirty() though. Although I would have expected it the
other way around - test_clear_pages_dirty_sync_ptes to be the default
case and test_clear_pages_dirty_clean_ptes to be used in
clear_page_dirty_for_io().

Anyway it has the same issues as the others. See what happens when you
run two test_clear_page_dirty_sync_ptes() consecutively, you still loose
PG_dirty even though the page might actually be dirty.




^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  6:51               ` Nick Piggin
@ 2006-12-19  7:26                 ` Linus Torvalds
  2006-12-19  8:04                   ` Linus Torvalds
  0 siblings, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-19  7:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
> I wouldn't have thought it becomes clean by dropping it ;) Is this a
> trick question? My answer is that we clean a page by by taking some
> action such that the underlying data matches the data in RAM...

Sure.

> We don't "drop" any data until it has been cleaned (again, ignoring
> things like truncate for a minute). That's a bug!

Actually, it's the other way around. We have to drop the dirty bits BEFORE 
cleaning. If we clean first, and _then_ drop the dirty bits, THAT is a 
bug, because the dirty bits can now refer to _new_ dirty data that didn't 
get written out.

So the proper sequence is _literally_ to mark the page clean FIRST. Drop 
all the dirty bits, but not the _data_ obviously (ie you have a reference 
to the page). And _then_ you do the writeout to actually clean the data 
itself.

So you actually state it exactly the wrogn way around.

We MUST clear the dirty bits before we do the IO that actually cleans the 
data. Exactly because if new writes keep on happening, if we do it in the 
other order, we'll drop dirty data on the floor.

> > In no other circumstance do we ever want to clear a dirty bit, as far as I
> > can tell. 
> 
> Exactly. And that is exactly what try_to_free_buffers is doing now.
> 
> I still think you should have a look at the patch.

I claim that dropping dirty bits AFTER the IO is always wrong. 
Try_to_free_buffers() must never touch the dirty bits at all, because by 
definition that thing happens after the IO has actually been done.

Anbd yes, I looked at your patch. And it looks a million times cleaner 
than Andrew's patch. However, it's already been tested multiple times, and 
totally REMOVING the "clear_page_dirty()" from try_to_free_buffers() still 
resulted in the corruption.

That said, I think your patch is worth it just as a cleanup. Much nicer 
than Andrews code, also from a naming standpoint. So I'm not actually 
disagreeing about the patch itself, but I _am_ saying that I don't 
actually see the point of ever moving the dirty bits around.

So I repeat: we have the case where we really want to _remove_ the dirty 
bits (because we're going to write the current state of the page to disk, 
and we need to clear the dirty bits BEFORE we do that). That's the one 
that makes sense, and that's the code we want to run before doing IO. It's 
the "clear_dirty_bits_for_io()" case.

The code that doesn't make sense is the "shuffle the dirty bits around" In 
other words: when does it actually make sense to call your 
(well-implemented, don't get me wrong) "test_clear_page_dirty_sync_ptes()"
function? It doesn't _fix_ anything. It just shuffles dirty bits from one 
place to another. What was the point again?

If the point is "try_to_free_buffers()", then my argument was that I had a 
much simpler solution: "Just don't do that then". My simple patch sadly 
didn't fix the data corruption, so the data corruption comes from 
something ELSE than try_to_free_buffers().

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 19:18                 ` Linus Torvalds
  2006-12-18 19:44                   ` Andrei Popa
@ 2006-12-19  7:38                   ` Peter Zijlstra
  1 sibling, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-19  7:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 11:18 -0800, Linus Torvalds wrote:

> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index d8a842a..3f9061e 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
> >  		goto unlock;
> >  
> >  	entry = ptep_get_and_clear(mm, address, pte);
> > -	entry = pte_mkclean(entry);
> > +	/*entry = pte_mkclean(entry);*/
> >  	entry = pte_wrprotect(entry);
> >  	ptep_establish(vma, address, pte, entry);
> >  	lazy_mmu_prot_update(entry);
> 
> The above patch is bad. It's always going to hide the bug, but it hides it 
> by just not doing anything at all. 

Not quite, it does wrprotect still, so further updates will trigger the
do_wp_page() path and call set_page_dirty().

So we could make 'something' that would keep the tracking working and
not create corruption, say something like this:

However I'll try and figure out how we get so terribly confused on the
PG_dirty state that we have to clean it and fall back to pte_dirty. That
is the real issue we have.

---
 include/linux/rmap.h |    6 ++++++
 mm/page-writeback.c  |    3 ++-
 mm/rmap.c            |   23 ++++++++++++++++++-----
 3 files changed, 26 insertions(+), 6 deletions(-)

Index: linux-2.6-git/mm/rmap.c
===================================================================
--- linux-2.6-git.orig/mm/rmap.c	2006-12-18 11:06:29.000000000 +0100
+++ linux-2.6-git/mm/rmap.c	2006-12-19 08:33:57.000000000 +0100
@@ -428,7 +428,8 @@ int page_referenced(struct page *page, i
 	return referenced;
 }
 
-static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
+static int page_mkcw_one(struct page *page,
+			 struct vm_area_struct *vma, int make_clean)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -448,7 +449,8 @@ static int page_mkclean_one(struct page 
 		goto unlock;
 
 	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
+	if (make_clean)
+		entry = pte_mkclean(entry);
 	entry = pte_wrprotect(entry);
 	ptep_establish(vma, address, pte, entry);
 	lazy_mmu_prot_update(entry);
@@ -460,7 +462,8 @@ out:
 	return ret;
 }
 
-static int page_mkclean_file(struct address_space *mapping, struct page *page)
+static int page_mkcw_file(struct address_space *mapping,
+			  struct page *page, int make_clean)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	struct vm_area_struct *vma;
@@ -478,7 +481,7 @@ static int page_mkclean_file(struct addr
 	return ret;
 }
 
-int page_mkclean(struct page *page)
+static int page_mkcw(struct page *page, int make_clean)
 {
 	int ret = 0;
 
@@ -487,12 +490,22 @@ int page_mkclean(struct page *page)
 	if (page_mapped(page)) {
 		struct address_space *mapping = page_mapping(page);
 		if (mapping)
-			ret = page_mkclean_file(mapping, page);
+			ret = page_mkcw_file(mapping, page, make_clean);
 	}
 
 	return ret;
 }
 
+int page_mkclean(struct page *page)
+{
+	return page_mkcw(page, 1);
+}
+
+int page_wrprotect(struct page *page)
+{
+	return page_mkcw(page, 0);
+}
+
 /**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
Index: linux-2.6-git/include/linux/rmap.h
===================================================================
--- linux-2.6-git.orig/include/linux/rmap.h	2006-12-19 08:31:59.000000000 +0100
+++ linux-2.6-git/include/linux/rmap.h	2006-12-19 08:32:28.000000000 +0100
@@ -110,6 +110,7 @@ unsigned long page_address_in_vma(struct
  * returns the number of cleaned PTEs.
  */
 int page_mkclean(struct page *);
+int page_wrprotect(struct page *);
 
 #else	/* !CONFIG_MMU */
 
@@ -125,6 +126,11 @@ static inline int page_mkclean(struct pa
 	return 0;
 }
 
+static inline int page_wrprotect(struct page *page)
+{
+	return 0;
+}
+
 
 #endif	/* CONFIG_MMU */
 
Index: linux-2.6-git/mm/page-writeback.c
===================================================================
--- linux-2.6-git.orig/mm/page-writeback.c	2006-12-19 08:24:48.000000000 +0100
+++ linux-2.6-git/mm/page-writeback.c	2006-12-19 08:31:43.000000000 +0100
@@ -872,7 +872,8 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			if (page_wrprotect(page))
+				set_page_dirty();
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;





^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  7:22             ` Peter Zijlstra
@ 2006-12-19  7:59               ` Nick Piggin
  2006-12-19  8:14                 ` Linus Torvalds
  0 siblings, 1 reply; 311+ messages in thread
From: Nick Piggin @ 2006-12-19  7:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 15:36 +1100, Nick Piggin wrote:
> 
> 
>>plain text document attachment (fs-fix.patch)
>>Index: linux-2.6/fs/buffer.c
>>===================================================================
>>--- linux-2.6.orig/fs/buffer.c	2006-12-19 15:15:46.000000000 +1100
>>+++ linux-2.6/fs/buffer.c	2006-12-19 15:36:01.000000000 +1100
>>@@ -2852,7 +2852,17 @@ int try_to_free_buffers(struct page *pag
>> 		 * This only applies in the rare case where try_to_free_buffers
>> 		 * succeeds but the page is not freed.
>> 		 */
>>-		clear_page_dirty(page);
>>+
>>+		/*
>>+		 * If the page has been dirtied via the user mappings, then
>>+		 * clean buffers does not indicate the page data is actually
>>+		 * clean! Only clear the page dirty bit if there are no dirty
>>+		 * ptes either.
>>+		 *
>>+		 * If there are dirty ptes, then the page must be uptodate, so
>>+		 * the above concern does not apply.
>>+		 */
>>+		clear_page_dirty_sync_ptes(page);
>> 	}
>> out:
>> 	if (buffers_to_free) {
>>Index: linux-2.6/include/linux/page-flags.h
>>===================================================================
>>--- linux-2.6.orig/include/linux/page-flags.h	2006-12-19 15:17:18.000000000 +1100
>>+++ linux-2.6/include/linux/page-flags.h	2006-12-19 15:34:24.000000000 +1100
>>@@ -254,6 +254,7 @@ static inline void SetPageUptodate(struc
>> struct page;	/* forward declaration */
>> 
>> int test_clear_page_dirty(struct page *page);
>>+int test_clear_page_dirty_sync_ptes(struct page *page);
>> int test_clear_page_writeback(struct page *page);
>> int test_set_page_writeback(struct page *page);
>> 
>>@@ -262,6 +263,11 @@ static inline void clear_page_dirty(stru
>> 	test_clear_page_dirty(page);
>> }
>> 
>>+static inline void clear_page_dirty_sync_ptes(struct page *page)
>>+{
>>+	test_clear_page_dirty_sync_ptes(page);
>>+}
>>+
>> static inline void set_page_writeback(struct page *page)
>> {
>> 	test_set_page_writeback(page);
>>Index: linux-2.6/mm/page-writeback.c
>>===================================================================
>>--- linux-2.6.orig/mm/page-writeback.c	2006-12-19 15:17:53.000000000 +1100
>>+++ linux-2.6/mm/page-writeback.c	2006-12-19 15:33:29.000000000 +1100
>>@@ -844,9 +844,10 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>> 
>> /*
>>  * Clear a page's dirty flag, while caring for dirty memory accounting. 
>>+ * Does not clear pte dirty bits.
>>  * Returns true if the page was previously dirty.
>>  */
>>-int test_clear_page_dirty(struct page *page)
>>+static int test_clear_page_dirty_leave_ptes(struct page *page)
>> {
>> 	struct address_space *mapping = page_mapping(page);
>> 	unsigned long flags;
>>@@ -862,10 +863,8 @@ int test_clear_page_dirty(struct page *p
>> 			 * We can continue to use `mapping' here because the
>> 			 * page is locked, which pins the address_space
>> 			 */
>>-			if (mapping_cap_account_dirty(mapping)) {
>>-				page_mkclean(page);
>>+			if (mapping_cap_account_dirty(mapping))
>> 				dec_zone_page_state(page, NR_FILE_DIRTY);
>>-			}
>> 			return 1;
>> 		}
>> 		write_unlock_irqrestore(&mapping->tree_lock, flags);
>>@@ -873,9 +872,43 @@ int test_clear_page_dirty(struct page *p
>> 	}
>> 	return TestClearPageDirty(page);
>> }
>>+
>>+/*
>>+ * As above, but does clear dirty bits from ptes
>>+ */
>>+int test_clear_page_dirty(struct page *page)
>>+{
>>+	struct address_space *mapping = page_mapping(page);
>>+
>>+	if (test_clear_page_dirty_leave_ptes(page)) {
>>+		if (mapping_cap_account_dirty(mapping))
>>+			page_mkclean(page);
>>+		return 1;
>>+	}
>>+	return 0;
>>+}
>> EXPORT_SYMBOL(test_clear_page_dirty);
>> 
>> /*
>>+ * As above, but redirties page if any dirty ptes are found (and then only
>>+ * if the mapping accounts dirty pages, otherwise dirty ptes are left dirty
>>+ * but the page is cleaned).
>>+ */
>>+int test_clear_page_dirty_sync_ptes(struct page *page)
>>+{
>>+	struct address_space *mapping = page_mapping(page);
>>+
>>+	if (test_clear_page_dirty_leave_ptes(page)) {
>>+		if (mapping_cap_account_dirty(mapping)) {
>>+			if (page_mkclean(page))
>>+				set_page_dirty(page);
>>+		}
>>+		return 1;
>>+	}
>>+	return 0;
>>+}
>>+
>>+/*
>>  * Clear a page's dirty flag, while caring for dirty memory accounting.
>>  * Returns true if the page was previously dirty.
>>  *
> 
> 
> Hmm, not quite; It certainly look better than the extra ,[01] tagged to
> test_clear_page_dirty() though. Although I would have expected it the
> other way around - test_clear_pages_dirty_sync_ptes to be the default
> case and test_clear_pages_dirty_clean_ptes to be used in
> clear_page_dirty_for_io().
> 
> Anyway it has the same issues as the others. See what happens when you
> run two test_clear_page_dirty_sync_ptes() consecutively, you still loose
> PG_dirty even though the page might actually be dirty.

How can this happen? We'll only test_clear_page_dirty_sync_ptes again
after buffers have been reattached, and subsequently cleaned. And in
that case if the ptes are still clean at this point then the page really
is clean.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  7:26                 ` Linus Torvalds
@ 2006-12-19  8:04                   ` Linus Torvalds
  2006-12-19  9:00                     ` Peter Zijlstra
       [not found]                     ` <4587B762.2030603@yahoo.com.au>
  0 siblings, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-19  8:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Linus Torvalds wrote:
> 
> The code that doesn't make sense is the "shuffle the dirty bits around" In 
> other words: when does it actually make sense to call your 
> (well-implemented, don't get me wrong) "test_clear_page_dirty_sync_ptes()"
> function? It doesn't _fix_ anything. It just shuffles dirty bits from one 
> place to another. What was the point again?

Let me try to phrase that another way, in terms that you defined.

In other words, look at your test_clear_page_dirty_sync_ptes() function.

First, start out from the _inner_ part, the:

	if (mapping_cap_account_dirty(mapping)) {
		if (page_mkclean(page))
			set_page_dirty(page);
	}

part.

This the one that both you and I agree is a "working" situation: we are 
moving the dirty bits from the pte into the "struct page", and we both 
agree that this is fine. No dirty bits get lost. You even make a BIG DEAL 
about the fact that no dirty bits get lost.

So begin by just explaining:
 - why do it?

Why shuffle the dirty bits around? Why not just _leave_ the PG_dirty bit 
on the "struct page", and simply leave it all at that? I agree that the 
above doesn't lose any dirty bits, but what I'm asking for is WHAT IS THE 
POINT?

So that is the code that we both agree "works", but I personally don't see 
the _point_ of. However, that's actually not even important, because I 
don't even care about the point. I wanted to bring that up just in order 
to then ignore it, and look at the stuff _around: it, namely the other 
part in "test_clear_page_dirty_sync_ptes()":

	int test_clear_page_dirty_sync_ptes(struct page *page)
	{
		if (test_clear_page_dirty_leave_ptes(page)) {
			.. do the inner part ..
			return 1;
		}
		return 0;
	}

Now, the above is the OUTER part. Please realize that this DOES actually 
drop the PG-dirty bit. So ignore the inner part entirely (which is a no-op 
for the case where the page isn't mapped), and explain to me why it's ok 
to DROP the dirty bit in the outer part, when you tried to say that it was 
NOT ok to drop it in the inner part?

NOTICE? First you make a BIG DEAL about how dirty bits should never get 
lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop 
the dirty bit for when it's not in the page tables.

In fact, if you just call that function twice, the first time it will 
MOVE the dirty bits from the PTE to the "struct page *", and the _second_ 
time it will just clear the dirty bit from the "struct page *". You end up 
with a clean page. It returned the same return value BOTH TIMES, even 
though it did two very different things (once just moving dirty bits 
around, and the second time actually _removing_ the dirty bit entirely).

Again, I have a very simple claim: I claim that NONE of the 
"test_clear_page_dirty()" functions make any sense what-so-ever. They're 
all wrong.

The "funny" part is, that the only thing that Andrei reports actually 
fixed his corruption (apart from the patch tjhat just stops removign the 
dirty bits from the PTE's _entirely_) is actually the part where he had an 
"#if 0 .. #endif" around basically _all_ of the "test_clear_page_dirty()" 
function (ie he had mis-understood what I asked for, and put it outside 
the _outer_ if(), rather than putting it around the inner one).

So I claim:
 - there is ONE and only ONE place where you can really drop the dirty 
   bits: it's when you're going to immediately afterwards do a writeout.

   This is the "clear_page_dirty_for_io()"

 - all the other "[test_and_]clear_dirty*()" functions seem to be outright 
   buggy and bogus. Shuffling dirty bits around from the page tables to 
   the "struct page *" (after having _cleared_ that "very important" 
   PG_dirty bit just before - apparently it wasn't that important after 
   all, was it?) is insane.

Nobody has actually ever explained why "test_clear_page_dirty()" is good 
at all.

 - Why is it ever used instead of "clear_page_dirty_for_io()"?

 - What is the difference?

 - Why would you EVER want to clear bits just in the "struct page *" or 
   just in the PTE's?

 - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO?

In other words, I have a theory:

 "A lot of this is actually historical cruft. Some of it may even be code 
  that was never supposed to work, but because we maintained _other_ dirty 
  bits in the PTE's, and never touched them before, we never even realized 
  that the code that played with PG_dirty was totally insane"

Now, that's just a theory. And yeah, it may be stated a bit provocatively. 
It may not be entirely correct. I'm just saying.. maybe it is?

And yes, we actually really _do_ have a data-point from Andrei that says 
that if you just make "test_clear_page_dirty()" a no-op, the corruption 
goes away. It was unintentional, bit hey, it's a real datapoint.

See the email from Andrei:

	Subject: Re: 2.6.19 file content corruption on ext3
	From: Andrei Popa <andrei.popa@i-neo.ro>
	Date: Tue, 19 Dec 2006 01:48:11 +0200
	Message-Id: <1166485691.6977.6.camel@localhost>

and look at what remains of his "test_clear_page_dirty()". 

Scary, isn't it? And a big hint that "test_clear_page_dirty()" is just 
totally BOGUS. 

And the thing is, I think it's bogus just because I don't understand why 
it would EVER be ok to drop those dirty bits _except_ very much just 
before doing the IO that makes it non-dirty (where "truncate()" is really 
a special case where the IO ends up being not done, but it's the same kind 
of situation).

			Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  1:54                                         ` Andrew Morton
  2006-12-19  2:04                                           ` Andrei Popa
@ 2006-12-19  8:05                                           ` Andrei Popa
  2006-12-19  8:24                                             ` Andrew Morton
  1 sibling, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-19  8:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

> > > Also, it'd be useful if you could determine whether the bug appears with
> > > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> > > rootfstype=ext2 if it's the root filesystem.
> > 
 I fave file corruption.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  7:59               ` Nick Piggin
@ 2006-12-19  8:14                 ` Linus Torvalds
  2006-12-19  9:40                   ` Nick Piggin
  0 siblings, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-19  8:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Nick Piggin wrote:
> > 
> > Anyway it has the same issues as the others. See what happens when you
> > run two test_clear_page_dirty_sync_ptes() consecutively, you still loose
> > PG_dirty even though the page might actually be dirty.
> 
> How can this happen? We'll only test_clear_page_dirty_sync_ptes again
> after buffers have been reattached, and subsequently cleaned. And in
> that case if the ptes are still clean at this point then the page really
> is clean.

Why do you talk about buffers being reattached? Are you still in some 
world where "try_to_free_buffers()" matters? Have you not followed the 
discussion? Why do you ignore my MUCH SIMPLER patch that just removed all 
this crap ENTIRELY from "try_to_free_buffers()", and the exact same 
corruption happened?

Forget about "try_to_free_buffers()". Please apply this patch to your tree 
first. That gets rid of _one_ copy of totally insane code that did all the 
wrong things.

Only after you have applied this patch should you look at the code again. 
Realizing that the corruption still happens.

So forget about buffers already. That piece of code was crap.

		Linus

---
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  8:05                                           ` Andrei Popa
@ 2006-12-19  8:24                                             ` Andrew Morton
  2006-12-19  8:34                                               ` Pekka Enberg
  2006-12-19  9:13                                               ` Marc Haber
  0 siblings, 2 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-19  8:24 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Tue, 19 Dec 2006 10:05:03 +0200
Andrei Popa <andrei.popa@i-neo.ro> wrote:

> > > > Also, it'd be useful if you could determine whether the bug appears with
> > > > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> > > > rootfstype=ext2 if it's the root filesystem.
> > > 
>  I fave file corruption.

Wow.  I didn't expect that, because Mark Haber reported that ext3's data=writeback
fixed it.   Maybe he didn't run it for long enough?

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  8:24                                             ` Andrew Morton
@ 2006-12-19  8:34                                               ` Pekka Enberg
  2006-12-19  9:13                                               ` Marc Haber
  1 sibling, 0 replies; 311+ messages in thread
From: Pekka Enberg @ 2006-12-19  8:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linus Torvalds, Peter Zijlstra,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On 12/19/06, Andrew Morton <akpm@osdl.org> wrote:
> Wow.  I didn't expect that, because Mark Haber reported that ext3's data=writeback
> fixed it.   Maybe he didn't run it for long enough?

I don't think it did fix it for Mark:

http://marc.theaimsgroup.com/?l=linux-kernel&m=116625777306843&w=2

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  5:43               ` Andrew Morton
  2006-12-18  7:22                 ` Nick Piggin
@ 2006-12-19  8:51                 ` Marc Haber
  2006-12-19  9:28                   ` Martin Michlmayr
  2006-12-28 18:05                   ` Marc Haber
  1 sibling, 2 replies; 311+ messages in thread
From: Marc Haber @ 2006-12-19  8:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linus Torvalds, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer, Martin Michlmayr

On Sun, Dec 17, 2006 at 09:43:08PM -0800, Andrew Morton wrote:
> Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> blocksize ext3, mainline.  Zero failures.  It's unlikely that this testing
> would pass, yet people running normal workloads are able to easily trigger
> failures.  I suspect we're looking in the wrong place.

I do not have a clue about memory management at all, but is it
possible that you're testing on a box with too much memory? My box has
only 256 MB, and I used to use mutt with a _huge_ inbox with mutt
taking somewhat 150 MB. Add spamassassin and a reasonably busy mail
server, and the box used to be like 150 MB in swap.

I have tidied my inbox in the mean time and mutt's memory requirement
has been reduced to somewhat 30 MB, which might be the cause that I
don't see the issue that often any more.

Greetings
Marc, just trying to give input

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  8:04                   ` Linus Torvalds
@ 2006-12-19  9:00                     ` Peter Zijlstra
  2006-12-19  9:05                       ` Peter Zijlstra
       [not found]                     ` <4587B762.2030603@yahoo.com.au>
  1 sibling, 1 reply; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-19  9:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 00:04 -0800, Linus Torvalds wrote:

> Nobody has actually ever explained why "test_clear_page_dirty()" is good 
> at all.
> 
>  - Why is it ever used instead of "clear_page_dirty_for_io()"?
> 
>  - What is the difference?
> 
>  - Why would you EVER want to clear bits just in the "struct page *" or 
>    just in the PTE's?
> 
>  - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO?
> 
> In other words, I have a theory:
> 
>  "A lot of this is actually historical cruft. Some of it may even be code 
>   that was never supposed to work, but because we maintained _other_ dirty 
>   bits in the PTE's, and never touched them before, we never even realized 
>   that the code that played with PG_dirty was totally insane"
> 
> Now, that's just a theory. And yeah, it may be stated a bit provocatively. 
> It may not be entirely correct. I'm just saying.. maybe it is?

On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote:

> try_to_free_buffers() clears the page's dirty state if it successfully removed
> the page's buffers.
> 
>   Background for this:
> 
>   - a process does a one-byte-write to a file on a 64k pagesize, 4k
>     blocksize ext3 filesystem.  The page is now PageDirty, !PgeUptodate and
>     has one dirty buffer and 15 not uptodate buffers.
> 
>   - kjournald writes the dirty buffer.  The page is now PageDirty,
>     !PageUptodate and has a mix of clean and not uptodate buffers.
> 
>   - try_to_free_buffers() removes the page's buffers.  It MUST now clear
>     PageDirty.  If we were to leave the page dirty then we'd have a dirty, not
>     uptodate page with no buffer_heads.
> 
>     We're screwed: we cannot write the page because we don't know which
>     sections of it contain garbage.  We cannot read the page because we don't
>     know which sections of it contain modified data.  We cannot free the page
>     because it is dirty.

However!! this is not true for mapped pages because mapped pages must
have the whole (16k in akpm's example) page loaded. Hence I suspect that
what Andrei did by accident - remove the if (mapping) case in
test_clean_dirty_pages() - is actually totally correct.




^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  9:00                     ` Peter Zijlstra
@ 2006-12-19  9:05                       ` Peter Zijlstra
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-19  9:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 10:00 +0100, Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 00:04 -0800, Linus Torvalds wrote:
> 
> > Nobody has actually ever explained why "test_clear_page_dirty()" is good 
> > at all.
> > 
> >  - Why is it ever used instead of "clear_page_dirty_for_io()"?
> > 
> >  - What is the difference?
> > 
> >  - Why would you EVER want to clear bits just in the "struct page *" or 
> >    just in the PTE's?
> > 
> >  - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO?
> > 
> > In other words, I have a theory:
> > 
> >  "A lot of this is actually historical cruft. Some of it may even be code 
> >   that was never supposed to work, but because we maintained _other_ dirty 
> >   bits in the PTE's, and never touched them before, we never even realized 
> >   that the code that played with PG_dirty was totally insane"
> > 
> > Now, that's just a theory. And yeah, it may be stated a bit provocatively. 
> > It may not be entirely correct. I'm just saying.. maybe it is?
> 
> On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote:
> 
> > try_to_free_buffers() clears the page's dirty state if it successfully removed
> > the page's buffers.
> > 
> >   Background for this:
> > 
> >   - a process does a one-byte-write to a file on a 64k pagesize, 4k
> >     blocksize ext3 filesystem.  The page is now PageDirty, !PgeUptodate and
> >     has one dirty buffer and 15 not uptodate buffers.
> > 
> >   - kjournald writes the dirty buffer.  The page is now PageDirty,
> >     !PageUptodate and has a mix of clean and not uptodate buffers.
> > 
> >   - try_to_free_buffers() removes the page's buffers.  It MUST now clear
> >     PageDirty.  If we were to leave the page dirty then we'd have a dirty, not
> >     uptodate page with no buffer_heads.
> > 
> >     We're screwed: we cannot write the page because we don't know which
> >     sections of it contain garbage.  We cannot read the page because we don't
> >     know which sections of it contain modified data.  We cannot free the page
> >     because it is dirty.
> 
> However!! this is not true for mapped pages because mapped pages must
> have the whole (16k in akpm's example) page loaded. Hence I suspect that
> what Andrei did by accident - remove the if (mapping) case in
> test_clean_dirty_pages() - is actually totally correct.

Obviously I need my morning shot, 64k ofcourse.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  8:24                                             ` Andrew Morton
  2006-12-19  8:34                                               ` Pekka Enberg
@ 2006-12-19  9:13                                               ` Marc Haber
  1 sibling, 0 replies; 311+ messages in thread
From: Marc Haber @ 2006-12-19  9:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linus Torvalds, Peter Zijlstra,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Martin Michlmayr

On Tue, Dec 19, 2006 at 12:24:16AM -0800, Andrew Morton wrote:
> Wow.  I didn't expect that, because Mark Haber reported that ext3's data=writeback
> fixed it.   Maybe he didn't run it for long enough?

My test case is Debian's "aptitude update" running once an hour, and
it was always the same file getting corrupted. With 2.6.19, I had this
corruption like every third hour (but -only- if run from cron, running
from a shell was always fine), data=writeback made the issue disappear
for about two days before I booted into 2.6.19.1 without
data=writeback (defaults chosen then), after which the issue only
shows up like every other day.

So, I feel like out of the loop since rtorrent seems much better in
reproducing this.

I notice, though, that both aptitude and rtorrent do downloads from
the net, so there might be a relation to tcp/ip and/or the network
driver. My box has a Linksys NC100 network card running with the tulip
driver.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  8:51                 ` Marc Haber
@ 2006-12-19  9:28                   ` Martin Michlmayr
  2006-12-28 18:05                   ` Marc Haber
  1 sibling, 0 replies; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-19  9:28 UTC (permalink / raw)
  To: Marc Haber
  Cc: Andrew Morton, Nick Piggin, Linus Torvalds, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer

* Marc Haber <mh+linux-kernel@zugschlus.de> [2006-12-19 09:51]:
> I do not have a clue about memory management at all, but is it
> possible that you're testing on a box with too much memory? My box has
> only 256 MB, and I used to use mutt with a _huge_ inbox with mutt
> taking somewhat 150 MB. Add spamassassin and a reasonably busy mail
> server, and the box used to be like 150 MB in swap.

FWIW, the ARM box I see this on has only 32 MB memory (and a 133 or
266 MHz CPU).  I don't see it on another ARM box (different ARM
sub-arch) with 128 MB memory and a 600 MHz CPU.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  8:14                 ` Linus Torvalds
@ 2006-12-19  9:40                   ` Nick Piggin
  2006-12-19 16:46                     ` Linus Torvalds
  0 siblings, 1 reply; 311+ messages in thread
From: Nick Piggin @ 2006-12-19  9:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Linus Torvalds wrote:
> 
> On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
>>>Anyway it has the same issues as the others. See what happens when you
>>>run two test_clear_page_dirty_sync_ptes() consecutively, you still loose
>>>PG_dirty even though the page might actually be dirty.
>>
>>How can this happen? We'll only test_clear_page_dirty_sync_ptes again
>>after buffers have been reattached, and subsequently cleaned. And in
>>that case if the ptes are still clean at this point then the page really
>>is clean.
> 
> 
> Why do you talk about buffers being reattached? Are you still in some 
> world where "try_to_free_buffers()" matters? Have you not followed the 

I'm talking about fixing just the race Andrew noticed via inspection. No
it doesn't appear to fix Andrei's problem, unfortunately. But it needs
to be fixed all the same, doesn't it?

> discussion? Why do you ignore my MUCH SIMPLER patch that just removed all 
> this crap ENTIRELY from "try_to_free_buffers()", and the exact same 
> corruption happened?
> 
> Forget about "try_to_free_buffers()". Please apply this patch to your tree 
> first. That gets rid of _one_ copy of totally insane code that did all the 
> wrong things.
> 
> Only after you have applied this patch should you look at the code again. 
> Realizing that the corruption still happens.
> 
> So forget about buffers already. That piece of code was crap.

Now I'm not exactly sure how ext3 (or any other) filesystems make use
of this particular feature of try_to_free_buffers(), but it is clear
from the comments what it is for. So your patch isn't really a minimal
fix (ie. it would require an OK from all filesystems, wouldn't it?)

Or did I miss a mail where you reasoned that it is safe to make this
change (/me goes to reread the thread)...

> 
> 		Linus
> 
> ---
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d1f1b54..263f88e 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
>  	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
> -	if (PageWriteback(page))
> +	if (PageDirty(page) || PageWriteback(page))
>  		return 0;
>  
>  	if (mapping == NULL) {		/* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
>  	spin_lock(&mapping->private_lock);
>  	ret = drop_buffers(page, &buffers_to_free);
>  	spin_unlock(&mapping->private_lock);
> -	if (ret) {
> -		/*
> -		 * If the filesystem writes its buffers by hand (eg ext3)
> -		 * then we can have clean buffers against a dirty page.  We
> -		 * clean the page here; otherwise later reattachment of buffers
> -		 * could encounter a non-uptodate page, which is unresolvable.
> -		 * This only applies in the rare case where try_to_free_buffers
> -		 * succeeds but the page is not freed.
> -		 *
> -		 * Also, during truncate, discard_buffer will have marked all
> -		 * the page's buffers clean.  We discover that here and clean
> -		 * the page also.
> -		 */
> -		if (test_clear_page_dirty(page))
> -			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> -	}
>  out:
>  	if (buffers_to_free) {
>  		struct buffer_head *bh = buffers_to_free;
> 


-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
       [not found]                     ` <4587B762.2030603@yahoo.com.au>
@ 2006-12-19 10:32                       ` Andrew Morton
  2006-12-19 10:42                         ` Nick Piggin
                                           ` (3 more replies)
  2006-12-19 16:51                       ` Linus Torvalds
  1 sibling, 4 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-19 10:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 19 Dec 2006 20:56:50 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Linus Torvalds wrote:
> 
> > NOTICE? First you make a BIG DEAL about how dirty bits should never get 
> > lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop 
> > the dirty bit for when it's not in the page tables.
> 
> try_to_free_buffers is quite a special case, where we're transferring
> the page dirty metadata from the buffers to the page. I think Andrew
> would have a better grasp of it so he could correct me, but what it
> does is legitimate.

Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
pages.  But it turns out that we don't feed it mapped pages, apart from
pagevec_strip() and possibly races against pagefaults.

> I think it could be very likely that indeed the bug is a latent one in
> a clear_page_dirty caller, rather than dirty-tracking itself.

The only callers are try_to_free_buffers(), truncate and a few scruffy
possibly-wrong-for-fsync filesytems which aren't being used here.


<spots a race in do_no_page()>

If a write-fault races with a read-fault and the write-fault loses, we forget
to mark the page dirty.

Something like this, but it's probably wrong - I didn't try very hard (am
feeling ill, and vaguely grumpy)


From: Andrew Morton <akpm@osdl.org>

Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 mm/memory.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff -puN mm/memory.c~a mm/memory.c
--- a/mm/memory.c~a
+++ a/mm/memory.c
@@ -2264,10 +2264,22 @@ retry:
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
+		if (write_access) {
+			/*
+			 * We might have raced against a read-fault.  We still
+			 * need to dirty the page.
+			 */
+			dirty_page = vm_normal_page(vma, address, *page_table);
+			if (dirty_page) {
+				get_page(dirty_page);
+				goto dirty_it;
+			}
+		}
 		page_cache_release(new_page);
 		goto unlock;
 	}
 
+dirty_it:
 	/* no need to invalidate: a not-present page shouldn't be cached */
 	update_mmu_cache(vma, address, entry);
 	lazy_mmu_prot_update(entry);
_


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 10:32                       ` Andrew Morton
@ 2006-12-19 10:42                         ` Nick Piggin
  2006-12-19 10:47                         ` Andrew Morton
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 311+ messages in thread
From: Nick Piggin @ 2006-12-19 10:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Andrew Morton wrote:
> On Tue, 19 Dec 2006 20:56:50 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>Linus Torvalds wrote:
>>
>>
>>>NOTICE? First you make a BIG DEAL about how dirty bits should never get 
>>>lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop 
>>>the dirty bit for when it's not in the page tables.
>>
>>try_to_free_buffers is quite a special case, where we're transferring
>>the page dirty metadata from the buffers to the page. I think Andrew
>>would have a better grasp of it so he could correct me, but what it
>>does is legitimate.
> 
> 
> Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
> pages.

Yes, that is what I was trying to get at.

>  But it turns out that we don't feed it mapped pages, apart from
> pagevec_strip() and possibly races against pagefaults.

True, and I think we have pretty well established that this isn't the
cause of Andrei's problem, but I think we all agree it is *a* bug?

And surely Andrei's data corruption will be of the same flavour in
that test_clear_page_dirty somewhere is now stripping pte dirty bits
where it shouldn't? (because it went away after Peter nooped that
behaviour)

>>I think it could be very likely that indeed the bug is a latent one in
>>a clear_page_dirty caller, rather than dirty-tracking itself.
> 
> 
> The only callers are try_to_free_buffers(), truncate and a few scruffy
> possibly-wrong-for-fsync filesytems which aren't being used here.
> 
> 
> <spots a race in do_no_page()>
> 
> If a write-fault races with a read-fault and the write-fault loses, we forget
> to mark the page dirty.

Hmm.. in that case will the pte still be readonly, and thus the write
faulter will have to try again I think?

> 
> Something like this, but it's probably wrong - I didn't try very hard (am
> feeling ill, and vaguely grumpy)
> 
> 
> From: Andrew Morton <akpm@osdl.org>
> 
> Signed-off-by: Andrew Morton <akpm@osdl.org>
> ---
> 
>  mm/memory.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff -puN mm/memory.c~a mm/memory.c
> --- a/mm/memory.c~a
> +++ a/mm/memory.c
> @@ -2264,10 +2264,22 @@ retry:
>  		}
>  	} else {
>  		/* One of our sibling threads was faster, back out. */
> +		if (write_access) {
> +			/*
> +			 * We might have raced against a read-fault.  We still
> +			 * need to dirty the page.
> +			 */
> +			dirty_page = vm_normal_page(vma, address, *page_table);
> +			if (dirty_page) {
> +				get_page(dirty_page);
> +				goto dirty_it;
> +			}
> +		}
>  		page_cache_release(new_page);
>  		goto unlock;
>  	}
>  
> +dirty_it:
>  	/* no need to invalidate: a not-present page shouldn't be cached */
>  	update_mmu_cache(vma, address, entry);
>  	lazy_mmu_prot_update(entry);
> _
> 
> 


-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 10:32                       ` Andrew Morton
  2006-12-19 10:42                         ` Nick Piggin
@ 2006-12-19 10:47                         ` Andrew Morton
  2006-12-19 10:52                         ` Peter Zijlstra
  2006-12-19 10:55                         ` Nick Piggin
  3 siblings, 0 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-19 10:47 UTC (permalink / raw)
  To: Nick Piggin, Linus Torvalds, Peter Zijlstra, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 19 Dec 2006 02:32:55 -0800
Andrew Morton <akpm@osdl.org> wrote:

> <spots a race in do_no_page()>
> 
> If a write-fault races with a read-fault and the write-fault loses, we forget
> to mark the page dirty.

No that isn't right, is it.  The writer just retakes the fault and
all the right things happen.  Ho hum.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 10:32                       ` Andrew Morton
  2006-12-19 10:42                         ` Nick Piggin
  2006-12-19 10:47                         ` Andrew Morton
@ 2006-12-19 10:52                         ` Peter Zijlstra
  2006-12-19 10:58                           ` Nick Piggin
  2006-12-19 10:55                         ` Nick Piggin
  3 siblings, 1 reply; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-19 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linus Torvalds, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote:
> On Tue, 19 Dec 2006 20:56:50 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > Linus Torvalds wrote:
> > 
> > > NOTICE? First you make a BIG DEAL about how dirty bits should never get 
> > > lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop 
> > > the dirty bit for when it's not in the page tables.
> > 
> > try_to_free_buffers is quite a special case, where we're transferring
> > the page dirty metadata from the buffers to the page. I think Andrew
> > would have a better grasp of it so he could correct me, but what it
> > does is legitimate.
> 
> Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
> pages.  But it turns out that we don't feed it mapped pages, apart from
> pagevec_strip() and possibly races against pagefaults.

So how about this:

Index: linux-2.6-git/mm/page-writeback.c
===================================================================
--- linux-2.6-git.orig/mm/page-writeback.c	2006-12-19 08:24:48.000000000 +0100
+++ linux-2.6-git/mm/page-writeback.c	2006-12-19 11:43:31.000000000 +0100
@@ -859,6 +859,9 @@ int test_clear_page_dirty(struct page *p
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
 
+	if (page_mapped(page))
+		return 0;
+
 	if (!mapping)
 		return TestClearPageDirty(page);
 



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 10:32                       ` Andrew Morton
                                           ` (2 preceding siblings ...)
  2006-12-19 10:52                         ` Peter Zijlstra
@ 2006-12-19 10:55                         ` Nick Piggin
  3 siblings, 0 replies; 311+ messages in thread
From: Nick Piggin @ 2006-12-19 10:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Andrew Morton wrote:
> On Tue, 19 Dec 2006 20:56:50 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>>I think it could be very likely that indeed the bug is a latent one in
>>a clear_page_dirty caller, rather than dirty-tracking itself.
> 
> 
> The only callers are try_to_free_buffers(), truncate and a few scruffy
> possibly-wrong-for-fsync filesytems which aren't being used here.

Well truncate/invalidate will not operate on mapped pages (barring the
very-unlikely truncate/invalidate vs fault races). We can ignore those
filesystems as they don't include ext3. Which brings us back to
try_to_free_buffers().

Maybe it is something else entirely, but did try_to_free_buffers ever
get completely cleared? Or was some of Andrei's corruption possibly
leftover on-disk corruption from a previous kernel?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 10:52                         ` Peter Zijlstra
@ 2006-12-19 10:58                           ` Nick Piggin
  2006-12-19 11:51                             ` Peter Zijlstra
  0 siblings, 1 reply; 311+ messages in thread
From: Nick Piggin @ 2006-12-19 10:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote:

>>Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
>>pages.  But it turns out that we don't feed it mapped pages, apart from
>>pagevec_strip() and possibly races against pagefaults.
> 
> 
> So how about this:

Well that's still racy. Anyway several earlier patches (including
the one I posted) closed this race. Some were still reported to
trigger corruption IIRC.

> Index: linux-2.6-git/mm/page-writeback.c
> ===================================================================
> --- linux-2.6-git.orig/mm/page-writeback.c	2006-12-19 08:24:48.000000000 +0100
> +++ linux-2.6-git/mm/page-writeback.c	2006-12-19 11:43:31.000000000 +0100
> @@ -859,6 +859,9 @@ int test_clear_page_dirty(struct page *p
>  	struct address_space *mapping = page_mapping(page);
>  	unsigned long flags;
>  
> +	if (page_mapped(page))
> +		return 0;
> +
>  	if (!mapping)
>  		return TestClearPageDirty(page);
>  
> 
> 
> -

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 10:58                           ` Nick Piggin
@ 2006-12-19 11:51                             ` Peter Zijlstra
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-19 11:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 21:58 +1100, Nick Piggin wrote:
> Peter Zijlstra wrote:
> > On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote:
> 
> >>Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
> >>pages.  But it turns out that we don't feed it mapped pages, apart from
> >>pagevec_strip() and possibly races against pagefaults.
> > 
> > 
> > So how about this:
> 
> Well that's still racy. Anyway several earlier patches (including
> the one I posted) closed this race. Some were still reported to
> trigger corruption IIRC.

I can't remember a patch that removes mapped pages from this code path,
however I could have missed it. All out removing the mapping branch in
ttfb() did also fix the problem - which is a superset of page_mapped().

I'm now building a kernel with this patch, and will submit that to
rtorrent with mem=256M on a 1k ext3 filesystem on x86_64 smp preempt.

---
 fs/buffer.c |   32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2798,11 +2798,38 @@ static inline int buffer_busy(struct buf
 		(bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock)));
 }
 
+/*
+ * AKPM sayeth:
+ *
+ * - a process does a one-byte-write to a file on a 64k pagesize, 4k
+ *   blocksize ext3 filesystem.  The page is now PageDirty, !PgeUptodate and
+ *   has one dirty buffer and 15 not uptodate buffers.
+ *
+ * - kjournald writes the dirty buffer.  The page is now PageDirty,
+ *   !PageUptodate and has a mix of clean and not uptodate buffers.
+ *
+ * - try_to_free_buffers() removes the page's buffers.  It MUST now clear
+ *   PageDirty.  If we were to leave the page dirty then we'd have a dirty, not
+ *   uptodate page with no buffer_heads.
+ *
+ *   We're screwed: we cannot write the page because we don't know which
+ *   sections of it contain garbage.  We cannot read the page because we don't
+ *   know which sections of it contain modified data.  We cannot free the page
+ *   because it is dirty.
+ *
+ * However for mapped pages this is not true; mapped pages will be fully
+ * loaded and thus cannot have not uptodate buffers.
+ *
+ * Hence allow the PG_dirty bit to stay for pages that had no not uptodate
+ * buffers (and assert that mapped pages never have those).
+ */
+
 static int
 drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
 {
 	struct buffer_head *head = page_buffers(page);
 	struct buffer_head *bh;
+	int uptodate = 1;
 
 	bh = head;
 	do {
@@ -2818,11 +2845,14 @@ drop_buffers(struct page *page, struct b
 
 		if (!list_empty(&bh->b_assoc_buffers))
 			__remove_assoc_queue(bh);
+		if (!buffer_uptodate(bh))
+			uptodate = 0;
 		bh = next;
 	} while (bh != head);
 	*buffers_to_free = head;
 	__clear_page_buffers(page);
-	return 1;
+	VM_BUG_ON(page_mapped(page) && !uptodate);
+	return !uptodate;
 failed:
 	return 0;
 }



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  9:40                   ` Nick Piggin
@ 2006-12-19 16:46                     ` Linus Torvalds
  0 siblings, 0 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-19 16:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
> Now I'm not exactly sure how ext3 (or any other) filesystems make use
> of this particular feature of try_to_free_buffers(), but it is clear
> from the comments what it is for. So your patch isn't really a minimal
> fix (ie. it would require an OK from all filesystems, wouldn't it?)
> 
> Or did I miss a mail where you reasoned that it is safe to make this
> change (/me goes to reread the thread)...

I'm saying it had _better_ be safe, and no, low-level filesystems don't 
actually matter.

The page has to be cleanable _some_ way. So if we test for "page_dirty()" 
at the top, and just refuse to do it in try_to_free_pages(), we still know 
that the _proper_ page cleaning had better clean it. Because ttfp() is 
never going to clean the page in the general case _anyway_.

So I'm really saying:

 - the page WILL be cleaned by the real page cleaning action (ie memory 
   pressure or sync or something else causing us to go through the 
   bog-standard page-based writeout.

   Does anybody dispute this?

 - the "ttfp()" hack was a HACK. It was an ugly and nasty hack even when 
   it was first introduced. It gets doubly worse now that we know we have 
   something wrong with page cleaning, and it has distracted from the real 
   problem.

 - I removed tha ugly and disgusting hack entirely at first, but Andrew 
   points out that he really wants to keep the buffers there, because the 
   buffers being clean actually say something. That, together with the 
   fact that as long as the page is dirty, the buffers really do end up 
   have a job to do, made me add a much smaller hack to replace the big 
   ugly one ("don't even try, if the page is marked dirty").

 - so with that thing in place, there isn't even any change in behaviour 
   wrt the buffers and low-level filesystems. It's just that we make them 
   a bit harder to get rid of. But arguably that shouldn't actually ever 
   really _happen_ anyway (because I think it's a BUG if the page is 
   marked dirty but none of the buffers are), so I think that part is a 
   non-issue.

In other words, ttfp() _never_ had anything to do with "page cleaning". 
Not originally, not with the horrible hack, and not with my patch. 

Trying to mix it in just caused a bug that _everybody_ agrees is a bug. 
It's not the bug we're chasing, but we've got three different patches to 
fix it (Andrew's, mine and yours), and mine is the simplest one by far 
especially in the long run, because it just REMOVES the ugly dependency.

And yes, I probably care more about "in the long run" than most. To me, a 
bug is a bug even if it's _just_ a maintenance headache. Andrews patch 
made things _worse_ ("magic insane flag"), and while yours didn't make the 
code worse, it still introduced the notion of a totally insane "clean the 
page but if the PTE's are dirty, do something else" notion.

IF THE PAGE TRULY IS CLEAN (and both you and Andrew claim it is, if all 
buffers are clean - since you mark it clean in the non-mapped case) THEN 
YOU SHOULD BE ABLE TO CLEAN THE PAGE TABLE BITS TOO.

And by claiming that the page table bits are different from PG_dirty, 
you're just making the issues worse. They shouldn't be. That's what the 
whole point of Peter's patch was: PG_dirty fundmentally _means_ that the 
page tables might be dirty too. That was the whole _point_ of doing all 
this in 2.6.19 in the first place.

So if you cannot accept that page table bits should be on "equal footing" 
with PG_dirty, then you should just say "Let's remove Peter's patch 
entirely".

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
       [not found]                     ` <4587B762.2030603@yahoo.com.au>
  2006-12-19 10:32                       ` Andrew Morton
@ 2006-12-19 16:51                       ` Linus Torvalds
  2006-12-19 17:43                         ` Linus Torvalds
  1 sibling, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-19 16:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
> Counterexample? Well AFAIKS, the clearing of PG_dirty in ttfb() in
> response to finding all buffers clean is perfectly valid. What makes
> you think otherwise?

If the page really is clean, then why the heck cant' we just clean the 
page table bits too?

Either it's clean or it isn't. If all the buffers being clean means that 
the page is clean, then it's clean. WE SHOULD NOT THINK THAT PTE'S ARE ANY 
DIFFERENT.

I really don't see your point. Is it clean? If it is, then clear the damn 
dirty bits from the page tables too. Don't go pussyfooting around the 
issue and confuse yourself and everybody but me by saying "but if it's 
dirty in the page tables, it's magically dirty". NO.

It really is that simple. Is it clean or not?

If it's clean, you can remove ALL the dirty bits. Not just some.

			Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 16:51                       ` Linus Torvalds
@ 2006-12-19 17:43                         ` Linus Torvalds
  2006-12-19 18:59                           ` Linus Torvalds
                                             ` (2 more replies)
  0 siblings, 3 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-19 17:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4156 bytes --]



Btw,
 here's a totally new tangent on this: it's possible that user code is 
simply BUGGY. 

There is one case where the kernel actually forcibly writes zeroes into a 
file: when we're writing a page that straddles the "inode->i_size" 
boundary. See the various writepages in fs/buffer.c, they all contain 
variations on that theme (although most of them aren't as well commented 
as this snippet):

        /*
         * The page straddles i_size.  It must be zeroed out on each and every
         * writepage invocation because it may be mmapped.  "A file is mapped
         * in multiples of the page size.  For a file that is not a multiple of
         * the  page size, the remaining memory is zeroed when mapped, and
         * writes to that region are not written out to the file."
         */
        kaddr = kmap_atomic(page, KM_USER0);
        memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
        flush_dcache_page(page);
        kunmap_atomic(kaddr, KM_USER0);

Now, this should _matter_ only for user processes that are buggy, and that 
have written to the page _before_ extending it with ftruncate(). That's 
definitely a serious bug, but it's one that can do totally undetected 
depending on when the actual write-out happens.

So what I'm saying is that if we end up writing things earlier thanks to 
the more aggressive dirty-page-management thing in 2.6.19, we might 
actually just expose a long-time userspace bug that was just a LOT harder 
to trigger before..

I'm not saying this is the cause of all this, but we've been tearing our 
hair out, and it migth be worthwhile trying this really really really 
stupid patch that will notice when that happens at truncate() time, and 
tell the user that he's a total idiot. Or something to that effect.

Maybe the reason this is so easy to trigger with rtorrent is not because 
rtorrent does some magic pattern that triggers a kernel bug, but simply 
because rtorrent itself might have a bug.

Ok, so it's a long shot, but it's still worth testing, I suspect. The 
patch is very simple: whenever we do an _expanding_ truncate, we check the 
last page of the _old_ size, and if there were non-zero contents past the 
old size, we complain.

As an attachement is a test-program that _should_ trigger a 
kernel message like

	a.out: BADNESS: truncate check 17000

for good measure, just so that you can verify that the patch works and 
actually catches this case.

(The 17000 number is just the one-hundred _invalid_ 0xaa bytes - out of 
the 200 we wrote - that were summed up: 100*0xaa == 17000. Anything 
non-zero is always a bug).

I doubt this is really it, but it's worth trying. If you fill out a page, 
and only do "ftruncate()" in response to SIGBUS messages (and don't 
truncate to whole pages), you could potentially see zeroes at the end of 
the page exactly because _writeout_ cleared the page for you! So it 
_could_ explain the symptoms, but only if user-space was horribly horribly 
broken.

		Linus

----
diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..79cecab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+	pgoff_t index;
+	unsigned int offset;
+	struct page *page;
+
+	if (!mapping)
+		return;
+	offset = size & ~PAGE_MASK;
+	if (!offset)
+		return;
+	index = size >> PAGE_SHIFT;
+	page = find_lock_page(mapping, index);
+	if (page) {
+		unsigned int check = 0;
+		unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+		do {
+			check += kaddr[offset++];
+		} while (offset < PAGE_SIZE);
+		kunmap_atomic(kaddr,KM_USER0);
+		unlock_page(page);
+		page_cache_release(page);
+		if (check)
+			printk("%s: BADNESS: truncate check %u\n", current->comm, check);
+	}
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
 		goto out_sig;
 	if (offset > inode->i_sb->s_maxbytes)
 		goto out_big;
+	check_last_page(mapping, inode->i_size);
 	i_size_write(inode, offset);
 
 out_truncate:

[-- Attachment #2: Type: TEXT/PLAIN, Size: 566 bytes --]

#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <string.h>

int main(int argc, char **argv)
{
	char *mapping;
	int fd;

	fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
	if (fd < 0)
		return -1;
	if (ftruncate(fd, 10) < 0)
		return -1;
	mapping = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (-1 == (int)(long)mapping)
		return -1;
	memset(mapping, 0x55, 10);
	if (ftruncate(fd, 100) < 0)
		return -1;
	memset(mapping, 0xaa, 200);
	if (ftruncate(fd, 200) < 0)
		return -1;
	return 0;
}

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 17:43                         ` Linus Torvalds
@ 2006-12-19 18:59                           ` Linus Torvalds
  2006-12-19 21:30                             ` Peter Zijlstra
  2006-12-20  5:56                             ` Jari Sundell
  2006-12-19 21:56                           ` Florian Weimer
  2006-12-21 13:03                           ` Peter Zijlstra
  2 siblings, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-19 18:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Linus Torvalds wrote:
>
>  here's a totally new tangent on this: it's possible that user code is 
> simply BUGGY. 

Btw, here's a simpler test-program that actually shows the difference 
between 2.6.18 and 2.6.19 in action, and why it could explain why a 
program like rtorrent might show corruption behavious that it didn't show 
before.

	#include <sys/mman.h>
	#include <sys/fcntl.h>
	#include <unistd.h>
	#include <string.h>
	
	int main(int argc, char **argv)
	{
		char *mapping;
		int fd;
	
		fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
		if (fd < 0)
			return -1;
		if (ftruncate(fd, 10) < 0)
			return -1;
		mapping = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
		if (-1 == (int)(long)mapping)
			return -1;
		memset(mapping, 0xaa, 20);
		sync();
		if (ftruncate(fd, 40) < 0)
			return -1;
		memset(mapping + 20, 0x55, 20);
		write(1, mapping, 40);
		return 0;
	}

Notice the "sync()" in between the "memset()" and the "ftruncate()". In 
2.6.18, that would normally do absolutely _nothing_ to the shared memory 
mapping, becuase we simply couldn't track pages that were dirty in the 
page tables. 

So in 2.6.18, if you try this, with

	./a.out | od -x

you should see something like

	0000000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa
	0000020 aaaa aaaa 5555 5555 5555 5555 5555 5555
	0000040 5555 5555 5555 5555
	0000050

which matches your memset() patterns: 20 bytes of 0xaa, and 20 bytes of 
0x55.

HOWEVER. 

In 2.6.19, because we actually track dirty data so much better, "sync()" 
will actually be smart enough to write out the dirty mmap'ed data too. But 
since the user program has only allocated ten bytes for it in the file, 
when it is written out, the rest of the page is cleared. When you then 
write the last 20 bytes (after _properly_ allocating memory for them), you 
should now see a pattern like

	0000000 aaaa aaaa aaaa aaaa aaaa 0000 0000 0000
	0000020 0000 0000 5555 5555 5555 5555 5555 5555
	0000040 5555 5555 5555 5555
	0000050

instead: with ten bytes of zero in between, because the data that couldn't 
be written out was cleared.

So 2.6.19 is strictly _better_, but exactly because it's tracking dirty 
status much more precisely, you'll see certain user-level bugs much more 
easily.

NOTE NOTE NOTE! The code really _was_ buggy in 2.6.18 too, and you _can_ 
get the zeroes in the middle of the file with an older kernel. But in 
older kernels, you need to be really really unlucky, and have the page 
cleaned by strong memory pressure. In 2.6.19, any "sync()" activity 
(includign from the outside) will clean the page, so a user program with 
this bug can just be made to trigger the bug much more easily.

			Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  6:34             ` Linus Torvalds
  2006-12-19  6:51               ` Nick Piggin
@ 2006-12-19 20:03               ` dean gaudet
  1 sibling, 0 replies; 311+ messages in thread
From: dean gaudet @ 2006-12-19 20:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Mon, 18 Dec 2006, Linus Torvalds wrote:

> On Tue, 19 Dec 2006, Nick Piggin wrote:
> > 
> > We never want to drop dirty data! (ignoring the truncate case, which is
> > handled privately by truncate anyway)
> 
> Bzzt.
> 
> SURE we do.
> 
> We absolutely do want to drop dirty data in the writeout path.
> 
> How do you think dirty data ever _becomes_ clean data?
> 
> In other words, yes, we _do_ want to test-and-clear all the pgtable bits 
> _and_ the PG_dirty bit. We want to do it for:
>  - writeout
>  - truncate
>  - possibly a "drop" event (which could be a case for a journal entry that 
>    becomes stale due to being replaced or something - kind of "truncate" 
>    on metadata)
> 
> because both of those events _literally_ turn dirty state into clean 
> state.
> 
> In no other circumstance do we ever want to clear a dirty bit, as far as I 
> can tell. 

i admit this may not be entirely relevant, but it seems like a good place 
to bring up an old problem:  when a disk dies with lots of queued writes 
it can totally bring a system to its knees... even after the disk is 
removed.  i wrote up something about this a while ago:

http://lkml.org/lkml/2005/8/18/243

so there's another reason to "clear a dirty bit"... well, in fact -- drop 
the pages entirely.

-dean

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 18:59                           ` Linus Torvalds
@ 2006-12-19 21:30                             ` Peter Zijlstra
  2006-12-19 22:51                               ` Linus Torvalds
  2006-12-20 18:02                               ` Stephen Clark
  2006-12-20  5:56                             ` Jari Sundell
  1 sibling, 2 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-19 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
> 
> On Tue, 19 Dec 2006, Linus Torvalds wrote:
> >
> >  here's a totally new tangent on this: it's possible that user code is 
> > simply BUGGY. 

I'm sad to say this doesn't trigger :-(



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 17:43                         ` Linus Torvalds
  2006-12-19 18:59                           ` Linus Torvalds
@ 2006-12-19 21:56                           ` Florian Weimer
  2006-12-21 13:03                           ` Peter Zijlstra
  2 siblings, 0 replies; 311+ messages in thread
From: Florian Weimer @ 2006-12-19 21:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Marc Haber,
	Martin Michlmayr

* Linus Torvalds:

> Now, this should _matter_ only for user processes that are buggy,
> and that have written to the page _before_ extending it with
> ftruncate().

APT seems to properly extend the file before mapping it, by writing a
zero byte at the desired position (creating a hole).

24986 open("/var/cache/apt/pkgcache.bin", O_RDWR|O_CREAT|O_TRUNC, 0666) = 6

24986 lseek(6, 12582911, SEEK_SET)      = 12582911
24986 write(6, "\0", 1)                 = 1

24986 mmap(NULL, 12582912, PROT_READ|PROT_WRITE, MAP_SHARED, 6, 0) = 0x2b6578636000

24986 msync(0x2b6578636000, 7464112, MS_SYNC) = 0
24986 msync(0x2b6578636000, 8656, MS_SYNC) = 0
24986 munmap(0x2b6578636000, 12582912)  = 0
24986 ftruncate(6, 7464112)             = 0
24986 fstat(6, {st_mode=S_IFREG|0644, st_size=7464112, ...}) = 0
24986 mmap(NULL, 7464112, PROT_READ, MAP_SHARED, 6, 0) = 0x2b6578636000

APT's code is pretty convoluted, though, and there might be some code
path in it that gets it wrong. 8-P

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 21:30                             ` Peter Zijlstra
@ 2006-12-19 22:51                               ` Linus Torvalds
  2006-12-19 22:58                                 ` Andrew Morton
  2006-12-20 18:02                               ` Stephen Clark
  1 sibling, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-19 22:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Peter Zijlstra wrote:

> On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
> > 
> > On Tue, 19 Dec 2006, Linus Torvalds wrote:
> > >
> > >  here's a totally new tangent on this: it's possible that user code is 
> > > simply BUGGY. 
> 
> I'm sad to say this doesn't trigger :-(

Oh, well. It was a theory. 

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 22:51                               ` Linus Torvalds
@ 2006-12-19 22:58                                 ` Andrew Morton
  2006-12-19 23:06                                   ` Peter Zijlstra
  0 siblings, 1 reply; 311+ messages in thread
From: Andrew Morton @ 2006-12-19 22:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

	On Tue, 19 Dec 2006 14:51:55 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Tue, 19 Dec 2006, Peter Zijlstra wrote:
> 
> > On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
> > > 
> > > On Tue, 19 Dec 2006, Linus Torvalds wrote:
> > > >
> > > >  here's a totally new tangent on this: it's possible that user code is 
> > > > simply BUGGY. 
> > 
> > I'm sad to say this doesn't trigger :-(
> 
> Oh, well. It was a theory. 
> 

Well... we'd need to see (corruption && this-not-triggering) to be sure.

Peter, have you been able to trigger the corruption?

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 22:58                                 ` Andrew Morton
@ 2006-12-19 23:06                                   ` Peter Zijlstra
  2006-12-19 23:07                                     ` Peter Zijlstra
  2006-12-20  0:03                                     ` Linus Torvalds
  0 siblings, 2 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-19 23:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:

> Well... we'd need to see (corruption && this-not-triggering) to be sure.
> 
> Peter, have you been able to trigger the corruption?

Yes; however the mail I send describing that seems to be lost in space.

/me quotes from the send folder:

> The bad new is, that doesn't help either. The good news is I can
> reproduce it.
> 
> What I did to achieve that:
>  
>  - get a sizable torrent from legaltorrents.com / or create a torrent
> yourself that is around ~600M and has multiple files.
> 
>  - start a tracker, and multiple seeds (I used three machines here)
> 
>  - pull the torrent on a fourth machine
> 
> the seeding machines don't much matter of course.
> 
> the fourth machine was a dual core x86-64 with an SMP kernel and
> PREEMPT, mem=256M (so that the torrent is quite a bit larger and does
> require writeout) and I used an ext3 partition with 1k blocks.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 23:06                                   ` Peter Zijlstra
@ 2006-12-19 23:07                                     ` Peter Zijlstra
  2006-12-20  0:03                                     ` Linus Torvalds
  1 sibling, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-19 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Wed, 2006-12-20 at 00:06 +0100, Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:
> 
> > Well... we'd need to see (corruption && this-not-triggering) to be sure.
> > 
> > Peter, have you been able to trigger the corruption?
> 
> Yes; however the mail I send describing that seems to be lost in space.
> 
> /me quotes from the send folder:
> 
> > The bad new is, that doesn't help either. The good news is I can
> > reproduce it.
> > 
> > What I did to achieve that:
> >  
> >  - get a sizable torrent from legaltorrents.com / or create a torrent
> > yourself that is around ~600M and has multiple files.
> > 
> >  - start a tracker, and multiple seeds (I used three machines here)
> > 
> >  - pull the torrent on a fourth machine
> > 
> > the seeding machines don't much matter of course.
> > 
> > the fourth machine was a dual core x86-64 with an SMP kernel and
> > PREEMPT, mem=256M (so that the torrent is quite a bit larger and does
> > require writeout) and I used an ext3 partition with 1k blocks.

PS. this was a reply to:
 http://lkml.org/lkml/2006/12/19/121


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 20:14                     ` Linus Torvalds
                                         ` (2 preceding siblings ...)
  2006-12-18 21:49                       ` Peter Zijlstra
@ 2006-12-19 23:42                       ` Peter Zijlstra
  2006-12-20  0:23                         ` Linus Torvalds
  2006-12-20 14:15                         ` Andrei Popa
  3 siblings, 2 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-19 23:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:

> OR:
> 
>  - page_mkclean_one() is simply buggy.

GOLD!

it seems to work with all this (full diff against current git).

/me rebuilds full kernel to make sure...
reboot...
test...      pff the tension...
yay, still good!

Andrei; would you please verify.

The magic seems to be in the extra tlb flush after clearing the dirty
bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry.

diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
index 5e7cd45..2b8893b 100644
--- a/drivers/connector/connector.c
+++ b/drivers/connector/connector.c
@@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), v
 	spin_lock_bh(&dev->cbdev->queue_lock);
 	list_for_each_entry(__cbq, &dev->cbdev->queue_list, callback_entry) {
 		if (cn_cb_equal(&__cbq->id.id, &msg->id)) {
-			if (likely(!test_bit(WORK_STRUCT_PENDING,
-					     &__cbq->work.work.management) &&
+			if (likely(!delayed_work_pending(&__cbq->work) &&
 					__cbq->data.ddata == NULL)) {
 				__cbq->data.callback_priv = msg;
 
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..60e0945 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+	pgoff_t index;
+	unsigned int offset;
+	struct page *page;
+
+	if (!mapping)
+		return;
+	offset = size & ~PAGE_MASK;
+	if (!offset)
+		return;
+	index = size >> PAGE_SHIFT;
+	page = find_lock_page(mapping, index);
+	if (page) {
+		unsigned int check = 0;
+		unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+		do {
+			check += kaddr[offset++];
+		} while (offset < PAGE_SIZE);
+		kunmap_atomic(kaddr, KM_USER0);
+		unlock_page(page);
+		page_cache_release(page);
+		if (check)
+			printk(KERN_ERR "%s: BADNESS: truncate check %u\n", current->comm, check);
+	}
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
 		goto out_sig;
 	if (offset > inode->i_sb->s_maxbytes)
 		goto out_big;
+	check_last_page(mapping, inode->i_size);
 	i_size_write(inode, offset);
 
 out_truncate:
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..f561e72 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page)
 EXPORT_SYMBOL(test_set_page_writeback);
 
 /*
- * Return true if any of the pages in the mapping are marged with the
+ * Return true if any of the pages in the mapping are marked with the
  * passed tag.
  */
 int mapping_tagged(struct address_space *mapping, int tag)
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..900229a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *pte, entry;
+	pte_t *ptep, entry;
 	spinlock_t *ptl;
 	int ret = 0;
 
@@ -440,22 +440,23 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
-	if (!pte)
+	ptep = page_check_address(page, mm, address, &ptl);
+	if (!ptep)
 		goto out;
 
-	if (!pte_dirty(*pte) && !pte_write(*pte))
+	if (!pte_dirty(*ptep) && !pte_write(*ptep))
 		goto unlock;
 
-	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
+	entry = ptep_get_and_clear(mm, address, ptep);
 	entry = pte_wrprotect(entry);
-	ptep_establish(vma, address, pte, entry);
+	ptep_establish(vma, address, ptep, entry);
+	ret = ptep_clear_flush_dirty(vma, address, ptep) ||
+		page_test_and_clear_dirty(page);
 	lazy_mmu_prot_update(entry);
 	ret = 1;
 
 unlock:
-	pte_unmap_unlock(pte, ptl);
+	pte_unmap_unlock(ptep, ptl);
 out:
 	return ret;
 }



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 23:06                                   ` Peter Zijlstra
  2006-12-19 23:07                                     ` Peter Zijlstra
@ 2006-12-20  0:03                                     ` Linus Torvalds
  2006-12-20  0:18                                       ` Andrew Morton
  1 sibling, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-20  0:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Wed, 20 Dec 2006, Peter Zijlstra wrote:

> On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:
> 
> > Well... we'd need to see (corruption && this-not-triggering) to be sure.
> > 
> > Peter, have you been able to trigger the corruption?
> 
> Yes; however the mail I send describing that seems to be lost in space.

Btw, can somebody actually explain the mess that is ext3 "dirtying".

Ext3 does NOT use __set_page_dirty_buffers. It does

	static int ext3_journalled_set_page_dirty(struct page *page)
	{
	        SetPageChecked(page);
	        return __set_page_dirty_nobuffers(page);
	}

and uses that "Checked" bit as a "whole page is dirty" bit (which it tests 
in "writepage()".

You realize what this all means? It means that ANYTHING that actually 
clears the _real_ dirty bit won't actually be doing anything at all for 
ext3, because the Checked bit will still stay set, and any IO down the 
line on that page would totally ignore the dirty bits on the buffer heads 
and just write out everything.

That is "The Mess(tm)".

It also basically means that anything that clears the dirty bit without 
just calling "writepage()" had _better_ call "invalidatepage()" for the 
whole page, because otherwise the PageChecked bit will never be cleared as 
far as I can see. Happily, at least ext3 seems to _test_ for that case in 
the release_page() function, so it appears that we do do this.

But this seems to just strengthen my argument: you can NEVER clean a page, 
unless you (a) do IO on it immediately afterwards (writeback) or (b) 
invalidate it entirely (truncate).

I'd really like to see just those two functions exist. Preferably in a 
form where you can see easily that we actually follow those rules. Rather 
than having a confusing set of "clear_page_dirty()" and
"test_and_clear_page_dirty()" functions that are called from random 
places.

IOW, I think the "clear_page_dirty_for_io()" is fine (it's case (a)) 
above, and then we should probably have a "cancel_dirty_page()" function 
that does all the current clear_page_dirty() but also makes sure that we 
actually call the invalidate_page() function itself. 

Hmm?

			Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20  0:03                                     ` Linus Torvalds
@ 2006-12-20  0:18                                       ` Andrew Morton
  0 siblings, 0 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-20  0:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 19 Dec 2006 16:03:49 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> 
> > On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:
> > 
> > > Well... we'd need to see (corruption && this-not-triggering) to be sure.
> > > 
> > > Peter, have you been able to trigger the corruption?
> > 
> > Yes; however the mail I send describing that seems to be lost in space.
> 
> Btw, can somebody actually explain the mess that is ext3 "dirtying".
> 
> Ext3 does NOT use __set_page_dirty_buffers. It does
> 
> 	static int ext3_journalled_set_page_dirty(struct page *page)
> 	{
> 	        SetPageChecked(page);
> 	        return __set_page_dirty_nobuffers(page);
> 	}
> 
> and uses that "Checked" bit as a "whole page is dirty" bit (which it tests 
> in "writepage()".

This is purely for data=journal, which is rarely used.

In journalled-data mode, write(), write-fault, etc are not allowed to dirty
the pages and buffers, because the data has to be written to the journal
first.  After the data has been written to the journal we only then mark
buffers (and hence pages) dirty as far as the VFS is concerned.  For
checkpointing the data back to its real place on the disk.


For MAP_SHARED pages ext3 cheats madly and doesn't journal the data at all.
In all journalling modes, MAP_SHARED data follows the regular ext2-style
handling.  Which is a bit of a wart.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 23:42                       ` Peter Zijlstra
@ 2006-12-20  0:23                         ` Linus Torvalds
  2006-12-20  9:01                           ` Peter Zijlstra
  2006-12-20  9:32                           ` Peter Zijlstra
  2006-12-20 14:15                         ` Andrei Popa
  1 sibling, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-20  0:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > OR:
> > 
> >  - page_mkclean_one() is simply buggy.
> 
> GOLD!

Ok. I was looking at that, and I wondered..

However, if that works, then I _think_ the correct sequence is the 
following..

The rule should be:
 - we flush the tlb _after_ we have cleared it, but _before_ we insert the 
   new entry.

But I dunno. These things are damn subtle. Does this patch fix it for you?

I actually suspect we should do this as an arch-specific macro, and 
totally replace the current "ptep_clear_flush_dirty()" with one that does 
"ptep_clear_flush_dirty_and_set_wp()".

Because what I'd _really_ prefer to do on x86 (and probably on most other 
sane architectures) is to do

 - atomically replace the pte with the EXACT SAME ONE, but one that 
   has the writable bit clear.

	bit_clear(_PAGE_BIT_RW, &(ptep)->pte_low);

 - flush the TLB, making sure that all CPU's will no longer write to it:

	flush_tlb_page(vma, address);

 - finally, just fetch-and-clear the dirty bit (and since it's no longer 
   writable, nobody should be settign it any more)

	ret = bit_clear(__PAGE_BIT_DIRTY, &(ptep)->pte_low);

and now we should be all done.

But the "ptep_get_and_clear() + flush_tlb_page()" sequence should 
hopefully also work.

Pls test.

		Linus

----
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..eec8706 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 		goto unlock;
 
 	entry = ptep_get_and_clear(mm, address, pte);
+	flush_tlb_page(vma, address);
 	entry = pte_mkclean(entry);
 	entry = pte_wrprotect(entry);
-	ptep_establish(vma, address, pte, entry);
+	set_pte_at(mm, address, pte, entry);
 	lazy_mmu_prot_update(entry);
 	ret = 1;
 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 18:59                           ` Linus Torvalds
  2006-12-19 21:30                             ` Peter Zijlstra
@ 2006-12-20  5:56                             ` Jari Sundell
  1 sibling, 0 replies; 311+ messages in thread
From: Jari Sundell @ 2006-12-20  5:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On 12/20/06, Linus Torvalds <torvalds@osdl.org> wrote:
> On Tue, 19 Dec 2006, Linus Torvalds wrote:
> >
> >  here's a totally new tangent on this: it's possible that user code is
> > simply BUGGY.
>
> Btw, here's a simpler test-program that actually shows the difference
> between 2.6.18 and 2.6.19 in action, and why it could explain why a
> program like rtorrent might show corruption behavious that it didn't show
> before.

Kinda late to the discussion, but I guess I could summarize what
rtorrent actually does, or should be doing.

When downloading a new torrent, it will create the files and truncate
them to the final size. It will never call truncate after this and the
files will remain sparse until data is downloaded. A 'piece' is mapped
to memory using MAP_SHARED, which will be page aligned on single file
torrents but unlikely to be so on multi-file torrents.

So on multi-file torrents it'll often end up with two mappings
overlapping with one page, each of which only write to their own part
the page. These will then be sync'ed with MS_ASYNC, or MS_SYNC if low
on disk space. After that it might be unmapped, then mapped as
read-only.

I haven't thought of asking if single file torrents are ok.

Rakshasa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20  0:23                         ` Linus Torvalds
@ 2006-12-20  9:01                           ` Peter Zijlstra
  2006-12-20  9:12                             ` Peter Zijlstra
                                               ` (2 more replies)
  2006-12-20  9:32                           ` Peter Zijlstra
  1 sibling, 3 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-20  9:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr,
	Martin Schwidefsky, Heiko Carstens

On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote:
> 
> On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > > OR:
> > > 
> > >  - page_mkclean_one() is simply buggy.
> > 
> > GOLD!
> 
> Ok. I was looking at that, and I wondered..
> 
> However, if that works, then I _think_ the correct sequence is the 
> following..
> 
> The rule should be:
>  - we flush the tlb _after_ we have cleared it, but _before_ we insert the 
>    new entry.
> 
> But I dunno. These things are damn subtle. Does this patch fix it for you?

I will try, but I had a look around the different architectures
implementation of ptep_clear_flush_dirty() and saw that not all do the
actual flush. So if we go down this road perhaps we should introduce
another per arch function that does the potential flush. like
flush_tlb_on_clear_dirty() or something like that.

Then we could write:

  entry = ptep_get_and_clear(mm, address, ptep)
  flush_tlb_on_clear_dirty(vma, address);
  entry = pte_mkclean(entry);
  entry = pte_wrprotect(entry);
  set_pte_at(mm, address, ptep, entry);

> I actually suspect we should do this as an arch-specific macro, and 
> totally replace the current "ptep_clear_flush_dirty()" with one that does 
> "ptep_clear_flush_dirty_and_set_wp()".
> 
> Because what I'd _really_ prefer to do on x86 (and probably on most other 
> sane architectures) is to do
> 
>  - atomically replace the pte with the EXACT SAME ONE, but one that 
>    has the writable bit clear.
> 
> 	bit_clear(_PAGE_BIT_RW, &(ptep)->pte_low);
> 
>  - flush the TLB, making sure that all CPU's will no longer write to it:
> 
> 	flush_tlb_page(vma, address);
> 
>  - finally, just fetch-and-clear the dirty bit (and since it's no longer 
>    writable, nobody should be settign it any more)
> 
> 	ret = bit_clear(__PAGE_BIT_DIRTY, &(ptep)->pte_low);
> 
> and now we should be all done.

Hmm, should we not flush after clearing the dirty bit? That is, why does
ptep_clear_flush_dirty() need a flush after clearing that bit? does it
leak through in the tlb copy?

Also, what is this page_test_and_clear_dirty() business, that seems to
be exclusively s390 btw. However they do seem to need this.

> But the "ptep_get_and_clear() + flush_tlb_page()" sequence should 
> hopefully also work.

Yeah, probably, not optimally so on some archs that don't actually need
the flush though. And as above, I wonder about s390.

(added our s390 friends to the CC list)


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20  9:01                           ` Peter Zijlstra
@ 2006-12-20  9:12                             ` Peter Zijlstra
  2006-12-20  9:39                             ` Arjan van de Ven
  2006-12-20 14:27                             ` 2.6.19 file content corruption on ext3 Martin Schwidefsky
  2 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-20  9:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr,
	Martin Schwidefsky, Heiko Carstens

On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote:

> I will try, but I had a look around the different architectures
> implementation of ptep_clear_flush_dirty() and saw that not all do the
> actual flush. So if we go down this road perhaps we should introduce
> another per arch function that does the potential flush. like
> flush_tlb_on_clear_dirty() or something like that.

never mind, we do need an unconditional flush for changing the
protection too.



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20  0:23                         ` Linus Torvalds
  2006-12-20  9:01                           ` Peter Zijlstra
@ 2006-12-20  9:32                           ` Peter Zijlstra
  1 sibling, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-20  9:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote:

> Pls test.

Is good. Only s390 remains a question.

Another point, change_protection() also does a cache flush, should we
too?

> ----
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..eec8706 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
>  		goto unlock;
>  
>  	entry = ptep_get_and_clear(mm, address, pte);
          flush_cache_page(vma, address, pte_pfn(entry));
> +	flush_tlb_page(vma, address);
>  	entry = pte_mkclean(entry);
>  	entry = pte_wrprotect(entry);
> -	ptep_establish(vma, address, pte, entry);
> +	set_pte_at(mm, address, pte, entry);
>  	lazy_mmu_prot_update(entry);
>  	ret = 1;
>  
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20  9:01                           ` Peter Zijlstra
  2006-12-20  9:12                             ` Peter Zijlstra
@ 2006-12-20  9:39                             ` Arjan van de Ven
  2006-12-20 11:26                               ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Peter Zijlstra
  2006-12-20 14:27                             ` 2.6.19 file content corruption on ext3 Martin Schwidefsky
  2 siblings, 1 reply; 311+ messages in thread
From: Arjan van de Ven @ 2006-12-20  9:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens


> Hmm, should we not flush after clearing the dirty bit? That is, why does
> ptep_clear_flush_dirty() need a flush after clearing that bit? does it
> leak through in the tlb copy?

afaics you need to 
1) clear
2) flush 
3) check and go to 1) if needed

to be race free. 




^ permalink raw reply	[flat|nested] 311+ messages in thread

* [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20  9:39                             ` Arjan van de Ven
@ 2006-12-20 11:26                               ` Peter Zijlstra
  2006-12-20 11:39                                 ` Jesper Juhl
                                                   ` (2 more replies)
  0 siblings, 3 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-20 11:26 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann


fix page_mkclean_one()

it had several issues:
 - it failed to flush the cache
 - it failed to flush the tlb
 - it failed to do s390 (s390 guys, please verify this is now correct)

Also, clear in a loop to ensure SMP safeness as suggested by Arjan.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/rmap.c |   29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page 
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *pte, entry;
+	pte_t *ptep;
 	spinlock_t *ptl;
 	int ret = 0;
 
@@ -440,22 +440,23 @@ static int page_mkclean_one(struct page 
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
-	if (!pte)
+	ptep = page_check_address(page, mm, address, &ptl);
+	if (!ptep)
 		goto out;
 
-	if (!pte_dirty(*pte) && !pte_write(*pte))
-		goto unlock;
-
-	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
-	entry = pte_wrprotect(entry);
-	ptep_establish(vma, address, pte, entry);
-	lazy_mmu_prot_update(entry);
-	ret = 1;
+	while (pte_dirty(*ptep) || pte_write(*ptep)) {
+		pte_t entry = ptep_get_and_clear(mm, address, ptep);
+		flush_cache_page(vma, address, pte_pfn(entry));
+		flush_tlb_page(vma, address);
+		(void)page_test_and_clear_dirty(page); /* do the s390 thing */
+		entry = pte_wrprotect(entry);
+		entry = pte_mkclean(entry);
+		set_pte_at(vma, address, ptep, entry);
+		lazy_mmu_prot_update(entry);
+		ret = 1;
+	}
 
-unlock:
-	pte_unmap_unlock(pte, ptl);
+	pte_unmap_unlock(ptep, ptl);
 out:
 	return ret;
 }



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 11:26                               ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Peter Zijlstra
@ 2006-12-20 11:39                                 ` Jesper Juhl
  2006-12-20 11:42                                   ` Peter Zijlstra
  2006-12-20 13:00                                 ` Hugh Dickins
  2006-12-20 14:55                                 ` Martin Schwidefsky
  2 siblings, 1 reply; 311+ messages in thread
From: Jesper Juhl @ 2006-12-20 11:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann

On 20/12/06, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> fix page_mkclean_one()
>
> it had several issues:
>  - it failed to flush the cache
>  - it failed to flush the tlb
>  - it failed to do s390 (s390 guys, please verify this is now correct)
>
> Also, clear in a loop to ensure SMP safeness as suggested by Arjan.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  mm/rmap.c |   29 +++++++++++++++--------------
>  1 file changed, 15 insertions(+), 14 deletions(-)
>
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
>  {
>         struct mm_struct *mm = vma->vm_mm;
>         unsigned long address;
> -       pte_t *pte, entry;
> +       pte_t *ptep;
>         spinlock_t *ptl;
>         int ret = 0;
>
> @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page
>         if (address == -EFAULT)
>                 goto out;
>
> -       pte = page_check_address(page, mm, address, &ptl);
> -       if (!pte)
> +       ptep = page_check_address(page, mm, address, &ptl);
> +       if (!ptep)
>                 goto out;
>
> -       if (!pte_dirty(*pte) && !pte_write(*pte))
> -               goto unlock;
> -
> -       entry = ptep_get_and_clear(mm, address, pte);
> -       entry = pte_mkclean(entry);
> -       entry = pte_wrprotect(entry);
> -       ptep_establish(vma, address, pte, entry);
> -       lazy_mmu_prot_update(entry);
> -       ret = 1;
> +       while (pte_dirty(*ptep) || pte_write(*ptep)) {
> +               pte_t entry = ptep_get_and_clear(mm, address, ptep);
> +               flush_cache_page(vma, address, pte_pfn(entry));
> +               flush_tlb_page(vma, address);
> +               (void)page_test_and_clear_dirty(page); /* do the s390 thing */
> +               entry = pte_wrprotect(entry);
> +               entry = pte_mkclean(entry);
> +               set_pte_at(vma, address, ptep, entry);
> +               lazy_mmu_prot_update(entry);
> +               ret = 1;
> +       }
>
Having the assignment of "ret = 1;" inside the loop seems a little
pointless. Perhaps gcc can optimize it, but still, that assignment
really only needs to happen once outside the loop.


> -unlock:
> -       pte_unmap_unlock(pte, ptl);
> +       pte_unmap_unlock(ptep, ptl);
>  out:
>         return ret;
>  }
>

-- 
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please      http://www.expita.com/nomime.html

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 11:39                                 ` Jesper Juhl
@ 2006-12-20 11:42                                   ` Peter Zijlstra
  2006-12-20 12:12                                     ` Jesper Juhl
  0 siblings, 1 reply; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-20 11:42 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann

On Wed, 2006-12-20 at 12:39 +0100, Jesper Juhl wrote:
> On 20/12/06, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > fix page_mkclean_one()
> >
> > it had several issues:
> >  - it failed to flush the cache
> >  - it failed to flush the tlb
> >  - it failed to do s390 (s390 guys, please verify this is now correct)
> >
> > Also, clear in a loop to ensure SMP safeness as suggested by Arjan.
> >
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  mm/rmap.c |   29 +++++++++++++++--------------
> >  1 file changed, 15 insertions(+), 14 deletions(-)
> >
> > Index: linux-2.6/mm/rmap.c
> > ===================================================================
> > --- linux-2.6.orig/mm/rmap.c
> > +++ linux-2.6/mm/rmap.c
> > @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
> >  {
> >         struct mm_struct *mm = vma->vm_mm;
> >         unsigned long address;
> > -       pte_t *pte, entry;
> > +       pte_t *ptep;
> >         spinlock_t *ptl;
> >         int ret = 0;
> >
> > @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page
> >         if (address == -EFAULT)
> >                 goto out;
> >
> > -       pte = page_check_address(page, mm, address, &ptl);
> > -       if (!pte)
> > +       ptep = page_check_address(page, mm, address, &ptl);
> > +       if (!ptep)
> >                 goto out;
> >
> > -       if (!pte_dirty(*pte) && !pte_write(*pte))
> > -               goto unlock;
> > -
> > -       entry = ptep_get_and_clear(mm, address, pte);
> > -       entry = pte_mkclean(entry);
> > -       entry = pte_wrprotect(entry);
> > -       ptep_establish(vma, address, pte, entry);
> > -       lazy_mmu_prot_update(entry);
> > -       ret = 1;
> > +       while (pte_dirty(*ptep) || pte_write(*ptep)) {
> > +               pte_t entry = ptep_get_and_clear(mm, address, ptep);
> > +               flush_cache_page(vma, address, pte_pfn(entry));
> > +               flush_tlb_page(vma, address);
> > +               (void)page_test_and_clear_dirty(page); /* do the s390 thing */
> > +               entry = pte_wrprotect(entry);
> > +               entry = pte_mkclean(entry);
> > +               set_pte_at(vma, address, ptep, entry);
> > +               lazy_mmu_prot_update(entry);
> > +               ret = 1;
> > +       }
> >
> Having the assignment of "ret = 1;" inside the loop seems a little
> pointless. Perhaps gcc can optimize it, but still, that assignment
> really only needs to happen once outside the loop.

Sure, but I was hoping gcc was smart enough. Placing it outside the loop
would require an extra if stmt. Also the chance this loop will actually
be traversed more than once is _very_ small.




^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 11:42                                   ` Peter Zijlstra
@ 2006-12-20 12:12                                     ` Jesper Juhl
  0 siblings, 0 replies; 311+ messages in thread
From: Jesper Juhl @ 2006-12-20 12:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann

On 20/12/06, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2006-12-20 at 12:39 +0100, Jesper Juhl wrote:
> > Having the assignment of "ret = 1;" inside the loop seems a little
> > pointless. Perhaps gcc can optimize it, but still, that assignment
> > really only needs to happen once outside the loop.
>
> Sure, but I was hoping gcc was smart enough. Placing it outside the loop
> would require an extra if stmt. Also the chance this loop will actually
> be traversed more than once is _very_ small.
>

allright - I just spotted it and thought I'd point it out :-)

-- 
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please      http://www.expita.com/nomime.html

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 11:26                               ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Peter Zijlstra
  2006-12-20 11:39                                 ` Jesper Juhl
@ 2006-12-20 13:00                                 ` Hugh Dickins
  2006-12-20 13:56                                   ` Peter Zijlstra
  2006-12-20 14:55                                 ` Martin Schwidefsky
  2 siblings, 1 reply; 311+ messages in thread
From: Hugh Dickins @ 2006-12-20 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Michlmayr, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann

On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> 
> fix page_mkclean_one()

Congratulations on getting to the bottom of it, Peter (if you have:
I haven't digested enough of the thread to tell).  I'm mostly offline at
present, no time for dialogue, I'll throw out a few remarks and run...

> 
> it had several issues:
>  - it failed to flush the cache

It's unclear to me why it should need to flush the cache, but I don't
know much about that, and mprotect does flush the cache in advance -
I think others will tell you that if it does need to be flushed, it must
be flushed while there's still a valid pte (on some arches at least).

>  - it failed to flush the tlb

Eh?  It flushed the TLB inside ptep_establish, didn't it?
I guess you mean you've found a race before it flushed the TLB.

>  - it failed to do s390 (s390 guys, please verify this is now correct)

Hmm, I thought we cleared it with them back at the time.

> 
> Also, clear in a loop to ensure SMP safeness as suggested by Arjan.

Yikes.  Well, please compare with mprotect's change_pte_range.  I think
I took that as the relevant standard when checking your implementation,
and back then satisfied myself that what you were doing was equivalent.
If page_mkclean_one is now agreed to be significantly defective, then
I suspect change_pte_range is also; perhaps others too.

(But I haven't found time to do more than skim through the thread,
I've not thought through the issues at all: I am surprised that it's
now found defective, we looked at it long and hard back then.)

And trivial point: please undo those distracting "pte" to "ptep" mods:
if you want to call pte pointers ptep, throughout rmap.c and throughout
mm, that's another patch entirely (which I won't welcome, but others may).

Hugh

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 13:00                                 ` Hugh Dickins
@ 2006-12-20 13:56                                   ` Peter Zijlstra
  2006-12-20 17:03                                     ` Martin Michlmayr
  0 siblings, 1 reply; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-20 13:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Michlmayr, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann

On Wed, 2006-12-20 at 13:00 +0000, Hugh Dickins wrote:
> On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> > 
> > fix page_mkclean_one()
> 
> Congratulations on getting to the bottom of it, Peter (if you have:
> I haven't digested enough of the thread to tell).

Well, I thought I understood, you just shattered that.

>   I'm mostly offline at
> present, no time for dialogue, I'll throw out a few remarks and run...

I wondered where you were ;-) Enjoy your time away from the computer.

> > 
> > it had several issues:
> >  - it failed to flush the cache
> 
> It's unclear to me why it should need to flush the cache, but I don't
> know much about that, and mprotect does flush the cache in advance -
> I think others will tell you that if it does need to be flushed,

I was still thinking about why exactly, but indeed since mprotect does I
thought it prudent to also do it.

> it must
> be flushed while there's still a valid pte (on some arches at least).

Ah, good point, makes sense I guess.

> >  - it failed to flush the tlb
> 
> Eh?  It flushed the TLB inside ptep_establish, didn't it?
> I guess you mean you've found a race before it flushed the TLB.

Hmm, quite right indeed. I missed that. So moving the flush inside the
pte cleared section closed a race. It seems I must have a long hard look
at these architecture manuals...

> >  - it failed to do s390 (s390 guys, please verify this is now correct)
> 
> Hmm, I thought we cleared it with them back at the time.

/me queries mail folder...
can't seem to find it.

> > 
> > Also, clear in a loop to ensure SMP safeness as suggested by Arjan.
> 
> Yikes.  Well, please compare with mprotect's change_pte_range.  I think
> I took that as the relevant standard when checking your implementation,
> and back then satisfied myself that what you were doing was equivalent.
> If page_mkclean_one is now agreed to be significantly defective, then
> I suspect change_pte_range is also; perhaps others too.

Arjan argued that mprotect and msync would mostly race with themselves
in userspace. 

> (But I haven't found time to do more than skim through the thread,
> I've not thought through the issues at all: I am surprised that it's
> now found defective, we looked at it long and hard back then.)

---

page_mkclean_one() fix

it had several issues:
 - it failed to flush the cache
 - a race wrt tlb flushing
 - it failed to do s390 (s390 guys, please verify this is now correct)

Also, clear in a loop to ensure SMP safeness as suggested by Arjan.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/rmap.c |   23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page 
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *pte, entry;
+	pte_t *pte;
 	spinlock_t *ptl;
 	int ret = 0;
 
@@ -444,17 +444,20 @@ static int page_mkclean_one(struct page 
 	if (!pte)
 		goto out;
 
-	if (!pte_dirty(*pte) && !pte_write(*pte))
-		goto unlock;
+	while (pte_dirty(*pte) || pte_write(*pte)) {
+		pte_t entry;
 
-	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
-	entry = pte_wrprotect(entry);
-	ptep_establish(vma, address, pte, entry);
-	lazy_mmu_prot_update(entry);
-	ret = 1;
+		flush_cache_page(vma, address, pte_pfn(*pte));
+		entry = ptep_get_and_clear(mm, address, pte);
+		flush_tlb_page(vma, address);
+		(void)page_test_and_clear_dirty(page); /* do the s390 thing */
+		entry = pte_wrprotect(entry);
+		entry = pte_mkclean(entry);
+		set_pte_at(vma, address, pte, entry);
+		lazy_mmu_prot_update(entry);
+		ret = 1;
+	}
 
-unlock:
 	pte_unmap_unlock(pte, ptl);
 out:
 	return ret;



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 23:42                       ` Peter Zijlstra
  2006-12-20  0:23                         ` Linus Torvalds
@ 2006-12-20 14:15                         ` Andrei Popa
  2006-12-20 14:23                           ` Peter Zijlstra
  1 sibling, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-20 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> 
> > OR:
> > 
> >  - page_mkclean_one() is simply buggy.
> 
> GOLD!
> 
> it seems to work with all this (full diff against current git).
> 
> /me rebuilds full kernel to make sure...
> reboot...
> test...      pff the tension...
> yay, still good!
> 
> Andrei; would you please verify.

I have corrupted files.

> The magic seems to be in the extra tlb flush after clearing the dirty
> bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry.
> 
> diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
> index 5e7cd45..2b8893b 100644
> --- a/drivers/connector/connector.c
> +++ b/drivers/connector/connector.c
> @@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), v
>  	spin_lock_bh(&dev->cbdev->queue_lock);
>  	list_for_each_entry(__cbq, &dev->cbdev->queue_list, callback_entry) {
>  		if (cn_cb_equal(&__cbq->id.id, &msg->id)) {
> -			if (likely(!test_bit(WORK_STRUCT_PENDING,
> -					     &__cbq->work.work.management) &&
> +			if (likely(!delayed_work_pending(&__cbq->work) &&
>  					__cbq->data.ddata == NULL)) {
>  				__cbq->data.callback_priv = msg;
>  
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d1f1b54..263f88e 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
>  	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
> -	if (PageWriteback(page))
> +	if (PageDirty(page) || PageWriteback(page))
>  		return 0;
>  
>  	if (mapping == NULL) {		/* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
>  	spin_lock(&mapping->private_lock);
>  	ret = drop_buffers(page, &buffers_to_free);
>  	spin_unlock(&mapping->private_lock);
> -	if (ret) {
> -		/*
> -		 * If the filesystem writes its buffers by hand (eg ext3)
> -		 * then we can have clean buffers against a dirty page.  We
> -		 * clean the page here; otherwise later reattachment of buffers
> -		 * could encounter a non-uptodate page, which is unresolvable.
> -		 * This only applies in the rare case where try_to_free_buffers
> -		 * succeeds but the page is not freed.
> -		 *
> -		 * Also, during truncate, discard_buffer will have marked all
> -		 * the page's buffers clean.  We discover that here and clean
> -		 * the page also.
> -		 */
> -		if (test_clear_page_dirty(page))
> -			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> -	}
>  out:
>  	if (buffers_to_free) {
>  		struct buffer_head *bh = buffers_to_free;
> diff --git a/mm/memory.c b/mm/memory.c
> index c00bac6..60e0945 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
>  }
>  EXPORT_SYMBOL(unmap_mapping_range);
>  
> +static void check_last_page(struct address_space *mapping, loff_t size)
> +{
> +	pgoff_t index;
> +	unsigned int offset;
> +	struct page *page;
> +
> +	if (!mapping)
> +		return;
> +	offset = size & ~PAGE_MASK;
> +	if (!offset)
> +		return;
> +	index = size >> PAGE_SHIFT;
> +	page = find_lock_page(mapping, index);
> +	if (page) {
> +		unsigned int check = 0;
> +		unsigned char *kaddr = kmap_atomic(page, KM_USER0);
> +		do {
> +			check += kaddr[offset++];
> +		} while (offset < PAGE_SIZE);
> +		kunmap_atomic(kaddr, KM_USER0);
> +		unlock_page(page);
> +		page_cache_release(page);
> +		if (check)
> +			printk(KERN_ERR "%s: BADNESS: truncate check %u\n", current->comm, check);
> +	}
> +}
> +
>  /**
>   * vmtruncate - unmap mappings "freed" by truncate() syscall
>   * @inode: inode of the file used
> @@ -1875,6 +1902,7 @@ do_expand:
>  		goto out_sig;
>  	if (offset > inode->i_sb->s_maxbytes)
>  		goto out_big;
> +	check_last_page(mapping, inode->i_size);
>  	i_size_write(inode, offset);
>  
>  out_truncate:
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 237107c..f561e72 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page)
>  EXPORT_SYMBOL(test_set_page_writeback);
>  
>  /*
> - * Return true if any of the pages in the mapping are marged with the
> + * Return true if any of the pages in the mapping are marked with the
>   * passed tag.
>   */
>  int mapping_tagged(struct address_space *mapping, int tag)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..900229a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long address;
> -	pte_t *pte, entry;
> +	pte_t *ptep, entry;
>  	spinlock_t *ptl;
>  	int ret = 0;
>  
> @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> -	if (!pte)
> +	ptep = page_check_address(page, mm, address, &ptl);
> +	if (!ptep)
>  		goto out;
>  
> -	if (!pte_dirty(*pte) && !pte_write(*pte))
> +	if (!pte_dirty(*ptep) && !pte_write(*ptep))
>  		goto unlock;
>  
> -	entry = ptep_get_and_clear(mm, address, pte);
> -	entry = pte_mkclean(entry);
> +	entry = ptep_get_and_clear(mm, address, ptep);
>  	entry = pte_wrprotect(entry);
> -	ptep_establish(vma, address, pte, entry);
> +	ptep_establish(vma, address, ptep, entry);
> +	ret = ptep_clear_flush_dirty(vma, address, ptep) ||
> +		page_test_and_clear_dirty(page);
>  	lazy_mmu_prot_update(entry);
>  	ret = 1;
>  
>  unlock:
> -	pte_unmap_unlock(pte, ptl);
> +	pte_unmap_unlock(ptep, ptl);
>  out:
>  	return ret;
>  }
> 
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20 14:15                         ` Andrei Popa
@ 2006-12-20 14:23                           ` Peter Zijlstra
  2006-12-20 16:30                             ` Andrei Popa
  0 siblings, 1 reply; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-20 14:23 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
> On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > 
> > > OR:
> > > 
> > >  - page_mkclean_one() is simply buggy.
> > 
> > GOLD!
> > 
> > it seems to work with all this (full diff against current git).
> > 
> > /me rebuilds full kernel to make sure...
> > reboot...
> > test...      pff the tension...
> > yay, still good!
> > 
> > Andrei; would you please verify.
> 
> I have corrupted files.

drad; and with this patch:
  http://lkml.org/lkml/2006/12/20/112

/me goes rebuild his kernel and try more than 3 times


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20  9:01                           ` Peter Zijlstra
  2006-12-20  9:12                             ` Peter Zijlstra
  2006-12-20  9:39                             ` Arjan van de Ven
@ 2006-12-20 14:27                             ` Martin Schwidefsky
  2 siblings, 0 replies; 311+ messages in thread
From: Martin Schwidefsky @ 2006-12-20 14:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr, Heiko Carstens

On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote:
> Also, what is this page_test_and_clear_dirty() business, that seems to
> be exclusively s390 btw. However they do seem to need this.
> 
> > But the "ptep_get_and_clear() + flush_tlb_page()" sequence should
> > hopefully also work.
> 
> Yeah, probably, not optimally so on some archs that don't actually need
> the flush though. And as above, I wonder about s390.

Simple, the s390 architecture does not keep the dirty bit in the pte but
in something called the storage key. For each physical page there is one
associated storage key. It is accessed with special instructions like
"iske", "sske" or "rrbe". To clear the dirty bit the storage key of a
page is read with iske, the bit is cleared and the storage key is stored
back with sske. That means that clearing the dirty bit is not an atomic
operation. rrbe is used to test and clear the referenced bit (young/old
infomation) and is atomic in regard to other storage key operations. If
you think about it, the storage keys are quite nice for the operating
system, page_referenced() can be implemented with a single test
"page_test_and_clear_young()". No need to read all the ptes pointing to
the page. The downside is that the storage keys have a cost on the
hardware side.

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 11:26                               ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Peter Zijlstra
  2006-12-20 11:39                                 ` Jesper Juhl
  2006-12-20 13:00                                 ` Hugh Dickins
@ 2006-12-20 14:55                                 ` Martin Schwidefsky
  2 siblings, 0 replies; 311+ messages in thread
From: Martin Schwidefsky @ 2006-12-20 14:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr, Heiko Carstens, Arnd Bergmann

On Wed, 2006-12-20 at 12:26 +0100, Peter Zijlstra wrote:
> fix page_mkclean_one()
> 
> it had several issues:
>  - it failed to flush the cache
>  - it failed to flush the tlb
>  - it failed to do s390 (s390 guys, please verify this is now correct)

Sorry, page_mkclean is broken for s390. But it has already been broken
before your change. It is only more broken now.

> @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page
>  	if (address == -EFAULT)
>  		goto out;
> 
> -	pte = page_check_address(page, mm, address, &ptl);
> -	if (!pte)
> +	ptep = page_check_address(page, mm, address, &ptl);
> +	if (!ptep)
>  		goto out;
> 
> -	if (!pte_dirty(*pte) && !pte_write(*pte))
> -		goto unlock;
> -
> -	entry = ptep_get_and_clear(mm, address, pte);
> -	entry = pte_mkclean(entry);
> -	entry = pte_wrprotect(entry);
> -	ptep_establish(vma, address, pte, entry);
> -	lazy_mmu_prot_update(entry);
> -	ret = 1;
> +	while (pte_dirty(*ptep) || pte_write(*ptep)) {
> +		pte_t entry = ptep_get_and_clear(mm, address, ptep);
> +		flush_cache_page(vma, address, pte_pfn(entry));
> +		flush_tlb_page(vma, address);
> +		(void)page_test_and_clear_dirty(page); /* do the s390 thing */
> +		entry = pte_wrprotect(entry);
> +		entry = pte_mkclean(entry);
> +		set_pte_at(vma, address, ptep, entry);
> +		lazy_mmu_prot_update(entry);
> +		ret = 1;
> +	}
> 
> -unlock:
> -	pte_unmap_unlock(pte, ptl);
> +	pte_unmap_unlock(ptep, ptl);
>  out:
>  	return ret;
>  }

1) pte_dirty() is always false. The reason is that s390 keeps the dirty
bit information in the storage key and not the pte. If pte_write is
false as well nothing is done. There really should be a 
	if (page_test_and_clear_dirty(page))
		ret = 1;
at the end of page_mkclean.

2) Please use ptep_clear_flush instead of ptep_get_and_clear +
flush_tlb_page. The former uses an optimization on s390 that flushes
just one TLB, the later flushes every TLB of the current mm.

My try to fix this up is attached. It moves the flush_cache_page after
the flush_tlb_page (see asm-generic/pgtable.h for the generic definition
of ptep_clear_flush that is used for i386). I hope this doesn't break
anything else.

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.

---
 mm/rmap.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff -urpN linux-2.6/mm/rmap.c linux-2.6-mkclean/mm/rmap.c
--- linux-2.6/mm/rmap.c	2006-12-20 15:49:01.000000000 +0100
+++ linux-2.6-mkclean/mm/rmap.c	2006-12-20 15:51:14.000000000 +0100
@@ -445,10 +445,8 @@ static int page_mkclean_one(struct page 
 		goto out;
 
 	while (pte_dirty(*ptep) || pte_write(*ptep)) {
-		pte_t entry = ptep_get_and_clear(mm, address, ptep);
+		pte_t entry = ptep_clear_flush(vma, address, ptep);
 		flush_cache_page(vma, address, pte_pfn(entry));
-		flush_tlb_page(vma, address);
-		(void)page_test_and_clear_dirty(page); /* do the s390 thing */
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(vma, address, ptep, entry);
@@ -490,6 +488,8 @@ int page_mkclean(struct page *page)
 		if (mapping)
 			ret = page_mkclean_file(mapping, page);
 	}
+	if (page_test_and_clear_dirty(page))
+		ret = 1;
 
 	return ret;
 }



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20 14:23                           ` Peter Zijlstra
@ 2006-12-20 16:30                             ` Andrei Popa
  2006-12-20 16:36                               ` Peter Zijlstra
  0 siblings, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-20 16:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote:
> On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
> > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > > 
> > > > OR:
> > > > 
> > > >  - page_mkclean_one() is simply buggy.
> > > 
> > > GOLD!
> > > 
> > > it seems to work with all this (full diff against current git).
> > > 
> > > /me rebuilds full kernel to make sure...
> > > reboot...
> > > test...      pff the tension...
> > > yay, still good!
> > > 
> > > Andrei; would you please verify.
> > 
> > I have corrupted files.
> 
> drad; and with this patch:
>   http://lkml.org/lkml/2006/12/20/112

Hash check on download completion found bad chunks, consider using
"safe_sync".

> 
> /me goes rebuild his kernel and try more than 3 times
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20 16:30                             ` Andrei Popa
@ 2006-12-20 16:36                               ` Peter Zijlstra
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-20 16:36 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Wed, 2006-12-20 at 18:30 +0200, Andrei Popa wrote:
> On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote:
> > On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
> > > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> > > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > > > 
> > > > > OR:
> > > > > 
> > > > >  - page_mkclean_one() is simply buggy.
> > > > 
> > > > GOLD!
> > > > 
> > > > it seems to work with all this (full diff against current git).
> > > > 
> > > > /me rebuilds full kernel to make sure...
> > > > reboot...
> > > > test...      pff the tension...
> > > > yay, still good!
> > > > 
> > > > Andrei; would you please verify.
> > > 
> > > I have corrupted files.
> > 
> > drad; and with this patch:
> >   http://lkml.org/lkml/2006/12/20/112
> 
> Hash check on download completion found bad chunks, consider using
> "safe_sync".

*sigh* back to square 1.

and I need to look at my reproduction case ;-(

Thanks for testing.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 13:56                                   ` Peter Zijlstra
@ 2006-12-20 17:03                                     ` Martin Michlmayr
  2006-12-20 17:35                                       ` Linus Torvalds
  2006-12-20 22:11                                       ` Russell King
  0 siblings, 2 replies; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-20 17:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Arjan van de Ven, Linus Torvalds, Andrei Popa,
	Andrew Morton, Linux Kernel Mailing List, Florian Weimer,
	Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2006-12-20 14:56]:
> page_mkclean_one() fix

This patch doesn't fix my problem (apt segfaults on ARM because its
database is corrupted).
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 17:03                                     ` Martin Michlmayr
@ 2006-12-20 17:35                                       ` Linus Torvalds
  2006-12-20 17:53                                         ` Martin Michlmayr
  2006-12-20 22:11                                       ` Russell King
  1 sibling, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-20 17:35 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Peter Zijlstra, Hugh Dickins, Arjan van de Ven, Andrei Popa,
	Andrew Morton, Linux Kernel Mailing List, Florian Weimer,
	Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson



On Wed, 20 Dec 2006, Martin Michlmayr wrote:

> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2006-12-20 14:56]:
> > page_mkclean_one() fix
> 
> This patch doesn't fix my problem (apt segfaults on ARM because its
> database is corrupted).

Can you remind us:
 - your ARM is UP, right? Do you have PREEMPT on?
 - This is probably a stupid question, but you did make sure that the 
   database was ok (with some rebuild command) and that you didn't have 
   preexisting corruption?

Anyway, the page_mkclean_one() fixes (along with _most_ things we've 
looked at) shouldn't matter on UP, at least certainly not without PREEMPT.

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 17:35                                       ` Linus Torvalds
@ 2006-12-20 17:53                                         ` Martin Michlmayr
  2006-12-20 19:01                                           ` Linus Torvalds
  0 siblings, 1 reply; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-20 17:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Hugh Dickins, Arjan van de Ven, Andrei Popa,
	Andrew Morton, Linux Kernel Mailing List, Florian Weimer,
	Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson

* Linus Torvalds <torvalds@osdl.org> [2006-12-20 09:35]:
> Can you remind us:
>  - your ARM is UP, right? Do you have PREEMPT on?

It's UP and PREEMPT is not set.  I used 2.6.19 plus the patch that has
been posted.

>  - This is probably a stupid question, but you did make sure that the
>    database was ok (with some rebuild command) and that you didn't have
>    preexisting corruption?

Yes, my test case is to install Debian on the ARM machine so the
database is created fresh.  While the corruption always triggers
during a fresh installation, it's much harder to see in a running
system.  Some people see it on their system but I haven't found a 100%
working recipe to reproduce it yet given a working system; doing a new
installation seems to trigger it all the time though.

> Anyway, the page_mkclean_one() fixes (along with _most_ things we've
> looked at) shouldn't matter on UP, at least certainly not without
> PREEMPT.

Hmm.  So what about UP without PREEMPT then...

Maybe the following information is helpful in some way: remember how I
said that we have applied 6 mm patches to 2.6.18 in Debian?  According
to Gordon Farquharson, who's helping me a great deal with testing
installation on this ARM machine (Linksys NSLU2), the corruption
doesn't always show up when you only apply
mm-tracking-shared-dirty-pages.patch to 2.6.18 but it shows up all the
time with all six patches applied.  As a reminder, the 6 patches we
apply are:

mm-tracking-shared-dirty-pages.patch
mm-balance-dirty-pages.patch
mm-optimize-mprotect.patch
mm-install_page-cleanup.patch
mm-do_wp_page-fixup.patch
mm-msync-cleanup.patch

-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 21:30                             ` Peter Zijlstra
  2006-12-19 22:51                               ` Linus Torvalds
@ 2006-12-20 18:02                               ` Stephen Clark
  1 sibling, 0 replies; 311+ messages in thread
From: Stephen Clark @ 2006-12-20 18:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Peter Zijlstra wrote:

>On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
>  
>
>>On Tue, 19 Dec 2006, Linus Torvalds wrote:
>>    
>>
>>> here's a totally new tangent on this: it's possible that user code is 
>>>simply BUGGY. 
>>>      
>>>
>
>I'm sad to say this doesn't trigger :-(
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>  
>
Hi all,

I ran it a number of times on 2.6.16-1.2115_FC4 and always got
 ./a.out | od -x
0000000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa
0000020 aaaa aaaa 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555

but running it on 2.6.19-rc5 I always get zeros in the middle.

Steve

-- 

"They that give up essential liberty to obtain temporary safety, 
deserve neither liberty nor safety."  (Ben Franklin)

"The course of history shows that as a government grows, liberty 
decreases."  (Thomas Jefferson)




^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 17:53                                         ` Martin Michlmayr
@ 2006-12-20 19:01                                           ` Linus Torvalds
  2006-12-20 19:50                                             ` Linus Torvalds
  0 siblings, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-20 19:01 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson



On Wed, 20 Dec 2006, Martin Michlmayr wrote:
> 
> > Anyway, the page_mkclean_one() fixes (along with _most_ things we've
> > looked at) shouldn't matter on UP, at least certainly not without
> > PREEMPT.
> 
> Hmm.  So what about UP without PREEMPT then...

So that's why I've been harping on the fact that I think we simply do 
really wrong things with PG_dirty at times, and that I find it confusing 
that there's

 - clear_page_dirty_for_io(): this one makes sense. The name makes sense, 
   and the implementation makes sense (which is _not_ the same thing as 
   "works", of course - "makes sense" does not mean "no bugs" ;).

 - test_clear_page_dirty: this one makes no sense WHATSOEVER, except as a 
   buggy way to do the "_for_io()" case.. This makes sense neither from a 
   concept angle _or_ an implementation angle (the whole "test_" part is 
   nonsense: why would anybody care? What operation does this? What can it 
   do if the page is dirty? It also has no sensible thing it can do to the 
   page tables.

 - clear_page_dirty(): this one makes sense only as a "cancel" operation, 
   for vmtruncate and friends (it's different from the "_for_io()" case in 
   several ways:
	(a) we should have unmapped such pages forcibly _anyway_, so 
	    looking at the PTE's make no sense.
	(b) because we're not starting IO, we don't have the "mark for 
	    writeback" case, and we need to clear the dirty tags from the 
	    radix trees etc since the writeback logic won't do it for us.
   The _implementation_ of "clear_page_dirty()" doesn't make sense, but 
   the concept does.

I've repeated that theory a few times, but neither Andrew nor Nick seem to 
really believe in it. So I'll just repeat it once more, only to be shot 
down. I think we have three operations, one of which is totally idiotic 
and senseless, and one of which is just badly implemented.

> Maybe the following information is helpful in some way: remember how I
> said that we have applied 6 mm patches to 2.6.18 in Debian?  According
> to Gordon Farquharson, who's helping me a great deal with testing
> installation on this ARM machine (Linksys NSLU2), the corruption
> doesn't always show up when you only apply
> mm-tracking-shared-dirty-pages.patch to 2.6.18 but it shows up all the
> time with all six patches applied.

I think the "it hapepns occasionally with just the first patch" is the 
really important part. The other patches really are likely to just change 
writeback timing behaviour (_especially_ the "tracking-shared-dirty-pages" 
patch), but if it happens occasionally even with the first one, that's the 
one that almost certainly introduced the real problem.

And my argument above is actually that the "real problem" goes a hell of a 
lot further back in time, but it didn't use to be a problem because we 
just considered dirty bits in the page tables to be something _completely_ 
independent of the "page dirty" status, so historically, it just didn't 
matter that we had insane implementations and senseless operations.

			Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 19:01                                           ` Linus Torvalds
@ 2006-12-20 19:50                                             ` Linus Torvalds
  2006-12-20 20:22                                               ` Peter Zijlstra
                                                                 ` (6 more replies)
  0 siblings, 7 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-20 19:50 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson



On Wed, 20 Dec 2006, Linus Torvalds wrote:
> 
> So that's why I've been harping on the fact that I think we simply do 
> really wrong things with PG_dirty at times [ ... ]

Ok, I'll just put my money where my mouth is, and suggest a patch like 
THIS instead.

This one clears up all the issues I find irritating:

 - "test_clear_page_dirty()" is insane, both conceptually and as an 
   implementation. "Give me a 'C', give me an 'R', give me an 'A', give me 
   a 'P'".

   So rip out that mindfart entirely.

 - "clear_page_dirty()" is badly named, and should be about CANCELLING the 
   dirty bit, and must never be called with pages mapped anyway. So throw 
   that out too, and replace it with a new function:

	void cancel_dirty_page(struct page *page, unsigned int accounting_size);

 - "clear_page_dirty_for_io()" is fine.

And with that, I then either rip out any old users of 
"test_clear_page_dirty()" or "clear_page_dirty()", and if appropriate (and 
it's realy lonly appropriate for "truncate()", I replace them with the new 
"cancel_dirty_page()". Most of the time, they should just be deleted 
entirely.

NOTE NOTE NOTE! I _only_ did enough to make things compile for my 
particular configuration. That means that right now the following 
filesystems are broken with this patch (because they use the totally 
broken old crap):

	CIFS, FUSE, JFS, ReiserFS, XFS

and I don't know exactly what they need to be fixed. But most likely their 
usage was insane and pointless anyway (looking at the ReiserFS case, for 
example, that was DEFINITELY the case. I can't even imagine what the heck 
it thinks it is doing).

Anyway, I'm not at all guaranteeing that this solves anything at all. I 
_do_ guarantee that this is a h*ll of a lot saner than what we had before.

[ This also includes a few of my older patches, I didn't bother to sort 
  them out, and the fs/buffer.c patch is required because it got rid of 
  one of the insane uses of test_clear_page_dirty().

  So this goes directly on top of current -git, with no other changes in 
  the tree. ]

Nick, Hugh, Peter, Andrew? Comments? 

Martin, Andrei, does this make any difference for your corruption cases?

		Linus

---
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..4f4cd13 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct file *file,
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..350878a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,15 +253,11 @@ static inline void SetPageUptodate(struct page *page)
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+extern void cancel_dirty_page(struct page *page, unsigned int account_size);
+
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
-{
-	test_clear_page_dirty(page);
-}
-
 static inline void set_page_writeback(struct page *page)
 {
 	test_set_page_writeback(page);
diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..79cecab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+	pgoff_t index;
+	unsigned int offset;
+	struct page *page;
+
+	if (!mapping)
+		return;
+	offset = size & ~PAGE_MASK;
+	if (!offset)
+		return;
+	index = size >> PAGE_SHIFT;
+	page = find_lock_page(mapping, index);
+	if (page) {
+		unsigned int check = 0;
+		unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+		do {
+			check += kaddr[offset++];
+		} while (offset < PAGE_SIZE);
+		kunmap_atomic(kaddr,KM_USER0);
+		unlock_page(page);
+		page_cache_release(page);
+		if (check)
+			printk("%s: BADNESS: truncate check %u\n", current->comm, check);
+	}
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
 		goto out_sig;
 	if (offset > inode->i_sb->s_maxbytes)
 		goto out_big;
+	check_last_page(mapping, inode->i_size);
 	i_size_write(inode, offset);
 
 out_truncate:
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..b3a198c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -845,38 +845,6 @@ int set_page_dirty_lock(struct page *page)
 EXPORT_SYMBOL(set_page_dirty_lock);
 
 /*
- * Clear a page's dirty flag, while caring for dirty memory accounting. 
- * Returns true if the page was previously dirty.
- */
-int test_clear_page_dirty(struct page *page)
-{
-	struct address_space *mapping = page_mapping(page);
-	unsigned long flags;
-
-	if (!mapping)
-		return TestClearPageDirty(page);
-
-	write_lock_irqsave(&mapping->tree_lock, flags);
-	if (TestClearPageDirty(page)) {
-		radix_tree_tag_clear(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
-		/*
-		 * We can continue to use `mapping' here because the
-		 * page is locked, which pins the address_space
-		 */
-		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
-			dec_zone_page_state(page, NR_FILE_DIRTY);
-		}
-		return 1;
-	}
-	write_unlock_irqrestore(&mapping->tree_lock, flags);
-	return 0;
-}
-EXPORT_SYMBOL(test_clear_page_dirty);
-
-/*
  * Clear a page's dirty flag, while caring for dirty memory accounting.
  * Returns true if the page was previously dirty.
  *
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..bf9e296 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -51,6 +51,20 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
 		do_invalidatepage(page, partial);
 }
 
+void cancel_dirty_page(struct page *page, unsigned int account_size)
+{
+	/* If we're cancelling the page, it had better not be mapped any more */
+	if (page_mapped(page)) {
+		static unsigned int warncount;
+
+		WARN_ON(++warncount < 5);
+	}
+		
+	if (TestClearPageDirty(page) && account_size)
+		task_io_account_cancelled_write(account_size);
+}
+
+
 /*
  * If truncate cannot remove the fs-private metadata from the page, the page
  * becomes anonymous.  It will be left on the LRU and may even be mapped into
@@ -70,8 +84,8 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
-		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
+	cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
 	remove_from_page_cache(page);
@@ -350,7 +364,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index;
-			int was_dirty;
 
 			lock_page(page);
 			if (page->mapping != mapping) {
@@ -386,12 +399,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
-			if (!invalidate_complete_page2(mapping, page)) {
-				if (was_dirty)
-					set_page_dirty(page);
+			if (!invalidate_complete_page2(mapping, page))
 				ret = -EIO;
-			}
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 19:50                                             ` Linus Torvalds
@ 2006-12-20 20:22                                               ` Peter Zijlstra
  2006-12-20 21:55                                               ` Dave Kleikamp
                                                                 ` (5 subsequent siblings)
  6 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-20 20:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson

On Wed, 2006-12-20 at 11:50 -0800, Linus Torvalds wrote:

> Nick, Hugh, Peter, Andrew? Comments? 

Hooray! I'm all for this cleanup. Let us see where this road leads..




^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 19:50                                             ` Linus Torvalds
  2006-12-20 20:22                                               ` Peter Zijlstra
@ 2006-12-20 21:55                                               ` Dave Kleikamp
  2006-12-20 22:25                                                 ` Linus Torvalds
  2006-12-20 22:15                                               ` Peter Zijlstra
                                                                 ` (4 subsequent siblings)
  6 siblings, 1 reply; 311+ messages in thread
From: Dave Kleikamp @ 2006-12-20 21:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson

On Wed, 2006-12-20 at 11:50 -0800, Linus Torvalds wrote:

> NOTE NOTE NOTE! I _only_ did enough to make things compile for my
> particular configuration. That means that right now the following
> filesystems are broken with this patch (because they use the totally
> broken old crap):
> 
> 	CIFS, FUSE, JFS, ReiserFS, XFS
> 
> and I don't know exactly what they need to be fixed. But most likely their
> usage was insane and pointless anyway (looking at the ReiserFS case, for
> example, that was DEFINITELY the case. I can't even imagine what the heck
> it thinks it is doing).

Here's a patch to get rid of clear_page_dirty() from jfs.  I'm not
convinced it was totally broken, but I'm not convinced it wasn't.
Either way, I don't think that bit of code was particularly beneficial.

Feel free to apply this patch independent of your patch if you really
think that jfs's use of clear_page_dirty is crap, or I can push it
through -mm first.

This patch removes some questionable code that attempted to make a
no-longer-used page easier to reclaim.

Calling metapage_writepage against such a page will not result in any
I/O being performed, so removing this code shouldn't be a big deal.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>

diff -Nurp linux-orig/fs/jfs/jfs_metapage.c linux/fs/jfs/jfs_metapage.c
--- linux-orig/fs/jfs/jfs_metapage.c	2006-12-07 17:12:58.000000000 -0600
+++ linux/fs/jfs/jfs_metapage.c	2006-12-20 15:19:48.000000000 -0600
@@ -764,22 +764,9 @@ void release_metapage(struct metapage * 
 	} else if (mp->lsn)	/* discard_metapage doesn't remove it */
 		remove_from_logsync(mp);
 
-#if MPS_PER_PAGE == 1
-	/*
-	 * If we know this is the only thing in the page, we can throw
-	 * the page out of the page cache.  If pages are larger, we
-	 * don't want to do this.
-	 */
-
-	/* Retest mp->count since we may have released page lock */
-	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
-		ClearPageUptodate(page);
-	}
-#else
 	/* Try to keep metapages from using up too much memory */
 	drop_metapage(page, mp);
-#endif
+
 	unlock_page(page);
 	page_cache_release(page);
 }



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 17:03                                     ` Martin Michlmayr
  2006-12-20 17:35                                       ` Linus Torvalds
@ 2006-12-20 22:11                                       ` Russell King
  2006-12-21  8:18                                         ` Martin Michlmayr
  1 sibling, 1 reply; 311+ messages in thread
From: Russell King @ 2006-12-20 22:11 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Peter Zijlstra, Hugh Dickins, Arjan van de Ven, Linus Torvalds,
	Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson

On Wed, Dec 20, 2006 at 06:03:23PM +0100, Martin Michlmayr wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2006-12-20 14:56]:
> > page_mkclean_one() fix
> 
> This patch doesn't fix my problem (apt segfaults on ARM because its
> database is corrupted).

Are you using IDE in PIO mode?  If so, the bug probably lies there.

As I've said repeatedly when asked by IDE folk to test their PIO-based
cache coherency fixes, I am unable to reproduce the bug, ergo I am
unable to test the fix.

(Some people, such as Jeff Garzik to name names, took that as me being
entirely unreasonable and un-cooperative.  But consider carefully - how
can _anyone_ test something that they can't produce.  I consider Jeff's
comments extremely very childish in that respect.)

Hence, as far as I'm aware, Linux on PIO-based IDE ARM hardware remains
utterly *unsafe*.

Sorry.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 19:50                                             ` Linus Torvalds
  2006-12-20 20:22                                               ` Peter Zijlstra
  2006-12-20 21:55                                               ` Dave Kleikamp
@ 2006-12-20 22:15                                               ` Peter Zijlstra
  2006-12-20 22:20                                                 ` Peter Zijlstra
                                                                   ` (2 more replies)
  2006-12-20 23:24                                               ` David Chinner
                                                                 ` (3 subsequent siblings)
  6 siblings, 3 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-20 22:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson

I think this is also needed:

---
 mm/truncate.c |    7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -320,19 +320,14 @@ invalidate_complete_page2(struct address
 	if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
+	cancel_dirty_page(page, PAGE_CACHE_SIZE);
 	lock_page_ref_irq(page);
-	if (PageDirty(page))
-		goto failed;
-
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
 	unlock_page_ref_irq(page);
 	ClearPageUptodate(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
-failed:
-	unlock_page_ref_irq(page);
-	return 0;
 }
 
 /**



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 22:15                                               ` Peter Zijlstra
@ 2006-12-20 22:20                                                 ` Peter Zijlstra
  2006-12-20 22:49                                                 ` Linus Torvalds
  2006-12-21  2:36                                                 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Trond Myklebust
  2 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-20 22:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson

On Wed, 2006-12-20 at 23:15 +0100, Peter Zijlstra wrote:
> I think this is also needed:

See also:
  http://marc.theaimsgroup.com/?l=linux-kernel&m=116603599904278&w=2

> ---
>  mm/truncate.c |    7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> Index: linux-2.6/mm/truncate.c
> ===================================================================
> --- linux-2.6.orig/mm/truncate.c
> +++ linux-2.6/mm/truncate.c
> @@ -320,19 +320,14 @@ invalidate_complete_page2(struct address
>  	if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
>  		return 0;
>  
> +	cancel_dirty_page(page, PAGE_CACHE_SIZE);
>  	lock_page_ref_irq(page);
> -	if (PageDirty(page))
> -		goto failed;
> -
>  	BUG_ON(PagePrivate(page));
>  	__remove_from_page_cache(page);
>  	unlock_page_ref_irq(page);
>  	ClearPageUptodate(page);
>  	page_cache_release(page);	/* pagecache ref */
>  	return 1;
> -failed:
> -	unlock_page_ref_irq(page);
> -	return 0;
>  }
>  
>  /**
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 21:55                                               ` Dave Kleikamp
@ 2006-12-20 22:25                                                 ` Linus Torvalds
  2006-12-20 22:59                                                   ` Dave Kleikamp
  0 siblings, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-20 22:25 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson



On Wed, 20 Dec 2006, Dave Kleikamp wrote:
> 
> This patch removes some questionable code that attempted to make a
> no-longer-used page easier to reclaim.

If so, "cancel_dirty_page()" may actually be the right thing to use, but 
only if you can guarantee that the page isn't mapped anywhere (and from 
the name of the function I guess it's not something that you'll ever map?)

So the JFS code _looks_ like you could just replace the

	clear_page_dirty(page);

with

	cancel_dirty_page(page, PAGE_CACHE_SIZE);

(where that second parameter is just used for statistics - it updates the 
"cancelled IO" byte-counts if CONFIG_TASK_IO_ACCOUNTING is set - so the 
number doesn't really matter, you could make it zero if you never want the 
thing to show up in the IO accounting).

			Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 22:15                                               ` Peter Zijlstra
  2006-12-20 22:20                                                 ` Peter Zijlstra
@ 2006-12-20 22:49                                                 ` Linus Torvalds
  2006-12-20 23:03                                                   ` Peter Zijlstra
  2006-12-21  2:36                                                 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Trond Myklebust
  2 siblings, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-20 22:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson



On Wed, 20 Dec 2006, Peter Zijlstra wrote:
>
> I think this is also needed:

Yeah, that looks about right. Although I think it should go above the 
"try_to_release_page()", because right now we do that "ttrp()" with the 
dirty bit set, and we should let the low-level filesystem just know that 
it's simply not interesting any more (and, indeed, "try_to_free_buffers()" 
too, for that matter).

Anyway, I think that's a detail. I'd rather know whether this all actually 
makes any difference what-so-ever to the corruption behaviour of Andrei 
&co. 

Maybe the UP ARM case is some strange dcache alias issue with PIO IDE, and 
the only reason that started showing up at the same time is the different 
IO loads. Who knows.

[ Although I think you may have been on the right track with that dcache 
  flushing stuff in "page_mkclean()".. It might not have been quite 
  all there, but I think we should go back and look very closely at 
  page_mkclean() regardless of any other issues! ]

So far, my whole "cancel_dirty_page/clean_page_dirty_for_io" patch has 
really been just a "try to make the code _look_ sane. I don't think we 
have a single report that the patch actually makes any difference yet.

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 22:25                                                 ` Linus Torvalds
@ 2006-12-20 22:59                                                   ` Dave Kleikamp
  0 siblings, 0 replies; 311+ messages in thread
From: Dave Kleikamp @ 2006-12-20 22:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson

On Wed, 2006-12-20 at 14:25 -0800, Linus Torvalds wrote:
> 
> On Wed, 20 Dec 2006, Dave Kleikamp wrote:
> >
> > This patch removes some questionable code that attempted to make a
> > no-longer-used page easier to reclaim.
> 
> If so, "cancel_dirty_page()" may actually be the right thing to use, but
> only if you can guarantee that the page isn't mapped anywhere (and from
> the name of the function I guess it's not something that you'll ever map?)

That's correct.  It can't be mapped.  It's a private mapping only used
for metadata.

I'm really not sure the code in question is having the intended effect.
Maybe one of the gurus on cc: can take a look at the code and tell me if
it's worth keeping.  I apologize in advance if it makes anyone lose
their lunch.

> So the JFS code _looks_ like you could just replace the
> 
> 	clear_page_dirty(page);
> 
> with
> 
> 	cancel_dirty_page(page, PAGE_CACHE_SIZE);
> 
> (where that second parameter is just used for statistics - it updates the
> "cancelled IO" byte-counts if CONFIG_TASK_IO_ACCOUNTING is set - so the
> number doesn't really matter, you could make it zero if you never want the
> thing to show up in the IO accounting).

I'm not sure whether zero or PAGE_CACHE_SIZE would be better.  The
situation is where some page of metadata is no longer used, say
shrinking a directory tree or truncating a file and throwing out the
extent tree.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 22:49                                                 ` Linus Torvalds
@ 2006-12-20 23:03                                                   ` Peter Zijlstra
  2006-12-21  9:16                                                     ` Martin Schwidefsky
  0 siblings, 1 reply; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-20 23:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson

On Wed, 2006-12-20 at 14:49 -0800, Linus Torvalds wrote:
> 
> On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> >
> > I think this is also needed:
> 
> Yeah, that looks about right. Although I think it should go above the 
> "try_to_release_page()", because right now we do that "ttrp()" with the 
> dirty bit set, and we should let the low-level filesystem just know that 
> it's simply not interesting any more (and, indeed, "try_to_free_buffers()" 
> too, for that matter).

That makes NFS unhappy, see nfs_release_page().

> Anyway, I think that's a detail. I'd rather know whether this all actually 
> makes any difference what-so-ever to the corruption behaviour of Andrei 
> &co. 

Yeah, I have to tinker with my test setup to make it fail again. Maybe I
have to add more seeds, that seemed to make a difference, it was
impossible to trigger with a single seed.

FWIW I also added some scribble past i_size checks in nobh_writepage()
and block_write_full_page().

FWIW2 I straced rtorrent for a bit and it does an aweful lot of mmap
calls and relatively few msync(MS_ASYNC);munmap(), and no truncate apart
from creating sparse files at the beginning.

> Maybe the UP ARM case is some strange dcache alias issue with PIO IDE, and 
> the only reason that started showing up at the same time is the different 
> IO loads. Who knows.
> 
> [ Although I think you may have been on the right track with that dcache 
>   flushing stuff in "page_mkclean()".. It might not have been quite 
>   all there, but I think we should go back and look very closely at 
>   page_mkclean() regardless of any other issues! ]

current version

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/rmap.c |   23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page 
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *pte, entry;
+	pte_t *pte;
 	spinlock_t *ptl;
 	int ret = 0;
 
@@ -444,17 +444,18 @@ static int page_mkclean_one(struct page 
 	if (!pte)
 		goto out;
 
-	if (!pte_dirty(*pte) && !pte_write(*pte))
-		goto unlock;
+	while (pte_dirty(*pte) || pte_write(*pte)) {
+		pte_t entry;
 
-	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
-	entry = pte_wrprotect(entry);
-	ptep_establish(vma, address, pte, entry);
-	lazy_mmu_prot_update(entry);
-	ret = 1;
+		flush_cache_page(vma, address, pte_pfn(*pte));
+		entry = ptep_clear_flush(vma, address, pte);
+		entry = pte_wrprotect(entry);
+		entry = pte_mkclean(entry);
+		ptep_establish(vma, address, pte, entry);
+		lazy_mmu_prot_update(entry);
+		ret = 1;
+	}
 
-unlock:
 	pte_unmap_unlock(pte, ptl);
 out:
 	return ret;
@@ -489,6 +490,8 @@ int page_mkclean(struct page *page)
 		if (mapping)
 			ret = page_mkclean_file(mapping, page);
 	}
+	if (page_test_and_clear_dirty(page))
+		ret = 1;
 
 	return ret;
 }


> So far, my whole "cancel_dirty_page/clean_page_dirty_for_io" patch has 
> really been just a "try to make the code _look_ sane. I don't think we 
> have a single report that the patch actually makes any difference yet.

I failed to compile a kernel with that patch (100% iowait and a bunch of
processes stuck in D state), but sysrq-t was borked (only numbers no
symbols) have yet to retry - I noticed you kicked the unwinder?.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 19:50                                             ` Linus Torvalds
                                                                 ` (2 preceding siblings ...)
  2006-12-20 22:15                                               ` Peter Zijlstra
@ 2006-12-20 23:24                                               ` David Chinner
  2006-12-20 23:55                                                 ` Linus Torvalds
  2006-12-20 23:32                                               ` Andrew Morton
                                                                 ` (2 subsequent siblings)
  6 siblings, 1 reply; 311+ messages in thread
From: David Chinner @ 2006-12-20 23:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson

On Wed, Dec 20, 2006 at 11:50:50AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 20 Dec 2006, Linus Torvalds wrote:
> > 
> > So that's why I've been harping on the fact that I think we simply do 
> > really wrong things with PG_dirty at times [ ... ]
> 
> Ok, I'll just put my money where my mouth is, and suggest a patch like 
> THIS instead.
> 
> This one clears up all the issues I find irritating:
> 
>  - "test_clear_page_dirty()" is insane, both conceptually and as an 
>    implementation. "Give me a 'C', give me an 'R', give me an 'A', give me 
>    a 'P'".
> 
>    So rip out that mindfart entirely.
> 
>  - "clear_page_dirty()" is badly named, and should be about CANCELLING the 
>    dirty bit, and must never be called with pages mapped anyway. So throw 
>    that out too, and replace it with a new function:
> 
> 	void cancel_dirty_page(struct page *page, unsigned int accounting_size);
> 
>  - "clear_page_dirty_for_io()" is fine.
> 
> And with that, I then either rip out any old users of 
> "test_clear_page_dirty()" or "clear_page_dirty()", and if appropriate (and 
> it's realy lonly appropriate for "truncate()", I replace them with the new 
> "cancel_dirty_page()". Most of the time, they should just be deleted 
> entirely.
> 
> NOTE NOTE NOTE! I _only_ did enough to make things compile for my 
> particular configuration. That means that right now the following 
> filesystems are broken with this patch (because they use the totally 
> broken old crap):
> 
> 	CIFS, FUSE, JFS, ReiserFS, XFS


XFS appears to call clear_page_dirty to get the mapping tree dirty
tag set correctly at the same time the page dirty flag is cleared. I
note that this can be done by set_page_writeback() if we clear the
dirty flag on the page first when we are writing back the entire page.

Hence it seems to me that the XFS call to clear_page_dirty() could
easily be substituted by clear_page_dirty_for_io() followed by a
call to set_page_writeback() to get the mapping tree tags set
correctly after the page has been marked clean.

Does this make sense (even without the posted patch)?

---
 fs/xfs/linux-2.6/xfs_aops.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c	2006-12-19 12:22:47.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c	2006-12-21 10:15:04.545375877 +1100
@@ -340,9 +340,9 @@ xfs_start_page_writeback(
 {
 	ASSERT(PageLocked(page));
 	ASSERT(!PageWriteback(page));
-	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty_for_io(page);
+	set_page_writeback(page);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 19:50                                             ` Linus Torvalds
                                                                 ` (3 preceding siblings ...)
  2006-12-20 23:24                                               ` David Chinner
@ 2006-12-20 23:32                                               ` Andrew Morton
  2006-12-20 23:55                                                 ` Linus Torvalds
  2006-12-21  7:32                                               ` Gordon Farquharson
  2006-12-21 11:21                                               ` Martin Michlmayr
  6 siblings, 1 reply; 311+ messages in thread
From: Andrew Morton @ 2006-12-20 23:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson, Chen, Kenneth W

On Wed, 20 Dec 2006 11:50:50 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> Ok, I'll just put my money where my mouth is, and suggest a patch like 
> THIS instead.
> 
> ...
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d1f1b54..263f88e 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
>  	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
> -	if (PageWriteback(page))
> +	if (PageDirty(page) || PageWriteback(page))
>  		return 0;
>  
>  	if (mapping == NULL) {		/* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
>  	spin_lock(&mapping->private_lock);
>  	ret = drop_buffers(page, &buffers_to_free);
>  	spin_unlock(&mapping->private_lock);
> -	if (ret) {
> -		/*
> -		 * If the filesystem writes its buffers by hand (eg ext3)
> -		 * then we can have clean buffers against a dirty page.  We
> -		 * clean the page here; otherwise later reattachment of buffers
> -		 * could encounter a non-uptodate page, which is unresolvable.
> -		 * This only applies in the rare case where try_to_free_buffers
> -		 * succeeds but the page is not freed.
> -		 *
> -		 * Also, during truncate, discard_buffer will have marked all
> -		 * the page's buffers clean.  We discover that here and clean
> -		 * the page also.
> -		 */
> -		if (test_clear_page_dirty(page))
> -			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> -	}

I think this will be OK, because vmscan has just run ->writepage anyway. 
But we will need to make changes to truncate_complete_page() - make it run
cancel_dirty_page() before it runs do_invalidatepage().  


I don't think there's anything preventing zap_pte_range() or perhaps a
pagefault from coming in and dirtying this page after we've tested
PageDirty().

That could leave us with a dirty, non-uptodate page with no buffers, which
is very bad.  But this situation is hopefully impossible, because if the
page is not uptodate then the first thing a pagefault will do is bring it
uptodate, which involves locking it. And if zap_pte_range() is looking at
this page, it is uptodate.


If the page _was_ uptodate and the zap_pte_range() race happens, we'll end
up with with either a dirty page with dirty buffers or a dirty uptodate
page with no buffers, both of which are OK.



> +void cancel_dirty_page(struct page *page, unsigned int account_size)
> +{
> +	/* If we're cancelling the page, it had better not be mapped any more */
> +	if (page_mapped(page)) {
> +		static unsigned int warncount;
> +
> +		WARN_ON(++warncount < 5);
> +	}
> +		
> +	if (TestClearPageDirty(page) && account_size)
> +		task_io_account_cancelled_write(account_size);
> +}

This doesn't clear the radix-tree dirty tags.  I'm not sure what effect
that would have on a truncated mapping.  Perhaps just a bit of extra work
in radix-tree lookup during writeback.

If we _know_ that this page is about to be removed from pagecache then
radix_tree_delete() will delete the tags for us anyway, but
invalidate_inode_pages2() can decide to back out.

> @@ -386,12 +399,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>  					  PAGE_CACHE_SIZE, 0);
>  				}
>  			}
> -			was_dirty = test_clear_page_dirty(page);
> -			if (!invalidate_complete_page2(mapping, page)) {
> -				if (was_dirty)
> -					set_page_dirty(page);
> +			if (!invalidate_complete_page2(mapping, page))
>  				ret = -EIO;
> -			}
>  			unlock_page(page);

Well, it used to.

invalidate_complete_page2() is pretty gruesome.  We're handling the case
where someone went and redirtied the page (and hence its buffers) after the
invalidate_inode_pages2() caller (generic_file_direct_IO) synced it to
disk.

I'd prefer to just fail the direct-io if someone did that, but then
people's tests fail and they whine.

It's tempting to just truncate the damn page and discard the user's data -
the app is being silly.  But that would permit access to uninitialised disk
blocks.

With your change I think what'll happen is that we'll correctly handle the
case where the page and its buffers are dirty (it gets left in place), but
we'll needlessy fail in the case where the page is dirty but the buffers
are clean.  How important that will be in practice I do not know.  People
will get -EIOs where they used not to.

A suitable fix for that might to be to simply not return -EIO here.  So
some thread went and dirtied a pagecache page after
generic_file_direct_IO() synced the data.  Big deal, that's your own fault.
Usually the disk will end up getting a copy of the dirtied pagecache page
and rarely it'll get a copy of the direct-io-written page.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 23:32                                               ` Andrew Morton
@ 2006-12-20 23:55                                                 ` Linus Torvalds
  2006-12-21  0:11                                                   ` Andrew Morton
  2006-12-21  2:54                                                   ` Trond Myklebust
  0 siblings, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-20 23:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson, Chen, Kenneth W



On Wed, 20 Dec 2006, Andrew Morton wrote:
> 
> > +void cancel_dirty_page(struct page *page, unsigned int account_size)
> > +{
> > +	/* If we're cancelling the page, it had better not be mapped any more */
> > +	if (page_mapped(page)) {
> > +		static unsigned int warncount;
> > +
> > +		WARN_ON(++warncount < 5);
> > +	}
> > +		
> > +	if (TestClearPageDirty(page) && account_size)
> > +		task_io_account_cancelled_write(account_size);
> > +}
> 
> This doesn't clear the radix-tree dirty tags.  I'm not sure what effect
> that would have on a truncated mapping.  Perhaps just a bit of extra work
> in radix-tree lookup during writeback.

This should _only_ be a valid thing to do when we're removing the page 
from a mapping anyway, so I'd most definitely hope that the code 
immediately after (or before) will have done a "remove_from_page_cache()"

In which case the tags should not matter.

There is _no_ excuse for cancelling a page and leaving it in the page 
cache that I can see. Because your page contents will be _indeterminate_.

> > @@ -386,12 +399,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
> 
> invalidate_complete_page2() is pretty gruesome.  We're handling the case
> where someone went and redirtied the page (and hence its buffers) after the
> invalidate_inode_pages2() caller (generic_file_direct_IO) synced it to
> disk.
> 
> I'd prefer to just fail the direct-io if someone did that, but then
> people's tests fail and they whine.

So with my change, afaik, we will just return EIO to the invalidate, and 
do the write. Which should be ok. In fact, it appears to be the only 
possibly valid thing to do.

It really boils down to that same thing: if you remove the dirty bit, 
there is NO CONCEIVABLE GOOD THING YOU CAN DO EXCEPT FOR:
 - do the damn IO already ("clear_page_dirty_for_io()")
 - truncate the page (unmap and destroy it both from page cache AND from 
   any user-visible filesystem cases)

Anything else is simpyl a bug. Always has been. My patch just makes that 
very clear.

> With your change I think what'll happen is that we'll correctly handle the
> case where the page and its buffers are dirty (it gets left in place), but
> we'll needlessy fail in the case where the page is dirty but the buffers
> are clean.  How important that will be in practice I do not know.  People
> will get -EIOs where they used not to.

People will now get -EIO where they used to get an inconsistent system 
image. I really think it sounds like an improvement.

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 23:24                                               ` David Chinner
@ 2006-12-20 23:55                                                 ` Linus Torvalds
  2006-12-21  1:20                                                   ` David Chinner
  0 siblings, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-20 23:55 UTC (permalink / raw)
  To: David Chinner
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson



On Thu, 21 Dec 2006, David Chinner wrote:
> 
> XFS appears to call clear_page_dirty to get the mapping tree dirty
> tag set correctly at the same time the page dirty flag is cleared. I
> note that this can be done by set_page_writeback() if we clear the
> dirty flag on the page first when we are writing back the entire page.

Yes. I think the XFS routine should just use "clear_page_dirty_fir_io()", 
since that matches what it actually wants to do (surprise surprise, it's 
going to write it out).

HOWEVER. Why is it conditional? Can somebody who understands XFS tell me 
why "clear_dirty" would ever be 0? I can grep the sources, and I see that 
it's an unconditional 1 in one call-site, but then in the other one it 
does

	xfs_start_page_writeback(page, wbc, !page_dirty, count);

and that part just blows my mind. Why would you do a 
xfs_start_page_writeback() and _not_ write the page out? Is this for a 
partial-page-only case?

Anyway, your patch looks fine. It seems to be the right thing to do. I'm 
just wondering why we're not always cleaning the whole page, and why we'd 
not set it unconditionally dirty?

			Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 23:55                                                 ` Linus Torvalds
@ 2006-12-21  0:11                                                   ` Andrew Morton
  2006-12-21  0:22                                                     ` Linus Torvalds
  2006-12-21  2:54                                                   ` Trond Myklebust
  1 sibling, 1 reply; 311+ messages in thread
From: Andrew Morton @ 2006-12-21  0:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson, Chen, Kenneth W

On Wed, 20 Dec 2006 15:55:14 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> > > @@ -386,12 +399,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
> > 
> > invalidate_complete_page2() is pretty gruesome.  We're handling the case
> > where someone went and redirtied the page (and hence its buffers) after the
> > invalidate_inode_pages2() caller (generic_file_direct_IO) synced it to
> > disk.
> > 
> > I'd prefer to just fail the direct-io if someone did that, but then
> > people's tests fail and they whine.
> 
> So with my change, afaik, we will just return EIO to the invalidate, and 
> do the write.

The write's already been done by this stage.

> Which should be ok. In fact, it appears to be the only 
> possibly valid thing to do.
> 
> It really boils down to that same thing: if you remove the dirty bit, 
> there is NO CONCEIVABLE GOOD THING YOU CAN DO EXCEPT FOR:
>  - do the damn IO already ("clear_page_dirty_for_io()")
>  - truncate the page (unmap and destroy it both from page cache AND from 
>    any user-visible filesystem cases)

There's also redirty_page_for_writepage().


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  0:11                                                   ` Andrew Morton
@ 2006-12-21  0:22                                                     ` Linus Torvalds
  2006-12-21  0:24                                                       ` Linus Torvalds
  2006-12-21  0:43                                                       ` Linus Torvalds
  0 siblings, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-21  0:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson, Chen, Kenneth W



On Wed, 20 Dec 2006, Andrew Morton wrote:
> > 
> > So with my change, afaik, we will just return EIO to the invalidate, and 
> > do the write.
> 
> The write's already been done by this stage.

Ok, but the end result is the same: you MUST NOT just "cancel" a write. It 
needs to be done, or the backing store must be actually de-allocated. You 
can't just say "get rid of it" and think that it can work. Exactly because 
of security issues, and just the simple fact that reading it back gets 
random contents.

So I repeat: clearing a dirty bit really only has two valid cases. Not 
three, like we used to have. And the "cancel" case cannot be conditional: 
either you can cancel it or you cannot. There's no

	if (cancel_dirty_page()) {
			..

sequence that makes sense that I can think of.

> > It really boils down to that same thing: if you remove the dirty bit, 
> > there is NO CONCEIVABLE GOOD THING YOU CAN DO EXCEPT FOR:
> >  - do the damn IO already ("clear_page_dirty_for_io()")
> >  - truncate the page (unmap and destroy it both from page cache AND from 
> >    any user-visible filesystem cases)
> 
> There's also redirty_page_for_writepage().

_dirtying_ a page makes sense in any situation. You can always dirty them. 
I'm just saying that you can't just mark them *clean*.

If your point was that the filesystem had better be able to take care of 
"redirty_page_for_writepage()", then yes, of course. But since it's the 
filesystem itself that does it, it had _better_ be able to take care of 
the situation it puts itself into.

			Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  0:22                                                     ` Linus Torvalds
@ 2006-12-21  0:24                                                       ` Linus Torvalds
  2006-12-21 15:48                                                         ` Andrei Popa
  2006-12-21  0:43                                                       ` Linus Torvalds
  1 sibling, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-21  0:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson, Chen, Kenneth W



Btw, I'd really love to hear whether the patch I sent out actually _helps_ 
at all, or whether we're just discussing something that in the end is just 
a cleanup..

Martin, Peter, Andrei, pls give it a try. (Martin and Andrei may be 
talking about different bugs, so _both_ of your experiences definitely 
matter here).

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  0:22                                                     ` Linus Torvalds
  2006-12-21  0:24                                                       ` Linus Torvalds
@ 2006-12-21  0:43                                                       ` Linus Torvalds
  2006-12-21  1:20                                                         ` Andrew Morton
  1 sibling, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-21  0:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson, Chen, Kenneth W



On Wed, 20 Dec 2006, Linus Torvalds wrote:
> > 
> > There's also redirty_page_for_writepage().
> 
> _dirtying_ a page makes sense in any situation. You can always dirty them. 
> I'm just saying that you can't just mark them *clean*.
> 
> If your point was that the filesystem had better be able to take care of 
> "redirty_page_for_writepage()", then yes, of course. But since it's the 
> filesystem itself that does it, it had _better_ be able to take care of 
> the situation it puts itself into.

Btw, as an example of something where this may NOT be ok, look at 
migrate_page_copy().

I'm not at all convinced that "migrate_page_copy()" can work at all. It 
does:

	...
        if (PageDirty(page)) {
                clear_page_dirty_for_io(page);
                set_page_dirty(newpage);
        }
	...

which is an example of what NOT to do, because it claims to clear the page 
for IO, but doesn't actually _do_ any IO.

And this is wrong, for many reasons. 

For example, it's very possible that the old page is not actually 
up-to-date, and is only partially dirty using some FS-specific dirty data 
queues (like NFS does with its dirty data, or buffer-heads can do for 
local filesystems). When you do

	if (clear_dirty(page))
		set_page_dirty(page);

in generic VM code, that is a BUG. It's an insane operation. It cannot 
work. It's exactly what I'm trying to avoid.

So page migration is probably broken, but it's no less broken than it 
always has been. And I don't think many people use it anyway. It might 
work "by accident" in a lot of situations, but to actually be solid, it 
really would need to do something fundamentally different, like:

 - have a per-mapping "migrate()" function that actually knows HOW to 
   migrate the dirty state from one page to another.

 - or, preferably, by just not migrating dirty pages, and just actually 
   doing the writeback on them first.

Again, this is an example of just _incorrect_ code, that thinks that it 
can "just clear the dirty bit". You can't do that. It's wrong. And it is 
not wrong just because I say so, but because the operations itself simply 
is FUNDAMENTALLY not a sensible one.

This is why I keep harping on this issue: there are two cases, and two 
cases only, when you can clear a page. And no, "migrating the data to 
another page" was not one of those two cases. The cases are, and will 
_always_ be: (a) full writeback IO of _all_ the dirty data on the page 
(and that can only be done by the low-level filesystem, since it's the 
only one that knows what rules it has followed for marking things dirty) 
and (b) cancelling dirty data that got truncated and literally removed 
from the filesystem.

So I don't claim that I fixed all the cases. mm/migrate.c is still broken. 
Maybe somebody else also uses "clear_page_dirty_for_io()" even though the 
name very clearly says FOR IO. I didn't check, but I think they're mostly 
right now.

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  0:43                                                       ` Linus Torvalds
@ 2006-12-21  1:20                                                         ` Andrew Morton
  0 siblings, 0 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-21  1:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson, Chen, Kenneth W,
	Christoph Lameter

On Wed, 20 Dec 2006 16:43:31 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Wed, 20 Dec 2006, Linus Torvalds wrote:
> > > 
> > > There's also redirty_page_for_writepage().
> > 
> > _dirtying_ a page makes sense in any situation. You can always dirty them. 
> > I'm just saying that you can't just mark them *clean*.
> > 
> > If your point was that the filesystem had better be able to take care of 
> > "redirty_page_for_writepage()", then yes, of course. But since it's the 
> > filesystem itself that does it, it had _better_ be able to take care of 
> > the situation it puts itself into.
> 
> Btw, as an example of something where this may NOT be ok, look at 
> migrate_page_copy().
> 
> I'm not at all convinced that "migrate_page_copy()" can work at all. It 
> does:
> 
> 	...
>         if (PageDirty(page)) {
>                 clear_page_dirty_for_io(page);
>                 set_page_dirty(newpage);

Note that this is referring to different pages.

>         }
> 	...
> 
> which is an example of what NOT to do, because it claims to clear the page 
> for IO, but doesn't actually _do_ any IO.
> 
> And this is wrong, for many reasons. 
>
> For example, it's very possible that the old page is not actually 
> up-to-date, and is only partially dirty using some FS-specific dirty data 
> queues (like NFS does with its dirty data, or buffer-heads can do for 
> local filesystems).

afaict the code copes with those things.

> When you do
> 
> 	if (clear_dirty(page))
> 		set_page_dirty(page);
> 
> in generic VM code, that is a BUG. It's an insane operation. It cannot 
> work. It's exactly what I'm trying to avoid.

These are different pages.

We could view the copy_highpage() in migrate_page_copy() as an "io"
operation, only the backing store is a new pagecache page.

It'd be more logical if that copy_highpage() was occurring after the
clear_page_dirty_for_io().

I'm not sure why migrate_page_copy() is playing with
PageWriteback(newpage).  Surely newpage is locked, in which case nobody
will be starting writeback on it.

> So page migration is probably broken, but it's no less broken than it 
> always has been. And I don't think many people use it anyway. It might 
> work "by accident" in a lot of situations, but to actually be solid, it 
> really would need to do something fundamentally different, like:
> 
>  - have a per-mapping "migrate()" function that actually knows HOW to 
>    migrate the dirty state from one page to another.

That is how it's presently implemented.  You're looking at helper functions
which fileystems may point their address_space_operations.migratepage at.

>  - or, preferably, by just not migrating dirty pages, and just actually 
>    doing the writeback on them first.
> 
> Again, this is an example of just _incorrect_ code, that thinks that it 
> can "just clear the dirty bit". You can't do that. It's wrong. And it is 
> not wrong just because I say so, but because the operations itself simply 
> is FUNDAMENTALLY not a sensible one.

The dirty state is being transferred to the new page.  The tricky part is
handling the cases where these pages are mapped into pagetables.  That's
what the special migration ptes are there for.  I'll let Christoph explain
that lot ;)


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 23:55                                                 ` Linus Torvalds
@ 2006-12-21  1:20                                                   ` David Chinner
  0 siblings, 0 replies; 311+ messages in thread
From: David Chinner @ 2006-12-21  1:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Chinner, Martin Michlmayr, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson

On Wed, Dec 20, 2006 at 03:55:25PM -0800, Linus Torvalds wrote:
> On Thu, 21 Dec 2006, David Chinner wrote:
> > 
> > XFS appears to call clear_page_dirty to get the mapping tree dirty
> > tag set correctly at the same time the page dirty flag is cleared. I
> > note that this can be done by set_page_writeback() if we clear the
> > dirty flag on the page first when we are writing back the entire page.
> 
> Yes. I think the XFS routine should just use "clear_page_dirty_fir_io()", 
> since that matches what it actually wants to do (surprise surprise, it's 
> going to write it out).

Yup ;)

> HOWEVER. Why is it conditional? Can somebody who understands XFS tell me 
> why "clear_dirty" would ever be 0? I can grep the sources, and I see that 
> it's an unconditional 1 in one call-site, but then in the other one it 
> does
> 
> 	xfs_start_page_writeback(page, wbc, !page_dirty, count);

page dirty starts at the number of dirty buffers on the page, and as
we map each dirty buffer into the I/O we decrement the page dirty count.

Hence if we map all of the buffers into the I/O, we are cleaning
the entire page and hence we can clear the dirty state on the page.

> and that part just blows my mind. Why would you do a 
> xfs_start_page_writeback() and _not_ write the page out? Is this for a 
> partial-page-only case?

Yes, partial-page-only case when doing speculative write clustering. We'll hit
this when an extent boundary is not page aligned (fs block size < page size
case) and we need to issue at least two separate I/Os to clean the page.
Because we leave the page dirty and we are working ahead of the index in
generic_writepages() we'll get the rest of the page flushed when we return
back to generic_writepages() as the page is still dirty in the mapping
tree....

> Anyway, your patch looks fine. It seems to be the right thing to do.

Ok, thanks, Linus.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 22:15                                               ` Peter Zijlstra
  2006-12-20 22:20                                                 ` Peter Zijlstra
  2006-12-20 22:49                                                 ` Linus Torvalds
@ 2006-12-21  2:36                                                 ` Trond Myklebust
  2006-12-21  8:10                                                   ` Peter Zijlstra
  2 siblings, 1 reply; 311+ messages in thread
From: Trond Myklebust @ 2006-12-21  2:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Martin Michlmayr, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson

On Wed, 2006-12-20 at 23:15 +0100, Peter Zijlstra wrote:
> I think this is also needed:

NAK

invalidate_inode_pages2() should _not_ be pretending that dirty pages
are clean. This patch is incorrect both for the NFS usage and for the
directIO usage.

In the latter case, if someone has the page mmapped, resulting in the
page getting marked as dirty _after_ a directIO write, then it would be
wrong to discard that data. Only dirty data from _before_ the directIO
write should needs to be discarded (and that is achieved by unmapping,
then cleaning the page prior to the directIO call)...

For the NFS case, the race is a bit more tricky, since you have the
"unstable write" case which means that the page is neither marked as
dirty, nor is entirely clean ('cos we don't know that the server has
committed the data to permanent storage yet).

Cheers
  Trond

> ---
>  mm/truncate.c |    7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> Index: linux-2.6/mm/truncate.c
> ===================================================================
> --- linux-2.6.orig/mm/truncate.c
> +++ linux-2.6/mm/truncate.c
> @@ -320,19 +320,14 @@ invalidate_complete_page2(struct address
>  	if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
>  		return 0;
>  
> +	cancel_dirty_page(page, PAGE_CACHE_SIZE);
>  	lock_page_ref_irq(page);
> -	if (PageDirty(page))
> -		goto failed;
> -
>  	BUG_ON(PagePrivate(page));
>  	__remove_from_page_cache(page);
>  	unlock_page_ref_irq(page);
>  	ClearPageUptodate(page);
>  	page_cache_release(page);	/* pagecache ref */
>  	return 1;
> -failed:
> -	unlock_page_ref_irq(page);
> -	return 0;
>  }
>  
>  /**
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 23:55                                                 ` Linus Torvalds
  2006-12-21  0:11                                                   ` Andrew Morton
@ 2006-12-21  2:54                                                   ` Trond Myklebust
  2006-12-21 17:19                                                     ` Linus Torvalds
  1 sibling, 1 reply; 311+ messages in thread
From: Trond Myklebust @ 2006-12-21  2:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson, Chen, Kenneth W

On Wed, 2006-12-20 at 15:55 -0800, Linus Torvalds wrote:
> > With your change I think what'll happen is that we'll correctly handle the
> > case where the page and its buffers are dirty (it gets left in place), but
> > we'll needlessy fail in the case where the page is dirty but the buffers
> > are clean.  How important that will be in practice I do not know.  People
> > will get -EIOs where they used not to.
> 
> People will now get -EIO where they used to get an inconsistent system 
> image. I really think it sounds like an improvement.

The hell it is. You end up with a corrupted page cache because
invalidate_inode_pages2_range() immediately exits without throwing out
the pages in the rest of the range.

I can't see that it is the business of invalidate_inode_pages2() to
resolve races between ->direct_IO() and pages that are redirtied by
mmap(). All it needs to ensure is that pages that clean are discarded,
since those are neither consistent with data that the ->directIO() call
wrote to the disk nor are they scheduled to be written to disk.

The only case that I can see that is still problematic is NFS because it
may have unstable writes (hence the ->launder_page() patch that I posted
yesterday).

Trond


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 19:50                                             ` Linus Torvalds
                                                                 ` (4 preceding siblings ...)
  2006-12-20 23:32                                               ` Andrew Morton
@ 2006-12-21  7:32                                               ` Gordon Farquharson
  2006-12-21  7:53                                                 ` Linus Torvalds
  2006-12-21 11:21                                               ` Martin Michlmayr
  6 siblings, 1 reply; 311+ messages in thread
From: Gordon Farquharson @ 2006-12-21  7:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann

On 12/20/06, Linus Torvalds <torvalds@osdl.org> wrote:

> Ok, I'll just put my money where my mouth is, and suggest a patch like
> THIS instead.

> Martin, Andrei, does this make any difference for your corruption cases?

Unfortunately, I cannot get the latest git version of the kernel to
boot on the ARM machine on which Martin and I are experiencing the apt
segfault. After the kernel is finished uncompressing it prints "done,
booting the kernel." as expected, but nothing more happens. I have
tried both with and without the patch. Hopefully either Andrei or
Martin will have better luck at testing this patch than I have had.

Gordon

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  7:32                                               ` Gordon Farquharson
@ 2006-12-21  7:53                                                 ` Linus Torvalds
  2006-12-21  8:38                                                   ` Martin Michlmayr
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-21  7:53 UTC (permalink / raw)
  To: Gordon Farquharson
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann



On Thu, 21 Dec 2006, Gordon Farquharson wrote:
> 
> Unfortunately, I cannot get the latest git version of the kernel to
> boot on the ARM machine on which Martin and I are experiencing the apt
> segfault.

Ouch.

> After the kernel is finished uncompressing it prints "done,
> booting the kernel." as expected, but nothing more happens. I have
> tried both with and without the patch. Hopefully either Andrei or
> Martin will have better luck at testing this patch than I have had.

That's obviously a bug worth fixing on its own. Do you know when it 
started?

That said, I think the patch I sent out should actually work on top of 
plain 2.6.19 too. I don't think things have changed in this area that 
much. IOW, you don't _need_ latest -git to test it, you just need a broken 
kernel ;)

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  2:36                                                 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Trond Myklebust
@ 2006-12-21  8:10                                                   ` Peter Zijlstra
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-21  8:10 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Linus Torvalds, Martin Michlmayr, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson

On Wed, 2006-12-20 at 21:36 -0500, Trond Myklebust wrote:
> On Wed, 2006-12-20 at 23:15 +0100, Peter Zijlstra wrote:
> > I think this is also needed:
> 
> NAK
> 
> invalidate_inode_pages2() should _not_ be pretending that dirty pages
> are clean. This patch is incorrect both for the NFS usage and for the
> directIO usage.
> 
> In the latter case, if someone has the page mmapped, resulting in the
> page getting marked as dirty _after_ a directIO write, then it would be
> wrong to discard that data. Only dirty data from _before_ the directIO
> write should needs to be discarded (and that is achieved by unmapping,
> then cleaning the page prior to the directIO call)...
> 
> For the NFS case, the race is a bit more tricky, since you have the
> "unstable write" case which means that the page is neither marked as
> dirty, nor is entirely clean ('cos we don't know that the server has
> committed the data to permanent storage yet).

Then this patch:
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc1/2.6.20-rc1-mm1/broken-out/nfs-fix-nr_file_dirty-underflow.patch

is equally wrong, right?


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 22:11                                       ` Russell King
@ 2006-12-21  8:18                                         ` Martin Michlmayr
  2006-12-21  9:54                                           ` Russell King
  0 siblings, 1 reply; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-21  8:18 UTC (permalink / raw)
  To: rmk+lkml, Peter Zijlstra, Hugh Dickins, Arjan van de Ven,
	Linus Torvalds, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson

* Russell King <rmk+lkml@arm.linux.org.uk> [2006-12-20 22:11]:
> > This patch doesn't fix my problem (apt segfaults on ARM because its
> > database is corrupted).
> 
> Are you using IDE in PIO mode?  If so, the bug probably lies there.

I'm using usb-storage.  It's used to access an external IDE drive in
an USB enclosure but I don't think it matters that it's IDE since
we're using the SCSI layer to talk to it, right?
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  7:53                                                 ` Linus Torvalds
@ 2006-12-21  8:38                                                   ` Martin Michlmayr
  2006-12-21  8:59                                                     ` Linus Torvalds
  2006-12-21  9:17                                                   ` Gordon Farquharson
  2006-12-21 12:30                                                   ` Russell King
  2 siblings, 1 reply; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-21  8:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gordon Farquharson, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann

* Linus Torvalds <torvalds@osdl.org> [2006-12-20 23:53]:
> > Unfortunately, I cannot get the latest git version of the kernel to
> > boot on the ARM machine on which Martin and I are experiencing the apt
> > segfault.
> 
> Ouch.
> 
> That's obviously a bug worth fixing on its own. Do you know when it
> started?

This is a known issue.  The following patch has been proposed
http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=4030/1
although I just notice that it has been marked as "discarded".
Apparently Russell King commited a better patch so this should be
fixed in git when he sends his next pull request.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  8:38                                                   ` Martin Michlmayr
@ 2006-12-21  8:59                                                     ` Linus Torvalds
  0 siblings, 0 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-21  8:59 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Gordon Farquharson, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann



On Thu, 21 Dec 2006, Martin Michlmayr wrote:
> 
> This is a known issue.  The following patch has been proposed
> http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=4030/1
> although I just notice that it has been marked as "discarded".
> Apparently Russell King commited a better patch so this should be
> fixed in git when he sends his next pull request.

Ahh, ok. Then it might even be in the set of merges I did earlier today 
(and which should mirror out soon enough, hopefully).

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 23:03                                                   ` Peter Zijlstra
@ 2006-12-21  9:16                                                     ` Martin Schwidefsky
  2006-12-21  9:20                                                       ` Peter Zijlstra
  0 siblings, 1 reply; 311+ messages in thread
From: Martin Schwidefsky @ 2006-12-21  9:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Martin Michlmayr, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Heiko Carstens, Arnd Bergmann, gordonfarquharson

On Thu, 2006-12-21 at 00:03 +0100, Peter Zijlstra wrote:
> current version

Nitpicking ..

> @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page
>  	if (!pte)
>  		goto out;
> 
> -	if (!pte_dirty(*pte) && !pte_write(*pte))
> -		goto unlock;
> +	while (pte_dirty(*pte) || pte_write(*pte)) {
> +		pte_t entry;
> 
> -	entry = ptep_get_and_clear(mm, address, pte);
> -	entry = pte_mkclean(entry);
> -	entry = pte_wrprotect(entry);
> -	ptep_establish(vma, address, pte, entry);
> -	lazy_mmu_prot_update(entry);
> -	ret = 1;
> +		flush_cache_page(vma, address, pte_pfn(*pte));
> +		entry = ptep_clear_flush(vma, address, pte);
> +		entry = pte_wrprotect(entry);
> +		entry = pte_mkclean(entry);
> +		ptep_establish(vma, address, pte, entry);

Now you are flushing the tlb twice. ptep_clear_flush clears the pte and
flushes the tlb, ptep_establish sets the new pte and flushes the tlb.
Not good. Use set_pte_at instead of the ptep_establish.

> +		lazy_mmu_prot_update(entry);
> +		ret = 1;
> +	}
> 
> -unlock:
>  	pte_unmap_unlock(pte, ptl);
>  out:
>  	return ret;

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  7:53                                                 ` Linus Torvalds
  2006-12-21  8:38                                                   ` Martin Michlmayr
@ 2006-12-21  9:17                                                   ` Gordon Farquharson
  2006-12-21  9:27                                                     ` Andrew Morton
  2006-12-21 12:30                                                   ` Russell King
  2 siblings, 1 reply; 311+ messages in thread
From: Gordon Farquharson @ 2006-12-21  9:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann

On 12/21/06, Linus Torvalds <torvalds@osdl.org> wrote:

> That said, I think the patch I sent out should actually work on top of
> plain 2.6.19 too. I don't think things have changed in this area that
> much. IOW, you don't _need_ latest -git to test it, you just need a broken
> kernel ;)

I created a version of your patch that applied to 2.6.19, but it
doesn't compile:

mm/built-in.o: In function `cancel_dirty_page':
slab.c:(.text+0x8964): undefined reference to `task_io_account_cancelled_write'
make[3]: *** [.tmp_vmlinux1] Error 1

It looks like task_io_account_cancelled_write() was added in

http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7c3ab7381e79dfc7db14a67c6f4f3285664e1ec2

Can the call to task_io_account_cancelled_write() simply be removed
from cancel_dirty_page() for testing the patch with 2.6.19 (since
2.6.19 doesn't seem to have the task I/O accounting) ?

Gordon

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  9:16                                                     ` Martin Schwidefsky
@ 2006-12-21  9:20                                                       ` Peter Zijlstra
  2006-12-21  9:26                                                         ` Martin Schwidefsky
  2006-12-21 20:01                                                         ` Linus Torvalds
  0 siblings, 2 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-21  9:20 UTC (permalink / raw)
  To: schwidefsky
  Cc: Linus Torvalds, Martin Michlmayr, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Heiko Carstens, Arnd Bergmann, gordonfarquharson

On Thu, 2006-12-21 at 10:16 +0100, Martin Schwidefsky wrote:
> On Thu, 2006-12-21 at 00:03 +0100, Peter Zijlstra wrote:
> > current version
> 
> Nitpicking ..
> 
> > @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page
> >  	if (!pte)
> >  		goto out;
> > 
> > -	if (!pte_dirty(*pte) && !pte_write(*pte))
> > -		goto unlock;
> > +	while (pte_dirty(*pte) || pte_write(*pte)) {
> > +		pte_t entry;
> > 
> > -	entry = ptep_get_and_clear(mm, address, pte);
> > -	entry = pte_mkclean(entry);
> > -	entry = pte_wrprotect(entry);
> > -	ptep_establish(vma, address, pte, entry);
> > -	lazy_mmu_prot_update(entry);
> > -	ret = 1;
> > +		flush_cache_page(vma, address, pte_pfn(*pte));
> > +		entry = ptep_clear_flush(vma, address, pte);
> > +		entry = pte_wrprotect(entry);
> > +		entry = pte_mkclean(entry);
> > +		ptep_establish(vma, address, pte, entry);
> 
> Now you are flushing the tlb twice. ptep_clear_flush clears the pte and
> flushes the tlb, ptep_establish sets the new pte and flushes the tlb.
> Not good. Use set_pte_at instead of the ptep_establish.

Yeah, sorry, I already noticed and corrected that :-|

Also, I'm dubious about the while thing and stuck a WARN_ON(ret) thing
at the beginning of the loop. flush_tlb_page() does IPI the other cpus
to flush their tlb too, so there should not be a SMP race, Arjan?

> > +		lazy_mmu_prot_update(entry);
> > +		ret = 1;
> > +	}
> > 
> > -unlock:
> >  	pte_unmap_unlock(pte, ptl);
> >  out:
> >  	return ret;
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  9:20                                                       ` Peter Zijlstra
@ 2006-12-21  9:26                                                         ` Martin Schwidefsky
  2006-12-21 20:01                                                         ` Linus Torvalds
  1 sibling, 0 replies; 311+ messages in thread
From: Martin Schwidefsky @ 2006-12-21  9:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Martin Michlmayr, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Heiko Carstens, Arnd Bergmann, gordonfarquharson

On Thu, 2006-12-21 at 10:20 +0100, Peter Zijlstra wrote:
> > Now you are flushing the tlb twice. ptep_clear_flush clears the pte and
> > flushes the tlb, ptep_establish sets the new pte and flushes the tlb.
> > Not good. Use set_pte_at instead of the ptep_establish.
> 
> Yeah, sorry, I already noticed and corrected that :-|
> 
> Also, I'm dubious about the while thing and stuck a WARN_ON(ret) thing
> at the beginning of the loop. flush_tlb_page() does IPI the other cpus
> to flush their tlb too, so there should not be a SMP race, Arjan?

The while loop is protected by the pte lock and flush_tlb_page has to
remove the tlbs on all cpus. So yes, I think the while loop is not
necessary.

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  9:17                                                   ` Gordon Farquharson
@ 2006-12-21  9:27                                                     ` Andrew Morton
  2006-12-22  4:20                                                       ` Gordon Farquharson
  0 siblings, 1 reply; 311+ messages in thread
From: Andrew Morton @ 2006-12-21  9:27 UTC (permalink / raw)
  To: Gordon Farquharson
  Cc: Linus Torvalds, Martin Michlmayr, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann

On Thu, 21 Dec 2006 02:17:05 -0700
"Gordon Farquharson" <gordonfarquharson@gmail.com> wrote:

> Can the call to task_io_account_cancelled_write() simply be removed
> from cancel_dirty_page() for testing the patch with 2.6.19 (since
> 2.6.19 doesn't seem to have the task I/O accounting) ?

Yes.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  8:18                                         ` Martin Michlmayr
@ 2006-12-21  9:54                                           ` Russell King
  0 siblings, 0 replies; 311+ messages in thread
From: Russell King @ 2006-12-21  9:54 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Peter Zijlstra, Hugh Dickins, Arjan van de Ven, Linus Torvalds,
	Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson

On Thu, Dec 21, 2006 at 09:18:45AM +0100, Martin Michlmayr wrote:
> * Russell King <rmk+lkml@arm.linux.org.uk> [2006-12-20 22:11]:
> > > This patch doesn't fix my problem (apt segfaults on ARM because its
> > > database is corrupted).
> > 
> > Are you using IDE in PIO mode?  If so, the bug probably lies there.
> 
> I'm using usb-storage.  It's used to access an external IDE drive in
> an USB enclosure but I don't think it matters that it's IDE since
> we're using the SCSI layer to talk to it, right?

USB generally uses DMA so you're probably safe.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-20 19:50                                             ` Linus Torvalds
                                                                 ` (5 preceding siblings ...)
  2006-12-21  7:32                                               ` Gordon Farquharson
@ 2006-12-21 11:21                                               ` Martin Michlmayr
  6 siblings, 0 replies; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-21 11:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson

* Linus Torvalds <torvalds@osdl.org> [2006-12-20 11:50]:
> Martin, Andrei, does this make any difference for your corruption
> cases?

Works for me.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  7:53                                                 ` Linus Torvalds
  2006-12-21  8:38                                                   ` Martin Michlmayr
  2006-12-21  9:17                                                   ` Gordon Farquharson
@ 2006-12-21 12:30                                                   ` Russell King
  2006-12-21 12:36                                                     ` Russell King
  2 siblings, 1 reply; 311+ messages in thread
From: Russell King @ 2006-12-21 12:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gordon Farquharson, Martin Michlmayr, Peter Zijlstra,
	Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa,
	Andrew Morton, Linux Kernel Mailing List, Florian Weimer,
	Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann

On Wed, Dec 20, 2006 at 11:53:25PM -0800, Linus Torvalds wrote:
> That's obviously a bug worth fixing on its own. Do you know when it 
> started?

My last merge, just before 2.6.19-rc1.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21 12:30                                                   ` Russell King
@ 2006-12-21 12:36                                                     ` Russell King
  0 siblings, 0 replies; 311+ messages in thread
From: Russell King @ 2006-12-21 12:36 UTC (permalink / raw)
  To: Linus Torvalds, Gordon Farquharson, Martin Michlmayr,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann

On Thu, Dec 21, 2006 at 12:30:22PM +0000, Russell King wrote:
> On Wed, Dec 20, 2006 at 11:53:25PM -0800, Linus Torvalds wrote:
> > That's obviously a bug worth fixing on its own. Do you know when it 
> > started?
> 
> My last merge, just before 2.6.19-rc1.

Obviously 2.6.20-rc1.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 17:43                         ` Linus Torvalds
  2006-12-19 18:59                           ` Linus Torvalds
  2006-12-19 21:56                           ` Florian Weimer
@ 2006-12-21 13:03                           ` Peter Zijlstra
  2006-12-21 20:40                             ` Andrew Morton
  2 siblings, 1 reply; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-21 13:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote:
> 
> Btw,
>  here's a totally new tangent on this: it's possible that user code is 
> simply BUGGY. 

depmod: BADNESS: written outside isize 22183

---
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..5db9fd9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2393,6 +2393,17 @@ int nobh_commit_write(struct file *file, struct page *page,
 }
 EXPORT_SYMBOL(nobh_commit_write);
 
+static void __check_tail_zero(char *kaddr, unsigned int offset)
+{
+	unsigned int check = 0;
+	do {
+		check += kaddr[offset++];
+	} while (offset < PAGE_CACHE_SIZE);
+	if (check)
+		printk(KERN_ERR "%s: BADNESS: written outside isize %u\n",
+				current->comm, check);
+}
+
 /*
  * nobh_writepage() - based on block_full_write_page() except
  * that it tries to operate without attaching bufferheads to
@@ -2437,6 +2448,7 @@ int nobh_writepage(struct page *page, get_block_t *get_block,
 	 * writes to that region are not written out to the file."
 	 */
 	kaddr = kmap_atomic(page, KM_USER0);
+	__check_tail_zero(kaddr, offset);
 	memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
 	flush_dcache_page(page);
 	kunmap_atomic(kaddr, KM_USER0);
@@ -2604,6 +2616,7 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
 	 * writes to that region are not written out to the file."
 	 */
 	kaddr = kmap_atomic(page, KM_USER0);
+	__check_tail_zero(kaddr, offset);
 	memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
 	flush_dcache_page(page);
 	kunmap_atomic(kaddr, KM_USER0);



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  0:24                                                       ` Linus Torvalds
@ 2006-12-21 15:48                                                         ` Andrei Popa
  2006-12-21 16:58                                                           ` Linus Torvalds
  0 siblings, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-21 15:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson, Chen, Kenneth W

On Wed, 2006-12-20 at 16:24 -0800, Linus Torvalds wrote:
> 
> Btw, I'd really love to hear whether the patch I sent out actually _helps_ 
> at all, or whether we're just discussing something that in the end is just 
> a cleanup..
> 
> Martin, Peter, Andrei, pls give it a try. (Martin and Andrei may be 
> talking about different bugs, so _both_ of your experiences definitely 
> matter here).

with http://lkml.org/lkml/diff/2006/12/20/204/1
I have corruption: Hash check on download completion found bad chunks,
consider using "safe_sync".

> 
> 		Linus


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21 15:48                                                         ` Andrei Popa
@ 2006-12-21 16:58                                                           ` Linus Torvalds
  0 siblings, 0 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-21 16:58 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List,
	Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens,
	Arnd Bergmann, gordonfarquharson, Chen, Kenneth W



On Thu, 21 Dec 2006, Andrei Popa wrote:

> On Wed, 2006-12-20 at 16:24 -0800, Linus Torvalds wrote:
> > 
> > Martin, Peter, Andrei, pls give it a try. (Martin and Andrei may be 
> > talking about different bugs, so _both_ of your experiences definitely 
> > matter here).
> 
> with http://lkml.org/lkml/diff/2006/12/20/204/1
> I have corruption: Hash check on download completion found bad chunks,
> consider using "safe_sync".

Gaah. Martin Michlmayr reported that it apparently fixes his ARM 
corruption.

Now, admittedly I already suspected the issues might be different (if only 
because of the UP vs SMP/PREEMPT case), but I really had my hopes up after 
Martin's report, because if anything, _his_ issue might have been a 
superset of your problem (while obviously any subtle SMP races you might 
be seeing are definitely not an issue in his case).

Oh well. I think the ARM case is enough of a reason to apply those patches 
(if it hadn't made any difference at all, I'd have waited until after 
2.6.20), and we'll just have to continue on the SMP PREEMPT angle.

			Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  2:54                                                   ` Trond Myklebust
@ 2006-12-21 17:19                                                     ` Linus Torvalds
  0 siblings, 0 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-21 17:19 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann,
	gordonfarquharson, Chen, Kenneth W



On Wed, 20 Dec 2006, Trond Myklebust wrote:
> 
> I can't see that it is the business of invalidate_inode_pages2() to
> resolve races between ->direct_IO() and pages that are redirtied by
> mmap(). All it needs to ensure is that pages that clean are discarded,
> since those are neither consistent with data that the ->directIO() call
> wrote to the disk nor are they scheduled to be written to disk.

Sure, we could happily just remove the -EIO. Alternatively, we could still 
do all the invalidates over the whole range, and return -EIO at the end of 
any of the pages weren't invalidated because they had to be written back. 

I don't personally care whether we should just return success or something 
to indicate that there were busy pages, but somebody who _uses_ direct-IO 
might want to know that the thing didn't throw away everything. If you 
know such users, can you ask them?

(Maybe "-EAGAIN" is better than "-EIO", since it's not really even a fatal 
error).

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  9:20                                                       ` Peter Zijlstra
  2006-12-21  9:26                                                         ` Martin Schwidefsky
@ 2006-12-21 20:01                                                         ` Linus Torvalds
  2006-12-28  0:00                                                           ` Martin Schwidefsky
  1 sibling, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-21 20:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: schwidefsky, Martin Michlmayr, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Heiko Carstens, Arnd Bergmann, gordonfarquharson



On Thu, 21 Dec 2006, Peter Zijlstra wrote:
> 
> Also, I'm dubious about the while thing and stuck a WARN_ON(ret) thing
> at the beginning of the loop. flush_tlb_page() does IPI the other cpus
> to flush their tlb too, so there should not be a SMP race, Arjan?

Now, the reason I think the loop may be needed is:

	CPU#0				CPU#1
	-----				-----
					load old PTE entry
	clear dirty and WP bits
					write to page using old PTE
					NOT CHECKING that the new one
					is write-protected, and just
					setting the dirty bit blindly
					(but atomically)
	flush_tlb_page()
					TLB flushed, but we now have a
					page that is marked dirty and
					unwritable in the page tables,
					and we will mark it clean in
					"struct page *"

Now, the scary thing is, IF a CPU does this, then the way we do all this, 
we may actually have the following sequence:

	CPU#0				CPU#1
	-----				-----
					load old PTE entry
	ptep_clear_flush():
					atomic "set dirty bit" sequence
					PTEP now contains 0000040 !!!
	flush_tlb_page();
					TLB flushed, but PTEP is still 
					"dirty zero"
	write the clear/readonly PTE
					THE DIRTY BIT WAS LOST!

which might actually explain this bug.

I personally _thought_ that Intel CPU's don't actually do an "set dirty 
bit atomically" sequence, but more of a "set dirty bit but trap if the TLB 
is nonpresent" thing, but I have absolutely no proof for that.

Anyway, IF this is the case, then the following patch may or may not fix 
things. It avoids things by never overwriting a PTE entry, not even the 
"cleared" one. It always does an atomic "xchg()" with a valid new entry, 
and looks at the old bits.

What do you guys think? Does something like this work out for S/390 too? I 
tried to make that "ptep_flush_dirty()" concept work for architectures 
that hide the dirty bit somewhere else too, but..

It actually simplifies the architecture-specific code (you just need to 
implement a trivial "ptep_exchange()" and "ptep_flush_dirty()" macro), but 
I only did x86-64 and i386, and while I've booted with this, I haven't 
really given the thing a lot of really _deep_ thought.

But I think this might be safer, as per above.. And it _might_ actually 
explain the problem. Exactly because the "ptep_clear() + blindly assign to 
ptep" might lose a dirty bit that was written by another CPU.

But this really does depend on what a CPU does when it marks a page dirty. 
Does it just blindly write the dirty bit? Or does it actually _validate_ 
that the old page table entry was still present and writable?

This patch makes no assumptions. It should work even if a CPU just writes 
the dirty bit blindly, and the only expectation is that the page tables 
can be accessed atomically (which had _better_ be true on any SMP 
architecture)

Arjan, can you please check within Intel, and ask what the "proper" 
sequence for doing something like this is?

			Linus

----
commit 301d2d53ca0e5d2f61b1c1c259da410c7ee6d6a7
Author: Linus Torvalds <torvalds@woody.osdl.org>
Date:   Thu Dec 21 11:11:05 2006 -0800

    Rewrite the page table "clear dirty and writable" accesses
    
    This is much simpler for most architectures, and allows us to do the
    dirty and writable clear in a single operation without any races or any
    double flushes.
    
    It's also much more careful: we never overwrite the old dirty bits at
    any time, and always make sure to do atomic memory ops to exchange and
    see the old value.
    
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 9d774d0..8879f1d 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -61,31 +61,6 @@ do {				  					  \
 })
 #endif
 
-#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
-#define ptep_test_and_clear_dirty(__vma, __address, __ptep)		\
-({									\
-	pte_t __pte = *__ptep;						\
-	int r = 1;							\
-	if (!pte_dirty(__pte))						\
-		r = 0;							\
-	else								\
-		set_pte_at((__vma)->vm_mm, (__address), (__ptep),	\
-			   pte_mkclean(__pte));				\
-	r;								\
-})
-#endif
-
-#ifndef __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
-#define ptep_clear_flush_dirty(__vma, __address, __ptep)		\
-({									\
-	int __dirty;							\
-	__dirty = ptep_test_and_clear_dirty(__vma, __address, __ptep);	\
-	if (__dirty)							\
-		flush_tlb_page(__vma, __address);			\
-	__dirty;							\
-})
-#endif
-
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear(__mm, __address, __ptep)			\
 ({									\
diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
index e6a4723..b61d6f9 100644
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -300,18 +300,20 @@ do {									\
 	flush_tlb_page(vma, address);					\
 } while (0)
 
-#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
-#define ptep_clear_flush_dirty(vma, address, ptep)			\
-({									\
-	int __dirty;							\
-	__dirty = pte_dirty(*(ptep));					\
-	if (__dirty) {							\
-		clear_bit(_PAGE_BIT_DIRTY, &(ptep)->pte_low);		\
-		pte_update_defer((vma)->vm_mm, (address), (ptep));	\
-		flush_tlb_page(vma, address);				\
-	}								\
-	__dirty;							\
-})
+/*
+ * "ptep_exchange()" can be used to atomically change a set of
+ * page table protection bits, returning the old ones (the dirty
+ * and accessed bits in particular, since they are set by hw).
+ *
+ * "ptep_flush_dirty()" then returns the dirty status of the
+ * page (on x86-64, we just look at the dirty bit in the returned
+ * pte, but some other architectures have the dirty bits in
+ * other places than the page tables).
+ */
+#define ptep_exchange(vma, address, ptep, old, new) \
+	(old).pte_low = xchg(&(ptep)->pte_low, (new).pte_low);
+#define ptep_flush_dirty(vma, address, ptep, old) \
+	pte_dirty(old)
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 #define ptep_clear_flush_young(vma, address, ptep)			\
diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h
index 59901c6..07754b5 100644
--- a/include/asm-x86_64/pgtable.h
+++ b/include/asm-x86_64/pgtable.h
@@ -283,12 +283,20 @@ static inline pte_t pte_clrhuge(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) &
 
 struct vm_area_struct;
 
-static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
-{
-	if (!pte_dirty(*ptep))
-		return 0;
-	return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte);
-}
+/*
+ * "ptep_exchange()" can be used to atomically change a set of
+ * page table protection bits, returning the old ones (the dirty
+ * and accessed bits in particular, since they are set by hw).
+ *
+ * "ptep_flush_dirty()" then returns the dirty status of the
+ * page (on x86-64, we just look at the dirty bit in the returned
+ * pte, but some other architectures have the dirty bits in
+ * other places than the page tables).
+ */
+#define ptep_exchange(vma, address, ptep, old, new) \
+	(old).pte = xchg(&(ptep)->pte, (new).pte);
+#define ptep_flush_dirty(vma, address, ptep, old) \
+	pte_dirty(old)
 
 static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..a028803 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *pte, entry;
+	pte_t *ptep;
 	spinlock_t *ptl;
 	int ret = 0;
 
@@ -440,22 +440,24 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
-	if (!pte)
-		goto out;
-
-	if (!pte_dirty(*pte) && !pte_write(*pte))
-		goto unlock;
-
-	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
-	entry = pte_wrprotect(entry);
-	ptep_establish(vma, address, pte, entry);
-	lazy_mmu_prot_update(entry);
-	ret = 1;
-
-unlock:
-	pte_unmap_unlock(pte, ptl);
+	ptep = page_check_address(page, mm, address, &ptl);
+	if (ptep) {
+		pte_t old, new;
+
+		old = *ptep;
+		new = pte_wrprotect(pte_mkclean(old));
+		if (!pte_same(old, new)) {
+			for (;;) {
+				flush_cache_page(vma, address, page_to_pfn(page));
+				ptep_exchange(vma, address, ptep, old, new);
+				if (pte_same(old, new))
+					break;
+				ret |= ptep_flush_dirty(vma, address, ptep, old);
+				flush_tlb_page(vma, address);
+			}
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
 out:
 	return ret;
 }

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-21 13:03                           ` Peter Zijlstra
@ 2006-12-21 20:40                             ` Andrew Morton
  0 siblings, 0 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-21 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Thu, 21 Dec 2006 14:03:20 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote:
> > 
> > Btw,
> >  here's a totally new tangent on this: it's possible that user code is 
> > simply BUGGY. 
> 
> depmod: BADNESS: written outside isize 22183

akpm:/usr/src/module-init-tools-3.3-pre1> grep -r mmap .
./zlibsupport.c:        map = mmap(0, *size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);

So presumably it's in a library.

akpm:/usr/src/25> ldd /sbin/depmod
        linux-gate.so.1 =>  (0xffffe000)
        libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0x46afa000)
        /lib/ld-linux.so.2 (0x4631d000)

worrisome.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-21  9:27                                                     ` Andrew Morton
@ 2006-12-22  4:20                                                       ` Gordon Farquharson
  2006-12-22  4:54                                                         ` Linus Torvalds
  2006-12-22 10:01                                                         ` Martin Michlmayr
  0 siblings, 2 replies; 311+ messages in thread
From: Gordon Farquharson @ 2006-12-22  4:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Martin Michlmayr, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann

On 12/21/06, Andrew Morton <akpm@osdl.org> wrote:

> > Can the call to task_io_account_cancelled_write() simply be removed
> > from cancel_dirty_page() for testing the patch with 2.6.19 (since
> > 2.6.19 doesn't seem to have the task I/O accounting) ?
>
> Yes.

I tested 2.6.19 with a version of Linus's patch that applies cleanly
to 2.6.19 (patch appended to the end of this email) on ARM and apt-get
failed. It did not segfault this time, but instead got stuck for about
20 to 30 minutes and was accessing the hard drive frequently.

Here is some background about the problem we see with apt which may
help somebody with knowledge of the apt source code analyse the
problem in the context of the patch. When apt-get is first run, it
generates pkgcache.bin and srcpkgcache.bin in /var/cache/apt. We have
found that these are the files that get corrupted when we apply the
patch "mm: tracking shared dirty pages" [1] to 2.6.18. The corruption
of these files is what causes apt-get to segfault. I have observed
that the normal operation of apt-get is that while apt-get is
generating these files, pkgcache.bin grows to 12582912 bytes, and when
apt-get finishes, pkgcache.bin is 6425533 bytes and srcpkgcache.bin is
64254483 bytes. This time, when apt-get exited, it had only created
pkgcache.bin which was still 12582912 bytes. Also, the patch caused
apt to slow down a lot. I ran apt-get -f install after apt had exited,
and it took so long that I killed it before it had finished.

I did not try 2.6.20-git, but I presume that this version is what
Martin tried earlier. Maybe Linus's patch doesn't work with 2.6.19,
because 2.6.19 is missing some other patch.

Gordon

[1] http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89

diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c
--- linux-2.6.19.orig/fs/buffer.c       2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/fs/buffer.c    2006-12-21 01:16:31.000000000 -0700
@@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag
        int ret = 0;

        BUG_ON(!PageLocked(page));
-       if (PageWriteback(page))
+       if (PageDirty(page) || PageWriteback(page))
                return 0;

        if (mapping == NULL) {          /* can this still happen? */
@@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag
        spin_lock(&mapping->private_lock);
        ret = drop_buffers(page, &buffers_to_free);
        spin_unlock(&mapping->private_lock);
-       if (ret) {
-               /*
-                * If the filesystem writes its buffers by hand (eg ext3)
-                * then we can have clean buffers against a dirty page.  We
-                * clean the page here; otherwise later reattachment of buffers
-                * could encounter a non-uptodate page, which is unresolvable.
-                * This only applies in the rare case where try_to_free_buffers
-                * succeeds but the page is not freed.
-                */
-               clear_page_dirty(page);
-       }
 out:
        if (buffers_to_free) {
                struct buffer_head *bh = buffers_to_free;
diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c
linux-2.6.19/fs/hugetlbfs/inode.c
--- linux-2.6.19.orig/fs/hugetlbfs/inode.c      2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/fs/hugetlbfs/inode.c   2006-12-21 01:15:21.000000000 -0700
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct

  static void truncate_huge_page(struct page *page)
 {
-       clear_page_dirty(page);
+       cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
        ClearPageUptodate(page);
        remove_from_page_cache(page);
        put_page(page);
diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h
linux-2.6.19/include/linux/page-flags.h
--- linux-2.6.19.orig/include/linux/page-flags.h        2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/include/linux/page-flags.h     2006-12-21
01:15:21.000000000 -0700
@@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc

  struct page;   /* forward declaration */

-int test_clear_page_dirty(struct page *page);
+extern void cancel_dirty_page(struct page *page, unsigned int account_size);
+
  int test_clear_page_writeback(struct page *page);
  int test_set_page_writeback(struct page *page);

-static inline void clear_page_dirty(struct page *page)
-{
-       test_clear_page_dirty(page);
-}
-
  static inline void set_page_writeback(struct page *page)
 {
        test_set_page_writeback(page);
diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c
--- linux-2.6.19.orig/mm/memory.c       2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/memory.c    2006-12-21 01:15:21.000000000 -0700
@@ -1832,6 +1832,33 @@ void unmap_mapping_range(struct address_
 }
 EXPORT_SYMBOL(unmap_mapping_range);

+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+       pgoff_t index;
+       unsigned int offset;
+       struct page *page;
+
+       if (!mapping)
+               return;
+       offset = size & ~PAGE_MASK;
+       if (!offset)
+               return;
+       index = size >> PAGE_SHIFT;
+       page = find_lock_page(mapping, index);
+       if (page) {
+               unsigned int check = 0;
+               unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+               do {
+                       check += kaddr[offset++];
+               } while (offset < PAGE_SIZE);
+               kunmap_atomic(kaddr,KM_USER0);
+               unlock_page(page);
+               page_cache_release(page);
+               if (check)
+                       printk("%s: BADNESS: truncate check %u\n",
current->comm, check);
+       }
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used
@@ -1865,6 +1892,7 @@ do_expand:
                goto out_sig;
        if (offset > inode->i_sb->s_maxbytes)
                goto out_big;
+       check_last_page(mapping, inode->i_size);
        i_size_write(inode, offset);

 out_truncate:
diff -Naupr linux-2.6.19.orig/mm/page-writeback.c
linux-2.6.19/mm/page-writeback.c
--- linux-2.6.19.orig/mm/page-writeback.c       2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/mm/page-writeback.c    2006-12-21 01:26:53.000000000 -0700
@@ -843,39 +843,6 @@ int set_page_dirty_lock(struct page *pag
 EXPORT_SYMBOL(set_page_dirty_lock);

 /*
- * Clear a page's dirty flag, while caring for dirty memory accounting.
- * Returns true if the page was previously dirty.
- */
-int test_clear_page_dirty(struct page *page)
-{
-       struct address_space *mapping = page_mapping(page);
-       unsigned long flags;
-
-       if (mapping) {
-               write_lock_irqsave(&mapping->tree_lock, flags);
-               if (TestClearPageDirty(page)) {
-                       radix_tree_tag_clear(&mapping->page_tree,
-                                               page_index(page),
-                                               PAGECACHE_TAG_DIRTY);
-                       write_unlock_irqrestore(&mapping->tree_lock, flags);
-                       /*
-                        * We can continue to use `mapping' here because the
-                        * page is locked, which pins the address_space
-                        */
-                       if (mapping_cap_account_dirty(mapping)) {
-                               page_mkclean(page);
-                               dec_zone_page_state(page, NR_FILE_DIRTY);
-                       }
-                       return 1;
-               }
-               write_unlock_irqrestore(&mapping->tree_lock, flags);
-               return 0;
-       }
-       return TestClearPageDirty(page);
-}
-EXPORT_SYMBOL(test_clear_page_dirty);
-
-/*
  * Clear a page's dirty flag, while caring for dirty memory accounting.
  * Returns true if the page was previously dirty.
  *
diff -Naupr linux-2.6.19.orig/mm/truncate.c linux-2.6.19/mm/truncate.c
--- linux-2.6.19.orig/mm/truncate.c     2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/truncate.c  2006-12-21 15:58:18.000000000 -0700
@@ -50,6 +50,17 @@ static inline void truncate_partial_page
                do_invalidatepage(page, partial);
 }

+void cancel_dirty_page(struct page *page, unsigned int account_size)
+{
+       /* If we're cancelling the page, it had better not be mapped
any more */+       if (page_mapped(page)) {
+               static unsigned int warncount;
+
+               WARN_ON(++warncount < 5);
+       }
+}
+
+
 /*
  * If truncate cannot remove the fs-private metadata from the page, the page
  * becomes anonymous.  It will be left on the LRU and may even be mapped into
@@ -69,7 +80,8 @@ truncate_complete_page(struct address_sp
        if (PagePrivate(page))
                do_invalidatepage(page, 0);

-       clear_page_dirty(page);
+       cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
        ClearPageUptodate(page);
        ClearPageMappedToDisk(page);
        remove_from_page_cache(page);
@@ -348,7 +360,6 @@ int invalidate_inode_pages2_range(struct
                for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
                        struct page *page = pvec.pages[i];
                        pgoff_t page_index;
-                       int was_dirty;

                        lock_page(page);
                        if (page->mapping != mapping) {
@@ -384,12 +395,8 @@ int invalidate_inode_pages2_range(struct
                                          PAGE_CACHE_SIZE, 0);
                                }
                        }
-                       was_dirty = test_clear_page_dirty(page);
-                       if (!invalidate_complete_page2(mapping, page)) {
-                               if (was_dirty)
-                                       set_page_dirty(page);
+                       if (!invalidate_complete_page2(mapping, page))
                                ret = -EIO;
-                       }
                        unlock_page(page);
                }
                pagevec_release(&pvec);


-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22  4:20                                                       ` Gordon Farquharson
@ 2006-12-22  4:54                                                         ` Linus Torvalds
  2006-12-22 10:00                                                           ` Martin Michlmayr
  2006-12-22 15:08                                                           ` Gordon Farquharson
  2006-12-22 10:01                                                         ` Martin Michlmayr
  1 sibling, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-22  4:54 UTC (permalink / raw)
  To: Gordon Farquharson
  Cc: Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann



On Thu, 21 Dec 2006, Gordon Farquharson wrote:
> 
> I tested 2.6.19 with a version of Linus's patch that applies cleanly
> to 2.6.19 (patch appended to the end of this email) on ARM and apt-get
> failed. It did not segfault this time, but instead got stuck for about
> 20 to 30 minutes and was accessing the hard drive frequently.

Ok, there's definitely something screwy going on.

Andrew located at least one bug: we run cancel_dirty_page() too late in 
"truncate_complete_page()", which means that do_invalidatepage() ends up 
not clearing the page cache.

His patch is appended.

But it sounds like I probably misunderstood something, because I thought 
that Martin had acknowledged that this patch actually worked for him. 
Which sounded very similar to your setup (he has a 32M ARM box too, no?)

And your failure sounds a lot like one that David Miller is reporting. At 
the same time, my own shared file mmap tests on my own machines obviously 
work fine (I lower the dirty-writeback tresholds to force writeback more 
easily, and then mmap a file and write and rewrite to it in memory, and 
truncate it).

Maybe it's mount option issue? I've got data=ordered on my machine, are 
you perhaps runnign with something else?

		Linus

---
commit 3e67c0987d7567ad666641164a153dca9a43b11d
Author: Andrew Morton <akpm@osdl.org>
Date:   Thu Dec 21 11:00:33 2006 -0800

    [PATCH] truncate: clear page dirtiness before running try_to_free_buffers()
    
    truncate presently invalidates the dirty page's buffer_heads then shoots down
    the page.  But try_to_free_buffers() will now bale out because the page is
    dirty.
    
    Net effect: the LRU gets filled with dirty pages which have invalidated
    buffer_heads attached.  They have no ->mapping and hence cannot be cleaned.
    The machine leaks memory at an enormous rate.
    
    Fix this by cleaning the page before running try_to_free_buffers(), so
    try_to_free_buffers() can do its work.
    
    Also, remember to do dirty-page-acoounting in cancel_dirty_page() so the
    machine won't wedge up trying to write non-existent dirty pages.
    
    Probably still wrong, but now less so.
    
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff --git a/mm/truncate.c b/mm/truncate.c
index bf9e296..89a5c35 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -60,11 +60,12 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 		WARN_ON(++warncount < 5);
 	}
 		
-	if (TestClearPageDirty(page) && account_size)
+	if (TestClearPageDirty(page) && account_size) {
+		dec_zone_page_state(page, NR_FILE_DIRTY);
 		task_io_account_cancelled_write(account_size);
+	}
 }
 
-
 /*
  * If truncate cannot remove the fs-private metadata from the page, the page
  * becomes anonymous.  It will be left on the LRU and may even be mapped into
@@ -81,11 +82,11 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
 	if (page->mapping != mapping)
 		return;
 
+	cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	cancel_dirty_page(page, PAGE_CACHE_SIZE);
-
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
 	remove_from_page_cache(page);

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22  4:54                                                         ` Linus Torvalds
@ 2006-12-22 10:00                                                           ` Martin Michlmayr
  2006-12-22 10:06                                                             ` Martin Michlmayr
  2006-12-22 10:17                                                             ` Andrew Morton
  2006-12-22 15:08                                                           ` Gordon Farquharson
  1 sibling, 2 replies; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-22 10:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gordon Farquharson, Andrew Morton, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann

* Linus Torvalds <torvalds@osdl.org> [2006-12-21 20:54]:
> But it sounds like I probably misunderstood something, because I thought
> that Martin had acknowledged that this patch actually worked for him.

That's what I thought too but now I can confirm what Gordon sees.  But
it's pretty weird.  Our testcase is to run Debian installer on the
NSLU2 arm device and apt-get would either segfault or hang at this
particular spot in the installation (when apt is first run).  With
your patch, apt works correctly where it normally fails (at least for
me).  I stopped the installation at this point and repeated it several
more times to make sure it's really working.  And, yes, I can repeat
this result.

This time, however, I let the installer continue and it seems that
with your patch apt now works where it failed in the past, but it
hangs later on.  It's pretty weird because I cannot even kill the
process:

sh-3.1# ps aux | grep 31126
root     31126  5.7 20.6  16240  6076 ?        R+   04:45   0:21 apt-get -o APT::Status-Fd=4 -o APT::Keep-Fds::=5 -o APT::Keep-Fds::=6 -q -y -f install popularity-contest
root     31157  0.0  1.6   1516   492 ttyS0    S+   04:51   0:00 grep 31126
sh-3.1# kill -9 31126
sh-3.1# kill -9 31126
sh-3.1# ps aux | grep 31126
root     31126  5.6 20.6  16240  6076 ?        R+   04:45   0:21 apt-get -o APT::Status-Fd=4 -o APT::Keep-Fds::=5 -o APT::Keep-Fds::=6 -q -y -f install popularity-contest
root     31159  0.0  1.6   1516   492 ttyS0    S+   04:51   0:00 grep 31126
sh-3.1#

> Which sounded very similar to your setup (he has a 32M ARM box too, no?)

It's the same device, a Linksys NSLU2.

> Author: Andrew Morton <akpm@osdl.org>

This patch makes it even worse for me.

> -	if (TestClearPageDirty(page) && account_size)
> +	if (TestClearPageDirty(page) && account_size) {
> +		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		task_io_account_cancelled_write(account_size);
> +	}

This hunk (on top of git from about 2 days ago and your latest patch)
results in the installer hanging right at the start.  The Linux kernel
boots fine, the debian-installer is loaded into a ramdisk but when
ncurses is being started it just hangs.  Reverting this hunk makes it
start again.

Does that help or confuse you even more?
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22  4:20                                                       ` Gordon Farquharson
  2006-12-22  4:54                                                         ` Linus Torvalds
@ 2006-12-22 10:01                                                         ` Martin Michlmayr
  2006-12-22 15:16                                                           ` Gordon Farquharson
  1 sibling, 1 reply; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-22 10:01 UTC (permalink / raw)
  To: Gordon Farquharson
  Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann

* Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-21 21:20]:
> generating these files, pkgcache.bin grows to 12582912 bytes, and when
> apt-get finishes, pkgcache.bin is 6425533 bytes and srcpkgcache.bin is
> 64254483 bytes. This time, when apt-get exited, it had only created
> pkgcache.bin which was still 12582912 bytes.

Yes, same here:

sh-3.1# ls -l /var/cache/apt/
total 5252
drwxr-xr-x 3 root root    12288 Dec 22 04:41 archives
-rw-r--r-- 1 root root 12582912 Dec 22 04:45 pkgcache.bin
-rw-r--r-- 1 root root     8554 Dec 22 04:45 srcpkgcache.bin

Gordon, does it fail for you where it normally does (installing
initramfs-tools) or much later?  For me, the installer was able to
install initramfs-tools and the kernel, but apt now hangs at "Select
and install software".
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 10:00                                                           ` Martin Michlmayr
@ 2006-12-22 10:06                                                             ` Martin Michlmayr
  2006-12-22 10:10                                                               ` Martin Michlmayr
  2006-12-22 10:17                                                             ` Andrew Morton
  1 sibling, 1 reply; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-22 10:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gordon Farquharson, Andrew Morton, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber

* Martin Michlmayr <tbm@cyrius.com> [2006-12-22 11:00]:
> This time, however, I let the installer continue and it seems that
> with your patch apt now works where it failed in the past, but it
> hangs later on.  It's pretty weird because I cannot even kill the
> process:

Okay, it's really weird.  So apt-get just hangs doing nothing and I
cannot even kill it.  I just tried to download strace via wget and
immediately when I started wget, the hanging apt-get process
continued.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 10:06                                                             ` Martin Michlmayr
@ 2006-12-22 10:10                                                               ` Martin Michlmayr
  2006-12-22 11:07                                                                 ` Martin Michlmayr
  2006-12-22 15:30                                                                 ` Gordon Farquharson
  0 siblings, 2 replies; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-22 10:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gordon Farquharson, Andrew Morton, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber

* Martin Michlmayr <tbm@cyrius.com> [2006-12-22 11:06]:
> Okay, it's really weird.  So apt-get just hangs doing nothing and I
> cannot even kill it.  I just tried to download strace via wget and
> immediately when I started wget, the hanging apt-get process
> continued.

... and now that we've completed this step, the apt cache has suddenly
been reduced (see Gordon's mail for an explanation) and it segfaults:

sh-3.1# ls -l /var/cache/apt/
total 12524
drwxr-xr-x 3 root root   12288 Dec 22 04:41 archives
-rw-r--r-- 1 root root 6426885 Dec 22 05:03 pkgcache.bin
-rw-r--r-- 1 root root 6426835 Dec 22 05:03 srcpkgcache.bin
sh-3.1# apt-get -f install
Reading package lists... Done
Segmentation faulty tree... 50%

-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 10:00                                                           ` Martin Michlmayr
  2006-12-22 10:06                                                             ` Martin Michlmayr
@ 2006-12-22 10:17                                                             ` Andrew Morton
  2006-12-22 11:12                                                               ` Martin Michlmayr
  2006-12-22 12:24                                                               ` Andrei Popa
  1 sibling, 2 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-22 10:17 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Linus Torvalds, Gordon Farquharson, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann

On Fri, 22 Dec 2006 11:00:04 +0100
Martin Michlmayr <tbm@cyrius.com> wrote:

> > -	if (TestClearPageDirty(page) && account_size)
> > +	if (TestClearPageDirty(page) && account_size) {
> > +		dec_zone_page_state(page, NR_FILE_DIRTY);
> >  		task_io_account_cancelled_write(account_size);
> > +	}
> 
> This hunk (on top of git from about 2 days ago and your latest patch)
> results in the installer hanging right at the start. 

You'll need this also:

From: Andrew Morton <akpm@osdl.org>

Only (un)account for IO and page-dirtying for devices which have real backing
store (ie: not tmpfs or ramdisks).

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 mm/truncate.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff -puN mm/truncate.c~truncate-dirty-memory-accounting-fix mm/truncate.c
--- a/mm/truncate.c~truncate-dirty-memory-accounting-fix
+++ a/mm/truncate.c
@@ -60,7 +60,8 @@ void cancel_dirty_page(struct page *page
 		WARN_ON(++warncount < 5);
 	}
 		
-	if (TestClearPageDirty(page) && account_size) {
+	if (TestClearPageDirty(page) && account_size &&
+			mapping_cap_account_dirty(page->mapping)) {
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		task_io_account_cancelled_write(account_size);
 	}
_


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 10:10                                                               ` Martin Michlmayr
@ 2006-12-22 11:07                                                                 ` Martin Michlmayr
  2006-12-22 15:30                                                                 ` Gordon Farquharson
  1 sibling, 0 replies; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-22 11:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gordon Farquharson, Andrew Morton, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber

* Martin Michlmayr <tbm@cyrius.com> [2006-12-22 11:10]:
> > immediately when I started wget, the hanging apt-get process
> > continued.
> ... and now that we've completed this step, the apt cache has suddenly
> been reduced (see Gordon's mail for an explanation) and it segfaults:

One of my questions was why apt-get worked to install the
initramfs-tools, the kernel and some other packages but later hung
while it was building the cache (which clearly it had built already to
install some packages): before the installer offers to install
additional packages, it changes the apt sources, which leads to apt
rebuilding the cache, and here it hangs.

Remember how I said that downloading a file with wget prompts apt to
work again?  Apparently any filesystem access will do (I just ran
find / > /dev/null).  Gordon, can you confirm this?
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 10:17                                                             ` Andrew Morton
@ 2006-12-22 11:12                                                               ` Martin Michlmayr
  2006-12-22 12:24                                                               ` Andrei Popa
  1 sibling, 0 replies; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-22 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Gordon Farquharson, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List

* Andrew Morton <akpm@osdl.org> [2006-12-22 02:17]:
> > This hunk (on top of git from about 2 days ago and your latest patch)
> > results in the installer hanging right at the start.
> 
> You'll need this also:

It starts again, thanks.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 10:17                                                             ` Andrew Morton
  2006-12-22 11:12                                                               ` Martin Michlmayr
@ 2006-12-22 12:24                                                               ` Andrei Popa
  2006-12-22 12:32                                                                 ` Martin Michlmayr
  1 sibling, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-22 12:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin Michlmayr, Linus Torvalds, Gordon Farquharson,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann

With all three patches I have corruption....


diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..4f4cd13 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/include/asm-generic/pgtable.h
b/include/asm-generic/pgtable.h
index 9d774d0..8879f1d 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -61,31 +61,6 @@ ({									\
 })
 #endif
 
-#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
-#define ptep_test_and_clear_dirty(__vma, __address, __ptep)		\
-({									\
-	pte_t __pte = *__ptep;						\
-	int r = 1;							\
-	if (!pte_dirty(__pte))						\
-		r = 0;							\
-	else								\
-		set_pte_at((__vma)->vm_mm, (__address), (__ptep),	\
-			   pte_mkclean(__pte));				\
-	r;								\
-})
-#endif
-
-#ifndef __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
-#define ptep_clear_flush_dirty(__vma, __address, __ptep)		\
-({									\
-	int __dirty;							\
-	__dirty = ptep_test_and_clear_dirty(__vma, __address, __ptep);	\
-	if (__dirty)							\
-		flush_tlb_page(__vma, __address);			\
-	__dirty;							\
-})
-#endif
-
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear(__mm, __address, __ptep)			\
 ({									\
diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
index e6a4723..b61d6f9 100644
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -300,18 +300,20 @@ do {									\
 	flush_tlb_page(vma, address);					\
 } while (0)
 
-#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
-#define ptep_clear_flush_dirty(vma, address, ptep)			\
-({									\
-	int __dirty;							\
-	__dirty = pte_dirty(*(ptep));					\
-	if (__dirty) {							\
-		clear_bit(_PAGE_BIT_DIRTY, &(ptep)->pte_low);		\
-		pte_update_defer((vma)->vm_mm, (address), (ptep));	\
-		flush_tlb_page(vma, address);				\
-	}								\
-	__dirty;							\
-})
+/*
+ * "ptep_exchange()" can be used to atomically change a set of
+ * page table protection bits, returning the old ones (the dirty
+ * and accessed bits in particular, since they are set by hw).
+ *
+ * "ptep_flush_dirty()" then returns the dirty status of the
+ * page (on x86-64, we just look at the dirty bit in the returned
+ * pte, but some other architectures have the dirty bits in
+ * other places than the page tables).
+ */
+#define ptep_exchange(vma, address, ptep, old, new) \
+	(old).pte_low = xchg(&(ptep)->pte_low, (new).pte_low);
+#define ptep_flush_dirty(vma, address, ptep, old) \
+	pte_dirty(old)
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 #define ptep_clear_flush_young(vma, address, ptep)			\
diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h
index 59901c6..07754b5 100644
--- a/include/asm-x86_64/pgtable.h
+++ b/include/asm-x86_64/pgtable.h
@@ -283,12 +283,20 @@ static inline pte_t pte_clrhuge(pte_t pt
 
 struct vm_area_struct;
 
-static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
-{
-	if (!pte_dirty(*ptep))
-		return 0;
-	return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte);
-}
+/*
+ * "ptep_exchange()" can be used to atomically change a set of
+ * page table protection bits, returning the old ones (the dirty
+ * and accessed bits in particular, since they are set by hw).
+ *
+ * "ptep_flush_dirty()" then returns the dirty status of the
+ * page (on x86-64, we just look at the dirty bit in the returned
+ * pte, but some other architectures have the dirty bits in
+ * other places than the page tables).
+ */
+#define ptep_exchange(vma, address, ptep, old, new) \
+	(old).pte = xchg(&(ptep)->pte, (new).pte);
+#define ptep_flush_dirty(vma, address, ptep, old) \
+	pte_dirty(old)
 
 static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
 {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..350878a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,15 +253,11 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+extern void cancel_dirty_page(struct page *page, unsigned int
account_size);
+
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
-{
-	test_clear_page_dirty(page);
-}
-
 static inline void set_page_writeback(struct page *page)
 {
 	test_set_page_writeback(page);
diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..79cecab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+	pgoff_t index;
+	unsigned int offset;
+	struct page *page;
+
+	if (!mapping)
+		return;
+	offset = size & ~PAGE_MASK;
+	if (!offset)
+		return;
+	index = size >> PAGE_SHIFT;
+	page = find_lock_page(mapping, index);
+	if (page) {
+		unsigned int check = 0;
+		unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+		do {
+			check += kaddr[offset++];
+		} while (offset < PAGE_SIZE);
+		kunmap_atomic(kaddr,KM_USER0);
+		unlock_page(page);
+		page_cache_release(page);
+		if (check)
+			printk("%s: BADNESS: truncate check %u\n", current->comm, check);
+	}
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
 		goto out_sig;
 	if (offset > inode->i_sb->s_maxbytes)
 		goto out_big;
+	check_last_page(mapping, inode->i_size);
 	i_size_write(inode, offset);
 
 out_truncate:
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..b3a198c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -845,38 +845,6 @@ int set_page_dirty_lock(struct page *pag
 EXPORT_SYMBOL(set_page_dirty_lock);
 
 /*
- * Clear a page's dirty flag, while caring for dirty memory
accounting. 
- * Returns true if the page was previously dirty.
- */
-int test_clear_page_dirty(struct page *page)
-{
-	struct address_space *mapping = page_mapping(page);
-	unsigned long flags;
-
-	if (!mapping)
-		return TestClearPageDirty(page);
-
-	write_lock_irqsave(&mapping->tree_lock, flags);
-	if (TestClearPageDirty(page)) {
-		radix_tree_tag_clear(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
-		/*
-		 * We can continue to use `mapping' here because the
-		 * page is locked, which pins the address_space
-		 */
-		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
-			dec_zone_page_state(page, NR_FILE_DIRTY);
-		}
-		return 1;
-	}
-	write_unlock_irqrestore(&mapping->tree_lock, flags);
-	return 0;
-}
-EXPORT_SYMBOL(test_clear_page_dirty);
-
-/*
  * Clear a page's dirty flag, while caring for dirty memory accounting.
  * Returns true if the page was previously dirty.
  *
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..a028803 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page 
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *pte, entry;
+	pte_t *ptep;
 	spinlock_t *ptl;
 	int ret = 0;
 
@@ -440,22 +440,24 @@ static int page_mkclean_one(struct page 
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
-	if (!pte)
-		goto out;
-
-	if (!pte_dirty(*pte) && !pte_write(*pte))
-		goto unlock;
-
-	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
-	entry = pte_wrprotect(entry);
-	ptep_establish(vma, address, pte, entry);
-	lazy_mmu_prot_update(entry);
-	ret = 1;
-
-unlock:
-	pte_unmap_unlock(pte, ptl);
+	ptep = page_check_address(page, mm, address, &ptl);
+	if (ptep) {
+		pte_t old, new;
+
+		old = *ptep;
+		new = pte_wrprotect(pte_mkclean(old));
+		if (!pte_same(old, new)) {
+			for (;;) {
+				flush_cache_page(vma, address, page_to_pfn(page));
+				ptep_exchange(vma, address, ptep, old, new);
+				if (pte_same(old, new))
+					break;
+				ret |= ptep_flush_dirty(vma, address, ptep, old);
+				flush_tlb_page(vma, address);
+			}
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
 out:
 	return ret;
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..4a38dd1 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -51,6 +51,22 @@ static inline void truncate_partial_page
 		do_invalidatepage(page, partial);
 }
 
+void cancel_dirty_page(struct page *page, unsigned int account_size)
+{
+	/* If we're cancelling the page, it had better not be mapped any more
*/
+	if (page_mapped(page)) {
+		static unsigned int warncount;
+
+		WARN_ON(++warncount < 5);
+	}
+		
+	if (TestClearPageDirty(page) && account_size &&
+			mapping_cap_account_dirty(page->mapping)) {
+		dec_zone_page_state(page, NR_FILE_DIRTY);
+		task_io_account_cancelled_write(account_size);
+	}
+}
+
 /*
  * If truncate cannot remove the fs-private metadata from the page, the
page
  * becomes anonymous.  It will be left on the LRU and may even be
mapped into
@@ -67,11 +83,11 @@ truncate_complete_page(struct address_sp
 	if (page->mapping != mapping)
 		return;
 
+	cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
-		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
 	remove_from_page_cache(page);
@@ -350,7 +366,6 @@ int invalidate_inode_pages2_range(struct
 		for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index;
-			int was_dirty;
 
 			lock_page(page);
 			if (page->mapping != mapping) {
@@ -386,12 +401,8 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
-			if (!invalidate_complete_page2(mapping, page)) {
-				if (was_dirty)
-					set_page_dirty(page);
+			if (!invalidate_complete_page2(mapping, page))
 				ret = -EIO;
-			}
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);



On Fri, 2006-12-22 at 02:17 -0800, Andrew Morton wrote:
> On Fri, 22 Dec 2006 11:00:04 +0100
> Martin Michlmayr <tbm@cyrius.com> wrote:
> 
> > > -	if (TestClearPageDirty(page) && account_size)
> > > +	if (TestClearPageDirty(page) && account_size) {
> > > +		dec_zone_page_state(page, NR_FILE_DIRTY);
> > >  		task_io_account_cancelled_write(account_size);
> > > +	}
> > 
> > This hunk (on top of git from about 2 days ago and your latest patch)
> > results in the installer hanging right at the start. 
> 
> You'll need this also:
> 
> From: Andrew Morton <akpm@osdl.org>
> 
> Only (un)account for IO and page-dirtying for devices which have real backing
> store (ie: not tmpfs or ramdisks).
> 
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Linus Torvalds <torvalds@osdl.org>
> Signed-off-by: Andrew Morton <akpm@osdl.org>
> ---
> 
>  mm/truncate.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff -puN mm/truncate.c~truncate-dirty-memory-accounting-fix mm/truncate.c
> --- a/mm/truncate.c~truncate-dirty-memory-accounting-fix
> +++ a/mm/truncate.c
> @@ -60,7 +60,8 @@ void cancel_dirty_page(struct page *page
>  		WARN_ON(++warncount < 5);
>  	}
>  		
> -	if (TestClearPageDirty(page) && account_size) {
> +	if (TestClearPageDirty(page) && account_size &&
> +			mapping_cap_account_dirty(page->mapping)) {
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		task_io_account_cancelled_write(account_size);
>  	}
> _
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 12:24                                                               ` Andrei Popa
@ 2006-12-22 12:32                                                                 ` Martin Michlmayr
  2006-12-22 12:59                                                                   ` Martin Michlmayr
                                                                                     ` (2 more replies)
  0 siblings, 3 replies; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-22 12:32 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Andrew Morton, Linus Torvalds, Gordon Farquharson,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

* Andrei Popa <andrei.popa@i-neo.ro> [2006-12-22 14:24]:
> With all three patches I have corruption....

I've completed one installation with Linus' patch plus the two from
Andrew successfully, but I'm currently trying again... but I really
need a better testcase since an installation takes about an hour.
Andrei, which torrent do you download as a testcase?  It would be good
if someone could suggest a torrent which is legal and not too large.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 12:32                                                                 ` Martin Michlmayr
@ 2006-12-22 12:59                                                                   ` Martin Michlmayr
  2006-12-22 13:25                                                                     ` Peter Zijlstra
  2006-12-22 15:01                                                                   ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Patrick Mau
  2006-12-23  8:15                                                                   ` Andrei Popa
  2 siblings, 1 reply; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-22 12:59 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Andrew Morton, Linus Torvalds, Gordon Farquharson,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

* Martin Michlmayr <tbm@cyrius.com> [2006-12-22 13:32]:
> I've completed one installation with Linus' patch plus the two from
> Andrew successfully, but I'm currently trying again...

... and it failed.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 12:59                                                                   ` Martin Michlmayr
@ 2006-12-22 13:25                                                                     ` Peter Zijlstra
  2006-12-22 13:29                                                                       ` Peter Zijlstra
                                                                                         ` (2 more replies)
  0 siblings, 3 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-22 13:25 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Andrei Popa, Andrew Morton, Linus Torvalds, Gordon Farquharson,
	Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Fri, 2006-12-22 at 13:59 +0100, Martin Michlmayr wrote:
> * Martin Michlmayr <tbm@cyrius.com> [2006-12-22 13:32]:
> > I've completed one installation with Linus' patch plus the two from
> > Andrew successfully, but I'm currently trying again...
> 
> .... and it failed.

Since you are on ARM you might want to try with the page_mkclean_one
cleanup patch too.

Arjan agreed that the loop is not needed; we clear the pte, flush on all
CPUs and then re-establish the pte. Any race will fault and be
serialised on the pte lock.

FWIW - with todays -git and Andrews second cancel_dirty_page() patch:
  http://lkml.org/lkml/2006/12/22/49
I am unable to trigger any corruption - I could again earlier by raising
the number of seeds from 3 to 6. (am currently at 10 seeds)




From: Peter Zijlstra <a.p.zijlstra@chello.nl>

fix page_mkclean_one()

 - add flush_cache_page() for all those virtual indexed cache
   architectures.

 - handle s390.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/rmap.c |   38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page 
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *pte, entry;
+	pte_t *pte;
 	spinlock_t *ptl;
 	int ret = 0;
 
@@ -444,17 +444,18 @@ static int page_mkclean_one(struct page 
 	if (!pte)
 		goto out;
 
-	if (!pte_dirty(*pte) && !pte_write(*pte))
-		goto unlock;
+	if (pte_dirty(*pte) || pte_write(*pte)) {
+		pte_t entry;
 
-	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
-	entry = pte_wrprotect(entry);
-	ptep_establish(vma, address, pte, entry);
-	lazy_mmu_prot_update(entry);
-	ret = 1;
+		flush_cache_page(vma, address, pte_pfn(*pte));
+		entry = ptep_clear_flush(vma, address, pte);
+		entry = pte_wrprotect(entry);
+		entry = pte_mkclean(entry);
+		set_pte_at(vma, address, pte, entry);
+		lazy_mmu_prot_update(entry);
+		ret = 1;
+	}
 
-unlock:
 	pte_unmap_unlock(pte, ptl);
 out:
 	return ret;
@@ -489,6 +490,8 @@ int page_mkclean(struct page *page)
 		if (mapping)
 			ret = page_mkclean_file(mapping, page);
 	}
+	if (page_test_and_clear_dirty(page))
+		ret = 1;
 
 	return ret;
 }


 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 13:25                                                                     ` Peter Zijlstra
@ 2006-12-22 13:29                                                                       ` Peter Zijlstra
  2006-12-22 17:56                                                                       ` Linus Torvalds
  2006-12-22 19:20                                                                       ` Martin Michlmayr
  2 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2006-12-22 13:29 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Andrei Popa, Andrew Morton, Linus Torvalds, Gordon Farquharson,
	Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List


A cleanup of try_to_unmap. I have not identified any races that this
would solve, but for consistencies sake.

Also includes a small s390 optimization by moving
page_test_and_clear_dirty() out of the vma iteration.


From: Peter Zijlstra <a.p.zijlstra@chello.nl>

We clear the page in the following sequence:
  ClearPageDirty - lock ptl, clear pte, unlock ptl

hence we should dirty in the opposite order:
  lock ptl, clear pte, unlock ptl - SetPageDirty

try_to_unmap_one violates this by doing the SetPageDirty under the ptl.

Also move page_test_and_clear_dirty() to try_to_unmap().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/rmap.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -590,8 +590,6 @@ void page_remove_rmap(struct page *page)
 		 * Leaving it set also helps swapoff to reinstate ptes
 		 * faster for those pages still in swapcache.
 		 */
-		if (page_test_and_clear_dirty(page))
-			set_page_dirty(page);
 		__dec_zone_page_state(page,
 				PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
 	}
@@ -610,6 +608,7 @@ static int try_to_unmap_one(struct page 
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
+	struct page *dirty_page = NULL;
 
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -636,7 +635,7 @@ static int try_to_unmap_one(struct page 
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
-		set_page_dirty(page);
+		dirty_page = page;
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
@@ -687,6 +686,8 @@ static int try_to_unmap_one(struct page 
 
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
+	if (dirty_page)
+		set_page_dirty(dirty_page);
 out:
 	return ret;
 }
@@ -918,6 +919,9 @@ int try_to_unmap(struct page *page, int 
 	else
 		ret = try_to_unmap_file(page, migration);
 
+	if (page_test_and_clear_dirty(page))
+		set_page_dirty(page);
+
 	if (!page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 12:32                                                                 ` Martin Michlmayr
  2006-12-22 12:59                                                                   ` Martin Michlmayr
@ 2006-12-22 15:01                                                                   ` Patrick Mau
  2006-12-23  8:15                                                                   ` Andrei Popa
  2 siblings, 0 replies; 311+ messages in thread
From: Patrick Mau @ 2006-12-22 15:01 UTC (permalink / raw)
  To: Linux Kernel

On Fri, Dec 22, 2006 at 01:32:49PM +0100, Martin Michlmayr wrote:
> * Andrei Popa <andrei.popa@i-neo.ro> [2006-12-22 14:24]:
> > With all three patches I have corruption....
> 
> I've completed one installation with Linus' patch plus the two from
> Andrew successfully, but I'm currently trying again... but I really
> need a better testcase since an installation takes about an hour.
> Andrei, which torrent do you download as a testcase?  It would be good
> if someone could suggest a torrent which is legal and not too large.

Hi everyone,

I have been reading this thread for the last few days, but have been
silent. I have 3 torrents here for testing, if you want.

You can easily reproduce with "rtorrent", if you:

- Have a completly downloaded one, no matter what size
- Corrupt the download with
  dd if=/dev/zero of=download.file bs=16k count=1
- Restart 'rtorrent', hash-check fails
- It will download 1 piece that was corrupted.

The important part here is that rtorrent transfers one piece,
using its own code sequence to write to the file.

Let me offer to test until Saturday afternoon CET,
I have a cloned git repository, downloaded torrent files and "apt".

My systems that are affected are:

Linux oscar 2.6.18 SMP (2x450Mhz Intel P3)
(rolled back to 2.6.18 but can boot latest git)

Linux tony 2.6.20-git UP
(can be tested using all kinds of "apt" operations)

Both machines are using:
IDE  -> MD-RAID1 -> LVM -> EXT3 (data=ordered)
SCSI -> MD-RAID5 -> .....

I don't want to disturb your technical discussion,
just offering some help in testing.

Regards,
Patrick


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22  4:54                                                         ` Linus Torvalds
  2006-12-22 10:00                                                           ` Martin Michlmayr
@ 2006-12-22 15:08                                                           ` Gordon Farquharson
  1 sibling, 0 replies; 311+ messages in thread
From: Gordon Farquharson @ 2006-12-22 15:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann

On 12/21/06, Linus Torvalds <torvalds@osdl.org> wrote:

> Andrew located at least one bug: we run cancel_dirty_page() too late in
> "truncate_complete_page()", which means that do_invalidatepage() ends up
> not clearing the page cache.
>
> His patch is appended.

Thanks. I'll try this out later today.

> But it sounds like I probably misunderstood something, because I thought
> that Martin had acknowledged that this patch actually worked for him.
> Which sounded very similar to your setup (he has a 32M ARM box too, no?)

Yup, we have the same machines (Linksys NSLU2) and are running the
same test case (installing Debian). However, I'm not sure what kernel
version he had used for his latest test. I presumed 2.6.20-git,
whereas I had used 2.6.19.

> Maybe it's mount option issue? I've got data=ordered on my machine, are
> you perhaps runnign with something else?

We are also using ordered.

/dev/scsi/host0/bus0/target0/lun0/part1 /target ext3 rw,data=ordered 0 0

Gordon

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 10:01                                                         ` Martin Michlmayr
@ 2006-12-22 15:16                                                           ` Gordon Farquharson
  0 siblings, 0 replies; 311+ messages in thread
From: Gordon Farquharson @ 2006-12-22 15:16 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann

On 12/22/06, Martin Michlmayr <tbm@cyrius.com> wrote:

> sh-3.1# ls -l /var/cache/apt/
> total 5252
> drwxr-xr-x 3 root root    12288 Dec 22 04:41 archives
> -rw-r--r-- 1 root root 12582912 Dec 22 04:45 pkgcache.bin
> -rw-r--r-- 1 root root     8554 Dec 22 04:45 srcpkgcache.bin

This listing is a little different to what I got. For me,
srcpkgcache.bin did not exist when apt eventually finished. Did you
notice whether the install took a lot longer than usual ?

> Gordon, does it fail for you where it normally does (installing
> initramfs-tools) or much later?  For me, the installer was able to
> install initramfs-tools and the kernel, but apt now hangs at "Select
> and install software".

apt didn't hang for me, it just took 20 to 30 minutes to complete
building the package database. Usually, it takes less than a minute.
The installer stopped because it could not find a kernel to install. I
have seen this failure mde before, and as you have previously pointed
out, is probably the same problem (corrupted apt cache files), just a
different manifestation.

Gordon

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 10:10                                                               ` Martin Michlmayr
  2006-12-22 11:07                                                                 ` Martin Michlmayr
@ 2006-12-22 15:30                                                                 ` Gordon Farquharson
  2006-12-22 17:11                                                                   ` Martin Michlmayr
  1 sibling, 1 reply; 311+ messages in thread
From: Gordon Farquharson @ 2006-12-22 15:30 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber

On 12/22/06, Martin Michlmayr <tbm@cyrius.com> wrote:

> ... and now that we've completed this step, the apt cache has suddenly
> been reduced (see Gordon's mail for an explanation) and it segfaults:
>
> sh-3.1# ls -l /var/cache/apt/
> total 12524
> drwxr-xr-x 3 root root   12288 Dec 22 04:41 archives
> -rw-r--r-- 1 root root 6426885 Dec 22 05:03 pkgcache.bin
> -rw-r--r-- 1 root root 6426835 Dec 22 05:03 srcpkgcache.bin
> sh-3.1# apt-get -f install
> Reading package lists... Done
> Segmentation faulty tree... 50%

I think that we are seeing different manifestations of apt's response
to corrupted cache files. There does not appear to be any pattern to
which manifestation occurs. Maybe it depends on where in the cache
file the corruption is located, i.e. when the corruption occurs. Based
on the kernel gurus current knowledge of the problem, would you expect
the corruption to occur at the same point in a file, or is it possible
that the corruption could occur at different points on successive
Debian installer attempts on a UP, non PREEMPT system ?

Gordon

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 15:30                                                                 ` Gordon Farquharson
@ 2006-12-22 17:11                                                                   ` Martin Michlmayr
  0 siblings, 0 replies; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-22 17:11 UTC (permalink / raw)
  To: Gordon Farquharson
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Hugh Dickins,
	Nick Piggin, Arjan van de Ven, Andrei Popa,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber

* Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-22 08:30]:
> Based on the kernel gurus current knowledge of the problem, would
> you expect the corruption to occur at the same point in a file, or
> is it possible that the corruption could occur at different points
> on successive Debian installer attempts on a UP, non PREEMPT system?

Seems like it can occur anywhere.  In fact, some people see apt
problems because of filesystem corruption on the NSLU2 after they have
already installe Debian.  I've only seen this once myself and failed
many times to find a reproducible situation.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 22:34                         ` Gene Heskett
@ 2006-12-22 17:27                           ` Linus Torvalds
  0 siblings, 0 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-22 17:27 UTC (permalink / raw)
  To: Gene Heskett
  Cc: linux-kernel, Andrei Popa, Peter Zijlstra, Andrew Morton,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Gene Heskett wrote:
>
> What about the mm/rmap.c one liner, in or out?

The one that just removes the "pte_mkclean()"? That's definitely out, it 
was just a test-patch to verify that the pte dirty bits seemed to matter 
at all (and they do).

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 13:25                                                                     ` Peter Zijlstra
  2006-12-22 13:29                                                                       ` Peter Zijlstra
@ 2006-12-22 17:56                                                                       ` Linus Torvalds
  2006-12-22 19:20                                                                       ` Martin Michlmayr
  2 siblings, 0 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-22 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Michlmayr, Andrei Popa, Andrew Morton, Gordon Farquharson,
	Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List



On Fri, 22 Dec 2006, Peter Zijlstra wrote:
> 
> fix page_mkclean_one()
> 
>  - add flush_cache_page() for all those virtual indexed cache
>    architectures.

I think the flush_cache_page() should be after we've actually flushed it 
from the TLB and re-inserted it (this is one reason why I did the 
"ptep_exchange()" version of this). Otherwise somebody can still write to 
the page _after_ the cache flush..

>  - handle s390.

Yeah, that looks like the proper way to handle that.

That said, it looks like we still see corruption. You may not, but Martin 
and Andrei still report problems, even with all the patches (including the 
last one from Andrew that avoids "dirty" going negative under some 
circumstances, and explains the "slow and/or never completed" case that 
Gordon and Martin saw).

The good news is that I think the code now is cleaner and more 
understandable. The bad news is that nothing we've ever tried seems to 
have fixed the _problem_.

And I don't think it's page_mkclean(). Especially not since the ARM people 
are seeing this under UP without PREEMPT. In that kind of schenario, the 
only possible races tend to be from things that actually block: 
"set_page_dirty()" (which blocks on IO in balancing), memory allocations, 
and obviously doing actual IO.

And it's not a virtual cache problem, since others see it on x86.

Of course, since it's quite possibly two different issues, maybe the 
virtual cache flush is required in order to force write-back to memory 
(which in turn is required for the DMA for the actual write!). So the ARM 
issue certainly could be due to the flush_cache_page() thing...

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 13:25                                                                     ` Peter Zijlstra
  2006-12-22 13:29                                                                       ` Peter Zijlstra
  2006-12-22 17:56                                                                       ` Linus Torvalds
@ 2006-12-22 19:20                                                                       ` Martin Michlmayr
  2006-12-24  8:10                                                                         ` Gordon Farquharson
  2 siblings, 1 reply; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-22 19:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrei Popa, Andrew Morton, Linus Torvalds, Gordon Farquharson,
	Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 399 bytes --]

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2006-12-22 14:25]:
> > .... and it failed.
> Since you are on ARM you might want to try with the page_mkclean_one
> cleanup patch too.

I've already tried it and it didn't work.  I just tried it again
together with Linus' patch and the two from Andrew and it still fails.
(For reference, the patch is attached.)
-- 
Martin Michlmayr
http://www.cyrius.com/

[-- Attachment #2: p --]
[-- Type: text/plain, Size: 7798 bytes --]

--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..4f4cd13 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..350878a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,15 +253,11 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+extern void cancel_dirty_page(struct page *page, unsigned int account_size);
+
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
-{
-	test_clear_page_dirty(page);
-}
-
 static inline void set_page_writeback(struct page *page)
 {
 	test_set_page_writeback(page);
diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..79cecab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+	pgoff_t index;
+	unsigned int offset;
+	struct page *page;
+
+	if (!mapping)
+		return;
+	offset = size & ~PAGE_MASK;
+	if (!offset)
+		return;
+	index = size >> PAGE_SHIFT;
+	page = find_lock_page(mapping, index);
+	if (page) {
+		unsigned int check = 0;
+		unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+		do {
+			check += kaddr[offset++];
+		} while (offset < PAGE_SIZE);
+		kunmap_atomic(kaddr,KM_USER0);
+		unlock_page(page);
+		page_cache_release(page);
+		if (check)
+			printk("%s: BADNESS: truncate check %u\n", current->comm, check);
+	}
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
 		goto out_sig;
 	if (offset > inode->i_sb->s_maxbytes)
 		goto out_big;
+	check_last_page(mapping, inode->i_size);
 	i_size_write(inode, offset);
 
 out_truncate:
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..b3a198c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -845,38 +845,6 @@ int set_page_dirty_lock(struct page *pag
 EXPORT_SYMBOL(set_page_dirty_lock);
 
 /*
- * Clear a page's dirty flag, while caring for dirty memory accounting. 
- * Returns true if the page was previously dirty.
- */
-int test_clear_page_dirty(struct page *page)
-{
-	struct address_space *mapping = page_mapping(page);
-	unsigned long flags;
-
-	if (!mapping)
-		return TestClearPageDirty(page);
-
-	write_lock_irqsave(&mapping->tree_lock, flags);
-	if (TestClearPageDirty(page)) {
-		radix_tree_tag_clear(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
-		/*
-		 * We can continue to use `mapping' here because the
-		 * page is locked, which pins the address_space
-		 */
-		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
-			dec_zone_page_state(page, NR_FILE_DIRTY);
-		}
-		return 1;
-	}
-	write_unlock_irqrestore(&mapping->tree_lock, flags);
-	return 0;
-}
-EXPORT_SYMBOL(test_clear_page_dirty);
-
-/*
  * Clear a page's dirty flag, while caring for dirty memory accounting.
  * Returns true if the page was previously dirty.
  *
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..3278b2a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *pte, entry;
+	pte_t *pte;
 	spinlock_t *ptl;
 	int ret = 0;
 
@@ -444,17 +444,18 @@ static int page_mkclean_one(struct page
 	if (!pte)
 		goto out;
 
-	if (!pte_dirty(*pte) && !pte_write(*pte))
-		goto unlock;
+	if (pte_dirty(*pte) || pte_write(*pte)) {
+		pte_t entry;
 
-	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
-	entry = pte_wrprotect(entry);
-	ptep_establish(vma, address, pte, entry);
-	lazy_mmu_prot_update(entry);
-	ret = 1;
+		flush_cache_page(vma, address, pte_pfn(*pte));
+		entry = ptep_clear_flush(vma, address, pte);
+		entry = pte_wrprotect(entry);
+		entry = pte_mkclean(entry);
+		set_pte_at(vma, address, pte, entry);
+		lazy_mmu_prot_update(entry);
+		ret = 1;
+	}
 
-unlock:
 	pte_unmap_unlock(pte, ptl);
 out:
 	return ret;
@@ -489,6 +490,8 @@ int page_mkclean(struct page *page)
 		if (mapping)
 			ret = page_mkclean_file(mapping, page);
 	}
+	if (page_test_and_clear_dirty(page))
+		ret = 1;
 
 	return ret;
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..4a38dd1 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -51,6 +51,22 @@ static inline void truncate_partial_page
 		do_invalidatepage(page, partial);
 }
 
+void cancel_dirty_page(struct page *page, unsigned int account_size)
+{
+	/* If we're cancelling the page, it had better not be mapped any more */
+	if (page_mapped(page)) {
+		static unsigned int warncount;
+
+		WARN_ON(++warncount < 5);
+	}
+		
+	if (TestClearPageDirty(page) && account_size &&
+			mapping_cap_account_dirty(page->mapping)) {
+		dec_zone_page_state(page, NR_FILE_DIRTY);
+		task_io_account_cancelled_write(account_size);
+	}
+}
+
 /*
  * If truncate cannot remove the fs-private metadata from the page, the page
  * becomes anonymous.  It will be left on the LRU and may even be mapped into
@@ -67,11 +83,11 @@ truncate_complete_page(struct address_sp
 	if (page->mapping != mapping)
 		return;
 
+	cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
-		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
 	remove_from_page_cache(page);
@@ -350,7 +366,6 @@ int invalidate_inode_pages2_range(struct
 		for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index;
-			int was_dirty;
 
 			lock_page(page);
 			if (page->mapping != mapping) {
@@ -386,12 +401,8 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
-			if (!invalidate_complete_page2(mapping, page)) {
-				if (was_dirty)
-					set_page_dirty(page);
+			if (!invalidate_complete_page2(mapping, page))
 				ret = -EIO;
-			}
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 12:32                                                                 ` Martin Michlmayr
  2006-12-22 12:59                                                                   ` Martin Michlmayr
  2006-12-22 15:01                                                                   ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Patrick Mau
@ 2006-12-23  8:15                                                                   ` Andrei Popa
  2 siblings, 0 replies; 311+ messages in thread
From: Andrei Popa @ 2006-12-23  8:15 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Andrew Morton, Linus Torvalds, Gordon Farquharson,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Fri, 2006-12-22 at 13:32 +0100, Martin Michlmayr wrote:
> * Andrei Popa <andrei.popa@i-neo.ro> [2006-12-22 14:24]:
> > With all three patches I have corruption....
> 
> I've completed one installation with Linus' patch plus the two from
> Andrew successfully, but I'm currently trying again... but I really
> need a better testcase since an installation takes about an hour.
> Andrei, which torrent do you download as a testcase?  It would be good
> if someone could suggest a torrent which is legal and not too large.
It's a 1.4GB file torrent split in 84 rar files and there are many
seeders. I download with ~ 5MB/sec. The torrent is private.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-22 19:20                                                                       ` Martin Michlmayr
@ 2006-12-24  8:10                                                                         ` Gordon Farquharson
  2006-12-24  8:43                                                                           ` Linus Torvalds
  0 siblings, 1 reply; 311+ messages in thread
From: Gordon Farquharson @ 2006-12-24  8:10 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Peter Zijlstra, Andrei Popa, Andrew Morton, Linus Torvalds,
	Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On 12/22/06, Martin Michlmayr <tbm@cyrius.com> wrote:

> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2006-12-22 14:25]:
> > > .... and it failed.
> > Since you are on ARM you might want to try with the page_mkclean_one
> > cleanup patch too.
>
> I've already tried it and it didn't work.  I just tried it again
> together with Linus' patch and the two from Andrew and it still fails.
> (For reference, the patch is attached.)

I can confirm this behaviour with 2.6.19 and the patches mentioned
above (cumulative patch for 2.6.19 appended to the end of this email).

Is there any way to provide any debugging information that may help
solve the problem ? Would it help to know the nature of the corruption
e.g. an analysis of the corruption in the file ? I have previously
asked apt developers if they wanted to look at the corrupted cache
files, but there were no takers then.

BTW, I decided to try Linus's test program [1] on ARM (I don't think
that anybody had tried it on ARM before).

Since we see file corruption with 2.6.18 + [PATCH] mm: tracking shared
dirty pages [2], I ran Linus's program on machines with the following
setups:

2.6.18 + the following patches
   mm: tracking shared dirty pages [2]
   mm: balance dirty pages [3]
   mm: optimize the new mprotect() code a bit [4]
   mm: small cleanup of install_page() [5]
   mm: fixup do_wp_page() [6]
   mm: msync() cleanup [7]

$ ./mm-test | od -x
0000000 aaaa aaaa aaaa aaaa aaaa 0000 0000 0000
0000020 0000 0000 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555
0000050

2.6.18 (no mm patches)

$ ./mm-test | od -x
0000000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa
0000020 aaaa aaaa 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555
0000050

I don't know if this helps at all.

Gordon

[1] http://lkml.org/lkml/2006/12/19/200
[2] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
[3] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
[4] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=c1e6098b23bb46e2b488fe9a26f831f867157483
[5] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=e88dd6c11c5aef74d8b74a062767add53315533b
[6] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ee6a6457886a80415db209e87033b63f2b06558c
[7] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=204ec841fbea3e5138168edbc3a76d46747cc987

diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c
--- linux-2.6.19.orig/fs/buffer.c       2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/fs/buffer.c    2006-12-21 01:16:31.000000000 -0700
@@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag
        int ret = 0;

        BUG_ON(!PageLocked(page));
-       if (PageWriteback(page))
+       if (PageDirty(page) || PageWriteback(page))
                return 0;

        if (mapping == NULL) {          /* can this still happen? */
@@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag
        spin_lock(&mapping->private_lock);
        ret = drop_buffers(page, &buffers_to_free);
        spin_unlock(&mapping->private_lock);
-       if (ret) {
-               /*
-                * If the filesystem writes its buffers by hand (eg ext3)
-                * then we can have clean buffers against a dirty page.  We
-                * clean the page here; otherwise later reattachment of buffers
-                * could encounter a non-uptodate page, which is unresolvable.
-                * This only applies in the rare case where try_to_free_buffers
-                * succeeds but the page is not freed.
-                */
-               clear_page_dirty(page);
-       }
 out:
        if (buffers_to_free) {
                struct buffer_head *bh = buffers_to_free;
diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c
linux-2.6.19/fs/hugetlbfs/inode.c
--- linux-2.6.19.orig/fs/hugetlbfs/inode.c      2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/fs/hugetlbfs/inode.c   2006-12-21 01:15:21.000000000 -0700
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct

  static void truncate_huge_page(struct page *page)
 {
-       clear_page_dirty(page);
+       cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
        ClearPageUptodate(page);
        remove_from_page_cache(page);
        put_page(page);
diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h
linux-2.6.19/include/linux/page-flags.h
--- linux-2.6.19.orig/include/linux/page-flags.h        2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/include/linux/page-flags.h     2006-12-21
01:15:21.000000000 -0700
@@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc

  struct page;   /* forward declaration */

-int test_clear_page_dirty(struct page *page);
+extern void cancel_dirty_page(struct page *page, unsigned int account_size);
+
  int test_clear_page_writeback(struct page *page);
  int test_set_page_writeback(struct page *page);

-static inline void clear_page_dirty(struct page *page)
-{
-       test_clear_page_dirty(page);
-}
-
  static inline void set_page_writeback(struct page *page)
 {
        test_set_page_writeback(page);
diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c
--- linux-2.6.19.orig/mm/memory.c       2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/memory.c    2006-12-21 01:15:21.000000000 -0700
@@ -1832,6 +1832,33 @@ void unmap_mapping_range(struct address_
 }
 EXPORT_SYMBOL(unmap_mapping_range);

+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+       pgoff_t index;
+       unsigned int offset;
+       struct page *page;
+
+       if (!mapping)
+               return;
+       offset = size & ~PAGE_MASK;
+       if (!offset)
+               return;
+       index = size >> PAGE_SHIFT;
+       page = find_lock_page(mapping, index);
+       if (page) {
+               unsigned int check = 0;
+               unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+               do {
+                       check += kaddr[offset++];
+               } while (offset < PAGE_SIZE);
+               kunmap_atomic(kaddr,KM_USER0);
+               unlock_page(page);
+               page_cache_release(page);
+               if (check)
+                       printk("%s: BADNESS: truncate check %u\n",
current->comm, check);
+       }
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used
@@ -1865,6 +1892,7 @@ do_expand:
                goto out_sig;
        if (offset > inode->i_sb->s_maxbytes)
                goto out_big;
+       check_last_page(mapping, inode->i_size);
        i_size_write(inode, offset);

 out_truncate:
diff -Naupr linux-2.6.19.orig/mm/page-writeback.c
linux-2.6.19/mm/page-writeback.c
--- linux-2.6.19.orig/mm/page-writeback.c       2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/mm/page-writeback.c    2006-12-21 01:26:53.000000000 -0700
@@ -843,39 +843,6 @@ int set_page_dirty_lock(struct page *pag
 EXPORT_SYMBOL(set_page_dirty_lock);

 /*
- * Clear a page's dirty flag, while caring for dirty memory accounting.
- * Returns true if the page was previously dirty.
- */
-int test_clear_page_dirty(struct page *page)
-{
-       struct address_space *mapping = page_mapping(page);
-       unsigned long flags;
-
-       if (mapping) {
-               write_lock_irqsave(&mapping->tree_lock, flags);
-               if (TestClearPageDirty(page)) {
-                       radix_tree_tag_clear(&mapping->page_tree,
-                                               page_index(page),
-                                               PAGECACHE_TAG_DIRTY);
-                       write_unlock_irqrestore(&mapping->tree_lock, flags);
-                       /*
-                        * We can continue to use `mapping' here because the
-                        * page is locked, which pins the address_space
-                        */
-                       if (mapping_cap_account_dirty(mapping)) {
-                               page_mkclean(page);
-                               dec_zone_page_state(page, NR_FILE_DIRTY);
-                       }
-                       return 1;
-               }
-               write_unlock_irqrestore(&mapping->tree_lock, flags);
-               return 0;
-       }
-       return TestClearPageDirty(page);
-}
-EXPORT_SYMBOL(test_clear_page_dirty);
-
-/*
  * Clear a page's dirty flag, while caring for dirty memory accounting.
  * Returns true if the page was previously dirty.
  *
diff -Naupr linux-2.6.19.orig/mm/rmap.c linux-2.6.19/mm/rmap.c
--- linux-2.6.19.orig/mm/rmap.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/rmap.c      2006-12-22 23:25:09.000000000 -0700
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
 {
        struct mm_struct *mm = vma->vm_mm;
        unsigned long address;
-       pte_t *pte, entry;
+       pte_t *pte;
        spinlock_t *ptl;
        int ret = 0;

@@ -444,17 +444,18 @@ static int page_mkclean_one(struct page
        if (!pte)
                goto out;

-       if (!pte_dirty(*pte) && !pte_write(*pte))
-               goto unlock;
+       if (pte_dirty(*pte) || pte_write(*pte)) {
+               pte_t entry;

-       entry = ptep_get_and_clear(mm, address, pte);
-       entry = pte_mkclean(entry);
-       entry = pte_wrprotect(entry);
-       ptep_establish(vma, address, pte, entry);
-       lazy_mmu_prot_update(entry);
-       ret = 1;
+               flush_cache_page(vma, address, pte_pfn(*pte));
+               entry = ptep_clear_flush(vma, address, pte);
+               entry = pte_wrprotect(entry);
+               entry = pte_mkclean(entry);
+               set_pte_at(vma, address, pte, entry);
+               lazy_mmu_prot_update(entry);
+               ret = 1;
+       }

-unlock:
        pte_unmap_unlock(pte, ptl);
 out:
        return ret;
@@ -489,6 +490,8 @@ int page_mkclean(struct page *page)
                if (mapping)
                        ret = page_mkclean_file(mapping, page);
        }
+       if (page_test_and_clear_dirty(page))
+               ret = 1;

        return ret;
 }
@@ -587,8 +590,6 @@ void page_remove_rmap(struct page *page)
                 * Leaving it set also helps swapoff to reinstate ptes
                 * faster for those pages still in swapcache.
                 */
-               if (page_test_and_clear_dirty(page))
-                       set_page_dirty(page);
                __dec_zone_page_state(page,
                                PageAnon(page) ? NR_ANON_PAGES :
NR_FILE_MAPPED);
        }
@@ -607,6 +608,7 @@ static int try_to_unmap_one(struct page
        pte_t pteval;
        spinlock_t *ptl;
        int ret = SWAP_AGAIN;
+       struct page *dirty_page = NULL;

        address = vma_address(page, vma);
        if (address == -EFAULT)
@@ -633,7 +635,7 @@ static int try_to_unmap_one(struct page

        /* Move the dirty bit to the physical page now the pte is gone. */
        if (pte_dirty(pteval))
-               set_page_dirty(page);
+               dirty_page = page;

        /* Update high watermark before we lower rss */
        update_hiwater_rss(mm);
@@ -684,6 +686,8 @@ static int try_to_unmap_one(struct page

 out_unmap:
        pte_unmap_unlock(pte, ptl);
+       if (dirty_page)
+               set_page_dirty(dirty_page);
 out:
        return ret;
 }
@@ -915,6 +919,9 @@ int try_to_unmap(struct page *page, int
        else
                ret = try_to_unmap_file(page, migration);

+       if (page_test_and_clear_dirty(page))
+               set_page_dirty(page);
+
        if (!page_mapped(page))
                ret = SWAP_SUCCESS;
        return ret;
diff -Naupr linux-2.6.19.orig/mm/truncate.c linux-2.6.19/mm/truncate.c
--- linux-2.6.19.orig/mm/truncate.c     2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/truncate.c  2006-12-23 13:21:42.000000000 -0700
@@ -50,6 +50,21 @@ static inline void truncate_partial_page
                do_invalidatepage(page, partial);
 }

+void cancel_dirty_page(struct page *page, unsigned int account_size)
+{
+       /* If we're cancelling the page, it had better not be mapped
any more */+       if (page_mapped(page)) {
+               static unsigned int warncount;
+
+               WARN_ON(++warncount < 5);
+       }
+
+       if (TestClearPageDirty(page) && account_size &&
+                       mapping_cap_account_dirty(page->mapping))
+               dec_zone_page_state(page, NR_FILE_DIRTY);
+}
+
+
 /*
  * If truncate cannot remove the fs-private metadata from the page, the page
  * becomes anonymous.  It will be left on the LRU and may even be mapped into
@@ -66,10 +81,11 @@ truncate_complete_page(struct address_sp
        if (page->mapping != mapping)
                return;

+       cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
        if (PagePrivate(page))
                do_invalidatepage(page, 0);

-       clear_page_dirty(page);
        ClearPageUptodate(page);
        ClearPageMappedToDisk(page);
        remove_from_page_cache(page);
@@ -348,7 +364,6 @@ int invalidate_inode_pages2_range(struct
                for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
                        struct page *page = pvec.pages[i];
                        pgoff_t page_index;
-                       int was_dirty;

                        lock_page(page);
                        if (page->mapping != mapping) {
@@ -384,12 +399,8 @@ int invalidate_inode_pages2_range(struct
                                          PAGE_CACHE_SIZE, 0);
                                }
                        }
-                       was_dirty = test_clear_page_dirty(page);
-                       if (!invalidate_complete_page2(mapping, page)) {
-                               if (was_dirty)
-                                       set_page_dirty(page);
+                       if (!invalidate_complete_page2(mapping, page))
                                ret = -EIO;
-                       }
                        unlock_page(page);
                }
                pagevec_release(&pvec);

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24  8:10                                                                         ` Gordon Farquharson
@ 2006-12-24  8:43                                                                           ` Linus Torvalds
  2006-12-24  8:57                                                                             ` Andrew Morton
  2006-12-26 16:17                                                                             ` Tobias Diedrich
  0 siblings, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-24  8:43 UTC (permalink / raw)
  To: Gordon Farquharson
  Cc: Martin Michlmayr, Peter Zijlstra, Andrei Popa, Andrew Morton,
	Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List



On Sun, 24 Dec 2006, Gordon Farquharson wrote:
> 
> Is there any way to provide any debugging information that may help
> solve the problem ?

I think we have people working on this. I know I'm trying to even come up 
with an idea of what is going on. I don't think we know yet.

> Would it help to know the nature of the corruption e.g. an analysis
> of the corruption in the file ?

I actually think we know that, because Andrei already gave details. The 
corruption seems to be basically a few pages that get zeroes at the end 
rather than the expected contents. That's consistent with the page being 
written out once, but then _not_ getting written out again despite being 
dirtied some more.

But if you see ay other pattern, please holler, because that would be 
interesting.

> BTW, I decided to try Linus's test program [1] on ARM (I don't think
> that anybody had tried it on ARM before).

You get the expected results, and in fact, I'd be very surprised if you 
didn't. It's something subtler than that going on.

I now _suspect_ that we're talking about something like

 - we started a writeout. The IO is still pending, and the page was 
   marked clean and is now in the "writeback" phase.
 - a write happens to the page, and the page gets marked dirty again. 
   Marking the page dirty also marks all the _buffers_ in the page dirty, 
   but they were actually already dirty, because the IO hasn't completed 
   yet.
 - the IO from the _previous_ write completes, and marks the buffers clean 
   again.

And no, thatr's not actually what is going on. The thing is, we actually 
clear the buffer dirty bits when we start the IO, not when we end it, but 
I think it is going to be this _kind_ of situation, where we missed 
something, and marked it clean too late, and thus cleared a dirty bit.

I don't think it's a page table issue any more, it just doesn't look 
likely with the ARM UP corruption. It's also not apparently even on a 
cacheline boundary, so it probably is really a dirty bit that got cleared 
wrogn due to some race with IO.

But right now we're all clueless. I personally suspect it's not even a new 
bug: it's probably an old bug that simply didn't matter before.
	
			Linus


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24  8:43                                                                           ` Linus Torvalds
@ 2006-12-24  8:57                                                                             ` Andrew Morton
  2006-12-24  9:26                                                                               ` Linus Torvalds
                                                                                                 ` (2 more replies)
  2006-12-26 16:17                                                                             ` Tobias Diedrich
  1 sibling, 3 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-24  8:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gordon Farquharson, Martin Michlmayr, Peter Zijlstra,
	Andrei Popa, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Sun, 24 Dec 2006 00:43:54 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> I now _suspect_ that we're talking about something like
> 
>  - we started a writeout. The IO is still pending, and the page was 
>    marked clean and is now in the "writeback" phase.
>  - a write happens to the page, and the page gets marked dirty again. 
>    Marking the page dirty also marks all the _buffers_ in the page dirty, 
>    but they were actually already dirty, because the IO hasn't completed 
>    yet.
>  - the IO from the _previous_ write completes, and marks the buffers clean 
>    again.

Some things for the testers to try, please:

- mount the fs with ext2 with the no-buffer-head option.  That means either:

  grub.conf:  rootfstype=ext2 rootflags=nobh
  /etc/fstab: ext2 nobh

- mount the fs with ext3 data=writeback, nobh

  grub.conf:  rootfstype=ext3 rootflags=nobh,data=writeback  (I hope this works)
  /etc/fstab: ext2 data=writeback,nobh

if that still fails we can rule out buffer_head funnies.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24  8:57                                                                             ` Andrew Morton
@ 2006-12-24  9:26                                                                               ` Linus Torvalds
  2006-12-24 12:14                                                                               ` Andrei Popa
  2006-12-24 14:05                                                                               ` Martin Michlmayr
  2 siblings, 0 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-24  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gordon Farquharson, Martin Michlmayr, Peter Zijlstra,
	Andrei Popa, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List



On Sun, 24 Dec 2006, Andrew Morton wrote:
> 
> > I now _suspect_ that we're talking about something like
> > 
> >  - we started a writeout. The IO is still pending, and the page was 
> >    marked clean and is now in the "writeback" phase.
> >  - a write happens to the page, and the page gets marked dirty again. 
> >    Marking the page dirty also marks all the _buffers_ in the page dirty, 
> >    but they were actually already dirty, because the IO hasn't completed 
> >    yet.
> >  - the IO from the _previous_ write completes, and marks the buffers clean 
> >    again.
> 
> Some things for the testers to try, please:
> 
> - mount the fs with ext2 with the no-buffer-head option.  That means either:

[ snip snip ]

This is definitely worth testing, but the exact schenario I outlined is 
probably not the thing that happens. It was really meant to be more of an 
exmple of the _kind_ of situation I think we might have.

That would explain why we didn't see this before: we simply didn't mark 
pages clean all that aggressively, and an app like rtorrent would normally 
have caused its flushes to happen _synchronously_ by using msync() (even 
if the IO itself was done asynchronously, all the dirty bit stuff would be 
synchronous wrt any rtorrent behaviour).

And the things that /did/ use to clean pages asynchronously (VM scanning) 
would always actually look at the "young" bit (aka "accessed") and not 
even touch the dirty bit if an application had accessed the page recently, 
so that basically avoided any likely races, because we'd touch the dirty 
bit ONLY if the page was "cold".

So this is why I'm saying that it might be an old bug, and it would be 
just the new pattern of handling dirty bits that triggers it.

But avoiding buffer heads and testing that part is worth doing. Just to 
remove one thing from the equation.

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24  8:57                                                                             ` Andrew Morton
  2006-12-24  9:26                                                                               ` Linus Torvalds
@ 2006-12-24 12:14                                                                               ` Andrei Popa
  2006-12-24 12:26                                                                                 ` Andrei Popa
  2006-12-24 12:31                                                                                 ` Andrew Morton
  2006-12-24 14:05                                                                               ` Martin Michlmayr
  2 siblings, 2 replies; 311+ messages in thread
From: Andrei Popa @ 2006-12-24 12:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Gordon Farquharson, Martin Michlmayr,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Sun, 2006-12-24 at 00:57 -0800, Andrew Morton wrote: 
> On Sun, 24 Dec 2006 00:43:54 -0800 (PST)
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> > I now _suspect_ that we're talking about something like
> > 
> >  - we started a writeout. The IO is still pending, and the page was 
> >    marked clean and is now in the "writeback" phase.
> >  - a write happens to the page, and the page gets marked dirty again. 
> >    Marking the page dirty also marks all the _buffers_ in the page dirty, 
> >    but they were actually already dirty, because the IO hasn't completed 
> >    yet.
> >  - the IO from the _previous_ write completes, and marks the buffers clean 
> >    again.
> 
> Some things for the testers to try, please:
> 
> - mount the fs with ext2 with the no-buffer-head option.  That means either:
> 
>   grub.conf:  rootfstype=ext2 rootflags=nobh
>   /etc/fstab: ext2 nobh

ierdnac ~ # mount
/dev/sda7 on / type ext2 (rw,noatime,nobh)

I have corruption.

> 
> - mount the fs with ext3 data=writeback, nobh
> 
>   grub.conf:  rootfstype=ext3 rootflags=nobh,data=writeback  (I hope this works)
>   /etc/fstab: ext2 data=writeback,nobh

ierdnac ~ # mount
/dev/sda7 on / type ext3 (rw,noatime,nobh)

ierdnac ~ # dmesg|grep EXT3
EXT3-fs: mounted filesystem with writeback data mode.
EXT3 FS on sda7, internal journal

I don't have corruption. I tested twice.

> 
> if that still fails we can rule out buffer_head funnies.
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 12:14                                                                               ` Andrei Popa
@ 2006-12-24 12:26                                                                                 ` Andrei Popa
  2006-12-24 12:30                                                                                   ` Andrew Morton
  2006-12-24 12:31                                                                                 ` Andrew Morton
  1 sibling, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-24 12:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Gordon Farquharson, Martin Michlmayr,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Sun, 2006-12-24 at 14:14 +0200, Andrei Popa wrote:
> On Sun, 2006-12-24 at 00:57 -0800, Andrew Morton wrote: 
> > On Sun, 24 Dec 2006 00:43:54 -0800 (PST)
> > Linus Torvalds <torvalds@osdl.org> wrote:
> > 
> > > I now _suspect_ that we're talking about something like
> > > 
> > >  - we started a writeout. The IO is still pending, and the page was 
> > >    marked clean and is now in the "writeback" phase.
> > >  - a write happens to the page, and the page gets marked dirty again. 
> > >    Marking the page dirty also marks all the _buffers_ in the page dirty, 
> > >    but they were actually already dirty, because the IO hasn't completed 
> > >    yet.
> > >  - the IO from the _previous_ write completes, and marks the buffers clean 
> > >    again.
> > 
> > Some things for the testers to try, please:
> > 
> > - mount the fs with ext2 with the no-buffer-head option.  That means either:
> > 
> >   grub.conf:  rootfstype=ext2 rootflags=nobh
> >   /etc/fstab: ext2 nobh
> 
> ierdnac ~ # mount
> /dev/sda7 on / type ext2 (rw,noatime,nobh)
> 
> I have corruption.
> 
> > 
> > - mount the fs with ext3 data=writeback, nobh
> > 
> >   grub.conf:  rootfstype=ext3 rootflags=nobh,data=writeback  (I hope this works)
> >   /etc/fstab: ext2 data=writeback,nobh
> 
> ierdnac ~ # mount
> /dev/sda7 on / type ext3 (rw,noatime,nobh)
> 
> ierdnac ~ # dmesg|grep EXT3
> EXT3-fs: mounted filesystem with writeback data mode.
> EXT3 FS on sda7, internal journal
> 
> I don't have corruption. I tested twice.
> 

I also tested with ext3 ordered, nobh  and I have file corruption...

> > 
> > if that still fails we can rule out buffer_head funnies.
> > 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 12:26                                                                                 ` Andrei Popa
@ 2006-12-24 12:30                                                                                   ` Andrew Morton
  0 siblings, 0 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-24 12:30 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Gordon Farquharson, Martin Michlmayr,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Sun, 24 Dec 2006 14:26:01 +0200
Andrei Popa <andrei.popa@i-neo.ro> wrote:

> I also tested with ext3 ordered, nobh  and I have file corruption...

ordered+nobh isn't a possible combination.  The filesystem probably ignored
nobh.  nobh mode only makes sense with data=writeback.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 12:14                                                                               ` Andrei Popa
  2006-12-24 12:26                                                                                 ` Andrei Popa
@ 2006-12-24 12:31                                                                                 ` Andrew Morton
  2006-12-24 16:45                                                                                   ` Andrei Popa
  1 sibling, 1 reply; 311+ messages in thread
From: Andrew Morton @ 2006-12-24 12:31 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Gordon Farquharson, Martin Michlmayr,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Sun, 24 Dec 2006 14:14:38 +0200
Andrei Popa <andrei.popa@i-neo.ro> wrote:

> > - mount the fs with ext2 with the no-buffer-head option.  That means either:
> > 
> >   grub.conf:  rootfstype=ext2 rootflags=nobh
> >   /etc/fstab: ext2 nobh
> 
> ierdnac ~ # mount
> /dev/sda7 on / type ext2 (rw,noatime,nobh)
> 
> I have corruption.
> 
> > 
> > - mount the fs with ext3 data=writeback, nobh
> > 
> >   grub.conf:  rootfstype=ext3 rootflags=nobh,data=writeback  (I hope this works)
> >   /etc/fstab: ext2 data=writeback,nobh
> 
> ierdnac ~ # mount
> /dev/sda7 on / type ext3 (rw,noatime,nobh)
> 
> ierdnac ~ # dmesg|grep EXT3
> EXT3-fs: mounted filesystem with writeback data mode.
> EXT3 FS on sda7, internal journal
> 
> I don't have corruption. I tested twice.

This is a surprising result.  Can you pleas retest ext3 data=writeback,nobh?

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24  8:57                                                                             ` Andrew Morton
  2006-12-24  9:26                                                                               ` Linus Torvalds
  2006-12-24 12:14                                                                               ` Andrei Popa
@ 2006-12-24 14:05                                                                               ` Martin Michlmayr
  2 siblings, 0 replies; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-24 14:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Gordon Farquharson, Peter Zijlstra, Andrei Popa,
	Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

* Andrew Morton <akpm@osdl.org> [2006-12-24 00:57]:
>   /etc/fstab: ext2 nobh
>   /etc/fstab: ext3 data=writeback,nobh

It seems that busybox mount ignores the nobh option but both ext2 and
ext3 data=writeback work for me.  This is with plain 2.6.19 which
normally always fails.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 12:31                                                                                 ` Andrew Morton
@ 2006-12-24 16:45                                                                                   ` Andrei Popa
  2006-12-24 17:16                                                                                     ` Linus Torvalds
  0 siblings, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-24 16:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Gordon Farquharson, Martin Michlmayr,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote:
> On Sun, 24 Dec 2006 14:14:38 +0200
> Andrei Popa <andrei.popa@i-neo.ro> wrote:
> 
> > > - mount the fs with ext2 with the no-buffer-head option.  That means either:
> > > 
> > >   grub.conf:  rootfstype=ext2 rootflags=nobh
> > >   /etc/fstab: ext2 nobh
> > 
> > ierdnac ~ # mount
> > /dev/sda7 on / type ext2 (rw,noatime,nobh)
> > 
> > I have corruption.
> > 
> > > 
> > > - mount the fs with ext3 data=writeback, nobh
> > > 
> > >   grub.conf:  rootfstype=ext3 rootflags=nobh,data=writeback  (I hope this works)
> > >   /etc/fstab: ext2 data=writeback,nobh
> > 
> > ierdnac ~ # mount
> > /dev/sda7 on / type ext3 (rw,noatime,nobh)
> > 
> > ierdnac ~ # dmesg|grep EXT3
> > EXT3-fs: mounted filesystem with writeback data mode.
> > EXT3 FS on sda7, internal journal
> > 
> > I don't have corruption. I tested twice.
> 
> This is a surprising result.  Can you pleas retest ext3 data=writeback,nobh?

Yes, no corruption. Also tested only with data=writeback and had no
corruption.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 16:45                                                                                   ` Andrei Popa
@ 2006-12-24 17:16                                                                                     ` Linus Torvalds
  2006-12-24 18:07                                                                                       ` Andrew Morton
                                                                                                         ` (2 more replies)
  0 siblings, 3 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-24 17:16 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Andrew Morton, Gordon Farquharson, Martin Michlmayr,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List



On Sun, 24 Dec 2006, Andrei Popa wrote:

> On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote:
> > Andrei Popa <andrei.popa@i-neo.ro> wrote:
> > > /dev/sda7 on / type ext3 (rw,noatime,nobh)
> > > 
> > > I don't have corruption. I tested twice.
> > 
> > This is a surprising result.  Can you pleas retest ext3 data=writeback,nobh?
> 
> Yes, no corruption. Also tested only with data=writeback and had no
> corruption.

Ok, so it would seem to be writeback related _somehow_. However, most of 
the differences (I _thought_) in ext3 actually show up only if you have 
*both* "nobh" and "data=writeback", and as far as I can tell, just a 
simple "data=writeback" should still use the bog-standard 
"block_write_full_page()".

Andrew?

Although as far as I can see, then ext2 should work as-is too (since it 
too also just uses "block_write_full_page()" without anything fancy).

Strange.

How about this particularly stupid diff? (please test with something that 
_would_ cause corruption normally).

It is _entirely_ untested, but what it tries to do is to simply serialize 
any writeback in progress with any process that tries to re-map a shared 
page into its address space and dirty it. I haven't tested it, and maybe 
it misses some case, but it looks likea good way to try to avoid races 
with marking pages dirty and the writeback phase ..

			Linus
---
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..64ed10b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1544,6 +1544,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
 		}
+		wait_on_page_writeback(old_page);
 		dirty_page = old_page;
 		get_page(dirty_page);
 		reuse = 1;
@@ -2215,6 +2216,7 @@ retry:
 				page_cache_release(new_page);
 				return VM_FAULT_SIGBUS;
 			}
+			wait_on_page_writeback(new_page);
 		}
 	}
 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 17:16                                                                                     ` Linus Torvalds
@ 2006-12-24 18:07                                                                                       ` Andrew Morton
  2006-12-24 18:37                                                                                       ` Linus Torvalds
  2006-12-24 19:27                                                                                       ` Gordon Farquharson
  2 siblings, 0 replies; 311+ messages in thread
From: Andrew Morton @ 2006-12-24 18:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Gordon Farquharson, Martin Michlmayr,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Sun, 24 Dec 2006 09:16:06 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Sun, 24 Dec 2006, Andrei Popa wrote:
> 
> > On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote:
> > > Andrei Popa <andrei.popa@i-neo.ro> wrote:
> > > > /dev/sda7 on / type ext3 (rw,noatime,nobh)
> > > > 
> > > > I don't have corruption. I tested twice.
> > > 
> > > This is a surprising result.  Can you pleas retest ext3 data=writeback,nobh?
> > 
> > Yes, no corruption. Also tested only with data=writeback and had no
> > corruption.
> 
> Ok, so it would seem to be writeback related _somehow_. However, most of 
> the differences (I _thought_) in ext3 actually show up only if you have 
> *both* "nobh" and "data=writeback", and as far as I can tell, just a 
> simple "data=writeback" should still use the bog-standard 
> "block_write_full_page()".
> 
> Andrew?
> 
> Although as far as I can see, then ext2 should work as-is too (since it 
> too also just uses "block_write_full_page()" without anything fancy).

ext2 uses the multipage-bio assembly code for writeback whereas ext3
doesn't.  But ext3 doesn't use that code in data=ordered mode, of course.

Still, this:

--- a/fs/ext2/inode.c~a
+++ a/fs/ext2/inode.c
@@ -693,7 +693,7 @@ const struct address_space_operations ex
 	.commit_write		= generic_commit_write,
 	.bmap			= ext2_bmap,
 	.direct_IO		= ext2_direct_IO,
-	.writepages		= ext2_writepages,
+//	.writepages		= ext2_writepages,
 	.migratepage		= buffer_migrate_page,
 };
 
@@ -711,7 +711,7 @@ const struct address_space_operations ex
 	.commit_write		= nobh_commit_write,
 	.bmap			= ext2_bmap,
 	.direct_IO		= ext2_direct_IO,
-	.writepages		= ext2_writepages,
+//	.writepages		= ext2_writepages,
 	.migratepage		= buffer_migrate_page,
 };
 
_

will switch it off for ext2.


> Strange.
> 
> How about this particularly stupid diff? (please test with something that 
> _would_ cause corruption normally).
> 
> It is _entirely_ untested, but what it tries to do is to simply serialize 
> any writeback in progress with any process that tries to re-map a shared 
> page into its address space and dirty it. I haven't tested it, and maybe 
> it misses some case, but it looks likea good way to try to avoid races 
> with marking pages dirty and the writeback phase ..
> 
> 			Linus
> ---
> diff --git a/mm/memory.c b/mm/memory.c
> index 563792f..64ed10b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1544,6 +1544,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			if (!pte_same(*page_table, orig_pte))
>  				goto unlock;
>  		}
> +		wait_on_page_writeback(old_page);
>  		dirty_page = old_page;
>  		get_page(dirty_page);
>  		reuse = 1;
> @@ -2215,6 +2216,7 @@ retry:
>  				page_cache_release(new_page);
>  				return VM_FAULT_SIGBUS;
>  			}
> +			wait_on_page_writeback(new_page);
>  		}
>  	}

yup.  Also, we could perhaps lock the target page during pagefaults..


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 17:16                                                                                     ` Linus Torvalds
  2006-12-24 18:07                                                                                       ` Andrew Morton
@ 2006-12-24 18:37                                                                                       ` Linus Torvalds
  2006-12-24 19:18                                                                                         ` Linus Torvalds
  2006-12-24 21:21                                                                                         ` Michael S. Tsirkin
  2006-12-24 19:27                                                                                       ` Gordon Farquharson
  2 siblings, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-24 18:37 UTC (permalink / raw)
  To: Andrei Popa, Peter Zijlstra
  Cc: Andrew Morton, Gordon Farquharson, Martin Michlmayr,
	Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List




On Sun, 24 Dec 2006, Linus Torvalds wrote:
>
> How about this particularly stupid diff? (please test with something that 
> _would_ cause corruption normally).

Actually, here's an even more stupid diff, which actually to some degree 
seems to capture the real problem better.

Peter, tell me I'm crazy, but with the new rules, the following condition 
is a bug:

 - shared mapping
 - writable
 - not already marked dirty in the PTE

because that combination means that the hardware can mark the PTE dirty 
without us even realizing (and thus not marking the "struct page *" 
dirty).

(The above is actually a valid situation for IO mappings, but not for 
"real" mappings. And IO mappings should never take page faults, I think).

So, with that in mind, I wrote this stupid patch (for 32-bit x86, since I 
used my Mac Mini for testing ratehr than my main machine - but the x86-64 
version should be pretty much identcal)..

And you know what, Peter? It triggers for me. I get

	WARNING at mm/memory.c:2274 do_no_page()
	 [<c0103d4a>] show_trace_log_lvl+0x1a/0x2f
	 [<c010436c>] show_trace+0x12/0x14
	 [<c01043f0>] dump_stack+0x16/0x18
	 [<c0159790>] __handle_mm_fault+0x38d/0x919
	 [<c011c8c4>] do_page_fault+0x1ff/0x507
	 [<c03fabcc>] error_code+0x7c/0x84

which seems to say that do_no_page() can be used to insert shared and 
non-dirty, but still writable, pages.

But maybe my patch is just bogus, and I didn't think it through.

Peter, I realize it's Christmas Eve, but let's face it, Santa appreciates 
good boys and girls, and we all want tons of loot. So please be good, and 
waste some time looking at this and tell me why I'm either wrong, or 
there's a real smoking gun here.. ;)

		Linus

---
diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
index e6a4723..1389bb7 100644
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -494,7 +494,13 @@ do {									\
  * The i386 doesn't have any external MMU info: the kernel page
  * tables contain all the necessary information.
  */
-#define update_mmu_cache(vma,address,pte) do { } while (0)
+#define bad_shared_pte(pte) (pte_write(pte) && !pte_dirty(pte))
+#define update_mmu_cache(vma,address,pte) do {		\
+	static int __cnt;				\
+	WARN_ON(((vma)->vm_flags & VM_SHARED)		\
+		 && bad_shared_pte(pte)			\
+		 && ++__cnt < 5);			\
+} while (0)
 #endif /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_FLATMEM

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 18:37                                                                                       ` Linus Torvalds
@ 2006-12-24 19:18                                                                                         ` Linus Torvalds
  2006-12-24 20:55                                                                                           ` Gordon Farquharson
  2006-12-26 10:31                                                                                           ` Nick Piggin
  2006-12-24 21:21                                                                                         ` Michael S. Tsirkin
  1 sibling, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-24 19:18 UTC (permalink / raw)
  To: Andrei Popa, Peter Zijlstra, David S. Miller
  Cc: Andrew Morton, Gordon Farquharson, Martin Michlmayr,
	Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List



On Sun, 24 Dec 2006, Linus Torvalds wrote:
> 
> Peter, tell me I'm crazy, but with the new rules, the following condition 
> is a bug:
> 
>  - shared mapping
>  - writable
>  - not already marked dirty in the PTE

Ok, so how about this diff.

I'm actually feeling good about this one. It really looks like 
"do_no_page()" was simply buggy, and that this explains everything.

Please please please test. Throw all the other patches away (with the 
possible exception of the "update_mmu_cache()" sanity checker, which is 
still interesting in case some _other_ place does this too).

Don't do the "wait_on_page_writeback()" thing, because it changes timings 
and might hide thngs for the wrong reasons.  Just apply this on top of a 
known failing kernel, and test.

			Linus

---
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..cf429c4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2247,21 +2249,23 @@ retry:
 	if (pte_none(*page_table)) {
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
-		if (write_access)
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			lru_cache_add_active(new_page);
 			page_add_new_anon_rmap(new_page, vma, address);
+			if (write_access)
+				entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
+			entry = pte_wrprotect(entry);
 			if (write_access) {
 				dirty_page = new_page;
 				get_page(dirty_page);
+				entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 			}
 		}
+		set_pte_at(mm, address, page_table, entry);
 	} else {
 		/* One of our sibling threads was faster, back out. */
 		page_cache_release(new_page);

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 17:16                                                                                     ` Linus Torvalds
  2006-12-24 18:07                                                                                       ` Andrew Morton
  2006-12-24 18:37                                                                                       ` Linus Torvalds
@ 2006-12-24 19:27                                                                                       ` Gordon Farquharson
  2006-12-24 19:35                                                                                         ` Linus Torvalds
  2 siblings, 1 reply; 311+ messages in thread
From: Gordon Farquharson @ 2006-12-24 19:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Andrew Morton, Martin Michlmayr, Peter Zijlstra,
	Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On 12/24/06, Linus Torvalds <torvalds@osdl.org> wrote:

> How about this particularly stupid diff? (please test with something that
> _would_ cause corruption normally).
>
> It is _entirely_ untested, but what it tries to do is to simply serialize
> any writeback in progress with any process that tries to re-map a shared
> page into its address space and dirty it. I haven't tested it, and maybe
> it misses some case, but it looks likea good way to try to avoid races
> with marking pages dirty and the writeback phase ..

The apt cache files (/var/cache/apt/*.bin) still get corrupted with
this patch and 2.6.19.

Gordon

diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c
--- linux-2.6.19.orig/fs/buffer.c       2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/fs/buffer.c    2006-12-21 01:16:31.000000000 -0700
@@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag
        int ret = 0;

        BUG_ON(!PageLocked(page));
-       if (PageWriteback(page))
+       if (PageDirty(page) || PageWriteback(page))
                return 0;

        if (mapping == NULL) {          /* can this still happen? */
@@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag
        spin_lock(&mapping->private_lock);
        ret = drop_buffers(page, &buffers_to_free);
        spin_unlock(&mapping->private_lock);
-       if (ret) {
-               /*
-                * If the filesystem writes its buffers by hand (eg ext3)
-                * then we can have clean buffers against a dirty page.  We
-                * clean the page here; otherwise later reattachment of buffers
-                * could encounter a non-uptodate page, which is unresolvable.
-                * This only applies in the rare case where try_to_free_buffers
-                * succeeds but the page is not freed.
-                */
-               clear_page_dirty(page);
-       }
 out:
        if (buffers_to_free) {
                struct buffer_head *bh = buffers_to_free;
diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c
linux-2.6.19/fs/hugetlbfs/inode.c
--- linux-2.6.19.orig/fs/hugetlbfs/inode.c      2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/fs/hugetlbfs/inode.c   2006-12-21 01:15:21.000000000 -0700
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct

 static void truncate_huge_page(struct page *page)
 {
-       clear_page_dirty(page);
+       cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
        ClearPageUptodate(page);
        remove_from_page_cache(page);
        put_page(page);
diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h
linux-2.6.19/include/linux/page-flags.h
--- linux-2.6.19.orig/include/linux/page-flags.h        2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/include/linux/page-flags.h     2006-12-21
01:15:21.000000000 -0700
@@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc

 struct page;   /* forward declaration */

-int test_clear_page_dirty(struct page *page);
+extern void cancel_dirty_page(struct page *page, unsigned int account_size);
+
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);

-static inline void clear_page_dirty(struct page *page)
-{
-       test_clear_page_dirty(page);
-}
-
 static inline void set_page_writeback(struct page *page)
 {
        test_set_page_writeback(page);
diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c
--- linux-2.6.19.orig/mm/memory.c       2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/memory.c    2006-12-24 11:04:03.000000000 -0700
@@ -1534,6 +1534,7 @@ static int do_wp_page(struct mm_struct *
                        if (!pte_same(*page_table, orig_pte))
                                goto unlock;
                }
+               wait_on_page_writeback(old_page);
                dirty_page = old_page;
                get_page(dirty_page);
                reuse = 1;
@@ -1832,6 +1833,33 @@ void unmap_mapping_range(struct address_
 }
 EXPORT_SYMBOL(unmap_mapping_range);

+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+       pgoff_t index;
+       unsigned int offset;
+       struct page *page;
+
+       if (!mapping)
+               return;
+       offset = size & ~PAGE_MASK;
+       if (!offset)
+               return;
+       index = size >> PAGE_SHIFT;
+       page = find_lock_page(mapping, index);
+       if (page) {
+               unsigned int check = 0;
+               unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+               do {
+                       check += kaddr[offset++];
+               } while (offset < PAGE_SIZE);
+               kunmap_atomic(kaddr,KM_USER0);
+               unlock_page(page);
+               page_cache_release(page);
+               if (check)
+                       printk("%s: BADNESS: truncate check %u\n",
current->comm, check);
+       }
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used
@@ -1865,6 +1893,7 @@ do_expand:
                goto out_sig;
        if (offset > inode->i_sb->s_maxbytes)
                goto out_big;
+       check_last_page(mapping, inode->i_size);
        i_size_write(inode, offset);

 out_truncate:
@@ -2206,6 +2235,7 @@ retry:
                                page_cache_release(new_page);
                                return VM_FAULT_SIGBUS;
                        }
+                       wait_on_page_writeback(new_page);
                }
        }

diff -Naupr linux-2.6.19.orig/mm/page-writeback.c
linux-2.6.19/mm/page-writeback.c
--- linux-2.6.19.orig/mm/page-writeback.c       2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/mm/page-writeback.c    2006-12-21 01:26:53.000000000 -0700
@@ -843,39 +843,6 @@ int set_page_dirty_lock(struct page *pag
 EXPORT_SYMBOL(set_page_dirty_lock);

 /*
- * Clear a page's dirty flag, while caring for dirty memory accounting.
- * Returns true if the page was previously dirty.
- */
-int test_clear_page_dirty(struct page *page)
-{
-       struct address_space *mapping = page_mapping(page);
-       unsigned long flags;
-
-       if (mapping) {
-               write_lock_irqsave(&mapping->tree_lock, flags);
-               if (TestClearPageDirty(page)) {
-                       radix_tree_tag_clear(&mapping->page_tree,
-                                               page_index(page),
-                                               PAGECACHE_TAG_DIRTY);
-                       write_unlock_irqrestore(&mapping->tree_lock, flags);
-                       /*
-                        * We can continue to use `mapping' here because the
-                        * page is locked, which pins the address_space
-                        */
-                       if (mapping_cap_account_dirty(mapping)) {
-                               page_mkclean(page);
-                               dec_zone_page_state(page, NR_FILE_DIRTY);
-                       }
-                       return 1;
-               }
-               write_unlock_irqrestore(&mapping->tree_lock, flags);
-               return 0;
-       }
-       return TestClearPageDirty(page);
-}
-EXPORT_SYMBOL(test_clear_page_dirty);
-
-/*
  * Clear a page's dirty flag, while caring for dirty memory accounting.
  * Returns true if the page was previously dirty.
  *
diff -Naupr linux-2.6.19.orig/mm/rmap.c linux-2.6.19/mm/rmap.c
--- linux-2.6.19.orig/mm/rmap.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/rmap.c      2006-12-22 23:25:09.000000000 -0700
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
 {
        struct mm_struct *mm = vma->vm_mm;
        unsigned long address;
-       pte_t *pte, entry;
+       pte_t *pte;
        spinlock_t *ptl;
        int ret = 0;

@@ -444,17 +444,18 @@ static int page_mkclean_one(struct page
        if (!pte)
                goto out;

-       if (!pte_dirty(*pte) && !pte_write(*pte))
-               goto unlock;
+       if (pte_dirty(*pte) || pte_write(*pte)) {
+               pte_t entry;

-       entry = ptep_get_and_clear(mm, address, pte);
-       entry = pte_mkclean(entry);
-       entry = pte_wrprotect(entry);
-       ptep_establish(vma, address, pte, entry);
-       lazy_mmu_prot_update(entry);
-       ret = 1;
+               flush_cache_page(vma, address, pte_pfn(*pte));
+               entry = ptep_clear_flush(vma, address, pte);
+               entry = pte_wrprotect(entry);
+               entry = pte_mkclean(entry);
+               set_pte_at(vma, address, pte, entry);
+               lazy_mmu_prot_update(entry);
+               ret = 1;
+       }

-unlock:
        pte_unmap_unlock(pte, ptl);
 out:
        return ret;
@@ -489,6 +490,8 @@ int page_mkclean(struct page *page)
                if (mapping)
                        ret = page_mkclean_file(mapping, page);
        }
+       if (page_test_and_clear_dirty(page))
+               ret = 1;

        return ret;
 }
@@ -587,8 +590,6 @@ void page_remove_rmap(struct page *page)
                 * Leaving it set also helps swapoff to reinstate ptes
                 * faster for those pages still in swapcache.
                 */
-               if (page_test_and_clear_dirty(page))
-                       set_page_dirty(page);
                __dec_zone_page_state(page,
                                PageAnon(page) ? NR_ANON_PAGES :
NR_FILE_MAPPED);
        }
@@ -607,6 +608,7 @@ static int try_to_unmap_one(struct page
        pte_t pteval;
        spinlock_t *ptl;
        int ret = SWAP_AGAIN;
+       struct page *dirty_page = NULL;

        address = vma_address(page, vma);
        if (address == -EFAULT)
@@ -633,7 +635,7 @@ static int try_to_unmap_one(struct page

        /* Move the dirty bit to the physical page now the pte is gone. */
        if (pte_dirty(pteval))
-               set_page_dirty(page);
+               dirty_page = page;

        /* Update high watermark before we lower rss */
        update_hiwater_rss(mm);
@@ -684,6 +686,8 @@ static int try_to_unmap_one(struct page

 out_unmap:
        pte_unmap_unlock(pte, ptl);
+       if (dirty_page)
+               set_page_dirty(dirty_page);
 out:
        return ret;
 }
@@ -915,6 +919,9 @@ int try_to_unmap(struct page *page, int
        else
                ret = try_to_unmap_file(page, migration);

+       if (page_test_and_clear_dirty(page))
+               set_page_dirty(page);
+
        if (!page_mapped(page))
                ret = SWAP_SUCCESS;
        return ret;
diff -Naupr linux-2.6.19.orig/mm/truncate.c linux-2.6.19/mm/truncate.c
--- linux-2.6.19.orig/mm/truncate.c     2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/truncate.c  2006-12-23 13:21:42.000000000 -0700
@@ -50,6 +50,21 @@ static inline void truncate_partial_page
                do_invalidatepage(page, partial);
 }

+void cancel_dirty_page(struct page *page, unsigned int account_size)
+{
+       /* If we're cancelling the page, it had better not be mapped
any more */+       if (page_mapped(page)) {
+               static unsigned int warncount;
+
+               WARN_ON(++warncount < 5);
+       }
+
+       if (TestClearPageDirty(page) && account_size &&
+                       mapping_cap_account_dirty(page->mapping))
+               dec_zone_page_state(page, NR_FILE_DIRTY);
+}
+
+
 /*
  * If truncate cannot remove the fs-private metadata from the page, the page
  * becomes anonymous.  It will be left on the LRU and may even be mapped into
@@ -66,10 +81,11 @@ truncate_complete_page(struct address_sp
        if (page->mapping != mapping)
                return;

+       cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
        if (PagePrivate(page))
                do_invalidatepage(page, 0);

-       clear_page_dirty(page);
        ClearPageUptodate(page);
        ClearPageMappedToDisk(page);
        remove_from_page_cache(page);
@@ -348,7 +364,6 @@ int invalidate_inode_pages2_range(struct
                for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
                        struct page *page = pvec.pages[i];
                        pgoff_t page_index;
-                       int was_dirty;

                        lock_page(page);
                        if (page->mapping != mapping) {
@@ -384,12 +399,8 @@ int invalidate_inode_pages2_range(struct
                                          PAGE_CACHE_SIZE, 0);
                                }
                        }
-                       was_dirty = test_clear_page_dirty(page);
-                       if (!invalidate_complete_page2(mapping, page)) {
-                               if (was_dirty)
-                                       set_page_dirty(page);
+                       if (!invalidate_complete_page2(mapping, page))
                                ret = -EIO;
-                       }
                        unlock_page(page);
                }
                pagevec_release(&pvec);


-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 19:27                                                                                       ` Gordon Farquharson
@ 2006-12-24 19:35                                                                                         ` Linus Torvalds
  2006-12-24 20:10                                                                                           ` Andrei Popa
  2006-12-24 22:01                                                                                           ` Martin Michlmayr
  0 siblings, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-24 19:35 UTC (permalink / raw)
  To: Gordon Farquharson
  Cc: Andrei Popa, Andrew Morton, Martin Michlmayr, Peter Zijlstra,
	Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List



On Sun, 24 Dec 2006, Gordon Farquharson wrote:
> 
> The apt cache files (/var/cache/apt/*.bin) still get corrupted with
> this patch and 2.6.19.

Yeah, if my guess about do_no_page() is right, _none_ of the previous 
patches should have ANY effect what-so-ever. In fact, I'd say that even 
the "ext3 works in writeback mode" thing that Andrei reports is probably a 
total fluke brought on by timing changes rather than anything else.

So please try the latest patch instead (on top of anything that shows 
corruption reliably - the patch should be _totally_ independent of all the 
other issues, and I think it will apply cleanly on top of 2.6.18.3 and 
2.6.19 too, so anything that shows corruption is a fine target - but try 
to choose something that has been the "best" at corrupting things for you, 
to make the testing as good as possible).

Patch included here again (although I think you were cc'd on my previous 
email too, so you should already have it, and our emails just crossed)

And if this doesn't fix it, I don't know what will..

		Linus

---
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..cf429c4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2247,21 +2249,23 @@ retry:
 	if (pte_none(*page_table)) {
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
-		if (write_access)
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			lru_cache_add_active(new_page);
 			page_add_new_anon_rmap(new_page, vma, address);
+			if (write_access)
+				entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
+			entry = pte_wrprotect(entry);
 			if (write_access) {
 				dirty_page = new_page;
 				get_page(dirty_page);
+				entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 			}
 		}
+		set_pte_at(mm, address, page_table, entry);
 	} else {
 		/* One of our sibling threads was faster, back out. */
 		page_cache_release(new_page);

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 19:35                                                                                         ` Linus Torvalds
@ 2006-12-24 20:10                                                                                           ` Andrei Popa
  2006-12-24 20:24                                                                                             ` Linus Torvalds
  2006-12-24 22:01                                                                                           ` Martin Michlmayr
  1 sibling, 1 reply; 311+ messages in thread
From: Andrei Popa @ 2006-12-24 20:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gordon Farquharson, Andrew Morton, Martin Michlmayr,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Sun, 2006-12-24 at 11:35 -0800, Linus Torvalds wrote:
> 
> On Sun, 24 Dec 2006, Gordon Farquharson wrote:
> > 
> > The apt cache files (/var/cache/apt/*.bin) still get corrupted with
> > this patch and 2.6.19.
> 
> Yeah, if my guess about do_no_page() is right, _none_ of the previous 
> patches should have ANY effect what-so-ever. In fact, I'd say that even 
> the "ext3 works in writeback mode" thing that Andrei reports is probably a 
> total fluke brought on by timing changes rather than anything else.
> 
> So please try the latest patch instead (on top of anything that shows 
> corruption reliably - the patch should be _totally_ independent of all the 
> other issues, and I think it will apply cleanly on top of 2.6.18.3 and 
> 2.6.19 too, so anything that shows corruption is a fine target - but try 
> to choose something that has been the "best" at corrupting things for you, 
> to make the testing as good as possible).
> 
> Patch included here again (although I think you were cc'd on my previous 
> email too, so you should already have it, and our emails just crossed)
> 
> And if this doesn't fix it, I don't know what will..

With latest git and patches:
http://lkml.org/lkml/diff/2006/12/24/56/1
http://lkml.org/lkml/diff/2006/12/24/61/1

Hash check on download completion found bad chunks, consider using
"safe_sync".

> 
> 		Linus
> 
> ---
> diff --git a/mm/memory.c b/mm/memory.c
> index 563792f..cf429c4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2247,21 +2249,23 @@ retry:
>  	if (pte_none(*page_table)) {
>  		flush_icache_page(vma, new_page);
>  		entry = mk_pte(new_page, vma->vm_page_prot);
> -		if (write_access)
> -			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> -		set_pte_at(mm, address, page_table, entry);
>  		if (anon) {
>  			inc_mm_counter(mm, anon_rss);
>  			lru_cache_add_active(new_page);
>  			page_add_new_anon_rmap(new_page, vma, address);
> +			if (write_access)
> +				entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  		} else {
>  			inc_mm_counter(mm, file_rss);
>  			page_add_file_rmap(new_page);
> +			entry = pte_wrprotect(entry);
>  			if (write_access) {
>  				dirty_page = new_page;
>  				get_page(dirty_page);
> +				entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  			}
>  		}
> +		set_pte_at(mm, address, page_table, entry);
>  	} else {
>  		/* One of our sibling threads was faster, back out. */
>  		page_cache_release(new_page);


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 20:10                                                                                           ` Andrei Popa
@ 2006-12-24 20:24                                                                                             ` Linus Torvalds
  2006-12-24 20:30                                                                                               ` Andrei Popa
  2006-12-26 17:51                                                                                               ` Al Viro
  0 siblings, 2 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-24 20:24 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Gordon Farquharson, Andrew Morton, Martin Michlmayr,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List



On Sun, 24 Dec 2006, Andrei Popa wrote:
> 
> Hash check on download completion found bad chunks, consider using
> "safe_sync".

Dang. Did you get any warning messages from the kernel?

		Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 20:24                                                                                             ` Linus Torvalds
@ 2006-12-24 20:30                                                                                               ` Andrei Popa
  2006-12-26 17:51                                                                                               ` Al Viro
  1 sibling, 0 replies; 311+ messages in thread
From: Andrei Popa @ 2006-12-24 20:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gordon Farquharson, Andrew Morton, Martin Michlmayr,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Sun, 2006-12-24 at 12:24 -0800, Linus Torvalds wrote:
> 
> On Sun, 24 Dec 2006, Andrei Popa wrote:
> > 
> > Hash check on download completion found bad chunks, consider using
> > "safe_sync".
> 
> Dang. Did you get any warning messages from the kernel?
> 

only these:
ACPI: EC: evaluating _Q80
ACPI: EC: evaluating _Q80
ACPI: EC: evaluating _Q80

but I don't think has anything to do with...

> 		Linus


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 19:18                                                                                         ` Linus Torvalds
@ 2006-12-24 20:55                                                                                           ` Gordon Farquharson
  2006-12-26 10:31                                                                                           ` Nick Piggin
  1 sibling, 0 replies; 311+ messages in thread
From: Gordon Farquharson @ 2006-12-24 20:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Peter Zijlstra, David S. Miller, Andrew Morton,
	Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On 12/24/06, Linus Torvalds <torvalds@osdl.org> wrote:

> Ok, so how about this diff.
>
> I'm actually feeling good about this one. It really looks like
> "do_no_page()" was simply buggy, and that this explains everything.

I tested with just this patch and 2.6.19 and no change. Sorry Linus,
no early Christmas present :-(

Gordon

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 18:37                                                                                       ` Linus Torvalds
  2006-12-24 19:18                                                                                         ` Linus Torvalds
@ 2006-12-24 21:21                                                                                         ` Michael S. Tsirkin
  1 sibling, 0 replies; 311+ messages in thread
From: Michael S. Tsirkin @ 2006-12-24 21:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Peter Zijlstra, Andrew Morton, Gordon Farquharson,
	Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	openib-general, Linux Kernel Mailing List

> Quoting Linus Torvalds <torvalds@osdl.org>:
> Subject: Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
>
> Peter, tell me I'm crazy, but with the new rules, the following condition 
> is a bug:
> 
>  - shared mapping
>  - writable
>  - not already marked dirty in the PTE
> 
> because that combination means that the hardware can mark the PTE dirty 
> without us even realizing (and thus not marking the "struct page *" 
> dirty).

Er.
Sorry about bumping in, and I'm not sure I understand all of the discussion,
but this reminded me of an old issue with COW that created what looks
like a vaguely similiar data corruption on infiniband. We solved this for
infiniband with MADV_DONTFORK, but I always wondered why does it not affect
other parts of kernel.  Small reminder from that discussion:

down mmap sem
get user pages
up mmap sem
page becomes shared, and COW (e.g. fork)
process writes to first byte of page <----- gets a copy
Now we had a problem: struct page that we got from get user pages
does not point to a correct page in our process.
For example: if at some point we map this page for DMA, and
hardware writes to last byte of page -----> process does not
see this data.

So for infiniband, what we do is a combination of
- prevent page from becoming COW while hardware might DMA to this page, and
- ask users not to write to page if hardware might DMA to same page
  (even if its using different bytes).

I just wandered - is there some chance something like this could be happening in
the fs code?

HTH,

-- 
MST

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 19:35                                                                                         ` Linus Torvalds
  2006-12-24 20:10                                                                                           ` Andrei Popa
@ 2006-12-24 22:01                                                                                           ` Martin Michlmayr
  1 sibling, 0 replies; 311+ messages in thread
From: Martin Michlmayr @ 2006-12-24 22:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gordon Farquharson, Andrei Popa, Andrew Morton, Peter Zijlstra,
	Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

* Linus Torvalds <torvalds@osdl.org> [2006-12-24 11:35]:
> And if this doesn't fix it, I don't know what will..

Sorry, but it still fails (on top of plain 2.6.19).
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 19:18                                                                                         ` Linus Torvalds
  2006-12-24 20:55                                                                                           ` Gordon Farquharson
@ 2006-12-26 10:31                                                                                           ` Nick Piggin
  2006-12-26 19:26                                                                                             ` Linus Torvalds
  1 sibling, 1 reply; 311+ messages in thread
From: Nick Piggin @ 2006-12-26 10:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Peter Zijlstra, David S. Miller, Andrew Morton,
	Gordon Farquharson, Martin Michlmayr, Hugh Dickins,
	Arjan van de Ven, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Sun, 24 Dec 2006, Linus Torvalds wrote:
> 
>>Peter, tell me I'm crazy, but with the new rules, the following condition 
>>is a bug:
>>
>> - shared mapping
>> - writable
>> - not already marked dirty in the PTE
> 
> 
> Ok, so how about this diff.
> 
> I'm actually feeling good about this one. It really looks like 
> "do_no_page()" was simply buggy, and that this explains everything.

Still trying to catch up here, so I'm not going to reply to any old
stuff and just start at the tip of the thread... Other than to say
that I really like cancel_page_dirty ;)

I think your patch is quite right so that's a good catch. But I'm not
too surprised that it does not help the problem, because I don't
think we have started shedding any old pte_dirty tests at
unmap/reclaim-time, have we? So the dirty bit isn't going to get lost,
as such.

I was hoping that you've almost narrowed it down to the filesystem
writeback code, with the last few mails?

Nick

> Please please please test. Throw all the other patches away (with the 
> possible exception of the "update_mmu_cache()" sanity checker, which is 
> still interesting in case some _other_ place does this too).
> 
> Don't do the "wait_on_page_writeback()" thing, because it changes timings 
> and might hide thngs for the wrong reasons.  Just apply this on top of a 
> known failing kernel, and test.
> 
> 			Linus
> 
> ---
> diff --git a/mm/memory.c b/mm/memory.c
> index 563792f..cf429c4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2247,21 +2249,23 @@ retry:
>  	if (pte_none(*page_table)) {
>  		flush_icache_page(vma, new_page);
>  		entry = mk_pte(new_page, vma->vm_page_prot);
> -		if (write_access)
> -			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> -		set_pte_at(mm, address, page_table, entry);
>  		if (anon) {
>  			inc_mm_counter(mm, anon_rss);
>  			lru_cache_add_active(new_page);
>  			page_add_new_anon_rmap(new_page, vma, address);
> +			if (write_access)
> +				entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  		} else {
>  			inc_mm_counter(mm, file_rss);
>  			page_add_file_rmap(new_page);
> +			entry = pte_wrprotect(entry);
>  			if (write_access) {
>  				dirty_page = new_page;
>  				get_page(dirty_page);
> +				entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  			}
>  		}
> +		set_pte_at(mm, address, page_table, entry);
>  	} else {
>  		/* One of our sibling threads was faster, back out. */
>  		page_cache_release(new_page);
> 


-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24  8:43                                                                           ` Linus Torvalds
  2006-12-24  8:57                                                                             ` Andrew Morton
@ 2006-12-26 16:17                                                                             ` Tobias Diedrich
  2006-12-27  4:55                                                                               ` [PATCH] mm: fix page_mkclean_one David Miller
  1 sibling, 1 reply; 311+ messages in thread
From: Tobias Diedrich @ 2006-12-26 16:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gordon Farquharson, Martin Michlmayr, Peter Zijlstra,
	Andrei Popa, Andrew Morton, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Linux Kernel Mailing List

Linus Torvalds wrote:
> I don't think it's a page table issue any more, it just doesn't look 
> likely with the ARM UP corruption. It's also not apparently even on a 
> cacheline boundary, so it probably is really a dirty bit that got cleared 
> wrogn due to some race with IO.

So, until now it's only been reported for SMP on i386?
I'm seeing the issue on my Pentium-M Notebook (Thinkpad R52) over
here, UP kernel, no preempt.

I've first seen it with 2.6.20-rc1, but am running 2.6.20-rc2 now.
The corruption pattern looks like the one already reported, rtorrent
hash check fails (for some files it succeeds at first, but
fails after "echo 1 > /proc/sys/vm/drop_caches"), the corruption is
zeroes at the end of page instead of data.

ii  rtorrent       0.6.4-1        ncurses BitTorrent client based on LibTorren
ii  libtorrent9    0.10.4-1       a C++ BitTorrent library

.config:
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.20-rc2
# Mon Dec 25 14:00:03 2006
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y

#
# Block layer
#
CONFIG_BLOCK=y
CONFIG_LBD=y
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_LSF is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="anticipatory"

#
# Processor type and features
#
# CONFIG_SMP is not set
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_PARAVIRT is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
CONFIG_MPENTIUMM=y
# CONFIG_MCORE2 is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_X86_UP_APIC=y
CONFIG_X86_UP_IOAPIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_NONFATAL=y
CONFIG_X86_MCE_P4THERMAL=y
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
# CONFIG_X86_REBOOTFIXUPS is not set
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
# CONFIG_DELL_RBU is not set
CONFIG_DCDBAS=m
CONFIG_NOHIGHMEM=y
# CONFIG_HIGHMEM4G is not set
# CONFIG_HIGHMEM64G is not set
CONFIG_PAGE_OFFSET=0xC0000000
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_SPARSEMEM_STATIC=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_RESOURCES_64BIT is not set
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
# CONFIG_EFI is not set
# CONFIG_SECCOMP is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_300=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=300
# CONFIG_KEXEC is not set
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x100000
CONFIG_COMPAT_VDSO=y

#
# Power management options (ACPI, APM)
#
CONFIG_PM=y
# CONFIG_PM_LEGACY is not set
# CONFIG_PM_DEBUG is not set
# CONFIG_PM_SYSFS_DEPRECATED is not set
CONFIG_SOFTWARE_SUSPEND=y
CONFIG_PM_STD_PARTITION=""

#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_SLEEP_PROC_FS=y
# CONFIG_ACPI_SLEEP_PROC_SLEEP is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=y
CONFIG_ACPI_HOTKEY=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_ASUS is not set
CONFIG_ACPI_IBM=m
# CONFIG_ACPI_TOSHIBA is not set
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
# CONFIG_ACPI_CONTAINER is not set
# CONFIG_ACPI_SBS is not set

#
# APM (Advanced Power Management) BIOS Support
#
# CONFIG_APM is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=y
# CONFIG_CPU_FREQ_STAT_DETAILS is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# CPUFreq processor drivers
#
# CONFIG_X86_ACPI_CPUFREQ is not set
# CONFIG_X86_POWERNOW_K6 is not set
# CONFIG_X86_POWERNOW_K7 is not set
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_GX_SUSPMOD is not set
CONFIG_X86_SPEEDSTEP_CENTRINO=y
CONFIG_X86_SPEEDSTEP_CENTRINO_ACPI=y
CONFIG_X86_SPEEDSTEP_CENTRINO_TABLE=y
CONFIG_X86_SPEEDSTEP_ICH=y
CONFIG_X86_SPEEDSTEP_SMI=y
# CONFIG_X86_P4_CLOCKMOD is not set
# CONFIG_X86_CPUFREQ_NFORCE2 is not set
# CONFIG_X86_LONGRUN is not set
# CONFIG_X86_LONGHAUL is not set

#
# shared options
#
# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set
CONFIG_X86_SPEEDSTEP_LIB=y
# CONFIG_X86_SPEEDSTEP_RELAXED_CAP_CHECK is not set

#
# Bus options (PCI, PCMCIA, EISA, MCA, ISA)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
# CONFIG_PCI_GODIRECT is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCIEPORTBUS=y
# CONFIG_HOTPLUG_PCI_PCIE is not set
CONFIG_PCIEAER=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_MULTITHREAD_PROBE is not set
# CONFIG_PCI_DEBUG is not set
CONFIG_HT_IRQ=y
CONFIG_ISA_DMA_API=y
CONFIG_ISA=y
# CONFIG_EISA is not set
# CONFIG_MCA is not set
# CONFIG_SCx200 is not set

#
# PCCARD (PCMCIA/CardBus) support
#
CONFIG_PCCARD=y
# CONFIG_PCMCIA_DEBUG is not set
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_PCMCIA_IOCTL=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
CONFIG_YENTA=y
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
# CONFIG_I82365 is not set
# CONFIG_TCIC is not set
CONFIG_PCMCIA_PROBE=y
CONFIG_PCCARD_NONSTATIC=y

#
# PCI Hotplug Support
#
CONFIG_HOTPLUG_PCI=y
# CONFIG_HOTPLUG_PCI_FAKE is not set
# CONFIG_HOTPLUG_PCI_COMPAQ is not set
CONFIG_HOTPLUG_PCI_IBM=y
CONFIG_HOTPLUG_PCI_ACPI=y
CONFIG_HOTPLUG_PCI_ACPI_IBM=y
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_MISC=y

#
# Networking
#
CONFIG_NET=y

#
# Networking options
#
# CONFIG_NETDEBUG is not set
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
# CONFIG_XFRM_USER is not set
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_FIB_HASH=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
CONFIG_INET_XFRM_MODE_TRANSPORT=y
CONFIG_INET_XFRM_MODE_TUNNEL=y
CONFIG_INET_XFRM_MODE_BEET=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=y
CONFIG_TCP_CONG_CUBIC=y
CONFIG_TCP_CONG_WESTWOOD=y
# CONFIG_TCP_CONG_HTCP is not set
CONFIG_TCP_CONG_HSTCP=y
# CONFIG_TCP_CONG_HYBLA is not set
CONFIG_TCP_CONG_VEGAS=y
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_DEFAULT_BIC is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set

#
# IP: Virtual Server Configuration
#
# CONFIG_IP_VS is not set
CONFIG_IPV6=y
# CONFIG_IPV6_PRIVACY is not set
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
CONFIG_INET6_TUNNEL=y
CONFIG_INET6_XFRM_MODE_TRANSPORT=y
CONFIG_INET6_XFRM_MODE_TUNNEL=y
CONFIG_INET6_XFRM_MODE_BEET=y
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=y
CONFIG_IPV6_TUNNEL=y
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_BRIDGE_NETFILTER=y

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=y
CONFIG_NETFILTER_NETLINK_QUEUE=y
CONFIG_NETFILTER_NETLINK_LOG=y
# CONFIG_NF_CONNTRACK_ENABLED is not set
CONFIG_NETFILTER_XTABLES=y
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=y
# CONFIG_NETFILTER_XT_TARGET_DSCP is not set
CONFIG_NETFILTER_XT_TARGET_MARK=y
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=y
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
CONFIG_NETFILTER_XT_MATCH_COMMENT=y
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
CONFIG_NETFILTER_XT_MATCH_LIMIT=y
CONFIG_NETFILTER_XT_MATCH_MAC=y
CONFIG_NETFILTER_XT_MATCH_MARK=y
# CONFIG_NETFILTER_XT_MATCH_POLICY is not set
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=y
CONFIG_NETFILTER_XT_MATCH_PHYSDEV=y
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=y
# CONFIG_NETFILTER_XT_MATCH_QUOTA is not set
CONFIG_NETFILTER_XT_MATCH_REALM=y
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
CONFIG_NETFILTER_XT_MATCH_TCPMSS=y
# CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set

#
# IP: Netfilter Configuration
#
CONFIG_IP_NF_QUEUE=y
CONFIG_IP_NF_IPTABLES=y
CONFIG_IP_NF_MATCH_IPRANGE=y
CONFIG_IP_NF_MATCH_TOS=y
# CONFIG_IP_NF_MATCH_RECENT is not set
CONFIG_IP_NF_MATCH_ECN=y
CONFIG_IP_NF_MATCH_AH=y
# CONFIG_IP_NF_MATCH_TTL is not set
CONFIG_IP_NF_MATCH_OWNER=y
CONFIG_IP_NF_MATCH_ADDRTYPE=y
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
CONFIG_IP_NF_TARGET_LOG=y
# CONFIG_IP_NF_TARGET_ULOG is not set
CONFIG_IP_NF_TARGET_TCPMSS=y
CONFIG_IP_NF_MANGLE=y
CONFIG_IP_NF_TARGET_TOS=y
CONFIG_IP_NF_TARGET_ECN=y
# CONFIG_IP_NF_TARGET_TTL is not set
# CONFIG_IP_NF_RAW is not set
# CONFIG_IP_NF_ARPTABLES is not set

#
# IPv6: Netfilter Configuration (EXPERIMENTAL)
#
CONFIG_IP6_NF_QUEUE=y
# CONFIG_IP6_NF_IPTABLES is not set

#
# Bridge: Netfilter Configuration
#
# CONFIG_BRIDGE_NF_EBTABLES is not set

#
# DCCP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_DCCP is not set

#
# SCTP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_SCTP is not set

#
# TIPC Configuration (EXPERIMENTAL)
#
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
CONFIG_BRIDGE=y
CONFIG_VLAN_8021Q=y
# CONFIG_DECNET is not set
CONFIG_LLC=y
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set

#
# QoS and/or fair queueing
#
CONFIG_NET_SCHED=y
CONFIG_NET_SCH_FIFO=y
# CONFIG_NET_SCH_CLK_JIFFIES is not set
# CONFIG_NET_SCH_CLK_GETTIMEOFDAY is not set
CONFIG_NET_SCH_CLK_CPU=y

#
# Queueing/Scheduling
#
CONFIG_NET_SCH_CBQ=y
CONFIG_NET_SCH_HTB=y
# CONFIG_NET_SCH_HFSC is not set
CONFIG_NET_SCH_PRIO=y
CONFIG_NET_SCH_RED=y
CONFIG_NET_SCH_SFQ=y
# CONFIG_NET_SCH_TEQL is not set
CONFIG_NET_SCH_TBF=y
CONFIG_NET_SCH_GRED=y
CONFIG_NET_SCH_DSMARK=y
CONFIG_NET_SCH_NETEM=y
CONFIG_NET_SCH_INGRESS=y

#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=y
CONFIG_NET_CLS_TCINDEX=y
CONFIG_NET_CLS_ROUTE4=y
CONFIG_NET_CLS_ROUTE=y
# CONFIG_NET_CLS_FW is not set
CONFIG_NET_CLS_U32=y
# CONFIG_CLS_U32_PERF is not set
# CONFIG_CLS_U32_MARK is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_EMATCH is not set
# CONFIG_NET_CLS_ACT is not set
# CONFIG_NET_CLS_POLICE is not set
# CONFIG_NET_CLS_IND is not set
# CONFIG_NET_ESTIMATOR is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_HAMRADIO is not set
# CONFIG_IRDA is not set
CONFIG_BT=y
CONFIG_BT_L2CAP=y
CONFIG_BT_SCO=y
CONFIG_BT_RFCOMM=y
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=y
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
CONFIG_BT_HIDP=y

#
# Bluetooth device drivers
#
CONFIG_BT_HCIUSB=m
CONFIG_BT_HCIUSB_SCO=y
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBPA10X is not set
# CONFIG_BT_HCIBFUSB is not set
# CONFIG_BT_HCIDTL1 is not set
# CONFIG_BT_HCIBT3C is not set
# CONFIG_BT_HCIBLUECARD is not set
# CONFIG_BT_HCIBTUART is not set
# CONFIG_BT_HCIVHCI is not set
CONFIG_IEEE80211=y
# CONFIG_IEEE80211_DEBUG is not set
CONFIG_IEEE80211_CRYPT_WEP=y
CONFIG_IEEE80211_CRYPT_CCMP=y
CONFIG_IEEE80211_CRYPT_TKIP=y
CONFIG_IEEE80211_SOFTMAC=y
# CONFIG_IEEE80211_SOFTMAC_DEBUG is not set
CONFIG_WIRELESS_EXT=y

#
# Device Drivers
#

#
# Generic Driver Options
#
# CONFIG_STANDALONE is not set
# CONFIG_PREVENT_FIRMWARE_BUILD is not set
CONFIG_FW_LOADER=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_SYS_HYPERVISOR is not set

#
# Connector - unified userspace <-> kernelspace linker
#
CONFIG_CONNECTOR=y
# CONFIG_PROC_EVENTS is not set

#
# Memory Technology Devices (MTD)
#
CONFIG_MTD=m
# CONFIG_MTD_DEBUG is not set
# CONFIG_MTD_CONCAT is not set
CONFIG_MTD_PARTITIONS=y
# CONFIG_MTD_REDBOOT_PARTS is not set

#
# User Modules And Translation Layers
#
CONFIG_MTD_CHAR=m
CONFIG_MTD_BLOCK=m
# CONFIG_MTD_BLOCK_RO is not set
CONFIG_FTL=m
CONFIG_NFTL=m
# CONFIG_NFTL_RW is not set
CONFIG_INFTL=m
CONFIG_RFD_FTL=m
# CONFIG_SSFDC is not set

#
# RAM/ROM/Flash chip drivers
#
CONFIG_MTD_CFI=m
CONFIG_MTD_JEDECPROBE=m
CONFIG_MTD_GEN_PROBE=m
# CONFIG_MTD_CFI_ADV_OPTIONS is not set
CONFIG_MTD_MAP_BANK_WIDTH_1=y
CONFIG_MTD_MAP_BANK_WIDTH_2=y
CONFIG_MTD_MAP_BANK_WIDTH_4=y
# CONFIG_MTD_MAP_BANK_WIDTH_8 is not set
# CONFIG_MTD_MAP_BANK_WIDTH_16 is not set
# CONFIG_MTD_MAP_BANK_WIDTH_32 is not set
CONFIG_MTD_CFI_I1=y
CONFIG_MTD_CFI_I2=y
# CONFIG_MTD_CFI_I4 is not set
# CONFIG_MTD_CFI_I8 is not set
CONFIG_MTD_CFI_INTELEXT=m
CONFIG_MTD_CFI_AMDSTD=m
CONFIG_MTD_CFI_STAA=m
CONFIG_MTD_CFI_UTIL=m
CONFIG_MTD_RAM=m
CONFIG_MTD_ROM=m
# CONFIG_MTD_ABSENT is not set
# CONFIG_MTD_OBSOLETE_CHIPS is not set

#
# Mapping drivers for chip access
#
CONFIG_MTD_COMPLEX_MAPPINGS=y
# CONFIG_MTD_PHYSMAP is not set
# CONFIG_MTD_PNC2000 is not set
# CONFIG_MTD_NETSC520 is not set
# CONFIG_MTD_TS5500 is not set
# CONFIG_MTD_SBC_GXX is not set
# CONFIG_MTD_AMD76XROM is not set
# CONFIG_MTD_ICHXROM is not set
# CONFIG_MTD_SCB2_FLASH is not set
# CONFIG_MTD_NETtel is not set
# CONFIG_MTD_L440GX is not set
# CONFIG_MTD_PCI is not set
# CONFIG_MTD_PLATRAM is not set

#
# Self-contained MTD device drivers
#
# CONFIG_MTD_PMC551 is not set
# CONFIG_MTD_SLRAM is not set
# CONFIG_MTD_PHRAM is not set
# CONFIG_MTD_MTDRAM is not set
CONFIG_MTD_BLOCK2MTD=m

#
# Disk-On-Chip Device Drivers
#
# CONFIG_MTD_DOC2000 is not set
# CONFIG_MTD_DOC2001 is not set
# CONFIG_MTD_DOC2001PLUS is not set

#
# NAND Flash Device Drivers
#
CONFIG_MTD_NAND=m
# CONFIG_MTD_NAND_VERIFY_WRITE is not set
# CONFIG_MTD_NAND_ECC_SMC is not set
CONFIG_MTD_NAND_IDS=m
# CONFIG_MTD_NAND_DISKONCHIP is not set
# CONFIG_MTD_NAND_CS553X is not set
# CONFIG_MTD_NAND_NANDSIM is not set

#
# OneNAND Flash Device Drivers
#
# CONFIG_MTD_ONENAND is not set

#
# Parallel port support
#
CONFIG_PARPORT=y
CONFIG_PARPORT_PC=y
# CONFIG_PARPORT_SERIAL is not set
CONFIG_PARPORT_PC_FIFO=y
# CONFIG_PARPORT_PC_SUPERIO is not set
# CONFIG_PARPORT_PC_PCMCIA is not set
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_AX88796 is not set
# CONFIG_PARPORT_1284 is not set

#
# Plug and Play support
#
CONFIG_PNP=y
# CONFIG_PNP_DEBUG is not set

#
# Protocols
#
# CONFIG_ISAPNP is not set
# CONFIG_PNPBIOS is not set
CONFIG_PNPACPI=y

#
# Block devices
#
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_XD is not set
# CONFIG_PARIDE is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
CONFIG_BLK_DEV_NBD=y
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_BLK_DEV_INITRD is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set

#
# Misc devices
#
# CONFIG_IBM_ASM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_MSI_LAPTOP is not set

#
# ATA/ATAPI/MFM/RLL support
#
# CONFIG_IDE is not set

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
# CONFIG_SCSI_TGT is not set
# CONFIG_SCSI_NETLINK is not set
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
# CONFIG_BLK_DEV_SR_VENDOR is not set
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
# CONFIG_SCSI_LOGGING is not set
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set

#
# SCSI low-level drivers
#
# CONFIG_ISCSI_TCP is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_7000FASST is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AHA152X is not set
# CONFIG_SCSI_AHA1542 is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_IN2000 is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_DTC3280 is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_GENERIC_NCR5380 is not set
# CONFIG_SCSI_GENERIC_NCR5380_MMIO is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_PPA is not set
# CONFIG_SCSI_IMM is not set
# CONFIG_SCSI_NCR53C406A is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_PAS16 is not set
# CONFIG_SCSI_PSI240I is not set
# CONFIG_SCSI_QLOGIC_FAS is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_SYM53C416 is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_T128 is not set
# CONFIG_SCSI_U14_34F is not set
# CONFIG_SCSI_ULTRASTOR is not set
# CONFIG_SCSI_NSP32 is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_SRP is not set

#
# PCMCIA SCSI adapter support
#
# CONFIG_PCMCIA_AHA152X is not set
# CONFIG_PCMCIA_FDOMAIN is not set
# CONFIG_PCMCIA_NINJA_SCSI is not set
# CONFIG_PCMCIA_QLOGIC is not set
# CONFIG_PCMCIA_SYM53C500 is not set

#
# Serial ATA (prod) and Parallel ATA (experimental) drivers
#
CONFIG_ATA=y
CONFIG_SATA_AHCI=y
# CONFIG_SATA_SVW is not set
CONFIG_ATA_PIIX=y
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIL24 is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CS5535 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_LEGACY is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PCMCIA is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_QDI is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set
# CONFIG_PATA_WINBOND_VLB is not set

#
# Old CD-ROM drivers (not SCSI, not IDE)
#
# CONFIG_CD_NO_IDESCSI is not set

#
# Multi-device support (RAID and LVM)
#
CONFIG_MD=y
# CONFIG_BLK_DEV_MD is not set
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
CONFIG_DM_CRYPT=y
CONFIG_DM_SNAPSHOT=y
# CONFIG_DM_MIRROR is not set
# CONFIG_DM_ZERO is not set
# CONFIG_DM_MULTIPATH is not set

#
# Fusion MPT device support
#
# CONFIG_FUSION is not set
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_FC is not set
# CONFIG_FUSION_SAS is not set

#
# IEEE 1394 (FireWire) support
#
CONFIG_IEEE1394=y

#
# Subsystem Options
#
# CONFIG_IEEE1394_VERBOSEDEBUG is not set
# CONFIG_IEEE1394_OUI_DB is not set
CONFIG_IEEE1394_EXTRA_CONFIG_ROMS=y
CONFIG_IEEE1394_CONFIG_ROM_IP1394=y
# CONFIG_IEEE1394_EXPORT_FULL_API is not set

#
# Device Drivers
#
# CONFIG_IEEE1394_PCILYNX is not set
CONFIG_IEEE1394_OHCI1394=m

#
# Protocol Drivers
#
# CONFIG_IEEE1394_VIDEO1394 is not set
CONFIG_IEEE1394_SBP2=y
# CONFIG_IEEE1394_SBP2_PHYS_DMA is not set
CONFIG_IEEE1394_ETH1394=y
# CONFIG_IEEE1394_DV1394 is not set
CONFIG_IEEE1394_RAWIO=y

#
# I2O device support
#
# CONFIG_I2O is not set

#
# Network device support
#
CONFIG_NETDEVICES=y
# CONFIG_DUMMY is not set
CONFIG_BONDING=y
# CONFIG_EQUALIZER is not set
CONFIG_TUN=y
# CONFIG_NET_SB1000 is not set

#
# ARCnet devices
#
# CONFIG_ARCNET is not set

#
# PHY device support
#
# CONFIG_PHYLIB is not set

#
# Ethernet (10 or 100Mbit)
#
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_LANCE is not set
# CONFIG_NET_VENDOR_SMC is not set
# CONFIG_NET_VENDOR_RACAL is not set

#
# Tulip family network device support
#
# CONFIG_NET_TULIP is not set
# CONFIG_AT1700 is not set
# CONFIG_DEPCA is not set
# CONFIG_HP100 is not set
# CONFIG_NET_ISA is not set
CONFIG_NET_PCI=y
CONFIG_PCNET32=y
# CONFIG_PCNET32_NAPI is not set
CONFIG_AMD8111_ETH=y
CONFIG_AMD8111E_NAPI=y
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_AC3200 is not set
# CONFIG_APRICOT is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
# CONFIG_CS89x0 is not set
# CONFIG_DGRS is not set
# CONFIG_EEPRO100 is not set
CONFIG_E100=y
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_8139CP is not set
CONFIG_8139TOO=y
CONFIG_8139TOO_PIO=y
# CONFIG_8139TOO_TUNE_TWISTER is not set
# CONFIG_8139TOO_8129 is not set
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_NET_POCKET is not set

#
# Ethernet (1000 Mbit)
#
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_SK98LIN is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_TIGON3=y
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set

#
# Ethernet (10000 Mbit)
#
# CONFIG_CHELSIO_T1 is not set
# CONFIG_IXGB is not set
CONFIG_S2IO=m
# CONFIG_S2IO_NAPI is not set
# CONFIG_MYRI10GE is not set
# CONFIG_NETXEN_NIC is not set

#
# Token Ring devices
#
# CONFIG_TR is not set

#
# Wireless LAN (non-hamradio)
#
CONFIG_NET_RADIO=y
CONFIG_NET_WIRELESS_RTNETLINK=y

#
# Obsolete Wireless cards support (pre-802.11)
#
# CONFIG_STRIP is not set
# CONFIG_ARLAN is not set
# CONFIG_WAVELAN is not set
# CONFIG_PCMCIA_WAVELAN is not set
# CONFIG_PCMCIA_NETWAVE is not set

#
# Wireless 802.11 Frequency Hopping cards support
#
# CONFIG_PCMCIA_RAYCS is not set

#
# Wireless 802.11b ISA/PCI cards support
#
# CONFIG_IPW2100 is not set
CONFIG_IPW2200=m
CONFIG_IPW2200_MONITOR=y
CONFIG_IPW2200_RADIOTAP=y
CONFIG_IPW2200_PROMISCUOUS=y
CONFIG_IPW2200_QOS=y
# CONFIG_IPW2200_DEBUG is not set
# CONFIG_AIRO is not set
# CONFIG_HERMES is not set
# CONFIG_ATMEL is not set

#
# Wireless 802.11b Pcmcia/Cardbus cards support
#
# CONFIG_AIRO_CS is not set
# CONFIG_PCMCIA_WL3501 is not set

#
# Prism GT/Duette 802.11(a/b/g) PCI/Cardbus support
#
# CONFIG_PRISM54 is not set
# CONFIG_USB_ZD1201 is not set
CONFIG_HOSTAP=m
CONFIG_HOSTAP_FIRMWARE=y
CONFIG_HOSTAP_FIRMWARE_NVRAM=y
# CONFIG_HOSTAP_PLX is not set
# CONFIG_HOSTAP_PCI is not set
CONFIG_HOSTAP_CS=m
# CONFIG_BCM43XX is not set
CONFIG_ZD1211RW=m
# CONFIG_ZD1211RW_DEBUG is not set
CONFIG_NET_WIRELESS=y

#
# PCMCIA network device support
#
# CONFIG_NET_PCMCIA is not set

#
# Wan interfaces
#
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PLIP is not set
CONFIG_PPP=y
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=y
CONFIG_PPP_SYNC_TTY=y
CONFIG_PPP_DEFLATE=y
CONFIG_PPP_BSDCOMP=y
# CONFIG_PPP_MPPE is not set
CONFIG_PPPOE=y
# CONFIG_SLIP is not set
CONFIG_SLHC=y
# CONFIG_NET_FC is not set
# CONFIG_SHAPER is not set
CONFIG_NETCONSOLE=y
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_RX is not set
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y

#
# ISDN subsystem
#
# CONFIG_ISDN is not set

#
# Telephony Support
#
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_FF_MEMLESS is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
# CONFIG_INPUT_TSDEV is not set
# CONFIG_INPUT_EVDEV is not set
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_INPORT is not set
# CONFIG_MOUSE_LOGIBM is not set
# CONFIG_MOUSE_PC110PAD is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_WISTRON_BTNS is not set
CONFIG_INPUT_UINPUT=m

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_VT_HW_CONSOLE_BINDING is not set
# CONFIG_SERIAL_NONSTANDARD is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
# CONFIG_SERIAL_8250_CONSOLE is not set
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_CS is not set
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
# CONFIG_SERIAL_8250_EXTENDED is not set

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_PRINTER=m
# CONFIG_LP_CONSOLE is not set
CONFIG_PPDEV=m
# CONFIG_TIPAR is not set

#
# IPMI
#
# CONFIG_IPMI_HANDLER is not set

#
# Watchdog Cards
#
# CONFIG_WATCHDOG is not set
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_INTEL=y
CONFIG_HW_RANDOM_AMD=y
CONFIG_HW_RANDOM_GEODE=y
CONFIG_HW_RANDOM_VIA=y
# CONFIG_NVRAM is not set
CONFIG_RTC=y
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_SONYPI is not set
CONFIG_AGP=y
# CONFIG_AGP_ALI is not set
# CONFIG_AGP_ATI is not set
# CONFIG_AGP_AMD is not set
# CONFIG_AGP_AMD64 is not set
CONFIG_AGP_INTEL=y
# CONFIG_AGP_NVIDIA is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_SWORKS is not set
# CONFIG_AGP_VIA is not set
# CONFIG_AGP_EFFICEON is not set
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
CONFIG_DRM_RADEON=m
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I830 is not set
# CONFIG_DRM_I915 is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
# CONFIG_CARDMAN_4000 is not set
# CONFIG_CARDMAN_4040 is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_NSC_GPIO is not set
# CONFIG_CS5535_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
# CONFIG_HPET_RTC_IRQ is not set
CONFIG_HPET_MMAP=y
# CONFIG_HANGCHECK_TIMER is not set

#
# TPM devices
#
CONFIG_TCG_TPM=y
CONFIG_TCG_TIS=y
CONFIG_TCG_NSC=y
CONFIG_TCG_ATMEL=y
CONFIG_TCG_INFINEON=y
# CONFIG_TELCLOCK is not set

#
# I2C support
#
CONFIG_I2C=y
CONFIG_I2C_CHARDEV=y

#
# I2C Algorithms
#
CONFIG_I2C_ALGOBIT=y
# CONFIG_I2C_ALGOPCF is not set
# CONFIG_I2C_ALGOPCA is not set

#
# I2C Hardware Bus support
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_ELEKTOR is not set
CONFIG_I2C_I801=y
CONFIG_I2C_I810=y
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PARPORT is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_PROSAVAGE is not set
# CONFIG_I2C_SAVAGE4 is not set
# CONFIG_SCx200_ACB is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set
# CONFIG_I2C_VOODOO3 is not set
# CONFIG_I2C_PCA_ISA is not set

#
# Miscellaneous I2C Chip support
#
# CONFIG_SENSORS_DS1337 is not set
# CONFIG_SENSORS_DS1374 is not set
CONFIG_SENSORS_EEPROM=m
# CONFIG_SENSORS_PCF8574 is not set
# CONFIG_SENSORS_PCA9539 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_SENSORS_MAX6875 is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set

#
# SPI support
#
# CONFIG_SPI is not set
# CONFIG_SPI_MASTER is not set

#
# Dallas's 1-wire bus
#
# CONFIG_W1 is not set

#
# Hardware Monitoring support
#
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_FSCHER is not set
# CONFIG_SENSORS_FSCPOS is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
CONFIG_SENSORS_HDAPS=m
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Multimedia devices
#
# CONFIG_VIDEO_DEV is not set

#
# Digital Video Broadcasting Devices
#
# CONFIG_DVB is not set
# CONFIG_USB_DABUSB is not set

#
# Graphics support
#
CONFIG_FIRMWARE_EDID=y
CONFIG_FB=m
CONFIG_FB_DDC=m
CONFIG_FB_CFB_FILLRECT=m
CONFIG_FB_CFB_COPYAREA=m
CONFIG_FB_CFB_IMAGEBLIT=m
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
# CONFIG_FB_TILEBLITTING is not set
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I810 is not set
# CONFIG_FB_INTEL is not set
# CONFIG_FB_MATROX is not set
CONFIG_FB_RADEON=m
CONFIG_FB_RADEON_I2C=y
CONFIG_FB_RADEON_DEBUG=y
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_CYBLA is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_VIDEO_SELECT=y
# CONFIG_MDA_CONSOLE is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=m
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
CONFIG_FONTS=y
# CONFIG_FONT_8x8 is not set
# CONFIG_FONT_8x16 is not set
# CONFIG_FONT_6x11 is not set
# CONFIG_FONT_7x14 is not set
# CONFIG_FONT_PEARL_8x8 is not set
# CONFIG_FONT_ACORN_8x8 is not set
# CONFIG_FONT_MINI_4x6 is not set
# CONFIG_FONT_SUN8x16 is not set
CONFIG_FONT_SUN12x22=y
# CONFIG_FONT_10x18 is not set

#
# Logo configuration
#
# CONFIG_LOGO is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_BACKLIGHT_CLASS_DEVICE=m
CONFIG_BACKLIGHT_DEVICE=y
CONFIG_LCD_CLASS_DEVICE=m
CONFIG_LCD_DEVICE=y

#
# Sound
#
CONFIG_SOUND=y

#
# Advanced Linux Sound Architecture
#
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
# CONFIG_SND_SEQUENCER is not set
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=y
CONFIG_SND_PCM_OSS=y
# CONFIG_SND_PCM_OSS_PLUGINS is not set
CONFIG_SND_RTCTIMER=y
# CONFIG_SND_DYNAMIC_MINORS is not set
# CONFIG_SND_SUPPORT_OLD_API is not set
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set

#
# Generic devices
#
CONFIG_SND_AC97_CODEC=y
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_MTS64 is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set

#
# ISA devices
#
# CONFIG_SND_ADLIB is not set
# CONFIG_SND_AD1816A is not set
# CONFIG_SND_AD1848 is not set
# CONFIG_SND_ALS100 is not set
# CONFIG_SND_AZT2320 is not set
# CONFIG_SND_CMI8330 is not set
# CONFIG_SND_CS4231 is not set
# CONFIG_SND_CS4232 is not set
# CONFIG_SND_CS4236 is not set
# CONFIG_SND_DT019X is not set
# CONFIG_SND_ES968 is not set
# CONFIG_SND_ES1688 is not set
# CONFIG_SND_ES18XX is not set
# CONFIG_SND_GUSCLASSIC is not set
# CONFIG_SND_GUSEXTREME is not set
# CONFIG_SND_GUSMAX is not set
# CONFIG_SND_INTERWAVE is not set
# CONFIG_SND_INTERWAVE_STB is not set
# CONFIG_SND_OPL3SA2 is not set
# CONFIG_SND_OPTI92X_AD1848 is not set
# CONFIG_SND_OPTI92X_CS4231 is not set
# CONFIG_SND_OPTI93X is not set
# CONFIG_SND_MIRO is not set
# CONFIG_SND_SB8 is not set
# CONFIG_SND_SB16 is not set
# CONFIG_SND_SBAWE is not set
# CONFIG_SND_SGALAXY is not set
# CONFIG_SND_SSCAPE is not set
# CONFIG_SND_WAVEFRONT is not set

#
# PCI devices
#
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS5535AUDIO is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=y
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=y
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
CONFIG_SND_AC97_POWER_SAVE=y

#
# USB devices
#
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_USX2Y is not set

#
# PCMCIA devices
#
# CONFIG_SND_VXPOCKET is not set
# CONFIG_SND_PDAUDIOCF is not set

#
# Open Sound System
#
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=y

#
# HID Devices
#
CONFIG_HID=y

#
# USB support
#
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_BANDWIDTH is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
CONFIG_USB_MULTITHREAD_PROBE=y
# CONFIG_USB_OTG is not set

#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=m
# CONFIG_USB_EHCI_SPLIT_ISO is not set
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_OHCI_HCD is not set
CONFIG_USB_UHCI_HCD=m
# CONFIG_USB_SL811_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
CONFIG_USB_PRINTER=y

#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
#

#
# may also be needed; see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_DPCM is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_KARMA is not set
CONFIG_USB_LIBUSUAL=y

#
# USB Input Devices
#
CONFIG_USB_HID=y
# CONFIG_USB_HID_POWERBOOK is not set
# CONFIG_HID_FF is not set
# CONFIG_USB_HIDDEV is not set
# CONFIG_USB_AIPTEK is not set
# CONFIG_USB_WACOM is not set
# CONFIG_USB_ACECAD is not set
# CONFIG_USB_KBTAB is not set
# CONFIG_USB_POWERMATE is not set
# CONFIG_USB_TOUCHSCREEN is not set
# CONFIG_USB_YEALINK is not set
# CONFIG_USB_XPAD is not set
# CONFIG_USB_ATI_REMOTE is not set
# CONFIG_USB_ATI_REMOTE2 is not set
# CONFIG_USB_KEYSPAN_REMOTE is not set
# CONFIG_USB_APPLETOUCH is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
CONFIG_USB_USBNET_MII=y
CONFIG_USB_USBNET=y
CONFIG_USB_NET_AX8817X=y
CONFIG_USB_NET_CDCETHER=m
# CONFIG_USB_NET_GL620A is not set
CONFIG_USB_NET_NET1080=m
# CONFIG_USB_NET_PLUSB is not set
# CONFIG_USB_NET_MCS7830 is not set
CONFIG_USB_NET_RNDIS_HOST=m
CONFIG_USB_NET_CDC_SUBSET=m
# CONFIG_USB_ALI_M5632 is not set
# CONFIG_USB_AN2720 is not set
CONFIG_USB_BELKIN=y
CONFIG_USB_ARMLINUX=y
# CONFIG_USB_EPSON2888 is not set
CONFIG_USB_NET_ZAURUS=m
CONFIG_USB_MON=y

#
# USB port drivers
#
# CONFIG_USB_USS720 is not set

#
# USB Serial Converter support
#
CONFIG_USB_SERIAL=y
# CONFIG_USB_SERIAL_CONSOLE is not set
CONFIG_USB_SERIAL_GENERIC=y
# CONFIG_USB_SERIAL_AIRCABLE is not set
# CONFIG_USB_SERIAL_AIRPRIME is not set
# CONFIG_USB_SERIAL_ARK3116 is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP2101 is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
# CONFIG_USB_SERIAL_EMPEG is not set
# CONFIG_USB_SERIAL_FTDI_SIO is not set
# CONFIG_USB_SERIAL_FUNSOFT is not set
# CONFIG_USB_SERIAL_VISOR is not set
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
# CONFIG_USB_SERIAL_GARMIN is not set
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
# CONFIG_USB_SERIAL_KEYSPAN is not set
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_MOS7720 is not set
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_NAVMAN is not set
CONFIG_USB_SERIAL_PL2303=y
CONFIG_USB_SERIAL_HP4X=y
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_XIRCOM is not set
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_AUERSWALD is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_TEST is not set

#
# USB DSL modem support
#

#
# USB Gadget Support
#
# CONFIG_USB_GADGET is not set

#
# MMC/SD Card support
#
# CONFIG_MMC is not set

#
# LED devices
#
# CONFIG_NEW_LEDS is not set

#
# LED drivers
#

#
# LED Triggers
#

#
# InfiniBand support
#
# CONFIG_INFINIBAND is not set

#
# EDAC - error detection and reporting (RAS) (EXPERIMENTAL)
#
CONFIG_EDAC=y

#
# Reporting subsystems
#
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_MM_EDAC=y
# CONFIG_EDAC_AMD76X is not set
# CONFIG_EDAC_E7XXX is not set
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82875P is not set
# CONFIG_EDAC_I82860 is not set
# CONFIG_EDAC_R82600 is not set
CONFIG_EDAC_POLL=y

#
# Real Time Clock
#
# CONFIG_RTC_CLASS is not set

#
# DMA Engine support
#
# CONFIG_DMA_ENGINE is not set

#
# DMA Clients
#

#
# DMA Devices
#

#
# Virtualization
#
# CONFIG_KVM is not set

#
# File systems
#
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4DEV_FS is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=y
# CONFIG_REISERFS_CHECK is not set
# CONFIG_REISERFS_PROC_INFO is not set
# CONFIG_REISERFS_FS_XATTR is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
CONFIG_MINIX_FS=y
# CONFIG_ROMFS_FS is not set
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
CONFIG_DNOTIFY=y
# CONFIG_AUTOFS_FS is not set
CONFIG_AUTOFS4_FS=y
# CONFIG_FUSE_FS is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
# CONFIG_ZISOFS is not set
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_NTFS_FS=y
# CONFIG_NTFS_DEBUG is not set
CONFIG_NTFS_RW=y

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_RAMFS=y
CONFIG_CONFIGFS_FS=y

#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
CONFIG_JFFS2_FS=m
CONFIG_JFFS2_FS_DEBUG=0
CONFIG_JFFS2_FS_WRITEBUFFER=y
# CONFIG_JFFS2_SUMMARY is not set
# CONFIG_JFFS2_FS_XATTR is not set
CONFIG_JFFS2_COMPRESSION_OPTIONS=y
CONFIG_JFFS2_ZLIB=y
CONFIG_JFFS2_RTIME=y
# CONFIG_JFFS2_RUBIN is not set
# CONFIG_JFFS2_CMODE_NONE is not set
CONFIG_JFFS2_CMODE_PRIORITY=y
# CONFIG_JFFS2_CMODE_SIZE is not set
CONFIG_CRAMFS=m
# CONFIG_VXFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set

#
# Network File Systems
#
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
# CONFIG_NFS_V4 is not set
CONFIG_NFS_DIRECTIO=y
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
# CONFIG_NFSD_V4 is not set
CONFIG_NFSD_TCP=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
# CONFIG_RPCSEC_GSS_KRB5 is not set
# CONFIG_RPCSEC_GSS_SPKM3 is not set
CONFIG_SMB_FS=y
# CONFIG_SMB_NLS_DEFAULT is not set
CONFIG_CIFS=y
# CONFIG_CIFS_STATS is not set
# CONFIG_CIFS_WEAK_PW_HASH is not set
# CONFIG_CIFS_XATTR is not set
# CONFIG_CIFS_DEBUG2 is not set
# CONFIG_CIFS_EXPERIMENTAL is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
# CONFIG_9P_FS is not set

#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y

#
# Native Language Support
#
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
CONFIG_NLS_CODEPAGE_850=y
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
CONFIG_NLS_CODEPAGE_932=y
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
CONFIG_NLS_ISO8859_15=y
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=y

#
# Distributed Lock Manager
#
# CONFIG_DLM is not set

#
# Instrumentation Support
#
# CONFIG_PROFILING is not set
# CONFIG_KPROBES is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_LOG_BUF_SHIFT=15
CONFIG_DETECT_SOFTLOCKUP=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_FRAME_POINTER is not set
CONFIG_FORCED_INLINING=y
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set

#
# Page alloc debug is incompatible with Software Suspend on i386
#
# CONFIG_DEBUG_RODATA is not set
CONFIG_4KSTACKS=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_DOUBLEFAULT=y

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set

#
# Cryptographic options
#
CONFIG_CRYPTO=y
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_MANAGER=y
# CONFIG_CRYPTO_HMAC is not set
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_MD4 is not set
# CONFIG_CRYPTO_MD5 is not set
# CONFIG_CRYPTO_SHA1 is not set
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_GF128MUL is not set
CONFIG_CRYPTO_ECB=y
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_586 is not set
# CONFIG_CRYPTO_SERPENT is not set
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_586 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_TEA is not set
CONFIG_CRYPTO_ARC4=y
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_DEFLATE is not set
CONFIG_CRYPTO_MICHAEL_MIC=y
# CONFIG_CRYPTO_CRC32C is not set
# CONFIG_CRYPTO_TEST is not set

#
# Hardware crypto devices
#
# CONFIG_CRYPTO_DEV_PADLOCK is not set
CONFIG_CRYPTO_DEV_GEODE=m

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_CRC_CCITT=y
# CONFIG_CRC16 is not set
CONFIG_CRC32=y
# CONFIG_LIBCRC32C is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_PLIST=y
CONFIG_IOMAP_COPY=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_KTIME_SCALAR=y


dmesg:
[    0.000000] Linux version 2.6.20-rc2 (ranma@navi) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #26 Mon Dec 25 14:00:08 CET 2006
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] sanitize start
[    0.000000] sanitize end
[    0.000000] copy_e820_map() start: 0000000000000000 size: 000000000009f000 end: 000000000009f000 type: 1
[    0.000000] copy_e820_map() type is E820_RAM
[    0.000000] copy_e820_map() start: 000000000009f000 size: 0000000000001000 end: 00000000000a0000 type: 2
[    0.000000] copy_e820_map() start: 00000000000dc000 size: 0000000000024000 end: 0000000000100000 type: 2
[    0.000000] copy_e820_map() start: 0000000000100000 size: 000000001fde0000 end: 000000001fee0000 type: 1
[    0.000000] copy_e820_map() type is E820_RAM
[    0.000000] copy_e820_map() start: 000000001fee0000 size: 0000000000015000 end: 000000001fef5000 type: 3
[    0.000000] copy_e820_map() start: 000000001fef5000 size: 000000000000b000 end: 000000001ff00000 type: 4
[    0.000000] copy_e820_map() start: 000000001ff00000 size: 0000000000100000 end: 0000000020000000 type: 2
[    0.000000] copy_e820_map() start: 00000000e0000000 size: 0000000010000000 end: 00000000f0000000 type: 2
[    0.000000] copy_e820_map() start: 00000000f0008000 size: 0000000000004000 end: 00000000f000c000 type: 2
[    0.000000] copy_e820_map() start: 00000000fec00000 size: 0000000000010000 end: 00000000fec10000 type: 2
[    0.000000] copy_e820_map() start: 00000000fed14000 size: 0000000000006000 end: 00000000fed1a000 type: 2
[    0.000000] copy_e820_map() start: 00000000fed20000 size: 0000000000070000 end: 00000000fed90000 type: 2
[    0.000000] copy_e820_map() start: 00000000fee00000 size: 0000000000001000 end: 00000000fee01000 type: 2
[    0.000000] copy_e820_map() start: 00000000ff000000 size: 0000000001000000 end: 0000000100000000 type: 2
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
[    0.000000]  BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 000000001fee0000 (usable)
[    0.000000]  BIOS-e820: 000000001fee0000 - 000000001fef5000 (ACPI data)
[    0.000000]  BIOS-e820: 000000001fef5000 - 000000001ff00000 (ACPI NVS)
[    0.000000]  BIOS-e820: 000000001ff00000 - 0000000020000000 (reserved)
[    0.000000]  BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000f0008000 - 00000000f000c000 (reserved)
[    0.000000]  BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[    0.000000]  BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
[    0.000000]  BIOS-e820: 00000000fed20000 - 00000000fed90000 (reserved)
[    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
[    0.000000]  BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
[    0.000000] 510MB LOWMEM available.
[    0.000000] Entering add_active_range(0, 0, 130784) 0 entries of 256 used
[    0.000000] Zone PFN ranges:
[    0.000000]   DMA             0 ->     4096
[    0.000000]   Normal       4096 ->   130784
[    0.000000] early_node_map[1] active PFN ranges
[    0.000000]     0:        0 ->   130784
[    0.000000] On node 0 totalpages: 130784
[    0.000000]   DMA zone: 32 pages used for memmap
[    0.000000]   DMA zone: 0 pages reserved
[    0.000000]   DMA zone: 4064 pages, LIFO batch:0
[    0.000000]   Normal zone: 989 pages used for memmap
[    0.000000]   Normal zone: 125699 pages, LIFO batch:31
[    0.000000] DMI present.
[    0.000000] ACPI: RSDP (v002 IBM                                   ) @ 0x000f6bf0
[    0.000000] ACPI: XSDT (v001 IBM    TP-76    0x00001270  LTP 0x00000000) @ 0x1fee6f9b
[    0.000000] ACPI: FADT (v003 IBM    TP-76    0x00001270 IBM  0x00000001) @ 0x1fee7000
[    0.000000] ACPI: SSDT (v001 IBM    TP-76    0x00001270 MSFT 0x0100000e) @ 0x1fee71b4
[    0.000000] ACPI: ECDT (v001 IBM    TP-76    0x00001270 IBM  0x00000001) @ 0x1fef4d46
[    0.000000] ACPI: TCPA (v001 IBM    TP-76    0x00001270 PTL  0x00000001) @ 0x1fef4d98
[    0.000000] ACPI: MADT (v001 IBM    TP-76    0x00001270 IBM  0x00000001) @ 0x1fef4dca
[    0.000000] ACPI: MCFG (v001 IBM    TP-76    0x00001270 IBM  0x00000001) @ 0x1fef4e24
[    0.000000] ACPI: BOOT (v001 IBM    TP-76    0x00001270  LTP 0x00000001) @ 0x1fef4fd8
[    0.000000] ACPI: DSDT (v001 IBM    TP-76    0x00001270 MSFT 0x0100000e) @ 0x00000000
[    0.000000] ACPI: PM-Timer IO Port: 0x1008
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
[    0.000000] Processor #0 6:13 APIC version 20
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
[    0.000000] ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
[    0.000000] IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ2 used by override.
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] Enabling APIC mode:  Flat.  Using 1 I/O APICs
[    0.000000] Using ACPI (MADT) for SMP configuration information
[    0.000000] Allocating PCI resources starting at 30000000 (gap: 20000000:c0000000)
[    0.000000] Detected 1995.186 MHz processor.
[    2.815181] Built 1 zonelists.  Total pages: 129763
[    2.815183] Kernel command line: root=/dev/sda6 resume=/dev/sda5 vga=ext parport=auto ide0=noprobe ide1=noprobe libata.atapi_enabled=1 ro
[    2.815401] mapped APIC to ffff9000 (fee00000)
[    2.815404] mapped IOAPIC to ffff8000 (fec00000)
[    2.815406] Enabling fast FPU save and restore... done.
[    2.815408] Enabling unmasked SIMD FPU exception support... done.
[    2.815416] Initializing CPU#0
[    2.815473] CPU 0 irqstacks, hard=c05f3000 soft=c05f2000
[    2.815476] PID hash table entries: 2048 (order: 11, 8192 bytes)
[    2.815491] is_hpet_capable()
[    2.815493] trying to force-enable HPET
[    2.815498] RCBA already mapped at f0008000
[    2.815501] HPTC: RCBA Base is 0xf0008000, mapped at 0xffffc000 to 0xfffff000
[    2.815505] HPTC: RCBA 0x3404 is 0x00000080n<3>Intel HPET force-enabled at 0xfed00000
[    2.817499] Console: colour VGA+ 80x50
[    2.821573] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
[    2.821816] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
[    2.831460] Memory: 512836k/523136k available (3392k kernel code, 9880k reserved, 1444k data, 200k init, 0k highmem)
[    2.831572] virtual kernel memory layout:
[    2.831573]     fixmap  : 0xfffb3000 - 0xfffff000   ( 304 kB)
[    2.831574]     vmalloc : 0xe0800000 - 0xfffb1000   ( 503 MB)
[    2.831575]     lowmem  : 0xc0000000 - 0xdfee0000   ( 510 MB)
[    2.831576]       .init : 0xc05bb000 - 0xc05ed000   ( 200 kB)
[    2.831577]       .data : 0xc0450065 - 0xc05b90b8   (1444 kB)
[    2.831579]       .text : 0xc0100000 - 0xc0450065   (3392 kB)
[    2.832061] Checking if this processor honours the WP bit even in supervisor mode... Ok.
[    2.832297] hpet_enable
[    2.832382] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[    2.832602] hpet0: 3 64-bit timers, 14318180 Hz
[    2.833675] Using HPET for base-timer
[    2.915669] Calibrating delay using timer specific routine.. 3994.20 BogoMIPS (lpj=6654729)
[    2.915836] Mount-cache hash table entries: 512
[    2.915981] CPU: After generic identify, caps: afe9fbff 00100000 00000000 00000000 00000180 00000000 00000000
[    2.915990] CPU: L1 I cache: 32K, L1 D cache: 32K
[    2.916101] CPU: L2 cache: 2048K
[    2.916171] CPU: After all inits, caps: afe9fbff 00100000 00000000 00002040 00000180 00000000 00000000
[    2.916176] Intel machine check architecture supported.
[    2.916248] Intel machine check reporting enabled on CPU#0.
[    2.916320] Compat vDSO mapped to ffffa000.
[    2.916396] CPU: Intel(R) Pentium(R) M processor 2.00GHz stepping 08
[    2.916543] Checking 'hlt' instruction... OK.
[    2.929131] ACPI: Core revision 20060707
[    2.945497] ENABLING IO-APIC IRQs
[    2.945753] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
[    3.082402] NET: Registered protocol family 16
[    3.082649] ACPI: ACPI Dock Station Driver 
[    3.082749] ACPI: bus type pci registered
[    3.082824] PCI: Using MMCONFIG
[    3.083561] Setting up standard PCI resources
[    3.093993] ACPI: Interpreter enabled
[    3.094065] ACPI: Using IOAPIC for interrupt routing
[    3.094750] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11)
[    3.095684] ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 10 *11)
[    3.096606] ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 *11)
[    3.097521] ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 *11)
[    3.098438] ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 9 10 *11)
[    3.099369] ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 9 10 *11)
[    3.100284] ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 *5 6 7 9 10 11)
[    3.101201] ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 7 9 10 *11)
[    3.101926] ACPI: PCI Root Bridge [PCI0] (0000:00)
[    3.102000] PCI: Probing PCI hardware (bus 00)
[    3.103425] HPTC: RCBA Base is 0xf0008000
[    3.103498] HPTC: RCBA 0x3404 is 0x80
[    3.103566] HPTC: HPTC enabled
[    3.103635] HPTC: HPET located at 0xfed00000
[    3.103707] PCI quirk: region 1000-107f claimed by ICH6 ACPI/GPIO/TCO
[    3.103779] PCI quirk: region 1180-11bf claimed by ICH6 GPIO
[    3.103989] Boot video device is 0000:01:00.0
[    3.104433] PCI: Transparent bridge - 0000:00:1e.0
[    3.104583] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
[    3.109110] ACPI: Power Resource [PUBS] (on)
[    3.110023] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.AGP_._PRT]
[    3.110275] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP0._PRT]
[    3.110438] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP2._PRT]
[    3.110626] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1._PRT]
[    3.112300] Linux Plug and Play Support v0.97 (c) Adam Belay
[    3.112376] pnp: PnP ACPI init
[    3.115896] pnp: PnP ACPI: found 13 devices
[    3.115984] intel_rng: FWH not detected
[    3.116140] SCSI subsystem initialized
[    3.116225] libata version 2.00 loaded.
[    3.116258] usbcore: registered new interface driver usbfs
[    3.116349] usbcore: registered new interface driver hub
[    3.116441] usbcore: registered new device driver usb
[    3.116549] PCI: Using ACPI for IRQ routing
[    3.116621] PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
[    3.215491] Bluetooth: Core ver 2.11
[    3.215587] NET: Registered protocol family 31
[    3.215657] Bluetooth: HCI device and connection manager initialized
[    3.215728] Bluetooth: HCI socket layer initialized
[    3.216289] ieee1394: Initialized config rom entry `ip1394'
[    3.216345] PCI: Bridge: 0000:00:01.0
[    3.216417]   IO window: 3000-3fff
[    3.216487]   MEM window: b0100000-b01fffff
[    3.216557]   PREFETCH window: c0000000-c7ffffff
[    3.216626] PCI: Bridge: 0000:00:1c.0
[    3.216694]   IO window: disabled.
[    3.216766]   MEM window: b0200000-b02fffff
[    3.216835]   PREFETCH window: disabled.
[    3.216906] PCI: Bridge: 0000:00:1c.2
[    3.216976]   IO window: 4000-4fff
[    3.217047]   MEM window: b2000000-b3ffffff
[    3.217117]   PREFETCH window: c8000000-c80fffff
[    3.217190] PCI: Bus 12, cardbus bridge: 0000:0b:00.0
[    3.217260]   IO window: 00005000-000050ff
[    3.217331]   IO window: 00005400-000054ff
[    3.217403]   PREFETCH window: d0000000-d3ffffff
[    3.217474]   MEM window: b8000000-bbffffff
[    3.217545] PCI: Bridge: 0000:00:1e.0
[    3.217615]   IO window: 5000-8fff
[    3.217686]   MEM window: b4000000-bfffffff
[    3.217757]   PREFETCH window: d0000000-d7ffffff
[    3.217834] ACPI: PCI Interrupt 0000:00:01.0[A] -> GSI 16 (level, low) -> IRQ 16
[    3.217973] PCI: Setting latency timer of device 0000:00:01.0 to 64
[    3.217986] ACPI: PCI Interrupt 0000:00:1c.0[A] -> GSI 20 (level, low) -> IRQ 17
[    3.218125] PCI: Setting latency timer of device 0000:00:1c.0 to 64
[    3.218140] ACPI: PCI Interrupt 0000:00:1c.2[C] -> GSI 22 (level, low) -> IRQ 18
[    3.218278] PCI: Setting latency timer of device 0000:00:1c.2 to 64
[    3.218287] PCI: Setting latency timer of device 0000:00:1e.0 to 64
[    3.218298] ACPI: PCI Interrupt 0000:0b:00.0[A] -> GSI 16 (level, low) -> IRQ 16
[    3.218448] NET: Registered protocol family 2
[    3.248830] IP route cache hash table entries: 4096 (order: 2, 16384 bytes)
[    3.248952] TCP established hash table entries: 16384 (order: 4, 65536 bytes)
[    3.249071] TCP bind hash table entries: 8192 (order: 3, 32768 bytes)
[    3.249167] TCP: Hash tables configured (established 16384 bind 8192)
[    3.249238] TCP reno registered
[    3.258925] Simple Boot Flag at 0x35 set to 0x1
[    3.259018] Machine check exception polling timer started.
[    3.259287] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[    3.259478] NTFS driver 2.1.27 [Flags: R/W].
[    3.259606] io scheduler noop registered
[    3.259714] io scheduler anticipatory registered (default)
[    3.259858] io scheduler deadline registered
[    3.259969] io scheduler cfq registered
[    3.261652] PCI: Setting latency timer of device 0000:00:01.0 to 64
[    3.261667] assign_interrupt_mode Found MSI capability
[    3.261754] Allocate Port Service[0000:00:01.0:pcie00]
[    3.261774] Allocate Port Service[0000:00:01.0:pcie03]
[    3.261816] PCI: Setting latency timer of device 0000:00:1c.0 to 64
[    3.261852] assign_interrupt_mode Found MSI capability
[    3.261949] Allocate Port Service[0000:00:1c.0:pcie00]
[    3.261967] Allocate Port Service[0000:00:1c.0:pcie02]
[    3.261987] Allocate Port Service[0000:00:1c.0:pcie03]
[    3.262059] PCI: Setting latency timer of device 0000:00:1c.2 to 64
[    3.262095] assign_interrupt_mode Found MSI capability
[    3.262197] Allocate Port Service[0000:00:1c.2:pcie00]
[    3.262215] Allocate Port Service[0000:00:1c.2:pcie02]
[    3.262233] Allocate Port Service[0000:00:1c.2:pcie03]
[    3.262314] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[    3.262387] ibmphpd: IBM Hot Plug PCI Controller Driver version: 0.6
[    3.262462] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    3.265962] decode_hpp: Could not get hotplug parameters. Use defaults
[    3.266059] acpiphp: Slot [1] registered
[    3.267122] acpiphp_ibm: ibm_find_acpi_device:  Failed to get device information<3>acpiphp_ibm: ibm_find_acpi_device:  Failed to get device information<3>acpiphp_ibm: ibm_find_acpi_device:  Failed to get device information<3>acpiphp_ibm: ibm_acpiphp_init: acpi_walk_namespace failed
[    3.269969] ACPI: AC Adapter [AC] (on-line)
[    3.278158] ACPI: Battery Slot [BAT0] (battery present)
[    3.278284] input: Power Button (FF) as /class/input/input0
[    3.278358] ACPI: Power Button (FF) [PWRF]
[    3.278459] input: Lid Switch as /class/input/input1
[    3.278532] ACPI: Lid Switch [LID]
[    3.278632] input: Sleep Button (CM) as /class/input/input2
[    3.278706] ACPI: Sleep Button (CM) [SLPB]
[    3.278995] ACPI: Video Device [VID] (multi-head: yes  rom: no  post: no)
[    3.280635] ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
[    3.280857] ACPI: Processor [CPU] (supports 8 throttling states)
[    3.281966] ACPI: Thermal Zone [THM0] (63 C)
[    3.283336] Real Time Clock Driver v1.12ac
[    3.283432] Linux agpgart interface v0.101 (c) Dave Jones
[    3.283522] agpgart: Detected an Intel 915GM Chipset.
[    3.300594] agpgart: AGP aperture is 256M @ 0x0
[    3.300695] [drm] Initialized drm 1.1.0 20060810
[    3.300877] tpm_nsc tpm_nscl0: NSC TPM revision 2
[    3.301002] Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
[    3.301249] serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a NS16550A
[    3.302001] pnp: Device 00:09 activated.
[    3.302186] 00:09: ttyS0 at I/O 0x3f8 (irq = 4) is a NS16550A
[    3.302352] ACPI: PCI Interrupt 0000:00:1e.3[B] -> GSI 23 (level, low) -> IRQ 19
[    3.302495] ACPI: PCI interrupt for device 0000:00:1e.3 disabled
[    3.302599] parport: PnPBIOS parport detected.
[    3.302704] parport0: PC-style at 0x3bc (0x7bc), irq 7 [PCSPP(,...)]
[    3.303238] loop: loaded (max 8 devices)
[    3.303352] nbd: registered device at major 43
[    3.303761] Ethernet Channel Bonding Driver: v3.1.1 (September 26, 2006)
[    3.303837] bonding: Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details.
[    3.304025] pcnet32.c:v1.33 27.Jun.2006 tsbogend@alpha.franken.de
[    3.304115] e100: Intel(R) PRO/100 Network Driver, 3.5.17-k2-NAPI
[    3.304185] e100: Copyright(c) 1999-2006 Intel Corporation
[    3.304292] tg3.c:v3.71 (December 15, 2006)
[    3.304377] ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 16
[    3.304520] PCI: Setting latency timer of device 0000:02:00.0 to 64
[    0.399999] eth0: Tigon3 [partno(BCM95751M) rev 4101 PHY(5750)] (PCI Express) 10/100/1000Base-T Ethernet 00:0a:e4:c1:27:01
[    0.399999] eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] 
[    0.399999] eth0: dma_rwctrl[76180000] dma_mask[64-bit]
[    0.399999] PPP generic driver version 2.4.2
[    0.399999] PPP Deflate Compression module registered
[    0.399999] PPP BSD Compression module registered
[    0.403333] NET: Registered protocol family 24
[    0.403333] tun: Universal TUN/TAP device driver, 1.6
[    0.403333] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
[    0.403333] netconsole: not configured, aborting
[    0.403333] ahci 0000:00:1f.2: version 2.0
[    0.403333] ahci: probe of 0000:00:1f.2 failed with error -12
[    0.403333] ata_piix 0000:00:1f.2: version 2.00ac7
[    0.403333] ata_piix 0000:00:1f.2: MAP [ P0 P2 IDE IDE ]
[    0.403333] PCI: Setting latency timer of device 0000:00:1f.2 to 64
[    0.403333] ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0x18C0 irq 14
[    0.403333] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0x18C8 irq 15
[    0.403333] scsi0 : ata_piix
[    0.563333] ata1.00: ATA-6, max UDMA/100, 195371568 sectors: LBA 
[    0.563333] ata1.00: ata1: dev 0 multi count 16
[    0.563333] ata1.00: applying bridge limits
[    0.573333] ata1.00: configured for UDMA/100
[    0.573333] scsi1 : ata_piix
[    0.886666] ata2.00: ATAPI, max UDMA/33
[    1.046666] ata2.00: configured for UDMA/33
[    1.046666] scsi 0:0:0:0: Direct-Access     ATA      FUJITSU MHV2100A 0084 PQ: 0 ANSI: 5
[    1.046666] SCSI device sda: 195371568 512-byte hdwr sectors (100030 MB)
[    1.046666] sda: Write Protect is off
[    1.046666] sda: Mode Sense: 00 3a 00 00
[    1.046666] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.046666] SCSI device sda: 195371568 512-byte hdwr sectors (100030 MB)
[    1.046666] sda: Write Protect is off
[    1.046666] sda: Mode Sense: 00 3a 00 00
[    1.046666] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.046666]  sda: sda1 sda2 sda3 < sda5 sda6 sda7 > sda4
[    1.113333] sd 0:0:0:0: Attached scsi disk sda
[    1.113333] sd 0:0:0:0: Attached scsi generic sg0 type 0
[    1.116666] scsi 1:0:0:0: CD-ROM            MATSHITA DVD-RAM UJ-830S  1.02 PQ: 0 ANSI: 5
[    1.123333] sr0: scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray
[    1.123333] Uniform CD-ROM driver Revision: 3.20
[    1.123333] sr 1:0:0:0: Attached scsi CD-ROM sr0
[    1.123333] sr 1:0:0:0: Attached scsi generic sg1 type 5
[    1.123333] ieee1394: raw1394: /dev/raw1394 device initialized
[    1.123333] Yenta: CardBus bridge found at 0000:0b:00.0 [1014:0532]
[    1.249999] Yenta: ISA IRQ mask 0x0438, PCI irq 16
[    1.249999] Socket status: 30000006
[    1.249999] pcmcia: parent PCI bridge I/O window: 0x5000 - 0x8fff
[    1.249999] cs: IO port probe 0x5000-0x8fff: clean.
[    1.249999] pcmcia: parent PCI bridge Memory window: 0xb4000000 - 0xbfffffff
[    1.249999] pcmcia: parent PCI bridge Memory window: 0xd0000000 - 0xd7ffffff
[    1.503333] usbcore: registered new interface driver usblp
[    1.503333] drivers/usb/class/usblp.c: v0.13: USB Printer Device Class driver
[    1.503333] usbcore: registered new interface driver libusual
[    1.503333] usbcore: registered new interface driver usbhid
[    1.503333] drivers/usb/input/hid-core.c: v2.6:USB HID core driver
[    1.503333] usbcore: registered new interface driver asix
[    1.503333] usbcore: registered new interface driver usbserial
[    1.503333] drivers/usb/serial/usb-serial.c: USB Serial support registered for generic
[    1.503333] usbcore: registered new interface driver usbserial_generic
[    1.503333] drivers/usb/serial/usb-serial.c: USB Serial Driver core
[    1.503333] drivers/usb/serial/usb-serial.c: USB Serial support registered for hp4X
[    1.503333] usbcore: registered new interface driver hp4X
[    1.503333] drivers/usb/serial/hp4x.c: HP4x (48/49) Generic Serial driver v1.00
[    1.503333] drivers/usb/serial/usb-serial.c: USB Serial support registered for pl2303
[    1.503333] usbcore: registered new interface driver pl2303
[    1.503333] drivers/usb/serial/pl2303.c: Prolific PL2303 USB to serial adaptor driver
[    1.503333] PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
[    1.509999] serio: i8042 KBD port at 0x60,0x64 irq 1
[    1.509999] serio: i8042 AUX port at 0x60,0x64 irq 12
[    1.509999] mice: PS/2 mouse device common for all mice
[    1.513333] input: AT Translated Set 2 keyboard as /class/input/input3
[    1.519999] i2c /dev entries driver
[    1.519999] ACPI: PCI Interrupt 0000:00:1f.3[A] -> GSI 23 (level, low) -> IRQ 19
[    1.519999] device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: dm-devel@redhat.com
[    1.519999] EDAC MC: Ver: 2.0.1 Dec 25 2006
[    1.549999] Advanced Linux Sound Architecture Driver Version 1.0.14rc1 (Wed Dec 20 08:11:48 2006 UTC).
[    1.549999] ACPI: PCI Interrupt 0000:00:1e.2[A] -> GSI 22 (level, low) -> IRQ 18
[    1.549999] PCI: Setting latency timer of device 0000:00:1e.2 to 64
[    1.816666] ACPI: EC: evaluating _Q75
[    2.133333] Synaptics Touchpad, model: 1, fw: 5.9, id: 0x2c6ab1, caps: 0x884793/0x0
[    2.133333] serio: Synaptics pass-through port at isa0060/serio1/input0
[    2.176666] input: SynPS/2 Synaptics TouchPad as /class/input/input4
[    2.473333] intel8x0_measure_ac97_clock: measured 53330 usecs
[    2.473333] intel8x0: clocking to 48000
[    2.473333] ALSA device list:
[    2.473333]   #0: Intel ICH6 with AD1981B at 0xb0000800, irq 18
[    2.473333] netem: version 1.2
[    2.473333] u32 classifier
[    2.473333] Netfilter messages via NETLINK v0.30.
[    2.473333] ip_tables: (C) 2000-2006 Netfilter Core Team
[    2.553333] TCP bic registered
[    2.553333] TCP cubic registered
[    2.553333] TCP westwood registered
[    2.553333] TCP highspeed registered
[    2.553333] TCP vegas registered
[    2.553333] NET: Registered protocol family 1
[    2.553333] NET: Registered protocol family 10
[    2.553333] IPv6 over IPv4 tunneling driver
[    2.553333] NET: Registered protocol family 17
[    2.633333] Bridge firewalling registered
[    2.633333] Bluetooth: L2CAP ver 2.8
[    2.633333] Bluetooth: L2CAP socket layer initialized
[    2.633333] Bluetooth: SCO (Voice Link) ver 0.5
[    2.633333] Bluetooth: SCO socket layer initialized
[    2.633333] Bluetooth: RFCOMM socket layer initialized
[    2.633333] Bluetooth: RFCOMM TTY layer initialized
[    2.633333] Bluetooth: RFCOMM ver 1.8
[    2.633333] Bluetooth: BNEP (Ethernet Emulation) ver 1.2
[    2.633333] Bluetooth: BNEP filters: protocol multicast
[    2.633333] Bluetooth: HIDP (Human Interface Emulation) ver 1.1
[    2.633333] 802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com>
[    2.633333] All bugs added by David S. Miller <davem@redhat.com>
[    2.633333] ieee80211: 802.11 data/management/control stack, git-1.1.13
[    2.633333] ieee80211: Copyright (C) 2004-2005 Intel Corporation <jketreno@linux.intel.com>
[    2.633333] ieee80211_crypt: registered algorithm 'NULL'
[    2.633333] ieee80211_crypt: registered algorithm 'WEP'
[    2.633333] ieee80211_crypt: registered algorithm 'CCMP'
[    2.633333] ieee80211_crypt: registered algorithm 'TKIP'
[    2.633333] speedstep-centrino with X86_SPEEDSTEP_CENTRINO_ACPIconfig is deprecated.
[    2.633333]  Use X86_ACPI_CPUFREQ (acpi-cpufreq instead.
[    2.633333] Using IPI Shortcut mode
[    2.633333] ACPI: (supports S0 S3 S4 S5)
[    2.636666] Time: tsc clocksource has been installed.
[    2.643333] Time: hpet clocksource has been installed.
[    7.439999] IBM TrackPoint firmware: 0x0e, buttons: 3/3
[    7.696665] input: TPPS/2 IBM TrackPoint as /class/input/input5
[    7.703332] ACPI: EC: evaluating _Q75
[    7.879999] kjournald starting.  Commit interval 5 seconds
[    7.879999] EXT3-fs: mounted filesystem with ordered data mode.
[    7.879999] VFS: Mounted root (ext3 filesystem) readonly.
[    7.879999] Freeing unused kernel memory: 200k freed
[   11.453332] ACPI: PCI Interrupt 0000:0b:00.1[B] -> GSI 17 (level, low) -> IRQ 20
[   11.506665] ohci1394: fw-host0: OHCI-1394 1.0 (PCI): IRQ=[20]  MMIO=[b1000000-b10007ff]  Max Packet=[2048]  IR/IT contexts=[4/4]
[   11.516665] eth1394: eth0: IEEE-1394 IPv4 over 1394 Ethernet (fw-host0)
[   11.679998] cs: IO port probe 0x100-0x4ff: excluding 0x4d0-0x4d7
[   11.683332] cs: IO port probe 0x800-0x8ff: clean.
[   11.683332] cs: IO port probe 0xc00-0xcff: clean.
[   11.683332] cs: IO port probe 0xa00-0xaff: clean.
[   12.166665] Adding 1958000k swap on /dev/sda5.  Priority:10 extents:1 across:1958000k
[   12.319998] EXT3 FS on sda6, internal journal
[   12.693332] ibm_acpi: ThinkPad EC firmware 76HT16WW-1.06
[   12.693332] ibm_acpi: IBM ThinkPad ACPI Extras v0.13
[   12.693332] ibm_acpi: http://ibm-acpi.sf.net/
[   12.699998] ibm_acpi: fan_init: initial fan status is unknown, assuming it is in auto mode
[   12.783332] ieee1394: Host added: ID:BUS[0-00:1023]  GUID[000ae405314003e1]
[   13.063332] kjournald starting.  Commit interval 5 seconds
[   13.063332] EXT3-fs: mounted filesystem with ordered data mode.
[   13.066665] kjournald starting.  Commit interval 5 seconds
[   13.066665] EXT3 FS on sda7, internal journal
[   13.066665] EXT3-fs: mounted filesystem with ordered data mode.
[   13.493331] pcmcia: Detected deprecated PCMCIA ioctl usage from process: discover.
[   13.493331] pcmcia: This interface will soon be removed from the kernel; please expect breakage unless you upgrade to new tools.
[   13.493331] pcmcia: see http://www.kernel.org/pub/linux/utils/kernel/pcmcia/pcmcia.html for details.
[   15.979998] ieee1394: Node removed: ID:BUS[0-00:1023]  GUID[000ae405314003e1]

-- 
Tobias						PGP: http://9ac7e0bc.uguu.de

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-24 20:24                                                                                             ` Linus Torvalds
  2006-12-24 20:30                                                                                               ` Andrei Popa
@ 2006-12-26 17:51                                                                                               ` Al Viro
  2006-12-26 17:58                                                                                                 ` Al Viro
  1 sibling, 1 reply; 311+ messages in thread
From: Al Viro @ 2006-12-26 17:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Gordon Farquharson, Andrew Morton, Martin Michlmayr,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Sun, Dec 24, 2006 at 12:24:46PM -0800, Linus Torvalds wrote:
> 
> 
> On Sun, 24 Dec 2006, Andrei Popa wrote:
> > 
> > Hash check on download completion found bad chunks, consider using
> > "safe_sync".
> 
> Dang. Did you get any warning messages from the kernel?
> 
> 		Linus

BTW, rmap.c patch is broken - needs at least

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/mm/rmap.c b/mm/rmap.c
index 57306fa..669acb2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -452,7 +452,7 @@ static int page_mkclean_one(struct page 
 		entry = ptep_clear_flush(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
-		set_pte_at(vma, address, pte, entry);
+		set_pte_at(mm, address, pte, entry);
 		lazy_mmu_prot_update(entry);
 		ret = 1;
 	}

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-26 17:51                                                                                               ` Al Viro
@ 2006-12-26 17:58                                                                                                 ` Al Viro
  0 siblings, 0 replies; 311+ messages in thread
From: Al Viro @ 2006-12-26 17:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Gordon Farquharson, Andrew Morton, Martin Michlmayr,
	Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On Tue, Dec 26, 2006 at 05:51:55PM +0000, Al Viro wrote:
> On Sun, Dec 24, 2006 at 12:24:46PM -0800, Linus Torvalds wrote:
> > 
> > 
> > On Sun, 24 Dec 2006, Andrei Popa wrote:
> > > 
> > > Hash check on download completion found bad chunks, consider using
> > > "safe_sync".
> > 
> > Dang. Did you get any warning messages from the kernel?
> > 
> > 		Linus
> 
> BTW, rmap.c patch is broken - needs at least

... but that doesn't affect most of the architectures - only sparc64 and
some of powerpc.  So it's definitely not enough.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-26 10:31                                                                                           ` Nick Piggin
@ 2006-12-26 19:26                                                                                             ` Linus Torvalds
  2006-12-27 12:32                                                                                               ` Jari Sundell
                                                                                                                 ` (2 more replies)
  0 siblings, 3 replies; 311+ messages in thread
From: Linus Torvalds @ 2006-12-26 19:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrei Popa, Peter Zijlstra, David S. Miller, Andrew Morton,
	Gordon Farquharson, Martin Michlmayr, Hugh Dickins,
	Arjan van de Ven, Linux Kernel Mailing List



On Tue, 26 Dec 2006, Nick Piggin wrote:

> Linus Torvalds wrote:
> > 
> > Ok, so how about this diff.
> > 
> > I'm actually feeling good about this one. It really looks like
> > "do_no_page()" was simply buggy, and that this explains everything.
> 
> Still trying to catch up here, so I'm not going to reply to any old
> stuff and just start at the tip of the thread... Other than to say
> that I really like cancel_page_dirty ;)

Yeah, I think that part is a bit clearer about what's going on now.

> I think your patch is quite right so that's a good catch.

Actually, since people told me it didn't matter, I went back and looked at 
_why_ - the thing is, "vma->vm_page_prot" should always be read-only 
anyway, except for mappings that don't do dirty accounting at all, so I 
think my patch only found cases that are unimportant (ie pages that get 
faulted on on filesystems like ramfs that doesn't do any dirty page 
accounting because they're all dirty anyway).

> But I'm not too surprised that it does not help the problem, because I 
> don't think we have started shedding any old pte_dirty tests at 
> unmap/reclaim-time, have we? So the dirty bit isn't going to get lost, 
> as such.

True. We should no longer _need_ those dirty bit reclaims at 
unmap/reclaim, but we still do them, so you're right, even if we were 
buggy in this area, it should only really matter for the dirty page 
counting, not for any lost data.

> I was hoping that you've almost narrowed it down to the filesystem
> writeback code, with the last few mails?

I think so, yes.

However, I've checked, and "rtorrent" really does seem to be fairly 
well-behaved wrt any filesystem activity. It does

 - no threading. It's 100% single-threaded, and doesn't even appear to use 
   signals.

 - exactly _one_ "ftruncate()", and it does it at the beginning, for the 
   full final size.

   IOW, it's not anything subtle with truncate and dirty page cancel.

 - It never uses mprotect on the shared mappings, but it _does_ do:
	"mincore()" - but the return values don't much matter (it's used 
	              as a heuristic on which parts to hash, apparently)

		      I double- and triple-checked this one, because I
		      did make changes to "mincore()", but those didn't go 
		      into the affected kernels anyway (ie they are not in 
		      plain 2.6.19, nor in 2.6.18.3 either)

	"madvise(MADV_WILLNEED)"
	"msync(MS_ASYNC)" (or MS_SYNC if you use a command line flag)
	"munmap()" of course

 - it never seems to mix mmap() and write() - it does _only_ mmap.

 - it seems to mmap/munmap the shared files in nice 64-page chunks, all 
   64-page aligned in the file (ie it does NOT create one big mapping, it 
   has some kind of LRU of thse 64-page chunks). The only exception being 
   the last chunk, which it maps byte-accurate to the size.

 - I haven't checked whether it only ever has the same chunk mapped once 
   at a time.

Anyway, the _one_ half-way interesting thing is the fact that it doesn't 
allocate any backing store at all for the file, and as such the page 
writeback needs to create all the underlying buffers on the filesystem. I 
really don't see why that would be a problem either, but I could imagine 
that if we have some writeback bug where we can end up writing back the 
_same_ page concurrently, we'd actually end up racing in the kernel, and 
allocating two different backing stores, and then maybe the other one 
would effectively "get lost" (and the earlier writeback would win the 
race, explaining why we'd end up with zeroes at the end of a block).

Or something.

However, all the codepaths _seem_ to test for PG_writeback, and not even 
try to start another writeback while the first one is still active.

What would also actually be interesting is whether somebody can reproduce 
this on Reiserfs, for example. I _think_ all the reports I've seen are on 
ext2 or ext3, and if this is somehow writeback-related, it could be some 
bug that is just shared between the two by virtue of them still having a 
lot of stuff in common. 

			Linus

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-26 16:17                                                                             ` Tobias Diedrich
@ 2006-12-27  4:55                                                                               ` David Miller
  2006-12-27  7:00                                                                                 ` Linus Torvalds
  2006-12-28  0:16                                                                                 ` Linus Torvalds
  0 siblings, 2 replies; 311+ messages in thread
From: David Miller @ 2006-12-27  4:55 UTC (permalink / raw)
  To: ranma
  Cc: torvalds, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa,
	akpm, hugh, nickpiggin, arjan, linux-kernel

From: Tobias Diedrich <ranma@tdiedrich.de>
Date: Tue, 26 Dec 2006 17:17:00 +0100

> Linus Torvalds wrote:
> > I don't think it's a page table issue any more, it just doesn't look 
> > likely with the ARM UP corruption. It's also not apparently even on a 
> > cacheline boundary, so it probably is really a dirty bit that got cleared 
> > wrogn due to some race with IO.
> 
> So, until now it's only been reported for SMP on i386?
> I'm seeing the issue on my Pentium-M Notebook (Thinkpad R52) over
> here, UP kernel, no preempt.

I've seen it on sparc64, UP kernel, no preempt.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-27  4:55                                                                               ` [PATCH] mm: fix page_mkclean_one David Miller
@ 2006-12-27  7:00                                                                                 ` Linus Torvalds
  2006-12-27  8:39                                                                                   ` Andrei Popa
  2006-12-28  0:16                                                                                 ` Linus Torvalds
  1 sibling, 1 reply; 311+ messages in thread
From: Linus Torvalds @ 2006-12-27  7:00 UTC (permalink / raw)
  To: David Miller
  Cc: ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa, akpm,
	hugh, nickpiggin, arjan, linux-kernel



On Tue, 26 Dec 2006, David Miller wrote:
> 
> I've seen it on sparc64, UP kernel, no preempt.

Btw, having tried to debug the writeback code, there's one very special 
case that just makes me go "hmm".

If we have a buffer that is "busy" when we try to write back a page, we 
have this magic "wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking" mode, 
where we won't wait for it, but instead we'll redirty the page and redo 
the whole thing.

Looking at the code, that should all work, but at the same time, it 
triggers some of my debug messages about having a dirty page during 
writeback, and one way to trigger that debug message is to try to run 
rtorrent on the machine.. 

I dunno. Witht he writeback being suspicious, and the normal 
"block_write_full_page()" path being implicated in at least ext2, I just 
wonder. This is one of those "let's see if behaviour changes" patches, 
that I'm just throwing out there..

		Linus

---
diff --git a/fs/buffer.c b/fs/buffer.c
index 263f88e..4652ef1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 	do {
 		if (!buffer_mapped(bh))
 			continue;
-		/*
-		 * If it's a fully non-blocking write attempt and we cannot
-		 * lock the buffer then redirty the page.  Note that this can
-		 * potentially cause a busy-wait loop from pdflush and kswapd
-		 * activity, but those code paths have their own higher-level
-		 * throttling.
-		 */
-		if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
-			lock_buffer(bh);
-		} else if (test_set_buffer_locked(bh)) {
-			redirty_page_for_writepage(wbc, page);
-			continue;
-		}
+		lock_buffer(bh);
 		if (test_clear_buffer_dirty(bh)) {
 			mark_buffer_async_write(bh);
 		} else {

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-27  7:00                                                                                 ` Linus Torvalds
@ 2006-12-27  8:39                                                                                   ` Andrei Popa
  0 siblings, 0 replies; 311+ messages in thread
From: Andrei Popa @ 2006-12-27  8:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, ranma, gordonfarquharson, tbm, a.p.zijlstra, akpm,
	hugh, nickpiggin, arjan, linux-kernel

I have corrupted files...

> ---
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 263f88e..4652ef1 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
>  	do {
>  		if (!buffer_mapped(bh))
>  			continue;
> -		/*
> -		 * If it's a fully non-blocking write attempt and we cannot
> -		 * lock the buffer then redirty the page.  Note that this can
> -		 * potentially cause a busy-wait loop from pdflush and kswapd
> -		 * activity, but those code paths have their own higher-level
> -		 * throttling.
> -		 */
> -		if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
> -			lock_buffer(bh);
> -		} else if (test_set_buffer_locked(bh)) {
> -			redirty_page_for_writepage(wbc, page);
> -			continue;
> -		}
> +		lock_buffer(bh);
>  		if (test_clear_buffer_dirty(bh)) {
>  			mark_buffer_async_write(bh);
>  		} else {


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-26 19:26                                                                                             ` Linus Torvalds
@ 2006-12-27 12:32                                                                                               ` Jari Sundell
  2006-12-27 12:44                                                                                               ` valdyn
  2007-01-07  2:06                                                                                               ` Tom Lanyon
  2 siblings, 0 replies; 311+ messages in thread
From: Jari Sundell @ 2006-12-27 12:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrei Popa, Peter Zijlstra, David S. Miller,
	Andrew Morton, Gordon Farquharson, Martin Michlmayr,
	Hugh Dickins, Arjan van de Ven, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1553 bytes --]

On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote:
<snip>
>  - It never uses mprotect on the shared mappings, but it _does_ do:
>         "mincore()" - but the return values don't much matter (it's used
>                       as a heuristic on which parts to hash, apparently)
>
>                       I double- and triple-checked this one, because I
>                       did make changes to "mincore()", but those didn't go
>                       into the affected kernels anyway (ie they are not in
>                       plain 2.6.19, nor in 2.6.18.3 either)

Correct, mincore is only used to check if it should delay the hash checking.

>         "madvise(MADV_WILLNEED)"
>         "msync(MS_ASYNC)" (or MS_SYNC if you use a command line flag)
>         "munmap()" of course
>
>  - it never seems to mix mmap() and write() - it does _only_ mmap.
>
>  - it seems to mmap/munmap the shared files in nice 64-page chunks, all
>    64-page aligned in the file (ie it does NOT create one big mapping, it
>    has some kind of LRU of thse 64-page chunks). The only exception being
>    the last chunk, which it maps byte-accurate to the size.

The length of the chunks is only page aligned on single file torrents,
not so on multi-file torrents. I've attached a patch for rtorrent that
will extend the length to the page boundary.

>  - I haven't checked whether it only ever has the same chunk mapped once
>    at a time.

This should be the case, but two mapped chunks may share a page,
sometimes with different r/w permissions.

Jari Sundell

[-- Attachment #2: extend_mapping.diff --]
[-- Type: application/octet-stream, Size: 1887 bytes --]

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-26 19:26                                                                                             ` Linus Torvalds
  2006-12-27 12:32                                                                                               ` Jari Sundell
@ 2006-12-27 12:44                                                                                               ` valdyn
  2006-12-27 13:33                                                                                                 ` Jari Sundell
  2007-01-07  2:06                                                                                               ` Tom Lanyon
  2 siblings, 1 reply; 311+ messages in thread
From: valdyn @ 2006-12-27 12:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nick Piggin, Andrei Popa, Peter Zijlstra, David S. Miller,
	Andrew Morton, Gordon Farquharson, Martin Michlmayr,
	Hugh Dickins, Arjan van de Ven, Linux Kernel Mailing List,
	Linus Torvalds

On Tue, Dec 26, 2006 at 11:26:50AM -0800, Linus Torvalds wrote:
> What would also actually be interesting is whether somebody can reproduce 
> this on Reiserfs, for example. I _think_ all the reports I've seen are on 
> ext2 or ext3, and if this is somehow writeback-related, it could be some 
> bug that is just shared between the two by virtue of them still having a 
> lot of stuff in common. 
> 
> 			Linus
I do get this error on reiserfs ( old one, didn't try on reiser4 ). 
Stock 2.6.19 plus reiser4 patch. Previously reported by me only in the
debian bts.

flo attenberger

---
Linux master 2.6.19 #1 PREEMPT Thu Dec 21 10:55:34 CET 2006 x86_64
GNU/Linux

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.19
# Thu Dec 21 10:45:05 2006
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_RELAY is not set
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y

#
# Block layer
#
CONFIG_BLOCK=y
# CONFIG_LBD is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=m
CONFIG_IOSCHED_DEADLINE=m
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# Processor type and features
#
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
CONFIG_MK8=y
# CONFIG_MPSC is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_MICROCODE=m
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
# CONFIG_SMP is not set
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_RESOURCES_64BIT=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HPET_TIMER=y
CONFIG_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_INTEL is not set
CONFIG_X86_MCE_AMD=y
CONFIG_KEXEC=y
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x200000
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_REORDER=y
CONFIG_K8_NB=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_ISA_DMA_API=y

#
# Power management options
#
CONFIG_PM=y
CONFIG_PM_LEGACY=y
# CONFIG_PM_DEBUG is not set
CONFIG_PM_SYSFS_DEPRECATED=y
# CONFIG_SOFTWARE_SUSPEND is not set

#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_SLEEP_PROC_FS=y
# CONFIG_ACPI_SLEEP_PROC_SLEEP is not set
CONFIG_ACPI_AC=m
# CONFIG_ACPI_BATTERY is not set
CONFIG_ACPI_BUTTON=m
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_HOTKEY=m
CONFIG_ACPI_FAN=m
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_PROCESSOR=m
CONFIG_ACPI_THERMAL=m
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_IBM is not set
# CONFIG_ACPI_TOSHIBA is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
# CONFIG_ACPI_CONTAINER is not set
# CONFIG_ACPI_SBS is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=m
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=m
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=m
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m

#
# CPUFreq processor drivers
#
CONFIG_X86_POWERNOW_K8=m
CONFIG_X86_POWERNOW_K8_ACPI=y
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
CONFIG_X86_ACPI_CPUFREQ=m

#
# shared options
#
# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set
# CONFIG_X86_SPEEDSTEP_LIB is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
# CONFIG_PCIEPORTBUS is not set
# CONFIG_PCI_MSI is not set
# CONFIG_PCI_DEBUG is not set
# CONFIG_HT_IRQ is not set

#
# PCCARD (PCMCIA/CardBus) support
#
# CONFIG_PCCARD is not set

#
# PCI Hotplug Support
#
CONFIG_HOTPLUG_PCI=m
CONFIG_HOTPLUG_PCI_FAKE=m
# CONFIG_HOTPLUG_PCI_ACPI is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=m
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
CONFIG_COMPAT=y
CONFIG_SYSVIPC_COMPAT=y

#
# Networking
#
CONFIG_NET=y

#
# Networking options
#
# CONFIG_NETDEBUG is not set
CONFIG_PACKET=m
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=m
# CONFIG_XFRM_SUB_POLICY is not set
CONFIG_NET_KEY=m
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_FWMARK=y
CONFIG_IP_ROUTE_MULTIPATH=y
# CONFIG_IP_ROUTE_MULTIPATH_CACHED is not set
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE=m
# CONFIG_NET_IPGRE_BROADCAST is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
CONFIG_ARPD=y
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=y
CONFIG_TCP_CONG_CUBIC=m
CONFIG_TCP_CONG_WESTWOOD=m
CONFIG_TCP_CONG_HTCP=m
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
CONFIG_DEFAULT_BIC=y
# CONFIG_DEFAULT_CUBIC is not set
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="bic"

#
# IP: Virtual Server Configuration
#
# CONFIG_IP_VS is not set
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
# CONFIG_IPV6_ROUTER_PREF is not set
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
# CONFIG_IPV6_MIP6 is not set
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
# CONFIG_INET6_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET6_XFRM_MODE_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_BEET is not set
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=m
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_NETLABEL is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NETFILTER_XTABLES=m
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
# CONFIG_NETFILTER_XT_TARGET_DSCP is not  SCSI low-level drivers
#
# CONFIG_ISCSI_TCP is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_PPA is not set
# CONFIG_SCSI_IMM is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_DEBUG is not set

#
# Serial ATA (prod) and Parallel ATA (experimental) drivers
#
CONFIG_ATA=y
# CONFIG_SATA_AHCI is not set
# CONFIG_SATA_SVW is not set
# CONFIG_ATA_PIIX is not set
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
CONFIG_SATA_PROMISE=m
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIL24 is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
CONFIG_SATA_VIA=y
# CONFIG_SATA_VITESSE is not set
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# C# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set

#
# ISDN subsystem
#
CONFIG_ISDN=m

#
# Old ISDN4Linux
#
CONFIG_ISDN_I4L=m
CONFIG_ISDN_PPP=y
CONFIG_ISDN_PPP_VJ=y
CONFIG_ISDN_MPP=y
# CONFIG_IPPP_FILTER is not set
CONFIG_ISDN_PPP_BSDCOMP=m
CONFIG_ISDN_AUDIO=y
CONFIG_ISDN_TTY_FAX=y

#
# ISDN feature submodules
#
# CONFIG_ISDN_DRV_LOOP is not set
CONFIG_ISDN_DIVERSION=m

#
# ISDN4Linux hardware drivers
#

#
# Passive cards
#
CONFIG_ISDN_DRV_HISAX=m

#
# D-channel protocol features
#
CONFIG_HISAX_EURO=y
CONFIG_DE_AOC=y
# CONFIG_HISAX_NO_SENDCOMPLETE is not set
# CONFIG_HISAX_NO_LLC is not set
# CONFIG_HISAX_NO_KEYPAD is not set
# CONFIG_HISAX_1TR6 is not set
# CONFIG_HISAX_NI1 is not set
CONFIG_HISAX_MAX_CARDS=8

#
# HiSax supported cards
#
# CONFIG_HISAX_16_3 is not set
# CONFIG_HISAX_TELESPCI is not set
# CONFIG_HISAX_S0BOX is not set
CONFIG_HISAX_FRITZPCI=y
# CONFIG_HISAX_AVM_A1_PCMCIA is not set
# CONFIG_HISAX_ELSA is not set
# CONFIG_HISAX_DIEHLDIVA is not set
# CONFIG_HISAX_SEDLBAUER is not set
# CONFIG_HISAX_NETJET is not set
# CONFIG_HISAX_NETJET_U is not set
# CONFIG_HISAX_NICCY is not set
# CONFIG_HISAX_BKM_A4T is not set
# CONFIG_HISAX_SCT_QUADRO is not set
# CONFIG_HISAX_GAZEL is not set
# CONFIG_HISAX_HFC_PCI is not set
# CONFIG_HISAX_W6692 is not set
# CONFIG_HISAX_HFC_SX is not set
# CONFIG_HISAX_DEBUG is not set

#
# HiSax PCMCIA card service modules
#

#
# HiSax sub driver modules
#
# CONFIG_HISAX_ST5481 is not set
# CONFIG_HISAX_HFCUSB is not set
# CONFIG_HISAX_HFC4S8S is not set
CONFIG_HISAX_FRITZ_PCIPNP=m

#
# Active cards
#
# CONFIG_HYSDN is not set

#
# Siemens Gigaset
#
# CONFIG_ISDN_DRV_GIGASET is not set

#
# CAPI subsystem
#
CONFIG_ISDN_CAPI=m
# CONFIG_ISDN_DRV_AVMB1_VERBOSE_REASON is not set
CONFIG_ISDN_CAPI_MIDDLEWARE=y
CONFIG_ISDN_CAPI_CAPI20=m
CONFIG_ISDN_CAPI_CAPIFS_BOOL=y
CONFIG_ISDN_CAPI_CAPIFS=m
# CONFIG_ISDN_CAPI_CAPIDRV is not set

#
# CAPI hardware drivers
#

#
# Active AVM cards
#
# CONFIG_CAPI_AVM is not set

#
# Active Eicon DIVA Server cards
#
# CONFIG_CAPI_EICON is not set

#
# Telephony Support
#
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_FF_MEMLESS is not set

#
# Usepport
#
# CONFIG_SPI is not set
# CONFIG_SPI_MASTER is not set

#
# Dallas's 1-wire bus
#
CONFIG_W1=m

#
# 1-wire Bus Masters
#
# CONFIG_W1_MASTER_MATROX is not set
# CONFIG_W1_MASTER_DS2490 is not set
# CONFIG_W1_MASTER_DS2482 is not set

#
# 1-wire Slaves
#
CONFIG_W1_SLAVE_THERM=m
CONFIG_W1_SLAVE_SMEM=m
CONFIG_W1_SLAVE_DS2433=m
# CONFIG_W1_SLAVE_DS2433_CRC is not set

#
# Hardware Monitoring support
#
CONFIG_HWMON=m
CONFIG_HWMON_VID=m
# CONFIG_SENSORS_ABITUGURU is not set
CONFIG_SENSORS_ADM1021=m
CONFIG_SENSORS_ADM1025=m
CONFIG_SENSORS_ADM1026=m
CONFIG_SENSORS_ADM1031=m
CONFIG_SENSORS_ADM9240=m
CONFIG_SENSORS_K8TEMP=m
CONFIG_SENSORS_ASB100=m
# CONFIG_SENSORS_ATXP1 is not set
CONFIG_SENSORS_DS1621=m
# CONFIG_SENSORS_F71805F is not set
CONFIG_SENSORS_FSCHER=m
CONFIG_SENSORS_FSCPOS=m
CONFIG_SENSORS_GL518SM=m
CONFIG_SENSORS_GL520SM=m
CONFIG_SENSORS_IT87=m
CONFIG_SENSORS_LM63=m
CONFIG_SENSORS_LM75=m
CONFIG_SENSORS_LM77=m
CONFIG_SENSORS_LM78=m
CONFIG_SENSORS_LM80=m
CONFIG_SENSORS_LM83=m
CONFIG_SENSORS_LM85=m
CONFIG_SENSORS_LM87=m
CONFIG_SENSORS_LM90=m
CONFIG_SENSORS_LM92=m
CONFIG_SENSORS_MAX1619=m
CONFIG_SENSORS_PC87360=m
CONFIG_SENSORS_SIS5595=m
CONFIG_SENSORS_SMSC47M1=m
# CONFIG_SENSORS_SMSC47M192 is not set
CONFIG_SENSORS_SMSC47B397=m
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_VT1211=m
CONFIG_SENSORS_VT8231=m
CONFIG_SENSORS_W83781D=m
# CONFIG_SENSORS_W83791D is not set
CONFIG_SENSORS_W83792D=m
CONFIG_SENSORS_W83L785TS=m
CONFIG_SENSORS_W83627HF=m
CONFIG_SENSORS_W83627EHF=m
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Multimedia devices
#
CONFIG_VIDEO_DEV=m
CONFIG_VIDEO_V4L1=y
CONFIG_VIDEO_V4L1_COMPAT=y
CONFIG_VIDEO_V4L2=y

#
# Video Capture Adapters
#

#
# Video Capture Adapters
#
# CONFIG_VIDEO_ADV_DEBUG is not set
CONFIG_VIDEO_HELPER_CHIPS_AUTO=y
CONFIG_VIDEO_TVAUDIO=m
CONFIG_VIDEO_TDA7432=m
CONFIG_VIDEO_TDA9875=m
CONFIG_VIDEO_MSP3400=m
# CONFIG_VIDEO_VIVI is not set
CONFIG_VIDEO_BT848=m
# CONFIG_VIDEO_BT848_DVB is not set
CONFIG_VIDEO_SAA6588=m
# CONFIG_VIDEO_BWQCAM is not set
# CONFIG_VIDEO_CQCAM is not set
# CONFIG_VIDEO_W9966 is not set
# CONFIG_VIDEO_CPIA is not set
# CONFIG_VIDEO_CPIA2 is not set
CONFIG_VIDEO_SAA5246A=m
CONFIG_VIDEO__UART=m
CONFIG_SND_AC97_CODEC=m
CONFIG_SND_AC97_BUS=m
CONFIG_SND_DUMMY=m
CONFIG_SND_VIRMIDI=m
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_MTS64 is not set
# CONFIG_SND_SERIAL_U16550 is not set
CONFIG_SND_MPU401=m

#
# PCI devices
#
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AZT3328 is not set
CONFIG_SND_BT87X=m
# CONFIG_SND_BT87X_OVERCLOCK is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
CONFIG_SND_ENS1371=m
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
# CONFIG_SND_HDA_INTEL is not set
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
# CONFIG_SND_INTEL8X0 is not set
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
CONFIG_SND_VIA82XX=m
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
CONFIG_SND_AC97_POWER_SAVE=y

#
# USB devices
#
CONFIG_SND_USB_AUDIO=m
# CONFIG_SND_USB_USX2Y is not set

#
# Open Sound System
#
# CONFIG_SOUND_PRIME is not set

#
# USB support
#
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=m
# CONFIG_USB_DEBUG is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_BANDWIDTH is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set

#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=m
CONFIG_USB_EHCI_SPLIT_ISO=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
# CONFIG_USB_ISP116X_HCD is not set
CONFIG_USB_OHCI_HCD=m
# CONFIG_USB_OHCI_BIG_ENDIAN is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=m
# CONFIG_USB_SL811_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
CONFIG_USB_PRINTER=m

#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
#

#
# may also be needed; see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_DPCM is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_LIBUSUAL is not set

#
# USB Input Devices
#
CONFIG_USB_HID=m
CONFIG_USB_HIDINPUT=y
# CONFIG_USB_HIDINPUT_POWERBOOK is not set
# CONFIG_HID_FF is not set
CONFIG_USB_HIDDEV=y

#
# USB HID Boot Protocol drivers
#
# CONFIG_USB_KBD is not set
# CONFIG_USB_MOUSE is not set
# CONFIG_USB_AIPTEK is not set
# CONFIG_USB_WACOM is not set
# CONFIG_USB_ACECAD is not set
# CONFIG_USB_KBTAB is not set
# CONFIG_USB_POWERMATE is not set
# CONFIG_USB_TOUCHSCREEN is not set
# CONFIG_USB_YEALINK is not set
# CONFIG_USB_XPAD is not set
# CONFIG_USB_ATI_REMOTE is not set
# CONFIG_USB_ATI_REMOTE2 is not set
# CONFIG_USB_KEYSPAN_REMOTE is not set
# CONFIG_USB_APPLETOUCH is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET_MII is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_MON is not set

#
# USB port drivers
#
# CONFIG_USB_USS720 is not set

#
# USB Serial Converter support
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_AUERSWALD is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_TEST is not set

#
# USB DSL modem support
#

#
# USB Gadget Support
#
# CONFIG_USB_GADGET is not set

#
# MMC/SD Card support
#
# CONFIG_MMC is not set

#
# LED devices
#
# CONFIG_NEW_LEDS is not set

#
# LED drivers
#

#
# LED Triggers
#

#
# InfiniBand support
#
# CONFIG_INFINIBAND is not set

#
# EDAC - error detection and reporting (RAS) (EXPERIMENTAL)
#
# CONFIG_EDAC is not set

#
# Real Time Clock
#
CONFIG_RTC_LIB=m
CONFIG_RTC_CLASS=m

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=m
CONFIG_RTC_INTF_PROC=m
CONFIG_RTC_INTF_DEV=m
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set

#
# RTC drivers
#
CONFIG_RTC_DRV_X1205=m
CONFIG_RTC_DRV_DS1307=m
CONFIG_RTC_DRV_DS1553=m
CONFIG_RTC_DRV_ISL1208=m
CONFIG_RTC_DRV_DS1672=m
CONFIG_RTC_DRV_DS1742=m
CONFIG_RTC_DRV_PCF8563=m
CONFIG_RTC_DRV_PCF8583=m
CONFIG_RTC_DRV_RS5C372=m
CONFIG_RTC_DRV_M48T86=m
CONFIG_RTC_DRV_TEST=m
CONFIG_RTC_DRV_V3020=m

#
# DMA Engine support
#
# CONFIG_DMA_ENGINE is not set

#
# DMA Clients
#

#
# DMA Devices
#

#
# Firmware Drivers
#
# CONFIG_EDD is not set
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set

#
# File systems
#
CONFIG_EXT2_FS=y
# CONFIG_EXT2_FS_XATTR is not set
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=m
# CONFIG_EXT3_FS_XATTR is not set
# CONFIG_EXT4DEV_FS is not set
CONFIG_JBD=m
# CONFIG_JBD_DEBUG is not set
CONFIG_REISER4_FS=y
# CONFIG_REISER4_DEBUG is not set
CONFIG_REISERFS_FS=y
# CONFIG_REISERFS_CHECK is not set
# CONFIG_REISERFS_PROC_INFO is not set
# CONFIG_REISERFS_FS_XATTR is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
CONFIG_MINIX_FS=m
CONFIG_ROMFS_FS=m
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
CONFIG_DNOTIFY=y
CONFIG_AUTOFS_FS=m
CONFIG_AUTOFS4_FS=m
CONFIG_FUSE_FS=m

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_ZISOFS_FS=m
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=850
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-15"
CONFIG_NTFS_FS=m
# CONFIG_NTFS_DEBUG is not set
CONFIG_NTFS_RW=y

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_RAMFS=y
# CONFIG_CONFIGFS_FS is not set

#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
CONFIG_CRAMFS=m
# CONFIG_VXFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set

#
# Network File Systems
#
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
# CONFIG_NFS_V3_ACL is not set
CONFIG_NFS_V4=y
# CONFIG_NFS_DIRECTIO is not set
CONFIG_NFSD=m
CONFIG_NFSD_V3=y
# CONFIG_ONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_SHA1=m
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_WP512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_ECB=m
CONFIG_CRYPTO_CBC=m
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m
CONFIG_CRYPTO_TWOFISH_X86_64=m
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_X86_64=m
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_KHAZAD=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_CRC32C=m
CONFIG_CRYPTO_TEST=m

#
# Hardware crypto devices
#

#
# Library routines
#
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC32=m
CONFIG_LIBCRC32C=m
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_PLIST=y


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
  2006-12-27 12:44                                                                                               ` valdyn
@ 2006-12-27 13:33                                                                                                 ` Jari Sundell
  0 siblings, 0 replies; 311+ messages in thread
From: Jari Sundell @ 2006-12-27 13:33 UTC (permalink / raw)
  To: valdyn
  Cc: linux-kernel, Nick Piggin, Andrei Popa, Peter Zijlstra,
	David S. Miller, Andrew Morton, Gordon Farquharson,
	Martin Michlmayr, Hugh Dickins, Arjan van de Ven, Linus Torvalds

On 12/27/06, valdyn@gmail.com <valdyn@gmail.com> wrote:
> I do get this error on reiserfs ( old one, didn't try on reiser4 ).
> Stock 2.6.19 plus reiser4 patch. Previously reported by me only in the
> debian bts.

I've had reports of corrupted data on earlier kernel releases with
reiserfs3, which were fixed by upgrading to reiserfs4.

Jari Sundell

^