linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.6.19 file content corruption on ext3
@ 2006-12-07 15:57 Marc Haber
  2006-12-07 16:50 ` Phillip Susi
                   ` (2 more replies)
  0 siblings, 3 replies; 154+ messages in thread
From: Marc Haber @ 2006-12-07 15:57 UTC (permalink / raw)
  To: linux-kernel

Hi,

one of my systems is running Debian stable with a self-compiled Linux
kernel. On this system, Debian's aptitude binary is started hourly
from cron to check for new packages (including virus scan definition
packages, this is actually the reason for the update running so often).

After updating to 2.6.19, Debian's apt control file
/var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
six hours. In that situation, "aptitude update" segfaults. When I
delete the file and have apt recreate it, things are fine again for a
few hours before the file is broken again and the segfault start over.
In all cases, umounting the file system and doing an fsck does not
show issues with the file system.

I went back to 2.6.18.3 to debug this, and the system ran for three
days without problems and without corrupting
/var/cache/apt/pkgcache.bin. After booting 2.6.19 again, it took three
hours for the file corruption to show again.

I do not have an idea what could cause this other than the 2.6.19
kernel.

The file system in question is an ext3fs on an LVM LV, which is member
of a VG that only has a single PV, which in turn is on a primary
partition of the first IDE hard disk, hda. The IDE interface is a VIA
Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master
IDE (rev 06). The box is a rented server in a colocation, and I do not
have access to the console or physical access to the box itself.

I'll happily deliver information that might be needed to nail down
this issue. Can anybody give advice about how to solve this?

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-07 15:57 2.6.19 file content corruption on ext3 Marc Haber
@ 2006-12-07 16:50 ` Phillip Susi
  2006-12-08  1:38   ` Fernando Luis Vázquez Cao
  2006-12-09  9:26   ` Marc Haber
  2006-12-16 18:31 ` Florian Weimer
  2006-12-22 13:30 ` Daniel Drake
  2 siblings, 2 replies; 154+ messages in thread
From: Phillip Susi @ 2006-12-07 16:50 UTC (permalink / raw)
  To: Marc Haber; +Cc: linux-kernel

Marc Haber wrote:
> I went back to 2.6.18.3 to debug this, and the system ran for three
> days without problems and without corrupting
> /var/cache/apt/pkgcache.bin. After booting 2.6.19 again, it took three
> hours for the file corruption to show again.
> 
> I do not have an idea what could cause this other than the 2.6.19
> kernel.
<snip>
> I'll happily deliver information that might be needed to nail down
> this issue. Can anybody give advice about how to solve this?

I'd say start git bisecting to track down which commit the problem 
starts at.


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-07 16:50 ` Phillip Susi
@ 2006-12-08  1:38   ` Fernando Luis Vázquez Cao
  2006-12-08 16:42     ` Marc Haber
  2006-12-09  9:26   ` Marc Haber
  1 sibling, 1 reply; 154+ messages in thread
From: Fernando Luis Vázquez Cao @ 2006-12-08  1:38 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Marc Haber, linux-kernel

On Thu, 2006-12-07 at 11:50 -0500, Phillip Susi wrote:
> Marc Haber wrote:
> > I went back to 2.6.18.3 to debug this, and the system ran for three
> > days without problems and without corrupting
> > /var/cache/apt/pkgcache.bin. After booting 2.6.19 again, it took three
> > hours for the file corruption to show again.
> > 
> > I do not have an idea what could cause this other than the 2.6.19
> > kernel.
> <snip>
> > I'll happily deliver information that might be needed to nail down
> > this issue. Can anybody give advice about how to solve this?
> 
> I'd say start git bisecting to track down which commit the problem 
> starts at.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Does the patch below help?

http://marc.theaimsgroup.com/?l=linux-ext4&m=116483980823714&w=4


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-08  1:38   ` Fernando Luis Vázquez Cao
@ 2006-12-08 16:42     ` Marc Haber
  2006-12-09 10:47       ` Jan Kara
  2006-12-09 23:46       ` Mike Galbraith
  0 siblings, 2 replies; 154+ messages in thread
From: Marc Haber @ 2006-12-08 16:42 UTC (permalink / raw)
  To: linux-kernel

On Fri, Dec 08, 2006 at 10:38:12AM +0900, Fernando Luis Vázquez Cao wrote:
> Does the patch below help?
> 
> http://marc.theaimsgroup.com/?l=linux-ext4&m=116483980823714&w=4

No, pkgcache.bin still getting corrupted within two hours of using
2.6.19.

Greetings
Marc, back to 2.6.18.3 for the time being

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-07 16:50 ` Phillip Susi
  2006-12-08  1:38   ` Fernando Luis Vázquez Cao
@ 2006-12-09  9:26   ` Marc Haber
  2006-12-16 18:43     ` Martin Michlmayr
  1 sibling, 1 reply; 154+ messages in thread
From: Marc Haber @ 2006-12-09  9:26 UTC (permalink / raw)
  To: linux-kernel

On Thu, Dec 07, 2006 at 11:50:37AM -0500, Phillip Susi wrote:
> Marc Haber wrote:
> >I went back to 2.6.18.3 to debug this, and the system ran for three
> >days without problems and without corrupting
> >/var/cache/apt/pkgcache.bin. After booting 2.6.19 again, it took three
> >hours for the file corruption to show again.
> >
> >I do not have an idea what could cause this other than the 2.6.19
> >kernel.
> <snip>
> >I'll happily deliver information that might be needed to nail down
> >this issue. Can anybody give advice about how to solve this?
> 
> I'd say start git bisecting to track down which commit the problem 
> starts at.

Unfortunately, I am lacking the knowledge needed to do this in an
informed way. I am neither familiar enough with git nor do I possess
the necessary C powers.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-08 16:42     ` Marc Haber
@ 2006-12-09 10:47       ` Jan Kara
  2006-12-11 19:07         ` Marc Haber
  2006-12-09 23:46       ` Mike Galbraith
  1 sibling, 1 reply; 154+ messages in thread
From: Jan Kara @ 2006-12-09 10:47 UTC (permalink / raw)
  To: Marc Haber; +Cc: linux-kernel

> On Fri, Dec 08, 2006 at 10:38:12AM +0900, Fernando Luis Vázquez Cao wrote:
> > Does the patch below help?
> > 
> > http://marc.theaimsgroup.com/?l=linux-ext4&m=116483980823714&w=4
> 
> No, pkgcache.bin still getting corrupted within two hours of using
> 2.6.19.
  Hmm, interesting. I'll try to reproduce the problem. In the mean time
- does mounting the filesystem with data=writeback help?

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-08 16:42     ` Marc Haber
  2006-12-09 10:47       ` Jan Kara
@ 2006-12-09 23:46       ` Mike Galbraith
  2006-12-11  9:31         ` Marc Haber
  1 sibling, 1 reply; 154+ messages in thread
From: Mike Galbraith @ 2006-12-09 23:46 UTC (permalink / raw)
  To: Marc Haber; +Cc: linux-kernel

On Fri, 2006-12-08 at 17:42 +0100, Marc Haber wrote:
> On Fri, Dec 08, 2006 at 10:38:12AM +0900, Fernando Luis Vázquez Cao wrote:
> > Does the patch below help?
> > 
> > http://marc.theaimsgroup.com/?l=linux-ext4&m=116483980823714&w=4
> 
> No, pkgcache.bin still getting corrupted within two hours of using
> 2.6.19.
> 
> Greetings
> Marc, back to 2.6.18.3 for the time being

Hi,

I've missed most of this thread, but have cause to be interested.  Do
you have a generic recipe for reproducing file corruption?  I seem to be
(read pretty darn sure, modulus hw (wish) vs sw testing methods...)
experiencing memory corruption problems with 2.6.19, and am interested
in anything that might be related (trigger!).

	-Mike


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-09 23:46       ` Mike Galbraith
@ 2006-12-11  9:31         ` Marc Haber
  0 siblings, 0 replies; 154+ messages in thread
From: Marc Haber @ 2006-12-11  9:31 UTC (permalink / raw)
  To: linux-kernel

On Sun, Dec 10, 2006 at 12:46:01AM +0100, Mike Galbraith wrote:
> On Fri, 2006-12-08 at 17:42 +0100, Marc Haber wrote:
> > On Fri, Dec 08, 2006 at 10:38:12AM +0900, Fernando Luis Vázquez Cao wrote:
> > > Does the patch below help?
> > > 
> > > http://marc.theaimsgroup.com/?l=linux-ext4&m=116483980823714&w=4
> > 
> > No, pkgcache.bin still getting corrupted within two hours of using
> > 2.6.19.
> > 
> > Greetings
> > Marc, back to 2.6.18.3 for the time being
> 
> Hi,
> 
> I've missed most of this thread, but have cause to be interested.  Do
> you have a generic recipe for reproducing file corruption?

My recipe is running apt-get update from cron. This needs Debian
though. Maybe a chroot installation will suffice.

I'm going to try data=writeback first.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-09 10:47       ` Jan Kara
@ 2006-12-11 19:07         ` Marc Haber
  2006-12-14 12:03           ` Jan Kara
  0 siblings, 1 reply; 154+ messages in thread
From: Marc Haber @ 2006-12-11 19:07 UTC (permalink / raw)
  To: linux-kernel

On Sat, Dec 09, 2006 at 11:47:58AM +0100, Jan Kara wrote:
>   In the mean time
>   does mounting the filesystem with data=writeback help?

I have now nine hours uptime with data=writeback, and the file is
still OK. Looks good.

By this posting, I'm going to invoke murphy, so I'll report again
tomorrow.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-11 19:07         ` Marc Haber
@ 2006-12-14 12:03           ` Jan Kara
  2006-12-15  9:30             ` Marc Haber
  0 siblings, 1 reply; 154+ messages in thread
From: Jan Kara @ 2006-12-14 12:03 UTC (permalink / raw)
  To: Marc Haber; +Cc: linux-kernel

> On Sat, Dec 09, 2006 at 11:47:58AM +0100, Jan Kara wrote:
> >   In the mean time
> >   does mounting the filesystem with data=writeback help?
> 
> I have now nine hours uptime with data=writeback, and the file is
> still OK. Looks good.
> 
> By this posting, I'm going to invoke murphy, so I'll report again
> tomorrow.
  Since you haven't written till today I assume that data=writeback does
not have a problem. Hmm. I really start to suspect my changes to JBD
commit code. But I was trying to reproduce the problem by copying files
there and back without success :( Also I check the code and I don't see
how we could loose dirty bits on buffers (which is probably what happens
as one guy has written to me that he also sees the problem when using
rtorrent which does checksum after downloading and that passes fine).
Next I'm going to try to reproduce the problem with heavy mmap load.
Maybe that would trigger it.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-14 12:03           ` Jan Kara
@ 2006-12-15  9:30             ` Marc Haber
  2006-12-16  8:29               ` Marc Haber
  0 siblings, 1 reply; 154+ messages in thread
From: Marc Haber @ 2006-12-15  9:30 UTC (permalink / raw)
  To: linux-kernel

On Thu, Dec 14, 2006 at 01:03:41PM +0100, Jan Kara wrote:
> > On Sat, Dec 09, 2006 at 11:47:58AM +0100, Jan Kara wrote:
> > >   In the mean time
> > >   does mounting the filesystem with data=writeback help?
> > 
> > I have now nine hours uptime with data=writeback, and the file is
> > still OK. Looks good.
> > 
> > By this posting, I'm going to invoke murphy, so I'll report again
> > tomorrow.
>   Since you haven't written till today I assume that data=writeback does
> not have a problem.

It does not have a problem, right. Additionally, updating to 2.6.19.1
allowed me to remove data=writeback without the issue re-surfacing. I
suspect that the issue is fixed now.

Thanks for helping.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-15  9:30             ` Marc Haber
@ 2006-12-16  8:29               ` Marc Haber
  0 siblings, 0 replies; 154+ messages in thread
From: Marc Haber @ 2006-12-16  8:29 UTC (permalink / raw)
  To: linux-kernel

On Fri, Dec 15, 2006 at 10:30:34AM +0100, Marc Haber wrote:
> Additionally, updating to 2.6.19.1
> allowed me to remove data=writeback without the issue re-surfacing. I
> suspect that the issue is fixed now.

Unfortunately, this suspicion proved wrong when the file was corrupted
again this morning.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-07 15:57 2.6.19 file content corruption on ext3 Marc Haber
  2006-12-07 16:50 ` Phillip Susi
@ 2006-12-16 18:31 ` Florian Weimer
  2006-12-17 11:52   ` Andrew Morton
  2006-12-22 13:30 ` Daniel Drake
  2 siblings, 1 reply; 154+ messages in thread
From: Florian Weimer @ 2006-12-16 18:31 UTC (permalink / raw)
  To: Marc Haber; +Cc: linux-kernel

* Marc Haber:

> After updating to 2.6.19, Debian's apt control file
> /var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
> six hours.

I've seen that with Debian's 2.6.18 kernels as well.  Perhaps it's
related to this Debian bug?

<http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=401006>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-09  9:26   ` Marc Haber
@ 2006-12-16 18:43     ` Martin Michlmayr
  2006-12-16 19:18       ` Hugh Dickins
  2006-12-22 17:05       ` Marc Haber
  0 siblings, 2 replies; 154+ messages in thread
From: Martin Michlmayr @ 2006-12-16 18:43 UTC (permalink / raw)
  To: Marc Haber; +Cc: linux-kernel

* Marc Haber <mh+linux-kernel@zugschlus.de> [2006-12-09 10:26]:
> Unfortunately, I am lacking the knowledge needed to do this in an
> informed way. I am neither familiar enough with git nor do I possess
> the necessary C powers.

I wonder if what you're seein is related to
http://lkml.org/lkml/2006/12/16/73

You said that you don't see any corruption with 2.6.18.  Can you try
to apply the patch from
http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
to 2.6.18 to see if the corruption shows up?
-- 
Martin Michlmayr
tbm@cyrius.com

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-16 18:43     ` Martin Michlmayr
@ 2006-12-16 19:18       ` Hugh Dickins
  2006-12-16 21:29         ` Peter Zijlstra
  2006-12-17 13:52         ` Jan Kara
  2006-12-22 17:05       ` Marc Haber
  1 sibling, 2 replies; 154+ messages in thread
From: Hugh Dickins @ 2006-12-16 19:18 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Marc Haber, Jan Kara, Peter Zijlstra, linux-mm, linux-kernel

On Sat, 16 Dec 2006, Martin Michlmayr wrote:
> * Marc Haber <mh+linux-kernel@zugschlus.de> [2006-12-09 10:26]:
> > Unfortunately, I am lacking the knowledge needed to do this in an
> > informed way. I am neither familiar enough with git nor do I possess
> > the necessary C powers.
> 
> I wonder if what you're seein is related to
> http://lkml.org/lkml/2006/12/16/73
> 
> You said that you don't see any corruption with 2.6.18.  Can you try
> to apply the patch from
> http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
> to 2.6.18 to see if the corruption shows up?

I did wonder about the very first hunk of Peter's patch, where the
mapping->private_lock is unlocked earlier now in try_to_free_buffers,
before the clear_page_dirty.  I'm not at all familiar with that area,
I wonder if Jan has looked at that change, and might be able to say
whether it's good or not (earlier he worried about his JBD changes,
but they wouldn't be implicated if just 2.6.18+Peter's gives trouble).

Hugh

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-16 19:18       ` Hugh Dickins
@ 2006-12-16 21:29         ` Peter Zijlstra
  2006-12-16 23:08           ` Hugh Dickins
  2006-12-17 13:52         ` Jan Kara
  1 sibling, 1 reply; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-16 21:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Michlmayr, Marc Haber, Jan Kara, linux-mm, linux-kernel

On Sat, 2006-12-16 at 19:18 +0000, Hugh Dickins wrote:
> On Sat, 16 Dec 2006, Martin Michlmayr wrote:
> > * Marc Haber <mh+linux-kernel@zugschlus.de> [2006-12-09 10:26]:
> > > Unfortunately, I am lacking the knowledge needed to do this in an
> > > informed way. I am neither familiar enough with git nor do I possess
> > > the necessary C powers.
> > 
> > I wonder if what you're seein is related to
> > http://lkml.org/lkml/2006/12/16/73
> > 
> > You said that you don't see any corruption with 2.6.18.  Can you try
> > to apply the patch from
> > http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
> > to 2.6.18 to see if the corruption shows up?
> 
> I did wonder about the very first hunk of Peter's patch, where the
> mapping->private_lock is unlocked earlier now in try_to_free_buffers,
> before the clear_page_dirty.  I'm not at all familiar with that area,
> I wonder if Jan has looked at that change, and might be able to say
> whether it's good or not (earlier he worried about his JBD changes,
> but they wouldn't be implicated if just 2.6.18+Peter's gives trouble).

fs/buffers.c:2775

/*
 * try_to_free_buffers() checks if all the buffers on this particular page
 * are unused, and releases them if so.
 *
 * Exclusion against try_to_free_buffers may be obtained by either
 * locking the page or by holding its mapping's private_lock.
 *
 * If the page is dirty but all the buffers are clean then we need to
 * be sure to mark the page clean as well.  This is because the page
 * may be against a block device, and a later reattachment of buffers
 * to a dirty page will set *all* buffers dirty.  Which would corrupt
 * filesystem data on the same device.
 *
 * The same applies to regular filesystem pages: if all the buffers are
 * clean then we set the page clean and proceed.  To do that, we require
 * total exclusion from __set_page_dirty_buffers().  That is obtained with
 * private_lock.
 *
 * try_to_free_buffers() is non-blocking.
 */

Note the 3th paragraph. Would I have opened up a race by moving that
unlock upwards, such that it is possible to re-attach buffers to the
page before having it marked clean; which according to this text will
mark those buffers dirty and cause data corruption?

Hmm, how to go about something like this:

---
Moving the cleaning of the page out from under the private_lock opened
up a window where newly attached buffer might still see the page dirty
status and were thus marked (incorrectly) dirty themselves; resulting in
filesystem data corruption.

Close this by moving the cleaning of the page inside of the private_lock
scope again. However it is not possible to call page_mkclean() from
within the private_lock (this violates locking order); thus introduce a
variant of test_clear_page_dirty() that does not call page_mkclean() and
call it ourselves when we did do clean the page and call it outside of
the private_lock.

This is still safe because the page is still locked by means of
PG_locked.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/buffer.c                |   11 +++++++++--
 include/linux/page-flags.h |    1 +
 mm/page-writeback.c        |   10 ++++++++--
 3 files changed, 18 insertions(+), 4 deletions(-)

Index: linux-2.6-git/fs/buffer.c
===================================================================
--- linux-2.6-git.orig/fs/buffer.c	2006-12-16 22:18:24.000000000 +0100
+++ linux-2.6-git/fs/buffer.c	2006-12-16 22:22:17.000000000 +0100
@@ -42,6 +42,7 @@
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
+#include <linux/rmap.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 static void invalidate_bh_lrus(void);
@@ -2832,6 +2833,7 @@ int try_to_free_buffers(struct page *pag
 	struct address_space * const mapping = page->mapping;
 	struct buffer_head *buffers_to_free = NULL;
 	int ret = 0;
+	int must_clean = 0;
 
 	BUG_ON(!PageLocked(page));
 	if (PageWriteback(page))
@@ -2844,7 +2846,6 @@ int try_to_free_buffers(struct page *pag
 
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
-	spin_unlock(&mapping->private_lock);
 	if (ret) {
 		/*
 		 * If the filesystem writes its buffers by hand (eg ext3)
@@ -2858,9 +2859,15 @@ int try_to_free_buffers(struct page *pag
 		 * the page's buffers clean.  We discover that here and clean
 		 * the page also.
 		 */
-		if (test_clear_page_dirty(page))
+		if (__test_clear_page_dirty(page, 0)) {
 			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
+			if (mapping_cap_account_dirty(mapping))
+				must_clean = 1;
+		}
 	}
+	spin_unlock(&mapping->private_lock);
+	if (must_clean)
+		page_mkclean(page);
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
Index: linux-2.6-git/include/linux/page-flags.h
===================================================================
--- linux-2.6-git.orig/include/linux/page-flags.h	2006-12-16 22:19:56.000000000 +0100
+++ linux-2.6-git/include/linux/page-flags.h	2006-12-16 22:20:07.000000000 +0100
@@ -253,6 +253,7 @@ static inline void SetPageUptodate(struc
 
 struct page;	/* forward declaration */
 
+int __test_clear_page_dirty(struct page *page, int do_clean);
 int test_clear_page_dirty(struct page *page);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
Index: linux-2.6-git/mm/page-writeback.c
===================================================================
--- linux-2.6-git.orig/mm/page-writeback.c	2006-12-16 22:18:18.000000000 +0100
+++ linux-2.6-git/mm/page-writeback.c	2006-12-16 22:19:42.000000000 +0100
@@ -854,7 +854,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int __test_clear_page_dirty(struct page *page, int do_clean)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -872,7 +872,8 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			if (do_clean)
+				page_mkclean(page);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;
@@ -880,6 +881,11 @@ int test_clear_page_dirty(struct page *p
 	write_unlock_irqrestore(&mapping->tree_lock, flags);
 	return 0;
 }
+
+int test_clear_page_dirty(struct page *page)
+{
+	return __test_clear_page_dirty(page, 1);
+}
 EXPORT_SYMBOL(test_clear_page_dirty);
 
 /*



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-16 21:29         ` Peter Zijlstra
@ 2006-12-16 23:08           ` Hugh Dickins
  0 siblings, 0 replies; 154+ messages in thread
From: Hugh Dickins @ 2006-12-16 23:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Michlmayr, Marc Haber, Jan Kara, linux-mm, linux-kernel

On Sat, 16 Dec 2006, Peter Zijlstra wrote:
> Moving the cleaning of the page out from under the private_lock opened
> up a window where newly attached buffer might still see the page dirty
> status and were thus marked (incorrectly) dirty themselves; resulting in
> filesystem data corruption.

I'm not going to pretend to understand the buffers issues here:
people thought that change was safe originally, and I can't say
it's not - it just stood out as a potentially weakening change.

The patch you propose certainly looks like a good way out, if
that moved unlock really is a problem: your patch is very well
worth trying by those people seeing their corruption problems,
let's wait to hear their feedback.

Thanks!
Hugh

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-16 18:31 ` Florian Weimer
@ 2006-12-17 11:52   ` Andrew Morton
  0 siblings, 0 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-17 11:52 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Marc Haber, linux-kernel, Peter Zijlstra, Hugh Dickins, Linus Torvalds

On Sat, 16 Dec 2006 19:31:25 +0100
Florian Weimer <fw@deneb.enyo.de> wrote:

> * Marc Haber:
> 
> > After updating to 2.6.19, Debian's apt control file
> > /var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
> > six hours.
> 
> I've seen that with Debian's 2.6.18 kernels as well.  Perhaps it's
> related to this Debian bug?
> 
> <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=401006>

ugh, that's pretty damning.  And rtorrent uses MAP_SHARED.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-16 19:18       ` Hugh Dickins
  2006-12-16 21:29         ` Peter Zijlstra
@ 2006-12-17 13:52         ` Jan Kara
  1 sibling, 0 replies; 154+ messages in thread
From: Jan Kara @ 2006-12-17 13:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Michlmayr, Marc Haber, Jan Kara, Peter Zijlstra, linux-mm,
	linux-kernel, Mikael Magnusson

> On Sat, 16 Dec 2006, Martin Michlmayr wrote:
> > * Marc Haber <mh+linux-kernel@zugschlus.de> [2006-12-09 10:26]:
> > > Unfortunately, I am lacking the knowledge needed to do this in an
> > > informed way. I am neither familiar enough with git nor do I possess
> > > the necessary C powers.
> > 
> > I wonder if what you're seein is related to
> > http://lkml.org/lkml/2006/12/16/73
> > 
> > You said that you don't see any corruption with 2.6.18.  Can you try
> > to apply the patch from
> > http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
> > to 2.6.18 to see if the corruption shows up?
> 
> I did wonder about the very first hunk of Peter's patch, where the
> mapping->private_lock is unlocked earlier now in try_to_free_buffers,
> before the clear_page_dirty.  I'm not at all familiar with that area,
> I wonder if Jan has looked at that change, and might be able to say
> whether it's good or not (earlier he worried about his JBD changes,
> but they wouldn't be implicated if just 2.6.18+Peter's gives trouble).
  Thanks for pointer. I was not aware of this change, I'll have a look
at it on Monday. Actually Mickael has checked that he sees corruption
even if all the JBD changes are backed out so I was going to look for
other changes in VFS that could cause that.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-07 15:57 2.6.19 file content corruption on ext3 Marc Haber
  2006-12-07 16:50 ` Phillip Susi
  2006-12-16 18:31 ` Florian Weimer
@ 2006-12-22 13:30 ` Daniel Drake
  2006-12-22 17:03   ` Marc Haber
  2 siblings, 1 reply; 154+ messages in thread
From: Daniel Drake @ 2006-12-22 13:30 UTC (permalink / raw)
  To: Marc Haber; +Cc: linux-kernel

Marc Haber wrote:
> After updating to 2.6.19, Debian's apt control file
> /var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
> six hours. In that situation, "aptitude update" segfaults. When I
> delete the file and have apt recreate it, things are fine again for a
> few hours before the file is broken again and the segfault start over.
> In all cases, umounting the file system and doing an fsck does not
> show issues with the file system.

Are you using wireless networking of any kind? If so which driver and 
security key system? Might be useful if you could post 'dmesg' output so 
that people can see the other hardware that you have.

Daniel


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-22 13:30 ` Daniel Drake
@ 2006-12-22 17:03   ` Marc Haber
  0 siblings, 0 replies; 154+ messages in thread
From: Marc Haber @ 2006-12-22 17:03 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1188 bytes --]

On Fri, Dec 22, 2006 at 08:30:06AM -0500, Daniel Drake wrote:
> Marc Haber wrote:
> >After updating to 2.6.19, Debian's apt control file
> >/var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
> >six hours. In that situation, "aptitude update" segfaults. When I
> >delete the file and have apt recreate it, things are fine again for a
> >few hours before the file is broken again and the segfault start over.
> >In all cases, umounting the file system and doing an fsck does not
> >show issues with the file system.
> 
> Are you using wireless networking of any kind?

Since the system in question is a colocated server box, I am pretty
sure that there is no wireless networking.

>  Might be useful if you could post 'dmesg' output so that people can
>  see the other hardware that you have.

I have attached what I could scrape from syslog.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

[-- Attachment #2: syslog --]
[-- Type: text/plain, Size: 17551 bytes --]

Dec 18 15:45:01 torres syslogd 1.4.1#17: restart.
Dec 18 15:45:01 torres kernel: klogd 1.4.1#17, log source = /proc/kmsg started.
Dec 18 15:45:01 torres kernel: Inspecting /boot/System.map-2.6.19.1-zgsrv
Dec 18 15:45:01 torres kernel: Loaded 26500 symbols from /boot/System.map-2.6.19.1-zgsrv.
Dec 18 15:45:01 torres kernel: Symbols match kernel version 2.6.19.
Dec 18 15:45:01 torres kernel: No module symbols loaded - kernel modules not enabled. 
Dec 18 15:45:01 torres kernel: Linux version 2.6.19.1-zgsrv (mh@nechayev) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 Sun Dec 17 12:44:56 UTC 2006
Dec 18 15:45:01 torres kernel: BIOS-provided physical RAM map:
Dec 18 15:45:01 torres kernel:  BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)
Dec 18 15:45:01 torres kernel:  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
Dec 18 15:45:01 torres kernel:  BIOS-e820: 0000000000100000 - 000000000f7f0000 (usable)
Dec 18 15:45:01 torres kernel:  BIOS-e820: 000000000f7f0000 - 000000000f7f3000 (ACPI NVS)
Dec 18 15:45:01 torres kernel:  BIOS-e820: 000000000f7f3000 - 000000000f800000 (ACPI data)
Dec 18 15:45:01 torres kernel:  BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
Dec 18 15:45:01 torres kernel: 0MB HIGHMEM available.
Dec 18 15:45:01 torres kernel: 247MB LOWMEM available.
Dec 18 15:45:01 torres kernel: Entering add_active_range(0, 0, 63472) 0 entries of 256 used
Dec 18 15:45:01 torres kernel: Zone PFN ranges:
Dec 18 15:45:01 torres kernel:   DMA             0 ->     4096
Dec 18 15:45:01 torres kernel:   Normal       4096 ->    63472
Dec 18 15:45:01 torres kernel:   HighMem     63472 ->    63472
Dec 18 15:45:01 torres kernel: early_node_map[1] active PFN ranges
Dec 18 15:45:01 torres kernel:     0:        0 ->    63472
Dec 18 15:45:01 torres kernel: On node 0 totalpages: 63472
Dec 18 15:45:01 torres kernel:   DMA zone: 32 pages used for memmap
Dec 18 15:45:01 torres kernel:   DMA zone: 0 pages reserved
Dec 18 15:45:01 torres kernel:   DMA zone: 4064 pages, LIFO batch:0
Dec 18 15:45:01 torres kernel:   Normal zone: 463 pages used for memmap
Dec 18 15:45:01 torres kernel:   Normal zone: 58913 pages, LIFO batch:15
Dec 18 15:45:01 torres kernel:   HighMem zone: 0 pages used for memmap
Dec 18 15:45:01 torres kernel: DMI 2.2 present.
Dec 18 15:45:01 torres kernel: ACPI: RSDP (v000 VIA694                                ) @ 0x000f8050
Dec 18 15:45:01 torres kernel: ACPI: RSDT (v001 VIA694 MSI ACPI 0x42302e31 AWRD 0x00000000) @ 0x0f7f3000
Dec 18 15:45:01 torres kernel: ACPI: FADT (v001 VIA694 MSI ACPI 0x42302e31 AWRD 0x00000000) @ 0x0f7f3040
Dec 18 15:45:01 torres kernel: ACPI: DSDT (v001 VIA694 AWRDACPI 0x00001000 MSFT 0x0100000c) @ 0x00000000
Dec 18 15:45:01 torres kernel: ACPI: PM-Timer IO Port: 0x4008
Dec 18 15:45:01 torres kernel: Allocating PCI resources starting at 10000000 (gap: 0f800000:f07f0000)
Dec 18 15:45:01 torres kernel: Detected 1466.361 MHz processor.
Dec 18 15:45:01 torres kernel: Built 1 zonelists.  Total pages: 62977
Dec 18 15:45:01 torres kernel: Kernel command line: root=/dev/hda1 ro vga=normal 
Dec 18 15:45:01 torres kernel: Enabling fast FPU save and restore... done.
Dec 18 15:45:01 torres kernel: Enabling unmasked SIMD FPU exception support... done.
Dec 18 15:45:01 torres kernel: Initializing CPU#0
Dec 18 15:45:01 torres kernel: PID hash table entries: 1024 (order: 10, 4096 bytes)
Dec 18 15:45:01 torres kernel: Console: colour VGA+ 80x25
Dec 18 15:45:01 torres kernel: Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Dec 18 15:45:01 torres kernel: Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Dec 18 15:45:01 torres kernel: Memory: 246964k/253888k available (2896k kernel code, 6368k reserved, 859k data, 204k init, 0k highmem)
Dec 18 15:45:01 torres kernel: virtual kernel memory layout:
Dec 18 15:45:01 torres kernel:     fixmap  : 0xfffea000 - 0xfffff000   (  84 kB)
Dec 18 15:45:01 torres kernel:     pkmap   : 0xff800000 - 0xffc00000   (4096 kB)
Dec 18 15:45:01 torres kernel:     vmalloc : 0xd0000000 - 0xff7fe000   ( 759 MB)
Dec 18 15:45:01 torres kernel:     lowmem  : 0xc0000000 - 0xcf7f0000   ( 247 MB)
Dec 18 15:45:01 torres kernel:       .init : 0xc04ae000 - 0xc04e1000   ( 204 kB)
Dec 18 15:45:01 torres kernel:       .data : 0xc03d40b2 - 0xc04aaff4   ( 859 kB)
Dec 18 15:45:01 torres kernel:       .text : 0xc0100000 - 0xc03d40b2   (2896 kB)
Dec 18 15:45:01 torres kernel: Checking if this processor honours the WP bit even in supervisor mode... Ok.
Dec 18 15:45:01 torres kernel: Calibrating delay using timer specific routine.. 2935.39 BogoMIPS (lpj=5870788)
Dec 18 15:45:01 torres kernel: Security Framework v1.0.0 initialized
Dec 18 15:45:01 torres kernel: Capability LSM initialized
Dec 18 15:45:01 torres kernel: Mount-cache hash table entries: 512
Dec 18 15:45:01 torres kernel: CPU: After generic identify, caps: 0383f9ff c1c3f9ff 00000000 00000000 00000000 00000000 00000000
Dec 18 15:45:01 torres kernel: CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
Dec 18 15:45:01 torres kernel: CPU: L2 Cache: 256K (64 bytes/line)
Dec 18 15:45:01 torres kernel: CPU: After all inits, caps: 0383f9ff c1c3f9ff 00000000 00000420 00000000 00000000 00000000
Dec 18 15:45:01 torres kernel: Intel machine check architecture supported.
Dec 18 15:45:01 torres kernel: Intel machine check reporting enabled on CPU#0.
Dec 18 15:45:01 torres kernel: Compat vDSO mapped to ffffe000.
Dec 18 15:45:01 torres kernel: CPU: AMD Athlon(tm) XP 1700+ stepping 02
Dec 18 15:45:01 torres kernel: Checking 'hlt' instruction... OK.
Dec 18 15:45:01 torres kernel: ACPI: Core revision 20060707
Dec 18 15:45:01 torres kernel: ACPI: setting ELCR to 1000 (from 1c00)
Dec 18 15:45:01 torres kernel: NET: Registered protocol family 16
Dec 18 15:45:01 torres kernel: ACPI: bus type pci registered
Dec 18 15:45:01 torres kernel: PCI: PCI BIOS revision 2.10 entry at 0xfb5c0, last bus=1
Dec 18 15:45:01 torres kernel: PCI: Using configuration type 1
Dec 18 15:45:01 torres kernel: Setting up standard PCI resources
Dec 18 15:45:01 torres kernel: ACPI: Interpreter enabled
Dec 18 15:45:01 torres kernel: ACPI: Using PIC for interrupt routing
Dec 18 15:45:01 torres kernel: ACPI: PCI Root Bridge [PCI0] (0000:00)
Dec 18 15:45:01 torres kernel: PCI: Probing PCI hardware (bus 00)
Dec 18 15:45:01 torres kernel: ACPI: Assume root bridge [\_SB_.PCI0] bus is 0
Dec 18 15:45:01 torres kernel: Disabling VIA memory write queue (PCI ID 3112, rev 00): [55] f9 & 1f -> 19
Dec 18 15:45:01 torres kernel: PCI quirk: region 6000-607f claimed by vt82c686 HW-mon
Dec 18 15:45:01 torres kernel: PCI quirk: region 5000-500f claimed by vt82c686 SMB
Dec 18 15:45:01 torres kernel: Boot video device is 0000:01:00.0
Dec 18 15:45:01 torres kernel: ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
Dec 18 15:45:01 torres kernel: ACPI: PCI Interrupt Link [LNKA] (IRQs 1 3 4 5 6 7 *10 11 12 14 15)
Dec 18 15:45:01 torres kernel: ACPI: PCI Interrupt Link [LNKB] (IRQs 1 3 4 5 6 7 10 11 12 14 15) *0, disabled.
Dec 18 15:45:01 torres kernel: ACPI: PCI Interrupt Link [LNKC] (IRQs 1 3 4 5 6 7 10 11 12 14 15) *0, disabled.
Dec 18 15:45:01 torres kernel: ACPI: PCI Interrupt Link [LNKD] (IRQs 1 3 4 5 6 7 10 *11 12 14 15)
Dec 18 15:45:01 torres kernel: Linux Plug and Play Support v0.97 (c) Adam Belay
Dec 18 15:45:01 torres kernel: pnp: PnP ACPI init
Dec 18 15:45:01 torres kernel: pnp: PnP ACPI: found 11 devices
Dec 18 15:45:01 torres kernel: PnPBIOS: Disabled by ACPI PNP
Dec 18 15:45:01 torres kernel: SCSI subsystem initialized
Dec 18 15:45:01 torres kernel: usbcore: registered new interface driver usbfs
Dec 18 15:45:01 torres kernel: usbcore: registered new interface driver hub
Dec 18 15:45:01 torres kernel: usbcore: registered new device driver usb
Dec 18 15:45:01 torres kernel: PCI: Using ACPI for IRQ routing
Dec 18 15:45:01 torres kernel: PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
Dec 18 15:45:01 torres kernel: NetLabel: Initializing
Dec 18 15:45:01 torres kernel: NetLabel:  domain hash size = 128
Dec 18 15:45:01 torres kernel: NetLabel:  protocols = UNLABELED CIPSOv4
Dec 18 15:45:01 torres kernel: NetLabel:  unlabeled traffic allowed by default
Dec 18 15:45:01 torres kernel: PCI: Bridge: 0000:00:01.0
Dec 18 15:45:01 torres kernel:   IO window: disabled.
Dec 18 15:45:01 torres kernel:   MEM window: d4000000-d6ffffff
Dec 18 15:45:01 torres kernel:   PREFETCH window: 10000000-100fffff
Dec 18 15:45:01 torres kernel: PCI: Setting latency timer of device 0000:00:01.0 to 64
Dec 18 15:45:01 torres kernel: NET: Registered protocol family 2
Dec 18 15:45:01 torres kernel: IP route cache hash table entries: 2048 (order: 1, 8192 bytes)
Dec 18 15:45:01 torres kernel: TCP established hash table entries: 8192 (order: 3, 32768 bytes)
Dec 18 15:45:01 torres kernel: TCP bind hash table entries: 4096 (order: 2, 16384 bytes)
Dec 18 15:45:01 torres kernel: TCP: Hash tables configured (established 8192 bind 4096)
Dec 18 15:45:01 torres kernel: TCP reno registered
Dec 18 15:45:01 torres kernel: Machine check exception polling timer started.
Dec 18 15:45:01 torres kernel: apm: BIOS version 1.2 Flags 0x07 (Driver version 1.16ac)
Dec 18 15:45:01 torres kernel: apm: overridden by ACPI.
Dec 18 15:45:01 torres kernel: audit: initializing netlink socket (disabled)
Dec 18 15:45:01 torres kernel: audit(1166453090.432:1): initialized
Dec 18 15:45:01 torres kernel: SGI XFS with no debug enabled
Dec 18 15:45:01 torres kernel: io scheduler noop registered
Dec 18 15:45:01 torres kernel: io scheduler anticipatory registered (default)
Dec 18 15:45:01 torres kernel: io scheduler deadline registered
Dec 18 15:45:01 torres kernel: io scheduler cfq registered
Dec 18 15:45:01 torres kernel: Applying VIA southbridge workaround.
Dec 18 15:45:01 torres kernel: ACPI: Power Button (FF) [PWRF]
Dec 18 15:45:01 torres kernel: ACPI: Power Button (CM) [PWRB]
Dec 18 15:45:01 torres kernel: ACPI: Sleep Button (CM) [SLPB]
Dec 18 15:45:01 torres kernel: isapnp: Scanning for PnP cards...
Dec 18 15:45:01 torres kernel: isapnp: No Plug & Play device found
Dec 18 15:45:01 torres kernel: Linux agpgart interface v0.101 (c) Dave Jones
Dec 18 15:45:01 torres kernel: agpgart: Detected VIA KLE133 chipset
Dec 18 15:45:01 torres kernel: agpgart: AGP aperture is 64M @ 0xd0000000
Dec 18 15:45:01 torres kernel: Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
Dec 18 15:45:01 torres kernel: serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
Dec 18 15:45:01 torres kernel: 00:08: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
Dec 18 15:45:01 torres kernel: FDC 0 is a post-1991 82077
Dec 18 15:45:01 torres kernel: HP CISS Driver (v 3.6.10)
Dec 18 15:45:01 torres kernel: Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
Dec 18 15:45:01 torres kernel: ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
Dec 18 15:45:01 torres kernel: VP_IDE: IDE controller at PCI slot 0000:00:07.1
Dec 18 15:45:01 torres kernel: VP_IDE: chipset revision 6
Dec 18 15:45:01 torres kernel: VP_IDE: not 100%% native mode: will probe irqs later
Dec 18 15:45:01 torres kernel: VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci0000:00:07.1
Dec 18 15:45:01 torres kernel:     ide0: BM-DMA at 0xe000-0xe007, BIOS settings: hda:DMA, hdb:pio
Dec 18 15:45:01 torres kernel:     ide1: BM-DMA at 0xe008-0xe00f, BIOS settings: hdc:pio, hdd:pio
Dec 18 15:45:01 torres kernel: Probing IDE interface ide0...
Dec 18 15:45:01 torres kernel: hda: WDC WD400BB-75DEA0, ATA DISK drive
Dec 18 15:45:01 torres kernel: ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Dec 18 15:45:01 torres kernel: Probing IDE interface ide1...
Dec 18 15:45:01 torres kernel: Probing IDE interface ide1...
Dec 18 15:45:01 torres kernel: hda: max request size: 128KiB
Dec 18 15:45:01 torres kernel: hda: Host Protected Area detected.
Dec 18 15:45:01 torres kernel: ^Icurrent capacity is 78125000 sectors (40000 MB)
Dec 18 15:45:01 torres kernel: ^Inative  capacity is 78125040 sectors (40000 MB)
Dec 18 15:45:01 torres kernel: hda: Host Protected Area disabled.
Dec 18 15:45:01 torres kernel: hda: 78125040 sectors (40000 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(100)
Dec 18 15:45:01 torres kernel: hda: cache flushes not supported
Dec 18 15:45:01 torres kernel:  hda: hda1 hda2 hda3
Dec 18 15:45:01 torres kernel: 3ware Storage Controller device driver for Linux v1.26.02.001.
Dec 18 15:45:01 torres kernel: usbmon: debugfs is not available
Dec 18 15:45:01 torres kernel: ohci_hcd: 2006 August 04 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
Dec 18 15:45:01 torres kernel: USB Universal Host Controller Interface driver v3.0
Dec 18 15:45:01 torres kernel: usbcore: registered new interface driver hiddev
Dec 18 15:45:01 torres kernel: usbcore: registered new interface driver usbhid
Dec 18 15:45:01 torres kernel: drivers/usb/input/hid-core.c: v2.6:USB HID core driver
Dec 18 15:45:01 torres kernel: usbcore: registered new interface driver usbserial
Dec 18 15:45:01 torres kernel: drivers/usb/serial/usb-serial.c: USB Serial support registered for generic
Dec 18 15:45:01 torres kernel: usbcore: registered new interface driver usbserial_generic
Dec 18 15:45:01 torres kernel: drivers/usb/serial/usb-serial.c: USB Serial Driver core
Dec 18 15:45:01 torres kernel: PNP: PS/2 Controller [PNP0303:PS2K] at 0x60,0x64 irq 1
Dec 18 15:45:01 torres kernel: PNP: PS/2 controller doesn't have AUX irq; using default 12
Dec 18 15:45:01 torres kernel: serio: i8042 KBD port at 0x60,0x64 irq 1
Dec 18 15:45:01 torres kernel: mice: PS/2 mouse device common for all mice
Dec 18 15:45:01 torres kernel: md: raid1 personality registered for level 1
Dec 18 15:45:01 torres kernel: raid6: int32x1    558 MB/s
Dec 18 15:45:01 torres kernel: raid6: int32x2    565 MB/s
Dec 18 15:45:01 torres kernel: raid6: int32x4    405 MB/s
Dec 18 15:45:01 torres kernel: raid6: int32x8    421 MB/s
Dec 18 15:45:01 torres kernel: raid6: mmxx1     1169 MB/s
Dec 18 15:45:01 torres kernel: raid6: mmxx2     1870 MB/s
Dec 18 15:45:01 torres kernel: raid6: sse1x1     586 MB/s
Dec 18 15:45:01 torres kernel: raid6: sse1x2    1178 MB/s
Dec 18 15:45:01 torres kernel: raid6: using algorithm sse1x2 (1178 MB/s)
Dec 18 15:45:01 torres kernel: md: raid6 personality registered for level 6
Dec 18 15:45:01 torres kernel: md: raid5 personality registered for level 5
Dec 18 15:45:01 torres kernel: md: raid4 personality registered for level 4
Dec 18 15:45:01 torres kernel: raid5: automatically using best checksumming function: pIII_sse
Dec 18 15:45:01 torres kernel:    pIII_sse  :  3401.000 MB/sec
Dec 18 15:45:01 torres kernel: raid5: using function: pIII_sse (3401.000 MB/sec)
Dec 18 15:45:01 torres kernel: device-mapper: ioctl: 4.10.0-ioctl (2006-09-14) initialised: dm-devel@redhat.com
Dec 18 15:45:01 torres kernel: TCP cubic registered
Dec 18 15:45:01 torres kernel: NET: Registered protocol family 1
Dec 18 15:45:01 torres kernel: NET: Registered protocol family 17
Dec 18 15:45:01 torres kernel: NET: Registered protocol family 15
Dec 18 15:45:01 torres kernel: Using IPI Shortcut mode
Dec 18 15:45:01 torres kernel: ACPI: (supports S0 S1 S4 S5)
Dec 18 15:45:01 torres kernel: Time: tsc clocksource has been installed.
Dec 18 15:45:01 torres kernel: input: AT Translated Set 2 keyboard as /class/input/input0
Dec 18 15:45:01 torres kernel: md: Autodetecting RAID arrays.
Dec 18 15:45:01 torres kernel: md: autorun ...
Dec 18 15:45:01 torres kernel: md: ... autorun DONE.
Dec 18 15:45:01 torres kernel: kjournald starting.  Commit interval 5 seconds
Dec 18 15:45:01 torres kernel: EXT3-fs: mounted filesystem with ordered data mode.
Dec 18 15:45:01 torres kernel: VFS: Mounted root (ext3 filesystem) readonly.
Dec 18 15:45:01 torres kernel: Freeing unused kernel memory: 204k freed
Dec 18 15:45:01 torres kernel: Adding 1952992k swap on /dev/hda2.  Priority:-1 extents:1 across:1952992k
Dec 18 15:45:01 torres kernel: EXT3 FS on hda1, internal journal
Dec 18 15:45:01 torres kernel: Linux Tulip driver version 1.1.14 (May 11, 2002)
Dec 18 15:45:01 torres kernel: ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
Dec 18 15:45:01 torres kernel: PCI: setting IRQ 11 as level-triggered
Dec 18 15:45:01 torres kernel: ACPI: PCI Interrupt 0000:00:0f.0[A] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
Dec 18 15:45:01 torres kernel: tulip0:  MII transceiver #1 config 1000 status 786d advertising 05e1.
Dec 18 15:45:01 torres kernel: eth0: ADMtek Comet rev 17 at Port 0xec00, 00:10:DC:6A:AA:0E, IRQ 11.
Dec 18 15:45:01 torres kernel: kjournald starting.  Commit interval 5 seconds
Dec 18 15:45:01 torres kernel: EXT3 FS on dm-0, internal journal
Dec 18 15:45:01 torres kernel: EXT3-fs: mounted filesystem with ordered data mode.
Dec 18 15:45:01 torres kernel: kjournald starting.  Commit interval 5 seconds
Dec 18 15:45:01 torres kernel: EXT3 FS on dm-1, internal journal
Dec 18 15:45:01 torres kernel: EXT3-fs: mounted filesystem with ordered data mode.
Dec 18 15:45:01 torres kernel: kjournald starting.  Commit interval 5 seconds
Dec 18 15:45:01 torres kernel: EXT3 FS on dm-2, internal journal
Dec 18 15:45:01 torres kernel: EXT3-fs: mounted filesystem with ordered data mode.
Dec 18 15:45:01 torres kernel: NET: Registered protocol family 10
Dec 18 15:45:01 torres kernel: lo: Disabled Privacy Extensions
Dec 18 15:45:03 torres kernel: 0000:00:0f.0: tulip_stop_rxtx() failed
Dec 18 15:45:03 torres kernel: eth0: Setting full-duplex based on MII#1 link partner capability of 41e1.
Dec 18 15:45:10 torres kernel: eth0: no IPv6 routers present

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-16 18:43     ` Martin Michlmayr
  2006-12-16 19:18       ` Hugh Dickins
@ 2006-12-22 17:05       ` Marc Haber
  1 sibling, 0 replies; 154+ messages in thread
From: Marc Haber @ 2006-12-22 17:05 UTC (permalink / raw)
  To: linux-kernel

On Sat, Dec 16, 2006 at 06:43:10PM +0000, Martin Michlmayr wrote:
> * Marc Haber <mh+linux-kernel@zugschlus.de> [2006-12-09 10:26]:
> > Unfortunately, I am lacking the knowledge needed to do this in an
> > informed way. I am neither familiar enough with git nor do I possess
> > the necessary C powers.
> 
> I wonder if what you're seein is related to
> http://lkml.org/lkml/2006/12/16/73
> 
> You said that you don't see any corruption with 2.6.18.  Can you try
> to apply the patch from
> http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
> to 2.6.18 to see if the corruption shows up?

Since I am no longer seeing the issue after easing the memory load, I
doubt that this would make sense.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-29 18:52                                 ` maximilian attems
@ 2006-12-29 19:14                                   ` Dave Jones
  0 siblings, 0 replies; 154+ messages in thread
From: Dave Jones @ 2006-12-29 19:14 UTC (permalink / raw)
  To: maximilian attems; +Cc: linux-kernel

On Fri, Dec 29, 2006 at 07:52:15PM +0100, maximilian attems wrote:
 
 > > The only -mm stuff I recall being in the Fedora 2.6.18 is
 > > the inode-diet stuff which ended up in 2.6.19, though the xmas
 > > break has left my head somewhat empty so I may be forgetting something.
 > > What patch in particular are you talking about?
 > 
 > it's no longer visible in the FC6 cvs, due to rebase
 >  but it's name was linux-2.6-mm-tracking-dirty-pages.patch
 > it is an earlier almagame of the merged patch serie:
 >    - mm: tracking shared dirty pages
 >    - mm: balance dirty pages
 >    - mm: optimize the new mprotect() code a bit
 >    - mm: small cleanup of install_page()
 >    - mm: fixup do_wp_page()
 >    - mm: msync() cleanup (closes: #394392)

Ohh, that. Yes. I had forgotten all about that.
I've been hitting the nog a little too hard :)

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-29 15:02                               ` Dave Jones
@ 2006-12-29 18:52                                 ` maximilian attems
  2006-12-29 19:14                                   ` Dave Jones
  0 siblings, 1 reply; 154+ messages in thread
From: maximilian attems @ 2006-12-29 18:52 UTC (permalink / raw)
  To: Dave Jones, linux-kernel

On Fri, Dec 29, 2006 at 10:02:53AM -0500, Dave Jones wrote:
> On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote:
>  > > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
<snipp>
>  > >  > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
>  > >  > > (or older)?
>  > >  > 
>  > >  > Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
>  > >  > have the page throttling patches in it, those were written this summer. So 
>  > >  > it would either have to be Fedora carrying around another patch that just 
>  > >  > happens to result in the same corruption for _years_, or it's the same 
>  > >  > bug.
>  > > 
>  > > The only notable VM patch in Fedora kernels of that vintage that I recall
>  > > was Ingo's 4g/4g thing.
>  > 
>  > no the fedora 2.6.18 kernel is affected.
> 
> I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel.
> 
>  > it carries the same -mm patches that Debian backported
>  > for LSB 3.1 compliance.
> 
> The only -mm stuff I recall being in the Fedora 2.6.18 is
> the inode-diet stuff which ended up in 2.6.19, though the xmas
> break has left my head somewhat empty so I may be forgetting something.
> What patch in particular are you talking about?

it's no longer visible in the FC6 cvs, due to rebase
 but it's name was linux-2.6-mm-tracking-dirty-pages.patch
it is an earlier almagame of the merged patch serie:
   - mm: tracking shared dirty pages
   - mm: balance dirty pages
   - mm: optimize the new mprotect() code a bit
   - mm: small cleanup of install_page()
   - mm: fixup do_wp_page()
   - mm: msync() cleanup (closes: #394392)

--
maks

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 19:00                     ` Linus Torvalds
  2006-12-28 19:05                       ` Petri Kaukasoina
  2006-12-28 21:24                       ` Linus Torvalds
@ 2006-12-29 17:49                       ` Guillaume Chazarain
  2 siblings, 0 replies; 154+ messages in thread
From: Guillaume Chazarain @ 2006-12-29 17:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marc Haber, Andrew Morton, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer, Martin Michlmayr

Linus Torvalds a écrit :
> going back to Linux-2.6.5 at least, according to one tester).
>   

I apologize for the confusion, but it just occurred to me that I was 
actually
experiencing a totally different problem: I set a root filesystem of 
3Mib for
qemu, so the test program just didn't have enough space for its file.

-- 
Guillaume


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-29  9:23                             ` maximilian attems
@ 2006-12-29 15:02                               ` Dave Jones
  2006-12-29 18:52                                 ` maximilian attems
  0 siblings, 1 reply; 154+ messages in thread
From: Dave Jones @ 2006-12-29 15:02 UTC (permalink / raw)
  To: maximilian attems; +Cc: linux-kernel

On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote:
 > > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
 > >  > 
 > >  > 
 > >  > On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
 > >  > > > me up), and that seems to show the corruption going way way back (ie going 
 > >  > > > back to Linux-2.6.5 at least, according to one tester).
 > >  > > 
 > >  > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
 > >  > > (or older)?
 > >  > 
 > >  > Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
 > >  > have the page throttling patches in it, those were written this summer. So 
 > >  > it would either have to be Fedora carrying around another patch that just 
 > >  > happens to result in the same corruption for _years_, or it's the same 
 > >  > bug.
 > > 
 > > The only notable VM patch in Fedora kernels of that vintage that I recall
 > > was Ingo's 4g/4g thing.
 > 
 > no the fedora 2.6.18 kernel is affected.

I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel.

 > it carries the same -mm patches that Debian backported
 > for LSB 3.1 compliance.

The only -mm stuff I recall being in the Fedora 2.6.18 is
the inode-diet stuff which ended up in 2.6.19, though the xmas
break has left my head somewhat empty so I may be forgetting something.
What patch in particular are you talking about?

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 19:39                           ` Dave Jones
  2006-12-28 20:10                             ` Arjan van de Ven
@ 2006-12-29  9:23                             ` maximilian attems
  2006-12-29 15:02                               ` Dave Jones
  1 sibling, 1 reply; 154+ messages in thread
From: maximilian attems @ 2006-12-29  9:23 UTC (permalink / raw)
  To: davej; +Cc: linux-kernel

> On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
>  > 
>  > 
>  > On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
>  > > > me up), and that seems to show the corruption going way way back (ie going 
>  > > > back to Linux-2.6.5 at least, according to one tester).
>  > > 
>  > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
>  > > (or older)?
>  > 
>  > Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
>  > have the page throttling patches in it, those were written this summer. So 
>  > it would either have to be Fedora carrying around another patch that just 
>  > happens to result in the same corruption for _years_, or it's the same 
>  > bug.
> 
> The only notable VM patch in Fedora kernels of that vintage that I recall
> was Ingo's 4g/4g thing.
> 
> 		Dave

no the fedora 2.6.18 kernel is affected.
it carries the same -mm patches that Debian backported
for LSB 3.1 compliance.

-- 
maks

ps sorry for stripping cc, only downloaded that message raw.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-29  1:38                             ` Linus Torvalds
@ 2006-12-29  1:59                               ` Andrew Morton
  0 siblings, 0 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-29  1:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, guichaz, ranma, gordonfarquharson, mh+linux-kernel,
	nickpiggin, andrei.popa, linux-kernel, a.p.zijlstra, hugh, fw,
	tbm, arjan, kenneth.w.chen

On Thu, 28 Dec 2006 17:38:38 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> in 
> the hope that somebody else is working on this corruption issue and is 
> interested..

What corruption issue? ;)


I'm finding that the corruption happens trivially with your test app, but
apparently doesn't happen at all with ext2 or ext3, data=writeback.  Maybe
it will happen with increased rarity, but the difference is quite stark.

Removing the

                err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
                                        NULL, journal_dirty_data_fn);

from ext3_ordered_writepage() fixes things up.

The things which journal_submit_data_buffers() does after dropping all the
locks are ...  disturbing - I don't think we have sufficient tests in there
to ensure that the buffer is still where we think it is after we retake
locks (they're slippery little buggers).  But that wouldn't explain it
anyway.

It's inefficient that journal_dirty_data() will put these locked, clean
buffers onto BJ_SyncData instead of BJ_Locked, but
journal_submit_data_buffers() seems to dtrt with them.

So no theory yet.  Maybe ext3 is just altering timing.  But the difference
is really large..



Disabling all the WB_SYNC_NONE stuff and making everything go synchronous
everywhere has no effect.  Disabling bdi_write_congested() has no effect.




^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 22:50                           ` David Miller
  2006-12-28 23:01                             ` Linus Torvalds
@ 2006-12-29  1:38                             ` Linus Torvalds
  2006-12-29  1:59                               ` Andrew Morton
  1 sibling, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-29  1:38 UTC (permalink / raw)
  To: David Miller
  Cc: akpm, guichaz, ranma, gordonfarquharson, mh+linux-kernel,
	nickpiggin, andrei.popa, linux-kernel, a.p.zijlstra, hugh, fw,
	tbm, arjan, kenneth.w.chen

[-- Attachment #1: Type: TEXT/PLAIN, Size: 9586 bytes --]


Btw, 
 much cleaned-up page tracing patch here, in case anybody cares (and 
"test.c" attached, although I don't think it changed since last time). 

The test.c output is a bit hard to read at times, since it will give 
offsets in bytes as hex (ie "00a77664" means page frame 00000a77, and byte 
664h within that page), while the kernel output is obvioiusly the page 
indexes (but the page fault _addresses_ can contain information about the 
exact byte in a page, so you can match them up when some kernel event is 
related to a page fault).

So both forms are necessary/logical, but it means that to match things up, 
you often need to ignore the last three hex digits of the address that 
"test.c" outputs.

This one also adds traces for the tags and the writeback activity, but 
since I'm going out for birthday dinner, I won't have time to try to 
actually analyse the trace I have.. Which is why I'm sending it out, in 
the hope that somebody else is working on this corruption issue and is 
interested..

		Linus

----
diff --git a/fs/buffer.c b/fs/buffer.c
index 263f88e..f5e132a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -722,6 +722,7 @@ int __set_page_dirty_buffers(struct page *page)
 			set_buffer_dirty(bh);
 			bh = bh->b_this_page;
 		} while (bh != head);
+		PAGE_TRACE(page, "dirtied buffers");
 	}
 	spin_unlock(&mapping->private_lock);
 
@@ -734,6 +735,7 @@ int __set_page_dirty_buffers(struct page *page)
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
+		PAGE_TRACE(page, "setting TAG_DIRTY");
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 350878a..0cf3dce 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -91,6 +91,14 @@
 #define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags)
+#define PageInteresting(page)	test_bit(PG_arch_1, &(page)->flags)
+
+#define PAGE_TRACE(page, msg, arg...) do {	 				\
+	if (PageInteresting(page))	 					\
+		printk(KERN_DEBUG "PG %08lx: %s:%d " msg "\n", 			\
+			(page)->index, __FILE__, __LINE__ ,##arg );		\
+} while (0)
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -183,32 +191,38 @@ static inline void SetPageUptodate(struct page *page)
 #define PageWriteback(page)	test_bit(PG_writeback, &(page)->flags)
 #define SetPageWriteback(page)						\
 	do {								\
-		if (!test_and_set_bit(PG_writeback,			\
-				&(page)->flags))			\
+		if (!test_and_set_bit(PG_writeback, &(page)->flags)) {	\
+			PAGE_TRACE(page, "set writeback");		\
 			inc_zone_page_state(page, NR_WRITEBACK);	\
+		}							\
 	} while (0)
 #define TestSetPageWriteback(page)					\
 	({								\
 		int ret;						\
 		ret = test_and_set_bit(PG_writeback,			\
 					&(page)->flags);		\
-		if (!ret)						\
+		if (!ret) {						\
+			PAGE_TRACE(page, "set writeback");		\
 			inc_zone_page_state(page, NR_WRITEBACK);	\
+		}							\
 		ret;							\
 	})
 #define ClearPageWriteback(page)					\
 	do {								\
-		if (test_and_clear_bit(PG_writeback,			\
-				&(page)->flags))			\
+		if (test_and_clear_bit(PG_writeback, &(page)->flags)) {	\
+			PAGE_TRACE(page, "end writeback");		\
 			dec_zone_page_state(page, NR_WRITEBACK);	\
+		}							\
 	} while (0)
 #define TestClearPageWriteback(page)					\
 	({								\
 		int ret;						\
 		ret = test_and_clear_bit(PG_writeback,			\
 				&(page)->flags);			\
-		if (ret)						\
+		if (ret) {						\
+			PAGE_TRACE(page, "end writeback");		\
 			dec_zone_page_state(page, NR_WRITEBACK);	\
+		}							\
 		ret;							\
 	})
 
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 5c26818..7735b83 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -79,7 +79,7 @@ config DEBUG_KERNEL
 
 config LOG_BUF_SHIFT
 	int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL
-	range 12 21
+	range 12 24
 	default 17 if S390 || LOCKDEP
 	default 16 if X86_NUMAQ || IA64
 	default 15 if SMP
diff --git a/mm/filemap.c b/mm/filemap.c
index 8332c77..4df7d35 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 
+	PAGE_TRACE(page, "Removing page cache");
 	radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	mapping->nrpages--;
@@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 	return err;
 }
 
+static noinline int is_interesting(struct address_space *mapping)
+{
+	struct inode *inode = mapping->host;
+	struct dentry *dentry;
+	int retval = 0;
+
+	spin_lock(&dcache_lock);
+	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
+		if (strcmp(dentry->d_name.name, "mapfile"))
+			continue;
+		retval = 1;
+		break;
+	}
+	spin_unlock(&dcache_lock);
+	return retval;
+}
+
 /**
  * add_to_page_cache - add newly allocated pagecache pages
  * @page:	page to add
@@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct address_space *mapping,
 {
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
+	if (is_interesting(mapping))
+		SetPageInteresting(page);
+
 	if (error == 0) {
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..20af32f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -667,6 +667,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
+			PAGE_TRACE(page, "unmapped at %08lx", addr);
 			if (unlikely(details) && details->nonlinear_vma
 			    && linear_page_index(details->nonlinear_vma,
 						addr) != page->index)
@@ -1605,6 +1606,7 @@ gotten:
 		 */
 		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
+		PAGE_TRACE(new_page, "write fault at %08lx", address);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
 		page_add_new_anon_rmap(new_page, vma, address);
@@ -2249,6 +2251,7 @@ retry:
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		PAGE_TRACE(new_page, "mapping at %08lx (%s)", address, write_access ? "write" : "read");
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3a198c..15f3aaf 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -773,6 +773,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 				__inc_zone_page_state(page, NR_FILE_DIRTY);
 				task_io_account_write(PAGE_CACHE_SIZE);
 			}
+			PAGE_TRACE(page, "setting TAG_DIRTY");
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
@@ -813,6 +814,7 @@ int fastcall set_page_dirty(struct page *page)
 		if (!spd)
 			spd = __set_page_dirty_buffers;
 #endif
+		PAGE_TRACE(page, "setting dirty");
 		return (*spd)(page);
 	}
 	if (!PageDirty(page)) {
@@ -867,6 +869,7 @@ int clear_page_dirty_for_io(struct page *page)
 
 	if (TestClearPageDirty(page)) {
 		if (mapping_cap_account_dirty(mapping)) {
+			PAGE_TRACE(page, "clean_for_io");
 			page_mkclean(page);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
@@ -886,10 +889,12 @@ int test_clear_page_writeback(struct page *page)
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
-		if (ret)
+		if (ret) {
+			PAGE_TRACE(page, "clearing TAG_WRITEBACK");
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
@@ -907,14 +912,18 @@ int test_set_page_writeback(struct page *page)
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
-		if (!ret)
+		if (!ret) {
+			PAGE_TRACE(page, "setting TAG_WRITEBACK");
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-		if (!PageDirty(page))
+		}
+		if (!PageDirty(page)) {
+			PAGE_TRACE(page, "clearing TAG_DIRTY");
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
+		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 57306fa..e6b4676 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -448,6 +448,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 	if (pte_dirty(*pte) || pte_write(*pte)) {
 		pte_t entry;
 
+		PAGE_TRACE(page, "cleaning PTE %08lx", address);
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		entry = ptep_clear_flush(vma, address, pte);
 		entry = pte_wrprotect(entry);
@@ -637,6 +638,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		goto out_unmap;
 	}
 
+	PAGE_TRACE(page, "unmapping from %08lx", address);
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
 	pteval = ptep_clear_flush(vma, address, pte);
@@ -767,6 +769,7 @@ static void try_to_unmap_cluster(unsigned long cursor,
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
 
+		PAGE_TRACE(page, "unmapping from %08lx", address);
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		pteval = ptep_clear_flush(vma, address, pte);

[-- Attachment #2: Type: TEXT/PLAIN, Size: 2975 bytes --]

#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>

#define TARGETSIZE (22 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)

static void fillmem(void *start, int nr)
{
	memset(start, nr, CHUNKSIZE);
}

#define page_offset(buf, off) (unsigned)((unsigned long)(buf)+(off)-(unsigned long)(mapping))

static int chunkorder[NRCHUNKS];
static char *mapping;

static int order(int nr)
{
	int i;
	if (nr < 0 || nr >= NRCHUNKS)
		return -1;
	for (i = 0; i < NRCHUNKS; i++)
		if (chunkorder[i] == nr)
			return i;
	return -2;
}

static void checkmem(void *buf, int nr)
{
	unsigned int start = ~0u, end = 0;
	unsigned char c = nr, *p = buf, differs = 0;
	int i;
	for (i = 0; i < CHUNKSIZE; i++) {
		unsigned char got = *p++;
		if (got != c) {
			if (i < start)
				start = i;
			if (i > end)
				end = i;
			differs = got;
		}
	}
	if (start < end) {
		printf("Chunk %d corrupted (%u-%u)  (%x-%x)            \n", nr, start, end,
			page_offset(buf, start), page_offset(buf, end));
		printf("Expected %u, got %u\n", c, differs);
		printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1));
	}
}

static char *remap(int fd, char *mapping)
{
	if (mapping) {
		munmap(mapping, SIZE);
		posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
	}
	return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
}

int main(int argc, char **argv)
{
	int fd, i;

	/*
	 * Make some random ordering of writing the chunks to the
	 * memory map..
	 *
	 * Start with fully ordered..
	 */
	for (i = 0; i < NRCHUNKS; i++)
		chunkorder[i] = i;

	/* ..and then mix it up randomly */
	srandom(time(NULL));
	for (i = 0; i < NRCHUNKS; i++) {
		int index = (unsigned int) random() % NRCHUNKS;
		int nr = chunkorder[index];
		chunkorder[index] = chunkorder[i];
		chunkorder[i] = nr;
	}

	fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
	if (fd < 0)
		return -1;
	if (ftruncate(fd, SIZE) < 0)
		return -1;
	mapping = remap(fd, NULL);
	if (-1 == (int)(long)mapping)
		return -1;

	for (i = 0; i < NRCHUNKS; i++) {
		int chunk = chunkorder[i];
		printf("Writing chunk %d/%d (%d%%) (%08x)     \r",
			chunk, NRCHUNKS,
			100*i/NRCHUNKS,
			page_offset(mapping, chunk * CHUNKSIZE));
		fillmem(mapping + chunk * CHUNKSIZE, chunk);
	}
	printf("\n");

	/* Unmap, drop, and remap.. */
	mapping = remap(fd, mapping);

	/* .. and check */
	for (i = 0; i < NRCHUNKS; i++) {
		int chunk = i;
		printf("Checking chunk %d/%d (%d%%) (%08x)    \r",
			i, NRCHUNKS,
			100*i/NRCHUNKS,
			page_offset(mapping, i * CHUNKSIZE));
		checkmem(mapping + chunk * CHUNKSIZE, chunk);
	}
	printf("\n");

	/* Clean up for next time */
	sleep(5);
	sync();
	sleep(5);
	munmap(mapping, SIZE);
	close(fd);
	unlink("mapfile");
	
	return 0;
}

^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 23:36                           ` Anton Altaparmakov
@ 2006-12-28 23:54                             ` Linus Torvalds
  0 siblings, 0 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-28 23:54 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Andrew Morton, Guillaume Chazarain, David Miller, ranma,
	gordonfarquharson, Marc Haber, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer, Martin Michlmayr, arjan, Chen Kenneth W



On Thu, 28 Dec 2006, Anton Altaparmakov wrote:
> 
> But are chunks 3 and 4 in separate buffer heads?  Sorry could not see it 
> immediately from the output you showed...

No, this is a 4kB filesystem. A single bh per page.

> It is just that there may be a different cause rather than buffer dirty 
> state...

Sure.

> A shot in the dark I know but it could perhaps be that a "COW for 
> MAP_PRIVATE" like event happens when the page is dirty already thus the 
> second write never actually makes it to the shared page thus it never gets 
> written out.

There are no private mappings anywhere, and no forks. Just a single mmap 
(well, we unmap and remap in order to force the page cache to be 
invalidated properly with the posix_fadvise() thing, but that's literally 
the only user).

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 22:37                         ` Linus Torvalds
  2006-12-28 22:50                           ` David Miller
@ 2006-12-28 23:36                           ` Anton Altaparmakov
  2006-12-28 23:54                             ` Linus Torvalds
  1 sibling, 1 reply; 154+ messages in thread
From: Anton Altaparmakov @ 2006-12-28 23:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Guillaume Chazarain, David Miller, ranma,
	gordonfarquharson, Marc Haber, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer, Martin Michlmayr, arjan, Chen Kenneth W

On Thu, 28 Dec 2006, Linus Torvalds wrote:
> Ok,
>  with the ugly trace capture patch, I've actually captured this corruption 
> in action, I think.
> 
> I did a full trace of all pages involved in one run, and picked one 
> corruption at random:
> 
> 	Chunk 14465 corrupted (0-75)  (01423fb4-01423fff)
> 	Expected 129, got 0
> 	Written as (5126)9509(15017)
> 
> That's the first 76 bytes of a chunk missing, and it's the last 76 bytes 
> on a page. It's page index 01423 in the mapped file, and bytes fb4-fff 
> within that file.
> 
> There were four chunks written to that page:
> 
> 	Writing chunk 14463/15800 (15%) (0142344c) (1)
> 	Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 00001423)
> 	Writing chunk 14464/15800 (32%) (01423a00) (3)
> 	Writing chunk 14465/15800 (60%) (01423fb4) (4)  <--- LOST!
> 
> and the other three chunks checked out all right.
> 
> And here's the annotated trace as it concerns that page:
> 
>  - here we write the first chunk to the page:
> 	** (1)  do_no_page: mapping index 00001423 at b7d1f44c (write)
> 	**      Setting page 00001423 dirty
> 
>  - something flushes it out to disk:
> 	**      cpd_for_io: index 00001423
> 	**      cleaning index 00001423 at b7d1f000
> 
>  - here we write the second chunk (which was split over the previous page 
>    and the interesting one):
> 	** (2)  Setting page 00001422 dirty
> 	** (2)  Setting page 00001423 dirty
> 
>  - and here we do a cleaning event
> 	**      cpd_for_io: index 00001423
> 	**      cleaning index 00001423 at b7d1f000
> 
>  - here we write the third chunk:
> 	** (3)  Setting page 00001423 dirty
> 
>  - here we write the fourth chunk:
> 	** (4) NO DIRTY EVENT
> 
>  - and a third flush to disk: 
> 	**      cpd_for_io: index 00001423
> 	**      cleaning index 00001423 at b7d1f000
> 
>  - here we unmap and flush:
> 	**      Unmapped index 00001423 at b7d1f000
> 	**      Removing index 00001423 from page cache
> 
>  - here we remap to check:
> 	**      do_no_page: mapping index 00001423 at b7d1f000 (read)
> 	**      Unmapped index 00001423 at b7d1f000
> 
>  - and finally, here I remove the file after the run:
> 	**      Removing index 00001423 from page cache
> 
> Now, the important thing to see here is:
> 
>  - the missing write did not have a "Setting page 00001423 dirty" event 
>    associated with it.
> 
>  - but I can _see_ where the actual dirty event would be happening in the 
>    logs, because I can see the dirty events of the other chunk writes 
>    around it, so I know exactly where that fourth write happens. And 
>    indeed, it _shouldn't_ get a dirty event, because the page is still 
>    dirty from the write of chunk #3 to that page, which _did_ get a dirty 
>    event.
> 
>    I can see that, because the testing app writes the log of the pages it 
>    writes, and this is the log around the fourth and final write:
> 
> 	...
>         Writing chunk 5338/15800 (60%) (0076eb48)       PFN: 76e/76f
>         Writing chunk 960/15800 (60%) (00156300)        PFN: 156
>         Writing chunk 14465/15800 (60%) (01423fb4)  <----
>         Writing chunk 8594/15800 (60%) (00bf74a8)       PFN: bf7
>         Writing chunk 556/15800 (60%) (000c62f0)        PFN: c6
> 	Writing chunk 15190/15800 (60%) (01526678)	PFN: 1526
> 	...
> 
>    and I can match this up with the full log from the kernel, which looks 
>    like this:
> 
>         Setting page 0000076e dirty
>         Setting page 0000076f dirty
>         Setting page 00000156 dirty
>         Setting page 000000c6 dirty
> 	Setting page 00001526 dirty
> 
>    so I know exactly where the missing writes (to our page at pfn 1423, 
>    and the fpn-bf7 page) happened.
> 
>  - and the thing is, I can see a "cpd_for_io()" happening AFTER that 
>    fourth write. Quite a long while after, in fact. So all of this looks 
>    very fine indeed. We are not losing any dirty bits.
> 
>  - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses 
>    the SAME dirty bit as write 4 did (which didn't make it out to disk!). 
>    The event that clears the dirty bit that write 3 did happens AFTER 
>    write 4 has happened!
> 
> So if we're not losing any dirty bits, what's going on?
> 
> I think we have some nasty interaction with the buffer heads. In 

But are chunks 3 and 4 in separate buffer heads?  Sorry could not see it 
immediately from the output you showed...

It is just that there may be a different cause rather than buffer dirty 
state...

A shot in the dark I know but it could perhaps be that a "COW for 
MAP_PRIVATE" like event happens when the page is dirty already thus the 
second write never actually makes it to the shared page thus it never gets 
written out.

I am almost certainly totally barking up the wrong tree but I thought it 
may be worth mentioning just in case there was a slip in the COW logic or 
page writable state maintenance somewhere...

Best regards,

	Anton

> particular, I don't think it's the dirty page bits that are broken (I 
> _see_ that the PageDirty bit was set after write 4 was done to memory in 
> the kernel traces). So I think that a real writeback just doesn't happen, 
> because somebody has marked the buffer heads clean _after_ it started IO 
> on them.
> 
> I think "__mpage_writepage()" is buggy in this regard, for example. It 
> even has a comment about its crapola behaviour:
> 
>         /*
>          * Must try to add the page before marking the buffer clean or
>          * the confused fail path above (OOM) will be very confused when
>          * it finds all bh marked clean (i.e. it will not write anything)
>          */
> 
> however, I don't think that particular thing explains it, because I don't 
> think we use that function for the cases I'm looking at.
> 
> Anyway, I'll add tracing for page-writeback setting/cleaning too, in case 
> I might see anything new there..
> 
> 		Linus

-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 22:50                           ` David Miller
@ 2006-12-28 23:01                             ` Linus Torvalds
  2006-12-29  1:38                             ` Linus Torvalds
  1 sibling, 0 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-28 23:01 UTC (permalink / raw)
  To: David Miller
  Cc: akpm, guichaz, ranma, gordonfarquharson, mh+linux-kernel,
	nickpiggin, andrei.popa, linux-kernel, a.p.zijlstra, hugh, fw,
	tbm, arjan, kenneth.w.chen



On Thu, 28 Dec 2006, David Miller wrote:
> 
> What happens when we writeback, to the PTEs?

Not a damn thing.

We clear the PTE's _before_ we even start the write. The writeback does 
nothing to them. If the user dirties the page while writeback is in 
progress, we'll take the page fault and re-dirty it _again_.

> page_mkclean_file() iterates the VMAs and when it finds a shared
> one it goes:
> 
> 		entry = ptep_clear_flush(vma, address, pte);
> 		entry = pte_wrprotect(entry);
> 		entry = pte_mkclean(entry);
> 
> and that's fine, but that PTE is still marked writable, and
> I think that's key.

No it's not. It's right there. "pte_wrprotect(entry)". You even copied it 
yourself.

> What does the fault path do in this situation?
> 
> 	if (write_access) {
> 		if (!pte_write(entry))
> 			return do_wp_page(mm, vma, address,
> 					pte, pmd, ptl, entry);

So we call "do_wp_page()", and that does everythign right.

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 22:37                         ` Linus Torvalds
@ 2006-12-28 22:50                           ` David Miller
  2006-12-28 23:01                             ` Linus Torvalds
  2006-12-29  1:38                             ` Linus Torvalds
  2006-12-28 23:36                           ` Anton Altaparmakov
  1 sibling, 2 replies; 154+ messages in thread
From: David Miller @ 2006-12-28 22:50 UTC (permalink / raw)
  To: torvalds
  Cc: akpm, guichaz, ranma, gordonfarquharson, mh+linux-kernel,
	nickpiggin, andrei.popa, linux-kernel, a.p.zijlstra, hugh, fw,
	tbm, arjan, kenneth.w.chen

From: Linus Torvalds <torvalds@osdl.org>
Date: Thu, 28 Dec 2006 14:37:37 -0800 (PST)

> So if we're not losing any dirty bits, what's going on?

What happens when we writeback, to the PTEs?

page_mkclean_file() iterates the VMAs and when it finds a shared
one it goes:

		entry = ptep_clear_flush(vma, address, pte);
		entry = pte_wrprotect(entry);
		entry = pte_mkclean(entry);

and that's fine, but that PTE is still marked writable, and
I think that's key.

What does the fault path do in this situation?

	if (write_access) {
		if (!pte_write(entry))
			return do_wp_page(mm, vma, address,
					pte, pmd, ptl, entry);
		entry = pte_mkdirty(entry);
	}

It does nothing to update the page dirty state, because it's
writable, it just sets the PTE dirty bit and that's it.  Should
it be setting the page dirty here for SHARED cases?

So until vmscan actually unmaps the PTE completely, we have this
window in which the application can write to the PTE and the
page dirty state doesn't get updated.

Perhaps something later cleans up after this, f.e. by rechecking the
PTE dirty bit at the end of I/O or when vmscan unmaps the page.
I guess that should handle things, but the above logic definitely
stood out to me.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 21:24                       ` Linus Torvalds
  2006-12-28 21:36                         ` Russell King
@ 2006-12-28 22:37                         ` Linus Torvalds
  2006-12-28 22:50                           ` David Miller
  2006-12-28 23:36                           ` Anton Altaparmakov
  1 sibling, 2 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-28 22:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Guillaume Chazarain, David Miller, ranma, gordonfarquharson,
	Marc Haber, Nick Piggin, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr,
	arjan, Chen Kenneth W


Ok,
 with the ugly trace capture patch, I've actually captured this corruption 
in action, I think.

I did a full trace of all pages involved in one run, and picked one 
corruption at random:

	Chunk 14465 corrupted (0-75)  (01423fb4-01423fff)
	Expected 129, got 0
	Written as (5126)9509(15017)

That's the first 76 bytes of a chunk missing, and it's the last 76 bytes 
on a page. It's page index 01423 in the mapped file, and bytes fb4-fff 
within that file.

There were four chunks written to that page:

	Writing chunk 14463/15800 (15%) (0142344c) (1)
	Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 00001423)
	Writing chunk 14464/15800 (32%) (01423a00) (3)
	Writing chunk 14465/15800 (60%) (01423fb4) (4)  <--- LOST!

and the other three chunks checked out all right.

And here's the annotated trace as it concerns that page:

 - here we write the first chunk to the page:
	** (1)  do_no_page: mapping index 00001423 at b7d1f44c (write)
	**      Setting page 00001423 dirty

 - something flushes it out to disk:
	**      cpd_for_io: index 00001423
	**      cleaning index 00001423 at b7d1f000

 - here we write the second chunk (which was split over the previous page 
   and the interesting one):
	** (2)  Setting page 00001422 dirty
	** (2)  Setting page 00001423 dirty

 - and here we do a cleaning event
	**      cpd_for_io: index 00001423
	**      cleaning index 00001423 at b7d1f000

 - here we write the third chunk:
	** (3)  Setting page 00001423 dirty

 - here we write the fourth chunk:
	** (4) NO DIRTY EVENT

 - and a third flush to disk: 
	**      cpd_for_io: index 00001423
	**      cleaning index 00001423 at b7d1f000

 - here we unmap and flush:
	**      Unmapped index 00001423 at b7d1f000
	**      Removing index 00001423 from page cache

 - here we remap to check:
	**      do_no_page: mapping index 00001423 at b7d1f000 (read)
	**      Unmapped index 00001423 at b7d1f000

 - and finally, here I remove the file after the run:
	**      Removing index 00001423 from page cache

Now, the important thing to see here is:

 - the missing write did not have a "Setting page 00001423 dirty" event 
   associated with it.

 - but I can _see_ where the actual dirty event would be happening in the 
   logs, because I can see the dirty events of the other chunk writes 
   around it, so I know exactly where that fourth write happens. And 
   indeed, it _shouldn't_ get a dirty event, because the page is still 
   dirty from the write of chunk #3 to that page, which _did_ get a dirty 
   event.

   I can see that, because the testing app writes the log of the pages it 
   writes, and this is the log around the fourth and final write:

	...
        Writing chunk 5338/15800 (60%) (0076eb48)       PFN: 76e/76f
        Writing chunk 960/15800 (60%) (00156300)        PFN: 156
        Writing chunk 14465/15800 (60%) (01423fb4)  <----
        Writing chunk 8594/15800 (60%) (00bf74a8)       PFN: bf7
        Writing chunk 556/15800 (60%) (000c62f0)        PFN: c6
	Writing chunk 15190/15800 (60%) (01526678)	PFN: 1526
	...

   and I can match this up with the full log from the kernel, which looks 
   like this:

        Setting page 0000076e dirty
        Setting page 0000076f dirty
        Setting page 00000156 dirty
        Setting page 000000c6 dirty
	Setting page 00001526 dirty

   so I know exactly where the missing writes (to our page at pfn 1423, 
   and the fpn-bf7 page) happened.

 - and the thing is, I can see a "cpd_for_io()" happening AFTER that 
   fourth write. Quite a long while after, in fact. So all of this looks 
   very fine indeed. We are not losing any dirty bits.

 - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses 
   the SAME dirty bit as write 4 did (which didn't make it out to disk!). 
   The event that clears the dirty bit that write 3 did happens AFTER 
   write 4 has happened!

So if we're not losing any dirty bits, what's going on?

I think we have some nasty interaction with the buffer heads. In 
particular, I don't think it's the dirty page bits that are broken (I 
_see_ that the PageDirty bit was set after write 4 was done to memory in 
the kernel traces). So I think that a real writeback just doesn't happen, 
because somebody has marked the buffer heads clean _after_ it started IO 
on them.

I think "__mpage_writepage()" is buggy in this regard, for example. It 
even has a comment about its crapola behaviour:

        /*
         * Must try to add the page before marking the buffer clean or
         * the confused fail path above (OOM) will be very confused when
         * it finds all bh marked clean (i.e. it will not write anything)
         */

however, I don't think that particular thing explains it, because I don't 
think we use that function for the cases I'm looking at.

Anyway, I'll add tracing for page-writeback setting/cleaning too, in case 
I might see anything new there..

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 21:24                       ` Linus Torvalds
@ 2006-12-28 21:36                         ` Russell King
  2006-12-28 22:37                         ` Linus Torvalds
  1 sibling, 0 replies; 154+ messages in thread
From: Russell King @ 2006-12-28 21:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marc Haber, Andrew Morton, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer, Martin Michlmayr

On Thu, Dec 28, 2006 at 01:24:30PM -0800, Linus Torvalds wrote:
> On Thu, 28 Dec 2006, Linus Torvalds wrote:
> > 
> > What we need now is actually looking at the source code, and people who 
> > understand the VM, I'm afraid. I'm gathering traces now that I have a good 
> > test-case. I'll post my trace tools once I've tested that they work, in 
> > case others want to help.
> 
> Ok, I've got the traces, but quite frankly, I doubt anybody is crazy 
> enough to want to trawl through them. It's a bit painful, since we're 
> talking thousands of pages to trigger this problem.
> 
> Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably 
> ARM, but is used for other things on ia64, powerpc and sparc64. But here's 
> the patch in case anybody cares.

PG_arch_1 is used on ARM to flag pages that need a dcache flush prior to
hitting userspace, in the same way that sparc64 uses it.  So ARM systems
should not have this patch applied.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 19:00                     ` Linus Torvalds
  2006-12-28 19:05                       ` Petri Kaukasoina
@ 2006-12-28 21:24                       ` Linus Torvalds
  2006-12-28 21:36                         ` Russell King
  2006-12-28 22:37                         ` Linus Torvalds
  2006-12-29 17:49                       ` Guillaume Chazarain
  2 siblings, 2 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-28 21:24 UTC (permalink / raw)
  To: Marc Haber
  Cc: Andrew Morton, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer, Martin Michlmayr



On Thu, 28 Dec 2006, Linus Torvalds wrote:
> 
> What we need now is actually looking at the source code, and people who 
> understand the VM, I'm afraid. I'm gathering traces now that I have a good 
> test-case. I'll post my trace tools once I've tested that they work, in 
> case others want to help.

Ok, I've got the traces, but quite frankly, I doubt anybody is crazy 
enough to want to trawl through them. It's a bit painful, since we're 
talking thousands of pages to trigger this problem.

Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably 
ARM, but is used for other things on ia64, powerpc and sparc64. But here's 
the patch in case anybody cares.

It wants a _big_ kernel buffer to capture all the crud into (which is why 
I made the thing accept a bigger log buffer), and quite frankly, I'm not 
at all sure that all the locking is ok (ie I could imagine that the 
dcache-locking thing there in "is_interesting()" could deadlock, what do I 
know..)

But I've captured some real data with this, which I'll describe 
separately.

		Linus

----
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 350878a..967dd80 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -91,6 +91,8 @@
 #define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags)
+#define PageInteresting(page)	test_bit(PG_arch_1, &(page)->flags)
 
 #if (BITS_PER_LONG > 32)
 /*
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 5c26818..7735b83 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -79,7 +79,7 @@ config DEBUG_KERNEL
 
 config LOG_BUF_SHIFT
 	int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL
-	range 12 21
+	range 12 24
 	default 17 if S390 || LOCKDEP
 	default 16 if X86_NUMAQ || IA64
 	default 15 if SMP
diff --git a/mm/filemap.c b/mm/filemap.c
index 8332c77..d6a0f56 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 
+if (PageInteresting(page)) printk("Removing index %08x from page cache\n", page->index);
 	radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	mapping->nrpages--;
@@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 	return err;
 }
 
+static noinline int is_interesting(struct address_space *mapping)
+{
+	struct inode *inode = mapping->host;
+	struct dentry *dentry;
+	int retval = 0;
+
+	spin_lock(&dcache_lock);
+	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
+		if (strcmp(dentry->d_name.name, "mapfile"))
+			continue;
+		retval = 1;
+		break;
+	}
+	spin_unlock(&dcache_lock);
+	return retval;
+}
+
 /**
  * add_to_page_cache - add newly allocated pagecache pages
  * @page:	page to add
@@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct address_space *mapping,
 {
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
+	if (is_interesting(mapping))
+		SetPageInteresting(page);
+
 	if (error == 0) {
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..14c9815 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -667,6 +667,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
+if (PageInteresting(page))
+	printk("Unmapped index %08x at %08x\n", page->index, addr);
 			if (unlikely(details) && details->nonlinear_vma
 			    && linear_page_index(details->nonlinear_vma,
 						addr) != page->index)
@@ -1605,6 +1607,7 @@ gotten:
 		 */
 		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
+if (PageInteresting(new_page)) printk("do_wp_page: mapping index %08x at %08lx\n", new_page->index, address);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
 		page_add_new_anon_rmap(new_page, vma, address);
@@ -2249,6 +2252,7 @@ retry:
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+if (PageInteresting(new_page)) printk("do_no_page: mapping index %08x at %08lx (%s)\n", new_page->index, address, write_access ? "write" : "read");
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3a198c..0466601 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -813,6 +813,7 @@ int fastcall set_page_dirty(struct page *page)
 		if (!spd)
 			spd = __set_page_dirty_buffers;
 #endif
+if (PageInteresting(page)) printk("Setting page %08x dirty\n", page->index);
 		return (*spd)(page);
 	}
 	if (!PageDirty(page)) {
@@ -867,6 +868,7 @@ int clear_page_dirty_for_io(struct page *page)
 
 	if (TestClearPageDirty(page)) {
 		if (mapping_cap_account_dirty(mapping)) {
+if (PageInteresting(page)) printk("cpd_for_io: index %08x\n", page->index);
 			page_mkclean(page);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
diff --git a/mm/rmap.c b/mm/rmap.c
index 57306fa..e98e84c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -448,6 +448,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 	if (pte_dirty(*pte) || pte_write(*pte)) {
 		pte_t entry;
 
+if (PageInteresting(page)) printk("cleaning index %08x at %08x\n", page->index, address);
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		entry = ptep_clear_flush(vma, address, pte);
 		entry = pte_wrprotect(entry);
@@ -637,6 +638,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		goto out_unmap;
 	}
 
+if (PageInteresting(page)) printk("unmapping index %08x from %08lx\n", page->index, address);
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
 	pteval = ptep_clear_flush(vma, address, pte);
@@ -767,6 +769,7 @@ static void try_to_unmap_cluster(unsigned long cursor,
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
 
+if (PageInteresting(page)) printk("Cluster-unmapping %08x from %08lx\n", page->index, address);
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		pteval = ptep_clear_flush(vma, address, pte);

^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 19:39                           ` Dave Jones
@ 2006-12-28 20:10                             ` Arjan van de Ven
  2006-12-29  9:23                             ` maximilian attems
  1 sibling, 0 replies; 154+ messages in thread
From: Arjan van de Ven @ 2006-12-28 20:10 UTC (permalink / raw)
  To: Dave Jones
  Cc: Linus Torvalds, Petri Kaukasoina, Marc Haber, Andrew Morton,
	Nick Piggin, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr

On Thu, 2006-12-28 at 14:39 -0500, Dave Jones wrote:
> On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
>  > 
>  > 
>  > On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
>  > > > me up), and that seems to show the corruption going way way back (ie going 
>  > > > back to Linux-2.6.5 at least, according to one tester).
>  > > 
>  > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
>  > > (or older)?
>  > 
>  > Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
>  > have the page throttling patches in it, those were written this summer. So 
>  > it would either have to be Fedora carrying around another patch that just 
>  > happens to result in the same corruption for _years_, or it's the same 
>  > bug.
> 
> The only notable VM patch in Fedora kernels of that vintage that I recall
> was Ingo's 4g/4g thing.

which does tlb flushes *all the time* so that even rules out (well
almost) a stale tlb somewhere...



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 19:21                         ` Linus Torvalds
@ 2006-12-28 19:39                           ` Dave Jones
  2006-12-28 20:10                             ` Arjan van de Ven
  2006-12-29  9:23                             ` maximilian attems
  0 siblings, 2 replies; 154+ messages in thread
From: Dave Jones @ 2006-12-28 19:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Petri Kaukasoina, Marc Haber, Andrew Morton, Nick Piggin,
	andrei.popa, Linux Kernel Mailing List, Peter Zijlstra,
	Hugh Dickins, Florian Weimer, Martin Michlmayr

On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
 > 
 > 
 > On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
 > > > me up), and that seems to show the corruption going way way back (ie going 
 > > > back to Linux-2.6.5 at least, according to one tester).
 > > 
 > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
 > > (or older)?
 > 
 > Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
 > have the page throttling patches in it, those were written this summer. So 
 > it would either have to be Fedora carrying around another patch that just 
 > happens to result in the same corruption for _years_, or it's the same 
 > bug.

The only notable VM patch in Fedora kernels of that vintage that I recall
was Ingo's 4g/4g thing.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 19:05                       ` Petri Kaukasoina
@ 2006-12-28 19:21                         ` Linus Torvalds
  2006-12-28 19:39                           ` Dave Jones
  0 siblings, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-28 19:21 UTC (permalink / raw)
  To: Petri Kaukasoina
  Cc: Marc Haber, Andrew Morton, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer, Martin Michlmayr



On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
> > me up), and that seems to show the corruption going way way back (ie going 
> > back to Linux-2.6.5 at least, according to one tester).
> 
> That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
> (or older)?

Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
have the page throttling patches in it, those were written this summer. So 
it would either have to be Fedora carrying around another patch that just 
happens to result in the same corruption for _years_, or it's the same 
bug.

I bet it's the same bug, and it's been around for ages.

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 19:00                     ` Linus Torvalds
@ 2006-12-28 19:05                       ` Petri Kaukasoina
  2006-12-28 19:21                         ` Linus Torvalds
  2006-12-28 21:24                       ` Linus Torvalds
  2006-12-29 17:49                       ` Guillaume Chazarain
  2 siblings, 1 reply; 154+ messages in thread
From: Petri Kaukasoina @ 2006-12-28 19:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marc Haber, Andrew Morton, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer, Martin Michlmayr

On Thu, Dec 28, 2006 at 11:00:46AM -0800, Linus Torvalds wrote:
> And I have a test-program that shows the corruption _much_ easier (at 
> least according to my own testing, and that of several reporters that back 
> me up), and that seems to show the corruption going way way back (ie going 
> back to Linux-2.6.5 at least, according to one tester).

That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
(or older)?

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-28 18:05                   ` Marc Haber
@ 2006-12-28 19:00                     ` Linus Torvalds
  2006-12-28 19:05                       ` Petri Kaukasoina
                                         ` (2 more replies)
  0 siblings, 3 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-28 19:00 UTC (permalink / raw)
  To: Marc Haber
  Cc: Andrew Morton, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer, Martin Michlmayr



On Thu, 28 Dec 2006, Marc Haber wrote:
> 
> After being up for ten days, I have now encountered the file
> corruption of pkgcache.bin for the first time again. The 256 MB i386
> box is like 26M in swap, is under very moderate load.
> 
> I am running plain vanilla 2.6.19.1. Is there a patch that I should
> apply against 2.6.19.1 that would help in debugging?

Not right now. 

And I have a test-program that shows the corruption _much_ easier (at 
least according to my own testing, and that of several reporters that back 
me up), and that seems to show the corruption going way way back (ie going 
back to Linux-2.6.5 at least, according to one tester).

So it just got a lot _easier_ to trigger in 2.6.19, but it's not a new 
bug.

What we need now is actually looking at the source code, and people who 
understand the VM, I'm afraid. I'm gathering traces now that I have a good 
test-case. I'll post my trace tools once I've tested that they work, in 
case others want to help.

(And hey, you don't have to be a VM expert to help: this could be a 
learning experience. However, I'll warn you: this is _the_ most grotty 
part of the whole kernel. It's not even ugly, it's just damn hard and 
complex).

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  8:51                 ` Marc Haber
  2006-12-19  9:28                   ` Martin Michlmayr
@ 2006-12-28 18:05                   ` Marc Haber
  2006-12-28 19:00                     ` Linus Torvalds
  1 sibling, 1 reply; 154+ messages in thread
From: Marc Haber @ 2006-12-28 18:05 UTC (permalink / raw)
  To: Andrew Morton, Nick Piggin, Linus Torvalds, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer, Martin Michlmayr

On Tue, Dec 19, 2006 at 09:51:49AM +0100, Marc Haber wrote:
> On Sun, Dec 17, 2006 at 09:43:08PM -0800, Andrew Morton wrote:
> > Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> > blocksize ext3, mainline.  Zero failures.  It's unlikely that this testing
> > would pass, yet people running normal workloads are able to easily trigger
> > failures.  I suspect we're looking in the wrong place.
> 
> I do not have a clue about memory management at all, but is it
> possible that you're testing on a box with too much memory? My box has
> only 256 MB, and I used to use mutt with a _huge_ inbox with mutt
> taking somewhat 150 MB. Add spamassassin and a reasonably busy mail
> server, and the box used to be like 150 MB in swap.
> 
> I have tidied my inbox in the mean time and mutt's memory requirement
> has been reduced to somewhat 30 MB, which might be the cause that I
> don't see the issue that often any more.

After being up for ten days, I have now encountered the file
corruption of pkgcache.bin for the first time again. The 256 MB i386
box is like 26M in swap, is under very moderate load.

I am running plain vanilla 2.6.19.1. Is there a patch that I should
apply against 2.6.19.1 that would help in debugging?

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 22:34                         ` Gene Heskett
@ 2006-12-22 17:27                           ` Linus Torvalds
  0 siblings, 0 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-22 17:27 UTC (permalink / raw)
  To: Gene Heskett
  Cc: linux-kernel, Andrei Popa, Peter Zijlstra, Andrew Morton,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Gene Heskett wrote:
>
> What about the mm/rmap.c one liner, in or out?

The one that just removes the "pte_mkclean()"? That's definitely out, it 
was just a test-patch to verify that the pte dirty bits seemed to matter 
at all (and they do).

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-21 13:03                           ` Peter Zijlstra
@ 2006-12-21 20:40                             ` Andrew Morton
  0 siblings, 0 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-21 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Thu, 21 Dec 2006 14:03:20 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote:
> > 
> > Btw,
> >  here's a totally new tangent on this: it's possible that user code is 
> > simply BUGGY. 
> 
> depmod: BADNESS: written outside isize 22183

akpm:/usr/src/module-init-tools-3.3-pre1> grep -r mmap .
./zlibsupport.c:        map = mmap(0, *size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);

So presumably it's in a library.

akpm:/usr/src/25> ldd /sbin/depmod
        linux-gate.so.1 =>  (0xffffe000)
        libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0x46afa000)
        /lib/ld-linux.so.2 (0x4631d000)

worrisome.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 17:43                         ` Linus Torvalds
  2006-12-19 18:59                           ` Linus Torvalds
  2006-12-19 21:56                           ` Florian Weimer
@ 2006-12-21 13:03                           ` Peter Zijlstra
  2006-12-21 20:40                             ` Andrew Morton
  2 siblings, 1 reply; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-21 13:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote:
> 
> Btw,
>  here's a totally new tangent on this: it's possible that user code is 
> simply BUGGY. 

depmod: BADNESS: written outside isize 22183

---
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..5db9fd9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2393,6 +2393,17 @@ int nobh_commit_write(struct file *file, struct page *page,
 }
 EXPORT_SYMBOL(nobh_commit_write);
 
+static void __check_tail_zero(char *kaddr, unsigned int offset)
+{
+	unsigned int check = 0;
+	do {
+		check += kaddr[offset++];
+	} while (offset < PAGE_CACHE_SIZE);
+	if (check)
+		printk(KERN_ERR "%s: BADNESS: written outside isize %u\n",
+				current->comm, check);
+}
+
 /*
  * nobh_writepage() - based on block_full_write_page() except
  * that it tries to operate without attaching bufferheads to
@@ -2437,6 +2448,7 @@ int nobh_writepage(struct page *page, get_block_t *get_block,
 	 * writes to that region are not written out to the file."
 	 */
 	kaddr = kmap_atomic(page, KM_USER0);
+	__check_tail_zero(kaddr, offset);
 	memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
 	flush_dcache_page(page);
 	kunmap_atomic(kaddr, KM_USER0);
@@ -2604,6 +2616,7 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
 	 * writes to that region are not written out to the file."
 	 */
 	kaddr = kmap_atomic(page, KM_USER0);
+	__check_tail_zero(kaddr, offset);
 	memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
 	flush_dcache_page(page);
 	kunmap_atomic(kaddr, KM_USER0);



^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 21:30                             ` Peter Zijlstra
  2006-12-19 22:51                               ` Linus Torvalds
@ 2006-12-20 18:02                               ` Stephen Clark
  1 sibling, 0 replies; 154+ messages in thread
From: Stephen Clark @ 2006-12-20 18:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Peter Zijlstra wrote:

>On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
>  
>
>>On Tue, 19 Dec 2006, Linus Torvalds wrote:
>>    
>>
>>> here's a totally new tangent on this: it's possible that user code is 
>>>simply BUGGY. 
>>>      
>>>
>
>I'm sad to say this doesn't trigger :-(
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>  
>
Hi all,

I ran it a number of times on 2.6.16-1.2115_FC4 and always got
 ./a.out | od -x
0000000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa
0000020 aaaa aaaa 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555

but running it on 2.6.19-rc5 I always get zeros in the middle.

Steve

-- 

"They that give up essential liberty to obtain temporary safety, 
deserve neither liberty nor safety."  (Ben Franklin)

"The course of history shows that as a government grows, liberty 
decreases."  (Thomas Jefferson)




^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20 16:30                             ` Andrei Popa
@ 2006-12-20 16:36                               ` Peter Zijlstra
  0 siblings, 0 replies; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-20 16:36 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Wed, 2006-12-20 at 18:30 +0200, Andrei Popa wrote:
> On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote:
> > On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
> > > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> > > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > > > 
> > > > > OR:
> > > > > 
> > > > >  - page_mkclean_one() is simply buggy.
> > > > 
> > > > GOLD!
> > > > 
> > > > it seems to work with all this (full diff against current git).
> > > > 
> > > > /me rebuilds full kernel to make sure...
> > > > reboot...
> > > > test...      pff the tension...
> > > > yay, still good!
> > > > 
> > > > Andrei; would you please verify.
> > > 
> > > I have corrupted files.
> > 
> > drad; and with this patch:
> >   http://lkml.org/lkml/2006/12/20/112
> 
> Hash check on download completion found bad chunks, consider using
> "safe_sync".

*sigh* back to square 1.

and I need to look at my reproduction case ;-(

Thanks for testing.


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20 14:23                           ` Peter Zijlstra
@ 2006-12-20 16:30                             ` Andrei Popa
  2006-12-20 16:36                               ` Peter Zijlstra
  0 siblings, 1 reply; 154+ messages in thread
From: Andrei Popa @ 2006-12-20 16:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote:
> On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
> > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > > 
> > > > OR:
> > > > 
> > > >  - page_mkclean_one() is simply buggy.
> > > 
> > > GOLD!
> > > 
> > > it seems to work with all this (full diff against current git).
> > > 
> > > /me rebuilds full kernel to make sure...
> > > reboot...
> > > test...      pff the tension...
> > > yay, still good!
> > > 
> > > Andrei; would you please verify.
> > 
> > I have corrupted files.
> 
> drad; and with this patch:
>   http://lkml.org/lkml/2006/12/20/112

Hash check on download completion found bad chunks, consider using
"safe_sync".

> 
> /me goes rebuild his kernel and try more than 3 times
> 


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20  9:01                           ` Peter Zijlstra
  2006-12-20  9:12                             ` Peter Zijlstra
  2006-12-20  9:39                             ` Arjan van de Ven
@ 2006-12-20 14:27                             ` Martin Schwidefsky
  2 siblings, 0 replies; 154+ messages in thread
From: Martin Schwidefsky @ 2006-12-20 14:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr, Heiko Carstens

On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote:
> Also, what is this page_test_and_clear_dirty() business, that seems to
> be exclusively s390 btw. However they do seem to need this.
> 
> > But the "ptep_get_and_clear() + flush_tlb_page()" sequence should
> > hopefully also work.
> 
> Yeah, probably, not optimally so on some archs that don't actually need
> the flush though. And as above, I wonder about s390.

Simple, the s390 architecture does not keep the dirty bit in the pte but
in something called the storage key. For each physical page there is one
associated storage key. It is accessed with special instructions like
"iske", "sske" or "rrbe". To clear the dirty bit the storage key of a
page is read with iske, the bit is cleared and the storage key is stored
back with sske. That means that clearing the dirty bit is not an atomic
operation. rrbe is used to test and clear the referenced bit (young/old
infomation) and is atomic in regard to other storage key operations. If
you think about it, the storage keys are quite nice for the operating
system, page_referenced() can be implemented with a single test
"page_test_and_clear_young()". No need to read all the ptes pointing to
the page. The downside is that the storage keys have a cost on the
hardware side.

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20 14:15                         ` Andrei Popa
@ 2006-12-20 14:23                           ` Peter Zijlstra
  2006-12-20 16:30                             ` Andrei Popa
  0 siblings, 1 reply; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-20 14:23 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
> On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > 
> > > OR:
> > > 
> > >  - page_mkclean_one() is simply buggy.
> > 
> > GOLD!
> > 
> > it seems to work with all this (full diff against current git).
> > 
> > /me rebuilds full kernel to make sure...
> > reboot...
> > test...      pff the tension...
> > yay, still good!
> > 
> > Andrei; would you please verify.
> 
> I have corrupted files.

drad; and with this patch:
  http://lkml.org/lkml/2006/12/20/112

/me goes rebuild his kernel and try more than 3 times


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 23:42                       ` Peter Zijlstra
  2006-12-20  0:23                         ` Linus Torvalds
@ 2006-12-20 14:15                         ` Andrei Popa
  2006-12-20 14:23                           ` Peter Zijlstra
  1 sibling, 1 reply; 154+ messages in thread
From: Andrei Popa @ 2006-12-20 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> 
> > OR:
> > 
> >  - page_mkclean_one() is simply buggy.
> 
> GOLD!
> 
> it seems to work with all this (full diff against current git).
> 
> /me rebuilds full kernel to make sure...
> reboot...
> test...      pff the tension...
> yay, still good!
> 
> Andrei; would you please verify.

I have corrupted files.

> The magic seems to be in the extra tlb flush after clearing the dirty
> bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry.
> 
> diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
> index 5e7cd45..2b8893b 100644
> --- a/drivers/connector/connector.c
> +++ b/drivers/connector/connector.c
> @@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), v
>  	spin_lock_bh(&dev->cbdev->queue_lock);
>  	list_for_each_entry(__cbq, &dev->cbdev->queue_list, callback_entry) {
>  		if (cn_cb_equal(&__cbq->id.id, &msg->id)) {
> -			if (likely(!test_bit(WORK_STRUCT_PENDING,
> -					     &__cbq->work.work.management) &&
> +			if (likely(!delayed_work_pending(&__cbq->work) &&
>  					__cbq->data.ddata == NULL)) {
>  				__cbq->data.callback_priv = msg;
>  
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d1f1b54..263f88e 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
>  	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
> -	if (PageWriteback(page))
> +	if (PageDirty(page) || PageWriteback(page))
>  		return 0;
>  
>  	if (mapping == NULL) {		/* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
>  	spin_lock(&mapping->private_lock);
>  	ret = drop_buffers(page, &buffers_to_free);
>  	spin_unlock(&mapping->private_lock);
> -	if (ret) {
> -		/*
> -		 * If the filesystem writes its buffers by hand (eg ext3)
> -		 * then we can have clean buffers against a dirty page.  We
> -		 * clean the page here; otherwise later reattachment of buffers
> -		 * could encounter a non-uptodate page, which is unresolvable.
> -		 * This only applies in the rare case where try_to_free_buffers
> -		 * succeeds but the page is not freed.
> -		 *
> -		 * Also, during truncate, discard_buffer will have marked all
> -		 * the page's buffers clean.  We discover that here and clean
> -		 * the page also.
> -		 */
> -		if (test_clear_page_dirty(page))
> -			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> -	}
>  out:
>  	if (buffers_to_free) {
>  		struct buffer_head *bh = buffers_to_free;
> diff --git a/mm/memory.c b/mm/memory.c
> index c00bac6..60e0945 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
>  }
>  EXPORT_SYMBOL(unmap_mapping_range);
>  
> +static void check_last_page(struct address_space *mapping, loff_t size)
> +{
> +	pgoff_t index;
> +	unsigned int offset;
> +	struct page *page;
> +
> +	if (!mapping)
> +		return;
> +	offset = size & ~PAGE_MASK;
> +	if (!offset)
> +		return;
> +	index = size >> PAGE_SHIFT;
> +	page = find_lock_page(mapping, index);
> +	if (page) {
> +		unsigned int check = 0;
> +		unsigned char *kaddr = kmap_atomic(page, KM_USER0);
> +		do {
> +			check += kaddr[offset++];
> +		} while (offset < PAGE_SIZE);
> +		kunmap_atomic(kaddr, KM_USER0);
> +		unlock_page(page);
> +		page_cache_release(page);
> +		if (check)
> +			printk(KERN_ERR "%s: BADNESS: truncate check %u\n", current->comm, check);
> +	}
> +}
> +
>  /**
>   * vmtruncate - unmap mappings "freed" by truncate() syscall
>   * @inode: inode of the file used
> @@ -1875,6 +1902,7 @@ do_expand:
>  		goto out_sig;
>  	if (offset > inode->i_sb->s_maxbytes)
>  		goto out_big;
> +	check_last_page(mapping, inode->i_size);
>  	i_size_write(inode, offset);
>  
>  out_truncate:
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 237107c..f561e72 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page)
>  EXPORT_SYMBOL(test_set_page_writeback);
>  
>  /*
> - * Return true if any of the pages in the mapping are marged with the
> + * Return true if any of the pages in the mapping are marked with the
>   * passed tag.
>   */
>  int mapping_tagged(struct address_space *mapping, int tag)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..900229a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long address;
> -	pte_t *pte, entry;
> +	pte_t *ptep, entry;
>  	spinlock_t *ptl;
>  	int ret = 0;
>  
> @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
>  	if (address == -EFAULT)
>  		goto out;
>  
> -	pte = page_check_address(page, mm, address, &ptl);
> -	if (!pte)
> +	ptep = page_check_address(page, mm, address, &ptl);
> +	if (!ptep)
>  		goto out;
>  
> -	if (!pte_dirty(*pte) && !pte_write(*pte))
> +	if (!pte_dirty(*ptep) && !pte_write(*ptep))
>  		goto unlock;
>  
> -	entry = ptep_get_and_clear(mm, address, pte);
> -	entry = pte_mkclean(entry);
> +	entry = ptep_get_and_clear(mm, address, ptep);
>  	entry = pte_wrprotect(entry);
> -	ptep_establish(vma, address, pte, entry);
> +	ptep_establish(vma, address, ptep, entry);
> +	ret = ptep_clear_flush_dirty(vma, address, ptep) ||
> +		page_test_and_clear_dirty(page);
>  	lazy_mmu_prot_update(entry);
>  	ret = 1;
>  
>  unlock:
> -	pte_unmap_unlock(pte, ptl);
> +	pte_unmap_unlock(ptep, ptl);
>  out:
>  	return ret;
>  }
> 
> 


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20  9:01                           ` Peter Zijlstra
  2006-12-20  9:12                             ` Peter Zijlstra
@ 2006-12-20  9:39                             ` Arjan van de Ven
  2006-12-20 14:27                             ` Martin Schwidefsky
  2 siblings, 0 replies; 154+ messages in thread
From: Arjan van de Ven @ 2006-12-20  9:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens


> Hmm, should we not flush after clearing the dirty bit? That is, why does
> ptep_clear_flush_dirty() need a flush after clearing that bit? does it
> leak through in the tlb copy?

afaics you need to 
1) clear
2) flush 
3) check and go to 1) if needed

to be race free. 




^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20  0:23                         ` Linus Torvalds
  2006-12-20  9:01                           ` Peter Zijlstra
@ 2006-12-20  9:32                           ` Peter Zijlstra
  1 sibling, 0 replies; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-20  9:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote:

> Pls test.

Is good. Only s390 remains a question.

Another point, change_protection() also does a cache flush, should we
too?

> ----
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..eec8706 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
>  		goto unlock;
>  
>  	entry = ptep_get_and_clear(mm, address, pte);
          flush_cache_page(vma, address, pte_pfn(entry));
> +	flush_tlb_page(vma, address);
>  	entry = pte_mkclean(entry);
>  	entry = pte_wrprotect(entry);
> -	ptep_establish(vma, address, pte, entry);
> +	set_pte_at(mm, address, pte, entry);
>  	lazy_mmu_prot_update(entry);
>  	ret = 1;
>  
> 


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20  9:01                           ` Peter Zijlstra
@ 2006-12-20  9:12                             ` Peter Zijlstra
  2006-12-20  9:39                             ` Arjan van de Ven
  2006-12-20 14:27                             ` Martin Schwidefsky
  2 siblings, 0 replies; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-20  9:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr,
	Martin Schwidefsky, Heiko Carstens

On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote:

> I will try, but I had a look around the different architectures
> implementation of ptep_clear_flush_dirty() and saw that not all do the
> actual flush. So if we go down this road perhaps we should introduce
> another per arch function that does the potential flush. like
> flush_tlb_on_clear_dirty() or something like that.

never mind, we do need an unconditional flush for changing the
protection too.



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20  0:23                         ` Linus Torvalds
@ 2006-12-20  9:01                           ` Peter Zijlstra
  2006-12-20  9:12                             ` Peter Zijlstra
                                               ` (2 more replies)
  2006-12-20  9:32                           ` Peter Zijlstra
  1 sibling, 3 replies; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-20  9:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr,
	Martin Schwidefsky, Heiko Carstens

On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote:
> 
> On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > > OR:
> > > 
> > >  - page_mkclean_one() is simply buggy.
> > 
> > GOLD!
> 
> Ok. I was looking at that, and I wondered..
> 
> However, if that works, then I _think_ the correct sequence is the 
> following..
> 
> The rule should be:
>  - we flush the tlb _after_ we have cleared it, but _before_ we insert the 
>    new entry.
> 
> But I dunno. These things are damn subtle. Does this patch fix it for you?

I will try, but I had a look around the different architectures
implementation of ptep_clear_flush_dirty() and saw that not all do the
actual flush. So if we go down this road perhaps we should introduce
another per arch function that does the potential flush. like
flush_tlb_on_clear_dirty() or something like that.

Then we could write:

  entry = ptep_get_and_clear(mm, address, ptep)
  flush_tlb_on_clear_dirty(vma, address);
  entry = pte_mkclean(entry);
  entry = pte_wrprotect(entry);
  set_pte_at(mm, address, ptep, entry);

> I actually suspect we should do this as an arch-specific macro, and 
> totally replace the current "ptep_clear_flush_dirty()" with one that does 
> "ptep_clear_flush_dirty_and_set_wp()".
> 
> Because what I'd _really_ prefer to do on x86 (and probably on most other 
> sane architectures) is to do
> 
>  - atomically replace the pte with the EXACT SAME ONE, but one that 
>    has the writable bit clear.
> 
> 	bit_clear(_PAGE_BIT_RW, &(ptep)->pte_low);
> 
>  - flush the TLB, making sure that all CPU's will no longer write to it:
> 
> 	flush_tlb_page(vma, address);
> 
>  - finally, just fetch-and-clear the dirty bit (and since it's no longer 
>    writable, nobody should be settign it any more)
> 
> 	ret = bit_clear(__PAGE_BIT_DIRTY, &(ptep)->pte_low);
> 
> and now we should be all done.

Hmm, should we not flush after clearing the dirty bit? That is, why does
ptep_clear_flush_dirty() need a flush after clearing that bit? does it
leak through in the tlb copy?

Also, what is this page_test_and_clear_dirty() business, that seems to
be exclusively s390 btw. However they do seem to need this.

> But the "ptep_get_and_clear() + flush_tlb_page()" sequence should 
> hopefully also work.

Yeah, probably, not optimally so on some archs that don't actually need
the flush though. And as above, I wonder about s390.

(added our s390 friends to the CC list)


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 18:59                           ` Linus Torvalds
  2006-12-19 21:30                             ` Peter Zijlstra
@ 2006-12-20  5:56                             ` Jari Sundell
  1 sibling, 0 replies; 154+ messages in thread
From: Jari Sundell @ 2006-12-20  5:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On 12/20/06, Linus Torvalds <torvalds@osdl.org> wrote:
> On Tue, 19 Dec 2006, Linus Torvalds wrote:
> >
> >  here's a totally new tangent on this: it's possible that user code is
> > simply BUGGY.
>
> Btw, here's a simpler test-program that actually shows the difference
> between 2.6.18 and 2.6.19 in action, and why it could explain why a
> program like rtorrent might show corruption behavious that it didn't show
> before.

Kinda late to the discussion, but I guess I could summarize what
rtorrent actually does, or should be doing.

When downloading a new torrent, it will create the files and truncate
them to the final size. It will never call truncate after this and the
files will remain sparse until data is downloaded. A 'piece' is mapped
to memory using MAP_SHARED, which will be page aligned on single file
torrents but unlikely to be so on multi-file torrents.

So on multi-file torrents it'll often end up with two mappings
overlapping with one page, each of which only write to their own part
the page. These will then be sync'ed with MS_ASYNC, or MS_SYNC if low
on disk space. After that it might be unmapped, then mapped as
read-only.

I haven't thought of asking if single file torrents are ok.

Rakshasa

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 23:42                       ` Peter Zijlstra
@ 2006-12-20  0:23                         ` Linus Torvalds
  2006-12-20  9:01                           ` Peter Zijlstra
  2006-12-20  9:32                           ` Peter Zijlstra
  2006-12-20 14:15                         ` Andrei Popa
  1 sibling, 2 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-20  0:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > OR:
> > 
> >  - page_mkclean_one() is simply buggy.
> 
> GOLD!

Ok. I was looking at that, and I wondered..

However, if that works, then I _think_ the correct sequence is the 
following..

The rule should be:
 - we flush the tlb _after_ we have cleared it, but _before_ we insert the 
   new entry.

But I dunno. These things are damn subtle. Does this patch fix it for you?

I actually suspect we should do this as an arch-specific macro, and 
totally replace the current "ptep_clear_flush_dirty()" with one that does 
"ptep_clear_flush_dirty_and_set_wp()".

Because what I'd _really_ prefer to do on x86 (and probably on most other 
sane architectures) is to do

 - atomically replace the pte with the EXACT SAME ONE, but one that 
   has the writable bit clear.

	bit_clear(_PAGE_BIT_RW, &(ptep)->pte_low);

 - flush the TLB, making sure that all CPU's will no longer write to it:

	flush_tlb_page(vma, address);

 - finally, just fetch-and-clear the dirty bit (and since it's no longer 
   writable, nobody should be settign it any more)

	ret = bit_clear(__PAGE_BIT_DIRTY, &(ptep)->pte_low);

and now we should be all done.

But the "ptep_get_and_clear() + flush_tlb_page()" sequence should 
hopefully also work.

Pls test.

		Linus

----
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..eec8706 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 		goto unlock;
 
 	entry = ptep_get_and_clear(mm, address, pte);
+	flush_tlb_page(vma, address);
 	entry = pte_mkclean(entry);
 	entry = pte_wrprotect(entry);
-	ptep_establish(vma, address, pte, entry);
+	set_pte_at(mm, address, pte, entry);
 	lazy_mmu_prot_update(entry);
 	ret = 1;
 


^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-20  0:03                                     ` Linus Torvalds
@ 2006-12-20  0:18                                       ` Andrew Morton
  0 siblings, 0 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-20  0:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 19 Dec 2006 16:03:49 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> 
> > On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:
> > 
> > > Well... we'd need to see (corruption && this-not-triggering) to be sure.
> > > 
> > > Peter, have you been able to trigger the corruption?
> > 
> > Yes; however the mail I send describing that seems to be lost in space.
> 
> Btw, can somebody actually explain the mess that is ext3 "dirtying".
> 
> Ext3 does NOT use __set_page_dirty_buffers. It does
> 
> 	static int ext3_journalled_set_page_dirty(struct page *page)
> 	{
> 	        SetPageChecked(page);
> 	        return __set_page_dirty_nobuffers(page);
> 	}
> 
> and uses that "Checked" bit as a "whole page is dirty" bit (which it tests 
> in "writepage()".

This is purely for data=journal, which is rarely used.

In journalled-data mode, write(), write-fault, etc are not allowed to dirty
the pages and buffers, because the data has to be written to the journal
first.  After the data has been written to the journal we only then mark
buffers (and hence pages) dirty as far as the VFS is concerned.  For
checkpointing the data back to its real place on the disk.


For MAP_SHARED pages ext3 cheats madly and doesn't journal the data at all.
In all journalling modes, MAP_SHARED data follows the regular ext2-style
handling.  Which is a bit of a wart.


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 23:06                                   ` Peter Zijlstra
  2006-12-19 23:07                                     ` Peter Zijlstra
@ 2006-12-20  0:03                                     ` Linus Torvalds
  2006-12-20  0:18                                       ` Andrew Morton
  1 sibling, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-20  0:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Wed, 20 Dec 2006, Peter Zijlstra wrote:

> On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:
> 
> > Well... we'd need to see (corruption && this-not-triggering) to be sure.
> > 
> > Peter, have you been able to trigger the corruption?
> 
> Yes; however the mail I send describing that seems to be lost in space.

Btw, can somebody actually explain the mess that is ext3 "dirtying".

Ext3 does NOT use __set_page_dirty_buffers. It does

	static int ext3_journalled_set_page_dirty(struct page *page)
	{
	        SetPageChecked(page);
	        return __set_page_dirty_nobuffers(page);
	}

and uses that "Checked" bit as a "whole page is dirty" bit (which it tests 
in "writepage()".

You realize what this all means? It means that ANYTHING that actually 
clears the _real_ dirty bit won't actually be doing anything at all for 
ext3, because the Checked bit will still stay set, and any IO down the 
line on that page would totally ignore the dirty bits on the buffer heads 
and just write out everything.

That is "The Mess(tm)".

It also basically means that anything that clears the dirty bit without 
just calling "writepage()" had _better_ call "invalidatepage()" for the 
whole page, because otherwise the PageChecked bit will never be cleared as 
far as I can see. Happily, at least ext3 seems to _test_ for that case in 
the release_page() function, so it appears that we do do this.

But this seems to just strengthen my argument: you can NEVER clean a page, 
unless you (a) do IO on it immediately afterwards (writeback) or (b) 
invalidate it entirely (truncate).

I'd really like to see just those two functions exist. Preferably in a 
form where you can see easily that we actually follow those rules. Rather 
than having a confusing set of "clear_page_dirty()" and
"test_and_clear_page_dirty()" functions that are called from random 
places.

IOW, I think the "clear_page_dirty_for_io()" is fine (it's case (a)) 
above, and then we should probably have a "cancel_dirty_page()" function 
that does all the current clear_page_dirty() but also makes sure that we 
actually call the invalidate_page() function itself. 

Hmm?

			Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 20:14                     ` Linus Torvalds
                                         ` (2 preceding siblings ...)
  2006-12-18 21:49                       ` Peter Zijlstra
@ 2006-12-19 23:42                       ` Peter Zijlstra
  2006-12-20  0:23                         ` Linus Torvalds
  2006-12-20 14:15                         ` Andrei Popa
  3 siblings, 2 replies; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-19 23:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:

> OR:
> 
>  - page_mkclean_one() is simply buggy.

GOLD!

it seems to work with all this (full diff against current git).

/me rebuilds full kernel to make sure...
reboot...
test...      pff the tension...
yay, still good!

Andrei; would you please verify.

The magic seems to be in the extra tlb flush after clearing the dirty
bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry.

diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
index 5e7cd45..2b8893b 100644
--- a/drivers/connector/connector.c
+++ b/drivers/connector/connector.c
@@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), v
 	spin_lock_bh(&dev->cbdev->queue_lock);
 	list_for_each_entry(__cbq, &dev->cbdev->queue_list, callback_entry) {
 		if (cn_cb_equal(&__cbq->id.id, &msg->id)) {
-			if (likely(!test_bit(WORK_STRUCT_PENDING,
-					     &__cbq->work.work.management) &&
+			if (likely(!delayed_work_pending(&__cbq->work) &&
 					__cbq->data.ddata == NULL)) {
 				__cbq->data.callback_priv = msg;
 
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..60e0945 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+	pgoff_t index;
+	unsigned int offset;
+	struct page *page;
+
+	if (!mapping)
+		return;
+	offset = size & ~PAGE_MASK;
+	if (!offset)
+		return;
+	index = size >> PAGE_SHIFT;
+	page = find_lock_page(mapping, index);
+	if (page) {
+		unsigned int check = 0;
+		unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+		do {
+			check += kaddr[offset++];
+		} while (offset < PAGE_SIZE);
+		kunmap_atomic(kaddr, KM_USER0);
+		unlock_page(page);
+		page_cache_release(page);
+		if (check)
+			printk(KERN_ERR "%s: BADNESS: truncate check %u\n", current->comm, check);
+	}
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
 		goto out_sig;
 	if (offset > inode->i_sb->s_maxbytes)
 		goto out_big;
+	check_last_page(mapping, inode->i_size);
 	i_size_write(inode, offset);
 
 out_truncate:
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..f561e72 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page)
 EXPORT_SYMBOL(test_set_page_writeback);
 
 /*
- * Return true if any of the pages in the mapping are marged with the
+ * Return true if any of the pages in the mapping are marked with the
  * passed tag.
  */
 int mapping_tagged(struct address_space *mapping, int tag)
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..900229a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *pte, entry;
+	pte_t *ptep, entry;
 	spinlock_t *ptl;
 	int ret = 0;
 
@@ -440,22 +440,23 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
-	if (!pte)
+	ptep = page_check_address(page, mm, address, &ptl);
+	if (!ptep)
 		goto out;
 
-	if (!pte_dirty(*pte) && !pte_write(*pte))
+	if (!pte_dirty(*ptep) && !pte_write(*ptep))
 		goto unlock;
 
-	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
+	entry = ptep_get_and_clear(mm, address, ptep);
 	entry = pte_wrprotect(entry);
-	ptep_establish(vma, address, pte, entry);
+	ptep_establish(vma, address, ptep, entry);
+	ret = ptep_clear_flush_dirty(vma, address, ptep) ||
+		page_test_and_clear_dirty(page);
 	lazy_mmu_prot_update(entry);
 	ret = 1;
 
 unlock:
-	pte_unmap_unlock(pte, ptl);
+	pte_unmap_unlock(ptep, ptl);
 out:
 	return ret;
 }



^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 23:06                                   ` Peter Zijlstra
@ 2006-12-19 23:07                                     ` Peter Zijlstra
  2006-12-20  0:03                                     ` Linus Torvalds
  1 sibling, 0 replies; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-19 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Wed, 2006-12-20 at 00:06 +0100, Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:
> 
> > Well... we'd need to see (corruption && this-not-triggering) to be sure.
> > 
> > Peter, have you been able to trigger the corruption?
> 
> Yes; however the mail I send describing that seems to be lost in space.
> 
> /me quotes from the send folder:
> 
> > The bad new is, that doesn't help either. The good news is I can
> > reproduce it.
> > 
> > What I did to achieve that:
> >  
> >  - get a sizable torrent from legaltorrents.com / or create a torrent
> > yourself that is around ~600M and has multiple files.
> > 
> >  - start a tracker, and multiple seeds (I used three machines here)
> > 
> >  - pull the torrent on a fourth machine
> > 
> > the seeding machines don't much matter of course.
> > 
> > the fourth machine was a dual core x86-64 with an SMP kernel and
> > PREEMPT, mem=256M (so that the torrent is quite a bit larger and does
> > require writeout) and I used an ext3 partition with 1k blocks.

PS. this was a reply to:
 http://lkml.org/lkml/2006/12/19/121


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 22:58                                 ` Andrew Morton
@ 2006-12-19 23:06                                   ` Peter Zijlstra
  2006-12-19 23:07                                     ` Peter Zijlstra
  2006-12-20  0:03                                     ` Linus Torvalds
  0 siblings, 2 replies; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-19 23:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:

> Well... we'd need to see (corruption && this-not-triggering) to be sure.
> 
> Peter, have you been able to trigger the corruption?

Yes; however the mail I send describing that seems to be lost in space.

/me quotes from the send folder:

> The bad new is, that doesn't help either. The good news is I can
> reproduce it.
> 
> What I did to achieve that:
>  
>  - get a sizable torrent from legaltorrents.com / or create a torrent
> yourself that is around ~600M and has multiple files.
> 
>  - start a tracker, and multiple seeds (I used three machines here)
> 
>  - pull the torrent on a fourth machine
> 
> the seeding machines don't much matter of course.
> 
> the fourth machine was a dual core x86-64 with an SMP kernel and
> PREEMPT, mem=256M (so that the torrent is quite a bit larger and does
> require writeout) and I used an ext3 partition with 1k blocks.


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 22:51                               ` Linus Torvalds
@ 2006-12-19 22:58                                 ` Andrew Morton
  2006-12-19 23:06                                   ` Peter Zijlstra
  0 siblings, 1 reply; 154+ messages in thread
From: Andrew Morton @ 2006-12-19 22:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

	On Tue, 19 Dec 2006 14:51:55 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Tue, 19 Dec 2006, Peter Zijlstra wrote:
> 
> > On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
> > > 
> > > On Tue, 19 Dec 2006, Linus Torvalds wrote:
> > > >
> > > >  here's a totally new tangent on this: it's possible that user code is 
> > > > simply BUGGY. 
> > 
> > I'm sad to say this doesn't trigger :-(
> 
> Oh, well. It was a theory. 
> 

Well... we'd need to see (corruption && this-not-triggering) to be sure.

Peter, have you been able to trigger the corruption?

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 21:30                             ` Peter Zijlstra
@ 2006-12-19 22:51                               ` Linus Torvalds
  2006-12-19 22:58                                 ` Andrew Morton
  2006-12-20 18:02                               ` Stephen Clark
  1 sibling, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-19 22:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Peter Zijlstra wrote:

> On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
> > 
> > On Tue, 19 Dec 2006, Linus Torvalds wrote:
> > >
> > >  here's a totally new tangent on this: it's possible that user code is 
> > > simply BUGGY. 
> 
> I'm sad to say this doesn't trigger :-(

Oh, well. It was a theory. 

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 17:43                         ` Linus Torvalds
  2006-12-19 18:59                           ` Linus Torvalds
@ 2006-12-19 21:56                           ` Florian Weimer
  2006-12-21 13:03                           ` Peter Zijlstra
  2 siblings, 0 replies; 154+ messages in thread
From: Florian Weimer @ 2006-12-19 21:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Marc Haber,
	Martin Michlmayr

* Linus Torvalds:

> Now, this should _matter_ only for user processes that are buggy,
> and that have written to the page _before_ extending it with
> ftruncate().

APT seems to properly extend the file before mapping it, by writing a
zero byte at the desired position (creating a hole).

24986 open("/var/cache/apt/pkgcache.bin", O_RDWR|O_CREAT|O_TRUNC, 0666) = 6

24986 lseek(6, 12582911, SEEK_SET)      = 12582911
24986 write(6, "\0", 1)                 = 1

24986 mmap(NULL, 12582912, PROT_READ|PROT_WRITE, MAP_SHARED, 6, 0) = 0x2b6578636000

24986 msync(0x2b6578636000, 7464112, MS_SYNC) = 0
24986 msync(0x2b6578636000, 8656, MS_SYNC) = 0
24986 munmap(0x2b6578636000, 12582912)  = 0
24986 ftruncate(6, 7464112)             = 0
24986 fstat(6, {st_mode=S_IFREG|0644, st_size=7464112, ...}) = 0
24986 mmap(NULL, 7464112, PROT_READ, MAP_SHARED, 6, 0) = 0x2b6578636000

APT's code is pretty convoluted, though, and there might be some code
path in it that gets it wrong. 8-P

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 18:59                           ` Linus Torvalds
@ 2006-12-19 21:30                             ` Peter Zijlstra
  2006-12-19 22:51                               ` Linus Torvalds
  2006-12-20 18:02                               ` Stephen Clark
  2006-12-20  5:56                             ` Jari Sundell
  1 sibling, 2 replies; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-19 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
> 
> On Tue, 19 Dec 2006, Linus Torvalds wrote:
> >
> >  here's a totally new tangent on this: it's possible that user code is 
> > simply BUGGY. 

I'm sad to say this doesn't trigger :-(



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  6:34             ` Linus Torvalds
  2006-12-19  6:51               ` Nick Piggin
@ 2006-12-19 20:03               ` dean gaudet
  1 sibling, 0 replies; 154+ messages in thread
From: dean gaudet @ 2006-12-19 20:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Mon, 18 Dec 2006, Linus Torvalds wrote:

> On Tue, 19 Dec 2006, Nick Piggin wrote:
> > 
> > We never want to drop dirty data! (ignoring the truncate case, which is
> > handled privately by truncate anyway)
> 
> Bzzt.
> 
> SURE we do.
> 
> We absolutely do want to drop dirty data in the writeout path.
> 
> How do you think dirty data ever _becomes_ clean data?
> 
> In other words, yes, we _do_ want to test-and-clear all the pgtable bits 
> _and_ the PG_dirty bit. We want to do it for:
>  - writeout
>  - truncate
>  - possibly a "drop" event (which could be a case for a journal entry that 
>    becomes stale due to being replaced or something - kind of "truncate" 
>    on metadata)
> 
> because both of those events _literally_ turn dirty state into clean 
> state.
> 
> In no other circumstance do we ever want to clear a dirty bit, as far as I 
> can tell. 

i admit this may not be entirely relevant, but it seems like a good place 
to bring up an old problem:  when a disk dies with lots of queued writes 
it can totally bring a system to its knees... even after the disk is 
removed.  i wrote up something about this a while ago:

http://lkml.org/lkml/2005/8/18/243

so there's another reason to "clear a dirty bit"... well, in fact -- drop 
the pages entirely.

-dean

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 17:43                         ` Linus Torvalds
@ 2006-12-19 18:59                           ` Linus Torvalds
  2006-12-19 21:30                             ` Peter Zijlstra
  2006-12-20  5:56                             ` Jari Sundell
  2006-12-19 21:56                           ` Florian Weimer
  2006-12-21 13:03                           ` Peter Zijlstra
  2 siblings, 2 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-19 18:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Linus Torvalds wrote:
>
>  here's a totally new tangent on this: it's possible that user code is 
> simply BUGGY. 

Btw, here's a simpler test-program that actually shows the difference 
between 2.6.18 and 2.6.19 in action, and why it could explain why a 
program like rtorrent might show corruption behavious that it didn't show 
before.

	#include <sys/mman.h>
	#include <sys/fcntl.h>
	#include <unistd.h>
	#include <string.h>
	
	int main(int argc, char **argv)
	{
		char *mapping;
		int fd;
	
		fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
		if (fd < 0)
			return -1;
		if (ftruncate(fd, 10) < 0)
			return -1;
		mapping = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
		if (-1 == (int)(long)mapping)
			return -1;
		memset(mapping, 0xaa, 20);
		sync();
		if (ftruncate(fd, 40) < 0)
			return -1;
		memset(mapping + 20, 0x55, 20);
		write(1, mapping, 40);
		return 0;
	}

Notice the "sync()" in between the "memset()" and the "ftruncate()". In 
2.6.18, that would normally do absolutely _nothing_ to the shared memory 
mapping, becuase we simply couldn't track pages that were dirty in the 
page tables. 

So in 2.6.18, if you try this, with

	./a.out | od -x

you should see something like

	0000000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa
	0000020 aaaa aaaa 5555 5555 5555 5555 5555 5555
	0000040 5555 5555 5555 5555
	0000050

which matches your memset() patterns: 20 bytes of 0xaa, and 20 bytes of 
0x55.

HOWEVER. 

In 2.6.19, because we actually track dirty data so much better, "sync()" 
will actually be smart enough to write out the dirty mmap'ed data too. But 
since the user program has only allocated ten bytes for it in the file, 
when it is written out, the rest of the page is cleared. When you then 
write the last 20 bytes (after _properly_ allocating memory for them), you 
should now see a pattern like

	0000000 aaaa aaaa aaaa aaaa aaaa 0000 0000 0000
	0000020 0000 0000 5555 5555 5555 5555 5555 5555
	0000040 5555 5555 5555 5555
	0000050

instead: with ten bytes of zero in between, because the data that couldn't 
be written out was cleared.

So 2.6.19 is strictly _better_, but exactly because it's tracking dirty 
status much more precisely, you'll see certain user-level bugs much more 
easily.

NOTE NOTE NOTE! The code really _was_ buggy in 2.6.18 too, and you _can_ 
get the zeroes in the middle of the file with an older kernel. But in 
older kernels, you need to be really really unlucky, and have the page 
cleaned by strong memory pressure. In 2.6.19, any "sync()" activity 
(includign from the outside) will clean the page, so a user program with 
this bug can just be made to trigger the bug much more easily.

			Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 16:51                       ` Linus Torvalds
@ 2006-12-19 17:43                         ` Linus Torvalds
  2006-12-19 18:59                           ` Linus Torvalds
                                             ` (2 more replies)
  0 siblings, 3 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-19 17:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4156 bytes --]



Btw,
 here's a totally new tangent on this: it's possible that user code is 
simply BUGGY. 

There is one case where the kernel actually forcibly writes zeroes into a 
file: when we're writing a page that straddles the "inode->i_size" 
boundary. See the various writepages in fs/buffer.c, they all contain 
variations on that theme (although most of them aren't as well commented 
as this snippet):

        /*
         * The page straddles i_size.  It must be zeroed out on each and every
         * writepage invocation because it may be mmapped.  "A file is mapped
         * in multiples of the page size.  For a file that is not a multiple of
         * the  page size, the remaining memory is zeroed when mapped, and
         * writes to that region are not written out to the file."
         */
        kaddr = kmap_atomic(page, KM_USER0);
        memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
        flush_dcache_page(page);
        kunmap_atomic(kaddr, KM_USER0);

Now, this should _matter_ only for user processes that are buggy, and that 
have written to the page _before_ extending it with ftruncate(). That's 
definitely a serious bug, but it's one that can do totally undetected 
depending on when the actual write-out happens.

So what I'm saying is that if we end up writing things earlier thanks to 
the more aggressive dirty-page-management thing in 2.6.19, we might 
actually just expose a long-time userspace bug that was just a LOT harder 
to trigger before..

I'm not saying this is the cause of all this, but we've been tearing our 
hair out, and it migth be worthwhile trying this really really really 
stupid patch that will notice when that happens at truncate() time, and 
tell the user that he's a total idiot. Or something to that effect.

Maybe the reason this is so easy to trigger with rtorrent is not because 
rtorrent does some magic pattern that triggers a kernel bug, but simply 
because rtorrent itself might have a bug.

Ok, so it's a long shot, but it's still worth testing, I suspect. The 
patch is very simple: whenever we do an _expanding_ truncate, we check the 
last page of the _old_ size, and if there were non-zero contents past the 
old size, we complain.

As an attachement is a test-program that _should_ trigger a 
kernel message like

	a.out: BADNESS: truncate check 17000

for good measure, just so that you can verify that the patch works and 
actually catches this case.

(The 17000 number is just the one-hundred _invalid_ 0xaa bytes - out of 
the 200 we wrote - that were summed up: 100*0xaa == 17000. Anything 
non-zero is always a bug).

I doubt this is really it, but it's worth trying. If you fill out a page, 
and only do "ftruncate()" in response to SIGBUS messages (and don't 
truncate to whole pages), you could potentially see zeroes at the end of 
the page exactly because _writeout_ cleared the page for you! So it 
_could_ explain the symptoms, but only if user-space was horribly horribly 
broken.

		Linus

----
diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..79cecab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+	pgoff_t index;
+	unsigned int offset;
+	struct page *page;
+
+	if (!mapping)
+		return;
+	offset = size & ~PAGE_MASK;
+	if (!offset)
+		return;
+	index = size >> PAGE_SHIFT;
+	page = find_lock_page(mapping, index);
+	if (page) {
+		unsigned int check = 0;
+		unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+		do {
+			check += kaddr[offset++];
+		} while (offset < PAGE_SIZE);
+		kunmap_atomic(kaddr,KM_USER0);
+		unlock_page(page);
+		page_cache_release(page);
+		if (check)
+			printk("%s: BADNESS: truncate check %u\n", current->comm, check);
+	}
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
 		goto out_sig;
 	if (offset > inode->i_sb->s_maxbytes)
 		goto out_big;
+	check_last_page(mapping, inode->i_size);
 	i_size_write(inode, offset);
 
 out_truncate:

[-- Attachment #2: Type: TEXT/PLAIN, Size: 566 bytes --]

#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <string.h>

int main(int argc, char **argv)
{
	char *mapping;
	int fd;

	fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
	if (fd < 0)
		return -1;
	if (ftruncate(fd, 10) < 0)
		return -1;
	mapping = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (-1 == (int)(long)mapping)
		return -1;
	memset(mapping, 0x55, 10);
	if (ftruncate(fd, 100) < 0)
		return -1;
	memset(mapping, 0xaa, 200);
	if (ftruncate(fd, 200) < 0)
		return -1;
	return 0;
}

^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
       [not found]                     ` <4587B762.2030603@yahoo.com.au>
  2006-12-19 10:32                       ` Andrew Morton
@ 2006-12-19 16:51                       ` Linus Torvalds
  2006-12-19 17:43                         ` Linus Torvalds
  1 sibling, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-19 16:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
> Counterexample? Well AFAIKS, the clearing of PG_dirty in ttfb() in
> response to finding all buffers clean is perfectly valid. What makes
> you think otherwise?

If the page really is clean, then why the heck cant' we just clean the 
page table bits too?

Either it's clean or it isn't. If all the buffers being clean means that 
the page is clean, then it's clean. WE SHOULD NOT THINK THAT PTE'S ARE ANY 
DIFFERENT.

I really don't see your point. Is it clean? If it is, then clear the damn 
dirty bits from the page tables too. Don't go pussyfooting around the 
issue and confuse yourself and everybody but me by saying "but if it's 
dirty in the page tables, it's magically dirty". NO.

It really is that simple. Is it clean or not?

If it's clean, you can remove ALL the dirty bits. Not just some.

			Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  9:40                   ` Nick Piggin
@ 2006-12-19 16:46                     ` Linus Torvalds
  0 siblings, 0 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-19 16:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
> Now I'm not exactly sure how ext3 (or any other) filesystems make use
> of this particular feature of try_to_free_buffers(), but it is clear
> from the comments what it is for. So your patch isn't really a minimal
> fix (ie. it would require an OK from all filesystems, wouldn't it?)
> 
> Or did I miss a mail where you reasoned that it is safe to make this
> change (/me goes to reread the thread)...

I'm saying it had _better_ be safe, and no, low-level filesystems don't 
actually matter.

The page has to be cleanable _some_ way. So if we test for "page_dirty()" 
at the top, and just refuse to do it in try_to_free_pages(), we still know 
that the _proper_ page cleaning had better clean it. Because ttfp() is 
never going to clean the page in the general case _anyway_.

So I'm really saying:

 - the page WILL be cleaned by the real page cleaning action (ie memory 
   pressure or sync or something else causing us to go through the 
   bog-standard page-based writeout.

   Does anybody dispute this?

 - the "ttfp()" hack was a HACK. It was an ugly and nasty hack even when 
   it was first introduced. It gets doubly worse now that we know we have 
   something wrong with page cleaning, and it has distracted from the real 
   problem.

 - I removed tha ugly and disgusting hack entirely at first, but Andrew 
   points out that he really wants to keep the buffers there, because the 
   buffers being clean actually say something. That, together with the 
   fact that as long as the page is dirty, the buffers really do end up 
   have a job to do, made me add a much smaller hack to replace the big 
   ugly one ("don't even try, if the page is marked dirty").

 - so with that thing in place, there isn't even any change in behaviour 
   wrt the buffers and low-level filesystems. It's just that we make them 
   a bit harder to get rid of. But arguably that shouldn't actually ever 
   really _happen_ anyway (because I think it's a BUG if the page is 
   marked dirty but none of the buffers are), so I think that part is a 
   non-issue.

In other words, ttfp() _never_ had anything to do with "page cleaning". 
Not originally, not with the horrible hack, and not with my patch. 

Trying to mix it in just caused a bug that _everybody_ agrees is a bug. 
It's not the bug we're chasing, but we've got three different patches to 
fix it (Andrew's, mine and yours), and mine is the simplest one by far 
especially in the long run, because it just REMOVES the ugly dependency.

And yes, I probably care more about "in the long run" than most. To me, a 
bug is a bug even if it's _just_ a maintenance headache. Andrews patch 
made things _worse_ ("magic insane flag"), and while yours didn't make the 
code worse, it still introduced the notion of a totally insane "clean the 
page but if the PTE's are dirty, do something else" notion.

IF THE PAGE TRULY IS CLEAN (and both you and Andrew claim it is, if all 
buffers are clean - since you mark it clean in the non-mapped case) THEN 
YOU SHOULD BE ABLE TO CLEAN THE PAGE TABLE BITS TOO.

And by claiming that the page table bits are different from PG_dirty, 
you're just making the issues worse. They shouldn't be. That's what the 
whole point of Peter's patch was: PG_dirty fundmentally _means_ that the 
page tables might be dirty too. That was the whole _point_ of doing all 
this in 2.6.19 in the first place.

So if you cannot accept that page table bits should be on "equal footing" 
with PG_dirty, then you should just say "Let's remove Peter's patch 
entirely".

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 10:58                           ` Nick Piggin
@ 2006-12-19 11:51                             ` Peter Zijlstra
  0 siblings, 0 replies; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-19 11:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linus Torvalds, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 21:58 +1100, Nick Piggin wrote:
> Peter Zijlstra wrote:
> > On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote:
> 
> >>Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
> >>pages.  But it turns out that we don't feed it mapped pages, apart from
> >>pagevec_strip() and possibly races against pagefaults.
> > 
> > 
> > So how about this:
> 
> Well that's still racy. Anyway several earlier patches (including
> the one I posted) closed this race. Some were still reported to
> trigger corruption IIRC.

I can't remember a patch that removes mapped pages from this code path,
however I could have missed it. All out removing the mapping branch in
ttfb() did also fix the problem - which is a superset of page_mapped().

I'm now building a kernel with this patch, and will submit that to
rtorrent with mem=256M on a 1k ext3 filesystem on x86_64 smp preempt.

---
 fs/buffer.c |   32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2798,11 +2798,38 @@ static inline int buffer_busy(struct buf
 		(bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock)));
 }
 
+/*
+ * AKPM sayeth:
+ *
+ * - a process does a one-byte-write to a file on a 64k pagesize, 4k
+ *   blocksize ext3 filesystem.  The page is now PageDirty, !PgeUptodate and
+ *   has one dirty buffer and 15 not uptodate buffers.
+ *
+ * - kjournald writes the dirty buffer.  The page is now PageDirty,
+ *   !PageUptodate and has a mix of clean and not uptodate buffers.
+ *
+ * - try_to_free_buffers() removes the page's buffers.  It MUST now clear
+ *   PageDirty.  If we were to leave the page dirty then we'd have a dirty, not
+ *   uptodate page with no buffer_heads.
+ *
+ *   We're screwed: we cannot write the page because we don't know which
+ *   sections of it contain garbage.  We cannot read the page because we don't
+ *   know which sections of it contain modified data.  We cannot free the page
+ *   because it is dirty.
+ *
+ * However for mapped pages this is not true; mapped pages will be fully
+ * loaded and thus cannot have not uptodate buffers.
+ *
+ * Hence allow the PG_dirty bit to stay for pages that had no not uptodate
+ * buffers (and assert that mapped pages never have those).
+ */
+
 static int
 drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
 {
 	struct buffer_head *head = page_buffers(page);
 	struct buffer_head *bh;
+	int uptodate = 1;
 
 	bh = head;
 	do {
@@ -2818,11 +2845,14 @@ drop_buffers(struct page *page, struct b
 
 		if (!list_empty(&bh->b_assoc_buffers))
 			__remove_assoc_queue(bh);
+		if (!buffer_uptodate(bh))
+			uptodate = 0;
 		bh = next;
 	} while (bh != head);
 	*buffers_to_free = head;
 	__clear_page_buffers(page);
-	return 1;
+	VM_BUG_ON(page_mapped(page) && !uptodate);
+	return !uptodate;
 failed:
 	return 0;
 }



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 10:52                         ` Peter Zijlstra
@ 2006-12-19 10:58                           ` Nick Piggin
  2006-12-19 11:51                             ` Peter Zijlstra
  0 siblings, 1 reply; 154+ messages in thread
From: Nick Piggin @ 2006-12-19 10:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote:

>>Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
>>pages.  But it turns out that we don't feed it mapped pages, apart from
>>pagevec_strip() and possibly races against pagefaults.
> 
> 
> So how about this:

Well that's still racy. Anyway several earlier patches (including
the one I posted) closed this race. Some were still reported to
trigger corruption IIRC.

> Index: linux-2.6-git/mm/page-writeback.c
> ===================================================================
> --- linux-2.6-git.orig/mm/page-writeback.c	2006-12-19 08:24:48.000000000 +0100
> +++ linux-2.6-git/mm/page-writeback.c	2006-12-19 11:43:31.000000000 +0100
> @@ -859,6 +859,9 @@ int test_clear_page_dirty(struct page *p
>  	struct address_space *mapping = page_mapping(page);
>  	unsigned long flags;
>  
> +	if (page_mapped(page))
> +		return 0;
> +
>  	if (!mapping)
>  		return TestClearPageDirty(page);
>  
> 
> 
> -

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 10:32                       ` Andrew Morton
                                           ` (2 preceding siblings ...)
  2006-12-19 10:52                         ` Peter Zijlstra
@ 2006-12-19 10:55                         ` Nick Piggin
  3 siblings, 0 replies; 154+ messages in thread
From: Nick Piggin @ 2006-12-19 10:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Andrew Morton wrote:
> On Tue, 19 Dec 2006 20:56:50 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>>I think it could be very likely that indeed the bug is a latent one in
>>a clear_page_dirty caller, rather than dirty-tracking itself.
> 
> 
> The only callers are try_to_free_buffers(), truncate and a few scruffy
> possibly-wrong-for-fsync filesytems which aren't being used here.

Well truncate/invalidate will not operate on mapped pages (barring the
very-unlikely truncate/invalidate vs fault races). We can ignore those
filesystems as they don't include ext3. Which brings us back to
try_to_free_buffers().

Maybe it is something else entirely, but did try_to_free_buffers ever
get completely cleared? Or was some of Andrei's corruption possibly
leftover on-disk corruption from a previous kernel?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 10:32                       ` Andrew Morton
  2006-12-19 10:42                         ` Nick Piggin
  2006-12-19 10:47                         ` Andrew Morton
@ 2006-12-19 10:52                         ` Peter Zijlstra
  2006-12-19 10:58                           ` Nick Piggin
  2006-12-19 10:55                         ` Nick Piggin
  3 siblings, 1 reply; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-19 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linus Torvalds, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote:
> On Tue, 19 Dec 2006 20:56:50 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > Linus Torvalds wrote:
> > 
> > > NOTICE? First you make a BIG DEAL about how dirty bits should never get 
> > > lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop 
> > > the dirty bit for when it's not in the page tables.
> > 
> > try_to_free_buffers is quite a special case, where we're transferring
> > the page dirty metadata from the buffers to the page. I think Andrew
> > would have a better grasp of it so he could correct me, but what it
> > does is legitimate.
> 
> Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
> pages.  But it turns out that we don't feed it mapped pages, apart from
> pagevec_strip() and possibly races against pagefaults.

So how about this:

Index: linux-2.6-git/mm/page-writeback.c
===================================================================
--- linux-2.6-git.orig/mm/page-writeback.c	2006-12-19 08:24:48.000000000 +0100
+++ linux-2.6-git/mm/page-writeback.c	2006-12-19 11:43:31.000000000 +0100
@@ -859,6 +859,9 @@ int test_clear_page_dirty(struct page *p
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
 
+	if (page_mapped(page))
+		return 0;
+
 	if (!mapping)
 		return TestClearPageDirty(page);
 



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 10:32                       ` Andrew Morton
  2006-12-19 10:42                         ` Nick Piggin
@ 2006-12-19 10:47                         ` Andrew Morton
  2006-12-19 10:52                         ` Peter Zijlstra
  2006-12-19 10:55                         ` Nick Piggin
  3 siblings, 0 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-19 10:47 UTC (permalink / raw)
  To: Nick Piggin, Linus Torvalds, Peter Zijlstra, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 19 Dec 2006 02:32:55 -0800
Andrew Morton <akpm@osdl.org> wrote:

> <spots a race in do_no_page()>
> 
> If a write-fault races with a read-fault and the write-fault loses, we forget
> to mark the page dirty.

No that isn't right, is it.  The writer just retakes the fault and
all the right things happen.  Ho hum.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19 10:32                       ` Andrew Morton
@ 2006-12-19 10:42                         ` Nick Piggin
  2006-12-19 10:47                         ` Andrew Morton
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 154+ messages in thread
From: Nick Piggin @ 2006-12-19 10:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Andrew Morton wrote:
> On Tue, 19 Dec 2006 20:56:50 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>Linus Torvalds wrote:
>>
>>
>>>NOTICE? First you make a BIG DEAL about how dirty bits should never get 
>>>lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop 
>>>the dirty bit for when it's not in the page tables.
>>
>>try_to_free_buffers is quite a special case, where we're transferring
>>the page dirty metadata from the buffers to the page. I think Andrew
>>would have a better grasp of it so he could correct me, but what it
>>does is legitimate.
> 
> 
> Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
> pages.

Yes, that is what I was trying to get at.

>  But it turns out that we don't feed it mapped pages, apart from
> pagevec_strip() and possibly races against pagefaults.

True, and I think we have pretty well established that this isn't the
cause of Andrei's problem, but I think we all agree it is *a* bug?

And surely Andrei's data corruption will be of the same flavour in
that test_clear_page_dirty somewhere is now stripping pte dirty bits
where it shouldn't? (because it went away after Peter nooped that
behaviour)

>>I think it could be very likely that indeed the bug is a latent one in
>>a clear_page_dirty caller, rather than dirty-tracking itself.
> 
> 
> The only callers are try_to_free_buffers(), truncate and a few scruffy
> possibly-wrong-for-fsync filesytems which aren't being used here.
> 
> 
> <spots a race in do_no_page()>
> 
> If a write-fault races with a read-fault and the write-fault loses, we forget
> to mark the page dirty.

Hmm.. in that case will the pte still be readonly, and thus the write
faulter will have to try again I think?

> 
> Something like this, but it's probably wrong - I didn't try very hard (am
> feeling ill, and vaguely grumpy)
> 
> 
> From: Andrew Morton <akpm@osdl.org>
> 
> Signed-off-by: Andrew Morton <akpm@osdl.org>
> ---
> 
>  mm/memory.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff -puN mm/memory.c~a mm/memory.c
> --- a/mm/memory.c~a
> +++ a/mm/memory.c
> @@ -2264,10 +2264,22 @@ retry:
>  		}
>  	} else {
>  		/* One of our sibling threads was faster, back out. */
> +		if (write_access) {
> +			/*
> +			 * We might have raced against a read-fault.  We still
> +			 * need to dirty the page.
> +			 */
> +			dirty_page = vm_normal_page(vma, address, *page_table);
> +			if (dirty_page) {
> +				get_page(dirty_page);
> +				goto dirty_it;
> +			}
> +		}
>  		page_cache_release(new_page);
>  		goto unlock;
>  	}
>  
> +dirty_it:
>  	/* no need to invalidate: a not-present page shouldn't be cached */
>  	update_mmu_cache(vma, address, entry);
>  	lazy_mmu_prot_update(entry);
> _
> 
> 


-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
       [not found]                     ` <4587B762.2030603@yahoo.com.au>
@ 2006-12-19 10:32                       ` Andrew Morton
  2006-12-19 10:42                         ` Nick Piggin
                                           ` (3 more replies)
  2006-12-19 16:51                       ` Linus Torvalds
  1 sibling, 4 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-19 10:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Zijlstra, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 19 Dec 2006 20:56:50 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Linus Torvalds wrote:
> 
> > NOTICE? First you make a BIG DEAL about how dirty bits should never get 
> > lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop 
> > the dirty bit for when it's not in the page tables.
> 
> try_to_free_buffers is quite a special case, where we're transferring
> the page dirty metadata from the buffers to the page. I think Andrew
> would have a better grasp of it so he could correct me, but what it
> does is legitimate.

Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
pages.  But it turns out that we don't feed it mapped pages, apart from
pagevec_strip() and possibly races against pagefaults.

> I think it could be very likely that indeed the bug is a latent one in
> a clear_page_dirty caller, rather than dirty-tracking itself.

The only callers are try_to_free_buffers(), truncate and a few scruffy
possibly-wrong-for-fsync filesytems which aren't being used here.


<spots a race in do_no_page()>

If a write-fault races with a read-fault and the write-fault loses, we forget
to mark the page dirty.

Something like this, but it's probably wrong - I didn't try very hard (am
feeling ill, and vaguely grumpy)


From: Andrew Morton <akpm@osdl.org>

Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 mm/memory.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff -puN mm/memory.c~a mm/memory.c
--- a/mm/memory.c~a
+++ a/mm/memory.c
@@ -2264,10 +2264,22 @@ retry:
 		}
 	} else {
 		/* One of our sibling threads was faster, back out. */
+		if (write_access) {
+			/*
+			 * We might have raced against a read-fault.  We still
+			 * need to dirty the page.
+			 */
+			dirty_page = vm_normal_page(vma, address, *page_table);
+			if (dirty_page) {
+				get_page(dirty_page);
+				goto dirty_it;
+			}
+		}
 		page_cache_release(new_page);
 		goto unlock;
 	}
 
+dirty_it:
 	/* no need to invalidate: a not-present page shouldn't be cached */
 	update_mmu_cache(vma, address, entry);
 	lazy_mmu_prot_update(entry);
_


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  8:14                 ` Linus Torvalds
@ 2006-12-19  9:40                   ` Nick Piggin
  2006-12-19 16:46                     ` Linus Torvalds
  0 siblings, 1 reply; 154+ messages in thread
From: Nick Piggin @ 2006-12-19  9:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Linus Torvalds wrote:
> 
> On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
>>>Anyway it has the same issues as the others. See what happens when you
>>>run two test_clear_page_dirty_sync_ptes() consecutively, you still loose
>>>PG_dirty even though the page might actually be dirty.
>>
>>How can this happen? We'll only test_clear_page_dirty_sync_ptes again
>>after buffers have been reattached, and subsequently cleaned. And in
>>that case if the ptes are still clean at this point then the page really
>>is clean.
> 
> 
> Why do you talk about buffers being reattached? Are you still in some 
> world where "try_to_free_buffers()" matters? Have you not followed the 

I'm talking about fixing just the race Andrew noticed via inspection. No
it doesn't appear to fix Andrei's problem, unfortunately. But it needs
to be fixed all the same, doesn't it?

> discussion? Why do you ignore my MUCH SIMPLER patch that just removed all 
> this crap ENTIRELY from "try_to_free_buffers()", and the exact same 
> corruption happened?
> 
> Forget about "try_to_free_buffers()". Please apply this patch to your tree 
> first. That gets rid of _one_ copy of totally insane code that did all the 
> wrong things.
> 
> Only after you have applied this patch should you look at the code again. 
> Realizing that the corruption still happens.
> 
> So forget about buffers already. That piece of code was crap.

Now I'm not exactly sure how ext3 (or any other) filesystems make use
of this particular feature of try_to_free_buffers(), but it is clear
from the comments what it is for. So your patch isn't really a minimal
fix (ie. it would require an OK from all filesystems, wouldn't it?)

Or did I miss a mail where you reasoned that it is safe to make this
change (/me goes to reread the thread)...

> 
> 		Linus
> 
> ---
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d1f1b54..263f88e 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
>  	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
> -	if (PageWriteback(page))
> +	if (PageDirty(page) || PageWriteback(page))
>  		return 0;
>  
>  	if (mapping == NULL) {		/* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
>  	spin_lock(&mapping->private_lock);
>  	ret = drop_buffers(page, &buffers_to_free);
>  	spin_unlock(&mapping->private_lock);
> -	if (ret) {
> -		/*
> -		 * If the filesystem writes its buffers by hand (eg ext3)
> -		 * then we can have clean buffers against a dirty page.  We
> -		 * clean the page here; otherwise later reattachment of buffers
> -		 * could encounter a non-uptodate page, which is unresolvable.
> -		 * This only applies in the rare case where try_to_free_buffers
> -		 * succeeds but the page is not freed.
> -		 *
> -		 * Also, during truncate, discard_buffer will have marked all
> -		 * the page's buffers clean.  We discover that here and clean
> -		 * the page also.
> -		 */
> -		if (test_clear_page_dirty(page))
> -			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> -	}
>  out:
>  	if (buffers_to_free) {
>  		struct buffer_head *bh = buffers_to_free;
> 


-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  8:51                 ` Marc Haber
@ 2006-12-19  9:28                   ` Martin Michlmayr
  2006-12-28 18:05                   ` Marc Haber
  1 sibling, 0 replies; 154+ messages in thread
From: Martin Michlmayr @ 2006-12-19  9:28 UTC (permalink / raw)
  To: Marc Haber
  Cc: Andrew Morton, Nick Piggin, Linus Torvalds, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer

* Marc Haber <mh+linux-kernel@zugschlus.de> [2006-12-19 09:51]:
> I do not have a clue about memory management at all, but is it
> possible that you're testing on a box with too much memory? My box has
> only 256 MB, and I used to use mutt with a _huge_ inbox with mutt
> taking somewhat 150 MB. Add spamassassin and a reasonably busy mail
> server, and the box used to be like 150 MB in swap.

FWIW, the ARM box I see this on has only 32 MB memory (and a 133 or
266 MHz CPU).  I don't see it on another ARM box (different ARM
sub-arch) with 128 MB memory and a 600 MHz CPU.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  8:24                                             ` Andrew Morton
  2006-12-19  8:34                                               ` Pekka Enberg
@ 2006-12-19  9:13                                               ` Marc Haber
  1 sibling, 0 replies; 154+ messages in thread
From: Marc Haber @ 2006-12-19  9:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linus Torvalds, Peter Zijlstra,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Martin Michlmayr

On Tue, Dec 19, 2006 at 12:24:16AM -0800, Andrew Morton wrote:
> Wow.  I didn't expect that, because Mark Haber reported that ext3's data=writeback
> fixed it.   Maybe he didn't run it for long enough?

My test case is Debian's "aptitude update" running once an hour, and
it was always the same file getting corrupted. With 2.6.19, I had this
corruption like every third hour (but -only- if run from cron, running
from a shell was always fine), data=writeback made the issue disappear
for about two days before I booted into 2.6.19.1 without
data=writeback (defaults chosen then), after which the issue only
shows up like every other day.

So, I feel like out of the loop since rtorrent seems much better in
reproducing this.

I notice, though, that both aptitude and rtorrent do downloads from
the net, so there might be a relation to tcp/ip and/or the network
driver. My box has a Linksys NC100 network card running with the tulip
driver.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  9:00                     ` Peter Zijlstra
@ 2006-12-19  9:05                       ` Peter Zijlstra
  0 siblings, 0 replies; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-19  9:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 10:00 +0100, Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 00:04 -0800, Linus Torvalds wrote:
> 
> > Nobody has actually ever explained why "test_clear_page_dirty()" is good 
> > at all.
> > 
> >  - Why is it ever used instead of "clear_page_dirty_for_io()"?
> > 
> >  - What is the difference?
> > 
> >  - Why would you EVER want to clear bits just in the "struct page *" or 
> >    just in the PTE's?
> > 
> >  - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO?
> > 
> > In other words, I have a theory:
> > 
> >  "A lot of this is actually historical cruft. Some of it may even be code 
> >   that was never supposed to work, but because we maintained _other_ dirty 
> >   bits in the PTE's, and never touched them before, we never even realized 
> >   that the code that played with PG_dirty was totally insane"
> > 
> > Now, that's just a theory. And yeah, it may be stated a bit provocatively. 
> > It may not be entirely correct. I'm just saying.. maybe it is?
> 
> On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote:
> 
> > try_to_free_buffers() clears the page's dirty state if it successfully removed
> > the page's buffers.
> > 
> >   Background for this:
> > 
> >   - a process does a one-byte-write to a file on a 64k pagesize, 4k
> >     blocksize ext3 filesystem.  The page is now PageDirty, !PgeUptodate and
> >     has one dirty buffer and 15 not uptodate buffers.
> > 
> >   - kjournald writes the dirty buffer.  The page is now PageDirty,
> >     !PageUptodate and has a mix of clean and not uptodate buffers.
> > 
> >   - try_to_free_buffers() removes the page's buffers.  It MUST now clear
> >     PageDirty.  If we were to leave the page dirty then we'd have a dirty, not
> >     uptodate page with no buffer_heads.
> > 
> >     We're screwed: we cannot write the page because we don't know which
> >     sections of it contain garbage.  We cannot read the page because we don't
> >     know which sections of it contain modified data.  We cannot free the page
> >     because it is dirty.
> 
> However!! this is not true for mapped pages because mapped pages must
> have the whole (16k in akpm's example) page loaded. Hence I suspect that
> what Andrei did by accident - remove the if (mapping) case in
> test_clean_dirty_pages() - is actually totally correct.

Obviously I need my morning shot, 64k ofcourse.


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  8:04                   ` Linus Torvalds
@ 2006-12-19  9:00                     ` Peter Zijlstra
  2006-12-19  9:05                       ` Peter Zijlstra
       [not found]                     ` <4587B762.2030603@yahoo.com.au>
  1 sibling, 1 reply; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-19  9:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 00:04 -0800, Linus Torvalds wrote:

> Nobody has actually ever explained why "test_clear_page_dirty()" is good 
> at all.
> 
>  - Why is it ever used instead of "clear_page_dirty_for_io()"?
> 
>  - What is the difference?
> 
>  - Why would you EVER want to clear bits just in the "struct page *" or 
>    just in the PTE's?
> 
>  - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO?
> 
> In other words, I have a theory:
> 
>  "A lot of this is actually historical cruft. Some of it may even be code 
>   that was never supposed to work, but because we maintained _other_ dirty 
>   bits in the PTE's, and never touched them before, we never even realized 
>   that the code that played with PG_dirty was totally insane"
> 
> Now, that's just a theory. And yeah, it may be stated a bit provocatively. 
> It may not be entirely correct. I'm just saying.. maybe it is?

On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote:

> try_to_free_buffers() clears the page's dirty state if it successfully removed
> the page's buffers.
> 
>   Background for this:
> 
>   - a process does a one-byte-write to a file on a 64k pagesize, 4k
>     blocksize ext3 filesystem.  The page is now PageDirty, !PgeUptodate and
>     has one dirty buffer and 15 not uptodate buffers.
> 
>   - kjournald writes the dirty buffer.  The page is now PageDirty,
>     !PageUptodate and has a mix of clean and not uptodate buffers.
> 
>   - try_to_free_buffers() removes the page's buffers.  It MUST now clear
>     PageDirty.  If we were to leave the page dirty then we'd have a dirty, not
>     uptodate page with no buffer_heads.
> 
>     We're screwed: we cannot write the page because we don't know which
>     sections of it contain garbage.  We cannot read the page because we don't
>     know which sections of it contain modified data.  We cannot free the page
>     because it is dirty.

However!! this is not true for mapped pages because mapped pages must
have the whole (16k in akpm's example) page loaded. Hence I suspect that
what Andrei did by accident - remove the if (mapping) case in
test_clean_dirty_pages() - is actually totally correct.




^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  5:43               ` Andrew Morton
  2006-12-18  7:22                 ` Nick Piggin
@ 2006-12-19  8:51                 ` Marc Haber
  2006-12-19  9:28                   ` Martin Michlmayr
  2006-12-28 18:05                   ` Marc Haber
  1 sibling, 2 replies; 154+ messages in thread
From: Marc Haber @ 2006-12-19  8:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linus Torvalds, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer, Martin Michlmayr

On Sun, Dec 17, 2006 at 09:43:08PM -0800, Andrew Morton wrote:
> Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> blocksize ext3, mainline.  Zero failures.  It's unlikely that this testing
> would pass, yet people running normal workloads are able to easily trigger
> failures.  I suspect we're looking in the wrong place.

I do not have a clue about memory management at all, but is it
possible that you're testing on a box with too much memory? My box has
only 256 MB, and I used to use mutt with a _huge_ inbox with mutt
taking somewhat 150 MB. Add spamassassin and a reasonably busy mail
server, and the box used to be like 150 MB in swap.

I have tidied my inbox in the mean time and mutt's memory requirement
has been reduced to somewhat 30 MB, which might be the cause that I
don't see the issue that often any more.

Greetings
Marc, just trying to give input

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  8:24                                             ` Andrew Morton
@ 2006-12-19  8:34                                               ` Pekka Enberg
  2006-12-19  9:13                                               ` Marc Haber
  1 sibling, 0 replies; 154+ messages in thread
From: Pekka Enberg @ 2006-12-19  8:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linus Torvalds, Peter Zijlstra,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On 12/19/06, Andrew Morton <akpm@osdl.org> wrote:
> Wow.  I didn't expect that, because Mark Haber reported that ext3's data=writeback
> fixed it.   Maybe he didn't run it for long enough?

I don't think it did fix it for Mark:

http://marc.theaimsgroup.com/?l=linux-kernel&m=116625777306843&w=2

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  8:05                                           ` Andrei Popa
@ 2006-12-19  8:24                                             ` Andrew Morton
  2006-12-19  8:34                                               ` Pekka Enberg
  2006-12-19  9:13                                               ` Marc Haber
  0 siblings, 2 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-19  8:24 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Tue, 19 Dec 2006 10:05:03 +0200
Andrei Popa <andrei.popa@i-neo.ro> wrote:

> > > > Also, it'd be useful if you could determine whether the bug appears with
> > > > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> > > > rootfstype=ext2 if it's the root filesystem.
> > > 
>  I fave file corruption.

Wow.  I didn't expect that, because Mark Haber reported that ext3's data=writeback
fixed it.   Maybe he didn't run it for long enough?

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  7:59               ` Nick Piggin
@ 2006-12-19  8:14                 ` Linus Torvalds
  2006-12-19  9:40                   ` Nick Piggin
  0 siblings, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-19  8:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Nick Piggin wrote:
> > 
> > Anyway it has the same issues as the others. See what happens when you
> > run two test_clear_page_dirty_sync_ptes() consecutively, you still loose
> > PG_dirty even though the page might actually be dirty.
> 
> How can this happen? We'll only test_clear_page_dirty_sync_ptes again
> after buffers have been reattached, and subsequently cleaned. And in
> that case if the ptes are still clean at this point then the page really
> is clean.

Why do you talk about buffers being reattached? Are you still in some 
world where "try_to_free_buffers()" matters? Have you not followed the 
discussion? Why do you ignore my MUCH SIMPLER patch that just removed all 
this crap ENTIRELY from "try_to_free_buffers()", and the exact same 
corruption happened?

Forget about "try_to_free_buffers()". Please apply this patch to your tree 
first. That gets rid of _one_ copy of totally insane code that did all the 
wrong things.

Only after you have applied this patch should you look at the code again. 
Realizing that the corruption still happens.

So forget about buffers already. That piece of code was crap.

		Linus

---
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;

^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  1:54                                         ` Andrew Morton
  2006-12-19  2:04                                           ` Andrei Popa
@ 2006-12-19  8:05                                           ` Andrei Popa
  2006-12-19  8:24                                             ` Andrew Morton
  1 sibling, 1 reply; 154+ messages in thread
From: Andrei Popa @ 2006-12-19  8:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

> > > Also, it'd be useful if you could determine whether the bug appears with
> > > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> > > rootfstype=ext2 if it's the root filesystem.
> > 
 I fave file corruption.


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  7:26                 ` Linus Torvalds
@ 2006-12-19  8:04                   ` Linus Torvalds
  2006-12-19  9:00                     ` Peter Zijlstra
       [not found]                     ` <4587B762.2030603@yahoo.com.au>
  0 siblings, 2 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-19  8:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Linus Torvalds wrote:
> 
> The code that doesn't make sense is the "shuffle the dirty bits around" In 
> other words: when does it actually make sense to call your 
> (well-implemented, don't get me wrong) "test_clear_page_dirty_sync_ptes()"
> function? It doesn't _fix_ anything. It just shuffles dirty bits from one 
> place to another. What was the point again?

Let me try to phrase that another way, in terms that you defined.

In other words, look at your test_clear_page_dirty_sync_ptes() function.

First, start out from the _inner_ part, the:

	if (mapping_cap_account_dirty(mapping)) {
		if (page_mkclean(page))
			set_page_dirty(page);
	}

part.

This the one that both you and I agree is a "working" situation: we are 
moving the dirty bits from the pte into the "struct page", and we both 
agree that this is fine. No dirty bits get lost. You even make a BIG DEAL 
about the fact that no dirty bits get lost.

So begin by just explaining:
 - why do it?

Why shuffle the dirty bits around? Why not just _leave_ the PG_dirty bit 
on the "struct page", and simply leave it all at that? I agree that the 
above doesn't lose any dirty bits, but what I'm asking for is WHAT IS THE 
POINT?

So that is the code that we both agree "works", but I personally don't see 
the _point_ of. However, that's actually not even important, because I 
don't even care about the point. I wanted to bring that up just in order 
to then ignore it, and look at the stuff _around: it, namely the other 
part in "test_clear_page_dirty_sync_ptes()":

	int test_clear_page_dirty_sync_ptes(struct page *page)
	{
		if (test_clear_page_dirty_leave_ptes(page)) {
			.. do the inner part ..
			return 1;
		}
		return 0;
	}

Now, the above is the OUTER part. Please realize that this DOES actually 
drop the PG-dirty bit. So ignore the inner part entirely (which is a no-op 
for the case where the page isn't mapped), and explain to me why it's ok 
to DROP the dirty bit in the outer part, when you tried to say that it was 
NOT ok to drop it in the inner part?

NOTICE? First you make a BIG DEAL about how dirty bits should never get 
lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop 
the dirty bit for when it's not in the page tables.

In fact, if you just call that function twice, the first time it will 
MOVE the dirty bits from the PTE to the "struct page *", and the _second_ 
time it will just clear the dirty bit from the "struct page *". You end up 
with a clean page. It returned the same return value BOTH TIMES, even 
though it did two very different things (once just moving dirty bits 
around, and the second time actually _removing_ the dirty bit entirely).

Again, I have a very simple claim: I claim that NONE of the 
"test_clear_page_dirty()" functions make any sense what-so-ever. They're 
all wrong.

The "funny" part is, that the only thing that Andrei reports actually 
fixed his corruption (apart from the patch tjhat just stops removign the 
dirty bits from the PTE's _entirely_) is actually the part where he had an 
"#if 0 .. #endif" around basically _all_ of the "test_clear_page_dirty()" 
function (ie he had mis-understood what I asked for, and put it outside 
the _outer_ if(), rather than putting it around the inner one).

So I claim:
 - there is ONE and only ONE place where you can really drop the dirty 
   bits: it's when you're going to immediately afterwards do a writeout.

   This is the "clear_page_dirty_for_io()"

 - all the other "[test_and_]clear_dirty*()" functions seem to be outright 
   buggy and bogus. Shuffling dirty bits around from the page tables to 
   the "struct page *" (after having _cleared_ that "very important" 
   PG_dirty bit just before - apparently it wasn't that important after 
   all, was it?) is insane.

Nobody has actually ever explained why "test_clear_page_dirty()" is good 
at all.

 - Why is it ever used instead of "clear_page_dirty_for_io()"?

 - What is the difference?

 - Why would you EVER want to clear bits just in the "struct page *" or 
   just in the PTE's?

 - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO?

In other words, I have a theory:

 "A lot of this is actually historical cruft. Some of it may even be code 
  that was never supposed to work, but because we maintained _other_ dirty 
  bits in the PTE's, and never touched them before, we never even realized 
  that the code that played with PG_dirty was totally insane"

Now, that's just a theory. And yeah, it may be stated a bit provocatively. 
It may not be entirely correct. I'm just saying.. maybe it is?

And yes, we actually really _do_ have a data-point from Andrei that says 
that if you just make "test_clear_page_dirty()" a no-op, the corruption 
goes away. It was unintentional, bit hey, it's a real datapoint.

See the email from Andrei:

	Subject: Re: 2.6.19 file content corruption on ext3
	From: Andrei Popa <andrei.popa@i-neo.ro>
	Date: Tue, 19 Dec 2006 01:48:11 +0200
	Message-Id: <1166485691.6977.6.camel@localhost>

and look at what remains of his "test_clear_page_dirty()". 

Scary, isn't it? And a big hint that "test_clear_page_dirty()" is just 
totally BOGUS. 

And the thing is, I think it's bogus just because I don't understand why 
it would EVER be ok to drop those dirty bits _except_ very much just 
before doing the IO that makes it non-dirty (where "truncate()" is really 
a special case where the IO ends up being not done, but it's the same kind 
of situation).

			Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  7:22             ` Peter Zijlstra
@ 2006-12-19  7:59               ` Nick Piggin
  2006-12-19  8:14                 ` Linus Torvalds
  0 siblings, 1 reply; 154+ messages in thread
From: Nick Piggin @ 2006-12-19  7:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 15:36 +1100, Nick Piggin wrote:
> 
> 
>>plain text document attachment (fs-fix.patch)
>>Index: linux-2.6/fs/buffer.c
>>===================================================================
>>--- linux-2.6.orig/fs/buffer.c	2006-12-19 15:15:46.000000000 +1100
>>+++ linux-2.6/fs/buffer.c	2006-12-19 15:36:01.000000000 +1100
>>@@ -2852,7 +2852,17 @@ int try_to_free_buffers(struct page *pag
>> 		 * This only applies in the rare case where try_to_free_buffers
>> 		 * succeeds but the page is not freed.
>> 		 */
>>-		clear_page_dirty(page);
>>+
>>+		/*
>>+		 * If the page has been dirtied via the user mappings, then
>>+		 * clean buffers does not indicate the page data is actually
>>+		 * clean! Only clear the page dirty bit if there are no dirty
>>+		 * ptes either.
>>+		 *
>>+		 * If there are dirty ptes, then the page must be uptodate, so
>>+		 * the above concern does not apply.
>>+		 */
>>+		clear_page_dirty_sync_ptes(page);
>> 	}
>> out:
>> 	if (buffers_to_free) {
>>Index: linux-2.6/include/linux/page-flags.h
>>===================================================================
>>--- linux-2.6.orig/include/linux/page-flags.h	2006-12-19 15:17:18.000000000 +1100
>>+++ linux-2.6/include/linux/page-flags.h	2006-12-19 15:34:24.000000000 +1100
>>@@ -254,6 +254,7 @@ static inline void SetPageUptodate(struc
>> struct page;	/* forward declaration */
>> 
>> int test_clear_page_dirty(struct page *page);
>>+int test_clear_page_dirty_sync_ptes(struct page *page);
>> int test_clear_page_writeback(struct page *page);
>> int test_set_page_writeback(struct page *page);
>> 
>>@@ -262,6 +263,11 @@ static inline void clear_page_dirty(stru
>> 	test_clear_page_dirty(page);
>> }
>> 
>>+static inline void clear_page_dirty_sync_ptes(struct page *page)
>>+{
>>+	test_clear_page_dirty_sync_ptes(page);
>>+}
>>+
>> static inline void set_page_writeback(struct page *page)
>> {
>> 	test_set_page_writeback(page);
>>Index: linux-2.6/mm/page-writeback.c
>>===================================================================
>>--- linux-2.6.orig/mm/page-writeback.c	2006-12-19 15:17:53.000000000 +1100
>>+++ linux-2.6/mm/page-writeback.c	2006-12-19 15:33:29.000000000 +1100
>>@@ -844,9 +844,10 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>> 
>> /*
>>  * Clear a page's dirty flag, while caring for dirty memory accounting. 
>>+ * Does not clear pte dirty bits.
>>  * Returns true if the page was previously dirty.
>>  */
>>-int test_clear_page_dirty(struct page *page)
>>+static int test_clear_page_dirty_leave_ptes(struct page *page)
>> {
>> 	struct address_space *mapping = page_mapping(page);
>> 	unsigned long flags;
>>@@ -862,10 +863,8 @@ int test_clear_page_dirty(struct page *p
>> 			 * We can continue to use `mapping' here because the
>> 			 * page is locked, which pins the address_space
>> 			 */
>>-			if (mapping_cap_account_dirty(mapping)) {
>>-				page_mkclean(page);
>>+			if (mapping_cap_account_dirty(mapping))
>> 				dec_zone_page_state(page, NR_FILE_DIRTY);
>>-			}
>> 			return 1;
>> 		}
>> 		write_unlock_irqrestore(&mapping->tree_lock, flags);
>>@@ -873,9 +872,43 @@ int test_clear_page_dirty(struct page *p
>> 	}
>> 	return TestClearPageDirty(page);
>> }
>>+
>>+/*
>>+ * As above, but does clear dirty bits from ptes
>>+ */
>>+int test_clear_page_dirty(struct page *page)
>>+{
>>+	struct address_space *mapping = page_mapping(page);
>>+
>>+	if (test_clear_page_dirty_leave_ptes(page)) {
>>+		if (mapping_cap_account_dirty(mapping))
>>+			page_mkclean(page);
>>+		return 1;
>>+	}
>>+	return 0;
>>+}
>> EXPORT_SYMBOL(test_clear_page_dirty);
>> 
>> /*
>>+ * As above, but redirties page if any dirty ptes are found (and then only
>>+ * if the mapping accounts dirty pages, otherwise dirty ptes are left dirty
>>+ * but the page is cleaned).
>>+ */
>>+int test_clear_page_dirty_sync_ptes(struct page *page)
>>+{
>>+	struct address_space *mapping = page_mapping(page);
>>+
>>+	if (test_clear_page_dirty_leave_ptes(page)) {
>>+		if (mapping_cap_account_dirty(mapping)) {
>>+			if (page_mkclean(page))
>>+				set_page_dirty(page);
>>+		}
>>+		return 1;
>>+	}
>>+	return 0;
>>+}
>>+
>>+/*
>>  * Clear a page's dirty flag, while caring for dirty memory accounting.
>>  * Returns true if the page was previously dirty.
>>  *
> 
> 
> Hmm, not quite; It certainly look better than the extra ,[01] tagged to
> test_clear_page_dirty() though. Although I would have expected it the
> other way around - test_clear_pages_dirty_sync_ptes to be the default
> case and test_clear_pages_dirty_clean_ptes to be used in
> clear_page_dirty_for_io().
> 
> Anyway it has the same issues as the others. See what happens when you
> run two test_clear_page_dirty_sync_ptes() consecutively, you still loose
> PG_dirty even though the page might actually be dirty.

How can this happen? We'll only test_clear_page_dirty_sync_ptes again
after buffers have been reattached, and subsequently cleaned. And in
that case if the ptes are still clean at this point then the page really
is clean.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 19:18                 ` Linus Torvalds
  2006-12-18 19:44                   ` Andrei Popa
@ 2006-12-19  7:38                   ` Peter Zijlstra
  1 sibling, 0 replies; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-19  7:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 11:18 -0800, Linus Torvalds wrote:

> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index d8a842a..3f9061e 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
> >  		goto unlock;
> >  
> >  	entry = ptep_get_and_clear(mm, address, pte);
> > -	entry = pte_mkclean(entry);
> > +	/*entry = pte_mkclean(entry);*/
> >  	entry = pte_wrprotect(entry);
> >  	ptep_establish(vma, address, pte, entry);
> >  	lazy_mmu_prot_update(entry);
> 
> The above patch is bad. It's always going to hide the bug, but it hides it 
> by just not doing anything at all. 

Not quite, it does wrprotect still, so further updates will trigger the
do_wp_page() path and call set_page_dirty().

So we could make 'something' that would keep the tracking working and
not create corruption, say something like this:

However I'll try and figure out how we get so terribly confused on the
PG_dirty state that we have to clean it and fall back to pte_dirty. That
is the real issue we have.

---
 include/linux/rmap.h |    6 ++++++
 mm/page-writeback.c  |    3 ++-
 mm/rmap.c            |   23 ++++++++++++++++++-----
 3 files changed, 26 insertions(+), 6 deletions(-)

Index: linux-2.6-git/mm/rmap.c
===================================================================
--- linux-2.6-git.orig/mm/rmap.c	2006-12-18 11:06:29.000000000 +0100
+++ linux-2.6-git/mm/rmap.c	2006-12-19 08:33:57.000000000 +0100
@@ -428,7 +428,8 @@ int page_referenced(struct page *page, i
 	return referenced;
 }
 
-static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
+static int page_mkcw_one(struct page *page,
+			 struct vm_area_struct *vma, int make_clean)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -448,7 +449,8 @@ static int page_mkclean_one(struct page 
 		goto unlock;
 
 	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
+	if (make_clean)
+		entry = pte_mkclean(entry);
 	entry = pte_wrprotect(entry);
 	ptep_establish(vma, address, pte, entry);
 	lazy_mmu_prot_update(entry);
@@ -460,7 +462,8 @@ out:
 	return ret;
 }
 
-static int page_mkclean_file(struct address_space *mapping, struct page *page)
+static int page_mkcw_file(struct address_space *mapping,
+			  struct page *page, int make_clean)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	struct vm_area_struct *vma;
@@ -478,7 +481,7 @@ static int page_mkclean_file(struct addr
 	return ret;
 }
 
-int page_mkclean(struct page *page)
+static int page_mkcw(struct page *page, int make_clean)
 {
 	int ret = 0;
 
@@ -487,12 +490,22 @@ int page_mkclean(struct page *page)
 	if (page_mapped(page)) {
 		struct address_space *mapping = page_mapping(page);
 		if (mapping)
-			ret = page_mkclean_file(mapping, page);
+			ret = page_mkcw_file(mapping, page, make_clean);
 	}
 
 	return ret;
 }
 
+int page_mkclean(struct page *page)
+{
+	return page_mkcw(page, 1);
+}
+
+int page_wrprotect(struct page *page)
+{
+	return page_mkcw(page, 0);
+}
+
 /**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
Index: linux-2.6-git/include/linux/rmap.h
===================================================================
--- linux-2.6-git.orig/include/linux/rmap.h	2006-12-19 08:31:59.000000000 +0100
+++ linux-2.6-git/include/linux/rmap.h	2006-12-19 08:32:28.000000000 +0100
@@ -110,6 +110,7 @@ unsigned long page_address_in_vma(struct
  * returns the number of cleaned PTEs.
  */
 int page_mkclean(struct page *);
+int page_wrprotect(struct page *);
 
 #else	/* !CONFIG_MMU */
 
@@ -125,6 +126,11 @@ static inline int page_mkclean(struct pa
 	return 0;
 }
 
+static inline int page_wrprotect(struct page *page)
+{
+	return 0;
+}
+
 
 #endif	/* CONFIG_MMU */
 
Index: linux-2.6-git/mm/page-writeback.c
===================================================================
--- linux-2.6-git.orig/mm/page-writeback.c	2006-12-19 08:24:48.000000000 +0100
+++ linux-2.6-git/mm/page-writeback.c	2006-12-19 08:31:43.000000000 +0100
@@ -872,7 +872,8 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			if (page_wrprotect(page))
+				set_page_dirty();
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;





^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  6:51               ` Nick Piggin
@ 2006-12-19  7:26                 ` Linus Torvalds
  2006-12-19  8:04                   ` Linus Torvalds
  0 siblings, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-19  7:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
> I wouldn't have thought it becomes clean by dropping it ;) Is this a
> trick question? My answer is that we clean a page by by taking some
> action such that the underlying data matches the data in RAM...

Sure.

> We don't "drop" any data until it has been cleaned (again, ignoring
> things like truncate for a minute). That's a bug!

Actually, it's the other way around. We have to drop the dirty bits BEFORE 
cleaning. If we clean first, and _then_ drop the dirty bits, THAT is a 
bug, because the dirty bits can now refer to _new_ dirty data that didn't 
get written out.

So the proper sequence is _literally_ to mark the page clean FIRST. Drop 
all the dirty bits, but not the _data_ obviously (ie you have a reference 
to the page). And _then_ you do the writeout to actually clean the data 
itself.

So you actually state it exactly the wrogn way around.

We MUST clear the dirty bits before we do the IO that actually cleans the 
data. Exactly because if new writes keep on happening, if we do it in the 
other order, we'll drop dirty data on the floor.

> > In no other circumstance do we ever want to clear a dirty bit, as far as I
> > can tell. 
> 
> Exactly. And that is exactly what try_to_free_buffers is doing now.
> 
> I still think you should have a look at the patch.

I claim that dropping dirty bits AFTER the IO is always wrong. 
Try_to_free_buffers() must never touch the dirty bits at all, because by 
definition that thing happens after the IO has actually been done.

Anbd yes, I looked at your patch. And it looks a million times cleaner 
than Andrew's patch. However, it's already been tested multiple times, and 
totally REMOVING the "clear_page_dirty()" from try_to_free_buffers() still 
resulted in the corruption.

That said, I think your patch is worth it just as a cleanup. Much nicer 
than Andrews code, also from a naming standpoint. So I'm not actually 
disagreeing about the patch itself, but I _am_ saying that I don't 
actually see the point of ever moving the dirty bits around.

So I repeat: we have the case where we really want to _remove_ the dirty 
bits (because we're going to write the current state of the page to disk, 
and we need to clear the dirty bits BEFORE we do that). That's the one 
that makes sense, and that's the code we want to run before doing IO. It's 
the "clear_dirty_bits_for_io()" case.

The code that doesn't make sense is the "shuffle the dirty bits around" In 
other words: when does it actually make sense to call your 
(well-implemented, don't get me wrong) "test_clear_page_dirty_sync_ptes()"
function? It doesn't _fix_ anything. It just shuffles dirty bits from one 
place to another. What was the point again?

If the point is "try_to_free_buffers()", then my argument was that I had a 
much simpler solution: "Just don't do that then". My simple patch sadly 
didn't fix the data corruption, so the data corruption comes from 
something ELSE than try_to_free_buffers().

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  4:36           ` Nick Piggin
  2006-12-19  6:34             ` Linus Torvalds
@ 2006-12-19  7:22             ` Peter Zijlstra
  2006-12-19  7:59               ` Nick Piggin
  1 sibling, 1 reply; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-19  7:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Tue, 2006-12-19 at 15:36 +1100, Nick Piggin wrote:

> plain text document attachment (fs-fix.patch)
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c	2006-12-19 15:15:46.000000000 +1100
> +++ linux-2.6/fs/buffer.c	2006-12-19 15:36:01.000000000 +1100
> @@ -2852,7 +2852,17 @@ int try_to_free_buffers(struct page *pag
>  		 * This only applies in the rare case where try_to_free_buffers
>  		 * succeeds but the page is not freed.
>  		 */
> -		clear_page_dirty(page);
> +
> +		/*
> +		 * If the page has been dirtied via the user mappings, then
> +		 * clean buffers does not indicate the page data is actually
> +		 * clean! Only clear the page dirty bit if there are no dirty
> +		 * ptes either.
> +		 *
> +		 * If there are dirty ptes, then the page must be uptodate, so
> +		 * the above concern does not apply.
> +		 */
> +		clear_page_dirty_sync_ptes(page);
>  	}
>  out:
>  	if (buffers_to_free) {
> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h	2006-12-19 15:17:18.000000000 +1100
> +++ linux-2.6/include/linux/page-flags.h	2006-12-19 15:34:24.000000000 +1100
> @@ -254,6 +254,7 @@ static inline void SetPageUptodate(struc
>  struct page;	/* forward declaration */
>  
>  int test_clear_page_dirty(struct page *page);
> +int test_clear_page_dirty_sync_ptes(struct page *page);
>  int test_clear_page_writeback(struct page *page);
>  int test_set_page_writeback(struct page *page);
>  
> @@ -262,6 +263,11 @@ static inline void clear_page_dirty(stru
>  	test_clear_page_dirty(page);
>  }
>  
> +static inline void clear_page_dirty_sync_ptes(struct page *page)
> +{
> +	test_clear_page_dirty_sync_ptes(page);
> +}
> +
>  static inline void set_page_writeback(struct page *page)
>  {
>  	test_set_page_writeback(page);
> Index: linux-2.6/mm/page-writeback.c
> ===================================================================
> --- linux-2.6.orig/mm/page-writeback.c	2006-12-19 15:17:53.000000000 +1100
> +++ linux-2.6/mm/page-writeback.c	2006-12-19 15:33:29.000000000 +1100
> @@ -844,9 +844,10 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  
>  /*
>   * Clear a page's dirty flag, while caring for dirty memory accounting. 
> + * Does not clear pte dirty bits.
>   * Returns true if the page was previously dirty.
>   */
> -int test_clear_page_dirty(struct page *page)
> +static int test_clear_page_dirty_leave_ptes(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	unsigned long flags;
> @@ -862,10 +863,8 @@ int test_clear_page_dirty(struct page *p
>  			 * We can continue to use `mapping' here because the
>  			 * page is locked, which pins the address_space
>  			 */
> -			if (mapping_cap_account_dirty(mapping)) {
> -				page_mkclean(page);
> +			if (mapping_cap_account_dirty(mapping))
>  				dec_zone_page_state(page, NR_FILE_DIRTY);
> -			}
>  			return 1;
>  		}
>  		write_unlock_irqrestore(&mapping->tree_lock, flags);
> @@ -873,9 +872,43 @@ int test_clear_page_dirty(struct page *p
>  	}
>  	return TestClearPageDirty(page);
>  }
> +
> +/*
> + * As above, but does clear dirty bits from ptes
> + */
> +int test_clear_page_dirty(struct page *page)
> +{
> +	struct address_space *mapping = page_mapping(page);
> +
> +	if (test_clear_page_dirty_leave_ptes(page)) {
> +		if (mapping_cap_account_dirty(mapping))
> +			page_mkclean(page);
> +		return 1;
> +	}
> +	return 0;
> +}
>  EXPORT_SYMBOL(test_clear_page_dirty);
>  
>  /*
> + * As above, but redirties page if any dirty ptes are found (and then only
> + * if the mapping accounts dirty pages, otherwise dirty ptes are left dirty
> + * but the page is cleaned).
> + */
> +int test_clear_page_dirty_sync_ptes(struct page *page)
> +{
> +	struct address_space *mapping = page_mapping(page);
> +
> +	if (test_clear_page_dirty_leave_ptes(page)) {
> +		if (mapping_cap_account_dirty(mapping)) {
> +			if (page_mkclean(page))
> +				set_page_dirty(page);
> +		}
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/*
>   * Clear a page's dirty flag, while caring for dirty memory accounting.
>   * Returns true if the page was previously dirty.
>   *

Hmm, not quite; It certainly look better than the extra ,[01] tagged to
test_clear_page_dirty() though. Although I would have expected it the
other way around - test_clear_pages_dirty_sync_ptes to be the default
case and test_clear_pages_dirty_clean_ptes to be used in
clear_page_dirty_for_io().

Anyway it has the same issues as the others. See what happens when you
run two test_clear_page_dirty_sync_ptes() consecutively, you still loose
PG_dirty even though the page might actually be dirty.




^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  6:34             ` Linus Torvalds
@ 2006-12-19  6:51               ` Nick Piggin
  2006-12-19  7:26                 ` Linus Torvalds
  2006-12-19 20:03               ` dean gaudet
  1 sibling, 1 reply; 154+ messages in thread
From: Nick Piggin @ 2006-12-19  6:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

Linus Torvalds wrote:
> 
> On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
>>We never want to drop dirty data! (ignoring the truncate case, which is
>>handled privately by truncate anyway)
> 
> 
> Bzzt.
> 
> SURE we do.
> 
> We absolutely do want to drop dirty data in the writeout path.
> 
> How do you think dirty data ever _becomes_ clean data?

I wouldn't have thought it becomes clean by dropping it ;) Is this a
trick question? My answer is that we clean a page by by taking some
action such that the underlying data matches the data in RAM...

We don't "drop" any data until it has been cleaned (again, ignoring
things like truncate for a minute). That's a bug! And
try_to_free_buffers() is called from places outside the writeout path.
This is our bug (or at least, one of our bugs that appears to have the
same triggers and symptoms as people are reporting).

[...]

> In no other circumstance do we ever want to clear a dirty bit, as far as I 
> can tell. 

Exactly. And that is exactly what try_to_free_buffers is doing now.

I still think you should have a look at the patch.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  4:36           ` Nick Piggin
@ 2006-12-19  6:34             ` Linus Torvalds
  2006-12-19  6:51               ` Nick Piggin
  2006-12-19 20:03               ` dean gaudet
  2006-12-19  7:22             ` Peter Zijlstra
  1 sibling, 2 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-19  6:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
> We never want to drop dirty data! (ignoring the truncate case, which is
> handled privately by truncate anyway)

Bzzt.

SURE we do.

We absolutely do want to drop dirty data in the writeout path.

How do you think dirty data ever _becomes_ clean data?

In other words, yes, we _do_ want to test-and-clear all the pgtable bits 
_and_ the PG_dirty bit. We want to do it for:
 - writeout
 - truncate
 - possibly a "drop" event (which could be a case for a journal entry that 
   becomes stale due to being replaced or something - kind of "truncate" 
   on metadata)

because both of those events _literally_ turn dirty state into clean 
state.

In no other circumstance do we ever want to clear a dirty bit, as far as I 
can tell. 

			Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 18:03         ` Linus Torvalds
  2006-12-18 18:24           ` Peter Zijlstra
@ 2006-12-19  4:36           ` Nick Piggin
  2006-12-19  6:34             ` Linus Torvalds
  2006-12-19  7:22             ` Peter Zijlstra
  1 sibling, 2 replies; 154+ messages in thread
From: Nick Piggin @ 2006-12-19  4:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, andrei.popa,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

[-- Attachment #1: Type: text/plain, Size: 3403 bytes --]

Linus Torvalds wrote:
> On Mon, 18 Dec 2006, Peter Zijlstra wrote:
> 
>>This should be safe; page_mkclean walks the rmap and flips the pte's
>>under the pte lock and records the dirty state while iterating.
>>Concurrent faults will either do set_page_dirty() before we get around
>>to doing it or vice versa, but dirty state is not lost.
> 
> 
> Ok, I really liked this patch, but the more I thought about it, the more I 
> started to doubt the reasons for liking it.

Well this implements my suggestion to redirty the page if there were dirty
ptes. I think it is a good fix (whether or not it fixes Andrei's bug, it
does fix a bug), though maybe _slightly_ suboptimal.

> I think we have some core fundamental problem here that this patch is 
> needed at all.
> 
> So let's think about this: we apparently have two cases of 
> "clear_page_dirty()":
> 
>  - the one that really wants to clear the bit unconditionally (Andrew 
>    calls this the "must_clean_ptes" case, which I personally find to be a 
>    really confusing name, but whatever)
> 
>  - the other case. The case that doesn't want to really clear the pte 
>    dirty bits.

I don't think this characterises it correctly. Think about how it worked
before the page_mkclean went in there.

We really _never_ want to just clear pte dirty bits, because that would be
a data loss situation[*]. The only reason we clear PG_dirty is because some
filesystem may have cleaned each buffer without realising it has cleaned
the whole page. But if you have a dirty pte, then all bets are off: a
buffer with a clear dirty bit can not be considered clean.

Before the dirty page tracking, it was fine to clear PG_dirty here, because
we would pick up the pte dirty info later on. After the page dirty tracking,
clearing pte dirty is a bug here, and re-accounting the dirty page is
arguably the minimal fix.

[*] except in the truncate case where we are happy to throw out dirty data,
     but in that case there would be no ptes anyway.

The only thing I would suggest is not applying Andrew's patch at all, and
do the special casing in try_to_free_buffers(). I've attached a patch for
comments.


> and I thought your patch made sense, because it saved away the pte state 
> in the page dirty state, and that matches my mental model, but the more I 
> think about it, the less sense that whole "the other case" situation makes 
> AT ALL.
> 
> Why does "the other case" exist at all? If you want to clear the dirty 
> page flag, what is _ever_ the reason for not wanting to drop PTE dirty 
> information? In other words, what possible reason can there ever be for 
> saying "I want this page to be clean", while at the same time saying "but 
> if it was dirty in the page tables, don't forget about that state".

We never want to drop dirty data! (ignoring the truncate case, which is
handled privately by truncate anyway)

This whole exercise is not about cleaning or dirtying or fogetting the actual
*data* in the page. It is about bringing the pagecache's notion of whether
the page is dirty or clean in line with the (more uptodate) filesystem's
notion.

After dirty write accounting, we also threw in "the virtual memory manager's
notion", but got that case slightly wrong.

As unlikely as this race is for SMP systems, I think it is easily possible
for PREEMPT kernels. And they have featured in all bug reports, AFAIKS.

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: fs-fix.patch --]
[-- Type: text/plain, Size: 3904 bytes --]

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2006-12-19 15:15:46.000000000 +1100
+++ linux-2.6/fs/buffer.c	2006-12-19 15:36:01.000000000 +1100
@@ -2852,7 +2852,17 @@ int try_to_free_buffers(struct page *pag
 		 * This only applies in the rare case where try_to_free_buffers
 		 * succeeds but the page is not freed.
 		 */
-		clear_page_dirty(page);
+
+		/*
+		 * If the page has been dirtied via the user mappings, then
+		 * clean buffers does not indicate the page data is actually
+		 * clean! Only clear the page dirty bit if there are no dirty
+		 * ptes either.
+		 *
+		 * If there are dirty ptes, then the page must be uptodate, so
+		 * the above concern does not apply.
+		 */
+		clear_page_dirty_sync_ptes(page);
 	}
 out:
 	if (buffers_to_free) {
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2006-12-19 15:17:18.000000000 +1100
+++ linux-2.6/include/linux/page-flags.h	2006-12-19 15:34:24.000000000 +1100
@@ -254,6 +254,7 @@ static inline void SetPageUptodate(struc
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty_sync_ptes(struct page *page);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
@@ -262,6 +263,11 @@ static inline void clear_page_dirty(stru
 	test_clear_page_dirty(page);
 }
 
+static inline void clear_page_dirty_sync_ptes(struct page *page)
+{
+	test_clear_page_dirty_sync_ptes(page);
+}
+
 static inline void set_page_writeback(struct page *page)
 {
 	test_set_page_writeback(page);
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2006-12-19 15:17:53.000000000 +1100
+++ linux-2.6/mm/page-writeback.c	2006-12-19 15:33:29.000000000 +1100
@@ -844,9 +844,10 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 
 /*
  * Clear a page's dirty flag, while caring for dirty memory accounting. 
+ * Does not clear pte dirty bits.
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+static int test_clear_page_dirty_leave_ptes(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -862,10 +863,8 @@ int test_clear_page_dirty(struct page *p
 			 * We can continue to use `mapping' here because the
 			 * page is locked, which pins the address_space
 			 */
-			if (mapping_cap_account_dirty(mapping)) {
-				page_mkclean(page);
+			if (mapping_cap_account_dirty(mapping))
 				dec_zone_page_state(page, NR_FILE_DIRTY);
-			}
 			return 1;
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -873,9 +872,43 @@ int test_clear_page_dirty(struct page *p
 	}
 	return TestClearPageDirty(page);
 }
+
+/*
+ * As above, but does clear dirty bits from ptes
+ */
+int test_clear_page_dirty(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (test_clear_page_dirty_leave_ptes(page)) {
+		if (mapping_cap_account_dirty(mapping))
+			page_mkclean(page);
+		return 1;
+	}
+	return 0;
+}
 EXPORT_SYMBOL(test_clear_page_dirty);
 
 /*
+ * As above, but redirties page if any dirty ptes are found (and then only
+ * if the mapping accounts dirty pages, otherwise dirty ptes are left dirty
+ * but the page is cleaned).
+ */
+int test_clear_page_dirty_sync_ptes(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (test_clear_page_dirty_leave_ptes(page)) {
+		if (mapping_cap_account_dirty(mapping)) {
+			if (page_mkclean(page))
+				set_page_dirty(page);
+		}
+		return 1;
+	}
+	return 0;
+}
+
+/*
  * Clear a page's dirty flag, while caring for dirty memory accounting.
  * Returns true if the page was previously dirty.
  *

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  1:54                                         ` Andrew Morton
@ 2006-12-19  2:04                                           ` Andrei Popa
  2006-12-19  8:05                                           ` Andrei Popa
  1 sibling, 0 replies; 154+ messages in thread
From: Andrei Popa @ 2006-12-19  2:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr


> > > If all of test_clear_page_dirty() has been commented out then the page will
> > > never become clean hence will never fall out of pagecache, so unless Andrei
> > > is doing a reboot before checking for corruption, perhaps the underlying
> > > data on-disk is incorrect, but we can't see it.
> > 
> > if I do a sync and echo 1 > /proc/sys/vm/drop_caches
> 
> OK, that works.
> 
> >  does the reboot is
> > still necesary ?
> 
> It might be necessary to reboot in this case - if we're leaving the
> pagecache dirty, writing to drop_caches won't remove it.  And you probably
> won't be able to get a clean reboot either.
> 
> > > 
> > > Andrei, how _are_ you running this test?    What's the exact sequence of steps?
> > > 
> > > In particular, are you doing anything which would cause the corrupted file
> > > to be evicted from memory, thus forcing a read from disk?  Such as
> > > unmounting and then remounting the filesystem?
> > 
> > I boot linux, I start rtorrent and start the download, while it's
> > downloading I start evolution and i check my mail(my mbox is very large,
> > several hundered megabytes), I close evolution(I use evolution just to
> > have another application witch uses the filesystem and the memory), I
> > start evolution again. I start firefox. The download is complete.
> > Rtorrent says if the hash is good or not. I do a "unrar t qwe.rar" to
> > test that all 84 downloaded rar files are ok and see the result.
> > 
> > > 
> > > The point of my question is to check that the data is really incorrect
> > > on-disk, or whether it is incorrect in pagecache.

I rebooted and the files are still broken after reboot(tested twice) so
the data is incorrect on disk.

> > > 
> > > Also, it'd be useful if you could determine whether the bug appears with
> > > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> > > rootfstype=ext2 if it's the root filesystem.
> > 
> > I will test.

Will test In a couple of hours, I have some work to do...

> 
> ok, thanks.


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  1:44                                       ` Andrei Popa
@ 2006-12-19  1:54                                         ` Andrew Morton
  2006-12-19  2:04                                           ` Andrei Popa
  2006-12-19  8:05                                           ` Andrei Popa
  0 siblings, 2 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-19  1:54 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Tue, 19 Dec 2006 03:44:51 +0200
Andrei Popa <andrei.popa@i-neo.ro> wrote:

> On Mon, 2006-12-18 at 17:21 -0800, Andrew Morton wrote:
> > On Mon, 18 Dec 2006 16:57:30 -0800 (PST)
> > Linus Torvalds <torvalds@osdl.org> wrote:
> > 
> > > What happens if you only ifdef out that single thing? 
> > > 
> > > The actual page-cleaning functions make sure to only clear the TAG_DIRTY 
> > > bit _after_ the page has been marked for writeback. Is there some ordering 
> > > constraint there, perhaps?
> > > 
> > > I'm really reaching here. I'm trying to see the pattern, and I'm not 
> > > seeing it. I'm asking you to test things just to get more of a feel for 
> > > what triggers the failure, than because I actually have any kind of idea 
> > > of what the heck is going on.
> > > 
> > > Andrew, Nick, Hugh - any ideas?
> > 
> > If all of test_clear_page_dirty() has been commented out then the page will
> > never become clean hence will never fall out of pagecache, so unless Andrei
> > is doing a reboot before checking for corruption, perhaps the underlying
> > data on-disk is incorrect, but we can't see it.
> 
> if I do a sync and echo 1 > /proc/sys/vm/drop_caches

OK, that works.

>  does the reboot is
> still necesary ?

It might be necessary to reboot in this case - if we're leaving the
pagecache dirty, writing to drop_caches won't remove it.  And you probably
won't be able to get a clean reboot either.

> > 
> > Andrei, how _are_ you running this test?    What's the exact sequence of steps?
> > 
> > In particular, are you doing anything which would cause the corrupted file
> > to be evicted from memory, thus forcing a read from disk?  Such as
> > unmounting and then remounting the filesystem?
> 
> I boot linux, I start rtorrent and start the download, while it's
> downloading I start evolution and i check my mail(my mbox is very large,
> several hundered megabytes), I close evolution(I use evolution just to
> have another application witch uses the filesystem and the memory), I
> start evolution again. I start firefox. The download is complete.
> Rtorrent says if the hash is good or not. I do a "unrar t qwe.rar" to
> test that all 84 downloaded rar files are ok and see the result.
> 
> > 
> > The point of my question is to check that the data is really incorrect
> > on-disk, or whether it is incorrect in pagecache.
> > 
> > Also, it'd be useful if you could determine whether the bug appears with
> > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> > rootfstype=ext2 if it's the root filesystem.
> 
> I will test.

ok, thanks.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  0:57                                   ` Linus Torvalds
  2006-12-19  1:21                                     ` Andrew Morton
@ 2006-12-19  1:50                                     ` Andrei Popa
  1 sibling, 0 replies; 154+ messages in thread
From: Andrei Popa @ 2006-12-19  1:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 16:57 -0800, Linus Torvalds wrote:
> 
> On Tue, 19 Dec 2006, Andrei Popa wrote:
> > > > 
> > > > nope, no file corruption at all.
> > > 
> > > Ok. That's interesting, but I think you actually #ifdef'ed out too 
> > > much:
> > > 
> > > It was really just the _inner_ "if (mapping_cap_account_dirty(.." 
> > > statement that I meant you should remove.
> > > 
> > > Can you try that too?
> > 
> > I have file corruption: "Hash check on download completion found bad
> > chunks, consider using "safe_sync"."
> 
> Ok, that's interesting.
> 
> So it doesn't seem to be the call to page_mkclean() itself that causes 
> corruption. It looks like Peter's hunch that maybe there's some bug in 
> PG_dirty handling _itself_ might be an idea..
> 
> And the reason it only started happening now is that it may just have been 
> _hidden_ by the fact that while we kept the dirty bits in the page tables, 
> we'd end up writing the dirty page _despite_ having lost the PG_dirty bit. 
> So if it's some bad interaction between writable mappings and some other 
> part of the system, we just didn't see it earlier, exactly because we had 
> _lots_ of dirty bits, and it was enough that _one_ of them was right.
> 
> If you didn't see corruption when you #ifdef'ed out too much of the 
> "test_clean_page_dirty() function (the _whole_ TestClearPageDirty() 
> if-statement), but you get it when you just comment out the stuff that 
> does the page_mkclean(), that's interesting.
> 
> I'm left lookin gat the "radix_tree_tag_clear()" in 
> test_clear_page_dirty().
> 
> What happens if you only ifdef out that single thing? 

I have file corruption.

> 
> The actual page-cleaning functions make sure to only clear the TAG_DIRTY 
> bit _after_ the page has been marked for writeback. Is there some ordering 
> constraint there, perhaps?
> 
> I'm really reaching here. I'm trying to see the pattern, and I'm not 
> seeing it. I'm asking you to test things just to get more of a feel for 
> what triggers the failure, than because I actually have any kind of idea 
> of what the heck is going on.
> 
> Andrew, Nick, Hugh - any ideas?
> 
> 			Linus


diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..2d8bbbb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 0)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..9f82cd0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 0);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..5e29b37 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..44ac434 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..4ff7f90 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -857,6 +857,7 @@ int test_clear_page_dirty(struct page *p
 		return TestClearPageDirty(page);
 
 	write_lock_irqsave(&mapping->tree_lock, flags);
+
 	if (TestClearPageDirty(page)) {
 		radix_tree_tag_clear(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
@@ -865,12 +866,23 @@ int test_clear_page_dirty(struct page *p
 		 * We can continue to use `mapping' here because the
 		 * page is locked, which pins the address_space
 		 */
+
+#if 0
+
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned){
+			WARN_ON(1);
+			set_page_dirty(page);
+			}
+
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
+#endif
+
 		return 1;
 	}
+
 	write_unlock_irqrestore(&mapping->tree_lock, flags);
 	return 0;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..9a01d9e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 0))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..2d8bbbb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 0)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..9f82cd0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 0);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..5e29b37 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..44ac434 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..e6524a6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -857,20 +857,35 @@ int test_clear_page_dirty(struct page *p
 		return TestClearPageDirty(page);
 
 	write_lock_irqsave(&mapping->tree_lock, flags);
+
 	if (TestClearPageDirty(page)) {
+
+#if 0
+
 		radix_tree_tag_clear(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
+
+#endif
+
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 		/*
 		 * We can continue to use `mapping' here because the
 		 * page is locked, which pins the address_space
 		 */
+
+
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned){
+			WARN_ON(1);
+			set_page_dirty(page);
+			}
+
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;
 	}
+
 	write_unlock_irqrestore(&mapping->tree_lock, flags);
 	return 0;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..9a01d9e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 0))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);



^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  1:21                                     ` Andrew Morton
@ 2006-12-19  1:44                                       ` Andrei Popa
  2006-12-19  1:54                                         ` Andrew Morton
  0 siblings, 1 reply; 154+ messages in thread
From: Andrei Popa @ 2006-12-19  1:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 17:21 -0800, Andrew Morton wrote:
> On Mon, 18 Dec 2006 16:57:30 -0800 (PST)
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> > What happens if you only ifdef out that single thing? 
> > 
> > The actual page-cleaning functions make sure to only clear the TAG_DIRTY 
> > bit _after_ the page has been marked for writeback. Is there some ordering 
> > constraint there, perhaps?
> > 
> > I'm really reaching here. I'm trying to see the pattern, and I'm not 
> > seeing it. I'm asking you to test things just to get more of a feel for 
> > what triggers the failure, than because I actually have any kind of idea 
> > of what the heck is going on.
> > 
> > Andrew, Nick, Hugh - any ideas?
> 
> If all of test_clear_page_dirty() has been commented out then the page will
> never become clean hence will never fall out of pagecache, so unless Andrei
> is doing a reboot before checking for corruption, perhaps the underlying
> data on-disk is incorrect, but we can't see it.

if I do a sync and echo 1 > /proc/sys/vm/drop_caches does the reboot is
still necesary ?

> 
> Andrei, how _are_ you running this test?    What's the exact sequence of steps?
> 
> In particular, are you doing anything which would cause the corrupted file
> to be evicted from memory, thus forcing a read from disk?  Such as
> unmounting and then remounting the filesystem?

I boot linux, I start rtorrent and start the download, while it's
downloading I start evolution and i check my mail(my mbox is very large,
several hundered megabytes), I close evolution(I use evolution just to
have another application witch uses the filesystem and the memory), I
start evolution again. I start firefox. The download is complete.
Rtorrent says if the hash is good or not. I do a "unrar t qwe.rar" to
test that all 84 downloaded rar files are ok and see the result.

> 
> The point of my question is to check that the data is really incorrect
> on-disk, or whether it is incorrect in pagecache.
> 
> Also, it'd be useful if you could determine whether the bug appears with
> the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> rootfstype=ext2 if it's the root filesystem.

I will test.

> 
> Thanks.


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  0:57                                   ` Linus Torvalds
@ 2006-12-19  1:21                                     ` Andrew Morton
  2006-12-19  1:44                                       ` Andrei Popa
  2006-12-19  1:50                                     ` Andrei Popa
  1 sibling, 1 reply; 154+ messages in thread
From: Andrew Morton @ 2006-12-19  1:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 18 Dec 2006 16:57:30 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> What happens if you only ifdef out that single thing? 
> 
> The actual page-cleaning functions make sure to only clear the TAG_DIRTY 
> bit _after_ the page has been marked for writeback. Is there some ordering 
> constraint there, perhaps?
> 
> I'm really reaching here. I'm trying to see the pattern, and I'm not 
> seeing it. I'm asking you to test things just to get more of a feel for 
> what triggers the failure, than because I actually have any kind of idea 
> of what the heck is going on.
> 
> Andrew, Nick, Hugh - any ideas?

If all of test_clear_page_dirty() has been commented out then the page will
never become clean hence will never fall out of pagecache, so unless Andrei
is doing a reboot before checking for corruption, perhaps the underlying
data on-disk is incorrect, but we can't see it.

Andrei, how _are_ you running this test?    What's the exact sequence of steps?

In particular, are you doing anything which would cause the corrupted file
to be evicted from memory, thus forcing a read from disk?  Such as
unmounting and then remounting the filesystem?

The point of my question is to check that the data is really incorrect
on-disk, or whether it is incorrect in pagecache.

Also, it'd be useful if you could determine whether the bug appears with
the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
rootfstype=ext2 if it's the root filesystem.

Thanks.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 23:48                             ` Andrei Popa
  2006-12-19  0:04                               ` Linus Torvalds
@ 2006-12-19  1:03                               ` Gene Heskett
  1 sibling, 0 replies; 154+ messages in thread
From: Gene Heskett @ 2006-12-19  1:03 UTC (permalink / raw)
  To: linux-kernel, andrei.popa
  Cc: Linus Torvalds, Peter Zijlstra, Andrew Morton, Hugh Dickins,
	Florian Weimer, Marc Haber, Martin Michlmayr

On Monday 18 December 2006 18:48, Andrei Popa wrote:
>On Mon, 2006-12-18 at 14:32 -0800, Linus Torvalds wrote:
>> On Mon, 18 Dec 2006, Andrei Popa wrote:
>> > > This should be fairly easy to test: just change every single ", 1"
>> > > case in the patch to ", 0".
>> > >
>> > > What happens for you in that case?
>> >
>> > I have file corruption.
>>
>> Magic. And btw, _thanks_ for being such a great tester.
>>
>> So now I have one more thng for you to try, it you can bother:
>>
>> There's exactly two call sites that call "page_mkclean()" (an dthat is
>> the only thing in turn that calls "page_mkclean_one()", which we
>> already determined will cause the corruption).
>>
>> Both of them do
>>
>> 	if (mapping_cap_account_dirty(mapping)) {
>> 			..
>>
>> things, although they do slightly different things inside that if in
>> your patched kernel.
>>
>> Can you just TOTALLY DISABLE that case for the test_clear_page_dirty()
>> case? Just do an "#if 0 .. #endif" around that whole if-statement,
>> leaving the _only_ thing that actually calls "page_mkclean()" to be
>> the "clear_page_dirty_for_io()" call.
>>
>> Do you still see corruption?
>
>nope, no file corruption at all.
>
Goody I says to nobody in particular, I'll go build this...
>
>diff --git a/fs/buffer.c b/fs/buffer.c
>index d1f1b54..263f88e 100644
>--- a/fs/buffer.c
>+++ b/fs/buffer.c
>@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
> 	int ret = 0;
>
> 	BUG_ON(!PageLocked(page));
>-	if (PageWriteback(page))
>+	if (PageDirty(page) || PageWriteback(page))
> 		return 0;
>
> 	if (mapping == NULL) {		/* can this still happen? */
>@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
> 	spin_lock(&mapping->private_lock);
> 	ret = drop_buffers(page, &buffers_to_free);
> 	spin_unlock(&mapping->private_lock);
>-	if (ret) {
>-		/*
>-		 * If the filesystem writes its buffers by hand (eg ext3)
>-		 * then we can have clean buffers against a dirty page.  We
>-		 * clean the page here; otherwise later reattachment of buffers
>-		 * could encounter a non-uptodate page, which is unresolvable.
>-		 * This only applies in the rare case where try_to_free_buffers
>-		 * succeeds but the page is not freed.
>-		 *
>-		 * Also, during truncate, discard_buffer will have marked all
>-		 * the page's buffers clean.  We discover that here and clean
>-		 * the page also.
>-		 */
>-		if (test_clear_page_dirty(page))
>-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
>-	}
> out:
> 	if (buffers_to_free) {
> 		struct buffer_head *bh = buffers_to_free;
>diff --git a/fs/cifs/file.c b/fs/cifs/file.c
>index 0f05cab..2d8bbbb 100644
>--- a/fs/cifs/file.c
>+++ b/fs/cifs/file.c
>@@ -1245,7 +1245,7 @@ retry:
> 				wait_on_page_writeback(page);
>
> 			if (PageWriteback(page) ||
>-					!test_clear_page_dirty(page)) {
>+					!test_clear_page_dirty(page, 0)) {
> 				unlock_page(page);
> 				break;
> 			}
>diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>index 1387749..da2bdb1 100644
>--- a/fs/fuse/file.c
>+++ b/fs/fuse/file.c
>@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
> 		spin_unlock(&fc->lock);
>
> 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
>-			clear_page_dirty(page);
>+			clear_page_dirty(page, 0);
> 			SetPageUptodate(page);
> 		}
> 	}
>diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>index ed2c223..9f82cd0 100644
>--- a/fs/hugetlbfs/inode.c
>+++ b/fs/hugetlbfs/inode.c
>@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
>
> static void truncate_huge_page(struct page *page)
> {
>-	clear_page_dirty(page);
>+	clear_page_dirty(page, 0);
> 	ClearPageUptodate(page);
> 	remove_from_page_cache(page);
> 	put_page(page);
>diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
>index b1a1c72..5e29b37 100644
>--- a/fs/jfs/jfs_metapage.c
>+++ b/fs/jfs/jfs_metapage.c
>@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
>
> 	/* Retest mp->count since we may have released page lock */
> 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
>-		clear_page_dirty(page);
>+		clear_page_dirty(page, 0);
> 		ClearPageUptodate(page);
> 	}
> #else
>diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
>index 47e7027..a97e198 100644
>--- a/fs/reiserfs/stree.c
>+++ b/fs/reiserfs/stree.c
>@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
> 				bh = next;
> 			} while (bh != head);
> 			if (PAGE_SIZE == bh->b_size) {
>-				clear_page_dirty(page);
>+				clear_page_dirty(page, 0);
> 			}
> 		}
> 	}
>diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
>index b56eb75..44ac434 100644
>--- a/fs/xfs/linux-2.6/xfs_aops.c
>+++ b/fs/xfs/linux-2.6/xfs_aops.c
>@@ -343,7 +343,7 @@ xfs_start_page_writeback(
> 	ASSERT(!PageWriteback(page));
> 	set_page_writeback(page);
> 	if (clear_dirty)
>-		clear_page_dirty(page);
>+		clear_page_dirty(page, 0);
> 	unlock_page(page);
> 	if (!buffers) {
> 		end_page_writeback(page);
>diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>index 4830a3b..175ab3c 100644
>--- a/include/linux/page-flags.h
>+++ b/include/linux/page-flags.h
>@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
>
> struct page;	/* forward declaration */
>
>-int test_clear_page_dirty(struct page *page);
>+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
> int test_clear_page_writeback(struct page *page);
> int test_set_page_writeback(struct page *page);
>
>-static inline void clear_page_dirty(struct page *page)
>+static inline void clear_page_dirty(struct page *page, int
>must_clean_ptes)
above looks wrapped to me so I fixed it to one line
> {
>-	test_clear_page_dirty(page);
>+	test_clear_page_dirty(page, must_clean_ptes);
> }
>
> static inline void set_page_writeback(struct page *page)
>diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>index 237107c..f2a157d 100644
>--- a/mm/page-writeback.c
>+++ b/mm/page-writeback.c
>@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  * Clear a page's dirty flag, while caring for dirty memory
>accounting.
Likewise here, malformed patch otherwise
>  * Returns true if the page was previously dirty.
>  */
>-int test_clear_page_dirty(struct page *page)
>+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
> {
> 	struct address_space *mapping = page_mapping(page);
> 	unsigned long flags;
>@@ -857,6 +857,8 @@ int test_clear_page_dirty(struct page *p
> 		return TestClearPageDirty(page);
>
> 	write_lock_irqsave(&mapping->tree_lock, flags);
>+
>+#if 0
> 	if (TestClearPageDirty(page)) {
> 		radix_tree_tag_clear(&mapping->page_tree,
> 				page_index(page), PAGECACHE_TAG_DIRTY);
>@@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p
> 		 * page is locked, which pins the address_space
> 		 */
> 		if (mapping_cap_account_dirty(mapping)) {
>-			page_mkclean(page);
>+			int cleaned = page_mkclean(page);
>+			if (!must_clean_ptes && cleaned){
>+			WARN_ON(1);
>+			set_page_dirty(page);
>+			}
>+
> 			dec_zone_page_state(page, NR_FILE_DIRTY);
> 		}
> 		return 1;
> 	}
>+
>+#endif
>+
> 	write_unlock_irqrestore(&mapping->tree_lock, flags);
> 	return 0;
> }
>diff --git a/mm/rmap.c b/mm/rmap.c
>diff --git a/mm/truncate.c b/mm/truncate.c
>index 9bfb8e8..9a01d9e 100644
>--- a/mm/truncate.c
>+++ b/mm/truncate.c
>@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
> 	if (PagePrivate(page))
> 		do_invalidatepage(page, 0);
>
>-	if (test_clear_page_dirty(page))
>+	if (test_clear_page_dirty(page, 0))
> 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> 	ClearPageUptodate(page);
> 	ClearPageMappedToDisk(page);
>@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
> 					  PAGE_CACHE_SIZE, 0);
> 				}
> 			}
>-			was_dirty = test_clear_page_dirty(page);
>+			was_dirty = test_clear_page_dirty(page, 0);
> 			if (!invalidate_complete_page2(mapping, page)) {
> 				if (was_dirty)
> 					set_page_dirty(page);
>
I think I must have screwed the moose.  Following along in this thread, 
I'd patched things back and forth till I figured I'd better do a fresh 
tree, so starting with the full 2.6.19 tarball, I applied the 2.6.20-rc1 
patch, then the above patch, which should be the only thing different 
from what I'm running right now, which is the commented line in rmap.c, 
otherwise as it unpacked.

But:
In file included from include/linux/mm.h:230,
                 from include/linux/rmap.h:10,
                 from init/main.c:47:
include/linux/page-flags.h:260: error: expected declaration specifiers 
or ‘...’ before ‘in’
include/linux/page-flags.h: In function ‘clear_page_dirty’:
include/linux/page-flags.h:262: error: ‘must_clean_ptes’ undeclared (first 
use in this function)
include/linux/page-flags.h:262: error: (Each undeclared identifier is 
reported only once
include/linux/page-flags.h:262: error: for each function it appears in.)
make[1]: *** [init/main.o] Error 1
make: *** [init] Error 2

There were 2 places where this patch is word wrapped, and this was one of 
them:

-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)

The other one was in a comment, which screwed the patch and needed fixed 
too.  Is it fubared someplace else I missed?  Or am I in fact being 
bitten by this bug?

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  0:29                                 ` Andrei Popa
@ 2006-12-19  0:57                                   ` Linus Torvalds
  2006-12-19  1:21                                     ` Andrew Morton
  2006-12-19  1:50                                     ` Andrei Popa
  0 siblings, 2 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-19  0:57 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Andrei Popa wrote:
> > > 
> > > nope, no file corruption at all.
> > 
> > Ok. That's interesting, but I think you actually #ifdef'ed out too 
> > much:
> > 
> > It was really just the _inner_ "if (mapping_cap_account_dirty(.." 
> > statement that I meant you should remove.
> > 
> > Can you try that too?
> 
> I have file corruption: "Hash check on download completion found bad
> chunks, consider using "safe_sync"."

Ok, that's interesting.

So it doesn't seem to be the call to page_mkclean() itself that causes 
corruption. It looks like Peter's hunch that maybe there's some bug in 
PG_dirty handling _itself_ might be an idea..

And the reason it only started happening now is that it may just have been 
_hidden_ by the fact that while we kept the dirty bits in the page tables, 
we'd end up writing the dirty page _despite_ having lost the PG_dirty bit. 
So if it's some bad interaction between writable mappings and some other 
part of the system, we just didn't see it earlier, exactly because we had 
_lots_ of dirty bits, and it was enough that _one_ of them was right.

If you didn't see corruption when you #ifdef'ed out too much of the 
"test_clean_page_dirty() function (the _whole_ TestClearPageDirty() 
if-statement), but you get it when you just comment out the stuff that 
does the page_mkclean(), that's interesting.

I'm left lookin gat the "radix_tree_tag_clear()" in 
test_clear_page_dirty().

What happens if you only ifdef out that single thing? 

The actual page-cleaning functions make sure to only clear the TAG_DIRTY 
bit _after_ the page has been marked for writeback. Is there some ordering 
constraint there, perhaps?

I'm really reaching here. I'm trying to see the pattern, and I'm not 
seeing it. I'm asking you to test things just to get more of a feel for 
what triggers the failure, than because I actually have any kind of idea 
of what the heck is going on.

Andrew, Nick, Hugh - any ideas?

			Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  0:04                               ` Linus Torvalds
@ 2006-12-19  0:29                                 ` Andrei Popa
  2006-12-19  0:57                                   ` Linus Torvalds
  0 siblings, 1 reply; 154+ messages in thread
From: Andrei Popa @ 2006-12-19  0:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 16:04 -0800, Linus Torvalds wrote:
> 
> On Tue, 19 Dec 2006, Andrei Popa wrote:
> > > 
> > > There's exactly two call sites that call "page_mkclean()" (an dthat is the 
> > > only thing in turn that calls "page_mkclean_one()", which we already 
> > > determined will cause the corruption). 
> > >
> > > Can you just TOTALLY DISABLE that case for the test_clear_page_dirty() 
> > > case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving 
> > > the _only_ thing that actually calls "page_mkclean()" to be the 
> > > "clear_page_dirty_for_io()" call.
> > > 
> > > Do you still see corruption?
> > 
> > nope, no file corruption at all.
> 
> Ok. That's interesting, but I think you actually #ifdef'ed out too 
> much:
> 
> > +
> > +#if 0
> >  	if (TestClearPageDirty(page)) {
> >  		radix_tree_tag_clear(&mapping->page_tree,
> >  				page_index(page), PAGECACHE_TAG_DIRTY);
> > @@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p
> >  		 * page is locked, which pins the address_space
> >  		 */
> >  		if (mapping_cap_account_dirty(mapping)) {
> > -			page_mkclean(page);
> > +			int cleaned = page_mkclean(page);
> > +			if (!must_clean_ptes && cleaned){
> > +			WARN_ON(1);
> > +			set_page_dirty(page);
> > +			}
> > +
> >  			dec_zone_page_state(page, NR_FILE_DIRTY);
> >  		}
> >  		return 1;
> >  	}
> > +
> > +#endif
> > +
> 
> It was really just the _inner_ "if (mapping_cap_account_dirty(.." 
> statement that I meant you should remove.
> 
> Can you try that too?

I have file corruption: "Hash check on download completion found bad
chunks, consider using "safe_sync"."


diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..2d8bbbb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 0)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..9f82cd0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 0);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..5e29b37 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..44ac434 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..4ff7f90 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -857,6 +857,7 @@ int test_clear_page_dirty(struct page *p
 		return TestClearPageDirty(page);
 
 	write_lock_irqsave(&mapping->tree_lock, flags);
+
 	if (TestClearPageDirty(page)) {
 		radix_tree_tag_clear(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
@@ -865,12 +866,23 @@ int test_clear_page_dirty(struct page *p
 		 * We can continue to use `mapping' here because the
 		 * page is locked, which pins the address_space
 		 */
+
+#if 0
+
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned){
+			WARN_ON(1);
+			set_page_dirty(page);
+			}
+
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
+#endif
+
 		return 1;
 	}
+
 	write_unlock_irqrestore(&mapping->tree_lock, flags);
 	return 0;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..9a01d9e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 0))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);



^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-19  0:13                               ` Andrei Popa
@ 2006-12-19  0:29                                 ` Linus Torvalds
  0 siblings, 0 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-19  0:29 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Alessandro Suardi, Peter Zijlstra, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Andrei Popa wrote:
> 
> the corrupted file has a chink full with zeros
> 
> http://193.226.119.62/corruption0.jpg
> http://193.226.119.62/corruption1.jpg

Thanks. Yup, filled with zeroes, and the corruption stops (but does _not_ 
start) at a page boundary.

That _does_ look very much like it was filled in linearly, then written 
out to disk when it was in the middle of the page, and then we simply lost 
the further writes that should also have gone on to that page. All 
consistent with dropping a dirty bit somewhere in the middle of the page 
updates.

Which we kind of knew must be the issue anyway, but it's good to know that 
the corruption pattern is consistent with what we're trying to figure out.

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 22:45                             ` Linus Torvalds
@ 2006-12-19  0:13                               ` Andrei Popa
  2006-12-19  0:29                                 ` Linus Torvalds
  0 siblings, 1 reply; 154+ messages in thread
From: Andrei Popa @ 2006-12-19  0:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alessandro Suardi, Peter Zijlstra, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 14:45 -0800, Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Alessandro Suardi wrote:
> > 
> > No idea whether this can be a data point or not, but
> > here it goes... my P2P box is about to turn 5 days old
> > while running nonstop one or both of aMule 2.1.3 and
> > BitTorrent 4.4.0 on ext3 mounted w/default options
> > on both IDE and USB disks. Zero corruption.
> > 
> > AMD K7-800, 512MB RAM, PREEMPT/UP kernel,
> > 2.6.19-git20 on top of up-to-date FC6.
> 
> It _looks_ like PREEMPT/SMP is one common configuration.
> 
> It might also be that the blocksize of the filesystem matters. 4kB 
> filesystems are fundamentally simpler than 1kB filesystems, for example. 
> You can tell at least with "/sbin/dumpe2fs -h /dev/..." or something.
> 
> Andrei - one thing that might be interesting to see: when corruption 
> occurs, can you get the corrupted file somehow? And compare it with a 
> known-good copy to see what the corruption looks like?

the corrupted file has a chink full with zeros

http://193.226.119.62/corruption0.jpg
http://193.226.119.62/corruption1.jpg




^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 23:48                             ` Andrei Popa
@ 2006-12-19  0:04                               ` Linus Torvalds
  2006-12-19  0:29                                 ` Andrei Popa
  2006-12-19  1:03                               ` Gene Heskett
  1 sibling, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-19  0:04 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Tue, 19 Dec 2006, Andrei Popa wrote:
> > 
> > There's exactly two call sites that call "page_mkclean()" (an dthat is the 
> > only thing in turn that calls "page_mkclean_one()", which we already 
> > determined will cause the corruption). 
> >
> > Can you just TOTALLY DISABLE that case for the test_clear_page_dirty() 
> > case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving 
> > the _only_ thing that actually calls "page_mkclean()" to be the 
> > "clear_page_dirty_for_io()" call.
> > 
> > Do you still see corruption?
> 
> nope, no file corruption at all.

Ok. That's interesting, but I think you actually #ifdef'ed out too 
much:

> +
> +#if 0
>  	if (TestClearPageDirty(page)) {
>  		radix_tree_tag_clear(&mapping->page_tree,
>  				page_index(page), PAGECACHE_TAG_DIRTY);
> @@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p
>  		 * page is locked, which pins the address_space
>  		 */
>  		if (mapping_cap_account_dirty(mapping)) {
> -			page_mkclean(page);
> +			int cleaned = page_mkclean(page);
> +			if (!must_clean_ptes && cleaned){
> +			WARN_ON(1);
> +			set_page_dirty(page);
> +			}
> +
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  		}
>  		return 1;
>  	}
> +
> +#endif
> +

It was really just the _inner_ "if (mapping_cap_account_dirty(.." 
statement that I meant you should remove.

Can you try that too?

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 22:32                           ` Linus Torvalds
@ 2006-12-18 23:48                             ` Andrei Popa
  2006-12-19  0:04                               ` Linus Torvalds
  2006-12-19  1:03                               ` Gene Heskett
  0 siblings, 2 replies; 154+ messages in thread
From: Andrei Popa @ 2006-12-18 23:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 14:32 -0800, Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Andrei Popa wrote:
> > >
> > > This should be fairly easy to test: just change every single ", 1" case in 
> > > the patch to ", 0".
> > >
> > > What happens for you in that case?
> > 
> > I have file corruption.
> 
> Magic. And btw, _thanks_ for being such a great tester.
> 
> So now I have one more thng for you to try, it you can bother:
> 
> There's exactly two call sites that call "page_mkclean()" (an dthat is the 
> only thing in turn that calls "page_mkclean_one()", which we already 
> determined will cause the corruption). 
> 
> Both of them do 
> 
> 	if (mapping_cap_account_dirty(mapping)) {
> 			..
> 
> things, although they do slightly different things inside that if in your 
> patched kernel.
> 
> Can you just TOTALLY DISABLE that case for the test_clear_page_dirty() 
> case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving 
> the _only_ thing that actually calls "page_mkclean()" to be the 
> "clear_page_dirty_for_io()" call.
> 
> Do you still see corruption?

nope, no file corruption at all.



diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..2d8bbbb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 0)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..9f82cd0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 0);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..5e29b37 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..44ac434 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..f2a157d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -857,6 +857,8 @@ int test_clear_page_dirty(struct page *p
 		return TestClearPageDirty(page);
 
 	write_lock_irqsave(&mapping->tree_lock, flags);
+
+#if 0
 	if (TestClearPageDirty(page)) {
 		radix_tree_tag_clear(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
@@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned){
+			WARN_ON(1);
+			set_page_dirty(page);
+			}
+
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;
 	}
+
+#endif
+
 	write_unlock_irqrestore(&mapping->tree_lock, flags);
 	return 0;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..9a01d9e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 0))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);



^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 22:00                           ` Alessandro Suardi
@ 2006-12-18 22:45                             ` Linus Torvalds
  2006-12-19  0:13                               ` Andrei Popa
  0 siblings, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-18 22:45 UTC (permalink / raw)
  To: Alessandro Suardi
  Cc: andrei.popa, Peter Zijlstra, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Alessandro Suardi wrote:
> 
> No idea whether this can be a data point or not, but
> here it goes... my P2P box is about to turn 5 days old
> while running nonstop one or both of aMule 2.1.3 and
> BitTorrent 4.4.0 on ext3 mounted w/default options
> on both IDE and USB disks. Zero corruption.
> 
> AMD K7-800, 512MB RAM, PREEMPT/UP kernel,
> 2.6.19-git20 on top of up-to-date FC6.

It _looks_ like PREEMPT/SMP is one common configuration.

It might also be that the blocksize of the filesystem matters. 4kB 
filesystems are fundamentally simpler than 1kB filesystems, for example. 
You can tell at least with "/sbin/dumpe2fs -h /dev/..." or something.

Andrei - one thing that might be interesting to see: when corruption 
occurs, can you get the corrupted file somehow? And compare it with a 
known-good copy to see what the corruption looks like?

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 20:41                       ` Linus Torvalds
  2006-12-18 21:11                         ` Andrei Popa
@ 2006-12-18 22:34                         ` Gene Heskett
  2006-12-22 17:27                           ` Linus Torvalds
  1 sibling, 1 reply; 154+ messages in thread
From: Gene Heskett @ 2006-12-18 22:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrei Popa, Peter Zijlstra, Andrew Morton,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Monday 18 December 2006 15:41, Linus Torvalds wrote:
>On Mon, 18 Dec 2006, Linus Torvalds wrote:
>> But at the same time, it's interesting that it still happens when we
>> try to re-add the dirty bit. That would tell me that it's one of two
>> cases:
>
>Forget that. There's a third case, which is much more likely:
>
> - Andrew's patch had a ", 1" where it _should_ have had a ", 0".
>
>This should be fairly easy to test: just change every single ", 1" case
> in the patch to ", 0".
>
>The only case that _definitely_ would want ",1" is actually the case
> that already calls page_mkclean() directly: clear_page_dirty_for_io().
> So no other ", 1" is valid, and that one that needed it already avoided
> even calling the "test_clear_page_dirty()" function, because it did it
> all by hand.
>
What about the mm/rmap.c one liner, in or out?

Thanks.

>What happens for you in that case?
>
>		Linus
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 21:11                         ` Andrei Popa
  2006-12-18 22:00                           ` Alessandro Suardi
@ 2006-12-18 22:32                           ` Linus Torvalds
  2006-12-18 23:48                             ` Andrei Popa
  1 sibling, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-18 22:32 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Andrei Popa wrote:
> >
> > This should be fairly easy to test: just change every single ", 1" case in 
> > the patch to ", 0".
> >
> > What happens for you in that case?
> 
> I have file corruption.

Magic. And btw, _thanks_ for being such a great tester.

So now I have one more thng for you to try, it you can bother:

There's exactly two call sites that call "page_mkclean()" (an dthat is the 
only thing in turn that calls "page_mkclean_one()", which we already 
determined will cause the corruption). 

Both of them do 

	if (mapping_cap_account_dirty(mapping)) {
			..

things, although they do slightly different things inside that if in your 
patched kernel.

Can you just TOTALLY DISABLE that case for the test_clear_page_dirty() 
case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving 
the _only_ thing that actually calls "page_mkclean()" to be the 
"clear_page_dirty_for_io()" call.

Do you still see corruption?

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 21:11                         ` Andrei Popa
@ 2006-12-18 22:00                           ` Alessandro Suardi
  2006-12-18 22:45                             ` Linus Torvalds
  2006-12-18 22:32                           ` Linus Torvalds
  1 sibling, 1 reply; 154+ messages in thread
From: Alessandro Suardi @ 2006-12-18 22:00 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Peter Zijlstra, Andrew Morton,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On 12/18/06, Andrei Popa <andrei.popa@i-neo.ro> wrote:
> On Mon, 2006-12-18 at 12:41 -0800, Linus Torvalds wrote:
> >
> > On Mon, 18 Dec 2006, Linus Torvalds wrote:
> > >
> > > But at the same time, it's interesting that it still happens when we try
> > > to re-add the dirty bit. That would tell me that it's one of two cases:
> >
> > Forget that. There's a third case, which is much more likely:
> >
> >  - Andrew's patch had a ", 1" where it _should_ have had a ", 0".
> >
> > This should be fairly easy to test: just change every single ", 1" case in
> > the patch to ", 0".
> >
> > The only case that _definitely_ would want ",1" is actually the case that
> > already calls page_mkclean() directly: clear_page_dirty_for_io(). So no
> > other ", 1" is valid, and that one that needed it already avoided even
> > calling the "test_clear_page_dirty()" function, because it did it all by
> > hand.
> >
> > What happens for you in that case?
> >
> >               Linus
>
> I have file corruption.

No idea whether this can be a data point or not, but
 here it goes... my P2P box is about to turn 5 days old
 while running nonstop one or both of aMule 2.1.3 and
 BitTorrent 4.4.0 on ext3 mounted w/default options
 on both IDE and USB disks. Zero corruption.

AMD K7-800, 512MB RAM, PREEMPT/UP kernel,
2.6.19-git20 on top of up-to-date FC6.

--alessandro

"...when I get it, I _get_ it"

     (Lara Eidemiller)

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 20:14                     ` Linus Torvalds
  2006-12-18 20:41                       ` Linus Torvalds
  2006-12-18 21:43                       ` Andrew Morton
@ 2006-12-18 21:49                       ` Peter Zijlstra
  2006-12-19 23:42                       ` Peter Zijlstra
  3 siblings, 0 replies; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-18 21:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Andrei Popa wrote:
> > 
> > I dropped that patch and added WARN_ON(1), the unified patch is
> > attached.
> > 
> > I got corruption: "Hash check on download completion found bad chunks,
> > consider using "safe_sync"."
> 
> Ok. That is actually _very_ interesting.
> 
> It's interesting because (a) the corruption obviously goes away with the 
> one-liner that effectively disables "page_mkclean_one()".
> 
> So that tells us that yes, it's a PTE dirty bit that matters.
> 
> But at the same time, it's interesting that it still happens when we try 
> to re-add the dirty bit. That would tell me that it's one of two cases:
> 
>  - there is another caller of page cleaning that should have done the same 
>    thing (we could check that by just doing this all _inside_ the 
>    page_mkclean() thing)
> 
> OR:
> 
>  - page_mkclean_one() is simply buggy.
> 
> And I'm starting to wonder about the second case. But it all LOOKS really 
> fine - I can't see anything wrong there (it uses the extremely 
> conservative "ptep_get_and_clear()", and seems to flush everything right 
> too, through "ptep_establish()").

How about this:

we get confused on what PG_dirty tells us, we fall back to pte_dirty,
transfer pte_dirty to PG_dirty and clear pte_dirty. Now it happens
again, however we don't have pte_dirty to fall back to anymore.

This would explain why disabling pte_mkclean() does make it go away and
non of the other tried approaches works.

We really need a way to sort out PG_dirty, independent of pte_dirty. 


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 20:14                     ` Linus Torvalds
  2006-12-18 20:41                       ` Linus Torvalds
@ 2006-12-18 21:43                       ` Andrew Morton
  2006-12-18 21:49                       ` Peter Zijlstra
  2006-12-19 23:42                       ` Peter Zijlstra
  3 siblings, 0 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-18 21:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Popa, Peter Zijlstra, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 18 Dec 2006 12:14:35 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> OR:
> 
>  - page_mkclean_one() is simply buggy.
> 
> And I'm starting to wonder about the second case. But it all LOOKS really 
> fine - I can't see anything wrong there (it uses the extremely 
> conservative "ptep_get_and_clear()", and seems to flush everything right 
> too, through "ptep_establish()").

What does the call to page_check_address() in there do?

It'd be good to have a printk in there to see if it's triggering.

Is this all correct for non-linear VMAs?  (rtorrent doesn't use
MAP_NONLINEAR though).

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 20:41                       ` Linus Torvalds
@ 2006-12-18 21:11                         ` Andrei Popa
  2006-12-18 22:00                           ` Alessandro Suardi
  2006-12-18 22:32                           ` Linus Torvalds
  2006-12-18 22:34                         ` Gene Heskett
  1 sibling, 2 replies; 154+ messages in thread
From: Andrei Popa @ 2006-12-18 21:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 12:41 -0800, Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Linus Torvalds wrote:
> > 
> > But at the same time, it's interesting that it still happens when we try 
> > to re-add the dirty bit. That would tell me that it's one of two cases:
> 
> Forget that. There's a third case, which is much more likely:
> 
>  - Andrew's patch had a ", 1" where it _should_ have had a ", 0".
> 
> This should be fairly easy to test: just change every single ", 1" case in 
> the patch to ", 0".
> 
> The only case that _definitely_ would want ",1" is actually the case that 
> already calls page_mkclean() directly: clear_page_dirty_for_io(). So no 
> other ", 1" is valid, and that one that needed it already avoided even 
> calling the "test_clear_page_dirty()" function, because it did it all by 
> hand.
> 
> What happens for you in that case?
> 
> 		Linus

I have file corruption.


diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..760442f 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 0)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..7b87875 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 0);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..47a6b62 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..d65ba84 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 0);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..f7e0cc8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -866,7 +866,12 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned){
+			WARN_ON(1);
+			set_page_dirty(page);
+			}
+
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..cafa843 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 0))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);



^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 20:14                     ` Linus Torvalds
@ 2006-12-18 20:41                       ` Linus Torvalds
  2006-12-18 21:11                         ` Andrei Popa
  2006-12-18 22:34                         ` Gene Heskett
  2006-12-18 21:43                       ` Andrew Morton
                                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-18 20:41 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Linus Torvalds wrote:
> 
> But at the same time, it's interesting that it still happens when we try 
> to re-add the dirty bit. That would tell me that it's one of two cases:

Forget that. There's a third case, which is much more likely:

 - Andrew's patch had a ", 1" where it _should_ have had a ", 0".

This should be fairly easy to test: just change every single ", 1" case in 
the patch to ", 0".

The only case that _definitely_ would want ",1" is actually the case that 
already calls page_mkclean() directly: clear_page_dirty_for_io(). So no 
other ", 1" is valid, and that one that needed it already avoided even 
calling the "test_clear_page_dirty()" function, because it did it all by 
hand.

What happens for you in that case?

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 19:44                   ` Andrei Popa
@ 2006-12-18 20:14                     ` Linus Torvalds
  2006-12-18 20:41                       ` Linus Torvalds
                                         ` (3 more replies)
  0 siblings, 4 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-18 20:14 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Andrei Popa wrote:
> 
> I dropped that patch and added WARN_ON(1), the unified patch is
> attached.
> 
> I got corruption: "Hash check on download completion found bad chunks,
> consider using "safe_sync"."

Ok. That is actually _very_ interesting.

It's interesting because (a) the corruption obviously goes away with the 
one-liner that effectively disables "page_mkclean_one()".

So that tells us that yes, it's a PTE dirty bit that matters.

But at the same time, it's interesting that it still happens when we try 
to re-add the dirty bit. That would tell me that it's one of two cases:

 - there is another caller of page cleaning that should have done the same 
   thing (we could check that by just doing this all _inside_ the 
   page_mkclean() thing)

OR:

 - page_mkclean_one() is simply buggy.

And I'm starting to wonder about the second case. But it all LOOKS really 
fine - I can't see anything wrong there (it uses the extremely 
conservative "ptep_get_and_clear()", and seems to flush everything right 
too, through "ptep_establish()").

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 19:18                 ` Linus Torvalds
@ 2006-12-18 19:44                   ` Andrei Popa
  2006-12-18 20:14                     ` Linus Torvalds
  2006-12-19  7:38                   ` Peter Zijlstra
  1 sibling, 1 reply; 154+ messages in thread
From: Andrei Popa @ 2006-12-18 19:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 11:18 -0800, Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Andrei Popa wrote:
> > 
> > I applied Linus patch, Andrew patch, Peter Zijlstra patches(the last
> > two). All unified patch is attached. I tested and I have no corruption.
> 
> That wasn't very interesting, because you also had the patch that just 
> disabled "page_mkclean_one()" entirely:
> 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index d8a842a..3f9061e 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
> >  		goto unlock;
> >  
> >  	entry = ptep_get_and_clear(mm, address, pte);
> > -	entry = pte_mkclean(entry);
> > +	/*entry = pte_mkclean(entry);*/
> >  	entry = pte_wrprotect(entry);
> >  	ptep_establish(vma, address, pte, entry);
> >  	lazy_mmu_prot_update(entry);
> 
> The above patch is bad. It's always going to hide the bug, but it hides it 
> by just not doing anything at all. So any patch combination that contains 
> that patch will probably _always_ fix your problem, but it won't be an 
> interesting patch..
> 
> So can you remove that small fragment? Also, it would be nice if you added 
> the WARN_ON() to this sequence in mm/page-writeback.c:
> 
> +                       if (!must_clean_ptes && cleaned)
> +                               set_page_dirty(page);
> 
> just make it do a WARN_ON() if this ever triggers.
> 
> Then, IF the corruption is gone, we'd love to see the WARN_ON results..
> 
> 		Linus

I dropped that patch and added WARN_ON(1), the unified patch is
attached.

I got corruption: "Hash check on download completion found bad chunks,
consider using "safe_sync"."

In dmesg there is no message from WARN_ON(1), my .config is attached.



diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..760442f 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 1)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..7b87875 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 1);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..47a6b62 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 1);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..d65ba84 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 1);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..f7e0cc8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -866,7 +866,12 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned){
+			WARN_ON(1);
+			set_page_dirty(page);
+			}
+
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..cafa843 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 1))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);















#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.20-rc1
# Sun Dec 17 01:52:12 2006
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
# CONFIG_POSIX_MQUEUE is not set
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
# CONFIG_IKCONFIG_PROC is not set
# CONFIG_CPUSETS is not set
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set

#
# Loadable module support
#
# CONFIG_MODULES is not set
CONFIG_STOP_MACHINE=y

#
# Block layer
#
CONFIG_BLOCK=y
# CONFIG_LBD is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# Processor type and features
#
CONFIG_SMP=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_PARAVIRT is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
CONFIG_MPENTIUMM=y
# CONFIG_MCORE2 is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_NR_CPUS=8
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_NONFATAL=y
CONFIG_X86_MCE_P4THERMAL=y
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
# CONFIG_X86_REBOOTFIXUPS is not set
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_PAGE_OFFSET=0xC0000000
CONFIG_HIGHMEM=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_SPARSEMEM_STATIC=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_RESOURCES_64BIT is not set
# CONFIG_HIGHPTE is not set
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
# CONFIG_EFI is not set
CONFIG_IRQBALANCE=y
# CONFIG_SECCOMP is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
# CONFIG_KEXEC is not set
# CONFIG_CRASH_DUMP is not set
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x100000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management options (ACPI, APM)
#
CONFIG_PM=y
# CONFIG_PM_LEGACY is not set
# CONFIG_PM_DEBUG is not set
# CONFIG_PM_SYSFS_DEPRECATED is not set
CONFIG_SOFTWARE_SUSPEND=y
CONFIG_PM_STD_PARTITION=""
CONFIG_SUSPEND_SMP=y

#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_SLEEP_PROC_FS=y
# CONFIG_ACPI_SLEEP_PROC_SLEEP is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=y
CONFIG_ACPI_HOTKEY=y
CONFIG_ACPI_FAN=y
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_IBM is not set
# CONFIG_ACPI_TOSHIBA is not set
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y

#
# APM (Advanced Power Management) BIOS Support
#
# CONFIG_APM is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=y
# CONFIG_CPU_FREQ_STAT_DETAILS is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=y
# CONFIG_X86_POWERNOW_K6 is not set
# CONFIG_X86_POWERNOW_K7 is not set
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_GX_SUSPMOD is not set
CONFIG_X86_SPEEDSTEP_CENTRINO=y
CONFIG_X86_SPEEDSTEP_CENTRINO_ACPI=y
# CONFIG_X86_SPEEDSTEP_CENTRINO_TABLE is not set
CONFIG_X86_SPEEDSTEP_ICH=y
# CONFIG_X86_SPEEDSTEP_SMI is not set
# CONFIG_X86_P4_CLOCKMOD is not set
# CONFIG_X86_CPUFREQ_NFORCE2 is not set
# CONFIG_X86_LONGRUN is not set
# CONFIG_X86_LONGHAUL is not set

#
# shared options
#
# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set
CONFIG_X86_SPEEDSTEP_LIB=y
# CONFIG_X86_SPEEDSTEP_RELAXED_CAP_CHECK is not set

#
# Bus options (PCI, PCMCIA, EISA, MCA, ISA)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
# CONFIG_PCI_GODIRECT is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
# CONFIG_PCIEPORTBUS is not set
CONFIG_PCI_MSI=y
# CONFIG_PCI_MULTITHREAD_PROBE is not set
# CONFIG_PCI_DEBUG is not set
# CONFIG_HT_IRQ is not set
CONFIG_ISA_DMA_API=y
# CONFIG_ISA is not set
# CONFIG_MCA is not set
# CONFIG_SCx200 is not set

#
# PCCARD (PCMCIA/CardBus) support
#
# CONFIG_PCCARD is not set

#
# PCI Hotplug Support
#
# CONFIG_HOTPLUG_PCI is not set

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_MISC=y

#
# Networking
#
CONFIG_NET=y

#
# Networking options
#
# CONFIG_NETDEBUG is not set
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
# CONFIG_NET_KEY is not set
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_FIB_HASH=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_ARPD is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
# CONFIG_INET_DIAG is not set
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
# CONFIG_IPV6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
# CONFIG_NETWORK_SECMARK is not set
# CONFIG_NETFILTER is not set

#
# DCCP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_DCCP is not set

#
# SCTP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_SCTP is not set

#
# TIPC Configuration (EXPERIMENTAL)
#
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_BRIDGE is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set

#
# QoS and/or fair queueing
#
# CONFIG_NET_SCHED is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_HAMRADIO is not set
# CONFIG_IRDA is not set
CONFIG_BT=y
CONFIG_BT_L2CAP=y
CONFIG_BT_SCO=y
CONFIG_BT_RFCOMM=y
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=y
# CONFIG_BT_BNEP_MC_FILTER is not set
# CONFIG_BT_BNEP_PROTO_FILTER is not set
CONFIG_BT_HIDP=y

#
# Bluetooth device drivers
#
CONFIG_BT_HCIUSB=y
# CONFIG_BT_HCIUSB_SCO is not set
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBPA10X is not set
# CONFIG_BT_HCIBFUSB is not set
# CONFIG_BT_HCIVHCI is not set
# CONFIG_IEEE80211 is not set
CONFIG_WIRELESS_EXT=y

#
# Device Drivers
#

#
# Generic Driver Options
#
# CONFIG_STANDALONE is not set
# CONFIG_PREVENT_FIRMWARE_BUILD is not set
CONFIG_FW_LOADER=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_SYS_HYPERVISOR is not set

#
# Connector - unified userspace <-> kernelspace linker
#
# CONFIG_CONNECTOR is not set

#
# Memory Technology Devices (MTD)
#
# CONFIG_MTD is not set

#
# Parallel port support
#
# CONFIG_PARPORT is not set

#
# Plug and Play support
#
CONFIG_PNP=y
# CONFIG_PNP_DEBUG is not set

#
# Protocols
#
CONFIG_PNPACPI=y

#
# Block devices
#
CONFIG_BLK_DEV_FD=y
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_BLK_DEV_INITRD is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set

#
# Misc devices
#
# CONFIG_IBM_ASM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_MSI_LAPTOP is not set

#
# ATA/ATAPI/MFM/RLL support
#
CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y

#
# Please see Documentation/ide.txt for help/info on IDE drives
#
# CONFIG_BLK_DEV_IDE_SATA is not set
# CONFIG_BLK_DEV_HD_IDE is not set
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
CONFIG_BLK_DEV_IDECD=y
# CONFIG_BLK_DEV_IDETAPE is not set
# CONFIG_BLK_DEV_IDEFLOPPY is not set
CONFIG_BLK_DEV_IDESCSI=y
# CONFIG_IDE_TASK_IOCTL is not set

#
# IDE chipset support/bugfixes
#
CONFIG_IDE_GENERIC=y
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_IDEPNP is not set
CONFIG_BLK_DEV_IDEPCI=y
CONFIG_IDEPCI_SHARE_IRQ=y
# CONFIG_BLK_DEV_OFFBOARD is not set
CONFIG_BLK_DEV_GENERIC=y
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_RZ1000 is not set
CONFIG_BLK_DEV_IDEDMA_PCI=y
# CONFIG_BLK_DEV_IDEDMA_FORCED is not set
CONFIG_IDEDMA_PCI_AUTO=y
# CONFIG_IDEDMA_ONLYDISK is not set
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
# CONFIG_BLK_DEV_AMD74XX is not set
# CONFIG_BLK_DEV_ATIIXP is not set
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_TRIFLEX is not set
# CONFIG_BLK_DEV_CY82C693 is not set
# CONFIG_BLK_DEV_CS5520 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_CS5535 is not set
# CONFIG_BLK_DEV_HPT34X is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_JMICRON is not set
# CONFIG_BLK_DEV_SC1200 is not set
CONFIG_BLK_DEV_PIIX=y
# CONFIG_BLK_DEV_IT821X is not set
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_PDC202XX_OLD is not set
# CONFIG_BLK_DEV_PDC202XX_NEW is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIIMAGE is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_IDE_ARM is not set
CONFIG_BLK_DEV_IDEDMA=y
# CONFIG_IDEDMA_IVB is not set
CONFIG_IDEDMA_AUTO=y
# CONFIG_BLK_DEV_HD is not set

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
# CONFIG_SCSI_TGT is not set
# CONFIG_SCSI_NETLINK is not set
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
# CONFIG_BLK_DEV_SR_VENDOR is not set
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set

#
# SCSI low-level drivers
#
# CONFIG_ISCSI_TCP is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_NSP32 is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_SRP is not set

#
# Serial ATA (prod) and Parallel ATA (experimental) drivers
#
CONFIG_ATA=y
CONFIG_SATA_AHCI=y
# CONFIG_SATA_SVW is not set
CONFIG_ATA_PIIX=y
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIL24 is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
CONFIG_SATA_INTEL_COMBINED=y
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CS5535 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# Multi-device support (RAID and LVM)
#
# CONFIG_MD is not set

#
# Fusion MPT device support
#
# CONFIG_FUSION is not set
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_FC is not set
# CONFIG_FUSION_SAS is not set

#
# IEEE 1394 (FireWire) support
#
CONFIG_IEEE1394=y

#
# Subsystem Options
#
# CONFIG_IEEE1394_VERBOSEDEBUG is not set
# CONFIG_IEEE1394_OUI_DB is not set
# CONFIG_IEEE1394_EXTRA_CONFIG_ROMS is not set
# CONFIG_IEEE1394_EXPORT_FULL_API is not set

#
# Device Drivers
#

#
# Texas Instruments PCILynx requires I2C
#
CONFIG_IEEE1394_OHCI1394=y

#
# Protocol Drivers
#
# CONFIG_IEEE1394_VIDEO1394 is not set
CONFIG_IEEE1394_SBP2=y
# CONFIG_IEEE1394_SBP2_PHYS_DMA is not set
# CONFIG_IEEE1394_ETH1394 is not set
# CONFIG_IEEE1394_DV1394 is not set
CONFIG_IEEE1394_RAWIO=y

#
# I2O device support
#
# CONFIG_I2O is not set

#
# Network device support
#
CONFIG_NETDEVICES=y
# CONFIG_DUMMY is not set
# CONFIG_BONDING is not set
# CONFIG_EQUALIZER is not set
# CONFIG_TUN is not set
# CONFIG_NET_SB1000 is not set

#
# ARCnet devices
#
# CONFIG_ARCNET is not set

#
# PHY device support
#
# CONFIG_PHYLIB is not set

#
# Ethernet (10 or 100Mbit)
#
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set

#
# Tulip family network device support
#
# CONFIG_NET_TULIP is not set
# CONFIG_HP100 is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
# CONFIG_DGRS is not set
# CONFIG_EEPRO100 is not set
CONFIG_E100=y
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set

#
# Ethernet (1000 Mbit)
#
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_SK98LIN is not set
# CONFIG_VIA_VELOCITY is not set
# CONFIG_TIGON3 is not set
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set

#
# Ethernet (10000 Mbit)
#
# CONFIG_CHELSIO_T1 is not set
# CONFIG_IXGB is not set
# CONFIG_S2IO is not set
# CONFIG_MYRI10GE is not set
# CONFIG_NETXEN_NIC is not set

#
# Token Ring devices
#
# CONFIG_TR is not set

#
# Wireless LAN (non-hamradio)
#
CONFIG_NET_RADIO=y
# CONFIG_NET_WIRELESS_RTNETLINK is not set

#
# Obsolete Wireless cards support (pre-802.11)
#
# CONFIG_STRIP is not set

#
# Wireless 802.11b ISA/PCI cards support
#
# CONFIG_IPW2100 is not set
# CONFIG_IPW2200 is not set
# CONFIG_AIRO is not set
# CONFIG_HERMES is not set
# CONFIG_ATMEL is not set

#
# Prism GT/Duette 802.11(a/b/g) PCI/Cardbus support
#
# CONFIG_PRISM54 is not set
# CONFIG_USB_ZD1201 is not set
# CONFIG_HOSTAP is not set
CONFIG_NET_WIRELESS=y

#
# Wan interfaces
#
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
# CONFIG_NET_FC is not set
# CONFIG_SHAPER is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set

#
# ISDN subsystem
#
# CONFIG_ISDN is not set

#
# Telephony Support
#
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_FF_MEMLESS is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1280
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=800
# CONFIG_INPUT_JOYDEV is not set
# CONFIG_INPUT_TSDEV is not set
# CONFIG_INPUT_EVDEV is not set
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_PCSPKR is not set
CONFIG_INPUT_WISTRON_BTNS=y
# CONFIG_INPUT_UINPUT is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_VT_HW_CONSOLE_BINDING is not set
# CONFIG_SERIAL_NONSTANDARD is not set

#
# Serial drivers
#
# CONFIG_SERIAL_8250 is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256

#
# IPMI
#
# CONFIG_IPMI_HANDLER is not set

#
# Watchdog Cards
#
# CONFIG_WATCHDOG is not set
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_INTEL=y
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_GEODE is not set
# CONFIG_HW_RANDOM_VIA is not set
CONFIG_NVRAM=y
CONFIG_RTC=y
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_SONYPI is not set
CONFIG_AGP=y
# CONFIG_AGP_ALI is not set
# CONFIG_AGP_ATI is not set
# CONFIG_AGP_AMD is not set
# CONFIG_AGP_AMD64 is not set
CONFIG_AGP_INTEL=y
# CONFIG_AGP_NVIDIA is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_SWORKS is not set
# CONFIG_AGP_VIA is not set
# CONFIG_AGP_EFFICEON is not set
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I830 is not set
CONFIG_DRM_I915=y
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_NSC_GPIO is not set
# CONFIG_CS5535_GPIO is not set
# CONFIG_RAW_DRIVER is not set
# CONFIG_HPET is not set
# CONFIG_HANGCHECK_TIMER is not set

#
# TPM devices
#
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set

#
# I2C support
#
# CONFIG_I2C is not set

#
# SPI support
#
# CONFIG_SPI is not set
# CONFIG_SPI_MASTER is not set

#
# Dallas's 1-wire bus
#
# CONFIG_W1 is not set

#
# Hardware Monitoring support
#
# CONFIG_HWMON is not set
# CONFIG_HWMON_VID is not set

#
# Multimedia devices
#
# CONFIG_VIDEO_DEV is not set

#
# Digital Video Broadcasting Devices
#
# CONFIG_DVB is not set
# CONFIG_USB_DABUSB is not set

#
# Graphics support
#
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
# CONFIG_FB_TILEBLITTING is not set
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
CONFIG_FB_VESA=y
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
CONFIG_FB_I810=y
CONFIG_FB_I810_GTF=y
# CONFIG_FB_I810_I2C is not set
CONFIG_FB_INTEL=y
# CONFIG_FB_INTEL_DEBUG is not set
# CONFIG_FB_INTEL_I2C is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_CYBLA is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VGACON_SOFT_SCROLLBACK is not set
CONFIG_VIDEO_SELECT=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y

#
# Logo configuration
#
# CONFIG_LOGO is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_BACKLIGHT_DEVICE=y
CONFIG_LCD_CLASS_DEVICE=y
CONFIG_LCD_DEVICE=y

#
# Sound
#
CONFIG_SOUND=y

#
# Advanced Linux Sound Architecture
#
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
CONFIG_SND_SEQUENCER=y
# CONFIG_SND_SEQ_DUMMY is not set
# CONFIG_SND_MIXER_OSS is not set
# CONFIG_SND_PCM_OSS is not set
# CONFIG_SND_SEQUENCER_OSS is not set
CONFIG_SND_RTCTIMER=y
CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y
# CONFIG_SND_DYNAMIC_MINORS is not set
CONFIG_SND_SUPPORT_OLD_API=y
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set

#
# Generic devices
#
CONFIG_SND_AC97_CODEC=y
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_VIRMIDI is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set

#
# PCI devices
#
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS5535AUDIO is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=y
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=y
CONFIG_SND_INTEL8X0M=y
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
# CONFIG_SND_AC97_POWER_SAVE is not set

#
# USB devices
#
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_USX2Y is not set

#
# Open Sound System
#
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=y

#
# HID Devices
#
# CONFIG_HID is not set

#
# USB support
#
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set

#
# Miscellaneous USB options
#
# CONFIG_USB_DEVICEFS is not set
# CONFIG_USB_BANDWIDTH is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_MULTITHREAD_PROBE is not set
# CONFIG_USB_OTG is not set

#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=y
# CONFIG_USB_EHCI_SPLIT_ISO is not set
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_OHCI_HCD is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set

#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
#

#
# may also be needed; see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_DPCM is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_LIBUSUAL is not set

#
# USB Input Devices
#

#
# USB HID Boot Protocol drivers
#
# CONFIG_USB_KBD is not set
# CONFIG_USB_MOUSE is not set
# CONFIG_USB_AIPTEK is not set
# CONFIG_USB_WACOM is not set
# CONFIG_USB_ACECAD is not set
# CONFIG_USB_KBTAB is not set
# CONFIG_USB_POWERMATE is not set
# CONFIG_USB_TOUCHSCREEN is not set
# CONFIG_USB_YEALINK is not set
# CONFIG_USB_XPAD is not set
# CONFIG_USB_ATI_REMOTE is not set
# CONFIG_USB_ATI_REMOTE2 is not set
# CONFIG_USB_KEYSPAN_REMOTE is not set
# CONFIG_USB_APPLETOUCH is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET_MII is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_MON is not set

#
# USB port drivers
#

#
# USB Serial Converter support
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_AUERSWALD is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set

#
# USB DSL modem support
#

#
# USB Gadget Support
#
# CONFIG_USB_GADGET is not set

#
# MMC/SD Card support
#
# CONFIG_MMC is not set

#
# LED devices
#
# CONFIG_NEW_LEDS is not set

#
# LED drivers
#

#
# LED Triggers
#

#
# InfiniBand support
#
# CONFIG_INFINIBAND is not set

#
# EDAC - error detection and reporting (RAS) (EXPERIMENTAL)
#
# CONFIG_EDAC is not set

#
# Real Time Clock
#
# CONFIG_RTC_CLASS is not set

#
# DMA Engine support
#
# CONFIG_DMA_ENGINE is not set

#
# DMA Clients
#

#
# DMA Devices
#

#
# Virtualization
#
# CONFIG_KVM is not set

#
# File systems
#
CONFIG_EXT2_FS=y
# CONFIG_EXT2_FS_XATTR is not set
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
# CONFIG_EXT3_FS_POSIX_ACL is not set
# CONFIG_EXT3_FS_SECURITY is not set
# CONFIG_EXT4DEV_FS is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_FS_POSIX_ACL is not set
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_ROMFS_FS is not set
# CONFIG_INOTIFY is not set
# CONFIG_QUOTA is not set
CONFIG_DNOTIFY=y
# CONFIG_AUTOFS_FS is not set
CONFIG_AUTOFS4_FS=y
# CONFIG_FUSE_FS is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_ZISOFS_FS=y
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_NTFS_FS=y
# CONFIG_NTFS_DEBUG is not set
# CONFIG_NTFS_RW is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_RAMFS=y
# CONFIG_CONFIGFS_FS is not set

#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_SYSV_FS is not set
CONFIG_UFS_FS=y
# CONFIG_UFS_FS_WRITE is not set
# CONFIG_UFS_DEBUG is not set

#
# Network File Systems
#
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_SMB_FS is not set
CONFIG_CIFS=y
# CONFIG_CIFS_STATS is not set
# CONFIG_CIFS_WEAK_PW_HASH is not set
# CONFIG_CIFS_XATTR is not set
# CONFIG_CIFS_DEBUG2 is not set
# CONFIG_CIFS_EXPERIMENTAL is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
# CONFIG_9P_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
# CONFIG_ATARI_PARTITION is not set
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
# CONFIG_MINIX_SUBPARTITION is not set
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
# CONFIG_LDM_PARTITION is not set
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
# CONFIG_KARMA_PARTITION is not set
# CONFIG_EFI_PARTITION is not set

#
# Native Language Support
#
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_UTF8 is not set

#
# Distributed Lock Manager
#
# CONFIG_DLM is not set

#
# Instrumentation Support
#
# CONFIG_PROFILING is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
# CONFIG_DEBUG_FS is not set
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_LOG_BUF_SHIFT=14
# CONFIG_DETECT_SOFTLOCKUP is not set
# CONFIG_SCHEDSTATS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_HIGHMEM is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_FRAME_POINTER is not set
# CONFIG_FORCED_INLINING is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set

#
# Page alloc debug is incompatible with Software Suspend on i386
#
# CONFIG_DEBUG_RODATA is not set
CONFIG_4KSTACKS=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_DOUBLEFAULT=y

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set

#
# Cryptographic options
#
CONFIG_CRYPTO=y
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_MANAGER=y
# CONFIG_CRYPTO_HMAC is not set
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_SHA1 is not set
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_CBC is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_586 is not set
# CONFIG_CRYPTO_SERPENT is not set
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_AES_586=y
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_TEA is not set
CONFIG_CRYPTO_ARC4=y
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_ANUBIS is not set
CONFIG_CRYPTO_DEFLATE=y
CONFIG_CRYPTO_MICHAEL_MIC=y
# CONFIG_CRYPTO_CRC32C is not set

#
# Hardware crypto devices
#
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_CRYPTO_DEV_GEODE is not set

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC32=y
CONFIG_LIBCRC32C=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_PLIST=y
CONFIG_IOMAP_COPY=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_KTIME_SCALAR=y



^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 19:04               ` Andrei Popa
  2006-12-18 19:10                 ` Peter Zijlstra
@ 2006-12-18 19:18                 ` Linus Torvalds
  2006-12-18 19:44                   ` Andrei Popa
  2006-12-19  7:38                   ` Peter Zijlstra
  1 sibling, 2 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-18 19:18 UTC (permalink / raw)
  To: Andrei Popa
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Andrei Popa wrote:
> 
> I applied Linus patch, Andrew patch, Peter Zijlstra patches(the last
> two). All unified patch is attached. I tested and I have no corruption.

That wasn't very interesting, because you also had the patch that just 
disabled "page_mkclean_one()" entirely:

> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..3f9061e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
>  		goto unlock;
>  
>  	entry = ptep_get_and_clear(mm, address, pte);
> -	entry = pte_mkclean(entry);
> +	/*entry = pte_mkclean(entry);*/
>  	entry = pte_wrprotect(entry);
>  	ptep_establish(vma, address, pte, entry);
>  	lazy_mmu_prot_update(entry);

The above patch is bad. It's always going to hide the bug, but it hides it 
by just not doing anything at all. So any patch combination that contains 
that patch will probably _always_ fix your problem, but it won't be an 
interesting patch..

So can you remove that small fragment? Also, it would be nice if you added 
the WARN_ON() to this sequence in mm/page-writeback.c:

+                       if (!must_clean_ptes && cleaned)
+                               set_page_dirty(page);

just make it do a WARN_ON() if this ever triggers.

Then, IF the corruption is gone, we'd love to see the WARN_ON results..

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 19:04               ` Andrei Popa
@ 2006-12-18 19:10                 ` Peter Zijlstra
  2006-12-18 19:18                 ` Linus Torvalds
  1 sibling, 0 replies; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-18 19:10 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 21:04 +0200, Andrei Popa wrote:

> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..3f9061e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
>  		goto unlock;
>  
>  	entry = ptep_get_and_clear(mm, address, pte);
> -	entry = pte_mkclean(entry);
> +	/*entry = pte_mkclean(entry);*/
>  	entry = pte_wrprotect(entry);
>  	ptep_establish(vma, address, pte, entry);
>  	lazy_mmu_prot_update(entry);

please drop this chunk, this will always make the problem go away.



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 18:35             ` Linus Torvalds
@ 2006-12-18 19:04               ` Andrei Popa
  2006-12-18 19:10                 ` Peter Zijlstra
  2006-12-18 19:18                 ` Linus Torvalds
  0 siblings, 2 replies; 154+ messages in thread
From: Andrei Popa @ 2006-12-18 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr


> (On that note: Andrei - if you do test this out, I'd suggest applying my 
> patch too - the one that you already tested. It won't apply cleanly on top 
> of Andrew's patch, but it should be trivial to apply by hand, since you 
> really just want to remove the whole "if (ret) {...}" sequence. I realize 
> that it didn't make any difference for you, but applying that patch is 
> probably a good idea just to remove the noise for a codepath that you 
> already showed to not matter)


I applied Linus patch, Andrew patch, Peter Zijlstra patches(the last
two). All unified patch is attached. I tested and I have no corruption.


diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..760442f 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 1)) {
 				unlock_page(page);
 				break;
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..7b87875 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 1);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..47a6b62 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 1);
 		ClearPageUptodate(page);
 	}
 #else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..d65ba84 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 1);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page)	clear_bi
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..561d702 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory
accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -866,7 +866,9 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned)
+				set_page_dirty(page);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..3f9061e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
 		goto unlock;
 
 	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
+	/*entry = pte_mkclean(entry);*/
 	entry = pte_wrprotect(entry);
 	ptep_establish(vma, address, pte, entry);
 	lazy_mmu_prot_update(entry);
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..cafa843 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 1))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);

> 
> > I'm guessing that if we do the WARN_ON() some folks might get a lot of 
> > output, perhaps WARN_ON_ONCE() ?
> 
> Well, I'd rather get lots of noise to see all the paths that can cause 
> this. We've been concentrating mainly on one (try_to_free_buffers()), but 
> that one was already shown not to matter or at least not to be the _whole_ 
> issue, so..
> 
> 		Linus


^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 18:24           ` Peter Zijlstra
@ 2006-12-18 18:35             ` Linus Torvalds
  2006-12-18 19:04               ` Andrei Popa
  0 siblings, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-18 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Mon, 18 Dec 2006, Peter Zijlstra wrote:
> > 
> > Or maybe the WARN_ON() just points out _why_ somebody would want to do 
> > something this insane. Right now I just can't see why it's a valid thing 
> > to do.
> 
> Maybe, but I think Nick's mail here:
>   http://lkml.org/lkml/2006/12/18/59
> 
> shows a trace like that.

Sure, but I actually think that "try_to_free_buffers()" was buggy in the 
first place, shouldn't have done what it did at all (it has NO business 
clearing dirty data), and should be fixed with my other simple and clean 
patch that just removes the crap.

But sadly, Andrei said that he still saw data corruption, which implies 
that the problem had nothing to do with "try_to_free_buffers()" at all.

(On that note: Andrei - if you do test this out, I'd suggest applying my 
patch too - the one that you already tested. It won't apply cleanly on top 
of Andrew's patch, but it should be trivial to apply by hand, since you 
really just want to remove the whole "if (ret) {...}" sequence. I realize 
that it didn't make any difference for you, but applying that patch is 
probably a good idea just to remove the noise for a codepath that you 
already showed to not matter)

> I'm guessing that if we do the WARN_ON() some folks might get a lot of 
> output, perhaps WARN_ON_ONCE() ?

Well, I'd rather get lots of noise to see all the paths that can cause 
this. We've been concentrating mainly on one (try_to_free_buffers()), but 
that one was already shown not to matter or at least not to be the _whole_ 
issue, so..

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 18:03         ` Linus Torvalds
@ 2006-12-18 18:24           ` Peter Zijlstra
  2006-12-18 18:35             ` Linus Torvalds
  2006-12-19  4:36           ` Nick Piggin
  1 sibling, 1 reply; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-18 18:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 10:03 -0800, Linus Torvalds wrote:
> Andrei,
>  could you try Peter's patch (on top of Andrew's patch - it depends on 
> it, and wouldn't work on an unmodified -git kernel, but add the WARN_ON() 
> I mention in this email? You seem to be able to reproduce this easily.. 
> Thanks)

I finally beat yum into submission and I hope to have rtorrent compiled
shortly.

> On Mon, 18 Dec 2006, Peter Zijlstra wrote:
> > 
> > This should be safe; page_mkclean walks the rmap and flips the pte's
> > under the pte lock and records the dirty state while iterating.
> > Concurrent faults will either do set_page_dirty() before we get around
> > to doing it or vice versa, but dirty state is not lost.
> 
> Ok, I really liked this patch, but the more I thought about it, the more I 
> started to doubt the reasons for liking it.
> 
> I think we have some core fundamental problem here that this patch is 
> needed at all.

I agree, but I suspect this is like the buffered write deadlock Nick is
working on, in that it will require some proper filesystem surgery to
get right. Having the kernel working in the meantime has my
preference ;-)

> So let's think about this: we apparently have two cases of 
> "clear_page_dirty()":
> 
>  - the one that really wants to clear the bit unconditionally (Andrew 
>    calls this the "must_clean_ptes" case, which I personally find to be a 
>    really confusing name, but whatever)

I'm probably worse with names so I'm not even going to try and fix that.

>  - the other case. The case that doesn't want to really clear the pte 
>    dirty bits.
> 
> and I thought your patch made sense, because it saved away the pte state 
> in the page dirty state, and that matches my mental model, but the more I 
> think about it, the less sense that whole "the other case" situation makes 
> AT ALL.
>
> Why does "the other case" exist at all? If you want to clear the dirty 
> page flag, what is _ever_ the reason for not wanting to drop PTE dirty 
> information? In other words, what possible reason can there ever be for 
> saying "I want this page to be clean", while at the same time saying "but 
> if it was dirty in the page tables, don't forget about that state".

I have tried to get my head around this, and have so far failed. Andrews
mail with the patch (great-grandparent to this mail) was the one that
made most sense explaining it afaics.

> So I absolutely detested Andrew's original patch, because that one made 
> zero sense at all even from a code standpoint. With your patch on top, it 
> all suddenly makes sense: at least you don't just leave dirty pages in the 
> PTE's with a "struct page" that is marked clean, and the end result is 
> undeniably at least _consistent_.
> 
> So Andrew's patch I can't stand, because the whole point of it seems to be 
> to leave the system in an inconsistent state (dirty in the pte's but 
> marked "clean"), and if we want to have that state, then we should just 
> revert _everything_ to the 2.6.18 situation, and not play these games at 
> all.
> 
> Andrew's patch with your patch on top makes me happy, because now we're 
> at least honoring all the basic rules (we don't get into an inconsistent 
> state), so on a local level it all makes sense. HOWEVER, I then don't 
> actually understand how it could ever actually make sense to ask for 
> "please clean the page, but don't actually clean it".

Somehow it looses track of actual page content dirtyness when it does
the page buffer game.

Is this because page buffers are used to do sub-page sized writes
without RMW cycles?

Cannot this case be avoided when the page is mapped, because at that
point the whole page will be resident anyway.

> So _I_ think that we should add a honking huge WARN_ON() for this case. 
> Ie, do your patch, but instead of re-dirtying the page:
> 
> +                       if (!must_clean_ptes && cleaned)
> +                               set_page_dirty(page);
> 
> we would do
> 
> +                       if (!must_clean_ptes && cleaned) {
> +                               WARN_ON(1);
> +                               set_page_dirty(page);
> +                       }
> 
> and ask the people who see this problem to see if they get the WARN_ON() 
> (assuming it _fixes_ their data corruption).
> 
> Because whoever calls "clean_dirty_page()" without actually wanting to 
> clean the PTE's is really a bug: those dirty PTE's had better not exist.
> 
> Or maybe the WARN_ON() just points out _why_ somebody would want to do 
> something this insane. Right now I just can't see why it's a valid thing 
> to do.

Maybe, but I think Nick's mail here:
  http://lkml.org/lkml/2006/12/18/59

shows a trace like that. I'm guessing that if we do the WARN_ON() some
folks might get a lot of output, perhaps WARN_ON_ONCE() ?


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 16:55       ` Peter Zijlstra
@ 2006-12-18 18:03         ` Linus Torvalds
  2006-12-18 18:24           ` Peter Zijlstra
  2006-12-19  4:36           ` Nick Piggin
  0 siblings, 2 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-18 18:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr


Andrei,
 could you try Peter's patch (on top of Andrew's patch - it depends on 
it, and wouldn't work on an unmodified -git kernel, but add the WARN_ON() 
I mention in this email? You seem to be able to reproduce this easily.. 
Thanks)

On Mon, 18 Dec 2006, Peter Zijlstra wrote:
> 
> This should be safe; page_mkclean walks the rmap and flips the pte's
> under the pte lock and records the dirty state while iterating.
> Concurrent faults will either do set_page_dirty() before we get around
> to doing it or vice versa, but dirty state is not lost.

Ok, I really liked this patch, but the more I thought about it, the more I 
started to doubt the reasons for liking it.

I think we have some core fundamental problem here that this patch is 
needed at all.

So let's think about this: we apparently have two cases of 
"clear_page_dirty()":

 - the one that really wants to clear the bit unconditionally (Andrew 
   calls this the "must_clean_ptes" case, which I personally find to be a 
   really confusing name, but whatever)

 - the other case. The case that doesn't want to really clear the pte 
   dirty bits.

and I thought your patch made sense, because it saved away the pte state 
in the page dirty state, and that matches my mental model, but the more I 
think about it, the less sense that whole "the other case" situation makes 
AT ALL.

Why does "the other case" exist at all? If you want to clear the dirty 
page flag, what is _ever_ the reason for not wanting to drop PTE dirty 
information? In other words, what possible reason can there ever be for 
saying "I want this page to be clean", while at the same time saying "but 
if it was dirty in the page tables, don't forget about that state".

So I absolutely detested Andrew's original patch, because that one made 
zero sense at all even from a code standpoint. With your patch on top, it 
all suddenly makes sense: at least you don't just leave dirty pages in the 
PTE's with a "struct page" that is marked clean, and the end result is 
undeniably at least _consistent_.

So Andrew's patch I can't stand, because the whole point of it seems to be 
to leave the system in an inconsistent state (dirty in the pte's but 
marked "clean"), and if we want to have that state, then we should just 
revert _everything_ to the 2.6.18 situation, and not play these games at 
all.

Andrew's patch with your patch on top makes me happy, because now we're 
at least honoring all the basic rules (we don't get into an inconsistent 
state), so on a local level it all makes sense. HOWEVER, I then don't 
actually understand how it could ever actually make sense to ask for 
"please clean the page, but don't actually clean it".

So _I_ think that we should add a honking huge WARN_ON() for this case. 
Ie, do your patch, but instead of re-dirtying the page:

+                       if (!must_clean_ptes && cleaned)
+                               set_page_dirty(page);

we would do

+                       if (!must_clean_ptes && cleaned) {
+                               WARN_ON(1);
+                               set_page_dirty(page);
+                       }

and ask the people who see this problem to see if they get the WARN_ON() 
(assuming it _fixes_ their data corruption).

Because whoever calls "clean_dirty_page()" without actually wanting to 
clean the PTE's is really a bug: those dirty PTE's had better not exist.

Or maybe the WARN_ON() just points out _why_ somebody would want to do 
something this insane. Right now I just can't see why it's a valid thing 
to do.

Maybe I'm still confused. 

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 23:40     ` Andrew Morton
  2006-12-18  1:02       ` Linus Torvalds
  2006-12-18  1:22       ` Linus Torvalds
@ 2006-12-18 16:55       ` Peter Zijlstra
  2006-12-18 18:03         ` Linus Torvalds
  2 siblings, 1 reply; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-18 16:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linux Kernel Mailing List, Hugh Dickins,
	Linus Torvalds, Florian Weimer, Marc Haber, Martin Michlmayr

On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote:
> On Sun, 17 Dec 2006 15:39:32 +0200
> Andrei Popa <andrei.popa@i-neo.ro> wrote:
> 
> > I was mistaken, I'm still having file corruption with rtorrent.
> > 
> 
> Well I'm not very optimistic, but if people could try this, please...
> 
> 
> 
> From: Andrew Morton <akpm@osdl.org>
> 
> try_to_free_buffers() clears the page's dirty state if it successfully removed
> the page's buffers.
> 
>   Background for this:
> 
>   - a process does a one-byte-write to a file on a 64k pagesize, 4k
>     blocksize ext3 filesystem.  The page is now PageDirty, !PgeUptodate and
>     has one dirty buffer and 15 not uptodate buffers.
> 
>   - kjournald writes the dirty buffer.  The page is now PageDirty,
>     !PageUptodate and has a mix of clean and not uptodate buffers.
> 
>   - try_to_free_buffers() removes the page's buffers.  It MUST now clear
>     PageDirty.  If we were to leave the page dirty then we'd have a dirty, not
>     uptodate page with no buffer_heads.
> 
>     We're screwed: we cannot write the page because we don't know which
>     sections of it contain garbage.  We cannot read the page because we don't
>     know which sections of it contain modified data.  We cannot free the page
>     because it is dirty.
> 

How about we stick something like this on top of that patch. It should
preserve the dirty state as required.

I tried to tinker with avoiding the clear/set thing but could not
convince myself it was close to safe.

This should be safe; page_mkclean walks the rmap and flips the pte's
under the pte lock and records the dirty state while iterating.
Concurrent faults will either do set_page_dirty() before we get around
to doing it or vice versa, but dirty state is not lost.

---
 mm/page-writeback.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6-git/mm/page-writeback.c
===================================================================
--- linux-2.6-git.orig/mm/page-writeback.c	2006-12-18 17:24:41.000000000 +0100
+++ linux-2.6-git/mm/page-writeback.c	2006-12-18 17:26:56.000000000 +0100
@@ -872,8 +872,9 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			if (must_clean_ptes)
-				page_mkclean(page);
+			int cleaned = page_mkclean(page);
+			if (!must_clean_ptes && cleaned)
+				set_page_dirty(page);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 15:32                             ` Peter Zijlstra
@ 2006-12-18 15:47                               ` Gene Heskett
  0 siblings, 0 replies; 154+ messages in thread
From: Gene Heskett @ 2006-12-18 15:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, andrei.popa, Andrew Morton, Linus Torvalds,
	Nick Piggin, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Monday 18 December 2006 10:32, Peter Zijlstra wrote:
[...]
>>
>> I've not run a torrent app here recently.  Should this patch be
>> applied to a plain 2.6-20-rc1 before I do run azureas or similar apps?
>
>depends on what the blue frog does, if it uses MAP_SHARED like rtorrent
>does then yeah, probably. This patch really should not be the final one,
>I'm currently still trying to wrap my head around the issue. That said,
>it should be safe to use :-)
>
Thanks, I'll do it.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 15:24                           ` Gene Heskett
@ 2006-12-18 15:32                             ` Peter Zijlstra
  2006-12-18 15:47                               ` Gene Heskett
  0 siblings, 1 reply; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-18 15:32 UTC (permalink / raw)
  To: Gene Heskett
  Cc: linux-kernel, andrei.popa, Andrew Morton, Linus Torvalds,
	Nick Piggin, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Mon, 2006-12-18 at 10:24 -0500, Gene Heskett wrote:
> On Monday 18 December 2006 05:49, Andrei Popa wrote:
> >> OK, I'll try this on a ext3 box. BTW, what data mode are you using
> >> ext3 in?
> >
> >ordered
> >
> >> Also, for testings sake, could you give this a go:
> >> It's a total hack but I guess worth testing.
> >>
> >> ---
> >>  mm/rmap.c |    2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> Index: linux-2.6-git/mm/rmap.c
> >> ===================================================================
> >> --- linux-2.6-git.orig/mm/rmap.c	2006-12-18 11:06:29.000000000 +0100
> >> +++ linux-2.6-git/mm/rmap.c	2006-12-18 11:07:16.000000000 +0100
> >> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page
> >>  		goto unlock;
> >>
> >>  	entry = ptep_get_and_clear(mm, address, pte);
> >> -	entry = pte_mkclean(entry);
> >> +	/* entry = pte_mkclean(entry); */
> >>  	entry = pte_wrprotect(entry);
> >>  	ptep_establish(vma, address, pte, entry);
> >>  	lazy_mmu_prot_update(entry);
> >
> >with latest git and this patch there is no corruption !
> >
> I've not run a torrent app here recently.  Should this patch be applied to 
> a plain 2.6-20-rc1 before I do run azureas or similar apps?

depends on what the blue frog does, if it uses MAP_SHARED like rtorrent
does then yeah, probably. This patch really should not be the final one,
I'm currently still trying to wrap my head around the issue. That said,
it should be safe to use :-)


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 10:49                         ` Andrei Popa
@ 2006-12-18 15:24                           ` Gene Heskett
  2006-12-18 15:32                             ` Peter Zijlstra
  0 siblings, 1 reply; 154+ messages in thread
From: Gene Heskett @ 2006-12-18 15:24 UTC (permalink / raw)
  To: linux-kernel, andrei.popa
  Cc: Peter Zijlstra, Andrew Morton, Linus Torvalds, Nick Piggin,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr

On Monday 18 December 2006 05:49, Andrei Popa wrote:
>> OK, I'll try this on a ext3 box. BTW, what data mode are you using
>> ext3 in?
>
>ordered
>
>> Also, for testings sake, could you give this a go:
>> It's a total hack but I guess worth testing.
>>
>> ---
>>  mm/rmap.c |    2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> Index: linux-2.6-git/mm/rmap.c
>> ===================================================================
>> --- linux-2.6-git.orig/mm/rmap.c	2006-12-18 11:06:29.000000000 +0100
>> +++ linux-2.6-git/mm/rmap.c	2006-12-18 11:07:16.000000000 +0100
>> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page
>>  		goto unlock;
>>
>>  	entry = ptep_get_and_clear(mm, address, pte);
>> -	entry = pte_mkclean(entry);
>> +	/* entry = pte_mkclean(entry); */
>>  	entry = pte_wrprotect(entry);
>>  	ptep_establish(vma, address, pte, entry);
>>  	lazy_mmu_prot_update(entry);
>
>with latest git and this patch there is no corruption !
>
I've not run a torrent app here recently.  Should this patch be applied to 
a plain 2.6-20-rc1 before I do run azureas or similar apps?
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 10:11                       ` Peter Zijlstra
@ 2006-12-18 10:49                         ` Andrei Popa
  2006-12-18 15:24                           ` Gene Heskett
  0 siblings, 1 reply; 154+ messages in thread
From: Andrei Popa @ 2006-12-18 10:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Linus Torvalds, Nick Piggin,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

> OK, I'll try this on a ext3 box. BTW, what data mode are you using ext3
> in?
> 

ordered

> 
> Also, for testings sake, could you give this a go:
> It's a total hack but I guess worth testing.
> 
> ---
>  mm/rmap.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6-git/mm/rmap.c
> ===================================================================
> --- linux-2.6-git.orig/mm/rmap.c	2006-12-18 11:06:29.000000000 +0100
> +++ linux-2.6-git/mm/rmap.c	2006-12-18 11:07:16.000000000 +0100
> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
>  		goto unlock;
>  
>  	entry = ptep_get_and_clear(mm, address, pte);
> -	entry = pte_mkclean(entry);
> +	/* entry = pte_mkclean(entry); */
>  	entry = pte_wrprotect(entry);
>  	ptep_establish(vma, address, pte, entry);
>  	lazy_mmu_prot_update(entry);
> 

with latest git and this patch there is no corruption !




^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18 10:00                     ` Andrei Popa
@ 2006-12-18 10:11                       ` Peter Zijlstra
  2006-12-18 10:49                         ` Andrei Popa
  0 siblings, 1 reply; 154+ messages in thread
From: Peter Zijlstra @ 2006-12-18 10:11 UTC (permalink / raw)
  To: andrei.popa
  Cc: Andrew Morton, Linus Torvalds, Nick Piggin,
	Linux Kernel Mailing List, Hugh Dickins, Florian Weimer,
	Marc Haber, Martin Michlmayr

On Mon, 2006-12-18 at 12:00 +0200, Andrei Popa wrote:
> On Mon, 2006-12-18 at 01:38 -0800, Andrew Morton wrote: 
> > On Mon, 18 Dec 2006 11:19:04 +0200
> > Andrei Popa <andrei.popa@i-neo.ro> wrote:
> > 
> > > 
> > > I tried latest git with the patch from this email and it still get file
> > > content corruption. If I can help you further debug the problem tell me
> > > what to do.
> > 
> > Can you please tell us all the steps which we need to take to reproduce this?
> 
> I'm using rtorrent-0.7.0 and libtorrent-0.11.0, just download a torrent
> with multiple files(I downloaded 84 rar files) and when it will finish
> it will do a hash check and at the end of the check will say "Hash check
> on download completion found bad chunks, consider using "safe_sync"."
> and stop and most of the downloaded files are broken. With Peter
> Zijlstra patch this error doesn't show but there is file
> corruption(although less files are corrupted); afther the hash check,
> rtorrent will download the bad chunks and do another hash check and all
> files are ok.

OK, I'll try this on a ext3 box. BTW, what data mode are you using ext3
in?


Also, for testings sake, could you give this a go:
It's a total hack but I guess worth testing.

---
 mm/rmap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-git/mm/rmap.c
===================================================================
--- linux-2.6-git.orig/mm/rmap.c	2006-12-18 11:06:29.000000000 +0100
+++ linux-2.6-git/mm/rmap.c	2006-12-18 11:07:16.000000000 +0100
@@ -448,7 +448,7 @@ static int page_mkclean_one(struct page 
 		goto unlock;
 
 	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
+	/* entry = pte_mkclean(entry); */
 	entry = pte_wrprotect(entry);
 	ptep_establish(vma, address, pte, entry);
 	lazy_mmu_prot_update(entry);



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  9:38                   ` Andrew Morton
@ 2006-12-18 10:00                     ` Andrei Popa
  2006-12-18 10:11                       ` Peter Zijlstra
  0 siblings, 1 reply; 154+ messages in thread
From: Andrei Popa @ 2006-12-18 10:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Nick Piggin, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Mon, 2006-12-18 at 01:38 -0800, Andrew Morton wrote: 
> On Mon, 18 Dec 2006 11:19:04 +0200
> Andrei Popa <andrei.popa@i-neo.ro> wrote:
> 
> > 
> > I tried latest git with the patch from this email and it still get file
> > content corruption. If I can help you further debug the problem tell me
> > what to do.
> 
> Can you please tell us all the steps which we need to take to reproduce this?

I'm using rtorrent-0.7.0 and libtorrent-0.11.0, just download a torrent
with multiple files(I downloaded 84 rar files) and when it will finish
it will do a hash check and at the end of the check will say "Hash check
on download completion found bad chunks, consider using "safe_sync"."
and stop and most of the downloaded files are broken. With Peter
Zijlstra patch this error doesn't show but there is file
corruption(although less files are corrupted); afther the hash check,
rtorrent will download the bad chunks and do another hash check and all
files are ok.


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  9:18                   ` Andrew Morton
  2006-12-18  9:26                     ` Andrei Popa
@ 2006-12-18  9:42                     ` Nick Piggin
  1 sibling, 0 replies; 154+ messages in thread
From: Nick Piggin @ 2006-12-18  9:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

Andrew Morton wrote:
> On Mon, 18 Dec 2006 18:22:42 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:

 >>Yes I could believe it the corruption is caused by something else
 >>completely.
 >
 >
 > Think so.  We do have a problem here, but only on threaded apps, I believe.
 > rtorrent doesn't appear to be threaded, and the bug is hit on non-preempt
 > UP.

I think (see below) that it does not apply only to threaded apps. But
it would need one of SMP or PREEMPT to trigger.


>>After try_to_free_buffers detaches the buffers from the page, a
>>pagefault can come in, and mark the pte writeable, then set_page_dirty
>>(which finds no buffers, so only sets PG_dirty).
>>
>>The page can now get dirtied through this mapping.
>>
>>try_to_free_buffers then goes on to clean the page and ptes.
> 
> 
> try_to_free_buffers() isn't called against a page which doesn't have
> buffers.  It'll oops.

Sure. But I think the race exists... I'll try spelling it out in
the conventional way:

try_to_free_buffers()
   drop_buffers() (succeeds)

** preempt here or run right-hand thread on 2nd CPU in SMP **

                                do_no_page()
                                  set_page_dirty()

                                [now modify the page via this mapping
                                (from this process or a concurrent thread)]


   clear_page_dirty() (clears PG_dirty + pte dirty, oops)


-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  9:19                 ` Andrei Popa
@ 2006-12-18  9:38                   ` Andrew Morton
  2006-12-18 10:00                     ` Andrei Popa
  0 siblings, 1 reply; 154+ messages in thread
From: Andrew Morton @ 2006-12-18  9:38 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linus Torvalds, Nick Piggin, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Mon, 18 Dec 2006 11:19:04 +0200
Andrei Popa <andrei.popa@i-neo.ro> wrote:

> 
> I tried latest git with the patch from this email and it still get file
> content corruption. If I can help you further debug the problem tell me
> what to do.

Can you please tell us all the steps which we need to take to reproduce this?

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  7:16                 ` Andrew Morton
  2006-12-18  7:17                   ` Andrew Morton
@ 2006-12-18  9:30                   ` Nick Piggin
  1 sibling, 0 replies; 154+ messages in thread
From: Nick Piggin @ 2006-12-18  9:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

Andrew Morton wrote:
> On Sun, 17 Dec 2006 21:50:43 -0800 (PST)
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> 
>>
>>On Mon, 18 Dec 2006, Nick Piggin wrote:
>>
>>>I can't see how that's exactly a problem -- so long as the page does not
>>>get reclaimed (it won't, because we have a ref on it) then all that matters
>>>is that the page eventually gets marked dirty.
>>
>>But the point being that "try_to_free_buffers()" marks it clean 
>>AFTERWARDS.
>>
>>So yes, the page gets marked dirty in the pte's - the hardware generally 
>>does that for us, so we don't have to worry about that part going on.
>>
>>But "try_to_free_buffers()" seems to clear those dirty bits without 
>>serializing it really any way. It just says "ok, I will now clear them". 
>>Without knowing whether the dirty bits got set before the IO that cleared 
>>the buffer head dirty bits or not.
> 
> 
> Yes, I can't see anything correct about the current behaviour.
> 
> But I'm going blue in the face here trying to feed try_to_free_buffers() a
> page_mapped(page), without success.  pagevec_strip() presumably isn't
> triggering.

I can trigger it here, with a kernel patch to call pagevec_strip
unconditionally. I am seeing it clearing pte dirty bits, which is surely
a dataloss bug.

BUG: warning at mm/page-writeback.c:862/clear_page_dirty_warn()
  [<c013f65a>] clear_page_dirty_warn+0xdb/0xdd
  [<c0174309>] try_to_free_buffers+0x6b/0x7e
  [<c01937ec>] ext3_releasepage+0x0/0x74
  [<c013bb48>] try_to_release_page+0x2c/0x40
  [<c0140925>] pagevec_strip+0x52/0x54
  [<c0141580>] shrink_active_list+0x2a0/0x3c8
  [<c0142100>] shrink_zone+0xcd/0xea
  [<c014266d>] kswapd+0x311/0x41e
  [<c012c6aa>] autoremove_wake_function+0x0/0x37
  [<c014235c>] kswapd+0x0/0x41e
  [<c012c527>] kthread+0xde/0xe2
  [<c012c449>] kthread+0x0/0xe2
  [<c010395b>] kernel_thread_helper+0x7/0x1c
  =======================

(clear_page_dirty_warn() is test_clear_page_dirty which WARN_ON()s the
result of page_mkclean)


> This will (at least) cause truncate to do peculiar things. 
> do_invalidatepage() runs discard_buffer() against the dirty page and will
> then expect try_to_free_buffers() to remove those buffers and then clean
> the page.  truncate_complete_page() will clean the page, but it still has
> those invalidated buffers.  We'll end up with a large number of clean,
> unused pages on the LRU, with attached buffers.  These should eventually
> get reaped, but it'll change the page aging dynamics.

This isn't so nice. I wonder if you could just ClearPageDirty before
calling try_to_free_buffers in this case, or is that too much of a
hack? Ideally I guess you want a variant that is happy to discard
dirtiness (alternatively, my proposal to redirty the page if we find
a dirty pte should also handle this).

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  9:18                   ` Andrew Morton
@ 2006-12-18  9:26                     ` Andrei Popa
  2006-12-18  9:42                     ` Nick Piggin
  1 sibling, 0 replies; 154+ messages in thread
From: Andrei Popa @ 2006-12-18  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linus Torvalds, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr


On Mon, 2006-12-18 at 01:18 -0800, Andrew Morton wrote:
> On Mon, 18 Dec 2006 18:22:42 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > Andrew Morton wrote:
> > > On Mon, 18 Dec 2006 15:51:52 +1100
> > > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > 
> > > 
> > >>I think the problem Andrew identified is real.
> > > 
> > > 
> > > I don't.  In fact I don't think I described any problem (well, I tried to,
> > > but then I contradicted myself).
> > 
> > By saying that there shouldn't be any dirty ptes if there are no
> > dirty buffers? But in that case the _page_ shouldn't be dirty either,
> > so that clear_page_dirty would be redundant. But presumably it isn't.
> 
> I don't follow that.
> 
> The linkage between pte-dirtiness and buffer_heads is a bit hard to follow
> without also considering page-dirtiness.
> 
> > > Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> > > blocksize ext3, mainline.  Zero failures.  It's unlikely that this testing
> > > would pass, yet people running normal workloads are able to easily trigger
> > > failures.  I suspect we're looking in the wrong place.
> > 
> > Yes I could believe it the corruption is caused by something else
> > completely.
> 
> Think so.  We do have a problem here, but only on threaded apps, I believe.
> rtorrent doesn't appear to be threaded, and the bug is hit on non-preempt
> UP.


ierdnac ~ # uname -a
Linux ierdnac 2.6.20-rc1 #2 SMP PREEMPT Mon Dec 18 11:01:52 EET 2006
i686 Genuine Intel(R) CPU           T2050  @ 1.60GHz GenuineIntel
GNU/Linux


and the other person who had corruption with rtorrent has also SMP and
PREEMPT.


> 
> > >>The issue is the disconnect between the pte dirtiness and a filesystem
> > >>bringing buffers clean.
> > > 
> > > 
> > > Really?  The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the
> > > cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty.  That's pretty
> > > simple, setting aside races.
> > > 
> > > In the try_to_free_buffers case there's a large time inverval between
> > > !BH_Dirty and !PG_dirty, but that shouldn't affect anything.
> > 
> > After try_to_free_buffers detaches the buffers from the page, a
> > pagefault can come in, and mark the pte writeable, then set_page_dirty
> > (which finds no buffers, so only sets PG_dirty).
> > 
> > The page can now get dirtied through this mapping.
> > 
> > try_to_free_buffers then goes on to clean the page and ptes.
> 
> try_to_free_buffers() isn't called against a page which doesn't have
> buffers.  It'll oops.
> 
> > Were you testing with preempt?
> 
> nope, just SMP.
> 


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  5:50               ` Linus Torvalds
  2006-12-18  7:16                 ` Andrew Morton
  2006-12-18  7:30                 ` Nick Piggin
@ 2006-12-18  9:19                 ` Andrei Popa
  2006-12-18  9:38                   ` Andrew Morton
  2 siblings, 1 reply; 154+ messages in thread
From: Andrei Popa @ 2006-12-18  9:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

I tried latest git with the patch from this email and it still get file
content corruption. If I can help you further debug the problem tell me
what to do.

On Sun, 2006-12-17 at 21:50 -0800, Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Nick Piggin wrote:
> > 
> > I can't see how that's exactly a problem -- so long as the page does not
> > get reclaimed (it won't, because we have a ref on it) then all that matters
> > is that the page eventually gets marked dirty.
> 
> But the point being that "try_to_free_buffers()" marks it clean 
> AFTERWARDS.
> 
> So yes, the page gets marked dirty in the pte's - the hardware generally 
> does that for us, so we don't have to worry about that part going on.
> 
> But "try_to_free_buffers()" seems to clear those dirty bits without 
> serializing it really any way. It just says "ok, I will now clear them". 
> Without knowing whether the dirty bits got set before the IO that cleared 
> the buffer head dirty bits or not.
> 
> What is _that_ serialization? As far as I can see, the only way to 
> guarantee that to happen (since the dirty bits in the page tables will get 
> set without us ever even being notified) is that the page tables 
> themselves must simply never contain that page in a writable form at all.
> 
> And that seems to be lacking.
> 
> Anyway, I have what I consider a much simpler solution: just don't DO all 
> that crap in try_to_free_buffers() at all. I sent it out to some people 
> already, not not very widely. 
> 
> I reproduce my suggestion here for you (and maybe others too who weren't 
> cc'd in that other discussion group) to comment on..
> 
> 		Linus
> 
> ---
> 
> So I think your patch is really broken, how about this one instead?
> 
> It's really my previous patch, BUT it also adds a 
> 
> 	if (PageDirty(page) ..
> 		return 0;
> 
> case, on the assumption that since PageDirty() measn that one of the 
> buffers should be dirty, there's no point in even _trying_ drop_buffers, 
> since that should just fail anyway.
> 
> Now, that assumption is obviously wrong _if_ the buffers have been cleaned 
> by something else. So in that case, we now don't remove the buffer heads, 
> but who really cares? The page will remain on the dirty list, and 
> something should be trying to write it out, but since now all the buffers 
> are clean, once that happens, there is no actual IO to happen.
> 
> Hmm? So this means that we simply don't remove the buffers early from such 
> pages, but there shouldn't be any real downside.
> 
> Now, the only question would be if the page is marked dirty _while_ this 
> is running. We do hold the page lock, but page dirtying doesn't get the 
> lock, does it? But at least we won't mark the page _clean_ when it 
> shouldn't be.. And we still are atomic wrt the actual buffer lists 
> (mapping->private_lock), so I think this should all be ok, and 
> drop_buffers() will do the right thing.
> 
> So no race possible either.
> 
> At least as far as I can see. And the patch certainly is simple.
> 
> Now the question whether this actually _fixes_ any problems does remain, 
> but I think this should be a pretty good solution if the bug really is 
> here. Andrew?
> 
> 		Linus
> 
> ----
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d1f1b54..263f88e 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
>  	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
> -	if (PageWriteback(page))
> +	if (PageDirty(page) || PageWriteback(page))
>  		return 0;
>  
>  	if (mapping == NULL) {		/* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
>  	spin_lock(&mapping->private_lock);
>  	ret = drop_buffers(page, &buffers_to_free);
>  	spin_unlock(&mapping->private_lock);
> -	if (ret) {
> -		/*
> -		 * If the filesystem writes its buffers by hand (eg ext3)
> -		 * then we can have clean buffers against a dirty page.  We
> -		 * clean the page here; otherwise later reattachment of buffers
> -		 * could encounter a non-uptodate page, which is unresolvable.
> -		 * This only applies in the rare case where try_to_free_buffers
> -		 * succeeds but the page is not freed.
> -		 *
> -		 * Also, during truncate, discard_buffer will have marked all
> -		 * the page's buffers clean.  We discover that here and clean
> -		 * the page also.
> -		 */
> -		if (test_clear_page_dirty(page))
> -			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> -	}
>  out:
>  	if (buffers_to_free) {
>  		struct buffer_head *bh = buffers_to_free;
> 


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  7:22                 ` Nick Piggin
@ 2006-12-18  9:18                   ` Andrew Morton
  2006-12-18  9:26                     ` Andrei Popa
  2006-12-18  9:42                     ` Nick Piggin
  0 siblings, 2 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-18  9:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Mon, 18 Dec 2006 18:22:42 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Andrew Morton wrote:
> > On Mon, 18 Dec 2006 15:51:52 +1100
> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > 
> > 
> >>I think the problem Andrew identified is real.
> > 
> > 
> > I don't.  In fact I don't think I described any problem (well, I tried to,
> > but then I contradicted myself).
> 
> By saying that there shouldn't be any dirty ptes if there are no
> dirty buffers? But in that case the _page_ shouldn't be dirty either,
> so that clear_page_dirty would be redundant. But presumably it isn't.

I don't follow that.

The linkage between pte-dirtiness and buffer_heads is a bit hard to follow
without also considering page-dirtiness.

> > Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> > blocksize ext3, mainline.  Zero failures.  It's unlikely that this testing
> > would pass, yet people running normal workloads are able to easily trigger
> > failures.  I suspect we're looking in the wrong place.
> 
> Yes I could believe it the corruption is caused by something else
> completely.

Think so.  We do have a problem here, but only on threaded apps, I believe.
rtorrent doesn't appear to be threaded, and the bug is hit on non-preempt
UP.

> >>The issue is the disconnect between the pte dirtiness and a filesystem
> >>bringing buffers clean.
> > 
> > 
> > Really?  The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the
> > cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty.  That's pretty
> > simple, setting aside races.
> > 
> > In the try_to_free_buffers case there's a large time inverval between
> > !BH_Dirty and !PG_dirty, but that shouldn't affect anything.
> 
> After try_to_free_buffers detaches the buffers from the page, a
> pagefault can come in, and mark the pte writeable, then set_page_dirty
> (which finds no buffers, so only sets PG_dirty).
> 
> The page can now get dirtied through this mapping.
> 
> try_to_free_buffers then goes on to clean the page and ptes.

try_to_free_buffers() isn't called against a page which doesn't have
buffers.  It'll oops.

> Were you testing with preempt?

nope, just SMP.


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  5:50               ` Linus Torvalds
  2006-12-18  7:16                 ` Andrew Morton
@ 2006-12-18  7:30                 ` Nick Piggin
  2006-12-18  9:19                 ` Andrei Popa
  2 siblings, 0 replies; 154+ messages in thread
From: Nick Piggin @ 2006-12-18  7:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

Linus Torvalds wrote:
> 
> On Mon, 18 Dec 2006, Nick Piggin wrote:
> 
>>I can't see how that's exactly a problem -- so long as the page does not
>>get reclaimed (it won't, because we have a ref on it) then all that matters
>>is that the page eventually gets marked dirty.
> 
> 
> But the point being that "try_to_free_buffers()" marks it clean 
> AFTERWARDS.

For some reason I thought you were suggesting it is a problem on its own :P

Yes I agree there is a pagefault vs ttfb race.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  5:43               ` Andrew Morton
@ 2006-12-18  7:22                 ` Nick Piggin
  2006-12-18  9:18                   ` Andrew Morton
  2006-12-19  8:51                 ` Marc Haber
  1 sibling, 1 reply; 154+ messages in thread
From: Nick Piggin @ 2006-12-18  7:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

Andrew Morton wrote:
> On Mon, 18 Dec 2006 15:51:52 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>I think the problem Andrew identified is real.
> 
> 
> I don't.  In fact I don't think I described any problem (well, I tried to,
> but then I contradicted myself).

By saying that there shouldn't be any dirty ptes if there are no
dirty buffers? But in that case the _page_ shouldn't be dirty either,
so that clear_page_dirty would be redundant. But presumably it isn't.

> Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> blocksize ext3, mainline.  Zero failures.  It's unlikely that this testing
> would pass, yet people running normal workloads are able to easily trigger
> failures.  I suspect we're looking in the wrong place.

Yes I could believe it the corruption is caused by something else
completely.

>>The issue is the disconnect between the pte dirtiness and a filesystem
>>bringing buffers clean.
> 
> 
> Really?  The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the
> cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty.  That's pretty
> simple, setting aside races.
> 
> In the try_to_free_buffers case there's a large time inverval between
> !BH_Dirty and !PG_dirty, but that shouldn't affect anything.

After try_to_free_buffers detaches the buffers from the page, a
pagefault can come in, and mark the pte writeable, then set_page_dirty
(which finds no buffers, so only sets PG_dirty).

The page can now get dirtied through this mapping.

try_to_free_buffers then goes on to clean the page and ptes.

I really thought you were the one who identified this race, and I didn't
see where you showed it is safe.

It may be very unlikely with small SMPs, but less so with preempt. All
we have to do is preempt at spin_unlock in try_to_free_buffers AFAIKS.
Were you testing with preempt?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  7:16                 ` Andrew Morton
@ 2006-12-18  7:17                   ` Andrew Morton
  2006-12-18  9:30                   ` Nick Piggin
  1 sibling, 0 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-18  7:17 UTC (permalink / raw)
  To: Linus Torvalds, Nick Piggin, andrei.popa,
	Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Florian Weimer, Marc Haber, Martin Michlmayr

On Sun, 17 Dec 2006 23:16:17 -0800
Andrew Morton <akpm@osdl.org> wrote:

> >  out:
> >  	if (buffers_to_free) {
> >  		struct buffer_head *bh = buffers_to_free;
> 
> This will (at least) cause truncate to do peculiar things. 
> do_invalidatepage() runs discard_buffer() against the dirty page and will
> then expect try_to_free_buffers() to remove those buffers and then clean
> the page.  truncate_complete_page() will clean the page, but it still has
> those invalidated buffers.  We'll end up with a large number of clean,
> unused pages on the LRU, with attached buffers.  These should eventually
> get reaped, but it'll change the page aging dynamics.

That being said, it's be great to get this tested by someone who can
trigger this bug, please.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  5:50               ` Linus Torvalds
@ 2006-12-18  7:16                 ` Andrew Morton
  2006-12-18  7:17                   ` Andrew Morton
  2006-12-18  9:30                   ` Nick Piggin
  2006-12-18  7:30                 ` Nick Piggin
  2006-12-18  9:19                 ` Andrei Popa
  2 siblings, 2 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-18  7:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Sun, 17 Dec 2006 21:50:43 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Mon, 18 Dec 2006, Nick Piggin wrote:
> > 
> > I can't see how that's exactly a problem -- so long as the page does not
> > get reclaimed (it won't, because we have a ref on it) then all that matters
> > is that the page eventually gets marked dirty.
> 
> But the point being that "try_to_free_buffers()" marks it clean 
> AFTERWARDS.
> 
> So yes, the page gets marked dirty in the pte's - the hardware generally 
> does that for us, so we don't have to worry about that part going on.
> 
> But "try_to_free_buffers()" seems to clear those dirty bits without 
> serializing it really any way. It just says "ok, I will now clear them". 
> Without knowing whether the dirty bits got set before the IO that cleared 
> the buffer head dirty bits or not.

Yes, I can't see anything correct about the current behaviour.

But I'm going blue in the face here trying to feed try_to_free_buffers() a
page_mapped(page), without success.  pagevec_strip() presumably isn't
triggering.

> What is _that_ serialization? As far as I can see, the only way to 
> guarantee that to happen (since the dirty bits in the page tables will get 
> set without us ever even being notified) is that the page tables 
> themselves must simply never contain that page in a writable form at all.
> 
> And that seems to be lacking.
> 
> Anyway, I have what I consider a much simpler solution: just don't DO all 
> that crap in try_to_free_buffers() at all. I sent it out to some people 
> already, not not very widely. 
> 
> I reproduce my suggestion here for you (and maybe others too who weren't 
> cc'd in that other discussion group) to comment on..
>
> ...
>
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
>  	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
> -	if (PageWriteback(page))
> +	if (PageDirty(page) || PageWriteback(page))
>  		return 0;
>  
>  	if (mapping == NULL) {		/* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
>  	spin_lock(&mapping->private_lock);
>  	ret = drop_buffers(page, &buffers_to_free);
>  	spin_unlock(&mapping->private_lock);
> -	if (ret) {
> -		/*
> -		 * If the filesystem writes its buffers by hand (eg ext3)
> -		 * then we can have clean buffers against a dirty page.  We
> -		 * clean the page here; otherwise later reattachment of buffers
> -		 * could encounter a non-uptodate page, which is unresolvable.
> -		 * This only applies in the rare case where try_to_free_buffers
> -		 * succeeds but the page is not freed.
> -		 *
> -		 * Also, during truncate, discard_buffer will have marked all
> -		 * the page's buffers clean.  We discover that here and clean
> -		 * the page also.
> -		 */
> -		if (test_clear_page_dirty(page))
> -			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> -	}
>  out:
>  	if (buffers_to_free) {
>  		struct buffer_head *bh = buffers_to_free;

This will (at least) cause truncate to do peculiar things. 
do_invalidatepage() runs discard_buffer() against the dirty page and will
then expect try_to_free_buffers() to remove those buffers and then clean
the page.  truncate_complete_page() will clean the page, but it still has
those invalidated buffers.  We'll end up with a large number of clean,
unused pages on the LRU, with attached buffers.  These should eventually
get reaped, but it'll change the page aging dynamics.


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  4:51             ` Nick Piggin
  2006-12-18  5:43               ` Andrew Morton
@ 2006-12-18  5:50               ` Linus Torvalds
  2006-12-18  7:16                 ` Andrew Morton
                                   ` (2 more replies)
  1 sibling, 3 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-18  5:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr



On Mon, 18 Dec 2006, Nick Piggin wrote:
> 
> I can't see how that's exactly a problem -- so long as the page does not
> get reclaimed (it won't, because we have a ref on it) then all that matters
> is that the page eventually gets marked dirty.

But the point being that "try_to_free_buffers()" marks it clean 
AFTERWARDS.

So yes, the page gets marked dirty in the pte's - the hardware generally 
does that for us, so we don't have to worry about that part going on.

But "try_to_free_buffers()" seems to clear those dirty bits without 
serializing it really any way. It just says "ok, I will now clear them". 
Without knowing whether the dirty bits got set before the IO that cleared 
the buffer head dirty bits or not.

What is _that_ serialization? As far as I can see, the only way to 
guarantee that to happen (since the dirty bits in the page tables will get 
set without us ever even being notified) is that the page tables 
themselves must simply never contain that page in a writable form at all.

And that seems to be lacking.

Anyway, I have what I consider a much simpler solution: just don't DO all 
that crap in try_to_free_buffers() at all. I sent it out to some people 
already, not not very widely. 

I reproduce my suggestion here for you (and maybe others too who weren't 
cc'd in that other discussion group) to comment on..

		Linus

---

So I think your patch is really broken, how about this one instead?

It's really my previous patch, BUT it also adds a 

	if (PageDirty(page) ..
		return 0;

case, on the assumption that since PageDirty() measn that one of the 
buffers should be dirty, there's no point in even _trying_ drop_buffers, 
since that should just fail anyway.

Now, that assumption is obviously wrong _if_ the buffers have been cleaned 
by something else. So in that case, we now don't remove the buffer heads, 
but who really cares? The page will remain on the dirty list, and 
something should be trying to write it out, but since now all the buffers 
are clean, once that happens, there is no actual IO to happen.

Hmm? So this means that we simply don't remove the buffers early from such 
pages, but there shouldn't be any real downside.

Now, the only question would be if the page is marked dirty _while_ this 
is running. We do hold the page lock, but page dirtying doesn't get the 
lock, does it? But at least we won't mark the page _clean_ when it 
shouldn't be.. And we still are atomic wrt the actual buffer lists 
(mapping->private_lock), so I think this should all be ok, and 
drop_buffers() will do the right thing.

So no race possible either.

At least as far as I can see. And the patch certainly is simple.

Now the question whether this actually _fixes_ any problems does remain, 
but I think this should be a pretty good solution if the bug really is 
here. Andrew?

		Linus

----
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageDirty(page) || PageWriteback(page))
 		return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	spin_unlock(&mapping->private_lock);
-	if (ret) {
-		/*
-		 * If the filesystem writes its buffers by hand (eg ext3)
-		 * then we can have clean buffers against a dirty page.  We
-		 * clean the page here; otherwise later reattachment of buffers
-		 * could encounter a non-uptodate page, which is unresolvable.
-		 * This only applies in the rare case where try_to_free_buffers
-		 * succeeds but the page is not freed.
-		 *
-		 * Also, during truncate, discard_buffer will have marked all
-		 * the page's buffers clean.  We discover that here and clean
-		 * the page also.
-		 */
-		if (test_clear_page_dirty(page))
-			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-	}
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;


^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  4:51             ` Nick Piggin
@ 2006-12-18  5:43               ` Andrew Morton
  2006-12-18  7:22                 ` Nick Piggin
  2006-12-19  8:51                 ` Marc Haber
  2006-12-18  5:50               ` Linus Torvalds
  1 sibling, 2 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-18  5:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

On Mon, 18 Dec 2006 15:51:52 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> I think the problem Andrew identified is real.

I don't.  In fact I don't think I described any problem (well, I tried to,
but then I contradicted myself).

Six hours here of fsx-linux plus high memory pressure on SMP on 1k
blocksize ext3, mainline.  Zero failures.  It's unlikely that this testing
would pass, yet people running normal workloads are able to easily trigger
failures.  I suspect we're looking in the wrong place.

> The issue is the disconnect between the pte dirtiness and a filesystem
> bringing buffers clean.

Really?  The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the
cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty.  That's pretty
simple, setting aside races.

In the try_to_free_buffers case there's a large time inverval between
!BH_Dirty and !PG_dirty, but that shouldn't affect anything.

I don't think we even have a theory as to what's gone wrong yet.


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  1:57           ` Linus Torvalds
@ 2006-12-18  4:51             ` Nick Piggin
  2006-12-18  5:43               ` Andrew Morton
  2006-12-18  5:50               ` Linus Torvalds
  0 siblings, 2 replies; 154+ messages in thread
From: Nick Piggin @ 2006-12-18  4:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List,
	Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber,
	Martin Michlmayr

Linus Torvalds wrote:
> [ Replying to myself - a sure sign that I don't get out enough ]
> 
> On Sun, 17 Dec 2006, Linus Torvalds wrote:
> 
>>So I don't actually see any serialization at all that would keep a random 
>>page from being paged back in.
> 
> 
> We do actually serialize, but we do it _after_ the page has already been 
> mapped back. Ie we do it for the dirty page case at rthe end of 
> do_wp_page() and do_no_page() when we do the "set_page_dirty_balance()", 
> but that's potentially too late - we've already mapped the page read-write 
> into the address space.

I can't see how that's exactly a problem -- so long as the page does not
get reclaimed (it won't, because we have a ref on it) then all that matters
is that the page eventually gets marked dirty.

> That said, this means that only threaded apps should ever trigger any 
> problems, which would seem to make it unlikely that this is the issue.
> 
> But Andrew: I don't think it's necessarily true that 
> "try_to_free_buffers()" callers have all unmapped the page.
> 
> That seems to be true for vmscan.c (ie the shrink_page_list -> 
> try_to_release_page -> try_to_release_buffers callchain), but what about 
> the other callchains (through filesystems, or through "pagevec_strip()" or 
> similar?) That pagevec_strip() is called from shrink_active_list(), I 
> don't see that unmapping the pages..

Right. But would it really matter whether they are currently mapped or
not, given that we agree it may become mapped at any point?

I think the problem Andrew identified is real.

The issue is the disconnect between the pte dirtiness and a filesystem
bringing buffers clean. But I disagree with his fix, because we don't
actually want to just throw out that pte dirtiness information: we're
just trying to get the PG_dirty bit into synch with what the buffers are
telling us, not actually clean or dirty anything, as such.

Can we clear the page dirty bit, then run set_page_dirty afterwards, if
any dirty ptes are found?

The other thing we might be able to do is to skip doing the
clear_page_dirty if the page is uptodate. This feels more hackish but it
might be faster?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  1:29         ` Linus Torvalds
@ 2006-12-18  1:57           ` Linus Torvalds
  2006-12-18  4:51             ` Nick Piggin
  0 siblings, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-18  1:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr


[ Replying to myself - a sure sign that I don't get out enough ]

On Sun, 17 Dec 2006, Linus Torvalds wrote:
> 
> So I don't actually see any serialization at all that would keep a random 
> page from being paged back in.

We do actually serialize, but we do it _after_ the page has already been 
mapped back. Ie we do it for the dirty page case at rthe end of 
do_wp_page() and do_no_page() when we do the "set_page_dirty_balance()", 
but that's potentially too late - we've already mapped the page read-write 
into the address space.

That said, this means that only threaded apps should ever trigger any 
problems, which would seem to make it unlikely that this is the issue.

But Andrew: I don't think it's necessarily true that 
"try_to_free_buffers()" callers have all unmapped the page.

That seems to be true for vmscan.c (ie the shrink_page_list -> 
try_to_release_page -> try_to_release_buffers callchain), but what about 
the other callchains (through filesystems, or through "pagevec_strip()" or 
similar?) That pagevec_strip() is called from shrink_active_list(), I 
don't see that unmapping the pages..

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-18  1:22       ` Linus Torvalds
@ 2006-12-18  1:29         ` Linus Torvalds
  2006-12-18  1:57           ` Linus Torvalds
  0 siblings, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-18  1:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Sun, 17 Dec 2006, Linus Torvalds wrote:
> 
> So we should probably do a "wait_for_page()" in do_no_page()?
> 
> Or maybe only do it for write accesses (since we don't really care about 
> getting mapped readably)? If so, we need to do it in the write case of 
> do_no_page() _and_ in the do_wp_page() case. Hmm?

I think we discussed doing exactly this at some earlier time, actually, 
just to try to throttle people who do lots of page dirtying. 

Maybe we even do it somewhere, but I tried to see it, and in the normal 
"nopage()" routine we very much try to _avoid_ locking the page (ie if 
it's marked PageUptodate() we'll return it whether locked or not). Which 
is fine - especially for readers, there really isn't any reason to ever 
delay them getting access to a page just because it's locked for write-out 
or something (once it's mapped, they'll have access to it regardless of 
any locked state in the kernel anyway).

So I don't actually see any serialization at all that would keep a random 
page from being paged back in.

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 23:40     ` Andrew Morton
  2006-12-18  1:02       ` Linus Torvalds
@ 2006-12-18  1:22       ` Linus Torvalds
  2006-12-18  1:29         ` Linus Torvalds
  2006-12-18 16:55       ` Peter Zijlstra
  2 siblings, 1 reply; 154+ messages in thread
From: Linus Torvalds @ 2006-12-18  1:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Sun, 17 Dec 2006, Andrew Morton wrote:
> 
> From my quick reading, all callers of try_to_free_buffers() have already
> unmapped the page from pagetables, and given that the reported ext3 corruption
> happens on uniprocessor, non-preempt kernels, I doubt if this patch will fix
> things.

Hmm. One possible explanation: maybe the page actually _did_ get unmapped 
from the page tables, but got added back?

I don't think we lock the page when faulting it in (we want it to be 
uptodate, but not necessarily locked). So assuming the pageout sequence 
always _does_ follow the rule that it only does try_to_free_buffers() on 
pages that aren't mapped, what actually protects them from not becoming 
mapped (and dirtied) during that sequence?

So we should probably do a "wait_for_page()" in do_no_page()?

Or maybe only do it for write accesses (since we don't really care about 
getting mapped readably)? If so, we need to do it in the write case of 
do_no_page() _and_ in the do_wp_page() case. Hmm?

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 23:40     ` Andrew Morton
@ 2006-12-18  1:02       ` Linus Torvalds
  2006-12-18  1:22       ` Linus Torvalds
  2006-12-18 16:55       ` Peter Zijlstra
  2 siblings, 0 replies; 154+ messages in thread
From: Linus Torvalds @ 2006-12-18  1:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra,
	Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr



On Sun, 17 Dec 2006, Andrew Morton wrote:
> 
> So this patch instead arranges for clear_page_dirty() to not clean the pte's
> when it is called on the try_to_free_buffers() path.

No. This is wrong.

It's wrong exactly because it now _breaks_ the whole thing that the 2.6.19 
PG_dirty changes were all about: keeping track of dirty pages. Now you 
have a page that is dirty, but it's no longer marked PG_dirty, and thus it 
doesn't participate in the dirty accounting.

> From my quick reading, all callers of try_to_free_buffers() have already
> unmapped the page from pagetables, and given that the reported ext3 corruption
> happens on uniprocessor, non-preempt kernels, I doubt if this patch will fix
> things.

So not only are you breaking this, you also claim that it cannot happen in 
the first place. So either the patch is buggy, or it's pointless. In 
neither case does it seem to be a good idea to do.

		Linus

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 13:39   ` Andrei Popa
@ 2006-12-17 23:40     ` Andrew Morton
  2006-12-18  1:02       ` Linus Torvalds
                         ` (2 more replies)
  0 siblings, 3 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-17 23:40 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Linus Torvalds, Florian Weimer, Marc Haber, Martin Michlmayr

On Sun, 17 Dec 2006 15:39:32 +0200
Andrei Popa <andrei.popa@i-neo.ro> wrote:

> I was mistaken, I'm still having file corruption with rtorrent.
> 

Well I'm not very optimistic, but if people could try this, please...



From: Andrew Morton <akpm@osdl.org>

try_to_free_buffers() clears the page's dirty state if it successfully removed
the page's buffers.

  Background for this:

  - a process does a one-byte-write to a file on a 64k pagesize, 4k
    blocksize ext3 filesystem.  The page is now PageDirty, !PgeUptodate and
    has one dirty buffer and 15 not uptodate buffers.

  - kjournald writes the dirty buffer.  The page is now PageDirty,
    !PageUptodate and has a mix of clean and not uptodate buffers.

  - try_to_free_buffers() removes the page's buffers.  It MUST now clear
    PageDirty.  If we were to leave the page dirty then we'd have a dirty, not
    uptodate page with no buffer_heads.

    We're screwed: we cannot write the page because we don't know which
    sections of it contain garbage.  We cannot read the page because we don't
    know which sections of it contain modified data.  We cannot free the page
    because it is dirty.


Peter's "mm: tracking shared dirty pages"
(d08b3851da41d0ee60851f2c75b118e1f7a5fc89) modified clear_page_dirty() so that
it also clears the page's pte mapping's dirty flags, arranging for a
subsequent userspace modification of the page to cause a fault.

That change to clear_page_dirty() was correct for when it is called on the
writeback path.  Here, we effectively do:

	ClearPageDirty()
	pte_mkclean()
	submit-the-writeout

if a page-dirtying via write() or via pte's happens after the ClearPageDirty()
or the pte_mkclean() then the page is redirtied while writeout is in flight
and the page will again need writing; no probs.

But that change to clear_page_dirty() was incorrect for when it is called on
the try_to_free_buffers() path.  Here, we want to preserve any pte-dirtiness
because we're not going to write the page to backing store.  We need to keep
a record of any userspace modification to the page.

One way of addressing this would be to bale from try_to_free_buffers() if the
page is mapped into pagetables.  However that is racy, because the pagefault
path doesn't lock the page when establishing a pte against it (I which it did
- it would solve a lot of nasties).

So this patch instead arranges for clear_page_dirty() to not clean the pte's
when it is called on the try_to_free_buffers() path.

clear_page_dirty() had several callers and it's not immediately obvious to me
what the appropriate behaviour is in each case.  Could maintainers please take
a look?

>From my quick reading, all callers of try_to_free_buffers() have already
unmapped the page from pagetables, and given that the reported ext3 corruption
happens on uniprocessor, non-preempt kernels, I doubt if this patch will fix
things.

But even if it is true that try_to_free_buffers() callers unmap the page
first, this fix is still needed, because a minor fault could reestablish pte's
in the meanwhile.

Note that with this change, we can now restore try_to_free_buffers()'s
->private_lock to cover the test_clear_page_dirty().  If we indeed need to do
that, it'll be in a separate patch.

(Need to think about this some more.  How can a page be pte-dirty, but not
have dirty buffers?  We're supposed to clean the pte's when we write the
page, and we dirty the page and buffers when userspace dirties the pte...)


Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: <reiserfs-dev@namesys.com>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 fs/buffer.c                 |    2 +-
 fs/cifs/file.c              |    2 +-
 fs/fuse/file.c              |    2 +-
 fs/hugetlbfs/inode.c        |    2 +-
 fs/jfs/jfs_metapage.c       |    2 +-
 fs/reiserfs/stree.c         |    2 +-
 fs/xfs/linux-2.6/xfs_aops.c |    2 +-
 include/linux/page-flags.h  |    6 +++---
 mm/page-writeback.c         |    5 +++--
 mm/truncate.c               |    4 ++--
 10 files changed, 15 insertions(+), 14 deletions(-)

diff -puN fs/buffer.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/buffer.c
--- a/fs/buffer.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/buffer.c
@@ -2858,7 +2858,7 @@ int try_to_free_buffers(struct page *pag
 		 * the page's buffers clean.  We discover that here and clean
 		 * the page also.
 		 */
-		if (test_clear_page_dirty(page))
+		if (test_clear_page_dirty(page, 0))
 			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	}
 out:
diff -puN fs/fuse/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/fuse/file.c
--- a/fs/fuse/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
 		spin_unlock(&fc->lock);
 
 		if (offset == 0 && to == PAGE_CACHE_SIZE) {
-			clear_page_dirty(page);
+			clear_page_dirty(page, 0);
 			SetPageUptodate(page);
 		}
 	}
diff -puN fs/hugetlbfs/inode.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
 
 static void truncate_huge_page(struct page *page)
 {
-	clear_page_dirty(page);
+	clear_page_dirty(page, 1);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	put_page(page);
diff -puN fs/jfs/jfs_metapage.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/jfs/jfs_metapage.c
--- a/fs/jfs/jfs_metapage.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ void release_metapage(struct metapage * 
 
 	/* Retest mp->count since we may have released page lock */
 	if (test_bit(META_discard, &mp->flag) && !mp->count) {
-		clear_page_dirty(page);
+		clear_page_dirty(page, 1);
 		ClearPageUptodate(page);
 	}
 #else
diff -puN fs/reiserfs/stree.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/reiserfs/stree.c
--- a/fs/reiserfs/stree.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				clear_page_dirty(page);
+				clear_page_dirty(page, 0);
 			}
 		}
 	}
diff -puN fs/xfs/linux-2.6/xfs_aops.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/xfs/linux-2.6/xfs_aops.c
--- a/fs/xfs/linux-2.6/xfs_aops.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
 	ASSERT(!PageWriteback(page));
 	set_page_writeback(page);
 	if (clear_dirty)
-		clear_page_dirty(page);
+		clear_page_dirty(page, 1);
 	unlock_page(page);
 	if (!buffers) {
 		end_page_writeback(page);
diff -puN include/linux/page-flags.h~try_to_free_buffers-dont-clear-pte-dirty-bits include/linux/page-flags.h
--- a/include/linux/page-flags.h~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/include/linux/page-flags.h
@@ -253,13 +253,13 @@ static inline void SetPageUptodate(struc
 
 struct page;	/* forward declaration */
 
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
 int test_clear_page_writeback(struct page *page);
 int test_set_page_writeback(struct page *page);
 
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int must_clean_ptes)
 {
-	test_clear_page_dirty(page);
+	test_clear_page_dirty(page, must_clean_ptes);
 }
 
 static inline void set_page_writeback(struct page *page)
diff -puN mm/page-writeback.c~try_to_free_buffers-dont-clear-pte-dirty-bits mm/page-writeback.c
--- a/mm/page-writeback.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * Clear a page's dirty flag, while caring for dirty memory accounting. 
  * Returns true if the page was previously dirty.
  */
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
@@ -866,7 +866,8 @@ int test_clear_page_dirty(struct page *p
 		 * page is locked, which pins the address_space
 		 */
 		if (mapping_cap_account_dirty(mapping)) {
-			page_mkclean(page);
+			if (must_clean_ptes)
+				page_mkclean(page);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
 		return 1;
diff -puN mm/truncate.c~try_to_free_buffers-dont-clear-pte-dirty-bits mm/truncate.c
--- a/mm/truncate.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	if (test_clear_page_dirty(page))
+	if (test_clear_page_dirty(page, 1))
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = test_clear_page_dirty(page, 0);
 			if (!invalidate_complete_page2(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);
diff -puN fs/cifs/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/cifs/file.c
--- a/fs/cifs/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!test_clear_page_dirty(page, 1)) {
 				unlock_page(page);
 				break;
 			}
_


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 12:06 ` Andrew Morton
  2006-12-17 12:19   ` Marc Haber
  2006-12-17 12:32   ` Andrei Popa
@ 2006-12-17 13:39   ` Andrei Popa
  2006-12-17 23:40     ` Andrew Morton
  2 siblings, 1 reply; 154+ messages in thread
From: Andrei Popa @ 2006-12-17 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Linus Torvalds, Florian Weimer, Marc Haber

I was mistaken, I'm still having file corruption with rtorrent.

On Sun, 2006-12-17 at 04:06 -0800, Andrew Morton wrote:
> On Sun, 17 Dec 2006 02:13:18 +0200
> Andrei Popa <andrei.popa@i-neo.ro> wrote:
> 
> > Hello,
> > I had filesystem data corruption with rtorrent with 2.6.19.
> > I tried recent git with Peter Zijlstra patch
> > http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is
> > fixed.
> > 
> 
> oh crap, I'd forgotten that test_clear_page_dirty() now fiddles with the
> ptes.
> 
> I'd be really surprised if this was all due to a race though.  Is everyone
> who has observed this problem running SMP and/or premptible kernels?
> 
> Peter, why isn't that proposed patch's cleaning of the pte racy against
> do_wp_page()?


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 12:06 ` Andrew Morton
  2006-12-17 12:19   ` Marc Haber
@ 2006-12-17 12:32   ` Andrei Popa
  2006-12-17 13:39   ` Andrei Popa
  2 siblings, 0 replies; 154+ messages in thread
From: Andrei Popa @ 2006-12-17 12:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Linus Torvalds, Florian Weimer, Marc Haber


ierdnac ~ # uname -a
Linux ierdnac 2.6.20-rc1 #1 SMP PREEMPT Sun Dec 17 01:52:28 EET 2006
i686 Genuine Intel(R) CPU           T2050  @ 1.60GHz GenuineIntel
GNU/Linux


On Sun, 2006-12-17 at 04:06 -0800, Andrew Morton wrote:
> On Sun, 17 Dec 2006 02:13:18 +0200
> Andrei Popa <andrei.popa@i-neo.ro> wrote:
> 
> > Hello,
> > I had filesystem data corruption with rtorrent with 2.6.19.
> > I tried recent git with Peter Zijlstra patch
> > http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is
> > fixed.
> > 
> 
> oh crap, I'd forgotten that test_clear_page_dirty() now fiddles with the
> ptes.
> 
> I'd be really surprised if this was all due to a race though.  Is everyone
> who has observed this problem running SMP and/or premptible kernels?
> 
> Peter, why isn't that proposed patch's cleaning of the pte racy against
> do_wp_page()?


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17 12:06 ` Andrew Morton
@ 2006-12-17 12:19   ` Marc Haber
  2006-12-17 12:32   ` Andrei Popa
  2006-12-17 13:39   ` Andrei Popa
  2 siblings, 0 replies; 154+ messages in thread
From: Marc Haber @ 2006-12-17 12:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra,
	Hugh Dickins, Linus Torvalds, Florian Weimer

On Sun, Dec 17, 2006 at 04:06:20AM -0800, Andrew Morton wrote:
> I'd be really surprised if this was all due to a race though.  Is everyone
> who has observed this problem running SMP and/or premptible kernels?

Linux torres 2.6.19.1-zgsrv #1 SMP PREEMPT Wed Dec 13 01:31:27 UTC 2006 i686 GNU/Linux

So, it's a "yes" to both counts, and I'll build a kernel without SMP
and without preemption asap.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
  2006-12-17  0:13 Andrei Popa
@ 2006-12-17 12:06 ` Andrew Morton
  2006-12-17 12:19   ` Marc Haber
                     ` (2 more replies)
  0 siblings, 3 replies; 154+ messages in thread
From: Andrew Morton @ 2006-12-17 12:06 UTC (permalink / raw)
  To: andrei.popa
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins,
	Linus Torvalds, Florian Weimer, Marc Haber

On Sun, 17 Dec 2006 02:13:18 +0200
Andrei Popa <andrei.popa@i-neo.ro> wrote:

> Hello,
> I had filesystem data corruption with rtorrent with 2.6.19.
> I tried recent git with Peter Zijlstra patch
> http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is
> fixed.
> 

oh crap, I'd forgotten that test_clear_page_dirty() now fiddles with the
ptes.

I'd be really surprised if this was all due to a race though.  Is everyone
who has observed this problem running SMP and/or premptible kernels?

Peter, why isn't that proposed patch's cleaning of the pte racy against
do_wp_page()?

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: 2.6.19 file content corruption on ext3
@ 2006-12-17  0:13 Andrei Popa
  2006-12-17 12:06 ` Andrew Morton
  0 siblings, 1 reply; 154+ messages in thread
From: Andrei Popa @ 2006-12-17  0:13 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Hello,
I had filesystem data corruption with rtorrent with 2.6.19.
I tried recent git with Peter Zijlstra patch
http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is
fixed.

Please CC as I am not subscribed to lkml.

Andrei


^ permalink raw reply	[flat|nested] 154+ messages in thread

end of thread, other threads:[~2006-12-29 19:14 UTC | newest]

Thread overview: 154+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-07 15:57 2.6.19 file content corruption on ext3 Marc Haber
2006-12-07 16:50 ` Phillip Susi
2006-12-08  1:38   ` Fernando Luis Vázquez Cao
2006-12-08 16:42     ` Marc Haber
2006-12-09 10:47       ` Jan Kara
2006-12-11 19:07         ` Marc Haber
2006-12-14 12:03           ` Jan Kara
2006-12-15  9:30             ` Marc Haber
2006-12-16  8:29               ` Marc Haber
2006-12-09 23:46       ` Mike Galbraith
2006-12-11  9:31         ` Marc Haber
2006-12-09  9:26   ` Marc Haber
2006-12-16 18:43     ` Martin Michlmayr
2006-12-16 19:18       ` Hugh Dickins
2006-12-16 21:29         ` Peter Zijlstra
2006-12-16 23:08           ` Hugh Dickins
2006-12-17 13:52         ` Jan Kara
2006-12-22 17:05       ` Marc Haber
2006-12-16 18:31 ` Florian Weimer
2006-12-17 11:52   ` Andrew Morton
2006-12-22 13:30 ` Daniel Drake
2006-12-22 17:03   ` Marc Haber
2006-12-17  0:13 Andrei Popa
2006-12-17 12:06 ` Andrew Morton
2006-12-17 12:19   ` Marc Haber
2006-12-17 12:32   ` Andrei Popa
2006-12-17 13:39   ` Andrei Popa
2006-12-17 23:40     ` Andrew Morton
2006-12-18  1:02       ` Linus Torvalds
2006-12-18  1:22       ` Linus Torvalds
2006-12-18  1:29         ` Linus Torvalds
2006-12-18  1:57           ` Linus Torvalds
2006-12-18  4:51             ` Nick Piggin
2006-12-18  5:43               ` Andrew Morton
2006-12-18  7:22                 ` Nick Piggin
2006-12-18  9:18                   ` Andrew Morton
2006-12-18  9:26                     ` Andrei Popa
2006-12-18  9:42                     ` Nick Piggin
2006-12-19  8:51                 ` Marc Haber
2006-12-19  9:28                   ` Martin Michlmayr
2006-12-28 18:05                   ` Marc Haber
2006-12-28 19:00                     ` Linus Torvalds
2006-12-28 19:05                       ` Petri Kaukasoina
2006-12-28 19:21                         ` Linus Torvalds
2006-12-28 19:39                           ` Dave Jones
2006-12-28 20:10                             ` Arjan van de Ven
2006-12-29  9:23                             ` maximilian attems
2006-12-29 15:02                               ` Dave Jones
2006-12-29 18:52                                 ` maximilian attems
2006-12-29 19:14                                   ` Dave Jones
2006-12-28 21:24                       ` Linus Torvalds
2006-12-28 21:36                         ` Russell King
2006-12-28 22:37                         ` Linus Torvalds
2006-12-28 22:50                           ` David Miller
2006-12-28 23:01                             ` Linus Torvalds
2006-12-29  1:38                             ` Linus Torvalds
2006-12-29  1:59                               ` Andrew Morton
2006-12-28 23:36                           ` Anton Altaparmakov
2006-12-28 23:54                             ` Linus Torvalds
2006-12-29 17:49                       ` Guillaume Chazarain
2006-12-18  5:50               ` Linus Torvalds
2006-12-18  7:16                 ` Andrew Morton
2006-12-18  7:17                   ` Andrew Morton
2006-12-18  9:30                   ` Nick Piggin
2006-12-18  7:30                 ` Nick Piggin
2006-12-18  9:19                 ` Andrei Popa
2006-12-18  9:38                   ` Andrew Morton
2006-12-18 10:00                     ` Andrei Popa
2006-12-18 10:11                       ` Peter Zijlstra
2006-12-18 10:49                         ` Andrei Popa
2006-12-18 15:24                           ` Gene Heskett
2006-12-18 15:32                             ` Peter Zijlstra
2006-12-18 15:47                               ` Gene Heskett
2006-12-18 16:55       ` Peter Zijlstra
2006-12-18 18:03         ` Linus Torvalds
2006-12-18 18:24           ` Peter Zijlstra
2006-12-18 18:35             ` Linus Torvalds
2006-12-18 19:04               ` Andrei Popa
2006-12-18 19:10                 ` Peter Zijlstra
2006-12-18 19:18                 ` Linus Torvalds
2006-12-18 19:44                   ` Andrei Popa
2006-12-18 20:14                     ` Linus Torvalds
2006-12-18 20:41                       ` Linus Torvalds
2006-12-18 21:11                         ` Andrei Popa
2006-12-18 22:00                           ` Alessandro Suardi
2006-12-18 22:45                             ` Linus Torvalds
2006-12-19  0:13                               ` Andrei Popa
2006-12-19  0:29                                 ` Linus Torvalds
2006-12-18 22:32                           ` Linus Torvalds
2006-12-18 23:48                             ` Andrei Popa
2006-12-19  0:04                               ` Linus Torvalds
2006-12-19  0:29                                 ` Andrei Popa
2006-12-19  0:57                                   ` Linus Torvalds
2006-12-19  1:21                                     ` Andrew Morton
2006-12-19  1:44                                       ` Andrei Popa
2006-12-19  1:54                                         ` Andrew Morton
2006-12-19  2:04                                           ` Andrei Popa
2006-12-19  8:05                                           ` Andrei Popa
2006-12-19  8:24                                             ` Andrew Morton
2006-12-19  8:34                                               ` Pekka Enberg
2006-12-19  9:13                                               ` Marc Haber
2006-12-19  1:50                                     ` Andrei Popa
2006-12-19  1:03                               ` Gene Heskett
2006-12-18 22:34                         ` Gene Heskett
2006-12-22 17:27                           ` Linus Torvalds
2006-12-18 21:43                       ` Andrew Morton
2006-12-18 21:49                       ` Peter Zijlstra
2006-12-19 23:42                       ` Peter Zijlstra
2006-12-20  0:23                         ` Linus Torvalds
2006-12-20  9:01                           ` Peter Zijlstra
2006-12-20  9:12                             ` Peter Zijlstra
2006-12-20  9:39                             ` Arjan van de Ven
2006-12-20 14:27                             ` Martin Schwidefsky
2006-12-20  9:32                           ` Peter Zijlstra
2006-12-20 14:15                         ` Andrei Popa
2006-12-20 14:23                           ` Peter Zijlstra
2006-12-20 16:30                             ` Andrei Popa
2006-12-20 16:36                               ` Peter Zijlstra
2006-12-19  7:38                   ` Peter Zijlstra
2006-12-19  4:36           ` Nick Piggin
2006-12-19  6:34             ` Linus Torvalds
2006-12-19  6:51               ` Nick Piggin
2006-12-19  7:26                 ` Linus Torvalds
2006-12-19  8:04                   ` Linus Torvalds
2006-12-19  9:00                     ` Peter Zijlstra
2006-12-19  9:05                       ` Peter Zijlstra
     [not found]                     ` <4587B762.2030603@yahoo.com.au>
2006-12-19 10:32                       ` Andrew Morton
2006-12-19 10:42                         ` Nick Piggin
2006-12-19 10:47                         ` Andrew Morton
2006-12-19 10:52                         ` Peter Zijlstra
2006-12-19 10:58                           ` Nick Piggin
2006-12-19 11:51                             ` Peter Zijlstra
2006-12-19 10:55                         ` Nick Piggin
2006-12-19 16:51                       ` Linus Torvalds
2006-12-19 17:43                         ` Linus Torvalds
2006-12-19 18:59                           ` Linus Torvalds
2006-12-19 21:30                             ` Peter Zijlstra
2006-12-19 22:51                               ` Linus Torvalds
2006-12-19 22:58                                 ` Andrew Morton
2006-12-19 23:06                                   ` Peter Zijlstra
2006-12-19 23:07                                     ` Peter Zijlstra
2006-12-20  0:03                                     ` Linus Torvalds
2006-12-20  0:18                                       ` Andrew Morton
2006-12-20 18:02                               ` Stephen Clark
2006-12-20  5:56                             ` Jari Sundell
2006-12-19 21:56                           ` Florian Weimer
2006-12-21 13:03                           ` Peter Zijlstra
2006-12-21 20:40                             ` Andrew Morton
2006-12-19 20:03               ` dean gaudet
2006-12-19  7:22             ` Peter Zijlstra
2006-12-19  7:59               ` Nick Piggin
2006-12-19  8:14                 ` Linus Torvalds
2006-12-19  9:40                   ` Nick Piggin
2006-12-19 16:46                     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).