linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [NFS] [PATCH] mmap corruption
@ 2003-04-04 19:20 Steve Dickson
  2003-04-04 22:01 ` Trond Myklebust
  0 siblings, 1 reply; 7+ messages in thread
From: Steve Dickson @ 2003-04-04 19:20 UTC (permalink / raw)
  To: nfs, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1720 bytes --]

The following patch stops memory mapped corruption due
to pages not being flushed in a timely matter. This
patch is for the 2.4.20 kernel.

The Problem: The corruption occurred when 300 of the File System
Exerciser (fsx) processes are started simultaneously. These processes do
random mmap-ed read/writes, file truncations, normal reads/writes,
and opens/closes one on data file. Just to make things more interesting,
each process opens a logging file and output file in the same directory
(a total of three files per process). The corruption occurred on one
of the data files after 3 to 5 mins.

The Cause: Memory mapped pages were not being flushed out in a timely
manner. When a file is about to truncated (up or down), nfs_writepage()
is called (by filemap_fdatasync()) to flush out dirty pages. When this done
asynchronously, nfs_writepage() will (indirectly) call nfs_strategy().
nfs_strategy() wants to send groups of pages (in this case 4 pages). Now in
the error case, only one page was dirty so it was *not* flushed out.
Eventually that page would be flushed (by kupdate) but it was too late
because the file size had already change due to a second truncation.

My Solution: When a file is going to be truncated down (i.e. the file
is going to shrink), _synchronously_ flush out the mmapped pages. I used
the (unused) NFS_INO_FLUSH flag be to tell nfs_writepage to synchronously
write out the page. This sync write *only* occurs when there are dirty
mmapped pages and the file size is going to shrink. And then only the
mmapped pages are written out synchronously so there should very little
affect on I/O performance due to this synchronization. Definitely worth
the price of  correctness, IMHO....


SteveD.

[-- Attachment #2: linux-2.4.20-nfs-mmap-corruption.patch --]
[-- Type: text/plain, Size: 4371 bytes --]

--- linux-2.4.20/fs/nfs/write.c.orig	2002-11-28 18:53:15.000000000 -0500
+++ linux-2.4.20/fs/nfs/write.c	2003-02-20 08:17:48.000000000 -0500
@@ -242,7 +242,7 @@
 	struct inode *inode = page->mapping->host;
 	unsigned long end_index;
 	unsigned offset = PAGE_CACHE_SIZE;
-	int err;
+	int err, is_sync;
 
 	end_index = inode->i_size >> PAGE_CACHE_SHIFT;
 
@@ -261,7 +261,8 @@
 		goto out;
 do_it:
 	lock_kernel();
-	if (NFS_SERVER(inode)->wsize >= PAGE_CACHE_SIZE && !IS_SYNC(inode)) {
+	is_sync = (IS_SYNC(inode) || NFS_FLUSH(inode));
+	if (NFS_SERVER(inode)->wsize >= PAGE_CACHE_SIZE && !is_sync) {
 		err = nfs_writepage_async(NULL, inode, page, 0, offset);
 		if (err >= 0)
 			err = 0;
@@ -749,15 +750,18 @@
 static void
 nfs_strategy(struct inode *inode)
 {
-	unsigned int	dirty, wpages;
+	unsigned int	dirty, wpages, flush;
 
 	dirty  = inode->u.nfs_i.ndirty;
 	wpages = NFS_SERVER(inode)->wpages;
+	flush = NFS_FLUSH(inode);
 #ifdef CONFIG_NFS_V3
 	if (NFS_PROTO(inode)->version == 2) {
 		if (dirty >= NFS_STRATEGY_PAGES * wpages)
 			nfs_flush_file(inode, NULL, 0, 0, 0);
-	} else if (dirty >= wpages)
+	} else if (dirty >= wpages) {
+		nfs_flush_file(inode, NULL, 0, 0, 0);
+	} else if (dirty && flush)
 		nfs_flush_file(inode, NULL, 0, 0, 0);
 #else
 	if (dirty >= NFS_STRATEGY_PAGES * wpages)
--- linux-2.4.20/fs/nfs/inode.c.orig	2003-02-18 13:31:44.000000000 -0500
+++ linux-2.4.20/fs/nfs/inode.c	2003-02-20 07:15:50.000000000 -0500
@@ -746,12 +746,37 @@
 	goto out;
 }
 
+#define IS_TRUNC_DOWN(_inode, _attr) \
+	(_attr->ia_valid & ATTR_SIZE  && _attr->ia_size < _inode->i_size)
+static inline void
+nfs_inode_flush_on(struct inode *inode)
+{
+	atomic_inc(&(NFS_I(inode)->flushers));
+	lock_kernel(); 
+	NFS_FLAGS(inode) |= NFS_INO_FLUSH;
+	unlock_kernel();
+	return;
+}
+
+static inline void
+nfs_inode_flush_off(struct inode *inode)
+{
+	atomic_dec(&(NFS_I(inode)->flushers));
+	if (atomic_read(&(NFS_I(inode)->flushers)) == 0) {
+		lock_kernel();
+		NFS_FLAGS(inode) &= ~NFS_INO_FLUSH;
+		wake_up(&inode->i_wait);
+		unlock_kernel();
+	}
+	return;
+}
+
 int
 nfs_notify_change(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
 	struct nfs_fattr fattr;
-	int error;
+	int error, flusher=0;
 
 	/*
 	 * Make sure the inode is up-to-date.
@@ -767,11 +792,30 @@
 	if (!S_ISREG(inode->i_mode))
 		attr->ia_valid &= ~ATTR_SIZE;
 
-	filemap_fdatasync(inode->i_mapping);
-	error = nfs_wb_all(inode);
-	filemap_fdatawait(inode->i_mapping);
-	if (error)
-		goto out;
+	/*
+	 * If the file is going to be truncated down
+	 * make sure all of the mmapped pages get flushed
+	 * by telling nfs_writepage to flush them synchronously.
+	 * If they are flushed asynchronously and the file size
+	 * changes (again) before they are flushed, data corruption
+	 * will occur.
+	 * XXX: It would be nice if there was an filemap_ api
+	 * that would tell how many (if any) dirty mmapped pages there
+	 * are. That way I would have to take the lock_kernel() when
+	 * its not necessary.
+	 */
+	if (IS_TRUNC_DOWN(inode, attr)) { 
+		flusher = 1;
+		nfs_inode_flush_on(inode);
+	}
+
+	do {
+		filemap_fdatasync(inode->i_mapping);
+		error = nfs_wb_all(inode);
+		filemap_fdatawait(inode->i_mapping);
+		if (error)
+			goto out;
+	} while (flusher && NFS_I(inode)->npages);
 
 	error = NFS_PROTO(inode)->setattr(inode, &fattr, attr);
 	if (error)
@@ -801,6 +845,9 @@
 	NFS_CACHEINV(inode);
 	error = nfs_refresh_inode(inode, &fattr);
 out:
+	if (flusher) {
+		nfs_inode_flush_off(inode);
+	}
 	return error;
 }
 
--- linux-2.4.20/include/linux/nfs_fs_i.h.orig	2003-02-19 08:22:20.000000000 -0500
+++ linux-2.4.20/include/linux/nfs_fs_i.h	2003-02-20 07:36:01.000000000 -0500
@@ -75,6 +75,9 @@
 
 	/* Credentials for shared mmap */
 	struct rpc_cred		*mm_cred;
+
+	/* The number of threads flushing out dirty mmaps pages */
+	atomic_t    flushers;
 };
 
 /*
--- linux-2.4.20/include/linux/nfs_fs.h.orig	2003-02-19 08:22:21.000000000 -0500
+++ linux-2.4.20/include/linux/nfs_fs.h	2003-02-20 07:36:02.000000000 -0500
@@ -99,6 +99,7 @@
 #define NFS_FLAGS(inode)		((inode)->u.nfs_i.flags)
 #define NFS_REVALIDATING(inode)		(NFS_FLAGS(inode) & NFS_INO_REVALIDATING)
 #define NFS_STALE(inode)		(NFS_FLAGS(inode) & NFS_INO_STALE)
+#define NFS_FLUSH(inode)		(NFS_FLAGS(inode) & NFS_INO_FLUSH)
 
 #define NFS_FILEID(inode)		((inode)->u.nfs_i.fileid)
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [NFS] [PATCH] mmap corruption
  2003-04-04 19:20 [NFS] [PATCH] mmap corruption Steve Dickson
@ 2003-04-04 22:01 ` Trond Myklebust
  2003-04-05 16:47   ` Steve Dickson
  0 siblings, 1 reply; 7+ messages in thread
From: Trond Myklebust @ 2003-04-04 22:01 UTC (permalink / raw)
  To: Steve Dickson; +Cc: nfs, linux-kernel

>>>>> " " == Steve Dickson <SteveD@redhat.com> writes:

     > The Cause: Memory mapped pages were not being flushed out in a
     > timely manner. When a file is about to truncated (up or down),
     > nfs_writepage() is called (by filemap_fdatasync()) to flush out
     > dirty pages. When this done asynchronously, nfs_writepage()
     > will (indirectly) call nfs_strategy().  nfs_strategy() wants to
     > send groups of pages (in this case 4 pages). Now in the error
     > case, only one page was dirty so it was *not* flushed out.
     > Eventually that page would be flushed (by kupdate) but it was
     > too late because the file size had already change due to a
     > second truncation.

That simply doesn't ring true. The nfs_wb_all() immediately after the
call to filemap_fdatasync() should ensure that *all* scheduled writes
will flushed out.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [NFS] [PATCH] mmap corruption
  2003-04-04 22:01 ` Trond Myklebust
@ 2003-04-05 16:47   ` Steve Dickson
  2003-04-06 12:30     ` Trond Myklebust
  0 siblings, 1 reply; 7+ messages in thread
From: Steve Dickson @ 2003-04-05 16:47 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs, linux-kernel

Hi Trond,

On Sat, Apr 05, 2003 at 12:01:02AM +0200, Trond Myklebust wrote:
> >>>>> " " == Steve Dickson <SteveD@redhat.com> writes:
> 
> That simply doesn't ring true. The nfs_wb_all() immediately after the
> call to filemap_fdatasync() should ensure that *all* scheduled writes
> will flushed out.
> 
Here is my evidence your honor... :-)

1)  Through my debugging I was able to show when (and only when)
the following if statement is true, there would be corruption:

nfs_notify_change(struct dentry *dentry, struct iattr *attr)
{
	/*
	 * beginning code skipped
	 */

	filemap_fdatasync(inode->i_mapping);
	error = nfs_wb_all(inode);
	filemap_fdatawait(inode->i_mapping);
	if (error)
		goto out;

	/*
	 * Every time either npages or ncommit had a value and the file size is
	 * immediately changed (with in a microsecond or two) by another 
	 * truncation, followed by a mmap read, the file would be corrupted.
	 */
	if (NFS_I(inode)->npages || NFS_I(inode)->ncommit || NFS_I(inode)->ndirty) {
		printk("nfs_notify_change: fid %Ld npages %d ncommit %d ndirty %d\n",
		NFS_FILEID(inode), NFS_I(inode)->npages, 
		NFS_I(inode)->ncommit, NFS_I(inode)->ndirty);
	}
}

I was also able to log the fact that the page was being written out (and committed)
by kupdated was after second truncation finished. At first, I was thinking 
there was a problem with the nfs_fattr_obsolete() code (and still might be) 
since this late write/commit is *truly* obsolete. But I just could not figure 
out how to detect this event so I went with avoidance approach. Now, if the 
server supplied the client with valid pre and post attrs, I believe this condition 
could be detected. But I didn't have a server that did that so I could 
not test out my theroy...

2) Without this patch my script that startups 300 process fails within
minutes. With this patch the script runs to completion constistanly...

I rest my case... and the verdict is?

SteveD.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [NFS] [PATCH] mmap corruption
  2003-04-05 16:47   ` Steve Dickson
@ 2003-04-06 12:30     ` Trond Myklebust
  2003-04-07 14:00       ` Steve Dickson
  0 siblings, 1 reply; 7+ messages in thread
From: Trond Myklebust @ 2003-04-06 12:30 UTC (permalink / raw)
  To: SteveD; +Cc: nfs, linux-kernel

>>>>> " " == Steve Dickson <SteveD@RedHat.com> writes:

     > 	filemap_fdatasync(inode->i_mapping);
     > 	error = nfs_wb_all(inode);
     > 	filemap_fdatawait(inode->i_mapping);
     > 	if (error)
     > 		goto out;

     > 	/*
     > * Every time either npages or ncommit had a value and the file
     > 	   size is
     > * immediately changed (with in a microsecond or two) by another
     > * truncation, followed by a mmap read, the file would be
     > 	   corrupted.
     > 	 */
     > 	if (NFS_I(inode)->npages || NFS_I(inode)->ncommit ||
     > 	NFS_I(inode)->ndirty) {
     > 		printk("nfs_notify_change: fid %Ld npages %d ncommit
     > 		%d ndirty %d\n", NFS_FILEID(inode),
     > 		NFS_I(inode)->npages, ncommit, NFS_I(inode)->ndirty);
     > 	}
     > }

My point is that nfs_wb_all() is supposed to ensure that
NFS_I(inode)->ncommit, and/or NFS_I(inode)->ndirty are both
zero. i.e. you can have pending reads (in which case
NFS_I(inode)->npages != 0), but *no* pending writes.

Was this the case?

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [NFS] [PATCH] mmap corruption
  2003-04-06 12:30     ` Trond Myklebust
@ 2003-04-07 14:00       ` Steve Dickson
  2003-04-07 14:56         ` Trond Myklebust
  0 siblings, 1 reply; 7+ messages in thread
From: Steve Dickson @ 2003-04-07 14:00 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs, linux-kernel

On Sun, Apr 06, 2003 at 02:30:09PM +0200, Trond Myklebust wrote:
> >>>>> " " == Steve Dickson <SteveD@RedHat.com> writes:
>      > /*
>      > * Every time either npages or ncommit had a value and the file
>      > size is
>      > * immediately changed (with in a microsecond or two) by another
>      > * truncation, followed by a mmap read, the file would be
>      > 	   corrupted.
>      > 	 */
>      > 	if (NFS_I(inode)->npages || NFS_I(inode)->ncommit ||
>      > 	NFS_I(inode)->ndirty) {
>      > 		printk("nfs_notify_change: fid %Ld npages %d ncommit
>      > 		%d ndirty %d\n", NFS_FILEID(inode),
>      > 		NFS_I(inode)->npages, ncommit, NFS_I(inode)->ndirty);
>      > 	}
>      > }
> 
> My point is that nfs_wb_all() is supposed to ensure that
> NFS_I(inode)->ncommit, and/or NFS_I(inode)->ndirty are both
> zero. i.e. you can have pending reads (in which case
> NFS_I(inode)->npages != 0), but *no* pending writes.
> 
> Was this the case?

OK, I understand your point.  And Yes, ndirty and ncommit 
always seem to be zero when nfs_wb_all() returns. Only
when npages != 0 is when I get the corruption.

I didn't realize that npages != 0 meant there are only pending 
reads *not* pending writes... Thanks for that clarification....

SteveD.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [NFS] [PATCH] mmap corruption
  2003-04-07 14:00       ` Steve Dickson
@ 2003-04-07 14:56         ` Trond Myklebust
  2003-04-07 17:39           ` Steve Dickson
  0 siblings, 1 reply; 7+ messages in thread
From: Trond Myklebust @ 2003-04-07 14:56 UTC (permalink / raw)
  To: SteveD; +Cc: nfs, linux-kernel

>>>>> " " == Steve Dickson <SteveD@RedHat.com> writes:

     > OK, I understand your point.  And Yes, ndirty and ncommit
     > always seem to be zero when nfs_wb_all() returns. Only when
     > npages != 0 is when I get the corruption.

     > I didn't realize that npages != 0 meant there are only pending
     > reads *not* pending writes... Thanks for that clarification....

My mistake. npages counts only writes...

However, I still stand by my statement that nfs_wb_all() is supposed
to ensure that *all* pending writes have been cleared.
The only explanation for npages != 0 is if

  a) an error occurred with nfs_wb_all() (we should perhaps test the
     return value of nfs_wb_all() there). Under normal circumstances,
     an error should only occur if you're using soft mounts, though.

  b) somebody redirtied the page *after* nfs_wb_all() was done.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [NFS] [PATCH] mmap corruption
  2003-04-07 14:56         ` Trond Myklebust
@ 2003-04-07 17:39           ` Steve Dickson
  0 siblings, 0 replies; 7+ messages in thread
From: Steve Dickson @ 2003-04-07 17:39 UTC (permalink / raw)
  To: trond.myklebust; +Cc: nfs, linux-kernel



Trond Myklebust wrote:

>However, I still stand by my statement that nfs_wb_all() is supposed
>to ensure that *all* pending writes have been cleared.
>The only explanation for npages != 0 is if
>
>  a) an error occurred with nfs_wb_all() (we should perhaps test the
>     return value of nfs_wb_all() there). Under normal circumstances,
>     an error should only occur if you're using soft mounts, though.
>
I have checked the return value and its not failing... The would be too 
easy... :-)

SteveD.



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-04-07 17:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-04 19:20 [NFS] [PATCH] mmap corruption Steve Dickson
2003-04-04 22:01 ` Trond Myklebust
2003-04-05 16:47   ` Steve Dickson
2003-04-06 12:30     ` Trond Myklebust
2003-04-07 14:00       ` Steve Dickson
2003-04-07 14:56         ` Trond Myklebust
2003-04-07 17:39           ` Steve Dickson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).