linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 000 of 2] Invalidate_inode_pages2 changes.
@ 2006-08-29  1:30 NeilBrown
  2006-08-29  1:30 ` [PATCH 001 of 2] Invalidate_inode_pages2 shouldn't abort on first error NeilBrown
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: NeilBrown @ 2006-08-29  1:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: v9fs-developer, trond.myklebust, Andrea Arcangeli, linux-kernel

Hi all,
 I'm picking up on a conversation that was started in late March
 this year, and which didn't get anywhere much.
 See
   http://lkml.org/lkml/2006/3/31/93
 and following.

 The issue is that it is possible for an NFS client to lose writes
 if a particular race is lost - and this has been seen in real life.

 The patch discussed in the above mentioned converstation made two
 changes, one of which was completely right.  The over was slightly wrong. 

 Having explored the issue in some depth I now see how it was wrong
 and present below for your enjoyment two patches.  One which makes 
 and justifies the first change.  One which makes and justifies a
 revised version of the second change.

Thanks,
NeilBrown


 [PATCH 001 of 2] Invalidate_inode_pages2 shouldn't abort on first error.
 [PATCH 002 of 2] Make data destruction in invalidate_inode_pages2 optional.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 001 of 2] Invalidate_inode_pages2 shouldn't abort on first error.
  2006-08-29  1:30 [PATCH 000 of 2] Invalidate_inode_pages2 changes NeilBrown
@ 2006-08-29  1:30 ` NeilBrown
  2006-08-29  1:30 ` [PATCH 002 of 2] Make data destruction in invalidate_inode_pages2 optional NeilBrown
  2006-08-29  2:14 ` [PATCH 000 of 2] Invalidate_inode_pages2 changes Andrew Morton
  2 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2006-08-29  1:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: v9fs-developer, trond.myklebust, Andrea Arcangeli, linux-kernel


invalidate_inode_pages2 is called when the caller has reason to
believe that the pagecache is out-of-sync with the backing storage.
This can happen with networked filesystems (nfs, 9p) when they believe
something might have changed on the server, and it can happen
with O_DIRECT writing when a write has bypassed the page cache.

It is possible for invalidate_inode_pages2 to fail to invalidate a
page.  This can happen if the page is in Writeback or if the
filesystem is unable to release 'private' data for some reason.

This can happen even if the caller has protected against concurrent
writes by claiming i_mutex, and has unmapped and flushed the mapping.
This is because there is no way to lock against an uptodate page being
faulted into an address space and being dirtied.

If invalidate_inode_pages2 finds a page that it cannot invalidate, it
gives up and doesn't bother trying to invalidate any more pages.

This is wrong.  There is no justification for aborting early, and
doing so unnecessarily leaves part of the pagecache inconsistant with
the backing store.

This patch removes the early-abort logic from invalidate_inode_pages2_range.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./mm/truncate.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff .prev/mm/truncate.c ./mm/truncate.c
--- .prev/mm/truncate.c	2006-08-28 16:38:09.000000000 +1000
+++ ./mm/truncate.c	2006-08-29 10:24:34.000000000 +1000
@@ -293,10 +293,10 @@ int invalidate_inode_pages2_range(struct
 
 	pagevec_init(&pvec, 0);
 	next = start;
-	while (next <= end && !ret && !wrapped &&
+	while (next <= end && !wrapped &&
 		pagevec_lookup(&pvec, mapping, next,
 			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
-		for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
+		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index;
 			int was_dirty;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 002 of 2] Make data destruction in invalidate_inode_pages2 optional.
  2006-08-29  1:30 [PATCH 000 of 2] Invalidate_inode_pages2 changes NeilBrown
  2006-08-29  1:30 ` [PATCH 001 of 2] Invalidate_inode_pages2 shouldn't abort on first error NeilBrown
@ 2006-08-29  1:30 ` NeilBrown
  2006-08-29  2:14 ` [PATCH 000 of 2] Invalidate_inode_pages2 changes Andrew Morton
  2 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2006-08-29  1:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: v9fs-developer, trond.myklebust, Andrea Arcangeli, linux-kernel


invalidate_inode_pages2 is called when the caller has determined
that the page cache might be inconsistant with the backing store.

It is possible for invalidate_complete_page (which it calls) to fail
if a page is dirty.

It is not possible to lock against this as a writable memory mapping
can dirty a page at any time.

Different callers have different preferences about how this situation
is handled.

For nfs (and presumably 9p) the cache invalidation is advisory only.
When byte-range locking is not used, nfs cache coherancy is on a
best-effort basis.  So while nfs would like to invalidate the cache,
it doesn't want to do it at the risk of discarding newly written data.
If a process writes (via mmap) to a page that some other client may
have updated on the server, then we want to local write to go through
(after all, it only 'may' conflict with another client and as they didn't
use locking, that is what they should expect).

For O_DIRECT writes, the situation is different.
generic_file_direct_IO *knows* that there has been an independant,
write to the page in question.  If this races with some other process
performing an update via a mmap then the only way that we can
serialise these writes is to act as though the O_DIRECT comes second
and overwrites the mmap write.

So for O_DIRECT writes, invalidate_inode_pages2 should remove the
Dirty flag from any pages where that is possible.

If that page has already entered Writeback, it may not be possible to
full clean the page.  In this case, -EIO is returned to inform the
writer that their write was not completely successful.

Currently, invalidate_inode_pages2 always tries to clear the Dirty
bit.  This is good for O_DIRECT but bad for NFS (and probably 9p).

This patch makes the clearing of the dirty bit optional via a flag
and sets the flag as appropriate.

Note that I am not at all sure what should happen in the NFS +
O_DIRECT case (and the comment in nfs/direct.c suggests that I am not
alone in this).  So I have left the behaviour at that call site
unchanged.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./fs/9p/vfs_file.c   |    2 +-
 ./fs/nfs/dir.c       |    2 +-
 ./fs/nfs/direct.c    |    2 +-
 ./fs/nfs/fscache.h   |    2 +-
 ./fs/nfs/inode.c     |    2 +-
 ./include/linux/fs.h |    6 ++++--
 ./mm/filemap.c       |    2 +-
 ./mm/truncate.c      |   11 +++++++----
 8 files changed, 17 insertions(+), 12 deletions(-)

diff .prev/fs/9p/vfs_file.c ./fs/9p/vfs_file.c
--- .prev/fs/9p/vfs_file.c	2006-08-29 10:40:51.000000000 +1000
+++ ./fs/9p/vfs_file.c	2006-08-29 10:40:59.000000000 +1000
@@ -266,7 +266,7 @@ v9fs_file_write(struct file *filp, const
 		total += result;
 	} while (count);
 
-		invalidate_inode_pages2(inode->i_mapping);
+	invalidate_inode_pages2(inode->i_mapping, 0);
 	return total;
 }
 

diff .prev/fs/nfs/dir.c ./fs/nfs/dir.c
--- .prev/fs/nfs/dir.c	2006-08-29 10:41:42.000000000 +1000
+++ ./fs/nfs/dir.c	2006-08-29 10:41:49.000000000 +1000
@@ -204,7 +204,7 @@ int nfs_readdir_filler(nfs_readdir_descr
 	 *	 through inode->i_mutex or some other mechanism.
 	 */
 	if (page->index == 0)
-		invalidate_inode_pages2_range(inode->i_mapping, PAGE_CACHE_SIZE, -1);
+		invalidate_inode_pages2_range(inode->i_mapping, PAGE_CACHE_SIZE, -1, 0);
 	unlock_page(page);
 	return 0;
  error:

diff .prev/fs/nfs/direct.c ./fs/nfs/direct.c
--- .prev/fs/nfs/direct.c	2006-08-29 10:45:41.000000000 +1000
+++ ./fs/nfs/direct.c	2006-08-29 10:45:47.000000000 +1000
@@ -859,7 +859,7 @@ ssize_t nfs_file_direct_write(struct kio
 	 *      occur before the writes complete.  Kind of racey.
 	 */
 	if (mapping->nrpages)
-		invalidate_inode_pages2(mapping);
+		invalidate_inode_pages2(mapping, 1);
 
 	if (retval > 0)
 		iocb->ki_pos = pos + retval;

diff .prev/fs/nfs/fscache.h ./fs/nfs/fscache.h
--- .prev/fs/nfs/fscache.h	2006-08-29 10:46:54.000000000 +1000
+++ ./fs/nfs/fscache.h	2006-08-29 10:49:09.000000000 +1000
@@ -184,7 +184,7 @@ static inline void nfs_fscache_disable_f
 		 * turning off the cache.
 		 */
 		if (inode->i_mapping && inode->i_mapping->nrpages)
-			invalidate_inode_pages2(inode->i_mapping);
+			invalidate_inode_pages2(inode->i_mapping, 0);
 
 		nfs_fscache_zap_fh_cookie(NFS_SERVER(inode), NFS_I(inode));
 	}

diff .prev/fs/nfs/inode.c ./fs/nfs/inode.c
--- .prev/fs/nfs/inode.c	2006-08-29 10:45:54.000000000 +1000
+++ ./fs/nfs/inode.c	2006-08-29 10:46:29.000000000 +1000
@@ -685,7 +685,7 @@ int nfs_revalidate_mapping(struct inode 
 		nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
 		if (S_ISREG(inode->i_mode))
 			nfs_sync_mapping(mapping);
-		invalidate_inode_pages2(mapping);
+		invalidate_inode_pages2(mapping, 0);
 
 		spin_lock(&inode->i_lock);
 		nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;

diff .prev/include/linux/fs.h ./include/linux/fs.h
--- .prev/include/linux/fs.h	2006-08-29 10:22:02.000000000 +1000
+++ ./include/linux/fs.h	2006-08-29 10:40:35.000000000 +1000
@@ -1636,9 +1636,11 @@ static inline void invalidate_remote_ino
 	    S_ISLNK(inode->i_mode))
 		invalidate_inode_pages(inode->i_mapping);
 }
-extern int invalidate_inode_pages2(struct address_space *mapping);
+extern int invalidate_inode_pages2(struct address_space *mapping,
+				   int may_destroy);
 extern int invalidate_inode_pages2_range(struct address_space *mapping,
-					 pgoff_t start, pgoff_t end);
+					 pgoff_t start, pgoff_t end,
+					 int may_destroy);
 extern int write_inode_now(struct inode *, int);
 extern void generic_sync_sb_inodes(struct super_block *, struct writeback_control *);
 extern int filemap_fdatawrite(struct address_space *);

diff .prev/mm/filemap.c ./mm/filemap.c
--- .prev/mm/filemap.c	2006-08-29 10:23:11.000000000 +1000
+++ ./mm/filemap.c	2006-08-29 10:49:25.000000000 +1000
@@ -2633,7 +2633,7 @@ generic_file_direct_IO(int rw, struct ki
 			pgoff_t end = (offset + write_len - 1)
 						>> PAGE_CACHE_SHIFT;
 			int err = invalidate_inode_pages2_range(mapping,
-					offset >> PAGE_CACHE_SHIFT, end);
+					offset >> PAGE_CACHE_SHIFT, end, 1);
 			if (err)
 				retval = err;
 		}

diff .prev/mm/truncate.c ./mm/truncate.c
--- .prev/mm/truncate.c	2006-08-29 10:24:34.000000000 +1000
+++ ./mm/truncate.c	2006-08-29 10:50:24.000000000 +1000
@@ -275,6 +275,8 @@ EXPORT_SYMBOL(invalidate_inode_pages);
  * @mapping: the address_space
  * @start: the page offset 'from' which to invalidate
  * @end: the page offset 'to' which to invalidate (inclusive)
+ * @may_destroy: flag to say if Dirty pages may be marked clean, thus possibly
+ *          destroying recent updates to the pages.
  *
  * Any pages which are found to be mapped into pagetables are unmapped prior to
  * invalidation.
@@ -282,7 +284,8 @@ EXPORT_SYMBOL(invalidate_inode_pages);
  * Returns -EIO if any pages could not be invalidated.
  */
 int invalidate_inode_pages2_range(struct address_space *mapping,
-				  pgoff_t start, pgoff_t end)
+				  pgoff_t start, pgoff_t end,
+				  int may_destroy)
 {
 	struct pagevec pvec;
 	pgoff_t next;
@@ -335,7 +338,7 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
+			was_dirty = may_destroy && test_clear_page_dirty(page);
 			if (!invalidate_complete_page(mapping, page)) {
 				if (was_dirty)
 					set_page_dirty(page);
@@ -359,8 +362,8 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
  *
  * Returns -EIO if any pages could not be invalidated.
  */
-int invalidate_inode_pages2(struct address_space *mapping)
+int invalidate_inode_pages2(struct address_space *mapping, int may_destroy)
 {
-	return invalidate_inode_pages2_range(mapping, 0, -1);
+	return invalidate_inode_pages2_range(mapping, 0, -1, may_destroy);
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2);

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 000 of 2] Invalidate_inode_pages2 changes.
  2006-08-29  1:30 [PATCH 000 of 2] Invalidate_inode_pages2 changes NeilBrown
  2006-08-29  1:30 ` [PATCH 001 of 2] Invalidate_inode_pages2 shouldn't abort on first error NeilBrown
  2006-08-29  1:30 ` [PATCH 002 of 2] Make data destruction in invalidate_inode_pages2 optional NeilBrown
@ 2006-08-29  2:14 ` Andrew Morton
  2006-08-29  2:46   ` Neil Brown
  2 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2006-08-29  2:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: v9fs-developer, trond.myklebust, Andrea Arcangeli, linux-kernel

On Tue, 29 Aug 2006 11:30:15 +1000
NeilBrown <neilb@suse.de> wrote:

>  I'm picking up on a conversation that was started in late March
>  this year, and which didn't get anywhere much.
>  See
>    http://lkml.org/lkml/2006/3/31/93
>  and following.

Nick's "possible lock_page fix for Andrea's nopage vs invalidate race?"
patch (linux-mm) should fix this?

If filemap_nopage() does lock_page(), invalidate_inode_pages2_range() is solid?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 000 of 2] Invalidate_inode_pages2 changes.
  2006-08-29  2:14 ` [PATCH 000 of 2] Invalidate_inode_pages2 changes Andrew Morton
@ 2006-08-29  2:46   ` Neil Brown
  2006-08-29  3:43     ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Neil Brown @ 2006-08-29  2:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: v9fs-developer, trond.myklebust, Andrea Arcangeli, linux-kernel

On Monday August 28, akpm@osdl.org wrote:
> On Tue, 29 Aug 2006 11:30:15 +1000
> NeilBrown <neilb@suse.de> wrote:
> 
> >  I'm picking up on a conversation that was started in late March
> >  this year, and which didn't get anywhere much.
> >  See
> >    http://lkml.org/lkml/2006/3/31/93
> >  and following.
> 
> Nick's "possible lock_page fix for Andrea's nopage vs invalidate race?"
> patch (linux-mm) should fix this?
> 
> If filemap_nopage() does lock_page(), invalidate_inode_pages2_range() is solid?

UHmm.... yes.  In that case we can remove lots of stuff from
invalidate_inode_pages2_range as we can be sure the page won't be
dirty or in writeback so invalidate_complete_page will be certain to
succeed. 

So if that goes ahead, these become moot.  But until it does, these
would be nice to have :-)

Also, the patch at
  http://marc.theaimsgroup.com/?l=linux-mm&m=115443228617576&w=2
appears not to set 
  +	.vm_flags	= VM_CAN_INVLD,
for nfs_fs_vm_operations, but maybe they are a later addition to
nfs...

Thinks: should I subscribe to linux-mm... only about 100 messages per
week.... maybe :-)

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 000 of 2] Invalidate_inode_pages2 changes.
  2006-08-29  2:46   ` Neil Brown
@ 2006-08-29  3:43     ` Andrew Morton
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2006-08-29  3:43 UTC (permalink / raw)
  To: Neil Brown
  Cc: v9fs-developer, trond.myklebust, Andrea Arcangeli, linux-kernel

On Tue, 29 Aug 2006 12:46:44 +1000
Neil Brown <neilb@suse.de> wrote:

> On Monday August 28, akpm@osdl.org wrote:
> > On Tue, 29 Aug 2006 11:30:15 +1000
> > NeilBrown <neilb@suse.de> wrote:
> > 
> > >  I'm picking up on a conversation that was started in late March
> > >  this year, and which didn't get anywhere much.
> > >  See
> > >    http://lkml.org/lkml/2006/3/31/93
> > >  and following.
> > 
> > Nick's "possible lock_page fix for Andrea's nopage vs invalidate race?"
> > patch (linux-mm) should fix this?
> > 
> > If filemap_nopage() does lock_page(), invalidate_inode_pages2_range() is solid?
> 
> UHmm.... yes.  In that case we can remove lots of stuff from
> invalidate_inode_pages2_range as we can be sure the page won't be
> dirty or in writeback so invalidate_complete_page will be certain to
> succeed. 

I'm not sure we can remove much from invalidate_inode_pages2_range(). 
After lock_page() returns the page can be under writeback, so the
wait_on_page_writeback() is appropriate.  After the page has been unmapped
from pagetables it could have been be redirtied.

Perhaps the while() loop is no longer necessary - nobody else will be
mapping this locked page into pagetables.

> So if that goes ahead, these become moot.  But until it does, these
> would be nice to have :-)

I guess we need to repair Nick's broken wing.

> Also, the patch at
>   http://marc.theaimsgroup.com/?l=linux-mm&m=115443228617576&w=2
> appears not to set 
>   +	.vm_flags	= VM_CAN_INVLD,
> for nfs_fs_vm_operations, but maybe they are a later addition to
> nfs...

I'm hoping all that stuff can go away.  Instead, change do_page_fault to
declare a new `struct page_fault_args' thing and pass that all the way up
and down the pagefault path.  Then, ->nopage implementations can simply set
page_fault_args.i_locked_the_page, to be examined at higher levels.

page_fault_args can also be used to tell do_no_page() to rerun the fault,
which would be needed if we want to stop holding down_read(mmap_sem) while
doing synchronous disk reads.

It'd be a fairly big-but-simple patch though.

> Thinks: should I subscribe to linux-mm... only about 100 messages per
> week.... maybe :-)

And no spam!

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-08-29  3:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-08-29  1:30 [PATCH 000 of 2] Invalidate_inode_pages2 changes NeilBrown
2006-08-29  1:30 ` [PATCH 001 of 2] Invalidate_inode_pages2 shouldn't abort on first error NeilBrown
2006-08-29  1:30 ` [PATCH 002 of 2] Make data destruction in invalidate_inode_pages2 optional NeilBrown
2006-08-29  2:14 ` [PATCH 000 of 2] Invalidate_inode_pages2 changes Andrew Morton
2006-08-29  2:46   ` Neil Brown
2006-08-29  3:43     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).