All of lore.kernel.org
 help / color / mirror / Atom feed
* [ANNOUNCE] new new aops patchset
@ 2007-04-02 12:09 Nick Piggin
  2007-04-02 20:18 ` Badari Pulavarty
                   ` (10 more replies)
  0 siblings, 11 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-02 12:09 UTC (permalink / raw)
  To: Linux Filesystems; +Cc: Mark Fasheh, Steven Whitehouse

Updated aops patchset against 2.6.21-rc5.

http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/

Files/dirs are 2.6.21-rc5-new-aops*

Contains numerous fixes from Mark and myself -- I'd say the core code is
getting reasonably stable at this point.

New stuff: patches from Mark and Steve for the cluster filesystems.

(compile only) patches for NFS, XFS, FUSE, eCryptfs. OK, they're untested,
but I have tried to put some effort in, so if maintainers would kindly
take a look... ext3 still needs review and porting to ext4, which I hope
someone will do eventually (I presume ext3 is still maintained)... does NFS
have a lock_page vs lock_kernel deadlock in its commit_write? (if yes, this
is fixable with the new write aops).

I chose a various smattering of weird and wonderful filesystems to try
converting, and so far I can't see major problems.

A first cut at FAT and cont_prepare_write support. The FAT conversion is
actually tested and seems to work quite well for various truncates and
expands. Breaks reiserfs though, which might be a bit tricky to fix within
the generic cont infrastructure.

Nick


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 12:09 [ANNOUNCE] new new aops patchset Nick Piggin
@ 2007-04-02 20:18 ` Badari Pulavarty
  2007-04-03  0:45   ` Nick Piggin
  2007-04-02 20:44 ` Badari Pulavarty
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-02 20:18 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> Updated aops patchset against 2.6.21-rc5.
> 
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> 
> Files/dirs are 2.6.21-rc5-new-aops*
> 
> Contains numerous fixes from Mark and myself -- I'd say the core code is
> getting reasonably stable at this point.
> 
> New stuff: patches from Mark and Steve for the cluster filesystems.
> 
> (compile only) patches for NFS, XFS, FUSE, eCryptfs. OK, they're untested,
> but I have tried to put some effort in, so if maintainers would kindly
> take a look... ext3 still needs review and porting to ext4, which I hope
> someone will do eventually (I presume ext3 is still maintained)... does NFS
> have a lock_page vs lock_kernel deadlock in its commit_write? (if yes, this
> is fixable with the new write aops).

In case of error, write_begin() is *supposed* to unlock and release the
page it locked. Right ?

If so, ext3 error handling is not quite right :(
Here is the fix.

Thanks,
Badari

---
 fs/ext3/inode.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Index: linux-2.6.21-rc5.aop/fs/ext3/inode.c
===================================================================
--- linux-2.6.21-rc5.aop.orig/fs/ext3/inode.c	2007-04-02 10:21:44.000000000 -0700
+++ linux-2.6.21-rc5.aop/fs/ext3/inode.c	2007-04-02 13:08:41.000000000 -0700
@@ -1236,8 +1236,11 @@ retry:
 	*pagep = page;
 
 	handle = ext3_journal_start(inode, needed_blocks);
-	if (IS_ERR(handle))
+	if (IS_ERR(handle)) {
+		unlock_page(page);
+		page_cache_release(page);
 		return PTR_ERR(handle);
+	}
 	ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
 							ext3_get_block);
 	if (ret)
@@ -1246,9 +1249,12 @@ retry:
 	if (ext3_should_journal_data(inode)) {
 		ret = walk_page_buffers(handle, page_buffers(page),
 				start, end, NULL, do_journal_get_write_access);
-		if (ret)
+		if (ret) {
 			/* fatal error, just put the handle and return */
+			unlock_page(page);
+			page_cache_release(page);
 			journal_stop(handle);
+		}
 	}
 	return ret;
 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 12:09 [ANNOUNCE] new new aops patchset Nick Piggin
  2007-04-02 20:18 ` Badari Pulavarty
@ 2007-04-02 20:44 ` Badari Pulavarty
  2007-04-02 23:58   ` Nick Piggin
  2007-04-02 21:44 ` Badari Pulavarty
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-02 20:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> Updated aops patchset against 2.6.21-rc5.
> 
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> 
> Files/dirs are 2.6.21-rc5-new-aops*
> 
> Contains numerous fixes from Mark and myself -- I'd say the core code is
> getting reasonably stable at this point.


ext3_write_failure() conversion is NOT quite correct.
Old code was returing failure of do_journal_get_write_access(),
where as your changes will return journal_stop(). 

Thanks,
Badari

---
 fs/ext3/inode.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6.21-rc5.aop/fs/ext3/inode.c
===================================================================
--- linux-2.6.21-rc5.aop.orig/fs/ext3/inode.c	2007-04-02 13:37:16.000000000 -0700
+++ linux-2.6.21-rc5.aop/fs/ext3/inode.c	2007-04-02 13:37:35.000000000 -0700
@@ -1196,8 +1196,11 @@ skip:
 			break;
 		if (ext3_should_journal_data(mapping->host)) {
 			ret = do_journal_get_write_access(handle, bh);
-			if (ret)
-				goto skip;
+			if (ret) {
+				unlock_page(page);
+				page_cache_release(page);
+				return ret;
+			}
 		}
 	/*
 	 * block_start here becomes the first block where the current iteration



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 12:09 [ANNOUNCE] new new aops patchset Nick Piggin
  2007-04-02 20:18 ` Badari Pulavarty
  2007-04-02 20:44 ` Badari Pulavarty
@ 2007-04-02 21:44 ` Badari Pulavarty
  2007-04-02 23:08 ` Badari Pulavarty
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-02 21:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> Updated aops patchset against 2.6.21-rc5.
> 
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> 
> Files/dirs are 2.6.21-rc5-new-aops*


[root@elm3a241 linux-2.6.21-rc5]# make -j8 modules
  CHK     include/linux/version.h
  CHK     include/linux/utsrelease.h
  Building modules, stage 2.
  MODPOST 1281 modules
WARNING: "cont_write_begin" [fs/fat/fat.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2


Here is the fix.

Thanks,
Badari

---
 fs/buffer.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6.21-rc5/fs/buffer.c
===================================================================
--- linux-2.6.21-rc5.orig/fs/buffer.c	2007-04-02 13:45:56.000000000 -0700
+++ linux-2.6.21-rc5/fs/buffer.c	2007-04-02 14:33:20.000000000 -0700
@@ -3128,6 +3128,7 @@ EXPORT_SYMBOL(block_sync_page);
 EXPORT_SYMBOL(block_truncate_page);
 EXPORT_SYMBOL(block_write_full_page);
 EXPORT_SYMBOL(cont_prepare_write);
+EXPORT_SYMBOL(cont_write_begin);
 EXPORT_SYMBOL(end_buffer_read_sync);
 EXPORT_SYMBOL(end_buffer_write_sync);
 EXPORT_SYMBOL(file_fsync);



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 12:09 [ANNOUNCE] new new aops patchset Nick Piggin
                   ` (2 preceding siblings ...)
  2007-04-02 21:44 ` Badari Pulavarty
@ 2007-04-02 23:08 ` Badari Pulavarty
  2007-04-02 23:14 ` Badari Pulavarty
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-02 23:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> Updated aops patchset against 2.6.21-rc5.
> 
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> 
> Files/dirs are 2.6.21-rc5-new-aops*
> 

Simple cleanup for block_write_begin().

Thanks,
Badari

---
 fs/buffer.c |    1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.21-rc5/fs/buffer.c
===================================================================
--- linux-2.6.21-rc5.orig/fs/buffer.c	2007-04-02 14:33:20.000000000 -0700
+++ linux-2.6.21-rc5/fs/buffer.c	2007-04-02 16:02:32.000000000 -0700
@@ -1977,7 +1977,6 @@ int block_write_begin(struct file *file,
 			if (pos + len > inode->i_size)
 				vmtruncate(inode, inode->i_size);
 		}
-		goto out;
 	}
 
 out:



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 12:09 [ANNOUNCE] new new aops patchset Nick Piggin
                   ` (3 preceding siblings ...)
  2007-04-02 23:08 ` Badari Pulavarty
@ 2007-04-02 23:14 ` Badari Pulavarty
  2007-04-02 23:49   ` Nick Piggin
  2007-04-02 23:31 ` Badari Pulavarty
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-02 23:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> Updated aops patchset against 2.6.21-rc5.
> 
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> 
> Files/dirs are 2.6.21-rc5-new-aops*

Baaah !! You took away ext3 -nobh option :(

Do you have plans to support nobh versions of block_write_begin/end ?

BTW, I don't see how block_write_end() can ever return < 0.
If so, here is the cleanup fix for ext3 (no unnecessay checks).

Thanks,
Badari

---
 fs/ext3/inode.c |   11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

Index: linux-2.6.21-rc5/fs/ext3/inode.c
===================================================================
--- linux-2.6.21-rc5.orig/fs/ext3/inode.c	2007-04-02 16:01:27.000000000 -0700
+++ linux-2.6.21-rc5/fs/ext3/inode.c	2007-04-02 16:06:39.000000000 -0700
@@ -1328,8 +1328,6 @@ static int ext3_ordered_write_end(struct
 		if (new_i_size > EXT3_I(inode)->i_disksize)
 			EXT3_I(inode)->i_disksize = new_i_size;
 		copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
-		if (copied < 0)
-			ret = copied;
 	}
 	ret2 = ext3_journal_stop(handle);
 	if (!ret)
@@ -1344,7 +1342,7 @@ static int ext3_writeback_write_end(stru
 {
 	handle_t *handle = ext3_journal_current_handle();
 	struct inode *inode = file->f_mapping->host;
-	int ret = 0, ret2;
+	int ret;
 	loff_t new_i_size;
 
 	new_i_size = pos + copied;
@@ -1352,12 +1350,9 @@ static int ext3_writeback_write_end(stru
 		EXT3_I(inode)->i_disksize = new_i_size;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
-	if (copied < 0)
-		ret = copied;
 
-	ret2 = ext3_journal_stop(handle);
-	if (!ret)
-		ret = ret2;
+	ret = ext3_journal_stop(handle);
+
 	return ret ? ret : copied;
 }
 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 12:09 [ANNOUNCE] new new aops patchset Nick Piggin
                   ` (4 preceding siblings ...)
  2007-04-02 23:14 ` Badari Pulavarty
@ 2007-04-02 23:31 ` Badari Pulavarty
  2007-04-02 23:35   ` Nick Piggin
  2007-04-02 23:56 ` Badari Pulavarty
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-02 23:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> Updated aops patchset against 2.6.21-rc5.
> 
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> 

Sorry to send you so many silly fixes, but I though it would
be easy to review individual ones.

Thanks,
Badari

---
 mm/filemap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.21-rc5/mm/filemap.c
===================================================================
--- linux-2.6.21-rc5.orig/mm/filemap.c	2007-04-02 13:45:56.000000000 -0700
+++ linux-2.6.21-rc5/mm/filemap.c	2007-04-02 16:23:48.000000000 -0700
@@ -2107,7 +2107,7 @@ int pagecache_write_end(struct file *fil
 	const struct address_space_operations *aops = mapping->a_ops;
 	int ret;
 
-	if (aops->write_begin) {
+	if (aops->write_end) {
 		ret = aops->write_end(file, mapping, pos, len, copied,
 							page, fsdata);
 	} else {





^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 23:31 ` Badari Pulavarty
@ 2007-04-02 23:35   ` Nick Piggin
  0 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-02 23:35 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, Apr 02, 2007 at 04:31:22PM -0700, Badari Pulavarty wrote:
> On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> > Updated aops patchset against 2.6.21-rc5.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > 
> 
> Sorry to send you so many silly fixes, but I though it would
> be easy to review individual ones.

On the contrary, thanks for sending them all :) In regards to this
hunk, my aim is simply to mirror generic_perform_write test which
gives us a free oops if write_end is NULL. I'll add a comment there
(eventually this will all go away anyway).

The rest of your patches look good I think, so consider them applied
unless I reply individually.

Thanks,
Nick

> 
> Thanks,
> Badari
> 
> ---
>  mm/filemap.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6.21-rc5/mm/filemap.c
> ===================================================================
> --- linux-2.6.21-rc5.orig/mm/filemap.c	2007-04-02 13:45:56.000000000 -0700
> +++ linux-2.6.21-rc5/mm/filemap.c	2007-04-02 16:23:48.000000000 -0700
> @@ -2107,7 +2107,7 @@ int pagecache_write_end(struct file *fil
>  	const struct address_space_operations *aops = mapping->a_ops;
>  	int ret;
>  
> -	if (aops->write_begin) {
> +	if (aops->write_end) {
>  		ret = aops->write_end(file, mapping, pos, len, copied,
>  							page, fsdata);
>  	} else {
> 
> 
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 23:14 ` Badari Pulavarty
@ 2007-04-02 23:49   ` Nick Piggin
  2007-04-03 15:57     ` Badari Pulavarty
  0 siblings, 1 reply; 43+ messages in thread
From: Nick Piggin @ 2007-04-02 23:49 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, Apr 02, 2007 at 04:14:59PM -0700, Badari Pulavarty wrote:
> On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> > Updated aops patchset against 2.6.21-rc5.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > 
> > Files/dirs are 2.6.21-rc5-new-aops*
> 
> Baaah !! You took away ext3 -nobh option :(

Ahh, just the person I wanted to ask! ;) How useful is it, out of curiosity?
What sort of users use it, and what sort of improvements do they get?


> Do you have plans to support nobh versions of block_write_begin/end ?

At the moment I'm looking at doing it another way. Having the seperate
nobh path is quite annoying -- there are still bugs in it and it is
simply less tested (not that the bh path is bug-free either, but it
is better to be able to concentrate on one). So I hope to merge them
at some point to restore that functionality. 


> BTW, I don't see how block_write_end() can ever return < 0.
> If so, here is the cleanup fix for ext3 (no unnecessay checks).

Shouldn't we allow for the possibility?


> ---
>  fs/ext3/inode.c |   11 +++--------
>  1 file changed, 3 insertions(+), 8 deletions(-)
> 
> Index: linux-2.6.21-rc5/fs/ext3/inode.c
> ===================================================================
> --- linux-2.6.21-rc5.orig/fs/ext3/inode.c	2007-04-02 16:01:27.000000000 -0700
> +++ linux-2.6.21-rc5/fs/ext3/inode.c	2007-04-02 16:06:39.000000000 -0700
> @@ -1328,8 +1328,6 @@ static int ext3_ordered_write_end(struct
>  		if (new_i_size > EXT3_I(inode)->i_disksize)
>  			EXT3_I(inode)->i_disksize = new_i_size;
>  		copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> -		if (copied < 0)
> -			ret = copied;
>  	}
>  	ret2 = ext3_journal_stop(handle);
>  	if (!ret)
> @@ -1344,7 +1342,7 @@ static int ext3_writeback_write_end(stru
>  {
>  	handle_t *handle = ext3_journal_current_handle();
>  	struct inode *inode = file->f_mapping->host;
> -	int ret = 0, ret2;
> +	int ret;
>  	loff_t new_i_size;
>  
>  	new_i_size = pos + copied;
> @@ -1352,12 +1350,9 @@ static int ext3_writeback_write_end(stru
>  		EXT3_I(inode)->i_disksize = new_i_size;
>  
>  	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> -	if (copied < 0)
> -		ret = copied;
>  
> -	ret2 = ext3_journal_stop(handle);
> -	if (!ret)
> -		ret = ret2;
> +	ret = ext3_journal_stop(handle);
> +
>  	return ret ? ret : copied;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 12:09 [ANNOUNCE] new new aops patchset Nick Piggin
                   ` (5 preceding siblings ...)
  2007-04-02 23:31 ` Badari Pulavarty
@ 2007-04-02 23:56 ` Badari Pulavarty
  2007-04-03  0:02   ` Nick Piggin
  2007-04-03  0:00 ` Badari Pulavarty
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-02 23:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> Updated aops patchset against 2.6.21-rc5.
> 
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> 
> Files/dirs are 2.6.21-rc5-new-aops*

ext3_write_begin() is computing "start" incorrectly.

Thanks,
Badari

---
 fs/ext3/inode.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.21-rc5/fs/ext3/inode.c
===================================================================
--- linux-2.6.21-rc5.orig/fs/ext3/inode.c	2007-04-02 16:06:39.000000000 -0700
+++ linux-2.6.21-rc5/fs/ext3/inode.c	2007-04-02 16:49:57.000000000 -0700
@@ -1229,7 +1229,7 @@ static int ext3_write_begin(struct file 
 	unsigned start, end;
 
 	index = pos >> PAGE_CACHE_SHIFT;
-	start = pos * (PAGE_CACHE_SIZE - 1);
+	start = pos & (PAGE_CACHE_SIZE - 1);
 	end = start + len;
 
 retry:



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 20:44 ` Badari Pulavarty
@ 2007-04-02 23:58   ` Nick Piggin
  2007-04-03 15:59     ` Badari Pulavarty
  0 siblings, 1 reply; 43+ messages in thread
From: Nick Piggin @ 2007-04-02 23:58 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, Apr 02, 2007 at 01:44:59PM -0700, Badari Pulavarty wrote:
> On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> > Updated aops patchset against 2.6.21-rc5.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > 
> > Files/dirs are 2.6.21-rc5-new-aops*
> > 
> > Contains numerous fixes from Mark and myself -- I'd say the core code is
> > getting reasonably stable at this point.
> 
> 
> ext3_write_failure() conversion is NOT quite correct.
> Old code was returing failure of do_journal_get_write_access(),
> where as your changes will return journal_stop(). 

Good catch. This still requires a journal_stop, however?

How's this look?

--
Index: linux-2.6/fs/ext3/inode.c
===================================================================
--- linux-2.6.orig/fs/ext3/inode.c
+++ linux-2.6/fs/ext3/inode.c
@@ -1167,11 +1167,13 @@ static int ext3_write_failure(struct fil
 	handle_t *handle = ext3_journal_current_handle();
 
 	if (ext3_should_writeback_data(mapping->host)) {
+skip_and_stop:
 		/* optimization: no constraints about data */
+		ret = ext3_journal_stop(handle);
 skip:
 		unlock_page(page);
 		page_cache_release(page);
-		return ext3_journal_stop(handle);
+		return ret;
 	}
 
 	from = pos & (PAGE_CACHE_SIZE - 1);
@@ -1196,8 +1198,10 @@ skip:
 			break;
 		if (ext3_should_journal_data(mapping->host)) {
 			ret = do_journal_get_write_access(handle, bh);
-			if (ret)
+			if (ret) {
+				ext3_journal_stop(handle);
 				goto skip;
+			}
 		}
 	/*
 	 * block_start here becomes the first block where the current iteration
@@ -1205,7 +1209,7 @@ skip:
 	 */
 	}
 	if (block_start <= from)
-		goto skip;
+		goto skip_and_stop;
 
 	/* commit allocated and zeroed buffers */
 	return mapping->a_ops->write_end(file, mapping, pos, len,

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 12:09 [ANNOUNCE] new new aops patchset Nick Piggin
                   ` (6 preceding siblings ...)
  2007-04-02 23:56 ` Badari Pulavarty
@ 2007-04-03  0:00 ` Badari Pulavarty
  2007-04-03  0:08   ` Nick Piggin
  2007-04-03 17:35 ` Badari Pulavarty
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-03  0:00 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> Updated aops patchset against 2.6.21-rc5.
> 
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> 
> Files/dirs are 2.6.21-rc5-new-aops*

One more cleanup. "index" is unused.

BTW, I will take a shot at ext4 tomorrow.

(If you want me to roll all my fixes and send them to you,u,

5.)


1\fc link

5.orig

\x11
h2
(

 and send them to you,
let me know).

Thanks,
Badari

---
 mm/filemap.c |    2 --
 1 file changed, 2 deletions(-)

Index: linux-2.6.21-rc5/mm/filemap.c
===================================================================
--- linux-2.6.21-rc5.orig/mm/filemap.c	2007-04-02 16:23:48.000000000 -0700
+++ linux-2.6.21-rc5/mm/filemap.c	2007-04-02 16:53:13.000000000 -0700
@@ -2383,14 +2383,12 @@ static ssize_t generic_perform_write(str
 
 	do {
 		struct page *page;
-		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 		void *fsdata;
 
 		offset = (pos & (PAGE_CACHE_SIZE - 1));
-		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
 						iov_iter_count(i));
 




-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 23:56 ` Badari Pulavarty
@ 2007-04-03  0:02   ` Nick Piggin
  0 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-03  0:02 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, Apr 02, 2007 at 04:56:11PM -0700, Badari Pulavarty wrote:
> On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> > Updated aops patchset against 2.6.21-rc5.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > 
> > Files/dirs are 2.6.21-rc5-new-aops*
> 
> ext3_write_begin() is computing "start" incorrectly.

Good spotting again, thanks!

> 
> Thanks,
> Badari
> 
> ---
>  fs/ext3/inode.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6.21-rc5/fs/ext3/inode.c
> ===================================================================
> --- linux-2.6.21-rc5.orig/fs/ext3/inode.c	2007-04-02 16:06:39.000000000 -0700
> +++ linux-2.6.21-rc5/fs/ext3/inode.c	2007-04-02 16:49:57.000000000 -0700
> @@ -1229,7 +1229,7 @@ static int ext3_write_begin(struct file 
>  	unsigned start, end;
>  
>  	index = pos >> PAGE_CACHE_SHIFT;
> -	start = pos * (PAGE_CACHE_SIZE - 1);
> +	start = pos & (PAGE_CACHE_SIZE - 1);
>  	end = start + len;
>  
>  retry:
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-03  0:00 ` Badari Pulavarty
@ 2007-04-03  0:08   ` Nick Piggin
  2007-04-03  9:31     ` Nick Piggin
  0 siblings, 1 reply; 43+ messages in thread
From: Nick Piggin @ 2007-04-03  0:08 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, Apr 02, 2007 at 05:00:50PM -0700, Badari Pulavarty wrote:
> On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> > Updated aops patchset against 2.6.21-rc5.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > 
> > Files/dirs are 2.6.21-rc5-new-aops*
> 
> One more cleanup. "index" is unused.

Indeed. I wonder why gcc (at least the 4.0 I'm using) doesn't warn if a
local, non volatile is only ever assigned to?

> BTW, I will take a shot at ext4 tomorrow.

Thanks, so long as you think ext3 is looking OK?

BTW. is it a known issue that ext3 fails fsx-linux? (I tried 2.6.21-rc3
IIRC, and ordered and writeback both eventually failed I think). ext2
does not.

> (If you want me to roll all my fixes and send them to you,u,
> 
> 5.)
> 
> 
> 1\f

c link
> 
> 5.orig?
> 
> \x11
> h2
> (

^^^ I hope that wasn't due to your running the new aops patches! ;)

> 
>  and send them to you,
> let me know).

I have merged them all up, so it should be fine. Thanks very much.

Nick

> 
> Thanks,
> Badari
> 
> ---
>  mm/filemap.c |    2 --
>  1 file changed, 2 deletions(-)
> 
> Index: linux-2.6.21-rc5/mm/filemap.c
> ===================================================================
> --- linux-2.6.21-rc5.orig/mm/filemap.c	2007-04-02 16:23:48.000000000 -0700
> +++ linux-2.6.21-rc5/mm/filemap.c	2007-04-02 16:53:13.000000000 -0700
> @@ -2383,14 +2383,12 @@ static ssize_t generic_perform_write(str
>  
>  	do {
>  		struct page *page;
> -		pgoff_t index;		/* Pagecache index for current page */
>  		unsigned long offset;	/* Offset into pagecache page */
>  		unsigned long bytes;	/* Bytes to write to page */
>  		size_t copied;		/* Bytes copied from user */
>  		void *fsdata;
>  
>  		offset = (pos & (PAGE_CACHE_SIZE - 1));
> -		index = pos >> PAGE_CACHE_SHIFT;
>  		bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
>  						iov_iter_count(i));
>  
> 
> 
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 20:18 ` Badari Pulavarty
@ 2007-04-03  0:45   ` Nick Piggin
  0 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-03  0:45 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, Apr 02, 2007 at 01:18:13PM -0700, Badari Pulavarty wrote:
> On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> > Updated aops patchset against 2.6.21-rc5.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > 
> > Files/dirs are 2.6.21-rc5-new-aops*
> > 
> > Contains numerous fixes from Mark and myself -- I'd say the core code is
> > getting reasonably stable at this point.
> > 
> > New stuff: patches from Mark and Steve for the cluster filesystems.
> > 
> > (compile only) patches for NFS, XFS, FUSE, eCryptfs. OK, they're untested,
> > but I have tried to put some effort in, so if maintainers would kindly
> > take a look... ext3 still needs review and porting to ext4, which I hope
> > someone will do eventually (I presume ext3 is still maintained)... does NFS
> > have a lock_page vs lock_kernel deadlock in its commit_write? (if yes, this
> > is fixable with the new write aops).
> 
> In case of error, write_begin() is *supposed* to unlock and release the
> page it locked. Right ?

Thanks.

BTW. I also had another instance of that problem in the NFS conversion.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-03  0:08   ` Nick Piggin
@ 2007-04-03  9:31     ` Nick Piggin
  2007-04-03 16:03       ` Badari Pulavarty
  0 siblings, 1 reply; 43+ messages in thread
From: Nick Piggin @ 2007-04-03  9:31 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Tue, Apr 03, 2007 at 02:08:53AM +0200, Nick Piggin wrote:
> > BTW, I will take a shot at ext4 tomorrow.
> 
> Thanks, so long as you think ext3 is looking OK?
> 
> BTW. is it a known issue that ext3 fails fsx-linux? (I tried 2.6.21-rc3
> IIRC, and ordered and writeback both eventually failed I think). ext2
> does not.

Well I just tested, and it is not fixed by the recent patch to revert
ext3_prepare_failure....

Is this a known issue? Is it an fsx-linux shortcoming? It is fairly
surprising because it basically makes it impossible to test ext3
changes with that nice tool :( I can submit the traces if anyone is
interested, however I can reproduce in UML on an ext3 writeback
filesystem with no arguments (except the filename).


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 23:49   ` Nick Piggin
@ 2007-04-03 15:57     ` Badari Pulavarty
  2007-04-04  2:31       ` Nick Piggin
  0 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-03 15:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Tue, 2007-04-03 at 01:49 +0200, Nick Piggin wrote:
> On Mon, Apr 02, 2007 at 04:14:59PM -0700, Badari Pulavarty wrote:
> > On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> > > Updated aops patchset against 2.6.21-rc5.
> > > 
> > > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > > 
> > > Files/dirs are 2.6.21-rc5-new-aops*
> > 
> > Baaah !! You took away ext3 -nobh option :(
> 
> Ahh, just the person I wanted to ask! ;) How useful is it, out of curiosity?
> What sort of users use it, and what sort of improvements do they get?

Well, at the time it helped lowmem issue on x86. Bufferheads end up
using lots of lowmem and caused bunch of issues. We used it heavily
on our local machines. Not sure if any customers use it (my guess
is not).

I wanted to completely kill bufferhead usage from ext3. writeback
mode is simple and easy. Ordered mode, need attach the buffers
to transaction - which needed complete rework. So never gotten
around to doing it.

> 
> > Do you have plans to support nobh versions of block_write_begin/end ?
> 
> At the moment I'm looking at doing it another way. Having the seperate
> nobh path is quite annoying -- there are still bugs in it and it is
> simply less tested (not that the bh path is bug-free either, but it
> is better to be able to concentrate on one). So I hope to merge them
> at some point to restore that functionality. 
> 
> 
> > BTW, I don't see how block_write_end() can ever return < 0.
> > If so, here is the cleanup fix for ext3 (no unnecessay checks).
> 
> Shouldn't we allow for the possibility?

Well, if there is a (remote) possibility of failure - yes
we should handle it. Otherwise its dead code, unnecessary
checks in hotpath and makes code ugly by confusing non-possible
error cases from possible ones. 

Even today - ext3 code has checks to handle  generic_commit_write()
failures. But it never returns failure. I wanted to clean up ext3 code,
but I left it for allowing for possibility.

Thanks,
Badari


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 23:58   ` Nick Piggin
@ 2007-04-03 15:59     ` Badari Pulavarty
  2007-04-04  2:33       ` Nick Piggin
  0 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-03 15:59 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Tue, 2007-04-03 at 01:58 +0200, Nick Piggin wrote:
> On Mon, Apr 02, 2007 at 01:44:59PM -0700, Badari Pulavarty wrote:
> > On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> > > Updated aops patchset against 2.6.21-rc5.
> > > 
> > > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > > 
> > > Files/dirs are 2.6.21-rc5-new-aops*
> > > 
> > > Contains numerous fixes from Mark and myself -- I'd say the core code is
> > > getting reasonably stable at this point.
> > 
> > 
> > ext3_write_failure() conversion is NOT quite correct.
> > Old code was returing failure of do_journal_get_write_access(),
> > where as your changes will return journal_stop(). 
> 
> Good catch. This still requires a journal_stop, however?

Yes.
> 
> How's this look?

Looks good. I am little worried about the page-lock vs journal
start/stop order. I remember running into deadlock while trying
to do writepages() for ordered mode. I think we are okay here.
I will take a closer look.

Thanks,
Badari

> 
> --
> Index: linux-2.6/fs/ext3/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/ext3/inode.c
> +++ linux-2.6/fs/ext3/inode.c
> @@ -1167,11 +1167,13 @@ static int ext3_write_failure(struct fil
>  	handle_t *handle = ext3_journal_current_handle();
>  
>  	if (ext3_should_writeback_data(mapping->host)) {
> +skip_and_stop:
>  		/* optimization: no constraints about data */
> +		ret = ext3_journal_stop(handle);
>  skip:
>  		unlock_page(page);
>  		page_cache_release(page);
> -		return ext3_journal_stop(handle);
> +		return ret;
>  	}
>  
>  	from = pos & (PAGE_CACHE_SIZE - 1);
> @@ -1196,8 +1198,10 @@ skip:
>  			break;
>  		if (ext3_should_journal_data(mapping->host)) {
>  			ret = do_journal_get_write_access(handle, bh);
> -			if (ret)
> +			if (ret) {
> +				ext3_journal_stop(handle);
>  				goto skip;
> +			}
>  		}
>  	/*
>  	 * block_start here becomes the first block where the current iteration
> @@ -1205,7 +1209,7 @@ skip:
>  	 */
>  	}
>  	if (block_start <= from)
> -		goto skip;
> +		goto skip_and_stop;
>  
>  	/* commit allocated and zeroed buffers */
>  	return mapping->a_ops->write_end(file, mapping, pos, len,


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-03  9:31     ` Nick Piggin
@ 2007-04-03 16:03       ` Badari Pulavarty
  2007-04-04  2:37         ` Nick Piggin
  0 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-03 16:03 UTC (permalink / raw)
  To: Nick Piggin, mcao; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Tue, 2007-04-03 at 11:31 +0200, Nick Piggin wrote:
> On Tue, Apr 03, 2007 at 02:08:53AM +0200, Nick Piggin wrote:
> > > BTW, I will take a shot at ext4 tomorrow.
> > 
> > Thanks, so long as you think ext3 is looking OK?
> > 
> > BTW. is it a known issue that ext3 fails fsx-linux? (I tried 2.6.21-rc3
> > IIRC, and ordered and writeback both eventually failed I think). ext2
> > does not.
> 
> Well I just tested, and it is not fixed by the recent patch to revert
> ext3_prepare_failure....
> 
> Is this a known issue? Is it an fsx-linux shortcoming? It is fairly
> surprising because it basically makes it impossible to test ext3
> changes with that nice tool :( I can submit the traces if anyone is
> interested, however I can reproduce in UML on an ext3 writeback
> filesystem with no arguments (except the filename).
> 

I haven't seen an fsx failure recently. I ran 4 copies of fsx
on 2.6.21-rc5 and 2.6.21-rc5+aops, without any problems for
12+ hours. What am I missing ?

Mingming, do you know of any fsx failures recently ?

Thanks,
Badari


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 12:09 [ANNOUNCE] new new aops patchset Nick Piggin
                   ` (7 preceding siblings ...)
  2007-04-03  0:00 ` Badari Pulavarty
@ 2007-04-03 17:35 ` Badari Pulavarty
  2007-04-04  3:05   ` Nick Piggin
  2007-04-04 22:10 ` Mark Fasheh
  2007-04-05  0:10 ` David Chinner
  10 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-03 17:35 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> Updated aops patchset against 2.6.21-rc5.
> 
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> 
> Files/dirs are 2.6.21-rc5-new-aops*

Here is the ext4 support for it. This is a simple port from 
ext3 code. Ran fsx without any problems :)

BTW, I never clearly understood what exactly the problem these
new interfaces are solving and how :( I can dig through the
archives and try to figure out. Would you care to put a 
small description of the *actual* problem and how these
new aops are needed (vs hacking the existing methods).

Thanks,
Badari

---
 fs/ext4/inode.c |  171 +++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 113 insertions(+), 58 deletions(-)

Index: linux-2.6.21-rc5/fs/ext4/inode.c
===================================================================
--- linux-2.6.21-rc5.orig/fs/ext4/inode.c	2007-03-25 15:56:23.000000000 -0700
+++ linux-2.6.21-rc5/fs/ext4/inode.c	2007-04-03 10:01:34.000000000 -0700
@@ -1154,23 +1154,30 @@ static int do_journal_get_write_access(h
  * This content is expected to be set to zeroes by block_prepare_write().
  * 2006/10/14  SAW
  */
-static int ext4_prepare_failure(struct file *file, struct page *page,
-				unsigned from, unsigned to)
+static int ext4_write_failure(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len,
+				struct page *page, void *fsdata)
 {
-	struct address_space *mapping;
 	struct buffer_head *bh, *head, *next;
 	unsigned block_start, block_end;
 	unsigned blocksize;
+	unsigned from, to;
 	int ret;
 	handle_t *handle = ext4_journal_current_handle();
 
-	mapping = page->mapping;
 	if (ext4_should_writeback_data(mapping->host)) {
 		/* optimization: no constraints about data */
+skip_and_stop:
+		ret = ext4_journal_stop(handle);
 skip:
-		return ext4_journal_stop(handle);
+		unlock_page(page);
+		page_cache_release(page);
+		return ret;
 	}
 
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = from + len;
+
 	head = page_buffers(page);
 	blocksize = head->b_size;
 	for (	bh = head, block_start = 0;
@@ -1192,7 +1199,7 @@ skip:
 			ret = do_journal_get_write_access(handle, bh);
 			if (ret) {
 				ext4_journal_stop(handle);
-				return ret;
+				goto skip;
 			}
 		}
 	/*
@@ -1201,43 +1208,64 @@ skip:
 	 */
 	}
 	if (block_start <= from)
-		goto skip;
+		goto skip_and_stop;
 
 	/* commit allocated and zeroed buffers */
-	return mapping->a_ops->commit_write(file, page, from, block_start);
+	return mapping->a_ops->write_end(file, mapping, pos, len,
+					block_start - from, page, fsdata);
 }
 
-static int ext4_prepare_write(struct file *file, struct page *page,
-			      unsigned from, unsigned to)
+static int ext4_write_begin(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata)
 {
-	struct inode *inode = page->mapping->host;
-	int ret, ret2;
+	struct inode *inode = mapping->host;
 	int needed_blocks = ext4_writepage_trans_blocks(inode);
+	int ret, ret2;
 	handle_t *handle;
 	int retries = 0;
+	struct page *page;
+	pgoff_t index;
+	unsigned start, end;
+
+	index = pos >> PAGE_CACHE_SHIFT;
+	start = pos * (PAGE_CACHE_SIZE - 1);
+	end = start + len;
 
 retry:
+	page = __grab_cache_page(mapping, index);
+	if (!page)
+		return -ENOMEM;
+	*pagep = page;
+
 	handle = ext4_journal_start(inode, needed_blocks);
-	if (IS_ERR(handle))
+	if (IS_ERR(handle)) {
+		unlock_page(page);
+		page_cache_release(page);
 		return PTR_ERR(handle);
-	if (test_opt(inode->i_sb, NOBH) && ext4_should_writeback_data(inode))
-		ret = nobh_prepare_write(page, from, to, ext4_get_block);
-	else
-		ret = block_prepare_write(page, from, to, ext4_get_block);
+	}
+	ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+							ext4_get_block);
 	if (ret)
 		goto failure;
 
 	if (ext4_should_journal_data(inode)) {
 		ret = walk_page_buffers(handle, page_buffers(page),
-				from, to, NULL, do_journal_get_write_access);
-		if (ret)
+				start, end, NULL, do_journal_get_write_access);
+		if (ret) {
 			/* fatal error, just put the handle and return */
 			ext4_journal_stop(handle);
+			unlock_page(page);
+			page_cache_release(page);
+		}
 	}
 	return ret;
 
 failure:
-	ret2 = ext4_prepare_failure(file, page, from, to);
+	ret2 = ext4_write_failure(file, mapping, pos, len, page, *fsdata);
+	/* trim off blocks (XXX: need better helpers than vmtruncate) */
+	if (pos + len > inode->i_size)
+		vmtruncate(inode, inode->i_size);
 	if (ret2 < 0)
 		return ret2;
 	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
@@ -1251,12 +1279,12 @@ int ext4_journal_dirty_data(handle_t *ha
 	int err = jbd2_journal_dirty_data(handle, bh);
 	if (err)
 		ext4_journal_abort_handle(__FUNCTION__, __FUNCTION__,
-						bh, handle,err);
+						bh, handle, err);
 	return err;
 }
 
-/* For commit_write() in data=journal mode */
-static int commit_write_fn(handle_t *handle, struct buffer_head *bh)
+/* For write_end() in data=journal mode */
+static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 {
 	if (!buffer_mapped(bh) || buffer_freed(bh))
 		return 0;
@@ -1271,78 +1299,103 @@ static int commit_write_fn(handle_t *han
  * ext4 never places buffers on inode->i_mapping->private_list.  metadata
  * buffers are managed internally.
  */
-static int ext4_ordered_commit_write(struct file *file, struct page *page,
-			     unsigned from, unsigned to)
+static int ext4_ordered_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext4_journal_current_handle();
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = file->f_mapping->host;
+	unsigned from, to;
 	int ret = 0, ret2;
 
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = from + len;
+
 	ret = walk_page_buffers(handle, page_buffers(page),
 		from, to, NULL, ext4_journal_dirty_data);
 
 	if (ret == 0) {
 		/*
-		 * generic_commit_write() will run mark_inode_dirty() if i_size
+		 * block_write_end() will run mark_inode_dirty() if i_size
 		 * changes.  So let's piggyback the i_disksize mark_inode_dirty
 		 * into that.
 		 */
 		loff_t new_i_size;
 
-		new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+		new_i_size = pos + copied;
 		if (new_i_size > EXT4_I(inode)->i_disksize)
 			EXT4_I(inode)->i_disksize = new_i_size;
-		ret = generic_commit_write(file, page, from, to);
+		copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+		if (copied < 0)
+			ret = copied;
 	}
 	ret2 = ext4_journal_stop(handle);
 	if (!ret)
 		ret = ret2;
-	return ret;
+	return ret ? ret : copied;
 }
 
-static int ext4_writeback_commit_write(struct file *file, struct page *page,
-			     unsigned from, unsigned to)
+static int ext4_writeback_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext4_journal_current_handle();
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = file->f_mapping->host;
 	int ret = 0, ret2;
 	loff_t new_i_size;
 
-	new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	new_i_size = pos + copied;
 	if (new_i_size > EXT4_I(inode)->i_disksize)
 		EXT4_I(inode)->i_disksize = new_i_size;
 
-	if (test_opt(inode->i_sb, NOBH) && ext4_should_writeback_data(inode))
-		ret = nobh_commit_write(file, page, from, to);
-	else
-		ret = generic_commit_write(file, page, from, to);
+	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+	if (copied < 0)
+		ret = copied;
 
 	ret2 = ext4_journal_stop(handle);
 	if (!ret)
 		ret = ret2;
-	return ret;
+	return ret ? ret : copied;
 }
 
-static int ext4_journalled_commit_write(struct file *file,
-			struct page *page, unsigned from, unsigned to)
+static int ext4_journalled_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext4_journal_current_handle();
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = mapping->host;
 	int ret = 0, ret2;
 	int partial = 0;
-	loff_t pos;
+	unsigned from, to;
 
-	/*
-	 * Here we duplicate the generic_commit_write() functionality
-	 */
-	pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = from + len;
+
+	if (copied < len) {
+		if (PageUptodate(page))
+			copied = len;
+		else {
+			/* XXX: don't need to zero new buffers because we abort? */
+			copied = 0;
+			if (!is_handle_aborted(handle))
+				jbd2_journal_abort_handle(handle);
+			unlock_page(page);
+			page_cache_release(page);
+			goto out;
+		}
+	}
 
 	ret = walk_page_buffers(handle, page_buffers(page), from,
-				to, &partial, commit_write_fn);
+				to, &partial, write_end_fn);
 	if (!partial)
 		SetPageUptodate(page);
-	if (pos > inode->i_size)
-		i_size_write(inode, pos);
+	unlock_page(page);
+	page_cache_release(page);
+	if (pos+copied > inode->i_size)
+		i_size_write(inode, pos+copied);
 	EXT4_I(inode)->i_state |= EXT4_STATE_JDATA;
 	if (inode->i_size > EXT4_I(inode)->i_disksize) {
 		EXT4_I(inode)->i_disksize = inode->i_size;
@@ -1350,10 +1403,12 @@ static int ext4_journalled_commit_write(
 		if (!ret)
 			ret = ret2;
 	}
+
+out:
 	ret2 = ext4_journal_stop(handle);
 	if (!ret)
 		ret = ret2;
-	return ret;
+	return ret ? ret : copied;
 }
 
 /*
@@ -1611,7 +1666,7 @@ static int ext4_journalled_writepage(str
 			PAGE_CACHE_SIZE, NULL, do_journal_get_write_access);
 
 		err = walk_page_buffers(handle, page_buffers(page), 0,
-				PAGE_CACHE_SIZE, NULL, commit_write_fn);
+				PAGE_CACHE_SIZE, NULL, write_end_fn);
 		if (ret == 0)
 			ret = err;
 		EXT4_I(inode)->i_state |= EXT4_STATE_JDATA;
@@ -1771,8 +1826,8 @@ static const struct address_space_operat
 	.readpages	= ext4_readpages,
 	.writepage	= ext4_ordered_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= ext4_prepare_write,
-	.commit_write	= ext4_ordered_commit_write,
+	.write_begin	= ext4_write_begin,
+	.write_end	= ext4_ordered_write_end,
 	.bmap		= ext4_bmap,
 	.invalidatepage	= ext4_invalidatepage,
 	.releasepage	= ext4_releasepage,
@@ -1785,8 +1840,8 @@ static const struct address_space_operat
 	.readpages	= ext4_readpages,
 	.writepage	= ext4_writeback_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= ext4_prepare_write,
-	.commit_write	= ext4_writeback_commit_write,
+	.write_begin	= ext4_write_begin,
+	.write_end	= ext4_writeback_write_end,
 	.bmap		= ext4_bmap,
 	.invalidatepage	= ext4_invalidatepage,
 	.releasepage	= ext4_releasepage,
@@ -1799,8 +1854,8 @@ static const struct address_space_operat
 	.readpages	= ext4_readpages,
 	.writepage	= ext4_journalled_writepage,
 	.sync_page	= block_sync_page,
-	.prepare_write	= ext4_prepare_write,
-	.commit_write	= ext4_journalled_commit_write,
+	.write_begin	= ext4_write_begin,
+	.write_end	= ext4_journalled_write_end,
 	.set_page_dirty	= ext4_journalled_set_page_dirty,
 	.bmap		= ext4_bmap,
 	.invalidatepage	= ext4_invalidatepage,



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-03 15:57     ` Badari Pulavarty
@ 2007-04-04  2:31       ` Nick Piggin
  0 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-04  2:31 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Tue, Apr 03, 2007 at 08:57:03AM -0700, Badari Pulavarty wrote:
> On Tue, 2007-04-03 at 01:49 +0200, Nick Piggin wrote:
> > 
> > Ahh, just the person I wanted to ask! ;) How useful is it, out of curiosity?
> > What sort of users use it, and what sort of improvements do they get?
> 
> Well, at the time it helped lowmem issue on x86. Bufferheads end up
> using lots of lowmem and caused bunch of issues. We used it heavily
> on our local machines. Not sure if any customers use it (my guess
> is not).

Hmm, I think we need better reclaim for buffers as well. That might
help the problem. The thing that doesn't make sense with nobh is
that you deliberately throw out buffers for dirty pages, which you
know will be needed again for writeout. You also want to do as little
work as possible in the writeout path (ie. do it all in the dirtying
path) so as to avoid deadlocks and stuff.

Anyway, I will keep nobh in mind. A port to the new aops shouldn't
be impossible.


> > At the moment I'm looking at doing it another way. Having the seperate
> > nobh path is quite annoying -- there are still bugs in it and it is
> > simply less tested (not that the bh path is bug-free either, but it
> > is better to be able to concentrate on one). So I hope to merge them
> > at some point to restore that functionality. 
> > 
> > 
> > > BTW, I don't see how block_write_end() can ever return < 0.
> > > If so, here is the cleanup fix for ext3 (no unnecessay checks).
> > 
> > Shouldn't we allow for the possibility?
> 
> Well, if there is a (remote) possibility of failure - yes
> we should handle it. Otherwise its dead code, unnecessary
> checks in hotpath and makes code ugly by confusing non-possible
> error cases from possible ones. 
> 
> Even today - ext3 code has checks to handle  generic_commit_write()
> failures. But it never returns failure. I wanted to clean up ext3 code,
> but I left it for allowing for possibility.

Well I'm just thinking we should all for it in the interface from
the start, so that if we ever have a need to error out there, we
can without having to update all filesystems. I can't imagine a
reason today though...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-03 15:59     ` Badari Pulavarty
@ 2007-04-04  2:33       ` Nick Piggin
  0 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-04  2:33 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Tue, Apr 03, 2007 at 08:59:57AM -0700, Badari Pulavarty wrote:
> On Tue, 2007-04-03 at 01:58 +0200, Nick Piggin wrote:
> > On Mon, Apr 02, 2007 at 01:44:59PM -0700, Badari Pulavarty wrote:
> > > On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> > > > Updated aops patchset against 2.6.21-rc5.
> > > > 
> > > > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > > > 
> > > > Files/dirs are 2.6.21-rc5-new-aops*
> > > > 
> > > > Contains numerous fixes from Mark and myself -- I'd say the core code is
> > > > getting reasonably stable at this point.
> > > 
> > > 
> > > ext3_write_failure() conversion is NOT quite correct.
> > > Old code was returing failure of do_journal_get_write_access(),
> > > where as your changes will return journal_stop(). 
> > 
> > Good catch. This still requires a journal_stop, however?
> 
> Yes.
> > 
> > How's this look?
> 
> Looks good. I am little worried about the page-lock vs journal
> start/stop order.

Yeah that did cross my mind, but the old code does it. You can just
unlock before the journal_stop, it just takes a bit more code.

BTW. this is one thing the new interface is good for (ie. the filesystem
completely controls the page lock).

Of course, this whole thing needs updating for the ext3_write_failure
backout...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-03 16:03       ` Badari Pulavarty
@ 2007-04-04  2:37         ` Nick Piggin
  0 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-04  2:37 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: mcao, Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Tue, Apr 03, 2007 at 09:03:50AM -0700, Badari Pulavarty wrote:
> On Tue, 2007-04-03 at 11:31 +0200, Nick Piggin wrote:
> > On Tue, Apr 03, 2007 at 02:08:53AM +0200, Nick Piggin wrote:
> > > > BTW, I will take a shot at ext4 tomorrow.
> > > 
> > > Thanks, so long as you think ext3 is looking OK?
> > > 
> > > BTW. is it a known issue that ext3 fails fsx-linux? (I tried 2.6.21-rc3
> > > IIRC, and ordered and writeback both eventually failed I think). ext2
> > > does not.
> > 
> > Well I just tested, and it is not fixed by the recent patch to revert
> > ext3_prepare_failure....
> > 
> > Is this a known issue? Is it an fsx-linux shortcoming? It is fairly
> > surprising because it basically makes it impossible to test ext3
> > changes with that nice tool :( I can submit the traces if anyone is
> > interested, however I can reproduce in UML on an ext3 writeback
> > filesystem with no arguments (except the filename).
> > 
> 
> I haven't seen an fsx failure recently. I ran 4 copies of fsx
> on 2.6.21-rc5 and 2.6.21-rc5+aops, without any problems for
> 12+ hours. What am I missing ?

Well this is in UML and mounted on loopback, so it could be a failure
in the driver stack somewhere. However ext2 is completely solid, so
I did suspect ext3.

In my configuration, some IO operations that would take a reasonable
amount of time on a real system will complete much more quickly.
There might be an issue there? Using a ramdisk or loop over tmpfs
might produce something.

I'll ping you again if I can reproduce outside of UML.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-03 17:35 ` Badari Pulavarty
@ 2007-04-04  3:05   ` Nick Piggin
  0 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-04  3:05 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Tue, Apr 03, 2007 at 10:35:19AM -0700, Badari Pulavarty wrote:
> On Mon, 2007-04-02 at 14:09 +0200, Nick Piggin wrote:
> > Updated aops patchset against 2.6.21-rc5.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > 
> > Files/dirs are 2.6.21-rc5-new-aops*
> 
> Here is the ext4 support for it. This is a simple port from 
> ext3 code. Ran fsx without any problems :)

Thanks. Again, I guess this needs updating for write_failure backout :(


> BTW, I never clearly understood what exactly the problem these
> new interfaces are solving and how :( I can dig through the
> archives and try to figure out. Would you care to put a 
> small description of the *actual* problem and how these
> new aops are needed (vs hacking the existing methods).

OK, so the problem from my point of view is the pagecache write
deadlock:

http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/2.6.21-rc5-new-aops/broken-out/mm-pagecache-write-deadlocks.patch

The problem with that fix is that it requires an extra copy (I
thought long and hard about different ways to fix this without
changing the aops API and without taking a performance hit).

It was pretty clear that the problem can be fixed easily and efficiently
in the mm/ code if the the API can be changed a bit (eg.
allow commit_write to commit a range which is smaller than the
prepare_write range).

The problem with changing the API is that you break all filesystems,
and out of tree filesystems. Even if we ignore the out of tree ones,
the existing ones can have difficult and subtle issues that I cannot
fix alone.

So I decided that the best way to go is to introduce the "slow" fix
for everyone, then add these new aops. Once everyone is moved to
the new aops, we remove the old ones and a lot of old cruft with
them.

So at this point we can make a clean break and come up with an API
that is not only capable of fixing the deadlock, but is also better
suited to modern filesystems (prepare_write was intorduced in 2.2 or
2.0 IIRC, and probably suffered from deadlocks even then!).

I don't know if write_begin/ write_end is final yet (which is why we
need your input), but it already does help a lot with the clustered
filesystems right now.

The reason we also have the (still unused) perform_write, is that was
my first attempt at a new interface. The problem with that is that
it is difficult to provide generic block helpers that are usable in
more complex ways. It also cannot be used for source data other than
memory in iovecs. However the reason why it is still there is because
I already introduced the iov_iter and infrastructure for it, and I
think it is a very nice, clean interface, and can be good for
high performance writes (I got positive results even with a naive
ext2 implementation).

Does that answer your question?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 12:09 [ANNOUNCE] new new aops patchset Nick Piggin
                   ` (8 preceding siblings ...)
  2007-04-03 17:35 ` Badari Pulavarty
@ 2007-04-04 22:10 ` Mark Fasheh
  2007-04-04 22:39   ` Badari Pulavarty
                     ` (2 more replies)
  2007-04-05  0:10 ` David Chinner
  10 siblings, 3 replies; 43+ messages in thread
From: Mark Fasheh @ 2007-04-04 22:10 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Filesystems, Steven Whitehouse

On Mon, Apr 02, 2007 at 02:09:34PM +0200, Nick Piggin wrote:
> Updated aops patchset against 2.6.21-rc5.
> 
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/

Thanks again for pushing this stuff out Nick.

Fwiw, I had merge conflicts working against the latest git code from Linus -
it looks like Jens pushed some splice changes (some from you). A refreshed
fs-new-write-aops.patch is attached.

The ext3 patch also caused me some problems, but I just temporarily dropped
it for my immediate work.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com


Introduce write_begin, write_end, and perform_write aops.

These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).

Signed-off-by: Nick Piggin <npiggin@suse.de>


 Documentation/filesystems/Locking |    9 -
 Documentation/filesystems/vfs.txt |   36 +++++
 drivers/block/loop.c              |   69 +++--------
 fs/buffer.c                       |  161 +++++++++++++++++++++++---
 fs/libfs.c                        |   40 ++++++
 fs/namei.c                        |   47 +------
 fs/splice.c                       |  107 +----------------
 include/linux/buffer_head.h       |    7 +
 include/linux/fs.h                |   31 +++++
 include/linux/pagemap.h           |    2 
 mm/filemap.c                      |  229 ++++++++++++++++++++++++++++++++++----
 11 files changed, 513 insertions(+), 225 deletions(-)

Index: linux-2.6.git/include/linux/fs.h
===================================================================
--- linux-2.6.git.orig/include/linux/fs.h
+++ linux-2.6.git/include/linux/fs.h
@@ -389,6 +389,8 @@ enum positive_aop_returns {
 	AOP_TRUNCATED_PAGE	= 0x80001,
 };
 
+#define AOP_FLAG_UNINTERRUPTIBLE	0x0001 /* will not do a short write */
+
 /*
  * oh the beauties of C type declarations.
  */
@@ -449,6 +451,17 @@ struct address_space_operations {
 	 */
 	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
 	int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+
+	int (*write_begin)(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata);
+	int (*write_end)(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata);
+
+	ssize_t (*perform_write)(struct file *, struct address_space *,
+				struct iov_iter *, loff_t);
+
 	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
 	sector_t (*bmap)(struct address_space *, sector_t);
 	void (*invalidatepage) (struct page *, unsigned long);
@@ -463,6 +476,18 @@ struct address_space_operations {
 	int (*launder_page) (struct page *);
 };
 
+/*
+ * pagecache_write_begin/pagecache_write_end must be used by general code
+ * to write into the pagecache.
+ */
+int pagecache_write_begin(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata);
+
+int pagecache_write_end(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata);
+
 struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
@@ -1907,6 +1932,12 @@ extern int simple_prepare_write(struct f
 			unsigned offset, unsigned to);
 extern int simple_commit_write(struct file *file, struct page *page,
 				unsigned offset, unsigned to);
+extern int simple_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata);
+extern int simple_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata);
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, struct nameidata *);
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
Index: linux-2.6.git/mm/filemap.c
===================================================================
--- linux-2.6.git.orig/mm/filemap.c
+++ linux-2.6.git/mm/filemap.c
@@ -2049,6 +2049,93 @@ inline int generic_write_checks(struct f
 }
 EXPORT_SYMBOL(generic_write_checks);
 
+int pagecache_write_begin(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata)
+{
+	const struct address_space_operations *aops = mapping->a_ops;
+
+	if (aops->write_begin) {
+		return aops->write_begin(file, mapping, pos, len, flags,
+							pagep, fsdata);
+	} else {
+		int ret;
+		pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+		unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
+		struct inode *inode = mapping->host;
+		struct page *page;
+again:
+		page = __grab_cache_page(mapping, index);
+		*pagep = page;
+		if (!page)
+			return -ENOMEM;
+
+		if (flags & AOP_FLAG_UNINTERRUPTIBLE && !PageUptodate(page)) {
+			/*
+			 * There is no way to resolve a short write situation
+			 * for a !Uptodate page (except by double copying in
+			 * the caller done by generic_perform_write_2copy).
+			 *
+			 * Instead, we have to bring it uptodate here.
+			 */
+			ret = aops->readpage(file, page);
+			page_cache_release(page);
+			if (ret) {
+				if (ret == AOP_TRUNCATED_PAGE)
+					goto again;
+				return ret;
+			}
+			goto again;
+		}
+
+		ret = aops->prepare_write(file, page, offset, offset+len);
+		if (ret) {
+			if (ret != AOP_TRUNCATED_PAGE)
+				unlock_page(page);
+			page_cache_release(page);
+			if (pos + len > inode->i_size)
+				vmtruncate(inode, inode->i_size);
+			if (ret == AOP_TRUNCATED_PAGE)
+				goto again;
+		}
+		return ret;
+	}
+}
+EXPORT_SYMBOL(pagecache_write_begin);
+
+int pagecache_write_end(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
+{
+	const struct address_space_operations *aops = mapping->a_ops;
+	int ret;
+
+	if (aops->write_begin) {
+		ret = aops->write_end(file, mapping, pos, len, copied,
+							page, fsdata);
+	} else {
+		unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
+		struct inode *inode = mapping->host;
+
+		flush_dcache_page(page);
+		ret = aops->commit_write(file, page, offset, offset+len);
+		unlock_page(page);
+		page_cache_release(page);
+		BUG_ON(ret == AOP_TRUNCATED_PAGE); /* can't deal with */
+
+		if (ret < 0) {
+			if (pos + len > inode->i_size)
+				vmtruncate(inode, inode->i_size);
+		} else if (ret > 0)
+			ret = min_t(size_t, copied, ret);
+		else
+			ret = copied;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(pagecache_write_end);
+
 ssize_t
 generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long *nr_segs, loff_t pos, loff_t *ppos,
@@ -2092,8 +2179,7 @@ EXPORT_SYMBOL(generic_file_direct_write)
  * Find or create a page at the given pagecache position. Return the locked
  * page. This function is specifically for buffered writes.
  */
-static struct page *__grab_cache_page(struct address_space *mapping,
-							pgoff_t index)
+struct page *__grab_cache_page(struct address_space *mapping, pgoff_t index)
 {
 	int status;
 	struct page *page;
@@ -2114,20 +2200,16 @@ repeat:
 	}
 	return page;
 }
+EXPORT_SYMBOL(__grab_cache_page);
 
-ssize_t
-generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
-		unsigned long nr_segs, loff_t pos, loff_t *ppos,
-		size_t count, ssize_t written)
+static ssize_t generic_perform_write_2copy(struct file *file,
+				struct iov_iter *i, loff_t pos)
 {
-	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode 	*inode = mapping->host;
-	long		status = 0;
-	struct iov_iter i;
-
-	iov_iter_init(&i, iov, nr_segs, count, written);
+	struct inode *inode = mapping->host;
+	long status = 0;
+	ssize_t written = 0;
 
 	do {
 		struct page *src_page;
@@ -2140,7 +2222,7 @@ generic_file_buffered_write(struct kiocb
 		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
-						iov_iter_count(&i));
+						iov_iter_count(i));
 
 		/*
 		 * a non-NULL src_page indicates that we're doing the
@@ -2158,7 +2240,7 @@ generic_file_buffered_write(struct kiocb
 		 * to check that the address is actually valid, when atomic
 		 * usercopies are used, below.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(&i))) {
+		if (unlikely(iov_iter_fault_in_readable(i))) {
 			status = -EFAULT;
 			break;
 		}
@@ -2189,7 +2271,7 @@ generic_file_buffered_write(struct kiocb
 			 * same reason as we can't take a page fault with a
 			 * page locked (as explained below).
 			 */
-			copied = iov_iter_copy_from_user(src_page, &i,
+			copied = iov_iter_copy_from_user(src_page, i,
 								offset, bytes);
 			if (unlikely(copied == 0)) {
 				status = -EFAULT;
@@ -2214,7 +2296,6 @@ generic_file_buffered_write(struct kiocb
 				page_cache_release(src_page);
 				continue;
 			}
-
 		}
 
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
@@ -2236,7 +2317,7 @@ generic_file_buffered_write(struct kiocb
 			 * really matter.
 			 */
 			pagefault_disable();
-			copied = iov_iter_copy_from_user_atomic(page, &i,
+			copied = iov_iter_copy_from_user_atomic(page, i,
 								offset, bytes);
 			pagefault_enable();
 		} else {
@@ -2262,9 +2343,9 @@ generic_file_buffered_write(struct kiocb
 		if (src_page)
 			page_cache_release(src_page);
 
-		iov_iter_advance(&i, copied);
-		written += copied;
+		iov_iter_advance(i, copied);
 		pos += copied;
+		written += copied;
 
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
@@ -2288,13 +2369,119 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-	} while (iov_iter_count(&i));
-	*ppos = pos;
+	} while (iov_iter_count(i));
+
+	return written ? written : status;
+}
+
+static ssize_t generic_perform_write(struct file *file,
+				struct iov_iter *i, loff_t pos)
+{
+	struct address_space *mapping = file->f_mapping;
+	const struct address_space_operations *a_ops = mapping->a_ops;
+	long status = 0;
+	ssize_t written = 0;
+
+	do {
+		struct page *page;
+		pgoff_t index;		/* Pagecache index for current page */
+		unsigned long offset;	/* Offset into pagecache page */
+		unsigned long bytes;	/* Bytes to write to page */
+		size_t copied;		/* Bytes copied from user */
+		void *fsdata;
+
+		offset = (pos & (PAGE_CACHE_SIZE - 1));
+		index = pos >> PAGE_CACHE_SHIFT;
+		bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
+						iov_iter_count(i));
+
+again:
+
+		/*
+		 * Bring in the user page that we will copy from _first_.
+		 * Otherwise there's a nasty deadlock on copying from the
+		 * same page as we're writing to, without it being marked
+		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
+		 */
+		if (unlikely(iov_iter_fault_in_readable(i))) {
+			status = -EFAULT;
+			break;
+		}
+
+		status = a_ops->write_begin(file, mapping, pos, bytes, 0,
+						&page, &fsdata);
+		if (unlikely(status))
+			break;
+
+		pagefault_disable();
+		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		pagefault_enable();
+		flush_dcache_page(page);
+
+		status = a_ops->write_end(file, mapping, pos, bytes, copied,
+						page, fsdata);
+		if (unlikely(status < 0))
+			break;
+		copied = status;
+
+		cond_resched();
+
+		if (unlikely(copied == 0)) {
+			/*
+			 * If we were unable to copy any data at all, we must
+			 * fall back to a single segment length write.
+			 *
+			 * If we didn't fallback here, we could livelock
+			 * because not all segments in the iov can be copied at
+			 * once without a pagefault.
+			 */
+			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
+						iov_iter_single_seg_count(i));
+			goto again;
+		}
+		iov_iter_advance(i, copied);
+		pos += copied;
+		written += copied;
+
+		balance_dirty_pages_ratelimited(mapping);
+
+	} while (iov_iter_count(i));
+
+	return written ? written : status;
+}
+
+ssize_t
+generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
+		unsigned long nr_segs, loff_t pos, loff_t *ppos,
+		size_t count, ssize_t written)
+{
+	struct file *file = iocb->ki_filp;
+	struct address_space *mapping = file->f_mapping;
+	const struct address_space_operations *a_ops = mapping->a_ops;
+	struct inode *inode = mapping->host;
+	ssize_t status;
+	struct iov_iter i;
+
+	iov_iter_init(&i, iov, nr_segs, count, written);
+	if (a_ops->perform_write)
+		status = a_ops->perform_write(file, mapping, &i, pos);
+	else if (a_ops->write_begin)
+		status = generic_perform_write(file, &i, pos);
+	else
+		status = generic_perform_write_2copy(file, &i, pos);
 
-	/*
-	 * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
-	 */
 	if (likely(status >= 0)) {
+		written += status;
+		*ppos = pos + status;
+
+		/*
+		 * For now, when the user asks for O_SYNC, we'll actually give
+		 * O_DSYNC
+		 */
 		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 			if (!a_ops->writepage || !is_sync_kiocb(iocb))
 				status = generic_osync_inode(inode, mapping,
Index: linux-2.6.git/fs/buffer.c
===================================================================
--- linux-2.6.git.orig/fs/buffer.c
+++ linux-2.6.git/fs/buffer.c
@@ -1840,44 +1840,54 @@ static int __block_prepare_write(struct 
 		if (!buffer_uptodate(*wait_bh))
 			err = -EIO;
 	}
-	if (!err) {
-		bh = head;
-		do {
-			if (buffer_new(bh))
-				clear_buffer_new(bh);
-		} while ((bh = bh->b_this_page) != head);
-		return 0;
-	}
-	/* Error case: */
-	/*
-	 * Zero out any newly allocated blocks to avoid exposing stale
-	 * data.  If BH_New is set, we know that the block was newly
-	 * allocated in the above loop.
-	 */
-	bh = head;
+	if (unlikely(err))
+		page_zero_new_buffers(page, from, to);
+	return err;
+}
+
+/*
+ * If a page has any new buffers, zero them out here, and mark them uptodate
+ * and dirty so they'll be written out (in order to prevent uninitialised
+ * block data from leaking). And clear the new bit.
+ */
+void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
+{
+	unsigned int block_start, block_end;
+	struct buffer_head *head, *bh;
+
+	BUG_ON(!PageLocked(page));
+	if (!page_has_buffers(page))
+		return;
+
+	bh = head = page_buffers(page);
 	block_start = 0;
 	do {
-		block_end = block_start+blocksize;
-		if (block_end <= from)
-			goto next_bh;
-		if (block_start >= to)
-			break;
+		block_end = block_start + bh->b_size;
+
 		if (buffer_new(bh)) {
-			void *kaddr;
+			if (block_end > from && block_start < to) {
+				if (!PageUptodate(page)) {
+					unsigned start, end;
+					void *kaddr;
+
+					start = max(from, block_start);
+					end = min(to, block_end);
+
+					kaddr = kmap_atomic(page, KM_USER0);
+					memset(kaddr+start, 0, end - start);
+					flush_dcache_page(page);
+					kunmap_atomic(kaddr, KM_USER0);
+					set_buffer_uptodate(bh);
+				}
 
-			clear_buffer_new(bh);
-			kaddr = kmap_atomic(page, KM_USER0);
-			memset(kaddr+block_start, 0, bh->b_size);
-			flush_dcache_page(page);
-			kunmap_atomic(kaddr, KM_USER0);
-			set_buffer_uptodate(bh);
-			mark_buffer_dirty(bh);
+				clear_buffer_new(bh);
+				mark_buffer_dirty(bh);
+			}
 		}
-next_bh:
+
 		block_start = block_end;
 		bh = bh->b_this_page;
 	} while (bh != head);
-	return err;
 }
 
 static int __block_commit_write(struct inode *inode, struct page *page,
@@ -1901,6 +1911,7 @@ static int __block_commit_write(struct i
 			set_buffer_uptodate(bh);
 			mark_buffer_dirty(bh);
 		}
+		clear_buffer_new(bh);
 	}
 
 	/*
@@ -1915,6 +1926,115 @@ static int __block_commit_write(struct i
 }
 
 /*
+ * block_write_begin takes care of the basic task of block allocation and
+ * bringing partial write blocks uptodate first.
+ *
+ * If *pagep is not NULL, then block_write_begin uses the locked page
+ * at *pagep rather than allocating its own. In this case, the page will
+ * not be unlocked or deallocated on failure.
+ */
+int block_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata,
+			get_block_t *get_block)
+{
+	struct inode *inode = mapping->host;
+	int status = 0;
+	struct page *page;
+	pgoff_t index;
+	unsigned start, end;
+	int ownpage = 0;
+
+	index = pos >> PAGE_CACHE_SHIFT;
+	start = pos & (PAGE_CACHE_SIZE - 1);
+	end = start + len;
+
+	page = *pagep;
+	if (page == NULL) {
+		ownpage = 1;
+		page = __grab_cache_page(mapping, index);
+		if (!page) {
+			status = -ENOMEM;
+			goto out;
+		}
+		*pagep = page;
+	} else
+		BUG_ON(!PageLocked(page));
+
+	status = __block_prepare_write(inode, page, start, end, get_block);
+	if (unlikely(status)) {
+		ClearPageUptodate(page);
+
+		if (ownpage) {
+			unlock_page(page);
+			page_cache_release(page);
+
+			/*
+			 * prepare_write() may have instantiated a few blocks
+			 * outside i_size.  Trim these off again. Don't need
+			 * i_size_read because we hold i_mutex.
+			 */
+			if (pos + len > inode->i_size)
+				vmtruncate(inode, inode->i_size);
+		}
+		goto out;
+	}
+
+out:
+	return status;
+}
+EXPORT_SYMBOL(block_write_begin);
+
+int block_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
+{
+	struct inode *inode = mapping->host;
+	unsigned start;
+
+	start = pos & (PAGE_CACHE_SIZE - 1);
+
+	if (unlikely(copied < len)) {
+		/*
+		 * The buffers that were written will now be uptodate, so we
+		 * don't have to worry about a readpage reading them and
+		 * overwriting a partial write. However if we have encountered
+		 * a short write and only partially written into a buffer, it
+		 * will not be marked uptodate, so a readpage might come in and
+		 * destroy our partial write.
+		 *
+		 * Do the simplest thing, and just treat any short write to a
+		 * non uptodate page as a zero-length write, and force the
+		 * caller to redo the whole thing.
+		 */
+		if (!PageUptodate(page))
+			copied = 0;
+
+		page_zero_new_buffers(page, start+copied, start+len);
+	}
+	flush_dcache_page(page);
+
+	/* This could be a short (even 0-length) commit */
+	__block_commit_write(inode, page, start, start+copied);
+
+	unlock_page(page);
+	mark_page_accessed(page); /* XXX: put this in caller? */
+	page_cache_release(page);
+
+	/*
+	 * No need to use i_size_read() here, the i_size
+	 * cannot change under us because we hold i_mutex.
+	 */
+	if (pos+copied > inode->i_size) {
+		i_size_write(inode, pos+copied);
+		mark_inode_dirty(inode);
+	}
+
+	return copied;
+}
+EXPORT_SYMBOL(block_write_end);
+
+/*
  * Generic "read page" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
  * Reads the page asynchronously --- the unlock_buffer() and
Index: linux-2.6.git/include/linux/buffer_head.h
===================================================================
--- linux-2.6.git.orig/include/linux/buffer_head.h
+++ linux-2.6.git/include/linux/buffer_head.h
@@ -202,6 +202,13 @@ void block_invalidatepage(struct page *p
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
 int block_read_full_page(struct page*, get_block_t*);
+int block_write_begin(struct file *, struct address_space *,
+				loff_t, unsigned, unsigned,
+				struct page **, void **, get_block_t*);
+int block_write_end(struct file *, struct address_space *,
+				loff_t, unsigned, unsigned,
+				struct page *, void *);
+void page_zero_new_buffers(struct page *page, unsigned from, unsigned to);
 int block_prepare_write(struct page*, unsigned, unsigned, get_block_t*);
 int cont_prepare_write(struct page*, unsigned, unsigned, get_block_t*,
 				loff_t *);
Index: linux-2.6.git/include/linux/pagemap.h
===================================================================
--- linux-2.6.git.orig/include/linux/pagemap.h
+++ linux-2.6.git/include/linux/pagemap.h
@@ -85,6 +85,8 @@ unsigned find_get_pages_contig(struct ad
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
 
+struct page *__grab_cache_page(struct address_space *mapping, pgoff_t index);
+
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
Index: linux-2.6.git/fs/libfs.c
===================================================================
--- linux-2.6.git.orig/fs/libfs.c
+++ linux-2.6.git/fs/libfs.c
@@ -342,6 +342,26 @@ int simple_prepare_write(struct file *fi
 	return 0;
 }
 
+int simple_write_begin(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned flags,
+			struct page **pagep, void **fsdata)
+{
+	struct page *page;
+	pgoff_t index;
+	unsigned from;
+
+	index = pos >> PAGE_CACHE_SHIFT;
+	from = pos & (PAGE_CACHE_SIZE - 1);
+
+	page = __grab_cache_page(mapping, index);
+	if (!page)
+		return -ENOMEM;
+
+	*pagep = page;
+
+	return simple_prepare_write(file, page, from, from+len);
+}
+
 int simple_commit_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
@@ -360,6 +380,28 @@ int simple_commit_write(struct file *fil
 	return 0;
 }
 
+int simple_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
+{
+	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+
+	/* zero the stale part of the page if we did a short copy */
+	if (copied < len) {
+		void *kaddr = kmap_atomic(page, KM_USER0);
+		memset(kaddr + from + copied, 0, len - copied);
+		flush_dcache_page(page);
+		kunmap_atomic(kaddr, KM_USER0);
+	}
+
+	simple_commit_write(file, page, from, from+copied);
+
+	unlock_page(page);
+	page_cache_release(page);
+
+	return copied;
+}
+
 int simple_fill_super(struct super_block *s, int magic, struct tree_descr *files)
 {
 	struct inode *inode;
@@ -616,6 +658,8 @@ EXPORT_SYMBOL(dcache_dir_open);
 EXPORT_SYMBOL(dcache_readdir);
 EXPORT_SYMBOL(generic_read_dir);
 EXPORT_SYMBOL(get_sb_pseudo);
+EXPORT_SYMBOL(simple_write_begin);
+EXPORT_SYMBOL(simple_write_end);
 EXPORT_SYMBOL(simple_commit_write);
 EXPORT_SYMBOL(simple_dir_inode_operations);
 EXPORT_SYMBOL(simple_dir_operations);
Index: linux-2.6.git/drivers/block/loop.c
===================================================================
--- linux-2.6.git.orig/drivers/block/loop.c
+++ linux-2.6.git/drivers/block/loop.c
@@ -206,11 +206,10 @@ lo_do_transfer(struct loop_device *lo, i
  * space operations prepare_write and commit_write.
  */
 static int do_lo_send_aops(struct loop_device *lo, struct bio_vec *bvec,
-		int bsize, loff_t pos, struct page *page)
+		int bsize, loff_t pos, struct page *unused)
 {
 	struct file *file = lo->lo_backing_file; /* kudos to NFsckingS */
 	struct address_space *mapping = file->f_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 	pgoff_t index;
 	unsigned offset, bv_offs;
 	int len, ret;
@@ -222,67 +221,47 @@ static int do_lo_send_aops(struct loop_d
 	len = bvec->bv_len;
 	while (len > 0) {
 		sector_t IV;
-		unsigned size;
+		unsigned size, copied;
 		int transfer_result;
+		struct page *page;
+		void *fsdata;
 
 		IV = ((sector_t)index << (PAGE_CACHE_SHIFT - 9))+(offset >> 9);
 		size = PAGE_CACHE_SIZE - offset;
 		if (size > len)
 			size = len;
-		page = grab_cache_page(mapping, index);
-		if (unlikely(!page))
+
+		ret = pagecache_write_begin(file, mapping, pos, size, 0,
+							&page, &fsdata);
+		if (ret)
 			goto fail;
-		ret = aops->prepare_write(file, page, offset,
-					  offset + size);
-		if (unlikely(ret)) {
-			if (ret == AOP_TRUNCATED_PAGE) {
-				page_cache_release(page);
-				continue;
-			}
-			goto unlock;
-		}
+
 		transfer_result = lo_do_transfer(lo, WRITE, page, offset,
 				bvec->bv_page, bv_offs, size, IV);
-		if (unlikely(transfer_result)) {
-			char *kaddr;
+		copied = size;
+		if (unlikely(transfer_result))
+			copied = 0;
+
+		ret = pagecache_write_end(file, mapping, pos, size, copied,
+							page, fsdata);
+		if (ret < 0)
+			goto fail;
+		if (ret < copied)
+			copied = ret;
 
-			/*
-			 * The transfer failed, but we still write the data to
-			 * keep prepare/commit calls balanced.
-			 */
-			printk(KERN_ERR "loop: transfer error block %llu\n",
-			       (unsigned long long)index);
-			kaddr = kmap_atomic(page, KM_USER0);
-			memset(kaddr + offset, 0, size);
-			kunmap_atomic(kaddr, KM_USER0);
-		}
-		flush_dcache_page(page);
-		ret = aops->commit_write(file, page, offset,
-					 offset + size);
-		if (unlikely(ret)) {
-			if (ret == AOP_TRUNCATED_PAGE) {
-				page_cache_release(page);
-				continue;
-			}
-			goto unlock;
-		}
 		if (unlikely(transfer_result))
-			goto unlock;
-		bv_offs += size;
-		len -= size;
+			goto fail;
+
+		bv_offs += copied;
+		len -= copied;
 		offset = 0;
 		index++;
-		pos += size;
-		unlock_page(page);
-		page_cache_release(page);
+		pos += copied;
 	}
 	ret = 0;
 out:
 	mutex_unlock(&mapping->host->i_mutex);
 	return ret;
-unlock:
-	unlock_page(page);
-	page_cache_release(page);
 fail:
 	ret = -1;
 	goto out;
Index: linux-2.6.git/fs/namei.c
===================================================================
--- linux-2.6.git.orig/fs/namei.c
+++ linux-2.6.git/fs/namei.c
@@ -2688,53 +2688,30 @@ int __page_symlink(struct inode *inode, 
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct page *page;
+	void *fsdata;
 	int err;
 	char *kaddr;
 
 retry:
-	err = -ENOMEM;
-	page = find_or_create_page(mapping, 0, gfp_mask);
-	if (!page)
-		goto fail;
-	err = mapping->a_ops->prepare_write(NULL, page, 0, len-1);
-	if (err == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
-		goto retry;
-	}
+	err = pagecache_write_begin(NULL, mapping, 0, PAGE_CACHE_SIZE,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
 	if (err)
-		goto fail_map;
+		goto fail;
+
 	kaddr = kmap_atomic(page, KM_USER0);
 	memcpy(kaddr, symname, len-1);
+	memset(kaddr+len-1, 0, PAGE_CACHE_SIZE-(len-1));
 	kunmap_atomic(kaddr, KM_USER0);
-	err = mapping->a_ops->commit_write(NULL, page, 0, len-1);
-	if (err == AOP_TRUNCATED_PAGE) {
-		page_cache_release(page);
-		goto retry;
-	}
-	if (err)
-		goto fail_map;
-	/*
-	 * Notice that we are _not_ going to block here - end of page is
-	 * unmapped, so this will only try to map the rest of page, see
-	 * that it is unmapped (typically even will not look into inode -
-	 * ->i_size will be enough for everything) and zero it out.
-	 * OTOH it's obviously correct and should make the page up-to-date.
-	 */
-	if (!PageUptodate(page)) {
-		err = mapping->a_ops->readpage(NULL, page);
-		if (err != AOP_TRUNCATED_PAGE)
-			wait_on_page_locked(page);
-	} else {
-		unlock_page(page);
-	}
-	page_cache_release(page);
+
+	err = pagecache_write_end(NULL, mapping, 0, PAGE_CACHE_SIZE, PAGE_CACHE_SIZE,
+							page, fsdata);
 	if (err < 0)
 		goto fail;
+	if (err < PAGE_CACHE_SIZE)
+		goto retry;
+
 	mark_inode_dirty(inode);
 	return 0;
-fail_map:
-	unlock_page(page);
-	page_cache_release(page);
 fail:
 	return err;
 }
Index: linux-2.6.git/fs/splice.c
===================================================================
--- linux-2.6.git.orig/fs/splice.c
+++ linux-2.6.git/fs/splice.c
@@ -559,7 +559,7 @@ static int pipe_to_file(struct pipe_inod
 	struct address_space *mapping = file->f_mapping;
 	unsigned int offset, this_len;
 	struct page *page;
-	pgoff_t index;
+	void *fsdata;
 	int ret;
 
 	/*
@@ -569,49 +569,16 @@ static int pipe_to_file(struct pipe_inod
 	if (unlikely(ret))
 		return ret;
 
-	index = sd->pos >> PAGE_CACHE_SHIFT;
 	offset = sd->pos & ~PAGE_CACHE_MASK;
 
 	this_len = sd->len;
 	if (this_len + offset > PAGE_CACHE_SIZE)
 		this_len = PAGE_CACHE_SIZE - offset;
 
-find_page:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		ret = -ENOMEM;
-		page = page_cache_alloc_cold(mapping);
-		if (unlikely(!page))
-			goto out_ret;
-
-		/*
-		 * This will also lock the page
-		 */
-		ret = add_to_page_cache_lru(page, mapping, index,
-					    GFP_KERNEL);
-		if (unlikely(ret))
-			goto out;
-	}
-
-	ret = mapping->a_ops->prepare_write(file, page, offset, offset+this_len);
-	if (unlikely(ret)) {
-		loff_t isize = i_size_read(mapping->host);
-
-		if (ret != AOP_TRUNCATED_PAGE)
-			unlock_page(page);
-		page_cache_release(page);
-		if (ret == AOP_TRUNCATED_PAGE)
-			goto find_page;
-
-		/*
-		 * prepare_write() may have instantiated a few blocks
-		 * outside i_size.  Trim these off again.
-		 */
-		if (sd->pos + this_len > isize)
-			vmtruncate(mapping->host, isize);
-
-		goto out_ret;
-	}
+	ret = pagecache_write_begin(file, mapping, sd->pos, sd->len,
+				AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+	if (unlikely(ret))
+		goto out;
 
 	if (buf->page != page) {
 		/*
@@ -621,35 +588,14 @@ find_page:
 		char *dst = kmap_atomic(page, KM_USER1);
 
 		memcpy(dst + offset, src + buf->offset, this_len);
-		flush_dcache_page(page);
 		kunmap_atomic(dst, KM_USER1);
 		buf->ops->unmap(pipe, buf, src);
 	}
 
-	ret = mapping->a_ops->commit_write(file, page, offset, offset+this_len);
-	if (ret) {
-		if (ret == AOP_TRUNCATED_PAGE) {
-			page_cache_release(page);
-			goto find_page;
-		}
-		if (ret < 0)
-			goto out;
-		/*
-		 * Partial write has happened, so 'ret' already initialized by
-		 * number of bytes written, Where is nothing we have to do here.
-		 */
-	} else
-		ret = this_len;
-	/*
-	 * Return the number of bytes written and mark page as
-	 * accessed, we are now done!
-	 */
-	mark_page_accessed(page);
-	balance_dirty_pages_ratelimited(mapping);
+	ret = pagecache_write_end(file, mapping, sd->pos, sd->len, sd->len, page, fsdata);
+
 out:
-	page_cache_release(page);
-	unlock_page(page);
-out_ret:
+
 	return ret;
 }
 
Index: linux-2.6.git/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/Locking
+++ linux-2.6.git/Documentation/filesystems/Locking
@@ -176,15 +176,18 @@ prototypes:
 locking rules:
 	All except set_page_dirty may block
 
-			BKL	PageLocked(page)
+			BKL	PageLocked(page)	i_sem
 writepage:		no	yes, unlocks (see below)
 readpage:		no	yes, unlocks
 sync_page:		no	maybe
 writepages:		no
 set_page_dirty		no	no
 readpages:		no
-prepare_write:		no	yes
-commit_write:		no	yes
+prepare_write:		no	yes			yes
+commit_write:		no	yes			yes
+write_begin:		no	locks the page		yes
+write_end:		no	yes, unlocks		yes
+perform_write:		no	n/a			yes
 bmap:			yes
 invalidatepage:		no	yes
 releasepage:		no	yes
Index: linux-2.6.git/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.git.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6.git/Documentation/filesystems/vfs.txt
@@ -534,6 +534,14 @@ struct address_space_operations {
 			struct list_head *pages, unsigned nr_pages);
 	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
 	int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+	int (*write_begin)(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata);
+	int (*write_end)(struct file *, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata);
+	int (*perform_write)(struct file *, struct address_space *,
+				struct iov_iter *, loff_t);
 	sector_t (*bmap)(struct address_space *, sector_t);
 	int (*invalidatepage) (struct page *, unsigned long);
 	int (*releasepage) (struct page *, int);
@@ -629,6 +637,34 @@ struct address_space_operations {
         operations.  It should avoid returning an error if possible -
         errors should have been handled by prepare_write.
 
+  write_begin: This is intended as a replacement for prepare_write. Called
+        by the generic buffered write code to ask the filesystem to prepare
+        to write len bytes at the given offset in the file. flags is a field
+        for AOP_FLAG_xxx flags, described in include/linux/mm.h.
+
+        The filesystem must return the locked pagecache page for the caller
+        to write into.
+
+        A void * may be returned in fsdata, which then gets passed into
+        write_end.
+
+        Returns < 0 on failure, in which case all cleanup must be done and
+        write_end not called. 0 on success, in which case write_end must
+        be called.
+
+  write_end: After a successful write_begin, and data copy, write_end must
+        be called. len is the original len passed to write_begin, and copied
+        is the amount that was able to be copied (they must be equal if
+        write_begin was called with intr == 0).
+
+        The filesystem must take care of unlocking the page and dropping its
+        refcount, and updating i_size.
+
+        Returns < 0 on failure, otherwise the number of bytes (<= 'copied')
+        that were able to be copied into the file.
+
+  perform_write:
+
   bmap: called by the VFS to map a logical block offset within object to
   	physical block number. This method is used by the FIBMAP
   	ioctl and for working with swap-files.  To be able to swap to

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-04 22:10 ` Mark Fasheh
@ 2007-04-04 22:39   ` Badari Pulavarty
  2007-04-04 22:51     ` Mark Fasheh
  2007-04-04 23:05   ` Badari Pulavarty
  2007-04-05  2:09   ` Nick Piggin
  2 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-04 22:39 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Nick Piggin, Linux Filesystems, Steven Whitehouse

On Wed, 2007-04-04 at 15:10 -0700, Mark Fasheh wrote:
> On Mon, Apr 02, 2007 at 02:09:34PM +0200, Nick Piggin wrote:
> > Updated aops patchset against 2.6.21-rc5.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> 
> Thanks again for pushing this stuff out Nick.
> 
> Fwiw, I had merge conflicts working against the latest git code from Linus -
> it looks like Jens pushed some splice changes (some from you). A refreshed
> fs-new-write-aops.patch is attached.
> 
> The ext3 patch also caused me some problems, but I just temporarily dropped
> it for my immediate work.


Could you elaborate on issues ext3 caused ? Its pretty stable for
me (after bunch of simple fixes, I posted earlier). Anything I need
to be aware ?

Thanks,
Badari



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-04 22:39   ` Badari Pulavarty
@ 2007-04-04 22:51     ` Mark Fasheh
  2007-04-04 23:02       ` Badari Pulavarty
  0 siblings, 1 reply; 43+ messages in thread
From: Mark Fasheh @ 2007-04-04 22:51 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Nick Piggin, Linux Filesystems, Steven Whitehouse

On Wed, Apr 04, 2007 at 03:39:48PM -0700, Badari Pulavarty wrote:
> Could you elaborate on issues ext3 caused ? Its pretty stable for
> me (after bunch of simple fixes, I posted earlier). Anything I need
> to be aware ?

Oh, I meant merge issues - that's all. And that's just because I was trying
to be "fancy" by merging the stuff on top of a git branch I'm working on ;)
	--Mark
 
--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-04 22:51     ` Mark Fasheh
@ 2007-04-04 23:02       ` Badari Pulavarty
  0 siblings, 0 replies; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-04 23:02 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Nick Piggin, Linux Filesystems, Steven Whitehouse

On Wed, 2007-04-04 at 15:51 -0700, Mark Fasheh wrote:
> On Wed, Apr 04, 2007 at 03:39:48PM -0700, Badari Pulavarty wrote:
> > Could you elaborate on issues ext3 caused ? Its pretty stable for
> > me (after bunch of simple fixes, I posted earlier). Anything I need
> > to be aware ?
> 
> Oh, I meant merge issues - that's all. And that's just because I was trying
> to be "fancy" by merging the stuff on top of a git branch I'm working on ;)
> 	--Mark

Okay, Thanks.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-04 22:10 ` Mark Fasheh
  2007-04-04 22:39   ` Badari Pulavarty
@ 2007-04-04 23:05   ` Badari Pulavarty
  2007-04-04 23:17     ` Mark Fasheh
  2007-04-05  2:09   ` Nick Piggin
  2 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-04 23:05 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Nick Piggin, Linux Filesystems, Steven Whitehouse

On Wed, 2007-04-04 at 15:10 -0700, Mark Fasheh wrote:
...
> +int pagecache_write_begin(struct file *file, struct address_space *mapping,
> +				loff_t pos, unsigned len, unsigned flags,
> +				struct page **pagep, void **fsdata)
> +{
> +	const struct address_space_operations *aops = mapping->a_ops;
> +
> +	if (aops->write_begin) {
> +		return aops->write_begin(file, mapping, pos, len, flags,
> +							pagep, fsdata);
> +	} else {
> +		int ret;
> +		pgoff_t index = pos >> PAGE_CACHE_SHIFT;
> +		unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
> +		struct inode *inode = mapping->host;
> +		struct page *page;
> +again:
> +		page = __grab_cache_page(mapping, index);
> +		*pagep = page;
> +		if (!page)
> +			return -ENOMEM;
> +
> +		if (flags & AOP_FLAG_UNINTERRUPTIBLE && !PageUptodate(page)) {
> +			/*
> +			 * There is no way to resolve a short write situation
> +			 * for a !Uptodate page (except by double copying in
> +			 * the caller done by generic_perform_write_2copy).
> +			 *
> +			 * Instead, we have to bring it uptodate here.
> +			 */
> +			ret = aops->readpage(file, page);
> +			page_cache_release(page);
> +			if (ret) {
> +				if (ret == AOP_TRUNCATED_PAGE)
> +					goto again;
> +				return ret;
> +			}
> +			goto again;
> +		}
> +
> +		ret = aops->prepare_write(file, page, offset, offset+len);
> +		if (ret) {
> +			if (ret != AOP_TRUNCATED_PAGE)
> +				unlock_page(page);
> +			page_cache_release(page);
> +			if (pos + len > inode->i_size)
> +				vmtruncate(inode, inode->i_size);
> +			if (ret == AOP_TRUNCATED_PAGE)
> +				goto again;
> +		}
> +		return ret;
> +	}


Hmm.. Okay, only filesystems that could return AOP_TRUNCATED_PAGE
are ocf2 and gfs2. Now that both of them are switched to have
write_begin()/write_end(), why do we need this code to handle
AOP_TRUNCATED_PAGE (in the else part) ? Can't we just cleanup/nuke
all the AOP_TRUNCATED_PAGE handling ?

Dumb question ?

Thanks,
Badari



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-04 23:05   ` Badari Pulavarty
@ 2007-04-04 23:17     ` Mark Fasheh
  2007-04-04 23:32       ` Badari Pulavarty
  0 siblings, 1 reply; 43+ messages in thread
From: Mark Fasheh @ 2007-04-04 23:17 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Nick Piggin, Linux Filesystems, Steven Whitehouse

On Wed, Apr 04, 2007 at 04:05:19PM -0700, Badari Pulavarty wrote:
> Hmm.. Okay, only filesystems that could return AOP_TRUNCATED_PAGE
> are ocf2 and gfs2. Now that both of them are switched to have
> write_begin()/write_end(), why do we need this code to handle
> AOP_TRUNCATED_PAGE (in the else part) ? Can't we just cleanup/nuke
> all the AOP_TRUNCATED_PAGE handling ?

We don't - I'm pretty sure that fs-no-AOP_TRUNCATED_PAGE.patch gets rid of
them.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-04 23:17     ` Mark Fasheh
@ 2007-04-04 23:32       ` Badari Pulavarty
  2007-04-05  2:08         ` Nick Piggin
  0 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-04 23:32 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Nick Piggin, Linux Filesystems, Steven Whitehouse

On Wed, 2007-04-04 at 16:17 -0700, Mark Fasheh wrote:
> On Wed, Apr 04, 2007 at 04:05:19PM -0700, Badari Pulavarty wrote:
> > Hmm.. Okay, only filesystems that could return AOP_TRUNCATED_PAGE
> > are ocf2 and gfs2. Now that both of them are switched to have
> > write_begin()/write_end(), why do we need this code to handle
> > AOP_TRUNCATED_PAGE (in the else part) ? Can't we just cleanup/nuke
> > all the AOP_TRUNCATED_PAGE handling ?
> 
> We don't - I'm pretty sure that fs-no-AOP_TRUNCATED_PAGE.patch gets rid of
> them.
> 	--Mark

It didn't, completely get rid of them :(

Thanks,
Badari


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-02 12:09 [ANNOUNCE] new new aops patchset Nick Piggin
                   ` (9 preceding siblings ...)
  2007-04-04 22:10 ` Mark Fasheh
@ 2007-04-05  0:10 ` David Chinner
  2007-04-05  2:13   ` Nick Piggin
  2007-04-05  2:43   ` David Chinner
  10 siblings, 2 replies; 43+ messages in thread
From: David Chinner @ 2007-04-05  0:10 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Mon, Apr 02, 2007 at 02:09:34PM +0200, Nick Piggin wrote:
> Updated aops patchset against 2.6.21-rc5.
> 
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> 
> Files/dirs are 2.6.21-rc5-new-aops*
> 
> Contains numerous fixes from Mark and myself -- I'd say the core code is
> getting reasonably stable at this point.
......

> (compile only) patches for NFS, XFS, FUSE, eCryptfs. OK, they're untested,

Failed to compile in UDF and reiser for me, but no doubt you know
that already. Don't have time to look at why - I just disabled them
so I get some QA done on the XFS and core changes.

On a related note - what's the rules for a perform_write() implementation?
I noticed that wasn't documented with write_begin and write_end and
I don't see any other filesystem implementing it yet....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-04 23:32       ` Badari Pulavarty
@ 2007-04-05  2:08         ` Nick Piggin
  2007-04-05 15:21           ` Badari Pulavarty
  0 siblings, 1 reply; 43+ messages in thread
From: Nick Piggin @ 2007-04-05  2:08 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Mark Fasheh, Linux Filesystems, Steven Whitehouse

On Wed, Apr 04, 2007 at 04:32:24PM -0700, Badari Pulavarty wrote:
> On Wed, 2007-04-04 at 16:17 -0700, Mark Fasheh wrote:
> > On Wed, Apr 04, 2007 at 04:05:19PM -0700, Badari Pulavarty wrote:
> > > Hmm.. Okay, only filesystems that could return AOP_TRUNCATED_PAGE
> > > are ocf2 and gfs2. Now that both of them are switched to have
> > > write_begin()/write_end(), why do we need this code to handle
> > > AOP_TRUNCATED_PAGE (in the else part) ? Can't we just cleanup/nuke
> > > all the AOP_TRUNCATED_PAGE handling ?
> > 
> > We don't - I'm pretty sure that fs-no-AOP_TRUNCATED_PAGE.patch gets rid of
> > them.
> > 	--Mark
> 
> It didn't, completely get rid of them :(

->readpage can still return AOP_TRUNCATED_PAGE. Were there any from
prepare_write or commit_write still around?



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-04 22:10 ` Mark Fasheh
  2007-04-04 22:39   ` Badari Pulavarty
  2007-04-04 23:05   ` Badari Pulavarty
@ 2007-04-05  2:09   ` Nick Piggin
  2 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-05  2:09 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Linux Filesystems, Steven Whitehouse

On Wed, Apr 04, 2007 at 03:10:34PM -0700, Mark Fasheh wrote:
> On Mon, Apr 02, 2007 at 02:09:34PM +0200, Nick Piggin wrote:
> > Updated aops patchset against 2.6.21-rc5.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> 
> Thanks again for pushing this stuff out Nick.
> 
> Fwiw, I had merge conflicts working against the latest git code from Linus -
> it looks like Jens pushed some splice changes (some from you). A refreshed
> fs-new-write-aops.patch is attached.

Thanks, added.

> The ext3 patch also caused me some problems, but I just temporarily dropped
> it for my immediate work.

Yeah, good idea ;)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-05  0:10 ` David Chinner
@ 2007-04-05  2:13   ` Nick Piggin
  2007-04-05  7:45     ` Christoph Hellwig
  2007-04-05  2:43   ` David Chinner
  1 sibling, 1 reply; 43+ messages in thread
From: Nick Piggin @ 2007-04-05  2:13 UTC (permalink / raw)
  To: David Chinner; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Thu, Apr 05, 2007 at 10:10:18AM +1000, David Chinner wrote:
> On Mon, Apr 02, 2007 at 02:09:34PM +0200, Nick Piggin wrote:
> > Updated aops patchset against 2.6.21-rc5.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > 
> > Files/dirs are 2.6.21-rc5-new-aops*
> > 
> > Contains numerous fixes from Mark and myself -- I'd say the core code is
> > getting reasonably stable at this point.
> ......
> 
> > (compile only) patches for NFS, XFS, FUSE, eCryptfs. OK, they're untested,
> 
> Failed to compile in UDF and reiser for me, but no doubt you know
> that already. Don't have time to look at why - I just disabled them
> so I get some QA done on the XFS and core changes.

Thanks, yeah the "cont_prepare_write" rework is a little intrusive and
affects reiserfs in ways I haven't had time to try fixing yet.

> On a related note - what's the rules for a perform_write() implementation?
> I noticed that wasn't documented with write_begin and write_end and
> I don't see any other filesystem implementing it yet....

Ah, so it isn't, thanks I'll document it. (today I'm looking at doing a
simple_perform_write and perhaps another easy one, which should also help
filesystem maintainers to have a reference).

Thanks,
Nick

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-05  0:10 ` David Chinner
  2007-04-05  2:13   ` Nick Piggin
@ 2007-04-05  2:43   ` David Chinner
  2007-04-05  3:00     ` Nick Piggin
  1 sibling, 1 reply; 43+ messages in thread
From: David Chinner @ 2007-04-05  2:43 UTC (permalink / raw)
  To: David Chinner
  Cc: Nick Piggin, Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Thu, Apr 05, 2007 at 10:10:18AM +1000, David Chinner wrote:
> On Mon, Apr 02, 2007 at 02:09:34PM +0200, Nick Piggin wrote:
> > Updated aops patchset against 2.6.21-rc5.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > 
> > Files/dirs are 2.6.21-rc5-new-aops*
> > 
> > Contains numerous fixes from Mark and myself -- I'd say the core code is
> > getting reasonably stable at this point.
> ......
> 
> > (compile only) patches for NFS, XFS, FUSE, eCryptfs. OK, they're untested,
> 
> Failed to compile in UDF and reiser for me, but no doubt you know
> that already. Don't have time to look at why - I just disabled them
> so I get some QA done on the XFS and core changes.

Nick - the XFSQA randholes test locks up. As you can probably guess
it creates holey files and does stuff to them ;)

Both pdflush and a write doing an a blocked waiting for a page to become unlocked. 
They are both waiting on teh same page (0xa0007fffffe8bd68).



[0]kdb> btt 0xe0000039ec2b8000
Stack traceback for pid 113
0xe0000039ec2b8000      113       19  0    3   D  0xe0000039ec2b82f0  pdflush
0xa0000001008bcdb0 schedule+0x9b0
        args (0xe0000039ec207020, 0xe0000039ec2070a8, 0xa0000001001f3fa0, 0x206, 0xa000000100eee5a0)
0xa0000001008bf4b0 io_schedule+0x50
        args (0xe00000b0030258d4, 0xa0000001001338c0, 0x205, 0xa000000100eee5a0)
0xa0000001001338c0 sync_page+0x100
        args (0xa0007fffffe8bd68, 0xa0000001008bfd20, 0x58e, 0xa000000100eee5a0)
0xa0000001008bfd20 __wait_on_bit_lock+0x160
        args (0xe00000b005905e20, 0xe0000039ec2bfb40, 0xa000000100accf30, 0x2, 0xe0000039ec2bfb90)
0xa000000100133760 __lock_page+0x180
        args (0xa0007fffffe8bd68, 0xa000000100143b30, 0x917, 0x917)
0xa000000100143b30 generic_writepages+0x450
        args (0xe00000b47b2a65f0, 0xe0000039ec2bfcd8, 0xa0007fffffe8bd68, 0xe0000039ec2bfbe8, 0x2)
0xa00000010044be70 xfs_vm_writepages+0x50
        args (0xe00000b47b2a65f0, 0xe0000039ec2bfcd8, 0xa000000100144030, 0x308, 0xa0000001000d1870)
0xa000000100144030 do_writepages+0xb0
        args (0xe00000b47b2a65f0, 0xe0000039ec2bfcd8, 0xe0000039ec2bfd10, 0xa0000001001eb8e0, 0x691)
0xa0000001001eb8e0 __writeback_single_inode+0x140
        args (0xe00000b47b2a64c8, 0xe0000039ec2bfcd8, 0x7, 0xe00000b47b2a6700, 0xe00000b47b2a65d8)
0xa0000001001ecbd0 sync_sb_inodes+0x490
        args (0xe0000039eb96fb48, 0xe0000039ec2bfcd8, 0xe0000039ec207140, 0xe00000b47b2a64c8, 0xe0000039eb96fc30)
0xa0000001001ed940 writeback_inodes+0x160
        args (0xe0000039ec2bfcd8, 0xe0000039eb96fb48, 0xe0000039eb96fbb8, 0xa000000100c51e48, 0xe0000039ec2bfcf0)
0xa0000001001452a0 wb_kupdate+0x280
        args (0x34eb, 0xe0000039ec2bfcf0, 0xe0000039ec2bfd10, 0x1000036af, 0xe0000039ec2bfcd8)
0xa000000100145f70 pdflush+0x2d0
        args (0xa000000100cbfa10, 0xe0000039ec2bfd88, 0xe0000039ec2bfd90, 0xe0000039ec2bfd78, 0xa000000100bb1a80)
0xa0000001000fb640 kthread+0x260
        args (0xe0000039ee1c7d10, 0xffffffffffffffff, 0x0, 0xa000000100acd590, 0xfffffffffffffffc)
0xa0000001000124b0 kernel_thread_helper+0xd0
        args (0xa000000100acead0, 0xe0000039ee1c7d10, 0xa0000001000094c0, 0x2, 0xa000000100eee5a0)
0xa0000001000094c0 start_kernel_thread+0x20
        args (0xa000000100acead0, 0xe0000039ee1c7d10)
[0]kdb> btt 0xe0000030101d8000
Stack traceback for pid 8416
0xe0000030101d8000     8416     8242  0    3   D  0xe0000030101d82f0  randholes
0xa0000001008bcdb0 schedule+0x9b0
        args (0xe0000039ec207020, 0xe0000039ec2070a8, 0xa0000001001f3fa0, 0x206, 0xa000000100eee5a0)
0xa0000001008bf4b0 io_schedule+0x50
        args (0xe00000b0030258d4, 0xa0000001001338c0, 0x205, 0xa000000100eee5a0)
0xa0000001001338c0 sync_page+0x100
        args (0xa0007fffffe8bd68, 0xa0000001008bfd20, 0x58e, 0xa000000100eee5a0)
0xa0000001008bfd20 __wait_on_bit_lock+0x160
        args (0xe00000b005905e20, 0xe0000030101dfbb0, 0xa000000100accf30, 0x2, 0xa0000001008c2520)
0xa000000100133760 __lock_page+0x180
        args (0xa0007fffffe8bd68, 0xa000000100133d10, 0x48b, 0x0)
0xa000000100133d10 find_lock_page+0xf0
        args (0xe00000b47b2a65f0, 0x275a, 0xa0007fffffe8bd68, 0xe00000b47b2a6608, 0xe00000b47b2a65f8)
0xa000000100134550 __grab_cache_page+0x30
        args (0xe00000b47b2a65f0, 0x275a, 0x3fff, 0xe00000b47b2a6660, 0xfffff)
0xa0000001001f9fa0 block_write_begin+0x1a0
        args (0x0, 0xe00000b47b2a65f0, 0x9d69000, 0x3000, 0x1)
0xa000000100447630 xfs_vm_write_begin+0x50
        args (0x0, 0xe00000b47b2a65f0, 0x9d69000, 0x3000, 0x1)
0xa000000100138d50 pagecache_write_begin+0x90
        args (0x0, 0xe00000b47b2a65f0, 0x9d69000, 0x3000, 0x1)
0xa00000010045df90 xfs_iozero+0x90
        args (0xe00000b47b2a65f0, 0x9d69000, 0x7000, 0xe00000300e23e0b8, 0x3000)
0xa00000010045eaa0 xfs_zero_eof+0x500
        args (0xe00000b47b2a64c8, 0xe00000300e23e150, 0xad36000, 0x9d70000, 0xad35)
0xa0000001004611f0 xfs_write+0xc10
        args (0xe00000300e23e018, 0xe0000030101dfd40, 0xe0000030101dfd30, 0x200, 0xe0000030101dfdc0)
0xa0000001004543e0 __xfs_file_write+0xe0
        args (0xe0000030101dfd40, 0xe0000030101dfd30, 0x1, 0x1, 0xad36000)
0xa0000001004544e0 xfs_file_aio_write+0x40
        args (0xe0000030101dfd40, 0xe0000030101dfd30, 0x1, 0xad36000, 0xa0000001001a11b0)
0xa0000001001a11b0 do_sync_write+0x170
        args (0xa000000100cd7c78, 0xe00000b006679650, 0xa000000100cd7c50, 0xe0000030101dfe38, 0xe0000030101dfd40)
0xa0000001001a2860 vfs_write+0x1a0
        args (0xe00000b006679630, 0x6000000000018020, 0x200, 0xe0000030101dfe38, 0xe00000b006679650)
0xa0000001001a3530 sys_write+0x70
        args (0x3, 0x6000000000018020, 0x200, 0xc000000000000793, 0x250)
0xa00000010000bc40 ia64_ret_from_syscall
        args (0x3, 0x6000000000018020, 0x200, 0xc000000000000793)
0xa000000000010620 __kernel_syscall_via_break
        args (0x3, 0x6000000000018020, 0x200, 0xc000000000000793)

The page:

[0]kdb> md8c8 0xa0007fffffe8bd68
0xa0007fffffe8bd68 0100000000000835 ffffffff00000005   5...............
0xa0007fffffe8bd78 e00000b47aa88228 e00000b47b2a65f0   (..z.....e*{....
                   ^^^^^^^^^^^^^^^^ buffers
0xa0007fffffe8bd88 000000000000275a a0007fffffe8a660   Z'......`.......
0xa0007fffffe8bd98 a0007ffff7aab928 0100000000080000   (...............

kdb> inode_pages 0xe00000b47b2a64c8
page_struct       index   cnt zone nid flags
e000003010324000   5274     3   0   0 0x824 Referenced LRU Private
        bh 0xe000003479f8d0a8 bno 18446744073709551615 []
        bh 0xe00000b47ad7d0a8 bno    53132 [Uptodate Mapped]
        bh 0xe000003479f8e228 bno 18446744073709551615 []
        bh 0xe00000b47ad7fca8 bno 18446744073709551615 []
 mapping= e00000b47b2a65f0
e00000b079f80000   7637     3   0   1 0x100000000000824 Referenced LRU Private
        bh 0xe00000b47ad7cc28 bno 18446744073709551615 []
        bh 0xe000003479f8cfa8 bno 18446744073709551615 []
        bh 0xe00000b47ad7f028 bno 18446744073709551615 []
        bh 0xe000003479f8f7a8 bno    53082 [Uptodate Mapped]
 mapping= e00000b47b2a65f0
e00000b9e56ac000  10074     5   0   1 0x100000000000835 Locked Referenced Dirty LRU Private
        bh 0xe00000b47aa88228 bno 18446744073709551615 [Uptodate Dirty Mapped Delay]
           ^^^^^^^^^^^^^^^^^^
        bh 0xe000003479f8e3a8 bno 18446744073709551615 []
        bh 0xe00000b47aa882a8 bno 18446744073709551615 []
        bh 0xe000003479f8fd28 bno 18446744073709551615 []
 mapping= e00000b47b2a65f0

So the page is locked and it has a delalloc buffer on it that used to
be a hole. The problem page is at EOF....

No idea what is going write yet.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-05  2:43   ` David Chinner
@ 2007-04-05  3:00     ` Nick Piggin
  2007-04-05  6:18       ` David Chinner
  0 siblings, 1 reply; 43+ messages in thread
From: Nick Piggin @ 2007-04-05  3:00 UTC (permalink / raw)
  To: David Chinner; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Thu, Apr 05, 2007 at 12:43:50PM +1000, David Chinner wrote:
> On Thu, Apr 05, 2007 at 10:10:18AM +1000, David Chinner wrote:
> > On Mon, Apr 02, 2007 at 02:09:34PM +0200, Nick Piggin wrote:
> > > Updated aops patchset against 2.6.21-rc5.
> > > 
> > > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > > 
> > > Files/dirs are 2.6.21-rc5-new-aops*
> > > 
> > > Contains numerous fixes from Mark and myself -- I'd say the core code is
> > > getting reasonably stable at this point.
> > ......
> > 
> > > (compile only) patches for NFS, XFS, FUSE, eCryptfs. OK, they're untested,
> > 
> > Failed to compile in UDF and reiser for me, but no doubt you know
> > that already. Don't have time to look at why - I just disabled them
> > so I get some QA done on the XFS and core changes.
> 
> Nick - the XFSQA randholes test locks up. As you can probably guess
> it creates holey files and does stuff to them ;)
> 
> Both pdflush and a write doing an a blocked waiting for a page to become unlocked. 
> They are both waiting on teh same page (0xa0007fffffe8bd68).

[...]

> So the page is locked and it has a delalloc buffer on it that used to
> be a hole. The problem page is at EOF....
> 
> No idea what is going write yet.....

OK, thanks for testing. This is with the full patchset applied, I guess.
No real ideas here (brute force we put a last-locked stack trace into
struct page :P), but I wonder if it may be any of my earlier patches?
(ie. not the new aops path or XFS conversion itself).

Anyway I think randholes is available online, right? I'll try reproduce
here and help.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-05  3:00     ` Nick Piggin
@ 2007-04-05  6:18       ` David Chinner
  2007-04-05  6:40         ` Nick Piggin
  0 siblings, 1 reply; 43+ messages in thread
From: David Chinner @ 2007-04-05  6:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Chinner, Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Thu, Apr 05, 2007 at 05:00:44AM +0200, Nick Piggin wrote:
> On Thu, Apr 05, 2007 at 12:43:50PM +1000, David Chinner wrote:
> > On Thu, Apr 05, 2007 at 10:10:18AM +1000, David Chinner wrote:
> > > On Mon, Apr 02, 2007 at 02:09:34PM +0200, Nick Piggin wrote:
> > > > Updated aops patchset against 2.6.21-rc5.
> > > > 
> > > > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> > > > 
> > > > Files/dirs are 2.6.21-rc5-new-aops*
> > > > 
> > > > Contains numerous fixes from Mark and myself -- I'd say the core code is
> > > > getting reasonably stable at this point.
> > > ......
> > > 
> > > > (compile only) patches for NFS, XFS, FUSE, eCryptfs. OK, they're untested,
> > > 
> > > Failed to compile in UDF and reiser for me, but no doubt you know
> > > that already. Don't have time to look at why - I just disabled them
> > > so I get some QA done on the XFS and core changes.
> > 
> > Nick - the XFSQA randholes test locks up. As you can probably guess
> > it creates holey files and does stuff to them ;)
> > 
> > Both pdflush and a write doing an a blocked waiting for a page to become unlocked. 
> > They are both waiting on teh same page (0xa0007fffffe8bd68).
> 
> [...]
> 
> > So the page is locked and it has a delalloc buffer on it that used to
> > be a hole. The problem page is at EOF....
> > 
> > No idea what is going write yet.....
> 
> OK, thanks for testing. This is with the full patchset applied, I guess.

Yes.

> No real ideas here (brute force we put a last-locked stack trace into
> struct page :P), but I wonder if it may be any of my earlier patches?
> (ie. not the new aops path or XFS conversion itself).

It's not.

I've had a look at the files created by the test - it's the third
file in the test that is causing the problem:

_do_test 1 50 "-l 5000000 -c 50 -b $pgsize"
_do_test 2 100 "-l 10000000 -c 100 -b $pgsize"
_do_test 3 100 "-l 10000000 -c 100 -b 512"   # test partial pages

i.e. the partial page test. This is the effective command that
locks it up:

# randholes -v -l 10000000 -c 100 -b 512 holes.out

Ah, now i see what the I/o pattern is that is triggering this:

budgie:/mnt/test # ~/dgc/xfstests/src/randholes -v -v -v  -l 10000000 -c 100 -b 512 dgc.holes.out
randholes: Seed = 1175749165 (use "-s 1175749165" to re-execute this test)
randholes: blocksize=512, filesize=268435456, seed=1175749165
randholes: count=100, offset=0, extsize=0
randholes: verbose=3, wsync=0, direct=0, rt=0, alloconly=0, preserve=0, test=0
write
.writing data at offset=acbd800, value 0xacbd800 and 0xacbd800
writing data at offset=5c09600, value 0x5c09600 and 0x5c09600
writing data at offset=97bde00, value 0x97bde00 and 0x97bde00


And we writing to offset:

0xa0000001004611f0 xfs_write+0xc10
        args (0xe00000347895e750, 0xe00000347aa1fd40, 0xe00000347aa1fd30, 0x200, 0xe00000347aa1fdc0)
.....
[0]kdb> md8c1 0xe00000347aa1fdc0
0xe00000347aa1fdc0 000000000c885400                    .T......

Which is well past teh last EOF.
So when we extend the file past the EOF, the old EOF had a partial
block (and page) that needed to have the remaining chunk zeroed. This
will have found the page already in the cache, but it is locked.

That tends to imply that it wasn't unlocked correctly on teh first call to
xfs_iozero it was originally written...

Ah, found it. page_cache_write_begin() returns zero on success. So this:

               status = pagecache_write_begin(NULL, mapping, pos, bytes,
                                       AOP_FLAG_UNINTERRUPTIBLE,
                                       &page, &fsdata);
               if (!status)
                        break;

Will abort on success, never zero the range of the page it's supposed
to and leave without calling write_end. Then when we extend the file again....

Hmmm - an interesting question just came up. The return from ->write_end()
will either be equal to 'copied' or zero. What do you do if it returns
zero? it indicates a failure of some kind, but what?

In xfs_iozero, because we loop based on the number of bytes we get
returned from pagecache_write_end(); a return of zero bytes will
result in trying the same operation again. If it fails repeatedly in
the same manner (likely), then we've got an endless loop. I can
break out of the loop, but I've got no idea what the error was or
how to handle it. Any thoughts on this?

Also, when pagecache_write_begin() (and ->write_begin()) returns
zero, it indicates success. When pagecache_write_end() (and ->write_end())
returns zero, it indicates some kind of failure occurred. Can
you make the return values more consistent, Nick?

Patch below

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/linux-2.6/xfs_lrw.c |   11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_lrw.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_lrw.c	2007-04-05 15:07:54.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_lrw.c	2007-04-05 15:41:44.912532457 +1000
@@ -151,18 +151,17 @@ xfs_iozero(
 		status = pagecache_write_begin(NULL, mapping, pos, bytes,
 					AOP_FLAG_UNINTERRUPTIBLE,
 					&page, &fsdata);
-		if (!status)
+		if (status)
 			break;
 
 		memclear_highpage_flush(page, offset, bytes);
 
 		status = pagecache_write_end(NULL, mapping, pos, bytes, bytes,
 					page, fsdata);
-		if (status < 0)
-			break;
-		bytes = status;
-		pos += bytes;
-		count -= bytes;
+		WARN_ON(status <= 0); /* can't return less than zero! */
+		pos += status;
+		count -= status;
+		status = 0;
 	} while (count);
 
 	return (-status);

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-05  6:18       ` David Chinner
@ 2007-04-05  6:40         ` Nick Piggin
  0 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-05  6:40 UTC (permalink / raw)
  To: David Chinner; +Cc: Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Thu, Apr 05, 2007 at 04:18:09PM +1000, David Chinner wrote:

[lots of interesting stuff]

> Ah, found it. page_cache_write_begin() returns zero on success. So this:
> 
>                status = pagecache_write_begin(NULL, mapping, pos, bytes,
>                                        AOP_FLAG_UNINTERRUPTIBLE,
>                                        &page, &fsdata);
>                if (!status)
>                         break;
> 
> Will abort on success, never zero the range of the page it's supposed
> to and leave without calling write_end. Then when we extend the file again....

Ah, and I just downloaded xfs-cmds :P

Thanks for finding that. Sorry to waste your time with a simple thinko :(


> Hmmm - an interesting question just came up. The return from ->write_end()
> will either be equal to 'copied' or zero. What do you do if it returns
> zero? it indicates a failure of some kind, but what?

zero means that it couldn't copy anything, but there was no permanent
failure (ie. we could retry as we might with any other short write).

The reason why write_end returns "copied" is that some filesystems (jffs
was one) that want to return a short copy. I would prefer if they could
all to the required setup in write_begin so a short write couldn't occur
at write_end-time, however I'm not an expert on filesystems, so I left
this feature in. It will probably be easier to rip it out later than to
introduce it later. I think?

> In xfs_iozero, because we loop based on the number of bytes we get
> returned from pagecache_write_end(); a return of zero bytes will
> result in trying the same operation again. If it fails repeatedly in
> the same manner (likely), then we've got an endless loop. I can
> break out of the loop, but I've got no idea what the error was or
> how to handle it. Any thoughts on this?

Basically you can probably take some advantage of knowing that your
filesystem never returns 0 here, but we should note that the filesystem
must either return an error or copy some bytes eventually, rather than
always return 0. I'll at that to the docs.

> Also, when pagecache_write_begin() (and ->write_begin()) returns
> zero, it indicates success. When pagecache_write_end() (and ->write_end())
> returns zero, it indicates some kind of failure occurred. Can
> you make the return values more consistent, Nick?

As I said, zero is not technically an error (and it isn't a great deal
more code because most callers have to deal with short copies anyway).

If we get agreement that the feature of allowing the filesystem to
shorten the copy at write_end-time is unnecessary, then I would like
that because it does make interfaces easier.

> 
> Patch below

Thanks a lot, applied.

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
> 
> ---
>  fs/xfs/linux-2.6/xfs_lrw.c |   11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_lrw.c
> ===================================================================
> --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_lrw.c	2007-04-05 15:07:54.000000000 +1000
> +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_lrw.c	2007-04-05 15:41:44.912532457 +1000
> @@ -151,18 +151,17 @@ xfs_iozero(
>  		status = pagecache_write_begin(NULL, mapping, pos, bytes,
>  					AOP_FLAG_UNINTERRUPTIBLE,
>  					&page, &fsdata);
> -		if (!status)
> +		if (status)
>  			break;
>  
>  		memclear_highpage_flush(page, offset, bytes);
>  
>  		status = pagecache_write_end(NULL, mapping, pos, bytes, bytes,
>  					page, fsdata);
> -		if (status < 0)
> -			break;
> -		bytes = status;
> -		pos += bytes;
> -		count -= bytes;
> +		WARN_ON(status <= 0); /* can't return less than zero! */
> +		pos += status;
> +		count -= status;
> +		status = 0;
>  	} while (count);
>  
>  	return (-status);

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-05  2:13   ` Nick Piggin
@ 2007-04-05  7:45     ` Christoph Hellwig
  2007-04-05  7:58       ` Nick Piggin
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2007-04-05  7:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Chinner, Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Thu, Apr 05, 2007 at 04:13:05AM +0200, Nick Piggin wrote:
> > On a related note - what's the rules for a perform_write() implementation?
> > I noticed that wasn't documented with write_begin and write_end and
> > I don't see any other filesystem implementing it yet....
> 
> Ah, so it isn't, thanks I'll document it. (today I'm looking at doing a
> simple_perform_write and perhaps another easy one, which should also help
> filesystem maintainers to have a reference).

I think ->perform_write should go away.  It's really just duplicating
->write for the !O_DIRECT case.  Instead we should factor the remaining
bits of generic_file_aio_write into nice helpers so that people that don't
want to use the generic part can build their own ->write from these pieces.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-05  7:45     ` Christoph Hellwig
@ 2007-04-05  7:58       ` Nick Piggin
  0 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-05  7:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Chinner, Linux Filesystems, Mark Fasheh, Steven Whitehouse

On Thu, Apr 05, 2007 at 08:45:29AM +0100, Christoph Hellwig wrote:
> On Thu, Apr 05, 2007 at 04:13:05AM +0200, Nick Piggin wrote:
> > > On a related note - what's the rules for a perform_write() implementation?
> > > I noticed that wasn't documented with write_begin and write_end and
> > > I don't see any other filesystem implementing it yet....
> > 
> > Ah, so it isn't, thanks I'll document it. (today I'm looking at doing a
> > simple_perform_write and perhaps another easy one, which should also help
> > filesystem maintainers to have a reference).
> 
> I think ->perform_write should go away.  It's really just duplicating
> ->write for the !O_DIRECT case.  Instead we should factor the remaining
> bits of generic_file_aio_write into nice helpers so that people that don't
> want to use the generic part can build their own ->write from these pieces.

I probably don't disagree with you. If we use the iov_iter further
down the stack (which I was going to do as a cleanup following this
patchset), then you're right that it isn't too painful...

Anyway, I'll leave it in the next release: I've converted a few users,
so that will give people a better idea of whether they agree/disagree
or don't care ;)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-05  2:08         ` Nick Piggin
@ 2007-04-05 15:21           ` Badari Pulavarty
  2007-04-06  1:38             ` Nick Piggin
  0 siblings, 1 reply; 43+ messages in thread
From: Badari Pulavarty @ 2007-04-05 15:21 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Mark Fasheh, Linux Filesystems, Steven Whitehouse

On Thu, 2007-04-05 at 04:08 +0200, Nick Piggin wrote:
> On Wed, Apr 04, 2007 at 04:32:24PM -0700, Badari Pulavarty wrote:
> > On Wed, 2007-04-04 at 16:17 -0700, Mark Fasheh wrote:
> > > On Wed, Apr 04, 2007 at 04:05:19PM -0700, Badari Pulavarty wrote:
> > > > Hmm.. Okay, only filesystems that could return AOP_TRUNCATED_PAGE
> > > > are ocf2 and gfs2. Now that both of them are switched to have
> > > > write_begin()/write_end(), why do we need this code to handle
> > > > AOP_TRUNCATED_PAGE (in the else part) ? Can't we just cleanup/nuke
> > > > all the AOP_TRUNCATED_PAGE handling ?
> > > 
> > > We don't - I'm pretty sure that fs-no-AOP_TRUNCATED_PAGE.patch gets rid of
> > > them.
> > > 	--Mark
> > 
> > It didn't, completely get rid of them :(
> 
> ->readpage can still return AOP_TRUNCATED_PAGE. Were there any from
> prepare_write or commit_write still around?
> 
> 

Not a big deal. But trying to understand it better.


int pagecache_write_begin()
{
 
        if (aops->write_begin) {
                return aops->write_begin(file, mapping, pos, len, flags,
                                                        pagep, fsdata);
        } else {
                .....
                ret = aops->readpage(file, page);
                page_cache_release(page);
                if (ret) {
			if (ret == AOP_TRUNCATED_PAGE)
                                        goto again;
                        return ret;
                }
                goto again;
		
	      ....
	}
}

filesystems (ocfs2, gfs2) which can return AOP_TRUNCATED_PAGE for
prepare_write or readpage would never come to this case. They
have write_begin() method set. Isn't it ? Why this check ?

Thanks,
Badari



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [ANNOUNCE] new new aops patchset
  2007-04-05 15:21           ` Badari Pulavarty
@ 2007-04-06  1:38             ` Nick Piggin
  0 siblings, 0 replies; 43+ messages in thread
From: Nick Piggin @ 2007-04-06  1:38 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Mark Fasheh, Linux Filesystems, Steven Whitehouse

On Thu, Apr 05, 2007 at 08:21:54AM -0700, Badari Pulavarty wrote:
> On Thu, 2007-04-05 at 04:08 +0200, Nick Piggin wrote:
> > 
> > ->readpage can still return AOP_TRUNCATED_PAGE. Were there any from
> > prepare_write or commit_write still around?
> > 
> > 
> 
> Not a big deal. But trying to understand it better.
> 
> 
> int pagecache_write_begin()
> {
>  
>         if (aops->write_begin) {
>                 return aops->write_begin(file, mapping, pos, len, flags,
>                                                         pagep, fsdata);
>         } else {
>                 .....
>                 ret = aops->readpage(file, page);
>                 page_cache_release(page);
>                 if (ret) {
> 			if (ret == AOP_TRUNCATED_PAGE)
>                                         goto again;
>                         return ret;
>                 }
>                 goto again;
> 		
> 	      ....
> 	}
> }
> 
> filesystems (ocfs2, gfs2) which can return AOP_TRUNCATED_PAGE for
> prepare_write or readpage would never come to this case. They
> have write_begin() method set. Isn't it ? Why this check ?

Ah you're right. In that case, I'll replace that with a comment ;)

Thanks!

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2007-04-06  1:38 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-02 12:09 [ANNOUNCE] new new aops patchset Nick Piggin
2007-04-02 20:18 ` Badari Pulavarty
2007-04-03  0:45   ` Nick Piggin
2007-04-02 20:44 ` Badari Pulavarty
2007-04-02 23:58   ` Nick Piggin
2007-04-03 15:59     ` Badari Pulavarty
2007-04-04  2:33       ` Nick Piggin
2007-04-02 21:44 ` Badari Pulavarty
2007-04-02 23:08 ` Badari Pulavarty
2007-04-02 23:14 ` Badari Pulavarty
2007-04-02 23:49   ` Nick Piggin
2007-04-03 15:57     ` Badari Pulavarty
2007-04-04  2:31       ` Nick Piggin
2007-04-02 23:31 ` Badari Pulavarty
2007-04-02 23:35   ` Nick Piggin
2007-04-02 23:56 ` Badari Pulavarty
2007-04-03  0:02   ` Nick Piggin
2007-04-03  0:00 ` Badari Pulavarty
2007-04-03  0:08   ` Nick Piggin
2007-04-03  9:31     ` Nick Piggin
2007-04-03 16:03       ` Badari Pulavarty
2007-04-04  2:37         ` Nick Piggin
2007-04-03 17:35 ` Badari Pulavarty
2007-04-04  3:05   ` Nick Piggin
2007-04-04 22:10 ` Mark Fasheh
2007-04-04 22:39   ` Badari Pulavarty
2007-04-04 22:51     ` Mark Fasheh
2007-04-04 23:02       ` Badari Pulavarty
2007-04-04 23:05   ` Badari Pulavarty
2007-04-04 23:17     ` Mark Fasheh
2007-04-04 23:32       ` Badari Pulavarty
2007-04-05  2:08         ` Nick Piggin
2007-04-05 15:21           ` Badari Pulavarty
2007-04-06  1:38             ` Nick Piggin
2007-04-05  2:09   ` Nick Piggin
2007-04-05  0:10 ` David Chinner
2007-04-05  2:13   ` Nick Piggin
2007-04-05  7:45     ` Christoph Hellwig
2007-04-05  7:58       ` Nick Piggin
2007-04-05  2:43   ` David Chinner
2007-04-05  3:00     ` Nick Piggin
2007-04-05  6:18       ` David Chinner
2007-04-05  6:40         ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.