linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-06-13 13:40 ` + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree Dmitriy Monakhov
@ 2007-06-13 11:43   ` Nick Piggin
  2007-06-13 18:51     ` Dmitriy Monakhov
  2007-06-13 23:07     ` Badari Pulavarty
  2007-06-13 13:50   ` [patch] new aop block_write_begin fix Dmitriy Monakhov
  2007-06-13 13:57   ` iov_iter_fault_in_readable fix Dmitriy Monakhov
  2 siblings, 2 replies; 13+ messages in thread
From: Nick Piggin @ 2007-06-13 11:43 UTC (permalink / raw)
  To: linux-kernel, mark.fasheh, linux-ext4; +Cc: Andrew Morton

On Wed, Jun 13, 2007 at 05:40:05PM +0400, Dmitriy Monakhov wrote:
> On 14:19 ?????? 29 ??????     , akpm@linux-foundation.org wrote:
> > 
> > The patch titled
> >      fs: introduce write_begin, write_end, and perform_write aops
> > has been added to the -mm tree.  Its filename is
> >      fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> > 
> > *** Remember to use Documentation/SubmitChecklist when testing your code ***
> > 
> > See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> > out what to do about this
> > 
> > ------------------------------------------------------
> > Subject: fs: introduce write_begin, write_end, and perform_write aops
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > These are intended to replace prepare_write and commit_write with more
> > flexible alternatives that are also able to avoid the buffered write
> > deadlock problems efficiently (which prepare_write is unable to do).
> > 
> > [mark.fasheh@oracle.com: API design contributions, code review and fixes]
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> I've finaly find time to review Nick's "write_begin/write_end aop" patch set.
> And i have some fixes and questions. My be i've missed somthing and it was 
> already disscussed, but i cant find in LKML.

Thanks, that's really helpful.

 
> 1) loop dev:
> 	loop.c code itself is not perfect. In fact before Nick's patch
> 	partial write was't possible. Assumption what write chunks are
> 	always page aligned is realy weird ( see "index++" line).
> 	Fixed by "new aop loop fix" patch

I think you're right, fix looks good.

 
> 2)block_write_begin:
> 	After we enter to block_write_begin with *pagep == NULL and
> 	some page was grabed we remember this page in *pagep
> 	And if __block_prepare_write() we have to clear *pagep , as 
> 	it was before. Because this may confuse caller.
> 	for example caller may have folowing code:
> 		ret = block_write_begin(..., pagep,...)
> 		if (ret && *pagep != NULL) {
> 			unlock_page(*pagep);
> 			page_cache_release(*pagep);
> 		}
> 	Fixed my "new aop block_write_begin fix" patch

I don't think the caller can rely on that if it returns failure.
But that is more defensive I guess. Maybe setting it to 1 or
so would catch abusers.

 
> 3) __page_symlink:
> 	Nick's patch add folowing code:
> 	+ err = pagecache_write_begin(NULL, mapping, 0,PAGE_CACHE_SIZE,
> 	+                 AOP_FLAG_UNINTERRUPTIBLE, &page,&fsdata);
> 	symlink now consume whole page. I have only one question "WHY???".
> 	I don't see any advantages, but where are huge list of
> 	dissdvantages:
> 	a) fs with blksize == 1k and pagesize == 16k after this patch
> 	   waste more than 10x times disk space for nothing.
> 	b) What happends if we want use fs with blksize == 4k on i386
> 	   after it was used by ia64 ??? (before this patch it was
> 	   possible).
> 	
> 	I dont prepare patch for this because i dont understand issue
> 	witch Nick aimed to fix.

I don't know why myself :P I think it would be just fine to use
len-1 as it did previously, so it must have been a typo?


> 4) iov_iter_fault_in_readable:
> 	Function prerform check for signgle region, with out respect to
> 	segment nature of iovec, For example writev no longer works :) :
> 	writev(3, [{"\1", 1}, {"\2"..., 4096}], 2) = -1 EFAULT (Bad address)
> 	this hidden bug, and it was invisiable because XXXX_fault_in_readable
> 	return value was ignored before. Lets iov_iter_fault_in_readable
> 	perform checks for all segments.
> 	Fixed by :"iov_iter_fault_in_readable fix"

OK thanks. I would rather just change this to use the length of
the first iovec always (prefaulting multiple iovecs would have to
be benchmarked on real apps that make heavy use of writev, I
suspect).

 
> 5) ext3_write_end:
> 	Before  write_begin/write_end patch set we have folowing locking
> 	order:
> 		stop_journal(handle);
> 		unlock_page(page);
> 	But now order is oposite:
> 		unlock_page(page);
> 		stop_journal(handle);
> 	Can we got any race condition now? I'm not sure is it actual problem,
> 	may be somebody cant describe this.

Can we just change it to the original order? That would seem to be
safest unless one of the ext3 devs explicitly acks it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
       [not found] <200705292119.l4TLJtAD011726@shell0.pdx.osdl.net>
@ 2007-06-13 13:40 ` Dmitriy Monakhov
  2007-06-13 11:43   ` Nick Piggin
                     ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Dmitriy Monakhov @ 2007-06-13 13:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: npiggin, mark.fasheh, dmonakhov, linux-ext4

On 14:19 Втр 29 Май     , akpm@linux-foundation.org wrote:
> 
> The patch titled
>      fs: introduce write_begin, write_end, and perform_write aops
> has been added to the -mm tree.  Its filename is
>      fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> 
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 
> See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> out what to do about this
> 
> ------------------------------------------------------
> Subject: fs: introduce write_begin, write_end, and perform_write aops
> From: Nick Piggin <npiggin@suse.de>
> 
> These are intended to replace prepare_write and commit_write with more
> flexible alternatives that are also able to avoid the buffered write
> deadlock problems efficiently (which prepare_write is unable to do).
> 
> [mark.fasheh@oracle.com: API design contributions, code review and fixes]
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
I've finaly find time to review Nick's "write_begin/write_end aop" patch set.
And i have some fixes and questions. My be i've missed somthing and it was 
already disscussed, but i cant find in LKML.

1) loop dev:
	loop.c code itself is not perfect. In fact before Nick's patch
	partial write was't possible. Assumption what write chunks are
	always page aligned is realy weird ( see "index++" line).
	Fixed by "new aop loop fix" patch

2)block_write_begin:
	After we enter to block_write_begin with *pagep == NULL and
	some page was grabed we remember this page in *pagep
	And if __block_prepare_write() we have to clear *pagep , as 
	it was before. Because this may confuse caller.
	for example caller may have folowing code:
		ret = block_write_begin(..., pagep,...)
		if (ret && *pagep != NULL) {
			unlock_page(*pagep);
			page_cache_release(*pagep);
		}
	Fixed my "new aop block_write_begin fix" patch

3) __page_symlink:
	Nick's patch add folowing code:
	+ err = pagecache_write_begin(NULL, mapping, 0,PAGE_CACHE_SIZE,
	+                 AOP_FLAG_UNINTERRUPTIBLE, &page,&fsdata);
	symlink now consume whole page. I have only one question "WHY???".
	I don't see any advantages, but where are huge list of
	dissdvantages:
	a) fs with blksize == 1k and pagesize == 16k after this patch
	   waste more than 10x times disk space for nothing.
	b) What happends if we want use fs with blksize == 4k on i386
	   after it was used by ia64 ??? (before this patch it was
	   possible).
	
	I dont prepare patch for this because i dont understand issue
	witch Nick aimed to fix.

4) iov_iter_fault_in_readable:
	Function prerform check for signgle region, with out respect to
	segment nature of iovec, For example writev no longer works :) :
	writev(3, [{"\1", 1}, {"\2"..., 4096}], 2) = -1 EFAULT (Bad address)
	this hidden bug, and it was invisiable because XXXX_fault_in_readable
	return value was ignored before. Lets iov_iter_fault_in_readable
	perform checks for all segments.
	Fixed by :"iov_iter_fault_in_readable fix"

5) ext3_write_end:
	Before  write_begin/write_end patch set we have folowing locking
	order:
		stop_journal(handle);
		unlock_page(page);
	But now order is oposite:
		unlock_page(page);
		stop_journal(handle);
	Can we got any race condition now? I'm not sure is it actual problem,
	may be somebody cant describe this.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch] new aop block_write_begin fix
  2007-06-13 13:40 ` + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree Dmitriy Monakhov
  2007-06-13 11:43   ` Nick Piggin
@ 2007-06-13 13:50   ` Dmitriy Monakhov
  2007-06-13 13:57   ` iov_iter_fault_in_readable fix Dmitriy Monakhov
  2 siblings, 0 replies; 13+ messages in thread
From: Dmitriy Monakhov @ 2007-06-13 13:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: npiggin

	After we enter to block_write_begin with *pagep == NULL and
 	some page was grabed we remember this page in *pagep
 	And if __block_prepare_write() we have to clear *pagep , as 
 	it was before. Because this may confuse caller.
 	for example caller may have folowing code:
 		ret = block_write_begin(..., pagep,...)
 		if (ret && *pagep != NULL) {
 			unlock_page(*pagep);
 			page_cache_release(*pagep);
 		}
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>

diff --git a/fs/buffer.c b/fs/buffer.c
index 07cd457..df933ba 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1982,6 +1982,7 @@ int block_write_begin(struct file *file, struct address_space *mapping,
 		if (ownpage) {
 			unlock_page(page);
 			page_cache_release(page);
+			*pagep = NULL;
 
 			/*
 			 * prepare_write() may have instantiated a few blocks 


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* iov_iter_fault_in_readable fix
  2007-06-13 13:40 ` + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree Dmitriy Monakhov
  2007-06-13 11:43   ` Nick Piggin
  2007-06-13 13:50   ` [patch] new aop block_write_begin fix Dmitriy Monakhov
@ 2007-06-13 13:57   ` Dmitriy Monakhov
  2007-06-14 17:31     ` Christoph Hellwig
  2 siblings, 1 reply; 13+ messages in thread
From: Dmitriy Monakhov @ 2007-06-13 13:57 UTC (permalink / raw)
  To: linux-kernel, npiggin, mark.fasheh, linux-ext4

 	Function prerform check for signgle region, with out respect to
 	segment nature of iovec, For example writev no longer works :)

	/* TESTCASE BEGIN */
	#include <stdio.h>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <fcntl.h>
	#include <sys/uio.h>
	#include <sys/mman.h>
	#define SIZE  (4096 * 2)
	int main(int argc, char* argv[])
	{	
		char* ptr[4];
		struct iovec iov[2];
		int fd, ret;
		ptr[0] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
		ptr[1] = mmap(NULL, SIZE, PROT_NONE,
			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
		ptr[2] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
		ptr[3] = mmap(NULL, SIZE, PROT_NONE, 
			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

		iov[0].iov_base = ptr[0] + (SIZE -1);
		iov[0].iov_len = 1;
		memset(ptr[0], 1, SIZE);

		iov[1].iov_base = ptr[2];
		iov[1].iov_len = SIZE;
		memset(ptr[2], 2, SIZE);

		fd = open(argv[1], O_CREAT|O_RDWR|O_TRUNC, 0666);
		ret = writev(fd, iov, sizeof(iov) / sizeof(struct iovec));
		return 0;
	}	
	/* TESTCASE END*/
	We will get folowing result:
		writev(3, [{"\1", 1}, {"\2"..., 8192}], 2) = -1 EFAULT (Bad address)
 	
	this is hidden bug, and it was invisiable because XXXX_fault_in_readable
 	return value was ignored before. Lets iov_iter_fault_in_readable
 	perform checks for all segments.

Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index fef19fc..7e025ea 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -433,7 +433,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 size_t iov_iter_copy_from_user(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t *bytes);
 size_t iov_iter_single_seg_count(struct iov_iter *i);
 
 static inline void iov_iter_init(struct iov_iter *i,
diff --git a/mm/filemap.c b/mm/filemap.c
index 8d59ed9..8600c3e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1817,10 +1817,32 @@ void iov_iter_advance(struct iov_iter *i, size_t bytes)
 }
 EXPORT_SYMBOL(iov_iter_advance);
 
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t* bytes)
 {
-	char __user *buf = i->iov->iov_base + i->iov_offset;
-	return fault_in_pages_readable(buf, bytes);
+	size_t len = *bytes;
+	int ret;
+	if (likely(i->nr_segs == 1)) {
+		ret = fault_in_pages_readable(i->iov->iov_base, len);
+		if (ret)
+			*bytes = 0;
+	} else {
+		const struct iovec *iov = i->iov;
+		size_t base = i->iov_offset;
+		*bytes = 0;
+		while (len) {
+			int copy = min(len, iov->iov_len - base);
+			if ((ret = fault_in_pages_readable(iov->iov_base + base, copy)))
+				break;
+			*bytes += copy;
+			len -= copy;
+			base += copy;
+			if (iov->iov_len == base) {
+				iov++;
+				base = 0;
+			}
+		}
+	}
+	return ret;	
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
@@ -2110,7 +2132,7 @@ static ssize_t generic_perform_write_2copy(struct file *file,
 		 * to check that the address is actually valid, when atomic
 		 * usercopies are used, below.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+		if (unlikely(iov_iter_fault_in_readable(i, &bytes) && !bytes)) {
 			status = -EFAULT;
 			break;
 		}
@@ -2284,7 +2306,7 @@ again:
 		 * to check that the address is actually valid, when atomic
 		 * usercopies are used, below.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+		if (unlikely(iov_iter_fault_in_readable(i, &bytes)&& !bytes)) {
 			status = -EFAULT;
 			break;
 		}



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-06-13 11:43   ` Nick Piggin
@ 2007-06-13 18:51     ` Dmitriy Monakhov
  2007-06-13 23:07     ` Badari Pulavarty
  1 sibling, 0 replies; 13+ messages in thread
From: Dmitriy Monakhov @ 2007-06-13 18:51 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, mark.fasheh, linux-ext4, Andrew Morton

On 13:43 Срд 13 Июн     , Nick Piggin wrote:
> On Wed, Jun 13, 2007 at 05:40:05PM +0400, Dmitriy Monakhov wrote:
> > On 14:19 ?????? 29 ??????     , akpm@linux-foundation.org wrote:
> > > 
> > > The patch titled
> > >      fs: introduce write_begin, write_end, and perform_write aops
> > > has been added to the -mm tree.  Its filename is
> > >      fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> > > 
> > > *** Remember to use Documentation/SubmitChecklist when testing your code ***
> > > 
> > > See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> > > out what to do about this
> > > 
> > > ------------------------------------------------------
> > > Subject: fs: introduce write_begin, write_end, and perform_write aops
> > > From: Nick Piggin <npiggin@suse.de>
> > > 
> > > These are intended to replace prepare_writ eand commit_write with more
> > > flexible alternatives that are also able to avoid the buffered write
> > > deadlock problems efficiently (which prepare_write is unable to do).
> > > 
> > > [mark.fasheh@oracle.com: API design contributions, code review and fixes]
> > > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > > Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > I've finaly find time to review Nick's "write_begin/write_end aop" patch set.
> > And i have some fixes and questions. My be i've missed somthing and it was 
> > already disscussed, but i cant find in LKML.
> 
> Thanks, that's really helpful.
> 
>  
> > 1) loop dev:
> > 	loop.c code itself is not perfect. In fact before Nick's patch
> > 	partial write was't possible. Assumption what write chunks are
> > 	always page aligned is realy weird ( see "index++" line).
> > 	Fixed by "new aop loop fix" patch
> 
> I think you're right, fix looks good.
> 
>  
> > 2)block_write_begin:
> > 	After we enter to block_write_begin with *pagep == NULL and
> > 	some page was grabed we remember this page in *pagep
> > 	And if __block_prepare_write() we have to clear *pagep , as 
> > 	it was before. Because this may confuse caller.
> > 	for example caller may have folowing code:
> > 		ret = block_write_begin(..., pagep,...)
> > 		if (ret && *pagep != NULL) {
> > 			unlock_page(*pagep);
> > 			page_cache_release(*pagep);
> > 		}
> > 	Fixed my "new aop block_write_begin fix" patch
> 
> I don't think the caller can rely on that if it returns failure.
> But that is more defensive I guess. Maybe setting it to 1 or
> so would catch abusers.
> 
>  
> > 3) __page_symlink:
> > 	Nick's patch add folowing code:
> > 	+ err = pagecache_write_begin(NULL, mapping, 0,PAGE_CACHE_SIZE,
> > 	+                 AOP_FLAG_UNINTERRUPTIBLE, &page,&fsdata);
> > 	symlink now consume whole page. I have only one question "WHY???".
> > 	I don't see any advantages, but where are huge list of
> > 	dissdvantages:
> > 	a) fs with blksize == 1k and pagesize == 16k after this patch
> > 	   waste more than 10x times disk space for nothing.
> > 	b) What happends if we want use fs with blksize == 4k on i386
> > 	   after it was used by ia64 ??? (before this patch it was
> > 	   possible).
One more visiable effect caused by wrong symlink size:
fsstress cause folowing error:
<LOG BEGIN>
EXT3-fs unexpected failure: !buffer_revoked(bh); 
inconsistent data on disk
ext3_forget: aborting transaction: IO failure in __ext3_journal_revoke
ext3_abort called.

EXT3-fs error (device dm-4): ext3_forget: error -5 when attempting
revoke
Remounting filesystem read-only 
Aborting journal on device dm-4.
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
EXT3-fs error (device dm-4) in ext3_free_blocks_sb: Journal has aborted

journal commit I/O error
journal commit I/O error
EXT3-fs error (device dm-4) in ext3_reserve_inode_write: Journal has
aborted
EXT3-fs error (device dm-4) in ext3_truncate: IO failure
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
EXT3-fs error (device dm-4) in ext3_reserve_inode_write: Journal has
aborted
EXT3-fs error (device dm-4) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device dm-4) in ext3_reserve_inode_write: Journal has
aborted
EXT3-fs error (device dm-4) in ext3_delete_inode: IO failure
<LOG END>

After symlink size was fixed to len-1 problem dissappeared. 
> > 	
> > 	I dont prepare patch for this because i dont understand issue
> > 	witch Nick aimed to fix.
> 
> I don't know why myself :P I think it would be just fine to use
> len-1 as it did previously, so it must have been a typo?
> 
> 
> > 4) iov_iter_fault_in_readable:
> > 	Function prerform check for signgle region, with out respect to
> > 	segment nature of iovec, For example writev no longer works :) :
> > 	writev(3, [{"\1", 1}, {"\2"..., 4096}], 2) = -1 EFAULT (Bad address)
> > 	this hidden bug, and it was invisiable because XXXX_fault_in_readable
> > 	return value was ignored before. Lets iov_iter_fault_in_readable
> > 	perform checks for all segments.
> > 	Fixed by :"iov_iter_fault_in_readable fix"
> 
> OK thanks. I would rather just change this to use the length of
> the first iovec always (prefaulting multiple iovecs would have to
> be benchmarked on real apps that make heavy use of writev, I
> suspect).
> 
>  
> > 5) ext3_write_end:
> > 	Before  write_begin/write_end patch set we have folowing locking
> > 	order:
> > 		stop_journal(handle);
> > 		unlock_page(page);
> > 	But now order is oposite:
> > 		unlock_page(page);
> > 		stop_journal(handle);
> > 	Can we got any race condition now? I'm not sure is it actual problem,
> > 	may be somebody cant describe this.
> 
> Can we just change it to the original order? That would seem to be
> safest unless one of the ext3 devs explicitly acks it.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-06-13 11:43   ` Nick Piggin
  2007-06-13 18:51     ` Dmitriy Monakhov
@ 2007-06-13 23:07     ` Badari Pulavarty
  2007-06-13 23:28       ` Nick Piggin
  2007-06-14  9:52       ` Jan Kara
  1 sibling, 2 replies; 13+ messages in thread
From: Badari Pulavarty @ 2007-06-13 23:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: lkml, mark.fasheh, ext4, Andrew Morton, cmm

On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote:
..
>  
> > 5) ext3_write_end:
> > 	Before  write_begin/write_end patch set we have folowing locking
> > 	order:
> > 		stop_journal(handle);
> > 		unlock_page(page);
> > 	But now order is oposite:
> > 		unlock_page(page);
> > 		stop_journal(handle);
> > 	Can we got any race condition now? I'm not sure is it actual problem,
> > 	may be somebody cant describe this.
> 
> Can we just change it to the original order? That would seem to be
> safest unless one of the ext3 devs explicitly acks it.

It would be nice to go back to original order, but its not that
simple with current structure of the code. With Nick's patches
unlock_page() happens in generic_write_end(). journal_stop() 
needs to happen after generic_write_end(). :(

Mingming, can you take a look at the current & proposed order ?
I ran into bunch of races when I tried to change the order for
->writepages() support earlier :(

Thanks,
Badari


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-06-13 23:07     ` Badari Pulavarty
@ 2007-06-13 23:28       ` Nick Piggin
  2007-06-14  9:52       ` Jan Kara
  1 sibling, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2007-06-13 23:28 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: lkml, mark.fasheh, ext4, Andrew Morton, cmm

On Wed, Jun 13, 2007 at 04:07:01PM -0700, Badari Pulavarty wrote:
> On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote:
> ..
> >  
> > > 5) ext3_write_end:
> > > 	Before  write_begin/write_end patch set we have folowing locking
> > > 	order:
> > > 		stop_journal(handle);
> > > 		unlock_page(page);
> > > 	But now order is oposite:
> > > 		unlock_page(page);
> > > 		stop_journal(handle);
> > > 	Can we got any race condition now? I'm not sure is it actual problem,
> > > 	may be somebody cant describe this.
> > 
> > Can we just change it to the original order? That would seem to be
> > safest unless one of the ext3 devs explicitly acks it.
> 
> It would be nice to go back to original order, but its not that
> simple with current structure of the code. With Nick's patches
> unlock_page() happens in generic_write_end(). journal_stop() 
> needs to happen after generic_write_end(). :(

Well we could use block_write_end?

 
> Mingming, can you take a look at the current & proposed order ?
> I ran into bunch of races when I tried to change the order for
> ->writepages() support earlier :(

OK, it sounds like we probably want to revert to the original
order at least for this patchset. If the new order is proven
safe then that could be introduced later to simplify things...

Thanks,
Nick


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-06-13 23:07     ` Badari Pulavarty
  2007-06-13 23:28       ` Nick Piggin
@ 2007-06-14  9:52       ` Jan Kara
  2007-06-14 10:39         ` Nick Piggin
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Kara @ 2007-06-14  9:52 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Nick Piggin, lkml, mark.fasheh, ext4, Andrew Morton, cmm

> On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote:
> ..
> >  
> > > 5) ext3_write_end:
> > > 	Before  write_begin/write_end patch set we have folowing locking
> > > 	order:
> > > 		stop_journal(handle);
> > > 		unlock_page(page);
> > > 	But now order is oposite:
> > > 		unlock_page(page);
> > > 		stop_journal(handle);
> > > 	Can we got any race condition now? I'm not sure is it actual problem,
> > > 	may be somebody cant describe this.
> > 
> > Can we just change it to the original order? That would seem to be
> > safest unless one of the ext3 devs explicitly acks it.
  Sorry, I've missed beginning of this thread. But what problems can
exactly cause this ordering change? ext3_journal_stop has no need to be
protected by the page lock - it can be even better that it's not
protected as it can trigger commit and all that would happen
unnecessarily under page lock...

								Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
  2007-06-14  9:52       ` Jan Kara
@ 2007-06-14 10:39         ` Nick Piggin
  0 siblings, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2007-06-14 10:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: Badari Pulavarty, lkml, mark.fasheh, ext4, Andrew Morton, cmm

On Thu, Jun 14, 2007 at 11:52:49AM +0200, Jan Kara wrote:
> > On Wed, 2007-06-13 at 13:43 +0200, Nick Piggin wrote:
> > ..
> > >  
> > > > 5) ext3_write_end:
> > > > 	Before  write_begin/write_end patch set we have folowing locking
> > > > 	order:
> > > > 		stop_journal(handle);
> > > > 		unlock_page(page);
> > > > 	But now order is oposite:
> > > > 		unlock_page(page);
> > > > 		stop_journal(handle);
> > > > 	Can we got any race condition now? I'm not sure is it actual problem,
> > > > 	may be somebody cant describe this.
> > > 
> > > Can we just change it to the original order? That would seem to be
> > > safest unless one of the ext3 devs explicitly acks it.
>   Sorry, I've missed beginning of this thread. But what problems can
> exactly cause this ordering change? ext3_journal_stop has no need to be
> protected by the page lock - it can be even better that it's not
> protected as it can trigger commit and all that would happen
> unnecessarily under page lock...

Sure, if you think it is safe. I would rather it be done in a
different patch though.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: iov_iter_fault_in_readable fix
  2007-06-13 13:57   ` iov_iter_fault_in_readable fix Dmitriy Monakhov
@ 2007-06-14 17:31     ` Christoph Hellwig
  2007-06-14 22:21       ` David Chinner
  2007-06-16 18:17       ` Dmitriy Monakhov
  0 siblings, 2 replies; 13+ messages in thread
From: Christoph Hellwig @ 2007-06-14 17:31 UTC (permalink / raw)
  To: linux-kernel, npiggin, mark.fasheh, linux-ext4, xfs

On Wed, Jun 13, 2007 at 05:57:59PM +0400, Dmitriy Monakhov wrote:
>  	Function prerform check for signgle region, with out respect to
>  	segment nature of iovec, For example writev no longer works :)

Btw, could someone please start to collect all sniplets like this in
a nice simple regression test suite?  If no one wants to start a new
one we should probably just put it into xfsqa (which should be useable
for other filesystems aswell despite the name)

> 
> 	/* TESTCASE BEGIN */
> 	#include <stdio.h>
> 	#include <sys/types.h>
> 	#include <sys/stat.h>
> 	#include <fcntl.h>
> 	#include <sys/uio.h>
> 	#include <sys/mman.h>
> 	#define SIZE  (4096 * 2)
> 	int main(int argc, char* argv[])
> 	{	
> 		char* ptr[4];
> 		struct iovec iov[2];
> 		int fd, ret;
> 		ptr[0] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
> 			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 		ptr[1] = mmap(NULL, SIZE, PROT_NONE,
> 			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 		ptr[2] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
> 			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 		ptr[3] = mmap(NULL, SIZE, PROT_NONE, 
> 			MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 
> 		iov[0].iov_base = ptr[0] + (SIZE -1);
> 		iov[0].iov_len = 1;
> 		memset(ptr[0], 1, SIZE);
> 
> 		iov[1].iov_base = ptr[2];
> 		iov[1].iov_len = SIZE;
> 		memset(ptr[2], 2, SIZE);
> 
> 		fd = open(argv[1], O_CREAT|O_RDWR|O_TRUNC, 0666);
> 		ret = writev(fd, iov, sizeof(iov) / sizeof(struct iovec));
> 		return 0;
> 	}	
> 	/* TESTCASE END*/
> 	We will get folowing result:
> 		writev(3, [{"\1", 1}, {"\2"..., 8192}], 2) = -1 EFAULT (Bad address)
>  	
> 	this is hidden bug, and it was invisiable because XXXX_fault_in_readable
>  	return value was ignored before. Lets iov_iter_fault_in_readable
>  	perform checks for all segments.
> 
> Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fef19fc..7e025ea 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -433,7 +433,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
>  size_t iov_iter_copy_from_user(struct page *page,
>  		struct iov_iter *i, unsigned long offset, size_t bytes);
>  void iov_iter_advance(struct iov_iter *i, size_t bytes);
> -int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
> +int iov_iter_fault_in_readable(struct iov_iter *i, size_t *bytes);
>  size_t iov_iter_single_seg_count(struct iov_iter *i);
>  
>  static inline void iov_iter_init(struct iov_iter *i,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8d59ed9..8600c3e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1817,10 +1817,32 @@ void iov_iter_advance(struct iov_iter *i, size_t bytes)
>  }
>  EXPORT_SYMBOL(iov_iter_advance);
>  
> -int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
> +int iov_iter_fault_in_readable(struct iov_iter *i, size_t* bytes)
>  {
> -	char __user *buf = i->iov->iov_base + i->iov_offset;
> -	return fault_in_pages_readable(buf, bytes);
> +	size_t len = *bytes;
> +	int ret;
> +	if (likely(i->nr_segs == 1)) {
> +		ret = fault_in_pages_readable(i->iov->iov_base, len);
> +		if (ret)
> +			*bytes = 0;
> +	} else {
> +		const struct iovec *iov = i->iov;
> +		size_t base = i->iov_offset;
> +		*bytes = 0;
> +		while (len) {
> +			int copy = min(len, iov->iov_len - base);
> +			if ((ret = fault_in_pages_readable(iov->iov_base + base, copy)))
> +				break;
> +			*bytes += copy;
> +			len -= copy;
> +			base += copy;
> +			if (iov->iov_len == base) {
> +				iov++;
> +				base = 0;
> +			}
> +		}
> +	}
> +	return ret;	
>  }
>  EXPORT_SYMBOL(iov_iter_fault_in_readable);
>  
> @@ -2110,7 +2132,7 @@ static ssize_t generic_perform_write_2copy(struct file *file,
>  		 * to check that the address is actually valid, when atomic
>  		 * usercopies are used, below.
>  		 */
> -		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
> +		if (unlikely(iov_iter_fault_in_readable(i, &bytes) && !bytes)) {
>  			status = -EFAULT;
>  			break;
>  		}
> @@ -2284,7 +2306,7 @@ again:
>  		 * to check that the address is actually valid, when atomic
>  		 * usercopies are used, below.
>  		 */
> -		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
> +		if (unlikely(iov_iter_fault_in_readable(i, &bytes)&& !bytes)) {
>  			status = -EFAULT;
>  			break;
>  		}
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
---end quoted text---

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: iov_iter_fault_in_readable fix
  2007-06-14 17:31     ` Christoph Hellwig
@ 2007-06-14 22:21       ` David Chinner
  2007-06-14 22:34         ` Christoph Hellwig
  2007-06-16 18:17       ` Dmitriy Monakhov
  1 sibling, 1 reply; 13+ messages in thread
From: David Chinner @ 2007-06-14 22:21 UTC (permalink / raw)
  To: Christoph Hellwig, linux-kernel, npiggin, mark.fasheh, linux-ext4, xfs

On Thu, Jun 14, 2007 at 06:31:53PM +0100, Christoph Hellwig wrote:
> On Wed, Jun 13, 2007 at 05:57:59PM +0400, Dmitriy Monakhov wrote:
> >  	Function prerform check for signgle region, with out respect to
> >  	segment nature of iovec, For example writev no longer works :)
> 
> Btw, could someone please start to collect all sniplets like this in
> a nice simple regression test suite?  If no one wants to start a new
> one we should probably just put it into xfsqa (which should be useable
> for other filesystems aswell despite the name)

Yeah, it can run a subset of the tests on NFS and UDF filesystems as well and
there are some specific UDF-only tests in it too.  I think the NFS test group
is mostly generic tests that don't use or test specific XFS features.

I'd be happy to accumulate tests like these in xfsqa...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: iov_iter_fault_in_readable fix
  2007-06-14 22:21       ` David Chinner
@ 2007-06-14 22:34         ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2007-06-14 22:34 UTC (permalink / raw)
  To: David Chinner
  Cc: Christoph Hellwig, linux-kernel, npiggin, mark.fasheh, linux-ext4, xfs

On Fri, Jun 15, 2007 at 08:21:09AM +1000, David Chinner wrote:
> Yeah, it can run a subset of the tests on NFS and UDF filesystems as well and
> there are some specific UDF-only tests in it too.  I think the NFS test group
> is mostly generic tests that don't use or test specific XFS features.

Actually most testcases can run on any reasonable posixish filesystem, we
just need some glue to tell the testsuite it's actually okay.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: iov_iter_fault_in_readable fix
  2007-06-14 17:31     ` Christoph Hellwig
  2007-06-14 22:21       ` David Chinner
@ 2007-06-16 18:17       ` Dmitriy Monakhov
  1 sibling, 0 replies; 13+ messages in thread
From: Dmitriy Monakhov @ 2007-06-16 18:17 UTC (permalink / raw)
  To: Christoph Hellwig, linux-kernel, npiggin, mark.fasheh, linux-ext4, xfs

On 18:31 Чтв 14 Июн     , Christoph Hellwig wrote:
> On Wed, Jun 13, 2007 at 05:57:59PM +0400, Dmitriy Monakhov wrote:
> >  	Function prerform check for signgle region, with out respect to
> >  	segment nature of iovec, For example writev no longer works :)
> 
> Btw, could someone please start to collect all sniplets like this in
> a nice simple regression test suite?  If no one wants to start a new
> one we should probably just put it into xfsqa (which should be useable
> for other filesystems aswell despite the name)
I've prepared testcase (testcases/kernel/syscalls/writev/writev06.c) 
and sent it to ltp mailing list.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-06-16 18:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200705292119.l4TLJtAD011726@shell0.pdx.osdl.net>
2007-06-13 13:40 ` + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree Dmitriy Monakhov
2007-06-13 11:43   ` Nick Piggin
2007-06-13 18:51     ` Dmitriy Monakhov
2007-06-13 23:07     ` Badari Pulavarty
2007-06-13 23:28       ` Nick Piggin
2007-06-14  9:52       ` Jan Kara
2007-06-14 10:39         ` Nick Piggin
2007-06-13 13:50   ` [patch] new aop block_write_begin fix Dmitriy Monakhov
2007-06-13 13:57   ` iov_iter_fault_in_readable fix Dmitriy Monakhov
2007-06-14 17:31     ` Christoph Hellwig
2007-06-14 22:21       ` David Chinner
2007-06-14 22:34         ` Christoph Hellwig
2007-06-16 18:17       ` Dmitriy Monakhov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).