[patch 0/9] buffered write deadlock fix

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/9] buffered write deadlock fix
@ 2007-02-04  8:49 Nick Piggin
  2007-02-04  8:49 ` [patch 1/9] fs: libfs buffered write leak fix Nick Piggin
                   ` (8 more replies)
  0 siblings, 9 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-04  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Have fixed a few issues since last time:
- better comments for the SetPageUptodate race
- actually fix the nobh problem rather than adding a comment
- use kmap_atomic instead of kmap

Patches against 2.6.20-rc7.

Thanks,
Nick

--
SuSE Labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch 1/9] fs: libfs buffered write leak fix
  2007-02-04  8:49 [patch 0/9] buffered write deadlock fix Nick Piggin
@ 2007-02-04  8:49 ` Nick Piggin
  2007-02-04  8:50 ` [patch 2/9] mm: revert "generic_file_buffered_write(): handle zero length iovec segments" Nick Piggin
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-04  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.
This happens because the prepare_write functions leave an uninitialised
"hole" over the part of the page that the write is expected to go to. This
is fine, but they then mark the page uptodate, which means a concurrent read
can come in and copy the uninitialised memory into userspace before it written
to.

Fix simple_readpage by simply initialising the whole page in the case of a
partial-page write. In the case of a full-page write, we don't SetPageDirty
until commit_write time.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -327,25 +327,32 @@ int simple_readpage(struct file *file, s
 int simple_prepare_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
-	if (!PageUptodate(page)) {
-		if (to - from != PAGE_CACHE_SIZE) {
-			void *kaddr = kmap_atomic(page, KM_USER0);
-			memset(kaddr, 0, from);
-			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
-			flush_dcache_page(page);
-			kunmap_atomic(kaddr, KM_USER0);
-		}
+	if (PageUptodate(page))
+		return 0;
+
+	if (to - from != PAGE_CACHE_SIZE) {
+		/*
+		 * Partial-page write? Initialise the complete page and
+		 * set it uptodate. We could avoid initialising the
+		 * (from, to) hole, and opt to mark it uptodate in
+		 * simple_commit_write, but that's probably only a win
+		 * for filesystems that would need to read blocks off disk.
+		 */
+		memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE);
 		SetPageUptodate(page);
 	}
+
 	return 0;
 }
 
 int simple_commit_write(struct file *file, struct page *page,
-			unsigned offset, unsigned to)
+			unsigned from, unsigned to)
 {
 	struct inode *inode = page->mapping->host;
 	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 
+	if (to - from == PAGE_CACHE_SIZE)
+		SetPageUptodate(page);
 	/*
 	 * No need to use i_size_read() here, the i_size
 	 * cannot change under us because we hold the i_mutex.
@@ -353,6 +360,7 @@ int simple_commit_write(struct file *fil
 	if (pos > inode->i_size)
 		i_size_write(inode, pos);
 	set_page_dirty(page);
+
 	return 0;
 }
 
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2344,17 +2344,6 @@ int nobh_prepare_write(struct page *page
 
 	if (is_mapped_to_disk)
 		SetPageMappedToDisk(page);
-	SetPageUptodate(page);
-
-	/*
-	 * Setting the page dirty here isn't necessary for the prepare_write
-	 * function - commit_write will do that.  But if/when this function is
-	 * used within the pagefault handler to ensure that all mmapped pages
-	 * have backing space in the filesystem, we will need to dirty the page
-	 * if its contents were altered.
-	 */
-	if (dirtied_it)
-		set_page_dirty(page);
 
 	return 0;
 
@@ -2384,6 +2373,7 @@ int nobh_commit_write(struct file *file,
 	struct inode *inode = page->mapping->host;
 	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 
+	SetPageUptodate(page);
 	set_page_dirty(page);
 	if (pos > inode->i_size) {
 		i_size_write(inode, pos);
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6/Documentation/filesystems/vfs.txt
@@ -617,6 +617,11 @@ struct address_space_operations {
 	In this case the prepare_write will be retried one the lock is
   	regained.
 
+	Note: the page _must not_ be marked uptodate in this function
+	(or anywhere else) unless it actually is uptodate right now. As
+	soon as a page is marked uptodate, it is possible for a concurrent
+	read(2) to copy it to userspace.
+
   commit_write: If prepare_write succeeds, new data will be copied
         into the page and then commit_write will be called.  It will
         typically update the size of the file (if appropriate) and

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch 2/9] mm: revert "generic_file_buffered_write(): handle zero length iovec segments"
  2007-02-04  8:49 [patch 0/9] buffered write deadlock fix Nick Piggin
  2007-02-04  8:49 ` [patch 1/9] fs: libfs buffered write leak fix Nick Piggin
@ 2007-02-04  8:50 ` Nick Piggin
  2007-02-04  8:50 ` [patch 3/9] mm: revert "generic_file_buffered_write(): deadlock on vectored write" Nick Piggin
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-04  8:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

From: Andrew Morton <akpm@osdl.org>

Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6.

This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
also revert.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2120,12 +2120,6 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
-		if (unlikely(bytes == 0)) {
-			status = 0;
-			copied = 0;
-			goto zero_length_segment;
-		}
-
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status)) {
 			loff_t isize = i_size_read(inode);
@@ -2155,8 +2149,7 @@ generic_file_buffered_write(struct kiocb
 			page_cache_release(page);
 			continue;
 		}
-zero_length_segment:
-		if (likely(copied >= 0)) {
+		if (likely(copied > 0)) {
 			if (!status)
 				status = copied;
 
Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove
 	const struct iovec *iov = *iovp;
 	size_t base = *basep;
 
-	do {
+	while (bytes) {
 		int copy = min(bytes, iov->iov_len - base);
 
 		bytes -= copy;
@@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove
 			iov++;
 			base = 0;
 		}
-	} while (bytes);
+	}
 	*iovp = iov;
 	*basep = base;
 }

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch 3/9] mm: revert "generic_file_buffered_write(): deadlock on vectored write"
  2007-02-04  8:49 [patch 0/9] buffered write deadlock fix Nick Piggin
  2007-02-04  8:49 ` [patch 1/9] fs: libfs buffered write leak fix Nick Piggin
  2007-02-04  8:50 ` [patch 2/9] mm: revert "generic_file_buffered_write(): handle zero length iovec segments" Nick Piggin
@ 2007-02-04  8:50 ` Nick Piggin
  2007-02-04  8:50 ` [patch 4/9] mm: generic_file_buffered_write cleanup Nick Piggin
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-04  8:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

From: Andrew Morton <akpm@osdl.org>

Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83

This patch fixed the following bug:

  When prefaulting in the pages in generic_file_buffered_write(), we only
  faulted in the pages for the firts segment of the iovec.  If the second of
  successive segment described a mmapping of the page into which we're
  write()ing, and that page is not up-to-date, the fault handler tries to lock
  the already-locked page (to bring it up to date) and deadlocks.

  An exploit for this bug is in writev-deadlock-demo.c, in
  http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

  (These demos assume blocksize < PAGE_CACHE_SIZE).

The problem with this fix is that it takes the kernel back to doing a single
prepare_write()/commit_write() per iovec segment.  So in the worst case we'll
run prepare_write+commit_write 1024 times where we previously would have run
it once. The other problem with the fix is that it fix all the locking problems.

<insert numbers obtained via ext3-tools's writev-speed.c here>

And apparently this change killed NFS overwrite performance, because, I
suppose, it talks to the server for each prepare_write+commit_write.

So just back that patch out - we'll be fixing the deadlock by other means.

Signed-off-by: Andrew Morton <akpm@osdl.org>

Nick says: also it only ever actually papered over the bug, because after
faulting in the pages, they might be unmapped or reclaimed.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2090,21 +2090,14 @@ generic_file_buffered_write(struct kiocb
 	do {
 		unsigned long index;
 		unsigned long offset;
+		unsigned long maxlen;
 		size_t copied;

 		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
-
-		/* Limit the size of the copy to the caller's write size */
-		bytes = min(bytes, count);
-
-		/*
-		 * Limit the size of the copy to that of the current segment,
-		 * because fault_in_pages_readable() doesn't know how to walk
-		 * segments.
-		 */
-		bytes = min(bytes, cur_iov->iov_len - iov_base);
+		if (bytes > count)
+			bytes = count;

 		/*
 		 * Bring in the user page that we will copy from _first_.
@@ -2112,7 +2105,10 @@ generic_file_buffered_write(struct kiocb
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		fault_in_pages_readable(buf, bytes);
+		maxlen = cur_iov->iov_len - iov_base;
+		if (maxlen > bytes)
+			maxlen = bytes;
+		fault_in_pages_readable(buf, maxlen);

 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
 		if (!page) {

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch 4/9] mm: generic_file_buffered_write cleanup
  2007-02-04  8:49 [patch 0/9] buffered write deadlock fix Nick Piggin
                   ` (2 preceding siblings ...)
  2007-02-04  8:50 ` [patch 3/9] mm: revert "generic_file_buffered_write(): deadlock on vectored write" Nick Piggin
@ 2007-02-04  8:50 ` Nick Piggin
  2007-02-04  8:50 ` [patch 5/9] mm: debug write deadlocks Nick Piggin
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-04  8:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

From: Andrew Morton <akpm@osdl.org>

Clean up buffered write code. Rename some variables and fix some types.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2063,16 +2063,15 @@ generic_file_buffered_write(struct kiocb
 		size_t count, ssize_t written)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space * mapping = file->f_mapping;
+	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	struct inode 	*inode = mapping->host;
 	long		status = 0;
 	struct page	*page;
 	struct page	*cached_page = NULL;
-	size_t		bytes;
 	struct pagevec	lru_pvec;
 	const struct iovec *cur_iov = iov; /* current iovec */
-	size_t		iov_base = 0;	   /* offset in the current iovec */
+	size_t		iov_offset = 0;	   /* offset in the current iovec */
 	char __user	*buf;
 
 	pagevec_init(&lru_pvec, 0);
@@ -2083,31 +2082,33 @@ generic_file_buffered_write(struct kiocb
 	if (likely(nr_segs == 1))
 		buf = iov->iov_base + written;
 	else {
-		filemap_set_next_iovec(&cur_iov, &iov_base, written);
-		buf = cur_iov->iov_base + iov_base;
+		filemap_set_next_iovec(&cur_iov, &iov_offset, written);
+		buf = cur_iov->iov_base + iov_offset;
 	}
 
 	do {
-		unsigned long index;
-		unsigned long offset;
-		unsigned long maxlen;
-		size_t copied;
+		pgoff_t index;		/* Pagecache index for current page */
+		unsigned long offset;	/* Offset into pagecache page */
+		unsigned long maxlen;	/* Bytes remaining in current iovec */
+		size_t bytes;		/* Bytes to write to page */
+		size_t copied;		/* Bytes copied from user */
 
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
 		if (bytes > count)
 			bytes = count;
 
+		maxlen = cur_iov->iov_len - iov_offset;
+		if (maxlen > bytes)
+			maxlen = bytes;
+
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		maxlen = cur_iov->iov_len - iov_base;
-		if (maxlen > bytes)
-			maxlen = bytes;
 		fault_in_pages_readable(buf, maxlen);
 
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
@@ -2138,7 +2139,7 @@ generic_file_buffered_write(struct kiocb
 							buf, bytes);
 		else
 			copied = filemap_copy_from_user_iovec(page, offset,
-						cur_iov, iov_base, bytes);
+						cur_iov, iov_offset, bytes);
 		flush_dcache_page(page);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (status == AOP_TRUNCATED_PAGE) {
@@ -2156,12 +2157,12 @@ generic_file_buffered_write(struct kiocb
 				buf += status;
 				if (unlikely(nr_segs > 1)) {
 					filemap_set_next_iovec(&cur_iov,
-							&iov_base, status);
+							&iov_offset, status);
 					if (count)
 						buf = cur_iov->iov_base +
-							iov_base;
+							iov_offset;
 				} else {
-					iov_base += status;
+					iov_offset += status;
 				}
 			}
 		}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch 5/9] mm: debug write deadlocks
  2007-02-04  8:49 [patch 0/9] buffered write deadlock fix Nick Piggin
                   ` (3 preceding siblings ...)
  2007-02-04  8:50 ` [patch 4/9] mm: generic_file_buffered_write cleanup Nick Piggin
@ 2007-02-04  8:50 ` Nick Piggin
  2007-02-04  8:50 ` [patch 6/9] mm: be sure to trim blocks Nick Piggin
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-04  8:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
difficult race where the page may be unmapped before calling copy_from_user.
Makes the race much easier to hit.

This is useful for demonstration and testing purposes, but is removed in a
subsequent patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2103,6 +2103,7 @@ generic_file_buffered_write(struct kiocb
 		if (maxlen > bytes)
 			maxlen = bytes;
 
+#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
@@ -2110,6 +2111,7 @@ generic_file_buffered_write(struct kiocb
 		 * up-to-date.
 		 */
 		fault_in_pages_readable(buf, maxlen);
+#endif
 
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
 		if (!page) {

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch 6/9] mm: be sure to trim blocks
  2007-02-04  8:49 [patch 0/9] buffered write deadlock fix Nick Piggin
                   ` (4 preceding siblings ...)
  2007-02-04  8:50 ` [patch 5/9] mm: debug write deadlocks Nick Piggin
@ 2007-02-04  8:50 ` Nick Piggin
  2007-02-04  8:50 ` [patch 7/9] mm: cleanup pagecache insertion operations Nick Piggin
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-04  8:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
we may have failed the write operation despite prepare_write having
instantiated blocks past i_size. Fix this, and consolidate the trimming into
one place.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2120,22 +2120,9 @@ generic_file_buffered_write(struct kiocb
 		}
 
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
-		if (unlikely(status)) {
-			loff_t isize = i_size_read(inode);
+		if (unlikely(status))
+			goto fs_write_aop_error;
 
-			if (status != AOP_TRUNCATED_PAGE)
-				unlock_page(page);
-			page_cache_release(page);
-			if (status == AOP_TRUNCATED_PAGE)
-				continue;
-			/*
-			 * prepare_write() may have instantiated a few blocks
-			 * outside i_size.  Trim these off again.
-			 */
-			if (pos + bytes > isize)
-				vmtruncate(inode, isize);
-			break;
-		}
 		if (likely(nr_segs == 1))
 			copied = filemap_copy_from_user(page, offset,
 							buf, bytes);
@@ -2144,40 +2131,53 @@ generic_file_buffered_write(struct kiocb
 						cur_iov, iov_offset, bytes);
 		flush_dcache_page(page);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
-		if (status == AOP_TRUNCATED_PAGE) {
-			page_cache_release(page);
-			continue;
+		if (unlikely(status < 0))
+			goto fs_write_aop_error;
+		if (unlikely(copied != bytes)) {
+			status = -EFAULT;
+			goto fs_write_aop_error;
 		}
-		if (likely(copied > 0)) {
-			if (!status)
-				status = copied;
+		if (unlikely(status > 0)) /* filesystem did partial write */
+			copied = status;
 
-			if (status >= 0) {
-				written += status;
-				count -= status;
-				pos += status;
-				buf += status;
-				if (unlikely(nr_segs > 1)) {
-					filemap_set_next_iovec(&cur_iov,
-							&iov_offset, status);
-					if (count)
-						buf = cur_iov->iov_base +
-							iov_offset;
-				} else {
-					iov_offset += status;
-				}
+		if (likely(copied > 0)) {
+			written += copied;
+			count -= copied;
+			pos += copied;
+			buf += copied;
+			if (unlikely(nr_segs > 1)) {
+				filemap_set_next_iovec(&cur_iov,
+						&iov_offset, copied);
+				if (count)
+					buf = cur_iov->iov_base + iov_offset;
+			} else {
+				iov_offset += copied;
 			}
 		}
-		if (unlikely(copied != bytes))
-			if (status >= 0)
-				status = -EFAULT;
 		unlock_page(page);
 		mark_page_accessed(page);
 		page_cache_release(page);
-		if (status < 0)
-			break;
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
+		continue;
+
+fs_write_aop_error:
+		if (status != AOP_TRUNCATED_PAGE)
+			unlock_page(page);
+		page_cache_release(page);
+
+		/*
+		 * prepare_write() may have instantiated a few blocks
+		 * outside i_size.  Trim these off again. Don't need
+		 * i_size_read because we hold i_mutex.
+		 */
+		if (pos + bytes > inode->i_size)
+			vmtruncate(inode, inode->i_size);
+		if (status == AOP_TRUNCATED_PAGE)
+			continue;
+		else
+			break;
+
 	} while (count);
 	*ppos = pos;
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch 7/9] mm: cleanup pagecache insertion operations
  2007-02-04  8:49 [patch 0/9] buffered write deadlock fix Nick Piggin
                   ` (5 preceding siblings ...)
  2007-02-04  8:50 ` [patch 6/9] mm: be sure to trim blocks Nick Piggin
@ 2007-02-04  8:50 ` Nick Piggin
  2007-02-04  8:50 ` [patch 8/9] mm: generic_file_buffered_write iovec cleanup Nick Piggin
  2007-02-04  8:51 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
  8 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-04  8:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Quite a bit of code is used in maintaining these "cached pages" that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.

Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -684,26 +684,22 @@ EXPORT_SYMBOL(find_lock_page);
 struct page *find_or_create_page(struct address_space *mapping,
 		unsigned long index, gfp_t gfp_mask)
 {
-	struct page *page, *cached_page = NULL;
+	struct page *page;
 	int err;
 repeat:
 	page = find_lock_page(mapping, index);
 	if (!page) {
-		if (!cached_page) {
-			cached_page = alloc_page(gfp_mask);
-			if (!cached_page)
-				return NULL;
-		}
-		err = add_to_page_cache_lru(cached_page, mapping,
-					index, gfp_mask);
-		if (!err) {
-			page = cached_page;
-			cached_page = NULL;
-		} else if (err == -EEXIST)
-			goto repeat;
+		page = alloc_page(gfp_mask);
+		if (!page)
+			return NULL;
+		err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+		if (unlikely(err)) {
+			page_cache_release(page);
+			page = NULL;
+			if (err == -EEXIST)
+				goto repeat;
+		}
 	}
-	if (cached_page)
-		page_cache_release(cached_page);
 	return page;
 }
 EXPORT_SYMBOL(find_or_create_page);
@@ -889,11 +885,9 @@ void do_generic_mapping_read(struct addr
 	unsigned long next_index;
 	unsigned long prev_index;
 	loff_t isize;
-	struct page *cached_page;
 	int error;
 	struct file_ra_state ra = *_ra;
 
-	cached_page = NULL;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	next_index = index;
 	prev_index = ra.prev_page;
@@ -1057,23 +1051,20 @@ no_cached_page:
 		 * Ok, it wasn't cached, so we need to create a new
 		 * page..
 		 */
-		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
-			if (!cached_page) {
-				desc->error = -ENOMEM;
-				goto out;
-			}
+		page = page_cache_alloc_cold(mapping);
+		if (!page) {
+			desc->error = -ENOMEM;
+			goto out;
 		}
-		error = add_to_page_cache_lru(cached_page, mapping,
+		error = add_to_page_cache_lru(page, mapping,
 						index, GFP_KERNEL);
 		if (error) {
+			page_cache_release(page);
 			if (error == -EEXIST)
 				goto find_page;
 			desc->error = error;
 			goto out;
 		}
-		page = cached_page;
-		cached_page = NULL;
 		goto readpage;
 	}
 
@@ -1081,8 +1072,6 @@ out:
 	*_ra = ra;
 
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
-	if (cached_page)
-		page_cache_release(cached_page);
 	if (filp)
 		file_accessed(filp);
 }
@@ -1751,35 +1740,28 @@ static inline struct page *__read_cache_
 				int (*filler)(void *,struct page*),
 				void *data)
 {
-	struct page *page, *cached_page = NULL;
+	struct page *page;
 	int err;
 repeat:
 	page = find_get_page(mapping, index);
 	if (!page) {
-		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
-			if (!cached_page)
-				return ERR_PTR(-ENOMEM);
-		}
-		err = add_to_page_cache_lru(cached_page, mapping,
-					index, GFP_KERNEL);
-		if (err == -EEXIST)
-			goto repeat;
-		if (err < 0) {
+		page = page_cache_alloc_cold(mapping);
+		if (!page)
+			return ERR_PTR(-ENOMEM);
+		err = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+		if (unlikely(err)) {
+			page_cache_release(page);
+			if (err == -EEXIST)
+				goto repeat;
 			/* Presumably ENOMEM for radix tree node */
-			page_cache_release(cached_page);
 			return ERR_PTR(err);
 		}
-		page = cached_page;
-		cached_page = NULL;
 		err = filler(data, page);
 		if (err < 0) {
 			page_cache_release(page);
 			page = ERR_PTR(err);
 		}
 	}
-	if (cached_page)
-		page_cache_release(cached_page);
 	return page;
 }
 
@@ -1830,40 +1812,6 @@ retry:
 EXPORT_SYMBOL(read_cache_page);
 
 /*
- * If the page was newly created, increment its refcount and add it to the
- * caller's lru-buffering pagevec.  This function is specifically for
- * generic_file_write().
- */
-static inline struct page *
-__grab_cache_page(struct address_space *mapping, unsigned long index,
-			struct page **cached_page, struct pagevec *lru_pvec)
-{
-	int err;
-	struct page *page;
-repeat:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		if (!*cached_page) {
-			*cached_page = page_cache_alloc(mapping);
-			if (!*cached_page)
-				return NULL;
-		}
-		err = add_to_page_cache(*cached_page, mapping,
-					index, GFP_KERNEL);
-		if (err == -EEXIST)
-			goto repeat;
-		if (err == 0) {
-			page = *cached_page;
-			page_cache_get(page);
-			if (!pagevec_add(lru_pvec, page))
-				__pagevec_lru_add(lru_pvec);
-			*cached_page = NULL;
-		}
-	}
-	return page;
-}
-
-/*
  * The logic we want is
  *
  *	if suid or (sgid and xgrp)
@@ -2057,6 +2005,33 @@ generic_file_direct_write(struct kiocb *
 }
 EXPORT_SYMBOL(generic_file_direct_write);
 
+/*
+ * Find or create a page at the given pagecache position. Return the locked
+ * page. This function is specifically for buffered writes.
+ */
+static struct page *__grab_cache_page(struct address_space *mapping,
+							pgoff_t index)
+{
+	int status;
+	struct page *page;
+repeat:
+	page = find_lock_page(mapping, index);
+	if (likely(page))
+		return page;
+
+	page = page_cache_alloc(mapping);
+	if (!page)
+		return NULL;
+	status = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+	if (unlikely(status)) {
+		page_cache_release(page);
+		if (status == -EEXIST)
+			goto repeat;
+		return NULL;
+	}
+	return page;
+}
+
 ssize_t
 generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos, loff_t *ppos,
@@ -2067,15 +2042,10 @@ generic_file_buffered_write(struct kiocb
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	struct inode 	*inode = mapping->host;
 	long		status = 0;
-	struct page	*page;
-	struct page	*cached_page = NULL;
-	struct pagevec	lru_pvec;
 	const struct iovec *cur_iov = iov; /* current iovec */
 	size_t		iov_offset = 0;	   /* offset in the current iovec */
 	char __user	*buf;
 
-	pagevec_init(&lru_pvec, 0);
-
 	/*
 	 * handle partial DIO write.  Adjust cur_iov if needed.
 	 */
@@ -2087,6 +2057,7 @@ generic_file_buffered_write(struct kiocb
 	}
 
 	do {
+		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long maxlen;	/* Bytes remaining in current iovec */
@@ -2113,7 +2084,8 @@ generic_file_buffered_write(struct kiocb
 		fault_in_pages_readable(buf, maxlen);
 #endif
 
-		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
+
+		page = __grab_cache_page(mapping, index);
 		if (!page) {
 			status = -ENOMEM;
 			break;
@@ -2181,9 +2153,6 @@ fs_write_aop_error:
 	} while (count);
 	*ppos = pos;
 
-	if (cached_page)
-		page_cache_release(cached_page);
-
 	/*
 	 * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
 	 */
@@ -2203,7 +2172,6 @@ fs_write_aop_error:
 	if (unlikely(file->f_flags & O_DIRECT) && written)
 		status = filemap_write_and_wait(mapping);
 
-	pagevec_lru_add(&lru_pvec);
 	return written ? written : status;
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c
+++ linux-2.6/fs/mpage.c
@@ -389,31 +389,25 @@ mpage_readpages(struct address_space *ma
 	struct bio *bio = NULL;
 	unsigned page_idx;
 	sector_t last_block_in_bio = 0;
-	struct pagevec lru_pvec;
 	struct buffer_head map_bh;
 	unsigned long first_logical_block = 0;
 
 	clear_buffer_mapped(&map_bh);
-	pagevec_init(&lru_pvec, 0);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_entry(pages->prev, struct page, lru);
 
 		prefetchw(&page->flags);
 		list_del(&page->lru);
-		if (!add_to_page_cache(page, mapping,
+		if (!add_to_page_cache_lru(page, mapping,
 					page->index, GFP_KERNEL)) {
 			bio = do_mpage_readpage(bio, page,
 					nr_pages - page_idx,
 					&last_block_in_bio, &map_bh,
 					&first_logical_block,
 					get_block);
-			if (!pagevec_add(&lru_pvec, page))
-				__pagevec_lru_add(&lru_pvec);
-		} else {
-			page_cache_release(page);
 		}
+		page_cache_release(page);
 	}
-	pagevec_lru_add(&lru_pvec);
 	BUG_ON(!list_empty(pages));
 	if (bio)
 		mpage_bio_submit(READ, bio);
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -133,28 +133,25 @@ int read_cache_pages(struct address_spac
 			int (*filler)(void *, struct page *), void *data)
 {
 	struct page *page;
-	struct pagevec lru_pvec;
 	int ret = 0;
 
-	pagevec_init(&lru_pvec, 0);
-
 	while (!list_empty(pages)) {
 		page = list_to_page(pages);
 		list_del(&page->lru);
-		if (add_to_page_cache(page, mapping, page->index, GFP_KERNEL)) {
+		if (add_to_page_cache_lru(page, mapping,
+					page->index, GFP_KERNEL)) {
 			page_cache_release(page);
 			continue;
 		}
+		page_cache_release(page);
+
 		ret = filler(data, page);
-		if (!pagevec_add(&lru_pvec, page))
-			__pagevec_lru_add(&lru_pvec);
-		if (ret) {
+		if (unlikely(ret)) {
 			put_pages_list(pages);
 			break;
 		}
 		task_io_account_read(PAGE_CACHE_SIZE);
 	}
-	pagevec_lru_add(&lru_pvec);
 	return ret;
 }
 
@@ -164,7 +161,6 @@ static int read_pages(struct address_spa
 		struct list_head *pages, unsigned nr_pages)
 {
 	unsigned page_idx;
-	struct pagevec lru_pvec;
 	int ret;
 
 	if (mapping->a_ops->readpages) {
@@ -174,19 +170,15 @@ static int read_pages(struct address_spa
 		goto out;
 	}
 
-	pagevec_init(&lru_pvec, 0);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_to_page(pages);
 		list_del(&page->lru);
-		if (!add_to_page_cache(page, mapping,
+		if (!add_to_page_cache_lru(page, mapping,
 					page->index, GFP_KERNEL)) {
 			mapping->a_ops->readpage(filp, page);
-			if (!pagevec_add(&lru_pvec, page))
-				__pagevec_lru_add(&lru_pvec);
-		} else
-			page_cache_release(page);
+		}
+		page_cache_release(page);
 	}
-	pagevec_lru_add(&lru_pvec);
 	ret = 0;
 out:
 	return ret;

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch 8/9] mm: generic_file_buffered_write iovec cleanup
  2007-02-04  8:49 [patch 0/9] buffered write deadlock fix Nick Piggin
                   ` (6 preceding siblings ...)
  2007-02-04  8:50 ` [patch 7/9] mm: cleanup pagecache insertion operations Nick Piggin
@ 2007-02-04  8:50 ` Nick Piggin
  2007-02-04  8:51 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
  8 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-04  8:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Hide some of the open-coded nr_segs tests into the iovec helpers. This is
all to simplify generic_file_buffered_write, because that gets more complex
in the next patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -22,82 +22,82 @@ __filemap_copy_from_user_iovec_inatomic(
 
 /*
  * Copy as much as we can into the page and return the number of bytes which
- * were sucessfully copied.  If a fault is encountered then clear the page
- * out to (offset+bytes) and return the number of bytes which were copied.
- *
- * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache
- * to *NOT* zero any tail of the buffer that it failed to copy.  If it does,
- * and if the following non-atomic copy succeeds, then there is a small window
- * where the target page contains neither the data before the write, nor the
- * data after the write (it contains zero).  A read at this time will see
- * data that is inconsistent with any ordering of the read and the write.
- * (This has been detected in practice).
+ * were sucessfully copied.  If a fault is encountered then return the number of
+ * bytes which were copied.
  */
 static inline size_t
-filemap_copy_from_user(struct page *page, unsigned long offset,
-			const char __user *buf, unsigned bytes)
+filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
+			const struct iovec *iov, unsigned long nr_segs,
+			size_t base, size_t bytes)
 {
 	char *kaddr;
-	int left;
+	size_t copied;
 
 	kaddr = kmap_atomic(page, KM_USER0);
-	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
+	if (likely(nr_segs == 1)) {
+		int left;
+		char __user *buf = iov->iov_base + base;
+		left = __copy_from_user_inatomic_nocache(kaddr + offset,
+							buf, bytes);
+		copied = bytes - left;
+	} else {
+		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+							iov, base, bytes);
+	}
 	kunmap_atomic(kaddr, KM_USER0);
 
-	if (left != 0) {
-		/* Do it the slow way */
-		kaddr = kmap(page);
-		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
-		kunmap(page);
-	}
-	return bytes - left;
+	return copied;
 }
 
 /*
- * This has the same sideeffects and return value as filemap_copy_from_user().
- * The difference is that on a fault we need to memset the remainder of the
- * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
- * single-segment behaviour.
+ * This has the same sideeffects and return value as
+ * filemap_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
  */
 static inline size_t
-filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
-			const struct iovec *iov, size_t base, size_t bytes)
+filemap_copy_from_user(struct page *page, unsigned long offset,
+			const struct iovec *iov, unsigned long nr_segs,
+			 size_t base, size_t bytes)
 {
 	char *kaddr;
 	size_t copied;
 
-	kaddr = kmap_atomic(page, KM_USER0);
-	copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
-							 base, bytes);
-	kunmap_atomic(kaddr, KM_USER0);
-	if (copied != bytes) {
-		kaddr = kmap(page);
-		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
-								 base, bytes);
-		if (bytes - copied)
-			memset(kaddr + offset + copied, 0, bytes - copied);
-		kunmap(page);
+	kaddr = kmap(page);
+	if (likely(nr_segs == 1)) {
+		int left;
+		char __user *buf = iov->iov_base + base;
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+		copied = bytes - left;
+	} else {
+		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+							iov, base, bytes);
 	}
+	kunmap(page);
 	return copied;
 }
 
 static inline void
-filemap_set_next_iovec(const struct iovec **iovp, size_t *basep, size_t bytes)
+filemap_set_next_iovec(const struct iovec **iovp, unsigned long nr_segs,
+						 size_t *basep, size_t bytes)
 {
-	const struct iovec *iov = *iovp;
-	size_t base = *basep;
-
-	while (bytes) {
-		int copy = min(bytes, iov->iov_len - base);
-
-		bytes -= copy;
-		base += copy;
-		if (iov->iov_len == base) {
-			iov++;
-			base = 0;
+	if (likely(nr_segs == 1)) {
+		*basep += bytes;
+	} else {
+		const struct iovec *iov = *iovp;
+		size_t base = *basep;
+
+		while (bytes) {
+			int copy = min(bytes, iov->iov_len - base);
+
+			bytes -= copy;
+			base += copy;
+			if (iov->iov_len == base) {
+				iov++;
+				base = 0;
+			}
 		}
+		*iovp = iov;
+		*basep = base;
 	}
-	*iovp = iov;
-	*basep = base;
 }
 #endif
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2049,12 +2049,7 @@ generic_file_buffered_write(struct kiocb
 	/*
 	 * handle partial DIO write.  Adjust cur_iov if needed.
 	 */
-	if (likely(nr_segs == 1))
-		buf = iov->iov_base + written;
-	else {
-		filemap_set_next_iovec(&cur_iov, &iov_offset, written);
-		buf = cur_iov->iov_base + iov_offset;
-	}
+	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
 		struct page *page;
@@ -2064,6 +2059,7 @@ generic_file_buffered_write(struct kiocb
 		size_t bytes;		/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
+		buf = cur_iov->iov_base + iov_offset;
 		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
@@ -2095,13 +2091,10 @@ generic_file_buffered_write(struct kiocb
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		if (likely(nr_segs == 1))
-			copied = filemap_copy_from_user(page, offset,
-							buf, bytes);
-		else
-			copied = filemap_copy_from_user_iovec(page, offset,
-						cur_iov, iov_offset, bytes);
+		copied = filemap_copy_from_user(page, offset,
+					cur_iov, nr_segs, iov_offset, bytes);
 		flush_dcache_page(page);
+
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
@@ -2112,20 +2105,11 @@ generic_file_buffered_write(struct kiocb
 		if (unlikely(status > 0)) /* filesystem did partial write */
 			copied = status;
 
-		if (likely(copied > 0)) {
-			written += copied;
-			count -= copied;
-			pos += copied;
-			buf += copied;
-			if (unlikely(nr_segs > 1)) {
-				filemap_set_next_iovec(&cur_iov,
-						&iov_offset, copied);
-				if (count)
-					buf = cur_iov->iov_base + iov_offset;
-			} else {
-				iov_offset += copied;
-			}
-		}
+		written += copied;
+		count -= copied;
+		pos += copied;
+		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
+
 		unlock_page(page);
 		mark_page_accessed(page);
 		page_cache_release(page);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04  8:49 [patch 0/9] buffered write deadlock fix Nick Piggin
                   ` (7 preceding siblings ...)
  2007-02-04  8:50 ` [patch 8/9] mm: generic_file_buffered_write iovec cleanup Nick Piggin
@ 2007-02-04  8:51 ` Nick Piggin
  2007-02-04  9:44   ` Andrew Morton
  8 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2007-02-04  8:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then find and pin the source page with
    get_user_pages. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2052,11 +2052,12 @@ generic_file_buffered_write(struct kiocb
 	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
+		struct page *src_page;
 		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long maxlen;	/* Bytes remaining in current iovec */
-		size_t bytes;		/* Bytes to write to page */
+		unsigned long seglen;	/* Bytes remaining in current iovec */
+		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
 		buf = cur_iov->iov_base + iov_offset;
@@ -2066,20 +2067,30 @@ generic_file_buffered_write(struct kiocb
 		if (bytes > count)
 			bytes = count;
 
-		maxlen = cur_iov->iov_len - iov_offset;
-		if (maxlen > bytes)
-			maxlen = bytes;
+		/*
+		 * a non-NULL src_page indicates that we're doing the
+		 * copy via get_user_pages and kmap.
+		 */
+		src_page = NULL;
+
+		seglen = cur_iov->iov_len - iov_offset;
+		if (seglen > bytes)
+			seglen = bytes;
 
-#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
 		 */
-		fault_in_pages_readable(buf, maxlen);
-#endif
-
+		if (unlikely(fault_in_pages_readable(buf, seglen))) {
+			status = -EFAULT;
+			break;
+		}
 
 		page = __grab_cache_page(mapping, index);
 		if (!page) {
@@ -2087,32 +2098,94 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
+		/*
+		 * non-uptodate pages cannot cope with short copies, and we
+		 * cannot take a pagefault with the destination page locked.
+		 * So pin the source page to copy it.
+		 */
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+
+			bytes = min(bytes, PAGE_CACHE_SIZE -
+				     ((unsigned long)buf & ~PAGE_CACHE_MASK));
+
+			/*
+			 * Cannot get_user_pages with a page locked for the
+			 * same reason as we can't take a page fault with a
+			 * page locked (as explained below).
+			 */
+			down_read(&current->mm->mmap_sem);
+			status = get_user_pages(current, current->mm,
+					(unsigned long)buf & PAGE_CACHE_MASK, 1,
+					0, 0, &src_page, NULL);
+			up_read(&current->mm->mmap_sem);
+			if (status != 1) {
+				page_cache_release(page);
+				break;
+			}
+
+			lock_page(page);
+			if (!page->mapping) {
+				unlock_page(page);
+				page_cache_release(page);
+				page_cache_release(src_page);
+				continue;
+			}
+
+		}
+
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		copied = filemap_copy_from_user(page, offset,
+		if (!src_page) {
+			/*
+			 * Must not enter the pagefault handler here, because
+			 * we hold the page lock, so we might recursively
+			 * deadlock on the same lock, or get an ABBA deadlock
+			 * against a different lock, or against the mmap_sem
+			 * (which nests outside the page lock).  So increment
+			 * preempt count, and use _atomic usercopies.
+			 *
+			 * The page is uptodate so we are OK to encounter a
+			 * short copy: if unmodified parts of the page are
+			 * marked dirty and written out to disk, it doesn't
+			 * really matter.
+			 */
+			pagefault_disable();
+			copied = filemap_copy_from_user_atomic(page, offset,
 					cur_iov, nr_segs, iov_offset, bytes);
+			pagefault_enable();
+		} else {
+			char *src, *dst;
+			src = kmap_atomic(src_page, KM_USER0);
+			dst = kmap_atomic(page, KM_USER1);
+			memcpy(dst + offset,
+				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
+				bytes);
+			kunmap_atomic(dst, KM_USER1);
+			kunmap_atomic(src, KM_USER0);
+			copied = bytes;
+		}
 		flush_dcache_page(page);
 
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
-		if (unlikely(copied != bytes)) {
-			status = -EFAULT;
-			goto fs_write_aop_error;
-		}
 		if (unlikely(status > 0)) /* filesystem did partial write */
-			copied = status;
+			copied = min_t(size_t, copied, status);
+
+		unlock_page(page);
+		mark_page_accessed(page);
+		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		written += copied;
 		count -= copied;
 		pos += copied;
 		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
 
-		unlock_page(page);
-		mark_page_accessed(page);
-		page_cache_release(page);
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 		continue;
@@ -2121,6 +2194,8 @@ fs_write_aop_error:
 		if (status != AOP_TRUNCATED_PAGE)
 			unlock_page(page);
 		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		/*
 		 * prepare_write() may have instantiated a few blocks
@@ -2133,7 +2208,6 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-
 	} while (count);
 	*ppos = pos;
 
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -198,6 +198,9 @@ static inline int fault_in_pages_writeab
 {
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	/*
 	 * Writing zeroes into userspace here is OK, because we know that if
 	 * the zero gets there, we'll be overwriting it.
@@ -217,19 +220,23 @@ static inline int fault_in_pages_writeab
 	return ret;
 }
 
-static inline void fault_in_pages_readable(const char __user *uaddr, int size)
+static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 {
 	volatile char c;
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	ret = __get_user(c, uaddr);
 	if (ret == 0) {
 		const char __user *end = uaddr + size - 1;
 
 		if (((unsigned long)uaddr & PAGE_MASK) !=
 				((unsigned long)end & PAGE_MASK))
-		 	__get_user(c, end);
+		 	ret = __get_user(c, end);
 	}
+	return ret;
 }
 
 #endif /* _LINUX_PAGEMAP_H */

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04  8:51 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
@ 2007-02-04  9:44   ` Andrew Morton
  2007-02-04 10:15     ` Nick Piggin
  2007-02-04 10:59     ` Anton Altaparmakov
  0 siblings, 2 replies; 36+ messages in thread
From: Andrew Morton @ 2007-02-04  9:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:

> 2.  If we find the destination page is non uptodate, unlock it (this could be
>     made slightly more optimal), then find and pin the source page with
>     get_user_pages. Relock the destination page and continue with the copy.
>     However, instead of a usercopy (which might take a fault), copy the data
>     via the kernel address space.

argh.  We just can't go adding all this gunk into the write() path. 

mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
Even single-process write() will suffer, let along multithreaded stuff,
where mmap_sem contention may be the bigger problem.

I was going to do some quick measurements of this, but the code oopses
on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)

There's a build error in filemap_xip.c btw.

We need to think different.

What happened to the idea of doing an atomic copy into the non-uptodate
page and handling it somehow?

Another option might be to effectively pin the whole mm during the copy:

	down_read(&current->mm->unpaging_lock);
	get_user(addr);		/* Fault the page in */
	...
	copy_from_user()
	up_read(&current->mm->unpaging_lock);

then, anyone who wants to unmap pages from this mm requires
write_lock(unpaging_lock).  So we know the results of that get_user()
cannot be undone.

Or perhaps something like this can be done on a per-vma basis.  Just
something to tell the VM "hey, you're not allowed to unmap this page right
now"?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04  9:44   ` Andrew Morton
@ 2007-02-04 10:15     ` Nick Piggin
  2007-02-04 10:26       ` Christoph Hellwig
  2007-02-04 10:30       ` Andrew Morton
  2007-02-04 10:59     ` Anton Altaparmakov
  1 sibling, 2 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-04 10:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 01:44:45AM -0800, Andrew Morton wrote:
> On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> 
> > 2.  If we find the destination page is non uptodate, unlock it (this could be
> >     made slightly more optimal), then find and pin the source page with
> >     get_user_pages. Relock the destination page and continue with the copy.
> >     However, instead of a usercopy (which might take a fault), copy the data
> >     via the kernel address space.
> 
> argh.  We just can't go adding all this gunk into the write() path. 
> 
> mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> Even single-process write() will suffer, let along multithreaded stuff,
> where mmap_sem contention may be the bigger problem.

The write path is broken. I prefer my kernels slow, than buggy.

As I said, I'm working on a replacement API so that the filesystems
that care, can be correct *and* fast.

> I was going to do some quick measurements of this, but the code oopses
> on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)

Cool, a kernel thread is calling sys_write. Fun.

I guess I should be able to reproduce this if using initramfs. Thanks.

> There's a build error in filemap_xip.c btw.
> 
> 
> 
> We need to think different.
> 
> What happened to the idea of doing an atomic copy into the non-uptodate
> page and handling it somehow?

That was my second idea. I didn't get any feedback on that patchset
except to try this method, so I assume everyone hated it.

I actually liked it, because it didn't have to do the writev
segment-at-a-time for !uptodate pages like this one does. Considering
this code gets called from mm-less contexts, maybe I'll have to go back
to this approach.

> Another option might be to effectively pin the whole mm during the copy:
> 
> 	down_read(&current->mm->unpaging_lock);
> 	get_user(addr);		/* Fault the page in */
> 	...
> 	copy_from_user()
> 	up_read(&current->mm->unpaging_lock);
> 
> then, anyone who wants to unmap pages from this mm requires
> write_lock(unpaging_lock).  So we know the results of that get_user()
> cannot be undone.

Fugly. Don't know whether there are any lock order problems making it
hard to implement, but you introduce the theoretical memory deadlock
where a task cannot reclaim its own memory.

> Or perhaps something like this can be done on a per-vma basis.  Just
> something to tell the VM "hey, you're not allowed to unmap this page right
> now"?

Same memory deadlock problem.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:15     ` Nick Piggin
@ 2007-02-04 10:26       ` Christoph Hellwig
  2007-02-04 10:30       ` Andrew Morton
  1 sibling, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2007-02-04 10:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 11:15:29AM +0100, Nick Piggin wrote:
> Cool, a kernel thread is calling sys_write. Fun.

There are tons of places where we possible call into ->write from
either kernel threads or at least with a kernel pointer  and set_fs/set_ds
magic.  Anything in the buffer write path that tries to touch page tables
can't work.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:15     ` Nick Piggin
  2007-02-04 10:26       ` Christoph Hellwig
@ 2007-02-04 10:30       ` Andrew Morton
  2007-02-04 10:46         ` Nick Piggin
  1 sibling, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2007-02-04 10:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Sun, Feb 04, 2007 at 01:44:45AM -0800, Andrew Morton wrote:
> > On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > 2.  If we find the destination page is non uptodate, unlock it (this could be
> > >     made slightly more optimal), then find and pin the source page with
> > >     get_user_pages. Relock the destination page and continue with the copy.
> > >     However, instead of a usercopy (which might take a fault), copy the data
> > >     via the kernel address space.
> > 
> > argh.  We just can't go adding all this gunk into the write() path. 
> > 
> > mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> > Even single-process write() will suffer, let along multithreaded stuff,
> > where mmap_sem contention may be the bigger problem.
> 
> The write path is broken. I prefer my kernels slow, than buggy.

That won't fly.

> > There's a build error in filemap_xip.c btw.

?

> > 
> > We need to think different.
> > 
> > What happened to the idea of doing an atomic copy into the non-uptodate
> > page and handling it somehow?
> 
> That was my second idea.

Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
practical because of the games we needed to play with ->commit_write.

> I didn't get any feedback on that patchset
> except to try this method, so I assume everyone hated it.
> 
> I actually liked it, because it didn't have to do the writev
> segment-at-a-time for !uptodate pages like this one does. Considering
> this code gets called from mm-less contexts, maybe I'll have to go back
> to this approach.

OK.

> > Another option might be to effectively pin the whole mm during the copy:
> > 
> > 	down_read(&current->mm->unpaging_lock);
> > 	get_user(addr);		/* Fault the page in */
> > 	...
> > 	copy_from_user()
> > 	up_read(&current->mm->unpaging_lock);
> > 
> > then, anyone who wants to unmap pages from this mm requires
> > write_lock(unpaging_lock).  So we know the results of that get_user()
> > cannot be undone.
> 
> Fugly.

I invited you to think different - don't just fixate on one random
tossed-out-there suggestion.

> but you introduce the theoretical memory deadlock
> where a task cannot reclaim its own memory.

Nah, that'll never happen - both pages are already allocated.

It's better than taking mmap_sem and walking pagetables...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:30       ` Andrew Morton
@ 2007-02-04 10:46         ` Nick Piggin
  2007-02-04 10:50           ` Nick Piggin
  2007-02-04 10:56           ` Andrew Morton
  0 siblings, 2 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-04 10:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > The write path is broken. I prefer my kernels slow, than buggy.
> 
> That won't fly.

What won't fly?

> 
> > > There's a build error in filemap_xip.c btw.
> 
> ?

Thanks?

> > > What happened to the idea of doing an atomic copy into the non-uptodate
> > > page and handling it somehow?
> > 
> > That was my second idea.
> 
> Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
> practical because of the games we needed to play with ->commit_write.

Maybe I misunderstood what you meant, above. I have an alterative fix
where a temporary page is allocated if the write enncounters a non
uptodate page. The usercopy then goes into that page, and from there
into the target page after we have opened the prepare_write().

My *first* idea to fix this was to do the atomic copy into a non-uptodate
page and then calling a zero-length commit_write if it failed. I pretty
carefully constructed all these good arguments as to why each case works
properly, but in the end it just didn't fly because it broke lots of
filesystems.

> > > Another option might be to effectively pin the whole mm during the copy:
> > > 
> > > 	down_read(&current->mm->unpaging_lock);
> > > 	get_user(addr);		/* Fault the page in */
> > > 	...
> > > 	copy_from_user()
> > > 	up_read(&current->mm->unpaging_lock);
> > > 
> > > then, anyone who wants to unmap pages from this mm requires
> > > write_lock(unpaging_lock).  So we know the results of that get_user()
> > > cannot be undone.
> > 
> > Fugly.
> 
> I invited you to think different - don't just fixate on one random
> tossed-out-there suggestion.

I've thought. Quite a lot. I have 2 other approaches that don't require
mmap_sem, and 1 which is actually possible to implement without breaking
filesystems.

> > but you introduce the theoretical memory deadlock
> > where a task cannot reclaim its own memory.
> 
> Nah, that'll never happen - both pages are already allocated.

Both pages? I don't get it.

You set the don't-reclaim vma flag, then run get_user, which takes a
page fault and potentially has to allocate N pages for pagetables,
pagecache readahead, buffers and fs private data and pagecache radix
tree nodes for all of the pages read in.

> It's better than taking mmap_sem and walking pagetables...

I'm not convinced.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:46         ` Nick Piggin
@ 2007-02-04 10:50           ` Nick Piggin
  2007-02-04 10:56           ` Andrew Morton
  1 sibling, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-04 10:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 11:46:09AM +0100, Nick Piggin wrote:
> 
> > It's better than taking mmap_sem and walking pagetables...
> 
> I'm not convinced.

Though I am more convinced that looking at mm *at all* (either to
take the mmap_sem and find the vma, or to take the mmap_sem and
run get_user_pages) is going to hurt.

We'd have to special case kernel threads, which don't even have an
mm, let alone the vmas... too ugly.

I'll revert to my temporary-page approach: at least that will
fix the problem.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:46         ` Nick Piggin
  2007-02-04 10:50           ` Nick Piggin
@ 2007-02-04 10:56           ` Andrew Morton
  2007-02-04 11:03             ` Nick Piggin
  1 sibling, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2007-02-04 10:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > The write path is broken. I prefer my kernels slow, than buggy.
> > 
> > That won't fly.
> 
> What won't fly?

I suspect the performance cost of this approach would force us to redo it
all.

> > > > What happened to the idea of doing an atomic copy into the non-uptodate
> > > > page and handling it somehow?
> > > 
> > > That was my second idea.
> > 
> > Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
> > practical because of the games we needed to play with ->commit_write.
> 
> Maybe I misunderstood what you meant, above.

The original set of half-written patches I sent you.  Do an atomic copy_from_user()
inside the page lock and if that fails, zero out the remainder of the page, run
commit_write() and then redo the whole thing.

> I have an alterative fix
> where a temporary page is allocated if the write enncounters a non
> uptodate page. The usercopy then goes into that page, and from there
> into the target page after we have opened the prepare_write().

Remember that a non-uptodate page is the common case.

> My *first* idea to fix this was to do the atomic copy into a non-uptodate
> page and then calling a zero-length commit_write if it failed. I pretty
> carefully constructed all these good arguments as to why each case works
> properly, but in the end it just didn't fly because it broke lots of
> filesystems.

I forget the details now.  I think we did have a workable-looking solution
based on the atomic copy_from_user() but it would have re-exposed the old
problem wherein a page would fleetingly have a bunch of zeroes in the
middle of it, if someone looked at it during the write.

If that recollection is right, I think we could afford to reintroduce that
problem, frankly.  Especially as it only happens in the incredibly rare
case of that get_user()ed page getting unmapped under our feet.

> > > but you introduce the theoretical memory deadlock
> > > where a task cannot reclaim its own memory.
> > 
> > Nah, that'll never happen - both pages are already allocated.
> 
> Both pages? I don't get it.
> 
> You set the don't-reclaim vma flag, then run get_user, which takes a
> page fault and potentially has to allocate N pages for pagetables,
> pagecache readahead, buffers and fs private data and pagecache radix
> tree nodes for all of the pages read in.

Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
rwsem.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04  9:44   ` Andrew Morton
  2007-02-04 10:15     ` Nick Piggin
@ 2007-02-04 10:59     ` Anton Altaparmakov
  2007-02-04 11:10       ` Andrew Morton
  1 sibling, 1 reply; 36+ messages in thread
From: Anton Altaparmakov @ 2007-02-04 10:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007, Andrew Morton wrote:
> On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> > 2.  If we find the destination page is non uptodate, unlock it (this could be
> >     made slightly more optimal), then find and pin the source page with
> >     get_user_pages. Relock the destination page and continue with the copy.
> >     However, instead of a usercopy (which might take a fault), copy the data
> >     via the kernel address space.
> 
> argh.  We just can't go adding all this gunk into the write() path. 
> 
> mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> Even single-process write() will suffer, let along multithreaded stuff,
> where mmap_sem contention may be the bigger problem.
> 
> I was going to do some quick measurements of this, but the code oopses
> on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
> 
> We need to think different.

How about leaving the existing code with the following minor 
modifications:

Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
would do (could of course move into a helper function of course):

pagefault_disable()
kaddr = kmap_atomic(page, KM_USER0);
left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
kunmap_atomic(kaddr, KM_USER0);
pagefault_enable()

if (unlikely(left)) {
	/* The user space page got unmapped before we got to copy it. */
	...
}

Thus the 99.999% (or more!) of the time the code would just work as it 
always has and there is no bug and no speed impact.  Only in the very rare 
and hard to trigger race condition that the user space page after being 
faulted in got thrown out again before we did the atomic memory copy do we 
run into the above "..." code path.

I would propose to call out a different function altogether which could do 
a multitude of things including drop the lock on the destination page 
(maintaining a reference on the page!), allocate a temporary page, copy 
from the user space page into the temporary page, then lock the 
destination page again, and copy from the temporary page into the 
destination page.

This would be slow but who cares given it would only happen incredibly 
rarely and on majority of machines it would never happen at all.

The only potential problem I can see is that the destination page could be 
truncated whilst it is unlocked.  I can see two possible solutions to 
this:

1) Invent a new page flag and run "SetDoNotTruncatePage()" before dropping 
the page lock, then modify truncate code paths to check for page locked 
and the new flag and not truncate a page if the new flag is set, just like 
they would if it was locked.  Basically treat the two bits as equivalent 
for truncate purposes.

I think 1) would be best but a possible alternative would be:

2) Make commit write check whether a page has been truncated and if so 
make it abort the operation/transaction (or non-journalling file systems 
can probably just ignore the call completely).  Then simply restart the 
generic_file_buffered_write() call by faulting the user space page in 
again, etc.  And this time round things will just work again.  I guess it 
would be possible that we hit the same race condition again and again and 
again so we would loop for ever but the chances for that are even less 
than the original race condition.  I suppose to guard against it we could 
have a counter that say limited to X retries and if that failed the write 
would be aborted with -EDEADLOCK or something.  As I said I think 1) would 
be nicer...

What do you think?

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:56           ` Andrew Morton
@ 2007-02-04 11:03             ` Nick Piggin
  2007-02-04 11:15               ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2007-02-04 11:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > The write path is broken. I prefer my kernels slow, than buggy.
> > > 
> > > That won't fly.
> > 
> > What won't fly?
> 
> I suspect the performance cost of this approach would force us to redo it
> all.

That's the idea. But at least in the meantime we're correct.

> > > > That was my second idea.
> > > 
> > > Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
> > > practical because of the games we needed to play with ->commit_write.
> > 
> > Maybe I misunderstood what you meant, above.
> 
> The original set of half-written patches I sent you.  Do an atomic copy_from_user()
> inside the page lock and if that fails, zero out the remainder of the page, run
> commit_write() and then redo the whole thing.

Oh that. Data corruption, transient zeroes.

> > I have an alterative fix
> > where a temporary page is allocated if the write enncounters a non
> > uptodate page. The usercopy then goes into that page, and from there
> > into the target page after we have opened the prepare_write().
> 
> Remember that a non-uptodate page is the common case.

Yes.

> 
> > My *first* idea to fix this was to do the atomic copy into a non-uptodate
> > page and then calling a zero-length commit_write if it failed. I pretty
> > carefully constructed all these good arguments as to why each case works
> > properly, but in the end it just didn't fly because it broke lots of
> > filesystems.
> 
> I forget the details now.  I think we did have a workable-looking solution
> based on the atomic copy_from_user() but it would have re-exposed the old
> problem wherein a page would fleetingly have a bunch of zeroes in the
> middle of it, if someone looked at it during the write.
> 
> If that recollection is right, I think we could afford to reintroduce that
> problem, frankly.  Especially as it only happens in the incredibly rare
> case of that get_user()ed page getting unmapped under our feet.

Dang. I was hoping to fix it without introducing data corruption.

> > > > but you introduce the theoretical memory deadlock
> > > > where a task cannot reclaim its own memory.
> > > 
> > > Nah, that'll never happen - both pages are already allocated.
> > 
> > Both pages? I don't get it.
> > 
> > You set the don't-reclaim vma flag, then run get_user, which takes a
> > page fault and potentially has to allocate N pages for pagetables,
> > pagecache readahead, buffers and fs private data and pagecache radix
> > tree nodes for all of the pages read in.
> 
> Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
> rwsem.

Race condition remains.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:59     ` Anton Altaparmakov
@ 2007-02-04 11:10       ` Andrew Morton
  2007-02-04 11:22         ` Nick Piggin
  2007-02-04 17:40         ` Anton Altaparmakov
  0 siblings, 2 replies; 36+ messages in thread
From: Andrew Morton @ 2007-02-04 11:10 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Nick Piggin, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 10:59:58 +0000 (GMT) Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> On Sun, 4 Feb 2007, Andrew Morton wrote:
> > On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> > > 2.  If we find the destination page is non uptodate, unlock it (this could be
> > >     made slightly more optimal), then find and pin the source page with
> > >     get_user_pages. Relock the destination page and continue with the copy.
> > >     However, instead of a usercopy (which might take a fault), copy the data
> > >     via the kernel address space.
> > 
> > argh.  We just can't go adding all this gunk into the write() path. 
> > 
> > mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> > Even single-process write() will suffer, let along multithreaded stuff,
> > where mmap_sem contention may be the bigger problem.
> > 
> > I was going to do some quick measurements of this, but the code oopses
> > on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
> > 
> > We need to think different.
> 
> How about leaving the existing code with the following minor 
> modifications:
> 
> Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
> bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
> would do (could of course move into a helper function of course):
> 
> pagefault_disable()
> kaddr = kmap_atomic(page, KM_USER0);
> left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> kunmap_atomic(kaddr, KM_USER0);
> pagefault_enable()
> 
> if (unlikely(left)) {
> 	/* The user space page got unmapped before we got to copy it. */
> 	...
> }
> 
> Thus the 99.999% (or more!) of the time the code would just work as it 
> always has and there is no bug and no speed impact.  Only in the very rare 
> and hard to trigger race condition that the user space page after being 
> faulted in got thrown out again before we did the atomic memory copy do we 
> run into the above "..." code path.

Right.  And what I wanted to do here is to zero out the uncopied part of
the page (if it wasn't uptodate), then run commit_write(), then retry the
whole thing.

iirc, we ruled that out because those temporary zeroes are exposed to
userspace.  But the kernel used to do that anyway for a long time (years)
until someone noticed, and we'll only do it in your 0.0001% case anyway.

(Actually, perhaps we can prevent it by not marking the page uptodate in
this case.  But that'll cause a read()er to try to bring it uptodate...)

> I would propose to call out a different function altogether which could do 
> a multitude of things including drop the lock on the destination page 
> (maintaining a reference on the page!), allocate a temporary page, copy 
> from the user space page into the temporary page, then lock the 
> destination page again, and copy from the temporary page into the 
> destination page.

The problem with all these things is that as soon as we unlock the page,
it's visible to read().  And in fact, as soon as we mark it uptodate it's
visible to mmap, even if it's still locked.

> This would be slow but who cares given it would only happen incredibly 
> rarely and on majority of machines it would never happen at all.
> 
> The only potential problem I can see is that the destination page could be 
> truncated whilst it is unlocked.  I can see two possible solutions to 
> this:

truncate's OK: we're holding i_mutex.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 11:03             ` Nick Piggin
@ 2007-02-04 11:15               ` Andrew Morton
  2007-02-04 15:10                 ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2007-02-04 11:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 12:03:17 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> > > > On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > > 
> > > > > The write path is broken. I prefer my kernels slow, than buggy.
> > > > 
> > > > That won't fly.
> > > 
> > > What won't fly?
> > 
> > I suspect the performance cost of this approach would force us to redo it
> > all.
> 
> That's the idea. But at least in the meantime we're correct.

There's no way I'd support merging a change which we know we'll have to
redo only we have no clue how.

> > If that recollection is right, I think we could afford to reintroduce that
> > problem, frankly.  Especially as it only happens in the incredibly rare
> > case of that get_user()ed page getting unmapped under our feet.
> 
> Dang. I was hoping to fix it without introducing data corruption.

Well.  It's a compromise.  Being practical about it, I reeeealy doubt that
anyone will hit this combination of circumstances.

> > > > > but you introduce the theoretical memory deadlock
> > > > > where a task cannot reclaim its own memory.
> > > > 
> > > > Nah, that'll never happen - both pages are already allocated.
> > > 
> > > Both pages? I don't get it.
> > > 
> > > You set the don't-reclaim vma flag, then run get_user, which takes a
> > > page fault and potentially has to allocate N pages for pagetables,
> > > pagecache readahead, buffers and fs private data and pagecache radix
> > > tree nodes for all of the pages read in.
> > 
> > Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
> > rwsem.
> 
> Race condition remains.

No, not in a million years.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 11:10       ` Andrew Morton
@ 2007-02-04 11:22         ` Nick Piggin
  2007-02-04 17:40         ` Anton Altaparmakov
  1 sibling, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-04 11:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anton Altaparmakov, Linux Kernel, Linux Filesystems,
	Linux Memory Management

On Sun, Feb 04, 2007 at 03:10:39AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 10:59:58 +0000 (GMT) Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > 
> > How about leaving the existing code with the following minor 
> > modifications:
> > 
> > Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
> > bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
> > would do (could of course move into a helper function of course):
> > 
> > pagefault_disable()
> > kaddr = kmap_atomic(page, KM_USER0);
> > left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> > kunmap_atomic(kaddr, KM_USER0);
> > pagefault_enable()
> > 
> > if (unlikely(left)) {
> > 	/* The user space page got unmapped before we got to copy it. */
> > 	...
> > }
> > 
> > Thus the 99.999% (or more!) of the time the code would just work as it 
> > always has and there is no bug and no speed impact.  Only in the very rare 
> > and hard to trigger race condition that the user space page after being 
> > faulted in got thrown out again before we did the atomic memory copy do we 
> > run into the above "..." code path.
> 
> Right.  And what I wanted to do here is to zero out the uncopied part of
> the page (if it wasn't uptodate), then run commit_write(), then retry the
> whole thing.
> 
> iirc, we ruled that out because those temporary zeroes are exposed to
> userspace.  But the kernel used to do that anyway for a long time (years)
> until someone noticed, and we'll only do it in your 0.0001% case anyway.

Serious? I'd rather leave the deadlock in there than introduce a
very hard to reproduce data corruption bug to fix it. At least the
deadlock is fail-stop and you can tell exactly what happened when
you hit it (assuming you can get a trace).

Then again, we've got lots more similar little correctness corner
cases like this that most people don't notice most of the time. Am
I aiming too high?

> (Actually, perhaps we can prevent it by not marking the page uptodate in
> this case.  But that'll cause a read()er to try to bring it uptodate...)

We have to write something back to the filesystem because it may have
allocated blocks at this point.

> > I would propose to call out a different function altogether which could do 
> > a multitude of things including drop the lock on the destination page 
> > (maintaining a reference on the page!), allocate a temporary page, copy 
> > from the user space page into the temporary page, then lock the 
> > destination page again, and copy from the temporary page into the 
> > destination page.
> 
> The problem with all these things is that as soon as we unlock the page,
> it's visible to read().  And in fact, as soon as we mark it uptodate it's
> visible to mmap, even if it's still locked.
> 
> > This would be slow but who cares given it would only happen incredibly 
> > rarely and on majority of machines it would never happen at all.
> > 
> > The only potential problem I can see is that the destination page could be 
> > truncated whilst it is unlocked.  I can see two possible solutions to 
> > this:
> 
> truncate's OK: we're holding i_mutex.

Not all truncates hold i_mutex. Neither do all invalidates, for that
matter.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 11:15               ` Andrew Morton
@ 2007-02-04 15:10                 ` Nick Piggin
  2007-02-04 18:36                   ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2007-02-04 15:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 03:15:49AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 12:03:17 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > If that recollection is right, I think we could afford to reintroduce that
> > > problem, frankly.  Especially as it only happens in the incredibly rare
> > > case of that get_user()ed page getting unmapped under our feet.
> > 
> > Dang. I was hoping to fix it without introducing data corruption.
> 
> Well.  It's a compromise.  Being practical about it, I reeeealy doubt that
> anyone will hit this combination of circumstances.

They're not likely to hit the deadlocks, either. Probability gets more
likely after my patch to lock the page in the fault path. But practially,
we could live without that too, because the data corruption it fixes is
very rare as well. Which is exactly what we've been doing quite happily
for most of 2.6, including all distro kernels (I think).

But (sadly) for me, there is no compromise. I may be known as the clown
who tries outlandish things to shave a few atomic ops and interrupt flag
changes in the page allocator, or make the pagecache lockless. However
I can't be happy even making something faster if correctness is < 100.0%,
even if less likely than hardware failure.

> > > > > > but you introduce the theoretical memory deadlock
> > > > > > where a task cannot reclaim its own memory.
> > > > > 
> > > > > Nah, that'll never happen - both pages are already allocated.
> > > > 
> > > > Both pages? I don't get it.
> > > > 
> > > > You set the don't-reclaim vma flag, then run get_user, which takes a
> > > > page fault and potentially has to allocate N pages for pagetables,
> > > > pagecache readahead, buffers and fs private data and pagecache radix
> > > > tree nodes for all of the pages read in.
> > > 
> > > Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
> > > rwsem.
> > 
> > Race condition remains.
> 
> No, not in a million years.

There is a huge window. Think about what other activity will be holding
that very rwsem for write, that you'll have to wait for in the race window.
But you could say that's also a question of practicality, because it is
pretty unlikely to do anything bad even if you do hit the race.

Well if fixing this is just going to be flat-out vetoed on performance
reasons then I can't argue, because it does impact performance.

Thinking about the numbers, if your kernel's reliability is already the
same order of magnitude as reliability of commodity hardware, then
trading a bit of performance for a bit of reliability is a BAD tradeoff
if you are at all interested in performance on commodity hardware. That
is especially true if you have a massive redundant cluster of commodity
systems, which I understand is a fairly big market for Linux. The 
X-nines guys who would disagree are probably a tiny niche for Linux.

So I do understand your argument for praticality, even if I don't agree.

For the record, here is the "temporary page" fix that should fix it
properly. And some numbers.

Nick

--

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then allocate a temporary page to copy the
    source data into. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    from the pinned temporary page via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

On a P4 Xeon, SMP kernel, on a tmpfs filesystem, a 1GB dd if=/dev/zero write
had the following performance (higher is worse):

Orig kernel			New kernel
new file (no pagecache)
4K  blocks 1.280s		1.287s (+0.5%)
64K blocks 1.090s		1.105s (+1.4%)
notrunc (uptodate pagecache)
4K  blocks 0.976s		1.001s (+0.5%)
64K blocks 0.780s		0.792s (+1.5%)

[numbers are better than +/- 0.005]

So we lose somewhere between half and one and a half of one percent
performance in a pagecache write intensive workload.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2052,11 +2052,12 @@ generic_file_buffered_write(struct kiocb
 	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
+		struct page *src_page;
 		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long maxlen;	/* Bytes remaining in current iovec */
-		size_t bytes;		/* Bytes to write to page */
+		unsigned long seglen;	/* Bytes remaining in current iovec */
+		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
 		buf = cur_iov->iov_base + iov_offset;
@@ -2066,20 +2067,30 @@ generic_file_buffered_write(struct kiocb
 		if (bytes > count)
 			bytes = count;
 
-		maxlen = cur_iov->iov_len - iov_offset;
-		if (maxlen > bytes)
-			maxlen = bytes;
+		/*
+		 * a non-NULL src_page indicates that we're doing the
+		 * copy via get_user_pages and kmap.
+		 */
+		src_page = NULL;
+
+		seglen = cur_iov->iov_len - iov_offset;
+		if (seglen > bytes)
+			seglen = bytes;
 
-#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
 		 */
-		fault_in_pages_readable(buf, maxlen);
-#endif
-
+		if (unlikely(fault_in_pages_readable(buf, seglen))) {
+			status = -EFAULT;
+			break;
+		}
 
 		page = __grab_cache_page(mapping, index);
 		if (!page) {
@@ -2087,32 +2098,96 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
+		/*
+		 * non-uptodate pages cannot cope with short copies, and we
+		 * cannot take a pagefault with the destination page locked.
+		 * So pin the source page to copy it.
+		 */
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+
+			src_page = alloc_page(GFP_KERNEL);
+			if (!src_page) {
+				page_cache_release(page);
+				status = -ENOMEM;
+				break;
+			}
+
+			/*
+			 * Cannot get_user_pages with a page locked for the
+			 * same reason as we can't take a page fault with a
+			 * page locked (as explained below).
+			 */
+			copied = filemap_copy_from_user(src_page, offset,
+					cur_iov, nr_segs, iov_offset, bytes);
+			if (unlikely(copied == 0)) {
+				status = -EFAULT;
+				page_cache_release(page);
+				page_cache_release(src_page);
+				break;
+			}
+			bytes = copied;
+
+			lock_page(page);
+			if (unlikely(!page->mapping)) {
+				unlock_page(page);
+				page_cache_release(page);
+				page_cache_release(src_page);
+				continue;
+			}
+
+		}
+
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		copied = filemap_copy_from_user(page, offset,
+		if (!src_page) {
+			/*
+			 * Must not enter the pagefault handler here, because
+			 * we hold the page lock, so we might recursively
+			 * deadlock on the same lock, or get an ABBA deadlock
+			 * against a different lock, or against the mmap_sem
+			 * (which nests outside the page lock).  So increment
+			 * preempt count, and use _atomic usercopies.
+			 *
+			 * The page is uptodate so we are OK to encounter a
+			 * short copy: if unmodified parts of the page are
+			 * marked dirty and written out to disk, it doesn't
+			 * really matter.
+			 */
+			pagefault_disable();
+			copied = filemap_copy_from_user_atomic(page, offset,
 					cur_iov, nr_segs, iov_offset, bytes);
+			pagefault_enable();
+		} else {
+			void *src, *dst;
+			src = kmap_atomic(src_page, KM_USER0);
+			dst = kmap_atomic(page, KM_USER1);
+			memcpy(dst + offset, src + offset, bytes);
+			kunmap_atomic(dst, KM_USER1);
+			kunmap_atomic(src, KM_USER0);
+			copied = bytes;
+		}
 		flush_dcache_page(page);
 
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
-		if (unlikely(copied != bytes)) {
-			status = -EFAULT;
-			goto fs_write_aop_error;
-		}
 		if (unlikely(status > 0)) /* filesystem did partial write */
-			copied = status;
+			copied = min_t(size_t, copied, status);
+
+		unlock_page(page);
+		mark_page_accessed(page);
+		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		written += copied;
 		count -= copied;
 		pos += copied;
 		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
 
-		unlock_page(page);
-		mark_page_accessed(page);
-		page_cache_release(page);
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 		continue;
@@ -2121,6 +2196,8 @@ fs_write_aop_error:
 		if (status != AOP_TRUNCATED_PAGE)
 			unlock_page(page);
 		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		/*
 		 * prepare_write() may have instantiated a few blocks
@@ -2133,7 +2210,6 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-
 	} while (count);
 	*ppos = pos;
 
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -198,6 +198,9 @@ static inline int fault_in_pages_writeab
 {
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	/*
 	 * Writing zeroes into userspace here is OK, because we know that if
 	 * the zero gets there, we'll be overwriting it.
@@ -217,19 +220,23 @@ static inline int fault_in_pages_writeab
 	return ret;
 }
 
-static inline void fault_in_pages_readable(const char __user *uaddr, int size)
+static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 {
 	volatile char c;
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	ret = __get_user(c, uaddr);
 	if (ret == 0) {
 		const char __user *end = uaddr + size - 1;
 
 		if (((unsigned long)uaddr & PAGE_MASK) !=
 				((unsigned long)end & PAGE_MASK))
-		 	__get_user(c, end);
+		 	ret = __get_user(c, end);
 	}
+	return ret;
 }
 
 #endif /* _LINUX_PAGEMAP_H */

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 11:10       ` Andrew Morton
  2007-02-04 11:22         ` Nick Piggin
@ 2007-02-04 17:40         ` Anton Altaparmakov
  2007-02-06  2:09           ` Nick Piggin
  1 sibling, 1 reply; 36+ messages in thread
From: Anton Altaparmakov @ 2007-02-04 17:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007, Andrew Morton wrote:
> On Sun, 4 Feb 2007 10:59:58 +0000 (GMT) Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > On Sun, 4 Feb 2007, Andrew Morton wrote:
> > > On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> > > > 2.  If we find the destination page is non uptodate, unlock it (this could be
> > > >     made slightly more optimal), then find and pin the source page with
> > > >     get_user_pages. Relock the destination page and continue with the copy.
> > > >     However, instead of a usercopy (which might take a fault), copy the data
> > > >     via the kernel address space.
> > > 
> > > argh.  We just can't go adding all this gunk into the write() path. 
> > > 
> > > mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> > > Even single-process write() will suffer, let along multithreaded stuff,
> > > where mmap_sem contention may be the bigger problem.
> > > 
> > > I was going to do some quick measurements of this, but the code oopses
> > > on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
> > > 
> > > We need to think different.
> > 
> > How about leaving the existing code with the following minor 
> > modifications:
> > 
> > Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
> > bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
> > would do (could of course move into a helper function of course):
> > 
> > pagefault_disable()
> > kaddr = kmap_atomic(page, KM_USER0);
> > left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> > kunmap_atomic(kaddr, KM_USER0);
> > pagefault_enable()
> > 
> > if (unlikely(left)) {
> > 	/* The user space page got unmapped before we got to copy it. */
> > 	...
> > }
> > 
> > Thus the 99.999% (or more!) of the time the code would just work as it 
> > always has and there is no bug and no speed impact.  Only in the very rare 
> > and hard to trigger race condition that the user space page after being 
> > faulted in got thrown out again before we did the atomic memory copy do we 
> > run into the above "..." code path.
> 
> Right.  And what I wanted to do here is to zero out the uncopied part of
> the page (if it wasn't uptodate), then run commit_write(), then retry the
> whole thing.
> 
> iirc, we ruled that out because those temporary zeroes are exposed to
> userspace.  But the kernel used to do that anyway for a long time (years)
> until someone noticed, and we'll only do it in your 0.0001% case anyway.
> 
> (Actually, perhaps we can prevent it by not marking the page uptodate in
> this case.  But that'll cause a read()er to try to bring it uptodate...)

My thinking was not marking the page uptodate.  But yes that causes the 
problem of a concurrent readpage reading uninitialized disk blocks that 
prepare_write allocated.

> > I would propose to call out a different function altogether which could do 
> > a multitude of things including drop the lock on the destination page 
> > (maintaining a reference on the page!), allocate a temporary page, copy 
> > from the user space page into the temporary page, then lock the 
> > destination page again, and copy from the temporary page into the 
> > destination page.
> 
> The problem with all these things is that as soon as we unlock the page,
> it's visible to read().  And in fact, as soon as we mark it uptodate it's
> visible to mmap, even if it's still locked.

Yes we definitely cannot mark it uptodate before it really is uptodate.
Either we have to not mark it uptodate or we have to zero it or we have to
think of some other magic...

> > This would be slow but who cares given it would only happen incredibly 
> > rarely and on majority of machines it would never happen at all.
> > 
> > The only potential problem I can see is that the destination page could be 
> > truncated whilst it is unlocked.  I can see two possible solutions to 
> > this:
> 
> truncate's OK: we're holding i_mutex.

How about excluding readpage() (in addition to truncate if Nick is right  
and some cases of truncate do not hold i_mutex) with an extra page flag as
I proposed for truncate exclusion?  Then it would not matter that
prepare_write might have allocated blocks and might expose stale data.    
It would go to sleep and wait on the bit to be cleared instead of trying  
to bring the page uptodate.  It can then lock the page and either find it 
uptodate (because commit_write did it) or not and then bring it uptodate.

Then we could safely fault in the page, copy from it into a temporary 
page, then lock the destination page again and copy into it.

This is getting more involved as a patch again...  )-:  But at least it   
does not affect the common case except for having to check the new page 
flag in every readpage() and truncate() call.  But at least the checks 
could be with an "if (unlikely(newpageflag()))" so should not be too bad.

Have I missed anything this time?

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 15:10                 ` Nick Piggin
@ 2007-02-04 18:36                   ` Andrew Morton
  2007-02-06  2:25                     ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2007-02-04 18:36 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Sun, Feb 04, 2007 at 03:15:49AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 12:03:17 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> > > > On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > > 
> > > > If that recollection is right, I think we could afford to reintroduce that
> > > > problem, frankly.  Especially as it only happens in the incredibly rare
> > > > case of that get_user()ed page getting unmapped under our feet.
> > > 
> > > Dang. I was hoping to fix it without introducing data corruption.
> > 
> > Well.  It's a compromise.  Being practical about it, I reeeealy doubt that
> > anyone will hit this combination of circumstances.
> 
> They're not likely to hit the deadlocks, either. Probability gets more
> likely after my patch to lock the page in the fault path. But practially,
> we could live without that too, because the data corruption it fixes is
> very rare as well. Which is exactly what we've been doing quite happily
> for most of 2.6, including all distro kernels (I think).

Thing is, an application which is relying on the contents of that page is
already unreliable (or really peculiar), because it can get indeterminate
results anyway.

> ...
> 
> On a P4 Xeon, SMP kernel, on a tmpfs filesystem, a 1GB dd if=/dev/zero write
> had the following performance (higher is worse):
> 
> Orig kernel			New kernel
> new file (no pagecache)
> 4K  blocks 1.280s		1.287s (+0.5%)
> 64K blocks 1.090s		1.105s (+1.4%)
> notrunc (uptodate pagecache)
> 4K  blocks 0.976s		1.001s (+0.5%)
> 64K blocks 0.780s		0.792s (+1.5%)
> 
> [numbers are better than +/- 0.005]
> 
> So we lose somewhere between half and one and a half of one percent
> performance in a pagecache write intensive workload.

That's not too bad - caches are fast.  Did you look at optimising the
handling of that temp page, ensure that we always use the same page?  I
guess the page allocator per-cpu-pages thing is being good here.

I'm not sure how, though.  Park a copy in the task_struct, just as an
experiment.  But that'd de-optimise multiple-tasks-writing-on-the-same-cpu.
Maybe a per-cpu thing?  Largely duplicates the page allocator's per-cpu-pages.

Of course, we're also increasing caceh footprint, which this test won't
show.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 17:40         ` Anton Altaparmakov
@ 2007-02-06  2:09           ` Nick Piggin
  2007-02-06 13:13             ` Anton Altaparmakov
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2007-02-06  2:09 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 05:40:35PM +0000, Anton Altaparmakov wrote:
> On Sun, 4 Feb 2007, Andrew Morton wrote:
> > truncate's OK: we're holding i_mutex.
> 
> How about excluding readpage() (in addition to truncate if Nick is right  
> and some cases of truncate do not hold i_mutex) with an extra page flag as
> I proposed for truncate exclusion?  Then it would not matter that
> prepare_write might have allocated blocks and might expose stale data.    
> It would go to sleep and wait on the bit to be cleared instead of trying  
> to bring the page uptodate.  It can then lock the page and either find it 
> uptodate (because commit_write did it) or not and then bring it uptodate.
> 
> Then we could safely fault in the page, copy from it into a temporary 
> page, then lock the destination page again and copy into it.
> 
> This is getting more involved as a patch again...  )-:  But at least it   
> does not affect the common case except for having to check the new page 
> flag in every readpage() and truncate() call.  But at least the checks 
> could be with an "if (unlikely(newpageflag()))" so should not be too bad.
> 
> Have I missed anything this time?

Yes. If you have a flag to exclude readpage(), then you must also
exclude filemap_nopage, in which case it is still deadlocky.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 18:36                   ` Andrew Morton
@ 2007-02-06  2:25                     ` Nick Piggin
  2007-02-06  4:41                       ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2007-02-06  2:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > They're not likely to hit the deadlocks, either. Probability gets more
> > likely after my patch to lock the page in the fault path. But practially,
> > we could live without that too, because the data corruption it fixes is
> > very rare as well. Which is exactly what we've been doing quite happily
> > for most of 2.6, including all distro kernels (I think).
> 
> Thing is, an application which is relying on the contents of that page is
> already unreliable (or really peculiar), because it can get indeterminate
> results anyway.

Not necessarily -- they could read from one part of a page and write to
another. I see this as the biggest data corruption problem.

But even in the case where they can get indeterminate results, they can
still determine what the results *won't* be. Eg. they might use a single
byte for a flag or something.

> > ...
> > 
> > On a P4 Xeon, SMP kernel, on a tmpfs filesystem, a 1GB dd if=/dev/zero write
> > had the following performance (higher is worse):
> > 
> > Orig kernel			New kernel
> > new file (no pagecache)
> > 4K  blocks 1.280s		1.287s (+0.5%)
> > 64K blocks 1.090s		1.105s (+1.4%)
> > notrunc (uptodate pagecache)
> > 4K  blocks 0.976s		1.001s (+0.5%)
> > 64K blocks 0.780s		0.792s (+1.5%)
> > 
> > [numbers are better than +/- 0.005]
> > 
> > So we lose somewhere between half and one and a half of one percent
> > performance in a pagecache write intensive workload.
> 
> That's not too bad - caches are fast.  Did you look at optimising the
> handling of that temp page, ensure that we always use the same page?  I
> guess the page allocator per-cpu-pages thing is being good here.

Yeah it should be doing a reasonable job.

> I'm not sure how, though.  Park a copy in the task_struct, just as an
> experiment.  But that'd de-optimise multiple-tasks-writing-on-the-same-cpu.
> Maybe a per-cpu thing?  Largely duplicates the page allocator's per-cpu-pages.

Putting a copy in the task_struct won't do much I figure, except saving
a copule of interrupt enable/disable, and being more wasteful of memory
and cache-hotness.

Per-cpu doesn't work because we can't hold preempt off over the usercopy
(well, we *could* do it in a loop together with fault_in_pages, but that
just adds to the icache bloat).

> 
> Of course, we're also increasing caceh footprint, which this test won't
> show.

We are indeed. At least we release the hot page back to the allocator
very quickly that it can be reused.

The upshot is that your writev performance will be improved :)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-06  2:25                     ` Nick Piggin
@ 2007-02-06  4:41                       ` Nick Piggin
  2007-02-06  5:30                         ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2007-02-06  4:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, Feb 06, 2007 at 03:25:49AM +0100, Nick Piggin wrote:
> On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > They're not likely to hit the deadlocks, either. Probability gets more
> > > likely after my patch to lock the page in the fault path. But practially,
> > > we could live without that too, because the data corruption it fixes is
> > > very rare as well. Which is exactly what we've been doing quite happily
> > > for most of 2.6, including all distro kernels (I think).
> > 
> > Thing is, an application which is relying on the contents of that page is
> > already unreliable (or really peculiar), because it can get indeterminate
> > results anyway.
> 
> Not necessarily -- they could read from one part of a page and write to
> another. I see this as the biggest data corruption problem.

And in fact, it is not just transient errors either. This problem can
add permanent corruption into the pagecache and onto disk, and it doesn't
even require two processes to race.

After zeroing out the uncopied part of the page, and attempting to loop
again, we might bail out of the loop for any reason before completing the
rest of the copy, leaving the pagecache corrupted, which will soon go out
to disk.

Nick

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-06  4:41                       ` Nick Piggin
@ 2007-02-06  5:30                         ` Andrew Morton
  2007-02-06  5:49                           ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2007-02-06  5:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, 6 Feb 2007 05:41:46 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Tue, Feb 06, 2007 at 03:25:49AM +0100, Nick Piggin wrote:
> > On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > They're not likely to hit the deadlocks, either. Probability gets more
> > > > likely after my patch to lock the page in the fault path. But practially,
> > > > we could live without that too, because the data corruption it fixes is
> > > > very rare as well. Which is exactly what we've been doing quite happily
> > > > for most of 2.6, including all distro kernels (I think).
> > > 
> > > Thing is, an application which is relying on the contents of that page is
> > > already unreliable (or really peculiar), because it can get indeterminate
> > > results anyway.
> > 
> > Not necessarily -- they could read from one part of a page and write to
> > another. I see this as the biggest data corruption problem.

The kernel gets that sort of thing wrong anyway, and always has, because
it uses memcpy()-style copying and not memmove()-style.

I can't imagine what sort of application you're envisaging here.  The
problem was only ever observed from userspace by an artificial stress-test
thing.

> And in fact, it is not just transient errors either. This problem can
> add permanent corruption into the pagecache and onto disk, and it doesn't
> even require two processes to race.
> 
> After zeroing out the uncopied part of the page, and attempting to loop
> again, we might bail out of the loop for any reason before completing the
> rest of the copy, leaving the pagecache corrupted, which will soon go out
> to disk.
> 

Only because ->commit_write() went and incorrectly marked parts of the page
as up-to-date.

Zeroing out the fag end of the copy_from_user() on fault is actually incorrect. 
What we _should_ do is to bring those uncopyable, non-uptodate parts of the
page uptodate rather than zeroing them.  ->readpage() does that.

So...  what happens if we do

	lock_page()
	prepare_write()
	if (copy_from_user_atomic()) {
		readpage()
		wait_on_page()
		lock_page()
	}
	commit_write()
	unlock_page()

- If the page has no buffers then it is either fully uptodate or fully
  not uptodate.  In the former case, don't call readpage at all.  In the
  latter case, readpage() is the correct thing to do.

- If the page had buffers, then readpage() won't touch the uptodate ones
  and will bring the non-uptodate ones up to date from disk.

  Some of the data which we copied from userspace may get overwritten
  from backing store, but that's OK.

seems crazy, but it's close.  We do have the minor problem that readpage
went and unlocked the page so we need to relock it.  I bet there are holes
in there.

Idea #42: after we've locked the pagecache page, do an atomic get_user()
against the source page(s) before attempting the copy_from_user().  If that
faults, don't run prepare_write or anything else: drop the page lock and
try again.

Because

- If the get_user() faults, it might be because the page we're copying
  from and to is the same page, and someone went and unmapped it: deadlock.

- If the get_user() doesn't fault, and if we're copying from and to the
  same page, we know that we've locked it, so nobody will be able to unmap
  it while we're copying from it.

Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
scenario.

btw, to fix the writev() performance problem we may need to go off and run
get_user() against up to 1024 separate user pages before locking the
pagecache page, which sounds fairly idiotic.  Are you doing that in the
implemetnations which you've been working on?  I forget...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-06  5:30                         ` Andrew Morton
@ 2007-02-06  5:49                           ` Nick Piggin
  2007-02-06  5:53                             ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2007-02-06  5:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Mon, Feb 05, 2007 at 09:30:06PM -0800, Andrew Morton wrote:
> On Tue, 6 Feb 2007 05:41:46 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > Not necessarily -- they could read from one part of a page and write to
> > > another. I see this as the biggest data corruption problem.
> 
> The kernel gets that sort of thing wrong anyway, and always has, because
> it uses memcpy()-style copying and not memmove()-style.
> 
> I can't imagine what sort of application you're envisaging here.  The
> problem was only ever observed from userspace by an artificial stress-test
> thing.

No, I'm not talking about writing into a page with memory from the same
page. I'm talking about one process writing to part of a file, and another
reading from that same file (different offset).

If they happen to be covered by the same page, then the reader can see
zeroes.

I'm not envisaging any sort of application, all I know is that there are
several (related) data corruption bugs and I'm trying to fix them (and
fix these deadlock problems without introducing more).

> > And in fact, it is not just transient errors either. This problem can
> > add permanent corruption into the pagecache and onto disk, and it doesn't
> > even require two processes to race.
> > 
> > After zeroing out the uncopied part of the page, and attempting to loop
> > again, we might bail out of the loop for any reason before completing the
> > rest of the copy, leaving the pagecache corrupted, which will soon go out
> > to disk.
> > 
> 
> Only because ->commit_write() went and incorrectly marked parts of the page
> as up-to-date.
> 
> Zeroing out the fag end of the copy_from_user() on fault is actually incorrect. 

Yes, I know.

> What we _should_ do is to bring those uncopyable, non-uptodate parts of the
> page uptodate rather than zeroing them.  ->readpage() does that.
> 
> So...  what happens if we do
> 
> 	lock_page()
> 	prepare_write()
> 	if (copy_from_user_atomic()) {
> 		readpage()
> 		wait_on_page()
> 		lock_page()
> 	}
> 	commit_write()
> 	unlock_page()
> 
> - If the page has no buffers then it is either fully uptodate or fully
>   not uptodate.  In the former case, don't call readpage at all.  In the
>   latter case, readpage() is the correct thing to do.
> 
> - If the page had buffers, then readpage() won't touch the uptodate ones
>   and will bring the non-uptodate ones up to date from disk.
> 
>   Some of the data which we copied from userspace may get overwritten
>   from backing store, but that's OK.
> 
> seems crazy, but it's close.  We do have the minor problem that readpage
> went and unlocked the page so we need to relock it.  I bet there are holes
> in there.

Yes, I tried doing this as well and there are holes in it. Even supposing
that we add a readpage_dontunlock, there is still the issue of breaking
the filesystem API from nesting readpage inside prepare_write. You also
do need to zero newly allocated blocks, for example.

> Idea #42: after we've locked the pagecache page, do an atomic get_user()
> against the source page(s) before attempting the copy_from_user().  If that
> faults, don't run prepare_write or anything else: drop the page lock and
> try again.
> 
> Because
> 
> - If the get_user() faults, it might be because the page we're copying
>   from and to is the same page, and someone went and unmapped it: deadlock.
> 
> - If the get_user() doesn't fault, and if we're copying from and to the
>   same page, we know that we've locked it, so nobody will be able to unmap
>   it while we're copying from it.
> 
> Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
> scenario.

Yes I considered this too. Hard isn't it?

> btw, to fix the writev() performance problem we may need to go off and run
> get_user() against up to 1024 separate user pages before locking the
> pagecache page, which sounds fairly idiotic.  Are you doing that in the
> implemetnations which you've been working on?  I forget...

No, because in my fix it can do non-atomic usercopies for !uptodate pages.

For uptodate pages, yes there is a possibility that it may do a short copy
and have to retry, but it is probably safe to bet that the source data is
fairly recently accessed in most cases, so a short copy will be unlikely.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-06  5:49                           ` Nick Piggin
@ 2007-02-06  5:53                             ` Nick Piggin
  0 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-06  5:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, Feb 06, 2007 at 06:49:05AM +0100, Nick Piggin wrote:
> > - If the get_user() doesn't fault, and if we're copying from and to the
> >   same page, we know that we've locked it, so nobody will be able to unmap
> >   it while we're copying from it.
> > 
> > Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
> > scenario.
> 
> Yes I considered this too. Hard isn't it?

BTW, there are two different abba deadlocks. It's all documented in the
patch 9/9 description.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-06  2:09           ` Nick Piggin
@ 2007-02-06 13:13             ` Anton Altaparmakov
  0 siblings, 0 replies; 36+ messages in thread
From: Anton Altaparmakov @ 2007-02-06 13:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, 2007-02-06 at 03:09 +0100, Nick Piggin wrote:
> On Sun, Feb 04, 2007 at 05:40:35PM +0000, Anton Altaparmakov wrote:
> > On Sun, 4 Feb 2007, Andrew Morton wrote:
> > > truncate's OK: we're holding i_mutex.
> > 
> > How about excluding readpage() (in addition to truncate if Nick is right  
> > and some cases of truncate do not hold i_mutex) with an extra page flag as
> > I proposed for truncate exclusion?  Then it would not matter that
> > prepare_write might have allocated blocks and might expose stale data.    
> > It would go to sleep and wait on the bit to be cleared instead of trying  
> > to bring the page uptodate.  It can then lock the page and either find it 
> > uptodate (because commit_write did it) or not and then bring it uptodate.
> > 
> > Then we could safely fault in the page, copy from it into a temporary 
> > page, then lock the destination page again and copy into it.
> > 
> > This is getting more involved as a patch again...  )-:  But at least it   
> > does not affect the common case except for having to check the new page 
> > flag in every readpage() and truncate() call.  But at least the checks 
> > could be with an "if (unlikely(newpageflag()))" so should not be too bad.
> > 
> > Have I missed anything this time?
> 
> Yes. If you have a flag to exclude readpage(), then you must also
> exclude filemap_nopage, in which case it is still deadlocky.

Ouch, you are of course right.  )-:

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-02 23:53   ` Andrew Morton
@ 2007-02-03  1:38     ` Nick Piggin
  0 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2007-02-03  1:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Fri, Feb 02, 2007 at 03:53:11PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:33:03 +0100 (CET)
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > Modify the core write() code so that it won't take a pagefault while holding a
> > lock on the pagecache page. There are a number of different deadlocks possible
> > if we try to do such a thing:
> > 
> > 1.  generic_buffered_write
> > 2.   lock_page
> > 3.    prepare_write
> > 4.     unlock_page+vmtruncate
> > 5.     copy_from_user
> > 6.      mmap_sem(r)
> > 7.       handle_mm_fault
> > 8.        lock_page (filemap_nopage)
> > 9.    commit_write
> > 10.  unlock_page
> > 
> > a. sys_munmap / sys_mlock / others
> > b.  mmap_sem(w)
> > c.   make_pages_present
> > d.    get_user_pages
> > e.     handle_mm_fault
> > f.      lock_page (filemap_nopage)
> > 
> > 2,8	- recursive deadlock if page is same
> > 2,8;2,8	- ABBA deadlock is page is different
> > 2,6;b,f	- ABBA deadlock if page is same
> > 
> > The solution is as follows:
> > 1.  If we find the destination page is uptodate, continue as normal, but use
> >     atomic usercopies which do not take pagefaults and do not zero the uncopied
> >     tail of the destination. The destination is already uptodate, so we can
> >     commit_write the full length even if there was a partial copy: it does not
> >     matter that the tail was not modified, because if it is dirtied and written
> >     back to disk it will not cause any problems (uptodate *means* that the
> >     destination page is as new or newer than the copy on disk).
> > 
> > 1a. The above requires that fault_in_pages_readable correctly returns access
> >     information, because atomic usercopies cannot distinguish between
> >     non-present pages in a readable mapping, from lack of a readable mapping.
> > 
> > 2.  If we find the destination page is non uptodate, unlock it (this could be
> >     made slightly more optimal), then find and pin the source page with
> >     get_user_pages. Relock the destination page and continue with the copy.
> >     However, instead of a usercopy (which might take a fault), copy the data
> >     via the kernel address space.
> > 
> 
> Oh what a mess we're making :(
> 
> Unfortunately, write() into a non-uptodate page is very much the common
> case.  We've always tried to avoid doing a pte-walk in the write() path to
> fix this bug.  Careful performance testing is needed here so we can assess
> the impact.  For threaded applications, simply the taking of mmap_sem might
> be the biggest problem.
> 
> And I can't think of any tricks we can play to avoid doing the pte-walk in
> most cases.  For example, we don't yet have a page to run page_mapped()
> against.

After this patch series, I am working on another that will allow filesystems
to specifically code around the problem (eg. by handling short usercopies
properly).

I tried to take this approach generically the first time, but it turns out
lots of filesystems had subtle problems, so if we do it this way instead,
then filesystem developers who actually care enough can improve their
code, and those that don't won't hold them back (or prevent this bug from
being fixed).

> >  			break;
> >  		}
> >  
> > +		/*
> > +		 * non-uptodate pages cannot cope with short copies, and we
> > +		 * cannot take a pagefault with the destination page locked.
> > +		 * So pin the source page to copy it.
> > +		 */
> > +		if (!PageUptodate(page)) {
> > +			unlock_page(page);
> > +
> > +			bytes = min(bytes, PAGE_CACHE_SIZE -
> > +				     ((unsigned long)buf & ~PAGE_CACHE_MASK));
> > +
> > +			/*
> > +			 * Cannot get_user_pages with a page locked for the
> > +			 * same reason as we can't take a page fault with a
> > +			 * page locked (as explained below).
> > +			 */
> > +			down_read(&current->mm->mmap_sem);
> > +			status = get_user_pages(current, current->mm,
> > +					(unsigned long)buf & PAGE_CACHE_MASK, 1,
> > +					0, 0, &src_page, NULL);
> > +			up_read(&current->mm->mmap_sem);
> > +			if (status != 1) {
> > +				page_cache_release(page);
> > +				break;
> > +			}
> > +
> > +			lock_page(page);
> > +			if (!page->mapping) {
> 
> Hopefully this can't happen?  If it can, who went and took our page off the
> mapping?  Reclaim?  The elevated page_count will prevent that?

Truncate/invalidate?

> > +				unlock_page(page);
> > +				page_cache_release(page);
> > +				page_cache_release(src_page);
> > +				continue;
> > +			}
> > +
> > +		}
> > +
> >  		status = a_ops->prepare_write(file, page, offset, offset+bytes);
> >  		if (unlikely(status))
> >  			goto fs_write_aop_error;
> >  
> > -		copied = filemap_copy_from_user(page, offset,
> > +		if (!src_page) {
> > +			/*
> > +			 * Must not enter the pagefault handler here, because
> > +			 * we hold the page lock, so we might recursively
> > +			 * deadlock on the same lock, or get an ABBA deadlock
> > +			 * against a different lock, or against the mmap_sem
> > +			 * (which nests outside the page lock).  So increment
> > +			 * preempt count, and use _atomic usercopies.
> > +			 *
> > +			 * The page is uptodate so we are OK to encounter a
> > +			 * short copy: if unmodified parts of the page are
> > +			 * marked dirty and written out to disk, it doesn't
> > +			 * really matter.
> > +			 */
> > +			pagefault_disable();
> > +			copied = filemap_copy_from_user_atomic(page, offset,
> >  					cur_iov, nr_segs, iov_offset, bytes);
> > +			pagefault_enable();
> > +		} else {
> > +			char *src, *dst;
> > +			src = kmap(src_page);
> > +			dst = kmap(page);
> > +			memcpy(dst + offset,
> > +				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> > +				bytes);
> > +			kunmap(page);
> > +			kunmap(src_page);
> > +			copied = bytes;
> > +		}
> 
> As you say, we don't want to do this: taking two kmap()s at the same time
> leaves us vulnerable to kmap() deadlocks: we deadlock when there are 512
> tasks (LAST_PKMAP) each holding one kmap, and all waiting for someone else
> to give one back.

Yep, thinko. My updated version uses KM_USER[01].

Thanks for reviewing so far.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-01-29 10:33 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
  2007-01-29 11:11   ` Nick Piggin
@ 2007-02-02 23:53   ` Andrew Morton
  2007-02-03  1:38     ` Nick Piggin
  1 sibling, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2007-02-02 23:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Mon, 29 Jan 2007 11:33:03 +0100 (CET)
Nick Piggin <npiggin@suse.de> wrote:

> Modify the core write() code so that it won't take a pagefault while holding a
> lock on the pagecache page. There are a number of different deadlocks possible
> if we try to do such a thing:
> 
> 1.  generic_buffered_write
> 2.   lock_page
> 3.    prepare_write
> 4.     unlock_page+vmtruncate
> 5.     copy_from_user
> 6.      mmap_sem(r)
> 7.       handle_mm_fault
> 8.        lock_page (filemap_nopage)
> 9.    commit_write
> 10.  unlock_page
> 
> a. sys_munmap / sys_mlock / others
> b.  mmap_sem(w)
> c.   make_pages_present
> d.    get_user_pages
> e.     handle_mm_fault
> f.      lock_page (filemap_nopage)
> 
> 2,8	- recursive deadlock if page is same
> 2,8;2,8	- ABBA deadlock is page is different
> 2,6;b,f	- ABBA deadlock if page is same
> 
> The solution is as follows:
> 1.  If we find the destination page is uptodate, continue as normal, but use
>     atomic usercopies which do not take pagefaults and do not zero the uncopied
>     tail of the destination. The destination is already uptodate, so we can
>     commit_write the full length even if there was a partial copy: it does not
>     matter that the tail was not modified, because if it is dirtied and written
>     back to disk it will not cause any problems (uptodate *means* that the
>     destination page is as new or newer than the copy on disk).
> 
> 1a. The above requires that fault_in_pages_readable correctly returns access
>     information, because atomic usercopies cannot distinguish between
>     non-present pages in a readable mapping, from lack of a readable mapping.
> 
> 2.  If we find the destination page is non uptodate, unlock it (this could be
>     made slightly more optimal), then find and pin the source page with
>     get_user_pages. Relock the destination page and continue with the copy.
>     However, instead of a usercopy (which might take a fault), copy the data
>     via the kernel address space.
> 

Oh what a mess we're making :(

Unfortunately, write() into a non-uptodate page is very much the common
case.  We've always tried to avoid doing a pte-walk in the write() path to
fix this bug.  Careful performance testing is needed here so we can assess
the impact.  For threaded applications, simply the taking of mmap_sem might
be the biggest problem.

And I can't think of any tricks we can play to avoid doing the pte-walk in
most cases.  For example, we don't yet have a page to run page_mapped()
against.

>  			break;
>  		}
>  
> +		/*
> +		 * non-uptodate pages cannot cope with short copies, and we
> +		 * cannot take a pagefault with the destination page locked.
> +		 * So pin the source page to copy it.
> +		 */
> +		if (!PageUptodate(page)) {
> +			unlock_page(page);
> +
> +			bytes = min(bytes, PAGE_CACHE_SIZE -
> +				     ((unsigned long)buf & ~PAGE_CACHE_MASK));
> +
> +			/*
> +			 * Cannot get_user_pages with a page locked for the
> +			 * same reason as we can't take a page fault with a
> +			 * page locked (as explained below).
> +			 */
> +			down_read(&current->mm->mmap_sem);
> +			status = get_user_pages(current, current->mm,
> +					(unsigned long)buf & PAGE_CACHE_MASK, 1,
> +					0, 0, &src_page, NULL);
> +			up_read(&current->mm->mmap_sem);
> +			if (status != 1) {
> +				page_cache_release(page);
> +				break;
> +			}
> +
> +			lock_page(page);
> +			if (!page->mapping) {

Hopefully this can't happen?  If it can, who went and took our page off the
mapping?  Reclaim?  The elevated page_count will prevent that?

> +				unlock_page(page);
> +				page_cache_release(page);
> +				page_cache_release(src_page);
> +				continue;
> +			}
> +
> +		}
> +
>  		status = a_ops->prepare_write(file, page, offset, offset+bytes);
>  		if (unlikely(status))
>  			goto fs_write_aop_error;
>  
> -		copied = filemap_copy_from_user(page, offset,
> +		if (!src_page) {
> +			/*
> +			 * Must not enter the pagefault handler here, because
> +			 * we hold the page lock, so we might recursively
> +			 * deadlock on the same lock, or get an ABBA deadlock
> +			 * against a different lock, or against the mmap_sem
> +			 * (which nests outside the page lock).  So increment
> +			 * preempt count, and use _atomic usercopies.
> +			 *
> +			 * The page is uptodate so we are OK to encounter a
> +			 * short copy: if unmodified parts of the page are
> +			 * marked dirty and written out to disk, it doesn't
> +			 * really matter.
> +			 */
> +			pagefault_disable();
> +			copied = filemap_copy_from_user_atomic(page, offset,
>  					cur_iov, nr_segs, iov_offset, bytes);
> +			pagefault_enable();
> +		} else {
> +			char *src, *dst;
> +			src = kmap(src_page);
> +			dst = kmap(page);
> +			memcpy(dst + offset,
> +				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> +				bytes);
> +			kunmap(page);
> +			kunmap(src_page);
> +			copied = bytes;
> +		}

As you say, we don't want to do this: taking two kmap()s at the same time
leaves us vulnerable to kmap() deadlocks: we deadlock when there are 512
tasks (LAST_PKMAP) each holding one kmap, and all waiting for someone else
to give one back.




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-01-29 10:33 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
@ 2007-01-29 11:11   ` Nick Piggin
  2007-02-02 23:53   ` Andrew Morton
  1 sibling, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2007-01-29 11:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Mon, Jan 29, 2007 at 11:33:03AM +0100, Nick Piggin wrote:
> +		} else {
> +			char *src, *dst;
> +			src = kmap(src_page);
> +			dst = kmap(page);
> +			memcpy(dst + offset,
> +				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> +				bytes);
> +			kunmap(page);
> +			kunmap(src_page);
> +			copied = bytes;
> +		}
>  		flush_dcache_page(page);

Hmm, I guess these should use kmap_atomic with KM_USER[01]?

The kmap is from an earlier iteration that wanted to sleep
with the page mapped into kernel.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch 9/9] mm: fix pagecache write deadlocks
  2007-01-29 10:31 [patch 0/9] buffered write deadlock fix Nick Piggin
@ 2007-01-29 10:33 ` Nick Piggin
  2007-01-29 11:11   ` Nick Piggin
  2007-02-02 23:53   ` Andrew Morton
  0 siblings, 2 replies; 36+ messages in thread
From: Nick Piggin @ 2007-01-29 10:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then find and pin the source page with
    get_user_pages. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1843,11 +1843,12 @@ generic_file_buffered_write(struct kiocb
 	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
+		struct page *src_page;
 		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long maxlen;	/* Bytes remaining in current iovec */
-		size_t bytes;		/* Bytes to write to page */
+		unsigned long seglen;	/* Bytes remaining in current iovec */
+		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
 		buf = cur_iov->iov_base + iov_offset;
@@ -1857,20 +1858,30 @@ generic_file_buffered_write(struct kiocb
 		if (bytes > count)
 			bytes = count;
 
-		maxlen = cur_iov->iov_len - iov_offset;
-		if (maxlen > bytes)
-			maxlen = bytes;
+		/*
+		 * a non-NULL src_page indicates that we're doing the
+		 * copy via get_user_pages and kmap.
+		 */
+		src_page = NULL;
+
+		seglen = cur_iov->iov_len - iov_offset;
+		if (seglen > bytes)
+			seglen = bytes;
 
-#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
 		 */
-		fault_in_pages_readable(buf, maxlen);
-#endif
-
+		if (unlikely(fault_in_pages_readable(buf, seglen))) {
+			status = -EFAULT;
+			break;
+		}
 
 		page = __grab_cache_page(mapping, index);
 		if (!page) {
@@ -1878,32 +1889,94 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
+		/*
+		 * non-uptodate pages cannot cope with short copies, and we
+		 * cannot take a pagefault with the destination page locked.
+		 * So pin the source page to copy it.
+		 */
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+
+			bytes = min(bytes, PAGE_CACHE_SIZE -
+				     ((unsigned long)buf & ~PAGE_CACHE_MASK));
+
+			/*
+			 * Cannot get_user_pages with a page locked for the
+			 * same reason as we can't take a page fault with a
+			 * page locked (as explained below).
+			 */
+			down_read(&current->mm->mmap_sem);
+			status = get_user_pages(current, current->mm,
+					(unsigned long)buf & PAGE_CACHE_MASK, 1,
+					0, 0, &src_page, NULL);
+			up_read(&current->mm->mmap_sem);
+			if (status != 1) {
+				page_cache_release(page);
+				break;
+			}
+
+			lock_page(page);
+			if (!page->mapping) {
+				unlock_page(page);
+				page_cache_release(page);
+				page_cache_release(src_page);
+				continue;
+			}
+
+		}
+
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		copied = filemap_copy_from_user(page, offset,
+		if (!src_page) {
+			/*
+			 * Must not enter the pagefault handler here, because
+			 * we hold the page lock, so we might recursively
+			 * deadlock on the same lock, or get an ABBA deadlock
+			 * against a different lock, or against the mmap_sem
+			 * (which nests outside the page lock).  So increment
+			 * preempt count, and use _atomic usercopies.
+			 *
+			 * The page is uptodate so we are OK to encounter a
+			 * short copy: if unmodified parts of the page are
+			 * marked dirty and written out to disk, it doesn't
+			 * really matter.
+			 */
+			pagefault_disable();
+			copied = filemap_copy_from_user_atomic(page, offset,
 					cur_iov, nr_segs, iov_offset, bytes);
+			pagefault_enable();
+		} else {
+			char *src, *dst;
+			src = kmap(src_page);
+			dst = kmap(page);
+			memcpy(dst + offset,
+				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
+				bytes);
+			kunmap(page);
+			kunmap(src_page);
+			copied = bytes;
+		}
 		flush_dcache_page(page);
 
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
-		if (unlikely(copied != bytes)) {
-			status = -EFAULT;
-			goto fs_write_aop_error;
-		}
 		if (unlikely(status > 0)) /* filesystem did partial write */
-			copied = status;
+			copied = min_t(size_t, copied, status);
+
+		unlock_page(page);
+		mark_page_accessed(page);
+		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		written += copied;
 		count -= copied;
 		pos += copied;
 		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
 
-		unlock_page(page);
-		mark_page_accessed(page);
-		page_cache_release(page);
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 		continue;
@@ -1912,6 +1985,8 @@ fs_write_aop_error:
 		if (status != AOP_TRUNCATED_PAGE)
 			unlock_page(page);
 		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		/*
 		 * prepare_write() may have instantiated a few blocks
@@ -1924,7 +1999,6 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-
 	} while (count);
 	*ppos = pos;
 
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -198,6 +198,9 @@ static inline int fault_in_pages_writeab
 {
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	/*
 	 * Writing zeroes into userspace here is OK, because we know that if
 	 * the zero gets there, we'll be overwriting it.
@@ -217,19 +220,23 @@ static inline int fault_in_pages_writeab
 	return ret;
 }
 
-static inline void fault_in_pages_readable(const char __user *uaddr, int size)
+static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 {
 	volatile char c;
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	ret = __get_user(c, uaddr);
 	if (ret == 0) {
 		const char __user *end = uaddr + size - 1;
 
 		if (((unsigned long)uaddr & PAGE_MASK) !=
 				((unsigned long)end & PAGE_MASK))
-		 	__get_user(c, end);
+		 	ret = __get_user(c, end);
 	}
+	return ret;
 }
 
 #endif /* _LINUX_PAGEMAP_H */

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2007-02-06 13:13 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-04  8:49 [patch 0/9] buffered write deadlock fix Nick Piggin
2007-02-04  8:49 ` [patch 1/9] fs: libfs buffered write leak fix Nick Piggin
2007-02-04  8:50 ` [patch 2/9] mm: revert "generic_file_buffered_write(): handle zero length iovec segments" Nick Piggin
2007-02-04  8:50 ` [patch 3/9] mm: revert "generic_file_buffered_write(): deadlock on vectored write" Nick Piggin
2007-02-04  8:50 ` [patch 4/9] mm: generic_file_buffered_write cleanup Nick Piggin
2007-02-04  8:50 ` [patch 5/9] mm: debug write deadlocks Nick Piggin
2007-02-04  8:50 ` [patch 6/9] mm: be sure to trim blocks Nick Piggin
2007-02-04  8:50 ` [patch 7/9] mm: cleanup pagecache insertion operations Nick Piggin
2007-02-04  8:50 ` [patch 8/9] mm: generic_file_buffered_write iovec cleanup Nick Piggin
2007-02-04  8:51 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
2007-02-04  9:44   ` Andrew Morton
2007-02-04 10:15     ` Nick Piggin
2007-02-04 10:26       ` Christoph Hellwig
2007-02-04 10:30       ` Andrew Morton
2007-02-04 10:46         ` Nick Piggin
2007-02-04 10:50           ` Nick Piggin
2007-02-04 10:56           ` Andrew Morton
2007-02-04 11:03             ` Nick Piggin
2007-02-04 11:15               ` Andrew Morton
2007-02-04 15:10                 ` Nick Piggin
2007-02-04 18:36                   ` Andrew Morton
2007-02-06  2:25                     ` Nick Piggin
2007-02-06  4:41                       ` Nick Piggin
2007-02-06  5:30                         ` Andrew Morton
2007-02-06  5:49                           ` Nick Piggin
2007-02-06  5:53                             ` Nick Piggin
2007-02-04 10:59     ` Anton Altaparmakov
2007-02-04 11:10       ` Andrew Morton
2007-02-04 11:22         ` Nick Piggin
2007-02-04 17:40         ` Anton Altaparmakov
2007-02-06  2:09           ` Nick Piggin
2007-02-06 13:13             ` Anton Altaparmakov
  -- strict thread matches above, loose matches on Subject: below --
2007-01-29 10:31 [patch 0/9] buffered write deadlock fix Nick Piggin
2007-01-29 10:33 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
2007-01-29 11:11   ` Nick Piggin
2007-02-02 23:53   ` Andrew Morton
2007-02-03  1:38     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).