All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/9] buffered write deadlock fix
@ 2007-01-29 10:31 ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

The following set of patches attempt to fix the buffered write
locking problems (and there are a couple of peripheral patches
and cleanups there too).

Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
be an easier diff with the fsaio patches gone, but the readahead
rewrite clashes badly :(

Please apply?

Thanks,
Nick

--
SuSE Labs


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 0/9] buffered write deadlock fix
@ 2007-01-29 10:31 ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

The following set of patches attempt to fix the buffered write
locking problems (and there are a couple of peripheral patches
and cleanups there too).

Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
be an easier diff with the fsaio patches gone, but the readahead
rewrite clashes badly :(

Please apply?

Thanks,
Nick

--
SuSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 1/9] fs: libfs buffered write leak fix
  2007-01-29 10:31 ` Nick Piggin
@ 2007-01-29 10:31   ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.
Fix the former, make a note of the latter. Several other filesystems seem
to be iffy here, too.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s
 int simple_prepare_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
-	if (!PageUptodate(page)) {
-		if (to - from != PAGE_CACHE_SIZE) {
-			void *kaddr = kmap_atomic(page, KM_USER0);
-			memset(kaddr, 0, from);
-			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
-			flush_dcache_page(page);
-			kunmap_atomic(kaddr, KM_USER0);
-		}
+	if (PageUptodate(page))
+		return 0;
+
+	if (to - from != PAGE_CACHE_SIZE) {
+		clear_highpage(page);
+		flush_dcache_page(page);
 		SetPageUptodate(page);
 	}
+
 	return 0;
 }
 
 int simple_commit_write(struct file *file, struct page *page,
-			unsigned offset, unsigned to)
+			unsigned from, unsigned to)
 {
-	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
-
-	/*
-	 * No need to use i_size_read() here, the i_size
-	 * cannot change under us because we hold the i_mutex.
-	 */
-	if (pos > inode->i_size)
-		i_size_write(inode, pos);
-	set_page_dirty(page);
+	if (to > from) {
+		struct inode *inode = page->mapping->host;
+		loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+		if (to - from == PAGE_CACHE_SIZE)
+			SetPageUptodate(page);
+		/*
+		 * No need to use i_size_read() here, the i_size
+		 * cannot change under us because we hold the i_mutex.
+		 */
+		if (pos > inode->i_size)
+			i_size_write(inode, pos);
+		set_page_dirty(page);
+	}
 	return 0;
 }
 
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page
 
 	if (is_mapped_to_disk)
 		SetPageMappedToDisk(page);
+
+	/* XXX: information leak vs read(2) */
 	SetPageUptodate(page);
 
 	/*

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 1/9] fs: libfs buffered write leak fix
@ 2007-01-29 10:31   ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.
Fix the former, make a note of the latter. Several other filesystems seem
to be iffy here, too.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s
 int simple_prepare_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
-	if (!PageUptodate(page)) {
-		if (to - from != PAGE_CACHE_SIZE) {
-			void *kaddr = kmap_atomic(page, KM_USER0);
-			memset(kaddr, 0, from);
-			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
-			flush_dcache_page(page);
-			kunmap_atomic(kaddr, KM_USER0);
-		}
+	if (PageUptodate(page))
+		return 0;
+
+	if (to - from != PAGE_CACHE_SIZE) {
+		clear_highpage(page);
+		flush_dcache_page(page);
 		SetPageUptodate(page);
 	}
+
 	return 0;
 }
 
 int simple_commit_write(struct file *file, struct page *page,
-			unsigned offset, unsigned to)
+			unsigned from, unsigned to)
 {
-	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
-
-	/*
-	 * No need to use i_size_read() here, the i_size
-	 * cannot change under us because we hold the i_mutex.
-	 */
-	if (pos > inode->i_size)
-		i_size_write(inode, pos);
-	set_page_dirty(page);
+	if (to > from) {
+		struct inode *inode = page->mapping->host;
+		loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+		if (to - from == PAGE_CACHE_SIZE)
+			SetPageUptodate(page);
+		/*
+		 * No need to use i_size_read() here, the i_size
+		 * cannot change under us because we hold the i_mutex.
+		 */
+		if (pos > inode->i_size)
+			i_size_write(inode, pos);
+		set_page_dirty(page);
+	}
 	return 0;
 }
 
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page
 
 	if (is_mapped_to_disk)
 		SetPageMappedToDisk(page);
+
+	/* XXX: information leak vs read(2) */
 	SetPageUptodate(page);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 2/9] mm: revert "generic_file_buffered_write(): handle zero length iovec segments"
  2007-01-29 10:31 ` Nick Piggin
@ 2007-01-29 10:31   ` Nick Piggin, Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

From: Andrew Morton <akpm@osdl.org>

Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6.

This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
also revert.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1911,12 +1911,6 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
-		if (unlikely(bytes == 0)) {
-			status = 0;
-			copied = 0;
-			goto zero_length_segment;
-		}
-
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status)) {
 			loff_t isize = i_size_read(inode);
@@ -1946,8 +1940,7 @@ generic_file_buffered_write(struct kiocb
 			page_cache_release(page);
 			continue;
 		}
-zero_length_segment:
-		if (likely(copied >= 0)) {
+		if (likely(copied > 0)) {
 			if (!status)
 				status = copied;
 
Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove
 	const struct iovec *iov = *iovp;
 	size_t base = *basep;
 
-	do {
+	while (bytes) {
 		int copy = min(bytes, iov->iov_len - base);
 
 		bytes -= copy;
@@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove
 			iov++;
 			base = 0;
 		}
-	} while (bytes);
+	}
 	*iovp = iov;
 	*basep = base;
 }

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 2/9] mm: revert "generic_file_buffered_write(): handle zero length iovec segments"
@ 2007-01-29 10:31   ` Nick Piggin, Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin, Andrew Morton @ 2007-01-29 10:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6.

This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
also revert.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1911,12 +1911,6 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
-		if (unlikely(bytes == 0)) {
-			status = 0;
-			copied = 0;
-			goto zero_length_segment;
-		}
-
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status)) {
 			loff_t isize = i_size_read(inode);
@@ -1946,8 +1940,7 @@ generic_file_buffered_write(struct kiocb
 			page_cache_release(page);
 			continue;
 		}
-zero_length_segment:
-		if (likely(copied >= 0)) {
+		if (likely(copied > 0)) {
 			if (!status)
 				status = copied;
 
Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove
 	const struct iovec *iov = *iovp;
 	size_t base = *basep;
 
-	do {
+	while (bytes) {
 		int copy = min(bytes, iov->iov_len - base);
 
 		bytes -= copy;
@@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove
 			iov++;
 			base = 0;
 		}
-	} while (bytes);
+	}
 	*iovp = iov;
 	*basep = base;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 3/9] mm: revert "generic_file_buffered_write(): deadlock on vectored write"
  2007-01-29 10:31 ` Nick Piggin
@ 2007-01-29 10:32   ` Nick Piggin, Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

From: Andrew Morton <akpm@osdl.org>

Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83

This patch fixed the following bug:

  When prefaulting in the pages in generic_file_buffered_write(), we only
  faulted in the pages for the firts segment of the iovec.  If the second of
  successive segment described a mmapping of the page into which we're
  write()ing, and that page is not up-to-date, the fault handler tries to lock
  the already-locked page (to bring it up to date) and deadlocks.

  An exploit for this bug is in writev-deadlock-demo.c, in
  http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

  (These demos assume blocksize < PAGE_CACHE_SIZE).

The problem with this fix is that it takes the kernel back to doing a single
prepare_write()/commit_write() per iovec segment.  So in the worst case we'll
run prepare_write+commit_write 1024 times where we previously would have run
it once. The other problem with the fix is that it fix all the locking problems.


<insert numbers obtained via ext3-tools's writev-speed.c here>

And apparently this change killed NFS overwrite performance, because, I
suppose, it talks to the server for each prepare_write+commit_write.

So just back that patch out - we'll be fixing the deadlock by other means.

Signed-off-by: Andrew Morton <akpm@osdl.org>

Nick says: also it only ever actually papered over the bug, because after
faulting in the pages, they might be unmapped or reclaimed.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1881,21 +1881,14 @@ generic_file_buffered_write(struct kiocb
 	do {
 		unsigned long index;
 		unsigned long offset;
+		unsigned long maxlen;
 		size_t copied;
 
 		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
-
-		/* Limit the size of the copy to the caller's write size */
-		bytes = min(bytes, count);
-
-		/*
-		 * Limit the size of the copy to that of the current segment,
-		 * because fault_in_pages_readable() doesn't know how to walk
-		 * segments.
-		 */
-		bytes = min(bytes, cur_iov->iov_len - iov_base);
+		if (bytes > count)
+			bytes = count;
 
 		/*
 		 * Bring in the user page that we will copy from _first_.
@@ -1903,7 +1896,10 @@ generic_file_buffered_write(struct kiocb
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		fault_in_pages_readable(buf, bytes);
+		maxlen = cur_iov->iov_len - iov_base;
+		if (maxlen > bytes)
+			maxlen = bytes;
+		fault_in_pages_readable(buf, maxlen);
 
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
 		if (!page) {

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 3/9] mm: revert "generic_file_buffered_write(): deadlock on vectored write"
@ 2007-01-29 10:32   ` Nick Piggin, Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin, Andrew Morton @ 2007-01-29 10:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83

This patch fixed the following bug:

  When prefaulting in the pages in generic_file_buffered_write(), we only
  faulted in the pages for the firts segment of the iovec.  If the second of
  successive segment described a mmapping of the page into which we're
  write()ing, and that page is not up-to-date, the fault handler tries to lock
  the already-locked page (to bring it up to date) and deadlocks.

  An exploit for this bug is in writev-deadlock-demo.c, in
  http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

  (These demos assume blocksize < PAGE_CACHE_SIZE).

The problem with this fix is that it takes the kernel back to doing a single
prepare_write()/commit_write() per iovec segment.  So in the worst case we'll
run prepare_write+commit_write 1024 times where we previously would have run
it once. The other problem with the fix is that it fix all the locking problems.


<insert numbers obtained via ext3-tools's writev-speed.c here>

And apparently this change killed NFS overwrite performance, because, I
suppose, it talks to the server for each prepare_write+commit_write.

So just back that patch out - we'll be fixing the deadlock by other means.

Signed-off-by: Andrew Morton <akpm@osdl.org>

Nick says: also it only ever actually papered over the bug, because after
faulting in the pages, they might be unmapped or reclaimed.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1881,21 +1881,14 @@ generic_file_buffered_write(struct kiocb
 	do {
 		unsigned long index;
 		unsigned long offset;
+		unsigned long maxlen;
 		size_t copied;
 
 		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
-
-		/* Limit the size of the copy to the caller's write size */
-		bytes = min(bytes, count);
-
-		/*
-		 * Limit the size of the copy to that of the current segment,
-		 * because fault_in_pages_readable() doesn't know how to walk
-		 * segments.
-		 */
-		bytes = min(bytes, cur_iov->iov_len - iov_base);
+		if (bytes > count)
+			bytes = count;
 
 		/*
 		 * Bring in the user page that we will copy from _first_.
@@ -1903,7 +1896,10 @@ generic_file_buffered_write(struct kiocb
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		fault_in_pages_readable(buf, bytes);
+		maxlen = cur_iov->iov_len - iov_base;
+		if (maxlen > bytes)
+			maxlen = bytes;
+		fault_in_pages_readable(buf, maxlen);
 
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
 		if (!page) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 4/9] mm: generic_file_buffered_write cleanup
  2007-01-29 10:31 ` Nick Piggin
@ 2007-01-29 10:32   ` Nick Piggin, Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

From: Andrew Morton <akpm@osdl.org>

Clean up buffered write code. Rename some variables and fix some types.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1854,16 +1854,15 @@ generic_file_buffered_write(struct kiocb
 		size_t count, ssize_t written)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space * mapping = file->f_mapping;
+	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	struct inode 	*inode = mapping->host;
 	long		status = 0;
 	struct page	*page;
 	struct page	*cached_page = NULL;
-	size_t		bytes;
 	struct pagevec	lru_pvec;
 	const struct iovec *cur_iov = iov; /* current iovec */
-	size_t		iov_base = 0;	   /* offset in the current iovec */
+	size_t		iov_offset = 0;	   /* offset in the current iovec */
 	char __user	*buf;
 
 	pagevec_init(&lru_pvec, 0);
@@ -1874,31 +1873,33 @@ generic_file_buffered_write(struct kiocb
 	if (likely(nr_segs == 1))
 		buf = iov->iov_base + written;
 	else {
-		filemap_set_next_iovec(&cur_iov, &iov_base, written);
-		buf = cur_iov->iov_base + iov_base;
+		filemap_set_next_iovec(&cur_iov, &iov_offset, written);
+		buf = cur_iov->iov_base + iov_offset;
 	}
 
 	do {
-		unsigned long index;
-		unsigned long offset;
-		unsigned long maxlen;
-		size_t copied;
+		pgoff_t index;		/* Pagecache index for current page */
+		unsigned long offset;	/* Offset into pagecache page */
+		unsigned long maxlen;	/* Bytes remaining in current iovec */
+		size_t bytes;		/* Bytes to write to page */
+		size_t copied;		/* Bytes copied from user */
 
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
 		if (bytes > count)
 			bytes = count;
 
+		maxlen = cur_iov->iov_len - iov_offset;
+		if (maxlen > bytes)
+			maxlen = bytes;
+
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		maxlen = cur_iov->iov_len - iov_base;
-		if (maxlen > bytes)
-			maxlen = bytes;
 		fault_in_pages_readable(buf, maxlen);
 
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
@@ -1929,7 +1930,7 @@ generic_file_buffered_write(struct kiocb
 							buf, bytes);
 		else
 			copied = filemap_copy_from_user_iovec(page, offset,
-						cur_iov, iov_base, bytes);
+						cur_iov, iov_offset, bytes);
 		flush_dcache_page(page);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (status == AOP_TRUNCATED_PAGE) {
@@ -1947,12 +1948,12 @@ generic_file_buffered_write(struct kiocb
 				buf += status;
 				if (unlikely(nr_segs > 1)) {
 					filemap_set_next_iovec(&cur_iov,
-							&iov_base, status);
+							&iov_offset, status);
 					if (count)
 						buf = cur_iov->iov_base +
-							iov_base;
+							iov_offset;
 				} else {
-					iov_base += status;
+					iov_offset += status;
 				}
 			}
 		}

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 4/9] mm: generic_file_buffered_write cleanup
@ 2007-01-29 10:32   ` Nick Piggin, Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin, Andrew Morton @ 2007-01-29 10:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Clean up buffered write code. Rename some variables and fix some types.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1854,16 +1854,15 @@ generic_file_buffered_write(struct kiocb
 		size_t count, ssize_t written)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space * mapping = file->f_mapping;
+	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	struct inode 	*inode = mapping->host;
 	long		status = 0;
 	struct page	*page;
 	struct page	*cached_page = NULL;
-	size_t		bytes;
 	struct pagevec	lru_pvec;
 	const struct iovec *cur_iov = iov; /* current iovec */
-	size_t		iov_base = 0;	   /* offset in the current iovec */
+	size_t		iov_offset = 0;	   /* offset in the current iovec */
 	char __user	*buf;
 
 	pagevec_init(&lru_pvec, 0);
@@ -1874,31 +1873,33 @@ generic_file_buffered_write(struct kiocb
 	if (likely(nr_segs == 1))
 		buf = iov->iov_base + written;
 	else {
-		filemap_set_next_iovec(&cur_iov, &iov_base, written);
-		buf = cur_iov->iov_base + iov_base;
+		filemap_set_next_iovec(&cur_iov, &iov_offset, written);
+		buf = cur_iov->iov_base + iov_offset;
 	}
 
 	do {
-		unsigned long index;
-		unsigned long offset;
-		unsigned long maxlen;
-		size_t copied;
+		pgoff_t index;		/* Pagecache index for current page */
+		unsigned long offset;	/* Offset into pagecache page */
+		unsigned long maxlen;	/* Bytes remaining in current iovec */
+		size_t bytes;		/* Bytes to write to page */
+		size_t copied;		/* Bytes copied from user */
 
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
 		if (bytes > count)
 			bytes = count;
 
+		maxlen = cur_iov->iov_len - iov_offset;
+		if (maxlen > bytes)
+			maxlen = bytes;
+
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		maxlen = cur_iov->iov_len - iov_base;
-		if (maxlen > bytes)
-			maxlen = bytes;
 		fault_in_pages_readable(buf, maxlen);
 
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
@@ -1929,7 +1930,7 @@ generic_file_buffered_write(struct kiocb
 							buf, bytes);
 		else
 			copied = filemap_copy_from_user_iovec(page, offset,
-						cur_iov, iov_base, bytes);
+						cur_iov, iov_offset, bytes);
 		flush_dcache_page(page);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (status == AOP_TRUNCATED_PAGE) {
@@ -1947,12 +1948,12 @@ generic_file_buffered_write(struct kiocb
 				buf += status;
 				if (unlikely(nr_segs > 1)) {
 					filemap_set_next_iovec(&cur_iov,
-							&iov_base, status);
+							&iov_offset, status);
 					if (count)
 						buf = cur_iov->iov_base +
-							iov_base;
+							iov_offset;
 				} else {
-					iov_base += status;
+					iov_offset += status;
 				}
 			}
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 5/9] mm: debug write deadlocks
  2007-01-29 10:31 ` Nick Piggin
@ 2007-01-29 10:32   ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
difficult race where the page may be unmapped before calling copy_from_user.
Makes the race much easier to hit.

This is useful for demonstration and testing purposes, but is removed in a
subsequent patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1894,6 +1894,7 @@ generic_file_buffered_write(struct kiocb
 		if (maxlen > bytes)
 			maxlen = bytes;
 
+#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
@@ -1901,6 +1902,7 @@ generic_file_buffered_write(struct kiocb
 		 * up-to-date.
 		 */
 		fault_in_pages_readable(buf, maxlen);
+#endif
 
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
 		if (!page) {

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 5/9] mm: debug write deadlocks
@ 2007-01-29 10:32   ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
difficult race where the page may be unmapped before calling copy_from_user.
Makes the race much easier to hit.

This is useful for demonstration and testing purposes, but is removed in a
subsequent patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1894,6 +1894,7 @@ generic_file_buffered_write(struct kiocb
 		if (maxlen > bytes)
 			maxlen = bytes;
 
+#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
@@ -1901,6 +1902,7 @@ generic_file_buffered_write(struct kiocb
 		 * up-to-date.
 		 */
 		fault_in_pages_readable(buf, maxlen);
+#endif
 
 		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
 		if (!page) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 6/9] mm: be sure to trim blocks
  2007-01-29 10:31 ` Nick Piggin
@ 2007-01-29 10:32   ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
we may have failed the write operation despite prepare_write having
instantiated blocks past i_size. Fix this, and consolidate the trimming into
one place.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1911,22 +1911,9 @@ generic_file_buffered_write(struct kiocb
 		}
 
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
-		if (unlikely(status)) {
-			loff_t isize = i_size_read(inode);
+		if (unlikely(status))
+			goto fs_write_aop_error;
 
-			if (status != AOP_TRUNCATED_PAGE)
-				unlock_page(page);
-			page_cache_release(page);
-			if (status == AOP_TRUNCATED_PAGE)
-				continue;
-			/*
-			 * prepare_write() may have instantiated a few blocks
-			 * outside i_size.  Trim these off again.
-			 */
-			if (pos + bytes > isize)
-				vmtruncate(inode, isize);
-			break;
-		}
 		if (likely(nr_segs == 1))
 			copied = filemap_copy_from_user(page, offset,
 							buf, bytes);
@@ -1935,40 +1922,53 @@ generic_file_buffered_write(struct kiocb
 						cur_iov, iov_offset, bytes);
 		flush_dcache_page(page);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
-		if (status == AOP_TRUNCATED_PAGE) {
-			page_cache_release(page);
-			continue;
+		if (unlikely(status < 0))
+			goto fs_write_aop_error;
+		if (unlikely(copied != bytes)) {
+			status = -EFAULT;
+			goto fs_write_aop_error;
 		}
-		if (likely(copied > 0)) {
-			if (!status)
-				status = copied;
+		if (unlikely(status > 0)) /* filesystem did partial write */
+			copied = status;
 
-			if (status >= 0) {
-				written += status;
-				count -= status;
-				pos += status;
-				buf += status;
-				if (unlikely(nr_segs > 1)) {
-					filemap_set_next_iovec(&cur_iov,
-							&iov_offset, status);
-					if (count)
-						buf = cur_iov->iov_base +
-							iov_offset;
-				} else {
-					iov_offset += status;
-				}
+		if (likely(copied > 0)) {
+			written += copied;
+			count -= copied;
+			pos += copied;
+			buf += copied;
+			if (unlikely(nr_segs > 1)) {
+				filemap_set_next_iovec(&cur_iov,
+						&iov_offset, copied);
+				if (count)
+					buf = cur_iov->iov_base + iov_offset;
+			} else {
+				iov_offset += copied;
 			}
 		}
-		if (unlikely(copied != bytes))
-			if (status >= 0)
-				status = -EFAULT;
 		unlock_page(page);
 		mark_page_accessed(page);
 		page_cache_release(page);
-		if (status < 0)
-			break;
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
+		continue;
+
+fs_write_aop_error:
+		if (status != AOP_TRUNCATED_PAGE)
+			unlock_page(page);
+		page_cache_release(page);
+
+		/*
+		 * prepare_write() may have instantiated a few blocks
+		 * outside i_size.  Trim these off again. Don't need
+		 * i_size_read because we hold i_mutex.
+		 */
+		if (pos + bytes > inode->i_size)
+			vmtruncate(inode, inode->i_size);
+		if (status == AOP_TRUNCATED_PAGE)
+			continue;
+		else
+			break;
+
 	} while (count);
 	*ppos = pos;
 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 6/9] mm: be sure to trim blocks
@ 2007-01-29 10:32   ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
we may have failed the write operation despite prepare_write having
instantiated blocks past i_size. Fix this, and consolidate the trimming into
one place.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1911,22 +1911,9 @@ generic_file_buffered_write(struct kiocb
 		}
 
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
-		if (unlikely(status)) {
-			loff_t isize = i_size_read(inode);
+		if (unlikely(status))
+			goto fs_write_aop_error;
 
-			if (status != AOP_TRUNCATED_PAGE)
-				unlock_page(page);
-			page_cache_release(page);
-			if (status == AOP_TRUNCATED_PAGE)
-				continue;
-			/*
-			 * prepare_write() may have instantiated a few blocks
-			 * outside i_size.  Trim these off again.
-			 */
-			if (pos + bytes > isize)
-				vmtruncate(inode, isize);
-			break;
-		}
 		if (likely(nr_segs == 1))
 			copied = filemap_copy_from_user(page, offset,
 							buf, bytes);
@@ -1935,40 +1922,53 @@ generic_file_buffered_write(struct kiocb
 						cur_iov, iov_offset, bytes);
 		flush_dcache_page(page);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
-		if (status == AOP_TRUNCATED_PAGE) {
-			page_cache_release(page);
-			continue;
+		if (unlikely(status < 0))
+			goto fs_write_aop_error;
+		if (unlikely(copied != bytes)) {
+			status = -EFAULT;
+			goto fs_write_aop_error;
 		}
-		if (likely(copied > 0)) {
-			if (!status)
-				status = copied;
+		if (unlikely(status > 0)) /* filesystem did partial write */
+			copied = status;
 
-			if (status >= 0) {
-				written += status;
-				count -= status;
-				pos += status;
-				buf += status;
-				if (unlikely(nr_segs > 1)) {
-					filemap_set_next_iovec(&cur_iov,
-							&iov_offset, status);
-					if (count)
-						buf = cur_iov->iov_base +
-							iov_offset;
-				} else {
-					iov_offset += status;
-				}
+		if (likely(copied > 0)) {
+			written += copied;
+			count -= copied;
+			pos += copied;
+			buf += copied;
+			if (unlikely(nr_segs > 1)) {
+				filemap_set_next_iovec(&cur_iov,
+						&iov_offset, copied);
+				if (count)
+					buf = cur_iov->iov_base + iov_offset;
+			} else {
+				iov_offset += copied;
 			}
 		}
-		if (unlikely(copied != bytes))
-			if (status >= 0)
-				status = -EFAULT;
 		unlock_page(page);
 		mark_page_accessed(page);
 		page_cache_release(page);
-		if (status < 0)
-			break;
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
+		continue;
+
+fs_write_aop_error:
+		if (status != AOP_TRUNCATED_PAGE)
+			unlock_page(page);
+		page_cache_release(page);
+
+		/*
+		 * prepare_write() may have instantiated a few blocks
+		 * outside i_size.  Trim these off again. Don't need
+		 * i_size_read because we hold i_mutex.
+		 */
+		if (pos + bytes > inode->i_size)
+			vmtruncate(inode, inode->i_size);
+		if (status == AOP_TRUNCATED_PAGE)
+			continue;
+		else
+			break;
+
 	} while (count);
 	*ppos = pos;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 7/9] mm: cleanup pagecache insertion operations
  2007-01-29 10:31 ` Nick Piggin
@ 2007-01-29 10:32   ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Quite a bit of code is used in maintaining these "cached pages" that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.

Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -686,26 +686,22 @@ EXPORT_SYMBOL(find_lock_page);
 struct page *find_or_create_page(struct address_space *mapping,
 		unsigned long index, gfp_t gfp_mask)
 {
-	struct page *page, *cached_page = NULL;
+	struct page *page;
 	int err;
 repeat:
 	page = find_lock_page(mapping, index);
 	if (!page) {
-		if (!cached_page) {
-			cached_page = alloc_page(gfp_mask);
-			if (!cached_page)
-				return NULL;
-		}
-		err = add_to_page_cache_lru(cached_page, mapping,
-					index, gfp_mask);
-		if (!err) {
-			page = cached_page;
-			cached_page = NULL;
-		} else if (err == -EEXIST)
-			goto repeat;
+		page = alloc_page(gfp_mask);
+		if (!page)
+			return NULL;
+		err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+		if (unlikely(err)) {
+			page_cache_release(page);
+			page = NULL;
+			if (err == -EEXIST)
+				goto repeat;
+		}
 	}
-	if (cached_page)
-		page_cache_release(cached_page);
 	return page;
 }
 EXPORT_SYMBOL(find_or_create_page);
@@ -891,11 +887,9 @@ void do_generic_mapping_read(struct addr
 	unsigned long next_index;
 	unsigned long prev_index;
 	loff_t isize;
-	struct page *cached_page;
 	int error;
 	struct file_ra_state ra = *_ra;
 
-	cached_page = NULL;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	next_index = index;
 	prev_index = ra.prev_page;
@@ -1059,23 +1053,20 @@ no_cached_page:
 		 * Ok, it wasn't cached, so we need to create a new
 		 * page..
 		 */
-		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
-			if (!cached_page) {
-				desc->error = -ENOMEM;
-				goto out;
-			}
+		page = page_cache_alloc_cold(mapping);
+		if (!page) {
+			desc->error = -ENOMEM;
+			goto out;
 		}
-		error = add_to_page_cache_lru(cached_page, mapping,
+		error = add_to_page_cache_lru(page, mapping,
 						index, GFP_KERNEL);
 		if (error) {
+			page_cache_release(page);
 			if (error == -EEXIST)
 				goto find_page;
 			desc->error = error;
 			goto out;
 		}
-		page = cached_page;
-		cached_page = NULL;
 		goto readpage;
 	}
 
@@ -1083,8 +1074,6 @@ out:
 	*_ra = ra;
 
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
-	if (cached_page)
-		page_cache_release(cached_page);
 	if (filp)
 		file_accessed(filp);
 }
@@ -1542,35 +1531,28 @@ static inline struct page *__read_cache_
 				int (*filler)(void *,struct page*),
 				void *data)
 {
-	struct page *page, *cached_page = NULL;
+	struct page *page;
 	int err;
 repeat:
 	page = find_get_page(mapping, index);
 	if (!page) {
-		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
-			if (!cached_page)
-				return ERR_PTR(-ENOMEM);
-		}
-		err = add_to_page_cache_lru(cached_page, mapping,
-					index, GFP_KERNEL);
-		if (err == -EEXIST)
-			goto repeat;
-		if (err < 0) {
+		page = page_cache_alloc_cold(mapping);
+		if (!page)
+			return ERR_PTR(-ENOMEM);
+		err = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+		if (unlikely(err)) {
+			page_cache_release(page);
+			if (err == -EEXIST)
+				goto repeat;
 			/* Presumably ENOMEM for radix tree node */
-			page_cache_release(cached_page);
 			return ERR_PTR(err);
 		}
-		page = cached_page;
-		cached_page = NULL;
 		err = filler(data, page);
 		if (err < 0) {
 			page_cache_release(page);
 			page = ERR_PTR(err);
 		}
 	}
-	if (cached_page)
-		page_cache_release(cached_page);
 	return page;
 }
 
@@ -1621,40 +1603,6 @@ retry:
 EXPORT_SYMBOL(read_cache_page);
 
 /*
- * If the page was newly created, increment its refcount and add it to the
- * caller's lru-buffering pagevec.  This function is specifically for
- * generic_file_write().
- */
-static inline struct page *
-__grab_cache_page(struct address_space *mapping, unsigned long index,
-			struct page **cached_page, struct pagevec *lru_pvec)
-{
-	int err;
-	struct page *page;
-repeat:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		if (!*cached_page) {
-			*cached_page = page_cache_alloc(mapping);
-			if (!*cached_page)
-				return NULL;
-		}
-		err = add_to_page_cache(*cached_page, mapping,
-					index, GFP_KERNEL);
-		if (err == -EEXIST)
-			goto repeat;
-		if (err == 0) {
-			page = *cached_page;
-			page_cache_get(page);
-			if (!pagevec_add(lru_pvec, page))
-				__pagevec_lru_add(lru_pvec);
-			*cached_page = NULL;
-		}
-	}
-	return page;
-}
-
-/*
  * The logic we want is
  *
  *	if suid or (sgid and xgrp)
@@ -1848,6 +1796,33 @@ generic_file_direct_write(struct kiocb *
 }
 EXPORT_SYMBOL(generic_file_direct_write);
 
+/*
+ * Find or create a page at the given pagecache position. Return the locked
+ * page. This function is specifically for buffered writes.
+ */
+static struct page *__grab_cache_page(struct address_space *mapping,
+							pgoff_t index)
+{
+	int status;
+	struct page *page;
+repeat:
+	page = find_lock_page(mapping, index);
+	if (likely(page))
+		return page;
+
+	page = page_cache_alloc(mapping);
+	if (!page)
+		return NULL;
+	status = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+	if (unlikely(status)) {
+		page_cache_release(page);
+		if (status == -EEXIST)
+			goto repeat;
+		return NULL;
+	}
+	return page;
+}
+
 ssize_t
 generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos, loff_t *ppos,
@@ -1858,15 +1833,10 @@ generic_file_buffered_write(struct kiocb
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	struct inode 	*inode = mapping->host;
 	long		status = 0;
-	struct page	*page;
-	struct page	*cached_page = NULL;
-	struct pagevec	lru_pvec;
 	const struct iovec *cur_iov = iov; /* current iovec */
 	size_t		iov_offset = 0;	   /* offset in the current iovec */
 	char __user	*buf;
 
-	pagevec_init(&lru_pvec, 0);
-
 	/*
 	 * handle partial DIO write.  Adjust cur_iov if needed.
 	 */
@@ -1878,6 +1848,7 @@ generic_file_buffered_write(struct kiocb
 	}
 
 	do {
+		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long maxlen;	/* Bytes remaining in current iovec */
@@ -1904,7 +1875,8 @@ generic_file_buffered_write(struct kiocb
 		fault_in_pages_readable(buf, maxlen);
 #endif
 
-		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
+
+		page = __grab_cache_page(mapping, index);
 		if (!page) {
 			status = -ENOMEM;
 			break;
@@ -1972,9 +1944,6 @@ fs_write_aop_error:
 	} while (count);
 	*ppos = pos;
 
-	if (cached_page)
-		page_cache_release(cached_page);
-
 	/*
 	 * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
 	 */
@@ -1994,7 +1963,6 @@ fs_write_aop_error:
 	if (unlikely(file->f_flags & O_DIRECT) && written)
 		status = filemap_write_and_wait(mapping);
 
-	pagevec_lru_add(&lru_pvec);
 	return written ? written : status;
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c
+++ linux-2.6/fs/mpage.c
@@ -389,31 +389,25 @@ mpage_readpages(struct address_space *ma
 	struct bio *bio = NULL;
 	unsigned page_idx;
 	sector_t last_block_in_bio = 0;
-	struct pagevec lru_pvec;
 	struct buffer_head map_bh;
 	unsigned long first_logical_block = 0;
 
 	clear_buffer_mapped(&map_bh);
-	pagevec_init(&lru_pvec, 0);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_entry(pages->prev, struct page, lru);
 
 		prefetchw(&page->flags);
 		list_del(&page->lru);
-		if (!add_to_page_cache(page, mapping,
+		if (!add_to_page_cache_lru(page, mapping,
 					page->index, GFP_KERNEL)) {
 			bio = do_mpage_readpage(bio, page,
 					nr_pages - page_idx,
 					&last_block_in_bio, &map_bh,
 					&first_logical_block,
 					get_block);
-			if (!pagevec_add(&lru_pvec, page))
-				__pagevec_lru_add(&lru_pvec);
-		} else {
-			page_cache_release(page);
 		}
+		page_cache_release(page);
 	}
-	pagevec_lru_add(&lru_pvec);
 	BUG_ON(!list_empty(pages));
 	if (bio)
 		mpage_bio_submit(READ, bio);
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -133,28 +133,25 @@ int read_cache_pages(struct address_spac
 			int (*filler)(void *, struct page *), void *data)
 {
 	struct page *page;
-	struct pagevec lru_pvec;
 	int ret = 0;
 
-	pagevec_init(&lru_pvec, 0);
-
 	while (!list_empty(pages)) {
 		page = list_to_page(pages);
 		list_del(&page->lru);
-		if (add_to_page_cache(page, mapping, page->index, GFP_KERNEL)) {
+		if (add_to_page_cache_lru(page, mapping,
+					page->index, GFP_KERNEL)) {
 			page_cache_release(page);
 			continue;
 		}
+		page_cache_release(page);
+
 		ret = filler(data, page);
-		if (!pagevec_add(&lru_pvec, page))
-			__pagevec_lru_add(&lru_pvec);
-		if (ret) {
+		if (unlikely(ret)) {
 			put_pages_list(pages);
 			break;
 		}
 		task_io_account_read(PAGE_CACHE_SIZE);
 	}
-	pagevec_lru_add(&lru_pvec);
 	return ret;
 }
 
@@ -164,7 +161,6 @@ static int read_pages(struct address_spa
 		struct list_head *pages, unsigned nr_pages)
 {
 	unsigned page_idx;
-	struct pagevec lru_pvec;
 	int ret;
 
 	if (mapping->a_ops->readpages) {
@@ -174,19 +170,15 @@ static int read_pages(struct address_spa
 		goto out;
 	}
 
-	pagevec_init(&lru_pvec, 0);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_to_page(pages);
 		list_del(&page->lru);
-		if (!add_to_page_cache(page, mapping,
+		if (!add_to_page_cache_lru(page, mapping,
 					page->index, GFP_KERNEL)) {
 			mapping->a_ops->readpage(filp, page);
-			if (!pagevec_add(&lru_pvec, page))
-				__pagevec_lru_add(&lru_pvec);
-		} else
-			page_cache_release(page);
+		}
+		page_cache_release(page);
 	}
-	pagevec_lru_add(&lru_pvec);
 	ret = 0;
 out:
 	return ret;

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 7/9] mm: cleanup pagecache insertion operations
@ 2007-01-29 10:32   ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Quite a bit of code is used in maintaining these "cached pages" that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.

Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -686,26 +686,22 @@ EXPORT_SYMBOL(find_lock_page);
 struct page *find_or_create_page(struct address_space *mapping,
 		unsigned long index, gfp_t gfp_mask)
 {
-	struct page *page, *cached_page = NULL;
+	struct page *page;
 	int err;
 repeat:
 	page = find_lock_page(mapping, index);
 	if (!page) {
-		if (!cached_page) {
-			cached_page = alloc_page(gfp_mask);
-			if (!cached_page)
-				return NULL;
-		}
-		err = add_to_page_cache_lru(cached_page, mapping,
-					index, gfp_mask);
-		if (!err) {
-			page = cached_page;
-			cached_page = NULL;
-		} else if (err == -EEXIST)
-			goto repeat;
+		page = alloc_page(gfp_mask);
+		if (!page)
+			return NULL;
+		err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+		if (unlikely(err)) {
+			page_cache_release(page);
+			page = NULL;
+			if (err == -EEXIST)
+				goto repeat;
+		}
 	}
-	if (cached_page)
-		page_cache_release(cached_page);
 	return page;
 }
 EXPORT_SYMBOL(find_or_create_page);
@@ -891,11 +887,9 @@ void do_generic_mapping_read(struct addr
 	unsigned long next_index;
 	unsigned long prev_index;
 	loff_t isize;
-	struct page *cached_page;
 	int error;
 	struct file_ra_state ra = *_ra;
 
-	cached_page = NULL;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	next_index = index;
 	prev_index = ra.prev_page;
@@ -1059,23 +1053,20 @@ no_cached_page:
 		 * Ok, it wasn't cached, so we need to create a new
 		 * page..
 		 */
-		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
-			if (!cached_page) {
-				desc->error = -ENOMEM;
-				goto out;
-			}
+		page = page_cache_alloc_cold(mapping);
+		if (!page) {
+			desc->error = -ENOMEM;
+			goto out;
 		}
-		error = add_to_page_cache_lru(cached_page, mapping,
+		error = add_to_page_cache_lru(page, mapping,
 						index, GFP_KERNEL);
 		if (error) {
+			page_cache_release(page);
 			if (error == -EEXIST)
 				goto find_page;
 			desc->error = error;
 			goto out;
 		}
-		page = cached_page;
-		cached_page = NULL;
 		goto readpage;
 	}
 
@@ -1083,8 +1074,6 @@ out:
 	*_ra = ra;
 
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
-	if (cached_page)
-		page_cache_release(cached_page);
 	if (filp)
 		file_accessed(filp);
 }
@@ -1542,35 +1531,28 @@ static inline struct page *__read_cache_
 				int (*filler)(void *,struct page*),
 				void *data)
 {
-	struct page *page, *cached_page = NULL;
+	struct page *page;
 	int err;
 repeat:
 	page = find_get_page(mapping, index);
 	if (!page) {
-		if (!cached_page) {
-			cached_page = page_cache_alloc_cold(mapping);
-			if (!cached_page)
-				return ERR_PTR(-ENOMEM);
-		}
-		err = add_to_page_cache_lru(cached_page, mapping,
-					index, GFP_KERNEL);
-		if (err == -EEXIST)
-			goto repeat;
-		if (err < 0) {
+		page = page_cache_alloc_cold(mapping);
+		if (!page)
+			return ERR_PTR(-ENOMEM);
+		err = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+		if (unlikely(err)) {
+			page_cache_release(page);
+			if (err == -EEXIST)
+				goto repeat;
 			/* Presumably ENOMEM for radix tree node */
-			page_cache_release(cached_page);
 			return ERR_PTR(err);
 		}
-		page = cached_page;
-		cached_page = NULL;
 		err = filler(data, page);
 		if (err < 0) {
 			page_cache_release(page);
 			page = ERR_PTR(err);
 		}
 	}
-	if (cached_page)
-		page_cache_release(cached_page);
 	return page;
 }
 
@@ -1621,40 +1603,6 @@ retry:
 EXPORT_SYMBOL(read_cache_page);
 
 /*
- * If the page was newly created, increment its refcount and add it to the
- * caller's lru-buffering pagevec.  This function is specifically for
- * generic_file_write().
- */
-static inline struct page *
-__grab_cache_page(struct address_space *mapping, unsigned long index,
-			struct page **cached_page, struct pagevec *lru_pvec)
-{
-	int err;
-	struct page *page;
-repeat:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		if (!*cached_page) {
-			*cached_page = page_cache_alloc(mapping);
-			if (!*cached_page)
-				return NULL;
-		}
-		err = add_to_page_cache(*cached_page, mapping,
-					index, GFP_KERNEL);
-		if (err == -EEXIST)
-			goto repeat;
-		if (err == 0) {
-			page = *cached_page;
-			page_cache_get(page);
-			if (!pagevec_add(lru_pvec, page))
-				__pagevec_lru_add(lru_pvec);
-			*cached_page = NULL;
-		}
-	}
-	return page;
-}
-
-/*
  * The logic we want is
  *
  *	if suid or (sgid and xgrp)
@@ -1848,6 +1796,33 @@ generic_file_direct_write(struct kiocb *
 }
 EXPORT_SYMBOL(generic_file_direct_write);
 
+/*
+ * Find or create a page at the given pagecache position. Return the locked
+ * page. This function is specifically for buffered writes.
+ */
+static struct page *__grab_cache_page(struct address_space *mapping,
+							pgoff_t index)
+{
+	int status;
+	struct page *page;
+repeat:
+	page = find_lock_page(mapping, index);
+	if (likely(page))
+		return page;
+
+	page = page_cache_alloc(mapping);
+	if (!page)
+		return NULL;
+	status = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+	if (unlikely(status)) {
+		page_cache_release(page);
+		if (status == -EEXIST)
+			goto repeat;
+		return NULL;
+	}
+	return page;
+}
+
 ssize_t
 generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos, loff_t *ppos,
@@ -1858,15 +1833,10 @@ generic_file_buffered_write(struct kiocb
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	struct inode 	*inode = mapping->host;
 	long		status = 0;
-	struct page	*page;
-	struct page	*cached_page = NULL;
-	struct pagevec	lru_pvec;
 	const struct iovec *cur_iov = iov; /* current iovec */
 	size_t		iov_offset = 0;	   /* offset in the current iovec */
 	char __user	*buf;
 
-	pagevec_init(&lru_pvec, 0);
-
 	/*
 	 * handle partial DIO write.  Adjust cur_iov if needed.
 	 */
@@ -1878,6 +1848,7 @@ generic_file_buffered_write(struct kiocb
 	}
 
 	do {
+		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long maxlen;	/* Bytes remaining in current iovec */
@@ -1904,7 +1875,8 @@ generic_file_buffered_write(struct kiocb
 		fault_in_pages_readable(buf, maxlen);
 #endif
 
-		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
+
+		page = __grab_cache_page(mapping, index);
 		if (!page) {
 			status = -ENOMEM;
 			break;
@@ -1972,9 +1944,6 @@ fs_write_aop_error:
 	} while (count);
 	*ppos = pos;
 
-	if (cached_page)
-		page_cache_release(cached_page);
-
 	/*
 	 * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
 	 */
@@ -1994,7 +1963,6 @@ fs_write_aop_error:
 	if (unlikely(file->f_flags & O_DIRECT) && written)
 		status = filemap_write_and_wait(mapping);
 
-	pagevec_lru_add(&lru_pvec);
 	return written ? written : status;
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c
+++ linux-2.6/fs/mpage.c
@@ -389,31 +389,25 @@ mpage_readpages(struct address_space *ma
 	struct bio *bio = NULL;
 	unsigned page_idx;
 	sector_t last_block_in_bio = 0;
-	struct pagevec lru_pvec;
 	struct buffer_head map_bh;
 	unsigned long first_logical_block = 0;
 
 	clear_buffer_mapped(&map_bh);
-	pagevec_init(&lru_pvec, 0);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_entry(pages->prev, struct page, lru);
 
 		prefetchw(&page->flags);
 		list_del(&page->lru);
-		if (!add_to_page_cache(page, mapping,
+		if (!add_to_page_cache_lru(page, mapping,
 					page->index, GFP_KERNEL)) {
 			bio = do_mpage_readpage(bio, page,
 					nr_pages - page_idx,
 					&last_block_in_bio, &map_bh,
 					&first_logical_block,
 					get_block);
-			if (!pagevec_add(&lru_pvec, page))
-				__pagevec_lru_add(&lru_pvec);
-		} else {
-			page_cache_release(page);
 		}
+		page_cache_release(page);
 	}
-	pagevec_lru_add(&lru_pvec);
 	BUG_ON(!list_empty(pages));
 	if (bio)
 		mpage_bio_submit(READ, bio);
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -133,28 +133,25 @@ int read_cache_pages(struct address_spac
 			int (*filler)(void *, struct page *), void *data)
 {
 	struct page *page;
-	struct pagevec lru_pvec;
 	int ret = 0;
 
-	pagevec_init(&lru_pvec, 0);
-
 	while (!list_empty(pages)) {
 		page = list_to_page(pages);
 		list_del(&page->lru);
-		if (add_to_page_cache(page, mapping, page->index, GFP_KERNEL)) {
+		if (add_to_page_cache_lru(page, mapping,
+					page->index, GFP_KERNEL)) {
 			page_cache_release(page);
 			continue;
 		}
+		page_cache_release(page);
+
 		ret = filler(data, page);
-		if (!pagevec_add(&lru_pvec, page))
-			__pagevec_lru_add(&lru_pvec);
-		if (ret) {
+		if (unlikely(ret)) {
 			put_pages_list(pages);
 			break;
 		}
 		task_io_account_read(PAGE_CACHE_SIZE);
 	}
-	pagevec_lru_add(&lru_pvec);
 	return ret;
 }
 
@@ -164,7 +161,6 @@ static int read_pages(struct address_spa
 		struct list_head *pages, unsigned nr_pages)
 {
 	unsigned page_idx;
-	struct pagevec lru_pvec;
 	int ret;
 
 	if (mapping->a_ops->readpages) {
@@ -174,19 +170,15 @@ static int read_pages(struct address_spa
 		goto out;
 	}
 
-	pagevec_init(&lru_pvec, 0);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_to_page(pages);
 		list_del(&page->lru);
-		if (!add_to_page_cache(page, mapping,
+		if (!add_to_page_cache_lru(page, mapping,
 					page->index, GFP_KERNEL)) {
 			mapping->a_ops->readpage(filp, page);
-			if (!pagevec_add(&lru_pvec, page))
-				__pagevec_lru_add(&lru_pvec);
-		} else
-			page_cache_release(page);
+		}
+		page_cache_release(page);
 	}
-	pagevec_lru_add(&lru_pvec);
 	ret = 0;
 out:
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 8/9] mm: generic_file_buffered_write iovec cleanup
  2007-01-29 10:31 ` Nick Piggin
@ 2007-01-29 10:32   ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Hide some of the open-coded nr_segs tests into the iovec helpers. This is
all to simplify generic_file_buffered_write, because that gets more complex
in the next patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -22,82 +22,82 @@ __filemap_copy_from_user_iovec_inatomic(
 
 /*
  * Copy as much as we can into the page and return the number of bytes which
- * were sucessfully copied.  If a fault is encountered then clear the page
- * out to (offset+bytes) and return the number of bytes which were copied.
- *
- * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache
- * to *NOT* zero any tail of the buffer that it failed to copy.  If it does,
- * and if the following non-atomic copy succeeds, then there is a small window
- * where the target page contains neither the data before the write, nor the
- * data after the write (it contains zero).  A read at this time will see
- * data that is inconsistent with any ordering of the read and the write.
- * (This has been detected in practice).
+ * were sucessfully copied.  If a fault is encountered then return the number of
+ * bytes which were copied.
  */
 static inline size_t
-filemap_copy_from_user(struct page *page, unsigned long offset,
-			const char __user *buf, unsigned bytes)
+filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
+			const struct iovec *iov, unsigned long nr_segs,
+			size_t base, size_t bytes)
 {
 	char *kaddr;
-	int left;
+	size_t copied;
 
 	kaddr = kmap_atomic(page, KM_USER0);
-	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
+	if (likely(nr_segs == 1)) {
+		int left;
+		char __user *buf = iov->iov_base + base;
+		left = __copy_from_user_inatomic_nocache(kaddr + offset,
+							buf, bytes);
+		copied = bytes - left;
+	} else {
+		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+							iov, base, bytes);
+	}
 	kunmap_atomic(kaddr, KM_USER0);
 
-	if (left != 0) {
-		/* Do it the slow way */
-		kaddr = kmap(page);
-		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
-		kunmap(page);
-	}
-	return bytes - left;
+	return copied;
 }
 
 /*
- * This has the same sideeffects and return value as filemap_copy_from_user().
- * The difference is that on a fault we need to memset the remainder of the
- * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
- * single-segment behaviour.
+ * This has the same sideeffects and return value as
+ * filemap_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
  */
 static inline size_t
-filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
-			const struct iovec *iov, size_t base, size_t bytes)
+filemap_copy_from_user(struct page *page, unsigned long offset,
+			const struct iovec *iov, unsigned long nr_segs,
+			 size_t base, size_t bytes)
 {
 	char *kaddr;
 	size_t copied;
 
-	kaddr = kmap_atomic(page, KM_USER0);
-	copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
-							 base, bytes);
-	kunmap_atomic(kaddr, KM_USER0);
-	if (copied != bytes) {
-		kaddr = kmap(page);
-		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
-								 base, bytes);
-		if (bytes - copied)
-			memset(kaddr + offset + copied, 0, bytes - copied);
-		kunmap(page);
+	kaddr = kmap(page);
+	if (likely(nr_segs == 1)) {
+		int left;
+		char __user *buf = iov->iov_base + base;
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+		copied = bytes - left;
+	} else {
+		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+							iov, base, bytes);
 	}
+	kunmap(page);
 	return copied;
 }
 
 static inline void
-filemap_set_next_iovec(const struct iovec **iovp, size_t *basep, size_t bytes)
+filemap_set_next_iovec(const struct iovec **iovp, unsigned long nr_segs,
+						 size_t *basep, size_t bytes)
 {
-	const struct iovec *iov = *iovp;
-	size_t base = *basep;
-
-	while (bytes) {
-		int copy = min(bytes, iov->iov_len - base);
-
-		bytes -= copy;
-		base += copy;
-		if (iov->iov_len == base) {
-			iov++;
-			base = 0;
+	if (likely(nr_segs == 1)) {
+		*basep += bytes;
+	} else {
+		const struct iovec *iov = *iovp;
+		size_t base = *basep;
+
+		while (bytes) {
+			int copy = min(bytes, iov->iov_len - base);
+
+			bytes -= copy;
+			base += copy;
+			if (iov->iov_len == base) {
+				iov++;
+				base = 0;
+			}
 		}
+		*iovp = iov;
+		*basep = base;
 	}
-	*iovp = iov;
-	*basep = base;
 }
 #endif
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1840,12 +1840,7 @@ generic_file_buffered_write(struct kiocb
 	/*
 	 * handle partial DIO write.  Adjust cur_iov if needed.
 	 */
-	if (likely(nr_segs == 1))
-		buf = iov->iov_base + written;
-	else {
-		filemap_set_next_iovec(&cur_iov, &iov_offset, written);
-		buf = cur_iov->iov_base + iov_offset;
-	}
+	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
 		struct page *page;
@@ -1855,6 +1850,7 @@ generic_file_buffered_write(struct kiocb
 		size_t bytes;		/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
+		buf = cur_iov->iov_base + iov_offset;
 		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
@@ -1886,13 +1882,10 @@ generic_file_buffered_write(struct kiocb
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		if (likely(nr_segs == 1))
-			copied = filemap_copy_from_user(page, offset,
-							buf, bytes);
-		else
-			copied = filemap_copy_from_user_iovec(page, offset,
-						cur_iov, iov_offset, bytes);
+		copied = filemap_copy_from_user(page, offset,
+					cur_iov, nr_segs, iov_offset, bytes);
 		flush_dcache_page(page);
+
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
@@ -1903,20 +1896,11 @@ generic_file_buffered_write(struct kiocb
 		if (unlikely(status > 0)) /* filesystem did partial write */
 			copied = status;
 
-		if (likely(copied > 0)) {
-			written += copied;
-			count -= copied;
-			pos += copied;
-			buf += copied;
-			if (unlikely(nr_segs > 1)) {
-				filemap_set_next_iovec(&cur_iov,
-						&iov_offset, copied);
-				if (count)
-					buf = cur_iov->iov_base + iov_offset;
-			} else {
-				iov_offset += copied;
-			}
-		}
+		written += copied;
+		count -= copied;
+		pos += copied;
+		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
+
 		unlock_page(page);
 		mark_page_accessed(page);
 		page_cache_release(page);

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 8/9] mm: generic_file_buffered_write iovec cleanup
@ 2007-01-29 10:32   ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Hide some of the open-coded nr_segs tests into the iovec helpers. This is
all to simplify generic_file_buffered_write, because that gets more complex
in the next patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -22,82 +22,82 @@ __filemap_copy_from_user_iovec_inatomic(
 
 /*
  * Copy as much as we can into the page and return the number of bytes which
- * were sucessfully copied.  If a fault is encountered then clear the page
- * out to (offset+bytes) and return the number of bytes which were copied.
- *
- * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache
- * to *NOT* zero any tail of the buffer that it failed to copy.  If it does,
- * and if the following non-atomic copy succeeds, then there is a small window
- * where the target page contains neither the data before the write, nor the
- * data after the write (it contains zero).  A read at this time will see
- * data that is inconsistent with any ordering of the read and the write.
- * (This has been detected in practice).
+ * were sucessfully copied.  If a fault is encountered then return the number of
+ * bytes which were copied.
  */
 static inline size_t
-filemap_copy_from_user(struct page *page, unsigned long offset,
-			const char __user *buf, unsigned bytes)
+filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
+			const struct iovec *iov, unsigned long nr_segs,
+			size_t base, size_t bytes)
 {
 	char *kaddr;
-	int left;
+	size_t copied;
 
 	kaddr = kmap_atomic(page, KM_USER0);
-	left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
+	if (likely(nr_segs == 1)) {
+		int left;
+		char __user *buf = iov->iov_base + base;
+		left = __copy_from_user_inatomic_nocache(kaddr + offset,
+							buf, bytes);
+		copied = bytes - left;
+	} else {
+		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+							iov, base, bytes);
+	}
 	kunmap_atomic(kaddr, KM_USER0);
 
-	if (left != 0) {
-		/* Do it the slow way */
-		kaddr = kmap(page);
-		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
-		kunmap(page);
-	}
-	return bytes - left;
+	return copied;
 }
 
 /*
- * This has the same sideeffects and return value as filemap_copy_from_user().
- * The difference is that on a fault we need to memset the remainder of the
- * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
- * single-segment behaviour.
+ * This has the same sideeffects and return value as
+ * filemap_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
  */
 static inline size_t
-filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
-			const struct iovec *iov, size_t base, size_t bytes)
+filemap_copy_from_user(struct page *page, unsigned long offset,
+			const struct iovec *iov, unsigned long nr_segs,
+			 size_t base, size_t bytes)
 {
 	char *kaddr;
 	size_t copied;
 
-	kaddr = kmap_atomic(page, KM_USER0);
-	copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
-							 base, bytes);
-	kunmap_atomic(kaddr, KM_USER0);
-	if (copied != bytes) {
-		kaddr = kmap(page);
-		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
-								 base, bytes);
-		if (bytes - copied)
-			memset(kaddr + offset + copied, 0, bytes - copied);
-		kunmap(page);
+	kaddr = kmap(page);
+	if (likely(nr_segs == 1)) {
+		int left;
+		char __user *buf = iov->iov_base + base;
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+		copied = bytes - left;
+	} else {
+		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+							iov, base, bytes);
 	}
+	kunmap(page);
 	return copied;
 }
 
 static inline void
-filemap_set_next_iovec(const struct iovec **iovp, size_t *basep, size_t bytes)
+filemap_set_next_iovec(const struct iovec **iovp, unsigned long nr_segs,
+						 size_t *basep, size_t bytes)
 {
-	const struct iovec *iov = *iovp;
-	size_t base = *basep;
-
-	while (bytes) {
-		int copy = min(bytes, iov->iov_len - base);
-
-		bytes -= copy;
-		base += copy;
-		if (iov->iov_len == base) {
-			iov++;
-			base = 0;
+	if (likely(nr_segs == 1)) {
+		*basep += bytes;
+	} else {
+		const struct iovec *iov = *iovp;
+		size_t base = *basep;
+
+		while (bytes) {
+			int copy = min(bytes, iov->iov_len - base);
+
+			bytes -= copy;
+			base += copy;
+			if (iov->iov_len == base) {
+				iov++;
+				base = 0;
+			}
 		}
+		*iovp = iov;
+		*basep = base;
 	}
-	*iovp = iov;
-	*basep = base;
 }
 #endif
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1840,12 +1840,7 @@ generic_file_buffered_write(struct kiocb
 	/*
 	 * handle partial DIO write.  Adjust cur_iov if needed.
 	 */
-	if (likely(nr_segs == 1))
-		buf = iov->iov_base + written;
-	else {
-		filemap_set_next_iovec(&cur_iov, &iov_offset, written);
-		buf = cur_iov->iov_base + iov_offset;
-	}
+	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
 		struct page *page;
@@ -1855,6 +1850,7 @@ generic_file_buffered_write(struct kiocb
 		size_t bytes;		/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
+		buf = cur_iov->iov_base + iov_offset;
 		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
@@ -1886,13 +1882,10 @@ generic_file_buffered_write(struct kiocb
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		if (likely(nr_segs == 1))
-			copied = filemap_copy_from_user(page, offset,
-							buf, bytes);
-		else
-			copied = filemap_copy_from_user_iovec(page, offset,
-						cur_iov, iov_offset, bytes);
+		copied = filemap_copy_from_user(page, offset,
+					cur_iov, nr_segs, iov_offset, bytes);
 		flush_dcache_page(page);
+
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
@@ -1903,20 +1896,11 @@ generic_file_buffered_write(struct kiocb
 		if (unlikely(status > 0)) /* filesystem did partial write */
 			copied = status;
 
-		if (likely(copied > 0)) {
-			written += copied;
-			count -= copied;
-			pos += copied;
-			buf += copied;
-			if (unlikely(nr_segs > 1)) {
-				filemap_set_next_iovec(&cur_iov,
-						&iov_offset, copied);
-				if (count)
-					buf = cur_iov->iov_base + iov_offset;
-			} else {
-				iov_offset += copied;
-			}
-		}
+		written += copied;
+		count -= copied;
+		pos += copied;
+		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
+
 		unlock_page(page);
 		mark_page_accessed(page);
 		page_cache_release(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 9/9] mm: fix pagecache write deadlocks
  2007-01-29 10:31 ` Nick Piggin
@ 2007-01-29 10:33   ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then find and pin the source page with
    get_user_pages. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1843,11 +1843,12 @@ generic_file_buffered_write(struct kiocb
 	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
+		struct page *src_page;
 		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long maxlen;	/* Bytes remaining in current iovec */
-		size_t bytes;		/* Bytes to write to page */
+		unsigned long seglen;	/* Bytes remaining in current iovec */
+		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
 		buf = cur_iov->iov_base + iov_offset;
@@ -1857,20 +1858,30 @@ generic_file_buffered_write(struct kiocb
 		if (bytes > count)
 			bytes = count;
 
-		maxlen = cur_iov->iov_len - iov_offset;
-		if (maxlen > bytes)
-			maxlen = bytes;
+		/*
+		 * a non-NULL src_page indicates that we're doing the
+		 * copy via get_user_pages and kmap.
+		 */
+		src_page = NULL;
+
+		seglen = cur_iov->iov_len - iov_offset;
+		if (seglen > bytes)
+			seglen = bytes;
 
-#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
 		 */
-		fault_in_pages_readable(buf, maxlen);
-#endif
-
+		if (unlikely(fault_in_pages_readable(buf, seglen))) {
+			status = -EFAULT;
+			break;
+		}
 
 		page = __grab_cache_page(mapping, index);
 		if (!page) {
@@ -1878,32 +1889,94 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
+		/*
+		 * non-uptodate pages cannot cope with short copies, and we
+		 * cannot take a pagefault with the destination page locked.
+		 * So pin the source page to copy it.
+		 */
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+
+			bytes = min(bytes, PAGE_CACHE_SIZE -
+				     ((unsigned long)buf & ~PAGE_CACHE_MASK));
+
+			/*
+			 * Cannot get_user_pages with a page locked for the
+			 * same reason as we can't take a page fault with a
+			 * page locked (as explained below).
+			 */
+			down_read(&current->mm->mmap_sem);
+			status = get_user_pages(current, current->mm,
+					(unsigned long)buf & PAGE_CACHE_MASK, 1,
+					0, 0, &src_page, NULL);
+			up_read(&current->mm->mmap_sem);
+			if (status != 1) {
+				page_cache_release(page);
+				break;
+			}
+
+			lock_page(page);
+			if (!page->mapping) {
+				unlock_page(page);
+				page_cache_release(page);
+				page_cache_release(src_page);
+				continue;
+			}
+
+		}
+
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		copied = filemap_copy_from_user(page, offset,
+		if (!src_page) {
+			/*
+			 * Must not enter the pagefault handler here, because
+			 * we hold the page lock, so we might recursively
+			 * deadlock on the same lock, or get an ABBA deadlock
+			 * against a different lock, or against the mmap_sem
+			 * (which nests outside the page lock).  So increment
+			 * preempt count, and use _atomic usercopies.
+			 *
+			 * The page is uptodate so we are OK to encounter a
+			 * short copy: if unmodified parts of the page are
+			 * marked dirty and written out to disk, it doesn't
+			 * really matter.
+			 */
+			pagefault_disable();
+			copied = filemap_copy_from_user_atomic(page, offset,
 					cur_iov, nr_segs, iov_offset, bytes);
+			pagefault_enable();
+		} else {
+			char *src, *dst;
+			src = kmap(src_page);
+			dst = kmap(page);
+			memcpy(dst + offset,
+				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
+				bytes);
+			kunmap(page);
+			kunmap(src_page);
+			copied = bytes;
+		}
 		flush_dcache_page(page);
 
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
-		if (unlikely(copied != bytes)) {
-			status = -EFAULT;
-			goto fs_write_aop_error;
-		}
 		if (unlikely(status > 0)) /* filesystem did partial write */
-			copied = status;
+			copied = min_t(size_t, copied, status);
+
+		unlock_page(page);
+		mark_page_accessed(page);
+		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		written += copied;
 		count -= copied;
 		pos += copied;
 		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
 
-		unlock_page(page);
-		mark_page_accessed(page);
-		page_cache_release(page);
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 		continue;
@@ -1912,6 +1985,8 @@ fs_write_aop_error:
 		if (status != AOP_TRUNCATED_PAGE)
 			unlock_page(page);
 		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		/*
 		 * prepare_write() may have instantiated a few blocks
@@ -1924,7 +1999,6 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-
 	} while (count);
 	*ppos = pos;
 
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -198,6 +198,9 @@ static inline int fault_in_pages_writeab
 {
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	/*
 	 * Writing zeroes into userspace here is OK, because we know that if
 	 * the zero gets there, we'll be overwriting it.
@@ -217,19 +220,23 @@ static inline int fault_in_pages_writeab
 	return ret;
 }
 
-static inline void fault_in_pages_readable(const char __user *uaddr, int size)
+static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 {
 	volatile char c;
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	ret = __get_user(c, uaddr);
 	if (ret == 0) {
 		const char __user *end = uaddr + size - 1;
 
 		if (((unsigned long)uaddr & PAGE_MASK) !=
 				((unsigned long)end & PAGE_MASK))
-		 	__get_user(c, end);
+		 	ret = __get_user(c, end);
 	}
+	return ret;
 }
 
 #endif /* _LINUX_PAGEMAP_H */

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-01-29 10:33   ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 10:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then find and pin the source page with
    get_user_pages. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1843,11 +1843,12 @@ generic_file_buffered_write(struct kiocb
 	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
+		struct page *src_page;
 		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long maxlen;	/* Bytes remaining in current iovec */
-		size_t bytes;		/* Bytes to write to page */
+		unsigned long seglen;	/* Bytes remaining in current iovec */
+		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
 		buf = cur_iov->iov_base + iov_offset;
@@ -1857,20 +1858,30 @@ generic_file_buffered_write(struct kiocb
 		if (bytes > count)
 			bytes = count;
 
-		maxlen = cur_iov->iov_len - iov_offset;
-		if (maxlen > bytes)
-			maxlen = bytes;
+		/*
+		 * a non-NULL src_page indicates that we're doing the
+		 * copy via get_user_pages and kmap.
+		 */
+		src_page = NULL;
+
+		seglen = cur_iov->iov_len - iov_offset;
+		if (seglen > bytes)
+			seglen = bytes;
 
-#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
 		 */
-		fault_in_pages_readable(buf, maxlen);
-#endif
-
+		if (unlikely(fault_in_pages_readable(buf, seglen))) {
+			status = -EFAULT;
+			break;
+		}
 
 		page = __grab_cache_page(mapping, index);
 		if (!page) {
@@ -1878,32 +1889,94 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
+		/*
+		 * non-uptodate pages cannot cope with short copies, and we
+		 * cannot take a pagefault with the destination page locked.
+		 * So pin the source page to copy it.
+		 */
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+
+			bytes = min(bytes, PAGE_CACHE_SIZE -
+				     ((unsigned long)buf & ~PAGE_CACHE_MASK));
+
+			/*
+			 * Cannot get_user_pages with a page locked for the
+			 * same reason as we can't take a page fault with a
+			 * page locked (as explained below).
+			 */
+			down_read(&current->mm->mmap_sem);
+			status = get_user_pages(current, current->mm,
+					(unsigned long)buf & PAGE_CACHE_MASK, 1,
+					0, 0, &src_page, NULL);
+			up_read(&current->mm->mmap_sem);
+			if (status != 1) {
+				page_cache_release(page);
+				break;
+			}
+
+			lock_page(page);
+			if (!page->mapping) {
+				unlock_page(page);
+				page_cache_release(page);
+				page_cache_release(src_page);
+				continue;
+			}
+
+		}
+
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		copied = filemap_copy_from_user(page, offset,
+		if (!src_page) {
+			/*
+			 * Must not enter the pagefault handler here, because
+			 * we hold the page lock, so we might recursively
+			 * deadlock on the same lock, or get an ABBA deadlock
+			 * against a different lock, or against the mmap_sem
+			 * (which nests outside the page lock).  So increment
+			 * preempt count, and use _atomic usercopies.
+			 *
+			 * The page is uptodate so we are OK to encounter a
+			 * short copy: if unmodified parts of the page are
+			 * marked dirty and written out to disk, it doesn't
+			 * really matter.
+			 */
+			pagefault_disable();
+			copied = filemap_copy_from_user_atomic(page, offset,
 					cur_iov, nr_segs, iov_offset, bytes);
+			pagefault_enable();
+		} else {
+			char *src, *dst;
+			src = kmap(src_page);
+			dst = kmap(page);
+			memcpy(dst + offset,
+				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
+				bytes);
+			kunmap(page);
+			kunmap(src_page);
+			copied = bytes;
+		}
 		flush_dcache_page(page);
 
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
-		if (unlikely(copied != bytes)) {
-			status = -EFAULT;
-			goto fs_write_aop_error;
-		}
 		if (unlikely(status > 0)) /* filesystem did partial write */
-			copied = status;
+			copied = min_t(size_t, copied, status);
+
+		unlock_page(page);
+		mark_page_accessed(page);
+		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		written += copied;
 		count -= copied;
 		pos += copied;
 		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
 
-		unlock_page(page);
-		mark_page_accessed(page);
-		page_cache_release(page);
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 		continue;
@@ -1912,6 +1985,8 @@ fs_write_aop_error:
 		if (status != AOP_TRUNCATED_PAGE)
 			unlock_page(page);
 		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		/*
 		 * prepare_write() may have instantiated a few blocks
@@ -1924,7 +1999,6 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-
 	} while (count);
 	*ppos = pos;
 
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -198,6 +198,9 @@ static inline int fault_in_pages_writeab
 {
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	/*
 	 * Writing zeroes into userspace here is OK, because we know that if
 	 * the zero gets there, we'll be overwriting it.
@@ -217,19 +220,23 @@ static inline int fault_in_pages_writeab
 	return ret;
 }
 
-static inline void fault_in_pages_readable(const char __user *uaddr, int size)
+static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 {
 	volatile char c;
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	ret = __get_user(c, uaddr);
 	if (ret == 0) {
 		const char __user *end = uaddr + size - 1;
 
 		if (((unsigned long)uaddr & PAGE_MASK) !=
 				((unsigned long)end & PAGE_MASK))
-		 	__get_user(c, end);
+		 	ret = __get_user(c, end);
 	}
+	return ret;
 }
 
 #endif /* _LINUX_PAGEMAP_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-01-29 10:33   ` Nick Piggin
@ 2007-01-29 11:11     ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 11:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Mon, Jan 29, 2007 at 11:33:03AM +0100, Nick Piggin wrote:
> +		} else {
> +			char *src, *dst;
> +			src = kmap(src_page);
> +			dst = kmap(page);
> +			memcpy(dst + offset,
> +				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> +				bytes);
> +			kunmap(page);
> +			kunmap(src_page);
> +			copied = bytes;
> +		}
>  		flush_dcache_page(page);

Hmm, I guess these should use kmap_atomic with KM_USER[01]?

The kmap is from an earlier iteration that wanted to sleep
with the page mapped into kernel.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-01-29 11:11     ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-29 11:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Mon, Jan 29, 2007 at 11:33:03AM +0100, Nick Piggin wrote:
> +		} else {
> +			char *src, *dst;
> +			src = kmap(src_page);
> +			dst = kmap(page);
> +			memcpy(dst + offset,
> +				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> +				bytes);
> +			kunmap(page);
> +			kunmap(src_page);
> +			copied = bytes;
> +		}
>  		flush_dcache_page(page);

Hmm, I guess these should use kmap_atomic with KM_USER[01]?

The kmap is from an earlier iteration that wanted to sleep
with the page mapped into kernel.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
  2007-01-29 10:31 ` Nick Piggin
@ 2007-01-30 20:55   ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-01-30 20:55 UTC (permalink / raw)
  To: Nick Piggin, Hugh Dickins
  Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
Nick Piggin <npiggin@suse.de> wrote:

> The following set of patches attempt to fix the buffered write
> locking problems 

y'know, four or five years back I fixed this bug by doing

	current->locked_page = page;

in the write() code, and then teaching the pagefault code to avoid locking
the same page.  Patch below.

But then evil mean Hugh pointed out that the patch is still vulnerable to
ab/ba deadlocking so I dropped it.

But Hugh lied!  There is no ab/ba deadlock because both a and b need
i_mutex to get into the write() path.

This approach doesn't fix the writev() performance regresson which
nobody has measured yet but which the NFS guys reported.

But I think with this fix in place we can go back to a modified version of
the 2.6.17 filemap.c code and get that performance back, but I haven't
thought about that.

It's a heck of a lot simpler than your patches though ;)





There is a longstanding problem in which the kernel can deadlock if an
application performs a write(bf, buf, n), and `buf' points at a mmapped
page of `fd', and that page is the pagecache page into which this write
will write, and the page is not yet present.

The kernel will put a new page into pagecache, lock it, run
copy_from_user().  The copy will fault and filemap_nopage() will try to
lock the page in preparation for reading it in.  But it was already
locked up in generic_file_write().

We have had a workaround in there - touch the userspace address by hand
(to fault it in) before locking the pagecache page, and hope that it
doesn't get evicted before we try to lock it.

But there's a place in the kmap_atomic-for-generic_file_write code
where this race is detectable.  I put a printk in there and saw a
stream of them during heavy testing.  So the workaround doesn't work.


What this patch does is

- within generic_file_write(): record the locked page in current->locked_page

- within filemap_nopage, look to see if the page which we're faulting
  in is equal to current->locked_page.

  If it is, taken special action: instead of locking the page,
  reading it and unlocking it, we assume that the page is already
  locked.  Bring it uptodate and relock it before returning.


I tested this by disabling the fault-it-in-by-hand workaround, writing
a special test app and adding several printks.  Not pretty, but it does
the job.




 include/linux/sched.h |    2 ++
 mm/filemap.c          |   26 ++++++++++++++++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

--- linux-mnm/mm/filemap.c~write-deadlock	Thu Oct 31 22:42:38 2002
+++ linux-mnm-akpm/mm/filemap.c	Thu Oct 31 22:42:38 2002
@@ -997,6 +997,7 @@ struct page * filemap_nopage(struct vm_a
 	struct page *page;
 	unsigned long size, pgoff, endoff;
 	int did_readahead;
+	struct page *locked_page = current->locked_page;
 
 	pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 	endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
@@ -1092,10 +1093,12 @@ no_cached_page:
 
 page_not_uptodate:
 	inc_page_state(pgmajfault);
-	lock_page(page);
+	if (page != locked_page)
+		lock_page(page);
 
 	/* Did it get unhashed while we waited for it? */
 	if (!page->mapping) {
+		BUG_ON(page == locked_page);	/* can't happen: i_sem */
 		unlock_page(page);
 		page_cache_release(page);
 		goto retry_all;
@@ -1103,12 +1106,15 @@ page_not_uptodate:
 
 	/* Did somebody else get it up-to-date? */
 	if (PageUptodate(page)) {
-		unlock_page(page);
+		if (page != locked_page)
+			unlock_page(page);
 		goto success;
 	}
 
 	if (!mapping->a_ops->readpage(file, page)) {
 		wait_on_page_locked(page);
+		if (page == locked_page)
+			lock_page(page);
 		if (PageUptodate(page))
 			goto success;
 	}
@@ -1119,10 +1125,12 @@ page_not_uptodate:
 	 * because there really aren't any performance issues here
 	 * and we need to check for errors.
 	 */
-	lock_page(page);
+	if (page != locked_page)
+		lock_page(page);
 
 	/* Somebody truncated the page on us? */
 	if (!page->mapping) {
+		BUG_ON(page == locked_page);	/* can't happen: i_sem */
 		unlock_page(page);
 		page_cache_release(page);
 		goto retry_all;
@@ -1130,12 +1138,15 @@ page_not_uptodate:
 
 	/* Somebody else successfully read it in? */
 	if (PageUptodate(page)) {
-		unlock_page(page);
+		if (page != locked_page)
+			unlock_page(page);
 		goto success;
 	}
 	ClearPageError(page);
 	if (!mapping->a_ops->readpage(file, page)) {
 		wait_on_page_locked(page);
+		if (page == locked_page)
+			lock_page(page);
 		if (PageUptodate(page))
 			goto success;
 	}
@@ -1445,6 +1456,11 @@ void remove_suid(struct dentry *dentry)
 	}
 }
 
+/*
+ * There is rare deadlock scenario in which the source and dest pages are
+ * the same, and the VM evicts the page at the wrong time.   Here we set
+ * current->locked_page to signal that to special-case code in filemap_nopage
+ */
 static inline int
 filemap_copy_from_user(struct page *page, unsigned long offset,
 			const char *buf, unsigned bytes)
@@ -1459,7 +1475,9 @@ filemap_copy_from_user(struct page *page
 	if (left != 0) {
 		/* Do it the slow way */
 		kaddr = kmap(page);
+		current->locked_page = page;
 		left = __copy_from_user(kaddr + offset, buf, bytes);
+		current->locked_page = NULL;
 		kunmap(page);
 	}
 	return left;
--- linux-mnm/include/linux/sched.h~write-deadlock	Thu Oct 31 22:42:38 2002
+++ linux-mnm-akpm/include/linux/sched.h	Thu Oct 31 22:42:38 2002
@@ -266,6 +266,7 @@ extern struct user_struct root_user;
 
 typedef struct prio_array prio_array_t;
 struct backing_dev_info;
+struct page;
 
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
@@ -387,6 +388,7 @@ struct task_struct {
 	void *journal_info;
 	struct dentry *proc_dentry;
 	struct backing_dev_info *backing_dev_info;
+	struct page *locked_page;
 };
 
 extern void __put_task_struct(struct task_struct *tsk);

.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
@ 2007-01-30 20:55   ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-01-30 20:55 UTC (permalink / raw)
  To: Nick Piggin, Hugh Dickins
  Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
Nick Piggin <npiggin@suse.de> wrote:

> The following set of patches attempt to fix the buffered write
> locking problems 

y'know, four or five years back I fixed this bug by doing

	current->locked_page = page;

in the write() code, and then teaching the pagefault code to avoid locking
the same page.  Patch below.

But then evil mean Hugh pointed out that the patch is still vulnerable to
ab/ba deadlocking so I dropped it.

But Hugh lied!  There is no ab/ba deadlock because both a and b need
i_mutex to get into the write() path.

This approach doesn't fix the writev() performance regresson which
nobody has measured yet but which the NFS guys reported.

But I think with this fix in place we can go back to a modified version of
the 2.6.17 filemap.c code and get that performance back, but I haven't
thought about that.

It's a heck of a lot simpler than your patches though ;)





There is a longstanding problem in which the kernel can deadlock if an
application performs a write(bf, buf, n), and `buf' points at a mmapped
page of `fd', and that page is the pagecache page into which this write
will write, and the page is not yet present.

The kernel will put a new page into pagecache, lock it, run
copy_from_user().  The copy will fault and filemap_nopage() will try to
lock the page in preparation for reading it in.  But it was already
locked up in generic_file_write().

We have had a workaround in there - touch the userspace address by hand
(to fault it in) before locking the pagecache page, and hope that it
doesn't get evicted before we try to lock it.

But there's a place in the kmap_atomic-for-generic_file_write code
where this race is detectable.  I put a printk in there and saw a
stream of them during heavy testing.  So the workaround doesn't work.


What this patch does is

- within generic_file_write(): record the locked page in current->locked_page

- within filemap_nopage, look to see if the page which we're faulting
  in is equal to current->locked_page.

  If it is, taken special action: instead of locking the page,
  reading it and unlocking it, we assume that the page is already
  locked.  Bring it uptodate and relock it before returning.


I tested this by disabling the fault-it-in-by-hand workaround, writing
a special test app and adding several printks.  Not pretty, but it does
the job.




 include/linux/sched.h |    2 ++
 mm/filemap.c          |   26 ++++++++++++++++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

--- linux-mnm/mm/filemap.c~write-deadlock	Thu Oct 31 22:42:38 2002
+++ linux-mnm-akpm/mm/filemap.c	Thu Oct 31 22:42:38 2002
@@ -997,6 +997,7 @@ struct page * filemap_nopage(struct vm_a
 	struct page *page;
 	unsigned long size, pgoff, endoff;
 	int did_readahead;
+	struct page *locked_page = current->locked_page;
 
 	pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 	endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
@@ -1092,10 +1093,12 @@ no_cached_page:
 
 page_not_uptodate:
 	inc_page_state(pgmajfault);
-	lock_page(page);
+	if (page != locked_page)
+		lock_page(page);
 
 	/* Did it get unhashed while we waited for it? */
 	if (!page->mapping) {
+		BUG_ON(page == locked_page);	/* can't happen: i_sem */
 		unlock_page(page);
 		page_cache_release(page);
 		goto retry_all;
@@ -1103,12 +1106,15 @@ page_not_uptodate:
 
 	/* Did somebody else get it up-to-date? */
 	if (PageUptodate(page)) {
-		unlock_page(page);
+		if (page != locked_page)
+			unlock_page(page);
 		goto success;
 	}
 
 	if (!mapping->a_ops->readpage(file, page)) {
 		wait_on_page_locked(page);
+		if (page == locked_page)
+			lock_page(page);
 		if (PageUptodate(page))
 			goto success;
 	}
@@ -1119,10 +1125,12 @@ page_not_uptodate:
 	 * because there really aren't any performance issues here
 	 * and we need to check for errors.
 	 */
-	lock_page(page);
+	if (page != locked_page)
+		lock_page(page);
 
 	/* Somebody truncated the page on us? */
 	if (!page->mapping) {
+		BUG_ON(page == locked_page);	/* can't happen: i_sem */
 		unlock_page(page);
 		page_cache_release(page);
 		goto retry_all;
@@ -1130,12 +1138,15 @@ page_not_uptodate:
 
 	/* Somebody else successfully read it in? */
 	if (PageUptodate(page)) {
-		unlock_page(page);
+		if (page != locked_page)
+			unlock_page(page);
 		goto success;
 	}
 	ClearPageError(page);
 	if (!mapping->a_ops->readpage(file, page)) {
 		wait_on_page_locked(page);
+		if (page == locked_page)
+			lock_page(page);
 		if (PageUptodate(page))
 			goto success;
 	}
@@ -1445,6 +1456,11 @@ void remove_suid(struct dentry *dentry)
 	}
 }
 
+/*
+ * There is rare deadlock scenario in which the source and dest pages are
+ * the same, and the VM evicts the page at the wrong time.   Here we set
+ * current->locked_page to signal that to special-case code in filemap_nopage
+ */
 static inline int
 filemap_copy_from_user(struct page *page, unsigned long offset,
 			const char *buf, unsigned bytes)
@@ -1459,7 +1475,9 @@ filemap_copy_from_user(struct page *page
 	if (left != 0) {
 		/* Do it the slow way */
 		kaddr = kmap(page);
+		current->locked_page = page;
 		left = __copy_from_user(kaddr + offset, buf, bytes);
+		current->locked_page = NULL;
 		kunmap(page);
 	}
 	return left;
--- linux-mnm/include/linux/sched.h~write-deadlock	Thu Oct 31 22:42:38 2002
+++ linux-mnm-akpm/include/linux/sched.h	Thu Oct 31 22:42:38 2002
@@ -266,6 +266,7 @@ extern struct user_struct root_user;
 
 typedef struct prio_array prio_array_t;
 struct backing_dev_info;
+struct page;
 
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
@@ -387,6 +388,7 @@ struct task_struct {
 	void *journal_info;
 	struct dentry *proc_dentry;
 	struct backing_dev_info *backing_dev_info;
+	struct page *locked_page;
 };
 
 extern void __put_task_struct(struct task_struct *tsk);

.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
  2007-01-30 20:55   ` Andrew Morton
@ 2007-01-30 23:21     ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-01-30 23:21 UTC (permalink / raw)
  To: Nick Piggin, Hugh Dickins, Linux Kernel, Linux Filesystems,
	Linux Memory Management

On Tue, 30 Jan 2007 12:55:58 -0800
Andrew Morton <akpm@osdl.org> wrote:

> y'know, four or five years back I fixed this bug by doing
> 
> 	current->locked_page = page;
> 
> in the write() code, and then teaching the pagefault code to avoid locking
> the same page.  Patch below.
> 
> But then evil mean Hugh pointed out that the patch is still vulnerable to
> ab/ba deadlocking so I dropped it.

And he was right, of course.  Task A holds file a's i_mutex and takes a
fault against file b's page.  Task B holds file b's i_mutex and takes a
fault against file a's page.  Drat.

I wonder if there's a sane way of preventing that.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
@ 2007-01-30 23:21     ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-01-30 23:21 UTC (permalink / raw)
  To: Nick Piggin, Hugh Dickins, Linux Kernel, Linux Filesystems,
	Linux Memory Management

On Tue, 30 Jan 2007 12:55:58 -0800
Andrew Morton <akpm@osdl.org> wrote:

> y'know, four or five years back I fixed this bug by doing
> 
> 	current->locked_page = page;
> 
> in the write() code, and then teaching the pagefault code to avoid locking
> the same page.  Patch below.
> 
> But then evil mean Hugh pointed out that the patch is still vulnerable to
> ab/ba deadlocking so I dropped it.

And he was right, of course.  Task A holds file a's i_mutex and takes a
fault against file b's page.  Task B holds file b's i_mutex and takes a
fault against file a's page.  Drat.

I wonder if there's a sane way of preventing that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
  2007-01-30 20:55   ` Andrew Morton
@ 2007-01-31  0:32     ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-31  0:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, Jan 30, 2007 at 12:55:58PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > The following set of patches attempt to fix the buffered write
> > locking problems 
> 
> y'know, four or five years back I fixed this bug by doing
> 
> 	current->locked_page = page;
> 
> in the write() code, and then teaching the pagefault code to avoid locking
> the same page.  Patch below.
> 
> But then evil mean Hugh pointed out that the patch is still vulnerable to
> ab/ba deadlocking so I dropped it.
> 
> But Hugh lied!  There is no ab/ba deadlock because both a and b need
> i_mutex to get into the write() path.

Not only is there still the abba deadlock on page locks, as you point
out in your next mail, but there is also an abba on page lock vs mmap_sem.

> This approach doesn't fix the writev() performance regresson which
> nobody has measured yet but which the NFS guys reported.
> 
> But I think with this fix in place we can go back to a modified version of
> the 2.6.17 filemap.c code and get that performance back, but I haven't
> thought about that.
> 
> It's a heck of a lot simpler than your patches though ;)

Ignoring the cleanup patches, the only thing mine really do is to use
get_user_pages in  generic_file_buffered_write if the destination page
is not uptodate, and use atomic copies if it is. Conceptually pretty simple.

Mine fixes the writev performance regression iff the destination page
it uptodate. More importantly, they fix 3 deadlocks in the core mm.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
@ 2007-01-31  0:32     ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-31  0:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, Jan 30, 2007 at 12:55:58PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > The following set of patches attempt to fix the buffered write
> > locking problems 
> 
> y'know, four or five years back I fixed this bug by doing
> 
> 	current->locked_page = page;
> 
> in the write() code, and then teaching the pagefault code to avoid locking
> the same page.  Patch below.
> 
> But then evil mean Hugh pointed out that the patch is still vulnerable to
> ab/ba deadlocking so I dropped it.
> 
> But Hugh lied!  There is no ab/ba deadlock because both a and b need
> i_mutex to get into the write() path.

Not only is there still the abba deadlock on page locks, as you point
out in your next mail, but there is also an abba on page lock vs mmap_sem.

> This approach doesn't fix the writev() performance regresson which
> nobody has measured yet but which the NFS guys reported.
> 
> But I think with this fix in place we can go back to a modified version of
> the 2.6.17 filemap.c code and get that performance back, but I haven't
> thought about that.
> 
> It's a heck of a lot simpler than your patches though ;)

Ignoring the cleanup patches, the only thing mine really do is to use
get_user_pages in  generic_file_buffered_write if the destination page
is not uptodate, and use atomic copies if it is. Conceptually pretty simple.

Mine fixes the writev performance regression iff the destination page
it uptodate. More importantly, they fix 3 deadlocks in the core mm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
  2007-01-30 23:21     ` Andrew Morton
@ 2007-01-31  1:31       ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-31  1:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, Jan 30, 2007 at 03:21:19PM -0800, Andrew Morton wrote:
> On Tue, 30 Jan 2007 12:55:58 -0800
> Andrew Morton <akpm@osdl.org> wrote:
> 
> > y'know, four or five years back I fixed this bug by doing
> > 
> > 	current->locked_page = page;
> > 
> > in the write() code, and then teaching the pagefault code to avoid locking
> > the same page.  Patch below.
> > 
> > But then evil mean Hugh pointed out that the patch is still vulnerable to
> > ab/ba deadlocking so I dropped it.
> 
> And he was right, of course.  Task A holds file a's i_mutex and takes a
> fault against file b's page.  Task B holds file b's i_mutex and takes a
> fault against file a's page.  Drat.
> 
> I wonder if there's a sane way of preventing that.

If you want to go down the path of carrying state around in task_struct,
you can take the mmap_sem and set a flag, then get_user_pages the source
page and lock both source and destination in ascending order, then your
page fault handler checks the flag and skips mmap_sem, and the rest of
your fault path checks both the page locks you're holding.

At which point you arrive at a horrible mess :)


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
@ 2007-01-31  1:31       ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-01-31  1:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, Jan 30, 2007 at 03:21:19PM -0800, Andrew Morton wrote:
> On Tue, 30 Jan 2007 12:55:58 -0800
> Andrew Morton <akpm@osdl.org> wrote:
> 
> > y'know, four or five years back I fixed this bug by doing
> > 
> > 	current->locked_page = page;
> > 
> > in the write() code, and then teaching the pagefault code to avoid locking
> > the same page.  Patch below.
> > 
> > But then evil mean Hugh pointed out that the patch is still vulnerable to
> > ab/ba deadlocking so I dropped it.
> 
> And he was right, of course.  Task A holds file a's i_mutex and takes a
> fault against file b's page.  Task B holds file b's i_mutex and takes a
> fault against file a's page.  Drat.
> 
> I wonder if there's a sane way of preventing that.

If you want to go down the path of carrying state around in task_struct,
you can take the mmap_sem and set a flag, then get_user_pages the source
page and lock both source and destination in ascending order, then your
page fault handler checks the flag and skips mmap_sem, and the rest of
your fault path checks both the page locks you're holding.

At which point you arrive at a horrible mess :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
  2007-01-29 10:31 ` Nick Piggin
@ 2007-02-02 23:52   ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-02 23:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel, Linux Filesystems, Linux Memory Management,
	Fengguang Wu, Suparna Bhattacharya

On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
Nick Piggin <npiggin@suse.de> wrote:

> The following set of patches attempt to fix the buffered write
> locking problems (and there are a couple of peripheral patches
> and cleanups there too).
> 
> Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
> be an easier diff with the fsaio patches gone, but the readahead
> rewrite clashes badly :(

Well fsaio is restored, but there's now considerable doubt over it due to
the recent febril febrility.

How bad is the clash with the readahead patches?

Clashes with git-block are likely, too.

Bugfixes come first, so I will drop readahead and fsaio and git-block to get
this work completed if needed - please work agaisnt mainline.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
@ 2007-02-02 23:52   ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-02 23:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel, Linux Filesystems, Linux Memory Management,
	Fengguang Wu, Suparna Bhattacharya

On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
Nick Piggin <npiggin@suse.de> wrote:

> The following set of patches attempt to fix the buffered write
> locking problems (and there are a couple of peripheral patches
> and cleanups there too).
> 
> Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
> be an easier diff with the fsaio patches gone, but the readahead
> rewrite clashes badly :(

Well fsaio is restored, but there's now considerable doubt over it due to
the recent febril febrility.

How bad is the clash with the readahead patches?

Clashes with git-block are likely, too.

Bugfixes come first, so I will drop readahead and fsaio and git-block to get
this work completed if needed - please work agaisnt mainline.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
  2007-01-29 10:31   ` Nick Piggin
@ 2007-02-02 23:52     ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-02 23:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Mon, 29 Jan 2007 11:31:46 +0100 (CET)
Nick Piggin <npiggin@suse.de> wrote:

> simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.

They do?  Under what situation?

> Fix the former,

How?

> make a note of the latter. Several other filesystems seem
> to be iffy here, too.

Please, tell us what the bug is so that others have a chance of reviewing
and, if needed, fixing those other filesystems.

> --- linux-2.6.orig/fs/libfs.c
> +++ linux-2.6/fs/libfs.c
> @@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s
>  int simple_prepare_write(struct file *file, struct page *page,
>  			unsigned from, unsigned to)
>  {
> -	if (!PageUptodate(page)) {
> -		if (to - from != PAGE_CACHE_SIZE) {
> -			void *kaddr = kmap_atomic(page, KM_USER0);
> -			memset(kaddr, 0, from);
> -			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
> -			flush_dcache_page(page);
> -			kunmap_atomic(kaddr, KM_USER0);
> -		}
> +	if (PageUptodate(page))
> +		return 0;
> +
> +	if (to - from != PAGE_CACHE_SIZE) {
> +		clear_highpage(page);
> +		flush_dcache_page(page);
>  		SetPageUptodate(page);
>  	}

memclear_highpage_flush() is fashionable.

> ===================================================================
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page
>  
>  	if (is_mapped_to_disk)
>  		SetPageMappedToDisk(page);
> +
> +	/* XXX: information leak vs read(2) */
>  	SetPageUptodate(page);
>  
>  	/*

That comment is too terse to be useful.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
@ 2007-02-02 23:52     ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-02 23:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Mon, 29 Jan 2007 11:31:46 +0100 (CET)
Nick Piggin <npiggin@suse.de> wrote:

> simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.

They do?  Under what situation?

> Fix the former,

How?

> make a note of the latter. Several other filesystems seem
> to be iffy here, too.

Please, tell us what the bug is so that others have a chance of reviewing
and, if needed, fixing those other filesystems.

> --- linux-2.6.orig/fs/libfs.c
> +++ linux-2.6/fs/libfs.c
> @@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s
>  int simple_prepare_write(struct file *file, struct page *page,
>  			unsigned from, unsigned to)
>  {
> -	if (!PageUptodate(page)) {
> -		if (to - from != PAGE_CACHE_SIZE) {
> -			void *kaddr = kmap_atomic(page, KM_USER0);
> -			memset(kaddr, 0, from);
> -			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
> -			flush_dcache_page(page);
> -			kunmap_atomic(kaddr, KM_USER0);
> -		}
> +	if (PageUptodate(page))
> +		return 0;
> +
> +	if (to - from != PAGE_CACHE_SIZE) {
> +		clear_highpage(page);
> +		flush_dcache_page(page);
>  		SetPageUptodate(page);
>  	}

memclear_highpage_flush() is fashionable.

> ===================================================================
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page
>  
>  	if (is_mapped_to_disk)
>  		SetPageMappedToDisk(page);
> +
> +	/* XXX: information leak vs read(2) */
>  	SetPageUptodate(page);
>  
>  	/*

That comment is too terse to be useful.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-01-29 10:33   ` Nick Piggin
@ 2007-02-02 23:53     ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-02 23:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Mon, 29 Jan 2007 11:33:03 +0100 (CET)
Nick Piggin <npiggin@suse.de> wrote:

> Modify the core write() code so that it won't take a pagefault while holding a
> lock on the pagecache page. There are a number of different deadlocks possible
> if we try to do such a thing:
> 
> 1.  generic_buffered_write
> 2.   lock_page
> 3.    prepare_write
> 4.     unlock_page+vmtruncate
> 5.     copy_from_user
> 6.      mmap_sem(r)
> 7.       handle_mm_fault
> 8.        lock_page (filemap_nopage)
> 9.    commit_write
> 10.  unlock_page
> 
> a. sys_munmap / sys_mlock / others
> b.  mmap_sem(w)
> c.   make_pages_present
> d.    get_user_pages
> e.     handle_mm_fault
> f.      lock_page (filemap_nopage)
> 
> 2,8	- recursive deadlock if page is same
> 2,8;2,8	- ABBA deadlock is page is different
> 2,6;b,f	- ABBA deadlock if page is same
> 
> The solution is as follows:
> 1.  If we find the destination page is uptodate, continue as normal, but use
>     atomic usercopies which do not take pagefaults and do not zero the uncopied
>     tail of the destination. The destination is already uptodate, so we can
>     commit_write the full length even if there was a partial copy: it does not
>     matter that the tail was not modified, because if it is dirtied and written
>     back to disk it will not cause any problems (uptodate *means* that the
>     destination page is as new or newer than the copy on disk).
> 
> 1a. The above requires that fault_in_pages_readable correctly returns access
>     information, because atomic usercopies cannot distinguish between
>     non-present pages in a readable mapping, from lack of a readable mapping.
> 
> 2.  If we find the destination page is non uptodate, unlock it (this could be
>     made slightly more optimal), then find and pin the source page with
>     get_user_pages. Relock the destination page and continue with the copy.
>     However, instead of a usercopy (which might take a fault), copy the data
>     via the kernel address space.
> 

Oh what a mess we're making :(

Unfortunately, write() into a non-uptodate page is very much the common
case.  We've always tried to avoid doing a pte-walk in the write() path to
fix this bug.  Careful performance testing is needed here so we can assess
the impact.  For threaded applications, simply the taking of mmap_sem might
be the biggest problem.

And I can't think of any tricks we can play to avoid doing the pte-walk in
most cases.  For example, we don't yet have a page to run page_mapped()
against.

>  			break;
>  		}
>  
> +		/*
> +		 * non-uptodate pages cannot cope with short copies, and we
> +		 * cannot take a pagefault with the destination page locked.
> +		 * So pin the source page to copy it.
> +		 */
> +		if (!PageUptodate(page)) {
> +			unlock_page(page);
> +
> +			bytes = min(bytes, PAGE_CACHE_SIZE -
> +				     ((unsigned long)buf & ~PAGE_CACHE_MASK));
> +
> +			/*
> +			 * Cannot get_user_pages with a page locked for the
> +			 * same reason as we can't take a page fault with a
> +			 * page locked (as explained below).
> +			 */
> +			down_read(&current->mm->mmap_sem);
> +			status = get_user_pages(current, current->mm,
> +					(unsigned long)buf & PAGE_CACHE_MASK, 1,
> +					0, 0, &src_page, NULL);
> +			up_read(&current->mm->mmap_sem);
> +			if (status != 1) {
> +				page_cache_release(page);
> +				break;
> +			}
> +
> +			lock_page(page);
> +			if (!page->mapping) {

Hopefully this can't happen?  If it can, who went and took our page off the
mapping?  Reclaim?  The elevated page_count will prevent that?

> +				unlock_page(page);
> +				page_cache_release(page);
> +				page_cache_release(src_page);
> +				continue;
> +			}
> +
> +		}
> +
>  		status = a_ops->prepare_write(file, page, offset, offset+bytes);
>  		if (unlikely(status))
>  			goto fs_write_aop_error;
>  
> -		copied = filemap_copy_from_user(page, offset,
> +		if (!src_page) {
> +			/*
> +			 * Must not enter the pagefault handler here, because
> +			 * we hold the page lock, so we might recursively
> +			 * deadlock on the same lock, or get an ABBA deadlock
> +			 * against a different lock, or against the mmap_sem
> +			 * (which nests outside the page lock).  So increment
> +			 * preempt count, and use _atomic usercopies.
> +			 *
> +			 * The page is uptodate so we are OK to encounter a
> +			 * short copy: if unmodified parts of the page are
> +			 * marked dirty and written out to disk, it doesn't
> +			 * really matter.
> +			 */
> +			pagefault_disable();
> +			copied = filemap_copy_from_user_atomic(page, offset,
>  					cur_iov, nr_segs, iov_offset, bytes);
> +			pagefault_enable();
> +		} else {
> +			char *src, *dst;
> +			src = kmap(src_page);
> +			dst = kmap(page);
> +			memcpy(dst + offset,
> +				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> +				bytes);
> +			kunmap(page);
> +			kunmap(src_page);
> +			copied = bytes;
> +		}

As you say, we don't want to do this: taking two kmap()s at the same time
leaves us vulnerable to kmap() deadlocks: we deadlock when there are 512
tasks (LAST_PKMAP) each holding one kmap, and all waiting for someone else
to give one back.




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-02 23:53     ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-02 23:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Mon, 29 Jan 2007 11:33:03 +0100 (CET)
Nick Piggin <npiggin@suse.de> wrote:

> Modify the core write() code so that it won't take a pagefault while holding a
> lock on the pagecache page. There are a number of different deadlocks possible
> if we try to do such a thing:
> 
> 1.  generic_buffered_write
> 2.   lock_page
> 3.    prepare_write
> 4.     unlock_page+vmtruncate
> 5.     copy_from_user
> 6.      mmap_sem(r)
> 7.       handle_mm_fault
> 8.        lock_page (filemap_nopage)
> 9.    commit_write
> 10.  unlock_page
> 
> a. sys_munmap / sys_mlock / others
> b.  mmap_sem(w)
> c.   make_pages_present
> d.    get_user_pages
> e.     handle_mm_fault
> f.      lock_page (filemap_nopage)
> 
> 2,8	- recursive deadlock if page is same
> 2,8;2,8	- ABBA deadlock is page is different
> 2,6;b,f	- ABBA deadlock if page is same
> 
> The solution is as follows:
> 1.  If we find the destination page is uptodate, continue as normal, but use
>     atomic usercopies which do not take pagefaults and do not zero the uncopied
>     tail of the destination. The destination is already uptodate, so we can
>     commit_write the full length even if there was a partial copy: it does not
>     matter that the tail was not modified, because if it is dirtied and written
>     back to disk it will not cause any problems (uptodate *means* that the
>     destination page is as new or newer than the copy on disk).
> 
> 1a. The above requires that fault_in_pages_readable correctly returns access
>     information, because atomic usercopies cannot distinguish between
>     non-present pages in a readable mapping, from lack of a readable mapping.
> 
> 2.  If we find the destination page is non uptodate, unlock it (this could be
>     made slightly more optimal), then find and pin the source page with
>     get_user_pages. Relock the destination page and continue with the copy.
>     However, instead of a usercopy (which might take a fault), copy the data
>     via the kernel address space.
> 

Oh what a mess we're making :(

Unfortunately, write() into a non-uptodate page is very much the common
case.  We've always tried to avoid doing a pte-walk in the write() path to
fix this bug.  Careful performance testing is needed here so we can assess
the impact.  For threaded applications, simply the taking of mmap_sem might
be the biggest problem.

And I can't think of any tricks we can play to avoid doing the pte-walk in
most cases.  For example, we don't yet have a page to run page_mapped()
against.

>  			break;
>  		}
>  
> +		/*
> +		 * non-uptodate pages cannot cope with short copies, and we
> +		 * cannot take a pagefault with the destination page locked.
> +		 * So pin the source page to copy it.
> +		 */
> +		if (!PageUptodate(page)) {
> +			unlock_page(page);
> +
> +			bytes = min(bytes, PAGE_CACHE_SIZE -
> +				     ((unsigned long)buf & ~PAGE_CACHE_MASK));
> +
> +			/*
> +			 * Cannot get_user_pages with a page locked for the
> +			 * same reason as we can't take a page fault with a
> +			 * page locked (as explained below).
> +			 */
> +			down_read(&current->mm->mmap_sem);
> +			status = get_user_pages(current, current->mm,
> +					(unsigned long)buf & PAGE_CACHE_MASK, 1,
> +					0, 0, &src_page, NULL);
> +			up_read(&current->mm->mmap_sem);
> +			if (status != 1) {
> +				page_cache_release(page);
> +				break;
> +			}
> +
> +			lock_page(page);
> +			if (!page->mapping) {

Hopefully this can't happen?  If it can, who went and took our page off the
mapping?  Reclaim?  The elevated page_count will prevent that?

> +				unlock_page(page);
> +				page_cache_release(page);
> +				page_cache_release(src_page);
> +				continue;
> +			}
> +
> +		}
> +
>  		status = a_ops->prepare_write(file, page, offset, offset+bytes);
>  		if (unlikely(status))
>  			goto fs_write_aop_error;
>  
> -		copied = filemap_copy_from_user(page, offset,
> +		if (!src_page) {
> +			/*
> +			 * Must not enter the pagefault handler here, because
> +			 * we hold the page lock, so we might recursively
> +			 * deadlock on the same lock, or get an ABBA deadlock
> +			 * against a different lock, or against the mmap_sem
> +			 * (which nests outside the page lock).  So increment
> +			 * preempt count, and use _atomic usercopies.
> +			 *
> +			 * The page is uptodate so we are OK to encounter a
> +			 * short copy: if unmodified parts of the page are
> +			 * marked dirty and written out to disk, it doesn't
> +			 * really matter.
> +			 */
> +			pagefault_disable();
> +			copied = filemap_copy_from_user_atomic(page, offset,
>  					cur_iov, nr_segs, iov_offset, bytes);
> +			pagefault_enable();
> +		} else {
> +			char *src, *dst;
> +			src = kmap(src_page);
> +			dst = kmap(page);
> +			memcpy(dst + offset,
> +				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> +				bytes);
> +			kunmap(page);
> +			kunmap(src_page);
> +			copied = bytes;
> +		}

As you say, we don't want to do this: taking two kmap()s at the same time
leaves us vulnerable to kmap() deadlocks: we deadlock when there are 512
tasks (LAST_PKMAP) each holding one kmap, and all waiting for someone else
to give one back.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
  2007-02-02 23:52   ` Andrew Morton
@ 2007-02-03  1:22     ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-03  1:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Linux Memory Management,
	Fengguang Wu, Suparna Bhattacharya

On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > The following set of patches attempt to fix the buffered write
> > locking problems (and there are a couple of peripheral patches
> > and cleanups there too).
> > 
> > Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
> > be an easier diff with the fsaio patches gone, but the readahead
> > rewrite clashes badly :(
> 
> Well fsaio is restored, but there's now considerable doubt over it due to
> the recent febril febrility.
> 
> How bad is the clash with the readahead patches?

I don't think it would be so bad that one couldn't merge readahead
back on top quite easily... The fsaio ones are a little harder because
they change generic_file_buffered_write.

> Clashes with git-block are likely, too.
> 
> Bugfixes come first, so I will drop readahead and fsaio and git-block to get
> this work completed if needed - please work agaisnt mainline.

OK.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
@ 2007-02-03  1:22     ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-03  1:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Linux Memory Management,
	Fengguang Wu, Suparna Bhattacharya

On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > The following set of patches attempt to fix the buffered write
> > locking problems (and there are a couple of peripheral patches
> > and cleanups there too).
> > 
> > Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
> > be an easier diff with the fsaio patches gone, but the readahead
> > rewrite clashes badly :(
> 
> Well fsaio is restored, but there's now considerable doubt over it due to
> the recent febril febrility.
> 
> How bad is the clash with the readahead patches?

I don't think it would be so bad that one couldn't merge readahead
back on top quite easily... The fsaio ones are a little harder because
they change generic_file_buffered_write.

> Clashes with git-block are likely, too.
> 
> Bugfixes come first, so I will drop readahead and fsaio and git-block to get
> this work completed if needed - please work agaisnt mainline.

OK.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
  2007-02-02 23:52     ` Andrew Morton
@ 2007-02-03  1:33       ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-03  1:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Fri, Feb 02, 2007 at 03:52:36PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:46 +0100 (CET)
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.
> 
> They do?  Under what situation?

Yes, I have at least reproduced the libfs leak.

The situation is when you write into a !uptodate page, the prepare_write
function runs SetPageUptodate *before* we have copied data in. Thus you
can read uninitialised data out of there.

SetPageUptodate must not be used (or at least used carefully) in
prepare_write. commit_write is the correct place to do this.

> > Fix the former,
> 
> How?

If doing a partial-write, simply clear the whole page and set it uptodate
(don't need to get too tricky). If doing a full-write, only set it uptodate
in the commit_write.

> > make a note of the latter. Several other filesystems seem
> > to be iffy here, too.
> 
> Please, tell us what the bug is so that others have a chance of reviewing
> and, if needed, fixing those other filesystems.
> 
> > --- linux-2.6.orig/fs/libfs.c
> > +++ linux-2.6/fs/libfs.c
> > @@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s
> >  int simple_prepare_write(struct file *file, struct page *page,
> >  			unsigned from, unsigned to)
> >  {
> > -	if (!PageUptodate(page)) {
> > -		if (to - from != PAGE_CACHE_SIZE) {
> > -			void *kaddr = kmap_atomic(page, KM_USER0);
> > -			memset(kaddr, 0, from);
> > -			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
> > -			flush_dcache_page(page);
> > -			kunmap_atomic(kaddr, KM_USER0);
> > -		}
> > +	if (PageUptodate(page))
> > +		return 0;
> > +
> > +	if (to - from != PAGE_CACHE_SIZE) {
> > +		clear_highpage(page);
> > +		flush_dcache_page(page);
> >  		SetPageUptodate(page);
> >  	}
> 
> memclear_highpage_flush() is fashionable.

Good one.

> > ===================================================================
> > --- linux-2.6.orig/fs/buffer.c
> > +++ linux-2.6/fs/buffer.c
> > @@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page
> >  
> >  	if (is_mapped_to_disk)
> >  		SetPageMappedToDisk(page);
> > +
> > +	/* XXX: information leak vs read(2) */
> >  	SetPageUptodate(page);
> >  
> >  	/*
> 
> That comment is too terse to be useful.

OK, similar problem here - we have brought all the buffers uptodate
that we are *not* going to write over, or partially write over, but
we can have an uninitialised hole over the region we want to write.

I think just setting page uptodate in commit_write might do the
trick? (and getting rid of the set_page_dirty there).

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
@ 2007-02-03  1:33       ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-03  1:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Fri, Feb 02, 2007 at 03:52:36PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:46 +0100 (CET)
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.
> 
> They do?  Under what situation?

Yes, I have at least reproduced the libfs leak.

The situation is when you write into a !uptodate page, the prepare_write
function runs SetPageUptodate *before* we have copied data in. Thus you
can read uninitialised data out of there.

SetPageUptodate must not be used (or at least used carefully) in
prepare_write. commit_write is the correct place to do this.

> > Fix the former,
> 
> How?

If doing a partial-write, simply clear the whole page and set it uptodate
(don't need to get too tricky). If doing a full-write, only set it uptodate
in the commit_write.

> > make a note of the latter. Several other filesystems seem
> > to be iffy here, too.
> 
> Please, tell us what the bug is so that others have a chance of reviewing
> and, if needed, fixing those other filesystems.
> 
> > --- linux-2.6.orig/fs/libfs.c
> > +++ linux-2.6/fs/libfs.c
> > @@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s
> >  int simple_prepare_write(struct file *file, struct page *page,
> >  			unsigned from, unsigned to)
> >  {
> > -	if (!PageUptodate(page)) {
> > -		if (to - from != PAGE_CACHE_SIZE) {
> > -			void *kaddr = kmap_atomic(page, KM_USER0);
> > -			memset(kaddr, 0, from);
> > -			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
> > -			flush_dcache_page(page);
> > -			kunmap_atomic(kaddr, KM_USER0);
> > -		}
> > +	if (PageUptodate(page))
> > +		return 0;
> > +
> > +	if (to - from != PAGE_CACHE_SIZE) {
> > +		clear_highpage(page);
> > +		flush_dcache_page(page);
> >  		SetPageUptodate(page);
> >  	}
> 
> memclear_highpage_flush() is fashionable.

Good one.

> > ===================================================================
> > --- linux-2.6.orig/fs/buffer.c
> > +++ linux-2.6/fs/buffer.c
> > @@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page
> >  
> >  	if (is_mapped_to_disk)
> >  		SetPageMappedToDisk(page);
> > +
> > +	/* XXX: information leak vs read(2) */
> >  	SetPageUptodate(page);
> >  
> >  	/*
> 
> That comment is too terse to be useful.

OK, similar problem here - we have brought all the buffers uptodate
that we are *not* going to write over, or partially write over, but
we can have an uninitialised hole over the region we want to write.

I think just setting page uptodate in commit_write might do the
trick? (and getting rid of the set_page_dirty there).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-02 23:53     ` Andrew Morton
@ 2007-02-03  1:38       ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-03  1:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Fri, Feb 02, 2007 at 03:53:11PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:33:03 +0100 (CET)
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > Modify the core write() code so that it won't take a pagefault while holding a
> > lock on the pagecache page. There are a number of different deadlocks possible
> > if we try to do such a thing:
> > 
> > 1.  generic_buffered_write
> > 2.   lock_page
> > 3.    prepare_write
> > 4.     unlock_page+vmtruncate
> > 5.     copy_from_user
> > 6.      mmap_sem(r)
> > 7.       handle_mm_fault
> > 8.        lock_page (filemap_nopage)
> > 9.    commit_write
> > 10.  unlock_page
> > 
> > a. sys_munmap / sys_mlock / others
> > b.  mmap_sem(w)
> > c.   make_pages_present
> > d.    get_user_pages
> > e.     handle_mm_fault
> > f.      lock_page (filemap_nopage)
> > 
> > 2,8	- recursive deadlock if page is same
> > 2,8;2,8	- ABBA deadlock is page is different
> > 2,6;b,f	- ABBA deadlock if page is same
> > 
> > The solution is as follows:
> > 1.  If we find the destination page is uptodate, continue as normal, but use
> >     atomic usercopies which do not take pagefaults and do not zero the uncopied
> >     tail of the destination. The destination is already uptodate, so we can
> >     commit_write the full length even if there was a partial copy: it does not
> >     matter that the tail was not modified, because if it is dirtied and written
> >     back to disk it will not cause any problems (uptodate *means* that the
> >     destination page is as new or newer than the copy on disk).
> > 
> > 1a. The above requires that fault_in_pages_readable correctly returns access
> >     information, because atomic usercopies cannot distinguish between
> >     non-present pages in a readable mapping, from lack of a readable mapping.
> > 
> > 2.  If we find the destination page is non uptodate, unlock it (this could be
> >     made slightly more optimal), then find and pin the source page with
> >     get_user_pages. Relock the destination page and continue with the copy.
> >     However, instead of a usercopy (which might take a fault), copy the data
> >     via the kernel address space.
> > 
> 
> Oh what a mess we're making :(
> 
> Unfortunately, write() into a non-uptodate page is very much the common
> case.  We've always tried to avoid doing a pte-walk in the write() path to
> fix this bug.  Careful performance testing is needed here so we can assess
> the impact.  For threaded applications, simply the taking of mmap_sem might
> be the biggest problem.
> 
> And I can't think of any tricks we can play to avoid doing the pte-walk in
> most cases.  For example, we don't yet have a page to run page_mapped()
> against.

After this patch series, I am working on another that will allow filesystems
to specifically code around the problem (eg. by handling short usercopies
properly).

I tried to take this approach generically the first time, but it turns out
lots of filesystems had subtle problems, so if we do it this way instead,
then filesystem developers who actually care enough can improve their
code, and those that don't won't hold them back (or prevent this bug from
being fixed).

> >  			break;
> >  		}
> >  
> > +		/*
> > +		 * non-uptodate pages cannot cope with short copies, and we
> > +		 * cannot take a pagefault with the destination page locked.
> > +		 * So pin the source page to copy it.
> > +		 */
> > +		if (!PageUptodate(page)) {
> > +			unlock_page(page);
> > +
> > +			bytes = min(bytes, PAGE_CACHE_SIZE -
> > +				     ((unsigned long)buf & ~PAGE_CACHE_MASK));
> > +
> > +			/*
> > +			 * Cannot get_user_pages with a page locked for the
> > +			 * same reason as we can't take a page fault with a
> > +			 * page locked (as explained below).
> > +			 */
> > +			down_read(&current->mm->mmap_sem);
> > +			status = get_user_pages(current, current->mm,
> > +					(unsigned long)buf & PAGE_CACHE_MASK, 1,
> > +					0, 0, &src_page, NULL);
> > +			up_read(&current->mm->mmap_sem);
> > +			if (status != 1) {
> > +				page_cache_release(page);
> > +				break;
> > +			}
> > +
> > +			lock_page(page);
> > +			if (!page->mapping) {
> 
> Hopefully this can't happen?  If it can, who went and took our page off the
> mapping?  Reclaim?  The elevated page_count will prevent that?

Truncate/invalidate?

> > +				unlock_page(page);
> > +				page_cache_release(page);
> > +				page_cache_release(src_page);
> > +				continue;
> > +			}
> > +
> > +		}
> > +
> >  		status = a_ops->prepare_write(file, page, offset, offset+bytes);
> >  		if (unlikely(status))
> >  			goto fs_write_aop_error;
> >  
> > -		copied = filemap_copy_from_user(page, offset,
> > +		if (!src_page) {
> > +			/*
> > +			 * Must not enter the pagefault handler here, because
> > +			 * we hold the page lock, so we might recursively
> > +			 * deadlock on the same lock, or get an ABBA deadlock
> > +			 * against a different lock, or against the mmap_sem
> > +			 * (which nests outside the page lock).  So increment
> > +			 * preempt count, and use _atomic usercopies.
> > +			 *
> > +			 * The page is uptodate so we are OK to encounter a
> > +			 * short copy: if unmodified parts of the page are
> > +			 * marked dirty and written out to disk, it doesn't
> > +			 * really matter.
> > +			 */
> > +			pagefault_disable();
> > +			copied = filemap_copy_from_user_atomic(page, offset,
> >  					cur_iov, nr_segs, iov_offset, bytes);
> > +			pagefault_enable();
> > +		} else {
> > +			char *src, *dst;
> > +			src = kmap(src_page);
> > +			dst = kmap(page);
> > +			memcpy(dst + offset,
> > +				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> > +				bytes);
> > +			kunmap(page);
> > +			kunmap(src_page);
> > +			copied = bytes;
> > +		}
> 
> As you say, we don't want to do this: taking two kmap()s at the same time
> leaves us vulnerable to kmap() deadlocks: we deadlock when there are 512
> tasks (LAST_PKMAP) each holding one kmap, and all waiting for someone else
> to give one back.

Yep, thinko. My updated version uses KM_USER[01].

Thanks for reviewing so far.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-03  1:38       ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-03  1:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Fri, Feb 02, 2007 at 03:53:11PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:33:03 +0100 (CET)
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > Modify the core write() code so that it won't take a pagefault while holding a
> > lock on the pagecache page. There are a number of different deadlocks possible
> > if we try to do such a thing:
> > 
> > 1.  generic_buffered_write
> > 2.   lock_page
> > 3.    prepare_write
> > 4.     unlock_page+vmtruncate
> > 5.     copy_from_user
> > 6.      mmap_sem(r)
> > 7.       handle_mm_fault
> > 8.        lock_page (filemap_nopage)
> > 9.    commit_write
> > 10.  unlock_page
> > 
> > a. sys_munmap / sys_mlock / others
> > b.  mmap_sem(w)
> > c.   make_pages_present
> > d.    get_user_pages
> > e.     handle_mm_fault
> > f.      lock_page (filemap_nopage)
> > 
> > 2,8	- recursive deadlock if page is same
> > 2,8;2,8	- ABBA deadlock is page is different
> > 2,6;b,f	- ABBA deadlock if page is same
> > 
> > The solution is as follows:
> > 1.  If we find the destination page is uptodate, continue as normal, but use
> >     atomic usercopies which do not take pagefaults and do not zero the uncopied
> >     tail of the destination. The destination is already uptodate, so we can
> >     commit_write the full length even if there was a partial copy: it does not
> >     matter that the tail was not modified, because if it is dirtied and written
> >     back to disk it will not cause any problems (uptodate *means* that the
> >     destination page is as new or newer than the copy on disk).
> > 
> > 1a. The above requires that fault_in_pages_readable correctly returns access
> >     information, because atomic usercopies cannot distinguish between
> >     non-present pages in a readable mapping, from lack of a readable mapping.
> > 
> > 2.  If we find the destination page is non uptodate, unlock it (this could be
> >     made slightly more optimal), then find and pin the source page with
> >     get_user_pages. Relock the destination page and continue with the copy.
> >     However, instead of a usercopy (which might take a fault), copy the data
> >     via the kernel address space.
> > 
> 
> Oh what a mess we're making :(
> 
> Unfortunately, write() into a non-uptodate page is very much the common
> case.  We've always tried to avoid doing a pte-walk in the write() path to
> fix this bug.  Careful performance testing is needed here so we can assess
> the impact.  For threaded applications, simply the taking of mmap_sem might
> be the biggest problem.
> 
> And I can't think of any tricks we can play to avoid doing the pte-walk in
> most cases.  For example, we don't yet have a page to run page_mapped()
> against.

After this patch series, I am working on another that will allow filesystems
to specifically code around the problem (eg. by handling short usercopies
properly).

I tried to take this approach generically the first time, but it turns out
lots of filesystems had subtle problems, so if we do it this way instead,
then filesystem developers who actually care enough can improve their
code, and those that don't won't hold them back (or prevent this bug from
being fixed).

> >  			break;
> >  		}
> >  
> > +		/*
> > +		 * non-uptodate pages cannot cope with short copies, and we
> > +		 * cannot take a pagefault with the destination page locked.
> > +		 * So pin the source page to copy it.
> > +		 */
> > +		if (!PageUptodate(page)) {
> > +			unlock_page(page);
> > +
> > +			bytes = min(bytes, PAGE_CACHE_SIZE -
> > +				     ((unsigned long)buf & ~PAGE_CACHE_MASK));
> > +
> > +			/*
> > +			 * Cannot get_user_pages with a page locked for the
> > +			 * same reason as we can't take a page fault with a
> > +			 * page locked (as explained below).
> > +			 */
> > +			down_read(&current->mm->mmap_sem);
> > +			status = get_user_pages(current, current->mm,
> > +					(unsigned long)buf & PAGE_CACHE_MASK, 1,
> > +					0, 0, &src_page, NULL);
> > +			up_read(&current->mm->mmap_sem);
> > +			if (status != 1) {
> > +				page_cache_release(page);
> > +				break;
> > +			}
> > +
> > +			lock_page(page);
> > +			if (!page->mapping) {
> 
> Hopefully this can't happen?  If it can, who went and took our page off the
> mapping?  Reclaim?  The elevated page_count will prevent that?

Truncate/invalidate?

> > +				unlock_page(page);
> > +				page_cache_release(page);
> > +				page_cache_release(src_page);
> > +				continue;
> > +			}
> > +
> > +		}
> > +
> >  		status = a_ops->prepare_write(file, page, offset, offset+bytes);
> >  		if (unlikely(status))
> >  			goto fs_write_aop_error;
> >  
> > -		copied = filemap_copy_from_user(page, offset,
> > +		if (!src_page) {
> > +			/*
> > +			 * Must not enter the pagefault handler here, because
> > +			 * we hold the page lock, so we might recursively
> > +			 * deadlock on the same lock, or get an ABBA deadlock
> > +			 * against a different lock, or against the mmap_sem
> > +			 * (which nests outside the page lock).  So increment
> > +			 * preempt count, and use _atomic usercopies.
> > +			 *
> > +			 * The page is uptodate so we are OK to encounter a
> > +			 * short copy: if unmodified parts of the page are
> > +			 * marked dirty and written out to disk, it doesn't
> > +			 * really matter.
> > +			 */
> > +			pagefault_disable();
> > +			copied = filemap_copy_from_user_atomic(page, offset,
> >  					cur_iov, nr_segs, iov_offset, bytes);
> > +			pagefault_enable();
> > +		} else {
> > +			char *src, *dst;
> > +			src = kmap(src_page);
> > +			dst = kmap(page);
> > +			memcpy(dst + offset,
> > +				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> > +				bytes);
> > +			kunmap(page);
> > +			kunmap(src_page);
> > +			copied = bytes;
> > +		}
> 
> As you say, we don't want to do this: taking two kmap()s at the same time
> leaves us vulnerable to kmap() deadlocks: we deadlock when there are 512
> tasks (LAST_PKMAP) each holding one kmap, and all waiting for someone else
> to give one back.

Yep, thinko. My updated version uses KM_USER[01].

Thanks for reviewing so far.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
  2007-02-03  1:33       ` Nick Piggin
@ 2007-02-03  1:58         ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-03  1:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sat, 3 Feb 2007 02:33:16 +0100
Nick Piggin <npiggin@suse.de> wrote:

> > > ===================================================================
> > > --- linux-2.6.orig/fs/buffer.c
> > > +++ linux-2.6/fs/buffer.c
> > > @@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page
> > >  
> > >  	if (is_mapped_to_disk)
> > >  		SetPageMappedToDisk(page);
> > > +
> > > +	/* XXX: information leak vs read(2) */
> > >  	SetPageUptodate(page);
> > >  
> > >  	/*
> > 
> > That comment is too terse to be useful.
> 
> OK, similar problem here - we have brought all the buffers uptodate
> that we are *not* going to write over, or partially write over, but
> we can have an uninitialised hole over the region we want to write.
> 
> I think just setting page uptodate in commit_write might do the
> trick? (and getting rid of the set_page_dirty there).

Yes, the page just isn't uptodate yet in prepare_write() - moving things
to commti_write() sounds sane.

But please, can we have sufficient changelogs and comments in the next version?

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
@ 2007-02-03  1:58         ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-03  1:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sat, 3 Feb 2007 02:33:16 +0100
Nick Piggin <npiggin@suse.de> wrote:

> > > ===================================================================
> > > --- linux-2.6.orig/fs/buffer.c
> > > +++ linux-2.6/fs/buffer.c
> > > @@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page
> > >  
> > >  	if (is_mapped_to_disk)
> > >  		SetPageMappedToDisk(page);
> > > +
> > > +	/* XXX: information leak vs read(2) */
> > >  	SetPageUptodate(page);
> > >  
> > >  	/*
> > 
> > That comment is too terse to be useful.
> 
> OK, similar problem here - we have brought all the buffers uptodate
> that we are *not* going to write over, or partially write over, but
> we can have an uninitialised hole over the region we want to write.
> 
> I think just setting page uptodate in commit_write might do the
> trick? (and getting rid of the set_page_dirty there).

Yes, the page just isn't uptodate yet in prepare_write() - moving things
to commti_write() sounds sane.

But please, can we have sufficient changelogs and comments in the next version?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
  2007-02-03  1:58         ` Andrew Morton
@ 2007-02-03  2:09           ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-03  2:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Fri, Feb 02, 2007 at 05:58:01PM -0800, Andrew Morton wrote:
> On Sat, 3 Feb 2007 02:33:16 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > I think just setting page uptodate in commit_write might do the
> > trick? (and getting rid of the set_page_dirty there).
> 
> Yes, the page just isn't uptodate yet in prepare_write() - moving things
> to commti_write() sounds sane.
> 
> But please, can we have sufficient changelogs and comments in the next version?

You're right, sorry. Is this any better? (warning: nobh code is untested)

--
simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.
This happens because the prepare_write functions leave an uninitialised
"hole" over the part of the page that the write is expected to go to. This
is fine, but they then mark the page uptodate, which means a concurrent read
can come in and copy the uninitialised memory into userspace before it written
to.

Fix simple_readpage by simply initialising the whole page in the case of a
partial-page write. In the case of a full-page write, we don't SetPageDirty
until commit_write time.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -327,25 +327,32 @@ int simple_readpage(struct file *file, s
 int simple_prepare_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
-	if (!PageUptodate(page)) {
-		if (to - from != PAGE_CACHE_SIZE) {
-			void *kaddr = kmap_atomic(page, KM_USER0);
-			memset(kaddr, 0, from);
-			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
-			flush_dcache_page(page);
-			kunmap_atomic(kaddr, KM_USER0);
-		}
+	if (PageUptodate(page))
+		return 0;
+
+	if (to - from != PAGE_CACHE_SIZE) {
+		/*
+		 * Partial-page write? Initialise the complete page and
+		 * set it uptodate. We could avoid initialising the
+		 * (from, to) hole, and opt to mark it uptodate in
+		 * simple_commit_write, but that's probably only a win
+		 * for filesystems that would need to read blocks off disk.
+		 */
+		memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE);
 		SetPageUptodate(page);
 	}
+
 	return 0;
 }
 
 int simple_commit_write(struct file *file, struct page *page,
-			unsigned offset, unsigned to)
+			unsigned from, unsigned to)
 {
 	struct inode *inode = page->mapping->host;
 	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 
+	if (to - from == PAGE_CACHE_SIZE)
+		SetPageUptodate(page);
 	/*
 	 * No need to use i_size_read() here, the i_size
 	 * cannot change under us because we hold the i_mutex.
@@ -353,6 +360,7 @@ int simple_commit_write(struct file *fil
 	if (pos > inode->i_size)
 		i_size_write(inode, pos);
 	set_page_dirty(page);
+
 	return 0;
 }
 
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2344,17 +2344,6 @@ int nobh_prepare_write(struct page *page
 
 	if (is_mapped_to_disk)
 		SetPageMappedToDisk(page);
-	SetPageUptodate(page);
-
-	/*
-	 * Setting the page dirty here isn't necessary for the prepare_write
-	 * function - commit_write will do that.  But if/when this function is
-	 * used within the pagefault handler to ensure that all mmapped pages
-	 * have backing space in the filesystem, we will need to dirty the page
-	 * if its contents were altered.
-	 */
-	if (dirtied_it)
-		set_page_dirty(page);
 
 	return 0;
 
@@ -2384,6 +2373,7 @@ int nobh_commit_write(struct file *file,
 	struct inode *inode = page->mapping->host;
 	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 
+	SetPageUptodate(page);
 	set_page_dirty(page);
 	if (pos > inode->i_size) {
 		i_size_write(inode, pos);
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6/Documentation/filesystems/vfs.txt
@@ -617,6 +617,11 @@ struct address_space_operations {
 	In this case the prepare_write will be retried one the lock is
   	regained.
 
+	Note: the page _must not_ be marked uptodate in this function
+	(or anywhere else) unless it actually is uptodate right now. As
+	soon as a page is marked uptodate, it is possible for a concurrent
+	read(2) to copy it to userspace.
+
   commit_write: If prepare_write succeeds, new data will be copied
         into the page and then commit_write will be called.  It will
         typically update the size of the file (if appropriate) and

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
@ 2007-02-03  2:09           ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-03  2:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Fri, Feb 02, 2007 at 05:58:01PM -0800, Andrew Morton wrote:
> On Sat, 3 Feb 2007 02:33:16 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > I think just setting page uptodate in commit_write might do the
> > trick? (and getting rid of the set_page_dirty there).
> 
> Yes, the page just isn't uptodate yet in prepare_write() - moving things
> to commti_write() sounds sane.
> 
> But please, can we have sufficient changelogs and comments in the next version?

You're right, sorry. Is this any better? (warning: nobh code is untested)

--
simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.
This happens because the prepare_write functions leave an uninitialised
"hole" over the part of the page that the write is expected to go to. This
is fine, but they then mark the page uptodate, which means a concurrent read
can come in and copy the uninitialised memory into userspace before it written
to.

Fix simple_readpage by simply initialising the whole page in the case of a
partial-page write. In the case of a full-page write, we don't SetPageDirty
until commit_write time.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -327,25 +327,32 @@ int simple_readpage(struct file *file, s
 int simple_prepare_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
-	if (!PageUptodate(page)) {
-		if (to - from != PAGE_CACHE_SIZE) {
-			void *kaddr = kmap_atomic(page, KM_USER0);
-			memset(kaddr, 0, from);
-			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
-			flush_dcache_page(page);
-			kunmap_atomic(kaddr, KM_USER0);
-		}
+	if (PageUptodate(page))
+		return 0;
+
+	if (to - from != PAGE_CACHE_SIZE) {
+		/*
+		 * Partial-page write? Initialise the complete page and
+		 * set it uptodate. We could avoid initialising the
+		 * (from, to) hole, and opt to mark it uptodate in
+		 * simple_commit_write, but that's probably only a win
+		 * for filesystems that would need to read blocks off disk.
+		 */
+		memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE);
 		SetPageUptodate(page);
 	}
+
 	return 0;
 }
 
 int simple_commit_write(struct file *file, struct page *page,
-			unsigned offset, unsigned to)
+			unsigned from, unsigned to)
 {
 	struct inode *inode = page->mapping->host;
 	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 
+	if (to - from == PAGE_CACHE_SIZE)
+		SetPageUptodate(page);
 	/*
 	 * No need to use i_size_read() here, the i_size
 	 * cannot change under us because we hold the i_mutex.
@@ -353,6 +360,7 @@ int simple_commit_write(struct file *fil
 	if (pos > inode->i_size)
 		i_size_write(inode, pos);
 	set_page_dirty(page);
+
 	return 0;
 }
 
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2344,17 +2344,6 @@ int nobh_prepare_write(struct page *page
 
 	if (is_mapped_to_disk)
 		SetPageMappedToDisk(page);
-	SetPageUptodate(page);
-
-	/*
-	 * Setting the page dirty here isn't necessary for the prepare_write
-	 * function - commit_write will do that.  But if/when this function is
-	 * used within the pagefault handler to ensure that all mmapped pages
-	 * have backing space in the filesystem, we will need to dirty the page
-	 * if its contents were altered.
-	 */
-	if (dirtied_it)
-		set_page_dirty(page);
 
 	return 0;
 
@@ -2384,6 +2373,7 @@ int nobh_commit_write(struct file *file,
 	struct inode *inode = page->mapping->host;
 	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 
+	SetPageUptodate(page);
 	set_page_dirty(page);
 	if (pos > inode->i_size) {
 		i_size_write(inode, pos);
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6/Documentation/filesystems/vfs.txt
@@ -617,6 +617,11 @@ struct address_space_operations {
 	In this case the prepare_write will be retried one the lock is
   	regained.
 
+	Note: the page _must not_ be marked uptodate in this function
+	(or anywhere else) unless it actually is uptodate right now. As
+	soon as a page is marked uptodate, it is possible for a concurrent
+	read(2) to copy it to userspace.
+
   commit_write: If prepare_write succeeds, new data will be copied
         into the page and then commit_write will be called.  It will
         typically update the size of the file (if appropriate) and

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
  2007-02-03  2:09           ` Nick Piggin
@ 2007-02-03  2:19             ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-03  2:19 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sat, 3 Feb 2007 03:09:26 +0100
Nick Piggin <npiggin@suse.de> wrote:

> From: Nick Piggin <npiggin@suse.de>
> To: Andrew Morton <akpm@osdl.org>

argh.  Yesterday all my emails were getting a mysterious
s/osdl/linux-foundation/ done to them at the server, so I switched everything
over.  Now it would appear that they are getting an equally mysterious
s/linux-foundation/osdl/ done to them.  I assume you sent this to
akpm@linux-foundation.org?


> Cc: Linux Kernel <linux-kernel@vger.kernel.org>, Linux Filesystems <linux-fsdevel@vger.kernel.org>, Linux Memory Management <linux-mm@kvack.org>
> Subject: Re: [patch 1/9] fs: libfs buffered write leak fix
> Date: Sat, 3 Feb 2007 03:09:26 +0100
> User-Agent: Mutt/1.5.9i
> 
> On Fri, Feb 02, 2007 at 05:58:01PM -0800, Andrew Morton wrote:
> > On Sat, 3 Feb 2007 02:33:16 +0100
> > Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > I think just setting page uptodate in commit_write might do the
> > > trick? (and getting rid of the set_page_dirty there).
> > 
> > Yes, the page just isn't uptodate yet in prepare_write() - moving things
> > to commti_write() sounds sane.
> > 
> > But please, can we have sufficient changelogs and comments in the next version?
> 
> You're right, sorry. Is this any better?

yup, thanks.

> (warning: nobh code is untested)

ow.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
@ 2007-02-03  2:19             ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-03  2:19 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sat, 3 Feb 2007 03:09:26 +0100
Nick Piggin <npiggin@suse.de> wrote:

> From: Nick Piggin <npiggin@suse.de>
> To: Andrew Morton <akpm@osdl.org>

argh.  Yesterday all my emails were getting a mysterious
s/osdl/linux-foundation/ done to them at the server, so I switched everything
over.  Now it would appear that they are getting an equally mysterious
s/linux-foundation/osdl/ done to them.  I assume you sent this to
akpm@linux-foundation.org?


> Cc: Linux Kernel <linux-kernel@vger.kernel.org>, Linux Filesystems <linux-fsdevel@vger.kernel.org>, Linux Memory Management <linux-mm@kvack.org>
> Subject: Re: [patch 1/9] fs: libfs buffered write leak fix
> Date: Sat, 3 Feb 2007 03:09:26 +0100
> User-Agent: Mutt/1.5.9i
> 
> On Fri, Feb 02, 2007 at 05:58:01PM -0800, Andrew Morton wrote:
> > On Sat, 3 Feb 2007 02:33:16 +0100
> > Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > I think just setting page uptodate in commit_write might do the
> > > trick? (and getting rid of the set_page_dirty there).
> > 
> > Yes, the page just isn't uptodate yet in prepare_write() - moving things
> > to commti_write() sounds sane.
> > 
> > But please, can we have sufficient changelogs and comments in the next version?
> 
> You're right, sorry. Is this any better?

yup, thanks.

> (warning: nobh code is untested)

ow.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
  2007-02-03  2:19             ` Andrew Morton
@ 2007-02-03  2:28               ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-03  2:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Fri, Feb 02, 2007 at 06:19:55PM -0800, Andrew Morton wrote:
> On Sat, 3 Feb 2007 03:09:26 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > From: Nick Piggin <npiggin@suse.de>
> > To: Andrew Morton <akpm@osdl.org>
> 
> argh.  Yesterday all my emails were getting a mysterious
> s/osdl/linux-foundation/ done to them at the server, so I switched everything
> over.  Now it would appear that they are getting an equally mysterious
> s/linux-foundation/osdl/ done to them.  I assume you sent this to
> akpm@linux-foundation.org?

No. Your first reply I got to this patch came as linux-foundantion, and
that's what I replied to. Your subsequent reply back to me ("Yes, the page
just isn't uptodate yet..."), came from osdl.org, which is what I replied
to.

> > Cc: Linux Kernel <linux-kernel@vger.kernel.org>, Linux Filesystems <linux-fsdevel@vger.kernel.org>, Linux Memory Management <linux-mm@kvack.org>
> > Subject: Re: [patch 1/9] fs: libfs buffered write leak fix
> > Date: Sat, 3 Feb 2007 03:09:26 +0100
> > User-Agent: Mutt/1.5.9i
> > 
> > On Fri, Feb 02, 2007 at 05:58:01PM -0800, Andrew Morton wrote:
> > > On Sat, 3 Feb 2007 02:33:16 +0100
> > > Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > I think just setting page uptodate in commit_write might do the
> > > > trick? (and getting rid of the set_page_dirty there).
> > > 
> > > Yes, the page just isn't uptodate yet in prepare_write() - moving things
> > > to commti_write() sounds sane.
> > > 
> > > But please, can we have sufficient changelogs and comments in the next version?
> > 
> > You're right, sorry. Is this any better?
> 
> yup, thanks.
> 
> > (warning: nobh code is untested)
> 
> ow.

I'll get a chance to do that later today. I have to fire up the old test
case and see if I can reproduce the problem with nobh on a real fs...
Will get back to you when I do.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
@ 2007-02-03  2:28               ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-03  2:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Fri, Feb 02, 2007 at 06:19:55PM -0800, Andrew Morton wrote:
> On Sat, 3 Feb 2007 03:09:26 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > From: Nick Piggin <npiggin@suse.de>
> > To: Andrew Morton <akpm@osdl.org>
> 
> argh.  Yesterday all my emails were getting a mysterious
> s/osdl/linux-foundation/ done to them at the server, so I switched everything
> over.  Now it would appear that they are getting an equally mysterious
> s/linux-foundation/osdl/ done to them.  I assume you sent this to
> akpm@linux-foundation.org?

No. Your first reply I got to this patch came as linux-foundantion, and
that's what I replied to. Your subsequent reply back to me ("Yes, the page
just isn't uptodate yet..."), came from osdl.org, which is what I replied
to.

> > Cc: Linux Kernel <linux-kernel@vger.kernel.org>, Linux Filesystems <linux-fsdevel@vger.kernel.org>, Linux Memory Management <linux-mm@kvack.org>
> > Subject: Re: [patch 1/9] fs: libfs buffered write leak fix
> > Date: Sat, 3 Feb 2007 03:09:26 +0100
> > User-Agent: Mutt/1.5.9i
> > 
> > On Fri, Feb 02, 2007 at 05:58:01PM -0800, Andrew Morton wrote:
> > > On Sat, 3 Feb 2007 02:33:16 +0100
> > > Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > I think just setting page uptodate in commit_write might do the
> > > > trick? (and getting rid of the set_page_dirty there).
> > > 
> > > Yes, the page just isn't uptodate yet in prepare_write() - moving things
> > > to commti_write() sounds sane.
> > > 
> > > But please, can we have sufficient changelogs and comments in the next version?
> > 
> > You're right, sorry. Is this any better?
> 
> yup, thanks.
> 
> > (warning: nobh code is untested)
> 
> ow.

I'll get a chance to do that later today. I have to fire up the old test
case and see if I can reproduce the problem with nobh on a real fs...
Will get back to you when I do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
  2007-02-02 23:52   ` Andrew Morton
@ 2007-02-03  6:43     ` Suparna Bhattacharya
  -1 siblings, 0 replies; 105+ messages in thread
From: Suparna Bhattacharya @ 2007-02-03  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Kernel, Linux Filesystems,
	Linux Memory Management, Fengguang Wu

On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > The following set of patches attempt to fix the buffered write
> > locking problems (and there are a couple of peripheral patches
> > and cleanups there too).
> > 
> > Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
> > be an easier diff with the fsaio patches gone, but the readahead
> > rewrite clashes badly :(
> 
> Well fsaio is restored, but there's now considerable doubt over it due to
> the recent febril febrility.

I think Ingo made a point earlier about letting the old co-exist with the
new. Fibrils + kevents have great potential for a next generation
solution but we need to give the whole story some time to play out and prove
it in practice, debate and benchmark the alternative combinations, optimize it
for various workloads etc.  It will also take more work on top before we
can get the whole POSIX AIO implementation supported on top of this. I'll be
very happy when that happens ... it is just that it is still too early to
be sure.

Since this is going to be a new interface, not the existing linux AIO
interface, I do not see any conflict between the two. Samba4 already uses
fsaio, and we now have the ability to do POSIX AIO over kernel AIO (which
depends on fsaio). The more we delay real world usage the longer we take
to learn about the application patterns that matter. And it is those
patterns that are key.

> 
> How bad is the clash with the readahead patches?
> 
> Clashes with git-block are likely, too.
> 
> Bugfixes come first, so I will drop readahead and fsaio and git-block to get
> this work completed if needed - please work agaisnt mainline.

If you need help with fixing the clashes, please let me know.

Regards
Suparna

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
@ 2007-02-03  6:43     ` Suparna Bhattacharya
  0 siblings, 0 replies; 105+ messages in thread
From: Suparna Bhattacharya @ 2007-02-03  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Kernel, Linux Filesystems,
	Linux Memory Management, Fengguang Wu

On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > The following set of patches attempt to fix the buffered write
> > locking problems (and there are a couple of peripheral patches
> > and cleanups there too).
> > 
> > Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
> > be an easier diff with the fsaio patches gone, but the readahead
> > rewrite clashes badly :(
> 
> Well fsaio is restored, but there's now considerable doubt over it due to
> the recent febril febrility.

I think Ingo made a point earlier about letting the old co-exist with the
new. Fibrils + kevents have great potential for a next generation
solution but we need to give the whole story some time to play out and prove
it in practice, debate and benchmark the alternative combinations, optimize it
for various workloads etc.  It will also take more work on top before we
can get the whole POSIX AIO implementation supported on top of this. I'll be
very happy when that happens ... it is just that it is still too early to
be sure.

Since this is going to be a new interface, not the existing linux AIO
interface, I do not see any conflict between the two. Samba4 already uses
fsaio, and we now have the ability to do POSIX AIO over kernel AIO (which
depends on fsaio). The more we delay real world usage the longer we take
to learn about the application patterns that matter. And it is those
patterns that are key.

> 
> How bad is the clash with the readahead patches?
> 
> Clashes with git-block are likely, too.
> 
> Bugfixes come first, so I will drop readahead and fsaio and git-block to get
> this work completed if needed - please work agaisnt mainline.

If you need help with fixing the clashes, please let me know.

Regards
Suparna

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
@ 2007-02-03 15:31       ` Fengguang Wu
  0 siblings, 0 replies; 105+ messages in thread
From: Fengguang Wu @ 2007-02-03 15:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Kernel, Linux Filesystems,
	Linux Memory Management, Suparna Bhattacharya

On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote:
> Bugfixes come first, so I will drop readahead and fsaio and git-block to get
> this work completed if needed - please work agaisnt mainline.

OK with readahead.

There are too much fixes in the series.  I'd like to fold them up and
update some change logs. And then there would be one more update.

Regards,
Wu

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 0/9] buffered write deadlock fix
@ 2007-02-03 15:31       ` Fengguang Wu
  0 siblings, 0 replies; 105+ messages in thread
From: Fengguang Wu @ 2007-02-03 15:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Kernel, Linux Filesystems,
	Linux Memory Management, Suparna Bhattacharya

On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote:
> Bugfixes come first, so I will drop readahead and fsaio and git-block to get
> this work completed if needed - please work agaisnt mainline.

OK with readahead.

There are too much fixes in the series.  I'd like to fold them up and
update some change logs. And then there would be one more update.

Regards,
Wu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
  2007-02-03  1:33       ` Nick Piggin
@ 2007-02-03 17:49         ` Jörn Engel
  -1 siblings, 0 replies; 105+ messages in thread
From: Jörn Engel @ 2007-02-03 17:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sat, 3 February 2007 02:33:16 +0100, Nick Piggin wrote:
> 
> If doing a partial-write, simply clear the whole page and set it uptodate
> (don't need to get too tricky).

That sounds just like a bug I recently fixed in logfs.  prepare_write()
would clear the page, commit_write() would write the whole page.  Bug
can be reproduced with a simple testcate:

echo -n foo > foo
echo -n bar >> foo
cat foo

With the bug, the second write will replace "foo" with "\0\0\0" and
cat will return "bar".  Doing a read instead of clearing the page will
return "foobar", as would be expected.

Can you hit the same bug with your patch or did I miss something?

Jörn

-- 
When people work hard for you for a pat on the back, you've got
to give them that pat.
-- Robert Heinlein

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
@ 2007-02-03 17:49         ` Jörn Engel
  0 siblings, 0 replies; 105+ messages in thread
From: Jörn Engel @ 2007-02-03 17:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sat, 3 February 2007 02:33:16 +0100, Nick Piggin wrote:
> 
> If doing a partial-write, simply clear the whole page and set it uptodate
> (don't need to get too tricky).

That sounds just like a bug I recently fixed in logfs.  prepare_write()
would clear the page, commit_write() would write the whole page.  Bug
can be reproduced with a simple testcate:

echo -n foo > foo
echo -n bar >> foo
cat foo

With the bug, the second write will replace "foo" with "\0\0\0" and
cat will return "bar".  Doing a read instead of clearing the page will
return "foobar", as would be expected.

Can you hit the same bug with your patch or did I miss something?

JA?rn

-- 
When people work hard for you for a pat on the back, you've got
to give them that pat.
-- Robert Heinlein

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
  2007-02-03 17:49         ` Jörn Engel
  (?)
@ 2007-02-04  3:55           ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04  3:55 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sat, Feb 03, 2007 at 05:49:47PM +0000, Jörn Engel wrote:
> On Sat, 3 February 2007 02:33:16 +0100, Nick Piggin wrote:
> > 
> > If doing a partial-write, simply clear the whole page and set it uptodate
> > (don't need to get too tricky).
> 
> That sounds just like a bug I recently fixed in logfs.  prepare_write()
> would clear the page, commit_write() would write the whole page.  Bug
> can be reproduced with a simple testcate:
> 
> echo -n foo > foo
> echo -n bar >> foo
> cat foo
> 
> With the bug, the second write will replace "foo" with "\0\0\0" and
> cat will return "bar".  Doing a read instead of clearing the page will
> return "foobar", as would be expected.
> 
> Can you hit the same bug with your patch or did I miss something?

Yes, the page is only cleared if it is not uptodate. This is fine
for the simple filesystems.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
@ 2007-02-04  3:55           ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04  3:55 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sat, Feb 03, 2007 at 05:49:47PM +0000, Jörn Engel wrote:
> On Sat, 3 February 2007 02:33:16 +0100, Nick Piggin wrote:
> > 
> > If doing a partial-write, simply clear the whole page and set it uptodate
> > (don't need to get too tricky).
> 
> That sounds just like a bug I recently fixed in logfs.  prepare_write()
> would clear the page, commit_write() would write the whole page.  Bug
> can be reproduced with a simple testcate:
> 
> echo -n foo > foo
> echo -n bar >> foo
> cat foo
> 
> With the bug, the second write will replace "foo" with "\0\0\0" and
> cat will return "bar".  Doing a read instead of clearing the page will
> return "foobar", as would be expected.
> 
> Can you hit the same bug with your patch or did I miss something?

Yes, the page is only cleared if it is not uptodate. This is fine
for the simple filesystems.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 1/9] fs: libfs buffered write leak fix
@ 2007-02-04  3:55           ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04  3:55 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sat, Feb 03, 2007 at 05:49:47PM +0000, Jorn Engel wrote:
> On Sat, 3 February 2007 02:33:16 +0100, Nick Piggin wrote:
> > 
> > If doing a partial-write, simply clear the whole page and set it uptodate
> > (don't need to get too tricky).
> 
> That sounds just like a bug I recently fixed in logfs.  prepare_write()
> would clear the page, commit_write() would write the whole page.  Bug
> can be reproduced with a simple testcate:
> 
> echo -n foo > foo
> echo -n bar >> foo
> cat foo
> 
> With the bug, the second write will replace "foo" with "\0\0\0" and
> cat will return "bar".  Doing a read instead of clearing the page will
> return "foobar", as would be expected.
> 
> Can you hit the same bug with your patch or did I miss something?

Yes, the page is only cleared if it is not uptodate. This is fine
for the simple filesystems.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-06  2:09             ` Nick Piggin
@ 2007-02-06 13:13               ` Anton Altaparmakov
  -1 siblings, 0 replies; 105+ messages in thread
From: Anton Altaparmakov @ 2007-02-06 13:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, 2007-02-06 at 03:09 +0100, Nick Piggin wrote:
> On Sun, Feb 04, 2007 at 05:40:35PM +0000, Anton Altaparmakov wrote:
> > On Sun, 4 Feb 2007, Andrew Morton wrote:
> > > truncate's OK: we're holding i_mutex.
> > 
> > How about excluding readpage() (in addition to truncate if Nick is right  
> > and some cases of truncate do not hold i_mutex) with an extra page flag as
> > I proposed for truncate exclusion?  Then it would not matter that
> > prepare_write might have allocated blocks and might expose stale data.    
> > It would go to sleep and wait on the bit to be cleared instead of trying  
> > to bring the page uptodate.  It can then lock the page and either find it 
> > uptodate (because commit_write did it) or not and then bring it uptodate.
> > 
> > Then we could safely fault in the page, copy from it into a temporary 
> > page, then lock the destination page again and copy into it.
> > 
> > This is getting more involved as a patch again...  )-:  But at least it   
> > does not affect the common case except for having to check the new page 
> > flag in every readpage() and truncate() call.  But at least the checks 
> > could be with an "if (unlikely(newpageflag()))" so should not be too bad.
> > 
> > Have I missed anything this time?
> 
> Yes. If you have a flag to exclude readpage(), then you must also
> exclude filemap_nopage, in which case it is still deadlocky.

Ouch, you are of course right.  )-:

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-06 13:13               ` Anton Altaparmakov
  0 siblings, 0 replies; 105+ messages in thread
From: Anton Altaparmakov @ 2007-02-06 13:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, 2007-02-06 at 03:09 +0100, Nick Piggin wrote:
> On Sun, Feb 04, 2007 at 05:40:35PM +0000, Anton Altaparmakov wrote:
> > On Sun, 4 Feb 2007, Andrew Morton wrote:
> > > truncate's OK: we're holding i_mutex.
> > 
> > How about excluding readpage() (in addition to truncate if Nick is right  
> > and some cases of truncate do not hold i_mutex) with an extra page flag as
> > I proposed for truncate exclusion?  Then it would not matter that
> > prepare_write might have allocated blocks and might expose stale data.    
> > It would go to sleep and wait on the bit to be cleared instead of trying  
> > to bring the page uptodate.  It can then lock the page and either find it 
> > uptodate (because commit_write did it) or not and then bring it uptodate.
> > 
> > Then we could safely fault in the page, copy from it into a temporary 
> > page, then lock the destination page again and copy into it.
> > 
> > This is getting more involved as a patch again...  )-:  But at least it   
> > does not affect the common case except for having to check the new page 
> > flag in every readpage() and truncate() call.  But at least the checks 
> > could be with an "if (unlikely(newpageflag()))" so should not be too bad.
> > 
> > Have I missed anything this time?
> 
> Yes. If you have a flag to exclude readpage(), then you must also
> exclude filemap_nopage, in which case it is still deadlocky.

Ouch, you are of course right.  )-:

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-06  5:49                             ` Nick Piggin
@ 2007-02-06  5:53                               ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-06  5:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, Feb 06, 2007 at 06:49:05AM +0100, Nick Piggin wrote:
> > - If the get_user() doesn't fault, and if we're copying from and to the
> >   same page, we know that we've locked it, so nobody will be able to unmap
> >   it while we're copying from it.
> > 
> > Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
> > scenario.
> 
> Yes I considered this too. Hard isn't it?

BTW, there are two different abba deadlocks. It's all documented in the
patch 9/9 description.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-06  5:53                               ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-06  5:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, Feb 06, 2007 at 06:49:05AM +0100, Nick Piggin wrote:
> > - If the get_user() doesn't fault, and if we're copying from and to the
> >   same page, we know that we've locked it, so nobody will be able to unmap
> >   it while we're copying from it.
> > 
> > Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
> > scenario.
> 
> Yes I considered this too. Hard isn't it?

BTW, there are two different abba deadlocks. It's all documented in the
patch 9/9 description.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-06  5:30                           ` Andrew Morton
@ 2007-02-06  5:49                             ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-06  5:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Mon, Feb 05, 2007 at 09:30:06PM -0800, Andrew Morton wrote:
> On Tue, 6 Feb 2007 05:41:46 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > Not necessarily -- they could read from one part of a page and write to
> > > another. I see this as the biggest data corruption problem.
> 
> The kernel gets that sort of thing wrong anyway, and always has, because
> it uses memcpy()-style copying and not memmove()-style.
> 
> I can't imagine what sort of application you're envisaging here.  The
> problem was only ever observed from userspace by an artificial stress-test
> thing.

No, I'm not talking about writing into a page with memory from the same
page. I'm talking about one process writing to part of a file, and another
reading from that same file (different offset).

If they happen to be covered by the same page, then the reader can see
zeroes.

I'm not envisaging any sort of application, all I know is that there are
several (related) data corruption bugs and I'm trying to fix them (and
fix these deadlock problems without introducing more).

> > And in fact, it is not just transient errors either. This problem can
> > add permanent corruption into the pagecache and onto disk, and it doesn't
> > even require two processes to race.
> > 
> > After zeroing out the uncopied part of the page, and attempting to loop
> > again, we might bail out of the loop for any reason before completing the
> > rest of the copy, leaving the pagecache corrupted, which will soon go out
> > to disk.
> > 
> 
> Only because ->commit_write() went and incorrectly marked parts of the page
> as up-to-date.
> 
> Zeroing out the fag end of the copy_from_user() on fault is actually incorrect. 

Yes, I know.

> What we _should_ do is to bring those uncopyable, non-uptodate parts of the
> page uptodate rather than zeroing them.  ->readpage() does that.
> 
> So...  what happens if we do
> 
> 	lock_page()
> 	prepare_write()
> 	if (copy_from_user_atomic()) {
> 		readpage()
> 		wait_on_page()
> 		lock_page()
> 	}
> 	commit_write()
> 	unlock_page()
> 
> - If the page has no buffers then it is either fully uptodate or fully
>   not uptodate.  In the former case, don't call readpage at all.  In the
>   latter case, readpage() is the correct thing to do.
> 
> - If the page had buffers, then readpage() won't touch the uptodate ones
>   and will bring the non-uptodate ones up to date from disk.
> 
>   Some of the data which we copied from userspace may get overwritten
>   from backing store, but that's OK.
> 
> seems crazy, but it's close.  We do have the minor problem that readpage
> went and unlocked the page so we need to relock it.  I bet there are holes
> in there.

Yes, I tried doing this as well and there are holes in it. Even supposing
that we add a readpage_dontunlock, there is still the issue of breaking
the filesystem API from nesting readpage inside prepare_write. You also
do need to zero newly allocated blocks, for example.

> Idea #42: after we've locked the pagecache page, do an atomic get_user()
> against the source page(s) before attempting the copy_from_user().  If that
> faults, don't run prepare_write or anything else: drop the page lock and
> try again.
> 
> Because
> 
> - If the get_user() faults, it might be because the page we're copying
>   from and to is the same page, and someone went and unmapped it: deadlock.
> 
> - If the get_user() doesn't fault, and if we're copying from and to the
>   same page, we know that we've locked it, so nobody will be able to unmap
>   it while we're copying from it.
> 
> Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
> scenario.

Yes I considered this too. Hard isn't it?

> btw, to fix the writev() performance problem we may need to go off and run
> get_user() against up to 1024 separate user pages before locking the
> pagecache page, which sounds fairly idiotic.  Are you doing that in the
> implemetnations which you've been working on?  I forget...

No, because in my fix it can do non-atomic usercopies for !uptodate pages.

For uptodate pages, yes there is a possibility that it may do a short copy
and have to retry, but it is probably safe to bet that the source data is
fairly recently accessed in most cases, so a short copy will be unlikely.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-06  5:49                             ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-06  5:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Mon, Feb 05, 2007 at 09:30:06PM -0800, Andrew Morton wrote:
> On Tue, 6 Feb 2007 05:41:46 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > Not necessarily -- they could read from one part of a page and write to
> > > another. I see this as the biggest data corruption problem.
> 
> The kernel gets that sort of thing wrong anyway, and always has, because
> it uses memcpy()-style copying and not memmove()-style.
> 
> I can't imagine what sort of application you're envisaging here.  The
> problem was only ever observed from userspace by an artificial stress-test
> thing.

No, I'm not talking about writing into a page with memory from the same
page. I'm talking about one process writing to part of a file, and another
reading from that same file (different offset).

If they happen to be covered by the same page, then the reader can see
zeroes.

I'm not envisaging any sort of application, all I know is that there are
several (related) data corruption bugs and I'm trying to fix them (and
fix these deadlock problems without introducing more).

> > And in fact, it is not just transient errors either. This problem can
> > add permanent corruption into the pagecache and onto disk, and it doesn't
> > even require two processes to race.
> > 
> > After zeroing out the uncopied part of the page, and attempting to loop
> > again, we might bail out of the loop for any reason before completing the
> > rest of the copy, leaving the pagecache corrupted, which will soon go out
> > to disk.
> > 
> 
> Only because ->commit_write() went and incorrectly marked parts of the page
> as up-to-date.
> 
> Zeroing out the fag end of the copy_from_user() on fault is actually incorrect. 

Yes, I know.

> What we _should_ do is to bring those uncopyable, non-uptodate parts of the
> page uptodate rather than zeroing them.  ->readpage() does that.
> 
> So...  what happens if we do
> 
> 	lock_page()
> 	prepare_write()
> 	if (copy_from_user_atomic()) {
> 		readpage()
> 		wait_on_page()
> 		lock_page()
> 	}
> 	commit_write()
> 	unlock_page()
> 
> - If the page has no buffers then it is either fully uptodate or fully
>   not uptodate.  In the former case, don't call readpage at all.  In the
>   latter case, readpage() is the correct thing to do.
> 
> - If the page had buffers, then readpage() won't touch the uptodate ones
>   and will bring the non-uptodate ones up to date from disk.
> 
>   Some of the data which we copied from userspace may get overwritten
>   from backing store, but that's OK.
> 
> seems crazy, but it's close.  We do have the minor problem that readpage
> went and unlocked the page so we need to relock it.  I bet there are holes
> in there.

Yes, I tried doing this as well and there are holes in it. Even supposing
that we add a readpage_dontunlock, there is still the issue of breaking
the filesystem API from nesting readpage inside prepare_write. You also
do need to zero newly allocated blocks, for example.

> Idea #42: after we've locked the pagecache page, do an atomic get_user()
> against the source page(s) before attempting the copy_from_user().  If that
> faults, don't run prepare_write or anything else: drop the page lock and
> try again.
> 
> Because
> 
> - If the get_user() faults, it might be because the page we're copying
>   from and to is the same page, and someone went and unmapped it: deadlock.
> 
> - If the get_user() doesn't fault, and if we're copying from and to the
>   same page, we know that we've locked it, so nobody will be able to unmap
>   it while we're copying from it.
> 
> Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
> scenario.

Yes I considered this too. Hard isn't it?

> btw, to fix the writev() performance problem we may need to go off and run
> get_user() against up to 1024 separate user pages before locking the
> pagecache page, which sounds fairly idiotic.  Are you doing that in the
> implemetnations which you've been working on?  I forget...

No, because in my fix it can do non-atomic usercopies for !uptodate pages.

For uptodate pages, yes there is a possibility that it may do a short copy
and have to retry, but it is probably safe to bet that the source data is
fairly recently accessed in most cases, so a short copy will be unlikely.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-06  4:41                         ` Nick Piggin
@ 2007-02-06  5:30                           ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-06  5:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, 6 Feb 2007 05:41:46 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Tue, Feb 06, 2007 at 03:25:49AM +0100, Nick Piggin wrote:
> > On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > They're not likely to hit the deadlocks, either. Probability gets more
> > > > likely after my patch to lock the page in the fault path. But practially,
> > > > we could live without that too, because the data corruption it fixes is
> > > > very rare as well. Which is exactly what we've been doing quite happily
> > > > for most of 2.6, including all distro kernels (I think).
> > > 
> > > Thing is, an application which is relying on the contents of that page is
> > > already unreliable (or really peculiar), because it can get indeterminate
> > > results anyway.
> > 
> > Not necessarily -- they could read from one part of a page and write to
> > another. I see this as the biggest data corruption problem.

The kernel gets that sort of thing wrong anyway, and always has, because
it uses memcpy()-style copying and not memmove()-style.

I can't imagine what sort of application you're envisaging here.  The
problem was only ever observed from userspace by an artificial stress-test
thing.

> And in fact, it is not just transient errors either. This problem can
> add permanent corruption into the pagecache and onto disk, and it doesn't
> even require two processes to race.
> 
> After zeroing out the uncopied part of the page, and attempting to loop
> again, we might bail out of the loop for any reason before completing the
> rest of the copy, leaving the pagecache corrupted, which will soon go out
> to disk.
> 

Only because ->commit_write() went and incorrectly marked parts of the page
as up-to-date.

Zeroing out the fag end of the copy_from_user() on fault is actually incorrect. 
What we _should_ do is to bring those uncopyable, non-uptodate parts of the
page uptodate rather than zeroing them.  ->readpage() does that.

So...  what happens if we do

	lock_page()
	prepare_write()
	if (copy_from_user_atomic()) {
		readpage()
		wait_on_page()
		lock_page()
	}
	commit_write()
	unlock_page()

- If the page has no buffers then it is either fully uptodate or fully
  not uptodate.  In the former case, don't call readpage at all.  In the
  latter case, readpage() is the correct thing to do.

- If the page had buffers, then readpage() won't touch the uptodate ones
  and will bring the non-uptodate ones up to date from disk.

  Some of the data which we copied from userspace may get overwritten
  from backing store, but that's OK.

seems crazy, but it's close.  We do have the minor problem that readpage
went and unlocked the page so we need to relock it.  I bet there are holes
in there.




Idea #42: after we've locked the pagecache page, do an atomic get_user()
against the source page(s) before attempting the copy_from_user().  If that
faults, don't run prepare_write or anything else: drop the page lock and
try again.

Because

- If the get_user() faults, it might be because the page we're copying
  from and to is the same page, and someone went and unmapped it: deadlock.

- If the get_user() doesn't fault, and if we're copying from and to the
  same page, we know that we've locked it, so nobody will be able to unmap
  it while we're copying from it.

Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
scenario.


btw, to fix the writev() performance problem we may need to go off and run
get_user() against up to 1024 separate user pages before locking the
pagecache page, which sounds fairly idiotic.  Are you doing that in the
implemetnations which you've been working on?  I forget...

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-06  5:30                           ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-06  5:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, 6 Feb 2007 05:41:46 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Tue, Feb 06, 2007 at 03:25:49AM +0100, Nick Piggin wrote:
> > On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > They're not likely to hit the deadlocks, either. Probability gets more
> > > > likely after my patch to lock the page in the fault path. But practially,
> > > > we could live without that too, because the data corruption it fixes is
> > > > very rare as well. Which is exactly what we've been doing quite happily
> > > > for most of 2.6, including all distro kernels (I think).
> > > 
> > > Thing is, an application which is relying on the contents of that page is
> > > already unreliable (or really peculiar), because it can get indeterminate
> > > results anyway.
> > 
> > Not necessarily -- they could read from one part of a page and write to
> > another. I see this as the biggest data corruption problem.

The kernel gets that sort of thing wrong anyway, and always has, because
it uses memcpy()-style copying and not memmove()-style.

I can't imagine what sort of application you're envisaging here.  The
problem was only ever observed from userspace by an artificial stress-test
thing.

> And in fact, it is not just transient errors either. This problem can
> add permanent corruption into the pagecache and onto disk, and it doesn't
> even require two processes to race.
> 
> After zeroing out the uncopied part of the page, and attempting to loop
> again, we might bail out of the loop for any reason before completing the
> rest of the copy, leaving the pagecache corrupted, which will soon go out
> to disk.
> 

Only because ->commit_write() went and incorrectly marked parts of the page
as up-to-date.

Zeroing out the fag end of the copy_from_user() on fault is actually incorrect. 
What we _should_ do is to bring those uncopyable, non-uptodate parts of the
page uptodate rather than zeroing them.  ->readpage() does that.

So...  what happens if we do

	lock_page()
	prepare_write()
	if (copy_from_user_atomic()) {
		readpage()
		wait_on_page()
		lock_page()
	}
	commit_write()
	unlock_page()

- If the page has no buffers then it is either fully uptodate or fully
  not uptodate.  In the former case, don't call readpage at all.  In the
  latter case, readpage() is the correct thing to do.

- If the page had buffers, then readpage() won't touch the uptodate ones
  and will bring the non-uptodate ones up to date from disk.

  Some of the data which we copied from userspace may get overwritten
  from backing store, but that's OK.

seems crazy, but it's close.  We do have the minor problem that readpage
went and unlocked the page so we need to relock it.  I bet there are holes
in there.




Idea #42: after we've locked the pagecache page, do an atomic get_user()
against the source page(s) before attempting the copy_from_user().  If that
faults, don't run prepare_write or anything else: drop the page lock and
try again.

Because

- If the get_user() faults, it might be because the page we're copying
  from and to is the same page, and someone went and unmapped it: deadlock.

- If the get_user() doesn't fault, and if we're copying from and to the
  same page, we know that we've locked it, so nobody will be able to unmap
  it while we're copying from it.

Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
scenario.


btw, to fix the writev() performance problem we may need to go off and run
get_user() against up to 1024 separate user pages before locking the
pagecache page, which sounds fairly idiotic.  Are you doing that in the
implemetnations which you've been working on?  I forget...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-06  2:25                       ` Nick Piggin
@ 2007-02-06  4:41                         ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-06  4:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, Feb 06, 2007 at 03:25:49AM +0100, Nick Piggin wrote:
> On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > They're not likely to hit the deadlocks, either. Probability gets more
> > > likely after my patch to lock the page in the fault path. But practially,
> > > we could live without that too, because the data corruption it fixes is
> > > very rare as well. Which is exactly what we've been doing quite happily
> > > for most of 2.6, including all distro kernels (I think).
> > 
> > Thing is, an application which is relying on the contents of that page is
> > already unreliable (or really peculiar), because it can get indeterminate
> > results anyway.
> 
> Not necessarily -- they could read from one part of a page and write to
> another. I see this as the biggest data corruption problem.

And in fact, it is not just transient errors either. This problem can
add permanent corruption into the pagecache and onto disk, and it doesn't
even require two processes to race.

After zeroing out the uncopied part of the page, and attempting to loop
again, we might bail out of the loop for any reason before completing the
rest of the copy, leaving the pagecache corrupted, which will soon go out
to disk.

Nick

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-06  4:41                         ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-06  4:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Tue, Feb 06, 2007 at 03:25:49AM +0100, Nick Piggin wrote:
> On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > They're not likely to hit the deadlocks, either. Probability gets more
> > > likely after my patch to lock the page in the fault path. But practially,
> > > we could live without that too, because the data corruption it fixes is
> > > very rare as well. Which is exactly what we've been doing quite happily
> > > for most of 2.6, including all distro kernels (I think).
> > 
> > Thing is, an application which is relying on the contents of that page is
> > already unreliable (or really peculiar), because it can get indeterminate
> > results anyway.
> 
> Not necessarily -- they could read from one part of a page and write to
> another. I see this as the biggest data corruption problem.

And in fact, it is not just transient errors either. This problem can
add permanent corruption into the pagecache and onto disk, and it doesn't
even require two processes to race.

After zeroing out the uncopied part of the page, and attempting to loop
again, we might bail out of the loop for any reason before completing the
rest of the copy, leaving the pagecache corrupted, which will soon go out
to disk.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 18:36                     ` Andrew Morton
@ 2007-02-06  2:25                       ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-06  2:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > They're not likely to hit the deadlocks, either. Probability gets more
> > likely after my patch to lock the page in the fault path. But practially,
> > we could live without that too, because the data corruption it fixes is
> > very rare as well. Which is exactly what we've been doing quite happily
> > for most of 2.6, including all distro kernels (I think).
> 
> Thing is, an application which is relying on the contents of that page is
> already unreliable (or really peculiar), because it can get indeterminate
> results anyway.

Not necessarily -- they could read from one part of a page and write to
another. I see this as the biggest data corruption problem.

But even in the case where they can get indeterminate results, they can
still determine what the results *won't* be. Eg. they might use a single
byte for a flag or something.

> > ...
> > 
> > On a P4 Xeon, SMP kernel, on a tmpfs filesystem, a 1GB dd if=/dev/zero write
> > had the following performance (higher is worse):
> > 
> > Orig kernel			New kernel
> > new file (no pagecache)
> > 4K  blocks 1.280s		1.287s (+0.5%)
> > 64K blocks 1.090s		1.105s (+1.4%)
> > notrunc (uptodate pagecache)
> > 4K  blocks 0.976s		1.001s (+0.5%)
> > 64K blocks 0.780s		0.792s (+1.5%)
> > 
> > [numbers are better than +/- 0.005]
> > 
> > So we lose somewhere between half and one and a half of one percent
> > performance in a pagecache write intensive workload.
> 
> That's not too bad - caches are fast.  Did you look at optimising the
> handling of that temp page, ensure that we always use the same page?  I
> guess the page allocator per-cpu-pages thing is being good here.

Yeah it should be doing a reasonable job.

> I'm not sure how, though.  Park a copy in the task_struct, just as an
> experiment.  But that'd de-optimise multiple-tasks-writing-on-the-same-cpu.
> Maybe a per-cpu thing?  Largely duplicates the page allocator's per-cpu-pages.

Putting a copy in the task_struct won't do much I figure, except saving
a copule of interrupt enable/disable, and being more wasteful of memory
and cache-hotness.

Per-cpu doesn't work because we can't hold preempt off over the usercopy
(well, we *could* do it in a loop together with fault_in_pages, but that
just adds to the icache bloat).

> 
> Of course, we're also increasing caceh footprint, which this test won't
> show.

We are indeed. At least we release the hot page back to the allocator
very quickly that it can be reused.

The upshot is that your writev performance will be improved :)

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-06  2:25                       ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-06  2:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > They're not likely to hit the deadlocks, either. Probability gets more
> > likely after my patch to lock the page in the fault path. But practially,
> > we could live without that too, because the data corruption it fixes is
> > very rare as well. Which is exactly what we've been doing quite happily
> > for most of 2.6, including all distro kernels (I think).
> 
> Thing is, an application which is relying on the contents of that page is
> already unreliable (or really peculiar), because it can get indeterminate
> results anyway.

Not necessarily -- they could read from one part of a page and write to
another. I see this as the biggest data corruption problem.

But even in the case where they can get indeterminate results, they can
still determine what the results *won't* be. Eg. they might use a single
byte for a flag or something.

> > ...
> > 
> > On a P4 Xeon, SMP kernel, on a tmpfs filesystem, a 1GB dd if=/dev/zero write
> > had the following performance (higher is worse):
> > 
> > Orig kernel			New kernel
> > new file (no pagecache)
> > 4K  blocks 1.280s		1.287s (+0.5%)
> > 64K blocks 1.090s		1.105s (+1.4%)
> > notrunc (uptodate pagecache)
> > 4K  blocks 0.976s		1.001s (+0.5%)
> > 64K blocks 0.780s		0.792s (+1.5%)
> > 
> > [numbers are better than +/- 0.005]
> > 
> > So we lose somewhere between half and one and a half of one percent
> > performance in a pagecache write intensive workload.
> 
> That's not too bad - caches are fast.  Did you look at optimising the
> handling of that temp page, ensure that we always use the same page?  I
> guess the page allocator per-cpu-pages thing is being good here.

Yeah it should be doing a reasonable job.

> I'm not sure how, though.  Park a copy in the task_struct, just as an
> experiment.  But that'd de-optimise multiple-tasks-writing-on-the-same-cpu.
> Maybe a per-cpu thing?  Largely duplicates the page allocator's per-cpu-pages.

Putting a copy in the task_struct won't do much I figure, except saving
a copule of interrupt enable/disable, and being more wasteful of memory
and cache-hotness.

Per-cpu doesn't work because we can't hold preempt off over the usercopy
(well, we *could* do it in a loop together with fault_in_pages, but that
just adds to the icache bloat).

> 
> Of course, we're also increasing caceh footprint, which this test won't
> show.

We are indeed. At least we release the hot page back to the allocator
very quickly that it can be reused.

The upshot is that your writev performance will be improved :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 17:40           ` Anton Altaparmakov
@ 2007-02-06  2:09             ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-06  2:09 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 05:40:35PM +0000, Anton Altaparmakov wrote:
> On Sun, 4 Feb 2007, Andrew Morton wrote:
> > truncate's OK: we're holding i_mutex.
> 
> How about excluding readpage() (in addition to truncate if Nick is right  
> and some cases of truncate do not hold i_mutex) with an extra page flag as
> I proposed for truncate exclusion?  Then it would not matter that
> prepare_write might have allocated blocks and might expose stale data.    
> It would go to sleep and wait on the bit to be cleared instead of trying  
> to bring the page uptodate.  It can then lock the page and either find it 
> uptodate (because commit_write did it) or not and then bring it uptodate.
> 
> Then we could safely fault in the page, copy from it into a temporary 
> page, then lock the destination page again and copy into it.
> 
> This is getting more involved as a patch again...  )-:  But at least it   
> does not affect the common case except for having to check the new page 
> flag in every readpage() and truncate() call.  But at least the checks 
> could be with an "if (unlikely(newpageflag()))" so should not be too bad.
> 
> Have I missed anything this time?

Yes. If you have a flag to exclude readpage(), then you must also
exclude filemap_nopage, in which case it is still deadlocky.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-06  2:09             ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-06  2:09 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 05:40:35PM +0000, Anton Altaparmakov wrote:
> On Sun, 4 Feb 2007, Andrew Morton wrote:
> > truncate's OK: we're holding i_mutex.
> 
> How about excluding readpage() (in addition to truncate if Nick is right  
> and some cases of truncate do not hold i_mutex) with an extra page flag as
> I proposed for truncate exclusion?  Then it would not matter that
> prepare_write might have allocated blocks and might expose stale data.    
> It would go to sleep and wait on the bit to be cleared instead of trying  
> to bring the page uptodate.  It can then lock the page and either find it 
> uptodate (because commit_write did it) or not and then bring it uptodate.
> 
> Then we could safely fault in the page, copy from it into a temporary 
> page, then lock the destination page again and copy into it.
> 
> This is getting more involved as a patch again...  )-:  But at least it   
> does not affect the common case except for having to check the new page 
> flag in every readpage() and truncate() call.  But at least the checks 
> could be with an "if (unlikely(newpageflag()))" so should not be too bad.
> 
> Have I missed anything this time?

Yes. If you have a flag to exclude readpage(), then you must also
exclude filemap_nopage, in which case it is still deadlocky.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 15:10                   ` Nick Piggin
@ 2007-02-04 18:36                     ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-04 18:36 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Sun, Feb 04, 2007 at 03:15:49AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 12:03:17 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> > > > On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > > 
> > > > If that recollection is right, I think we could afford to reintroduce that
> > > > problem, frankly.  Especially as it only happens in the incredibly rare
> > > > case of that get_user()ed page getting unmapped under our feet.
> > > 
> > > Dang. I was hoping to fix it without introducing data corruption.
> > 
> > Well.  It's a compromise.  Being practical about it, I reeeealy doubt that
> > anyone will hit this combination of circumstances.
> 
> They're not likely to hit the deadlocks, either. Probability gets more
> likely after my patch to lock the page in the fault path. But practially,
> we could live without that too, because the data corruption it fixes is
> very rare as well. Which is exactly what we've been doing quite happily
> for most of 2.6, including all distro kernels (I think).

Thing is, an application which is relying on the contents of that page is
already unreliable (or really peculiar), because it can get indeterminate
results anyway.

> ...
> 
> On a P4 Xeon, SMP kernel, on a tmpfs filesystem, a 1GB dd if=/dev/zero write
> had the following performance (higher is worse):
> 
> Orig kernel			New kernel
> new file (no pagecache)
> 4K  blocks 1.280s		1.287s (+0.5%)
> 64K blocks 1.090s		1.105s (+1.4%)
> notrunc (uptodate pagecache)
> 4K  blocks 0.976s		1.001s (+0.5%)
> 64K blocks 0.780s		0.792s (+1.5%)
> 
> [numbers are better than +/- 0.005]
> 
> So we lose somewhere between half and one and a half of one percent
> performance in a pagecache write intensive workload.

That's not too bad - caches are fast.  Did you look at optimising the
handling of that temp page, ensure that we always use the same page?  I
guess the page allocator per-cpu-pages thing is being good here.

I'm not sure how, though.  Park a copy in the task_struct, just as an
experiment.  But that'd de-optimise multiple-tasks-writing-on-the-same-cpu.
Maybe a per-cpu thing?  Largely duplicates the page allocator's per-cpu-pages.

Of course, we're also increasing caceh footprint, which this test won't
show.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 18:36                     ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-04 18:36 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Sun, Feb 04, 2007 at 03:15:49AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 12:03:17 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> > > > On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > > 
> > > > If that recollection is right, I think we could afford to reintroduce that
> > > > problem, frankly.  Especially as it only happens in the incredibly rare
> > > > case of that get_user()ed page getting unmapped under our feet.
> > > 
> > > Dang. I was hoping to fix it without introducing data corruption.
> > 
> > Well.  It's a compromise.  Being practical about it, I reeeealy doubt that
> > anyone will hit this combination of circumstances.
> 
> They're not likely to hit the deadlocks, either. Probability gets more
> likely after my patch to lock the page in the fault path. But practially,
> we could live without that too, because the data corruption it fixes is
> very rare as well. Which is exactly what we've been doing quite happily
> for most of 2.6, including all distro kernels (I think).

Thing is, an application which is relying on the contents of that page is
already unreliable (or really peculiar), because it can get indeterminate
results anyway.

> ...
> 
> On a P4 Xeon, SMP kernel, on a tmpfs filesystem, a 1GB dd if=/dev/zero write
> had the following performance (higher is worse):
> 
> Orig kernel			New kernel
> new file (no pagecache)
> 4K  blocks 1.280s		1.287s (+0.5%)
> 64K blocks 1.090s		1.105s (+1.4%)
> notrunc (uptodate pagecache)
> 4K  blocks 0.976s		1.001s (+0.5%)
> 64K blocks 0.780s		0.792s (+1.5%)
> 
> [numbers are better than +/- 0.005]
> 
> So we lose somewhere between half and one and a half of one percent
> performance in a pagecache write intensive workload.

That's not too bad - caches are fast.  Did you look at optimising the
handling of that temp page, ensure that we always use the same page?  I
guess the page allocator per-cpu-pages thing is being good here.

I'm not sure how, though.  Park a copy in the task_struct, just as an
experiment.  But that'd de-optimise multiple-tasks-writing-on-the-same-cpu.
Maybe a per-cpu thing?  Largely duplicates the page allocator's per-cpu-pages.

Of course, we're also increasing caceh footprint, which this test won't
show.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 11:10         ` Andrew Morton
@ 2007-02-04 17:40           ` Anton Altaparmakov
  -1 siblings, 0 replies; 105+ messages in thread
From: Anton Altaparmakov @ 2007-02-04 17:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007, Andrew Morton wrote:
> On Sun, 4 Feb 2007 10:59:58 +0000 (GMT) Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > On Sun, 4 Feb 2007, Andrew Morton wrote:
> > > On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> > > > 2.  If we find the destination page is non uptodate, unlock it (this could be
> > > >     made slightly more optimal), then find and pin the source page with
> > > >     get_user_pages. Relock the destination page and continue with the copy.
> > > >     However, instead of a usercopy (which might take a fault), copy the data
> > > >     via the kernel address space.
> > > 
> > > argh.  We just can't go adding all this gunk into the write() path. 
> > > 
> > > mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> > > Even single-process write() will suffer, let along multithreaded stuff,
> > > where mmap_sem contention may be the bigger problem.
> > > 
> > > I was going to do some quick measurements of this, but the code oopses
> > > on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
> > > 
> > > We need to think different.
> > 
> > How about leaving the existing code with the following minor 
> > modifications:
> > 
> > Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
> > bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
> > would do (could of course move into a helper function of course):
> > 
> > pagefault_disable()
> > kaddr = kmap_atomic(page, KM_USER0);
> > left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> > kunmap_atomic(kaddr, KM_USER0);
> > pagefault_enable()
> > 
> > if (unlikely(left)) {
> > 	/* The user space page got unmapped before we got to copy it. */
> > 	...
> > }
> > 
> > Thus the 99.999% (or more!) of the time the code would just work as it 
> > always has and there is no bug and no speed impact.  Only in the very rare 
> > and hard to trigger race condition that the user space page after being 
> > faulted in got thrown out again before we did the atomic memory copy do we 
> > run into the above "..." code path.
> 
> Right.  And what I wanted to do here is to zero out the uncopied part of
> the page (if it wasn't uptodate), then run commit_write(), then retry the
> whole thing.
> 
> iirc, we ruled that out because those temporary zeroes are exposed to
> userspace.  But the kernel used to do that anyway for a long time (years)
> until someone noticed, and we'll only do it in your 0.0001% case anyway.
> 
> (Actually, perhaps we can prevent it by not marking the page uptodate in
> this case.  But that'll cause a read()er to try to bring it uptodate...)

My thinking was not marking the page uptodate.  But yes that causes the 
problem of a concurrent readpage reading uninitialized disk blocks that 
prepare_write allocated.

> > I would propose to call out a different function altogether which could do 
> > a multitude of things including drop the lock on the destination page 
> > (maintaining a reference on the page!), allocate a temporary page, copy 
> > from the user space page into the temporary page, then lock the 
> > destination page again, and copy from the temporary page into the 
> > destination page.
> 
> The problem with all these things is that as soon as we unlock the page,
> it's visible to read().  And in fact, as soon as we mark it uptodate it's
> visible to mmap, even if it's still locked.

Yes we definitely cannot mark it uptodate before it really is uptodate.
Either we have to not mark it uptodate or we have to zero it or we have to
think of some other magic...

> > This would be slow but who cares given it would only happen incredibly 
> > rarely and on majority of machines it would never happen at all.
> > 
> > The only potential problem I can see is that the destination page could be 
> > truncated whilst it is unlocked.  I can see two possible solutions to 
> > this:
> 
> truncate's OK: we're holding i_mutex.

How about excluding readpage() (in addition to truncate if Nick is right  
and some cases of truncate do not hold i_mutex) with an extra page flag as
I proposed for truncate exclusion?  Then it would not matter that
prepare_write might have allocated blocks and might expose stale data.    
It would go to sleep and wait on the bit to be cleared instead of trying  
to bring the page uptodate.  It can then lock the page and either find it 
uptodate (because commit_write did it) or not and then bring it uptodate.

Then we could safely fault in the page, copy from it into a temporary 
page, then lock the destination page again and copy into it.

This is getting more involved as a patch again...  )-:  But at least it   
does not affect the common case except for having to check the new page 
flag in every readpage() and truncate() call.  But at least the checks 
could be with an "if (unlikely(newpageflag()))" so should not be too bad.

Have I missed anything this time?

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 17:40           ` Anton Altaparmakov
  0 siblings, 0 replies; 105+ messages in thread
From: Anton Altaparmakov @ 2007-02-04 17:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007, Andrew Morton wrote:
> On Sun, 4 Feb 2007 10:59:58 +0000 (GMT) Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > On Sun, 4 Feb 2007, Andrew Morton wrote:
> > > On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> > > > 2.  If we find the destination page is non uptodate, unlock it (this could be
> > > >     made slightly more optimal), then find and pin the source page with
> > > >     get_user_pages. Relock the destination page and continue with the copy.
> > > >     However, instead of a usercopy (which might take a fault), copy the data
> > > >     via the kernel address space.
> > > 
> > > argh.  We just can't go adding all this gunk into the write() path. 
> > > 
> > > mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> > > Even single-process write() will suffer, let along multithreaded stuff,
> > > where mmap_sem contention may be the bigger problem.
> > > 
> > > I was going to do some quick measurements of this, but the code oopses
> > > on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
> > > 
> > > We need to think different.
> > 
> > How about leaving the existing code with the following minor 
> > modifications:
> > 
> > Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
> > bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
> > would do (could of course move into a helper function of course):
> > 
> > pagefault_disable()
> > kaddr = kmap_atomic(page, KM_USER0);
> > left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> > kunmap_atomic(kaddr, KM_USER0);
> > pagefault_enable()
> > 
> > if (unlikely(left)) {
> > 	/* The user space page got unmapped before we got to copy it. */
> > 	...
> > }
> > 
> > Thus the 99.999% (or more!) of the time the code would just work as it 
> > always has and there is no bug and no speed impact.  Only in the very rare 
> > and hard to trigger race condition that the user space page after being 
> > faulted in got thrown out again before we did the atomic memory copy do we 
> > run into the above "..." code path.
> 
> Right.  And what I wanted to do here is to zero out the uncopied part of
> the page (if it wasn't uptodate), then run commit_write(), then retry the
> whole thing.
> 
> iirc, we ruled that out because those temporary zeroes are exposed to
> userspace.  But the kernel used to do that anyway for a long time (years)
> until someone noticed, and we'll only do it in your 0.0001% case anyway.
> 
> (Actually, perhaps we can prevent it by not marking the page uptodate in
> this case.  But that'll cause a read()er to try to bring it uptodate...)

My thinking was not marking the page uptodate.  But yes that causes the 
problem of a concurrent readpage reading uninitialized disk blocks that 
prepare_write allocated.

> > I would propose to call out a different function altogether which could do 
> > a multitude of things including drop the lock on the destination page 
> > (maintaining a reference on the page!), allocate a temporary page, copy 
> > from the user space page into the temporary page, then lock the 
> > destination page again, and copy from the temporary page into the 
> > destination page.
> 
> The problem with all these things is that as soon as we unlock the page,
> it's visible to read().  And in fact, as soon as we mark it uptodate it's
> visible to mmap, even if it's still locked.

Yes we definitely cannot mark it uptodate before it really is uptodate.
Either we have to not mark it uptodate or we have to zero it or we have to
think of some other magic...

> > This would be slow but who cares given it would only happen incredibly 
> > rarely and on majority of machines it would never happen at all.
> > 
> > The only potential problem I can see is that the destination page could be 
> > truncated whilst it is unlocked.  I can see two possible solutions to 
> > this:
> 
> truncate's OK: we're holding i_mutex.

How about excluding readpage() (in addition to truncate if Nick is right  
and some cases of truncate do not hold i_mutex) with an extra page flag as
I proposed for truncate exclusion?  Then it would not matter that
prepare_write might have allocated blocks and might expose stale data.    
It would go to sleep and wait on the bit to be cleared instead of trying  
to bring the page uptodate.  It can then lock the page and either find it 
uptodate (because commit_write did it) or not and then bring it uptodate.

Then we could safely fault in the page, copy from it into a temporary 
page, then lock the destination page again and copy into it.

This is getting more involved as a patch again...  )-:  But at least it   
does not affect the common case except for having to check the new page 
flag in every readpage() and truncate() call.  But at least the checks 
could be with an "if (unlikely(newpageflag()))" so should not be too bad.

Have I missed anything this time?

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 11:15                 ` Andrew Morton
@ 2007-02-04 15:10                   ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04 15:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 03:15:49AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 12:03:17 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > If that recollection is right, I think we could afford to reintroduce that
> > > problem, frankly.  Especially as it only happens in the incredibly rare
> > > case of that get_user()ed page getting unmapped under our feet.
> > 
> > Dang. I was hoping to fix it without introducing data corruption.
> 
> Well.  It's a compromise.  Being practical about it, I reeeealy doubt that
> anyone will hit this combination of circumstances.

They're not likely to hit the deadlocks, either. Probability gets more
likely after my patch to lock the page in the fault path. But practially,
we could live without that too, because the data corruption it fixes is
very rare as well. Which is exactly what we've been doing quite happily
for most of 2.6, including all distro kernels (I think).

But (sadly) for me, there is no compromise. I may be known as the clown
who tries outlandish things to shave a few atomic ops and interrupt flag
changes in the page allocator, or make the pagecache lockless. However
I can't be happy even making something faster if correctness is < 100.0%,
even if less likely than hardware failure.

> > > > > > but you introduce the theoretical memory deadlock
> > > > > > where a task cannot reclaim its own memory.
> > > > > 
> > > > > Nah, that'll never happen - both pages are already allocated.
> > > > 
> > > > Both pages? I don't get it.
> > > > 
> > > > You set the don't-reclaim vma flag, then run get_user, which takes a
> > > > page fault and potentially has to allocate N pages for pagetables,
> > > > pagecache readahead, buffers and fs private data and pagecache radix
> > > > tree nodes for all of the pages read in.
> > > 
> > > Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
> > > rwsem.
> > 
> > Race condition remains.
> 
> No, not in a million years.

There is a huge window. Think about what other activity will be holding
that very rwsem for write, that you'll have to wait for in the race window.
But you could say that's also a question of practicality, because it is
pretty unlikely to do anything bad even if you do hit the race.

Well if fixing this is just going to be flat-out vetoed on performance
reasons then I can't argue, because it does impact performance.

Thinking about the numbers, if your kernel's reliability is already the
same order of magnitude as reliability of commodity hardware, then
trading a bit of performance for a bit of reliability is a BAD tradeoff
if you are at all interested in performance on commodity hardware. That
is especially true if you have a massive redundant cluster of commodity
systems, which I understand is a fairly big market for Linux. The 
X-nines guys who would disagree are probably a tiny niche for Linux.

So I do understand your argument for praticality, even if I don't agree.

For the record, here is the "temporary page" fix that should fix it
properly. And some numbers.

Nick

--

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then allocate a temporary page to copy the
    source data into. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    from the pinned temporary page via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

On a P4 Xeon, SMP kernel, on a tmpfs filesystem, a 1GB dd if=/dev/zero write
had the following performance (higher is worse):

Orig kernel			New kernel
new file (no pagecache)
4K  blocks 1.280s		1.287s (+0.5%)
64K blocks 1.090s		1.105s (+1.4%)
notrunc (uptodate pagecache)
4K  blocks 0.976s		1.001s (+0.5%)
64K blocks 0.780s		0.792s (+1.5%)

[numbers are better than +/- 0.005]

So we lose somewhere between half and one and a half of one percent
performance in a pagecache write intensive workload.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2052,11 +2052,12 @@ generic_file_buffered_write(struct kiocb
 	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
+		struct page *src_page;
 		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long maxlen;	/* Bytes remaining in current iovec */
-		size_t bytes;		/* Bytes to write to page */
+		unsigned long seglen;	/* Bytes remaining in current iovec */
+		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
 		buf = cur_iov->iov_base + iov_offset;
@@ -2066,20 +2067,30 @@ generic_file_buffered_write(struct kiocb
 		if (bytes > count)
 			bytes = count;
 
-		maxlen = cur_iov->iov_len - iov_offset;
-		if (maxlen > bytes)
-			maxlen = bytes;
+		/*
+		 * a non-NULL src_page indicates that we're doing the
+		 * copy via get_user_pages and kmap.
+		 */
+		src_page = NULL;
+
+		seglen = cur_iov->iov_len - iov_offset;
+		if (seglen > bytes)
+			seglen = bytes;
 
-#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
 		 */
-		fault_in_pages_readable(buf, maxlen);
-#endif
-
+		if (unlikely(fault_in_pages_readable(buf, seglen))) {
+			status = -EFAULT;
+			break;
+		}
 
 		page = __grab_cache_page(mapping, index);
 		if (!page) {
@@ -2087,32 +2098,96 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
+		/*
+		 * non-uptodate pages cannot cope with short copies, and we
+		 * cannot take a pagefault with the destination page locked.
+		 * So pin the source page to copy it.
+		 */
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+
+			src_page = alloc_page(GFP_KERNEL);
+			if (!src_page) {
+				page_cache_release(page);
+				status = -ENOMEM;
+				break;
+			}
+
+			/*
+			 * Cannot get_user_pages with a page locked for the
+			 * same reason as we can't take a page fault with a
+			 * page locked (as explained below).
+			 */
+			copied = filemap_copy_from_user(src_page, offset,
+					cur_iov, nr_segs, iov_offset, bytes);
+			if (unlikely(copied == 0)) {
+				status = -EFAULT;
+				page_cache_release(page);
+				page_cache_release(src_page);
+				break;
+			}
+			bytes = copied;
+
+			lock_page(page);
+			if (unlikely(!page->mapping)) {
+				unlock_page(page);
+				page_cache_release(page);
+				page_cache_release(src_page);
+				continue;
+			}
+
+		}
+
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		copied = filemap_copy_from_user(page, offset,
+		if (!src_page) {
+			/*
+			 * Must not enter the pagefault handler here, because
+			 * we hold the page lock, so we might recursively
+			 * deadlock on the same lock, or get an ABBA deadlock
+			 * against a different lock, or against the mmap_sem
+			 * (which nests outside the page lock).  So increment
+			 * preempt count, and use _atomic usercopies.
+			 *
+			 * The page is uptodate so we are OK to encounter a
+			 * short copy: if unmodified parts of the page are
+			 * marked dirty and written out to disk, it doesn't
+			 * really matter.
+			 */
+			pagefault_disable();
+			copied = filemap_copy_from_user_atomic(page, offset,
 					cur_iov, nr_segs, iov_offset, bytes);
+			pagefault_enable();
+		} else {
+			void *src, *dst;
+			src = kmap_atomic(src_page, KM_USER0);
+			dst = kmap_atomic(page, KM_USER1);
+			memcpy(dst + offset, src + offset, bytes);
+			kunmap_atomic(dst, KM_USER1);
+			kunmap_atomic(src, KM_USER0);
+			copied = bytes;
+		}
 		flush_dcache_page(page);
 
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
-		if (unlikely(copied != bytes)) {
-			status = -EFAULT;
-			goto fs_write_aop_error;
-		}
 		if (unlikely(status > 0)) /* filesystem did partial write */
-			copied = status;
+			copied = min_t(size_t, copied, status);
+
+		unlock_page(page);
+		mark_page_accessed(page);
+		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		written += copied;
 		count -= copied;
 		pos += copied;
 		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
 
-		unlock_page(page);
-		mark_page_accessed(page);
-		page_cache_release(page);
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 		continue;
@@ -2121,6 +2196,8 @@ fs_write_aop_error:
 		if (status != AOP_TRUNCATED_PAGE)
 			unlock_page(page);
 		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		/*
 		 * prepare_write() may have instantiated a few blocks
@@ -2133,7 +2210,6 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-
 	} while (count);
 	*ppos = pos;
 
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -198,6 +198,9 @@ static inline int fault_in_pages_writeab
 {
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	/*
 	 * Writing zeroes into userspace here is OK, because we know that if
 	 * the zero gets there, we'll be overwriting it.
@@ -217,19 +220,23 @@ static inline int fault_in_pages_writeab
 	return ret;
 }
 
-static inline void fault_in_pages_readable(const char __user *uaddr, int size)
+static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 {
 	volatile char c;
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	ret = __get_user(c, uaddr);
 	if (ret == 0) {
 		const char __user *end = uaddr + size - 1;
 
 		if (((unsigned long)uaddr & PAGE_MASK) !=
 				((unsigned long)end & PAGE_MASK))
-		 	__get_user(c, end);
+		 	ret = __get_user(c, end);
 	}
+	return ret;
 }
 
 #endif /* _LINUX_PAGEMAP_H */

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 15:10                   ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04 15:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 03:15:49AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 12:03:17 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > If that recollection is right, I think we could afford to reintroduce that
> > > problem, frankly.  Especially as it only happens in the incredibly rare
> > > case of that get_user()ed page getting unmapped under our feet.
> > 
> > Dang. I was hoping to fix it without introducing data corruption.
> 
> Well.  It's a compromise.  Being practical about it, I reeeealy doubt that
> anyone will hit this combination of circumstances.

They're not likely to hit the deadlocks, either. Probability gets more
likely after my patch to lock the page in the fault path. But practially,
we could live without that too, because the data corruption it fixes is
very rare as well. Which is exactly what we've been doing quite happily
for most of 2.6, including all distro kernels (I think).

But (sadly) for me, there is no compromise. I may be known as the clown
who tries outlandish things to shave a few atomic ops and interrupt flag
changes in the page allocator, or make the pagecache lockless. However
I can't be happy even making something faster if correctness is < 100.0%,
even if less likely than hardware failure.

> > > > > > but you introduce the theoretical memory deadlock
> > > > > > where a task cannot reclaim its own memory.
> > > > > 
> > > > > Nah, that'll never happen - both pages are already allocated.
> > > > 
> > > > Both pages? I don't get it.
> > > > 
> > > > You set the don't-reclaim vma flag, then run get_user, which takes a
> > > > page fault and potentially has to allocate N pages for pagetables,
> > > > pagecache readahead, buffers and fs private data and pagecache radix
> > > > tree nodes for all of the pages read in.
> > > 
> > > Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
> > > rwsem.
> > 
> > Race condition remains.
> 
> No, not in a million years.

There is a huge window. Think about what other activity will be holding
that very rwsem for write, that you'll have to wait for in the race window.
But you could say that's also a question of practicality, because it is
pretty unlikely to do anything bad even if you do hit the race.

Well if fixing this is just going to be flat-out vetoed on performance
reasons then I can't argue, because it does impact performance.

Thinking about the numbers, if your kernel's reliability is already the
same order of magnitude as reliability of commodity hardware, then
trading a bit of performance for a bit of reliability is a BAD tradeoff
if you are at all interested in performance on commodity hardware. That
is especially true if you have a massive redundant cluster of commodity
systems, which I understand is a fairly big market for Linux. The 
X-nines guys who would disagree are probably a tiny niche for Linux.

So I do understand your argument for praticality, even if I don't agree.

For the record, here is the "temporary page" fix that should fix it
properly. And some numbers.

Nick

--

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then allocate a temporary page to copy the
    source data into. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    from the pinned temporary page via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

On a P4 Xeon, SMP kernel, on a tmpfs filesystem, a 1GB dd if=/dev/zero write
had the following performance (higher is worse):

Orig kernel			New kernel
new file (no pagecache)
4K  blocks 1.280s		1.287s (+0.5%)
64K blocks 1.090s		1.105s (+1.4%)
notrunc (uptodate pagecache)
4K  blocks 0.976s		1.001s (+0.5%)
64K blocks 0.780s		0.792s (+1.5%)

[numbers are better than +/- 0.005]

So we lose somewhere between half and one and a half of one percent
performance in a pagecache write intensive workload.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2052,11 +2052,12 @@ generic_file_buffered_write(struct kiocb
 	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
+		struct page *src_page;
 		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long maxlen;	/* Bytes remaining in current iovec */
-		size_t bytes;		/* Bytes to write to page */
+		unsigned long seglen;	/* Bytes remaining in current iovec */
+		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
 		buf = cur_iov->iov_base + iov_offset;
@@ -2066,20 +2067,30 @@ generic_file_buffered_write(struct kiocb
 		if (bytes > count)
 			bytes = count;
 
-		maxlen = cur_iov->iov_len - iov_offset;
-		if (maxlen > bytes)
-			maxlen = bytes;
+		/*
+		 * a non-NULL src_page indicates that we're doing the
+		 * copy via get_user_pages and kmap.
+		 */
+		src_page = NULL;
+
+		seglen = cur_iov->iov_len - iov_offset;
+		if (seglen > bytes)
+			seglen = bytes;
 
-#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
 		 */
-		fault_in_pages_readable(buf, maxlen);
-#endif
-
+		if (unlikely(fault_in_pages_readable(buf, seglen))) {
+			status = -EFAULT;
+			break;
+		}
 
 		page = __grab_cache_page(mapping, index);
 		if (!page) {
@@ -2087,32 +2098,96 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
+		/*
+		 * non-uptodate pages cannot cope with short copies, and we
+		 * cannot take a pagefault with the destination page locked.
+		 * So pin the source page to copy it.
+		 */
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+
+			src_page = alloc_page(GFP_KERNEL);
+			if (!src_page) {
+				page_cache_release(page);
+				status = -ENOMEM;
+				break;
+			}
+
+			/*
+			 * Cannot get_user_pages with a page locked for the
+			 * same reason as we can't take a page fault with a
+			 * page locked (as explained below).
+			 */
+			copied = filemap_copy_from_user(src_page, offset,
+					cur_iov, nr_segs, iov_offset, bytes);
+			if (unlikely(copied == 0)) {
+				status = -EFAULT;
+				page_cache_release(page);
+				page_cache_release(src_page);
+				break;
+			}
+			bytes = copied;
+
+			lock_page(page);
+			if (unlikely(!page->mapping)) {
+				unlock_page(page);
+				page_cache_release(page);
+				page_cache_release(src_page);
+				continue;
+			}
+
+		}
+
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		copied = filemap_copy_from_user(page, offset,
+		if (!src_page) {
+			/*
+			 * Must not enter the pagefault handler here, because
+			 * we hold the page lock, so we might recursively
+			 * deadlock on the same lock, or get an ABBA deadlock
+			 * against a different lock, or against the mmap_sem
+			 * (which nests outside the page lock).  So increment
+			 * preempt count, and use _atomic usercopies.
+			 *
+			 * The page is uptodate so we are OK to encounter a
+			 * short copy: if unmodified parts of the page are
+			 * marked dirty and written out to disk, it doesn't
+			 * really matter.
+			 */
+			pagefault_disable();
+			copied = filemap_copy_from_user_atomic(page, offset,
 					cur_iov, nr_segs, iov_offset, bytes);
+			pagefault_enable();
+		} else {
+			void *src, *dst;
+			src = kmap_atomic(src_page, KM_USER0);
+			dst = kmap_atomic(page, KM_USER1);
+			memcpy(dst + offset, src + offset, bytes);
+			kunmap_atomic(dst, KM_USER1);
+			kunmap_atomic(src, KM_USER0);
+			copied = bytes;
+		}
 		flush_dcache_page(page);
 
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
-		if (unlikely(copied != bytes)) {
-			status = -EFAULT;
-			goto fs_write_aop_error;
-		}
 		if (unlikely(status > 0)) /* filesystem did partial write */
-			copied = status;
+			copied = min_t(size_t, copied, status);
+
+		unlock_page(page);
+		mark_page_accessed(page);
+		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		written += copied;
 		count -= copied;
 		pos += copied;
 		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
 
-		unlock_page(page);
-		mark_page_accessed(page);
-		page_cache_release(page);
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 		continue;
@@ -2121,6 +2196,8 @@ fs_write_aop_error:
 		if (status != AOP_TRUNCATED_PAGE)
 			unlock_page(page);
 		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		/*
 		 * prepare_write() may have instantiated a few blocks
@@ -2133,7 +2210,6 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-
 	} while (count);
 	*ppos = pos;
 
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -198,6 +198,9 @@ static inline int fault_in_pages_writeab
 {
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	/*
 	 * Writing zeroes into userspace here is OK, because we know that if
 	 * the zero gets there, we'll be overwriting it.
@@ -217,19 +220,23 @@ static inline int fault_in_pages_writeab
 	return ret;
 }
 
-static inline void fault_in_pages_readable(const char __user *uaddr, int size)
+static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 {
 	volatile char c;
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	ret = __get_user(c, uaddr);
 	if (ret == 0) {
 		const char __user *end = uaddr + size - 1;
 
 		if (((unsigned long)uaddr & PAGE_MASK) !=
 				((unsigned long)end & PAGE_MASK))
-		 	__get_user(c, end);
+		 	ret = __get_user(c, end);
 	}
+	return ret;
 }
 
 #endif /* _LINUX_PAGEMAP_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 11:10         ` Andrew Morton
@ 2007-02-04 11:22           ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04 11:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anton Altaparmakov, Linux Kernel, Linux Filesystems,
	Linux Memory Management

On Sun, Feb 04, 2007 at 03:10:39AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 10:59:58 +0000 (GMT) Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > 
> > How about leaving the existing code with the following minor 
> > modifications:
> > 
> > Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
> > bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
> > would do (could of course move into a helper function of course):
> > 
> > pagefault_disable()
> > kaddr = kmap_atomic(page, KM_USER0);
> > left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> > kunmap_atomic(kaddr, KM_USER0);
> > pagefault_enable()
> > 
> > if (unlikely(left)) {
> > 	/* The user space page got unmapped before we got to copy it. */
> > 	...
> > }
> > 
> > Thus the 99.999% (or more!) of the time the code would just work as it 
> > always has and there is no bug and no speed impact.  Only in the very rare 
> > and hard to trigger race condition that the user space page after being 
> > faulted in got thrown out again before we did the atomic memory copy do we 
> > run into the above "..." code path.
> 
> Right.  And what I wanted to do here is to zero out the uncopied part of
> the page (if it wasn't uptodate), then run commit_write(), then retry the
> whole thing.
> 
> iirc, we ruled that out because those temporary zeroes are exposed to
> userspace.  But the kernel used to do that anyway for a long time (years)
> until someone noticed, and we'll only do it in your 0.0001% case anyway.

Serious? I'd rather leave the deadlock in there than introduce a
very hard to reproduce data corruption bug to fix it. At least the
deadlock is fail-stop and you can tell exactly what happened when
you hit it (assuming you can get a trace).

Then again, we've got lots more similar little correctness corner
cases like this that most people don't notice most of the time. Am
I aiming too high?

> (Actually, perhaps we can prevent it by not marking the page uptodate in
> this case.  But that'll cause a read()er to try to bring it uptodate...)

We have to write something back to the filesystem because it may have
allocated blocks at this point.

> > I would propose to call out a different function altogether which could do 
> > a multitude of things including drop the lock on the destination page 
> > (maintaining a reference on the page!), allocate a temporary page, copy 
> > from the user space page into the temporary page, then lock the 
> > destination page again, and copy from the temporary page into the 
> > destination page.
> 
> The problem with all these things is that as soon as we unlock the page,
> it's visible to read().  And in fact, as soon as we mark it uptodate it's
> visible to mmap, even if it's still locked.
> 
> > This would be slow but who cares given it would only happen incredibly 
> > rarely and on majority of machines it would never happen at all.
> > 
> > The only potential problem I can see is that the destination page could be 
> > truncated whilst it is unlocked.  I can see two possible solutions to 
> > this:
> 
> truncate's OK: we're holding i_mutex.

Not all truncates hold i_mutex. Neither do all invalidates, for that
matter.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 11:22           ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04 11:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anton Altaparmakov, Linux Kernel, Linux Filesystems,
	Linux Memory Management

On Sun, Feb 04, 2007 at 03:10:39AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 10:59:58 +0000 (GMT) Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > 
> > How about leaving the existing code with the following minor 
> > modifications:
> > 
> > Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
> > bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
> > would do (could of course move into a helper function of course):
> > 
> > pagefault_disable()
> > kaddr = kmap_atomic(page, KM_USER0);
> > left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> > kunmap_atomic(kaddr, KM_USER0);
> > pagefault_enable()
> > 
> > if (unlikely(left)) {
> > 	/* The user space page got unmapped before we got to copy it. */
> > 	...
> > }
> > 
> > Thus the 99.999% (or more!) of the time the code would just work as it 
> > always has and there is no bug and no speed impact.  Only in the very rare 
> > and hard to trigger race condition that the user space page after being 
> > faulted in got thrown out again before we did the atomic memory copy do we 
> > run into the above "..." code path.
> 
> Right.  And what I wanted to do here is to zero out the uncopied part of
> the page (if it wasn't uptodate), then run commit_write(), then retry the
> whole thing.
> 
> iirc, we ruled that out because those temporary zeroes are exposed to
> userspace.  But the kernel used to do that anyway for a long time (years)
> until someone noticed, and we'll only do it in your 0.0001% case anyway.

Serious? I'd rather leave the deadlock in there than introduce a
very hard to reproduce data corruption bug to fix it. At least the
deadlock is fail-stop and you can tell exactly what happened when
you hit it (assuming you can get a trace).

Then again, we've got lots more similar little correctness corner
cases like this that most people don't notice most of the time. Am
I aiming too high?

> (Actually, perhaps we can prevent it by not marking the page uptodate in
> this case.  But that'll cause a read()er to try to bring it uptodate...)

We have to write something back to the filesystem because it may have
allocated blocks at this point.

> > I would propose to call out a different function altogether which could do 
> > a multitude of things including drop the lock on the destination page 
> > (maintaining a reference on the page!), allocate a temporary page, copy 
> > from the user space page into the temporary page, then lock the 
> > destination page again, and copy from the temporary page into the 
> > destination page.
> 
> The problem with all these things is that as soon as we unlock the page,
> it's visible to read().  And in fact, as soon as we mark it uptodate it's
> visible to mmap, even if it's still locked.
> 
> > This would be slow but who cares given it would only happen incredibly 
> > rarely and on majority of machines it would never happen at all.
> > 
> > The only potential problem I can see is that the destination page could be 
> > truncated whilst it is unlocked.  I can see two possible solutions to 
> > this:
> 
> truncate's OK: we're holding i_mutex.

Not all truncates hold i_mutex. Neither do all invalidates, for that
matter.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 11:03               ` Nick Piggin
@ 2007-02-04 11:15                 ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-04 11:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 12:03:17 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> > > > On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > > 
> > > > > The write path is broken. I prefer my kernels slow, than buggy.
> > > > 
> > > > That won't fly.
> > > 
> > > What won't fly?
> > 
> > I suspect the performance cost of this approach would force us to redo it
> > all.
> 
> That's the idea. But at least in the meantime we're correct.

There's no way I'd support merging a change which we know we'll have to
redo only we have no clue how.

> > If that recollection is right, I think we could afford to reintroduce that
> > problem, frankly.  Especially as it only happens in the incredibly rare
> > case of that get_user()ed page getting unmapped under our feet.
> 
> Dang. I was hoping to fix it without introducing data corruption.

Well.  It's a compromise.  Being practical about it, I reeeealy doubt that
anyone will hit this combination of circumstances.

> > > > > but you introduce the theoretical memory deadlock
> > > > > where a task cannot reclaim its own memory.
> > > > 
> > > > Nah, that'll never happen - both pages are already allocated.
> > > 
> > > Both pages? I don't get it.
> > > 
> > > You set the don't-reclaim vma flag, then run get_user, which takes a
> > > page fault and potentially has to allocate N pages for pagetables,
> > > pagecache readahead, buffers and fs private data and pagecache radix
> > > tree nodes for all of the pages read in.
> > 
> > Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
> > rwsem.
> 
> Race condition remains.

No, not in a million years.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 11:15                 ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-04 11:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 12:03:17 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> > > > On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > > 
> > > > > The write path is broken. I prefer my kernels slow, than buggy.
> > > > 
> > > > That won't fly.
> > > 
> > > What won't fly?
> > 
> > I suspect the performance cost of this approach would force us to redo it
> > all.
> 
> That's the idea. But at least in the meantime we're correct.

There's no way I'd support merging a change which we know we'll have to
redo only we have no clue how.

> > If that recollection is right, I think we could afford to reintroduce that
> > problem, frankly.  Especially as it only happens in the incredibly rare
> > case of that get_user()ed page getting unmapped under our feet.
> 
> Dang. I was hoping to fix it without introducing data corruption.

Well.  It's a compromise.  Being practical about it, I reeeealy doubt that
anyone will hit this combination of circumstances.

> > > > > but you introduce the theoretical memory deadlock
> > > > > where a task cannot reclaim its own memory.
> > > > 
> > > > Nah, that'll never happen - both pages are already allocated.
> > > 
> > > Both pages? I don't get it.
> > > 
> > > You set the don't-reclaim vma flag, then run get_user, which takes a
> > > page fault and potentially has to allocate N pages for pagetables,
> > > pagecache readahead, buffers and fs private data and pagecache radix
> > > tree nodes for all of the pages read in.
> > 
> > Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
> > rwsem.
> 
> Race condition remains.

No, not in a million years.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:59       ` Anton Altaparmakov
@ 2007-02-04 11:10         ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-04 11:10 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Nick Piggin, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 10:59:58 +0000 (GMT) Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> On Sun, 4 Feb 2007, Andrew Morton wrote:
> > On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> > > 2.  If we find the destination page is non uptodate, unlock it (this could be
> > >     made slightly more optimal), then find and pin the source page with
> > >     get_user_pages. Relock the destination page and continue with the copy.
> > >     However, instead of a usercopy (which might take a fault), copy the data
> > >     via the kernel address space.
> > 
> > argh.  We just can't go adding all this gunk into the write() path. 
> > 
> > mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> > Even single-process write() will suffer, let along multithreaded stuff,
> > where mmap_sem contention may be the bigger problem.
> > 
> > I was going to do some quick measurements of this, but the code oopses
> > on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
> > 
> > We need to think different.
> 
> How about leaving the existing code with the following minor 
> modifications:
> 
> Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
> bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
> would do (could of course move into a helper function of course):
> 
> pagefault_disable()
> kaddr = kmap_atomic(page, KM_USER0);
> left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> kunmap_atomic(kaddr, KM_USER0);
> pagefault_enable()
> 
> if (unlikely(left)) {
> 	/* The user space page got unmapped before we got to copy it. */
> 	...
> }
> 
> Thus the 99.999% (or more!) of the time the code would just work as it 
> always has and there is no bug and no speed impact.  Only in the very rare 
> and hard to trigger race condition that the user space page after being 
> faulted in got thrown out again before we did the atomic memory copy do we 
> run into the above "..." code path.

Right.  And what I wanted to do here is to zero out the uncopied part of
the page (if it wasn't uptodate), then run commit_write(), then retry the
whole thing.

iirc, we ruled that out because those temporary zeroes are exposed to
userspace.  But the kernel used to do that anyway for a long time (years)
until someone noticed, and we'll only do it in your 0.0001% case anyway.

(Actually, perhaps we can prevent it by not marking the page uptodate in
this case.  But that'll cause a read()er to try to bring it uptodate...)

> I would propose to call out a different function altogether which could do 
> a multitude of things including drop the lock on the destination page 
> (maintaining a reference on the page!), allocate a temporary page, copy 
> from the user space page into the temporary page, then lock the 
> destination page again, and copy from the temporary page into the 
> destination page.

The problem with all these things is that as soon as we unlock the page,
it's visible to read().  And in fact, as soon as we mark it uptodate it's
visible to mmap, even if it's still locked.

> This would be slow but who cares given it would only happen incredibly 
> rarely and on majority of machines it would never happen at all.
> 
> The only potential problem I can see is that the destination page could be 
> truncated whilst it is unlocked.  I can see two possible solutions to 
> this:

truncate's OK: we're holding i_mutex.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 11:10         ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-04 11:10 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Nick Piggin, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 10:59:58 +0000 (GMT) Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> On Sun, 4 Feb 2007, Andrew Morton wrote:
> > On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> > > 2.  If we find the destination page is non uptodate, unlock it (this could be
> > >     made slightly more optimal), then find and pin the source page with
> > >     get_user_pages. Relock the destination page and continue with the copy.
> > >     However, instead of a usercopy (which might take a fault), copy the data
> > >     via the kernel address space.
> > 
> > argh.  We just can't go adding all this gunk into the write() path. 
> > 
> > mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> > Even single-process write() will suffer, let along multithreaded stuff,
> > where mmap_sem contention may be the bigger problem.
> > 
> > I was going to do some quick measurements of this, but the code oopses
> > on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
> > 
> > We need to think different.
> 
> How about leaving the existing code with the following minor 
> modifications:
> 
> Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
> bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
> would do (could of course move into a helper function of course):
> 
> pagefault_disable()
> kaddr = kmap_atomic(page, KM_USER0);
> left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> kunmap_atomic(kaddr, KM_USER0);
> pagefault_enable()
> 
> if (unlikely(left)) {
> 	/* The user space page got unmapped before we got to copy it. */
> 	...
> }
> 
> Thus the 99.999% (or more!) of the time the code would just work as it 
> always has and there is no bug and no speed impact.  Only in the very rare 
> and hard to trigger race condition that the user space page after being 
> faulted in got thrown out again before we did the atomic memory copy do we 
> run into the above "..." code path.

Right.  And what I wanted to do here is to zero out the uncopied part of
the page (if it wasn't uptodate), then run commit_write(), then retry the
whole thing.

iirc, we ruled that out because those temporary zeroes are exposed to
userspace.  But the kernel used to do that anyway for a long time (years)
until someone noticed, and we'll only do it in your 0.0001% case anyway.

(Actually, perhaps we can prevent it by not marking the page uptodate in
this case.  But that'll cause a read()er to try to bring it uptodate...)

> I would propose to call out a different function altogether which could do 
> a multitude of things including drop the lock on the destination page 
> (maintaining a reference on the page!), allocate a temporary page, copy 
> from the user space page into the temporary page, then lock the 
> destination page again, and copy from the temporary page into the 
> destination page.

The problem with all these things is that as soon as we unlock the page,
it's visible to read().  And in fact, as soon as we mark it uptodate it's
visible to mmap, even if it's still locked.

> This would be slow but who cares given it would only happen incredibly 
> rarely and on majority of machines it would never happen at all.
> 
> The only potential problem I can see is that the destination page could be 
> truncated whilst it is unlocked.  I can see two possible solutions to 
> this:

truncate's OK: we're holding i_mutex.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:56             ` Andrew Morton
@ 2007-02-04 11:03               ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04 11:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > The write path is broken. I prefer my kernels slow, than buggy.
> > > 
> > > That won't fly.
> > 
> > What won't fly?
> 
> I suspect the performance cost of this approach would force us to redo it
> all.

That's the idea. But at least in the meantime we're correct.

> > > > That was my second idea.
> > > 
> > > Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
> > > practical because of the games we needed to play with ->commit_write.
> > 
> > Maybe I misunderstood what you meant, above.
> 
> The original set of half-written patches I sent you.  Do an atomic copy_from_user()
> inside the page lock and if that fails, zero out the remainder of the page, run
> commit_write() and then redo the whole thing.

Oh that. Data corruption, transient zeroes.

> > I have an alterative fix
> > where a temporary page is allocated if the write enncounters a non
> > uptodate page. The usercopy then goes into that page, and from there
> > into the target page after we have opened the prepare_write().
> 
> Remember that a non-uptodate page is the common case.

Yes.

> 
> > My *first* idea to fix this was to do the atomic copy into a non-uptodate
> > page and then calling a zero-length commit_write if it failed. I pretty
> > carefully constructed all these good arguments as to why each case works
> > properly, but in the end it just didn't fly because it broke lots of
> > filesystems.
> 
> I forget the details now.  I think we did have a workable-looking solution
> based on the atomic copy_from_user() but it would have re-exposed the old
> problem wherein a page would fleetingly have a bunch of zeroes in the
> middle of it, if someone looked at it during the write.
> 
> If that recollection is right, I think we could afford to reintroduce that
> problem, frankly.  Especially as it only happens in the incredibly rare
> case of that get_user()ed page getting unmapped under our feet.

Dang. I was hoping to fix it without introducing data corruption.

> > > > but you introduce the theoretical memory deadlock
> > > > where a task cannot reclaim its own memory.
> > > 
> > > Nah, that'll never happen - both pages are already allocated.
> > 
> > Both pages? I don't get it.
> > 
> > You set the don't-reclaim vma flag, then run get_user, which takes a
> > page fault and potentially has to allocate N pages for pagetables,
> > pagecache readahead, buffers and fs private data and pagecache radix
> > tree nodes for all of the pages read in.
> 
> Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
> rwsem.

Race condition remains.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 11:03               ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04 11:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > The write path is broken. I prefer my kernels slow, than buggy.
> > > 
> > > That won't fly.
> > 
> > What won't fly?
> 
> I suspect the performance cost of this approach would force us to redo it
> all.

That's the idea. But at least in the meantime we're correct.

> > > > That was my second idea.
> > > 
> > > Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
> > > practical because of the games we needed to play with ->commit_write.
> > 
> > Maybe I misunderstood what you meant, above.
> 
> The original set of half-written patches I sent you.  Do an atomic copy_from_user()
> inside the page lock and if that fails, zero out the remainder of the page, run
> commit_write() and then redo the whole thing.

Oh that. Data corruption, transient zeroes.

> > I have an alterative fix
> > where a temporary page is allocated if the write enncounters a non
> > uptodate page. The usercopy then goes into that page, and from there
> > into the target page after we have opened the prepare_write().
> 
> Remember that a non-uptodate page is the common case.

Yes.

> 
> > My *first* idea to fix this was to do the atomic copy into a non-uptodate
> > page and then calling a zero-length commit_write if it failed. I pretty
> > carefully constructed all these good arguments as to why each case works
> > properly, but in the end it just didn't fly because it broke lots of
> > filesystems.
> 
> I forget the details now.  I think we did have a workable-looking solution
> based on the atomic copy_from_user() but it would have re-exposed the old
> problem wherein a page would fleetingly have a bunch of zeroes in the
> middle of it, if someone looked at it during the write.
> 
> If that recollection is right, I think we could afford to reintroduce that
> problem, frankly.  Especially as it only happens in the incredibly rare
> case of that get_user()ed page getting unmapped under our feet.

Dang. I was hoping to fix it without introducing data corruption.

> > > > but you introduce the theoretical memory deadlock
> > > > where a task cannot reclaim its own memory.
> > > 
> > > Nah, that'll never happen - both pages are already allocated.
> > 
> > Both pages? I don't get it.
> > 
> > You set the don't-reclaim vma flag, then run get_user, which takes a
> > page fault and potentially has to allocate N pages for pagetables,
> > pagecache readahead, buffers and fs private data and pagecache radix
> > tree nodes for all of the pages read in.
> 
> Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
> rwsem.

Race condition remains.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04  9:44     ` Andrew Morton
@ 2007-02-04 10:59       ` Anton Altaparmakov
  -1 siblings, 0 replies; 105+ messages in thread
From: Anton Altaparmakov @ 2007-02-04 10:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007, Andrew Morton wrote:
> On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> > 2.  If we find the destination page is non uptodate, unlock it (this could be
> >     made slightly more optimal), then find and pin the source page with
> >     get_user_pages. Relock the destination page and continue with the copy.
> >     However, instead of a usercopy (which might take a fault), copy the data
> >     via the kernel address space.
> 
> argh.  We just can't go adding all this gunk into the write() path. 
> 
> mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> Even single-process write() will suffer, let along multithreaded stuff,
> where mmap_sem contention may be the bigger problem.
> 
> I was going to do some quick measurements of this, but the code oopses
> on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
> 
> We need to think different.

How about leaving the existing code with the following minor 
modifications:

Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
would do (could of course move into a helper function of course):

pagefault_disable()
kaddr = kmap_atomic(page, KM_USER0);
left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
kunmap_atomic(kaddr, KM_USER0);
pagefault_enable()

if (unlikely(left)) {
	/* The user space page got unmapped before we got to copy it. */
	...
}

Thus the 99.999% (or more!) of the time the code would just work as it 
always has and there is no bug and no speed impact.  Only in the very rare 
and hard to trigger race condition that the user space page after being 
faulted in got thrown out again before we did the atomic memory copy do we 
run into the above "..." code path.

I would propose to call out a different function altogether which could do 
a multitude of things including drop the lock on the destination page 
(maintaining a reference on the page!), allocate a temporary page, copy 
from the user space page into the temporary page, then lock the 
destination page again, and copy from the temporary page into the 
destination page.

This would be slow but who cares given it would only happen incredibly 
rarely and on majority of machines it would never happen at all.

The only potential problem I can see is that the destination page could be 
truncated whilst it is unlocked.  I can see two possible solutions to 
this:

1) Invent a new page flag and run "SetDoNotTruncatePage()" before dropping 
the page lock, then modify truncate code paths to check for page locked 
and the new flag and not truncate a page if the new flag is set, just like 
they would if it was locked.  Basically treat the two bits as equivalent 
for truncate purposes.

I think 1) would be best but a possible alternative would be:

2) Make commit write check whether a page has been truncated and if so 
make it abort the operation/transaction (or non-journalling file systems 
can probably just ignore the call completely).  Then simply restart the 
generic_file_buffered_write() call by faulting the user space page in 
again, etc.  And this time round things will just work again.  I guess it 
would be possible that we hit the same race condition again and again and 
again so we would loop for ever but the chances for that are even less 
than the original race condition.  I suppose to guard against it we could 
have a counter that say limited to X retries and if that failed the write 
would be aborted with -EDEADLOCK or something.  As I said I think 1) would 
be nicer...

What do you think?

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 10:59       ` Anton Altaparmakov
  0 siblings, 0 replies; 105+ messages in thread
From: Anton Altaparmakov @ 2007-02-04 10:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007, Andrew Morton wrote:
> On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> > 2.  If we find the destination page is non uptodate, unlock it (this could be
> >     made slightly more optimal), then find and pin the source page with
> >     get_user_pages. Relock the destination page and continue with the copy.
> >     However, instead of a usercopy (which might take a fault), copy the data
> >     via the kernel address space.
> 
> argh.  We just can't go adding all this gunk into the write() path. 
> 
> mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> Even single-process write() will suffer, let along multithreaded stuff,
> where mmap_sem contention may be the bigger problem.
> 
> I was going to do some quick measurements of this, but the code oopses
> on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
> 
> We need to think different.

How about leaving the existing code with the following minor 
modifications:

Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
would do (could of course move into a helper function of course):

pagefault_disable()
kaddr = kmap_atomic(page, KM_USER0);
left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
kunmap_atomic(kaddr, KM_USER0);
pagefault_enable()

if (unlikely(left)) {
	/* The user space page got unmapped before we got to copy it. */
	...
}

Thus the 99.999% (or more!) of the time the code would just work as it 
always has and there is no bug and no speed impact.  Only in the very rare 
and hard to trigger race condition that the user space page after being 
faulted in got thrown out again before we did the atomic memory copy do we 
run into the above "..." code path.

I would propose to call out a different function altogether which could do 
a multitude of things including drop the lock on the destination page 
(maintaining a reference on the page!), allocate a temporary page, copy 
from the user space page into the temporary page, then lock the 
destination page again, and copy from the temporary page into the 
destination page.

This would be slow but who cares given it would only happen incredibly 
rarely and on majority of machines it would never happen at all.

The only potential problem I can see is that the destination page could be 
truncated whilst it is unlocked.  I can see two possible solutions to 
this:

1) Invent a new page flag and run "SetDoNotTruncatePage()" before dropping 
the page lock, then modify truncate code paths to check for page locked 
and the new flag and not truncate a page if the new flag is set, just like 
they would if it was locked.  Basically treat the two bits as equivalent 
for truncate purposes.

I think 1) would be best but a possible alternative would be:

2) Make commit write check whether a page has been truncated and if so 
make it abort the operation/transaction (or non-journalling file systems 
can probably just ignore the call completely).  Then simply restart the 
generic_file_buffered_write() call by faulting the user space page in 
again, etc.  And this time round things will just work again.  I guess it 
would be possible that we hit the same race condition again and again and 
again so we would loop for ever but the chances for that are even less 
than the original race condition.  I suppose to guard against it we could 
have a counter that say limited to X retries and if that failed the write 
would be aborted with -EDEADLOCK or something.  As I said I think 1) would 
be nicer...

What do you think?

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:46           ` Nick Piggin
@ 2007-02-04 10:56             ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-04 10:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > The write path is broken. I prefer my kernels slow, than buggy.
> > 
> > That won't fly.
> 
> What won't fly?

I suspect the performance cost of this approach would force us to redo it
all.

> > > > What happened to the idea of doing an atomic copy into the non-uptodate
> > > > page and handling it somehow?
> > > 
> > > That was my second idea.
> > 
> > Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
> > practical because of the games we needed to play with ->commit_write.
> 
> Maybe I misunderstood what you meant, above.

The original set of half-written patches I sent you.  Do an atomic copy_from_user()
inside the page lock and if that fails, zero out the remainder of the page, run
commit_write() and then redo the whole thing.

> I have an alterative fix
> where a temporary page is allocated if the write enncounters a non
> uptodate page. The usercopy then goes into that page, and from there
> into the target page after we have opened the prepare_write().

Remember that a non-uptodate page is the common case.

> My *first* idea to fix this was to do the atomic copy into a non-uptodate
> page and then calling a zero-length commit_write if it failed. I pretty
> carefully constructed all these good arguments as to why each case works
> properly, but in the end it just didn't fly because it broke lots of
> filesystems.

I forget the details now.  I think we did have a workable-looking solution
based on the atomic copy_from_user() but it would have re-exposed the old
problem wherein a page would fleetingly have a bunch of zeroes in the
middle of it, if someone looked at it during the write.

If that recollection is right, I think we could afford to reintroduce that
problem, frankly.  Especially as it only happens in the incredibly rare
case of that get_user()ed page getting unmapped under our feet.

> > > but you introduce the theoretical memory deadlock
> > > where a task cannot reclaim its own memory.
> > 
> > Nah, that'll never happen - both pages are already allocated.
> 
> Both pages? I don't get it.
> 
> You set the don't-reclaim vma flag, then run get_user, which takes a
> page fault and potentially has to allocate N pages for pagetables,
> pagecache readahead, buffers and fs private data and pagecache radix
> tree nodes for all of the pages read in.

Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
rwsem.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 10:56             ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-04 10:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > The write path is broken. I prefer my kernels slow, than buggy.
> > 
> > That won't fly.
> 
> What won't fly?

I suspect the performance cost of this approach would force us to redo it
all.

> > > > What happened to the idea of doing an atomic copy into the non-uptodate
> > > > page and handling it somehow?
> > > 
> > > That was my second idea.
> > 
> > Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
> > practical because of the games we needed to play with ->commit_write.
> 
> Maybe I misunderstood what you meant, above.

The original set of half-written patches I sent you.  Do an atomic copy_from_user()
inside the page lock and if that fails, zero out the remainder of the page, run
commit_write() and then redo the whole thing.

> I have an alterative fix
> where a temporary page is allocated if the write enncounters a non
> uptodate page. The usercopy then goes into that page, and from there
> into the target page after we have opened the prepare_write().

Remember that a non-uptodate page is the common case.

> My *first* idea to fix this was to do the atomic copy into a non-uptodate
> page and then calling a zero-length commit_write if it failed. I pretty
> carefully constructed all these good arguments as to why each case works
> properly, but in the end it just didn't fly because it broke lots of
> filesystems.

I forget the details now.  I think we did have a workable-looking solution
based on the atomic copy_from_user() but it would have re-exposed the old
problem wherein a page would fleetingly have a bunch of zeroes in the
middle of it, if someone looked at it during the write.

If that recollection is right, I think we could afford to reintroduce that
problem, frankly.  Especially as it only happens in the incredibly rare
case of that get_user()ed page getting unmapped under our feet.

> > > but you introduce the theoretical memory deadlock
> > > where a task cannot reclaim its own memory.
> > 
> > Nah, that'll never happen - both pages are already allocated.
> 
> Both pages? I don't get it.
> 
> You set the don't-reclaim vma flag, then run get_user, which takes a
> page fault and potentially has to allocate N pages for pagetables,
> pagecache readahead, buffers and fs private data and pagecache radix
> tree nodes for all of the pages read in.

Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
rwsem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:46           ` Nick Piggin
@ 2007-02-04 10:50             ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04 10:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 11:46:09AM +0100, Nick Piggin wrote:
> 
> > It's better than taking mmap_sem and walking pagetables...
> 
> I'm not convinced.

Though I am more convinced that looking at mm *at all* (either to
take the mmap_sem and find the vma, or to take the mmap_sem and
run get_user_pages) is going to hurt.

We'd have to special case kernel threads, which don't even have an
mm, let alone the vmas... too ugly.

I'll revert to my temporary-page approach: at least that will
fix the problem.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 10:50             ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04 10:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 11:46:09AM +0100, Nick Piggin wrote:
> 
> > It's better than taking mmap_sem and walking pagetables...
> 
> I'm not convinced.

Though I am more convinced that looking at mm *at all* (either to
take the mmap_sem and find the vma, or to take the mmap_sem and
run get_user_pages) is going to hurt.

We'd have to special case kernel threads, which don't even have an
mm, let alone the vmas... too ugly.

I'll revert to my temporary-page approach: at least that will
fix the problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:30         ` Andrew Morton
@ 2007-02-04 10:46           ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04 10:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > The write path is broken. I prefer my kernels slow, than buggy.
> 
> That won't fly.

What won't fly?

> 
> > > There's a build error in filemap_xip.c btw.
> 
> ?

Thanks?

> > > What happened to the idea of doing an atomic copy into the non-uptodate
> > > page and handling it somehow?
> > 
> > That was my second idea.
> 
> Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
> practical because of the games we needed to play with ->commit_write.

Maybe I misunderstood what you meant, above. I have an alterative fix
where a temporary page is allocated if the write enncounters a non
uptodate page. The usercopy then goes into that page, and from there
into the target page after we have opened the prepare_write().

My *first* idea to fix this was to do the atomic copy into a non-uptodate
page and then calling a zero-length commit_write if it failed. I pretty
carefully constructed all these good arguments as to why each case works
properly, but in the end it just didn't fly because it broke lots of
filesystems.

> > > Another option might be to effectively pin the whole mm during the copy:
> > > 
> > > 	down_read(&current->mm->unpaging_lock);
> > > 	get_user(addr);		/* Fault the page in */
> > > 	...
> > > 	copy_from_user()
> > > 	up_read(&current->mm->unpaging_lock);
> > > 
> > > then, anyone who wants to unmap pages from this mm requires
> > > write_lock(unpaging_lock).  So we know the results of that get_user()
> > > cannot be undone.
> > 
> > Fugly.
> 
> I invited you to think different - don't just fixate on one random
> tossed-out-there suggestion.

I've thought. Quite a lot. I have 2 other approaches that don't require
mmap_sem, and 1 which is actually possible to implement without breaking
filesystems.

> > but you introduce the theoretical memory deadlock
> > where a task cannot reclaim its own memory.
> 
> Nah, that'll never happen - both pages are already allocated.

Both pages? I don't get it.

You set the don't-reclaim vma flag, then run get_user, which takes a
page fault and potentially has to allocate N pages for pagetables,
pagecache readahead, buffers and fs private data and pagecache radix
tree nodes for all of the pages read in.

> It's better than taking mmap_sem and walking pagetables...

I'm not convinced.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 10:46           ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04 10:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > The write path is broken. I prefer my kernels slow, than buggy.
> 
> That won't fly.

What won't fly?

> 
> > > There's a build error in filemap_xip.c btw.
> 
> ?

Thanks?

> > > What happened to the idea of doing an atomic copy into the non-uptodate
> > > page and handling it somehow?
> > 
> > That was my second idea.
> 
> Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
> practical because of the games we needed to play with ->commit_write.

Maybe I misunderstood what you meant, above. I have an alterative fix
where a temporary page is allocated if the write enncounters a non
uptodate page. The usercopy then goes into that page, and from there
into the target page after we have opened the prepare_write().

My *first* idea to fix this was to do the atomic copy into a non-uptodate
page and then calling a zero-length commit_write if it failed. I pretty
carefully constructed all these good arguments as to why each case works
properly, but in the end it just didn't fly because it broke lots of
filesystems.

> > > Another option might be to effectively pin the whole mm during the copy:
> > > 
> > > 	down_read(&current->mm->unpaging_lock);
> > > 	get_user(addr);		/* Fault the page in */
> > > 	...
> > > 	copy_from_user()
> > > 	up_read(&current->mm->unpaging_lock);
> > > 
> > > then, anyone who wants to unmap pages from this mm requires
> > > write_lock(unpaging_lock).  So we know the results of that get_user()
> > > cannot be undone.
> > 
> > Fugly.
> 
> I invited you to think different - don't just fixate on one random
> tossed-out-there suggestion.

I've thought. Quite a lot. I have 2 other approaches that don't require
mmap_sem, and 1 which is actually possible to implement without breaking
filesystems.

> > but you introduce the theoretical memory deadlock
> > where a task cannot reclaim its own memory.
> 
> Nah, that'll never happen - both pages are already allocated.

Both pages? I don't get it.

You set the don't-reclaim vma flag, then run get_user, which takes a
page fault and potentially has to allocate N pages for pagetables,
pagecache readahead, buffers and fs private data and pagecache radix
tree nodes for all of the pages read in.

> It's better than taking mmap_sem and walking pagetables...

I'm not convinced.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:15       ` Nick Piggin
@ 2007-02-04 10:30         ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-04 10:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Sun, Feb 04, 2007 at 01:44:45AM -0800, Andrew Morton wrote:
> > On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > 2.  If we find the destination page is non uptodate, unlock it (this could be
> > >     made slightly more optimal), then find and pin the source page with
> > >     get_user_pages. Relock the destination page and continue with the copy.
> > >     However, instead of a usercopy (which might take a fault), copy the data
> > >     via the kernel address space.
> > 
> > argh.  We just can't go adding all this gunk into the write() path. 
> > 
> > mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> > Even single-process write() will suffer, let along multithreaded stuff,
> > where mmap_sem contention may be the bigger problem.
> 
> The write path is broken. I prefer my kernels slow, than buggy.

That won't fly.

> > There's a build error in filemap_xip.c btw.

?

> > 
> > We need to think different.
> > 
> > What happened to the idea of doing an atomic copy into the non-uptodate
> > page and handling it somehow?
> 
> That was my second idea.

Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
practical because of the games we needed to play with ->commit_write.

> I didn't get any feedback on that patchset
> except to try this method, so I assume everyone hated it.
> 
> I actually liked it, because it didn't have to do the writev
> segment-at-a-time for !uptodate pages like this one does. Considering
> this code gets called from mm-less contexts, maybe I'll have to go back
> to this approach.

OK.

> > Another option might be to effectively pin the whole mm during the copy:
> > 
> > 	down_read(&current->mm->unpaging_lock);
> > 	get_user(addr);		/* Fault the page in */
> > 	...
> > 	copy_from_user()
> > 	up_read(&current->mm->unpaging_lock);
> > 
> > then, anyone who wants to unmap pages from this mm requires
> > write_lock(unpaging_lock).  So we know the results of that get_user()
> > cannot be undone.
> 
> Fugly.

I invited you to think different - don't just fixate on one random
tossed-out-there suggestion.

> but you introduce the theoretical memory deadlock
> where a task cannot reclaim its own memory.

Nah, that'll never happen - both pages are already allocated.

It's better than taking mmap_sem and walking pagetables...

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 10:30         ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-04 10:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Sun, Feb 04, 2007 at 01:44:45AM -0800, Andrew Morton wrote:
> > On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > 2.  If we find the destination page is non uptodate, unlock it (this could be
> > >     made slightly more optimal), then find and pin the source page with
> > >     get_user_pages. Relock the destination page and continue with the copy.
> > >     However, instead of a usercopy (which might take a fault), copy the data
> > >     via the kernel address space.
> > 
> > argh.  We just can't go adding all this gunk into the write() path. 
> > 
> > mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> > Even single-process write() will suffer, let along multithreaded stuff,
> > where mmap_sem contention may be the bigger problem.
> 
> The write path is broken. I prefer my kernels slow, than buggy.

That won't fly.

> > There's a build error in filemap_xip.c btw.

?

> > 
> > We need to think different.
> > 
> > What happened to the idea of doing an atomic copy into the non-uptodate
> > page and handling it somehow?
> 
> That was my second idea.

Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
practical because of the games we needed to play with ->commit_write.

> I didn't get any feedback on that patchset
> except to try this method, so I assume everyone hated it.
> 
> I actually liked it, because it didn't have to do the writev
> segment-at-a-time for !uptodate pages like this one does. Considering
> this code gets called from mm-less contexts, maybe I'll have to go back
> to this approach.

OK.

> > Another option might be to effectively pin the whole mm during the copy:
> > 
> > 	down_read(&current->mm->unpaging_lock);
> > 	get_user(addr);		/* Fault the page in */
> > 	...
> > 	copy_from_user()
> > 	up_read(&current->mm->unpaging_lock);
> > 
> > then, anyone who wants to unmap pages from this mm requires
> > write_lock(unpaging_lock).  So we know the results of that get_user()
> > cannot be undone.
> 
> Fugly.

I invited you to think different - don't just fixate on one random
tossed-out-there suggestion.

> but you introduce the theoretical memory deadlock
> where a task cannot reclaim its own memory.

Nah, that'll never happen - both pages are already allocated.

It's better than taking mmap_sem and walking pagetables...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04 10:15       ` Nick Piggin
@ 2007-02-04 10:26         ` Christoph Hellwig
  -1 siblings, 0 replies; 105+ messages in thread
From: Christoph Hellwig @ 2007-02-04 10:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 11:15:29AM +0100, Nick Piggin wrote:
> Cool, a kernel thread is calling sys_write. Fun.

There are tons of places where we possible call into ->write from
either kernel threads or at least with a kernel pointer  and set_fs/set_ds
magic.  Anything in the buffer write path that tries to touch page tables
can't work.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 10:26         ` Christoph Hellwig
  0 siblings, 0 replies; 105+ messages in thread
From: Christoph Hellwig @ 2007-02-04 10:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 11:15:29AM +0100, Nick Piggin wrote:
> Cool, a kernel thread is calling sys_write. Fun.

There are tons of places where we possible call into ->write from
either kernel threads or at least with a kernel pointer  and set_fs/set_ds
magic.  Anything in the buffer write path that tries to touch page tables
can't work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04  9:44     ` Andrew Morton
@ 2007-02-04 10:15       ` Nick Piggin
  -1 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04 10:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 01:44:45AM -0800, Andrew Morton wrote:
> On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> 
> > 2.  If we find the destination page is non uptodate, unlock it (this could be
> >     made slightly more optimal), then find and pin the source page with
> >     get_user_pages. Relock the destination page and continue with the copy.
> >     However, instead of a usercopy (which might take a fault), copy the data
> >     via the kernel address space.
> 
> argh.  We just can't go adding all this gunk into the write() path. 
> 
> mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> Even single-process write() will suffer, let along multithreaded stuff,
> where mmap_sem contention may be the bigger problem.

The write path is broken. I prefer my kernels slow, than buggy.

As I said, I'm working on a replacement API so that the filesystems
that care, can be correct *and* fast.

> I was going to do some quick measurements of this, but the code oopses
> on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)

Cool, a kernel thread is calling sys_write. Fun.

I guess I should be able to reproduce this if using initramfs. Thanks.

> There's a build error in filemap_xip.c btw.
> 
> 
> 
> We need to think different.
> 
> What happened to the idea of doing an atomic copy into the non-uptodate
> page and handling it somehow?

That was my second idea. I didn't get any feedback on that patchset
except to try this method, so I assume everyone hated it.

I actually liked it, because it didn't have to do the writev
segment-at-a-time for !uptodate pages like this one does. Considering
this code gets called from mm-less contexts, maybe I'll have to go back
to this approach.

> Another option might be to effectively pin the whole mm during the copy:
> 
> 	down_read(&current->mm->unpaging_lock);
> 	get_user(addr);		/* Fault the page in */
> 	...
> 	copy_from_user()
> 	up_read(&current->mm->unpaging_lock);
> 
> then, anyone who wants to unmap pages from this mm requires
> write_lock(unpaging_lock).  So we know the results of that get_user()
> cannot be undone.

Fugly. Don't know whether there are any lock order problems making it
hard to implement, but you introduce the theoretical memory deadlock
where a task cannot reclaim its own memory.

> Or perhaps something like this can be done on a per-vma basis.  Just
> something to tell the VM "hey, you're not allowed to unmap this page right
> now"?

Same memory deadlock problem.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04 10:15       ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04 10:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun, Feb 04, 2007 at 01:44:45AM -0800, Andrew Morton wrote:
> On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> 
> > 2.  If we find the destination page is non uptodate, unlock it (this could be
> >     made slightly more optimal), then find and pin the source page with
> >     get_user_pages. Relock the destination page and continue with the copy.
> >     However, instead of a usercopy (which might take a fault), copy the data
> >     via the kernel address space.
> 
> argh.  We just can't go adding all this gunk into the write() path. 
> 
> mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> Even single-process write() will suffer, let along multithreaded stuff,
> where mmap_sem contention may be the bigger problem.

The write path is broken. I prefer my kernels slow, than buggy.

As I said, I'm working on a replacement API so that the filesystems
that care, can be correct *and* fast.

> I was going to do some quick measurements of this, but the code oopses
> on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)

Cool, a kernel thread is calling sys_write. Fun.

I guess I should be able to reproduce this if using initramfs. Thanks.

> There's a build error in filemap_xip.c btw.
> 
> 
> 
> We need to think different.
> 
> What happened to the idea of doing an atomic copy into the non-uptodate
> page and handling it somehow?

That was my second idea. I didn't get any feedback on that patchset
except to try this method, so I assume everyone hated it.

I actually liked it, because it didn't have to do the writev
segment-at-a-time for !uptodate pages like this one does. Considering
this code gets called from mm-less contexts, maybe I'll have to go back
to this approach.

> Another option might be to effectively pin the whole mm during the copy:
> 
> 	down_read(&current->mm->unpaging_lock);
> 	get_user(addr);		/* Fault the page in */
> 	...
> 	copy_from_user()
> 	up_read(&current->mm->unpaging_lock);
> 
> then, anyone who wants to unmap pages from this mm requires
> write_lock(unpaging_lock).  So we know the results of that get_user()
> cannot be undone.

Fugly. Don't know whether there are any lock order problems making it
hard to implement, but you introduce the theoretical memory deadlock
where a task cannot reclaim its own memory.

> Or perhaps something like this can be done on a per-vma basis.  Just
> something to tell the VM "hey, you're not allowed to unmap this page right
> now"?

Same memory deadlock problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04  8:51   ` Nick Piggin
@ 2007-02-04  9:44     ` Andrew Morton
  -1 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-04  9:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:

> 2.  If we find the destination page is non uptodate, unlock it (this could be
>     made slightly more optimal), then find and pin the source page with
>     get_user_pages. Relock the destination page and continue with the copy.
>     However, instead of a usercopy (which might take a fault), copy the data
>     via the kernel address space.

argh.  We just can't go adding all this gunk into the write() path. 

mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
Even single-process write() will suffer, let along multithreaded stuff,
where mmap_sem contention may be the bigger problem.

I was going to do some quick measurements of this, but the code oopses
on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)

There's a build error in filemap_xip.c btw.



We need to think different.

What happened to the idea of doing an atomic copy into the non-uptodate
page and handling it somehow?



Another option might be to effectively pin the whole mm during the copy:

	down_read(&current->mm->unpaging_lock);
	get_user(addr);		/* Fault the page in */
	...
	copy_from_user()
	up_read(&current->mm->unpaging_lock);

then, anyone who wants to unmap pages from this mm requires
write_lock(unpaging_lock).  So we know the results of that get_user()
cannot be undone.

Or perhaps something like this can be done on a per-vma basis.  Just
something to tell the VM "hey, you're not allowed to unmap this page right
now"?

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04  9:44     ` Andrew Morton
  0 siblings, 0 replies; 105+ messages in thread
From: Andrew Morton @ 2007-02-04  9:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel, Linux Filesystems, Linux Memory Management

On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:

> 2.  If we find the destination page is non uptodate, unlock it (this could be
>     made slightly more optimal), then find and pin the source page with
>     get_user_pages. Relock the destination page and continue with the copy.
>     However, instead of a usercopy (which might take a fault), copy the data
>     via the kernel address space.

argh.  We just can't go adding all this gunk into the write() path. 

mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
Even single-process write() will suffer, let along multithreaded stuff,
where mmap_sem contention may be the bigger problem.

I was going to do some quick measurements of this, but the code oopses
on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)

There's a build error in filemap_xip.c btw.



We need to think different.

What happened to the idea of doing an atomic copy into the non-uptodate
page and handling it somehow?



Another option might be to effectively pin the whole mm during the copy:

	down_read(&current->mm->unpaging_lock);
	get_user(addr);		/* Fault the page in */
	...
	copy_from_user()
	up_read(&current->mm->unpaging_lock);

then, anyone who wants to unmap pages from this mm requires
write_lock(unpaging_lock).  So we know the results of that get_user()
cannot be undone.

Or perhaps something like this can be done on a per-vma basis.  Just
something to tell the VM "hey, you're not allowed to unmap this page right
now"?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 9/9] mm: fix pagecache write deadlocks
  2007-02-04  8:49 Nick Piggin
@ 2007-02-04  8:51   ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04  8:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then find and pin the source page with
    get_user_pages. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2052,11 +2052,12 @@ generic_file_buffered_write(struct kiocb
 	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
+		struct page *src_page;
 		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long maxlen;	/* Bytes remaining in current iovec */
-		size_t bytes;		/* Bytes to write to page */
+		unsigned long seglen;	/* Bytes remaining in current iovec */
+		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
 		buf = cur_iov->iov_base + iov_offset;
@@ -2066,20 +2067,30 @@ generic_file_buffered_write(struct kiocb
 		if (bytes > count)
 			bytes = count;
 
-		maxlen = cur_iov->iov_len - iov_offset;
-		if (maxlen > bytes)
-			maxlen = bytes;
+		/*
+		 * a non-NULL src_page indicates that we're doing the
+		 * copy via get_user_pages and kmap.
+		 */
+		src_page = NULL;
+
+		seglen = cur_iov->iov_len - iov_offset;
+		if (seglen > bytes)
+			seglen = bytes;
 
-#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
 		 */
-		fault_in_pages_readable(buf, maxlen);
-#endif
-
+		if (unlikely(fault_in_pages_readable(buf, seglen))) {
+			status = -EFAULT;
+			break;
+		}
 
 		page = __grab_cache_page(mapping, index);
 		if (!page) {
@@ -2087,32 +2098,94 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
+		/*
+		 * non-uptodate pages cannot cope with short copies, and we
+		 * cannot take a pagefault with the destination page locked.
+		 * So pin the source page to copy it.
+		 */
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+
+			bytes = min(bytes, PAGE_CACHE_SIZE -
+				     ((unsigned long)buf & ~PAGE_CACHE_MASK));
+
+			/*
+			 * Cannot get_user_pages with a page locked for the
+			 * same reason as we can't take a page fault with a
+			 * page locked (as explained below).
+			 */
+			down_read(&current->mm->mmap_sem);
+			status = get_user_pages(current, current->mm,
+					(unsigned long)buf & PAGE_CACHE_MASK, 1,
+					0, 0, &src_page, NULL);
+			up_read(&current->mm->mmap_sem);
+			if (status != 1) {
+				page_cache_release(page);
+				break;
+			}
+
+			lock_page(page);
+			if (!page->mapping) {
+				unlock_page(page);
+				page_cache_release(page);
+				page_cache_release(src_page);
+				continue;
+			}
+
+		}
+
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		copied = filemap_copy_from_user(page, offset,
+		if (!src_page) {
+			/*
+			 * Must not enter the pagefault handler here, because
+			 * we hold the page lock, so we might recursively
+			 * deadlock on the same lock, or get an ABBA deadlock
+			 * against a different lock, or against the mmap_sem
+			 * (which nests outside the page lock).  So increment
+			 * preempt count, and use _atomic usercopies.
+			 *
+			 * The page is uptodate so we are OK to encounter a
+			 * short copy: if unmodified parts of the page are
+			 * marked dirty and written out to disk, it doesn't
+			 * really matter.
+			 */
+			pagefault_disable();
+			copied = filemap_copy_from_user_atomic(page, offset,
 					cur_iov, nr_segs, iov_offset, bytes);
+			pagefault_enable();
+		} else {
+			char *src, *dst;
+			src = kmap_atomic(src_page, KM_USER0);
+			dst = kmap_atomic(page, KM_USER1);
+			memcpy(dst + offset,
+				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
+				bytes);
+			kunmap_atomic(dst, KM_USER1);
+			kunmap_atomic(src, KM_USER0);
+			copied = bytes;
+		}
 		flush_dcache_page(page);
 
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
-		if (unlikely(copied != bytes)) {
-			status = -EFAULT;
-			goto fs_write_aop_error;
-		}
 		if (unlikely(status > 0)) /* filesystem did partial write */
-			copied = status;
+			copied = min_t(size_t, copied, status);
+
+		unlock_page(page);
+		mark_page_accessed(page);
+		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		written += copied;
 		count -= copied;
 		pos += copied;
 		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
 
-		unlock_page(page);
-		mark_page_accessed(page);
-		page_cache_release(page);
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 		continue;
@@ -2121,6 +2194,8 @@ fs_write_aop_error:
 		if (status != AOP_TRUNCATED_PAGE)
 			unlock_page(page);
 		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		/*
 		 * prepare_write() may have instantiated a few blocks
@@ -2133,7 +2208,6 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-
 	} while (count);
 	*ppos = pos;
 
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -198,6 +198,9 @@ static inline int fault_in_pages_writeab
 {
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	/*
 	 * Writing zeroes into userspace here is OK, because we know that if
 	 * the zero gets there, we'll be overwriting it.
@@ -217,19 +220,23 @@ static inline int fault_in_pages_writeab
 	return ret;
 }
 
-static inline void fault_in_pages_readable(const char __user *uaddr, int size)
+static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 {
 	volatile char c;
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	ret = __get_user(c, uaddr);
 	if (ret == 0) {
 		const char __user *end = uaddr + size - 1;
 
 		if (((unsigned long)uaddr & PAGE_MASK) !=
 				((unsigned long)end & PAGE_MASK))
-		 	__get_user(c, end);
+		 	ret = __get_user(c, end);
 	}
+	return ret;
 }
 
 #endif /* _LINUX_PAGEMAP_H */

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [patch 9/9] mm: fix pagecache write deadlocks
@ 2007-02-04  8:51   ` Nick Piggin
  0 siblings, 0 replies; 105+ messages in thread
From: Nick Piggin @ 2007-02-04  8:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux Filesystems, Nick Piggin, Linux Memory Management

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then find and pin the source page with
    get_user_pages. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2052,11 +2052,12 @@ generic_file_buffered_write(struct kiocb
 	filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
 
 	do {
+		struct page *src_page;
 		struct page *page;
 		pgoff_t index;		/* Pagecache index for current page */
 		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long maxlen;	/* Bytes remaining in current iovec */
-		size_t bytes;		/* Bytes to write to page */
+		unsigned long seglen;	/* Bytes remaining in current iovec */
+		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
 		buf = cur_iov->iov_base + iov_offset;
@@ -2066,20 +2067,30 @@ generic_file_buffered_write(struct kiocb
 		if (bytes > count)
 			bytes = count;
 
-		maxlen = cur_iov->iov_len - iov_offset;
-		if (maxlen > bytes)
-			maxlen = bytes;
+		/*
+		 * a non-NULL src_page indicates that we're doing the
+		 * copy via get_user_pages and kmap.
+		 */
+		src_page = NULL;
+
+		seglen = cur_iov->iov_len - iov_offset;
+		if (seglen > bytes)
+			seglen = bytes;
 
-#ifndef CONFIG_DEBUG_VM
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
 		 */
-		fault_in_pages_readable(buf, maxlen);
-#endif
-
+		if (unlikely(fault_in_pages_readable(buf, seglen))) {
+			status = -EFAULT;
+			break;
+		}
 
 		page = __grab_cache_page(mapping, index);
 		if (!page) {
@@ -2087,32 +2098,94 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 
+		/*
+		 * non-uptodate pages cannot cope with short copies, and we
+		 * cannot take a pagefault with the destination page locked.
+		 * So pin the source page to copy it.
+		 */
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+
+			bytes = min(bytes, PAGE_CACHE_SIZE -
+				     ((unsigned long)buf & ~PAGE_CACHE_MASK));
+
+			/*
+			 * Cannot get_user_pages with a page locked for the
+			 * same reason as we can't take a page fault with a
+			 * page locked (as explained below).
+			 */
+			down_read(&current->mm->mmap_sem);
+			status = get_user_pages(current, current->mm,
+					(unsigned long)buf & PAGE_CACHE_MASK, 1,
+					0, 0, &src_page, NULL);
+			up_read(&current->mm->mmap_sem);
+			if (status != 1) {
+				page_cache_release(page);
+				break;
+			}
+
+			lock_page(page);
+			if (!page->mapping) {
+				unlock_page(page);
+				page_cache_release(page);
+				page_cache_release(src_page);
+				continue;
+			}
+
+		}
+
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status))
 			goto fs_write_aop_error;
 
-		copied = filemap_copy_from_user(page, offset,
+		if (!src_page) {
+			/*
+			 * Must not enter the pagefault handler here, because
+			 * we hold the page lock, so we might recursively
+			 * deadlock on the same lock, or get an ABBA deadlock
+			 * against a different lock, or against the mmap_sem
+			 * (which nests outside the page lock).  So increment
+			 * preempt count, and use _atomic usercopies.
+			 *
+			 * The page is uptodate so we are OK to encounter a
+			 * short copy: if unmodified parts of the page are
+			 * marked dirty and written out to disk, it doesn't
+			 * really matter.
+			 */
+			pagefault_disable();
+			copied = filemap_copy_from_user_atomic(page, offset,
 					cur_iov, nr_segs, iov_offset, bytes);
+			pagefault_enable();
+		} else {
+			char *src, *dst;
+			src = kmap_atomic(src_page, KM_USER0);
+			dst = kmap_atomic(page, KM_USER1);
+			memcpy(dst + offset,
+				src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
+				bytes);
+			kunmap_atomic(dst, KM_USER1);
+			kunmap_atomic(src, KM_USER0);
+			copied = bytes;
+		}
 		flush_dcache_page(page);
 
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(status < 0))
 			goto fs_write_aop_error;
-		if (unlikely(copied != bytes)) {
-			status = -EFAULT;
-			goto fs_write_aop_error;
-		}
 		if (unlikely(status > 0)) /* filesystem did partial write */
-			copied = status;
+			copied = min_t(size_t, copied, status);
+
+		unlock_page(page);
+		mark_page_accessed(page);
+		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		written += copied;
 		count -= copied;
 		pos += copied;
 		filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
 
-		unlock_page(page);
-		mark_page_accessed(page);
-		page_cache_release(page);
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 		continue;
@@ -2121,6 +2194,8 @@ fs_write_aop_error:
 		if (status != AOP_TRUNCATED_PAGE)
 			unlock_page(page);
 		page_cache_release(page);
+		if (src_page)
+			page_cache_release(src_page);
 
 		/*
 		 * prepare_write() may have instantiated a few blocks
@@ -2133,7 +2208,6 @@ fs_write_aop_error:
 			continue;
 		else
 			break;
-
 	} while (count);
 	*ppos = pos;
 
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -198,6 +198,9 @@ static inline int fault_in_pages_writeab
 {
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	/*
 	 * Writing zeroes into userspace here is OK, because we know that if
 	 * the zero gets there, we'll be overwriting it.
@@ -217,19 +220,23 @@ static inline int fault_in_pages_writeab
 	return ret;
 }
 
-static inline void fault_in_pages_readable(const char __user *uaddr, int size)
+static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 {
 	volatile char c;
 	int ret;
 
+	if (unlikely(size == 0))
+		return 0;
+
 	ret = __get_user(c, uaddr);
 	if (ret == 0) {
 		const char __user *end = uaddr + size - 1;
 
 		if (((unsigned long)uaddr & PAGE_MASK) !=
 				((unsigned long)end & PAGE_MASK))
-		 	__get_user(c, end);
+		 	ret = __get_user(c, end);
 	}
+	return ret;
 }
 
 #endif /* _LINUX_PAGEMAP_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, other threads:[~2007-02-06 13:13 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-29 10:31 [patch 0/9] buffered write deadlock fix Nick Piggin
2007-01-29 10:31 ` Nick Piggin
2007-01-29 10:31 ` [patch 1/9] fs: libfs buffered write leak fix Nick Piggin
2007-01-29 10:31   ` Nick Piggin
2007-02-02 23:52   ` Andrew Morton
2007-02-02 23:52     ` Andrew Morton
2007-02-03  1:33     ` Nick Piggin
2007-02-03  1:33       ` Nick Piggin
2007-02-03  1:58       ` Andrew Morton
2007-02-03  1:58         ` Andrew Morton
2007-02-03  2:09         ` Nick Piggin
2007-02-03  2:09           ` Nick Piggin
2007-02-03  2:19           ` Andrew Morton
2007-02-03  2:19             ` Andrew Morton
2007-02-03  2:28             ` Nick Piggin
2007-02-03  2:28               ` Nick Piggin
2007-02-03 17:49       ` Jörn Engel
2007-02-03 17:49         ` Jörn Engel
2007-02-04  3:55         ` Nick Piggin
2007-02-04  3:55           ` Nick Piggin
2007-02-04  3:55           ` Nick Piggin
2007-01-29 10:31 ` [patch 2/9] mm: revert "generic_file_buffered_write(): handle zero length iovec segments" Nick Piggin
2007-01-29 10:31   ` Nick Piggin, Andrew Morton
2007-01-29 10:32 ` [patch 3/9] mm: revert "generic_file_buffered_write(): deadlock on vectored write" Nick Piggin
2007-01-29 10:32   ` Nick Piggin, Andrew Morton
2007-01-29 10:32 ` [patch 4/9] mm: generic_file_buffered_write cleanup Nick Piggin
2007-01-29 10:32   ` Nick Piggin, Andrew Morton
2007-01-29 10:32 ` [patch 5/9] mm: debug write deadlocks Nick Piggin
2007-01-29 10:32   ` Nick Piggin
2007-01-29 10:32 ` [patch 6/9] mm: be sure to trim blocks Nick Piggin
2007-01-29 10:32   ` Nick Piggin
2007-01-29 10:32 ` [patch 7/9] mm: cleanup pagecache insertion operations Nick Piggin
2007-01-29 10:32   ` Nick Piggin
2007-01-29 10:32 ` [patch 8/9] mm: generic_file_buffered_write iovec cleanup Nick Piggin
2007-01-29 10:32   ` Nick Piggin
2007-01-29 10:33 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
2007-01-29 10:33   ` Nick Piggin
2007-01-29 11:11   ` Nick Piggin
2007-01-29 11:11     ` Nick Piggin
2007-02-02 23:53   ` Andrew Morton
2007-02-02 23:53     ` Andrew Morton
2007-02-03  1:38     ` Nick Piggin
2007-02-03  1:38       ` Nick Piggin
2007-01-30 20:55 ` [patch 0/9] buffered write deadlock fix Andrew Morton
2007-01-30 20:55   ` Andrew Morton
2007-01-30 23:21   ` Andrew Morton
2007-01-30 23:21     ` Andrew Morton
2007-01-31  1:31     ` Nick Piggin
2007-01-31  1:31       ` Nick Piggin
2007-01-31  0:32   ` Nick Piggin
2007-01-31  0:32     ` Nick Piggin
2007-02-02 23:52 ` Andrew Morton
2007-02-02 23:52   ` Andrew Morton
2007-02-03  1:22   ` Nick Piggin
2007-02-03  1:22     ` Nick Piggin
2007-02-03  6:43   ` Suparna Bhattacharya
2007-02-03  6:43     ` Suparna Bhattacharya
2007-02-03 15:31   ` Fengguang Wu
2007-02-03 15:31     ` Fengguang Wu
2007-02-03 15:31       ` Fengguang Wu
2007-02-04  8:49 Nick Piggin
2007-02-04  8:51 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
2007-02-04  8:51   ` Nick Piggin
2007-02-04  9:44   ` Andrew Morton
2007-02-04  9:44     ` Andrew Morton
2007-02-04 10:15     ` Nick Piggin
2007-02-04 10:15       ` Nick Piggin
2007-02-04 10:26       ` Christoph Hellwig
2007-02-04 10:26         ` Christoph Hellwig
2007-02-04 10:30       ` Andrew Morton
2007-02-04 10:30         ` Andrew Morton
2007-02-04 10:46         ` Nick Piggin
2007-02-04 10:46           ` Nick Piggin
2007-02-04 10:50           ` Nick Piggin
2007-02-04 10:50             ` Nick Piggin
2007-02-04 10:56           ` Andrew Morton
2007-02-04 10:56             ` Andrew Morton
2007-02-04 11:03             ` Nick Piggin
2007-02-04 11:03               ` Nick Piggin
2007-02-04 11:15               ` Andrew Morton
2007-02-04 11:15                 ` Andrew Morton
2007-02-04 15:10                 ` Nick Piggin
2007-02-04 15:10                   ` Nick Piggin
2007-02-04 18:36                   ` Andrew Morton
2007-02-04 18:36                     ` Andrew Morton
2007-02-06  2:25                     ` Nick Piggin
2007-02-06  2:25                       ` Nick Piggin
2007-02-06  4:41                       ` Nick Piggin
2007-02-06  4:41                         ` Nick Piggin
2007-02-06  5:30                         ` Andrew Morton
2007-02-06  5:30                           ` Andrew Morton
2007-02-06  5:49                           ` Nick Piggin
2007-02-06  5:49                             ` Nick Piggin
2007-02-06  5:53                             ` Nick Piggin
2007-02-06  5:53                               ` Nick Piggin
2007-02-04 10:59     ` Anton Altaparmakov
2007-02-04 10:59       ` Anton Altaparmakov
2007-02-04 11:10       ` Andrew Morton
2007-02-04 11:10         ` Andrew Morton
2007-02-04 11:22         ` Nick Piggin
2007-02-04 11:22           ` Nick Piggin
2007-02-04 17:40         ` Anton Altaparmakov
2007-02-04 17:40           ` Anton Altaparmakov
2007-02-06  2:09           ` Nick Piggin
2007-02-06  2:09             ` Nick Piggin
2007-02-06 13:13             ` Anton Altaparmakov
2007-02-06 13:13               ` Anton Altaparmakov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.