All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] mm: close page_mkwrite races (try 2)
@ 2009-04-14  7:11 ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-04-14  7:11 UTC (permalink / raw)
  To: Andrew Morton, Sage Weil, Trond Myklebust, linux-fsdevel,
	Linux Memory Management List

Let's try this again with a better changelog.
--

Change page_mkwrite to allow callers to return with the page locked,
and change it's page fault callers to hold the lock until the page
is marked dirty. This allows the filesystem to have full control of
metadata associated with a dirty page. We'd like to call page_mkwrite
with the page unlocked and return with it locked, so that filesystems
can avoid LOR conditions with page lock.

A filesystem that wants to set some metadata to a page while it
is dirty, will manipulate the metadata in its ->page_mkwrite. At
this point, even if it does a set_page_dirty in its page_mkwrite
handler, it must return with the page unlocked (according to the
page_mkwrite convention).

In this window, the VM could write out the page, clearing page-dirty.
The filesystem has no way to detect that a dirty pte is about to be
attached, so it will happily write out the page, at which point, the
dirty-page metadata might be freed.

It is not always possible to perform the required metadata manipulation
in ->set_page_dirty, because that function cannot block or fail.

The VM cannot mark the pte dirty before page_mkwrite, because
page_mkwrite is allowed to fail (and anyway, it would probably be
hard to avoid races where the pte gets cleaned along the way).

Holding the page locked over the 3 critical operations (page_mkwrite,
setting the pte dirty, and finally setting the page dirty) closes out
races nicely, because the page must be locked to clean it for writeout.
It provides the filesystem with a strong synchronisation against VM.

- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
  under dirty pages might be susceptible to i_size changing under partial
  page at the end of file (we also have a buffer.c-side problem here, but it
  cannot be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
  page_mkwrite functions themselves.

[ This also moves page_mkwrite another step closer to fault, which
should eventually allow page_mkwrite to be moved into ->fault, and
thus avoiding a filesystem calldown and page lock/unlock cycle in
__do_fault. ]

Cc: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 Documentation/filesystems/Locking |   24 +++++++---
 fs/buffer.c                       |   10 ++--
 mm/memory.c                       |   83 ++++++++++++++++++++++++++------------
 3 files changed, 79 insertions(+), 38 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2383,7 +2383,8 @@ block_page_mkwrite(struct vm_area_struct
 	if ((page->mapping != inode->i_mapping) ||
 	    (page_offset(page) > size)) {
 		/* page got truncated out from underneath us */
-		goto out_unlock;
+		unlock_page(page);
+		goto out;
 	}
 
 	/* page is wholly or partially inside EOF */
@@ -2397,14 +2398,15 @@ block_page_mkwrite(struct vm_area_struct
 		ret = block_commit_write(page, 0, end);
 
 	if (unlikely(ret)) {
+		unlock_page(page);
 		if (ret == -ENOMEM)
 			ret = VM_FAULT_OOM;
 		else /* -ENOSPC, -EIO, etc */
 			ret = VM_FAULT_SIGBUS;
-	}
+	} else
+		ret = VM_FAULT_LOCKED;
 
-out_unlock:
-	unlock_page(page);
+out:
 	return ret;
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1971,6 +1971,15 @@ static int do_wp_page(struct mm_struct *
 				ret = tmp;
 				goto unwritable_page;
 			}
+			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+				lock_page(old_page);
+				if (!old_page->mapping) {
+					ret = 0; /* retry the fault */
+					unlock_page(old_page);
+					goto unwritable_page;
+				}
+			} else
+				VM_BUG_ON(!PageLocked(old_page));
 
 			/*
 			 * Since we dropped the lock we need to revalidate
@@ -1980,9 +1989,11 @@ static int do_wp_page(struct mm_struct *
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
-			page_cache_release(old_page);
-			if (!pte_same(*page_table, orig_pte))
+			if (!pte_same(*page_table, orig_pte)) {
+				page_cache_release(old_page);
+				unlock_page(old_page);
 				goto unlock;
+			}
 
 			page_mkwrite = 1;
 		}
@@ -2105,16 +2116,30 @@ unlock:
 		 *
 		 * do_no_page is protected similarly.
 		 */
-		wait_on_page_locked(dirty_page);
-		set_page_dirty_balance(dirty_page, page_mkwrite);
+		if (!page_mkwrite) {
+			wait_on_page_locked(dirty_page);
+			set_page_dirty_balance(dirty_page, page_mkwrite);
+		}
 		put_page(dirty_page);
+		if (page_mkwrite) {
+			struct address_space *mapping = old_page->mapping;
+
+			unlock_page(old_page);
+			page_cache_release(old_page);
+			balance_dirty_pages_ratelimited(mapping);
+		}
 	}
 	return ret;
 oom_free_new:
 	page_cache_release(new_page);
 oom:
-	if (old_page)
+	if (old_page) {
+		if (page_mkwrite) {
+			unlock_page(old_page);
+			page_cache_release(old_page);
+		}
 		page_cache_release(old_page);
+	}
 	return VM_FAULT_OOM;
 
 unwritable_page:
@@ -2664,27 +2689,22 @@ static int __do_fault(struct mm_struct *
 				int tmp;
 
 				unlock_page(page);
-				vmf.flags |= FAULT_FLAG_MKWRITE;
+				vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 				tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
 				if (unlikely(tmp &
 					  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
 					ret = tmp;
-					anon = 1; /* no anon but release vmf.page */
-					goto out_unlocked;
-				}
-				lock_page(page);
-				/*
-				 * XXX: this is not quite right (racy vs
-				 * invalidate) to unlock and relock the page
-				 * like this, however a better fix requires
-				 * reworking page_mkwrite locking API, which
-				 * is better done later.
-				 */
-				if (!page->mapping) {
-					ret = 0;
-					anon = 1; /* no anon but release vmf.page */
-					goto out;
+					goto unwritable_page;
 				}
+				if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+					lock_page(page);
+					if (!page->mapping) {
+						ret = 0; /* retry the fault */
+						unlock_page(page);
+						goto unwritable_page;
+					}
+				} else
+					VM_BUG_ON(!PageLocked(page));
 				page_mkwrite = 1;
 			}
 		}
@@ -2736,19 +2756,30 @@ static int __do_fault(struct mm_struct *
 	pte_unmap_unlock(page_table, ptl);
 
 out:
-	unlock_page(vmf.page);
-out_unlocked:
-	if (anon)
-		page_cache_release(vmf.page);
-	else if (dirty_page) {
+	if (dirty_page) {
+		struct address_space *mapping = page->mapping;
+
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
 
+		if (set_page_dirty(dirty_page))
+			page_mkwrite = 1;
 		set_page_dirty_balance(dirty_page, page_mkwrite);
+		unlock_page(dirty_page);
 		put_page(dirty_page);
+		if (page_mkwrite)
+			balance_dirty_pages_ratelimited(mapping);
+	} else {
+		unlock_page(vmf.page);
+		if (anon)
+			page_cache_release(vmf.page);
 	}
 
 	return ret;
+
+unwritable_page:
+	page_cache_release(page);
+	return ret;
 }
 
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -512,16 +512,24 @@ locking rules:
 		BKL	mmap_sem	PageLocked(page)
 open:		no	yes
 close:		no	yes
-fault:		no	yes
-page_mkwrite:	no	yes		no
+fault:		no	yes		can return with page locked
+page_mkwrite:	no	yes		can return with page locked
 access:		no	yes
 
-	->page_mkwrite() is called when a previously read-only page is
-about to become writeable. The file system is responsible for
-protecting against truncate races. Once appropriate action has been
-taking to lock out truncate, the page range should be verified to be
-within i_size. The page mapping should also be checked that it is not
-NULL.
+	->fault() is called when a previously not present pte is about
+to be faulted in. The filesystem must find and return the page associated
+with the passed in "pgoff" in the vm_fault structure. If it is possible that
+the page may be truncated and/or invalidated, then the filesystem must lock
+the page, then ensure it is not already truncated (the page lock will block
+subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
+locked. The VM will unlock the page.
+
+	->page_mkwrite() is called when a previously read-only pte is
+about to become writeable. The filesystem again must ensure that there are
+no truncate/invalidate races, and then return with the page locked. If
+the page has been truncated, the filesystem should not look up a new page
+like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
+will cause the VM to retry the fault.
 
 	->access() is called when get_user_pages() fails in
 acces_process_vm(), typically used to debug a process through

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch] mm: close page_mkwrite races (try 2)
@ 2009-04-14  7:11 ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-04-14  7:11 UTC (permalink / raw)
  To: Andrew Morton, Sage Weil, Trond Myklebust, linux-fsdevel,
	Linux Memory Management List

Let's try this again with a better changelog.
--

Change page_mkwrite to allow callers to return with the page locked,
and change it's page fault callers to hold the lock until the page
is marked dirty. This allows the filesystem to have full control of
metadata associated with a dirty page. We'd like to call page_mkwrite
with the page unlocked and return with it locked, so that filesystems
can avoid LOR conditions with page lock.

A filesystem that wants to set some metadata to a page while it
is dirty, will manipulate the metadata in its ->page_mkwrite. At
this point, even if it does a set_page_dirty in its page_mkwrite
handler, it must return with the page unlocked (according to the
page_mkwrite convention).

In this window, the VM could write out the page, clearing page-dirty.
The filesystem has no way to detect that a dirty pte is about to be
attached, so it will happily write out the page, at which point, the
dirty-page metadata might be freed.

It is not always possible to perform the required metadata manipulation
in ->set_page_dirty, because that function cannot block or fail.

The VM cannot mark the pte dirty before page_mkwrite, because
page_mkwrite is allowed to fail (and anyway, it would probably be
hard to avoid races where the pte gets cleaned along the way).

Holding the page locked over the 3 critical operations (page_mkwrite,
setting the pte dirty, and finally setting the page dirty) closes out
races nicely, because the page must be locked to clean it for writeout.
It provides the filesystem with a strong synchronisation against VM.

- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
  under dirty pages might be susceptible to i_size changing under partial
  page at the end of file (we also have a buffer.c-side problem here, but it
  cannot be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
  page_mkwrite functions themselves.

[ This also moves page_mkwrite another step closer to fault, which
should eventually allow page_mkwrite to be moved into ->fault, and
thus avoiding a filesystem calldown and page lock/unlock cycle in
__do_fault. ]

Cc: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 Documentation/filesystems/Locking |   24 +++++++---
 fs/buffer.c                       |   10 ++--
 mm/memory.c                       |   83 ++++++++++++++++++++++++++------------
 3 files changed, 79 insertions(+), 38 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2383,7 +2383,8 @@ block_page_mkwrite(struct vm_area_struct
 	if ((page->mapping != inode->i_mapping) ||
 	    (page_offset(page) > size)) {
 		/* page got truncated out from underneath us */
-		goto out_unlock;
+		unlock_page(page);
+		goto out;
 	}
 
 	/* page is wholly or partially inside EOF */
@@ -2397,14 +2398,15 @@ block_page_mkwrite(struct vm_area_struct
 		ret = block_commit_write(page, 0, end);
 
 	if (unlikely(ret)) {
+		unlock_page(page);
 		if (ret == -ENOMEM)
 			ret = VM_FAULT_OOM;
 		else /* -ENOSPC, -EIO, etc */
 			ret = VM_FAULT_SIGBUS;
-	}
+	} else
+		ret = VM_FAULT_LOCKED;
 
-out_unlock:
-	unlock_page(page);
+out:
 	return ret;
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1971,6 +1971,15 @@ static int do_wp_page(struct mm_struct *
 				ret = tmp;
 				goto unwritable_page;
 			}
+			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+				lock_page(old_page);
+				if (!old_page->mapping) {
+					ret = 0; /* retry the fault */
+					unlock_page(old_page);
+					goto unwritable_page;
+				}
+			} else
+				VM_BUG_ON(!PageLocked(old_page));
 
 			/*
 			 * Since we dropped the lock we need to revalidate
@@ -1980,9 +1989,11 @@ static int do_wp_page(struct mm_struct *
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
-			page_cache_release(old_page);
-			if (!pte_same(*page_table, orig_pte))
+			if (!pte_same(*page_table, orig_pte)) {
+				page_cache_release(old_page);
+				unlock_page(old_page);
 				goto unlock;
+			}
 
 			page_mkwrite = 1;
 		}
@@ -2105,16 +2116,30 @@ unlock:
 		 *
 		 * do_no_page is protected similarly.
 		 */
-		wait_on_page_locked(dirty_page);
-		set_page_dirty_balance(dirty_page, page_mkwrite);
+		if (!page_mkwrite) {
+			wait_on_page_locked(dirty_page);
+			set_page_dirty_balance(dirty_page, page_mkwrite);
+		}
 		put_page(dirty_page);
+		if (page_mkwrite) {
+			struct address_space *mapping = old_page->mapping;
+
+			unlock_page(old_page);
+			page_cache_release(old_page);
+			balance_dirty_pages_ratelimited(mapping);
+		}
 	}
 	return ret;
 oom_free_new:
 	page_cache_release(new_page);
 oom:
-	if (old_page)
+	if (old_page) {
+		if (page_mkwrite) {
+			unlock_page(old_page);
+			page_cache_release(old_page);
+		}
 		page_cache_release(old_page);
+	}
 	return VM_FAULT_OOM;
 
 unwritable_page:
@@ -2664,27 +2689,22 @@ static int __do_fault(struct mm_struct *
 				int tmp;
 
 				unlock_page(page);
-				vmf.flags |= FAULT_FLAG_MKWRITE;
+				vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 				tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
 				if (unlikely(tmp &
 					  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
 					ret = tmp;
-					anon = 1; /* no anon but release vmf.page */
-					goto out_unlocked;
-				}
-				lock_page(page);
-				/*
-				 * XXX: this is not quite right (racy vs
-				 * invalidate) to unlock and relock the page
-				 * like this, however a better fix requires
-				 * reworking page_mkwrite locking API, which
-				 * is better done later.
-				 */
-				if (!page->mapping) {
-					ret = 0;
-					anon = 1; /* no anon but release vmf.page */
-					goto out;
+					goto unwritable_page;
 				}
+				if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+					lock_page(page);
+					if (!page->mapping) {
+						ret = 0; /* retry the fault */
+						unlock_page(page);
+						goto unwritable_page;
+					}
+				} else
+					VM_BUG_ON(!PageLocked(page));
 				page_mkwrite = 1;
 			}
 		}
@@ -2736,19 +2756,30 @@ static int __do_fault(struct mm_struct *
 	pte_unmap_unlock(page_table, ptl);
 
 out:
-	unlock_page(vmf.page);
-out_unlocked:
-	if (anon)
-		page_cache_release(vmf.page);
-	else if (dirty_page) {
+	if (dirty_page) {
+		struct address_space *mapping = page->mapping;
+
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
 
+		if (set_page_dirty(dirty_page))
+			page_mkwrite = 1;
 		set_page_dirty_balance(dirty_page, page_mkwrite);
+		unlock_page(dirty_page);
 		put_page(dirty_page);
+		if (page_mkwrite)
+			balance_dirty_pages_ratelimited(mapping);
+	} else {
+		unlock_page(vmf.page);
+		if (anon)
+			page_cache_release(vmf.page);
 	}
 
 	return ret;
+
+unwritable_page:
+	page_cache_release(page);
+	return ret;
 }
 
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -512,16 +512,24 @@ locking rules:
 		BKL	mmap_sem	PageLocked(page)
 open:		no	yes
 close:		no	yes
-fault:		no	yes
-page_mkwrite:	no	yes		no
+fault:		no	yes		can return with page locked
+page_mkwrite:	no	yes		can return with page locked
 access:		no	yes
 
-	->page_mkwrite() is called when a previously read-only page is
-about to become writeable. The file system is responsible for
-protecting against truncate races. Once appropriate action has been
-taking to lock out truncate, the page range should be verified to be
-within i_size. The page mapping should also be checked that it is not
-NULL.
+	->fault() is called when a previously not present pte is about
+to be faulted in. The filesystem must find and return the page associated
+with the passed in "pgoff" in the vm_fault structure. If it is possible that
+the page may be truncated and/or invalidated, then the filesystem must lock
+the page, then ensure it is not already truncated (the page lock will block
+subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
+locked. The VM will unlock the page.
+
+	->page_mkwrite() is called when a previously read-only pte is
+about to become writeable. The filesystem again must ensure that there are
+no truncate/invalidate races, and then return with the page locked. If
+the page has been truncated, the filesystem should not look up a new page
+like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
+will cause the VM to retry the fault.
 
 	->access() is called when get_user_pages() fails in
 acces_process_vm(), typically used to debug a process through

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch] mm: close page_mkwrite races (try 3)
  2009-04-14  7:11 ` Nick Piggin
@ 2009-04-15  8:25   ` Nick Piggin
  -1 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-04-15  8:25 UTC (permalink / raw)
  To: Andrew Morton, Sage Weil, Trond Myklebust, linux-fsdevel,
	Linux Memory Management List

OK, that one had some rough patches (in the changelog and the patch itself).
One more try... if it's still misunderstandable, I give up :)

--

Change page_mkwrite to allow implementations to return with the page locked,
and also change it's callers (in page fault paths) to hold the lock until the
page is marked dirty. This allows the filesystem to have full control of page
dirtying events coming from the VM.

Rather than simply hold the page locked over the page_mkwrite call, we call
page_mkwrite with the page unlocked and allow callers to return with it locked,
so filesystems can avoid LOR conditions with page lock.

The problem with the current scheme is this: a filesystem that wants to
associate some metadata with a page as long as the page is dirty, will perform
this manipulation in its ->page_mkwrite. It currently then must return with the
page unlocked and may not hold any other locks (according to existing
page_mkwrite convention).

In this window, the VM could write out the page, clearing page-dirty. The
filesystem has no good way to detect that a dirty pte is about to be attached,
so it will happily write out the page, at which point, the filesystem may
manipulate the metadata to reflect that the page is no longer dirty.

It is not always possible to perform the required metadata manipulation in
->set_page_dirty, because that function cannot block or fail. The filesystem
may need to allocate some data structure, for example.

And the VM cannot mark the pte dirty before page_mkwrite, because page_mkwrite
is allowed to fail, so we must not allow any window where the page could be
written to if page_mkwrite does fail.

This solution of holding the page locked over the 3 critical operations
(page_mkwrite, setting the pte dirty, and finally setting the page dirty)
closes out races nicely, preventing page cleaning for writeout being initiated
in that window. This provides the filesystem with a strong synchronisation
against the VM here.

- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
  under dirty pages might be susceptible to i_size changing under partial page
  at the end of file (we also have a buffer.c-side problem here, but it cannot
  be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
  page_mkwrite functions themselves.

[ This also moves page_mkwrite another step closer to fault, which should
  eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
  filesystem calldown and page lock/unlock cycle in __do_fault. ]

Cc: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 Documentation/filesystems/Locking |   24 +++++++---
 fs/buffer.c                       |   10 ++--
 mm/memory.c                       |   85 +++++++++++++++++++++++++-------------
 3 files changed, 80 insertions(+), 39 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2383,7 +2383,8 @@ block_page_mkwrite(struct vm_area_struct
 	if ((page->mapping != inode->i_mapping) ||
 	    (page_offset(page) > size)) {
 		/* page got truncated out from underneath us */
-		goto out_unlock;
+		unlock_page(page);
+		goto out;
 	}
 
 	/* page is wholly or partially inside EOF */
@@ -2397,14 +2398,15 @@ block_page_mkwrite(struct vm_area_struct
 		ret = block_commit_write(page, 0, end);
 
 	if (unlikely(ret)) {
+		unlock_page(page);
 		if (ret == -ENOMEM)
 			ret = VM_FAULT_OOM;
 		else /* -ENOSPC, -EIO, etc */
 			ret = VM_FAULT_SIGBUS;
-	}
+	} else
+		ret = VM_FAULT_LOCKED;
 
-out_unlock:
-	unlock_page(page);
+out:
 	return ret;
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1971,6 +1971,15 @@ static int do_wp_page(struct mm_struct *
 				ret = tmp;
 				goto unwritable_page;
 			}
+			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+				lock_page(old_page);
+				if (!old_page->mapping) {
+					ret = 0; /* retry the fault */
+					unlock_page(old_page);
+					goto unwritable_page;
+				}
+			} else
+				VM_BUG_ON(!PageLocked(old_page));
 
 			/*
 			 * Since we dropped the lock we need to revalidate
@@ -1980,9 +1989,11 @@ static int do_wp_page(struct mm_struct *
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
-			page_cache_release(old_page);
-			if (!pte_same(*page_table, orig_pte))
+			if (!pte_same(*page_table, orig_pte)) {
+				unlock_page(old_page);
+				page_cache_release(old_page);
 				goto unlock;
+			}
 
 			page_mkwrite = 1;
 		}
@@ -2105,16 +2116,31 @@ unlock:
 		 *
 		 * do_no_page is protected similarly.
 		 */
-		wait_on_page_locked(dirty_page);
-		set_page_dirty_balance(dirty_page, page_mkwrite);
+		if (!page_mkwrite) {
+			wait_on_page_locked(dirty_page);
+			set_page_dirty_balance(dirty_page, page_mkwrite);
+		}
 		put_page(dirty_page);
+		if (page_mkwrite) {
+			struct address_space *mapping = dirty_page->mapping;
+
+			set_page_dirty(dirty_page);
+			unlock_page(dirty_page);
+			page_cache_release(dirty_page);
+			balance_dirty_pages_ratelimited(mapping);
+		}
 	}
 	return ret;
 oom_free_new:
 	page_cache_release(new_page);
 oom:
-	if (old_page)
+	if (old_page) {
+		if (page_mkwrite) {
+			unlock_page(old_page);
+			page_cache_release(old_page);
+		}
 		page_cache_release(old_page);
+	}
 	return VM_FAULT_OOM;
 
 unwritable_page:
@@ -2664,27 +2690,22 @@ static int __do_fault(struct mm_struct *
 				int tmp;
 
 				unlock_page(page);
-				vmf.flags |= FAULT_FLAG_MKWRITE;
+				vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 				tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
 				if (unlikely(tmp &
 					  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
 					ret = tmp;
-					anon = 1; /* no anon but release vmf.page */
-					goto out_unlocked;
-				}
-				lock_page(page);
-				/*
-				 * XXX: this is not quite right (racy vs
-				 * invalidate) to unlock and relock the page
-				 * like this, however a better fix requires
-				 * reworking page_mkwrite locking API, which
-				 * is better done later.
-				 */
-				if (!page->mapping) {
-					ret = 0;
-					anon = 1; /* no anon but release vmf.page */
-					goto out;
+					goto unwritable_page;
 				}
+				if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+					lock_page(page);
+					if (!page->mapping) {
+						ret = 0; /* retry the fault */
+						unlock_page(page);
+						goto unwritable_page;
+					}
+				} else
+					VM_BUG_ON(!PageLocked(page));
 				page_mkwrite = 1;
 			}
 		}
@@ -2736,19 +2757,29 @@ static int __do_fault(struct mm_struct *
 	pte_unmap_unlock(page_table, ptl);
 
 out:
-	unlock_page(vmf.page);
-out_unlocked:
-	if (anon)
-		page_cache_release(vmf.page);
-	else if (dirty_page) {
+	if (dirty_page) {
+		struct address_space *mapping = page->mapping;
+
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
 
-		set_page_dirty_balance(dirty_page, page_mkwrite);
+		if (set_page_dirty(dirty_page))
+			page_mkwrite = 1;
+		unlock_page(dirty_page);
 		put_page(dirty_page);
+		if (page_mkwrite)
+			balance_dirty_pages_ratelimited(mapping);
+	} else {
+		unlock_page(vmf.page);
+		if (anon)
+			page_cache_release(vmf.page);
 	}
 
 	return ret;
+
+unwritable_page:
+	page_cache_release(page);
+	return ret;
 }
 
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -512,16 +512,24 @@ locking rules:
 		BKL	mmap_sem	PageLocked(page)
 open:		no	yes
 close:		no	yes
-fault:		no	yes
-page_mkwrite:	no	yes		no
+fault:		no	yes		can return with page locked
+page_mkwrite:	no	yes		can return with page locked
 access:		no	yes
 
-	->page_mkwrite() is called when a previously read-only page is
-about to become writeable. The file system is responsible for
-protecting against truncate races. Once appropriate action has been
-taking to lock out truncate, the page range should be verified to be
-within i_size. The page mapping should also be checked that it is not
-NULL.
+	->fault() is called when a previously not present pte is about
+to be faulted in. The filesystem must find and return the page associated
+with the passed in "pgoff" in the vm_fault structure. If it is possible that
+the page may be truncated and/or invalidated, then the filesystem must lock
+the page, then ensure it is not already truncated (the page lock will block
+subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
+locked. The VM will unlock the page.
+
+	->page_mkwrite() is called when a previously read-only pte is
+about to become writeable. The filesystem again must ensure that there are
+no truncate/invalidate races, and then return with the page locked. If
+the page has been truncated, the filesystem should not look up a new page
+like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
+will cause the VM to retry the fault.
 
 	->access() is called when get_user_pages() fails in
 acces_process_vm(), typically used to debug a process through

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch] mm: close page_mkwrite races (try 3)
@ 2009-04-15  8:25   ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-04-15  8:25 UTC (permalink / raw)
  To: Andrew Morton, Sage Weil, Trond Myklebust, linux-fsdevel,
	Linux Memory Management List

OK, that one had some rough patches (in the changelog and the patch itself).
One more try... if it's still misunderstandable, I give up :)

--

Change page_mkwrite to allow implementations to return with the page locked,
and also change it's callers (in page fault paths) to hold the lock until the
page is marked dirty. This allows the filesystem to have full control of page
dirtying events coming from the VM.

Rather than simply hold the page locked over the page_mkwrite call, we call
page_mkwrite with the page unlocked and allow callers to return with it locked,
so filesystems can avoid LOR conditions with page lock.

The problem with the current scheme is this: a filesystem that wants to
associate some metadata with a page as long as the page is dirty, will perform
this manipulation in its ->page_mkwrite. It currently then must return with the
page unlocked and may not hold any other locks (according to existing
page_mkwrite convention).

In this window, the VM could write out the page, clearing page-dirty. The
filesystem has no good way to detect that a dirty pte is about to be attached,
so it will happily write out the page, at which point, the filesystem may
manipulate the metadata to reflect that the page is no longer dirty.

It is not always possible to perform the required metadata manipulation in
->set_page_dirty, because that function cannot block or fail. The filesystem
may need to allocate some data structure, for example.

And the VM cannot mark the pte dirty before page_mkwrite, because page_mkwrite
is allowed to fail, so we must not allow any window where the page could be
written to if page_mkwrite does fail.

This solution of holding the page locked over the 3 critical operations
(page_mkwrite, setting the pte dirty, and finally setting the page dirty)
closes out races nicely, preventing page cleaning for writeout being initiated
in that window. This provides the filesystem with a strong synchronisation
against the VM here.

- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
  under dirty pages might be susceptible to i_size changing under partial page
  at the end of file (we also have a buffer.c-side problem here, but it cannot
  be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
  page_mkwrite functions themselves.

[ This also moves page_mkwrite another step closer to fault, which should
  eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
  filesystem calldown and page lock/unlock cycle in __do_fault. ]

Cc: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 Documentation/filesystems/Locking |   24 +++++++---
 fs/buffer.c                       |   10 ++--
 mm/memory.c                       |   85 +++++++++++++++++++++++++-------------
 3 files changed, 80 insertions(+), 39 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2383,7 +2383,8 @@ block_page_mkwrite(struct vm_area_struct
 	if ((page->mapping != inode->i_mapping) ||
 	    (page_offset(page) > size)) {
 		/* page got truncated out from underneath us */
-		goto out_unlock;
+		unlock_page(page);
+		goto out;
 	}
 
 	/* page is wholly or partially inside EOF */
@@ -2397,14 +2398,15 @@ block_page_mkwrite(struct vm_area_struct
 		ret = block_commit_write(page, 0, end);
 
 	if (unlikely(ret)) {
+		unlock_page(page);
 		if (ret == -ENOMEM)
 			ret = VM_FAULT_OOM;
 		else /* -ENOSPC, -EIO, etc */
 			ret = VM_FAULT_SIGBUS;
-	}
+	} else
+		ret = VM_FAULT_LOCKED;
 
-out_unlock:
-	unlock_page(page);
+out:
 	return ret;
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1971,6 +1971,15 @@ static int do_wp_page(struct mm_struct *
 				ret = tmp;
 				goto unwritable_page;
 			}
+			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+				lock_page(old_page);
+				if (!old_page->mapping) {
+					ret = 0; /* retry the fault */
+					unlock_page(old_page);
+					goto unwritable_page;
+				}
+			} else
+				VM_BUG_ON(!PageLocked(old_page));
 
 			/*
 			 * Since we dropped the lock we need to revalidate
@@ -1980,9 +1989,11 @@ static int do_wp_page(struct mm_struct *
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
-			page_cache_release(old_page);
-			if (!pte_same(*page_table, orig_pte))
+			if (!pte_same(*page_table, orig_pte)) {
+				unlock_page(old_page);
+				page_cache_release(old_page);
 				goto unlock;
+			}
 
 			page_mkwrite = 1;
 		}
@@ -2105,16 +2116,31 @@ unlock:
 		 *
 		 * do_no_page is protected similarly.
 		 */
-		wait_on_page_locked(dirty_page);
-		set_page_dirty_balance(dirty_page, page_mkwrite);
+		if (!page_mkwrite) {
+			wait_on_page_locked(dirty_page);
+			set_page_dirty_balance(dirty_page, page_mkwrite);
+		}
 		put_page(dirty_page);
+		if (page_mkwrite) {
+			struct address_space *mapping = dirty_page->mapping;
+
+			set_page_dirty(dirty_page);
+			unlock_page(dirty_page);
+			page_cache_release(dirty_page);
+			balance_dirty_pages_ratelimited(mapping);
+		}
 	}
 	return ret;
 oom_free_new:
 	page_cache_release(new_page);
 oom:
-	if (old_page)
+	if (old_page) {
+		if (page_mkwrite) {
+			unlock_page(old_page);
+			page_cache_release(old_page);
+		}
 		page_cache_release(old_page);
+	}
 	return VM_FAULT_OOM;
 
 unwritable_page:
@@ -2664,27 +2690,22 @@ static int __do_fault(struct mm_struct *
 				int tmp;
 
 				unlock_page(page);
-				vmf.flags |= FAULT_FLAG_MKWRITE;
+				vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 				tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
 				if (unlikely(tmp &
 					  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
 					ret = tmp;
-					anon = 1; /* no anon but release vmf.page */
-					goto out_unlocked;
-				}
-				lock_page(page);
-				/*
-				 * XXX: this is not quite right (racy vs
-				 * invalidate) to unlock and relock the page
-				 * like this, however a better fix requires
-				 * reworking page_mkwrite locking API, which
-				 * is better done later.
-				 */
-				if (!page->mapping) {
-					ret = 0;
-					anon = 1; /* no anon but release vmf.page */
-					goto out;
+					goto unwritable_page;
 				}
+				if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+					lock_page(page);
+					if (!page->mapping) {
+						ret = 0; /* retry the fault */
+						unlock_page(page);
+						goto unwritable_page;
+					}
+				} else
+					VM_BUG_ON(!PageLocked(page));
 				page_mkwrite = 1;
 			}
 		}
@@ -2736,19 +2757,29 @@ static int __do_fault(struct mm_struct *
 	pte_unmap_unlock(page_table, ptl);
 
 out:
-	unlock_page(vmf.page);
-out_unlocked:
-	if (anon)
-		page_cache_release(vmf.page);
-	else if (dirty_page) {
+	if (dirty_page) {
+		struct address_space *mapping = page->mapping;
+
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
 
-		set_page_dirty_balance(dirty_page, page_mkwrite);
+		if (set_page_dirty(dirty_page))
+			page_mkwrite = 1;
+		unlock_page(dirty_page);
 		put_page(dirty_page);
+		if (page_mkwrite)
+			balance_dirty_pages_ratelimited(mapping);
+	} else {
+		unlock_page(vmf.page);
+		if (anon)
+			page_cache_release(vmf.page);
 	}
 
 	return ret;
+
+unwritable_page:
+	page_cache_release(page);
+	return ret;
 }
 
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -512,16 +512,24 @@ locking rules:
 		BKL	mmap_sem	PageLocked(page)
 open:		no	yes
 close:		no	yes
-fault:		no	yes
-page_mkwrite:	no	yes		no
+fault:		no	yes		can return with page locked
+page_mkwrite:	no	yes		can return with page locked
 access:		no	yes
 
-	->page_mkwrite() is called when a previously read-only page is
-about to become writeable. The file system is responsible for
-protecting against truncate races. Once appropriate action has been
-taking to lock out truncate, the page range should be verified to be
-within i_size. The page mapping should also be checked that it is not
-NULL.
+	->fault() is called when a previously not present pte is about
+to be faulted in. The filesystem must find and return the page associated
+with the passed in "pgoff" in the vm_fault structure. If it is possible that
+the page may be truncated and/or invalidated, then the filesystem must lock
+the page, then ensure it is not already truncated (the page lock will block
+subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
+locked. The VM will unlock the page.
+
+	->page_mkwrite() is called when a previously read-only pte is
+about to become writeable. The filesystem again must ensure that there are
+no truncate/invalidate races, and then return with the page locked. If
+the page has been truncated, the filesystem should not look up a new page
+like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
+will cause the VM to retry the fault.
 
 	->access() is called when get_user_pages() fails in
 acces_process_vm(), typically used to debug a process through

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-04-15  8:25   ` Nick Piggin
  (?)
@ 2009-04-16  1:38   ` Andrew Morton
  2009-04-16  2:03       ` Nick Piggin
                       ` (2 more replies)
  -1 siblings, 3 replies; 23+ messages in thread
From: Andrew Morton @ 2009-04-16  1:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Sage Weil, Trond Myklebust, linux-fsdevel, Linux Memory Management List

On Wed, 15 Apr 2009 10:25:07 +0200 Nick Piggin <npiggin@suse.de> wrote:

> OK, that one had some rough patches (in the changelog and the patch itself).
> One more try... if it's still misunderstandable, I give up :)
> 
> --
> 
> Change page_mkwrite to allow implementations to return with the page locked,
> and also change it's callers (in page fault paths) to hold the lock until the
> page is marked dirty. This allows the filesystem to have full control of page
> dirtying events coming from the VM.
> 
> Rather than simply hold the page locked over the page_mkwrite call, we call
> page_mkwrite with the page unlocked and allow callers to return with it locked,
> so filesystems can avoid LOR conditions with page lock.

All right, I give up.  What's LOR?

> The problem with the current scheme is this: a filesystem that wants to
> associate some metadata with a page as long as the page is dirty, will perform
> this manipulation in its ->page_mkwrite. It currently then must return with the
> page unlocked and may not hold any other locks (according to existing
> page_mkwrite convention).
> 
> In this window, the VM could write out the page, clearing page-dirty. The
> filesystem has no good way to detect that a dirty pte is about to be attached,
> so it will happily write out the page, at which point, the filesystem may
> manipulate the metadata to reflect that the page is no longer dirty.
> 
> It is not always possible to perform the required metadata manipulation in
> ->set_page_dirty, because that function cannot block or fail. The filesystem
> may need to allocate some data structure, for example.
> 
> And the VM cannot mark the pte dirty before page_mkwrite, because page_mkwrite
> is allowed to fail, so we must not allow any window where the page could be
> written to if page_mkwrite does fail.
> 
> This solution of holding the page locked over the 3 critical operations
> (page_mkwrite, setting the pte dirty, and finally setting the page dirty)
> closes out races nicely, preventing page cleaning for writeout being initiated
> in that window. This provides the filesystem with a strong synchronisation
> against the VM here.
> 
> - Sage needs this race closed for ceph filesystem.
> - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).

I wonder which kernel version(s) we should put this in.

Going BUG isn't nice, but that report is against 2.6.27.  Is the BUG
super-rare, or did we avoid it via other means, or what?

> - I need it for fsblock.
> - I suspect other filesystems may need it too (eg. btrfs).
> - I have converted buffer.c to the new locking. Even simple block allocation
>   under dirty pages might be susceptible to i_size changing under partial page
>   at the end of file (we also have a buffer.c-side problem here, but it cannot
>   be fixed properly without this patch).
> - Other filesystems (eg. NFS, maybe btrfs) will need to change their
>   page_mkwrite functions themselves.
> 
> [ This also moves page_mkwrite another step closer to fault, which should
>   eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
>   filesystem calldown and page lock/unlock cycle in __do_fault. ]
> 
>
> ...
>
> @@ -1980,9 +1989,11 @@ static int do_wp_page(struct mm_struct *
>  			 */
>  			page_table = pte_offset_map_lock(mm, pmd, address,
>  							 &ptl);
> -			page_cache_release(old_page);
> -			if (!pte_same(*page_table, orig_pte))
> +			if (!pte_same(*page_table, orig_pte)) {
> +				unlock_page(old_page);
> +				page_cache_release(old_page);
>  				goto unlock;
> +			}
>  
>  			page_mkwrite = 1;
>  		}
> @@ -2105,16 +2116,31 @@ unlock:
>  		 *
>  		 * do_no_page is protected similarly.
>  		 */
> -		wait_on_page_locked(dirty_page);
> -		set_page_dirty_balance(dirty_page, page_mkwrite);
> +		if (!page_mkwrite) {
> +			wait_on_page_locked(dirty_page);
> +			set_page_dirty_balance(dirty_page, page_mkwrite);
> +		}
>  		put_page(dirty_page);
> +		if (page_mkwrite) {
> +			struct address_space *mapping = dirty_page->mapping;
> +
> +			set_page_dirty(dirty_page);
> +			unlock_page(dirty_page);
> +			page_cache_release(dirty_page);
> +			balance_dirty_pages_ratelimited(mapping);

hm.  I wonder what prevents (prevented) *mapping from vanishing under
our feet here.

> +		}
>  	}
>  	return ret;
>  oom_free_new:
>  	page_cache_release(new_page);
>  oom:
> -	if (old_page)
> +	if (old_page) {
> +		if (page_mkwrite) {
> +			unlock_page(old_page);
> +			page_cache_release(old_page);
> +		}
>  		page_cache_release(old_page);
> +	}
>  	return VM_FAULT_OOM;
>  
>
> ...
>
> @@ -2736,19 +2757,29 @@ static int __do_fault(struct mm_struct *
>  	pte_unmap_unlock(page_table, ptl);
>  
>  out:
> -	unlock_page(vmf.page);
> -out_unlocked:
> -	if (anon)
> -		page_cache_release(vmf.page);
> -	else if (dirty_page) {
> +	if (dirty_page) {
> +		struct address_space *mapping = page->mapping;
> +
>  		if (vma->vm_file)
>  			file_update_time(vma->vm_file);
>  
> -		set_page_dirty_balance(dirty_page, page_mkwrite);
> +		if (set_page_dirty(dirty_page))
> +			page_mkwrite = 1;
> +		unlock_page(dirty_page);
>  		put_page(dirty_page);
> +		if (page_mkwrite)
> +			balance_dirty_pages_ratelimited(mapping);
> +	} else {
> +		unlock_page(vmf.page);
> +		if (anon)
> +			page_cache_release(vmf.page);
>  	}
>  
>  	return ret;
> +
> +unwritable_page:
> +	page_cache_release(page);
> +	return ret;
>  }

Whoa.  Running file_update_time() under lock_page() opens a whole can
of worms, doesn't it?  That thing can do journal commits and all sorts
of stuff.  And I don't think this ordering is necessary here?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-04-16  1:38   ` Andrew Morton
@ 2009-04-16  2:03       ` Nick Piggin
  2009-04-16  2:23     ` Nick Piggin
  2009-04-28 18:57     ` Ravikiran G Thirumalai
  2 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-04-16  2:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sage Weil, Trond Myklebust, linux-fsdevel, Linux Memory Management List

On Wed, Apr 15, 2009 at 06:38:47PM -0700, Andrew Morton wrote:
> On Wed, 15 Apr 2009 10:25:07 +0200 Nick Piggin <npiggin@suse.de> wrote:
> 
> > so filesystems can avoid LOR conditions with page lock.
> 
> All right, I give up.  What's LOR?

Lock order reversal.

 
> > - Sage needs this race closed for ceph filesystem.
> > - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
> 
> I wonder which kernel version(s) we should put this in.
> 
> Going BUG isn't nice, but that report is against 2.6.27.  Is the BUG
> super-rare, or did we avoid it via other means, or what?

Trond?


> > @@ -2105,16 +2116,31 @@ unlock:
> >  		 *
> >  		 * do_no_page is protected similarly.
> >  		 */
> > -		wait_on_page_locked(dirty_page);
> > -		set_page_dirty_balance(dirty_page, page_mkwrite);
> > +		if (!page_mkwrite) {
> > +			wait_on_page_locked(dirty_page);
> > +			set_page_dirty_balance(dirty_page, page_mkwrite);
> > +		}
> >  		put_page(dirty_page);
> > +		if (page_mkwrite) {
> > +			struct address_space *mapping = dirty_page->mapping;
> > +
> > +			set_page_dirty(dirty_page);
> > +			unlock_page(dirty_page);
> > +			page_cache_release(dirty_page);
> > +			balance_dirty_pages_ratelimited(mapping);
> 
> hm.  I wonder what prevents (prevented) *mapping from vanishing under
> our feet here.

down_read on mmap_sem should be keeping it pinned.

 
> > +	if (dirty_page) {
> > +		struct address_space *mapping = page->mapping;
> > +
> >  		if (vma->vm_file)
> >  			file_update_time(vma->vm_file);
> >  
> > -		set_page_dirty_balance(dirty_page, page_mkwrite);
> > +		if (set_page_dirty(dirty_page))
> > +			page_mkwrite = 1;
> > +		unlock_page(dirty_page);
> >  		put_page(dirty_page);
> > +		if (page_mkwrite)
> > +			balance_dirty_pages_ratelimited(mapping);
> > +	} else {
> > +		unlock_page(vmf.page);
> > +		if (anon)
> > +			page_cache_release(vmf.page);
> >  	}
> >  
> >  	return ret;
> > +
> > +unwritable_page:
> > +	page_cache_release(page);
> > +	return ret;
> >  }
> 
> Whoa.  Running file_update_time() under lock_page() opens a whole can
> of worms, doesn't it?  That thing can do journal commits and all sorts
> of stuff.  And I don't think this ordering is necessary here?
 
Oh good catch. Yes I think we can just move that out to after the
put_page no problem.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
@ 2009-04-16  2:03       ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-04-16  2:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sage Weil, Trond Myklebust, linux-fsdevel, Linux Memory Management List

On Wed, Apr 15, 2009 at 06:38:47PM -0700, Andrew Morton wrote:
> On Wed, 15 Apr 2009 10:25:07 +0200 Nick Piggin <npiggin@suse.de> wrote:
> 
> > so filesystems can avoid LOR conditions with page lock.
> 
> All right, I give up.  What's LOR?

Lock order reversal.

 
> > - Sage needs this race closed for ceph filesystem.
> > - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
> 
> I wonder which kernel version(s) we should put this in.
> 
> Going BUG isn't nice, but that report is against 2.6.27.  Is the BUG
> super-rare, or did we avoid it via other means, or what?

Trond?


> > @@ -2105,16 +2116,31 @@ unlock:
> >  		 *
> >  		 * do_no_page is protected similarly.
> >  		 */
> > -		wait_on_page_locked(dirty_page);
> > -		set_page_dirty_balance(dirty_page, page_mkwrite);
> > +		if (!page_mkwrite) {
> > +			wait_on_page_locked(dirty_page);
> > +			set_page_dirty_balance(dirty_page, page_mkwrite);
> > +		}
> >  		put_page(dirty_page);
> > +		if (page_mkwrite) {
> > +			struct address_space *mapping = dirty_page->mapping;
> > +
> > +			set_page_dirty(dirty_page);
> > +			unlock_page(dirty_page);
> > +			page_cache_release(dirty_page);
> > +			balance_dirty_pages_ratelimited(mapping);
> 
> hm.  I wonder what prevents (prevented) *mapping from vanishing under
> our feet here.

down_read on mmap_sem should be keeping it pinned.

 
> > +	if (dirty_page) {
> > +		struct address_space *mapping = page->mapping;
> > +
> >  		if (vma->vm_file)
> >  			file_update_time(vma->vm_file);
> >  
> > -		set_page_dirty_balance(dirty_page, page_mkwrite);
> > +		if (set_page_dirty(dirty_page))
> > +			page_mkwrite = 1;
> > +		unlock_page(dirty_page);
> >  		put_page(dirty_page);
> > +		if (page_mkwrite)
> > +			balance_dirty_pages_ratelimited(mapping);
> > +	} else {
> > +		unlock_page(vmf.page);
> > +		if (anon)
> > +			page_cache_release(vmf.page);
> >  	}
> >  
> >  	return ret;
> > +
> > +unwritable_page:
> > +	page_cache_release(page);
> > +	return ret;
> >  }
> 
> Whoa.  Running file_update_time() under lock_page() opens a whole can
> of worms, doesn't it?  That thing can do journal commits and all sorts
> of stuff.  And I don't think this ordering is necessary here?
 
Oh good catch. Yes I think we can just move that out to after the
put_page no problem.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-04-16  1:38   ` Andrew Morton
  2009-04-16  2:03       ` Nick Piggin
@ 2009-04-16  2:23     ` Nick Piggin
  2009-04-28 18:57     ` Ravikiran G Thirumalai
  2 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-04-16  2:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sage Weil, Trond Myklebust, linux-fsdevel, Linux Memory Management List

Here's an incremental patch.

--
On Wed, Apr 15, 2009 at 06:38:47PM -0700, Andrew Morton wrote:
> Whoa.  Running file_update_time() under lock_page() opens a whole can
> of worms, doesn't it?  That thing can do journal commits and all sorts
> of stuff.  And I don't think this ordering is necessary here?

Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 mm/memory.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2105,9 +2105,6 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
-		if (vma->vm_file)
-			file_update_time(vma->vm_file);
-
 		/*
 		 * Yes, Virginia, this is actually required to prevent a race
 		 * with clear_page_dirty_for_io() from clearing the page dirty
@@ -2129,6 +2126,10 @@ unlock:
 			page_cache_release(dirty_page);
 			balance_dirty_pages_ratelimited(mapping);
 		}
+
+		/* file_update_time outside page_lock */
+		if (vma->vm_file)
+			file_update_time(vma->vm_file);
 	}
 	return ret;
 oom_free_new:
@@ -2760,15 +2761,16 @@ out:
 	if (dirty_page) {
 		struct address_space *mapping = page->mapping;
 
-		if (vma->vm_file)
-			file_update_time(vma->vm_file);
-
 		if (set_page_dirty(dirty_page))
 			page_mkwrite = 1;
 		unlock_page(dirty_page);
 		put_page(dirty_page);
 		if (page_mkwrite)
 			balance_dirty_pages_ratelimited(mapping);
+
+		/* file_update_time outside page_lock */
+		if (vma->vm_file)
+			file_update_time(vma->vm_file);
 	} else {
 		unlock_page(vmf.page);
 		if (anon)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-04-16  1:38   ` Andrew Morton
  2009-04-16  2:03       ` Nick Piggin
  2009-04-16  2:23     ` Nick Piggin
@ 2009-04-28 18:57     ` Ravikiran G Thirumalai
  2009-04-29  7:12         ` Nick Piggin
  2 siblings, 1 reply; 23+ messages in thread
From: Ravikiran G Thirumalai @ 2009-04-28 18:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Sage Weil, Trond Myklebust, linux-fsdevel,
	Linux Memory Management List

On Wed, Apr 15, 2009 at 06:38:47PM -0700, Andrew Morton wrote:
>On Wed, 15 Apr 2009 10:25:07 +0200 Nick Piggin <npiggin@suse.de> wrote:
>
>> - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
>
>I wonder which kernel version(s) we should put this in.
>
>Going BUG isn't nice, but that report is against 2.6.27.  Is the BUG
>super-rare, or did we avoid it via other means, or what?
>

Jumping in late in  after being bit by this bug many times
over with  2.6.27.  The bug (http://bugzilla.kernel.org/show_bug.cgi?id=12913)
is not rare with the right workload at all.
I can easily make it happen on smp machines, when multiple
processes are writing to the same NFS mounted file system.  AFAICT this
needs to be back ported to 27 stable and 29 stable as well.

Nick, are there 27 based patches already available someplace?
Obviously, I have verified these patches + Trond's patch --
http://lkml.org/lkml/2009/4/25/64 fixes the issue with 2.6.30-rc3

Thanks,
Kiran

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-04-28 18:57     ` Ravikiran G Thirumalai
@ 2009-04-29  7:12         ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-04-29  7:12 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Andrew Morton, Sage Weil, Trond Myklebust, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 28, 2009 at 11:57:39AM -0700, Ravikiran G Thirumalai wrote:
> On Wed, Apr 15, 2009 at 06:38:47PM -0700, Andrew Morton wrote:
> >On Wed, 15 Apr 2009 10:25:07 +0200 Nick Piggin <npiggin@suse.de> wrote:
> >
> >> - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
> >
> >I wonder which kernel version(s) we should put this in.
> >
> >Going BUG isn't nice, but that report is against 2.6.27.  Is the BUG
> >super-rare, or did we avoid it via other means, or what?
> >
> 
> Jumping in late in  after being bit by this bug many times
> over with  2.6.27.  The bug (http://bugzilla.kernel.org/show_bug.cgi?id=12913)
> is not rare with the right workload at all.

Good data point, thanks.


> I can easily make it happen on smp machines, when multiple
> processes are writing to the same NFS mounted file system.  AFAICT this
> needs to be back ported to 27 stable and 29 stable as well.
> 
> Nick, are there 27 based patches already available someplace?
> Obviously, I have verified these patches + Trond's patch --
> http://lkml.org/lkml/2009/4/25/64 fixes the issue with 2.6.30-rc3

I haven't got any prepared, but they should be a pretty trivial
backport, provided we also backport c2ec175c39f62949438354f603f4aa170846aabb
(which is probably a good idea anyway).

However I will probably wait for a bit, given that the patch isn't upstream
yet.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
@ 2009-04-29  7:12         ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-04-29  7:12 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Andrew Morton, Sage Weil, Trond Myklebust, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 28, 2009 at 11:57:39AM -0700, Ravikiran G Thirumalai wrote:
> On Wed, Apr 15, 2009 at 06:38:47PM -0700, Andrew Morton wrote:
> >On Wed, 15 Apr 2009 10:25:07 +0200 Nick Piggin <npiggin@suse.de> wrote:
> >
> >> - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
> >
> >I wonder which kernel version(s) we should put this in.
> >
> >Going BUG isn't nice, but that report is against 2.6.27.  Is the BUG
> >super-rare, or did we avoid it via other means, or what?
> >
> 
> Jumping in late in  after being bit by this bug many times
> over with  2.6.27.  The bug (http://bugzilla.kernel.org/show_bug.cgi?id=12913)
> is not rare with the right workload at all.

Good data point, thanks.


> I can easily make it happen on smp machines, when multiple
> processes are writing to the same NFS mounted file system.  AFAICT this
> needs to be back ported to 27 stable and 29 stable as well.
> 
> Nick, are there 27 based patches already available someplace?
> Obviously, I have verified these patches + Trond's patch --
> http://lkml.org/lkml/2009/4/25/64 fixes the issue with 2.6.30-rc3

I haven't got any prepared, but they should be a pretty trivial
backport, provided we also backport c2ec175c39f62949438354f603f4aa170846aabb
(which is probably a good idea anyway).

However I will probably wait for a bit, given that the patch isn't upstream
yet.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-04-29  7:12         ` Nick Piggin
  (?)
@ 2009-04-29  7:24         ` Andrew Morton
  2009-04-29  7:45           ` Nick Piggin
  -1 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2009-04-29  7:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ravikiran G Thirumalai, Sage Weil, Trond Myklebust,
	linux-fsdevel, Linux Memory Management List

On Wed, 29 Apr 2009 09:12:33 +0200 Nick Piggin <npiggin@suse.de> wrote:

> On Tue, Apr 28, 2009 at 11:57:39AM -0700, Ravikiran G Thirumalai wrote:
> > On Wed, Apr 15, 2009 at 06:38:47PM -0700, Andrew Morton wrote:
> > >On Wed, 15 Apr 2009 10:25:07 +0200 Nick Piggin <npiggin@suse.de> wrote:
> > >
> > >> - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
> > >
> > >I wonder which kernel version(s) we should put this in.
> > >
> > >Going BUG isn't nice, but that report is against 2.6.27.  Is the BUG
> > >super-rare, or did we avoid it via other means, or what?
> > >
> > 
> > Jumping in late in  after being bit by this bug many times
> > over with  2.6.27.  The bug (http://bugzilla.kernel.org/show_bug.cgi?id=12913)
> > is not rare with the right workload at all.
> 
> Good data point, thanks.
> 
> 
> > I can easily make it happen on smp machines, when multiple
> > processes are writing to the same NFS mounted file system.  AFAICT this
> > needs to be back ported to 27 stable and 29 stable as well.
> > 
> > Nick, are there 27 based patches already available someplace?
> > Obviously, I have verified these patches + Trond's patch --
> > http://lkml.org/lkml/2009/4/25/64 fixes the issue with 2.6.30-rc3
> 
> I haven't got any prepared, but they should be a pretty trivial
> backport, provided we also backport c2ec175c39f62949438354f603f4aa170846aabb
> (which is probably a good idea anyway).
> 
> However I will probably wait for a bit, given that the patch isn't upstream
> yet.

err, I'd marked it as for-2.6.31.  It looks like that was wrong?

all this:

#mm-close-page_mkwrite-races-try-3.patch: akpm issues!
#mm-close-page_mkwrite-races-try-3.patch: check akpm hack
mm-close-page_mkwrite-races-try-3.patch
mm-close-page_mkwrite-races-try-3-update.patch
mm-close-page_mkwrite-races-try-3-fix.patch
mm-close-page_mkwrite-races-try-3-fix-fix.patch

is a bit of a worry.  But I guess we won't know until we merge it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-04-29  7:24         ` Andrew Morton
@ 2009-04-29  7:45           ` Nick Piggin
  2009-04-29 12:39               ` Trond Myklebust
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2009-04-29  7:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ravikiran G Thirumalai, Sage Weil, Trond Myklebust,
	linux-fsdevel, Linux Memory Management List

On Wed, Apr 29, 2009 at 12:24:18AM -0700, Andrew Morton wrote:
> On Wed, 29 Apr 2009 09:12:33 +0200 Nick Piggin <npiggin@suse.de> wrote:
> 
> > I haven't got any prepared, but they should be a pretty trivial
> > backport, provided we also backport c2ec175c39f62949438354f603f4aa170846aabb
> > (which is probably a good idea anyway).
> > 
> > However I will probably wait for a bit, given that the patch isn't upstream
> > yet.
> 
> err, I'd marked it as for-2.6.31.  It looks like that was wrong?

At the time I agreed because I didn't know the severity of the NFS
bugs. So it is up to you and Trond / nfs guys I guess.


> all this:
> 
> #mm-close-page_mkwrite-races-try-3.patch: akpm issues!
> #mm-close-page_mkwrite-races-try-3.patch: check akpm hack
> mm-close-page_mkwrite-races-try-3.patch
> mm-close-page_mkwrite-races-try-3-update.patch
> mm-close-page_mkwrite-races-try-3-fix.patch
> mm-close-page_mkwrite-races-try-3-fix-fix.patch
> 
> is a bit of a worry.  But I guess we won't know until we merge it.

I have nothing against merging it now if you think it is needed.
It's only adding synchronisation, so I doubt it will cause a problem
that pushes the release out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-04-29  7:45           ` Nick Piggin
@ 2009-04-29 12:39               ` Trond Myklebust
  0 siblings, 0 replies; 23+ messages in thread
From: Trond Myklebust @ 2009-04-29 12:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Ravikiran G Thirumalai, Sage Weil, linux-fsdevel,
	Linux Memory Management List

On Wed, 2009-04-29 at 09:45 +0200, Nick Piggin wrote:
> On Wed, Apr 29, 2009 at 12:24:18AM -0700, Andrew Morton wrote:
> > On Wed, 29 Apr 2009 09:12:33 +0200 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > I haven't got any prepared, but they should be a pretty trivial
> > > backport, provided we also backport c2ec175c39f62949438354f603f4aa170846aabb
> > > (which is probably a good idea anyway).
> > > 
> > > However I will probably wait for a bit, given that the patch isn't upstream
> > > yet.
> > 
> > err, I'd marked it as for-2.6.31.  It looks like that was wrong?
> 
> At the time I agreed because I didn't know the severity of the NFS
> bugs. So it is up to you and Trond / nfs guys I guess.
> 

The bug affects any serious use of shared mmap writes, so it's pretty
urgent to get it merged together with the NFS 3 liner. As I said in an
earlier email, I'm able to reproduce it at will within 1 minute or so of
using iozone with the right options.

I also concur with the opinion that we should backport it to the stable
kernels as soon as possible.

Cheers
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
@ 2009-04-29 12:39               ` Trond Myklebust
  0 siblings, 0 replies; 23+ messages in thread
From: Trond Myklebust @ 2009-04-29 12:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Ravikiran G Thirumalai, Sage Weil, linux-fsdevel,
	Linux Memory Management List

On Wed, 2009-04-29 at 09:45 +0200, Nick Piggin wrote:
> On Wed, Apr 29, 2009 at 12:24:18AM -0700, Andrew Morton wrote:
> > On Wed, 29 Apr 2009 09:12:33 +0200 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > I haven't got any prepared, but they should be a pretty trivial
> > > backport, provided we also backport c2ec175c39f62949438354f603f4aa170846aabb
> > > (which is probably a good idea anyway).
> > > 
> > > However I will probably wait for a bit, given that the patch isn't upstream
> > > yet.
> > 
> > err, I'd marked it as for-2.6.31.  It looks like that was wrong?
> 
> At the time I agreed because I didn't know the severity of the NFS
> bugs. So it is up to you and Trond / nfs guys I guess.
> 

The bug affects any serious use of shared mmap writes, so it's pretty
urgent to get it merged together with the NFS 3 liner. As I said in an
earlier email, I'm able to reproduce it at will within 1 minute or so of
using iozone with the right options.

I also concur with the opinion that we should backport it to the stable
kernels as soon as possible.

Cheers
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-04-29 12:39               ` Trond Myklebust
  (?)
@ 2009-04-29 15:27               ` Andrew Morton
  2009-04-29 16:45                   ` Trond Myklebust
  -1 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2009-04-29 15:27 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Nick Piggin, Ravikiran G Thirumalai, Sage Weil, linux-fsdevel,
	Linux Memory Management List, stable

On Wed, 29 Apr 2009 08:39:22 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> On Wed, 2009-04-29 at 09:45 +0200, Nick Piggin wrote:
> > On Wed, Apr 29, 2009 at 12:24:18AM -0700, Andrew Morton wrote:
> > > On Wed, 29 Apr 2009 09:12:33 +0200 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > I haven't got any prepared, but they should be a pretty trivial
> > > > backport, provided we also backport c2ec175c39f62949438354f603f4aa170846aabb
> > > > (which is probably a good idea anyway).
> > > > 
> > > > However I will probably wait for a bit, given that the patch isn't upstream
> > > > yet.
> > > 
> > > err, I'd marked it as for-2.6.31.  It looks like that was wrong?
> > 
> > At the time I agreed because I didn't know the severity of the NFS
> > bugs. So it is up to you and Trond / nfs guys I guess.
> > 
> 
> The bug affects any serious use of shared mmap writes, so it's pretty
> urgent to get it merged together with the NFS 3 liner. As I said in an
> earlier email, I'm able to reproduce it at will within 1 minute or so of
> using iozone with the right options.
> 
> I also concur with the opinion that we should backport it to the stable
> kernels as soon as possible.
> 

ok...

I don't know what the "NFS 3 liner" is, but I trust you'll take care of
it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-04-29 15:27               ` Andrew Morton
@ 2009-04-29 16:45                   ` Trond Myklebust
  0 siblings, 0 replies; 23+ messages in thread
From: Trond Myklebust @ 2009-04-29 16:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Ravikiran G Thirumalai, Sage Weil, linux-fsdevel,
	Linux Memory Management List, stable

On Wed, 2009-04-29 at 08:27 -0700, Andrew Morton wrote:
> I don't know what the "NFS 3 liner" is, but I trust you'll take care of
> it.

It's this one. I can send it on to Linus as soon as Nick's stuff is
upstream unless you'd prefer to fold it in with his patch.

Cheers
  Trond
---------------------------------------------------------------------
>From f0258852dcb43c748854d2ee550c9c270bb25f21 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Fri, 24 Apr 2009 17:32:22 -0400
Subject: [PATCH] NFS: Close page_mkwrite() races

Follow up to Nick Piggin's patches to ensure that nfs_vm_page_mkwrite
returns with the page lock held, and sets the VM_FAULT_LOCKED flag.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 fs/nfs/file.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5a97bcf..ec7e27d 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -517,10 +517,10 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 	ret = nfs_updatepage(filp, page, 0, pagelen);
 out_unlock:
+	if (!ret)
+		return VM_FAULT_LOCKED;
 	unlock_page(page);
-	if (ret)
-		ret = VM_FAULT_SIGBUS;
-	return ret;
+	return VM_FAULT_SIGBUS;
 }
 
 static struct vm_operations_struct nfs_file_vm_ops = {
-- 
1.6.0.6




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
@ 2009-04-29 16:45                   ` Trond Myklebust
  0 siblings, 0 replies; 23+ messages in thread
From: Trond Myklebust @ 2009-04-29 16:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Ravikiran G Thirumalai, Sage Weil, linux-fsdevel,
	Linux Memory Management List, stable

On Wed, 2009-04-29 at 08:27 -0700, Andrew Morton wrote:
> I don't know what the "NFS 3 liner" is, but I trust you'll take care of
> it.

It's this one. I can send it on to Linus as soon as Nick's stuff is
upstream unless you'd prefer to fold it in with his patch.

Cheers
  Trond
---------------------------------------------------------------------
>From f0258852dcb43c748854d2ee550c9c270bb25f21 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Fri, 24 Apr 2009 17:32:22 -0400
Subject: [PATCH] NFS: Close page_mkwrite() races

Follow up to Nick Piggin's patches to ensure that nfs_vm_page_mkwrite
returns with the page lock held, and sets the VM_FAULT_LOCKED flag.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 fs/nfs/file.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5a97bcf..ec7e27d 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -517,10 +517,10 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 	ret = nfs_updatepage(filp, page, 0, pagelen);
 out_unlock:
+	if (!ret)
+		return VM_FAULT_LOCKED;
 	unlock_page(page);
-	if (ret)
-		ret = VM_FAULT_SIGBUS;
-	return ret;
+	return VM_FAULT_SIGBUS;
 }
 
 static struct vm_operations_struct nfs_file_vm_ops = {
-- 
1.6.0.6



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-04-29 16:45                   ` Trond Myklebust
@ 2009-04-30 14:22                     ` Rince
  -1 siblings, 0 replies; 23+ messages in thread
From: Rince @ 2009-04-30 14:22 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Nick Piggin, Ravikiran G Thirumalai, Sage Weil,
	linux-fsdevel, Linux Memory Management List, stable

Bad news for everyone...

My kernel, with the four patches mentioned, left a steaming present on
my desk this morning.

------------[ cut here ]------------
kernel BUG at fs/nfs/write.c:252!
invalid opcode: 0000 [#1] SMP
[...]

Fascinating, no?

- Rich

-- 

((lambda (foo) (bar foo)) (baz))

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
@ 2009-04-30 14:22                     ` Rince
  0 siblings, 0 replies; 23+ messages in thread
From: Rince @ 2009-04-30 14:22 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Nick Piggin, Ravikiran G Thirumalai, Sage Weil,
	linux-fsdevel, Linux Memory Management List, stable

Bad news for everyone...

My kernel, with the four patches mentioned, left a steaming present on
my desk this morning.

------------[ cut here ]------------
kernel BUG at fs/nfs/write.c:252!
invalid opcode: 0000 [#1] SMP
[...]

Fascinating, no?

- Rich

-- 

((lambda (foo) (bar foo)) (baz))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-04-30 14:22                     ` Rince
  (?)
@ 2009-04-30 14:33                     ` Rince
  2009-05-02 22:12                       ` Rince
  -1 siblings, 1 reply; 23+ messages in thread
From: Rince @ 2009-04-30 14:33 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Nick Piggin, Ravikiran G Thirumalai, Sage Weil,
	linux-fsdevel, Linux Memory Management List, stable

On Thu, Apr 30, 2009 at 10:22 AM, Rince <rincebrain@gmail.com> wrote:
> four patches mentioned

Hm, somehow I missed that the NFS 3-liner above wasn't included in the
four patches.

Let's see how this goes. Apologies for the misunderstanding. :)

- Rich

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-04-30 14:33                     ` Rince
@ 2009-05-02 22:12                       ` Rince
  2009-05-03  1:38                         ` Trond Myklebust
  0 siblings, 1 reply; 23+ messages in thread
From: Rince @ 2009-05-02 22:12 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-fsdevel

Well...that's different.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [<ffffffffa02c8ff2>] _nfs4_do_setlk+0xe3/0x289 [nfs]
PGD 10e4f7067 PUD 109221067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/virtual/block/md0/md/metadata_version
CPU 0
Modules linked in: autofs4 coretemp hwmon nfs lockd nfs_acl
auth_rpcgss sunrpc cachefiles fscache ipv6 cpufreq_ondemand
acpi_cpufreq freq_table kvm_intel kvm snd_hda_codec_idt snd_hda_intel
snd_hda_codec snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi_event
snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm cpia_usb
e1000e snd_timer ppdev cpia snd ums_cypress parport_pc videodev
firewire_ohci i82975x_edac usb_storage parport firewire_core i2c_i801
edac_core v4l1_compat soundcore snd_page_alloc iTCO_wdt
v4l2_compat_ioctl32 pcspkr i2c_core crc_itu_t iTCO_vendor_support
raid1 [last unloaded: scsi_wait_scan]
Pid: 29418, comm: 10.1.1.2-manage Not tainted 2.6.30-rc4 #1
RIP: 0010:[<ffffffffa02c8ff2>]  [<ffffffffa02c8ff2>]
_nfs4_do_setlk+0xe3/0x289 [nfs]
RSP: 0018:ffff880102361d30  EFLAGS: 00010246
RAX: ffff8800ce865f00 RBX: ffff88010b4b74d8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000138
RBP: ffff880102361de0 R08: ffff880126557000 R09: ffff8800b38c3900
R10: ffffffffa02cbd1c R11: ffff880126553c00 R12: 00000000fffffff4
R13: 0000000000000000 R14: ffff88012d42f5c0 R15: ffff88012d42f5c0
FS:  0000000000000000(0000) GS:ffff880028023000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000010 CR3: 0000000016eda000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process 10.1.1.2-manage (pid: 29418, threadinfo ffff880102360000, task
ffff8800015b5c00)
Stack:
 ffff880102361d40 0000000600000001 ffff8800ce865f00 ffffffffa02cbc5c
 0000000000000000 ffff880126555c00 ffff880102361d90 ffffffffa02db960
 0000000000000000 ffff880128093000 0000000000000001 ffffffffa02b88aa
Call Trace:
 [<ffffffffa02cbc5c>] ? nfs4_open_recover_helper+0x82/0x97 [nfs]
 [<ffffffffa02b88aa>] ? __put_nfs_open_context+0x31/0x98 [nfs]
 [<ffffffffa02c9646>] nfs4_lock_reclaim+0x60/0x8d [nfs]
 [<ffffffffa02d57a3>] nfs4_do_reclaim+0x13d/0x322 [nfs]
 [<ffffffffa02d5b21>] nfs4_run_state_manager+0x199/0x27f [nfs]
 [<ffffffffa02d5988>] ? nfs4_run_state_manager+0x0/0x27f [nfs]
 [<ffffffffa02d5988>] ? nfs4_run_state_manager+0x0/0x27f [nfs]
 [<ffffffff8105e7bf>] kthread+0x5b/0x88
 [<ffffffff81011dba>] child_rip+0xa/0x20
 [<ffffffff8101177d>] ? restore_args+0x0/0x30
 [<ffffffff8105e764>] ? kthread+0x0/0x88
 [<ffffffff81011db0>] ? child_rip+0x0/0x20
Code: 10 e1 49 8b 47 58 4d 8b af 90 00 00 00 be d0 00 00 00 bf 38 01
00 00 41 bc f4 ff ff ff 48 8b 80 a0 00 00 00 48 89 85 60 ff ff ff <49>
8b 45 10 4c 8b 70 38 49 8b 86 00 01 00 00 48 8b 80 a8 02 00
RIP  [<ffffffffa02c8ff2>] _nfs4_do_setlk+0xe3/0x289 [nfs]
 RSP <ffff880102361d30>
CR2: 0000000000000010
---[ end trace 205a6f9494aa30de ]---

It's unclear to me whether I should blame this on the patches applied,
or that this is just something never triggered unless the
aforementioned bug is fixed...

- Rich

-- 

Aquele que ri do destino conquistará a fortuna. -- Benjamin Disraeli
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] mm: close page_mkwrite races (try 3)
  2009-05-02 22:12                       ` Rince
@ 2009-05-03  1:38                         ` Trond Myklebust
  0 siblings, 0 replies; 23+ messages in thread
From: Trond Myklebust @ 2009-05-03  1:38 UTC (permalink / raw)
  To: Rince; +Cc: linux-fsdevel, linux-nfs

On Sat, 2009-05-02 at 18:12 -0400, Rince wrote:
> Well...that's different.
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> IP: [<ffffffffa02c8ff2>] _nfs4_do_setlk+0xe3/0x289 [nfs]
> PGD 10e4f7067 PUD 109221067 PMD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/devices/virtual/block/md0/md/metadata_version
> CPU 0
> Modules linked in: autofs4 coretemp hwmon nfs lockd nfs_acl
> auth_rpcgss sunrpc cachefiles fscache ipv6 cpufreq_ondemand
> acpi_cpufreq freq_table kvm_intel kvm snd_hda_codec_idt snd_hda_intel
> snd_hda_codec snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi_event
> snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm cpia_usb
> e1000e snd_timer ppdev cpia snd ums_cypress parport_pc videodev
> firewire_ohci i82975x_edac usb_storage parport firewire_core i2c_i801
> edac_core v4l1_compat soundcore snd_page_alloc iTCO_wdt
> v4l2_compat_ioctl32 pcspkr i2c_core crc_itu_t iTCO_vendor_support
> raid1 [last unloaded: scsi_wait_scan]
> Pid: 29418, comm: 10.1.1.2-manage Not tainted 2.6.30-rc4 #1
> RIP: 0010:[<ffffffffa02c8ff2>]  [<ffffffffa02c8ff2>]
> _nfs4_do_setlk+0xe3/0x289 [nfs]
> RSP: 0018:ffff880102361d30  EFLAGS: 00010246
> RAX: ffff8800ce865f00 RBX: ffff88010b4b74d8 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000138
> RBP: ffff880102361de0 R08: ffff880126557000 R09: ffff8800b38c3900
> R10: ffffffffa02cbd1c R11: ffff880126553c00 R12: 00000000fffffff4
> R13: 0000000000000000 R14: ffff88012d42f5c0 R15: ffff88012d42f5c0
> FS:  0000000000000000(0000) GS:ffff880028023000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000010 CR3: 0000000016eda000 CR4: 00000000000026e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process 10.1.1.2-manage (pid: 29418, threadinfo ffff880102360000, task
> ffff8800015b5c00)
> Stack:
>  ffff880102361d40 0000000600000001 ffff8800ce865f00 ffffffffa02cbc5c
>  0000000000000000 ffff880126555c00 ffff880102361d90 ffffffffa02db960
>  0000000000000000 ffff880128093000 0000000000000001 ffffffffa02b88aa
> Call Trace:
>  [<ffffffffa02cbc5c>] ? nfs4_open_recover_helper+0x82/0x97 [nfs]
>  [<ffffffffa02b88aa>] ? __put_nfs_open_context+0x31/0x98 [nfs]
>  [<ffffffffa02c9646>] nfs4_lock_reclaim+0x60/0x8d [nfs]
>  [<ffffffffa02d57a3>] nfs4_do_reclaim+0x13d/0x322 [nfs]
>  [<ffffffffa02d5b21>] nfs4_run_state_manager+0x199/0x27f [nfs]
>  [<ffffffffa02d5988>] ? nfs4_run_state_manager+0x0/0x27f [nfs]
>  [<ffffffffa02d5988>] ? nfs4_run_state_manager+0x0/0x27f [nfs]
>  [<ffffffff8105e7bf>] kthread+0x5b/0x88
>  [<ffffffff81011dba>] child_rip+0xa/0x20
>  [<ffffffff8101177d>] ? restore_args+0x0/0x30
>  [<ffffffff8105e764>] ? kthread+0x0/0x88
>  [<ffffffff81011db0>] ? child_rip+0x0/0x20
> Code: 10 e1 49 8b 47 58 4d 8b af 90 00 00 00 be d0 00 00 00 bf 38 01
> 00 00 41 bc f4 ff ff ff 48 8b 80 a0 00 00 00 48 89 85 60 ff ff ff <49>
> 8b 45 10 4c 8b 70 38 49 8b 86 00 01 00 00 48 8b 80 a8 02 00
> RIP  [<ffffffffa02c8ff2>] _nfs4_do_setlk+0xe3/0x289 [nfs]
>  RSP <ffff880102361d30>
> CR2: 0000000000000010
> ---[ end trace 205a6f9494aa30de ]---
> 
> It's unclear to me whether I should blame this on the patches applied,
> or that this is just something never triggered unless the
> aforementioned bug is fixed...
> 
> - Rich

It is completely unrelated. The above appears to be due to a lock
recovery error after a server reboot. It has nothing to do with the mmap
writeback code.

Trond


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2009-05-03  1:38 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-14  7:11 [patch] mm: close page_mkwrite races (try 2) Nick Piggin
2009-04-14  7:11 ` Nick Piggin
2009-04-15  8:25 ` [patch] mm: close page_mkwrite races (try 3) Nick Piggin
2009-04-15  8:25   ` Nick Piggin
2009-04-16  1:38   ` Andrew Morton
2009-04-16  2:03     ` Nick Piggin
2009-04-16  2:03       ` Nick Piggin
2009-04-16  2:23     ` Nick Piggin
2009-04-28 18:57     ` Ravikiran G Thirumalai
2009-04-29  7:12       ` Nick Piggin
2009-04-29  7:12         ` Nick Piggin
2009-04-29  7:24         ` Andrew Morton
2009-04-29  7:45           ` Nick Piggin
2009-04-29 12:39             ` Trond Myklebust
2009-04-29 12:39               ` Trond Myklebust
2009-04-29 15:27               ` Andrew Morton
2009-04-29 16:45                 ` Trond Myklebust
2009-04-29 16:45                   ` Trond Myklebust
2009-04-30 14:22                   ` Rince
2009-04-30 14:22                     ` Rince
2009-04-30 14:33                     ` Rince
2009-05-02 22:12                       ` Rince
2009-05-03  1:38                         ` Trond Myklebust

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.