linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] vm_operation to avoid pagefault/inval race
@ 2003-05-13 20:53 Paul E. McKenney
  2003-05-17 16:06 ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2003-05-13 20:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm; +Cc: mjbligh

This patch adds a vm_operations_struct function pointer that allows
networked and distributed filesystems to avoid a race between a
pagefault on an mmap and an invalidation request from some other
node.  The race goes as follows:

1.	A user process on node A accesses a portion of a mapped
	file, resulting in a page fault.  The pagefault handler
	invokes the corresponding nopage function, which reads
	the page into memory.

2.	A user process on node B writes to the same portion of
	the file (either via mmap or write()), therefore sending
	node A an invalidation request to node A.

3.	Node A receives this invalidate request, and dutifully
	invalidates all mmaps.  Except for the one that has
	not yet been fully mapped by step 1.

4.	Node A then executes the rest of do_no_page(), entering
	the now-invalid page into the PTEs.

5.	One way or another, life is now hard.

One solution would be for the distributed filesystem to hold
onto a lock or semaphore upon return from the nopage function.
The problem is that there is no way to determine (in a timely
fashion) when it safe to release this lock or semaphore.

The attached patch addresses this by adding a nopagedone
function for when do_no_page() exits.  The filesystem may then
drop the lock or semaphore in this nopagedone function.

Thoughts?  Is there some other existing way to get this done?

					Thanx, Paul


diff -urN -X dontdiff linux-2.5.69/include/linux/mm.h linux-2.5.69.stmmap/include/linux/mm.h
--- linux-2.5.69/include/linux/mm.h	Sun May  4 16:53:00 2003
+++ linux-2.5.69.stmmap/include/linux/mm.h	Fri May  9 09:30:37 2003
@@ -134,6 +134,7 @@
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int unused);
+	void (*nopagedone)(struct vm_area_struct * area, unsigned long address, int status);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
 };
 
diff -urN -X dontdiff linux-2.5.69/mm/memory.c linux-2.5.69.stmmap/mm/memory.c
--- linux-2.5.69/mm/memory.c	Sun May  4 16:53:14 2003
+++ linux-2.5.69.stmmap/mm/memory.c	Fri May  9 17:04:09 2003
@@ -1426,6 +1487,9 @@
 	ret = VM_FAULT_OOM;
 out:
 	pte_chain_free(pte_chain);
+	if (vma->vm_ops && vma->vm_ops->nopagedone) {
+		vma->vm_ops->nopagedone(vma, address & PAGE_MASK, ret);
+	}
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] vm_operation to avoid pagefault/inval race
  2003-05-13 20:53 [RFC][PATCH] vm_operation to avoid pagefault/inval race Paul E. McKenney
@ 2003-05-17 16:06 ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2003-05-17 16:06 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel, linux-mm, akpm, mjbligh

On Tue, May 13, 2003 at 01:53:26PM -0700, Paul E. McKenney wrote:
> This patch adds a vm_operations_struct function pointer that allows
> networked and distributed filesystems to avoid a race between a
> pagefault on an mmap and an invalidation request from some other
> node.  The race goes as follows:

The race is real although currenly no in-tree filesystem is affected.
The patch is uglyh as hell, though.  The right fix is to change the
->nopage method to cover what do_no_page is currently, change anonymous
vmas to have vm_ops as well and set ->nopage to do_anonymous_page.

The gets of the current do_no_page become a new helper (__finish_nopage?)
and EXPORT_SYMBOL_GPL()ed.  It would also be nice if you could point to
a filesystem that actually needs this, but if you can get rid of the
do_anonymous_page special casing a patch might even be acceptable without it.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] vm_operation to avoid pagefault/inval race
  2003-05-20  1:23   ` Paul E. McKenney
@ 2003-05-20  8:11     ` Andrew Morton
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2003-05-20  8:11 UTC (permalink / raw)
  To: paulmck; +Cc: phillips, hch, linux-mm, linux-kernel

"Paul E. McKenney" <paulmck@us.ibm.com> wrote:
>
> So the general idea is to do something as follows, right?

It sounds reasonable.  A matter of putting together the appropriate
library functions and refactoring a few things.

> 
> o	Make a function, perhaps named something like
> 	install_new_page(), that does the PTE-installation
> 	and RSS-adjustment tasks currently performed by
> 	both do_no_page() and by do_anonymous_page().

That's similar to mm/fremap.c:install_page().  (Which forgets to call
update_mmu_cache().  Debatably a buglet.)

However there is not a lot of commonality between the various nopage()s and
there may not be a lot to be gained from all this.  There is subtle code in
there and it is performance-critical.  I'd be inclined to try to minimise
overall code churn in this work.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] vm_operation to avoid pagefault/inval race
  2003-05-17 19:49 ` Andrew Morton
@ 2003-05-20  1:23   ` Paul E. McKenney
  2003-05-20  8:11     ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2003-05-20  1:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, hch, linux-mm, linux-kernel

On Sat, May 17, 2003 at 12:49:48PM -0700, Andrew Morton wrote:
> Daniel Phillips <phillips@arcor.de> wrote:
> >
> > and the only problem is, we have to change pretty well every 
> >  filesystem in and out of tree.
> 
> But it's only a one-liner per fs.

So the general idea is to do something as follows, right?
(Sorry for not just putting together a patch -- I want
to make sure I understand all of your advice first!)

o	Make all callers to do_no_page() instead call
	vma->vm_ops->nopage().

o	Make a function, perhaps named something like
	install_new_page(), that does the PTE-installation
	and RSS-adjustment tasks currently performed by
	both do_no_page() and by do_anonymous_page().
	(Not clear to me yet whether a full merge of
	these two functions is the right approach, more
	thought needed.  Note that the nopage function
	is implicitly aware of whether it is handling
	an anonymous page or not, so a pair of functions
	that both call another function containing the
	common code is reasonable, if warranted.)

	The install_new_page() function needs an additional
	argument to accept the new_page value that used
	to be returned by the nopage() function.

o	Add arguments to nopage() to allow it to invoke
	install_new_page().

o	Change all nopage() functions to invoke install_new_page(),
	but only in cases where they would -not- return
	VM_FAULT_OOM or VM_FAULT_SIGBUS.  In these cases,
	these two return codes must be handed back to the
	caller without invoking install_new_page().

o	Otherwise, the value that these nopage() functions
	would normally return must be passed to
	install_new_page(), and the value returned by
	install_new_page() must be returned to the nopage()
	function's caller.

o	Replace all occurrences of "->vm_ops = NULL" with
	"->vm_ops = anonymous_vm_ops" or some such.

o	The anonymous_vm_ops would have the following members:

	nopage: pointer to a function containing the page-allocation
		code extracted from do_anonymous_page(), followed
		by a call to install_new_page().

	populate: NULL.

	open: NULL.

	close: NULL.

Thoughts?

					Thanx, Paul

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] vm_operation to avoid pagefault/inval race
  2003-05-17 18:21 Daniel Phillips
@ 2003-05-17 19:49 ` Andrew Morton
  2003-05-20  1:23   ` Paul E. McKenney
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2003-05-17 19:49 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: paulmck, hch, linux-mm, linux-kernel

Daniel Phillips <phillips@arcor.de> wrote:
>
> and the only problem is, we have to change pretty well every 
>  filesystem in and out of tree.

But it's only a one-liner per fs.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] vm_operation to avoid pagefault/inval race
@ 2003-05-17 18:21 Daniel Phillips
  2003-05-17 19:49 ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel Phillips @ 2003-05-17 18:21 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Andrew Morton, Christoph Hellwig, linux-mm, linux-kernel

Please don't take lack of response for lack of interest.  The generic issue 
here is "what are the vfs changes needed to support cross-host mmap?".  You 
defined the problem nicely.

>
> [...]
>
> 5.	One way or another, life is now hard.

Indeed.  In brief, ->nopage just doesn't provide adequate coverage to support 
the cross-host lock.

> One solution would be for the distributed filesystem to hold
> onto a lock or semaphore upon return from the nopage function.
> The problem is that there is no way to determine (in a timely
> fashion) when it safe to release this lock or semaphore.
>
> The attached patch addresses this by adding a nopagedone
> function for when do_no_page() exits.  The filesystem may then
> drop the lock or semaphore in this nopagedone function.
>
> Thoughts?  Is there some other existing way to get this done?

There is.  One way is to make all of do_no_page a hook, and clearly this is 
more generic than what you proposed, since it covers your hook and the rest 
can be done with library calls.  Once you've gone there, the next question to 
ask is "what use is the existing ->nopage" hook, and the answer is: none, 
really.  The existing usage of ->nopage can be replaced by ->do_no_page plus 
library code, and the only problem is, we have to change pretty well every 
filesystem in and out of tree.  So that gets a little, em, interesting from 
the 2.6.0 point of view, which is why I cc'd Andrew on this.  Christoph has 
also expressed interest in this, which explains the other cc.

Any clustered filesystem that wants to support posix mmap is going to need 
this hook, so the sooner we hash this out, the better.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-05-20  7:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-05-13 20:53 [RFC][PATCH] vm_operation to avoid pagefault/inval race Paul E. McKenney
2003-05-17 16:06 ` Christoph Hellwig
2003-05-17 18:21 Daniel Phillips
2003-05-17 19:49 ` Andrew Morton
2003-05-20  1:23   ` Paul E. McKenney
2003-05-20  8:11     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).