* [RFC][PATCH] vm_operation to avoid pagefault/inval race
@ 2003-05-13 20:53 Paul E. McKenney
2003-05-17 16:06 ` Christoph Hellwig
0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2003-05-13 20:53 UTC (permalink / raw)
To: linux-kernel, linux-mm, akpm; +Cc: mjbligh
This patch adds a vm_operations_struct function pointer that allows
networked and distributed filesystems to avoid a race between a
pagefault on an mmap and an invalidation request from some other
node. The race goes as follows:
1. A user process on node A accesses a portion of a mapped
file, resulting in a page fault. The pagefault handler
invokes the corresponding nopage function, which reads
the page into memory.
2. A user process on node B writes to the same portion of
the file (either via mmap or write()), therefore sending
node A an invalidation request to node A.
3. Node A receives this invalidate request, and dutifully
invalidates all mmaps. Except for the one that has
not yet been fully mapped by step 1.
4. Node A then executes the rest of do_no_page(), entering
the now-invalid page into the PTEs.
5. One way or another, life is now hard.
One solution would be for the distributed filesystem to hold
onto a lock or semaphore upon return from the nopage function.
The problem is that there is no way to determine (in a timely
fashion) when it safe to release this lock or semaphore.
The attached patch addresses this by adding a nopagedone
function for when do_no_page() exits. The filesystem may then
drop the lock or semaphore in this nopagedone function.
Thoughts? Is there some other existing way to get this done?
Thanx, Paul
diff -urN -X dontdiff linux-2.5.69/include/linux/mm.h linux-2.5.69.stmmap/include/linux/mm.h
--- linux-2.5.69/include/linux/mm.h Sun May 4 16:53:00 2003
+++ linux-2.5.69.stmmap/include/linux/mm.h Fri May 9 09:30:37 2003
@@ -134,6 +134,7 @@
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int unused);
+ void (*nopagedone)(struct vm_area_struct * area, unsigned long address, int status);
int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
};
diff -urN -X dontdiff linux-2.5.69/mm/memory.c linux-2.5.69.stmmap/mm/memory.c
--- linux-2.5.69/mm/memory.c Sun May 4 16:53:14 2003
+++ linux-2.5.69.stmmap/mm/memory.c Fri May 9 17:04:09 2003
@@ -1426,6 +1487,9 @@
ret = VM_FAULT_OOM;
out:
pte_chain_free(pte_chain);
+ if (vma->vm_ops && vma->vm_ops->nopagedone) {
+ vma->vm_ops->nopagedone(vma, address & PAGE_MASK, ret);
+ }
return ret;
}
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC][PATCH] vm_operation to avoid pagefault/inval race
2003-05-13 20:53 [RFC][PATCH] vm_operation to avoid pagefault/inval race Paul E. McKenney
@ 2003-05-17 16:06 ` Christoph Hellwig
0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2003-05-17 16:06 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: linux-kernel, linux-mm, akpm, mjbligh
On Tue, May 13, 2003 at 01:53:26PM -0700, Paul E. McKenney wrote:
> This patch adds a vm_operations_struct function pointer that allows
> networked and distributed filesystems to avoid a race between a
> pagefault on an mmap and an invalidation request from some other
> node. The race goes as follows:
The race is real although currenly no in-tree filesystem is affected.
The patch is uglyh as hell, though. The right fix is to change the
->nopage method to cover what do_no_page is currently, change anonymous
vmas to have vm_ops as well and set ->nopage to do_anonymous_page.
The gets of the current do_no_page become a new helper (__finish_nopage?)
and EXPORT_SYMBOL_GPL()ed. It would also be nice if you could point to
a filesystem that actually needs this, but if you can get rid of the
do_anonymous_page special casing a patch might even be acceptable without it.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC][PATCH] vm_operation to avoid pagefault/inval race
2003-05-20 1:23 ` Paul E. McKenney
@ 2003-05-20 8:11 ` Andrew Morton
0 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2003-05-20 8:11 UTC (permalink / raw)
To: paulmck; +Cc: phillips, hch, linux-mm, linux-kernel
"Paul E. McKenney" <paulmck@us.ibm.com> wrote:
>
> So the general idea is to do something as follows, right?
It sounds reasonable. A matter of putting together the appropriate
library functions and refactoring a few things.
>
> o Make a function, perhaps named something like
> install_new_page(), that does the PTE-installation
> and RSS-adjustment tasks currently performed by
> both do_no_page() and by do_anonymous_page().
That's similar to mm/fremap.c:install_page(). (Which forgets to call
update_mmu_cache(). Debatably a buglet.)
However there is not a lot of commonality between the various nopage()s and
there may not be a lot to be gained from all this. There is subtle code in
there and it is performance-critical. I'd be inclined to try to minimise
overall code churn in this work.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC][PATCH] vm_operation to avoid pagefault/inval race
2003-05-17 19:49 ` Andrew Morton
@ 2003-05-20 1:23 ` Paul E. McKenney
2003-05-20 8:11 ` Andrew Morton
0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2003-05-20 1:23 UTC (permalink / raw)
To: Andrew Morton; +Cc: Daniel Phillips, hch, linux-mm, linux-kernel
On Sat, May 17, 2003 at 12:49:48PM -0700, Andrew Morton wrote:
> Daniel Phillips <phillips@arcor.de> wrote:
> >
> > and the only problem is, we have to change pretty well every
> > filesystem in and out of tree.
>
> But it's only a one-liner per fs.
So the general idea is to do something as follows, right?
(Sorry for not just putting together a patch -- I want
to make sure I understand all of your advice first!)
o Make all callers to do_no_page() instead call
vma->vm_ops->nopage().
o Make a function, perhaps named something like
install_new_page(), that does the PTE-installation
and RSS-adjustment tasks currently performed by
both do_no_page() and by do_anonymous_page().
(Not clear to me yet whether a full merge of
these two functions is the right approach, more
thought needed. Note that the nopage function
is implicitly aware of whether it is handling
an anonymous page or not, so a pair of functions
that both call another function containing the
common code is reasonable, if warranted.)
The install_new_page() function needs an additional
argument to accept the new_page value that used
to be returned by the nopage() function.
o Add arguments to nopage() to allow it to invoke
install_new_page().
o Change all nopage() functions to invoke install_new_page(),
but only in cases where they would -not- return
VM_FAULT_OOM or VM_FAULT_SIGBUS. In these cases,
these two return codes must be handed back to the
caller without invoking install_new_page().
o Otherwise, the value that these nopage() functions
would normally return must be passed to
install_new_page(), and the value returned by
install_new_page() must be returned to the nopage()
function's caller.
o Replace all occurrences of "->vm_ops = NULL" with
"->vm_ops = anonymous_vm_ops" or some such.
o The anonymous_vm_ops would have the following members:
nopage: pointer to a function containing the page-allocation
code extracted from do_anonymous_page(), followed
by a call to install_new_page().
populate: NULL.
open: NULL.
close: NULL.
Thoughts?
Thanx, Paul
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC][PATCH] vm_operation to avoid pagefault/inval race
2003-05-17 18:21 Daniel Phillips
@ 2003-05-17 19:49 ` Andrew Morton
2003-05-20 1:23 ` Paul E. McKenney
0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2003-05-17 19:49 UTC (permalink / raw)
To: Daniel Phillips; +Cc: paulmck, hch, linux-mm, linux-kernel
Daniel Phillips <phillips@arcor.de> wrote:
>
> and the only problem is, we have to change pretty well every
> filesystem in and out of tree.
But it's only a one-liner per fs.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC][PATCH] vm_operation to avoid pagefault/inval race
@ 2003-05-17 18:21 Daniel Phillips
2003-05-17 19:49 ` Andrew Morton
0 siblings, 1 reply; 6+ messages in thread
From: Daniel Phillips @ 2003-05-17 18:21 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: Andrew Morton, Christoph Hellwig, linux-mm, linux-kernel
Please don't take lack of response for lack of interest. The generic issue
here is "what are the vfs changes needed to support cross-host mmap?". You
defined the problem nicely.
>
> [...]
>
> 5. One way or another, life is now hard.
Indeed. In brief, ->nopage just doesn't provide adequate coverage to support
the cross-host lock.
> One solution would be for the distributed filesystem to hold
> onto a lock or semaphore upon return from the nopage function.
> The problem is that there is no way to determine (in a timely
> fashion) when it safe to release this lock or semaphore.
>
> The attached patch addresses this by adding a nopagedone
> function for when do_no_page() exits. The filesystem may then
> drop the lock or semaphore in this nopagedone function.
>
> Thoughts? Is there some other existing way to get this done?
There is. One way is to make all of do_no_page a hook, and clearly this is
more generic than what you proposed, since it covers your hook and the rest
can be done with library calls. Once you've gone there, the next question to
ask is "what use is the existing ->nopage" hook, and the answer is: none,
really. The existing usage of ->nopage can be replaced by ->do_no_page plus
library code, and the only problem is, we have to change pretty well every
filesystem in and out of tree. So that gets a little, em, interesting from
the 2.6.0 point of view, which is why I cc'd Andrew on this. Christoph has
also expressed interest in this, which explains the other cc.
Any clustered filesystem that wants to support posix mmap is going to need
this hook, so the sooner we hash this out, the better.
Regards,
Daniel
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2003-05-20 7:56 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-05-13 20:53 [RFC][PATCH] vm_operation to avoid pagefault/inval race Paul E. McKenney
2003-05-17 16:06 ` Christoph Hellwig
2003-05-17 18:21 Daniel Phillips
2003-05-17 19:49 ` Andrew Morton
2003-05-20 1:23 ` Paul E. McKenney
2003-05-20 8:11 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).