linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Splitting the mmap_sem
@ 2019-12-03 22:21 Matthew Wilcox
  2019-12-05 17:21 ` Jerome Glisse
  2019-12-12 14:24 ` Kirill A. Shutemov
  0 siblings, 2 replies; 31+ messages in thread
From: Matthew Wilcox @ 2019-12-03 22:21 UTC (permalink / raw)
  To: linux-mm


[My thanks to Vlastimil, Michel, Liam, David, Davidlohr and Hugh for
 their feedback on an earlier version of this.  I think the solution
 we discussed doesn't quite work, so here's one which I think does.
 See the last two paragraphs in particular.]

My preferred solution to the mmap_sem scalability problem is to allow
VMAs to be looked up under the RCU read lock then take a per-VMA lock.
I've been focusing on the first half of this problem (looking up VMAs
in an RCU-safe data structure) and ignoring the second half (taking a
lock while holding the RCU lock).

We can't take a semaphore while holding the RCU lock in case we have to
sleep -- the VMA might not exist any more when we woke up.  Making the
per-VMA lock a spinlock would be a massive change -- fault handlers are
currently called with the mmap_sem held and may sleep.  So I think we
need a per-VMA refcount.  That lets us sleep while handling a fault.
There are over 100 fault handlers in the kernel, and I don't want to
change the locking in all of them.

That makes modifications to the tree a little tricky.  At the moment,
we take the rwsem for write which waits for all readers to finish, then
we modify the VMAs, then we allow readers back in.  With RCU, there is
no way to block readers, so different threads may (at the same time)
see both an old and a new VMA for the same virtual address.

So calling mmap() looks like this:

        allocate a new VMA
        update pointer(s) in maple tree
        sleep until old VMAs have a zero refcount
        synchronize_rcu()
        free old VMAs
        flush caches for affected range
        return to userspace

While one thread is calling mmap(MAP_FIXED), two other threads which are
accessing the same address may see different data from each other and
have different page translations in their respective CPU caches until
the thread calling mmap() returns.  I believe this is OK, but would
greatly appreciate hearing from people who know better.

Some people are concerned that a reference count on the VMA will lead to
contention moving from the mmap_sem to the refcount on a very large VMA
for workloads which have one giant VMA covering the entire working set.
For those workloads, I propose we use the existing ->map_pages() callback
(changed to return a vm_fault_t from the current void).

It will be called with the RCU lock held and no reference count on
the vma.  If it needs to sleep, it should bump the refcount, drop the
RCU lock, prepare enough so that the next call will not need to sleep,
then drop the refcount and return VM_FAULT_RETRY so the VM knows the
VMA is no longer good, and it needs to walk the VMA tree from the start.

We currently only have one ->map_pages() callback, and it's
filemap_map_pages().  It only needs to sleep in one place -- to allocate
a PTE table.  I think that can be allocated ahead of time if needed.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-03 22:21 Splitting the mmap_sem Matthew Wilcox
@ 2019-12-05 17:21 ` Jerome Glisse
  2019-12-06  5:13   ` Matthew Wilcox
  2019-12-10 15:26   ` Vlastimil Babka
  2019-12-12 14:24 ` Kirill A. Shutemov
  1 sibling, 2 replies; 31+ messages in thread
From: Jerome Glisse @ 2019-12-05 17:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

Adding few interested people in cc

On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> 
> [My thanks to Vlastimil, Michel, Liam, David, Davidlohr and Hugh for
>  their feedback on an earlier version of this.  I think the solution
>  we discussed doesn't quite work, so here's one which I think does.
>  See the last two paragraphs in particular.]
> 
> My preferred solution to the mmap_sem scalability problem is to allow
> VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> I've been focusing on the first half of this problem (looking up VMAs
> in an RCU-safe data structure) and ignoring the second half (taking a
> lock while holding the RCU lock).
> 
> We can't take a semaphore while holding the RCU lock in case we have to
> sleep -- the VMA might not exist any more when we woke up.  Making the
> per-VMA lock a spinlock would be a massive change -- fault handlers are
> currently called with the mmap_sem held and may sleep.  So I think we
> need a per-VMA refcount.  That lets us sleep while handling a fault.
> There are over 100 fault handlers in the kernel, and I don't want to
> change the locking in all of them.
> 
> That makes modifications to the tree a little tricky.  At the moment,
> we take the rwsem for write which waits for all readers to finish, then
> we modify the VMAs, then we allow readers back in.  With RCU, there is
> no way to block readers, so different threads may (at the same time)
> see both an old and a new VMA for the same virtual address.
> 
> So calling mmap() looks like this:
> 
>       1 allocate a new VMA
>       2 update pointer(s) in maple tree
>       3 sleep until old VMAs have a zero refcount
>       4 synchronize_rcu()
>       5 free old VMAs
>       6 flush caches for affected range
>       7 return to userspace
> 
> While one thread is calling mmap(MAP_FIXED), two other threads which are
> accessing the same address may see different data from each other and
> have different page translations in their respective CPU caches until
> the thread calling mmap() returns.  I believe this is OK, but would
> greatly appreciate hearing from people who know better.

I do not believe this is OK, i believe this is wrong (not even considering
possible hardware issues that can arise from such aliasing).

That bein said i believe this can be solve "easily" when the new vma is
added you mark it as a new born (VMA_BABY :)) and page fault will have
to wait on it ie until the previous vma is fully gone and flush. So after
step (6 flush caches) you remove the VMA_BABY flag before returning to
userspace and page fault can resume.

I would also mark old VMA with a ZOMBIE flag so that any reader might have
a chance to back-off and retry. To check for that we should add a new
check to vmf_insert_page() (and similar) to avoid inserting pfn in ZOMBIE
vma. Note that i am not sure what we want to do here, can an application
rely on rwsem serialization unknowingly ie could it have one thread doing
page fault on a range that is about to be unmap by another thread ? I am
not sure this can happen today without SEGFAULT thanks to serialization
through rwsem.

Anyway with BABY and ZOMBIE, it should behave mostly as it does today
(modulo concurrency).

> 
> Some people are concerned that a reference count on the VMA will lead to
> contention moving from the mmap_sem to the refcount on a very large VMA
> for workloads which have one giant VMA covering the entire working set.
> For those workloads, I propose we use the existing ->map_pages() callback
> (changed to return a vm_fault_t from the current void).
> 
> It will be called with the RCU lock held and no reference count on
> the vma.  If it needs to sleep, it should bump the refcount, drop the
> RCU lock, prepare enough so that the next call will not need to sleep,
> then drop the refcount and return VM_FAULT_RETRY so the VM knows the
> VMA is no longer good, and it needs to walk the VMA tree from the start.

Just to make sure i understand, you propose that ->map_pages() becomes
a new ->fault() handler that get calls before ->fault() without refcount
so that we can update fs/drivers slowly to perform better in the new scheme
(ie avoid the overead of refcounting if possible at all) ?

The ->fault() callback would then be the "slow" path which will require
a refcount on the vma (taken by core mm code before dropping rcu lock).

Cheers,
Jérôme



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-05 17:21 ` Jerome Glisse
@ 2019-12-06  5:13   ` Matthew Wilcox
  2019-12-06 17:30     ` Jerome Glisse
  2019-12-10 15:26   ` Vlastimil Babka
  1 sibling, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2019-12-06  5:13 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote:
> Adding few interested people in cc

I figured they all read linux-mm already ;-)

> On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > While one thread is calling mmap(MAP_FIXED), two other threads which are
> > accessing the same address may see different data from each other and
> > have different page translations in their respective CPU caches until
> > the thread calling mmap() returns.  I believe this is OK, but would
> > greatly appreciate hearing from people who know better.
> 
> I do not believe this is OK, i believe this is wrong (not even considering
> possible hardware issues that can arise from such aliasing).

Well, OK, but why do you believe it is wrong?  If thread A is executing
a load instruction at the same time that thread B is calling mmap(),
it really is indeterminate what value A loads.  It might be from before
the call to mmap() and it might be from after.  And if thread C is also
executing a load instruction at the same time, then it might already get
a different result from thread A.  And can threads A and C really tell
which of them executed the load instruction 'first'?  I think this is
all so indeterminate already that the (lack of) guarantees I outlined
above are acceptable.

But we should all agree on this, so _please_ continue to argue your case
for why you believe it to be wrong.

[snip proposed solution -- if the problem needs solving, we can argue
about how to solve it later]

> > Some people are concerned that a reference count on the VMA will lead to
> > contention moving from the mmap_sem to the refcount on a very large VMA
> > for workloads which have one giant VMA covering the entire working set.
> > For those workloads, I propose we use the existing ->map_pages() callback
> > (changed to return a vm_fault_t from the current void).
> > 
> > It will be called with the RCU lock held and no reference count on
> > the vma.  If it needs to sleep, it should bump the refcount, drop the
> > RCU lock, prepare enough so that the next call will not need to sleep,
> > then drop the refcount and return VM_FAULT_RETRY so the VM knows the
> > VMA is no longer good, and it needs to walk the VMA tree from the start.
> 
> Just to make sure i understand, you propose that ->map_pages() becomes
> a new ->fault() handler that get calls before ->fault() without refcount
> so that we can update fs/drivers slowly to perform better in the new scheme
> (ie avoid the overead of refcounting if possible at all) ?
> 
> The ->fault() callback would then be the "slow" path which will require
> a refcount on the vma (taken by core mm code before dropping rcu lock).

I would actually propose never updating most drivers.  There's just no
need for them to handle such an unstable and tricky situation as this.
Let's not make driver writers lives harder.

For the ones which need this kind of scalability (and let's be clear, they
would already have *better* scalability than today due to the rwsem being
split into a per-VMA refcount), then yes, implementing ->map_pages would
be the way to go.  Indeed, they would probably benefit from implementing
it today, since it will reduce the number of page faults.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-06  5:13   ` Matthew Wilcox
@ 2019-12-06 17:30     ` Jerome Glisse
  2019-12-09  3:33       ` Matthew Wilcox
  0 siblings, 1 reply; 31+ messages in thread
From: Jerome Glisse @ 2019-12-06 17:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote:
> On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote:
> > Adding few interested people in cc
> 
> I figured they all read linux-mm already ;-)
> 
> > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > While one thread is calling mmap(MAP_FIXED), two other threads which are
> > > accessing the same address may see different data from each other and
> > > have different page translations in their respective CPU caches until
> > > the thread calling mmap() returns.  I believe this is OK, but would
> > > greatly appreciate hearing from people who know better.
> > 
> > I do not believe this is OK, i believe this is wrong (not even considering
> > possible hardware issues that can arise from such aliasing).
> 
> Well, OK, but why do you believe it is wrong?  If thread A is executing
> a load instruction at the same time that thread B is calling mmap(),
> it really is indeterminate what value A loads.  It might be from before
> the call to mmap() and it might be from after.  And if thread C is also
> executing a load instruction at the same time, then it might already get
> a different result from thread A.  And can threads A and C really tell
> which of them executed the load instruction 'first'?  I think this is
> all so indeterminate already that the (lack of) guarantees I outlined
> above are acceptable.
> 
> But we should all agree on this, so _please_ continue to argue your case
> for why you believe it to be wrong.
> 

I agree that such application might looks like it is doing something that
is undeterminate but their might be application that catch SEGFAULT and use
it as synchronization. I did something similar for reverse engineering a
long time ago with a library call libsegfault ...

In any case, i agree that an application that is not catching SEGFAULT, and
which is doing the above (access patterns) is doing something undeterminate.

Nonetheless i believe it is important that at any point in time for all the
threads in a given process, on all the CPUs, a given virtual address should
always point to the same physical memory (or to nothing) ie we should never
have one CPU that sees a different physical memory from another CPU for the
same virtual address.


> [snip proposed solution -- if the problem needs solving, we can argue
> about how to solve it later]

Well i feel like you are also not discussing about the munmap() the above
seemed to be about MAP_FIXED (replacing an existing mapping). For munmap
too i believe we should agree on what should be the expected behavior and
from my POV again we should not allow new mapping to appear until a "running"
munmap is not fully done (ie all CPUs cache and TLB flushed). For the same
reason as above ie all CPUs always see same physical memory (or nothing) for
a given virtual address.

This is what we have today with the big rwsem and i think we need to keep
that behavior even with concurency. I do not believe this will impact the
performance and it is easy enough to solve so i feel safer doing so given
it does not cost anything.

So i would rather argue on why we should change the current behavior if we
can fix the concurrency without changing it (hence why discussing solution
might also be relevant here).


> > > Some people are concerned that a reference count on the VMA will lead to
> > > contention moving from the mmap_sem to the refcount on a very large VMA
> > > for workloads which have one giant VMA covering the entire working set.
> > > For those workloads, I propose we use the existing ->map_pages() callback
> > > (changed to return a vm_fault_t from the current void).
> > > 
> > > It will be called with the RCU lock held and no reference count on
> > > the vma.  If it needs to sleep, it should bump the refcount, drop the
> > > RCU lock, prepare enough so that the next call will not need to sleep,
> > > then drop the refcount and return VM_FAULT_RETRY so the VM knows the
> > > VMA is no longer good, and it needs to walk the VMA tree from the start.
> > 
> > Just to make sure i understand, you propose that ->map_pages() becomes
> > a new ->fault() handler that get calls before ->fault() without refcount
> > so that we can update fs/drivers slowly to perform better in the new scheme
> > (ie avoid the overead of refcounting if possible at all) ?
> > 
> > The ->fault() callback would then be the "slow" path which will require
> > a refcount on the vma (taken by core mm code before dropping rcu lock).
> 
> I would actually propose never updating most drivers.  There's just no
> need for them to handle such an unstable and tricky situation as this.
> Let's not make driver writers lives harder.
> 
> For the ones which need this kind of scalability (and let's be clear, they
> would already have *better* scalability than today due to the rwsem being
> split into a per-VMA refcount), then yes, implementing ->map_pages would
> be the way to go.  Indeed, they would probably benefit from implementing
> it today, since it will reduce the number of page faults.

Yes they will get better scalability but i see some of those drivers might
want the extra few mini-percent :) In any case, i believe that it might be
better to give a new name ie fix current map_pages() user and rename that
callback to something more explicit (atomic_map_pages() or something similar
i am not good at naming). But otherwise this looks like a good plan to avoid
excessive refcount overhead.

Cheers,
Jérôme



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-06 17:30     ` Jerome Glisse
@ 2019-12-09  3:33       ` Matthew Wilcox
  2019-12-09 14:17         ` Jerome Glisse
  0 siblings, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2019-12-09  3:33 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

On Fri, Dec 06, 2019 at 12:30:30PM -0500, Jerome Glisse wrote:
> On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote:
> > On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote:
> > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > > While one thread is calling mmap(MAP_FIXED), two other threads which are
> > > > accessing the same address may see different data from each other and
> > > > have different page translations in their respective CPU caches until
> > > > the thread calling mmap() returns.  I believe this is OK, but would
> > > > greatly appreciate hearing from people who know better.
> > > 
> > > I do not believe this is OK, i believe this is wrong (not even considering
> > > possible hardware issues that can arise from such aliasing).
> > 
> > Well, OK, but why do you believe it is wrong?  If thread A is executing
> > a load instruction at the same time that thread B is calling mmap(),
> > it really is indeterminate what value A loads.  It might be from before
> > the call to mmap() and it might be from after.  And if thread C is also
> > executing a load instruction at the same time, then it might already get
> > a different result from thread A.  And can threads A and C really tell
> > which of them executed the load instruction 'first'?  I think this is
> > all so indeterminate already that the (lack of) guarantees I outlined
> > above are acceptable.
> > 
> > But we should all agree on this, so _please_ continue to argue your case
> > for why you believe it to be wrong.
> 
> I agree that such application might looks like it is doing something that
> is undeterminate but their might be application that catch SEGFAULT and use
> it as synchronization. I did something similar for reverse engineering a
> long time ago with a library call libsegfault ...
> 
> In any case, i agree that an application that is not catching SEGFAULT, and
> which is doing the above (access patterns) is doing something undeterminate.
> 
> Nonetheless i believe it is important that at any point in time for all the
> threads in a given process, on all the CPUs, a given virtual address should
> always point to the same physical memory (or to nothing) ie we should never
> have one CPU that sees a different physical memory from another CPU for the
> same virtual address.
> 
> Well i feel like you are also not discussing about the munmap() the above
> seemed to be about MAP_FIXED (replacing an existing mapping). For munmap
> too i believe we should agree on what should be the expected behavior and
> from my POV again we should not allow new mapping to appear until a "running"
> munmap is not fully done (ie all CPUs cache and TLB flushed). For the same
> reason as above ie all CPUs always see same physical memory (or nothing) for
> a given virtual address.

I see MAP_FIXED as being the harder case, but sure, let's talk about
munmap!  I agree that a munmap() + mmap() call should not permit thread
B to see the old value after thread A has seen the new value.  But,
as long as no new mmap can occupy that range, then it's OK if thread A
takes a segfault while thread B can still load the old value.  At least
for a short window.

We can replicate that behaviour by ensuring that new lookups see a NULL
entry, but new attempts to allocate will not reuse that range until the
munmap has finished and all TLB entries are flushed.  The maple tree
actually supports a "ZERO" entry (just like the XArray does) which has
this behaviour -- lookups see NULL, but attempts to allocate do not see
it as free.  We already use that property to prevent allocating above
the end of the process address space.

> This is what we have today with the big rwsem and i think we need to keep
> that behavior even with concurency. I do not believe this will impact the
> performance and it is easy enough to solve so i feel safer doing so given
> it does not cost anything.
> 
> So i would rather argue on why we should change the current behavior if we
> can fix the concurrency without changing it (hence why discussing solution
> might also be relevant here).

It seems like you want to force a thread which sees an ongoing munmap
to spin or sleep until the munmap is done, rather than immediately take
a segfault, and I don't know that's a useful behaviour.

> > > Just to make sure i understand, you propose that ->map_pages() becomes
> > > a new ->fault() handler that get calls before ->fault() without refcount
> > > so that we can update fs/drivers slowly to perform better in the new scheme
> > > (ie avoid the overead of refcounting if possible at all) ?
> > > 
> > > The ->fault() callback would then be the "slow" path which will require
> > > a refcount on the vma (taken by core mm code before dropping rcu lock).
> > 
> > I would actually propose never updating most drivers.  There's just no
> > need for them to handle such an unstable and tricky situation as this.
> > Let's not make driver writers lives harder.
> > 
> > For the ones which need this kind of scalability (and let's be clear, they
> > would already have *better* scalability than today due to the rwsem being
> > split into a per-VMA refcount), then yes, implementing ->map_pages would
> > be the way to go.  Indeed, they would probably benefit from implementing
> > it today, since it will reduce the number of page faults.
> 
> Yes they will get better scalability but i see some of those drivers might
> want the extra few mini-percent :) In any case, i believe that it might be
> better to give a new name ie fix current map_pages() user and rename that
> callback to something more explicit (atomic_map_pages() or something similar
> i am not good at naming). But otherwise this looks like a good plan to avoid
> excessive refcount overhead.

OK, great.  I don't think the current name is bad, but if someone comes
up with a better one, I don't have a problem with renaming it.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-09  3:33       ` Matthew Wilcox
@ 2019-12-09 14:17         ` Jerome Glisse
  0 siblings, 0 replies; 31+ messages in thread
From: Jerome Glisse @ 2019-12-09 14:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

On Sun, Dec 08, 2019 at 07:33:35PM -0800, Matthew Wilcox wrote:
> On Fri, Dec 06, 2019 at 12:30:30PM -0500, Jerome Glisse wrote:
> > On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote:
> > > On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote:
> > > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > > > While one thread is calling mmap(MAP_FIXED), two other threads which are
> > > > > accessing the same address may see different data from each other and
> > > > > have different page translations in their respective CPU caches until
> > > > > the thread calling mmap() returns.  I believe this is OK, but would
> > > > > greatly appreciate hearing from people who know better.
> > > > 
> > > > I do not believe this is OK, i believe this is wrong (not even considering
> > > > possible hardware issues that can arise from such aliasing).
> > > 
> > > Well, OK, but why do you believe it is wrong?  If thread A is executing
> > > a load instruction at the same time that thread B is calling mmap(),
> > > it really is indeterminate what value A loads.  It might be from before
> > > the call to mmap() and it might be from after.  And if thread C is also
> > > executing a load instruction at the same time, then it might already get
> > > a different result from thread A.  And can threads A and C really tell
> > > which of them executed the load instruction 'first'?  I think this is
> > > all so indeterminate already that the (lack of) guarantees I outlined
> > > above are acceptable.
> > > 
> > > But we should all agree on this, so _please_ continue to argue your case
> > > for why you believe it to be wrong.
> > 
> > I agree that such application might looks like it is doing something that
> > is undeterminate but their might be application that catch SEGFAULT and use
> > it as synchronization. I did something similar for reverse engineering a
> > long time ago with a library call libsegfault ...
> > 
> > In any case, i agree that an application that is not catching SEGFAULT, and
> > which is doing the above (access patterns) is doing something undeterminate.
> > 
> > Nonetheless i believe it is important that at any point in time for all the
> > threads in a given process, on all the CPUs, a given virtual address should
> > always point to the same physical memory (or to nothing) ie we should never
> > have one CPU that sees a different physical memory from another CPU for the
> > same virtual address.
> > 
> > Well i feel like you are also not discussing about the munmap() the above
> > seemed to be about MAP_FIXED (replacing an existing mapping). For munmap
> > too i believe we should agree on what should be the expected behavior and
> > from my POV again we should not allow new mapping to appear until a "running"
> > munmap is not fully done (ie all CPUs cache and TLB flushed). For the same
> > reason as above ie all CPUs always see same physical memory (or nothing) for
> > a given virtual address.
> 
> I see MAP_FIXED as being the harder case, but sure, let's talk about
> munmap!  I agree that a munmap() + mmap() call should not permit thread
> B to see the old value after thread A has seen the new value.  But,
> as long as no new mmap can occupy that range, then it's OK if thread A
> takes a segfault while thread B can still load the old value.  At least
> for a short window.
> 
> We can replicate that behaviour by ensuring that new lookups see a NULL
> entry, but new attempts to allocate will not reuse that range until the
> munmap has finished and all TLB entries are flushed.  The maple tree
> actually supports a "ZERO" entry (just like the XArray does) which has
> this behaviour -- lookups see NULL, but attempts to allocate do not see
> it as free.  We already use that property to prevent allocating above
> the end of the process address space.
> 
> > This is what we have today with the big rwsem and i think we need to keep
> > that behavior even with concurency. I do not believe this will impact the
> > performance and it is easy enough to solve so i feel safer doing so given
> > it does not cost anything.
> > 
> > So i would rather argue on why we should change the current behavior if we
> > can fix the concurrency without changing it (hence why discussing solution
> > might also be relevant here).
> 
> It seems like you want to force a thread which sees an ongoing munmap
> to spin or sleep until the munmap is done, rather than immediately take
> a segfault, and I don't know that's a useful behaviour.

No, for munmap() i want the today behavior modulo concurrency ie:
    - any concurrent fault while munmap is on going should SEGFAULT
    - any new mmap might install itself in the unmaped area but can not
      service fault until the munmap is fully done (ie all cache and TLB
      are flush)

So it seems we agree on that given your above presentation of mapple NULL
entry.

I want the spin only for fault in a MAP_FIXED range that replace an existing
range ie which is what we have today. Today if you do a mmap MAP_FIXED you
will block concurrent fault until the rwsem is release. I argue we should
keep that behavior as any program that keep accessing such area while mmap
is in progress is doing something peculiar (it can be a bug or it can be
something totaly intentional and use in conjunction with signal handlers).

This will only affect such application and it will keep coherency for TLBs
accross all CPUs. The implementation cost can be kept very low and it does
not impede concurrency on other virtual range.

Keeping existing behavior just feels a lot safer to me and it avoids us
pondering on wether or not a CPU can freakout if they see different TLB
for same virtual address.


> > > > Just to make sure i understand, you propose that ->map_pages() becomes
> > > > a new ->fault() handler that get calls before ->fault() without refcount
> > > > so that we can update fs/drivers slowly to perform better in the new scheme
> > > > (ie avoid the overead of refcounting if possible at all) ?
> > > > 
> > > > The ->fault() callback would then be the "slow" path which will require
> > > > a refcount on the vma (taken by core mm code before dropping rcu lock).
> > > 
> > > I would actually propose never updating most drivers.  There's just no
> > > need for them to handle such an unstable and tricky situation as this.
> > > Let's not make driver writers lives harder.
> > > 
> > > For the ones which need this kind of scalability (and let's be clear, they
> > > would already have *better* scalability than today due to the rwsem being
> > > split into a per-VMA refcount), then yes, implementing ->map_pages would
> > > be the way to go.  Indeed, they would probably benefit from implementing
> > > it today, since it will reduce the number of page faults.
> > 
> > Yes they will get better scalability but i see some of those drivers might
> > want the extra few mini-percent :) In any case, i believe that it might be
> > better to give a new name ie fix current map_pages() user and rename that
> > callback to something more explicit (atomic_map_pages() or something similar
> > i am not good at naming). But otherwise this looks like a good plan to avoid
> > excessive refcount overhead.
> 
> OK, great.  I don't think the current name is bad, but if someone comes
> up with a better one, I don't have a problem with renaming it.

I just like when new behavior are reflected in name, what about rcu_map_pages ?
Making it clear it is under rcu.

Cheers,
Jérôme



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-05 17:21 ` Jerome Glisse
  2019-12-06  5:13   ` Matthew Wilcox
@ 2019-12-10 15:26   ` Vlastimil Babka
  2019-12-10 16:07     ` Jerome Glisse
  1 sibling, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2019-12-10 15:26 UTC (permalink / raw)
  To: Jerome Glisse, Matthew Wilcox
  Cc: linux-mm, Laurent Dufour, David Rientjes, Hugh Dickins,
	Michel Lespinasse, Davidlohr Bueso

On 12/5/19 6:21 PM, Jerome Glisse wrote:
>> 
>> So calling mmap() looks like this:
>> 
>>       1 allocate a new VMA
>>       2 update pointer(s) in maple tree
>>       3 sleep until old VMAs have a zero refcount
>>       4 synchronize_rcu()
>>       5 free old VMAs
>>       6 flush caches for affected range
>>       7 return to userspace
>> 
>> While one thread is calling mmap(MAP_FIXED), two other threads which are
>> accessing the same address may see different data from each other and
>> have different page translations in their respective CPU caches until
>> the thread calling mmap() returns.  I believe this is OK, but would
>> greatly appreciate hearing from people who know better.
> 
> I do not believe this is OK, i believe this is wrong (not even considering
> possible hardware issues that can arise from such aliasing).

But is it true that the races can happen in the above such that multiple CPU's
have different translations? I think it's impossible to tell from above - there
are no details about when and which pte modifications happen, where ptl lock is
taken... perhaps after filling those details, we could be able to see that
there's no race.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-10 15:26   ` Vlastimil Babka
@ 2019-12-10 16:07     ` Jerome Glisse
  2019-12-10 18:09       ` Vlastimil Babka
  0 siblings, 1 reply; 31+ messages in thread
From: Jerome Glisse @ 2019-12-10 16:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, linux-mm, Laurent Dufour, David Rientjes,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

On Tue, Dec 10, 2019 at 04:26:40PM +0100, Vlastimil Babka wrote:
> On 12/5/19 6:21 PM, Jerome Glisse wrote:
> >> 
> >> So calling mmap() looks like this:
> >> 
> >>       1 allocate a new VMA
> >>       2 update pointer(s) in maple tree
> >>       3 sleep until old VMAs have a zero refcount
> >>       4 synchronize_rcu()
> >>       5 free old VMAs
> >>       6 flush caches for affected range
> >>       7 return to userspace
> >> 
> >> While one thread is calling mmap(MAP_FIXED), two other threads which are
> >> accessing the same address may see different data from each other and
> >> have different page translations in their respective CPU caches until
> >> the thread calling mmap() returns.  I believe this is OK, but would
> >> greatly appreciate hearing from people who know better.
> > 
> > I do not believe this is OK, i believe this is wrong (not even considering
> > possible hardware issues that can arise from such aliasing).
> 
> But is it true that the races can happen in the above such that multiple CPU's
> have different translations? I think it's impossible to tell from above - there
> are no details about when and which pte modifications happen, where ptl lock is
> taken... perhaps after filling those details, we could be able to see that
> there's no race.
> 

My assumption reading Matthew was that as step 6 is making progress
(flushing caches and i assume TLB too) then you can have a CPU which
is already flushed and that do take a fault against the new VMA and
thus get a new TLB entry that do not match a CPU which is not yet
flushed.

Today this can not happens because page fault will serialize on the
mmap_sem (ie until the write mode is release when returning to user-
space).

I advocate that for MAP_FIXED we should keep that behavior ie have
page fault wait until all the CPUs caches and TLB are flush before
servicing new fault. The rational are:
    - Application with racing thread accessing an area which is under
      going a mmap MAP_FIXED are doing something weird (which might
      be totaly waranted) and thus it should not matter if we penalize
      those by not offering the benefit of fully concurrent page fault
      for that area. We are not making thing any worse then they are
      today.

    - Waiting on new MAP_FIXED vma is easy to implement and does not
      affect other kind of page fault (ie you still have concurrent
      fault to other area).

Note i stress again this is only for MAP_FIXED replacing an existing
vma. For any other mmap (well i would need to go over all the flags
to make sure there is nothing else similar to MAP_FIXED) we can have
full concurrency ie:
    - Access to an area that is under going munmap will SEGFAULT
      as soon as the vma is no longer valid (ie no longer in maple
      tree i guess)

    - Access to an area that is just being mmap will fault using
      the new vma (assuming we publish the vma only once it is
      ready to take fault)

    - A mmap can make progress as another munmap is making progress
      and reuse the the munmaped area as soon as it is fully flushed
      (caches and TLB).

Cheers,
Jérôme



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-10 16:07     ` Jerome Glisse
@ 2019-12-10 18:09       ` Vlastimil Babka
  0 siblings, 0 replies; 31+ messages in thread
From: Vlastimil Babka @ 2019-12-10 18:09 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Matthew Wilcox, linux-mm, Laurent Dufour, David Rientjes,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

On 12/10/19 5:07 PM, Jerome Glisse wrote:
> On Tue, Dec 10, 2019 at 04:26:40PM +0100, Vlastimil Babka wrote:
>> On 12/5/19 6:21 PM, Jerome Glisse wrote:
>>>>
>>>> So calling mmap() looks like this:
>>>>
>>>>       1 allocate a new VMA
>>>>       2 update pointer(s) in maple tree
>>>>       3 sleep until old VMAs have a zero refcount
>>>>       4 synchronize_rcu()
>>>>       5 free old VMAs
>>>>       6 flush caches for affected range
>>>>       7 return to userspace
>>>>
>>>> While one thread is calling mmap(MAP_FIXED), two other threads which are
>>>> accessing the same address may see different data from each other and
>>>> have different page translations in their respective CPU caches until
>>>> the thread calling mmap() returns.  I believe this is OK, but would
>>>> greatly appreciate hearing from people who know better.
>>>
>>> I do not believe this is OK, i believe this is wrong (not even considering
>>> possible hardware issues that can arise from such aliasing).
>>
>> But is it true that the races can happen in the above such that multiple CPU's
>> have different translations? I think it's impossible to tell from above - there
>> are no details about when and which pte modifications happen, where ptl lock is
>> taken... perhaps after filling those details, we could be able to see that
>> there's no race.
>>
> 
> My assumption reading Matthew was that as step 6 is making progress
> (flushing caches and i assume TLB too) then you can have a CPU which
> is already flushed and that do take a fault against the new VMA and
> thus get a new TLB entry that do not match a CPU which is not yet
> flushed.

We already have to protect against CPU's that access page tables (and
thus fill their TLBs) by hardware, taking no mmap_sem at all. For the
mmap(MAP_FIXED) overwriting existing mapping case this means that the
old stuff will first be munmapped - zap_pte_range() will take pte lock,
clear pte and flush tlb's before installing any new mappings.
A parallel fault trying to install pte for the new VMA should thus
serialize on the pte lock and only install the new pte after everyone
was flushed, and this should be fine?

What might IMHO perhaps need care is a parallel fault that started with
the old VMA, because some PTE was unpopulated. We need to make sure it
doesn't end up being the last to install the PTE for the VMA that's
going away.

> Today this can not happens because page fault will serialize on the
> mmap_sem (ie until the write mode is release when returning to user-
> space).



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-03 22:21 Splitting the mmap_sem Matthew Wilcox
  2019-12-05 17:21 ` Jerome Glisse
@ 2019-12-12 14:24 ` Kirill A. Shutemov
  2019-12-12 15:40   ` Matthew Wilcox
  1 sibling, 1 reply; 31+ messages in thread
From: Kirill A. Shutemov @ 2019-12-12 14:24 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> My preferred solution to the mmap_sem scalability problem is to allow
> VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> I've been focusing on the first half of this problem (looking up VMAs
> in an RCU-safe data structure) and ignoring the second half (taking a
> lock while holding the RCU lock).

Do you see this approach to be regression-free for uncontended case?
I doubt it will not cause regressions for signle-threaded applications...

> We currently only have one ->map_pages() callback, and it's
> filemap_map_pages().  It only needs to sleep in one place -- to allocate
> a PTE table.  I think that can be allocated ahead of time if needed.

No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the
function is under rcu_read_lock(). It uses pre-allocated page table.
See do_fault_around().

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-12 14:24 ` Kirill A. Shutemov
@ 2019-12-12 15:40   ` Matthew Wilcox
  2019-12-12 15:46     ` Kirill A. Shutemov
  2020-01-06 22:09     ` Matthew Wilcox
  0 siblings, 2 replies; 31+ messages in thread
From: Matthew Wilcox @ 2019-12-12 15:40 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-mm

On Thu, Dec 12, 2019 at 05:24:57PM +0300, Kirill A. Shutemov wrote:
> On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > My preferred solution to the mmap_sem scalability problem is to allow
> > VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> > I've been focusing on the first half of this problem (looking up VMAs
> > in an RCU-safe data structure) and ignoring the second half (taking a
> > lock while holding the RCU lock).
> 
> Do you see this approach to be regression-free for uncontended case?
> I doubt it will not cause regressions for signle-threaded applications...

Which part of the approach do you think will cause a regression?  The
maple tree is quicker to traverse than the rbtree (in our simulations).
Incrementing a refcount on a VMA is surely no slower than acquiring an
uncontended rwsem for read.  mmap() and munmap() will get slower, but is
that a problem?

> > We currently only have one ->map_pages() callback, and it's
> > filemap_map_pages().  It only needs to sleep in one place -- to allocate
> > a PTE table.  I think that can be allocated ahead of time if needed.
> 
> No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the
> function is under rcu_read_lock(). It uses pre-allocated page table.
> See do_fault_around().

Oh, thank you!  That makes the ->map_pages() optimisation already workable
with no changes.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-12 15:40   ` Matthew Wilcox
@ 2019-12-12 15:46     ` Kirill A. Shutemov
  2019-12-13 14:33       ` Matthew Wilcox
  2020-01-06 22:09     ` Matthew Wilcox
  1 sibling, 1 reply; 31+ messages in thread
From: Kirill A. Shutemov @ 2019-12-12 15:46 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote:
> On Thu, Dec 12, 2019 at 05:24:57PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > My preferred solution to the mmap_sem scalability problem is to allow
> > > VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> > > I've been focusing on the first half of this problem (looking up VMAs
> > > in an RCU-safe data structure) and ignoring the second half (taking a
> > > lock while holding the RCU lock).
> > 
> > Do you see this approach to be regression-free for uncontended case?
> > I doubt it will not cause regressions for signle-threaded applications...
> 
> Which part of the approach do you think will cause a regression?  The
> maple tree is quicker to traverse than the rbtree (in our simulations).
> Incrementing a refcount on a VMA is surely no slower than acquiring an
> uncontended rwsem for read.  mmap() and munmap() will get slower, but is
> that a problem?

Yes, it does. Especially for short-living processes. See kernel build as a
workload.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-12 15:46     ` Kirill A. Shutemov
@ 2019-12-13 14:33       ` Matthew Wilcox
  2019-12-13 18:06         ` Kirill A. Shutemov
  0 siblings, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2019-12-13 14:33 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-mm

On Thu, Dec 12, 2019 at 06:46:13PM +0300, Kirill A. Shutemov wrote:
> On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote:
> > On Thu, Dec 12, 2019 at 05:24:57PM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > > My preferred solution to the mmap_sem scalability problem is to allow
> > > > VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> > > > I've been focusing on the first half of this problem (looking up VMAs
> > > > in an RCU-safe data structure) and ignoring the second half (taking a
> > > > lock while holding the RCU lock).
> > > 
> > > Do you see this approach to be regression-free for uncontended case?
> > > I doubt it will not cause regressions for signle-threaded applications...
> > 
> > Which part of the approach do you think will cause a regression?  The
> > maple tree is quicker to traverse than the rbtree (in our simulations).
> > Incrementing a refcount on a VMA is surely no slower than acquiring an
> > uncontended rwsem for read.  mmap() and munmap() will get slower, but is
> > that a problem?
> 
> Yes, it does. Especially for short-living processes. See kernel build as a
> workload.

Ah.  Well, we can skip the synchronize_rcu() step if the mm_struct has zero
or one mm_users.  That should avoid a slowdown for mmap/munmap.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-13 14:33       ` Matthew Wilcox
@ 2019-12-13 18:06         ` Kirill A. Shutemov
  2019-12-13 18:21           ` Matthew Wilcox
  0 siblings, 1 reply; 31+ messages in thread
From: Kirill A. Shutemov @ 2019-12-13 18:06 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

On Fri, Dec 13, 2019 at 06:33:33AM -0800, Matthew Wilcox wrote:
> On Thu, Dec 12, 2019 at 06:46:13PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote:
> > > On Thu, Dec 12, 2019 at 05:24:57PM +0300, Kirill A. Shutemov wrote:
> > > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > > > My preferred solution to the mmap_sem scalability problem is to allow
> > > > > VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> > > > > I've been focusing on the first half of this problem (looking up VMAs
> > > > > in an RCU-safe data structure) and ignoring the second half (taking a
> > > > > lock while holding the RCU lock).
> > > > 
> > > > Do you see this approach to be regression-free for uncontended case?
> > > > I doubt it will not cause regressions for signle-threaded applications...
> > > 
> > > Which part of the approach do you think will cause a regression?  The
> > > maple tree is quicker to traverse than the rbtree (in our simulations).
> > > Incrementing a refcount on a VMA is surely no slower than acquiring an
> > > uncontended rwsem for read.  mmap() and munmap() will get slower, but is
> > > that a problem?
> > 
> > Yes, it does. Especially for short-living processes. See kernel build as a
> > workload.
> 
> Ah.  Well, we can skip the synchronize_rcu() step if the mm_struct has zero
> or one mm_users.  That should avoid a slowdown for mmap/munmap.

I may work. But I'm not sure how it will work with remote mm accesses.
Like with /proc/ interfaces or ptrace.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-13 18:06         ` Kirill A. Shutemov
@ 2019-12-13 18:21           ` Matthew Wilcox
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Wilcox @ 2019-12-13 18:21 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-mm

On Fri, Dec 13, 2019 at 09:06:26PM +0300, Kirill A. Shutemov wrote:
> On Fri, Dec 13, 2019 at 06:33:33AM -0800, Matthew Wilcox wrote:
> > Ah.  Well, we can skip the synchronize_rcu() step if the mm_struct has zero
> > or one mm_users.  That should avoid a slowdown for mmap/munmap.
> 
> I may work. But I'm not sure how it will work with remote mm accesses.
> Like with /proc/ interfaces or ptrace.

Those users call mmget() which increases mm_users.  See m_start() in
fs/proc/task_mmu.c.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-12 15:40   ` Matthew Wilcox
  2019-12-12 15:46     ` Kirill A. Shutemov
@ 2020-01-06 22:09     ` Matthew Wilcox
  2020-01-07 12:34       ` Kirill A. Shutemov
  1 sibling, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2020-01-06 22:09 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-mm

On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote:
> > > We currently only have one ->map_pages() callback, and it's
> > > filemap_map_pages().  It only needs to sleep in one place -- to allocate
> > > a PTE table.  I think that can be allocated ahead of time if needed.
> > 
> > No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the
> > function is under rcu_read_lock(). It uses pre-allocated page table.
> > See do_fault_around().
> 
> Oh, thank you!  That makes the ->map_pages() optimisation already workable
> with no changes.

I've been thinking about this some more, and we have a bit of a tough time
allocating page table entries while holding the RCU read lock.  There's
no GFP flags to the p??_alloc() functions, so we can't specify GFP_NOWAIT.

Option 1: Add 'prealloc_pmd' and 'prealloc_pud' to the vm_fault (to go
with prealloc_pte).  Allocate them before taking the RCU lock to walk
the VMA tree.  This will be a bit of reordering as we currently take
the mmap_sem, walk the VMA tree, then walk the page tables once we know
we have a good VMA.  I don't see a problem with doing that, but others
may differ.

Option 2: Add a memalloc_nowait_save/restore API to go along
with nofs and noio.  That way, we can take the RCU read lock, call
memalloc_nowait_save(), and walk the VMA tree and the page tables in
the current order.  There's an increased chance of memory allocation of
page tables failing, so we'll have to risk that and do a retry with the
reference count held on the VMA if we need to sleep to allocate memory.

Option 3: Variant of 2 where we add GFP flags to the p??_alloc()
functions.

Option 4: Variant of 2 where we make taking the RCU read lock magically
set the nowait bit, or we have the page allocator check the RCU preempt
depth.  I don't particularly like this one, particularly since the
preempt depth is not knowable in most kernel configurations.

Other thoughts on this?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2020-01-06 22:09     ` Matthew Wilcox
@ 2020-01-07 12:34       ` Kirill A. Shutemov
  2020-01-07 13:54         ` Matthew Wilcox
  0 siblings, 1 reply; 31+ messages in thread
From: Kirill A. Shutemov @ 2020-01-07 12:34 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

On Mon, Jan 06, 2020 at 02:09:10PM -0800, Matthew Wilcox wrote:
> On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote:
> > > > We currently only have one ->map_pages() callback, and it's
> > > > filemap_map_pages().  It only needs to sleep in one place -- to allocate
> > > > a PTE table.  I think that can be allocated ahead of time if needed.
> > > 
> > > No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the
> > > function is under rcu_read_lock(). It uses pre-allocated page table.
> > > See do_fault_around().
> > 
> > Oh, thank you!  That makes the ->map_pages() optimisation already workable
> > with no changes.
> 
> I've been thinking about this some more, and we have a bit of a tough time
> allocating page table entries while holding the RCU read lock.  There's
> no GFP flags to the p??_alloc() functions, so we can't specify GFP_NOWAIT.
> 
> Option 1: Add 'prealloc_pmd' and 'prealloc_pud' to the vm_fault (to go
> with prealloc_pte).  Allocate them before taking the RCU lock to walk
> the VMA tree.  This will be a bit of reordering as we currently take
> the mmap_sem, walk the VMA tree, then walk the page tables once we know
> we have a good VMA.  I don't see a problem with doing that, but others
> may differ.

I expect preallocating all these page tables just-in-case would have
measuable performance impact. Current code only preallocates PTE page
table if sees pmd_none().

We may first check if this branch of the tree is present. But I'm not sure
how efficient it can be. And we still need to protect from freeing
these page tables from under us.

> Option 2: Add a memalloc_nowait_save/restore API to go along
> with nofs and noio.  That way, we can take the RCU read lock, call
> memalloc_nowait_save(), and walk the VMA tree and the page tables in
> the current order.  There's an increased chance of memory allocation of
> page tables failing, so we'll have to risk that and do a retry with the
> reference count held on the VMA if we need to sleep to allocate memory.
> 
> Option 3: Variant of 2 where we add GFP flags to the p??_alloc()
> functions.

I think this is the most reasonable way. If we are low of memory, latency
is not on the top of priorities.

> Option 4: Variant of 2 where we make taking the RCU read lock magically
> set the nowait bit, or we have the page allocator check the RCU preempt
> depth.  I don't particularly like this one, particularly since the
> preempt depth is not knowable in most kernel configurations.
> 
> Other thoughts on this?

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2020-01-07 12:34       ` Kirill A. Shutemov
@ 2020-01-07 13:54         ` Matthew Wilcox
  2020-01-07 14:27           ` Kirill A. Shutemov
  0 siblings, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2020-01-07 13:54 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-mm

On Tue, Jan 07, 2020 at 03:34:15PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jan 06, 2020 at 02:09:10PM -0800, Matthew Wilcox wrote:
> > On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote:
> > > > > We currently only have one ->map_pages() callback, and it's
> > > > > filemap_map_pages().  It only needs to sleep in one place -- to allocate
> > > > > a PTE table.  I think that can be allocated ahead of time if needed.
> > > > 
> > > > No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the
> > > > function is under rcu_read_lock(). It uses pre-allocated page table.
> > > > See do_fault_around().
> > > 
> > > Oh, thank you!  That makes the ->map_pages() optimisation already workable
> > > with no changes.
> > 
> > I've been thinking about this some more, and we have a bit of a tough time
> > allocating page table entries while holding the RCU read lock.  There's
> > no GFP flags to the p??_alloc() functions, so we can't specify GFP_NOWAIT.
> > 
> > Option 1: Add 'prealloc_pmd' and 'prealloc_pud' to the vm_fault (to go
> > with prealloc_pte).  Allocate them before taking the RCU lock to walk
> > the VMA tree.  This will be a bit of reordering as we currently take
> > the mmap_sem, walk the VMA tree, then walk the page tables once we know
> > we have a good VMA.  I don't see a problem with doing that, but others
> > may differ.
> 
> I expect preallocating all these page tables just-in-case would have
> measuable performance impact. Current code only preallocates PTE page
> table if sees pmd_none().

Right, and we'd only preallocate those tables if we see pud_none() and
p4d_none().  Which is why we'd need to walk the page tables before
walking the VMA tree.

> We may first check if this branch of the tree is present. But I'm not sure
> how efficient it can be. And we still need to protect from freeing
> these page tables from under us.

Do we free page tables once they've been inserted, other than at
process exit?

> > Option 2: Add a memalloc_nowait_save/restore API to go along
> > with nofs and noio.  That way, we can take the RCU read lock, call
> > memalloc_nowait_save(), and walk the VMA tree and the page tables in
> > the current order.  There's an increased chance of memory allocation of
> > page tables failing, so we'll have to risk that and do a retry with the
> > reference count held on the VMA if we need to sleep to allocate memory.
> > 
> > Option 3: Variant of 2 where we add GFP flags to the p??_alloc()
> > functions.
> 
> I think this is the most reasonable way. If we are low of memory, latency
> is not on the top of priorities.

You prefer adding GFP flags over adding a memalloc_nowait_save()?



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2020-01-07 13:54         ` Matthew Wilcox
@ 2020-01-07 14:27           ` Kirill A. Shutemov
  2020-01-09 13:56             ` Vlastimil Babka
  0 siblings, 1 reply; 31+ messages in thread
From: Kirill A. Shutemov @ 2020-01-07 14:27 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

On Tue, Jan 07, 2020 at 05:54:59AM -0800, Matthew Wilcox wrote:
> On Tue, Jan 07, 2020 at 03:34:15PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Jan 06, 2020 at 02:09:10PM -0800, Matthew Wilcox wrote:
> > > On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote:
> > > > > > We currently only have one ->map_pages() callback, and it's
> > > > > > filemap_map_pages().  It only needs to sleep in one place -- to allocate
> > > > > > a PTE table.  I think that can be allocated ahead of time if needed.
> > > > > 
> > > > > No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the
> > > > > function is under rcu_read_lock(). It uses pre-allocated page table.
> > > > > See do_fault_around().
> > > > 
> > > > Oh, thank you!  That makes the ->map_pages() optimisation already workable
> > > > with no changes.
> > > 
> > > I've been thinking about this some more, and we have a bit of a tough time
> > > allocating page table entries while holding the RCU read lock.  There's
> > > no GFP flags to the p??_alloc() functions, so we can't specify GFP_NOWAIT.
> > > 
> > > Option 1: Add 'prealloc_pmd' and 'prealloc_pud' to the vm_fault (to go
> > > with prealloc_pte).  Allocate them before taking the RCU lock to walk
> > > the VMA tree.  This will be a bit of reordering as we currently take
> > > the mmap_sem, walk the VMA tree, then walk the page tables once we know
> > > we have a good VMA.  I don't see a problem with doing that, but others
> > > may differ.
> > 
> > I expect preallocating all these page tables just-in-case would have
> > measuable performance impact. Current code only preallocates PTE page
> > table if sees pmd_none().
> 
> Right, and we'd only preallocate those tables if we see pud_none() and
> p4d_none().  Which is why we'd need to walk the page tables before
> walking the VMA tree.
> 
> > We may first check if this branch of the tree is present. But I'm not sure
> > how efficient it can be. And we still need to protect from freeing
> > these page tables from under us.
> 
> Do we free page tables once they've been inserted, other than at
> process exit?

Yes. munmap() would free them. See free_pgtables() in unmap_region()

> > > Option 2: Add a memalloc_nowait_save/restore API to go along
> > > with nofs and noio.  That way, we can take the RCU read lock, call
> > > memalloc_nowait_save(), and walk the VMA tree and the page tables in
> > > the current order.  There's an increased chance of memory allocation of
> > > page tables failing, so we'll have to risk that and do a retry with the
> > > reference count held on the VMA if we need to sleep to allocate memory.
> > > 
> > > Option 3: Variant of 2 where we add GFP flags to the p??_alloc()
> > > functions.
> > 
> > I think this is the most reasonable way. If we are low of memory, latency
> > is not on the top of priorities.
> 
> You prefer adding GFP flags over adding a memalloc_nowait_save()?

I don't have strong preference here.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2020-01-07 14:27           ` Kirill A. Shutemov
@ 2020-01-09 13:56             ` Vlastimil Babka
  2020-01-09 17:03               ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2020-01-09 13:56 UTC (permalink / raw)
  To: Kirill A. Shutemov, Matthew Wilcox; +Cc: linux-mm, Michal Hocko, Peter Zijlstra

+CC

On 1/7/20 3:27 PM, Kirill A. Shutemov wrote:
>>>> Option 2: Add a memalloc_nowait_save/restore API to go along
>>>> with nofs and noio.  That way, we can take the RCU read lock, call
>>>> memalloc_nowait_save(), and walk the VMA tree and the page tables in
>>>> the current order.  There's an increased chance of memory allocation of
>>>> page tables failing, so we'll have to risk that and do a retry with the
>>>> reference count held on the VMA if we need to sleep to allocate memory.
>>>>
>>>> Option 3: Variant of 2 where we add GFP flags to the p??_alloc()
>>>> functions.
>>>
>>> I think this is the most reasonable way. If we are low of memory, latency
>>> is not on the top of priorities.
>>
>> You prefer adding GFP flags over adding a memalloc_nowait_save()?
> 
> I don't have strong preference here.

I think it's generally a good idea to have the memalloc_() scoping.
However, we use the PF_* flags where it now seems 30 out of 32 are used,
and MEMALLOC* already occupies 3 of those.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2020-01-09 13:56             ` Vlastimil Babka
@ 2020-01-09 17:03               ` Michal Hocko
  2020-01-09 17:07                 ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2020-01-09 17:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Matthew Wilcox, linux-mm, Peter Zijlstra

On Thu 09-01-20 14:56:48, Vlastimil Babka wrote:
> +CC
> 
> On 1/7/20 3:27 PM, Kirill A. Shutemov wrote:
> >>>> Option 2: Add a memalloc_nowait_save/restore API to go along
> >>>> with nofs and noio.  That way, we can take the RCU read lock, call
> >>>> memalloc_nowait_save(), and walk the VMA tree and the page tables in
> >>>> the current order.  There's an increased chance of memory allocation of
> >>>> page tables failing, so we'll have to risk that and do a retry with the
> >>>> reference count held on the VMA if we need to sleep to allocate memory.
> >>>>
> >>>> Option 3: Variant of 2 where we add GFP flags to the p??_alloc()
> >>>> functions.
> >>>
> >>> I think this is the most reasonable way. If we are low of memory, latency
> >>> is not on the top of priorities.
> >>
> >> You prefer adding GFP flags over adding a memalloc_nowait_save()?
> > 
> > I don't have strong preference here.
> 
> I think it's generally a good idea to have the memalloc_() scoping.

I also think there are going to be other usecases where NOWAIT scoping
would be interesting. I am not sure this is the case for the page tables
allocation though. I might misremember but RCU based VMA handling has
been considered in the past. I do not remember details but there were
some problems and page tables allocation is not the biggest one.

> However, we use the PF_* flags where it now seems 30 out of 32 are used,
> and MEMALLOC* already occupies 3 of those.

This shouldn't be a big deal. Adding a single int to the struct page
shouldn't be terribly hard.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2020-01-09 17:03               ` Michal Hocko
@ 2020-01-09 17:07                 ` Michal Hocko
  2020-01-09 17:32                   ` SeongJae Park
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2020-01-09 17:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Matthew Wilcox, linux-mm, Peter Zijlstra

On Thu 09-01-20 18:03:25, Michal Hocko wrote:
> I might misremember but RCU based VMA handling has
> been considered in the past. I do not remember details but there were
> some problems and page tables allocation is not the biggest one.

I have found https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf in my
notes. I managed to forget everything but maybe it will be useful for a
reference.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: Splitting the mmap_sem
  2020-01-09 17:07                 ` Michal Hocko
@ 2020-01-09 17:32                   ` SeongJae Park
  2020-01-09 20:13                     ` Matthew Wilcox
  0 siblings, 1 reply; 31+ messages in thread
From: SeongJae Park @ 2020-01-09 17:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Kirill A. Shutemov, Matthew Wilcox, linux-mm,
	Peter Zijlstra

On Thu, 9 Jan 2020 18:07:15 +0100 Michal Hocko <mhocko@suse.com> wrote:

> On Thu 09-01-20 18:03:25, Michal Hocko wrote:
> > I might misremember but RCU based VMA handling has
> > been considered in the past. I do not remember details but there were
> > some problems and page tables allocation is not the biggest one.
> 
> I have found https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf in my
> notes. I managed to forget everything but maybe it will be useful for a
> reference.

The subsequent work from the authors
(https://people.csail.mit.edu/nickolai/papers/clements-radixvm-2014-08-05.pdf)
might be also useful for the understanding of the limitations found from the
work.

I also forgot many details but as far as I remember, the biggest problem with
the rcuvm was the update side scalability limitation that results from the
single updater lock and the TLB invalidations.  I has also internally
implemented another RCU based vm that utilizing fine-grained update side
synchronization.  The write side performance of my version was therefore much
improved, but it also dropped the performance at the end with heavily
write-intensive workloads due to the TLB flush overhead.

Page table allocations weren't bothered me at that time.


Thanks,
SeongJae Park

> -- 
> Michal Hocko
> SUSE Labs
> 
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: Splitting the mmap_sem
  2020-01-09 17:32                   ` SeongJae Park
@ 2020-01-09 20:13                     ` Matthew Wilcox
  2020-02-06 13:59                       ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2020-01-09 20:13 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, linux-mm,
	Peter Zijlstra

On Thu, Jan 09, 2020 at 06:32:06PM +0100, SeongJae Park wrote:
> On Thu, 9 Jan 2020 18:07:15 +0100 Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Thu 09-01-20 18:03:25, Michal Hocko wrote:
> > > I might misremember but RCU based VMA handling has
> > > been considered in the past. I do not remember details but there were
> > > some problems and page tables allocation is not the biggest one.
> > 
> > I have found https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf in my
> > notes. I managed to forget everything but maybe it will be useful for a
> > reference.
> 
> The subsequent work from the authors
> (https://people.csail.mit.edu/nickolai/papers/clements-radixvm-2014-08-05.pdf)
> might be also useful for the understanding of the limitations found from the
> work.

Thanks for both those references.

> I also forgot many details but as far as I remember, the biggest problem with
> the rcuvm was the update side scalability limitation that results from the
> single updater lock and the TLB invalidations.  I has also internally
> implemented another RCU based vm that utilizing fine-grained update side
> synchronization.  The write side performance of my version was therefore much
> improved, but it also dropped the performance at the end with heavily
> write-intensive workloads due to the TLB flush overhead.
> 
> Page table allocations weren't bothered me at that time.

As far as I can tell, both these implementations work by using RCU to
look up a VMA, taking a reference count on the VMA and dropping the RCU
read lock before walking the page tables.  Sleeping to allocate page
tables will be fine as the reference count prevents the VMA from going
away.

One of the use cases that we're concerned about involves a high
percentage of page faults on a single large (terabytes) VMA (and a
highly multithreaded process).  Moving the contention from a rwsem
in the mm_struct to a refcount in the VMA will not help performance
substantially for this user.

The proposal consists of three phases.  In phase 1, we convert the
rbtree to the maple tree, and leave the locking alone.  In phase 2,
we change the locking to a per-VMA refcount, looked up under RCU.

This problem arises during phase 3 where we attempt to handle page
faults entirely under the RCU read lock.  If we encounter problems,
we can fall back to acquiring the VMA refcount, but we need the
page allocation to fail rather than sleep (or magically drop the
RCU lock and return an indication that it has done so, but that
doesn't seem to be an approach that would find any favour).


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: Splitting the mmap_sem
  2020-01-09 20:13                     ` Matthew Wilcox
@ 2020-02-06 13:59                       ` Peter Zijlstra
  2020-02-06 20:15                         ` Matthew Wilcox
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2020-02-06 13:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	linux-mm

On Thu, Jan 09, 2020 at 12:13:20PM -0800, Matthew Wilcox wrote:
> One of the use cases that we're concerned about involves a high
> percentage of page faults on a single large (terabytes) VMA (and a
> highly multithreaded process).  Moving the contention from a rwsem
> in the mm_struct to a refcount in the VMA will not help performance
> substantially for this user.

This is why I never believed in the VMA-refcount approach, and why my
patches used SRCU.

> The proposal consists of three phases.  In phase 1, we convert the
> rbtree to the maple tree, and leave the locking alone.  In phase 2,
> we change the locking to a per-VMA refcount, looked up under RCU.
> 
> This problem arises during phase 3 where we attempt to handle page
> faults entirely under the RCU read lock.  If we encounter problems,
> we can fall back to acquiring the VMA refcount, but we need the
> page allocation to fail rather than sleep (or magically drop the
> RCU lock and return an indication that it has done so, but that
> doesn't seem to be an approach that would find any favour).

So why not use SRCU? You can do full blocking faults under SRCU and
don't need no 'stinkin' refcounts ;-)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: Splitting the mmap_sem
  2020-02-06 13:59                       ` Peter Zijlstra
@ 2020-02-06 20:15                         ` Matthew Wilcox
  2020-02-06 20:55                           ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2020-02-06 20:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	linux-mm

On Thu, Feb 06, 2020 at 02:59:20PM +0100, Peter Zijlstra wrote:
> > The proposal consists of three phases.  In phase 1, we convert the
> > rbtree to the maple tree, and leave the locking alone.  In phase 2,
> > we change the locking to a per-VMA refcount, looked up under RCU.
> > 
> > This problem arises during phase 3 where we attempt to handle page
> > faults entirely under the RCU read lock.  If we encounter problems,
> > we can fall back to acquiring the VMA refcount, but we need the
> > page allocation to fail rather than sleep (or magically drop the
> > RCU lock and return an indication that it has done so, but that
> > doesn't seem to be an approach that would find any favour).
> 
> So why not use SRCU? You can do full blocking faults under SRCU and
> don't need no 'stinkin' refcounts ;-)

I have to say, SRCU is not in my mental toolbox of "how to solve a
problem", so it simply hadn't occurred to me.  Thanks.

So, we'd DEFINE_SRCU(vma_srcu); in mm/memory.c

then, at the beginning of a page fault call srcu_read_lock(&vma_srcu);
walk the tree as we do now, allocate memory for PTEs, sleep waiting for
pages to arrive back from disc, etc, etc, then at the end of the fault,
call srcu_read_unlock(&vma_srcu).  munmap() would consist of removing the
VMA from the tree, then calling synchronize_srcu() to wait for all faults
to finish, then putting the backing file, etc, etc and freeing the VMA.

This seems pretty reasonable, and investigation could actually proceed
before the Maple tree work lands.  Today, that would be:

srcu_read_lock(&vmas_srcu);
down_read(&mm->mmap_sem);
find_vma(mm, address);
up_read(&mm->mmap_sem);
... rest of fault handler path ...
srcu_read_unlock(&vmas_srcu);

Kind of a pain because we still call find_vma() in the per-arch page
fault handler, but for prototyping, we'd only have to do one or two
architectures.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: Splitting the mmap_sem
  2020-02-06 20:15                         ` Matthew Wilcox
@ 2020-02-06 20:55                           ` Peter Zijlstra
  2020-02-06 21:20                             ` Matthew Wilcox
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2020-02-06 20:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	linux-mm

On Thu, Feb 06, 2020 at 12:15:36PM -0800, Matthew Wilcox wrote:
> On Thu, Feb 06, 2020 at 02:59:20PM +0100, Peter Zijlstra wrote:
> > > The proposal consists of three phases.  In phase 1, we convert the
> > > rbtree to the maple tree, and leave the locking alone.  In phase 2,
> > > we change the locking to a per-VMA refcount, looked up under RCU.
> > > 
> > > This problem arises during phase 3 where we attempt to handle page
> > > faults entirely under the RCU read lock.  If we encounter problems,
> > > we can fall back to acquiring the VMA refcount, but we need the
> > > page allocation to fail rather than sleep (or magically drop the
> > > RCU lock and return an indication that it has done so, but that
> > > doesn't seem to be an approach that would find any favour).
> > 
> > So why not use SRCU? You can do full blocking faults under SRCU and
> > don't need no 'stinkin' refcounts ;-)
> 
> I have to say, SRCU is not in my mental toolbox of "how to solve a
> problem", so it simply hadn't occurred to me.  Thanks.
> 
> So, we'd DEFINE_SRCU(vma_srcu); in mm/memory.c
> 
> then, at the beginning of a page fault call srcu_read_lock(&vma_srcu);
> walk the tree as we do now, allocate memory for PTEs, sleep waiting for
> pages to arrive back from disc, etc, etc, then at the end of the fault,
> call srcu_read_unlock(&vma_srcu). 

So far so good,...

> munmap() would consist of removing the
> VMA from the tree, then calling synchronize_srcu() to wait for all faults
> to finish, then putting the backing file, etc, etc and freeing the VMA.

call_srcu(), and the (s)rcu callback will then fput() and such things
more.

synchronize_srcu() (like synchronize_rcu()) is stupid slow and would
make munmap()/exit()/etc.. unusable.

> This seems pretty reasonable, and investigation could actually proceed
> before the Maple tree work lands.  Today, that would be:
> 
> srcu_read_lock(&vmas_srcu);
> down_read(&mm->mmap_sem);
> find_vma(mm, address);
> up_read(&mm->mmap_sem);
> ... rest of fault handler path ...
> srcu_read_unlock(&vmas_srcu);
> 
> Kind of a pain because we still call find_vma() in the per-arch page
> fault handler, but for prototyping, we'd only have to do one or two
> architectures.

If you look at the earlier speculative page-fault patches by Laurent,
which were based on my still earlier patches, you'll find most of this
there.

The tricky bit was validating everything on the second page-table walk,
so see if nothing had fundamentally changed, specifically the VMA,
before installing the PTE. If you do this without mmap_sem, you need to
hold ptlock to pin stuff while validating everything you did earlier.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: Splitting the mmap_sem
  2020-02-06 20:55                           ` Peter Zijlstra
@ 2020-02-06 21:20                             ` Matthew Wilcox
  2020-02-07  8:52                               ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2020-02-06 21:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	linux-mm

On Thu, Feb 06, 2020 at 09:55:29PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 06, 2020 at 12:15:36PM -0800, Matthew Wilcox wrote:
> > then, at the beginning of a page fault call srcu_read_lock(&vma_srcu);
> > walk the tree as we do now, allocate memory for PTEs, sleep waiting for
> > pages to arrive back from disc, etc, etc, then at the end of the fault,
> > call srcu_read_unlock(&vma_srcu). 
> 
> So far so good,...
> 
> > munmap() would consist of removing the
> > VMA from the tree, then calling synchronize_srcu() to wait for all faults
> > to finish, then putting the backing file, etc, etc and freeing the VMA.
> 
> call_srcu(), and the (s)rcu callback will then fput() and such things
> more.
> 
> synchronize_srcu() (like synchronize_rcu()) is stupid slow and would
> make munmap()/exit()/etc.. unusable.

I'll need to think about that a bit.  I was convinced we needed to wait
for the current pagefaults to finish before we could return from munmap().
I need to convince myself that it's OK to return to userspace while the
page faults for that range are still proceeding on other CPUs.

> > This seems pretty reasonable, and investigation could actually proceed
> > before the Maple tree work lands.  Today, that would be:
> > 
> > srcu_read_lock(&vmas_srcu);
> > down_read(&mm->mmap_sem);
> > find_vma(mm, address);
> > up_read(&mm->mmap_sem);
> > ... rest of fault handler path ...
> > srcu_read_unlock(&vmas_srcu);
> > 
> > Kind of a pain because we still call find_vma() in the per-arch page
> > fault handler, but for prototyping, we'd only have to do one or two
> > architectures.
> 
> If you look at the earlier speculative page-fault patches by Laurent,
> which were based on my still earlier patches, you'll find most of this
> there.
> 
> The tricky bit was validating everything on the second page-table walk,
> so see if nothing had fundamentally changed, specifically the VMA,
> before installing the PTE. If you do this without mmap_sem, you need to
> hold ptlock to pin stuff while validating everything you did earlier.

The patches Laurent posted used regular RCU and a per-VMA refcount, not
SRCU.  If you use SRCU, why would you need a second page table walk?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: Splitting the mmap_sem
  2020-02-06 21:20                             ` Matthew Wilcox
@ 2020-02-07  8:52                               ` Peter Zijlstra
  2020-02-10 22:00                                 ` Matthew Wilcox
  2020-02-19 17:14                                 ` Laurent Dufour
  0 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2020-02-07  8:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	linux-mm

On Thu, Feb 06, 2020 at 01:20:24PM -0800, Matthew Wilcox wrote:
> On Thu, Feb 06, 2020 at 09:55:29PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 06, 2020 at 12:15:36PM -0800, Matthew Wilcox wrote:
> > > then, at the beginning of a page fault call srcu_read_lock(&vma_srcu);
> > > walk the tree as we do now, allocate memory for PTEs, sleep waiting for
> > > pages to arrive back from disc, etc, etc, then at the end of the fault,
> > > call srcu_read_unlock(&vma_srcu). 
> > 
> > So far so good,...
> > 
> > > munmap() would consist of removing the
> > > VMA from the tree, then calling synchronize_srcu() to wait for all faults
> > > to finish, then putting the backing file, etc, etc and freeing the VMA.
> > 
> > call_srcu(), and the (s)rcu callback will then fput() and such things
> > more.
> > 
> > synchronize_srcu() (like synchronize_rcu()) is stupid slow and would
> > make munmap()/exit()/etc.. unusable.
> 
> I'll need to think about that a bit.  I was convinced we needed to wait
> for the current pagefaults to finish before we could return from munmap().
> I need to convince myself that it's OK to return to userspace while the
> page faults for that range are still proceeding on other CPUs.

File-io might be in progress, any actual faults will result in SIGFAULT
instead of installing a PTE.

It is not fundamentally different from any threaded uaf race.

> > > This seems pretty reasonable, and investigation could actually proceed
> > > before the Maple tree work lands.  Today, that would be:
> > > 
> > > srcu_read_lock(&vmas_srcu);
> > > down_read(&mm->mmap_sem);
> > > find_vma(mm, address);
> > > up_read(&mm->mmap_sem);
> > > ... rest of fault handler path ...
> > > srcu_read_unlock(&vmas_srcu);
> > > 
> > > Kind of a pain because we still call find_vma() in the per-arch page
> > > fault handler, but for prototyping, we'd only have to do one or two
> > > architectures.
> > 
> > If you look at the earlier speculative page-fault patches by Laurent,
> > which were based on my still earlier patches, you'll find most of this
> > there.
> > 
> > The tricky bit was validating everything on the second page-table walk,
> > so see if nothing had fundamentally changed, specifically the VMA,
> > before installing the PTE. If you do this without mmap_sem, you need to
> > hold ptlock to pin stuff while validating everything you did earlier.
> 
> The patches Laurent posted used regular RCU and a per-VMA refcount, not
> SRCU.

That are his later patches, and I distinctly disagree with that
approach.

If you look at the patches here:

  https://lkml.kernel.org/r/cover.1479465699.git.ldufour@linux.vnet.ibm.com

you'll find it uses SRCU.

> If you use SRCU, why would you need a second page table walk?

Because SRCU only ensures the VMA object remains extant, it does not
prevent modification of it, normally that guarantee is provided by
mmap_sem, but we're not going to use that.

Instead, what we serialize on is the (split) ptlock. So we do the first
page-walk and ptlock to verify the vma-lookup, then we drop ptlock and
do the file-io, then we page-walk and take ptlock again, verify the vma
(again) and install the PTE. If anything goes wrong, we bail.

See this patch:

  https://lkml.kernel.org/r/301fb863785f37c319b493bd0d43167353871804.1479465699.git.ldufour@linux.vnet.ibm.com


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: Splitting the mmap_sem
  2020-02-07  8:52                               ` Peter Zijlstra
@ 2020-02-10 22:00                                 ` Matthew Wilcox
  2020-02-19 17:14                                 ` Laurent Dufour
  1 sibling, 0 replies; 31+ messages in thread
From: Matthew Wilcox @ 2020-02-10 22:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	linux-mm

On Fri, Feb 07, 2020 at 09:52:34AM +0100, Peter Zijlstra wrote:
> On Thu, Feb 06, 2020 at 01:20:24PM -0800, Matthew Wilcox wrote:
> > If you use SRCU, why would you need a second page table walk?
> 
> Because SRCU only ensures the VMA object remains extant, it does not
> prevent modification of it, normally that guarantee is provided by
> mmap_sem, but we're not going to use that.
> 
> Instead, what we serialize on is the (split) ptlock. So we do the first
> page-walk and ptlock to verify the vma-lookup, then we drop ptlock and
> do the file-io, then we page-walk and take ptlock again, verify the vma
> (again) and install the PTE. If anything goes wrong, we bail.
> 
> See this patch:
> 
>   https://lkml.kernel.org/r/301fb863785f37c319b493bd0d43167353871804.1479465699.git.ldufour@linux.vnet.ibm.com

OK, this gets to a difference in how we see modification to a VMA working.
You take a seqcount in patch 5:
https://lore.kernel.org/linux-mm/95790e53bfcfb536eb8f1dcdf4750e7e8050d8f4.1479465699.git.ldufour@linux.vnet.ibm.com/

I was intending to postpone the page table flushing parts of a VMA
modification until a grace period had passed, but what you're doing
is faster.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Splitting the mmap_sem
  2020-02-07  8:52                               ` Peter Zijlstra
  2020-02-10 22:00                                 ` Matthew Wilcox
@ 2020-02-19 17:14                                 ` Laurent Dufour
  1 sibling, 0 replies; 31+ messages in thread
From: Laurent Dufour @ 2020-02-19 17:14 UTC (permalink / raw)
  To: Peter Zijlstra, Matthew Wilcox
  Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov,
	linux-mm

Le 07/02/2020 à 09:52, Peter Zijlstra a écrit :
> On Thu, Feb 06, 2020 at 01:20:24PM -0800, Matthew Wilcox wrote:
>> On Thu, Feb 06, 2020 at 09:55:29PM +0100, Peter Zijlstra wrote:
>>> On Thu, Feb 06, 2020 at 12:15:36PM -0800, Matthew Wilcox wrote:
>>>> then, at the beginning of a page fault call srcu_read_lock(&vma_srcu);
>>>> walk the tree as we do now, allocate memory for PTEs, sleep waiting for
>>>> pages to arrive back from disc, etc, etc, then at the end of the fault,
>>>> call srcu_read_unlock(&vma_srcu).
>>>
>>> So far so good,...
>>>
>>>> munmap() would consist of removing the
>>>> VMA from the tree, then calling synchronize_srcu() to wait for all faults
>>>> to finish, then putting the backing file, etc, etc and freeing the VMA.
>>>
>>> call_srcu(), and the (s)rcu callback will then fput() and such things
>>> more.
>>>
>>> synchronize_srcu() (like synchronize_rcu()) is stupid slow and would
>>> make munmap()/exit()/etc.. unusable.
>>
>> I'll need to think about that a bit.  I was convinced we needed to wait
>> for the current pagefaults to finish before we could return from munmap().
>> I need to convince myself that it's OK to return to userspace while the
>> page faults for that range are still proceeding on other CPUs.
> 
> File-io might be in progress, any actual faults will result in SIGFAULT
> instead of installing a PTE.
> 
> It is not fundamentally different from any threaded uaf race.
> 
>>>> This seems pretty reasonable, and investigation could actually proceed
>>>> before the Maple tree work lands.  Today, that would be:
>>>>
>>>> srcu_read_lock(&vmas_srcu);
>>>> down_read(&mm->mmap_sem);
>>>> find_vma(mm, address);
>>>> up_read(&mm->mmap_sem);
>>>> ... rest of fault handler path ...
>>>> srcu_read_unlock(&vmas_srcu);
>>>>
>>>> Kind of a pain because we still call find_vma() in the per-arch page
>>>> fault handler, but for prototyping, we'd only have to do one or two
>>>> architectures.
>>>
>>> If you look at the earlier speculative page-fault patches by Laurent,
>>> which were based on my still earlier patches, you'll find most of this
>>> there.
>>>
>>> The tricky bit was validating everything on the second page-table walk,
>>> so see if nothing had fundamentally changed, specifically the VMA,
>>> before installing the PTE. If you do this without mmap_sem, you need to
>>> hold ptlock to pin stuff while validating everything you did earlier.
>>
>> The patches Laurent posted used regular RCU and a per-VMA refcount, not
>> SRCU.
> 
> That are his later patches, and I distinctly disagree with that
> approach.
> 
> If you look at the patches here:
> 
>    https://lkml.kernel.org/r/cover.1479465699.git.ldufour@linux.vnet.ibm.com
> 
> you'll find it uses SRCU.

For the record, I switched from SRCU to RCU and a ref counter because using 
SRCU, the performances were impacted by the scheduling generated to handle 
SRCU asynchronous events.

I may have missed something, but using RCU and a ref counter was working 
around this 20% overhead.

>> If you use SRCU, why would you need a second page table walk?
> 
> Because SRCU only ensures the VMA object remains extant, it does not
> prevent modification of it, normally that guarantee is provided by
> mmap_sem, but we're not going to use that.
> 
> Instead, what we serialize on is the (split) ptlock. So we do the first
> page-walk and ptlock to verify the vma-lookup, then we drop ptlock and
> do the file-io, then we page-walk and take ptlock again, verify the vma
> (again) and install the PTE. If anything goes wrong, we bail.
> 
> See this patch:
> 
>    https://lkml.kernel.org/r/301fb863785f37c319b493bd0d43167353871804.1479465699.git.ldufour@linux.vnet.ibm.com
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-02-19 17:14 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-03 22:21 Splitting the mmap_sem Matthew Wilcox
2019-12-05 17:21 ` Jerome Glisse
2019-12-06  5:13   ` Matthew Wilcox
2019-12-06 17:30     ` Jerome Glisse
2019-12-09  3:33       ` Matthew Wilcox
2019-12-09 14:17         ` Jerome Glisse
2019-12-10 15:26   ` Vlastimil Babka
2019-12-10 16:07     ` Jerome Glisse
2019-12-10 18:09       ` Vlastimil Babka
2019-12-12 14:24 ` Kirill A. Shutemov
2019-12-12 15:40   ` Matthew Wilcox
2019-12-12 15:46     ` Kirill A. Shutemov
2019-12-13 14:33       ` Matthew Wilcox
2019-12-13 18:06         ` Kirill A. Shutemov
2019-12-13 18:21           ` Matthew Wilcox
2020-01-06 22:09     ` Matthew Wilcox
2020-01-07 12:34       ` Kirill A. Shutemov
2020-01-07 13:54         ` Matthew Wilcox
2020-01-07 14:27           ` Kirill A. Shutemov
2020-01-09 13:56             ` Vlastimil Babka
2020-01-09 17:03               ` Michal Hocko
2020-01-09 17:07                 ` Michal Hocko
2020-01-09 17:32                   ` SeongJae Park
2020-01-09 20:13                     ` Matthew Wilcox
2020-02-06 13:59                       ` Peter Zijlstra
2020-02-06 20:15                         ` Matthew Wilcox
2020-02-06 20:55                           ` Peter Zijlstra
2020-02-06 21:20                             ` Matthew Wilcox
2020-02-07  8:52                               ` Peter Zijlstra
2020-02-10 22:00                                 ` Matthew Wilcox
2020-02-19 17:14                                 ` Laurent Dufour

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).