Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* Splitting the mmap_sem
@ 2019-12-03 22:21 Matthew Wilcox
  2019-12-05 17:21 ` Jerome Glisse
  2019-12-12 14:24 ` Kirill A. Shutemov
  0 siblings, 2 replies; 15+ messages in thread
From: Matthew Wilcox @ 2019-12-03 22:21 UTC (permalink / raw)
  To: linux-mm


[My thanks to Vlastimil, Michel, Liam, David, Davidlohr and Hugh for
 their feedback on an earlier version of this.  I think the solution
 we discussed doesn't quite work, so here's one which I think does.
 See the last two paragraphs in particular.]

My preferred solution to the mmap_sem scalability problem is to allow
VMAs to be looked up under the RCU read lock then take a per-VMA lock.
I've been focusing on the first half of this problem (looking up VMAs
in an RCU-safe data structure) and ignoring the second half (taking a
lock while holding the RCU lock).

We can't take a semaphore while holding the RCU lock in case we have to
sleep -- the VMA might not exist any more when we woke up.  Making the
per-VMA lock a spinlock would be a massive change -- fault handlers are
currently called with the mmap_sem held and may sleep.  So I think we
need a per-VMA refcount.  That lets us sleep while handling a fault.
There are over 100 fault handlers in the kernel, and I don't want to
change the locking in all of them.

That makes modifications to the tree a little tricky.  At the moment,
we take the rwsem for write which waits for all readers to finish, then
we modify the VMAs, then we allow readers back in.  With RCU, there is
no way to block readers, so different threads may (at the same time)
see both an old and a new VMA for the same virtual address.

So calling mmap() looks like this:

        allocate a new VMA
        update pointer(s) in maple tree
        sleep until old VMAs have a zero refcount
        synchronize_rcu()
        free old VMAs
        flush caches for affected range
        return to userspace

While one thread is calling mmap(MAP_FIXED), two other threads which are
accessing the same address may see different data from each other and
have different page translations in their respective CPU caches until
the thread calling mmap() returns.  I believe this is OK, but would
greatly appreciate hearing from people who know better.

Some people are concerned that a reference count on the VMA will lead to
contention moving from the mmap_sem to the refcount on a very large VMA
for workloads which have one giant VMA covering the entire working set.
For those workloads, I propose we use the existing ->map_pages() callback
(changed to return a vm_fault_t from the current void).

It will be called with the RCU lock held and no reference count on
the vma.  If it needs to sleep, it should bump the refcount, drop the
RCU lock, prepare enough so that the next call will not need to sleep,
then drop the refcount and return VM_FAULT_RETRY so the VM knows the
VMA is no longer good, and it needs to walk the VMA tree from the start.

We currently only have one ->map_pages() callback, and it's
filemap_map_pages().  It only needs to sleep in one place -- to allocate
a PTE table.  I think that can be allocated ahead of time if needed.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-03 22:21 Splitting the mmap_sem Matthew Wilcox
@ 2019-12-05 17:21 ` Jerome Glisse
  2019-12-06  5:13   ` Matthew Wilcox
  2019-12-10 15:26   ` Vlastimil Babka
  2019-12-12 14:24 ` Kirill A. Shutemov
  1 sibling, 2 replies; 15+ messages in thread
From: Jerome Glisse @ 2019-12-05 17:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

Adding few interested people in cc

On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> 
> [My thanks to Vlastimil, Michel, Liam, David, Davidlohr and Hugh for
>  their feedback on an earlier version of this.  I think the solution
>  we discussed doesn't quite work, so here's one which I think does.
>  See the last two paragraphs in particular.]
> 
> My preferred solution to the mmap_sem scalability problem is to allow
> VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> I've been focusing on the first half of this problem (looking up VMAs
> in an RCU-safe data structure) and ignoring the second half (taking a
> lock while holding the RCU lock).
> 
> We can't take a semaphore while holding the RCU lock in case we have to
> sleep -- the VMA might not exist any more when we woke up.  Making the
> per-VMA lock a spinlock would be a massive change -- fault handlers are
> currently called with the mmap_sem held and may sleep.  So I think we
> need a per-VMA refcount.  That lets us sleep while handling a fault.
> There are over 100 fault handlers in the kernel, and I don't want to
> change the locking in all of them.
> 
> That makes modifications to the tree a little tricky.  At the moment,
> we take the rwsem for write which waits for all readers to finish, then
> we modify the VMAs, then we allow readers back in.  With RCU, there is
> no way to block readers, so different threads may (at the same time)
> see both an old and a new VMA for the same virtual address.
> 
> So calling mmap() looks like this:
> 
>       1 allocate a new VMA
>       2 update pointer(s) in maple tree
>       3 sleep until old VMAs have a zero refcount
>       4 synchronize_rcu()
>       5 free old VMAs
>       6 flush caches for affected range
>       7 return to userspace
> 
> While one thread is calling mmap(MAP_FIXED), two other threads which are
> accessing the same address may see different data from each other and
> have different page translations in their respective CPU caches until
> the thread calling mmap() returns.  I believe this is OK, but would
> greatly appreciate hearing from people who know better.

I do not believe this is OK, i believe this is wrong (not even considering
possible hardware issues that can arise from such aliasing).

That bein said i believe this can be solve "easily" when the new vma is
added you mark it as a new born (VMA_BABY :)) and page fault will have
to wait on it ie until the previous vma is fully gone and flush. So after
step (6 flush caches) you remove the VMA_BABY flag before returning to
userspace and page fault can resume.

I would also mark old VMA with a ZOMBIE flag so that any reader might have
a chance to back-off and retry. To check for that we should add a new
check to vmf_insert_page() (and similar) to avoid inserting pfn in ZOMBIE
vma. Note that i am not sure what we want to do here, can an application
rely on rwsem serialization unknowingly ie could it have one thread doing
page fault on a range that is about to be unmap by another thread ? I am
not sure this can happen today without SEGFAULT thanks to serialization
through rwsem.

Anyway with BABY and ZOMBIE, it should behave mostly as it does today
(modulo concurrency).

> 
> Some people are concerned that a reference count on the VMA will lead to
> contention moving from the mmap_sem to the refcount on a very large VMA
> for workloads which have one giant VMA covering the entire working set.
> For those workloads, I propose we use the existing ->map_pages() callback
> (changed to return a vm_fault_t from the current void).
> 
> It will be called with the RCU lock held and no reference count on
> the vma.  If it needs to sleep, it should bump the refcount, drop the
> RCU lock, prepare enough so that the next call will not need to sleep,
> then drop the refcount and return VM_FAULT_RETRY so the VM knows the
> VMA is no longer good, and it needs to walk the VMA tree from the start.

Just to make sure i understand, you propose that ->map_pages() becomes
a new ->fault() handler that get calls before ->fault() without refcount
so that we can update fs/drivers slowly to perform better in the new scheme
(ie avoid the overead of refcounting if possible at all) ?

The ->fault() callback would then be the "slow" path which will require
a refcount on the vma (taken by core mm code before dropping rcu lock).

Cheers,
Jérôme



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-05 17:21 ` Jerome Glisse
@ 2019-12-06  5:13   ` Matthew Wilcox
  2019-12-06 17:30     ` Jerome Glisse
  2019-12-10 15:26   ` Vlastimil Babka
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2019-12-06  5:13 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote:
> Adding few interested people in cc

I figured they all read linux-mm already ;-)

> On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > While one thread is calling mmap(MAP_FIXED), two other threads which are
> > accessing the same address may see different data from each other and
> > have different page translations in their respective CPU caches until
> > the thread calling mmap() returns.  I believe this is OK, but would
> > greatly appreciate hearing from people who know better.
> 
> I do not believe this is OK, i believe this is wrong (not even considering
> possible hardware issues that can arise from such aliasing).

Well, OK, but why do you believe it is wrong?  If thread A is executing
a load instruction at the same time that thread B is calling mmap(),
it really is indeterminate what value A loads.  It might be from before
the call to mmap() and it might be from after.  And if thread C is also
executing a load instruction at the same time, then it might already get
a different result from thread A.  And can threads A and C really tell
which of them executed the load instruction 'first'?  I think this is
all so indeterminate already that the (lack of) guarantees I outlined
above are acceptable.

But we should all agree on this, so _please_ continue to argue your case
for why you believe it to be wrong.

[snip proposed solution -- if the problem needs solving, we can argue
about how to solve it later]

> > Some people are concerned that a reference count on the VMA will lead to
> > contention moving from the mmap_sem to the refcount on a very large VMA
> > for workloads which have one giant VMA covering the entire working set.
> > For those workloads, I propose we use the existing ->map_pages() callback
> > (changed to return a vm_fault_t from the current void).
> > 
> > It will be called with the RCU lock held and no reference count on
> > the vma.  If it needs to sleep, it should bump the refcount, drop the
> > RCU lock, prepare enough so that the next call will not need to sleep,
> > then drop the refcount and return VM_FAULT_RETRY so the VM knows the
> > VMA is no longer good, and it needs to walk the VMA tree from the start.
> 
> Just to make sure i understand, you propose that ->map_pages() becomes
> a new ->fault() handler that get calls before ->fault() without refcount
> so that we can update fs/drivers slowly to perform better in the new scheme
> (ie avoid the overead of refcounting if possible at all) ?
> 
> The ->fault() callback would then be the "slow" path which will require
> a refcount on the vma (taken by core mm code before dropping rcu lock).

I would actually propose never updating most drivers.  There's just no
need for them to handle such an unstable and tricky situation as this.
Let's not make driver writers lives harder.

For the ones which need this kind of scalability (and let's be clear, they
would already have *better* scalability than today due to the rwsem being
split into a per-VMA refcount), then yes, implementing ->map_pages would
be the way to go.  Indeed, they would probably benefit from implementing
it today, since it will reduce the number of page faults.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-06  5:13   ` Matthew Wilcox
@ 2019-12-06 17:30     ` Jerome Glisse
  2019-12-09  3:33       ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Jerome Glisse @ 2019-12-06 17:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote:
> On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote:
> > Adding few interested people in cc
> 
> I figured they all read linux-mm already ;-)
> 
> > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > While one thread is calling mmap(MAP_FIXED), two other threads which are
> > > accessing the same address may see different data from each other and
> > > have different page translations in their respective CPU caches until
> > > the thread calling mmap() returns.  I believe this is OK, but would
> > > greatly appreciate hearing from people who know better.
> > 
> > I do not believe this is OK, i believe this is wrong (not even considering
> > possible hardware issues that can arise from such aliasing).
> 
> Well, OK, but why do you believe it is wrong?  If thread A is executing
> a load instruction at the same time that thread B is calling mmap(),
> it really is indeterminate what value A loads.  It might be from before
> the call to mmap() and it might be from after.  And if thread C is also
> executing a load instruction at the same time, then it might already get
> a different result from thread A.  And can threads A and C really tell
> which of them executed the load instruction 'first'?  I think this is
> all so indeterminate already that the (lack of) guarantees I outlined
> above are acceptable.
> 
> But we should all agree on this, so _please_ continue to argue your case
> for why you believe it to be wrong.
> 

I agree that such application might looks like it is doing something that
is undeterminate but their might be application that catch SEGFAULT and use
it as synchronization. I did something similar for reverse engineering a
long time ago with a library call libsegfault ...

In any case, i agree that an application that is not catching SEGFAULT, and
which is doing the above (access patterns) is doing something undeterminate.

Nonetheless i believe it is important that at any point in time for all the
threads in a given process, on all the CPUs, a given virtual address should
always point to the same physical memory (or to nothing) ie we should never
have one CPU that sees a different physical memory from another CPU for the
same virtual address.


> [snip proposed solution -- if the problem needs solving, we can argue
> about how to solve it later]

Well i feel like you are also not discussing about the munmap() the above
seemed to be about MAP_FIXED (replacing an existing mapping). For munmap
too i believe we should agree on what should be the expected behavior and
from my POV again we should not allow new mapping to appear until a "running"
munmap is not fully done (ie all CPUs cache and TLB flushed). For the same
reason as above ie all CPUs always see same physical memory (or nothing) for
a given virtual address.

This is what we have today with the big rwsem and i think we need to keep
that behavior even with concurency. I do not believe this will impact the
performance and it is easy enough to solve so i feel safer doing so given
it does not cost anything.

So i would rather argue on why we should change the current behavior if we
can fix the concurrency without changing it (hence why discussing solution
might also be relevant here).


> > > Some people are concerned that a reference count on the VMA will lead to
> > > contention moving from the mmap_sem to the refcount on a very large VMA
> > > for workloads which have one giant VMA covering the entire working set.
> > > For those workloads, I propose we use the existing ->map_pages() callback
> > > (changed to return a vm_fault_t from the current void).
> > > 
> > > It will be called with the RCU lock held and no reference count on
> > > the vma.  If it needs to sleep, it should bump the refcount, drop the
> > > RCU lock, prepare enough so that the next call will not need to sleep,
> > > then drop the refcount and return VM_FAULT_RETRY so the VM knows the
> > > VMA is no longer good, and it needs to walk the VMA tree from the start.
> > 
> > Just to make sure i understand, you propose that ->map_pages() becomes
> > a new ->fault() handler that get calls before ->fault() without refcount
> > so that we can update fs/drivers slowly to perform better in the new scheme
> > (ie avoid the overead of refcounting if possible at all) ?
> > 
> > The ->fault() callback would then be the "slow" path which will require
> > a refcount on the vma (taken by core mm code before dropping rcu lock).
> 
> I would actually propose never updating most drivers.  There's just no
> need for them to handle such an unstable and tricky situation as this.
> Let's not make driver writers lives harder.
> 
> For the ones which need this kind of scalability (and let's be clear, they
> would already have *better* scalability than today due to the rwsem being
> split into a per-VMA refcount), then yes, implementing ->map_pages would
> be the way to go.  Indeed, they would probably benefit from implementing
> it today, since it will reduce the number of page faults.

Yes they will get better scalability but i see some of those drivers might
want the extra few mini-percent :) In any case, i believe that it might be
better to give a new name ie fix current map_pages() user and rename that
callback to something more explicit (atomic_map_pages() or something similar
i am not good at naming). But otherwise this looks like a good plan to avoid
excessive refcount overhead.

Cheers,
Jérôme



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-06 17:30     ` Jerome Glisse
@ 2019-12-09  3:33       ` Matthew Wilcox
  2019-12-09 14:17         ` Jerome Glisse
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2019-12-09  3:33 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

On Fri, Dec 06, 2019 at 12:30:30PM -0500, Jerome Glisse wrote:
> On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote:
> > On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote:
> > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > > While one thread is calling mmap(MAP_FIXED), two other threads which are
> > > > accessing the same address may see different data from each other and
> > > > have different page translations in their respective CPU caches until
> > > > the thread calling mmap() returns.  I believe this is OK, but would
> > > > greatly appreciate hearing from people who know better.
> > > 
> > > I do not believe this is OK, i believe this is wrong (not even considering
> > > possible hardware issues that can arise from such aliasing).
> > 
> > Well, OK, but why do you believe it is wrong?  If thread A is executing
> > a load instruction at the same time that thread B is calling mmap(),
> > it really is indeterminate what value A loads.  It might be from before
> > the call to mmap() and it might be from after.  And if thread C is also
> > executing a load instruction at the same time, then it might already get
> > a different result from thread A.  And can threads A and C really tell
> > which of them executed the load instruction 'first'?  I think this is
> > all so indeterminate already that the (lack of) guarantees I outlined
> > above are acceptable.
> > 
> > But we should all agree on this, so _please_ continue to argue your case
> > for why you believe it to be wrong.
> 
> I agree that such application might looks like it is doing something that
> is undeterminate but their might be application that catch SEGFAULT and use
> it as synchronization. I did something similar for reverse engineering a
> long time ago with a library call libsegfault ...
> 
> In any case, i agree that an application that is not catching SEGFAULT, and
> which is doing the above (access patterns) is doing something undeterminate.
> 
> Nonetheless i believe it is important that at any point in time for all the
> threads in a given process, on all the CPUs, a given virtual address should
> always point to the same physical memory (or to nothing) ie we should never
> have one CPU that sees a different physical memory from another CPU for the
> same virtual address.
> 
> Well i feel like you are also not discussing about the munmap() the above
> seemed to be about MAP_FIXED (replacing an existing mapping). For munmap
> too i believe we should agree on what should be the expected behavior and
> from my POV again we should not allow new mapping to appear until a "running"
> munmap is not fully done (ie all CPUs cache and TLB flushed). For the same
> reason as above ie all CPUs always see same physical memory (or nothing) for
> a given virtual address.

I see MAP_FIXED as being the harder case, but sure, let's talk about
munmap!  I agree that a munmap() + mmap() call should not permit thread
B to see the old value after thread A has seen the new value.  But,
as long as no new mmap can occupy that range, then it's OK if thread A
takes a segfault while thread B can still load the old value.  At least
for a short window.

We can replicate that behaviour by ensuring that new lookups see a NULL
entry, but new attempts to allocate will not reuse that range until the
munmap has finished and all TLB entries are flushed.  The maple tree
actually supports a "ZERO" entry (just like the XArray does) which has
this behaviour -- lookups see NULL, but attempts to allocate do not see
it as free.  We already use that property to prevent allocating above
the end of the process address space.

> This is what we have today with the big rwsem and i think we need to keep
> that behavior even with concurency. I do not believe this will impact the
> performance and it is easy enough to solve so i feel safer doing so given
> it does not cost anything.
> 
> So i would rather argue on why we should change the current behavior if we
> can fix the concurrency without changing it (hence why discussing solution
> might also be relevant here).

It seems like you want to force a thread which sees an ongoing munmap
to spin or sleep until the munmap is done, rather than immediately take
a segfault, and I don't know that's a useful behaviour.

> > > Just to make sure i understand, you propose that ->map_pages() becomes
> > > a new ->fault() handler that get calls before ->fault() without refcount
> > > so that we can update fs/drivers slowly to perform better in the new scheme
> > > (ie avoid the overead of refcounting if possible at all) ?
> > > 
> > > The ->fault() callback would then be the "slow" path which will require
> > > a refcount on the vma (taken by core mm code before dropping rcu lock).
> > 
> > I would actually propose never updating most drivers.  There's just no
> > need for them to handle such an unstable and tricky situation as this.
> > Let's not make driver writers lives harder.
> > 
> > For the ones which need this kind of scalability (and let's be clear, they
> > would already have *better* scalability than today due to the rwsem being
> > split into a per-VMA refcount), then yes, implementing ->map_pages would
> > be the way to go.  Indeed, they would probably benefit from implementing
> > it today, since it will reduce the number of page faults.
> 
> Yes they will get better scalability but i see some of those drivers might
> want the extra few mini-percent :) In any case, i believe that it might be
> better to give a new name ie fix current map_pages() user and rename that
> callback to something more explicit (atomic_map_pages() or something similar
> i am not good at naming). But otherwise this looks like a good plan to avoid
> excessive refcount overhead.

OK, great.  I don't think the current name is bad, but if someone comes
up with a better one, I don't have a problem with renaming it.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-09  3:33       ` Matthew Wilcox
@ 2019-12-09 14:17         ` Jerome Glisse
  0 siblings, 0 replies; 15+ messages in thread
From: Jerome Glisse @ 2019-12-09 14:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

On Sun, Dec 08, 2019 at 07:33:35PM -0800, Matthew Wilcox wrote:
> On Fri, Dec 06, 2019 at 12:30:30PM -0500, Jerome Glisse wrote:
> > On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote:
> > > On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote:
> > > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > > > While one thread is calling mmap(MAP_FIXED), two other threads which are
> > > > > accessing the same address may see different data from each other and
> > > > > have different page translations in their respective CPU caches until
> > > > > the thread calling mmap() returns.  I believe this is OK, but would
> > > > > greatly appreciate hearing from people who know better.
> > > > 
> > > > I do not believe this is OK, i believe this is wrong (not even considering
> > > > possible hardware issues that can arise from such aliasing).
> > > 
> > > Well, OK, but why do you believe it is wrong?  If thread A is executing
> > > a load instruction at the same time that thread B is calling mmap(),
> > > it really is indeterminate what value A loads.  It might be from before
> > > the call to mmap() and it might be from after.  And if thread C is also
> > > executing a load instruction at the same time, then it might already get
> > > a different result from thread A.  And can threads A and C really tell
> > > which of them executed the load instruction 'first'?  I think this is
> > > all so indeterminate already that the (lack of) guarantees I outlined
> > > above are acceptable.
> > > 
> > > But we should all agree on this, so _please_ continue to argue your case
> > > for why you believe it to be wrong.
> > 
> > I agree that such application might looks like it is doing something that
> > is undeterminate but their might be application that catch SEGFAULT and use
> > it as synchronization. I did something similar for reverse engineering a
> > long time ago with a library call libsegfault ...
> > 
> > In any case, i agree that an application that is not catching SEGFAULT, and
> > which is doing the above (access patterns) is doing something undeterminate.
> > 
> > Nonetheless i believe it is important that at any point in time for all the
> > threads in a given process, on all the CPUs, a given virtual address should
> > always point to the same physical memory (or to nothing) ie we should never
> > have one CPU that sees a different physical memory from another CPU for the
> > same virtual address.
> > 
> > Well i feel like you are also not discussing about the munmap() the above
> > seemed to be about MAP_FIXED (replacing an existing mapping). For munmap
> > too i believe we should agree on what should be the expected behavior and
> > from my POV again we should not allow new mapping to appear until a "running"
> > munmap is not fully done (ie all CPUs cache and TLB flushed). For the same
> > reason as above ie all CPUs always see same physical memory (or nothing) for
> > a given virtual address.
> 
> I see MAP_FIXED as being the harder case, but sure, let's talk about
> munmap!  I agree that a munmap() + mmap() call should not permit thread
> B to see the old value after thread A has seen the new value.  But,
> as long as no new mmap can occupy that range, then it's OK if thread A
> takes a segfault while thread B can still load the old value.  At least
> for a short window.
> 
> We can replicate that behaviour by ensuring that new lookups see a NULL
> entry, but new attempts to allocate will not reuse that range until the
> munmap has finished and all TLB entries are flushed.  The maple tree
> actually supports a "ZERO" entry (just like the XArray does) which has
> this behaviour -- lookups see NULL, but attempts to allocate do not see
> it as free.  We already use that property to prevent allocating above
> the end of the process address space.
> 
> > This is what we have today with the big rwsem and i think we need to keep
> > that behavior even with concurency. I do not believe this will impact the
> > performance and it is easy enough to solve so i feel safer doing so given
> > it does not cost anything.
> > 
> > So i would rather argue on why we should change the current behavior if we
> > can fix the concurrency without changing it (hence why discussing solution
> > might also be relevant here).
> 
> It seems like you want to force a thread which sees an ongoing munmap
> to spin or sleep until the munmap is done, rather than immediately take
> a segfault, and I don't know that's a useful behaviour.

No, for munmap() i want the today behavior modulo concurrency ie:
    - any concurrent fault while munmap is on going should SEGFAULT
    - any new mmap might install itself in the unmaped area but can not
      service fault until the munmap is fully done (ie all cache and TLB
      are flush)

So it seems we agree on that given your above presentation of mapple NULL
entry.

I want the spin only for fault in a MAP_FIXED range that replace an existing
range ie which is what we have today. Today if you do a mmap MAP_FIXED you
will block concurrent fault until the rwsem is release. I argue we should
keep that behavior as any program that keep accessing such area while mmap
is in progress is doing something peculiar (it can be a bug or it can be
something totaly intentional and use in conjunction with signal handlers).

This will only affect such application and it will keep coherency for TLBs
accross all CPUs. The implementation cost can be kept very low and it does
not impede concurrency on other virtual range.

Keeping existing behavior just feels a lot safer to me and it avoids us
pondering on wether or not a CPU can freakout if they see different TLB
for same virtual address.


> > > > Just to make sure i understand, you propose that ->map_pages() becomes
> > > > a new ->fault() handler that get calls before ->fault() without refcount
> > > > so that we can update fs/drivers slowly to perform better in the new scheme
> > > > (ie avoid the overead of refcounting if possible at all) ?
> > > > 
> > > > The ->fault() callback would then be the "slow" path which will require
> > > > a refcount on the vma (taken by core mm code before dropping rcu lock).
> > > 
> > > I would actually propose never updating most drivers.  There's just no
> > > need for them to handle such an unstable and tricky situation as this.
> > > Let's not make driver writers lives harder.
> > > 
> > > For the ones which need this kind of scalability (and let's be clear, they
> > > would already have *better* scalability than today due to the rwsem being
> > > split into a per-VMA refcount), then yes, implementing ->map_pages would
> > > be the way to go.  Indeed, they would probably benefit from implementing
> > > it today, since it will reduce the number of page faults.
> > 
> > Yes they will get better scalability but i see some of those drivers might
> > want the extra few mini-percent :) In any case, i believe that it might be
> > better to give a new name ie fix current map_pages() user and rename that
> > callback to something more explicit (atomic_map_pages() or something similar
> > i am not good at naming). But otherwise this looks like a good plan to avoid
> > excessive refcount overhead.
> 
> OK, great.  I don't think the current name is bad, but if someone comes
> up with a better one, I don't have a problem with renaming it.

I just like when new behavior are reflected in name, what about rcu_map_pages ?
Making it clear it is under rcu.

Cheers,
Jérôme



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-05 17:21 ` Jerome Glisse
  2019-12-06  5:13   ` Matthew Wilcox
@ 2019-12-10 15:26   ` Vlastimil Babka
  2019-12-10 16:07     ` Jerome Glisse
  1 sibling, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2019-12-10 15:26 UTC (permalink / raw)
  To: Jerome Glisse, Matthew Wilcox
  Cc: linux-mm, Laurent Dufour, David Rientjes, Hugh Dickins,
	Michel Lespinasse, Davidlohr Bueso

On 12/5/19 6:21 PM, Jerome Glisse wrote:
>> 
>> So calling mmap() looks like this:
>> 
>>       1 allocate a new VMA
>>       2 update pointer(s) in maple tree
>>       3 sleep until old VMAs have a zero refcount
>>       4 synchronize_rcu()
>>       5 free old VMAs
>>       6 flush caches for affected range
>>       7 return to userspace
>> 
>> While one thread is calling mmap(MAP_FIXED), two other threads which are
>> accessing the same address may see different data from each other and
>> have different page translations in their respective CPU caches until
>> the thread calling mmap() returns.  I believe this is OK, but would
>> greatly appreciate hearing from people who know better.
> 
> I do not believe this is OK, i believe this is wrong (not even considering
> possible hardware issues that can arise from such aliasing).

But is it true that the races can happen in the above such that multiple CPU's
have different translations? I think it's impossible to tell from above - there
are no details about when and which pte modifications happen, where ptl lock is
taken... perhaps after filling those details, we could be able to see that
there's no race.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-10 15:26   ` Vlastimil Babka
@ 2019-12-10 16:07     ` Jerome Glisse
  2019-12-10 18:09       ` Vlastimil Babka
  0 siblings, 1 reply; 15+ messages in thread
From: Jerome Glisse @ 2019-12-10 16:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, linux-mm, Laurent Dufour, David Rientjes,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

On Tue, Dec 10, 2019 at 04:26:40PM +0100, Vlastimil Babka wrote:
> On 12/5/19 6:21 PM, Jerome Glisse wrote:
> >> 
> >> So calling mmap() looks like this:
> >> 
> >>       1 allocate a new VMA
> >>       2 update pointer(s) in maple tree
> >>       3 sleep until old VMAs have a zero refcount
> >>       4 synchronize_rcu()
> >>       5 free old VMAs
> >>       6 flush caches for affected range
> >>       7 return to userspace
> >> 
> >> While one thread is calling mmap(MAP_FIXED), two other threads which are
> >> accessing the same address may see different data from each other and
> >> have different page translations in their respective CPU caches until
> >> the thread calling mmap() returns.  I believe this is OK, but would
> >> greatly appreciate hearing from people who know better.
> > 
> > I do not believe this is OK, i believe this is wrong (not even considering
> > possible hardware issues that can arise from such aliasing).
> 
> But is it true that the races can happen in the above such that multiple CPU's
> have different translations? I think it's impossible to tell from above - there
> are no details about when and which pte modifications happen, where ptl lock is
> taken... perhaps after filling those details, we could be able to see that
> there's no race.
> 

My assumption reading Matthew was that as step 6 is making progress
(flushing caches and i assume TLB too) then you can have a CPU which
is already flushed and that do take a fault against the new VMA and
thus get a new TLB entry that do not match a CPU which is not yet
flushed.

Today this can not happens because page fault will serialize on the
mmap_sem (ie until the write mode is release when returning to user-
space).

I advocate that for MAP_FIXED we should keep that behavior ie have
page fault wait until all the CPUs caches and TLB are flush before
servicing new fault. The rational are:
    - Application with racing thread accessing an area which is under
      going a mmap MAP_FIXED are doing something weird (which might
      be totaly waranted) and thus it should not matter if we penalize
      those by not offering the benefit of fully concurrent page fault
      for that area. We are not making thing any worse then they are
      today.

    - Waiting on new MAP_FIXED vma is easy to implement and does not
      affect other kind of page fault (ie you still have concurrent
      fault to other area).

Note i stress again this is only for MAP_FIXED replacing an existing
vma. For any other mmap (well i would need to go over all the flags
to make sure there is nothing else similar to MAP_FIXED) we can have
full concurrency ie:
    - Access to an area that is under going munmap will SEGFAULT
      as soon as the vma is no longer valid (ie no longer in maple
      tree i guess)

    - Access to an area that is just being mmap will fault using
      the new vma (assuming we publish the vma only once it is
      ready to take fault)

    - A mmap can make progress as another munmap is making progress
      and reuse the the munmaped area as soon as it is fully flushed
      (caches and TLB).

Cheers,
Jérôme



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-10 16:07     ` Jerome Glisse
@ 2019-12-10 18:09       ` Vlastimil Babka
  0 siblings, 0 replies; 15+ messages in thread
From: Vlastimil Babka @ 2019-12-10 18:09 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Matthew Wilcox, linux-mm, Laurent Dufour, David Rientjes,
	Hugh Dickins, Michel Lespinasse, Davidlohr Bueso

On 12/10/19 5:07 PM, Jerome Glisse wrote:
> On Tue, Dec 10, 2019 at 04:26:40PM +0100, Vlastimil Babka wrote:
>> On 12/5/19 6:21 PM, Jerome Glisse wrote:
>>>>
>>>> So calling mmap() looks like this:
>>>>
>>>>       1 allocate a new VMA
>>>>       2 update pointer(s) in maple tree
>>>>       3 sleep until old VMAs have a zero refcount
>>>>       4 synchronize_rcu()
>>>>       5 free old VMAs
>>>>       6 flush caches for affected range
>>>>       7 return to userspace
>>>>
>>>> While one thread is calling mmap(MAP_FIXED), two other threads which are
>>>> accessing the same address may see different data from each other and
>>>> have different page translations in their respective CPU caches until
>>>> the thread calling mmap() returns.  I believe this is OK, but would
>>>> greatly appreciate hearing from people who know better.
>>>
>>> I do not believe this is OK, i believe this is wrong (not even considering
>>> possible hardware issues that can arise from such aliasing).
>>
>> But is it true that the races can happen in the above such that multiple CPU's
>> have different translations? I think it's impossible to tell from above - there
>> are no details about when and which pte modifications happen, where ptl lock is
>> taken... perhaps after filling those details, we could be able to see that
>> there's no race.
>>
> 
> My assumption reading Matthew was that as step 6 is making progress
> (flushing caches and i assume TLB too) then you can have a CPU which
> is already flushed and that do take a fault against the new VMA and
> thus get a new TLB entry that do not match a CPU which is not yet
> flushed.

We already have to protect against CPU's that access page tables (and
thus fill their TLBs) by hardware, taking no mmap_sem at all. For the
mmap(MAP_FIXED) overwriting existing mapping case this means that the
old stuff will first be munmapped - zap_pte_range() will take pte lock,
clear pte and flush tlb's before installing any new mappings.
A parallel fault trying to install pte for the new VMA should thus
serialize on the pte lock and only install the new pte after everyone
was flushed, and this should be fine?

What might IMHO perhaps need care is a parallel fault that started with
the old VMA, because some PTE was unpopulated. We need to make sure it
doesn't end up being the last to install the PTE for the VMA that's
going away.

> Today this can not happens because page fault will serialize on the
> mmap_sem (ie until the write mode is release when returning to user-
> space).



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-03 22:21 Splitting the mmap_sem Matthew Wilcox
  2019-12-05 17:21 ` Jerome Glisse
@ 2019-12-12 14:24 ` Kirill A. Shutemov
  2019-12-12 15:40   ` Matthew Wilcox
  1 sibling, 1 reply; 15+ messages in thread
From: Kirill A. Shutemov @ 2019-12-12 14:24 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> My preferred solution to the mmap_sem scalability problem is to allow
> VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> I've been focusing on the first half of this problem (looking up VMAs
> in an RCU-safe data structure) and ignoring the second half (taking a
> lock while holding the RCU lock).

Do you see this approach to be regression-free for uncontended case?
I doubt it will not cause regressions for signle-threaded applications...

> We currently only have one ->map_pages() callback, and it's
> filemap_map_pages().  It only needs to sleep in one place -- to allocate
> a PTE table.  I think that can be allocated ahead of time if needed.

No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the
function is under rcu_read_lock(). It uses pre-allocated page table.
See do_fault_around().

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-12 14:24 ` Kirill A. Shutemov
@ 2019-12-12 15:40   ` Matthew Wilcox
  2019-12-12 15:46     ` Kirill A. Shutemov
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2019-12-12 15:40 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-mm

On Thu, Dec 12, 2019 at 05:24:57PM +0300, Kirill A. Shutemov wrote:
> On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > My preferred solution to the mmap_sem scalability problem is to allow
> > VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> > I've been focusing on the first half of this problem (looking up VMAs
> > in an RCU-safe data structure) and ignoring the second half (taking a
> > lock while holding the RCU lock).
> 
> Do you see this approach to be regression-free for uncontended case?
> I doubt it will not cause regressions for signle-threaded applications...

Which part of the approach do you think will cause a regression?  The
maple tree is quicker to traverse than the rbtree (in our simulations).
Incrementing a refcount on a VMA is surely no slower than acquiring an
uncontended rwsem for read.  mmap() and munmap() will get slower, but is
that a problem?

> > We currently only have one ->map_pages() callback, and it's
> > filemap_map_pages().  It only needs to sleep in one place -- to allocate
> > a PTE table.  I think that can be allocated ahead of time if needed.
> 
> No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the
> function is under rcu_read_lock(). It uses pre-allocated page table.
> See do_fault_around().

Oh, thank you!  That makes the ->map_pages() optimisation already workable
with no changes.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-12 15:40   ` Matthew Wilcox
@ 2019-12-12 15:46     ` Kirill A. Shutemov
  2019-12-13 14:33       ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Kirill A. Shutemov @ 2019-12-12 15:46 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote:
> On Thu, Dec 12, 2019 at 05:24:57PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > My preferred solution to the mmap_sem scalability problem is to allow
> > > VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> > > I've been focusing on the first half of this problem (looking up VMAs
> > > in an RCU-safe data structure) and ignoring the second half (taking a
> > > lock while holding the RCU lock).
> > 
> > Do you see this approach to be regression-free for uncontended case?
> > I doubt it will not cause regressions for signle-threaded applications...
> 
> Which part of the approach do you think will cause a regression?  The
> maple tree is quicker to traverse than the rbtree (in our simulations).
> Incrementing a refcount on a VMA is surely no slower than acquiring an
> uncontended rwsem for read.  mmap() and munmap() will get slower, but is
> that a problem?

Yes, it does. Especially for short-living processes. See kernel build as a
workload.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-12 15:46     ` Kirill A. Shutemov
@ 2019-12-13 14:33       ` Matthew Wilcox
  2019-12-13 18:06         ` Kirill A. Shutemov
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2019-12-13 14:33 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-mm

On Thu, Dec 12, 2019 at 06:46:13PM +0300, Kirill A. Shutemov wrote:
> On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote:
> > On Thu, Dec 12, 2019 at 05:24:57PM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > > My preferred solution to the mmap_sem scalability problem is to allow
> > > > VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> > > > I've been focusing on the first half of this problem (looking up VMAs
> > > > in an RCU-safe data structure) and ignoring the second half (taking a
> > > > lock while holding the RCU lock).
> > > 
> > > Do you see this approach to be regression-free for uncontended case?
> > > I doubt it will not cause regressions for signle-threaded applications...
> > 
> > Which part of the approach do you think will cause a regression?  The
> > maple tree is quicker to traverse than the rbtree (in our simulations).
> > Incrementing a refcount on a VMA is surely no slower than acquiring an
> > uncontended rwsem for read.  mmap() and munmap() will get slower, but is
> > that a problem?
> 
> Yes, it does. Especially for short-living processes. See kernel build as a
> workload.

Ah.  Well, we can skip the synchronize_rcu() step if the mm_struct has zero
or one mm_users.  That should avoid a slowdown for mmap/munmap.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-13 14:33       ` Matthew Wilcox
@ 2019-12-13 18:06         ` Kirill A. Shutemov
  2019-12-13 18:21           ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Kirill A. Shutemov @ 2019-12-13 18:06 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm

On Fri, Dec 13, 2019 at 06:33:33AM -0800, Matthew Wilcox wrote:
> On Thu, Dec 12, 2019 at 06:46:13PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote:
> > > On Thu, Dec 12, 2019 at 05:24:57PM +0300, Kirill A. Shutemov wrote:
> > > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > > > My preferred solution to the mmap_sem scalability problem is to allow
> > > > > VMAs to be looked up under the RCU read lock then take a per-VMA lock.
> > > > > I've been focusing on the first half of this problem (looking up VMAs
> > > > > in an RCU-safe data structure) and ignoring the second half (taking a
> > > > > lock while holding the RCU lock).
> > > > 
> > > > Do you see this approach to be regression-free for uncontended case?
> > > > I doubt it will not cause regressions for signle-threaded applications...
> > > 
> > > Which part of the approach do you think will cause a regression?  The
> > > maple tree is quicker to traverse than the rbtree (in our simulations).
> > > Incrementing a refcount on a VMA is surely no slower than acquiring an
> > > uncontended rwsem for read.  mmap() and munmap() will get slower, but is
> > > that a problem?
> > 
> > Yes, it does. Especially for short-living processes. See kernel build as a
> > workload.
> 
> Ah.  Well, we can skip the synchronize_rcu() step if the mm_struct has zero
> or one mm_users.  That should avoid a slowdown for mmap/munmap.

I may work. But I'm not sure how it will work with remote mm accesses.
Like with /proc/ interfaces or ptrace.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Splitting the mmap_sem
  2019-12-13 18:06         ` Kirill A. Shutemov
@ 2019-12-13 18:21           ` Matthew Wilcox
  0 siblings, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2019-12-13 18:21 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-mm

On Fri, Dec 13, 2019 at 09:06:26PM +0300, Kirill A. Shutemov wrote:
> On Fri, Dec 13, 2019 at 06:33:33AM -0800, Matthew Wilcox wrote:
> > Ah.  Well, we can skip the synchronize_rcu() step if the mm_struct has zero
> > or one mm_users.  That should avoid a slowdown for mmap/munmap.
> 
> I may work. But I'm not sure how it will work with remote mm accesses.
> Like with /proc/ interfaces or ptrace.

Those users call mmget() which increases mm_users.  See m_start() in
fs/proc/task_mmu.c.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, back to index

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-03 22:21 Splitting the mmap_sem Matthew Wilcox
2019-12-05 17:21 ` Jerome Glisse
2019-12-06  5:13   ` Matthew Wilcox
2019-12-06 17:30     ` Jerome Glisse
2019-12-09  3:33       ` Matthew Wilcox
2019-12-09 14:17         ` Jerome Glisse
2019-12-10 15:26   ` Vlastimil Babka
2019-12-10 16:07     ` Jerome Glisse
2019-12-10 18:09       ` Vlastimil Babka
2019-12-12 14:24 ` Kirill A. Shutemov
2019-12-12 15:40   ` Matthew Wilcox
2019-12-12 15:46     ` Kirill A. Shutemov
2019-12-13 14:33       ` Matthew Wilcox
2019-12-13 18:06         ` Kirill A. Shutemov
2019-12-13 18:21           ` Matthew Wilcox

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git