* Splitting the mmap_sem @ 2019-12-03 22:21 Matthew Wilcox 2019-12-05 17:21 ` Jerome Glisse 2019-12-12 14:24 ` Kirill A. Shutemov 0 siblings, 2 replies; 31+ messages in thread From: Matthew Wilcox @ 2019-12-03 22:21 UTC (permalink / raw) To: linux-mm [My thanks to Vlastimil, Michel, Liam, David, Davidlohr and Hugh for their feedback on an earlier version of this. I think the solution we discussed doesn't quite work, so here's one which I think does. See the last two paragraphs in particular.] My preferred solution to the mmap_sem scalability problem is to allow VMAs to be looked up under the RCU read lock then take a per-VMA lock. I've been focusing on the first half of this problem (looking up VMAs in an RCU-safe data structure) and ignoring the second half (taking a lock while holding the RCU lock). We can't take a semaphore while holding the RCU lock in case we have to sleep -- the VMA might not exist any more when we woke up. Making the per-VMA lock a spinlock would be a massive change -- fault handlers are currently called with the mmap_sem held and may sleep. So I think we need a per-VMA refcount. That lets us sleep while handling a fault. There are over 100 fault handlers in the kernel, and I don't want to change the locking in all of them. That makes modifications to the tree a little tricky. At the moment, we take the rwsem for write which waits for all readers to finish, then we modify the VMAs, then we allow readers back in. With RCU, there is no way to block readers, so different threads may (at the same time) see both an old and a new VMA for the same virtual address. So calling mmap() looks like this: allocate a new VMA update pointer(s) in maple tree sleep until old VMAs have a zero refcount synchronize_rcu() free old VMAs flush caches for affected range return to userspace While one thread is calling mmap(MAP_FIXED), two other threads which are accessing the same address may see different data from each other and have different page translations in their respective CPU caches until the thread calling mmap() returns. I believe this is OK, but would greatly appreciate hearing from people who know better. Some people are concerned that a reference count on the VMA will lead to contention moving from the mmap_sem to the refcount on a very large VMA for workloads which have one giant VMA covering the entire working set. For those workloads, I propose we use the existing ->map_pages() callback (changed to return a vm_fault_t from the current void). It will be called with the RCU lock held and no reference count on the vma. If it needs to sleep, it should bump the refcount, drop the RCU lock, prepare enough so that the next call will not need to sleep, then drop the refcount and return VM_FAULT_RETRY so the VM knows the VMA is no longer good, and it needs to walk the VMA tree from the start. We currently only have one ->map_pages() callback, and it's filemap_map_pages(). It only needs to sleep in one place -- to allocate a PTE table. I think that can be allocated ahead of time if needed. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-03 22:21 Splitting the mmap_sem Matthew Wilcox @ 2019-12-05 17:21 ` Jerome Glisse 2019-12-06 5:13 ` Matthew Wilcox 2019-12-10 15:26 ` Vlastimil Babka 2019-12-12 14:24 ` Kirill A. Shutemov 1 sibling, 2 replies; 31+ messages in thread From: Jerome Glisse @ 2019-12-05 17:21 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka, Hugh Dickins, Michel Lespinasse, Davidlohr Bueso Adding few interested people in cc On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: > > [My thanks to Vlastimil, Michel, Liam, David, Davidlohr and Hugh for > their feedback on an earlier version of this. I think the solution > we discussed doesn't quite work, so here's one which I think does. > See the last two paragraphs in particular.] > > My preferred solution to the mmap_sem scalability problem is to allow > VMAs to be looked up under the RCU read lock then take a per-VMA lock. > I've been focusing on the first half of this problem (looking up VMAs > in an RCU-safe data structure) and ignoring the second half (taking a > lock while holding the RCU lock). > > We can't take a semaphore while holding the RCU lock in case we have to > sleep -- the VMA might not exist any more when we woke up. Making the > per-VMA lock a spinlock would be a massive change -- fault handlers are > currently called with the mmap_sem held and may sleep. So I think we > need a per-VMA refcount. That lets us sleep while handling a fault. > There are over 100 fault handlers in the kernel, and I don't want to > change the locking in all of them. > > That makes modifications to the tree a little tricky. At the moment, > we take the rwsem for write which waits for all readers to finish, then > we modify the VMAs, then we allow readers back in. With RCU, there is > no way to block readers, so different threads may (at the same time) > see both an old and a new VMA for the same virtual address. > > So calling mmap() looks like this: > > 1 allocate a new VMA > 2 update pointer(s) in maple tree > 3 sleep until old VMAs have a zero refcount > 4 synchronize_rcu() > 5 free old VMAs > 6 flush caches for affected range > 7 return to userspace > > While one thread is calling mmap(MAP_FIXED), two other threads which are > accessing the same address may see different data from each other and > have different page translations in their respective CPU caches until > the thread calling mmap() returns. I believe this is OK, but would > greatly appreciate hearing from people who know better. I do not believe this is OK, i believe this is wrong (not even considering possible hardware issues that can arise from such aliasing). That bein said i believe this can be solve "easily" when the new vma is added you mark it as a new born (VMA_BABY :)) and page fault will have to wait on it ie until the previous vma is fully gone and flush. So after step (6 flush caches) you remove the VMA_BABY flag before returning to userspace and page fault can resume. I would also mark old VMA with a ZOMBIE flag so that any reader might have a chance to back-off and retry. To check for that we should add a new check to vmf_insert_page() (and similar) to avoid inserting pfn in ZOMBIE vma. Note that i am not sure what we want to do here, can an application rely on rwsem serialization unknowingly ie could it have one thread doing page fault on a range that is about to be unmap by another thread ? I am not sure this can happen today without SEGFAULT thanks to serialization through rwsem. Anyway with BABY and ZOMBIE, it should behave mostly as it does today (modulo concurrency). > > Some people are concerned that a reference count on the VMA will lead to > contention moving from the mmap_sem to the refcount on a very large VMA > for workloads which have one giant VMA covering the entire working set. > For those workloads, I propose we use the existing ->map_pages() callback > (changed to return a vm_fault_t from the current void). > > It will be called with the RCU lock held and no reference count on > the vma. If it needs to sleep, it should bump the refcount, drop the > RCU lock, prepare enough so that the next call will not need to sleep, > then drop the refcount and return VM_FAULT_RETRY so the VM knows the > VMA is no longer good, and it needs to walk the VMA tree from the start. Just to make sure i understand, you propose that ->map_pages() becomes a new ->fault() handler that get calls before ->fault() without refcount so that we can update fs/drivers slowly to perform better in the new scheme (ie avoid the overead of refcounting if possible at all) ? The ->fault() callback would then be the "slow" path which will require a refcount on the vma (taken by core mm code before dropping rcu lock). Cheers, Jérôme ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-05 17:21 ` Jerome Glisse @ 2019-12-06 5:13 ` Matthew Wilcox 2019-12-06 17:30 ` Jerome Glisse 2019-12-10 15:26 ` Vlastimil Babka 1 sibling, 1 reply; 31+ messages in thread From: Matthew Wilcox @ 2019-12-06 5:13 UTC (permalink / raw) To: Jerome Glisse Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka, Hugh Dickins, Michel Lespinasse, Davidlohr Bueso On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote: > Adding few interested people in cc I figured they all read linux-mm already ;-) > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: > > While one thread is calling mmap(MAP_FIXED), two other threads which are > > accessing the same address may see different data from each other and > > have different page translations in their respective CPU caches until > > the thread calling mmap() returns. I believe this is OK, but would > > greatly appreciate hearing from people who know better. > > I do not believe this is OK, i believe this is wrong (not even considering > possible hardware issues that can arise from such aliasing). Well, OK, but why do you believe it is wrong? If thread A is executing a load instruction at the same time that thread B is calling mmap(), it really is indeterminate what value A loads. It might be from before the call to mmap() and it might be from after. And if thread C is also executing a load instruction at the same time, then it might already get a different result from thread A. And can threads A and C really tell which of them executed the load instruction 'first'? I think this is all so indeterminate already that the (lack of) guarantees I outlined above are acceptable. But we should all agree on this, so _please_ continue to argue your case for why you believe it to be wrong. [snip proposed solution -- if the problem needs solving, we can argue about how to solve it later] > > Some people are concerned that a reference count on the VMA will lead to > > contention moving from the mmap_sem to the refcount on a very large VMA > > for workloads which have one giant VMA covering the entire working set. > > For those workloads, I propose we use the existing ->map_pages() callback > > (changed to return a vm_fault_t from the current void). > > > > It will be called with the RCU lock held and no reference count on > > the vma. If it needs to sleep, it should bump the refcount, drop the > > RCU lock, prepare enough so that the next call will not need to sleep, > > then drop the refcount and return VM_FAULT_RETRY so the VM knows the > > VMA is no longer good, and it needs to walk the VMA tree from the start. > > Just to make sure i understand, you propose that ->map_pages() becomes > a new ->fault() handler that get calls before ->fault() without refcount > so that we can update fs/drivers slowly to perform better in the new scheme > (ie avoid the overead of refcounting if possible at all) ? > > The ->fault() callback would then be the "slow" path which will require > a refcount on the vma (taken by core mm code before dropping rcu lock). I would actually propose never updating most drivers. There's just no need for them to handle such an unstable and tricky situation as this. Let's not make driver writers lives harder. For the ones which need this kind of scalability (and let's be clear, they would already have *better* scalability than today due to the rwsem being split into a per-VMA refcount), then yes, implementing ->map_pages would be the way to go. Indeed, they would probably benefit from implementing it today, since it will reduce the number of page faults. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-06 5:13 ` Matthew Wilcox @ 2019-12-06 17:30 ` Jerome Glisse 2019-12-09 3:33 ` Matthew Wilcox 0 siblings, 1 reply; 31+ messages in thread From: Jerome Glisse @ 2019-12-06 17:30 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka, Hugh Dickins, Michel Lespinasse, Davidlohr Bueso On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote: > On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote: > > Adding few interested people in cc > > I figured they all read linux-mm already ;-) > > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: > > > While one thread is calling mmap(MAP_FIXED), two other threads which are > > > accessing the same address may see different data from each other and > > > have different page translations in their respective CPU caches until > > > the thread calling mmap() returns. I believe this is OK, but would > > > greatly appreciate hearing from people who know better. > > > > I do not believe this is OK, i believe this is wrong (not even considering > > possible hardware issues that can arise from such aliasing). > > Well, OK, but why do you believe it is wrong? If thread A is executing > a load instruction at the same time that thread B is calling mmap(), > it really is indeterminate what value A loads. It might be from before > the call to mmap() and it might be from after. And if thread C is also > executing a load instruction at the same time, then it might already get > a different result from thread A. And can threads A and C really tell > which of them executed the load instruction 'first'? I think this is > all so indeterminate already that the (lack of) guarantees I outlined > above are acceptable. > > But we should all agree on this, so _please_ continue to argue your case > for why you believe it to be wrong. > I agree that such application might looks like it is doing something that is undeterminate but their might be application that catch SEGFAULT and use it as synchronization. I did something similar for reverse engineering a long time ago with a library call libsegfault ... In any case, i agree that an application that is not catching SEGFAULT, and which is doing the above (access patterns) is doing something undeterminate. Nonetheless i believe it is important that at any point in time for all the threads in a given process, on all the CPUs, a given virtual address should always point to the same physical memory (or to nothing) ie we should never have one CPU that sees a different physical memory from another CPU for the same virtual address. > [snip proposed solution -- if the problem needs solving, we can argue > about how to solve it later] Well i feel like you are also not discussing about the munmap() the above seemed to be about MAP_FIXED (replacing an existing mapping). For munmap too i believe we should agree on what should be the expected behavior and from my POV again we should not allow new mapping to appear until a "running" munmap is not fully done (ie all CPUs cache and TLB flushed). For the same reason as above ie all CPUs always see same physical memory (or nothing) for a given virtual address. This is what we have today with the big rwsem and i think we need to keep that behavior even with concurency. I do not believe this will impact the performance and it is easy enough to solve so i feel safer doing so given it does not cost anything. So i would rather argue on why we should change the current behavior if we can fix the concurrency without changing it (hence why discussing solution might also be relevant here). > > > Some people are concerned that a reference count on the VMA will lead to > > > contention moving from the mmap_sem to the refcount on a very large VMA > > > for workloads which have one giant VMA covering the entire working set. > > > For those workloads, I propose we use the existing ->map_pages() callback > > > (changed to return a vm_fault_t from the current void). > > > > > > It will be called with the RCU lock held and no reference count on > > > the vma. If it needs to sleep, it should bump the refcount, drop the > > > RCU lock, prepare enough so that the next call will not need to sleep, > > > then drop the refcount and return VM_FAULT_RETRY so the VM knows the > > > VMA is no longer good, and it needs to walk the VMA tree from the start. > > > > Just to make sure i understand, you propose that ->map_pages() becomes > > a new ->fault() handler that get calls before ->fault() without refcount > > so that we can update fs/drivers slowly to perform better in the new scheme > > (ie avoid the overead of refcounting if possible at all) ? > > > > The ->fault() callback would then be the "slow" path which will require > > a refcount on the vma (taken by core mm code before dropping rcu lock). > > I would actually propose never updating most drivers. There's just no > need for them to handle such an unstable and tricky situation as this. > Let's not make driver writers lives harder. > > For the ones which need this kind of scalability (and let's be clear, they > would already have *better* scalability than today due to the rwsem being > split into a per-VMA refcount), then yes, implementing ->map_pages would > be the way to go. Indeed, they would probably benefit from implementing > it today, since it will reduce the number of page faults. Yes they will get better scalability but i see some of those drivers might want the extra few mini-percent :) In any case, i believe that it might be better to give a new name ie fix current map_pages() user and rename that callback to something more explicit (atomic_map_pages() or something similar i am not good at naming). But otherwise this looks like a good plan to avoid excessive refcount overhead. Cheers, Jérôme ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-06 17:30 ` Jerome Glisse @ 2019-12-09 3:33 ` Matthew Wilcox 2019-12-09 14:17 ` Jerome Glisse 0 siblings, 1 reply; 31+ messages in thread From: Matthew Wilcox @ 2019-12-09 3:33 UTC (permalink / raw) To: Jerome Glisse Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka, Hugh Dickins, Michel Lespinasse, Davidlohr Bueso On Fri, Dec 06, 2019 at 12:30:30PM -0500, Jerome Glisse wrote: > On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote: > > On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote: > > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: > > > > While one thread is calling mmap(MAP_FIXED), two other threads which are > > > > accessing the same address may see different data from each other and > > > > have different page translations in their respective CPU caches until > > > > the thread calling mmap() returns. I believe this is OK, but would > > > > greatly appreciate hearing from people who know better. > > > > > > I do not believe this is OK, i believe this is wrong (not even considering > > > possible hardware issues that can arise from such aliasing). > > > > Well, OK, but why do you believe it is wrong? If thread A is executing > > a load instruction at the same time that thread B is calling mmap(), > > it really is indeterminate what value A loads. It might be from before > > the call to mmap() and it might be from after. And if thread C is also > > executing a load instruction at the same time, then it might already get > > a different result from thread A. And can threads A and C really tell > > which of them executed the load instruction 'first'? I think this is > > all so indeterminate already that the (lack of) guarantees I outlined > > above are acceptable. > > > > But we should all agree on this, so _please_ continue to argue your case > > for why you believe it to be wrong. > > I agree that such application might looks like it is doing something that > is undeterminate but their might be application that catch SEGFAULT and use > it as synchronization. I did something similar for reverse engineering a > long time ago with a library call libsegfault ... > > In any case, i agree that an application that is not catching SEGFAULT, and > which is doing the above (access patterns) is doing something undeterminate. > > Nonetheless i believe it is important that at any point in time for all the > threads in a given process, on all the CPUs, a given virtual address should > always point to the same physical memory (or to nothing) ie we should never > have one CPU that sees a different physical memory from another CPU for the > same virtual address. > > Well i feel like you are also not discussing about the munmap() the above > seemed to be about MAP_FIXED (replacing an existing mapping). For munmap > too i believe we should agree on what should be the expected behavior and > from my POV again we should not allow new mapping to appear until a "running" > munmap is not fully done (ie all CPUs cache and TLB flushed). For the same > reason as above ie all CPUs always see same physical memory (or nothing) for > a given virtual address. I see MAP_FIXED as being the harder case, but sure, let's talk about munmap! I agree that a munmap() + mmap() call should not permit thread B to see the old value after thread A has seen the new value. But, as long as no new mmap can occupy that range, then it's OK if thread A takes a segfault while thread B can still load the old value. At least for a short window. We can replicate that behaviour by ensuring that new lookups see a NULL entry, but new attempts to allocate will not reuse that range until the munmap has finished and all TLB entries are flushed. The maple tree actually supports a "ZERO" entry (just like the XArray does) which has this behaviour -- lookups see NULL, but attempts to allocate do not see it as free. We already use that property to prevent allocating above the end of the process address space. > This is what we have today with the big rwsem and i think we need to keep > that behavior even with concurency. I do not believe this will impact the > performance and it is easy enough to solve so i feel safer doing so given > it does not cost anything. > > So i would rather argue on why we should change the current behavior if we > can fix the concurrency without changing it (hence why discussing solution > might also be relevant here). It seems like you want to force a thread which sees an ongoing munmap to spin or sleep until the munmap is done, rather than immediately take a segfault, and I don't know that's a useful behaviour. > > > Just to make sure i understand, you propose that ->map_pages() becomes > > > a new ->fault() handler that get calls before ->fault() without refcount > > > so that we can update fs/drivers slowly to perform better in the new scheme > > > (ie avoid the overead of refcounting if possible at all) ? > > > > > > The ->fault() callback would then be the "slow" path which will require > > > a refcount on the vma (taken by core mm code before dropping rcu lock). > > > > I would actually propose never updating most drivers. There's just no > > need for them to handle such an unstable and tricky situation as this. > > Let's not make driver writers lives harder. > > > > For the ones which need this kind of scalability (and let's be clear, they > > would already have *better* scalability than today due to the rwsem being > > split into a per-VMA refcount), then yes, implementing ->map_pages would > > be the way to go. Indeed, they would probably benefit from implementing > > it today, since it will reduce the number of page faults. > > Yes they will get better scalability but i see some of those drivers might > want the extra few mini-percent :) In any case, i believe that it might be > better to give a new name ie fix current map_pages() user and rename that > callback to something more explicit (atomic_map_pages() or something similar > i am not good at naming). But otherwise this looks like a good plan to avoid > excessive refcount overhead. OK, great. I don't think the current name is bad, but if someone comes up with a better one, I don't have a problem with renaming it. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-09 3:33 ` Matthew Wilcox @ 2019-12-09 14:17 ` Jerome Glisse 0 siblings, 0 replies; 31+ messages in thread From: Jerome Glisse @ 2019-12-09 14:17 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, Laurent Dufour, David Rientjes, Vlastimil Babka, Hugh Dickins, Michel Lespinasse, Davidlohr Bueso On Sun, Dec 08, 2019 at 07:33:35PM -0800, Matthew Wilcox wrote: > On Fri, Dec 06, 2019 at 12:30:30PM -0500, Jerome Glisse wrote: > > On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote: > > > On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote: > > > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: > > > > > While one thread is calling mmap(MAP_FIXED), two other threads which are > > > > > accessing the same address may see different data from each other and > > > > > have different page translations in their respective CPU caches until > > > > > the thread calling mmap() returns. I believe this is OK, but would > > > > > greatly appreciate hearing from people who know better. > > > > > > > > I do not believe this is OK, i believe this is wrong (not even considering > > > > possible hardware issues that can arise from such aliasing). > > > > > > Well, OK, but why do you believe it is wrong? If thread A is executing > > > a load instruction at the same time that thread B is calling mmap(), > > > it really is indeterminate what value A loads. It might be from before > > > the call to mmap() and it might be from after. And if thread C is also > > > executing a load instruction at the same time, then it might already get > > > a different result from thread A. And can threads A and C really tell > > > which of them executed the load instruction 'first'? I think this is > > > all so indeterminate already that the (lack of) guarantees I outlined > > > above are acceptable. > > > > > > But we should all agree on this, so _please_ continue to argue your case > > > for why you believe it to be wrong. > > > > I agree that such application might looks like it is doing something that > > is undeterminate but their might be application that catch SEGFAULT and use > > it as synchronization. I did something similar for reverse engineering a > > long time ago with a library call libsegfault ... > > > > In any case, i agree that an application that is not catching SEGFAULT, and > > which is doing the above (access patterns) is doing something undeterminate. > > > > Nonetheless i believe it is important that at any point in time for all the > > threads in a given process, on all the CPUs, a given virtual address should > > always point to the same physical memory (or to nothing) ie we should never > > have one CPU that sees a different physical memory from another CPU for the > > same virtual address. > > > > Well i feel like you are also not discussing about the munmap() the above > > seemed to be about MAP_FIXED (replacing an existing mapping). For munmap > > too i believe we should agree on what should be the expected behavior and > > from my POV again we should not allow new mapping to appear until a "running" > > munmap is not fully done (ie all CPUs cache and TLB flushed). For the same > > reason as above ie all CPUs always see same physical memory (or nothing) for > > a given virtual address. > > I see MAP_FIXED as being the harder case, but sure, let's talk about > munmap! I agree that a munmap() + mmap() call should not permit thread > B to see the old value after thread A has seen the new value. But, > as long as no new mmap can occupy that range, then it's OK if thread A > takes a segfault while thread B can still load the old value. At least > for a short window. > > We can replicate that behaviour by ensuring that new lookups see a NULL > entry, but new attempts to allocate will not reuse that range until the > munmap has finished and all TLB entries are flushed. The maple tree > actually supports a "ZERO" entry (just like the XArray does) which has > this behaviour -- lookups see NULL, but attempts to allocate do not see > it as free. We already use that property to prevent allocating above > the end of the process address space. > > > This is what we have today with the big rwsem and i think we need to keep > > that behavior even with concurency. I do not believe this will impact the > > performance and it is easy enough to solve so i feel safer doing so given > > it does not cost anything. > > > > So i would rather argue on why we should change the current behavior if we > > can fix the concurrency without changing it (hence why discussing solution > > might also be relevant here). > > It seems like you want to force a thread which sees an ongoing munmap > to spin or sleep until the munmap is done, rather than immediately take > a segfault, and I don't know that's a useful behaviour. No, for munmap() i want the today behavior modulo concurrency ie: - any concurrent fault while munmap is on going should SEGFAULT - any new mmap might install itself in the unmaped area but can not service fault until the munmap is fully done (ie all cache and TLB are flush) So it seems we agree on that given your above presentation of mapple NULL entry. I want the spin only for fault in a MAP_FIXED range that replace an existing range ie which is what we have today. Today if you do a mmap MAP_FIXED you will block concurrent fault until the rwsem is release. I argue we should keep that behavior as any program that keep accessing such area while mmap is in progress is doing something peculiar (it can be a bug or it can be something totaly intentional and use in conjunction with signal handlers). This will only affect such application and it will keep coherency for TLBs accross all CPUs. The implementation cost can be kept very low and it does not impede concurrency on other virtual range. Keeping existing behavior just feels a lot safer to me and it avoids us pondering on wether or not a CPU can freakout if they see different TLB for same virtual address. > > > > Just to make sure i understand, you propose that ->map_pages() becomes > > > > a new ->fault() handler that get calls before ->fault() without refcount > > > > so that we can update fs/drivers slowly to perform better in the new scheme > > > > (ie avoid the overead of refcounting if possible at all) ? > > > > > > > > The ->fault() callback would then be the "slow" path which will require > > > > a refcount on the vma (taken by core mm code before dropping rcu lock). > > > > > > I would actually propose never updating most drivers. There's just no > > > need for them to handle such an unstable and tricky situation as this. > > > Let's not make driver writers lives harder. > > > > > > For the ones which need this kind of scalability (and let's be clear, they > > > would already have *better* scalability than today due to the rwsem being > > > split into a per-VMA refcount), then yes, implementing ->map_pages would > > > be the way to go. Indeed, they would probably benefit from implementing > > > it today, since it will reduce the number of page faults. > > > > Yes they will get better scalability but i see some of those drivers might > > want the extra few mini-percent :) In any case, i believe that it might be > > better to give a new name ie fix current map_pages() user and rename that > > callback to something more explicit (atomic_map_pages() or something similar > > i am not good at naming). But otherwise this looks like a good plan to avoid > > excessive refcount overhead. > > OK, great. I don't think the current name is bad, but if someone comes > up with a better one, I don't have a problem with renaming it. I just like when new behavior are reflected in name, what about rcu_map_pages ? Making it clear it is under rcu. Cheers, Jérôme ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-05 17:21 ` Jerome Glisse 2019-12-06 5:13 ` Matthew Wilcox @ 2019-12-10 15:26 ` Vlastimil Babka 2019-12-10 16:07 ` Jerome Glisse 1 sibling, 1 reply; 31+ messages in thread From: Vlastimil Babka @ 2019-12-10 15:26 UTC (permalink / raw) To: Jerome Glisse, Matthew Wilcox Cc: linux-mm, Laurent Dufour, David Rientjes, Hugh Dickins, Michel Lespinasse, Davidlohr Bueso On 12/5/19 6:21 PM, Jerome Glisse wrote: >> >> So calling mmap() looks like this: >> >> 1 allocate a new VMA >> 2 update pointer(s) in maple tree >> 3 sleep until old VMAs have a zero refcount >> 4 synchronize_rcu() >> 5 free old VMAs >> 6 flush caches for affected range >> 7 return to userspace >> >> While one thread is calling mmap(MAP_FIXED), two other threads which are >> accessing the same address may see different data from each other and >> have different page translations in their respective CPU caches until >> the thread calling mmap() returns. I believe this is OK, but would >> greatly appreciate hearing from people who know better. > > I do not believe this is OK, i believe this is wrong (not even considering > possible hardware issues that can arise from such aliasing). But is it true that the races can happen in the above such that multiple CPU's have different translations? I think it's impossible to tell from above - there are no details about when and which pte modifications happen, where ptl lock is taken... perhaps after filling those details, we could be able to see that there's no race. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-10 15:26 ` Vlastimil Babka @ 2019-12-10 16:07 ` Jerome Glisse 2019-12-10 18:09 ` Vlastimil Babka 0 siblings, 1 reply; 31+ messages in thread From: Jerome Glisse @ 2019-12-10 16:07 UTC (permalink / raw) To: Vlastimil Babka Cc: Matthew Wilcox, linux-mm, Laurent Dufour, David Rientjes, Hugh Dickins, Michel Lespinasse, Davidlohr Bueso On Tue, Dec 10, 2019 at 04:26:40PM +0100, Vlastimil Babka wrote: > On 12/5/19 6:21 PM, Jerome Glisse wrote: > >> > >> So calling mmap() looks like this: > >> > >> 1 allocate a new VMA > >> 2 update pointer(s) in maple tree > >> 3 sleep until old VMAs have a zero refcount > >> 4 synchronize_rcu() > >> 5 free old VMAs > >> 6 flush caches for affected range > >> 7 return to userspace > >> > >> While one thread is calling mmap(MAP_FIXED), two other threads which are > >> accessing the same address may see different data from each other and > >> have different page translations in their respective CPU caches until > >> the thread calling mmap() returns. I believe this is OK, but would > >> greatly appreciate hearing from people who know better. > > > > I do not believe this is OK, i believe this is wrong (not even considering > > possible hardware issues that can arise from such aliasing). > > But is it true that the races can happen in the above such that multiple CPU's > have different translations? I think it's impossible to tell from above - there > are no details about when and which pte modifications happen, where ptl lock is > taken... perhaps after filling those details, we could be able to see that > there's no race. > My assumption reading Matthew was that as step 6 is making progress (flushing caches and i assume TLB too) then you can have a CPU which is already flushed and that do take a fault against the new VMA and thus get a new TLB entry that do not match a CPU which is not yet flushed. Today this can not happens because page fault will serialize on the mmap_sem (ie until the write mode is release when returning to user- space). I advocate that for MAP_FIXED we should keep that behavior ie have page fault wait until all the CPUs caches and TLB are flush before servicing new fault. The rational are: - Application with racing thread accessing an area which is under going a mmap MAP_FIXED are doing something weird (which might be totaly waranted) and thus it should not matter if we penalize those by not offering the benefit of fully concurrent page fault for that area. We are not making thing any worse then they are today. - Waiting on new MAP_FIXED vma is easy to implement and does not affect other kind of page fault (ie you still have concurrent fault to other area). Note i stress again this is only for MAP_FIXED replacing an existing vma. For any other mmap (well i would need to go over all the flags to make sure there is nothing else similar to MAP_FIXED) we can have full concurrency ie: - Access to an area that is under going munmap will SEGFAULT as soon as the vma is no longer valid (ie no longer in maple tree i guess) - Access to an area that is just being mmap will fault using the new vma (assuming we publish the vma only once it is ready to take fault) - A mmap can make progress as another munmap is making progress and reuse the the munmaped area as soon as it is fully flushed (caches and TLB). Cheers, Jérôme ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-10 16:07 ` Jerome Glisse @ 2019-12-10 18:09 ` Vlastimil Babka 0 siblings, 0 replies; 31+ messages in thread From: Vlastimil Babka @ 2019-12-10 18:09 UTC (permalink / raw) To: Jerome Glisse Cc: Matthew Wilcox, linux-mm, Laurent Dufour, David Rientjes, Hugh Dickins, Michel Lespinasse, Davidlohr Bueso On 12/10/19 5:07 PM, Jerome Glisse wrote: > On Tue, Dec 10, 2019 at 04:26:40PM +0100, Vlastimil Babka wrote: >> On 12/5/19 6:21 PM, Jerome Glisse wrote: >>>> >>>> So calling mmap() looks like this: >>>> >>>> 1 allocate a new VMA >>>> 2 update pointer(s) in maple tree >>>> 3 sleep until old VMAs have a zero refcount >>>> 4 synchronize_rcu() >>>> 5 free old VMAs >>>> 6 flush caches for affected range >>>> 7 return to userspace >>>> >>>> While one thread is calling mmap(MAP_FIXED), two other threads which are >>>> accessing the same address may see different data from each other and >>>> have different page translations in their respective CPU caches until >>>> the thread calling mmap() returns. I believe this is OK, but would >>>> greatly appreciate hearing from people who know better. >>> >>> I do not believe this is OK, i believe this is wrong (not even considering >>> possible hardware issues that can arise from such aliasing). >> >> But is it true that the races can happen in the above such that multiple CPU's >> have different translations? I think it's impossible to tell from above - there >> are no details about when and which pte modifications happen, where ptl lock is >> taken... perhaps after filling those details, we could be able to see that >> there's no race. >> > > My assumption reading Matthew was that as step 6 is making progress > (flushing caches and i assume TLB too) then you can have a CPU which > is already flushed and that do take a fault against the new VMA and > thus get a new TLB entry that do not match a CPU which is not yet > flushed. We already have to protect against CPU's that access page tables (and thus fill their TLBs) by hardware, taking no mmap_sem at all. For the mmap(MAP_FIXED) overwriting existing mapping case this means that the old stuff will first be munmapped - zap_pte_range() will take pte lock, clear pte and flush tlb's before installing any new mappings. A parallel fault trying to install pte for the new VMA should thus serialize on the pte lock and only install the new pte after everyone was flushed, and this should be fine? What might IMHO perhaps need care is a parallel fault that started with the old VMA, because some PTE was unpopulated. We need to make sure it doesn't end up being the last to install the PTE for the VMA that's going away. > Today this can not happens because page fault will serialize on the > mmap_sem (ie until the write mode is release when returning to user- > space). ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-03 22:21 Splitting the mmap_sem Matthew Wilcox 2019-12-05 17:21 ` Jerome Glisse @ 2019-12-12 14:24 ` Kirill A. Shutemov 2019-12-12 15:40 ` Matthew Wilcox 1 sibling, 1 reply; 31+ messages in thread From: Kirill A. Shutemov @ 2019-12-12 14:24 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: > My preferred solution to the mmap_sem scalability problem is to allow > VMAs to be looked up under the RCU read lock then take a per-VMA lock. > I've been focusing on the first half of this problem (looking up VMAs > in an RCU-safe data structure) and ignoring the second half (taking a > lock while holding the RCU lock). Do you see this approach to be regression-free for uncontended case? I doubt it will not cause regressions for signle-threaded applications... > We currently only have one ->map_pages() callback, and it's > filemap_map_pages(). It only needs to sleep in one place -- to allocate > a PTE table. I think that can be allocated ahead of time if needed. No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the function is under rcu_read_lock(). It uses pre-allocated page table. See do_fault_around(). -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-12 14:24 ` Kirill A. Shutemov @ 2019-12-12 15:40 ` Matthew Wilcox 2019-12-12 15:46 ` Kirill A. Shutemov 2020-01-06 22:09 ` Matthew Wilcox 0 siblings, 2 replies; 31+ messages in thread From: Matthew Wilcox @ 2019-12-12 15:40 UTC (permalink / raw) To: Kirill A. Shutemov; +Cc: linux-mm On Thu, Dec 12, 2019 at 05:24:57PM +0300, Kirill A. Shutemov wrote: > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: > > My preferred solution to the mmap_sem scalability problem is to allow > > VMAs to be looked up under the RCU read lock then take a per-VMA lock. > > I've been focusing on the first half of this problem (looking up VMAs > > in an RCU-safe data structure) and ignoring the second half (taking a > > lock while holding the RCU lock). > > Do you see this approach to be regression-free for uncontended case? > I doubt it will not cause regressions for signle-threaded applications... Which part of the approach do you think will cause a regression? The maple tree is quicker to traverse than the rbtree (in our simulations). Incrementing a refcount on a VMA is surely no slower than acquiring an uncontended rwsem for read. mmap() and munmap() will get slower, but is that a problem? > > We currently only have one ->map_pages() callback, and it's > > filemap_map_pages(). It only needs to sleep in one place -- to allocate > > a PTE table. I think that can be allocated ahead of time if needed. > > No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the > function is under rcu_read_lock(). It uses pre-allocated page table. > See do_fault_around(). Oh, thank you! That makes the ->map_pages() optimisation already workable with no changes. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-12 15:40 ` Matthew Wilcox @ 2019-12-12 15:46 ` Kirill A. Shutemov 2019-12-13 14:33 ` Matthew Wilcox 2020-01-06 22:09 ` Matthew Wilcox 1 sibling, 1 reply; 31+ messages in thread From: Kirill A. Shutemov @ 2019-12-12 15:46 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote: > On Thu, Dec 12, 2019 at 05:24:57PM +0300, Kirill A. Shutemov wrote: > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: > > > My preferred solution to the mmap_sem scalability problem is to allow > > > VMAs to be looked up under the RCU read lock then take a per-VMA lock. > > > I've been focusing on the first half of this problem (looking up VMAs > > > in an RCU-safe data structure) and ignoring the second half (taking a > > > lock while holding the RCU lock). > > > > Do you see this approach to be regression-free for uncontended case? > > I doubt it will not cause regressions for signle-threaded applications... > > Which part of the approach do you think will cause a regression? The > maple tree is quicker to traverse than the rbtree (in our simulations). > Incrementing a refcount on a VMA is surely no slower than acquiring an > uncontended rwsem for read. mmap() and munmap() will get slower, but is > that a problem? Yes, it does. Especially for short-living processes. See kernel build as a workload. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-12 15:46 ` Kirill A. Shutemov @ 2019-12-13 14:33 ` Matthew Wilcox 2019-12-13 18:06 ` Kirill A. Shutemov 0 siblings, 1 reply; 31+ messages in thread From: Matthew Wilcox @ 2019-12-13 14:33 UTC (permalink / raw) To: Kirill A. Shutemov; +Cc: linux-mm On Thu, Dec 12, 2019 at 06:46:13PM +0300, Kirill A. Shutemov wrote: > On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote: > > On Thu, Dec 12, 2019 at 05:24:57PM +0300, Kirill A. Shutemov wrote: > > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: > > > > My preferred solution to the mmap_sem scalability problem is to allow > > > > VMAs to be looked up under the RCU read lock then take a per-VMA lock. > > > > I've been focusing on the first half of this problem (looking up VMAs > > > > in an RCU-safe data structure) and ignoring the second half (taking a > > > > lock while holding the RCU lock). > > > > > > Do you see this approach to be regression-free for uncontended case? > > > I doubt it will not cause regressions for signle-threaded applications... > > > > Which part of the approach do you think will cause a regression? The > > maple tree is quicker to traverse than the rbtree (in our simulations). > > Incrementing a refcount on a VMA is surely no slower than acquiring an > > uncontended rwsem for read. mmap() and munmap() will get slower, but is > > that a problem? > > Yes, it does. Especially for short-living processes. See kernel build as a > workload. Ah. Well, we can skip the synchronize_rcu() step if the mm_struct has zero or one mm_users. That should avoid a slowdown for mmap/munmap. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-13 14:33 ` Matthew Wilcox @ 2019-12-13 18:06 ` Kirill A. Shutemov 2019-12-13 18:21 ` Matthew Wilcox 0 siblings, 1 reply; 31+ messages in thread From: Kirill A. Shutemov @ 2019-12-13 18:06 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm On Fri, Dec 13, 2019 at 06:33:33AM -0800, Matthew Wilcox wrote: > On Thu, Dec 12, 2019 at 06:46:13PM +0300, Kirill A. Shutemov wrote: > > On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote: > > > On Thu, Dec 12, 2019 at 05:24:57PM +0300, Kirill A. Shutemov wrote: > > > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: > > > > > My preferred solution to the mmap_sem scalability problem is to allow > > > > > VMAs to be looked up under the RCU read lock then take a per-VMA lock. > > > > > I've been focusing on the first half of this problem (looking up VMAs > > > > > in an RCU-safe data structure) and ignoring the second half (taking a > > > > > lock while holding the RCU lock). > > > > > > > > Do you see this approach to be regression-free for uncontended case? > > > > I doubt it will not cause regressions for signle-threaded applications... > > > > > > Which part of the approach do you think will cause a regression? The > > > maple tree is quicker to traverse than the rbtree (in our simulations). > > > Incrementing a refcount on a VMA is surely no slower than acquiring an > > > uncontended rwsem for read. mmap() and munmap() will get slower, but is > > > that a problem? > > > > Yes, it does. Especially for short-living processes. See kernel build as a > > workload. > > Ah. Well, we can skip the synchronize_rcu() step if the mm_struct has zero > or one mm_users. That should avoid a slowdown for mmap/munmap. I may work. But I'm not sure how it will work with remote mm accesses. Like with /proc/ interfaces or ptrace. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-13 18:06 ` Kirill A. Shutemov @ 2019-12-13 18:21 ` Matthew Wilcox 0 siblings, 0 replies; 31+ messages in thread From: Matthew Wilcox @ 2019-12-13 18:21 UTC (permalink / raw) To: Kirill A. Shutemov; +Cc: linux-mm On Fri, Dec 13, 2019 at 09:06:26PM +0300, Kirill A. Shutemov wrote: > On Fri, Dec 13, 2019 at 06:33:33AM -0800, Matthew Wilcox wrote: > > Ah. Well, we can skip the synchronize_rcu() step if the mm_struct has zero > > or one mm_users. That should avoid a slowdown for mmap/munmap. > > I may work. But I'm not sure how it will work with remote mm accesses. > Like with /proc/ interfaces or ptrace. Those users call mmget() which increases mm_users. See m_start() in fs/proc/task_mmu.c. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2019-12-12 15:40 ` Matthew Wilcox 2019-12-12 15:46 ` Kirill A. Shutemov @ 2020-01-06 22:09 ` Matthew Wilcox 2020-01-07 12:34 ` Kirill A. Shutemov 1 sibling, 1 reply; 31+ messages in thread From: Matthew Wilcox @ 2020-01-06 22:09 UTC (permalink / raw) To: Kirill A. Shutemov; +Cc: linux-mm On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote: > > > We currently only have one ->map_pages() callback, and it's > > > filemap_map_pages(). It only needs to sleep in one place -- to allocate > > > a PTE table. I think that can be allocated ahead of time if needed. > > > > No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the > > function is under rcu_read_lock(). It uses pre-allocated page table. > > See do_fault_around(). > > Oh, thank you! That makes the ->map_pages() optimisation already workable > with no changes. I've been thinking about this some more, and we have a bit of a tough time allocating page table entries while holding the RCU read lock. There's no GFP flags to the p??_alloc() functions, so we can't specify GFP_NOWAIT. Option 1: Add 'prealloc_pmd' and 'prealloc_pud' to the vm_fault (to go with prealloc_pte). Allocate them before taking the RCU lock to walk the VMA tree. This will be a bit of reordering as we currently take the mmap_sem, walk the VMA tree, then walk the page tables once we know we have a good VMA. I don't see a problem with doing that, but others may differ. Option 2: Add a memalloc_nowait_save/restore API to go along with nofs and noio. That way, we can take the RCU read lock, call memalloc_nowait_save(), and walk the VMA tree and the page tables in the current order. There's an increased chance of memory allocation of page tables failing, so we'll have to risk that and do a retry with the reference count held on the VMA if we need to sleep to allocate memory. Option 3: Variant of 2 where we add GFP flags to the p??_alloc() functions. Option 4: Variant of 2 where we make taking the RCU read lock magically set the nowait bit, or we have the page allocator check the RCU preempt depth. I don't particularly like this one, particularly since the preempt depth is not knowable in most kernel configurations. Other thoughts on this? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2020-01-06 22:09 ` Matthew Wilcox @ 2020-01-07 12:34 ` Kirill A. Shutemov 2020-01-07 13:54 ` Matthew Wilcox 0 siblings, 1 reply; 31+ messages in thread From: Kirill A. Shutemov @ 2020-01-07 12:34 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm On Mon, Jan 06, 2020 at 02:09:10PM -0800, Matthew Wilcox wrote: > On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote: > > > > We currently only have one ->map_pages() callback, and it's > > > > filemap_map_pages(). It only needs to sleep in one place -- to allocate > > > > a PTE table. I think that can be allocated ahead of time if needed. > > > > > > No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the > > > function is under rcu_read_lock(). It uses pre-allocated page table. > > > See do_fault_around(). > > > > Oh, thank you! That makes the ->map_pages() optimisation already workable > > with no changes. > > I've been thinking about this some more, and we have a bit of a tough time > allocating page table entries while holding the RCU read lock. There's > no GFP flags to the p??_alloc() functions, so we can't specify GFP_NOWAIT. > > Option 1: Add 'prealloc_pmd' and 'prealloc_pud' to the vm_fault (to go > with prealloc_pte). Allocate them before taking the RCU lock to walk > the VMA tree. This will be a bit of reordering as we currently take > the mmap_sem, walk the VMA tree, then walk the page tables once we know > we have a good VMA. I don't see a problem with doing that, but others > may differ. I expect preallocating all these page tables just-in-case would have measuable performance impact. Current code only preallocates PTE page table if sees pmd_none(). We may first check if this branch of the tree is present. But I'm not sure how efficient it can be. And we still need to protect from freeing these page tables from under us. > Option 2: Add a memalloc_nowait_save/restore API to go along > with nofs and noio. That way, we can take the RCU read lock, call > memalloc_nowait_save(), and walk the VMA tree and the page tables in > the current order. There's an increased chance of memory allocation of > page tables failing, so we'll have to risk that and do a retry with the > reference count held on the VMA if we need to sleep to allocate memory. > > Option 3: Variant of 2 where we add GFP flags to the p??_alloc() > functions. I think this is the most reasonable way. If we are low of memory, latency is not on the top of priorities. > Option 4: Variant of 2 where we make taking the RCU read lock magically > set the nowait bit, or we have the page allocator check the RCU preempt > depth. I don't particularly like this one, particularly since the > preempt depth is not knowable in most kernel configurations. > > Other thoughts on this? -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2020-01-07 12:34 ` Kirill A. Shutemov @ 2020-01-07 13:54 ` Matthew Wilcox 2020-01-07 14:27 ` Kirill A. Shutemov 0 siblings, 1 reply; 31+ messages in thread From: Matthew Wilcox @ 2020-01-07 13:54 UTC (permalink / raw) To: Kirill A. Shutemov; +Cc: linux-mm On Tue, Jan 07, 2020 at 03:34:15PM +0300, Kirill A. Shutemov wrote: > On Mon, Jan 06, 2020 at 02:09:10PM -0800, Matthew Wilcox wrote: > > On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote: > > > > > We currently only have one ->map_pages() callback, and it's > > > > > filemap_map_pages(). It only needs to sleep in one place -- to allocate > > > > > a PTE table. I think that can be allocated ahead of time if needed. > > > > > > > > No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the > > > > function is under rcu_read_lock(). It uses pre-allocated page table. > > > > See do_fault_around(). > > > > > > Oh, thank you! That makes the ->map_pages() optimisation already workable > > > with no changes. > > > > I've been thinking about this some more, and we have a bit of a tough time > > allocating page table entries while holding the RCU read lock. There's > > no GFP flags to the p??_alloc() functions, so we can't specify GFP_NOWAIT. > > > > Option 1: Add 'prealloc_pmd' and 'prealloc_pud' to the vm_fault (to go > > with prealloc_pte). Allocate them before taking the RCU lock to walk > > the VMA tree. This will be a bit of reordering as we currently take > > the mmap_sem, walk the VMA tree, then walk the page tables once we know > > we have a good VMA. I don't see a problem with doing that, but others > > may differ. > > I expect preallocating all these page tables just-in-case would have > measuable performance impact. Current code only preallocates PTE page > table if sees pmd_none(). Right, and we'd only preallocate those tables if we see pud_none() and p4d_none(). Which is why we'd need to walk the page tables before walking the VMA tree. > We may first check if this branch of the tree is present. But I'm not sure > how efficient it can be. And we still need to protect from freeing > these page tables from under us. Do we free page tables once they've been inserted, other than at process exit? > > Option 2: Add a memalloc_nowait_save/restore API to go along > > with nofs and noio. That way, we can take the RCU read lock, call > > memalloc_nowait_save(), and walk the VMA tree and the page tables in > > the current order. There's an increased chance of memory allocation of > > page tables failing, so we'll have to risk that and do a retry with the > > reference count held on the VMA if we need to sleep to allocate memory. > > > > Option 3: Variant of 2 where we add GFP flags to the p??_alloc() > > functions. > > I think this is the most reasonable way. If we are low of memory, latency > is not on the top of priorities. You prefer adding GFP flags over adding a memalloc_nowait_save()? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2020-01-07 13:54 ` Matthew Wilcox @ 2020-01-07 14:27 ` Kirill A. Shutemov 2020-01-09 13:56 ` Vlastimil Babka 0 siblings, 1 reply; 31+ messages in thread From: Kirill A. Shutemov @ 2020-01-07 14:27 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm On Tue, Jan 07, 2020 at 05:54:59AM -0800, Matthew Wilcox wrote: > On Tue, Jan 07, 2020 at 03:34:15PM +0300, Kirill A. Shutemov wrote: > > On Mon, Jan 06, 2020 at 02:09:10PM -0800, Matthew Wilcox wrote: > > > On Thu, Dec 12, 2019 at 07:40:02AM -0800, Matthew Wilcox wrote: > > > > > > We currently only have one ->map_pages() callback, and it's > > > > > > filemap_map_pages(). It only needs to sleep in one place -- to allocate > > > > > > a PTE table. I think that can be allocated ahead of time if needed. > > > > > > > > > > No, filemap_map_pages() doesn't sleep. It cannot. Whole body of the > > > > > function is under rcu_read_lock(). It uses pre-allocated page table. > > > > > See do_fault_around(). > > > > > > > > Oh, thank you! That makes the ->map_pages() optimisation already workable > > > > with no changes. > > > > > > I've been thinking about this some more, and we have a bit of a tough time > > > allocating page table entries while holding the RCU read lock. There's > > > no GFP flags to the p??_alloc() functions, so we can't specify GFP_NOWAIT. > > > > > > Option 1: Add 'prealloc_pmd' and 'prealloc_pud' to the vm_fault (to go > > > with prealloc_pte). Allocate them before taking the RCU lock to walk > > > the VMA tree. This will be a bit of reordering as we currently take > > > the mmap_sem, walk the VMA tree, then walk the page tables once we know > > > we have a good VMA. I don't see a problem with doing that, but others > > > may differ. > > > > I expect preallocating all these page tables just-in-case would have > > measuable performance impact. Current code only preallocates PTE page > > table if sees pmd_none(). > > Right, and we'd only preallocate those tables if we see pud_none() and > p4d_none(). Which is why we'd need to walk the page tables before > walking the VMA tree. > > > We may first check if this branch of the tree is present. But I'm not sure > > how efficient it can be. And we still need to protect from freeing > > these page tables from under us. > > Do we free page tables once they've been inserted, other than at > process exit? Yes. munmap() would free them. See free_pgtables() in unmap_region() > > > Option 2: Add a memalloc_nowait_save/restore API to go along > > > with nofs and noio. That way, we can take the RCU read lock, call > > > memalloc_nowait_save(), and walk the VMA tree and the page tables in > > > the current order. There's an increased chance of memory allocation of > > > page tables failing, so we'll have to risk that and do a retry with the > > > reference count held on the VMA if we need to sleep to allocate memory. > > > > > > Option 3: Variant of 2 where we add GFP flags to the p??_alloc() > > > functions. > > > > I think this is the most reasonable way. If we are low of memory, latency > > is not on the top of priorities. > > You prefer adding GFP flags over adding a memalloc_nowait_save()? I don't have strong preference here. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2020-01-07 14:27 ` Kirill A. Shutemov @ 2020-01-09 13:56 ` Vlastimil Babka 2020-01-09 17:03 ` Michal Hocko 0 siblings, 1 reply; 31+ messages in thread From: Vlastimil Babka @ 2020-01-09 13:56 UTC (permalink / raw) To: Kirill A. Shutemov, Matthew Wilcox; +Cc: linux-mm, Michal Hocko, Peter Zijlstra +CC On 1/7/20 3:27 PM, Kirill A. Shutemov wrote: >>>> Option 2: Add a memalloc_nowait_save/restore API to go along >>>> with nofs and noio. That way, we can take the RCU read lock, call >>>> memalloc_nowait_save(), and walk the VMA tree and the page tables in >>>> the current order. There's an increased chance of memory allocation of >>>> page tables failing, so we'll have to risk that and do a retry with the >>>> reference count held on the VMA if we need to sleep to allocate memory. >>>> >>>> Option 3: Variant of 2 where we add GFP flags to the p??_alloc() >>>> functions. >>> >>> I think this is the most reasonable way. If we are low of memory, latency >>> is not on the top of priorities. >> >> You prefer adding GFP flags over adding a memalloc_nowait_save()? > > I don't have strong preference here. I think it's generally a good idea to have the memalloc_() scoping. However, we use the PF_* flags where it now seems 30 out of 32 are used, and MEMALLOC* already occupies 3 of those. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2020-01-09 13:56 ` Vlastimil Babka @ 2020-01-09 17:03 ` Michal Hocko 2020-01-09 17:07 ` Michal Hocko 0 siblings, 1 reply; 31+ messages in thread From: Michal Hocko @ 2020-01-09 17:03 UTC (permalink / raw) To: Vlastimil Babka Cc: Kirill A. Shutemov, Matthew Wilcox, linux-mm, Peter Zijlstra On Thu 09-01-20 14:56:48, Vlastimil Babka wrote: > +CC > > On 1/7/20 3:27 PM, Kirill A. Shutemov wrote: > >>>> Option 2: Add a memalloc_nowait_save/restore API to go along > >>>> with nofs and noio. That way, we can take the RCU read lock, call > >>>> memalloc_nowait_save(), and walk the VMA tree and the page tables in > >>>> the current order. There's an increased chance of memory allocation of > >>>> page tables failing, so we'll have to risk that and do a retry with the > >>>> reference count held on the VMA if we need to sleep to allocate memory. > >>>> > >>>> Option 3: Variant of 2 where we add GFP flags to the p??_alloc() > >>>> functions. > >>> > >>> I think this is the most reasonable way. If we are low of memory, latency > >>> is not on the top of priorities. > >> > >> You prefer adding GFP flags over adding a memalloc_nowait_save()? > > > > I don't have strong preference here. > > I think it's generally a good idea to have the memalloc_() scoping. I also think there are going to be other usecases where NOWAIT scoping would be interesting. I am not sure this is the case for the page tables allocation though. I might misremember but RCU based VMA handling has been considered in the past. I do not remember details but there were some problems and page tables allocation is not the biggest one. > However, we use the PF_* flags where it now seems 30 out of 32 are used, > and MEMALLOC* already occupies 3 of those. This shouldn't be a big deal. Adding a single int to the struct page shouldn't be terribly hard. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2020-01-09 17:03 ` Michal Hocko @ 2020-01-09 17:07 ` Michal Hocko 2020-01-09 17:32 ` SeongJae Park 0 siblings, 1 reply; 31+ messages in thread From: Michal Hocko @ 2020-01-09 17:07 UTC (permalink / raw) To: Vlastimil Babka Cc: Kirill A. Shutemov, Matthew Wilcox, linux-mm, Peter Zijlstra On Thu 09-01-20 18:03:25, Michal Hocko wrote: > I might misremember but RCU based VMA handling has > been considered in the past. I do not remember details but there were > some problems and page tables allocation is not the biggest one. I have found https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf in my notes. I managed to forget everything but maybe it will be useful for a reference. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Re: Splitting the mmap_sem 2020-01-09 17:07 ` Michal Hocko @ 2020-01-09 17:32 ` SeongJae Park 2020-01-09 20:13 ` Matthew Wilcox 0 siblings, 1 reply; 31+ messages in thread From: SeongJae Park @ 2020-01-09 17:32 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, Kirill A. Shutemov, Matthew Wilcox, linux-mm, Peter Zijlstra On Thu, 9 Jan 2020 18:07:15 +0100 Michal Hocko <mhocko@suse.com> wrote: > On Thu 09-01-20 18:03:25, Michal Hocko wrote: > > I might misremember but RCU based VMA handling has > > been considered in the past. I do not remember details but there were > > some problems and page tables allocation is not the biggest one. > > I have found https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf in my > notes. I managed to forget everything but maybe it will be useful for a > reference. The subsequent work from the authors (https://people.csail.mit.edu/nickolai/papers/clements-radixvm-2014-08-05.pdf) might be also useful for the understanding of the limitations found from the work. I also forgot many details but as far as I remember, the biggest problem with the rcuvm was the update side scalability limitation that results from the single updater lock and the TLB invalidations. I has also internally implemented another RCU based vm that utilizing fine-grained update side synchronization. The write side performance of my version was therefore much improved, but it also dropped the performance at the end with heavily write-intensive workloads due to the TLB flush overhead. Page table allocations weren't bothered me at that time. Thanks, SeongJae Park > -- > Michal Hocko > SUSE Labs > > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Re: Splitting the mmap_sem 2020-01-09 17:32 ` SeongJae Park @ 2020-01-09 20:13 ` Matthew Wilcox 2020-02-06 13:59 ` Peter Zijlstra 0 siblings, 1 reply; 31+ messages in thread From: Matthew Wilcox @ 2020-01-09 20:13 UTC (permalink / raw) To: SeongJae Park Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, linux-mm, Peter Zijlstra On Thu, Jan 09, 2020 at 06:32:06PM +0100, SeongJae Park wrote: > On Thu, 9 Jan 2020 18:07:15 +0100 Michal Hocko <mhocko@suse.com> wrote: > > > On Thu 09-01-20 18:03:25, Michal Hocko wrote: > > > I might misremember but RCU based VMA handling has > > > been considered in the past. I do not remember details but there were > > > some problems and page tables allocation is not the biggest one. > > > > I have found https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf in my > > notes. I managed to forget everything but maybe it will be useful for a > > reference. > > The subsequent work from the authors > (https://people.csail.mit.edu/nickolai/papers/clements-radixvm-2014-08-05.pdf) > might be also useful for the understanding of the limitations found from the > work. Thanks for both those references. > I also forgot many details but as far as I remember, the biggest problem with > the rcuvm was the update side scalability limitation that results from the > single updater lock and the TLB invalidations. I has also internally > implemented another RCU based vm that utilizing fine-grained update side > synchronization. The write side performance of my version was therefore much > improved, but it also dropped the performance at the end with heavily > write-intensive workloads due to the TLB flush overhead. > > Page table allocations weren't bothered me at that time. As far as I can tell, both these implementations work by using RCU to look up a VMA, taking a reference count on the VMA and dropping the RCU read lock before walking the page tables. Sleeping to allocate page tables will be fine as the reference count prevents the VMA from going away. One of the use cases that we're concerned about involves a high percentage of page faults on a single large (terabytes) VMA (and a highly multithreaded process). Moving the contention from a rwsem in the mm_struct to a refcount in the VMA will not help performance substantially for this user. The proposal consists of three phases. In phase 1, we convert the rbtree to the maple tree, and leave the locking alone. In phase 2, we change the locking to a per-VMA refcount, looked up under RCU. This problem arises during phase 3 where we attempt to handle page faults entirely under the RCU read lock. If we encounter problems, we can fall back to acquiring the VMA refcount, but we need the page allocation to fail rather than sleep (or magically drop the RCU lock and return an indication that it has done so, but that doesn't seem to be an approach that would find any favour). ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Re: Splitting the mmap_sem 2020-01-09 20:13 ` Matthew Wilcox @ 2020-02-06 13:59 ` Peter Zijlstra 2020-02-06 20:15 ` Matthew Wilcox 0 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2020-02-06 13:59 UTC (permalink / raw) To: Matthew Wilcox Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, linux-mm On Thu, Jan 09, 2020 at 12:13:20PM -0800, Matthew Wilcox wrote: > One of the use cases that we're concerned about involves a high > percentage of page faults on a single large (terabytes) VMA (and a > highly multithreaded process). Moving the contention from a rwsem > in the mm_struct to a refcount in the VMA will not help performance > substantially for this user. This is why I never believed in the VMA-refcount approach, and why my patches used SRCU. > The proposal consists of three phases. In phase 1, we convert the > rbtree to the maple tree, and leave the locking alone. In phase 2, > we change the locking to a per-VMA refcount, looked up under RCU. > > This problem arises during phase 3 where we attempt to handle page > faults entirely under the RCU read lock. If we encounter problems, > we can fall back to acquiring the VMA refcount, but we need the > page allocation to fail rather than sleep (or magically drop the > RCU lock and return an indication that it has done so, but that > doesn't seem to be an approach that would find any favour). So why not use SRCU? You can do full blocking faults under SRCU and don't need no 'stinkin' refcounts ;-) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Re: Splitting the mmap_sem 2020-02-06 13:59 ` Peter Zijlstra @ 2020-02-06 20:15 ` Matthew Wilcox 2020-02-06 20:55 ` Peter Zijlstra 0 siblings, 1 reply; 31+ messages in thread From: Matthew Wilcox @ 2020-02-06 20:15 UTC (permalink / raw) To: Peter Zijlstra Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, linux-mm On Thu, Feb 06, 2020 at 02:59:20PM +0100, Peter Zijlstra wrote: > > The proposal consists of three phases. In phase 1, we convert the > > rbtree to the maple tree, and leave the locking alone. In phase 2, > > we change the locking to a per-VMA refcount, looked up under RCU. > > > > This problem arises during phase 3 where we attempt to handle page > > faults entirely under the RCU read lock. If we encounter problems, > > we can fall back to acquiring the VMA refcount, but we need the > > page allocation to fail rather than sleep (or magically drop the > > RCU lock and return an indication that it has done so, but that > > doesn't seem to be an approach that would find any favour). > > So why not use SRCU? You can do full blocking faults under SRCU and > don't need no 'stinkin' refcounts ;-) I have to say, SRCU is not in my mental toolbox of "how to solve a problem", so it simply hadn't occurred to me. Thanks. So, we'd DEFINE_SRCU(vma_srcu); in mm/memory.c then, at the beginning of a page fault call srcu_read_lock(&vma_srcu); walk the tree as we do now, allocate memory for PTEs, sleep waiting for pages to arrive back from disc, etc, etc, then at the end of the fault, call srcu_read_unlock(&vma_srcu). munmap() would consist of removing the VMA from the tree, then calling synchronize_srcu() to wait for all faults to finish, then putting the backing file, etc, etc and freeing the VMA. This seems pretty reasonable, and investigation could actually proceed before the Maple tree work lands. Today, that would be: srcu_read_lock(&vmas_srcu); down_read(&mm->mmap_sem); find_vma(mm, address); up_read(&mm->mmap_sem); ... rest of fault handler path ... srcu_read_unlock(&vmas_srcu); Kind of a pain because we still call find_vma() in the per-arch page fault handler, but for prototyping, we'd only have to do one or two architectures. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Re: Splitting the mmap_sem 2020-02-06 20:15 ` Matthew Wilcox @ 2020-02-06 20:55 ` Peter Zijlstra 2020-02-06 21:20 ` Matthew Wilcox 0 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2020-02-06 20:55 UTC (permalink / raw) To: Matthew Wilcox Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, linux-mm On Thu, Feb 06, 2020 at 12:15:36PM -0800, Matthew Wilcox wrote: > On Thu, Feb 06, 2020 at 02:59:20PM +0100, Peter Zijlstra wrote: > > > The proposal consists of three phases. In phase 1, we convert the > > > rbtree to the maple tree, and leave the locking alone. In phase 2, > > > we change the locking to a per-VMA refcount, looked up under RCU. > > > > > > This problem arises during phase 3 where we attempt to handle page > > > faults entirely under the RCU read lock. If we encounter problems, > > > we can fall back to acquiring the VMA refcount, but we need the > > > page allocation to fail rather than sleep (or magically drop the > > > RCU lock and return an indication that it has done so, but that > > > doesn't seem to be an approach that would find any favour). > > > > So why not use SRCU? You can do full blocking faults under SRCU and > > don't need no 'stinkin' refcounts ;-) > > I have to say, SRCU is not in my mental toolbox of "how to solve a > problem", so it simply hadn't occurred to me. Thanks. > > So, we'd DEFINE_SRCU(vma_srcu); in mm/memory.c > > then, at the beginning of a page fault call srcu_read_lock(&vma_srcu); > walk the tree as we do now, allocate memory for PTEs, sleep waiting for > pages to arrive back from disc, etc, etc, then at the end of the fault, > call srcu_read_unlock(&vma_srcu). So far so good,... > munmap() would consist of removing the > VMA from the tree, then calling synchronize_srcu() to wait for all faults > to finish, then putting the backing file, etc, etc and freeing the VMA. call_srcu(), and the (s)rcu callback will then fput() and such things more. synchronize_srcu() (like synchronize_rcu()) is stupid slow and would make munmap()/exit()/etc.. unusable. > This seems pretty reasonable, and investigation could actually proceed > before the Maple tree work lands. Today, that would be: > > srcu_read_lock(&vmas_srcu); > down_read(&mm->mmap_sem); > find_vma(mm, address); > up_read(&mm->mmap_sem); > ... rest of fault handler path ... > srcu_read_unlock(&vmas_srcu); > > Kind of a pain because we still call find_vma() in the per-arch page > fault handler, but for prototyping, we'd only have to do one or two > architectures. If you look at the earlier speculative page-fault patches by Laurent, which were based on my still earlier patches, you'll find most of this there. The tricky bit was validating everything on the second page-table walk, so see if nothing had fundamentally changed, specifically the VMA, before installing the PTE. If you do this without mmap_sem, you need to hold ptlock to pin stuff while validating everything you did earlier. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Re: Splitting the mmap_sem 2020-02-06 20:55 ` Peter Zijlstra @ 2020-02-06 21:20 ` Matthew Wilcox 2020-02-07 8:52 ` Peter Zijlstra 0 siblings, 1 reply; 31+ messages in thread From: Matthew Wilcox @ 2020-02-06 21:20 UTC (permalink / raw) To: Peter Zijlstra Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, linux-mm On Thu, Feb 06, 2020 at 09:55:29PM +0100, Peter Zijlstra wrote: > On Thu, Feb 06, 2020 at 12:15:36PM -0800, Matthew Wilcox wrote: > > then, at the beginning of a page fault call srcu_read_lock(&vma_srcu); > > walk the tree as we do now, allocate memory for PTEs, sleep waiting for > > pages to arrive back from disc, etc, etc, then at the end of the fault, > > call srcu_read_unlock(&vma_srcu). > > So far so good,... > > > munmap() would consist of removing the > > VMA from the tree, then calling synchronize_srcu() to wait for all faults > > to finish, then putting the backing file, etc, etc and freeing the VMA. > > call_srcu(), and the (s)rcu callback will then fput() and such things > more. > > synchronize_srcu() (like synchronize_rcu()) is stupid slow and would > make munmap()/exit()/etc.. unusable. I'll need to think about that a bit. I was convinced we needed to wait for the current pagefaults to finish before we could return from munmap(). I need to convince myself that it's OK to return to userspace while the page faults for that range are still proceeding on other CPUs. > > This seems pretty reasonable, and investigation could actually proceed > > before the Maple tree work lands. Today, that would be: > > > > srcu_read_lock(&vmas_srcu); > > down_read(&mm->mmap_sem); > > find_vma(mm, address); > > up_read(&mm->mmap_sem); > > ... rest of fault handler path ... > > srcu_read_unlock(&vmas_srcu); > > > > Kind of a pain because we still call find_vma() in the per-arch page > > fault handler, but for prototyping, we'd only have to do one or two > > architectures. > > If you look at the earlier speculative page-fault patches by Laurent, > which were based on my still earlier patches, you'll find most of this > there. > > The tricky bit was validating everything on the second page-table walk, > so see if nothing had fundamentally changed, specifically the VMA, > before installing the PTE. If you do this without mmap_sem, you need to > hold ptlock to pin stuff while validating everything you did earlier. The patches Laurent posted used regular RCU and a per-VMA refcount, not SRCU. If you use SRCU, why would you need a second page table walk? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Re: Splitting the mmap_sem 2020-02-06 21:20 ` Matthew Wilcox @ 2020-02-07 8:52 ` Peter Zijlstra 2020-02-10 22:00 ` Matthew Wilcox 2020-02-19 17:14 ` Laurent Dufour 0 siblings, 2 replies; 31+ messages in thread From: Peter Zijlstra @ 2020-02-07 8:52 UTC (permalink / raw) To: Matthew Wilcox Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, linux-mm On Thu, Feb 06, 2020 at 01:20:24PM -0800, Matthew Wilcox wrote: > On Thu, Feb 06, 2020 at 09:55:29PM +0100, Peter Zijlstra wrote: > > On Thu, Feb 06, 2020 at 12:15:36PM -0800, Matthew Wilcox wrote: > > > then, at the beginning of a page fault call srcu_read_lock(&vma_srcu); > > > walk the tree as we do now, allocate memory for PTEs, sleep waiting for > > > pages to arrive back from disc, etc, etc, then at the end of the fault, > > > call srcu_read_unlock(&vma_srcu). > > > > So far so good,... > > > > > munmap() would consist of removing the > > > VMA from the tree, then calling synchronize_srcu() to wait for all faults > > > to finish, then putting the backing file, etc, etc and freeing the VMA. > > > > call_srcu(), and the (s)rcu callback will then fput() and such things > > more. > > > > synchronize_srcu() (like synchronize_rcu()) is stupid slow and would > > make munmap()/exit()/etc.. unusable. > > I'll need to think about that a bit. I was convinced we needed to wait > for the current pagefaults to finish before we could return from munmap(). > I need to convince myself that it's OK to return to userspace while the > page faults for that range are still proceeding on other CPUs. File-io might be in progress, any actual faults will result in SIGFAULT instead of installing a PTE. It is not fundamentally different from any threaded uaf race. > > > This seems pretty reasonable, and investigation could actually proceed > > > before the Maple tree work lands. Today, that would be: > > > > > > srcu_read_lock(&vmas_srcu); > > > down_read(&mm->mmap_sem); > > > find_vma(mm, address); > > > up_read(&mm->mmap_sem); > > > ... rest of fault handler path ... > > > srcu_read_unlock(&vmas_srcu); > > > > > > Kind of a pain because we still call find_vma() in the per-arch page > > > fault handler, but for prototyping, we'd only have to do one or two > > > architectures. > > > > If you look at the earlier speculative page-fault patches by Laurent, > > which were based on my still earlier patches, you'll find most of this > > there. > > > > The tricky bit was validating everything on the second page-table walk, > > so see if nothing had fundamentally changed, specifically the VMA, > > before installing the PTE. If you do this without mmap_sem, you need to > > hold ptlock to pin stuff while validating everything you did earlier. > > The patches Laurent posted used regular RCU and a per-VMA refcount, not > SRCU. That are his later patches, and I distinctly disagree with that approach. If you look at the patches here: https://lkml.kernel.org/r/cover.1479465699.git.ldufour@linux.vnet.ibm.com you'll find it uses SRCU. > If you use SRCU, why would you need a second page table walk? Because SRCU only ensures the VMA object remains extant, it does not prevent modification of it, normally that guarantee is provided by mmap_sem, but we're not going to use that. Instead, what we serialize on is the (split) ptlock. So we do the first page-walk and ptlock to verify the vma-lookup, then we drop ptlock and do the file-io, then we page-walk and take ptlock again, verify the vma (again) and install the PTE. If anything goes wrong, we bail. See this patch: https://lkml.kernel.org/r/301fb863785f37c319b493bd0d43167353871804.1479465699.git.ldufour@linux.vnet.ibm.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Re: Splitting the mmap_sem 2020-02-07 8:52 ` Peter Zijlstra @ 2020-02-10 22:00 ` Matthew Wilcox 2020-02-19 17:14 ` Laurent Dufour 1 sibling, 0 replies; 31+ messages in thread From: Matthew Wilcox @ 2020-02-10 22:00 UTC (permalink / raw) To: Peter Zijlstra Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, linux-mm On Fri, Feb 07, 2020 at 09:52:34AM +0100, Peter Zijlstra wrote: > On Thu, Feb 06, 2020 at 01:20:24PM -0800, Matthew Wilcox wrote: > > If you use SRCU, why would you need a second page table walk? > > Because SRCU only ensures the VMA object remains extant, it does not > prevent modification of it, normally that guarantee is provided by > mmap_sem, but we're not going to use that. > > Instead, what we serialize on is the (split) ptlock. So we do the first > page-walk and ptlock to verify the vma-lookup, then we drop ptlock and > do the file-io, then we page-walk and take ptlock again, verify the vma > (again) and install the PTE. If anything goes wrong, we bail. > > See this patch: > > https://lkml.kernel.org/r/301fb863785f37c319b493bd0d43167353871804.1479465699.git.ldufour@linux.vnet.ibm.com OK, this gets to a difference in how we see modification to a VMA working. You take a seqcount in patch 5: https://lore.kernel.org/linux-mm/95790e53bfcfb536eb8f1dcdf4750e7e8050d8f4.1479465699.git.ldufour@linux.vnet.ibm.com/ I was intending to postpone the page table flushing parts of a VMA modification until a grace period had passed, but what you're doing is faster. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Splitting the mmap_sem 2020-02-07 8:52 ` Peter Zijlstra 2020-02-10 22:00 ` Matthew Wilcox @ 2020-02-19 17:14 ` Laurent Dufour 1 sibling, 0 replies; 31+ messages in thread From: Laurent Dufour @ 2020-02-19 17:14 UTC (permalink / raw) To: Peter Zijlstra, Matthew Wilcox Cc: SeongJae Park, Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, linux-mm Le 07/02/2020 à 09:52, Peter Zijlstra a écrit : > On Thu, Feb 06, 2020 at 01:20:24PM -0800, Matthew Wilcox wrote: >> On Thu, Feb 06, 2020 at 09:55:29PM +0100, Peter Zijlstra wrote: >>> On Thu, Feb 06, 2020 at 12:15:36PM -0800, Matthew Wilcox wrote: >>>> then, at the beginning of a page fault call srcu_read_lock(&vma_srcu); >>>> walk the tree as we do now, allocate memory for PTEs, sleep waiting for >>>> pages to arrive back from disc, etc, etc, then at the end of the fault, >>>> call srcu_read_unlock(&vma_srcu). >>> >>> So far so good,... >>> >>>> munmap() would consist of removing the >>>> VMA from the tree, then calling synchronize_srcu() to wait for all faults >>>> to finish, then putting the backing file, etc, etc and freeing the VMA. >>> >>> call_srcu(), and the (s)rcu callback will then fput() and such things >>> more. >>> >>> synchronize_srcu() (like synchronize_rcu()) is stupid slow and would >>> make munmap()/exit()/etc.. unusable. >> >> I'll need to think about that a bit. I was convinced we needed to wait >> for the current pagefaults to finish before we could return from munmap(). >> I need to convince myself that it's OK to return to userspace while the >> page faults for that range are still proceeding on other CPUs. > > File-io might be in progress, any actual faults will result in SIGFAULT > instead of installing a PTE. > > It is not fundamentally different from any threaded uaf race. > >>>> This seems pretty reasonable, and investigation could actually proceed >>>> before the Maple tree work lands. Today, that would be: >>>> >>>> srcu_read_lock(&vmas_srcu); >>>> down_read(&mm->mmap_sem); >>>> find_vma(mm, address); >>>> up_read(&mm->mmap_sem); >>>> ... rest of fault handler path ... >>>> srcu_read_unlock(&vmas_srcu); >>>> >>>> Kind of a pain because we still call find_vma() in the per-arch page >>>> fault handler, but for prototyping, we'd only have to do one or two >>>> architectures. >>> >>> If you look at the earlier speculative page-fault patches by Laurent, >>> which were based on my still earlier patches, you'll find most of this >>> there. >>> >>> The tricky bit was validating everything on the second page-table walk, >>> so see if nothing had fundamentally changed, specifically the VMA, >>> before installing the PTE. If you do this without mmap_sem, you need to >>> hold ptlock to pin stuff while validating everything you did earlier. >> >> The patches Laurent posted used regular RCU and a per-VMA refcount, not >> SRCU. > > That are his later patches, and I distinctly disagree with that > approach. > > If you look at the patches here: > > https://lkml.kernel.org/r/cover.1479465699.git.ldufour@linux.vnet.ibm.com > > you'll find it uses SRCU. For the record, I switched from SRCU to RCU and a ref counter because using SRCU, the performances were impacted by the scheduling generated to handle SRCU asynchronous events. I may have missed something, but using RCU and a ref counter was working around this 20% overhead. >> If you use SRCU, why would you need a second page table walk? > > Because SRCU only ensures the VMA object remains extant, it does not > prevent modification of it, normally that guarantee is provided by > mmap_sem, but we're not going to use that. > > Instead, what we serialize on is the (split) ptlock. So we do the first > page-walk and ptlock to verify the vma-lookup, then we drop ptlock and > do the file-io, then we page-walk and take ptlock again, verify the vma > (again) and install the PTE. If anything goes wrong, we bail. > > See this patch: > > https://lkml.kernel.org/r/301fb863785f37c319b493bd0d43167353871804.1479465699.git.ldufour@linux.vnet.ibm.com > ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2020-02-19 17:14 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-03 22:21 Splitting the mmap_sem Matthew Wilcox 2019-12-05 17:21 ` Jerome Glisse 2019-12-06 5:13 ` Matthew Wilcox 2019-12-06 17:30 ` Jerome Glisse 2019-12-09 3:33 ` Matthew Wilcox 2019-12-09 14:17 ` Jerome Glisse 2019-12-10 15:26 ` Vlastimil Babka 2019-12-10 16:07 ` Jerome Glisse 2019-12-10 18:09 ` Vlastimil Babka 2019-12-12 14:24 ` Kirill A. Shutemov 2019-12-12 15:40 ` Matthew Wilcox 2019-12-12 15:46 ` Kirill A. Shutemov 2019-12-13 14:33 ` Matthew Wilcox 2019-12-13 18:06 ` Kirill A. Shutemov 2019-12-13 18:21 ` Matthew Wilcox 2020-01-06 22:09 ` Matthew Wilcox 2020-01-07 12:34 ` Kirill A. Shutemov 2020-01-07 13:54 ` Matthew Wilcox 2020-01-07 14:27 ` Kirill A. Shutemov 2020-01-09 13:56 ` Vlastimil Babka 2020-01-09 17:03 ` Michal Hocko 2020-01-09 17:07 ` Michal Hocko 2020-01-09 17:32 ` SeongJae Park 2020-01-09 20:13 ` Matthew Wilcox 2020-02-06 13:59 ` Peter Zijlstra 2020-02-06 20:15 ` Matthew Wilcox 2020-02-06 20:55 ` Peter Zijlstra 2020-02-06 21:20 ` Matthew Wilcox 2020-02-07 8:52 ` Peter Zijlstra 2020-02-10 22:00 ` Matthew Wilcox 2020-02-19 17:14 ` Laurent Dufour
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).