* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update [not found] ` <alpine.DEB.2.20.1704111152170.25069-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> @ 2017-04-11 19:00 ` Vlastimil Babka 2017-04-12 21:25 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Vlastimil Babka @ 2017-04-11 19:00 UTC (permalink / raw) To: Christoph Lameter Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api-u79uwXL29TY76Z2rM5mHXA +CC linux-api On 11.4.2017 19:24, Christoph Lameter wrote: > On Tue, 11 Apr 2017, Vlastimil Babka wrote: > >> The root of the problem is that the cpuset's mems_allowed and mempolicy's >> nodemask can temporarily have no intersection, thus get_page_from_freelist() >> cannot find any usable zone. The current semantic for empty intersection is to >> ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in >> node_zonelist(), but the racy update can happen after we already passed the > > The fallback was only intended for a cpuset on which boundaries are not enforced > in critical conditions (softwall). A hardwall cpuset (CS_MEM_HARDWALL) > should fail the allocation. Hmm just to clarify - I'm talking about ignoring the *mempolicy's* nodemask on the basis of cpuset having higher priority, while you seem to be talking about ignoring a (softwall) cpuset nodemask, right? man set_mempolicy says "... if required nodemask contains no nodes that are allowed by the process's current cpuset context, the memory policy reverts to local allocation" which does come down to ignoring mempolicy's nodemask. >> This patch fixes the issue by having __alloc_pages_slowpath() check for empty >> intersection of cpuset and ac->nodemask before OOM or allocation failure. If >> it's indeed empty, the nodemask is ignored and allocation retried, which mimics >> node_zonelist(). This works fine, because almost all callers of > > Well that would need to be subject to the hardwall flag. Allocation needs > to fail for a hardwall cpuset. They still do, if no hardwall cpuset node can satisfy the allocation with mempolicy ignored. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update 2017-04-11 19:00 ` [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update Vlastimil Babka @ 2017-04-12 21:25 ` Christoph Lameter [not found] ` <alpine.DEB.2.20.1704121617040.28335-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2017-04-12 21:25 UTC (permalink / raw) To: Vlastimil Babka Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api On Tue, 11 Apr 2017, Vlastimil Babka wrote: > > The fallback was only intended for a cpuset on which boundaries are not enforced > > in critical conditions (softwall). A hardwall cpuset (CS_MEM_HARDWALL) > > should fail the allocation. > > Hmm just to clarify - I'm talking about ignoring the *mempolicy's* nodemask on > the basis of cpuset having higher priority, while you seem to be talking about > ignoring a (softwall) cpuset nodemask, right? man set_mempolicy says "... if > required nodemask contains no nodes that are allowed by the process's current > cpuset context, the memory policy reverts to local allocation" which does come > down to ignoring mempolicy's nodemask. I am talking of allocating outside of the current allowed nodes (determined by mempolicy -- MPOL_BIND is the only concern as far as I can tell -- as well as the current cpuset). One can violate the cpuset if its not a hardwall but the MPOL_MBIND node restriction cannot be violated. Those allocations are also not allowed if the allocation was for a user space page even if this is a softwall cpuset. > >> This patch fixes the issue by having __alloc_pages_slowpath() check for empty > >> intersection of cpuset and ac->nodemask before OOM or allocation failure. If > >> it's indeed empty, the nodemask is ignored and allocation retried, which mimics > >> node_zonelist(). This works fine, because almost all callers of > > > > Well that would need to be subject to the hardwall flag. Allocation needs > > to fail for a hardwall cpuset. > > They still do, if no hardwall cpuset node can satisfy the allocation with > mempolicy ignored. If the memory policy is MPOL_MBIND then allocations outside of the given nodes should fail. They can violate the cpuset boundaries only if they are kernel allocations and we are not in a hardwall cpuset. That was at least my understand when working on this code years ago. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <alpine.DEB.2.20.1704121617040.28335-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>]
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update [not found] ` <alpine.DEB.2.20.1704121617040.28335-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> @ 2017-04-13 6:24 ` Vlastimil Babka 2017-04-14 20:37 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Vlastimil Babka @ 2017-04-13 6:24 UTC (permalink / raw) To: Christoph Lameter Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api-u79uwXL29TY76Z2rM5mHXA On 04/12/2017 11:25 PM, Christoph Lameter wrote: > On Tue, 11 Apr 2017, Vlastimil Babka wrote: > >>> The fallback was only intended for a cpuset on which boundaries are not enforced >>> in critical conditions (softwall). A hardwall cpuset (CS_MEM_HARDWALL) >>> should fail the allocation. >> >> Hmm just to clarify - I'm talking about ignoring the *mempolicy's* nodemask on >> the basis of cpuset having higher priority, while you seem to be talking about >> ignoring a (softwall) cpuset nodemask, right? man set_mempolicy says "... if >> required nodemask contains no nodes that are allowed by the process's current >> cpuset context, the memory policy reverts to local allocation" which does come >> down to ignoring mempolicy's nodemask. > > I am talking of allocating outside of the current allowed nodes > (determined by mempolicy -- MPOL_BIND is the only concern as far as I can > tell -- as well as the current cpuset). One can violate the cpuset if its not > a hardwall but the MPOL_MBIND node restriction cannot be violated. > > Those allocations are also not allowed if the allocation was for a user > space page even if this is a softwall cpuset. > >>>> This patch fixes the issue by having __alloc_pages_slowpath() check for empty >>>> intersection of cpuset and ac->nodemask before OOM or allocation failure. If >>>> it's indeed empty, the nodemask is ignored and allocation retried, which mimics >>>> node_zonelist(). This works fine, because almost all callers of >>> >>> Well that would need to be subject to the hardwall flag. Allocation needs >>> to fail for a hardwall cpuset. >> >> They still do, if no hardwall cpuset node can satisfy the allocation with >> mempolicy ignored. > > If the memory policy is MPOL_MBIND then allocations outside of the given > nodes should fail. They can violate the cpuset boundaries only if they are > kernel allocations and we are not in a hardwall cpuset. > > That was at least my understand when working on this code years ago. Hmm, I see policy_nodemask() (I wrongly mentioned node_zonelist() before) ignores BIND mempolicy nodemask when it doesn't overlap with cpuset allowed nodes since initial git commit 1da177e4c3f4 (back then it was zonelist_policy()). But AFAIU this couldn't actually happen (outside of races), because 1) one is not allowed to create such effectively empty BIND mempolicy in the first place and 2) an existing mempolicy is rebound on cpuset changes to maintain the overlap. The point 2) does not apply to MPOL_F_STATIC_NODES mempolicies introduced in 2008 by DavidR, but it's documented in Documentation/vm/numa_memory_policy.txt and manpages that when they don't overlap with cpuset allowed nodes, the default mempolicy is used instead. I doubt we can change that now, because that can break existing programs. It also makes some sense at least to me, because a task can control its own mempolicy (for performance reasons), but cpuset changes are admin decisions that the task cannot even anticipate. I think it's better to continue working with suboptimal performance than start failing allocations? > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a> > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update 2017-04-13 6:24 ` Vlastimil Babka @ 2017-04-14 20:37 ` Christoph Lameter 2017-04-26 8:07 ` Vlastimil Babka 0 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2017-04-14 20:37 UTC (permalink / raw) To: Vlastimil Babka Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api On Thu, 13 Apr 2017, Vlastimil Babka wrote: > > I doubt we can change that now, because that can break existing > programs. It also makes some sense at least to me, because a task can > control its own mempolicy (for performance reasons), but cpuset changes > are admin decisions that the task cannot even anticipate. I think it's > better to continue working with suboptimal performance than start > failing allocations? If the expected semantics (hardwall) are that allocations should fail then lets be consistent and do so. Adding more and more exceptions gets this convoluted mess into an even worse shape. Adding the static binding of nodes was already a screwball if used within a cpuset because now one has to anticipate how a user would move the nodes of a cpuset and how the static bindings would work in such a context. The admin basically needs to know how the application has used memory policies if one still wants to move the applications within a cpuset with the fixed bindings. Maybe the best way to handle this is to give up on cpuset migration of live applications? After all this can be done with a script in the same way as the kernel is doing: 1. Extend the cpuset to include the new nodes. 2. Loop over the processes and use the migrate_pages() to move the apps one by one. 3. Remove the nodes no longer to be used. Then forget about translating memory policies. If an application that is supposed to run in a cpuset and supposed to be moveable has fixed bindings then the application should be aware of that and be equipped with some logic to rebind its memory on its own. Such an application typically already has such logic and executes a binding after discovering its numa node configuration on startup. It would have to be modified to redo that action when it gets some sort of a signal from the script telling it that the node config would be changed. Having this logic in the application instead of the kernel avoids all the kernel messes that we keep on trying to deal with and IMHO is much cleaner. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update 2017-04-14 20:37 ` Christoph Lameter @ 2017-04-26 8:07 ` Vlastimil Babka 2017-04-30 21:33 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Vlastimil Babka @ 2017-04-26 8:07 UTC (permalink / raw) To: Christoph Lameter Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api On 04/14/2017 10:37 PM, Christoph Lameter wrote: > On Thu, 13 Apr 2017, Vlastimil Babka wrote: > >> >> I doubt we can change that now, because that can break existing >> programs. It also makes some sense at least to me, because a task can >> control its own mempolicy (for performance reasons), but cpuset changes >> are admin decisions that the task cannot even anticipate. I think it's >> better to continue working with suboptimal performance than start >> failing allocations? > > If the expected semantics (hardwall) are that allocations should fail then > lets be consistent and do so. It's not "expected" right now. The documented semantics is that (static, as the others are rebound) mempolicy is ignored when it's not compatible with cpuset. I'm just reusing the same existing semantic for race situations. We can discuss whether we can change the semantics now, but I don't think it should block this fix. > Adding more and more exceptions gets this convoluted mess into an even > worse shape. Again, it's not a new exception semantics-wise, but I agree that the code of __alloc_pages_slowpath() is even more subtle. But I don't see any other easy fix. > Adding the static binding of nodes was already a screwball > if used within a cpuset because now one has to anticipate how a user would > move the nodes of a cpuset and how the static bindings would work in such > a context. On the other hand, static mempolicy is the only one that does not need rebinding, and removing the other modes would allow much simpler implementation. I thought the outcome of LSF/MM session was that we should try to go that way. > The admin basically needs to know how the application has used memory > policies if one still wants to move the applications within a cpuset with > the fixed bindings. > > Maybe the best way to handle this is to give up on cpuset migration of > live applications? After all this can be done with a script in the same > way as the kernel is doing: > > 1. Extend the cpuset to include the new nodes. > > 2. Loop over the processes and use the migrate_pages() to move the apps > one by one. > > 3. Remove the nodes no longer to be used. > > Then forget about translating memory policies. If an application that is > supposed to run in a cpuset and supposed to be moveable has fixed bindings > then the application should be aware of that and be equipped with > some logic to rebind its memory on its own. > > Such an application typically already has such logic and executes a > binding after discovering its numa node configuration on startup. It would > have to be modified to redo that action when it gets some sort of a signal > from the script telling it that the node config would be changed. > > Having this logic in the application instead of the kernel avoids all the > kernel messes that we keep on trying to deal with and IMHO is much > cleaner. That would be much simpler for us indeed. But we still IMHO can't abruptly start denying page fault allocations for existing applications that don't have the necessary awareness. > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update 2017-04-26 8:07 ` Vlastimil Babka @ 2017-04-30 21:33 ` Christoph Lameter [not found] ` <alpine.DEB.2.20.1704301628460.21533-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2017-04-30 21:33 UTC (permalink / raw) To: Vlastimil Babka Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api On Wed, 26 Apr 2017, Vlastimil Babka wrote: > > Such an application typically already has such logic and executes a > > binding after discovering its numa node configuration on startup. It would > > have to be modified to redo that action when it gets some sort of a signal > > from the script telling it that the node config would be changed. > > > > Having this logic in the application instead of the kernel avoids all the > > kernel messes that we keep on trying to deal with and IMHO is much > > cleaner. > > That would be much simpler for us indeed. But we still IMHO can't > abruptly start denying page fault allocations for existing applications > that don't have the necessary awareness. We certainly can do that. The failure of the page faults are due to the admin trying to move an application that is not aware of this and is using mempols. That could be an error. Trying to move an application that contains both absolute and relative node numbers is definitely something that is potentiall so screwed up that the kernel should not muck around with such an app. Also user space can determine if the application is using memory policies and can then take appropriate measures (message to the sysadmin to eval tge situation f.e.) or mess aroud with the processes memory policies on its own. So this is certainly a way out of this mess. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <alpine.DEB.2.20.1704301628460.21533-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>]
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update [not found] ` <alpine.DEB.2.20.1704301628460.21533-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> @ 2017-05-17 9:20 ` Michal Hocko [not found] ` <20170517092042.GH18247-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Michal Hocko @ 2017-05-17 9:20 UTC (permalink / raw) To: Christoph Lameter Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api-u79uwXL29TY76Z2rM5mHXA On Sun 30-04-17 16:33:10, Cristopher Lameter wrote: > On Wed, 26 Apr 2017, Vlastimil Babka wrote: > > > > Such an application typically already has such logic and executes a > > > binding after discovering its numa node configuration on startup. It would > > > have to be modified to redo that action when it gets some sort of a signal > > > from the script telling it that the node config would be changed. > > > > > > Having this logic in the application instead of the kernel avoids all the > > > kernel messes that we keep on trying to deal with and IMHO is much > > > cleaner. > > > > That would be much simpler for us indeed. But we still IMHO can't > > abruptly start denying page fault allocations for existing applications > > that don't have the necessary awareness. > > We certainly can do that. The failure of the page faults are due to the > admin trying to move an application that is not aware of this and is using > mempols. That could be an error. Trying to move an application that > contains both absolute and relative node numbers is definitely something > that is potentiall so screwed up that the kernel should not muck around > with such an app. > > Also user space can determine if the application is using memory policies > and can then take appropriate measures (message to the sysadmin to eval > tge situation f.e.) or mess aroud with the processes memory policies on > its own. > > So this is certainly a way out of this mess. So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy case in a raceless way? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20170517092042.GH18247-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update [not found] ` <20170517092042.GH18247-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2017-05-17 13:56 ` Christoph Lameter [not found] ` <alpine.DEB.2.20.1705170855430.7925-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2017-05-17 13:56 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api-u79uwXL29TY76Z2rM5mHXA On Wed, 17 May 2017, Michal Hocko wrote: > > We certainly can do that. The failure of the page faults are due to the > > admin trying to move an application that is not aware of this and is using > > mempols. That could be an error. Trying to move an application that > > contains both absolute and relative node numbers is definitely something > > that is potentiall so screwed up that the kernel should not muck around > > with such an app. > > > > Also user space can determine if the application is using memory policies > > and can then take appropriate measures (message to the sysadmin to eval > > tge situation f.e.) or mess aroud with the processes memory policies on > > its own. > > > > So this is certainly a way out of this mess. > > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy > case in a raceless way? You dont have to do that if you do not create an empty mempolicy in the first place. The current kernel code avoids that by first allowing access to the new set of nodes and removing the old ones from the set when done. ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <alpine.DEB.2.20.1705170855430.7925-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>]
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update [not found] ` <alpine.DEB.2.20.1705170855430.7925-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> @ 2017-05-17 14:05 ` Michal Hocko 2017-05-17 14:48 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Michal Hocko @ 2017-05-17 14:05 UTC (permalink / raw) To: Christoph Lameter Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api-u79uwXL29TY76Z2rM5mHXA On Wed 17-05-17 08:56:34, Cristopher Lameter wrote: > On Wed, 17 May 2017, Michal Hocko wrote: > > > > We certainly can do that. The failure of the page faults are due to the > > > admin trying to move an application that is not aware of this and is using > > > mempols. That could be an error. Trying to move an application that > > > contains both absolute and relative node numbers is definitely something > > > that is potentiall so screwed up that the kernel should not muck around > > > with such an app. > > > > > > Also user space can determine if the application is using memory policies > > > and can then take appropriate measures (message to the sysadmin to eval > > > tge situation f.e.) or mess aroud with the processes memory policies on > > > its own. > > > > > > So this is certainly a way out of this mess. > > > > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy > > case in a raceless way? > > You dont have to do that if you do not create an empty mempolicy in the > first place. The current kernel code avoids that by first allowing access > to the new set of nodes and removing the old ones from the set when done. which is racy and as Vlastimil pointed out. If we simply fail such an allocation the failure will go up the call chain until we hit the OOM killer due to VM_FAULT_OOM. How would you want to handle that? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update 2017-05-17 14:05 ` Michal Hocko @ 2017-05-17 14:48 ` Christoph Lameter [not found] ` <alpine.DEB.2.20.1705170943090.8714-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 2017-05-18 10:03 ` Vlastimil Babka 0 siblings, 2 replies; 21+ messages in thread From: Christoph Lameter @ 2017-05-17 14:48 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api On Wed, 17 May 2017, Michal Hocko wrote: > > > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy > > > case in a raceless way? > > > > You dont have to do that if you do not create an empty mempolicy in the > > first place. The current kernel code avoids that by first allowing access > > to the new set of nodes and removing the old ones from the set when done. > > which is racy and as Vlastimil pointed out. If we simply fail such an > allocation the failure will go up the call chain until we hit the OOM > killer due to VM_FAULT_OOM. How would you want to handle that? The race is where? If you expand the node set during the move of the application then you are safe in terms of the legacy apps that did not include static bindings. If you have screwy things like static mbinds in there then you are hopelessly lost anyways. You may have moved the process to another set of nodes but the static bindings may refer to a node no longer available. Thus the OOM is legitimate. At least a user space app could inspect the situation and come up with custom ways of dealing with the mess. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <alpine.DEB.2.20.1705170943090.8714-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>]
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update [not found] ` <alpine.DEB.2.20.1705170943090.8714-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> @ 2017-05-17 14:56 ` Michal Hocko 2017-05-17 15:25 ` Christoph Lameter 2017-05-17 15:27 ` Christoph Lameter 0 siblings, 2 replies; 21+ messages in thread From: Michal Hocko @ 2017-05-17 14:56 UTC (permalink / raw) To: Christoph Lameter Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api-u79uwXL29TY76Z2rM5mHXA On Wed 17-05-17 09:48:25, Cristopher Lameter wrote: > On Wed, 17 May 2017, Michal Hocko wrote: > > > > > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy > > > > case in a raceless way? > > > > > > You dont have to do that if you do not create an empty mempolicy in the > > > first place. The current kernel code avoids that by first allowing access > > > to the new set of nodes and removing the old ones from the set when done. > > > > which is racy and as Vlastimil pointed out. If we simply fail such an > > allocation the failure will go up the call chain until we hit the OOM > > killer due to VM_FAULT_OOM. How would you want to handle that? > > The race is where? If you expand the node set during the move of the > application then you are safe in terms of the legacy apps that did not > include static bindings. I am pretty sure it is describe in those changelogs and I won't repeat it here. > If you have screwy things like static mbinds in there then you are > hopelessly lost anyways. You may have moved the process to another set > of nodes but the static bindings may refer to a node no longer > available. Thus the OOM is legitimate. The point is that you do _not_ want such a process to trigger the OOM because it can cause other processes being killed. > At least a user space app could inspect > the situation and come up with custom ways of dealing with the mess. I do not really see how would this help to prevent a malicious user from playing tricks. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update 2017-05-17 14:56 ` Michal Hocko @ 2017-05-17 15:25 ` Christoph Lameter [not found] ` <alpine.DEB.2.20.1705171021570.9487-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 2017-05-17 15:27 ` Christoph Lameter 1 sibling, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2017-05-17 15:25 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api On Wed, 17 May 2017, Michal Hocko wrote: > > If you have screwy things like static mbinds in there then you are > > hopelessly lost anyways. You may have moved the process to another set > > of nodes but the static bindings may refer to a node no longer > > available. Thus the OOM is legitimate. > > The point is that you do _not_ want such a process to trigger the OOM > because it can cause other processes being killed. Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what that changed? At this point you have messed up royally and nothing is going to rescue you anyways. OOM or not does not matter anymore. The app will fail. > > At least a user space app could inspect > > the situation and come up with custom ways of dealing with the mess. > > I do not really see how would this help to prevent a malicious user from > playing tricks. How did a malicious user come into this? Of course you can mess up in significant ways if you can overflow nodes and cause an app that has restrictions to fail but nothing is going to change that. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <alpine.DEB.2.20.1705171021570.9487-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>]
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update [not found] ` <alpine.DEB.2.20.1705171021570.9487-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> @ 2017-05-18 9:08 ` Michal Hocko [not found] ` <20170518090846.GD25462-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Michal Hocko @ 2017-05-18 9:08 UTC (permalink / raw) To: Christoph Lameter Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api-u79uwXL29TY76Z2rM5mHXA On Wed 17-05-17 10:25:09, Cristopher Lameter wrote: > On Wed, 17 May 2017, Michal Hocko wrote: > > > > If you have screwy things like static mbinds in there then you are > > > hopelessly lost anyways. You may have moved the process to another set > > > of nodes but the static bindings may refer to a node no longer > > > available. Thus the OOM is legitimate. > > > > The point is that you do _not_ want such a process to trigger the OOM > > because it can cause other processes being killed. > > Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what > that changed? > > At this point you have messed up royally and nothing is going to rescue > you anyways. OOM or not does not matter anymore. The app will fail. Not really. If you can trick the system to _think_ that the intersection between mempolicy and the cpuset is empty then the OOM killer might trigger an innocent task rather than the one which tricked it into that situation. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20170518090846.GD25462-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update [not found] ` <20170518090846.GD25462-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2017-05-18 16:57 ` Christoph Lameter [not found] ` <alpine.DEB.2.20.1705181154450.27641-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2017-05-18 16:57 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api-u79uwXL29TY76Z2rM5mHXA On Thu, 18 May 2017, Michal Hocko wrote: > > Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what > > that changed? !!!!! > > > > At this point you have messed up royally and nothing is going to rescue > > you anyways. OOM or not does not matter anymore. The app will fail. > > Not really. If you can trick the system to _think_ that the intersection > between mempolicy and the cpuset is empty then the OOM killer might > trigger an innocent task rather than the one which tricked it into that > situation. See above. OOM Kill in a cpuset does not kill an innocent task but a task that does an allocation in that specific context meaning a task in that cpuset that also has a memory policty. Regardless of that the point earlier was that the moving logic can avoid creating temporary situations of empty sets of nodes by analysing the memory policies etc and only performing moves when doing so is safe. ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <alpine.DEB.2.20.1705181154450.27641-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>]
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update [not found] ` <alpine.DEB.2.20.1705181154450.27641-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> @ 2017-05-18 17:24 ` Michal Hocko [not found] ` <20170518172424.GB30148-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Michal Hocko @ 2017-05-18 17:24 UTC (permalink / raw) To: Christoph Lameter Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api-u79uwXL29TY76Z2rM5mHXA On Thu 18-05-17 11:57:55, Cristopher Lameter wrote: > On Thu, 18 May 2017, Michal Hocko wrote: > > > > Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what > > > that changed? > > !!!!! > > > > > > > At this point you have messed up royally and nothing is going to rescue > > > you anyways. OOM or not does not matter anymore. The app will fail. > > > > Not really. If you can trick the system to _think_ that the intersection > > between mempolicy and the cpuset is empty then the OOM killer might > > trigger an innocent task rather than the one which tricked it into that > > situation. > > See above. OOM Kill in a cpuset does not kill an innocent task but a task > that does an allocation in that specific context meaning a task in that > cpuset that also has a memory policty. No, the oom killer will chose the largest task in the specific NUMA domain. If you just fail such an allocation then a page fault would get VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of the cpusets. > Regardless of that the point earlier was that the moving logic can avoid > creating temporary situations of empty sets of nodes by analysing the > memory policies etc and only performing moves when doing so is safe. How are you going to do that in a raceless way? Moreover the whole discussion is about _failing_ allocations on an empty cpuset and mempolicy intersection. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20170518172424.GB30148-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update [not found] ` <20170518172424.GB30148-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2017-05-18 19:07 ` Christoph Lameter 2017-05-19 7:37 ` Michal Hocko 0 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2017-05-18 19:07 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api-u79uwXL29TY76Z2rM5mHXA On Thu, 18 May 2017, Michal Hocko wrote: > > See above. OOM Kill in a cpuset does not kill an innocent task but a task > > that does an allocation in that specific context meaning a task in that > > cpuset that also has a memory policty. > > No, the oom killer will chose the largest task in the specific NUMA > domain. If you just fail such an allocation then a page fault would get > VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of > the cpusets. Ok someone screwed up that code. There still is the determination that we have a constrained alloc: oom_kill: /* * Check if there were limitations on the allocation (only relevant for * NUMA and memcg) that may require different handling. */ constraint = constrained_alloc(oc); if (constraint != CONSTRAINT_MEMORY_POLICY) oc->nodemask = NULL; check_panic_on_oom(oc, constraint); -- Ok. A constrained failing alloc used to terminate the allocating process here. But it falls through to selecting a "bad process" if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) && current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { get_task_struct(current); oc->chosen = current; oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)"); return true; } -- A constrained allocation should not get here but fail the process that attempts the alloc. select_bad_process(oc); Can we restore the old behavior? If I just specify the right memory policy I can cause other processes to just be terminated? > > Regardless of that the point earlier was that the moving logic can avoid > > creating temporary situations of empty sets of nodes by analysing the > > memory policies etc and only performing moves when doing so is safe. > > How are you going to do that in a raceless way? Moreover the whole > discussion is about _failing_ allocations on an empty cpuset and > mempolicy intersection. Again this is only working for processes that are well behaved and it never worked in a different way before. There was always the assumption that a process does not allocate in the areas that have allocation constraints and that the process does not change memory policies nor store them somewhere for late etc etc. HPC apps typically allocate memory on startup and then go through long times of processing and I/O. The idea that cpuset node to node migration will work with a running process that does abitrary activity is a pipe dream that we should give up. There must be constraints on a process in order to allow this to work and as far as I can tell this is best done in userspace with a library and by putting requirements on the applications that desire to be movable that way. F.e. an application that does not use memory policies or other allocation constraints should be fine. That has been working. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update 2017-05-18 19:07 ` Christoph Lameter @ 2017-05-19 7:37 ` Michal Hocko 0 siblings, 0 replies; 21+ messages in thread From: Michal Hocko @ 2017-05-19 7:37 UTC (permalink / raw) To: Christoph Lameter Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api On Thu 18-05-17 14:07:45, Cristopher Lameter wrote: > On Thu, 18 May 2017, Michal Hocko wrote: > > > > See above. OOM Kill in a cpuset does not kill an innocent task but a task > > > that does an allocation in that specific context meaning a task in that > > > cpuset that also has a memory policty. > > > > No, the oom killer will chose the largest task in the specific NUMA > > domain. If you just fail such an allocation then a page fault would get > > VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of > > the cpusets. > > Ok someone screwed up that code. There still is the determination that we > have a constrained alloc: It would be much more easier if you read emails more carefully. In order to have a constrained OOM you have to have either a non-null nodemask or zonelist which. And as I've said above you do not have them from the pagefault_out_of_memory context. The whole point of this discussion is _that_ failing allocations will not work currently! > oom_kill: > /* > * Check if there were limitations on the allocation (only relevant for > * NUMA and memcg) that may require different handling. > */ > constraint = constrained_alloc(oc); > if (constraint != CONSTRAINT_MEMORY_POLICY) > oc->nodemask = NULL; > check_panic_on_oom(oc, constraint); > > -- Ok. A constrained failing alloc used to terminate the allocating > process here. But it falls through to selecting a "bad process" This behavior is there for ~10 years. [...] > Can we restore the old behavior? If I just specify the right memory policy > I can cause other processes to just be terminated? Not normally. Because out_of_memory called from the page allocator context makes sure to kill tasks from the same NUMA domain (see oom_unkillable_task). > > > Regardless of that the point earlier was that the moving logic can avoid > > > creating temporary situations of empty sets of nodes by analysing the > > > memory policies etc and only performing moves when doing so is safe. > > > > How are you going to do that in a raceless way? Moreover the whole > > discussion is about _failing_ allocations on an empty cpuset and > > mempolicy intersection. > > Again this is only working for processes that are well behaved and it > never worked in a different way before. There was always the assumption > that a process does not allocate in the areas that have allocation > constraints and that the process does not change memory policies nor > store them somewhere for late etc etc. HPC apps typically allocate memory > on startup and then go through long times of processing and I/O. I would call it a bad design which then triggered a lot of work to make it semi-working over years. This is what Vlastimil tries to address now. And yes that might mean we would have to do some restrictions on the semantics. But as you know this is a user visible API and changing something that has been fundamentally underdefined initially is quite hard to fix. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update 2017-05-17 14:56 ` Michal Hocko 2017-05-17 15:25 ` Christoph Lameter @ 2017-05-17 15:27 ` Christoph Lameter 1 sibling, 0 replies; 21+ messages in thread From: Christoph Lameter @ 2017-05-17 15:27 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api On Wed, 17 May 2017, Michal Hocko wrote: > > The race is where? If you expand the node set during the move of the > > application then you are safe in terms of the legacy apps that did not > > include static bindings. > > I am pretty sure it is describe in those changelogs and I won't repeat > it here. I cannot figure out what you are referring to. There are numerous patches and discussions about OOM scenarios in this context. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update 2017-05-17 14:48 ` Christoph Lameter [not found] ` <alpine.DEB.2.20.1705170943090.8714-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> @ 2017-05-18 10:03 ` Vlastimil Babka 2017-05-18 17:07 ` Christoph Lameter 1 sibling, 1 reply; 21+ messages in thread From: Vlastimil Babka @ 2017-05-18 10:03 UTC (permalink / raw) To: Christoph Lameter, Michal Hocko Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api On 05/17/2017 04:48 PM, Christoph Lameter wrote: > On Wed, 17 May 2017, Michal Hocko wrote: > >>>> So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy >>>> case in a raceless way? >>> >>> You dont have to do that if you do not create an empty mempolicy in the >>> first place. The current kernel code avoids that by first allowing access >>> to the new set of nodes and removing the old ones from the set when done. >> >> which is racy and as Vlastimil pointed out. If we simply fail such an >> allocation the failure will go up the call chain until we hit the OOM >> killer due to VM_FAULT_OOM. How would you want to handle that? > > The race is where? If you expand the node set during the move of the > application then you are safe in terms of the legacy apps that did not > include static bindings. No, that expand/shrink by itself doesn't work against parallel get_page_from_freelist going through a zonelist. Moving from node 0 to 1, with zonelist containing nodes 1 and 0 in that order: - mempolicy mask is 0 - zonelist iteration checks node 1, it's not allowed, skip - mempolicy mask is 0,1 (expand) - mempolicy mask is 1 (shrink) - zonelist iteration checks node 0, it's not allowed, skip - OOM ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update 2017-05-18 10:03 ` Vlastimil Babka @ 2017-05-18 17:07 ` Christoph Lameter 2017-05-19 11:27 ` Vlastimil Babka 0 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2017-05-18 17:07 UTC (permalink / raw) To: Vlastimil Babka Cc: Michal Hocko, linux-mm, linux-kernel, cgroups, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api On Thu, 18 May 2017, Vlastimil Babka wrote: > > The race is where? If you expand the node set during the move of the > > application then you are safe in terms of the legacy apps that did not > > include static bindings. > > No, that expand/shrink by itself doesn't work against parallel Parallel? I think we are clear that ithis is inherently racy against the app changing policies etc etc? There is a huge issue there already. The app needs to be well behaved in some heretofore undefined way in order to make moves clean. > get_page_from_freelist going through a zonelist. Moving from node 0 to > 1, with zonelist containing nodes 1 and 0 in that order: > > - mempolicy mask is 0 > - zonelist iteration checks node 1, it's not allowed, skip There is an allocation from node 1? This is not allowed before the move. So it should fail. Not skipping to another node. > - mempolicy mask is 0,1 (expand) > - mempolicy mask is 1 (shrink) > - zonelist iteration checks node 0, it's not allowed, skip > - OOM Are you talking about a race here between zonelist scanning and the moving? That has been there forever. And frankly there are gazillions of these races. The best thing to do is to get the cpuset moving logic out of the kernel and into user space. Understand that this is a heuristic and maybe come up with a list of restrictions that make an app safe. An safe app that can be moved must f.e 1. Not allocate new memory while its being moved 2. Not change memory policies after its initialization and while its being moved. 3. Not save memory policy state in some variable (because the logic to translate the memory policies for the new context cannot find it). ... Again cpuset process migration is a huge mess that you do not want to have in the kernel and AFAICT this is a corner case with difficult semantics. Better have that in user space... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update 2017-05-18 17:07 ` Christoph Lameter @ 2017-05-19 11:27 ` Vlastimil Babka 0 siblings, 0 replies; 21+ messages in thread From: Vlastimil Babka @ 2017-05-19 11:27 UTC (permalink / raw) To: Christoph Lameter Cc: Michal Hocko, linux-mm, linux-kernel, cgroups, Li Zefan, Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov, linux-api On 05/18/2017 07:07 PM, Christoph Lameter wrote: > On Thu, 18 May 2017, Vlastimil Babka wrote: > >>> The race is where? If you expand the node set during the move of the >>> application then you are safe in terms of the legacy apps that did not >>> include static bindings. >> >> No, that expand/shrink by itself doesn't work against parallel > > Parallel? I think we are clear that ithis is inherently racy against the > app changing policies etc etc? There is a huge issue there already. The > app needs to be well behaved in some heretofore undefined way in order to > make moves clean. The code is safe against mbind() changing a vma's mempolicy parallel to another thread page faulting within that vma, because mbind() takes mmap_sem for write, and page faults take it for read. The per-task mempolicy can be changed by set_mempolicy() call which means the task itself doesn't allocate stuff in parallel. So, the application never needed to be "well behaved" wrt changing its own mempolicies. Now with mempolicy rebinding due to cpuset migrations, the application cannot be "well behaved" as it has no way to learn about being under a cpuset, or cpuset change. Any application can be put in a cpuset and we can't really expect that all would be adapted, even if the necessary interfaces existed. Thus, the rebinding implementation in the kernel itself has to be robust against parallel allocations. >> get_page_from_freelist going through a zonelist. Moving from node 0 to >> 1, with zonelist containing nodes 1 and 0 in that order: >> >> - mempolicy mask is 0 >> - zonelist iteration checks node 1, it's not allowed, skip > > There is an allocation from node 1? Sorry, I missed to mention the full scenario. Let's say the allocation is on cpu local to node 1, so it gets zonelist from node 1, which contains nodes 1 and 0 in that order. > This is not allowed before the move. > So it should fail. Not skipping to another node. > >> - mempolicy mask is 0,1 (expand) >> - mempolicy mask is 1 (shrink) >> - zonelist iteration checks node 0, it's not allowed, skip >> - OOM > > Are you talking about a race here between zonelist scanning and the > moving? That has been there forever. As far as I can tell from my git archeology in [1] there was always some kind of protection against the race (generation counters, two-step protocol, seqlock...), which however had some corner cases. This patch is merely plugging the last known one. > And frankly there are gazillions of these races. I don't know about any other existing race that we don't handle after this patch. > The best thing to do is > to get the cpuset moving logic out of the kernel and into user space. > > Understand that this is a heuristic and maybe come up with a list of > restrictions that make an app safe. An safe app that can be moved must f.e > > 1. Not allocate new memory while its being moved > 2. Not change memory policies after its initialization and while its being > moved. As I explainer eariler in this mail, changing mempolicy by app itself is safe, the problem was always due to cpuset-triggered rebinding. > 3. Not save memory policy state in some variable (because the logic to > translate the memory policies for the new context cannot find it). > > ... > > Again cpuset process migration is a huge mess that you do not want to > have in the kernel and AFAICT this is a corner case with difficult > semantics. Better have that in user space... Moving this out of kernel etc is changing the current semantics and breaking existing userspace, this patch is a fix within the existing one. [1] https://marc.info/?l=linux-mm&m=148611344511408&w=2 > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2017-05-19 11:27 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20170411140609.3787-1-vbabka@suse.cz> [not found] ` <20170411140609.3787-2-vbabka@suse.cz> [not found] ` <alpine.DEB.2.20.1704111152170.25069@east.gentwo.org> [not found] ` <alpine.DEB.2.20.1704111152170.25069-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 2017-04-11 19:00 ` [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update Vlastimil Babka 2017-04-12 21:25 ` Christoph Lameter [not found] ` <alpine.DEB.2.20.1704121617040.28335-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 2017-04-13 6:24 ` Vlastimil Babka 2017-04-14 20:37 ` Christoph Lameter 2017-04-26 8:07 ` Vlastimil Babka 2017-04-30 21:33 ` Christoph Lameter [not found] ` <alpine.DEB.2.20.1704301628460.21533-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 2017-05-17 9:20 ` Michal Hocko [not found] ` <20170517092042.GH18247-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2017-05-17 13:56 ` Christoph Lameter [not found] ` <alpine.DEB.2.20.1705170855430.7925-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 2017-05-17 14:05 ` Michal Hocko 2017-05-17 14:48 ` Christoph Lameter [not found] ` <alpine.DEB.2.20.1705170943090.8714-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 2017-05-17 14:56 ` Michal Hocko 2017-05-17 15:25 ` Christoph Lameter [not found] ` <alpine.DEB.2.20.1705171021570.9487-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 2017-05-18 9:08 ` Michal Hocko [not found] ` <20170518090846.GD25462-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2017-05-18 16:57 ` Christoph Lameter [not found] ` <alpine.DEB.2.20.1705181154450.27641-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org> 2017-05-18 17:24 ` Michal Hocko [not found] ` <20170518172424.GB30148-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2017-05-18 19:07 ` Christoph Lameter 2017-05-19 7:37 ` Michal Hocko 2017-05-17 15:27 ` Christoph Lameter 2017-05-18 10:03 ` Vlastimil Babka 2017-05-18 17:07 ` Christoph Lameter 2017-05-19 11:27 ` Vlastimil Babka
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).