* [RFC] memory reserve for userspace oom-killer @ 2021-04-20 1:44 Shakeel Butt 2021-04-20 6:45 ` Michal Hocko ` (2 more replies) 0 siblings, 3 replies; 28+ messages in thread From: Shakeel Butt @ 2021-04-20 1:44 UTC (permalink / raw) To: Johannes Weiner, Roman Gushchin, Michal Hocko, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan Cc: Greg Thelen, Dragos Sbirlea, Priya Duraisamy Proposal: Provide memory guarantees to userspace oom-killer. Background: Issues with kernel oom-killer: 1. Very conservative and prefer to reclaim. Applications can suffer for a long time. 2. Borrows the context of the allocator which can be resource limited (low sched priority or limited CPU quota). 3. Serialized by global lock. 4. Very simplistic oom victim selection policy. These issues are resolved through userspace oom-killer by: 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to early detect suffering. 2. Independent process context which can be given dedicated CPU quota and high scheduling priority. 3. Can be more aggressive as required. 4. Can implement sophisticated business logic/policies. Android's LMKD and Facebook's oomd are the prime examples of userspace oom-killers. One of the biggest challenges for userspace oom-killers is to potentially function under intense memory pressure and are prone to getting stuck in memory reclaim themselves. Current userspace oom-killers aim to avoid this situation by preallocating user memory and protecting themselves from global reclaim by either mlocking or memory.min. However a new allocation from userspace oom-killer can still get stuck in the reclaim and policy rich oom-killer do trigger new allocations through syscalls or even heap. Our attempt of userspace oom-killer faces similar challenges. Particularly at the tail on the very highly utilized machines we have observed userspace oom-killer spectacularly failing in many possible ways in the direct reclaim. We have seen oom-killer stuck in direct reclaim throttling, stuck in reclaim and allocations from interrupts keep stealing reclaimed memory. We have even observed systems where all the processes were stuck in throttle_direct_reclaim() and only kswapd was running and the interrupts kept stealing the memory reclaimed by kswapd. To reliably solve this problem, we need to give guaranteed memory to the userspace oom-killer. At the moment we are contemplating between the following options and I would like to get some feedback. 1. prctl(PF_MEMALLOC) The idea is to give userspace oom-killer (just one thread which is finding the appropriate victims and will be sending SIGKILLs) access to MEMALLOC reserves. Most of the time the preallocation, mlock and memory.min will be good enough but for rare occasions, when the userspace oom-killer needs to allocate, the PF_MEMALLOC flag will protect it from reclaim and let the allocation dip into the memory reserves. The misuse of this feature would be risky but it can be limited to privileged applications. Userspace oom-killer is the only appropriate user of this feature. This option is simple to implement. 2. Mempool The idea is to preallocate mempool with a given amount of memory for userspace oom-killer. Preferably this will be per-thread and oom-killer can preallocate mempool for its specific threads. The core page allocator can check before going to the reclaim path if the task has private access to the mempool and return page from it if yes. This option would be more complicated than the previous option as the lifecycle of the page from the mempool would be more sophisticated. Additionally the current mempool does not handle higher order pages and we might need to extend it to allow such allocations. Though this feature might have more use-cases and it would be less risky than the previous option. Another idea I had was to use kthread based oom-killer and provide the policies through eBPF program. Though I am not sure how to make it monitor arbitrary metrics and if that can be done without any allocations. Please do provide feedback on these approaches. thanks, Shakeel ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-20 1:44 [RFC] memory reserve for userspace oom-killer Shakeel Butt @ 2021-04-20 6:45 ` Michal Hocko 2021-04-20 16:04 ` Shakeel Butt 2021-04-20 19:17 ` Roman Gushchin 2021-04-21 17:05 ` peter enderborg 2 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2021-04-20 6:45 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Mon 19-04-21 18:44:02, Shakeel Butt wrote: > Proposal: Provide memory guarantees to userspace oom-killer. > > Background: > > Issues with kernel oom-killer: > 1. Very conservative and prefer to reclaim. Applications can suffer > for a long time. > 2. Borrows the context of the allocator which can be resource limited > (low sched priority or limited CPU quota). > 3. Serialized by global lock. > 4. Very simplistic oom victim selection policy. > > These issues are resolved through userspace oom-killer by: > 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to > early detect suffering. > 2. Independent process context which can be given dedicated CPU quota > and high scheduling priority. > 3. Can be more aggressive as required. > 4. Can implement sophisticated business logic/policies. > > Android's LMKD and Facebook's oomd are the prime examples of userspace > oom-killers. One of the biggest challenges for userspace oom-killers > is to potentially function under intense memory pressure and are prone > to getting stuck in memory reclaim themselves. Current userspace > oom-killers aim to avoid this situation by preallocating user memory > and protecting themselves from global reclaim by either mlocking or > memory.min. However a new allocation from userspace oom-killer can > still get stuck in the reclaim and policy rich oom-killer do trigger > new allocations through syscalls or even heap. Can you be more specific please? > Our attempt of userspace oom-killer faces similar challenges. > Particularly at the tail on the very highly utilized machines we have > observed userspace oom-killer spectacularly failing in many possible > ways in the direct reclaim. We have seen oom-killer stuck in direct > reclaim throttling, stuck in reclaim and allocations from interrupts > keep stealing reclaimed memory. We have even observed systems where > all the processes were stuck in throttle_direct_reclaim() and only > kswapd was running and the interrupts kept stealing the memory > reclaimed by kswapd. > > To reliably solve this problem, we need to give guaranteed memory to > the userspace oom-killer. There is nothing like that. Even memory reserves are a finite resource which can be consumed as it is sharing those reserves with other users who are not necessarily coordinated. So before we start discussing making this even more muddy by handing over memory reserves to the userspace we should really examine whether pre-allocation is something that will not work. > At the moment we are contemplating between > the following options and I would like to get some feedback. > > 1. prctl(PF_MEMALLOC) > > The idea is to give userspace oom-killer (just one thread which is > finding the appropriate victims and will be sending SIGKILLs) access > to MEMALLOC reserves. Most of the time the preallocation, mlock and > memory.min will be good enough but for rare occasions, when the > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will > protect it from reclaim and let the allocation dip into the memory > reserves. I do not think that handing over an unlimited ticket to the memory reserves to userspace is a good idea. Even the in kernel oom killer is bound to a partial access to reserves. So if we really want this then it should be in sync with and bound by the ALLOC_OOM. > The misuse of this feature would be risky but it can be limited to > privileged applications. Userspace oom-killer is the only appropriate > user of this feature. This option is simple to implement. > > 2. Mempool > > The idea is to preallocate mempool with a given amount of memory for > userspace oom-killer. Preferably this will be per-thread and > oom-killer can preallocate mempool for its specific threads. The core > page allocator can check before going to the reclaim path if the task > has private access to the mempool and return page from it if yes. Could you elaborate some more on how this would be controlled from the userspace? A dedicated syscall? A driver? > This option would be more complicated than the previous option as the > lifecycle of the page from the mempool would be more sophisticated. > Additionally the current mempool does not handle higher order pages > and we might need to extend it to allow such allocations. Though this > feature might have more use-cases and it would be less risky than the > previous option. I would tend to agree. > Another idea I had was to use kthread based oom-killer and provide the > policies through eBPF program. Though I am not sure how to make it > monitor arbitrary metrics and if that can be done without any > allocations. A kernel module or eBPF to implement oom decisions has already been discussed few years back. But I am afraid this would be hard to wire in for anything except for the victim selection. I am not sure it is maintainable to also control when the OOM handling should trigger. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-20 6:45 ` Michal Hocko @ 2021-04-20 16:04 ` Shakeel Butt 2021-04-21 7:16 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Shakeel Butt @ 2021-04-20 16:04 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Roman Gushchin, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Mon, Apr 19, 2021 at 11:46 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 19-04-21 18:44:02, Shakeel Butt wrote: [...] > > memory.min. However a new allocation from userspace oom-killer can > > still get stuck in the reclaim and policy rich oom-killer do trigger > > new allocations through syscalls or even heap. > > Can you be more specific please? > To decide when to kill, the oom-killer has to read a lot of metrics. It has to open a lot of files to read them and there will definitely be new allocations involved in those operations. For example reading memory.stat does a page size allocation. Similarly, to perform action the oom-killer may have to read cgroup.procs file which again has allocation inside it. Regarding sophisticated oom policy, I can give one example of our cluster level policy. For robustness, many user facing jobs run a lot of instances in a cluster to handle failures. Such jobs are tolerant to some amount of failures but they still have requirements to not let the number of running instances below some threshold. Normally killing such jobs is fine but we do want to make sure that we do not violate their cluster level agreement. So, the userspace oom-killer may dynamically need to confirm if such a job can be killed. [...] > > To reliably solve this problem, we need to give guaranteed memory to > > the userspace oom-killer. > > There is nothing like that. Even memory reserves are a finite resource > which can be consumed as it is sharing those reserves with other users > who are not necessarily coordinated. So before we start discussing > making this even more muddy by handing over memory reserves to the > userspace we should really examine whether pre-allocation is something > that will not work. > We actually explored if we can restrict the syscalls for the oom-killer which does not do memory allocations. We concluded that is not practical and not maintainable. Whatever the list we can come up with will be outdated soon. In addition, converting all the must-have syscalls to not do allocations is not possible/practical. > > At the moment we are contemplating between > > the following options and I would like to get some feedback. > > > > 1. prctl(PF_MEMALLOC) > > > > The idea is to give userspace oom-killer (just one thread which is > > finding the appropriate victims and will be sending SIGKILLs) access > > to MEMALLOC reserves. Most of the time the preallocation, mlock and > > memory.min will be good enough but for rare occasions, when the > > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will > > protect it from reclaim and let the allocation dip into the memory > > reserves. > > I do not think that handing over an unlimited ticket to the memory > reserves to userspace is a good idea. Even the in kernel oom killer is > bound to a partial access to reserves. So if we really want this then > it should be in sync with and bound by the ALLOC_OOM. > Makes sense. > > The misuse of this feature would be risky but it can be limited to > > privileged applications. Userspace oom-killer is the only appropriate > > user of this feature. This option is simple to implement. > > > > 2. Mempool > > > > The idea is to preallocate mempool with a given amount of memory for > > userspace oom-killer. Preferably this will be per-thread and > > oom-killer can preallocate mempool for its specific threads. The core > > page allocator can check before going to the reclaim path if the task > > has private access to the mempool and return page from it if yes. > > Could you elaborate some more on how this would be controlled from the > userspace? A dedicated syscall? A driver? > I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to free the mempool. > > This option would be more complicated than the previous option as the > > lifecycle of the page from the mempool would be more sophisticated. > > Additionally the current mempool does not handle higher order pages > > and we might need to extend it to allow such allocations. Though this > > feature might have more use-cases and it would be less risky than the > > previous option. > > I would tend to agree. > > > Another idea I had was to use kthread based oom-killer and provide the > > policies through eBPF program. Though I am not sure how to make it > > monitor arbitrary metrics and if that can be done without any > > allocations. > > A kernel module or eBPF to implement oom decisions has already been > discussed few years back. But I am afraid this would be hard to wire in > for anything except for the victim selection. I am not sure it is > maintainable to also control when the OOM handling should trigger. > I think you are referring to [1]. That patch was only looking at PSI and I think we are on the same page that we need more information to decide when to kill. Also I agree with you that it is hard to implement "when to kill" with eBPF but I wanted the idea out to see if eBPF experts have some suggestions. [1] https://lore.kernel.org/lkml/20190807205138.GA24222@cmpxchg.org/ thanks, Shakeel ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-20 16:04 ` Shakeel Butt @ 2021-04-21 7:16 ` Michal Hocko 2021-04-21 13:57 ` Shakeel Butt 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2021-04-21 7:16 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Tue 20-04-21 09:04:21, Shakeel Butt wrote: > On Mon, Apr 19, 2021 at 11:46 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 19-04-21 18:44:02, Shakeel Butt wrote: > [...] > > > memory.min. However a new allocation from userspace oom-killer can > > > still get stuck in the reclaim and policy rich oom-killer do trigger > > > new allocations through syscalls or even heap. > > > > Can you be more specific please? > > > > To decide when to kill, the oom-killer has to read a lot of metrics. > It has to open a lot of files to read them and there will definitely > be new allocations involved in those operations. For example reading > memory.stat does a page size allocation. Similarly, to perform action > the oom-killer may have to read cgroup.procs file which again has > allocation inside it. True but many of those can be avoided by opening the file early. At least seq_file based ones will not allocate later if the output size doesn't increase. Which should be the case for many. I think it is a general improvement to push those who allocate during read to an open time allocation. > Regarding sophisticated oom policy, I can give one example of our > cluster level policy. For robustness, many user facing jobs run a lot > of instances in a cluster to handle failures. Such jobs are tolerant > to some amount of failures but they still have requirements to not let > the number of running instances below some threshold. Normally killing > such jobs is fine but we do want to make sure that we do not violate > their cluster level agreement. So, the userspace oom-killer may > dynamically need to confirm if such a job can be killed. What kind of data do you need to examine to make those decisions? > [...] > > > To reliably solve this problem, we need to give guaranteed memory to > > > the userspace oom-killer. > > > > There is nothing like that. Even memory reserves are a finite resource > > which can be consumed as it is sharing those reserves with other users > > who are not necessarily coordinated. So before we start discussing > > making this even more muddy by handing over memory reserves to the > > userspace we should really examine whether pre-allocation is something > > that will not work. > > > > We actually explored if we can restrict the syscalls for the > oom-killer which does not do memory allocations. We concluded that is > not practical and not maintainable. Whatever the list we can come up > with will be outdated soon. In addition, converting all the must-have > syscalls to not do allocations is not possible/practical. I am definitely curious to learn more. [...] > > > 2. Mempool > > > > > > The idea is to preallocate mempool with a given amount of memory for > > > userspace oom-killer. Preferably this will be per-thread and > > > oom-killer can preallocate mempool for its specific threads. The core > > > page allocator can check before going to the reclaim path if the task > > > has private access to the mempool and return page from it if yes. > > > > Could you elaborate some more on how this would be controlled from the > > userspace? A dedicated syscall? A driver? > > > > I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool > to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to > free the mempool. I am not a great fan of prctl. It has become a dumping ground for all mix of unrelated functionality. But let's say this is a minor detail at this stage. So you are proposing to have a per mm mem pool that would be used as a fallback for an allocation which cannot make a forward progress, right? Would that pool be preallocated and sitting idle? What kind of allocations would be allowed to use the pool? What if the pool is depleted? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-21 7:16 ` Michal Hocko @ 2021-04-21 13:57 ` Shakeel Butt 2021-04-21 14:29 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Shakeel Butt @ 2021-04-21 13:57 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Roman Gushchin, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Wed, Apr 21, 2021 at 12:16 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > To decide when to kill, the oom-killer has to read a lot of metrics. > > It has to open a lot of files to read them and there will definitely > > be new allocations involved in those operations. For example reading > > memory.stat does a page size allocation. Similarly, to perform action > > the oom-killer may have to read cgroup.procs file which again has > > allocation inside it. > > True but many of those can be avoided by opening the file early. At > least seq_file based ones will not allocate later if the output size > doesn't increase. Which should be the case for many. I think it is a > general improvement to push those who allocate during read to an open > time allocation. > I agree that this would be a general improvement but it is not always possible (see below). > > Regarding sophisticated oom policy, I can give one example of our > > cluster level policy. For robustness, many user facing jobs run a lot > > of instances in a cluster to handle failures. Such jobs are tolerant > > to some amount of failures but they still have requirements to not let > > the number of running instances below some threshold. Normally killing > > such jobs is fine but we do want to make sure that we do not violate > > their cluster level agreement. So, the userspace oom-killer may > > dynamically need to confirm if such a job can be killed. > > What kind of data do you need to examine to make those decisions? > Most of the time the cluster level scheduler pushes the information to the node controller which transfers that information to the oom-killer. However based on the freshness of the information the oom-killer might request to pull the latest information (IPC and RPC). [...] > > > > I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool > > to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to > > free the mempool. > > I am not a great fan of prctl. It has become a dumping ground for all > mix of unrelated functionality. But let's say this is a minor detail at > this stage. I agree this does not have to be prctl(). > So you are proposing to have a per mm mem pool that would be I was thinking of per-task_struct instead of per-mm_struct just for simplicity. > used as a fallback for an allocation which cannot make a forward > progress, right? Correct > Would that pool be preallocated and sitting idle? Correct > What kind of allocations would be allowed to use the pool? I was thinking of any type of allocation from the oom-killer (or specific threads). Please note that the mempool is the backup and only used in the slowpath. > What if the pool is depleted? This would mean that either the estimate of mempool size is bad or oom-killer is buggy and leaking memory. I am open to any design directions for mempool or some other way where we can provide a notion of memory guarantee to oom-killer. thanks, Shakeel ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-21 13:57 ` Shakeel Butt @ 2021-04-21 14:29 ` Michal Hocko 2021-04-22 12:33 ` [RFC PATCH] Android OOM helper proof of concept peter enderborg 2021-05-05 0:37 ` [RFC] memory reserve for userspace oom-killer Shakeel Butt 0 siblings, 2 replies; 28+ messages in thread From: Michal Hocko @ 2021-04-21 14:29 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Wed 21-04-21 06:57:43, Shakeel Butt wrote: > On Wed, Apr 21, 2021 at 12:16 AM Michal Hocko <mhocko@suse.com> wrote: > > > [...] > > > To decide when to kill, the oom-killer has to read a lot of metrics. > > > It has to open a lot of files to read them and there will definitely > > > be new allocations involved in those operations. For example reading > > > memory.stat does a page size allocation. Similarly, to perform action > > > the oom-killer may have to read cgroup.procs file which again has > > > allocation inside it. > > > > True but many of those can be avoided by opening the file early. At > > least seq_file based ones will not allocate later if the output size > > doesn't increase. Which should be the case for many. I think it is a > > general improvement to push those who allocate during read to an open > > time allocation. > > > > I agree that this would be a general improvement but it is not always > possible (see below). It would be still great to invest into those improvements. And I would be really grateful to learn about bottlenecks from the existing kernel interfaces you have found on the way. > > > Regarding sophisticated oom policy, I can give one example of our > > > cluster level policy. For robustness, many user facing jobs run a lot > > > of instances in a cluster to handle failures. Such jobs are tolerant > > > to some amount of failures but they still have requirements to not let > > > the number of running instances below some threshold. Normally killing > > > such jobs is fine but we do want to make sure that we do not violate > > > their cluster level agreement. So, the userspace oom-killer may > > > dynamically need to confirm if such a job can be killed. > > > > What kind of data do you need to examine to make those decisions? > > > > Most of the time the cluster level scheduler pushes the information to > the node controller which transfers that information to the > oom-killer. However based on the freshness of the information the > oom-killer might request to pull the latest information (IPC and RPC). I cannot imagine any OOM handler to be reliable if it has to depend on other userspace component with a lower resource priority. OOM handlers are fundamentally complex components which has to reduce their dependencies to the bare minimum. > [...] > > > > > > I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool > > > to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to > > > free the mempool. > > > > I am not a great fan of prctl. It has become a dumping ground for all > > mix of unrelated functionality. But let's say this is a minor detail at > > this stage. > > I agree this does not have to be prctl(). > > > So you are proposing to have a per mm mem pool that would be > > I was thinking of per-task_struct instead of per-mm_struct just for simplicity. > > > used as a fallback for an allocation which cannot make a forward > > progress, right? > > Correct > > > Would that pool be preallocated and sitting idle? > > Correct > > > What kind of allocations would be allowed to use the pool? > > I was thinking of any type of allocation from the oom-killer (or > specific threads). Please note that the mempool is the backup and only > used in the slowpath. > > > What if the pool is depleted? > > This would mean that either the estimate of mempool size is bad or > oom-killer is buggy and leaking memory. > > I am open to any design directions for mempool or some other way where > we can provide a notion of memory guarantee to oom-killer. OK, thanks for clarification. There will certainly be hard problems to sort out[1] but the overall idea makes sense to me and it sounds like a much better approach than a OOM specific solution. [1] - how the pool is going to be replenished without hitting all potential reclaim problems (thus dependencies on other all tasks directly/indirectly) yet to not rely on any background workers to do that on the task behalf without a proper accounting etc... -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* [RFC PATCH] Android OOM helper proof of concept 2021-04-21 14:29 ` Michal Hocko @ 2021-04-22 12:33 ` peter enderborg 2021-04-22 13:03 ` Michal Hocko 2021-05-05 0:37 ` [RFC] memory reserve for userspace oom-killer Shakeel Butt 1 sibling, 1 reply; 28+ messages in thread From: peter enderborg @ 2021-04-22 12:33 UTC (permalink / raw) To: Michal Hocko, Shakeel Butt Cc: Johannes Weiner, Roman Gushchin, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On 4/21/21 4:29 PM, Michal Hocko wrote: > On Wed 21-04-21 06:57:43, Shakeel Butt wrote: >> On Wed, Apr 21, 2021 at 12:16 AM Michal Hocko <mhocko@suse.com> wrote: >> [...] >>>> To decide when to kill, the oom-killer has to read a lot of metrics. >>>> It has to open a lot of files to read them and there will definitely >>>> be new allocations involved in those operations. For example reading >>>> memory.stat does a page size allocation. Similarly, to perform action >>>> the oom-killer may have to read cgroup.procs file which again has >>>> allocation inside it. >>> True but many of those can be avoided by opening the file early. At >>> least seq_file based ones will not allocate later if the output size >>> doesn't increase. Which should be the case for many. I think it is a >>> general improvement to push those who allocate during read to an open >>> time allocation. >>> >> I agree that this would be a general improvement but it is not always >> possible (see below). > It would be still great to invest into those improvements. And I would > be really grateful to learn about bottlenecks from the existing kernel > interfaces you have found on the way. > >>>> Regarding sophisticated oom policy, I can give one example of our >>>> cluster level policy. For robustness, many user facing jobs run a lot >>>> of instances in a cluster to handle failures. Such jobs are tolerant >>>> to some amount of failures but they still have requirements to not let >>>> the number of running instances below some threshold. Normally killing >>>> such jobs is fine but we do want to make sure that we do not violate >>>> their cluster level agreement. So, the userspace oom-killer may >>>> dynamically need to confirm if such a job can be killed. >>> What kind of data do you need to examine to make those decisions? >>> >> Most of the time the cluster level scheduler pushes the information to >> the node controller which transfers that information to the >> oom-killer. However based on the freshness of the information the >> oom-killer might request to pull the latest information (IPC and RPC). > I cannot imagine any OOM handler to be reliable if it has to depend on > other userspace component with a lower resource priority. OOM handlers > are fundamentally complex components which has to reduce their > dependencies to the bare minimum. I think we very much need a OOM killer that can help out, but it is essential that it also play with android rules. This is RFC patch that interact with OOM From 09f3a2e401d4ed77e95b7cea7edb7c5c3e6a0c62 Mon Sep 17 00:00:00 2001 From: Peter Enderborg <peter.enderborg@sony.com> Date: Thu, 22 Apr 2021 14:15:46 +0200 Subject: [PATCH] mm/oom: Android oomhelper This is proff of concept of a pre-oom-killer that kill task strictly on oom-score-adj order if the score is positive. It act as lifeline when userspace does not have optimal performance. --- drivers/staging/Makefile | 1 + drivers/staging/oomhelper/Makefile | 2 + drivers/staging/oomhelper/oomhelper.c | 65 +++++++++++++++++++++++++++ mm/oom_kill.c | 4 +- 4 files changed, 70 insertions(+), 2 deletions(-) create mode 100644 drivers/staging/oomhelper/Makefile create mode 100644 drivers/staging/oomhelper/oomhelper.c diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile index 2245059e69c7..4a5449b42568 100644 --- a/drivers/staging/Makefile +++ b/drivers/staging/Makefile @@ -47,3 +47,4 @@ obj-$(CONFIG_QLGE) += qlge/ obj-$(CONFIG_WIMAX) += wimax/ obj-$(CONFIG_WFX) += wfx/ obj-y += hikey9xx/ +obj-y += oomhelper/ diff --git a/drivers/staging/oomhelper/Makefile b/drivers/staging/oomhelper/Makefile new file mode 100644 index 000000000000..ee9b361957f8 --- /dev/null +++ b/drivers/staging/oomhelper/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +obj-y += oomhelper.o diff --git a/drivers/staging/oomhelper/oomhelper.c b/drivers/staging/oomhelper/oomhelper.c new file mode 100644 index 000000000000..5a3fe0270cb8 --- /dev/null +++ b/drivers/staging/oomhelper/oomhelper.c @@ -0,0 +1,65 @@ +// SPDX-License-Identifier: GPL-2.0 +/* prof of concept of android aware oom killer */ +/* Author: peter.enderborg@sony.com */ + +#include <linux/kernel.h> +#include <linux/mm.h> +#include <linux/slab.h> +#include <linux/oom.h> +void wake_oom_reaper(struct task_struct *tsk); /* need to public ... */ +void __oom_kill_process(struct task_struct *victim, const char *message); + +static int oomhelper_oom_notify(struct notifier_block *self, + unsigned long notused, void *param) +{ + struct task_struct *tsk; + struct task_struct *selected = NULL; + int highest = 0; + + pr_info("invited"); + rcu_read_lock(); + for_each_process(tsk) { + struct task_struct *candidate; + if (tsk->flags & PF_KTHREAD) + continue; + + /* Ignore task if coredump in progress */ + if (tsk->mm && tsk->mm->core_state) + continue; + candidate = find_lock_task_mm(tsk); + if (!candidate) + continue; + + if (highest < candidate->signal->oom_score_adj) { + /* for test dont kill level 0 */ + highest = candidate->signal->oom_score_adj; + selected = candidate; + pr_info("new selected %d %d", selected->pid, + selected->signal->oom_score_adj); + } + task_unlock(candidate); + } + if (selected) { + get_task_struct(selected); + } + rcu_read_unlock(); + if (selected) { + pr_info("oomhelper killing: %d", selected->pid); + __oom_kill_process(selected, "oomhelper"); + } + + return NOTIFY_OK; +} + +static struct notifier_block oomhelper_oom_nb = { + .notifier_call = oomhelper_oom_notify +}; + +int __init oomhelper_register_oom_notifier(void) +{ + register_oom_notifier(&oomhelper_oom_nb); + pr_info("oomhelper installed"); + return 0; +} + +subsys_initcall(oomhelper_register_oom_notifier); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index fa1cf18bac97..a5f7299af9a3 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -658,7 +658,7 @@ static int oom_reaper(void *unused) return 0; } -static void wake_oom_reaper(struct task_struct *tsk) +void wake_oom_reaper(struct task_struct *tsk) { /* mm is already queued? */ if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) @@ -856,7 +856,7 @@ static bool task_will_free_mem(struct task_struct *task) return ret; } -static void __oom_kill_process(struct task_struct *victim, const char *message) +void __oom_kill_process(struct task_struct *victim, const char *message) { struct task_struct *p; struct mm_struct *mm; -- 2.17.1 Is that something that might be accepted? It uses the notifications and that is no problem a guess. But it also calls some oom-kill functions that is not exported. > >> [...] >>>> I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool >>>> to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to >>>> free the mempool. >>> I am not a great fan of prctl. It has become a dumping ground for all >>> mix of unrelated functionality. But let's say this is a minor detail at >>> this stage. >> I agree this does not have to be prctl(). >> >>> So you are proposing to have a per mm mem pool that would be >> I was thinking of per-task_struct instead of per-mm_struct just for simplicity. >> >>> used as a fallback for an allocation which cannot make a forward >>> progress, right? >> Correct >> >>> Would that pool be preallocated and sitting idle? >> Correct >> >>> What kind of allocations would be allowed to use the pool? >> I was thinking of any type of allocation from the oom-killer (or >> specific threads). Please note that the mempool is the backup and only >> used in the slowpath. >> >>> What if the pool is depleted? >> This would mean that either the estimate of mempool size is bad or >> oom-killer is buggy and leaking memory. >> >> I am open to any design directions for mempool or some other way where >> we can provide a notion of memory guarantee to oom-killer. > OK, thanks for clarification. There will certainly be hard problems to > sort out[1] but the overall idea makes sense to me and it sounds like a > much better approach than a OOM specific solution. > > > [1] - how the pool is going to be replenished without hitting all > potential reclaim problems (thus dependencies on other all tasks > directly/indirectly) yet to not rely on any background workers to do > that on the task behalf without a proper accounting etc... ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC PATCH] Android OOM helper proof of concept 2021-04-22 12:33 ` [RFC PATCH] Android OOM helper proof of concept peter enderborg @ 2021-04-22 13:03 ` Michal Hocko 0 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2021-04-22 13:03 UTC (permalink / raw) To: peter enderborg Cc: Shakeel Butt, Johannes Weiner, Roman Gushchin, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Thu 22-04-21 14:33:45, peter enderborg wrote: [...] > I think we very much need a OOM killer that can help out, > but it is essential that it also play with android rules. This is completely off topic to the discussion here. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-21 14:29 ` Michal Hocko 2021-04-22 12:33 ` [RFC PATCH] Android OOM helper proof of concept peter enderborg @ 2021-05-05 0:37 ` Shakeel Butt 2021-05-05 1:26 ` Suren Baghdasaryan 1 sibling, 1 reply; 28+ messages in thread From: Shakeel Butt @ 2021-05-05 0:37 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Roman Gushchin, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > What if the pool is depleted? > > > > This would mean that either the estimate of mempool size is bad or > > oom-killer is buggy and leaking memory. > > > > I am open to any design directions for mempool or some other way where > > we can provide a notion of memory guarantee to oom-killer. > > OK, thanks for clarification. There will certainly be hard problems to > sort out[1] but the overall idea makes sense to me and it sounds like a > much better approach than a OOM specific solution. > > > [1] - how the pool is going to be replenished without hitting all > potential reclaim problems (thus dependencies on other all tasks > directly/indirectly) yet to not rely on any background workers to do > that on the task behalf without a proper accounting etc... > -- I am currently contemplating between two paths here: First, the mempool, exposed through either prctl or a new syscall. Users would need to trace their userspace oom-killer (or whatever their use case is) to find an appropriate mempool size they would need and periodically refill the mempools if allowed by the state of the machine. The challenge here is to find a good value for the mempool size and coordinating the refilling of mempools. Second is a mix of Roman and Peter's suggestions but much more simplified. A very simple watchdog with a kill-list of processes and if userspace didn't pet the watchdog within a specified time, it will kill all the processes in the kill-list. The challenge here is to maintain/update the kill-list. I would prefer the direction which oomd and lmkd are open to adopt. Any suggestions? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-05-05 0:37 ` [RFC] memory reserve for userspace oom-killer Shakeel Butt @ 2021-05-05 1:26 ` Suren Baghdasaryan 2021-05-05 2:45 ` Shakeel Butt 0 siblings, 1 reply; 28+ messages in thread From: Suren Baghdasaryan @ 2021-05-05 1:26 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Hocko, Johannes Weiner, Roman Gushchin, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Tue, May 4, 2021 at 5:37 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <mhocko@suse.com> wrote: > > > [...] > > > > What if the pool is depleted? > > > > > > This would mean that either the estimate of mempool size is bad or > > > oom-killer is buggy and leaking memory. > > > > > > I am open to any design directions for mempool or some other way where > > > we can provide a notion of memory guarantee to oom-killer. > > > > OK, thanks for clarification. There will certainly be hard problems to > > sort out[1] but the overall idea makes sense to me and it sounds like a > > much better approach than a OOM specific solution. > > > > > > [1] - how the pool is going to be replenished without hitting all > > potential reclaim problems (thus dependencies on other all tasks > > directly/indirectly) yet to not rely on any background workers to do > > that on the task behalf without a proper accounting etc... > > -- > > I am currently contemplating between two paths here: > > First, the mempool, exposed through either prctl or a new syscall. > Users would need to trace their userspace oom-killer (or whatever > their use case is) to find an appropriate mempool size they would need > and periodically refill the mempools if allowed by the state of the > machine. The challenge here is to find a good value for the mempool > size and coordinating the refilling of mempools. > > Second is a mix of Roman and Peter's suggestions but much more > simplified. A very simple watchdog with a kill-list of processes and > if userspace didn't pet the watchdog within a specified time, it will > kill all the processes in the kill-list. The challenge here is to > maintain/update the kill-list. IIUC this solution is designed to identify cases when oomd/lmkd got stuck while allocating memory due to memory shortages and therefore can't feed the watchdog. In such a case the kernel goes ahead and kills some processes to free up memory and unblock the blocked process. Effectively this would limit the time such a process gets stuck by the duration of the watchdog timeout. If my understanding of this proposal is correct, then I see the following downsides: 1. oomd/lmkd are still not prevented from being stuck, it just limits the duration of this blocked state. Delaying kills when memory pressure is high even for short duration is very undesirable. I think having mempool reserves could address this issue better if it can always guarantee memory availability (not sure if it's possible in practice). 2. What would be performance overhead of this watchdog? To limit the duration of a process being blocked to a small enough value we would have to have quite a small timeout, which means oomd/lmkd would have to wake up quite often to feed the watchdog. Frequent wakeups on a battery-powered system is not a good idea. 3. What if oomd/lmkd gets stuck for some memory-unrelated reason and can't feed the watchdog? In such a scenario the kernel would assume that it is stuck due to memory shortages and would go on a killing spree. If there is a sure way to identify when a process gets stuck due to memory shortages then this could work better. 4. Additional complexity of keeping the list of potential victims in the kernel. Maybe we can simply reuse oom_score to choose the best victims? Thanks, Suren. > > I would prefer the direction which oomd and lmkd are open to adopt. > > Any suggestions? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-05-05 1:26 ` Suren Baghdasaryan @ 2021-05-05 2:45 ` Shakeel Butt 2021-05-05 2:59 ` Suren Baghdasaryan 0 siblings, 1 reply; 28+ messages in thread From: Shakeel Butt @ 2021-05-05 2:45 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Michal Hocko, Johannes Weiner, Roman Gushchin, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Tue, May 4, 2021 at 6:26 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Tue, May 4, 2021 at 5:37 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > [...] > > > > > What if the pool is depleted? > > > > > > > > This would mean that either the estimate of mempool size is bad or > > > > oom-killer is buggy and leaking memory. > > > > > > > > I am open to any design directions for mempool or some other way where > > > > we can provide a notion of memory guarantee to oom-killer. > > > > > > OK, thanks for clarification. There will certainly be hard problems to > > > sort out[1] but the overall idea makes sense to me and it sounds like a > > > much better approach than a OOM specific solution. > > > > > > > > > [1] - how the pool is going to be replenished without hitting all > > > potential reclaim problems (thus dependencies on other all tasks > > > directly/indirectly) yet to not rely on any background workers to do > > > that on the task behalf without a proper accounting etc... > > > -- > > > > I am currently contemplating between two paths here: > > > > First, the mempool, exposed through either prctl or a new syscall. > > Users would need to trace their userspace oom-killer (or whatever > > their use case is) to find an appropriate mempool size they would need > > and periodically refill the mempools if allowed by the state of the > > machine. The challenge here is to find a good value for the mempool > > size and coordinating the refilling of mempools. > > > > Second is a mix of Roman and Peter's suggestions but much more > > simplified. A very simple watchdog with a kill-list of processes and > > if userspace didn't pet the watchdog within a specified time, it will > > kill all the processes in the kill-list. The challenge here is to > > maintain/update the kill-list. > > IIUC this solution is designed to identify cases when oomd/lmkd got > stuck while allocating memory due to memory shortages and therefore > can't feed the watchdog. In such a case the kernel goes ahead and > kills some processes to free up memory and unblock the blocked > process. Effectively this would limit the time such a process gets > stuck by the duration of the watchdog timeout. If my understanding of > this proposal is correct, Your understanding is indeed correct. > then I see the following downsides: > 1. oomd/lmkd are still not prevented from being stuck, it just limits > the duration of this blocked state. Delaying kills when memory > pressure is high even for short duration is very undesirable. Yes I agree. > I think > having mempool reserves could address this issue better if it can > always guarantee memory availability (not sure if it's possible in > practice). I think "mempool ... always guarantee memory availability" is something I should quantify with some experiments. > 2. What would be performance overhead of this watchdog? To limit the > duration of a process being blocked to a small enough value we would > have to have quite a small timeout, which means oomd/lmkd would have > to wake up quite often to feed the watchdog. Frequent wakeups on a > battery-powered system is not a good idea. This is indeed the downside i.e. the tradeoff between acceptable stall vs frequent wakeups. > 3. What if oomd/lmkd gets stuck for some memory-unrelated reason and > can't feed the watchdog? In such a scenario the kernel would assume > that it is stuck due to memory shortages and would go on a killing > spree. This is correct but IMHO killing spree is not worse than oomd/lmkd getting stuck for some other reason. > If there is a sure way to identify when a process gets stuck > due to memory shortages then this could work better. Hmm are you saying looking at the stack traces of the userspace oom-killer or some metrics related to oom-killer? It will complicate the code. > 4. Additional complexity of keeping the list of potential victims in > the kernel. Maybe we can simply reuse oom_score to choose the best > victims? Your point of additional complexity is correct. Regarding oom_score I think you meant oom_score_adj, I would avoid putting more policies/complexity in the kernel but I got your point that the simplest watchdog might not be helpful at all. > Thanks, > Suren. > > > > > I would prefer the direction which oomd and lmkd are open to adopt. > > > > Any suggestions? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-05-05 2:45 ` Shakeel Butt @ 2021-05-05 2:59 ` Suren Baghdasaryan 0 siblings, 0 replies; 28+ messages in thread From: Suren Baghdasaryan @ 2021-05-05 2:59 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Hocko, Johannes Weiner, Roman Gushchin, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Tue, May 4, 2021 at 7:45 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Tue, May 4, 2021 at 6:26 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Tue, May 4, 2021 at 5:37 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > [...] > > > > > > What if the pool is depleted? > > > > > > > > > > This would mean that either the estimate of mempool size is bad or > > > > > oom-killer is buggy and leaking memory. > > > > > > > > > > I am open to any design directions for mempool or some other way where > > > > > we can provide a notion of memory guarantee to oom-killer. > > > > > > > > OK, thanks for clarification. There will certainly be hard problems to > > > > sort out[1] but the overall idea makes sense to me and it sounds like a > > > > much better approach than a OOM specific solution. > > > > > > > > > > > > [1] - how the pool is going to be replenished without hitting all > > > > potential reclaim problems (thus dependencies on other all tasks > > > > directly/indirectly) yet to not rely on any background workers to do > > > > that on the task behalf without a proper accounting etc... > > > > -- > > > > > > I am currently contemplating between two paths here: > > > > > > First, the mempool, exposed through either prctl or a new syscall. > > > Users would need to trace their userspace oom-killer (or whatever > > > their use case is) to find an appropriate mempool size they would need > > > and periodically refill the mempools if allowed by the state of the > > > machine. The challenge here is to find a good value for the mempool > > > size and coordinating the refilling of mempools. > > > > > > Second is a mix of Roman and Peter's suggestions but much more > > > simplified. A very simple watchdog with a kill-list of processes and > > > if userspace didn't pet the watchdog within a specified time, it will > > > kill all the processes in the kill-list. The challenge here is to > > > maintain/update the kill-list. > > > > IIUC this solution is designed to identify cases when oomd/lmkd got > > stuck while allocating memory due to memory shortages and therefore > > can't feed the watchdog. In such a case the kernel goes ahead and > > kills some processes to free up memory and unblock the blocked > > process. Effectively this would limit the time such a process gets > > stuck by the duration of the watchdog timeout. If my understanding of > > this proposal is correct, > > Your understanding is indeed correct. > > > then I see the following downsides: > > 1. oomd/lmkd are still not prevented from being stuck, it just limits > > the duration of this blocked state. Delaying kills when memory > > pressure is high even for short duration is very undesirable. > > Yes I agree. > > > I think > > having mempool reserves could address this issue better if it can > > always guarantee memory availability (not sure if it's possible in > > practice). > > I think "mempool ... always guarantee memory availability" is > something I should quantify with some experiments. > > > 2. What would be performance overhead of this watchdog? To limit the > > duration of a process being blocked to a small enough value we would > > have to have quite a small timeout, which means oomd/lmkd would have > > to wake up quite often to feed the watchdog. Frequent wakeups on a > > battery-powered system is not a good idea. > > This is indeed the downside i.e. the tradeoff between acceptable stall > vs frequent wakeups. > > > 3. What if oomd/lmkd gets stuck for some memory-unrelated reason and > > can't feed the watchdog? In such a scenario the kernel would assume > > that it is stuck due to memory shortages and would go on a killing > > spree. > > This is correct but IMHO killing spree is not worse than oomd/lmkd > getting stuck for some other reason. > > > If there is a sure way to identify when a process gets stuck > > due to memory shortages then this could work better. > > Hmm are you saying looking at the stack traces of the userspace > oom-killer or some metrics related to oom-killer? It will complicate > the code. Well, I don't know of a sure and easy way to identify the reasons for process blockage but maybe there is one I don't know of? My point is that we would need some additional indications of memory being the culprit for the process blockage before resorting to kill. > > > 4. Additional complexity of keeping the list of potential victims in > > the kernel. Maybe we can simply reuse oom_score to choose the best > > victims? > > Your point of additional complexity is correct. Regarding oom_score I > think you meant oom_score_adj, I would avoid putting more > policies/complexity in the kernel but I got your point that the > simplest watchdog might not be helpful at all. > > > Thanks, > > Suren. > > > > > > > > I would prefer the direction which oomd and lmkd are open to adopt. > > > > > > Any suggestions? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-20 1:44 [RFC] memory reserve for userspace oom-killer Shakeel Butt 2021-04-20 6:45 ` Michal Hocko @ 2021-04-20 19:17 ` Roman Gushchin 2021-04-20 19:36 ` Suren Baghdasaryan 2021-04-21 1:18 ` Shakeel Butt 2021-04-21 17:05 ` peter enderborg 2 siblings, 2 replies; 28+ messages in thread From: Roman Gushchin @ 2021-04-20 19:17 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote: > Proposal: Provide memory guarantees to userspace oom-killer. > > Background: > > Issues with kernel oom-killer: > 1. Very conservative and prefer to reclaim. Applications can suffer > for a long time. > 2. Borrows the context of the allocator which can be resource limited > (low sched priority or limited CPU quota). > 3. Serialized by global lock. > 4. Very simplistic oom victim selection policy. > > These issues are resolved through userspace oom-killer by: > 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to > early detect suffering. > 2. Independent process context which can be given dedicated CPU quota > and high scheduling priority. > 3. Can be more aggressive as required. > 4. Can implement sophisticated business logic/policies. > > Android's LMKD and Facebook's oomd are the prime examples of userspace > oom-killers. One of the biggest challenges for userspace oom-killers > is to potentially function under intense memory pressure and are prone > to getting stuck in memory reclaim themselves. Current userspace > oom-killers aim to avoid this situation by preallocating user memory > and protecting themselves from global reclaim by either mlocking or > memory.min. However a new allocation from userspace oom-killer can > still get stuck in the reclaim and policy rich oom-killer do trigger > new allocations through syscalls or even heap. > > Our attempt of userspace oom-killer faces similar challenges. > Particularly at the tail on the very highly utilized machines we have > observed userspace oom-killer spectacularly failing in many possible > ways in the direct reclaim. We have seen oom-killer stuck in direct > reclaim throttling, stuck in reclaim and allocations from interrupts > keep stealing reclaimed memory. We have even observed systems where > all the processes were stuck in throttle_direct_reclaim() and only > kswapd was running and the interrupts kept stealing the memory > reclaimed by kswapd. > > To reliably solve this problem, we need to give guaranteed memory to > the userspace oom-killer. At the moment we are contemplating between > the following options and I would like to get some feedback. > > 1. prctl(PF_MEMALLOC) > > The idea is to give userspace oom-killer (just one thread which is > finding the appropriate victims and will be sending SIGKILLs) access > to MEMALLOC reserves. Most of the time the preallocation, mlock and > memory.min will be good enough but for rare occasions, when the > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will > protect it from reclaim and let the allocation dip into the memory > reserves. > > The misuse of this feature would be risky but it can be limited to > privileged applications. Userspace oom-killer is the only appropriate > user of this feature. This option is simple to implement. Hello Shakeel! If ordinary PAGE_SIZE and smaller kernel allocations start to fail, the system is already in a relatively bad shape. Arguably the userspace OOM killer should kick in earlier, it's already a bit too late. Allowing to use reserves just pushes this even further, so we're risking the kernel stability for no good reason. But I agree that throttling the oom daemon in direct reclaim makes no sense. I wonder if we can introduce a per-task flag which will exclude the task from throttling, but instead all (large) allocations will just fail under a significant memory pressure more easily. In this case if there is a significant memory shortage the oom daemon will not be fully functional (will get -ENOMEM for an attempt to read some stats, for example), but still will be able to kill some processes and make the forward progress. But maybe it can be done in userspace too: by splitting the daemon into a core- and extended part and avoid doing anything behind bare minimum in the core part. > > 2. Mempool > > The idea is to preallocate mempool with a given amount of memory for > userspace oom-killer. Preferably this will be per-thread and > oom-killer can preallocate mempool for its specific threads. The core > page allocator can check before going to the reclaim path if the task > has private access to the mempool and return page from it if yes. > > This option would be more complicated than the previous option as the > lifecycle of the page from the mempool would be more sophisticated. > Additionally the current mempool does not handle higher order pages > and we might need to extend it to allow such allocations. Though this > feature might have more use-cases and it would be less risky than the > previous option. It looks like an over-kill for the oom daemon protection, but if there are other good use cases, maybe it's a good feature to have. > > Another idea I had was to use kthread based oom-killer and provide the > policies through eBPF program. Though I am not sure how to make it > monitor arbitrary metrics and if that can be done without any > allocations. To start this effort it would be nice to understand what metrics various oom daemons use and how easy is to gather them from the bpf side. I like this idea long-term, but not sure if it has been settled down enough. I imagine it will require a fair amount of work on the bpf side, so we need a good understanding of features we need. Thanks! ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-20 19:17 ` Roman Gushchin @ 2021-04-20 19:36 ` Suren Baghdasaryan 2021-04-21 1:18 ` Shakeel Butt 1 sibling, 0 replies; 28+ messages in thread From: Suren Baghdasaryan @ 2021-04-20 19:36 UTC (permalink / raw) To: Roman Gushchin Cc: Shakeel Butt, Johannes Weiner, Michal Hocko, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Greg Thelen, Dragos Sbirlea, Priya Duraisamy Hi Folks, On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin <guro@fb.com> wrote: > > On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote: > > Proposal: Provide memory guarantees to userspace oom-killer. > > > > Background: > > > > Issues with kernel oom-killer: > > 1. Very conservative and prefer to reclaim. Applications can suffer > > for a long time. > > 2. Borrows the context of the allocator which can be resource limited > > (low sched priority or limited CPU quota). > > 3. Serialized by global lock. > > 4. Very simplistic oom victim selection policy. > > > > These issues are resolved through userspace oom-killer by: > > 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to > > early detect suffering. > > 2. Independent process context which can be given dedicated CPU quota > > and high scheduling priority. > > 3. Can be more aggressive as required. > > 4. Can implement sophisticated business logic/policies. > > > > Android's LMKD and Facebook's oomd are the prime examples of userspace > > oom-killers. One of the biggest challenges for userspace oom-killers > > is to potentially function under intense memory pressure and are prone > > to getting stuck in memory reclaim themselves. Current userspace > > oom-killers aim to avoid this situation by preallocating user memory > > and protecting themselves from global reclaim by either mlocking or > > memory.min. However a new allocation from userspace oom-killer can > > still get stuck in the reclaim and policy rich oom-killer do trigger > > new allocations through syscalls or even heap. > > > > Our attempt of userspace oom-killer faces similar challenges. > > Particularly at the tail on the very highly utilized machines we have > > observed userspace oom-killer spectacularly failing in many possible > > ways in the direct reclaim. We have seen oom-killer stuck in direct > > reclaim throttling, stuck in reclaim and allocations from interrupts > > keep stealing reclaimed memory. We have even observed systems where > > all the processes were stuck in throttle_direct_reclaim() and only > > kswapd was running and the interrupts kept stealing the memory > > reclaimed by kswapd. > > > > To reliably solve this problem, we need to give guaranteed memory to > > the userspace oom-killer. At the moment we are contemplating between > > the following options and I would like to get some feedback. > > > > 1. prctl(PF_MEMALLOC) > > > > The idea is to give userspace oom-killer (just one thread which is > > finding the appropriate victims and will be sending SIGKILLs) access > > to MEMALLOC reserves. Most of the time the preallocation, mlock and > > memory.min will be good enough but for rare occasions, when the > > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will > > protect it from reclaim and let the allocation dip into the memory > > reserves. > > > > The misuse of this feature would be risky but it can be limited to > > privileged applications. Userspace oom-killer is the only appropriate > > user of this feature. This option is simple to implement. > > Hello Shakeel! > > If ordinary PAGE_SIZE and smaller kernel allocations start to fail, > the system is already in a relatively bad shape. Arguably the userspace > OOM killer should kick in earlier, it's already a bit too late. I tend to agree here. This is how we are trying to avoid issues with such severe memory shortages - by tuning the killer a bit more aggressively. But a more reliable mechanism would definitely be an improvement. > Allowing to use reserves just pushes this even further, so we're risking > the kernel stability for no good reason. > > But I agree that throttling the oom daemon in direct reclaim makes no sense. > I wonder if we can introduce a per-task flag which will exclude the task from > throttling, but instead all (large) allocations will just fail under a > significant memory pressure more easily. In this case if there is a significant > memory shortage the oom daemon will not be fully functional (will get -ENOMEM > for an attempt to read some stats, for example), but still will be able to kill > some processes and make the forward progress. This sounds like a good idea to me. > But maybe it can be done in userspace too: by splitting the daemon into > a core- and extended part and avoid doing anything behind bare minimum > in the core part. > > > > > 2. Mempool > > > > The idea is to preallocate mempool with a given amount of memory for > > userspace oom-killer. Preferably this will be per-thread and > > oom-killer can preallocate mempool for its specific threads. The core > > page allocator can check before going to the reclaim path if the task > > has private access to the mempool and return page from it if yes. > > > > This option would be more complicated than the previous option as the > > lifecycle of the page from the mempool would be more sophisticated. > > Additionally the current mempool does not handle higher order pages > > and we might need to extend it to allow such allocations. Though this > > feature might have more use-cases and it would be less risky than the > > previous option. > > It looks like an over-kill for the oom daemon protection, but if there > are other good use cases, maybe it's a good feature to have. > > > > > Another idea I had was to use kthread based oom-killer and provide the > > policies through eBPF program. Though I am not sure how to make it > > monitor arbitrary metrics and if that can be done without any > > allocations. > > To start this effort it would be nice to understand what metrics various > oom daemons use and how easy is to gather them from the bpf side. I like > this idea long-term, but not sure if it has been settled down enough. > I imagine it will require a fair amount of work on the bpf side, so we > need a good understanding of features we need. For a reference, on Android, where we do not really use memcgs, low-memory-killer reads global data from meminfo, vmstat, zoneinfo procfs nodes. Thanks, Suren. > > Thanks! ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-20 19:17 ` Roman Gushchin 2021-04-20 19:36 ` Suren Baghdasaryan @ 2021-04-21 1:18 ` Shakeel Butt 2021-04-21 2:58 ` Roman Gushchin 2021-04-21 7:23 ` Michal Hocko 1 sibling, 2 replies; 28+ messages in thread From: Shakeel Butt @ 2021-04-21 1:18 UTC (permalink / raw) To: Roman Gushchin Cc: Johannes Weiner, Michal Hocko, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin <guro@fb.com> wrote: > > On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote: [...] > > 1. prctl(PF_MEMALLOC) > > > > The idea is to give userspace oom-killer (just one thread which is > > finding the appropriate victims and will be sending SIGKILLs) access > > to MEMALLOC reserves. Most of the time the preallocation, mlock and > > memory.min will be good enough but for rare occasions, when the > > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will > > protect it from reclaim and let the allocation dip into the memory > > reserves. > > > > The misuse of this feature would be risky but it can be limited to > > privileged applications. Userspace oom-killer is the only appropriate > > user of this feature. This option is simple to implement. > > Hello Shakeel! > > If ordinary PAGE_SIZE and smaller kernel allocations start to fail, > the system is already in a relatively bad shape. Arguably the userspace > OOM killer should kick in earlier, it's already a bit too late. Please note that these are not allocation failures but rather reclaim on allocations (which is very normal). Our observation is that this reclaim is very unpredictable and depends on the type of memory present on the system which depends on the workload. If there is a good amount of easily reclaimable memory (e.g. clean file pages), the reclaim would be really fast. However for other types of reclaimable memory the reclaim time varies a lot. The unreclaimable memory, pinned memory, too many direct reclaimers, too many isolated memory and many other things/heuristics/assumptions make the reclaim further non-deterministic. In our observation the global reclaim is very non-deterministic at the tail and dramatically impacts the reliability of the system. We are looking for a solution which is independent of the global reclaim. > Allowing to use reserves just pushes this even further, so we're risking > the kernel stability for no good reason. Michal has suggested ALLOC_OOM which is less risky. > > But I agree that throttling the oom daemon in direct reclaim makes no sense. > I wonder if we can introduce a per-task flag which will exclude the task from > throttling, but instead all (large) allocations will just fail under a > significant memory pressure more easily. In this case if there is a significant > memory shortage the oom daemon will not be fully functional (will get -ENOMEM > for an attempt to read some stats, for example), but still will be able to kill > some processes and make the forward progress. So, the suggestion is to have a per-task flag to (1) indicate to not throttle and (2) fail allocations easily on significant memory pressure. For (1), the challenge I see is that there are a lot of places in the reclaim code paths where a task can get throttled. There are filesystems that block/throttle in slab shrinking. Any process can get blocked on an unrelated page or inode writeback within reclaim. For (2), I am not sure how to deterministically define "significant memory pressure". One idea is to follow the __GFP_NORETRY semantics and along with (1) the userspace oom-killer will see ENOMEM more reliably than stucking in the reclaim. So, the oom-killer maintains a list of processes to kill in extreme conditions, have their pidfds open and keep that list fresh. Whenever any syscalls returns ENOMEM, it starts doing pidfd_send_signal(SIGKILL) to that list of processes, right? The idea has merit but I don't see how this is any simpler. The (1) is challenging on its own and my main concern is that it will be very hard to maintain as reclaim code (particularly shrinkers) callbacks into many diverse subsystems. > But maybe it can be done in userspace too: by splitting the daemon into > a core- and extended part and avoid doing anything behind bare minimum > in the core part. > > > > > 2. Mempool > > > > The idea is to preallocate mempool with a given amount of memory for > > userspace oom-killer. Preferably this will be per-thread and > > oom-killer can preallocate mempool for its specific threads. The core > > page allocator can check before going to the reclaim path if the task > > has private access to the mempool and return page from it if yes. > > > > This option would be more complicated than the previous option as the > > lifecycle of the page from the mempool would be more sophisticated. > > Additionally the current mempool does not handle higher order pages > > and we might need to extend it to allow such allocations. Though this > > feature might have more use-cases and it would be less risky than the > > previous option. > > It looks like an over-kill for the oom daemon protection, but if there > are other good use cases, maybe it's a good feature to have. > IMHO it is not an over-kill and easier to do then to remove all instances of potential blocking/throttling sites in memory reclaim. > > > > Another idea I had was to use kthread based oom-killer and provide the > > policies through eBPF program. Though I am not sure how to make it > > monitor arbitrary metrics and if that can be done without any > > allocations. > > To start this effort it would be nice to understand what metrics various > oom daemons use and how easy is to gather them from the bpf side. I like > this idea long-term, but not sure if it has been settled down enough. > I imagine it will require a fair amount of work on the bpf side, so we > need a good understanding of features we need. > Are there any examples of gathering existing metrics from bpf? Suren has given a list of metrics useful for Android. Is it possible to gather those metrics? BTW thanks a lot for taking a look and I really appreciate your time. thanks, Shakeel ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-21 1:18 ` Shakeel Butt @ 2021-04-21 2:58 ` Roman Gushchin 2021-04-21 13:26 ` Shakeel Butt 2021-04-21 7:23 ` Michal Hocko 1 sibling, 1 reply; 28+ messages in thread From: Roman Gushchin @ 2021-04-21 2:58 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Tue, Apr 20, 2021 at 06:18:29PM -0700, Shakeel Butt wrote: > On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin <guro@fb.com> wrote: > > > > On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote: > [...] > > > 1. prctl(PF_MEMALLOC) > > > > > > The idea is to give userspace oom-killer (just one thread which is > > > finding the appropriate victims and will be sending SIGKILLs) access > > > to MEMALLOC reserves. Most of the time the preallocation, mlock and > > > memory.min will be good enough but for rare occasions, when the > > > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will > > > protect it from reclaim and let the allocation dip into the memory > > > reserves. > > > > > > The misuse of this feature would be risky but it can be limited to > > > privileged applications. Userspace oom-killer is the only appropriate > > > user of this feature. This option is simple to implement. > > > > Hello Shakeel! > > > > If ordinary PAGE_SIZE and smaller kernel allocations start to fail, > > the system is already in a relatively bad shape. Arguably the userspace > > OOM killer should kick in earlier, it's already a bit too late. > > Please note that these are not allocation failures but rather reclaim > on allocations (which is very normal). Our observation is that this > reclaim is very unpredictable and depends on the type of memory > present on the system which depends on the workload. If there is a > good amount of easily reclaimable memory (e.g. clean file pages), the > reclaim would be really fast. However for other types of reclaimable > memory the reclaim time varies a lot. The unreclaimable memory, pinned > memory, too many direct reclaimers, too many isolated memory and many > other things/heuristics/assumptions make the reclaim further > non-deterministic. > > In our observation the global reclaim is very non-deterministic at the > tail and dramatically impacts the reliability of the system. We are > looking for a solution which is independent of the global reclaim. > > > Allowing to use reserves just pushes this even further, so we're risking > > the kernel stability for no good reason. > > Michal has suggested ALLOC_OOM which is less risky. The problem is that even if you'll serve the oom daemon task with pages from a reserve/custom pool, it doesn't guarantee anything, because the task still can wait for a long time on some mutex, taken by another process, throttled somewhere in the reclaim. You're basically trying to introduce a "higher memory priority" and as always in such cases there will be priority inversion problems. So I doubt that you can simple create a common mechanism which will work flawlessly for all kinds of allocations, I anticipate many special cases requiring an individual approach. > > > > > But I agree that throttling the oom daemon in direct reclaim makes no sense. > > I wonder if we can introduce a per-task flag which will exclude the task from > > throttling, but instead all (large) allocations will just fail under a > > significant memory pressure more easily. In this case if there is a significant > > memory shortage the oom daemon will not be fully functional (will get -ENOMEM > > for an attempt to read some stats, for example), but still will be able to kill > > some processes and make the forward progress. > > So, the suggestion is to have a per-task flag to (1) indicate to not > throttle and (2) fail allocations easily on significant memory > pressure. > > For (1), the challenge I see is that there are a lot of places in the > reclaim code paths where a task can get throttled. There are > filesystems that block/throttle in slab shrinking. Any process can get > blocked on an unrelated page or inode writeback within reclaim. > > For (2), I am not sure how to deterministically define "significant > memory pressure". One idea is to follow the __GFP_NORETRY semantics > and along with (1) the userspace oom-killer will see ENOMEM more > reliably than stucking in the reclaim. > > So, the oom-killer maintains a list of processes to kill in extreme > conditions, have their pidfds open and keep that list fresh. Whenever > any syscalls returns ENOMEM, it starts doing > pidfd_send_signal(SIGKILL) to that list of processes, right? > > The idea has merit but I don't see how this is any simpler. The (1) is > challenging on its own and my main concern is that it will be very > hard to maintain as reclaim code (particularly shrinkers) callbacks > into many diverse subsystems. Yeah, I thought about something like this, but I didn't go too deep. Basically we can emulate __GFP_NOFS | __GFP_NORETRY, but I'm not sure we can apply it for any random allocation without bad consequences. Btw, this approach can be easily prototyped using bpf: a bpf program can be called on each allocation and modify the behavior based on the pid of the process and other circumstances. > > > But maybe it can be done in userspace too: by splitting the daemon into > > a core- and extended part and avoid doing anything behind bare minimum > > in the core part. > > > > > > > > 2. Mempool > > > > > > The idea is to preallocate mempool with a given amount of memory for > > > userspace oom-killer. Preferably this will be per-thread and > > > oom-killer can preallocate mempool for its specific threads. The core > > > page allocator can check before going to the reclaim path if the task > > > has private access to the mempool and return page from it if yes. > > > > > > This option would be more complicated than the previous option as the > > > lifecycle of the page from the mempool would be more sophisticated. > > > Additionally the current mempool does not handle higher order pages > > > and we might need to extend it to allow such allocations. Though this > > > feature might have more use-cases and it would be less risky than the > > > previous option. > > > > It looks like an over-kill for the oom daemon protection, but if there > > are other good use cases, maybe it's a good feature to have. > > > > IMHO it is not an over-kill and easier to do then to remove all > instances of potential blocking/throttling sites in memory reclaim. > > > > > > > Another idea I had was to use kthread based oom-killer and provide the > > > policies through eBPF program. Though I am not sure how to make it > > > monitor arbitrary metrics and if that can be done without any > > > allocations. > > > > To start this effort it would be nice to understand what metrics various > > oom daemons use and how easy is to gather them from the bpf side. I like > > this idea long-term, but not sure if it has been settled down enough. > > I imagine it will require a fair amount of work on the bpf side, so we > > need a good understanding of features we need. > > > > Are there any examples of gathering existing metrics from bpf? Suren > has given a list of metrics useful for Android. Is it possible to > gather those metrics? First, I need to admit that I didn't follow the bpf development too close for last couple of years, so my knowledge can be a bit outdated. But in general bpf is great when there is a fixed amount of data as input (e.g. skb) and a fixed output (e.g. drop/pass the packet). There are different maps which are handy to store some persistent data between calls. However traversing complex data structures is way more complicated. It's especially tricky if the data structure is not of a fixed size: bpf programs have to be deterministic, so there are significant constraints on loops. Just for example: it's easy to call a bpf program for each task in the system, provide some stats/access to some fields of struct task and expect it to return an oom score, which then the kernel will look at to select the victim. Something like this can be done with cgroups too. Writing a kthread, which can sleep, poll some data all over the system and decide what to do (what oomd/... does), will be really challenging. And going back, it will not provide any guarantees unless we're not taking any locks, which is already quite challenging. Thanks! ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-21 2:58 ` Roman Gushchin @ 2021-04-21 13:26 ` Shakeel Butt 2021-04-21 19:04 ` Roman Gushchin 0 siblings, 1 reply; 28+ messages in thread From: Shakeel Butt @ 2021-04-21 13:26 UTC (permalink / raw) To: Roman Gushchin Cc: Johannes Weiner, Michal Hocko, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Tue, Apr 20, 2021 at 7:58 PM Roman Gushchin <guro@fb.com> wrote: > [...] > > > > Michal has suggested ALLOC_OOM which is less risky. > > The problem is that even if you'll serve the oom daemon task with pages > from a reserve/custom pool, it doesn't guarantee anything, because the task > still can wait for a long time on some mutex, taken by another process, > throttled somewhere in the reclaim. I am assuming here by mutex you are referring to locks which oom-killer might have to take to read metrics or any possible lock which oom-killer might have to take which some other process can take too. Have you observed this situation happening with oomd on production? > You're basically trying to introduce a > "higher memory priority" and as always in such cases there will be priority > inversion problems. > > So I doubt that you can simple create a common mechanism which will work > flawlessly for all kinds of allocations, I anticipate many special cases > requiring an individual approach. > [...] > > First, I need to admit that I didn't follow the bpf development too close > for last couple of years, so my knowledge can be a bit outdated. > > But in general bpf is great when there is a fixed amount of data as input > (e.g. skb) and a fixed output (e.g. drop/pass the packet). There are different > maps which are handy to store some persistent data between calls. > > However traversing complex data structures is way more complicated. It's > especially tricky if the data structure is not of a fixed size: bpf programs > have to be deterministic, so there are significant constraints on loops. > > Just for example: it's easy to call a bpf program for each task in the system, > provide some stats/access to some fields of struct task and expect it to return > an oom score, which then the kernel will look at to select the victim. > Something like this can be done with cgroups too. > > Writing a kthread, which can sleep, poll some data all over the system and > decide what to do (what oomd/... does), will be really challenging. > And going back, it will not provide any guarantees unless we're not taking > any locks, which is already quite challenging. > Thanks for the info and I agree this direction needs much more thought and time to be materialized. thanks, Shakeel ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-21 13:26 ` Shakeel Butt @ 2021-04-21 19:04 ` Roman Gushchin 0 siblings, 0 replies; 28+ messages in thread From: Roman Gushchin @ 2021-04-21 19:04 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Michal Hocko, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Wed, Apr 21, 2021 at 06:26:37AM -0700, Shakeel Butt wrote: > On Tue, Apr 20, 2021 at 7:58 PM Roman Gushchin <guro@fb.com> wrote: > > > [...] > > > > > > Michal has suggested ALLOC_OOM which is less risky. > > > > The problem is that even if you'll serve the oom daemon task with pages > > from a reserve/custom pool, it doesn't guarantee anything, because the task > > still can wait for a long time on some mutex, taken by another process, > > throttled somewhere in the reclaim. > > I am assuming here by mutex you are referring to locks which > oom-killer might have to take to read metrics or any possible lock > which oom-killer might have to take which some other process can take > too. > > Have you observed this situation happening with oomd on production? I'm not aware of any oomd-specific issues. I'm not sure if they don't exist at all, but so far it's wasn't a problem for us. Maybe it because you tend to have less pagecache (as I understand), maybe it comes to specific oomd policies/settings. I know we had different pains with mmap_sem and atop and similar programs, where reading process data stalled on mmap_sem for a long time. Thanks! ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-21 1:18 ` Shakeel Butt 2021-04-21 2:58 ` Roman Gushchin @ 2021-04-21 7:23 ` Michal Hocko 2021-04-21 14:13 ` Shakeel Butt 1 sibling, 1 reply; 28+ messages in thread From: Michal Hocko @ 2021-04-21 7:23 UTC (permalink / raw) To: Shakeel Butt Cc: Roman Gushchin, Johannes Weiner, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Tue 20-04-21 18:18:29, Shakeel Butt wrote: > On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin <guro@fb.com> wrote: > > > > On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote: > [...] > > > 1. prctl(PF_MEMALLOC) > > > > > > The idea is to give userspace oom-killer (just one thread which is > > > finding the appropriate victims and will be sending SIGKILLs) access > > > to MEMALLOC reserves. Most of the time the preallocation, mlock and > > > memory.min will be good enough but for rare occasions, when the > > > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will > > > protect it from reclaim and let the allocation dip into the memory > > > reserves. > > > > > > The misuse of this feature would be risky but it can be limited to > > > privileged applications. Userspace oom-killer is the only appropriate > > > user of this feature. This option is simple to implement. > > > > Hello Shakeel! > > > > If ordinary PAGE_SIZE and smaller kernel allocations start to fail, > > the system is already in a relatively bad shape. Arguably the userspace > > OOM killer should kick in earlier, it's already a bit too late. > > Please note that these are not allocation failures but rather reclaim > on allocations (which is very normal). Our observation is that this > reclaim is very unpredictable and depends on the type of memory > present on the system which depends on the workload. If there is a > good amount of easily reclaimable memory (e.g. clean file pages), the > reclaim would be really fast. However for other types of reclaimable > memory the reclaim time varies a lot. The unreclaimable memory, pinned > memory, too many direct reclaimers, too many isolated memory and many > other things/heuristics/assumptions make the reclaim further > non-deterministic. > > In our observation the global reclaim is very non-deterministic at the > tail and dramatically impacts the reliability of the system. We are > looking for a solution which is independent of the global reclaim. I believe it is worth purusing a solution that would make the memory reclaim more predictable. I have seen direct reclaim memory throttling in the past. For some reason which I haven't tried to examine this has become less of a problem with newer kernels. Maybe the memory access patterns have changed or those problems got replaced by other issues but an excessive throttling is definitely something that we want to address rather than work around by some user visible APIs. > > Allowing to use reserves just pushes this even further, so we're risking > > the kernel stability for no good reason. > > Michal has suggested ALLOC_OOM which is less risky. > > > > > But I agree that throttling the oom daemon in direct reclaim makes no sense. > > I wonder if we can introduce a per-task flag which will exclude the task from > > throttling, but instead all (large) allocations will just fail under a > > significant memory pressure more easily. In this case if there is a significant > > memory shortage the oom daemon will not be fully functional (will get -ENOMEM > > for an attempt to read some stats, for example), but still will be able to kill > > some processes and make the forward progress. > > So, the suggestion is to have a per-task flag to (1) indicate to not > throttle and (2) fail allocations easily on significant memory > pressure. > > For (1), the challenge I see is that there are a lot of places in the > reclaim code paths where a task can get throttled. There are > filesystems that block/throttle in slab shrinking. Any process can get > blocked on an unrelated page or inode writeback within reclaim. > > For (2), I am not sure how to deterministically define "significant > memory pressure". One idea is to follow the __GFP_NORETRY semantics > and along with (1) the userspace oom-killer will see ENOMEM more > reliably than stucking in the reclaim. Some of the interfaces (e.g. seq_file uses GFP_KERNEL reclaim strength) could be more relaxed and rather fail than OOM kill but wouldn't your OOM handler be effectivelly dysfunctional when not able to collect data to make a decision? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-21 7:23 ` Michal Hocko @ 2021-04-21 14:13 ` Shakeel Butt 0 siblings, 0 replies; 28+ messages in thread From: Shakeel Butt @ 2021-04-21 14:13 UTC (permalink / raw) To: Michal Hocko Cc: Roman Gushchin, Johannes Weiner, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Wed, Apr 21, 2021 at 12:23 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > In our observation the global reclaim is very non-deterministic at the > > tail and dramatically impacts the reliability of the system. We are > > looking for a solution which is independent of the global reclaim. > > I believe it is worth purusing a solution that would make the memory > reclaim more predictable. I have seen direct reclaim memory throttling > in the past. For some reason which I haven't tried to examine this has > become less of a problem with newer kernels. Maybe the memory access > patterns have changed or those problems got replaced by other issues but > an excessive throttling is definitely something that we want to address > rather than work around by some user visible APIs. > I agree we want to address the excessive throttling but for everyone on the machine and most importantly it is a moving target. The reclaim code continues to evolve and in addition it has callbacks to diverse sets of subsystems. The user visible APIs is for one specific use-case i.e. oom-killer which will indirectly help in reducing the excessive throttling. [...] > > So, the suggestion is to have a per-task flag to (1) indicate to not > > throttle and (2) fail allocations easily on significant memory > > pressure. > > > > For (1), the challenge I see is that there are a lot of places in the > > reclaim code paths where a task can get throttled. There are > > filesystems that block/throttle in slab shrinking. Any process can get > > blocked on an unrelated page or inode writeback within reclaim. > > > > For (2), I am not sure how to deterministically define "significant > > memory pressure". One idea is to follow the __GFP_NORETRY semantics > > and along with (1) the userspace oom-killer will see ENOMEM more > > reliably than stucking in the reclaim. > > Some of the interfaces (e.g. seq_file uses GFP_KERNEL reclaim strength) > could be more relaxed and rather fail than OOM kill but wouldn't your > OOM handler be effectivelly dysfunctional when not able to collect data > to make a decision? > Yes it would be. Roman is suggesting to have a precomputed kill-list (pidfds ready to send SIGKILL) and whenever oom-killer gets ENOMEM, it would go with the kill-list. Though we are still contemplating the ways and side-effects of preferably returning ENOMEM in slowpath for oom-killer and in addition the complexity to maintain the kill-list and keeping it up to date. thanks, Shakeel ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-20 1:44 [RFC] memory reserve for userspace oom-killer Shakeel Butt 2021-04-20 6:45 ` Michal Hocko 2021-04-20 19:17 ` Roman Gushchin @ 2021-04-21 17:05 ` peter enderborg 2021-04-21 18:28 ` Shakeel Butt 2021-04-22 13:08 ` Michal Hocko 2 siblings, 2 replies; 28+ messages in thread From: peter enderborg @ 2021-04-21 17:05 UTC (permalink / raw) To: Shakeel Butt, Johannes Weiner, Roman Gushchin, Michal Hocko, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan Cc: Greg Thelen, Dragos Sbirlea, Priya Duraisamy On 4/20/21 3:44 AM, Shakeel Butt wrote: > Proposal: Provide memory guarantees to userspace oom-killer. > > Background: > > Issues with kernel oom-killer: > 1. Very conservative and prefer to reclaim. Applications can suffer > for a long time. > 2. Borrows the context of the allocator which can be resource limited > (low sched priority or limited CPU quota). > 3. Serialized by global lock. > 4. Very simplistic oom victim selection policy. > > These issues are resolved through userspace oom-killer by: > 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to > early detect suffering. > 2. Independent process context which can be given dedicated CPU quota > and high scheduling priority. > 3. Can be more aggressive as required. > 4. Can implement sophisticated business logic/policies. > > Android's LMKD and Facebook's oomd are the prime examples of userspace > oom-killers. One of the biggest challenges for userspace oom-killers > is to potentially function under intense memory pressure and are prone > to getting stuck in memory reclaim themselves. Current userspace > oom-killers aim to avoid this situation by preallocating user memory > and protecting themselves from global reclaim by either mlocking or > memory.min. However a new allocation from userspace oom-killer can > still get stuck in the reclaim and policy rich oom-killer do trigger > new allocations through syscalls or even heap. > > Our attempt of userspace oom-killer faces similar challenges. > Particularly at the tail on the very highly utilized machines we have > observed userspace oom-killer spectacularly failing in many possible > ways in the direct reclaim. We have seen oom-killer stuck in direct > reclaim throttling, stuck in reclaim and allocations from interrupts > keep stealing reclaimed memory. We have even observed systems where > all the processes were stuck in throttle_direct_reclaim() and only > kswapd was running and the interrupts kept stealing the memory > reclaimed by kswapd. > > To reliably solve this problem, we need to give guaranteed memory to > the userspace oom-killer. At the moment we are contemplating between > the following options and I would like to get some feedback. > > 1. prctl(PF_MEMALLOC) > > The idea is to give userspace oom-killer (just one thread which is > finding the appropriate victims and will be sending SIGKILLs) access > to MEMALLOC reserves. Most of the time the preallocation, mlock and > memory.min will be good enough but for rare occasions, when the > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will > protect it from reclaim and let the allocation dip into the memory > reserves. > > The misuse of this feature would be risky but it can be limited to > privileged applications. Userspace oom-killer is the only appropriate > user of this feature. This option is simple to implement. > > 2. Mempool > > The idea is to preallocate mempool with a given amount of memory for > userspace oom-killer. Preferably this will be per-thread and > oom-killer can preallocate mempool for its specific threads. The core > page allocator can check before going to the reclaim path if the task > has private access to the mempool and return page from it if yes. > > This option would be more complicated than the previous option as the > lifecycle of the page from the mempool would be more sophisticated. > Additionally the current mempool does not handle higher order pages > and we might need to extend it to allow such allocations. Though this > feature might have more use-cases and it would be less risky than the > previous option. > > Another idea I had was to use kthread based oom-killer and provide the > policies through eBPF program. Though I am not sure how to make it > monitor arbitrary metrics and if that can be done without any > allocations. > > Please do provide feedback on these approaches. > > thanks, > Shakeel I think this is the wrong way to go. I sent a patch for android lowmemorykiller some years ago. http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2017-February/100319.html It has been improved since than, so it can act handle oom callbacks, it can act on vmpressure and psi and as a shrinker. The patches has not been ported to resent kernels though. I don't think vmpressure and psi is that relevant now. (They are what userspace act on) But the basic idea is to have a priority queue within the kernel. It need pick up new processes and dying process. And then it has a order, and that is set with oom adj values by activity manager in android. I see this model can be reused for something that is between a standard oom and userspace. Instead of vmpressure and psi a watchdog might be a better way. If userspace (in android the activity manager or lmkd) does not kick the watchdog, the watchdog bite the task according to the priority and kills it. This priority list does not have to be a list generated within kernel. But it has the advantage that you inherent parents properties. We use a rb-tree for that. All that is missing is the watchdog. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-21 17:05 ` peter enderborg @ 2021-04-21 18:28 ` Shakeel Butt 2021-04-21 18:46 ` Peter.Enderborg 2021-04-22 13:08 ` Michal Hocko 1 sibling, 1 reply; 28+ messages in thread From: Shakeel Butt @ 2021-04-21 18:28 UTC (permalink / raw) To: peter enderborg Cc: Johannes Weiner, Roman Gushchin, Michal Hocko, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Wed, Apr 21, 2021 at 10:06 AM peter enderborg <peter.enderborg@sony.com> wrote: > > On 4/20/21 3:44 AM, Shakeel Butt wrote: [...] > > I think this is the wrong way to go. Which one? Are you talking about the kernel one? We already talked out of that. To decide to OOM, we need to look at a very diverse set of metrics and it seems like that would be very hard to do flexibly inside the kernel. > > I sent a patch for android lowmemorykiller some years ago. > > http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2017-February/100319.html > > It has been improved since than, so it can act handle oom callbacks, it can act on vmpressure and psi > and as a shrinker. The patches has not been ported to resent kernels though. > > I don't think vmpressure and psi is that relevant now. (They are what userspace act on) But the basic idea is to have a priority queue > within the kernel. It need pick up new processes and dying process. And then it has a order, and that > is set with oom adj values by activity manager in android. I see this model can be reused for > something that is between a standard oom and userspace. Instead of vmpressure and psi > a watchdog might be a better way. If userspace (in android the activity manager or lmkd) does not kick the watchdog, > the watchdog bite the task according to the priority and kills it. This priority list does not have to be a list generated > within kernel. But it has the advantage that you inherent parents properties. We use a rb-tree for that. > > All that is missing is the watchdog. > Actually no. It is missing the flexibility to monitor metrics which a user care and based on which they decide to trigger oom-kill. Not sure how will watchdog replace psi/vmpressure? Userspace keeps petting the watchdog does not mean that system is not suffering. In addition oom priorities change dynamically and changing it in your system seems very hard. Cgroup awareness is missing too. Anyways, there are already widely deployed userspace oom-killer solutions (lmkd, oomd). I am aiming to further improve the reliability. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-21 18:28 ` Shakeel Butt @ 2021-04-21 18:46 ` Peter.Enderborg 2021-04-21 19:18 ` Shakeel Butt 0 siblings, 1 reply; 28+ messages in thread From: Peter.Enderborg @ 2021-04-21 18:46 UTC (permalink / raw) To: shakeelb Cc: hannes, guro, mhocko, linux-mm, akpm, cgroups, rientjes, linux-kernel, surenb, gthelen, dragoss, padmapriyad On 4/21/21 8:28 PM, Shakeel Butt wrote: > On Wed, Apr 21, 2021 at 10:06 AM peter enderborg > <peter.enderborg@sony.com> wrote: >> On 4/20/21 3:44 AM, Shakeel Butt wrote: > [...] >> I think this is the wrong way to go. > Which one? Are you talking about the kernel one? We already talked out > of that. To decide to OOM, we need to look at a very diverse set of > metrics and it seems like that would be very hard to do flexibly > inside the kernel. You dont need to decide to oom, but when oom occurs you can take a proper action. > >> I sent a patch for android lowmemorykiller some years ago. >> >> https://urldefense.com/v3/__http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2017-February/100319.html__;!!JmoZiZGBv3RvKRSx!pwmY7R1kGPkZq95bHSObHqIR1-r3ItSBgdRBdKym9uCcUprGq-CUrAIaH946vWJqrjU$ >> >> It has been improved since than, so it can act handle oom callbacks, it can act on vmpressure and psi >> and as a shrinker. The patches has not been ported to resent kernels though. >> >> I don't think vmpressure and psi is that relevant now. (They are what userspace act on) But the basic idea is to have a priority queue >> within the kernel. It need pick up new processes and dying process. And then it has a order, and that >> is set with oom adj values by activity manager in android. I see this model can be reused for >> something that is between a standard oom and userspace. Instead of vmpressure and psi >> a watchdog might be a better way. If userspace (in android the activity manager or lmkd) does not kick the watchdog, >> the watchdog bite the task according to the priority and kills it. This priority list does not have to be a list generated >> within kernel. But it has the advantage that you inherent parents properties. We use a rb-tree for that. >> >> All that is missing is the watchdog. >> > Actually no. It is missing the flexibility to monitor metrics which a > user care and based on which they decide to trigger oom-kill. Not sure > how will watchdog replace psi/vmpressure? Userspace keeps petting the > watchdog does not mean that system is not suffering. The userspace should very much do what it do. But when it does not do what it should do, including kick the WD. Then the kernel kicks in and kill a pre defined process or as many as needed until the monitoring can start to kick and have the control. > > In addition oom priorities change dynamically and changing it in your > system seems very hard. Cgroup awareness is missing too. Why is that hard? Moving a object in a rb-tree is as good it get. > > Anyways, there are already widely deployed userspace oom-killer > solutions (lmkd, oomd). I am aiming to further improve the > reliability. Yes, and I totally agree that it is needed. But I don't think it will possible until linux is realtime ready, including a memory system that can guarantee allocation times. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-21 18:46 ` Peter.Enderborg @ 2021-04-21 19:18 ` Shakeel Butt 2021-04-22 5:38 ` Peter.Enderborg 0 siblings, 1 reply; 28+ messages in thread From: Shakeel Butt @ 2021-04-21 19:18 UTC (permalink / raw) To: peter enderborg Cc: Johannes Weiner, Roman Gushchin, Michal Hocko, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Wed, Apr 21, 2021 at 11:46 AM <Peter.Enderborg@sony.com> wrote: > > On 4/21/21 8:28 PM, Shakeel Butt wrote: > > On Wed, Apr 21, 2021 at 10:06 AM peter enderborg > > <peter.enderborg@sony.com> wrote: > >> On 4/20/21 3:44 AM, Shakeel Butt wrote: > > [...] > >> I think this is the wrong way to go. > > Which one? Are you talking about the kernel one? We already talked out > > of that. To decide to OOM, we need to look at a very diverse set of > > metrics and it seems like that would be very hard to do flexibly > > inside the kernel. > You dont need to decide to oom, but when oom occurs you > can take a proper action. No, we want the flexibility to decide when to oom-kill. Kernel is very conservative in triggering the oom-kill. > > [...] > > Actually no. It is missing the flexibility to monitor metrics which a > > user care and based on which they decide to trigger oom-kill. Not sure > > how will watchdog replace psi/vmpressure? Userspace keeps petting the > > watchdog does not mean that system is not suffering. > > The userspace should very much do what it do. But when it > does not do what it should do, including kick the WD. Then > the kernel kicks in and kill a pre defined process or as many > as needed until the monitoring can start to kick and have the > control. > Roman already suggested something similar (i.e. oom-killer core and extended and core watching extended) but completely in userspace. I don't see why we would want to do that in the kernel instead. > > > > In addition oom priorities change dynamically and changing it in your > > system seems very hard. Cgroup awareness is missing too. > > Why is that hard? Moving a object in a rb-tree is as good it get. > It is a group of objects. Anyways that is implementation detail. The message I got from this exchange is that we can have a watchdog (userspace or kernel) to further improve the reliability of userspace oom-killers. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-21 19:18 ` Shakeel Butt @ 2021-04-22 5:38 ` Peter.Enderborg 2021-04-22 14:27 ` Shakeel Butt 0 siblings, 1 reply; 28+ messages in thread From: Peter.Enderborg @ 2021-04-22 5:38 UTC (permalink / raw) To: shakeelb Cc: hannes, guro, mhocko, linux-mm, akpm, cgroups, rientjes, linux-kernel, surenb, gthelen, dragoss, padmapriyad On 4/21/21 9:18 PM, Shakeel Butt wrote: > On Wed, Apr 21, 2021 at 11:46 AM <Peter.Enderborg@sony.com> wrote: >> On 4/21/21 8:28 PM, Shakeel Butt wrote: >>> On Wed, Apr 21, 2021 at 10:06 AM peter enderborg >>> <peter.enderborg@sony.com> wrote: >>>> On 4/20/21 3:44 AM, Shakeel Butt wrote: >>> [...] >>>> I think this is the wrong way to go. >>> Which one? Are you talking about the kernel one? We already talked out >>> of that. To decide to OOM, we need to look at a very diverse set of >>> metrics and it seems like that would be very hard to do flexibly >>> inside the kernel. >> You dont need to decide to oom, but when oom occurs you >> can take a proper action. > No, we want the flexibility to decide when to oom-kill. Kernel is very > conservative in triggering the oom-kill. It wont do it for you. We use this code to solve that: /* * lowmemorykiller_oom * * Author: Peter Enderborg <peter.enderborg@sonymobile.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as * published by the Free Software Foundation. */ /* add fake print format with original module name */ #define pr_fmt(fmt) "lowmemorykiller: " fmt #include <linux/mm.h> #include <linux/slab.h> #include <linux/oom.h> #include <trace/events/lmk.h> #include "lowmemorykiller.h" #include "lowmemorykiller_tng.h" #include "lowmemorykiller_stats.h" #include "lowmemorykiller_tasks.h" /** * lowmemorykiller_oom_notify - OOM notifier * @self: notifier block struct * @notused: not used * @parm: returned - number of pages freed * * Return value: * NOTIFY_OK **/ static int lowmemorykiller_oom_notify(struct notifier_block *self, unsigned long notused, void *param) { struct lmk_rb_watch *lrw; unsigned long *nfreed = param; lowmem_print(2, "oom notify event\n"); *nfreed = 0; lmk_inc_stats(LMK_OOM_COUNT); spin_lock_bh(&lmk_task_lock); lrw = __lmk_task_first(); if (lrw) { struct task_struct *selected = lrw->tsk; struct lmk_death_pending_entry *ldpt; if (!task_trylock_lmk(selected)) { lmk_inc_stats(LMK_ERROR); lowmem_print(1, "Failed to lock task.\n"); lmk_inc_stats(LMK_BUSY); goto unlock_out; } get_task_struct(selected); /* move to kill pending set */ ldpt = kmem_cache_alloc(lmk_dp_cache, GFP_ATOMIC); /* if we fail to alloc we ignore the death pending list */ if (ldpt) { ldpt->tsk = selected; __lmk_death_pending_add(ldpt); } else { WARN_ON(1); lmk_inc_stats(LMK_MEM_ERROR); trace_lmk_sigkill(selected->pid, selected->comm, LMK_TRACE_MEMERROR, *nfreed, 0); } if (!__lmk_task_remove(selected, lrw->key)) WARN_ON(1); spin_unlock_bh(&lmk_task_lock); *nfreed = get_task_rss(selected); send_sig(SIGKILL, selected, 0); LMK_TAG_TASK_DIE(selected); trace_lmk_sigkill(selected->pid, selected->comm, LMK_TRACE_OOMKILL, *nfreed, 0); task_unlock(selected); put_task_struct(selected); lmk_inc_stats(LMK_OOM_KILL_COUNT); goto out; } unlock_out: spin_unlock_bh(&lmk_task_lock); out: return NOTIFY_OK; } static struct notifier_block lowmemorykiller_oom_nb = { .notifier_call = lowmemorykiller_oom_notify }; int __init lowmemorykiller_register_oom_notifier(void) { register_oom_notifier(&lowmemorykiller_oom_nb); return 0; } So what needed is a function that knows the priority. Here __lmk_task_first() that is from a rb-tree. You can pick what ever priority you like. In our case it is a android so it is a strictly oom_adj order in the tree. I think you can do the same with a old lowmemmorykiller style with a full task scan too. > [...] >>> Actually no. It is missing the flexibility to monitor metrics which a >>> user care and based on which they decide to trigger oom-kill. Not sure >>> how will watchdog replace psi/vmpressure? Userspace keeps petting the >>> watchdog does not mean that system is not suffering. >> The userspace should very much do what it do. But when it >> does not do what it should do, including kick the WD. Then >> the kernel kicks in and kill a pre defined process or as many >> as needed until the monitoring can start to kick and have the >> control. >> > Roman already suggested something similar (i.e. oom-killer core and > extended and core watching extended) but completely in userspace. I > don't see why we would want to do that in the kernel instead. A watchdog in kernel will work if userspace is completely broken or staved with low on memory. > >>> In addition oom priorities change dynamically and changing it in your >>> system seems very hard. Cgroup awareness is missing too. >> Why is that hard? Moving a object in a rb-tree is as good it get. >> > It is a group of objects. Anyways that is implementation detail. > > The message I got from this exchange is that we can have a watchdog > (userspace or kernel) to further improve the reliability of userspace > oom-killers. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-22 5:38 ` Peter.Enderborg @ 2021-04-22 14:27 ` Shakeel Butt 2021-04-22 15:41 ` Peter.Enderborg 0 siblings, 1 reply; 28+ messages in thread From: Shakeel Butt @ 2021-04-22 14:27 UTC (permalink / raw) To: peter enderborg Cc: Johannes Weiner, Roman Gushchin, Michal Hocko, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Wed, Apr 21, 2021 at 10:39 PM <Peter.Enderborg@sony.com> wrote: > > On 4/21/21 9:18 PM, Shakeel Butt wrote: > > On Wed, Apr 21, 2021 at 11:46 AM <Peter.Enderborg@sony.com> wrote: > >> On 4/21/21 8:28 PM, Shakeel Butt wrote: > >>> On Wed, Apr 21, 2021 at 10:06 AM peter enderborg > >>> <peter.enderborg@sony.com> wrote: > >>>> On 4/20/21 3:44 AM, Shakeel Butt wrote: > >>> [...] > >>>> I think this is the wrong way to go. > >>> Which one? Are you talking about the kernel one? We already talked out > >>> of that. To decide to OOM, we need to look at a very diverse set of > >>> metrics and it seems like that would be very hard to do flexibly > >>> inside the kernel. > >> You dont need to decide to oom, but when oom occurs you > >> can take a proper action. > > No, we want the flexibility to decide when to oom-kill. Kernel is very > > conservative in triggering the oom-kill. > > It wont do it for you. We use this code to solve that: Sorry what do you mean by "It wont do it for you"? [...] > int __init lowmemorykiller_register_oom_notifier(void) > { > register_oom_notifier(&lowmemorykiller_oom_nb); This code is using oom_notify_list. That is only called when the kernel has already decided to go for the oom-kill. My point was the kernel is very conservative in deciding to trigger the oom-kill and the applications can suffer for long. We already have solutions for this issue in the form of userspace oom-killers (Android's lmkd and Facebook's oomd) which monitors a diverse set of metrics to early detect the application suffering and trigger SIGKILLs to release the memory pressure on the system. BTW with the userspace oom-killers, we would like to avoid the kernel oom-killer and memory.swap.high has been introduced in the kernel for that purpose. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-22 14:27 ` Shakeel Butt @ 2021-04-22 15:41 ` Peter.Enderborg 0 siblings, 0 replies; 28+ messages in thread From: Peter.Enderborg @ 2021-04-22 15:41 UTC (permalink / raw) To: shakeelb Cc: hannes, guro, mhocko, linux-mm, akpm, cgroups, rientjes, linux-kernel, surenb, gthelen, dragoss, padmapriyad On 4/22/21 4:27 PM, Shakeel Butt wrote: > On Wed, Apr 21, 2021 at 10:39 PM <Peter.Enderborg@sony.com> wrote: >> On 4/21/21 9:18 PM, Shakeel Butt wrote: >>> On Wed, Apr 21, 2021 at 11:46 AM <Peter.Enderborg@sony.com> wrote: >>>> On 4/21/21 8:28 PM, Shakeel Butt wrote: >>>>> On Wed, Apr 21, 2021 at 10:06 AM peter enderborg >>>>> <peter.enderborg@sony.com> wrote: >>>>>> On 4/20/21 3:44 AM, Shakeel Butt wrote: >>>>> [...] >>>>>> I think this is the wrong way to go. >>>>> Which one? Are you talking about the kernel one? We already talked out >>>>> of that. To decide to OOM, we need to look at a very diverse set of >>>>> metrics and it seems like that would be very hard to do flexibly >>>>> inside the kernel. >>>> You dont need to decide to oom, but when oom occurs you >>>> can take a proper action. >>> No, we want the flexibility to decide when to oom-kill. Kernel is very >>> conservative in triggering the oom-kill. >> It wont do it for you. We use this code to solve that: > Sorry what do you mean by "It wont do it for you"? The oom-killer, it does not do what you want and need. You need to add something that kills the "right" task. The example does that, it pick the task with highest oom_score_adj and kills it. It is probably easer to see in the "proof of concept" patch. > > [...] >> int __init lowmemorykiller_register_oom_notifier(void) >> { >> register_oom_notifier(&lowmemorykiller_oom_nb); > This code is using oom_notify_list. That is only called when the > kernel has already decided to go for the oom-kill. My point was the > kernel is very conservative in deciding to trigger the oom-kill and > the applications can suffer for long. We already have solutions for > this issue in the form of userspace oom-killers (Android's lmkd and > Facebook's oomd) which monitors a diverse set of metrics to early > detect the application suffering and trigger SIGKILLs to release the > memory pressure on the system. > > BTW with the userspace oom-killers, we would like to avoid the kernel > oom-killer and memory.swap.high has been introduced in the kernel for > that purpose. This it is a lifeline. It will keep the lmkd/activity manger going. It is not a replacement it is helper. It gives the freedom to tune other parts with out worrying to much about oom. (Assuming that userspace still can handle kills like the kernel lmk did) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] memory reserve for userspace oom-killer 2021-04-21 17:05 ` peter enderborg 2021-04-21 18:28 ` Shakeel Butt @ 2021-04-22 13:08 ` Michal Hocko 1 sibling, 0 replies; 28+ messages in thread From: Michal Hocko @ 2021-04-22 13:08 UTC (permalink / raw) To: peter enderborg Cc: Shakeel Butt, Johannes Weiner, Roman Gushchin, Linux MM, Andrew Morton, Cgroups, David Rientjes, LKML, Suren Baghdasaryan, Greg Thelen, Dragos Sbirlea, Priya Duraisamy On Wed 21-04-21 19:05:49, peter enderborg wrote: [...] > I think this is the wrong way to go. > > I sent a patch for android lowmemorykiller some years ago. > > http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2017-February/100319.html > > It has been improved since than, so it can act handle oom callbacks, it can act on vmpressure and psi > and as a shrinker. The patches has not been ported to resent kernels though. > > I don't think vmpressure and psi is that relevant now. (They are what userspace act on) But the basic idea is to have a priority queue > within the kernel. It need pick up new processes and dying process. And then it has a order, and that > is set with oom adj values by activity manager in android. I see this model can be reused for > something that is between a standard oom and userspace. Instead of vmpressure and psi > a watchdog might be a better way. If userspace (in android the activity manager or lmkd) does not kick the watchdog, > the watchdog bite the task according to the priority and kills it. This priority list does not have to be a list generated > within kernel. But it has the advantage that you inherent parents properties. We use a rb-tree for that. > > All that is missing is the watchdog. And this is off topic to the discussion as well. We are not discussing how to handle OOM situation best. Shakeel has brought up challenges that some userspace based OOM killer implementations are facing. Like it or not but different workloads have different requirements and what you are using in Android might not be the best fit for everybody. I will not comment on the android approach but it doesn't address any of the concerns that have been brought up. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2021-05-05 2:59 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-20 1:44 [RFC] memory reserve for userspace oom-killer Shakeel Butt 2021-04-20 6:45 ` Michal Hocko 2021-04-20 16:04 ` Shakeel Butt 2021-04-21 7:16 ` Michal Hocko 2021-04-21 13:57 ` Shakeel Butt 2021-04-21 14:29 ` Michal Hocko 2021-04-22 12:33 ` [RFC PATCH] Android OOM helper proof of concept peter enderborg 2021-04-22 13:03 ` Michal Hocko 2021-05-05 0:37 ` [RFC] memory reserve for userspace oom-killer Shakeel Butt 2021-05-05 1:26 ` Suren Baghdasaryan 2021-05-05 2:45 ` Shakeel Butt 2021-05-05 2:59 ` Suren Baghdasaryan 2021-04-20 19:17 ` Roman Gushchin 2021-04-20 19:36 ` Suren Baghdasaryan 2021-04-21 1:18 ` Shakeel Butt 2021-04-21 2:58 ` Roman Gushchin 2021-04-21 13:26 ` Shakeel Butt 2021-04-21 19:04 ` Roman Gushchin 2021-04-21 7:23 ` Michal Hocko 2021-04-21 14:13 ` Shakeel Butt 2021-04-21 17:05 ` peter enderborg 2021-04-21 18:28 ` Shakeel Butt 2021-04-21 18:46 ` Peter.Enderborg 2021-04-21 19:18 ` Shakeel Butt 2021-04-22 5:38 ` Peter.Enderborg 2021-04-22 14:27 ` Shakeel Butt 2021-04-22 15:41 ` Peter.Enderborg 2021-04-22 13:08 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).