From: Shakeel Butt <shakeelb@google.com> To: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <guro@fb.com>, Linux MM <linux-mm@kvack.org>, Andrew Morton <akpm@linux-foundation.org>, Cgroups <cgroups@vger.kernel.org>, David Rientjes <rientjes@google.com>, LKML <linux-kernel@vger.kernel.org>, Suren Baghdasaryan <surenb@google.com>, Greg Thelen <gthelen@google.com>, Dragos Sbirlea <dragoss@google.com>, Priya Duraisamy <padmapriyad@google.com> Subject: Re: [RFC] memory reserve for userspace oom-killer Date: Wed, 21 Apr 2021 06:57:43 -0700 [thread overview] Message-ID: <CALvZod4kRWDQuZZQ5F+z6WMcUWLwgYd-Kb0mY8UAEK4MbSOZaA@mail.gmail.com> (raw) In-Reply-To: <YH/RPydqhwXdyG80@dhcp22.suse.cz> On Wed, Apr 21, 2021 at 12:16 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > To decide when to kill, the oom-killer has to read a lot of metrics. > > It has to open a lot of files to read them and there will definitely > > be new allocations involved in those operations. For example reading > > memory.stat does a page size allocation. Similarly, to perform action > > the oom-killer may have to read cgroup.procs file which again has > > allocation inside it. > > True but many of those can be avoided by opening the file early. At > least seq_file based ones will not allocate later if the output size > doesn't increase. Which should be the case for many. I think it is a > general improvement to push those who allocate during read to an open > time allocation. > I agree that this would be a general improvement but it is not always possible (see below). > > Regarding sophisticated oom policy, I can give one example of our > > cluster level policy. For robustness, many user facing jobs run a lot > > of instances in a cluster to handle failures. Such jobs are tolerant > > to some amount of failures but they still have requirements to not let > > the number of running instances below some threshold. Normally killing > > such jobs is fine but we do want to make sure that we do not violate > > their cluster level agreement. So, the userspace oom-killer may > > dynamically need to confirm if such a job can be killed. > > What kind of data do you need to examine to make those decisions? > Most of the time the cluster level scheduler pushes the information to the node controller which transfers that information to the oom-killer. However based on the freshness of the information the oom-killer might request to pull the latest information (IPC and RPC). [...] > > > > I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool > > to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to > > free the mempool. > > I am not a great fan of prctl. It has become a dumping ground for all > mix of unrelated functionality. But let's say this is a minor detail at > this stage. I agree this does not have to be prctl(). > So you are proposing to have a per mm mem pool that would be I was thinking of per-task_struct instead of per-mm_struct just for simplicity. > used as a fallback for an allocation which cannot make a forward > progress, right? Correct > Would that pool be preallocated and sitting idle? Correct > What kind of allocations would be allowed to use the pool? I was thinking of any type of allocation from the oom-killer (or specific threads). Please note that the mempool is the backup and only used in the slowpath. > What if the pool is depleted? This would mean that either the estimate of mempool size is bad or oom-killer is buggy and leaking memory. I am open to any design directions for mempool or some other way where we can provide a notion of memory guarantee to oom-killer. thanks, Shakeel
WARNING: multiple messages have this Message-ID (diff)
From: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> To: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>, Linux MM <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>, Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>, Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, Suren Baghdasaryan <surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, Dragos Sbirlea <dragoss-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, Priya Duraisamy <padmapriyad-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Subject: Re: [RFC] memory reserve for userspace oom-killer Date: Wed, 21 Apr 2021 06:57:43 -0700 [thread overview] Message-ID: <CALvZod4kRWDQuZZQ5F+z6WMcUWLwgYd-Kb0mY8UAEK4MbSOZaA@mail.gmail.com> (raw) In-Reply-To: <YH/RPydqhwXdyG80-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> On Wed, Apr 21, 2021 at 12:16 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote: > [...] > > To decide when to kill, the oom-killer has to read a lot of metrics. > > It has to open a lot of files to read them and there will definitely > > be new allocations involved in those operations. For example reading > > memory.stat does a page size allocation. Similarly, to perform action > > the oom-killer may have to read cgroup.procs file which again has > > allocation inside it. > > True but many of those can be avoided by opening the file early. At > least seq_file based ones will not allocate later if the output size > doesn't increase. Which should be the case for many. I think it is a > general improvement to push those who allocate during read to an open > time allocation. > I agree that this would be a general improvement but it is not always possible (see below). > > Regarding sophisticated oom policy, I can give one example of our > > cluster level policy. For robustness, many user facing jobs run a lot > > of instances in a cluster to handle failures. Such jobs are tolerant > > to some amount of failures but they still have requirements to not let > > the number of running instances below some threshold. Normally killing > > such jobs is fine but we do want to make sure that we do not violate > > their cluster level agreement. So, the userspace oom-killer may > > dynamically need to confirm if such a job can be killed. > > What kind of data do you need to examine to make those decisions? > Most of the time the cluster level scheduler pushes the information to the node controller which transfers that information to the oom-killer. However based on the freshness of the information the oom-killer might request to pull the latest information (IPC and RPC). [...] > > > > I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool > > to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to > > free the mempool. > > I am not a great fan of prctl. It has become a dumping ground for all > mix of unrelated functionality. But let's say this is a minor detail at > this stage. I agree this does not have to be prctl(). > So you are proposing to have a per mm mem pool that would be I was thinking of per-task_struct instead of per-mm_struct just for simplicity. > used as a fallback for an allocation which cannot make a forward > progress, right? Correct > Would that pool be preallocated and sitting idle? Correct > What kind of allocations would be allowed to use the pool? I was thinking of any type of allocation from the oom-killer (or specific threads). Please note that the mempool is the backup and only used in the slowpath. > What if the pool is depleted? This would mean that either the estimate of mempool size is bad or oom-killer is buggy and leaking memory. I am open to any design directions for mempool or some other way where we can provide a notion of memory guarantee to oom-killer. thanks, Shakeel
next prev parent reply other threads:[~2021-04-21 13:57 UTC|newest] Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-04-20 1:44 [RFC] memory reserve for userspace oom-killer Shakeel Butt 2021-04-20 1:44 ` Shakeel Butt 2021-04-20 1:44 ` Shakeel Butt 2021-04-20 6:45 ` Michal Hocko 2021-04-20 6:45 ` Michal Hocko 2021-04-20 16:04 ` Shakeel Butt 2021-04-20 16:04 ` Shakeel Butt 2021-04-20 16:04 ` Shakeel Butt 2021-04-21 7:16 ` Michal Hocko 2021-04-21 7:16 ` Michal Hocko 2021-04-21 13:57 ` Shakeel Butt [this message] 2021-04-21 13:57 ` Shakeel Butt 2021-04-21 13:57 ` Shakeel Butt 2021-04-21 14:29 ` Michal Hocko 2021-04-22 12:33 ` [RFC PATCH] Android OOM helper proof of concept peter enderborg 2021-04-22 12:33 ` peter enderborg 2021-04-22 13:03 ` Michal Hocko 2021-05-05 0:37 ` [RFC] memory reserve for userspace oom-killer Shakeel Butt 2021-05-05 0:37 ` Shakeel Butt 2021-05-05 0:37 ` Shakeel Butt 2021-05-05 1:26 ` Suren Baghdasaryan 2021-05-05 1:26 ` Suren Baghdasaryan 2021-05-05 2:45 ` Shakeel Butt 2021-05-05 2:45 ` Shakeel Butt 2021-05-05 2:45 ` Shakeel Butt 2021-05-05 2:59 ` Suren Baghdasaryan 2021-05-05 2:59 ` Suren Baghdasaryan 2021-05-05 2:59 ` Suren Baghdasaryan 2021-05-05 2:43 ` Hillf Danton 2021-04-20 19:17 ` Roman Gushchin 2021-04-20 19:17 ` Roman Gushchin 2021-04-20 19:36 ` Suren Baghdasaryan 2021-04-20 19:36 ` Suren Baghdasaryan 2021-04-20 19:36 ` Suren Baghdasaryan 2021-04-21 1:18 ` Shakeel Butt 2021-04-21 1:18 ` Shakeel Butt 2021-04-21 1:18 ` Shakeel Butt 2021-04-21 2:58 ` Roman Gushchin 2021-04-21 13:26 ` Shakeel Butt 2021-04-21 13:26 ` Shakeel Butt 2021-04-21 13:26 ` Shakeel Butt 2021-04-21 19:04 ` Roman Gushchin 2021-04-21 19:04 ` Roman Gushchin 2021-04-21 7:23 ` Michal Hocko 2021-04-21 7:23 ` Michal Hocko 2021-04-21 14:13 ` Shakeel Butt 2021-04-21 14:13 ` Shakeel Butt 2021-04-21 14:13 ` Shakeel Butt 2021-04-21 17:05 ` peter enderborg 2021-04-21 18:28 ` Shakeel Butt 2021-04-21 18:28 ` Shakeel Butt 2021-04-21 18:28 ` Shakeel Butt 2021-04-21 18:46 ` Peter.Enderborg 2021-04-21 18:46 ` Peter.Enderborg-7U/KSKJipcs 2021-04-21 19:18 ` Shakeel Butt 2021-04-21 19:18 ` Shakeel Butt 2021-04-21 19:18 ` Shakeel Butt 2021-04-22 5:38 ` Peter.Enderborg 2021-04-22 5:38 ` Peter.Enderborg-7U/KSKJipcs 2021-04-22 14:27 ` Shakeel Butt 2021-04-22 14:27 ` Shakeel Butt 2021-04-22 14:27 ` Shakeel Butt 2021-04-22 15:41 ` Peter.Enderborg 2021-04-22 15:41 ` Peter.Enderborg-7U/KSKJipcs 2021-04-22 13:08 ` Michal Hocko
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CALvZod4kRWDQuZZQ5F+z6WMcUWLwgYd-Kb0mY8UAEK4MbSOZaA@mail.gmail.com \ --to=shakeelb@google.com \ --cc=akpm@linux-foundation.org \ --cc=cgroups@vger.kernel.org \ --cc=dragoss@google.com \ --cc=gthelen@google.com \ --cc=guro@fb.com \ --cc=hannes@cmpxchg.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mhocko@suse.com \ --cc=padmapriyad@google.com \ --cc=rientjes@google.com \ --cc=surenb@google.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.