All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <guro@fb.com>, Linux MM <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Cgroups <cgroups@vger.kernel.org>,
	David Rientjes <rientjes@google.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Greg Thelen <gthelen@google.com>,
	Dragos Sbirlea <dragoss@google.com>,
	Priya Duraisamy <padmapriyad@google.com>
Subject: Re: [RFC] memory reserve for userspace oom-killer
Date: Wed, 21 Apr 2021 09:16:15 +0200	[thread overview]
Message-ID: <YH/RPydqhwXdyG80@dhcp22.suse.cz> (raw)
In-Reply-To: <CALvZod4kjdgMU=8T_bx6zFufA1cGtt2p1Jg8jOgi=+g=bs-Evw@mail.gmail.com>

On Tue 20-04-21 09:04:21, Shakeel Butt wrote:
> On Mon, Apr 19, 2021 at 11:46 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 19-04-21 18:44:02, Shakeel Butt wrote:
> [...]
> > > memory.min. However a new allocation from userspace oom-killer can
> > > still get stuck in the reclaim and policy rich oom-killer do trigger
> > > new allocations through syscalls or even heap.
> >
> > Can you be more specific please?
> >
> 
> To decide when to kill, the oom-killer has to read a lot of metrics.
> It has to open a lot of files to read them and there will definitely
> be new allocations involved in those operations. For example reading
> memory.stat does a page size allocation. Similarly, to perform action
> the oom-killer may have to read cgroup.procs file which again has
> allocation inside it.

True but many of those can be avoided by opening the file early. At
least seq_file based ones will not allocate later if the output size
doesn't increase. Which should be the case for many. I think it is a
general improvement to push those who allocate during read to an open
time allocation.

> Regarding sophisticated oom policy, I can give one example of our
> cluster level policy. For robustness, many user facing jobs run a lot
> of instances in a cluster to handle failures. Such jobs are tolerant
> to some amount of failures but they still have requirements to not let
> the number of running instances below some threshold. Normally killing
> such jobs is fine but we do want to make sure that we do not violate
> their cluster level agreement. So, the userspace oom-killer may
> dynamically need to confirm if such a job can be killed.

What kind of data do you need to examine to make those decisions?

> [...]
> > > To reliably solve this problem, we need to give guaranteed memory to
> > > the userspace oom-killer.
> >
> > There is nothing like that. Even memory reserves are a finite resource
> > which can be consumed as it is sharing those reserves with other users
> > who are not necessarily coordinated. So before we start discussing
> > making this even more muddy by handing over memory reserves to the
> > userspace we should really examine whether pre-allocation is something
> > that will not work.
> >
> 
> We actually explored if we can restrict the syscalls for the
> oom-killer which does not do memory allocations. We concluded that is
> not practical and not maintainable. Whatever the list we can come up
> with will be outdated soon. In addition, converting all the must-have
> syscalls to not do allocations is not possible/practical.

I am definitely curious to learn more.

[...]
> > > 2. Mempool
> > >
> > > The idea is to preallocate mempool with a given amount of memory for
> > > userspace oom-killer. Preferably this will be per-thread and
> > > oom-killer can preallocate mempool for its specific threads. The core
> > > page allocator can check before going to the reclaim path if the task
> > > has private access to the mempool and return page from it if yes.
> >
> > Could you elaborate some more on how this would be controlled from the
> > userspace? A dedicated syscall? A driver?
> >
> 
> I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool
> to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to
> free the mempool.

I am not a great fan of prctl. It has become a dumping ground for all
mix of unrelated functionality. But let's say this is a minor detail at
this stage. So you are proposing to have a per mm mem pool that would be
used as a fallback for an allocation which cannot make a forward
progress, right? Would that pool be preallocated and sitting idle? What
kind of allocations would be allowed to use the pool? What if the pool
is depleted?
-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
To: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>,
	Linux MM <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Suren Baghdasaryan
	<surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Dragos Sbirlea <dragoss-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Priya Duraisamy
	<padmapriyad-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Subject: Re: [RFC] memory reserve for userspace oom-killer
Date: Wed, 21 Apr 2021 09:16:15 +0200	[thread overview]
Message-ID: <YH/RPydqhwXdyG80@dhcp22.suse.cz> (raw)
In-Reply-To: <CALvZod4kjdgMU=8T_bx6zFufA1cGtt2p1Jg8jOgi=+g=bs-Evw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue 20-04-21 09:04:21, Shakeel Butt wrote:
> On Mon, Apr 19, 2021 at 11:46 PM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> >
> > On Mon 19-04-21 18:44:02, Shakeel Butt wrote:
> [...]
> > > memory.min. However a new allocation from userspace oom-killer can
> > > still get stuck in the reclaim and policy rich oom-killer do trigger
> > > new allocations through syscalls or even heap.
> >
> > Can you be more specific please?
> >
> 
> To decide when to kill, the oom-killer has to read a lot of metrics.
> It has to open a lot of files to read them and there will definitely
> be new allocations involved in those operations. For example reading
> memory.stat does a page size allocation. Similarly, to perform action
> the oom-killer may have to read cgroup.procs file which again has
> allocation inside it.

True but many of those can be avoided by opening the file early. At
least seq_file based ones will not allocate later if the output size
doesn't increase. Which should be the case for many. I think it is a
general improvement to push those who allocate during read to an open
time allocation.

> Regarding sophisticated oom policy, I can give one example of our
> cluster level policy. For robustness, many user facing jobs run a lot
> of instances in a cluster to handle failures. Such jobs are tolerant
> to some amount of failures but they still have requirements to not let
> the number of running instances below some threshold. Normally killing
> such jobs is fine but we do want to make sure that we do not violate
> their cluster level agreement. So, the userspace oom-killer may
> dynamically need to confirm if such a job can be killed.

What kind of data do you need to examine to make those decisions?

> [...]
> > > To reliably solve this problem, we need to give guaranteed memory to
> > > the userspace oom-killer.
> >
> > There is nothing like that. Even memory reserves are a finite resource
> > which can be consumed as it is sharing those reserves with other users
> > who are not necessarily coordinated. So before we start discussing
> > making this even more muddy by handing over memory reserves to the
> > userspace we should really examine whether pre-allocation is something
> > that will not work.
> >
> 
> We actually explored if we can restrict the syscalls for the
> oom-killer which does not do memory allocations. We concluded that is
> not practical and not maintainable. Whatever the list we can come up
> with will be outdated soon. In addition, converting all the must-have
> syscalls to not do allocations is not possible/practical.

I am definitely curious to learn more.

[...]
> > > 2. Mempool
> > >
> > > The idea is to preallocate mempool with a given amount of memory for
> > > userspace oom-killer. Preferably this will be per-thread and
> > > oom-killer can preallocate mempool for its specific threads. The core
> > > page allocator can check before going to the reclaim path if the task
> > > has private access to the mempool and return page from it if yes.
> >
> > Could you elaborate some more on how this would be controlled from the
> > userspace? A dedicated syscall? A driver?
> >
> 
> I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool
> to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to
> free the mempool.

I am not a great fan of prctl. It has become a dumping ground for all
mix of unrelated functionality. But let's say this is a minor detail at
this stage. So you are proposing to have a per mm mem pool that would be
used as a fallback for an allocation which cannot make a forward
progress, right? Would that pool be preallocated and sitting idle? What
kind of allocations would be allowed to use the pool? What if the pool
is depleted?
-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2021-04-21  7:16 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-20  1:44 [RFC] memory reserve for userspace oom-killer Shakeel Butt
2021-04-20  1:44 ` Shakeel Butt
2021-04-20  1:44 ` Shakeel Butt
2021-04-20  6:45 ` Michal Hocko
2021-04-20  6:45   ` Michal Hocko
2021-04-20 16:04   ` Shakeel Butt
2021-04-20 16:04     ` Shakeel Butt
2021-04-20 16:04     ` Shakeel Butt
2021-04-21  7:16     ` Michal Hocko [this message]
2021-04-21  7:16       ` Michal Hocko
2021-04-21 13:57       ` Shakeel Butt
2021-04-21 13:57         ` Shakeel Butt
2021-04-21 13:57         ` Shakeel Butt
2021-04-21 14:29         ` Michal Hocko
2021-04-22 12:33           ` [RFC PATCH] Android OOM helper proof of concept peter enderborg
2021-04-22 12:33             ` peter enderborg
2021-04-22 13:03             ` Michal Hocko
2021-05-05  0:37           ` [RFC] memory reserve for userspace oom-killer Shakeel Butt
2021-05-05  0:37             ` Shakeel Butt
2021-05-05  0:37             ` Shakeel Butt
2021-05-05  1:26             ` Suren Baghdasaryan
2021-05-05  1:26               ` Suren Baghdasaryan
2021-05-05  2:45               ` Shakeel Butt
2021-05-05  2:45                 ` Shakeel Butt
2021-05-05  2:45                 ` Shakeel Butt
2021-05-05  2:59                 ` Suren Baghdasaryan
2021-05-05  2:59                   ` Suren Baghdasaryan
2021-05-05  2:59                   ` Suren Baghdasaryan
2021-05-05  2:43             ` Hillf Danton
2021-04-20 19:17 ` Roman Gushchin
2021-04-20 19:17   ` Roman Gushchin
2021-04-20 19:36   ` Suren Baghdasaryan
2021-04-20 19:36     ` Suren Baghdasaryan
2021-04-20 19:36     ` Suren Baghdasaryan
2021-04-21  1:18   ` Shakeel Butt
2021-04-21  1:18     ` Shakeel Butt
2021-04-21  1:18     ` Shakeel Butt
2021-04-21  2:58     ` Roman Gushchin
2021-04-21 13:26       ` Shakeel Butt
2021-04-21 13:26         ` Shakeel Butt
2021-04-21 13:26         ` Shakeel Butt
2021-04-21 19:04         ` Roman Gushchin
2021-04-21 19:04           ` Roman Gushchin
2021-04-21  7:23     ` Michal Hocko
2021-04-21  7:23       ` Michal Hocko
2021-04-21 14:13       ` Shakeel Butt
2021-04-21 14:13         ` Shakeel Butt
2021-04-21 14:13         ` Shakeel Butt
2021-04-21 17:05 ` peter enderborg
2021-04-21 18:28   ` Shakeel Butt
2021-04-21 18:28     ` Shakeel Butt
2021-04-21 18:28     ` Shakeel Butt
2021-04-21 18:46     ` Peter.Enderborg
2021-04-21 18:46       ` Peter.Enderborg-7U/KSKJipcs
2021-04-21 19:18       ` Shakeel Butt
2021-04-21 19:18         ` Shakeel Butt
2021-04-21 19:18         ` Shakeel Butt
2021-04-22  5:38         ` Peter.Enderborg
2021-04-22  5:38           ` Peter.Enderborg-7U/KSKJipcs
2021-04-22 14:27           ` Shakeel Butt
2021-04-22 14:27             ` Shakeel Butt
2021-04-22 14:27             ` Shakeel Butt
2021-04-22 15:41             ` Peter.Enderborg
2021-04-22 15:41               ` Peter.Enderborg-7U/KSKJipcs
2021-04-22 13:08   ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YH/RPydqhwXdyG80@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=dragoss@google.com \
    --cc=gthelen@google.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=padmapriyad@google.com \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.