All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeelb@google.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <guro@fb.com>, Linux MM <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Cgroups <cgroups@vger.kernel.org>,
	David Rientjes <rientjes@google.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Greg Thelen <gthelen@google.com>,
	Dragos Sbirlea <dragoss@google.com>,
	Priya Duraisamy <padmapriyad@google.com>
Subject: Re: [RFC] memory reserve for userspace oom-killer
Date: Wed, 21 Apr 2021 06:57:43 -0700	[thread overview]
Message-ID: <CALvZod4kRWDQuZZQ5F+z6WMcUWLwgYd-Kb0mY8UAEK4MbSOZaA@mail.gmail.com> (raw)
In-Reply-To: <YH/RPydqhwXdyG80@dhcp22.suse.cz>

On Wed, Apr 21, 2021 at 12:16 AM Michal Hocko <mhocko@suse.com> wrote:
>
[...]
> > To decide when to kill, the oom-killer has to read a lot of metrics.
> > It has to open a lot of files to read them and there will definitely
> > be new allocations involved in those operations. For example reading
> > memory.stat does a page size allocation. Similarly, to perform action
> > the oom-killer may have to read cgroup.procs file which again has
> > allocation inside it.
>
> True but many of those can be avoided by opening the file early. At
> least seq_file based ones will not allocate later if the output size
> doesn't increase. Which should be the case for many. I think it is a
> general improvement to push those who allocate during read to an open
> time allocation.
>

I agree that this would be a general improvement but it is not always
possible (see below).

> > Regarding sophisticated oom policy, I can give one example of our
> > cluster level policy. For robustness, many user facing jobs run a lot
> > of instances in a cluster to handle failures. Such jobs are tolerant
> > to some amount of failures but they still have requirements to not let
> > the number of running instances below some threshold. Normally killing
> > such jobs is fine but we do want to make sure that we do not violate
> > their cluster level agreement. So, the userspace oom-killer may
> > dynamically need to confirm if such a job can be killed.
>
> What kind of data do you need to examine to make those decisions?
>

Most of the time the cluster level scheduler pushes the information to
the node controller which transfers that information to the
oom-killer. However based on the freshness of the information the
oom-killer might request to pull the latest information (IPC and RPC).

[...]
> >
> > I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool
> > to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to
> > free the mempool.
>
> I am not a great fan of prctl. It has become a dumping ground for all
> mix of unrelated functionality. But let's say this is a minor detail at
> this stage.

I agree this does not have to be prctl().

> So you are proposing to have a per mm mem pool that would be

I was thinking of per-task_struct instead of per-mm_struct just for simplicity.

> used as a fallback for an allocation which cannot make a forward
> progress, right?

Correct

> Would that pool be preallocated and sitting idle?

Correct

> What kind of allocations would be allowed to use the pool?

I was thinking of any type of allocation from the oom-killer (or
specific threads). Please note that the mempool is the backup and only
used in the slowpath.

> What if the pool is depleted?

This would mean that either the estimate of mempool size is bad or
oom-killer is buggy and leaking memory.

I am open to any design directions for mempool or some other way where
we can provide a notion of memory guarantee to oom-killer.

thanks,
Shakeel

WARNING: multiple messages have this Message-ID (diff)
From: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
To: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>,
	Linux MM <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Suren Baghdasaryan
	<surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Dragos Sbirlea <dragoss-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Priya Duraisamy
	<padmapriyad-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Subject: Re: [RFC] memory reserve for userspace oom-killer
Date: Wed, 21 Apr 2021 06:57:43 -0700	[thread overview]
Message-ID: <CALvZod4kRWDQuZZQ5F+z6WMcUWLwgYd-Kb0mY8UAEK4MbSOZaA@mail.gmail.com> (raw)
In-Reply-To: <YH/RPydqhwXdyG80-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

On Wed, Apr 21, 2021 at 12:16 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
>
[...]
> > To decide when to kill, the oom-killer has to read a lot of metrics.
> > It has to open a lot of files to read them and there will definitely
> > be new allocations involved in those operations. For example reading
> > memory.stat does a page size allocation. Similarly, to perform action
> > the oom-killer may have to read cgroup.procs file which again has
> > allocation inside it.
>
> True but many of those can be avoided by opening the file early. At
> least seq_file based ones will not allocate later if the output size
> doesn't increase. Which should be the case for many. I think it is a
> general improvement to push those who allocate during read to an open
> time allocation.
>

I agree that this would be a general improvement but it is not always
possible (see below).

> > Regarding sophisticated oom policy, I can give one example of our
> > cluster level policy. For robustness, many user facing jobs run a lot
> > of instances in a cluster to handle failures. Such jobs are tolerant
> > to some amount of failures but they still have requirements to not let
> > the number of running instances below some threshold. Normally killing
> > such jobs is fine but we do want to make sure that we do not violate
> > their cluster level agreement. So, the userspace oom-killer may
> > dynamically need to confirm if such a job can be killed.
>
> What kind of data do you need to examine to make those decisions?
>

Most of the time the cluster level scheduler pushes the information to
the node controller which transfers that information to the
oom-killer. However based on the freshness of the information the
oom-killer might request to pull the latest information (IPC and RPC).

[...]
> >
> > I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool
> > to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to
> > free the mempool.
>
> I am not a great fan of prctl. It has become a dumping ground for all
> mix of unrelated functionality. But let's say this is a minor detail at
> this stage.

I agree this does not have to be prctl().

> So you are proposing to have a per mm mem pool that would be

I was thinking of per-task_struct instead of per-mm_struct just for simplicity.

> used as a fallback for an allocation which cannot make a forward
> progress, right?

Correct

> Would that pool be preallocated and sitting idle?

Correct

> What kind of allocations would be allowed to use the pool?

I was thinking of any type of allocation from the oom-killer (or
specific threads). Please note that the mempool is the backup and only
used in the slowpath.

> What if the pool is depleted?

This would mean that either the estimate of mempool size is bad or
oom-killer is buggy and leaking memory.

I am open to any design directions for mempool or some other way where
we can provide a notion of memory guarantee to oom-killer.

thanks,
Shakeel

  reply	other threads:[~2021-04-21 13:57 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-20  1:44 [RFC] memory reserve for userspace oom-killer Shakeel Butt
2021-04-20  1:44 ` Shakeel Butt
2021-04-20  1:44 ` Shakeel Butt
2021-04-20  6:45 ` Michal Hocko
2021-04-20  6:45   ` Michal Hocko
2021-04-20 16:04   ` Shakeel Butt
2021-04-20 16:04     ` Shakeel Butt
2021-04-20 16:04     ` Shakeel Butt
2021-04-21  7:16     ` Michal Hocko
2021-04-21  7:16       ` Michal Hocko
2021-04-21 13:57       ` Shakeel Butt [this message]
2021-04-21 13:57         ` Shakeel Butt
2021-04-21 13:57         ` Shakeel Butt
2021-04-21 14:29         ` Michal Hocko
2021-04-22 12:33           ` [RFC PATCH] Android OOM helper proof of concept peter enderborg
2021-04-22 12:33             ` peter enderborg
2021-04-22 13:03             ` Michal Hocko
2021-05-05  0:37           ` [RFC] memory reserve for userspace oom-killer Shakeel Butt
2021-05-05  0:37             ` Shakeel Butt
2021-05-05  0:37             ` Shakeel Butt
2021-05-05  1:26             ` Suren Baghdasaryan
2021-05-05  1:26               ` Suren Baghdasaryan
2021-05-05  2:45               ` Shakeel Butt
2021-05-05  2:45                 ` Shakeel Butt
2021-05-05  2:45                 ` Shakeel Butt
2021-05-05  2:59                 ` Suren Baghdasaryan
2021-05-05  2:59                   ` Suren Baghdasaryan
2021-05-05  2:59                   ` Suren Baghdasaryan
2021-05-05  2:43             ` Hillf Danton
2021-04-20 19:17 ` Roman Gushchin
2021-04-20 19:17   ` Roman Gushchin
2021-04-20 19:36   ` Suren Baghdasaryan
2021-04-20 19:36     ` Suren Baghdasaryan
2021-04-20 19:36     ` Suren Baghdasaryan
2021-04-21  1:18   ` Shakeel Butt
2021-04-21  1:18     ` Shakeel Butt
2021-04-21  1:18     ` Shakeel Butt
2021-04-21  2:58     ` Roman Gushchin
2021-04-21 13:26       ` Shakeel Butt
2021-04-21 13:26         ` Shakeel Butt
2021-04-21 13:26         ` Shakeel Butt
2021-04-21 19:04         ` Roman Gushchin
2021-04-21 19:04           ` Roman Gushchin
2021-04-21  7:23     ` Michal Hocko
2021-04-21  7:23       ` Michal Hocko
2021-04-21 14:13       ` Shakeel Butt
2021-04-21 14:13         ` Shakeel Butt
2021-04-21 14:13         ` Shakeel Butt
2021-04-21 17:05 ` peter enderborg
2021-04-21 18:28   ` Shakeel Butt
2021-04-21 18:28     ` Shakeel Butt
2021-04-21 18:28     ` Shakeel Butt
2021-04-21 18:46     ` Peter.Enderborg
2021-04-21 18:46       ` Peter.Enderborg-7U/KSKJipcs
2021-04-21 19:18       ` Shakeel Butt
2021-04-21 19:18         ` Shakeel Butt
2021-04-21 19:18         ` Shakeel Butt
2021-04-22  5:38         ` Peter.Enderborg
2021-04-22  5:38           ` Peter.Enderborg-7U/KSKJipcs
2021-04-22 14:27           ` Shakeel Butt
2021-04-22 14:27             ` Shakeel Butt
2021-04-22 14:27             ` Shakeel Butt
2021-04-22 15:41             ` Peter.Enderborg
2021-04-22 15:41               ` Peter.Enderborg-7U/KSKJipcs
2021-04-22 13:08   ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALvZod4kRWDQuZZQ5F+z6WMcUWLwgYd-Kb0mY8UAEK4MbSOZaA@mail.gmail.com \
    --to=shakeelb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=dragoss@google.com \
    --cc=gthelen@google.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=padmapriyad@google.com \
    --cc=rientjes@google.com \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.