All of lore.kernel.org
 help / color / mirror / Atom feed
From: Suren Baghdasaryan <surenb@google.com>
To: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <guro@fb.com>, Linux MM <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Cgroups <cgroups@vger.kernel.org>,
	David Rientjes <rientjes@google.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Greg Thelen <gthelen@google.com>,
	Dragos Sbirlea <dragoss@google.com>,
	Priya Duraisamy <padmapriyad@google.com>
Subject: Re: [RFC] memory reserve for userspace oom-killer
Date: Tue, 4 May 2021 19:59:13 -0700	[thread overview]
Message-ID: <CAJuCfpGD8xBh2nepB0zmxRjzjQQbxKj_o9OzQPQMkw5rUcovMQ@mail.gmail.com> (raw)
In-Reply-To: <CALvZod4pqkY84Od67=aEnpWL7V3bXnH4pduBQAh89Byp=snD+Q@mail.gmail.com>

On Tue, May 4, 2021 at 7:45 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Tue, May 4, 2021 at 6:26 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, May 4, 2021 at 5:37 PM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > [...]
> > > > > > What if the pool is depleted?
> > > > >
> > > > > This would mean that either the estimate of mempool size is bad or
> > > > > oom-killer is buggy and leaking memory.
> > > > >
> > > > > I am open to any design directions for mempool or some other way where
> > > > > we can provide a notion of memory guarantee to oom-killer.
> > > >
> > > > OK, thanks for clarification. There will certainly be hard problems to
> > > > sort out[1] but the overall idea makes sense to me and it sounds like a
> > > > much better approach than a OOM specific solution.
> > > >
> > > >
> > > > [1] - how the pool is going to be replenished without hitting all
> > > > potential reclaim problems (thus dependencies on other all tasks
> > > > directly/indirectly) yet to not rely on any background workers to do
> > > > that on the task behalf without a proper accounting etc...
> > > > --
> > >
> > > I am currently contemplating between two paths here:
> > >
> > > First, the mempool, exposed through either prctl or a new syscall.
> > > Users would need to trace their userspace oom-killer (or whatever
> > > their use case is) to find an appropriate mempool size they would need
> > > and periodically refill the mempools if allowed by the state of the
> > > machine. The challenge here is to find a good value for the mempool
> > > size and coordinating the refilling of mempools.
> > >
> > > Second is a mix of Roman and Peter's suggestions but much more
> > > simplified. A very simple watchdog with a kill-list of processes and
> > > if userspace didn't pet the watchdog within a specified time, it will
> > > kill all the processes in the kill-list. The challenge here is to
> > > maintain/update the kill-list.
> >
> > IIUC this solution is designed to identify cases when oomd/lmkd got
> > stuck while allocating memory due to memory shortages and therefore
> > can't feed the watchdog. In such a case the kernel goes ahead and
> > kills some processes to free up memory and unblock the blocked
> > process. Effectively this would limit the time such a process gets
> > stuck by the duration of the watchdog timeout. If my understanding of
> > this proposal is correct,
>
> Your understanding is indeed correct.
>
> > then I see the following downsides:
> > 1. oomd/lmkd are still not prevented from being stuck, it just limits
> > the duration of this blocked state. Delaying kills when memory
> > pressure is high even for short duration is very undesirable.
>
> Yes I agree.
>
> > I think
> > having mempool reserves could address this issue better if it can
> > always guarantee memory availability (not sure if it's possible in
> > practice).
>
> I think "mempool ... always guarantee memory availability" is
> something I should quantify with some experiments.
>
> > 2. What would be performance overhead of this watchdog? To limit the
> > duration of a process being blocked to a small enough value we would
> > have to have quite a small timeout, which means oomd/lmkd would have
> > to wake up quite often to feed the watchdog. Frequent wakeups on a
> > battery-powered system is not a good idea.
>
> This is indeed the downside i.e. the tradeoff between acceptable stall
> vs frequent wakeups.
>
> > 3. What if oomd/lmkd gets stuck for some memory-unrelated reason and
> > can't feed the watchdog? In such a scenario the kernel would assume
> > that it is stuck due to memory shortages and would go on a killing
> > spree.
>
> This is correct but IMHO killing spree is not worse than oomd/lmkd
> getting stuck for some other reason.
>
> > If there is a sure way to identify when a process gets stuck
> > due to memory shortages then this could work better.
>
> Hmm are you saying looking at the stack traces of the userspace
> oom-killer or some metrics related to oom-killer? It will complicate
> the code.

Well, I don't know of a sure and easy way to identify the reasons for
process blockage but maybe there is one I don't know of? My point is
that we would need some additional indications of memory being the
culprit for the process blockage before resorting to kill.

>
> > 4. Additional complexity of keeping the list of potential victims in
> > the kernel. Maybe we can simply reuse oom_score to choose the best
> > victims?
>
> Your point of additional complexity is correct. Regarding oom_score I
> think you meant oom_score_adj, I would avoid putting more
> policies/complexity in the kernel but I got your point that the
> simplest watchdog might not be helpful at all.
>
> > Thanks,
> > Suren.
> >
> > >
> > > I would prefer the direction which oomd and lmkd are open to adopt.
> > >
> > > Any suggestions?

WARNING: multiple messages have this Message-ID (diff)
From: Suren Baghdasaryan <surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
To: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>,
	Linux MM <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Dragos Sbirlea <dragoss-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Priya Duraisamy
	<padmapriyad-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Subject: Re: [RFC] memory reserve for userspace oom-killer
Date: Tue, 4 May 2021 19:59:13 -0700	[thread overview]
Message-ID: <CAJuCfpGD8xBh2nepB0zmxRjzjQQbxKj_o9OzQPQMkw5rUcovMQ@mail.gmail.com> (raw)
In-Reply-To: <CALvZod4pqkY84Od67=aEnpWL7V3bXnH4pduBQAh89Byp=snD+Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, May 4, 2021 at 7:45 PM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> On Tue, May 4, 2021 at 6:26 PM Suren Baghdasaryan <surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > On Tue, May 4, 2021 at 5:37 PM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > >
> > > On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> > > >
> > > [...]
> > > > > > What if the pool is depleted?
> > > > >
> > > > > This would mean that either the estimate of mempool size is bad or
> > > > > oom-killer is buggy and leaking memory.
> > > > >
> > > > > I am open to any design directions for mempool or some other way where
> > > > > we can provide a notion of memory guarantee to oom-killer.
> > > >
> > > > OK, thanks for clarification. There will certainly be hard problems to
> > > > sort out[1] but the overall idea makes sense to me and it sounds like a
> > > > much better approach than a OOM specific solution.
> > > >
> > > >
> > > > [1] - how the pool is going to be replenished without hitting all
> > > > potential reclaim problems (thus dependencies on other all tasks
> > > > directly/indirectly) yet to not rely on any background workers to do
> > > > that on the task behalf without a proper accounting etc...
> > > > --
> > >
> > > I am currently contemplating between two paths here:
> > >
> > > First, the mempool, exposed through either prctl or a new syscall.
> > > Users would need to trace their userspace oom-killer (or whatever
> > > their use case is) to find an appropriate mempool size they would need
> > > and periodically refill the mempools if allowed by the state of the
> > > machine. The challenge here is to find a good value for the mempool
> > > size and coordinating the refilling of mempools.
> > >
> > > Second is a mix of Roman and Peter's suggestions but much more
> > > simplified. A very simple watchdog with a kill-list of processes and
> > > if userspace didn't pet the watchdog within a specified time, it will
> > > kill all the processes in the kill-list. The challenge here is to
> > > maintain/update the kill-list.
> >
> > IIUC this solution is designed to identify cases when oomd/lmkd got
> > stuck while allocating memory due to memory shortages and therefore
> > can't feed the watchdog. In such a case the kernel goes ahead and
> > kills some processes to free up memory and unblock the blocked
> > process. Effectively this would limit the time such a process gets
> > stuck by the duration of the watchdog timeout. If my understanding of
> > this proposal is correct,
>
> Your understanding is indeed correct.
>
> > then I see the following downsides:
> > 1. oomd/lmkd are still not prevented from being stuck, it just limits
> > the duration of this blocked state. Delaying kills when memory
> > pressure is high even for short duration is very undesirable.
>
> Yes I agree.
>
> > I think
> > having mempool reserves could address this issue better if it can
> > always guarantee memory availability (not sure if it's possible in
> > practice).
>
> I think "mempool ... always guarantee memory availability" is
> something I should quantify with some experiments.
>
> > 2. What would be performance overhead of this watchdog? To limit the
> > duration of a process being blocked to a small enough value we would
> > have to have quite a small timeout, which means oomd/lmkd would have
> > to wake up quite often to feed the watchdog. Frequent wakeups on a
> > battery-powered system is not a good idea.
>
> This is indeed the downside i.e. the tradeoff between acceptable stall
> vs frequent wakeups.
>
> > 3. What if oomd/lmkd gets stuck for some memory-unrelated reason and
> > can't feed the watchdog? In such a scenario the kernel would assume
> > that it is stuck due to memory shortages and would go on a killing
> > spree.
>
> This is correct but IMHO killing spree is not worse than oomd/lmkd
> getting stuck for some other reason.
>
> > If there is a sure way to identify when a process gets stuck
> > due to memory shortages then this could work better.
>
> Hmm are you saying looking at the stack traces of the userspace
> oom-killer or some metrics related to oom-killer? It will complicate
> the code.

Well, I don't know of a sure and easy way to identify the reasons for
process blockage but maybe there is one I don't know of? My point is
that we would need some additional indications of memory being the
culprit for the process blockage before resorting to kill.

>
> > 4. Additional complexity of keeping the list of potential victims in
> > the kernel. Maybe we can simply reuse oom_score to choose the best
> > victims?
>
> Your point of additional complexity is correct. Regarding oom_score I
> think you meant oom_score_adj, I would avoid putting more
> policies/complexity in the kernel but I got your point that the
> simplest watchdog might not be helpful at all.
>
> > Thanks,
> > Suren.
> >
> > >
> > > I would prefer the direction which oomd and lmkd are open to adopt.
> > >
> > > Any suggestions?

  reply	other threads:[~2021-05-05  2:59 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-20  1:44 [RFC] memory reserve for userspace oom-killer Shakeel Butt
2021-04-20  1:44 ` Shakeel Butt
2021-04-20  1:44 ` Shakeel Butt
2021-04-20  6:45 ` Michal Hocko
2021-04-20  6:45   ` Michal Hocko
2021-04-20 16:04   ` Shakeel Butt
2021-04-20 16:04     ` Shakeel Butt
2021-04-20 16:04     ` Shakeel Butt
2021-04-21  7:16     ` Michal Hocko
2021-04-21  7:16       ` Michal Hocko
2021-04-21 13:57       ` Shakeel Butt
2021-04-21 13:57         ` Shakeel Butt
2021-04-21 13:57         ` Shakeel Butt
2021-04-21 14:29         ` Michal Hocko
2021-04-22 12:33           ` [RFC PATCH] Android OOM helper proof of concept peter enderborg
2021-04-22 12:33             ` peter enderborg
2021-04-22 13:03             ` Michal Hocko
2021-05-05  0:37           ` [RFC] memory reserve for userspace oom-killer Shakeel Butt
2021-05-05  0:37             ` Shakeel Butt
2021-05-05  0:37             ` Shakeel Butt
2021-05-05  1:26             ` Suren Baghdasaryan
2021-05-05  1:26               ` Suren Baghdasaryan
2021-05-05  2:45               ` Shakeel Butt
2021-05-05  2:45                 ` Shakeel Butt
2021-05-05  2:45                 ` Shakeel Butt
2021-05-05  2:59                 ` Suren Baghdasaryan [this message]
2021-05-05  2:59                   ` Suren Baghdasaryan
2021-05-05  2:59                   ` Suren Baghdasaryan
2021-05-05  2:43             ` Hillf Danton
2021-04-20 19:17 ` Roman Gushchin
2021-04-20 19:17   ` Roman Gushchin
2021-04-20 19:36   ` Suren Baghdasaryan
2021-04-20 19:36     ` Suren Baghdasaryan
2021-04-20 19:36     ` Suren Baghdasaryan
2021-04-21  1:18   ` Shakeel Butt
2021-04-21  1:18     ` Shakeel Butt
2021-04-21  1:18     ` Shakeel Butt
2021-04-21  2:58     ` Roman Gushchin
2021-04-21 13:26       ` Shakeel Butt
2021-04-21 13:26         ` Shakeel Butt
2021-04-21 13:26         ` Shakeel Butt
2021-04-21 19:04         ` Roman Gushchin
2021-04-21 19:04           ` Roman Gushchin
2021-04-21  7:23     ` Michal Hocko
2021-04-21  7:23       ` Michal Hocko
2021-04-21 14:13       ` Shakeel Butt
2021-04-21 14:13         ` Shakeel Butt
2021-04-21 14:13         ` Shakeel Butt
2021-04-21 17:05 ` peter enderborg
2021-04-21 18:28   ` Shakeel Butt
2021-04-21 18:28     ` Shakeel Butt
2021-04-21 18:28     ` Shakeel Butt
2021-04-21 18:46     ` Peter.Enderborg
2021-04-21 18:46       ` Peter.Enderborg-7U/KSKJipcs
2021-04-21 19:18       ` Shakeel Butt
2021-04-21 19:18         ` Shakeel Butt
2021-04-21 19:18         ` Shakeel Butt
2021-04-22  5:38         ` Peter.Enderborg
2021-04-22  5:38           ` Peter.Enderborg-7U/KSKJipcs
2021-04-22 14:27           ` Shakeel Butt
2021-04-22 14:27             ` Shakeel Butt
2021-04-22 14:27             ` Shakeel Butt
2021-04-22 15:41             ` Peter.Enderborg
2021-04-22 15:41               ` Peter.Enderborg-7U/KSKJipcs
2021-04-22 13:08   ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJuCfpGD8xBh2nepB0zmxRjzjQQbxKj_o9OzQPQMkw5rUcovMQ@mail.gmail.com \
    --to=surenb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=dragoss@google.com \
    --cc=gthelen@google.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=padmapriyad@google.com \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.