From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F1CE0C433B4
	for <linux-kernel@archiver.kernel.org>; Wed,  5 May 2021 02:59:35 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id C4FB5613CB
	for <linux-kernel@archiver.kernel.org>; Wed,  5 May 2021 02:59:35 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231313AbhEEDAW (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Tue, 4 May 2021 23:00:22 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55230 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230079AbhEEDAV (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 4 May 2021 23:00:21 -0400
Received: from mail-yb1-xb36.google.com (mail-yb1-xb36.google.com [IPv6:2607:f8b0:4864:20::b36])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D8AE4C061574
        for <linux-kernel@vger.kernel.org>; Tue,  4 May 2021 19:59:25 -0700 (PDT)
Received: by mail-yb1-xb36.google.com with SMTP id m9so820101ybm.3
        for <linux-kernel@vger.kernel.org>; Tue, 04 May 2021 19:59:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=ihGAn22hJSR8rjyLL6L+pknKJgzhwSZEEzawq0VbAU0=;
        b=knXCpejnyh18gszj8Evz817ERM+apdbOpGzltW5BoTfCfnWvkFNz7GwzqeSqpjJRhq
         2Z5VxkeCwMz7CbctjqLVsv439h0ie60LIOjkJhgR2nLQfn3LTZw4zle3XxJGId4AeKiX
         SHItgrHqyo0S1OJB1a8ohnLE58ZALqyq0WwlAexypypDblVxzJ5GwCnXQ2TudbBZj9l/
         5NGj2T+jKrZQGcDLnsIb4kL2F7GM2XCD39p2MntbG30uZk1PEjwXqh+kWwy2Gi6SXfFw
         eky1yep288Y0OPlcPIQSbDnaPQUzvvPRAdS6WIravUuliHYFpzJ9BP65Ip1o1K451gsI
         QSvw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=ihGAn22hJSR8rjyLL6L+pknKJgzhwSZEEzawq0VbAU0=;
        b=GMl55+VibR+a08DOzmMp4EjWBZ+sqDp3PcRYXwFnkZIGtqqlTWNFU9CGTNrKb05K3g
         VdEeDGb9rHK8IThL+45R7e9GdVJLCipaj55UGNDtdYFeE4pCWUBLjD96AjljH06CGdO0
         eqK1nEUiSe63Lb0m7jkLY8PV75GMCGFKdC5zgoqbSgsxGCvWzPFW0mkuCiOpEY+Osal8
         PqOQmMkCvK4ZGCl5FIhd/WymH50TlsA8GGKMETLLvRwyGu/wVvLC5d/EjKQesntpfSYY
         NeudvDfbG014vy5N8Qy0u9kziQUUfxtoYUxv4iWNoJuilQ/DIE9MZfgElToyJEDDiTc2
         ykgQ==
X-Gm-Message-State: AOAM532pT0xxae+/p5pxtOsliXVSL9eWSaLbXyLA+f2YozxUb6MXwYse
        NSyU90U2cphZio5drzs106dHQ42QkjylWJkTosafHQ==
X-Google-Smtp-Source: ABdhPJxxNqPNHSOoLbLktWw+OgpqGe/rmbsMooN4hPamGsw+4Yzkcdu6/fR3RYL0B9Y4vjDcX5m68e+0T5PrU6Fck2A=
X-Received: by 2002:a25:7ec4:: with SMTP id z187mr36919159ybc.136.1620183564909;
 Tue, 04 May 2021 19:59:24 -0700 (PDT)
MIME-Version: 1.0
References: <CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com>
 <YH54pyRWSi1zLMw4@dhcp22.suse.cz> <CALvZod4kjdgMU=8T_bx6zFufA1cGtt2p1Jg8jOgi=+g=bs-Evw@mail.gmail.com>
 <YH/RPydqhwXdyG80@dhcp22.suse.cz> <CALvZod4kRWDQuZZQ5F+z6WMcUWLwgYd-Kb0mY8UAEK4MbSOZaA@mail.gmail.com>
 <YIA2rB0wgqKzfUfi@dhcp22.suse.cz> <CALvZod4_L7GSHnivQTSdDzo=fb4i3z=katjzVCHfLz9WWGK8uQ@mail.gmail.com>
 <CAJuCfpEXyG9x1nUsg+6yVWTP+-A4OwuCg9XHLAciu39=JNY7DQ@mail.gmail.com> <CALvZod4pqkY84Od67=aEnpWL7V3bXnH4pduBQAh89Byp=snD+Q@mail.gmail.com>
In-Reply-To: <CALvZod4pqkY84Od67=aEnpWL7V3bXnH4pduBQAh89Byp=snD+Q@mail.gmail.com>
From:   Suren Baghdasaryan <surenb@google.com>
Date:   Tue, 4 May 2021 19:59:13 -0700
Message-ID: <CAJuCfpGD8xBh2nepB0zmxRjzjQQbxKj_o9OzQPQMkw5rUcovMQ@mail.gmail.com>
Subject: Re: [RFC] memory reserve for userspace oom-killer
To:     Shakeel Butt <shakeelb@google.com>
Cc:     Michal Hocko <mhocko@suse.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Roman Gushchin <guro@fb.com>, Linux MM <linux-mm@kvack.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Cgroups <cgroups@vger.kernel.org>,
        David Rientjes <rientjes@google.com>,
        LKML <linux-kernel@vger.kernel.org>,
        Greg Thelen <gthelen@google.com>,
        Dragos Sbirlea <dragoss@google.com>,
        Priya Duraisamy <padmapriyad@google.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, May 4, 2021 at 7:45 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Tue, May 4, 2021 at 6:26 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, May 4, 2021 at 5:37 PM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > [...]
> > > > > > What if the pool is depleted?
> > > > >
> > > > > This would mean that either the estimate of mempool size is bad or
> > > > > oom-killer is buggy and leaking memory.
> > > > >
> > > > > I am open to any design directions for mempool or some other way where
> > > > > we can provide a notion of memory guarantee to oom-killer.
> > > >
> > > > OK, thanks for clarification. There will certainly be hard problems to
> > > > sort out[1] but the overall idea makes sense to me and it sounds like a
> > > > much better approach than a OOM specific solution.
> > > >
> > > >
> > > > [1] - how the pool is going to be replenished without hitting all
> > > > potential reclaim problems (thus dependencies on other all tasks
> > > > directly/indirectly) yet to not rely on any background workers to do
> > > > that on the task behalf without a proper accounting etc...
> > > > --
> > >
> > > I am currently contemplating between two paths here:
> > >
> > > First, the mempool, exposed through either prctl or a new syscall.
> > > Users would need to trace their userspace oom-killer (or whatever
> > > their use case is) to find an appropriate mempool size they would need
> > > and periodically refill the mempools if allowed by the state of the
> > > machine. The challenge here is to find a good value for the mempool
> > > size and coordinating the refilling of mempools.
> > >
> > > Second is a mix of Roman and Peter's suggestions but much more
> > > simplified. A very simple watchdog with a kill-list of processes and
> > > if userspace didn't pet the watchdog within a specified time, it will
> > > kill all the processes in the kill-list. The challenge here is to
> > > maintain/update the kill-list.
> >
> > IIUC this solution is designed to identify cases when oomd/lmkd got
> > stuck while allocating memory due to memory shortages and therefore
> > can't feed the watchdog. In such a case the kernel goes ahead and
> > kills some processes to free up memory and unblock the blocked
> > process. Effectively this would limit the time such a process gets
> > stuck by the duration of the watchdog timeout. If my understanding of
> > this proposal is correct,
>
> Your understanding is indeed correct.
>
> > then I see the following downsides:
> > 1. oomd/lmkd are still not prevented from being stuck, it just limits
> > the duration of this blocked state. Delaying kills when memory
> > pressure is high even for short duration is very undesirable.
>
> Yes I agree.
>
> > I think
> > having mempool reserves could address this issue better if it can
> > always guarantee memory availability (not sure if it's possible in
> > practice).
>
> I think "mempool ... always guarantee memory availability" is
> something I should quantify with some experiments.
>
> > 2. What would be performance overhead of this watchdog? To limit the
> > duration of a process being blocked to a small enough value we would
> > have to have quite a small timeout, which means oomd/lmkd would have
> > to wake up quite often to feed the watchdog. Frequent wakeups on a
> > battery-powered system is not a good idea.
>
> This is indeed the downside i.e. the tradeoff between acceptable stall
> vs frequent wakeups.
>
> > 3. What if oomd/lmkd gets stuck for some memory-unrelated reason and
> > can't feed the watchdog? In such a scenario the kernel would assume
> > that it is stuck due to memory shortages and would go on a killing
> > spree.
>
> This is correct but IMHO killing spree is not worse than oomd/lmkd
> getting stuck for some other reason.
>
> > If there is a sure way to identify when a process gets stuck
> > due to memory shortages then this could work better.
>
> Hmm are you saying looking at the stack traces of the userspace
> oom-killer or some metrics related to oom-killer? It will complicate
> the code.

Well, I don't know of a sure and easy way to identify the reasons for
process blockage but maybe there is one I don't know of? My point is
that we would need some additional indications of memory being the
culprit for the process blockage before resorting to kill.

>
> > 4. Additional complexity of keeping the list of potential victims in
> > the kernel. Maybe we can simply reuse oom_score to choose the best
> > victims?
>
> Your point of additional complexity is correct. Regarding oom_score I
> think you meant oom_score_adj, I would avoid putting more
> policies/complexity in the kernel but I got your point that the
> simplest watchdog might not be helpful at all.
>
> > Thanks,
> > Suren.
> >
> > >
> > > I would prefer the direction which oomd and lmkd are open to adopt.
> > >
> > > Any suggestions?

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=moDn=KA=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BAD0BC433B4
	for <linux-mm@archiver.kernel.org>; Wed,  5 May 2021 02:59:27 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 2DB18613CB
	for <linux-mm@archiver.kernel.org>; Wed,  5 May 2021 02:59:27 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2DB18613CB
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id B46316B0146; Tue,  4 May 2021 22:59:26 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AF6526B0147; Tue,  4 May 2021 22:59:26 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 996CB6B0148; Tue,  4 May 2021 22:59:26 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0220.hostedemail.com [216.40.44.220])
	by kanga.kvack.org (Postfix) with ESMTP id 7FF296B0146
	for <linux-mm@kvack.org>; Tue,  4 May 2021 22:59:26 -0400 (EDT)
Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 39B2C989C
	for <linux-mm@kvack.org>; Wed,  5 May 2021 02:59:26 +0000 (UTC)
X-FDA: 78105671532.24.C580AB2
Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com [209.85.219.169])
	by imf13.hostedemail.com (Postfix) with ESMTP id 18A34E000111
	for <linux-mm@kvack.org>; Wed,  5 May 2021 02:59:13 +0000 (UTC)
Received: by mail-yb1-f169.google.com with SMTP id l7so788463ybf.8
        for <linux-mm@kvack.org>; Tue, 04 May 2021 19:59:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=ihGAn22hJSR8rjyLL6L+pknKJgzhwSZEEzawq0VbAU0=;
        b=knXCpejnyh18gszj8Evz817ERM+apdbOpGzltW5BoTfCfnWvkFNz7GwzqeSqpjJRhq
         2Z5VxkeCwMz7CbctjqLVsv439h0ie60LIOjkJhgR2nLQfn3LTZw4zle3XxJGId4AeKiX
         SHItgrHqyo0S1OJB1a8ohnLE58ZALqyq0WwlAexypypDblVxzJ5GwCnXQ2TudbBZj9l/
         5NGj2T+jKrZQGcDLnsIb4kL2F7GM2XCD39p2MntbG30uZk1PEjwXqh+kWwy2Gi6SXfFw
         eky1yep288Y0OPlcPIQSbDnaPQUzvvPRAdS6WIravUuliHYFpzJ9BP65Ip1o1K451gsI
         QSvw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=ihGAn22hJSR8rjyLL6L+pknKJgzhwSZEEzawq0VbAU0=;
        b=jhxGJ6KnK3T4M5UHAKdaNHCMvDvDb37vRsXrUD1xsHEDikifAJj1X51pkAqt/Xzt0T
         DRBEEebls5pRgtYngflwHPVyWlIWs+YgIOnmc0bo7z0irITgjcqHMjlZ1+bkq56M1HKh
         4oUjHUB3RUZlK4dGUFv8AD3/h3VEg7qf+7SjxIZdXwHMXakADMoE9Hgr9ye+dAwbw1rD
         rpXaK4wo5LFev09p4a8nuiuYsRoUI7pRtE/is3g/r/o5Sbisuj0g3ui5bvJXEzGAkD/n
         4YAG0H4tG2PxbhrmuMTR4tGVG0WtD0y2UqnKkw07MJvC+SPfwDh3NUJd0VYEnzYve3zA
         I1Lg==
X-Gm-Message-State: AOAM532xABbl9g2jstWXRefOuP1jQqxHIIgLC5xxZkUBiYYIWs+TAh9L
	q4zhRos5FjsxYZVcHBz5Gs9wviK1Sp4jFVMXgIMn9g==
X-Google-Smtp-Source: ABdhPJxxNqPNHSOoLbLktWw+OgpqGe/rmbsMooN4hPamGsw+4Yzkcdu6/fR3RYL0B9Y4vjDcX5m68e+0T5PrU6Fck2A=
X-Received: by 2002:a25:7ec4:: with SMTP id z187mr36919159ybc.136.1620183564909;
 Tue, 04 May 2021 19:59:24 -0700 (PDT)
MIME-Version: 1.0
References: <CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com>
 <YH54pyRWSi1zLMw4@dhcp22.suse.cz> <CALvZod4kjdgMU=8T_bx6zFufA1cGtt2p1Jg8jOgi=+g=bs-Evw@mail.gmail.com>
 <YH/RPydqhwXdyG80@dhcp22.suse.cz> <CALvZod4kRWDQuZZQ5F+z6WMcUWLwgYd-Kb0mY8UAEK4MbSOZaA@mail.gmail.com>
 <YIA2rB0wgqKzfUfi@dhcp22.suse.cz> <CALvZod4_L7GSHnivQTSdDzo=fb4i3z=katjzVCHfLz9WWGK8uQ@mail.gmail.com>
 <CAJuCfpEXyG9x1nUsg+6yVWTP+-A4OwuCg9XHLAciu39=JNY7DQ@mail.gmail.com> <CALvZod4pqkY84Od67=aEnpWL7V3bXnH4pduBQAh89Byp=snD+Q@mail.gmail.com>
In-Reply-To: <CALvZod4pqkY84Od67=aEnpWL7V3bXnH4pduBQAh89Byp=snD+Q@mail.gmail.com>
From: Suren Baghdasaryan <surenb@google.com>
Date: Tue, 4 May 2021 19:59:13 -0700
Message-ID: <CAJuCfpGD8xBh2nepB0zmxRjzjQQbxKj_o9OzQPQMkw5rUcovMQ@mail.gmail.com>
Subject: Re: [RFC] memory reserve for userspace oom-killer
To: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>, Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <guro@fb.com>, 
	Linux MM <linux-mm@kvack.org>, Andrew Morton <akpm@linux-foundation.org>, 
	Cgroups <cgroups@vger.kernel.org>, David Rientjes <rientjes@google.com>, 
	LKML <linux-kernel@vger.kernel.org>, Greg Thelen <gthelen@google.com>, 
	Dragos Sbirlea <dragoss@google.com>, Priya Duraisamy <padmapriyad@google.com>
Content-Type: text/plain; charset="UTF-8"
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=google.com header.s=20161025 header.b=knXCpejn;
	spf=pass (imf13.hostedemail.com: domain of surenb@google.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=surenb@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 18A34E000111
X-Stat-Signature: p3hdnu7qemywqqn5b9ubd3xtqfhycpfb
Received-SPF: none (google.com>: No applicable sender policy available) receiver=imf13; identity=mailfrom; envelope-from="<surenb@google.com>"; helo=mail-yb1-f169.google.com; client-ip=209.85.219.169
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1620183553-984855
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, May 4, 2021 at 7:45 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Tue, May 4, 2021 at 6:26 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, May 4, 2021 at 5:37 PM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > [...]
> > > > > > What if the pool is depleted?
> > > > >
> > > > > This would mean that either the estimate of mempool size is bad or
> > > > > oom-killer is buggy and leaking memory.
> > > > >
> > > > > I am open to any design directions for mempool or some other way where
> > > > > we can provide a notion of memory guarantee to oom-killer.
> > > >
> > > > OK, thanks for clarification. There will certainly be hard problems to
> > > > sort out[1] but the overall idea makes sense to me and it sounds like a
> > > > much better approach than a OOM specific solution.
> > > >
> > > >
> > > > [1] - how the pool is going to be replenished without hitting all
> > > > potential reclaim problems (thus dependencies on other all tasks
> > > > directly/indirectly) yet to not rely on any background workers to do
> > > > that on the task behalf without a proper accounting etc...
> > > > --
> > >
> > > I am currently contemplating between two paths here:
> > >
> > > First, the mempool, exposed through either prctl or a new syscall.
> > > Users would need to trace their userspace oom-killer (or whatever
> > > their use case is) to find an appropriate mempool size they would need
> > > and periodically refill the mempools if allowed by the state of the
> > > machine. The challenge here is to find a good value for the mempool
> > > size and coordinating the refilling of mempools.
> > >
> > > Second is a mix of Roman and Peter's suggestions but much more
> > > simplified. A very simple watchdog with a kill-list of processes and
> > > if userspace didn't pet the watchdog within a specified time, it will
> > > kill all the processes in the kill-list. The challenge here is to
> > > maintain/update the kill-list.
> >
> > IIUC this solution is designed to identify cases when oomd/lmkd got
> > stuck while allocating memory due to memory shortages and therefore
> > can't feed the watchdog. In such a case the kernel goes ahead and
> > kills some processes to free up memory and unblock the blocked
> > process. Effectively this would limit the time such a process gets
> > stuck by the duration of the watchdog timeout. If my understanding of
> > this proposal is correct,
>
> Your understanding is indeed correct.
>
> > then I see the following downsides:
> > 1. oomd/lmkd are still not prevented from being stuck, it just limits
> > the duration of this blocked state. Delaying kills when memory
> > pressure is high even for short duration is very undesirable.
>
> Yes I agree.
>
> > I think
> > having mempool reserves could address this issue better if it can
> > always guarantee memory availability (not sure if it's possible in
> > practice).
>
> I think "mempool ... always guarantee memory availability" is
> something I should quantify with some experiments.
>
> > 2. What would be performance overhead of this watchdog? To limit the
> > duration of a process being blocked to a small enough value we would
> > have to have quite a small timeout, which means oomd/lmkd would have
> > to wake up quite often to feed the watchdog. Frequent wakeups on a
> > battery-powered system is not a good idea.
>
> This is indeed the downside i.e. the tradeoff between acceptable stall
> vs frequent wakeups.
>
> > 3. What if oomd/lmkd gets stuck for some memory-unrelated reason and
> > can't feed the watchdog? In such a scenario the kernel would assume
> > that it is stuck due to memory shortages and would go on a killing
> > spree.
>
> This is correct but IMHO killing spree is not worse than oomd/lmkd
> getting stuck for some other reason.
>
> > If there is a sure way to identify when a process gets stuck
> > due to memory shortages then this could work better.
>
> Hmm are you saying looking at the stack traces of the userspace
> oom-killer or some metrics related to oom-killer? It will complicate
> the code.

Well, I don't know of a sure and easy way to identify the reasons for
process blockage but maybe there is one I don't know of? My point is
that we would need some additional indications of memory being the
culprit for the process blockage before resorting to kill.

>
> > 4. Additional complexity of keeping the list of potential victims in
> > the kernel. Maybe we can simply reuse oom_score to choose the best
> > victims?
>
> Your point of additional complexity is correct. Regarding oom_score I
> think you meant oom_score_adj, I would avoid putting more
> policies/complexity in the kernel but I got your point that the
> simplest watchdog might not be helpful at all.
>
> > Thanks,
> > Suren.
> >
> > >
> > > I would prefer the direction which oomd and lmkd are open to adopt.
> > >
> > > Any suggestions?


From mboxrd@z Thu Jan  1 00:00:00 1970
From: Suren Baghdasaryan <surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Subject: Re: [RFC] memory reserve for userspace oom-killer
Date: Tue, 4 May 2021 19:59:13 -0700
Message-ID: <CAJuCfpGD8xBh2nepB0zmxRjzjQQbxKj_o9OzQPQMkw5rUcovMQ@mail.gmail.com>
References: <CALvZod7vtDxJZtNhn81V=oE-EPOf=4KZB2Bv6Giz+u3bFFyOLg@mail.gmail.com>
 <YH54pyRWSi1zLMw4@dhcp22.suse.cz> <CALvZod4kjdgMU=8T_bx6zFufA1cGtt2p1Jg8jOgi=+g=bs-Evw@mail.gmail.com>
 <YH/RPydqhwXdyG80@dhcp22.suse.cz> <CALvZod4kRWDQuZZQ5F+z6WMcUWLwgYd-Kb0mY8UAEK4MbSOZaA@mail.gmail.com>
 <YIA2rB0wgqKzfUfi@dhcp22.suse.cz> <CALvZod4_L7GSHnivQTSdDzo=fb4i3z=katjzVCHfLz9WWGK8uQ@mail.gmail.com>
 <CAJuCfpEXyG9x1nUsg+6yVWTP+-A4OwuCg9XHLAciu39=JNY7DQ@mail.gmail.com> <CALvZod4pqkY84Od67=aEnpWL7V3bXnH4pduBQAh89Byp=snD+Q@mail.gmail.com>
Mime-Version: 1.0
Return-path: <cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=ihGAn22hJSR8rjyLL6L+pknKJgzhwSZEEzawq0VbAU0=;
        b=knXCpejnyh18gszj8Evz817ERM+apdbOpGzltW5BoTfCfnWvkFNz7GwzqeSqpjJRhq
         2Z5VxkeCwMz7CbctjqLVsv439h0ie60LIOjkJhgR2nLQfn3LTZw4zle3XxJGId4AeKiX
         SHItgrHqyo0S1OJB1a8ohnLE58ZALqyq0WwlAexypypDblVxzJ5GwCnXQ2TudbBZj9l/
         5NGj2T+jKrZQGcDLnsIb4kL2F7GM2XCD39p2MntbG30uZk1PEjwXqh+kWwy2Gi6SXfFw
         eky1yep288Y0OPlcPIQSbDnaPQUzvvPRAdS6WIravUuliHYFpzJ9BP65Ip1o1K451gsI
         QSvw==
In-Reply-To: <CALvZod4pqkY84Od67=aEnpWL7V3bXnH4pduBQAh89Byp=snD+Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
List-ID: <cgroups.vger.kernel.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>, Linux MM <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>, Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>, Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, Dragos Sbirlea <dragoss-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, Priya Duraisamy <padmapriyad-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

On Tue, May 4, 2021 at 7:45 PM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> On Tue, May 4, 2021 at 6:26 PM Suren Baghdasaryan <surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > On Tue, May 4, 2021 at 5:37 PM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > >
> > > On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> > > >
> > > [...]
> > > > > > What if the pool is depleted?
> > > > >
> > > > > This would mean that either the estimate of mempool size is bad or
> > > > > oom-killer is buggy and leaking memory.
> > > > >
> > > > > I am open to any design directions for mempool or some other way where
> > > > > we can provide a notion of memory guarantee to oom-killer.
> > > >
> > > > OK, thanks for clarification. There will certainly be hard problems to
> > > > sort out[1] but the overall idea makes sense to me and it sounds like a
> > > > much better approach than a OOM specific solution.
> > > >
> > > >
> > > > [1] - how the pool is going to be replenished without hitting all
> > > > potential reclaim problems (thus dependencies on other all tasks
> > > > directly/indirectly) yet to not rely on any background workers to do
> > > > that on the task behalf without a proper accounting etc...
> > > > --
> > >
> > > I am currently contemplating between two paths here:
> > >
> > > First, the mempool, exposed through either prctl or a new syscall.
> > > Users would need to trace their userspace oom-killer (or whatever
> > > their use case is) to find an appropriate mempool size they would need
> > > and periodically refill the mempools if allowed by the state of the
> > > machine. The challenge here is to find a good value for the mempool
> > > size and coordinating the refilling of mempools.
> > >
> > > Second is a mix of Roman and Peter's suggestions but much more
> > > simplified. A very simple watchdog with a kill-list of processes and
> > > if userspace didn't pet the watchdog within a specified time, it will
> > > kill all the processes in the kill-list. The challenge here is to
> > > maintain/update the kill-list.
> >
> > IIUC this solution is designed to identify cases when oomd/lmkd got
> > stuck while allocating memory due to memory shortages and therefore
> > can't feed the watchdog. In such a case the kernel goes ahead and
> > kills some processes to free up memory and unblock the blocked
> > process. Effectively this would limit the time such a process gets
> > stuck by the duration of the watchdog timeout. If my understanding of
> > this proposal is correct,
>
> Your understanding is indeed correct.
>
> > then I see the following downsides:
> > 1. oomd/lmkd are still not prevented from being stuck, it just limits
> > the duration of this blocked state. Delaying kills when memory
> > pressure is high even for short duration is very undesirable.
>
> Yes I agree.
>
> > I think
> > having mempool reserves could address this issue better if it can
> > always guarantee memory availability (not sure if it's possible in
> > practice).
>
> I think "mempool ... always guarantee memory availability" is
> something I should quantify with some experiments.
>
> > 2. What would be performance overhead of this watchdog? To limit the
> > duration of a process being blocked to a small enough value we would
> > have to have quite a small timeout, which means oomd/lmkd would have
> > to wake up quite often to feed the watchdog. Frequent wakeups on a
> > battery-powered system is not a good idea.
>
> This is indeed the downside i.e. the tradeoff between acceptable stall
> vs frequent wakeups.
>
> > 3. What if oomd/lmkd gets stuck for some memory-unrelated reason and
> > can't feed the watchdog? In such a scenario the kernel would assume
> > that it is stuck due to memory shortages and would go on a killing
> > spree.
>
> This is correct but IMHO killing spree is not worse than oomd/lmkd
> getting stuck for some other reason.
>
> > If there is a sure way to identify when a process gets stuck
> > due to memory shortages then this could work better.
>
> Hmm are you saying looking at the stack traces of the userspace
> oom-killer or some metrics related to oom-killer? It will complicate
> the code.

Well, I don't know of a sure and easy way to identify the reasons for
process blockage but maybe there is one I don't know of? My point is
that we would need some additional indications of memory being the
culprit for the process blockage before resorting to kill.

>
> > 4. Additional complexity of keeping the list of potential victims in
> > the kernel. Maybe we can simply reuse oom_score to choose the best
> > victims?
>
> Your point of additional complexity is correct. Regarding oom_score I
> think you meant oom_score_adj, I would avoid putting more
> policies/complexity in the kernel but I got your point that the
> simplest watchdog might not be helpful at all.
>
> > Thanks,
> > Suren.
> >
> > >
> > > I would prefer the direction which oomd and lmkd are open to adopt.
> > >
> > > Any suggestions?