Re: [RFC] memory reserve for userspace oom-killer

From: <Peter.Enderborg@sony.com>
To: <shakeelb@google.com>
Cc: <hannes@cmpxchg.org>, <guro@fb.com>, <mhocko@kernel.org>,
	<linux-mm@kvack.org>, <akpm@linux-foundation.org>,
	<cgroups@vger.kernel.org>, <rientjes@google.com>,
	<linux-kernel@vger.kernel.org>, <surenb@google.com>,
	<gthelen@google.com>, <dragoss@google.com>,
	<padmapriyad@google.com>
Subject: Re: [RFC] memory reserve for userspace oom-killer
Date: Thu, 22 Apr 2021 05:38:13 +0000	[thread overview]
Message-ID: <6eaa4c24-c565-bc5d-dbca-b73c72569a16@sony.com> (raw)
In-Reply-To: <CALvZod5+5ycobmSt=NC3VJF4FRMFmBQEN7SQgipyTDbzHEbPUQ@mail.gmail.com>

On 4/21/21 9:18 PM, Shakeel Butt wrote:
> On Wed, Apr 21, 2021 at 11:46 AM <Peter.Enderborg@sony.com> wrote:
>> On 4/21/21 8:28 PM, Shakeel Butt wrote:
>>> On Wed, Apr 21, 2021 at 10:06 AM peter enderborg
>>> <peter.enderborg@sony.com> wrote:
>>>> On 4/20/21 3:44 AM, Shakeel Butt wrote:
>>> [...]
>>>> I think this is the wrong way to go.
>>> Which one? Are you talking about the kernel one? We already talked out
>>> of that. To decide to OOM, we need to look at a very diverse set of
>>> metrics and it seems like that would be very hard to do flexibly
>>> inside the kernel.
>> You dont need to decide to oom, but when oom occurs you
>> can take a proper action.
> No, we want the flexibility to decide when to oom-kill. Kernel is very
> conservative in triggering the oom-kill.

It wont do it for you. We use this code to solve that:

/*
 *  lowmemorykiller_oom
 *
 *  Author: Peter Enderborg <peter.enderborg@sonymobile.com>
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License version 2 as
 *  published by the Free Software Foundation.
 */

/* add fake print format with original module name */
#define pr_fmt(fmt) "lowmemorykiller: " fmt

#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/oom.h>

#include <trace/events/lmk.h>

#include "lowmemorykiller.h"
#include "lowmemorykiller_tng.h"
#include "lowmemorykiller_stats.h"
#include "lowmemorykiller_tasks.h"

/**
 * lowmemorykiller_oom_notify - OOM notifier
 * @self:    notifier block struct
 * @notused:    not used
 * @parm:    returned - number of pages freed
 *
 * Return value:
 *    NOTIFY_OK
 **/

static int lowmemorykiller_oom_notify(struct notifier_block *self,
                      unsigned long notused, void *param)
{
    struct lmk_rb_watch *lrw;
    unsigned long *nfreed = param;

    lowmem_print(2, "oom notify event\n");
    *nfreed = 0;
    lmk_inc_stats(LMK_OOM_COUNT);
    spin_lock_bh(&lmk_task_lock);
    lrw = __lmk_task_first();
    if (lrw) {
        struct task_struct *selected = lrw->tsk;
        struct lmk_death_pending_entry *ldpt;

        if (!task_trylock_lmk(selected)) {
            lmk_inc_stats(LMK_ERROR);
            lowmem_print(1, "Failed to lock task.\n");
            lmk_inc_stats(LMK_BUSY);
            goto unlock_out;
        }

        get_task_struct(selected);
        /* move to kill pending set */
        ldpt = kmem_cache_alloc(lmk_dp_cache, GFP_ATOMIC);
        /* if we fail to alloc we ignore the death pending list */
        if (ldpt) {
            ldpt->tsk = selected;
            __lmk_death_pending_add(ldpt);
        } else {
            WARN_ON(1);
            lmk_inc_stats(LMK_MEM_ERROR);
            trace_lmk_sigkill(selected->pid, selected->comm,
                      LMK_TRACE_MEMERROR, *nfreed, 0);
        }
        if (!__lmk_task_remove(selected, lrw->key))
            WARN_ON(1);

        spin_unlock_bh(&lmk_task_lock);
        *nfreed = get_task_rss(selected);
        send_sig(SIGKILL, selected, 0);

        LMK_TAG_TASK_DIE(selected);
        trace_lmk_sigkill(selected->pid, selected->comm,
                  LMK_TRACE_OOMKILL, *nfreed,
                  0);

        task_unlock(selected);
        put_task_struct(selected);
        lmk_inc_stats(LMK_OOM_KILL_COUNT);
        goto out;
    }
unlock_out:
    spin_unlock_bh(&lmk_task_lock);
out:
    return NOTIFY_OK;
}

static struct notifier_block lowmemorykiller_oom_nb = {
    .notifier_call = lowmemorykiller_oom_notify
};

int __init lowmemorykiller_register_oom_notifier(void)
{
    register_oom_notifier(&lowmemorykiller_oom_nb);
    return 0;
}

So what needed is a function that knows the
priority. Here __lmk_task_first() that is from a rb-tree.

You can pick what ever priority you like. In our case it is a
android so it is a strictly oom_adj order in the tree.

I think you can do the same with a old lowmemmorykiller style  with
a full task scan too.

> [...]
>>> Actually no. It is missing the flexibility to monitor metrics which a
>>> user care and based on which they decide to trigger oom-kill. Not sure
>>> how will watchdog replace psi/vmpressure? Userspace keeps petting the
>>> watchdog does not mean that system is not suffering.
>> The userspace should very much do what it do. But when it
>> does not do what it should do, including kick the WD. Then
>> the kernel kicks in and kill a pre defined process or as many
>> as needed until the monitoring can start to kick and have the
>> control.
>>
> Roman already suggested something similar (i.e. oom-killer core and
> extended and core watching extended) but completely in userspace. I
> don't see why we would want to do that in the kernel instead.

A watchdog in kernel will work if userspace is completely broken
or staved with low on memory.

>
>>> In addition oom priorities change dynamically and changing it in your
>>> system seems very hard. Cgroup awareness is missing too.
>> Why is that hard? Moving a object in a rb-tree is as good it get.
>>
> It is a group of objects. Anyways that is implementation detail.
>
> The message I got from this exchange is that we can have a watchdog
> (userspace or kernel) to further improve the reliability of userspace
> oom-killers.