From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-18.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_CR_TRAILER,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,
	USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9561FC11F65
	for <linux-kernel@archiver.kernel.org>; Wed, 30 Jun 2021 18:44:14 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 7E593600EF
	for <linux-kernel@archiver.kernel.org>; Wed, 30 Jun 2021 18:44:14 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233586AbhF3Sqm (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 30 Jun 2021 14:46:42 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55814 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233587AbhF3Sqg (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 30 Jun 2021 14:46:36 -0400
Received: from mail-yb1-xb31.google.com (mail-yb1-xb31.google.com [IPv6:2607:f8b0:4864:20::b31])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A2DE0C061756
        for <linux-kernel@vger.kernel.org>; Wed, 30 Jun 2021 11:44:06 -0700 (PDT)
Received: by mail-yb1-xb31.google.com with SMTP id k184so6575307ybf.12
        for <linux-kernel@vger.kernel.org>; Wed, 30 Jun 2021 11:44:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=WIoChI/UV1Z7a7XrKG/VZSLZ/TPMwf39BXQMXIxOAwA=;
        b=btIXHyRiBkK956T8OwjWkekMNvzo/mZJOPFcrvBEKbeBejlWmlRMlznuchf2pWPWqe
         a9Ag692CKOfxzD8GcEQEhZ/3Jr9LS22bkN+lCYbP2wrWkSf2MsmKL4vpgoFibTmfwJ7H
         WFzEIGgzeQ191/Pt+a+5pUMDz7AToSI7RmYpEEpOhySN6UKbpUGg6FJzgKiotqZ0wfoE
         Z2pLW/YB01A6w0jwOMzklxNWBxb+4Iv3srIzg7zlfMxeQ8dentmx7j/x5pLBCq+TEp30
         dHGD2ns9TvW0TXws3V8QruPNTAt1lQT+kuX399Eg3WT2x5Px+0CSkxk2UwoGc4fN5w0A
         khMw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=WIoChI/UV1Z7a7XrKG/VZSLZ/TPMwf39BXQMXIxOAwA=;
        b=RJ/0o4brAiJ5uGMU81rjyO9QWgC5Q7CVratq1iyad493yUxc5X4AtY+Ckbv2kAjXX8
         APPjmsmJNWp+BEt5Ho4qWSSTqwjkxf2jIpK3q/7lIB3HsGNfpV7DyUPAkcK+lz5H0H2x
         Px5SWUQgIwdFHy0E14EenOhUYPDA6Wa3YIgsiusMHc9tSwMsT1nlwFvRSAwQ/aX8gUVA
         DMOQAUGClJbLZ5S+rl537xlvXY43mZ4xNp98Pb4ltvRcujxBvymZo6hwQQln1L5Bb5s8
         XsJCXr7njWSXrVr85cT9dgTNatW7dZ5jqYxnQsjN3LF93xKSaqANL1GS+o0yqYzApNow
         Z5Tg==
X-Gm-Message-State: AOAM531XOhiOZoZHcRIsFYGMGfzkjTEIxr1kZApwz2gN7s8m/9axIr+r
        8O1QstRaIknInD1JAL7CTxmPqpKE6nkpMSyGmLZPIg==
X-Google-Smtp-Source: ABdhPJxsqqO39vuA3mf7rpViczVY8A/5U4GcNWz8KEEHm4mV92UeEkuUnoskpseoiZc3iaG2Xq6Wb09osZVU7EYQjpI=
X-Received: by 2002:a25:d913:: with SMTP id q19mr48066209ybg.397.1625078645669;
 Wed, 30 Jun 2021 11:44:05 -0700 (PDT)
MIME-Version: 1.0
References: <20210623192822.3072029-1-surenb@google.com> <CALvZod7GPeB6ArrU8oBPx-1NT-ZDBQzTiJHJDojjO2kAgALkHw@mail.gmail.com>
In-Reply-To: <CALvZod7GPeB6ArrU8oBPx-1NT-ZDBQzTiJHJDojjO2kAgALkHw@mail.gmail.com>
From:   Suren Baghdasaryan <surenb@google.com>
Date:   Wed, 30 Jun 2021 11:43:54 -0700
Message-ID: <CAJuCfpG4M=ZnqR9D9MPNB88nwWgQ9qA9Z9a6dymZ5abOxNucGg@mail.gmail.com>
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call
To:     Shakeel Butt <shakeelb@google.com>
Cc:     Andrew Morton <akpm@linux-foundation.org>,
        Michal Hocko <mhocko@kernel.org>,
        Michal Hocko <mhocko@suse.com>,
        David Rientjes <rientjes@google.com>,
        Matthew Wilcox <willy@infradead.org>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Roman Gushchin <guro@fb.com>, Rik van Riel <riel@surriel.com>,
        Minchan Kim <minchan@kernel.org>,
        Christian Brauner <christian@brauner.io>,
        Christoph Hellwig <hch@infradead.org>,
        Oleg Nesterov <oleg@redhat.com>,
        David Hildenbrand <david@redhat.com>,
        Jann Horn <jannh@google.com>,
        Tim Murray <timmurray@google.com>,
        Linux API <linux-api@vger.kernel.org>,
        Linux MM <linux-mm@kvack.org>,
        LKML <linux-kernel@vger.kernel.org>,
        kernel-team <kernel-team@android.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Jun 30, 2021 at 11:01 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Hi Suren,
>
> On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > In modern systems it's not unusual to have a system component monitoring
> > memory conditions of the system and tasked with keeping system memory
> > pressure under control. One way to accomplish that is to kill
> > non-essential processes to free up memory for more important ones.
> > Examples of this are Facebook's OOM killer daemon called oomd and
> > Android's low memory killer daemon called lmkd.
> > For such system component it's important to be able to free memory
> > quickly and efficiently. Unfortunately the time process takes to free
> > up its memory after receiving a SIGKILL might vary based on the state
> > of the process (uninterruptible sleep), size and OPP level of the core
> > the process is running. A mechanism to free resources of the target
> > process in a more predictable way would improve system's ability to
> > control its memory pressure.
> > Introduce process_reap system call that reclaims memory of a dying process
> > from the context of the caller. This way the memory in freed in a more
> > controllable way with CPU affinity and priority of the caller. The workload
> > of freeing the memory will also be charged to the caller.
> > The operation is allowed only on a dying process.
> >
> > Previously I proposed a number of alternatives to accomplish this:
> > - https://lore.kernel.org/patchwork/patch/1060407 extending
> > pidfd_send_signal to allow memory reaping using oom_reaper thread;
> > - https://lore.kernel.org/patchwork/patch/1338196 extending
> > pidfd_send_signal to reap memory of the target process synchronously from
> > the context of the caller;
> > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
> > support for process_madvise implementing synchronous memory reaping.
> >
> > The end of the last discussion culminated with suggestion to introduce a
> > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
> > The reasoning was that the new variant of process_madvise
> >   a) does not work on an address range
> >   b) is destructive
> >   c) doesn't share much code at all with the rest of process_madvise
> > From the userspace point of view it was awkward and inconvenient to provide
> > memory range for this operation that operates on the entire address space.
> > Using special flags or address values to specify the entire address space
> > was too hacky.
> >
> > The API is as follows,
> >
> >           int process_reap(int pidfd, unsigned int flags);
> >
> >         DESCRIPTION
> >           The process_reap() system call is used to free the memory of a
> >           dying process.
> >
> >           The pidfd selects the process referred to by the PID file
> >           descriptor.
> >           (See pidofd_open(2) for further information)
>
> *pidfd_open

Ack

>
> >
> >           The flags argument is reserved for future use; currently, this
> >           argument must be specified as 0.
> >
> >         RETURN VALUE
> >           On success, process_reap() returns 0. On error, -1 is returned
> >           and errno is set to indicate the error.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Thanks for continuously pushing this. One question I have is how do
> you envision this syscall to be used for the cgroup based workloads.
> Traverse the target tree, read pids from cgroup.procs files,
> pidfd_open them, send SIGKILL and then process_reap them. Is that
> right?

Yes, at least that's how Android does that. It's a bit more involved
but it's a technical detail. Userspace low memory killer kills a
process (sends SIGKILL and calls process_reap) and another system
component detects that a process died and will kill all processes
belonging to the same cgroup (that's how we identify related
processes).

>
> Orthogonal to this patch I wonder if we should have an optimized way
> to reap processes from a cgroup. Something similar to cgroup.kill (or
> maybe overload cgroup.kill with reaping as well).

Seems reasonable to me. We could use that in the above scenario.

>
> [...]
>
> > +
> > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)
> > +{
> > +       struct pid *pid;
> > +       struct task_struct *task;
> > +       struct mm_struct *mm = NULL;
> > +       unsigned int f_flags;
> > +       long ret = 0;
> > +
> > +       if (flags != 0)
> > +               return -EINVAL;
> > +
> > +       pid = pidfd_get_pid(pidfd, &f_flags);
> > +       if (IS_ERR(pid))
> > +               return PTR_ERR(pid);
> > +
> > +       task = get_pid_task(pid, PIDTYPE_PID);
> > +       if (!task) {
> > +               ret = -ESRCH;
> > +               goto put_pid;
> > +       }
> > +
> > +       /*
> > +        * If the task is dying and in the process of releasing its memory
> > +        * then get its mm.
> > +        */
> > +       task_lock(task);
> > +       if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {
>
> task_will_free_mem() is fine here but I think in parallel we should
> optimize this function. At the moment it is traversing all the
> processes on the machine. It is very normal to have tens of thousands
> of processes on big machines, so it would be really costly when
> reaping a bunch of processes.

Hmm. But I think we still need to make sure that the mm is not shared
with another non-dying process. IIUC that's the point of that
traversal. Am I mistaken?

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=iBfL=LY=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-18.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_CR_TRAILER,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,
	USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 780DBC11F65
	for <linux-mm@archiver.kernel.org>; Wed, 30 Jun 2021 18:44:08 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 066C3600EF
	for <linux-mm@archiver.kernel.org>; Wed, 30 Jun 2021 18:44:08 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 066C3600EF
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 64B9D8D01BF; Wed, 30 Jun 2021 14:44:07 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6226F8D01A2; Wed, 30 Jun 2021 14:44:07 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 49D4B8D01BF; Wed, 30 Jun 2021 14:44:07 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0097.hostedemail.com [216.40.44.97])
	by kanga.kvack.org (Postfix) with ESMTP id 261F48D01A2
	for <linux-mm@kvack.org>; Wed, 30 Jun 2021 14:44:07 -0400 (EDT)
Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id F0A8A824C442
	for <linux-mm@kvack.org>; Wed, 30 Jun 2021 18:44:06 +0000 (UTC)
X-FDA: 78311264892.03.95FE1DB
Received: from mail-yb1-f173.google.com (mail-yb1-f173.google.com [209.85.219.173])
	by imf04.hostedemail.com (Postfix) with ESMTP id A6FA8500009F
	for <linux-mm@kvack.org>; Wed, 30 Jun 2021 18:44:06 +0000 (UTC)
Received: by mail-yb1-f173.google.com with SMTP id g5so6591284ybu.10
        for <linux-mm@kvack.org>; Wed, 30 Jun 2021 11:44:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=WIoChI/UV1Z7a7XrKG/VZSLZ/TPMwf39BXQMXIxOAwA=;
        b=btIXHyRiBkK956T8OwjWkekMNvzo/mZJOPFcrvBEKbeBejlWmlRMlznuchf2pWPWqe
         a9Ag692CKOfxzD8GcEQEhZ/3Jr9LS22bkN+lCYbP2wrWkSf2MsmKL4vpgoFibTmfwJ7H
         WFzEIGgzeQ191/Pt+a+5pUMDz7AToSI7RmYpEEpOhySN6UKbpUGg6FJzgKiotqZ0wfoE
         Z2pLW/YB01A6w0jwOMzklxNWBxb+4Iv3srIzg7zlfMxeQ8dentmx7j/x5pLBCq+TEp30
         dHGD2ns9TvW0TXws3V8QruPNTAt1lQT+kuX399Eg3WT2x5Px+0CSkxk2UwoGc4fN5w0A
         khMw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=WIoChI/UV1Z7a7XrKG/VZSLZ/TPMwf39BXQMXIxOAwA=;
        b=K2HIDdm8dRwPOfB3BtfV0oSJ2ua5hCjeWVsp3zJttC0YDyIBLwYJyZvmS7LQhxeNhd
         rHsqBs8LcIEuJe0zjrvD9uB9m6aq/PK6/hJSXhjegULmLmV2FRH+tBaIbIsibAam9bGd
         ZCkj89vCZtwg7/mGPDu86OHTKk7ZXilyY7x5oULL8i1alQWL+2Ztr5VwpsuOOMhV2IPQ
         xtkcN0V1+Gx6JgOMKTzVbT16cMfNtiKGvJZGlZT9Z043Op0gXarTUpWx8wvH2gPbUKwe
         9b900hEofKUV9E9Tco/4XWR/CyVHWzE6v06GP4dFNwoLzw+zg0PVNyyKZuVpQrUCaAaL
         tFzA==
X-Gm-Message-State: AOAM530uwgMoXwEIFoalZ9whCpozDXCEoOB7vyR71NWHhD5/p/Vb+Djj
	BpADSkRBSfz2jgXtkJTotBczrLe0nUT6QEzPbwuMBg==
X-Google-Smtp-Source: ABdhPJxsqqO39vuA3mf7rpViczVY8A/5U4GcNWz8KEEHm4mV92UeEkuUnoskpseoiZc3iaG2Xq6Wb09osZVU7EYQjpI=
X-Received: by 2002:a25:d913:: with SMTP id q19mr48066209ybg.397.1625078645669;
 Wed, 30 Jun 2021 11:44:05 -0700 (PDT)
MIME-Version: 1.0
References: <20210623192822.3072029-1-surenb@google.com> <CALvZod7GPeB6ArrU8oBPx-1NT-ZDBQzTiJHJDojjO2kAgALkHw@mail.gmail.com>
In-Reply-To: <CALvZod7GPeB6ArrU8oBPx-1NT-ZDBQzTiJHJDojjO2kAgALkHw@mail.gmail.com>
From: Suren Baghdasaryan <surenb@google.com>
Date: Wed, 30 Jun 2021 11:43:54 -0700
Message-ID: <CAJuCfpG4M=ZnqR9D9MPNB88nwWgQ9qA9Z9a6dymZ5abOxNucGg@mail.gmail.com>
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call
To: Shakeel Butt <shakeelb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Michal Hocko <mhocko@kernel.org>, 
	Michal Hocko <mhocko@suse.com>, David Rientjes <rientjes@google.com>, 
	Matthew Wilcox <willy@infradead.org>, Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <guro@fb.com>, 
	Rik van Riel <riel@surriel.com>, Minchan Kim <minchan@kernel.org>, 
	Christian Brauner <christian@brauner.io>, Christoph Hellwig <hch@infradead.org>, 
	Oleg Nesterov <oleg@redhat.com>, David Hildenbrand <david@redhat.com>, Jann Horn <jannh@google.com>, 
	Tim Murray <timmurray@google.com>, Linux API <linux-api@vger.kernel.org>, 
	Linux MM <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>, 
	kernel-team <kernel-team@android.com>
Content-Type: text/plain; charset="UTF-8"
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=google.com header.s=20161025 header.b=btIXHyRi;
	spf=pass (imf04.hostedemail.com: domain of surenb@google.com designates 209.85.219.173 as permitted sender) smtp.mailfrom=surenb@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: A6FA8500009F
X-Stat-Signature: caep1cmn8nsfcixnaubio9fdzpaeidts
X-HE-Tag: 1625078646-478883
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Jun 30, 2021 at 11:01 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Hi Suren,
>
> On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > In modern systems it's not unusual to have a system component monitoring
> > memory conditions of the system and tasked with keeping system memory
> > pressure under control. One way to accomplish that is to kill
> > non-essential processes to free up memory for more important ones.
> > Examples of this are Facebook's OOM killer daemon called oomd and
> > Android's low memory killer daemon called lmkd.
> > For such system component it's important to be able to free memory
> > quickly and efficiently. Unfortunately the time process takes to free
> > up its memory after receiving a SIGKILL might vary based on the state
> > of the process (uninterruptible sleep), size and OPP level of the core
> > the process is running. A mechanism to free resources of the target
> > process in a more predictable way would improve system's ability to
> > control its memory pressure.
> > Introduce process_reap system call that reclaims memory of a dying process
> > from the context of the caller. This way the memory in freed in a more
> > controllable way with CPU affinity and priority of the caller. The workload
> > of freeing the memory will also be charged to the caller.
> > The operation is allowed only on a dying process.
> >
> > Previously I proposed a number of alternatives to accomplish this:
> > - https://lore.kernel.org/patchwork/patch/1060407 extending
> > pidfd_send_signal to allow memory reaping using oom_reaper thread;
> > - https://lore.kernel.org/patchwork/patch/1338196 extending
> > pidfd_send_signal to reap memory of the target process synchronously from
> > the context of the caller;
> > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
> > support for process_madvise implementing synchronous memory reaping.
> >
> > The end of the last discussion culminated with suggestion to introduce a
> > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
> > The reasoning was that the new variant of process_madvise
> >   a) does not work on an address range
> >   b) is destructive
> >   c) doesn't share much code at all with the rest of process_madvise
> > From the userspace point of view it was awkward and inconvenient to provide
> > memory range for this operation that operates on the entire address space.
> > Using special flags or address values to specify the entire address space
> > was too hacky.
> >
> > The API is as follows,
> >
> >           int process_reap(int pidfd, unsigned int flags);
> >
> >         DESCRIPTION
> >           The process_reap() system call is used to free the memory of a
> >           dying process.
> >
> >           The pidfd selects the process referred to by the PID file
> >           descriptor.
> >           (See pidofd_open(2) for further information)
>
> *pidfd_open

Ack

>
> >
> >           The flags argument is reserved for future use; currently, this
> >           argument must be specified as 0.
> >
> >         RETURN VALUE
> >           On success, process_reap() returns 0. On error, -1 is returned
> >           and errno is set to indicate the error.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Thanks for continuously pushing this. One question I have is how do
> you envision this syscall to be used for the cgroup based workloads.
> Traverse the target tree, read pids from cgroup.procs files,
> pidfd_open them, send SIGKILL and then process_reap them. Is that
> right?

Yes, at least that's how Android does that. It's a bit more involved
but it's a technical detail. Userspace low memory killer kills a
process (sends SIGKILL and calls process_reap) and another system
component detects that a process died and will kill all processes
belonging to the same cgroup (that's how we identify related
processes).

>
> Orthogonal to this patch I wonder if we should have an optimized way
> to reap processes from a cgroup. Something similar to cgroup.kill (or
> maybe overload cgroup.kill with reaping as well).

Seems reasonable to me. We could use that in the above scenario.

>
> [...]
>
> > +
> > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)
> > +{
> > +       struct pid *pid;
> > +       struct task_struct *task;
> > +       struct mm_struct *mm = NULL;
> > +       unsigned int f_flags;
> > +       long ret = 0;
> > +
> > +       if (flags != 0)
> > +               return -EINVAL;
> > +
> > +       pid = pidfd_get_pid(pidfd, &f_flags);
> > +       if (IS_ERR(pid))
> > +               return PTR_ERR(pid);
> > +
> > +       task = get_pid_task(pid, PIDTYPE_PID);
> > +       if (!task) {
> > +               ret = -ESRCH;
> > +               goto put_pid;
> > +       }
> > +
> > +       /*
> > +        * If the task is dying and in the process of releasing its memory
> > +        * then get its mm.
> > +        */
> > +       task_lock(task);
> > +       if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {
>
> task_will_free_mem() is fine here but I think in parallel we should
> optimize this function. At the moment it is traversing all the
> processes on the machine. It is very normal to have tens of thousands
> of processes on big machines, so it would be really costly when
> reaping a bunch of processes.

Hmm. But I think we still need to make sure that the mm is not shared
with another non-dying process. IIUC that's the point of that
traversal. Am I mistaken?