From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9561FC11F65 for ; Wed, 30 Jun 2021 18:44:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7E593600EF for ; Wed, 30 Jun 2021 18:44:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233586AbhF3Sqm (ORCPT ); Wed, 30 Jun 2021 14:46:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55814 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233587AbhF3Sqg (ORCPT ); Wed, 30 Jun 2021 14:46:36 -0400 Received: from mail-yb1-xb31.google.com (mail-yb1-xb31.google.com [IPv6:2607:f8b0:4864:20::b31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A2DE0C061756 for ; Wed, 30 Jun 2021 11:44:06 -0700 (PDT) Received: by mail-yb1-xb31.google.com with SMTP id k184so6575307ybf.12 for ; Wed, 30 Jun 2021 11:44:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=WIoChI/UV1Z7a7XrKG/VZSLZ/TPMwf39BXQMXIxOAwA=; b=btIXHyRiBkK956T8OwjWkekMNvzo/mZJOPFcrvBEKbeBejlWmlRMlznuchf2pWPWqe a9Ag692CKOfxzD8GcEQEhZ/3Jr9LS22bkN+lCYbP2wrWkSf2MsmKL4vpgoFibTmfwJ7H WFzEIGgzeQ191/Pt+a+5pUMDz7AToSI7RmYpEEpOhySN6UKbpUGg6FJzgKiotqZ0wfoE Z2pLW/YB01A6w0jwOMzklxNWBxb+4Iv3srIzg7zlfMxeQ8dentmx7j/x5pLBCq+TEp30 dHGD2ns9TvW0TXws3V8QruPNTAt1lQT+kuX399Eg3WT2x5Px+0CSkxk2UwoGc4fN5w0A khMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=WIoChI/UV1Z7a7XrKG/VZSLZ/TPMwf39BXQMXIxOAwA=; b=RJ/0o4brAiJ5uGMU81rjyO9QWgC5Q7CVratq1iyad493yUxc5X4AtY+Ckbv2kAjXX8 APPjmsmJNWp+BEt5Ho4qWSSTqwjkxf2jIpK3q/7lIB3HsGNfpV7DyUPAkcK+lz5H0H2x Px5SWUQgIwdFHy0E14EenOhUYPDA6Wa3YIgsiusMHc9tSwMsT1nlwFvRSAwQ/aX8gUVA DMOQAUGClJbLZ5S+rl537xlvXY43mZ4xNp98Pb4ltvRcujxBvymZo6hwQQln1L5Bb5s8 XsJCXr7njWSXrVr85cT9dgTNatW7dZ5jqYxnQsjN3LF93xKSaqANL1GS+o0yqYzApNow Z5Tg== X-Gm-Message-State: AOAM531XOhiOZoZHcRIsFYGMGfzkjTEIxr1kZApwz2gN7s8m/9axIr+r 8O1QstRaIknInD1JAL7CTxmPqpKE6nkpMSyGmLZPIg== X-Google-Smtp-Source: ABdhPJxsqqO39vuA3mf7rpViczVY8A/5U4GcNWz8KEEHm4mV92UeEkuUnoskpseoiZc3iaG2Xq6Wb09osZVU7EYQjpI= X-Received: by 2002:a25:d913:: with SMTP id q19mr48066209ybg.397.1625078645669; Wed, 30 Jun 2021 11:44:05 -0700 (PDT) MIME-Version: 1.0 References: <20210623192822.3072029-1-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 30 Jun 2021 11:43:54 -0700 Message-ID: Subject: Re: [PATCH 1/1] mm: introduce process_reap system call To: Shakeel Butt Cc: Andrew Morton , Michal Hocko , Michal Hocko , David Rientjes , Matthew Wilcox , Johannes Weiner , Roman Gushchin , Rik van Riel , Minchan Kim , Christian Brauner , Christoph Hellwig , Oleg Nesterov , David Hildenbrand , Jann Horn , Tim Murray , Linux API , Linux MM , LKML , kernel-team Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 30, 2021 at 11:01 AM Shakeel Butt wrote: > > Hi Suren, > > On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan wrote: > > > > In modern systems it's not unusual to have a system component monitoring > > memory conditions of the system and tasked with keeping system memory > > pressure under control. One way to accomplish that is to kill > > non-essential processes to free up memory for more important ones. > > Examples of this are Facebook's OOM killer daemon called oomd and > > Android's low memory killer daemon called lmkd. > > For such system component it's important to be able to free memory > > quickly and efficiently. Unfortunately the time process takes to free > > up its memory after receiving a SIGKILL might vary based on the state > > of the process (uninterruptible sleep), size and OPP level of the core > > the process is running. A mechanism to free resources of the target > > process in a more predictable way would improve system's ability to > > control its memory pressure. > > Introduce process_reap system call that reclaims memory of a dying process > > from the context of the caller. This way the memory in freed in a more > > controllable way with CPU affinity and priority of the caller. The workload > > of freeing the memory will also be charged to the caller. > > The operation is allowed only on a dying process. > > > > Previously I proposed a number of alternatives to accomplish this: > > - https://lore.kernel.org/patchwork/patch/1060407 extending > > pidfd_send_signal to allow memory reaping using oom_reaper thread; > > - https://lore.kernel.org/patchwork/patch/1338196 extending > > pidfd_send_signal to reap memory of the target process synchronously from > > the context of the caller; > > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED > > support for process_madvise implementing synchronous memory reaping. > > > > The end of the last discussion culminated with suggestion to introduce a > > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875) > > The reasoning was that the new variant of process_madvise > > a) does not work on an address range > > b) is destructive > > c) doesn't share much code at all with the rest of process_madvise > > From the userspace point of view it was awkward and inconvenient to provide > > memory range for this operation that operates on the entire address space. > > Using special flags or address values to specify the entire address space > > was too hacky. > > > > The API is as follows, > > > > int process_reap(int pidfd, unsigned int flags); > > > > DESCRIPTION > > The process_reap() system call is used to free the memory of a > > dying process. > > > > The pidfd selects the process referred to by the PID file > > descriptor. > > (See pidofd_open(2) for further information) > > *pidfd_open Ack > > > > > The flags argument is reserved for future use; currently, this > > argument must be specified as 0. > > > > RETURN VALUE > > On success, process_reap() returns 0. On error, -1 is returned > > and errno is set to indicate the error. > > > > Signed-off-by: Suren Baghdasaryan > > Thanks for continuously pushing this. One question I have is how do > you envision this syscall to be used for the cgroup based workloads. > Traverse the target tree, read pids from cgroup.procs files, > pidfd_open them, send SIGKILL and then process_reap them. Is that > right? Yes, at least that's how Android does that. It's a bit more involved but it's a technical detail. Userspace low memory killer kills a process (sends SIGKILL and calls process_reap) and another system component detects that a process died and will kill all processes belonging to the same cgroup (that's how we identify related processes). > > Orthogonal to this patch I wonder if we should have an optimized way > to reap processes from a cgroup. Something similar to cgroup.kill (or > maybe overload cgroup.kill with reaping as well). Seems reasonable to me. We could use that in the above scenario. > > [...] > > > + > > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags) > > +{ > > + struct pid *pid; > > + struct task_struct *task; > > + struct mm_struct *mm = NULL; > > + unsigned int f_flags; > > + long ret = 0; > > + > > + if (flags != 0) > > + return -EINVAL; > > + > > + pid = pidfd_get_pid(pidfd, &f_flags); > > + if (IS_ERR(pid)) > > + return PTR_ERR(pid); > > + > > + task = get_pid_task(pid, PIDTYPE_PID); > > + if (!task) { > > + ret = -ESRCH; > > + goto put_pid; > > + } > > + > > + /* > > + * If the task is dying and in the process of releasing its memory > > + * then get its mm. > > + */ > > + task_lock(task); > > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) { > > task_will_free_mem() is fine here but I think in parallel we should > optimize this function. At the moment it is traversing all the > processes on the machine. It is very normal to have tens of thousands > of processes on big machines, so it would be really costly when > reaping a bunch of processes. Hmm. But I think we still need to make sure that the mm is not shared with another non-dying process. IIUC that's the point of that traversal. Am I mistaken? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 780DBC11F65 for ; Wed, 30 Jun 2021 18:44:08 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 066C3600EF for ; Wed, 30 Jun 2021 18:44:08 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 066C3600EF Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 64B9D8D01BF; Wed, 30 Jun 2021 14:44:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6226F8D01A2; Wed, 30 Jun 2021 14:44:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 49D4B8D01BF; Wed, 30 Jun 2021 14:44:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0097.hostedemail.com [216.40.44.97]) by kanga.kvack.org (Postfix) with ESMTP id 261F48D01A2 for ; Wed, 30 Jun 2021 14:44:07 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id F0A8A824C442 for ; Wed, 30 Jun 2021 18:44:06 +0000 (UTC) X-FDA: 78311264892.03.95FE1DB Received: from mail-yb1-f173.google.com (mail-yb1-f173.google.com [209.85.219.173]) by imf04.hostedemail.com (Postfix) with ESMTP id A6FA8500009F for ; Wed, 30 Jun 2021 18:44:06 +0000 (UTC) Received: by mail-yb1-f173.google.com with SMTP id g5so6591284ybu.10 for ; Wed, 30 Jun 2021 11:44:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=WIoChI/UV1Z7a7XrKG/VZSLZ/TPMwf39BXQMXIxOAwA=; b=btIXHyRiBkK956T8OwjWkekMNvzo/mZJOPFcrvBEKbeBejlWmlRMlznuchf2pWPWqe a9Ag692CKOfxzD8GcEQEhZ/3Jr9LS22bkN+lCYbP2wrWkSf2MsmKL4vpgoFibTmfwJ7H WFzEIGgzeQ191/Pt+a+5pUMDz7AToSI7RmYpEEpOhySN6UKbpUGg6FJzgKiotqZ0wfoE Z2pLW/YB01A6w0jwOMzklxNWBxb+4Iv3srIzg7zlfMxeQ8dentmx7j/x5pLBCq+TEp30 dHGD2ns9TvW0TXws3V8QruPNTAt1lQT+kuX399Eg3WT2x5Px+0CSkxk2UwoGc4fN5w0A khMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=WIoChI/UV1Z7a7XrKG/VZSLZ/TPMwf39BXQMXIxOAwA=; b=K2HIDdm8dRwPOfB3BtfV0oSJ2ua5hCjeWVsp3zJttC0YDyIBLwYJyZvmS7LQhxeNhd rHsqBs8LcIEuJe0zjrvD9uB9m6aq/PK6/hJSXhjegULmLmV2FRH+tBaIbIsibAam9bGd ZCkj89vCZtwg7/mGPDu86OHTKk7ZXilyY7x5oULL8i1alQWL+2Ztr5VwpsuOOMhV2IPQ xtkcN0V1+Gx6JgOMKTzVbT16cMfNtiKGvJZGlZT9Z043Op0gXarTUpWx8wvH2gPbUKwe 9b900hEofKUV9E9Tco/4XWR/CyVHWzE6v06GP4dFNwoLzw+zg0PVNyyKZuVpQrUCaAaL tFzA== X-Gm-Message-State: AOAM530uwgMoXwEIFoalZ9whCpozDXCEoOB7vyR71NWHhD5/p/Vb+Djj BpADSkRBSfz2jgXtkJTotBczrLe0nUT6QEzPbwuMBg== X-Google-Smtp-Source: ABdhPJxsqqO39vuA3mf7rpViczVY8A/5U4GcNWz8KEEHm4mV92UeEkuUnoskpseoiZc3iaG2Xq6Wb09osZVU7EYQjpI= X-Received: by 2002:a25:d913:: with SMTP id q19mr48066209ybg.397.1625078645669; Wed, 30 Jun 2021 11:44:05 -0700 (PDT) MIME-Version: 1.0 References: <20210623192822.3072029-1-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 30 Jun 2021 11:43:54 -0700 Message-ID: Subject: Re: [PATCH 1/1] mm: introduce process_reap system call To: Shakeel Butt Cc: Andrew Morton , Michal Hocko , Michal Hocko , David Rientjes , Matthew Wilcox , Johannes Weiner , Roman Gushchin , Rik van Riel , Minchan Kim , Christian Brauner , Christoph Hellwig , Oleg Nesterov , David Hildenbrand , Jann Horn , Tim Murray , Linux API , Linux MM , LKML , kernel-team Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=btIXHyRi; spf=pass (imf04.hostedemail.com: domain of surenb@google.com designates 209.85.219.173 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: A6FA8500009F X-Stat-Signature: caep1cmn8nsfcixnaubio9fdzpaeidts X-HE-Tag: 1625078646-478883 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jun 30, 2021 at 11:01 AM Shakeel Butt wrote: > > Hi Suren, > > On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan wrote: > > > > In modern systems it's not unusual to have a system component monitoring > > memory conditions of the system and tasked with keeping system memory > > pressure under control. One way to accomplish that is to kill > > non-essential processes to free up memory for more important ones. > > Examples of this are Facebook's OOM killer daemon called oomd and > > Android's low memory killer daemon called lmkd. > > For such system component it's important to be able to free memory > > quickly and efficiently. Unfortunately the time process takes to free > > up its memory after receiving a SIGKILL might vary based on the state > > of the process (uninterruptible sleep), size and OPP level of the core > > the process is running. A mechanism to free resources of the target > > process in a more predictable way would improve system's ability to > > control its memory pressure. > > Introduce process_reap system call that reclaims memory of a dying process > > from the context of the caller. This way the memory in freed in a more > > controllable way with CPU affinity and priority of the caller. The workload > > of freeing the memory will also be charged to the caller. > > The operation is allowed only on a dying process. > > > > Previously I proposed a number of alternatives to accomplish this: > > - https://lore.kernel.org/patchwork/patch/1060407 extending > > pidfd_send_signal to allow memory reaping using oom_reaper thread; > > - https://lore.kernel.org/patchwork/patch/1338196 extending > > pidfd_send_signal to reap memory of the target process synchronously from > > the context of the caller; > > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED > > support for process_madvise implementing synchronous memory reaping. > > > > The end of the last discussion culminated with suggestion to introduce a > > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875) > > The reasoning was that the new variant of process_madvise > > a) does not work on an address range > > b) is destructive > > c) doesn't share much code at all with the rest of process_madvise > > From the userspace point of view it was awkward and inconvenient to provide > > memory range for this operation that operates on the entire address space. > > Using special flags or address values to specify the entire address space > > was too hacky. > > > > The API is as follows, > > > > int process_reap(int pidfd, unsigned int flags); > > > > DESCRIPTION > > The process_reap() system call is used to free the memory of a > > dying process. > > > > The pidfd selects the process referred to by the PID file > > descriptor. > > (See pidofd_open(2) for further information) > > *pidfd_open Ack > > > > > The flags argument is reserved for future use; currently, this > > argument must be specified as 0. > > > > RETURN VALUE > > On success, process_reap() returns 0. On error, -1 is returned > > and errno is set to indicate the error. > > > > Signed-off-by: Suren Baghdasaryan > > Thanks for continuously pushing this. One question I have is how do > you envision this syscall to be used for the cgroup based workloads. > Traverse the target tree, read pids from cgroup.procs files, > pidfd_open them, send SIGKILL and then process_reap them. Is that > right? Yes, at least that's how Android does that. It's a bit more involved but it's a technical detail. Userspace low memory killer kills a process (sends SIGKILL and calls process_reap) and another system component detects that a process died and will kill all processes belonging to the same cgroup (that's how we identify related processes). > > Orthogonal to this patch I wonder if we should have an optimized way > to reap processes from a cgroup. Something similar to cgroup.kill (or > maybe overload cgroup.kill with reaping as well). Seems reasonable to me. We could use that in the above scenario. > > [...] > > > + > > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags) > > +{ > > + struct pid *pid; > > + struct task_struct *task; > > + struct mm_struct *mm = NULL; > > + unsigned int f_flags; > > + long ret = 0; > > + > > + if (flags != 0) > > + return -EINVAL; > > + > > + pid = pidfd_get_pid(pidfd, &f_flags); > > + if (IS_ERR(pid)) > > + return PTR_ERR(pid); > > + > > + task = get_pid_task(pid, PIDTYPE_PID); > > + if (!task) { > > + ret = -ESRCH; > > + goto put_pid; > > + } > > + > > + /* > > + * If the task is dying and in the process of releasing its memory > > + * then get its mm. > > + */ > > + task_lock(task); > > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) { > > task_will_free_mem() is fine here but I think in parallel we should > optimize this function. At the moment it is traversing all the > processes on the machine. It is very normal to have tens of thousands > of processes on big machines, so it would be really costly when > reaping a bunch of processes. Hmm. But I think we still need to make sure that the mm is not shared with another non-dying process. IIUC that's the point of that traversal. Am I mistaken?