From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-23.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 785AFC6377D for ; Wed, 21 Jul 2021 22:59:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5A10961264 for ; Wed, 21 Jul 2021 22:59:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230404AbhGUWSk (ORCPT ); Wed, 21 Jul 2021 18:18:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60846 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230256AbhGUWSj (ORCPT ); Wed, 21 Jul 2021 18:18:39 -0400 Received: from mail-yb1-xb2e.google.com (mail-yb1-xb2e.google.com [IPv6:2607:f8b0:4864:20::b2e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E0A40C061757 for ; Wed, 21 Jul 2021 15:59:15 -0700 (PDT) Received: by mail-yb1-xb2e.google.com with SMTP id r132so5564518yba.5 for ; Wed, 21 Jul 2021 15:59:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=1dwxVRcNkN7Kn6Ao7jrDfweXBuqFRProuph1jkdsgR0=; b=j/+JjdRwnShpGM6r3Al+598YIV7XWN2RZmLcvd42dfXfMcpmDLYDR2nc3GQT7Vr9p5 efFAB9i40J5LblI0W2ugC4KR9wbVs3RBpPt/Fr9eI9raJTXjBNmWzcW4+FXT1qVlh0z6 hTrrV12yT64eO1L2VhoXFAehv/liLzHkMqjhX6DYucjQlJrdToNi2aRqg7a0bUuRvD0z cc8SPIvpp/5HbDPdAuPgPNqtjP690EsinwaSB+WtAhGjZe9Sfr1Mgf8oGpUaRcpGklnV ZIomwa4dg5E8/1PcbkfI1CfYRJjpr3laNhjcFAvwCdHw8zPunqdz9PChQ5miFov/aZvB uBvQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=1dwxVRcNkN7Kn6Ao7jrDfweXBuqFRProuph1jkdsgR0=; b=XXWfhxi84jvz9wBbFHd7R1HCsSUTD9EbZRjVJ82NQefWZ8rlSpy5yzGjQDq64xO13q 6FrMIZyMSs55hP/YKOF1TYZnp5kdpzTFfuBx48LjDd234bPcoYtlHxa+NLe5p01jR7Wt z/josPs22sleLgw5Ma020MWRsHM9IYk1kHpecgs6A2PPgrEWE6T1YqpL8+IHE0w0JOJd eOxs/hGR6lVF8ZcEGRWFao2O1Akb6N85JGb9govpbQLjo423NmBUmWqyQOrrseCaQOAj uQcVYpNcMiX4COQwAkaIO+m4tl3YX6nUcWsDf4dlux63hsoAzHcPdPjMBh4y+/Hjh0DA hulQ== X-Gm-Message-State: AOAM533qnhxURPEck4KPTpsgFwmxmbw+IiAYIgqZ/FDXreAnmvMZHNOr 6vW5pWe3qF46vlgiXuJMFZkN35HQLT27tL0WWnhaYQ== X-Google-Smtp-Source: ABdhPJwPew/QlG/QdMpfIAldcZ6LbgkHdbIJPlxvO+OzU2VGnY9bbAj9isXfNtN6ohi/Xr9O0YCcCgTVoaxj5LfcPGc= X-Received: by 2002:a25:2e49:: with SMTP id b9mr50641996ybn.250.1626908354832; Wed, 21 Jul 2021 15:59:14 -0700 (PDT) MIME-Version: 1.0 References: <20210718214134.2619099-1-surenb@google.com> <20210718214134.2619099-2-surenb@google.com> <6ab82426-ddbd-7937-3334-468f16ceedab@redhat.com> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 21 Jul 2021 15:59:03 -0700 Message-ID: Subject: Re: [PATCH v2 2/3] mm: introduce process_mrelease system call To: David Hildenbrand Cc: Andrew Morton , Michal Hocko , Michal Hocko , David Rientjes , Matthew Wilcox , Johannes Weiner , Roman Gushchin , Rik van Riel , Minchan Kim , Christian Brauner , Christoph Hellwig , Oleg Nesterov , Jann Horn , Shakeel Butt , Andy Lutomirski , Christian Brauner , Florian Weimer , Jan Engelhardt , Tim Murray , Linux API , linux-mm , LKML , kernel-team Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 21, 2021 at 8:43 AM Suren Baghdasaryan wrote: > > On Wed, Jul 21, 2021 at 1:02 AM David Hildenbrand wrote: > > > > On 18.07.21 23:41, Suren Baghdasaryan wrote: > > > In modern systems it's not unusual to have a system component monitoring > > > memory conditions of the system and tasked with keeping system memory > > > pressure under control. One way to accomplish that is to kill > > > non-essential processes to free up memory for more important ones. > > > Examples of this are Facebook's OOM killer daemon called oomd and > > > Android's low memory killer daemon called lmkd. > > > For such system component it's important to be able to free memory > > > quickly and efficiently. Unfortunately the time process takes to free > > > up its memory after receiving a SIGKILL might vary based on the state > > > of the process (uninterruptible sleep), size and OPP level of the core > > > the process is running. A mechanism to free resources of the target > > > process in a more predictable way would improve system's ability to > > > control its memory pressure. > > > Introduce process_mrelease system call that releases memory of a dying > > > process from the context of the caller. This way the memory is freed in > > > a more controllable way with CPU affinity and priority of the caller. > > > The workload of freeing the memory will also be charged to the caller. > > > The operation is allowed only on a dying process. > > > > > > Previously I proposed a number of alternatives to accomplish this: > > > - https://lore.kernel.org/patchwork/patch/1060407 extending > > > pidfd_send_signal to allow memory reaping using oom_reaper thread; > > > - https://lore.kernel.org/patchwork/patch/1338196 extending > > > pidfd_send_signal to reap memory of the target process synchronously from > > > the context of the caller; > > > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED > > > support for process_madvise implementing synchronous memory reaping. > > > > To me, this looks a lot cleaner. Although I do wonder why we need two > > separate mechanisms to achieve the end goal > > > > 1. send sigkill > > 2. process_mrelease > > > > As 2. doesn't make sense without 1. it somehow feels like it would be > > optimal to achieve both steps in a single syscall. But I remember there > > were discussions around that. > > Yep, we recently discussed the approach in this thread: > https://lore.kernel.org/patchwork/patch/1450952/#1652452 > > > > > > > > > The end of the last discussion culminated with suggestion to introduce a > > > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875) > > > The reasoning was that the new variant of process_madvise > > > a) does not work on an address range > > > b) is destructive > > > c) doesn't share much code at all with the rest of process_madvise > > > From the userspace point of view it was awkward and inconvenient to provide > > > memory range for this operation that operates on the entire address space. > > > Using special flags or address values to specify the entire address space > > > was too hacky. > > > > > > The API is as follows, > > > > > > int process_mrelease(int pidfd, unsigned int flags); > > > > > > DESCRIPTION > > > The process_mrelease() system call is used to free the memory of > > > a process which was sent a SIGKILL signal. > > > > > > The pidfd selects the process referred to by the PID file > > > descriptor. > > > (See pidofd_open(2) for further information) > > > > > > The flags argument is reserved for future use; currently, this > > > argument must be specified as 0. > > > > > > RETURN VALUE > > > On success, process_mrelease() returns 0. On error, -1 is > > > returned and errno is set to indicate the error. > > > > > > ERRORS > > > EBADF pidfd is not a valid PID file descriptor. > > > > > > EAGAIN Failed to release part of the address space. > > > > > > EINVAL flags is not 0. > > > > > > EINVAL The task does not have a pending SIGKILL or its memory is > > > shared with another process with no pending SIGKILL. > > > > > > ENOSYS This system call is not supported by kernels built with no > > > MMU support (CONFIG_MMU=n). > > > > > > ESRCH The target process does not exist (i.e., it has terminated > > > and been waited on). > > > > > > Signed-off-by: Suren Baghdasaryan > > > --- > > > mm/oom_kill.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 55 insertions(+) > > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > index d04a13dc9fde..7fbfa70d4e97 100644 > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -28,6 +28,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > #include > > > @@ -755,10 +756,64 @@ static int __init oom_init(void) > > > return 0; > > > } > > > subsys_initcall(oom_init) > > > + > > > +SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) > > > +{ > > > + struct pid *pid; > > > + struct task_struct *task; > > > + struct mm_struct *mm = NULL; > > > + unsigned int f_flags; > > > + long ret = 0; > > > > Nit: reverse Christmas tree. > > Ack. Will reorder like this: > > struct mm_struct *mm = NULL; > struct task_struct *task; > unsigned int f_flags; > struct pid *pid; > long ret = 0; > > > > > > + > > > + if (flags != 0) > > > + return -EINVAL; > > > + > > > + pid = pidfd_get_pid(pidfd, &f_flags); > > > + if (IS_ERR(pid)) > > > + return PTR_ERR(pid); > > > + > > > + task = get_pid_task(pid, PIDTYPE_PID); > > > + if (!task) { > > > + ret = -ESRCH; > > > + goto put_pid; > > > + } > > > + > > > + /* > > > + * If the task is dying and in the process of releasing its memory > > > + * then get its mm. > > > + */ > > > + task_lock(task); > > > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) { > > > + mm = task->mm; > > > + mmget(mm); > > > + } > > > > AFAIU, while holding the task_lock, task->mm won't change and we cannot > > see a concurrent exit_mm()->mmput(). So the mm structure and the VMAs > > won't go away while holding the task_lock(). I do wonder if we need the > > mmget() at all here. We do mmget() here to ensure mm is stable when it is passed later to __oom_reap_task_mm(mm)/mmap_read_lock(mm)/mmap_read_unlock(mm). Note that during those calls we do not hold task_lock anymore. > > > > Also, I wonder if it would be worth dropping the task_lock() while > > reaping - to unblock anybody else wanting to lock the task. As I mentioned above, we do not hold task_lock during reaping. We release it right after we call task_will_free_mem(), which checks that the task is exiting. task_lock is held during the call to task_will_free_mem() to satisfy the requirement listed in that function's comment: "Caller has to make sure that task->mm is stable (hold task_lock or it operates on the current)". > > Getting a hold of the mm and locking the mmap_lock would be sufficient I guess. That's exactly what I do here. The simplified sequence is: task_lock if (task_will_free_mem()) mm=mmget() task_unlock if (!mm) return; mmap_read_lock(mm) __oom_reap_task_mm(mm) mmap_read_unlock(mm) mmput(mm) Or did I misunderstand your comments? > > Let me take a closer look at the locking sequence here and will follow > up afterwards. > Thanks for the review! > > > > > > > In general, looks quite good to me. > > > > -- > > Thanks, > > > > David / dhildenb > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-23.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6A64FC6377B for ; Wed, 21 Jul 2021 22:59:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0A34361249 for ; Wed, 21 Jul 2021 22:59:17 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0A34361249 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9F03F6B006C; Wed, 21 Jul 2021 18:59:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 97B176B0071; Wed, 21 Jul 2021 18:59:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8417C6B0072; Wed, 21 Jul 2021 18:59:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0171.hostedemail.com [216.40.44.171]) by kanga.kvack.org (Postfix) with ESMTP id 62D646B006C for ; Wed, 21 Jul 2021 18:59:16 -0400 (EDT) Received: from smtpin33.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 152408248047 for ; Wed, 21 Jul 2021 22:59:16 +0000 (UTC) X-FDA: 78388112712.33.D85B8FE Received: from mail-yb1-f172.google.com (mail-yb1-f172.google.com [209.85.219.172]) by imf22.hostedemail.com (Postfix) with ESMTP id CBCD51B3BB for ; Wed, 21 Jul 2021 22:59:15 +0000 (UTC) Received: by mail-yb1-f172.google.com with SMTP id r135so2528304ybc.0 for ; Wed, 21 Jul 2021 15:59:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=1dwxVRcNkN7Kn6Ao7jrDfweXBuqFRProuph1jkdsgR0=; b=j/+JjdRwnShpGM6r3Al+598YIV7XWN2RZmLcvd42dfXfMcpmDLYDR2nc3GQT7Vr9p5 efFAB9i40J5LblI0W2ugC4KR9wbVs3RBpPt/Fr9eI9raJTXjBNmWzcW4+FXT1qVlh0z6 hTrrV12yT64eO1L2VhoXFAehv/liLzHkMqjhX6DYucjQlJrdToNi2aRqg7a0bUuRvD0z cc8SPIvpp/5HbDPdAuPgPNqtjP690EsinwaSB+WtAhGjZe9Sfr1Mgf8oGpUaRcpGklnV ZIomwa4dg5E8/1PcbkfI1CfYRJjpr3laNhjcFAvwCdHw8zPunqdz9PChQ5miFov/aZvB uBvQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=1dwxVRcNkN7Kn6Ao7jrDfweXBuqFRProuph1jkdsgR0=; b=oh7vVJgOtP/RlGy2fzNUIt6J7TqnoCdcJxzLTv/wIlKv+zugr6/JEYTCAcfxjCyN08 eceIrO5X64mAmHi5Xpxqf27rNNMZibBcSU87xIFcKXLZf807KU23dVQt+7KkKhf8fyHQ L3AFmGljDn21G5WNMDmYDgrJFMSKwPb0HsPtW5lT2KbmjAGpyUjIZhjMv7L+gAW7lWve TBVKqc1alRApJ14P3/VXKmBjLgE32VkhiSFP9/ujFoRj79iJ0TBNLRYo0xau4gQWlg6V 6tbuKCvHJvQWoTT5Qju0AMF7pSfpHevsEvyJB+Cq3QCpZyCXCZhObnirIZhkL+wtSf4Y 1X8w== X-Gm-Message-State: AOAM530WKfjXuu/updZt0x2o5Pne/HH8zrTWqeV9GTXDXcmhvVTfixw3 zxNCeLYpZXNw1KR/dGR5XPMfeB89iOGsuqBy7FPApQ== X-Google-Smtp-Source: ABdhPJwPew/QlG/QdMpfIAldcZ6LbgkHdbIJPlxvO+OzU2VGnY9bbAj9isXfNtN6ohi/Xr9O0YCcCgTVoaxj5LfcPGc= X-Received: by 2002:a25:2e49:: with SMTP id b9mr50641996ybn.250.1626908354832; Wed, 21 Jul 2021 15:59:14 -0700 (PDT) MIME-Version: 1.0 References: <20210718214134.2619099-1-surenb@google.com> <20210718214134.2619099-2-surenb@google.com> <6ab82426-ddbd-7937-3334-468f16ceedab@redhat.com> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 21 Jul 2021 15:59:03 -0700 Message-ID: Subject: Re: [PATCH v2 2/3] mm: introduce process_mrelease system call To: David Hildenbrand Cc: Andrew Morton , Michal Hocko , Michal Hocko , David Rientjes , Matthew Wilcox , Johannes Weiner , Roman Gushchin , Rik van Riel , Minchan Kim , Christian Brauner , Christoph Hellwig , Oleg Nesterov , Jann Horn , Shakeel Butt , Andy Lutomirski , Christian Brauner , Florian Weimer , Jan Engelhardt , Tim Murray , Linux API , linux-mm , LKML , kernel-team Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: CBCD51B3BB X-Stat-Signature: shoh6yfr5pncfcs3m38mndne4wp8assj Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b="j/+JjdRw"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf22.hostedemail.com: domain of surenb@google.com designates 209.85.219.172 as permitted sender) smtp.mailfrom=surenb@google.com X-HE-Tag: 1626908355-958875 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jul 21, 2021 at 8:43 AM Suren Baghdasaryan wrote: > > On Wed, Jul 21, 2021 at 1:02 AM David Hildenbrand wrote: > > > > On 18.07.21 23:41, Suren Baghdasaryan wrote: > > > In modern systems it's not unusual to have a system component monitoring > > > memory conditions of the system and tasked with keeping system memory > > > pressure under control. One way to accomplish that is to kill > > > non-essential processes to free up memory for more important ones. > > > Examples of this are Facebook's OOM killer daemon called oomd and > > > Android's low memory killer daemon called lmkd. > > > For such system component it's important to be able to free memory > > > quickly and efficiently. Unfortunately the time process takes to free > > > up its memory after receiving a SIGKILL might vary based on the state > > > of the process (uninterruptible sleep), size and OPP level of the core > > > the process is running. A mechanism to free resources of the target > > > process in a more predictable way would improve system's ability to > > > control its memory pressure. > > > Introduce process_mrelease system call that releases memory of a dying > > > process from the context of the caller. This way the memory is freed in > > > a more controllable way with CPU affinity and priority of the caller. > > > The workload of freeing the memory will also be charged to the caller. > > > The operation is allowed only on a dying process. > > > > > > Previously I proposed a number of alternatives to accomplish this: > > > - https://lore.kernel.org/patchwork/patch/1060407 extending > > > pidfd_send_signal to allow memory reaping using oom_reaper thread; > > > - https://lore.kernel.org/patchwork/patch/1338196 extending > > > pidfd_send_signal to reap memory of the target process synchronously from > > > the context of the caller; > > > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED > > > support for process_madvise implementing synchronous memory reaping. > > > > To me, this looks a lot cleaner. Although I do wonder why we need two > > separate mechanisms to achieve the end goal > > > > 1. send sigkill > > 2. process_mrelease > > > > As 2. doesn't make sense without 1. it somehow feels like it would be > > optimal to achieve both steps in a single syscall. But I remember there > > were discussions around that. > > Yep, we recently discussed the approach in this thread: > https://lore.kernel.org/patchwork/patch/1450952/#1652452 > > > > > > > > > The end of the last discussion culminated with suggestion to introduce a > > > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875) > > > The reasoning was that the new variant of process_madvise > > > a) does not work on an address range > > > b) is destructive > > > c) doesn't share much code at all with the rest of process_madvise > > > From the userspace point of view it was awkward and inconvenient to provide > > > memory range for this operation that operates on the entire address space. > > > Using special flags or address values to specify the entire address space > > > was too hacky. > > > > > > The API is as follows, > > > > > > int process_mrelease(int pidfd, unsigned int flags); > > > > > > DESCRIPTION > > > The process_mrelease() system call is used to free the memory of > > > a process which was sent a SIGKILL signal. > > > > > > The pidfd selects the process referred to by the PID file > > > descriptor. > > > (See pidofd_open(2) for further information) > > > > > > The flags argument is reserved for future use; currently, this > > > argument must be specified as 0. > > > > > > RETURN VALUE > > > On success, process_mrelease() returns 0. On error, -1 is > > > returned and errno is set to indicate the error. > > > > > > ERRORS > > > EBADF pidfd is not a valid PID file descriptor. > > > > > > EAGAIN Failed to release part of the address space. > > > > > > EINVAL flags is not 0. > > > > > > EINVAL The task does not have a pending SIGKILL or its memory is > > > shared with another process with no pending SIGKILL. > > > > > > ENOSYS This system call is not supported by kernels built with no > > > MMU support (CONFIG_MMU=n). > > > > > > ESRCH The target process does not exist (i.e., it has terminated > > > and been waited on). > > > > > > Signed-off-by: Suren Baghdasaryan > > > --- > > > mm/oom_kill.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 55 insertions(+) > > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > index d04a13dc9fde..7fbfa70d4e97 100644 > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -28,6 +28,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > #include > > > @@ -755,10 +756,64 @@ static int __init oom_init(void) > > > return 0; > > > } > > > subsys_initcall(oom_init) > > > + > > > +SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) > > > +{ > > > + struct pid *pid; > > > + struct task_struct *task; > > > + struct mm_struct *mm = NULL; > > > + unsigned int f_flags; > > > + long ret = 0; > > > > Nit: reverse Christmas tree. > > Ack. Will reorder like this: > > struct mm_struct *mm = NULL; > struct task_struct *task; > unsigned int f_flags; > struct pid *pid; > long ret = 0; > > > > > > + > > > + if (flags != 0) > > > + return -EINVAL; > > > + > > > + pid = pidfd_get_pid(pidfd, &f_flags); > > > + if (IS_ERR(pid)) > > > + return PTR_ERR(pid); > > > + > > > + task = get_pid_task(pid, PIDTYPE_PID); > > > + if (!task) { > > > + ret = -ESRCH; > > > + goto put_pid; > > > + } > > > + > > > + /* > > > + * If the task is dying and in the process of releasing its memory > > > + * then get its mm. > > > + */ > > > + task_lock(task); > > > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) { > > > + mm = task->mm; > > > + mmget(mm); > > > + } > > > > AFAIU, while holding the task_lock, task->mm won't change and we cannot > > see a concurrent exit_mm()->mmput(). So the mm structure and the VMAs > > won't go away while holding the task_lock(). I do wonder if we need the > > mmget() at all here. We do mmget() here to ensure mm is stable when it is passed later to __oom_reap_task_mm(mm)/mmap_read_lock(mm)/mmap_read_unlock(mm). Note that during those calls we do not hold task_lock anymore. > > > > Also, I wonder if it would be worth dropping the task_lock() while > > reaping - to unblock anybody else wanting to lock the task. As I mentioned above, we do not hold task_lock during reaping. We release it right after we call task_will_free_mem(), which checks that the task is exiting. task_lock is held during the call to task_will_free_mem() to satisfy the requirement listed in that function's comment: "Caller has to make sure that task->mm is stable (hold task_lock or it operates on the current)". > > Getting a hold of the mm and locking the mmap_lock would be sufficient I guess. That's exactly what I do here. The simplified sequence is: task_lock if (task_will_free_mem()) mm=mmget() task_unlock if (!mm) return; mmap_read_lock(mm) __oom_reap_task_mm(mm) mmap_read_unlock(mm) mmput(mm) Or did I misunderstand your comments? > > Let me take a closer look at the locking sequence here and will follow > up afterwards. > Thanks for the review! > > > > > > > In general, looks quite good to me. > > > > -- > > Thanks, > > > > David / dhildenb > >