From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.2 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 87964C636CE for ; Wed, 21 Jul 2021 14:00:37 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 39D8660FED for ; Wed, 21 Jul 2021 14:00:37 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 39D8660FED Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 809658D0003; Wed, 21 Jul 2021 10:00:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 791276B0082; Wed, 21 Jul 2021 10:00:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E6078D0003; Wed, 21 Jul 2021 10:00:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0148.hostedemail.com [216.40.44.148]) by kanga.kvack.org (Postfix) with ESMTP id 331056B0082 for ; Wed, 21 Jul 2021 10:00:35 -0400 (EDT) Received: from forelay.prod.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by fograve01.hostedemail.com (Postfix) with ESMTP id 2653718039009 for ; Wed, 21 Jul 2021 08:02:29 +0000 (UTC) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id BDB2F824556B for ; Wed, 21 Jul 2021 08:02:28 +0000 (UTC) X-FDA: 78385852776.24.505C903 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf28.hostedemail.com (Postfix) with ESMTP id 590DF9002F2B for ; Wed, 21 Jul 2021 08:02:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1626854547; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uh86klSvWJeXWjuC8u4ShkXNhr7jcN9emlcyDr3ARaM=; b=SFD+F3Kdu+FhdrQKzjilEVlRZf6+ATsFQ2iRzckXoexwtAkivdMlHHRCXEC7gYDlJ9AwKS eBBmCuseTa46PYAcCvNuHFOmkGtpId1A+J94dV41dckEoP+tZ0yi7fEC+ZYq8MJkBgN2CD C0E6YMxznzBNFAuCm0cuOgDYOXtnYRY= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-596-12DNUdgDO_-FO0Q5Ch72BA-1; Wed, 21 Jul 2021 04:02:22 -0400 X-MC-Unique: 12DNUdgDO_-FO0Q5Ch72BA-1 Received: by mail-wm1-f71.google.com with SMTP id v25-20020a1cf7190000b0290197a4be97b7so262130wmh.9 for ; Wed, 21 Jul 2021 01:02:22 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=uh86klSvWJeXWjuC8u4ShkXNhr7jcN9emlcyDr3ARaM=; b=CCon6RkzdhLLnyK3/Mt3tyYHOIzszrTY2OEhAvGH3acvzqBDGZiJRSHS8TaIYZkGhw rnjGReEVNqcV2vB1U1HvHOt7itHxChtUzhlHYhyliPvJOXChh61470o53a+ne3ao0aAE inNNYyTJPO86CeJcaLdBS+vsjiNx5oH/8QF4ZsT3eMsOd3cLJ4xc1mpWd7wtj//Yc2fi 2y/BYZ1mh2DeqdpZQ9gK07N7PPpQRHsLo0OcuBf86heOPyr7UdEOvPQoGvDe25mU0mn9 glvXEo34vgW6AjtrtmxCdF54wEZrz/LpAC3VVH0443pTfz8oAswBJZ2zlFqIx7s02m3+ iwug== X-Gm-Message-State: AOAM532/F+tLRpD8vFgDXQ6AHR4679Bpe6uad7lvwApyJfAQP6WSJpix F99cddr6M/f8QS9hREMG74MDpf/ynaiRwd61hj/NlstNjCGKUVqHt16wvxEz2+DWuuJFmD/GEhB lx5qRGS9fIRY= X-Received: by 2002:a05:600c:3593:: with SMTP id p19mr35390646wmq.33.1626854541012; Wed, 21 Jul 2021 01:02:21 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx3wuhejTcrsCykLyKIBVHs6iXl5oh0x4VtsCQdeJ+1w2jg9+s3DJGo9yEmZib5OwfCyapJrA== X-Received: by 2002:a05:600c:3593:: with SMTP id p19mr35390619wmq.33.1626854540729; Wed, 21 Jul 2021 01:02:20 -0700 (PDT) Received: from [192.168.3.132] (p5b0c65c3.dip0.t-ipconnect.de. [91.12.101.195]) by smtp.gmail.com with ESMTPSA id o18sm889881wmh.0.2021.07.21.01.02.19 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 21 Jul 2021 01:02:20 -0700 (PDT) To: Suren Baghdasaryan , akpm@linux-foundation.org Cc: mhocko@kernel.org, mhocko@suse.com, rientjes@google.com, willy@infradead.org, hannes@cmpxchg.org, guro@fb.com, riel@surriel.com, minchan@kernel.org, christian@brauner.io, hch@infradead.org, oleg@redhat.com, jannh@google.com, shakeelb@google.com, luto@kernel.org, christian.brauner@ubuntu.com, fweimer@redhat.com, jengelh@inai.de, timmurray@google.com, linux-api@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com References: <20210718214134.2619099-1-surenb@google.com> <20210718214134.2619099-2-surenb@google.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v2 2/3] mm: introduce process_mrelease system call Message-ID: <6ab82426-ddbd-7937-3334-468f16ceedab@redhat.com> Date: Wed, 21 Jul 2021 10:02:19 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210718214134.2619099-2-surenb@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 590DF9002F2B X-Stat-Signature: 8t1uj65nysrw9tufg6giku9d7j5dzwcp Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SFD+F3Kd; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf28.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1626854548-431012 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 18.07.21 23:41, Suren Baghdasaryan wrote: > In modern systems it's not unusual to have a system component monitorin= g > memory conditions of the system and tasked with keeping system memory > pressure under control. One way to accomplish that is to kill > non-essential processes to free up memory for more important ones. > Examples of this are Facebook's OOM killer daemon called oomd and > Android's low memory killer daemon called lmkd. > For such system component it's important to be able to free memory > quickly and efficiently. Unfortunately the time process takes to free > up its memory after receiving a SIGKILL might vary based on the state > of the process (uninterruptible sleep), size and OPP level of the core > the process is running. A mechanism to free resources of the target > process in a more predictable way would improve system's ability to > control its memory pressure. > Introduce process_mrelease system call that releases memory of a dying > process from the context of the caller. This way the memory is freed in > a more controllable way with CPU affinity and priority of the caller. > The workload of freeing the memory will also be charged to the caller. > The operation is allowed only on a dying process. >=20 > Previously I proposed a number of alternatives to accomplish this: > - https://lore.kernel.org/patchwork/patch/1060407 extending > pidfd_send_signal to allow memory reaping using oom_reaper thread; > - https://lore.kernel.org/patchwork/patch/1338196 extending > pidfd_send_signal to reap memory of the target process synchronously fr= om > the context of the caller; > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED > support for process_madvise implementing synchronous memory reaping. To me, this looks a lot cleaner. Although I do wonder why we need two=20 separate mechanisms to achieve the end goal 1. send sigkill 2. process_mrelease As 2. doesn't make sense without 1. it somehow feels like it would be=20 optimal to achieve both steps in a single syscall. But I remember there=20 were discussions around that. >=20 > The end of the last discussion culminated with suggestion to introduce = a > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/= #1553875) > The reasoning was that the new variant of process_madvise > a) does not work on an address range > b) is destructive > c) doesn't share much code at all with the rest of process_madvise > From the userspace point of view it was awkward and inconvenient to pr= ovide > memory range for this operation that operates on the entire address spa= ce. > Using special flags or address values to specify the entire address spa= ce > was too hacky. >=20 > The API is as follows, >=20 > int process_mrelease(int pidfd, unsigned int flags); >=20 > DESCRIPTION > The process_mrelease() system call is used to free the memor= y of > a process which was sent a SIGKILL signal. >=20 > The pidfd selects the process referred to by the PID file > descriptor. > (See pidofd_open(2) for further information) >=20 > The flags argument is reserved for future use; currently, th= is > argument must be specified as 0. >=20 > RETURN VALUE > On success, process_mrelease() returns 0. On error, -1 is > returned and errno is set to indicate the error. >=20 > ERRORS > EBADF pidfd is not a valid PID file descriptor. >=20 > EAGAIN Failed to release part of the address space. >=20 > EINVAL flags is not 0. >=20 > EINVAL The task does not have a pending SIGKILL or its memor= y is > shared with another process with no pending SIGKILL. >=20 > ENOSYS This system call is not supported by kernels built wi= th no > MMU support (CONFIG_MMU=3Dn). >=20 > ESRCH The target process does not exist (i.e., it has termi= nated > and been waited on). >=20 > Signed-off-by: Suren Baghdasaryan > --- > mm/oom_kill.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++= + > 1 file changed, 55 insertions(+) >=20 > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index d04a13dc9fde..7fbfa70d4e97 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -28,6 +28,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -755,10 +756,64 @@ static int __init oom_init(void) > return 0; > } > subsys_initcall(oom_init) > + > +SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) > +{ > + struct pid *pid; > + struct task_struct *task; > + struct mm_struct *mm =3D NULL; > + unsigned int f_flags; > + long ret =3D 0; Nit: reverse Christmas tree. > + > + if (flags !=3D 0) > + return -EINVAL; > + > + pid =3D pidfd_get_pid(pidfd, &f_flags); > + if (IS_ERR(pid)) > + return PTR_ERR(pid); > + > + task =3D get_pid_task(pid, PIDTYPE_PID); > + if (!task) { > + ret =3D -ESRCH; > + goto put_pid; > + } > + > + /* > + * If the task is dying and in the process of releasing its memory > + * then get its mm. > + */ > + task_lock(task); > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) =3D=3D 0) = { > + mm =3D task->mm; > + mmget(mm); > + } AFAIU, while holding the task_lock, task->mm won't change and we cannot=20 see a concurrent exit_mm()->mmput(). So the mm structure and the VMAs=20 won't go away while holding the task_lock(). I do wonder if we need the=20 mmget() at all here. Also, I wonder if it would be worth dropping the task_lock() while=20 reaping - to unblock anybody else wanting to lock the task. Getting a=20 hold of the mm and locking the mmap_lock would be sufficient I guess. In general, looks quite good to me. --=20 Thanks, David / dhildenb