From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A44F7C4332F for ; Fri, 22 Oct 2021 17:39:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 825516103E for ; Fri, 22 Oct 2021 17:39:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233854AbhJVRlR (ORCPT ); Fri, 22 Oct 2021 13:41:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41160 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233803AbhJVRlP (ORCPT ); Fri, 22 Oct 2021 13:41:15 -0400 Received: from mail-yb1-xb33.google.com (mail-yb1-xb33.google.com [IPv6:2607:f8b0:4864:20::b33]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 076F0C061766 for ; Fri, 22 Oct 2021 10:38:58 -0700 (PDT) Received: by mail-yb1-xb33.google.com with SMTP id r184so8618656ybc.10 for ; Fri, 22 Oct 2021 10:38:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=r/8J8TR4uHJ1X4W6TznT1rAj7h/OTN7aNlkEVgGiT3k=; b=LP7en/nydEvtAFRctjt7frpaybeS3xAVyS4NXj3p3ir55MZBGyzoBHMEyWvi+VW6PT BwXDhkwDH07ax89kj2XT2dNMx4ckyVZan4DxLHH2lIjIHbWMqvq/EbTqniXG7EwtSvS4 +CfXvJsgsY20ATw2fvtp/OyXy4SPk0ONvFKzV5hf2GBwbGKks6vtz1+DTPq9IzbICuYI kqS/klK5DiAvNo6refJCuZJZ9kMLNEf2EmQnAMmyrHFplSCk/4Q1IdGi7GrKSN/kyvPW rR3yW1l0MXUB3MAV4mE8gq4QHsCO+8bqIAkwVWFRD5Jgpo4wmosHJ7JGffPBxY3ScHba nlPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=r/8J8TR4uHJ1X4W6TznT1rAj7h/OTN7aNlkEVgGiT3k=; b=YX8jCFF/Om/5X7Xe3yOnQZsOTmzLZFWJghefZtSB8CwKbGwX6zP2nEobTV2toboCyF MEENXMFTl9OemdqY43lwAKKdTvttFIfkJVY5rgkQsVmA61iEsEDkQJlJ/9J1sETlWWMJ G0XUpjMM84MmkCkeALYfasvc6nzdKTov2kRQ9twYthCwjYKu8s4gEBFsFOEsrsq4t9LR 3/GZEyDQv54zSiB25rj3Zq8WaklinpHbYpHvdzrd22BFdFhN+W/Yz0R9Qg0xNwxSywQ5 9toNXUoVMRS7+cs78BdcNHpXtImZ696MS88GEH307xinJS/z6FfX8oxRvRxq8I2wMHF5 Dukw== X-Gm-Message-State: AOAM532AtLo5DBmGdVUJssRmVaHoEYNh9kDL2tk3eMEoOZuAPN7lUNIF kxtOtPxq1rwCyoTcIYx0xurNxE9X+gDxJfH1cksodw== X-Google-Smtp-Source: ABdhPJzqy77u6YQ7w9HgETig0ZT17flo890EWUfxm8QGQC6SHPCyKeC6hHbuYdXWUpWZZMOnfP5nxkD02ITzgvHJkBo= X-Received: by 2002:a25:81c6:: with SMTP id n6mr1180149ybm.412.1634924336836; Fri, 22 Oct 2021 10:38:56 -0700 (PDT) MIME-Version: 1.0 References: <20211022014658.263508-1-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Fri, 22 Oct 2021 10:38:45 -0700 Message-ID: Subject: Re: [PATCH 1/1] mm: prevent a race between process_mrelease and exit_mmap To: Michal Hocko Cc: Andrew Morton , David Rientjes , Matthew Wilcox , Johannes Weiner , Roman Gushchin , Rik van Riel , Minchan Kim , Christian Brauner , Christoph Hellwig , Oleg Nesterov , David Hildenbrand , Jann Horn , Shakeel Butt , Andy Lutomirski , Christian Brauner , Florian Weimer , Jan Engelhardt , Linux API , linux-mm , LKML , kernel-team Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 22, 2021 at 1:03 AM Michal Hocko wrote: > > On Thu 21-10-21 18:46:58, Suren Baghdasaryan wrote: > > Race between process_mrelease and exit_mmap, where free_pgtables is > > called while __oom_reap_task_mm is in progress, leads to kernel crash > > during pte_offset_map_lock call. oom-reaper avoids this race by setting > > MMF_OOM_VICTIM flag and causing exit_mmap to take and release > > mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock. > > Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to > > fix this race, however that would be considered a hack. Fix this race > > by elevating mm->mm_users and preventing exit_mmap from executing until > > process_mrelease is finished. Patch slightly refactors the code to adapt > > for a possible mmget_not_zero failure. > > This fix has considerable negative impact on process_mrelease performance > > and will likely need later optimization. > > I am not sure there is any promise that process_mrelease will run in > parallel with the exiting process. In fact the primary purpose of this > syscall is to provide a reliable way to oom kill from user space. If you > want to optimize process exit resp. its exit_mmap part then you should > be using other means. So I would be careful calling this a regression. > > I do agree that taking the reference count is the right approach here. I > was wrong previously [1] when saying that pinning the mm struct is > sufficient. I have completely forgot about the subtle sync in exit_mmap. > One way we can approach that would be to take exclusive mmap_sem > throughout the exit_mmap unconditionally. I agree, that would probably be the cleanest way. > There was a push back against > that though so arguments would have to be re-evaluated. I'll review that discussion to better understand the reasons for the push back. Thanks for the link. > > [1] http://lkml.kernel.org/r/YQzZqFwDP7eUxwcn@dhcp22.suse.cz > > That being said > Acked-by: Michal Hocko Thanks! > > Thanks! > > > Fixes: 884a7e5964e0 ("mm: introduce process_mrelease system call") > > Signed-off-by: Suren Baghdasaryan > > --- > > mm/oom_kill.c | 23 ++++++++++++----------- > > 1 file changed, 12 insertions(+), 11 deletions(-) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 831340e7ad8b..989f35a2bbb1 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -1150,7 +1150,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) > > struct task_struct *task; > > struct task_struct *p; > > unsigned int f_flags; > > - bool reap = true; > > + bool reap = false; > > struct pid *pid; > > long ret = 0; > > > > @@ -1177,15 +1177,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) > > goto put_task; > > } > > > > - mm = p->mm; > > - mmgrab(mm); > > - > > - /* If the work has been done already, just exit with success */ > > - if (test_bit(MMF_OOM_SKIP, &mm->flags)) > > - reap = false; > > - else if (!task_will_free_mem(p)) { > > - reap = false; > > - ret = -EINVAL; > > + if (mmget_not_zero(p->mm)) { > > + mm = p->mm; > > + if (task_will_free_mem(p)) > > + reap = true; > > + else { > > + /* Error only if the work has not been done already */ > > + if (!test_bit(MMF_OOM_SKIP, &mm->flags)) > > + ret = -EINVAL; > > + } > > } > > task_unlock(p); > > > > @@ -1201,7 +1201,8 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) > > mmap_read_unlock(mm); > > > > drop_mm: > > - mmdrop(mm); > > + if (mm) > > + mmput(mm); > > put_task: > > put_task_struct(task); > > put_pid: > > -- > > 2.33.0.1079.g6e70778dc9-goog > > -- > Michal Hocko > SUSE Labs