From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC6D1C55ABD for ; Sat, 14 Nov 2020 02:52:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 82CE722240 for ; Sat, 14 Nov 2020 02:52:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="lfdqtHmE" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726395AbgKNCvv (ORCPT ); Fri, 13 Nov 2020 21:51:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34288 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726116AbgKNCvu (ORCPT ); Fri, 13 Nov 2020 21:51:50 -0500 Received: from mail-wm1-x344.google.com (mail-wm1-x344.google.com [IPv6:2a00:1450:4864:20::344]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 266B0C0613D1 for ; Fri, 13 Nov 2020 18:51:49 -0800 (PST) Received: by mail-wm1-x344.google.com with SMTP id 19so14315882wmf.1 for ; Fri, 13 Nov 2020 18:51:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=/pF82g4ci3xp6pW4O0nO7ylMYBzowQiQ/mlbHUCZjYM=; b=lfdqtHmEO2B3dirgmxVeOVkreCFY8yc6AV6c/Po/ambopcg50qfTbxC6PX8k70WymB yWIzqfHgvKtagwRTwkF1qfCLQsNtGJdnc99FLtZ0JFvpH7IlNFicfVEvzsxq52rG2kCN uhHpV7F4a4CrHmHrASuaAOgkvPwi6tFTO2CAMZ8+63nRrui9J9NYT1wFE/asvtGsfy9s AkVZ1B1S9baz3knpYDgLBcqt2A8MMbxm+e7Y4bluSFFpCcg2j7qBtLsOwkNHn7ZJp/Og ab0ZZM0EjYC+ESFVkW0iWTnTxhXd60ZuC+2gAxUT8ibRXXhgyZY8VZh6dNQBjCyHXf3G /xpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=/pF82g4ci3xp6pW4O0nO7ylMYBzowQiQ/mlbHUCZjYM=; b=N3rQMSTkbfpCcqjiMwyahv00nTaS9aqRxwW75umoMWUGWjOs/DqF9Z8WEoETTKLFOk tuGjjS++l+nmLvOEkWC7tnuKAD7mGk9Du7czsmjeE1E0fgAvQmryz/Kj7PJCRu2iHuCJ xQhUDd666tm0HZ192Ahcq0n5Ddi/+OJ1WWvucPlXA4FG+9GN84PkeXWPdK9zlMsTIBMz Z5BUmzRX2C0bFw0bg3DzdQdzmQ+ENFOgR0k6siVBM1XgmO8iU6APbwfO9qT8CgjhKGYj Dnlfc2QrDcFvJjJDGPNMmtN2zRVE+x8eokRtvd18RhAA+P3TTZFKs1ntWCtJ6UQ5xXbC xrwA== X-Gm-Message-State: AOAM5313AfoSmzIVOwvAVQQtmLhEoyrVRiE3I9uZDnl+ZZfgcp9b8ODU Q+U0QJqZ4qz8EUZ3w5SAaJzZm7pWUH1mLzpPIH/eEQ== X-Google-Smtp-Source: ABdhPJwl3dK1Pst00k61vPihAcc18qxn5bQRviOKf/y8pjdzr3R0jjOBDf+WXIe5huKKNi3014n6lFP3HK6wVxHoZZo= X-Received: by 2002:a1c:9c0e:: with SMTP id f14mr5192682wme.22.1605322307590; Fri, 13 Nov 2020 18:51:47 -0800 (PST) MIME-Version: 1.0 References: <20201113173448.1863419-1-surenb@google.com> <20201113155539.64e0af5b60ad3145b018ab0d@linux-foundation.org> <20201113170032.7aa56ea273c900f97e6ccbdc@linux-foundation.org> <20201113171810.bebf66608b145cced85bf54c@linux-foundation.org> <20201113181632.6d98489465430a987c96568d@linux-foundation.org> In-Reply-To: <20201113181632.6d98489465430a987c96568d@linux-foundation.org> From: Suren Baghdasaryan Date: Fri, 13 Nov 2020 18:51:36 -0800 Message-ID: Subject: Re: [PATCH 1/1] RFC: add pidfd_send_signal flag to reclaim mm while killing a process To: Andrew Morton Cc: Michal Hocko , David Rientjes , Matthew Wilcox , Johannes Weiner , Roman Gushchin , Rik van Riel , Christian Brauner , Oleg Nesterov , Tim Murray , linux-api@vger.kernel.org, linux-mm , LKML , kernel-team , Minchan Kim Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 13, 2020 at 6:16 PM Andrew Morton wrote: > > On Fri, 13 Nov 2020 17:57:02 -0800 Suren Baghdasaryan wrote: > > > On Fri, Nov 13, 2020 at 5:18 PM Andrew Morton wrote: > > > > > > On Fri, 13 Nov 2020 17:09:37 -0800 Suren Baghdasaryan wrote: > > > > > > > > > > Seems to me that the ability to reap another process's memory is a > > > > > > > generally useful one, and that it should not be tied to delivering a > > > > > > > signal in this fashion. > > > > > > > > > > > > > > And we do have the new process_madvise(MADV_PAGEOUT). It may need a > > > > > > > few changes and tweaks, but can't that be used to solve this problem? > > > > > > > > > > > > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was > > > > > > one of the options recently discussed in > > > > > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > > > > > . The thread describes some of the issues with that approach but if we > > > > > > limit it to processes with pending SIGKILL only then I think that > > > > > > would be doable. > > > > > > > > > > Why would it be necessary to read /proc/pid/maps? I'd have thought > > > > > that a starting effort would be > > > > > > > > > > madvise((void *)0, (void *)-1, MADV_PAGEOUT) > > > > > > > > > > (after translation into process_madvise() speak). Which is equivalent > > > > > to the proposed process_madvise(MADV_DONTNEED_MM)? > > > > > > > > Yep, this is very similar to option #3 in > > > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > > > and I actually have a tested prototype for that. > > > > > > Why is the `vector=NULL' needed? Can't `vector' point at a single iovec > > > which spans the whole address range? > > > > That would be the option #4 from the same discussion and the issues > > noted there are "process_madvise return value can't handle such a > > large number of bytes and there is MAX_RW_COUNT limit on max number of > > bytes one process_madvise call can handle". In my prototype I have a > > special handling for such "bulk operation" to work around the > > MAX_RW_COUNT limitation. > > Ah, OK, return value. Maybe process_madvise() shouldn't have done that > and should have simply returned 0 on success, like madvise(). > > I guess a special "nuke whole address space" command is OK. But, again > in the search for generality, the ability to nuke very large amounts of > address space (but not the entire address space) would be better. > > The process_madvise() return value issue could be addressed by adding a > process_madvise() mode which return 0 on success. > > And I guess the MAX_RW_COUNT issue is solvable by adding an > import_iovec() arg to say "don't check that". Along those lines. > > It's all sounding a bit painful (but not *too* painful). But to > reiterate, I do think that adding the ability for a process to shoot > down a large amount of another process's memory is a lot more generally > useful than tying it to SIGKILL, agree? I see. So you are suggesting a mode where process_madvise() can operate on large areas spanning multiple VMAs. This slightly differs from option 4 in the previous RFC which suggested a special mode that operates on the *entire* mm of the process. I agree, your suggestion is more generic. > > > > > > > > If that's the > > > > preferred method then I can post it quite quickly. > > > > > > I assume you've tested that prototype. How did its usefulness compare > > > with this SIGKILL-based approach? > > > > Just to make sure I understand correctly your question, you are asking > > about performance comparison of: > > > > // approach in this RFC > > pidfd_send_signal(SIGKILL, SYNC_REAP_MM) > > > > vs > > > > // option #4 in the previous RFC > > kill(SIGKILL); process_madvise(vector=NULL, MADV_DONTNEED); > > > > If so, I have results for the current RFC approach but the previous > > approach was testing on an older device, so don't have > > apples-to-apples comparison results at the moment. I can collect the > > data for fair comparison if desired, however I don't expect a > > noticeable performance difference since they both do pretty much the > > same thing (even on different devices my results are quite close). I > > think it's more a question of which API would be more appropriate. > > OK. I wouldn't expect performance to be very different (and things can > be sped up if so), but the API usefulness might be an issue. Using > process_madvise() (or similar) makes it a two-step operation, whereas > tying it to SIGKILL&&TASK_UNINTERRUPTIBLE provides a more precise tool. > Any thoughts on this? > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ACAF9C4742C for ; Sat, 14 Nov 2020 02:51:51 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 24A1E22245 for ; Sat, 14 Nov 2020 02:51:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="lfdqtHmE" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 24A1E22245 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 515196B005C; Fri, 13 Nov 2020 21:51:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 49E4D6B005D; Fri, 13 Nov 2020 21:51:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 33EF86B0068; Fri, 13 Nov 2020 21:51:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0154.hostedemail.com [216.40.44.154]) by kanga.kvack.org (Postfix) with ESMTP id F329F6B005C for ; Fri, 13 Nov 2020 21:51:49 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 9F943362A for ; Sat, 14 Nov 2020 02:51:49 +0000 (UTC) X-FDA: 77481498738.17.shoe49_5a0ec8827314 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin17.hostedemail.com (Postfix) with ESMTP id 853B8180D0181 for ; Sat, 14 Nov 2020 02:51:49 +0000 (UTC) X-HE-Tag: shoe49_5a0ec8827314 X-Filterd-Recvd-Size: 8397 Received: from mail-wm1-f66.google.com (mail-wm1-f66.google.com [209.85.128.66]) by imf22.hostedemail.com (Postfix) with ESMTP for ; Sat, 14 Nov 2020 02:51:48 +0000 (UTC) Received: by mail-wm1-f66.google.com with SMTP id h2so14337977wmm.0 for ; Fri, 13 Nov 2020 18:51:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=/pF82g4ci3xp6pW4O0nO7ylMYBzowQiQ/mlbHUCZjYM=; b=lfdqtHmEO2B3dirgmxVeOVkreCFY8yc6AV6c/Po/ambopcg50qfTbxC6PX8k70WymB yWIzqfHgvKtagwRTwkF1qfCLQsNtGJdnc99FLtZ0JFvpH7IlNFicfVEvzsxq52rG2kCN uhHpV7F4a4CrHmHrASuaAOgkvPwi6tFTO2CAMZ8+63nRrui9J9NYT1wFE/asvtGsfy9s AkVZ1B1S9baz3knpYDgLBcqt2A8MMbxm+e7Y4bluSFFpCcg2j7qBtLsOwkNHn7ZJp/Og ab0ZZM0EjYC+ESFVkW0iWTnTxhXd60ZuC+2gAxUT8ibRXXhgyZY8VZh6dNQBjCyHXf3G /xpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=/pF82g4ci3xp6pW4O0nO7ylMYBzowQiQ/mlbHUCZjYM=; b=DyCN898q+hixMK3CWSlMxnlhY7gS43LSEDBLs41ENrOeasl/pVIEqGU5ii3e8t6yga wb6MqzlkElxp5FdV+qSdrB58DEVG9zBiJr+nSjyazZqHLsmDaY95PMrQ94cZAFkQmXTO dYCbg+C6fgR/1G5OFnyELfMXCN0KUp4647Pjc5bXdXS/aTO0o8V0dgzp5sNzzwn372yh djH9n08M8CSgwaL4wu2q4e5Vmh+B6144M/G0/BM/t77WUIrplKRx+b8GOPtO8bFbnasm my09ouw1TBx0Eeo3gqCRwRU0J8ubcKMHRRlvrfTcCrflyuI2PwPWiMpt4j6IcOgEfhQ8 S/ag== X-Gm-Message-State: AOAM531lWM8OzeBRQy3QTv+K1jcdW474FMdFRucqQFnD/Zz105MU7zkP ADcbLdcS0Hc1fbF/rXPDATSOOP5CwsIGs4CXYgSrYw== X-Google-Smtp-Source: ABdhPJwl3dK1Pst00k61vPihAcc18qxn5bQRviOKf/y8pjdzr3R0jjOBDf+WXIe5huKKNi3014n6lFP3HK6wVxHoZZo= X-Received: by 2002:a1c:9c0e:: with SMTP id f14mr5192682wme.22.1605322307590; Fri, 13 Nov 2020 18:51:47 -0800 (PST) MIME-Version: 1.0 References: <20201113173448.1863419-1-surenb@google.com> <20201113155539.64e0af5b60ad3145b018ab0d@linux-foundation.org> <20201113170032.7aa56ea273c900f97e6ccbdc@linux-foundation.org> <20201113171810.bebf66608b145cced85bf54c@linux-foundation.org> <20201113181632.6d98489465430a987c96568d@linux-foundation.org> In-Reply-To: <20201113181632.6d98489465430a987c96568d@linux-foundation.org> From: Suren Baghdasaryan Date: Fri, 13 Nov 2020 18:51:36 -0800 Message-ID: Subject: Re: [PATCH 1/1] RFC: add pidfd_send_signal flag to reclaim mm while killing a process To: Andrew Morton Cc: Michal Hocko , David Rientjes , Matthew Wilcox , Johannes Weiner , Roman Gushchin , Rik van Riel , Christian Brauner , Oleg Nesterov , Tim Murray , linux-api@vger.kernel.org, linux-mm , LKML , kernel-team , Minchan Kim Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Nov 13, 2020 at 6:16 PM Andrew Morton wrote: > > On Fri, 13 Nov 2020 17:57:02 -0800 Suren Baghdasaryan wrote: > > > On Fri, Nov 13, 2020 at 5:18 PM Andrew Morton wrote: > > > > > > On Fri, 13 Nov 2020 17:09:37 -0800 Suren Baghdasaryan wrote: > > > > > > > > > > Seems to me that the ability to reap another process's memory is a > > > > > > > generally useful one, and that it should not be tied to delivering a > > > > > > > signal in this fashion. > > > > > > > > > > > > > > And we do have the new process_madvise(MADV_PAGEOUT). It may need a > > > > > > > few changes and tweaks, but can't that be used to solve this problem? > > > > > > > > > > > > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was > > > > > > one of the options recently discussed in > > > > > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > > > > > . The thread describes some of the issues with that approach but if we > > > > > > limit it to processes with pending SIGKILL only then I think that > > > > > > would be doable. > > > > > > > > > > Why would it be necessary to read /proc/pid/maps? I'd have thought > > > > > that a starting effort would be > > > > > > > > > > madvise((void *)0, (void *)-1, MADV_PAGEOUT) > > > > > > > > > > (after translation into process_madvise() speak). Which is equivalent > > > > > to the proposed process_madvise(MADV_DONTNEED_MM)? > > > > > > > > Yep, this is very similar to option #3 in > > > > https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com > > > > and I actually have a tested prototype for that. > > > > > > Why is the `vector=NULL' needed? Can't `vector' point at a single iovec > > > which spans the whole address range? > > > > That would be the option #4 from the same discussion and the issues > > noted there are "process_madvise return value can't handle such a > > large number of bytes and there is MAX_RW_COUNT limit on max number of > > bytes one process_madvise call can handle". In my prototype I have a > > special handling for such "bulk operation" to work around the > > MAX_RW_COUNT limitation. > > Ah, OK, return value. Maybe process_madvise() shouldn't have done that > and should have simply returned 0 on success, like madvise(). > > I guess a special "nuke whole address space" command is OK. But, again > in the search for generality, the ability to nuke very large amounts of > address space (but not the entire address space) would be better. > > The process_madvise() return value issue could be addressed by adding a > process_madvise() mode which return 0 on success. > > And I guess the MAX_RW_COUNT issue is solvable by adding an > import_iovec() arg to say "don't check that". Along those lines. > > It's all sounding a bit painful (but not *too* painful). But to > reiterate, I do think that adding the ability for a process to shoot > down a large amount of another process's memory is a lot more generally > useful than tying it to SIGKILL, agree? I see. So you are suggesting a mode where process_madvise() can operate on large areas spanning multiple VMAs. This slightly differs from option 4 in the previous RFC which suggested a special mode that operates on the *entire* mm of the process. I agree, your suggestion is more generic. > > > > > > > > If that's the > > > > preferred method then I can post it quite quickly. > > > > > > I assume you've tested that prototype. How did its usefulness compare > > > with this SIGKILL-based approach? > > > > Just to make sure I understand correctly your question, you are asking > > about performance comparison of: > > > > // approach in this RFC > > pidfd_send_signal(SIGKILL, SYNC_REAP_MM) > > > > vs > > > > // option #4 in the previous RFC > > kill(SIGKILL); process_madvise(vector=NULL, MADV_DONTNEED); > > > > If so, I have results for the current RFC approach but the previous > > approach was testing on an older device, so don't have > > apples-to-apples comparison results at the moment. I can collect the > > data for fair comparison if desired, however I don't expect a > > noticeable performance difference since they both do pretty much the > > same thing (even on different devices my results are quite close). I > > think it's more a question of which API would be more appropriate. > > OK. I wouldn't expect performance to be very different (and things can > be sped up if so), but the API usefulness might be an issue. Using > process_madvise() (or similar) makes it a two-step operation, whereas > tying it to SIGKILL&&TASK_UNINTERRUPTIBLE provides a more precise tool. > Any thoughts on this? >