From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F14BAC433F5 for ; Fri, 13 May 2022 21:07:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 67E988D0002; Fri, 13 May 2022 17:07:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 62D6B6B0081; Fri, 13 May 2022 17:07:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4CD7B8D0002; Fri, 13 May 2022 17:07:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 3D9E26B007E for ; Fri, 13 May 2022 17:07:01 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 1470860782 for ; Fri, 13 May 2022 21:07:01 +0000 (UTC) X-FDA: 79461954642.22.EE9E55C Received: from mail-lj1-f176.google.com (mail-lj1-f176.google.com [209.85.208.176]) by imf08.hostedemail.com (Postfix) with ESMTP id 8D3F31600A5 for ; Fri, 13 May 2022 21:06:45 +0000 (UTC) Received: by mail-lj1-f176.google.com with SMTP id l19so11645855ljb.7 for ; Fri, 13 May 2022 14:07:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=OlbQ4+VlHahOaVFP/ry3pfXsV1QjdtBcEa2sxBnkRtA=; b=XtNK46ya72tNymCA2UF1qBU6HZmxZt+Li6JhmxQdt1DhX2WMGPhutrP+hgpRtUtHaj TL4H14pVA1pfWWqs37w/dU7T+7yw/FJ1CBjN5AwvJSKZPCw8n5roF//SjM7P44MVME9W vOMlPo+v+GTqYZIrwJn+yOWtlzVyhcDZeXRH/EaJVqLWE4foT9zmi5PXQF1bJdvHQP9T F3tuFVwrhoq4ZSxpIuxAH9zEq8uOm0Z5w23eZ9Uo2QYX7tl1qA8Fvg36YzClv9kw1wB/ 6MD+Q6BliQYU5VQDMyu3veda6IX9MH1aXF3BwTYwy1GGFQxBwwOsFV5Sc+HT4VgcpoW0 MzOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=OlbQ4+VlHahOaVFP/ry3pfXsV1QjdtBcEa2sxBnkRtA=; b=z114SIXMaCGGStQi22nDOpGn11tggSpfw6onR1s9gNK3xL85+UtlFmLcpPXhW0vtaz Qozq7RzqVfpEF7/0uz4asUwzYpb5c2PwPBrmuZzt/LkFTe+BcE4gOhMUwG8zFWOMSCQx 43EJ85R6j4Y4IlUKHuoX9IXaq18XI5+A+fwdxzr7wnbsiJRkSskcLQpCA1AARgSy+KDF WHXB3v+Y2UPS6Hvmsxc/aYqfoxOE8QuD4Ml6qE4CMq74ShblaKgqRcbML06SfHaLrQ3Y w3WyMTLfmTqb/6CRSIQL133BLdNekx2VaL4K5v5RLdqDYQHB+5YjpyT91ODKsfPXyTXi vZjQ== X-Gm-Message-State: AOAM532VhLSGpnyr+nqS6vp2TdW6ZmjxB79ZCm8cowHP8HyMCr7NnRYU GK+2UPMio3ILgOgyNALFZ6dIltwU7oxt/sb/6XIyNA== X-Google-Smtp-Source: ABdhPJzcfrFmSuwkHTEF55k4EN+UOmqsABmR811nq4ZGb0OLZz35t7G9/ZKfjUGhYb94ZkuCREDuw1anTLaOqwb3vBo= X-Received: by 2002:a05:651c:1502:b0:250:2328:d127 with SMTP id e2-20020a05651c150200b002502328d127mr4202340ljf.183.1652476018605; Fri, 13 May 2022 14:06:58 -0700 (PDT) MIME-Version: 1.0 References: <20220504214437.2850685-1-zokeefe@google.com> <20220504214437.2850685-11-zokeefe@google.com> <502a3ced-f3c6-7117-3b24-d80d204d66ee@linux.alibaba.com> <3a53435-5d5a-d6cb-5739-5c444523fc7@google.com> In-Reply-To: <3a53435-5d5a-d6cb-5739-5c444523fc7@google.com> From: "Zach O'Keefe" Date: Fri, 13 May 2022 14:06:22 -0700 Message-ID: Subject: Re: [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise() To: David Rientjes Cc: Rongwei Wang , Alex Shi , David Hildenbrand , Matthew Wilcox , Michal Hocko , Pasha Tatashin , Peter Xu , SeongJae Park , Song Liu , Vlastimil Babka , Yang Shi , Zi Yan , linux-mm@kvack.org, Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Thomas Bogendoerfer , calling@linux.alibaba.com Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 8D3F31600A5 X-Stat-Signature: j7gxgegry1biniw5i76ps1qg9mxhj1n3 X-Rspam-User: Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=XtNK46ya; spf=pass (imf08.hostedemail.com: domain of zokeefe@google.com designates 209.85.208.176 as permitted sender) smtp.mailfrom=zokeefe@google.com; dmarc=pass (policy=reject) header.from=google.com X-HE-Tag: 1652476005-250074 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, May 12, 2022 at 1:03 PM David Rientjes wrote: > > On Wed, 11 May 2022, Zach O'Keefe wrote: > > > Hey Rongwei, > > > > Thanks for taking the time to review! > > > > On Tue, May 10, 2022 at 5:49 PM Rongwei Wang > > wrote: > > > > > > Hi, Zach > > > > > > Thanks for your great patchset! > > > Recently, We also try to collapse THP in this way, likes performance > > > degradation due to using too much hugepages in our scenes. > > > > > Rongwei, could you elaborate on this? I can understand undesired overhead > for allocation of a hugepage at the time of fault if there is not enough > benefit derived by the hugepages over the long term (like for a database > workload), is this the performance degradation you're referring to? > > Otherwise I'm unfamiliar with performance degradation for using too much > hugepages after they have been allocated :) Maybe RSS grows too much and > we run into memory pressure? > > It would be good to know more if you can share details here. > > > > And there is a doubt about process_madvise(MADV_COLLAPSE) when we test > > > this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on > > > madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg', > > > process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss > > > something, please let me know. > > > > > > > I tried to have MADV_COLLAPSE follow the same THP eligibility > > semantics as khugepaged and at-fault: either THP=always, or > > THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point > > out. > > > > If I understand you correctly, the usefulness of > > process_madvise(MADV_COLLAPSE) is limited in the case where > > THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of > > behalf of another process since they don't have a way to mark the > > target memory as eligible (which requires VM_HUGEPAGE). > > > > If so, I think that's a valid point, and your suggestion below of a > > supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For > > the sake of exploring all options, I'll mention that there was also a > > previous idea suggested by Yang Shi where MADV_COLLAPSE could also set > > VM_HUGEPAGE[1]. > > > > If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary > to need to do MADV_HUGEPAGE before that regardless of system-wide settings > that it may not control? > If THP=always, this isn't necessary - but yes, if THP=madvise then, as proposed, we'd need MADV_HUGEPAGE to opt-in to THPs, just like at-fault and khugepaged. If MADV_COLLAPSE also sets VM_HUGEPAGE, then in THP=madvise mode, we also opt-in that range to khugepaged scanning. Most likely this is what we want anyways (if the collapsed range is split after MADV_COLLAPSE, khugepaged could come along and fix things) but it does couple the two more. We can always MADV_NOHUGEPAGE the range after MADV_COLLAPSE if this was ever a concern, however. > Same point for a root user doing this on behalf of another user. It could > either do MADV_HUGEPAGE and change the behavior that the user perhaps > requested behind its back (not desired) or it could temporarily set the > system-wide setting to allow THP always before doing the MADV_COLLAPSE > (it's root). > > We can simply allow MADV_COLLAPSE to always collapse in either context > regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings? > I think we need to respect VM_NOHUGEPAGE as it can be set from non-madvise code, as David H mentioned for kvm on s390 in the PATCH RFC[1]. It's not clear to me how this would break, however. If MADV_HUGEPAGE'ing the areas marked by arch/s390/mm/gmap.c:thp_split_mm() would also break kvm (assuming subsequent collapse by khugepaged), then MADV_COLLAPSE ignoring VM_NOHUGEPAGE doesn't really increase the failure space much. Else, I think we should really be respecting VM_NOHUGEPAGE. [1] https://lore.kernel.org/linux-mm/30571216-5a6a-7a11-3b2c-77d914025f6d@redhat.com/ > > Since it's possible supporting MADV_[NO]HUGEPAGE for > > process_madivse(2) has applications outside a subsequent > > MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to > > be in a hot path, I'd vote in favor of your suggestion and include > > process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object. > > > > Thanks again for your review and your suggestion! > > Zach > > > > [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/ > > > > > If so, how about introducing process_madvise(MADV_HUGEPAGE) or > > > process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target > > > vma with 'hg', and the collapse process can be finished completely with > > > the help of other processes. the latter could let some special vma avoid > > > collapsing when setting 'THP=always'. > > > > > > Best regards, > > > -wrw > > > > > > On 5/5/22 5:44 AM, Zach O'Keefe wrote: > > > > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has > > > > CAP_SYS_ADMIN or is requesting collapse of it's own memory. > > > > > > > > Signed-off-by: Zach O'Keefe > > > > --- > > > > mm/madvise.c | 6 ++++-- > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > > index 638517952bd2..08c11217025a 100644 > > > > --- a/mm/madvise.c > > > > +++ b/mm/madvise.c > > > > @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior) > > > > } > > > > > > > > static bool > > > > -process_madvise_behavior_valid(int behavior) > > > > +process_madvise_behavior_valid(int behavior, struct task_struct *task) > > > > { > > > > switch (behavior) { > > > > case MADV_COLD: > > > > case MADV_PAGEOUT: > > > > case MADV_WILLNEED: > > > > return true; > > > > + case MADV_COLLAPSE: > > > > + return task == current || capable(CAP_SYS_ADMIN); > > > > default: > > > > return false; > > > > } > > > > @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > > > > goto free_iov; > > > > } > > > > > > > > - if (!process_madvise_behavior_valid(behavior)) { > > > > + if (!process_madvise_behavior_valid(behavior, task)) { > > > > ret = -EINVAL; > > > > goto release_task; > > > > } > >