From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9DB4C433EF for ; Fri, 15 Apr 2022 13:39:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D63E76B0071; Fri, 15 Apr 2022 09:39:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D13616B0073; Fri, 15 Apr 2022 09:39:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B8D6E6B0074; Fri, 15 Apr 2022 09:39:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28]) by kanga.kvack.org (Postfix) with ESMTP id A57B66B0071 for ; Fri, 15 Apr 2022 09:39:21 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 5622B60A0F for ; Fri, 15 Apr 2022 13:39:21 +0000 (UTC) X-FDA: 79359220122.12.7A5B380 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf24.hostedemail.com (Postfix) with ESMTP id C0C1E180009 for ; Fri, 15 Apr 2022 13:39:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1650029960; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=2ZAFl58r1KHEQ5P59AN3ZmT29kUIREOuZxsuiEitvdc=; b=cusFkUd1IZnXE178B0ZPAZZ9UQKknVqWUT2HiT9hvKu/lQtkD09RqMZNqmQZQKfcz+l9m6 9UhjNPCuf4Y+PQZSigxW77noRpmvZP8K0Nq1VbP8kg6AlGOlCUz32sFJRT+jQ9oxf9Hi01 pNfJdUnMBUaXzEnF581RsJPEyZ8II0Q= Received: from mail-io1-f72.google.com (mail-io1-f72.google.com [209.85.166.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-637--pwKJli7NE-gKeq1ccN3Aw-1; Fri, 15 Apr 2022 09:39:18 -0400 X-MC-Unique: -pwKJli7NE-gKeq1ccN3Aw-1 Received: by mail-io1-f72.google.com with SMTP id d1-20020a6bcd01000000b0064d23a49b27so4852977iog.3 for ; Fri, 15 Apr 2022 06:39:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=2ZAFl58r1KHEQ5P59AN3ZmT29kUIREOuZxsuiEitvdc=; b=CH4aF6Y+zxkhZAK56Uzh3hIuBB8vK4cunMXuunH2DBtrn0gEXLj8akOS0v60bTRg8K lnMw9xmPigjtaef4i34fFoHlnug1kxNnpEE5v14WaoHE15vH6Mdo1HoIp8T9yFhHGVwc vkb4lrO/rNBn0CR8oFWumlBR4Ean/x3x7yZokv+NB8YnwpO1UQZB4nfUDDAHPjF5cV3b b+Ex9l5HQrLGQKq6GJIKMH639f2Z5v6LqPiYbC5WGg0DOKrvc+UcmV7rQ4plA7npYyWw 43MedmqXLazC57rRzU30KRYipVnBeRibjuRBD8Y5NRpYKBsUySkogCiv4JPZ2MHpXwAa nE4g== X-Gm-Message-State: AOAM530kNLDJEeLYTaQG3munsK0th+njiez3uVpbjS0o4b3D7CLkj0mK oG1mBdeUZT1uyeYMRi75uZKupuTLccbUgHIkUDDXxNeFXGTWEN1LVfTk9JOE4vgQZmmWCRoGtgD cKe3o5HsqiI4= X-Received: by 2002:a05:6638:2654:b0:324:2f25:8693 with SMTP id n20-20020a056638265400b003242f258693mr3903901jat.214.1650029958335; Fri, 15 Apr 2022 06:39:18 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwDEKqDmklDJBLwdru+a2Kc8svfe1u8WEe0Wu2ZVGigjpfMArSLr0/g8UGMx+YYylUV34mjlA== X-Received: by 2002:a05:6638:2654:b0:324:2f25:8693 with SMTP id n20-20020a056638265400b003242f258693mr3903872jat.214.1650029958041; Fri, 15 Apr 2022 06:39:18 -0700 (PDT) Received: from xz-m1.local (cpec09435e3e0ee-cmc09435e3e0ec.cpe.net.cable.rogers.com. [99.241.198.116]) by smtp.gmail.com with ESMTPSA id d14-20020a056602184e00b00649673c175asm3056638ioi.25.2022.04.15.06.39.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 15 Apr 2022 06:39:17 -0700 (PDT) Date: Fri, 15 Apr 2022 09:39:14 -0400 From: Peter Xu To: Zach O'Keefe Cc: Alex Shi , David Hildenbrand , David Rientjes , Matthew Wilcox , Michal Hocko , Pasha Tatashin , SeongJae Park , Song Liu , Vlastimil Babka , Yang Shi , Zi Yan , linux-mm@kvack.org, Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Thomas Bogendoerfer Subject: Re: [PATCH v2 00/12] mm: userspace hugepage collapse Message-ID: References: <20220414180612.3844426-1-zokeefe@google.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cusFkUd1; spf=none (imf24.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Stat-Signature: z85ccyfckxxzt5mt66j8jwyqcwo9ssdy X-Rspamd-Queue-Id: C0C1E180009 X-Rspamd-Server: rspam04 X-Rspam-User: X-HE-Tag: 1650029960-772132 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000302, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Apr 14, 2022 at 05:52:43PM -0700, Zach O'Keefe wrote: > Hey Peter, > > Thanks for taking the time to review! > > On Thu, Apr 14, 2022 at 5:04 PM Peter Xu wrote: > > > > Hi, Zach, > > > > On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote: > > > process_madvise(2) > > > > > > Performs a synchronous collapse of the native pages > > > mapped by the list of iovecs into transparent hugepages. > > > > > > Allocation semantics are the same as khugepaged, and depend on > > > (1) the active sysfs settings > > > /sys/kernel/mm/transparent_hugepage/enabled and > > > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2) > > > the VMA flags of the memory range being collapsed. > > > > > > Collapse eligibility criteria differs from khugepaged in that > > > the sysfs files > > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared] > > > are ignored. > > > > The userspace khugepaged idea definitely makes sense to me, though I'm > > curious how the line is drown on the different behaviors here by explicitly > > ignoring the max_ptes_* entries. > > > > Let's assume the initiative is to duplicate a more data-aware khugepaged in > > the userspace, then IMHO it makes more sense to start with all the policies > > that applies to khugepaged already, including max_pte_*. > > > > I can understand the willingness to provide even stronger semantics here > > than khugepaged since the userspace could have very clear knowledge of how > > to provision the memories (better than a kernel scanner). It's just that > > IMHO it could be slightly confusing if the new interface only partially > > apply the khugepaged rules. > > > > No strong opinion here. It could already been a trade-off after the > > discussion from the RFC with Michal which I read.. Just curious about how > > you made that design decision so feel free to read it as a pure question. > > > > Understand your point here. The allocation and max_pte_* semantics are > split between khugepaged-like and fault-like, respectively - which > could be confusing. Originally, I proposed a MADV_F_COLLAPSE_LIMITS > flag to control the former's behavior, but agreed to keep things > simple to start, and expand the interface if/when necessary. I opted > to ignore max_ptes_* as the default since I envisioned that early > adopters would "just want it to work". One such example would be > backing executable text by hugepages on program load when many pages > haven't been demand-paged in yet. > > What do you think? I'm just slightly worried that'll make the default MADV_COLLAPSE semantics blurred. To me, a clean default definition for MADV_COLLAPSE would be nice, as "do khugepaged on this range, and with current thread context". IMHO any feature bits then can be supplementing special needs, and I'll take the thp backing executable example to be one of the (good?) reason we'd need an extra flag for ignoring the max_ptes_* knobs. So personally if I were you maybe I'll start with the simple scheme of that (even if it won't immediately service a thing) but then add either the defrag or ignore_max_ptes_* as feature bits later on, with clear use case descriptions about why we need each of the feature flags. IMHO numbers would be even more helpful when there's specific use cases on the show. Or, perhaps you think all potential MADV_COLLAPSE users should literally skip max_ptes_* limitations always? Anyway, I won't pretend I am an expert in this area. :) So please take that with a grain of salt. Thanks, -- Peter Xu