From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A3230C433F5
	for <linux-mm@archiver.kernel.org>; Tue, 19 Apr 2022 14:33:41 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0F70B6B00DB; Tue, 19 Apr 2022 10:33:41 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0A8516B00DC; Tue, 19 Apr 2022 10:33:41 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E3C608D0047; Tue, 19 Apr 2022 10:33:40 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24])
	by kanga.kvack.org (Postfix) with ESMTP id CF7E56B00DB
	for <linux-mm@kvack.org>; Tue, 19 Apr 2022 10:33:40 -0400 (EDT)
Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 92ED823619
	for <linux-mm@kvack.org>; Tue, 19 Apr 2022 14:33:40 +0000 (UTC)
X-FDA: 79373872200.20.04659A9
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf03.hostedemail.com (Postfix) with ESMTP id DB19B20011
	for <linux-mm@kvack.org>; Tue, 19 Apr 2022 14:33:38 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1650378819;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=1siTtdg35Zv76LsaRQ1Aqn16dXv1Y0v7We88GlGNHCk=;
	b=SbKvG9yIH4lppwzrglW9xXUTckgwTaJYbc7u53+LyQbpWCsRbCKGB/tqmas4PbZf8nWceH
	tUR2yy3mzB7ww5ggw3wfeWiPF+bEeC8MRUdrwRTk91/gheWbQM8rQQIFiQ8V/R24LGXS+b
	ot/E3NklxCQszCT5CQ0PWPbVYVz859I=
Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com
 [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-576-83e32y5gMD22OufOW8lakQ-1; Tue, 19 Apr 2022 10:33:36 -0400
X-MC-Unique: 83e32y5gMD22OufOW8lakQ-1
Received: by mail-wm1-f72.google.com with SMTP id y11-20020a7bc18b000000b0038eac019fc0so1303853wmi.9
        for <linux-mm@kvack.org>; Tue, 19 Apr 2022 07:33:36 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:message-id:date:mime-version:user-agent
         :content-language:to:cc:references:from:organization:subject
         :in-reply-to:content-transfer-encoding;
        bh=1siTtdg35Zv76LsaRQ1Aqn16dXv1Y0v7We88GlGNHCk=;
        b=MXCUlnpwOAnxuz6tu5QYOSrrWX8lr4tQyblbx33D7CJtq/Wf7vRVde8ZZq1N39yp3X
         Jz2YdFWlEw+OLS8P7+VqYd9vIBPWl2TLCnuiEQkoGQNyoiPRq5gsGe3iXzreQZHaQkiU
         Y9kEm4EmB/99M9BhFjDsQzMTbF9CeQmNMhaniBoA1fmUJ+AJ13jh6J23eko3VIpKKWlX
         kK9bf0KngBsI8urhUx6kj80VBjoMM9qwfcTMybxjWXynXitb30WZoVGJtpNQObqPt78/
         8w+cpFJehnjPXdGhLOhGJl5EkdYAeh+KRrc5prJKe6ALw3T+mTTDfgLHIUOMdft9EzXu
         3iaw==
X-Gm-Message-State: AOAM530c08/kBWcZHIzCYWEBLZnE/lppjUk1a9cMkfZz2YPlSbQni7bP
	QdtmdWNYWoIpGxat+SVuBAVOMiGdjh8EUb+wrSBplFCFEbOQAJgQgHwVV6sHyho/AGGS/h/loML
	DDOFnpy6R/4w=
X-Received: by 2002:a05:600c:3b0e:b0:38e:c5f6:d255 with SMTP id m14-20020a05600c3b0e00b0038ec5f6d255mr16364063wms.129.1650378814921;
        Tue, 19 Apr 2022 07:33:34 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJwmV9W+806wkaM+JUNurMRFouYlJ1NNEPCiP/PcQGV2w+owvKEjZjU4cn3wMeRcrQ8eBVzfbA==
X-Received: by 2002:a05:600c:3b0e:b0:38e:c5f6:d255 with SMTP id m14-20020a05600c3b0e00b0038ec5f6d255mr16364007wms.129.1650378814531;
        Tue, 19 Apr 2022 07:33:34 -0700 (PDT)
Received: from ?IPV6:2003:cb:c704:5d00:d8c2:fbf6:a608:957a? (p200300cbc7045d00d8c2fbf6a608957a.dip0.t-ipconnect.de. [2003:cb:c704:5d00:d8c2:fbf6:a608:957a])
        by smtp.gmail.com with ESMTPSA id l9-20020a1c7909000000b0038eb8171fa5sm15876513wme.1.2022.04.19.07.33.33
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Tue, 19 Apr 2022 07:33:34 -0700 (PDT)
Message-ID: <8d8da2fb-aed9-96d0-47ed-94806e190250@redhat.com>
Date: Tue, 19 Apr 2022 16:33:32 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.6.2
To: Peter Xu <peterx@redhat.com>, Zach O'Keefe <zokeefe@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>,
 David Rientjes <rientjes@google.com>, Matthew Wilcox <willy@infradead.org>,
 Michal Hocko <mhocko@suse.com>, Pasha Tatashin <pasha.tatashin@soleen.com>,
 SeongJae Park <sj@kernel.org>, Song Liu <songliubraving@fb.com>,
 Vlastimil Babka <vbabka@suse.cz>, Yang Shi <shy828301@gmail.com>,
 Zi Yan <ziy@nvidia.com>, linux-mm@kvack.org,
 Andrea Arcangeli <aarcange@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>, Arnd Bergmann <arnd@arndb.de>,
 Axel Rasmussen <axelrasmussen@google.com>,
 Chris Kennelly <ckennelly@google.com>, Chris Zankel <chris@zankel.net>,
 Helge Deller <deller@gmx.de>, Hugh Dickins <hughd@google.com>,
 Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
 "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>,
 Jens Axboe <axboe@kernel.dk>,
 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
 Matt Turner <mattst88@gmail.com>, Max Filippov <jcmvbkbc@gmail.com>,
 Miaohe Lin <linmiaohe@huawei.com>, Minchan Kim <minchan@kernel.org>,
 Patrick Xia <patrickx@google.com>, Pavel Begunkov <asml.silence@gmail.com>,
 Thomas Bogendoerfer <tsbogend@alpha.franken.de>
References: <20220414180612.3844426-1-zokeefe@google.com>
 <Yli2nLz/agg4+ZIK@xz-m1.local>
 <CAAa6QmQbA4LFoUa19my+0Mc4Rp+ded5yiELB6qpVwhirDNrhtg@mail.gmail.com>
 <Yll1gg68NVnFRSby@xz-m1.local>
 <CAAa6QmSMwp9SR4v4AbJzbcribbdwJB_jTeSP595iMKQ2khtBHQ@mail.gmail.com>
 <YlsYcBWQyUMzrWRB@xz-m1.local>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [PATCH v2 00/12] mm: userspace hugepage collapse
In-Reply-To: <YlsYcBWQyUMzrWRB@xz-m1.local>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Authentication-Results: imf03.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SbKvG9yI;
	spf=none (imf03.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
X-Stat-Signature: yem46nefjw8pic4xsu534qnxe99q1y1u
X-Rspamd-Queue-Id: DB19B20011
X-Rspamd-Server: rspam04
X-Rspam-User: 
X-HE-Tag: 1650378818-874226
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 16.04.22 21:26, Peter Xu wrote:
> Hi, Zach,
> 
> On Fri, Apr 15, 2022 at 01:04:04PM -0700, Zach O'Keefe wrote:
>> On Fri, Apr 15, 2022 at 6:39 AM Peter Xu <peterx@redhat.com> wrote:
>>>
>>> On Thu, Apr 14, 2022 at 05:52:43PM -0700, Zach O'Keefe wrote:
>>>> Hey Peter,
>>>>
>>>> Thanks for taking the time to review!
>>>>
>>>> On Thu, Apr 14, 2022 at 5:04 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>
>>>>> Hi, Zach,
>>>>>
>>>>> On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote:
>>>>>> process_madvise(2)
>>>>>>
>>>>>>       Performs a synchronous collapse of the native pages
>>>>>>       mapped by the list of iovecs into transparent hugepages.
>>>>>>
>>>>>>       Allocation semantics are the same as khugepaged, and depend on
>>>>>>       (1) the active sysfs settings
>>>>>>       /sys/kernel/mm/transparent_hugepage/enabled and
>>>>>>       /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2)
>>>>>>       the VMA flags of the memory range being collapsed.
>>>>>>
>>>>>>       Collapse eligibility criteria differs from khugepaged in that
>>>>>>       the sysfs files
>>>>>>       /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]
>>>>>>       are ignored.
>>>>>
>>>>> The userspace khugepaged idea definitely makes sense to me, though I'm
>>>>> curious how the line is drown on the different behaviors here by explicitly
>>>>> ignoring the max_ptes_* entries.
>>>>>
>>>>> Let's assume the initiative is to duplicate a more data-aware khugepaged in
>>>>> the userspace, then IMHO it makes more sense to start with all the policies
>>>>> that applies to khugepaged already, including max_pte_*.
>>>>>
>>>>> I can understand the willingness to provide even stronger semantics here
>>>>> than khugepaged since the userspace could have very clear knowledge of how
>>>>> to provision the memories (better than a kernel scanner).  It's just that
>>>>> IMHO it could be slightly confusing if the new interface only partially
>>>>> apply the khugepaged rules.
>>>>>
>>>>> No strong opinion here.  It could already been a trade-off after the
>>>>> discussion from the RFC with Michal which I read..  Just curious about how
>>>>> you made that design decision so feel free to read it as a pure question.
>>>>>
>>>>
>>>> Understand your point here. The allocation and max_pte_* semantics are
>>>> split between khugepaged-like and fault-like, respectively - which
>>>> could be confusing. Originally, I proposed a MADV_F_COLLAPSE_LIMITS
>>>> flag to control the former's behavior, but agreed to keep things
>>>> simple to start, and expand the interface if/when necessary. I opted
>>>> to ignore max_ptes_* as the default since I envisioned that early
>>>> adopters would "just want it to work". One such example would be
>>>> backing executable text by hugepages on program load when many pages
>>>> haven't been demand-paged in yet.
>>>>
>>>> What do you think?
>>>
>>> I'm just slightly worried that'll make the default MADV_COLLAPSE semantics
>>> blurred.
>>>
>>> To me, a clean default definition for MADV_COLLAPSE would be nice, as "do
>>> khugepaged on this range, and with current thread context".  IMHO any
>>> feature bits then can be supplementing special needs, and I'll take the thp
>>> backing executable example to be one of the (good?) reason we'd need an
>>> extra flag for ignoring the max_ptes_* knobs.
>>>
>>> So personally if I were you maybe I'll start with the simple scheme of that
>>> (even if it won't immediately service a thing) but then add either the
>>> defrag or ignore_max_ptes_* as feature bits later on, with clear use case
>>> descriptions about why we need each of the feature flags.  IMHO numbers
>>> would be even more helpful when there's specific use cases on the show.
>>>
>>> Or, perhaps you think all potential MADV_COLLAPSE users should literally
>>> skip max_ptes_* limitations always?
>>>
>>
>> Thanks for your time and valuable feedback here, Peter. I had a response typed
>> up, but after a few iterations became increasingly unsatisfied with my
>> own response.
>>
>> I think this feature should be able to stand on its own without
>> consideration of a userspace khugepaged, as we have existing concrete
>> examples where it would be useful. In these cases, and I assume almost
>> all other use-cases outside userspace khugepaged, max_ptes_* should be
>> ignored as the fundamental assumption of MADV_COLLAPSE is that the
>> user knows better, and IMHO, khugepaged heuristics shouldn't tell
>> users they are wrong.
> 
> Valid point.  And actually right after I replied I thought similarly on
> whether we need to connect the two interfaces at all..
> 
> It's just that it's very easy to go think like that after reading the cover
> letter since that's exactly what it is comparing to. :)
> 
> There's definitely a difference view on user/kernel level of things, then
> it sounds reasonable to me if we add a new interface it by default has a
> stronger semantics otherwise we may not bother if with MADV_HUGEPAGE's
> existance.
> 
> So maybe max_ptes_* won't even make sense for MADV_COLLAPSE in most cases
> as you said.  And that's a real pure question I asked above, and I feel
> like your answer is actually "yes" we should always ignore the max_ptes_*
> fields until there's a proof that it'll be helpful.
> 
>>
>> But this, as you mention, unsatisfactorily blurs the semantics of
>> MADV_COLLAPSE: "act like khugepaged here, but not here".
>>
>> As such, WDYT about the reverse-side of the coin of what you proposed:
>> to not couple the default behavior of MADV_COLLAPSE with khugepaged at
>> all? I.e. Not tie the allocation semantics to
>> /sys/kernel/mm/transparent_hugepage/khugepaged/defrag. We can add
>> flags as necessary when/if a reimplementation of khugepaged in
>> userspace proves fruitful.
> 
> Let's see whether others have thoughts, but what you proposed here makes
> sense to me.


Makes sense to me. IIUC, the whole handling of max_ptes_* is in place as
 tunable because we don't know what user space is up to.

E.g., have with a very sparse memory layout, we don't want to waste
memory by allocating memory where we actually have no page populated yet
-- could be user space won't reuse that memory in the foreseeable
future. With too many swap entries, we don't want to trigger an
eventually unnecessary overhead of swapping in entries if user space
won't access them in the foreseeable future. Something similar applies
to max_ptes_shared, where one might just end up wasting a lot of memory
eventually in some applications.

So IMHO, with MADV_COLLAPSE we should ignore/disable any heuristics that
try figuring out what user space might be doing. We know exactly what
user space asks for -- and that can be documented properly.

-- 
Thanks,

David / dhildenb