From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4C4F7C433EF
	for <linux-mm@archiver.kernel.org>; Thu,  9 Jun 2022 17:35:52 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8D8E78D002E; Thu,  9 Jun 2022 13:35:51 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 887E46B00D0; Thu,  9 Jun 2022 13:35:51 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 728F18D002E; Thu,  9 Jun 2022 13:35:51 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 5EC216B00CF
	for <linux-mm@kvack.org>; Thu,  9 Jun 2022 13:35:51 -0400 (EDT)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 3824460AB5
	for <linux-mm@kvack.org>; Thu,  9 Jun 2022 17:35:51 +0000 (UTC)
X-FDA: 79559400102.10.4047F83
Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com [209.85.167.41])
	by imf07.hostedemail.com (Postfix) with ESMTP id D8EA040045
	for <linux-mm@kvack.org>; Thu,  9 Jun 2022 17:35:50 +0000 (UTC)
Received: by mail-lf1-f41.google.com with SMTP id h23so39090251lfe.4
        for <linux-mm@kvack.org>; Thu, 09 Jun 2022 10:35:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=PMEfqvdq9feTTA2GMug67nXutEth2Uqg9/c1l2uD2Og=;
        b=HNSud6r0xdg3q9TBgbd+qu28YcHxpAFckMQBoVbQb3BDk0Qmstj7INu1mf2zoydnvd
         S5GK5YnsrWKkPlfi60lz1bheJ45g5Eg8Hb+jR5rksGFswqdpXkbNSkpb1JJNN+S9SLtf
         4fUSSMYOGORJuR+AhcGVb9CjGScTZB/PjFvBhhTFhpANgXZiKcooysG+9WZyPy+4rXLo
         +KAcw9/4T47fdB2gLOff6Ccu229cKa8FdSslhohEWo6ZTqRdlItOn9eMGCp+skGNpzBt
         c+sh835W9bHrpsF5vQICRFvYAIP0mXYkpzcllDz6HZgZ3AmQl468WyR+sOWcKQV11l71
         Jq0g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=PMEfqvdq9feTTA2GMug67nXutEth2Uqg9/c1l2uD2Og=;
        b=skMn2HIGztGGrb8eNtCxJPWUKGsKlw7CfZfJHxbzOICYpeizcaI+j/ALCyCmLZ2LXq
         pAqMejvfzwinrWFZgsxLyG8VCcQPgTffCsG15FYlWLyR2zLPIpOCUkJ9+oAzjXXiYFvI
         JQOfjZYgPv80ZcxfY94gyLERQ/IFD/DV3BXAAUWAJ668cz7uOML/OaPGNjqLDsMoc+7j
         25trAW9Bv+7HZZ0/qVWMavkA+IsHHwxPGngV1hsY/lvp2VQXbgzeDrCftZOJLry+5WYm
         wl8DuUd4QVJzWtg+wqVT+ToyxFVVqVQJPxfR1oiaI2TZ3rEX+H6tCzBj2MJxt5RCVGPL
         ojOg==
X-Gm-Message-State: AOAM531ChlRsYZ0ylG6hX//17OOsQoe1I+fI7iejqb/p5u3zDkDsZFZe
	ZmhGNi24TmSiSiRT67z9Qcms2Hl4EXK21Ari3Leg1w==
X-Google-Smtp-Source: ABdhPJx7Xnrt7PpTthT9wxKdBP6OPJa/zMQVUuZMQUPrhIrZlA6dsuh00sy20OHjXoNcOqrJszBwsjhGepsMxsc2Tqc=
X-Received: by 2002:a05:6512:e94:b0:478:f6ad:c9d7 with SMTP id
 bi20-20020a0565120e9400b00478f6adc9d7mr25314836lfb.198.1654796148204; Thu, 09
 Jun 2022 10:35:48 -0700 (PDT)
MIME-Version: 1.0
References: <20220604004004.954674-1-zokeefe@google.com> <20220604004004.954674-10-zokeefe@google.com>
 <CAHbLzkowsvrP1F86MUjFH4EdmW4OeMSLtZciAoCsvCg=O332Xw@mail.gmail.com>
 <CAAa6QmTu+Mmsmbf4rYGpgoj3Fm3ZnFQoPJT+4rBP_TM7MF-++g@mail.gmail.com> <CAHbLzkrjMLuA01Mn8=+zNyBpKOvqf_0XyMs_QBP18Q6Ja7xKww@mail.gmail.com>
In-Reply-To: <CAHbLzkrjMLuA01Mn8=+zNyBpKOvqf_0XyMs_QBP18Q6Ja7xKww@mail.gmail.com>
From: "Zach O'Keefe" <zokeefe@google.com>
Date: Thu, 9 Jun 2022 10:35:10 -0700
Message-ID: <CAAa6QmQgJX9Kw2HeSf=PsGYvzTi=zW9udeR+Bis7_abzuO+BPA@mail.gmail.com>
Subject: Re: [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync
 hugepage collapse
To: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>, David Hildenbrand <david@redhat.com>, 
	David Rientjes <rientjes@google.com>, Matthew Wilcox <willy@infradead.org>, 
	Michal Hocko <mhocko@suse.com>, Pasha Tatashin <pasha.tatashin@soleen.com>, 
	Peter Xu <peterx@redhat.com>, Rongwei Wang <rongwei.wang@linux.alibaba.com>, 
	SeongJae Park <sj@kernel.org>, Song Liu <songliubraving@fb.com>, Vlastimil Babka <vbabka@suse.cz>, 
	Zi Yan <ziy@nvidia.com>, Linux MM <linux-mm@kvack.org>, 
	Andrea Arcangeli <aarcange@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, 
	Arnd Bergmann <arnd@arndb.de>, Axel Rasmussen <axelrasmussen@google.com>, 
	Chris Kennelly <ckennelly@google.com>, Chris Zankel <chris@zankel.net>, Helge Deller <deller@gmx.de>, 
	Hugh Dickins <hughd@google.com>, Ivan Kokshaysky <ink@jurassic.park.msu.ru>, 
	"James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>, Jens Axboe <axboe@kernel.dk>, 
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Matt Turner <mattst88@gmail.com>, 
	Max Filippov <jcmvbkbc@gmail.com>, Miaohe Lin <linmiaohe@huawei.com>, 
	Minchan Kim <minchan@kernel.org>, Patrick Xia <patrickx@google.com>, 
	Pavel Begunkov <asml.silence@gmail.com>, Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Content-Type: text/plain; charset="UTF-8"
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654796150; a=rsa-sha256;
	cv=none;
	b=jGB8ZuW1WmmQQqX5N99tqC1zFXMuOk1mixjvXpY76lpO2H10gcmJGnq9xb3VUJ9xdIFBsb
	6AOKTlrBxaMahYbh7ssK7YkUcbEnIbmD1Y+2NIRcPzZu35QE84XqUlq/0ulAOxeWjryjfy
	t95XRmzf45tx0Cqleew8hdEuRi0qmkk=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=HNSud6r0;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf07.hostedemail.com: domain of zokeefe@google.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=zokeefe@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1654796150;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=PMEfqvdq9feTTA2GMug67nXutEth2Uqg9/c1l2uD2Og=;
	b=mP98FXJ1CDZMMC5mNpXN+N/IewYcpvjtzf3CGJuJT4kHW8L1UJ9+FR9pD+zV4LH/6LDMe2
	7N+ycPAP8jQzFS5Dl23jLzOI0lIv2kmTUU9kqFtzAz7rDieOhlXASuY//sC5qnEeJSOaZt
	PqZKwh92FsA//qxpqlNyMl+asw0dMCk=
X-Stat-Signature: 6qdcgkzwg5immgh3brbhz896a9iyeqh8
X-Rspam-User: 
X-Rspamd-Queue-Id: D8EA040045
X-Rspamd-Server: rspam07
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=HNSud6r0;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf07.hostedemail.com: domain of zokeefe@google.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=zokeefe@google.com
X-HE-Tag: 1654796150-925700
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Jun 7, 2022 at 5:39 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Jun 7, 2022 at 3:48 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Mon, Jun 6, 2022 at 4:53 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > This idea was introduced by David Rientjes[1].
> > > >
> > > > Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
> > > > synchronous collapse of memory at their own expense.
> > > >
> > > > The benefits of this approach are:
> > > >
> > > > * CPU is charged to the process that wants to spend the cycles for the
> > > >   THP
> > > > * Avoid unpredictable timing of khugepaged collapse
> > > >
> > > > An immediate user of this new functionality are malloc() implementations
> > > > that manage memory in hugepage-sized chunks, but sometimes subrelease
> > > > memory back to the system in native-sized chunks via MADV_DONTNEED;
> > > > zapping the pmd.  Later, when the memory is hot, the implementation
> > > > could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
> > > > hugepage coverage and dTLB performance.  TCMalloc is such an
> > > > implementation that could benefit from this[2].
> > > >
> > > > Only privately-mapped anon memory is supported for now, but it is
> > > > expected that file and shmem support will be added later to support the
> > > > use-case of backing executable text by THPs.  Current support provided
> > > > by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
> > > > which might impair services from serving at their full rated load after
> > > > (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
> > > > immediately realize iTLB performance prevents page sharing and demand
> > > > paging, both of which increase steady state memory footprint.  With
> > > > MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
> > > > and lower RAM footprints.
> > > >
> > > > This call is independent of the system-wide THP sysfs settings, but will
> > > > fail for memory marked VM_NOHUGEPAGE.
> > > >
> > > > THP allocation may enter direct reclaim and/or compaction.
> > > >
> > > > [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> > > > [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
> > > >
> > > > Suggested-by: David Rientjes <rientjes@google.com>
> > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > ---
> > > >  arch/alpha/include/uapi/asm/mman.h     |   2 +
> > > >  arch/mips/include/uapi/asm/mman.h      |   2 +
> > > >  arch/parisc/include/uapi/asm/mman.h    |   2 +
> > > >  arch/xtensa/include/uapi/asm/mman.h    |   2 +
> > > >  include/linux/huge_mm.h                |  12 +++
> > > >  include/uapi/asm-generic/mman-common.h |   2 +
> > > >  mm/khugepaged.c                        | 124 +++++++++++++++++++++++++
> > > >  mm/madvise.c                           |   5 +
> > > >  8 files changed, 151 insertions(+)
> > > >
> > > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > > > index 4aa996423b0d..763929e814e9 100644
> > > > --- a/arch/alpha/include/uapi/asm/mman.h
> > > > +++ b/arch/alpha/include/uapi/asm/mman.h
> > > > @@ -76,6 +76,8 @@
> > > >
> > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > >
> > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > +
> > > >  /* compatibility flags */
> > > >  #define MAP_FILE       0
> > > >
> > > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > > > index 1be428663c10..c6e1fc77c996 100644
> > > > --- a/arch/mips/include/uapi/asm/mman.h
> > > > +++ b/arch/mips/include/uapi/asm/mman.h
> > > > @@ -103,6 +103,8 @@
> > > >
> > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > >
> > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > +
> > > >  /* compatibility flags */
> > > >  #define MAP_FILE       0
> > > >
> > > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > > > index a7ea3204a5fa..22133a6a506e 100644
> > > > --- a/arch/parisc/include/uapi/asm/mman.h
> > > > +++ b/arch/parisc/include/uapi/asm/mman.h
> > > > @@ -70,6 +70,8 @@
> > > >  #define MADV_WIPEONFORK 71             /* Zero memory on fork, child only */
> > > >  #define MADV_KEEPONFORK 72             /* Undo MADV_WIPEONFORK */
> > > >
> > > > +#define MADV_COLLAPSE  73              /* Synchronous hugepage collapse */
> > > > +
> > > >  #define MADV_HWPOISON     100          /* poison a page for testing */
> > > >  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
> > > >
> > > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > > > index 7966a58af472..1ff0c858544f 100644
> > > > --- a/arch/xtensa/include/uapi/asm/mman.h
> > > > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > > > @@ -111,6 +111,8 @@
> > > >
> > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > >
> > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > +
> > > >  /* compatibility flags */
> > > >  #define MAP_FILE       0
> > > >
> > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > > index 648cb3ce7099..2ca2f3b41fc8 100644
> > > > --- a/include/linux/huge_mm.h
> > > > +++ b/include/linux/huge_mm.h
> > > > @@ -240,6 +240,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> > > >
> > > >  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> > > >                      int advice);
> > > > +int madvise_collapse(struct vm_area_struct *vma,
> > > > +                    struct vm_area_struct **prev,
> > > > +                    unsigned long start, unsigned long end);
> > > >  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> > > >                            unsigned long end, long adjust_next);
> > > >  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > > > @@ -395,6 +398,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> > > >         BUG();
> > > >         return 0;
> > > >  }
> > > > +
> > > > +static inline int madvise_collapse(struct vm_area_struct *vma,
> > > > +                                  struct vm_area_struct **prev,
> > > > +                                  unsigned long start, unsigned long end)
> > > > +{
> > > > +       BUG();
> > > > +       return 0;
> > >
> > > I wish -ENOSYS could have been returned, but it seems madvise()
> > > doesn't support this return value.
> >
> > This is somewhat tangential, but I agree that ENOSYS (or some other
> > errno, but ENOSYS makes most sense to me, after EINVAL, (ENOTSUP?))
> > should be anointed the dedicated return value for "madvise mode not
> > supported". Ran into this recently when wanting some form of feature
> > detection for MADV_COLLAPSE where EINVAL is overloaded  (including
> > madvise mode not supported). Happy to move this forward if others
> > agree.
>
> I did a quick test by calling MADV_HUGEPAGE on !THP kernel, madvise()
> actually returns -EINVAL by madvise_behavior_valid(). So
> madvise_collapse() won't be called at all. So madvise_collapse() is
> basically used to make !THP compile happy.

Ya, exactly. I was thinking -ENOTSUP could be used in
place of -ENVAL in the madvise_behavior_valid() path to tell callers
of madvise(2) if a given madvise mode was supported or not. At the
moment -EINVAL return could mean a number of different things. Anyways
- that's a side conversation.

> I think we could just return -EINVAL.

That sounds fine - as you mention it's code that shouldn't be called
anyways and is just there to satisfy !THP. Was just basing off
hugepage_madvise().

> >
> > > > +}
> > > > +
> > > >  static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> > > >                                          unsigned long start,
> > > >                                          unsigned long end,
> > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > > index 6c1aa92a92e4..6ce1f1ceb432 100644
> > > > --- a/include/uapi/asm-generic/mman-common.h
> > > > +++ b/include/uapi/asm-generic/mman-common.h
> > > > @@ -77,6 +77,8 @@
> > > >
> > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > >
> > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > +
> > > >  /* compatibility flags */
> > > >  #define MAP_FILE       0
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 4ad04f552347..073d6bb03b37 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -2404,3 +2404,127 @@ void khugepaged_min_free_kbytes_update(void)
> > > >                 set_recommended_min_free_kbytes();
> > > >         mutex_unlock(&khugepaged_mutex);
> > > >  }
> > > > +
> > > > +static int madvise_collapse_errno(enum scan_result r)
> > > > +{
> > > > +       switch (r) {
> > > > +       case SCAN_PMD_NULL:
> > > > +       case SCAN_ADDRESS_RANGE:
> > > > +       case SCAN_VMA_NULL:
> > > > +       case SCAN_PTE_NON_PRESENT:
> > > > +       case SCAN_PAGE_NULL:
> > > > +               /*
> > > > +                * Addresses in the specified range are not currently mapped,
> > > > +                * or are outside the AS of the process.
> > > > +                */
> > > > +               return -ENOMEM;
> > > > +       case SCAN_ALLOC_HUGE_PAGE_FAIL:
> > > > +       case SCAN_CGROUP_CHARGE_FAIL:
> > > > +               /* A kernel resource was temporarily unavailable. */
> > > > +               return -EAGAIN;
> > >
> > > I thought this should return -ENOMEM too.
> >
> > Do you mean specifically SCAN_CGROUP_CHARGE_FAIL?
>
> No, I mean both.
>
> >
> > At least going by the comment above do_madvise(), and in the man
> > pages, for ENOMEM: "Addresses in the specified range are not currently
> > mapped, or are outside the address space of the process." doesn't
> > really apply here (though I don't know if "A kernel resource was
> > temporarily unavailable" applies any better).
>
> Yes, the man page does say so. But IIRC some MADV_ operations do
> return -ENOMEM for memory allocation failure, for example,
> MADV_POPULATE_READ/WRITE. Typically the man pages don't cover all
> cases.

Good point, I missed MADV_POPULATE_READ/WRITE didn't go through the
-ENOMEM -> -EAGAIN remapping at the bottom of madvise_vma_behavior().

> >
> > That said, should we differentiate between allocation and charging
> > failure? At least in the case of a userspace agent using
> > process_madvise(2) to collapse memory on behalf of others, knowing
> > "this memcg is at its limit" vs "no THPs available" would be valuable.
> > Maybe the former should be EBUSY?
>
> IMHO we don't have to differentiate allocation and charging.

After some consideration (thanks for starting this discussion and
prompting me to do so), I do think it's very valuable for callers to
know when THP allocation fails, and that an errno should be reserved
for that. The caller needs to know when a generic error, specific to
the memory being collapsed, occurs vs THP allocation failure to help
guide next actions: fallback to other strategy, sleep, MADV_DONTNEED /
free memory elsewhere, etc.

As a concrete example, process init code that tries to back certain
segments of text by hugepages. Some existing strategies folks use for
this are CONFIG_READ_ONLY_THP_FOR_FS + khugepaged, and anon mremap(2)
tricks. CONFIG_READ_ONLY_THP_FOR_FS + MADV_COLLPASE might add a third,
and if it fails, it'd be nice to know which other option to fall back
to, depending on how badly the user wants THP backing. If THP
allocation fails, then likely the anon mremap(2) trick will fail too
(unless some reclaim/compaction is done).

Less immediately concrete, but a userspace agent seeking to optimize
system-wide THP utilization surely wants to know when it's exhausted
its precious THP supply.

So I'd like to see -EAGAIN reserved for THP allocation failure
(-ENOMEM is taken by AS errors, and it'd be nice to be consistent with
other modes here). I think -EBUSY for memcg charging makes sense, and
tells the caller something actionable and useful, so I'd like to see
it differentiated from -ENOMEM.

Would appreciate feedback from folks here before setting these in
stone and preventing the errno from ever being useful to callers.

> >
> > > > +       default:
> > > > +               return -EINVAL;
> > > > +       }
> > > > +}
> > > > +
> > > > +int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > > +                    unsigned long start, unsigned long end)
> > > > +{
> > > > +       struct collapse_control cc = {
> > > > +               .enforce_page_heuristics = false,
> > > > +               .enforce_thp_enabled = false,
> > > > +               .last_target_node = NUMA_NO_NODE,
> > > > +               .gfp = GFP_TRANSHUGE | __GFP_THISNODE,
> > > > +       };
> > > > +       struct mm_struct *mm = vma->vm_mm;
> > > > +       unsigned long hstart, hend, addr;
> > > > +       int thps = 0, last_fail = SCAN_FAIL;
> > > > +       bool mmap_locked = true;
> > > > +
> > > > +       BUG_ON(vma->vm_start > start);
> > > > +       BUG_ON(vma->vm_end < end);
> > > > +
> > > > +       *prev = vma;
> > > > +
> > > > +       /* TODO: Support file/shmem */
> > > > +       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > > > +               return -EINVAL;
> > > > +
> > > > +       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> > > > +       hend = end & HPAGE_PMD_MASK;
> > > > +
> > > > +       /*
> > > > +        * Set VM_HUGEPAGE so that hugepage_vma_check() can pass even if
> > > > +        * TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG is set (i.e. "madvise" mode).
> > > > +        * Note that hugepage_vma_check() doesn't enforce that
> > > > +        * TRANSPARENT_HUGEPAGE_FLAG or TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG
> > > > +        * must be set (i.e. "never" mode)
> > > > +        */
> > > > +       if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
> > >
> > > hugepage_vma_check() doesn't check vma size, so MADV_COLLAPSE may be
> > > running for a unsuitable vma, hugepage_vma_revalidate() called by
> > > khugepaged_scan_pmd() may find it out finally, but it is a huge waste
> > > of effort. So, it is better to check vma size upfront.
> >
> > This actually does check the vma size, but it's subtle. hstart and
> > hend are clamped to the first/last
> > hugepaged-aligned address covered by [start,end], which are themselves
> > contained in vma->vm_start/vma->vm_end, respectively. We then check
> > that addr = hstart < hend ; so if the main loop passes the first
> > check, we know that vma->vm_start <= addr and addr + HPAGE_PMD_SIZE <=
>
> Aha, yes, I overlooked that.
>
> > vma->vma_end. Agreed that we might be needlessly doing mmgrab() and
> > lru_add_drain() needlessly though.
>
> Yeah
>
> >
> > > BTW, my series moved the vma size check in hugepage_vma_check(), so if
> > > your series could be based on top of that, you get that for free.
> >
> > I'll try rebasing on top of your series, thank you!
>
> You don't have to do it right now. I don't know what series will be
> merged to mm tree first. Just a heads up.

Thanks! Seems beneficial here though, so I'll do that and add a note
in the cover letter.

> >
> > > > +               return -EINVAL;
> > > > +
> > > > +       mmgrab(mm);
> > > > +       lru_add_drain();
> > > > +
> > > > +       for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> > > > +               int result = SCAN_FAIL;
> > > > +               bool retry = true;  /* Allow one retry per hugepage */
> > > > +retry:
> > > > +               if (!mmap_locked) {
> > > > +                       cond_resched();
> > > > +                       mmap_read_lock(mm);
> > "> > +                       mmap_locked = true;
> > > > +                       result = hugepage_vma_revalidate(mm, addr, &vma, &cc);
> > >
> > > How's about making hugepage_vma_revalidate() return SCAN_SUCCEED too?
> > > It seems more consistent.
> >
> > Ya, I didn't like this either.  I'll add this to "mm/khugepaged: pipe
> > enum scan_result codes back to callers"
> >
> > > > +                       if (result) {
> > > > +                               last_fail = result;
> > > > +                               goto out_nolock;
> > > > +                       }
> > > > +               }
> > > > +               mmap_assert_locked(mm);
> > > > +               memset(cc.node_load, 0, sizeof(cc.node_load));
> > > > +               result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
> > > > +               if (!mmap_locked)
> > > > +                       *prev = NULL;  /* Tell caller we dropped mmap_lock */
> > > > +
> > > > +               switch (result) {
> > > > +               case SCAN_SUCCEED:
> > > > +               case SCAN_PMD_MAPPED:
> > > > +                       ++thps;
> > > > +                       break;
> > > > +               /* Whitelisted set of results where continuing OK */
> > > > +               case SCAN_PMD_NULL:
> > > > +               case SCAN_PTE_NON_PRESENT:
> > > > +               case SCAN_PTE_UFFD_WP:
> > > > +               case SCAN_PAGE_RO:
> > > > +               case SCAN_LACK_REFERENCED_PAGE:
> > > > +               case SCAN_PAGE_NULL:
> > > > +               case SCAN_PAGE_COUNT:
> > > > +               case SCAN_PAGE_LOCK:
> > > > +               case SCAN_PAGE_COMPOUND:
> > > > +                       last_fail = result;
> > > > +                       break;
> > > > +               case SCAN_PAGE_LRU:
> > > > +                       if (retry) {
> > > > +                               lru_add_drain_all();
> > > > +                               retry = false;
> > > > +                               goto retry;
> > >
> > > I'm not sure whether the retry logic is necessary or not, do you have
> > > any data about how retry improves the success rate? You could just
> > > replace lru_add_drain() to lru_add_drain_all() and remove the retry
> > > logic IMHO. I'd prefer to keep it simple at the moment personally.
> >
> > Transparently, I've only had success hitting this logic on small vms
> > under selftests.  That said, it does happen, and I can't imagine this
> > hurting, especially on larger systems + tasks using lots of mem.
> > Originally, I didn't plan to do this, but as things shook out and we
> > had SCAN_PAGE_LRU so readily available, it seemed like we got this for
> > free.
>
> "small vms" mean small virtual machines?
>
> When the logic is hit, does lru_add_drain_all() help to improve the
> success rate?

Ya, I've been doing most dev/testing on small, 2 cpu virtual machines.
I've been mmap()ing a multi-hugepage sized region, then faulting it in
- presumably being preempted and rescheduled on another cpu while
iterating over the region and faulting. I had ran into the
lru_add_drain_all() vs lru_add_drain() during testing/dev since
*occasionally* (admittedly, not very often the IIRC the former helped
tests pass.

That said, I set out to try and repro this semi-reliably, with little
success - I'm almost always finding the pages on the LRU. Still
playing around with this..

> I don't mean this hurts anything. I'm just thinking about whether the
> extra complexity is worth it or not. And calling lru_add_drain_all()
> with holding mmap_lock might have some scalability issues since
> draining lru for all is not cheap.

Good point. At the very least, it seems like we should unlock
mmap_lock before retry.


>
>
> >
> > > > +                       }
> > > > +                       fallthrough;
> > > > +               default:
> > > > +                       last_fail = result;
> > > > +                       /* Other error, exit */
> > > > +                       goto out_maybelock;
> > > > +               }
> > > > +       }
> > > > +
> > > > +out_maybelock:
> > > > +       /* Caller expects us to hold mmap_lock on return */
> > > > +       if (!mmap_locked)
> > > > +               mmap_read_lock(mm);
> > > > +out_nolock:
> > > > +       mmap_assert_locked(mm);
> > > > +       mmdrop(mm);
> > > > +
> > > > +       return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
> > > > +                       : madvise_collapse_errno(last_fail);
> > > > +}
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 46feb62ce163..eccac2620226 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
> > > >         case MADV_FREE:
> > > >         case MADV_POPULATE_READ:
> > > >         case MADV_POPULATE_WRITE:
> > > > +       case MADV_COLLAPSE:
> > > >                 return 0;
> > > >         default:
> > > >                 /* be safe, default to 1. list exceptions explicitly */
> > > > @@ -1057,6 +1058,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > > >                 if (error)
> > > >                         goto out;
> > > >                 break;
> > > > +       case MADV_COLLAPSE:
> > > > +               return madvise_collapse(vma, prev, start, end);
> > > >         }
> > > >
> > > >         anon_name = anon_vma_name(vma);
> > > > @@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior)
> > > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > >         case MADV_HUGEPAGE:
> > > >         case MADV_NOHUGEPAGE:
> > > > +       case MADV_COLLAPSE:
> > > >  #endif
> > > >         case MADV_DONTDUMP:
> > > >         case MADV_DODUMP:
> > > > @@ -1339,6 +1343,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > > >   *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> > > >   *             transparent huge pages so the existing pages will not be
> > > >   *             coalesced into THP and new pages will not be allocated as THP.
> > > > + *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> > > >   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> > > >   *             from being included in its core dump.
> > > >   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > > > --
> > > > 2.36.1.255.ge46751e96f-goog
> > > >