From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0B8FFC433EF
	for <linux-mm@archiver.kernel.org>; Wed, 29 Jun 2022 18:31:24 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5A4C18E0015; Wed, 29 Jun 2022 14:31:23 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 554B98E0013; Wed, 29 Jun 2022 14:31:23 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 41C368E0015; Wed, 29 Jun 2022 14:31:23 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 338458E0013
	for <linux-mm@kvack.org>; Wed, 29 Jun 2022 14:31:23 -0400 (EDT)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay11.hostedemail.com (Postfix) with ESMTP id F3C3C809CA
	for <linux-mm@kvack.org>; Wed, 29 Jun 2022 18:31:22 +0000 (UTC)
X-FDA: 79632116004.27.7111DB8
Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47])
	by imf04.hostedemail.com (Postfix) with ESMTP id 9670040045
	for <linux-mm@kvack.org>; Wed, 29 Jun 2022 18:31:22 +0000 (UTC)
Received: by mail-pj1-f47.google.com with SMTP id w24so16441338pjg.5
        for <linux-mm@kvack.org>; Wed, 29 Jun 2022 11:31:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=CDVfXBpAEBDJhGnhPCn04tt9xzUzeJTGIg9aUNRDKi0=;
        b=HJqE+gA5xUv3V1M4HPXDech2JSWzcTlmPCUOkS55QSUuy9lQpIRLChvr6pqrzQh7d8
         /qvUro79JuQekFvc9KkMLhVNtW8T+ReDU4VeuFWOwy5qkr47QQ71TqHuCT7exSGXAN4E
         ljXDJoU8viQ5uheFnUXzocJW41t5x/ngrkvjUkAvPRgxNELnQmzMugY/EX/oZhLMXshL
         rWcVdqqGQzn6n0s0UpjdCvrHr8EQXMUk+trMq2cTCWzF29T+Ob9eAWLBFl/vlBSAe2vR
         A5LCJ3bmNDOpnbcsB86myd2lJuZDk4fL1Hc4RTXX59QkMBtOn07Y5Y+L1U0f8e88ZqX4
         cM3A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=CDVfXBpAEBDJhGnhPCn04tt9xzUzeJTGIg9aUNRDKi0=;
        b=1KNmZcmvUPpLiVH1M0w0LUKPtfi3FUuyhWO20bif9NSG3G7bbAIurpfKGvLC5qZGtC
         oxrtBS+5zPt5Ctz0R4X7clJlxj3bB3o22NwEN8Ipj9SYoLymDA9VdcUL4eiFXk2XxTG8
         zmYA15IQwGO1bfqMkcc+E9mo7HVFAwUoIo3JUbN+rz+8BlFTxbjufCb9kDOEzb8RUGoP
         AHYz7ACy+DCxvrT9OZQmyZP80wqyEW+Ho6P0cdnGfxHyeJmJaj56Ih/hLGYru3RJUe6Z
         UUQA0LhExB/ec3ErCbE3kEblV86QRy3qw+qRVGA8k0Izli28pCN3WduCT9cM2RavvJ5S
         BGkA==
X-Gm-Message-State: AJIora/RiqKfTh40b00D1GTk/yRJAcFlGI1NuSyA2l0YGoA3SAy63R8F
	F96hdo60MP3MKTvzmFkI9cxbKgcWf4uBImS1eLAXOg==
X-Google-Smtp-Source: AGRyM1uCe+PfnmbjIgGgKt2vuyMWuT6tPF7Y5tVX5Jwk2bZodZxvORXSCjXUiuS4zBCwNglBiAUxq8mse2sfN+aE+v8=
X-Received: by 2002:a17:903:2281:b0:16a:674e:8949 with SMTP id
 b1-20020a170903228100b0016a674e8949mr11785575plh.49.1656527481228; Wed, 29
 Jun 2022 11:31:21 -0700 (PDT)
MIME-Version: 1.0
References: <20220624173656.2033256-1-jthoughton@google.com>
 <CAHS8izPnJd5EQjUi9cOk=03u3X1rk0PexTQZi+bEE4VMtFfksQ@mail.gmail.com>
 <CADrL8HWse7-=1Z=1_d8szwdkhFH1t8L4pOBO7E7yxgCYF-gc8w@mail.gmail.com>
 <CAHS8izNSsEW88Q=ozcC2rbnmvcX3zOL-qkFTPgn=M6S1R5t=Yw@mail.gmail.com> <YrtAyUSbtCLwCFxC@work-vm>
In-Reply-To: <YrtAyUSbtCLwCFxC@work-vm>
From: James Houghton <jthoughton@google.com>
Date: Wed, 29 Jun 2022 11:31:10 -0700
Message-ID: <CADrL8HXeLTiTP0cvq7DY8R0JkQT6gdz=gq06jarhBqyPHDfmzw@mail.gmail.com>
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mina Almasry <almasrymina@google.com>, Mike Kravetz <mike.kravetz@oracle.com>, 
	Muchun Song <songmuchun@bytedance.com>, Peter Xu <peterx@redhat.com>, 
	David Hildenbrand <david@redhat.com>, David Rientjes <rientjes@google.com>, 
	Axel Rasmussen <axelrasmussen@google.com>, Jue Wang <juew@google.com>, 
	Manish Mishra <manish.mishra@nutanix.com>, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=HJqE+gA5;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf04.hostedemail.com: domain of jthoughton@google.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=jthoughton@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656527482; a=rsa-sha256;
	cv=none;
	b=EFwilG7pHL+8zwbAAYsuVLK5jCYQKmpo30wQDLpi/bzVESVv2M8wHkKb3wQR6YUxRMzMUq
	B5OXxgWC/fghcOYFdRXo0cxwFw28KajX6Vn7dyq79aYjo4MiAdZJaZOzbQYf3XDaSHTUh0
	p8e5GcBfTJfh9jByegRQrRVy+shMSeA=
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1656527482;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=CDVfXBpAEBDJhGnhPCn04tt9xzUzeJTGIg9aUNRDKi0=;
	b=fVsSXJ/VbkIMqAf1G5M0GMFhGJkJ5ihTzJ4YPoMfJSHqBa9fhl/1teipbwUgn957V2vdzn
	RTIl6b0T+S2myI8yftNQKe0iYCJCeIuipSWaNF8hH4gsciQU0cGvZDtKv3J0pj0tglzTKT
	Z2PkOED1NTP58+jgXqrQK6apMo8s0MM=
X-Stat-Signature: 81h158x65trz9irno7kzafh3scq6iwer
X-Rspamd-Server: rspam08
X-Rspam-User: 
X-Rspamd-Queue-Id: 9670040045
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=HJqE+gA5;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf04.hostedemail.com: domain of jthoughton@google.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=jthoughton@google.com
X-HE-Tag: 1656527482-666776
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Jun 28, 2022 at 10:56 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Mina Almasry (almasrymina@google.com) wrote:
> > On Mon, Jun 27, 2022 at 9:27 AM James Houghton <jthoughton@google.com> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
> > > >
> > > > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > > > >
> > > > > [trimmed...]
> > > > > ---- Userspace API ----
> > > > >
> > > > > This patch series introduces a single way to take advantage of
> > > > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > > > userspace to resolve MINOR page faults on shared VMAs.
> > > > >
> > > > > To collapse a HugeTLB address range that has been mapped with several
> > > > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > > > userspace to know when all pages (that they care about) have been fetched.
> > > > >
> > > >
> > > > Thanks James! Cover letter looks good. A few questions:
> > > >
> > > > Why not have the kernel collapse the hugepage once all the 4K pages
> > > > have been fetched automatically? It would remove the need for a new
> > > > userspace API, and AFACT there aren't really any cases where it is
> > > > beneficial to have a hugepage sharded into 4K mappings when those
> > > > mappings can be collapsed.
> > >
> > > The reason that we don't automatically collapse mappings is because it
> > > would take additional complexity, and it is less flexible. Consider
> > > the case of 1G pages on x86: currently, userspace can collapse the
> > > whole page when it's all ready, but they can also choose to collapse a
> > > 2M piece of it. On architectures with more supported hugepage sizes
> > > (e.g., arm64), userspace has even more possibilities for when to
> > > collapse. This likely further complicates a potential
> > > automatic-collapse solution. Userspace may also want to collapse the
> > > mapping for an entire hugepage without completely mapping the hugepage
> > > first (this would also be possible by issuing UFFDIO_CONTINUE on all
> > > the holes, though).
> > >
> >
> > To be honest I'm don't think I'm a fan of this. I don't think this
> > saves complexity, but rather pushes it to the userspace. I.e. the
> > userspace now must track which regions are faulted in and which are
> > not to call MADV_COLLAPSE at the right time. Also, if the userspace
> > gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
> > any hugepages) or call MADV_COLLAPSE too early and have to deal with a
> > storm of maybe hundreds of minor faults at once which may take too
> > long to resolve and may impact guest stability, yes?
>
> I think it depends on whether the userspace is already holding bitmaps
> and data structures to let it know when the right time to call collapse
> is; if it already has to do all that book keeping for it's own postcopy
> or whatever process, then getting userspace to call it is easy.
> (I don't know the answer to whether it does have!)

Userspace generally has a lot of information about which pages have
been UFFDIO_CONTINUE'd, but they may not have the information (say,
some atomic count per hpage) to tell them exactly when to collapse.

I think it's worth discussing the tmpfs/THP case right now, too. Right
now, after userfaultfd post-copy, all THPs we have will all be
PTE-mapped. To deal with this, we need to use Zach's MADV_COLLAPSE to
collapse the mappings to PMD mappings (we don't want to wait for
khugepaged to happen upon them -- we want good performance ASAP :)).
In fact, IIUC, khugepaged actually won't collapse these *ever* right
now. I suppose we could enlighten tmpfs's UFFDIO_CONTINUE to
automatically collapse too (thus avoiding the need for MADV_COLLAPSE),
but that could be complicated/unwanted (if that is something we might
want, maybe we should have a separate discussion).

So, as it stands today, we intend to use MADV_COLLAPSE explicitly in
the tmpfs case as soon as it is supported, and so it follows that it's
ok to require userspace to do the same thing for HugeTLBFS-backed
memory.

>
> Dave
>
> > For these reasons I think automatic collapsing is something that will
> > eventually be implemented by us or someone else, and at that point
> > MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
> > is adding a userspace API that will probably need to be maintained for
> > perpetuity but actually is likely going to be going obsolete "soon".
> > For this reason I had hoped that automatic collapsing would come with
> > V1.

Small, unimportant clarification: the API, as described here, won't be
*completely* meaningless if we end up implementing automatic
collapsing :) It still has the effect of not requiring other
UFFDIO_CONTINUE operations to be done for the collapsed region.

> >
> > I wonder if we can have a very simple first try at automatic
> > collapsing for V1? I.e., can we support collapsing to the hstate size
> > and only that? So 4K pages can only be either collapsed to 2MB or 1G
> > on x86 depending on the hstate size. I think this may be not too
> > difficult to implement: we can have a counter similar to mapcount that
> > tracks how many of the subpages are mapped (subpage_mapcount). Once
> > all the subpages are mapped (the counter reaches a certain value),
> > trigger collapsing similar to hstate size MADV_COLLAPSE.
> >

In my estimation, to implement automatic collapsing, for one VMA, we
will need a per-hstate count, where when the count reaches the maximum
number, we collapse automatically to the next most optimal size. So if
we finish filling in enough PTEs for a CONT_PTE, we will collapse to a
CONT_PTE. If we finish filling up CONT_PTEs to a PMD, then collapse to
a PMD.

If you are suggesting to only collapse to the hstate size at the end,
then we lose flexibility.

> > I gather that no one else reviewing this has raised this issue thus
> > far so it might not be a big deal and I will continue to review the
> > RFC, but I had hoped for automatic collapsing myself for the reasons
> > above.

Thanks for the thorough review, Mina. :)

> >
> > > >
> > > > > ---- HugeTLB Changes ----
> > > > >
> > > > > - Mapcount
> > > > > The way mapcount is handled is different from the way that it was handled
> > > > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > > > high granularity, their mapcounts will remain the same as what they would
> > > > > have been pre-HGM.
> > > > >
> > > >
> > > > Sorry, I didn't quite follow this. It says mapcount is handled
> > > > differently, but the same if the page is not mapped at high
> > > > granularity. Can you elaborate on how the mapcount handling will be
> > > > different when the page is mapped at high granularity?
> > >
> > > I guess I didn't phrase this very well. For the sake of simplicity,
> > > consider 1G pages on x86, typically mapped with leaf-level PUDs.
> > > Previously, there were two possibilities for how a hugepage was
> > > mapped, either it was (1) completely mapped (PUD is present and a
> > > leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> > > case, where the PUD is not none but also not a leaf (this usually
> > > means that the page is partially mapped). We handle this case as if
> > > the whole page was mapped. That is, if we partially map a hugepage
> > > that was previously unmapped (making the PUD point to PMDs), we
> > > increment its mapcount, and if we completely unmap a partially mapped
> > > hugepage (making the PUD none), we decrement its mapcount. If we
> > > collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> > >
> > > It is possible for a PUD to be present and not a leaf (mapcount has
> > > been incremented) but for the page to still be unmapped: if the PMDs
> > > (or PTEs) underneath are all none. This case is atypical, and as of
> > > this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> > > think it would be very difficult to get this to happen.
> > >
> >
> > Thank you for the detailed explanation. Please add it to the cover letter.
> >
> > I wonder the case "PUD present but all the PMD are none": is that a
> > bug? I don't understand the usefulness of that. Not a comment on this
> > patch but rather a curiosity.
> >
> > > >
> > > > > - Page table walking and manipulation
> > > > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > > > high-granularity mappings. Eventually, it's possible to merge
> > > > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > > > >
> > > > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > > > This is because we generally need to know the "size" of a PTE (previously
> > > > > always just huge_page_size(hstate)).
> > > > >
> > > > > For every page table manipulation function that has a huge version (e.g.
> > > > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > > > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > > > > PTE really is "huge".
> > > > >
> > > > > - Synchronization
> > > > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > > > writing, and for doing high-granularity page table walks, we require it to
> > > > > be held for reading.
> > > > >
> > > > > ---- Limitations & Future Changes ----
> > > > >
> > > > > This patch series only implements high-granularity mapping for VM_SHARED
> > > > > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > > > > failure recovery for both shared and private mappings.
> > > > >
> > > > > The memory failure use case poses its own challenges that can be
> > > > > addressed, but I will do so in a separate RFC.
> > > > >
> > > > > Performance has not been heavily scrutinized with this patch series. There
> > > > > are places where lock contention can significantly reduce performance. This
> > > > > will be addressed later.
> > > > >
> > > > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > > > page struct optimization[3], as we do not need to modify data contained
> > > > > in the subpage page structs.
> > > > >
> > > > > Other omissions:
> > > > >  - Compatibility with userfaultfd write-protect (will be included in v1).
> > > > >  - Support for mremap() (will be included in v1). This looks a lot like
> > > > >    the support we have for fork().
> > > > >  - Documentation changes (will be included in v1).
> > > > >  - Completely ignores PMD sharing and hugepage migration (will be included
> > > > >    in v1).
> > > > >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> > > > >    than arm64.
> > > > >
> > > > > ---- Patch Breakdown ----
> > > > >
> > > > > Patch 1     - Preliminary changes
> > > > > Patch 2-10  - HugeTLB HGM core changes
> > > > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > > > Patch 20-23 - Userfaultfd and collapse changes
> > > > > Patch 24-26 - arm64 support and selftests
> > > > >
> > > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > > >     name. "High-granularity mapping" is not a great name either. I am open
> > > > >     to better names.
> > > >
> > > > I would drop 1 extra word and do "granular mapping", as in the mapping
> > > > is more granular than what it normally is (2MB/1G, etc).
> > >
> > > Noted. :)
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>