All of lore.kernel.org
 help / color / mirror / Atom feed
From: James Houghton <jthoughton@google.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	Muchun Song <songmuchun@bytedance.com>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Mina Almasry <almasrymina@google.com>,
	"Zach O'Keefe" <zokeefe@google.com>,
	Manish Mishra <manish.mishra@nutanix.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Miaohe Lin <linmiaohe@huawei.com>, Yang Shi <shy828301@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range
Date: Thu, 19 Jan 2023 11:42:26 -0800	[thread overview]
Message-ID: <CADrL8HVdL_NMdNq2mEemNCfwkYBAWnbqwyjsAYdQ2fF0iz34Hw@mail.gmail.com> (raw)
In-Reply-To: <Y8l+f2wNp2gAjvYg@monkey>

On Thu, Jan 19, 2023 at 9:32 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 01/19/23 08:57, James Houghton wrote:
> > > > > > I wonder if the following crazy idea has already been discussed: treat the
> > > > > > whole mapping as a single large logical mapping. One reference and one
> > > > > > mapping, no matter how the individual parts are mapped into the assigned
> > > > > > page table sub-tree.
> > > > > >
> > > > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned
> > > > > > sub-tree of page tables can only map the given hugetlb page, no fragments of
> > > > > > something else. That's very different to THP in private mappings ...
> > > > > >
> > > > > > So as soon as the first piece gets mapped, we increment refcount+mapcount.
> > > > > > Other pieces in the same subtree don't do that.
> > > > > >
> > > > > > Once the last piece is unmapped (or simpler: once the complete subtree of
> > > > > > page tables is gone), we decrement refcount+mapcount. Might require some
> > > > > > brain power to do this tracking, but I wouldn't call it impossible right
> > > > > > from the start.
> > > > > >
> > > > > > Would such a design violate other design aspects that are important?
> > > >
> > > > This is actually how mapcount was treated in HGM RFC v1 (though not
> > > > refcount); it is doable for both [2].
> > >
> > > My apologies for being late to the party :)
> > >
> > > When Peter first brought up the issue with ref/map_count overflows I was
> > > thinking that we should use a scheme like David describes above.  As
> > > James points out, this was the approach taken in the first RFC.
> > >
> > > > One caveat here: if a page is unmapped in small pieces, it is
> > > > difficult to know if the page is legitimately completely unmapped (we
> > > > would have to check all the PTEs in the page table).
> > >
> > > Are we allowing unmapping of small (non-huge page sized) areas with HGM?
> > > We must be if you are concerned with it.  What API would cause this?
> > > I just do not remember this discussion.
> >
> > There was some discussion about allowing MADV_DONTNEED on
> > less-than-hugepage pieces [3] (it actually motivated the switch from
> > UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT). It isn't implemented
> > in this series, but it could be implemented in the future.
>
> OK, so we do not actually create HGM mappings until a uffd operation is
> done at a less than huge page size granularity.  MADV_SPLIT just says
> that HGM mappings are 'possible' for this vma.  Hopefully, my understanding
> is correct.

Right, that's the current meaning of MADV_SPLIT for hugetlb.

> I was concerned about things like the page fault path, but in that case
> we have already 'entered HGM mode' via a uffd operation.
>
> Both David and Peter have asked whether eliminating intermediate mapping
> levels would be a simplification.  I trust your response that it would
> not help much in the current design/implementation.  But, it did get me
> thinking about something else.
>
> Perhaps we have discussed this before, and perhaps it does not meet all
> user needs, but one way possibly simplify this is:
>
> - 'Enable HGM' via MADV_SPLIT.  Must be done at huge page (hstate)
>   granularity.
> - MADV_SPLIT implicitly unmaps everything with in the range.
> - MADV_SPLIT says all mappings for this vma will now be done at a base
>   (4K) page size granularity.  vma would be marked some way.
> - I think this eliminates the need for hugetlb_pte's as we KNOW the
>   mapping size.
> - We still use huge pages to back 4K mappings, and we still have to deal
>   with the ref/map_count issues.
> - Code touching hugetlb page tables would KNOW the mapping size up front.
>
> Again, apologies if we talked about and previously dismissed this type
> of approach.

I think Peter was the one who originally suggested an approach like
this, and it meets my needs. However, I still think the way that
things are currently implemented is the right way to go.

Assuming we want decent performance, we can't get away with the same
strategy of just passing pte_t*s everywhere. The PTL for a 4K PTE
should be based on the PMD above the PTE, so we need to either pass
around the PMD above our PTE, or we need to pass around the PTL. This
is something that hugetlb_pte does for us, so, in some sense, even
going with this simpler approach, we still need a hugetlb_pte-like
construct.

Although most of the other complexity that HGM introduces would have
to be introduced either way (like having to deal with putting
compound_head()/page_folio() in more places and doing some
per-architecture updates), there are some complexities that the
simpler approach avoids:

- We avoid problems related to compound PTEs (the problem being: two
threads racing to populate a contiguous and non-contiguous PTE that
take up the same space could lead to user-detectable incorrect
behavior. This isn't hard to fix; it will be when I send the arm64
patches up.)

- We don't need to check if PTEs get split from under us in PT walks.
(In a lot of cases, the appropriate action is just to treat the PTE as
if it were pte_none().) In the same vein, we don't need
hugetlb_pte_present_leaf() at all, because PTEs we find will always be
leaves.

- We don't have to deal with sorting hstates or implementing
for_each_hgm_shift()/hugetlb_alloc_largest_pte().

None of these complexities are particularly major in my opinion.

This might seem kind of contrived, but let's say you have a VM with 1T
of memory, and you find 100 memory errors all in different 1G pages
over the life of this VM (years, potentially). Having 10% of your
memory be 4K-mapped is definitely worse than having 10% be 2M-mapped
(lost performance and increased memory overhead). There might be other
cases in the future where being able to have intermediate mapping
sizes could be helpful.

> > > When I was thinking about this I was a bit concerned about having enough
> > > information to know exactly when to inc or dec counts.  I was actually
> > > worried about knowing to do the increment.  I don't recall how it was
> > > done in the first RFC, but from a high level it would need to be done
> > > when the first hstate level PTE is allocated/added to the page table.
> > > Right?  My concern was with all the places where we could 'error out'
> > > after allocating the PTE, but before initializing it.  I was just thinking
> > > that we might need to scan the page table or keep metadata for better
> > > or easier accounting.
> >
> > The only two places where we can *create* a high-granularity page
> > table are: __mcopy_atomic_hugetlb (UFFDIO_CONTINUE) and
> > copy_hugetlb_page_range. RFC v1 did not properly deal with the cases
> > where we error out. To correctly handle these cases, we basically have
> > to do the pagecache lookup before touching the page table.
> >
> > 1. For __mcopy_atomic_hugetlb, we can lookup the page before doing the
> > PT walk/alloc. If PT walk tells us to inc the page ref/mapcount, we do
> > so immediately. We can easily pass the page into
> > hugetlb_mcopy_atomic_pte() (via 'pagep') .
> >
> > 2. For copy_hugetlb_page_range() for VM_MAYSHARE, we can also do the
> > lookup before we do the page table walk. I'm not sure how to support
> > non-shared HGM mappings with this scheme (in this series, we also
> > don't support non-shared; we return -EINVAL).
> > NB: The only case where high-granularity mappings for !VM_MAYSHARE
> > VMAs would come up is as a result of hwpoison.
> >
> > So we can avoid keeping additional metadata for what this series is
> > trying to accomplish, but if the above isn't acceptable, then I/we can
> > try to come up with a scheme that would be acceptable.
>
> Ok, I was thinking we had to deal with other code paths such as page
> fault.  But, now I understand that is not the case with this design.
>
> > There is also the possibility that the scheme implemented in this
> > version of the series is acceptable (i.e., the page_mapcount() API
> > difference, which results in slightly modified page migration behavior
> > and smaps output, is ok... assuming we have the refcount overflow
> > check).
> >
> > >
> > > I think Peter mentioned it elsewhere, we should come up with a workable
> > > scheme for HGM ref/map counting.  This can be done somewhat independently.
> >
> > FWIW, what makes the most sense to me right now is to implement the
> > THP-like scheme and mark HGM as mutually exclusive with the vmemmap
> > optimization. We can later come up with a scheme that lets us retain
> > compatibility. (Is that what you mean by "this can be done somewhat
> > independently", Mike?)
>
> Sort of, I was only saying that getting the ref/map counting right seems
> like a task than can be independently worked.  Using the THP-like scheme
> is good.

Ok! So if you're ok with the intermediate mapping sizes, it sounds
like I should go ahead and implement the THP-like scheme.

- James

  reply	other threads:[~2023-01-19 19:43 UTC|newest]

Thread overview: 126+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-05 10:17 [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e) James Houghton
2023-01-05 10:17 ` [PATCH 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
2023-01-05 10:18 ` [PATCH 02/46] hugetlb: remove mk_huge_pte; it is unused James Houghton
2023-01-05 10:18 ` [PATCH 03/46] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
2023-01-05 10:18 ` [PATCH 04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
2023-01-05 10:18 ` [PATCH 05/46] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
2023-01-05 10:18 ` [PATCH 06/46] mm: add VM_HUGETLB_HGM VMA flag James Houghton
2023-01-05 10:18 ` [PATCH 07/46] hugetlb: rename __vma_shareable_flags_pmd to __vma_has_hugetlb_vma_lock James Houghton
2023-01-05 10:18 ` [PATCH 08/46] hugetlb: add HugeTLB HGM enablement helpers James Houghton
2023-01-05 10:18 ` [PATCH 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM James Houghton
2023-01-05 15:05   ` kernel test robot
2023-01-05 15:29   ` David Hildenbrand
2023-01-10  0:01     ` Zach O'Keefe
2023-01-05 10:18 ` [PATCH 10/46] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
2023-01-05 10:18 ` [PATCH 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
2023-01-05 16:06   ` kernel test robot
2023-01-05 10:18 ` [PATCH 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte James Houghton
2023-01-05 10:18 ` [PATCH 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
2023-01-05 16:57   ` kernel test robot
2023-01-05 18:58   ` kernel test robot
2023-01-11 21:51   ` Peter Xu
2023-01-12 13:38     ` James Houghton
2023-01-05 10:18 ` [PATCH 14/46] hugetlb: add make_huge_pte_with_shift James Houghton
2023-01-05 10:18 ` [PATCH 15/46] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
2023-01-05 10:18 ` [PATCH 16/46] hugetlbfs: do a full walk to check if vma maps a page James Houghton
2023-01-05 10:18 ` [PATCH 17/46] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
2023-01-05 10:18 ` [PATCH 18/46] hugetlb: add HGM support for hugetlb_change_protection James Houghton
2023-01-05 10:18 ` [PATCH 19/46] hugetlb: add HGM support for follow_hugetlb_page James Houghton
2023-01-05 22:26   ` Peter Xu
2023-01-12 18:02   ` Peter Xu
2023-01-12 18:06     ` James Houghton
2023-01-05 10:18 ` [PATCH 20/46] hugetlb: add HGM support for hugetlb_follow_page_mask James Houghton
2023-01-05 10:18 ` [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
2023-01-05 22:42   ` Peter Xu
2023-01-11 22:58   ` Peter Xu
2023-01-12 14:06     ` James Houghton
2023-01-12 15:29       ` Peter Xu
2023-01-12 16:45         ` James Houghton
2023-01-12 16:55           ` James Houghton
2023-01-12 20:27           ` Peter Xu
2023-01-12 21:17             ` James Houghton
2023-01-12 21:33               ` Peter Xu
2023-01-16 10:17                 ` David Hildenbrand
2023-01-17 23:11                   ` James Houghton
2023-01-18  9:43                     ` David Hildenbrand
2023-01-18 15:35                       ` Peter Xu
2023-01-18 16:39                         ` James Houghton
2023-01-18 18:21                           ` David Hildenbrand
2023-01-18 19:28                           ` Mike Kravetz
2023-01-19 16:57                             ` James Houghton
2023-01-19 17:31                               ` Mike Kravetz
2023-01-19 19:42                                 ` James Houghton [this message]
2023-01-19 20:53                                   ` Peter Xu
2023-01-19 22:45                                     ` James Houghton
2023-01-19 22:00                                   ` Mike Kravetz
2023-01-19 22:23                                     ` Peter Xu
2023-01-19 22:35                                       ` James Houghton
2023-01-19 23:07                                         ` Peter Xu
2023-01-19 23:26                                           ` James Houghton
2023-01-20 17:23                                             ` Peter Xu
2023-01-19 23:44                                           ` Mike Kravetz
2023-01-23 15:19                                             ` Peter Xu
2023-01-23 17:49                                               ` Mike Kravetz
2023-01-26 16:58                                   ` James Houghton
2023-01-26 20:30                                     ` Peter Xu
2023-01-27 21:02                                       ` James Houghton
2023-01-30 17:29                                         ` Peter Xu
2023-01-30 18:38                                           ` James Houghton
2023-01-30 21:14                                             ` Peter Xu
2023-02-01  0:24                                               ` James Houghton
2023-02-01  1:24                                                 ` Peter Xu
2023-02-01 15:45                                                   ` James Houghton
2023-02-01 15:56                                                     ` David Hildenbrand
2023-02-01 17:58                                                       ` James Houghton
2023-02-01 18:01                                                         ` David Hildenbrand
2023-02-01 16:22                                                     ` Peter Xu
2023-02-01 21:32                                                       ` James Houghton
2023-02-01 21:51                                                         ` Peter Xu
2023-02-02  0:24                                                           ` James Houghton
2023-02-07 16:30                                                             ` James Houghton
2023-02-07 22:46                                                               ` James Houghton
2023-02-07 23:13                                                                 ` Peter Xu
2023-02-08  0:26                                                                   ` James Houghton
2023-02-08 16:16                                                                     ` Peter Xu
2023-02-09 16:43                                                                       ` James Houghton
2023-02-09 19:10                                                                         ` Peter Xu
2023-02-09 19:49                                                                           ` James Houghton
2023-02-09 20:22                                                                             ` Peter Xu
2023-01-18 17:08                         ` David Hildenbrand
2023-01-05 10:18 ` [PATCH 22/46] mm: rmap: provide pte_order in page_vma_mapped_walk James Houghton
2023-01-05 10:18 ` [PATCH 23/46] mm: rmap: make page_vma_mapped_walk callers use pte_order James Houghton
2023-01-05 10:18 ` [PATCH 24/46] rmap: update hugetlb lock comment for HGM James Houghton
2023-01-05 10:18 ` [PATCH 25/46] hugetlb: update page_vma_mapped to do high-granularity walks James Houghton
2023-01-05 10:18 ` [PATCH 26/46] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
2023-01-05 10:18 ` [PATCH 27/46] hugetlb: add HGM support for move_hugetlb_page_tables James Houghton
2023-01-05 10:18 ` [PATCH 28/46] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
2023-01-05 10:18 ` [PATCH 29/46] rmap: in try_to_{migrate,unmap}_one, check head page for page flags James Houghton
2023-01-05 10:18 ` [PATCH 30/46] hugetlb: add high-granularity migration support James Houghton
2023-01-05 10:18 ` [PATCH 31/46] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
2023-01-05 10:18 ` [PATCH 32/46] hugetlb: add for_each_hgm_shift James Houghton
2023-01-05 10:18 ` [PATCH 33/46] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
2023-01-05 10:18 ` [PATCH 34/46] hugetlb: userfaultfd: when using MADV_SPLIT, round addresses to PAGE_SIZE James Houghton
2023-01-06 15:13   ` Peter Xu
2023-01-10 14:50     ` James Houghton
2023-01-05 10:18 ` [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb James Houghton
2023-01-10 20:04   ` James Houghton
2023-01-17 21:06   ` Peter Xu
2023-01-17 21:38     ` James Houghton
2023-01-17 21:54       ` Peter Xu
2023-01-19 22:37   ` Peter Xu
2023-01-19 23:06     ` James Houghton
2023-01-05 10:18 ` [PATCH 36/46] hugetlb: remove huge_pte_lock and huge_pte_lockptr James Houghton
2023-01-05 10:18 ` [PATCH 37/46] hugetlb: replace make_huge_pte with make_huge_pte_with_shift James Houghton
2023-01-05 10:18 ` [PATCH 38/46] mm: smaps: add stats for HugeTLB mapping size James Houghton
2023-01-05 10:18 ` [PATCH 39/46] hugetlb: x86: enable high-granularity mapping James Houghton
2023-01-12 20:07   ` James Houghton
2023-01-05 10:18 ` [PATCH 40/46] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info James Houghton
2023-01-05 10:18 ` [PATCH 41/46] docs: proc: include information about HugeTLB HGM James Houghton
2023-01-05 10:18 ` [PATCH 42/46] selftests/vm: add HugeTLB HGM to userfaultfd selftest James Houghton
2023-01-05 10:18 ` [PATCH 43/46] selftests/kvm: add HugeTLB HGM to KVM demand paging selftest James Houghton
2023-01-05 10:18 ` [PATCH 44/46] selftests/vm: add anon and shared hugetlb to migration test James Houghton
2023-01-05 10:18 ` [PATCH 45/46] selftests/vm: add hugetlb HGM test to migration selftest James Houghton
2023-01-05 10:18 ` [PATCH 46/46] selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests James Houghton
2023-01-05 10:47 ` [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e) David Hildenbrand
2023-01-09 19:53   ` Mike Kravetz
2023-01-10 15:47     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CADrL8HVdL_NMdNq2mEemNCfwkYBAWnbqwyjsAYdQ2fF0iz34Hw@mail.gmail.com \
    --to=jthoughton@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=axelrasmussen@google.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=manish.mishra@nutanix.com \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.