linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Muchun Song <songmuchun@bytedance.com>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Zach O'Keefe <zokeefe@google.com>,
	Manish Mishra <manish.mishra@nutanix.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Miaohe Lin <linmiaohe@huawei.com>, Yang Shi <shy828301@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range
Date: Wed, 18 Jan 2023 10:35:00 -0500	[thread overview]
Message-ID: <Y8gRpEonhXgqfb41@x1n> (raw)
In-Reply-To: <6548b3b3-30c9-8f64-7d28-8a434e0a0b80@redhat.com>

On Wed, Jan 18, 2023 at 10:43:47AM +0100, David Hildenbrand wrote:
> On 18.01.23 00:11, James Houghton wrote:
> > On Mon, Jan 16, 2023 at 2:17 AM David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > On 12.01.23 22:33, Peter Xu wrote:
> > > > On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote:
> > > > > I'll look into it, but doing it this way will use _mapcount, so we
> > > > > won't be able to use the vmemmap optimization. I think even if we do
> > > > > use Hugh's approach, refcount is still being kept on the head page, so
> > > > > there's still an overflow risk there (but maybe I am
> > > > > misunderstanding).
> > > > 
> > > > Could you remind me what's the issue if using refcount on the small pages
> > > > rather than the head (assuming vmemmap still can be disabled)?
> > > 
> > > The THP-way of doing things is refcounting on the head page. All folios
> > > use a single refcount on the head.
> > > 
> > > There has to be a pretty good reason to do it differently.
> > 
> > Peter and I have discussed this a lot offline. There are two main problems here:
> > 
> > 1. Refcount overflow
> > 
> > Refcount is always kept on the head page (before and after this
> > series). IIUC, this means that if THPs could be 1G in size, they too
> > would be susceptible to the same potential overflow. How easy is the
> > overflow? [1]
> 
> Right. You'd need 8k VMAs. With 2 MiB THP you'd need 4096k VMAs. So ~64
> processes with 64k VMAs. Not impossible to achieve if one really wants to
> break the system ...
> 
> Side note: a long long time ago, we used to have sub-page refcounts for THP.
> IIRC, that was even before we had sub-page mapcounts and was used to make
> COW decisions.
> 
> > 
> > To deal with this, the best solution we've been able to come up with
> > is to check if refcount is > INT_MAX/2 (similar to try_get_page()),
> > and if it is, stop the operation (UFFDIO_CONTINUE or a page fault)
> > from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the
> > page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't
> > want to kill a random process).
> 
> You'd have to also make sure that fork() won't do the same. At least with
> uffd-wp, Peter also added page table copying during fork() for MAP_SHARED
> mappings, which would have to be handled.

If we want such a check to make a real difference, IIUC we may want to
consider having similar check in:

        page_ref_add
        page_ref_inc
        page_ref_inc_return
        page_ref_add_unless

But it's unfortunate that mostly all the callers to these functions
(especially the first two) do not have a retval yet at all.  Considering
the low possibility so far on having it overflow, maybe it can also be done
for later (and I think checking negative as try_get_page would suffice too).

> 
> Of course, one can just disallow fork() with any HGM right from the start
> and keep it all simpler to not open up a can of worms there.
> 
> Is it reasonable, to have more than one (or a handful) of VMAs mapping a
> huge page via a HGM? Restricting it to a single one, would make handling
> much easier.
> 
> If there is ever demand for more HGM mappings, that whole problem (and
> complexity) could be dealt with later. ... but I assume it will already be a
> requirement for VMs (e.g., under QEMU) that share memory with other
> processes (virtiofsd and friends?)

Yes, any form of multi-proc QEMU will need that for supporting HGM
postcopy.

> 
> 
> > 
> > (So David, I think this answers your question. Refcount should be
> > handled just like THPs.)
> > 
> > 2. page_mapcount() API differences
> > 
> > In this series, page_mapcount() returns the total number of page table
> > references for the compound page. For example, if you have a
> > PTE-mapped 2M page (with no other mappings), page_mapcount() for each
> > 4K page will be 512. This is not the same as a THP: page_mapcount()
> > would return 1 for each page. Because of the difference in
> > page_mapcount(), we have 4 problems:
> 
> IMHO, it would actually be great to just be able to remove the sub-page
> mapcounts for THP and make it all simpler.
> 
> Right now, the sub-page mapcount is mostly required for making COW
> decisions, but only for accounting purposes IIRC (NR_ANON_THPS,
> NR_SHMEM_PMDMAPPED, NR_FILE_PMDMAPPED) and mlock handling IIRC. See
> page_remove_rmap().
> 
> If we can avoid that complexity right from the start for hugetlb, great, ..
> 
> > 
> > i. Smaps uses page_mapcount() >= 2 to determine if hugetlb memory is
> > "private_hugetlb" or "shared_hugetlb".
> > ii. Migration with MPOL_MF_MOVE will check page_mapcount() to see if
> > the hugepage is shared or not. Pages that would otherwise be migrated
> > now require MPOL_MF_MOVE_ALL to be migrated.
> > [Really both of the above are checking how many VMAs are mapping our hugepage.]
> > iii. CoW. This isn't a problem right now because CoW is only possible
> > with MAP_PRIVATE VMAs and HGM can only be enabled for MAP_SHARED VMAs.
> > iv. The hwpoison handling code will check if it successfully unmapped
> > the poisoned page. This isn't a problem right now, as hwpoison will
> > unmap all the mappings for the hugepage, not just the 4K where the
> > poison was found.
> > 
> > Doing it this way allows HGM to remain compatible with the hugetlb
> > vmemmap optimization. None of the above problems strike me as
> > particularly major, but it's unclear to me how important it is to have
> > page_mapcount() have a consistent meaning for hugetlb vs non-hugetlb.
> 
> See below, maybe we should tackle HGM from a different direction.
> 
> > 
> > The other way page_mapcount() (let's say the "THP-like way") could be
> > done is like this: increment compound mapcount if we're mapping a
> > hugetlb page normally (e.g., 1G page with a PUD). If we're mapping at
> > high-granularity, increment the mapcount for each 4K page that is
> > getting mapped (e.g., PMD within a 1G page: increment the mapcount for
> > the 512 pages that are now mapped). This yields the same
> > page_mapcount() API we had before, but we lose the hugetlb vmemmap
> > optimization.
> > 
> > We could introduce an API like hugetlb_vma_mapcount() that would, for
> > hugetlb, give us the number of VMAs that map a hugepage, but I don't
> > think people would like this.
> > 
> > I'm curious what others think (Mike, Matthew?). I'm guessing the
> > THP-like way is probably what most people would want, though it would
> > be a real shame to lose the vmemmap optimization.
> 
> Heh, not me :) Having a single mapcount is certainly much cleaner. ... and
> if we're dealing with refcount overflows already, mapcount overflows are not
> an issue.
> 
> 
> I wonder if the following crazy idea has already been discussed: treat the
> whole mapping as a single large logical mapping. One reference and one
> mapping, no matter how the individual parts are mapped into the assigned
> page table sub-tree.
> 
> Because for hugetlb with MAP_SHARED, we know that the complete assigned
> sub-tree of page tables can only map the given hugetlb page, no fragments of
> something else. That's very different to THP in private mappings ...
> 
> So as soon as the first piece gets mapped, we increment refcount+mapcount.
> Other pieces in the same subtree don't do that.
> 
> Once the last piece is unmapped (or simpler: once the complete subtree of
> page tables is gone), we decrement refcount+mapcount. Might require some
> brain power to do this tracking, but I wouldn't call it impossible right
> from the start.
> 
> Would such a design violate other design aspects that are important?

The question is how to maintaining above information.

It needs to be per-map (so one page mapped multiple times can be accounted
differently), and per-page (so one mapping/vma can contain multiple pages).
So far I think that's exactly the pgtable.  If we can squeeze information
into the pgtable it'll work out, but definitely not trivial.  Or we can
maintain seperate allocates for such information, but that can be extra
overheads too.

So far I'd still consider going with reusing thp mapcounts, which will
mostly be what James mentioned above. The only difference is I'm not sure
whether we should allow mapping e.g. 2M ranges for 1G pages.  THP mapcount
doesn't have intermediate layer to maintain mapcount information like 2M,
so to me it's easier we start with only mapping either the hpage size or
PAGE_SIZE, not any intermediate size allowed.

Having intermediate size mapping allowed can at least be error prone to
me.  One example is if some pgtable walker found a 2M page, it may easily
fetch the PFN out of it, assuming it's a compound page and it should
satisfy PageHead(pfn)==true but it'll start to break here, because the 2M
PFN will only be a small page pfn for the 1G huge page in this case.

To me, intermediate sized mappings are good to have but not required to
resolve HGM problems, at least so far.  Said that, I'm fine with looking at
what it'll look like if James would like to keep persuing that direction.

Thanks,

-- 
Peter Xu


  reply	other threads:[~2023-01-18 15:36 UTC|newest]

Thread overview: 126+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-05 10:17 [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e) James Houghton
2023-01-05 10:17 ` [PATCH 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
2023-01-05 10:18 ` [PATCH 02/46] hugetlb: remove mk_huge_pte; it is unused James Houghton
2023-01-05 10:18 ` [PATCH 03/46] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
2023-01-05 10:18 ` [PATCH 04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
2023-01-05 10:18 ` [PATCH 05/46] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
2023-01-05 10:18 ` [PATCH 06/46] mm: add VM_HUGETLB_HGM VMA flag James Houghton
2023-01-05 10:18 ` [PATCH 07/46] hugetlb: rename __vma_shareable_flags_pmd to __vma_has_hugetlb_vma_lock James Houghton
2023-01-05 10:18 ` [PATCH 08/46] hugetlb: add HugeTLB HGM enablement helpers James Houghton
2023-01-05 10:18 ` [PATCH 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM James Houghton
2023-01-05 15:05   ` kernel test robot
2023-01-05 15:29   ` David Hildenbrand
2023-01-10  0:01     ` Zach O'Keefe
2023-01-05 10:18 ` [PATCH 10/46] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
2023-01-05 10:18 ` [PATCH 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
2023-01-05 16:06   ` kernel test robot
2023-01-05 10:18 ` [PATCH 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte James Houghton
2023-01-05 10:18 ` [PATCH 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
2023-01-05 16:57   ` kernel test robot
2023-01-05 18:58   ` kernel test robot
2023-01-11 21:51   ` Peter Xu
2023-01-12 13:38     ` James Houghton
2023-01-05 10:18 ` [PATCH 14/46] hugetlb: add make_huge_pte_with_shift James Houghton
2023-01-05 10:18 ` [PATCH 15/46] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
2023-01-05 10:18 ` [PATCH 16/46] hugetlbfs: do a full walk to check if vma maps a page James Houghton
2023-01-05 10:18 ` [PATCH 17/46] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
2023-01-05 10:18 ` [PATCH 18/46] hugetlb: add HGM support for hugetlb_change_protection James Houghton
2023-01-05 10:18 ` [PATCH 19/46] hugetlb: add HGM support for follow_hugetlb_page James Houghton
2023-01-05 22:26   ` Peter Xu
2023-01-12 18:02   ` Peter Xu
2023-01-12 18:06     ` James Houghton
2023-01-05 10:18 ` [PATCH 20/46] hugetlb: add HGM support for hugetlb_follow_page_mask James Houghton
2023-01-05 10:18 ` [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
2023-01-05 22:42   ` Peter Xu
2023-01-11 22:58   ` Peter Xu
2023-01-12 14:06     ` James Houghton
2023-01-12 15:29       ` Peter Xu
2023-01-12 16:45         ` James Houghton
2023-01-12 16:55           ` James Houghton
2023-01-12 20:27           ` Peter Xu
2023-01-12 21:17             ` James Houghton
2023-01-12 21:33               ` Peter Xu
2023-01-16 10:17                 ` David Hildenbrand
2023-01-17 23:11                   ` James Houghton
2023-01-18  9:43                     ` David Hildenbrand
2023-01-18 15:35                       ` Peter Xu [this message]
2023-01-18 16:39                         ` James Houghton
2023-01-18 18:21                           ` David Hildenbrand
2023-01-18 19:28                           ` Mike Kravetz
2023-01-19 16:57                             ` James Houghton
2023-01-19 17:31                               ` Mike Kravetz
2023-01-19 19:42                                 ` James Houghton
2023-01-19 20:53                                   ` Peter Xu
2023-01-19 22:45                                     ` James Houghton
2023-01-19 22:00                                   ` Mike Kravetz
2023-01-19 22:23                                     ` Peter Xu
2023-01-19 22:35                                       ` James Houghton
2023-01-19 23:07                                         ` Peter Xu
2023-01-19 23:26                                           ` James Houghton
2023-01-20 17:23                                             ` Peter Xu
2023-01-19 23:44                                           ` Mike Kravetz
2023-01-23 15:19                                             ` Peter Xu
2023-01-23 17:49                                               ` Mike Kravetz
2023-01-26 16:58                                   ` James Houghton
2023-01-26 20:30                                     ` Peter Xu
2023-01-27 21:02                                       ` James Houghton
2023-01-30 17:29                                         ` Peter Xu
2023-01-30 18:38                                           ` James Houghton
2023-01-30 21:14                                             ` Peter Xu
2023-02-01  0:24                                               ` James Houghton
2023-02-01  1:24                                                 ` Peter Xu
2023-02-01 15:45                                                   ` James Houghton
2023-02-01 15:56                                                     ` David Hildenbrand
2023-02-01 17:58                                                       ` James Houghton
2023-02-01 18:01                                                         ` David Hildenbrand
2023-02-01 16:22                                                     ` Peter Xu
2023-02-01 21:32                                                       ` James Houghton
2023-02-01 21:51                                                         ` Peter Xu
2023-02-02  0:24                                                           ` James Houghton
2023-02-07 16:30                                                             ` James Houghton
2023-02-07 22:46                                                               ` James Houghton
2023-02-07 23:13                                                                 ` Peter Xu
2023-02-08  0:26                                                                   ` James Houghton
2023-02-08 16:16                                                                     ` Peter Xu
2023-02-09 16:43                                                                       ` James Houghton
2023-02-09 19:10                                                                         ` Peter Xu
2023-02-09 19:49                                                                           ` James Houghton
2023-02-09 20:22                                                                             ` Peter Xu
2023-01-18 17:08                         ` David Hildenbrand
2023-01-05 10:18 ` [PATCH 22/46] mm: rmap: provide pte_order in page_vma_mapped_walk James Houghton
2023-01-05 10:18 ` [PATCH 23/46] mm: rmap: make page_vma_mapped_walk callers use pte_order James Houghton
2023-01-05 10:18 ` [PATCH 24/46] rmap: update hugetlb lock comment for HGM James Houghton
2023-01-05 10:18 ` [PATCH 25/46] hugetlb: update page_vma_mapped to do high-granularity walks James Houghton
2023-01-05 10:18 ` [PATCH 26/46] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
2023-01-05 10:18 ` [PATCH 27/46] hugetlb: add HGM support for move_hugetlb_page_tables James Houghton
2023-01-05 10:18 ` [PATCH 28/46] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
2023-01-05 10:18 ` [PATCH 29/46] rmap: in try_to_{migrate,unmap}_one, check head page for page flags James Houghton
2023-01-05 10:18 ` [PATCH 30/46] hugetlb: add high-granularity migration support James Houghton
2023-01-05 10:18 ` [PATCH 31/46] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
2023-01-05 10:18 ` [PATCH 32/46] hugetlb: add for_each_hgm_shift James Houghton
2023-01-05 10:18 ` [PATCH 33/46] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
2023-01-05 10:18 ` [PATCH 34/46] hugetlb: userfaultfd: when using MADV_SPLIT, round addresses to PAGE_SIZE James Houghton
2023-01-06 15:13   ` Peter Xu
2023-01-10 14:50     ` James Houghton
2023-01-05 10:18 ` [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb James Houghton
2023-01-10 20:04   ` James Houghton
2023-01-17 21:06   ` Peter Xu
2023-01-17 21:38     ` James Houghton
2023-01-17 21:54       ` Peter Xu
2023-01-19 22:37   ` Peter Xu
2023-01-19 23:06     ` James Houghton
2023-01-05 10:18 ` [PATCH 36/46] hugetlb: remove huge_pte_lock and huge_pte_lockptr James Houghton
2023-01-05 10:18 ` [PATCH 37/46] hugetlb: replace make_huge_pte with make_huge_pte_with_shift James Houghton
2023-01-05 10:18 ` [PATCH 38/46] mm: smaps: add stats for HugeTLB mapping size James Houghton
2023-01-05 10:18 ` [PATCH 39/46] hugetlb: x86: enable high-granularity mapping James Houghton
2023-01-12 20:07   ` James Houghton
2023-01-05 10:18 ` [PATCH 40/46] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info James Houghton
2023-01-05 10:18 ` [PATCH 41/46] docs: proc: include information about HugeTLB HGM James Houghton
2023-01-05 10:18 ` [PATCH 42/46] selftests/vm: add HugeTLB HGM to userfaultfd selftest James Houghton
2023-01-05 10:18 ` [PATCH 43/46] selftests/kvm: add HugeTLB HGM to KVM demand paging selftest James Houghton
2023-01-05 10:18 ` [PATCH 44/46] selftests/vm: add anon and shared hugetlb to migration test James Houghton
2023-01-05 10:18 ` [PATCH 45/46] selftests/vm: add hugetlb HGM test to migration selftest James Houghton
2023-01-05 10:18 ` [PATCH 46/46] selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests James Houghton
2023-01-05 10:47 ` [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e) David Hildenbrand
2023-01-09 19:53   ` Mike Kravetz
2023-01-10 15:47     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y8gRpEonhXgqfb41@x1n \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=axelrasmussen@google.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=manish.mishra@nutanix.com \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).