All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: James Houghton <jthoughton@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
	Muchun Song <songmuchun@bytedance.com>,
	David Hildenbrand <david@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Zach O'Keefe <zokeefe@google.com>,
	Manish Mishra <manish.mishra@nutanix.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Miaohe Lin <linmiaohe@huawei.com>, Yang Shi <shy828301@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range
Date: Thu, 12 Jan 2023 15:27:19 -0500	[thread overview]
Message-ID: <Y8BtJzBLTpw5IR+H@x1n> (raw)
In-Reply-To: <CADrL8HWFfqCqbpmvv8BSpvvzJ9aEeBEN30bMLuWGancsfMXv2w@mail.gmail.com>

On Thu, Jan 12, 2023 at 11:45:40AM -0500, James Houghton wrote:
> On Thu, Jan 12, 2023 at 10:29 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote:
> > > On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > James,
> > > >
> > > > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> > > > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> > > > >               int mapcount = page_mapcount(page);
> > > > >
> > > > >               if (mapcount >= 2)
> > > > > -                     mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > > > +                     mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > > > >               else
> > > > > -                     mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > > > > +                     mss->private_hugetlb += hugetlb_pte_size(hpte);
> > > > >       }
> > > > >       return 0;
> > > >
> > > > One thing interesting I found with hgm right now is mostly everything will
> > > > be counted as "shared" here, I think it's because mapcount is accounted
> > > > always to the huge page even if mapped in smaller sizes, so page_mapcount()
> > > > to a small page should be huge too because the head page mapcount should be
> > > > huge.  I'm curious the reasons behind the mapcount decision.
> > > >
> > > > For example, would that risk overflow with head_compound_mapcount?  One 1G
> > > > page mapping all 4K takes 0.25M counts, while the limit should be 2G for
> > > > atomic_t.  Looks like it's possible.
> > >
> > > The original mapcount approach was "if the hstate-level PTE is
> > > present, increment the compound_mapcount by 1" (basically "if any of
> > > the hugepage is mapped, increment the compound_mapcount by 1"), but
> > > this was painful to implement,
> >
> > Any more info here on why it was painful?  What is the major blocker?
> 
> The original approach was implemented in RFC v1, but the
> implementation was broken: the way refcount was handled was wrong; it
> was incremented once for each new page table mapping. (How?
> find_lock_page(), called once per hugetlb_no_page/UFFDIO_CONTINUE
> would increment refcount and we wouldn't drop it, and in
> __unmap_hugepage_range(), the mmu_gather bits would decrement the
> refcount once per mapping.)

I'm not sure I fully get the point here, but perhaps it's mostly about how
hugetlb page cache is managed (in hstate size not PAGE_SIZE)?

static inline struct page *folio_file_page(struct folio *folio, pgoff_t index)
{
	/* HugeTLBfs indexes the page cache in units of hpage_size */
	if (folio_test_hugetlb(folio))
		return &folio->page;
	return folio_page(folio, index & (folio_nr_pages(folio) - 1));
}

I haven't thought through on that either.  Is it possible that we switche
the pgcache layout to be in PAGE_SIZE granule too when HGM enabled (e.g. a
simple scheme is we can fail MADV_SPLIT if hugetlb pgcache contains any
page already).

If keep using the same pgcache scheme (hpage_size stepped indexes),
find_lock_page() will also easily content on the head page lock, so we may
not be able to handle concurrent page faults on small mappings on the same
page as efficient as thp.

> 
> At the time, I figured the complexity of handling mapcount AND
> refcount correctly in the original approach would be quite complex, so
> I switched to the new one.
> 
> 1. In places that already change the mapcount, check that we're
> installing the hstate-level PTE, not a high-granularity PTE. Adjust
> mapcount AND refcount appropriately.
> 2. In the HGM walking bits, to the caller if we made the hstate-level
> PTE present. (hugetlb_[pmd,pte]_alloc is the source of truth.) Need to
> keep track of this until we figure out which page we're allocating
> PTEs for, then change mapcount/refcount appropriately.
> 3. In unmapping bits, change mmu_gather/tlb bits to drop refcount only
> once per hugepage. (This is probably the hardest of these three things
> to get right.)
> 
> >
> > > so I changed it to what it is now (each new PT mapping increments the
> > > compound_mapcount by 1). I think you're right, there is absolutely an
> > > overflow risk. :( I'm not sure what the best solution is. I could just go
> > > back to the old approach.
> >
> > No rush on that; let's discuss it thoroughly before doing anything.  We
> > have more context than when it was discussed initially in the calls, so
> > maybe a good time to revisit.
> >
> > >
> > > Regarding when things are accounted in private_hugetlb vs.
> > > shared_hugetlb, HGM shouldn't change that, because it only applies to
> > > shared mappings (right now anyway). It seems like "private_hugetlb"
> > > can include cases where the page is shared but has only one mapping,
> > > in which case HGM will change it from "private" to "shared".
> >
> > The two fields are not defined against VM_SHARED, it seems.  At least not
> > with current code base.
> >
> > Let me quote the code again just to be clear:
> >
> >                 int mapcount = page_mapcount(page);    <------------- [1]
> >
> >                 if (mapcount >= 2)
> >                         mss->shared_hugetlb += hugetlb_pte_size(hpte);
> >                 else
> >                         mss->private_hugetlb += hugetlb_pte_size(hpte);
> >
> >                 smaps_hugetlb_hgm_account(mss, hpte);
> >
> > So that information (for some reason) is only relevant to how many mapcount
> > is there.  If we have one 1G page and only two 4K mapped, with the existing
> > logic we should see 8K private_hugetlb while in fact I think it should be
> > 8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings
> > (as they all goes back to the head).
> >
> > I have no idea whether violating that will be a problem or not, I suppose
> > at least it needs justification if it will be violated, or hopefully it can
> > be kept as-is.
> 
> Agreed that this is a problem. I'm not sure what should be done here.
> It seems like the current upstream implementation is incorrect (surely
> MAP_SHARED with only one mapping should still be shared_hugetlb not
> private_hugetlb); the check should really be `if (vma->vm_flags &
> VM_MAYSHARE)` instead of `mapcount >= 2`. If that change can be taken,
> we don't have a problem here.

Naoya added relevant code.  Not sure whether he'll chim in.

commit 25ee01a2fca02dfb5a3ce316e77910c468108199
Author: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date:   Thu Nov 5 18:47:11 2015 -0800

    mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps

In all cases, it'll still be a slight ABI change, so worth considering the
effects to existing users.

> 
> >
> > >
> > > >
> > > > Btw, are the small page* pointers still needed in the latest HGM design?
> > > > Is there code taking care of disabling of hugetlb vmemmap optimization for
> > > > HGM?  Or maybe it's not needed anymore for the current design?
> > >
> > > The hugetlb vmemmap optimization can still be used with HGM, so there
> > > is no code to disable it. We don't need small page* pointers either,
> > > except for when we're dealing with mapping subpages, like in
> > > hugetlb_no_page. Everything else can handle the hugetlb page as a
> > > folio.
> > >
> > > I hope we can keep compatibility with the vmemmap optimization while
> > > solving the mapcount overflow risk.
> >
> > Yeh that'll be perfect if it works.  But afaiu even with your current
> > design (ignoring all the issues on either smaps accounting or overflow
> > risks), we already referenced the small pages, aren't we?  See:
> >
> > static inline int page_mapcount(struct page *page)
> > {
> >         int mapcount = atomic_read(&page->_mapcount) + 1;  <-------- here
> >
> >         if (likely(!PageCompound(page)))
> >                 return mapcount;
> >         page = compound_head(page);
> >         return head_compound_mapcount(page) + mapcount;
> > }
> >
> > Even if we assume small page->_mapcount should always be zero in this case,
> > we may need to take special care of hugetlb pages in page_mapcount() to not
> > reference it at all.  Or I think it's reading random values and some days
> > it can be non-zero.
> 
> IIUC, it's ok to read from all the hugetlb subpage structs, you just
> can't *write* to them (except the first few). The first page of page
> structs is mapped RW; all the others are mapped RO to a single
> physical page.

I'm not familiar with vmemmap work, but I saw this:

	/*
	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
	 * to the page which @vmemmap_reuse is mapped to, then free the pages
	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
	 */
	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))

It seems it'll reuse the 1st page of the huge page* rather than backing the
rest vmemmap with zero pages.  Would that be a problem (as I think some
small page* can actually points to the e.g. head page* if referenced)?

> 
> >
> > The other approach is probably using the thp approach.  After Hugh's rework
> > on the thp accounting I assumed it would be even cleaner (at least no
> > DoubleMap complexity anymore.. even though I can't say I fully digested the
> > whole history of that).  It's all about what's the major challenges of
> > using the same approach there with thp.  You may have more knowledge on
> > that aspect so I'd be willing to know.
> 
> I need to take a closer look at Hugh's approach to see if we can do it
> the same way. (I wonder if the 1G THP series has some ideas too.)

https://lore.kernel.org/all/47ad693-717-79c8-e1ba-46c3a6602e48@google.com/

It's already in the mainline.  I think it's mostly internally implemented
under the rmap code APIs.  For the HGM effort, I'd start with simply
passing around compound=false when using the rmap APIs, and see what'll
start to fail.

> 
> A really simple solution could be just to prevent userspace from doing
> MADV_SPLIT (or, if we are enabling HGM due to hwpoison, ignore the
> poison) if it could result in a mapcount overflow. For 1G pages,
> userspace would need 8192 mappings to overflow mapcount/refcount.

I'm not sure you can calculate it; consider fork()s along the way when pmd
sharing disabled, or whatever reason when the 1G pages mapped at multiple
places with more than one mmap()s.

Thanks,

-- 
Peter Xu


  parent reply	other threads:[~2023-01-12 20:56 UTC|newest]

Thread overview: 126+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-05 10:17 [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e) James Houghton
2023-01-05 10:17 ` [PATCH 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
2023-01-05 10:18 ` [PATCH 02/46] hugetlb: remove mk_huge_pte; it is unused James Houghton
2023-01-05 10:18 ` [PATCH 03/46] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
2023-01-05 10:18 ` [PATCH 04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
2023-01-05 10:18 ` [PATCH 05/46] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
2023-01-05 10:18 ` [PATCH 06/46] mm: add VM_HUGETLB_HGM VMA flag James Houghton
2023-01-05 10:18 ` [PATCH 07/46] hugetlb: rename __vma_shareable_flags_pmd to __vma_has_hugetlb_vma_lock James Houghton
2023-01-05 10:18 ` [PATCH 08/46] hugetlb: add HugeTLB HGM enablement helpers James Houghton
2023-01-05 10:18 ` [PATCH 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM James Houghton
2023-01-05 15:05   ` kernel test robot
2023-01-05 15:29   ` David Hildenbrand
2023-01-10  0:01     ` Zach O'Keefe
2023-01-05 10:18 ` [PATCH 10/46] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
2023-01-05 10:18 ` [PATCH 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
2023-01-05 16:06   ` kernel test robot
2023-01-05 10:18 ` [PATCH 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte James Houghton
2023-01-05 10:18 ` [PATCH 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
2023-01-05 16:57   ` kernel test robot
2023-01-05 18:58   ` kernel test robot
2023-01-11 21:51   ` Peter Xu
2023-01-12 13:38     ` James Houghton
2023-01-05 10:18 ` [PATCH 14/46] hugetlb: add make_huge_pte_with_shift James Houghton
2023-01-05 10:18 ` [PATCH 15/46] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
2023-01-05 10:18 ` [PATCH 16/46] hugetlbfs: do a full walk to check if vma maps a page James Houghton
2023-01-05 10:18 ` [PATCH 17/46] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
2023-01-05 10:18 ` [PATCH 18/46] hugetlb: add HGM support for hugetlb_change_protection James Houghton
2023-01-05 10:18 ` [PATCH 19/46] hugetlb: add HGM support for follow_hugetlb_page James Houghton
2023-01-05 22:26   ` Peter Xu
2023-01-12 18:02   ` Peter Xu
2023-01-12 18:06     ` James Houghton
2023-01-05 10:18 ` [PATCH 20/46] hugetlb: add HGM support for hugetlb_follow_page_mask James Houghton
2023-01-05 10:18 ` [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
2023-01-05 22:42   ` Peter Xu
2023-01-11 22:58   ` Peter Xu
2023-01-12 14:06     ` James Houghton
2023-01-12 15:29       ` Peter Xu
2023-01-12 16:45         ` James Houghton
2023-01-12 16:55           ` James Houghton
2023-01-12 20:27           ` Peter Xu [this message]
2023-01-12 21:17             ` James Houghton
2023-01-12 21:33               ` Peter Xu
2023-01-16 10:17                 ` David Hildenbrand
2023-01-17 23:11                   ` James Houghton
2023-01-18  9:43                     ` David Hildenbrand
2023-01-18 15:35                       ` Peter Xu
2023-01-18 16:39                         ` James Houghton
2023-01-18 18:21                           ` David Hildenbrand
2023-01-18 19:28                           ` Mike Kravetz
2023-01-19 16:57                             ` James Houghton
2023-01-19 17:31                               ` Mike Kravetz
2023-01-19 19:42                                 ` James Houghton
2023-01-19 20:53                                   ` Peter Xu
2023-01-19 22:45                                     ` James Houghton
2023-01-19 22:00                                   ` Mike Kravetz
2023-01-19 22:23                                     ` Peter Xu
2023-01-19 22:35                                       ` James Houghton
2023-01-19 23:07                                         ` Peter Xu
2023-01-19 23:26                                           ` James Houghton
2023-01-20 17:23                                             ` Peter Xu
2023-01-19 23:44                                           ` Mike Kravetz
2023-01-23 15:19                                             ` Peter Xu
2023-01-23 17:49                                               ` Mike Kravetz
2023-01-26 16:58                                   ` James Houghton
2023-01-26 20:30                                     ` Peter Xu
2023-01-27 21:02                                       ` James Houghton
2023-01-30 17:29                                         ` Peter Xu
2023-01-30 18:38                                           ` James Houghton
2023-01-30 21:14                                             ` Peter Xu
2023-02-01  0:24                                               ` James Houghton
2023-02-01  1:24                                                 ` Peter Xu
2023-02-01 15:45                                                   ` James Houghton
2023-02-01 15:56                                                     ` David Hildenbrand
2023-02-01 17:58                                                       ` James Houghton
2023-02-01 18:01                                                         ` David Hildenbrand
2023-02-01 16:22                                                     ` Peter Xu
2023-02-01 21:32                                                       ` James Houghton
2023-02-01 21:51                                                         ` Peter Xu
2023-02-02  0:24                                                           ` James Houghton
2023-02-07 16:30                                                             ` James Houghton
2023-02-07 22:46                                                               ` James Houghton
2023-02-07 23:13                                                                 ` Peter Xu
2023-02-08  0:26                                                                   ` James Houghton
2023-02-08 16:16                                                                     ` Peter Xu
2023-02-09 16:43                                                                       ` James Houghton
2023-02-09 19:10                                                                         ` Peter Xu
2023-02-09 19:49                                                                           ` James Houghton
2023-02-09 20:22                                                                             ` Peter Xu
2023-01-18 17:08                         ` David Hildenbrand
2023-01-05 10:18 ` [PATCH 22/46] mm: rmap: provide pte_order in page_vma_mapped_walk James Houghton
2023-01-05 10:18 ` [PATCH 23/46] mm: rmap: make page_vma_mapped_walk callers use pte_order James Houghton
2023-01-05 10:18 ` [PATCH 24/46] rmap: update hugetlb lock comment for HGM James Houghton
2023-01-05 10:18 ` [PATCH 25/46] hugetlb: update page_vma_mapped to do high-granularity walks James Houghton
2023-01-05 10:18 ` [PATCH 26/46] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
2023-01-05 10:18 ` [PATCH 27/46] hugetlb: add HGM support for move_hugetlb_page_tables James Houghton
2023-01-05 10:18 ` [PATCH 28/46] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
2023-01-05 10:18 ` [PATCH 29/46] rmap: in try_to_{migrate,unmap}_one, check head page for page flags James Houghton
2023-01-05 10:18 ` [PATCH 30/46] hugetlb: add high-granularity migration support James Houghton
2023-01-05 10:18 ` [PATCH 31/46] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
2023-01-05 10:18 ` [PATCH 32/46] hugetlb: add for_each_hgm_shift James Houghton
2023-01-05 10:18 ` [PATCH 33/46] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
2023-01-05 10:18 ` [PATCH 34/46] hugetlb: userfaultfd: when using MADV_SPLIT, round addresses to PAGE_SIZE James Houghton
2023-01-06 15:13   ` Peter Xu
2023-01-10 14:50     ` James Houghton
2023-01-05 10:18 ` [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb James Houghton
2023-01-10 20:04   ` James Houghton
2023-01-17 21:06   ` Peter Xu
2023-01-17 21:38     ` James Houghton
2023-01-17 21:54       ` Peter Xu
2023-01-19 22:37   ` Peter Xu
2023-01-19 23:06     ` James Houghton
2023-01-05 10:18 ` [PATCH 36/46] hugetlb: remove huge_pte_lock and huge_pte_lockptr James Houghton
2023-01-05 10:18 ` [PATCH 37/46] hugetlb: replace make_huge_pte with make_huge_pte_with_shift James Houghton
2023-01-05 10:18 ` [PATCH 38/46] mm: smaps: add stats for HugeTLB mapping size James Houghton
2023-01-05 10:18 ` [PATCH 39/46] hugetlb: x86: enable high-granularity mapping James Houghton
2023-01-12 20:07   ` James Houghton
2023-01-05 10:18 ` [PATCH 40/46] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info James Houghton
2023-01-05 10:18 ` [PATCH 41/46] docs: proc: include information about HugeTLB HGM James Houghton
2023-01-05 10:18 ` [PATCH 42/46] selftests/vm: add HugeTLB HGM to userfaultfd selftest James Houghton
2023-01-05 10:18 ` [PATCH 43/46] selftests/kvm: add HugeTLB HGM to KVM demand paging selftest James Houghton
2023-01-05 10:18 ` [PATCH 44/46] selftests/vm: add anon and shared hugetlb to migration test James Houghton
2023-01-05 10:18 ` [PATCH 45/46] selftests/vm: add hugetlb HGM test to migration selftest James Houghton
2023-01-05 10:18 ` [PATCH 46/46] selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests James Houghton
2023-01-05 10:47 ` [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e) David Hildenbrand
2023-01-09 19:53   ` Mike Kravetz
2023-01-10 15:47     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y8BtJzBLTpw5IR+H@x1n \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=axelrasmussen@google.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=manish.mishra@nutanix.com \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.