All of lore.kernel.org
 help / color / mirror / Atom feed
From: James Houghton <jthoughton@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
	David Hildenbrand <david@redhat.com>,
	Muchun Song <songmuchun@bytedance.com>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Mina Almasry <almasrymina@google.com>,
	"Zach O'Keefe" <zokeefe@google.com>,
	Manish Mishra <manish.mishra@nutanix.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Miaohe Lin <linmiaohe@huawei.com>, Yang Shi <shy828301@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range
Date: Wed, 1 Feb 2023 16:24:33 -0800	[thread overview]
Message-ID: <CADrL8HVja_xJ9qczsd-fQfEPvEEXswhXQwoan=a_LSMyORvqww@mail.gmail.com> (raw)
In-Reply-To: <Y9re82gctIZf08cX@x1n>

On Wed, Feb 1, 2023 at 1:51 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Feb 01, 2023 at 01:32:21PM -0800, James Houghton wrote:
> > On Wed, Feb 1, 2023 at 8:22 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote:
> > > > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> > > > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > > >
> > > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > > [snip]
> > > > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > > > > > > freeing pgtable of the PUD entry).
> > > > > > > >
> > > > > > > > Right, and this is doable. Also it seems like this is pretty close to
> > > > > > > > the direction Matthew Wilcox wants to go with THPs.
> > > > > > >
> > > > > > > I may not be familiar with it, do you mean this one?
> > > > > > >
> > > > > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/
> > > > > >
> > > > > > Yep that's it.
> > > > > >
> > > > > > >
> > > > > > > For hugetlb I think it should be easier to maintain rather than any-sized
> > > > > > > folios, because there's the pgtable non-leaf entry to track rmap
> > > > > > > information and the folio size being static to hpage size.
> > > > > > >
> > > > > > > It'll be different to folios where it can be random sized pages chunk, so
> > > > > > > it needs to be managed by batching the ptes when install/zap.
> > > > > >
> > > > > > Agreed. It's probably easier for HugeTLB because they're always
> > > > > > "naturally aligned" and yeah they can't change sizes.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Something I noticed though, from the implementation of
> > > > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > > > > > > ought to report the total number of PTEs that are pointing on the page
> > > > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > > > > > > folio_referenced() is never called for hugetlb folios.
> > > > > > >
> > > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > > > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > > > > > > should match with what it expects to see later, more or less.
> > > > > >
> > > > > > I don't fully understand what you mean here.
> > > > >
> > > > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> > > > > leaves for the folio, big or small.  I think that will match the number
> > > > > with what got returned from folio_mapcount().
> > > >
> > > > See below.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > But I agree the mapcount/referenced value itself is debatable to me, just
> > > > > > > like what you raised in the other thread on page migration.  Meanwhile, I
> > > > > > > am not certain whether the mapcount is accurate either because AFAICT the
> > > > > > > mapcount can be modified if e.g. new page mapping established as long as
> > > > > > > before taking the page lock later in folio_referenced().
> > > > > > >
> > > > > > > It's just that I don't see any severe issue either due to any of above, as
> > > > > > > long as that information is only used as a hint for next steps, e.g., to
> > > > > > > swap which page out.
> > > > > >
> > > > > > I also don't see a big problem with folio_referenced() (and you're
> > > > > > right that folio_mapcount() can be stale by the time it takes the
> > > > > > folio lock). It still seems like folio_mapcount() should return the
> > > > > > total number of PTEs that map the page though. Are you saying that
> > > > > > breaking this would be ok?
> > > > >
> > > > > I didn't quite follow - isn't that already doing so?
> > > > >
> > > > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> > > > > accumulated value of all possible PTEs or PMDs being mapped as long as it's
> > > > > all or part of the folio being mapped.
> > > >
> > > > We've talked about 3 ways of handling mapcount:
> > > >
> > > > 1. The RFC v2 way, which is head-only, and we increment the compound
> > > > mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> > > > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> > > > 2. The THP-like way. If we are fully mapping the hugetlb page with the
> > > > hstate-level PTE, we increment the compound mapcount, otherwise we
> > > > increment subpage->_mapcount.
> > > > 3. The RFC v1 way (the way you have suggested above), which is
> > > > head-only, and we increment the compound mapcount if the hstate-level
> > > > PTE is made present.
> > >
> > > Oh that's where it come from!  It took quite some months going through all
> > > these, I can hardly remember the details.
> > >
> > > >
> > > > With #1 and #2, there is no concern with folio_mapcount(). But with
> > > > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> > > > would yield 1 instead of 512 (right?). That's what I mean.
> > > >
> > > > #1 has problems wrt smaps and migration (though there were other
> > > > problems with those anyway that Mike has fixed), and #2 makes
> > > > MADV_COLLAPSE slow to the point of being unusable for some
> > > > applications.
> > >
> > > Ah so you're talking about after HGM being applied..  while I was only
> > > talking about THPs.
> > >
> > > If to apply the logic here with idea 3), the worst case is we'll need to
> > > have special care of HGM hugetlb in folio_referenced_one(), so the default
> > > page_vma_mapped_walk() may not apply anymore - the resource is always in
> > > hstate sized, so counting small ptes do not help too - we can just walk
> > > until the hstate entry and do referenced++ if it's not none, at the
> > > entrance of folio_referenced_one().
> > >
> > > But I'm not sure whether that'll be necessary at all, as I'm not sure
> > > whether that path can be triggered at all in any form (where from the top
> > > it should always be shrink_page_list()).  In that sense maybe we can also
> > > consider adding a WARN_ON_ONCE() in folio_referenced() where it is a
> > > hugetlb page that got passed in?  Meanwhile, adding a TODO comment
> > > explaining that current walk won't work easily for HGM only, so when it
> > > will be applicable to hugetlb we need to rework?
> > >
> > > I confess that's not pretty, though.  But that'll make 3) with no major
> > > defect from function-wise.
> >
> > Another potential idea would be to add something like page_vmacount().
> > For non-HugeTLB pages, page_vmacount() == page_mapcount(). Then for
> > HugeTLB pages, we could keep a separate count (in one of the tail
> > pages, I guess). And then in the places that matter (so smaps,
> > migration, and maybe CoW and hwpoison), potentially change their calls
> > to page_vmacount() instead of page_mapcount().
> >
> > Then to implement page_vmacount(), we do the RFC v1 mapcount approach
> > (but like.... correctly this time). And then for page_mapcount(), we
> > do the RFC v2 mapcount approach (head-only, once per PTE).
> >
> > Then we fix folio_referenced() without needing to special-case it for
> > HugeTLB. :) Or we could just special-case it. *shrug*
> >
> > Does that sound reasonable? We still have the problem where a series
> > of partially unmaps could leave page_vmacount() incremented, but I
> > don't think that's a big problem.
>
> I'm afraid someone will stop you from introducing yet another definition of
> mapcount, where others are trying to remove it. :)
>
> Or, can we just drop folio_referenced_arg.mapcount?  We need to keep:
>
>         if (!pra.mapcount)
>                 return 0;
>
> By replacing it with folio_mapcount which is definitely something
> worthwhile, but what about the rest?
>
> If it can be dropped, afaict it'll naturally work with HGM again.
>
> IIUC that's an optimization where we want to stop the rmap walk as long as
> we found all the pages, however (1) IIUC it's not required to function, and
> (2) it's not guaranteed to work as solid anyway.. As we've discussed
> before: right after it reads mapcount (but before taking the page lock),
> the mapcount can get decreased by 1, then it'll still need to loop over all
> the vmas just to find that there's one "misterious" mapcount lost.
>
> Personally I really have no idea on how much that optimization can help.

Ok, yeah, I think pra.mapcount can be removed too. (And we replace
!pra.mapcount with !folio_mapcount().)

I don't see any other existing users of folio_mapcount() and
total_mapcount() that are problematic. We do need to make sure to keep
refcount and mapcount in sync though; it can be done.

So I'll compare this "RFC v1" way with the THP-like way and get you a
performance comparison.


- James

  reply	other threads:[~2023-02-02  0:25 UTC|newest]

Thread overview: 126+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-05 10:17 [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e) James Houghton
2023-01-05 10:17 ` [PATCH 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
2023-01-05 10:18 ` [PATCH 02/46] hugetlb: remove mk_huge_pte; it is unused James Houghton
2023-01-05 10:18 ` [PATCH 03/46] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
2023-01-05 10:18 ` [PATCH 04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
2023-01-05 10:18 ` [PATCH 05/46] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
2023-01-05 10:18 ` [PATCH 06/46] mm: add VM_HUGETLB_HGM VMA flag James Houghton
2023-01-05 10:18 ` [PATCH 07/46] hugetlb: rename __vma_shareable_flags_pmd to __vma_has_hugetlb_vma_lock James Houghton
2023-01-05 10:18 ` [PATCH 08/46] hugetlb: add HugeTLB HGM enablement helpers James Houghton
2023-01-05 10:18 ` [PATCH 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM James Houghton
2023-01-05 15:05   ` kernel test robot
2023-01-05 15:29   ` David Hildenbrand
2023-01-10  0:01     ` Zach O'Keefe
2023-01-05 10:18 ` [PATCH 10/46] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
2023-01-05 10:18 ` [PATCH 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
2023-01-05 16:06   ` kernel test robot
2023-01-05 10:18 ` [PATCH 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte James Houghton
2023-01-05 10:18 ` [PATCH 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
2023-01-05 16:57   ` kernel test robot
2023-01-05 18:58   ` kernel test robot
2023-01-11 21:51   ` Peter Xu
2023-01-12 13:38     ` James Houghton
2023-01-05 10:18 ` [PATCH 14/46] hugetlb: add make_huge_pte_with_shift James Houghton
2023-01-05 10:18 ` [PATCH 15/46] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
2023-01-05 10:18 ` [PATCH 16/46] hugetlbfs: do a full walk to check if vma maps a page James Houghton
2023-01-05 10:18 ` [PATCH 17/46] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
2023-01-05 10:18 ` [PATCH 18/46] hugetlb: add HGM support for hugetlb_change_protection James Houghton
2023-01-05 10:18 ` [PATCH 19/46] hugetlb: add HGM support for follow_hugetlb_page James Houghton
2023-01-05 22:26   ` Peter Xu
2023-01-12 18:02   ` Peter Xu
2023-01-12 18:06     ` James Houghton
2023-01-05 10:18 ` [PATCH 20/46] hugetlb: add HGM support for hugetlb_follow_page_mask James Houghton
2023-01-05 10:18 ` [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
2023-01-05 22:42   ` Peter Xu
2023-01-11 22:58   ` Peter Xu
2023-01-12 14:06     ` James Houghton
2023-01-12 15:29       ` Peter Xu
2023-01-12 16:45         ` James Houghton
2023-01-12 16:55           ` James Houghton
2023-01-12 20:27           ` Peter Xu
2023-01-12 21:17             ` James Houghton
2023-01-12 21:33               ` Peter Xu
2023-01-16 10:17                 ` David Hildenbrand
2023-01-17 23:11                   ` James Houghton
2023-01-18  9:43                     ` David Hildenbrand
2023-01-18 15:35                       ` Peter Xu
2023-01-18 16:39                         ` James Houghton
2023-01-18 18:21                           ` David Hildenbrand
2023-01-18 19:28                           ` Mike Kravetz
2023-01-19 16:57                             ` James Houghton
2023-01-19 17:31                               ` Mike Kravetz
2023-01-19 19:42                                 ` James Houghton
2023-01-19 20:53                                   ` Peter Xu
2023-01-19 22:45                                     ` James Houghton
2023-01-19 22:00                                   ` Mike Kravetz
2023-01-19 22:23                                     ` Peter Xu
2023-01-19 22:35                                       ` James Houghton
2023-01-19 23:07                                         ` Peter Xu
2023-01-19 23:26                                           ` James Houghton
2023-01-20 17:23                                             ` Peter Xu
2023-01-19 23:44                                           ` Mike Kravetz
2023-01-23 15:19                                             ` Peter Xu
2023-01-23 17:49                                               ` Mike Kravetz
2023-01-26 16:58                                   ` James Houghton
2023-01-26 20:30                                     ` Peter Xu
2023-01-27 21:02                                       ` James Houghton
2023-01-30 17:29                                         ` Peter Xu
2023-01-30 18:38                                           ` James Houghton
2023-01-30 21:14                                             ` Peter Xu
2023-02-01  0:24                                               ` James Houghton
2023-02-01  1:24                                                 ` Peter Xu
2023-02-01 15:45                                                   ` James Houghton
2023-02-01 15:56                                                     ` David Hildenbrand
2023-02-01 17:58                                                       ` James Houghton
2023-02-01 18:01                                                         ` David Hildenbrand
2023-02-01 16:22                                                     ` Peter Xu
2023-02-01 21:32                                                       ` James Houghton
2023-02-01 21:51                                                         ` Peter Xu
2023-02-02  0:24                                                           ` James Houghton [this message]
2023-02-07 16:30                                                             ` James Houghton
2023-02-07 22:46                                                               ` James Houghton
2023-02-07 23:13                                                                 ` Peter Xu
2023-02-08  0:26                                                                   ` James Houghton
2023-02-08 16:16                                                                     ` Peter Xu
2023-02-09 16:43                                                                       ` James Houghton
2023-02-09 19:10                                                                         ` Peter Xu
2023-02-09 19:49                                                                           ` James Houghton
2023-02-09 20:22                                                                             ` Peter Xu
2023-01-18 17:08                         ` David Hildenbrand
2023-01-05 10:18 ` [PATCH 22/46] mm: rmap: provide pte_order in page_vma_mapped_walk James Houghton
2023-01-05 10:18 ` [PATCH 23/46] mm: rmap: make page_vma_mapped_walk callers use pte_order James Houghton
2023-01-05 10:18 ` [PATCH 24/46] rmap: update hugetlb lock comment for HGM James Houghton
2023-01-05 10:18 ` [PATCH 25/46] hugetlb: update page_vma_mapped to do high-granularity walks James Houghton
2023-01-05 10:18 ` [PATCH 26/46] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
2023-01-05 10:18 ` [PATCH 27/46] hugetlb: add HGM support for move_hugetlb_page_tables James Houghton
2023-01-05 10:18 ` [PATCH 28/46] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
2023-01-05 10:18 ` [PATCH 29/46] rmap: in try_to_{migrate,unmap}_one, check head page for page flags James Houghton
2023-01-05 10:18 ` [PATCH 30/46] hugetlb: add high-granularity migration support James Houghton
2023-01-05 10:18 ` [PATCH 31/46] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
2023-01-05 10:18 ` [PATCH 32/46] hugetlb: add for_each_hgm_shift James Houghton
2023-01-05 10:18 ` [PATCH 33/46] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
2023-01-05 10:18 ` [PATCH 34/46] hugetlb: userfaultfd: when using MADV_SPLIT, round addresses to PAGE_SIZE James Houghton
2023-01-06 15:13   ` Peter Xu
2023-01-10 14:50     ` James Houghton
2023-01-05 10:18 ` [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb James Houghton
2023-01-10 20:04   ` James Houghton
2023-01-17 21:06   ` Peter Xu
2023-01-17 21:38     ` James Houghton
2023-01-17 21:54       ` Peter Xu
2023-01-19 22:37   ` Peter Xu
2023-01-19 23:06     ` James Houghton
2023-01-05 10:18 ` [PATCH 36/46] hugetlb: remove huge_pte_lock and huge_pte_lockptr James Houghton
2023-01-05 10:18 ` [PATCH 37/46] hugetlb: replace make_huge_pte with make_huge_pte_with_shift James Houghton
2023-01-05 10:18 ` [PATCH 38/46] mm: smaps: add stats for HugeTLB mapping size James Houghton
2023-01-05 10:18 ` [PATCH 39/46] hugetlb: x86: enable high-granularity mapping James Houghton
2023-01-12 20:07   ` James Houghton
2023-01-05 10:18 ` [PATCH 40/46] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info James Houghton
2023-01-05 10:18 ` [PATCH 41/46] docs: proc: include information about HugeTLB HGM James Houghton
2023-01-05 10:18 ` [PATCH 42/46] selftests/vm: add HugeTLB HGM to userfaultfd selftest James Houghton
2023-01-05 10:18 ` [PATCH 43/46] selftests/kvm: add HugeTLB HGM to KVM demand paging selftest James Houghton
2023-01-05 10:18 ` [PATCH 44/46] selftests/vm: add anon and shared hugetlb to migration test James Houghton
2023-01-05 10:18 ` [PATCH 45/46] selftests/vm: add hugetlb HGM test to migration selftest James Houghton
2023-01-05 10:18 ` [PATCH 46/46] selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests James Houghton
2023-01-05 10:47 ` [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e) David Hildenbrand
2023-01-09 19:53   ` Mike Kravetz
2023-01-10 15:47     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CADrL8HVja_xJ9qczsd-fQfEPvEEXswhXQwoan=a_LSMyORvqww@mail.gmail.com' \
    --to=jthoughton@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=axelrasmussen@google.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=manish.mishra@nutanix.com \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.