All of lore.kernel.org
 help / color / mirror / Atom feed
From: James Houghton <jthoughton@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
	David Hildenbrand <david@redhat.com>,
	Muchun Song <songmuchun@bytedance.com>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Mina Almasry <almasrymina@google.com>,
	"Zach O'Keefe" <zokeefe@google.com>,
	Manish Mishra <manish.mishra@nutanix.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Miaohe Lin <linmiaohe@huawei.com>, Yang Shi <shy828301@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range
Date: Thu, 9 Feb 2023 08:43:45 -0800	[thread overview]
Message-ID: <CADrL8HUZON-Fc9YxPyTA21AY02OgEXDDsmq9AUSMSxjr83ndYA@mail.gmail.com> (raw)
In-Reply-To: <Y+PKwvLT8+wQ6LPA@x1n>

On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote:
> > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > James,
> > >
> > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > > > Here is the result: [1] (sorry it took a little while heh). The
> > >
> > > Thanks.  From what I can tell, that number shows that it'll be great we
> > > start with your rfcv1 mapcount approach, which mimics what's proposed by
> > > Matthew for generic folio.
> >
> > Do you think the RFC v1 way is better than doing the THP-like way
> > *with the additional MMU notifier*?
>
> What's the additional MMU notifier you're referring?

An MMU notifier that informs KVM that a collapse has happened without
having to invalidate_range_start() and invalidate_range_end(), the one
you're replying to lower down in the email. :) [ see below... ]

>
> >
> > >
> > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this
> > >
> > > Any more information on why it's horrible? :)
> >
> > I figured the code would speak for itself, heh. It's quite complicated.
> >
> > I really didn't like:
> > 1. The 'inc' business in copy_hugetlb_page_range.
> > 2. How/where I call put_page()/folio_put() to keep the refcount and
> > mapcount synced up.
> > 3. Having to check the page cache in UFFDIO_CONTINUE.
>
> I think the complexity is one thing which I'm fine with so far.  However
> when I think again about the things behind that complexity, I noticed there
> may be at least one flaw that may not be trivial to work around.
>
> It's about truncation.  The problem is now we use the pgtable entry to
> represent the mapcount, but the pgtable entry cannot be zapped easily,
> unless vma unmapped or collapsed.
>
> It means e.g. truncate_inode_folio() may stop working for hugetlb (of
> course, with page lock held).  The mappings will be removed for real, but
> not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps
> the pgtable leaves, not the ones that we used to account for mapcounts.
>
> So the kernel may see weird things, like mapcount>0 after
> truncate_inode_folio() being finished completely.
>
> For HGM to do the right thing, we may want to also remove the non-leaf
> entries when truncating or doing similar things like a rmap walk to drop
> any mappings for a page/folio.  Though that's not doable for now because
> the locks that truncate_inode_folio() is weaker than what we need to free
> the pgtable non-leaf entries - we'll need mmap write lock for that, the
> same as when we unmap or collapse.
>
> Matthew's design doesn't have such issue if the ptes need to be populated,
> because mapcount is still with the leaves; not the case for us here.
>
> If that's the case, _maybe_ we still need to start with the stupid but
> working approach of subpage mapcounts.

Good point. I can't immediately think of a solution. I would prefer to
go with the subpage mapcount approach to simplify HGM for now;
optimizing mapcount for HugeTLB can then be handled separately. If
you're ok with this, I'll go ahead and send v2.

One way that might be possible: using the PAGE_SPECIAL bit on the
hstate-level PTE to indicate if mapcount has been incremented or not
(if the PTE is pointing to page tables). As far as I can tell,
PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would
need to be careful with existing PTE examination code as to not
misinterpret these PTEs.

>
> [...]
>
> > > > > Matthew is trying to solve the same problem with THPs right now: [3].
> > > > > I haven't figured out how we can apply Matthews's approach to HGM
> > > > > right now, but there probably is a way. (If we left the mapcount
> > > > > increment bits in the same place, we couldn't just check the
> > > > > hstate-level PTE; it would have already been made present.)
> > >
> > > I'm just worried that (1) this may add yet another dependency to your work
> > > which is still during discussion phase, and (2) whether the folio approach
> > > is easily applicable here, e.g., we may not want to populate all the ptes
> > > for hugetlb HGMs by default.
> >
> > That's true. I definitely don't want to wait for this either. It seems
> > like Matthew's approach won't work very well for us -- when doing a
> > lot of high-granularity UFFDIO_CONTINUEs on a 1G page, checking all
> > the PTEs to see if any of them are mapped would get really slow.
>
> I think it'll be a common problem to userfaultfd when it comes, e.g.,
> userfaultfd by design is PAGE_SIZE based so far.  It needs page size
> granule on pgtable manipulations, unless we extend the userfaultfd protocol
> to support folios, iiuc.
>
> >
> > >
> > > > >
> > > > > We could:
> > > > > - use the THP-like way and tolerate ~1 second collapses
> > > >
> > > > Another thought here. We don't necessarily *need* to collapse the page
> > > > table mappings in between mmu_notifier_invalidate_range_start() and
> > > > mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
> > > > we aren't punching any holes, and we aren't changing permission bits.
> > > > If we had an MMU notifier that simply informed KVM that we collapsed
> > > > the page tables *after* we finished collapsing, then it would be ok
> > > > for hugetlb_collapse() to be slow.

[ from above... ] This MMU notifier. :)

> > >
> > > That's a great point!  It'll definitely apply to either approach.
> > >
> > > >
> > > > If this MMU notifier is something that makes sense, it probably
> > > > applies to MADV_COLLAPSE for THPs as well.
> > >
> > > THPs are definitely different, mmu notifiers should be required there,
> > > afaict.  Isn't that what the current code does?
> > >
> > > See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon.
> >
> > Oh, yes, of course, MADV_COLLAPSE can actually move things around and
> > properly make THPs. Thanks. But it would apply if we were only
> > collapsing PTE-mapped THPs, I think?
>
> Yes it applies I think.  And if I'm not wrong it's also doing so. :) See
> collapse_pte_mapped_thp().
>
> While for anon we always allocate a new page, hence not applicable.
>
> --
> Peter Xu

Thanks Peter!
- James

  reply	other threads:[~2023-02-09 16:44 UTC|newest]

Thread overview: 126+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-05 10:17 [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e) James Houghton
2023-01-05 10:17 ` [PATCH 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
2023-01-05 10:18 ` [PATCH 02/46] hugetlb: remove mk_huge_pte; it is unused James Houghton
2023-01-05 10:18 ` [PATCH 03/46] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
2023-01-05 10:18 ` [PATCH 04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
2023-01-05 10:18 ` [PATCH 05/46] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
2023-01-05 10:18 ` [PATCH 06/46] mm: add VM_HUGETLB_HGM VMA flag James Houghton
2023-01-05 10:18 ` [PATCH 07/46] hugetlb: rename __vma_shareable_flags_pmd to __vma_has_hugetlb_vma_lock James Houghton
2023-01-05 10:18 ` [PATCH 08/46] hugetlb: add HugeTLB HGM enablement helpers James Houghton
2023-01-05 10:18 ` [PATCH 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM James Houghton
2023-01-05 15:05   ` kernel test robot
2023-01-05 15:29   ` David Hildenbrand
2023-01-10  0:01     ` Zach O'Keefe
2023-01-05 10:18 ` [PATCH 10/46] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
2023-01-05 10:18 ` [PATCH 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
2023-01-05 16:06   ` kernel test robot
2023-01-05 10:18 ` [PATCH 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte James Houghton
2023-01-05 10:18 ` [PATCH 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
2023-01-05 16:57   ` kernel test robot
2023-01-05 18:58   ` kernel test robot
2023-01-11 21:51   ` Peter Xu
2023-01-12 13:38     ` James Houghton
2023-01-05 10:18 ` [PATCH 14/46] hugetlb: add make_huge_pte_with_shift James Houghton
2023-01-05 10:18 ` [PATCH 15/46] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
2023-01-05 10:18 ` [PATCH 16/46] hugetlbfs: do a full walk to check if vma maps a page James Houghton
2023-01-05 10:18 ` [PATCH 17/46] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
2023-01-05 10:18 ` [PATCH 18/46] hugetlb: add HGM support for hugetlb_change_protection James Houghton
2023-01-05 10:18 ` [PATCH 19/46] hugetlb: add HGM support for follow_hugetlb_page James Houghton
2023-01-05 22:26   ` Peter Xu
2023-01-12 18:02   ` Peter Xu
2023-01-12 18:06     ` James Houghton
2023-01-05 10:18 ` [PATCH 20/46] hugetlb: add HGM support for hugetlb_follow_page_mask James Houghton
2023-01-05 10:18 ` [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
2023-01-05 22:42   ` Peter Xu
2023-01-11 22:58   ` Peter Xu
2023-01-12 14:06     ` James Houghton
2023-01-12 15:29       ` Peter Xu
2023-01-12 16:45         ` James Houghton
2023-01-12 16:55           ` James Houghton
2023-01-12 20:27           ` Peter Xu
2023-01-12 21:17             ` James Houghton
2023-01-12 21:33               ` Peter Xu
2023-01-16 10:17                 ` David Hildenbrand
2023-01-17 23:11                   ` James Houghton
2023-01-18  9:43                     ` David Hildenbrand
2023-01-18 15:35                       ` Peter Xu
2023-01-18 16:39                         ` James Houghton
2023-01-18 18:21                           ` David Hildenbrand
2023-01-18 19:28                           ` Mike Kravetz
2023-01-19 16:57                             ` James Houghton
2023-01-19 17:31                               ` Mike Kravetz
2023-01-19 19:42                                 ` James Houghton
2023-01-19 20:53                                   ` Peter Xu
2023-01-19 22:45                                     ` James Houghton
2023-01-19 22:00                                   ` Mike Kravetz
2023-01-19 22:23                                     ` Peter Xu
2023-01-19 22:35                                       ` James Houghton
2023-01-19 23:07                                         ` Peter Xu
2023-01-19 23:26                                           ` James Houghton
2023-01-20 17:23                                             ` Peter Xu
2023-01-19 23:44                                           ` Mike Kravetz
2023-01-23 15:19                                             ` Peter Xu
2023-01-23 17:49                                               ` Mike Kravetz
2023-01-26 16:58                                   ` James Houghton
2023-01-26 20:30                                     ` Peter Xu
2023-01-27 21:02                                       ` James Houghton
2023-01-30 17:29                                         ` Peter Xu
2023-01-30 18:38                                           ` James Houghton
2023-01-30 21:14                                             ` Peter Xu
2023-02-01  0:24                                               ` James Houghton
2023-02-01  1:24                                                 ` Peter Xu
2023-02-01 15:45                                                   ` James Houghton
2023-02-01 15:56                                                     ` David Hildenbrand
2023-02-01 17:58                                                       ` James Houghton
2023-02-01 18:01                                                         ` David Hildenbrand
2023-02-01 16:22                                                     ` Peter Xu
2023-02-01 21:32                                                       ` James Houghton
2023-02-01 21:51                                                         ` Peter Xu
2023-02-02  0:24                                                           ` James Houghton
2023-02-07 16:30                                                             ` James Houghton
2023-02-07 22:46                                                               ` James Houghton
2023-02-07 23:13                                                                 ` Peter Xu
2023-02-08  0:26                                                                   ` James Houghton
2023-02-08 16:16                                                                     ` Peter Xu
2023-02-09 16:43                                                                       ` James Houghton [this message]
2023-02-09 19:10                                                                         ` Peter Xu
2023-02-09 19:49                                                                           ` James Houghton
2023-02-09 20:22                                                                             ` Peter Xu
2023-01-18 17:08                         ` David Hildenbrand
2023-01-05 10:18 ` [PATCH 22/46] mm: rmap: provide pte_order in page_vma_mapped_walk James Houghton
2023-01-05 10:18 ` [PATCH 23/46] mm: rmap: make page_vma_mapped_walk callers use pte_order James Houghton
2023-01-05 10:18 ` [PATCH 24/46] rmap: update hugetlb lock comment for HGM James Houghton
2023-01-05 10:18 ` [PATCH 25/46] hugetlb: update page_vma_mapped to do high-granularity walks James Houghton
2023-01-05 10:18 ` [PATCH 26/46] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
2023-01-05 10:18 ` [PATCH 27/46] hugetlb: add HGM support for move_hugetlb_page_tables James Houghton
2023-01-05 10:18 ` [PATCH 28/46] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
2023-01-05 10:18 ` [PATCH 29/46] rmap: in try_to_{migrate,unmap}_one, check head page for page flags James Houghton
2023-01-05 10:18 ` [PATCH 30/46] hugetlb: add high-granularity migration support James Houghton
2023-01-05 10:18 ` [PATCH 31/46] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
2023-01-05 10:18 ` [PATCH 32/46] hugetlb: add for_each_hgm_shift James Houghton
2023-01-05 10:18 ` [PATCH 33/46] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
2023-01-05 10:18 ` [PATCH 34/46] hugetlb: userfaultfd: when using MADV_SPLIT, round addresses to PAGE_SIZE James Houghton
2023-01-06 15:13   ` Peter Xu
2023-01-10 14:50     ` James Houghton
2023-01-05 10:18 ` [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb James Houghton
2023-01-10 20:04   ` James Houghton
2023-01-17 21:06   ` Peter Xu
2023-01-17 21:38     ` James Houghton
2023-01-17 21:54       ` Peter Xu
2023-01-19 22:37   ` Peter Xu
2023-01-19 23:06     ` James Houghton
2023-01-05 10:18 ` [PATCH 36/46] hugetlb: remove huge_pte_lock and huge_pte_lockptr James Houghton
2023-01-05 10:18 ` [PATCH 37/46] hugetlb: replace make_huge_pte with make_huge_pte_with_shift James Houghton
2023-01-05 10:18 ` [PATCH 38/46] mm: smaps: add stats for HugeTLB mapping size James Houghton
2023-01-05 10:18 ` [PATCH 39/46] hugetlb: x86: enable high-granularity mapping James Houghton
2023-01-12 20:07   ` James Houghton
2023-01-05 10:18 ` [PATCH 40/46] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info James Houghton
2023-01-05 10:18 ` [PATCH 41/46] docs: proc: include information about HugeTLB HGM James Houghton
2023-01-05 10:18 ` [PATCH 42/46] selftests/vm: add HugeTLB HGM to userfaultfd selftest James Houghton
2023-01-05 10:18 ` [PATCH 43/46] selftests/kvm: add HugeTLB HGM to KVM demand paging selftest James Houghton
2023-01-05 10:18 ` [PATCH 44/46] selftests/vm: add anon and shared hugetlb to migration test James Houghton
2023-01-05 10:18 ` [PATCH 45/46] selftests/vm: add hugetlb HGM test to migration selftest James Houghton
2023-01-05 10:18 ` [PATCH 46/46] selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests James Houghton
2023-01-05 10:47 ` [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e) David Hildenbrand
2023-01-09 19:53   ` Mike Kravetz
2023-01-10 15:47     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CADrL8HUZON-Fc9YxPyTA21AY02OgEXDDsmq9AUSMSxjr83ndYA@mail.gmail.com \
    --to=jthoughton@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=axelrasmussen@google.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=manish.mishra@nutanix.com \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.