Re: mm/khugepaged: collapse file/shmem compound pages

From: "Zach O'Keefe" <zokeefe@google.com>
To: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	David Rientjes <rientjes@google.com>,
	 "linux-mm@kvack.org" <linux-mm@kvack.org>,
	Hugh Dickins <hughd@google.com>
Subject: Re: mm/khugepaged: collapse file/shmem compound pages
Date: Tue, 31 May 2022 22:19:31 -0700	[thread overview]
Message-ID: <CAAa6QmRo=SjCGSdhHyxmkzxyiR8S9EUdRr0CXSenWaa+-7e5bg@mail.gmail.com> (raw)
In-Reply-To: <cd781b7f-f5f2-4a44-a338-1ccc0f03d86d@linux.alibaba.com>

On Sun, May 29, 2022 at 6:25 PM Rongwei Wang
<rongwei.wang@linux.alibaba.com> wrote:
>
>
>
> On 5/30/22 5:36 AM, Zach O'Keefe wrote:
> > On Fri, May 27, 2022 at 8:48 PM Matthew Wilcox <willy@infradead.org> wrote:
> >>
> >> On Fri, May 27, 2022 at 09:27:33AM -0700, Zach O'Keefe wrote:
> >>> On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox <willy@infradead.org> wrote:
> >>>> Because PageTransCompound() does not do what it says on the tin.
> >>>>
> >>>> static inline int PageTransCompound(struct page *page)
> >>>> {
> >>>>          return PageCompound(page);
> >>>> }
> >>>>
> >>>> So any compound page is treated as if it's a PMD-sized page.
> >>>
> >>> Right - therein lies the problem :) I think I misattributed your
> >>> comment "we'll simply skip over it because the code believes that
> >>> means it's already a PMD" as a solution, not as the current state of
> >>> things. What we need to be able to do is:
> >>>
> >>> 1) If folio order == 0: do what we've been doing
> >>> 2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_
> >>> pmd-mapped. If it is, we're done. If not, continue to step (3)
> >>
> >> I would not do that part.  Just leave it alone and assume everything's
> >> good.
> >
> > Sorry if I keep pressing the issue here - but why not check? If the
> > goal of khugepaged (and certainly MADV_COLLAPSE) is to map eligible
> > memory at the pmd level, then these pte-mapped hugepages that we might
> > discover in step (2) are actually the cheapest memory to collapse
> > since we can do the collapse in-place.
> >
> >>> 3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully
> >>> it's ~ same as step (1)
> >>
> >> Yes, exactly this.
> >>
> >>>>> I thought the point / benefit of khugepaged was precisely to try and
> >>>>> find places where we can collapse many pte entries into a single pmd
> >>>>> mapping?
> >>>>
> >>>> Ideally, yes.  But if a file is mapped at an address which isn't
> >>>> PMD-aligned, it can't.  Maybe it should just decline to operate in that
> >>>> case.
> >>>
> >>> To make sure I'm not missing anything here: It's not actually
> >>> important that the file is mapped at a pmd-aligned address. All that
> >>> is important is that the region of memory being collapsed is
> >>> pmd-aligned. If we wanted to collapse memory mapped to the start of
> >>> the file, then sure, the file has to be mapped suitably.
> >>
> >> Ah, what you're probably missing is that for file pages/folios, they
> >> have to be naturally aligned.  The data structure underlying the
> >> page cache simply can't cope with askew pages.  (It kind of can under
> >> some circumstances that are so complicated that they shouldn't be
> >> explained, and it's far easier just to say "folios must be naturally
> >> aligned within the file")
> >
> > I'm trying to understand what you mean by "naturally aligned" here.
> > I'm operating under the assumption that all file pages map to
> > page-sized offsets within a file (e.g. linear_page_address()) and that
> > files are mapped at a page-aligned address. In the event we want to
> > collapse file-backed memory, if the virtual address of said memory is
> > hugepage-aligned, I don't think it's necessary that the address maps
> > to a hugepage-sized offset in the file? I.e. on x86, the file could
> Hi, Zach
>
> I'm not sure get your question rightly. We submitted patch set to
> support file THP can been used transparently, likes THP[1].
>
> [1]https://lore.kernel.org/linux-mm/20211009092658.59665-2-rongwei.wang@linux.alibaba.com/
>
> In this patch, I remember that we need to check if '(vma->vm_start >>
> PAGE_SHIFT) - vma->vm_pgoff' is align with HPAGE_PMD_NR. likes:
>
> +static inline bool vma_is_hugetext(struct vm_area_struct *vma,
> +                                  unsigned long vm_flags)
> +{
> +       if (!(vm_flags & VM_EXEC))
> +               return false;
> +
> +       if (vma->vm_file && !inode_is_open_for_write(vma->vm_file->f_inode))
> +               return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
> +                               HPAGE_PMD_NR);
> +
> +       return false;
> +}
>
> There is a little different with anon THP here.
> > itself be mapped to the start of the last page in a 2MiB region ,X,
>
> Maybe it's related with ELF's 'Align' parameter in your current system?
> If 'Align' set to 2MB (by 'readelf -l /path/exec'), it's probably meets
> the above alignment check.
>
> And the default Align parameter is related to binutils version, also can
> be set in compile time by '-z max-page-size=<align size>' option.
>
> Hope it is helpful:)
>
> -wrw

Hey Rongwei!

Thanks for the code / help. Took a little bit, but hughd has
enlightened me on the problem (thanks Hugh!). Likewise, apologies for
not understanding your previous comment regarding folio alignment,
Matthew.

Also, thanks for linking your patchset, and sorry for missing it
previously. It seems we're interested in the same problem! Hopefully
this work can be beneficial to your use case as well.

Thanks again for your time,
Zach

> > and that wouldn't prevent us from collapsing the 2MiB region starting
> > at X+4KiB.
> >
> >>>>>> shmem still expects folios to be of order either 0 or PMD_ORDER.
> >>>>>> That assumption extends into the swap code and I haven't had the heart
> >>>>>> to go and fix all those places yet.  Plus Neil was doing major surgery
> >>>>>> to the swap code in the most recent deveopment cycle and I didn't want
> >>>>>> to get in his way.
> >>>>>>
> >>>>>> So I am absolutely fine with khugepaged allocating a PMD-size folio for
> >>>>>> any inode that claims mapping_large_folio_support().  If any filesystems
> >>>>>> break, we'll fix them.
> >>>>>
> >>>>> Just for clarification, what is the equivalent code today that
> >>>>> enforces mapping_large_folio_support()? I.e. today, khugepaged can
> >>>>> successfully collapse file without checking if the inode supports it
> >>>>> (we only check that it's a regular file not opened for writing).
> >>>>
> >>>> Yeah, that's a dodgy hack which needs to go away.  But we need a lot
> >>>> more filesystems converted to supporting large folios before we can
> >>>> delete it.  Not your responsibility; I'm doing my best to encourage
> >>>> fs maintainers to do this part.
> >>>
> >>> Got it. In the meantime, do we want to check the old conditions +
> >>> mapping_large_folio_support()?
> >>
> >> Yes, that should work.  khugepaged should be free to create large
> >> folios if the underlying filesystem supports them OR (executable,
> >> read-only, CONFIG_THP_RO, etc, etc).
> >
> > Thanks for confirming!
> >
> >>>>> Also, just to check, there isn't anything wrong with following
> >>>>> collapse_file()'s approach, even for folios of 0 < order <
> >>>>> HPAGE_PMD_ORDER? I.e this part:
> >>>>>
> >>>>>   * Basic scheme is simple, details are more complex:
> >>>>>   *  - allocate and lock a new huge page;
> >>>>>   *  - scan page cache replacing old pages with the new one
> >>>>>   *    + swap/gup in pages if necessary;
> >>>>>   *    + fill in gaps;
> >>>>>   *    + keep old pages around in case rollback is required;
> >>>>>   *  - if replacing succeeds:
> >>>>>   *    + copy data over;
> >>>>>   *    + free old pages;
> >>>>>   *    + unlock huge page;
> >>>>>   *  - if replacing failed;
> >>>>>   *    + put all pages back and unfreeze them;
> >>>>>   *    + restore gaps in the page cache;
> >>>>>   *    + unlock and free huge page;
> >>>>>   */
> >>>>
> >>>> Correct.  At least, as far as I know!  Working on folios has been quite
> >>>> the education for me ...
> >>>
> >>> Great! Well, perhaps I'll run into a snafu here or there (and
> >>> hopefully learn something myself) but this gives me enough confidence
> >>> to naively give it a try and see what happens!
> >>>
> >>> Again, thank you very much for your time, help and advice with this,
> >>
> >> You're welcome!  Thanks for putting in some work on this project!
> >
> > No problem! Hopefully this can benefit a bunch of existing users.
> >
> > Thanks again,
> > Zach