Re: mm/khugepaged: collapse file/shmem compound pages

From: Rongwei Wang <rongwei.wang@linux.alibaba.com>
To: Zach O'Keefe <zokeefe@google.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	David Rientjes <rientjes@google.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Hugh Dickins <hughd@google.com>
Subject: Re: mm/khugepaged: collapse file/shmem compound pages
Date: Wed, 1 Jun 2022 19:26:51 +0800	[thread overview]
Message-ID: <08fb5da7-a390-1880-da3f-e1d480047caa@linux.alibaba.com> (raw)
In-Reply-To: <CAAa6QmRo=SjCGSdhHyxmkzxyiR8S9EUdRr0CXSenWaa+-7e5bg@mail.gmail.com>

On 6/1/22 1:19 PM, Zach O'Keefe wrote:
> On Sun, May 29, 2022 at 6:25 PM Rongwei Wang
> <rongwei.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 5/30/22 5:36 AM, Zach O'Keefe wrote:
>>> On Fri, May 27, 2022 at 8:48 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>> On Fri, May 27, 2022 at 09:27:33AM -0700, Zach O'Keefe wrote:
>>>>> On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>>> Because PageTransCompound() does not do what it says on the tin.
>>>>>>
>>>>>> static inline int PageTransCompound(struct page *page)
>>>>>> {
>>>>>>           return PageCompound(page);
>>>>>> }
>>>>>>
>>>>>> So any compound page is treated as if it's a PMD-sized page.
>>>>>
>>>>> Right - therein lies the problem :) I think I misattributed your
>>>>> comment "we'll simply skip over it because the code believes that
>>>>> means it's already a PMD" as a solution, not as the current state of
>>>>> things. What we need to be able to do is:
>>>>>
>>>>> 1) If folio order == 0: do what we've been doing
>>>>> 2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_
>>>>> pmd-mapped. If it is, we're done. If not, continue to step (3)
>>>>
>>>> I would not do that part.  Just leave it alone and assume everything's
>>>> good.
>>>
>>> Sorry if I keep pressing the issue here - but why not check? If the
>>> goal of khugepaged (and certainly MADV_COLLAPSE) is to map eligible
>>> memory at the pmd level, then these pte-mapped hugepages that we might
>>> discover in step (2) are actually the cheapest memory to collapse
>>> since we can do the collapse in-place.
>>>
>>>>> 3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully
>>>>> it's ~ same as step (1)
>>>>
>>>> Yes, exactly this.
>>>>
>>>>>>> I thought the point / benefit of khugepaged was precisely to try and
>>>>>>> find places where we can collapse many pte entries into a single pmd
>>>>>>> mapping?
>>>>>>
>>>>>> Ideally, yes.  But if a file is mapped at an address which isn't
>>>>>> PMD-aligned, it can't.  Maybe it should just decline to operate in that
>>>>>> case.
>>>>>
>>>>> To make sure I'm not missing anything here: It's not actually
>>>>> important that the file is mapped at a pmd-aligned address. All that
>>>>> is important is that the region of memory being collapsed is
>>>>> pmd-aligned. If we wanted to collapse memory mapped to the start of
>>>>> the file, then sure, the file has to be mapped suitably.
>>>>
>>>> Ah, what you're probably missing is that for file pages/folios, they
>>>> have to be naturally aligned.  The data structure underlying the
>>>> page cache simply can't cope with askew pages.  (It kind of can under
>>>> some circumstances that are so complicated that they shouldn't be
>>>> explained, and it's far easier just to say "folios must be naturally
>>>> aligned within the file")
>>>
>>> I'm trying to understand what you mean by "naturally aligned" here.
>>> I'm operating under the assumption that all file pages map to
>>> page-sized offsets within a file (e.g. linear_page_address()) and that
>>> files are mapped at a page-aligned address. In the event we want to
>>> collapse file-backed memory, if the virtual address of said memory is
>>> hugepage-aligned, I don't think it's necessary that the address maps
>>> to a hugepage-sized offset in the file? I.e. on x86, the file could
>> Hi, Zach
>>
>> I'm not sure get your question rightly. We submitted patch set to
>> support file THP can been used transparently, likes THP[1].
>>
>> [1]https://lore.kernel.org/linux-mm/20211009092658.59665-2-rongwei.wang@linux.alibaba.com/
>>
>> In this patch, I remember that we need to check if '(vma->vm_start >>
>> PAGE_SHIFT) - vma->vm_pgoff' is align with HPAGE_PMD_NR. likes:
>>
>> +static inline bool vma_is_hugetext(struct vm_area_struct *vma,
>> +                                  unsigned long vm_flags)
>> +{
>> +       if (!(vm_flags & VM_EXEC))
>> +               return false;
>> +
>> +       if (vma->vm_file && !inode_is_open_for_write(vma->vm_file->f_inode))
>> +               return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
>> +                               HPAGE_PMD_NR);
>> +
>> +       return false;
>> +}
>>
>> There is a little different with anon THP here.
>>> itself be mapped to the start of the last page in a 2MiB region ,X,
>>
>> Maybe it's related with ELF's 'Align' parameter in your current system?
>> If 'Align' set to 2MB (by 'readelf -l /path/exec'), it's probably meets
>> the above alignment check.
>>
>> And the default Align parameter is related to binutils version, also can
>> be set in compile time by '-z max-page-size=<align size>' option.
>>
>> Hope it is helpful:)
>>
>> -wrw
> 
> Hey Rongwei!
> 
> Thanks for the code / help. Took a little bit, but hughd has
> enlightened me on the problem (thanks Hugh!). Likewise, apologies for
> not understanding your previous comment regarding folio alignment,
> Matthew.
> 
> Also, thanks for linking your patchset, and sorry for missing it
> previously. It seems we're interested in the same problem! Hopefully
> this work can be beneficial to your use case as well.
Hi Zach!

Thanks you, too. Recently, we are trying to use process_madvise()+DAMON 
to find hot .text, especially x86, and then collapsing into huge pages. 
It seems that process_madvise(MADV_COLLAPSE) is feasible.

Anyway, thanks your nice work.
-wrw
> 
> Thanks again for your time,
> Zach
> 
> 
> 
>>> and that wouldn't prevent us from collapsing the 2MiB region starting
>>> at X+4KiB.
>>>
>>>>>>>> shmem still expects folios to be of order either 0 or PMD_ORDER.
>>>>>>>> That assumption extends into the swap code and I haven't had the heart
>>>>>>>> to go and fix all those places yet.  Plus Neil was doing major surgery
>>>>>>>> to the swap code in the most recent deveopment cycle and I didn't want
>>>>>>>> to get in his way.
>>>>>>>>
>>>>>>>> So I am absolutely fine with khugepaged allocating a PMD-size folio for
>>>>>>>> any inode that claims mapping_large_folio_support().  If any filesystems
>>>>>>>> break, we'll fix them.
>>>>>>>
>>>>>>> Just for clarification, what is the equivalent code today that
>>>>>>> enforces mapping_large_folio_support()? I.e. today, khugepaged can
>>>>>>> successfully collapse file without checking if the inode supports it
>>>>>>> (we only check that it's a regular file not opened for writing).
>>>>>>
>>>>>> Yeah, that's a dodgy hack which needs to go away.  But we need a lot
>>>>>> more filesystems converted to supporting large folios before we can
>>>>>> delete it.  Not your responsibility; I'm doing my best to encourage
>>>>>> fs maintainers to do this part.
>>>>>
>>>>> Got it. In the meantime, do we want to check the old conditions +
>>>>> mapping_large_folio_support()?
>>>>
>>>> Yes, that should work.  khugepaged should be free to create large
>>>> folios if the underlying filesystem supports them OR (executable,
>>>> read-only, CONFIG_THP_RO, etc, etc).
>>>
>>> Thanks for confirming!
>>>
>>>>>>> Also, just to check, there isn't anything wrong with following
>>>>>>> collapse_file()'s approach, even for folios of 0 < order <
>>>>>>> HPAGE_PMD_ORDER? I.e this part:
>>>>>>>
>>>>>>>    * Basic scheme is simple, details are more complex:
>>>>>>>    *  - allocate and lock a new huge page;
>>>>>>>    *  - scan page cache replacing old pages with the new one
>>>>>>>    *    + swap/gup in pages if necessary;
>>>>>>>    *    + fill in gaps;
>>>>>>>    *    + keep old pages around in case rollback is required;
>>>>>>>    *  - if replacing succeeds:
>>>>>>>    *    + copy data over;
>>>>>>>    *    + free old pages;
>>>>>>>    *    + unlock huge page;
>>>>>>>    *  - if replacing failed;
>>>>>>>    *    + put all pages back and unfreeze them;
>>>>>>>    *    + restore gaps in the page cache;
>>>>>>>    *    + unlock and free huge page;
>>>>>>>    */
>>>>>>
>>>>>> Correct.  At least, as far as I know!  Working on folios has been quite
>>>>>> the education for me ...
>>>>>
>>>>> Great! Well, perhaps I'll run into a snafu here or there (and
>>>>> hopefully learn something myself) but this gives me enough confidence
>>>>> to naively give it a try and see what happens!
>>>>>
>>>>> Again, thank you very much for your time, help and advice with this,
>>>>
>>>> You're welcome!  Thanks for putting in some work on this project!
>>>
>>> No problem! Hopefully this can benefit a bunch of existing users.
>>>
>>> Thanks again,
>>> Zach