From: Rongwei Wang <rongwei.wang@linux.alibaba.com>
To: Zach O'Keefe <zokeefe@google.com>
Cc: Matthew Wilcox <willy@infradead.org>,
David Rientjes <rientjes@google.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
Hugh Dickins <hughd@google.com>
Subject: Re: mm/khugepaged: collapse file/shmem compound pages
Date: Wed, 1 Jun 2022 19:26:51 +0800 [thread overview]
Message-ID: <08fb5da7-a390-1880-da3f-e1d480047caa@linux.alibaba.com> (raw)
In-Reply-To: <CAAa6QmRo=SjCGSdhHyxmkzxyiR8S9EUdRr0CXSenWaa+-7e5bg@mail.gmail.com>
On 6/1/22 1:19 PM, Zach O'Keefe wrote:
> On Sun, May 29, 2022 at 6:25 PM Rongwei Wang
> <rongwei.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 5/30/22 5:36 AM, Zach O'Keefe wrote:
>>> On Fri, May 27, 2022 at 8:48 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>> On Fri, May 27, 2022 at 09:27:33AM -0700, Zach O'Keefe wrote:
>>>>> On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>>> Because PageTransCompound() does not do what it says on the tin.
>>>>>>
>>>>>> static inline int PageTransCompound(struct page *page)
>>>>>> {
>>>>>> return PageCompound(page);
>>>>>> }
>>>>>>
>>>>>> So any compound page is treated as if it's a PMD-sized page.
>>>>>
>>>>> Right - therein lies the problem :) I think I misattributed your
>>>>> comment "we'll simply skip over it because the code believes that
>>>>> means it's already a PMD" as a solution, not as the current state of
>>>>> things. What we need to be able to do is:
>>>>>
>>>>> 1) If folio order == 0: do what we've been doing
>>>>> 2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_
>>>>> pmd-mapped. If it is, we're done. If not, continue to step (3)
>>>>
>>>> I would not do that part. Just leave it alone and assume everything's
>>>> good.
>>>
>>> Sorry if I keep pressing the issue here - but why not check? If the
>>> goal of khugepaged (and certainly MADV_COLLAPSE) is to map eligible
>>> memory at the pmd level, then these pte-mapped hugepages that we might
>>> discover in step (2) are actually the cheapest memory to collapse
>>> since we can do the collapse in-place.
>>>
>>>>> 3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully
>>>>> it's ~ same as step (1)
>>>>
>>>> Yes, exactly this.
>>>>
>>>>>>> I thought the point / benefit of khugepaged was precisely to try and
>>>>>>> find places where we can collapse many pte entries into a single pmd
>>>>>>> mapping?
>>>>>>
>>>>>> Ideally, yes. But if a file is mapped at an address which isn't
>>>>>> PMD-aligned, it can't. Maybe it should just decline to operate in that
>>>>>> case.
>>>>>
>>>>> To make sure I'm not missing anything here: It's not actually
>>>>> important that the file is mapped at a pmd-aligned address. All that
>>>>> is important is that the region of memory being collapsed is
>>>>> pmd-aligned. If we wanted to collapse memory mapped to the start of
>>>>> the file, then sure, the file has to be mapped suitably.
>>>>
>>>> Ah, what you're probably missing is that for file pages/folios, they
>>>> have to be naturally aligned. The data structure underlying the
>>>> page cache simply can't cope with askew pages. (It kind of can under
>>>> some circumstances that are so complicated that they shouldn't be
>>>> explained, and it's far easier just to say "folios must be naturally
>>>> aligned within the file")
>>>
>>> I'm trying to understand what you mean by "naturally aligned" here.
>>> I'm operating under the assumption that all file pages map to
>>> page-sized offsets within a file (e.g. linear_page_address()) and that
>>> files are mapped at a page-aligned address. In the event we want to
>>> collapse file-backed memory, if the virtual address of said memory is
>>> hugepage-aligned, I don't think it's necessary that the address maps
>>> to a hugepage-sized offset in the file? I.e. on x86, the file could
>> Hi, Zach
>>
>> I'm not sure get your question rightly. We submitted patch set to
>> support file THP can been used transparently, likes THP[1].
>>
>> [1]https://lore.kernel.org/linux-mm/20211009092658.59665-2-rongwei.wang@linux.alibaba.com/
>>
>> In this patch, I remember that we need to check if '(vma->vm_start >>
>> PAGE_SHIFT) - vma->vm_pgoff' is align with HPAGE_PMD_NR. likes:
>>
>> +static inline bool vma_is_hugetext(struct vm_area_struct *vma,
>> + unsigned long vm_flags)
>> +{
>> + if (!(vm_flags & VM_EXEC))
>> + return false;
>> +
>> + if (vma->vm_file && !inode_is_open_for_write(vma->vm_file->f_inode))
>> + return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
>> + HPAGE_PMD_NR);
>> +
>> + return false;
>> +}
>>
>> There is a little different with anon THP here.
>>> itself be mapped to the start of the last page in a 2MiB region ,X,
>>
>> Maybe it's related with ELF's 'Align' parameter in your current system?
>> If 'Align' set to 2MB (by 'readelf -l /path/exec'), it's probably meets
>> the above alignment check.
>>
>> And the default Align parameter is related to binutils version, also can
>> be set in compile time by '-z max-page-size=<align size>' option.
>>
>> Hope it is helpful:)
>>
>> -wrw
>
> Hey Rongwei!
>
> Thanks for the code / help. Took a little bit, but hughd has
> enlightened me on the problem (thanks Hugh!). Likewise, apologies for
> not understanding your previous comment regarding folio alignment,
> Matthew.
>
> Also, thanks for linking your patchset, and sorry for missing it
> previously. It seems we're interested in the same problem! Hopefully
> this work can be beneficial to your use case as well.
Hi Zach!
Thanks you, too. Recently, we are trying to use process_madvise()+DAMON
to find hot .text, especially x86, and then collapsing into huge pages.
It seems that process_madvise(MADV_COLLAPSE) is feasible.
Anyway, thanks your nice work.
-wrw
>
> Thanks again for your time,
> Zach
>
>
>
>>> and that wouldn't prevent us from collapsing the 2MiB region starting
>>> at X+4KiB.
>>>
>>>>>>>> shmem still expects folios to be of order either 0 or PMD_ORDER.
>>>>>>>> That assumption extends into the swap code and I haven't had the heart
>>>>>>>> to go and fix all those places yet. Plus Neil was doing major surgery
>>>>>>>> to the swap code in the most recent deveopment cycle and I didn't want
>>>>>>>> to get in his way.
>>>>>>>>
>>>>>>>> So I am absolutely fine with khugepaged allocating a PMD-size folio for
>>>>>>>> any inode that claims mapping_large_folio_support(). If any filesystems
>>>>>>>> break, we'll fix them.
>>>>>>>
>>>>>>> Just for clarification, what is the equivalent code today that
>>>>>>> enforces mapping_large_folio_support()? I.e. today, khugepaged can
>>>>>>> successfully collapse file without checking if the inode supports it
>>>>>>> (we only check that it's a regular file not opened for writing).
>>>>>>
>>>>>> Yeah, that's a dodgy hack which needs to go away. But we need a lot
>>>>>> more filesystems converted to supporting large folios before we can
>>>>>> delete it. Not your responsibility; I'm doing my best to encourage
>>>>>> fs maintainers to do this part.
>>>>>
>>>>> Got it. In the meantime, do we want to check the old conditions +
>>>>> mapping_large_folio_support()?
>>>>
>>>> Yes, that should work. khugepaged should be free to create large
>>>> folios if the underlying filesystem supports them OR (executable,
>>>> read-only, CONFIG_THP_RO, etc, etc).
>>>
>>> Thanks for confirming!
>>>
>>>>>>> Also, just to check, there isn't anything wrong with following
>>>>>>> collapse_file()'s approach, even for folios of 0 < order <
>>>>>>> HPAGE_PMD_ORDER? I.e this part:
>>>>>>>
>>>>>>> * Basic scheme is simple, details are more complex:
>>>>>>> * - allocate and lock a new huge page;
>>>>>>> * - scan page cache replacing old pages with the new one
>>>>>>> * + swap/gup in pages if necessary;
>>>>>>> * + fill in gaps;
>>>>>>> * + keep old pages around in case rollback is required;
>>>>>>> * - if replacing succeeds:
>>>>>>> * + copy data over;
>>>>>>> * + free old pages;
>>>>>>> * + unlock huge page;
>>>>>>> * - if replacing failed;
>>>>>>> * + put all pages back and unfreeze them;
>>>>>>> * + restore gaps in the page cache;
>>>>>>> * + unlock and free huge page;
>>>>>>> */
>>>>>>
>>>>>> Correct. At least, as far as I know! Working on folios has been quite
>>>>>> the education for me ...
>>>>>
>>>>> Great! Well, perhaps I'll run into a snafu here or there (and
>>>>> hopefully learn something myself) but this gives me enough confidence
>>>>> to naively give it a try and see what happens!
>>>>>
>>>>> Again, thank you very much for your time, help and advice with this,
>>>>
>>>> You're welcome! Thanks for putting in some work on this project!
>>>
>>> No problem! Hopefully this can benefit a bunch of existing users.
>>>
>>> Thanks again,
>>> Zach
prev parent reply other threads:[~2022-06-01 11:27 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-24 22:42 mm/khugepaged: collapse file/shmem compound pages Zach O'Keefe
2022-05-25 19:07 ` Matthew Wilcox
2022-05-26 1:23 ` Zach O'Keefe
2022-05-26 3:36 ` Matthew Wilcox
2022-05-27 0:54 ` Zach O'Keefe
2022-05-27 3:47 ` Matthew Wilcox
2022-05-27 16:27 ` Zach O'Keefe
2022-05-28 3:48 ` Matthew Wilcox
2022-05-29 21:36 ` Zach O'Keefe
2022-05-30 1:25 ` Rongwei Wang
2022-06-01 5:19 ` Zach O'Keefe
2022-06-01 11:26 ` Rongwei Wang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=08fb5da7-a390-1880-da3f-e1d480047caa@linux.alibaba.com \
--to=rongwei.wang@linux.alibaba.com \
--cc=hughd@google.com \
--cc=linux-mm@kvack.org \
--cc=rientjes@google.com \
--cc=willy@infradead.org \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).