All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
@ 2023-08-12 21:00 Zach O'Keefe
  2023-08-12 21:24 ` Zach O'Keefe
  2023-08-13  6:19 ` [EXTERNAL] " Saurabh Singh Sengar
  0 siblings, 2 replies; 20+ messages in thread
From: Zach O'Keefe @ 2023-08-12 21:00 UTC (permalink / raw)
  To: linux-mm, Yang Shi; +Cc: linux-kernel, Zach O'Keefe, Saurabh Singh Sengar

The 6.0 commits:

commit 9fec51689ff6 ("mm: thp: kill transparent_hugepage_active()")
commit 7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")

merged "can we have THPs in this VMA?" logic that was previously done
separately by fault-path, khugepaged, and smaps "THPeligible".

During the process, the check on VM_NO_KHUGEPAGED from the khugepaged
path was accidentally added to fault and smaps paths.  Certainly the
previous behavior for fault should be restored, and since smaps should
report the union of THP eligibility for fault and khugepaged, also opt
smaps out of this constraint.

Fixes: 7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
Reported-by: Saurabh Singh Sengar <ssengar@microsoft.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Cc: Yang Shi <shy828301@gmail.com>
---
 mm/huge_memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eb3678360b97..e098c26d5e2e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -96,11 +96,11 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 		return in_pf;
 
 	/*
-	 * Special VMA and hugetlb VMA.
+	 * khugepaged check for special VMA and hugetlb VMA.
 	 * Must be checked after dax since some dax mappings may have
 	 * VM_MIXEDMAP set.
 	 */
-	if (vm_flags & VM_NO_KHUGEPAGED)
+	if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
 		return false;
 
 	/*
-- 
2.41.0.694.ge786442a9b-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-12 21:00 [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()" Zach O'Keefe
@ 2023-08-12 21:24 ` Zach O'Keefe
  2023-08-13  6:19 ` [EXTERNAL] " Saurabh Singh Sengar
  1 sibling, 0 replies; 20+ messages in thread
From: Zach O'Keefe @ 2023-08-12 21:24 UTC (permalink / raw)
  To: linux-mm, Yang Shi; +Cc: linux-kernel, Saurabh Singh Sengar

On Sat, Aug 12, 2023 at 2:01 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> The 6.0 commits:
>
> commit 9fec51689ff6 ("mm: thp: kill transparent_hugepage_active()")
> commit 7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
>
> merged "can we have THPs in this VMA?" logic that was previously done
> separately by fault-path, khugepaged, and smaps "THPeligible".
>
> During the process, the check on VM_NO_KHUGEPAGED from the khugepaged
> path was accidentally added to fault and smaps paths.  Certainly the
> previous behavior for fault should be restored, and since smaps should
> report the union of THP eligibility for fault and khugepaged, also opt
> smaps out of this constraint.
>
> Fixes: 7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
> Reported-by: Saurabh Singh Sengar <ssengar@microsoft.com>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> Cc: Yang Shi <shy828301@gmail.com>
> ---
>  mm/huge_memory.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index eb3678360b97..e098c26d5e2e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -96,11 +96,11 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
>                 return in_pf;
>
>         /*
> -        * Special VMA and hugetlb VMA.
> +        * khugepaged check for special VMA and hugetlb VMA.
>          * Must be checked after dax since some dax mappings may have
>          * VM_MIXEDMAP set.
>          */
> -       if (vm_flags & VM_NO_KHUGEPAGED)
> +       if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
>                 return false;
>
>         /*
> --
> 2.41.0.694.ge786442a9b-goog
>

I should note that this was discussed before[1], and VM_MIXEDMAP was
called out then, but we didn't have any use cases.

What was reported broken by Saurabh was an out-of-tree driver that
relies on being able to fault in THPs over VM_HUGEPAGE|VM_MIXEDMAP
VMAs. We mentioned back then we could always opt fault-path out of
this check in the future, and it seems like we should.

To that extent, should this be added to stable?

Apologies, I should have added this context to the commit log.

Best,
Zach

[1] https://lore.kernel.org/linux-mm/YqdPmitColnzlXJ0@google.com/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-12 21:00 [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()" Zach O'Keefe
  2023-08-12 21:24 ` Zach O'Keefe
@ 2023-08-13  6:19 ` Saurabh Singh Sengar
  2023-08-14 18:47   ` Zach O'Keefe
  1 sibling, 1 reply; 20+ messages in thread
From: Saurabh Singh Sengar @ 2023-08-13  6:19 UTC (permalink / raw)
  To: Zach O'Keefe, linux-mm, Yang Shi; +Cc: linux-kernel



> -----Original Message-----
> From: Zach O'Keefe <zokeefe@google.com>
> Sent: Sunday, August 13, 2023 2:31 AM
> To: linux-mm@kvack.org; Yang Shi <shy828301@gmail.com>
> Cc: linux-kernel@vger.kernel.org; Zach O'Keefe <zokeefe@google.com>;
> Saurabh Singh Sengar <ssengar@microsoft.com>
> Subject: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill
> __transhuge_page_enabled()"
> 
> [You don't often get email from zokeefe@google.com. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> The 6.0 commits:
> 
> commit 9fec51689ff6 ("mm: thp: kill transparent_hugepage_active()") commit
> 7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
> 
> merged "can we have THPs in this VMA?" logic that was previously done
> separately by fault-path, khugepaged, and smaps "THPeligible".
> 
> During the process, the check on VM_NO_KHUGEPAGED from the
> khugepaged path was accidentally added to fault and smaps paths.  Certainly
> the previous behavior for fault should be restored, and since smaps should
> report the union of THP eligibility for fault and khugepaged, also opt smaps
> out of this constraint.
> 
> Fixes: 7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
> Reported-by: Saurabh Singh Sengar <ssengar@microsoft.com>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> Cc: Yang Shi <shy828301@gmail.com>
> ---
>  mm/huge_memory.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c index
> eb3678360b97..e098c26d5e2e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -96,11 +96,11 @@ bool hugepage_vma_check(struct vm_area_struct
> *vma, unsigned long vm_flags,
>                 return in_pf;
> 
>         /*
> -        * Special VMA and hugetlb VMA.
> +        * khugepaged check for special VMA and hugetlb VMA.
>          * Must be checked after dax since some dax mappings may have
>          * VM_MIXEDMAP set.
>          */
> -       if (vm_flags & VM_NO_KHUGEPAGED)
> +       if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
>                 return false;
> 
>         /*
> --
> 2.41.0.694.ge786442a9b-goog

Thanks for the patch, I realized with the commit 9fec51689ff60,
!vma_is_anonymous restriction is also introduced. To make fault path
work same as before we need relaxation for this check as well. Can we
add the below as will in this patch:

-       if (!vma_is_anonymous(vma))
+       if (!is_pf && !vma_is_anonymous(vma))
                return false;

- Saurabh


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-13  6:19 ` [EXTERNAL] " Saurabh Singh Sengar
@ 2023-08-14 18:47   ` Zach O'Keefe
  2023-08-14 19:06     ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Zach O'Keefe @ 2023-08-14 18:47 UTC (permalink / raw)
  To: Saurabh Singh Sengar, Matthew Wilcox, Dan Williams
  Cc: linux-mm, Yang Shi, linux-kernel

On Sat, Aug 12, 2023 at 11:19 PM Saurabh Singh Sengar
<ssengar@microsoft.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Zach O'Keefe <zokeefe@google.com>
> > Sent: Sunday, August 13, 2023 2:31 AM
> > To: linux-mm@kvack.org; Yang Shi <shy828301@gmail.com>
> > Cc: linux-kernel@vger.kernel.org; Zach O'Keefe <zokeefe@google.com>;
> > Saurabh Singh Sengar <ssengar@microsoft.com>
> > Subject: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill
> > __transhuge_page_enabled()"
> >
> > [You don't often get email from zokeefe@google.com. Learn why this is
> > important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > The 6.0 commits:
> >
> > commit 9fec51689ff6 ("mm: thp: kill transparent_hugepage_active()") commit
> > 7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
> >
> > merged "can we have THPs in this VMA?" logic that was previously done
> > separately by fault-path, khugepaged, and smaps "THPeligible".
> >
> > During the process, the check on VM_NO_KHUGEPAGED from the
> > khugepaged path was accidentally added to fault and smaps paths.  Certainly
> > the previous behavior for fault should be restored, and since smaps should
> > report the union of THP eligibility for fault and khugepaged, also opt smaps
> > out of this constraint.
> >
> > Fixes: 7da4e2cb8b1f ("mm: thp: kill __transhuge_page_enabled()")
> > Reported-by: Saurabh Singh Sengar <ssengar@microsoft.com>
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > Cc: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/huge_memory.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c index
> > eb3678360b97..e098c26d5e2e 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -96,11 +96,11 @@ bool hugepage_vma_check(struct vm_area_struct
> > *vma, unsigned long vm_flags,
> >                 return in_pf;
> >
> >         /*
> > -        * Special VMA and hugetlb VMA.
> > +        * khugepaged check for special VMA and hugetlb VMA.
> >          * Must be checked after dax since some dax mappings may have
> >          * VM_MIXEDMAP set.
> >          */
> > -       if (vm_flags & VM_NO_KHUGEPAGED)
> > +       if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
> >                 return false;
> >
> >         /*
> > --
> > 2.41.0.694.ge786442a9b-goog
>
> Thanks for the patch, I realized with the commit 9fec51689ff60,
> !vma_is_anonymous restriction is also introduced. To make fault path
> work same as before we need relaxation for this check as well. Can we
> add the below as will in this patch:
>
> -       if (!vma_is_anonymous(vma))
> +       if (!is_pf && !vma_is_anonymous(vma))
>                 return false;

Hey Saurabh,

Thanks for pointing this out, and sorry for the mixup.

I'll try looping in some folks from DAX and fs worlds to be sure,
since my knowledge doesn't extend far into those realms.

I was under the understanding that CONFIG_READ_ONLY_THP_FOR_FS was
supposed to keep the filesystem blissfully unaware of hugepages; IOW,
that assembling file-backed hugepages was supposed to be a
pagecache-only thing .. or be DAX.

The early check:

if (vma_is_dax(vma))
        return in_pf;

Should handle the DAX case.

IIUC, the check, lower down:

if (!in_pf && file_thp_enabled(vma))
        return true;

Was supposed to be the last check for eligible file-backed memory, and
here it's clear that we don't support faulting-in hugepages over
file-backed memory.

Looking at current users of struct vm_operations_struct->huge_fault, I see:

drivers/dax/device.c : dev_dax_huge_fault
fs/ext4/file.c : ext4_dax_huge_fault
fs/xfs/xfs_file.c : xfs_filemap_huge_fault
fs/erofs/data.c : erofs_dax_huge_fault
fs/fuse/dax.c: fuse_dax_huge_fault

All of which *look* like they operate on DAX-backed memory (I checked
the xfs handler, it does so as well) -- so they should have been
whitelisted by the vma_is_dax() check.

All this to say, the kernel doesn't _currently_ support faulting-in
hugepages over non-DAX file-backed memory. However, it seems we don't
give that ->huge_fault handler a fair shake.

Saurabh, does your use case fall outside this?

Willy -- I'm not up-to-date on what is happening on the THP-fs front.
Should we be checking for a ->huge_fault handler here?

Thanks,
Zach

> - Saurabh
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-14 18:47   ` Zach O'Keefe
@ 2023-08-14 19:06     ` Matthew Wilcox
  2023-08-15  0:04       ` Zach O'Keefe
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2023-08-14 19:06 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Saurabh Singh Sengar, Dan Williams, linux-mm, Yang Shi, linux-kernel

On Mon, Aug 14, 2023 at 11:47:50AM -0700, Zach O'Keefe wrote:
> Willy -- I'm not up-to-date on what is happening on the THP-fs front.
> Should we be checking for a ->huge_fault handler here?

Oh, thank goodness, I thought you were cc'ing me to ask a DAX question ...

From a large folios perspective, filesystems do not implement a special
handler.  They call filemap_fault() (directly or indirectly) from their
->fault handler.  If there is already a folio in the page cache which
satisfies this fault, we insert it into the page tables (no matter what
size it is).  If there is no folio, we call readahead to populate that
index in the page cache, and probably some other indices around it.
That's do_sync_mmap_readahead().

If you look at that, you'll see that we check the VM_HUGEPAGE flag, and
if set we align to a PMD boundary and read two PMD-size pages (so that we
can do async readahead for the second page, if we're doing a linear scan).
If the VM_HUGEPAGE flag isn't set, we'll use the readahead algorithm to
decide how large the folio should be that we're reading into; if it's a
random read workload, we'll stick to order-0 pages, but if we're getting
good hit rate from the linear scan, we'll increase the size (although
we won't go past PMD size)

There's also the ->map_pages() optimisation which handles page faults
locklessly, and will fail back to ->fault() if there's even a light
breeze.  I don't think that's of any particular use in answering your
question, so I'm not going into details about it.

I'm not sure I understand the code that's being modified well enough to
be able to give you a straight answer to your question, but hopefully
this is helpful to you.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-14 19:06     ` Matthew Wilcox
@ 2023-08-15  0:04       ` Zach O'Keefe
  2023-08-15  2:24         ` Matthew Wilcox
  2023-08-16 16:49         ` Saurabh Singh Sengar
  0 siblings, 2 replies; 20+ messages in thread
From: Zach O'Keefe @ 2023-08-15  0:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Saurabh Singh Sengar, Dan Williams, linux-mm, Yang Shi, linux-kernel

On Mon, Aug 14, 2023 at 12:06 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Aug 14, 2023 at 11:47:50AM -0700, Zach O'Keefe wrote:
> > Willy -- I'm not up-to-date on what is happening on the THP-fs front.
> > Should we be checking for a ->huge_fault handler here?
>
> Oh, thank goodness, I thought you were cc'ing me to ask a DAX question ...

:)

> From a large folios perspective, filesystems do not implement a special
> handler.  They call filemap_fault() (directly or indirectly) from their
> ->fault handler.  If there is already a folio in the page cache which
> satisfies this fault, we insert it into the page tables (no matter what
> size it is).  If there is no folio, we call readahead to populate that
> index in the page cache, and probably some other indices around it.
> That's do_sync_mmap_readahead().
>
> If you look at that, you'll see that we check the VM_HUGEPAGE flag, and
> if set we align to a PMD boundary and read two PMD-size pages (so that we
> can do async readahead for the second page, if we're doing a linear scan).
> If the VM_HUGEPAGE flag isn't set, we'll use the readahead algorithm to
> decide how large the folio should be that we're reading into; if it's a
> random read workload, we'll stick to order-0 pages, but if we're getting
> good hit rate from the linear scan, we'll increase the size (although
> we won't go past PMD size)
>
> There's also the ->map_pages() optimisation which handles page faults
> locklessly, and will fail back to ->fault() if there's even a light
> breeze.  I don't think that's of any particular use in answering your
> question, so I'm not going into details about it.
>
> I'm not sure I understand the code that's being modified well enough to
> be able to give you a straight answer to your question, but hopefully
> this is helpful to you.

Thank you, this was great info. I had thought, incorrectly, that large
folio work would eventually tie into that ->huge_fault() handler
(should be dax_huge_fault() ?)

If that's the case, then faulting file-backed, non-DAX memory as
(pmd-mapped-)THPs isn't supported at all, and no fault lies with the
aforementioned patches.

Saurabh, perhaps you can elaborate on your use case a bit more, and
how that anonymous check broke you?

Best,
Zach

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-15  0:04       ` Zach O'Keefe
@ 2023-08-15  2:24         ` Matthew Wilcox
  2023-08-16 16:52           ` Saurabh Singh Sengar
  2023-08-16 21:31           ` Zach O'Keefe
  2023-08-16 16:49         ` Saurabh Singh Sengar
  1 sibling, 2 replies; 20+ messages in thread
From: Matthew Wilcox @ 2023-08-15  2:24 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Saurabh Singh Sengar, Dan Williams, linux-mm, Yang Shi, linux-kernel

On Mon, Aug 14, 2023 at 05:04:47PM -0700, Zach O'Keefe wrote:
> > From a large folios perspective, filesystems do not implement a special
> > handler.  They call filemap_fault() (directly or indirectly) from their
> > ->fault handler.  If there is already a folio in the page cache which
> > satisfies this fault, we insert it into the page tables (no matter what
> > size it is).  If there is no folio, we call readahead to populate that
> > index in the page cache, and probably some other indices around it.
> > That's do_sync_mmap_readahead().
> >
> > If you look at that, you'll see that we check the VM_HUGEPAGE flag, and
> > if set we align to a PMD boundary and read two PMD-size pages (so that we
> > can do async readahead for the second page, if we're doing a linear scan).
> > If the VM_HUGEPAGE flag isn't set, we'll use the readahead algorithm to
> > decide how large the folio should be that we're reading into; if it's a
> > random read workload, we'll stick to order-0 pages, but if we're getting
> > good hit rate from the linear scan, we'll increase the size (although
> > we won't go past PMD size)
> >
> > There's also the ->map_pages() optimisation which handles page faults
> > locklessly, and will fail back to ->fault() if there's even a light
> > breeze.  I don't think that's of any particular use in answering your
> > question, so I'm not going into details about it.
> >
> > I'm not sure I understand the code that's being modified well enough to
> > be able to give you a straight answer to your question, but hopefully
> > this is helpful to you.
> 
> Thank you, this was great info. I had thought, incorrectly, that large
> folio work would eventually tie into that ->huge_fault() handler
> (should be dax_huge_fault() ?)
> 
> If that's the case, then faulting file-backed, non-DAX memory as
> (pmd-mapped-)THPs isn't supported at all, and no fault lies with the
> aforementioned patches.

Ah, wait, hang on.  You absolutely can get a PMD mapping by calling into
->fault.  Look at how finish_fault() works:

        if (pmd_none(*vmf->pmd)) {
                if (PageTransCompound(page)) {
                        ret = do_set_pmd(vmf, page);
                        if (ret != VM_FAULT_FALLBACK)
                                return ret;
                }

                if (vmf->prealloc_pte)
                        pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);

So if we find a large folio that is PMD mappable, and there's nothing
at vmf->pmd, we install a PMD-sized mapping at that spot.  If that
fails, we install the preallocated PTE table at vmf->pmd and continue to
trying set one or more PTEs to satisfy this page fault.

So why, you may be asking, do we have ->huge_fault.  Well, you should
ask the clown who did commit b96375f74a6d ... in fairness to me,
finish_fault() did not exist at the time, and the ability to return
a PMD-sized page was added later.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-15  0:04       ` Zach O'Keefe
  2023-08-15  2:24         ` Matthew Wilcox
@ 2023-08-16 16:49         ` Saurabh Singh Sengar
  1 sibling, 0 replies; 20+ messages in thread
From: Saurabh Singh Sengar @ 2023-08-16 16:49 UTC (permalink / raw)
  To: Zach O'Keefe, Matthew Wilcox
  Cc: Dan Williams, linux-mm, Yang Shi, linux-kernel



> -----Original Message-----
> From: Zach O'Keefe <zokeefe@google.com>
> Sent: Tuesday, August 15, 2023 5:35 AM
> To: Matthew Wilcox <willy@infradead.org>
> Cc: Saurabh Singh Sengar <ssengar@microsoft.com>; Dan Williams
> <dan.j.williams@intel.com>; linux-mm@kvack.org; Yang Shi
> <shy828301@gmail.com>; linux-kernel@vger.kernel.org
> Subject: Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill
> __transhuge_page_enabled()"
> 
> [You don't often get email from zokeefe@google.com. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> On Mon, Aug 14, 2023 at 12:06 PM Matthew Wilcox <willy@infradead.org>
> wrote:
> >
> > On Mon, Aug 14, 2023 at 11:47:50AM -0700, Zach O'Keefe wrote:
> > > Willy -- I'm not up-to-date on what is happening on the THP-fs front.
> > > Should we be checking for a ->huge_fault handler here?
> >
> > Oh, thank goodness, I thought you were cc'ing me to ask a DAX question ...
> 
> :)
> 
> > From a large folios perspective, filesystems do not implement a
> > special handler.  They call filemap_fault() (directly or indirectly)
> > from their
> > ->fault handler.  If there is already a folio in the page cache which
> > satisfies this fault, we insert it into the page tables (no matter
> > what size it is).  If there is no folio, we call readahead to populate
> > that index in the page cache, and probably some other indices around it.
> > That's do_sync_mmap_readahead().
> >
> > If you look at that, you'll see that we check the VM_HUGEPAGE flag,
> > and if set we align to a PMD boundary and read two PMD-size pages (so
> > that we can do async readahead for the second page, if we're doing a linear
> scan).
> > If the VM_HUGEPAGE flag isn't set, we'll use the readahead algorithm
> > to decide how large the folio should be that we're reading into; if
> > it's a random read workload, we'll stick to order-0 pages, but if
> > we're getting good hit rate from the linear scan, we'll increase the
> > size (although we won't go past PMD size)
> >
> > There's also the ->map_pages() optimisation which handles page faults
> > locklessly, and will fail back to ->fault() if there's even a light
> > breeze.  I don't think that's of any particular use in answering your
> > question, so I'm not going into details about it.
> >
> > I'm not sure I understand the code that's being modified well enough
> > to be able to give you a straight answer to your question, but
> > hopefully this is helpful to you.
> 
> Thank you, this was great info. I had thought, incorrectly, that large folio work
> would eventually tie into that ->huge_fault() handler (should be
> dax_huge_fault() ?)
> 
> If that's the case, then faulting file-backed, non-DAX memory as (pmd-
> mapped-)THPs isn't supported at all, and no fault lies with the
> aforementioned patches.
> 
> Saurabh, perhaps you can elaborate on your use case a bit more, and how
> that anonymous check broke you?

Zach,

We have a out of tree driver that maps huge pages through a file handle and
relies on -> huge_fault. It used to work in 5.19 kernels but 6.1 changed this
behaviour.

I don’t think reverting the earlier behaviour of fault_path for huge pages should
impact kernel negatively.

- Saurabh

> 
> Best,
> Zach

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-15  2:24         ` Matthew Wilcox
@ 2023-08-16 16:52           ` Saurabh Singh Sengar
  2023-08-16 21:47             ` Zach O'Keefe
  2023-08-16 21:31           ` Zach O'Keefe
  1 sibling, 1 reply; 20+ messages in thread
From: Saurabh Singh Sengar @ 2023-08-16 16:52 UTC (permalink / raw)
  To: Matthew Wilcox, Zach O'Keefe
  Cc: Dan Williams, linux-mm, Yang Shi, linux-kernel



> -----Original Message-----
> From: Matthew Wilcox <willy@infradead.org>
> Sent: Tuesday, August 15, 2023 7:55 AM
> To: Zach O'Keefe <zokeefe@google.com>
> Cc: Saurabh Singh Sengar <ssengar@microsoft.com>; Dan Williams
> <dan.j.williams@intel.com>; linux-mm@kvack.org; Yang Shi
> <shy828301@gmail.com>; linux-kernel@vger.kernel.org
> Subject: Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill
> __transhuge_page_enabled()"
> 
> On Mon, Aug 14, 2023 at 05:04:47PM -0700, Zach O'Keefe wrote:
> > > From a large folios perspective, filesystems do not implement a
> > > special handler.  They call filemap_fault() (directly or indirectly)
> > > from their
> > > ->fault handler.  If there is already a folio in the page cache
> > > ->which
> > > satisfies this fault, we insert it into the page tables (no matter
> > > what size it is).  If there is no folio, we call readahead to
> > > populate that index in the page cache, and probably some other indices
> around it.
> > > That's do_sync_mmap_readahead().
> > >
> > > If you look at that, you'll see that we check the VM_HUGEPAGE flag,
> > > and if set we align to a PMD boundary and read two PMD-size pages
> > > (so that we can do async readahead for the second page, if we're doing a
> linear scan).
> > > If the VM_HUGEPAGE flag isn't set, we'll use the readahead algorithm
> > > to decide how large the folio should be that we're reading into; if
> > > it's a random read workload, we'll stick to order-0 pages, but if
> > > we're getting good hit rate from the linear scan, we'll increase the
> > > size (although we won't go past PMD size)
> > >
> > > There's also the ->map_pages() optimisation which handles page
> > > faults locklessly, and will fail back to ->fault() if there's even a
> > > light breeze.  I don't think that's of any particular use in
> > > answering your question, so I'm not going into details about it.
> > >
> > > I'm not sure I understand the code that's being modified well enough
> > > to be able to give you a straight answer to your question, but
> > > hopefully this is helpful to you.
> >
> > Thank you, this was great info. I had thought, incorrectly, that large
> > folio work would eventually tie into that ->huge_fault() handler
> > (should be dax_huge_fault() ?)
> >
> > If that's the case, then faulting file-backed, non-DAX memory as
> > (pmd-mapped-)THPs isn't supported at all, and no fault lies with the
> > aforementioned patches.
> 
> Ah, wait, hang on.  You absolutely can get a PMD mapping by calling into
> ->fault.  Look at how finish_fault() works:
> 
>         if (pmd_none(*vmf->pmd)) {
>                 if (PageTransCompound(page)) {
>                         ret = do_set_pmd(vmf, page);
>                         if (ret != VM_FAULT_FALLBACK)
>                                 return ret;
>                 }
> 
>                 if (vmf->prealloc_pte)
>                         pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
> 
> So if we find a large folio that is PMD mappable, and there's nothing at vmf-
> >pmd, we install a PMD-sized mapping at that spot.  If that fails, we install the
> preallocated PTE table at vmf->pmd and continue to trying set one or more
> PTEs to satisfy this page fault.
> 
> So why, you may be asking, do we have ->huge_fault.  Well, you should ask
> the clown who did commit b96375f74a6d ... in fairness to me,
> finish_fault() did not exist at the time, and the ability to return a PMD-sized
> page was added later.

Do you think we can restore this earlier behaviour of kernel to allow page fault
for huge pages via ->huge_fault.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-15  2:24         ` Matthew Wilcox
  2023-08-16 16:52           ` Saurabh Singh Sengar
@ 2023-08-16 21:31           ` Zach O'Keefe
  2023-08-17 12:18             ` Matthew Wilcox
  1 sibling, 1 reply; 20+ messages in thread
From: Zach O'Keefe @ 2023-08-16 21:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Saurabh Singh Sengar, Dan Williams, linux-mm, Yang Shi, linux-kernel

On Mon, Aug 14, 2023 at 7:24 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Aug 14, 2023 at 05:04:47PM -0700, Zach O'Keefe wrote:
> > > From a large folios perspective, filesystems do not implement a special
> > > handler.  They call filemap_fault() (directly or indirectly) from their
> > > ->fault handler.  If there is already a folio in the page cache which
> > > satisfies this fault, we insert it into the page tables (no matter what
> > > size it is).  If there is no folio, we call readahead to populate that
> > > index in the page cache, and probably some other indices around it.
> > > That's do_sync_mmap_readahead().
> > >
> > > If you look at that, you'll see that we check the VM_HUGEPAGE flag, and
> > > if set we align to a PMD boundary and read two PMD-size pages (so that we
> > > can do async readahead for the second page, if we're doing a linear scan).
> > > If the VM_HUGEPAGE flag isn't set, we'll use the readahead algorithm to
> > > decide how large the folio should be that we're reading into; if it's a
> > > random read workload, we'll stick to order-0 pages, but if we're getting
> > > good hit rate from the linear scan, we'll increase the size (although
> > > we won't go past PMD size)
> > >
> > > There's also the ->map_pages() optimisation which handles page faults
> > > locklessly, and will fail back to ->fault() if there's even a light
> > > breeze.  I don't think that's of any particular use in answering your
> > > question, so I'm not going into details about it.
> > >
> > > I'm not sure I understand the code that's being modified well enough to
> > > be able to give you a straight answer to your question, but hopefully
> > > this is helpful to you.
> >
> > Thank you, this was great info. I had thought, incorrectly, that large
> > folio work would eventually tie into that ->huge_fault() handler
> > (should be dax_huge_fault() ?)
> >
> > If that's the case, then faulting file-backed, non-DAX memory as
> > (pmd-mapped-)THPs isn't supported at all, and no fault lies with the
> > aforementioned patches.
>
> Ah, wait, hang on.  You absolutely can get a PMD mapping by calling into
> ->fault.  Look at how finish_fault() works:
>
>         if (pmd_none(*vmf->pmd)) {
>                 if (PageTransCompound(page)) {
>                         ret = do_set_pmd(vmf, page);
>                         if (ret != VM_FAULT_FALLBACK)
>                                 return ret;
>                 }
>
>                 if (vmf->prealloc_pte)
>                         pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
>
> So if we find a large folio that is PMD mappable, and there's nothing
> at vmf->pmd, we install a PMD-sized mapping at that spot.  If that
> fails, we install the preallocated PTE table at vmf->pmd and continue to
> trying set one or more PTEs to satisfy this page fault.

Aha! I see. I did not expect ->fault() to have this logic, as I had
incorrectly thought (aka assumed) the pmd vs pte-mapping logic split
at create_huge_pmd(); i.e. do_huge_pmd_anonymous_page(), or
->huge_fault(), or fallback to pte-mapping. It seems very weird to me
that hugepage_vma_check() "artificially" says "no" to file and shmem
along the fault path, so they can go and do their own thing in
->fault().

But this has been the way non-anon has supported THP fault since the
start ... with Kiril's commit 4.8 commit 101024596441 ("mm: introduce
do_set_pmd()") as part of the original "THP-enabled tmpfs/shmem using
compound pages" series. I just did not know about it :/

But thanks for prompting this -- I learnt a lot reading further down
do_fault(). shmem's ability to fault THP will depend on how the file
was constructed to begin with (i.e the huge= mount option). For file,
our ability to pmd-map the folio will depend on if the folio was
assembled in the pagecache as a large folio or not -- which depends on
the fs' AS_LARGE_FOLIO_SUPPORT (for which, only xfs, erofs, and afs
support today).

I tested things on xfs, and it does actually work. Cool :)

IIUC then, there is a bug in smaps THPeligible code when
CONFIG_READ_ONLY_THP_FOR_FS is not set. Not obvious, but apparently
this config is (according to it's Kconfig desc) khugepaged-only, so it
should be fine for it to be disabled, yet allow
do_sync_mmap_readahead() to install a pmd for file-backed memory.
hugepage_vma_check() will need to be patched to fix this.

But I have a larger question for you: should we care about
/sys/kernel/mm/transparent_hugepage/enabled for file-fault? We
currently don't. Seems weird that we can transparently get a hugepage
when THP="never". Also, if THP="always", we might as well skip the
VM_HUGEPAGE check, and try the final pmd install (and save khugepaged
the trouble of attempting it later).

WDYT?

Aside: should have brought this to Cabal meeting today, but hadn't
finished going through things

>
> So why, you may be asking, do we have ->huge_fault.  Well, you should
> ask the clown who did commit b96375f74a6d ... in fairness to me,
> finish_fault() did not exist at the time, and the ability to return
> a PMD-sized page was added later.
>

:P

But that patch seems super reasonable? At least my naive initial
reading assumed exactly what that commit description says.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-16 16:52           ` Saurabh Singh Sengar
@ 2023-08-16 21:47             ` Zach O'Keefe
  2023-08-17 17:46               ` Yang Shi
  0 siblings, 1 reply; 20+ messages in thread
From: Zach O'Keefe @ 2023-08-16 21:47 UTC (permalink / raw)
  To: Saurabh Singh Sengar
  Cc: Matthew Wilcox, Dan Williams, linux-mm, Yang Shi, linux-kernel

> We have a out of tree driver that maps huge pages through a file handle and
> relies on -> huge_fault. It used to work in 5.19 kernels but 6.1 changed this
> behaviour.
>
> I don’t think reverting the earlier behaviour of fault_path for huge pages should
> impact kernel negatively.
>
> Do you think we can restore this earlier behaviour of kernel to allow page fault
> for huge pages via ->huge_fault.

That seems reasonable to me. I think using the existence of a
->huge_fault() handler as a predicate to return "true" makes sense to
me. The "normal" flow for file-backed memory along fault path still
needs to return "false", so that we correctly fallback to ->fault()
handler. Unless there are objections, I can do that in a v2.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-16 21:31           ` Zach O'Keefe
@ 2023-08-17 12:18             ` Matthew Wilcox
  2023-08-17 18:13               ` Zach O'Keefe
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2023-08-17 12:18 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Saurabh Singh Sengar, Dan Williams, linux-mm, Yang Shi, linux-kernel

On Wed, Aug 16, 2023 at 02:31:06PM -0700, Zach O'Keefe wrote:
> On Mon, Aug 14, 2023 at 7:24 PM Matthew Wilcox <willy@infradead.org> wrote:
> > So if we find a large folio that is PMD mappable, and there's nothing
> > at vmf->pmd, we install a PMD-sized mapping at that spot.  If that
> > fails, we install the preallocated PTE table at vmf->pmd and continue to
> > trying set one or more PTEs to satisfy this page fault.
> 
> Aha! I see. I did not expect ->fault() to have this logic, as I had
> incorrectly thought (aka assumed) the pmd vs pte-mapping logic split
> at create_huge_pmd(); i.e. do_huge_pmd_anonymous_page(), or
> ->huge_fault(), or fallback to pte-mapping. It seems very weird to me
> that hugepage_vma_check() "artificially" says "no" to file and shmem
> along the fault path, so they can go and do their own thing in
> ->fault().

Wow, hugepage_vma_check() is a very complicated function.  I'm glad I
ignored it!

> IIUC then, there is a bug in smaps THPeligible code when
> CONFIG_READ_ONLY_THP_FOR_FS is not set. Not obvious, but apparently
> this config is (according to it's Kconfig desc) khugepaged-only, so it
> should be fine for it to be disabled, yet allow
> do_sync_mmap_readahead() to install a pmd for file-backed memory.
> hugepage_vma_check() will need to be patched to fix this.

I guess so ...

> But I have a larger question for you: should we care about
> /sys/kernel/mm/transparent_hugepage/enabled for file-fault? We
> currently don't. Seems weird that we can transparently get a hugepage
> when THP="never". Also, if THP="always", we might as well skip the
> VM_HUGEPAGE check, and try the final pmd install (and save khugepaged
> the trouble of attempting it later).

I deliberately ignored the humungous complexity of the THP options.
They're overgrown and make my brain hurt.  Instead, large folios are
adaptive; they observe the behaviour of the user program and choose based
on history what to do.  This is far superior to having a sysadmin tell
us what to do!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-16 21:47             ` Zach O'Keefe
@ 2023-08-17 17:46               ` Yang Shi
  2023-08-17 18:29                 ` Zach O'Keefe
  0 siblings, 1 reply; 20+ messages in thread
From: Yang Shi @ 2023-08-17 17:46 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Saurabh Singh Sengar, Matthew Wilcox, Dan Williams, linux-mm,
	linux-kernel

On Wed, Aug 16, 2023 at 2:48 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> > We have a out of tree driver that maps huge pages through a file handle and
> > relies on -> huge_fault. It used to work in 5.19 kernels but 6.1 changed this
> > behaviour.
> >
> > I don’t think reverting the earlier behaviour of fault_path for huge pages should
> > impact kernel negatively.
> >
> > Do you think we can restore this earlier behaviour of kernel to allow page fault
> > for huge pages via ->huge_fault.
>
> That seems reasonable to me. I think using the existence of a
> ->huge_fault() handler as a predicate to return "true" makes sense to
> me. The "normal" flow for file-backed memory along fault path still
> needs to return "false", so that we correctly fallback to ->fault()
> handler. Unless there are objections, I can do that in a v2.

Sorry for chiming in late. I'm just back from vacation and trying to catch up...

IIUC the out-of-tree driver tries to allocate huge page and install
PMD mapping via huge_fault() handler, but the cleanup of
hugepage_vma_check() prevents this due to the check to
VM_NO_KHUGEPAGED?

So you would like to check whether a huge_fault() handler existed
instead of vma_is_dax()?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-17 12:18             ` Matthew Wilcox
@ 2023-08-17 18:13               ` Zach O'Keefe
  2023-08-17 19:01                 ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Zach O'Keefe @ 2023-08-17 18:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Saurabh Singh Sengar, Dan Williams, linux-mm, Yang Shi, linux-kernel

On Thu, Aug 17, 2023 at 5:18 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Aug 16, 2023 at 02:31:06PM -0700, Zach O'Keefe wrote:
> > On Mon, Aug 14, 2023 at 7:24 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > So if we find a large folio that is PMD mappable, and there's nothing
> > > at vmf->pmd, we install a PMD-sized mapping at that spot.  If that
> > > fails, we install the preallocated PTE table at vmf->pmd and continue to
> > > trying set one or more PTEs to satisfy this page fault.
> >
> > Aha! I see. I did not expect ->fault() to have this logic, as I had
> > incorrectly thought (aka assumed) the pmd vs pte-mapping logic split
> > at create_huge_pmd(); i.e. do_huge_pmd_anonymous_page(), or
> > ->huge_fault(), or fallback to pte-mapping. It seems very weird to me
> > that hugepage_vma_check() "artificially" says "no" to file and shmem
> > along the fault path, so they can go and do their own thing in
> > ->fault().
>
> Wow, hugepage_vma_check() is a very complicated function.  I'm glad I
> ignored it!

Ya it's a tangly area. Far better now though, then before Yang
centralized everything. But yes, now I need to figure out what to do
with it..

> > IIUC then, there is a bug in smaps THPeligible code when
> > CONFIG_READ_ONLY_THP_FOR_FS is not set. Not obvious, but apparently
> > this config is (according to it's Kconfig desc) khugepaged-only, so it
> > should be fine for it to be disabled, yet allow
> > do_sync_mmap_readahead() to install a pmd for file-backed memory.
> > hugepage_vma_check() will need to be patched to fix this.
>
> I guess so ...

The easiest and most satisfying way to handle this -- and I think we
talked about this before -- is relaxing that complicated
file_thp_enabled() check when the file's mapping supports large
folios. I think that makes sense to me, though I don't know all the
details fs-side. Will we need any hook to give fs the chance to update
any internal state on collapse?

> > But I have a larger question for you: should we care about
> > /sys/kernel/mm/transparent_hugepage/enabled for file-fault? We
> > currently don't. Seems weird that we can transparently get a hugepage
> > when THP="never". Also, if THP="always", we might as well skip the
> > VM_HUGEPAGE check, and try the final pmd install (and save khugepaged
> > the trouble of attempting it later).
>
> I deliberately ignored the humungous complexity of the THP options.
> They're overgrown and make my brain hurt. [..]

Same

> [..] Instead, large folios are
> adaptive; they observe the behaviour of the user program and choose based
> on history what to do.  This is far superior to having a sysadmin tell
> us what to do!

I had written a bunch on this, but I arrived to the conclusion that
(a) pmd-mapping here is ~ a free win, and (b) I'm not the best  person
to argue for these knobs, given MADV_COLLAPSE ignores them entirely :P

..But (sorry) what about MMF_DISABLE_THP?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-17 17:46               ` Yang Shi
@ 2023-08-17 18:29                 ` Zach O'Keefe
  2023-08-18 21:21                   ` Yang Shi
  0 siblings, 1 reply; 20+ messages in thread
From: Zach O'Keefe @ 2023-08-17 18:29 UTC (permalink / raw)
  To: Yang Shi
  Cc: Saurabh Singh Sengar, Matthew Wilcox, Dan Williams, linux-mm,
	linux-kernel

On Thu, Aug 17, 2023 at 10:47 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Aug 16, 2023 at 2:48 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > > We have a out of tree driver that maps huge pages through a file handle and
> > > relies on -> huge_fault. It used to work in 5.19 kernels but 6.1 changed this
> > > behaviour.
> > >
> > > I don’t think reverting the earlier behaviour of fault_path for huge pages should
> > > impact kernel negatively.
> > >
> > > Do you think we can restore this earlier behaviour of kernel to allow page fault
> > > for huge pages via ->huge_fault.
> >
> > That seems reasonable to me. I think using the existence of a
> > ->huge_fault() handler as a predicate to return "true" makes sense to
> > me. The "normal" flow for file-backed memory along fault path still
> > needs to return "false", so that we correctly fallback to ->fault()
> > handler. Unless there are objections, I can do that in a v2.
>
> Sorry for chiming in late. I'm just back from vacation and trying to catch up...
>
> IIUC the out-of-tree driver tries to allocate huge page and install
> PMD mapping via huge_fault() handler, but the cleanup of
> hugepage_vma_check() prevents this due to the check to
> VM_NO_KHUGEPAGED?
>
> So you would like to check whether a huge_fault() handler existed
> instead of vma_is_dax()?

Sorry for the multiple threads here. There are two problems: (a) the
VM_NO_KHUGEPAGED check along fault path, and (b) we don't give
->huge_fault() a fair shake, if it exists, along fault path. The
current code assumes vma_is_dax() iff ->huge_fault() exists.

(a) is easy enough to fix. For (b), I'm currently looking at the
possibility of not worrying about ->huge_fault() in
hugepage_vma_check(), and just letting create_huge_pud() /
create_huge_pmd() check and fallback as necessary. I think we'll need
the explicit DAX check still, since we want to keep khugepaged and
MADV_COLLAPSE away, and the presence / absence of ->huge_fault() isn't
enough to know that (well.. today it kind of is, but we shouldn't
depend on it).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-17 18:13               ` Zach O'Keefe
@ 2023-08-17 19:01                 ` Matthew Wilcox
  2023-08-17 21:12                   ` Zach O'Keefe
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2023-08-17 19:01 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Saurabh Singh Sengar, Dan Williams, linux-mm, Yang Shi, linux-kernel

On Thu, Aug 17, 2023 at 11:13:36AM -0700, Zach O'Keefe wrote:
> > > IIUC then, there is a bug in smaps THPeligible code when
> > > CONFIG_READ_ONLY_THP_FOR_FS is not set. Not obvious, but apparently
> > > this config is (according to it's Kconfig desc) khugepaged-only, so it
> > > should be fine for it to be disabled, yet allow
> > > do_sync_mmap_readahead() to install a pmd for file-backed memory.
> > > hugepage_vma_check() will need to be patched to fix this.
> >
> > I guess so ...
> 
> The easiest and most satisfying way to handle this -- and I think we
> talked about this before -- is relaxing that complicated
> file_thp_enabled() check when the file's mapping supports large
> folios. I think that makes sense to me, though I don't know all the
> details fs-side. Will we need any hook to give fs the chance to update
> any internal state on collapse?

If the filesystem has per-folio metadata, we need to give the filesystem
the chance to set that up.  I've vaguaely been wondering about using the
->migrate_folio callback for it.  At the moment, I think it just refuses
to work if the folio isn't order-0.

> > > But I have a larger question for you: should we care about
> > > /sys/kernel/mm/transparent_hugepage/enabled for file-fault? We
> > > currently don't. Seems weird that we can transparently get a hugepage
> > > when THP="never". Also, if THP="always", we might as well skip the
> > > VM_HUGEPAGE check, and try the final pmd install (and save khugepaged
> > > the trouble of attempting it later).
> >
> > I deliberately ignored the humungous complexity of the THP options.
> > They're overgrown and make my brain hurt. [..]
> 
> Same
> 
> > [..] Instead, large folios are
> > adaptive; they observe the behaviour of the user program and choose based
> > on history what to do.  This is far superior to having a sysadmin tell
> > us what to do!
> 
> I had written a bunch on this, but I arrived to the conclusion that
> (a) pmd-mapping here is ~ a free win, and (b) I'm not the best  person
> to argue for these knobs, given MADV_COLLAPSE ignores them entirely :P
> 
> ..But (sorry) what about MMF_DISABLE_THP?

Yeah, we ignore that too.  My rationale is -- as you said -- using the
PMDs is actually free, and it's really none of the app's business how
the page cache chooses to cache things.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-17 19:01                 ` Matthew Wilcox
@ 2023-08-17 21:12                   ` Zach O'Keefe
  0 siblings, 0 replies; 20+ messages in thread
From: Zach O'Keefe @ 2023-08-17 21:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Saurabh Singh Sengar, Dan Williams, linux-mm, Yang Shi, linux-kernel

On Thu, Aug 17, 2023 at 12:01 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Aug 17, 2023 at 11:13:36AM -0700, Zach O'Keefe wrote:
> > > > IIUC then, there is a bug in smaps THPeligible code when
> > > > CONFIG_READ_ONLY_THP_FOR_FS is not set. Not obvious, but apparently
> > > > this config is (according to it's Kconfig desc) khugepaged-only, so it
> > > > should be fine for it to be disabled, yet allow
> > > > do_sync_mmap_readahead() to install a pmd for file-backed memory.
> > > > hugepage_vma_check() will need to be patched to fix this.
> > >
> > > I guess so ...
> >
> > The easiest and most satisfying way to handle this -- and I think we
> > talked about this before -- is relaxing that complicated
> > file_thp_enabled() check when the file's mapping supports large
> > folios. I think that makes sense to me, though I don't know all the
> > details fs-side. Will we need any hook to give fs the chance to update
> > any internal state on collapse?
>
> If the filesystem has per-folio metadata, we need to give the filesystem
> the chance to set that up.  I've vaguaely been wondering about using the
> ->migrate_folio callback for it.  At the moment, I think it just refuses
> to work if the folio isn't order-0.

Ok, no free lunch here then. I'll give myself a reminder to come back
here then and dig a little deeper. Thanks Matthew

> > > > But I have a larger question for you: should we care about
> > > > /sys/kernel/mm/transparent_hugepage/enabled for file-fault? We
> > > > currently don't. Seems weird that we can transparently get a hugepage
> > > > when THP="never". Also, if THP="always", we might as well skip the
> > > > VM_HUGEPAGE check, and try the final pmd install (and save khugepaged
> > > > the trouble of attempting it later).
> > >
> > > I deliberately ignored the humungous complexity of the THP options.
> > > They're overgrown and make my brain hurt. [..]
> >
> > Same
> >
> > > [..] Instead, large folios are
> > > adaptive; they observe the behaviour of the user program and choose based
> > > on history what to do.  This is far superior to having a sysadmin tell
> > > us what to do!
> >
> > I had written a bunch on this, but I arrived to the conclusion that
> > (a) pmd-mapping here is ~ a free win, and (b) I'm not the best  person
> > to argue for these knobs, given MADV_COLLAPSE ignores them entirely :P
> >
> > ..But (sorry) what about MMF_DISABLE_THP?
>
> Yeah, we ignore that too.  My rationale is -- as you said -- using the
> PMDs is actually free, and it's really none of the app's business how
> the page cache chooses to cache things.

What should be done to be consistent with the collapse side here, for
file/shmem if at all? Answering the question, "can this memory be
backed by THPs" is becoming really complex, and that THPelligble smaps
field is becoming increasingly more difficult to use.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-17 18:29                 ` Zach O'Keefe
@ 2023-08-18 21:21                   ` Yang Shi
  2023-08-21 15:08                     ` Zach O'Keefe
  0 siblings, 1 reply; 20+ messages in thread
From: Yang Shi @ 2023-08-18 21:21 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Saurabh Singh Sengar, Matthew Wilcox, Dan Williams, linux-mm,
	linux-kernel

On Thu, Aug 17, 2023 at 11:29 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Thu, Aug 17, 2023 at 10:47 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Wed, Aug 16, 2023 at 2:48 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > > We have a out of tree driver that maps huge pages through a file handle and
> > > > relies on -> huge_fault. It used to work in 5.19 kernels but 6.1 changed this
> > > > behaviour.
> > > >
> > > > I don’t think reverting the earlier behaviour of fault_path for huge pages should
> > > > impact kernel negatively.
> > > >
> > > > Do you think we can restore this earlier behaviour of kernel to allow page fault
> > > > for huge pages via ->huge_fault.
> > >
> > > That seems reasonable to me. I think using the existence of a
> > > ->huge_fault() handler as a predicate to return "true" makes sense to
> > > me. The "normal" flow for file-backed memory along fault path still
> > > needs to return "false", so that we correctly fallback to ->fault()
> > > handler. Unless there are objections, I can do that in a v2.
> >
> > Sorry for chiming in late. I'm just back from vacation and trying to catch up...
> >
> > IIUC the out-of-tree driver tries to allocate huge page and install
> > PMD mapping via huge_fault() handler, but the cleanup of
> > hugepage_vma_check() prevents this due to the check to
> > VM_NO_KHUGEPAGED?
> >
> > So you would like to check whether a huge_fault() handler existed
> > instead of vma_is_dax()?
>
> Sorry for the multiple threads here. There are two problems: (a) the
> VM_NO_KHUGEPAGED check along fault path, and (b) we don't give
> ->huge_fault() a fair shake, if it exists, along fault path. The
> current code assumes vma_is_dax() iff ->huge_fault() exists.
>
> (a) is easy enough to fix. For (b), I'm currently looking at the
> possibility of not worrying about ->huge_fault() in
> hugepage_vma_check(), and just letting create_huge_pud() /
> create_huge_pmd() check and fallback as necessary. I think we'll need
> the explicit DAX check still, since we want to keep khugepaged and
> MADV_COLLAPSE away, and the presence / absence of ->huge_fault() isn't
> enough to know that (well.. today it kind of is, but we shouldn't
> depend on it).

You meant something like:

if (vma->vm_ops->huge_fault) {
    if (vma_is_dax(vma))
        return in_pf;

    /Fall through */
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-18 21:21                   ` Yang Shi
@ 2023-08-21 15:08                     ` Zach O'Keefe
  2023-08-21 22:59                       ` Yang Shi
  0 siblings, 1 reply; 20+ messages in thread
From: Zach O'Keefe @ 2023-08-21 15:08 UTC (permalink / raw)
  To: Yang Shi
  Cc: Saurabh Singh Sengar, Matthew Wilcox, Dan Williams, linux-mm,
	linux-kernel

On Fri, Aug 18, 2023 at 2:21 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Aug 17, 2023 at 11:29 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Thu, Aug 17, 2023 at 10:47 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Wed, Aug 16, 2023 at 2:48 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > > We have a out of tree driver that maps huge pages through a file handle and
> > > > > relies on -> huge_fault. It used to work in 5.19 kernels but 6.1 changed this
> > > > > behaviour.
> > > > >
> > > > > I don’t think reverting the earlier behaviour of fault_path for huge pages should
> > > > > impact kernel negatively.
> > > > >
> > > > > Do you think we can restore this earlier behaviour of kernel to allow page fault
> > > > > for huge pages via ->huge_fault.
> > > >
> > > > That seems reasonable to me. I think using the existence of a
> > > > ->huge_fault() handler as a predicate to return "true" makes sense to
> > > > me. The "normal" flow for file-backed memory along fault path still
> > > > needs to return "false", so that we correctly fallback to ->fault()
> > > > handler. Unless there are objections, I can do that in a v2.
> > >
> > > Sorry for chiming in late. I'm just back from vacation and trying to catch up...
> > >
> > > IIUC the out-of-tree driver tries to allocate huge page and install
> > > PMD mapping via huge_fault() handler, but the cleanup of
> > > hugepage_vma_check() prevents this due to the check to
> > > VM_NO_KHUGEPAGED?
> > >
> > > So you would like to check whether a huge_fault() handler existed
> > > instead of vma_is_dax()?
> >
> > Sorry for the multiple threads here. There are two problems: (a) the
> > VM_NO_KHUGEPAGED check along fault path, and (b) we don't give
> > ->huge_fault() a fair shake, if it exists, along fault path. The
> > current code assumes vma_is_dax() iff ->huge_fault() exists.
> >
> > (a) is easy enough to fix. For (b), I'm currently looking at the
> > possibility of not worrying about ->huge_fault() in
> > hugepage_vma_check(), and just letting create_huge_pud() /
> > create_huge_pmd() check and fallback as necessary. I think we'll need
> > the explicit DAX check still, since we want to keep khugepaged and
> > MADV_COLLAPSE away, and the presence / absence of ->huge_fault() isn't
> > enough to know that (well.. today it kind of is, but we shouldn't
> > depend on it).
>
> You meant something like:
>
> if (vma->vm_ops->huge_fault) {
>     if (vma_is_dax(vma))
>         return in_pf;
>
>     /Fall through */
> }

I don't think this will work for Saurabh's case, since IIUC, they
aren't using dax, but are using VM_HUGEPAGE|VM_MIXEDMAP, faulted in
using ->huge_fault()

The old (v5.19) fault path looked like:

static inline bool transhuge_vma_enabled(struct vm_area_struct *vma,
                                          unsigned long vm_flags)
{
        /* Explicitly disabled through madvise. */
        if ((vm_flags & VM_NOHUGEPAGE) ||
            test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
                return false;
        return true;
}

/*
 * to be used on vmas which are known to support THP.
 * Use transparent_hugepage_active otherwise
 */
static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
{

        /*
         * If the hardware/firmware marked hugepage support disabled.
         */
        if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_NEVER_DAX))
                return false;

        if (!transhuge_vma_enabled(vma, vma->vm_flags))
                return false;

        if (vma_is_temporary_stack(vma))
                return false;

        if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
                return true;

        if (vma_is_dax(vma))
                return true;

        if (transparent_hugepage_flags &
                                (1 << TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))
                return !!(vma->vm_flags & VM_HUGEPAGE);

        return false;
}

For non-anonymous, the next check (in create_huge_*) would be for that
->huge_fault handler, falling back as necessary if it didn't exist.

The patch I sent out last week[1] somewhat restores this logic -- the
only difference being we do the check for ->huge_fault in
hugepage_vma_check() as well. This is so smaps can surface this
possibility with some accuracy. I just realized it will erroneously
return "true" for the collapse path, however..

Maybe Matthew was right about unifying everything here :P That's 2
mistakes I've made in trying to fix this issue (but maybe that's just
me).

[1] https://lore.kernel.org/linux-mm/20230818211533.2523697-1-zokeefe@google.com/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [EXTERNAL] [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()"
  2023-08-21 15:08                     ` Zach O'Keefe
@ 2023-08-21 22:59                       ` Yang Shi
  0 siblings, 0 replies; 20+ messages in thread
From: Yang Shi @ 2023-08-21 22:59 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Saurabh Singh Sengar, Matthew Wilcox, Dan Williams, linux-mm,
	linux-kernel

On Mon, Aug 21, 2023 at 8:09 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Fri, Aug 18, 2023 at 2:21 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, Aug 17, 2023 at 11:29 AM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > On Thu, Aug 17, 2023 at 10:47 AM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > On Wed, Aug 16, 2023 at 2:48 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > >
> > > > > > We have a out of tree driver that maps huge pages through a file handle and
> > > > > > relies on -> huge_fault. It used to work in 5.19 kernels but 6.1 changed this
> > > > > > behaviour.
> > > > > >
> > > > > > I don’t think reverting the earlier behaviour of fault_path for huge pages should
> > > > > > impact kernel negatively.
> > > > > >
> > > > > > Do you think we can restore this earlier behaviour of kernel to allow page fault
> > > > > > for huge pages via ->huge_fault.
> > > > >
> > > > > That seems reasonable to me. I think using the existence of a
> > > > > ->huge_fault() handler as a predicate to return "true" makes sense to
> > > > > me. The "normal" flow for file-backed memory along fault path still
> > > > > needs to return "false", so that we correctly fallback to ->fault()
> > > > > handler. Unless there are objections, I can do that in a v2.
> > > >
> > > > Sorry for chiming in late. I'm just back from vacation and trying to catch up...
> > > >
> > > > IIUC the out-of-tree driver tries to allocate huge page and install
> > > > PMD mapping via huge_fault() handler, but the cleanup of
> > > > hugepage_vma_check() prevents this due to the check to
> > > > VM_NO_KHUGEPAGED?
> > > >
> > > > So you would like to check whether a huge_fault() handler existed
> > > > instead of vma_is_dax()?
> > >
> > > Sorry for the multiple threads here. There are two problems: (a) the
> > > VM_NO_KHUGEPAGED check along fault path, and (b) we don't give
> > > ->huge_fault() a fair shake, if it exists, along fault path. The
> > > current code assumes vma_is_dax() iff ->huge_fault() exists.
> > >
> > > (a) is easy enough to fix. For (b), I'm currently looking at the
> > > possibility of not worrying about ->huge_fault() in
> > > hugepage_vma_check(), and just letting create_huge_pud() /
> > > create_huge_pmd() check and fallback as necessary. I think we'll need
> > > the explicit DAX check still, since we want to keep khugepaged and
> > > MADV_COLLAPSE away, and the presence / absence of ->huge_fault() isn't
> > > enough to know that (well.. today it kind of is, but we shouldn't
> > > depend on it).
> >
> > You meant something like:
> >
> > if (vma->vm_ops->huge_fault) {
> >     if (vma_is_dax(vma))
> >         return in_pf;
> >
> >     /Fall through */
> > }
>
> I don't think this will work for Saurabh's case, since IIUC, they
> aren't using dax, but are using VM_HUGEPAGE|VM_MIXEDMAP, faulted in
> using ->huge_fault()
>
> The old (v5.19) fault path looked like:
>
> static inline bool transhuge_vma_enabled(struct vm_area_struct *vma,
>                                           unsigned long vm_flags)
> {
>         /* Explicitly disabled through madvise. */
>         if ((vm_flags & VM_NOHUGEPAGE) ||
>             test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>                 return false;
>         return true;
> }
>
> /*
>  * to be used on vmas which are known to support THP.
>  * Use transparent_hugepage_active otherwise
>  */
> static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
> {
>
>         /*
>          * If the hardware/firmware marked hugepage support disabled.
>          */
>         if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_NEVER_DAX))
>                 return false;
>
>         if (!transhuge_vma_enabled(vma, vma->vm_flags))
>                 return false;
>
>         if (vma_is_temporary_stack(vma))
>                 return false;
>
>         if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
>                 return true;
>
>         if (vma_is_dax(vma))
>                 return true;
>
>         if (transparent_hugepage_flags &
>                                 (1 << TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))
>                 return !!(vma->vm_flags & VM_HUGEPAGE);
>
>         return false;
> }
>
> For non-anonymous, the next check (in create_huge_*) would be for that
> ->huge_fault handler, falling back as necessary if it didn't exist.

Yeah, you are right. I just replied to your v2 patch.

>
> The patch I sent out last week[1] somewhat restores this logic -- the
> only difference being we do the check for ->huge_fault in
> hugepage_vma_check() as well. This is so smaps can surface this
> possibility with some accuracy. I just realized it will erroneously
> return "true" for the collapse path, however..
>
> Maybe Matthew was right about unifying everything here :P That's 2
> mistakes I've made in trying to fix this issue (but maybe that's just
> me).

IMHO, no rush on fixing it.

>
> [1] https://lore.kernel.org/linux-mm/20230818211533.2523697-1-zokeefe@google.com/

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-08-21 22:59 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-12 21:00 [PATCH] mm/thp: fix "mm: thp: kill __transhuge_page_enabled()" Zach O'Keefe
2023-08-12 21:24 ` Zach O'Keefe
2023-08-13  6:19 ` [EXTERNAL] " Saurabh Singh Sengar
2023-08-14 18:47   ` Zach O'Keefe
2023-08-14 19:06     ` Matthew Wilcox
2023-08-15  0:04       ` Zach O'Keefe
2023-08-15  2:24         ` Matthew Wilcox
2023-08-16 16:52           ` Saurabh Singh Sengar
2023-08-16 21:47             ` Zach O'Keefe
2023-08-17 17:46               ` Yang Shi
2023-08-17 18:29                 ` Zach O'Keefe
2023-08-18 21:21                   ` Yang Shi
2023-08-21 15:08                     ` Zach O'Keefe
2023-08-21 22:59                       ` Yang Shi
2023-08-16 21:31           ` Zach O'Keefe
2023-08-17 12:18             ` Matthew Wilcox
2023-08-17 18:13               ` Zach O'Keefe
2023-08-17 19:01                 ` Matthew Wilcox
2023-08-17 21:12                   ` Zach O'Keefe
2023-08-16 16:49         ` Saurabh Singh Sengar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.