All of lore.kernel.org
 help / color / mirror / Atom feed
* mm/khugepaged: collapse file/shmem compound pages
@ 2022-05-24 22:42 Zach O'Keefe
  2022-05-25 19:07 ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Zach O'Keefe @ 2022-05-24 22:42 UTC (permalink / raw)
  To: willy; +Cc: David Rientjes, linux-mm

Hey Matthew,

I'm leading an attempt to add a new madvise mode, MADV_COLLAPSE, to
allow userspace-directed collapse of memory into THPs[1]. The initial
proposal only supports anonymous memory, but I'm
working on adding support for file-backed and shmem memory.

The intended behavior of MADV_COLLAPSE is that it should return
"success" if all hugepage-aligned / sized regions requested are backed
by pmd-mapped THPs on return (races aside). IOW: we were able to
successfully collapse the memory, or it was already backed by
pmd-mapped THPs.

Currently there is a nice "XXX: khugepaged should compact smaller
compound pages into a PMD sized page" in khugepaged_scan_file() when
we encounter a compound page during scanning. Do you know what kind of
gotchas or technical difficulties would be involved in doing this? I
presume this work would also benefit those relying on khugepaged to
collapse read-only file and shmem memory, and I'd be happy to help
move it forward.

Thanks for your time,
Zach


[1] https://lore.kernel.org/linux-mm/20220504214437.2850685-1-zokeefe@google.com/


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm/khugepaged: collapse file/shmem compound pages
  2022-05-24 22:42 mm/khugepaged: collapse file/shmem compound pages Zach O'Keefe
@ 2022-05-25 19:07 ` Matthew Wilcox
  2022-05-26  1:23   ` Zach O'Keefe
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2022-05-25 19:07 UTC (permalink / raw)
  To: Zach O'Keefe; +Cc: David Rientjes, linux-mm

On Tue, May 24, 2022 at 03:42:55PM -0700, Zach O'Keefe wrote:
> Hey Matthew,
> 
> I'm leading an attempt to add a new madvise mode, MADV_COLLAPSE, to
> allow userspace-directed collapse of memory into THPs[1]. The initial
> proposal only supports anonymous memory, but I'm
> working on adding support for file-backed and shmem memory.
> 
> The intended behavior of MADV_COLLAPSE is that it should return
> "success" if all hugepage-aligned / sized regions requested are backed
> by pmd-mapped THPs on return (races aside). IOW: we were able to
> successfully collapse the memory, or it was already backed by
> pmd-mapped THPs.
> 
> Currently there is a nice "XXX: khugepaged should compact smaller
> compound pages into a PMD sized page" in khugepaged_scan_file() when
> we encounter a compound page during scanning. Do you know what kind of
> gotchas or technical difficulties would be involved in doing this? I
> presume this work would also benefit those relying on khugepaged to
> collapse read-only file and shmem memory, and I'd be happy to help
> move it forward.

Hi Zach,

Thanks for your interest, and I'd love some help on this.

The khugepaged code (like much of the mm used to) assumes that memory
comes in two sizes, PTE and PMD.  That's still true for anon and shmem
for now, but hopefully we'll start managing both anon & shmem memory in
larger chunks, without necessarily going as far as PMD.

I think the purpose of khugepaged should continue to be to construct
PMD-size pages; I don't see the point of it wandering through process VMs
replacing order-2 pages with order-5 pages.  I may be wrong about that,
of course, so feel free to argue with me.

Anyway, that meaning behind that comment is that the PageTransCompound()
test is going to be true on any compound page (TransCompound doesn't
check that the page is necessarily a THP).  So that particular test should
be folio_test_pmd_mappable(), but there are probably other things which
ought to be changed, including converting the entire file from dealing
in pages to dealing in folios.

I actually have one patch which starts in that direction, but I haven't
followed it up yet with all the other patches to that file which will
be needed:

From a64ac45ad951557103a1040c8bcc3f229022cd26 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Fri, 7 May 2021 23:40:19 -0400
Subject: [PATCH] mm/khugepaged: Allocate folios

khugepaged only wants to deal in terms of folios, so switch to
using the folio allocation functions.  This eliminates the calls to
prep_transhuge_page() and saves dozens of bytes of text.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/khugepaged.c | 32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 637bfecd6bf5..ec60ee4e14c9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -854,18 +854,20 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 static struct page *
 khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 {
+	struct folio *folio;
+
 	VM_BUG_ON_PAGE(*hpage, *hpage);
 
-	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
-	if (unlikely(!*hpage)) {
+	folio = __folio_alloc_node(gfp, HPAGE_PMD_ORDER, node);
+	if (unlikely(!folio)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
 		*hpage = ERR_PTR(-ENOMEM);
 		return NULL;
 	}
 
-	prep_transhuge_page(*hpage);
 	count_vm_event(THP_COLLAPSE_ALLOC);
-	return *hpage;
+	*hpage = &folio->page;
+	return &folio->page;
 }
 #else
 static int khugepaged_find_target_node(void)
@@ -873,24 +875,14 @@ static int khugepaged_find_target_node(void)
 	return 0;
 }
 
-static inline struct page *alloc_khugepaged_hugepage(void)
-{
-	struct page *page;
-
-	page = alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
-			   HPAGE_PMD_ORDER);
-	if (page)
-		prep_transhuge_page(page);
-	return page;
-}
-
 static struct page *khugepaged_alloc_hugepage(bool *wait)
 {
-	struct page *hpage;
+	struct folio *folio;
 
 	do {
-		hpage = alloc_khugepaged_hugepage();
-		if (!hpage) {
+		folio = folio_alloc(alloc_hugepage_khugepaged_gfpmask(),
+					HPAGE_PMD_ORDER);
+		if (!folio) {
 			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
 			if (!*wait)
 				return NULL;
@@ -899,9 +891,9 @@ static struct page *khugepaged_alloc_hugepage(bool *wait)
 			khugepaged_alloc_sleep();
 		} else
 			count_vm_event(THP_COLLAPSE_ALLOC);
-	} while (unlikely(!hpage) && likely(khugepaged_enabled()));
+	} while (unlikely(!folio) && likely(khugepaged_enabled()));
 
-	return hpage;
+	return &folio->page;
 }
 
 static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: mm/khugepaged: collapse file/shmem compound pages
  2022-05-25 19:07 ` Matthew Wilcox
@ 2022-05-26  1:23   ` Zach O'Keefe
  2022-05-26  3:36     ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Zach O'Keefe @ 2022-05-26  1:23 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: David Rientjes, linux-mm

On Wed, May 25, 2022 at 12:07 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, May 24, 2022 at 03:42:55PM -0700, Zach O'Keefe wrote:
> > Hey Matthew,
> >
> > I'm leading an attempt to add a new madvise mode, MADV_COLLAPSE, to
> > allow userspace-directed collapse of memory into THPs[1]. The initial
> > proposal only supports anonymous memory, but I'm
> > working on adding support for file-backed and shmem memory.
> >
> > The intended behavior of MADV_COLLAPSE is that it should return
> > "success" if all hugepage-aligned / sized regions requested are backed
> > by pmd-mapped THPs on return (races aside). IOW: we were able to
> > successfully collapse the memory, or it was already backed by
> > pmd-mapped THPs.
> >
> > Currently there is a nice "XXX: khugepaged should compact smaller
> > compound pages into a PMD sized page" in khugepaged_scan_file() when
> > we encounter a compound page during scanning. Do you know what kind of
> > gotchas or technical difficulties would be involved in doing this? I
> > presume this work would also benefit those relying on khugepaged to
> > collapse read-only file and shmem memory, and I'd be happy to help
> > move it forward.

Hey Matthew,

Thanks for taking the time!

>
> Hi Zach,
>
> Thanks for your interest, and I'd love some help on this.
>
> The khugepaged code (like much of the mm used to) assumes that memory
> comes in two sizes, PTE and PMD.  That's still true for anon and shmem
> for now, but hopefully we'll start managing both anon & shmem memory in
> larger chunks, without necessarily going as far as PMD.
>
> I think the purpose of khugepaged should continue to be to construct
> PMD-size pages; I don't see the point of it wandering through process VMs
> replacing order-2 pages with order-5 pages.  I may be wrong about that,
> of course, so feel free to argue with me.

I'd agree here.

> Anyway, that meaning behind that comment is that the PageTransCompound()
> test is going to be true on any compound page (TransCompound doesn't
> check that the page is necessarily a THP).  So that particular test should
> be folio_test_pmd_mappable(), but there are probably other things which
> ought to be changed, including converting the entire file from dealing
> in pages to dealing in folios.

Right, at this point, the page might be a pmd-mapped THP, or it could
be a pte-mapped compound page (I'm unsure if we can encounter compound
pages outside hugepages).

If we could tell it's already pmd-mapped, we're done :) IIUC,
folio_test_pmd_mappable() is a necessary but not sufficient condition
to determine this.

Else, if it's not, is it safe to try and continue? Suppose we find a
folio of 0 < order < HPAGE_PMD_ORDER. Are we safely able to try and
extend it, or will we break some filesystems that expect a certain
order folio?

> I actually have one patch which starts in that direction, but I haven't
> followed it up yet with all the other patches to that file which will
> be needed:

Thanks for the head start! Not an expert here, but would you say
converting this file to use folios is a necessary first step?

Again, thanks for your time,
Zach

> From a64ac45ad951557103a1040c8bcc3f229022cd26 Mon Sep 17 00:00:00 2001
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Date: Fri, 7 May 2021 23:40:19 -0400
> Subject: [PATCH] mm/khugepaged: Allocate folios
>
> khugepaged only wants to deal in terms of folios, so switch to
> using the folio allocation functions.  This eliminates the calls to
> prep_transhuge_page() and saves dozens of bytes of text.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/khugepaged.c | 32 ++++++++++++--------------------
>  1 file changed, 12 insertions(+), 20 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 637bfecd6bf5..ec60ee4e14c9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -854,18 +854,20 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
>  static struct page *
>  khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>  {
> +       struct folio *folio;
> +
>         VM_BUG_ON_PAGE(*hpage, *hpage);
>
> -       *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
> -       if (unlikely(!*hpage)) {
> +       folio = __folio_alloc_node(gfp, HPAGE_PMD_ORDER, node);
> +       if (unlikely(!folio)) {
>                 count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
>                 *hpage = ERR_PTR(-ENOMEM);
>                 return NULL;
>         }
>
> -       prep_transhuge_page(*hpage);
>         count_vm_event(THP_COLLAPSE_ALLOC);
> -       return *hpage;
> +       *hpage = &folio->page;
> +       return &folio->page;
>  }
>  #else
>  static int khugepaged_find_target_node(void)
> @@ -873,24 +875,14 @@ static int khugepaged_find_target_node(void)
>         return 0;
>  }
>
> -static inline struct page *alloc_khugepaged_hugepage(void)
> -{
> -       struct page *page;
> -
> -       page = alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
> -                          HPAGE_PMD_ORDER);
> -       if (page)
> -               prep_transhuge_page(page);
> -       return page;
> -}
> -
>  static struct page *khugepaged_alloc_hugepage(bool *wait)
>  {
> -       struct page *hpage;
> +       struct folio *folio;
>
>         do {
> -               hpage = alloc_khugepaged_hugepage();
> -               if (!hpage) {
> +               folio = folio_alloc(alloc_hugepage_khugepaged_gfpmask(),
> +                                       HPAGE_PMD_ORDER);
> +               if (!folio) {
>                         count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
>                         if (!*wait)
>                                 return NULL;
> @@ -899,9 +891,9 @@ static struct page *khugepaged_alloc_hugepage(bool *wait)
>                         khugepaged_alloc_sleep();
>                 } else
>                         count_vm_event(THP_COLLAPSE_ALLOC);
> -       } while (unlikely(!hpage) && likely(khugepaged_enabled()));
> +       } while (unlikely(!folio) && likely(khugepaged_enabled()));
>
> -       return hpage;
> +       return &folio->page;
>  }
>
>  static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm/khugepaged: collapse file/shmem compound pages
  2022-05-26  1:23   ` Zach O'Keefe
@ 2022-05-26  3:36     ` Matthew Wilcox
  2022-05-27  0:54       ` Zach O'Keefe
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2022-05-26  3:36 UTC (permalink / raw)
  To: Zach O'Keefe; +Cc: David Rientjes, linux-mm

On Wed, May 25, 2022 at 06:23:52PM -0700, Zach O'Keefe wrote:
> On Wed, May 25, 2022 at 12:07 PM Matthew Wilcox <willy@infradead.org> wrote:
> > The khugepaged code (like much of the mm used to) assumes that memory
> > comes in two sizes, PTE and PMD.  That's still true for anon and shmem
> > for now, but hopefully we'll start managing both anon & shmem memory in
> > larger chunks, without necessarily going as far as PMD.
> >
> > I think the purpose of khugepaged should continue to be to construct
> > PMD-size pages; I don't see the point of it wandering through process VMs
> > replacing order-2 pages with order-5 pages.  I may be wrong about that,
> > of course, so feel free to argue with me.
> 
> I'd agree here.
> 
> > Anyway, that meaning behind that comment is that the PageTransCompound()
> > test is going to be true on any compound page (TransCompound doesn't
> > check that the page is necessarily a THP).  So that particular test should
> > be folio_test_pmd_mappable(), but there are probably other things which
> > ought to be changed, including converting the entire file from dealing
> > in pages to dealing in folios.
> 
> Right, at this point, the page might be a pmd-mapped THP, or it could
> be a pte-mapped compound page (I'm unsure if we can encounter compound
> pages outside hugepages).

Today, there is a way.  We can find a folio with an order between 0 and
PMD_ORDER if the underlying filesystem supports large folios and the
file is executable and we've enabled CONFIG_READ_ONLY_THP_FOR_FS.
In this case, we'll simply skip over it because the code believes that
means it's already a PMD.

> If we could tell it's already pmd-mapped, we're done :) IIUC,
> folio_test_pmd_mappable() is a necessary but not sufficient condition
> to determine this.

It is necessary, but from khugepaged's point of view, it's sufficient
because khugepaged's job is to create PMD-sized folios -- it's not up to
khugepaged to ensure that PMD-sized folios are actually mapped using
a PMD.  There may be some other component of the system (eg DAMON?)
which has chosen to temporarily map the PMD-sized folio using PTEs
in order to track whether the memory is all being used.  It may also
be the case that (for file-based memory), the VMA is mis-aligned and
despite creating a PMD-sized folio, it can't be mapped with a PMD.

> Else, if it's not, is it safe to try and continue? Suppose we find a
> folio of 0 < order < HPAGE_PMD_ORDER. Are we safely able to try and
> extend it, or will we break some filesystems that expect a certain
> order folio?

We're not giving filesystems the opportunity to request that ;-)
Filesystems are expected to handle folios of arbitrary order (if they
claim the ability to support large folios at all).  In practice, I've
capped the folio creation size at PMD_ORDER (because I don't want to track
down all the places that assume pmd_page() is necessarily a head page),
but filesystems shouldn't rely on it.

shmem still expects folios to be of order either 0 or PMD_ORDER.
That assumption extends into the swap code and I haven't had the heart
to go and fix all those places yet.  Plus Neil was doing major surgery
to the swap code in the most recent deveopment cycle and I didn't want
to get in his way.

So I am absolutely fine with khugepaged allocating a PMD-size folio for
any inode that claims mapping_large_folio_support().  If any filesystems
break, we'll fix them.

> > I actually have one patch which starts in that direction, but I haven't
> > followed it up yet with all the other patches to that file which will
> > be needed:
> 
> Thanks for the head start! Not an expert here, but would you say
> converting this file to use folios is a necessary first step?

Not _necessary_, but I find it helps keep things clearer.  Plus it's
something that needs to happen anyway.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm/khugepaged: collapse file/shmem compound pages
  2022-05-26  3:36     ` Matthew Wilcox
@ 2022-05-27  0:54       ` Zach O'Keefe
  2022-05-27  3:47         ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Zach O'Keefe @ 2022-05-27  0:54 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: David Rientjes, linux-mm

Hey Matthew,

Thanks for the response!

On Wed, May 25, 2022 at 8:36 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, May 25, 2022 at 06:23:52PM -0700, Zach O'Keefe wrote:
> > On Wed, May 25, 2022 at 12:07 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > The khugepaged code (like much of the mm used to) assumes that memory
> > > comes in two sizes, PTE and PMD.  That's still true for anon and shmem
> > > for now, but hopefully we'll start managing both anon & shmem memory in
> > > larger chunks, without necessarily going as far as PMD.
> > >
> > > I think the purpose of khugepaged should continue to be to construct
> > > PMD-size pages; I don't see the point of it wandering through process VMs
> > > replacing order-2 pages with order-5 pages.  I may be wrong about that,
> > > of course, so feel free to argue with me.
> >
> > I'd agree here.
> >
> > > Anyway, that meaning behind that comment is that the PageTransCompound()
> > > test is going to be true on any compound page (TransCompound doesn't
> > > check that the page is necessarily a THP).  So that particular test should
> > > be folio_test_pmd_mappable(), but there are probably other things which
> > > ought to be changed, including converting the entire file from dealing
> > > in pages to dealing in folios.
> >
> > Right, at this point, the page might be a pmd-mapped THP, or it could
> > be a pte-mapped compound page (I'm unsure if we can encounter compound
> > pages outside hugepages).
>
> Today, there is a way.  We can find a folio with an order between 0 and
> PMD_ORDER if the underlying filesystem supports large folios and the
> file is executable and we've enabled CONFIG_READ_ONLY_THP_FOR_FS.
> In this case, we'll simply skip over it because the code believes that
> means it's already a PMD.

I think I'm missing something here - sorry. If the folio order is  <
HPAGE_PMD_ORDER, why does the code think it's a pmd?

> > If we could tell it's already pmd-mapped, we're done :) IIUC,
> > folio_test_pmd_mappable() is a necessary but not sufficient condition
> > to determine this.
>
> It is necessary, but from khugepaged's point of view, it's sufficient
> because khugepaged's job is to create PMD-sized folios -- it's not up to
> khugepaged to ensure that PMD-sized folios are actually mapped using
> a PMD.

I thought the point / benefit of khugepaged was precisely to try and
find places where we can collapse many pte entries into a single pmd
mapping?

> There may be some other component of the system (eg DAMON?)
> which has chosen to temporarily map the PMD-sized folio using PTEs
> in order to track whether the memory is all being used.  It may also
> be the case that (for file-based memory), the VMA is mis-aligned and
> despite creating a PMD-sized folio, it can't be mapped with a PMD.

AFAIK DAMON doesn't do this pmd splitting to do subpage tracking for
THPs. Also, I believe retract_page_tables() does make the check to see
if the address is suitably hugepage aligned/sized.

> > Else, if it's not, is it safe to try and continue? Suppose we find a
> > folio of 0 < order < HPAGE_PMD_ORDER. Are we safely able to try and
> > extend it, or will we break some filesystems that expect a certain
> > order folio?
>
> We're not giving filesystems the opportunity to request that ;-)
> Filesystems are expected to handle folios of arbitrary order (if they
> claim the ability to support large folios at all).  In practice, I've
> capped the folio creation size at PMD_ORDER (because I don't want to track
> down all the places that assume pmd_page() is necessarily a head page),
> but filesystems shouldn't rely on it.

Ok, that's good to hear :)

> shmem still expects folios to be of order either 0 or PMD_ORDER.
> That assumption extends into the swap code and I haven't had the heart
> to go and fix all those places yet.  Plus Neil was doing major surgery
> to the swap code in the most recent deveopment cycle and I didn't want
> to get in his way.
>
> So I am absolutely fine with khugepaged allocating a PMD-size folio for
> any inode that claims mapping_large_folio_support().  If any filesystems
> break, we'll fix them.

Just for clarification, what is the equivalent code today that
enforces mapping_large_folio_support()? I.e. today, khugepaged can
successfully collapse file without checking if the inode supports it
(we only check that it's a regular file not opened for writing).

Also, just to check, there isn't anything wrong with following
collapse_file()'s approach, even for folios of 0 < order <
HPAGE_PMD_ORDER? I.e this part:

 * Basic scheme is simple, details are more complex:
 *  - allocate and lock a new huge page;
 *  - scan page cache replacing old pages with the new one
 *    + swap/gup in pages if necessary;
 *    + fill in gaps;
 *    + keep old pages around in case rollback is required;
 *  - if replacing succeeds:
 *    + copy data over;
 *    + free old pages;
 *    + unlock huge page;
 *  - if replacing failed;
 *    + put all pages back and unfreeze them;
 *    + restore gaps in the page cache;
 *    + unlock and free huge page;
 */

> > > I actually have one patch which starts in that direction, but I haven't
> > > followed it up yet with all the other patches to that file which will
> > > be needed:
> >
> > Thanks for the head start! Not an expert here, but would you say
> > converting this file to use folios is a necessary first step?
>
> Not _necessary_, but I find it helps keep things clearer.  Plus it's
> something that needs to happen anyway.

Got it. Perhaps something I can help with then.

Thanks again for your time advising on this!

Best,
Zach


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm/khugepaged: collapse file/shmem compound pages
  2022-05-27  0:54       ` Zach O'Keefe
@ 2022-05-27  3:47         ` Matthew Wilcox
  2022-05-27 16:27           ` Zach O'Keefe
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2022-05-27  3:47 UTC (permalink / raw)
  To: Zach O'Keefe; +Cc: David Rientjes, linux-mm

On Thu, May 26, 2022 at 05:54:27PM -0700, Zach O'Keefe wrote:
> On Wed, May 25, 2022 at 8:36 PM Matthew Wilcox <willy@infradead.org> wrote:
> > On Wed, May 25, 2022 at 06:23:52PM -0700, Zach O'Keefe wrote:
> > > On Wed, May 25, 2022 at 12:07 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > Anyway, that meaning behind that comment is that the PageTransCompound()
> > > > test is going to be true on any compound page (TransCompound doesn't
> > > > check that the page is necessarily a THP).  So that particular test should
> > > > be folio_test_pmd_mappable(), but there are probably other things which
> > > > ought to be changed, including converting the entire file from dealing
> > > > in pages to dealing in folios.
> > >
> > > Right, at this point, the page might be a pmd-mapped THP, or it could
> > > be a pte-mapped compound page (I'm unsure if we can encounter compound
> > > pages outside hugepages).
> >
> > Today, there is a way.  We can find a folio with an order between 0 and
> > PMD_ORDER if the underlying filesystem supports large folios and the
> > file is executable and we've enabled CONFIG_READ_ONLY_THP_FOR_FS.
> > In this case, we'll simply skip over it because the code believes that
> > means it's already a PMD.
> 
> I think I'm missing something here - sorry. If the folio order is  <
> HPAGE_PMD_ORDER, why does the code think it's a pmd?

Because PageTransCompound() does not do what it says on the tin.

static inline int PageTransCompound(struct page *page)
{
        return PageCompound(page);
}

So any compound page is treated as if it's a PMD-sized page.

> > > If we could tell it's already pmd-mapped, we're done :) IIUC,
> > > folio_test_pmd_mappable() is a necessary but not sufficient condition
> > > to determine this.
> >
> > It is necessary, but from khugepaged's point of view, it's sufficient
> > because khugepaged's job is to create PMD-sized folios -- it's not up to
> > khugepaged to ensure that PMD-sized folios are actually mapped using
> > a PMD.
> 
> I thought the point / benefit of khugepaged was precisely to try and
> find places where we can collapse many pte entries into a single pmd
> mapping?

Ideally, yes.  But if a file is mapped at an address which isn't
PMD-aligned, it can't.  Maybe it should just decline to operate in that
case.

> > There may be some other component of the system (eg DAMON?)
> > which has chosen to temporarily map the PMD-sized folio using PTEs
> > in order to track whether the memory is all being used.  It may also
> > be the case that (for file-based memory), the VMA is mis-aligned and
> > despite creating a PMD-sized folio, it can't be mapped with a PMD.
> 
> AFAIK DAMON doesn't do this pmd splitting to do subpage tracking for
> THPs. Also, I believe retract_page_tables() does make the check to see
> if the address is suitably hugepage aligned/sized.

Maybe not DAMON itself, but it's something that various people are
talkig about doing; trying to determine whether THPs are worth using or
whether userspace has made the magic go-faster call without knowing
whether the valuable 2MB page is being entirely used.

> > shmem still expects folios to be of order either 0 or PMD_ORDER.
> > That assumption extends into the swap code and I haven't had the heart
> > to go and fix all those places yet.  Plus Neil was doing major surgery
> > to the swap code in the most recent deveopment cycle and I didn't want
> > to get in his way.
> >
> > So I am absolutely fine with khugepaged allocating a PMD-size folio for
> > any inode that claims mapping_large_folio_support().  If any filesystems
> > break, we'll fix them.
> 
> Just for clarification, what is the equivalent code today that
> enforces mapping_large_folio_support()? I.e. today, khugepaged can
> successfully collapse file without checking if the inode supports it
> (we only check that it's a regular file not opened for writing).

Yeah, that's a dodgy hack which needs to go away.  But we need a lot
more filesystems converted to supporting large folios before we can
delete it.  Not your responsibility; I'm doing my best to encourage
fs maintainers to do this part.

> Also, just to check, there isn't anything wrong with following
> collapse_file()'s approach, even for folios of 0 < order <
> HPAGE_PMD_ORDER? I.e this part:
> 
>  * Basic scheme is simple, details are more complex:
>  *  - allocate and lock a new huge page;
>  *  - scan page cache replacing old pages with the new one
>  *    + swap/gup in pages if necessary;
>  *    + fill in gaps;
>  *    + keep old pages around in case rollback is required;
>  *  - if replacing succeeds:
>  *    + copy data over;
>  *    + free old pages;
>  *    + unlock huge page;
>  *  - if replacing failed;
>  *    + put all pages back and unfreeze them;
>  *    + restore gaps in the page cache;
>  *    + unlock and free huge page;
>  */

Correct.  At least, as far as I know!  Working on folios has been quite
the education for me ...


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm/khugepaged: collapse file/shmem compound pages
  2022-05-27  3:47         ` Matthew Wilcox
@ 2022-05-27 16:27           ` Zach O'Keefe
  2022-05-28  3:48             ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Zach O'Keefe @ 2022-05-27 16:27 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: David Rientjes, linux-mm

On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, May 26, 2022 at 05:54:27PM -0700, Zach O'Keefe wrote:
> > On Wed, May 25, 2022 at 8:36 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > On Wed, May 25, 2022 at 06:23:52PM -0700, Zach O'Keefe wrote:
> > > > On Wed, May 25, 2022 at 12:07 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > Anyway, that meaning behind that comment is that the PageTransCompound()
> > > > > test is going to be true on any compound page (TransCompound doesn't
> > > > > check that the page is necessarily a THP).  So that particular test should
> > > > > be folio_test_pmd_mappable(), but there are probably other things which
> > > > > ought to be changed, including converting the entire file from dealing
> > > > > in pages to dealing in folios.
> > > >
> > > > Right, at this point, the page might be a pmd-mapped THP, or it could
> > > > be a pte-mapped compound page (I'm unsure if we can encounter compound
> > > > pages outside hugepages).
> > >
> > > Today, there is a way.  We can find a folio with an order between 0 and
> > > PMD_ORDER if the underlying filesystem supports large folios and the
> > > file is executable and we've enabled CONFIG_READ_ONLY_THP_FOR_FS.
> > > In this case, we'll simply skip over it because the code believes that
> > > means it's already a PMD.
> >
> > I think I'm missing something here - sorry. If the folio order is  <
> > HPAGE_PMD_ORDER, why does the code think it's a pmd?
>
> Because PageTransCompound() does not do what it says on the tin.
>
> static inline int PageTransCompound(struct page *page)
> {
>         return PageCompound(page);
> }
>
> So any compound page is treated as if it's a PMD-sized page.

Right - therein lies the problem :) I think I misattributed your
comment "we'll simply skip over it because the code believes that
means it's already a PMD" as a solution, not as the current state of
things. What we need to be able to do is:

1) If folio order == 0: do what we've been doing
2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_
pmd-mapped. If it is, we're done. If not, continue to step (3)
3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully
it's ~ same as step (1)

> > > > If we could tell it's already pmd-mapped, we're done :) IIUC,
> > > > folio_test_pmd_mappable() is a necessary but not sufficient condition
> > > > to determine this.
> > >
> > > It is necessary, but from khugepaged's point of view, it's sufficient
> > > because khugepaged's job is to create PMD-sized folios -- it's not up to
> > > khugepaged to ensure that PMD-sized folios are actually mapped using
> > > a PMD.
> >
> > I thought the point / benefit of khugepaged was precisely to try and
> > find places where we can collapse many pte entries into a single pmd
> > mapping?
>
> Ideally, yes.  But if a file is mapped at an address which isn't
> PMD-aligned, it can't.  Maybe it should just decline to operate in that
> case.

To make sure I'm not missing anything here: It's not actually
important that the file is mapped at a pmd-aligned address. All that
is important is that the region of memory being collapsed is
pmd-aligned. If we wanted to collapse memory mapped to the start of
the file, then sure, the file has to be mapped suitably.

> > > There may be some other component of the system (eg DAMON?)
> > > which has chosen to temporarily map the PMD-sized folio using PTEs
> > > in order to track whether the memory is all being used.  It may also
> > > be the case that (for file-based memory), the VMA is mis-aligned and
> > > despite creating a PMD-sized folio, it can't be mapped with a PMD.
> >
> > AFAIK DAMON doesn't do this pmd splitting to do subpage tracking for
> > THPs. Also, I believe retract_page_tables() does make the check to see
> > if the address is suitably hugepage aligned/sized.
>
> Maybe not DAMON itself, but it's something that various people are
> talkig about doing; trying to determine whether THPs are worth using or
> whether userspace has made the magic go-faster call without knowing
> whether the valuable 2MB page is being entirely used.

Got it - thanks for clarifying.

> > > shmem still expects folios to be of order either 0 or PMD_ORDER.
> > > That assumption extends into the swap code and I haven't had the heart
> > > to go and fix all those places yet.  Plus Neil was doing major surgery
> > > to the swap code in the most recent deveopment cycle and I didn't want
> > > to get in his way.
> > >
> > > So I am absolutely fine with khugepaged allocating a PMD-size folio for
> > > any inode that claims mapping_large_folio_support().  If any filesystems
> > > break, we'll fix them.
> >
> > Just for clarification, what is the equivalent code today that
> > enforces mapping_large_folio_support()? I.e. today, khugepaged can
> > successfully collapse file without checking if the inode supports it
> > (we only check that it's a regular file not opened for writing).
>
> Yeah, that's a dodgy hack which needs to go away.  But we need a lot
> more filesystems converted to supporting large folios before we can
> delete it.  Not your responsibility; I'm doing my best to encourage
> fs maintainers to do this part.

Got it. In the meantime, do we want to check the old conditions +
mapping_large_folio_support()?

> > Also, just to check, there isn't anything wrong with following
> > collapse_file()'s approach, even for folios of 0 < order <
> > HPAGE_PMD_ORDER? I.e this part:
> >
> >  * Basic scheme is simple, details are more complex:
> >  *  - allocate and lock a new huge page;
> >  *  - scan page cache replacing old pages with the new one
> >  *    + swap/gup in pages if necessary;
> >  *    + fill in gaps;
> >  *    + keep old pages around in case rollback is required;
> >  *  - if replacing succeeds:
> >  *    + copy data over;
> >  *    + free old pages;
> >  *    + unlock huge page;
> >  *  - if replacing failed;
> >  *    + put all pages back and unfreeze them;
> >  *    + restore gaps in the page cache;
> >  *    + unlock and free huge page;
> >  */
>
> Correct.  At least, as far as I know!  Working on folios has been quite
> the education for me ...

Great! Well, perhaps I'll run into a snafu here or there (and
hopefully learn something myself) but this gives me enough confidence
to naively give it a try and see what happens!

Again, thank you very much for your time, help and advice with this,

Best,
Zach


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm/khugepaged: collapse file/shmem compound pages
  2022-05-27 16:27           ` Zach O'Keefe
@ 2022-05-28  3:48             ` Matthew Wilcox
  2022-05-29 21:36               ` Zach O'Keefe
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2022-05-28  3:48 UTC (permalink / raw)
  To: Zach O'Keefe; +Cc: David Rientjes, linux-mm

On Fri, May 27, 2022 at 09:27:33AM -0700, Zach O'Keefe wrote:
> On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox <willy@infradead.org> wrote:
> > Because PageTransCompound() does not do what it says on the tin.
> >
> > static inline int PageTransCompound(struct page *page)
> > {
> >         return PageCompound(page);
> > }
> >
> > So any compound page is treated as if it's a PMD-sized page.
> 
> Right - therein lies the problem :) I think I misattributed your
> comment "we'll simply skip over it because the code believes that
> means it's already a PMD" as a solution, not as the current state of
> things. What we need to be able to do is:
> 
> 1) If folio order == 0: do what we've been doing
> 2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_
> pmd-mapped. If it is, we're done. If not, continue to step (3)

I would not do that part.  Just leave it alone and assume everything's
good.

> 3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully
> it's ~ same as step (1)

Yes, exactly this.

> > > I thought the point / benefit of khugepaged was precisely to try and
> > > find places where we can collapse many pte entries into a single pmd
> > > mapping?
> >
> > Ideally, yes.  But if a file is mapped at an address which isn't
> > PMD-aligned, it can't.  Maybe it should just decline to operate in that
> > case.
> 
> To make sure I'm not missing anything here: It's not actually
> important that the file is mapped at a pmd-aligned address. All that
> is important is that the region of memory being collapsed is
> pmd-aligned. If we wanted to collapse memory mapped to the start of
> the file, then sure, the file has to be mapped suitably.

Ah, what you're probably missing is that for file pages/folios, they
have to be naturally aligned.  The data structure underlying the
page cache simply can't cope with askew pages.  (It kind of can under
some circumstances that are so complicated that they shouldn't be
explained, and it's far easier just to say "folios must be naturally
aligned within the file")

> > > > shmem still expects folios to be of order either 0 or PMD_ORDER.
> > > > That assumption extends into the swap code and I haven't had the heart
> > > > to go and fix all those places yet.  Plus Neil was doing major surgery
> > > > to the swap code in the most recent deveopment cycle and I didn't want
> > > > to get in his way.
> > > >
> > > > So I am absolutely fine with khugepaged allocating a PMD-size folio for
> > > > any inode that claims mapping_large_folio_support().  If any filesystems
> > > > break, we'll fix them.
> > >
> > > Just for clarification, what is the equivalent code today that
> > > enforces mapping_large_folio_support()? I.e. today, khugepaged can
> > > successfully collapse file without checking if the inode supports it
> > > (we only check that it's a regular file not opened for writing).
> >
> > Yeah, that's a dodgy hack which needs to go away.  But we need a lot
> > more filesystems converted to supporting large folios before we can
> > delete it.  Not your responsibility; I'm doing my best to encourage
> > fs maintainers to do this part.
> 
> Got it. In the meantime, do we want to check the old conditions +
> mapping_large_folio_support()?

Yes, that should work.  khugepaged should be free to create large
folios if the underlying filesystem supports them OR (executable,
read-only, CONFIG_THP_RO, etc, etc).

> > > Also, just to check, there isn't anything wrong with following
> > > collapse_file()'s approach, even for folios of 0 < order <
> > > HPAGE_PMD_ORDER? I.e this part:
> > >
> > >  * Basic scheme is simple, details are more complex:
> > >  *  - allocate and lock a new huge page;
> > >  *  - scan page cache replacing old pages with the new one
> > >  *    + swap/gup in pages if necessary;
> > >  *    + fill in gaps;
> > >  *    + keep old pages around in case rollback is required;
> > >  *  - if replacing succeeds:
> > >  *    + copy data over;
> > >  *    + free old pages;
> > >  *    + unlock huge page;
> > >  *  - if replacing failed;
> > >  *    + put all pages back and unfreeze them;
> > >  *    + restore gaps in the page cache;
> > >  *    + unlock and free huge page;
> > >  */
> >
> > Correct.  At least, as far as I know!  Working on folios has been quite
> > the education for me ...
> 
> Great! Well, perhaps I'll run into a snafu here or there (and
> hopefully learn something myself) but this gives me enough confidence
> to naively give it a try and see what happens!
> 
> Again, thank you very much for your time, help and advice with this,

You're welcome!  Thanks for putting in some work on this project!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm/khugepaged: collapse file/shmem compound pages
  2022-05-28  3:48             ` Matthew Wilcox
@ 2022-05-29 21:36               ` Zach O'Keefe
  2022-05-30  1:25                 ` Rongwei Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Zach O'Keefe @ 2022-05-29 21:36 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: David Rientjes, linux-mm

On Fri, May 27, 2022 at 8:48 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, May 27, 2022 at 09:27:33AM -0700, Zach O'Keefe wrote:
> > On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > Because PageTransCompound() does not do what it says on the tin.
> > >
> > > static inline int PageTransCompound(struct page *page)
> > > {
> > >         return PageCompound(page);
> > > }
> > >
> > > So any compound page is treated as if it's a PMD-sized page.
> >
> > Right - therein lies the problem :) I think I misattributed your
> > comment "we'll simply skip over it because the code believes that
> > means it's already a PMD" as a solution, not as the current state of
> > things. What we need to be able to do is:
> >
> > 1) If folio order == 0: do what we've been doing
> > 2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_
> > pmd-mapped. If it is, we're done. If not, continue to step (3)
>
> I would not do that part.  Just leave it alone and assume everything's
> good.

Sorry if I keep pressing the issue here - but why not check? If the
goal of khugepaged (and certainly MADV_COLLAPSE) is to map eligible
memory at the pmd level, then these pte-mapped hugepages that we might
discover in step (2) are actually the cheapest memory to collapse
since we can do the collapse in-place.

> > 3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully
> > it's ~ same as step (1)
>
> Yes, exactly this.
>
> > > > I thought the point / benefit of khugepaged was precisely to try and
> > > > find places where we can collapse many pte entries into a single pmd
> > > > mapping?
> > >
> > > Ideally, yes.  But if a file is mapped at an address which isn't
> > > PMD-aligned, it can't.  Maybe it should just decline to operate in that
> > > case.
> >
> > To make sure I'm not missing anything here: It's not actually
> > important that the file is mapped at a pmd-aligned address. All that
> > is important is that the region of memory being collapsed is
> > pmd-aligned. If we wanted to collapse memory mapped to the start of
> > the file, then sure, the file has to be mapped suitably.
>
> Ah, what you're probably missing is that for file pages/folios, they
> have to be naturally aligned.  The data structure underlying the
> page cache simply can't cope with askew pages.  (It kind of can under
> some circumstances that are so complicated that they shouldn't be
> explained, and it's far easier just to say "folios must be naturally
> aligned within the file")

I'm trying to understand what you mean by "naturally aligned" here.
I'm operating under the assumption that all file pages map to
page-sized offsets within a file (e.g. linear_page_address()) and that
files are mapped at a page-aligned address. In the event we want to
collapse file-backed memory, if the virtual address of said memory is
hugepage-aligned, I don't think it's necessary that the address maps
to a hugepage-sized offset in the file? I.e. on x86, the file could
itself be mapped to the start of the last page in a 2MiB region ,X,
and that wouldn't prevent us from collapsing the 2MiB region starting
at X+4KiB.

> > > > > shmem still expects folios to be of order either 0 or PMD_ORDER.
> > > > > That assumption extends into the swap code and I haven't had the heart
> > > > > to go and fix all those places yet.  Plus Neil was doing major surgery
> > > > > to the swap code in the most recent deveopment cycle and I didn't want
> > > > > to get in his way.
> > > > >
> > > > > So I am absolutely fine with khugepaged allocating a PMD-size folio for
> > > > > any inode that claims mapping_large_folio_support().  If any filesystems
> > > > > break, we'll fix them.
> > > >
> > > > Just for clarification, what is the equivalent code today that
> > > > enforces mapping_large_folio_support()? I.e. today, khugepaged can
> > > > successfully collapse file without checking if the inode supports it
> > > > (we only check that it's a regular file not opened for writing).
> > >
> > > Yeah, that's a dodgy hack which needs to go away.  But we need a lot
> > > more filesystems converted to supporting large folios before we can
> > > delete it.  Not your responsibility; I'm doing my best to encourage
> > > fs maintainers to do this part.
> >
> > Got it. In the meantime, do we want to check the old conditions +
> > mapping_large_folio_support()?
>
> Yes, that should work.  khugepaged should be free to create large
> folios if the underlying filesystem supports them OR (executable,
> read-only, CONFIG_THP_RO, etc, etc).

Thanks for confirming!

> > > > Also, just to check, there isn't anything wrong with following
> > > > collapse_file()'s approach, even for folios of 0 < order <
> > > > HPAGE_PMD_ORDER? I.e this part:
> > > >
> > > >  * Basic scheme is simple, details are more complex:
> > > >  *  - allocate and lock a new huge page;
> > > >  *  - scan page cache replacing old pages with the new one
> > > >  *    + swap/gup in pages if necessary;
> > > >  *    + fill in gaps;
> > > >  *    + keep old pages around in case rollback is required;
> > > >  *  - if replacing succeeds:
> > > >  *    + copy data over;
> > > >  *    + free old pages;
> > > >  *    + unlock huge page;
> > > >  *  - if replacing failed;
> > > >  *    + put all pages back and unfreeze them;
> > > >  *    + restore gaps in the page cache;
> > > >  *    + unlock and free huge page;
> > > >  */
> > >
> > > Correct.  At least, as far as I know!  Working on folios has been quite
> > > the education for me ...
> >
> > Great! Well, perhaps I'll run into a snafu here or there (and
> > hopefully learn something myself) but this gives me enough confidence
> > to naively give it a try and see what happens!
> >
> > Again, thank you very much for your time, help and advice with this,
>
> You're welcome!  Thanks for putting in some work on this project!

No problem! Hopefully this can benefit a bunch of existing users.

Thanks again,
Zach


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm/khugepaged: collapse file/shmem compound pages
  2022-05-29 21:36               ` Zach O'Keefe
@ 2022-05-30  1:25                 ` Rongwei Wang
  2022-06-01  5:19                   ` Zach O'Keefe
  0 siblings, 1 reply; 12+ messages in thread
From: Rongwei Wang @ 2022-05-30  1:25 UTC (permalink / raw)
  To: Zach O'Keefe, Matthew Wilcox; +Cc: David Rientjes, linux-mm



On 5/30/22 5:36 AM, Zach O'Keefe wrote:
> On Fri, May 27, 2022 at 8:48 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Fri, May 27, 2022 at 09:27:33AM -0700, Zach O'Keefe wrote:
>>> On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>> Because PageTransCompound() does not do what it says on the tin.
>>>>
>>>> static inline int PageTransCompound(struct page *page)
>>>> {
>>>>          return PageCompound(page);
>>>> }
>>>>
>>>> So any compound page is treated as if it's a PMD-sized page.
>>>
>>> Right - therein lies the problem :) I think I misattributed your
>>> comment "we'll simply skip over it because the code believes that
>>> means it's already a PMD" as a solution, not as the current state of
>>> things. What we need to be able to do is:
>>>
>>> 1) If folio order == 0: do what we've been doing
>>> 2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_
>>> pmd-mapped. If it is, we're done. If not, continue to step (3)
>>
>> I would not do that part.  Just leave it alone and assume everything's
>> good.
> 
> Sorry if I keep pressing the issue here - but why not check? If the
> goal of khugepaged (and certainly MADV_COLLAPSE) is to map eligible
> memory at the pmd level, then these pte-mapped hugepages that we might
> discover in step (2) are actually the cheapest memory to collapse
> since we can do the collapse in-place.
> 
>>> 3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully
>>> it's ~ same as step (1)
>>
>> Yes, exactly this.
>>
>>>>> I thought the point / benefit of khugepaged was precisely to try and
>>>>> find places where we can collapse many pte entries into a single pmd
>>>>> mapping?
>>>>
>>>> Ideally, yes.  But if a file is mapped at an address which isn't
>>>> PMD-aligned, it can't.  Maybe it should just decline to operate in that
>>>> case.
>>>
>>> To make sure I'm not missing anything here: It's not actually
>>> important that the file is mapped at a pmd-aligned address. All that
>>> is important is that the region of memory being collapsed is
>>> pmd-aligned. If we wanted to collapse memory mapped to the start of
>>> the file, then sure, the file has to be mapped suitably.
>>
>> Ah, what you're probably missing is that for file pages/folios, they
>> have to be naturally aligned.  The data structure underlying the
>> page cache simply can't cope with askew pages.  (It kind of can under
>> some circumstances that are so complicated that they shouldn't be
>> explained, and it's far easier just to say "folios must be naturally
>> aligned within the file")
> 
> I'm trying to understand what you mean by "naturally aligned" here.
> I'm operating under the assumption that all file pages map to
> page-sized offsets within a file (e.g. linear_page_address()) and that
> files are mapped at a page-aligned address. In the event we want to
> collapse file-backed memory, if the virtual address of said memory is
> hugepage-aligned, I don't think it's necessary that the address maps
> to a hugepage-sized offset in the file? I.e. on x86, the file could
Hi, Zach

I'm not sure get your question rightly. We submitted patch set to 
support file THP can been used transparently, likes THP[1].

[1]https://lore.kernel.org/linux-mm/20211009092658.59665-2-rongwei.wang@linux.alibaba.com/

In this patch, I remember that we need to check if '(vma->vm_start >> 
PAGE_SHIFT) - vma->vm_pgoff' is align with HPAGE_PMD_NR. likes:

+static inline bool vma_is_hugetext(struct vm_area_struct *vma,
+				   unsigned long vm_flags)
+{
+	if (!(vm_flags & VM_EXEC))
+		return false;
+
+	if (vma->vm_file && !inode_is_open_for_write(vma->vm_file->f_inode))
+		return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
+				HPAGE_PMD_NR);
+
+	return false;
+}

There is a little different with anon THP here.
> itself be mapped to the start of the last page in a 2MiB region ,X,

Maybe it's related with ELF's 'Align' parameter in your current system? 
If 'Align' set to 2MB (by 'readelf -l /path/exec'), it's probably meets 
the above alignment check.

And the default Align parameter is related to binutils version, also can 
be set in compile time by '-z max-page-size=<align size>' option.

Hope it is helpful:)

-wrw

> and that wouldn't prevent us from collapsing the 2MiB region starting
> at X+4KiB.
> 
>>>>>> shmem still expects folios to be of order either 0 or PMD_ORDER.
>>>>>> That assumption extends into the swap code and I haven't had the heart
>>>>>> to go and fix all those places yet.  Plus Neil was doing major surgery
>>>>>> to the swap code in the most recent deveopment cycle and I didn't want
>>>>>> to get in his way.
>>>>>>
>>>>>> So I am absolutely fine with khugepaged allocating a PMD-size folio for
>>>>>> any inode that claims mapping_large_folio_support().  If any filesystems
>>>>>> break, we'll fix them.
>>>>>
>>>>> Just for clarification, what is the equivalent code today that
>>>>> enforces mapping_large_folio_support()? I.e. today, khugepaged can
>>>>> successfully collapse file without checking if the inode supports it
>>>>> (we only check that it's a regular file not opened for writing).
>>>>
>>>> Yeah, that's a dodgy hack which needs to go away.  But we need a lot
>>>> more filesystems converted to supporting large folios before we can
>>>> delete it.  Not your responsibility; I'm doing my best to encourage
>>>> fs maintainers to do this part.
>>>
>>> Got it. In the meantime, do we want to check the old conditions +
>>> mapping_large_folio_support()?
>>
>> Yes, that should work.  khugepaged should be free to create large
>> folios if the underlying filesystem supports them OR (executable,
>> read-only, CONFIG_THP_RO, etc, etc).
> 
> Thanks for confirming!
> 
>>>>> Also, just to check, there isn't anything wrong with following
>>>>> collapse_file()'s approach, even for folios of 0 < order <
>>>>> HPAGE_PMD_ORDER? I.e this part:
>>>>>
>>>>>   * Basic scheme is simple, details are more complex:
>>>>>   *  - allocate and lock a new huge page;
>>>>>   *  - scan page cache replacing old pages with the new one
>>>>>   *    + swap/gup in pages if necessary;
>>>>>   *    + fill in gaps;
>>>>>   *    + keep old pages around in case rollback is required;
>>>>>   *  - if replacing succeeds:
>>>>>   *    + copy data over;
>>>>>   *    + free old pages;
>>>>>   *    + unlock huge page;
>>>>>   *  - if replacing failed;
>>>>>   *    + put all pages back and unfreeze them;
>>>>>   *    + restore gaps in the page cache;
>>>>>   *    + unlock and free huge page;
>>>>>   */
>>>>
>>>> Correct.  At least, as far as I know!  Working on folios has been quite
>>>> the education for me ...
>>>
>>> Great! Well, perhaps I'll run into a snafu here or there (and
>>> hopefully learn something myself) but this gives me enough confidence
>>> to naively give it a try and see what happens!
>>>
>>> Again, thank you very much for your time, help and advice with this,
>>
>> You're welcome!  Thanks for putting in some work on this project!
> 
> No problem! Hopefully this can benefit a bunch of existing users.
> 
> Thanks again,
> Zach


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm/khugepaged: collapse file/shmem compound pages
  2022-05-30  1:25                 ` Rongwei Wang
@ 2022-06-01  5:19                   ` Zach O'Keefe
  2022-06-01 11:26                     ` Rongwei Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Zach O'Keefe @ 2022-06-01  5:19 UTC (permalink / raw)
  To: Rongwei Wang; +Cc: Matthew Wilcox, David Rientjes, linux-mm, Hugh Dickins

On Sun, May 29, 2022 at 6:25 PM Rongwei Wang
<rongwei.wang@linux.alibaba.com> wrote:
>
>
>
> On 5/30/22 5:36 AM, Zach O'Keefe wrote:
> > On Fri, May 27, 2022 at 8:48 PM Matthew Wilcox <willy@infradead.org> wrote:
> >>
> >> On Fri, May 27, 2022 at 09:27:33AM -0700, Zach O'Keefe wrote:
> >>> On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox <willy@infradead.org> wrote:
> >>>> Because PageTransCompound() does not do what it says on the tin.
> >>>>
> >>>> static inline int PageTransCompound(struct page *page)
> >>>> {
> >>>>          return PageCompound(page);
> >>>> }
> >>>>
> >>>> So any compound page is treated as if it's a PMD-sized page.
> >>>
> >>> Right - therein lies the problem :) I think I misattributed your
> >>> comment "we'll simply skip over it because the code believes that
> >>> means it's already a PMD" as a solution, not as the current state of
> >>> things. What we need to be able to do is:
> >>>
> >>> 1) If folio order == 0: do what we've been doing
> >>> 2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_
> >>> pmd-mapped. If it is, we're done. If not, continue to step (3)
> >>
> >> I would not do that part.  Just leave it alone and assume everything's
> >> good.
> >
> > Sorry if I keep pressing the issue here - but why not check? If the
> > goal of khugepaged (and certainly MADV_COLLAPSE) is to map eligible
> > memory at the pmd level, then these pte-mapped hugepages that we might
> > discover in step (2) are actually the cheapest memory to collapse
> > since we can do the collapse in-place.
> >
> >>> 3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully
> >>> it's ~ same as step (1)
> >>
> >> Yes, exactly this.
> >>
> >>>>> I thought the point / benefit of khugepaged was precisely to try and
> >>>>> find places where we can collapse many pte entries into a single pmd
> >>>>> mapping?
> >>>>
> >>>> Ideally, yes.  But if a file is mapped at an address which isn't
> >>>> PMD-aligned, it can't.  Maybe it should just decline to operate in that
> >>>> case.
> >>>
> >>> To make sure I'm not missing anything here: It's not actually
> >>> important that the file is mapped at a pmd-aligned address. All that
> >>> is important is that the region of memory being collapsed is
> >>> pmd-aligned. If we wanted to collapse memory mapped to the start of
> >>> the file, then sure, the file has to be mapped suitably.
> >>
> >> Ah, what you're probably missing is that for file pages/folios, they
> >> have to be naturally aligned.  The data structure underlying the
> >> page cache simply can't cope with askew pages.  (It kind of can under
> >> some circumstances that are so complicated that they shouldn't be
> >> explained, and it's far easier just to say "folios must be naturally
> >> aligned within the file")
> >
> > I'm trying to understand what you mean by "naturally aligned" here.
> > I'm operating under the assumption that all file pages map to
> > page-sized offsets within a file (e.g. linear_page_address()) and that
> > files are mapped at a page-aligned address. In the event we want to
> > collapse file-backed memory, if the virtual address of said memory is
> > hugepage-aligned, I don't think it's necessary that the address maps
> > to a hugepage-sized offset in the file? I.e. on x86, the file could
> Hi, Zach
>
> I'm not sure get your question rightly. We submitted patch set to
> support file THP can been used transparently, likes THP[1].
>
> [1]https://lore.kernel.org/linux-mm/20211009092658.59665-2-rongwei.wang@linux.alibaba.com/
>
> In this patch, I remember that we need to check if '(vma->vm_start >>
> PAGE_SHIFT) - vma->vm_pgoff' is align with HPAGE_PMD_NR. likes:
>
> +static inline bool vma_is_hugetext(struct vm_area_struct *vma,
> +                                  unsigned long vm_flags)
> +{
> +       if (!(vm_flags & VM_EXEC))
> +               return false;
> +
> +       if (vma->vm_file && !inode_is_open_for_write(vma->vm_file->f_inode))
> +               return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
> +                               HPAGE_PMD_NR);
> +
> +       return false;
> +}
>
> There is a little different with anon THP here.
> > itself be mapped to the start of the last page in a 2MiB region ,X,
>
> Maybe it's related with ELF's 'Align' parameter in your current system?
> If 'Align' set to 2MB (by 'readelf -l /path/exec'), it's probably meets
> the above alignment check.
>
> And the default Align parameter is related to binutils version, also can
> be set in compile time by '-z max-page-size=<align size>' option.
>
> Hope it is helpful:)
>
> -wrw

Hey Rongwei!

Thanks for the code / help. Took a little bit, but hughd has
enlightened me on the problem (thanks Hugh!). Likewise, apologies for
not understanding your previous comment regarding folio alignment,
Matthew.

Also, thanks for linking your patchset, and sorry for missing it
previously. It seems we're interested in the same problem! Hopefully
this work can be beneficial to your use case as well.

Thanks again for your time,
Zach



> > and that wouldn't prevent us from collapsing the 2MiB region starting
> > at X+4KiB.
> >
> >>>>>> shmem still expects folios to be of order either 0 or PMD_ORDER.
> >>>>>> That assumption extends into the swap code and I haven't had the heart
> >>>>>> to go and fix all those places yet.  Plus Neil was doing major surgery
> >>>>>> to the swap code in the most recent deveopment cycle and I didn't want
> >>>>>> to get in his way.
> >>>>>>
> >>>>>> So I am absolutely fine with khugepaged allocating a PMD-size folio for
> >>>>>> any inode that claims mapping_large_folio_support().  If any filesystems
> >>>>>> break, we'll fix them.
> >>>>>
> >>>>> Just for clarification, what is the equivalent code today that
> >>>>> enforces mapping_large_folio_support()? I.e. today, khugepaged can
> >>>>> successfully collapse file without checking if the inode supports it
> >>>>> (we only check that it's a regular file not opened for writing).
> >>>>
> >>>> Yeah, that's a dodgy hack which needs to go away.  But we need a lot
> >>>> more filesystems converted to supporting large folios before we can
> >>>> delete it.  Not your responsibility; I'm doing my best to encourage
> >>>> fs maintainers to do this part.
> >>>
> >>> Got it. In the meantime, do we want to check the old conditions +
> >>> mapping_large_folio_support()?
> >>
> >> Yes, that should work.  khugepaged should be free to create large
> >> folios if the underlying filesystem supports them OR (executable,
> >> read-only, CONFIG_THP_RO, etc, etc).
> >
> > Thanks for confirming!
> >
> >>>>> Also, just to check, there isn't anything wrong with following
> >>>>> collapse_file()'s approach, even for folios of 0 < order <
> >>>>> HPAGE_PMD_ORDER? I.e this part:
> >>>>>
> >>>>>   * Basic scheme is simple, details are more complex:
> >>>>>   *  - allocate and lock a new huge page;
> >>>>>   *  - scan page cache replacing old pages with the new one
> >>>>>   *    + swap/gup in pages if necessary;
> >>>>>   *    + fill in gaps;
> >>>>>   *    + keep old pages around in case rollback is required;
> >>>>>   *  - if replacing succeeds:
> >>>>>   *    + copy data over;
> >>>>>   *    + free old pages;
> >>>>>   *    + unlock huge page;
> >>>>>   *  - if replacing failed;
> >>>>>   *    + put all pages back and unfreeze them;
> >>>>>   *    + restore gaps in the page cache;
> >>>>>   *    + unlock and free huge page;
> >>>>>   */
> >>>>
> >>>> Correct.  At least, as far as I know!  Working on folios has been quite
> >>>> the education for me ...
> >>>
> >>> Great! Well, perhaps I'll run into a snafu here or there (and
> >>> hopefully learn something myself) but this gives me enough confidence
> >>> to naively give it a try and see what happens!
> >>>
> >>> Again, thank you very much for your time, help and advice with this,
> >>
> >> You're welcome!  Thanks for putting in some work on this project!
> >
> > No problem! Hopefully this can benefit a bunch of existing users.
> >
> > Thanks again,
> > Zach


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mm/khugepaged: collapse file/shmem compound pages
  2022-06-01  5:19                   ` Zach O'Keefe
@ 2022-06-01 11:26                     ` Rongwei Wang
  0 siblings, 0 replies; 12+ messages in thread
From: Rongwei Wang @ 2022-06-01 11:26 UTC (permalink / raw)
  To: Zach O'Keefe; +Cc: Matthew Wilcox, David Rientjes, linux-mm, Hugh Dickins



On 6/1/22 1:19 PM, Zach O'Keefe wrote:
> On Sun, May 29, 2022 at 6:25 PM Rongwei Wang
> <rongwei.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 5/30/22 5:36 AM, Zach O'Keefe wrote:
>>> On Fri, May 27, 2022 at 8:48 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>> On Fri, May 27, 2022 at 09:27:33AM -0700, Zach O'Keefe wrote:
>>>>> On Thu, May 26, 2022 at 8:47 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>>> Because PageTransCompound() does not do what it says on the tin.
>>>>>>
>>>>>> static inline int PageTransCompound(struct page *page)
>>>>>> {
>>>>>>           return PageCompound(page);
>>>>>> }
>>>>>>
>>>>>> So any compound page is treated as if it's a PMD-sized page.
>>>>>
>>>>> Right - therein lies the problem :) I think I misattributed your
>>>>> comment "we'll simply skip over it because the code believes that
>>>>> means it's already a PMD" as a solution, not as the current state of
>>>>> things. What we need to be able to do is:
>>>>>
>>>>> 1) If folio order == 0: do what we've been doing
>>>>> 2) If folio order == HPAGE_PMD_ORDER: check if it's _actually_
>>>>> pmd-mapped. If it is, we're done. If not, continue to step (3)
>>>>
>>>> I would not do that part.  Just leave it alone and assume everything's
>>>> good.
>>>
>>> Sorry if I keep pressing the issue here - but why not check? If the
>>> goal of khugepaged (and certainly MADV_COLLAPSE) is to map eligible
>>> memory at the pmd level, then these pte-mapped hugepages that we might
>>> discover in step (2) are actually the cheapest memory to collapse
>>> since we can do the collapse in-place.
>>>
>>>>> 3) Else (folio order > 0 and not pmd-mapped): new magic; hopefully
>>>>> it's ~ same as step (1)
>>>>
>>>> Yes, exactly this.
>>>>
>>>>>>> I thought the point / benefit of khugepaged was precisely to try and
>>>>>>> find places where we can collapse many pte entries into a single pmd
>>>>>>> mapping?
>>>>>>
>>>>>> Ideally, yes.  But if a file is mapped at an address which isn't
>>>>>> PMD-aligned, it can't.  Maybe it should just decline to operate in that
>>>>>> case.
>>>>>
>>>>> To make sure I'm not missing anything here: It's not actually
>>>>> important that the file is mapped at a pmd-aligned address. All that
>>>>> is important is that the region of memory being collapsed is
>>>>> pmd-aligned. If we wanted to collapse memory mapped to the start of
>>>>> the file, then sure, the file has to be mapped suitably.
>>>>
>>>> Ah, what you're probably missing is that for file pages/folios, they
>>>> have to be naturally aligned.  The data structure underlying the
>>>> page cache simply can't cope with askew pages.  (It kind of can under
>>>> some circumstances that are so complicated that they shouldn't be
>>>> explained, and it's far easier just to say "folios must be naturally
>>>> aligned within the file")
>>>
>>> I'm trying to understand what you mean by "naturally aligned" here.
>>> I'm operating under the assumption that all file pages map to
>>> page-sized offsets within a file (e.g. linear_page_address()) and that
>>> files are mapped at a page-aligned address. In the event we want to
>>> collapse file-backed memory, if the virtual address of said memory is
>>> hugepage-aligned, I don't think it's necessary that the address maps
>>> to a hugepage-sized offset in the file? I.e. on x86, the file could
>> Hi, Zach
>>
>> I'm not sure get your question rightly. We submitted patch set to
>> support file THP can been used transparently, likes THP[1].
>>
>> [1]https://lore.kernel.org/linux-mm/20211009092658.59665-2-rongwei.wang@linux.alibaba.com/
>>
>> In this patch, I remember that we need to check if '(vma->vm_start >>
>> PAGE_SHIFT) - vma->vm_pgoff' is align with HPAGE_PMD_NR. likes:
>>
>> +static inline bool vma_is_hugetext(struct vm_area_struct *vma,
>> +                                  unsigned long vm_flags)
>> +{
>> +       if (!(vm_flags & VM_EXEC))
>> +               return false;
>> +
>> +       if (vma->vm_file && !inode_is_open_for_write(vma->vm_file->f_inode))
>> +               return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
>> +                               HPAGE_PMD_NR);
>> +
>> +       return false;
>> +}
>>
>> There is a little different with anon THP here.
>>> itself be mapped to the start of the last page in a 2MiB region ,X,
>>
>> Maybe it's related with ELF's 'Align' parameter in your current system?
>> If 'Align' set to 2MB (by 'readelf -l /path/exec'), it's probably meets
>> the above alignment check.
>>
>> And the default Align parameter is related to binutils version, also can
>> be set in compile time by '-z max-page-size=<align size>' option.
>>
>> Hope it is helpful:)
>>
>> -wrw
> 
> Hey Rongwei!
> 
> Thanks for the code / help. Took a little bit, but hughd has
> enlightened me on the problem (thanks Hugh!). Likewise, apologies for
> not understanding your previous comment regarding folio alignment,
> Matthew.
> 
> Also, thanks for linking your patchset, and sorry for missing it
> previously. It seems we're interested in the same problem! Hopefully
> this work can be beneficial to your use case as well.
Hi Zach!

Thanks you, too. Recently, we are trying to use process_madvise()+DAMON 
to find hot .text, especially x86, and then collapsing into huge pages. 
It seems that process_madvise(MADV_COLLAPSE) is feasible.

Anyway, thanks your nice work.
-wrw
> 
> Thanks again for your time,
> Zach
> 
> 
> 
>>> and that wouldn't prevent us from collapsing the 2MiB region starting
>>> at X+4KiB.
>>>
>>>>>>>> shmem still expects folios to be of order either 0 or PMD_ORDER.
>>>>>>>> That assumption extends into the swap code and I haven't had the heart
>>>>>>>> to go and fix all those places yet.  Plus Neil was doing major surgery
>>>>>>>> to the swap code in the most recent deveopment cycle and I didn't want
>>>>>>>> to get in his way.
>>>>>>>>
>>>>>>>> So I am absolutely fine with khugepaged allocating a PMD-size folio for
>>>>>>>> any inode that claims mapping_large_folio_support().  If any filesystems
>>>>>>>> break, we'll fix them.
>>>>>>>
>>>>>>> Just for clarification, what is the equivalent code today that
>>>>>>> enforces mapping_large_folio_support()? I.e. today, khugepaged can
>>>>>>> successfully collapse file without checking if the inode supports it
>>>>>>> (we only check that it's a regular file not opened for writing).
>>>>>>
>>>>>> Yeah, that's a dodgy hack which needs to go away.  But we need a lot
>>>>>> more filesystems converted to supporting large folios before we can
>>>>>> delete it.  Not your responsibility; I'm doing my best to encourage
>>>>>> fs maintainers to do this part.
>>>>>
>>>>> Got it. In the meantime, do we want to check the old conditions +
>>>>> mapping_large_folio_support()?
>>>>
>>>> Yes, that should work.  khugepaged should be free to create large
>>>> folios if the underlying filesystem supports them OR (executable,
>>>> read-only, CONFIG_THP_RO, etc, etc).
>>>
>>> Thanks for confirming!
>>>
>>>>>>> Also, just to check, there isn't anything wrong with following
>>>>>>> collapse_file()'s approach, even for folios of 0 < order <
>>>>>>> HPAGE_PMD_ORDER? I.e this part:
>>>>>>>
>>>>>>>    * Basic scheme is simple, details are more complex:
>>>>>>>    *  - allocate and lock a new huge page;
>>>>>>>    *  - scan page cache replacing old pages with the new one
>>>>>>>    *    + swap/gup in pages if necessary;
>>>>>>>    *    + fill in gaps;
>>>>>>>    *    + keep old pages around in case rollback is required;
>>>>>>>    *  - if replacing succeeds:
>>>>>>>    *    + copy data over;
>>>>>>>    *    + free old pages;
>>>>>>>    *    + unlock huge page;
>>>>>>>    *  - if replacing failed;
>>>>>>>    *    + put all pages back and unfreeze them;
>>>>>>>    *    + restore gaps in the page cache;
>>>>>>>    *    + unlock and free huge page;
>>>>>>>    */
>>>>>>
>>>>>> Correct.  At least, as far as I know!  Working on folios has been quite
>>>>>> the education for me ...
>>>>>
>>>>> Great! Well, perhaps I'll run into a snafu here or there (and
>>>>> hopefully learn something myself) but this gives me enough confidence
>>>>> to naively give it a try and see what happens!
>>>>>
>>>>> Again, thank you very much for your time, help and advice with this,
>>>>
>>>> You're welcome!  Thanks for putting in some work on this project!
>>>
>>> No problem! Hopefully this can benefit a bunch of existing users.
>>>
>>> Thanks again,
>>> Zach


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-06-01 11:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-24 22:42 mm/khugepaged: collapse file/shmem compound pages Zach O'Keefe
2022-05-25 19:07 ` Matthew Wilcox
2022-05-26  1:23   ` Zach O'Keefe
2022-05-26  3:36     ` Matthew Wilcox
2022-05-27  0:54       ` Zach O'Keefe
2022-05-27  3:47         ` Matthew Wilcox
2022-05-27 16:27           ` Zach O'Keefe
2022-05-28  3:48             ` Matthew Wilcox
2022-05-29 21:36               ` Zach O'Keefe
2022-05-30  1:25                 ` Rongwei Wang
2022-06-01  5:19                   ` Zach O'Keefe
2022-06-01 11:26                     ` Rongwei Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.