All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] variable-order, large folios for anonymous memory
@ 2023-03-17 10:57 ` Ryan Roberts
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:57 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Hi All,

This is an RFC for an initial, _very_ limited stab at implementing support for
using variable-order, large folios for anonymous memory. It intends to be the
minimal change, upon which additions can be made incrementally. That said, with
just this change, I achive a 4% performance improvement when compiling the
kernel (more on that later).

My motivation for posting the RFC now is twofold:

- Get feedback on the approach I'm taking before I go too far down the path;
  does this fit with the direction of the community? Are there any bear traps
  that I've not considered (due to my being fairly new to mm and not having a
  complete understanding of its entirety)?

- Seek support for a bug I'm encountering when MADV_FREE is attempting to
  split_folio() one of these new variable-order anon folios. I've been pouring
  through the source and can't find the root cause. For now I have a work
  around, but hopefully someone can give me some pointers as to where the
  problem is likely to be. (see details below).

The patches apply on top of v6.3-rc1 + patches 1-31 of [4] (which needed one
minor conflict resolution). And I have a tree at [5].

See [1], [2], [3] for more background.

Approach
========

For now, I'm only modifying the allocation path (do_anonymous_page()). I'm not
touching the CoW path. First, I determine the order of the folio to allocate for
the given fault. This is determined by:

  - Folio must be naturally aligned within VA space
  - Folio must not breach boundaries of vma
  - Folio must be fully contained inside one pmd entry
  - Folio must not overlap any non-none ptes
  - Order must not be higher than a provided starting order

Where the "provided starting order" is currently hardcoded to 4, but the idea is
that this would eventually be a per-vma value that gets dynamically tuned.

We then try to allocate a large folio of the determined order, and keep trying
to allocate with successively smaller orders until we succeed. Once the folio is
allocated we can take the PTL and re-check that all the covered PTE entries are
still none. If not, we decrement the order and start again.

Next, the folio is added to the rmap using a new API,
folio_add_new_anon_rmap_range(), which is similar to Yin, Fengwei's
folio_add_file_rmap_range() at [4]. And finally set the ptes using Matthew
Wilcox's new set_ptes() API, also at [4].

Folio/page refcounts and mapcounts are managed in the same way as Yin, Fengwei
is doing in folio_add_file_rmap_range(); A reference is taken on the folio for
each pte, _mapcount is incremented on each page by 1, and
folio->_nr_pages_mapped is set to the number of pages in the folio (since every
page is initially mapped).

It is my assumption that the mm should be able to deal with these folios
correctly for CoW and reclaim etc. although perhaps not as optimally as we would
(eventually) like.

Bug(s)
======

When I run this code without the last (workaround) patch, with DEBUG_VM et al,
PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
relating to invalid kernel addresses (which usually look like either NULL +
small offset or mostly zeros with a few mid-order bits set + a small offset) or
lockdep complaining about a bad unlock balance. Call stacks are often in
madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
email example oopses out separately if anyone wants to review them). My hunch is
that struct pages adjacent to the folio are being corrupted, but don't have hard
evidence.

When adding the workaround patch, which prevents madvise_free_pte_range() from
attempting to split a large folio, I never see any issues. Although I'm not
putting the system under memory pressure so guess I might see the same types of
problem crop up under swap, etc.

I've reviewed most of the code within split_folio() and can't find any smoking
gun, but I wonder if there are implicit assumptions about the large folio being
PMD sized that I'm obviously breaking now?

The code in madvise_free_pte_range():

	if (folio_test_large(folio)) {
		if (folio_mapcount(folio) != 1)
			goto out;
		folio_get(folio);
		if (!folio_trylock(folio)) {
			folio_put(folio);
			goto out;
		}
		pte_unmap_unlock(orig_pte, ptl);
		if (split_folio(folio)) {
			folio_unlock(folio);
			folio_put(folio);
			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
			goto out;
		}
		...
	}

Will normally skip my large folios because they have a mapcount > 1, due to
incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
or CoW in this case?

I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
large anon folio is not using the folio lock in the same way as a THP would and
we are therefore not getting the expected serialization?

I'd really appreciate any suggestions for how to pregress here!

Performance
===========

With the above bug worked around, I'm benchmarking kernel compilation, which is
known to be heavy on anonymous page faults. Overall, I see a reduction in
wall-time by 4%. This is inline with my predictions based on earlier experiments
summarised at [1]. I beleive there is scope for future improvement on the CoW
and reclaim paths. I'd also expect to see performance improvements due to
reduced TLB pressure on CPUs that support HPA (I'm running on Ampere Altra where
HPA is not enabled).

Of the 4%, all of it is (obviously) in the kernel; overall kernel execution time
has reduced by 34%, more than halving the time spent servicing data faults, and
significantly speeding up sys_exit_group().

Thanks,
Ryan

[1] https://lore.kernel.org/linux-mm/4c991dcb-c5bb-86bb-5a29-05df24429607@arm.com/
[2] https://lore.kernel.org/linux-mm/a7cd938e-a86f-e3af-f56c-433c92ac69c2@arm.com/
[3] https://lore.kernel.org/linux-mm/Y%2FblF0GIunm+pRIC@casper.infradead.org/
[4] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
[5] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc


Ryan Roberts (6):
  mm: Expose clear_huge_page() unconditionally
  mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  mm: Introduce try_vma_alloc_zeroed_movable_folio()
  mm: Implement folio_add_new_anon_rmap_range()
  mm: Allocate large folios for anonymous memory
  WORKAROUND: Don't split large folios on madvise

 arch/alpha/include/asm/page.h   |   5 +-
 arch/arm64/include/asm/page.h   |   3 +-
 arch/arm64/mm/fault.c           |   7 +-
 arch/ia64/include/asm/page.h    |   5 +-
 arch/m68k/include/asm/page_no.h |   7 +-
 arch/s390/include/asm/page.h    |   5 +-
 arch/x86/include/asm/page.h     |   5 +-
 include/linux/highmem.h         |  23 +++--
 include/linux/mm.h              |   3 +-
 include/linux/rmap.h            |   2 +
 mm/madvise.c                    |   8 ++
 mm/memory.c                     | 167 ++++++++++++++++++++++++++++----
 mm/rmap.c                       |  43 ++++++++
 13 files changed, 239 insertions(+), 44 deletions(-)

--
2.25.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 0/6] variable-order, large folios for anonymous memory
@ 2023-03-17 10:57 ` Ryan Roberts
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:57 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Hi All,

This is an RFC for an initial, _very_ limited stab at implementing support for
using variable-order, large folios for anonymous memory. It intends to be the
minimal change, upon which additions can be made incrementally. That said, with
just this change, I achive a 4% performance improvement when compiling the
kernel (more on that later).

My motivation for posting the RFC now is twofold:

- Get feedback on the approach I'm taking before I go too far down the path;
  does this fit with the direction of the community? Are there any bear traps
  that I've not considered (due to my being fairly new to mm and not having a
  complete understanding of its entirety)?

- Seek support for a bug I'm encountering when MADV_FREE is attempting to
  split_folio() one of these new variable-order anon folios. I've been pouring
  through the source and can't find the root cause. For now I have a work
  around, but hopefully someone can give me some pointers as to where the
  problem is likely to be. (see details below).

The patches apply on top of v6.3-rc1 + patches 1-31 of [4] (which needed one
minor conflict resolution). And I have a tree at [5].

See [1], [2], [3] for more background.

Approach
========

For now, I'm only modifying the allocation path (do_anonymous_page()). I'm not
touching the CoW path. First, I determine the order of the folio to allocate for
the given fault. This is determined by:

  - Folio must be naturally aligned within VA space
  - Folio must not breach boundaries of vma
  - Folio must be fully contained inside one pmd entry
  - Folio must not overlap any non-none ptes
  - Order must not be higher than a provided starting order

Where the "provided starting order" is currently hardcoded to 4, but the idea is
that this would eventually be a per-vma value that gets dynamically tuned.

We then try to allocate a large folio of the determined order, and keep trying
to allocate with successively smaller orders until we succeed. Once the folio is
allocated we can take the PTL and re-check that all the covered PTE entries are
still none. If not, we decrement the order and start again.

Next, the folio is added to the rmap using a new API,
folio_add_new_anon_rmap_range(), which is similar to Yin, Fengwei's
folio_add_file_rmap_range() at [4]. And finally set the ptes using Matthew
Wilcox's new set_ptes() API, also at [4].

Folio/page refcounts and mapcounts are managed in the same way as Yin, Fengwei
is doing in folio_add_file_rmap_range(); A reference is taken on the folio for
each pte, _mapcount is incremented on each page by 1, and
folio->_nr_pages_mapped is set to the number of pages in the folio (since every
page is initially mapped).

It is my assumption that the mm should be able to deal with these folios
correctly for CoW and reclaim etc. although perhaps not as optimally as we would
(eventually) like.

Bug(s)
======

When I run this code without the last (workaround) patch, with DEBUG_VM et al,
PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
relating to invalid kernel addresses (which usually look like either NULL +
small offset or mostly zeros with a few mid-order bits set + a small offset) or
lockdep complaining about a bad unlock balance. Call stacks are often in
madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
email example oopses out separately if anyone wants to review them). My hunch is
that struct pages adjacent to the folio are being corrupted, but don't have hard
evidence.

When adding the workaround patch, which prevents madvise_free_pte_range() from
attempting to split a large folio, I never see any issues. Although I'm not
putting the system under memory pressure so guess I might see the same types of
problem crop up under swap, etc.

I've reviewed most of the code within split_folio() and can't find any smoking
gun, but I wonder if there are implicit assumptions about the large folio being
PMD sized that I'm obviously breaking now?

The code in madvise_free_pte_range():

	if (folio_test_large(folio)) {
		if (folio_mapcount(folio) != 1)
			goto out;
		folio_get(folio);
		if (!folio_trylock(folio)) {
			folio_put(folio);
			goto out;
		}
		pte_unmap_unlock(orig_pte, ptl);
		if (split_folio(folio)) {
			folio_unlock(folio);
			folio_put(folio);
			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
			goto out;
		}
		...
	}

Will normally skip my large folios because they have a mapcount > 1, due to
incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
or CoW in this case?

I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
large anon folio is not using the folio lock in the same way as a THP would and
we are therefore not getting the expected serialization?

I'd really appreciate any suggestions for how to pregress here!

Performance
===========

With the above bug worked around, I'm benchmarking kernel compilation, which is
known to be heavy on anonymous page faults. Overall, I see a reduction in
wall-time by 4%. This is inline with my predictions based on earlier experiments
summarised at [1]. I beleive there is scope for future improvement on the CoW
and reclaim paths. I'd also expect to see performance improvements due to
reduced TLB pressure on CPUs that support HPA (I'm running on Ampere Altra where
HPA is not enabled).

Of the 4%, all of it is (obviously) in the kernel; overall kernel execution time
has reduced by 34%, more than halving the time spent servicing data faults, and
significantly speeding up sys_exit_group().

Thanks,
Ryan

[1] https://lore.kernel.org/linux-mm/4c991dcb-c5bb-86bb-5a29-05df24429607@arm.com/
[2] https://lore.kernel.org/linux-mm/a7cd938e-a86f-e3af-f56c-433c92ac69c2@arm.com/
[3] https://lore.kernel.org/linux-mm/Y%2FblF0GIunm+pRIC@casper.infradead.org/
[4] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
[5] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc


Ryan Roberts (6):
  mm: Expose clear_huge_page() unconditionally
  mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  mm: Introduce try_vma_alloc_zeroed_movable_folio()
  mm: Implement folio_add_new_anon_rmap_range()
  mm: Allocate large folios for anonymous memory
  WORKAROUND: Don't split large folios on madvise

 arch/alpha/include/asm/page.h   |   5 +-
 arch/arm64/include/asm/page.h   |   3 +-
 arch/arm64/mm/fault.c           |   7 +-
 arch/ia64/include/asm/page.h    |   5 +-
 arch/m68k/include/asm/page_no.h |   7 +-
 arch/s390/include/asm/page.h    |   5 +-
 arch/x86/include/asm/page.h     |   5 +-
 include/linux/highmem.h         |  23 +++--
 include/linux/mm.h              |   3 +-
 include/linux/rmap.h            |   2 +
 mm/madvise.c                    |   8 ++
 mm/memory.c                     | 167 ++++++++++++++++++++++++++++----
 mm/rmap.c                       |  43 ++++++++
 13 files changed, 239 insertions(+), 44 deletions(-)

--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/6] mm: Expose clear_huge_page() unconditionally
  2023-03-17 10:57 ` Ryan Roberts
@ 2023-03-17 10:57   ` Ryan Roberts
  -1 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:57 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

In preparation for extending vma_alloc_zeroed_movable_folio() to
allocate a arbitrary order folio, expose clear_huge_page()
unconditionally, so that it can be used to zero the allocated folio.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/mm.h | 3 ++-
 mm/memory.c        | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb..cdb8c6031d0f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3538,10 +3538,11 @@ enum mf_action_page_type {
  */
 extern const struct attribute_group memory_failure_attr_group;

-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr_hint,
 			    unsigned int pages_per_huge_page);
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr_hint,
 				struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..c08645908ee2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5628,7 +5628,6 @@ void __might_fault(const char *file, int line)
 EXPORT_SYMBOL(__might_fault);
 #endif

-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 /*
  * Process all subpages of the specified huge page with the specified
  * operation.  The target subpage will be processed last to keep its
@@ -5716,6 +5715,8 @@ void clear_huge_page(struct page *page,
 	process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
 }

+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
 				    unsigned long addr,
 				    struct vm_area_struct *vma,
--
2.25.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/6] mm: Expose clear_huge_page() unconditionally
@ 2023-03-17 10:57   ` Ryan Roberts
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:57 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

In preparation for extending vma_alloc_zeroed_movable_folio() to
allocate a arbitrary order folio, expose clear_huge_page()
unconditionally, so that it can be used to zero the allocated folio.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/mm.h | 3 ++-
 mm/memory.c        | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb..cdb8c6031d0f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3538,10 +3538,11 @@ enum mf_action_page_type {
  */
 extern const struct attribute_group memory_failure_attr_group;

-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr_hint,
 			    unsigned int pages_per_huge_page);
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr_hint,
 				struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..c08645908ee2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5628,7 +5628,6 @@ void __might_fault(const char *file, int line)
 EXPORT_SYMBOL(__might_fault);
 #endif

-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 /*
  * Process all subpages of the specified huge page with the specified
  * operation.  The target subpage will be processed last to keep its
@@ -5716,6 +5715,8 @@ void clear_huge_page(struct page *page,
 	process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
 }

+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
 				    unsigned long addr,
 				    struct vm_area_struct *vma,
--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 2/6] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  2023-03-17 10:57 ` Ryan Roberts
@ 2023-03-17 10:57   ` Ryan Roberts
  -1 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:57 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Allow allocation of large folios with vma_alloc_zeroed_movable_folio().
This prepares the ground for large anonymous folios. The generic
implementation of vma_alloc_zeroed_movable_folio() now uses
clear_huge_page() to zero the allocated folio since it may now be a
non-0 order.

Currently the function is always called with order 0 and no extra gfp
flags, so no functional change intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/alpha/include/asm/page.h   |  5 +++--
 arch/arm64/include/asm/page.h   |  3 ++-
 arch/arm64/mm/fault.c           |  7 ++++---
 arch/ia64/include/asm/page.h    |  5 +++--
 arch/m68k/include/asm/page_no.h |  7 ++++---
 arch/s390/include/asm/page.h    |  5 +++--
 arch/x86/include/asm/page.h     |  5 +++--
 include/linux/highmem.h         | 23 +++++++++++++----------
 mm/memory.c                     |  5 +++--
 9 files changed, 38 insertions(+), 27 deletions(-)

diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index 4db1ebc0ed99..6fc7fe91b6cb 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -17,8 +17,9 @@
 extern void clear_page(void *page);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)

-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)

 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 2312e6ee595f..47710852f872 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -30,7 +30,8 @@ void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE

 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr);
+						unsigned long vaddr,
+						gfp_t gfp, int order);
 #define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio

 void tag_clear_highpage(struct page *to);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index f4cb0f85ccf4..3b4cc04f7a23 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -926,9 +926,10 @@ NOKPROBE_SYMBOL(do_debug_exception);
  * Used during anonymous page fault handling.
  */
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr)
+						unsigned long vaddr,
+						gfp_t gfp, int order)
 {
-	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
+	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO | gfp;

 	/*
 	 * If the page is mapped with PROT_MTE, initialise the tags at the
@@ -938,7 +939,7 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 	if (vma->vm_flags & VM_MTE)
 		flags |= __GFP_ZEROTAGS;

-	return vma_alloc_folio(flags, 0, vma, vaddr, false);
+	return vma_alloc_folio(flags, order, vma, vaddr, false);
 }

 void tag_clear_highpage(struct page *page)
diff --git a/arch/ia64/include/asm/page.h b/arch/ia64/include/asm/page.h
index 310b09c3342d..ebdf04274023 100644
--- a/arch/ia64/include/asm/page.h
+++ b/arch/ia64/include/asm/page.h
@@ -82,10 +82,11 @@ do {						\
 } while (0)


-#define vma_alloc_zeroed_movable_folio(vma, vaddr)			\
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order)		\
 ({									\
 	struct folio *folio = vma_alloc_folio(				\
-		GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false); \
+		GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp),		\
+		order, vma, vaddr, false);				\
 	if (folio)							\
 		flush_dcache_folio(folio);				\
 	folio;								\
diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
index 060e4c0e7605..4a2fe57fef5e 100644
--- a/arch/m68k/include/asm/page_no.h
+++ b/arch/m68k/include/asm/page_no.h
@@ -3,7 +3,7 @@
 #define _M68K_PAGE_NO_H

 #ifndef __ASSEMBLY__
-
+
 extern unsigned long memory_start;
 extern unsigned long memory_end;

@@ -13,8 +13,9 @@ extern unsigned long memory_end;
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)

 #define __pa(vaddr)		((unsigned long)(vaddr))
 #define __va(paddr)		((void *)((unsigned long)(paddr)))
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index 8a2a3b5d1e29..b749564140f1 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -73,8 +73,9 @@ static inline void copy_page(void *to, void *from)
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)

 /*
  * These are used to make use of C type-checking..
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index d18e5c332cb9..34deab1a8dae 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -34,8 +34,9 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 	copy_page(to, from);
 }

-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)

 #ifndef __pa
 #define __pa(x)		__phys_addr((unsigned long)(x))
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index b06254e76d99..e2127af4997b 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -209,26 +209,29 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)

 #ifndef vma_alloc_zeroed_movable_folio
 /**
- * vma_alloc_zeroed_movable_folio - Allocate a zeroed page for a VMA.
- * @vma: The VMA the page is to be allocated for.
- * @vaddr: The virtual address the page will be inserted into.
- *
- * This function will allocate a page suitable for inserting into this
- * VMA at this virtual address.  It may be allocated from highmem or
+ * vma_alloc_zeroed_movable_folio - Allocate a zeroed folio for a VMA.
+ * @vma: The start VMA the folio is to be allocated for.
+ * @vaddr: The virtual address the folio will be inserted into.
+ * @gfp: Additional gfp falgs to mix in or 0.
+ * @order: The order of the folio (2^order pages).
+ *
+ * This function will allocate a folio suitable for inserting into this
+ * VMA starting at this virtual address.  It may be allocated from highmem or
  * the movable zone.  An architecture may provide its own implementation.
  *
- * Return: A folio containing one allocated and zeroed page or NULL if
+ * Return: A folio containing 2^order allocated and zeroed pages or NULL if
  * we are out of memory.
  */
 static inline
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-				   unsigned long vaddr)
+				   unsigned long vaddr, gfp_t gfp, int order)
 {
 	struct folio *folio;

-	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vaddr, false);
+	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
+					order, vma, vaddr, false);
 	if (folio)
-		clear_user_highpage(&folio->page, vaddr);
+		clear_huge_page(&folio->page, vaddr, 1U << order);

 	return folio;
 }
diff --git a/mm/memory.c b/mm/memory.c
index c08645908ee2..8798da968686 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3061,7 +3061,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		goto oom;

 	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
-		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address,
+									0, 0);
 		if (!new_folio)
 			goto oom;
 	} else {
@@ -4049,7 +4050,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
 	if (!folio)
 		goto oom;

--
2.25.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 2/6] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
@ 2023-03-17 10:57   ` Ryan Roberts
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:57 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Allow allocation of large folios with vma_alloc_zeroed_movable_folio().
This prepares the ground for large anonymous folios. The generic
implementation of vma_alloc_zeroed_movable_folio() now uses
clear_huge_page() to zero the allocated folio since it may now be a
non-0 order.

Currently the function is always called with order 0 and no extra gfp
flags, so no functional change intended.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/alpha/include/asm/page.h   |  5 +++--
 arch/arm64/include/asm/page.h   |  3 ++-
 arch/arm64/mm/fault.c           |  7 ++++---
 arch/ia64/include/asm/page.h    |  5 +++--
 arch/m68k/include/asm/page_no.h |  7 ++++---
 arch/s390/include/asm/page.h    |  5 +++--
 arch/x86/include/asm/page.h     |  5 +++--
 include/linux/highmem.h         | 23 +++++++++++++----------
 mm/memory.c                     |  5 +++--
 9 files changed, 38 insertions(+), 27 deletions(-)

diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index 4db1ebc0ed99..6fc7fe91b6cb 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -17,8 +17,9 @@
 extern void clear_page(void *page);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)

-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)

 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 2312e6ee595f..47710852f872 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -30,7 +30,8 @@ void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE

 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr);
+						unsigned long vaddr,
+						gfp_t gfp, int order);
 #define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio

 void tag_clear_highpage(struct page *to);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index f4cb0f85ccf4..3b4cc04f7a23 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -926,9 +926,10 @@ NOKPROBE_SYMBOL(do_debug_exception);
  * Used during anonymous page fault handling.
  */
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr)
+						unsigned long vaddr,
+						gfp_t gfp, int order)
 {
-	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
+	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO | gfp;

 	/*
 	 * If the page is mapped with PROT_MTE, initialise the tags at the
@@ -938,7 +939,7 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 	if (vma->vm_flags & VM_MTE)
 		flags |= __GFP_ZEROTAGS;

-	return vma_alloc_folio(flags, 0, vma, vaddr, false);
+	return vma_alloc_folio(flags, order, vma, vaddr, false);
 }

 void tag_clear_highpage(struct page *page)
diff --git a/arch/ia64/include/asm/page.h b/arch/ia64/include/asm/page.h
index 310b09c3342d..ebdf04274023 100644
--- a/arch/ia64/include/asm/page.h
+++ b/arch/ia64/include/asm/page.h
@@ -82,10 +82,11 @@ do {						\
 } while (0)


-#define vma_alloc_zeroed_movable_folio(vma, vaddr)			\
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order)		\
 ({									\
 	struct folio *folio = vma_alloc_folio(				\
-		GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false); \
+		GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp),		\
+		order, vma, vaddr, false);				\
 	if (folio)							\
 		flush_dcache_folio(folio);				\
 	folio;								\
diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
index 060e4c0e7605..4a2fe57fef5e 100644
--- a/arch/m68k/include/asm/page_no.h
+++ b/arch/m68k/include/asm/page_no.h
@@ -3,7 +3,7 @@
 #define _M68K_PAGE_NO_H

 #ifndef __ASSEMBLY__
-
+
 extern unsigned long memory_start;
 extern unsigned long memory_end;

@@ -13,8 +13,9 @@ extern unsigned long memory_end;
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)

 #define __pa(vaddr)		((unsigned long)(vaddr))
 #define __va(paddr)		((void *)((unsigned long)(paddr)))
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index 8a2a3b5d1e29..b749564140f1 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -73,8 +73,9 @@ static inline void copy_page(void *to, void *from)
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)

 /*
  * These are used to make use of C type-checking..
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index d18e5c332cb9..34deab1a8dae 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -34,8 +34,9 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 	copy_page(to, from);
 }

-#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
-	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
+#define vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO | (gfp), \
+			order, vma, vaddr, false)

 #ifndef __pa
 #define __pa(x)		__phys_addr((unsigned long)(x))
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index b06254e76d99..e2127af4997b 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -209,26 +209,29 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)

 #ifndef vma_alloc_zeroed_movable_folio
 /**
- * vma_alloc_zeroed_movable_folio - Allocate a zeroed page for a VMA.
- * @vma: The VMA the page is to be allocated for.
- * @vaddr: The virtual address the page will be inserted into.
- *
- * This function will allocate a page suitable for inserting into this
- * VMA at this virtual address.  It may be allocated from highmem or
+ * vma_alloc_zeroed_movable_folio - Allocate a zeroed folio for a VMA.
+ * @vma: The start VMA the folio is to be allocated for.
+ * @vaddr: The virtual address the folio will be inserted into.
+ * @gfp: Additional gfp falgs to mix in or 0.
+ * @order: The order of the folio (2^order pages).
+ *
+ * This function will allocate a folio suitable for inserting into this
+ * VMA starting at this virtual address.  It may be allocated from highmem or
  * the movable zone.  An architecture may provide its own implementation.
  *
- * Return: A folio containing one allocated and zeroed page or NULL if
+ * Return: A folio containing 2^order allocated and zeroed pages or NULL if
  * we are out of memory.
  */
 static inline
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-				   unsigned long vaddr)
+				   unsigned long vaddr, gfp_t gfp, int order)
 {
 	struct folio *folio;

-	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vaddr, false);
+	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp,
+					order, vma, vaddr, false);
 	if (folio)
-		clear_user_highpage(&folio->page, vaddr);
+		clear_huge_page(&folio->page, vaddr, 1U << order);

 	return folio;
 }
diff --git a/mm/memory.c b/mm/memory.c
index c08645908ee2..8798da968686 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3061,7 +3061,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		goto oom;

 	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
-		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address,
+									0, 0);
 		if (!new_folio)
 			goto oom;
 	} else {
@@ -4049,7 +4050,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
 	if (!folio)
 		goto oom;

--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 3/6] mm: Introduce try_vma_alloc_zeroed_movable_folio()
  2023-03-17 10:57 ` Ryan Roberts
@ 2023-03-17 10:57   ` Ryan Roberts
  -1 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:57 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Like vma_alloc_zeroed_movable_folio(), except it will opportunistically
attempt to allocate high-order folios, retrying with lower orders all
the way to order-0, until success. The user must check what they got
with folio_order().

This will be used to oportunistically allocate large folios for
anonymous memory with a sensible fallback under pressure.

For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
high latency due to reclaim, instead preferring to just try for a lower
order. The same approach is used by the readahead code when allocating
large folios.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8798da968686..c9e09415ee18 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3024,6 +3024,27 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	count_vm_event(PGREUSE);
 }

+/*
+ * Opportunistically attempt to allocate high-order folios, retrying with lower
+ * orders all the way to order-0, until success. The user must check what they
+ * got with folio_order().
+ */
+static struct folio *try_vma_alloc_zeroed_movable_folio(
+						struct vm_area_struct *vma,
+						unsigned long vaddr, int order)
+{
+	struct folio *folio;
+	gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN;
+
+	for (; order > 0; order--) {
+		folio = vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
+		if (folio)
+			return folio;
+	}
+
+	return vma_alloc_zeroed_movable_folio(vma, vaddr, 0, 0);
+}
+
 /*
  * Handle the case of a page which we actually need to copy to a new page,
  * either due to COW or unsharing.
@@ -3061,8 +3082,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		goto oom;

 	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
-		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address,
-									0, 0);
+		new_folio = try_vma_alloc_zeroed_movable_folio(vma,
+							vmf->address, 0);
 		if (!new_folio)
 			goto oom;
 	} else {
@@ -4050,7 +4071,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
+	folio = try_vma_alloc_zeroed_movable_folio(vma, vmf->address, 0);
 	if (!folio)
 		goto oom;

--
2.25.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 3/6] mm: Introduce try_vma_alloc_zeroed_movable_folio()
@ 2023-03-17 10:57   ` Ryan Roberts
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:57 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Like vma_alloc_zeroed_movable_folio(), except it will opportunistically
attempt to allocate high-order folios, retrying with lower orders all
the way to order-0, until success. The user must check what they got
with folio_order().

This will be used to oportunistically allocate large folios for
anonymous memory with a sensible fallback under pressure.

For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
high latency due to reclaim, instead preferring to just try for a lower
order. The same approach is used by the readahead code when allocating
large folios.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8798da968686..c9e09415ee18 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3024,6 +3024,27 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	count_vm_event(PGREUSE);
 }

+/*
+ * Opportunistically attempt to allocate high-order folios, retrying with lower
+ * orders all the way to order-0, until success. The user must check what they
+ * got with folio_order().
+ */
+static struct folio *try_vma_alloc_zeroed_movable_folio(
+						struct vm_area_struct *vma,
+						unsigned long vaddr, int order)
+{
+	struct folio *folio;
+	gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN;
+
+	for (; order > 0; order--) {
+		folio = vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order);
+		if (folio)
+			return folio;
+	}
+
+	return vma_alloc_zeroed_movable_folio(vma, vaddr, 0, 0);
+}
+
 /*
  * Handle the case of a page which we actually need to copy to a new page,
  * either due to COW or unsharing.
@@ -3061,8 +3082,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		goto oom;

 	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
-		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address,
-									0, 0);
+		new_folio = try_vma_alloc_zeroed_movable_folio(vma,
+							vmf->address, 0);
 		if (!new_folio)
 			goto oom;
 	} else {
@@ -4050,7 +4071,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0);
+	folio = try_vma_alloc_zeroed_movable_folio(vma, vmf->address, 0);
 	if (!folio)
 		goto oom;

--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 4/6] mm: Implement folio_add_new_anon_rmap_range()
  2023-03-17 10:57 ` Ryan Roberts
@ 2023-03-17 10:58   ` Ryan Roberts
  -1 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:58 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Like folio_add_new_anon_rmap() but batch-rmaps all the pages belonging
to a folio, for effciency savings.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/rmap.h |  2 ++
 mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b87d01660412..d1d731650ce8 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address);
+void folio_add_new_anon_rmap_range(struct folio *folio,
+		struct vm_area_struct *vma, unsigned long address);
 void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
diff --git a/mm/rmap.c b/mm/rmap.c
index 8632e02661ac..05a0c0a700e7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1302,6 +1302,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 }

+/**
+ * folio_add_new_anon_rmap_range - Add mapping to a new anonymous potentially
+ * large but definitely non-THP folio.
+ * @folio:      The folio to add the mapping to.
+ * @vma:        the vm area in which the mapping is added
+ * @address:    the user virtual address of the first page in the folio
+ *
+ * Like folio_add_new_anon_rmap() but must only be called for new *non-THP*
+ * folios. Like folio_add_new_anon_rmap(), the inc-and-test is bypassed and the
+ * folio does not have to be locked. All pages in the folio are individually
+ * accounted.
+ *
+ * As the folio is new, it's assumed to be mapped exclusively by a single
+ * process.
+ */
+void folio_add_new_anon_rmap_range(struct folio *folio,
+			struct vm_area_struct *vma, unsigned long address)
+{
+	int i;
+	int nr = folio_nr_pages(folio);
+	struct page *page = &folio->page;
+
+	VM_BUG_ON_VMA(address < vma->vm_start ||
+		      address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
+	__folio_set_swapbacked(folio);
+
+	if (folio_test_large(folio)) {
+		/* increment count (starts at 0) */
+		atomic_set(&folio->_nr_pages_mapped, nr);
+	}
+
+	for (i = 0; i < nr; i++) {
+		/* increment count (starts at -1) */
+		atomic_set(&page->_mapcount, 0);
+		__page_set_anon_rmap(folio, page, vma, address, 1);
+		page++;
+		address += PAGE_SIZE;
+	}
+
+	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
+
+}
+
 /**
  * page_add_file_rmap - add pte mapping to a file page
  * @page:	the page to add the mapping to
--
2.25.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 4/6] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-03-17 10:58   ` Ryan Roberts
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:58 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Like folio_add_new_anon_rmap() but batch-rmaps all the pages belonging
to a folio, for effciency savings.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/rmap.h |  2 ++
 mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b87d01660412..d1d731650ce8 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address);
+void folio_add_new_anon_rmap_range(struct folio *folio,
+		struct vm_area_struct *vma, unsigned long address);
 void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
diff --git a/mm/rmap.c b/mm/rmap.c
index 8632e02661ac..05a0c0a700e7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1302,6 +1302,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
 }

+/**
+ * folio_add_new_anon_rmap_range - Add mapping to a new anonymous potentially
+ * large but definitely non-THP folio.
+ * @folio:      The folio to add the mapping to.
+ * @vma:        the vm area in which the mapping is added
+ * @address:    the user virtual address of the first page in the folio
+ *
+ * Like folio_add_new_anon_rmap() but must only be called for new *non-THP*
+ * folios. Like folio_add_new_anon_rmap(), the inc-and-test is bypassed and the
+ * folio does not have to be locked. All pages in the folio are individually
+ * accounted.
+ *
+ * As the folio is new, it's assumed to be mapped exclusively by a single
+ * process.
+ */
+void folio_add_new_anon_rmap_range(struct folio *folio,
+			struct vm_area_struct *vma, unsigned long address)
+{
+	int i;
+	int nr = folio_nr_pages(folio);
+	struct page *page = &folio->page;
+
+	VM_BUG_ON_VMA(address < vma->vm_start ||
+		      address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
+	__folio_set_swapbacked(folio);
+
+	if (folio_test_large(folio)) {
+		/* increment count (starts at 0) */
+		atomic_set(&folio->_nr_pages_mapped, nr);
+	}
+
+	for (i = 0; i < nr; i++) {
+		/* increment count (starts at -1) */
+		atomic_set(&page->_mapcount, 0);
+		__page_set_anon_rmap(folio, page, vma, address, 1);
+		page++;
+		address += PAGE_SIZE;
+	}
+
+	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
+
+}
+
 /**
  * page_add_file_rmap - add pte mapping to a file page
  * @page:	the page to add the mapping to
--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 5/6] mm: Allocate large folios for anonymous memory
  2023-03-17 10:57 ` Ryan Roberts
@ 2023-03-17 10:58   ` Ryan Roberts
  -1 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:58 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Add the machinery to determine what order of folio to allocate within
do_anonymous_page() and deal with racing faults to the same region.

TODO: For now, the maximum order is set to 4. This should probably be
set per-vma based on factors, and adjusted dynamically.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 140 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 124 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c9e09415ee18..3d01eab46d9c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4013,6 +4013,77 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	return ret;
 }

+/*
+ * Returns index of first pte that is not none, or nr if all are none.
+ */
+static int check_all_ptes_none(pte_t *pte, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		if (!pte_none(*pte++))
+			return i;
+	}
+
+	return nr;
+}
+
+static void calc_anonymous_folio_order(struct vm_fault *vmf,
+				       int *order_out,
+				       unsigned long *addr_out)
+{
+	/*
+	 * The aim here is to determine what size of folio we should allocate
+	 * for this fault. Factors include:
+	 * - Folio must be naturally aligned within VA space
+	 * - Folio must not breach boundaries of vma
+	 * - Folio must be fully contained inside one pmd entry
+	 * - Folio must not overlap any non-none ptes
+	 * - Order must not be higher than *order_out upon entry
+	 *
+	 * Note that the caller may or may not choose to lock the pte. If
+	 * unlocked, the calculation should be considered an estimate that will
+	 * need to be validated under the lock.
+	 */
+
+	struct vm_area_struct *vma = vmf->vma;
+	int nr;
+	int order = min(*order_out, PMD_SHIFT - PAGE_SHIFT);
+	unsigned long addr;
+	pte_t *pte;
+	pte_t *first_set = NULL;
+	int ret;
+
+	for (; order > 0; order--) {
+		nr = 1 << order;
+		addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
+		pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
+
+		/* Check vma bounds. */
+		if (addr < vma->vm_start ||
+		    addr + nr * PAGE_SIZE > vma->vm_end)
+			continue;
+
+		/* All ptes covered by order already known to be none. */
+		if (pte + nr <= first_set)
+			break;
+
+		/* Already found set pte in range covered by order. */
+		if (pte <= first_set)
+			continue;
+
+		/* Need to check if all the ptes are none. */
+		ret = check_all_ptes_none(pte, nr);
+		if (ret == nr)
+			break;
+
+		first_set = pte + ret;
+	}
+
+	*order_out = order;
+	*addr_out = order > 0 ? addr : vmf->address;
+}
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4024,6 +4095,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	unsigned long addr;
+	int order = 4; // TODO: Policy for maximum folio order.
+	int pgcount;

 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -4065,24 +4139,41 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return handle_userfault(vmf, VM_UFFD_MISSING);
 		}
-		goto setpte;
+		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, vmf->address, vmf->pte);
+		goto unlock;
 	}

-	/* Allocate our own private page. */
+retry:
+	/*
+	 * Estimate the folio order to allocate. We are not under the ptl here
+	 * so this estiamte needs to be re-checked later once we have the lock.
+	 */
+	vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+	calc_anonymous_folio_order(vmf, &order, &addr);
+	pte_unmap(vmf->pte);
+
+	/* Allocate our own private folio. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = try_vma_alloc_zeroed_movable_folio(vma, vmf->address, 0);
+	folio = try_vma_alloc_zeroed_movable_folio(vma, addr, order);
 	if (!folio)
 		goto oom;

+	/* We may have been granted less than we asked for. */
+	order = folio_order(folio);
+	pgcount = folio_nr_pages(folio);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
-	cgroup_throttle_swaprate(&folio->page, GFP_KERNEL);
+	folio_throttle_swaprate(folio, GFP_KERNEL);

 	/*
 	 * The memory barrier inside __folio_mark_uptodate makes sure that
-	 * preceding stores to the page contents become visible before
-	 * the set_pte_at() write.
+	 * preceding stores to the folio contents become visible before
+	 * the set_ptes() write.
 	 */
 	__folio_mark_uptodate(folio);

@@ -4091,11 +4182,26 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));

-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
-	if (!pte_none(*vmf->pte)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
-		goto release;
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
+
+	/*
+	 * Ensure our estimate above is still correct; we could have raced with
+	 * another thread to service a fault in the region.
+	 */
+	if (check_all_ptes_none(vmf->pte, pgcount) != pgcount) {
+		pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
+
+		/* If faulting pte was allocated by another, exit early. */
+		if (!pte_none(*pte)) {
+			update_mmu_tlb(vma, vmf->address, pte);
+			goto release;
+		}
+
+		/* Else try again, with a lower order. */
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		folio_put(folio);
+		order--;
+		goto retry;
 	}

 	ret = check_stable_address_space(vma->vm_mm);
@@ -4109,14 +4215,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}

-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, pgcount - 1);
+
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
+	folio_add_new_anon_rmap_range(folio, vma, addr);
 	folio_add_lru_vma(folio, vma);
-setpte:
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);

 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
--
2.25.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 5/6] mm: Allocate large folios for anonymous memory
@ 2023-03-17 10:58   ` Ryan Roberts
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:58 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Add the machinery to determine what order of folio to allocate within
do_anonymous_page() and deal with racing faults to the same region.

TODO: For now, the maximum order is set to 4. This should probably be
set per-vma based on factors, and adjusted dynamically.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 140 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 124 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c9e09415ee18..3d01eab46d9c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4013,6 +4013,77 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	return ret;
 }

+/*
+ * Returns index of first pte that is not none, or nr if all are none.
+ */
+static int check_all_ptes_none(pte_t *pte, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		if (!pte_none(*pte++))
+			return i;
+	}
+
+	return nr;
+}
+
+static void calc_anonymous_folio_order(struct vm_fault *vmf,
+				       int *order_out,
+				       unsigned long *addr_out)
+{
+	/*
+	 * The aim here is to determine what size of folio we should allocate
+	 * for this fault. Factors include:
+	 * - Folio must be naturally aligned within VA space
+	 * - Folio must not breach boundaries of vma
+	 * - Folio must be fully contained inside one pmd entry
+	 * - Folio must not overlap any non-none ptes
+	 * - Order must not be higher than *order_out upon entry
+	 *
+	 * Note that the caller may or may not choose to lock the pte. If
+	 * unlocked, the calculation should be considered an estimate that will
+	 * need to be validated under the lock.
+	 */
+
+	struct vm_area_struct *vma = vmf->vma;
+	int nr;
+	int order = min(*order_out, PMD_SHIFT - PAGE_SHIFT);
+	unsigned long addr;
+	pte_t *pte;
+	pte_t *first_set = NULL;
+	int ret;
+
+	for (; order > 0; order--) {
+		nr = 1 << order;
+		addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
+		pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
+
+		/* Check vma bounds. */
+		if (addr < vma->vm_start ||
+		    addr + nr * PAGE_SIZE > vma->vm_end)
+			continue;
+
+		/* All ptes covered by order already known to be none. */
+		if (pte + nr <= first_set)
+			break;
+
+		/* Already found set pte in range covered by order. */
+		if (pte <= first_set)
+			continue;
+
+		/* Need to check if all the ptes are none. */
+		ret = check_all_ptes_none(pte, nr);
+		if (ret == nr)
+			break;
+
+		first_set = pte + ret;
+	}
+
+	*order_out = order;
+	*addr_out = order > 0 ? addr : vmf->address;
+}
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4024,6 +4095,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	unsigned long addr;
+	int order = 4; // TODO: Policy for maximum folio order.
+	int pgcount;

 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -4065,24 +4139,41 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return handle_userfault(vmf, VM_UFFD_MISSING);
 		}
-		goto setpte;
+		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, vmf->address, vmf->pte);
+		goto unlock;
 	}

-	/* Allocate our own private page. */
+retry:
+	/*
+	 * Estimate the folio order to allocate. We are not under the ptl here
+	 * so this estiamte needs to be re-checked later once we have the lock.
+	 */
+	vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+	calc_anonymous_folio_order(vmf, &order, &addr);
+	pte_unmap(vmf->pte);
+
+	/* Allocate our own private folio. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = try_vma_alloc_zeroed_movable_folio(vma, vmf->address, 0);
+	folio = try_vma_alloc_zeroed_movable_folio(vma, addr, order);
 	if (!folio)
 		goto oom;

+	/* We may have been granted less than we asked for. */
+	order = folio_order(folio);
+	pgcount = folio_nr_pages(folio);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
-	cgroup_throttle_swaprate(&folio->page, GFP_KERNEL);
+	folio_throttle_swaprate(folio, GFP_KERNEL);

 	/*
 	 * The memory barrier inside __folio_mark_uptodate makes sure that
-	 * preceding stores to the page contents become visible before
-	 * the set_pte_at() write.
+	 * preceding stores to the folio contents become visible before
+	 * the set_ptes() write.
 	 */
 	__folio_mark_uptodate(folio);

@@ -4091,11 +4182,26 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));

-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
-	if (!pte_none(*vmf->pte)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
-		goto release;
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
+
+	/*
+	 * Ensure our estimate above is still correct; we could have raced with
+	 * another thread to service a fault in the region.
+	 */
+	if (check_all_ptes_none(vmf->pte, pgcount) != pgcount) {
+		pte_t *pte = vmf->pte + ((vmf->address - addr) >> PAGE_SHIFT);
+
+		/* If faulting pte was allocated by another, exit early. */
+		if (!pte_none(*pte)) {
+			update_mmu_tlb(vma, vmf->address, pte);
+			goto release;
+		}
+
+		/* Else try again, with a lower order. */
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		folio_put(folio);
+		order--;
+		goto retry;
 	}

 	ret = check_stable_address_space(vma->vm_mm);
@@ -4109,14 +4215,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}

-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, pgcount - 1);
+
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
+	folio_add_new_anon_rmap_range(folio, vma, addr);
 	folio_add_lru_vma(folio, vma);
-setpte:
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);

 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 6/6] WORKAROUND: Don't split large folios on madvise
  2023-03-17 10:57 ` Ryan Roberts
@ 2023-03-17 10:58   ` Ryan Roberts
  -1 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:58 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/madvise.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 340125d08c03..8fb84da744e1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -447,6 +447,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		 * are sure it's worth. Split it if we are only owner.
 		 */
 		if (folio_test_large(folio)) {
+#if 0
 			if (folio_mapcount(folio) != 1)
 				break;
 			if (pageout_anon_only_filter && !folio_test_anon(folio))
@@ -469,6 +470,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			pte--;
 			addr -= PAGE_SIZE;
 			continue;
+#else
+			break;
+#endif
 		}

 		/*
@@ -664,6 +668,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		 * deactivate all pages.
 		 */
 		if (folio_test_large(folio)) {
+#if 0
 			if (folio_mapcount(folio) != 1)
 				goto out;
 			folio_get(folio);
@@ -684,6 +689,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			pte--;
 			addr -= PAGE_SIZE;
 			continue;
+#else
+			goto out;
+#endif
 		}

 		if (folio_test_swapcache(folio) || folio_test_dirty(folio)) {
--
2.25.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 6/6] WORKAROUND: Don't split large folios on madvise
@ 2023-03-17 10:58   ` Ryan Roberts
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:58 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/madvise.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 340125d08c03..8fb84da744e1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -447,6 +447,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		 * are sure it's worth. Split it if we are only owner.
 		 */
 		if (folio_test_large(folio)) {
+#if 0
 			if (folio_mapcount(folio) != 1)
 				break;
 			if (pageout_anon_only_filter && !folio_test_anon(folio))
@@ -469,6 +470,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			pte--;
 			addr -= PAGE_SIZE;
 			continue;
+#else
+			break;
+#endif
 		}

 		/*
@@ -664,6 +668,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		 * deactivate all pages.
 		 */
 		if (folio_test_large(folio)) {
+#if 0
 			if (folio_mapcount(folio) != 1)
 				goto out;
 			folio_get(folio);
@@ -684,6 +689,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			pte--;
 			addr -= PAGE_SIZE;
 			continue;
+#else
+			goto out;
+#endif
 		}

 		if (folio_test_swapcache(folio) || folio_test_dirty(folio)) {
--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/6] mm: Implement folio_add_new_anon_rmap_range()
  2023-03-17 10:58   ` Ryan Roberts
@ 2023-03-22  6:59     ` Yin Fengwei
  -1 siblings, 0 replies; 30+ messages in thread
From: Yin Fengwei @ 2023-03-22  6:59 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 3/17/23 18:58, Ryan Roberts wrote:
> Like folio_add_new_anon_rmap() but batch-rmaps all the pages belonging
> to a folio, for effciency savings.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/rmap.h |  2 ++
>   mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 45 insertions(+)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b87d01660412..d1d731650ce8 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>   		unsigned long address);
>   void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>   		unsigned long address);
> +void folio_add_new_anon_rmap_range(struct folio *folio,
> +		struct vm_area_struct *vma, unsigned long address);
>   void page_add_file_rmap(struct page *, struct vm_area_struct *,
>   		bool compound);
>   void page_remove_rmap(struct page *, struct vm_area_struct *,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8632e02661ac..05a0c0a700e7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1302,6 +1302,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>   	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>   }
> 
> +/**
> + * folio_add_new_anon_rmap_range - Add mapping to a new anonymous potentially
> + * large but definitely non-THP folio.
> + * @folio:      The folio to add the mapping to.
> + * @vma:        the vm area in which the mapping is added
> + * @address:    the user virtual address of the first page in the folio
> + *
> + * Like folio_add_new_anon_rmap() but must only be called for new *non-THP*
> + * folios. Like folio_add_new_anon_rmap(), the inc-and-test is bypassed and the
> + * folio does not have to be locked. All pages in the folio are individually
> + * accounted.
> + *
> + * As the folio is new, it's assumed to be mapped exclusively by a single
> + * process.
> + */
> +void folio_add_new_anon_rmap_range(struct folio *folio,
> +			struct vm_area_struct *vma, unsigned long address)
> +{
> +	int i;
> +	int nr = folio_nr_pages(folio);
> +	struct page *page = &folio->page;
> +
> +	VM_BUG_ON_VMA(address < vma->vm_start ||
> +		      address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
> +	__folio_set_swapbacked(folio);
> +
> +	if (folio_test_large(folio)) {
> +		/* increment count (starts at 0) */
> +		atomic_set(&folio->_nr_pages_mapped, nr);
> +	}
> +
> +	for (i = 0; i < nr; i++) {
> +		/* increment count (starts at -1) */
> +		atomic_set(&page->_mapcount, 0);
> +		__page_set_anon_rmap(folio, page, vma, address, 1);
> +		page++;
> +		address += PAGE_SIZE;
> +	}
> +
> +	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
It looks like you missed __page_set_anon_rmap() call here.


Regards
Yin, Fengwei

> +
> +}
> +
>   /**
>    * page_add_file_rmap - add pte mapping to a file page
>    * @page:	the page to add the mapping to
> --
> 2.25.1
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/6] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-03-22  6:59     ` Yin Fengwei
  0 siblings, 0 replies; 30+ messages in thread
From: Yin Fengwei @ 2023-03-22  6:59 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 3/17/23 18:58, Ryan Roberts wrote:
> Like folio_add_new_anon_rmap() but batch-rmaps all the pages belonging
> to a folio, for effciency savings.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/rmap.h |  2 ++
>   mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 45 insertions(+)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b87d01660412..d1d731650ce8 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>   		unsigned long address);
>   void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>   		unsigned long address);
> +void folio_add_new_anon_rmap_range(struct folio *folio,
> +		struct vm_area_struct *vma, unsigned long address);
>   void page_add_file_rmap(struct page *, struct vm_area_struct *,
>   		bool compound);
>   void page_remove_rmap(struct page *, struct vm_area_struct *,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8632e02661ac..05a0c0a700e7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1302,6 +1302,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>   	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>   }
> 
> +/**
> + * folio_add_new_anon_rmap_range - Add mapping to a new anonymous potentially
> + * large but definitely non-THP folio.
> + * @folio:      The folio to add the mapping to.
> + * @vma:        the vm area in which the mapping is added
> + * @address:    the user virtual address of the first page in the folio
> + *
> + * Like folio_add_new_anon_rmap() but must only be called for new *non-THP*
> + * folios. Like folio_add_new_anon_rmap(), the inc-and-test is bypassed and the
> + * folio does not have to be locked. All pages in the folio are individually
> + * accounted.
> + *
> + * As the folio is new, it's assumed to be mapped exclusively by a single
> + * process.
> + */
> +void folio_add_new_anon_rmap_range(struct folio *folio,
> +			struct vm_area_struct *vma, unsigned long address)
> +{
> +	int i;
> +	int nr = folio_nr_pages(folio);
> +	struct page *page = &folio->page;
> +
> +	VM_BUG_ON_VMA(address < vma->vm_start ||
> +		      address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
> +	__folio_set_swapbacked(folio);
> +
> +	if (folio_test_large(folio)) {
> +		/* increment count (starts at 0) */
> +		atomic_set(&folio->_nr_pages_mapped, nr);
> +	}
> +
> +	for (i = 0; i < nr; i++) {
> +		/* increment count (starts at -1) */
> +		atomic_set(&page->_mapcount, 0);
> +		__page_set_anon_rmap(folio, page, vma, address, 1);
> +		page++;
> +		address += PAGE_SIZE;
> +	}
> +
> +	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
It looks like you missed __page_set_anon_rmap() call here.


Regards
Yin, Fengwei

> +
> +}
> +
>   /**
>    * page_add_file_rmap - add pte mapping to a file page
>    * @page:	the page to add the mapping to
> --
> 2.25.1
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/6] mm: Implement folio_add_new_anon_rmap_range()
  2023-03-17 10:58   ` Ryan Roberts
@ 2023-03-22  7:10     ` Yin Fengwei
  -1 siblings, 0 replies; 30+ messages in thread
From: Yin Fengwei @ 2023-03-22  7:10 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 3/17/23 18:58, Ryan Roberts wrote:
> Like folio_add_new_anon_rmap() but batch-rmaps all the pages belonging
> to a folio, for effciency savings.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/rmap.h |  2 ++
>   mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 45 insertions(+)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b87d01660412..d1d731650ce8 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>   		unsigned long address);
>   void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>   		unsigned long address);
> +void folio_add_new_anon_rmap_range(struct folio *folio,
> +		struct vm_area_struct *vma, unsigned long address);
>   void page_add_file_rmap(struct page *, struct vm_area_struct *,
>   		bool compound);
>   void page_remove_rmap(struct page *, struct vm_area_struct *,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8632e02661ac..05a0c0a700e7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1302,6 +1302,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>   	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>   }
> 
> +/**
> + * folio_add_new_anon_rmap_range - Add mapping to a new anonymous potentially
> + * large but definitely non-THP folio.
> + * @folio:      The folio to add the mapping to.
> + * @vma:        the vm area in which the mapping is added
> + * @address:    the user virtual address of the first page in the folio
> + *
> + * Like folio_add_new_anon_rmap() but must only be called for new *non-THP*
> + * folios. Like folio_add_new_anon_rmap(), the inc-and-test is bypassed and the
> + * folio does not have to be locked. All pages in the folio are individually
> + * accounted.
> + *
> + * As the folio is new, it's assumed to be mapped exclusively by a single
> + * process.
> + */
> +void folio_add_new_anon_rmap_range(struct folio *folio,
> +			struct vm_area_struct *vma, unsigned long address)
> +{
> +	int i;
> +	int nr = folio_nr_pages(folio);
> +	struct page *page = &folio->page;
> +
> +	VM_BUG_ON_VMA(address < vma->vm_start ||
> +		      address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
> +	__folio_set_swapbacked(folio);
> +
> +	if (folio_test_large(folio)) {
> +		/* increment count (starts at 0) */
> +		atomic_set(&folio->_nr_pages_mapped, nr);
> +	}
> +
> +	for (i = 0; i < nr; i++) {
> +		/* increment count (starts at -1) */
> +		atomic_set(&page->_mapcount, 0);
> +		__page_set_anon_rmap(folio, page, vma, address, 1);
My bad. You did call it here.

Regards
Yin, Fengwei

> +		page++;
> +		address += PAGE_SIZE;
> +	}
> +
> +	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
> +
> +}
> +
>   /**
>    * page_add_file_rmap - add pte mapping to a file page
>    * @page:	the page to add the mapping to
> --
> 2.25.1
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/6] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-03-22  7:10     ` Yin Fengwei
  0 siblings, 0 replies; 30+ messages in thread
From: Yin Fengwei @ 2023-03-22  7:10 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 3/17/23 18:58, Ryan Roberts wrote:
> Like folio_add_new_anon_rmap() but batch-rmaps all the pages belonging
> to a folio, for effciency savings.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/rmap.h |  2 ++
>   mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 45 insertions(+)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b87d01660412..d1d731650ce8 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>   		unsigned long address);
>   void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>   		unsigned long address);
> +void folio_add_new_anon_rmap_range(struct folio *folio,
> +		struct vm_area_struct *vma, unsigned long address);
>   void page_add_file_rmap(struct page *, struct vm_area_struct *,
>   		bool compound);
>   void page_remove_rmap(struct page *, struct vm_area_struct *,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8632e02661ac..05a0c0a700e7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1302,6 +1302,49 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>   	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>   }
> 
> +/**
> + * folio_add_new_anon_rmap_range - Add mapping to a new anonymous potentially
> + * large but definitely non-THP folio.
> + * @folio:      The folio to add the mapping to.
> + * @vma:        the vm area in which the mapping is added
> + * @address:    the user virtual address of the first page in the folio
> + *
> + * Like folio_add_new_anon_rmap() but must only be called for new *non-THP*
> + * folios. Like folio_add_new_anon_rmap(), the inc-and-test is bypassed and the
> + * folio does not have to be locked. All pages in the folio are individually
> + * accounted.
> + *
> + * As the folio is new, it's assumed to be mapped exclusively by a single
> + * process.
> + */
> +void folio_add_new_anon_rmap_range(struct folio *folio,
> +			struct vm_area_struct *vma, unsigned long address)
> +{
> +	int i;
> +	int nr = folio_nr_pages(folio);
> +	struct page *page = &folio->page;
> +
> +	VM_BUG_ON_VMA(address < vma->vm_start ||
> +		      address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
> +	__folio_set_swapbacked(folio);
> +
> +	if (folio_test_large(folio)) {
> +		/* increment count (starts at 0) */
> +		atomic_set(&folio->_nr_pages_mapped, nr);
> +	}
> +
> +	for (i = 0; i < nr; i++) {
> +		/* increment count (starts at -1) */
> +		atomic_set(&page->_mapcount, 0);
> +		__page_set_anon_rmap(folio, page, vma, address, 1);
My bad. You did call it here.

Regards
Yin, Fengwei

> +		page++;
> +		address += PAGE_SIZE;
> +	}
> +
> +	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
> +
> +}
> +
>   /**
>    * page_add_file_rmap - add pte mapping to a file page
>    * @page:	the page to add the mapping to
> --
> 2.25.1
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/6] mm: Implement folio_add_new_anon_rmap_range()
  2023-03-22  7:10     ` Yin Fengwei
@ 2023-03-22  7:42       ` Ryan Roberts
  -1 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-22  7:42 UTC (permalink / raw)
  To: Yin Fengwei, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 22/03/2023 07:10, Yin Fengwei wrote:
> On 3/17/23 18:58, Ryan Roberts wrote:
>> Like folio_add_new_anon_rmap() but batch-rmaps all the pages belonging
>> to a folio, for effciency savings.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/rmap.h |  2 ++
>>   mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 45 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index b87d01660412..d1d731650ce8 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct
>> vm_area_struct *,
>>           unsigned long address);
>>   void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>           unsigned long address);
>> +void folio_add_new_anon_rmap_range(struct folio *folio,
>> +        struct vm_area_struct *vma, unsigned long address);
>>   void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>           bool compound);
>>   void page_remove_rmap(struct page *, struct vm_area_struct *,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 8632e02661ac..05a0c0a700e7 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1302,6 +1302,49 @@ void folio_add_new_anon_rmap(struct folio *folio,
>> struct vm_area_struct *vma,
>>       __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>   }
>>
>> +/**
>> + * folio_add_new_anon_rmap_range - Add mapping to a new anonymous potentially
>> + * large but definitely non-THP folio.
>> + * @folio:      The folio to add the mapping to.
>> + * @vma:        the vm area in which the mapping is added
>> + * @address:    the user virtual address of the first page in the folio
>> + *
>> + * Like folio_add_new_anon_rmap() but must only be called for new *non-THP*
>> + * folios. Like folio_add_new_anon_rmap(), the inc-and-test is bypassed and the
>> + * folio does not have to be locked. All pages in the folio are individually
>> + * accounted.
>> + *
>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>> + * process.
>> + */
>> +void folio_add_new_anon_rmap_range(struct folio *folio,
>> +            struct vm_area_struct *vma, unsigned long address)
>> +{
>> +    int i;
>> +    int nr = folio_nr_pages(folio);
>> +    struct page *page = &folio->page;
>> +
>> +    VM_BUG_ON_VMA(address < vma->vm_start ||
>> +              address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>> +    __folio_set_swapbacked(folio);
>> +
>> +    if (folio_test_large(folio)) {
>> +        /* increment count (starts at 0) */
>> +        atomic_set(&folio->_nr_pages_mapped, nr);
>> +    }
>> +
>> +    for (i = 0; i < nr; i++) {
>> +        /* increment count (starts at -1) */
>> +        atomic_set(&page->_mapcount, 0);
>> +        __page_set_anon_rmap(folio, page, vma, address, 1);
> My bad. You did call it here.

Yes, calling it per subpage to ensure every subpage is marked AnonExclusive.
Although this does rely on calling it _first_ for the head page so that the
index is set correctly. I think that all works out though.

I did wonder if the order of the calls (__page_set_anon_rmap() vs
__lruvec_stat_mod_folio() might matter - I've swapped them. But I haven't found
any evidence that it does from reviewing the code.

> 
> Regards
> Yin, Fengwei
> 
>> +        page++;
>> +        address += PAGE_SIZE;
>> +    }
>> +
>> +    __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
>> +
>> +}
>> +
>>   /**
>>    * page_add_file_rmap - add pte mapping to a file page
>>    * @page:    the page to add the mapping to
>> -- 
>> 2.25.1
>>
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/6] mm: Implement folio_add_new_anon_rmap_range()
@ 2023-03-22  7:42       ` Ryan Roberts
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-22  7:42 UTC (permalink / raw)
  To: Yin Fengwei, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 22/03/2023 07:10, Yin Fengwei wrote:
> On 3/17/23 18:58, Ryan Roberts wrote:
>> Like folio_add_new_anon_rmap() but batch-rmaps all the pages belonging
>> to a folio, for effciency savings.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/rmap.h |  2 ++
>>   mm/rmap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 45 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index b87d01660412..d1d731650ce8 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct
>> vm_area_struct *,
>>           unsigned long address);
>>   void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>           unsigned long address);
>> +void folio_add_new_anon_rmap_range(struct folio *folio,
>> +        struct vm_area_struct *vma, unsigned long address);
>>   void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>           bool compound);
>>   void page_remove_rmap(struct page *, struct vm_area_struct *,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 8632e02661ac..05a0c0a700e7 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1302,6 +1302,49 @@ void folio_add_new_anon_rmap(struct folio *folio,
>> struct vm_area_struct *vma,
>>       __page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>>   }
>>
>> +/**
>> + * folio_add_new_anon_rmap_range - Add mapping to a new anonymous potentially
>> + * large but definitely non-THP folio.
>> + * @folio:      The folio to add the mapping to.
>> + * @vma:        the vm area in which the mapping is added
>> + * @address:    the user virtual address of the first page in the folio
>> + *
>> + * Like folio_add_new_anon_rmap() but must only be called for new *non-THP*
>> + * folios. Like folio_add_new_anon_rmap(), the inc-and-test is bypassed and the
>> + * folio does not have to be locked. All pages in the folio are individually
>> + * accounted.
>> + *
>> + * As the folio is new, it's assumed to be mapped exclusively by a single
>> + * process.
>> + */
>> +void folio_add_new_anon_rmap_range(struct folio *folio,
>> +            struct vm_area_struct *vma, unsigned long address)
>> +{
>> +    int i;
>> +    int nr = folio_nr_pages(folio);
>> +    struct page *page = &folio->page;
>> +
>> +    VM_BUG_ON_VMA(address < vma->vm_start ||
>> +              address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
>> +    __folio_set_swapbacked(folio);
>> +
>> +    if (folio_test_large(folio)) {
>> +        /* increment count (starts at 0) */
>> +        atomic_set(&folio->_nr_pages_mapped, nr);
>> +    }
>> +
>> +    for (i = 0; i < nr; i++) {
>> +        /* increment count (starts at -1) */
>> +        atomic_set(&page->_mapcount, 0);
>> +        __page_set_anon_rmap(folio, page, vma, address, 1);
> My bad. You did call it here.

Yes, calling it per subpage to ensure every subpage is marked AnonExclusive.
Although this does rely on calling it _first_ for the head page so that the
index is set correctly. I think that all works out though.

I did wonder if the order of the calls (__page_set_anon_rmap() vs
__lruvec_stat_mod_folio() might matter - I've swapped them. But I haven't found
any evidence that it does from reviewing the code.

> 
> Regards
> Yin, Fengwei
> 
>> +        page++;
>> +        address += PAGE_SIZE;
>> +    }
>> +
>> +    __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
>> +
>> +}
>> +
>>   /**
>>    * page_add_file_rmap - add pte mapping to a file page
>>    * @page:    the page to add the mapping to
>> -- 
>> 2.25.1
>>
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 6/6] WORKAROUND: Don't split large folios on madvise
  2023-03-17 10:58   ` Ryan Roberts
@ 2023-03-22  8:19     ` Yin Fengwei
  -1 siblings, 0 replies; 30+ messages in thread
From: Yin Fengwei @ 2023-03-22  8:19 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 3/17/23 18:58, Ryan Roberts wrote:
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   mm/madvise.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 340125d08c03..8fb84da744e1 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -447,6 +447,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>   		 * are sure it's worth. Split it if we are only owner.
>   		 */
>   		if (folio_test_large(folio)) {
> +#if 0
>   			if (folio_mapcount(folio) != 1)
>   				break;
>   			if (pageout_anon_only_filter && !folio_test_anon(folio))
> @@ -469,6 +470,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>   			pte--;
>   			addr -= PAGE_SIZE;
>   			continue;
> +#else
> +			break;
> +#endif
>   		}
> 
>   		/*
> @@ -664,6 +668,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>   		 * deactivate all pages.
>   		 */
>   		if (folio_test_large(folio)) {
> +#if 0
>   			if (folio_mapcount(folio) != 1)
>   				goto out;
>   			folio_get(folio);
> @@ -684,6 +689,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>   			pte--;
>   			addr -= PAGE_SIZE;
>   			continue;
> +#else
> +			goto out;
> +#endif
>   		}
 From this workaround change, you hit an case that large folio has
1 as folio_mapcount()? Can you share the kernel crash log to me? Thanks.


Regards
Yin, Fengwei

> 
>   		if (folio_test_swapcache(folio) || folio_test_dirty(folio)) {
> --
> 2.25.1
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 6/6] WORKAROUND: Don't split large folios on madvise
@ 2023-03-22  8:19     ` Yin Fengwei
  0 siblings, 0 replies; 30+ messages in thread
From: Yin Fengwei @ 2023-03-22  8:19 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 3/17/23 18:58, Ryan Roberts wrote:
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   mm/madvise.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 340125d08c03..8fb84da744e1 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -447,6 +447,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>   		 * are sure it's worth. Split it if we are only owner.
>   		 */
>   		if (folio_test_large(folio)) {
> +#if 0
>   			if (folio_mapcount(folio) != 1)
>   				break;
>   			if (pageout_anon_only_filter && !folio_test_anon(folio))
> @@ -469,6 +470,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>   			pte--;
>   			addr -= PAGE_SIZE;
>   			continue;
> +#else
> +			break;
> +#endif
>   		}
> 
>   		/*
> @@ -664,6 +668,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>   		 * deactivate all pages.
>   		 */
>   		if (folio_test_large(folio)) {
> +#if 0
>   			if (folio_mapcount(folio) != 1)
>   				goto out;
>   			folio_get(folio);
> @@ -684,6 +689,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>   			pte--;
>   			addr -= PAGE_SIZE;
>   			continue;
> +#else
> +			goto out;
> +#endif
>   		}
 From this workaround change, you hit an case that large folio has
1 as folio_mapcount()? Can you share the kernel crash log to me? Thanks.


Regards
Yin, Fengwei

> 
>   		if (folio_test_swapcache(folio) || folio_test_dirty(folio)) {
> --
> 2.25.1
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 6/6] WORKAROUND: Don't split large folios on madvise
  2023-03-22  8:19     ` Yin Fengwei
@ 2023-03-22  8:59       ` Ryan Roberts
  -1 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-22  8:59 UTC (permalink / raw)
  To: Yin Fengwei, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 22/03/2023 08:19, Yin Fengwei wrote:
> On 3/17/23 18:58, Ryan Roberts wrote:
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   mm/madvise.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 340125d08c03..8fb84da744e1 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -447,6 +447,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>            * are sure it's worth. Split it if we are only owner.
>>            */
>>           if (folio_test_large(folio)) {
>> +#if 0
>>               if (folio_mapcount(folio) != 1)
>>                   break;
>>               if (pageout_anon_only_filter && !folio_test_anon(folio))
>> @@ -469,6 +470,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>               pte--;
>>               addr -= PAGE_SIZE;
>>               continue;
>> +#else
>> +            break;
>> +#endif
>>           }
>>
>>           /*
>> @@ -664,6 +668,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>            * deactivate all pages.
>>            */
>>           if (folio_test_large(folio)) {
>> +#if 0
>>               if (folio_mapcount(folio) != 1)
>>                   goto out;
>>               folio_get(folio);
>> @@ -684,6 +689,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>               pte--;
>>               addr -= PAGE_SIZE;
>>               continue;
>> +#else
>> +            goto out;
>> +#endif
>>           }
> From this workaround change, you hit an case that large folio has
> 1 as folio_mapcount()? Can you share the kernel crash log to me? Thanks.

Yes I do. I'm not sure why the mapcount is decreasing. I thought perhaps it
could be due to CoW or explicit munmap, or something like that. I've been trying
to find the reason that the mapcount is being reduced by using the page_ref
tracepoints, but its proving difficult.

The crash logs are somewhat random. But I'll send some to you separately.

Thanks for taking a look!

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>>           if (folio_test_swapcache(folio) || folio_test_dirty(folio)) {
>> -- 
>> 2.25.1
>>
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 6/6] WORKAROUND: Don't split large folios on madvise
@ 2023-03-22  8:59       ` Ryan Roberts
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-22  8:59 UTC (permalink / raw)
  To: Yin Fengwei, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 22/03/2023 08:19, Yin Fengwei wrote:
> On 3/17/23 18:58, Ryan Roberts wrote:
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   mm/madvise.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 340125d08c03..8fb84da744e1 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -447,6 +447,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>            * are sure it's worth. Split it if we are only owner.
>>            */
>>           if (folio_test_large(folio)) {
>> +#if 0
>>               if (folio_mapcount(folio) != 1)
>>                   break;
>>               if (pageout_anon_only_filter && !folio_test_anon(folio))
>> @@ -469,6 +470,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>               pte--;
>>               addr -= PAGE_SIZE;
>>               continue;
>> +#else
>> +            break;
>> +#endif
>>           }
>>
>>           /*
>> @@ -664,6 +668,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>            * deactivate all pages.
>>            */
>>           if (folio_test_large(folio)) {
>> +#if 0
>>               if (folio_mapcount(folio) != 1)
>>                   goto out;
>>               folio_get(folio);
>> @@ -684,6 +689,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>>               pte--;
>>               addr -= PAGE_SIZE;
>>               continue;
>> +#else
>> +            goto out;
>> +#endif
>>           }
> From this workaround change, you hit an case that large folio has
> 1 as folio_mapcount()? Can you share the kernel crash log to me? Thanks.

Yes I do. I'm not sure why the mapcount is decreasing. I thought perhaps it
could be due to CoW or explicit munmap, or something like that. I've been trying
to find the reason that the mapcount is being reduced by using the page_ref
tracepoints, but its proving difficult.

The crash logs are somewhat random. But I'll send some to you separately.

Thanks for taking a look!

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>>           if (folio_test_swapcache(folio) || folio_test_dirty(folio)) {
>> -- 
>> 2.25.1
>>
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory
  2023-03-17 10:57 ` Ryan Roberts
@ 2023-03-22 12:03   ` Ryan Roberts
  -1 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-22 12:03 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: linux-mm, linux-arm-kernel

Hi Matthew,

On 17/03/2023 10:57, Ryan Roberts wrote:
> Hi All,
> 
> [...]
> 
> Bug(s)
> ======
> 
> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
> relating to invalid kernel addresses (which usually look like either NULL +
> small offset or mostly zeros with a few mid-order bits set + a small offset) or
> lockdep complaining about a bad unlock balance. Call stacks are often in
> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
> email example oopses out separately if anyone wants to review them). My hunch is
> that struct pages adjacent to the folio are being corrupted, but don't have hard
> evidence.
> 
> When adding the workaround patch, which prevents madvise_free_pte_range() from
> attempting to split a large folio, I never see any issues. Although I'm not
> putting the system under memory pressure so guess I might see the same types of
> problem crop up under swap, etc.
> 
> I've reviewed most of the code within split_folio() and can't find any smoking
> gun, but I wonder if there are implicit assumptions about the large folio being
> PMD sized that I'm obviously breaking now?
> 
> The code in madvise_free_pte_range():
> 
> 	if (folio_test_large(folio)) {
> 		if (folio_mapcount(folio) != 1)
> 			goto out;
> 		folio_get(folio);
> 		if (!folio_trylock(folio)) {
> 			folio_put(folio);
> 			goto out;
> 		}
> 		pte_unmap_unlock(orig_pte, ptl);
> 		if (split_folio(folio)) {
> 			folio_unlock(folio);
> 			folio_put(folio);
> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> 			goto out;
> 		}
> 		...
> 	}

I've noticed that its folio_split() with a folio order of 1 that causes my
problems. And I also see that the page cache code always explicitly never
allocates order-1 folios:

void page_cache_ra_order(struct readahead_control *ractl,
		struct file_ra_state *ra, unsigned int new_order)
{
	...

	while (index <= limit) {
		unsigned int order = new_order;

		/* Align with smaller pages if needed */
		if (index & ((1UL << order) - 1)) {
			order = __ffs(index);
			if (order == 1)
				order = 0;
		}
		/* Don't allocate pages past EOF */
		while (index + (1UL << order) - 1 > limit) {
			if (--order == 1)
				order = 0;
		}
		err = ra_alloc_folio(ractl, index, mark, order, gfp);
		if (err)
			break;
		index += 1UL << order;
	}

	...
}

Matthew, what is the reason for this? I suspect its guarding against the same
problem I'm seeing.

If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
any oops/panic/etc. I'd just like to understand the root cause.

Thanks,
Ryan



> 
> Will normally skip my large folios because they have a mapcount > 1, due to
> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
> or CoW in this case?
> 
> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
> large anon folio is not using the folio lock in the same way as a THP would and
> we are therefore not getting the expected serialization?
> 
> I'd really appreciate any suggestions for how to pregress here!
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory
@ 2023-03-22 12:03   ` Ryan Roberts
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-22 12:03 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: linux-mm, linux-arm-kernel

Hi Matthew,

On 17/03/2023 10:57, Ryan Roberts wrote:
> Hi All,
> 
> [...]
> 
> Bug(s)
> ======
> 
> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
> relating to invalid kernel addresses (which usually look like either NULL +
> small offset or mostly zeros with a few mid-order bits set + a small offset) or
> lockdep complaining about a bad unlock balance. Call stacks are often in
> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
> email example oopses out separately if anyone wants to review them). My hunch is
> that struct pages adjacent to the folio are being corrupted, but don't have hard
> evidence.
> 
> When adding the workaround patch, which prevents madvise_free_pte_range() from
> attempting to split a large folio, I never see any issues. Although I'm not
> putting the system under memory pressure so guess I might see the same types of
> problem crop up under swap, etc.
> 
> I've reviewed most of the code within split_folio() and can't find any smoking
> gun, but I wonder if there are implicit assumptions about the large folio being
> PMD sized that I'm obviously breaking now?
> 
> The code in madvise_free_pte_range():
> 
> 	if (folio_test_large(folio)) {
> 		if (folio_mapcount(folio) != 1)
> 			goto out;
> 		folio_get(folio);
> 		if (!folio_trylock(folio)) {
> 			folio_put(folio);
> 			goto out;
> 		}
> 		pte_unmap_unlock(orig_pte, ptl);
> 		if (split_folio(folio)) {
> 			folio_unlock(folio);
> 			folio_put(folio);
> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> 			goto out;
> 		}
> 		...
> 	}

I've noticed that its folio_split() with a folio order of 1 that causes my
problems. And I also see that the page cache code always explicitly never
allocates order-1 folios:

void page_cache_ra_order(struct readahead_control *ractl,
		struct file_ra_state *ra, unsigned int new_order)
{
	...

	while (index <= limit) {
		unsigned int order = new_order;

		/* Align with smaller pages if needed */
		if (index & ((1UL << order) - 1)) {
			order = __ffs(index);
			if (order == 1)
				order = 0;
		}
		/* Don't allocate pages past EOF */
		while (index + (1UL << order) - 1 > limit) {
			if (--order == 1)
				order = 0;
		}
		err = ra_alloc_folio(ractl, index, mark, order, gfp);
		if (err)
			break;
		index += 1UL << order;
	}

	...
}

Matthew, what is the reason for this? I suspect its guarding against the same
problem I'm seeing.

If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
any oops/panic/etc. I'd just like to understand the root cause.

Thanks,
Ryan



> 
> Will normally skip my large folios because they have a mapcount > 1, due to
> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
> or CoW in this case?
> 
> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
> large anon folio is not using the folio lock in the same way as a THP would and
> we are therefore not getting the expected serialization?
> 
> I'd really appreciate any suggestions for how to pregress here!
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory
  2023-03-22 12:03   ` Ryan Roberts
@ 2023-03-22 13:36     ` Yin, Fengwei
  -1 siblings, 0 replies; 30+ messages in thread
From: Yin, Fengwei @ 2023-03-22 13:36 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel



On 3/22/2023 8:03 PM, Ryan Roberts wrote:
> Hi Matthew,
> 
> On 17/03/2023 10:57, Ryan Roberts wrote:
>> Hi All,
>>
>> [...]
>>
>> Bug(s)
>> ======
>>
>> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
>> relating to invalid kernel addresses (which usually look like either NULL +
>> small offset or mostly zeros with a few mid-order bits set + a small offset) or
>> lockdep complaining about a bad unlock balance. Call stacks are often in
>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
>> email example oopses out separately if anyone wants to review them). My hunch is
>> that struct pages adjacent to the folio are being corrupted, but don't have hard
>> evidence.
>>
>> When adding the workaround patch, which prevents madvise_free_pte_range() from
>> attempting to split a large folio, I never see any issues. Although I'm not
>> putting the system under memory pressure so guess I might see the same types of
>> problem crop up under swap, etc.
>>
>> I've reviewed most of the code within split_folio() and can't find any smoking
>> gun, but I wonder if there are implicit assumptions about the large folio being
>> PMD sized that I'm obviously breaking now?
>>
>> The code in madvise_free_pte_range():
>>
>> 	if (folio_test_large(folio)) {
>> 		if (folio_mapcount(folio) != 1)
>> 			goto out;
>> 		folio_get(folio);
>> 		if (!folio_trylock(folio)) {
>> 			folio_put(folio);
>> 			goto out;
>> 		}
>> 		pte_unmap_unlock(orig_pte, ptl);
>> 		if (split_folio(folio)) {
>> 			folio_unlock(folio);
>> 			folio_put(folio);
>> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>> 			goto out;
>> 		}
>> 		...
>> 	}
> 
> I've noticed that its folio_split() with a folio order of 1 that causes my
> problems. And I also see that the page cache code always explicitly never
> allocates order-1 folios:
> 
> void page_cache_ra_order(struct readahead_control *ractl,
> 		struct file_ra_state *ra, unsigned int new_order)
> {
> 	...
> 
> 	while (index <= limit) {
> 		unsigned int order = new_order;
> 
> 		/* Align with smaller pages if needed */
> 		if (index & ((1UL << order) - 1)) {
> 			order = __ffs(index);
> 			if (order == 1)
> 				order = 0;
> 		}
> 		/* Don't allocate pages past EOF */
> 		while (index + (1UL << order) - 1 > limit) {
> 			if (--order == 1)
> 				order = 0;
> 		}
> 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
> 		if (err)
> 			break;
> 		index += 1UL << order;
> 	}
> 
> 	...
> }
> 
> Matthew, what is the reason for this? I suspect its guarding against the same
> problem I'm seeing.
> 
> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
> any oops/panic/etc. I'd just like to understand the root cause.
Checked the struct folio definition. The _deferred_list is in third page struct.
My understanding is to support folio split, the folio order must >= 2. Thanks.


Regards
Yin, Fengwei

> 
> Thanks,
> Ryan
> 
> 
> 
>>
>> Will normally skip my large folios because they have a mapcount > 1, due to
>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
>> or CoW in this case?
>>
>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
>> large anon folio is not using the folio lock in the same way as a THP would and
>> we are therefore not getting the expected serialization?
>>
>> I'd really appreciate any suggestions for how to pregress here!
>>
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory
@ 2023-03-22 13:36     ` Yin, Fengwei
  0 siblings, 0 replies; 30+ messages in thread
From: Yin, Fengwei @ 2023-03-22 13:36 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel



On 3/22/2023 8:03 PM, Ryan Roberts wrote:
> Hi Matthew,
> 
> On 17/03/2023 10:57, Ryan Roberts wrote:
>> Hi All,
>>
>> [...]
>>
>> Bug(s)
>> ======
>>
>> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
>> relating to invalid kernel addresses (which usually look like either NULL +
>> small offset or mostly zeros with a few mid-order bits set + a small offset) or
>> lockdep complaining about a bad unlock balance. Call stacks are often in
>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
>> email example oopses out separately if anyone wants to review them). My hunch is
>> that struct pages adjacent to the folio are being corrupted, but don't have hard
>> evidence.
>>
>> When adding the workaround patch, which prevents madvise_free_pte_range() from
>> attempting to split a large folio, I never see any issues. Although I'm not
>> putting the system under memory pressure so guess I might see the same types of
>> problem crop up under swap, etc.
>>
>> I've reviewed most of the code within split_folio() and can't find any smoking
>> gun, but I wonder if there are implicit assumptions about the large folio being
>> PMD sized that I'm obviously breaking now?
>>
>> The code in madvise_free_pte_range():
>>
>> 	if (folio_test_large(folio)) {
>> 		if (folio_mapcount(folio) != 1)
>> 			goto out;
>> 		folio_get(folio);
>> 		if (!folio_trylock(folio)) {
>> 			folio_put(folio);
>> 			goto out;
>> 		}
>> 		pte_unmap_unlock(orig_pte, ptl);
>> 		if (split_folio(folio)) {
>> 			folio_unlock(folio);
>> 			folio_put(folio);
>> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>> 			goto out;
>> 		}
>> 		...
>> 	}
> 
> I've noticed that its folio_split() with a folio order of 1 that causes my
> problems. And I also see that the page cache code always explicitly never
> allocates order-1 folios:
> 
> void page_cache_ra_order(struct readahead_control *ractl,
> 		struct file_ra_state *ra, unsigned int new_order)
> {
> 	...
> 
> 	while (index <= limit) {
> 		unsigned int order = new_order;
> 
> 		/* Align with smaller pages if needed */
> 		if (index & ((1UL << order) - 1)) {
> 			order = __ffs(index);
> 			if (order == 1)
> 				order = 0;
> 		}
> 		/* Don't allocate pages past EOF */
> 		while (index + (1UL << order) - 1 > limit) {
> 			if (--order == 1)
> 				order = 0;
> 		}
> 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
> 		if (err)
> 			break;
> 		index += 1UL << order;
> 	}
> 
> 	...
> }
> 
> Matthew, what is the reason for this? I suspect its guarding against the same
> problem I'm seeing.
> 
> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
> any oops/panic/etc. I'd just like to understand the root cause.
Checked the struct folio definition. The _deferred_list is in third page struct.
My understanding is to support folio split, the folio order must >= 2. Thanks.


Regards
Yin, Fengwei

> 
> Thanks,
> Ryan
> 
> 
> 
>>
>> Will normally skip my large folios because they have a mapcount > 1, due to
>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
>> or CoW in this case?
>>
>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
>> large anon folio is not using the folio lock in the same way as a THP would and
>> we are therefore not getting the expected serialization?
>>
>> I'd really appreciate any suggestions for how to pregress here!
>>
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory
  2023-03-22 13:36     ` Yin, Fengwei
@ 2023-03-22 14:25       ` Ryan Roberts
  -1 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-22 14:25 UTC (permalink / raw)
  To: Yin, Fengwei, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 22/03/2023 13:36, Yin, Fengwei wrote:
> 
> 
> On 3/22/2023 8:03 PM, Ryan Roberts wrote:
>> Hi Matthew,
>>
>> On 17/03/2023 10:57, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> [...]
>>>
>>> Bug(s)
>>> ======
>>>
>>> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
>>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
>>> relating to invalid kernel addresses (which usually look like either NULL +
>>> small offset or mostly zeros with a few mid-order bits set + a small offset) or
>>> lockdep complaining about a bad unlock balance. Call stacks are often in
>>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
>>> email example oopses out separately if anyone wants to review them). My hunch is
>>> that struct pages adjacent to the folio are being corrupted, but don't have hard
>>> evidence.
>>>
>>> When adding the workaround patch, which prevents madvise_free_pte_range() from
>>> attempting to split a large folio, I never see any issues. Although I'm not
>>> putting the system under memory pressure so guess I might see the same types of
>>> problem crop up under swap, etc.
>>>
>>> I've reviewed most of the code within split_folio() and can't find any smoking
>>> gun, but I wonder if there are implicit assumptions about the large folio being
>>> PMD sized that I'm obviously breaking now?
>>>
>>> The code in madvise_free_pte_range():
>>>
>>> 	if (folio_test_large(folio)) {
>>> 		if (folio_mapcount(folio) != 1)
>>> 			goto out;
>>> 		folio_get(folio);
>>> 		if (!folio_trylock(folio)) {
>>> 			folio_put(folio);
>>> 			goto out;
>>> 		}
>>> 		pte_unmap_unlock(orig_pte, ptl);
>>> 		if (split_folio(folio)) {
>>> 			folio_unlock(folio);
>>> 			folio_put(folio);
>>> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>>> 			goto out;
>>> 		}
>>> 		...
>>> 	}
>>
>> I've noticed that its folio_split() with a folio order of 1 that causes my
>> problems. And I also see that the page cache code always explicitly never
>> allocates order-1 folios:
>>
>> void page_cache_ra_order(struct readahead_control *ractl,
>> 		struct file_ra_state *ra, unsigned int new_order)
>> {
>> 	...
>>
>> 	while (index <= limit) {
>> 		unsigned int order = new_order;
>>
>> 		/* Align with smaller pages if needed */
>> 		if (index & ((1UL << order) - 1)) {
>> 			order = __ffs(index);
>> 			if (order == 1)
>> 				order = 0;
>> 		}
>> 		/* Don't allocate pages past EOF */
>> 		while (index + (1UL << order) - 1 > limit) {
>> 			if (--order == 1)
>> 				order = 0;
>> 		}
>> 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
>> 		if (err)
>> 			break;
>> 		index += 1UL << order;
>> 	}
>>
>> 	...
>> }
>>
>> Matthew, what is the reason for this? I suspect its guarding against the same
>> problem I'm seeing.
>>
>> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
>> any oops/panic/etc. I'd just like to understand the root cause.
> Checked the struct folio definition. The _deferred_list is in third page struct.
> My understanding is to support folio split, the folio order must >= 2. Thanks.

Yep, looks like we have found the root cause - thanks for your help!

I've updated calc_anonymous_folio_order() to only use non-0 order if THP is
available and in that case, never allocate order-1. I think that both fixes the
problem and manages the dependency we have on THP:

static void calc_anonymous_folio_order(struct vm_fault *vmf,
				       int *order_out,
				       unsigned long *addr_out)
{
	/*
	 * The aim here is to determine what size of folio we should allocate
	 * for this fault. Factors include:
	 * - Folio must be naturally aligned within VA space
	 * - Folio must not breach boundaries of vma
	 * - Folio must be fully contained inside one pmd entry
	 * - Folio must not overlap any non-none ptes
	 * - Order must not be higher than *order_out upon entry
	 *
	 * Additionally, we do not allow order-1 since this breaks assumptions
	 * elsewhere in the mm; THP pages must be at least order-2 (since they
	 * store state up to the 3rd struct page subpage), and these pages must
	 * be THP in order to correctly use pre-existing THP infrastructure such
	 * as folio_split().
	 *
	 * As a consequence of relying on the THP infrastructure, if the system
	 * does not support THP, we always fallback to order-0.
	 *
	 * Note that the caller may or may not choose to lock the pte. If
	 * unlocked, the calculation should be considered an estimate that will
	 * need to be validated under the lock.
	 */

	struct vm_area_struct *vma = vmf->vma;
	int nr;
	int order;
	unsigned long addr;
	pte_t *pte;
	pte_t *first_set = NULL;
	int ret;

	if (has_transparent_hugepage()) {
		order = min(*order_out, PMD_SHIFT - PAGE_SHIFT);

		for (; order > 1; order--) {
			nr = 1 << order;
			addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
			pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);

			/* Check vma bounds. */
			if (addr < vma->vm_start ||
			    addr + nr * PAGE_SIZE > vma->vm_end)
				continue;

			/* Ptes covered by order already known to be none. */
			if (pte + nr <= first_set)
				break;

			/* Already found set pte in range covered by order. */
			if (pte <= first_set)
				continue;

			/* Need to check if all the ptes are none. */
			ret = check_all_ptes_none(pte, nr);
			if (ret == nr)
				break;

			first_set = pte + ret;
		}

		if (order == 1)
			order = 0;
	} else {
		order = 0;
	}

	*order_out = order;
	*addr_out = order > 0 ? addr : vmf->address;
}



> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Thanks,
>> Ryan
>>
>>
>>
>>>
>>> Will normally skip my large folios because they have a mapcount > 1, due to
>>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
>>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
>>> or CoW in this case?
>>>
>>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
>>> large anon folio is not using the folio lock in the same way as a THP would and
>>> we are therefore not getting the expected serialization?
>>>
>>> I'd really appreciate any suggestions for how to pregress here!
>>>
>>



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory
@ 2023-03-22 14:25       ` Ryan Roberts
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-22 14:25 UTC (permalink / raw)
  To: Yin, Fengwei, Andrew Morton, Matthew Wilcox (Oracle), Yu Zhao
  Cc: linux-mm, linux-arm-kernel

On 22/03/2023 13:36, Yin, Fengwei wrote:
> 
> 
> On 3/22/2023 8:03 PM, Ryan Roberts wrote:
>> Hi Matthew,
>>
>> On 17/03/2023 10:57, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> [...]
>>>
>>> Bug(s)
>>> ======
>>>
>>> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
>>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
>>> relating to invalid kernel addresses (which usually look like either NULL +
>>> small offset or mostly zeros with a few mid-order bits set + a small offset) or
>>> lockdep complaining about a bad unlock balance. Call stacks are often in
>>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
>>> email example oopses out separately if anyone wants to review them). My hunch is
>>> that struct pages adjacent to the folio are being corrupted, but don't have hard
>>> evidence.
>>>
>>> When adding the workaround patch, which prevents madvise_free_pte_range() from
>>> attempting to split a large folio, I never see any issues. Although I'm not
>>> putting the system under memory pressure so guess I might see the same types of
>>> problem crop up under swap, etc.
>>>
>>> I've reviewed most of the code within split_folio() and can't find any smoking
>>> gun, but I wonder if there are implicit assumptions about the large folio being
>>> PMD sized that I'm obviously breaking now?
>>>
>>> The code in madvise_free_pte_range():
>>>
>>> 	if (folio_test_large(folio)) {
>>> 		if (folio_mapcount(folio) != 1)
>>> 			goto out;
>>> 		folio_get(folio);
>>> 		if (!folio_trylock(folio)) {
>>> 			folio_put(folio);
>>> 			goto out;
>>> 		}
>>> 		pte_unmap_unlock(orig_pte, ptl);
>>> 		if (split_folio(folio)) {
>>> 			folio_unlock(folio);
>>> 			folio_put(folio);
>>> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>>> 			goto out;
>>> 		}
>>> 		...
>>> 	}
>>
>> I've noticed that its folio_split() with a folio order of 1 that causes my
>> problems. And I also see that the page cache code always explicitly never
>> allocates order-1 folios:
>>
>> void page_cache_ra_order(struct readahead_control *ractl,
>> 		struct file_ra_state *ra, unsigned int new_order)
>> {
>> 	...
>>
>> 	while (index <= limit) {
>> 		unsigned int order = new_order;
>>
>> 		/* Align with smaller pages if needed */
>> 		if (index & ((1UL << order) - 1)) {
>> 			order = __ffs(index);
>> 			if (order == 1)
>> 				order = 0;
>> 		}
>> 		/* Don't allocate pages past EOF */
>> 		while (index + (1UL << order) - 1 > limit) {
>> 			if (--order == 1)
>> 				order = 0;
>> 		}
>> 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
>> 		if (err)
>> 			break;
>> 		index += 1UL << order;
>> 	}
>>
>> 	...
>> }
>>
>> Matthew, what is the reason for this? I suspect its guarding against the same
>> problem I'm seeing.
>>
>> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
>> any oops/panic/etc. I'd just like to understand the root cause.
> Checked the struct folio definition. The _deferred_list is in third page struct.
> My understanding is to support folio split, the folio order must >= 2. Thanks.

Yep, looks like we have found the root cause - thanks for your help!

I've updated calc_anonymous_folio_order() to only use non-0 order if THP is
available and in that case, never allocate order-1. I think that both fixes the
problem and manages the dependency we have on THP:

static void calc_anonymous_folio_order(struct vm_fault *vmf,
				       int *order_out,
				       unsigned long *addr_out)
{
	/*
	 * The aim here is to determine what size of folio we should allocate
	 * for this fault. Factors include:
	 * - Folio must be naturally aligned within VA space
	 * - Folio must not breach boundaries of vma
	 * - Folio must be fully contained inside one pmd entry
	 * - Folio must not overlap any non-none ptes
	 * - Order must not be higher than *order_out upon entry
	 *
	 * Additionally, we do not allow order-1 since this breaks assumptions
	 * elsewhere in the mm; THP pages must be at least order-2 (since they
	 * store state up to the 3rd struct page subpage), and these pages must
	 * be THP in order to correctly use pre-existing THP infrastructure such
	 * as folio_split().
	 *
	 * As a consequence of relying on the THP infrastructure, if the system
	 * does not support THP, we always fallback to order-0.
	 *
	 * Note that the caller may or may not choose to lock the pte. If
	 * unlocked, the calculation should be considered an estimate that will
	 * need to be validated under the lock.
	 */

	struct vm_area_struct *vma = vmf->vma;
	int nr;
	int order;
	unsigned long addr;
	pte_t *pte;
	pte_t *first_set = NULL;
	int ret;

	if (has_transparent_hugepage()) {
		order = min(*order_out, PMD_SHIFT - PAGE_SHIFT);

		for (; order > 1; order--) {
			nr = 1 << order;
			addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
			pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);

			/* Check vma bounds. */
			if (addr < vma->vm_start ||
			    addr + nr * PAGE_SIZE > vma->vm_end)
				continue;

			/* Ptes covered by order already known to be none. */
			if (pte + nr <= first_set)
				break;

			/* Already found set pte in range covered by order. */
			if (pte <= first_set)
				continue;

			/* Need to check if all the ptes are none. */
			ret = check_all_ptes_none(pte, nr);
			if (ret == nr)
				break;

			first_set = pte + ret;
		}

		if (order == 1)
			order = 0;
	} else {
		order = 0;
	}

	*order_out = order;
	*addr_out = order > 0 ? addr : vmf->address;
}



> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Thanks,
>> Ryan
>>
>>
>>
>>>
>>> Will normally skip my large folios because they have a mapcount > 1, due to
>>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
>>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
>>> or CoW in this case?
>>>
>>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
>>> large anon folio is not using the folio lock in the same way as a THP would and
>>> we are therefore not getting the expected serialization?
>>>
>>> I'd really appreciate any suggestions for how to pregress here!
>>>
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-03-22 14:26 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-17 10:57 [RFC PATCH 0/6] variable-order, large folios for anonymous memory Ryan Roberts
2023-03-17 10:57 ` Ryan Roberts
2023-03-17 10:57 ` [RFC PATCH 1/6] mm: Expose clear_huge_page() unconditionally Ryan Roberts
2023-03-17 10:57   ` Ryan Roberts
2023-03-17 10:57 ` [RFC PATCH 2/6] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-03-17 10:57   ` Ryan Roberts
2023-03-17 10:57 ` [RFC PATCH 3/6] mm: Introduce try_vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-03-17 10:57   ` Ryan Roberts
2023-03-17 10:58 ` [RFC PATCH 4/6] mm: Implement folio_add_new_anon_rmap_range() Ryan Roberts
2023-03-17 10:58   ` Ryan Roberts
2023-03-22  6:59   ` Yin Fengwei
2023-03-22  6:59     ` Yin Fengwei
2023-03-22  7:10   ` Yin Fengwei
2023-03-22  7:10     ` Yin Fengwei
2023-03-22  7:42     ` Ryan Roberts
2023-03-22  7:42       ` Ryan Roberts
2023-03-17 10:58 ` [RFC PATCH 5/6] mm: Allocate large folios for anonymous memory Ryan Roberts
2023-03-17 10:58   ` Ryan Roberts
2023-03-17 10:58 ` [RFC PATCH 6/6] WORKAROUND: Don't split large folios on madvise Ryan Roberts
2023-03-17 10:58   ` Ryan Roberts
2023-03-22  8:19   ` Yin Fengwei
2023-03-22  8:19     ` Yin Fengwei
2023-03-22  8:59     ` Ryan Roberts
2023-03-22  8:59       ` Ryan Roberts
2023-03-22 12:03 ` [RFC PATCH 0/6] variable-order, large folios for anonymous memory Ryan Roberts
2023-03-22 12:03   ` Ryan Roberts
2023-03-22 13:36   ` Yin, Fengwei
2023-03-22 13:36     ` Yin, Fengwei
2023-03-22 14:25     ` Ryan Roberts
2023-03-22 14:25       ` Ryan Roberts

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.