All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] Enabling large folios for 5.17
@ 2022-01-16 12:18 Matthew Wilcox (Oracle)
  2022-01-16 12:18 ` [PATCH 01/12] mm: Add folio_put_refs() Matthew Wilcox (Oracle)
                   ` (11 more replies)
  0 siblings, 12 replies; 20+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-01-16 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel; +Cc: Matthew Wilcox (Oracle)

Is Linux just too stable for you?  Tired of not having your data eaten
by a grue?  Then it's time to experiment with enabling large folios!

You will need:
 - A recent Linus tree (I used a33f5c380c4b)
 - To enable CONFIG_TRANSPARENT_HUGEPAGE
 - An XFS filesystem
 - Your favourite workload

These patches create large folios in the readahead and fault paths.
They do not create large folios in the write path; that is future
work.  For most workloads, this is quite sufficient.  You can
monitor the sizes of folios being added to the page cache with the
mm_filemap_add_to_page_cache tracepoint.

As mentioned in the 'Add large folio readahead' commit message, the
heuristic for deciding when to enlarge the size of the folio being
created is stupid.  I'm sure somebody out there can do better.

This patchset is not (as far as I'm concerned) a candidate for merging
into 5.17.  It hasn't been in linux-next, and while it does not introduce
any regressions in my testing, I'd be uncomfortable seeing it merged
before 5.18.

Matthew Wilcox (Oracle) (11):
  mm: Add folio_put_refs()
  filemap: Use folio_put_refs() in filemap_free_folio()
  filemap: Allow large folios to be added to the page cache
  mm/vmscan: Free non-shmem folios without splitting them
  mm: Fix READ_ONLY_THP warning
  mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
  mm: Make large folios depend on THP
  mm/readahead: Add large folio readahead
  mm/readahead: Switch to page_cache_ra_order
  mm/filemap: Support VM_HUGEPAGE for file mappings
  selftests/vm/transhuge-stress: Support file-backed PMD folios

William Kucharski (1):
  mm/readahead: Align file mappings for non-DAX

 include/linux/mm.h                            |  20 ++++
 include/linux/pagemap.h                       |  11 +-
 mm/filemap.c                                  |  69 +++++++----
 mm/huge_memory.c                              |   5 +-
 mm/internal.h                                 |   4 +-
 mm/readahead.c                                | 108 ++++++++++++++++--
 mm/vmscan.c                                   |   7 +-
 tools/testing/selftests/vm/transhuge-stress.c |  35 ++++--
 8 files changed, 204 insertions(+), 55 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 01/12] mm: Add folio_put_refs()
  2022-01-16 12:18 [PATCH 00/12] Enabling large folios for 5.17 Matthew Wilcox (Oracle)
@ 2022-01-16 12:18 ` Matthew Wilcox (Oracle)
  2022-01-16 12:18 ` [PATCH 02/12] filemap: Use folio_put_refs() in filemap_free_folio() Matthew Wilcox (Oracle)
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-01-16 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Christoph Hellwig, John Hubbard, Jason Gunthorpe,
	William Kucharski

This is like folio_put(), but puts N references at once instead of
just one.  It's like put_page_refs(), but does one atomic operation
instead of two, and is available to more than just gup.c.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
---
 include/linux/mm.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c768a7c81b0b..cb98f75b245e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1244,6 +1244,26 @@ static inline void folio_put(struct folio *folio)
 		__put_page(&folio->page);
 }
 
+/**
+ * folio_put_refs - Reduce the reference count on a folio.
+ * @folio: The folio.
+ * @refs: The amount to subtract from the folio's reference count.
+ *
+ * If the folio's reference count reaches zero, the memory will be
+ * released back to the page allocator and may be used by another
+ * allocation immediately.  Do not access the memory or the struct folio
+ * after calling folio_put_refs() unless you can be sure that these weren't
+ * the last references.
+ *
+ * Context: May be called in process or interrupt context, but not in NMI
+ * context.  May be called while holding a spinlock.
+ */
+static inline void folio_put_refs(struct folio *folio, int refs)
+{
+	if (folio_ref_sub_and_test(folio, refs))
+		__put_page(&folio->page);
+}
+
 static inline void put_page(struct page *page)
 {
 	struct folio *folio = page_folio(page);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 02/12] filemap: Use folio_put_refs() in filemap_free_folio()
  2022-01-16 12:18 [PATCH 00/12] Enabling large folios for 5.17 Matthew Wilcox (Oracle)
  2022-01-16 12:18 ` [PATCH 01/12] mm: Add folio_put_refs() Matthew Wilcox (Oracle)
@ 2022-01-16 12:18 ` Matthew Wilcox (Oracle)
  2022-01-17 15:56   ` Kirill A. Shutemov
  2022-01-16 12:18 ` [PATCH 03/12] filemap: Allow large folios to be added to the page cache Matthew Wilcox (Oracle)
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-01-16 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel; +Cc: Matthew Wilcox (Oracle)

This shrinks filemap_free_folio() by 55 bytes in my .config; 24 bytes
from removing the VM_BUG_ON_FOLIO() and 31 bytes from unifying the
small/large folio paths.

We could just use folio_ref_sub() here since the caller should hold a
reference (as the VM_BUG_ON_FOLIO() was asserting), but that's fragile.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2fd9b2f24025..afc8f5ca85ac 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -231,17 +231,15 @@ void __filemap_remove_folio(struct folio *folio, void *shadow)
 void filemap_free_folio(struct address_space *mapping, struct folio *folio)
 {
 	void (*freepage)(struct page *);
+	int refs = 1;
 
 	freepage = mapping->a_ops->freepage;
 	if (freepage)
 		freepage(&folio->page);
 
-	if (folio_test_large(folio) && !folio_test_hugetlb(folio)) {
-		folio_ref_sub(folio, folio_nr_pages(folio));
-		VM_BUG_ON_FOLIO(folio_ref_count(folio) <= 0, folio);
-	} else {
-		folio_put(folio);
-	}
+	if (folio_test_large(folio) && !folio_test_hugetlb(folio))
+		refs = folio_nr_pages(folio);
+	folio_put_refs(folio, refs);
 }
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 03/12] filemap: Allow large folios to be added to the page cache
  2022-01-16 12:18 [PATCH 00/12] Enabling large folios for 5.17 Matthew Wilcox (Oracle)
  2022-01-16 12:18 ` [PATCH 01/12] mm: Add folio_put_refs() Matthew Wilcox (Oracle)
  2022-01-16 12:18 ` [PATCH 02/12] filemap: Use folio_put_refs() in filemap_free_folio() Matthew Wilcox (Oracle)
@ 2022-01-16 12:18 ` Matthew Wilcox (Oracle)
  2022-01-16 12:18 ` [PATCH 04/12] mm/vmscan: Free non-shmem folios without splitting them Matthew Wilcox (Oracle)
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-01-16 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel; +Cc: Matthew Wilcox (Oracle)

We return -EEXIST if there are any non-shadow entries in the page
cache in the range covered by the folio.  If there are multiple
shadow entries in the range, we set *shadowp to one of them (currently
the one at the highest index).  If that turns out to be the wrong
answer, we can implement something more complex.  This is mostly
modelled after the equivalent function in the shmem code.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 39 ++++++++++++++++++++++-----------------
 1 file changed, 22 insertions(+), 17 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index afc8f5ca85ac..fe079b676ab7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -851,26 +851,27 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 {
 	XA_STATE(xas, &mapping->i_pages, index);
 	int huge = folio_test_hugetlb(folio);
-	int error;
 	bool charged = false;
+	long nr = 1;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
 	mapping_set_update(&xas, mapping);
 
-	folio_get(folio);
-	folio->mapping = mapping;
-	folio->index = index;
-
 	if (!huge) {
-		error = mem_cgroup_charge(folio, NULL, gfp);
+		int error = mem_cgroup_charge(folio, NULL, gfp);
 		VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
 		if (error)
-			goto error;
+			return error;
 		charged = true;
+		xas_set_order(&xas, index, folio_order(folio));
+		nr = folio_nr_pages(folio);
 	}
 
 	gfp &= GFP_RECLAIM_MASK;
+	folio_ref_add(folio, nr);
+	folio->mapping = mapping;
+	folio->index = xas.xa_index;
 
 	do {
 		unsigned int order = xa_get_order(xas.xa, xas.xa_index);
@@ -894,6 +895,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 			/* entry may have been split before we acquired lock */
 			order = xa_get_order(xas.xa, xas.xa_index);
 			if (order > folio_order(folio)) {
+				/* How to handle large swap entries? */
+				BUG_ON(shmem_mapping(mapping));
 				xas_split(&xas, old, order);
 				xas_reset(&xas);
 			}
@@ -903,29 +906,31 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 		if (xas_error(&xas))
 			goto unlock;
 
-		mapping->nrpages++;
+		mapping->nrpages += nr;
 
 		/* hugetlb pages do not participate in page cache accounting */
-		if (!huge)
-			__lruvec_stat_add_folio(folio, NR_FILE_PAGES);
+		if (!huge) {
+			__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr);
+			if (folio_test_pmd_mappable(folio))
+				__lruvec_stat_mod_folio(folio,
+						NR_FILE_THPS, nr);
+		}
 unlock:
 		xas_unlock_irq(&xas);
 	} while (xas_nomem(&xas, gfp));
 
-	if (xas_error(&xas)) {
-		error = xas_error(&xas);
-		if (charged)
-			mem_cgroup_uncharge(folio);
+	if (xas_error(&xas))
 		goto error;
-	}
 
 	trace_mm_filemap_add_to_page_cache(folio);
 	return 0;
 error:
+	if (charged)
+		mem_cgroup_uncharge(folio);
 	folio->mapping = NULL;
 	/* Leave page->index set: truncation relies upon it */
-	folio_put(folio);
-	return error;
+	folio_put_refs(folio, nr);
+	return xas_error(&xas);
 }
 ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 04/12] mm/vmscan: Free non-shmem folios without splitting them
  2022-01-16 12:18 [PATCH 00/12] Enabling large folios for 5.17 Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2022-01-16 12:18 ` [PATCH 03/12] filemap: Allow large folios to be added to the page cache Matthew Wilcox (Oracle)
@ 2022-01-16 12:18 ` Matthew Wilcox (Oracle)
  2022-01-17 16:06   ` Kirill A. Shutemov
  2022-01-16 12:18 ` [PATCH 05/12] mm: Fix READ_ONLY_THP warning Matthew Wilcox (Oracle)
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-01-16 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel; +Cc: Matthew Wilcox (Oracle)

We have to allocate memory in order to split a file-backed folio, so
it's not a good idea to split them in the memory freeing path.  It also
doesn't work for XFS because pages have an extra reference count from
page_has_private() and split_huge_page() expects that reference to have
already been removed.  Unfortunately, we still have to split shmem THPs
because we can't handle swapping out an entire THP yet.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/vmscan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 700434db5735..45665874082d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1728,8 +1728,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 				/* Adding to swap updated mapping */
 				mapping = page_mapping(page);
 			}
-		} else if (unlikely(PageTransHuge(page))) {
-			/* Split file THP */
+		} else if (PageSwapBacked(page) && PageTransHuge(page)) {
+			/* Split shmem THP */
 			if (split_huge_page_to_list(page, page_list))
 				goto keep_locked;
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 05/12] mm: Fix READ_ONLY_THP warning
  2022-01-16 12:18 [PATCH 00/12] Enabling large folios for 5.17 Matthew Wilcox (Oracle)
                   ` (3 preceding siblings ...)
  2022-01-16 12:18 ` [PATCH 04/12] mm/vmscan: Free non-shmem folios without splitting them Matthew Wilcox (Oracle)
@ 2022-01-16 12:18 ` Matthew Wilcox (Oracle)
  2022-01-16 12:18 ` [PATCH 06/12] mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios Matthew Wilcox (Oracle)
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-01-16 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel; +Cc: Matthew Wilcox (Oracle)

These counters only exist if CONFIG_READ_ONLY_THP_FOR_FS is defined,
but we do not need to warn if the filesystem natively supports large
folios.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 270bf5136c34..877dabed0316 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -212,7 +212,7 @@ static inline void filemap_nr_thps_inc(struct address_space *mapping)
 	if (!mapping_large_folio_support(mapping))
 		atomic_inc(&mapping->nr_thps);
 #else
-	WARN_ON_ONCE(1);
+	WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0);
 #endif
 }
 
@@ -222,7 +222,7 @@ static inline void filemap_nr_thps_dec(struct address_space *mapping)
 	if (!mapping_large_folio_support(mapping))
 		atomic_dec(&mapping->nr_thps);
 #else
-	WARN_ON_ONCE(1);
+	WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0);
 #endif
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 06/12] mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
  2022-01-16 12:18 [PATCH 00/12] Enabling large folios for 5.17 Matthew Wilcox (Oracle)
                   ` (4 preceding siblings ...)
  2022-01-16 12:18 ` [PATCH 05/12] mm: Fix READ_ONLY_THP warning Matthew Wilcox (Oracle)
@ 2022-01-16 12:18 ` Matthew Wilcox (Oracle)
  2022-01-16 12:18 ` [PATCH 07/12] mm: Make large folios depend on THP Matthew Wilcox (Oracle)
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-01-16 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel; +Cc: Matthew Wilcox (Oracle)

A large folio which is smaller than a PMD does not need to do the extra
work in try_to_unmap() of trying to split a PMD entry.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/vmscan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 45665874082d..3181bf2f8a37 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1754,7 +1754,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 			enum ttu_flags flags = TTU_BATCH_FLUSH;
 			bool was_swapbacked = PageSwapBacked(page);
 
-			if (unlikely(PageTransHuge(page)))
+			if (PageTransHuge(page) &&
+					thp_order(page) >= HPAGE_PMD_ORDER)
 				flags |= TTU_SPLIT_HUGE_PMD;
 
 			try_to_unmap(page, flags);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 07/12] mm: Make large folios depend on THP
  2022-01-16 12:18 [PATCH 00/12] Enabling large folios for 5.17 Matthew Wilcox (Oracle)
                   ` (5 preceding siblings ...)
  2022-01-16 12:18 ` [PATCH 06/12] mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios Matthew Wilcox (Oracle)
@ 2022-01-16 12:18 ` Matthew Wilcox (Oracle)
  2022-01-16 12:18 ` [PATCH 08/12] mm/readahead: Add large folio readahead Matthew Wilcox (Oracle)
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-01-16 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel; +Cc: Matthew Wilcox (Oracle)

Some parts of the VM still depend on THP to handle large folios
correctly.  Until those are fixed, prevent creating large folios
if THP are disabled.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 877dabed0316..3e348e0a9e4e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -192,9 +192,14 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
 	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
 }
 
+/*
+ * Large folio support currently depends on THP.  These dependencies are
+ * being worked on but are not yet fixed.
+ */
 static inline bool mapping_large_folio_support(struct address_space *mapping)
 {
-	return test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 08/12] mm/readahead: Add large folio readahead
  2022-01-16 12:18 [PATCH 00/12] Enabling large folios for 5.17 Matthew Wilcox (Oracle)
                   ` (6 preceding siblings ...)
  2022-01-16 12:18 ` [PATCH 07/12] mm: Make large folios depend on THP Matthew Wilcox (Oracle)
@ 2022-01-16 12:18 ` Matthew Wilcox (Oracle)
  2022-01-16 12:18 ` [PATCH 09/12] mm/readahead: Align file mappings for non-DAX Matthew Wilcox (Oracle)
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-01-16 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel; +Cc: Matthew Wilcox (Oracle)

Allocate large folios in the readahead code when the filesystem supports
them and it seems worth doing.  The heuristic for choosing which folio
sizes will surely need some tuning, but this aggressive ramp-up has been
good for testing.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/readahead.c | 106 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 99 insertions(+), 7 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index cf0dcf89eb69..5100eaf5b0ee 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -148,7 +148,7 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
 
 	blk_finish_plug(&plug);
 
-	BUG_ON(!list_empty(pages));
+	BUG_ON(pages && !list_empty(pages));
 	BUG_ON(readahead_count(rac));
 
 out:
@@ -431,11 +431,103 @@ static int try_context_readahead(struct address_space *mapping,
 	return 1;
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
+static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
+		pgoff_t mark, unsigned int order, gfp_t gfp)
+{
+	int err;
+	struct folio *folio = filemap_alloc_folio(gfp, order);
+
+	if (!folio)
+		return -ENOMEM;
+	if (mark - index < (1UL << order))
+		folio_set_readahead(folio);
+	err = filemap_add_folio(ractl->mapping, folio, index, gfp);
+	if (err)
+		folio_put(folio);
+	else
+		ractl->_nr_pages += 1UL << order;
+	return err;
+}
+
+static void page_cache_ra_order(struct readahead_control *ractl,
+		struct file_ra_state *ra, unsigned int new_order)
+{
+	struct address_space *mapping = ractl->mapping;
+	pgoff_t index = readahead_index(ractl);
+	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+	pgoff_t mark = index + ra->size - ra->async_size;
+	int err = 0;
+	gfp_t gfp = readahead_gfp_mask(mapping);
+
+	if (!mapping_large_folio_support(mapping) || ra->size < 4)
+		goto fallback;
+
+	limit = min(limit, index + ra->size - 1);
+
+	if (new_order < MAX_PAGECACHE_ORDER) {
+		new_order += 2;
+		if (new_order > MAX_PAGECACHE_ORDER)
+			new_order = MAX_PAGECACHE_ORDER;
+		while ((1 << new_order) > ra->size)
+			new_order--;
+	}
+
+	while (index <= limit) {
+		unsigned int order = new_order;
+
+		/* Align with smaller pages if needed */
+		if (index & ((1UL << order) - 1)) {
+			order = __ffs(index);
+			if (order == 1)
+				order = 0;
+		}
+		/* Don't allocate pages past EOF */
+		while (index + (1UL << order) - 1 > limit) {
+			if (--order == 1)
+				order = 0;
+		}
+		err = ra_alloc_folio(ractl, index, mark, order, gfp);
+		if (err)
+			break;
+		index += 1UL << order;
+	}
+
+	if (index > limit) {
+		ra->size += index - limit - 1;
+		ra->async_size += index - limit - 1;
+	}
+
+	read_pages(ractl, NULL, false);
+
+	/*
+	 * If there were already pages in the page cache, then we may have
+	 * left some gaps.  Let the regular readahead code take care of this
+	 * situation.
+	 */
+	if (!err)
+		return;
+fallback:
+	do_page_cache_ra(ractl, ra->size, ra->async_size);
+}
+
 /*
  * A minimal readahead algorithm for trivial sequential/random reads.
  */
 static void ondemand_readahead(struct readahead_control *ractl,
-		bool hit_readahead_marker, unsigned long req_size)
+		struct folio *folio, unsigned long req_size)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host);
 	struct file_ra_state *ra = ractl->ra;
@@ -470,12 +562,12 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	}
 
 	/*
-	 * Hit a marked page without valid readahead state.
+	 * Hit a marked folio without valid readahead state.
 	 * E.g. interleaved reads.
 	 * Query the pagecache for async_size, which normally equals to
 	 * readahead size. Ramp it up and use it as the new readahead size.
 	 */
-	if (hit_readahead_marker) {
+	if (folio) {
 		pgoff_t start;
 
 		rcu_read_lock();
@@ -548,7 +640,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	}
 
 	ractl->_index = ra->start;
-	do_page_cache_ra(ractl, ra->size, ra->async_size);
+	page_cache_ra_order(ractl, ra, folio ? folio_order(folio) : 0);
 }
 
 void page_cache_sync_ra(struct readahead_control *ractl,
@@ -576,7 +668,7 @@ void page_cache_sync_ra(struct readahead_control *ractl,
 	}
 
 	/* do read-ahead */
-	ondemand_readahead(ractl, false, req_count);
+	ondemand_readahead(ractl, NULL, req_count);
 }
 EXPORT_SYMBOL_GPL(page_cache_sync_ra);
 
@@ -605,7 +697,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
 		return;
 
 	/* do read-ahead */
-	ondemand_readahead(ractl, true, req_count);
+	ondemand_readahead(ractl, folio, req_count);
 }
 EXPORT_SYMBOL_GPL(page_cache_async_ra);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 09/12] mm/readahead: Align file mappings for non-DAX
  2022-01-16 12:18 [PATCH 00/12] Enabling large folios for 5.17 Matthew Wilcox (Oracle)
                   ` (7 preceding siblings ...)
  2022-01-16 12:18 ` [PATCH 08/12] mm/readahead: Add large folio readahead Matthew Wilcox (Oracle)
@ 2022-01-16 12:18 ` Matthew Wilcox (Oracle)
  2022-01-17  3:17   ` Rongwei Wang
  2022-01-16 12:18 ` [PATCH 10/12] mm/readahead: Switch to page_cache_ra_order Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-01-16 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel; +Cc: William Kucharski, Matthew Wilcox

From: William Kucharski <william.kucharski@oracle.com>

When we have the opportunity to use PMDs to map a file, we want to follow
the same rules as DAX.

Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/huge_memory.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f58524394dc1..28c29a0d854b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -582,13 +582,10 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 	unsigned long ret;
 	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 
-	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
-		goto out;
-
 	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE);
 	if (ret)
 		return ret;
-out:
+
 	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
 }
 EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 10/12] mm/readahead: Switch to page_cache_ra_order
  2022-01-16 12:18 [PATCH 00/12] Enabling large folios for 5.17 Matthew Wilcox (Oracle)
                   ` (8 preceding siblings ...)
  2022-01-16 12:18 ` [PATCH 09/12] mm/readahead: Align file mappings for non-DAX Matthew Wilcox (Oracle)
@ 2022-01-16 12:18 ` Matthew Wilcox (Oracle)
  2022-01-16 12:18 ` [PATCH 11/12] mm/filemap: Support VM_HUGEPAGE for file mappings Matthew Wilcox (Oracle)
  2022-01-16 12:18 ` [PATCH 12/12] selftests/vm/transhuge-stress: Support file-backed PMD folios Matthew Wilcox (Oracle)
  11 siblings, 0 replies; 20+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-01-16 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel; +Cc: Matthew Wilcox (Oracle)

do_page_cache_ra() was being exposed for the benefit of
do_sync_mmap_readahead().  Switch it over to page_cache_ra_order()
partly because it's a better interface but mostly for the benefit of
the next patch.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c   | 2 +-
 mm/internal.h  | 4 ++--
 mm/readahead.c | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index fe079b676ab7..8f076f0fd94f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2947,7 +2947,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	ra->size = ra->ra_pages;
 	ra->async_size = ra->ra_pages / 4;
 	ractl._index = ra->start;
-	do_page_cache_ra(&ractl, ra->size, ra->async_size);
+	page_cache_ra_order(&ractl, ra, 0);
 	return fpin;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 26af8a5a5be3..dbc15201a9d4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -82,8 +82,8 @@ void unmap_page_range(struct mmu_gather *tlb,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details);
 
-void do_page_cache_ra(struct readahead_control *, unsigned long nr_to_read,
-		unsigned long lookahead_size);
+void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
+		unsigned int order);
 void force_page_cache_ra(struct readahead_control *, unsigned long nr);
 static inline void force_page_cache_readahead(struct address_space *mapping,
 		struct file *file, pgoff_t index, unsigned long nr_to_read)
diff --git a/mm/readahead.c b/mm/readahead.c
index 5100eaf5b0ee..a20391d6a71b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -247,7 +247,7 @@ EXPORT_SYMBOL_GPL(page_cache_ra_unbounded);
  * behaviour which would occur if page allocations are causing VM writeback.
  * We really don't want to intermingle reads and writes like that.
  */
-void do_page_cache_ra(struct readahead_control *ractl,
+static void do_page_cache_ra(struct readahead_control *ractl,
 		unsigned long nr_to_read, unsigned long lookahead_size)
 {
 	struct inode *inode = ractl->mapping->host;
@@ -462,7 +462,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 	return err;
 }
 
-static void page_cache_ra_order(struct readahead_control *ractl,
+void page_cache_ra_order(struct readahead_control *ractl,
 		struct file_ra_state *ra, unsigned int new_order)
 {
 	struct address_space *mapping = ractl->mapping;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 11/12] mm/filemap: Support VM_HUGEPAGE for file mappings
  2022-01-16 12:18 [PATCH 00/12] Enabling large folios for 5.17 Matthew Wilcox (Oracle)
                   ` (9 preceding siblings ...)
  2022-01-16 12:18 ` [PATCH 10/12] mm/readahead: Switch to page_cache_ra_order Matthew Wilcox (Oracle)
@ 2022-01-16 12:18 ` Matthew Wilcox (Oracle)
  2022-01-16 12:18 ` [PATCH 12/12] selftests/vm/transhuge-stress: Support file-backed PMD folios Matthew Wilcox (Oracle)
  11 siblings, 0 replies; 20+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-01-16 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel; +Cc: Matthew Wilcox (Oracle)

If the VM_HUGEPAGE flag is set, attempt to allocate PMD-sized folios
during readahead, even if we have no history of readahead being
successful.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8f076f0fd94f..da190fc4e186 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2915,6 +2915,24 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	struct file *fpin = NULL;
 	unsigned int mmap_miss;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/* Use the readahead code, even if readahead is disabled */
+	if (vmf->vma->vm_flags & VM_HUGEPAGE) {
+		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
+		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
+		ra->size = HPAGE_PMD_NR;
+		/*
+		 * Fetch two PMD folios, so we get the chance to actually
+		 * readahead, unless we've been told not to.
+		 */
+		if (!(vmf->vma->vm_flags & VM_RAND_READ))
+			ra->size *= 2;
+		ra->async_size = HPAGE_PMD_NR;
+		page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
+		return fpin;
+	}
+#endif
+
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ)
 		return fpin;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 12/12] selftests/vm/transhuge-stress: Support file-backed PMD folios
  2022-01-16 12:18 [PATCH 00/12] Enabling large folios for 5.17 Matthew Wilcox (Oracle)
                   ` (10 preceding siblings ...)
  2022-01-16 12:18 ` [PATCH 11/12] mm/filemap: Support VM_HUGEPAGE for file mappings Matthew Wilcox (Oracle)
@ 2022-01-16 12:18 ` Matthew Wilcox (Oracle)
  11 siblings, 0 replies; 20+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-01-16 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-fsdevel; +Cc: Matthew Wilcox (Oracle)

Add a -f <filename> option to test PMD folios on files

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 tools/testing/selftests/vm/transhuge-stress.c | 35 +++++++++++++------
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/vm/transhuge-stress.c b/tools/testing/selftests/vm/transhuge-stress.c
index 5e4c036f6ad3..a03cb3fce1f6 100644
--- a/tools/testing/selftests/vm/transhuge-stress.c
+++ b/tools/testing/selftests/vm/transhuge-stress.c
@@ -26,15 +26,17 @@
 #define PAGEMAP_PFN(ent)	((ent) & ((1ull << 55) - 1))
 
 int pagemap_fd;
+int backing_fd = -1;
+int mmap_flags = MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE;
+#define PROT_RW (PROT_READ | PROT_WRITE)
 
 int64_t allocate_transhuge(void *ptr)
 {
 	uint64_t ent[2];
 
 	/* drop pmd */
-	if (mmap(ptr, HPAGE_SIZE, PROT_READ | PROT_WRITE,
-				MAP_FIXED | MAP_ANONYMOUS |
-				MAP_NORESERVE | MAP_PRIVATE, -1, 0) != ptr)
+	if (mmap(ptr, HPAGE_SIZE, PROT_RW, MAP_FIXED | mmap_flags,
+		 backing_fd, 0) != ptr)
 		errx(2, "mmap transhuge");
 
 	if (madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE))
@@ -60,6 +62,8 @@ int main(int argc, char **argv)
 	size_t ram, len;
 	void *ptr, *p;
 	struct timespec a, b;
+	int i = 0;
+	char *name = NULL;
 	double s;
 	uint8_t *map;
 	size_t map_len;
@@ -69,13 +73,23 @@ int main(int argc, char **argv)
 		ram = SIZE_MAX / 4;
 	else
 		ram *= sysconf(_SC_PAGESIZE);
+	len = ram;
+
+	while (++i < argc) {
+		if (!strcmp(argv[i], "-h"))
+			errx(1, "usage: %s [size in MiB]", argv[0]);
+		else if (!strcmp(argv[i], "-f"))
+			name = argv[++i];
+		else
+			len = atoll(argv[i]) << 20;
+	}
 
-	if (argc == 1)
-		len = ram;
-	else if (!strcmp(argv[1], "-h"))
-		errx(1, "usage: %s [size in MiB]", argv[0]);
-	else
-		len = atoll(argv[1]) << 20;
+	if (name) {
+		backing_fd = open(name, O_RDWR);
+		if (backing_fd == -1)
+			errx(2, "open %s", name);
+		mmap_flags = MAP_SHARED;
+	}
 
 	warnx("allocate %zd transhuge pages, using %zd MiB virtual memory"
 	      " and %zd MiB of ram", len >> HPAGE_SHIFT, len >> 20,
@@ -86,8 +100,7 @@ int main(int argc, char **argv)
 		err(2, "open pagemap");
 
 	len -= len % HPAGE_SIZE;
-	ptr = mmap(NULL, len + HPAGE_SIZE, PROT_READ | PROT_WRITE,
-			MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
+	ptr = mmap(NULL, len + HPAGE_SIZE, PROT_RW, mmap_flags, backing_fd, 0);
 	if (ptr == MAP_FAILED)
 		err(2, "initial mmap");
 	ptr += HPAGE_SIZE - (uintptr_t)ptr % HPAGE_SIZE;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 09/12] mm/readahead: Align file mappings for non-DAX
  2022-01-16 12:18 ` [PATCH 09/12] mm/readahead: Align file mappings for non-DAX Matthew Wilcox (Oracle)
@ 2022-01-17  3:17   ` Rongwei Wang
  2022-01-17  4:40     ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Rongwei Wang @ 2022-01-17  3:17 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), linux-kernel, linux-mm, linux-fsdevel
  Cc: William Kucharski



On 1/16/22 8:18 PM, Matthew Wilcox (Oracle) wrote:
> From: William Kucharski <william.kucharski@oracle.com>
> 
> When we have the opportunity to use PMDs to map a file, we want to follow
> the same rules as DAX.
> 
> Signed-off-by: William Kucharski <william.kucharski@oracle.com>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>   mm/huge_memory.c | 5 +----
>   1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f58524394dc1..28c29a0d854b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -582,13 +582,10 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>   	unsigned long ret;
>   	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
>   
> -	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
> -		goto out;
> -
>   	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE);
>   	if (ret)
>   		return ret;
> -out:
> +
>   	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
Hi, Matthew

It seems this patch will make all file mappings align with PMD_SIZE? And 
support realize all file THP, not only executable file THP?

Actually, what I want to say is we had merged a similar patch to only 
align DSO mapping in glibc:

"718fdd8 elf: Properly align PT_LOAD segments [BZ #28676]"


>   }
>   EXPORT_SYMBOL_GPL(thp_get_unmapped_area);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 09/12] mm/readahead: Align file mappings for non-DAX
  2022-01-17  3:17   ` Rongwei Wang
@ 2022-01-17  4:40     ` Matthew Wilcox
  0 siblings, 0 replies; 20+ messages in thread
From: Matthew Wilcox @ 2022-01-17  4:40 UTC (permalink / raw)
  To: Rongwei Wang; +Cc: linux-kernel, linux-mm, linux-fsdevel, William Kucharski

On Mon, Jan 17, 2022 at 11:17:55AM +0800, Rongwei Wang wrote:
> It seems this patch will make all file mappings align with PMD_SIZE?

Only those which are big enough.  See __thp_get_unmapped_area():

        if (off_end <= off_align || (off_end - off_align) < size)
                return 0;

> And
> support realize all file THP, not only executable file THP?

Executables are not the only files which benefit from being mapped
to an aligned address.  If you can use a PMD to map a font file,
for example, that's valuable.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 02/12] filemap: Use folio_put_refs() in filemap_free_folio()
  2022-01-16 12:18 ` [PATCH 02/12] filemap: Use folio_put_refs() in filemap_free_folio() Matthew Wilcox (Oracle)
@ 2022-01-17 15:56   ` Kirill A. Shutemov
  2022-01-17 16:11     ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Kirill A. Shutemov @ 2022-01-17 15:56 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-kernel, linux-mm, linux-fsdevel

On Sun, Jan 16, 2022 at 12:18:12PM +0000, Matthew Wilcox (Oracle) wrote:
> This shrinks filemap_free_folio() by 55 bytes in my .config; 24 bytes
> from removing the VM_BUG_ON_FOLIO() and 31 bytes from unifying the
> small/large folio paths.
> 
> We could just use folio_ref_sub() here since the caller should hold a
> reference (as the VM_BUG_ON_FOLIO() was asserting), but that's fragile.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/filemap.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2fd9b2f24025..afc8f5ca85ac 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -231,17 +231,15 @@ void __filemap_remove_folio(struct folio *folio, void *shadow)
>  void filemap_free_folio(struct address_space *mapping, struct folio *folio)
>  {
>  	void (*freepage)(struct page *);
> +	int refs = 1;
>  
>  	freepage = mapping->a_ops->freepage;
>  	if (freepage)
>  		freepage(&folio->page);
>  
> -	if (folio_test_large(folio) && !folio_test_hugetlb(folio)) {
> -		folio_ref_sub(folio, folio_nr_pages(folio));
> -		VM_BUG_ON_FOLIO(folio_ref_count(folio) <= 0, folio);
> -	} else {
> -		folio_put(folio);
> -	}
> +	if (folio_test_large(folio) && !folio_test_hugetlb(folio))
> +		refs = folio_nr_pages(folio);

Isn't folio_test_large() check redundant? folio_nr_pages() would return 1
for non-large folio, wouldn't it?

> +	folio_put_refs(folio, refs);
>  }
>  
>  /**
> -- 
> 2.34.1
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 04/12] mm/vmscan: Free non-shmem folios without splitting them
  2022-01-16 12:18 ` [PATCH 04/12] mm/vmscan: Free non-shmem folios without splitting them Matthew Wilcox (Oracle)
@ 2022-01-17 16:06   ` Kirill A. Shutemov
  2022-01-17 16:10     ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Kirill A. Shutemov @ 2022-01-17 16:06 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-kernel, linux-mm, linux-fsdevel

On Sun, Jan 16, 2022 at 12:18:14PM +0000, Matthew Wilcox (Oracle) wrote:
> We have to allocate memory in order to split a file-backed folio, so
> it's not a good idea to split them in the memory freeing path.

Could elaborate on why split a file-backed folio requires memory
allocation?

> It also
> doesn't work for XFS because pages have an extra reference count from
> page_has_private() and split_huge_page() expects that reference to have
> already been removed.

Need to adjust can_split_huge_page()?

> Unfortunately, we still have to split shmem THPs
> because we can't handle swapping out an entire THP yet.

... especially if the system doesn't have swap :P

> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/vmscan.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 700434db5735..45665874082d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1728,8 +1728,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>  				/* Adding to swap updated mapping */
>  				mapping = page_mapping(page);
>  			}
> -		} else if (unlikely(PageTransHuge(page))) {
> -			/* Split file THP */
> +		} else if (PageSwapBacked(page) && PageTransHuge(page)) {
> +			/* Split shmem THP */
>  			if (split_huge_page_to_list(page, page_list))
>  				goto keep_locked;
>  		}
> -- 
> 2.34.1
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 04/12] mm/vmscan: Free non-shmem folios without splitting them
  2022-01-17 16:06   ` Kirill A. Shutemov
@ 2022-01-17 16:10     ` Matthew Wilcox
  2022-01-17 21:00       ` Kirill A. Shutemov
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2022-01-17 16:10 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-kernel, linux-mm, linux-fsdevel

On Mon, Jan 17, 2022 at 07:06:25PM +0300, Kirill A. Shutemov wrote:
> On Sun, Jan 16, 2022 at 12:18:14PM +0000, Matthew Wilcox (Oracle) wrote:
> > We have to allocate memory in order to split a file-backed folio, so
> > it's not a good idea to split them in the memory freeing path.
> 
> Could elaborate on why split a file-backed folio requires memory
> allocation?

In the commit message or explain it to you now?

We need to allocate xarray nodes to store all the newly-independent
pages.  With a folio that's more than 64 entries in size (current
implementation), we elide the lowest layer of the radix tree.  But
with any data structure that tracks folios, we'll need to create
space in it to track N folios instead of 1.

> > It also
> > doesn't work for XFS because pages have an extra reference count from
> > page_has_private() and split_huge_page() expects that reference to have
> > already been removed.
> 
> Need to adjust can_split_huge_page()?

no?

> > Unfortunately, we still have to split shmem THPs
> > because we can't handle swapping out an entire THP yet.
> 
> ... especially if the system doesn't have swap :P

Not sure what correction to the commit message you want here.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 02/12] filemap: Use folio_put_refs() in filemap_free_folio()
  2022-01-17 15:56   ` Kirill A. Shutemov
@ 2022-01-17 16:11     ` Matthew Wilcox
  0 siblings, 0 replies; 20+ messages in thread
From: Matthew Wilcox @ 2022-01-17 16:11 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-kernel, linux-mm, linux-fsdevel

On Mon, Jan 17, 2022 at 06:56:41PM +0300, Kirill A. Shutemov wrote:
> On Sun, Jan 16, 2022 at 12:18:12PM +0000, Matthew Wilcox (Oracle) wrote:
> > +	if (folio_test_large(folio) && !folio_test_hugetlb(folio))
> > +		refs = folio_nr_pages(folio);
> 
> Isn't folio_test_large() check redundant? folio_nr_pages() would return 1
> for non-large folio, wouldn't it?

I'm trying to avoid the function call for !hugetlb pages.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 04/12] mm/vmscan: Free non-shmem folios without splitting them
  2022-01-17 16:10     ` Matthew Wilcox
@ 2022-01-17 21:00       ` Kirill A. Shutemov
  0 siblings, 0 replies; 20+ messages in thread
From: Kirill A. Shutemov @ 2022-01-17 21:00 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-mm, linux-fsdevel

On Mon, Jan 17, 2022 at 04:10:46PM +0000, Matthew Wilcox wrote:
> On Mon, Jan 17, 2022 at 07:06:25PM +0300, Kirill A. Shutemov wrote:
> > On Sun, Jan 16, 2022 at 12:18:14PM +0000, Matthew Wilcox (Oracle) wrote:
> > > We have to allocate memory in order to split a file-backed folio, so
> > > it's not a good idea to split them in the memory freeing path.
> > 
> > Could elaborate on why split a file-backed folio requires memory
> > allocation?
> 
> In the commit message or explain it to you now?
> 
> We need to allocate xarray nodes to store all the newly-independent
> pages.  With a folio that's more than 64 entries in size (current
> implementation), we elide the lowest layer of the radix tree.  But
> with any data structure that tracks folios, we'll need to create
> space in it to track N folios instead of 1.

Looks good.

> > > It also
> > > doesn't work for XFS because pages have an extra reference count from
> > > page_has_private() and split_huge_page() expects that reference to have
> > > already been removed.
> > 
> > Need to adjust can_split_huge_page()?
> 
> no?

I meant we can make can_split_huge_page() expect extra pin if
page_has_private() is true. If it is the only thing that stops
split_huge_page() from handling XFS pages.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-01-17 21:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-16 12:18 [PATCH 00/12] Enabling large folios for 5.17 Matthew Wilcox (Oracle)
2022-01-16 12:18 ` [PATCH 01/12] mm: Add folio_put_refs() Matthew Wilcox (Oracle)
2022-01-16 12:18 ` [PATCH 02/12] filemap: Use folio_put_refs() in filemap_free_folio() Matthew Wilcox (Oracle)
2022-01-17 15:56   ` Kirill A. Shutemov
2022-01-17 16:11     ` Matthew Wilcox
2022-01-16 12:18 ` [PATCH 03/12] filemap: Allow large folios to be added to the page cache Matthew Wilcox (Oracle)
2022-01-16 12:18 ` [PATCH 04/12] mm/vmscan: Free non-shmem folios without splitting them Matthew Wilcox (Oracle)
2022-01-17 16:06   ` Kirill A. Shutemov
2022-01-17 16:10     ` Matthew Wilcox
2022-01-17 21:00       ` Kirill A. Shutemov
2022-01-16 12:18 ` [PATCH 05/12] mm: Fix READ_ONLY_THP warning Matthew Wilcox (Oracle)
2022-01-16 12:18 ` [PATCH 06/12] mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios Matthew Wilcox (Oracle)
2022-01-16 12:18 ` [PATCH 07/12] mm: Make large folios depend on THP Matthew Wilcox (Oracle)
2022-01-16 12:18 ` [PATCH 08/12] mm/readahead: Add large folio readahead Matthew Wilcox (Oracle)
2022-01-16 12:18 ` [PATCH 09/12] mm/readahead: Align file mappings for non-DAX Matthew Wilcox (Oracle)
2022-01-17  3:17   ` Rongwei Wang
2022-01-17  4:40     ` Matthew Wilcox
2022-01-16 12:18 ` [PATCH 10/12] mm/readahead: Switch to page_cache_ra_order Matthew Wilcox (Oracle)
2022-01-16 12:18 ` [PATCH 11/12] mm/filemap: Support VM_HUGEPAGE for file mappings Matthew Wilcox (Oracle)
2022-01-16 12:18 ` [PATCH 12/12] selftests/vm/transhuge-stress: Support file-backed PMD folios Matthew Wilcox (Oracle)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.