Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v4 00/36] Large pages in the page cache
@ 2020-05-15 13:16 Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 01/36] mm: Move PageDoubleMap bit Matthew Wilcox
                   ` (37 more replies)
  0 siblings, 38 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This patch set does not pass xfstests.  Test at your own risk.  It is
based on the readahead rewrite which is in Andrew's tree.  I've fixed a
lot of issues in the last two weeks, but generic/013 will still crash it.

The primary idea here is that a large part of the overhead in dealing
with individual pages is that there's just so darned many of them.
We would be better off dealing with fewer, larger pages, even if they
don't get to be the size necessary for the CPU to use a larger TLB entry.

v4:
 - Fix thp_size typo
 - Fix the iomap page_mkwrite() path to operate on the head page, even
   though the vm_fault has a pointer to the tail page
 - Fix iomap_finish_ioend() to use bio_for_each_thp_segment_all()
 - Rework PageDoubleMap (see first two patches for details)
 - Fix page_cache_delete() to handle shadow entries being stored to a THP
 - Fix the assertion in pagecache_get_page() to handle tail pages
 - Change PageReadahead from NO_COMPOUND to ONLY_HEAD
 - Handle PageReadahead being set on head pages
 - Handle total_mapcount correctly (Kirill)
 - Pull the FS_LARGE_PAGES check out into mapping_large_pages()
 - Fix page size assumption in truncate_cleanup_page()
 - Avoid splitting large pages unnecessarily on truncate
 - Disable the page cache truncation introduced as part of the read-only
   THP patch set
 - Call compound_head() in iomap buffered write paths -- we retrieve a
   (potentially) tail page from the page cache and need to use that for
   flush_dcache_page(), but we expect to operate on a head page in most
   of the iomap code

Kirill A. Shutemov (1):
  mm: Fix total_mapcount assumption of page size

Matthew Wilcox (Oracle) (34):
  mm: Move PageDoubleMap bit
  mm: Simplify PageDoubleMap with PF_SECOND policy
  mm: Allow hpages to be arbitrary order
  mm: Introduce thp_size
  mm: Introduce thp_order
  mm: Introduce offset_in_thp
  fs: Add a filesystem flag for large pages
  fs: Do not update nr_thps for large page mappings
  fs: Introduce i_blocks_per_page
  fs: Make page_mkwrite_check_truncate thp-aware
  fs: Support THPs in zero_user_segments
  bio: Add bio_for_each_thp_segment_all
  iomap: Support arbitrarily many blocks per page
  iomap: Support large pages in iomap_adjust_read_range
  iomap: Support large pages in read paths
  iomap: Support large pages in write paths
  iomap: Inline data shouldn't see large pages
  iomap: Handle tail pages in iomap_page_mkwrite
  xfs: Support large pages
  mm: Make prep_transhuge_page return its argument
  mm: Add __page_cache_alloc_order
  mm: Allow large pages to be added to the page cache
  mm: Allow large pages to be removed from the page cache
  mm: Remove page fault assumption of compound page size
  mm: Avoid splitting large pages
  mm: Fix truncation for pages of arbitrary size
  mm: Support storing shadow entries for large pages
  mm: Support retrieving tail pages from the page cache
  mm: Support tail pages in wait_for_stable_page
  mm: Add DEFINE_READAHEAD
  mm: Make page_cache_readahead_unbounded take a readahead_control
  mm: Make __do_page_cache_readahead take a readahead_control
  mm: Allow PageReadahead to be set on head pages
  mm: Add large page readahead

William Kucharski (1):
  mm: Align THP mappings for non-DAX

 drivers/nvdimm/btt.c       |   4 +-
 drivers/nvdimm/pmem.c      |   6 +-
 fs/ext4/verity.c           |   4 +-
 fs/f2fs/verity.c           |   4 +-
 fs/iomap/buffered-io.c     | 121 +++++++++++++++++--------------
 fs/jfs/jfs_metapage.c      |   2 +-
 fs/xfs/xfs_aops.c          |   4 +-
 fs/xfs/xfs_super.c         |   2 +-
 include/linux/bio.h        |  13 ++++
 include/linux/bvec.h       |  23 ++++++
 include/linux/fs.h         |  28 +------
 include/linux/highmem.h    |  15 +++-
 include/linux/huge_mm.h    |  25 +++++--
 include/linux/mm.h         |  97 +++++++++++++------------
 include/linux/page-flags.h |  46 ++++--------
 include/linux/pagemap.h    |  96 +++++++++++++++++++++---
 mm/filemap.c               |  87 ++++++++++++++--------
 mm/highmem.c               |  62 +++++++++++++++-
 mm/huge_memory.c           |  58 +++++++--------
 mm/internal.h              |  13 ++--
 mm/memory.c                |   7 +-
 mm/page-writeback.c        |   1 +
 mm/page_io.c               |   2 +-
 mm/page_vma_mapped.c       |   4 +-
 mm/readahead.c             | 145 ++++++++++++++++++++++++++++---------
 mm/truncate.c              |   6 +-
 mm/vmscan.c                |   5 +-
 27 files changed, 565 insertions(+), 315 deletions(-)

-- 
2.26.2

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 01/36] mm: Move PageDoubleMap bit
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 02/36] mm: Simplify PageDoubleMap with PF_SECOND policy Matthew Wilcox
                   ` (36 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

PG_private_2 is defined as being PF_ANY (applicable to tail pages
as well as regular & head pages).  That means that the first tail
page of a double-map page will appear to have Private2 set.  Use the
Workingset bit instead which is defined as PF_HEAD so any attempt to
access the Workingset bit on a tail page will redirect to the head page's
Workingset bit.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/page-flags.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 222f6f7b2bb3..de6e0696f55c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -164,7 +164,7 @@ enum pageflags {
 	PG_slob_free = PG_private,
 
 	/* Compound pages. Stored in first tail page's flags */
-	PG_double_map = PG_private_2,
+	PG_double_map = PG_workingset,
 
 	/* non-lru isolated movable page */
 	PG_isolated = PG_reclaim,
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 02/36] mm: Simplify PageDoubleMap with PF_SECOND policy
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 01/36] mm: Move PageDoubleMap bit Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 03/36] mm: Allow hpages to be arbitrary order Matthew Wilcox
                   ` (35 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Introduce the new page policy of PF_SECOND which lets us use the
normal pageflags generation machinery to create the various DoubleMap
manipulation functions.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/page-flags.h | 40 ++++++++++----------------------------
 1 file changed, 10 insertions(+), 30 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index de6e0696f55c..979460df4768 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -232,6 +232,9 @@ static inline void page_init_poison(struct page *page, size_t size)
  *
  * PF_NO_COMPOUND:
  *     the page flag is not relevant for compound pages.
+ *
+ * PF_SECOND:
+ *     the page flag is stored in the first tail page.
  */
 #define PF_POISONED_CHECK(page) ({					\
 		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
@@ -247,6 +250,9 @@ static inline void page_init_poison(struct page *page, size_t size)
 #define PF_NO_COMPOUND(page, enforce) ({				\
 		VM_BUG_ON_PGFLAGS(enforce && PageCompound(page), page);	\
 		PF_POISONED_CHECK(page); })
+#define PF_SECOND(page, enforce) ({					\
+		VM_BUG_ON_PGFLAGS(!PageHead(page), page);		\
+		PF_POISONED_CHECK(&page[1]); })
 
 /*
  * Macros to create function definitions for page flags
@@ -685,42 +691,15 @@ static inline int PageTransTail(struct page *page)
  *
  * See also __split_huge_pmd_locked() and page_remove_anon_compound_rmap().
  */
-static inline int PageDoubleMap(struct page *page)
-{
-	return PageHead(page) && test_bit(PG_double_map, &page[1].flags);
-}
-
-static inline void SetPageDoubleMap(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	set_bit(PG_double_map, &page[1].flags);
-}
-
-static inline void ClearPageDoubleMap(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	clear_bit(PG_double_map, &page[1].flags);
-}
-static inline int TestSetPageDoubleMap(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return test_and_set_bit(PG_double_map, &page[1].flags);
-}
-
-static inline int TestClearPageDoubleMap(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return test_and_clear_bit(PG_double_map, &page[1].flags);
-}
-
+PAGEFLAG(DoubleMap, double_map, PF_SECOND)
+	TESTSCFLAG(DoubleMap, double_map, PF_SECOND)
 #else
 TESTPAGEFLAG_FALSE(TransHuge)
 TESTPAGEFLAG_FALSE(TransCompound)
 TESTPAGEFLAG_FALSE(TransCompoundMap)
 TESTPAGEFLAG_FALSE(TransTail)
 PAGEFLAG_FALSE(DoubleMap)
-	TESTSETFLAG_FALSE(DoubleMap)
-	TESTCLEARFLAG_FALSE(DoubleMap)
+	TESTSCFLAG_FALSE(DoubleMap)
 #endif
 
 /*
@@ -875,6 +854,7 @@ static inline int page_has_private(struct page *page)
 #undef PF_ONLY_HEAD
 #undef PF_NO_TAIL
 #undef PF_NO_COMPOUND
+#undef PF_SECOND
 #endif /* !__GENERATING_BOUNDS_H */
 
 #endif	/* PAGE_FLAGS_H */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 03/36] mm: Allow hpages to be arbitrary order
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 01/36] mm: Move PageDoubleMap bit Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 02/36] mm: Simplify PageDoubleMap with PF_SECOND policy Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-28 14:19   ` Zi Yan
  2020-05-15 13:16 ` [PATCH v4 04/36] mm: Introduce thp_size Matthew Wilcox
                   ` (34 subsequent siblings)
  37 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Remove the assumption in hpage_nr_pages() that compound pages are
necessarily PMD sized.  Move the relevant parts of mm.h to before the
include of huge_mm.h so we can use an inline function rather than a macro.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/huge_mm.h |  5 +--
 include/linux/mm.h      | 96 ++++++++++++++++++++---------------------
 2 files changed, 50 insertions(+), 51 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index cfbb0a87c5f0..6bec4b5b61e1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -265,11 +265,10 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 	else
 		return NULL;
 }
+
 static inline int hpage_nr_pages(struct page *page)
 {
-	if (unlikely(PageTransHuge(page)))
-		return HPAGE_PMD_NR;
-	return 1;
+	return compound_nr(page);
 }
 
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 581e56275bc4..088acbda722d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -671,6 +671,54 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
 struct mmu_gather;
 struct inode;
 
+static inline unsigned int compound_order(struct page *page)
+{
+	if (!PageHead(page))
+		return 0;
+	return page[1].compound_order;
+}
+
+static inline bool hpage_pincount_available(struct page *page)
+{
+	/*
+	 * Can the page->hpage_pinned_refcount field be used? That field is in
+	 * the 3rd page of the compound page, so the smallest (2-page) compound
+	 * pages cannot support it.
+	 */
+	page = compound_head(page);
+	return PageCompound(page) && compound_order(page) > 1;
+}
+
+static inline int compound_pincount(struct page *page)
+{
+	VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
+	page = compound_head(page);
+	return atomic_read(compound_pincount_ptr(page));
+}
+
+static inline void set_compound_order(struct page *page, unsigned int order)
+{
+	page[1].compound_order = order;
+}
+
+/* Returns the number of pages in this potentially compound page. */
+static inline unsigned long compound_nr(struct page *page)
+{
+	return 1UL << compound_order(page);
+}
+
+/* Returns the number of bytes in this potentially compound page. */
+static inline unsigned long page_size(struct page *page)
+{
+	return PAGE_SIZE << compound_order(page);
+}
+
+/* Returns the number of bits needed for the number of bytes in a page */
+static inline unsigned int page_shift(struct page *page)
+{
+	return PAGE_SHIFT + compound_order(page);
+}
+
 /*
  * FIXME: take this include out, include page-flags.h in
  * files which need it (119 of them)
@@ -875,54 +923,6 @@ static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
 	return compound_page_dtors[page[1].compound_dtor];
 }
 
-static inline unsigned int compound_order(struct page *page)
-{
-	if (!PageHead(page))
-		return 0;
-	return page[1].compound_order;
-}
-
-static inline bool hpage_pincount_available(struct page *page)
-{
-	/*
-	 * Can the page->hpage_pinned_refcount field be used? That field is in
-	 * the 3rd page of the compound page, so the smallest (2-page) compound
-	 * pages cannot support it.
-	 */
-	page = compound_head(page);
-	return PageCompound(page) && compound_order(page) > 1;
-}
-
-static inline int compound_pincount(struct page *page)
-{
-	VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
-	page = compound_head(page);
-	return atomic_read(compound_pincount_ptr(page));
-}
-
-static inline void set_compound_order(struct page *page, unsigned int order)
-{
-	page[1].compound_order = order;
-}
-
-/* Returns the number of pages in this potentially compound page. */
-static inline unsigned long compound_nr(struct page *page)
-{
-	return 1UL << compound_order(page);
-}
-
-/* Returns the number of bytes in this potentially compound page. */
-static inline unsigned long page_size(struct page *page)
-{
-	return PAGE_SIZE << compound_order(page);
-}
-
-/* Returns the number of bits needed for the number of bytes in a page */
-static inline unsigned int page_shift(struct page *page)
-{
-	return PAGE_SHIFT + compound_order(page);
-}
-
 void free_compound_page(struct page *page);
 
 #ifdef CONFIG_MMU
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 04/36] mm: Introduce thp_size
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (2 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 03/36] mm: Allow hpages to be arbitrary order Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:38   ` David Hildenbrand
  2020-05-15 13:16 ` [PATCH v4 05/36] mm: Introduce thp_order Matthew Wilcox
                   ` (33 subsequent siblings)
  37 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This is like page_size(), but compiles down to just PAGE_SIZE if THP
are disabled.  Convert the users of hpage_nr_pages() which would prefer
this interface.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 drivers/nvdimm/btt.c    | 4 +---
 drivers/nvdimm/pmem.c   | 6 ++----
 include/linux/huge_mm.h | 7 +++++++
 mm/internal.h           | 2 +-
 mm/page_io.c            | 2 +-
 mm/page_vma_mapped.c    | 4 ++--
 6 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 3b09419218d6..78e8d972d45a 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1488,10 +1488,8 @@ static int btt_rw_page(struct block_device *bdev, sector_t sector,
 {
 	struct btt *btt = bdev->bd_disk->private_data;
 	int rc;
-	unsigned int len;
 
-	len = hpage_nr_pages(page) * PAGE_SIZE;
-	rc = btt_do_bvec(btt, NULL, page, len, 0, op, sector);
+	rc = btt_do_bvec(btt, NULL, page, thp_size(page), 0, op, sector);
 	if (rc == 0)
 		page_endio(page, op_is_write(op), 0);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 2df6994acf83..d511504d07af 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -235,11 +235,9 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	blk_status_t rc;
 
 	if (op_is_write(op))
-		rc = pmem_do_write(pmem, page, 0, sector,
-				   hpage_nr_pages(page) * PAGE_SIZE);
+		rc = pmem_do_write(pmem, page, 0, sector, thp_size(page));
 	else
-		rc = pmem_do_read(pmem, page, 0, sector,
-				   hpage_nr_pages(page) * PAGE_SIZE);
+		rc = pmem_do_read(pmem, page, 0, sector, thp_size(page));
 	/*
 	 * The ->rw_page interface is subtle and tricky.  The core
 	 * retries on any error, so we can only invoke page_endio() in
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6bec4b5b61e1..e944f9757349 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -271,6 +271,11 @@ static inline int hpage_nr_pages(struct page *page)
 	return compound_nr(page);
 }
 
+static inline unsigned long thp_size(struct page *page)
+{
+	return page_size(page);
+}
+
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
@@ -329,6 +334,8 @@ static inline int hpage_nr_pages(struct page *page)
 	return 1;
 }
 
+#define thp_size(x)		PAGE_SIZE
+
 static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
 	return false;
diff --git a/mm/internal.h b/mm/internal.h
index f762a34b0c57..5efb13d5c226 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -386,7 +386,7 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	unsigned long start, end;
 
 	start = __vma_address(page, vma);
-	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
+	end = start + thp_size(page) - PAGE_SIZE;
 
 	/* page should be within @vma mapping range */
 	VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
diff --git a/mm/page_io.c b/mm/page_io.c
index 76965be1d40e..dd935129e3cb 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -41,7 +41,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
 		bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
 		bio->bi_end_io = end_io;
 
-		bio_add_page(bio, page, PAGE_SIZE * hpage_nr_pages(page), 0);
+		bio_add_page(bio, page, thp_size(page), 0);
 	}
 	return bio;
 }
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 719c35246cfa..e65629c056e8 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -227,7 +227,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			if (pvmw->address >= pvmw->vma->vm_end ||
 			    pvmw->address >=
 					__vma_address(pvmw->page, pvmw->vma) +
-					hpage_nr_pages(pvmw->page) * PAGE_SIZE)
+					thp_size(pvmw->page))
 				return not_found(pvmw);
 			/* Did we cross page table boundary? */
 			if (pvmw->address % PMD_SIZE == 0) {
@@ -268,7 +268,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	unsigned long start, end;
 
 	start = __vma_address(page, vma);
-	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
+	end = start + thp_size(page) - PAGE_SIZE;
 
 	if (unlikely(end < vma->vm_start || start >= vma->vm_end))
 		return 0;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 05/36] mm: Introduce thp_order
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (3 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 04/36] mm: Introduce thp_size Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 06/36] mm: Introduce offset_in_thp Matthew Wilcox
                   ` (32 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Like compound_order() except 0 when THP is disabled.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/huge_mm.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e944f9757349..1f6245091917 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -276,6 +276,11 @@ static inline unsigned long thp_size(struct page *page)
 	return page_size(page);
 }
 
+static inline unsigned int thp_order(struct page *page)
+{
+	return compound_order(page);
+}
+
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
@@ -335,6 +340,7 @@ static inline int hpage_nr_pages(struct page *page)
 }
 
 #define thp_size(x)		PAGE_SIZE
+#define thp_order(x)		0U
 
 static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 06/36] mm: Introduce offset_in_thp
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (4 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 05/36] mm: Introduce thp_order Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:39   ` David Hildenbrand
  2020-05-22 17:15   ` Kirill A. Shutemov
  2020-05-15 13:16 ` [PATCH v4 07/36] fs: Add a filesystem flag for large pages Matthew Wilcox
                   ` (31 subsequent siblings)
  37 siblings, 2 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Mirroring offset_in_page(), this gives you the offset within this
particular page, no matter what size page it is.  It optimises down
to offset_in_page() if CONFIG_TRANSPARENT_HUGEPAGE is not set.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 088acbda722d..9a55dce6a535 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1577,6 +1577,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
 extern void pagefault_out_of_memory(void);
 
 #define offset_in_page(p)	((unsigned long)(p) & ~PAGE_MASK)
+#define offset_in_thp(page, p)	((unsigned long)(p) & (thp_size(page) - 1))
 
 /*
  * Flags passed to show_mem() and show_free_areas() to suppress output in
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 07/36] fs: Add a filesystem flag for large pages
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (5 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 06/36] mm: Introduce offset_in_thp Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-21 21:55   ` Dave Chinner
  2020-05-15 13:16 ` [PATCH v4 08/36] fs: Do not update nr_thps for large page mappings Matthew Wilcox
                   ` (30 subsequent siblings)
  37 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The page cache needs to know whether the filesystem supports pages >
PAGE_SIZE.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/fs.h      | 1 +
 include/linux/pagemap.h | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 55c743925c40..777783c8760b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2241,6 +2241,7 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
+#define FS_LARGE_PAGES		8192	/* Remove once all fs converted */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	int (*init_fs_context)(struct fs_context *);
 	const struct fs_parameter_spec *parameters;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 36bfc9d855bb..c6db74b5e62f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -116,6 +116,11 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 	m->gfp_mask = mask;
 }
 
+static inline bool mapping_large_pages(struct address_space *mapping)
+{
+	return mapping->host->i_sb->s_type->fs_flags & FS_LARGE_PAGES;
+}
+
 void release_pages(struct page **pages, int nr);
 
 /*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 08/36] fs: Do not update nr_thps for large page mappings
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (6 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 07/36] fs: Add a filesystem flag for large pages Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 09/36] fs: Introduce i_blocks_per_page Matthew Wilcox
                   ` (29 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The nr_thps counter is to support large pages in the page cache
when the filesystem does not support writing large pages.  Eventually
it will be removed, but we should still support filesystems which
do not understand large pages yet.  Move the nr_thp manipulation
functions to filemap.h since they're page-cache specific.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/fs.h      | 27 ---------------------------
 include/linux/pagemap.h | 29 +++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 777783c8760b..1ab65898bd96 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2830,33 +2830,6 @@ static inline errseq_t filemap_sample_wb_err(struct address_space *mapping)
 	return errseq_sample(&mapping->wb_err);
 }
 
-static inline int filemap_nr_thps(struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	return atomic_read(&mapping->nr_thps);
-#else
-	return 0;
-#endif
-}
-
-static inline void filemap_nr_thps_inc(struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	atomic_inc(&mapping->nr_thps);
-#else
-	WARN_ON_ONCE(1);
-#endif
-}
-
-static inline void filemap_nr_thps_dec(struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	atomic_dec(&mapping->nr_thps);
-#else
-	WARN_ON_ONCE(1);
-#endif
-}
-
 extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
 			   int datasync);
 extern int vfs_fsync(struct file *file, int datasync);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c6db74b5e62f..cacd5a30cb9d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -121,6 +121,35 @@ static inline bool mapping_large_pages(struct address_space *mapping)
 	return mapping->host->i_sb->s_type->fs_flags & FS_LARGE_PAGES;
 }
 
+static inline int filemap_nr_thps(struct address_space *mapping)
+{
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+	return atomic_read(&mapping->nr_thps);
+#else
+	return 0;
+#endif
+}
+
+static inline void filemap_nr_thps_inc(struct address_space *mapping)
+{
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+	if (!mapping_large_pages(mapping))
+		atomic_inc(&mapping->nr_thps);
+#else
+	WARN_ON_ONCE(1);
+#endif
+}
+
+static inline void filemap_nr_thps_dec(struct address_space *mapping)
+{
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+	if (!mapping_large_pages(mapping))
+		atomic_dec(&mapping->nr_thps);
+#else
+	WARN_ON_ONCE(1);
+#endif
+}
+
 void release_pages(struct page **pages, int nr);
 
 /*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 09/36] fs: Introduce i_blocks_per_page
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (7 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 08/36] fs: Do not update nr_thps for large page mappings Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 10/36] fs: Make page_mkwrite_check_truncate thp-aware Matthew Wilcox
                   ` (28 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel, Christoph Hellwig

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This helper is useful for both large pages in the page cache and for
supporting block size larger than page size.  Convert some example
users (we have a few different ways of writing this idiom).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c  |  8 ++++----
 fs/jfs/jfs_metapage.c   |  2 +-
 fs/xfs/xfs_aops.c       |  2 +-
 include/linux/pagemap.h | 16 ++++++++++++++++
 4 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 890c8fcda4f3..4bc37bf8d057 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -46,7 +46,7 @@ iomap_page_create(struct inode *inode, struct page *page)
 {
 	struct iomap_page *iop = to_iomap_page(page);
 
-	if (iop || i_blocksize(inode) == PAGE_SIZE)
+	if (iop || i_blocks_per_page(inode, page) <= 1)
 		return iop;
 
 	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
@@ -152,7 +152,7 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
 	unsigned int i;
 
 	spin_lock_irqsave(&iop->uptodate_lock, flags);
-	for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
+	for (i = 0; i < i_blocks_per_page(inode, page); i++) {
 		if (i >= first && i <= last)
 			set_bit(i, iop->uptodate);
 		else if (!test_bit(i, iop->uptodate))
@@ -1090,7 +1090,7 @@ iomap_finish_page_writeback(struct inode *inode, struct page *page,
 		mapping_set_error(inode->i_mapping, -EIO);
 	}
 
-	WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
+	WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_count) <= 0);
 
 	if (!iop || atomic_dec_and_test(&iop->write_count))
@@ -1386,7 +1386,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	int error = 0, count = 0, i;
 	LIST_HEAD(submit_list);
 
-	WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
+	WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_count) != 0);
 
 	/*
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index a2f5338a5ea1..176580f54af9 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -473,7 +473,7 @@ static int metapage_readpage(struct file *fp, struct page *page)
 	struct inode *inode = page->mapping->host;
 	struct bio *bio = NULL;
 	int block_offset;
-	int blocks_per_page = PAGE_SIZE >> inode->i_blkbits;
+	int blocks_per_page = i_blocks_per_page(inode, page);
 	sector_t page_start;	/* address of page in fs blocks */
 	sector_t pblock;
 	int xlen;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1fd4fb7a607c..5b25f5ee84dc 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -544,7 +544,7 @@ xfs_discard_page(
 			page, ip->i_ino, offset);
 
 	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-			PAGE_SIZE / i_blocksize(inode));
+			i_blocks_per_page(inode, page));
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index cacd5a30cb9d..1a0bb387948c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -850,4 +850,20 @@ static inline int page_mkwrite_check_truncate(struct page *page,
 	return offset;
 }
 
+/**
+ * i_blocks_per_page - How many blocks fit in this page.
+ * @inode: The inode which contains the blocks.
+ * @page: The (potentially large) page.
+ *
+ * If the block size is larger than the size of this page, will return
+ * zero,
+ *
+ * Context: Any context.
+ * Return: The number of filesystem blocks covered by this page.
+ */
+static inline
+unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
+{
+	return thp_size(page) >> inode->i_blkbits;
+}
 #endif /* _LINUX_PAGEMAP_H */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 10/36] fs: Make page_mkwrite_check_truncate thp-aware
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (8 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 09/36] fs: Introduce i_blocks_per_page Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-21 22:01   ` Dave Chinner
  2020-05-15 13:16 ` [PATCH v4 11/36] fs: Support THPs in zero_user_segments Matthew Wilcox
                   ` (27 subsequent siblings)
  37 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

If the page is compound, check the last index in the page and return
the appropriate size.  Change the return type to ssize_t in case we ever
support pages larger than 2GB.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1a0bb387948c..c75d7fb7ccbc 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -827,22 +827,22 @@ static inline unsigned long dir_pages(struct inode *inode)
  * @page: the page to check
  * @inode: the inode to check the page against
  *
- * Returns the number of bytes in the page up to EOF,
+ * Return: The number of bytes in the page up to EOF,
  * or -EFAULT if the page was truncated.
  */
-static inline int page_mkwrite_check_truncate(struct page *page,
+static inline ssize_t page_mkwrite_check_truncate(struct page *page,
 					      struct inode *inode)
 {
 	loff_t size = i_size_read(inode);
 	pgoff_t index = size >> PAGE_SHIFT;
-	int offset = offset_in_page(size);
+	unsigned long offset = offset_in_thp(page, size);
 
 	if (page->mapping != inode->i_mapping)
 		return -EFAULT;
 
 	/* page is wholly inside EOF */
-	if (page->index < index)
-		return PAGE_SIZE;
+	if (page->index + hpage_nr_pages(page) - 1 < index)
+		return thp_size(page);
 	/* page is wholly past EOF */
 	if (page->index > index || !offset)
 		return -EFAULT;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 11/36] fs: Support THPs in zero_user_segments
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (9 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 10/36] fs: Make page_mkwrite_check_truncate thp-aware Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-25  4:55   ` Kirill A. Shutemov
  2020-05-15 13:16 ` [PATCH v4 12/36] bio: Add bio_for_each_thp_segment_all Matthew Wilcox
                   ` (26 subsequent siblings)
  37 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We can only kmap() one subpage of a THP at a time, so loop over all
relevant subpages, skipping ones which don't need to be zeroed.  This is
too large to inline when THPs are enabled and we actually need highmem,
so put it in highmem.c.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/highmem.h | 15 +++++++---
 mm/highmem.c            | 62 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 71 insertions(+), 6 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ea5cdbd8c2c3..74614903619d 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -215,13 +215,18 @@ static inline void clear_highpage(struct page *page)
 	kunmap_atomic(kaddr);
 }
 
+#if defined(CONFIG_HIGHMEM) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
+		unsigned start2, unsigned end2);
+#else /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */
 static inline void zero_user_segments(struct page *page,
-	unsigned start1, unsigned end1,
-	unsigned start2, unsigned end2)
+		unsigned start1, unsigned end1,
+		unsigned start2, unsigned end2)
 {
+	unsigned long i;
 	void *kaddr = kmap_atomic(page);
 
-	BUG_ON(end1 > PAGE_SIZE || end2 > PAGE_SIZE);
+	BUG_ON(end1 > thp_size(page) || end2 > thp_size(page));
 
 	if (end1 > start1)
 		memset(kaddr + start1, 0, end1 - start1);
@@ -230,8 +235,10 @@ static inline void zero_user_segments(struct page *page,
 		memset(kaddr + start2, 0, end2 - start2);
 
 	kunmap_atomic(kaddr);
-	flush_dcache_page(page);
+	for (i = 0; i < hpage_nr_pages(page); i++)
+		flush_dcache_page(page + i);
 }
+#endif /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */
 
 static inline void zero_user_segment(struct page *page,
 	unsigned start, unsigned end)
diff --git a/mm/highmem.c b/mm/highmem.c
index 64d8dea47dd1..3a85c66ef532 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -367,9 +367,67 @@ void kunmap_high(struct page *page)
 	if (need_wakeup)
 		wake_up(pkmap_map_wait);
 }
-
 EXPORT_SYMBOL(kunmap_high);
-#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
+		unsigned start2, unsigned end2)
+{
+	unsigned int i;
+
+	BUG_ON(end1 > thp_size(page) || end2 > thp_size(page));
+
+	for (i = 0; i < hpage_nr_pages(page); i++) {
+		void *kaddr;
+		unsigned this_end;
+
+		if (end1 == 0 && start2 >= PAGE_SIZE) {
+			start2 -= PAGE_SIZE;
+			end2 -= PAGE_SIZE;
+			continue;
+		}
+
+		if (start1 >= PAGE_SIZE) {
+			start1 -= PAGE_SIZE;
+			end1 -= PAGE_SIZE;
+			if (start2) {
+				start2 -= PAGE_SIZE;
+				end2 -= PAGE_SIZE;
+			}
+			continue;
+		}
+
+		kaddr = kmap_atomic(page + i);
+
+		this_end = min_t(unsigned, end1, PAGE_SIZE);
+		if (end1 > start1)
+			memset(kaddr + start1, 0, this_end - start1);
+		end1 -= this_end;
+		start1 = 0;
+
+		if (start2 >= PAGE_SIZE) {
+			start2 -= PAGE_SIZE;
+			end2 -= PAGE_SIZE;
+		} else {
+			this_end = min_t(unsigned, end2, PAGE_SIZE);
+			if (end2 > start2)
+				memset(kaddr + start2, 0, this_end - start2);
+			end2 -= this_end;
+			start2 = 0;
+		}
+
+		kunmap_atomic(kaddr);
+		flush_dcache_page(page + i);
+
+		if (!end1 && !end2)
+			break;
+	}
+
+	BUG_ON((start1 | start2 | end1 | end2) != 0);
+}
+EXPORT_SYMBOL(zero_user_segments);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_HIGHMEM */
 
 #if defined(HASHED_PAGE_VIRTUAL)
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 12/36] bio: Add bio_for_each_thp_segment_all
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (10 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 11/36] fs: Support THPs in zero_user_segments Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 13/36] iomap: Support arbitrarily many blocks per page Matthew Wilcox
                   ` (25 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Iterate once for each THP page instead of once for each base page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/bio.h  | 13 +++++++++++++
 include/linux/bvec.h | 23 +++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index c1c0f9ea4e63..4cc883fd8d63 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -131,12 +131,25 @@ static inline bool bio_next_segment(const struct bio *bio,
 	return true;
 }
 
+static inline bool bio_next_thp_segment(const struct bio *bio,
+				    struct bvec_iter_all *iter)
+{
+	if (iter->idx >= bio->bi_vcnt)
+		return false;
+
+	bvec_thp_advance(&bio->bi_io_vec[iter->idx], iter);
+	return true;
+}
+
 /*
  * drivers should _never_ use the all version - the bio may have been split
  * before it got to the driver and the driver won't own all of it
  */
 #define bio_for_each_segment_all(bvl, bio, iter) \
 	for (bvl = bvec_init_iter_all(&iter); bio_next_segment((bio), &iter); )
+#define bio_for_each_thp_segment_all(bvl, bio, iter) \
+	for (bvl = bvec_init_iter_all(&iter); \
+	     bio_next_thp_segment((bio), &iter); )
 
 static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
 				    unsigned bytes)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index a81c13ac1972..e08bd192e0ed 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -153,4 +153,27 @@ static inline void bvec_advance(const struct bio_vec *bvec,
 	}
 }
 
+static inline void bvec_thp_advance(const struct bio_vec *bvec,
+				struct bvec_iter_all *iter_all)
+{
+	struct bio_vec *bv = &iter_all->bv;
+	unsigned int page_size = thp_size(bvec->bv_page);
+
+	if (iter_all->done) {
+		bv->bv_page += hpage_nr_pages(bv->bv_page);
+		bv->bv_offset = 0;
+	} else {
+		BUG_ON(bvec->bv_offset >= page_size);
+		bv->bv_page = bvec->bv_page;
+		bv->bv_offset = bvec->bv_offset & (page_size - 1);
+	}
+	bv->bv_len = min(page_size - bv->bv_offset,
+			 bvec->bv_len - iter_all->done);
+	iter_all->done += bv->bv_len;
+
+	if (iter_all->done == bvec->bv_len) {
+		iter_all->idx++;
+		iter_all->done = 0;
+	}
+}
 #endif /* __LINUX_BVEC_ITER_H */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 13/36] iomap: Support arbitrarily many blocks per page
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (11 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 12/36] bio: Add bio_for_each_thp_segment_all Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 14/36] iomap: Support large pages in iomap_adjust_read_range Matthew Wilcox
                   ` (24 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Size the uptodate array dynamically.  Now that this array is protected
by a spinlock, we can use bitmap functions to set the bits in this array
instead of a loop around set_bit().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 27 +++++++++------------------
 1 file changed, 9 insertions(+), 18 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4bc37bf8d057..4a79061073eb 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -22,14 +22,14 @@
 #include "../internal.h"
 
 /*
- * Structure allocated for each page when block size < PAGE_SIZE to track
+ * Structure allocated for each page when block size < page size to track
  * sub-page uptodate status and I/O completions.
  */
 struct iomap_page {
 	atomic_t		read_count;
 	atomic_t		write_count;
 	spinlock_t		uptodate_lock;
-	DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
+	unsigned long		uptodate[];
 };
 
 static inline struct iomap_page *to_iomap_page(struct page *page)
@@ -45,15 +45,14 @@ static struct iomap_page *
 iomap_page_create(struct inode *inode, struct page *page)
 {
 	struct iomap_page *iop = to_iomap_page(page);
+	unsigned int nr_blocks = i_blocks_per_page(inode, page);
 
-	if (iop || i_blocks_per_page(inode, page) <= 1)
+	if (iop || nr_blocks <= 1)
 		return iop;
 
-	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
-	atomic_set(&iop->read_count, 0);
-	atomic_set(&iop->write_count, 0);
+	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
+				GFP_NOFS | __GFP_NOFAIL);
 	spin_lock_init(&iop->uptodate_lock);
-	bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
 
 	/*
 	 * migrate_page_move_mapping() assumes that pages with private data have
@@ -146,20 +145,12 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
 	struct iomap_page *iop = to_iomap_page(page);
 	struct inode *inode = page->mapping->host;
 	unsigned first = off >> inode->i_blkbits;
-	unsigned last = (off + len - 1) >> inode->i_blkbits;
-	bool uptodate = true;
+	unsigned count = len >> inode->i_blkbits;
 	unsigned long flags;
-	unsigned int i;
 
 	spin_lock_irqsave(&iop->uptodate_lock, flags);
-	for (i = 0; i < i_blocks_per_page(inode, page); i++) {
-		if (i >= first && i <= last)
-			set_bit(i, iop->uptodate);
-		else if (!test_bit(i, iop->uptodate))
-			uptodate = false;
-	}
-
-	if (uptodate)
+	bitmap_set(iop->uptodate, first, count);
+	if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
 		SetPageUptodate(page);
 	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 14/36] iomap: Support large pages in iomap_adjust_read_range
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (12 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 13/36] iomap: Support arbitrarily many blocks per page Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-21 22:24   ` Dave Chinner
  2020-05-15 13:16 ` [PATCH v4 15/36] iomap: Support large pages in read paths Matthew Wilcox
                   ` (23 subsequent siblings)
  37 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Pass the struct page instead of the iomap_page so we can determine the
size of the page.  Use offset_in_thp() instead of offset_in_page() and use
thp_size() instead of PAGE_SIZE.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4a79061073eb..423ffc9d4a97 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -83,15 +83,16 @@ iomap_page_release(struct page *page)
  * Calculate the range inside the page that we actually need to read.
  */
 static void
-iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
+iomap_adjust_read_range(struct inode *inode, struct page *page,
 		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
 {
+	struct iomap_page *iop = to_iomap_page(page);
 	loff_t orig_pos = *pos;
 	loff_t isize = i_size_read(inode);
 	unsigned block_bits = inode->i_blkbits;
 	unsigned block_size = (1 << block_bits);
-	unsigned poff = offset_in_page(*pos);
-	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+	unsigned poff = offset_in_thp(page, *pos);
+	unsigned plen = min_t(loff_t, thp_size(page) - poff, length);
 	unsigned first = poff >> block_bits;
 	unsigned last = (poff + plen - 1) >> block_bits;
 
@@ -129,7 +130,7 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 	 * page cache for blocks that are entirely outside of i_size.
 	 */
 	if (orig_pos <= isize && orig_pos + length > isize) {
-		unsigned end = offset_in_page(isize - 1) >> block_bits;
+		unsigned end = offset_in_thp(page, isize - 1) >> block_bits;
 
 		if (first <= end && last > end)
 			plen -= (last - end) * block_size;
@@ -256,7 +257,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	}
 
 	/* zero post-eof blocks as the page may be mapped */
-	iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
+	iomap_adjust_read_range(inode, page, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
 
@@ -571,7 +572,6 @@ static int
 __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
 		struct page *page, struct iomap *srcmap)
 {
-	struct iomap_page *iop = iomap_page_create(inode, page);
 	loff_t block_size = i_blocksize(inode);
 	loff_t block_start = pos & ~(block_size - 1);
 	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
@@ -580,9 +580,10 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
 
 	if (PageUptodate(page))
 		return 0;
+	iomap_page_create(inode, page);
 
 	do {
-		iomap_adjust_read_range(inode, iop, &block_start,
+		iomap_adjust_read_range(inode, page, &block_start,
 				block_end - block_start, &poff, &plen);
 		if (plen == 0)
 			break;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 15/36] iomap: Support large pages in read paths
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (13 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 14/36] iomap: Support large pages in iomap_adjust_read_range Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 16/36] iomap: Support large pages in write paths Matthew Wilcox
                   ` (22 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use thp_size() instead of PAGE_SIZE, offset_in_thp() instead of
offset_in_page() and bio_for_each_thp_segment_all().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 423ffc9d4a97..75f42c0d4cd9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -198,7 +198,7 @@ iomap_read_end_io(struct bio *bio)
 	struct bio_vec *bvec;
 	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, iter_all)
+	bio_for_each_thp_segment_all(bvec, bio, iter_all)
 		iomap_read_page_end_io(bvec, error);
 	bio_put(bio);
 }
@@ -238,6 +238,16 @@ static inline bool iomap_block_needs_zeroing(struct inode *inode,
 		pos >= i_size_read(inode);
 }
 
+/*
+ * Estimate the number of vectors we need based on the current page size;
+ * if we're wrong we'll end up doing an overly large allocation or needing
+ * to do a second allocation, neither of which is a big deal.
+ */
+static unsigned int iomap_nr_vecs(struct page *page, loff_t length)
+{
+	return (length + thp_size(page) - 1) >> page_shift(page);
+}
+
 static loff_t
 iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap, struct iomap *srcmap)
@@ -294,7 +304,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
 		gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
 		gfp_t orig_gfp = gfp;
-		int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		int nr_vecs = iomap_nr_vecs(page, length);
 
 		if (ctx->bio)
 			submit_bio(ctx->bio);
@@ -338,9 +348,9 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 
 	trace_iomap_readpage(page->mapping->host, 1);
 
-	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
+	for (poff = 0; poff < thp_size(page); poff += ret) {
 		ret = iomap_apply(inode, page_offset(page) + poff,
-				PAGE_SIZE - poff, 0, ops, &ctx,
+				thp_size(page) - poff, 0, ops, &ctx,
 				iomap_readpage_actor);
 		if (ret <= 0) {
 			WARN_ON_ONCE(ret == 0);
@@ -374,7 +384,8 @@ iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
 	loff_t done, ret;
 
 	for (done = 0; done < length; done += ret) {
-		if (ctx->cur_page && offset_in_page(pos + done) == 0) {
+		if (ctx->cur_page &&
+		    offset_in_thp(ctx->cur_page, pos + done) == 0) {
 			if (!ctx->cur_page_in_bio)
 				unlock_page(ctx->cur_page);
 			put_page(ctx->cur_page);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 16/36] iomap: Support large pages in write paths
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (14 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 15/36] iomap: Support large pages in read paths Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 17/36] iomap: Inline data shouldn't see large pages Matthew Wilcox
                   ` (21 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use thp_size() instead of PAGE_SIZE and offset_in_thp() instead of
offset_in_page().  Also simplify the logic in iomap_do_writepage()
for determining end of file.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 52 +++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 23 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 75f42c0d4cd9..b7504b8aa90c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -466,7 +466,7 @@ iomap_is_partially_uptodate(struct page *page, unsigned long from,
 	unsigned i;
 
 	/* Limit range to one page */
-	len = min_t(unsigned, PAGE_SIZE - from, count);
+	len = min_t(unsigned, thp_size(page) - from, count);
 
 	/* First and last blocks in range within page */
 	first = from >> inode->i_blkbits;
@@ -510,7 +510,7 @@ iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
 	 * If we are invalidating the entire page, clear the dirty state from it
 	 * and release it to avoid unnecessary buildup of the LRU.
 	 */
-	if (offset == 0 && len == PAGE_SIZE) {
+	if (offset == 0 && len == thp_size(page)) {
 		WARN_ON_ONCE(PageWriteback(page));
 		cancel_dirty_page(page);
 		iomap_page_release(page);
@@ -586,7 +586,9 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
 	loff_t block_size = i_blocksize(inode);
 	loff_t block_start = pos & ~(block_size - 1);
 	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
-	unsigned from = offset_in_page(pos), to = from + len, poff, plen;
+	unsigned from = offset_in_thp(page, pos);
+	unsigned to = from + len;
+	unsigned poff, plen;
 	int status;
 
 	if (PageUptodate(page))
@@ -654,8 +656,8 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
 		status = __block_write_begin_int(page, pos, len, NULL, srcmap);
 	else
-		status = __iomap_write_begin(inode, pos, len, flags, page,
-				srcmap);
+		status = __iomap_write_begin(inode, pos, len, flags,
+				compound_head(page), srcmap);
 
 	if (unlikely(status))
 		goto out_unlock;
@@ -718,7 +720,7 @@ __iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 	 */
 	if (unlikely(copied < len && !PageUptodate(page)))
 		return 0;
-	iomap_set_range_uptodate(page, offset_in_page(pos), len);
+	iomap_set_range_uptodate(page, offset_in_thp(page, pos), len);
 	iomap_set_page_dirty(page);
 	return copied;
 }
@@ -754,7 +756,8 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied,
 		ret = block_write_end(NULL, inode->i_mapping, pos, len, copied,
 				page, NULL);
 	} else {
-		ret = __iomap_write_end(inode, pos, len, copied, page);
+		ret = __iomap_write_end(inode, pos, len, copied,
+				compound_head(page));
 	}
 
 	/*
@@ -793,6 +796,10 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
+		/*
+		 * XXX: We don't know what size page we'll find in the
+		 * page cache, so only copy up to a regular page boundary.
+		 */
 		offset = offset_in_page(pos);
 		bytes = min_t(unsigned long, PAGE_SIZE - offset,
 						iov_iter_count(i));
@@ -1129,7 +1136,7 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 			next = bio->bi_private;
 
 		/* walk each page on bio, ending page IO on them */
-		bio_for_each_segment_all(bv, bio, iter_all)
+		bio_for_each_thp_segment_all(bv, bio, iter_all)
 			iomap_finish_page_writeback(inode, bv->bv_page, error);
 		bio_put(bio);
 	}
@@ -1335,7 +1342,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
 {
 	sector_t sector = iomap_sector(&wpc->iomap, offset);
 	unsigned len = i_blocksize(inode);
-	unsigned poff = offset & (PAGE_SIZE - 1);
+	unsigned poff = offset & (thp_size(page) - 1);
 	bool merged, same_page = false;
 
 	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
@@ -1385,11 +1392,12 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	struct iomap_page *iop = to_iomap_page(page);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
-	u64 file_offset; /* file offset of page */
+	loff_t pos;
 	int error = 0, count = 0, i;
+	int nr_blocks = i_blocks_per_page(inode, page);
 	LIST_HEAD(submit_list);
 
-	WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
+	WARN_ON_ONCE(nr_blocks > 1 && !iop);
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_count) != 0);
 
 	/*
@@ -1397,20 +1405,20 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * end of the current map or find the current map invalid, grab a new
 	 * one.
 	 */
-	for (i = 0, file_offset = page_offset(page);
-	     i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
-	     i++, file_offset += len) {
+	for (i = 0, pos = page_offset(page);
+	     i < nr_blocks && pos < end_offset;
+	     i++, pos += len) {
 		if (iop && !test_bit(i, iop->uptodate))
 			continue;
 
-		error = wpc->ops->map_blocks(wpc, inode, file_offset);
+		error = wpc->ops->map_blocks(wpc, inode, pos);
 		if (error)
 			break;
 		if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE))
 			continue;
 		if (wpc->iomap.type == IOMAP_HOLE)
 			continue;
-		iomap_add_to_ioend(inode, file_offset, page, iop, wpc, wbc,
+		iomap_add_to_ioend(inode, pos, page, iop, wpc, wbc,
 				 &submit_list);
 		count++;
 	}
@@ -1492,7 +1500,6 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 {
 	struct iomap_writepage_ctx *wpc = data;
 	struct inode *inode = page->mapping->host;
-	pgoff_t end_index;
 	u64 end_offset;
 	loff_t offset;
 
@@ -1533,10 +1540,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 	 * ---------------------------------^------------------|
 	 */
 	offset = i_size_read(inode);
-	end_index = offset >> PAGE_SHIFT;
-	if (page->index < end_index)
-		end_offset = (loff_t)(page->index + 1) << PAGE_SHIFT;
-	else {
+	end_offset = page_offset(page) + thp_size(page);
+	if (end_offset > offset) {
 		/*
 		 * Check whether the page to write out is beyond or straddles
 		 * i_size or not.
@@ -1548,7 +1553,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * |				    |      Straddles     |
 		 * ---------------------------------^-----------|--------|
 		 */
-		unsigned offset_into_page = offset & (PAGE_SIZE - 1);
+		unsigned offset_into_page = offset_in_thp(page, offset);
+		pgoff_t end_index = offset >> PAGE_SHIFT;
 
 		/*
 		 * Skip the page if it is fully outside i_size, e.g. due to a
@@ -1579,7 +1585,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * memory is zeroed when mapped, and writes to that region are
 		 * not written out to the file."
 		 */
-		zero_user_segment(page, offset_into_page, PAGE_SIZE);
+		zero_user_segment(page, offset_into_page, thp_size(page));
 
 		/* Adjust the end_offset to the end of file */
 		end_offset = offset;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 17/36] iomap: Inline data shouldn't see large pages
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (15 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 16/36] iomap: Support large pages in write paths Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 18/36] iomap: Handle tail pages in iomap_page_mkwrite Matthew Wilcox
                   ` (20 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel, Christoph Hellwig

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Assert that we're not seeing large pages in functions that read/write
inline data, rather than zeroing out the tail.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index b7504b8aa90c..782757258a28 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -221,6 +221,7 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
 		return;
 
 	BUG_ON(page->index);
+	BUG_ON(PageCompound(page));
 	BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
 
 	addr = kmap_atomic(page);
@@ -732,6 +733,7 @@ iomap_write_end_inline(struct inode *inode, struct page *page,
 	void *addr;
 
 	WARN_ON_ONCE(!PageUptodate(page));
+	BUG_ON(PageCompound(page));
 	BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
 
 	addr = kmap_atomic(page);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 18/36] iomap: Handle tail pages in iomap_page_mkwrite
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (16 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 17/36] iomap: Inline data shouldn't see large pages Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 19/36] xfs: Support large pages Matthew Wilcox
                   ` (19 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

iomap_page_mkwrite() can be called with a tail page.  If we are,
operate on the head page, since we're treating the entire thing as a
single unit and the whole page is dirtied at the same time.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 782757258a28..c9636d55a4be 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1060,7 +1060,7 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
 
 vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
 {
-	struct page *page = vmf->page;
+	struct page *page = compound_head(vmf->page);
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	unsigned long length;
 	loff_t offset;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 19/36] xfs: Support large pages
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (17 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 18/36] iomap: Handle tail pages in iomap_page_mkwrite Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 20/36] mm: Make prep_transhuge_page return its argument Matthew Wilcox
                   ` (18 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

There is one place which assumes the size of a page; fix it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/xfs/xfs_aops.c  | 2 +-
 fs/xfs/xfs_super.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 5b25f5ee84dc..bb677ecbdf32 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -548,7 +548,7 @@ xfs_discard_page(
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
-	iomap_invalidatepage(page, 0, PAGE_SIZE);
+	iomap_invalidatepage(page, 0, thp_size(page));
 }
 
 static const struct iomap_writeback_ops xfs_writeback_ops = {
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index abf06bf9c3f3..0c7c4afa5afd 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1793,7 +1793,7 @@ static struct file_system_type xfs_fs_type = {
 	.init_fs_context	= xfs_init_fs_context,
 	.parameters		= xfs_fs_parameters,
 	.kill_sb		= kill_block_super,
-	.fs_flags		= FS_REQUIRES_DEV,
+	.fs_flags		= FS_REQUIRES_DEV | FS_LARGE_PAGES,
 };
 MODULE_ALIAS_FS("xfs");
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 20/36] mm: Make prep_transhuge_page return its argument
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (18 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 19/36] xfs: Support large pages Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 21/36] mm: Add __page_cache_alloc_order Matthew Wilcox
                   ` (17 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel, Kirill A . Shutemov

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

By permitting NULL or order-0 pages as an argument, and returning the
argument, callers can write:

	return prep_transhuge_page(alloc_pages(...));

instead of assigning the result to a temporary variable and conditionally
passing that to prep_transhuge_page().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h | 7 +++++--
 mm/huge_memory.c        | 9 +++++++--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1f6245091917..6a8502278f41 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -193,7 +193,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp,
 		unsigned long addr, unsigned long len, unsigned long pgoff,
 		unsigned long flags);
 
-extern void prep_transhuge_page(struct page *page);
+extern struct page *prep_transhuge_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 bool is_transparent_hugepage(struct page *page);
 
@@ -358,7 +358,10 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
 	return false;
 }
 
-static inline void prep_transhuge_page(struct page *page) {}
+static inline struct page *prep_transhuge_page(struct page *page)
+{
+	return page;
+}
 
 static inline bool is_transparent_hugepage(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6ecd1045113b..7a5e2b470bc7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -508,15 +508,20 @@ static inline struct deferred_split *get_deferred_split_queue(struct page *page)
 }
 #endif
 
-void prep_transhuge_page(struct page *page)
+struct page *prep_transhuge_page(struct page *page)
 {
+	if (!page || compound_order(page) == 0)
+		return page;
 	/*
-	 * we use page->mapping and page->indexlru in second tail page
+	 * we use page->mapping and page->index in second tail page
 	 * as list_head: assuming THP order >= 2
 	 */
+	BUG_ON(compound_order(page) == 1);
 
 	INIT_LIST_HEAD(page_deferred_list(page));
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
+
+	return page;
 }
 
 bool is_transparent_hugepage(struct page *page)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 21/36] mm: Add __page_cache_alloc_order
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (19 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 20/36] mm: Make prep_transhuge_page return its argument Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 22/36] mm: Allow large pages to be added to the page cache Matthew Wilcox
                   ` (16 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel, Kirill A . Shutemov

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This new function allows page cache pages to be allocated that are
larger than an order-0 page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h | 24 +++++++++++++++++++++---
 mm/filemap.c            | 12 ++++++++----
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c75d7fb7ccbc..97f36ea16116 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -239,15 +239,33 @@ static inline int page_cache_add_speculative(struct page *page, int count)
 	return __page_cache_add_speculative(page, count);
 }
 
+static inline gfp_t thp_gfpmask(gfp_t gfp)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/* We'd rather allocate smaller pages than stall a page fault */
+	gfp |= GFP_TRANSHUGE_LIGHT;
+	gfp &= ~__GFP_DIRECT_RECLAIM;
+#endif
+	return gfp;
+}
+
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline
+struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order)
 {
-	return alloc_pages(gfp, 0);
+	if (order == 0)
+		return alloc_pages(gfp, 0);
+	return prep_transhuge_page(alloc_pages(thp_gfpmask(gfp), order));
 }
 #endif
 
+static inline struct page *__page_cache_alloc(gfp_t gfp)
+{
+	return __page_cache_alloc_order(gfp, 0);
+}
+
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
 	return __page_cache_alloc(mapping_gfp_mask(x));
diff --git a/mm/filemap.c b/mm/filemap.c
index 23a051a7ef0f..9abba062973a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -941,24 +941,28 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order)
 {
 	int n;
 	struct page *page;
 
+	if (order > 0)
+		gfp = thp_gfpmask(gfp);
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
 			cpuset_mems_cookie = read_mems_allowed_begin();
 			n = cpuset_mem_spread_node();
-			page = __alloc_pages_node(n, gfp, 0);
+			page = __alloc_pages_node(n, gfp, order);
+			prep_transhuge_page(page);
 		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
 
 		return page;
 	}
-	return alloc_pages(gfp, 0);
+	return prep_transhuge_page(alloc_pages(gfp, order));
 }
-EXPORT_SYMBOL(__page_cache_alloc);
+EXPORT_SYMBOL(__page_cache_alloc_order);
 #endif
 
 /*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 22/36] mm: Allow large pages to be added to the page cache
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (20 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 21/36] mm: Add __page_cache_alloc_order Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 23/36] mm: Allow large pages to be removed from " Matthew Wilcox
                   ` (15 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We return -EEXIST if there are any non-shadow entries in the page
cache in the range covered by the large page.  If there are multiple
shadow entries in the range, we set *shadowp to one of them (currently
the one at the highest index).  If that turns out to be the wrong
answer, we can implement something more complex.  This is mostly
modelled after the equivalent function in the shmem code.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 44 +++++++++++++++++++++++++++++++-------------
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 9abba062973a..437484d42b78 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -834,6 +834,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	int huge = PageHuge(page);
 	struct mem_cgroup *memcg;
 	int error;
+	unsigned int nr = 1;
 	void *old;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -845,31 +846,48 @@ static int __add_to_page_cache_locked(struct page *page,
 					      gfp_mask, &memcg, false);
 		if (error)
 			return error;
+		xas_set_order(&xas, offset, thp_order(page));
+		nr = hpage_nr_pages(page);
 	}
 
-	get_page(page);
+	page_ref_add(page, nr);
 	page->mapping = mapping;
 	page->index = offset;
 
 	do {
+		unsigned long exceptional = 0;
+		unsigned int i = 0;
+
 		xas_lock_irq(&xas);
-		old = xas_load(&xas);
-		if (old && !xa_is_value(old))
-			xas_set_err(&xas, -EEXIST);
-		xas_store(&xas, page);
+		xas_for_each_conflict(&xas, old) {
+			if (!xa_is_value(old)) {
+				xas_set_err(&xas, -EEXIST);
+				break;
+			}
+			exceptional++;
+			if (shadowp)
+				*shadowp = old;
+		}
+		xas_create_range(&xas);
 		if (xas_error(&xas))
 			goto unlock;
 
-		if (xa_is_value(old)) {
-			mapping->nrexceptional--;
-			if (shadowp)
-				*shadowp = old;
+next:
+		xas_store(&xas, page);
+		if (++i < nr) {
+			xas_next(&xas);
+			goto next;
 		}
-		mapping->nrpages++;
+		mapping->nrexceptional -= exceptional;
+		mapping->nrpages += nr;
 
 		/* hugetlb pages do not participate in page cache accounting */
-		if (!huge)
-			__inc_node_page_state(page, NR_FILE_PAGES);
+		if (!huge) {
+			__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES,
+						nr);
+			if (nr > 1)
+				__inc_node_page_state(page, NR_FILE_THPS);
+		}
 unlock:
 		xas_unlock_irq(&xas);
 	} while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));
@@ -886,7 +904,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	/* Leave page->index set: truncation relies upon it */
 	if (!huge)
 		mem_cgroup_cancel_charge(page, memcg, false);
-	put_page(page);
+	page_ref_sub(page, nr);
 	return xas_error(&xas);
 }
 ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 23/36] mm: Allow large pages to be removed from the page cache
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (21 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 22/36] mm: Allow large pages to be added to the page cache Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 24/36] mm: Remove page fault assumption of compound page size Matthew Wilcox
                   ` (14 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

page_cache_free_page() assumes compound pages are PMD_SIZE; fix
that assumption.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 437484d42b78..9c760dd7208e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -248,7 +248,7 @@ static void page_cache_free_page(struct address_space *mapping,
 		freepage(page);
 
 	if (PageTransHuge(page) && !PageHuge(page)) {
-		page_ref_sub(page, HPAGE_PMD_NR);
+		page_ref_sub(page, hpage_nr_pages(page));
 		VM_BUG_ON_PAGE(page_count(page) <= 0, page);
 	} else {
 		put_page(page);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 24/36] mm: Remove page fault assumption of compound page size
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (22 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 23/36] mm: Allow large pages to be removed from " Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-25  4:59   ` Kirill A. Shutemov
  2020-05-15 13:16 ` [PATCH v4 25/36] mm: Fix total_mapcount assumption of " Matthew Wilcox
                   ` (13 subsequent siblings)
  37 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

A compound page in the page cache will not necessarily be of PMD size,
so check explicitly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/memory.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f703fe8c8346..d68ce428ddd2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3549,13 +3549,14 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	pmd_t entry;
 	int i;
-	vm_fault_t ret;
+	vm_fault_t ret = VM_FAULT_FALLBACK;
 
 	if (!transhuge_vma_suitable(vma, haddr))
-		return VM_FAULT_FALLBACK;
+		return ret;
 
-	ret = VM_FAULT_FALLBACK;
 	page = compound_head(page);
+	if (page_order(page) != HPAGE_PMD_ORDER)
+		return ret;
 
 	/*
 	 * Archs like ppc64 need additonal space to store information
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 25/36] mm: Fix total_mapcount assumption of page size
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (23 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 24/36] mm: Remove page fault assumption of compound page size Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 26/36] mm: Avoid splitting large pages Matthew Wilcox
                   ` (12 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Kirill A. Shutemov, linux-mm, linux-kernel, Matthew Wilcox

From: "Kirill A. Shutemov" <kirill@shutemov.name>

File THPs may now be of arbitrary order.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/huge_memory.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7a5e2b470bc7..15a86b06befc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2668,7 +2668,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 int total_mapcount(struct page *page)
 {
-	int i, compound, ret;
+	int i, compound, nr, ret;
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
@@ -2676,16 +2676,17 @@ int total_mapcount(struct page *page)
 		return atomic_read(&page->_mapcount) + 1;
 
 	compound = compound_mapcount(page);
+	nr = compound_nr(page);
 	if (PageHuge(page))
 		return compound;
 	ret = compound;
-	for (i = 0; i < HPAGE_PMD_NR; i++)
+	for (i = 0; i < nr; i++)
 		ret += atomic_read(&page[i]._mapcount) + 1;
 	/* File pages has compound_mapcount included in _mapcount */
 	if (!PageAnon(page))
-		return ret - compound * HPAGE_PMD_NR;
+		return ret - compound * nr;
 	if (PageDoubleMap(page))
-		ret -= HPAGE_PMD_NR;
+		ret -= nr;
 	return ret;
 }
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 26/36] mm: Avoid splitting large pages
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (24 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 25/36] mm: Fix total_mapcount assumption of " Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 27/36] mm: Fix truncation for pages of arbitrary size Matthew Wilcox
                   ` (11 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

If the filesystem supports large pages, then do not split them before
removing them from the page cache; remove them as a unit.
---
 mm/vmscan.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b06868fc4926..51e6c135575d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1271,9 +1271,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				/* Adding to swap updated mapping */
 				mapping = page_mapping(page);
 			}
-		} else if (unlikely(PageTransHuge(page))) {
+		} else if (PageTransHuge(page)) {
 			/* Split file THP */
-			if (split_huge_page_to_list(page, page_list))
+			if (!mapping_large_pages(mapping) &&
+			    split_huge_page_to_list(page, page_list))
 				goto keep_locked;
 		}
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 27/36] mm: Fix truncation for pages of arbitrary size
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (25 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 26/36] mm: Avoid splitting large pages Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 28/36] mm: Support storing shadow entries for large pages Matthew Wilcox
                   ` (10 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Remove the assumption that a compound page is HPAGE_PMD_SIZE,
and the assumption that any page is PAGE_SIZE.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/truncate.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index dd9ebc1da356..dad384a4dc6d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -168,7 +168,7 @@ void do_invalidatepage(struct page *page, unsigned int offset,
  * becomes orphaned.  It will be left on the LRU and may even be mapped into
  * user pagetables if we're racing with filemap_fault().
  *
- * We need to bale out if page->mapping is no longer equal to the original
+ * We need to bail out if page->mapping is no longer equal to the original
  * mapping.  This happens a) when the VM reclaimed the page while we waited on
  * its lock, b) when a concurrent invalidate_mapping_pages got there first and
  * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
@@ -177,12 +177,12 @@ static void
 truncate_cleanup_page(struct address_space *mapping, struct page *page)
 {
 	if (page_mapped(page)) {
-		pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
+		unsigned int nr = hpage_nr_pages(page);
 		unmap_mapping_pages(mapping, page->index, nr, false);
 	}
 
 	if (page_has_private(page))
-		do_invalidatepage(page, 0, PAGE_SIZE);
+		do_invalidatepage(page, 0, thp_size(page));
 
 	/*
 	 * Some filesystems seem to re-dirty the page even after
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 28/36] mm: Support storing shadow entries for large pages
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (26 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 27/36] mm: Fix truncation for pages of arbitrary size Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 29/36] mm: Support retrieving tail pages from the page cache Matthew Wilcox
                   ` (9 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

If the page is being replaced with a NULL, we can do a single large store,
but for now we have to use a loop to store one shadow entry in each entry.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 9c760dd7208e..0ec7f25a07b2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -120,22 +120,27 @@ static void page_cache_delete(struct address_space *mapping,
 				   struct page *page, void *shadow)
 {
 	XA_STATE(xas, &mapping->i_pages, page->index);
-	unsigned int nr = 1;
+	unsigned int i, nr = 1, entries = 1;
 
 	mapping_set_update(&xas, mapping);
 
 	/* hugetlb pages are represented by a single entry in the xarray */
 	if (!PageHuge(page)) {
-		xas_set_order(&xas, page->index, compound_order(page));
-		nr = compound_nr(page);
+		entries = nr = hpage_nr_pages(page);
+		if (!shadow) {
+			xas_set_order(&xas, page->index, thp_order(page));
+			entries = 1;
+		}
 	}
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageTail(page), page);
-	VM_BUG_ON_PAGE(nr != 1 && shadow, page);
 
-	xas_store(&xas, shadow);
-	xas_init_marks(&xas);
+	for (i = 0; i < entries; i++) {
+		xas_store(&xas, shadow);
+		xas_init_marks(&xas);
+		xas_next(&xas);
+	}
 
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 29/36] mm: Support retrieving tail pages from the page cache
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (27 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 28/36] mm: Support storing shadow entries for large pages Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 30/36] mm: Support tail pages in wait_for_stable_page Matthew Wilcox
                   ` (8 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

page->index is not meaningful for tail pages; we have to use
page_to_index() in this assertion.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 0ec7f25a07b2..56eb086acef8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1655,7 +1655,7 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
 			put_page(page);
 			goto repeat;
 		}
-		VM_BUG_ON_PAGE(page->index != index, page);
+		VM_BUG_ON_PAGE(page_to_index(page) != index, page);
 	}
 
 	if (fgp_flags & FGP_ACCESSED)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 30/36] mm: Support tail pages in wait_for_stable_page
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (28 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 29/36] mm: Support retrieving tail pages from the page cache Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 31/36] mm: Add DEFINE_READAHEAD Matthew Wilcox
                   ` (7 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

page->mapping is undefined for tail pages, so operate exclusively on
the head page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/page-writeback.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7326b54ab728..e2da7d7e93b8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2841,6 +2841,7 @@ EXPORT_SYMBOL_GPL(wait_on_page_writeback);
  */
 void wait_for_stable_page(struct page *page)
 {
+	page = compound_head(page);
 	if (bdi_cap_stable_pages_required(inode_to_bdi(page->mapping->host)))
 		wait_on_page_writeback(page);
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 31/36] mm: Add DEFINE_READAHEAD
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (29 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 30/36] mm: Support tail pages in wait_for_stable_page Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 32/36] mm: Make page_cache_readahead_unbounded take a readahead_control Matthew Wilcox
                   ` (6 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Allow for a more concise definition of a struct readahead_control.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 7 +++++++
 mm/readahead.c          | 6 +-----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 97f36ea16116..29ca36acdfd7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -718,6 +718,13 @@ struct readahead_control {
 	unsigned int _batch_count;
 };
 
+#define DEFINE_READAHEAD(rac, f, m, i)					\
+	struct readahead_control rac = {				\
+		.file = f,						\
+		.mapping = m,						\
+		._index = i,						\
+	}
+
 /**
  * readahead_page - Get the next page to read.
  * @rac: The current readahead request.
diff --git a/mm/readahead.c b/mm/readahead.c
index 3c9a8dd7c56c..2126a2754e22 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -179,11 +179,7 @@ void page_cache_readahead_unbounded(struct address_space *mapping,
 {
 	LIST_HEAD(page_pool);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	struct readahead_control rac = {
-		.mapping = mapping,
-		.file = file,
-		._index = index,
-	};
+	DEFINE_READAHEAD(rac, file, mapping, index);
 	unsigned long i;
 
 	/*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 32/36] mm: Make page_cache_readahead_unbounded take a readahead_control
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (30 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 31/36] mm: Add DEFINE_READAHEAD Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 33/36] mm: Make __do_page_cache_readahead " Matthew Wilcox
                   ` (5 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Define it in the callers instead of in page_cache_readahead_unbounded().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/ext4/verity.c        |  4 ++--
 fs/f2fs/verity.c        |  4 ++--
 include/linux/pagemap.h |  5 ++---
 mm/readahead.c          | 26 ++++++++++++--------------
 4 files changed, 18 insertions(+), 21 deletions(-)

diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index dec1244dd062..fe2e541543da 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -346,6 +346,7 @@ static struct page *ext4_read_merkle_tree_page(struct inode *inode,
 					       pgoff_t index,
 					       unsigned long num_ra_pages)
 {
+	DEFINE_READAHEAD(rac, NULL, inode->i_mapping, index);
 	struct page *page;
 
 	index += ext4_verity_metadata_pos(inode) >> PAGE_SHIFT;
@@ -355,8 +356,7 @@ static struct page *ext4_read_merkle_tree_page(struct inode *inode,
 		if (page)
 			put_page(page);
 		else if (num_ra_pages > 1)
-			page_cache_readahead_unbounded(inode->i_mapping, NULL,
-					index, num_ra_pages, 0);
+			page_cache_readahead_unbounded(&rac, num_ra_pages, 0);
 		page = read_mapping_page(inode->i_mapping, index, NULL);
 	}
 	return page;
diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
index 865c9fb774fb..707a94745472 100644
--- a/fs/f2fs/verity.c
+++ b/fs/f2fs/verity.c
@@ -226,6 +226,7 @@ static struct page *f2fs_read_merkle_tree_page(struct inode *inode,
 					       pgoff_t index,
 					       unsigned long num_ra_pages)
 {
+	DEFINE_READAHEAD(rac, NULL, inode->i_mapping, index);
 	struct page *page;
 
 	index += f2fs_verity_metadata_pos(inode) >> PAGE_SHIFT;
@@ -235,8 +236,7 @@ static struct page *f2fs_read_merkle_tree_page(struct inode *inode,
 		if (page)
 			put_page(page);
 		else if (num_ra_pages > 1)
-			page_cache_readahead_unbounded(inode->i_mapping, NULL,
-					index, num_ra_pages, 0);
+			page_cache_readahead_unbounded(&rac, num_ra_pages, 0);
 		page = read_mapping_page(inode->i_mapping, index, NULL);
 	}
 	return page;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 29ca36acdfd7..c694a052751f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -674,9 +674,8 @@ void page_cache_sync_readahead(struct address_space *, struct file_ra_state *,
 void page_cache_async_readahead(struct address_space *, struct file_ra_state *,
 		struct file *, struct page *, pgoff_t index,
 		unsigned long req_count);
-void page_cache_readahead_unbounded(struct address_space *, struct file *,
-		pgoff_t index, unsigned long nr_to_read,
-		unsigned long lookahead_count);
+void page_cache_readahead_unbounded(struct readahead_control *,
+		unsigned long nr_to_read, unsigned long lookahead_count);
 
 /*
  * Like add_to_page_cache_locked, but used to add newly allocated pages:
diff --git a/mm/readahead.c b/mm/readahead.c
index 2126a2754e22..62da2d4beed1 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -159,9 +159,7 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
 
 /**
  * page_cache_readahead_unbounded - Start unchecked readahead.
- * @mapping: File address space.
- * @file: This instance of the open file; used for authentication.
- * @index: First page index to read.
+ * @rac: Readahead control.
  * @nr_to_read: The number of pages to read.
  * @lookahead_size: Where to start the next readahead.
  *
@@ -173,13 +171,13 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
  * Context: File is referenced by caller.  Mutexes may be held by caller.
  * May sleep, but will not reenter filesystem to reclaim memory.
  */
-void page_cache_readahead_unbounded(struct address_space *mapping,
-		struct file *file, pgoff_t index, unsigned long nr_to_read,
-		unsigned long lookahead_size)
+void page_cache_readahead_unbounded(struct readahead_control *rac,
+		unsigned long nr_to_read, unsigned long lookahead_size)
 {
+	struct address_space *mapping = rac->mapping;
+	unsigned long index = readahead_index(rac);
 	LIST_HEAD(page_pool);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	DEFINE_READAHEAD(rac, file, mapping, index);
 	unsigned long i;
 
 	/*
@@ -200,7 +198,7 @@ void page_cache_readahead_unbounded(struct address_space *mapping,
 	for (i = 0; i < nr_to_read; i++) {
 		struct page *page = xa_load(&mapping->i_pages, index + i);
 
-		BUG_ON(index + i != rac._index + rac._nr_pages);
+		BUG_ON(index + i != rac->_index + rac->_nr_pages);
 
 		if (page && !xa_is_value(page)) {
 			/*
@@ -211,7 +209,7 @@ void page_cache_readahead_unbounded(struct address_space *mapping,
 			 * have a stable reference to this page, and it's
 			 * not worth getting one just for that.
 			 */
-			read_pages(&rac, &page_pool, true);
+			read_pages(rac, &page_pool, true);
 			continue;
 		}
 
@@ -224,12 +222,12 @@ void page_cache_readahead_unbounded(struct address_space *mapping,
 		} else if (add_to_page_cache_lru(page, mapping, index + i,
 					gfp_mask) < 0) {
 			put_page(page);
-			read_pages(&rac, &page_pool, true);
+			read_pages(rac, &page_pool, true);
 			continue;
 		}
 		if (i == nr_to_read - lookahead_size)
 			SetPageReadahead(page);
-		rac._nr_pages++;
+		rac->_nr_pages++;
 	}
 
 	/*
@@ -237,7 +235,7 @@ void page_cache_readahead_unbounded(struct address_space *mapping,
 	 * uptodate then the caller will launch readpage again, and
 	 * will then handle the error.
 	 */
-	read_pages(&rac, &page_pool, false);
+	read_pages(rac, &page_pool, false);
 	memalloc_nofs_restore(nofs);
 }
 EXPORT_SYMBOL_GPL(page_cache_readahead_unbounded);
@@ -252,6 +250,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		struct file *file, pgoff_t index, unsigned long nr_to_read,
 		unsigned long lookahead_size)
 {
+	DEFINE_READAHEAD(rac, file, mapping, index);
 	struct inode *inode = mapping->host;
 	loff_t isize = i_size_read(inode);
 	pgoff_t end_index;	/* The last page we want to read */
@@ -266,8 +265,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	if (nr_to_read > end_index - index)
 		nr_to_read = end_index - index + 1;
 
-	page_cache_readahead_unbounded(mapping, file, index, nr_to_read,
-			lookahead_size);
+	page_cache_readahead_unbounded(&rac, nr_to_read, lookahead_size);
 }
 
 /*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 33/36] mm: Make __do_page_cache_readahead take a readahead_control
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (31 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 32/36] mm: Make page_cache_readahead_unbounded take a readahead_control Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 34/36] mm: Allow PageReadahead to be set on head pages Matthew Wilcox
                   ` (4 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Also call __do_page_cache_readahead() directly from ondemand_readahead()
instead of indirecting via ra_submit().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/internal.h  | 11 +++++------
 mm/readahead.c | 26 ++++++++++++++------------
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 5efb13d5c226..fd3eaff7acdc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -51,18 +51,17 @@ void unmap_page_range(struct mmu_gather *tlb,
 
 void force_page_cache_readahead(struct address_space *, struct file *,
 		pgoff_t index, unsigned long nr_to_read);
-void __do_page_cache_readahead(struct address_space *, struct file *,
-		pgoff_t index, unsigned long nr_to_read,
-		unsigned long lookahead_size);
+void __do_page_cache_readahead(struct readahead_control *,
+		unsigned long nr_to_read, unsigned long lookahead_size);
 
 /*
  * Submit IO for the read-ahead request in file_ra_state.
  */
 static inline void ra_submit(struct file_ra_state *ra,
-		struct address_space *mapping, struct file *filp)
+		struct address_space *mapping, struct file *file)
 {
-	__do_page_cache_readahead(mapping, filp,
-			ra->start, ra->size, ra->async_size);
+	DEFINE_READAHEAD(rac, file, mapping, ra->start);
+	__do_page_cache_readahead(&rac, ra->size, ra->async_size);
 }
 
 /**
diff --git a/mm/readahead.c b/mm/readahead.c
index 62da2d4beed1..74c7e1eff540 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -246,12 +246,11 @@ EXPORT_SYMBOL_GPL(page_cache_readahead_unbounded);
  * behaviour which would occur if page allocations are causing VM writeback.
  * We really don't want to intermingle reads and writes like that.
  */
-void __do_page_cache_readahead(struct address_space *mapping,
-		struct file *file, pgoff_t index, unsigned long nr_to_read,
-		unsigned long lookahead_size)
+void __do_page_cache_readahead(struct readahead_control *rac,
+		unsigned long nr_to_read, unsigned long lookahead_size)
 {
-	DEFINE_READAHEAD(rac, file, mapping, index);
-	struct inode *inode = mapping->host;
+	struct inode *inode = rac->mapping->host;
+	unsigned long index = readahead_index(rac);
 	loff_t isize = i_size_read(inode);
 	pgoff_t end_index;	/* The last page we want to read */
 
@@ -265,7 +264,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	if (nr_to_read > end_index - index)
 		nr_to_read = end_index - index + 1;
 
-	page_cache_readahead_unbounded(&rac, nr_to_read, lookahead_size);
+	page_cache_readahead_unbounded(rac, nr_to_read, lookahead_size);
 }
 
 /*
@@ -273,10 +272,11 @@ void __do_page_cache_readahead(struct address_space *mapping,
  * memory at once.
  */
 void force_page_cache_readahead(struct address_space *mapping,
-		struct file *filp, pgoff_t index, unsigned long nr_to_read)
+		struct file *file, pgoff_t index, unsigned long nr_to_read)
 {
+	DEFINE_READAHEAD(rac, file, mapping, index);
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
-	struct file_ra_state *ra = &filp->f_ra;
+	struct file_ra_state *ra = &file->f_ra;
 	unsigned long max_pages;
 
 	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages &&
@@ -294,7 +294,7 @@ void force_page_cache_readahead(struct address_space *mapping,
 
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
-		__do_page_cache_readahead(mapping, filp, index, this_chunk, 0);
+		__do_page_cache_readahead(&rac, this_chunk, 0);
 
 		index += this_chunk;
 		nr_to_read -= this_chunk;
@@ -432,10 +432,11 @@ static int try_context_readahead(struct address_space *mapping,
  * A minimal readahead algorithm for trivial sequential/random reads.
  */
 static void ondemand_readahead(struct address_space *mapping,
-		struct file_ra_state *ra, struct file *filp,
+		struct file_ra_state *ra, struct file *file,
 		bool hit_readahead_marker, pgoff_t index,
 		unsigned long req_size)
 {
+	DEFINE_READAHEAD(rac, file, mapping, index);
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
 	unsigned long max_pages = ra->ra_pages;
 	unsigned long add_pages;
@@ -516,7 +517,7 @@ static void ondemand_readahead(struct address_space *mapping,
 	 * standalone, small random read
 	 * Read as is, and do not pollute the readahead state.
 	 */
-	__do_page_cache_readahead(mapping, filp, index, req_size, 0);
+	__do_page_cache_readahead(&rac, req_size, 0);
 	return;
 
 initial_readahead:
@@ -542,7 +543,8 @@ static void ondemand_readahead(struct address_space *mapping,
 		}
 	}
 
-	ra_submit(ra, mapping, filp);
+	rac._index = ra->start;
+	__do_page_cache_readahead(&rac, ra->size, ra->async_size);
 }
 
 /**
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 34/36] mm: Allow PageReadahead to be set on head pages
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (32 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 33/36] mm: Make __do_page_cache_readahead " Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 35/36] mm: Add large page readahead Matthew Wilcox
                   ` (3 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Adjust the callers to only call PageReadahead on the head page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/page-flags.h |  4 ++--
 mm/filemap.c               | 10 +++++-----
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 979460df4768..a3110d675cd0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -377,8 +377,8 @@ PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
 	TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
-PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
-	TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+PAGEFLAG(Readahead, reclaim, PF_ONLY_HEAD)
+	TESTCLEARFLAG(Readahead, reclaim, PF_ONLY_HEAD)
 
 #ifdef CONFIG_HIGHMEM
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 56eb086acef8..f3f03705c025 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2067,7 +2067,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
-		if (PageReadahead(page)) {
+		if (PageReadahead(compound_head(page))) {
 			page_cache_async_readahead(mapping,
 					ra, filp, page,
 					index, last_index - index);
@@ -2454,7 +2454,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 		return fpin;
 	if (ra->mmap_miss > 0)
 		ra->mmap_miss--;
-	if (PageReadahead(page)) {
+	if (PageReadahead(compound_head(page))) {
 		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);
@@ -2640,11 +2640,11 @@ void filemap_map_pages(struct vm_fault *vmf,
 		/* Has the page moved or been split? */
 		if (unlikely(page != xas_reload(&xas)))
 			goto skip;
+		if (PageReadahead(page))
+			goto skip;
 		page = find_subpage(page, xas.xa_index);
 
-		if (!PageUptodate(page) ||
-				PageReadahead(page) ||
-				PageHWPoison(page))
+		if (!PageUptodate(page) || PageHWPoison(page))
 			goto skip;
 		if (!trylock_page(page))
 			goto skip;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 35/36] mm: Add large page readahead
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (33 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 34/36] mm: Allow PageReadahead to be set on head pages Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-15 13:16 ` [PATCH v4 36/36] mm: Align THP mappings for non-DAX Matthew Wilcox
                   ` (2 subsequent siblings)
  37 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

If the filesystem supports large pages, allocate larger pages in the
readahead code when it seems worth doing.  The heuristic for choosing
larger page sizes will surely need some tuning, but this aggressive
ramp-up seems good for testing.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/readahead.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 87 insertions(+), 6 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 74c7e1eff540..ac16e96a8828 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -149,7 +149,7 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
 
 	blk_finish_plug(&plug);
 
-	BUG_ON(!list_empty(pages));
+	BUG_ON(pages && !list_empty(pages));
 	BUG_ON(readahead_count(rac));
 
 out:
@@ -428,13 +428,92 @@ static int try_context_readahead(struct address_space *mapping,
 	return 1;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int ra_alloc_page(struct readahead_control *rac, pgoff_t index,
+		pgoff_t mark, unsigned int order, gfp_t gfp)
+{
+	int err;
+	struct page *page = __page_cache_alloc_order(gfp, order);
+
+	if (!page)
+		return -ENOMEM;
+	if (mark - index < (1UL << order))
+		SetPageReadahead(page);
+	err = add_to_page_cache_lru(page, rac->mapping, index, gfp);
+	if (err)
+		put_page(page);
+	else
+		rac->_nr_pages += 1UL << order;
+	return err;
+}
+
+static bool page_cache_readahead_order(struct readahead_control *rac,
+		struct file_ra_state *ra, unsigned int order)
+{
+	struct address_space *mapping = rac->mapping;
+	unsigned int old_order = order;
+	pgoff_t index = readahead_index(rac);
+	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+	pgoff_t mark = index + ra->size - ra->async_size;
+	int err = 0;
+	gfp_t gfp = readahead_gfp_mask(mapping);
+
+	if (!mapping_large_pages(mapping))
+		return false;
+
+	limit = min(limit, index + ra->size - 1);
+
+	/* Grow page size up to PMD size */
+	if (order < HPAGE_PMD_ORDER) {
+		order += 2;
+		if (order > HPAGE_PMD_ORDER)
+			order = HPAGE_PMD_ORDER;
+		while ((1 << order) > ra->size)
+			order--;
+	}
+
+	/* If size is somehow misaligned, fill with order-0 pages */
+	while (!err && index & ((1UL << old_order) - 1))
+		err = ra_alloc_page(rac, index++, mark, 0, gfp);
+
+	while (!err && index & ((1UL << order) - 1)) {
+		err = ra_alloc_page(rac, index, mark, old_order, gfp);
+		index += 1UL << old_order;
+	}
+
+	while (!err && index <= limit) {
+		err = ra_alloc_page(rac, index, mark, order, gfp);
+		index += 1UL << order;
+	}
+
+	if (index > limit) {
+		ra->size += index - limit - 1;
+		ra->async_size += index - limit - 1;
+	}
+
+	read_pages(rac, NULL, false);
+
+	/*
+	 * If there were already pages in the page cache, then we may have
+	 * left some gaps.  Let the regular readahead code take care of this
+	 * situation.
+	 */
+	return !err;
+}
+#else
+static bool page_cache_readahead_order(struct readahead_control *rac,
+		struct file_ra_state *ra, unsigned int order)
+{
+	return false;
+}
+#endif
+
 /*
  * A minimal readahead algorithm for trivial sequential/random reads.
  */
 static void ondemand_readahead(struct address_space *mapping,
 		struct file_ra_state *ra, struct file *file,
-		bool hit_readahead_marker, pgoff_t index,
-		unsigned long req_size)
+		struct page *page, pgoff_t index, unsigned long req_size)
 {
 	DEFINE_READAHEAD(rac, file, mapping, index);
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
@@ -473,7 +552,7 @@ static void ondemand_readahead(struct address_space *mapping,
 	 * Query the pagecache for async_size, which normally equals to
 	 * readahead size. Ramp it up and use it as the new readahead size.
 	 */
-	if (hit_readahead_marker) {
+	if (page) {
 		pgoff_t start;
 
 		rcu_read_lock();
@@ -544,6 +623,8 @@ static void ondemand_readahead(struct address_space *mapping,
 	}
 
 	rac._index = ra->start;
+	if (page && page_cache_readahead_order(&rac, ra, compound_order(page)))
+		return;
 	__do_page_cache_readahead(&rac, ra->size, ra->async_size);
 }
 
@@ -578,7 +659,7 @@ void page_cache_sync_readahead(struct address_space *mapping,
 	}
 
 	/* do read-ahead */
-	ondemand_readahead(mapping, ra, filp, false, index, req_count);
+	ondemand_readahead(mapping, ra, filp, NULL, index, req_count);
 }
 EXPORT_SYMBOL_GPL(page_cache_sync_readahead);
 
@@ -624,7 +705,7 @@ page_cache_async_readahead(struct address_space *mapping,
 		return;
 
 	/* do read-ahead */
-	ondemand_readahead(mapping, ra, filp, true, index, req_count);
+	ondemand_readahead(mapping, ra, filp, page, index, req_count);
 }
 EXPORT_SYMBOL_GPL(page_cache_async_readahead);
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 36/36] mm: Align THP mappings for non-DAX
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (34 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 35/36] mm: Add large page readahead Matthew Wilcox
@ 2020-05-15 13:16 ` Matthew Wilcox
  2020-05-26 22:05   ` William Kucharski
  2020-05-21 22:49 ` [PATCH v4 00/36] Large pages in the page cache Dave Chinner
  2020-05-28 11:00 ` William Kucharski
  37 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-15 13:16 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: William Kucharski, linux-mm, linux-kernel, Matthew Wilcox

From: William Kucharski <william.kucharski@oracle.com>

When we have the opportunity to use transparent huge pages to map a
file, we want to follow the same rules as DAX.

Signed-off-by: William Kucharski <william.kucharski@oracle.com>
[Inline __thp_get_unmapped_area() into thp_get_unmapped_area()]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/huge_memory.c | 40 +++++++++++++---------------------------
 1 file changed, 13 insertions(+), 27 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 15a86b06befc..e78686b628ae 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -535,30 +535,30 @@ bool is_transparent_hugepage(struct page *page)
 }
 EXPORT_SYMBOL_GPL(is_transparent_hugepage);
 
-static unsigned long __thp_get_unmapped_area(struct file *filp,
-		unsigned long addr, unsigned long len,
-		loff_t off, unsigned long flags, unsigned long size)
+unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags)
 {
+	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 	loff_t off_end = off + len;
-	loff_t off_align = round_up(off, size);
+	loff_t off_align = round_up(off, PMD_SIZE);
 	unsigned long len_pad, ret;
 
-	if (off_end <= off_align || (off_end - off_align) < size)
-		return 0;
+	if (off_end <= off_align || (off_end - off_align) < PMD_SIZE)
+		goto regular;
 
-	len_pad = len + size;
+	len_pad = len + PMD_SIZE;
 	if (len_pad < len || (off + len_pad) < off)
-		return 0;
+		goto regular;
 
 	ret = current->mm->get_unmapped_area(filp, addr, len_pad,
 					      off >> PAGE_SHIFT, flags);
 
 	/*
-	 * The failure might be due to length padding. The caller will retry
-	 * without the padding.
+	 * The failure might be due to length padding.  Retry without
+	 * the padding.
 	 */
 	if (IS_ERR_VALUE(ret))
-		return 0;
+		goto regular;
 
 	/*
 	 * Do not try to align to THP boundary if allocation at the address
@@ -567,23 +567,9 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
 	if (ret == addr)
 		return addr;
 
-	ret += (off - ret) & (size - 1);
+	ret += (off - ret) & (PMD_SIZE - 1);
 	return ret;
-}
-
-unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
-{
-	unsigned long ret;
-	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
-
-	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
-		goto out;
-
-	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE);
-	if (ret)
-		return ret;
-out:
+regular:
 	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
 }
 EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 04/36] mm: Introduce thp_size
  2020-05-15 13:16 ` [PATCH v4 04/36] mm: Introduce thp_size Matthew Wilcox
@ 2020-05-15 13:38   ` David Hildenbrand
  0 siblings, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2020-05-15 13:38 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel; +Cc: linux-mm, linux-kernel

On 15.05.20 15:16, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This is like page_size(), but compiles down to just PAGE_SIZE if THP
> are disabled.  Convert the users of hpage_nr_pages() which would prefer
> this interface.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  drivers/nvdimm/btt.c    | 4 +---
>  drivers/nvdimm/pmem.c   | 6 ++----
>  include/linux/huge_mm.h | 7 +++++++
>  mm/internal.h           | 2 +-
>  mm/page_io.c            | 2 +-
>  mm/page_vma_mapped.c    | 4 ++--
>  6 files changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> index 3b09419218d6..78e8d972d45a 100644
> --- a/drivers/nvdimm/btt.c
> +++ b/drivers/nvdimm/btt.c
> @@ -1488,10 +1488,8 @@ static int btt_rw_page(struct block_device *bdev, sector_t sector,
>  {
>  	struct btt *btt = bdev->bd_disk->private_data;
>  	int rc;
> -	unsigned int len;
>  
> -	len = hpage_nr_pages(page) * PAGE_SIZE;
> -	rc = btt_do_bvec(btt, NULL, page, len, 0, op, sector);
> +	rc = btt_do_bvec(btt, NULL, page, thp_size(page), 0, op, sector);
>  	if (rc == 0)
>  		page_endio(page, op_is_write(op), 0);
>  
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 2df6994acf83..d511504d07af 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -235,11 +235,9 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
>  	blk_status_t rc;
>  
>  	if (op_is_write(op))
> -		rc = pmem_do_write(pmem, page, 0, sector,
> -				   hpage_nr_pages(page) * PAGE_SIZE);
> +		rc = pmem_do_write(pmem, page, 0, sector, thp_size(page));
>  	else
> -		rc = pmem_do_read(pmem, page, 0, sector,
> -				   hpage_nr_pages(page) * PAGE_SIZE);
> +		rc = pmem_do_read(pmem, page, 0, sector, thp_size(page));
>  	/*
>  	 * The ->rw_page interface is subtle and tricky.  The core
>  	 * retries on any error, so we can only invoke page_endio() in
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 6bec4b5b61e1..e944f9757349 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -271,6 +271,11 @@ static inline int hpage_nr_pages(struct page *page)
>  	return compound_nr(page);
>  }
>  
> +static inline unsigned long thp_size(struct page *page)
> +{
> +	return page_size(page);
> +}
> +
>  struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>  		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
>  struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
> @@ -329,6 +334,8 @@ static inline int hpage_nr_pages(struct page *page)
>  	return 1;
>  }
>  
> +#define thp_size(x)		PAGE_SIZE
> +
>  static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
>  {
>  	return false;
> diff --git a/mm/internal.h b/mm/internal.h
> index f762a34b0c57..5efb13d5c226 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -386,7 +386,7 @@ vma_address(struct page *page, struct vm_area_struct *vma)
>  	unsigned long start, end;
>  
>  	start = __vma_address(page, vma);
> -	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
> +	end = start + thp_size(page) - PAGE_SIZE;
>  
>  	/* page should be within @vma mapping range */
>  	VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 76965be1d40e..dd935129e3cb 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -41,7 +41,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
>  		bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
>  		bio->bi_end_io = end_io;
>  
> -		bio_add_page(bio, page, PAGE_SIZE * hpage_nr_pages(page), 0);
> +		bio_add_page(bio, page, thp_size(page), 0);
>  	}
>  	return bio;
>  }
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 719c35246cfa..e65629c056e8 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -227,7 +227,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  			if (pvmw->address >= pvmw->vma->vm_end ||
>  			    pvmw->address >=
>  					__vma_address(pvmw->page, pvmw->vma) +
> -					hpage_nr_pages(pvmw->page) * PAGE_SIZE)
> +					thp_size(pvmw->page))
>  				return not_found(pvmw);
>  			/* Did we cross page table boundary? */
>  			if (pvmw->address % PMD_SIZE == 0) {
> @@ -268,7 +268,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
>  	unsigned long start, end;
>  
>  	start = __vma_address(page, vma);
> -	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
> +	end = start + thp_size(page) - PAGE_SIZE;
>  
>  	if (unlikely(end < vma->vm_start || start >= vma->vm_end))
>  		return 0;
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 06/36] mm: Introduce offset_in_thp
  2020-05-15 13:16 ` [PATCH v4 06/36] mm: Introduce offset_in_thp Matthew Wilcox
@ 2020-05-15 13:39   ` David Hildenbrand
  2020-05-22 17:15   ` Kirill A. Shutemov
  1 sibling, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2020-05-15 13:39 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel; +Cc: linux-mm, linux-kernel

On 15.05.20 15:16, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Mirroring offset_in_page(), this gives you the offset within this
> particular page, no matter what size page it is.  It optimises down
> to offset_in_page() if CONFIG_TRANSPARENT_HUGEPAGE is not set.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/mm.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 088acbda722d..9a55dce6a535 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1577,6 +1577,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
>  extern void pagefault_out_of_memory(void);
>  
>  #define offset_in_page(p)	((unsigned long)(p) & ~PAGE_MASK)
> +#define offset_in_thp(page, p)	((unsigned long)(p) & (thp_size(page) - 1))
>  
>  /*
>   * Flags passed to show_mem() and show_free_areas() to suppress output in
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 07/36] fs: Add a filesystem flag for large pages
  2020-05-15 13:16 ` [PATCH v4 07/36] fs: Add a filesystem flag for large pages Matthew Wilcox
@ 2020-05-21 21:55   ` Dave Chinner
  2020-05-21 23:29     ` Matthew Wilcox
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-05-21 21:55 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, May 15, 2020 at 06:16:27AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> The page cache needs to know whether the filesystem supports pages >
> PAGE_SIZE.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/fs.h      | 1 +
>  include/linux/pagemap.h | 5 +++++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 55c743925c40..777783c8760b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2241,6 +2241,7 @@ struct file_system_type {
>  #define FS_HAS_SUBTYPE		4
>  #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
>  #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
> +#define FS_LARGE_PAGES		8192	/* Remove once all fs converted */
>  #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
>  	int (*init_fs_context)(struct fs_context *);
>  	const struct fs_parameter_spec *parameters;
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 36bfc9d855bb..c6db74b5e62f 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -116,6 +116,11 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>  	m->gfp_mask = mask;
>  }
>  
> +static inline bool mapping_large_pages(struct address_space *mapping)
> +{
> +	return mapping->host->i_sb->s_type->fs_flags & FS_LARGE_PAGES;
> +}

If you've got to dereference 4 layers deep to check a behaviour
flag, the object needs it's own flag.  Can you just propagate this
to the address space when the inode is instantiated and the address
space initialised?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 10/36] fs: Make page_mkwrite_check_truncate thp-aware
  2020-05-15 13:16 ` [PATCH v4 10/36] fs: Make page_mkwrite_check_truncate thp-aware Matthew Wilcox
@ 2020-05-21 22:01   ` Dave Chinner
  2020-05-21 23:30     ` Matthew Wilcox
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-05-21 22:01 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, May 15, 2020 at 06:16:30AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> If the page is compound, check the last index in the page and return
> the appropriate size.  Change the return type to ssize_t in case we ever
> support pages larger than 2GB.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/pagemap.h | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 1a0bb387948c..c75d7fb7ccbc 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -827,22 +827,22 @@ static inline unsigned long dir_pages(struct inode *inode)
>   * @page: the page to check
>   * @inode: the inode to check the page against
>   *
> - * Returns the number of bytes in the page up to EOF,
> + * Return: The number of bytes in the page up to EOF,
>   * or -EFAULT if the page was truncated.
>   */
> -static inline int page_mkwrite_check_truncate(struct page *page,
> +static inline ssize_t page_mkwrite_check_truncate(struct page *page,
>  					      struct inode *inode)
>  {
>  	loff_t size = i_size_read(inode);
>  	pgoff_t index = size >> PAGE_SHIFT;
> -	int offset = offset_in_page(size);
> +	unsigned long offset = offset_in_thp(page, size);
>  
>  	if (page->mapping != inode->i_mapping)
>  		return -EFAULT;
>  
>  	/* page is wholly inside EOF */
> -	if (page->index < index)
> -		return PAGE_SIZE;
> +	if (page->index + hpage_nr_pages(page) - 1 < index)
> +		return thp_size(page);

Can we make these interfaces all use the same namespace prefix?
Here we have a mix of thp and hpage and I have no clue how hpages
are different to thps. If they refer to the same thing (i.e. huge
pages) then can we please make the API consistent before splattering
it all over the filesystem code?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/36] iomap: Support large pages in iomap_adjust_read_range
  2020-05-15 13:16 ` [PATCH v4 14/36] iomap: Support large pages in iomap_adjust_read_range Matthew Wilcox
@ 2020-05-21 22:24   ` Dave Chinner
  2020-05-21 23:39     ` Matthew Wilcox
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-05-21 22:24 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, May 15, 2020 at 06:16:34AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Pass the struct page instead of the iomap_page so we can determine the
> size of the page.  Use offset_in_thp() instead of offset_in_page() and use
> thp_size() instead of PAGE_SIZE.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/iomap/buffered-io.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 4a79061073eb..423ffc9d4a97 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -83,15 +83,16 @@ iomap_page_release(struct page *page)
>   * Calculate the range inside the page that we actually need to read.
>   */
>  static void
> -iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
> +iomap_adjust_read_range(struct inode *inode, struct page *page,
>  		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
>  {
> +	struct iomap_page *iop = to_iomap_page(page);
>  	loff_t orig_pos = *pos;
>  	loff_t isize = i_size_read(inode);
>  	unsigned block_bits = inode->i_blkbits;
>  	unsigned block_size = (1 << block_bits);
> -	unsigned poff = offset_in_page(*pos);
> -	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
> +	unsigned poff = offset_in_thp(page, *pos);
> +	unsigned plen = min_t(loff_t, thp_size(page) - poff, length);
>  	unsigned first = poff >> block_bits;
>  	unsigned last = (poff + plen - 1) >> block_bits;
>  
> @@ -129,7 +130,7 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
>  	 * page cache for blocks that are entirely outside of i_size.
>  	 */
>  	if (orig_pos <= isize && orig_pos + length > isize) {
> -		unsigned end = offset_in_page(isize - 1) >> block_bits;
> +		unsigned end = offset_in_thp(page, isize - 1) >> block_bits;
>  
>  		if (first <= end && last > end)
>  			plen -= (last - end) * block_size;
> @@ -256,7 +257,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  	}
>  
>  	/* zero post-eof blocks as the page may be mapped */
> -	iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
> +	iomap_adjust_read_range(inode, page, &pos, length, &poff, &plen);
>  	if (plen == 0)
>  		goto done;
>  
> @@ -571,7 +572,6 @@ static int
>  __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
>  		struct page *page, struct iomap *srcmap)
>  {
> -	struct iomap_page *iop = iomap_page_create(inode, page);
>  	loff_t block_size = i_blocksize(inode);
>  	loff_t block_start = pos & ~(block_size - 1);
>  	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
> @@ -580,9 +580,10 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
>  
>  	if (PageUptodate(page))
>  		return 0;
> +	iomap_page_create(inode, page);

What problem does this fix? i.e. if we can get here with an
uninitialised page, why isn't this a separate bug fix. I don't see
anything in this patch that actually changes behaviour, and there's
nothing in the commit description to tell me why this is here,
so... ???

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 00/36] Large pages in the page cache
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (35 preceding siblings ...)
  2020-05-15 13:16 ` [PATCH v4 36/36] mm: Align THP mappings for non-DAX Matthew Wilcox
@ 2020-05-21 22:49 ` Dave Chinner
  2020-05-22  0:04   ` Matthew Wilcox
  2020-05-28 11:00 ` William Kucharski
  37 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-05-21 22:49 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, May 15, 2020 at 06:16:20AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This patch set does not pass xfstests.  Test at your own risk.  It is
> based on the readahead rewrite which is in Andrew's tree.  I've fixed a
> lot of issues in the last two weeks, but generic/013 will still crash it.
> 
> The primary idea here is that a large part of the overhead in dealing
> with individual pages is that there's just so darned many of them.
> We would be better off dealing with fewer, larger pages, even if they
> don't get to be the size necessary for the CPU to use a larger TLB entry.

Ok, so the main issue I have with the filesystem/iomap side of
things is that it appears to be adding "transparent huge page"
awareness to the filesysetm code, not "large page support".

For people that aren't aware of the difference between the
transparent huge and and a normal compound page (e.g. I have no idea
what the difference is), this is likely to cause problems,
especially as you haven't explained at all in this description why
transparent huge pages are being used rather than bog standard
compound pages.

And, really, why should iomap or the filesystems care if the large
page is a THP or just a high order compound page? The interface
for operating on these things at the page cache level should be the
same. We already have page_size() and friends for operating on
high order compound pages, yet the iomap stuff has this new
thp_size() function instead of just using page_size(). THis is going
to lead to confusion and future bugs when people who don't know the
difference use the wrong page size function in their filesystem
code.

So, really, the "large page" API presented to the filesystems via
the page cache needs to be unified. Having to use compound_*() in
some places, thp_* in others, then page_* and Page*, not to mention
hpage_* just so that we can correctly support "large pages" is a
total non-starter.

Hence I'd suggest that this patch set needs to start by "hiding" all
the differences between different types of pages behind a unified,
consistent API, then it can introduce large page support into code
outside the mm/ infrastructure via that unified API. I don't care
what that API looks like so long as it is clear, consistenti, well
documented and means filesystem developers don't need to know
anything about how the page (large or not) is managed by the mm
subsystem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 07/36] fs: Add a filesystem flag for large pages
  2020-05-21 21:55   ` Dave Chinner
@ 2020-05-21 23:29     ` Matthew Wilcox
  0 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-21 23:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, May 22, 2020 at 07:55:23AM +1000, Dave Chinner wrote:
> If you've got to dereference 4 layers deep to check a behaviour
> flag, the object needs it's own flag.  Can you just propagate this
> to the address space when the inode is instantiated and the address
> space initialised?

Sure.  I'll fold in something like this:

+++ b/fs/inode.c
@@ -181,6 +181,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
        mapping->a_ops = &empty_aops;
        mapping->host = inode;
        mapping->flags = 0;
+       if (sb->s_type->fs_flags & FS_LARGE_PAGES)
+               __set_bit(AS_LARGE_PAGES, &mapping->flags);
        mapping->wb_err = 0;
        atomic_set(&mapping->i_mmap_writable, 0);
 #ifdef CONFIG_READ_ONLY_THP_FOR_FS
+++ b/include/linux/pagemap.h
@@ -29,6 +29,7 @@ enum mapping_flags {
        AS_EXITING      = 4,    /* final truncate in progress */
        /* writeback related tags are not used */
        AS_NO_WRITEBACK_TAGS = 5,
+       AS_LARGE_PAGES = 6,     /* large pages supported */
 };
 
 /**
@@ -118,7 +119,7 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 
 static inline bool mapping_large_pages(struct address_space *mapping)
 {
-       return mapping->host->i_sb->s_type->fs_flags & FS_LARGE_PAGES;
+       return test_bit(AS_LARGE_PAGES, &mapping->flags);
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 10/36] fs: Make page_mkwrite_check_truncate thp-aware
  2020-05-21 22:01   ` Dave Chinner
@ 2020-05-21 23:30     ` Matthew Wilcox
  0 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-21 23:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, linux-kernel, Kirill A. Shutemov

On Fri, May 22, 2020 at 08:01:39AM +1000, Dave Chinner wrote:
> On Fri, May 15, 2020 at 06:16:30AM -0700, Matthew Wilcox wrote:
> >  	if (page->mapping != inode->i_mapping)
> >  		return -EFAULT;
> >  
> >  	/* page is wholly inside EOF */
> > -	if (page->index < index)
> > -		return PAGE_SIZE;
> > +	if (page->index + hpage_nr_pages(page) - 1 < index)
> > +		return thp_size(page);
> 
> Can we make these interfaces all use the same namespace prefix?
> Here we have a mix of thp and hpage and I have no clue how hpages
> are different to thps. If they refer to the same thing (i.e. huge
> pages) then can we please make the API consistent before splattering
> it all over the filesystem code?

Yes, they're the same thing.  I'll rename hpage_nr_pages() to thp_count().

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/36] iomap: Support large pages in iomap_adjust_read_range
  2020-05-21 22:24   ` Dave Chinner
@ 2020-05-21 23:39     ` Matthew Wilcox
  0 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-21 23:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, May 22, 2020 at 08:24:38AM +1000, Dave Chinner wrote:
> > @@ -571,7 +572,6 @@ static int
> >  __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
> >  		struct page *page, struct iomap *srcmap)
> >  {
> > -	struct iomap_page *iop = iomap_page_create(inode, page);
> >  	loff_t block_size = i_blocksize(inode);
> >  	loff_t block_start = pos & ~(block_size - 1);
> >  	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
> > @@ -580,9 +580,10 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
> >  
> >  	if (PageUptodate(page))
> >  		return 0;
> > +	iomap_page_create(inode, page);
> 
> What problem does this fix? i.e. if we can get here with an
> uninitialised page, why isn't this a separate bug fix. I don't see
> anything in this patch that actually changes behaviour, and there's
> nothing in the commit description to tell me why this is here,
> so... ???

I'm not fixing anything ... just moving the call to iomap_page_create()
from the opening stanza to down here because we no longer need a struct
iomap_page pointer in this function.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 00/36] Large pages in the page cache
  2020-05-21 22:49 ` [PATCH v4 00/36] Large pages in the page cache Dave Chinner
@ 2020-05-22  0:04   ` Matthew Wilcox
  2020-05-22  2:57     ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-22  0:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, May 22, 2020 at 08:49:06AM +1000, Dave Chinner wrote:
> Ok, so the main issue I have with the filesystem/iomap side of
> things is that it appears to be adding "transparent huge page"
> awareness to the filesysetm code, not "large page support".
> 
> For people that aren't aware of the difference between the
> transparent huge and and a normal compound page (e.g. I have no idea
> what the difference is), this is likely to cause problems,
> especially as you haven't explained at all in this description why
> transparent huge pages are being used rather than bog standard
> compound pages.

The primary reason to use a different name from compound_*
is so that it can be compiled out for systems that don't enable
CONFIG_TRANSPARENT_HUGEPAGE.  So THPs are compound pages, as they always
have been, but for a filesystem, using thp_size() will compile to either
page_size() or PAGE_SIZE depending on CONFIG_TRANSPARENT_HUGEPAGE.

Now, maybe thp_size() is the wrong name, but then you need to suggest
a better name ;-)

> And, really, why should iomap or the filesystems care if the large
> page is a THP or just a high order compound page? The interface
> for operating on these things at the page cache level should be the
> same. We already have page_size() and friends for operating on
> high order compound pages, yet the iomap stuff has this new
> thp_size() function instead of just using page_size(). THis is going
> to lead to confusion and future bugs when people who don't know the
> difference use the wrong page size function in their filesystem
> code.

There is no wrong function to use -- just one that expands to more code
in the case where CONFIG_TRANSPARENT_HUGEPAGE is disabled.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 00/36] Large pages in the page cache
  2020-05-22  0:04   ` Matthew Wilcox
@ 2020-05-22  2:57     ` Dave Chinner
  2020-05-22  3:05       ` Matthew Wilcox
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-05-22  2:57 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Thu, May 21, 2020 at 05:04:11PM -0700, Matthew Wilcox wrote:
> On Fri, May 22, 2020 at 08:49:06AM +1000, Dave Chinner wrote:
> > Ok, so the main issue I have with the filesystem/iomap side of
> > things is that it appears to be adding "transparent huge page"
> > awareness to the filesysetm code, not "large page support".
> > 
> > For people that aren't aware of the difference between the
> > transparent huge and and a normal compound page (e.g. I have no idea
> > what the difference is), this is likely to cause problems,
> > especially as you haven't explained at all in this description why
> > transparent huge pages are being used rather than bog standard
> > compound pages.
> 
> The primary reason to use a different name from compound_*
> is so that it can be compiled out for systems that don't enable
> CONFIG_TRANSPARENT_HUGEPAGE.  So THPs are compound pages, as they always
> have been, but for a filesystem, using thp_size() will compile to either
> page_size() or PAGE_SIZE depending on CONFIG_TRANSPARENT_HUGEPAGE.

Again, why is this dependent on THP? We can allocate compound pages
without using THP, so why only allow the page cache to use larger
pages when THP is configured?

i.e. I don't know why this is dependent on THP because you haven't
explained why this only works for THP and not just plain old
compound pages....

> Now, maybe thp_size() is the wrong name, but then you need to suggest
> a better name ;-)

First you need to explain why THP is requirement for large pages in
the page cache when most of the code changes I see only care if the
page is a compound page or not....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 00/36] Large pages in the page cache
  2020-05-22  2:57     ` Dave Chinner
@ 2020-05-22  3:05       ` Matthew Wilcox
  2020-05-25 23:07         ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-22  3:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, May 22, 2020 at 12:57:51PM +1000, Dave Chinner wrote:
> On Thu, May 21, 2020 at 05:04:11PM -0700, Matthew Wilcox wrote:
> > On Fri, May 22, 2020 at 08:49:06AM +1000, Dave Chinner wrote:
> > > Ok, so the main issue I have with the filesystem/iomap side of
> > > things is that it appears to be adding "transparent huge page"
> > > awareness to the filesysetm code, not "large page support".
> > > 
> > > For people that aren't aware of the difference between the
> > > transparent huge and and a normal compound page (e.g. I have no idea
> > > what the difference is), this is likely to cause problems,
> > > especially as you haven't explained at all in this description why
> > > transparent huge pages are being used rather than bog standard
> > > compound pages.
> > 
> > The primary reason to use a different name from compound_*
> > is so that it can be compiled out for systems that don't enable
> > CONFIG_TRANSPARENT_HUGEPAGE.  So THPs are compound pages, as they always
> > have been, but for a filesystem, using thp_size() will compile to either
> > page_size() or PAGE_SIZE depending on CONFIG_TRANSPARENT_HUGEPAGE.
> 
> Again, why is this dependent on THP? We can allocate compound pages
> without using THP, so why only allow the page cache to use larger
> pages when THP is configured?

We have too many CONFIG options.  My brain can't cope with adding
CONFIG_LARGE_PAGES because then we might have neither THP nor LP, LP and
not THP, THP and not LP or both THP and LP.  And of course HUGETLBFS,
which has its own special set of issues that one has to think about when
dealing with the page cache.

So, either large pages becomes part of the base kernel and you
always get them, or there's a CONFIG option to enable them and it's
CONFIG_TRANSPARENT_HUGEPAGE.  I chose the latter.

I suppose what I'm saying is that a transparent hugepage can now be any
size [1], not just PMD size.

[1] power of two that isn't 1 because we use the third page for
something-or-other.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 06/36] mm: Introduce offset_in_thp
  2020-05-15 13:16 ` [PATCH v4 06/36] mm: Introduce offset_in_thp Matthew Wilcox
  2020-05-15 13:39   ` David Hildenbrand
@ 2020-05-22 17:15   ` Kirill A. Shutemov
  2020-05-29 12:59     ` Matthew Wilcox
  1 sibling, 1 reply; 59+ messages in thread
From: Kirill A. Shutemov @ 2020-05-22 17:15 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, May 15, 2020 at 06:16:26AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Mirroring offset_in_page(), this gives you the offset within this
> particular page, no matter what size page it is.  It optimises down
> to offset_in_page() if CONFIG_TRANSPARENT_HUGEPAGE is not set.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/mm.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 088acbda722d..9a55dce6a535 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1577,6 +1577,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
>  extern void pagefault_out_of_memory(void);
>  
>  #define offset_in_page(p)	((unsigned long)(p) & ~PAGE_MASK)
> +#define offset_in_thp(page, p)	((unsigned long)(p) & (thp_size(page) - 1))

Looks like thp_mask() would be handy here.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 11/36] fs: Support THPs in zero_user_segments
  2020-05-15 13:16 ` [PATCH v4 11/36] fs: Support THPs in zero_user_segments Matthew Wilcox
@ 2020-05-25  4:55   ` Kirill A. Shutemov
  0 siblings, 0 replies; 59+ messages in thread
From: Kirill A. Shutemov @ 2020-05-25  4:55 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, May 15, 2020 at 06:16:31AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> We can only kmap() one subpage of a THP at a time, so loop over all
> relevant subpages, skipping ones which don't need to be zeroed.  This is
> too large to inline when THPs are enabled and we actually need highmem,
> so put it in highmem.c.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/highmem.h | 15 +++++++---
>  mm/highmem.c            | 62 +++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 71 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index ea5cdbd8c2c3..74614903619d 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -215,13 +215,18 @@ static inline void clear_highpage(struct page *page)
>  	kunmap_atomic(kaddr);
>  }
>  
> +#if defined(CONFIG_HIGHMEM) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> +void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
> +		unsigned start2, unsigned end2);
> +#else /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */
>  static inline void zero_user_segments(struct page *page,
> -	unsigned start1, unsigned end1,
> -	unsigned start2, unsigned end2)
> +		unsigned start1, unsigned end1,
> +		unsigned start2, unsigned end2)
>  {
> +	unsigned long i;
>  	void *kaddr = kmap_atomic(page);
>  
> -	BUG_ON(end1 > PAGE_SIZE || end2 > PAGE_SIZE);
> +	BUG_ON(end1 > thp_size(page) || end2 > thp_size(page));
>  
>  	if (end1 > start1)
>  		memset(kaddr + start1, 0, end1 - start1);
> @@ -230,8 +235,10 @@ static inline void zero_user_segments(struct page *page,
>  		memset(kaddr + start2, 0, end2 - start2);
>  
>  	kunmap_atomic(kaddr);
> -	flush_dcache_page(page);
> +	for (i = 0; i < hpage_nr_pages(page); i++)
> +		flush_dcache_page(page + i);

Well, we need to settle on whether flush_dcache_page() has to be aware
about compound pages. There are already architectures that know how to
flush compound page, see ARM.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 24/36] mm: Remove page fault assumption of compound page size
  2020-05-15 13:16 ` [PATCH v4 24/36] mm: Remove page fault assumption of compound page size Matthew Wilcox
@ 2020-05-25  4:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 59+ messages in thread
From: Kirill A. Shutemov @ 2020-05-25  4:59 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, May 15, 2020 at 06:16:44AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> A compound page in the page cache will not necessarily be of PMD size,
> so check explicitly.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/memory.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index f703fe8c8346..d68ce428ddd2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3549,13 +3549,14 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>  	pmd_t entry;
>  	int i;
> -	vm_fault_t ret;
> +	vm_fault_t ret = VM_FAULT_FALLBACK;
>  
>  	if (!transhuge_vma_suitable(vma, haddr))
> -		return VM_FAULT_FALLBACK;
> +		return ret;
>  
> -	ret = VM_FAULT_FALLBACK;
>  	page = compound_head(page);
> +	if (page_order(page) != HPAGE_PMD_ORDER)
> +		return ret;

Maybe WARN() for page_order(page) > HPAGE_PMD_ORDER. It would be fun to
handle :P
>  
>  	/*
>  	 * Archs like ppc64 need additonal space to store information
> -- 
> 2.26.2
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 00/36] Large pages in the page cache
  2020-05-22  3:05       ` Matthew Wilcox
@ 2020-05-25 23:07         ` Dave Chinner
  2020-05-26  1:21           ` Matthew Wilcox
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2020-05-25 23:07 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Thu, May 21, 2020 at 08:05:53PM -0700, Matthew Wilcox wrote:
> On Fri, May 22, 2020 at 12:57:51PM +1000, Dave Chinner wrote:
> > On Thu, May 21, 2020 at 05:04:11PM -0700, Matthew Wilcox wrote:
> > > On Fri, May 22, 2020 at 08:49:06AM +1000, Dave Chinner wrote:
> > > > Ok, so the main issue I have with the filesystem/iomap side of
> > > > things is that it appears to be adding "transparent huge page"
> > > > awareness to the filesysetm code, not "large page support".
> > > > 
> > > > For people that aren't aware of the difference between the
> > > > transparent huge and and a normal compound page (e.g. I have no idea
> > > > what the difference is), this is likely to cause problems,
> > > > especially as you haven't explained at all in this description why
> > > > transparent huge pages are being used rather than bog standard
> > > > compound pages.
> > > 
> > > The primary reason to use a different name from compound_*
> > > is so that it can be compiled out for systems that don't enable
> > > CONFIG_TRANSPARENT_HUGEPAGE.  So THPs are compound pages, as they always
> > > have been, but for a filesystem, using thp_size() will compile to either
> > > page_size() or PAGE_SIZE depending on CONFIG_TRANSPARENT_HUGEPAGE.
> > 
> > Again, why is this dependent on THP? We can allocate compound pages
> > without using THP, so why only allow the page cache to use larger
> > pages when THP is configured?
> 
> We have too many CONFIG options.  My brain can't cope with adding
> CONFIG_LARGE_PAGES because then we might have neither THP nor LP, LP and
> not THP, THP and not LP or both THP and LP.  And of course HUGETLBFS,
> which has its own special set of issues that one has to think about when
> dealing with the page cache.

That sounds like something that should be fixed. :/

Really, I don't care about the historical mechanisms that people can
configure large pages with. If the mm subsystem does not have a
unified abstraction and API for working with large pages, then that
is the first problem that needs to be addressed before other
subsystems start trying to use large pages. 

i.e. a filesystem developer doesn't care how the mm subsystem is
allocating/managing large pages, we just want to be able to treat
large pages exactly the same way as we treat single pages. There
should be exactly zero difference between them at the API level.

> So, either large pages becomes part of the base kernel and you
> always get them, or there's a CONFIG option to enable them and it's
> CONFIG_TRANSPARENT_HUGEPAGE.  I chose the latter.

Please make the API part of the base kernel. Then you can hide all
these whacky mm level config options behind it so that code that
interacts with large pages just doesn't have to care about what type
of large page infrastructure the user has configured.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 00/36] Large pages in the page cache
  2020-05-25 23:07         ` Dave Chinner
@ 2020-05-26  1:21           ` Matthew Wilcox
  0 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-26  1:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, May 26, 2020 at 09:07:51AM +1000, Dave Chinner wrote:
> On Thu, May 21, 2020 at 08:05:53PM -0700, Matthew Wilcox wrote:
> > On Fri, May 22, 2020 at 12:57:51PM +1000, Dave Chinner wrote:
> > > Again, why is this dependent on THP? We can allocate compound pages
> > > without using THP, so why only allow the page cache to use larger
> > > pages when THP is configured?
> > 
> > We have too many CONFIG options.  My brain can't cope with adding
> > CONFIG_LARGE_PAGES because then we might have neither THP nor LP, LP and
> > not THP, THP and not LP or both THP and LP.  And of course HUGETLBFS,
> > which has its own special set of issues that one has to think about when
> > dealing with the page cache.
> 
> That sounds like something that should be fixed. :/

If I have to fix hugetlbfs before doing large pages in the page cache,
we'll be five years away and at least two mental breakdowns.  Honestly,
I'd rather work on almost anything else.  Some of the work I'm doing
will help make hugetlbfs more similar to everything else, eventually,
but ... no, not going to put all this on hold to fix hugetlbfs.  Sorry.

> Really, I don't care about the historical mechanisms that people can
> configure large pages with. If the mm subsystem does not have a
> unified abstraction and API for working with large pages, then that
> is the first problem that needs to be addressed before other
> subsystems start trying to use large pages. 

I think you're reading too quickly.  Let me try again.

Historically, Transparent Huge Pages have been PMD sized.  They have also
had a complicated interface to use.  I am changing both those things;
THPs may now be arbitrary order, and I'm adding interfaces to make THPs
easier to work with.

Now, if you want to contend that THPs are inextricably linked with
PMD sizes and I need to choose a different name, I've been thinking
about other options a bit.  One would be 'lpage' for 'large page'.
Another would be 'mop' for 'multi-order page'.

We should not be seeing 'compound_order' in any filesystem code.
Compound pages are an mm concept.  They happen to be how THPs are
implemented, but it'd be a layering violation to use them directly.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 36/36] mm: Align THP mappings for non-DAX
  2020-05-15 13:16 ` [PATCH v4 36/36] mm: Align THP mappings for non-DAX Matthew Wilcox
@ 2020-05-26 22:05   ` William Kucharski
  2020-05-26 22:20     ` Matthew Wilcox
  0 siblings, 1 reply; 59+ messages in thread
From: William Kucharski @ 2020-05-26 22:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

Thinking about this, if the intent is to make THP usable for any
greater than PAGESIZE page size, this routine should probably go back
to taking a size or perhaps order parameter so it could be called to
align addresses accordingly rather than hard code PMD_SIZE.


> On May 15, 2020, at 7:16 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> From: William Kucharski <william.kucharski@oracle.com>
> 
> When we have the opportunity to use transparent huge pages to map a
> file, we want to follow the same rules as DAX.
> 
> Signed-off-by: William Kucharski <william.kucharski@oracle.com>
> [Inline __thp_get_unmapped_area() into thp_get_unmapped_area()]
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
> mm/huge_memory.c | 40 +++++++++++++---------------------------
> 1 file changed, 13 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 15a86b06befc..e78686b628ae 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -535,30 +535,30 @@ bool is_transparent_hugepage(struct page *page)
> }
> EXPORT_SYMBOL_GPL(is_transparent_hugepage);
> 
> -static unsigned long __thp_get_unmapped_area(struct file *filp,
> -		unsigned long addr, unsigned long len,
> -		loff_t off, unsigned long flags, unsigned long size)
> +unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> +		unsigned long len, unsigned long pgoff, unsigned long flags)
> {
> +	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
> 	loff_t off_end = off + len;
> -	loff_t off_align = round_up(off, size);
> +	loff_t off_align = round_up(off, PMD_SIZE);
> 	unsigned long len_pad, ret;
> 
> -	if (off_end <= off_align || (off_end - off_align) < size)
> -		return 0;
> +	if (off_end <= off_align || (off_end - off_align) < PMD_SIZE)
> +		goto regular;
> 
> -	len_pad = len + size;
> +	len_pad = len + PMD_SIZE;
> 	if (len_pad < len || (off + len_pad) < off)
> -		return 0;
> +		goto regular;
> 
> 	ret = current->mm->get_unmapped_area(filp, addr, len_pad,
> 					      off >> PAGE_SHIFT, flags);
> 
> 	/*
> -	 * The failure might be due to length padding. The caller will retry
> -	 * without the padding.
> +	 * The failure might be due to length padding.  Retry without
> +	 * the padding.
> 	 */
> 	if (IS_ERR_VALUE(ret))
> -		return 0;
> +		goto regular;
> 
> 	/*
> 	 * Do not try to align to THP boundary if allocation at the address
> @@ -567,23 +567,9 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> 	if (ret == addr)
> 		return addr;
> 
> -	ret += (off - ret) & (size - 1);
> +	ret += (off - ret) & (PMD_SIZE - 1);
> 	return ret;
> -}
> -
> -unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> -		unsigned long len, unsigned long pgoff, unsigned long flags)
> -{
> -	unsigned long ret;
> -	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
> -
> -	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
> -		goto out;
> -
> -	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE);
> -	if (ret)
> -		return ret;
> -out:
> +regular:
> 	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
> }
> EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
> -- 
> 2.26.2
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 36/36] mm: Align THP mappings for non-DAX
  2020-05-26 22:05   ` William Kucharski
@ 2020-05-26 22:20     ` Matthew Wilcox
  0 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-26 22:20 UTC (permalink / raw)
  To: William Kucharski; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, May 26, 2020 at 04:05:58PM -0600, William Kucharski wrote:
> Thinking about this, if the intent is to make THP usable for any
> greater than PAGESIZE page size, this routine should probably go back
> to taking a size or perhaps order parameter so it could be called to
> align addresses accordingly rather than hard code PMD_SIZE.

Yes, that's a good point.  For example, on ARM, we'd want to 64kB-align
files which we could use 64kB pages, but there would be no point doing
that on x86.  I'll revert to the earlier version of this patch that
you sent.  Not sure how best to allow the architecture to tell us what
page sizes are useful to align to, but that earlier patch is a better
base to build on than this version.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 00/36] Large pages in the page cache
  2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
                   ` (36 preceding siblings ...)
  2020-05-21 22:49 ` [PATCH v4 00/36] Large pages in the page cache Dave Chinner
@ 2020-05-28 11:00 ` William Kucharski
  37 siblings, 0 replies; 59+ messages in thread
From: William Kucharski @ 2020-05-28 11:00 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

Except for [PATCH v4 36/36], which I can't approve for obvious reasons:

Reviewed-by: William Kucharski <william.kucharski@oracle.com>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 03/36] mm: Allow hpages to be arbitrary order
  2020-05-15 13:16 ` [PATCH v4 03/36] mm: Allow hpages to be arbitrary order Matthew Wilcox
@ 2020-05-28 14:19   ` Zi Yan
  0 siblings, 0 replies; 59+ messages in thread
From: Zi Yan @ 2020-05-28 14:19 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel


[-- Attachment #1: Type: text/plain, Size: 678 bytes --]

On 15 May 2020, at 9:16, Matthew Wilcox wrote:

> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>
> Remove the assumption in hpage_nr_pages() that compound pages are
> necessarily PMD sized.  Move the relevant parts of mm.h to before the
> include of huge_mm.h so we can use an inline function rather than a macro.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/huge_mm.h |  5 +--
>  include/linux/mm.h      | 96 ++++++++++++++++++++---------------------
>  2 files changed, 50 insertions(+), 51 deletions(-)
>
Glad to see this change. Thanks.

Reviewed-by: Zi Yan <ziy@nvidia.com>

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 06/36] mm: Introduce offset_in_thp
  2020-05-22 17:15   ` Kirill A. Shutemov
@ 2020-05-29 12:59     ` Matthew Wilcox
  0 siblings, 0 replies; 59+ messages in thread
From: Matthew Wilcox @ 2020-05-29 12:59 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, May 22, 2020 at 08:15:17PM +0300, Kirill A. Shutemov wrote:
> On Fri, May 15, 2020 at 06:16:26AM -0700, Matthew Wilcox wrote:
> > +#define offset_in_thp(page, p)	((unsigned long)(p) & (thp_size(page) - 1))
> 
> Looks like thp_mask() would be handy here.

It's not the only place we could use a thp_mask(), but PAGE_MASK is the
inverse of what I think it should be:

include/asm-generic/page.h:#define PAGE_MASK    (~(PAGE_SIZE-1))

ie addr & PAGE_MASK returns the address aligned to page size, not the
offset within the page.  Given this ambiguity, I'm inclined to leave
it as (thp_size(page) - 1), as it's clear which bits we're masking off.

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, back to index

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-15 13:16 [PATCH v4 00/36] Large pages in the page cache Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 01/36] mm: Move PageDoubleMap bit Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 02/36] mm: Simplify PageDoubleMap with PF_SECOND policy Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 03/36] mm: Allow hpages to be arbitrary order Matthew Wilcox
2020-05-28 14:19   ` Zi Yan
2020-05-15 13:16 ` [PATCH v4 04/36] mm: Introduce thp_size Matthew Wilcox
2020-05-15 13:38   ` David Hildenbrand
2020-05-15 13:16 ` [PATCH v4 05/36] mm: Introduce thp_order Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 06/36] mm: Introduce offset_in_thp Matthew Wilcox
2020-05-15 13:39   ` David Hildenbrand
2020-05-22 17:15   ` Kirill A. Shutemov
2020-05-29 12:59     ` Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 07/36] fs: Add a filesystem flag for large pages Matthew Wilcox
2020-05-21 21:55   ` Dave Chinner
2020-05-21 23:29     ` Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 08/36] fs: Do not update nr_thps for large page mappings Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 09/36] fs: Introduce i_blocks_per_page Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 10/36] fs: Make page_mkwrite_check_truncate thp-aware Matthew Wilcox
2020-05-21 22:01   ` Dave Chinner
2020-05-21 23:30     ` Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 11/36] fs: Support THPs in zero_user_segments Matthew Wilcox
2020-05-25  4:55   ` Kirill A. Shutemov
2020-05-15 13:16 ` [PATCH v4 12/36] bio: Add bio_for_each_thp_segment_all Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 13/36] iomap: Support arbitrarily many blocks per page Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 14/36] iomap: Support large pages in iomap_adjust_read_range Matthew Wilcox
2020-05-21 22:24   ` Dave Chinner
2020-05-21 23:39     ` Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 15/36] iomap: Support large pages in read paths Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 16/36] iomap: Support large pages in write paths Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 17/36] iomap: Inline data shouldn't see large pages Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 18/36] iomap: Handle tail pages in iomap_page_mkwrite Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 19/36] xfs: Support large pages Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 20/36] mm: Make prep_transhuge_page return its argument Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 21/36] mm: Add __page_cache_alloc_order Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 22/36] mm: Allow large pages to be added to the page cache Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 23/36] mm: Allow large pages to be removed from " Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 24/36] mm: Remove page fault assumption of compound page size Matthew Wilcox
2020-05-25  4:59   ` Kirill A. Shutemov
2020-05-15 13:16 ` [PATCH v4 25/36] mm: Fix total_mapcount assumption of " Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 26/36] mm: Avoid splitting large pages Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 27/36] mm: Fix truncation for pages of arbitrary size Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 28/36] mm: Support storing shadow entries for large pages Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 29/36] mm: Support retrieving tail pages from the page cache Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 30/36] mm: Support tail pages in wait_for_stable_page Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 31/36] mm: Add DEFINE_READAHEAD Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 32/36] mm: Make page_cache_readahead_unbounded take a readahead_control Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 33/36] mm: Make __do_page_cache_readahead " Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 34/36] mm: Allow PageReadahead to be set on head pages Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 35/36] mm: Add large page readahead Matthew Wilcox
2020-05-15 13:16 ` [PATCH v4 36/36] mm: Align THP mappings for non-DAX Matthew Wilcox
2020-05-26 22:05   ` William Kucharski
2020-05-26 22:20     ` Matthew Wilcox
2020-05-21 22:49 ` [PATCH v4 00/36] Large pages in the page cache Dave Chinner
2020-05-22  0:04   ` Matthew Wilcox
2020-05-22  2:57     ` Dave Chinner
2020-05-22  3:05       ` Matthew Wilcox
2020-05-25 23:07         ` Dave Chinner
2020-05-26  1:21           ` Matthew Wilcox
2020-05-28 11:00 ` William Kucharski

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git