linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/25] Large pages in the page cache
@ 2020-02-12  4:18 Matthew Wilcox
  2020-02-12  4:18 ` [PATCH v2 01/25] mm: Use vm_fault error code directly Matthew Wilcox
                   ` (24 more replies)
  0 siblings, 25 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This patch set does not pass xfstests.  Test at your own risk.  It is
based on the readahead patchset which I posted yesterday.

The principal idea here is that a large part of the overhead in dealing
with individual pages is that there's just so darned many of them.  We
would be better off dealing with fewer, larger pages, even if they don't
get to be the size necessary for the CPU to use a larger TLB entry.

The first five patches are more or less random cleanups which I came
across while working on this patchset ... Andrew, if you want to just
take those into your tree, it'd probably be a good thing.

hpage_nr_pages() is adapted to handle arbitrary order pages.  I also
add thp_order() and thp_size() for legibility.  Then the patches tear
through the page cache fixing the places which assume pages are either
PMD_SIZE or PAGE_SIZE.  After that, I tackle the iomap buffered I/O path,
removing the assumptions of PAGE_SIZE there.

Finally, we get to actually allocating large pages in the readahead code.
We gradually grow the page size that is allocated, so we don't just
jump straight from order-0 to order-9 pages, but gradually get there
through order-2, order-4, order-6, order-8 and order-9 (on x86; other
architectures will have a different PMD_ORDER).

In some testing, I've seen the code go as far as order-6.  Right now it
falls over on an earlier xfstest when it discovers a delayed allocation
extent in an inode which is being removed at unmount.

Matthew Wilcox (Oracle) (24):
  mm: Use vm_fault error code directly
  mm: Optimise find_subpage for !THP
  mm: Use VM_BUG_ON_PAGE in clear_page_dirty_for_io
  mm: Unexport find_get_entry
  mm: Fix documentation of FGP flags
  mm: Allow hpages to be arbitrary order
  mm: Introduce thp_size
  mm: Introduce thp_order
  fs: Add a filesystem flag for large pages
  fs: Introduce i_blocks_per_page
  fs: Make page_mkwrite_check_truncate thp-aware
  mm: Add file_offset_of_ helpers
  fs: Add zero_user_large
  iomap: Support arbitrarily many blocks per page
  iomap: Support large pages in iomap_adjust_read_range
  iomap: Support large pages in read paths
  iomap: Support large pages in write paths
  iomap: Inline data shouldn't see large pages
  xfs: Support large pages
  mm: Make prep_transhuge_page return its argument
  mm: Add __page_cache_alloc_order
  mm: Allow large pages to be added to the page cache
  mm: Allow large pages to be removed from the page cache
  mm: Add large page readahead

William Kucharski (1):
  mm: Align THP mappings for non-DAX

 drivers/net/ethernet/ibm/ibmveth.c |   2 -
 drivers/nvdimm/btt.c               |   4 +-
 drivers/nvdimm/pmem.c              |   3 +-
 fs/iomap/buffered-io.c             | 111 ++++++++++++++++-------------
 fs/jfs/jfs_metapage.c              |   2 +-
 fs/xfs/xfs_aops.c                  |   4 +-
 fs/xfs/xfs_super.c                 |   2 +-
 include/linux/fs.h                 |   1 +
 include/linux/highmem.h            |  22 ++++++
 include/linux/huge_mm.h            |  21 +++---
 include/linux/mm.h                 |   2 +
 include/linux/pagemap.h            |  78 ++++++++++++++++----
 mm/filemap.c                       |  67 +++++++++++------
 mm/huge_memory.c                   |  14 ++--
 mm/internal.h                      |   2 +-
 mm/page-writeback.c                |   2 +-
 mm/page_io.c                       |   2 +-
 mm/page_vma_mapped.c               |   4 +-
 mm/readahead.c                     |  98 +++++++++++++++++++++++--
 19 files changed, 322 insertions(+), 119 deletions(-)

-- 
2.25.0


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v2 01/25] mm: Use vm_fault error code directly
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  7:34   ` Christoph Hellwig
  2020-02-12  4:18 ` [PATCH v2 02/25] mm: Optimise find_subpage for !THP Matthew Wilcox
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm
  Cc: Matthew Wilcox (Oracle), linux-kernel, Kirill A . Shutemov

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use VM_FAULT_OOM instead of indirecting through vmf_error(-ENOMEM).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1784478270e1..1beb7716276b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2491,7 +2491,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		if (!page) {
 			if (fpin)
 				goto out_retry;
-			return vmf_error(-ENOMEM);
+			return VM_FAULT_OOM;
 		}
 	}
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 02/25] mm: Optimise find_subpage for !THP
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
  2020-02-12  4:18 ` [PATCH v2 01/25] mm: Use vm_fault error code directly Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  7:41   ` Christoph Hellwig
  2020-02-12  4:18 ` [PATCH v2 03/25] mm: Use VM_BUG_ON_PAGE in clear_page_dirty_for_io Matthew Wilcox
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

If THP is disabled, find_subpage can become a no-op.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 75bdfec49710..0842622cca90 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -340,7 +340,7 @@ static inline struct page *find_subpage(struct page *page, pgoff_t offset)
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
-	return page + (offset & (compound_nr(page) - 1));
+	return page + (offset & (hpage_nr_pages(page) - 1));
 }
 
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 03/25] mm: Use VM_BUG_ON_PAGE in clear_page_dirty_for_io
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
  2020-02-12  4:18 ` [PATCH v2 01/25] mm: Use vm_fault error code directly Matthew Wilcox
  2020-02-12  4:18 ` [PATCH v2 02/25] mm: Optimise find_subpage for !THP Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  7:38   ` Christoph Hellwig
  2020-02-13 13:50   ` Kirill A. Shutemov
  2020-02-12  4:18 ` [PATCH v2 04/25] mm: Unexport find_get_entry Matthew Wilcox
                   ` (21 subsequent siblings)
  24 siblings, 2 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Dumping the page information in this circumstance helps for debugging.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/page-writeback.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2caf780a42e7..9173c25cf8e6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2655,7 +2655,7 @@ int clear_page_dirty_for_io(struct page *page)
 	struct address_space *mapping = page_mapping(page);
 	int ret = 0;
 
-	BUG_ON(!PageLocked(page));
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 04/25] mm: Unexport find_get_entry
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (2 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 03/25] mm: Use VM_BUG_ON_PAGE in clear_page_dirty_for_io Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  7:37   ` Christoph Hellwig
  2020-02-13 13:51   ` Kirill A. Shutemov
  2020-02-12  4:18 ` [PATCH v2 05/25] mm: Fix documentation of FGP flags Matthew Wilcox
                   ` (20 subsequent siblings)
  24 siblings, 2 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

No in-tree users (proc, madvise, memcg, mincore) can be built as a module.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1beb7716276b..83ce9ce0bee1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1536,7 +1536,6 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
 
 	return page;
 }
-EXPORT_SYMBOL(find_get_entry);
 
 /**
  * find_lock_entry - locate, pin and lock a page cache entry
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 05/25] mm: Fix documentation of FGP flags
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (3 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 04/25] mm: Unexport find_get_entry Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  7:42   ` Christoph Hellwig
  2020-02-13 13:59   ` Kirill A. Shutemov
  2020-02-12  4:18 ` [PATCH v2 06/25] mm: Allow hpages to be arbitrary order Matthew Wilcox
                   ` (19 subsequent siblings)
  24 siblings, 2 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We never had PCG flags; they've been called FGP flags since their
introduction in 2014.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 83ce9ce0bee1..3204293f9b58 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1577,12 +1577,12 @@ EXPORT_SYMBOL(find_lock_entry);
  * pagecache_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
- * @fgp_flags: PCG flags
+ * @fgp_flags: FGP flags
  * @gfp_mask: gfp mask to use for the page cache data page allocation
  *
  * Looks up the page cache slot at @mapping & @offset.
  *
- * PCG flags modify how the page is returned.
+ * FGP flags modify how the page is returned.
  *
  * @fgp_flags can be:
  *
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 06/25] mm: Allow hpages to be arbitrary order
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (4 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 05/25] mm: Fix documentation of FGP flags Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-13 14:11   ` Kirill A. Shutemov
  2020-02-12  4:18 ` [PATCH v2 07/25] mm: Introduce thp_size Matthew Wilcox
                   ` (18 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Remove the assumption in hpage_nr_pages() that compound pages are
necessarily PMD sized.  The return type needs to be signed as we need
to use the negative value, eg when calling update_lru_size().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/huge_mm.h | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5aca3d1bdb32..16367e2f771e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -230,12 +230,8 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 	else
 		return NULL;
 }
-static inline int hpage_nr_pages(struct page *page)
-{
-	if (unlikely(PageTransHuge(page)))
-		return HPAGE_PMD_NR;
-	return 1;
-}
+
+#define hpage_nr_pages(page)	(long)compound_nr(page)
 
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
@@ -289,7 +285,7 @@ static inline struct list_head *page_deferred_list(struct page *page)
 #define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
-#define hpage_nr_pages(x) 1
+#define hpage_nr_pages(x) 1L
 
 static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 07/25] mm: Introduce thp_size
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (5 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 06/25] mm: Allow hpages to be arbitrary order Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-13 14:19   ` Kirill A. Shutemov
  2020-02-12  4:18 ` [PATCH v2 08/25] mm: Introduce thp_order Matthew Wilcox
                   ` (17 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This is like page_size(), but compiles down to just PAGE_SIZE if THP
are disabled.  Convert the users of hpage_nr_pages() which would prefer
this interface.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 drivers/nvdimm/btt.c    | 4 +---
 drivers/nvdimm/pmem.c   | 3 +--
 include/linux/huge_mm.h | 2 ++
 mm/internal.h           | 2 +-
 mm/page_io.c            | 2 +-
 mm/page_vma_mapped.c    | 4 ++--
 6 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 0d04ea3d9fd7..5d6a2a22f5a0 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1488,10 +1488,8 @@ static int btt_rw_page(struct block_device *bdev, sector_t sector,
 {
 	struct btt *btt = bdev->bd_disk->private_data;
 	int rc;
-	unsigned int len;
 
-	len = hpage_nr_pages(page) * PAGE_SIZE;
-	rc = btt_do_bvec(btt, NULL, page, len, 0, op, sector);
+	rc = btt_do_bvec(btt, NULL, page, thp_size(page), 0, op, sector);
 	if (rc == 0)
 		page_endio(page, op_is_write(op), 0);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4eae441f86c9..9c71c81f310f 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -223,8 +223,7 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	struct pmem_device *pmem = bdev->bd_queue->queuedata;
 	blk_status_t rc;
 
-	rc = pmem_do_bvec(pmem, page, hpage_nr_pages(page) * PAGE_SIZE,
-			  0, op, sector);
+	rc = pmem_do_bvec(pmem, page, thp_size(page), 0, op, sector);
 
 	/*
 	 * The ->rw_page interface is subtle and tricky.  The core
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 16367e2f771e..3680ae2d9019 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -232,6 +232,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 }
 
 #define hpage_nr_pages(page)	(long)compound_nr(page)
+#define thp_size(page)		page_size(page)
 
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
@@ -286,6 +287,7 @@ static inline struct list_head *page_deferred_list(struct page *page)
 #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
 #define hpage_nr_pages(x) 1L
+#define thp_size(x)		PAGE_SIZE
 
 static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
diff --git a/mm/internal.h b/mm/internal.h
index 41b93c4b3ab7..390d81d8b85f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -358,7 +358,7 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	unsigned long start, end;
 
 	start = __vma_address(page, vma);
-	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
+	end = start + thp_size(page) - PAGE_SIZE;
 
 	/* page should be within @vma mapping range */
 	VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
diff --git a/mm/page_io.c b/mm/page_io.c
index 76965be1d40e..dd935129e3cb 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -41,7 +41,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
 		bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
 		bio->bi_end_io = end_io;
 
-		bio_add_page(bio, page, PAGE_SIZE * hpage_nr_pages(page), 0);
+		bio_add_page(bio, page, thp_size(page), 0);
 	}
 	return bio;
 }
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 719c35246cfa..e65629c056e8 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -227,7 +227,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			if (pvmw->address >= pvmw->vma->vm_end ||
 			    pvmw->address >=
 					__vma_address(pvmw->page, pvmw->vma) +
-					hpage_nr_pages(pvmw->page) * PAGE_SIZE)
+					thp_size(pvmw->page))
 				return not_found(pvmw);
 			/* Did we cross page table boundary? */
 			if (pvmw->address % PMD_SIZE == 0) {
@@ -268,7 +268,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	unsigned long start, end;
 
 	start = __vma_address(page, vma);
-	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
+	end = start + thp_size(page) - PAGE_SIZE;
 
 	if (unlikely(end < vma->vm_start || start >= vma->vm_end))
 		return 0;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 08/25] mm: Introduce thp_order
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (6 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 07/25] mm: Introduce thp_size Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-13 14:20   ` Kirill A. Shutemov
  2020-02-12  4:18 ` [PATCH v2 09/25] fs: Add a filesystem flag for large pages Matthew Wilcox
                   ` (16 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Like compound_order() except 0 when THP is disabled.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/huge_mm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3680ae2d9019..3de788ee25bd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -233,6 +233,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 
 #define hpage_nr_pages(page)	(long)compound_nr(page)
 #define thp_size(page)		page_size(page)
+#define thp_order(page)		compound_order(page)
 
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
@@ -288,6 +289,7 @@ static inline struct list_head *page_deferred_list(struct page *page)
 
 #define hpage_nr_pages(x) 1L
 #define thp_size(x)		PAGE_SIZE
+#define thp_order(x)		0U
 
 static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 09/25] fs: Add a filesystem flag for large pages
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (7 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 08/25] mm: Introduce thp_order Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  7:43   ` Christoph Hellwig
  2020-02-12  4:18 ` [PATCH v2 10/25] fs: Introduce i_blocks_per_page Matthew Wilcox
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The page cache needs to know whether the filesystem supports pages >
PAGE_SIZE.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/fs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index d4e2d2964346..24e720723afb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2235,6 +2235,7 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
+#define FS_LARGE_PAGES		8192	/* Remove once all fs converted */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	int (*init_fs_context)(struct fs_context *);
 	const struct fs_parameter_spec *parameters;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 10/25] fs: Introduce i_blocks_per_page
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (8 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 09/25] fs: Add a filesystem flag for large pages Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  7:44   ` Christoph Hellwig
  2020-02-13 15:40   ` Kirill A. Shutemov
  2020-02-12  4:18 ` [PATCH v2 11/25] fs: Make page_mkwrite_check_truncate thp-aware Matthew Wilcox
                   ` (14 subsequent siblings)
  24 siblings, 2 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This helper is useful for both large pages in the page cache and for
supporting block size larger than page size.  Convert some example
users (we have a few different ways of writing this idiom).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c  |  8 ++++----
 fs/jfs/jfs_metapage.c   |  2 +-
 fs/xfs/xfs_aops.c       |  2 +-
 include/linux/pagemap.h | 16 ++++++++++++++++
 4 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e40eb45230fa..c551a48e2a81 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -46,7 +46,7 @@ iomap_page_create(struct inode *inode, struct page *page)
 {
 	struct iomap_page *iop = to_iomap_page(page);
 
-	if (iop || i_blocksize(inode) == PAGE_SIZE)
+	if (iop || i_blocks_per_page(inode, page) <= 1)
 		return iop;
 
 	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
@@ -152,7 +152,7 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
 	unsigned int i;
 
 	spin_lock_irqsave(&iop->uptodate_lock, flags);
-	for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
+	for (i = 0; i < i_blocks_per_page(inode, page); i++) {
 		if (i >= first && i <= last)
 			set_bit(i, iop->uptodate);
 		else if (!test_bit(i, iop->uptodate))
@@ -1073,7 +1073,7 @@ iomap_finish_page_writeback(struct inode *inode, struct page *page,
 		mapping_set_error(inode->i_mapping, -EIO);
 	}
 
-	WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
+	WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_count) <= 0);
 
 	if (!iop || atomic_dec_and_test(&iop->write_count))
@@ -1369,7 +1369,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	int error = 0, count = 0, i;
 	LIST_HEAD(submit_list);
 
-	WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
+	WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_count) != 0);
 
 	/*
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index a2f5338a5ea1..176580f54af9 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -473,7 +473,7 @@ static int metapage_readpage(struct file *fp, struct page *page)
 	struct inode *inode = page->mapping->host;
 	struct bio *bio = NULL;
 	int block_offset;
-	int blocks_per_page = PAGE_SIZE >> inode->i_blkbits;
+	int blocks_per_page = i_blocks_per_page(inode, page);
 	sector_t page_start;	/* address of page in fs blocks */
 	sector_t pblock;
 	int xlen;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 0897dd71c622..5573bf2957dd 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -544,7 +544,7 @@ xfs_discard_page(
 			page, ip->i_ino, offset);
 
 	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-			PAGE_SIZE / i_blocksize(inode));
+			i_blocks_per_page(inode, page));
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0842622cca90..aa925295347c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -748,4 +748,20 @@ static inline int page_mkwrite_check_truncate(struct page *page,
 	return offset;
 }
 
+/**
+ * i_blocks_per_page - How many blocks fit in this page.
+ * @inode: The inode which contains the blocks.
+ * @page: The (potentially large) page.
+ *
+ * If the block size is larger than the size of this page, will return
+ * zero,
+ *
+ * Context: Any context.
+ * Return: The number of filesystem blocks covered by this page.
+ */
+static inline
+unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
+{
+	return thp_size(page) >> inode->i_blkbits;
+}
 #endif /* _LINUX_PAGEMAP_H */
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 11/25] fs: Make page_mkwrite_check_truncate thp-aware
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (9 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 10/25] fs: Introduce i_blocks_per_page Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-13 15:44   ` Kirill A. Shutemov
  2020-02-12  4:18 ` [PATCH v2 12/25] mm: Add file_offset_of_ helpers Matthew Wilcox
                   ` (13 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

If the page is compound, check the appropriate indices and return the
appropriate sizes.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index aa925295347c..2ec33aabdbf6 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -732,17 +732,18 @@ static inline int page_mkwrite_check_truncate(struct page *page,
 					      struct inode *inode)
 {
 	loff_t size = i_size_read(inode);
-	pgoff_t index = size >> PAGE_SHIFT;
-	int offset = offset_in_page(size);
+	pgoff_t first_index = size >> PAGE_SHIFT;
+	pgoff_t last_index = first_index + hpage_nr_pages(page) - 1;
+	unsigned long offset = offset_in_this_page(page, size);
 
 	if (page->mapping != inode->i_mapping)
 		return -EFAULT;
 
 	/* page is wholly inside EOF */
-	if (page->index < index)
-		return PAGE_SIZE;
+	if (page->index < first_index)
+		return thp_size(page);
 	/* page is wholly past EOF */
-	if (page->index > index || !offset)
+	if (page->index > last_index || !offset)
 		return -EFAULT;
 	/* page is partially inside EOF */
 	return offset;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 12/25] mm: Add file_offset_of_ helpers
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (10 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 11/25] fs: Make page_mkwrite_check_truncate thp-aware Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  7:46   ` Christoph Hellwig
  2020-02-12  4:18 ` [PATCH v2 13/25] fs: Add zero_user_large Matthew Wilcox
                   ` (12 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The page_offset function is badly named for people reading the functions
which call it.  The natural meaning of a function with this name would
be 'offset within a page', not 'page offset in bytes within a file'.
Dave Chinner suggests file_offset_of_page() as a replacement function
name and I'm also adding file_offset_of_next_page() as a helper for the
large page work.  Also add kernel-doc for these functions so they show
up in the kernel API book.

page_offset() is retained as a compatibility define for now.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 drivers/net/ethernet/ibm/ibmveth.c |  2 --
 include/linux/pagemap.h            | 25 ++++++++++++++++++++++---
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index 84121aab7ff1..4cad94ac9bc9 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -978,8 +978,6 @@ static int ibmveth_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 	return -EOPNOTSUPP;
 }
 
-#define page_offset(v) ((unsigned long)(v) & ((1 << 12) - 1))
-
 static int ibmveth_send(struct ibmveth_adapter *adapter,
 			union ibmveth_buf_desc *descs, unsigned long mss)
 {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2ec33aabdbf6..497197315b73 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -432,14 +432,33 @@ static inline pgoff_t page_to_pgoff(struct page *page)
 	return page_to_index(page);
 }
 
-/*
- * Return byte-offset into filesystem object for page.
+/**
+ * file_offset_of_page - File offset of this page.
+ * @page: Page cache page.
+ *
+ * Context: Any context.
+ * Return: The offset of the first byte of this page.
  */
-static inline loff_t page_offset(struct page *page)
+static inline loff_t file_offset_of_page(struct page *page)
 {
 	return ((loff_t)page->index) << PAGE_SHIFT;
 }
 
+/* Legacy; please convert callers */
+#define page_offset(page)	file_offset_of_page(page)
+
+/**
+ * file_offset_of_next_page - File offset of the next page.
+ * @page: Page cache page.
+ *
+ * Context: Any context.
+ * Return: The offset of the first byte after this page.
+ */
+static inline loff_t file_offset_of_next_page(struct page *page)
+{
+	return file_offset_of_page(page) + thp_size(page);
+}
+
 static inline loff_t page_file_offset(struct page *page)
 {
 	return ((loff_t)page_index(page)) << PAGE_SHIFT;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 13/25] fs: Add zero_user_large
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (11 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 12/25] mm: Add file_offset_of_ helpers Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-14 13:52   ` Kirill A. Shutemov
  2020-02-12  4:18 ` [PATCH v2 14/25] iomap: Support arbitrarily many blocks per page Matthew Wilcox
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We can't kmap() a THP, so add a wrapper around zero_user() for large
pages.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/highmem.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ea5cdbd8c2c3..4465b8784353 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -245,6 +245,28 @@ static inline void zero_user(struct page *page,
 	zero_user_segments(page, start, start + size, 0, 0);
 }
 
+static inline void zero_user_large(struct page *page,
+		unsigned start, unsigned size)
+{
+	unsigned int i;
+
+	for (i = 0; i < thp_order(page); i++) {
+		if (start > PAGE_SIZE) {
+			start -= PAGE_SIZE;
+		} else {
+			unsigned this_size = size;
+
+			if (size > (PAGE_SIZE - start))
+				this_size = PAGE_SIZE - start;
+			zero_user(page + i, start, this_size);
+			start = 0;
+			size -= this_size;
+			if (!size)
+				break;
+		}
+	}
+}
+
 #ifndef __HAVE_ARCH_COPY_USER_HIGHPAGE
 
 static inline void copy_user_highpage(struct page *to, struct page *from,
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 14/25] iomap: Support arbitrarily many blocks per page
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (12 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 13/25] fs: Add zero_user_large Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  8:05   ` Christoph Hellwig
  2020-02-12  4:18 ` [PATCH v2 15/25] iomap: Support large pages in iomap_adjust_read_range Matthew Wilcox
                   ` (10 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Size the uptodate array dynamically.  Now that this array is protected
by a spinlock, we can use bitmap functions to set the bits in this array
instead of a loop around set_bit().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 27 +++++++++------------------
 1 file changed, 9 insertions(+), 18 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index c551a48e2a81..5e5a6b038fc3 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -22,14 +22,14 @@
 #include "../internal.h"
 
 /*
- * Structure allocated for each page when block size < PAGE_SIZE to track
+ * Structure allocated for each page when block size < page size to track
  * sub-page uptodate status and I/O completions.
  */
 struct iomap_page {
 	atomic_t		read_count;
 	atomic_t		write_count;
 	spinlock_t		uptodate_lock;
-	DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
+	unsigned long		uptodate[];
 };
 
 static inline struct iomap_page *to_iomap_page(struct page *page)
@@ -45,15 +45,14 @@ static struct iomap_page *
 iomap_page_create(struct inode *inode, struct page *page)
 {
 	struct iomap_page *iop = to_iomap_page(page);
+	unsigned int n, nr_blocks = i_blocks_per_page(inode, page);
 
-	if (iop || i_blocks_per_page(inode, page) <= 1)
+	if (iop || nr_blocks <= 1)
 		return iop;
 
-	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
-	atomic_set(&iop->read_count, 0);
-	atomic_set(&iop->write_count, 0);
+	n = BITS_TO_LONGS(nr_blocks);
+	iop = kzalloc(struct_size(iop, uptodate, n), GFP_NOFS | __GFP_NOFAIL);
 	spin_lock_init(&iop->uptodate_lock);
-	bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
 
 	/*
 	 * migrate_page_move_mapping() assumes that pages with private data have
@@ -146,20 +145,12 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
 	struct iomap_page *iop = to_iomap_page(page);
 	struct inode *inode = page->mapping->host;
 	unsigned first = off >> inode->i_blkbits;
-	unsigned last = (off + len - 1) >> inode->i_blkbits;
-	bool uptodate = true;
+	unsigned count = len >> inode->i_blkbits;
 	unsigned long flags;
-	unsigned int i;
 
 	spin_lock_irqsave(&iop->uptodate_lock, flags);
-	for (i = 0; i < i_blocks_per_page(inode, page); i++) {
-		if (i >= first && i <= last)
-			set_bit(i, iop->uptodate);
-		else if (!test_bit(i, iop->uptodate))
-			uptodate = false;
-	}
-
-	if (uptodate)
+	bitmap_set(iop->uptodate, first, count);
+	if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
 		SetPageUptodate(page);
 	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
 }
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 15/25] iomap: Support large pages in iomap_adjust_read_range
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (13 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 14/25] iomap: Support arbitrarily many blocks per page Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  8:11   ` Christoph Hellwig
  2020-02-12  4:18 ` [PATCH v2 16/25] iomap: Support large pages in read paths Matthew Wilcox
                   ` (9 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Pass the struct page instead of the iomap_page so we can determine the
size of the page.  Introduce offset_in_this_page() and use thp_size()
instead of PAGE_SIZE.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 16 +++++++++-------
 include/linux/mm.h     |  2 ++
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 5e5a6b038fc3..e522039f627f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -83,15 +83,16 @@ iomap_page_release(struct page *page)
  * Calculate the range inside the page that we actually need to read.
  */
 static void
-iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
+iomap_adjust_read_range(struct inode *inode, struct page *page,
 		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
 {
+	struct iomap_page *iop = to_iomap_page(page);
 	loff_t orig_pos = *pos;
 	loff_t isize = i_size_read(inode);
 	unsigned block_bits = inode->i_blkbits;
 	unsigned block_size = (1 << block_bits);
-	unsigned poff = offset_in_page(*pos);
-	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+	unsigned poff = offset_in_this_page(page, *pos);
+	unsigned plen = min_t(loff_t, thp_size(page) - poff, length);
 	unsigned first = poff >> block_bits;
 	unsigned last = (poff + plen - 1) >> block_bits;
 
@@ -129,7 +130,8 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 	 * page cache for blocks that are entirely outside of i_size.
 	 */
 	if (orig_pos <= isize && orig_pos + length > isize) {
-		unsigned end = offset_in_page(isize - 1) >> block_bits;
+		unsigned end = offset_in_this_page(page, isize - 1) >>
+				block_bits;
 
 		if (first <= end && last > end)
 			plen -= (last - end) * block_size;
@@ -256,7 +258,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	}
 
 	/* zero post-eof blocks as the page may be mapped */
-	iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
+	iomap_adjust_read_range(inode, page, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
 
@@ -547,7 +549,6 @@ static int
 __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
 		struct page *page, struct iomap *srcmap)
 {
-	struct iomap_page *iop = iomap_page_create(inode, page);
 	loff_t block_size = i_blocksize(inode);
 	loff_t block_start = pos & ~(block_size - 1);
 	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
@@ -556,9 +557,10 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
 
 	if (PageUptodate(page))
 		return 0;
+	iomap_page_create(inode, page);
 
 	do {
-		iomap_adjust_read_range(inode, iop, &block_start,
+		iomap_adjust_read_range(inode, page, &block_start,
 				block_end - block_start, &poff, &plen);
 		if (plen == 0)
 			break;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52269e56c514..b4bf86590096 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1387,6 +1387,8 @@ static inline void clear_page_pfmemalloc(struct page *page)
 extern void pagefault_out_of_memory(void);
 
 #define offset_in_page(p)	((unsigned long)(p) & ~PAGE_MASK)
+#define offset_in_this_page(page, p)	\
+	((unsigned long)(p) & (thp_size(page) - 1))
 
 /*
  * Flags passed to show_mem() and show_free_areas() to suppress output in
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 16/25] iomap: Support large pages in read paths
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (14 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 15/25] iomap: Support large pages in iomap_adjust_read_range Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  8:13   ` Christoph Hellwig
  2020-02-12  4:18 ` [PATCH v2 17/25] iomap: Support large pages in write paths Matthew Wilcox
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use thp_size() instead of PAGE_SIZE, use offset_in_this_page() and
abstract away how to access the list of readahead pages.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e522039f627f..68f8903ecd6d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -179,14 +179,16 @@ iomap_read_finish(struct iomap_page *iop, struct page *page)
 static void
 iomap_read_page_end_io(struct bio_vec *bvec, int error)
 {
-	struct page *page = bvec->bv_page;
+	struct page *page = compound_head(bvec->bv_page);
 	struct iomap_page *iop = to_iomap_page(page);
+	unsigned offset = bvec->bv_offset +
+				PAGE_SIZE * (bvec->bv_page - page);
 
 	if (unlikely(error)) {
 		ClearPageUptodate(page);
 		SetPageError(page);
 	} else {
-		iomap_set_range_uptodate(page, bvec->bv_offset, bvec->bv_len);
+		iomap_set_range_uptodate(page, offset, bvec->bv_len);
 	}
 
 	iomap_read_finish(iop, page);
@@ -239,6 +241,16 @@ static inline bool iomap_block_needs_zeroing(struct inode *inode,
 		pos >= i_size_read(inode);
 }
 
+/*
+ * Estimate the number of vectors we need based on the current page size;
+ * if we're wrong we'll end up doing an overly large allocation or needing
+ * to do a second allocation, neither of which is a big deal.
+ */
+static unsigned int iomap_nr_vecs(struct page *page, loff_t length)
+{
+	return (length + thp_size(page) - 1) >> page_shift(page);
+}
+
 static loff_t
 iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap, struct iomap *srcmap)
@@ -263,7 +275,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		goto done;
 
 	if (iomap_block_needs_zeroing(inode, iomap, pos)) {
-		zero_user(page, poff, plen);
+		zero_user_large(page, poff, plen);
 		iomap_set_range_uptodate(page, poff, plen);
 		goto done;
 	}
@@ -294,7 +306,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 
 	if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
 		gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
-		int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		int nr_vecs = iomap_nr_vecs(page, length);
 
 		if (ctx->bio)
 			submit_bio(ctx->bio);
@@ -331,9 +343,9 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 
 	trace_iomap_readpage(page->mapping->host, 1);
 
-	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
-		ret = iomap_apply(inode, page_offset(page) + poff,
-				PAGE_SIZE - poff, 0, ops, &ctx,
+	for (poff = 0; poff < thp_size(page); poff += ret) {
+		ret = iomap_apply(inode, file_offset_of_page(page) + poff,
+				thp_size(page) - poff, 0, ops, &ctx,
 				iomap_readpage_actor);
 		if (ret <= 0) {
 			WARN_ON_ONCE(ret == 0);
@@ -376,7 +388,7 @@ iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
 		if (WARN_ON(ret == 0))
 			break;
 		done += ret;
-		if (offset_in_page(pos + done) == 0) {
+		if (offset_in_this_page(ctx->cur_page, pos + done) == 0) {
 			ctx->rac->nr_pages -= ctx->rac->batch_count;
 			if (!ctx->cur_page_in_bio)
 				unlock_page(ctx->cur_page);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 17/25] iomap: Support large pages in write paths
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (15 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 16/25] iomap: Support large pages in read paths Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  8:17   ` Christoph Hellwig
  2020-02-12  4:18 ` [PATCH v2 18/25] iomap: Inline data shouldn't see large pages Matthew Wilcox
                   ` (7 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use thp_size() instead of PAGE_SIZE, offset_in_this_page().
Also simplify the logic in iomap_do_writepage() for determining end of
file.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 68f8903ecd6d..af1f56408fcd 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -445,7 +445,7 @@ iomap_is_partially_uptodate(struct page *page, unsigned long from,
 	unsigned i;
 
 	/* Limit range to one page */
-	len = min_t(unsigned, PAGE_SIZE - from, count);
+	len = min_t(unsigned, thp_size(page) - from, count);
 
 	/* First and last blocks in range within page */
 	first = from >> inode->i_blkbits;
@@ -488,7 +488,7 @@ iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
 	 * If we are invalidating the entire page, clear the dirty state from it
 	 * and release it to avoid unnecessary buildup of the LRU.
 	 */
-	if (offset == 0 && len == PAGE_SIZE) {
+	if (offset == 0 && len == thp_size(page)) {
 		WARN_ON_ONCE(PageWriteback(page));
 		cancel_dirty_page(page);
 		iomap_page_release(page);
@@ -564,7 +564,9 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
 	loff_t block_size = i_blocksize(inode);
 	loff_t block_start = pos & ~(block_size - 1);
 	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
-	unsigned from = offset_in_page(pos), to = from + len, poff, plen;
+	unsigned from = offset_in_this_page(page, pos);
+	unsigned to = from + len;
+	unsigned poff, plen;
 	int status;
 
 	if (PageUptodate(page))
@@ -696,7 +698,7 @@ __iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 	 */
 	if (unlikely(copied < len && !PageUptodate(page)))
 		return 0;
-	iomap_set_range_uptodate(page, offset_in_page(pos), len);
+	iomap_set_range_uptodate(page, offset_in_this_page(page, pos), len);
 	iomap_set_page_dirty(page);
 	return copied;
 }
@@ -771,6 +773,10 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
+		/*
+		 * XXX: We don't know what size page we'll find in the
+		 * page cache, so only copy up to a regular page boundary.
+		 */
 		offset = offset_in_page(pos);
 		bytes = min_t(unsigned long, PAGE_SIZE - offset,
 						iov_iter_count(i));
@@ -1320,7 +1326,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
 {
 	sector_t sector = iomap_sector(&wpc->iomap, offset);
 	unsigned len = i_blocksize(inode);
-	unsigned poff = offset & (PAGE_SIZE - 1);
+	unsigned poff = offset & (thp_size(page) - 1);
 	bool merged, same_page = false;
 
 	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
@@ -1372,9 +1378,10 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	unsigned len = i_blocksize(inode);
 	u64 file_offset; /* file offset of page */
 	int error = 0, count = 0, i;
+	int nr_blocks = i_blocks_per_page(inode, page);
 	LIST_HEAD(submit_list);
 
-	WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
+	WARN_ON_ONCE(nr_blocks > 1 && !iop);
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_count) != 0);
 
 	/*
@@ -1382,8 +1389,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * end of the current map or find the current map invalid, grab a new
 	 * one.
 	 */
-	for (i = 0, file_offset = page_offset(page);
-	     i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
+	for (i = 0, file_offset = file_offset_of_page(page);
+	     i < nr_blocks && file_offset < end_offset;
 	     i++, file_offset += len) {
 		if (iop && !test_bit(i, iop->uptodate))
 			continue;
@@ -1477,7 +1484,6 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 {
 	struct iomap_writepage_ctx *wpc = data;
 	struct inode *inode = page->mapping->host;
-	pgoff_t end_index;
 	u64 end_offset;
 	loff_t offset;
 
@@ -1518,10 +1524,9 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 	 * ---------------------------------^------------------|
 	 */
 	offset = i_size_read(inode);
-	end_index = offset >> PAGE_SHIFT;
-	if (page->index < end_index)
-		end_offset = (loff_t)(page->index + 1) << PAGE_SHIFT;
-	else {
+	end_offset = file_offset_of_next_page(page);
+
+	if (end_offset > offset) {
 		/*
 		 * Check whether the page to write out is beyond or straddles
 		 * i_size or not.
@@ -1533,7 +1538,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * |				    |      Straddles     |
 		 * ---------------------------------^-----------|--------|
 		 */
-		unsigned offset_into_page = offset & (PAGE_SIZE - 1);
+		unsigned offset_into_page = offset_in_this_page(page, offset);
+		pgoff_t end_index = offset >> PAGE_SHIFT;
 
 		/*
 		 * Skip the page if it is fully outside i_size, e.g. due to a
@@ -1564,7 +1570,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * memory is zeroed when mapped, and writes to that region are
 		 * not written out to the file."
 		 */
-		zero_user_segment(page, offset_into_page, PAGE_SIZE);
+		zero_user_segment(page, offset_into_page, thp_size(page));
 
 		/* Adjust the end_offset to the end of file */
 		end_offset = offset;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 18/25] iomap: Inline data shouldn't see large pages
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (16 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 17/25] iomap: Support large pages in write paths Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  8:05   ` Christoph Hellwig
  2020-02-12  4:18 ` [PATCH v2 19/25] xfs: Support " Matthew Wilcox
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Assert that we're not seeing large pages in functions that read/write
inline data, rather than zeroing out the tail.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index af1f56408fcd..a7a21b99b3a0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -224,6 +224,7 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
 		return;
 
 	BUG_ON(page->index);
+	BUG_ON(PageCompound(page));
 	BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
 
 	addr = kmap_atomic(page);
@@ -710,6 +711,7 @@ iomap_write_end_inline(struct inode *inode, struct page *page,
 	void *addr;
 
 	WARN_ON_ONCE(!PageUptodate(page));
+	BUG_ON(PageCompound(page));
 	BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
 
 	addr = kmap_atomic(page);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 19/25] xfs: Support large pages
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (17 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 18/25] iomap: Inline data shouldn't see large pages Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  4:18 ` [PATCH v2 20/25] mm: Make prep_transhuge_page return its argument Matthew Wilcox
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

There is one place which assumes the size of a page; fix it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/xfs/xfs_aops.c  | 2 +-
 fs/xfs/xfs_super.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 5573bf2957dd..0c10fd799f8c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -548,7 +548,7 @@ xfs_discard_page(
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
-	iomap_invalidatepage(page, 0, PAGE_SIZE);
+	iomap_invalidatepage(page, 0, thp_size(page));
 }
 
 static const struct iomap_writeback_ops xfs_writeback_ops = {
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2094386af8ac..a02efa1f72af 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1779,7 +1779,7 @@ static struct file_system_type xfs_fs_type = {
 	.init_fs_context	= xfs_init_fs_context,
 	.parameters		= xfs_fs_parameters,
 	.kill_sb		= kill_block_super,
-	.fs_flags		= FS_REQUIRES_DEV,
+	.fs_flags		= FS_REQUIRES_DEV | FS_LARGE_PAGES,
 };
 MODULE_ALIAS_FS("xfs");
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 20/25] mm: Make prep_transhuge_page return its argument
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (18 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 19/25] xfs: Support " Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  4:18 ` [PATCH v2 21/25] mm: Add __page_cache_alloc_order Matthew Wilcox
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm
  Cc: Matthew Wilcox (Oracle), linux-kernel, Kirill A . Shutemov

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

By permitting NULL or order-0 pages as an argument, and returning the
argument, callers can write:

	return prep_transhuge_page(alloc_pages(...));

instead of assigning the result to a temporary variable and conditionally
passing that to prep_transhuge_page().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h | 7 +++++--
 mm/huge_memory.c        | 9 +++++++--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3de788ee25bd..865b9c16c99c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -158,7 +158,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp,
 		unsigned long addr, unsigned long len, unsigned long pgoff,
 		unsigned long flags);
 
-extern void prep_transhuge_page(struct page *page);
+extern struct page *prep_transhuge_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 bool is_transparent_hugepage(struct page *page);
 
@@ -307,7 +307,10 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
 	return false;
 }
 
-static inline void prep_transhuge_page(struct page *page) {}
+static inline struct page *prep_transhuge_page(struct page *page)
+{
+	return page;
+}
 
 static inline bool is_transparent_hugepage(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b08b199f9a11..b52e007f0856 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -508,15 +508,20 @@ static inline struct deferred_split *get_deferred_split_queue(struct page *page)
 }
 #endif
 
-void prep_transhuge_page(struct page *page)
+struct page *prep_transhuge_page(struct page *page)
 {
+	if (!page || compound_order(page) == 0)
+		return page;
 	/*
-	 * we use page->mapping and page->indexlru in second tail page
+	 * we use page->mapping and page->index in second tail page
 	 * as list_head: assuming THP order >= 2
 	 */
+	BUG_ON(compound_order(page) == 1);
 
 	INIT_LIST_HEAD(page_deferred_list(page));
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
+
+	return page;
 }
 
 bool is_transparent_hugepage(struct page *page)
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 21/25] mm: Add __page_cache_alloc_order
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (19 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 20/25] mm: Make prep_transhuge_page return its argument Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  4:18 ` [PATCH v2 22/25] mm: Allow large pages to be added to the page cache Matthew Wilcox
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm
  Cc: Matthew Wilcox (Oracle), linux-kernel, Kirill A . Shutemov

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This new function allows page cache pages to be allocated that are
larger than an order-0 page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h | 24 +++++++++++++++++++++---
 mm/filemap.c            | 12 ++++++++----
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 497197315b73..64a3cf79611f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -207,15 +207,33 @@ static inline int page_cache_add_speculative(struct page *page, int count)
 	return __page_cache_add_speculative(page, count);
 }
 
+static inline gfp_t thp_gfpmask(gfp_t gfp)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/* We'd rather allocate smaller pages than stall a page fault */
+	gfp |= GFP_TRANSHUGE_LIGHT;
+	gfp &= ~__GFP_DIRECT_RECLAIM;
+#endif
+	return gfp;
+}
+
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline
+struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order)
 {
-	return alloc_pages(gfp, 0);
+	if (order == 0)
+		return alloc_pages(gfp, 0);
+	return prep_transhuge_page(alloc_pages(thp_gfpmask(gfp), order));
 }
 #endif
 
+static inline struct page *__page_cache_alloc(gfp_t gfp)
+{
+	return __page_cache_alloc_order(gfp, 0);
+}
+
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
 	return __page_cache_alloc(mapping_gfp_mask(x));
diff --git a/mm/filemap.c b/mm/filemap.c
index 3204293f9b58..1061463a169e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -941,24 +941,28 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order)
 {
 	int n;
 	struct page *page;
 
+	if (order > 0)
+		gfp = thp_gfpmask(gfp);
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
 			cpuset_mems_cookie = read_mems_allowed_begin();
 			n = cpuset_mem_spread_node();
-			page = __alloc_pages_node(n, gfp, 0);
+			page = __alloc_pages_node(n, gfp, order);
+			prep_transhuge_page(page);
 		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
 
 		return page;
 	}
-	return alloc_pages(gfp, 0);
+	return prep_transhuge_page(alloc_pages(gfp, order));
 }
-EXPORT_SYMBOL(__page_cache_alloc);
+EXPORT_SYMBOL(__page_cache_alloc_order);
 #endif
 
 /*
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 22/25] mm: Allow large pages to be added to the page cache
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (20 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 21/25] mm: Add __page_cache_alloc_order Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  4:18 ` [PATCH v2 23/25] mm: Allow large pages to be removed from " Matthew Wilcox
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We return -EEXIST if there are any non-shadow entries in the page
cache in the range covered by the large page.  If there are multiple
shadow entries in the range, we set *shadowp to one of them (currently
the one at the highest index).  If that turns out to be the wrong
answer, we can implement something more complex.  This is mostly
modelled after the equivalent function in the shmem code.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 46 +++++++++++++++++++++++++++++++++-------------
 1 file changed, 33 insertions(+), 13 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1061463a169e..08b5cd4ce47b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -834,6 +834,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	int huge = PageHuge(page);
 	struct mem_cgroup *memcg;
 	int error;
+	unsigned int nr = 1;
 	void *old;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -845,31 +846,50 @@ static int __add_to_page_cache_locked(struct page *page,
 					      gfp_mask, &memcg, false);
 		if (error)
 			return error;
+		xas_set_order(&xas, offset, thp_order(page));
+		nr = hpage_nr_pages(page);
 	}
 
-	get_page(page);
+	page_ref_add(page, nr);
 	page->mapping = mapping;
 	page->index = offset;
 
 	do {
+		unsigned long exceptional = 0;
+		unsigned int i = 0;
+
 		xas_lock_irq(&xas);
-		old = xas_load(&xas);
-		if (old && !xa_is_value(old))
-			xas_set_err(&xas, -EEXIST);
-		xas_store(&xas, page);
+		xas_for_each_conflict(&xas, old) {
+			if (!xa_is_value(old)) {
+				xas_set_err(&xas, -EEXIST);
+				break;
+			}
+			exceptional++;
+			if (shadowp)
+				*shadowp = old;
+		}
+		xas_create_range(&xas);
 		if (xas_error(&xas))
 			goto unlock;
 
-		if (xa_is_value(old)) {
-			mapping->nrexceptional--;
-			if (shadowp)
-				*shadowp = old;
+next:
+		xas_store(&xas, page);
+		if (++i < nr) {
+			xas_next(&xas);
+			goto next;
 		}
-		mapping->nrpages++;
+		mapping->nrexceptional -= exceptional;
+		mapping->nrpages += nr;
 
 		/* hugetlb pages do not participate in page cache accounting */
-		if (!huge)
-			__inc_node_page_state(page, NR_FILE_PAGES);
+		if (!huge) {
+			__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES,
+						nr);
+			if (nr > 1) {
+				__inc_node_page_state(page, NR_FILE_THPS);
+				filemap_nr_thps_inc(mapping);
+			}
+		}
 unlock:
 		xas_unlock_irq(&xas);
 	} while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));
@@ -886,7 +906,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	/* Leave page->index set: truncation relies upon it */
 	if (!huge)
 		mem_cgroup_cancel_charge(page, memcg, false);
-	put_page(page);
+	page_ref_sub(page, nr);
 	return xas_error(&xas);
 }
 ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 23/25] mm: Allow large pages to be removed from the page cache
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (21 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 22/25] mm: Allow large pages to be added to the page cache Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  4:18 ` [PATCH v2 24/25] mm: Add large page readahead Matthew Wilcox
  2020-02-12  4:18 ` [PATCH v2 25/25] mm: Align THP mappings for non-DAX Matthew Wilcox
  24 siblings, 0 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

page_cache_free_page() assumes compound pages are PMD_SIZE; fix
that assumption.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 08b5cd4ce47b..e74a22af7e4e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -248,7 +248,7 @@ static void page_cache_free_page(struct address_space *mapping,
 		freepage(page);
 
 	if (PageTransHuge(page) && !PageHuge(page)) {
-		page_ref_sub(page, HPAGE_PMD_NR);
+		page_ref_sub(page, hpage_nr_pages(page));
 		VM_BUG_ON_PAGE(page_count(page) <= 0, page);
 	} else {
 		put_page(page);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 24/25] mm: Add large page readahead
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (22 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 23/25] mm: Allow large pages to be removed from " Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  4:18 ` [PATCH v2 25/25] mm: Align THP mappings for non-DAX Matthew Wilcox
  24 siblings, 0 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle), linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

If the filesystem supports large pages, allocate larger pages in the
readahead code when it seems worth doing.  The heuristic for choosing
larger page sizes will surely need some tuning, but this aggressive
ramp-up seems good for testing.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/readahead.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 93 insertions(+), 5 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 29ca25c8f01e..b582f09aa7e3 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -406,13 +406,96 @@ static int try_context_readahead(struct address_space *mapping,
 	return 1;
 }
 
+static inline int ra_alloc_page(struct address_space *mapping, pgoff_t offset,
+		pgoff_t mark, unsigned int order, gfp_t gfp)
+{
+	int err;
+	struct page *page = __page_cache_alloc_order(gfp, order);
+
+	if (!page)
+		return -ENOMEM;
+	if (mark - offset < (1UL << order))
+		SetPageReadahead(page);
+	err = add_to_page_cache_lru(page, mapping, offset, gfp);
+	if (err)
+		put_page(page);
+	return err;
+}
+
+#define PMD_ORDER	(PMD_SHIFT - PAGE_SHIFT)
+
+static unsigned long page_cache_readahead_order(struct address_space *mapping,
+		struct file_ra_state *ra, struct file *file, unsigned int order)
+{
+	struct readahead_control rac = {
+		.mapping = mapping,
+		.file = file,
+		.start = ra->start,
+		.nr_pages = 0,
+	};
+	unsigned int old_order = order;
+	pgoff_t offset = ra->start;
+	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+	pgoff_t mark = offset + ra->size - ra->async_size;
+	int err = 0;
+	gfp_t gfp = readahead_gfp_mask(mapping);
+
+	limit = min(limit, offset + ra->size - 1);
+
+	/* Grow page size up to PMD size */
+	if (order < PMD_ORDER) {
+		order += 2;
+		if (order > PMD_ORDER)
+			order = PMD_ORDER;
+		while ((1 << order) > ra->size)
+			order--;
+	}
+
+	/* If size is somehow misaligned, fill with order-0 pages */
+	while (!err && offset & ((1UL << old_order) - 1)) {
+		err = ra_alloc_page(mapping, offset++, mark, 0, gfp);
+		if (!err)
+			rac.nr_pages++;
+	}
+
+	while (!err && offset & ((1UL << order) - 1)) {
+		err = ra_alloc_page(mapping, offset, mark, old_order, gfp);
+		if (!err)
+			rac.nr_pages += 1UL << old_order;
+		offset += 1UL << old_order;
+	}
+
+	while (!err && offset <= limit) {
+		err = ra_alloc_page(mapping, offset, mark, order, gfp);
+		if (!err)
+			rac.nr_pages += 1UL << order;
+		offset += 1UL << order;
+	}
+
+	if (offset > limit) {
+		ra->size += offset - limit - 1;
+		ra->async_size += offset - limit - 1;
+	}
+
+	read_pages(&rac, NULL);
+
+	/*
+	 * If there were already pages in the page cache, then we may have
+	 * left some gaps.  Let the regular readahead code take care of this
+	 * situation.
+	 */
+	if (err)
+		return ra_submit(ra, mapping, file);
+	return 0;
+}
+
 /*
  * A minimal readahead algorithm for trivial sequential/random reads.
  */
 static unsigned long
 ondemand_readahead(struct address_space *mapping,
 		   struct file_ra_state *ra, struct file *filp,
-		   bool hit_readahead_marker, pgoff_t offset,
+		   struct page *page, pgoff_t offset,
 		   unsigned long req_size)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
@@ -451,7 +534,7 @@ ondemand_readahead(struct address_space *mapping,
 	 * Query the pagecache for async_size, which normally equals to
 	 * readahead size. Ramp it up and use it as the new readahead size.
 	 */
-	if (hit_readahead_marker) {
+	if (page) {
 		pgoff_t start;
 
 		rcu_read_lock();
@@ -520,7 +603,12 @@ ondemand_readahead(struct address_space *mapping,
 		}
 	}
 
-	return ra_submit(ra, mapping, filp);
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) || !page ||
+	    !(mapping->host->i_sb->s_type->fs_flags & FS_LARGE_PAGES))
+		return ra_submit(ra, mapping, filp);
+
+	return page_cache_readahead_order(mapping, ra, filp,
+						compound_order(page));
 }
 
 /**
@@ -555,7 +643,7 @@ void page_cache_sync_readahead(struct address_space *mapping,
 	}
 
 	/* do read-ahead */
-	ondemand_readahead(mapping, ra, filp, false, offset, req_size);
+	ondemand_readahead(mapping, ra, filp, NULL, offset, req_size);
 }
 EXPORT_SYMBOL_GPL(page_cache_sync_readahead);
 
@@ -602,7 +690,7 @@ page_cache_async_readahead(struct address_space *mapping,
 		return;
 
 	/* do read-ahead */
-	ondemand_readahead(mapping, ra, filp, true, offset, req_size);
+	ondemand_readahead(mapping, ra, filp, page, offset, req_size);
 }
 EXPORT_SYMBOL_GPL(page_cache_async_readahead);
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 25/25] mm: Align THP mappings for non-DAX
  2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
                   ` (23 preceding siblings ...)
  2020-02-12  4:18 ` [PATCH v2 24/25] mm: Add large page readahead Matthew Wilcox
@ 2020-02-12  4:18 ` Matthew Wilcox
  2020-02-12  7:50   ` Christoph Hellwig
  24 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12  4:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm; +Cc: William Kucharski, linux-kernel, Matthew Wilcox

From: William Kucharski <william.kucharski@oracle.com>

When we have the opportunity to use transparent huge pages to map a
file, we want to follow the same rules as DAX.

Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/huge_memory.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b52e007f0856..b8d9e0d76062 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -577,13 +577,10 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 	unsigned long ret;
 	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 
-	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
-		goto out;
-
 	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE);
 	if (ret)
 		return ret;
-out:
+
 	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
 }
 EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 01/25] mm: Use vm_fault error code directly
  2020-02-12  4:18 ` [PATCH v2 01/25] mm: Use vm_fault error code directly Matthew Wilcox
@ 2020-02-12  7:34   ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  7:34 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel, Kirill A . Shutemov

On Tue, Feb 11, 2020 at 08:18:21PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Use VM_FAULT_OOM instead of indirecting through vmf_error(-ENOMEM).
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 04/25] mm: Unexport find_get_entry
  2020-02-12  4:18 ` [PATCH v2 04/25] mm: Unexport find_get_entry Matthew Wilcox
@ 2020-02-12  7:37   ` Christoph Hellwig
  2020-02-13 13:51   ` Kirill A. Shutemov
  1 sibling, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  7:37 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:24PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> No in-tree users (proc, madvise, memcg, mincore) can be built as a module.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 03/25] mm: Use VM_BUG_ON_PAGE in clear_page_dirty_for_io
  2020-02-12  4:18 ` [PATCH v2 03/25] mm: Use VM_BUG_ON_PAGE in clear_page_dirty_for_io Matthew Wilcox
@ 2020-02-12  7:38   ` Christoph Hellwig
  2020-02-13 13:50   ` Kirill A. Shutemov
  1 sibling, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  7:38 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:23PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Dumping the page information in this circumstance helps for debugging.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 02/25] mm: Optimise find_subpage for !THP
  2020-02-12  4:18 ` [PATCH v2 02/25] mm: Optimise find_subpage for !THP Matthew Wilcox
@ 2020-02-12  7:41   ` Christoph Hellwig
  2020-02-12 13:02     ` Matthew Wilcox
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  7:41 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:22PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> If THP is disabled, find_subpage can become a no-op.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/pagemap.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 75bdfec49710..0842622cca90 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -340,7 +340,7 @@ static inline struct page *find_subpage(struct page *page, pgoff_t offset)
>  
>  	VM_BUG_ON_PAGE(PageTail(page), page);
>  
> -	return page + (offset & (compound_nr(page) - 1));
> +	return page + (offset & (hpage_nr_pages(page) - 1));
>  }

So just above the quoted code is a

	if (PageHuge(page))
		return page;

So how can we get into a compound page that is not a huge page, but
only if THP is enabled?  (Yes, maybe I'm a little rusty on VM
internals).

Can you add comments describing the use case of this function and why
it does all these checks?  It looks like black magic to me.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 05/25] mm: Fix documentation of FGP flags
  2020-02-12  4:18 ` [PATCH v2 05/25] mm: Fix documentation of FGP flags Matthew Wilcox
@ 2020-02-12  7:42   ` Christoph Hellwig
  2020-02-12 19:11     ` Matthew Wilcox
  2020-02-13 13:59   ` Kirill A. Shutemov
  1 sibling, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  7:42 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:25PM -0800, Matthew Wilcox wrote:
> - * @fgp_flags: PCG flags
> + * @fgp_flags: FGP flags
>   * @gfp_mask: gfp mask to use for the page cache data page allocation
>   *
>   * Looks up the page cache slot at @mapping & @offset.
>   *
> - * PCG flags modify how the page is returned.
> + * FGP flags modify how the page is returned.

This still looks weird.  Why not just a single line:

	* @fgp_flags: FGP_* flags that control how the page is returned.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 09/25] fs: Add a filesystem flag for large pages
  2020-02-12  4:18 ` [PATCH v2 09/25] fs: Add a filesystem flag for large pages Matthew Wilcox
@ 2020-02-12  7:43   ` Christoph Hellwig
  2020-02-12 14:59     ` Matthew Wilcox
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  7:43 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:29PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> The page cache needs to know whether the filesystem supports pages >
> PAGE_SIZE.

Does it make sense to set this flag on the file_system_type, which
is rather broad scope, or a specific superblock or even inode?

For some file systems we might require on-disk flags that aren't set
for all instances.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 10/25] fs: Introduce i_blocks_per_page
  2020-02-12  4:18 ` [PATCH v2 10/25] fs: Introduce i_blocks_per_page Matthew Wilcox
@ 2020-02-12  7:44   ` Christoph Hellwig
  2020-02-12 15:05     ` Matthew Wilcox
  2020-02-13 15:40   ` Kirill A. Shutemov
  1 sibling, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  7:44 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

Looks good modulo some nitpicks below:

Reviewed-by: Christoph Hellwig <hch@lst.de>

> + * Context: Any context.

Does this add any value for a trivial helper like this?

> + * Return: The number of filesystem blocks covered by this page.
> + */
> +static inline
> +unsigned int i_blocks_per_page(struct inode *inode, struct page *page)

static inline unisnged int
i_blocks_per_page(struct inode *inode, struct page *page)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 12/25] mm: Add file_offset_of_ helpers
  2020-02-12  4:18 ` [PATCH v2 12/25] mm: Add file_offset_of_ helpers Matthew Wilcox
@ 2020-02-12  7:46   ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  7:46 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

> --- a/drivers/net/ethernet/ibm/ibmveth.c
> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> @@ -978,8 +978,6 @@ static int ibmveth_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
>  	return -EOPNOTSUPP;
>  }
>  
> -#define page_offset(v) ((unsigned long)(v) & ((1 << 12) - 1))

This one realy should be killed off in a separate patch, it has nothing
to do with the kernel-wide page_offset.

> +/* Legacy; please convert callers */
> +#define page_offset(page)	file_offset_of_page(page)

I'd say send a script to Linus to conver it as soon as the change is
in.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 25/25] mm: Align THP mappings for non-DAX
  2020-02-12  4:18 ` [PATCH v2 25/25] mm: Align THP mappings for non-DAX Matthew Wilcox
@ 2020-02-12  7:50   ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  7:50 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, William Kucharski, linux-kernel

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b52e007f0856..b8d9e0d76062 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -577,13 +577,10 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>  	unsigned long ret;
>  	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
>  
> -	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
> -		goto out;
> -
>  	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE);
>  	if (ret)
>  		return ret;
> -out:
> +
>  	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
>  }
>  EXPORT_SYMBOL_GPL(thp_get_unmapped_area);

There is no point in splitting thp_get_unmapped_area and
__thp_get_unmapped_area with this applied (and arguably even before
that).  But we still have ext2 and ext4 that use thp_get_unmapped_area but
only support huge page mappings for DAX, do we need to handle those somehow?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 14/25] iomap: Support arbitrarily many blocks per page
  2020-02-12  4:18 ` [PATCH v2 14/25] iomap: Support arbitrarily many blocks per page Matthew Wilcox
@ 2020-02-12  8:05   ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  8:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:34PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Size the uptodate array dynamically.  Now that this array is protected
> by a spinlock, we can use bitmap functions to set the bits in this array
> instead of a loop around set_bit().
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/iomap/buffered-io.c | 27 +++++++++------------------
>  1 file changed, 9 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index c551a48e2a81..5e5a6b038fc3 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -22,14 +22,14 @@
>  #include "../internal.h"
>  
>  /*
> - * Structure allocated for each page when block size < PAGE_SIZE to track
> + * Structure allocated for each page when block size < page size to track
>   * sub-page uptodate status and I/O completions.
>   */
>  struct iomap_page {
>  	atomic_t		read_count;
>  	atomic_t		write_count;
>  	spinlock_t		uptodate_lock;
> -	DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
> +	unsigned long		uptodate[];
>  };
>  
>  static inline struct iomap_page *to_iomap_page(struct page *page)
> @@ -45,15 +45,14 @@ static struct iomap_page *
>  iomap_page_create(struct inode *inode, struct page *page)
>  {
>  	struct iomap_page *iop = to_iomap_page(page);
> +	unsigned int n, nr_blocks = i_blocks_per_page(inode, page);
>  
> -	if (iop || i_blocks_per_page(inode, page) <= 1)
> +	if (iop || nr_blocks <= 1)
>  		return iop;
>  
> -	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
> -	atomic_set(&iop->read_count, 0);
> -	atomic_set(&iop->write_count, 0);
> +	n = BITS_TO_LONGS(nr_blocks);
> +	iop = kzalloc(struct_size(iop, uptodate, n), GFP_NOFS | __GFP_NOFAIL);

Nit: I don't really hink we need the n variable here, we can just
opencode the BITS_TO_LONGS in the struct_size call.

>  	struct inode *inode = page->mapping->host;
>  	unsigned first = off >> inode->i_blkbits;
> -	unsigned last = (off + len - 1) >> inode->i_blkbits;
> -	bool uptodate = true;
> +	unsigned count = len >> inode->i_blkbits;
>  	unsigned long flags;
> -	unsigned int i;
>  
>  	spin_lock_irqsave(&iop->uptodate_lock, flags);
> -	for (i = 0; i < i_blocks_per_page(inode, page); i++) {
> -		if (i >= first && i <= last)
> -			set_bit(i, iop->uptodate);
> -		else if (!test_bit(i, iop->uptodate))
> -			uptodate = false;
> -	}
> -
> -	if (uptodate)
> +	bitmap_set(iop->uptodate, first, count);
> +	if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
>  		SetPageUptodate(page);

Switching to the bitmap helpers might be worthwhile prep patch that
can go in directly now that we're having the uptodate_lock.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 18/25] iomap: Inline data shouldn't see large pages
  2020-02-12  4:18 ` [PATCH v2 18/25] iomap: Inline data shouldn't see large pages Matthew Wilcox
@ 2020-02-12  8:05   ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  8:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:38PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Assert that we're not seeing large pages in functions that read/write
> inline data, rather than zeroing out the tail.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 15/25] iomap: Support large pages in iomap_adjust_read_range
  2020-02-12  4:18 ` [PATCH v2 15/25] iomap: Support large pages in iomap_adjust_read_range Matthew Wilcox
@ 2020-02-12  8:11   ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  8:11 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

>  __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
>  		struct page *page, struct iomap *srcmap)
>  {
> -	struct iomap_page *iop = iomap_page_create(inode, page);
>  	loff_t block_size = i_blocksize(inode);
>  	loff_t block_start = pos & ~(block_size - 1);
>  	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
> @@ -556,9 +557,10 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
>  
>  	if (PageUptodate(page))
>  		return 0;
> +	iomap_page_create(inode, page);

FYI, I have a similar change in a pending series that only creates
the iomap_page if a page isn't actually mapped by a contiguous extent.
Lets see which series goes in first, but the conflicts shouldn't be too
bad.

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 52269e56c514..b4bf86590096 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1387,6 +1387,8 @@ static inline void clear_page_pfmemalloc(struct page *page)
>  extern void pagefault_out_of_memory(void);
>  
>  #define offset_in_page(p)	((unsigned long)(p) & ~PAGE_MASK)
> +#define offset_in_this_page(page, p)	\
> +	((unsigned long)(p) & (thp_size(page) - 1))

I think this should go int oa separate patch.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 16/25] iomap: Support large pages in read paths
  2020-02-12  4:18 ` [PATCH v2 16/25] iomap: Support large pages in read paths Matthew Wilcox
@ 2020-02-12  8:13   ` Christoph Hellwig
  2020-02-12 17:45     ` Matthew Wilcox
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  8:13 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

> +/*
> + * Estimate the number of vectors we need based on the current page size;
> + * if we're wrong we'll end up doing an overly large allocation or needing
> + * to do a second allocation, neither of which is a big deal.
> + */
> +static unsigned int iomap_nr_vecs(struct page *page, loff_t length)
> +{
> +	return (length + thp_size(page) - 1) >> page_shift(page);
> +}

With the multipage bvecs a huge page (or any physically contigous piece
of memory) will always use one or less (if merged into the previous)
bio_vec.  So this can be simplified and optimized.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 17/25] iomap: Support large pages in write paths
  2020-02-12  4:18 ` [PATCH v2 17/25] iomap: Support large pages in write paths Matthew Wilcox
@ 2020-02-12  8:17   ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12  8:17 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:37PM -0800, Matthew Wilcox wrote:
> Also simplify the logic in iomap_do_writepage() for determining end of
> file.

>  	 * ---------------------------------^------------------|
>  	 */
>  	offset = i_size_read(inode);
> -	end_index = offset >> PAGE_SHIFT;
> -	if (page->index < end_index)
> -		end_offset = (loff_t)(page->index + 1) << PAGE_SHIFT;
> -	else {
> +	end_offset = file_offset_of_next_page(page);
> +
> +	if (end_offset > offset) {

Nit: can you drop the empty line here?  Maybe it is just a pet peeve
of mine, but I hate empty line between a variable assignment, and a
use that is directly related to it.

Also it might be worth to preload this as a separate patch as it
is a very nice but not completely obvious cleanup that deserves to
stand out.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 02/25] mm: Optimise find_subpage for !THP
  2020-02-12  7:41   ` Christoph Hellwig
@ 2020-02-12 13:02     ` Matthew Wilcox
  2020-02-12 17:52       ` Christoph Hellwig
  2020-02-13 13:50       ` Kirill A. Shutemov
  0 siblings, 2 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12 13:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 11:41:05PM -0800, Christoph Hellwig wrote:
> On Tue, Feb 11, 2020 at 08:18:22PM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > If THP is disabled, find_subpage can become a no-op.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> >  include/linux/pagemap.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index 75bdfec49710..0842622cca90 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -340,7 +340,7 @@ static inline struct page *find_subpage(struct page *page, pgoff_t offset)
> >  
> >  	VM_BUG_ON_PAGE(PageTail(page), page);
> >  
> > -	return page + (offset & (compound_nr(page) - 1));
> > +	return page + (offset & (hpage_nr_pages(page) - 1));
> >  }
> 
> So just above the quoted code is a
> 
> 	if (PageHuge(page))
> 		return page;
> 
> So how can we get into a compound page that is not a huge page, but
> only if THP is enabled?  (Yes, maybe I'm a little rusty on VM
> internals).

That's for hugetlbfs:

        if (!PageCompound(page))
                return 0;

        page = compound_head(page);
        return page[1].compound_dtor == HUGETLB_PAGE_DTOR;

> Can you add comments describing the use case of this function and why
> it does all these checks?  It looks like black magic to me.

Would this help?

-static inline struct page *find_subpage(struct page *page, pgoff_t offset)
+/*
+ * Given the page we found in the page cache, return the page corresponding
+ * to this offset in the file
+ */
+static inline struct page *find_subpage(struct page *head, pgoff_t offset)
 {
-       if (PageHuge(page))
-               return page;
+       /* HugeTLBfs wants the head page regardless */
+       if (PageHuge(head))
+               return head;
 
-       VM_BUG_ON_PAGE(PageTail(page), page);
+       VM_BUG_ON_PAGE(PageTail(head), head);
 
-       return page + (offset & (hpage_nr_pages(page) - 1));
+       return head + (offset & (hpage_nr_pages(head) - 1));
 }


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 09/25] fs: Add a filesystem flag for large pages
  2020-02-12  7:43   ` Christoph Hellwig
@ 2020-02-12 14:59     ` Matthew Wilcox
  0 siblings, 0 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12 14:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 11:43:18PM -0800, Christoph Hellwig wrote:
> On Tue, Feb 11, 2020 at 08:18:29PM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > The page cache needs to know whether the filesystem supports pages >
> > PAGE_SIZE.
> 
> Does it make sense to set this flag on the file_system_type, which
> is rather broad scope, or a specific superblock or even inode?
> 
> For some file systems we might require on-disk flags that aren't set
> for all instances.

I don't see why we'd need on-disk flags or need to control this on a
per-inode or per-sb basis.  My intent for this flag is to represent
whether the filesystem understands large pages; how the file is cached
should make no difference to the on-disk layout.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 10/25] fs: Introduce i_blocks_per_page
  2020-02-12  7:44   ` Christoph Hellwig
@ 2020-02-12 15:05     ` Matthew Wilcox
  2020-02-12 17:54       ` Christoph Hellwig
  0 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12 15:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 11:44:53PM -0800, Christoph Hellwig wrote:
> Looks good modulo some nitpicks below:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> > + * Context: Any context.
> 
> Does this add any value for a trivial helper like this?

I think it's good to put them in to remind people they should be putting
them in for more complex functions.  Just like the Return: section.

> > + * Return: The number of filesystem blocks covered by this page.
> > + */
> > +static inline
> > +unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
> 
> static inline unisnged int
> i_blocks_per_page(struct inode *inode, struct page *page)

That's XFS coding style.  Linus has specifically forbidden that:

https://lore.kernel.org/lkml/1054519757.161606@palladium.transmeta.com/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 16/25] iomap: Support large pages in read paths
  2020-02-12  8:13   ` Christoph Hellwig
@ 2020-02-12 17:45     ` Matthew Wilcox
  0 siblings, 0 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12 17:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Wed, Feb 12, 2020 at 12:13:40AM -0800, Christoph Hellwig wrote:
> > +/*
> > + * Estimate the number of vectors we need based on the current page size;
> > + * if we're wrong we'll end up doing an overly large allocation or needing
> > + * to do a second allocation, neither of which is a big deal.
> > + */
> > +static unsigned int iomap_nr_vecs(struct page *page, loff_t length)
> > +{
> > +	return (length + thp_size(page) - 1) >> page_shift(page);
> > +}
> 
> With the multipage bvecs a huge page (or any physically contigous piece
> of memory) will always use one or less (if merged into the previous)
> bio_vec.  So this can be simplified and optimized.

Oh, hm, right.  So what you really want here is for me to pass in the
number of thp pages allocated by the readahead operation.  rac->nr_pages
is the number of PAGE_SIZE pages, but we could have an rac->nr_segs
element as well.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 02/25] mm: Optimise find_subpage for !THP
  2020-02-12 13:02     ` Matthew Wilcox
@ 2020-02-12 17:52       ` Christoph Hellwig
  2020-02-13 13:50       ` Kirill A. Shutemov
  1 sibling, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12 17:52 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Hellwig, linux-fsdevel, linux-mm, linux-kernel

On Wed, Feb 12, 2020 at 05:02:00AM -0800, Matthew Wilcox wrote:
> > Can you add comments describing the use case of this function and why
> > it does all these checks?  It looks like black magic to me.
> 
> Would this help?
> 
> -static inline struct page *find_subpage(struct page *page, pgoff_t offset)
> +/*
> + * Given the page we found in the page cache, return the page corresponding
> + * to this offset in the file
> + */
> +static inline struct page *find_subpage(struct page *head, pgoff_t offset)
>  {
> -       if (PageHuge(page))
> -               return page;
> +       /* HugeTLBfs wants the head page regardless */
> +       if (PageHuge(head))
> +               return head;
>  
> -       VM_BUG_ON_PAGE(PageTail(page), page);
> +       VM_BUG_ON_PAGE(PageTail(head), head);
>  
> -       return page + (offset & (hpage_nr_pages(page) - 1));
> +       return head + (offset & (hpage_nr_pages(head) - 1));

Much better.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 10/25] fs: Introduce i_blocks_per_page
  2020-02-12 15:05     ` Matthew Wilcox
@ 2020-02-12 17:54       ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2020-02-12 17:54 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Hellwig, linux-fsdevel, linux-mm, linux-kernel

On Wed, Feb 12, 2020 at 07:05:40AM -0800, Matthew Wilcox wrote:
> > > + * Return: The number of filesystem blocks covered by this page.
> > > + */
> > > +static inline
> > > +unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
> > 
> > static inline unisnged int
> > i_blocks_per_page(struct inode *inode, struct page *page)
> 
> That's XFS coding style.  Linus has specifically forbidden that:
> 
> https://lore.kernel.org/lkml/1054519757.161606@palladium.transmeta.com/

Not just xfs but lots of places.  But if you don't like that follow
the real Linus style we use elsewhere:

static inline unsigned int i_blocks_per_page(struct inode *inode,
		struct page *page)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 05/25] mm: Fix documentation of FGP flags
  2020-02-12  7:42   ` Christoph Hellwig
@ 2020-02-12 19:11     ` Matthew Wilcox
  2020-02-13 14:00       ` Kirill A. Shutemov
  0 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-12 19:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 11:42:15PM -0800, Christoph Hellwig wrote:
> On Tue, Feb 11, 2020 at 08:18:25PM -0800, Matthew Wilcox wrote:
> > - * @fgp_flags: PCG flags
> > + * @fgp_flags: FGP flags
> >   * @gfp_mask: gfp mask to use for the page cache data page allocation
> >   *
> >   * Looks up the page cache slot at @mapping & @offset.
> >   *
> > - * PCG flags modify how the page is returned.
> > + * FGP flags modify how the page is returned.
> 
> This still looks weird.  Why not just a single line:
> 
> 	* @fgp_flags: FGP_* flags that control how the page is returned.

Well, now you got me reading the entire comment for this function, and
looking at the html output, so I ended up rewriting it entirely.

+++ b/mm/filemap.c
@@ -1574,37 +1574,34 @@ struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset)
 EXPORT_SYMBOL(find_lock_entry);
 
 /**
- * pagecache_get_page - find and get a page reference
- * @mapping: the address_space to search
- * @offset: the page index
- * @fgp_flags: FGP flags
- * @gfp_mask: gfp mask to use for the page cache data page allocation
- *
- * Looks up the page cache slot at @mapping & @offset.
+ * pagecache_get_page - Find and get a reference to a page.
+ * @mapping: The address_space to search.
+ * @offset: The page index.
+ * @fgp_flags: %FGP flags modify how the page is returned.
+ * @gfp_mask: Memory allocation flags to use if %FGP_CREAT is specified.
  *
- * FGP flags modify how the page is returned.
+ * Looks up the page cache entry at @mapping & @offset.
  *
- * @fgp_flags can be:
+ * @fgp_flags can be zero or more of these flags:
  *
- * - FGP_ACCESSED: the page will be marked accessed
- * - FGP_LOCK: Page is return locked
- * - FGP_CREAT: If page is not present then a new page is allocated using
- *   @gfp_mask and added to the page cache and the VM's LRU
- *   list. The page is returned locked and with an increased
- *   refcount.
- * - FGP_FOR_MMAP: Similar to FGP_CREAT, only we want to allow the caller to do
- *   its own locking dance if the page is already in cache, or unlock the page
- *   before returning if we had to add the page to pagecache.
+ * * %FGP_ACCESSED - The page will be marked accessed.
+ * * %FGP_LOCK - The page is returned locked.
+ * * %FGP_CREAT - If no page is present then a new page is allocated using
+ *   @gfp_mask and added to the page cache and the VM's LRU list.
+ *   The page is returned locked and with an increased refcount.
+ * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
+ *   page is already in cache.  If the page was allocated, unlock it before
+ *   returning so the caller can do the same dance.
  *
- * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
- * if the GFP flags specified for FGP_CREAT are atomic.
+ * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
+ * if the %GFP flags specified for %FGP_CREAT are atomic.
  *
  * If there is a page cache page, it is returned with an increased refcount.
  *
- * Return: the found page or %NULL otherwise.
+ * Return: The found page or %NULL otherwise.
  */
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
-       int fgp_flags, gfp_t gfp_mask)
+               int fgp_flags, gfp_t gfp_mask)
 {


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 02/25] mm: Optimise find_subpage for !THP
  2020-02-12 13:02     ` Matthew Wilcox
  2020-02-12 17:52       ` Christoph Hellwig
@ 2020-02-13 13:50       ` Kirill A. Shutemov
  1 sibling, 0 replies; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-13 13:50 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Hellwig, linux-fsdevel, linux-mm, linux-kernel

On Wed, Feb 12, 2020 at 05:02:00AM -0800, Matthew Wilcox wrote:
> Would this help?

LGTM:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 03/25] mm: Use VM_BUG_ON_PAGE in clear_page_dirty_for_io
  2020-02-12  4:18 ` [PATCH v2 03/25] mm: Use VM_BUG_ON_PAGE in clear_page_dirty_for_io Matthew Wilcox
  2020-02-12  7:38   ` Christoph Hellwig
@ 2020-02-13 13:50   ` Kirill A. Shutemov
  1 sibling, 0 replies; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-13 13:50 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:23PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Dumping the page information in this circumstance helps for debugging.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 04/25] mm: Unexport find_get_entry
  2020-02-12  4:18 ` [PATCH v2 04/25] mm: Unexport find_get_entry Matthew Wilcox
  2020-02-12  7:37   ` Christoph Hellwig
@ 2020-02-13 13:51   ` Kirill A. Shutemov
  1 sibling, 0 replies; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-13 13:51 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:24PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> No in-tree users (proc, madvise, memcg, mincore) can be built as a module.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 05/25] mm: Fix documentation of FGP flags
  2020-02-12  4:18 ` [PATCH v2 05/25] mm: Fix documentation of FGP flags Matthew Wilcox
  2020-02-12  7:42   ` Christoph Hellwig
@ 2020-02-13 13:59   ` Kirill A. Shutemov
  2020-02-13 14:34     ` Matthew Wilcox
  1 sibling, 1 reply; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-13 13:59 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:25PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> We never had PCG flags

We actually had :P But it's totally different story.

See git log for include/linux/page_cgroup.h.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 05/25] mm: Fix documentation of FGP flags
  2020-02-12 19:11     ` Matthew Wilcox
@ 2020-02-13 14:00       ` Kirill A. Shutemov
  0 siblings, 0 replies; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-13 14:00 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Hellwig, linux-fsdevel, linux-mm, linux-kernel

On Wed, Feb 12, 2020 at 11:11:45AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 11, 2020 at 11:42:15PM -0800, Christoph Hellwig wrote:
> > On Tue, Feb 11, 2020 at 08:18:25PM -0800, Matthew Wilcox wrote:
> > > - * @fgp_flags: PCG flags
> > > + * @fgp_flags: FGP flags
> > >   * @gfp_mask: gfp mask to use for the page cache data page allocation
> > >   *
> > >   * Looks up the page cache slot at @mapping & @offset.
> > >   *
> > > - * PCG flags modify how the page is returned.
> > > + * FGP flags modify how the page is returned.
> > 
> > This still looks weird.  Why not just a single line:
> > 
> > 	* @fgp_flags: FGP_* flags that control how the page is returned.
> 
> Well, now you got me reading the entire comment for this function, and
> looking at the html output, so I ended up rewriting it entirely.
> 
> +++ b/mm/filemap.c
> @@ -1574,37 +1574,34 @@ struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset)
>  EXPORT_SYMBOL(find_lock_entry);
>  
>  /**
> - * pagecache_get_page - find and get a page reference
> - * @mapping: the address_space to search
> - * @offset: the page index
> - * @fgp_flags: FGP flags
> - * @gfp_mask: gfp mask to use for the page cache data page allocation
> - *
> - * Looks up the page cache slot at @mapping & @offset.
> + * pagecache_get_page - Find and get a reference to a page.
> + * @mapping: The address_space to search.
> + * @offset: The page index.
> + * @fgp_flags: %FGP flags modify how the page is returned.
> + * @gfp_mask: Memory allocation flags to use if %FGP_CREAT is specified.
>   *
> - * FGP flags modify how the page is returned.
> + * Looks up the page cache entry at @mapping & @offset.
>   *
> - * @fgp_flags can be:
> + * @fgp_flags can be zero or more of these flags:
>   *
> - * - FGP_ACCESSED: the page will be marked accessed
> - * - FGP_LOCK: Page is return locked
> - * - FGP_CREAT: If page is not present then a new page is allocated using
> - *   @gfp_mask and added to the page cache and the VM's LRU
> - *   list. The page is returned locked and with an increased
> - *   refcount.
> - * - FGP_FOR_MMAP: Similar to FGP_CREAT, only we want to allow the caller to do
> - *   its own locking dance if the page is already in cache, or unlock the page
> - *   before returning if we had to add the page to pagecache.
> + * * %FGP_ACCESSED - The page will be marked accessed.
> + * * %FGP_LOCK - The page is returned locked.
> + * * %FGP_CREAT - If no page is present then a new page is allocated using
> + *   @gfp_mask and added to the page cache and the VM's LRU list.
> + *   The page is returned locked and with an increased refcount.
> + * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
> + *   page is already in cache.  If the page was allocated, unlock it before
> + *   returning so the caller can do the same dance.
>   *
> - * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
> - * if the GFP flags specified for FGP_CREAT are atomic.
> + * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
> + * if the %GFP flags specified for %FGP_CREAT are atomic.
>   *
>   * If there is a page cache page, it is returned with an increased refcount.
>   *
> - * Return: the found page or %NULL otherwise.
> + * Return: The found page or %NULL otherwise.
>   */
>  struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
> -       int fgp_flags, gfp_t gfp_mask)
> +               int fgp_flags, gfp_t gfp_mask)
>  {

LGTM:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 06/25] mm: Allow hpages to be arbitrary order
  2020-02-12  4:18 ` [PATCH v2 06/25] mm: Allow hpages to be arbitrary order Matthew Wilcox
@ 2020-02-13 14:11   ` Kirill A. Shutemov
  2020-02-13 14:30     ` Matthew Wilcox
  0 siblings, 1 reply; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-13 14:11 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:26PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Remove the assumption in hpage_nr_pages() that compound pages are
> necessarily PMD sized.  The return type needs to be signed as we need
> to use the negative value, eg when calling update_lru_size().

But should it be long?
Any reason to use macros instead of inline function?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 07/25] mm: Introduce thp_size
  2020-02-12  4:18 ` [PATCH v2 07/25] mm: Introduce thp_size Matthew Wilcox
@ 2020-02-13 14:19   ` Kirill A. Shutemov
  0 siblings, 0 replies; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-13 14:19 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:27PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This is like page_size(), but compiles down to just PAGE_SIZE if THP
> are disabled.  Convert the users of hpage_nr_pages() which would prefer
> this interface.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

I would prefer the new helper to be inline function rather than macro.
Otherwise looks good.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 08/25] mm: Introduce thp_order
  2020-02-12  4:18 ` [PATCH v2 08/25] mm: Introduce thp_order Matthew Wilcox
@ 2020-02-13 14:20   ` Kirill A. Shutemov
  0 siblings, 0 replies; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-13 14:20 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:28PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Like compound_order() except 0 when THP is disabled.

Again, functions are preferred if an option.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 06/25] mm: Allow hpages to be arbitrary order
  2020-02-13 14:11   ` Kirill A. Shutemov
@ 2020-02-13 14:30     ` Matthew Wilcox
  0 siblings, 0 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-13 14:30 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Thu, Feb 13, 2020 at 05:11:07PM +0300, Kirill A. Shutemov wrote:
> On Tue, Feb 11, 2020 at 08:18:26PM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > Remove the assumption in hpage_nr_pages() that compound pages are
> > necessarily PMD sized.  The return type needs to be signed as we need
> > to use the negative value, eg when calling update_lru_size().
> 
> But should it be long?
> Any reason to use macros instead of inline function?

Huh, that does look like a bit of a weird change now you point it out.
I'll change it back:

 static inline int hpage_nr_pages(struct page *page)
 {
-	if (unlikely(PageTransHuge(page)))
-		return HPAGE_PMD_NR;
-	return 1;
+	return compound_nr(page);
 }

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 05/25] mm: Fix documentation of FGP flags
  2020-02-13 13:59   ` Kirill A. Shutemov
@ 2020-02-13 14:34     ` Matthew Wilcox
  0 siblings, 0 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-13 14:34 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Thu, Feb 13, 2020 at 04:59:05PM +0300, Kirill A. Shutemov wrote:
> On Tue, Feb 11, 2020 at 08:18:25PM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > We never had PCG flags
> 
> We actually had :P But it's totally different story.
> 
> See git log for include/linux/page_cgroup.h.

Thanks, wording updated.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 10/25] fs: Introduce i_blocks_per_page
  2020-02-12  4:18 ` [PATCH v2 10/25] fs: Introduce i_blocks_per_page Matthew Wilcox
  2020-02-12  7:44   ` Christoph Hellwig
@ 2020-02-13 15:40   ` Kirill A. Shutemov
  2020-02-13 16:07     ` Matthew Wilcox
  1 sibling, 1 reply; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-13 15:40 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:30PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This helper is useful for both large pages in the page cache and for
> supporting block size larger than page size.  Convert some example
> users (we have a few different ways of writing this idiom).

Maybe we should list what was converted and what wasn't. Like it's
important to know that fs/buffer.c is not covered.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 11/25] fs: Make page_mkwrite_check_truncate thp-aware
  2020-02-12  4:18 ` [PATCH v2 11/25] fs: Make page_mkwrite_check_truncate thp-aware Matthew Wilcox
@ 2020-02-13 15:44   ` Kirill A. Shutemov
  2020-02-13 16:26     ` Matthew Wilcox
  0 siblings, 1 reply; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-13 15:44 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:31PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> If the page is compound, check the appropriate indices and return the
> appropriate sizes.

Is it guarnteed that the page is never called on tail page?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 10/25] fs: Introduce i_blocks_per_page
  2020-02-13 15:40   ` Kirill A. Shutemov
@ 2020-02-13 16:07     ` Matthew Wilcox
  0 siblings, 0 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-13 16:07 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Thu, Feb 13, 2020 at 06:40:10PM +0300, Kirill A. Shutemov wrote:
> On Tue, Feb 11, 2020 at 08:18:30PM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > This helper is useful for both large pages in the page cache and for
> > supporting block size larger than page size.  Convert some example
> > users (we have a few different ways of writing this idiom).
> 
> Maybe we should list what was converted and what wasn't. Like it's
> important to know that fs/buffer.c is not covered.

I don't know what could have been converted and wasn't ... I just went
looking for a few places that use idioms like this.  Happy to add some
more examples.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 11/25] fs: Make page_mkwrite_check_truncate thp-aware
  2020-02-13 15:44   ` Kirill A. Shutemov
@ 2020-02-13 16:26     ` Matthew Wilcox
  0 siblings, 0 replies; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-13 16:26 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Thu, Feb 13, 2020 at 06:44:19PM +0300, Kirill A. Shutemov wrote:
> On Tue, Feb 11, 2020 at 08:18:31PM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > If the page is compound, check the appropriate indices and return the
> > appropriate sizes.
> 
> Is it guarnteed that the page is never called on tail page?

I think so.  page_mkwrite_check_truncate() is only called on pages
which belong to a particular filesystem.  Only filesystems which have
the FS_LARGE_PAGES flag set will have compound pages allocated in the
page cache for their files.  As filesystems are converted, they will
only see large head pages.

I'll happily put in a VM_BUG_ON(PageTail(page), page); to ensure we
don't screw that up.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 13/25] fs: Add zero_user_large
  2020-02-12  4:18 ` [PATCH v2 13/25] fs: Add zero_user_large Matthew Wilcox
@ 2020-02-14 13:52   ` Kirill A. Shutemov
  2020-02-14 16:03     ` Matthew Wilcox
  0 siblings, 1 reply; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-14 13:52 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 11, 2020 at 08:18:33PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> We can't kmap() a THP, so add a wrapper around zero_user() for large
> pages.

I would rather address it closer to the root: make zero_user_segments()
handle compound pages.

> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/highmem.h | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index ea5cdbd8c2c3..4465b8784353 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -245,6 +245,28 @@ static inline void zero_user(struct page *page,
>  	zero_user_segments(page, start, start + size, 0, 0);
>  }
>  
> +static inline void zero_user_large(struct page *page,
> +		unsigned start, unsigned size)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < thp_order(page); i++) {
> +		if (start > PAGE_SIZE) {

Off-by-one? >= ?

> +			start -= PAGE_SIZE;
> +		} else {
> +			unsigned this_size = size;
> +
> +			if (size > (PAGE_SIZE - start))
> +				this_size = PAGE_SIZE - start;
> +			zero_user(page + i, start, this_size);
> +			start = 0;
> +			size -= this_size;
> +			if (!size)
> +				break;
> +		}
> +	}
> +}
> +
>  #ifndef __HAVE_ARCH_COPY_USER_HIGHPAGE
>  
>  static inline void copy_user_highpage(struct page *to, struct page *from,
> -- 
> 2.25.0
> 
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 13/25] fs: Add zero_user_large
  2020-02-14 13:52   ` Kirill A. Shutemov
@ 2020-02-14 16:03     ` Matthew Wilcox
  2020-02-18 14:16       ` Kirill A. Shutemov
  0 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-14 16:03 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, Feb 14, 2020 at 04:52:48PM +0300, Kirill A. Shutemov wrote:
> On Tue, Feb 11, 2020 at 08:18:33PM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > We can't kmap() a THP, so add a wrapper around zero_user() for large
> > pages.
> 
> I would rather address it closer to the root: make zero_user_segments()
> handle compound pages.

Hah.  I ended up doing that, but hadn't sent it out.  I don't like
how ugly it is:

@@ -219,18 +219,57 @@ static inline void zero_user_segments(struct page *page,
        unsigned start1, unsigned end1,
        unsigned start2, unsigned end2)
 {
-       void *kaddr = kmap_atomic(page);
-
-       BUG_ON(end1 > PAGE_SIZE || end2 > PAGE_SIZE);
-
-       if (end1 > start1)
-               memset(kaddr + start1, 0, end1 - start1);
-
-       if (end2 > start2)
-               memset(kaddr + start2, 0, end2 - start2);
-
-       kunmap_atomic(kaddr);
-       flush_dcache_page(page);
+       unsigned int i;
+
+       BUG_ON(end1 > thp_size(page) || end2 > thp_size(page));
+
+       for (i = 0; i < hpage_nr_pages(page); i++) {
+               void *kaddr;
+               unsigned this_end;
+
+               if (end1 == 0 && start2 >= PAGE_SIZE) {
+                       start2 -= PAGE_SIZE;
+                       end2 -= PAGE_SIZE;
+                       continue;
+               }
+
+               if (start1 >= PAGE_SIZE) {
+                       start1 -= PAGE_SIZE;
+                       end1 -= PAGE_SIZE;
+                       if (start2) {
+                               start2 -= PAGE_SIZE;
+                               end2 -= PAGE_SIZE;
+                       }
+                       continue;
+               }
+
+               kaddr = kmap_atomic(page + i);
+
+               this_end = min_t(unsigned, end1, PAGE_SIZE);
+               if (end1 > start1)
+                       memset(kaddr + start1, 0, this_end - start1);
+               end1 -= this_end;
+               start1 = 0;
+
+               if (start2 >= PAGE_SIZE) {
+                       start2 -= PAGE_SIZE;
+                       end2 -= PAGE_SIZE;
+               } else {
+                       this_end = min_t(unsigned, end2, PAGE_SIZE);
+                       if (end2 > start2)
+                               memset(kaddr + start2, 0, this_end - start2);
+                       end2 -= this_end;
+                       start2 = 0;
+               }
+
+               kunmap_atomic(kaddr);
+               flush_dcache_page(page + i);
+
+               if (!end1 && !end2)
+                       break;
+       }
+
+       BUG_ON((start1 | start2 | end1 | end2) != 0);
 }

I think at this point it has to move out-of-line too.

> > +static inline void zero_user_large(struct page *page,
> > +		unsigned start, unsigned size)
> > +{
> > +	unsigned int i;
> > +
> > +	for (i = 0; i < thp_order(page); i++) {
> > +		if (start > PAGE_SIZE) {
> 
> Off-by-one? >= ?

Good catch; I'd also noticed that when I came to redo the zero_user_segments().


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 13/25] fs: Add zero_user_large
  2020-02-14 16:03     ` Matthew Wilcox
@ 2020-02-18 14:16       ` Kirill A. Shutemov
  2020-02-18 16:13         ` Matthew Wilcox
  0 siblings, 1 reply; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-18 14:16 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, Feb 14, 2020 at 08:03:42AM -0800, Matthew Wilcox wrote:
> On Fri, Feb 14, 2020 at 04:52:48PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Feb 11, 2020 at 08:18:33PM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > > 
> > > We can't kmap() a THP, so add a wrapper around zero_user() for large
> > > pages.
> > 
> > I would rather address it closer to the root: make zero_user_segments()
> > handle compound pages.
> 
> Hah.  I ended up doing that, but hadn't sent it out.  I don't like
> how ugly it is:
> 
> @@ -219,18 +219,57 @@ static inline void zero_user_segments(struct page *page,
>         unsigned start1, unsigned end1,
>         unsigned start2, unsigned end2)
>  {
> -       void *kaddr = kmap_atomic(page);
> -
> -       BUG_ON(end1 > PAGE_SIZE || end2 > PAGE_SIZE);
> -
> -       if (end1 > start1)
> -               memset(kaddr + start1, 0, end1 - start1);
> -
> -       if (end2 > start2)
> -               memset(kaddr + start2, 0, end2 - start2);
> -
> -       kunmap_atomic(kaddr);
> -       flush_dcache_page(page);
> +       unsigned int i;
> +
> +       BUG_ON(end1 > thp_size(page) || end2 > thp_size(page));
> +
> +       for (i = 0; i < hpage_nr_pages(page); i++) {
> +               void *kaddr;
> +               unsigned this_end;
> +
> +               if (end1 == 0 && start2 >= PAGE_SIZE) {
> +                       start2 -= PAGE_SIZE;
> +                       end2 -= PAGE_SIZE;
> +                       continue;
> +               }
> +
> +               if (start1 >= PAGE_SIZE) {
> +                       start1 -= PAGE_SIZE;
> +                       end1 -= PAGE_SIZE;
> +                       if (start2) {
> +                               start2 -= PAGE_SIZE;
> +                               end2 -= PAGE_SIZE;
> +                       }

You assume start2/end2 is always after start1/end1 in the page.
Is it always true? If so, I would add BUG_ON() for it.

Otherwise, looks good.

> +                       continue;
> +               }
> +
> +               kaddr = kmap_atomic(page + i);
> +
> +               this_end = min_t(unsigned, end1, PAGE_SIZE);
> +               if (end1 > start1)
> +                       memset(kaddr + start1, 0, this_end - start1);
> +               end1 -= this_end;
> +               start1 = 0;
> +
> +               if (start2 >= PAGE_SIZE) {
> +                       start2 -= PAGE_SIZE;
> +                       end2 -= PAGE_SIZE;
> +               } else {
> +                       this_end = min_t(unsigned, end2, PAGE_SIZE);
> +                       if (end2 > start2)
> +                               memset(kaddr + start2, 0, this_end - start2);
> +                       end2 -= this_end;
> +                       start2 = 0;
> +               }
> +
> +               kunmap_atomic(kaddr);
> +               flush_dcache_page(page + i);
> +
> +               if (!end1 && !end2)
> +                       break;
> +       }
> +
> +       BUG_ON((start1 | start2 | end1 | end2) != 0);
>  }
> 
> I think at this point it has to move out-of-line too.
> 
> > > +static inline void zero_user_large(struct page *page,
> > > +		unsigned start, unsigned size)
> > > +{
> > > +	unsigned int i;
> > > +
> > > +	for (i = 0; i < thp_order(page); i++) {
> > > +		if (start > PAGE_SIZE) {
> > 
> > Off-by-one? >= ?
> 
> Good catch; I'd also noticed that when I came to redo the zero_user_segments().
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 13/25] fs: Add zero_user_large
  2020-02-18 14:16       ` Kirill A. Shutemov
@ 2020-02-18 16:13         ` Matthew Wilcox
  2020-02-18 17:10           ` Kirill A. Shutemov
  0 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-18 16:13 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 18, 2020 at 05:16:34PM +0300, Kirill A. Shutemov wrote:
> > +               if (start1 >= PAGE_SIZE) {
> > +                       start1 -= PAGE_SIZE;
> > +                       end1 -= PAGE_SIZE;
> > +                       if (start2) {
> > +                               start2 -= PAGE_SIZE;
> > +                               end2 -= PAGE_SIZE;
> > +                       }
> 
> You assume start2/end2 is always after start1/end1 in the page.
> Is it always true? If so, I would add BUG_ON() for it.

after or zero.  Yes, I should add a BUG_ON to check for that.

> Otherwise, looks good.

Here's what I currently have (I'll add the BUG_ON later):

commit 7fabe16755365cdc6e80343ef994843ecebde60a
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Sat Feb 1 03:38:49 2020 -0500

    fs: Support THPs in zero_user_segments
    
    We can only kmap() one subpage of a THP at a time, so loop over all
    relevant subpages, skipping ones which don't need to be zeroed.  This is
    too large to inline when THPs are enabled and we actually need highmem,
    so put it in highmem.c.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ea5cdbd8c2c3..74614903619d 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -215,13 +215,18 @@ static inline void clear_highpage(struct page *page)
        kunmap_atomic(kaddr);
 }
 
+#if defined(CONFIG_HIGHMEM) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
+               unsigned start2, unsigned end2);
+#else /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */
 static inline void zero_user_segments(struct page *page,
-       unsigned start1, unsigned end1,
-       unsigned start2, unsigned end2)
+               unsigned start1, unsigned end1,
+               unsigned start2, unsigned end2)
 {
+       unsigned long i;
        void *kaddr = kmap_atomic(page);
 
-       BUG_ON(end1 > PAGE_SIZE || end2 > PAGE_SIZE);
+       BUG_ON(end1 > thp_size(page) || end2 > thp_size(page));
 
        if (end1 > start1)
                memset(kaddr + start1, 0, end1 - start1);
@@ -230,8 +235,10 @@ static inline void zero_user_segments(struct page *page,
                memset(kaddr + start2, 0, end2 - start2);
 
        kunmap_atomic(kaddr);
-       flush_dcache_page(page);
+       for (i = 0; i < hpage_nr_pages(page); i++)
+               flush_dcache_page(page + i);
 }
+#endif /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */
 
 static inline void zero_user_segment(struct page *page,
        unsigned start, unsigned end)
diff --git a/mm/highmem.c b/mm/highmem.c
index 64d8dea47dd1..3a85c66ef532 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -367,9 +367,67 @@ void kunmap_high(struct page *page)
        if (need_wakeup)
                wake_up(pkmap_map_wait);
 }
-
 EXPORT_SYMBOL(kunmap_high);
-#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
+               unsigned start2, unsigned end2)
+{
+       unsigned int i;
+
+       BUG_ON(end1 > thp_size(page) || end2 > thp_size(page));
+
+       for (i = 0; i < hpage_nr_pages(page); i++) {
+               void *kaddr;
+               unsigned this_end;
+
+               if (end1 == 0 && start2 >= PAGE_SIZE) {
+                       start2 -= PAGE_SIZE;
+                       end2 -= PAGE_SIZE;
+                       continue;
+               }
+
+               if (start1 >= PAGE_SIZE) {
+                       start1 -= PAGE_SIZE;
+                       end1 -= PAGE_SIZE;
+                       if (start2) {
+                               start2 -= PAGE_SIZE;
+                               end2 -= PAGE_SIZE;
+                       }
+                       continue;
+               }
+
+               kaddr = kmap_atomic(page + i);
+
+               this_end = min_t(unsigned, end1, PAGE_SIZE);
+               if (end1 > start1)
+                       memset(kaddr + start1, 0, this_end - start1);
+               end1 -= this_end;
+               start1 = 0;
+
+               if (start2 >= PAGE_SIZE) {
+                       start2 -= PAGE_SIZE;
+                       end2 -= PAGE_SIZE;
+               } else {
+                       this_end = min_t(unsigned, end2, PAGE_SIZE);
+                       if (end2 > start2)
+                               memset(kaddr + start2, 0, this_end - start2);
+                       end2 -= this_end;
+                       start2 = 0;
+               }
+
+               kunmap_atomic(kaddr);
+               flush_dcache_page(page + i);
+
+               if (!end1 && !end2)
+                       break;
+       }
+
+       BUG_ON((start1 | start2 | end1 | end2) != 0);
+}
+EXPORT_SYMBOL(zero_user_segments);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_HIGHMEM */
 
 #if defined(HASHED_PAGE_VIRTUAL)
 



> > +                       continue;
> > +               }
> > +
> > +               kaddr = kmap_atomic(page + i);
> > +
> > +               this_end = min_t(unsigned, end1, PAGE_SIZE);
> > +               if (end1 > start1)
> > +                       memset(kaddr + start1, 0, this_end - start1);
> > +               end1 -= this_end;
> > +               start1 = 0;
> > +
> > +               if (start2 >= PAGE_SIZE) {
> > +                       start2 -= PAGE_SIZE;
> > +                       end2 -= PAGE_SIZE;
> > +               } else {
> > +                       this_end = min_t(unsigned, end2, PAGE_SIZE);
> > +                       if (end2 > start2)
> > +                               memset(kaddr + start2, 0, this_end - start2);
> > +                       end2 -= this_end;
> > +                       start2 = 0;
> > +               }
> > +
> > +               kunmap_atomic(kaddr);
> > +               flush_dcache_page(page + i);
> > +
> > +               if (!end1 && !end2)
> > +                       break;
> > +       }
> > +
> > +       BUG_ON((start1 | start2 | end1 | end2) != 0);
> >  }
> > 
> > I think at this point it has to move out-of-line too.
> > 
> > > > +static inline void zero_user_large(struct page *page,
> > > > +		unsigned start, unsigned size)
> > > > +{
> > > > +	unsigned int i;
> > > > +
> > > > +	for (i = 0; i < thp_order(page); i++) {
> > > > +		if (start > PAGE_SIZE) {
> > > 
> > > Off-by-one? >= ?
> > 
> > Good catch; I'd also noticed that when I came to redo the zero_user_segments().
> > 
> 
> -- 
>  Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 13/25] fs: Add zero_user_large
  2020-02-18 16:13         ` Matthew Wilcox
@ 2020-02-18 17:10           ` Kirill A. Shutemov
  2020-02-18 18:07             ` Matthew Wilcox
  0 siblings, 1 reply; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-18 17:10 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 18, 2020 at 08:13:49AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:16:34PM +0300, Kirill A. Shutemov wrote:
> > > +               if (start1 >= PAGE_SIZE) {
> > > +                       start1 -= PAGE_SIZE;
> > > +                       end1 -= PAGE_SIZE;
> > > +                       if (start2) {
> > > +                               start2 -= PAGE_SIZE;
> > > +                               end2 -= PAGE_SIZE;
> > > +                       }
> > 
> > You assume start2/end2 is always after start1/end1 in the page.
> > Is it always true? If so, I would add BUG_ON() for it.
> 
> after or zero.  Yes, I should add a BUG_ON to check for that.
> 
> > Otherwise, looks good.
> 
> Here's what I currently have (I'll add the BUG_ON later):
> 
> commit 7fabe16755365cdc6e80343ef994843ecebde60a
> Author: Matthew Wilcox (Oracle) <willy@infradead.org>
> Date:   Sat Feb 1 03:38:49 2020 -0500
> 
>     fs: Support THPs in zero_user_segments
>     
>     We can only kmap() one subpage of a THP at a time, so loop over all
>     relevant subpages, skipping ones which don't need to be zeroed.  This is
>     too large to inline when THPs are enabled and we actually need highmem,
>     so put it in highmem.c.
>     
>     Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index ea5cdbd8c2c3..74614903619d 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -215,13 +215,18 @@ static inline void clear_highpage(struct page *page)
>         kunmap_atomic(kaddr);
>  }
>  
> +#if defined(CONFIG_HIGHMEM) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> +void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
> +               unsigned start2, unsigned end2);
> +#else /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */

This is a neat trick. I like it.

Although, it means non-inlined version will never get tested :/

>  static inline void zero_user_segments(struct page *page,
> -       unsigned start1, unsigned end1,
> -       unsigned start2, unsigned end2)
> +               unsigned start1, unsigned end1,
> +               unsigned start2, unsigned end2)
>  {
> +       unsigned long i;
>         void *kaddr = kmap_atomic(page);
>  
> -       BUG_ON(end1 > PAGE_SIZE || end2 > PAGE_SIZE);
> +       BUG_ON(end1 > thp_size(page) || end2 > thp_size(page));
>  
>         if (end1 > start1)
>                 memset(kaddr + start1, 0, end1 - start1);
> @@ -230,8 +235,10 @@ static inline void zero_user_segments(struct page *page,
>                 memset(kaddr + start2, 0, end2 - start2);
>  
>         kunmap_atomic(kaddr);
> -       flush_dcache_page(page);
> +       for (i = 0; i < hpage_nr_pages(page); i++)
> +               flush_dcache_page(page + i);
>  }
> +#endif /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */
>  
>  static inline void zero_user_segment(struct page *page,
>         unsigned start, unsigned end)
> diff --git a/mm/highmem.c b/mm/highmem.c
> index 64d8dea47dd1..3a85c66ef532 100644
> --- a/mm/highmem.c
> +++ b/mm/highmem.c
> @@ -367,9 +367,67 @@ void kunmap_high(struct page *page)
>         if (need_wakeup)
>                 wake_up(pkmap_map_wait);
>  }
> -
>  EXPORT_SYMBOL(kunmap_high);
> -#endif
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
> +               unsigned start2, unsigned end2)
> +{
> +       unsigned int i;
> +
> +       BUG_ON(end1 > thp_size(page) || end2 > thp_size(page));
> +
> +       for (i = 0; i < hpage_nr_pages(page); i++) {
> +               void *kaddr;
> +               unsigned this_end;
> +
> +               if (end1 == 0 && start2 >= PAGE_SIZE) {
> +                       start2 -= PAGE_SIZE;
> +                       end2 -= PAGE_SIZE;
> +                       continue;
> +               }
> +
> +               if (start1 >= PAGE_SIZE) {
> +                       start1 -= PAGE_SIZE;
> +                       end1 -= PAGE_SIZE;
> +                       if (start2) {
> +                               start2 -= PAGE_SIZE;
> +                               end2 -= PAGE_SIZE;
> +                       }
> +                       continue;
> +               }
> +
> +               kaddr = kmap_atomic(page + i);
> +
> +               this_end = min_t(unsigned, end1, PAGE_SIZE);
> +               if (end1 > start1)
> +                       memset(kaddr + start1, 0, this_end - start1);
> +               end1 -= this_end;
> +               start1 = 0;
> +
> +               if (start2 >= PAGE_SIZE) {
> +                       start2 -= PAGE_SIZE;
> +                       end2 -= PAGE_SIZE;
> +               } else {
> +                       this_end = min_t(unsigned, end2, PAGE_SIZE);
> +                       if (end2 > start2)
> +                               memset(kaddr + start2, 0, this_end - start2);
> +                       end2 -= this_end;
> +                       start2 = 0;
> +               }
> +
> +               kunmap_atomic(kaddr);
> +               flush_dcache_page(page + i);
> +
> +               if (!end1 && !end2)
> +                       break;
> +       }
> +
> +       BUG_ON((start1 | start2 | end1 | end2) != 0);
> +}
> +EXPORT_SYMBOL(zero_user_segments);
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#endif /* CONFIG_HIGHMEM */
>  
>  #if defined(HASHED_PAGE_VIRTUAL)
>  
> 
> 
> 
> > > +                       continue;
> > > +               }
> > > +
> > > +               kaddr = kmap_atomic(page + i);
> > > +
> > > +               this_end = min_t(unsigned, end1, PAGE_SIZE);
> > > +               if (end1 > start1)
> > > +                       memset(kaddr + start1, 0, this_end - start1);
> > > +               end1 -= this_end;
> > > +               start1 = 0;
> > > +
> > > +               if (start2 >= PAGE_SIZE) {
> > > +                       start2 -= PAGE_SIZE;
> > > +                       end2 -= PAGE_SIZE;
> > > +               } else {
> > > +                       this_end = min_t(unsigned, end2, PAGE_SIZE);
> > > +                       if (end2 > start2)
> > > +                               memset(kaddr + start2, 0, this_end - start2);
> > > +                       end2 -= this_end;
> > > +                       start2 = 0;
> > > +               }
> > > +
> > > +               kunmap_atomic(kaddr);
> > > +               flush_dcache_page(page + i);
> > > +
> > > +               if (!end1 && !end2)
> > > +                       break;
> > > +       }
> > > +
> > > +       BUG_ON((start1 | start2 | end1 | end2) != 0);
> > >  }
> > > 
> > > I think at this point it has to move out-of-line too.
> > > 
> > > > > +static inline void zero_user_large(struct page *page,
> > > > > +		unsigned start, unsigned size)
> > > > > +{
> > > > > +	unsigned int i;
> > > > > +
> > > > > +	for (i = 0; i < thp_order(page); i++) {
> > > > > +		if (start > PAGE_SIZE) {
> > > > 
> > > > Off-by-one? >= ?
> > > 
> > > Good catch; I'd also noticed that when I came to redo the zero_user_segments().
> > > 
> > 
> > -- 
> >  Kirill A. Shutemov

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 13/25] fs: Add zero_user_large
  2020-02-18 17:10           ` Kirill A. Shutemov
@ 2020-02-18 18:07             ` Matthew Wilcox
  2020-02-21 12:42               ` Kirill A. Shutemov
  0 siblings, 1 reply; 68+ messages in thread
From: Matthew Wilcox @ 2020-02-18 18:07 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 18, 2020 at 08:10:52PM +0300, Kirill A. Shutemov wrote:
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks

> > +#if defined(CONFIG_HIGHMEM) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > +void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
> > +               unsigned start2, unsigned end2);
> > +#else /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */
> 
> This is a neat trick. I like it.
> 
> Although, it means non-inlined version will never get tested :/

I worry about that too, but I don't really want to incur the overhead on
platforms people actually use.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 13/25] fs: Add zero_user_large
  2020-02-18 18:07             ` Matthew Wilcox
@ 2020-02-21 12:42               ` Kirill A. Shutemov
  0 siblings, 0 replies; 68+ messages in thread
From: Kirill A. Shutemov @ 2020-02-21 12:42 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Feb 18, 2020 at 10:07:05AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 08:10:52PM +0300, Kirill A. Shutemov wrote:
> > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> Thanks
> 
> > > +#if defined(CONFIG_HIGHMEM) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > > +void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
> > > +               unsigned start2, unsigned end2);
> > > +#else /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */
> > 
> > This is a neat trick. I like it.
> > 
> > Although, it means non-inlined version will never get tested :/
> 
> I worry about that too, but I don't really want to incur the overhead on
> platforms people actually use.

I'm also worried about latency: kmap_atomic() disables preemption even if
system has no highmem. Some archs have way too large THP to clear them
with preemption disabled.

I *think* there's no real need in preemption disabling in this situation
and we can wrap kmap_atomic()/kunmap_atomic() into CONFIG_HIGHMEM.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2020-02-21 12:42 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-12  4:18 [PATCH v2 00/25] Large pages in the page cache Matthew Wilcox
2020-02-12  4:18 ` [PATCH v2 01/25] mm: Use vm_fault error code directly Matthew Wilcox
2020-02-12  7:34   ` Christoph Hellwig
2020-02-12  4:18 ` [PATCH v2 02/25] mm: Optimise find_subpage for !THP Matthew Wilcox
2020-02-12  7:41   ` Christoph Hellwig
2020-02-12 13:02     ` Matthew Wilcox
2020-02-12 17:52       ` Christoph Hellwig
2020-02-13 13:50       ` Kirill A. Shutemov
2020-02-12  4:18 ` [PATCH v2 03/25] mm: Use VM_BUG_ON_PAGE in clear_page_dirty_for_io Matthew Wilcox
2020-02-12  7:38   ` Christoph Hellwig
2020-02-13 13:50   ` Kirill A. Shutemov
2020-02-12  4:18 ` [PATCH v2 04/25] mm: Unexport find_get_entry Matthew Wilcox
2020-02-12  7:37   ` Christoph Hellwig
2020-02-13 13:51   ` Kirill A. Shutemov
2020-02-12  4:18 ` [PATCH v2 05/25] mm: Fix documentation of FGP flags Matthew Wilcox
2020-02-12  7:42   ` Christoph Hellwig
2020-02-12 19:11     ` Matthew Wilcox
2020-02-13 14:00       ` Kirill A. Shutemov
2020-02-13 13:59   ` Kirill A. Shutemov
2020-02-13 14:34     ` Matthew Wilcox
2020-02-12  4:18 ` [PATCH v2 06/25] mm: Allow hpages to be arbitrary order Matthew Wilcox
2020-02-13 14:11   ` Kirill A. Shutemov
2020-02-13 14:30     ` Matthew Wilcox
2020-02-12  4:18 ` [PATCH v2 07/25] mm: Introduce thp_size Matthew Wilcox
2020-02-13 14:19   ` Kirill A. Shutemov
2020-02-12  4:18 ` [PATCH v2 08/25] mm: Introduce thp_order Matthew Wilcox
2020-02-13 14:20   ` Kirill A. Shutemov
2020-02-12  4:18 ` [PATCH v2 09/25] fs: Add a filesystem flag for large pages Matthew Wilcox
2020-02-12  7:43   ` Christoph Hellwig
2020-02-12 14:59     ` Matthew Wilcox
2020-02-12  4:18 ` [PATCH v2 10/25] fs: Introduce i_blocks_per_page Matthew Wilcox
2020-02-12  7:44   ` Christoph Hellwig
2020-02-12 15:05     ` Matthew Wilcox
2020-02-12 17:54       ` Christoph Hellwig
2020-02-13 15:40   ` Kirill A. Shutemov
2020-02-13 16:07     ` Matthew Wilcox
2020-02-12  4:18 ` [PATCH v2 11/25] fs: Make page_mkwrite_check_truncate thp-aware Matthew Wilcox
2020-02-13 15:44   ` Kirill A. Shutemov
2020-02-13 16:26     ` Matthew Wilcox
2020-02-12  4:18 ` [PATCH v2 12/25] mm: Add file_offset_of_ helpers Matthew Wilcox
2020-02-12  7:46   ` Christoph Hellwig
2020-02-12  4:18 ` [PATCH v2 13/25] fs: Add zero_user_large Matthew Wilcox
2020-02-14 13:52   ` Kirill A. Shutemov
2020-02-14 16:03     ` Matthew Wilcox
2020-02-18 14:16       ` Kirill A. Shutemov
2020-02-18 16:13         ` Matthew Wilcox
2020-02-18 17:10           ` Kirill A. Shutemov
2020-02-18 18:07             ` Matthew Wilcox
2020-02-21 12:42               ` Kirill A. Shutemov
2020-02-12  4:18 ` [PATCH v2 14/25] iomap: Support arbitrarily many blocks per page Matthew Wilcox
2020-02-12  8:05   ` Christoph Hellwig
2020-02-12  4:18 ` [PATCH v2 15/25] iomap: Support large pages in iomap_adjust_read_range Matthew Wilcox
2020-02-12  8:11   ` Christoph Hellwig
2020-02-12  4:18 ` [PATCH v2 16/25] iomap: Support large pages in read paths Matthew Wilcox
2020-02-12  8:13   ` Christoph Hellwig
2020-02-12 17:45     ` Matthew Wilcox
2020-02-12  4:18 ` [PATCH v2 17/25] iomap: Support large pages in write paths Matthew Wilcox
2020-02-12  8:17   ` Christoph Hellwig
2020-02-12  4:18 ` [PATCH v2 18/25] iomap: Inline data shouldn't see large pages Matthew Wilcox
2020-02-12  8:05   ` Christoph Hellwig
2020-02-12  4:18 ` [PATCH v2 19/25] xfs: Support " Matthew Wilcox
2020-02-12  4:18 ` [PATCH v2 20/25] mm: Make prep_transhuge_page return its argument Matthew Wilcox
2020-02-12  4:18 ` [PATCH v2 21/25] mm: Add __page_cache_alloc_order Matthew Wilcox
2020-02-12  4:18 ` [PATCH v2 22/25] mm: Allow large pages to be added to the page cache Matthew Wilcox
2020-02-12  4:18 ` [PATCH v2 23/25] mm: Allow large pages to be removed from " Matthew Wilcox
2020-02-12  4:18 ` [PATCH v2 24/25] mm: Add large page readahead Matthew Wilcox
2020-02-12  4:18 ` [PATCH v2 25/25] mm: Align THP mappings for non-DAX Matthew Wilcox
2020-02-12  7:50   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).