linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/15] Large pages in the page-cache
@ 2019-09-25  0:51 Matthew Wilcox
  2019-09-25  0:52 ` [PATCH 01/15] mm: Use vm_fault error code directly Matthew Wilcox
                   ` (18 more replies)
  0 siblings, 19 replies; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:51 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Here's what I'm currently playing with.  I'm having trouble _testing_
it, but since akpm's patches were just merged into Linus' tree, I
thought this would be a good point to send out my current work tree.
Thanks to kbuild bot for finding a bunch of build problems ;-)

Matthew Wilcox (Oracle) (12):
  mm: Use vm_fault error code directly
  fs: Introduce i_blocks_per_page
  mm: Add file_offset_of_ helpers
  iomap: Support large pages
  xfs: Support large pages
  xfs: Pass a page to xfs_finish_page_writeback
  mm: Make prep_transhuge_page tail-callable
  mm: Add __page_cache_alloc_order
  mm: Allow large pages to be added to the page cache
  mm: Allow find_get_page to be used for large pages
  mm: Remove hpage_nr_pages
  xfs: Use filemap_huge_fault

William Kucharski (3):
  mm: Support removing arbitrary sized pages from mapping
  mm: Add a huge page fault handler for files
  mm: Align THP mappings for non-DAX

 drivers/net/ethernet/ibm/ibmveth.c |   2 -
 drivers/nvdimm/btt.c               |   4 +-
 drivers/nvdimm/pmem.c              |   3 +-
 fs/iomap/buffered-io.c             | 121 +++++++----
 fs/jfs/jfs_metapage.c              |   2 +-
 fs/xfs/xfs_aops.c                  |  37 ++--
 fs/xfs/xfs_file.c                  |   5 +-
 include/linux/huge_mm.h            |  15 +-
 include/linux/iomap.h              |   2 +-
 include/linux/mm.h                 |  12 ++
 include/linux/mm_inline.h          |   6 +-
 include/linux/pagemap.h            |  73 ++++++-
 mm/filemap.c                       | 311 ++++++++++++++++++++++++++---
 mm/gup.c                           |   2 +-
 mm/huge_memory.c                   |  11 +-
 mm/internal.h                      |   4 +-
 mm/memcontrol.c                    |  14 +-
 mm/memory_hotplug.c                |   4 +-
 mm/mempolicy.c                     |   2 +-
 mm/migrate.c                       |  19 +-
 mm/mlock.c                         |   9 +-
 mm/page_io.c                       |   4 +-
 mm/page_vma_mapped.c               |   6 +-
 mm/rmap.c                          |   8 +-
 mm/swap.c                          |   4 +-
 mm/swap_state.c                    |   4 +-
 mm/swapfile.c                      |   2 +-
 mm/vmscan.c                        |   9 +-
 28 files changed, 519 insertions(+), 176 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01/15] mm: Use vm_fault error code directly
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-09-26 13:55   ` Kirill A. Shutemov
  2019-09-25  0:52 ` [PATCH 02/15] fs: Introduce i_blocks_per_page Matthew Wilcox
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use VM_FAULT_OOM instead of indirecting through vmf_error(-ENOMEM).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1146fcfa3215..625ef3ef19f3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2533,7 +2533,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		if (!page) {
 			if (fpin)
 				goto out_retry;
-			return vmf_error(-ENOMEM);
+			return VM_FAULT_OOM;
 		}
 	}
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 02/15] fs: Introduce i_blocks_per_page
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
  2019-09-25  0:52 ` [PATCH 01/15] mm: Use vm_fault error code directly Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-09-25  8:36   ` Dave Chinner
  2019-09-25  0:52 ` [PATCH 03/15] mm: Add file_offset_of_ helpers Matthew Wilcox
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This helper is useful for both large pages in the page cache and for
supporting block size larger than page size.  Convert some example
users (we have a few different ways of writing this idiom).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c  |  4 ++--
 fs/jfs/jfs_metapage.c   |  2 +-
 fs/xfs/xfs_aops.c       |  8 ++++----
 include/linux/pagemap.h | 13 +++++++++++++
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e25901ae3ff4..0e76a4b6d98a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -24,7 +24,7 @@ iomap_page_create(struct inode *inode, struct page *page)
 {
 	struct iomap_page *iop = to_iomap_page(page);
 
-	if (iop || i_blocksize(inode) == PAGE_SIZE)
+	if (iop || i_blocks_per_page(inode, page) <= 1)
 		return iop;
 
 	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
@@ -128,7 +128,7 @@ iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len)
 	bool uptodate = true;
 
 	if (iop) {
-		for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
+		for (i = 0; i < i_blocks_per_page(inode, page); i++) {
 			if (i >= first && i <= last)
 				set_bit(i, iop->uptodate);
 			else if (!test_bit(i, iop->uptodate))
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index a2f5338a5ea1..176580f54af9 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -473,7 +473,7 @@ static int metapage_readpage(struct file *fp, struct page *page)
 	struct inode *inode = page->mapping->host;
 	struct bio *bio = NULL;
 	int block_offset;
-	int blocks_per_page = PAGE_SIZE >> inode->i_blkbits;
+	int blocks_per_page = i_blocks_per_page(inode, page);
 	sector_t page_start;	/* address of page in fs blocks */
 	sector_t pblock;
 	int xlen;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f16d5f196c6b..102cfd8a97d6 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -68,7 +68,7 @@ xfs_finish_page_writeback(
 		mapping_set_error(inode->i_mapping, -EIO);
 	}
 
-	ASSERT(iop || i_blocksize(inode) == PAGE_SIZE);
+	ASSERT(iop || i_blocks_per_page(inode, bvec->bv_page) <= 1);
 	ASSERT(!iop || atomic_read(&iop->write_count) > 0);
 
 	if (!iop || atomic_dec_and_test(&iop->write_count))
@@ -839,7 +839,7 @@ xfs_aops_discard_page(
 			page, ip->i_ino, offset);
 
 	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-			PAGE_SIZE / i_blocksize(inode));
+			i_blocks_per_page(inode, page));
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
@@ -877,7 +877,7 @@ xfs_writepage_map(
 	uint64_t		file_offset;	/* file offset of page */
 	int			error = 0, count = 0, i;
 
-	ASSERT(iop || i_blocksize(inode) == PAGE_SIZE);
+	ASSERT(iop || i_blocks_per_page(inode, page) <= 1);
 	ASSERT(!iop || atomic_read(&iop->write_count) == 0);
 
 	/*
@@ -886,7 +886,7 @@ xfs_writepage_map(
 	 * one.
 	 */
 	for (i = 0, file_offset = page_offset(page);
-	     i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
+	     i < i_blocks_per_page(inode, page) && file_offset < end_offset;
 	     i++, file_offset += len) {
 		if (iop && !test_bit(i, iop->uptodate))
 			continue;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 37a4d9e32cd3..750770a2c685 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -636,4 +636,17 @@ static inline unsigned long dir_pages(struct inode *inode)
 			       PAGE_SHIFT;
 }
 
+/**
+ * i_blocks_per_page - How many blocks fit in this page.
+ * @inode: The inode which contains the blocks.
+ * @page: The (potentially large) page.
+ *
+ * Context: Any context.
+ * Return: The number of filesystem blocks covered by this page.
+ */
+static inline
+unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
+{
+	return page_size(page) >> inode->i_blkbits;
+}
 #endif /* _LINUX_PAGEMAP_H */
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 03/15] mm: Add file_offset_of_ helpers
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
  2019-09-25  0:52 ` [PATCH 01/15] mm: Use vm_fault error code directly Matthew Wilcox
  2019-09-25  0:52 ` [PATCH 02/15] fs: Introduce i_blocks_per_page Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-09-26 14:02   ` Kirill A. Shutemov
  2019-09-25  0:52 ` [PATCH 04/15] iomap: Support large pages Matthew Wilcox
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The page_offset function is badly named for people reading the functions
which call it.  The natural meaning of a function with this name would
be 'offset within a page', not 'page offset in bytes within a file'.
Dave Chinner suggests file_offset_of_page() as a replacement function
name and I'm also adding file_offset_of_next_page() as a helper for the
large page work.  Also add kernel-doc for these functions so they show
up in the kernel API book.

page_offset() is retained as a compatibility define for now.
---
 drivers/net/ethernet/ibm/ibmveth.c |  2 --
 include/linux/pagemap.h            | 25 ++++++++++++++++++++++---
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index c5be4ebd8437..bf98aeaf9a45 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -978,8 +978,6 @@ static int ibmveth_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 	return -EOPNOTSUPP;
 }
 
-#define page_offset(v) ((unsigned long)(v) & ((1 << 12) - 1))
-
 static int ibmveth_send(struct ibmveth_adapter *adapter,
 			union ibmveth_buf_desc *descs, unsigned long mss)
 {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 750770a2c685..103205494ea0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -428,14 +428,33 @@ static inline pgoff_t page_to_pgoff(struct page *page)
 	return page_to_index(page);
 }
 
-/*
- * Return byte-offset into filesystem object for page.
+/**
+ * file_offset_of_page - File offset of this page.
+ * @page: Page cache page.
+ *
+ * Context: Any context.
+ * Return: The offset of the first byte of this page.
  */
-static inline loff_t page_offset(struct page *page)
+static inline loff_t file_offset_of_page(struct page *page)
 {
 	return ((loff_t)page->index) << PAGE_SHIFT;
 }
 
+/* Legacy; please convert callers */
+#define page_offset(page)	file_offset_of_page(page)
+
+/**
+ * file_offset_of_next_page - File offset of the next page.
+ * @page: Page cache page.
+ *
+ * Context: Any context.
+ * Return: The offset of the first byte after this page.
+ */
+static inline loff_t file_offset_of_next_page(struct page *page)
+{
+	return ((loff_t)page->index + compound_nr(page)) << PAGE_SHIFT;
+}
+
 static inline loff_t page_file_offset(struct page *page)
 {
 	return ((loff_t)page_index(page)) << PAGE_SHIFT;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 04/15] iomap: Support large pages
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
                   ` (2 preceding siblings ...)
  2019-09-25  0:52 ` [PATCH 03/15] mm: Add file_offset_of_ helpers Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-09-25  0:52 ` [PATCH 05/15] xfs: " Matthew Wilcox
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Change iomap_page from a statically sized uptodate bitmap to a dynamically
allocated uptodate bitmap, allowing an arbitrarily large page.

The only remaining places where iomap assumes an order-0 page are for
files with inline data, where there's no sense in allocating a larger
page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 119 ++++++++++++++++++++++++++---------------
 include/linux/iomap.h  |   2 +-
 include/linux/mm.h     |   2 +
 3 files changed, 80 insertions(+), 43 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0e76a4b6d98a..15d844a88439 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -23,14 +23,14 @@ static struct iomap_page *
 iomap_page_create(struct inode *inode, struct page *page)
 {
 	struct iomap_page *iop = to_iomap_page(page);
+	unsigned int n;
 
 	if (iop || i_blocks_per_page(inode, page) <= 1)
 		return iop;
 
-	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
-	atomic_set(&iop->read_count, 0);
-	atomic_set(&iop->write_count, 0);
-	bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
+	n = BITS_TO_LONGS(i_blocks_per_page(inode, page));
+	iop = kmalloc(struct_size(iop, uptodate, n),
+			GFP_NOFS | __GFP_NOFAIL | __GFP_ZERO);
 
 	/*
 	 * migrate_page_move_mapping() assumes that pages with private data have
@@ -61,15 +61,16 @@ iomap_page_release(struct page *page)
  * Calculate the range inside the page that we actually need to read.
  */
 static void
-iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
+iomap_adjust_read_range(struct inode *inode, struct page *page,
 		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
 {
+	struct iomap_page *iop = to_iomap_page(page);
 	loff_t orig_pos = *pos;
 	loff_t isize = i_size_read(inode);
 	unsigned block_bits = inode->i_blkbits;
 	unsigned block_size = (1 << block_bits);
-	unsigned poff = offset_in_page(*pos);
-	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+	unsigned poff = offset_in_this_page(page, *pos);
+	unsigned plen = min_t(loff_t, page_size(page) - poff, length);
 	unsigned first = poff >> block_bits;
 	unsigned last = (poff + plen - 1) >> block_bits;
 
@@ -107,7 +108,8 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 	 * page cache for blocks that are entirely outside of i_size.
 	 */
 	if (orig_pos <= isize && orig_pos + length > isize) {
-		unsigned end = offset_in_page(isize - 1) >> block_bits;
+		unsigned end = offset_in_this_page(page, isize - 1) >>
+				block_bits;
 
 		if (first <= end && last > end)
 			plen -= (last - end) * block_size;
@@ -121,19 +123,16 @@ static void
 iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len)
 {
 	struct iomap_page *iop = to_iomap_page(page);
-	struct inode *inode = page->mapping->host;
-	unsigned first = off >> inode->i_blkbits;
-	unsigned last = (off + len - 1) >> inode->i_blkbits;
-	unsigned int i;
 	bool uptodate = true;
 
 	if (iop) {
-		for (i = 0; i < i_blocks_per_page(inode, page); i++) {
-			if (i >= first && i <= last)
-				set_bit(i, iop->uptodate);
-			else if (!test_bit(i, iop->uptodate))
-				uptodate = false;
-		}
+		struct inode *inode = page->mapping->host;
+		unsigned first = off >> inode->i_blkbits;
+		unsigned count = len >> inode->i_blkbits;
+
+		bitmap_set(iop->uptodate, first, count);
+		if (!bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
+			uptodate = false;
 	}
 
 	if (uptodate && !PageError(page))
@@ -194,6 +193,7 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
 		return;
 
 	BUG_ON(page->index);
+	BUG_ON(PageCompound(page));
 	BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
 
 	addr = kmap_atomic(page);
@@ -203,6 +203,16 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
 	SetPageUptodate(page);
 }
 
+/*
+ * Estimate the number of vectors we need based on the current page size;
+ * if we're wrong we'll end up doing an overly large allocation or needing
+ * to do a second allocation, neither of which is a big deal.
+ */
+static unsigned int iomap_nr_vecs(struct page *page, loff_t length)
+{
+	return (length + page_size(page) - 1) >> page_shift(page);
+}
+
 static loff_t
 iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
@@ -222,7 +232,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	}
 
 	/* zero post-eof blocks as the page may be mapped */
-	iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
+	iomap_adjust_read_range(inode, page, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
 
@@ -258,7 +268,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 
 	if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
 		gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
-		int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		int nr_vecs = iomap_nr_vecs(page, length);
 
 		if (ctx->bio)
 			submit_bio(ctx->bio);
@@ -293,9 +303,9 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 	unsigned poff;
 	loff_t ret;
 
-	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
-		ret = iomap_apply(inode, page_offset(page) + poff,
-				PAGE_SIZE - poff, 0, ops, &ctx,
+	for (poff = 0; poff < page_size(page); poff += ret) {
+		ret = iomap_apply(inode, file_offset_of_page(page) + poff,
+				page_size(page) - poff, 0, ops, &ctx,
 				iomap_readpage_actor);
 		if (ret <= 0) {
 			WARN_ON_ONCE(ret == 0);
@@ -328,7 +338,7 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
 	while (!list_empty(pages)) {
 		struct page *page = lru_to_page(pages);
 
-		if (page_offset(page) >= (u64)pos + length)
+		if (file_offset_of_page(page) >= (u64)pos + length)
 			break;
 
 		list_del(&page->lru);
@@ -342,7 +352,7 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
 		 * readpages call itself as every page gets checked again once
 		 * actually needed.
 		 */
-		*done += PAGE_SIZE;
+		*done += page_size(page);
 		put_page(page);
 	}
 
@@ -355,9 +365,14 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 {
 	struct iomap_readpage_ctx *ctx = data;
 	loff_t done, ret;
+	size_t left = 0;
+
+	if (ctx->cur_page)
+		left = page_size(ctx->cur_page) -
+					offset_in_this_page(ctx->cur_page, pos);
 
 	for (done = 0; done < length; done += ret) {
-		if (ctx->cur_page && offset_in_page(pos + done) == 0) {
+		if (ctx->cur_page && left == 0) {
 			if (!ctx->cur_page_in_bio)
 				unlock_page(ctx->cur_page);
 			put_page(ctx->cur_page);
@@ -369,14 +384,27 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 			if (!ctx->cur_page)
 				break;
 			ctx->cur_page_in_bio = false;
+			left = page_size(ctx->cur_page);
 		}
 		ret = iomap_readpage_actor(inode, pos + done, length - done,
 				ctx, iomap);
+		left -= ret;
 	}
 
 	return done;
 }
 
+/* move to fs.h? */
+static inline struct page *readahead_first_page(struct list_head *head)
+{
+	return list_entry(head->prev, struct page, lru);
+}
+
+static inline struct page *readahead_last_page(struct list_head *head)
+{
+	return list_entry(head->next, struct page, lru);
+}
+
 int
 iomap_readpages(struct address_space *mapping, struct list_head *pages,
 		unsigned nr_pages, const struct iomap_ops *ops)
@@ -385,9 +413,10 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
 		.pages		= pages,
 		.is_readahead	= true,
 	};
-	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
-	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
-	loff_t length = last - pos + PAGE_SIZE, ret = 0;
+	loff_t pos = file_offset_of_page(readahead_first_page(pages));
+	loff_t end = file_offset_of_next_page(readahead_last_page(pages));
+	loff_t length = end - pos;
+	loff_t ret = 0;
 
 	while (length > 0) {
 		ret = iomap_apply(mapping->host, pos, length, 0, ops,
@@ -410,7 +439,7 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
 	}
 
 	/*
-	 * Check that we didn't lose a page due to the arcance calling
+	 * Check that we didn't lose a page due to the arcane calling
 	 * conventions..
 	 */
 	WARN_ON_ONCE(!ret && !list_empty(ctx.pages));
@@ -435,7 +464,7 @@ iomap_is_partially_uptodate(struct page *page, unsigned long from,
 	unsigned i;
 
 	/* Limit range to one page */
-	len = min_t(unsigned, PAGE_SIZE - from, count);
+	len = min_t(unsigned, page_size(page) - from, count);
 
 	/* First and last blocks in range within page */
 	first = from >> inode->i_blkbits;
@@ -474,7 +503,7 @@ iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
 	 * If we are invalidating the entire page, clear the dirty state from it
 	 * and release it to avoid unnecessary buildup of the LRU.
 	 */
-	if (offset == 0 && len == PAGE_SIZE) {
+	if (offset == 0 && len == page_size(page)) {
 		WARN_ON_ONCE(PageWriteback(page));
 		cancel_dirty_page(page);
 		iomap_page_release(page);
@@ -550,18 +579,20 @@ static int
 __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
 		struct page *page, struct iomap *iomap)
 {
-	struct iomap_page *iop = iomap_page_create(inode, page);
 	loff_t block_size = i_blocksize(inode);
 	loff_t block_start = pos & ~(block_size - 1);
 	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
-	unsigned from = offset_in_page(pos), to = from + len, poff, plen;
+	unsigned from = offset_in_this_page(page, pos);
+	unsigned to = from + len;
+	unsigned poff, plen;
 	int status = 0;
 
 	if (PageUptodate(page))
 		return 0;
+	iomap_page_create(inode, page);
 
 	do {
-		iomap_adjust_read_range(inode, iop, &block_start,
+		iomap_adjust_read_range(inode, page, &block_start,
 				block_end - block_start, &poff, &plen);
 		if (plen == 0)
 			break;
@@ -673,7 +704,7 @@ __iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 	 */
 	if (unlikely(copied < len && !PageUptodate(page)))
 		return 0;
-	iomap_set_range_uptodate(page, offset_in_page(pos), len);
+	iomap_set_range_uptodate(page, offset_in_this_page(page, pos), len);
 	iomap_set_page_dirty(page);
 	return copied;
 }
@@ -685,6 +716,7 @@ iomap_write_end_inline(struct inode *inode, struct page *page,
 	void *addr;
 
 	WARN_ON_ONCE(!PageUptodate(page));
+	BUG_ON(PageCompound(page));
 	BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
 
 	addr = kmap_atomic(page);
@@ -749,6 +781,10 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
+		/*
+		 * XXX: We don't know what size page we'll find in the
+		 * page cache, so only copy up to a regular page boundary.
+		 */
 		offset = offset_in_page(pos);
 		bytes = min_t(unsigned long, PAGE_SIZE - offset,
 						iov_iter_count(i));
@@ -1041,19 +1077,18 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
 	lock_page(page);
 	size = i_size_read(inode);
 	if ((page->mapping != inode->i_mapping) ||
-	    (page_offset(page) > size)) {
+	    (file_offset_of_page(page) > size)) {
 		/* We overload EFAULT to mean page got truncated */
 		ret = -EFAULT;
 		goto out_unlock;
 	}
 
-	/* page is wholly or partially inside EOF */
-	if (((page->index + 1) << PAGE_SHIFT) > size)
-		length = offset_in_page(size);
+	offset = file_offset_of_page(page);
+	if (size - offset < page_size(page))
+		length = offset_in_this_page(page, size);
 	else
-		length = PAGE_SIZE;
+		length = page_size(page);
 
-	offset = page_offset(page);
 	while (length > 0) {
 		ret = iomap_apply(inode, offset, length,
 				IOMAP_WRITE | IOMAP_FAULT, ops, page,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index bc499ceae392..86be24a8259b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -139,7 +139,7 @@ loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 struct iomap_page {
 	atomic_t		read_count;
 	atomic_t		write_count;
-	DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
+	unsigned long		uptodate[];
 };
 
 static inline struct iomap_page *to_iomap_page(struct page *page)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 294a67b94147..04bea9f9282c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1415,6 +1415,8 @@ static inline void clear_page_pfmemalloc(struct page *page)
 extern void pagefault_out_of_memory(void);
 
 #define offset_in_page(p)	((unsigned long)(p) & ~PAGE_MASK)
+#define offset_in_this_page(page, p)	\
+	((unsigned long)(p) & (page_size(page) - 1))
 
 /*
  * Flags passed to show_mem() and show_free_areas() to suppress output in
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 05/15] xfs: Support large pages
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
                   ` (3 preceding siblings ...)
  2019-09-25  0:52 ` [PATCH 04/15] iomap: Support large pages Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-09-25  0:52 ` [PATCH 06/15] xfs: Pass a page to xfs_finish_page_writeback Matthew Wilcox
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Mostly this is just checking the page size of each page instead of
assuming PAGE_SIZE.  Clean up the logic in writepage a little.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/xfs/xfs_aops.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 102cfd8a97d6..1a26e9ca626b 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -765,7 +765,7 @@ xfs_add_to_ioend(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct block_device	*bdev = xfs_find_bdev_for_inode(inode);
 	unsigned		len = i_blocksize(inode);
-	unsigned		poff = offset & (PAGE_SIZE - 1);
+	unsigned		poff = offset & (page_size(page) - 1);
 	bool			merged, same_page = false;
 	sector_t		sector;
 
@@ -843,7 +843,7 @@ xfs_aops_discard_page(
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
-	xfs_vm_invalidatepage(page, 0, PAGE_SIZE);
+	xfs_vm_invalidatepage(page, 0, page_size(page));
 }
 
 /*
@@ -984,8 +984,7 @@ xfs_do_writepage(
 	struct xfs_writepage_ctx *wpc = data;
 	struct inode		*inode = page->mapping->host;
 	loff_t			offset;
-	uint64_t              end_offset;
-	pgoff_t                 end_index;
+	uint64_t		end_offset;
 
 	trace_xfs_writepage(inode, page, 0, 0);
 
@@ -1024,10 +1023,9 @@ xfs_do_writepage(
 	 * ---------------------------------^------------------|
 	 */
 	offset = i_size_read(inode);
-	end_index = offset >> PAGE_SHIFT;
-	if (page->index < end_index)
-		end_offset = (xfs_off_t)(page->index + 1) << PAGE_SHIFT;
-	else {
+	end_offset = file_offset_of_next_page(page);
+
+	if (end_offset > offset) {
 		/*
 		 * Check whether the page to write out is beyond or straddles
 		 * i_size or not.
@@ -1039,7 +1037,8 @@ xfs_do_writepage(
 		 * |				    |      Straddles     |
 		 * ---------------------------------^-----------|--------|
 		 */
-		unsigned offset_into_page = offset & (PAGE_SIZE - 1);
+		unsigned offset_into_page = offset_in_this_page(page, offset);
+		pgoff_t end_index = offset >> PAGE_SHIFT;
 
 		/*
 		 * Skip the page if it is fully outside i_size, e.g. due to a
@@ -1070,7 +1069,7 @@ xfs_do_writepage(
 		 * memory is zeroed when mapped, and writes to that region are
 		 * not written out to the file."
 		 */
-		zero_user_segment(page, offset_into_page, PAGE_SIZE);
+		zero_user_segment(page, offset_into_page, page_size(page));
 
 		/* Adjust the end_offset to the end of file */
 		end_offset = offset;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 06/15] xfs: Pass a page to xfs_finish_page_writeback
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
                   ` (4 preceding siblings ...)
  2019-09-25  0:52 ` [PATCH 05/15] xfs: " Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-09-25  0:52 ` [PATCH 07/15] mm: Make prep_transhuge_page tail-callable Matthew Wilcox
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The only part of the bvec we were accessing was the bv_page, so just
pass that instead of the whole bvec.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/xfs/xfs_aops.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1a26e9ca626b..edcb4797fcc2 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -58,21 +58,21 @@ xfs_find_daxdev_for_inode(
 static void
 xfs_finish_page_writeback(
 	struct inode		*inode,
-	struct bio_vec	*bvec,
+	struct page		*page,
 	int			error)
 {
-	struct iomap_page	*iop = to_iomap_page(bvec->bv_page);
+	struct iomap_page	*iop = to_iomap_page(page);
 
 	if (error) {
-		SetPageError(bvec->bv_page);
+		SetPageError(page);
 		mapping_set_error(inode->i_mapping, -EIO);
 	}
 
-	ASSERT(iop || i_blocks_per_page(inode, bvec->bv_page) <= 1);
+	ASSERT(iop || i_blocks_per_page(inode, page) <= 1);
 	ASSERT(!iop || atomic_read(&iop->write_count) > 0);
 
 	if (!iop || atomic_dec_and_test(&iop->write_count))
-		end_page_writeback(bvec->bv_page);
+		end_page_writeback(page);
 }
 
 /*
@@ -106,7 +106,7 @@ xfs_destroy_ioend(
 
 		/* walk each page on bio, ending page IO on them */
 		bio_for_each_segment_all(bvec, bio, iter_all)
-			xfs_finish_page_writeback(inode, bvec, error);
+			xfs_finish_page_writeback(inode, bvec->bv_page, error);
 		bio_put(bio);
 	}
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 07/15] mm: Make prep_transhuge_page tail-callable
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
                   ` (5 preceding siblings ...)
  2019-09-25  0:52 ` [PATCH 06/15] xfs: Pass a page to xfs_finish_page_writeback Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-09-26 14:13   ` Kirill A. Shutemov
  2019-09-25  0:52 ` [PATCH 08/15] mm: Add __page_cache_alloc_order Matthew Wilcox
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

By permitting NULL or order-0 pages as an argument, and returning the
argument, callers can write:

	return prep_transhuge_page(alloc_pages(...));

instead of assigning the result to a temporary variable and conditionally
passing that to prep_transhuge_page().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/huge_mm.h | 7 +++++--
 mm/huge_memory.c        | 9 +++++++--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 61c9ffd89b05..779e83800a77 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -153,7 +153,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp,
 		unsigned long addr, unsigned long len, unsigned long pgoff,
 		unsigned long flags);
 
-extern void prep_transhuge_page(struct page *page);
+extern struct page *prep_transhuge_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 
 bool can_split_huge_page(struct page *page, int *pextra_pins);
@@ -303,7 +303,10 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
 	return false;
 }
 
-static inline void prep_transhuge_page(struct page *page) {}
+static inline struct page *prep_transhuge_page(struct page *page)
+{
+	return page;
+}
 
 #define transparent_hugepage_flags 0UL
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73fc517c08d2..cbe7d0619439 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -516,15 +516,20 @@ static inline struct deferred_split *get_deferred_split_queue(struct page *page)
 }
 #endif
 
-void prep_transhuge_page(struct page *page)
+struct page *prep_transhuge_page(struct page *page)
 {
+	if (!page || compound_order(page) == 0)
+		return page;
 	/*
-	 * we use page->mapping and page->indexlru in second tail page
+	 * we use page->mapping and page->index in second tail page
 	 * as list_head: assuming THP order >= 2
 	 */
+	BUG_ON(compound_order(page) == 1);
 
 	INIT_LIST_HEAD(page_deferred_list(page));
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
+
+	return page;
 }
 
 static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 08/15] mm: Add __page_cache_alloc_order
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
                   ` (6 preceding siblings ...)
  2019-09-25  0:52 ` [PATCH 07/15] mm: Make prep_transhuge_page tail-callable Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-09-26 14:15   ` Kirill A. Shutemov
  2019-09-25  0:52 ` [PATCH 09/15] mm: Allow large pages to be added to the page cache Matthew Wilcox
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This new function allows page cache pages to be allocated that are
larger than an order-0 page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 14 +++++++++++---
 mm/filemap.c            | 12 ++++++++----
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 103205494ea0..d610a49be571 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -208,14 +208,22 @@ static inline int page_cache_add_speculative(struct page *page, int count)
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline
+struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order)
 {
-	return alloc_pages(gfp, 0);
+	if (order == 0)
+		return alloc_pages(gfp, 0);
+	return prep_transhuge_page(alloc_pages(gfp | __GFP_COMP, order));
 }
 #endif
 
+static inline struct page *__page_cache_alloc(gfp_t gfp)
+{
+	return __page_cache_alloc_order(gfp, 0);
+}
+
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
 	return __page_cache_alloc(mapping_gfp_mask(x));
diff --git a/mm/filemap.c b/mm/filemap.c
index 625ef3ef19f3..bab97addbb1d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -962,24 +962,28 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order)
 {
 	int n;
 	struct page *page;
 
+	if (order > 0)
+		gfp |= __GFP_COMP;
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
 			cpuset_mems_cookie = read_mems_allowed_begin();
 			n = cpuset_mem_spread_node();
-			page = __alloc_pages_node(n, gfp, 0);
+			page = __alloc_pages_node(n, gfp, order);
+			prep_transhuge_page(page);
 		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
 
 		return page;
 	}
-	return alloc_pages(gfp, 0);
+	return prep_transhuge_page(alloc_pages(gfp, order));
 }
-EXPORT_SYMBOL(__page_cache_alloc);
+EXPORT_SYMBOL(__page_cache_alloc_order);
 #endif
 
 /*
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 09/15] mm: Allow large pages to be added to the page cache
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
                   ` (7 preceding siblings ...)
  2019-09-25  0:52 ` [PATCH 08/15] mm: Add __page_cache_alloc_order Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-09-26 14:22   ` Kirill A. Shutemov
  2019-09-25  0:52 ` [PATCH 10/15] mm: Allow find_get_page to be used for large pages Matthew Wilcox
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We return -EEXIST if there are any non-shadow entries in the page
cache in the range covered by the large page.  If there are multiple
shadow entries in the range, we set *shadowp to one of them (currently
the one at the highest index).  If that turns out to be the wrong
answer, we can implement something more complex.  This is mostly
modelled after the equivalent function in the shmem code.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 37 ++++++++++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index bab97addbb1d..afe8f5d95810 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -855,6 +855,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	int huge = PageHuge(page);
 	struct mem_cgroup *memcg;
 	int error;
+	unsigned int nr = 1;
 	void *old;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -866,31 +867,45 @@ static int __add_to_page_cache_locked(struct page *page,
 					      gfp_mask, &memcg, false);
 		if (error)
 			return error;
+		xas_set_order(&xas, offset, compound_order(page));
+		nr = compound_nr(page);
 	}
 
-	get_page(page);
+	page_ref_add(page, nr);
 	page->mapping = mapping;
 	page->index = offset;
 
 	do {
+		unsigned long exceptional = 0;
+		unsigned int i = 0;
+
 		xas_lock_irq(&xas);
-		old = xas_load(&xas);
-		if (old && !xa_is_value(old))
+		xas_for_each_conflict(&xas, old) {
+			if (!xa_is_value(old))
+				break;
+			exceptional++;
+			if (shadowp)
+				*shadowp = old;
+		}
+		if (old)
 			xas_set_err(&xas, -EEXIST);
-		xas_store(&xas, page);
+		xas_create_range(&xas);
 		if (xas_error(&xas))
 			goto unlock;
 
-		if (xa_is_value(old)) {
-			mapping->nrexceptional--;
-			if (shadowp)
-				*shadowp = old;
+next:
+		xas_store(&xas, page);
+		if (++i < nr) {
+			xas_next(&xas);
+			goto next;
 		}
-		mapping->nrpages++;
+		mapping->nrexceptional -= exceptional;
+		mapping->nrpages += nr;
 
 		/* hugetlb pages do not participate in page cache accounting */
 		if (!huge)
-			__inc_node_page_state(page, NR_FILE_PAGES);
+			__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES,
+						nr);
 unlock:
 		xas_unlock_irq(&xas);
 	} while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));
@@ -907,7 +922,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	/* Leave page->index set: truncation relies upon it */
 	if (!huge)
 		mem_cgroup_cancel_charge(page, memcg, false);
-	put_page(page);
+	page_ref_sub(page, nr);
 	return xas_error(&xas);
 }
 ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 10/15] mm: Allow find_get_page to be used for large pages
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
                   ` (8 preceding siblings ...)
  2019-09-25  0:52 ` [PATCH 09/15] mm: Allow large pages to be added to the page cache Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-10-01 10:32   ` Kirill A. Shutemov
  2019-09-25  0:52 ` [PATCH 11/15] mm: Remove hpage_nr_pages Matthew Wilcox
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Add FGP_PMD to indicate that we're trying to find-or-create a page that
is at least PMD_ORDER in size.  The internal 'conflict' entry usage
is modelled after that in DAX, but the implementations are different
due to DAX using multi-order entries and the page cache using multiple
order-0 entries.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 13 ++++++
 mm/filemap.c            | 99 +++++++++++++++++++++++++++++++++++------
 2 files changed, 99 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d610a49be571..d6d97f9fb762 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -248,6 +248,19 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_NOFS		0x00000010
 #define FGP_NOWAIT		0x00000020
 #define FGP_FOR_MMAP		0x00000040
+/*
+ * If you add more flags, increment FGP_ORDER_SHIFT (no further than 25).
+ * Do not insert flags above the FGP order bits.
+ */
+#define FGP_ORDER_SHIFT		7
+#define FGP_PMD			((PMD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
+#define FGP_PUD			((PUD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define fgp_order(fgp)		((fgp) >> FGP_ORDER_SHIFT)
+#else
+#define fgp_order(fgp)		0
+#endif
 
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 		int fgp_flags, gfp_t cache_gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index afe8f5d95810..8eca91547e40 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1576,7 +1576,71 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
 
 	return page;
 }
-EXPORT_SYMBOL(find_get_entry);
+
+static bool pagecache_is_conflict(struct page *page)
+{
+	return page == XA_RETRY_ENTRY;
+}
+
+/**
+ * __find_get_page - Find and get a page cache entry.
+ * @mapping: The address_space to search.
+ * @offset: The page cache index.
+ * @order: The minimum order of the entry to return.
+ *
+ * Looks up the page cache entries at @mapping between @offset and
+ * @offset + 2^@order.  If there is a page cache page, it is returned with
+ * an increased refcount unless it is smaller than @order.
+ *
+ * If the slot holds a shadow entry of a previously evicted page, or a
+ * swap entry from shmem/tmpfs, it is returned.
+ *
+ * Return: the found page, a value indicating a conflicting page or %NULL if
+ * there are no pages in this range.
+ */
+static struct page *__find_get_page(struct address_space *mapping,
+		unsigned long offset, unsigned int order)
+{
+	XA_STATE(xas, &mapping->i_pages, offset);
+	struct page *page;
+
+	rcu_read_lock();
+repeat:
+	xas_reset(&xas);
+	page = xas_find(&xas, offset | ((1UL << order) - 1));
+	if (xas_retry(&xas, page))
+		goto repeat;
+	/*
+	 * A shadow entry of a recently evicted page, or a swap entry from
+	 * shmem/tmpfs.  Skip it; keep looking for pages.
+	 */
+	if (xa_is_value(page))
+		goto repeat;
+	if (!page)
+		goto out;
+	if (compound_order(page) < order) {
+		page = XA_RETRY_ENTRY;
+		goto out;
+	}
+
+	if (!page_cache_get_speculative(page))
+		goto repeat;
+
+	/*
+	 * Has the page moved or been split?
+	 * This is part of the lockless pagecache protocol. See
+	 * include/linux/pagemap.h for details.
+	 */
+	if (unlikely(page != xas_reload(&xas))) {
+		put_page(page);
+		goto repeat;
+	}
+	page = find_subpage(page, offset);
+out:
+	rcu_read_unlock();
+
+	return page;
+}
 
 /**
  * find_lock_entry - locate, pin and lock a page cache entry
@@ -1618,12 +1682,12 @@ EXPORT_SYMBOL(find_lock_entry);
  * pagecache_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
- * @fgp_flags: PCG flags
+ * @fgp_flags: FGP flags
  * @gfp_mask: gfp mask to use for the page cache data page allocation
  *
  * Looks up the page cache slot at @mapping & @offset.
  *
- * PCG flags modify how the page is returned.
+ * FGP flags modify how the page is returned.
  *
  * @fgp_flags can be:
  *
@@ -1636,6 +1700,10 @@ EXPORT_SYMBOL(find_lock_entry);
  * - FGP_FOR_MMAP: Similar to FGP_CREAT, only we want to allow the caller to do
  *   its own locking dance if the page is already in cache, or unlock the page
  *   before returning if we had to add the page to pagecache.
+ * - FGP_PMD: We're only interested in pages at PMD granularity.  If there
+ *   is no page here (and FGP_CREATE is set), we'll create one large enough.
+ *   If there is a smaller page in the cache that overlaps the PMD page, we
+ *   return %NULL and do not attempt to create a page.
  *
  * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
  * if the GFP flags specified for FGP_CREAT are atomic.
@@ -1649,10 +1717,11 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 {
 	struct page *page;
 
+	BUILD_BUG_ON(((63 << FGP_ORDER_SHIFT) >> FGP_ORDER_SHIFT) != 63);
 repeat:
-	page = find_get_entry(mapping, offset);
-	if (xa_is_value(page))
-		page = NULL;
+	page = __find_get_page(mapping, offset, fgp_order(fgp_flags));
+	if (pagecache_is_conflict(page))
+		return NULL;
 	if (!page)
 		goto no_page;
 
@@ -1686,7 +1755,7 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 		if (fgp_flags & FGP_NOFS)
 			gfp_mask &= ~__GFP_FS;
 
-		page = __page_cache_alloc(gfp_mask);
+		page = __page_cache_alloc_order(gfp_mask, fgp_order(fgp_flags));
 		if (!page)
 			return NULL;
 
@@ -1704,13 +1773,17 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 			if (err == -EEXIST)
 				goto repeat;
 		}
+		if (page) {
+			if (fgp_order(fgp_flags))
+				count_vm_event(THP_FILE_ALLOC);
 
-		/*
-		 * add_to_page_cache_lru locks the page, and for mmap we expect
-		 * an unlocked page.
-		 */
-		if (page && (fgp_flags & FGP_FOR_MMAP))
-			unlock_page(page);
+			/*
+			 * add_to_page_cache_lru locks the page, and
+			 * for mmap we expect an unlocked page.
+			 */
+			if (fgp_flags & FGP_FOR_MMAP)
+				unlock_page(page);
+		}
 	}
 
 	return page;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 11/15] mm: Remove hpage_nr_pages
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
                   ` (9 preceding siblings ...)
  2019-09-25  0:52 ` [PATCH 10/15] mm: Allow find_get_page to be used for large pages Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-10-01 10:35   ` Kirill A. Shutemov
  2019-09-25  0:52 ` [PATCH 12/15] mm: Support removing arbitrary sized pages from mapping Matthew Wilcox
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This function assumed that compound pages were necessarily PMD sized.
While that may be true for some users, it's not going to be true for
all users forever, so it's better to remove it and avoid the confusion
by just using compound_nr() or page_size().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 drivers/nvdimm/btt.c      |  4 +---
 drivers/nvdimm/pmem.c     |  3 +--
 include/linux/huge_mm.h   |  8 --------
 include/linux/mm_inline.h |  6 +++---
 mm/filemap.c              |  2 +-
 mm/gup.c                  |  2 +-
 mm/internal.h             |  4 ++--
 mm/memcontrol.c           | 14 +++++++-------
 mm/memory_hotplug.c       |  4 ++--
 mm/mempolicy.c            |  2 +-
 mm/migrate.c              | 19 ++++++++++---------
 mm/mlock.c                |  9 ++++-----
 mm/page_io.c              |  4 ++--
 mm/page_vma_mapped.c      |  6 +++---
 mm/rmap.c                 |  8 ++++----
 mm/swap.c                 |  4 ++--
 mm/swap_state.c           |  4 ++--
 mm/swapfile.c             |  2 +-
 mm/vmscan.c               |  4 ++--
 19 files changed, 49 insertions(+), 60 deletions(-)

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index a8d56887ec88..2aac2bf10a37 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1488,10 +1488,8 @@ static int btt_rw_page(struct block_device *bdev, sector_t sector,
 {
 	struct btt *btt = bdev->bd_disk->private_data;
 	int rc;
-	unsigned int len;
 
-	len = hpage_nr_pages(page) * PAGE_SIZE;
-	rc = btt_do_bvec(btt, NULL, page, len, 0, op, sector);
+	rc = btt_do_bvec(btt, NULL, page, page_size(page), 0, op, sector);
 	if (rc == 0)
 		page_endio(page, op_is_write(op), 0);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f9f76f6ba07b..778c73fd10d6 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -224,8 +224,7 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	struct pmem_device *pmem = bdev->bd_queue->queuedata;
 	blk_status_t rc;
 
-	rc = pmem_do_bvec(pmem, page, hpage_nr_pages(page) * PAGE_SIZE,
-			  0, op, sector);
+	rc = pmem_do_bvec(pmem, page, page_size(page), 0, op, sector);
 
 	/*
 	 * The ->rw_page interface is subtle and tricky.  The core
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 779e83800a77..6018d31549c3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -226,12 +226,6 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 	else
 		return NULL;
 }
-static inline int hpage_nr_pages(struct page *page)
-{
-	if (unlikely(PageTransHuge(page)))
-		return HPAGE_PMD_NR;
-	return 1;
-}
 
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
@@ -285,8 +279,6 @@ static inline struct list_head *page_deferred_list(struct page *page)
 #define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
-#define hpage_nr_pages(x) 1
-
 static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
 	return false;
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 6f2fef7b0784..3bd675ce6ba8 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -47,14 +47,14 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
 static __always_inline void add_page_to_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+	update_lru_size(lruvec, lru, page_zonenum(page), compound_nr(page));
 	list_add(&page->lru, &lruvec->lists[lru]);
 }
 
 static __always_inline void add_page_to_lru_list_tail(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+	update_lru_size(lruvec, lru, page_zonenum(page), compound_nr(page));
 	list_add_tail(&page->lru, &lruvec->lists[lru]);
 }
 
@@ -62,7 +62,7 @@ static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
 	list_del(&page->lru);
-	update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
+	update_lru_size(lruvec, lru, page_zonenum(page), -compound_nr(page));
 }
 
 /**
diff --git a/mm/filemap.c b/mm/filemap.c
index 8eca91547e40..b07ef9469861 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -196,7 +196,7 @@ static void unaccount_page_cache_page(struct address_space *mapping,
 	if (PageHuge(page))
 		return;
 
-	nr = hpage_nr_pages(page);
+	nr = compound_nr(page);
 
 	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page)) {
diff --git a/mm/gup.c b/mm/gup.c
index 60c3915c8ee6..579dc9426b87 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1469,7 +1469,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
 					mod_node_page_state(page_pgdat(head),
 							    NR_ISOLATED_ANON +
 							    page_is_file_cache(head),
-							    hpage_nr_pages(head));
+							    compound_nr(head));
 				}
 			}
 		}
diff --git a/mm/internal.h b/mm/internal.h
index e32390802fd3..abe3a15b456c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -327,7 +327,7 @@ extern void clear_page_mlock(struct page *page);
 static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 {
 	if (TestClearPageMlocked(page)) {
-		int nr_pages = hpage_nr_pages(page);
+		int nr_pages = compound_nr(page);
 
 		/* Holding pmd lock, no change in irq context: __mod is safe */
 		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
@@ -354,7 +354,7 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	unsigned long start, end;
 
 	start = __vma_address(page, vma);
-	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
+	end = start + page_size(page) - 1;
 
 	/* page should be within @vma mapping range */
 	VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2156ef775d04..9d457684a731 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5406,7 +5406,7 @@ static int mem_cgroup_move_account(struct page *page,
 				   struct mem_cgroup *to)
 {
 	unsigned long flags;
-	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+	unsigned int nr_pages = compound ? compound_nr(page) : 1;
 	int ret;
 	bool anon;
 
@@ -6447,7 +6447,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			  bool compound)
 {
 	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+	unsigned int nr_pages = compound ? compound_nr(page) : 1;
 	int ret = 0;
 
 	if (mem_cgroup_disabled())
@@ -6521,7 +6521,7 @@ int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 			      bool lrucare, bool compound)
 {
-	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+	unsigned int nr_pages = compound ? compound_nr(page) : 1;
 
 	VM_BUG_ON_PAGE(!page->mapping, page);
 	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
@@ -6565,7 +6565,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
 		bool compound)
 {
-	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+	unsigned int nr_pages = compound ? compound_nr(page) : 1;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -6772,7 +6772,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
 
 	/* Force-charge the new page. The old one will be freed soon */
 	compound = PageTransHuge(newpage);
-	nr_pages = compound ? hpage_nr_pages(newpage) : 1;
+	nr_pages = compound ? compound_nr(newpage) : 1;
 
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
@@ -6995,7 +6995,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * ancestor for the swap instead and transfer the memory+swap charge.
 	 */
 	swap_memcg = mem_cgroup_id_get_online(memcg);
-	nr_entries = hpage_nr_pages(page);
+	nr_entries = compound_nr(page);
 	/* Get references for the tail pages, too */
 	if (nr_entries > 1)
 		mem_cgroup_id_get_many(swap_memcg, nr_entries - 1);
@@ -7041,7 +7041,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
  */
 int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 {
-	unsigned int nr_pages = hpage_nr_pages(page);
+	unsigned int nr_pages = compound_nr(page);
 	struct page_counter *counter;
 	struct mem_cgroup *memcg;
 	unsigned short oldid;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b1be791f772d..317478203d20 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1344,8 +1344,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			isolate_huge_page(head, &source);
 			continue;
 		} else if (PageTransHuge(page))
-			pfn = page_to_pfn(compound_head(page))
-				+ hpage_nr_pages(page) - 1;
+			pfn = page_to_pfn(compound_head(page)) +
+				compound_nr(page) - 1;
 
 		/*
 		 * HWPoison pages have elevated reference counts so the migration would
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 464406e8da91..586ba2adbfd2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -978,7 +978,7 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist,
 			list_add_tail(&head->lru, pagelist);
 			mod_node_page_state(page_pgdat(head),
 				NR_ISOLATED_ANON + page_is_file_cache(head),
-				hpage_nr_pages(head));
+				compound_nr(head));
 		} else if (flags & MPOL_MF_STRICT) {
 			/*
 			 * Non-movable page may reach here.  And, there may be
diff --git a/mm/migrate.c b/mm/migrate.c
index 73d476d690b1..c3c9a3e70f07 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -191,8 +191,9 @@ void putback_movable_pages(struct list_head *l)
 			unlock_page(page);
 			put_page(page);
 		} else {
-			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
-					page_is_file_cache(page), -hpage_nr_pages(page));
+			mod_node_page_state(page_pgdat(page),
+				NR_ISOLATED_ANON + page_is_file_cache(page),
+				-compound_nr(page));
 			putback_lru_page(page);
 		}
 	}
@@ -381,7 +382,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
 	 */
 	expected_count += is_device_private_page(page);
 	if (mapping)
-		expected_count += hpage_nr_pages(page) + page_has_private(page);
+		expected_count += compound_nr(page) + page_has_private(page);
 
 	return expected_count;
 }
@@ -436,7 +437,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	 */
 	newpage->index = page->index;
 	newpage->mapping = page->mapping;
-	page_ref_add(newpage, hpage_nr_pages(page)); /* add cache reference */
+	page_ref_add(newpage, compound_nr(page)); /* add cache reference */
 	if (PageSwapBacked(page)) {
 		__SetPageSwapBacked(newpage);
 		if (PageSwapCache(page)) {
@@ -469,7 +470,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	 * to one less reference.
 	 * We know this isn't the last reference.
 	 */
-	page_ref_unfreeze(page, expected_count - hpage_nr_pages(page));
+	page_ref_unfreeze(page, expected_count - compound_nr(page));
 
 	xas_unlock(&xas);
 	/* Leave irq disabled to prevent preemption while updating stats */
@@ -579,7 +580,7 @@ static void copy_huge_page(struct page *dst, struct page *src)
 	} else {
 		/* thp page */
 		BUG_ON(!PageTransHuge(src));
-		nr_pages = hpage_nr_pages(src);
+		nr_pages = compound_nr(src);
 	}
 
 	for (i = 0; i < nr_pages; i++) {
@@ -1215,7 +1216,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		 */
 		if (likely(!__PageMovable(page)))
 			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
-					page_is_file_cache(page), -hpage_nr_pages(page));
+					page_is_file_cache(page), -compound_nr(page));
 	}
 
 	/*
@@ -1571,7 +1572,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 		list_add_tail(&head->lru, pagelist);
 		mod_node_page_state(page_pgdat(head),
 			NR_ISOLATED_ANON + page_is_file_cache(head),
-			hpage_nr_pages(head));
+			compound_nr(head));
 	}
 out_putpage:
 	/*
@@ -1912,7 +1913,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 
 	page_lru = page_is_file_cache(page);
 	mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
-				hpage_nr_pages(page));
+				compound_nr(page));
 
 	/*
 	 * Isolating the page has taken another reference, so the
diff --git a/mm/mlock.c b/mm/mlock.c
index a90099da4fb4..5567d55bf5e1 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -61,8 +61,7 @@ void clear_page_mlock(struct page *page)
 	if (!TestClearPageMlocked(page))
 		return;
 
-	mod_zone_page_state(page_zone(page), NR_MLOCK,
-			    -hpage_nr_pages(page));
+	mod_zone_page_state(page_zone(page), NR_MLOCK, -compound_nr(page));
 	count_vm_event(UNEVICTABLE_PGCLEARED);
 	/*
 	 * The previous TestClearPageMlocked() corresponds to the smp_mb()
@@ -95,7 +94,7 @@ void mlock_vma_page(struct page *page)
 
 	if (!TestSetPageMlocked(page)) {
 		mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    hpage_nr_pages(page));
+				    compound_nr(page));
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 		if (!isolate_lru_page(page))
 			putback_lru_page(page);
@@ -192,7 +191,7 @@ unsigned int munlock_vma_page(struct page *page)
 	/*
 	 * Serialize with any parallel __split_huge_page_refcount() which
 	 * might otherwise copy PageMlocked to part of the tail pages before
-	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
+	 * we clear it in the head page. It also stabilizes compound_nr().
 	 */
 	spin_lock_irq(&pgdat->lru_lock);
 
@@ -202,7 +201,7 @@ unsigned int munlock_vma_page(struct page *page)
 		goto unlock_out;
 	}
 
-	nr_pages = hpage_nr_pages(page);
+	nr_pages = compound_nr(page);
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
 	if (__munlock_isolate_lru_page(page, true)) {
diff --git a/mm/page_io.c b/mm/page_io.c
index 24ee600f9131..965fcc5701f8 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -40,7 +40,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
 		bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
 		bio->bi_end_io = end_io;
 
-		bio_add_page(bio, page, PAGE_SIZE * hpage_nr_pages(page), 0);
+		bio_add_page(bio, page, page_size(page), 0);
 	}
 	return bio;
 }
@@ -271,7 +271,7 @@ static inline void count_swpout_vm_event(struct page *page)
 	if (unlikely(PageTransHuge(page)))
 		count_vm_event(THP_SWPOUT);
 #endif
-	count_vm_events(PSWPOUT, hpage_nr_pages(page));
+	count_vm_events(PSWPOUT, compound_nr(page));
 }
 
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index eff4b4520c8d..dfca512c7b50 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -57,7 +57,7 @@ static inline bool pfn_in_hpage(struct page *hpage, unsigned long pfn)
 	unsigned long hpage_pfn = page_to_pfn(hpage);
 
 	/* THP can be referenced by any subpage */
-	return pfn >= hpage_pfn && pfn - hpage_pfn < hpage_nr_pages(hpage);
+	return (pfn - hpage_pfn) < compound_nr(hpage);
 }
 
 /**
@@ -223,7 +223,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			if (pvmw->address >= pvmw->vma->vm_end ||
 			    pvmw->address >=
 					__vma_address(pvmw->page, pvmw->vma) +
-					hpage_nr_pages(pvmw->page) * PAGE_SIZE)
+					page_size(pvmw->page))
 				return not_found(pvmw);
 			/* Did we cross page table boundary? */
 			if (pvmw->address % PMD_SIZE == 0) {
@@ -264,7 +264,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	unsigned long start, end;
 
 	start = __vma_address(page, vma);
-	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
+	end = start + page_size(page) - 1;
 
 	if (unlikely(end < vma->vm_start || start >= vma->vm_end))
 		return 0;
diff --git a/mm/rmap.c b/mm/rmap.c
index d9a23bb773bf..2d857283fb41 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1112,7 +1112,7 @@ void do_page_add_anon_rmap(struct page *page,
 	}
 
 	if (first) {
-		int nr = compound ? hpage_nr_pages(page) : 1;
+		int nr = compound ? compound_nr(page) : 1;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 		 * these counters are not modified in interrupt context, and
@@ -1150,7 +1150,7 @@ void do_page_add_anon_rmap(struct page *page,
 void page_add_new_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
-	int nr = compound ? hpage_nr_pages(page) : 1;
+	int nr = compound ? compound_nr(page) : 1;
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	__SetPageSwapBacked(page);
@@ -1826,7 +1826,7 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
 		return;
 
 	pgoff_start = page_to_pgoff(page);
-	pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
+	pgoff_end = pgoff_start + compound_nr(page) - 1;
 	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
 			pgoff_start, pgoff_end) {
 		struct vm_area_struct *vma = avc->vma;
@@ -1879,7 +1879,7 @@ static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
 		return;
 
 	pgoff_start = page_to_pgoff(page);
-	pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
+	pgoff_end = pgoff_start + compound_nr(page) - 1;
 	if (!locked)
 		i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap,
diff --git a/mm/swap.c b/mm/swap.c
index 784dc1620620..25d8c43035a4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -465,7 +465,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
 		 * lock is held(spinlock), which implies preemption disabled.
 		 */
 		__mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    hpage_nr_pages(page));
+				    compound_nr(page));
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 	}
 	lru_cache_add(page);
@@ -558,7 +558,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 		ClearPageSwapBacked(page);
 		add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
 
-		__count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
+		__count_vm_events(PGLAZYFREE, compound_nr(page));
 		count_memcg_page_event(page, PGLAZYFREE);
 		update_page_reclaim_stat(lruvec, 1, 0);
 	}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8e7ce9a9bc5e..51d8884a693a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -158,7 +158,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
 void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
 {
 	struct address_space *address_space = swap_address_space(entry);
-	int i, nr = hpage_nr_pages(page);
+	int i, nr = compound_nr(page);
 	pgoff_t idx = swp_offset(entry);
 	XA_STATE(xas, &address_space->i_pages, idx);
 
@@ -251,7 +251,7 @@ void delete_from_swap_cache(struct page *page)
 	xa_unlock_irq(&address_space->i_pages);
 
 	put_swap_page(page, entry);
-	page_ref_sub(page, hpage_nr_pages(page));
+	page_ref_sub(page, compound_nr(page));
 }
 
 /* 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index dab43523afdd..2dc7fbde7d9b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1331,7 +1331,7 @@ void put_swap_page(struct page *page, swp_entry_t entry)
 	unsigned char *map;
 	unsigned int i, free_entries = 0;
 	unsigned char val;
-	int size = swap_entry_size(hpage_nr_pages(page));
+	int size = swap_entry_size(compound_nr(page));
 
 	si = _swap_info_get(entry);
 	if (!si)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4911754c93b7..a7f9f379e523 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1901,7 +1901,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		SetPageLRU(page);
 		lru = page_lru(page);
 
-		nr_pages = hpage_nr_pages(page);
+		nr_pages = compound_nr(page);
 		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
 		list_move(&page->lru, &lruvec->lists[lru]);
 
@@ -2095,7 +2095,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 		if (page_referenced(page, 0, sc->target_mem_cgroup,
 				    &vm_flags)) {
-			nr_rotated += hpage_nr_pages(page);
+			nr_rotated += compound_nr(page);
 			/*
 			 * Identify referenced, file-backed active pages and
 			 * give them one more trip around the active list. So
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 12/15] mm: Support removing arbitrary sized pages from mapping
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
                   ` (10 preceding siblings ...)
  2019-09-25  0:52 ` [PATCH 11/15] mm: Remove hpage_nr_pages Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-10-01 10:39   ` Kirill A. Shutemov
  2019-09-25  0:52 ` [PATCH 13/15] mm: Add a huge page fault handler for files Matthew Wilcox
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: William Kucharski, Matthew Wilcox

From: William Kucharski <william.kucharski@oracle.com>

__remove_mapping() assumes that pages can only be either base pages
or HPAGE_PMD_SIZE.  Ask the page what size it is.

Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/vmscan.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a7f9f379e523..9f44868e640b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -932,10 +932,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	 * Note that if SetPageDirty is always performed via set_page_dirty,
 	 * and thus under the i_pages lock, then this ordering is not required.
 	 */
-	if (unlikely(PageTransHuge(page)) && PageSwapCache(page))
-		refcount = 1 + HPAGE_PMD_NR;
-	else
-		refcount = 2;
+	refcount = 1 + compound_nr(page);
 	if (!page_ref_freeze(page, refcount))
 		goto cannot_free;
 	/* note: atomic_cmpxchg in page_ref_freeze provides the smp_rmb */
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 13/15] mm: Add a huge page fault handler for files
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
                   ` (11 preceding siblings ...)
  2019-09-25  0:52 ` [PATCH 12/15] mm: Support removing arbitrary sized pages from mapping Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-10-01 10:42   ` Kirill A. Shutemov
  2019-09-25  0:52 ` [PATCH 14/15] mm: Align THP mappings for non-DAX Matthew Wilcox
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: William Kucharski, Matthew Wilcox

From: William Kucharski <william.kucharski@oracle.com>

Add filemap_huge_fault() to attempt to satisfy page
faults on memory-mapped read-only text pages using THP when possible.

Signed-off-by: William Kucharski <william.kucharski@oracle.com>
[rebased on top of mm prep patches -- Matthew]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h      |  10 +++
 include/linux/pagemap.h |   8 ++
 mm/filemap.c            | 165 ++++++++++++++++++++++++++++++++++++++--
 3 files changed, 178 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 04bea9f9282c..623878f11eaf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2414,6 +2414,16 @@ extern void truncate_inode_pages_final(struct address_space *);
 
 /* generic vm_area_ops exported for stackable file systems */
 extern vm_fault_t filemap_fault(struct vm_fault *vmf);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
+		enum page_entry_size pe_size);
+#else
+static inline vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
+		enum page_entry_size pe_size)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif
 extern void filemap_map_pages(struct vm_fault *vmf,
 		pgoff_t start_pgoff, pgoff_t end_pgoff);
 extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d6d97f9fb762..ae09788f5345 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -354,6 +354,14 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
 			mapping_gfp_mask(mapping));
 }
 
+/* This (head) page should be found at this offset in the page cache */
+static inline void page_cache_assert(struct page *page, pgoff_t offset)
+{
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	VM_BUG_ON_PAGE(page->index == (offset & ~(compound_nr(page) - 1)),
+			page);
+}
+
 static inline struct page *find_subpage(struct page *page, pgoff_t offset)
 {
 	if (PageHuge(page))
diff --git a/mm/filemap.c b/mm/filemap.c
index b07ef9469861..8017e905df7a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1590,7 +1590,8 @@ static bool pagecache_is_conflict(struct page *page)
  *
  * Looks up the page cache entries at @mapping between @offset and
  * @offset + 2^@order.  If there is a page cache page, it is returned with
- * an increased refcount unless it is smaller than @order.
+ * an increased refcount unless it is smaller than @order.  This function
+ * returns the head page, not a tail page.
  *
  * If the slot holds a shadow entry of a previously evicted page, or a
  * swap entry from shmem/tmpfs, it is returned.
@@ -1601,7 +1602,7 @@ static bool pagecache_is_conflict(struct page *page)
 static struct page *__find_get_page(struct address_space *mapping,
 		unsigned long offset, unsigned int order)
 {
-	XA_STATE(xas, &mapping->i_pages, offset);
+	XA_STATE(xas, &mapping->i_pages, offset & ~((1UL << order) - 1));
 	struct page *page;
 
 	rcu_read_lock();
@@ -1635,7 +1636,6 @@ static struct page *__find_get_page(struct address_space *mapping,
 		put_page(page);
 		goto repeat;
 	}
-	page = find_subpage(page, offset);
 out:
 	rcu_read_unlock();
 
@@ -1741,11 +1741,12 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 			put_page(page);
 			goto repeat;
 		}
-		VM_BUG_ON_PAGE(page->index != offset, page);
+		page_cache_assert(page, offset);
 	}
 
 	if (fgp_flags & FGP_ACCESSED)
 		mark_page_accessed(page);
+	page = find_subpage(page, offset);
 
 no_page:
 	if (!page && (fgp_flags & FGP_CREAT)) {
@@ -2638,7 +2639,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		put_page(page);
 		goto retry_find;
 	}
-	VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
+	page_cache_assert(page, offset);
 
 	/*
 	 * We have a locked page in the page cache, now we need to check
@@ -2711,6 +2712,160 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 }
 EXPORT_SYMBOL(filemap_fault);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/**
+ * filemap_huge_fault - Read in file data for page fault handling.
+ * @vmf: struct vm_fault containing details of the fault.
+ * @pe_size: Page entry size.
+ *
+ * filemap_huge_fault() is invoked via the vma operations vector for a
+ * mapped memory region to read in file data during a page fault.
+ *
+ * The goto's are kind of ugly, but this streamlines the normal case of having
+ * it in the page cache, and handles the special cases reasonably without
+ * having a lot of duplicated code.
+ *
+ * vma->vm_mm->mmap_sem must be held on entry.
+ *
+ * If our return value has VM_FAULT_RETRY set, it's because the mmap_sem
+ * may be dropped before doing I/O or by lock_page_maybe_drop_mmap().
+ *
+ * If our return value does not have VM_FAULT_RETRY set, the mmap_sem
+ * has not been released.
+ *
+ * We never return with VM_FAULT_RETRY and a bit from VM_FAULT_ERROR set.
+ *
+ * Return: bitwise-OR of %VM_FAULT_ codes.
+ */
+vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
+				enum page_entry_size pe_size)
+{
+	int error;
+	struct vm_area_struct *vma = vmf->vma;
+	struct file *file = vma->vm_file;
+	struct file *fpin = NULL;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	pgoff_t offset = vmf->pgoff;
+	pgoff_t max_off;
+	struct page *page;
+	vm_fault_t ret = 0;
+
+	if (pe_size != PE_SIZE_PMD)
+		return VM_FAULT_FALLBACK;
+	/* Read-only mappings for now */
+	if (vmf->flags & FAULT_FLAG_WRITE)
+		return VM_FAULT_FALLBACK;
+	if (vma->vm_start & ~HPAGE_PMD_MASK)
+		return VM_FAULT_FALLBACK;
+	/* Don't allocate a huge page for the tail of the file (?) */
+	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+	if (unlikely((offset | (HPAGE_PMD_NR - 1)) >= max_off))
+		return VM_FAULT_FALLBACK;
+
+	/*
+	 * Do we have something in the page cache already?
+	 */
+	page = __find_get_page(mapping, offset, HPAGE_PMD_ORDER);
+	if (likely(page)) {
+		if (pagecache_is_conflict(page))
+			return VM_FAULT_FALLBACK;
+		/* Readahead the next huge page here? */
+		page = find_subpage(page, offset & ~(HPAGE_PMD_NR - 1));
+	} else {
+		/* No page in the page cache at all */
+		count_vm_event(PGMAJFAULT);
+		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
+		ret = VM_FAULT_MAJOR;
+retry_find:
+		page = pagecache_get_page(mapping, offset,
+					  FGP_CREAT | FGP_FOR_MMAP | FGP_PMD,
+					  vmf->gfp_mask |
+						__GFP_NOWARN | __GFP_NORETRY);
+		if (!page)
+			return VM_FAULT_FALLBACK;
+	}
+
+	if (!lock_page_maybe_drop_mmap(vmf, page, &fpin))
+		goto out_retry;
+
+	/* Did it get truncated? */
+	if (unlikely(page->mapping != mapping)) {
+		unlock_page(page);
+		put_page(page);
+		goto retry_find;
+	}
+	VM_BUG_ON_PAGE(page_to_index(page) != offset, page);
+
+	/*
+	 * We have a locked page in the page cache, now we need to check
+	 * that it's up-to-date.  Because we don't readahead in huge_fault,
+	 * this may or may not be due to an error.
+	 */
+	if (!PageUptodate(page))
+		goto page_not_uptodate;
+
+	/*
+	 * We've made it this far and we had to drop our mmap_sem, now is the
+	 * time to return to the upper layer and have it re-find the vma and
+	 * redo the fault.
+	 */
+	if (fpin) {
+		unlock_page(page);
+		goto out_retry;
+	}
+
+	/*
+	 * Found the page and have a reference on it.
+	 * We must recheck i_size under page lock.
+	 */
+	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+	if (unlikely(offset >= max_off)) {
+		unlock_page(page);
+		put_page(page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	ret |= alloc_set_pte(vmf, NULL, page);
+	unlock_page(page);
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+		put_page(page);
+	return ret;
+
+page_not_uptodate:
+	ClearPageError(page);
+	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
+	error = mapping->a_ops->readpage(file, page);
+	if (!error) {
+		wait_on_page_locked(page);
+		if (!PageUptodate(page))
+			error = -EIO;
+	}
+	if (fpin)
+		goto out_retry;
+	put_page(page);
+
+	if (!error || error == AOP_TRUNCATED_PAGE)
+		goto retry_find;
+
+	/* Things didn't work out */
+	return VM_FAULT_SIGBUS;
+
+out_retry:
+	/*
+	 * We dropped the mmap_sem, we need to return to the fault handler to
+	 * re-find the vma and come back and find our hopefully still populated
+	 * page.
+	 */
+	if (page)
+		put_page(page);
+	if (fpin)
+		fput(fpin);
+	return ret | VM_FAULT_RETRY;
+}
+EXPORT_SYMBOL(filemap_huge_fault);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 void filemap_map_pages(struct vm_fault *vmf,
 		pgoff_t start_pgoff, pgoff_t end_pgoff)
 {
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 14/15] mm: Align THP mappings for non-DAX
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
                   ` (12 preceding siblings ...)
  2019-09-25  0:52 ` [PATCH 13/15] mm: Add a huge page fault handler for files Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
  2019-10-01 10:45   ` Kirill A. Shutemov
  2019-09-25  0:52 ` [PATCH 15/15] xfs: Use filemap_huge_fault Matthew Wilcox
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: William Kucharski, Matthew Wilcox

From: William Kucharski <william.kucharski@oracle.com>

When we have the opportunity to use transparent huge pages to map a
file, we want to follow the same rules as DAX.

Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/huge_memory.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cbe7d0619439..670a1780bd2f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 
 	if (addr)
 		goto out;
-	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
-		goto out;
 
 	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
 	if (addr)
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 15/15] xfs: Use filemap_huge_fault
  2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
                   ` (13 preceding siblings ...)
  2019-09-25  0:52 ` [PATCH 14/15] mm: Align THP mappings for non-DAX Matthew Wilcox
@ 2019-09-25  0:52 ` Matthew Wilcox
       [not found] ` <20191002130753.7680-1-hdanton@sina.com>
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2019-09-25  0:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/xfs/xfs_file.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index d952d5962e93..9445196f8056 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1156,6 +1156,8 @@ __xfs_filemap_fault(
 	} else {
 		if (write_fault)
 			ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops);
+		else if (pe_size)
+			ret = filemap_huge_fault(vmf, pe_size);
 		else
 			ret = filemap_fault(vmf);
 	}
@@ -1181,9 +1183,6 @@ xfs_filemap_huge_fault(
 	struct vm_fault		*vmf,
 	enum page_entry_size	pe_size)
 {
-	if (!IS_DAX(file_inode(vmf->vma->vm_file)))
-		return VM_FAULT_FALLBACK;
-
 	/* DAX can shortcut the normal fault path on write faults! */
 	return __xfs_filemap_fault(vmf, pe_size,
 			(vmf->flags & FAULT_FLAG_WRITE));
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/15] fs: Introduce i_blocks_per_page
  2019-09-25  0:52 ` [PATCH 02/15] fs: Introduce i_blocks_per_page Matthew Wilcox
@ 2019-09-25  8:36   ` Dave Chinner
  2019-10-04 19:28     ` Matthew Wilcox
  0 siblings, 1 reply; 40+ messages in thread
From: Dave Chinner @ 2019-09-25  8:36 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Sep 24, 2019 at 05:52:01PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This helper is useful for both large pages in the page cache and for
> supporting block size larger than page size.  Convert some example
> users (we have a few different ways of writing this idiom).
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

I'm actually working on abstrcting this code from both block size
and page size via the helpers below. We ahve need to support block
size > page size, and so that requires touching a bunch of all the
same code as this patchset. I'm currently trying to combine your
last patch set with my patchset so I can easily test allocating 64k
page cache pages on a 64k block size filesystem on a 4k page size
machine with XFS....

/*
 * Return the chunk size we should use for page cache based operations.
 * This supports both large block sizes and variable page sizes based on the
 * restriction that order-n blocks and page cache pages are order-n file offset
 * aligned.
 *
 * This will return the inode block size for block size < page_size(page),
 * otherwise it will return page_size(page).
 */
static inline unsigned
iomap_chunk_size(struct inode *inode, struct page *page)
{
        return min_t(unsigned, page_size(page), i_blocksize(inode));
}

static inline unsigned
iomap_chunk_bits(struct inode *inode, struct page *page)
{
        return min_t(unsigned, page_shift(page), inode->i_blkbits);
}

static inline unsigned
iomap_chunks_per_page(struct inode *inode, struct page *page)
{
        return page_size(page) >> inode->i_blkbits;
}

Basically, the process is to convert the iomap code over to
iterating "chunks" rather than blocks or pages, and then allocate
a struct iomap_page according to the difference between page and
block size....

> ---
>  fs/iomap/buffered-io.c  |  4 ++--
>  fs/jfs/jfs_metapage.c   |  2 +-
>  fs/xfs/xfs_aops.c       |  8 ++++----
>  include/linux/pagemap.h | 13 +++++++++++++
>  4 files changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e25901ae3ff4..0e76a4b6d98a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -24,7 +24,7 @@ iomap_page_create(struct inode *inode, struct page *page)
>  {
>  	struct iomap_page *iop = to_iomap_page(page);
>  
> -	if (iop || i_blocksize(inode) == PAGE_SIZE)
> +	if (iop || i_blocks_per_page(inode, page) <= 1)
>  		return iop;

That also means checks like these become:

	if (iop || iomap_chunks_per_page(inode, page) <= 1)

as a single file can now have multiple pages per block, a page per
block and multiple blocks per page as the page size changes...

I'd like to only have to make one pass over this code to abstract
out page and block sizes, so I'm guessing we'll need to do some
co-ordination here....

> @@ -636,4 +636,17 @@ static inline unsigned long dir_pages(struct inode *inode)
>  			       PAGE_SHIFT;
>  }
>  
> +/**
> + * i_blocks_per_page - How many blocks fit in this page.
> + * @inode: The inode which contains the blocks.
> + * @page: The (potentially large) page.
> + *
> + * Context: Any context.
> + * Return: The number of filesystem blocks covered by this page.
> + */
> +static inline
> +unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
> +{
> +	return page_size(page) >> inode->i_blkbits;
> +}
>  #endif /* _LINUX_PAGEMAP_H */

It also means that we largely don't need to touch mm headers as
all the helpers end up being iomap specific and private...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 01/15] mm: Use vm_fault error code directly
  2019-09-25  0:52 ` [PATCH 01/15] mm: Use vm_fault error code directly Matthew Wilcox
@ 2019-09-26 13:55   ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2019-09-26 13:55 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Sep 24, 2019 at 05:52:00PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Use VM_FAULT_OOM instead of indirecting through vmf_error(-ENOMEM).
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 03/15] mm: Add file_offset_of_ helpers
  2019-09-25  0:52 ` [PATCH 03/15] mm: Add file_offset_of_ helpers Matthew Wilcox
@ 2019-09-26 14:02   ` Kirill A. Shutemov
  2019-10-04 19:39     ` Matthew Wilcox
  0 siblings, 1 reply; 40+ messages in thread
From: Kirill A. Shutemov @ 2019-09-26 14:02 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Sep 24, 2019 at 05:52:02PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> The page_offset function is badly named for people reading the functions
> which call it.  The natural meaning of a function with this name would
> be 'offset within a page', not 'page offset in bytes within a file'.
> Dave Chinner suggests file_offset_of_page() as a replacement function
> name and I'm also adding file_offset_of_next_page() as a helper for the
> large page work.  Also add kernel-doc for these functions so they show
> up in the kernel API book.
> 
> page_offset() is retained as a compatibility define for now.

This should be trivial for coccinelle, right?

> ---
>  drivers/net/ethernet/ibm/ibmveth.c |  2 --
>  include/linux/pagemap.h            | 25 ++++++++++++++++++++++---
>  2 files changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
> index c5be4ebd8437..bf98aeaf9a45 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.c
> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> @@ -978,8 +978,6 @@ static int ibmveth_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
>  	return -EOPNOTSUPP;
>  }
>  
> -#define page_offset(v) ((unsigned long)(v) & ((1 << 12) - 1))
> -
>  static int ibmveth_send(struct ibmveth_adapter *adapter,
>  			union ibmveth_buf_desc *descs, unsigned long mss)
>  {
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 750770a2c685..103205494ea0 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -428,14 +428,33 @@ static inline pgoff_t page_to_pgoff(struct page *page)
>  	return page_to_index(page);
>  }
>  
> -/*
> - * Return byte-offset into filesystem object for page.
> +/**
> + * file_offset_of_page - File offset of this page.
> + * @page: Page cache page.
> + *
> + * Context: Any context.
> + * Return: The offset of the first byte of this page.
>   */
> -static inline loff_t page_offset(struct page *page)
> +static inline loff_t file_offset_of_page(struct page *page)
>  {
>  	return ((loff_t)page->index) << PAGE_SHIFT;
>  }
>  
> +/* Legacy; please convert callers */
> +#define page_offset(page)	file_offset_of_page(page)
> +
> +/**
> + * file_offset_of_next_page - File offset of the next page.
> + * @page: Page cache page.
> + *
> + * Context: Any context.
> + * Return: The offset of the first byte after this page.
> + */
> +static inline loff_t file_offset_of_next_page(struct page *page)
> +{
> +	return ((loff_t)page->index + compound_nr(page)) << PAGE_SHIFT;

Wouldn't it be more readable as

	return file_offset_of_page(page) + page_size(page);

?

> +}
> +
>  static inline loff_t page_file_offset(struct page *page)
>  {
>  	return ((loff_t)page_index(page)) << PAGE_SHIFT;
> -- 
> 2.23.0
> 
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/15] mm: Make prep_transhuge_page tail-callable
  2019-09-25  0:52 ` [PATCH 07/15] mm: Make prep_transhuge_page tail-callable Matthew Wilcox
@ 2019-09-26 14:13   ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2019-09-26 14:13 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Sep 24, 2019 at 05:52:06PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> By permitting NULL or order-0 pages as an argument, and returning the
> argument, callers can write:
> 
> 	return prep_transhuge_page(alloc_pages(...));
> 
> instead of assigning the result to a temporary variable and conditionally
> passing that to prep_transhuge_page().

Patch makes sense, "tail-callable" made me think you want it to be able
accept tail pages. Can we re-phrase this?
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/15] mm: Add __page_cache_alloc_order
  2019-09-25  0:52 ` [PATCH 08/15] mm: Add __page_cache_alloc_order Matthew Wilcox
@ 2019-09-26 14:15   ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2019-09-26 14:15 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Sep 24, 2019 at 05:52:07PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This new function allows page cache pages to be allocated that are
> larger than an order-0 page.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/15] mm: Allow large pages to be added to the page cache
  2019-09-25  0:52 ` [PATCH 09/15] mm: Allow large pages to be added to the page cache Matthew Wilcox
@ 2019-09-26 14:22   ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2019-09-26 14:22 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Sep 24, 2019 at 05:52:08PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> We return -EEXIST if there are any non-shadow entries in the page
> cache in the range covered by the large page.  If there are multiple
> shadow entries in the range, we set *shadowp to one of them (currently
> the one at the highest index).  If that turns out to be the wrong
> answer, we can implement something more complex.  This is mostly
> modelled after the equivalent function in the shmem code.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/filemap.c | 37 ++++++++++++++++++++++++++-----------
>  1 file changed, 26 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index bab97addbb1d..afe8f5d95810 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -855,6 +855,7 @@ static int __add_to_page_cache_locked(struct page *page,
>  	int huge = PageHuge(page);
>  	struct mem_cgroup *memcg;
>  	int error;
> +	unsigned int nr = 1;
>  	void *old;
>  
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> @@ -866,31 +867,45 @@ static int __add_to_page_cache_locked(struct page *page,
>  					      gfp_mask, &memcg, false);
>  		if (error)
>  			return error;
> +		xas_set_order(&xas, offset, compound_order(page));
> +		nr = compound_nr(page);
>  	}
>  
> -	get_page(page);
> +	page_ref_add(page, nr);
>  	page->mapping = mapping;
>  	page->index = offset;
>  
>  	do {
> +		unsigned long exceptional = 0;
> +		unsigned int i = 0;
> +
>  		xas_lock_irq(&xas);
> -		old = xas_load(&xas);
> -		if (old && !xa_is_value(old))
> +		xas_for_each_conflict(&xas, old) {
> +			if (!xa_is_value(old))
> +				break;
> +			exceptional++;
> +			if (shadowp)
> +				*shadowp = old;
> +		}
> +		if (old)
>  			xas_set_err(&xas, -EEXIST);

This made me confused.

Do we rely on 'old' to be NULL if the loop has completed without 'break'?
It's not very obvious.

Can we have a comment or call xas_set_err() within the loop next to the
'break'?

> -		xas_store(&xas, page);
> +		xas_create_range(&xas);
>  		if (xas_error(&xas))
>  			goto unlock;
>  
> -		if (xa_is_value(old)) {
> -			mapping->nrexceptional--;
> -			if (shadowp)
> -				*shadowp = old;
> +next:
> +		xas_store(&xas, page);
> +		if (++i < nr) {
> +			xas_next(&xas);
> +			goto next;
>  		}
> -		mapping->nrpages++;
> +		mapping->nrexceptional -= exceptional;
> +		mapping->nrpages += nr;
>  
>  		/* hugetlb pages do not participate in page cache accounting */
>  		if (!huge)
> -			__inc_node_page_state(page, NR_FILE_PAGES);
> +			__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES,
> +						nr);

We also need to bump NR_FILE_THPS here.

>  unlock:
>  		xas_unlock_irq(&xas);
>  	} while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));
> @@ -907,7 +922,7 @@ static int __add_to_page_cache_locked(struct page *page,
>  	/* Leave page->index set: truncation relies upon it */
>  	if (!huge)
>  		mem_cgroup_cancel_charge(page, memcg, false);
> -	put_page(page);
> +	page_ref_sub(page, nr);
>  	return xas_error(&xas);
>  }
>  ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
> -- 
> 2.23.0
> 
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/15] mm: Allow find_get_page to be used for large pages
  2019-09-25  0:52 ` [PATCH 10/15] mm: Allow find_get_page to be used for large pages Matthew Wilcox
@ 2019-10-01 10:32   ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2019-10-01 10:32 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Sep 24, 2019 at 05:52:09PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Add FGP_PMD to indicate that we're trying to find-or-create a page that
> is at least PMD_ORDER in size.  The internal 'conflict' entry usage
> is modelled after that in DAX, but the implementations are different
> due to DAX using multi-order entries and the page cache using multiple
> order-0 entries.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/pagemap.h | 13 ++++++
>  mm/filemap.c            | 99 +++++++++++++++++++++++++++++++++++------
>  2 files changed, 99 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index d610a49be571..d6d97f9fb762 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -248,6 +248,19 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
>  #define FGP_NOFS		0x00000010
>  #define FGP_NOWAIT		0x00000020
>  #define FGP_FOR_MMAP		0x00000040
> +/*
> + * If you add more flags, increment FGP_ORDER_SHIFT (no further than 25).
> + * Do not insert flags above the FGP order bits.
> + */
> +#define FGP_ORDER_SHIFT		7
> +#define FGP_PMD			((PMD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
> +#define FGP_PUD			((PUD_SHIFT - PAGE_SHIFT) << FGP_ORDER_SHIFT)
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define fgp_order(fgp)		((fgp) >> FGP_ORDER_SHIFT)
> +#else
> +#define fgp_order(fgp)		0
> +#endif
>  
>  struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
>  		int fgp_flags, gfp_t cache_gfp_mask);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index afe8f5d95810..8eca91547e40 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1576,7 +1576,71 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
>  
>  	return page;
>  }
> -EXPORT_SYMBOL(find_get_entry);
> +
> +static bool pagecache_is_conflict(struct page *page)
> +{
> +	return page == XA_RETRY_ENTRY;
> +}
> +
> +/**
> + * __find_get_page - Find and get a page cache entry.
> + * @mapping: The address_space to search.
> + * @offset: The page cache index.
> + * @order: The minimum order of the entry to return.
> + *
> + * Looks up the page cache entries at @mapping between @offset and
> + * @offset + 2^@order.  If there is a page cache page, it is returned with
> + * an increased refcount unless it is smaller than @order.
> + *
> + * If the slot holds a shadow entry of a previously evicted page, or a
> + * swap entry from shmem/tmpfs, it is returned.
> + *
> + * Return: the found page, a value indicating a conflicting page or %NULL if
> + * there are no pages in this range.
> + */
> +static struct page *__find_get_page(struct address_space *mapping,
> +		unsigned long offset, unsigned int order)
> +{
> +	XA_STATE(xas, &mapping->i_pages, offset);
> +	struct page *page;
> +
> +	rcu_read_lock();
> +repeat:
> +	xas_reset(&xas);
> +	page = xas_find(&xas, offset | ((1UL << order) - 1));
> +	if (xas_retry(&xas, page))
> +		goto repeat;
> +	/*
> +	 * A shadow entry of a recently evicted page, or a swap entry from
> +	 * shmem/tmpfs.  Skip it; keep looking for pages.
> +	 */
> +	if (xa_is_value(page))
> +		goto repeat;
> +	if (!page)
> +		goto out;
> +	if (compound_order(page) < order) {
> +		page = XA_RETRY_ENTRY;
> +		goto out;
> +	}
> +
> +	if (!page_cache_get_speculative(page))
> +		goto repeat;
> +
> +	/*
> +	 * Has the page moved or been split?
> +	 * This is part of the lockless pagecache protocol. See
> +	 * include/linux/pagemap.h for details.
> +	 */
> +	if (unlikely(page != xas_reload(&xas))) {
> +		put_page(page);
> +		goto repeat;
> +	}

You need to re-check compound_order() after obtaining reference to the
page. Otherwise the page could be split under you.

> +	page = find_subpage(page, offset);
> +out:
> +	rcu_read_unlock();
> +
> +	return page;
> +}
>  
>  /**
>   * find_lock_entry - locate, pin and lock a page cache entry
> @@ -1618,12 +1682,12 @@ EXPORT_SYMBOL(find_lock_entry);
>   * pagecache_get_page - find and get a page reference
>   * @mapping: the address_space to search
>   * @offset: the page index
> - * @fgp_flags: PCG flags
> + * @fgp_flags: FGP flags
>   * @gfp_mask: gfp mask to use for the page cache data page allocation
>   *
>   * Looks up the page cache slot at @mapping & @offset.
>   *
> - * PCG flags modify how the page is returned.
> + * FGP flags modify how the page is returned.
>   *
>   * @fgp_flags can be:
>   *
> @@ -1636,6 +1700,10 @@ EXPORT_SYMBOL(find_lock_entry);
>   * - FGP_FOR_MMAP: Similar to FGP_CREAT, only we want to allow the caller to do
>   *   its own locking dance if the page is already in cache, or unlock the page
>   *   before returning if we had to add the page to pagecache.
> + * - FGP_PMD: We're only interested in pages at PMD granularity.  If there
> + *   is no page here (and FGP_CREATE is set), we'll create one large enough.
> + *   If there is a smaller page in the cache that overlaps the PMD page, we
> + *   return %NULL and do not attempt to create a page.

I still think it's suboptimal interface. It's okay to ask for PMD page,
but there's small page already caller should deal with it. Otherwise the
caller will do one additional lookup in xarray for fallback path for no
real reason.

>   *
>   * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
>   * if the GFP flags specified for FGP_CREAT are atomic.
> @@ -1649,10 +1717,11 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
>  {
>  	struct page *page;
>  
> +	BUILD_BUG_ON(((63 << FGP_ORDER_SHIFT) >> FGP_ORDER_SHIFT) != 63);
>  repeat:
> -	page = find_get_entry(mapping, offset);
> -	if (xa_is_value(page))
> -		page = NULL;
> +	page = __find_get_page(mapping, offset, fgp_order(fgp_flags));
> +	if (pagecache_is_conflict(page))
> +		return NULL;
>  	if (!page)
>  		goto no_page;
>  
> @@ -1686,7 +1755,7 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
>  		if (fgp_flags & FGP_NOFS)
>  			gfp_mask &= ~__GFP_FS;
>  
> -		page = __page_cache_alloc(gfp_mask);
> +		page = __page_cache_alloc_order(gfp_mask, fgp_order(fgp_flags));
>  		if (!page)
>  			return NULL;
>  
> @@ -1704,13 +1773,17 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
>  			if (err == -EEXIST)
>  				goto repeat;
>  		}
> +		if (page) {
> +			if (fgp_order(fgp_flags))
> +				count_vm_event(THP_FILE_ALLOC);
>  
> -		/*
> -		 * add_to_page_cache_lru locks the page, and for mmap we expect
> -		 * an unlocked page.
> -		 */
> -		if (page && (fgp_flags & FGP_FOR_MMAP))
> -			unlock_page(page);
> +			/*
> +			 * add_to_page_cache_lru locks the page, and
> +			 * for mmap we expect an unlocked page.
> +			 */
> +			if (fgp_flags & FGP_FOR_MMAP)
> +				unlock_page(page);
> +		}
>  	}
>  
>  	return page;
> -- 
> 2.23.0
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/15] mm: Remove hpage_nr_pages
  2019-09-25  0:52 ` [PATCH 11/15] mm: Remove hpage_nr_pages Matthew Wilcox
@ 2019-10-01 10:35   ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2019-10-01 10:35 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Tue, Sep 24, 2019 at 05:52:10PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This function assumed that compound pages were necessarily PMD sized.
> While that may be true for some users, it's not going to be true for
> all users forever, so it's better to remove it and avoid the confusion
> by just using compound_nr() or page_size().
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 12/15] mm: Support removing arbitrary sized pages from mapping
  2019-09-25  0:52 ` [PATCH 12/15] mm: Support removing arbitrary sized pages from mapping Matthew Wilcox
@ 2019-10-01 10:39   ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2019-10-01 10:39 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel, William Kucharski

On Tue, Sep 24, 2019 at 05:52:11PM -0700, Matthew Wilcox wrote:
> From: William Kucharski <william.kucharski@oracle.com>
> 
> __remove_mapping() assumes that pages can only be either base pages
> or HPAGE_PMD_SIZE.  Ask the page what size it is.

You also fixes the issue CONFIG_READ_ONLY_THP_FOR_FS=y with this patch.
The new feature makes the refcount calculation relevant not only for
PageSwapCache(). It should go to v5.4.

> 
> Signed-off-by: William Kucharski <william.kucharski@oracle.com>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

> ---
>  mm/vmscan.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a7f9f379e523..9f44868e640b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -932,10 +932,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>  	 * Note that if SetPageDirty is always performed via set_page_dirty,
>  	 * and thus under the i_pages lock, then this ordering is not required.
>  	 */
> -	if (unlikely(PageTransHuge(page)) && PageSwapCache(page))
> -		refcount = 1 + HPAGE_PMD_NR;
> -	else
> -		refcount = 2;
> +	refcount = 1 + compound_nr(page);
>  	if (!page_ref_freeze(page, refcount))
>  		goto cannot_free;
>  	/* note: atomic_cmpxchg in page_ref_freeze provides the smp_rmb */
> -- 
> 2.23.0
> 
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 13/15] mm: Add a huge page fault handler for files
  2019-09-25  0:52 ` [PATCH 13/15] mm: Add a huge page fault handler for files Matthew Wilcox
@ 2019-10-01 10:42   ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2019-10-01 10:42 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel, William Kucharski

On Tue, Sep 24, 2019 at 05:52:12PM -0700, Matthew Wilcox wrote:
> From: William Kucharski <william.kucharski@oracle.com>
> 
> Add filemap_huge_fault() to attempt to satisfy page
> faults on memory-mapped read-only text pages using THP when possible.
> 
> Signed-off-by: William Kucharski <william.kucharski@oracle.com>
> [rebased on top of mm prep patches -- Matthew]
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

As we discuss before, I think it's wrong direction to separate ->fault()
from ->huge_fault(). If ->fault() would return a huge page alloc_set_pte()
will do The Right Thing™.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 14/15] mm: Align THP mappings for non-DAX
  2019-09-25  0:52 ` [PATCH 14/15] mm: Align THP mappings for non-DAX Matthew Wilcox
@ 2019-10-01 10:45   ` Kirill A. Shutemov
  2019-10-01 11:21     ` William Kucharski
  0 siblings, 1 reply; 40+ messages in thread
From: Kirill A. Shutemov @ 2019-10-01 10:45 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel, William Kucharski

On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote:
> From: William Kucharski <william.kucharski@oracle.com>
> 
> When we have the opportunity to use transparent huge pages to map a
> file, we want to follow the same rules as DAX.
> 
> Signed-off-by: William Kucharski <william.kucharski@oracle.com>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/huge_memory.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cbe7d0619439..670a1780bd2f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>  
>  	if (addr)
>  		goto out;
> -	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
> -		goto out;
>  
>  	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
>  	if (addr)

I think you reducing ASLR without any real indication that THP is relevant
for the VMA. We need to know if any huge page allocation will be
*attempted* for the VMA or the file.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 14/15] mm: Align THP mappings for non-DAX
  2019-10-01 10:45   ` Kirill A. Shutemov
@ 2019-10-01 11:21     ` William Kucharski
  2019-10-01 11:32       ` Kirill A. Shutemov
  0 siblings, 1 reply; 40+ messages in thread
From: William Kucharski @ 2019-10-01 11:21 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel



> On Oct 1, 2019, at 4:45 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote:
>> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index cbe7d0619439..670a1780bd2f 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>> 
>> 	if (addr)
>> 		goto out;
>> -	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
>> -		goto out;
>> 
>> 	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
>> 	if (addr)
> 
> I think you reducing ASLR without any real indication that THP is relevant
> for the VMA. We need to know if any huge page allocation will be
> *attempted* for the VMA or the file.

Without a properly aligned address the code will never even attempt allocating
a THP.

I don't think rounding an address to one that would be properly aligned to map
to a THP if possible is all that detrimental to ASLR and without the ability to
pick an aligned address it's rather unlikely anyone would ever map anything to
a THP unless they explicitly designate an address with MAP_FIXED.

If you do object to the slight reduction of the ASLR address space, what
alternative would you prefer to see?

    -- Bill

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 14/15] mm: Align THP mappings for non-DAX
  2019-10-01 11:21     ` William Kucharski
@ 2019-10-01 11:32       ` Kirill A. Shutemov
  2019-10-01 12:18         ` William Kucharski
  0 siblings, 1 reply; 40+ messages in thread
From: Kirill A. Shutemov @ 2019-10-01 11:32 UTC (permalink / raw)
  To: William Kucharski; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel

On Tue, Oct 01, 2019 at 05:21:26AM -0600, William Kucharski wrote:
> 
> 
> > On Oct 1, 2019, at 4:45 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote:
> >> 
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index cbe7d0619439..670a1780bd2f 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> >> 
> >> 	if (addr)
> >> 		goto out;
> >> -	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
> >> -		goto out;
> >> 
> >> 	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
> >> 	if (addr)
> > 
> > I think you reducing ASLR without any real indication that THP is relevant
> > for the VMA. We need to know if any huge page allocation will be
> > *attempted* for the VMA or the file.
> 
> Without a properly aligned address the code will never even attempt allocating
> a THP.
> 
> I don't think rounding an address to one that would be properly aligned to map
> to a THP if possible is all that detrimental to ASLR and without the ability to
> pick an aligned address it's rather unlikely anyone would ever map anything to
> a THP unless they explicitly designate an address with MAP_FIXED.
> 
> If you do object to the slight reduction of the ASLR address space, what
> alternative would you prefer to see?

We need to know by the time if THP is allowed for this
file/VMA/process/whatever. Meaning that we do not give up ASLR entropy for
nothing.

For instance, if THP is disabled globally, there is no reason to align the
VMA to the THP requirements.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 14/15] mm: Align THP mappings for non-DAX
  2019-10-01 11:32       ` Kirill A. Shutemov
@ 2019-10-01 12:18         ` William Kucharski
  2019-10-01 14:20           ` Kirill A. Shutemov
  0 siblings, 1 reply; 40+ messages in thread
From: William Kucharski @ 2019-10-01 12:18 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel



On 10/1/19 5:32 AM, Kirill A. Shutemov wrote:
> On Tue, Oct 01, 2019 at 05:21:26AM -0600, William Kucharski wrote:
>>
>>
>>> On Oct 1, 2019, at 4:45 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>>
>>> On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote:
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index cbe7d0619439..670a1780bd2f 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>>>>
>>>> 	if (addr)
>>>> 		goto out;
>>>> -	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
>>>> -		goto out;
>>>>
>>>> 	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
>>>> 	if (addr)
>>>
>>> I think you reducing ASLR without any real indication that THP is relevant
>>> for the VMA. We need to know if any huge page allocation will be
>>> *attempted* for the VMA or the file.
>>
>> Without a properly aligned address the code will never even attempt allocating
>> a THP.
>>
>> I don't think rounding an address to one that would be properly aligned to map
>> to a THP if possible is all that detrimental to ASLR and without the ability to
>> pick an aligned address it's rather unlikely anyone would ever map anything to
>> a THP unless they explicitly designate an address with MAP_FIXED.
>>
>> If you do object to the slight reduction of the ASLR address space, what
>> alternative would you prefer to see?
> 
> We need to know by the time if THP is allowed for this
> file/VMA/process/whatever. Meaning that we do not give up ASLR entropy for
> nothing.
> 
> For instance, if THP is disabled globally, there is no reason to align the
> VMA to the THP requirements.

I understand, but this code is in thp_get_unmapped_area(), which is only called
if THP is configured and the VMA can support it.

I don't see it in Matthew's patchset, so I'm not sure if it was inadvertently
missed in his merge or if he has other ideas for how it would eventually be 
called, but in my last patch revision the code calling it in do_mmap() looked 
like this:

#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
         /*
          * If THP is enabled, it's a read-only executable that is
          * MAP_PRIVATE mapped, the length is larger than a PMD page
          * and either it's not a MAP_FIXED mapping or the passed address is
          * properly aligned for a PMD page, attempt to get an appropriate
          * address at which to map a PMD-sized THP page, otherwise call the
          * normal routine.
          */
         if ((prot & PROT_READ) && (prot & PROT_EXEC) &&
                 (!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) &&
                 (!(flags & MAP_FIXED)) && len >= HPAGE_PMD_SIZE) {
                 addr = thp_get_unmapped_area(file, addr, len, pgoff, flags);

                 if (addr && (!(addr & ~HPAGE_PMD_MASK))) {
                         /*
                          * If we got a suitable THP mapping address, shut off
                          * VM_MAYWRITE for the region, since it's never what
                          * we would want.
                          */
                         vm_maywrite = 0;
                 } else
                         addr = get_unmapped_area(file, addr, len, pgoff, flags);
         } else {
#endif

So I think that meets your expectations regarding ASLR.

    -- Bill

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 14/15] mm: Align THP mappings for non-DAX
  2019-10-01 12:18         ` William Kucharski
@ 2019-10-01 14:20           ` Kirill A. Shutemov
  2019-10-01 16:08             ` William Kucharski
  0 siblings, 1 reply; 40+ messages in thread
From: Kirill A. Shutemov @ 2019-10-01 14:20 UTC (permalink / raw)
  To: William Kucharski; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel

On Tue, Oct 01, 2019 at 06:18:28AM -0600, William Kucharski wrote:
> 
> 
> On 10/1/19 5:32 AM, Kirill A. Shutemov wrote:
> > On Tue, Oct 01, 2019 at 05:21:26AM -0600, William Kucharski wrote:
> > > 
> > > 
> > > > On Oct 1, 2019, at 4:45 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > > 
> > > > On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote:
> > > > > 
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index cbe7d0619439..670a1780bd2f 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> > > > > 
> > > > > 	if (addr)
> > > > > 		goto out;
> > > > > -	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
> > > > > -		goto out;
> > > > > 
> > > > > 	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
> > > > > 	if (addr)
> > > > 
> > > > I think you reducing ASLR without any real indication that THP is relevant
> > > > for the VMA. We need to know if any huge page allocation will be
> > > > *attempted* for the VMA or the file.
> > > 
> > > Without a properly aligned address the code will never even attempt allocating
> > > a THP.
> > > 
> > > I don't think rounding an address to one that would be properly aligned to map
> > > to a THP if possible is all that detrimental to ASLR and without the ability to
> > > pick an aligned address it's rather unlikely anyone would ever map anything to
> > > a THP unless they explicitly designate an address with MAP_FIXED.
> > > 
> > > If you do object to the slight reduction of the ASLR address space, what
> > > alternative would you prefer to see?
> > 
> > We need to know by the time if THP is allowed for this
> > file/VMA/process/whatever. Meaning that we do not give up ASLR entropy for
> > nothing.
> > 
> > For instance, if THP is disabled globally, there is no reason to align the
> > VMA to the THP requirements.
> 
> I understand, but this code is in thp_get_unmapped_area(), which is only called
> if THP is configured and the VMA can support it.
> 
> I don't see it in Matthew's patchset, so I'm not sure if it was inadvertently
> missed in his merge or if he has other ideas for how it would eventually be
> called, but in my last patch revision the code calling it in do_mmap()
> looked like this:
> 
> #ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
>         /*
>          * If THP is enabled, it's a read-only executable that is
>          * MAP_PRIVATE mapped, the length is larger than a PMD page
>          * and either it's not a MAP_FIXED mapping or the passed address is
>          * properly aligned for a PMD page, attempt to get an appropriate
>          * address at which to map a PMD-sized THP page, otherwise call the
>          * normal routine.
>          */
>         if ((prot & PROT_READ) && (prot & PROT_EXEC) &&
>                 (!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) &&
>                 (!(flags & MAP_FIXED)) && len >= HPAGE_PMD_SIZE) {

len and MAP_FIXED is already handled by thp_get_unmapped_area().

	if (prot & (PROT_READ|PROT_WRITE|PROT_READ) == (PROT_READ|PROT_EXEC) &&
		(flags & MAP_PRIVATE)) {


>                 addr = thp_get_unmapped_area(file, addr, len, pgoff, flags);
> 
>                 if (addr && (!(addr & ~HPAGE_PMD_MASK))) {

This check is broken.

For instance, if pgoff is one, (addr & ~HPAGE_PMD_MASK) has to be equal to
PAGE_SIZE to have chance to get a huge page in the mapping.

>                         /*
>                          * If we got a suitable THP mapping address, shut off
>                          * VM_MAYWRITE for the region, since it's never what
>                          * we would want.
>                          */
>                         vm_maywrite = 0;

Wouldn't it break uprobe, for instance?

>                 } else
>                         addr = get_unmapped_area(file, addr, len, pgoff, flags);
>         } else {
> #endif
> 
> So I think that meets your expectations regarding ASLR.
> 
>    -- Bill

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 14/15] mm: Align THP mappings for non-DAX
  2019-10-01 14:20           ` Kirill A. Shutemov
@ 2019-10-01 16:08             ` William Kucharski
  2019-10-02  0:15               ` Kirill A. Shutemov
  0 siblings, 1 reply; 40+ messages in thread
From: William Kucharski @ 2019-10-01 16:08 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel



> On Oct 1, 2019, at 8:20 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Tue, Oct 01, 2019 at 06:18:28AM -0600, William Kucharski wrote:
>> 
>> 
>> On 10/1/19 5:32 AM, Kirill A. Shutemov wrote:
>>> On Tue, Oct 01, 2019 at 05:21:26AM -0600, William Kucharski wrote:
>>>> 
>>>> 
>>>>> On Oct 1, 2019, at 4:45 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>>>> 
>>>>> On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote:
>>>>>> 
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index cbe7d0619439..670a1780bd2f 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>>>>>> 
>>>>>> 	if (addr)
>>>>>> 		goto out;
>>>>>> -	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
>>>>>> -		goto out;
>>>>>> 
>>>>>> 	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
>>>>>> 	if (addr)
>>>>> 
>>>>> I think you reducing ASLR without any real indication that THP is relevant
>>>>> for the VMA. We need to know if any huge page allocation will be
>>>>> *attempted* for the VMA or the file.
>>>> 
>>>> Without a properly aligned address the code will never even attempt allocating
>>>> a THP.
>>>> 
>>>> I don't think rounding an address to one that would be properly aligned to map
>>>> to a THP if possible is all that detrimental to ASLR and without the ability to
>>>> pick an aligned address it's rather unlikely anyone would ever map anything to
>>>> a THP unless they explicitly designate an address with MAP_FIXED.
>>>> 
>>>> If you do object to the slight reduction of the ASLR address space, what
>>>> alternative would you prefer to see?
>>> 
>>> We need to know by the time if THP is allowed for this
>>> file/VMA/process/whatever. Meaning that we do not give up ASLR entropy for
>>> nothing.
>>> 
>>> For instance, if THP is disabled globally, there is no reason to align the
>>> VMA to the THP requirements.
>> 
>> I understand, but this code is in thp_get_unmapped_area(), which is only called
>> if THP is configured and the VMA can support it.
>> 
>> I don't see it in Matthew's patchset, so I'm not sure if it was inadvertently
>> missed in his merge or if he has other ideas for how it would eventually be
>> called, but in my last patch revision the code calling it in do_mmap()
>> looked like this:
>> 
>> #ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
>>        /*
>>         * If THP is enabled, it's a read-only executable that is
>>         * MAP_PRIVATE mapped, the length is larger than a PMD page
>>         * and either it's not a MAP_FIXED mapping or the passed address is
>>         * properly aligned for a PMD page, attempt to get an appropriate
>>         * address at which to map a PMD-sized THP page, otherwise call the
>>         * normal routine.
>>         */
>>        if ((prot & PROT_READ) && (prot & PROT_EXEC) &&
>>                (!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) &&
>>                (!(flags & MAP_FIXED)) && len >= HPAGE_PMD_SIZE) {
> 
> len and MAP_FIXED is already handled by thp_get_unmapped_area().
> 
> 	if (prot & (PROT_READ|PROT_WRITE|PROT_READ) == (PROT_READ|PROT_EXEC) &&
> 		(flags & MAP_PRIVATE)) {

It is, but I wanted to avoid even calling it if conditions weren't right.

Checking twice is non-optimal but I didn't want to alter the existing use of
the routine for anon THP.

> 
> 
>>                addr = thp_get_unmapped_area(file, addr, len, pgoff, flags);
>> 
>>                if (addr && (!(addr & ~HPAGE_PMD_MASK))) {
> 
> This check is broken.
> 
> For instance, if pgoff is one, (addr & ~HPAGE_PMD_MASK) has to be equal to
> PAGE_SIZE to have chance to get a huge page in the mapping.
> 

If the address isn't PMD-aligned, we will never be able to map it with a THP
anyway.

The current code is designed to only map a THP if the VMA allows for it and
it can map the entire THP starting at an aligned address.

You can't map a THP at the PMD level at an address that isn't PMD aligned.

Perhaps I'm missing a use case here.

>>                        /*
>>                         * If we got a suitable THP mapping address, shut off
>>                         * VM_MAYWRITE for the region, since it's never what
>>                         * we would want.
>>                         */
>>                        vm_maywrite = 0;
> 
> Wouldn't it break uprobe, for instance?

I'm not sure; does uprobe allow COW to insert the probe even for mappings
explicitly marked read-only?

Thanks,
     Bill


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 14/15] mm: Align THP mappings for non-DAX
  2019-10-01 16:08             ` William Kucharski
@ 2019-10-02  0:15               ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2019-10-02  0:15 UTC (permalink / raw)
  To: William Kucharski; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel

On Tue, Oct 01, 2019 at 10:08:30AM -0600, William Kucharski wrote:
> 
> 
> > On Oct 1, 2019, at 8:20 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > On Tue, Oct 01, 2019 at 06:18:28AM -0600, William Kucharski wrote:
> >> 
> >> 
> >> On 10/1/19 5:32 AM, Kirill A. Shutemov wrote:
> >>> On Tue, Oct 01, 2019 at 05:21:26AM -0600, William Kucharski wrote:
> >>>> 
> >>>> 
> >>>>> On Oct 1, 2019, at 4:45 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >>>>> 
> >>>>> On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote:
> >>>>>> 
> >>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>>>> index cbe7d0619439..670a1780bd2f 100644
> >>>>>> --- a/mm/huge_memory.c
> >>>>>> +++ b/mm/huge_memory.c
> >>>>>> @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> >>>>>> 
> >>>>>> 	if (addr)
> >>>>>> 		goto out;
> >>>>>> -	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
> >>>>>> -		goto out;
> >>>>>> 
> >>>>>> 	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
> >>>>>> 	if (addr)
> >>>>> 
> >>>>> I think you reducing ASLR without any real indication that THP is relevant
> >>>>> for the VMA. We need to know if any huge page allocation will be
> >>>>> *attempted* for the VMA or the file.
> >>>> 
> >>>> Without a properly aligned address the code will never even attempt allocating
> >>>> a THP.
> >>>> 
> >>>> I don't think rounding an address to one that would be properly aligned to map
> >>>> to a THP if possible is all that detrimental to ASLR and without the ability to
> >>>> pick an aligned address it's rather unlikely anyone would ever map anything to
> >>>> a THP unless they explicitly designate an address with MAP_FIXED.
> >>>> 
> >>>> If you do object to the slight reduction of the ASLR address space, what
> >>>> alternative would you prefer to see?
> >>> 
> >>> We need to know by the time if THP is allowed for this
> >>> file/VMA/process/whatever. Meaning that we do not give up ASLR entropy for
> >>> nothing.
> >>> 
> >>> For instance, if THP is disabled globally, there is no reason to align the
> >>> VMA to the THP requirements.
> >> 
> >> I understand, but this code is in thp_get_unmapped_area(), which is only called
> >> if THP is configured and the VMA can support it.
> >> 
> >> I don't see it in Matthew's patchset, so I'm not sure if it was inadvertently
> >> missed in his merge or if he has other ideas for how it would eventually be
> >> called, but in my last patch revision the code calling it in do_mmap()
> >> looked like this:
> >> 
> >> #ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> >>        /*
> >>         * If THP is enabled, it's a read-only executable that is
> >>         * MAP_PRIVATE mapped, the length is larger than a PMD page
> >>         * and either it's not a MAP_FIXED mapping or the passed address is
> >>         * properly aligned for a PMD page, attempt to get an appropriate
> >>         * address at which to map a PMD-sized THP page, otherwise call the
> >>         * normal routine.
> >>         */
> >>        if ((prot & PROT_READ) && (prot & PROT_EXEC) &&
> >>                (!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) &&
> >>                (!(flags & MAP_FIXED)) && len >= HPAGE_PMD_SIZE) {
> > 
> > len and MAP_FIXED is already handled by thp_get_unmapped_area().
> > 
> > 	if (prot & (PROT_READ|PROT_WRITE|PROT_READ) == (PROT_READ|PROT_EXEC) &&
> > 		(flags & MAP_PRIVATE)) {
> 
> It is, but I wanted to avoid even calling it if conditions weren't right.
> 
> Checking twice is non-optimal but I didn't want to alter the existing use of
> the routine for anon THP.

It's not used by anon THP. It used for DAX.

> > 
> > 
> >>                addr = thp_get_unmapped_area(file, addr, len, pgoff, flags);
> >> 
> >>                if (addr && (!(addr & ~HPAGE_PMD_MASK))) {
> > 
> > This check is broken.
> > 
> > For instance, if pgoff is one, (addr & ~HPAGE_PMD_MASK) has to be equal to
> > PAGE_SIZE to have chance to get a huge page in the mapping.
> > 
> 
> If the address isn't PMD-aligned, we will never be able to map it with a THP
> anyway.

The opposite is true. I tried to explain it few times, but let's try
again.

If the address here is PMD-aligned, it will get a mismatch with page cache
alignment.

Consider the case with 2 huge pages in page cache, starting at the
beginning of the file.

The key to understanding: huge pages always aligned naturally in page
cache. Page cache lives longer than any mapping of the file.

If user calls mmap(.pgoff = 1) and it returns PMD-aligned address, we will
never have a huge page in the mapping. At the PMD-aligned address you will
get the second 4k of the first huge page and you will have to map it with
PTE. At the second PMD-aligned address of the mapping, the situation will
repeat for the second huge page, again misaligned, PTE-mapped second
subpage.

The solution here is to return address aligned to PMD_SIZE + PAGE_SIZE, if
user asked for mmap(.pgoff = 1). The user will not get first huge page
mapped with PMD, because mapping truncates it from the beginning per user
request. But the second (and any following huge page) will land on the
right alignment and can be mapped with PMD.

Does it make sense?

> >>                        /*
> >>                         * If we got a suitable THP mapping address, shut off
> >>                         * VM_MAYWRITE for the region, since it's never what
> >>                         * we would want.
> >>                         */
> >>                        vm_maywrite = 0;
> > 
> > Wouldn't it break uprobe, for instance?
> 
> I'm not sure; does uprobe allow COW to insert the probe even for mappings
> explicitly marked read-only?

Yes. See FOLL_FORCE usage.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/15] fs: Introduce i_blocks_per_page
  2019-09-25  8:36   ` Dave Chinner
@ 2019-10-04 19:28     ` Matthew Wilcox
  2019-10-08  3:53       ` Dave Chinner
  0 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2019-10-04 19:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Wed, Sep 25, 2019 at 06:36:50PM +1000, Dave Chinner wrote:
> I'm actually working on abstrcting this code from both block size
> and page size via the helpers below. We ahve need to support block
> size > page size, and so that requires touching a bunch of all the
> same code as this patchset. I'm currently trying to combine your
> last patch set with my patchset so I can easily test allocating 64k
> page cache pages on a 64k block size filesystem on a 4k page size
> machine with XFS....

This all makes sense ...

> > -	if (iop || i_blocksize(inode) == PAGE_SIZE)
> > +	if (iop || i_blocks_per_page(inode, page) <= 1)
> >  		return iop;
> 
> That also means checks like these become:
> 
> 	if (iop || iomap_chunks_per_page(inode, page) <= 1)
> 
> as a single file can now have multiple pages per block, a page per
> block and multiple blocks per page as the page size changes...
> 
> I'd like to only have to make one pass over this code to abstract
> out page and block sizes, so I'm guessing we'll need to do some
> co-ordination here....

Yup.  I'm happy if you want to send your patches out; I'll keep going
with the patches I have for the moment, and we'll figure out how to
merge the two series in a way that makes sense.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 03/15] mm: Add file_offset_of_ helpers
       [not found] ` <20191002130753.7680-1-hdanton@sina.com>
@ 2019-10-04 19:33   ` Matthew Wilcox
  0 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2019-10-04 19:33 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-fsdevel, linux-mm, linux-kernel


Your mail program is still broken.  This shows up as a reply to the 0/15
email instead of as a reply to the 3/15 email.

On Wed, Oct 02, 2019 at 09:07:53PM +0800, Hillf Danton wrote:
> On Tue, 24 Sep 2019 17:52:02 -0700 From: Matthew Wilcox (Oracle)
> > +/**
> > + * file_offset_of_page - File offset of this page.
> > + * @page: Page cache page.
> > + *
> > + * Context: Any context.
> > + * Return: The offset of the first byte of this page.
> >   */
> > -static inline loff_t page_offset(struct page *page)
> > +static inline loff_t file_offset_of_page(struct page *page)
> >  {
> >  	return ((loff_t)page->index) << PAGE_SHIFT;
> >  }
> >  
> >  static inline loff_t page_file_offset(struct page *page)
> >  {
> >  	return ((loff_t)page_index(page)) << PAGE_SHIFT;
> 
> Would you like to specify the need to build a moon on the moon,
> with another name though?

I have no idea what you mean.  Is this an idiom in your native language,
perhaps?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/15] iomap: Support large pages
       [not found] ` <20191002133211.15696-1-hdanton@sina.com>
@ 2019-10-04 19:34   ` Matthew Wilcox
  0 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2019-10-04 19:34 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Wed, Oct 02, 2019 at 09:32:11PM +0800, Hillf Danton wrote:
> 
> On Tue, 24 Sep 2019 17:52:02 -0700 From: Matthew Wilcox (Oracle)
> > 
> > @@ -1415,6 +1415,8 @@ static inline void clear_page_pfmemalloc(struct page *page)
> >  extern void pagefault_out_of_memory(void);
> >  
> >  #define offset_in_page(p)	((unsigned long)(p) & ~PAGE_MASK)
> 
> With the above define, the page_offset function is not named as badly
> as 03/15 claims.

Just because there exists a function that does the job, does not mean that
the other function is correctly named.

> > +#define offset_in_this_page(page, p)	\
> > +	((unsigned long)(p) & (page_size(page) - 1))
> 
> What if Ted will post a rfc with offset_in_that_page defined next week?

Are you trying to be funny?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 06/15] xfs: Pass a page to xfs_finish_page_writeback
       [not found] ` <20191003040846.17604-1-hdanton@sina.com>
@ 2019-10-04 19:35   ` Matthew Wilcox
  0 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2019-10-04 19:35 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Thu, Oct 03, 2019 at 12:08:46PM +0800, Hillf Danton wrote:
> 
> On Tue, 24 Sep 2019 17:52:02 -0700 From: Matthew Wilcox (Oracle)
> > 
> > The only part of the bvec we were accessing was the bv_page, so just
> > pass that instead of the whole bvec.
> 
> Change is added in ABI without a bit of win.
> Changes like this are not needed.

ABI?  This is a static function.  The original recommendation to do this
came from Christoph, who I would trust over you as a referee of what
changes to make to XFS.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/15] mm: Remove hpage_nr_pages
       [not found] ` <20191003050859.18140-1-hdanton@sina.com>
@ 2019-10-04 19:36   ` Matthew Wilcox
  0 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2019-10-04 19:36 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Thu, Oct 03, 2019 at 01:08:59PM +0800, Hillf Danton wrote:
> 
> On Tue, 24 Sep 2019 17:52:02 -0700 From: Matthew Wilcox (Oracle)
> > 
> > @@ -354,7 +354,7 @@ vma_address(struct page *page, struct vm_area_struct *vma)
> >  	unsigned long start, end;
> > 
> >  	start = __vma_address(page, vma);
> > -	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
> > +	end = start + page_size(page) - 1;
> > 
> > @@ -57,7 +57,7 @@ static inline bool pfn_in_hpage(struct page *hpage, unsigned long pfn)
> >  	unsigned long hpage_pfn = page_to_pfn(hpage);
> > 
> >  	/* THP can be referenced by any subpage */
> > -	return pfn >= hpage_pfn && pfn - hpage_pfn < hpage_nr_pages(hpage);
> > +	return (pfn - hpage_pfn) < compound_nr(hpage);
> >  }
> > 
> > @@ -264,7 +264,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
> >  	unsigned long start, end;
> > 
> >  	start = __vma_address(page, vma);
> > -	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
> > +	end = start + page_size(page) - 1;
> > 
> >  	if (unlikely(end < vma->vm_start || start >= vma->vm_end))
> 
> Be certain that nothing is added other than mechanical replacings in
> the above hunks.

Are you saying I've made a mistake?  If so, please be clear.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 03/15] mm: Add file_offset_of_ helpers
  2019-09-26 14:02   ` Kirill A. Shutemov
@ 2019-10-04 19:39     ` Matthew Wilcox
  0 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2019-10-04 19:39 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Thu, Sep 26, 2019 at 05:02:11PM +0300, Kirill A. Shutemov wrote:
> On Tue, Sep 24, 2019 at 05:52:02PM -0700, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > The page_offset function is badly named for people reading the functions
> > which call it.  The natural meaning of a function with this name would
> > be 'offset within a page', not 'page offset in bytes within a file'.
> > Dave Chinner suggests file_offset_of_page() as a replacement function
> > name and I'm also adding file_offset_of_next_page() as a helper for the
> > large page work.  Also add kernel-doc for these functions so they show
> > up in the kernel API book.
> > 
> > page_offset() is retained as a compatibility define for now.
> 
> This should be trivial for coccinelle, right?

Yes, should be.  I'd prefer not to do conversions for now to minimise
conflicts when rebasing.

> > +static inline loff_t file_offset_of_next_page(struct page *page)
> > +{
> > +	return ((loff_t)page->index + compound_nr(page)) << PAGE_SHIFT;
> 
> Wouldn't it be more readable as
> 
> 	return file_offset_of_page(page) + page_size(page);
> 
> ?

Good idea.  I'll fix that up.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/15] fs: Introduce i_blocks_per_page
  2019-10-04 19:28     ` Matthew Wilcox
@ 2019-10-08  3:53       ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2019-10-08  3:53 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Fri, Oct 04, 2019 at 12:28:12PM -0700, Matthew Wilcox wrote:
> On Wed, Sep 25, 2019 at 06:36:50PM +1000, Dave Chinner wrote:
> > I'm actually working on abstrcting this code from both block size
> > and page size via the helpers below. We ahve need to support block
> > size > page size, and so that requires touching a bunch of all the
> > same code as this patchset. I'm currently trying to combine your
> > last patch set with my patchset so I can easily test allocating 64k
> > page cache pages on a 64k block size filesystem on a 4k page size
> > machine with XFS....
> 
> This all makes sense ...
> 
> > > -	if (iop || i_blocksize(inode) == PAGE_SIZE)
> > > +	if (iop || i_blocks_per_page(inode, page) <= 1)
> > >  		return iop;
> > 
> > That also means checks like these become:
> > 
> > 	if (iop || iomap_chunks_per_page(inode, page) <= 1)
> > 
> > as a single file can now have multiple pages per block, a page per
> > block and multiple blocks per page as the page size changes...
> > 
> > I'd like to only have to make one pass over this code to abstract
> > out page and block sizes, so I'm guessing we'll need to do some
> > co-ordination here....
> 
> Yup.  I'm happy if you want to send your patches out; I'll keep going
> with the patches I have for the moment, and we'll figure out how to
> merge the two series in a way that makes sense.

I'm waiting for the xfs -> iomap writeback changes to land in a
stable branch so I don't have to do things twice in slightly
different ways in the patchset. Once we get that in an iomap-next
branch I'll rebase my patches on top of it and go from there...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2019-10-08  3:54 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-25  0:51 [RFC 00/15] Large pages in the page-cache Matthew Wilcox
2019-09-25  0:52 ` [PATCH 01/15] mm: Use vm_fault error code directly Matthew Wilcox
2019-09-26 13:55   ` Kirill A. Shutemov
2019-09-25  0:52 ` [PATCH 02/15] fs: Introduce i_blocks_per_page Matthew Wilcox
2019-09-25  8:36   ` Dave Chinner
2019-10-04 19:28     ` Matthew Wilcox
2019-10-08  3:53       ` Dave Chinner
2019-09-25  0:52 ` [PATCH 03/15] mm: Add file_offset_of_ helpers Matthew Wilcox
2019-09-26 14:02   ` Kirill A. Shutemov
2019-10-04 19:39     ` Matthew Wilcox
2019-09-25  0:52 ` [PATCH 04/15] iomap: Support large pages Matthew Wilcox
2019-09-25  0:52 ` [PATCH 05/15] xfs: " Matthew Wilcox
2019-09-25  0:52 ` [PATCH 06/15] xfs: Pass a page to xfs_finish_page_writeback Matthew Wilcox
2019-09-25  0:52 ` [PATCH 07/15] mm: Make prep_transhuge_page tail-callable Matthew Wilcox
2019-09-26 14:13   ` Kirill A. Shutemov
2019-09-25  0:52 ` [PATCH 08/15] mm: Add __page_cache_alloc_order Matthew Wilcox
2019-09-26 14:15   ` Kirill A. Shutemov
2019-09-25  0:52 ` [PATCH 09/15] mm: Allow large pages to be added to the page cache Matthew Wilcox
2019-09-26 14:22   ` Kirill A. Shutemov
2019-09-25  0:52 ` [PATCH 10/15] mm: Allow find_get_page to be used for large pages Matthew Wilcox
2019-10-01 10:32   ` Kirill A. Shutemov
2019-09-25  0:52 ` [PATCH 11/15] mm: Remove hpage_nr_pages Matthew Wilcox
2019-10-01 10:35   ` Kirill A. Shutemov
2019-09-25  0:52 ` [PATCH 12/15] mm: Support removing arbitrary sized pages from mapping Matthew Wilcox
2019-10-01 10:39   ` Kirill A. Shutemov
2019-09-25  0:52 ` [PATCH 13/15] mm: Add a huge page fault handler for files Matthew Wilcox
2019-10-01 10:42   ` Kirill A. Shutemov
2019-09-25  0:52 ` [PATCH 14/15] mm: Align THP mappings for non-DAX Matthew Wilcox
2019-10-01 10:45   ` Kirill A. Shutemov
2019-10-01 11:21     ` William Kucharski
2019-10-01 11:32       ` Kirill A. Shutemov
2019-10-01 12:18         ` William Kucharski
2019-10-01 14:20           ` Kirill A. Shutemov
2019-10-01 16:08             ` William Kucharski
2019-10-02  0:15               ` Kirill A. Shutemov
2019-09-25  0:52 ` [PATCH 15/15] xfs: Use filemap_huge_fault Matthew Wilcox
     [not found] ` <20191002130753.7680-1-hdanton@sina.com>
2019-10-04 19:33   ` [PATCH 03/15] mm: Add file_offset_of_ helpers Matthew Wilcox
     [not found] ` <20191002133211.15696-1-hdanton@sina.com>
2019-10-04 19:34   ` [PATCH 04/15] iomap: Support large pages Matthew Wilcox
     [not found] ` <20191003040846.17604-1-hdanton@sina.com>
2019-10-04 19:35   ` [PATCH 06/15] xfs: Pass a page to xfs_finish_page_writeback Matthew Wilcox
     [not found] ` <20191003050859.18140-1-hdanton@sina.com>
2019-10-04 19:36   ` [PATCH 11/15] mm: Remove hpage_nr_pages Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).