linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/2] iomap & xfs support for large pages
@ 2019-07-31 17:17 Matthew Wilcox
  2019-07-31 17:17 ` [PATCH 1/2] iomap: Support " Matthew Wilcox
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Matthew Wilcox @ 2019-07-31 17:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), hch, linux-xfs, linux-mm

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Christoph sent me a patch a few months ago called "XFS THP wip".
I've redone it based on current linus tree, plus the page_size() /
compound_nr() / page_shift() patches currently found in -mm.  I fixed
the logic bugs that I noticed in his patch and may have introduced some
of my own.  I have only compile tested this code.

Matthew Wilcox (Oracle) (2):
  iomap: Support large pages
  xfs: Support large pages

 fs/iomap/buffered-io.c | 82 ++++++++++++++++++++++++++----------------
 fs/xfs/xfs_aops.c      | 37 +++++++++----------
 include/linux/iomap.h  |  2 +-
 3 files changed, 72 insertions(+), 49 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] iomap: Support large pages
  2019-07-31 17:17 [RFC 0/2] iomap & xfs support for large pages Matthew Wilcox
@ 2019-07-31 17:17 ` Matthew Wilcox
  2019-07-31 23:03   ` Dave Chinner
  2019-07-31 17:17 ` [PATCH 2/2] xfs: " Matthew Wilcox
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2019-07-31 17:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), hch, linux-xfs, linux-mm

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Change iomap_page from a statically sized uptodate bitmap to a
dynamically allocated uptodate bitmap, allowing an arbitrarily large page.

Calculate the size of the page everywhere instead of using a base
PAGE_SIZE.

Based on a patch from Christoph Hellwig.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 82 ++++++++++++++++++++++++++----------------
 include/linux/iomap.h  |  2 +-
 2 files changed, 53 insertions(+), 31 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e25901ae3ff4..4d56b8060b6c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -23,14 +23,17 @@ static struct iomap_page *
 iomap_page_create(struct inode *inode, struct page *page)
 {
 	struct iomap_page *iop = to_iomap_page(page);
+	unsigned int nbits;
 
-	if (iop || i_blocksize(inode) == PAGE_SIZE)
+	if (iop || i_blocksize(inode) == page_size(page))
 		return iop;
 
-	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
+	nbits = BITS_TO_LONGS(page_size(page) / SECTOR_SIZE);
+	iop = kmalloc(struct_size(iop, uptodate, nbits),
+			GFP_NOFS | __GFP_NOFAIL);
 	atomic_set(&iop->read_count, 0);
 	atomic_set(&iop->write_count, 0);
-	bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
+	bitmap_zero(iop->uptodate, nbits);
 
 	/*
 	 * migrate_page_move_mapping() assumes that pages with private data have
@@ -61,15 +64,16 @@ iomap_page_release(struct page *page)
  * Calculate the range inside the page that we actually need to read.
  */
 static void
-iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
+iomap_adjust_read_range(struct inode *inode, struct page *page,
 		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
 {
+	struct iomap_page *iop = to_iomap_page(page);
 	loff_t orig_pos = *pos;
 	loff_t isize = i_size_read(inode);
 	unsigned block_bits = inode->i_blkbits;
 	unsigned block_size = (1 << block_bits);
-	unsigned poff = offset_in_page(*pos);
-	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+	unsigned poff = *pos & (page_size(page) - 1);
+	unsigned plen = min_t(loff_t, page_size(page) - poff, length);
 	unsigned first = poff >> block_bits;
 	unsigned last = (poff + plen - 1) >> block_bits;
 
@@ -107,7 +111,8 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 	 * page cache for blocks that are entirely outside of i_size.
 	 */
 	if (orig_pos <= isize && orig_pos + length > isize) {
-		unsigned end = offset_in_page(isize - 1) >> block_bits;
+		unsigned end = (isize - 1) & (page_size(page) - 1) >>
+				block_bits;
 
 		if (first <= end && last > end)
 			plen -= (last - end) * block_size;
@@ -128,7 +133,7 @@ iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len)
 	bool uptodate = true;
 
 	if (iop) {
-		for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
+		for (i = 0; i < page_size(page) / i_blocksize(inode); i++) {
 			if (i >= first && i <= last)
 				set_bit(i, iop->uptodate);
 			else if (!test_bit(i, iop->uptodate))
@@ -194,11 +199,12 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
 		return;
 
 	BUG_ON(page->index);
-	BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
+	BUG_ON(size > page_size(page) - ((unsigned long)iomap->inline_data &
+						(page_size(page) - 1)));
 
 	addr = kmap_atomic(page);
 	memcpy(addr, iomap->inline_data, size);
-	memset(addr + size, 0, PAGE_SIZE - size);
+	memset(addr + size, 0, page_size(page) - size);
 	kunmap_atomic(addr);
 	SetPageUptodate(page);
 }
@@ -218,11 +224,11 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	if (iomap->type == IOMAP_INLINE) {
 		WARN_ON_ONCE(pos);
 		iomap_read_inline_data(inode, page, iomap);
-		return PAGE_SIZE;
+		return page_size(page);
 	}
 
 	/* zero post-eof blocks as the page may be mapped */
-	iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
+	iomap_adjust_read_range(inode, page, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
 
@@ -258,7 +264,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 
 	if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
 		gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
-		int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		int nr_vecs = (length + page_size(page) - 1) >> page_shift(page);
 
 		if (ctx->bio)
 			submit_bio(ctx->bio);
@@ -293,9 +299,9 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 	unsigned poff;
 	loff_t ret;
 
-	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
+	for (poff = 0; poff < page_size(page); poff += ret) {
 		ret = iomap_apply(inode, page_offset(page) + poff,
-				PAGE_SIZE - poff, 0, ops, &ctx,
+				page_size(page) - poff, 0, ops, &ctx,
 				iomap_readpage_actor);
 		if (ret <= 0) {
 			WARN_ON_ONCE(ret == 0);
@@ -342,7 +348,7 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
 		 * readpages call itself as every page gets checked again once
 		 * actually needed.
 		 */
-		*done += PAGE_SIZE;
+		*done += page_size(page);
 		put_page(page);
 	}
 
@@ -355,9 +361,14 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 {
 	struct iomap_readpage_ctx *ctx = data;
 	loff_t done, ret;
+	size_t pg_left = 0;
+
+	if (ctx->cur_page)
+		pg_left = page_size(ctx->cur_page) -
+					(pos & (page_size(ctx->cur_page) - 1));
 
 	for (done = 0; done < length; done += ret) {
-		if (ctx->cur_page && offset_in_page(pos + done) == 0) {
+		if (ctx->cur_page && pg_left == 0) {
 			if (!ctx->cur_page_in_bio)
 				unlock_page(ctx->cur_page);
 			put_page(ctx->cur_page);
@@ -369,9 +380,11 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 			if (!ctx->cur_page)
 				break;
 			ctx->cur_page_in_bio = false;
+			pg_left = page_size(ctx->cur_page);
 		}
 		ret = iomap_readpage_actor(inode, pos + done, length - done,
 				ctx, iomap);
+		pg_left -= ret;
 	}
 
 	return done;
@@ -386,8 +399,9 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
 		.is_readahead	= true,
 	};
 	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
-	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
-	loff_t length = last - pos + PAGE_SIZE, ret = 0;
+	struct page *last_page = list_entry(pages->next, struct page, lru);
+	loff_t length = page_offset(last_page) - pos + page_size(last_page);
+	loff_t ret = 0;
 
 	while (length > 0) {
 		ret = iomap_apply(mapping->host, pos, length, 0, ops,
@@ -435,7 +449,7 @@ iomap_is_partially_uptodate(struct page *page, unsigned long from,
 	unsigned i;
 
 	/* Limit range to one page */
-	len = min_t(unsigned, PAGE_SIZE - from, count);
+	len = min_t(unsigned, page_size(page) - from, count);
 
 	/* First and last blocks in range within page */
 	first = from >> inode->i_blkbits;
@@ -474,7 +488,7 @@ iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
 	 * If we are invalidating the entire page, clear the dirty state from it
 	 * and release it to avoid unnecessary buildup of the LRU.
 	 */
-	if (offset == 0 && len == PAGE_SIZE) {
+	if (offset == 0 && len == page_size(page)) {
 		WARN_ON_ONCE(PageWriteback(page));
 		cancel_dirty_page(page);
 		iomap_page_release(page);
@@ -550,18 +564,20 @@ static int
 __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
 		struct page *page, struct iomap *iomap)
 {
-	struct iomap_page *iop = iomap_page_create(inode, page);
 	loff_t block_size = i_blocksize(inode);
 	loff_t block_start = pos & ~(block_size - 1);
 	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
-	unsigned from = offset_in_page(pos), to = from + len, poff, plen;
+	unsigned from = pos & (page_size(page) - 1);
+	unsigned to = from + len;
+	unsigned poff, plen;
 	int status = 0;
 
 	if (PageUptodate(page))
 		return 0;
+	iomap_page_create(inode, page);
 
 	do {
-		iomap_adjust_read_range(inode, iop, &block_start,
+		iomap_adjust_read_range(inode, page, &block_start,
 				block_end - block_start, &poff, &plen);
 		if (plen == 0)
 			break;
@@ -673,7 +689,7 @@ __iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 	 */
 	if (unlikely(copied < len && !PageUptodate(page)))
 		return 0;
-	iomap_set_range_uptodate(page, offset_in_page(pos), len);
+	iomap_set_range_uptodate(page, pos & (page_size(page) - 1), len);
 	iomap_set_page_dirty(page);
 	return copied;
 }
@@ -685,7 +701,9 @@ iomap_write_end_inline(struct inode *inode, struct page *page,
 	void *addr;
 
 	WARN_ON_ONCE(!PageUptodate(page));
-	BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
+	BUG_ON(pos + copied > page_size(page) -
+			((unsigned long)iomap->inline_data &
+			 (page_size(page) - 1)));
 
 	addr = kmap_atomic(page);
 	memcpy(iomap->inline_data + pos, addr + pos, copied);
@@ -749,6 +767,10 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
+		/*
+		 * XXX: We don't know what size page we'll find in the
+		 * page cache, so only copy up to a regular page boundary.
+		 */
 		offset = offset_in_page(pos);
 		bytes = min_t(unsigned long, PAGE_SIZE - offset,
 						iov_iter_count(i));
@@ -1047,11 +1069,11 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
 		goto out_unlock;
 	}
 
-	/* page is wholly or partially inside EOF */
-	if (((page->index + 1) << PAGE_SHIFT) > size)
-		length = offset_in_page(size);
+	/* page is wholly or partially beyond EOF */
+	if (((page->index + compound_nr(page)) << PAGE_SHIFT) > size)
+		length = size & (page_size(page) - 1);
 	else
-		length = PAGE_SIZE;
+		length = page_size(page);
 
 	offset = page_offset(page);
 	while (length > 0) {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index bc499ceae392..86be24a8259b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -139,7 +139,7 @@ loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 struct iomap_page {
 	atomic_t		read_count;
 	atomic_t		write_count;
-	DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
+	unsigned long		uptodate[];
 };
 
 static inline struct iomap_page *to_iomap_page(struct page *page)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] xfs: Support large pages
  2019-07-31 17:17 [RFC 0/2] iomap & xfs support for large pages Matthew Wilcox
  2019-07-31 17:17 ` [PATCH 1/2] iomap: Support " Matthew Wilcox
@ 2019-07-31 17:17 ` Matthew Wilcox
  2019-08-01 16:13   ` Christoph Hellwig
  2019-07-31 17:50 ` [RFC 0/2] iomap & xfs support for " Song Liu
  2019-08-02 14:54 ` Christopher Lameter
  3 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2019-07-31 17:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), hch, linux-xfs, linux-mm

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Mostly this is just checking the page size of each page instead of
assuming PAGE_SIZE.  Clean up the logic in writepage a little.

Based on a patch from Christoph Hellwig.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/xfs/xfs_aops.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f16d5f196c6b..4952cd7d8c6c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -58,21 +58,22 @@ xfs_find_daxdev_for_inode(
 static void
 xfs_finish_page_writeback(
 	struct inode		*inode,
-	struct bio_vec	*bvec,
+	struct bio_vec		*bvec,
 	int			error)
 {
-	struct iomap_page	*iop = to_iomap_page(bvec->bv_page);
+	struct page		*page = bvec->bv_page;
+	struct iomap_page	*iop = to_iomap_page(page);
 
 	if (error) {
-		SetPageError(bvec->bv_page);
+		SetPageError(page);
 		mapping_set_error(inode->i_mapping, -EIO);
 	}
 
-	ASSERT(iop || i_blocksize(inode) == PAGE_SIZE);
+	ASSERT(iop || i_blocksize(inode) == page_size(page));
 	ASSERT(!iop || atomic_read(&iop->write_count) > 0);
 
 	if (!iop || atomic_dec_and_test(&iop->write_count))
-		end_page_writeback(bvec->bv_page);
+		end_page_writeback(page);
 }
 
 /*
@@ -765,7 +766,7 @@ xfs_add_to_ioend(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct block_device	*bdev = xfs_find_bdev_for_inode(inode);
 	unsigned		len = i_blocksize(inode);
-	unsigned		poff = offset & (PAGE_SIZE - 1);
+	unsigned		poff = offset & (page_size(page) - 1);
 	bool			merged, same_page = false;
 	sector_t		sector;
 
@@ -839,11 +840,11 @@ xfs_aops_discard_page(
 			page, ip->i_ino, offset);
 
 	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-			PAGE_SIZE / i_blocksize(inode));
+			page_size(page) / i_blocksize(inode));
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
-	xfs_vm_invalidatepage(page, 0, PAGE_SIZE);
+	xfs_vm_invalidatepage(page, 0, page_size(page));
 }
 
 /*
@@ -877,7 +878,7 @@ xfs_writepage_map(
 	uint64_t		file_offset;	/* file offset of page */
 	int			error = 0, count = 0, i;
 
-	ASSERT(iop || i_blocksize(inode) == PAGE_SIZE);
+	ASSERT(iop || i_blocksize(inode) == page_size(page));
 	ASSERT(!iop || atomic_read(&iop->write_count) == 0);
 
 	/*
@@ -886,7 +887,8 @@ xfs_writepage_map(
 	 * one.
 	 */
 	for (i = 0, file_offset = page_offset(page);
-	     i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
+	     i < (page_size(page) >> inode->i_blkbits) &&
+						file_offset < end_offset;
 	     i++, file_offset += len) {
 		if (iop && !test_bit(i, iop->uptodate))
 			continue;
@@ -984,8 +986,7 @@ xfs_do_writepage(
 	struct xfs_writepage_ctx *wpc = data;
 	struct inode		*inode = page->mapping->host;
 	loff_t			offset;
-	uint64_t              end_offset;
-	pgoff_t                 end_index;
+	uint64_t		end_offset;
 
 	trace_xfs_writepage(inode, page, 0, 0);
 
@@ -1024,10 +1025,9 @@ xfs_do_writepage(
 	 * ---------------------------------^------------------|
 	 */
 	offset = i_size_read(inode);
-	end_index = offset >> PAGE_SHIFT;
-	if (page->index < end_index)
-		end_offset = (xfs_off_t)(page->index + 1) << PAGE_SHIFT;
-	else {
+	end_offset = (xfs_off_t)(page->index + compound_nr(page)) << PAGE_SHIFT;
+
+	if (end_offset > offset) {
 		/*
 		 * Check whether the page to write out is beyond or straddles
 		 * i_size or not.
@@ -1039,7 +1039,8 @@ xfs_do_writepage(
 		 * |				    |      Straddles     |
 		 * ---------------------------------^-----------|--------|
 		 */
-		unsigned offset_into_page = offset & (PAGE_SIZE - 1);
+		unsigned offset_into_page = offset & (page_size(page) - 1);
+		pgoff_t end_index = offset >> PAGE_SHIFT;
 
 		/*
 		 * Skip the page if it is fully outside i_size, e.g. due to a
@@ -1070,7 +1071,7 @@ xfs_do_writepage(
 		 * memory is zeroed when mapped, and writes to that region are
 		 * not written out to the file."
 		 */
-		zero_user_segment(page, offset_into_page, PAGE_SIZE);
+		zero_user_segment(page, offset_into_page, page_size(page));
 
 		/* Adjust the end_offset to the end of file */
 		end_offset = offset;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] iomap & xfs support for large pages
  2019-07-31 17:17 [RFC 0/2] iomap & xfs support for large pages Matthew Wilcox
  2019-07-31 17:17 ` [PATCH 1/2] iomap: Support " Matthew Wilcox
  2019-07-31 17:17 ` [PATCH 2/2] xfs: " Matthew Wilcox
@ 2019-07-31 17:50 ` Song Liu
  2019-07-31 17:59   ` Matthew Wilcox
  2019-08-02 14:54 ` Christopher Lameter
  3 siblings, 1 reply; 12+ messages in thread
From: Song Liu @ 2019-07-31 17:50 UTC (permalink / raw)
  To: Matthew Wilcox, William Kucharski
  Cc: Linux-Fsdevel, Christoph Hellwig, linux-xfs, Linux-MM

On Wed, Jul 31, 2019 at 10:17 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>
> Christoph sent me a patch a few months ago called "XFS THP wip".
> I've redone it based on current linus tree, plus the page_size() /
> compound_nr() / page_shift() patches currently found in -mm.  I fixed
> the logic bugs that I noticed in his patch and may have introduced some
> of my own.  I have only compile tested this code.

Would Bill's set work on XFS with this set?

Thanks,
Song

>
> Matthew Wilcox (Oracle) (2):
>   iomap: Support large pages
>   xfs: Support large pages
>
>  fs/iomap/buffered-io.c | 82 ++++++++++++++++++++++++++----------------
>  fs/xfs/xfs_aops.c      | 37 +++++++++----------
>  include/linux/iomap.h  |  2 +-
>  3 files changed, 72 insertions(+), 49 deletions(-)
>
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] iomap & xfs support for large pages
  2019-07-31 17:50 ` [RFC 0/2] iomap & xfs support for " Song Liu
@ 2019-07-31 17:59   ` Matthew Wilcox
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2019-07-31 17:59 UTC (permalink / raw)
  To: Song Liu
  Cc: William Kucharski, Linux-Fsdevel, Christoph Hellwig, linux-xfs, Linux-MM

On Wed, Jul 31, 2019 at 10:50:40AM -0700, Song Liu wrote:
> On Wed, Jul 31, 2019 at 10:17 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> >
> > Christoph sent me a patch a few months ago called "XFS THP wip".
> > I've redone it based on current linus tree, plus the page_size() /
> > compound_nr() / page_shift() patches currently found in -mm.  I fixed
> > the logic bugs that I noticed in his patch and may have introduced some
> > of my own.  I have only compile tested this code.
> 
> Would Bill's set work on XFS with this set?

If there are no bugs in his code or mine ;-)

It'd also need to be wired up; something like this:

+++ b/fs/xfs/xfs_file.c
@@ -1131,6 +1131,8 @@ __xfs_filemap_fault(
        } else {
                if (write_fault)
                        ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops);
+               else if (pe_size)
+                       ret = filemap_huge_fault(vmf, pe_size);
                else
                        ret = filemap_fault(vmf);
        }
@@ -1156,9 +1158,6 @@ xfs_filemap_huge_fault(
        struct vm_fault         *vmf,
        enum page_entry_size    pe_size)
 {
-       if (!IS_DAX(file_inode(vmf->vma->vm_file)))
-               return VM_FAULT_FALLBACK;
-
        /* DAX can shortcut the normal fault path on write faults! */
        return __xfs_filemap_fault(vmf, pe_size,
                        (vmf->flags & FAULT_FLAG_WRITE));

(untested)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] iomap: Support large pages
  2019-07-31 17:17 ` [PATCH 1/2] iomap: Support " Matthew Wilcox
@ 2019-07-31 23:03   ` Dave Chinner
  2019-08-01  3:59     ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2019-07-31 23:03 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, hch, linux-xfs, linux-mm

On Wed, Jul 31, 2019 at 10:17:33AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Change iomap_page from a statically sized uptodate bitmap to a
> dynamically allocated uptodate bitmap, allowing an arbitrarily large page.
> 
> Calculate the size of the page everywhere instead of using a base
> PAGE_SIZE.
> 
> Based on a patch from Christoph Hellwig.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/iomap/buffered-io.c | 82 ++++++++++++++++++++++++++----------------
>  include/linux/iomap.h  |  2 +-
>  2 files changed, 53 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e25901ae3ff4..4d56b8060b6c 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -23,14 +23,17 @@ static struct iomap_page *
>  iomap_page_create(struct inode *inode, struct page *page)
>  {
>  	struct iomap_page *iop = to_iomap_page(page);
> +	unsigned int nbits;
>  
> -	if (iop || i_blocksize(inode) == PAGE_SIZE)
> +	if (iop || i_blocksize(inode) == page_size(page))
>  		return iop;
>  
> -	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
> +	nbits = BITS_TO_LONGS(page_size(page) / SECTOR_SIZE);

nbits = BITS_TO_LONGS(page_size(page) / i_blocksize(inode));

> +	iop = kmalloc(struct_size(iop, uptodate, nbits),
> +			GFP_NOFS | __GFP_NOFAIL);
>  	atomic_set(&iop->read_count, 0);
>  	atomic_set(&iop->write_count, 0);
> -	bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
> +	bitmap_zero(iop->uptodate, nbits);
>  
>  	/*
>  	 * migrate_page_move_mapping() assumes that pages with private data have
> @@ -61,15 +64,16 @@ iomap_page_release(struct page *page)
>   * Calculate the range inside the page that we actually need to read.
>   */
>  static void
> -iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
> +iomap_adjust_read_range(struct inode *inode, struct page *page,
>  		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
>  {
> +	struct iomap_page *iop = to_iomap_page(page);
>  	loff_t orig_pos = *pos;
>  	loff_t isize = i_size_read(inode);
>  	unsigned block_bits = inode->i_blkbits;
>  	unsigned block_size = (1 << block_bits);
> -	unsigned poff = offset_in_page(*pos);
> -	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
> +	unsigned poff = *pos & (page_size(page) - 1);
> +	unsigned plen = min_t(loff_t, page_size(page) - poff, length);
>  	unsigned first = poff >> block_bits;
>  	unsigned last = (poff + plen - 1) >> block_bits;

This all kinda looks familar. In my block size > page size patch
set, I was trying to wrap these sorts of things in helpers as they
ge repeated over and over again. e.g:

/*
 * Return the block size we should use for page cache based operations.
 * This will return the inode block size for block size < PAGE_SIZE,
 * otherwise it will return PAGE_SIZE.
 */
static inline unsigned
iomap_chunk_size(struct inode *inode)
{
	return min_t(unsigned, PAGE_SIZE, i_blocksize(inode));
}

"chunk" being the name that Christoph suggested as the size of the
region we need to operate over in this function.

IOws, if we have a normal page, it's as per the above, but if
we have block size > PAGE_SIZE, it's the block size we need to work
from, and if it's a huge page, is the huge page size we need to
use....

So starting by wrapping a bunch of these common length/size/offset
calculations will make this code much easier to understand, follow,
maintain as we explode the number of combinations of page and block
size it supports in the near future...

FYI, the blocksize > pagesize patchset was first posted here:

https://lore.kernel.org/linux-fsdevel/20181107063127.3902-1-david@fromorbit.com/

[ Bad memories, this patchset is what lead us to discover how 
horribly broken copy_file_range and friends were. ]

> @@ -107,7 +111,8 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
>  	 * page cache for blocks that are entirely outside of i_size.
>  	 */
>  	if (orig_pos <= isize && orig_pos + length > isize) {
> -		unsigned end = offset_in_page(isize - 1) >> block_bits;
> +		unsigned end = (isize - 1) & (page_size(page) - 1) >>
> +				block_bits;

iomap_offset_in_page()....

>  
>  		if (first <= end && last > end)
>  			plen -= (last - end) * block_size;
> @@ -128,7 +133,7 @@ iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len)
>  	bool uptodate = true;
>  
>  	if (iop) {
> -		for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
> +		for (i = 0; i < page_size(page) / i_blocksize(inode); i++) {

                                iomap_blocks_per_page()

>  			if (i >= first && i <= last)
>  				set_bit(i, iop->uptodate);
>  			else if (!test_bit(i, iop->uptodate))
> @@ -194,11 +199,12 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
>  		return;
>  
>  	BUG_ON(page->index);
> -	BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
> +	BUG_ON(size > page_size(page) - ((unsigned long)iomap->inline_data &
> +						(page_size(page) - 1)));

Inline data should never use a huge page - it's a total waste of
2MB of memory because inline data is intended for very small data
files that fit inside an inode. If anyone ever needs inline data
larger than PAGE_SIZE then we can worry about how to support that
at that time. Right now it should just refuse to use a huge page...

>  
>  	addr = kmap_atomic(page);
>  	memcpy(addr, iomap->inline_data, size);
> -	memset(addr + size, 0, PAGE_SIZE - size);
> +	memset(addr + size, 0, page_size(page) - size);
>  	kunmap_atomic(addr);
>  	SetPageUptodate(page);
>  }
> @@ -218,11 +224,11 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  	if (iomap->type == IOMAP_INLINE) {
>  		WARN_ON_ONCE(pos);
>  		iomap_read_inline_data(inode, page, iomap);
> -		return PAGE_SIZE;
> +		return page_size(page);
>  	}
>  
>  	/* zero post-eof blocks as the page may be mapped */
> -	iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
> +	iomap_adjust_read_range(inode, page, &pos, length, &poff, &plen);
>  	if (plen == 0)
>  		goto done;
>  
> @@ -258,7 +264,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  
>  	if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
>  		gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
> -		int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +		int nr_vecs = (length + page_size(page) - 1) >> page_shift(page);

iomap_nr_pages(page)?

>  
>  		if (ctx->bio)
>  			submit_bio(ctx->bio);
> @@ -293,9 +299,9 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
>  	unsigned poff;
>  	loff_t ret;
>  
> -	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
> +	for (poff = 0; poff < page_size(page); poff += ret) {
>  		ret = iomap_apply(inode, page_offset(page) + poff,
> -				PAGE_SIZE - poff, 0, ops, &ctx,
> +				page_size(page) - poff, 0, ops, &ctx,
>  				iomap_readpage_actor);
>  		if (ret <= 0) {
>  			WARN_ON_ONCE(ret == 0);
> @@ -342,7 +348,7 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
>  		 * readpages call itself as every page gets checked again once
>  		 * actually needed.
>  		 */
> -		*done += PAGE_SIZE;
> +		*done += page_size(page);
>  		put_page(page);
>  	}
>  
> @@ -355,9 +361,14 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
>  {
>  	struct iomap_readpage_ctx *ctx = data;
>  	loff_t done, ret;
> +	size_t pg_left = 0;
> +
> +	if (ctx->cur_page)
> +		pg_left = page_size(ctx->cur_page) -
> +					(pos & (page_size(ctx->cur_page) - 1));

What's this unreadable magic do?

>  
>  	for (done = 0; done < length; done += ret) {
> -		if (ctx->cur_page && offset_in_page(pos + done) == 0) {
> +		if (ctx->cur_page && pg_left == 0) {
>  			if (!ctx->cur_page_in_bio)
>  				unlock_page(ctx->cur_page);
>  			put_page(ctx->cur_page);
> @@ -369,9 +380,11 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
>  			if (!ctx->cur_page)
>  				break;
>  			ctx->cur_page_in_bio = false;
> +			pg_left = page_size(ctx->cur_page);
>  		}
>  		ret = iomap_readpage_actor(inode, pos + done, length - done,
>  				ctx, iomap);
> +		pg_left -= ret;
>  	}
>  
>  	return done;
> @@ -386,8 +399,9 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
>  		.is_readahead	= true,
>  	};
>  	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
> -	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
> -	loff_t length = last - pos + PAGE_SIZE, ret = 0;
> +	struct page *last_page = list_entry(pages->next, struct page, lru);
> +	loff_t length = page_offset(last_page) - pos + page_size(last_page);

More magic that could do with a helper.

> +	loff_t ret = 0;
>  
>  	while (length > 0) {
>  		ret = iomap_apply(mapping->host, pos, length, 0, ops,
> @@ -435,7 +449,7 @@ iomap_is_partially_uptodate(struct page *page, unsigned long from,
>  	unsigned i;
>  
>  	/* Limit range to one page */
> -	len = min_t(unsigned, PAGE_SIZE - from, count);
> +	len = min_t(unsigned, page_size(page) - from, count);
>  
>  	/* First and last blocks in range within page */
>  	first = from >> inode->i_blkbits;
> @@ -474,7 +488,7 @@ iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
>  	 * If we are invalidating the entire page, clear the dirty state from it
>  	 * and release it to avoid unnecessary buildup of the LRU.
>  	 */
> -	if (offset == 0 && len == PAGE_SIZE) {
> +	if (offset == 0 && len == page_size(page)) {
>  		WARN_ON_ONCE(PageWriteback(page));
>  		cancel_dirty_page(page);
>  		iomap_page_release(page);
> @@ -550,18 +564,20 @@ static int
>  __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
>  		struct page *page, struct iomap *iomap)
>  {
> -	struct iomap_page *iop = iomap_page_create(inode, page);
>  	loff_t block_size = i_blocksize(inode);
>  	loff_t block_start = pos & ~(block_size - 1);
>  	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
> -	unsigned from = offset_in_page(pos), to = from + len, poff, plen;
> +	unsigned from = pos & (page_size(page) - 1);

iomap_offset_in_page() again :)

> +	unsigned to = from + len;
> +	unsigned poff, plen;
>  	int status = 0;
>  
>  	if (PageUptodate(page))
>  		return 0;
> +	iomap_page_create(inode, page);
>  
>  	do {
> -		iomap_adjust_read_range(inode, iop, &block_start,
> +		iomap_adjust_read_range(inode, page, &block_start,
>  				block_end - block_start, &poff, &plen);
>  		if (plen == 0)
>  			break;
> @@ -673,7 +689,7 @@ __iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
>  	 */
>  	if (unlikely(copied < len && !PageUptodate(page)))
>  		return 0;
> -	iomap_set_range_uptodate(page, offset_in_page(pos), len);
> +	iomap_set_range_uptodate(page, pos & (page_size(page) - 1), len);

ditto.

>  	iomap_set_page_dirty(page);
>  	return copied;
>  }
> @@ -685,7 +701,9 @@ iomap_write_end_inline(struct inode *inode, struct page *page,
>  	void *addr;
>  
>  	WARN_ON_ONCE(!PageUptodate(page));
> -	BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
> +	BUG_ON(pos + copied > page_size(page) -
> +			((unsigned long)iomap->inline_data &
> +			 (page_size(page) - 1)));

Again, I think we should just avoid huge pages for inline data.

>  
>  	addr = kmap_atomic(page);
>  	memcpy(iomap->inline_data + pos, addr + pos, copied);
> @@ -749,6 +767,10 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		unsigned long bytes;	/* Bytes to write to page */
>  		size_t copied;		/* Bytes copied from user */
>  
> +		/*
> +		 * XXX: We don't know what size page we'll find in the
> +		 * page cache, so only copy up to a regular page boundary.
> +		 */
>  		offset = offset_in_page(pos);
>  		bytes = min_t(unsigned long, PAGE_SIZE - offset,
>  						iov_iter_count(i));
> @@ -1047,11 +1069,11 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
>  		goto out_unlock;
>  	}
>  
> -	/* page is wholly or partially inside EOF */
> -	if (((page->index + 1) << PAGE_SHIFT) > size)
> -		length = offset_in_page(size);
> +	/* page is wholly or partially beyond EOF */
> +	if (((page->index + compound_nr(page)) << PAGE_SHIFT) > size)
> +		length = size & (page_size(page) - 1);
>  	else
> -		length = PAGE_SIZE;
> +		length = page_size(page);

Yeah, that needs some help :)

Basically, I'd love to have all the things that end up being
variable because of block size or page size or a combination of both
moved into helpers. That way we end up the the code that does the
work being clean and easy to maintain, and all the nastiness
inherent to variable size objects is isolated to the helper
functions...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] iomap: Support large pages
  2019-07-31 23:03   ` Dave Chinner
@ 2019-08-01  3:59     ` Matthew Wilcox
  2019-08-01 16:21       ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2019-08-01  3:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, hch, linux-xfs, linux-mm

On Thu, Aug 01, 2019 at 09:03:15AM +1000, Dave Chinner wrote:
> > -	if (iop || i_blocksize(inode) == PAGE_SIZE)
> > +	if (iop || i_blocksize(inode) == page_size(page))
> >  		return iop;
> >  
> > -	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
> > +	nbits = BITS_TO_LONGS(page_size(page) / SECTOR_SIZE);
> 
> nbits = BITS_TO_LONGS(page_size(page) / i_blocksize(inode));

Ah, yes, that's better.  When it's statically allocated, you have to assume
512-byte blocks, but when it's dynamically allocated, you can use the
actual inode blocksize.

> > +	iop = kmalloc(struct_size(iop, uptodate, nbits),
> > +			GFP_NOFS | __GFP_NOFAIL);
> >  	atomic_set(&iop->read_count, 0);
> >  	atomic_set(&iop->write_count, 0);
> > -	bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
> > +	bitmap_zero(iop->uptodate, nbits);

Also, I confused myself by using nbits.  And, really, why do all this
initialisation by hand?

@@ -23,17 +23,14 @@ static struct iomap_page *
 iomap_page_create(struct inode *inode, struct page *page)
 {
        struct iomap_page *iop = to_iomap_page(page);
-       unsigned int nbits;
+       unsigned int n;
 
        if (iop || i_blocksize(inode) == page_size(page))
                return iop;
 
-       nbits = BITS_TO_LONGS(page_size(page) / SECTOR_SIZE);
-       iop = kmalloc(struct_size(iop, uptodate, nbits),
-                       GFP_NOFS | __GFP_NOFAIL);
-       atomic_set(&iop->read_count, 0);
-       atomic_set(&iop->write_count, 0);
-       bitmap_zero(iop->uptodate, nbits);
+       n = BITS_TO_LONGS(page_size(page) >> inode->i_blkbits);
+       iop = kmalloc(struct_size(iop, uptodate, n),
+                       GFP_NOFS | __GFP_NOFAIL | __GFP_ZERO);
 
        /*
         * migrate_page_move_mapping() assumes that pages with private data have

> > -	unsigned poff = offset_in_page(*pos);
> > -	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
> > +	unsigned poff = *pos & (page_size(page) - 1);
> > +	unsigned plen = min_t(loff_t, page_size(page) - poff, length);
> >  	unsigned first = poff >> block_bits;
> >  	unsigned last = (poff + plen - 1) >> block_bits;
> 
> This all kinda looks familar. In my block size > page size patch
> set, I was trying to wrap these sorts of things in helpers as they
> ge repeated over and over again. e.g:
> 
> /*
>  * Return the block size we should use for page cache based operations.
>  * This will return the inode block size for block size < PAGE_SIZE,
>  * otherwise it will return PAGE_SIZE.
>  */
> static inline unsigned
> iomap_chunk_size(struct inode *inode)
> {
> 	return min_t(unsigned, PAGE_SIZE, i_blocksize(inode));
> }
> 
> "chunk" being the name that Christoph suggested as the size of the
> region we need to operate over in this function.
> 
> IOws, if we have a normal page, it's as per the above, but if
> we have block size > PAGE_SIZE, it's the block size we need to work
> from, and if it's a huge page, is the huge page size we need to
> use....
> 
> So starting by wrapping a bunch of these common length/size/offset
> calculations will make this code much easier to understand, follow,
> maintain as we explode the number of combinations of page and block
> size it supports in the near future...
> 
> FYI, the blocksize > pagesize patchset was first posted here:
> 
> https://lore.kernel.org/linux-fsdevel/20181107063127.3902-1-david@fromorbit.com/
> 
> [ Bad memories, this patchset is what lead us to discover how 
> horribly broken copy_file_range and friends were. ]

Thanks.  I'll take a look at that and come back with a refreshed patch
tomorrow that wraps a bunch of these things.

> > -		unsigned end = offset_in_page(isize - 1) >> block_bits;
> > +		unsigned end = (isize - 1) & (page_size(page) - 1) >>
> > +				block_bits;
> 
> iomap_offset_in_page()....

It has applications outside iomap, so I've been thinking about
offset_in_this_page(page, thing).  I don't like it, but page_offset()
is taken and offset_in_page() doesn't take a page parameter.

> > @@ -194,11 +199,12 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
> >  		return;
> >  
> >  	BUG_ON(page->index);
> > -	BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
> > +	BUG_ON(size > page_size(page) - ((unsigned long)iomap->inline_data &
> > +						(page_size(page) - 1)));
> 
> Inline data should never use a huge page - it's a total waste of
> 2MB of memory because inline data is intended for very small data
> files that fit inside an inode. If anyone ever needs inline data
> larger than PAGE_SIZE then we can worry about how to support that
> at that time. Right now it should just refuse to use a huge page...

I kind of agree, but ...

This isn't just about supporting huge pages.  It's about supporting
large pages too (and by large pages, I mean arbitrary-order pages,
rather than ones which match a particular CPU's PMD/PGD hierarchy).
We might well decide that we want to switch to always at least trying
to allocate 16kB pages in the page cache, and so we might end up here
with a page larger than the base page size.

And yes, we could say it's the responsibility of that person to do this
work, but it's done now.

> > -		int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +		int nr_vecs = (length + page_size(page) - 1) >> page_shift(page);
> 
> iomap_nr_pages(page)?

Do you mean iomap_nr_pages(page, length)?

Actually, I'm not sure this is right.  It assumes the pages all have the same
length, so if we do a call to readpages() which has a 2MB page followed by
a raft of 4kB pages, it'll allocate a BIO with 2 vectors, when it should
really allocate many more.  I think I'll change this one back to operating
on PAGE_SIZE and if we fill in fewer vecs than we allocated, that's fine.

> > @@ -355,9 +361,14 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> >  {
> >  	struct iomap_readpage_ctx *ctx = data;
> >  	loff_t done, ret;
> > +	size_t pg_left = 0;
> > +
> > +	if (ctx->cur_page)
> > +		pg_left = page_size(ctx->cur_page) -
> > +					(pos & (page_size(ctx->cur_page) - 1));
> 
> What's this unreadable magic do?

Calculates the number of bytes left in this page.

> > @@ -1047,11 +1069,11 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
> >  		goto out_unlock;
> >  	}
> >  
> > -	/* page is wholly or partially inside EOF */
> > -	if (((page->index + 1) << PAGE_SHIFT) > size)
> > -		length = offset_in_page(size);
> > +	/* page is wholly or partially beyond EOF */
> > +	if (((page->index + compound_nr(page)) << PAGE_SHIFT) > size)
> > +		length = size & (page_size(page) - 1);
> >  	else
> > -		length = PAGE_SIZE;
> > +		length = page_size(page);
> 
> Yeah, that needs some help :)
> 
> Basically, I'd love to have all the things that end up being
> variable because of block size or page size or a combination of both
> moved into helpers. That way we end up the the code that does the
> work being clean and easy to maintain, and all the nastiness
> inherent to variable size objects is isolated to the helper
> functions...

I'm on board with the overall plan; just the details to quibble over.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] xfs: Support large pages
  2019-07-31 17:17 ` [PATCH 2/2] xfs: " Matthew Wilcox
@ 2019-08-01 16:13   ` Christoph Hellwig
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2019-08-01 16:13 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, hch, linux-xfs, linux-mm

On Wed, Jul 31, 2019 at 10:17:34AM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Mostly this is just checking the page size of each page instead of
> assuming PAGE_SIZE.  Clean up the logic in writepage a little.
> 
> Based on a patch from Christoph Hellwig.

FYI, all this is pending a move to iomap..

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] iomap: Support large pages
  2019-08-01  3:59     ` Matthew Wilcox
@ 2019-08-01 16:21       ` Christoph Hellwig
  2019-08-01 17:45         ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2019-08-01 16:21 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Dave Chinner, linux-fsdevel, hch, linux-xfs, linux-mm

On Wed, Jul 31, 2019 at 08:59:55PM -0700, Matthew Wilcox wrote:
> -       nbits = BITS_TO_LONGS(page_size(page) / SECTOR_SIZE);
> -       iop = kmalloc(struct_size(iop, uptodate, nbits),
> -                       GFP_NOFS | __GFP_NOFAIL);
> -       atomic_set(&iop->read_count, 0);
> -       atomic_set(&iop->write_count, 0);
> -       bitmap_zero(iop->uptodate, nbits);
> +       n = BITS_TO_LONGS(page_size(page) >> inode->i_blkbits);
> +       iop = kmalloc(struct_size(iop, uptodate, n),
> +                       GFP_NOFS | __GFP_NOFAIL | __GFP_ZERO);

I am really worried about potential very large GFP_NOFS | __GFP_NOFAIL
allocations here.  And thinking about this a bit more while walking
at the beach I wonder if a better option is to just allocate one
iomap per tail page if needed rather than blowing the head page one
up.  We'd still always use the read_count and write_count in the
head page, but the bitmaps in the tail pages, which should be pretty
easily doable.

Note that we'll also need to do another optimization first that I
skipped in the initial iomap writeback path work:  We only really need
an iomap if the blocksize is smaller than the page and there actually
is an extent boundary inside that page.  If a (small or huge) page is
backed by a single extent we can skip the whole iomap thing.  That is at
least for now, because I have a series adding optional t10 protection
information tuples (8 bytes per 512 bytes of data) to the end of
the iomap, which would grow it quite a bit for the PI case, and would
make also allocating the updatodate bit dynamically uglies (but not
impossible).

Note that we'll also need to remove the line that limits the iomap
allocation size in iomap_begin to 1024 times the page size to a better
chance at contiguous allocations for huge page faults and generally
avoid pointless roundtrips to the allocator.  It might or might be
time to revisit that limit in general, not just for huge pages.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] iomap: Support large pages
  2019-08-01 16:21       ` Christoph Hellwig
@ 2019-08-01 17:45         ` Matthew Wilcox
  2019-08-02  8:27           ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2019-08-01 17:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-fsdevel, linux-xfs, linux-mm

On Thu, Aug 01, 2019 at 06:21:47PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 31, 2019 at 08:59:55PM -0700, Matthew Wilcox wrote:
> > -       nbits = BITS_TO_LONGS(page_size(page) / SECTOR_SIZE);
> > -       iop = kmalloc(struct_size(iop, uptodate, nbits),
> > -                       GFP_NOFS | __GFP_NOFAIL);
> > -       atomic_set(&iop->read_count, 0);
> > -       atomic_set(&iop->write_count, 0);
> > -       bitmap_zero(iop->uptodate, nbits);
> > +       n = BITS_TO_LONGS(page_size(page) >> inode->i_blkbits);
> > +       iop = kmalloc(struct_size(iop, uptodate, n),
> > +                       GFP_NOFS | __GFP_NOFAIL | __GFP_ZERO);
> 
> I am really worried about potential very large GFP_NOFS | __GFP_NOFAIL
> allocations here.

I don't think it gets _very_ large here.  Assuming a 4kB block size
filesystem, that's 512 bits (64 bytes, plus 16 bytes for the two counters)
for a 2MB page.  For machines with an 8MB PMD page, it's 272 bytes.
Not a very nice fraction of a page size, so probably rounded up to a 512
byte allocation, but well under the one page that the MM is supposed to
guarantee being able to allocate.

> And thinking about this a bit more while walking
> at the beach I wonder if a better option is to just allocate one
> iomap per tail page if needed rather than blowing the head page one
> up.  We'd still always use the read_count and write_count in the
> head page, but the bitmaps in the tail pages, which should be pretty
> easily doable.

We wouldn't need to allocate an iomap per tail page, even.  We could
just use one bit of tail-page->private per block.  That'd work except
for 512-byte block size on machines with a 64kB page.  I doubt many
people expect that combination to work well.

One of my longer-term ambitions is to do away with tail pages under
certain situations; eg partition the memory between allocatable-as-4kB
pages and allocatable-as-2MB pages.  We'd need a different solution for
that, but it's a bit of a pipe dream right now anyway.

> Note that we'll also need to do another optimization first that I
> skipped in the initial iomap writeback path work:  We only really need
> an iomap if the blocksize is smaller than the page and there actually
> is an extent boundary inside that page.  If a (small or huge) page is
> backed by a single extent we can skip the whole iomap thing.  That is at
> least for now, because I have a series adding optional t10 protection
> information tuples (8 bytes per 512 bytes of data) to the end of
> the iomap, which would grow it quite a bit for the PI case, and would
> make also allocating the updatodate bit dynamically uglies (but not
> impossible).
> 
> Note that we'll also need to remove the line that limits the iomap
> allocation size in iomap_begin to 1024 times the page size to a better
> chance at contiguous allocations for huge page faults and generally
> avoid pointless roundtrips to the allocator.  It might or might be
> time to revisit that limit in general, not just for huge pages.

I think that's beyond my current understanding of the iomap code ;-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] iomap: Support large pages
  2019-08-01 17:45         ` Matthew Wilcox
@ 2019-08-02  8:27           ` Christoph Hellwig
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2019-08-02  8:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-xfs, linux-mm

On Thu, Aug 01, 2019 at 10:45:00AM -0700, Matthew Wilcox wrote:
> On Thu, Aug 01, 2019 at 06:21:47PM +0200, Christoph Hellwig wrote:
> > On Wed, Jul 31, 2019 at 08:59:55PM -0700, Matthew Wilcox wrote:
> > > -       nbits = BITS_TO_LONGS(page_size(page) / SECTOR_SIZE);
> > > -       iop = kmalloc(struct_size(iop, uptodate, nbits),
> > > -                       GFP_NOFS | __GFP_NOFAIL);
> > > -       atomic_set(&iop->read_count, 0);
> > > -       atomic_set(&iop->write_count, 0);
> > > -       bitmap_zero(iop->uptodate, nbits);
> > > +       n = BITS_TO_LONGS(page_size(page) >> inode->i_blkbits);
> > > +       iop = kmalloc(struct_size(iop, uptodate, n),
> > > +                       GFP_NOFS | __GFP_NOFAIL | __GFP_ZERO);
> > 
> > I am really worried about potential very large GFP_NOFS | __GFP_NOFAIL
> > allocations here.
> 
> I don't think it gets _very_ large here.  Assuming a 4kB block size
> filesystem, that's 512 bits (64 bytes, plus 16 bytes for the two counters)
> for a 2MB page.  For machines with an 8MB PMD page, it's 272 bytes.
> Not a very nice fraction of a page size, so probably rounded up to a 512
> byte allocation, but well under the one page that the MM is supposed to
> guarantee being able to allocate.

And if we use GB pages?

Or 512-byte blocks or at least 1k blocks, which we need to handle even
if they are not preferred by any means.  The real issue here is not just
the VMs capability to allocate these by some means, but that we do
__GFP_NOFAIL allocations in nofs context.

> > And thinking about this a bit more while walking
> > at the beach I wonder if a better option is to just allocate one
> > iomap per tail page if needed rather than blowing the head page one
> > up.  We'd still always use the read_count and write_count in the
> > head page, but the bitmaps in the tail pages, which should be pretty
> > easily doable.
> 
> We wouldn't need to allocate an iomap per tail page, even.  We could
> just use one bit of tail-page->private per block.  That'd work except
> for 512-byte block size on machines with a 64kB page.  I doubt many
> people expect that combination to work well.

We'd still need to deal with the T10 PI tuples for a case like that,
though.

> 
> One of my longer-term ambitions is to do away with tail pages under
> certain situations; eg partition the memory between allocatable-as-4kB
> pages and allocatable-as-2MB pages.  We'd need a different solution for
> that, but it's a bit of a pipe dream right now anyway.

Yes, lets focus on that.  Maybe at some point we'll also get extent
based VM instead of pages ;-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/2] iomap & xfs support for large pages
  2019-07-31 17:17 [RFC 0/2] iomap & xfs support for large pages Matthew Wilcox
                   ` (2 preceding siblings ...)
  2019-07-31 17:50 ` [RFC 0/2] iomap & xfs support for " Song Liu
@ 2019-08-02 14:54 ` Christopher Lameter
  3 siblings, 0 replies; 12+ messages in thread
From: Christopher Lameter @ 2019-08-02 14:54 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, hch, linux-xfs, linux-mm

On Wed, 31 Jul 2019, Matthew Wilcox wrote:

> Christoph sent me a patch a few months ago called "XFS THP wip".
> I've redone it based on current linus tree, plus the page_size() /
> compound_nr() / page_shift() patches currently found in -mm.  I fixed
> the logic bugs that I noticed in his patch and may have introduced some
> of my own.  I have only compile tested this code.

Some references here to patches from a long time ago. Maybe there are
useful tidbits here ...

Variable page cache just for ramfs:
https://lkml.org/lkml/2007/4/19/261

Large blocksize support for XFS, ReiserFS and ext2
https://www.mail-archive.com/linux-fsdevel@vger.kernel.org/msg08730.html


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-08-02 14:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-31 17:17 [RFC 0/2] iomap & xfs support for large pages Matthew Wilcox
2019-07-31 17:17 ` [PATCH 1/2] iomap: Support " Matthew Wilcox
2019-07-31 23:03   ` Dave Chinner
2019-08-01  3:59     ` Matthew Wilcox
2019-08-01 16:21       ` Christoph Hellwig
2019-08-01 17:45         ` Matthew Wilcox
2019-08-02  8:27           ` Christoph Hellwig
2019-07-31 17:17 ` [PATCH 2/2] xfs: " Matthew Wilcox
2019-08-01 16:13   ` Christoph Hellwig
2019-07-31 17:50 ` [RFC 0/2] iomap & xfs support for " Song Liu
2019-07-31 17:59   ` Matthew Wilcox
2019-08-02 14:54 ` Christopher Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).