linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support
@ 2018-11-07  6:31 Dave Chinner
  2018-11-07  6:31 ` [PATCH 01/16] xfs: drop ->writepage completely Dave Chinner
                   ` (16 more replies)
  0 siblings, 17 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Hi folks,

We've had a fair number of problems reported on 64k block size
filesystems of late, but none of the XFS developers have Power or
ARM machines handy to reproduce them or even really test the fixes.

The iomap infrastructure we introduced a while back was designed
with the capabity of block size > page size support in mind, but we
hadn't tried to implement it.

So after another 64k block size bug report late last week I said to
Darrick "How hard could it be"?

About 6 billion (yes, B) fsx ops later, I have most of the XFS
functionality working on 64k block sizes on x86_64.  Buffered
read/write, mmap read/write and direct IO read/write all work. All
the fallocate() operations work correctly, as does truncate. xfsdump
and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
needed some help, but I've tested Darrick's fixes for that quite a
bit over the past few days.

It passes most of xfstests - there's some test failures that I have
to determine whether they are code bugs or test problems (i.e. some
tests don't deal with 64k block sizes correctly or assume block size
<= page size).

What I haven't tested yet is shared extents - the COW path,
clone_file_range and dedupe_file_range. I discovered earlier today
that fsx doesn't support copy/clone/dedupe_file_operations
operations, so before I go any further I need to enxpahnce fsx. Then
fix all the bugs it uncovers on block size <= page size filesystems.
And then I'll move onto adding the rest of the functionality this
patch set requires.

The main addition to the iomap code to support this functionality is
the "zero-around" capability. When the filesystem is mapping a new
block, a delalloc range or an unwritten extent, it sets the
IOMAP_F_ZERO_AROUND flag in the iomap it returns. This tells the
iomap code that it needs to expand the range of the operation being
performed to cover entire blocks. i.e. if the data being written
doesn't span the filesystem block, it needs to instantiate and zero
pages in the page cache to cover the portions of the block the data
doesn't cover.

Because the page cache may already hold data for the regions (e.g.
read over a hole/unwritten extent) the zero-around code does not
zero pages that are already marked up to date. It is assumed that
whatever put those pages into the page cache has already done the
right thing with them.

Yes, this means the unit of page cache IO is still individual pages.
There are no mm/ changes at all, no page cache changes, nothing.
That all still just operates on individual pages and is oblivious to
the fact the filesystem/iomap codeis now processing gangs of pages
at a time instead of just one.

Actually, I stretch the truth there a little - there is one change
to XFS that is important to note here. I removed ->writepage from
XFS (patches 1 and 2). We can't use it for large block sizes because
we want to write whole blocks at a time if they are newly allocated
or unwritten. And really, it's just a nasty hack that gets in the
way of background writeback doing an efficient job of cleaning dirty
pages. So I killed it.

We also need to expose the large block size to stat(2). If we don't,
the applications that use stat.st_bsize for operations that require
block size alignment (e.g. various fallocate ops) then fail because
4k is not the block size of a 64k block size filesystem.

A number of latent bugs in existing code were uncovered as I worked
through this - patches 3-5 fix bugs in XFS and iomap that can be
triggered on existing systems but it's somewhat hard to expose them.

Patches 6-12 introduce all the iomap infrastructure needed to
support block size > page size.

Patches 13-16 introduce the necessary functionality to trigger the
iompa infrastructure, tell userspace the right thing, make sub-block
fsync ranges do the right thing and finally remove the code that
prevents large block size filesystems from mounting on small page
size machines.

It works, it seems pretty robust and runs enough of fstests that
I've already used it to find, fix and test a 64k block size bug in
XFS:

https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=837514f7a4ca4aca06aec5caa5ff56d33ef06976

I think this is the last of the XFS Irix features we haven't
implemented in Linux XFS - it's only taken us 20 years to
get here, but the end of the tunnel is in sight.

Nah, it's probably a train. Or maybe a flame. :)

Anyway, I'm interested to see what people think of the approach.

Cheers,

Dave.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 01/16] xfs: drop ->writepage completely
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-09 15:12   ` Christoph Hellwig
  2018-11-07  6:31 ` [PATCH 02/16] xfs: move writepage context warnings to writepages Dave Chinner
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

->writepage is only used in one place - single page writeback from
memory reclaim. We only allow such writeback from kswapd, not from
direct memory reclaim, and so it is rarely used. When it comes from
kswapd, it is effectively random dirty page shoot-down, which is
horrible for IO patterns. We will already have background writeback
trying to clean all the dirty pages in memory as efficiently as
possible, so having kswapd interrupt our well formed IO stream only
slows things down. So get rid of xfs_vm_writepage() completely.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 338b9d9984e0..0e1342e359ca 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -911,22 +911,6 @@ xfs_do_writepage(
 	return 0;
 }
 
-STATIC int
-xfs_vm_writepage(
-	struct page		*page,
-	struct writeback_control *wbc)
-{
-	struct xfs_writepage_ctx wpc = {
-		.io_type = XFS_IO_HOLE,
-	};
-	int			ret;
-
-	ret = xfs_do_writepage(page, wbc, &wpc);
-	if (wpc.ioend)
-		ret = xfs_submit_ioend(wbc, wpc.ioend, ret);
-	return ret;
-}
-
 STATIC int
 xfs_vm_writepages(
 	struct address_space	*mapping,
@@ -1019,7 +1003,6 @@ xfs_iomap_swapfile_activate(
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
 	.readpages		= xfs_vm_readpages,
-	.writepage		= xfs_vm_writepage,
 	.writepages		= xfs_vm_writepages,
 	.set_page_dirty		= iomap_set_page_dirty,
 	.releasepage		= xfs_vm_releasepage,
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 02/16] xfs: move writepage context warnings to writepages
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
  2018-11-07  6:31 ` [PATCH 01/16] xfs: drop ->writepage completely Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-07  6:31 ` [PATCH 03/16] xfs: finobt AG reserves don't consider last AG can be a runt Dave Chinner
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Seeing as we don't have ->writepage anymore, we don't need reclaim
context checks on every call to xfs_do_writepage() anymore - we only
need them xfs_vm_writepages() so they are checked per writeback
context.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 0e1342e359ca..95e67f595f98 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -816,27 +816,6 @@ xfs_do_writepage(
 
 	trace_xfs_writepage(inode, page, 0, 0);
 
-	/*
-	 * Refuse to write the page out if we are called from reclaim context.
-	 *
-	 * This avoids stack overflows when called from deeply used stacks in
-	 * random callers for direct reclaim or memcg reclaim.  We explicitly
-	 * allow reclaim from kswapd as the stack usage there is relatively low.
-	 *
-	 * This should never happen except in the case of a VM regression so
-	 * warn about it.
-	 */
-	if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
-			PF_MEMALLOC))
-		goto redirty;
-
-	/*
-	 * Given that we do not allow direct reclaim to call us, we should
-	 * never be called while in a filesystem transaction.
-	 */
-	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
-		goto redirty;
-
 	/*
 	 * Is this page beyond the end of the file?
 	 *
@@ -921,6 +900,27 @@ xfs_vm_writepages(
 	};
 	int			ret;
 
+	/*
+	 * Refuse to write pages out if we are called from reclaim context.
+	 *
+	 * This avoids stack overflows when called from deeply used stacks in
+	 * random callers for direct reclaim or memcg reclaim.  We explicitly
+	 * allow reclaim from kswapd as the stack usage there is relatively low.
+	 *
+	 * This should never happen except in the case of a VM regression so
+	 * warn about it.
+	 */
+	if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
+			PF_MEMALLOC))
+		return 0;
+
+	/*
+	 * Given that we do not allow direct reclaim to call us, we should
+	 * never be called while in a filesystem transaction.
+	 */
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
+		return 0;
+
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
 	ret = write_cache_pages(mapping, wbc, xfs_do_writepage, &wpc);
 	if (wpc.ioend)
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 03/16] xfs: finobt AG reserves don't consider last AG can be a runt
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
  2018-11-07  6:31 ` [PATCH 01/16] xfs: drop ->writepage completely Dave Chinner
  2018-11-07  6:31 ` [PATCH 02/16] xfs: move writepage context warnings to writepages Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-07 16:55   ` Darrick J. Wong
  2018-11-07  6:31 ` [PATCH 04/16] xfs: extent shifting doesn't fully invalidate page cache Dave Chinner
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

The last AG may be very small comapred to all other AGs, and hence
AG reservations based on the superblock AG size may actually consume
more space than the AG actually has. This results on assert failures
like:

XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319
[   48.932891]  xfs_ag_resv_init+0x1bd/0x1d0
[   48.933853]  xfs_fs_reserve_ag_blocks+0x37/0xb0
[   48.934939]  xfs_mountfs+0x5b3/0x920
[   48.935804]  xfs_fs_fill_super+0x462/0x640
[   48.936784]  ? xfs_test_remount_options+0x60/0x60
[   48.937908]  mount_bdev+0x178/0x1b0
[   48.938751]  mount_fs+0x36/0x170
[   48.939533]  vfs_kern_mount.part.43+0x54/0x130
[   48.940596]  do_mount+0x20e/0xcb0
[   48.941396]  ? memdup_user+0x3e/0x70
[   48.942249]  ksys_mount+0xba/0xd0
[   48.943046]  __x64_sys_mount+0x21/0x30
[   48.943953]  do_syscall_64+0x54/0x170
[   48.944835]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Hence we need to ensure the finobt per-ag space reservations take
into account the size of the last AG rather than treat it like all
the other full size AGs.

Note that both refcountbt and rmapbt already take the size of the AG
into account via reading the AGF length directly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc_btree.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 86c50208a143..62014780d5e4 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -538,15 +538,24 @@ xfs_inobt_rec_check_count(
 
 static xfs_extlen_t
 xfs_inobt_max_size(
-	struct xfs_mount	*mp)
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
 {
+	uint32_t		agblocks = mp->m_sb.sb_agblocks;
+
 	/* Bail out if we're uninitialized, which can happen in mkfs. */
 	if (mp->m_inobt_mxr[0] == 0)
 		return 0;
 
+	/* last AG can be a runt */
+	if (agno == mp->m_sb.sb_agcount - 1) {
+		div_u64_rem(mp->m_sb.sb_dblocks, mp->m_sb.sb_agblocks,
+				&agblocks);
+	}
+
 	return xfs_btree_calc_size(mp->m_inobt_mnr,
-		(uint64_t)mp->m_sb.sb_agblocks * mp->m_sb.sb_inopblock /
-				XFS_INODES_PER_CHUNK);
+				(uint64_t)agblocks * mp->m_sb.sb_inopblock /
+					XFS_INODES_PER_CHUNK);
 }
 
 static int
@@ -594,7 +603,7 @@ xfs_finobt_calc_reserves(
 	if (error)
 		return error;
 
-	*ask += xfs_inobt_max_size(mp);
+	*ask += xfs_inobt_max_size(mp, agno);
 	*used += tree_len;
 	return 0;
 }
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 04/16] xfs: extent shifting doesn't fully invalidate page cache
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (2 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 03/16] xfs: finobt AG reserves don't consider last AG can be a runt Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-07  6:31 ` [PATCH 05/16] iomap: sub-block dio needs to zeroout beyond EOF Dave Chinner
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

The extent shifting code uses a flush and invalidate mechainsm prior
to shifting extents around. This is similar to what
xfs_free_file_space() does, but it doesn't take into account things
like page cache vs block size differences, and it will fail if there
is a page that it currently busy.

xfs_flush_unmap_range() handles all of these cases, so just convert
xfs_prepare_shift() to us that mechanism rather than having it's own
special sauce.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 5d263dfdb3bc..167ff4297e5c 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1195,13 +1195,7 @@ xfs_prepare_shift(
 	 * Writeback and invalidate cache for the remainder of the file as we're
 	 * about to shift down every extent from offset to EOF.
 	 */
-	error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, offset, -1);
-	if (error)
-		return error;
-	error = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
-					offset >> PAGE_SHIFT, -1);
-	if (error)
-		return error;
+	error = xfs_flush_unmap_range(ip, offset, XFS_ISIZE(ip));
 
 	/*
 	 * Clean out anything hanging around in the cow fork now that
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 05/16] iomap: sub-block dio needs to zeroout beyond EOF
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (3 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 04/16] xfs: extent shifting doesn't fully invalidate page cache Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-09 15:15   ` Christoph Hellwig
  2018-11-07  6:31 ` [PATCH 06/16] iomap: support block size > page size for direct IO Dave Chinner
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

If we are soing sub-block dio that extends EOF, we need to zero
the unused tail of the block to initialise the data in it it. If we
do not zero the tail of the block, then an immediate mmap read of
the EOF block will expose stale data beyond EOF to userspace. Found
with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.

Fix this by detecting if the end of the DIO write is beyond EOF
and zeroing the tail if necessary.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 64ce240217a1..16d16596b00f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1676,7 +1676,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		dio->submit.cookie = submit_bio(bio);
 	} while (nr_pages);
 
-	if (need_zeroout) {
+	/*
+	 * We need to zeroout the tail of a sub-block write if th extent type
+	 * requires zeroing or the write extends beyond EOF. If we don't zero
+	 * the block tail in the latter case, we can expose stale data via mmap
+	 * reads of the EOF block.
+	 */
+	if (need_zeroout ||
+	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
 		/* zero out from the end of the write to the end of the block */
 		pad = pos & (fs_block_size - 1);
 		if (pad)
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 06/16] iomap: support block size > page size for direct IO
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (4 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 05/16] iomap: sub-block dio needs to zeroout beyond EOF Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-08 11:28   ` Nikolay Borisov
  2018-11-09 15:18   ` Christoph Hellwig
  2018-11-07  6:31 ` [PATCH 07/16] iomap: prepare buffered IO paths for block size > page size Dave Chinner
                   ` (10 subsequent siblings)
  16 siblings, 2 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

iomap_dio_rw() has all the infrastructure in place to support block
size > page size filesystems because it is essentially just
sub-block DIO. It needs help, however, with the sub-block zeroing
code (needs multi-page IOs) page cache invalidation over the block
being written.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c | 65 ++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 49 insertions(+), 16 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 16d16596b00f..8878b1f1f9c7 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1548,21 +1548,34 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 	}
 }
 
+/*
+ * With zeroing for block size larger than page size, the zeroing length can
+ * span multiple pages.
+ */
+#define howmany(x, y) (((x)+((y)-1))/(y))
 static blk_qc_t
 iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 		unsigned len)
 {
 	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
+	int npages = howmany(len, PAGE_SIZE);
+
+	WARN_ON_ONCE(npages > 16);
 
-	bio = bio_alloc(GFP_KERNEL, 1);
+	bio = bio_alloc(GFP_KERNEL, npages);
 	bio_set_dev(bio, iomap->bdev);
 	bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	get_page(page);
-	__bio_add_page(bio, page, len, 0);
+	while (npages-- > 0) {
+		unsigned plen = min_t(unsigned, PAGE_SIZE, len);
+		get_page(page);
+		__bio_add_page(bio, page, plen, 0);
+		len -= plen;
+	}
+	WARN_ON(len != 0);
 	bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
 
 	atomic_inc(&dio->ref);
@@ -1752,6 +1765,38 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 	}
 }
 
+/*
+ * This is lifted almost straight from xfs_flush_unmap_range(). Need a generic
+ * version of the block size rounding for these purposes.
+ */
+static int
+iomap_flush_unmap_range(struct file *f, loff_t offset, loff_t len)
+{
+	struct inode *inode = file_inode(f);
+	loff_t rounding, start, end;
+	int ret;
+
+	rounding = max_t(loff_t, i_blocksize(inode), PAGE_SIZE);
+	start = round_down(offset, rounding);
+	end = round_up(offset + len, rounding) - 1;
+
+	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (ret)
+		return ret;
+
+	/*
+	 * Try to invalidate cache pages for the range we're direct
+	 * writing.  If this invalidation fails, tough, the write will
+	 * still work, but racing two incompatible write paths is a
+	 * pretty crazy thing to do, so we don't support it 100%.
+	 */
+	ret = invalidate_inode_pages2_range(inode->i_mapping,
+			start >> PAGE_SHIFT, end >> PAGE_SHIFT);
+	if (ret)
+		dio_warn_stale_pagecache(f);
+	return 0;
+}
+
 /*
  * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
  * is being issued as AIO or not.  This allows us to optimise pure data writes
@@ -1829,22 +1874,10 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		flags |= IOMAP_NOWAIT;
 	}
 
-	ret = filemap_write_and_wait_range(mapping, start, end);
+	ret = iomap_flush_unmap_range(iocb->ki_filp, start, end);
 	if (ret)
 		goto out_free_dio;
 
-	/*
-	 * Try to invalidate cache pages for the range we're direct
-	 * writing.  If this invalidation fails, tough, the write will
-	 * still work, but racing two incompatible write paths is a
-	 * pretty crazy thing to do, so we don't support it 100%.
-	 */
-	ret = invalidate_inode_pages2_range(mapping,
-			start >> PAGE_SHIFT, end >> PAGE_SHIFT);
-	if (ret)
-		dio_warn_stale_pagecache(iocb->ki_filp);
-	ret = 0;
-
 	if (iov_iter_rw(iter) == WRITE && !dio->wait_for_completion &&
 	    !inode->i_sb->s_dio_done_wq) {
 		ret = sb_init_dio_done_wq(inode->i_sb);
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 07/16] iomap: prepare buffered IO paths for block size > page size
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (5 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 06/16] iomap: support block size > page size for direct IO Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-09 15:19   ` Christoph Hellwig
  2018-11-07  6:31 ` [PATCH 08/16] iomap: mode iomap_zero_range and friends Dave Chinner
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

With bs > ps, we have the situation where the page is no longer the
master of IO granularity. when doing IO to a page, we can't assume
there is a 1:N relationship between the page and the number of
blocks it covers. We actually have an N:1 relationship instead and
hence many of the assumptions about how the page cache interacts
with blocks need rethinking.

We can, however, still do page granulatiry IO - it's just that
certain operations need to be done on a block size granularity and
hence iterate multiple pages outside the actual page IO itself.
Before we get there, however, we need to make sure that the
functions that iterate blocks in pages do the right thing with bs >
ps.

Hence we need to change all the iop detection cases to detect block
size > page size rather than bs == ps, otherwise we will incorrectly
try to use the bs < ps code paths.

We also need to change all the code that uses block size for
iterating pages or finding offsets into pages to use PAGE_SIZE when
bs > ps, otherwise we calculate byte counts, offsets, into the page
we are operating on incorrectly. Because we are operating on single
pages, the size we need to use is no longer bs, but ps.

This also spills over into the XFS writepage code that interacts
directly with iomap page structures for block size < page size
writeback.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c            | 34 +++++++++++++++++++++++++---------
 fs/xfs/xfs_aops.c     | 13 +++++++------
 include/linux/iomap.h | 12 ++++++++++++
 3 files changed, 44 insertions(+), 15 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 8878b1f1f9c7..dd6aa6b403a6 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -109,7 +109,7 @@ iomap_page_create(struct inode *inode, struct page *page)
 {
 	struct iomap_page *iop = to_iomap_page(page);
 
-	if (iop || i_blocksize(inode) == PAGE_SIZE)
+	if (iop || i_blocksize(inode) >= PAGE_SIZE)
 		return iop;
 
 	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
@@ -137,12 +137,15 @@ iomap_page_release(struct page *page)
 
 /*
  * Calculate the range inside the page that we actually need to read.
+ *
+ * For block size > page size, we need to use PAGE_SHIFT rather than
+ * inode->i_blkbits to calculate the offset into the page we are reading.
  */
 static void
 iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
 {
-	unsigned block_bits = inode->i_blkbits;
+	unsigned block_bits = min_t(unsigned, inode->i_blkbits, PAGE_SHIFT);
 	unsigned block_size = (1 << block_bits);
 	unsigned poff = offset_in_page(*pos);
 	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
@@ -194,13 +197,14 @@ static void
 iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len)
 {
 	struct iomap_page *iop = to_iomap_page(page);
-	struct inode *inode = page->mapping->host;
-	unsigned first = off >> inode->i_blkbits;
-	unsigned last = (off + len - 1) >> inode->i_blkbits;
-	unsigned int i;
 	bool uptodate = true;
 
 	if (iop) {
+		struct inode *inode = page->mapping->host;
+		unsigned first = off >> inode->i_blkbits;
+		unsigned last = (off + len - 1) >> inode->i_blkbits;
+		unsigned int i;
+
 		for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
 			if (i >= first && i <= last)
 				set_bit(i, iop->uptodate);
@@ -600,12 +604,17 @@ iomap_read_page_sync(struct inode *inode, loff_t block_start, struct page *page,
 	return submit_bio_wait(&bio);
 }
 
+/*
+ * If block size > PAGE_SIZE, we're actually working on offsets into
+ * the page, not offsets into the block. IOWs, it looks exactly like
+ * the block size == PAGE_SIZE.
+ */
 static int
 __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
 		struct page *page, struct iomap *iomap)
 {
 	struct iomap_page *iop = iomap_page_create(inode, page);
-	loff_t block_size = i_blocksize(inode);
+	loff_t block_size = iomap_pageblock_size(inode);
 	loff_t block_start = pos & ~(block_size - 1);
 	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
 	unsigned from = offset_in_page(pos), to = from + len, poff, plen;
@@ -1021,11 +1030,18 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 }
 EXPORT_SYMBOL_GPL(iomap_zero_range);
 
+/*
+ * Zero the truncated tail of the page provided.
+ *
+ * For block size > page size, we zero just the tail of the page as that is all
+ * we need for mmap() to see correctly zeroed space. File extension
+ * will need to zero the remainder of the block that we haven't zeroed here.
+ */
 int
 iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 		const struct iomap_ops *ops)
 {
-	unsigned int blocksize = i_blocksize(inode);
+	unsigned int blocksize = iomap_pageblock_size(inode);
 	unsigned int off = pos & (blocksize - 1);
 
 	/* Block boundary? Nothing to do */
@@ -1209,7 +1225,7 @@ page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
 		int whence)
 {
 	const struct address_space_operations *ops = inode->i_mapping->a_ops;
-	unsigned int bsize = i_blocksize(inode), off;
+	unsigned int bsize = iomap_pageblock_size(inode), off;
 	bool seek_data = whence == SEEK_DATA;
 	loff_t poff = page_offset(page);
 
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 95e67f595f98..f6ef9e0a7312 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -72,7 +72,7 @@ xfs_finish_page_writeback(
 		mapping_set_error(inode->i_mapping, -EIO);
 	}
 
-	ASSERT(iop || i_blocksize(inode) == PAGE_SIZE);
+	ASSERT(iop || i_blocksize(inode) >= PAGE_SIZE);
 	ASSERT(!iop || atomic_read(&iop->write_count) > 0);
 
 	if (!iop || atomic_dec_and_test(&iop->write_count))
@@ -599,7 +599,7 @@ xfs_add_to_ioend(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	struct block_device	*bdev = xfs_find_bdev_for_inode(inode);
-	unsigned		len = i_blocksize(inode);
+	unsigned		len = iomap_pageblock_size(inode);
 	unsigned		poff = offset & (PAGE_SIZE - 1);
 	sector_t		sector;
 
@@ -666,7 +666,7 @@ xfs_aops_discard_page(
 			page, ip->i_ino, offset);
 
 	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-			PAGE_SIZE / i_blocksize(inode));
+			PAGE_SIZE / iomap_pageblock_size(inode));
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
@@ -699,12 +699,13 @@ xfs_writepage_map(
 {
 	LIST_HEAD(submit_list);
 	struct iomap_page	*iop = to_iomap_page(page);
-	unsigned		len = i_blocksize(inode);
+	unsigned		len = iomap_pageblock_size(inode);
+	unsigned		blks_per_page = PAGE_SIZE / len;
 	struct xfs_ioend	*ioend, *next;
 	uint64_t		file_offset;	/* file offset of page */
 	int			error = 0, count = 0, i;
 
-	ASSERT(iop || i_blocksize(inode) == PAGE_SIZE);
+	ASSERT(iop || len == PAGE_SIZE);
 	ASSERT(!iop || atomic_read(&iop->write_count) == 0);
 
 	/*
@@ -713,7 +714,7 @@ xfs_writepage_map(
 	 * one.
 	 */
 	for (i = 0, file_offset = page_offset(page);
-	     i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
+	     i < blks_per_page && file_offset < end_offset;
 	     i++, file_offset += len) {
 		if (iop && !test_bit(i, iop->uptodate))
 			continue;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 9a4258154b25..671c0c387450 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -119,6 +119,18 @@ static inline struct iomap_page *to_iomap_page(struct page *page)
 	return NULL;
 }
 
+/*
+ * Return the block size we should use for page cache based operations.
+ * This will return the inode block size for block size < PAGE_SIZE,
+ * otherwise it will return PAGE_SIZE.
+ */
+static inline unsigned
+iomap_pageblock_size(struct inode *inode)
+{
+	return min_t(unsigned, PAGE_SIZE, i_blocksize(inode));
+}
+
+
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
 int iomap_readpage(struct page *page, const struct iomap_ops *ops);
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 08/16] iomap: mode iomap_zero_range and friends
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (6 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 07/16] iomap: prepare buffered IO paths for block size > page size Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-09 15:19   ` Christoph Hellwig
  2018-11-07  6:31 ` [PATCH 09/16] iomap: introduce zero-around functionality Dave Chinner
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

We're going to need access to zeroing from within the iomap_write
and iomap_page_mkwrite paths for block size > page size support,
so move it higher up in the file.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c | 158 ++++++++++++++++++++++++++---------------------------
 1 file changed, 79 insertions(+), 79 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index dd6aa6b403a6..e417a5911239 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -769,6 +769,85 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 	return ret;
 }
 
+static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
+		unsigned bytes, struct iomap *iomap)
+{
+	struct page *page;
+	int status;
+
+	status = iomap_write_begin(inode, pos, bytes, AOP_FLAG_NOFS, &page,
+				   iomap);
+	if (status)
+		return status;
+
+	zero_user(page, offset, bytes);
+	mark_page_accessed(page);
+
+	return iomap_write_end(inode, pos, bytes, bytes, page, iomap);
+}
+
+static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
+		struct iomap *iomap)
+{
+	return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
+			iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
+}
+
+static loff_t
+iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
+		void *data, struct iomap *iomap)
+{
+	bool *did_zero = data;
+	loff_t written = 0;
+	int status;
+
+	/* already zeroed?  we're done. */
+	if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN)
+		return count;
+
+	do {
+		unsigned offset, bytes;
+
+		offset = offset_in_page(pos);
+		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
+
+		if (IS_DAX(inode))
+			status = iomap_dax_zero(pos, offset, bytes, iomap);
+		else
+			status = iomap_zero(inode, pos, offset, bytes, iomap);
+		if (status < 0)
+			return status;
+
+		pos += bytes;
+		count -= bytes;
+		written += bytes;
+		if (did_zero)
+			*did_zero = true;
+	} while (count > 0);
+
+	return written;
+}
+
+int
+iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
+		const struct iomap_ops *ops)
+{
+	loff_t ret;
+
+	while (len > 0) {
+		ret = iomap_apply(inode, pos, len, IOMAP_ZERO,
+				ops, did_zero, iomap_zero_range_actor);
+		if (ret <= 0)
+			return ret;
+
+		pos += ret;
+		len -= ret;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iomap_zero_range);
+
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
@@ -951,85 +1030,6 @@ iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
 }
 EXPORT_SYMBOL_GPL(iomap_file_dirty);
 
-static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
-		unsigned bytes, struct iomap *iomap)
-{
-	struct page *page;
-	int status;
-
-	status = iomap_write_begin(inode, pos, bytes, AOP_FLAG_NOFS, &page,
-				   iomap);
-	if (status)
-		return status;
-
-	zero_user(page, offset, bytes);
-	mark_page_accessed(page);
-
-	return iomap_write_end(inode, pos, bytes, bytes, page, iomap);
-}
-
-static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
-		struct iomap *iomap)
-{
-	return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
-			iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
-}
-
-static loff_t
-iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-		void *data, struct iomap *iomap)
-{
-	bool *did_zero = data;
-	loff_t written = 0;
-	int status;
-
-	/* already zeroed?  we're done. */
-	if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN)
-	    	return count;
-
-	do {
-		unsigned offset, bytes;
-
-		offset = offset_in_page(pos);
-		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
-
-		if (IS_DAX(inode))
-			status = iomap_dax_zero(pos, offset, bytes, iomap);
-		else
-			status = iomap_zero(inode, pos, offset, bytes, iomap);
-		if (status < 0)
-			return status;
-
-		pos += bytes;
-		count -= bytes;
-		written += bytes;
-		if (did_zero)
-			*did_zero = true;
-	} while (count > 0);
-
-	return written;
-}
-
-int
-iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
-		const struct iomap_ops *ops)
-{
-	loff_t ret;
-
-	while (len > 0) {
-		ret = iomap_apply(inode, pos, len, IOMAP_ZERO,
-				ops, did_zero, iomap_zero_range_actor);
-		if (ret <= 0)
-			return ret;
-
-		pos += ret;
-		len -= ret;
-	}
-
-	return 0;
-}
-EXPORT_SYMBOL_GPL(iomap_zero_range);
-
 /*
  * Zero the truncated tail of the page provided.
  *
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 09/16] iomap: introduce zero-around functionality
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (7 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 08/16] iomap: mode iomap_zero_range and friends Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-07  6:31 ` [PATCH 10/16] iomap: enable zero-around for iomap_zero_range() Dave Chinner
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

For block size > page size, a single page write is a sub-block
write. Hence they have to be treated differently when these writes
land in a hole or unwritten extent. The underlying block is going to
be allocated, but if we only write a single page to it the rest of
the block is going to be uninitialised. This creates a stale data
exposure problem.

To avoid this, when we write into the middle of a new block, we need
to instantiate and zero the pages in the block around the current
page. When writeback occurs, all the pages will get written back and
the block will be fully initialised.

When we are doing zero-around, we may find pages already in the
cache over that range (e.g. from reading). We don't want to zero
those pages - they will already be up-to-date if they contain data,
and so we skip the zeroing if we find an up-to-date page.

Zeroing is done from the iomap_apply() actor function, so we use
iomap_zero() directly to instantiate page cache pages and zero them.
The iomap we are supplied with will always span the range the actor
needs to zero, so there's no need to recurse through
iomap_zero_range() here.

The zero-around functionality will be triggered by the
IOMAP_F_ZERO_AROUND flag returned by the filesystem's ->iomap_begin
mapping function. It will do so when it knows that zero-around will
be required for the mapped region being returned.

This commit introduces the zero-around functionality and patches it
into the buffered write path. Future commits will add the
functionality to other iomap write paths.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c            | 88 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/iomap.h |  2 +
 2 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index e417a5911239..56f40177ed17 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -793,6 +793,84 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
 			iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
 }
 
+/*
+ * We need to zero around the write if the write lands in a hole or an unwritten
+ * extent and the IOMAP_F_ZERO_AROUND flag is set. If we are in newly allocated
+ * space (i.e. write landed in a hole), IOMAP_F_NEW will be set. If we landed
+ * in an unwritten extent, the type will be IOMAP_UNWRITTEN.
+ */
+static bool
+iomap_need_zero_around(struct iomap *iomap)
+{
+	if (!(iomap->flags & IOMAP_F_ZERO_AROUND))
+		return false;
+	if (iomap->flags & IOMAP_F_NEW)
+		return true;
+	if (iomap->type == IOMAP_UNWRITTEN)
+		return true;
+	return false;
+}
+
+/*
+ * If we need to do zero-around, we zero the partial leading block that the
+ * data_start lands in, and if the iomap extends past the end of the write, we
+ * zero that partial block, too. Don't zero tail blocks beyond EOF.
+ */
+static loff_t
+iomap_zero_around(struct inode *inode, loff_t data_start, loff_t length,
+		struct iomap *iomap)
+{
+	loff_t data_end = data_start + length;
+	loff_t pos;
+	loff_t end = data_end;
+	loff_t status;
+	unsigned long offset;	/* Offset into pagecache page */
+	unsigned long bytes;	/* Bytes to write to page */
+
+	pos = round_down(data_start, i_blocksize(inode));
+	if (end < i_size_read(inode))
+		end = round_up(end, i_blocksize(inode));
+
+	/*
+	 * If the end is now past EOF, it means this write is at or
+	 * completely inside EOF and so we only zero from the end of the
+	 * write to EOF. If we are extending the file this avoids tail
+	 * zeroing altogether.
+	 */
+	if (end >= i_size_read(inode))
+		end = max(data_end, i_size_read(inode));
+
+	WARN_ON_ONCE(pos < iomap->offset);
+	WARN_ON_ONCE(offset_in_page(pos));
+	WARN_ON_ONCE(end > iomap->offset + iomap->length);
+	WARN_ON_ONCE(end < data_end);
+
+	/* zero start */
+	while (pos < data_start) {
+		offset = offset_in_page(pos);
+		bytes = min_t(unsigned long, data_start - pos,
+							PAGE_SIZE - offset);
+
+		status = iomap_zero(inode, pos, offset, bytes, iomap);
+		if (status < 0)
+			return status;
+		pos += bytes;
+	}
+
+	/* zero end */
+	pos = data_end;
+	while (pos < end) {
+		offset = offset_in_page(pos);
+		bytes = min_t(unsigned long, end - pos, PAGE_SIZE - offset);
+
+		status = iomap_zero(inode, pos, offset, bytes, iomap);
+		if (status < 0)
+			return status;
+		pos += bytes;
+	}
+	return 0;
+}
+
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		void *data, struct iomap *iomap)
@@ -849,14 +927,20 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 EXPORT_SYMBOL_GPL(iomap_zero_range);
 
 static loff_t
-iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+iomap_write_actor(struct inode *inode, loff_t pos, loff_t length,
+		void *data, struct iomap *iomap)
 {
 	struct iov_iter *i = data;
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = AOP_FLAG_NOFS;
 
+	if (iomap_need_zero_around(iomap)) {
+		status = iomap_zero_around(inode, pos, length, iomap);
+		if (status)
+			return status;
+	}
+
 	do {
 		struct page *page;
 		unsigned long offset;	/* Offset into pagecache page */
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 671c0c387450..afdbeb12ed6e 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -35,6 +35,8 @@ struct vm_fault;
 #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
 #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
 #define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
+#define IOMAP_F_ZERO_AROUND	0x08	/* file system requires zeroed data
+					   around written data in map */
 
 /*
  * Flags that only need to be reported for IOMAP_REPORT requests:
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 10/16] iomap: enable zero-around for iomap_zero_range()
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (8 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 09/16] iomap: introduce zero-around functionality Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-07  6:31 ` [PATCH 11/16] iomap: Don't mark partial pages zeroing uptodate for zero-around Dave Chinner
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

iomap_zero_range() is used to zero the range between the old EOF
and the new EOF when the file is truncated up or written beyond the
existing EOF. With block size larger than page size, we can't assume
that because we've mapped a hole or an unwritten extent that there
is no data needing to be written in the portion of the block inside
the old EOF. Hence we need to zero to closer of the end of the block
or the new EOF so that subsequent reads of the range between the old
and new EOF are do not expose stale data.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c | 45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 56f40177ed17..d572e57c5caa 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -780,7 +780,14 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
 	if (status)
 		return status;
 
-	zero_user(page, offset, bytes);
+	/*
+	 * zero-around is conditional on whether the page we found already
+	 * contains data or not. If it's up to date, it contains data and we
+	 * should not zero it. We still need to mark it dirty to get that data
+	 * written, however.
+	 */
+	if (!(iomap->flags & IOMAP_F_ZERO_AROUND) || !PageUptodate(page))
+		zero_user(page, offset, bytes);
 	mark_page_accessed(page);
 
 	return iomap_write_end(inode, pos, bytes, bytes, page, iomap);
@@ -877,11 +884,41 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 {
 	bool *did_zero = data;
 	loff_t written = 0;
+	loff_t old_count = 0;
 	int status;
 
 	/* already zeroed?  we're done. */
-	if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN)
-		return count;
+	if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN) {
+
+		if (!iomap_need_zero_around(iomap))
+			return count;
+
+		/*
+		 * Because we landed in a hole, we only need to zero to the end
+		 * of this block. We'll do that by the loop below, but we need
+		 * to trim count here so the zero-around only acts on this
+		 * block, too.
+		 *
+		 * The magic "pos + 1" is needed because we want the offset of
+		 * the next block after pos. If pos is already aligned to the
+		 * block size, the round_up() returns the same value, not that
+		 * of the next highest multiple. Hence we have to add 1 to pos
+		 * to get round_up() to behave as we want.
+		 */
+		old_count = count;
+		if (pos + count > round_up(pos + 1, i_blocksize(inode)))
+			count = round_up(pos + 1, i_blocksize(inode)) - pos;
+
+		status = iomap_zero_around(inode, pos, count, iomap);
+		if (status)
+			return status;
+
+		/*
+		 * now clear the zero-around flag so that the range requested
+		 * in this block will be unconditionally zeroed.
+		 */
+		iomap->flags &= ~IOMAP_F_ZERO_AROUND;
+	}
 
 	do {
 		unsigned offset, bytes;
@@ -903,7 +940,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 			*did_zero = true;
 	} while (count > 0);
 
-	return written;
+	return old_count ? old_count : written;
 }
 
 int
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 11/16] iomap: Don't mark partial pages zeroing uptodate for zero-around
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (9 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 10/16] iomap: enable zero-around for iomap_zero_range() Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-07  6:31 ` [PATCH 12/16] iomap: zero-around in iomap_page_mkwrite Dave Chinner
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

When zero-around stops short of a full block, such as when zeroing
to a new EOF, iomap_zero() triggers the partial page read code in
__iomap_write_begin.

Because we are over an unmapped range, the code will zero the
portion that we would have read and then marks the page up to date.
Unfortunately, this defeats the zero-around code that looks for
pages that are not up to date for zeroing, and hence the zero-around
fails to zero the "unread" portion of the page and exposes stale
data.

Hence if we are doing read-around and we zero a partial page, do
not mark it up to date so that the zero-around will zero the
remaining section of the page.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index d572e57c5caa..41922fc775c4 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -592,7 +592,13 @@ iomap_read_page_sync(struct inode *inode, loff_t block_start, struct page *page,
 
 	if (iomap->type != IOMAP_MAPPED || block_start >= i_size_read(inode)) {
 		zero_user_segments(page, poff, from, to, poff + plen);
-		iomap_set_range_uptodate(page, poff, plen);
+		/*
+		 * if this is zero-around, we don't want to mark the page
+		 * uptodate here  because this is only a partial page zeroing
+		 * and there's still more data to be written into the page.
+		 */
+		if (!(iomap->flags & IOMAP_F_ZERO_AROUND))
+			iomap_set_range_uptodate(page, poff, plen);
 		return 0;
 	}
 
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 12/16] iomap: zero-around in iomap_page_mkwrite
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (10 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 11/16] iomap: Don't mark partial pages zeroing uptodate for zero-around Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-07  6:31 ` [PATCH 13/16] xfs: add zero-around controls to iomap Dave Chinner
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

When we take a write fault over a page in a block size > page size
filesystem, we may have to issue zero-around to initialise all the
pages in the block that mmap is writing to. This is essentially the
same as the zero-around in the buffered write path, with the added
complexity that we have to drop the page lock on the page that
was passed to iomap_page_mkwrite_actor().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c | 41 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 41922fc775c4..7aacd48c593e 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1183,7 +1183,46 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
 		void *data, struct iomap *iomap)
 {
 	struct page *page = data;
-	int ret;
+	loff_t ret;
+
+	/*
+	 * if we need to zero-around, we have to unlock the page we were given.
+	 * No big deal, we just have to repeat the "is this page ours" checks
+	 * after relocking it
+	 */
+	if (iomap_need_zero_around(iomap)) {
+		loff_t size;
+
+		/*
+		 * This only happens for block size > page size, so the file
+		 * offset of a page fault should always be page aligned.
+		 */
+		WARN_ON(offset_in_page(pos));
+
+		unlock_page(page);
+		ret = iomap_zero_around(inode, pos, length, iomap);
+		lock_page(page);
+		size = i_size_read(inode);
+		if ((page->mapping != inode->i_mapping) ||
+		    (page_offset(page) > size)) {
+			/* We overload EFAULT to mean page got truncated */
+			return -EFAULT;
+		}
+
+		if (page_offset(page) != pos) {
+			/* it moved in the file! */
+			return -EFAULT;
+		}
+
+		/* return failure now if zeroing had an error */
+		if (ret)
+			return ret;
+
+		/* trim down the length is we straddle EOF. */
+		if (((page->index + 1) << PAGE_SHIFT) > size)
+			length = offset_in_page(size);
+
+	}
 
 	if (iomap->flags & IOMAP_F_BUFFER_HEAD) {
 		ret = __block_write_begin_int(page, pos, length, NULL, iomap);
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 13/16] xfs: add zero-around controls to iomap
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (11 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 12/16] iomap: zero-around in iomap_page_mkwrite Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-07  6:31 ` [PATCH 14/16] xfs: align writepages to large block sizes Dave Chinner
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

For buffered writes into block size > page size filesystems, XFS
will need to trigger zero-around for delayed allocation mappings
and writes over unwritten extents.

Unwritten extents will only remain unwritten until the first data
gets written back, so we have to ensure that any write mapping
triggers zero-around for them.

Delayed allocation occurs over holes, which then require the data in
the page cache to completely cover them at write-out time. Hence we
have to trigger write-around for these mappings, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iomap.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 27c93b5f029d..35626855e207 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -575,6 +575,15 @@ xfs_file_iomap_begin_delay(
 				goto out_unlock;
 		}
 
+		/*
+		 * If we are about to write to an unwritten extent and the
+		 * the block size is larger than page size, then we need to make sure
+		 * that the caller zeroes blocks partially covered by data.
+		 */
+		if (got.br_state == XFS_EXT_UNWRITTEN &&
+		    mp->m_sb.sb_blocksize > PAGE_SIZE)
+			iomap->flags |= IOMAP_F_ZERO_AROUND;
+
 		trace_xfs_iomap_found(ip, offset, count, 0, &got);
 		goto done;
 	}
@@ -647,6 +656,8 @@ xfs_file_iomap_begin_delay(
 	 * them out if the write happens to fail.
 	 */
 	iomap->flags |= IOMAP_F_NEW;
+	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
+		iomap->flags |= IOMAP_F_ZERO_AROUND;
 	trace_xfs_iomap_alloc(ip, offset, count, 0, &got);
 done:
 	if (isnullstartblock(got.br_startblock))
@@ -1107,6 +1118,15 @@ xfs_file_iomap_begin(
 	if (flags & IOMAP_ZERO)
 		goto out_found;
 
+	/*
+	 * If we are about to write to an unwritten extent and the
+	 * the block size is larger than page size, then we need to make sure
+	 * that the caller zeroes blocks partially covered by data.
+	 */
+	if (imap.br_state == XFS_EXT_UNWRITTEN &&
+	    mp->m_sb.sb_blocksize > PAGE_SIZE)
+		iomap->flags |= IOMAP_F_ZERO_AROUND;
+
 	if (!imap_needs_alloc(inode, &imap, nimaps))
 		goto out_found;
 
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 14/16] xfs: align writepages to large block sizes
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (12 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 13/16] xfs: add zero-around controls to iomap Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-09 15:22   ` Christoph Hellwig
  2018-11-14 14:19   ` Brian Foster
  2018-11-07  6:31 ` [PATCH 15/16] xfs: expose block size in stat Dave Chinner
                   ` (2 subsequent siblings)
  16 siblings, 2 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

For data integrity purposes, we need to write back the entire
filesystem block when asked to sync a sub-block range of the file.
When the filesystem block size is larger than the page size, this
means we need to convert single page integrity writes into whole
block integrity writes. We do this by extending the writepage range
to filesystem block granularity and alignment.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f6ef9e0a7312..5334f16be166 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -900,6 +900,7 @@ xfs_vm_writepages(
 		.io_type = XFS_IO_HOLE,
 	};
 	int			ret;
+	unsigned		bsize =	i_blocksize(mapping->host);
 
 	/*
 	 * Refuse to write pages out if we are called from reclaim context.
@@ -922,6 +923,19 @@ xfs_vm_writepages(
 	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		return 0;
 
+	/*
+	 * If the block size is larger than page size, extent the incoming write
+	 * request to fsb granularity and alignment. This is a requirement for
+	 * data integrity operations and it doesn't hurt for other write
+	 * operations, so do it unconditionally.
+	 */
+	if (wbc->range_start)
+		wbc->range_start = round_down(wbc->range_start, bsize);
+	if (wbc->range_end != LLONG_MAX)
+		wbc->range_end = round_up(wbc->range_end, bsize);
+	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
+		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
+
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
 	ret = write_cache_pages(mapping, wbc, xfs_do_writepage, &wpc);
 	if (wpc.ioend)
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 15/16] xfs: expose block size in stat
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (13 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 14/16] xfs: align writepages to large block sizes Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-07  6:31 ` [PATCH 16/16] xfs: enable block size larger than page size support Dave Chinner
  2018-11-07 17:14 ` [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Darrick J. Wong
  16 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignemnt
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.h | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 7964513c3128..a323e362aeb6 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -266,13 +266,18 @@ typedef struct xfs_mount {
 static inline unsigned long
 xfs_preferred_iosize(xfs_mount_t *mp)
 {
+	unsigned long	default_size = max_t(unsigned long, PAGE_SIZE,
+						mp->m_sb.sb_blocksize);
+
 	if (mp->m_flags & XFS_MOUNT_COMPAT_IOSIZE)
-		return PAGE_SIZE;
-	return (mp->m_swidth ?
-		(mp->m_swidth << mp->m_sb.sb_blocklog) :
-		((mp->m_flags & XFS_MOUNT_DFLT_IOSIZE) ?
-			(1 << (int)max(mp->m_readio_log, mp->m_writeio_log)) :
-			PAGE_SIZE));
+		return default_size;
+
+	if (mp->m_swidth)
+		return mp->m_swidth << mp->m_sb.sb_blocklog;
+	if (mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)
+		return 1UL << max_t(int, mp->m_readio_log, mp->m_writeio_log);
+	return default_size;
+
 }
 
 #define XFS_LAST_UNMOUNT_WAS_CLEAN(mp)	\
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 16/16] xfs: enable block size larger than page size support
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (14 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 15/16] xfs: expose block size in stat Dave Chinner
@ 2018-11-07  6:31 ` Dave Chinner
  2018-11-07 17:14 ` [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Darrick J. Wong
  16 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_sb.c |  2 ++
 fs/xfs/xfs_mount.c     | 11 ++++++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index b5a82acd7dfe..e0881545a929 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -354,6 +354,7 @@ xfs_validate_sb_common(
 		return -EFSCORRUPTED;
 	}
 
+#if 0
 	/*
 	 * Until this is fixed only page-sized or smaller data blocks work.
 	 */
@@ -364,6 +365,7 @@ xfs_validate_sb_common(
 				sbp->sb_blocksize, PAGE_SIZE);
 		return -ENOSYS;
 	}
+#endif
 
 	/*
 	 * Currently only very few inode sizes are supported.
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 02d15098dbee..e0d95d56f776 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -164,7 +164,16 @@ xfs_sb_validate_fsb_count(
 	xfs_sb_t	*sbp,
 	uint64_t	nblocks)
 {
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
+	int		shift = PAGE_SHIFT - sbp->sb_blocklog;
+
+	/*
+	 * block size larger than page size still limited to page cache
+	 * page size limits.
+	 */
+	if (shift < 0)
+		shift = 0;
+
+	//ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 
 	/* Limited by ULONG_MAX of page cache index */
-- 
2.19.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 03/16] xfs: finobt AG reserves don't consider last AG can be a runt
  2018-11-07  6:31 ` [PATCH 03/16] xfs: finobt AG reserves don't consider last AG can be a runt Dave Chinner
@ 2018-11-07 16:55   ` Darrick J. Wong
  2018-11-09  0:21     ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2018-11-07 16:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Wed, Nov 07, 2018 at 05:31:14PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The last AG may be very small comapred to all other AGs, and hence
> AG reservations based on the superblock AG size may actually consume
> more space than the AG actually has. This results on assert failures
> like:
> 
> XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319
> [   48.932891]  xfs_ag_resv_init+0x1bd/0x1d0
> [   48.933853]  xfs_fs_reserve_ag_blocks+0x37/0xb0
> [   48.934939]  xfs_mountfs+0x5b3/0x920
> [   48.935804]  xfs_fs_fill_super+0x462/0x640
> [   48.936784]  ? xfs_test_remount_options+0x60/0x60
> [   48.937908]  mount_bdev+0x178/0x1b0
> [   48.938751]  mount_fs+0x36/0x170
> [   48.939533]  vfs_kern_mount.part.43+0x54/0x130
> [   48.940596]  do_mount+0x20e/0xcb0
> [   48.941396]  ? memdup_user+0x3e/0x70
> [   48.942249]  ksys_mount+0xba/0xd0
> [   48.943046]  __x64_sys_mount+0x21/0x30
> [   48.943953]  do_syscall_64+0x54/0x170
> [   48.944835]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Hence we need to ensure the finobt per-ag space reservations take
> into account the size of the last AG rather than treat it like all
> the other full size AGs.
> 
> Note that both refcountbt and rmapbt already take the size of the AG
> into account via reading the AGF length directly.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ialloc_btree.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 86c50208a143..62014780d5e4 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -538,15 +538,24 @@ xfs_inobt_rec_check_count(
>  
>  static xfs_extlen_t
>  xfs_inobt_max_size(
> -	struct xfs_mount	*mp)
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno)
>  {
> +	uint32_t		agblocks = mp->m_sb.sb_agblocks;
> +
>  	/* Bail out if we're uninitialized, which can happen in mkfs. */
>  	if (mp->m_inobt_mxr[0] == 0)
>  		return 0;
>  
> +	/* last AG can be a runt */
> +	if (agno == mp->m_sb.sb_agcount - 1) {
> +		div_u64_rem(mp->m_sb.sb_dblocks, mp->m_sb.sb_agblocks,
> +				&agblocks);
> +	}

Why not: agblocks = xfs_ag_block_count(mp, agno); ?

--D

> +
>  	return xfs_btree_calc_size(mp->m_inobt_mnr,
> -		(uint64_t)mp->m_sb.sb_agblocks * mp->m_sb.sb_inopblock /
> -				XFS_INODES_PER_CHUNK);
> +				(uint64_t)agblocks * mp->m_sb.sb_inopblock /
> +					XFS_INODES_PER_CHUNK);
>  }
>  
>  static int
> @@ -594,7 +603,7 @@ xfs_finobt_calc_reserves(
>  	if (error)
>  		return error;
>  
> -	*ask += xfs_inobt_max_size(mp);
> +	*ask += xfs_inobt_max_size(mp, agno);
>  	*used += tree_len;
>  	return 0;
>  }
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support
  2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
                   ` (15 preceding siblings ...)
  2018-11-07  6:31 ` [PATCH 16/16] xfs: enable block size larger than page size support Dave Chinner
@ 2018-11-07 17:14 ` Darrick J. Wong
  2018-11-07 22:04   ` Dave Chinner
  16 siblings, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2018-11-07 17:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote:
> Hi folks,
> 
> We've had a fair number of problems reported on 64k block size
> filesystems of late, but none of the XFS developers have Power or
> ARM machines handy to reproduce them or even really test the fixes.
> 
> The iomap infrastructure we introduced a while back was designed
> with the capabity of block size > page size support in mind, but we
> hadn't tried to implement it.
> 
> So after another 64k block size bug report late last week I said to
> Darrick "How hard could it be"?

"Nothing is ever simple" :)

> About 6 billion (yes, B) fsx ops later, I have most of the XFS
> functionality working on 64k block sizes on x86_64.  Buffered
> read/write, mmap read/write and direct IO read/write all work. All
> the fallocate() operations work correctly, as does truncate. xfsdump
> and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
> needed some help, but I've tested Darrick's fixes for that quite a
> bit over the past few days.
> 
> It passes most of xfstests - there's some test failures that I have
> to determine whether they are code bugs or test problems (i.e. some
> tests don't deal with 64k block sizes correctly or assume block size
> <= page size).
> 
> What I haven't tested yet is shared extents - the COW path,
> clone_file_range and dedupe_file_range. I discovered earlier today
> that fsx doesn't support copy/clone/dedupe_file_operations
> operations, so before I go any further I need to enxpahnce fsx. Then

I assume that means you only tested this on reflink=0 filesystems?

Looking at fsstress, it looks like we don't test copy_file_range either.
I can try adding the missing clone/dedupe/copy to both programs, but
maybe you've already done that while I was asleep?

--D

> fix all the bugs it uncovers on block size <= page size filesystems.
> And then I'll move onto adding the rest of the functionality this
> patch set requires.
> 
> The main addition to the iomap code to support this functionality is
> the "zero-around" capability. When the filesystem is mapping a new
> block, a delalloc range or an unwritten extent, it sets the
> IOMAP_F_ZERO_AROUND flag in the iomap it returns. This tells the
> iomap code that it needs to expand the range of the operation being
> performed to cover entire blocks. i.e. if the data being written
> doesn't span the filesystem block, it needs to instantiate and zero
> pages in the page cache to cover the portions of the block the data
> doesn't cover.
> 
> Because the page cache may already hold data for the regions (e.g.
> read over a hole/unwritten extent) the zero-around code does not
> zero pages that are already marked up to date. It is assumed that
> whatever put those pages into the page cache has already done the
> right thing with them.
> 
> Yes, this means the unit of page cache IO is still individual pages.
> There are no mm/ changes at all, no page cache changes, nothing.
> That all still just operates on individual pages and is oblivious to
> the fact the filesystem/iomap codeis now processing gangs of pages
> at a time instead of just one.
> 
> Actually, I stretch the truth there a little - there is one change
> to XFS that is important to note here. I removed ->writepage from
> XFS (patches 1 and 2). We can't use it for large block sizes because
> we want to write whole blocks at a time if they are newly allocated
> or unwritten. And really, it's just a nasty hack that gets in the
> way of background writeback doing an efficient job of cleaning dirty
> pages. So I killed it.
> 
> We also need to expose the large block size to stat(2). If we don't,
> the applications that use stat.st_bsize for operations that require
> block size alignment (e.g. various fallocate ops) then fail because
> 4k is not the block size of a 64k block size filesystem.
> 
> A number of latent bugs in existing code were uncovered as I worked
> through this - patches 3-5 fix bugs in XFS and iomap that can be
> triggered on existing systems but it's somewhat hard to expose them.
> 
> Patches 6-12 introduce all the iomap infrastructure needed to
> support block size > page size.
> 
> Patches 13-16 introduce the necessary functionality to trigger the
> iompa infrastructure, tell userspace the right thing, make sub-block
> fsync ranges do the right thing and finally remove the code that
> prevents large block size filesystems from mounting on small page
> size machines.
> 
> It works, it seems pretty robust and runs enough of fstests that
> I've already used it to find, fix and test a 64k block size bug in
> XFS:
> 
> https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=837514f7a4ca4aca06aec5caa5ff56d33ef06976
> 
> I think this is the last of the XFS Irix features we haven't
> implemented in Linux XFS - it's only taken us 20 years to
> get here, but the end of the tunnel is in sight.
> 
> Nah, it's probably a train. Or maybe a flame. :)
> 
> Anyway, I'm interested to see what people think of the approach.
> 
> Cheers,
> 
> Dave.
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support
  2018-11-07 17:14 ` [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Darrick J. Wong
@ 2018-11-07 22:04   ` Dave Chinner
  2018-11-08  1:38     ` Darrick J. Wong
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2018-11-07 22:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel

On Wed, Nov 07, 2018 at 09:14:05AM -0800, Darrick J. Wong wrote:
> On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote:
> > Hi folks,
> > 
> > We've had a fair number of problems reported on 64k block size
> > filesystems of late, but none of the XFS developers have Power or
> > ARM machines handy to reproduce them or even really test the fixes.
> > 
> > The iomap infrastructure we introduced a while back was designed
> > with the capabity of block size > page size support in mind, but we
> > hadn't tried to implement it.
> > 
> > So after another 64k block size bug report late last week I said to
> > Darrick "How hard could it be"?
> 
> "Nothing is ever simple" :)

"It'll only take a couple of minutes!"

> > About 6 billion (yes, B) fsx ops later, I have most of the XFS
> > functionality working on 64k block sizes on x86_64.  Buffered
> > read/write, mmap read/write and direct IO read/write all work. All
> > the fallocate() operations work correctly, as does truncate. xfsdump
> > and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
> > needed some help, but I've tested Darrick's fixes for that quite a
> > bit over the past few days.
> > 
> > It passes most of xfstests - there's some test failures that I have
> > to determine whether they are code bugs or test problems (i.e. some
> > tests don't deal with 64k block sizes correctly or assume block size
> > <= page size).
> > 
> > What I haven't tested yet is shared extents - the COW path,
> > clone_file_range and dedupe_file_range. I discovered earlier today
> > that fsx doesn't support copy/clone/dedupe_file_operations
> > operations, so before I go any further I need to enxpahnce fsx. Then
> 
> I assume that means you only tested this on reflink=0 filesystems?

Correct.

> Looking at fsstress, it looks like we don't test copy_file_range either.
> I can try adding the missing clone/dedupe/copy to both programs, but
> maybe you've already done that while I was asleep?

No, I haven't started on this yet. I've been sleeping. :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support
  2018-11-07 22:04   ` Dave Chinner
@ 2018-11-08  1:38     ` Darrick J. Wong
  2018-11-08  9:04       ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2018-11-08  1:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Thu, Nov 08, 2018 at 09:04:41AM +1100, Dave Chinner wrote:
> On Wed, Nov 07, 2018 at 09:14:05AM -0800, Darrick J. Wong wrote:
> > On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > We've had a fair number of problems reported on 64k block size
> > > filesystems of late, but none of the XFS developers have Power or
> > > ARM machines handy to reproduce them or even really test the fixes.
> > > 
> > > The iomap infrastructure we introduced a while back was designed
> > > with the capabity of block size > page size support in mind, but we
> > > hadn't tried to implement it.
> > > 
> > > So after another 64k block size bug report late last week I said to
> > > Darrick "How hard could it be"?
> > 
> > "Nothing is ever simple" :)
> 
> "It'll only take a couple of minutes!"
> 
> > > About 6 billion (yes, B) fsx ops later, I have most of the XFS
> > > functionality working on 64k block sizes on x86_64.  Buffered
> > > read/write, mmap read/write and direct IO read/write all work. All
> > > the fallocate() operations work correctly, as does truncate. xfsdump
> > > and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
> > > needed some help, but I've tested Darrick's fixes for that quite a
> > > bit over the past few days.
> > > 
> > > It passes most of xfstests - there's some test failures that I have
> > > to determine whether they are code bugs or test problems (i.e. some
> > > tests don't deal with 64k block sizes correctly or assume block size
> > > <= page size).
> > > 
> > > What I haven't tested yet is shared extents - the COW path,
> > > clone_file_range and dedupe_file_range. I discovered earlier today
> > > that fsx doesn't support copy/clone/dedupe_file_operations
> > > operations, so before I go any further I need to enxpahnce fsx. Then
> > 
> > I assume that means you only tested this on reflink=0 filesystems?
> 
> Correct.
> 
> > Looking at fsstress, it looks like we don't test copy_file_range either.
> > I can try adding the missing clone/dedupe/copy to both programs, but
> > maybe you've already done that while I was asleep?
> 
> No, I haven't started on this yet. I've been sleeping. :P

I started wondering if we were missing anything from not having fsx
support clone/dedupe and ended up with:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fsstress-clone

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support
  2018-11-08  1:38     ` Darrick J. Wong
@ 2018-11-08  9:04       ` Dave Chinner
  2018-11-08 22:17         ` Darrick J. Wong
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2018-11-08  9:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel

On Wed, Nov 07, 2018 at 05:38:43PM -0800, Darrick J. Wong wrote:
> On Thu, Nov 08, 2018 at 09:04:41AM +1100, Dave Chinner wrote:
> > On Wed, Nov 07, 2018 at 09:14:05AM -0800, Darrick J. Wong wrote:
> > > On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote:
> > > > Hi folks,
> > > > 
> > > > We've had a fair number of problems reported on 64k block size
> > > > filesystems of late, but none of the XFS developers have Power or
> > > > ARM machines handy to reproduce them or even really test the fixes.
> > > > 
> > > > The iomap infrastructure we introduced a while back was designed
> > > > with the capabity of block size > page size support in mind, but we
> > > > hadn't tried to implement it.
> > > > 
> > > > So after another 64k block size bug report late last week I said to
> > > > Darrick "How hard could it be"?
> > > 
> > > "Nothing is ever simple" :)
> > 
> > "It'll only take a couple of minutes!"
> > 
> > > > About 6 billion (yes, B) fsx ops later, I have most of the XFS
> > > > functionality working on 64k block sizes on x86_64.  Buffered
> > > > read/write, mmap read/write and direct IO read/write all work. All
> > > > the fallocate() operations work correctly, as does truncate. xfsdump
> > > > and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
> > > > needed some help, but I've tested Darrick's fixes for that quite a
> > > > bit over the past few days.
> > > > 
> > > > It passes most of xfstests - there's some test failures that I have
> > > > to determine whether they are code bugs or test problems (i.e. some
> > > > tests don't deal with 64k block sizes correctly or assume block size
> > > > <= page size).
> > > > 
> > > > What I haven't tested yet is shared extents - the COW path,
> > > > clone_file_range and dedupe_file_range. I discovered earlier today
> > > > that fsx doesn't support copy/clone/dedupe_file_operations
> > > > operations, so before I go any further I need to enxpahnce fsx. Then
> > > 
> > > I assume that means you only tested this on reflink=0 filesystems?
> > 
> > Correct.
> > 
> > > Looking at fsstress, it looks like we don't test copy_file_range either.
> > > I can try adding the missing clone/dedupe/copy to both programs, but
> > > maybe you've already done that while I was asleep?
> > 
> > No, I haven't started on this yet. I've been sleeping. :P
> 
> I started wondering if we were missing anything from not having fsx
> support clone/dedupe and ended up with:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fsstress-clone

Some fixes to that below.

I haven't got to testing dedupe or clone - copy_file_range explodes
in under 40 operations in on generic/263. do_splice_direct() looks
to be broken in several different waysat this point.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

fsx: clean up copy/dedupe file range support.

From: Dave Chinner <dchinner@redhat.com>

copy_file_range() needs to obey read/write constraints otherwise is
blows up when direct IO is used.

FIDEDUPERANGE has a completely screwed up API for error reporting.
The ioctl succeeds even if dedupe fails, so you have to check
every individual dedupe operations for failure. Without this, dedupe
"succeeds" on kernels filesystems that don't even support dedupe...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 ltp/fsx.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/ltp/fsx.c b/ltp/fsx.c
index fad50e0022af..b51910b8b2e1 100644
--- a/ltp/fsx.c
+++ b/ltp/fsx.c
@@ -1382,7 +1382,11 @@ do_dedupe_range(unsigned offset, unsigned length, unsigned dest)
 	fdr->info[0].dest_fd = fd;
 	fdr->info[0].dest_offset = dest;
 
-	if (ioctl(fd, FIDEDUPERANGE, fdr) == -1) {
+	if (ioctl(fd, FIDEDUPERANGE, fdr) == -1 ||
+	    fdr->info[0].status < 0) {
+		if (fdr->info[0].status < 0)
+			errno = -fdr->info[0].status;
+
 		if (errno == EOPNOTSUPP || errno == ENOTTY) {
 			if (!quiet && testcalls > simulatedopcount)
 				prt("skipping unsupported dedupe range\n");
@@ -1416,6 +1420,11 @@ do_copy_range(unsigned offset, unsigned length, unsigned dest)
 	loff_t o1, o2;
 	ssize_t nr;
 
+	offset -= offset % readbdy;
+	dest -= dest % writebdy;
+	if (o_direct)
+		length -= length % readbdy;
+
 	if (length == 0) {
 		if (!quiet && testcalls > simulatedopcount)
 			prt("skipping zero length copy range\n");

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 06/16] iomap: support block size > page size for direct IO
  2018-11-07  6:31 ` [PATCH 06/16] iomap: support block size > page size for direct IO Dave Chinner
@ 2018-11-08 11:28   ` Nikolay Borisov
  2018-11-09 15:18   ` Christoph Hellwig
  1 sibling, 0 replies; 44+ messages in thread
From: Nikolay Borisov @ 2018-11-08 11:28 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs; +Cc: linux-fsdevel



On 7.11.18 г. 8:31 ч., Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> iomap_dio_rw() has all the infrastructure in place to support block
> size > page size filesystems because it is essentially just
> sub-block DIO. It needs help, however, with the sub-block zeroing
> code (needs multi-page IOs) page cache invalidation over the block
> being written.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap.c | 65 ++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 49 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 16d16596b00f..8878b1f1f9c7 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1548,21 +1548,34 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>  	}
>  }
>  
> +/*
> + * With zeroing for block size larger than page size, the zeroing length can
> + * span multiple pages.
> + */
> +#define howmany(x, y) (((x)+((y)-1))/(y))

nit: This could be replaced by DIV_ROUND_UP

>  static blk_qc_t
>  iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>  		unsigned len)
>  {
>  	struct page *page = ZERO_PAGE(0);
>  	struct bio *bio;
> +	int npages = howmany(len, PAGE_SIZE);
> +
> +	WARN_ON_ONCE(npages > 16);
>  
> -	bio = bio_alloc(GFP_KERNEL, 1);
> +	bio = bio_alloc(GFP_KERNEL, npages);
>  	bio_set_dev(bio, iomap->bdev);
>  	bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>  	bio->bi_private = dio;
>  	bio->bi_end_io = iomap_dio_bio_end_io;
>  
> -	get_page(page);
> -	__bio_add_page(bio, page, len, 0);
> +	while (npages-- > 0) {
> +		unsigned plen = min_t(unsigned, PAGE_SIZE, len);
> +		get_page(page);
> +		__bio_add_page(bio, page, plen, 0);
> +		len -= plen;
> +	}
> +	WARN_ON(len != 0);
>  	bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
>  
>  	atomic_inc(&dio->ref);
> @@ -1752,6 +1765,38 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>  	}
>  }
>  
> +/*
> + * This is lifted almost straight from xfs_flush_unmap_range(). Need a generic
> + * version of the block size rounding for these purposes.
> + */
> +static int
> +iomap_flush_unmap_range(struct file *f, loff_t offset, loff_t len)
> +{
> +	struct inode *inode = file_inode(f);
> +	loff_t rounding, start, end;
> +	int ret;
> +
> +	rounding = max_t(loff_t, i_blocksize(inode), PAGE_SIZE);
> +	start = round_down(offset, rounding);
> +	end = round_up(offset + len, rounding) - 1;
> +
> +	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * Try to invalidate cache pages for the range we're direct
> +	 * writing.  If this invalidation fails, tough, the write will
> +	 * still work, but racing two incompatible write paths is a
> +	 * pretty crazy thing to do, so we don't support it 100%.
> +	 */
> +	ret = invalidate_inode_pages2_range(inode->i_mapping,
> +			start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> +	if (ret)
> +		dio_warn_stale_pagecache(f);
> +	return 0;
> +}
> +
>  /*
>   * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
>   * is being issued as AIO or not.  This allows us to optimise pure data writes
> @@ -1829,22 +1874,10 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		flags |= IOMAP_NOWAIT;
>  	}
>  
> -	ret = filemap_write_and_wait_range(mapping, start, end);
> +	ret = iomap_flush_unmap_range(iocb->ki_filp, start, end);
>  	if (ret)
>  		goto out_free_dio;
>  
> -	/*
> -	 * Try to invalidate cache pages for the range we're direct
> -	 * writing.  If this invalidation fails, tough, the write will
> -	 * still work, but racing two incompatible write paths is a
> -	 * pretty crazy thing to do, so we don't support it 100%.
> -	 */
> -	ret = invalidate_inode_pages2_range(mapping,
> -			start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -	if (ret)
> -		dio_warn_stale_pagecache(iocb->ki_filp);
> -	ret = 0;
> -
>  	if (iov_iter_rw(iter) == WRITE && !dio->wait_for_completion &&
>  	    !inode->i_sb->s_dio_done_wq) {
>  		ret = sb_init_dio_done_wq(inode->i_sb);
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support
  2018-11-08  9:04       ` Dave Chinner
@ 2018-11-08 22:17         ` Darrick J. Wong
  2018-11-08 22:22           ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2018-11-08 22:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Thu, Nov 08, 2018 at 08:04:32PM +1100, Dave Chinner wrote:
> On Wed, Nov 07, 2018 at 05:38:43PM -0800, Darrick J. Wong wrote:
> > On Thu, Nov 08, 2018 at 09:04:41AM +1100, Dave Chinner wrote:
> > > On Wed, Nov 07, 2018 at 09:14:05AM -0800, Darrick J. Wong wrote:
> > > > On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote:
> > > > > Hi folks,
> > > > > 
> > > > > We've had a fair number of problems reported on 64k block size
> > > > > filesystems of late, but none of the XFS developers have Power or
> > > > > ARM machines handy to reproduce them or even really test the fixes.
> > > > > 
> > > > > The iomap infrastructure we introduced a while back was designed
> > > > > with the capabity of block size > page size support in mind, but we
> > > > > hadn't tried to implement it.
> > > > > 
> > > > > So after another 64k block size bug report late last week I said to
> > > > > Darrick "How hard could it be"?
> > > > 
> > > > "Nothing is ever simple" :)
> > > 
> > > "It'll only take a couple of minutes!"
> > > 
> > > > > About 6 billion (yes, B) fsx ops later, I have most of the XFS
> > > > > functionality working on 64k block sizes on x86_64.  Buffered
> > > > > read/write, mmap read/write and direct IO read/write all work. All
> > > > > the fallocate() operations work correctly, as does truncate. xfsdump
> > > > > and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
> > > > > needed some help, but I've tested Darrick's fixes for that quite a
> > > > > bit over the past few days.
> > > > > 
> > > > > It passes most of xfstests - there's some test failures that I have
> > > > > to determine whether they are code bugs or test problems (i.e. some
> > > > > tests don't deal with 64k block sizes correctly or assume block size
> > > > > <= page size).
> > > > > 
> > > > > What I haven't tested yet is shared extents - the COW path,
> > > > > clone_file_range and dedupe_file_range. I discovered earlier today
> > > > > that fsx doesn't support copy/clone/dedupe_file_operations
> > > > > operations, so before I go any further I need to enxpahnce fsx. Then
> > > > 
> > > > I assume that means you only tested this on reflink=0 filesystems?
> > > 
> > > Correct.
> > > 
> > > > Looking at fsstress, it looks like we don't test copy_file_range either.
> > > > I can try adding the missing clone/dedupe/copy to both programs, but
> > > > maybe you've already done that while I was asleep?
> > > 
> > > No, I haven't started on this yet. I've been sleeping. :P
> > 
> > I started wondering if we were missing anything from not having fsx
> > support clone/dedupe and ended up with:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fsstress-clone
> 
> Some fixes to that below.
> 
> I haven't got to testing dedupe or clone - copy_file_range explodes
> in under 40 operations in on generic/263. do_splice_direct() looks
> to be broken in several different waysat this point.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> fsx: clean up copy/dedupe file range support.
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> copy_file_range() needs to obey read/write constraints otherwise is
> blows up when direct IO is used.
> 
> FIDEDUPERANGE has a completely screwed up API for error reporting.
> The ioctl succeeds even if dedupe fails, so you have to check
> every individual dedupe operations for failure. Without this, dedupe
> "succeeds" on kernels filesystems that don't even support dedupe...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  ltp/fsx.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/ltp/fsx.c b/ltp/fsx.c
> index fad50e0022af..b51910b8b2e1 100644
> --- a/ltp/fsx.c
> +++ b/ltp/fsx.c
> @@ -1382,7 +1382,11 @@ do_dedupe_range(unsigned offset, unsigned length, unsigned dest)
>  	fdr->info[0].dest_fd = fd;
>  	fdr->info[0].dest_offset = dest;
>  
> -	if (ioctl(fd, FIDEDUPERANGE, fdr) == -1) {
> +	if (ioctl(fd, FIDEDUPERANGE, fdr) == -1 ||
> +	    fdr->info[0].status < 0) {
> +		if (fdr->info[0].status < 0)
> +			errno = -fdr->info[0].status;
> +
>  		if (errno == EOPNOTSUPP || errno == ENOTTY) {
>  			if (!quiet && testcalls > simulatedopcount)
>  				prt("skipping unsupported dedupe range\n");
> @@ -1416,6 +1420,11 @@ do_copy_range(unsigned offset, unsigned length, unsigned dest)
>  	loff_t o1, o2;
>  	ssize_t nr;
>  
> +	offset -= offset % readbdy;
> +	dest -= dest % writebdy;
> +	if (o_direct)
> +		length -= length % readbdy;

Don't we want byte-granularity copies if we're doing buffered copies?

('Want' is such a strong word, maybe I don't want to find out what other
skeletons are lurking in do_splice_direct...)

--D

> +
>  	if (length == 0) {
>  		if (!quiet && testcalls > simulatedopcount)
>  			prt("skipping zero length copy range\n");

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support
  2018-11-08 22:17         ` Darrick J. Wong
@ 2018-11-08 22:22           ` Dave Chinner
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-08 22:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel

On Thu, Nov 08, 2018 at 02:17:56PM -0800, Darrick J. Wong wrote:
> On Thu, Nov 08, 2018 at 08:04:32PM +1100, Dave Chinner wrote:
> > On Wed, Nov 07, 2018 at 05:38:43PM -0800, Darrick J. Wong wrote:
> > @@ -1416,6 +1420,11 @@ do_copy_range(unsigned offset, unsigned length, unsigned dest)
> >  	loff_t o1, o2;
> >  	ssize_t nr;
> >  
> > +	offset -= offset % readbdy;
> > +	dest -= dest % writebdy;
> > +	if (o_direct)
> > +		length -= length % readbdy;
> 
> Don't we want byte-granularity copies if we're doing buffered copies?

Yes, just don't set -r/-w values when doing buffered IO. O_DIRECT
requires the length to be aligned, too, but buffered IO doesn't,
which is why the "if (o_direct)" case is there.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 03/16] xfs: finobt AG reserves don't consider last AG can be a runt
  2018-11-07 16:55   ` Darrick J. Wong
@ 2018-11-09  0:21     ` Dave Chinner
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-09  0:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel

On Wed, Nov 07, 2018 at 08:55:08AM -0800, Darrick J. Wong wrote:
> On Wed, Nov 07, 2018 at 05:31:14PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The last AG may be very small comapred to all other AGs, and hence
> > AG reservations based on the superblock AG size may actually consume
> > more space than the AG actually has. This results on assert failures
> > like:
> > 
> > XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319
> > [   48.932891]  xfs_ag_resv_init+0x1bd/0x1d0
> > [   48.933853]  xfs_fs_reserve_ag_blocks+0x37/0xb0
> > [   48.934939]  xfs_mountfs+0x5b3/0x920
> > [   48.935804]  xfs_fs_fill_super+0x462/0x640
> > [   48.936784]  ? xfs_test_remount_options+0x60/0x60
> > [   48.937908]  mount_bdev+0x178/0x1b0
> > [   48.938751]  mount_fs+0x36/0x170
> > [   48.939533]  vfs_kern_mount.part.43+0x54/0x130
> > [   48.940596]  do_mount+0x20e/0xcb0
> > [   48.941396]  ? memdup_user+0x3e/0x70
> > [   48.942249]  ksys_mount+0xba/0xd0
> > [   48.943046]  __x64_sys_mount+0x21/0x30
> > [   48.943953]  do_syscall_64+0x54/0x170
> > [   48.944835]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > 
> > Hence we need to ensure the finobt per-ag space reservations take
> > into account the size of the last AG rather than treat it like all
> > the other full size AGs.
> > 
> > Note that both refcountbt and rmapbt already take the size of the AG
> > into account via reading the AGF length directly.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_ialloc_btree.c | 17 +++++++++++++----
> >  1 file changed, 13 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > index 86c50208a143..62014780d5e4 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > @@ -538,15 +538,24 @@ xfs_inobt_rec_check_count(
> >  
> >  static xfs_extlen_t
> >  xfs_inobt_max_size(
> > -	struct xfs_mount	*mp)
> > +	struct xfs_mount	*mp,
> > +	xfs_agnumber_t		agno)
> >  {
> > +	uint32_t		agblocks = mp->m_sb.sb_agblocks;
> > +
> >  	/* Bail out if we're uninitialized, which can happen in mkfs. */
> >  	if (mp->m_inobt_mxr[0] == 0)
> >  		return 0;
> >  
> > +	/* last AG can be a runt */
> > +	if (agno == mp->m_sb.sb_agcount - 1) {
> > +		div_u64_rem(mp->m_sb.sb_dblocks, mp->m_sb.sb_agblocks,
> > +				&agblocks);
> > +	}
> 
> Why not: agblocks = xfs_ag_block_count(mp, agno); ?
> 

forgot about that function :)

Will fix.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 01/16] xfs: drop ->writepage completely
  2018-11-07  6:31 ` [PATCH 01/16] xfs: drop ->writepage completely Dave Chinner
@ 2018-11-09 15:12   ` Christoph Hellwig
  2018-11-12 21:08     ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2018-11-09 15:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-mm

[adding linux-mm to the CC list]

On Wed, Nov 07, 2018 at 05:31:12PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> ->writepage is only used in one place - single page writeback from
> memory reclaim. We only allow such writeback from kswapd, not from
> direct memory reclaim, and so it is rarely used. When it comes from
> kswapd, it is effectively random dirty page shoot-down, which is
> horrible for IO patterns. We will already have background writeback
> trying to clean all the dirty pages in memory as efficiently as
> possible, so having kswapd interrupt our well formed IO stream only
> slows things down. So get rid of xfs_vm_writepage() completely.

Interesting.  IFF we can pull this off it would simplify a lot of
things, so I'm generally in favor of it.

->writepage callers in generic code are:

 (1) mm/vmscan.c:pageout() - this is the kswaped (or direct reclaim) you
     mention above.  It basically does nothing in this case which isn't
     great, but the whole point of this patch..
 (2) mm/migrate.c:writeout() - this is only called if no ->migratepage
     method is presend, but we have one in XFS, so we should be ok.

Plus a few pieces of code that are just library functions like
generic_writepages and mpage_writepages.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 05/16] iomap: sub-block dio needs to zeroout beyond EOF
  2018-11-07  6:31 ` [PATCH 05/16] iomap: sub-block dio needs to zeroout beyond EOF Dave Chinner
@ 2018-11-09 15:15   ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2018-11-09 15:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Wed, Nov 07, 2018 at 05:31:16PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> If we are soing sub-block dio that extends EOF, we need to zero
> the unused tail of the block to initialise the data in it it. If we
> do not zero the tail of the block, then an immediate mmap read of
> the EOF block will expose stale data beyond EOF to userspace. Found
> with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.
> 
> Fix this by detecting if the end of the DIO write is beyond EOF
> and zeroing the tail if necessary.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 64ce240217a1..16d16596b00f 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1676,7 +1676,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		dio->submit.cookie = submit_bio(bio);
>  	} while (nr_pages);
>  
> -	if (need_zeroout) {
> +	/*
> +	 * We need to zeroout the tail of a sub-block write if th extent type

s/th/the/

Otherwise this looks fine to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 06/16] iomap: support block size > page size for direct IO
  2018-11-07  6:31 ` [PATCH 06/16] iomap: support block size > page size for direct IO Dave Chinner
  2018-11-08 11:28   ` Nikolay Borisov
@ 2018-11-09 15:18   ` Christoph Hellwig
  2018-11-11  1:12     ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2018-11-09 15:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

>  static blk_qc_t
>  iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>  		unsigned len)
>  {
>  	struct page *page = ZERO_PAGE(0);
>  	struct bio *bio;
> +	int npages = howmany(len, PAGE_SIZE);
> +
> +	WARN_ON_ONCE(npages > 16);

Where does this magic 16 come from?

> +	WARN_ON(len != 0);

WARN_ON_ONCE please to avoid making the log unreadable if it ever
triggers.

> +/*
> + * This is lifted almost straight from xfs_flush_unmap_range(). Need a generic
> + * version of the block size rounding for these purposes.
> + */

Can you just add a generic version of this in a separate patch and
also switch XFS over to it?

> +static int
> +iomap_flush_unmap_range(struct file *f, loff_t offset, loff_t len)

Can we please spell out file in the parameter name?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/16] iomap: prepare buffered IO paths for block size > page size
  2018-11-07  6:31 ` [PATCH 07/16] iomap: prepare buffered IO paths for block size > page size Dave Chinner
@ 2018-11-09 15:19   ` Christoph Hellwig
  2018-11-11  1:15     ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2018-11-09 15:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

> +/*
> + * Return the block size we should use for page cache based operations.
> + * This will return the inode block size for block size < PAGE_SIZE,
> + * otherwise it will return PAGE_SIZE.
> + */
> +static inline unsigned
> +iomap_pageblock_size(struct inode *inode)

Maybe cal this a chunk size instead?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 08/16] iomap: mode iomap_zero_range and friends
  2018-11-07  6:31 ` [PATCH 08/16] iomap: mode iomap_zero_range and friends Dave Chinner
@ 2018-11-09 15:19   ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2018-11-09 15:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

s/mode/move/ in the subject.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 14/16] xfs: align writepages to large block sizes
  2018-11-07  6:31 ` [PATCH 14/16] xfs: align writepages to large block sizes Dave Chinner
@ 2018-11-09 15:22   ` Christoph Hellwig
  2018-11-11  1:20     ` Dave Chinner
  2018-11-14 14:19   ` Brian Foster
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2018-11-09 15:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

> +	unsigned		bsize =	i_blocksize(mapping->host);
>  
>  	/*
>  	 * Refuse to write pages out if we are called from reclaim context.
> @@ -922,6 +923,19 @@ xfs_vm_writepages(
>  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>  		return 0;
>  
> +	/*
> +	 * If the block size is larger than page size, extent the incoming write
> +	 * request to fsb granularity and alignment. This is a requirement for
> +	 * data integrity operations and it doesn't hurt for other write
> +	 * operations, so do it unconditionally.
> +	 */
> +	if (wbc->range_start)
> +		wbc->range_start = round_down(wbc->range_start, bsize);
> +	if (wbc->range_end != LLONG_MAX)
> +		wbc->range_end = round_up(wbc->range_end, bsize);
> +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);

This looks fine to me, but I'd be much more comfortable it we had
it in the common writeback code instead of inside XFS.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 06/16] iomap: support block size > page size for direct IO
  2018-11-09 15:18   ` Christoph Hellwig
@ 2018-11-11  1:12     ` Dave Chinner
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-11  1:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Fri, Nov 09, 2018 at 07:18:19AM -0800, Christoph Hellwig wrote:
> >  static blk_qc_t
> >  iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
> >  		unsigned len)
> >  {
> >  	struct page *page = ZERO_PAGE(0);
> >  	struct bio *bio;
> > +	int npages = howmany(len, PAGE_SIZE);
> > +
> > +	WARN_ON_ONCE(npages > 16);
> 
> Where does this magic 16 come from?

4k page size, 64k block size. Debug code, essentially.

> > +	WARN_ON(len != 0);
> 
> WARN_ON_ONCE please to avoid making the log unreadable if it ever
> triggers.

Debug code, too, so it'll get removed eventually.

> > +/*
> > + * This is lifted almost straight from xfs_flush_unmap_range(). Need a generic
> > + * version of the block size rounding for these purposes.
> > + */
> 
> Can you just add a generic version of this in a separate patch and
> also switch XFS over to it?

Well, they do different things. The xfs code must truncate the page
cache over the range (because we are removing the underlying
storage) while this just attempts to invalidate the pages and simply
says "have a nice day" if it fails. 

So they really are two different functions. The comment was written
when I expected that I was going to need to do lots more block size
rounding for invalidation in the generic code., but it seems that
may actually not be necessary....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/16] iomap: prepare buffered IO paths for block size > page size
  2018-11-09 15:19   ` Christoph Hellwig
@ 2018-11-11  1:15     ` Dave Chinner
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-11  1:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Fri, Nov 09, 2018 at 07:19:38AM -0800, Christoph Hellwig wrote:
> > +/*
> > + * Return the block size we should use for page cache based operations.
> > + * This will return the inode block size for block size < PAGE_SIZE,
> > + * otherwise it will return PAGE_SIZE.
> > + */
> > +static inline unsigned
> > +iomap_pageblock_size(struct inode *inode)
> 
> Maybe cal this a chunk size instead?

Sure.

/me has flashbacks to Irix and it's "chunk cache" for efficient
filesystem block size state tracking and caching over the top of the
page cache.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 14/16] xfs: align writepages to large block sizes
  2018-11-09 15:22   ` Christoph Hellwig
@ 2018-11-11  1:20     ` Dave Chinner
  2018-11-11 16:32       ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2018-11-11  1:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Fri, Nov 09, 2018 at 07:22:18AM -0800, Christoph Hellwig wrote:
> > +	unsigned		bsize =	i_blocksize(mapping->host);
> >  
> >  	/*
> >  	 * Refuse to write pages out if we are called from reclaim context.
> > @@ -922,6 +923,19 @@ xfs_vm_writepages(
> >  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> >  		return 0;
> >  
> > +	/*
> > +	 * If the block size is larger than page size, extent the incoming write
> > +	 * request to fsb granularity and alignment. This is a requirement for
> > +	 * data integrity operations and it doesn't hurt for other write
> > +	 * operations, so do it unconditionally.
> > +	 */
> > +	if (wbc->range_start)
> > +		wbc->range_start = round_down(wbc->range_start, bsize);
> > +	if (wbc->range_end != LLONG_MAX)
> > +		wbc->range_end = round_up(wbc->range_end, bsize);
> > +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> > +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> 
> This looks fine to me, but I'd be much more comfortable it we had
> it in the common writeback code instead of inside XFS.

Where in the common code? there's quite a few places that can call
->writepages...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 14/16] xfs: align writepages to large block sizes
  2018-11-11  1:20     ` Dave Chinner
@ 2018-11-11 16:32       ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2018-11-11 16:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel

On Sun, Nov 11, 2018 at 12:20:57PM +1100, Dave Chinner wrote:
> > > +	if (wbc->range_start)
> > > +		wbc->range_start = round_down(wbc->range_start, bsize);
> > > +	if (wbc->range_end != LLONG_MAX)
> > > +		wbc->range_end = round_up(wbc->range_end, bsize);
> > > +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> > > +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> > 
> > This looks fine to me, but I'd be much more comfortable it we had
> > it in the common writeback code instead of inside XFS.
> 
> Where in the common code? there's quite a few places that can call
> ->writepages...

As far as I can tell the only place calling ->writepages is
do_writepages, which sounds like the right place to me.  Maybe
conditional on a block size > page size to reduce the scare factor.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 01/16] xfs: drop ->writepage completely
  2018-11-09 15:12   ` Christoph Hellwig
@ 2018-11-12 21:08     ` Dave Chinner
  2021-02-02 20:51       ` Darrick J. Wong
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2018-11-12 21:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, linux-mm

On Fri, Nov 09, 2018 at 07:12:39AM -0800, Christoph Hellwig wrote:
> [adding linux-mm to the CC list]
> 
> On Wed, Nov 07, 2018 at 05:31:12PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > ->writepage is only used in one place - single page writeback from
> > memory reclaim. We only allow such writeback from kswapd, not from
> > direct memory reclaim, and so it is rarely used. When it comes from
> > kswapd, it is effectively random dirty page shoot-down, which is
> > horrible for IO patterns. We will already have background writeback
> > trying to clean all the dirty pages in memory as efficiently as
> > possible, so having kswapd interrupt our well formed IO stream only
> > slows things down. So get rid of xfs_vm_writepage() completely.
> 
> Interesting.  IFF we can pull this off it would simplify a lot of
> things, so I'm generally in favor of it.

Over the past few days of hammeringon this, the only thing I've
noticed is that page reclaim hangs up less, but it's also putting a
bit more pressure on the shrinkers. Filesystem intensive workloads
that drive the machine into reclaim via the page cache seem to hit
breakdown conditions slightly earlier and the impact is that the
shrinkers are run harder. Mostly I see this as the XFS buffer cache
having a much harder time keeping a working set active.

However, while the workloads hit the working set cache, writeback
performance does seem to be slightly higher. It is, however, being
offset by the deeper lows that come from the cache being turned
over.

So there's a bit of rebalancing to be done here as a followup, but
I've been unable to drive the system into unexepected OOM kills or
other bad behaviour as a result of removing ->writepage.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 14/16] xfs: align writepages to large block sizes
  2018-11-07  6:31 ` [PATCH 14/16] xfs: align writepages to large block sizes Dave Chinner
  2018-11-09 15:22   ` Christoph Hellwig
@ 2018-11-14 14:19   ` Brian Foster
  2018-11-14 21:18     ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Brian Foster @ 2018-11-14 14:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Wed, Nov 07, 2018 at 05:31:25PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For data integrity purposes, we need to write back the entire
> filesystem block when asked to sync a sub-block range of the file.
> When the filesystem block size is larger than the page size, this
> means we need to convert single page integrity writes into whole
> block integrity writes. We do this by extending the writepage range
> to filesystem block granularity and alignment.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index f6ef9e0a7312..5334f16be166 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -900,6 +900,7 @@ xfs_vm_writepages(
>  		.io_type = XFS_IO_HOLE,
>  	};
>  	int			ret;
> +	unsigned		bsize =	i_blocksize(mapping->host);
>  
>  	/*
>  	 * Refuse to write pages out if we are called from reclaim context.
> @@ -922,6 +923,19 @@ xfs_vm_writepages(
>  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>  		return 0;
>  
> +	/*
> +	 * If the block size is larger than page size, extent the incoming write
> +	 * request to fsb granularity and alignment. This is a requirement for
> +	 * data integrity operations and it doesn't hurt for other write
> +	 * operations, so do it unconditionally.
> +	 */
> +	if (wbc->range_start)
> +		wbc->range_start = round_down(wbc->range_start, bsize);
> +	if (wbc->range_end != LLONG_MAX)
> +		wbc->range_end = round_up(wbc->range_end, bsize);
> +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> +

This latter bit causes endless writeback loops in tests such as
generic/475 (I think I reproduced it with xfs/141 as well). The
writeback infrastructure samples ->nr_to_write before and after
->writepages() calls to identify progress. Unconditionally bumping it to
something larger than the original value can lead to an underflow in the
writeback code that seems to throw things off. E.g., see the following
wb tracepoints (w/ 4k block and page size):

   kworker/u8:13-189   [003] ...1   317.968147: writeback_single_inode_start: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=0 cgroup_ino=4294967295
   kworker/u8:13-189   [003] ...1   317.968150: writeback_single_inode: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=18446744073709548544 cgroup_ino=4294967295

The wrote value goes from 0 to garbage and writeback_sb_inodes() uses
the same basic calculation for 'wrote.'

BTW, I haven't gone through the broader set, but just looking at this
bit what's the purpose of rounding ->nr_to_write (which is a page count)
to a block size in the first place?

Brian

>  	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
>  	ret = write_cache_pages(mapping, wbc, xfs_do_writepage, &wpc);
>  	if (wpc.ioend)
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 14/16] xfs: align writepages to large block sizes
  2018-11-14 14:19   ` Brian Foster
@ 2018-11-14 21:18     ` Dave Chinner
  2018-11-15 12:55       ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2018-11-14 21:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel

On Wed, Nov 14, 2018 at 09:19:26AM -0500, Brian Foster wrote:
> On Wed, Nov 07, 2018 at 05:31:25PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > For data integrity purposes, we need to write back the entire
> > filesystem block when asked to sync a sub-block range of the file.
> > When the filesystem block size is larger than the page size, this
> > means we need to convert single page integrity writes into whole
> > block integrity writes. We do this by extending the writepage range
> > to filesystem block granularity and alignment.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index f6ef9e0a7312..5334f16be166 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -900,6 +900,7 @@ xfs_vm_writepages(
> >  		.io_type = XFS_IO_HOLE,
> >  	};
> >  	int			ret;
> > +	unsigned		bsize =	i_blocksize(mapping->host);
> >  
> >  	/*
> >  	 * Refuse to write pages out if we are called from reclaim context.
> > @@ -922,6 +923,19 @@ xfs_vm_writepages(
> >  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> >  		return 0;
> >  
> > +	/*
> > +	 * If the block size is larger than page size, extent the incoming write
> > +	 * request to fsb granularity and alignment. This is a requirement for
> > +	 * data integrity operations and it doesn't hurt for other write
> > +	 * operations, so do it unconditionally.
> > +	 */
> > +	if (wbc->range_start)
> > +		wbc->range_start = round_down(wbc->range_start, bsize);
> > +	if (wbc->range_end != LLONG_MAX)
> > +		wbc->range_end = round_up(wbc->range_end, bsize);
> > +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> > +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> > +
> 
> This latter bit causes endless writeback loops in tests such as
> generic/475 (I think I reproduced it with xfs/141 as well). The

Yup, I've seen that, but haven't fixed it yet because I still
haven't climbed out of the dedupe/clone/copy file range data
corruption hole that fsx pulled the lid of.

Basically, I can't get back to working on bs > ps until I get the
stuff we actually support working correctly first...

> writeback infrastructure samples ->nr_to_write before and after
> ->writepages() calls to identify progress. Unconditionally bumping it to
> something larger than the original value can lead to an underflow in the
> writeback code that seems to throw things off. E.g., see the following
> wb tracepoints (w/ 4k block and page size):
> 
>    kworker/u8:13-189   [003] ...1   317.968147: writeback_single_inode_start: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=0 cgroup_ino=4294967295
>    kworker/u8:13-189   [003] ...1   317.968150: writeback_single_inode: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=18446744073709548544 cgroup_ino=4294967295
> 
> The wrote value goes from 0 to garbage and writeback_sb_inodes() uses
> the same basic calculation for 'wrote.'

Easy enough to fix, just stash the originals and restore them once
done.

> 
> BTW, I haven't gone through the broader set, but just looking at this
> bit what's the purpose of rounding ->nr_to_write (which is a page count)
> to a block size in the first place?

fsync on a single page range.

We write that page, allocate the block (which spans 16 pages), and
then return from writeback leaving 15/16 pages on that block still
dirty in memory.  Then we force the log, pushing the allocation and
metadata to disk.  Crash.

On recovery, we expose 15/16 pages of stale data because we only
wrote one of the pages over the block during fsync.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 14/16] xfs: align writepages to large block sizes
  2018-11-14 21:18     ` Dave Chinner
@ 2018-11-15 12:55       ` Brian Foster
  2018-11-16  6:19         ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2018-11-15 12:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Thu, Nov 15, 2018 at 08:18:18AM +1100, Dave Chinner wrote:
> On Wed, Nov 14, 2018 at 09:19:26AM -0500, Brian Foster wrote:
> > On Wed, Nov 07, 2018 at 05:31:25PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > For data integrity purposes, we need to write back the entire
> > > filesystem block when asked to sync a sub-block range of the file.
> > > When the filesystem block size is larger than the page size, this
> > > means we need to convert single page integrity writes into whole
> > > block integrity writes. We do this by extending the writepage range
> > > to filesystem block granularity and alignment.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_aops.c | 14 ++++++++++++++
> > >  1 file changed, 14 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index f6ef9e0a7312..5334f16be166 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -900,6 +900,7 @@ xfs_vm_writepages(
> > >  		.io_type = XFS_IO_HOLE,
> > >  	};
> > >  	int			ret;
> > > +	unsigned		bsize =	i_blocksize(mapping->host);
> > >  
> > >  	/*
> > >  	 * Refuse to write pages out if we are called from reclaim context.
> > > @@ -922,6 +923,19 @@ xfs_vm_writepages(
> > >  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> > >  		return 0;
> > >  
> > > +	/*
> > > +	 * If the block size is larger than page size, extent the incoming write
> > > +	 * request to fsb granularity and alignment. This is a requirement for
> > > +	 * data integrity operations and it doesn't hurt for other write
> > > +	 * operations, so do it unconditionally.
> > > +	 */
> > > +	if (wbc->range_start)
> > > +		wbc->range_start = round_down(wbc->range_start, bsize);
> > > +	if (wbc->range_end != LLONG_MAX)
> > > +		wbc->range_end = round_up(wbc->range_end, bsize);
> > > +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> > > +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> > > +
> > 
> > This latter bit causes endless writeback loops in tests such as
> > generic/475 (I think I reproduced it with xfs/141 as well). The
> 
> Yup, I've seen that, but haven't fixed it yet because I still
> haven't climbed out of the dedupe/clone/copy file range data
> corruption hole that fsx pulled the lid of.
> 
> Basically, I can't get back to working on bs > ps until I get the
> stuff we actually support working correctly first...
> 
> > writeback infrastructure samples ->nr_to_write before and after
> > ->writepages() calls to identify progress. Unconditionally bumping it to
> > something larger than the original value can lead to an underflow in the
> > writeback code that seems to throw things off. E.g., see the following
> > wb tracepoints (w/ 4k block and page size):
> > 
> >    kworker/u8:13-189   [003] ...1   317.968147: writeback_single_inode_start: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=0 cgroup_ino=4294967295
> >    kworker/u8:13-189   [003] ...1   317.968150: writeback_single_inode: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=18446744073709548544 cgroup_ino=4294967295
> > 
> > The wrote value goes from 0 to garbage and writeback_sb_inodes() uses
> > the same basic calculation for 'wrote.'
> 
> Easy enough to fix, just stash the originals and restore them once
> done.
> 
> > 
> > BTW, I haven't gone through the broader set, but just looking at this
> > bit what's the purpose of rounding ->nr_to_write (which is a page count)
> > to a block size in the first place?
> 
> fsync on a single page range.
> 
> We write that page, allocate the block (which spans 16 pages), and
> then return from writeback leaving 15/16 pages on that block still
> dirty in memory.  Then we force the log, pushing the allocation and
> metadata to disk.  Crash.
> 
> On recovery, we expose 15/16 pages of stale data because we only
> wrote one of the pages over the block during fsync.
> 

But why the block size and not something that represents pages per
block? Note again that ->nr_to_write is a page count and you're
comparing and aligning it with fields that are byte offsets.

Also, it looks like nr_to_write is set to LONG_MAX regardless of the
fsync range (whereas range_start/end are set appropriately). IOW, an
integrity write isn't limited by page count since it's critical to flush
the specified range. Unless I'm misunderstanding how this field is used
in the writeback code, ISTM that there shouldn't be any need to muck
with nr_to_write here for the purposes of fsync.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 14/16] xfs: align writepages to large block sizes
  2018-11-15 12:55       ` Brian Foster
@ 2018-11-16  6:19         ` Dave Chinner
  2018-11-16 13:29           ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2018-11-16  6:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel

On Thu, Nov 15, 2018 at 07:55:44AM -0500, Brian Foster wrote:
> On Thu, Nov 15, 2018 at 08:18:18AM +1100, Dave Chinner wrote:
> > On Wed, Nov 14, 2018 at 09:19:26AM -0500, Brian Foster wrote:
> > > On Wed, Nov 07, 2018 at 05:31:25PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > For data integrity purposes, we need to write back the entire
> > > > filesystem block when asked to sync a sub-block range of the file.
> > > > When the filesystem block size is larger than the page size, this
> > > > means we need to convert single page integrity writes into whole
> > > > block integrity writes. We do this by extending the writepage range
> > > > to filesystem block granularity and alignment.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_aops.c | 14 ++++++++++++++
> > > >  1 file changed, 14 insertions(+)
> > > > 
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index f6ef9e0a7312..5334f16be166 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > > @@ -900,6 +900,7 @@ xfs_vm_writepages(
> > > >  		.io_type = XFS_IO_HOLE,
> > > >  	};
> > > >  	int			ret;
> > > > +	unsigned		bsize =	i_blocksize(mapping->host);
> > > >  
> > > >  	/*
> > > >  	 * Refuse to write pages out if we are called from reclaim context.
> > > > @@ -922,6 +923,19 @@ xfs_vm_writepages(
> > > >  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> > > >  		return 0;
> > > >  
> > > > +	/*
> > > > +	 * If the block size is larger than page size, extent the incoming write
> > > > +	 * request to fsb granularity and alignment. This is a requirement for
> > > > +	 * data integrity operations and it doesn't hurt for other write
> > > > +	 * operations, so do it unconditionally.
> > > > +	 */
> > > > +	if (wbc->range_start)
> > > > +		wbc->range_start = round_down(wbc->range_start, bsize);
> > > > +	if (wbc->range_end != LLONG_MAX)
> > > > +		wbc->range_end = round_up(wbc->range_end, bsize);
> > > > +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> > > > +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> > > > +
> > > 
> > > This latter bit causes endless writeback loops in tests such as
> > > generic/475 (I think I reproduced it with xfs/141 as well). The
> > 
> > Yup, I've seen that, but haven't fixed it yet because I still
> > haven't climbed out of the dedupe/clone/copy file range data
> > corruption hole that fsx pulled the lid of.
> > 
> > Basically, I can't get back to working on bs > ps until I get the
> > stuff we actually support working correctly first...
> > 
> > > writeback infrastructure samples ->nr_to_write before and after
> > > ->writepages() calls to identify progress. Unconditionally bumping it to
> > > something larger than the original value can lead to an underflow in the
> > > writeback code that seems to throw things off. E.g., see the following
> > > wb tracepoints (w/ 4k block and page size):
> > > 
> > >    kworker/u8:13-189   [003] ...1   317.968147: writeback_single_inode_start: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=0 cgroup_ino=4294967295
> > >    kworker/u8:13-189   [003] ...1   317.968150: writeback_single_inode: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=18446744073709548544 cgroup_ino=4294967295
> > > 
> > > The wrote value goes from 0 to garbage and writeback_sb_inodes() uses
> > > the same basic calculation for 'wrote.'
> > 
> > Easy enough to fix, just stash the originals and restore them once
> > done.
> > 
> > > 
> > > BTW, I haven't gone through the broader set, but just looking at this
> > > bit what's the purpose of rounding ->nr_to_write (which is a page count)
> > > to a block size in the first place?
> > 
> > fsync on a single page range.
> > 
> > We write that page, allocate the block (which spans 16 pages), and
> > then return from writeback leaving 15/16 pages on that block still
> > dirty in memory.  Then we force the log, pushing the allocation and
> > metadata to disk.  Crash.
> > 
> > On recovery, we expose 15/16 pages of stale data because we only
> > wrote one of the pages over the block during fsync.
> > 
> 
> But why the block size and not something that represents pages per
> block? Note again that ->nr_to_write is a page count and you're
> comparing and aligning it with fields that are byte offsets.

Not sure I follow you - block size is directly related to the
number of pages per block. You can't calculate one without the
other...

/me goes back an looks at the code, finally it dawns on him that
the bug is he's rounding to block size, not (block size >>
PAGE_SHIFT). I'll fix that.

> IOW, an
> integrity write isn't limited by page count since it's critical to flush
> the specified range. Unless I'm misunderstanding how this field is used
> in the writeback code, ISTM that there shouldn't be any need to muck
> with nr_to_write here for the purposes of fsync.

Sorry, I thought you were asking about why were were rounding the
range in general, not nr_to_write specifically.

We don't need to modify nr_to_write for WB_SYNC_ALL,
write_cache_pages() will not terminate on a nr_to_write expiry. So,
unlike the range rounding, the nr_to_write rounding is really only
for WB_SYNC_NONE.

i.e. we round WB_SYNC_NONE to try to get whole blocks written so we
don't end up with partially written blocks on disk for extended
periods of time (e.g. between background writeback periods). It
doesn't matter for WB_SYNC_ALL, but it will reduce the potential for
stale data beign exposed when crashes occur and random pages in
blocks have't been written back. i.e. it's to help iprevent
re-exposing the problematic cases that we added the "NULL files
after crash" workarounds for.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 14/16] xfs: align writepages to large block sizes
  2018-11-16  6:19         ` Dave Chinner
@ 2018-11-16 13:29           ` Brian Foster
  2018-11-19  1:14             ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2018-11-16 13:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Fri, Nov 16, 2018 at 05:19:36PM +1100, Dave Chinner wrote:
> On Thu, Nov 15, 2018 at 07:55:44AM -0500, Brian Foster wrote:
> > On Thu, Nov 15, 2018 at 08:18:18AM +1100, Dave Chinner wrote:
> > > On Wed, Nov 14, 2018 at 09:19:26AM -0500, Brian Foster wrote:
> > > > On Wed, Nov 07, 2018 at 05:31:25PM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > For data integrity purposes, we need to write back the entire
> > > > > filesystem block when asked to sync a sub-block range of the file.
> > > > > When the filesystem block size is larger than the page size, this
> > > > > means we need to convert single page integrity writes into whole
> > > > > block integrity writes. We do this by extending the writepage range
> > > > > to filesystem block granularity and alignment.
> > > > > 
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > ---
> > > > >  fs/xfs/xfs_aops.c | 14 ++++++++++++++
> > > > >  1 file changed, 14 insertions(+)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > > index f6ef9e0a7312..5334f16be166 100644
> > > > > --- a/fs/xfs/xfs_aops.c
> > > > > +++ b/fs/xfs/xfs_aops.c
> > > > > @@ -900,6 +900,7 @@ xfs_vm_writepages(
> > > > >  		.io_type = XFS_IO_HOLE,
> > > > >  	};
> > > > >  	int			ret;
> > > > > +	unsigned		bsize =	i_blocksize(mapping->host);
> > > > >  
> > > > >  	/*
> > > > >  	 * Refuse to write pages out if we are called from reclaim context.
> > > > > @@ -922,6 +923,19 @@ xfs_vm_writepages(
> > > > >  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> > > > >  		return 0;
> > > > >  
> > > > > +	/*
> > > > > +	 * If the block size is larger than page size, extent the incoming write
> > > > > +	 * request to fsb granularity and alignment. This is a requirement for
> > > > > +	 * data integrity operations and it doesn't hurt for other write
> > > > > +	 * operations, so do it unconditionally.
> > > > > +	 */
> > > > > +	if (wbc->range_start)
> > > > > +		wbc->range_start = round_down(wbc->range_start, bsize);
> > > > > +	if (wbc->range_end != LLONG_MAX)
> > > > > +		wbc->range_end = round_up(wbc->range_end, bsize);
> > > > > +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> > > > > +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> > > > > +
> > > > 
> > > > This latter bit causes endless writeback loops in tests such as
> > > > generic/475 (I think I reproduced it with xfs/141 as well). The
> > > 
> > > Yup, I've seen that, but haven't fixed it yet because I still
> > > haven't climbed out of the dedupe/clone/copy file range data
> > > corruption hole that fsx pulled the lid of.
> > > 
> > > Basically, I can't get back to working on bs > ps until I get the
> > > stuff we actually support working correctly first...
> > > 
> > > > writeback infrastructure samples ->nr_to_write before and after
> > > > ->writepages() calls to identify progress. Unconditionally bumping it to
> > > > something larger than the original value can lead to an underflow in the
> > > > writeback code that seems to throw things off. E.g., see the following
> > > > wb tracepoints (w/ 4k block and page size):
> > > > 
> > > >    kworker/u8:13-189   [003] ...1   317.968147: writeback_single_inode_start: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=0 cgroup_ino=4294967295
> > > >    kworker/u8:13-189   [003] ...1   317.968150: writeback_single_inode: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=18446744073709548544 cgroup_ino=4294967295
> > > > 
> > > > The wrote value goes from 0 to garbage and writeback_sb_inodes() uses
> > > > the same basic calculation for 'wrote.'
> > > 
> > > Easy enough to fix, just stash the originals and restore them once
> > > done.
> > > 
> > > > 
> > > > BTW, I haven't gone through the broader set, but just looking at this
> > > > bit what's the purpose of rounding ->nr_to_write (which is a page count)
> > > > to a block size in the first place?
> > > 
> > > fsync on a single page range.
> > > 
> > > We write that page, allocate the block (which spans 16 pages), and
> > > then return from writeback leaving 15/16 pages on that block still
> > > dirty in memory.  Then we force the log, pushing the allocation and
> > > metadata to disk.  Crash.
> > > 
> > > On recovery, we expose 15/16 pages of stale data because we only
> > > wrote one of the pages over the block during fsync.
> > > 
> > 
> > But why the block size and not something that represents pages per
> > block? Note again that ->nr_to_write is a page count and you're
> > comparing and aligning it with fields that are byte offsets.
> 
> Not sure I follow you - block size is directly related to the
> number of pages per block. You can't calculate one without the
> other...
> 
> /me goes back an looks at the code, finally it dawns on him that
> the bug is he's rounding to block size, not (block size >>
> PAGE_SHIFT). I'll fix that.
> 

Right.. and comparing it to range_end - range_start without conversion
as well.

> > IOW, an
> > integrity write isn't limited by page count since it's critical to flush
> > the specified range. Unless I'm misunderstanding how this field is used
> > in the writeback code, ISTM that there shouldn't be any need to muck
> > with nr_to_write here for the purposes of fsync.
> 
> Sorry, I thought you were asking about why were were rounding the
> range in general, not nr_to_write specifically.
> 

No worries.

> We don't need to modify nr_to_write for WB_SYNC_ALL,
> write_cache_pages() will not terminate on a nr_to_write expiry. So,
> unlike the range rounding, the nr_to_write rounding is really only
> for WB_SYNC_NONE.
> 

Ok, I was responding to the "fsync() on a single page range" use case,
but I see it wasn't clear I was asking specifically about ->nr_to_write
and not necessarily the broader change.

Then again, the commit log seems to focus on fsync exclusively, so some
more content could be included there at minimum.

> i.e. we round WB_SYNC_NONE to try to get whole blocks written so we
> don't end up with partially written blocks on disk for extended
> periods of time (e.g. between background writeback periods). It
> doesn't matter for WB_SYNC_ALL, but it will reduce the potential for
> stale data beign exposed when crashes occur and random pages in
> blocks have't been written back. i.e. it's to help iprevent
> re-exposing the problematic cases that we added the "NULL files
> after crash" workarounds for.
> 

Ok. I see that there are earlier patches to do zero-around on sub-block
writes, so the idea makes a bit more sense with that in mind. That said,
I still don't grok how messing with nr_to_write is effective.

For background writeback (WB_SYNC_NONE), the range fields are clamped
out (0-LONG_MAX) since the location of pages to write is not really a
concern. In that case, ->nr_to_write is set based on some bandwidth
heuristic and the only change we make here is to round it. If we
consider the fact that any mapping itself may consist of some
combination of zeroed-around delalloc blocks (covered by an aligned
number of dirty pages) and already allocated/overwrite blocks (covered
by any number of dirty pages), how does a rounded ->nr_to_write actually
help us at all? Can't the magic ->nr_to_write value that prevents
stopping at a partially written sub-block page be unaligned to the block
size?

For integrity writeback (WB_SYNC_ALL), the range rounding makes sense to
me. In that case, the appropriate range_start/end values are set and
->nr_to_write is LONG_MAX because we don't want to limit on a page
count. So we round out the range to help ensure we don't do any
sub-block writes and as already discussed, ->nr_to_write is irrelevant
here.

Given the above, I don't see how tweaking ->nr_to_write really helps at
all even as an optimization. Unless I'm missing something else in the
earlier patches that facilitate this, ISTM that something more explicit
is required if you want to increase the odds that zeroed-around blocks
are written together. For example, perhaps try to detect this further
down in the writepage callout if we're about to stop in a bad spot, use
something like the old xfs_convert_page() mechanism to handle N pages
per-callout, or (as Christoph suggested), try to incorporate this kind
of block alignment consideration into write_cache_pages() itself. For
the latter, perhaps we could add a mod/align field or some such to the
wbc that the fs can use to instruct writeback to consider bs > ps
alignment explicitly.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 14/16] xfs: align writepages to large block sizes
  2018-11-16 13:29           ` Brian Foster
@ 2018-11-19  1:14             ` Dave Chinner
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-19  1:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel

On Fri, Nov 16, 2018 at 08:29:23AM -0500, Brian Foster wrote:
> On Fri, Nov 16, 2018 at 05:19:36PM +1100, Dave Chinner wrote:
> > i.e. we round WB_SYNC_NONE to try to get whole blocks written so we
> > don't end up with partially written blocks on disk for extended
> > periods of time (e.g. between background writeback periods). It
> > doesn't matter for WB_SYNC_ALL, but it will reduce the potential for
> > stale data beign exposed when crashes occur and random pages in
> > blocks have't been written back. i.e. it's to help iprevent
> > re-exposing the problematic cases that we added the "NULL files
> > after crash" workarounds for.
> > 
> 
> Ok. I see that there are earlier patches to do zero-around on sub-block
> writes, so the idea makes a bit more sense with that in mind. That said,
> I still don't grok how messing with nr_to_write is effective.

Because background writeback is range_cyclic = 1, and that means we
always start at offset zero, and then if nr_to_write expires we
stash the page index we are up to in:

	mapping->writeback_index = done_index

And the next background writeback will start again from there.

Hence if nr_to_write is always rounding to the number of pages per
block, background writeback will /tend/ towards writing full blocks
because the writeback_index will always end up a multiple of pages
per block. Hence cyclic writeback will tend towards writing aligned,
full blocks when nr_to_write is rounded.

That's the fundamental concept here - write-in does "zero-around" to
initialise full blocks, writeback does "write-around" to push full
blocks to disk. WB_SYNC_ALL needs ranges to be rounded to do full
block writeback, WB_SYNC_NONE background writeback needs it's range
cyclic behaviour to round to writing full blocks (and that's what
rounding nr_to_write is doing in this patch).

> For background writeback (WB_SYNC_NONE), the range fields are clamped
> out (0-LONG_MAX) since the location of pages to write is not really a
> concern. In that case, ->nr_to_write is set based on some bandwidth
> heuristic and the only change we make here is to round it. If we
> consider the fact that any mapping itself may consist of some
> combination of zeroed-around delalloc blocks (covered by an aligned
> number of dirty pages) and already allocated/overwrite blocks (covered
> by any number of dirty pages), how does a rounded ->nr_to_write actually
> help us at all? Can't the magic ->nr_to_write value that prevents
> stopping at a partially written sub-block page be unaligned to the block
> size?

Yup, I never intended for this RFC prototype to deal with all these
problems. That doesn't mean I'm not aware of them, nor that I don't
have a plan to deal with them.

> Given the above, I don't see how tweaking ->nr_to_write really helps at
> all even as an optimization. Unless I'm missing something else in the
> earlier patches that facilitate this, ISTM that something more explicit
> is required if you want to increase the odds that zeroed-around blocks
> are written together.

Which I've always intended as future work. I've spent about 14 hours
on this patch set so far - it's a functional prototype, not a
finished, completed piece of work.

I'm fully aware that this going to need a lot more work before it is
ready for merging This is an early prototype I'm putting out there
for architectural/design review. i.e. don't bother nitpicking
unimplemented details or bugs, look for big picture things I've got
wrong. Look for showstoppers and fundamental problems, not things
that just require a little bit of time and coding to implement....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 01/16] xfs: drop ->writepage completely
  2018-11-12 21:08     ` Dave Chinner
@ 2021-02-02 20:51       ` Darrick J. Wong
  0 siblings, 0 replies; 44+ messages in thread
From: Darrick J. Wong @ 2021-02-02 20:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, linux-mm

On Tue, Nov 13, 2018 at 08:08:39AM +1100, Dave Chinner wrote:
> On Fri, Nov 09, 2018 at 07:12:39AM -0800, Christoph Hellwig wrote:
> > [adding linux-mm to the CC list]
> > 
> > On Wed, Nov 07, 2018 at 05:31:12PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > ->writepage is only used in one place - single page writeback from
> > > memory reclaim. We only allow such writeback from kswapd, not from
> > > direct memory reclaim, and so it is rarely used. When it comes from
> > > kswapd, it is effectively random dirty page shoot-down, which is
> > > horrible for IO patterns. We will already have background writeback
> > > trying to clean all the dirty pages in memory as efficiently as
> > > possible, so having kswapd interrupt our well formed IO stream only
> > > slows things down. So get rid of xfs_vm_writepage() completely.
> > 
> > Interesting.  IFF we can pull this off it would simplify a lot of
> > things, so I'm generally in favor of it.
> 
> Over the past few days of hammeringon this, the only thing I've
> noticed is that page reclaim hangs up less, but it's also putting a
> bit more pressure on the shrinkers. Filesystem intensive workloads
> that drive the machine into reclaim via the page cache seem to hit
> breakdown conditions slightly earlier and the impact is that the
> shrinkers are run harder. Mostly I see this as the XFS buffer cache
> having a much harder time keeping a working set active.
> 
> However, while the workloads hit the working set cache, writeback
> performance does seem to be slightly higher. It is, however, being
> offset by the deeper lows that come from the cache being turned
> over.
> 
> So there's a bit of rebalancing to be done here as a followup, but
> I've been unable to drive the system into unexepected OOM kills or
> other bad behaviour as a result of removing ->writepage.

FWIW I've been running this patch in my development kernels as part of
exercising realtime reflink with rextsize > 1.  So far I haven't seen
any particularly adverse effects.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2021-02-02 20:52 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
2018-11-07  6:31 ` [PATCH 01/16] xfs: drop ->writepage completely Dave Chinner
2018-11-09 15:12   ` Christoph Hellwig
2018-11-12 21:08     ` Dave Chinner
2021-02-02 20:51       ` Darrick J. Wong
2018-11-07  6:31 ` [PATCH 02/16] xfs: move writepage context warnings to writepages Dave Chinner
2018-11-07  6:31 ` [PATCH 03/16] xfs: finobt AG reserves don't consider last AG can be a runt Dave Chinner
2018-11-07 16:55   ` Darrick J. Wong
2018-11-09  0:21     ` Dave Chinner
2018-11-07  6:31 ` [PATCH 04/16] xfs: extent shifting doesn't fully invalidate page cache Dave Chinner
2018-11-07  6:31 ` [PATCH 05/16] iomap: sub-block dio needs to zeroout beyond EOF Dave Chinner
2018-11-09 15:15   ` Christoph Hellwig
2018-11-07  6:31 ` [PATCH 06/16] iomap: support block size > page size for direct IO Dave Chinner
2018-11-08 11:28   ` Nikolay Borisov
2018-11-09 15:18   ` Christoph Hellwig
2018-11-11  1:12     ` Dave Chinner
2018-11-07  6:31 ` [PATCH 07/16] iomap: prepare buffered IO paths for block size > page size Dave Chinner
2018-11-09 15:19   ` Christoph Hellwig
2018-11-11  1:15     ` Dave Chinner
2018-11-07  6:31 ` [PATCH 08/16] iomap: mode iomap_zero_range and friends Dave Chinner
2018-11-09 15:19   ` Christoph Hellwig
2018-11-07  6:31 ` [PATCH 09/16] iomap: introduce zero-around functionality Dave Chinner
2018-11-07  6:31 ` [PATCH 10/16] iomap: enable zero-around for iomap_zero_range() Dave Chinner
2018-11-07  6:31 ` [PATCH 11/16] iomap: Don't mark partial pages zeroing uptodate for zero-around Dave Chinner
2018-11-07  6:31 ` [PATCH 12/16] iomap: zero-around in iomap_page_mkwrite Dave Chinner
2018-11-07  6:31 ` [PATCH 13/16] xfs: add zero-around controls to iomap Dave Chinner
2018-11-07  6:31 ` [PATCH 14/16] xfs: align writepages to large block sizes Dave Chinner
2018-11-09 15:22   ` Christoph Hellwig
2018-11-11  1:20     ` Dave Chinner
2018-11-11 16:32       ` Christoph Hellwig
2018-11-14 14:19   ` Brian Foster
2018-11-14 21:18     ` Dave Chinner
2018-11-15 12:55       ` Brian Foster
2018-11-16  6:19         ` Dave Chinner
2018-11-16 13:29           ` Brian Foster
2018-11-19  1:14             ` Dave Chinner
2018-11-07  6:31 ` [PATCH 15/16] xfs: expose block size in stat Dave Chinner
2018-11-07  6:31 ` [PATCH 16/16] xfs: enable block size larger than page size support Dave Chinner
2018-11-07 17:14 ` [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Darrick J. Wong
2018-11-07 22:04   ` Dave Chinner
2018-11-08  1:38     ` Darrick J. Wong
2018-11-08  9:04       ` Dave Chinner
2018-11-08 22:17         ` Darrick J. Wong
2018-11-08 22:22           ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).