All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [RFC] xfs: make xfs_writepage_map extent map centric
@ 2017-10-11  8:11 Dave Chinner
  2017-10-12 16:14 ` Brian Foster
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2017-10-11  8:11 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_writepage_map() iterates over the bufferheads on a page to decide
what sort of IO to do and what actions to take. However, when it comes
to reflink and deciding when it needs to execute a COW operation, we no
longer look at the bufferhead state but instead we ignore than and look up
internal state held in teh COW fork extent list.

This means xfs_writepage_map() is somewhat confused. It does stuff, then
ignores it, then tries to handle the impedence mismatch by shovelling the
results inside the existing mapping code. It works, but it's a bit of a mess and
it makes it hard to fix the cached map bug that the writepage code currently
has.

To unify the two different mechanisms, we first have to choose a direction.
That's already been set - we're de-emphasising bufferheads so they are no longer
a control structure as we need to do taht to allow for eventual removal. Hence
we need to move away from looking at bufferhead state to determine what
operations we need to perform.

We can't completely get rid of bufferheads yet - they do contain some state that
is absolutely necessary, such as whether that part of the page contains valid
data or not (buffer_uptodate()). Other state in the bufferhead is redundant:

	BH_dirty - the page is dirty, so we can ignore this and just write it
	BH_delay - we have delalloc extent info in the DATA fork extent tree
	BH_unwritten - same as BH_delay
	BH_mapped - indicates we've already used it once for IO and it's mapped
		    to a disk address. Needs to be ignored for COW blocks.

The BH_mapped flag is an interesting case - it's supposed to indicate that
it's already mapped to disk and so we can just use it "as is". In theory, we
don't even have to do an extent lookup to find where to write it too, but we
have to do that anyway to determine we are actually writing over a valid extent.
Hence it's not even serving the purpose of avoiding a an extent lookup during
writeback, and so we can pretty much ignore it. Especially as we have to ignore
it for COW operations...

Therefore, use the extent map as the source of information to tell us what
actions we need to take and what sort of IO we should perform. The first step
is integration xfs_map_blocks() and xfs_map_cow() and have xfs_map_blocks() set
the io type according to what it looks up. This means it can easily handle
both normal overwrite and COW cases. THe only thing we also need to add is
the ability to return hole mappings.

We need to return and cache hole mappings now for the case of multiple blocks
per page. We no longer use the BH_mapped to indicate a block over a hole, so we
have to get that info from xfs_map_blocks(). We cache it so that holes that span
two pages don't need separate lookups. This allows us to avoid ever doing write
IO over a hole, too.

Further, we need to drop the XFS_BMAPI_IGSTATE flag so that we don't combine
contiguous written and unwritten extents into a single map. The io type needs to
match the extent type we are writing to so that we run the correct IO completion
routine for the IO. There is scope for optimisation that would allow us to
re-instate the XFS_BMAPI_IGSTATE flag, but this requires tweaks to code outside
the scope of this change.

Now that we have xfs_map_blocks() returning both a cached map and the type of IO
we need to perform, we can rewrite xfs_writepage_map() to drop all the
bufferhead control. It's also much simplified because it doesn't need to
explicitly handle COW operations. Instead of iterating bufferheads, it iterates
blocks within the page and then looks up what per-block state is required from
the appropriate bufferhead. It then validates the cached map, and if it's not
valid, we get a new map. If we don't get a valid map or it's over a hole, we
skip the block.

At this point, we have to remap the bufferhead via xfs_map_at_offset(). As
previously noted, we had to do this even if the buffer was already mapped as
the mapping would be stale for XFS_IO_DELALLOC, XFS_IO_UNWRITTEN and XFS_IO_COW
IO types. With xfs_map_blocks() now controlling the type, even XFS_IO_OVERWRITE
types need remapping, as converted-but-not-yet-written delalloc extents beyond
EOF can be reported at XFS_IO_OVERWRITE. Bufferheads taht span such regions
still need their BH_Delay flags cleared and their block numbers calculated,
so we now unconditionally map each bufferhead before submission.

But wait! There's more - remember the old "treat unwritten extents as holes on
read" hack? Yeah, that means we can have a dirty page with unmapped, unwritten
bufferheads that contain data! What makes these so special is that the unwritten
"hole" bufferheads do not have a valid block device pointer, so if we attempt to
write them xfs_add_to_ioend() blows up. So we make xfs_map_at_offset() do the
"realtime or data device" lookup from the inode and ignore what was or wasn't
put into the bufferhead when the buffer was instantiated.

The astute reader will have realised by now that this code treats unwritten
extents in multiple-blocks-per-page situations differently. If we get any
combination of unwritten blocks on a dirty page that contain valid data in teh
page, we're going to convert them to real extents. This can actually be a win,
because it means that pages with interleaving unwritten and written blocks will
get converted to a single written extent with zeros replacing the interspersed
unwritten blocks. This is actually good for reducing extent list and conversion
overhead, and it means we issue a contiguous IO instead of lots of little ones.
THe downside is that we use up a little extra IO bandwidth. neither of these
seem like a bad thing given that spinning disks are seek sensitive, and
SSDs/pmem have bandwidth to burn and the lower Io latency/CPU overhead of fewer,
larger IOs will result in better performance on them...

As a result of all this, the only state we actually care about from the
bufferhead is a single flag - BH_Uptodate. We still use the bufferhead to pass
some information to the bio via xfs_add_to_ioend(), but that is trivial to
separate and pass explicitly. This means we really only need 1 bit of state per
block per page from the buffered write path in teh writeback path. Everything
else we do with the bufferhead is purely to make the buffered IO front end
continue to work correctly. i.e we've pretty much marginalised bufferheads in
the writeback path completely.

This patch has smoke tested ok on 1k and 4k block filesystems with and without
reflink enabled, so it seems solid enough for initial comments to be made. Next
step is to get some kind of generation count or seqlock based cached map
invalidation into this code now that it's all unified.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 355 +++++++++++++++++++++++++++++-------------------------
 fs/xfs/xfs_aops.h |   4 +-
 2 files changed, 191 insertions(+), 168 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 067284d84d9e..44b80ee59cb4 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -372,84 +372,6 @@ xfs_end_bio(
 		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
 }
 
-STATIC int
-xfs_map_blocks(
-	struct inode		*inode,
-	loff_t			offset,
-	struct xfs_bmbt_irec	*imap,
-	int			type)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	struct xfs_mount	*mp = ip->i_mount;
-	ssize_t			count = i_blocksize(inode);
-	xfs_fileoff_t		offset_fsb, end_fsb;
-	int			error = 0;
-	int			bmapi_flags = XFS_BMAPI_ENTIRE;
-	int			nimaps = 1;
-
-	if (XFS_FORCED_SHUTDOWN(mp))
-		return -EIO;
-
-	ASSERT(type != XFS_IO_COW);
-	if (type == XFS_IO_UNWRITTEN)
-		bmapi_flags |= XFS_BMAPI_IGSTATE;
-
-	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
-	       (ip->i_df.if_flags & XFS_IFEXTENTS));
-	ASSERT(offset <= mp->m_super->s_maxbytes);
-
-	if (offset + count > mp->m_super->s_maxbytes)
-		count = mp->m_super->s_maxbytes - offset;
-	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
-	offset_fsb = XFS_B_TO_FSBT(mp, offset);
-	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
-				imap, &nimaps, bmapi_flags);
-	/*
-	 * Truncate an overwrite extent if there's a pending CoW
-	 * reservation before the end of this extent.  This forces us
-	 * to come back to writepage to take care of the CoW.
-	 */
-	if (nimaps && type == XFS_IO_OVERWRITE)
-		xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-
-	if (error)
-		return error;
-
-	if (type == XFS_IO_DELALLOC &&
-	    (!nimaps || isnullstartblock(imap->br_startblock))) {
-		error = xfs_iomap_write_allocate(ip, XFS_DATA_FORK, offset,
-				imap);
-		if (!error)
-			trace_xfs_map_blocks_alloc(ip, offset, count, type, imap);
-		return error;
-	}
-
-#ifdef DEBUG
-	if (type == XFS_IO_UNWRITTEN) {
-		ASSERT(nimaps);
-		ASSERT(imap->br_startblock != HOLESTARTBLOCK);
-		ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
-	}
-#endif
-	if (nimaps)
-		trace_xfs_map_blocks_found(ip, offset, count, type, imap);
-	return 0;
-}
-
-STATIC bool
-xfs_imap_valid(
-	struct inode		*inode,
-	struct xfs_bmbt_irec	*imap,
-	xfs_off_t		offset)
-{
-	offset >>= inode->i_blkbits;
-
-	return offset >= imap->br_startoff &&
-		offset < imap->br_startoff + imap->br_blockcount;
-}
-
 STATIC void
 xfs_start_buffer_writeback(
 	struct buffer_head	*bh)
@@ -666,6 +588,7 @@ xfs_map_buffer(
 
 	bh->b_blocknr = bn;
 	set_buffer_mapped(bh);
+
 }
 
 STATIC void
@@ -682,6 +605,14 @@ xfs_map_at_offset(
 	set_buffer_mapped(bh);
 	clear_buffer_delay(bh);
 	clear_buffer_unwritten(bh);
+
+	/*
+	 * If this is a realtime file, data may be on a different device.
+	 * to that pointed to from the buffer_head b_bdev currently. We can't
+	 * trust that the bufferhead has a already been mapped correctly, so
+	 * set the bdev now.
+	 */
+	bh->b_bdev = xfs_find_bdev_for_inode(inode);
 }
 
 /*
@@ -811,56 +742,142 @@ xfs_aops_discard_page(
 	return;
 }
 
-static int
-xfs_map_cow(
-	struct xfs_writepage_ctx *wpc,
+STATIC int
+xfs_map_blocks(
 	struct inode		*inode,
 	loff_t			offset,
-	unsigned int		*new_type)
+	struct xfs_bmbt_irec	*imap,
+	int			*type)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
-	struct xfs_bmbt_irec	imap;
+	struct xfs_mount	*mp = ip->i_mount;
+	ssize_t			count = i_blocksize(inode);
+	xfs_fileoff_t		offset_fsb, end_fsb;
+	int			error = 0;
+	int			bmapi_flags = XFS_BMAPI_ENTIRE;
+	int			nimaps = 1;
 	bool			is_cow = false;
-	int			error;
 
-	/*
-	 * If we already have a valid COW mapping keep using it.
-	 */
-	if (wpc->io_type == XFS_IO_COW) {
-		wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap, offset);
-		if (wpc->imap_valid) {
-			*new_type = XFS_IO_COW;
-			return 0;
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+	ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
+	       (ip->i_df.if_flags & XFS_IFEXTENTS));
+	ASSERT(offset <= mp->m_super->s_maxbytes);
+
+	if (xfs_is_reflink_inode(XFS_I(inode)))
+		is_cow = xfs_reflink_find_cow_mapping(ip, offset, imap);
+
+	trace_printk("ino 0x%llx: off 0x%llx iscow %d\n",
+			(long long)inode->i_ino, offset, is_cow);
+	if (!is_cow) {
+		if (offset + count > mp->m_super->s_maxbytes)
+			count = mp->m_super->s_maxbytes - offset;
+		end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
+		offset_fsb = XFS_B_TO_FSBT(mp, offset);
+
+		/*
+		 * Note: we've dropped the XFS_BMAPI_IGSTATE flag here because
+		 * we don't know what kind of extent we are expecting. Hence
+		 * we will not merge contiguous written/unwritten extents
+		 * and instead report them separately. This is also required
+		 * so that we can set the io type correctly below.
+		 */
+		error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
+					imap, &nimaps, bmapi_flags);
+
+		if (!nimaps) {
+			/*
+			 * Lookup returns no match? Beyond eof? regardless,
+			 * return it as a hole so we don't write it
+			 */
+			imap->br_startoff = offset_fsb;
+			imap->br_blockcount = end_fsb - offset_fsb;
+			imap->br_startblock = HOLESTARTBLOCK;
+			*type = XFS_IO_HOLE;
+		} else if (imap->br_startblock == HOLESTARTBLOCK) {
+			/* landed in a hole */
+			*type = XFS_IO_HOLE;
+		} else if (isnullstartblock(imap->br_startblock)) {
+			/* got a delalloc extent */
+			*type = XFS_IO_DELALLOC;
+		} else {
+			/*
+			 * Got an existing extent for overwrite. Truncate it if
+			 * there's a pending CoW reservation before the end of
+			 * this extent.  This forces us to come back to
+			 * writepage to take care of the CoW.
+			 */
+			xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
+			if (imap->br_state == XFS_EXT_UNWRITTEN)
+				*type = XFS_IO_UNWRITTEN;
+			else
+				*type = XFS_IO_OVERWRITE;
 		}
+	} else {
+		/* got a cow mapping */
+		*type = XFS_IO_COW;
 	}
 
-	/*
-	 * Else we need to check if there is a COW mapping at this offset.
-	 */
-	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap);
+	trace_printk("ino 0x%llx: type %d, null %d, startoff 0x%llx\n",
+			(long long)inode->i_ino, *type,
+			isnullstartblock(imap->br_startblock),
+			imap->br_startoff);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	if (error)
+		return error;
 
-	if (!is_cow)
-		return 0;
+	if (isnullstartblock(imap->br_startblock)) {
+		int	whichfork = 0;
 
-	/*
-	 * And if the COW mapping has a delayed extent here we need to
-	 * allocate real space for it now.
-	 */
-	if (isnullstartblock(imap.br_startblock)) {
-		error = xfs_iomap_write_allocate(ip, XFS_COW_FORK, offset,
-				&imap);
-		if (error)
-			return error;
+		switch (*type) {
+		case XFS_IO_DELALLOC:
+			whichfork = XFS_DATA_FORK;
+			break;
+		case XFS_IO_COW:
+			whichfork = XFS_COW_FORK;
+			break;
+		case XFS_IO_HOLE:
+			/* nothing to do! */
+			trace_xfs_map_blocks_found(ip, offset, count, *type, imap);
+			return 0;
+		default:
+			ASSERT(0);
+			return -EFSCORRUPTED;
+		}
+		error = xfs_iomap_write_allocate(ip, whichfork, offset,
+				imap);
+		if (!error)
+			trace_xfs_map_blocks_alloc(ip, offset, count, *type, imap);
+		return error;
 	}
 
-	wpc->io_type = *new_type = XFS_IO_COW;
-	wpc->imap_valid = true;
-	wpc->imap = imap;
+
+#ifdef DEBUG
+	if (*type == XFS_IO_UNWRITTEN) {
+		ASSERT(nimaps);
+		ASSERT(imap->br_startblock != HOLESTARTBLOCK);
+		ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
+	}
+#endif
+	if (nimaps)
+		trace_xfs_map_blocks_found(ip, offset, count, *type, imap);
 	return 0;
 }
 
+STATIC bool
+xfs_imap_valid(
+	struct inode		*inode,
+	struct xfs_bmbt_irec	*imap,
+	xfs_off_t		offset)
+{
+	offset >>= inode->i_blkbits;
+
+	return offset >= imap->br_startoff &&
+		offset < imap->br_startoff + imap->br_blockcount;
+}
+
 /*
  * We implement an immediate ioend submission policy here to avoid needing to
  * chain multiple ioends and hence nest mempool allocations which can violate
@@ -883,89 +900,93 @@ xfs_writepage_map(
 	struct writeback_control *wbc,
 	struct inode		*inode,
 	struct page		*page,
-	loff_t			offset,
 	uint64_t              end_offset)
 {
 	LIST_HEAD(submit_list);
 	struct xfs_ioend	*ioend, *next;
-	struct buffer_head	*bh, *head;
+	struct buffer_head	*bh;
 	ssize_t			len = i_blocksize(inode);
 	int			error = 0;
 	int			count = 0;
-	int			uptodate = 1;
-	unsigned int		new_type;
+	bool			uptodate = true;
+	loff_t			file_offset;	/* file offset of page */
+	int			poffset;	/* offset into page */
 
-	bh = head = page_buffers(page);
-	offset = page_offset(page);
-	do {
-		if (offset >= end_offset)
+	/*
+	 * walk the blocks on the page, and we we run off then end of the
+	 * current map or find the current map invalid, grab a new one.
+	 * We only use bufferheads here to check per-block state - they no
+	 * longer control the iteration through the page. This allows us to
+	 * replace the bufferhead with some other state tracking mechanism in
+	 * future.
+	 */
+	file_offset = page_offset(page);
+	bh = page_buffers(page);
+	for (poffset = 0;
+	     poffset < PAGE_SIZE;
+	     poffset += len, file_offset += len, bh = bh->b_this_page) {
+
+		/* past the range we are writing, so nothing more to write. */
+		if (file_offset >= end_offset)
 			break;
-		if (!buffer_uptodate(bh))
-			uptodate = 0;
 
 		/*
-		 * set_page_dirty dirties all buffers in a page, independent
-		 * of their state.  The dirty state however is entirely
-		 * meaningless for holes (!mapped && uptodate), so skip
-		 * buffers covering holes here.
+		 * Block does not contain valid data, skip it, mark the current
+		 * map as invalid because we have a discontiguity. This ensures
+		 * we put subsequent writeable buffers into a new ioend.
 		 */
-		if (!buffer_mapped(bh) && buffer_uptodate(bh)) {
-			wpc->imap_valid = false;
-			continue;
-		}
-
-		if (buffer_unwritten(bh))
-			new_type = XFS_IO_UNWRITTEN;
-		else if (buffer_delay(bh))
-			new_type = XFS_IO_DELALLOC;
-		else if (buffer_uptodate(bh))
-			new_type = XFS_IO_OVERWRITE;
-		else {
+		if (!buffer_uptodate(bh)) {
 			if (PageUptodate(page))
 				ASSERT(buffer_mapped(bh));
-			/*
-			 * This buffer is not uptodate and will not be
-			 * written to disk.  Ensure that we will put any
-			 * subsequent writeable buffers into a new
-			 * ioend.
-			 */
-			wpc->imap_valid = false;
-			continue;
-		}
 
-		if (xfs_is_reflink_inode(XFS_I(inode))) {
-			error = xfs_map_cow(wpc, inode, offset, &new_type);
-			if (error)
-				goto out;
-		}
-
-		if (wpc->io_type != new_type) {
-			wpc->io_type = new_type;
+			uptodate = false;
 			wpc->imap_valid = false;
+			continue;
 		}
 
+		/* Check to see if current map spans this file offset */
 		if (wpc->imap_valid)
 			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
-							 offset);
+							 file_offset);
+
+		/*
+		 * If we don't have a valid map, now it's time to get
+		 * a new one for this offset. This will allocate delalloc
+		 * regions or COW shared extents. If we return without a valid
+		 * map, it means we landed in a hole and we skip the block.
+		 */
 		if (!wpc->imap_valid) {
-			error = xfs_map_blocks(inode, offset, &wpc->imap,
-					     wpc->io_type);
+			error = xfs_map_blocks(inode, file_offset, &wpc->imap,
+					     &wpc->io_type);
 			if (error)
 				goto out;
 			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
-							 offset);
+							 file_offset);
 		}
-		if (wpc->imap_valid) {
-			lock_buffer(bh);
-			if (wpc->io_type != XFS_IO_OVERWRITE)
-				xfs_map_at_offset(inode, bh, &wpc->imap, offset);
-			xfs_add_to_ioend(inode, bh, offset, wpc, wbc, &submit_list);
-			count++;
+
+trace_printk("ino 0x%llx: valid %d, type %d, fileoff 0x%llx bdev %p/0x%lx\n",
+			(long long)inode->i_ino, wpc->imap_valid, wpc->io_type,
+			(long long)file_offset, bh->b_bdev, bh->b_state);
+
+		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
+			/*
+			 * set_page_dirty dirties all buffers in a page, independent
+			 * of their state.  The dirty state however is entirely
+			 * meaningless for holes (!mapped && uptodate), so check we did
+			 * have a buffer covering a hole here and continue.
+			 */
+			ASSERT(!buffer_mapped(bh));
+			continue;
 		}
 
-	} while (offset += len, ((bh = bh->b_this_page) != head));
+		lock_buffer(bh);
+
+		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
+		xfs_add_to_ioend(inode, bh, file_offset, wpc, wbc, &submit_list);
+		count++;
+	}
 
-	if (uptodate && bh == head)
+	if (uptodate && poffset == PAGE_SIZE)
 		SetPageUptodate(page);
 
 	ASSERT(wpc->ioend || list_empty(&submit_list));
@@ -1133,7 +1154,7 @@ xfs_do_writepage(
 		end_offset = offset;
 	}
 
-	return xfs_writepage_map(wpc, wbc, inode, page, offset, end_offset);
+	return xfs_writepage_map(wpc, wbc, inode, page, end_offset);
 
 redirty:
 	redirty_page_for_writepage(wbc, page);
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 88c85ea63da0..db289ed164f6 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -29,6 +29,7 @@ enum {
 	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
 	XFS_IO_OVERWRITE,	/* covers already allocated extent */
 	XFS_IO_COW,		/* covers copy-on-write extent */
+	XFS_IO_HOLE,		/* cached hole */
 };
 
 #define XFS_IO_TYPES \
@@ -36,7 +37,8 @@ enum {
 	{ XFS_IO_DELALLOC,		"delalloc" }, \
 	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
 	{ XFS_IO_OVERWRITE,		"overwrite" }, \
-	{ XFS_IO_COW,			"CoW" }
+	{ XFS_IO_COW,			"CoW" }, \
+	{ XFS_IO_HOLE,			"hole" }
 
 /*
  * Structure for buffered I/O completions.
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] [RFC] xfs: make xfs_writepage_map extent map centric
  2017-10-11  8:11 [PATCH] [RFC] xfs: make xfs_writepage_map extent map centric Dave Chinner
@ 2017-10-12 16:14 ` Brian Foster
  2017-10-12 23:22   ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Brian Foster @ 2017-10-12 16:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 11, 2017 at 07:11:31PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_writepage_map() iterates over the bufferheads on a page to decide
> what sort of IO to do and what actions to take. However, when it comes
> to reflink and deciding when it needs to execute a COW operation, we no
> longer look at the bufferhead state but instead we ignore than and look up
> internal state held in teh COW fork extent list.
> 
> This means xfs_writepage_map() is somewhat confused. It does stuff, then
> ignores it, then tries to handle the impedence mismatch by shovelling the
> results inside the existing mapping code. It works, but it's a bit of a mess and
> it makes it hard to fix the cached map bug that the writepage code currently
> has.
> 
> To unify the two different mechanisms, we first have to choose a direction.
> That's already been set - we're de-emphasising bufferheads so they are no longer
> a control structure as we need to do taht to allow for eventual removal. Hence
> we need to move away from looking at bufferhead state to determine what
> operations we need to perform.
> 
> We can't completely get rid of bufferheads yet - they do contain some state that
> is absolutely necessary, such as whether that part of the page contains valid
> data or not (buffer_uptodate()). Other state in the bufferhead is redundant:
> 
> 	BH_dirty - the page is dirty, so we can ignore this and just write it
> 	BH_delay - we have delalloc extent info in the DATA fork extent tree
> 	BH_unwritten - same as BH_delay
> 	BH_mapped - indicates we've already used it once for IO and it's mapped
> 		    to a disk address. Needs to be ignored for COW blocks.
> 
> The BH_mapped flag is an interesting case - it's supposed to indicate that
> it's already mapped to disk and so we can just use it "as is". In theory, we
> don't even have to do an extent lookup to find where to write it too, but we
> have to do that anyway to determine we are actually writing over a valid extent.
> Hence it's not even serving the purpose of avoiding a an extent lookup during
> writeback, and so we can pretty much ignore it. Especially as we have to ignore
> it for COW operations...
> 
> Therefore, use the extent map as the source of information to tell us what
> actions we need to take and what sort of IO we should perform. The first step
> is integration xfs_map_blocks() and xfs_map_cow() and have xfs_map_blocks() set
> the io type according to what it looks up. This means it can easily handle
> both normal overwrite and COW cases. THe only thing we also need to add is
> the ability to return hole mappings.
> 
> We need to return and cache hole mappings now for the case of multiple blocks
> per page. We no longer use the BH_mapped to indicate a block over a hole, so we
> have to get that info from xfs_map_blocks(). We cache it so that holes that span
> two pages don't need separate lookups. This allows us to avoid ever doing write
> IO over a hole, too.
> 
> Further, we need to drop the XFS_BMAPI_IGSTATE flag so that we don't combine
> contiguous written and unwritten extents into a single map. The io type needs to
> match the extent type we are writing to so that we run the correct IO completion
> routine for the IO. There is scope for optimisation that would allow us to
> re-instate the XFS_BMAPI_IGSTATE flag, but this requires tweaks to code outside
> the scope of this change.
> 
> Now that we have xfs_map_blocks() returning both a cached map and the type of IO
> we need to perform, we can rewrite xfs_writepage_map() to drop all the
> bufferhead control. It's also much simplified because it doesn't need to
> explicitly handle COW operations. Instead of iterating bufferheads, it iterates
> blocks within the page and then looks up what per-block state is required from
> the appropriate bufferhead. It then validates the cached map, and if it's not
> valid, we get a new map. If we don't get a valid map or it's over a hole, we
> skip the block.
> 
> At this point, we have to remap the bufferhead via xfs_map_at_offset(). As
> previously noted, we had to do this even if the buffer was already mapped as
> the mapping would be stale for XFS_IO_DELALLOC, XFS_IO_UNWRITTEN and XFS_IO_COW
> IO types. With xfs_map_blocks() now controlling the type, even XFS_IO_OVERWRITE
> types need remapping, as converted-but-not-yet-written delalloc extents beyond
> EOF can be reported at XFS_IO_OVERWRITE. Bufferheads taht span such regions
> still need their BH_Delay flags cleared and their block numbers calculated,
> so we now unconditionally map each bufferhead before submission.
> 
> But wait! There's more - remember the old "treat unwritten extents as holes on
> read" hack? Yeah, that means we can have a dirty page with unmapped, unwritten
> bufferheads that contain data! What makes these so special is that the unwritten
> "hole" bufferheads do not have a valid block device pointer, so if we attempt to
> write them xfs_add_to_ioend() blows up. So we make xfs_map_at_offset() do the
> "realtime or data device" lookup from the inode and ignore what was or wasn't
> put into the bufferhead when the buffer was instantiated.
> 
> The astute reader will have realised by now that this code treats unwritten
> extents in multiple-blocks-per-page situations differently. If we get any
> combination of unwritten blocks on a dirty page that contain valid data in teh
> page, we're going to convert them to real extents. This can actually be a win,
> because it means that pages with interleaving unwritten and written blocks will
> get converted to a single written extent with zeros replacing the interspersed
> unwritten blocks. This is actually good for reducing extent list and conversion
> overhead, and it means we issue a contiguous IO instead of lots of little ones.
> THe downside is that we use up a little extra IO bandwidth. neither of these
> seem like a bad thing given that spinning disks are seek sensitive, and
> SSDs/pmem have bandwidth to burn and the lower Io latency/CPU overhead of fewer,
> larger IOs will result in better performance on them...
> 
> As a result of all this, the only state we actually care about from the
> bufferhead is a single flag - BH_Uptodate. We still use the bufferhead to pass
> some information to the bio via xfs_add_to_ioend(), but that is trivial to
> separate and pass explicitly. This means we really only need 1 bit of state per
> block per page from the buffered write path in teh writeback path. Everything
> else we do with the bufferhead is purely to make the buffered IO front end
> continue to work correctly. i.e we've pretty much marginalised bufferheads in
> the writeback path completely.
> 
> This patch has smoke tested ok on 1k and 4k block filesystems with and without
> reflink enabled, so it seems solid enough for initial comments to be made. Next
> step is to get some kind of generation count or seqlock based cached map
> invalidation into this code now that it's all unified.
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> ---

This seems sane to me on a first look. You may be planning this anyways
since this is an RFC, but could we split up the patches into separate
logical changes? For example, folding xfs_map_cow() into
xfs_map_blocks() seems like it could be a separate patch.

A few more comments below...

>  fs/xfs/xfs_aops.c | 355 +++++++++++++++++++++++++++++-------------------------
>  fs/xfs/xfs_aops.h |   4 +-
>  2 files changed, 191 insertions(+), 168 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 067284d84d9e..44b80ee59cb4 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
...
> @@ -666,6 +588,7 @@ xfs_map_buffer(
>  
>  	bh->b_blocknr = bn;
>  	set_buffer_mapped(bh);
> +

Extra whitespace.

>  }
>  
>  STATIC void
...
> @@ -811,56 +742,142 @@ xfs_aops_discard_page(
>  	return;
>  }
>  
> -static int
> -xfs_map_cow(
> -	struct xfs_writepage_ctx *wpc,
> +STATIC int
> +xfs_map_blocks(
>  	struct inode		*inode,
>  	loff_t			offset,
> -	unsigned int		*new_type)
> +	struct xfs_bmbt_irec	*imap,
> +	int			*type)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
> -	struct xfs_bmbt_irec	imap;
> +	struct xfs_mount	*mp = ip->i_mount;
> +	ssize_t			count = i_blocksize(inode);
> +	xfs_fileoff_t		offset_fsb, end_fsb;
> +	int			error = 0;
> +	int			bmapi_flags = XFS_BMAPI_ENTIRE;
> +	int			nimaps = 1;
>  	bool			is_cow = false;
> -	int			error;
>  
> -	/*
> -	 * If we already have a valid COW mapping keep using it.
> -	 */
> -	if (wpc->io_type == XFS_IO_COW) {
> -		wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap, offset);
> -		if (wpc->imap_valid) {
> -			*new_type = XFS_IO_COW;
> -			return 0;
> +	if (XFS_FORCED_SHUTDOWN(mp))
> +		return -EIO;
> +
> +	xfs_ilock(ip, XFS_ILOCK_SHARED);
> +	ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
> +	       (ip->i_df.if_flags & XFS_IFEXTENTS));
> +	ASSERT(offset <= mp->m_super->s_maxbytes);
> +
> +	if (xfs_is_reflink_inode(XFS_I(inode)))
> +		is_cow = xfs_reflink_find_cow_mapping(ip, offset, imap);
> +

Maybe do something like the following here:

		if (is_cow) {
			*type = XFS_IO_COW;
			goto do_alloc;
		}

... so we can reduce the indentation of the large !cow hunk below?

> +	trace_printk("ino 0x%llx: off 0x%llx iscow %d\n",
> +			(long long)inode->i_ino, offset, is_cow);
> +	if (!is_cow) {
> +		if (offset + count > mp->m_super->s_maxbytes)
> +			count = mp->m_super->s_maxbytes - offset;
> +		end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
> +		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> +
> +		/*
> +		 * Note: we've dropped the XFS_BMAPI_IGSTATE flag here because
> +		 * we don't know what kind of extent we are expecting. Hence
> +		 * we will not merge contiguous written/unwritten extents
> +		 * and instead report them separately. This is also required
> +		 * so that we can set the io type correctly below.
> +		 */
> +		error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
> +					imap, &nimaps, bmapi_flags);
> +
> +		if (!nimaps) {
> +			/*
> +			 * Lookup returns no match? Beyond eof? regardless,
> +			 * return it as a hole so we don't write it
> +			 */
> +			imap->br_startoff = offset_fsb;
> +			imap->br_blockcount = end_fsb - offset_fsb;
> +			imap->br_startblock = HOLESTARTBLOCK;
> +			*type = XFS_IO_HOLE;
> +		} else if (imap->br_startblock == HOLESTARTBLOCK) {
> +			/* landed in a hole */
> +			*type = XFS_IO_HOLE;
> +		} else if (isnullstartblock(imap->br_startblock)) {
> +			/* got a delalloc extent */
> +			*type = XFS_IO_DELALLOC;
> +		} else {
> +			/*
> +			 * Got an existing extent for overwrite. Truncate it if
> +			 * there's a pending CoW reservation before the end of
> +			 * this extent.  This forces us to come back to
> +			 * writepage to take care of the CoW.
> +			 */
> +			xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
> +			if (imap->br_state == XFS_EXT_UNWRITTEN)
> +				*type = XFS_IO_UNWRITTEN;
> +			else
> +				*type = XFS_IO_OVERWRITE;
>  		}
> +	} else {
> +		/* got a cow mapping */
> +		*type = XFS_IO_COW;
>  	}
>  
> -	/*
> -	 * Else we need to check if there is a COW mapping at this offset.
> -	 */
> -	xfs_ilock(ip, XFS_ILOCK_SHARED);
> -	is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap);
> +	trace_printk("ino 0x%llx: type %d, null %d, startoff 0x%llx\n",
> +			(long long)inode->i_ino, *type,
> +			isnullstartblock(imap->br_startblock),
> +			imap->br_startoff);
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +	if (error)
> +		return error;
>  
> -	if (!is_cow)
> -		return 0;
> +	if (isnullstartblock(imap->br_startblock)) {
> +		int	whichfork = 0;
>  
> -	/*
> -	 * And if the COW mapping has a delayed extent here we need to
> -	 * allocate real space for it now.
> -	 */
> -	if (isnullstartblock(imap.br_startblock)) {
> -		error = xfs_iomap_write_allocate(ip, XFS_COW_FORK, offset,
> -				&imap);
> -		if (error)
> -			return error;
> +		switch (*type) {
> +		case XFS_IO_DELALLOC:
> +			whichfork = XFS_DATA_FORK;
> +			break;
> +		case XFS_IO_COW:
> +			whichfork = XFS_COW_FORK;
> +			break;

The logic seems a little verbose here. Could we set whichfork =
XFS_COW_FORK in the same hunk I suggested adding above? Initialize
whichfork to XFS_DATA_FORK at the top of the function and that covers
the above two cases.

> +		case XFS_IO_HOLE:
> +			/* nothing to do! */
> +			trace_xfs_map_blocks_found(ip, offset, count, *type, imap);
> +			return 0;

I'm a little confused about this case. How can this happen, and why do
we use the _blocks_found() tracepoint?

> +		default:
> +			ASSERT(0);
> +			return -EFSCORRUPTED;
> +		}

Previous question aside, it looks like doing something like:

	if (isnullstartblock(..) &&
	    (*type == XFS_IO_DELALLOC || *type == XFS_IO_COW)) {
		...
	}

... and doing some of the corruption checks separately could then
eliminate this switch statement entirely.

> +		error = xfs_iomap_write_allocate(ip, whichfork, offset,
> +				imap);
> +		if (!error)
> +			trace_xfs_map_blocks_alloc(ip, offset, count, *type, imap);
> +		return error;
>  	}
>  
> -	wpc->io_type = *new_type = XFS_IO_COW;
> -	wpc->imap_valid = true;
> -	wpc->imap = imap;
> +
> +#ifdef DEBUG
> +	if (*type == XFS_IO_UNWRITTEN) {
> +		ASSERT(nimaps);
> +		ASSERT(imap->br_startblock != HOLESTARTBLOCK);
> +		ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
> +	}
> +#endif
> +	if (nimaps)
> +		trace_xfs_map_blocks_found(ip, offset, count, *type, imap);
>  	return 0;
>  }
>  
...
> @@ -883,89 +900,93 @@ xfs_writepage_map(
>  	struct writeback_control *wbc,
>  	struct inode		*inode,
>  	struct page		*page,
> -	loff_t			offset,
>  	uint64_t              end_offset)
>  {
>  	LIST_HEAD(submit_list);
>  	struct xfs_ioend	*ioend, *next;
> -	struct buffer_head	*bh, *head;
> +	struct buffer_head	*bh;
>  	ssize_t			len = i_blocksize(inode);
>  	int			error = 0;
>  	int			count = 0;
> -	int			uptodate = 1;
> -	unsigned int		new_type;
> +	bool			uptodate = true;
> +	loff_t			file_offset;	/* file offset of page */
> +	int			poffset;	/* offset into page */
>  
> -	bh = head = page_buffers(page);
> -	offset = page_offset(page);
> -	do {
> -		if (offset >= end_offset)
> +	/*
> +	 * walk the blocks on the page, and we we run off then end of the
> +	 * current map or find the current map invalid, grab a new one.
> +	 * We only use bufferheads here to check per-block state - they no
> +	 * longer control the iteration through the page. This allows us to
> +	 * replace the bufferhead with some other state tracking mechanism in
> +	 * future.
> +	 */
> +	file_offset = page_offset(page);
> +	bh = page_buffers(page);
> +	for (poffset = 0;
> +	     poffset < PAGE_SIZE;
> +	     poffset += len, file_offset += len, bh = bh->b_this_page) {
> +
> +		/* past the range we are writing, so nothing more to write. */
> +		if (file_offset >= end_offset)
>  			break;
...
> +
> +		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
> +			/*
> +			 * set_page_dirty dirties all buffers in a page, independent
> +			 * of their state.  The dirty state however is entirely
> +			 * meaningless for holes (!mapped && uptodate), so check we did
> +			 * have a buffer covering a hole here and continue.
> +			 */
> +			ASSERT(!buffer_mapped(bh));
> +			continue;
>  		}

IIRC, the old (broken) write over hole situation would allocate a block,
potentially yell about a block reservation overrun and ultimately write
the page. Now it looks like we silently (aside from the assert on dev
kernels) skip the writeback and clean the page. Unless I've missed that
being handled somewhere, that seems like a slightly more severe error
from the user perspective. Perhaps we should have a noiser warning here
or even consider that a writeback error?

Brian

>  
> -	} while (offset += len, ((bh = bh->b_this_page) != head));
> +		lock_buffer(bh);
> +
> +		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
> +		xfs_add_to_ioend(inode, bh, file_offset, wpc, wbc, &submit_list);
> +		count++;
> +	}
>  
> -	if (uptodate && bh == head)
> +	if (uptodate && poffset == PAGE_SIZE)
>  		SetPageUptodate(page);
>  
>  	ASSERT(wpc->ioend || list_empty(&submit_list));
> @@ -1133,7 +1154,7 @@ xfs_do_writepage(
>  		end_offset = offset;
>  	}
>  
> -	return xfs_writepage_map(wpc, wbc, inode, page, offset, end_offset);
> +	return xfs_writepage_map(wpc, wbc, inode, page, end_offset);
>  
>  redirty:
>  	redirty_page_for_writepage(wbc, page);
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 88c85ea63da0..db289ed164f6 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -29,6 +29,7 @@ enum {
>  	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
>  	XFS_IO_OVERWRITE,	/* covers already allocated extent */
>  	XFS_IO_COW,		/* covers copy-on-write extent */
> +	XFS_IO_HOLE,		/* cached hole */
>  };
>  
>  #define XFS_IO_TYPES \
> @@ -36,7 +37,8 @@ enum {
>  	{ XFS_IO_DELALLOC,		"delalloc" }, \
>  	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
>  	{ XFS_IO_OVERWRITE,		"overwrite" }, \
> -	{ XFS_IO_COW,			"CoW" }
> +	{ XFS_IO_COW,			"CoW" }, \
> +	{ XFS_IO_HOLE,			"hole" }
>  
>  /*
>   * Structure for buffered I/O completions.
> -- 
> 2.15.0.rc0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] [RFC] xfs: make xfs_writepage_map extent map centric
  2017-10-12 16:14 ` Brian Foster
@ 2017-10-12 23:22   ` Dave Chinner
  2017-10-13 13:13     ` Brian Foster
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2017-10-12 23:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Oct 12, 2017 at 12:14:37PM -0400, Brian Foster wrote:
> On Wed, Oct 11, 2017 at 07:11:31PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > xfs_writepage_map() iterates over the bufferheads on a page to decide
> > what sort of IO to do and what actions to take. However, when it comes
> > to reflink and deciding when it needs to execute a COW operation, we no
> > longer look at the bufferhead state but instead we ignore than and look up
> > internal state held in teh COW fork extent list.

[....]
> > 
> > This patch has smoke tested ok on 1k and 4k block filesystems with and without
> > reflink enabled, so it seems solid enough for initial comments to be made. Next
> > step is to get some kind of generation count or seqlock based cached map
> > invalidation into this code now that it's all unified.
> > 
> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> This seems sane to me on a first look. You may be planning this anyways
> since this is an RFC, but could we split up the patches into separate
> logical changes? For example, folding xfs_map_cow() into
> xfs_map_blocks() seems like it could be a separate patch.

Yeah, I intend to split it up for review and submission. I generally
just write a big single patch first to get the code into the right
shape and working first, then worry about how to split it sanely for
review. That way I don't waste time generating patches that have
to be completely reworked after an initial RFC...

> > +	xfs_ilock(ip, XFS_ILOCK_SHARED);
> > +	ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
> > +	       (ip->i_df.if_flags & XFS_IFEXTENTS));
> > +	ASSERT(offset <= mp->m_super->s_maxbytes);
> > +
> > +	if (xfs_is_reflink_inode(XFS_I(inode)))
> > +		is_cow = xfs_reflink_find_cow_mapping(ip, offset, imap);
> > +
> 
> Maybe do something like the following here:
> 
> 		if (is_cow) {
> 			*type = XFS_IO_COW;
> 			goto do_alloc;
> 		}
> 
> ... so we can reduce the indentation of the large !cow hunk below?

Yup, this whole function could do with a bit of factoring :P

> 
> > -	/*
> > -	 * Else we need to check if there is a COW mapping at this offset.
> > -	 */
> > -	xfs_ilock(ip, XFS_ILOCK_SHARED);
> > -	is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap);
> > +	trace_printk("ino 0x%llx: type %d, null %d, startoff 0x%llx\n",
> > +			(long long)inode->i_ino, *type,
> > +			isnullstartblock(imap->br_startblock),
> > +			imap->br_startoff);
> >  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > +	if (error)
> > +		return error;
> >  
> > -	if (!is_cow)
> > -		return 0;
> > +	if (isnullstartblock(imap->br_startblock)) {
> > +		int	whichfork = 0;
> >  
> > -	/*
> > -	 * And if the COW mapping has a delayed extent here we need to
> > -	 * allocate real space for it now.
> > -	 */
> > -	if (isnullstartblock(imap.br_startblock)) {
> > -		error = xfs_iomap_write_allocate(ip, XFS_COW_FORK, offset,
> > -				&imap);
> > -		if (error)
> > -			return error;
> > +		switch (*type) {
> > +		case XFS_IO_DELALLOC:
> > +			whichfork = XFS_DATA_FORK;
> > +			break;
> > +		case XFS_IO_COW:
> > +			whichfork = XFS_COW_FORK;
> > +			break;
> 
> The logic seems a little verbose here. Could we set whichfork =
> XFS_COW_FORK in the same hunk I suggested adding above? Initialize
> whichfork to XFS_DATA_FORK at the top of the function and that covers
> the above two cases.

*nod*

> 
> > +		case XFS_IO_HOLE:
> > +			/* nothing to do! */
> > +			trace_xfs_map_blocks_found(ip, offset, count, *type, imap);
> > +			return 0;
> 
> I'm a little confused about this case. How can this happen, and why do
> we use the _blocks_found() tracepoint?

isnullstartblock() evaluates as true for HOLESTARTBLOCK, which we
mapped above in the !COW case. I just reused the tracepoint as a
quick and dirty method of tracing that holes were being mapped
properly via the type field.

> > +		default:
> > +			ASSERT(0);
> > +			return -EFSCORRUPTED;
> > +		}
> 
> Previous question aside, it looks like doing something like:

This caught quite a few bugs as I developed the patch :P

> 
> 	if (isnullstartblock(..) &&
> 	    (*type == XFS_IO_DELALLOC || *type == XFS_IO_COW)) {
> 		...
> 	}
> 
> ... and doing some of the corruption checks separately could then
> eliminate this switch statement entirely.

Yeah, it can definitely be cleaned up - handling the non-allocation
case first would also help a lot....


> > @@ -883,89 +900,93 @@ xfs_writepage_map(
> >  	struct writeback_control *wbc,
> >  	struct inode		*inode,
> >  	struct page		*page,
> > -	loff_t			offset,
> >  	uint64_t              end_offset)
> >  {
> >  	LIST_HEAD(submit_list);
> >  	struct xfs_ioend	*ioend, *next;
> > -	struct buffer_head	*bh, *head;
> > +	struct buffer_head	*bh;
> >  	ssize_t			len = i_blocksize(inode);
> >  	int			error = 0;
> >  	int			count = 0;
> > -	int			uptodate = 1;
> > -	unsigned int		new_type;
> > +	bool			uptodate = true;
> > +	loff_t			file_offset;	/* file offset of page */
> > +	int			poffset;	/* offset into page */
> >  
> > -	bh = head = page_buffers(page);
> > -	offset = page_offset(page);
> > -	do {
> > -		if (offset >= end_offset)
> > +	/*
> > +	 * walk the blocks on the page, and we we run off then end of the
> > +	 * current map or find the current map invalid, grab a new one.
> > +	 * We only use bufferheads here to check per-block state - they no
> > +	 * longer control the iteration through the page. This allows us to
> > +	 * replace the bufferhead with some other state tracking mechanism in
> > +	 * future.
> > +	 */
> > +	file_offset = page_offset(page);
> > +	bh = page_buffers(page);
> > +	for (poffset = 0;
> > +	     poffset < PAGE_SIZE;
> > +	     poffset += len, file_offset += len, bh = bh->b_this_page) {
> > +
> > +		/* past the range we are writing, so nothing more to write. */
> > +		if (file_offset >= end_offset)
> >  			break;
> ...
> > +
> > +		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
> > +			/*
> > +			 * set_page_dirty dirties all buffers in a page, independent
> > +			 * of their state.  The dirty state however is entirely
> > +			 * meaningless for holes (!mapped && uptodate), so check we did
> > +			 * have a buffer covering a hole here and continue.
> > +			 */
> > +			ASSERT(!buffer_mapped(bh));
> > +			continue;
> >  		}
> 
> IIRC, the old (broken) write over hole situation would allocate a block,
> potentially yell about a block reservation overrun and ultimately write
> the page.

Which was [always] the wrong thing to do.

However, we were forced to do this historically because it covered
bugs and deficiencies in the front end mapping code, such as before
we had ->page_mkwrite notification. In those ancient times, mmap()
could not do delayed allocation or discover unwritten extents
correctly.  So back in those days we had to do unreserved
allocations in writeback and risk transaction overruns and ENOSPC in
writeback because mmap writes were unable to do the right thing.

Nowdays, we've closed off all those issues and the only situation we
can get unreserved allocation into a hole is when we've got a
mismatch between bufferhead and extent state. This should never
happen anymore, and silently allocating blocks for ranges that may
or may not have valid data in them and writing it to disk is
exactly the wrong thing to be doing now.

> Now it looks like we silently (aside from the assert on dev
> kernels) skip the writeback and clean the page.

Which, IMO, is correct behaviour - the old code danced around this
because we couldn't easily answer the question of "what to do when
we have a bufferhead state vs extent map mismatch?".

All we care about now is "does the block have valid data" and "does
the extent map say we can write it". By the time we get here, we've
already skipped blocks that don't have valid data, and the checks
above say "we can't write this data because it's a hole". The reason
it has valid data is that has been previously read - which is pretty
common and most definitely exercised by fstests - so we silently
skip it.

One of the benefits of this change is that we remove these
ambiguities from the writeback code. If we get fed crap in the
writeback code, then we've got a bug in the front end iomap code...

> Unless I've missed that
> being handled somewhere, that seems like a slightly more severe error
> from the user perspective. Perhaps we should have a noiser warning here
> or even consider that a writeback error?

No, because it's a valid state to have cached data over a hole if
it's been read. Hence the assert to ensure what we got was from a
previous read into a hole - if a mod we make to the higher level
code screws up, we'll catch it pretty quickly during testing...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] [RFC] xfs: make xfs_writepage_map extent map centric
  2017-10-12 23:22   ` Dave Chinner
@ 2017-10-13 13:13     ` Brian Foster
  2017-10-14 22:25       ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Brian Foster @ 2017-10-13 13:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Oct 13, 2017 at 10:22:21AM +1100, Dave Chinner wrote:
> On Thu, Oct 12, 2017 at 12:14:37PM -0400, Brian Foster wrote:
> > On Wed, Oct 11, 2017 at 07:11:31PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > xfs_writepage_map() iterates over the bufferheads on a page to decide
> > > what sort of IO to do and what actions to take. However, when it comes
> > > to reflink and deciding when it needs to execute a COW operation, we no
> > > longer look at the bufferhead state but instead we ignore than and look up
> > > internal state held in teh COW fork extent list.
> 
> [....]
> > > 
> > > This patch has smoke tested ok on 1k and 4k block filesystems with and without
> > > reflink enabled, so it seems solid enough for initial comments to be made. Next
> > > step is to get some kind of generation count or seqlock based cached map
> > > invalidation into this code now that it's all unified.
> > > 
> > > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> > > ---
> > 
> > This seems sane to me on a first look. You may be planning this anyways
> > since this is an RFC, but could we split up the patches into separate
> > logical changes? For example, folding xfs_map_cow() into
> > xfs_map_blocks() seems like it could be a separate patch.
> 
> Yeah, I intend to split it up for review and submission. I generally
> just write a big single patch first to get the code into the right
> shape and working first, then worry about how to split it sanely for
> review. That way I don't waste time generating patches that have
> to be completely reworked after an initial RFC...
> 

Yup, I figured.

...
> > 
> > > +		case XFS_IO_HOLE:
> > > +			/* nothing to do! */
> > > +			trace_xfs_map_blocks_found(ip, offset, count, *type, imap);
> > > +			return 0;
> > 
> > I'm a little confused about this case. How can this happen, and why do
> > we use the _blocks_found() tracepoint?
> 
> isnullstartblock() evaluates as true for HOLESTARTBLOCK, which we
> mapped above in the !COW case. I just reused the tracepoint as a
> quick and dirty method of tracing that holes were being mapped
> properly via the type field.
> 

Ah, Ok. This will be less confusing when the hole case can simply skip
over this branch rather than drop into a branch for allocation only to
return.

> > > +		default:
> > > +			ASSERT(0);
> > > +			return -EFSCORRUPTED;
> > > +		}
> > 
> > Previous question aside, it looks like doing something like:
> 
> This caught quite a few bugs as I developed the patch :P
> 

I'm not advocating to kill the corruption checks, they just seem more
clear to me when checked directly at a higher level. I.e., logic like
'if nullstartblock && type != some value we should have set earlier'
kind of makes my eyes cross once I have to go hunting around for what
type should/could be. Seeing them embedded in a if (nullstartblock())
branch also makes me start to wonder what cases we've missed. Better
would be just some top level checks that ensure the *type and startblock
state are sane.

> > 
> > 	if (isnullstartblock(..) &&
> > 	    (*type == XFS_IO_DELALLOC || *type == XFS_IO_COW)) {
> > 		...
> > 	}
> > 
> > ... and doing some of the corruption checks separately could then
> > eliminate this switch statement entirely.
> 
> Yeah, it can definitely be cleaned up - handling the non-allocation
> case first would also help a lot....
> 
> 
> > > @@ -883,89 +900,93 @@ xfs_writepage_map(
> > >  	struct writeback_control *wbc,
> > >  	struct inode		*inode,
> > >  	struct page		*page,
> > > -	loff_t			offset,
> > >  	uint64_t              end_offset)
> > >  {
> > >  	LIST_HEAD(submit_list);
> > >  	struct xfs_ioend	*ioend, *next;
> > > -	struct buffer_head	*bh, *head;
> > > +	struct buffer_head	*bh;
> > >  	ssize_t			len = i_blocksize(inode);
> > >  	int			error = 0;
> > >  	int			count = 0;
> > > -	int			uptodate = 1;
> > > -	unsigned int		new_type;
> > > +	bool			uptodate = true;
> > > +	loff_t			file_offset;	/* file offset of page */
> > > +	int			poffset;	/* offset into page */
> > >  
> > > -	bh = head = page_buffers(page);
> > > -	offset = page_offset(page);
> > > -	do {
> > > -		if (offset >= end_offset)
> > > +	/*
> > > +	 * walk the blocks on the page, and we we run off then end of the
> > > +	 * current map or find the current map invalid, grab a new one.
> > > +	 * We only use bufferheads here to check per-block state - they no
> > > +	 * longer control the iteration through the page. This allows us to
> > > +	 * replace the bufferhead with some other state tracking mechanism in
> > > +	 * future.
> > > +	 */
> > > +	file_offset = page_offset(page);
> > > +	bh = page_buffers(page);
> > > +	for (poffset = 0;
> > > +	     poffset < PAGE_SIZE;
> > > +	     poffset += len, file_offset += len, bh = bh->b_this_page) {
> > > +
> > > +		/* past the range we are writing, so nothing more to write. */
> > > +		if (file_offset >= end_offset)
> > >  			break;
> > ...
> > > +
> > > +		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
> > > +			/*
> > > +			 * set_page_dirty dirties all buffers in a page, independent
> > > +			 * of their state.  The dirty state however is entirely
> > > +			 * meaningless for holes (!mapped && uptodate), so check we did
> > > +			 * have a buffer covering a hole here and continue.
> > > +			 */
> > > +			ASSERT(!buffer_mapped(bh));
> > > +			continue;
> > >  		}
> > 
> > IIRC, the old (broken) write over hole situation would allocate a block,
> > potentially yell about a block reservation overrun and ultimately write
> > the page.
> 
> Which was [always] the wrong thing to do.
> 
> However, we were forced to do this historically because it covered
> bugs and deficiencies in the front end mapping code, such as before
> we had ->page_mkwrite notification. In those ancient times, mmap()
> could not do delayed allocation or discover unwritten extents
> correctly.  So back in those days we had to do unreserved
> allocations in writeback and risk transaction overruns and ENOSPC in
> writeback because mmap writes were unable to do the right thing.
> 
> Nowdays, we've closed off all those issues and the only situation we
> can get unreserved allocation into a hole is when we've got a
> mismatch between bufferhead and extent state. This should never
> happen anymore, and silently allocating blocks for ranges that may
> or may not have valid data in them and writing it to disk is
> exactly the wrong thing to be doing now.
> 

"This should never happen ..."

Famous last words? ;) 

Yeah, I understand it's a broken case, but I'm slightly concerned over
the fact that it has happened in the past due to filesystem bugs. I
think due to more than one problem as well, but my memory is hazy.

Note that I'm not suggesting we go ahead and write the page like we used
to do, but rather that we at least have a defensive check for a
discrepancy between pagecache state that suggests a buffer is dirty and
extent state says there is no place to write. If so, throw a warning
(either xfs_alert() or WARN_ON()) to indicate something went wrong
because we tossed a dirty page without writing it anywhere. In other
words, all I'm really saying here is that I don't think an assert is
good enough.

> > Now it looks like we silently (aside from the assert on dev
> > kernels) skip the writeback and clean the page.
> 
> Which, IMO, is correct behaviour - the old code danced around this
> because we couldn't easily answer the question of "what to do when
> we have a bufferhead state vs extent map mismatch?".
> 
> All we care about now is "does the block have valid data" and "does
> the extent map say we can write it". By the time we get here, we've
> already skipped blocks that don't have valid data, and the checks
> above say "we can't write this data because it's a hole". The reason
> it has valid data is that has been previously read - which is pretty
> common and most definitely exercised by fstests - so we silently
> skip it.
> 
> One of the benefits of this change is that we remove these
> ambiguities from the writeback code. If we get fed crap in the
> writeback code, then we've got a bug in the front end iomap code...
> 

Indeed, I agree with regard to the overall behavior.

> > Unless I've missed that
> > being handled somewhere, that seems like a slightly more severe error
> > from the user perspective. Perhaps we should have a noiser warning here
> > or even consider that a writeback error?
> 
> No, because it's a valid state to have cached data over a hole if
> it's been read. Hence the assert to ensure what we got was from a
> previous read into a hole - if a mod we make to the higher level
> code screws up, we'll catch it pretty quickly during testing...
> 

We should be able to distinguish between page/buffer state that
indicates we have valid data over a hole vs. valid+dirty data over a
hole in order to flag the latter as a problem.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] [RFC] xfs: make xfs_writepage_map extent map centric
  2017-10-13 13:13     ` Brian Foster
@ 2017-10-14 22:25       ` Dave Chinner
  2017-10-16 12:17         ` Brian Foster
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2017-10-14 22:25 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Oct 13, 2017 at 09:13:48AM -0400, Brian Foster wrote:
> On Fri, Oct 13, 2017 at 10:22:21AM +1100, Dave Chinner wrote:
> > On Thu, Oct 12, 2017 at 12:14:37PM -0400, Brian Foster wrote:
> > > On Wed, Oct 11, 2017 at 07:11:31PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > xfs_writepage_map() iterates over the bufferheads on a page to decide
> > > > what sort of IO to do and what actions to take. However, when it comes
> > > > to reflink and deciding when it needs to execute a COW operation, we no
> > > > longer look at the bufferhead state but instead we ignore than and look up
> > > > internal state held in teh COW fork extent list.
> > 
> > [....]
> > 
> > > > +
> > > > +		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
> > > > +			/*
> > > > +			 * set_page_dirty dirties all buffers in a page, independent
> > > > +			 * of their state.  The dirty state however is entirely
> > > > +			 * meaningless for holes (!mapped && uptodate), so check we did
> > > > +			 * have a buffer covering a hole here and continue.
> > > > +			 */
> > > > +			ASSERT(!buffer_mapped(bh));
> > > > +			continue;
> > > >  		}
> > > 
> > > IIRC, the old (broken) write over hole situation would allocate a block,
> > > potentially yell about a block reservation overrun and ultimately write
> > > the page.
> > 
> > Which was [always] the wrong thing to do.
> > 
> > However, we were forced to do this historically because it covered
> > bugs and deficiencies in the front end mapping code, such as before
> > we had ->page_mkwrite notification. In those ancient times, mmap()
> > could not do delayed allocation or discover unwritten extents
> > correctly.  So back in those days we had to do unreserved
> > allocations in writeback and risk transaction overruns and ENOSPC in
> > writeback because mmap writes were unable to do the right thing.
> > 
> > Nowdays, we've closed off all those issues and the only situation we
> > can get unreserved allocation into a hole is when we've got a
> > mismatch between bufferhead and extent state. This should never
> > happen anymore, and silently allocating blocks for ranges that may
> > or may not have valid data in them and writing it to disk is
> > exactly the wrong thing to be doing now.
> > 
> 
> "This should never happen ..."
> 
> Famous last words? ;) 
> 
> Yeah, I understand it's a broken case, but I'm slightly concerned over
> the fact that it has happened in the past due to filesystem bugs. I
> think due to more than one problem as well, but my memory is hazy.
> 
> Note that I'm not suggesting we go ahead and write the page like we used
> to do, but rather that we at least have a defensive check for a
> discrepancy between pagecache state that suggests a buffer is dirty and
> extent state says there is no place to write.

This code does not checking against "buffer_dirty" because we've
never done that in the writeback path - we've always written clean,
uptodate, mapped sub-page buffers if the page is dirty.

So the only case left to defend against here is mapped buffers over
a hole or an invalid extent. The writeback path currently checks
that doesn't happen with an ASSERT(), and the code above retains
that ASSERT(). It will go away completely when we move away from
bufferheads....

> If so, throw a warning
> (either xfs_alert() or WARN_ON()) to indicate something went wrong
> because we tossed a dirty page without writing it anywhere. In other
> words, all I'm really saying here is that I don't think an assert is
> good enough.

It's the same as what we've always had to defend against this "we
fucked up the higher layer mapping" type of bug. I'm not creating
any new situation here, so I don't think it warrants a new warning
that will simply go away with bufferheads...

> > > Unless I've missed that
> > > being handled somewhere, that seems like a slightly more severe error
> > > from the user perspective. Perhaps we should have a noiser warning here
> > > or even consider that a writeback error?
> > 
> > No, because it's a valid state to have cached data over a hole if
> > it's been read. Hence the assert to ensure what we got was from a
> > previous read into a hole - if a mod we make to the higher level
> > code screws up, we'll catch it pretty quickly during testing...
> 
> We should be able to distinguish between page/buffer state that
> indicates we have valid data over a hole vs. valid+dirty data over a
> hole in order to flag the latter as a problem.

Except that, as above, we treat clean/dirty data exactly the same in
writepage - we write it if it's mapped over a valid extent. The code
catches the only case left that is invalid and it doesn't matter if
the buffer is clean or dirty - in both cases it is invalid....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] [RFC] xfs: make xfs_writepage_map extent map centric
  2017-10-14 22:25       ` Dave Chinner
@ 2017-10-16 12:17         ` Brian Foster
  2017-10-17  1:58           ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Brian Foster @ 2017-10-16 12:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sun, Oct 15, 2017 at 09:25:32AM +1100, Dave Chinner wrote:
> On Fri, Oct 13, 2017 at 09:13:48AM -0400, Brian Foster wrote:
> > On Fri, Oct 13, 2017 at 10:22:21AM +1100, Dave Chinner wrote:
> > > On Thu, Oct 12, 2017 at 12:14:37PM -0400, Brian Foster wrote:
> > > > On Wed, Oct 11, 2017 at 07:11:31PM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > xfs_writepage_map() iterates over the bufferheads on a page to decide
> > > > > what sort of IO to do and what actions to take. However, when it comes
> > > > > to reflink and deciding when it needs to execute a COW operation, we no
> > > > > longer look at the bufferhead state but instead we ignore than and look up
> > > > > internal state held in teh COW fork extent list.
> > > 
> > > [....]
> > > 
> > > > > +
> > > > > +		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
> > > > > +			/*
> > > > > +			 * set_page_dirty dirties all buffers in a page, independent
> > > > > +			 * of their state.  The dirty state however is entirely
> > > > > +			 * meaningless for holes (!mapped && uptodate), so check we did
> > > > > +			 * have a buffer covering a hole here and continue.
> > > > > +			 */
> > > > > +			ASSERT(!buffer_mapped(bh));
> > > > > +			continue;
> > > > >  		}
> > > > 
> > > > IIRC, the old (broken) write over hole situation would allocate a block,
> > > > potentially yell about a block reservation overrun and ultimately write
> > > > the page.
> > > 
> > > Which was [always] the wrong thing to do.
> > > 
> > > However, we were forced to do this historically because it covered
> > > bugs and deficiencies in the front end mapping code, such as before
> > > we had ->page_mkwrite notification. In those ancient times, mmap()
> > > could not do delayed allocation or discover unwritten extents
> > > correctly.  So back in those days we had to do unreserved
> > > allocations in writeback and risk transaction overruns and ENOSPC in
> > > writeback because mmap writes were unable to do the right thing.
> > > 
> > > Nowdays, we've closed off all those issues and the only situation we
> > > can get unreserved allocation into a hole is when we've got a
> > > mismatch between bufferhead and extent state. This should never
> > > happen anymore, and silently allocating blocks for ranges that may
> > > or may not have valid data in them and writing it to disk is
> > > exactly the wrong thing to be doing now.
> > > 
> > 
> > "This should never happen ..."
> > 
> > Famous last words? ;) 
> > 
> > Yeah, I understand it's a broken case, but I'm slightly concerned over
> > the fact that it has happened in the past due to filesystem bugs. I
> > think due to more than one problem as well, but my memory is hazy.
> > 
> > Note that I'm not suggesting we go ahead and write the page like we used
> > to do, but rather that we at least have a defensive check for a
> > discrepancy between pagecache state that suggests a buffer is dirty and
> > extent state says there is no place to write.
> 
> This code does not checking against "buffer_dirty" because we've
> never done that in the writeback path - we've always written clean,
> uptodate, mapped sub-page buffers if the page is dirty.
> 
> So the only case left to defend against here is mapped buffers over
> a hole or an invalid extent. The writeback path currently checks
> that doesn't happen with an ASSERT(), and the code above retains
> that ASSERT(). It will go away completely when we move away from
> bufferheads....
> 
> > If so, throw a warning
> > (either xfs_alert() or WARN_ON()) to indicate something went wrong
> > because we tossed a dirty page without writing it anywhere. In other
> > words, all I'm really saying here is that I don't think an assert is
> > good enough.
> 
> It's the same as what we've always had to defend against this "we
> fucked up the higher layer mapping" type of bug. I'm not creating
> any new situation here, so I don't think it warrants a new warning
> that will simply go away with bufferheads...
> 

Ok, then it may not be ideal to rely on the buffer_head to detect the
error.

> > > > Unless I've missed that
> > > > being handled somewhere, that seems like a slightly more severe error
> > > > from the user perspective. Perhaps we should have a noiser warning here
> > > > or even consider that a writeback error?
> > > 
> > > No, because it's a valid state to have cached data over a hole if
> > > it's been read. Hence the assert to ensure what we got was from a
> > > previous read into a hole - if a mod we make to the higher level
> > > code screws up, we'll catch it pretty quickly during testing...
> > 
> > We should be able to distinguish between page/buffer state that
> > indicates we have valid data over a hole vs. valid+dirty data over a
> > hole in order to flag the latter as a problem.
> 
> Except that, as above, we treat clean/dirty data exactly the same in
> writepage - we write it if it's mapped over a valid extent. The code
> catches the only case left that is invalid and it doesn't matter if
> the buffer is clean or dirty - in both cases it is invalid....
> 

This is simply not the case as it pertains to how the current code deals
with holes. Initially I was thinking we still quietly wrote the page,
but that is incorrect too. Christoph added the XFS_BMAPI_DELALLOC flag
in commit d2b3964a ("xfs: fix COW writeback race") that alters this
behavior.

The current behavior of the broken dirty page over a hole case is that
we no longer silenty attempt a hole -> real allocation. Instead,
writeback fails with -EIO. With this patch, that exact same broken case
fails the assert, but otherwise skips the dirty page without any error
or notification to userspace. IOW, by skipping the
iomap_write_allocate() code unconditionally we've effectively removed an
error check and made the pre-existing assert logic responsible for it.

So regardless of what specific buffer state checks pre-existed or not,
this patch subtly changes an existing error condition to an assertion
without any explicit intent/justification. I can understand not wanting
to add new logic related to buffer heads, but I don't think "buffer
heads suck" is a good enough reason alone to just drop the error. The
safer approach is to retain the error so that it's clear we need to find
a better way to detect it before we eliminate buffer_heads.

It would be nice if we could just rely on the page state to detect the
error. For example, is there any real non-erroneous case for having a
dirty page completely over a hole within EOF of a file? Unfortunately
that doesn't help us in the sub-page blocksize case where some of the
page might be valid, but perhaps that is a reasonable compromise. If so,
I think that should be part of a patch that actually eliminates
buffer_heads rather than reworking writepage because it at least gives
us more time to think about the problem.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] [RFC] xfs: make xfs_writepage_map extent map centric
  2017-10-16 12:17         ` Brian Foster
@ 2017-10-17  1:58           ` Dave Chinner
  2017-10-17 14:44             ` Brian Foster
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2017-10-17  1:58 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Oct 16, 2017 at 08:17:54AM -0400, Brian Foster wrote:
> On Sun, Oct 15, 2017 at 09:25:32AM +1100, Dave Chinner wrote:
> > On Fri, Oct 13, 2017 at 09:13:48AM -0400, Brian Foster wrote:
> > > On Fri, Oct 13, 2017 at 10:22:21AM +1100, Dave Chinner wrote:
> > > > On Thu, Oct 12, 2017 at 12:14:37PM -0400, Brian Foster wrote:
> > > > > On Wed, Oct 11, 2017 at 07:11:31PM +1100, Dave Chinner wrote:
> > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > 
> > > > > > xfs_writepage_map() iterates over the bufferheads on a page to decide
> > > > > > what sort of IO to do and what actions to take. However, when it comes
> > > > > > to reflink and deciding when it needs to execute a COW operation, we no
> > > > > > longer look at the bufferhead state but instead we ignore than and look up
> > > > > > internal state held in teh COW fork extent list.
> > > > 
> > > > [....]
> > > > 
> > > > > > +
> > > > > > +		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
> > > > > > +			/*
> > > > > > +			 * set_page_dirty dirties all buffers in a page, independent
> > > > > > +			 * of their state.  The dirty state however is entirely
> > > > > > +			 * meaningless for holes (!mapped && uptodate), so check we did
> > > > > > +			 * have a buffer covering a hole here and continue.
> > > > > > +			 */
> > > > > > +			ASSERT(!buffer_mapped(bh));
> > > > > > +			continue;
> > > > > >  		}
> > > > > 
> > > > > IIRC, the old (broken) write over hole situation would allocate a block,
> > > > > potentially yell about a block reservation overrun and ultimately write
> > > > > the page.
> > > > 
> > > > Which was [always] the wrong thing to do.
> > > > 
> > > > However, we were forced to do this historically because it covered
> > > > bugs and deficiencies in the front end mapping code, such as before
> > > > we had ->page_mkwrite notification. In those ancient times, mmap()
> > > > could not do delayed allocation or discover unwritten extents
> > > > correctly.  So back in those days we had to do unreserved
> > > > allocations in writeback and risk transaction overruns and ENOSPC in
> > > > writeback because mmap writes were unable to do the right thing.
> > > > 
> > > > Nowdays, we've closed off all those issues and the only situation we
> > > > can get unreserved allocation into a hole is when we've got a
> > > > mismatch between bufferhead and extent state. This should never
> > > > happen anymore, and silently allocating blocks for ranges that may
> > > > or may not have valid data in them and writing it to disk is
> > > > exactly the wrong thing to be doing now.
> > > > 
> > > 
> > > "This should never happen ..."
> > > 
> > > Famous last words? ;) 
> > > 
> > > Yeah, I understand it's a broken case, but I'm slightly concerned over
> > > the fact that it has happened in the past due to filesystem bugs. I
> > > think due to more than one problem as well, but my memory is hazy.
> > > 
> > > Note that I'm not suggesting we go ahead and write the page like we used
> > > to do, but rather that we at least have a defensive check for a
> > > discrepancy between pagecache state that suggests a buffer is dirty and
> > > extent state says there is no place to write.
> > 
> > This code does not checking against "buffer_dirty" because we've
> > never done that in the writeback path - we've always written clean,
> > uptodate, mapped sub-page buffers if the page is dirty.
> > 
> > So the only case left to defend against here is mapped buffers over
> > a hole or an invalid extent. The writeback path currently checks
> > that doesn't happen with an ASSERT(), and the code above retains
> > that ASSERT(). It will go away completely when we move away from
> > bufferheads....
> > 
> > > If so, throw a warning
> > > (either xfs_alert() or WARN_ON()) to indicate something went wrong
> > > because we tossed a dirty page without writing it anywhere. In other
> > > words, all I'm really saying here is that I don't think an assert is
> > > good enough.
> > 
> > It's the same as what we've always had to defend against this "we
> > fucked up the higher layer mapping" type of bug. I'm not creating
> > any new situation here, so I don't think it warrants a new warning
> > that will simply go away with bufferheads...
> > 
> 
> Ok, then it may not be ideal to rely on the buffer_head to detect the
> error.

Again, it's not different to the existing code, and the problem will
go away when bufferheads go away. So it's really just maintaining
the status quo...

> > > > > Unless I've missed that
> > > > > being handled somewhere, that seems like a slightly more severe error
> > > > > from the user perspective. Perhaps we should have a noiser warning here
> > > > > or even consider that a writeback error?
> > > > 
> > > > No, because it's a valid state to have cached data over a hole if
> > > > it's been read. Hence the assert to ensure what we got was from a
> > > > previous read into a hole - if a mod we make to the higher level
> > > > code screws up, we'll catch it pretty quickly during testing...
> > > 
> > > We should be able to distinguish between page/buffer state that
> > > indicates we have valid data over a hole vs. valid+dirty data over a
> > > hole in order to flag the latter as a problem.
> > 
> > Except that, as above, we treat clean/dirty data exactly the same in
> > writepage - we write it if it's mapped over a valid extent. The code
> > catches the only case left that is invalid and it doesn't matter if
> > the buffer is clean or dirty - in both cases it is invalid....
> > 
> 
> This is simply not the case as it pertains to how the current code deals
> with holes. Initially I was thinking we still quietly wrote the page,
> but that is incorrect too. Christoph added the XFS_BMAPI_DELALLOC flag
> in commit d2b3964a ("xfs: fix COW writeback race") that alters this
> behavior.

The XFS_BMAPI_DELALLOC behaviour is completely undocumented. there's
nothing in the code that explains what it's for, and the commit
message is just as baffling. I had to ask Darrick WhatTF it was and
WhyTF it had wacky asserts for the COW fork and so on. I still
had to infer what it was supposed to do from the original thread.

> The current behavior of the broken dirty page over a hole case is that
> we no longer silenty attempt a hole -> real allocation.  Instead,
> writeback fails with -EIO. With this patch, that exact same broken case
> fails the assert, but otherwise skips the dirty page without any error
> or notification to userspace.

Yes, XFS_BMAPI_DELALLOC means we say no to unreserved allocations
over holes in xfs_iomap_write_allocate(). But why did we even get
to xfs_iomap_write_allocate() with a type == XFS_IO_DELALLOC
bufferhead over a hole in the first place?

We got there because we trusted the bufferhead state to match the
underlying extent state rather than looking at the extent state to
determine the correct action. And, as a result, we do the wrong
thing and then throw a bandaid over it to hide the fact we've done
the wrong thing.

If xfs_map_blocks()/xfs_map_cow() land in a hole, we should *never*
call xfs_iomap_write_allocate() because we do not have a reservation
that allows us to allocate blocks. No ifs, no buts, it's just wrong
and we shouldn't do it. But landing in a hole in these functions
isn't an actually an indication of an error.

What is an error is calling xfs_iomap_write_allocate() once we've
found a hole because we trusted cached bufferhead state that says
"you must allocate". We know that bufferhead state can go wrong - I
trust the extent list recrods and state far more than I trust
bufferheads to carry the correct state, and the fact we need
workarounds like XFS_BMAPI_DELALLOC to defend against
inconsistencies between the incore extent state and the bufferhead
state demonstrate that.

> IOW, by skipping the
> iomap_write_allocate() code unconditionally we've effectively removed an
> error check and made the pre-existing assert logic responsible for it.

The pre-existing bufferhead assert logic is not responsible for
skipping the iomap_write_allocate() code. The code as it stands is
simply broken, and XFS_BMAPI_DELALLOC is just a band-aid.

The changes to combine xfs_map_cow() and xfs_map_blocks() allow us
to detect holes correctly and so we do not even attempt to allocate
over them. We don't need to rely on xfs_bmapi_write() to detect that
the caller 2-3 layers up had a fuckup and asked us to allocate into
a hole incorrectly; those higher layers now get that right by
detecting and skipping over holes.

> So regardless of what specific buffer state checks pre-existed or
> not, this patch subtly changes an existing error condition to an
> assertion without any explicit intent/justification.  I can
> understand not wanting to add new logic related to buffer heads,
> but I don't think "buffer heads suck" is a good enough reason
> alone to just drop the error. The safer approach is to retain the
> error so that it's clear we need to find a better way to detect it
> before we eliminate buffer_heads.

"the code was a band-aid, it can no longer be hit" is a perfectly
good reason to drop it. :/

> It would be nice if we could just rely on the page state to detect the
> error.  For example, is there any real non-erroneous case for having a
> dirty page completely over a hole within EOF of a file?

The existing code silently skips over such pages right now.  The new
code does the same thing, only it doesn't trust the bufferhead state
to determine if the page lies over a hole. XFS_BMAPI_DELALLOC
should be redundant now - I'll go remove it and see what happens...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] [RFC] xfs: make xfs_writepage_map extent map centric
  2017-10-17  1:58           ` Dave Chinner
@ 2017-10-17 14:44             ` Brian Foster
  0 siblings, 0 replies; 8+ messages in thread
From: Brian Foster @ 2017-10-17 14:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 17, 2017 at 12:58:35PM +1100, Dave Chinner wrote:
> On Mon, Oct 16, 2017 at 08:17:54AM -0400, Brian Foster wrote:
> > On Sun, Oct 15, 2017 at 09:25:32AM +1100, Dave Chinner wrote:
> > > On Fri, Oct 13, 2017 at 09:13:48AM -0400, Brian Foster wrote:
> > > > On Fri, Oct 13, 2017 at 10:22:21AM +1100, Dave Chinner wrote:
> > > > > On Thu, Oct 12, 2017 at 12:14:37PM -0400, Brian Foster wrote:
> > > > > > On Wed, Oct 11, 2017 at 07:11:31PM +1100, Dave Chinner wrote:
...
> > This is simply not the case as it pertains to how the current code deals
> > with holes. Initially I was thinking we still quietly wrote the page,
> > but that is incorrect too. Christoph added the XFS_BMAPI_DELALLOC flag
> > in commit d2b3964a ("xfs: fix COW writeback race") that alters this
> > behavior.
> 
> The XFS_BMAPI_DELALLOC behaviour is completely undocumented. there's
> nothing in the code that explains what it's for, and the commit
> message is just as baffling. I had to ask Darrick WhatTF it was and
> WhyTF it had wacky asserts for the COW fork and so on. I still
> had to infer what it was supposed to do from the original thread.
> 

Yeah, I agree wrt to the commit log and lack of documentation. I
recalled the change in behavior shortly after realizing/remembering
about the change, but perhaps that was more from the mail thread because
the commit log is pretty sparse. That said, I don't think it was an
accidental change or anything of that nature.

> > The current behavior of the broken dirty page over a hole case is that
> > we no longer silenty attempt a hole -> real allocation.  Instead,
> > writeback fails with -EIO. With this patch, that exact same broken case
> > fails the assert, but otherwise skips the dirty page without any error
> > or notification to userspace.
> 
> Yes, XFS_BMAPI_DELALLOC means we say no to unreserved allocations
> over holes in xfs_iomap_write_allocate(). But why did we even get
> to xfs_iomap_write_allocate() with a type == XFS_IO_DELALLOC
> bufferhead over a hole in the first place?
> 

It's clearly a broken/error scenario...

> We got there because we trusted the bufferhead state to match the
> underlying extent state rather than looking at the extent state to
> determine the correct action. And, as a result, we do the wrong
> thing and then throw a bandaid over it to hide the fact we've done
> the wrong thing.
> 

Eh? We trusted the bufferhead state to match the underlying extent
state. If it doesn't, we return an error.

> If xfs_map_blocks()/xfs_map_cow() land in a hole, we should *never*
> call xfs_iomap_write_allocate() because we do not have a reservation
> that allows us to allocate blocks. No ifs, no buts, it's just wrong
> and we shouldn't do it. But landing in a hole in these functions
> isn't an actually an indication of an error.
> 

The implementation is a bit hacky for sure, but to justify dropping the
error because we shouldn't be calling the function its in is a bit
pedantic IMO. The question here is more "what is the right behavior?"

> What is an error is calling xfs_iomap_write_allocate() once we've
> found a hole because we trusted cached bufferhead state that says
> "you must allocate". We know that bufferhead state can go wrong - I
> trust the extent list recrods and state far more than I trust
> bufferheads to carry the correct state, and the fact we need
> workarounds like XFS_BMAPI_DELALLOC to defend against
> inconsistencies between the incore extent state and the bufferhead
> state demonstrate that.
> 

Sure, but the recent problems in this area that I remember were XFS
bugs. Unless I'm misremembering, the most recent driver for this was
that we had at least one scenario where the page/buffer state was valid
and we (XFS) reclaimed or somehow or another failed to allocate a block
at the associated file offset. The most likely effect on the user was
the reservation warning and otherwise no impact to the data.

> > IOW, by skipping the
> > iomap_write_allocate() code unconditionally we've effectively removed an
> > error check and made the pre-existing assert logic responsible for it.
> 
> The pre-existing bufferhead assert logic is not responsible for
> skipping the iomap_write_allocate() code. The code as it stands is
> simply broken, and XFS_BMAPI_DELALLOC is just a band-aid.
> 
> The changes to combine xfs_map_cow() and xfs_map_blocks() allow us
> to detect holes correctly and so we do not even attempt to allocate
> over them. We don't need to rely on xfs_bmapi_write() to detect that
> the caller 2-3 layers up had a fuckup and asked us to allocate into
> a hole incorrectly; those higher layers now get that right by
> detecting and skipping over holes.
> 

To try and summarize... the inconsistency can either be due to incorrect
page state or incorrect extent state. We don't know which at the time
this might occur, but both have occurred in the past. The historical
behavior was to forcefully allocate a block and write the data based on
the page state. The current behavior is to return an error. The behavior
after this patch is to skip the page based on extent state, regardless
of page state.

I think the behavior in this patch is fine in the case where the page
state might be incorrect, as you've noted above. So what about in the
case where the page state is correct and the extent state is not? IOW,
if userspace wrote data that is valid and otherwise correctly sitting in
the pagecache and XFS screws up the extent state to create ahole, what
behavior is preferred? As a user, would you prefer an error or silent
data corruption?

In the end, I don't care too much about what's a band-aid or not, what
functions we should or shouldn't be calling based on layering, etc. The
important question to me is what is the right behavior in this
situation? The answer depends on which side is correct. If the extent
state is correctly a hole, then clearly we should skip the page. If the
pagecache is correct and the hole is due to a bug, then we probably want
to try and preserve the data. Since we don't really know which is which
and it's not totally safe to allocate a block, the only sane thing to do
here IMO is return an error and let userspace know the file needs
attention (and let us know there's a bug that needs fixing..).

Brian

> > So regardless of what specific buffer state checks pre-existed or
> > not, this patch subtly changes an existing error condition to an
> > assertion without any explicit intent/justification.  I can
> > understand not wanting to add new logic related to buffer heads,
> > but I don't think "buffer heads suck" is a good enough reason
> > alone to just drop the error. The safer approach is to retain the
> > error so that it's clear we need to find a better way to detect it
> > before we eliminate buffer_heads.
> 
> "the code was a band-aid, it can no longer be hit" is a perfectly
> good reason to drop it. :/
> 
> > It would be nice if we could just rely on the page state to detect the
> > error.  For example, is there any real non-erroneous case for having a
> > dirty page completely over a hole within EOF of a file?
> 
> The existing code silently skips over such pages right now.  The new
> code does the same thing, only it doesn't trust the bufferhead state
> to determine if the page lies over a hole. XFS_BMAPI_DELALLOC
> should be redundant now - I'll go remove it and see what happens...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-10-17 14:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-11  8:11 [PATCH] [RFC] xfs: make xfs_writepage_map extent map centric Dave Chinner
2017-10-12 16:14 ` Brian Foster
2017-10-12 23:22   ` Dave Chinner
2017-10-13 13:13     ` Brian Foster
2017-10-14 22:25       ` Dave Chinner
2017-10-16 12:17         ` Brian Foster
2017-10-17  1:58           ` Dave Chinner
2017-10-17 14:44             ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.