Re: [PATCH 13/24] xfs: make xfs_writepage_map extent map centric

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Christoph Hellwig <hch@lst.de>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Dave Chinner <dchinner@redhat.com>
Subject: Re: [PATCH 13/24] xfs: make xfs_writepage_map extent map centric
Date: Mon, 18 Jun 2018 22:43:25 -0700	[thread overview]
Message-ID: <20180619054325.GV8128@magnolia> (raw)
In-Reply-To: <20180615130209.1970-14-hch@lst.de>

On Fri, Jun 15, 2018 at 03:01:58PM +0200, Christoph Hellwig wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_writepage_map() iterates over the bufferheads on a page to decide
> what sort of IO to do and what actions to take.  However, when it comes
> to reflink and deciding when it needs to execute a COW operation, we no
> longer look at the bufferhead state but instead we ignore than and look
> up internal state held in the COW fork extent list.
> 
> This means xfs_writepage_map() is somewhat confused. It does stuff, then
> ignores it, then tries to handle the impedence mismatch by shovelling the
> results inside the existing mapping code.  It works, but it's a bit of a
> mess and it makes it hard to fix the cached map bug that the writepage
> code currently has.
> 
> To unify the two different mechanisms, we first have to choose a direction.
> That's already been set - we're de-emphasising bufferheads so they are no
> longer a control structure as we need to do taht to allow for eventual
> removal.  Hence we need to move away from looking at bufferhead state to
> determine what operations we need to perform.
> 
> We can't completely get rid of bufferheads yet - they do contain some
> state that is absolutely necessary, such as whether that part of the page
> contains valid data or not (buffer_uptodate()).  Other state in the
> bufferhead is redundant:
> 
> 	BH_dirty - the page is dirty, so we can ignore this and just
> 		write it
> 	BH_delay - we have delalloc extent info in the DATA fork extent
> 		tree
> 	BH_unwritten - same as BH_delay
> 	BH_mapped - indicates we've already used it once for IO and it is
> 		mapped to a disk address. Needs to be ignored for COW
> 		blocks.
> 
> The BH_mapped flag is an interesting case - it's supposed to indicate that
> it's already mapped to disk and so we can just use it "as is".  In theory,
> we don't even have to do an extent lookup to find where to write it too,
> but we have to do that anyway to determine we are actually writing over a
> valid extent.  Hence it's not even serving the purpose of avoiding a an
> extent lookup during writeback, and so we can pretty much ignore it.
> Especially as we have to ignore it for COW operations...
> 
> Therefore, use the extent map as the source of information to tell us
> what actions we need to take and what sort of IO we should perform.  The
> first step is to have xfs_map_blocks() set the io type according to what
> it looks up.  This means it can easily handle both normal overwrite and
> COW cases.  The only thing we also need to add is the ability to return
> hole mappings.
> 
> We need to return and cache hole mappings now for the case of multiple
> blocks per page.  We no longer use the BH_mapped to indicate a block over
> a hole, so we have to get that info from xfs_map_blocks().  We cache it so
> that holes that span two pages don't need separate lookups.  This allows us
> to avoid ever doing write IO over a hole, too.
> 
> Now that we have xfs_map_blocks() returning both a cached map and the type
> of IO we need to perform, we can rewrite xfs_writepage_map() to drop all
> the bufferhead control. It's also much simplified because it doesn't need
> to explicitly handle COW operations.  Instead of iterating bufferheads, it
> iterates blocks within the page and then looks up what per-block state is
> required from the appropriate bufferhead.  It then validates the cached
> map, and if it's not valid, we get a new map.  If we don't get a valid map
> or it's over a hole, we skip the block.
> 
> At this point, we have to remap the bufferhead via xfs_map_at_offset().
> As previously noted, we had to do this even if the buffer was already
> mapped as the mapping would be stale for XFS_IO_DELALLOC, XFS_IO_UNWRITTEN
> and XFS_IO_COW IO types.  With xfs_map_blocks() now controlling the type,
> even XFS_IO_OVERWRITE types need remapping, as converted-but-not-yet-
> written delalloc extents beyond EOF can be reported at XFS_IO_OVERWRITE.
> Bufferheads that span such regions still need their BH_Delay flags cleared
> and their block numbers calculated, so we now unconditionally map each
> bufferhead before submission.
> 
> But wait! There's more - remember the old "treat unwritten extents as
> holes on read" hack?  Yeah, that means we can have a dirty page with
> unmapped, unwritten bufferheads that contain data!  What makes these so
> special is that the unwritten "hole" bufferheads do not have a valid block
> device pointer, so if we attempt to write them xfs_add_to_ioend() blows
> up. So we make xfs_map_at_offset() do the "realtime or data device"
> lookup from the inode and ignore what was or wasn't put into the
> bufferhead when the buffer was instantiated.
> 
> The astute reader will have realised by now that this code treats
> unwritten extents in multiple-blocks-per-page situations differently.
> If we get any combination of unwritten blocks on a dirty page that contain
> valid data in the page, we're going to convert them to real extents.  This
> can actually be a win, because it means that pages with interleaving
> unwritten and written blocks will get converted to a single written extent
> with zeros replacing the interspersed unwritten blocks.  This is actually
> good for reducing extent list and conversion overhead, and it means we
> issue a contiguous IO instead of lots of little ones.  The downside is
> that we use up a little extra IO bandwidth.  Neither of these seem like a
> bad thing given that spinning disks are seek sensitive, and SSDs/pmem have
> bandwidth to burn and the lower Io latency/CPU overhead of fewer, larger
> IOs will result in better performance on them...
> 
> As a result of all this, the only state we actually care about from the
> bufferhead is a single flag - BH_Uptodate. We still use the bufferhead to
> pass some information to the bio via xfs_add_to_ioend(), but that is
> trivial to separate and pass explicitly.  This means we really only need
> 1 bit of state per block per page from the buffered write path in the
> writeback path.  Everything else we do with the bufferhead is purely to
> make the buffered IO front end continue to work correctly. i.e we've
> pretty much marginalised bufferheads in the writeback path completely.
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> [hch: forward port + slight refactoring]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_aops.c | 89 ++++++++++++++++++++---------------------------
>  1 file changed, 37 insertions(+), 52 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 8c1a28f39197..165891ecb4ba 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -454,19 +454,19 @@ xfs_map_blocks(
>  	} else if (imap.br_startblock == HOLESTARTBLOCK) {
>  		/* landed in a hole */
>  		wpc->io_type = XFS_IO_HOLE;
> -	}
> -
> -	if (wpc->io_type == XFS_IO_DELALLOC &&
> -	    (!nimaps || isnullstartblock(imap.br_startblock)))
> -		goto allocate_blocks;
> +	} else {
> +		if (isnullstartblock(imap.br_startblock)) {
> +			/* got a delalloc extent */
> +			wpc->io_type = XFS_IO_DELALLOC;
> +			goto allocate_blocks;
> +		}
>  
> -#ifdef DEBUG
> -	if (wpc->io_type == XFS_IO_UNWRITTEN) {
> -		ASSERT(nimaps);
> -		ASSERT(imap.br_startblock != HOLESTARTBLOCK);
> -		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> +		if (imap.br_state == XFS_EXT_UNWRITTEN)
> +			wpc->io_type = XFS_IO_UNWRITTEN;
> +		else
> +			wpc->io_type = XFS_IO_OVERWRITE;
>  	}
> -#endif
> +
>  	wpc->imap = imap;
>  	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
>  	return 0;
> @@ -736,6 +736,14 @@ xfs_map_at_offset(
>  	set_buffer_mapped(bh);
>  	clear_buffer_delay(bh);
>  	clear_buffer_unwritten(bh);
> +
> +	/*
> +	 * If this is a realtime file, data may be on a different device.
> +	 * to that pointed to from the buffer_head b_bdev currently. We can't
> +	 * trust that the bufferhead has a already been mapped correctly, so
> +	 * set the bdev now.
> +	 */
> +	bh->b_bdev = xfs_find_bdev_for_inode(inode);
>  }
>  
>  STATIC void
> @@ -822,58 +830,36 @@ xfs_writepage_map(
>  {
>  	LIST_HEAD(submit_list);
>  	struct xfs_ioend	*ioend, *next;
> -	struct buffer_head	*bh, *head;
> +	struct buffer_head	*bh;
>  	ssize_t			len = i_blocksize(inode);
>  	uint64_t		file_offset;	/* file offset of page */
> +	unsigned		poffset;	/* offset into page */
>  	int			error = 0;
>  	int			count = 0;
> -	unsigned int		new_type;
>  
> -	bh = head = page_buffers(page);
> +	/*
> +	 * Walk the blocks on the page, and we we run off then end of the
> +	 * current map or find the current map invalid, grab a new one.

Will rework this comment to read:

"...and if we run off the end of the current map or..."

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> +	 * We only use bufferheads here to check per-block state - they no
> +	 * longer control the iteration through the page. This allows us to
> +	 * replace the bufferhead with some other state tracking mechanism in
> +	 * future.
> +	 */
>  	file_offset = page_offset(page);
> -	do {
> +	bh = page_buffers(page);
> +	for (poffset = 0;
> +	     poffset < PAGE_SIZE;
> +	     poffset += len, file_offset += len, bh = bh->b_this_page) {
> +		/* past the range we are writing, so nothing more to write. */
>  		if (file_offset >= end_offset)
>  			break;
>  
> -		/*
> -		 * set_page_dirty dirties all buffers in a page, independent
> -		 * of their state.  The dirty state however is entirely
> -		 * meaningless for holes (!mapped && uptodate), so skip
> -		 * buffers covering holes here.
> -		 */
> -		if (!buffer_mapped(bh) && buffer_uptodate(bh))
> -			continue;
> -
> -		if (buffer_unwritten(bh))
> -			new_type = XFS_IO_UNWRITTEN;
> -		else if (buffer_delay(bh))
> -			new_type = XFS_IO_DELALLOC;
> -		else if (buffer_uptodate(bh))
> -			new_type = XFS_IO_OVERWRITE;
> -		else {
> +		if (!buffer_uptodate(bh)) {
>  			if (PageUptodate(page))
>  				ASSERT(buffer_mapped(bh));
> -			/*
> -			 * This buffer is not uptodate and will not be
> -			 * written to disk.
> -			 */
>  			continue;
>  		}
>  
> -		/*
> -		 * If we already have a valid COW mapping keep using it.
> -		 */
> -		if (wpc->io_type == XFS_IO_COW &&
> -		    xfs_imap_valid(inode, &wpc->imap, file_offset)) {
> -			wpc->imap_valid = true;
> -			new_type = XFS_IO_COW;
> -		}
> -
> -		if (wpc->io_type != new_type) {
> -			wpc->io_type = new_type;
> -			wpc->imap_valid = false;
> -		}
> -
>  		if (wpc->imap_valid)
>  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
>  							 file_offset);
> @@ -891,11 +877,10 @@ xfs_writepage_map(
>  			continue;
>  
>  		lock_buffer(bh);
> -		if (wpc->io_type != XFS_IO_OVERWRITE)
> -			xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
> +		xfs_map_at_offset(inode, bh, &wpc->imap, file_offset);
>  		xfs_add_to_ioend(inode, bh, file_offset, wpc, wbc, &submit_list);
>  		count++;
> -	} while (file_offset += len, ((bh = bh->b_this_page) != head));
> +	}
>  
>  	ASSERT(wpc->ioend || list_empty(&submit_list));
>  
> -- 
> 2.17.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html