All of lore.kernel.org
 help / color / mirror / Atom feed
From: Carlos Maiolino <cmaiolino@redhat.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [RFC PATCH] xfs: skip discard of unwritten extents
Date: Mon, 30 Apr 2018 11:06:35 +0200	[thread overview]
Message-ID: <20180430090635.nsbqgygaftboou3k@odin.usersys.redhat.com> (raw)
In-Reply-To: <20180426180624.6134-1-bfoster@redhat.com>

Hi,

On Thu, Apr 26, 2018 at 02:06:24PM -0400, Brian Foster wrote:
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> Hi all,
> 
> What do folks think of something like this? The motivation here is that
> the VDO (dedup) devs had reported seeing online discards during
> write-only workloads. These turn out to be related to trimming post-eof
> preallocation blocks after large file copies. To my knowledge, this
> isn't really a prevalent or serious issue, but I think that technically
> these discards are unnecessary and so I was looking into how we could
> avoid them.
> 
> This behavior is of course not directly related to unwritten extents,
> but the immediate/obvious solution to bubble up a bmapi flag of some
> kind to xfs_free_eofblocks() seemed rather crude. From there, I figured
> that we technically don't need to discard any unwritten extents (within
> or beyond EOF) because they haven't been written to since being
> allocated. In fact, I'm not sure we have to even busy them, but it's
> roughly equivalent logic either way and I'm trying to avoid getting too
> clever.
> 

There is one thing I'm wondering here, pardon me if my assumptions are wrong.

Regarding the discard of post-eof blocks, one thing that comes to my mind, is
what would happen after let's say, a failed fallocate (I've run into this issued
a few months ago, and it just came to my mind while looking into this patch).
Currently, after a failed fallocate (issued beyond EOF), we do not rollback any
successfully allocated extents, keeping such extents allocated to the file,
consuming filesystem's space.

Today, such extents are freed when xfs_free_eofblocks() is called, freeing up
the space partially reserved by the failed fallocate.

I honestly do not remember if in such situation, we might already have issued
any write to the underlying device or not, if we did, issuing a discard here is
still useful.

I do *think* no writes have been issued to the block device so skipping discards
is not an issue, but I thought it might be worth to bring up such case.

The right approach from user is still to rollback its failed fallocate, but I
still think post eof block trim is useful in this case. Though I'm not sure if
the discard itself is useful or not.

Thoughts?

Cheers
> I also recall that we've discussed using unwritten extents for delalloc
> -> real conversion to avoid the small stale data exposure window that
> exists in writeback. Without getting too deep into the reason we don't
> currently do an initial unwritten allocation [1], I don't think there's
> anything blocking us from converting any post-eof blocks that happen to
> be part of the resulting normal allocation. As it is, the imap is
> already trimmed to EOF by the writeback code for coherency reasons. If
> we were to convert post-eof blocks (not part of this patch) along with
> something like this patch, then we'd indirectly prevent discards for
> eofblocks trims.
> 
> Beyond the whole discard thing, conversion of post-eof blocks may have a
> couple other advantages. First, we eliminate the aforementioned
> writeback stale data exposure problem for writes over preallocated
> blocks (which doesn't solve the fundamental problem, but closes the
> gap). Second, the zeroing required for post-eof writes that jump over
> eofblocks (see xfs_file_aio_write_checks()) becomes a much lighter
> weight operation. Normal blocks are zeroed using buffered writes whereas
> this is essentially a no-op for unwritten extents.
> 
> Thoughts? Flames? Other ideas?
> 
> Brian
> 
> [1] I think I've actually attempted this change in the past, but I
> haven't dug through my old git branches as of yet to completely jog my
> memory. IIRC, this may have been held up by the remnants of buffer_heads
> being used to track state for the writeback code.
> 
>  fs/xfs/libxfs/xfs_alloc.c          | 8 ++++++--
>  fs/xfs/libxfs/xfs_alloc.h          | 3 ++-
>  fs/xfs/libxfs/xfs_bmap.c           | 9 ++++++---
>  fs/xfs/libxfs/xfs_bmap.h           | 3 ++-
>  fs/xfs/libxfs/xfs_bmap_btree.c     | 2 +-
>  fs/xfs/libxfs/xfs_ialloc.c         | 4 ++--
>  fs/xfs/libxfs/xfs_ialloc_btree.c   | 2 +-
>  fs/xfs/libxfs/xfs_refcount.c       | 7 ++++---
>  fs/xfs/libxfs/xfs_refcount_btree.c | 2 +-
>  fs/xfs/xfs_extfree_item.c          | 2 +-
>  fs/xfs/xfs_fsops.c                 | 2 +-
>  fs/xfs/xfs_reflink.c               | 2 +-
>  fs/xfs/xfs_trans.h                 | 3 ++-
>  fs/xfs/xfs_trans_extfree.c         | 7 ++++---
>  14 files changed, 34 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 4bcc095fe44a..942c90ec6747 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2954,13 +2954,15 @@ xfs_free_extent(
>  	xfs_fsblock_t		bno,	/* starting block number of extent */
>  	xfs_extlen_t		len,	/* length of extent */
>  	struct xfs_owner_info	*oinfo,	/* extent owner */
> -	enum xfs_ag_resv_type	type)	/* block reservation type */
> +	enum xfs_ag_resv_type	type,	/* block reservation type */
> +	bool			skip_discard)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_buf		*agbp;
>  	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, bno);
>  	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(mp, bno);
>  	int			error;
> +	unsigned int		busy_flags = 0;
>  
>  	ASSERT(len != 0);
>  	ASSERT(type != XFS_AG_RESV_AGFL);
> @@ -2984,7 +2986,9 @@ xfs_free_extent(
>  	if (error)
>  		goto err;
>  
> -	xfs_extent_busy_insert(tp, agno, agbno, len, 0);
> +	if (skip_discard)
> +		busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD;
> +	xfs_extent_busy_insert(tp, agno, agbno, len, busy_flags);
>  	return 0;
>  
>  err:
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index cbf789ea5a4e..5c7d8391edc4 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -196,7 +196,8 @@ xfs_free_extent(
>  	xfs_fsblock_t		bno,	/* starting block number of extent */
>  	xfs_extlen_t		len,	/* length of extent */
>  	struct xfs_owner_info	*oinfo,	/* extent owner */
> -	enum xfs_ag_resv_type	type);	/* block reservation type */
> +	enum xfs_ag_resv_type	type,	/* block reservation type */
> +	bool			skip_discard);
>  
>  int				/* error */
>  xfs_alloc_lookup_le(
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 040eeda8426f..a5a37803f589 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -547,7 +547,8 @@ xfs_bmap_add_free(
>  	struct xfs_defer_ops		*dfops,
>  	xfs_fsblock_t			bno,
>  	xfs_filblks_t			len,
> -	struct xfs_owner_info		*oinfo)
> +	struct xfs_owner_info		*oinfo,
> +	bool				skip_discard)
>  {
>  	struct xfs_extent_free_item	*new;		/* new element */
>  #ifdef DEBUG
> @@ -574,6 +575,7 @@ xfs_bmap_add_free(
>  		new->xefi_oinfo = *oinfo;
>  	else
>  		xfs_rmap_skip_owner_update(&new->xefi_oinfo);
> +	new->xefi_skip_discard = skip_discard;
>  	trace_xfs_bmap_free_defer(mp, XFS_FSB_TO_AGNO(mp, bno), 0,
>  			XFS_FSB_TO_AGBNO(mp, bno), len);
>  	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_FREE, &new->xefi_list);
> @@ -632,7 +634,7 @@ xfs_bmap_btree_to_extents(
>  	if ((error = xfs_btree_check_block(cur, cblock, 0, cbp)))
>  		return error;
>  	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
> -	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1, &oinfo);
> +	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1, &oinfo, false);
>  	ip->i_d.di_nblocks--;
>  	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
>  	xfs_trans_binval(tp, cbp);
> @@ -5106,7 +5108,8 @@ xfs_bmap_del_extent_real(
>  				goto done;
>  		} else
>  			xfs_bmap_add_free(mp, dfops, del->br_startblock,
> -					del->br_blockcount, NULL);
> +					del->br_blockcount, NULL,
> +					del->br_state == XFS_EXT_UNWRITTEN);
>  	}
>  
>  	/*
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 2b766b37096d..0d2de22d143e 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -68,6 +68,7 @@ struct xfs_extent_free_item
>  	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
>  	struct list_head	xefi_list;
>  	struct xfs_owner_info	xefi_oinfo;	/* extent owner */
> +	bool			xefi_skip_discard;
>  };
>  
>  #define	XFS_BMAP_MAX_NMAP	4
> @@ -194,7 +195,7 @@ int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
>  void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
>  void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
>  			  xfs_fsblock_t bno, xfs_filblks_t len,
> -			  struct xfs_owner_info *oinfo);
> +			  struct xfs_owner_info *oinfo, bool skip_discard);
>  void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
>  int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
>  		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
> diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> index d89d06bea6e3..cecddfcbe11e 100644
> --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> @@ -305,7 +305,7 @@ xfs_bmbt_free_block(
>  	struct xfs_owner_info	oinfo;
>  
>  	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_private.b.whichfork);
> -	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1, &oinfo);
> +	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1, &oinfo, false);
>  	ip->i_d.di_nblocks--;
>  
>  	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index de627fa19168..854ebe04c86f 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -1837,7 +1837,7 @@ xfs_difree_inode_chunk(
>  	if (!xfs_inobt_issparse(rec->ir_holemask)) {
>  		/* not sparse, calculate extent info directly */
>  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, sagbno),
> -				  mp->m_ialloc_blks, &oinfo);
> +				  mp->m_ialloc_blks, &oinfo, false);
>  		return;
>  	}
>  
> @@ -1881,7 +1881,7 @@ xfs_difree_inode_chunk(
>  		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
>  		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
>  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, agbno),
> -				  contigblk, &oinfo);
> +				  contigblk, &oinfo, false);
>  
>  		/* reset range to current bit and carry on... */
>  		startidx = endidx = nextbit;
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 367e9a0726e6..977a33cc60d3 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -153,7 +153,7 @@ __xfs_inobt_free_block(
>  	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
>  	return xfs_free_extent(cur->bc_tp,
>  			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1,
> -			&oinfo, resv);
> +			&oinfo, resv, false);
>  }
>  
>  STATIC int
> diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> index 560e28473024..e5cfbe2534b1 100644
> --- a/fs/xfs/libxfs/xfs_refcount.c
> +++ b/fs/xfs/libxfs/xfs_refcount.c
> @@ -883,7 +883,8 @@ xfs_refcount_adjust_extents(
>  						cur->bc_private.a.agno,
>  						tmp.rc_startblock);
>  				xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
> -						tmp.rc_blockcount, oinfo);
> +						tmp.rc_blockcount, oinfo,
> +						false);
>  			}
>  
>  			(*agbno) += tmp.rc_blockcount;
> @@ -926,7 +927,7 @@ xfs_refcount_adjust_extents(
>  					cur->bc_private.a.agno,
>  					ext.rc_startblock);
>  			xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
> -					ext.rc_blockcount, oinfo);
> +					ext.rc_blockcount, oinfo, false);
>  		}
>  
>  skip:
> @@ -1658,7 +1659,7 @@ xfs_refcount_recover_cow_leftovers(
>  
>  		/* Free the block. */
>  		xfs_bmap_add_free(mp, &dfops, fsb,
> -				rr->rr_rrec.rc_blockcount, NULL);
> +				rr->rr_rrec.rc_blockcount, NULL, false);
>  
>  		error = xfs_defer_finish(&tp, &dfops);
>  		if (error)
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> index 375abfeb6267..bb0bdc6d97b5 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> @@ -131,7 +131,7 @@ xfs_refcountbt_free_block(
>  	be32_add_cpu(&agf->agf_refcount_blocks, -1);
>  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
>  	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
> -			XFS_AG_RESV_METADATA);
> +			XFS_AG_RESV_METADATA, false);
>  	if (error)
>  		return error;
>  
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index b5b1e567b9f4..4735a31793b0 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -542,7 +542,7 @@ xfs_efi_recover(
>  	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
>  		extp = &efip->efi_format.efi_extents[i];
>  		error = xfs_trans_free_extent(tp, efdp, extp->ext_start,
> -					      extp->ext_len, &oinfo);
> +					      extp->ext_len, &oinfo, false);
>  		if (error)
>  			goto abort_error;
>  
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 523792768080..9c555f81431e 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -502,7 +502,7 @@ xfs_growfs_data_private(
>  		error = xfs_free_extent(tp,
>  				XFS_AGB_TO_FSB(mp, agno,
>  					be32_to_cpu(agf->agf_length) - new),
> -				new, &oinfo, XFS_AG_RESV_NONE);
> +				new, &oinfo, XFS_AG_RESV_NONE, false);
>  		if (error)
>  			goto error0;
>  	}
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index cdbd342a5249..08381266ad85 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -604,7 +604,7 @@ xfs_reflink_cancel_cow_blocks(
>  
>  			xfs_bmap_add_free(ip->i_mount, &dfops,
>  					del.br_startblock, del.br_blockcount,
> -					NULL);
> +					NULL, false);
>  
>  			/* Roll the transaction */
>  			xfs_defer_ijoin(&dfops, ip);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 9d542dfe0052..d5be3f6a3e8f 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -228,7 +228,8 @@ struct xfs_efd_log_item	*xfs_trans_get_efd(struct xfs_trans *,
>  				  uint);
>  int		xfs_trans_free_extent(struct xfs_trans *,
>  				      struct xfs_efd_log_item *, xfs_fsblock_t,
> -				      xfs_extlen_t, struct xfs_owner_info *);
> +				      xfs_extlen_t, struct xfs_owner_info *,
> +				      bool);
>  int		xfs_trans_commit(struct xfs_trans *);
>  int		xfs_trans_roll(struct xfs_trans **);
>  int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
> diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
> index ab438647592a..cd2acfa4e562 100644
> --- a/fs/xfs/xfs_trans_extfree.c
> +++ b/fs/xfs/xfs_trans_extfree.c
> @@ -68,7 +68,8 @@ xfs_trans_free_extent(
>  	struct xfs_efd_log_item	*efdp,
>  	xfs_fsblock_t		start_block,
>  	xfs_extlen_t		ext_len,
> -	struct xfs_owner_info	*oinfo)
> +	struct xfs_owner_info	*oinfo,
> +	bool			skip_discard)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	uint			next_extent;
> @@ -80,7 +81,7 @@ xfs_trans_free_extent(
>  	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno, ext_len);
>  
>  	error = xfs_free_extent(tp, start_block, ext_len, oinfo,
> -			XFS_AG_RESV_NONE);
> +			XFS_AG_RESV_NONE, skip_discard);
>  
>  	/*
>  	 * Mark the transaction dirty, even on error. This ensures the
> @@ -195,7 +196,7 @@ xfs_extent_free_finish_item(
>  	error = xfs_trans_free_extent(tp, done_item,
>  			free->xefi_startblock,
>  			free->xefi_blockcount,
> -			&free->xefi_oinfo);
> +			&free->xefi_oinfo, free->xefi_skip_discard);
>  	kmem_free(free);
>  	return error;
>  }
> -- 
> 2.13.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

  reply	other threads:[~2018-04-30  9:06 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-26 18:06 [RFC PATCH] xfs: skip discard of unwritten extents Brian Foster
2018-04-30  9:06 ` Carlos Maiolino [this message]
2018-04-30 12:17   ` Brian Foster
2018-04-30 13:26     ` Carlos Maiolino
2018-04-30 22:00 ` Dave Chinner
2018-05-01 17:38   ` Brian Foster
2018-05-01 22:39     ` Dave Chinner
2018-05-02 11:18       ` Brian Foster
2018-05-03  0:48         ` Dave Chinner
2018-05-03 12:07           ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180430090635.nsbqgygaftboou3k@odin.usersys.redhat.com \
    --to=cmaiolino@redhat.com \
    --cc=bfoster@redhat.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.