All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls
Date: Fri, 26 Jan 2018 11:12:59 -0800	[thread overview]
Message-ID: <20180126191259.GB9068@magnolia> (raw)
In-Reply-To: <20180126130625.GC47923@bfoster.bfoster>

On Fri, Jan 26, 2018 at 08:06:25AM -0500, Brian Foster wrote:
> On Thu, Jan 25, 2018 at 12:20:33PM -0800, Darrick J. Wong wrote:
> > On Thu, Jan 25, 2018 at 12:31:12PM -0500, Brian Foster wrote:
> > > On Tue, Jan 23, 2018 at 06:18:35PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > In xfs_bmap_btalloc, we try using the CoW extent size hint to force
> > > > allocations to align (offset-wise) to cowextsz granularity to reduce CoW
> > > > fragmentation.  This works fine until we cannot satisfy the allocation
> > > > with enough blocks to cover the requested range and the alignment hints.
> > > > If this happens, return an unaligned region because if we don't the
> > > > extent trim functions cause us to return a zero-length extent to iomap,
> > > > which iomap doesn't catch and thus blows up.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > 
> > > Hmm.. is this a direct I/O thing? The description of the problem had me
> > 
> > Yes.
> > 
> > > wondering how we handle this with regard to dio and traditional extent
> > > size hints. It looks like we just return -ENOSPC if xfs_bmapi_write()
> > > doesn't return a mapping that covers the target range of the write (even
> > > if it apparently attempts to allocate part of the associated extent size
> > > hint range). E.g., see the nimaps == 0 check in
> > > xfs_iomap_write_direct() after we commit the transaction.
> > 
> > I did take a look at that, and didn't like it.
> > 
> > There's enough free space to fill the dio write, but the free space
> > itself is very fragmented so we can't honor the hint.  We did however
> > manage to allocate /some/ blocks, so we might as well return what we got
> > and let the next iteration of the iomap_apply loop try to fill the rest
> > of the write request.  We already reserved enough space, so the write
> > should succeed totally, not return to userspace with either a short
> > write or ENOSPC just because free space is fragmented.
> > 
> 
> I'm not following how a short write can necessarily be prevented, since
> space reservation doesn't guarantee contiguity and afaict we only make a
> single mapping call. I suppose the iomap level can loop, but that's
> outside of the context where blocks are reserved. Hm?
> 
> But regardless, this behavior seems reasonable to me if we apply it
> consistently between cow fork hint behavior and traditional extent size
> hint behavior. They are both hints, after all. I do think an nimaps
> check might still be appropriate in that reflink code path simply to
> cover the case of unexpected behavior or a bug, rather than brace for
> whatever is going to happen if we continue to shuffle a bogus imap
> around.

Yes, the new version makes the behavior consistent for both hints, and
adds the "got no blocks, so sad" check to _reflink_allocate_cow.

--D

> Brian
> 
> > (The other problem is that if we return ENOSPC out of iomap_begin, that
> > error code will bubble all the way back to userspace even if we /did/
> > write something, which means that even the programs that handle short
> > dio writes correctly will see that ENOSPC and bail out.  Goldwyn has
> > been trying to fix that braindamage for some time now.)
> > 
> > > In fact, it looks like just repeating the failed write could eventually
> > > succeed if the issue is that there is actually enough free space
> > > available to allocate the hint range up to where the write is targeted,
> > > just no long enough extent available to fill the extent size hint range
> > > in a single bmapi_write call. That behavior is a bit strange, I admit,
> > > but I'm wondering if we could do the same thing for the cow hint. Would
> > > a similar nimaps check in the xfs_bmapi_write() caller resolve the bug
> > > described here?
> > > 
> > > If so and if we still care to actually change/fix the allocation
> > > behavior with regard to the hints, perhaps we could do that in a
> > > separate patch more generically for both hints..?
> > 
> > I get the feeling we could apply this change to all the data fork
> > bmap_btalloc calls too.  I'll go study that in more depth.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > >  fs/iomap.c               |    2 +-
> > > >  fs/xfs/libxfs/xfs_bmap.c |   21 +++++++++++++++++++--
> > > >  2 files changed, 20 insertions(+), 3 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/iomap.c b/fs/iomap.c
> > > > index e5de772..aec35a0 100644
> > > > --- a/fs/iomap.c
> > > > +++ b/fs/iomap.c
> > > > @@ -63,7 +63,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> > > >  	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
> > > >  	if (ret)
> > > >  		return ret;
> > > > -	if (WARN_ON(iomap.offset > pos))
> > > > +	if (WARN_ON(iomap.offset > pos) || WARN_ON(iomap.length == 0))
> > > >  		return -EIO;
> > > >  
> > > >  	/*
> > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > > index 93ce2c6..4ec1fdc5 100644
> > > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > > @@ -3480,8 +3480,20 @@ xfs_bmap_btalloc_filestreams(
> > > >  static void
> > > >  xfs_bmap_btalloc_cow(
> > > >  	struct xfs_bmalloca	*ap,
> > > > -	struct xfs_alloc_arg	*args)
> > > > +	struct xfs_alloc_arg	*args,
> > > > +	xfs_fileoff_t		orig_offset,
> > > > +	xfs_extlen_t		orig_length)
> > > >  {
> > > > +	/*
> > > > +	 * If we didn't get enough blocks to satisfy the cowextsize
> > > > +	 * aligned request, break the alignment and return whatever we
> > > > +	 * got; it's the best we can do.
> > > > +	 */
> > > > +	if (ap->length <= orig_length)
> > > > +		ap->offset = orig_offset;
> > > > +	else if (ap->offset + ap->length < orig_offset + orig_length)
> > > > +		ap->offset = orig_offset + orig_length - ap->length;
> > > > +
> > > >  	/* Filling a previously reserved extent; nothing to do here. */
> > > >  	if (ap->wasdel)
> > > >  		return;
> > > > @@ -3520,6 +3532,8 @@ xfs_bmap_btalloc(
> > > >  	xfs_agnumber_t	fb_agno;	/* ag number of ap->firstblock */
> > > >  	xfs_agnumber_t	ag;
> > > >  	xfs_alloc_arg_t	args;
> > > > +	xfs_fileoff_t	orig_offset;
> > > > +	xfs_extlen_t	orig_length;
> > > >  	xfs_extlen_t	blen;
> > > >  	xfs_extlen_t	nextminlen = 0;
> > > >  	int		nullfb;		/* true if ap->firstblock isn't set */
> > > > @@ -3529,6 +3543,8 @@ xfs_bmap_btalloc(
> > > >  	int		stripe_align;
> > > >  
> > > >  	ASSERT(ap->length);
> > > > +	orig_offset = ap->offset;
> > > > +	orig_length = ap->length;
> > > >  
> > > >  	mp = ap->ip->i_mount;
> > > >  
> > > > @@ -3745,7 +3761,8 @@ xfs_bmap_btalloc(
> > > >  		ASSERT(nullfb || fb_agno <= args.agno);
> > > >  		ap->length = args.len;
> > > >  		if (ap->flags & XFS_BMAPI_COWFORK) {
> > > > -			xfs_bmap_btalloc_cow(ap, &args);
> > > > +			xfs_bmap_btalloc_cow(ap, &args, orig_offset,
> > > > +					orig_length);
> > > >  		} else {
> > > >  			ap->ip->i_d.di_nblocks += args.len;
> > > >  			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2018-01-26 19:13 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
2018-01-24  2:18 ` [PATCH 01/11] xfs: reflink should break pnfs leases before sharing blocks Darrick J. Wong
2018-01-24 14:16   ` Brian Foster
2018-01-26  9:06   ` Christoph Hellwig
2018-01-26 18:26     ` Darrick J. Wong
2018-01-24  2:18 ` [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink Darrick J. Wong
2018-01-24 14:18   ` Brian Foster
2018-01-24 18:40     ` Darrick J. Wong
2018-01-26 12:07   ` Christoph Hellwig
2018-01-26 18:48     ` Darrick J. Wong
2018-01-27  3:32     ` Dave Chinner
2018-01-24  2:18 ` [PATCH 03/11] xfs: call xfs_qm_dqattach before performing reflink operations Darrick J. Wong
2018-01-24 14:18   ` Brian Foster
2018-01-26  9:07   ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 04/11] xfs: CoW fork operations should only update quota reservations Darrick J. Wong
2018-01-24 14:22   ` Brian Foster
2018-01-24 19:14     ` Darrick J. Wong
2018-01-25 13:01       ` Brian Foster
2018-01-25 17:52         ` Darrick J. Wong
2018-01-25  1:20   ` [PATCH v2 " Darrick J. Wong
2018-01-25 13:03     ` Brian Foster
2018-01-25 18:20       ` Darrick J. Wong
2018-01-26 13:02         ` Brian Foster
2018-01-26 18:40           ` Darrick J. Wong
2018-01-26 12:12     ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 05/11] xfs: track CoW blocks separately in the inode Darrick J. Wong
2018-01-25 13:06   ` Brian Foster
2018-01-25 19:21     ` Darrick J. Wong
2018-01-26 13:04       ` Brian Foster
2018-01-26 19:08         ` Darrick J. Wong
2018-01-26 12:15   ` Christoph Hellwig
2018-01-26 19:00     ` Darrick J. Wong
2018-01-26 23:51       ` Darrick J. Wong
2018-01-24  2:18 ` [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-25 20:20     ` Darrick J. Wong
2018-01-26 13:06       ` Brian Foster
2018-01-26 19:12         ` Darrick J. Wong [this message]
2018-01-26  9:11   ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 07/11] xfs: always zero di_flags2 when we free the inode Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-25 18:36     ` Darrick J. Wong
2018-01-26  9:08   ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 08/11] xfs: fix tracepoint %p formats Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-25 18:47     ` Darrick J. Wong
2018-01-26  0:19       ` Darrick J. Wong
2018-01-26  9:09         ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 09/11] xfs: make tracepoint inode number format consistent Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-26  9:09   ` Christoph Hellwig
2018-01-24  2:19 ` [PATCH 10/11] xfs: refactor inode verifier corruption error printing Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-25 18:23     ` Darrick J. Wong
2018-01-26  9:10   ` Christoph Hellwig
2018-01-24  2:19 ` [PATCH 11/11] xfs: don't clobber inobt/finobt cursors when xref with rmap Darrick J. Wong
2018-01-26  9:10   ` Christoph Hellwig
2018-01-25  5:26 ` [PATCH 12/11] xfs: refactor quota code in xfs_bmap_btalloc Darrick J. Wong
2018-01-26 12:17   ` Christoph Hellwig
2018-01-26 21:46     ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180126191259.GB9068@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=bfoster@redhat.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.