linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>,
	linux-xfs@vger.kernel.org, hch@infradead.org
Subject: Re: [PATCH 1/2] xfs: force writes to delalloc regions to unwritten
Date: Thu, 14 May 2020 09:33:17 -0700	[thread overview]
Message-ID: <20200514163317.GA6714@magnolia> (raw)
In-Reply-To: <20200507103232.GB9003@bfoster>

On Thu, May 07, 2020 at 06:32:32AM -0400, Brian Foster wrote:
> On Mon, Feb 03, 2020 at 12:14:45PM -0800, Darrick J. Wong wrote:
> > On Mon, Jan 20, 2020 at 07:49:25AM +1100, Dave Chinner wrote:
> > > On Wed, Jan 15, 2020 at 10:15:50PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > When writing to a delalloc region in the data fork, commit the new
> > > > allocations (of the da reservation) as unwritten so that the mappings
> > > > are only marked written once writeback completes successfully.  This
> > > > fixes the problem of stale data exposure if the system goes down during
> > > > targeted writeback of a specific region of a file, as tested by
> > > > generic/042.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_bmap.c |   28 +++++++++++++++++-----------
> > > >  1 file changed, 17 insertions(+), 11 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > > index 4544732d09a5..220ea1dc67ab 100644
> > > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > > @@ -4190,17 +4190,7 @@ xfs_bmapi_allocate(
> > > >  	bma->got.br_blockcount = bma->length;
> > > >  	bma->got.br_state = XFS_EXT_NORM;
> > > >  
> > > > -	/*
> > > > -	 * In the data fork, a wasdelay extent has been initialized, so
> > > > -	 * shouldn't be flagged as unwritten.
> > > > -	 *
> > > > -	 * For the cow fork, however, we convert delalloc reservations
> > > > -	 * (extents allocated for speculative preallocation) to
> > > > -	 * allocated unwritten extents, and only convert the unwritten
> > > > -	 * extents to real extents when we're about to write the data.
> > > > -	 */
> > > > -	if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) &&
> > > > -	    (bma->flags & XFS_BMAPI_PREALLOC))
> > > > +	if (bma->flags & XFS_BMAPI_PREALLOC)
> > > >  		bma->got.br_state = XFS_EXT_UNWRITTEN;
> > > >  
> > > >  	if (bma->wasdel)
> > > > @@ -4608,8 +4598,24 @@ xfs_bmapi_convert_delalloc(
> > > >  	bma.offset = bma.got.br_startoff;
> > > >  	bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN);
> > > >  	bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork);
> > > > +
> > > > +	/*
> > > > +	 * When we're converting the delalloc reservations backing dirty pages
> > > > +	 * in the page cache, we must be careful about how we create the new
> > > > +	 * extents:
> > > > +	 *
> > > > +	 * New CoW fork extents are created unwritten, turned into real extents
> > > > +	 * when we're about to write the data to disk, and mapped into the data
> > > > +	 * fork after the write finishes.  End of story.
> > > > +	 *
> > > > +	 * New data fork extents must be mapped in as unwritten and converted
> > > > +	 * to real extents after the write succeeds to avoid exposing stale
> > > > +	 * disk contents if we crash.
> > > > +	 */
> > > >  	if (whichfork == XFS_COW_FORK)
> > > >  		bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
> > > > +	else
> > > > +		bma.flags = XFS_BMAPI_PREALLOC;
> > > 
> > > 	bma.flags = XFS_BMAPI_PREALLOC;
> > > 	if (whichfork == XFS_COW_FORK)
> > > 		bma.flags |= XFS_BMAPI_COWFORK;
> > > 
> > > However, I'm still not convinced that this is the right/best
> > > solution to the problem. It is the easiest, yes, but the down side
> > > on fast/high iops storage and/or under low memory conditions has
> > > potential to be extremely significant.
> > > 
> > > I suspect that heavy users of buffered O_DSYNC writes into sparse
> > > files are going to notice this the most - there are databases out
> > > there that work this way. And I suspect that most of the workloads
> > > that use buffered O_DSYNC IO heavily won't see this change for years
> > > as enterprise upgrade cycles are notoriously slow.
> > > 
> > > IOWs, all I see this change doing is kicking the can down the road
> > > and guaranteeing that we'll still have to solve this stale data
> > > exposure problem more efficiently in the future. And instead of
> > > doing it now when we have the time and freedom to do the work, it
> > > will have to be done urgently under high priority escalation
> > > pressures...
> > 
> > FWIW I'm *already* under urgent high priority GA blocker escalation
> > pressure, which is why this came up again.
> > 
> > Granted it did take 12 days of losing the battle with the distro folks
> > that this really isn't a release blocker (but teh sekuritehs!!) but...oh
> > right, I forgot that xfs actually /does/ crash more than once per day in
> > our environment.
> > 
> > I guess *we* will find out how much performance really disappears if you
> > do it this way. :P
> > 
> 
> Sorry for resurrecting an old thread here, but I was thinking about this
> problem a bit and realized I didn't have a great handle on the concerns
> with using unwritten extents for delalloc writeback. Dave calls out the
> O_DSYNC buffered writes into sparse files case above. I don't see any
> numbers posted here so I ran some quick tests using a large ramdisk to
> get low latency I/O.
> 
> I only seem to require a couple threads to max out single file, random
> 4k dsync buffered write iops in this particular setup. I see ~30.6k iops
> from a baseline 5.7.0-rc1 kernel and that drops to ~25.7k iops when
> using unwritten extents for delalloc conversion. However, note that the
> same workload through single threaded aio+dio (qd 32) runs at ~63.7k
> iops. That's already using unwritten extents for dio so it's unaffected
> by this patch. Also note that using a 10MB extent size hint puts the
> dsync buffered write case at ~27k iops (again for both kernels because
> we're already using unwritten extents in that case as well).
> 
> For reference, full file preallocation (i.e. no allocs, unwritten
> extents) runs at ~27k iops for the buffered write case and ~87k iops for
> aio+dio. The overwrite (no unwritten, no alloc) case gets to ~250k iops
> with the same couple dsync buffered write threads and close to 300k iops
> with single threaded aio+dio (which I think is maxing out my memory
> bandwidth).
> 
> Altogether, this has me wondering whether it's really worth the
> complexity of trying to avoid the overhead of unwritten extents for
> delalloc conversion. There is a noticeable hit, but it's an already slow
> path compared to async I/O mechanisms. Further, it's a workload that
> typically comes with a recommendation to use extent size hints to avoid
> fragmentation issues and minimize allocation overhead, and that feature
> already bypasses delalloc extents in favor of unwritten extents.
> Thoughts? Suggestions for other tests?

4-5 months ago I ran more or less the same benchmark (albeit with
$someproduct) and came to the same conclusion -- if you're really doing
scattershot buffered O_DSYNC writes to a file, you'll lose about 15-20%
with this patch added.  Then apparently I ... got buried in xmas and
other bugs and forgot to send the results. :/

Granted, you had to /force/ $someproduct to do this because it would
typically do either synchronous aio+dio, or it could do async writes
with an fsync at the important parts, or it could set an extent hint,
or (the default) it writes zeroes ahead of time so that XFS will stay
out of the way when checkpoints need to get done asap.

I could say (glibly) that I'm so buried in bug triage that what's a few
more? but maybe the rest of you have other opinions? :)

--D

> 
> Brian
> 
> > --D
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > 
> 

  reply	other threads:[~2020-05-14 16:33 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-16  6:15 [PATCH 0/2] xfs: fix stale disk exposure after crash Darrick J. Wong
2020-01-16  6:15 ` [PATCH 1/2] xfs: force writes to delalloc regions to unwritten Darrick J. Wong
2020-01-16 16:47   ` Christoph Hellwig
2020-01-16 23:16     ` Darrick J. Wong
2020-01-19 20:49   ` Dave Chinner
2020-02-03 20:14     ` Darrick J. Wong
2020-05-07 10:32       ` Brian Foster
2020-05-14 16:33         ` Darrick J. Wong [this message]
2020-05-14 17:44           ` Brian Foster
2020-05-17  7:48             ` Christoph Hellwig
2020-05-19  0:40               ` Darrick J. Wong
2020-05-20  1:03             ` Dave Chinner
2020-01-16  6:15 ` [PATCH 2/2] xfs: relax unwritten writeback overhead under some circumstances Darrick J. Wong
2020-01-16 16:49   ` Christoph Hellwig
2020-01-16 23:15     ` Darrick J. Wong
2020-01-16 16:49 ` [PATCH 0/2] xfs: fix stale disk exposure after crash Christoph Hellwig
2020-01-16 23:00   ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200514163317.GA6714@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).