All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] xfs: fix stale disk exposure after crash
@ 2020-01-16  6:15 Darrick J. Wong
  2020-01-16  6:15 ` [PATCH 1/2] xfs: force writes to delalloc regions to unwritten Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Darrick J. Wong @ 2020-01-16  6:15 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, hch

Hi all,

These two patches try to shrink the window during which a crash during
writeback can expose stale disk contents.  The first patch causes
delalloc reservations to be converted to unwritten extents for any
writeback that's going on within EOF.

The second patch selectively relaxes the unwritten writeout requirement
when the entire file is being flushed (ala fsync) and ensures that
writeback of a range after the ondisk EOF is expanded downwards to the
old EOF to ensure that increasing a file's size doesn't leave us
vulnerable to exposure of stale disk contents from a previous
speculative allocation.

This solves the regressions in generic/536 and generic/042.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This has been lightly tested with fstests.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=stale-exposure

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=stale-exposure

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] xfs: force writes to delalloc regions to unwritten
  2020-01-16  6:15 [PATCH 0/2] xfs: fix stale disk exposure after crash Darrick J. Wong
@ 2020-01-16  6:15 ` Darrick J. Wong
  2020-01-16 16:47   ` Christoph Hellwig
  2020-01-19 20:49   ` Dave Chinner
  2020-01-16  6:15 ` [PATCH 2/2] xfs: relax unwritten writeback overhead under some circumstances Darrick J. Wong
  2020-01-16 16:49 ` [PATCH 0/2] xfs: fix stale disk exposure after crash Christoph Hellwig
  2 siblings, 2 replies; 17+ messages in thread
From: Darrick J. Wong @ 2020-01-16  6:15 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, hch

From: Darrick J. Wong <darrick.wong@oracle.com>

When writing to a delalloc region in the data fork, commit the new
allocations (of the da reservation) as unwritten so that the mappings
are only marked written once writeback completes successfully.  This
fixes the problem of stale data exposure if the system goes down during
targeted writeback of a specific region of a file, as tested by
generic/042.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 4544732d09a5..220ea1dc67ab 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4190,17 +4190,7 @@ xfs_bmapi_allocate(
 	bma->got.br_blockcount = bma->length;
 	bma->got.br_state = XFS_EXT_NORM;
 
-	/*
-	 * In the data fork, a wasdelay extent has been initialized, so
-	 * shouldn't be flagged as unwritten.
-	 *
-	 * For the cow fork, however, we convert delalloc reservations
-	 * (extents allocated for speculative preallocation) to
-	 * allocated unwritten extents, and only convert the unwritten
-	 * extents to real extents when we're about to write the data.
-	 */
-	if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) &&
-	    (bma->flags & XFS_BMAPI_PREALLOC))
+	if (bma->flags & XFS_BMAPI_PREALLOC)
 		bma->got.br_state = XFS_EXT_UNWRITTEN;
 
 	if (bma->wasdel)
@@ -4608,8 +4598,24 @@ xfs_bmapi_convert_delalloc(
 	bma.offset = bma.got.br_startoff;
 	bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN);
 	bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork);
+
+	/*
+	 * When we're converting the delalloc reservations backing dirty pages
+	 * in the page cache, we must be careful about how we create the new
+	 * extents:
+	 *
+	 * New CoW fork extents are created unwritten, turned into real extents
+	 * when we're about to write the data to disk, and mapped into the data
+	 * fork after the write finishes.  End of story.
+	 *
+	 * New data fork extents must be mapped in as unwritten and converted
+	 * to real extents after the write succeeds to avoid exposing stale
+	 * disk contents if we crash.
+	 */
 	if (whichfork == XFS_COW_FORK)
 		bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
+	else
+		bma.flags = XFS_BMAPI_PREALLOC;
 
 	if (!xfs_iext_peek_prev_extent(ifp, &bma.icur, &bma.prev))
 		bma.prev.br_startoff = NULLFILEOFF;


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] xfs: relax unwritten writeback overhead under some circumstances
  2020-01-16  6:15 [PATCH 0/2] xfs: fix stale disk exposure after crash Darrick J. Wong
  2020-01-16  6:15 ` [PATCH 1/2] xfs: force writes to delalloc regions to unwritten Darrick J. Wong
@ 2020-01-16  6:15 ` Darrick J. Wong
  2020-01-16 16:49   ` Christoph Hellwig
  2020-01-16 16:49 ` [PATCH 0/2] xfs: fix stale disk exposure after crash Christoph Hellwig
  2 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2020-01-16  6:15 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, hch

From: Darrick J. Wong <darrick.wong@oracle.com>

In the previous patch, we solved a stale disk contents exposure problem
by forcing the delalloc write path to create unwritten extents, write
the data, and convert the extents to written after writeback completes.

This is a pretty huge hammer to use, so we'll relax the delalloc write
strategy to go straight to written extents (as we once did) if someone
tells us to write the entire file to disk.  This reopens the exposure
window slightly, but we'll only be affected if writeback completes out
of order and the system crashes during writeback.

Because once again we can map written extents past EOF, we also
enlarge the writepages window downward if the window is beyond the
on-disk size and there are written extents after the EOF block.  This
ensures that speculative post-EOF preallocations are not left uncovered.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |    8 ++++---
 fs/xfs/libxfs/xfs_bmap.h |    3 ++-
 fs/xfs/xfs_aops.c        |   52 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 58 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 220ea1dc67ab..65b2bd12720e 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4545,7 +4545,8 @@ xfs_bmapi_convert_delalloc(
 	int			whichfork,
 	xfs_off_t		offset,
 	struct iomap		*iomap,
-	unsigned int		*seq)
+	unsigned int		*seq,
+	bool			full_writeback)
 {
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -4610,11 +4611,12 @@ xfs_bmapi_convert_delalloc(
 	 *
 	 * New data fork extents must be mapped in as unwritten and converted
 	 * to real extents after the write succeeds to avoid exposing stale
-	 * disk contents if we crash.
+	 * disk contents if we crash.  We relax this requirement if we've been
+	 * told to flush all data to disk.
 	 */
 	if (whichfork == XFS_COW_FORK)
 		bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
-	else
+	else if (!full_writeback)
 		bma.flags = XFS_BMAPI_PREALLOC;
 
 	if (!xfs_iext_peek_prev_extent(ifp, &bma.icur, &bma.prev))
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 14d25e0b7d9c..9d0b0ed83c9f 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -228,7 +228,8 @@ int	xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork,
 		struct xfs_bmbt_irec *got, struct xfs_iext_cursor *cur,
 		int eof);
 int	xfs_bmapi_convert_delalloc(struct xfs_inode *ip, int whichfork,
-		xfs_off_t offset, struct iomap *iomap, unsigned int *seq);
+		xfs_off_t offset, struct iomap *iomap, unsigned int *seq,
+		bool full_writeback);
 int	xfs_bmap_add_extent_unwritten_real(struct xfs_trans *tp,
 		struct xfs_inode *ip, int whichfork,
 		struct xfs_iext_cursor *icur, struct xfs_btree_cur **curp,
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a688eb5c5ae..45174dfa0b7d 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -18,10 +18,13 @@
 #include "xfs_bmap_util.h"
 #include "xfs_reflink.h"
 
+#define XFS_WRITEPAGE_FULL_RANGE	(1 << 0)
+
 struct xfs_writepage_ctx {
 	struct iomap_writepage_ctx ctx;
 	unsigned int		data_seq;
 	unsigned int		cow_seq;
+	unsigned int		flags;
 };
 
 static inline struct xfs_writepage_ctx *
@@ -327,7 +330,8 @@ xfs_convert_blocks(
 	 */
 	do {
 		error = xfs_bmapi_convert_delalloc(ip, whichfork, offset,
-				&wpc->iomap, seq);
+				&wpc->iomap, seq,
+				XFS_WPC(wpc)->flags & XFS_WRITEPAGE_FULL_RANGE);
 		if (error)
 			return error;
 	} while (wpc->iomap.offset + wpc->iomap.length <= offset);
@@ -567,6 +571,48 @@ xfs_vm_writepage(
 	return iomap_writepage(page, wbc, &wpc.ctx, &xfs_writeback_ops);
 }
 
+/*
+ * If we've been told to write a range of the file that is beyond the on-disk
+ * file size and there's a written extent beyond the EOF block, we conclude
+ * that we previously wrote a speculative post-EOF preallocation to disk (as
+ * written extents) and later extended the incore file size.
+ *
+ * To prevent exposure of the contents of those speculative preallocations
+ * after a crash, extend the writeback range all the way down to the old file
+ * size to make sure that those pages get flushed.
+ */
+static void
+xfs_vm_adjust_posteof_writepages(
+	struct xfs_inode		*ip,
+	struct writeback_control	*wbc)
+{
+	struct xfs_iext_cursor		icur;
+	struct xfs_bmbt_irec		irec;
+
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+	if (ip->i_d.di_size >= wbc->range_start)
+		goto out;
+
+	/* We're done if we can't find a real extent past EOF. */
+	if (!xfs_iext_lookup_extent(ip, XFS_IFORK_PTR(ip, XFS_DATA_FORK),
+			XFS_B_TO_FSB(ip->i_mount, ip->i_d.di_size), &icur,
+			&irec))
+		goto out;
+	if (irec.br_startblock == HOLESTARTBLOCK)
+		goto out;
+
+	wbc->range_start = ip->i_d.di_size;
+
+	/* Adjust the number of pages to write, if needed. */
+	if (wbc->nr_to_write == LONG_MAX)
+		goto out;
+
+	wbc->nr_to_write += (wbc->range_start >> PAGE_SHIFT) -
+			    (ip->i_d.di_size >> PAGE_SHIFT);
+out:
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+}
+
 STATIC int
 xfs_vm_writepages(
 	struct address_space	*mapping,
@@ -574,6 +620,10 @@ xfs_vm_writepages(
 {
 	struct xfs_writepage_ctx wpc = { };
 
+	xfs_vm_adjust_posteof_writepages(XFS_I(mapping->host), wbc);
+	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
+		wpc.flags |= XFS_WRITEPAGE_FULL_RANGE;
+
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
 	return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
 }


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] xfs: force writes to delalloc regions to unwritten
  2020-01-16  6:15 ` [PATCH 1/2] xfs: force writes to delalloc regions to unwritten Darrick J. Wong
@ 2020-01-16 16:47   ` Christoph Hellwig
  2020-01-16 23:16     ` Darrick J. Wong
  2020-01-19 20:49   ` Dave Chinner
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2020-01-16 16:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Jan 15, 2020 at 10:15:50PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When writing to a delalloc region in the data fork, commit the new
> allocations (of the da reservation) as unwritten so that the mappings
> are only marked written once writeback completes successfully.  This
> fixes the problem of stale data exposure if the system goes down during
> targeted writeback of a specific region of a file, as tested by
> generic/042.

I think this is the only safe way to deal with buffered I/O into
holes, so:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] xfs: relax unwritten writeback overhead under some circumstances
  2020-01-16  6:15 ` [PATCH 2/2] xfs: relax unwritten writeback overhead under some circumstances Darrick J. Wong
@ 2020-01-16 16:49   ` Christoph Hellwig
  2020-01-16 23:15     ` Darrick J. Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2020-01-16 16:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Jan 15, 2020 at 10:15:58PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> In the previous patch, we solved a stale disk contents exposure problem
> by forcing the delalloc write path to create unwritten extents, write
> the data, and convert the extents to written after writeback completes.
> 
> This is a pretty huge hammer to use, so we'll relax the delalloc write
> strategy to go straight to written extents (as we once did) if someone
> tells us to write the entire file to disk.  This reopens the exposure
> window slightly, but we'll only be affected if writeback completes out
> of order and the system crashes during writeback.
> 
> Because once again we can map written extents past EOF, we also
> enlarge the writepages window downward if the window is beyond the
> on-disk size and there are written extents after the EOF block.  This
> ensures that speculative post-EOF preallocations are not left uncovered.

This does sound really sketchy.  Do you have any performance numbers
justifying something this nasty?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] xfs: fix stale disk exposure after crash
  2020-01-16  6:15 [PATCH 0/2] xfs: fix stale disk exposure after crash Darrick J. Wong
  2020-01-16  6:15 ` [PATCH 1/2] xfs: force writes to delalloc regions to unwritten Darrick J. Wong
  2020-01-16  6:15 ` [PATCH 2/2] xfs: relax unwritten writeback overhead under some circumstances Darrick J. Wong
@ 2020-01-16 16:49 ` Christoph Hellwig
  2020-01-16 23:00   ` Darrick J. Wong
  2 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2020-01-16 16:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

Btw, what happened to:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=c931d4b2a6634b94cc11958706592944f55870d4

?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] xfs: fix stale disk exposure after crash
  2020-01-16 16:49 ` [PATCH 0/2] xfs: fix stale disk exposure after crash Christoph Hellwig
@ 2020-01-16 23:00   ` Darrick J. Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2020-01-16 23:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Jan 16, 2020 at 08:49:55AM -0800, Christoph Hellwig wrote:
> Btw, what happened to:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=c931d4b2a6634b94cc11958706592944f55870d4

Originally I thought that it was enough to decrease *wbc->range_start to
the ondisk EOF since the _flush_unmap function triggers a writepages
call.  Maybe it's better to make that part explicit.

However, it's only necessary to extend writeback like that if you're
going to retain the delalloc -> written state change.  I don't consider
applying patch 2 to be a good idea though.

--D

> ?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] xfs: relax unwritten writeback overhead under some circumstances
  2020-01-16 16:49   ` Christoph Hellwig
@ 2020-01-16 23:15     ` Darrick J. Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2020-01-16 23:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Jan 16, 2020 at 08:49:00AM -0800, Christoph Hellwig wrote:
> On Wed, Jan 15, 2020 at 10:15:58PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > In the previous patch, we solved a stale disk contents exposure problem
> > by forcing the delalloc write path to create unwritten extents, write
> > the data, and convert the extents to written after writeback completes.
> > 
> > This is a pretty huge hammer to use, so we'll relax the delalloc write
> > strategy to go straight to written extents (as we once did) if someone
> > tells us to write the entire file to disk.  This reopens the exposure
> > window slightly, but we'll only be affected if writeback completes out
> > of order and the system crashes during writeback.
> > 
> > Because once again we can map written extents past EOF, we also
> > enlarge the writepages window downward if the window is beyond the
> > on-disk size and there are written extents after the EOF block.  This
> > ensures that speculative post-EOF preallocations are not left uncovered.
> 
> This does sound really sketchy.  Do you have any performance numbers
> justifying something this nasty?

Nope! :D

IIRC Dave also expressed interested in performance impacts the last time
I sent this series, albeit more from the perspective of quantifying how
much pain we'd incur from forcing all writes to perform an unwritten
extent conversion at the end.

FWIW after months of running this on my internal systems, I haven't been
able to quantify any significant difference before and after, even with
rmap enabled.  There's slightly more log traffic from the extra
bmbt/rmapbt/inode core updates, but even then the log is fairly good at
deduping repeated updates.  Both transactions usually commit before the
log checkpoints.

Frankly I wouldn't apply this patch (or 'xfs: extend the range of
flush_unmap ranges') on the grounds that re-opening potential disclosure
flaws is never worth the risk.  I'm also pretty sure that being careful
to convert delalloc data fork extents to unwritten extents fixes the
stale disclosure flaw that Ritesh wrote about in ('iomap: direct-io:
Move inode_dio_begin before filemap_write_and_wait_range').

(As far as ext4 goes, I talked to Jan and Ted this morning and they
seemed to think that they could solve the race on their end by retaining
the unwritten state in the incore extent cache because ext4 apparently
doesn't commit the extent map update transaction until after writeback
completes.)

--D

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] xfs: force writes to delalloc regions to unwritten
  2020-01-16 16:47   ` Christoph Hellwig
@ 2020-01-16 23:16     ` Darrick J. Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2020-01-16 23:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Jan 16, 2020 at 08:47:41AM -0800, Christoph Hellwig wrote:
> On Wed, Jan 15, 2020 at 10:15:50PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When writing to a delalloc region in the data fork, commit the new
> > allocations (of the da reservation) as unwritten so that the mappings
> > are only marked written once writeback completes successfully.  This
> > fixes the problem of stale data exposure if the system goes down during
> > targeted writeback of a specific region of a file, as tested by
> > generic/042.
> 
> I think this is the only safe way to deal with buffered I/O into
> holes, so:

Ditto.  Thanks for reviewing things!

--D

> Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] xfs: force writes to delalloc regions to unwritten
  2020-01-16  6:15 ` [PATCH 1/2] xfs: force writes to delalloc regions to unwritten Darrick J. Wong
  2020-01-16 16:47   ` Christoph Hellwig
@ 2020-01-19 20:49   ` Dave Chinner
  2020-02-03 20:14     ` Darrick J. Wong
  1 sibling, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2020-01-19 20:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Wed, Jan 15, 2020 at 10:15:50PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When writing to a delalloc region in the data fork, commit the new
> allocations (of the da reservation) as unwritten so that the mappings
> are only marked written once writeback completes successfully.  This
> fixes the problem of stale data exposure if the system goes down during
> targeted writeback of a specific region of a file, as tested by
> generic/042.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c |   28 +++++++++++++++++-----------
>  1 file changed, 17 insertions(+), 11 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 4544732d09a5..220ea1dc67ab 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4190,17 +4190,7 @@ xfs_bmapi_allocate(
>  	bma->got.br_blockcount = bma->length;
>  	bma->got.br_state = XFS_EXT_NORM;
>  
> -	/*
> -	 * In the data fork, a wasdelay extent has been initialized, so
> -	 * shouldn't be flagged as unwritten.
> -	 *
> -	 * For the cow fork, however, we convert delalloc reservations
> -	 * (extents allocated for speculative preallocation) to
> -	 * allocated unwritten extents, and only convert the unwritten
> -	 * extents to real extents when we're about to write the data.
> -	 */
> -	if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) &&
> -	    (bma->flags & XFS_BMAPI_PREALLOC))
> +	if (bma->flags & XFS_BMAPI_PREALLOC)
>  		bma->got.br_state = XFS_EXT_UNWRITTEN;
>  
>  	if (bma->wasdel)
> @@ -4608,8 +4598,24 @@ xfs_bmapi_convert_delalloc(
>  	bma.offset = bma.got.br_startoff;
>  	bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN);
>  	bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork);
> +
> +	/*
> +	 * When we're converting the delalloc reservations backing dirty pages
> +	 * in the page cache, we must be careful about how we create the new
> +	 * extents:
> +	 *
> +	 * New CoW fork extents are created unwritten, turned into real extents
> +	 * when we're about to write the data to disk, and mapped into the data
> +	 * fork after the write finishes.  End of story.
> +	 *
> +	 * New data fork extents must be mapped in as unwritten and converted
> +	 * to real extents after the write succeeds to avoid exposing stale
> +	 * disk contents if we crash.
> +	 */
>  	if (whichfork == XFS_COW_FORK)
>  		bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
> +	else
> +		bma.flags = XFS_BMAPI_PREALLOC;

	bma.flags = XFS_BMAPI_PREALLOC;
	if (whichfork == XFS_COW_FORK)
		bma.flags |= XFS_BMAPI_COWFORK;

However, I'm still not convinced that this is the right/best
solution to the problem. It is the easiest, yes, but the down side
on fast/high iops storage and/or under low memory conditions has
potential to be extremely significant.

I suspect that heavy users of buffered O_DSYNC writes into sparse
files are going to notice this the most - there are databases out
there that work this way. And I suspect that most of the workloads
that use buffered O_DSYNC IO heavily won't see this change for years
as enterprise upgrade cycles are notoriously slow.

IOWs, all I see this change doing is kicking the can down the road
and guaranteeing that we'll still have to solve this stale data
exposure problem more efficiently in the future. And instead of
doing it now when we have the time and freedom to do the work, it
will have to be done urgently under high priority escalation
pressures...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] xfs: force writes to delalloc regions to unwritten
  2020-01-19 20:49   ` Dave Chinner
@ 2020-02-03 20:14     ` Darrick J. Wong
  2020-05-07 10:32       ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2020-02-03 20:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, hch

On Mon, Jan 20, 2020 at 07:49:25AM +1100, Dave Chinner wrote:
> On Wed, Jan 15, 2020 at 10:15:50PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When writing to a delalloc region in the data fork, commit the new
> > allocations (of the da reservation) as unwritten so that the mappings
> > are only marked written once writeback completes successfully.  This
> > fixes the problem of stale data exposure if the system goes down during
> > targeted writeback of a specific region of a file, as tested by
> > generic/042.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c |   28 +++++++++++++++++-----------
> >  1 file changed, 17 insertions(+), 11 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 4544732d09a5..220ea1dc67ab 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -4190,17 +4190,7 @@ xfs_bmapi_allocate(
> >  	bma->got.br_blockcount = bma->length;
> >  	bma->got.br_state = XFS_EXT_NORM;
> >  
> > -	/*
> > -	 * In the data fork, a wasdelay extent has been initialized, so
> > -	 * shouldn't be flagged as unwritten.
> > -	 *
> > -	 * For the cow fork, however, we convert delalloc reservations
> > -	 * (extents allocated for speculative preallocation) to
> > -	 * allocated unwritten extents, and only convert the unwritten
> > -	 * extents to real extents when we're about to write the data.
> > -	 */
> > -	if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) &&
> > -	    (bma->flags & XFS_BMAPI_PREALLOC))
> > +	if (bma->flags & XFS_BMAPI_PREALLOC)
> >  		bma->got.br_state = XFS_EXT_UNWRITTEN;
> >  
> >  	if (bma->wasdel)
> > @@ -4608,8 +4598,24 @@ xfs_bmapi_convert_delalloc(
> >  	bma.offset = bma.got.br_startoff;
> >  	bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN);
> >  	bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork);
> > +
> > +	/*
> > +	 * When we're converting the delalloc reservations backing dirty pages
> > +	 * in the page cache, we must be careful about how we create the new
> > +	 * extents:
> > +	 *
> > +	 * New CoW fork extents are created unwritten, turned into real extents
> > +	 * when we're about to write the data to disk, and mapped into the data
> > +	 * fork after the write finishes.  End of story.
> > +	 *
> > +	 * New data fork extents must be mapped in as unwritten and converted
> > +	 * to real extents after the write succeeds to avoid exposing stale
> > +	 * disk contents if we crash.
> > +	 */
> >  	if (whichfork == XFS_COW_FORK)
> >  		bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
> > +	else
> > +		bma.flags = XFS_BMAPI_PREALLOC;
> 
> 	bma.flags = XFS_BMAPI_PREALLOC;
> 	if (whichfork == XFS_COW_FORK)
> 		bma.flags |= XFS_BMAPI_COWFORK;
> 
> However, I'm still not convinced that this is the right/best
> solution to the problem. It is the easiest, yes, but the down side
> on fast/high iops storage and/or under low memory conditions has
> potential to be extremely significant.
> 
> I suspect that heavy users of buffered O_DSYNC writes into sparse
> files are going to notice this the most - there are databases out
> there that work this way. And I suspect that most of the workloads
> that use buffered O_DSYNC IO heavily won't see this change for years
> as enterprise upgrade cycles are notoriously slow.
> 
> IOWs, all I see this change doing is kicking the can down the road
> and guaranteeing that we'll still have to solve this stale data
> exposure problem more efficiently in the future. And instead of
> doing it now when we have the time and freedom to do the work, it
> will have to be done urgently under high priority escalation
> pressures...

FWIW I'm *already* under urgent high priority GA blocker escalation
pressure, which is why this came up again.

Granted it did take 12 days of losing the battle with the distro folks
that this really isn't a release blocker (but teh sekuritehs!!) but...oh
right, I forgot that xfs actually /does/ crash more than once per day in
our environment.

I guess *we* will find out how much performance really disappears if you
do it this way. :P

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] xfs: force writes to delalloc regions to unwritten
  2020-02-03 20:14     ` Darrick J. Wong
@ 2020-05-07 10:32       ` Brian Foster
  2020-05-14 16:33         ` Darrick J. Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2020-05-07 10:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, hch

On Mon, Feb 03, 2020 at 12:14:45PM -0800, Darrick J. Wong wrote:
> On Mon, Jan 20, 2020 at 07:49:25AM +1100, Dave Chinner wrote:
> > On Wed, Jan 15, 2020 at 10:15:50PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > When writing to a delalloc region in the data fork, commit the new
> > > allocations (of the da reservation) as unwritten so that the mappings
> > > are only marked written once writeback completes successfully.  This
> > > fixes the problem of stale data exposure if the system goes down during
> > > targeted writeback of a specific region of a file, as tested by
> > > generic/042.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c |   28 +++++++++++++++++-----------
> > >  1 file changed, 17 insertions(+), 11 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index 4544732d09a5..220ea1dc67ab 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -4190,17 +4190,7 @@ xfs_bmapi_allocate(
> > >  	bma->got.br_blockcount = bma->length;
> > >  	bma->got.br_state = XFS_EXT_NORM;
> > >  
> > > -	/*
> > > -	 * In the data fork, a wasdelay extent has been initialized, so
> > > -	 * shouldn't be flagged as unwritten.
> > > -	 *
> > > -	 * For the cow fork, however, we convert delalloc reservations
> > > -	 * (extents allocated for speculative preallocation) to
> > > -	 * allocated unwritten extents, and only convert the unwritten
> > > -	 * extents to real extents when we're about to write the data.
> > > -	 */
> > > -	if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) &&
> > > -	    (bma->flags & XFS_BMAPI_PREALLOC))
> > > +	if (bma->flags & XFS_BMAPI_PREALLOC)
> > >  		bma->got.br_state = XFS_EXT_UNWRITTEN;
> > >  
> > >  	if (bma->wasdel)
> > > @@ -4608,8 +4598,24 @@ xfs_bmapi_convert_delalloc(
> > >  	bma.offset = bma.got.br_startoff;
> > >  	bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN);
> > >  	bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork);
> > > +
> > > +	/*
> > > +	 * When we're converting the delalloc reservations backing dirty pages
> > > +	 * in the page cache, we must be careful about how we create the new
> > > +	 * extents:
> > > +	 *
> > > +	 * New CoW fork extents are created unwritten, turned into real extents
> > > +	 * when we're about to write the data to disk, and mapped into the data
> > > +	 * fork after the write finishes.  End of story.
> > > +	 *
> > > +	 * New data fork extents must be mapped in as unwritten and converted
> > > +	 * to real extents after the write succeeds to avoid exposing stale
> > > +	 * disk contents if we crash.
> > > +	 */
> > >  	if (whichfork == XFS_COW_FORK)
> > >  		bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
> > > +	else
> > > +		bma.flags = XFS_BMAPI_PREALLOC;
> > 
> > 	bma.flags = XFS_BMAPI_PREALLOC;
> > 	if (whichfork == XFS_COW_FORK)
> > 		bma.flags |= XFS_BMAPI_COWFORK;
> > 
> > However, I'm still not convinced that this is the right/best
> > solution to the problem. It is the easiest, yes, but the down side
> > on fast/high iops storage and/or under low memory conditions has
> > potential to be extremely significant.
> > 
> > I suspect that heavy users of buffered O_DSYNC writes into sparse
> > files are going to notice this the most - there are databases out
> > there that work this way. And I suspect that most of the workloads
> > that use buffered O_DSYNC IO heavily won't see this change for years
> > as enterprise upgrade cycles are notoriously slow.
> > 
> > IOWs, all I see this change doing is kicking the can down the road
> > and guaranteeing that we'll still have to solve this stale data
> > exposure problem more efficiently in the future. And instead of
> > doing it now when we have the time and freedom to do the work, it
> > will have to be done urgently under high priority escalation
> > pressures...
> 
> FWIW I'm *already* under urgent high priority GA blocker escalation
> pressure, which is why this came up again.
> 
> Granted it did take 12 days of losing the battle with the distro folks
> that this really isn't a release blocker (but teh sekuritehs!!) but...oh
> right, I forgot that xfs actually /does/ crash more than once per day in
> our environment.
> 
> I guess *we* will find out how much performance really disappears if you
> do it this way. :P
> 

Sorry for resurrecting an old thread here, but I was thinking about this
problem a bit and realized I didn't have a great handle on the concerns
with using unwritten extents for delalloc writeback. Dave calls out the
O_DSYNC buffered writes into sparse files case above. I don't see any
numbers posted here so I ran some quick tests using a large ramdisk to
get low latency I/O.

I only seem to require a couple threads to max out single file, random
4k dsync buffered write iops in this particular setup. I see ~30.6k iops
from a baseline 5.7.0-rc1 kernel and that drops to ~25.7k iops when
using unwritten extents for delalloc conversion. However, note that the
same workload through single threaded aio+dio (qd 32) runs at ~63.7k
iops. That's already using unwritten extents for dio so it's unaffected
by this patch. Also note that using a 10MB extent size hint puts the
dsync buffered write case at ~27k iops (again for both kernels because
we're already using unwritten extents in that case as well).

For reference, full file preallocation (i.e. no allocs, unwritten
extents) runs at ~27k iops for the buffered write case and ~87k iops for
aio+dio. The overwrite (no unwritten, no alloc) case gets to ~250k iops
with the same couple dsync buffered write threads and close to 300k iops
with single threaded aio+dio (which I think is maxing out my memory
bandwidth).

Altogether, this has me wondering whether it's really worth the
complexity of trying to avoid the overhead of unwritten extents for
delalloc conversion. There is a noticeable hit, but it's an already slow
path compared to async I/O mechanisms. Further, it's a workload that
typically comes with a recommendation to use extent size hints to avoid
fragmentation issues and minimize allocation overhead, and that feature
already bypasses delalloc extents in favor of unwritten extents.
Thoughts? Suggestions for other tests?

Brian

> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] xfs: force writes to delalloc regions to unwritten
  2020-05-07 10:32       ` Brian Foster
@ 2020-05-14 16:33         ` Darrick J. Wong
  2020-05-14 17:44           ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2020-05-14 16:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs, hch

On Thu, May 07, 2020 at 06:32:32AM -0400, Brian Foster wrote:
> On Mon, Feb 03, 2020 at 12:14:45PM -0800, Darrick J. Wong wrote:
> > On Mon, Jan 20, 2020 at 07:49:25AM +1100, Dave Chinner wrote:
> > > On Wed, Jan 15, 2020 at 10:15:50PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > When writing to a delalloc region in the data fork, commit the new
> > > > allocations (of the da reservation) as unwritten so that the mappings
> > > > are only marked written once writeback completes successfully.  This
> > > > fixes the problem of stale data exposure if the system goes down during
> > > > targeted writeback of a specific region of a file, as tested by
> > > > generic/042.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_bmap.c |   28 +++++++++++++++++-----------
> > > >  1 file changed, 17 insertions(+), 11 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > > index 4544732d09a5..220ea1dc67ab 100644
> > > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > > @@ -4190,17 +4190,7 @@ xfs_bmapi_allocate(
> > > >  	bma->got.br_blockcount = bma->length;
> > > >  	bma->got.br_state = XFS_EXT_NORM;
> > > >  
> > > > -	/*
> > > > -	 * In the data fork, a wasdelay extent has been initialized, so
> > > > -	 * shouldn't be flagged as unwritten.
> > > > -	 *
> > > > -	 * For the cow fork, however, we convert delalloc reservations
> > > > -	 * (extents allocated for speculative preallocation) to
> > > > -	 * allocated unwritten extents, and only convert the unwritten
> > > > -	 * extents to real extents when we're about to write the data.
> > > > -	 */
> > > > -	if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) &&
> > > > -	    (bma->flags & XFS_BMAPI_PREALLOC))
> > > > +	if (bma->flags & XFS_BMAPI_PREALLOC)
> > > >  		bma->got.br_state = XFS_EXT_UNWRITTEN;
> > > >  
> > > >  	if (bma->wasdel)
> > > > @@ -4608,8 +4598,24 @@ xfs_bmapi_convert_delalloc(
> > > >  	bma.offset = bma.got.br_startoff;
> > > >  	bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN);
> > > >  	bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork);
> > > > +
> > > > +	/*
> > > > +	 * When we're converting the delalloc reservations backing dirty pages
> > > > +	 * in the page cache, we must be careful about how we create the new
> > > > +	 * extents:
> > > > +	 *
> > > > +	 * New CoW fork extents are created unwritten, turned into real extents
> > > > +	 * when we're about to write the data to disk, and mapped into the data
> > > > +	 * fork after the write finishes.  End of story.
> > > > +	 *
> > > > +	 * New data fork extents must be mapped in as unwritten and converted
> > > > +	 * to real extents after the write succeeds to avoid exposing stale
> > > > +	 * disk contents if we crash.
> > > > +	 */
> > > >  	if (whichfork == XFS_COW_FORK)
> > > >  		bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
> > > > +	else
> > > > +		bma.flags = XFS_BMAPI_PREALLOC;
> > > 
> > > 	bma.flags = XFS_BMAPI_PREALLOC;
> > > 	if (whichfork == XFS_COW_FORK)
> > > 		bma.flags |= XFS_BMAPI_COWFORK;
> > > 
> > > However, I'm still not convinced that this is the right/best
> > > solution to the problem. It is the easiest, yes, but the down side
> > > on fast/high iops storage and/or under low memory conditions has
> > > potential to be extremely significant.
> > > 
> > > I suspect that heavy users of buffered O_DSYNC writes into sparse
> > > files are going to notice this the most - there are databases out
> > > there that work this way. And I suspect that most of the workloads
> > > that use buffered O_DSYNC IO heavily won't see this change for years
> > > as enterprise upgrade cycles are notoriously slow.
> > > 
> > > IOWs, all I see this change doing is kicking the can down the road
> > > and guaranteeing that we'll still have to solve this stale data
> > > exposure problem more efficiently in the future. And instead of
> > > doing it now when we have the time and freedom to do the work, it
> > > will have to be done urgently under high priority escalation
> > > pressures...
> > 
> > FWIW I'm *already* under urgent high priority GA blocker escalation
> > pressure, which is why this came up again.
> > 
> > Granted it did take 12 days of losing the battle with the distro folks
> > that this really isn't a release blocker (but teh sekuritehs!!) but...oh
> > right, I forgot that xfs actually /does/ crash more than once per day in
> > our environment.
> > 
> > I guess *we* will find out how much performance really disappears if you
> > do it this way. :P
> > 
> 
> Sorry for resurrecting an old thread here, but I was thinking about this
> problem a bit and realized I didn't have a great handle on the concerns
> with using unwritten extents for delalloc writeback. Dave calls out the
> O_DSYNC buffered writes into sparse files case above. I don't see any
> numbers posted here so I ran some quick tests using a large ramdisk to
> get low latency I/O.
> 
> I only seem to require a couple threads to max out single file, random
> 4k dsync buffered write iops in this particular setup. I see ~30.6k iops
> from a baseline 5.7.0-rc1 kernel and that drops to ~25.7k iops when
> using unwritten extents for delalloc conversion. However, note that the
> same workload through single threaded aio+dio (qd 32) runs at ~63.7k
> iops. That's already using unwritten extents for dio so it's unaffected
> by this patch. Also note that using a 10MB extent size hint puts the
> dsync buffered write case at ~27k iops (again for both kernels because
> we're already using unwritten extents in that case as well).
> 
> For reference, full file preallocation (i.e. no allocs, unwritten
> extents) runs at ~27k iops for the buffered write case and ~87k iops for
> aio+dio. The overwrite (no unwritten, no alloc) case gets to ~250k iops
> with the same couple dsync buffered write threads and close to 300k iops
> with single threaded aio+dio (which I think is maxing out my memory
> bandwidth).
> 
> Altogether, this has me wondering whether it's really worth the
> complexity of trying to avoid the overhead of unwritten extents for
> delalloc conversion. There is a noticeable hit, but it's an already slow
> path compared to async I/O mechanisms. Further, it's a workload that
> typically comes with a recommendation to use extent size hints to avoid
> fragmentation issues and minimize allocation overhead, and that feature
> already bypasses delalloc extents in favor of unwritten extents.
> Thoughts? Suggestions for other tests?

4-5 months ago I ran more or less the same benchmark (albeit with
$someproduct) and came to the same conclusion -- if you're really doing
scattershot buffered O_DSYNC writes to a file, you'll lose about 15-20%
with this patch added.  Then apparently I ... got buried in xmas and
other bugs and forgot to send the results. :/

Granted, you had to /force/ $someproduct to do this because it would
typically do either synchronous aio+dio, or it could do async writes
with an fsync at the important parts, or it could set an extent hint,
or (the default) it writes zeroes ahead of time so that XFS will stay
out of the way when checkpoints need to get done asap.

I could say (glibly) that I'm so buried in bug triage that what's a few
more? but maybe the rest of you have other opinions? :)

--D

> 
> Brian
> 
> > --D
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] xfs: force writes to delalloc regions to unwritten
  2020-05-14 16:33         ` Darrick J. Wong
@ 2020-05-14 17:44           ` Brian Foster
  2020-05-17  7:48             ` Christoph Hellwig
  2020-05-20  1:03             ` Dave Chinner
  0 siblings, 2 replies; 17+ messages in thread
From: Brian Foster @ 2020-05-14 17:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, hch

On Thu, May 14, 2020 at 09:33:17AM -0700, Darrick J. Wong wrote:
> On Thu, May 07, 2020 at 06:32:32AM -0400, Brian Foster wrote:
> > On Mon, Feb 03, 2020 at 12:14:45PM -0800, Darrick J. Wong wrote:
> > > On Mon, Jan 20, 2020 at 07:49:25AM +1100, Dave Chinner wrote:
> > > > On Wed, Jan 15, 2020 at 10:15:50PM -0800, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > When writing to a delalloc region in the data fork, commit the new
> > > > > allocations (of the da reservation) as unwritten so that the mappings
> > > > > are only marked written once writeback completes successfully.  This
> > > > > fixes the problem of stale data exposure if the system goes down during
> > > > > targeted writeback of a specific region of a file, as tested by
> > > > > generic/042.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_bmap.c |   28 +++++++++++++++++-----------
> > > > >  1 file changed, 17 insertions(+), 11 deletions(-)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > > > index 4544732d09a5..220ea1dc67ab 100644
> > > > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > > > @@ -4190,17 +4190,7 @@ xfs_bmapi_allocate(
> > > > >  	bma->got.br_blockcount = bma->length;
> > > > >  	bma->got.br_state = XFS_EXT_NORM;
> > > > >  
> > > > > -	/*
> > > > > -	 * In the data fork, a wasdelay extent has been initialized, so
> > > > > -	 * shouldn't be flagged as unwritten.
> > > > > -	 *
> > > > > -	 * For the cow fork, however, we convert delalloc reservations
> > > > > -	 * (extents allocated for speculative preallocation) to
> > > > > -	 * allocated unwritten extents, and only convert the unwritten
> > > > > -	 * extents to real extents when we're about to write the data.
> > > > > -	 */
> > > > > -	if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) &&
> > > > > -	    (bma->flags & XFS_BMAPI_PREALLOC))
> > > > > +	if (bma->flags & XFS_BMAPI_PREALLOC)
> > > > >  		bma->got.br_state = XFS_EXT_UNWRITTEN;
> > > > >  
> > > > >  	if (bma->wasdel)
> > > > > @@ -4608,8 +4598,24 @@ xfs_bmapi_convert_delalloc(
> > > > >  	bma.offset = bma.got.br_startoff;
> > > > >  	bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN);
> > > > >  	bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork);
> > > > > +
> > > > > +	/*
> > > > > +	 * When we're converting the delalloc reservations backing dirty pages
> > > > > +	 * in the page cache, we must be careful about how we create the new
> > > > > +	 * extents:
> > > > > +	 *
> > > > > +	 * New CoW fork extents are created unwritten, turned into real extents
> > > > > +	 * when we're about to write the data to disk, and mapped into the data
> > > > > +	 * fork after the write finishes.  End of story.
> > > > > +	 *
> > > > > +	 * New data fork extents must be mapped in as unwritten and converted
> > > > > +	 * to real extents after the write succeeds to avoid exposing stale
> > > > > +	 * disk contents if we crash.
> > > > > +	 */
> > > > >  	if (whichfork == XFS_COW_FORK)
> > > > >  		bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
> > > > > +	else
> > > > > +		bma.flags = XFS_BMAPI_PREALLOC;
> > > > 
> > > > 	bma.flags = XFS_BMAPI_PREALLOC;
> > > > 	if (whichfork == XFS_COW_FORK)
> > > > 		bma.flags |= XFS_BMAPI_COWFORK;
> > > > 
> > > > However, I'm still not convinced that this is the right/best
> > > > solution to the problem. It is the easiest, yes, but the down side
> > > > on fast/high iops storage and/or under low memory conditions has
> > > > potential to be extremely significant.
> > > > 
> > > > I suspect that heavy users of buffered O_DSYNC writes into sparse
> > > > files are going to notice this the most - there are databases out
> > > > there that work this way. And I suspect that most of the workloads
> > > > that use buffered O_DSYNC IO heavily won't see this change for years
> > > > as enterprise upgrade cycles are notoriously slow.
> > > > 
> > > > IOWs, all I see this change doing is kicking the can down the road
> > > > and guaranteeing that we'll still have to solve this stale data
> > > > exposure problem more efficiently in the future. And instead of
> > > > doing it now when we have the time and freedom to do the work, it
> > > > will have to be done urgently under high priority escalation
> > > > pressures...
> > > 
> > > FWIW I'm *already* under urgent high priority GA blocker escalation
> > > pressure, which is why this came up again.
> > > 
> > > Granted it did take 12 days of losing the battle with the distro folks
> > > that this really isn't a release blocker (but teh sekuritehs!!) but...oh
> > > right, I forgot that xfs actually /does/ crash more than once per day in
> > > our environment.
> > > 
> > > I guess *we* will find out how much performance really disappears if you
> > > do it this way. :P
> > > 
> > 
> > Sorry for resurrecting an old thread here, but I was thinking about this
> > problem a bit and realized I didn't have a great handle on the concerns
> > with using unwritten extents for delalloc writeback. Dave calls out the
> > O_DSYNC buffered writes into sparse files case above. I don't see any
> > numbers posted here so I ran some quick tests using a large ramdisk to
> > get low latency I/O.
> > 
> > I only seem to require a couple threads to max out single file, random
> > 4k dsync buffered write iops in this particular setup. I see ~30.6k iops
> > from a baseline 5.7.0-rc1 kernel and that drops to ~25.7k iops when
> > using unwritten extents for delalloc conversion. However, note that the
> > same workload through single threaded aio+dio (qd 32) runs at ~63.7k
> > iops. That's already using unwritten extents for dio so it's unaffected
> > by this patch. Also note that using a 10MB extent size hint puts the
> > dsync buffered write case at ~27k iops (again for both kernels because
> > we're already using unwritten extents in that case as well).
> > 
> > For reference, full file preallocation (i.e. no allocs, unwritten
> > extents) runs at ~27k iops for the buffered write case and ~87k iops for
> > aio+dio. The overwrite (no unwritten, no alloc) case gets to ~250k iops
> > with the same couple dsync buffered write threads and close to 300k iops
> > with single threaded aio+dio (which I think is maxing out my memory
> > bandwidth).
> > 
> > Altogether, this has me wondering whether it's really worth the
> > complexity of trying to avoid the overhead of unwritten extents for
> > delalloc conversion. There is a noticeable hit, but it's an already slow
> > path compared to async I/O mechanisms. Further, it's a workload that
> > typically comes with a recommendation to use extent size hints to avoid
> > fragmentation issues and minimize allocation overhead, and that feature
> > already bypasses delalloc extents in favor of unwritten extents.
> > Thoughts? Suggestions for other tests?
> 
> 4-5 months ago I ran more or less the same benchmark (albeit with
> $someproduct) and came to the same conclusion -- if you're really doing
> scattershot buffered O_DSYNC writes to a file, you'll lose about 15-20%
> with this patch added.  Then apparently I ... got buried in xmas and
> other bugs and forgot to send the results. :/
> 

Heh. :P Thanks for following up..

> Granted, you had to /force/ $someproduct to do this because it would
> typically do either synchronous aio+dio, or it could do async writes
> with an fsync at the important parts, or it could set an extent hint,
> or (the default) it writes zeroes ahead of time so that XFS will stay
> out of the way when checkpoints need to get done asap.
> 

Right, all of which already utilize unwritten extents except for the
explicit zeroing case.

> I could say (glibly) that I'm so buried in bug triage that what's a few
> more? but maybe the rest of you have other opinions? :)
> 

In dwelling on this a bit more since my previous reply, I also realized
that holding off this particular patch has kind of distorted the
problem. For example, I'd been trying to think of clever ways to prevent
stale data exposure on buffered writes, but that leads to ideas that
tend to be specific to delayed allocation and thus of limited benefit
for other write paths.

IOW, it's not really the delayed allocation case we should be so focused
on improving as much as the performance hit of unwritten extents in
general. We've already accepted the corresponding performance hit in
more common I/O paths in the name of correctness. The (preexisting)
impact of preallocated unwritten extents in more efficient write paths
vs. pure overwrites is far more prominent than the impact of unwritten
extents on buffered writes.

ISTM that the right thing to do here is merge this patch, finally fix
the last known stale data exposure vector, and then perhaps step back
and think about how we might improve performance of unwritten extents
(or whatever alternate scheme to avoid stale data exposure we might
think up) regardless of allocation policy or write path. That might even
make a decent side topic associated with the SSD allocation policy topic
proposal Dave recently posted.

It looks like Christoph already reviewed the patch. I'm not sure if his
opinion changed it all after the subsequent discussion, but otherwise
that just leaves Dave's objection. Dave, any thoughts on this given the
test results and broader context? What do you think about getting this
patch merged and revisiting the whole unwritten extent thing
independently?

Brian

> --D
> 
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] xfs: force writes to delalloc regions to unwritten
  2020-05-14 17:44           ` Brian Foster
@ 2020-05-17  7:48             ` Christoph Hellwig
  2020-05-19  0:40               ` Darrick J. Wong
  2020-05-20  1:03             ` Dave Chinner
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2020-05-17  7:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs, hch

On Thu, May 14, 2020 at 01:44:48PM -0400, Brian Foster wrote:
> It looks like Christoph already reviewed the patch. I'm not sure if his
> opinion changed it all after the subsequent discussion, but otherwise
> that just leaves Dave's objection. Dave, any thoughts on this given the
> test results and broader context? What do you think about getting this
> patch merged and revisiting the whole unwritten extent thing
> independently?

Absolutely no change of mind.  I think we need to fix the issue ASAP
and then look into performance improvements as soon as we get to it.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] xfs: force writes to delalloc regions to unwritten
  2020-05-17  7:48             ` Christoph Hellwig
@ 2020-05-19  0:40               ` Darrick J. Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2020-05-19  0:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, Dave Chinner, linux-xfs

On Sun, May 17, 2020 at 12:48:43AM -0700, Christoph Hellwig wrote:
> On Thu, May 14, 2020 at 01:44:48PM -0400, Brian Foster wrote:
> > It looks like Christoph already reviewed the patch. I'm not sure if his
> > opinion changed it all after the subsequent discussion, but otherwise
> > that just leaves Dave's objection. Dave, any thoughts on this given the
> > test results and broader context? What do you think about getting this
> > patch merged and revisiting the whole unwritten extent thing
> > independently?
> 
> Absolutely no change of mind.  I think we need to fix the issue ASAP
> and then look into performance improvements as soon as we get to it.

Hm, well, I do have a couple more patches to fix a couple of minor
regressions that fstests found...

--D

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] xfs: force writes to delalloc regions to unwritten
  2020-05-14 17:44           ` Brian Foster
  2020-05-17  7:48             ` Christoph Hellwig
@ 2020-05-20  1:03             ` Dave Chinner
  1 sibling, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2020-05-20  1:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs, hch

On Thu, May 14, 2020 at 01:44:48PM -0400, Brian Foster wrote:
> On Thu, May 14, 2020 at 09:33:17AM -0700, Darrick J. Wong wrote:
> ISTM that the right thing to do here is merge this patch, finally fix
> the last known stale data exposure vector, and then perhaps step back
> and think about how we might improve performance of unwritten extents
> (or whatever alternate scheme to avoid stale data exposure we might
> think up) regardless of allocation policy or write path. That might even
> make a decent side topic associated with the SSD allocation policy topic
> proposal Dave recently posted.
> 
> It looks like Christoph already reviewed the patch. I'm not sure if his
> opinion changed it all after the subsequent discussion, but otherwise
> that just leaves Dave's objection. Dave, any thoughts on this given the
> test results and broader context? What do you think about getting this
> patch merged and revisiting the whole unwritten extent thing
> independently?

I guess when we look at this in the broader context of "buffered IO
already sucks real bad for high performance IO" then a few percent
here or there doesn't really matter.

Note, however, that the difference between dio+aio and buffered
writes has nothing to do with unwritten extents - what you are
seeing is the cost of the CPU copying the data into the page cache
in the user process context vs just submitting IO. Essentially, IO
submission time is way higher for buffered IO because of the data
copy, hence a CPU can do less of them per second. IOWs, unwritten
extents are not significant compared to the overhead the page cache
adds to the IO path....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-05-20  1:04 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-16  6:15 [PATCH 0/2] xfs: fix stale disk exposure after crash Darrick J. Wong
2020-01-16  6:15 ` [PATCH 1/2] xfs: force writes to delalloc regions to unwritten Darrick J. Wong
2020-01-16 16:47   ` Christoph Hellwig
2020-01-16 23:16     ` Darrick J. Wong
2020-01-19 20:49   ` Dave Chinner
2020-02-03 20:14     ` Darrick J. Wong
2020-05-07 10:32       ` Brian Foster
2020-05-14 16:33         ` Darrick J. Wong
2020-05-14 17:44           ` Brian Foster
2020-05-17  7:48             ` Christoph Hellwig
2020-05-19  0:40               ` Darrick J. Wong
2020-05-20  1:03             ` Dave Chinner
2020-01-16  6:15 ` [PATCH 2/2] xfs: relax unwritten writeback overhead under some circumstances Darrick J. Wong
2020-01-16 16:49   ` Christoph Hellwig
2020-01-16 23:15     ` Darrick J. Wong
2020-01-16 16:49 ` [PATCH 0/2] xfs: fix stale disk exposure after crash Christoph Hellwig
2020-01-16 23:00   ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.