linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/4] xfs: fix stale disk exposure after crash
@ 2020-05-23 16:49 Darrick J. Wong
  2020-05-23 16:49 ` [PATCH 1/4] xfs: don't fail unwritten extent conversion on writeback due to edquot Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Darrick J. Wong @ 2020-05-23 16:49 UTC (permalink / raw)
  To: darrick.wong
  Cc: Christoph Hellwig, Brian Foster, Christoph Hellwig, linux-xfs,
	hch, bfoster

Hi all,

This set of patches try to shrink the window during which a crash during
writeback can expose stale disk contents.  The first patch causes
delalloc reservations to be converted to unwritten extents for any
writeback that's going on within EOF.  The second patch fixes a minor
error encountered during writeback; and the third patch fixes
speculative preallocation to work when the EOF block could be unwritten.

v3 puts the extent tree walk into the dynamic prealloc sizing function.
v4 fixes a bug with walking the extent map backwards.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=stale-exposure

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=stale-exposure
---
 fs/xfs/libxfs/xfs_bmap.c |   29 +++++++-----
 fs/xfs/xfs_iomap.c       |  107 +++++++++++++++++++++++-----------------------
 2 files changed, 71 insertions(+), 65 deletions(-)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] xfs: don't fail unwritten extent conversion on writeback due to edquot
  2020-05-23 16:49 [PATCH v4 0/4] xfs: fix stale disk exposure after crash Darrick J. Wong
@ 2020-05-23 16:49 ` Darrick J. Wong
  2020-05-23 16:49 ` [PATCH 2/4] xfs: measure all contiguous previous extents for prealloc size Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2020-05-23 16:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: Brian Foster, Christoph Hellwig, linux-xfs, hch, bfoster

From: Darrick J. Wong <darrick.wong@oracle.com>

During writeback, it's possible for the quota block reservation in
xfs_iomap_write_unwritten to fail with EDQUOT because we hit the quota
limit.  This causes writeback errors for data that was already written
to disk, when it's not even guaranteed that the bmbt will expand to
exceed the quota limit.  Irritatingly, this condition is reported to
userspace as EIO by fsync, which is confusing.

We wrote the data, so allow the reservation.  That might put us slightly
above the hard limit, but it's better than losing data after a write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index bb590a267a7f..ac970b13b1f8 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -563,7 +563,7 @@ xfs_iomap_write_unwritten(
 		xfs_trans_ijoin(tp, ip, 0);
 
 		error = xfs_trans_reserve_quota_nblks(tp, ip, resblks, 0,
-				XFS_QMOPT_RES_REGBLKS);
+				XFS_QMOPT_RES_REGBLKS | XFS_QMOPT_FORCE_RES);
 		if (error)
 			goto error_on_bmapi_transaction;
 


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] xfs: measure all contiguous previous extents for prealloc size
  2020-05-23 16:49 [PATCH v4 0/4] xfs: fix stale disk exposure after crash Darrick J. Wong
  2020-05-23 16:49 ` [PATCH 1/4] xfs: don't fail unwritten extent conversion on writeback due to edquot Darrick J. Wong
@ 2020-05-23 16:49 ` Darrick J. Wong
  2020-05-24  9:14   ` Christoph Hellwig
  2020-05-24 17:16   ` [PATCH v2 " Darrick J. Wong
  2020-05-23 16:49 ` [PATCH 3/4] xfs: refactor xfs_iomap_prealloc_size Darrick J. Wong
  2020-05-23 16:49 ` [PATCH 4/4] xfs: force writes to delalloc regions to unwritten Darrick J. Wong
  3 siblings, 2 replies; 15+ messages in thread
From: Darrick J. Wong @ 2020-05-23 16:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, hch, bfoster

From: Darrick J. Wong <darrick.wong@oracle.com>

When we're estimating a new speculative preallocation length for an
extending write, we should walk backwards through the extent list to
determine the number of number of blocks that are physically and
logically contiguous with the write offset, and use that as an input to
the preallocation size computation.

This way, preallocation length is truly measured by the effectiveness of
the allocator in giving us contiguous allocations without being
influenced by the state of a given extent.  This fixes both the problem
where ZERO_RANGE within an EOF can reduce preallocation, and prevents
the unnecessary shrinkage of preallocation when delalloc extents are
turned into unwritten extents.

This was found as a regression in xfs/014 after changing delalloc writes
to create unwritten extents during writeback.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_iomap.c |   40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)


diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ac970b13b1f8..de990365397e 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -377,15 +377,17 @@ xfs_iomap_prealloc_size(
 	loff_t			count,
 	struct xfs_iext_cursor	*icur)
 {
+	struct xfs_iext_cursor	ncur = *icur;
+	struct xfs_bmbt_irec	prev, got;
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
-	struct xfs_bmbt_irec	prev;
-	int			shift = 0;
 	int64_t			freesp;
 	xfs_fsblock_t		qblocks;
-	int			qshift = 0;
 	xfs_fsblock_t		alloc_blocks = 0;
+	xfs_extlen_t		plen;
+	int			shift = 0;
+	int			qshift = 0;
 
 	if (offset + count <= XFS_ISIZE(ip))
 		return 0;
@@ -400,7 +402,7 @@ xfs_iomap_prealloc_size(
 	 */
 	if ((mp->m_flags & XFS_MOUNT_ALLOCSIZE) ||
 	    XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) ||
-	    !xfs_iext_peek_prev_extent(ifp, icur, &prev) ||
+	    !xfs_iext_prev_extent(ifp, &ncur, &prev) ||
 	    prev.br_startoff + prev.br_blockcount < offset_fsb)
 		return mp->m_allocsize_blocks;
 
@@ -413,16 +415,28 @@ xfs_iomap_prealloc_size(
 	 * preallocation size.
 	 *
 	 * If the extent is a hole, then preallocation is essentially disabled.
-	 * Otherwise we take the size of the preceding data extent as the basis
-	 * for the preallocation size. If the size of the extent is greater than
-	 * half the maximum extent length, then use the current offset as the
-	 * basis. This ensures that for large files the preallocation size
-	 * always extends to MAXEXTLEN rather than falling short due to things
-	 * like stripe unit/width alignment of real extents.
+	 * Otherwise we take the size of the preceding data extents as the basis
+	 * for the preallocation size. Note that we don't care if the previous
+	 * extents are written or not.
+	 *
+	 * If the size of the extents is greater than half the maximum extent
+	 * length, then use the current offset as the basis. This ensures that
+	 * for large files the preallocation size always extends to MAXEXTLEN
+	 * rather than falling short due to things like stripe unit/width
+	 * alignment of real extents.
 	 */
-	if (prev.br_blockcount <= (MAXEXTLEN >> 1))
-		alloc_blocks = prev.br_blockcount << 1;
-	else
+	plen = prev.br_blockcount;
+	while (xfs_iext_prev_extent(ifp, &ncur, &got)) {
+		if (plen > MAXEXTLEN / 2 ||
+		    isnullstartblock(got.br_startblock) ||
+		    got.br_startoff + got.br_blockcount != prev.br_startoff ||
+		    got.br_startblock + got.br_blockcount != prev.br_startblock)
+			break;
+		plen += got.br_blockcount;
+		prev = got;
+	}
+	alloc_blocks = plen << 1;
+	if (alloc_blocks > MAXEXTLEN)
 		alloc_blocks = XFS_B_TO_FSB(mp, offset);
 	if (!alloc_blocks)
 		goto check_writeio;


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] xfs: refactor xfs_iomap_prealloc_size
  2020-05-23 16:49 [PATCH v4 0/4] xfs: fix stale disk exposure after crash Darrick J. Wong
  2020-05-23 16:49 ` [PATCH 1/4] xfs: don't fail unwritten extent conversion on writeback due to edquot Darrick J. Wong
  2020-05-23 16:49 ` [PATCH 2/4] xfs: measure all contiguous previous extents for prealloc size Darrick J. Wong
@ 2020-05-23 16:49 ` Darrick J. Wong
  2020-05-24 17:17   ` [PATCH v2 " Darrick J. Wong
  2020-05-23 16:49 ` [PATCH 4/4] xfs: force writes to delalloc regions to unwritten Darrick J. Wong
  3 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2020-05-23 16:49 UTC (permalink / raw)
  To: darrick.wong
  Cc: Christoph Hellwig, Christoph Hellwig, linux-xfs, hch, bfoster

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor xfs_iomap_prealloc_size to be the function that dynamically
computes the per-file preallocation size by moving the allocsize= case
to the caller.  Break up the huge comment preceding the function to
annotate the relevant parts of the code, and remove the impossible
check_writeio case.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c |   83 ++++++++++++++++++++++------------------------------
 1 file changed, 35 insertions(+), 48 deletions(-)


diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index de990365397e..ad695c77a472 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -352,22 +352,10 @@ xfs_quota_calc_throttle(
 }
 
 /*
- * If we are doing a write at the end of the file and there are no allocations
- * past this one, then extend the allocation out to the file system's write
- * iosize.
- *
  * If we don't have a user specified preallocation size, dynamically increase
  * the preallocation size as the size of the file grows.  Cap the maximum size
  * at a single extent or less if the filesystem is near full. The closer the
- * filesystem is to full, the smaller the maximum prealocation.
- *
- * As an exception we don't do any preallocation at all if the file is smaller
- * than the minimum preallocation and we are using the default dynamic
- * preallocation scheme, as it is likely this is the only write to the file that
- * is going to be done.
- *
- * We clean up any extra space left over when the file is closed in
- * xfs_inactive().
+ * filesystem is to being full, the smaller the maximum preallocation.
  */
 STATIC xfs_fsblock_t
 xfs_iomap_prealloc_size(
@@ -389,41 +377,28 @@ xfs_iomap_prealloc_size(
 	int			shift = 0;
 	int			qshift = 0;
 
-	if (offset + count <= XFS_ISIZE(ip))
-		return 0;
-
-	if (!(mp->m_flags & XFS_MOUNT_ALLOCSIZE) &&
-	    (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_allocsize_blocks)))
+	/*
+	 * As an exception we don't do any preallocation at all if the file is
+	 * smaller than the minimum preallocation and we are using the default
+	 * dynamic preallocation scheme, as it is likely this is the only write
+	 * to the file that is going to be done.
+	 */
+	if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_allocsize_blocks))
 		return 0;
 
 	/*
-	 * If an explicit allocsize is set, the file is small, or we
-	 * are writing behind a hole, then use the minimum prealloc:
+	 * Use the minimum preallocation size for small files or if we are
+	 * writing right after a hole.
 	 */
-	if ((mp->m_flags & XFS_MOUNT_ALLOCSIZE) ||
-	    XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) ||
+	if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) ||
 	    !xfs_iext_prev_extent(ifp, &ncur, &prev) ||
 	    prev.br_startoff + prev.br_blockcount < offset_fsb)
 		return mp->m_allocsize_blocks;
 
 	/*
-	 * Determine the initial size of the preallocation. We are beyond the
-	 * current EOF here, but we need to take into account whether this is
-	 * a sparse write or an extending write when determining the
-	 * preallocation size.  Hence we need to look up the extent that ends
-	 * at the current write offset and use the result to determine the
-	 * preallocation size.
-	 *
-	 * If the extent is a hole, then preallocation is essentially disabled.
-	 * Otherwise we take the size of the preceding data extents as the basis
-	 * for the preallocation size. Note that we don't care if the previous
-	 * extents are written or not.
-	 *
-	 * If the size of the extents is greater than half the maximum extent
-	 * length, then use the current offset as the basis. This ensures that
-	 * for large files the preallocation size always extends to MAXEXTLEN
-	 * rather than falling short due to things like stripe unit/width
-	 * alignment of real extents.
+	 * Take the size of the preceding data extents as the basis for the
+	 * preallocation size. Note that we don't care if the previous extents
+	 * are written or not.
 	 */
 	plen = prev.br_blockcount;
 	while (xfs_iext_prev_extent(ifp, &ncur, &got)) {
@@ -435,19 +410,25 @@ xfs_iomap_prealloc_size(
 		plen += got.br_blockcount;
 		prev = got;
 	}
+
+	/*
+	 * If the size of the extents is greater than half the maximum extent
+	 * length, then use the current offset as the basis.  This ensures that
+	 * for large files the preallocation size always extends to MAXEXTLEN
+	 * rather than falling short due to things like stripe unit/width
+	 * alignment of real extents.
+	 */
 	alloc_blocks = plen << 1;
 	if (alloc_blocks > MAXEXTLEN)
 		alloc_blocks = XFS_B_TO_FSB(mp, offset);
-	if (!alloc_blocks)
-		goto check_writeio;
 	qblocks = alloc_blocks;
 
 	/*
 	 * MAXEXTLEN is not a power of two value but we round the prealloc down
 	 * to the nearest power of two value after throttling. To prevent the
-	 * round down from unconditionally reducing the maximum supported prealloc
-	 * size, we round up first, apply appropriate throttling, round down and
-	 * cap the value to MAXEXTLEN.
+	 * round down from unconditionally reducing the maximum supported
+	 * prealloc size, we round up first, apply appropriate throttling,
+	 * round down and cap the value to MAXEXTLEN.
 	 */
 	alloc_blocks = XFS_FILEOFF_MIN(roundup_pow_of_two(MAXEXTLEN),
 				       alloc_blocks);
@@ -508,7 +489,6 @@ xfs_iomap_prealloc_size(
 	 */
 	while (alloc_blocks && alloc_blocks >= freesp)
 		alloc_blocks >>= 4;
-check_writeio:
 	if (alloc_blocks < mp->m_allocsize_blocks)
 		alloc_blocks = mp->m_allocsize_blocks;
 	trace_xfs_iomap_prealloc_size(ip, alloc_blocks, shift,
@@ -975,9 +955,16 @@ xfs_buffered_write_iomap_begin(
 	if (error)
 		goto out_unlock;
 
-	if (eof) {
-		prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork, offset,
-				count, &icur);
+	if (eof && offset + count > XFS_ISIZE(ip)) {
+		/*
+		 * Determine the initial size of the preallocation.
+		 * We clean up any extra preallocation when the file is closed.
+		 */
+		if (mp->m_flags & XFS_MOUNT_ALLOCSIZE)
+			prealloc_blocks = mp->m_allocsize_blocks;
+		else
+			prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork,
+						offset, count, &icur);
 		if (prealloc_blocks) {
 			xfs_extlen_t	align;
 			xfs_off_t	end_offset;


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] xfs: force writes to delalloc regions to unwritten
  2020-05-23 16:49 [PATCH v4 0/4] xfs: fix stale disk exposure after crash Darrick J. Wong
                   ` (2 preceding siblings ...)
  2020-05-23 16:49 ` [PATCH 3/4] xfs: refactor xfs_iomap_prealloc_size Darrick J. Wong
@ 2020-05-23 16:49 ` Darrick J. Wong
  3 siblings, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2020-05-23 16:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: Christoph Hellwig, Brian Foster, linux-xfs, hch, bfoster

From: Darrick J. Wong <darrick.wong@oracle.com>

When writing to a delalloc region in the data fork, commit the new
allocations (of the da reservation) as unwritten so that the mappings
are only marked written once writeback completes successfully.  This
fixes the problem of stale data exposure if the system goes down during
targeted writeback of a specific region of a file, as tested by
generic/042.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index fda13cd7add0..f8fe83c9348d 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4193,17 +4193,7 @@ xfs_bmapi_allocate(
 	bma->got.br_blockcount = bma->length;
 	bma->got.br_state = XFS_EXT_NORM;
 
-	/*
-	 * In the data fork, a wasdelay extent has been initialized, so
-	 * shouldn't be flagged as unwritten.
-	 *
-	 * For the cow fork, however, we convert delalloc reservations
-	 * (extents allocated for speculative preallocation) to
-	 * allocated unwritten extents, and only convert the unwritten
-	 * extents to real extents when we're about to write the data.
-	 */
-	if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) &&
-	    (bma->flags & XFS_BMAPI_PREALLOC))
+	if (bma->flags & XFS_BMAPI_PREALLOC)
 		bma->got.br_state = XFS_EXT_UNWRITTEN;
 
 	if (bma->wasdel)
@@ -4611,8 +4601,23 @@ xfs_bmapi_convert_delalloc(
 	bma.offset = bma.got.br_startoff;
 	bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN);
 	bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork);
+
+	/*
+	 * When we're converting the delalloc reservations backing dirty pages
+	 * in the page cache, we must be careful about how we create the new
+	 * extents:
+	 *
+	 * New CoW fork extents are created unwritten, turned into real extents
+	 * when we're about to write the data to disk, and mapped into the data
+	 * fork after the write finishes.  End of story.
+	 *
+	 * New data fork extents must be mapped in as unwritten and converted
+	 * to real extents after the write succeeds to avoid exposing stale
+	 * disk contents if we crash.
+	 */
+	bma.flags = XFS_BMAPI_PREALLOC;
 	if (whichfork == XFS_COW_FORK)
-		bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
+		bma.flags |= XFS_BMAPI_COWFORK;
 
 	if (!xfs_iext_peek_prev_extent(ifp, &bma.icur, &bma.prev))
 		bma.prev.br_startoff = NULLFILEOFF;


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] xfs: measure all contiguous previous extents for prealloc size
  2020-05-23 16:49 ` [PATCH 2/4] xfs: measure all contiguous previous extents for prealloc size Darrick J. Wong
@ 2020-05-24  9:14   ` Christoph Hellwig
  2020-05-24 17:16     ` Darrick J. Wong
  2020-05-24 17:16   ` [PATCH v2 " Darrick J. Wong
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2020-05-24  9:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch, bfoster

I really, really hate the pointless use of shits where the
multiplication makes it obvious what is going on, and still allows
the compiler to opimize it in whatever way it wants.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 2/4] xfs: measure all contiguous previous extents for prealloc size
  2020-05-23 16:49 ` [PATCH 2/4] xfs: measure all contiguous previous extents for prealloc size Darrick J. Wong
  2020-05-24  9:14   ` Christoph Hellwig
@ 2020-05-24 17:16   ` Darrick J. Wong
  2020-05-25 13:28     ` Christoph Hellwig
  2020-05-26 13:46     ` Brian Foster
  1 sibling, 2 replies; 15+ messages in thread
From: Darrick J. Wong @ 2020-05-24 17:16 UTC (permalink / raw)
  To: linux-xfs, hch, bfoster

From: Darrick J. Wong <darrick.wong@oracle.com>

When we're estimating a new speculative preallocation length for an
extending write, we should walk backwards through the extent list to
determine the number of number of blocks that are physically and
logically contiguous with the write offset, and use that as an input to
the preallocation size computation.

This way, preallocation length is truly measured by the effectiveness of
the allocator in giving us contiguous allocations without being
influenced by the state of a given extent.  This fixes both the problem
where ZERO_RANGE within an EOF can reduce preallocation, and prevents
the unnecessary shrinkage of preallocation when delalloc extents are
turned into unwritten extents.

This was found as a regression in xfs/014 after changing delalloc writes
to create unwritten extents during writeback.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: multiplication instead of lshift
---
 fs/xfs/xfs_iomap.c |   40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 7d8966ce630a..e74a8c2c94ce 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -377,15 +377,17 @@ xfs_iomap_prealloc_size(
 	loff_t			count,
 	struct xfs_iext_cursor	*icur)
 {
+	struct xfs_iext_cursor	ncur = *icur;
+	struct xfs_bmbt_irec	prev, got;
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
-	struct xfs_bmbt_irec	prev;
-	int			shift = 0;
 	int64_t			freesp;
 	xfs_fsblock_t		qblocks;
-	int			qshift = 0;
 	xfs_fsblock_t		alloc_blocks = 0;
+	xfs_extlen_t		plen;
+	int			shift = 0;
+	int			qshift = 0;
 
 	if (offset + count <= XFS_ISIZE(ip))
 		return 0;
@@ -400,7 +402,7 @@ xfs_iomap_prealloc_size(
 	 */
 	if ((mp->m_flags & XFS_MOUNT_ALLOCSIZE) ||
 	    XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) ||
-	    !xfs_iext_peek_prev_extent(ifp, icur, &prev) ||
+	    !xfs_iext_prev_extent(ifp, &ncur, &prev) ||
 	    prev.br_startoff + prev.br_blockcount < offset_fsb)
 		return mp->m_allocsize_blocks;
 
@@ -413,16 +415,28 @@ xfs_iomap_prealloc_size(
 	 * preallocation size.
 	 *
 	 * If the extent is a hole, then preallocation is essentially disabled.
-	 * Otherwise we take the size of the preceding data extent as the basis
-	 * for the preallocation size. If the size of the extent is greater than
-	 * half the maximum extent length, then use the current offset as the
-	 * basis. This ensures that for large files the preallocation size
-	 * always extends to MAXEXTLEN rather than falling short due to things
-	 * like stripe unit/width alignment of real extents.
+	 * Otherwise we take the size of the preceding data extents as the basis
+	 * for the preallocation size. Note that we don't care if the previous
+	 * extents are written or not.
+	 *
+	 * If the size of the extents is greater than half the maximum extent
+	 * length, then use the current offset as the basis. This ensures that
+	 * for large files the preallocation size always extends to MAXEXTLEN
+	 * rather than falling short due to things like stripe unit/width
+	 * alignment of real extents.
 	 */
-	if (prev.br_blockcount <= (MAXEXTLEN >> 1))
-		alloc_blocks = prev.br_blockcount << 1;
-	else
+	plen = prev.br_blockcount;
+	while (xfs_iext_prev_extent(ifp, &ncur, &got)) {
+		if (plen > MAXEXTLEN / 2 ||
+		    isnullstartblock(got.br_startblock) ||
+		    got.br_startoff + got.br_blockcount != prev.br_startoff ||
+		    got.br_startblock + got.br_blockcount != prev.br_startblock)
+			break;
+		plen += got.br_blockcount;
+		prev = got;
+	}
+	alloc_blocks = plen * 2;
+	if (alloc_blocks > MAXEXTLEN)
 		alloc_blocks = XFS_B_TO_FSB(mp, offset);
 	if (!alloc_blocks)
 		goto check_writeio;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] xfs: measure all contiguous previous extents for prealloc size
  2020-05-24  9:14   ` Christoph Hellwig
@ 2020-05-24 17:16     ` Darrick J. Wong
  0 siblings, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2020-05-24 17:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, bfoster

On Sun, May 24, 2020 at 02:14:46AM -0700, Christoph Hellwig wrote:
> I really, really hate the pointless use of shits where the
> multiplication makes it obvious what is going on, and still allows
> the compiler to opimize it in whatever way it wants.

<grumble> Fine, I'll change it back....

--D

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 3/4] xfs: refactor xfs_iomap_prealloc_size
  2020-05-23 16:49 ` [PATCH 3/4] xfs: refactor xfs_iomap_prealloc_size Darrick J. Wong
@ 2020-05-24 17:17   ` Darrick J. Wong
  2020-05-26 13:46     ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2020-05-24 17:17 UTC (permalink / raw)
  To: Christoph Hellwig, Christoph Hellwig, linux-xfs, bfoster

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor xfs_iomap_prealloc_size to be the function that dynamically
computes the per-file preallocation size by moving the allocsize= case
to the caller.  Break up the huge comment preceding the function to
annotate the relevant parts of the code, and remove the impossible
check_writeio case.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
v2: minor rebase due to changes in previous patch
---
 fs/xfs/xfs_iomap.c |   83 ++++++++++++++++++++++------------------------------
 1 file changed, 35 insertions(+), 48 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index e74a8c2c94ce..b9a8c3798e08 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -352,22 +352,10 @@ xfs_quota_calc_throttle(
 }
 
 /*
- * If we are doing a write at the end of the file and there are no allocations
- * past this one, then extend the allocation out to the file system's write
- * iosize.
- *
  * If we don't have a user specified preallocation size, dynamically increase
  * the preallocation size as the size of the file grows.  Cap the maximum size
  * at a single extent or less if the filesystem is near full. The closer the
- * filesystem is to full, the smaller the maximum prealocation.
- *
- * As an exception we don't do any preallocation at all if the file is smaller
- * than the minimum preallocation and we are using the default dynamic
- * preallocation scheme, as it is likely this is the only write to the file that
- * is going to be done.
- *
- * We clean up any extra space left over when the file is closed in
- * xfs_inactive().
+ * filesystem is to being full, the smaller the maximum preallocation.
  */
 STATIC xfs_fsblock_t
 xfs_iomap_prealloc_size(
@@ -389,41 +377,28 @@ xfs_iomap_prealloc_size(
 	int			shift = 0;
 	int			qshift = 0;
 
-	if (offset + count <= XFS_ISIZE(ip))
-		return 0;
-
-	if (!(mp->m_flags & XFS_MOUNT_ALLOCSIZE) &&
-	    (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_allocsize_blocks)))
+	/*
+	 * As an exception we don't do any preallocation at all if the file is
+	 * smaller than the minimum preallocation and we are using the default
+	 * dynamic preallocation scheme, as it is likely this is the only write
+	 * to the file that is going to be done.
+	 */
+	if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_allocsize_blocks))
 		return 0;
 
 	/*
-	 * If an explicit allocsize is set, the file is small, or we
-	 * are writing behind a hole, then use the minimum prealloc:
+	 * Use the minimum preallocation size for small files or if we are
+	 * writing right after a hole.
 	 */
-	if ((mp->m_flags & XFS_MOUNT_ALLOCSIZE) ||
-	    XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) ||
+	if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) ||
 	    !xfs_iext_prev_extent(ifp, &ncur, &prev) ||
 	    prev.br_startoff + prev.br_blockcount < offset_fsb)
 		return mp->m_allocsize_blocks;
 
 	/*
-	 * Determine the initial size of the preallocation. We are beyond the
-	 * current EOF here, but we need to take into account whether this is
-	 * a sparse write or an extending write when determining the
-	 * preallocation size.  Hence we need to look up the extent that ends
-	 * at the current write offset and use the result to determine the
-	 * preallocation size.
-	 *
-	 * If the extent is a hole, then preallocation is essentially disabled.
-	 * Otherwise we take the size of the preceding data extents as the basis
-	 * for the preallocation size. Note that we don't care if the previous
-	 * extents are written or not.
-	 *
-	 * If the size of the extents is greater than half the maximum extent
-	 * length, then use the current offset as the basis. This ensures that
-	 * for large files the preallocation size always extends to MAXEXTLEN
-	 * rather than falling short due to things like stripe unit/width
-	 * alignment of real extents.
+	 * Take the size of the preceding data extents as the basis for the
+	 * preallocation size. Note that we don't care if the previous extents
+	 * are written or not.
 	 */
 	plen = prev.br_blockcount;
 	while (xfs_iext_prev_extent(ifp, &ncur, &got)) {
@@ -435,19 +410,25 @@ xfs_iomap_prealloc_size(
 		plen += got.br_blockcount;
 		prev = got;
 	}
+
+	/*
+	 * If the size of the extents is greater than half the maximum extent
+	 * length, then use the current offset as the basis.  This ensures that
+	 * for large files the preallocation size always extends to MAXEXTLEN
+	 * rather than falling short due to things like stripe unit/width
+	 * alignment of real extents.
+	 */
 	alloc_blocks = plen * 2;
 	if (alloc_blocks > MAXEXTLEN)
 		alloc_blocks = XFS_B_TO_FSB(mp, offset);
-	if (!alloc_blocks)
-		goto check_writeio;
 	qblocks = alloc_blocks;
 
 	/*
 	 * MAXEXTLEN is not a power of two value but we round the prealloc down
 	 * to the nearest power of two value after throttling. To prevent the
-	 * round down from unconditionally reducing the maximum supported prealloc
-	 * size, we round up first, apply appropriate throttling, round down and
-	 * cap the value to MAXEXTLEN.
+	 * round down from unconditionally reducing the maximum supported
+	 * prealloc size, we round up first, apply appropriate throttling,
+	 * round down and cap the value to MAXEXTLEN.
 	 */
 	alloc_blocks = XFS_FILEOFF_MIN(roundup_pow_of_two(MAXEXTLEN),
 				       alloc_blocks);
@@ -508,7 +489,6 @@ xfs_iomap_prealloc_size(
 	 */
 	while (alloc_blocks && alloc_blocks >= freesp)
 		alloc_blocks >>= 4;
-check_writeio:
 	if (alloc_blocks < mp->m_allocsize_blocks)
 		alloc_blocks = mp->m_allocsize_blocks;
 	trace_xfs_iomap_prealloc_size(ip, alloc_blocks, shift,
@@ -975,9 +955,16 @@ xfs_buffered_write_iomap_begin(
 	if (error)
 		goto out_unlock;
 
-	if (eof) {
-		prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork, offset,
-				count, &icur);
+	if (eof && offset + count > XFS_ISIZE(ip)) {
+		/*
+		 * Determine the initial size of the preallocation.
+		 * We clean up any extra preallocation when the file is closed.
+		 */
+		if (mp->m_flags & XFS_MOUNT_ALLOCSIZE)
+			prealloc_blocks = mp->m_allocsize_blocks;
+		else
+			prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork,
+						offset, count, &icur);
 		if (prealloc_blocks) {
 			xfs_extlen_t	align;
 			xfs_off_t	end_offset;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/4] xfs: measure all contiguous previous extents for prealloc size
  2020-05-24 17:16   ` [PATCH v2 " Darrick J. Wong
@ 2020-05-25 13:28     ` Christoph Hellwig
  2020-05-26 13:46     ` Brian Foster
  1 sibling, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2020-05-25 13:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch, bfoster

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/4] xfs: measure all contiguous previous extents for prealloc size
  2020-05-24 17:16   ` [PATCH v2 " Darrick J. Wong
  2020-05-25 13:28     ` Christoph Hellwig
@ 2020-05-26 13:46     ` Brian Foster
  1 sibling, 0 replies; 15+ messages in thread
From: Brian Foster @ 2020-05-26 13:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Sun, May 24, 2020 at 10:16:12AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we're estimating a new speculative preallocation length for an
> extending write, we should walk backwards through the extent list to
> determine the number of number of blocks that are physically and
> logically contiguous with the write offset, and use that as an input to
> the preallocation size computation.
> 
> This way, preallocation length is truly measured by the effectiveness of
> the allocator in giving us contiguous allocations without being
> influenced by the state of a given extent.  This fixes both the problem
> where ZERO_RANGE within an EOF can reduce preallocation, and prevents
> the unnecessary shrinkage of preallocation when delalloc extents are
> turned into unwritten extents.
> 
> This was found as a regression in xfs/014 after changing delalloc writes
> to create unwritten extents during writeback.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: multiplication instead of lshift
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_iomap.c |   40 +++++++++++++++++++++++++++-------------
>  1 file changed, 27 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 7d8966ce630a..e74a8c2c94ce 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -377,15 +377,17 @@ xfs_iomap_prealloc_size(
>  	loff_t			count,
>  	struct xfs_iext_cursor	*icur)
>  {
> +	struct xfs_iext_cursor	ncur = *icur;
> +	struct xfs_bmbt_irec	prev, got;
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
>  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> -	struct xfs_bmbt_irec	prev;
> -	int			shift = 0;
>  	int64_t			freesp;
>  	xfs_fsblock_t		qblocks;
> -	int			qshift = 0;
>  	xfs_fsblock_t		alloc_blocks = 0;
> +	xfs_extlen_t		plen;
> +	int			shift = 0;
> +	int			qshift = 0;
>  
>  	if (offset + count <= XFS_ISIZE(ip))
>  		return 0;
> @@ -400,7 +402,7 @@ xfs_iomap_prealloc_size(
>  	 */
>  	if ((mp->m_flags & XFS_MOUNT_ALLOCSIZE) ||
>  	    XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) ||
> -	    !xfs_iext_peek_prev_extent(ifp, icur, &prev) ||
> +	    !xfs_iext_prev_extent(ifp, &ncur, &prev) ||
>  	    prev.br_startoff + prev.br_blockcount < offset_fsb)
>  		return mp->m_allocsize_blocks;
>  
> @@ -413,16 +415,28 @@ xfs_iomap_prealloc_size(
>  	 * preallocation size.
>  	 *
>  	 * If the extent is a hole, then preallocation is essentially disabled.
> -	 * Otherwise we take the size of the preceding data extent as the basis
> -	 * for the preallocation size. If the size of the extent is greater than
> -	 * half the maximum extent length, then use the current offset as the
> -	 * basis. This ensures that for large files the preallocation size
> -	 * always extends to MAXEXTLEN rather than falling short due to things
> -	 * like stripe unit/width alignment of real extents.
> +	 * Otherwise we take the size of the preceding data extents as the basis
> +	 * for the preallocation size. Note that we don't care if the previous
> +	 * extents are written or not.
> +	 *
> +	 * If the size of the extents is greater than half the maximum extent
> +	 * length, then use the current offset as the basis. This ensures that
> +	 * for large files the preallocation size always extends to MAXEXTLEN
> +	 * rather than falling short due to things like stripe unit/width
> +	 * alignment of real extents.
>  	 */
> -	if (prev.br_blockcount <= (MAXEXTLEN >> 1))
> -		alloc_blocks = prev.br_blockcount << 1;
> -	else
> +	plen = prev.br_blockcount;
> +	while (xfs_iext_prev_extent(ifp, &ncur, &got)) {
> +		if (plen > MAXEXTLEN / 2 ||
> +		    isnullstartblock(got.br_startblock) ||
> +		    got.br_startoff + got.br_blockcount != prev.br_startoff ||
> +		    got.br_startblock + got.br_blockcount != prev.br_startblock)
> +			break;
> +		plen += got.br_blockcount;
> +		prev = got;
> +	}
> +	alloc_blocks = plen * 2;
> +	if (alloc_blocks > MAXEXTLEN)
>  		alloc_blocks = XFS_B_TO_FSB(mp, offset);
>  	if (!alloc_blocks)
>  		goto check_writeio;
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/4] xfs: refactor xfs_iomap_prealloc_size
  2020-05-24 17:17   ` [PATCH v2 " Darrick J. Wong
@ 2020-05-26 13:46     ` Brian Foster
  0 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2020-05-26 13:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Christoph Hellwig, linux-xfs

On Sun, May 24, 2020 at 10:17:09AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Refactor xfs_iomap_prealloc_size to be the function that dynamically
> computes the per-file preallocation size by moving the allocsize= case
> to the caller.  Break up the huge comment preceding the function to
> annotate the relevant parts of the code, and remove the impossible
> check_writeio case.
> 
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
> v2: minor rebase due to changes in previous patch
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_iomap.c |   83 ++++++++++++++++++++++------------------------------
>  1 file changed, 35 insertions(+), 48 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index e74a8c2c94ce..b9a8c3798e08 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -352,22 +352,10 @@ xfs_quota_calc_throttle(
>  }
>  
>  /*
> - * If we are doing a write at the end of the file and there are no allocations
> - * past this one, then extend the allocation out to the file system's write
> - * iosize.
> - *
>   * If we don't have a user specified preallocation size, dynamically increase
>   * the preallocation size as the size of the file grows.  Cap the maximum size
>   * at a single extent or less if the filesystem is near full. The closer the
> - * filesystem is to full, the smaller the maximum prealocation.
> - *
> - * As an exception we don't do any preallocation at all if the file is smaller
> - * than the minimum preallocation and we are using the default dynamic
> - * preallocation scheme, as it is likely this is the only write to the file that
> - * is going to be done.
> - *
> - * We clean up any extra space left over when the file is closed in
> - * xfs_inactive().
> + * filesystem is to being full, the smaller the maximum preallocation.
>   */
>  STATIC xfs_fsblock_t
>  xfs_iomap_prealloc_size(
> @@ -389,41 +377,28 @@ xfs_iomap_prealloc_size(
>  	int			shift = 0;
>  	int			qshift = 0;
>  
> -	if (offset + count <= XFS_ISIZE(ip))
> -		return 0;
> -
> -	if (!(mp->m_flags & XFS_MOUNT_ALLOCSIZE) &&
> -	    (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_allocsize_blocks)))
> +	/*
> +	 * As an exception we don't do any preallocation at all if the file is
> +	 * smaller than the minimum preallocation and we are using the default
> +	 * dynamic preallocation scheme, as it is likely this is the only write
> +	 * to the file that is going to be done.
> +	 */
> +	if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_allocsize_blocks))
>  		return 0;
>  
>  	/*
> -	 * If an explicit allocsize is set, the file is small, or we
> -	 * are writing behind a hole, then use the minimum prealloc:
> +	 * Use the minimum preallocation size for small files or if we are
> +	 * writing right after a hole.
>  	 */
> -	if ((mp->m_flags & XFS_MOUNT_ALLOCSIZE) ||
> -	    XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) ||
> +	if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) ||
>  	    !xfs_iext_prev_extent(ifp, &ncur, &prev) ||
>  	    prev.br_startoff + prev.br_blockcount < offset_fsb)
>  		return mp->m_allocsize_blocks;
>  
>  	/*
> -	 * Determine the initial size of the preallocation. We are beyond the
> -	 * current EOF here, but we need to take into account whether this is
> -	 * a sparse write or an extending write when determining the
> -	 * preallocation size.  Hence we need to look up the extent that ends
> -	 * at the current write offset and use the result to determine the
> -	 * preallocation size.
> -	 *
> -	 * If the extent is a hole, then preallocation is essentially disabled.
> -	 * Otherwise we take the size of the preceding data extents as the basis
> -	 * for the preallocation size. Note that we don't care if the previous
> -	 * extents are written or not.
> -	 *
> -	 * If the size of the extents is greater than half the maximum extent
> -	 * length, then use the current offset as the basis. This ensures that
> -	 * for large files the preallocation size always extends to MAXEXTLEN
> -	 * rather than falling short due to things like stripe unit/width
> -	 * alignment of real extents.
> +	 * Take the size of the preceding data extents as the basis for the
> +	 * preallocation size. Note that we don't care if the previous extents
> +	 * are written or not.
>  	 */
>  	plen = prev.br_blockcount;
>  	while (xfs_iext_prev_extent(ifp, &ncur, &got)) {
> @@ -435,19 +410,25 @@ xfs_iomap_prealloc_size(
>  		plen += got.br_blockcount;
>  		prev = got;
>  	}
> +
> +	/*
> +	 * If the size of the extents is greater than half the maximum extent
> +	 * length, then use the current offset as the basis.  This ensures that
> +	 * for large files the preallocation size always extends to MAXEXTLEN
> +	 * rather than falling short due to things like stripe unit/width
> +	 * alignment of real extents.
> +	 */
>  	alloc_blocks = plen * 2;
>  	if (alloc_blocks > MAXEXTLEN)
>  		alloc_blocks = XFS_B_TO_FSB(mp, offset);
> -	if (!alloc_blocks)
> -		goto check_writeio;
>  	qblocks = alloc_blocks;
>  
>  	/*
>  	 * MAXEXTLEN is not a power of two value but we round the prealloc down
>  	 * to the nearest power of two value after throttling. To prevent the
> -	 * round down from unconditionally reducing the maximum supported prealloc
> -	 * size, we round up first, apply appropriate throttling, round down and
> -	 * cap the value to MAXEXTLEN.
> +	 * round down from unconditionally reducing the maximum supported
> +	 * prealloc size, we round up first, apply appropriate throttling,
> +	 * round down and cap the value to MAXEXTLEN.
>  	 */
>  	alloc_blocks = XFS_FILEOFF_MIN(roundup_pow_of_two(MAXEXTLEN),
>  				       alloc_blocks);
> @@ -508,7 +489,6 @@ xfs_iomap_prealloc_size(
>  	 */
>  	while (alloc_blocks && alloc_blocks >= freesp)
>  		alloc_blocks >>= 4;
> -check_writeio:
>  	if (alloc_blocks < mp->m_allocsize_blocks)
>  		alloc_blocks = mp->m_allocsize_blocks;
>  	trace_xfs_iomap_prealloc_size(ip, alloc_blocks, shift,
> @@ -975,9 +955,16 @@ xfs_buffered_write_iomap_begin(
>  	if (error)
>  		goto out_unlock;
>  
> -	if (eof) {
> -		prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork, offset,
> -				count, &icur);
> +	if (eof && offset + count > XFS_ISIZE(ip)) {
> +		/*
> +		 * Determine the initial size of the preallocation.
> +		 * We clean up any extra preallocation when the file is closed.
> +		 */
> +		if (mp->m_flags & XFS_MOUNT_ALLOCSIZE)
> +			prealloc_blocks = mp->m_allocsize_blocks;
> +		else
> +			prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork,
> +						offset, count, &icur);
>  		if (prealloc_blocks) {
>  			xfs_extlen_t	align;
>  			xfs_off_t	end_offset;
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] xfs: force writes to delalloc regions to unwritten
  2020-05-22  3:31   ` Dave Chinner
@ 2020-05-22  3:56     ` Darrick J. Wong
  0 siblings, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2020-05-22  3:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Brian Foster, linux-xfs, hch

On Fri, May 22, 2020 at 01:31:02PM +1000, Dave Chinner wrote:
> On Thu, May 21, 2020 at 07:53:23PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When writing to a delalloc region in the data fork, commit the new
> > allocations (of the da reservation) as unwritten so that the mappings
> > are only marked written once writeback completes successfully.  This
> > fixes the problem of stale data exposure if the system goes down during
> > targeted writeback of a specific region of a file, as tested by
> > generic/042.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> Observation: yesterday I forced a 4kB file create workload
> to use unwritten extents by setting an extent size hint. That caused
> about 4,500 xfs-conv kworker threads to be spawned by the workload
> which had 16 userspace processes creating files...
> 
> I expect that any sort of "create lots of small files" worklaod is
> going to cause xfs-conv kworker explosions, so be prepared for users
> to start reporting kworker explosions with this in place...

Yes, I've been running this patch internally for months and /so far/ the
conv explosions haven't generated any additional support calls.

(That said, we probably ought to constrain that a bit, there's really no
point in allowing concurrency beyond some unholy mix of AG count and the
estimated iops capacity of the storage...)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] xfs: force writes to delalloc regions to unwritten
  2020-05-22  2:53 ` [PATCH 4/4] xfs: force writes to delalloc regions to unwritten Darrick J. Wong
@ 2020-05-22  3:31   ` Dave Chinner
  2020-05-22  3:56     ` Darrick J. Wong
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2020-05-22  3:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Brian Foster, linux-xfs, hch

On Thu, May 21, 2020 at 07:53:23PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When writing to a delalloc region in the data fork, commit the new
> allocations (of the da reservation) as unwritten so that the mappings
> are only marked written once writeback completes successfully.  This
> fixes the problem of stale data exposure if the system goes down during
> targeted writeback of a specific region of a file, as tested by
> generic/042.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Observation: yesterday I forced a 4kB file create workload
to use unwritten extents by setting an extent size hint. That caused
about 4,500 xfs-conv kworker threads to be spawned by the workload
which had 16 userspace processes creating files...

I expect that any sort of "create lots of small files" worklaod is
going to cause xfs-conv kworker explosions, so be prepared for users
to start reporting kworker explosions with this in place...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/4] xfs: force writes to delalloc regions to unwritten
  2020-05-22  2:52 [PATCH v3 0/4] xfs: fix stale disk exposure after crash Darrick J. Wong
@ 2020-05-22  2:53 ` Darrick J. Wong
  2020-05-22  3:31   ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2020-05-22  2:53 UTC (permalink / raw)
  To: darrick.wong; +Cc: Christoph Hellwig, Brian Foster, linux-xfs, hch, bfoster

From: Darrick J. Wong <darrick.wong@oracle.com>

When writing to a delalloc region in the data fork, commit the new
allocations (of the da reservation) as unwritten so that the mappings
are only marked written once writeback completes successfully.  This
fixes the problem of stale data exposure if the system goes down during
targeted writeback of a specific region of a file, as tested by
generic/042.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index fda13cd7add0..825d170e1503 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4193,17 +4193,7 @@ xfs_bmapi_allocate(
 	bma->got.br_blockcount = bma->length;
 	bma->got.br_state = XFS_EXT_NORM;
 
-	/*
-	 * In the data fork, a wasdelay extent has been initialized, so
-	 * shouldn't be flagged as unwritten.
-	 *
-	 * For the cow fork, however, we convert delalloc reservations
-	 * (extents allocated for speculative preallocation) to
-	 * allocated unwritten extents, and only convert the unwritten
-	 * extents to real extents when we're about to write the data.
-	 */
-	if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) &&
-	    (bma->flags & XFS_BMAPI_PREALLOC))
+	if (bma->flags & XFS_BMAPI_PREALLOC)
 		bma->got.br_state = XFS_EXT_UNWRITTEN;
 
 	if (bma->wasdel)
@@ -4611,8 +4601,24 @@ xfs_bmapi_convert_delalloc(
 	bma.offset = bma.got.br_startoff;
 	bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN);
 	bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork);
+
+	/*
+	 * When we're converting the delalloc reservations backing dirty pages
+	 * in the page cache, we must be careful about how we create the new
+	 * extents:
+	 *
+	 * New CoW fork extents are created unwritten, turned into real extents
+	 * when we're about to write the data to disk, and mapped into the data
+	 * fork after the write finishes.  End of story.
+	 *
+	 * New data fork extents must be mapped in as unwritten and converted
+	 * to real extents after the write succeeds to avoid exposing stale
+	 * disk contents if we crash.
+	 */
 	if (whichfork == XFS_COW_FORK)
 		bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
+	else
+		bma.flags = XFS_BMAPI_PREALLOC;
 
 	if (!xfs_iext_peek_prev_extent(ifp, &bma.icur, &bma.prev))
 		bma.prev.br_startoff = NULLFILEOFF;


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-05-26 13:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-23 16:49 [PATCH v4 0/4] xfs: fix stale disk exposure after crash Darrick J. Wong
2020-05-23 16:49 ` [PATCH 1/4] xfs: don't fail unwritten extent conversion on writeback due to edquot Darrick J. Wong
2020-05-23 16:49 ` [PATCH 2/4] xfs: measure all contiguous previous extents for prealloc size Darrick J. Wong
2020-05-24  9:14   ` Christoph Hellwig
2020-05-24 17:16     ` Darrick J. Wong
2020-05-24 17:16   ` [PATCH v2 " Darrick J. Wong
2020-05-25 13:28     ` Christoph Hellwig
2020-05-26 13:46     ` Brian Foster
2020-05-23 16:49 ` [PATCH 3/4] xfs: refactor xfs_iomap_prealloc_size Darrick J. Wong
2020-05-24 17:17   ` [PATCH v2 " Darrick J. Wong
2020-05-26 13:46     ` Brian Foster
2020-05-23 16:49 ` [PATCH 4/4] xfs: force writes to delalloc regions to unwritten Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2020-05-22  2:52 [PATCH v3 0/4] xfs: fix stale disk exposure after crash Darrick J. Wong
2020-05-22  2:53 ` [PATCH 4/4] xfs: force writes to delalloc regions to unwritten Darrick J. Wong
2020-05-22  3:31   ` Dave Chinner
2020-05-22  3:56     ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).