linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* COW improvements and always_cow support V2
@ 2018-11-19 13:46 Christoph Hellwig
  2018-11-19 13:46 ` [PATCH 1/9] xfs: fix shared extent data corruption due to missing cow reservation Christoph Hellwig
                   ` (9 more replies)
  0 siblings, 10 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-11-19 13:46 UTC (permalink / raw)
  To: linux-xfs

Hi all,

this series adds the always_cow mode support after improving our COW
write support a little bit first.

The always_cow mode stresses the COW path a lot, but with a few xfstests
fixups it generall looks good, except for:

 - a few tests that complain about fragmentation, which is rather inherent
   in this mode
 - generic/208 crashing a lot (and generic/095 with 1k block similarly)
   because a COW fork extent has changed under writeback.  As far as I can
   tell this is because nothing prevents another thread from moving a COW
   fork extent to the data fork while we are under writeback.  I'm currently
   fully root causing this and looking into a potential fix
 - xfs/017 crashes occasionally in log recovery because we can't find
   a refcount tree record that we try to free.
   I haven't really fully understood this one yet.

Changes since v1:
 - make delalloc and unwritten extent conversions simpler and more robust
 - add a few additional cleanups
 - support all fallocate modes but actual preallocation
 - rebase on top of a fix from Brian (which is included as first patch
   to make the patch set more usable)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/9] xfs: fix shared extent data corruption due to missing cow reservation
  2018-11-19 13:46 COW improvements and always_cow support V2 Christoph Hellwig
@ 2018-11-19 13:46 ` Christoph Hellwig
  2018-11-19 13:46 ` [PATCH 2/9] xfs: handle -EAGAIN from xfs_iomap_write_allocate Christoph Hellwig
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-11-19 13:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: Brian Foster

From: Brian Foster <bfoster@redhat.com>

Page writeback indirectly handles shared extents via the existence
of overlapping COW fork blocks. If COW fork blocks exist, writeback
always performs the associated copy-on-write regardless if the
underlying blocks are actually shared. If the blocks are shared,
then overlapping COW fork blocks must always exist.

fstests shared/010 reproduces a case where a buffered write occurs
over a shared block without performing the requisite COW fork
reservation.  This ultimately causes writeback to the shared extent
and data corruption that is detected across md5 checks of the
filesystem across a mount cycle.

The problem occurs when a buffered write lands over a shared extent
that crosses an extent size hint boundary and that also happens to
have a partial COW reservation that doesn't cover the start and end
blocks of the data fork extent.

For example, a buffered write occurs across the file offset (in FSB
units) range of [29, 57]. A shared extent exists at blocks [29, 35]
and COW reservation already exists at blocks [32, 34]. After
accommodating a COW extent size hint of 32 blocks and the existing
reservation at offset 32, xfs_reflink_reserve_cow() allocates 32
blocks of reservation at offset 0 and returns with COW reservation
across the range of [0, 34]. The associated data fork extent is
still [29, 35], however, which isn't fully covered by the COW
reservation.

This leads to a buffered write at file offset 35 over a shared
extent without associated COW reservation. Writeback eventually
kicks in, performs an overwrite of the underlying shared block and
causes the associated data corruption.

Update xfs_reflink_reserve_cow() to accommodate the fact that a
delalloc allocation request may not fully cover the extent in the
data fork. Trim the data fork extent appropriately, just as is done
for shared extent boundaries and/or existing COW reservations that
happen to overlap the start of the data fork extent. This prevents
shared/010 failures due to data corruption on reflink enabled
filesystems.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_reflink.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index bd9135afe3f4..bfa489c8799f 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -296,6 +296,7 @@ xfs_reflink_reserve_cow(
 	if (error)
 		return error;
 
+	xfs_trim_extent(imap, got.br_startoff, got.br_blockcount);
 	trace_xfs_reflink_cow_alloc(ip, &got);
 	return 0;
 }
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/9] xfs: handle -EAGAIN from xfs_iomap_write_allocate
  2018-11-19 13:46 COW improvements and always_cow support V2 Christoph Hellwig
  2018-11-19 13:46 ` [PATCH 1/9] xfs: fix shared extent data corruption due to missing cow reservation Christoph Hellwig
@ 2018-11-19 13:46 ` Christoph Hellwig
  2018-11-19 13:46 ` [PATCH 3/9] xfs: avoid an extent tree lookup in xfs_iomap_write_allocate Christoph Hellwig
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-11-19 13:46 UTC (permalink / raw)
  To: linux-xfs

-EAGAIN from xfs_iomap_write_allocate means that due to a racing
truncate there is no actual mapping at this offset anymore, and we
need to skip the block during writeback.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 338b9d9984e0..da520860c85e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -349,6 +349,7 @@ xfs_map_blocks(
 	imap_valid = offset_fsb >= wpc->imap.br_startoff &&
 		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
 	if (imap_valid &&
+	    !WARN_ON_ONCE(wpc->imap.br_startblock == HOLESTARTBLOCK) &&
 	    (!xfs_inode_has_cow_data(ip) ||
 	     wpc->io_type == XFS_IO_COW ||
 	     wpc->cow_seq == READ_ONCE(ip->i_cowfp->if_seq)))
@@ -454,8 +455,14 @@ xfs_map_blocks(
 allocate_blocks:
 	error = xfs_iomap_write_allocate(ip, whichfork, offset, &imap,
 			&wpc->cow_seq);
-	if (error)
+	if (error) {
+		if (error == -EAGAIN) {
+			/* we might have raced with truncate */
+			wpc->io_type = XFS_IO_HOLE;
+			error = 0;
+		}
 		return error;
+	}
 	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
 	       imap.br_startoff + imap.br_blockcount <= cow_fsb);
 	wpc->imap = imap;
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/9] xfs: avoid an extent tree lookup in xfs_iomap_write_allocate
  2018-11-19 13:46 COW improvements and always_cow support V2 Christoph Hellwig
  2018-11-19 13:46 ` [PATCH 1/9] xfs: fix shared extent data corruption due to missing cow reservation Christoph Hellwig
  2018-11-19 13:46 ` [PATCH 2/9] xfs: handle -EAGAIN from xfs_iomap_write_allocate Christoph Hellwig
@ 2018-11-19 13:46 ` Christoph Hellwig
  2018-11-19 13:46 ` [PATCH 4/9] xfs: make xfs_bmbt_to_iomap more useful Christoph Hellwig
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-11-19 13:46 UTC (permalink / raw)
  To: linux-xfs

Instead of checking for the offset of the last extent to reduce the
size passed into xfs_bmapi_write reuse the existing code to deal with
that case for the COW fork in xfs_bmapi_write.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_bmap.c |  27 +++------
 fs/xfs/xfs_iomap.c       | 127 ++++++++++++++-------------------------
 2 files changed, 55 insertions(+), 99 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 74d7228e755b..ec48652e6325 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4313,28 +4313,21 @@ xfs_bmapi_write(
 		/* in hole or beyond EOF? */
 		if (eof || bma.got.br_startoff > bno) {
 			/*
-			 * CoW fork conversions should /never/ hit EOF or
-			 * holes.  There should always be something for us
-			 * to work on.
+			 * It is possible that the extents have changed since
+			 * we did the read call as we dropped the ilock for a
+			 * while.  We have to be careful about truncates or hole
+			 * punchs here - we are not allowed to allocate
+			 * non-delalloc blocks here.
+			 *
+			 * The only protection against truncation is the pages
+			 * for the range we are being asked to convert are
+			 * locked and hence a truncate will block on them
+			 * first.
 			 */
 			ASSERT(!((flags & XFS_BMAPI_CONVERT) &&
 			         (flags & XFS_BMAPI_COWFORK)));
 
 			if (flags & XFS_BMAPI_DELALLOC) {
-				/*
-				 * For the COW fork we can reasonably get a
-				 * request for converting an extent that races
-				 * with other threads already having converted
-				 * part of it, as there converting COW to
-				 * regular blocks is not protected using the
-				 * IOLOCK.
-				 */
-				ASSERT(flags & XFS_BMAPI_COWFORK);
-				if (!(flags & XFS_BMAPI_COWFORK)) {
-					error = -EIO;
-					goto error0;
-				}
-
 				if (eof || bno >= end)
 					break;
 			} else {
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 27c93b5f029d..c45fe427be4a 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -685,7 +685,7 @@ xfs_iomap_write_allocate(
 {
 	xfs_mount_t	*mp = ip->i_mount;
 	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
-	xfs_fileoff_t	offset_fsb, last_block;
+	xfs_fileoff_t	offset_fsb;
 	xfs_fileoff_t	end_fsb, map_start_fsb;
 	xfs_filblks_t	count_fsb;
 	xfs_trans_t	*tp;
@@ -711,96 +711,59 @@ xfs_iomap_write_allocate(
 	XFS_STATS_ADD(mp, xs_xstrat_bytes, XFS_FSB_TO_B(mp, count_fsb));
 
 	while (count_fsb != 0) {
+		nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
 		/*
-		 * Set up a transaction with which to allocate the
-		 * backing store for the file.  Do allocations in a
-		 * loop until we get some space in the range we are
-		 * interested in.  The other space that might be allocated
-		 * is in the delayed allocation extent on which we sit
-		 * but before our buffer starts.
+		 * We have already reserved space for the extent and any
+		 * indirect blocks when creating the delalloc extent,
+		 * there is no need to reserve space in this transaction
+		 * again.
 		 */
-		nimaps = 0;
-		while (nimaps == 0) {
-			nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
-			/*
-			 * We have already reserved space for the extent and any
-			 * indirect blocks when creating the delalloc extent,
-			 * there is no need to reserve space in this transaction
-			 * again.
-			 */
-			error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0,
-					0, XFS_TRANS_RESERVE, &tp);
-			if (error)
-				return error;
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0,
+				0, XFS_TRANS_RESERVE, &tp);
+		if (error)
+			return error;
 
-			xfs_ilock(ip, XFS_ILOCK_EXCL);
-			xfs_trans_ijoin(tp, ip, 0);
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, ip, 0);
 
-			/*
-			 * it is possible that the extents have changed since
-			 * we did the read call as we dropped the ilock for a
-			 * while. We have to be careful about truncates or hole
-			 * punchs here - we are not allowed to allocate
-			 * non-delalloc blocks here.
-			 *
-			 * The only protection against truncation is the pages
-			 * for the range we are being asked to convert are
-			 * locked and hence a truncate will block on them
-			 * first.
-			 *
-			 * As a result, if we go beyond the range we really
-			 * need and hit an delalloc extent boundary followed by
-			 * a hole while we have excess blocks in the map, we
-			 * will fill the hole incorrectly and overrun the
-			 * transaction reservation.
-			 *
-			 * Using a single map prevents this as we are forced to
-			 * check each map we look for overlap with the desired
-			 * range and abort as soon as we find it. Also, given
-			 * that we only return a single map, having one beyond
-			 * what we can return is probably a bit silly.
-			 *
-			 * We also need to check that we don't go beyond EOF;
-			 * this is a truncate optimisation as a truncate sets
-			 * the new file size before block on the pages we
-			 * currently have locked under writeback. Because they
-			 * are about to be tossed, we don't need to write them
-			 * back....
-			 */
-			nimaps = 1;
-			end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
-			error = xfs_bmap_last_offset(ip, &last_block,
-							XFS_DATA_FORK);
-			if (error)
+		/*
+		 * We also need to check that we don't go beyond EOF;
+		 * this is a truncate optimisation as a truncate sets
+		 * the new file size before block on the pages we
+		 * currently have locked under writeback. Because they
+		 * are about to be tossed, we don't need to write them
+		 * back....
+		 */
+		nimaps = 1;
+		end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
+		if (map_start_fsb + count_fsb > end_fsb) {
+			count_fsb = end_fsb - map_start_fsb;
+			if (count_fsb == 0) {
+				error = -EAGAIN;
 				goto trans_cancel;
-
-			last_block = XFS_FILEOFF_MAX(last_block, end_fsb);
-			if ((map_start_fsb + count_fsb) > last_block) {
-				count_fsb = last_block - map_start_fsb;
-				if (count_fsb == 0) {
-					error = -EAGAIN;
-					goto trans_cancel;
-				}
 			}
+		}
 
-			/*
-			 * From this point onwards we overwrite the imap
-			 * pointer that the caller gave to us.
-			 */
-			error = xfs_bmapi_write(tp, ip, map_start_fsb,
-						count_fsb, flags, nres, imap,
-						&nimaps);
-			if (error)
-				goto trans_cancel;
+		/*
+		 * From this point onwards we overwrite the imap
+		 * pointer that the caller gave to us.
+		 */
+		error = xfs_bmapi_write(tp, ip, map_start_fsb,
+					count_fsb, flags, nres, imap,
+					&nimaps);
+		if (error)
+			goto trans_cancel;
 
-			error = xfs_trans_commit(tp);
-			if (error)
-				goto error0;
+		error = xfs_trans_commit(tp);
+		if (error)
+			goto error0;
 
-			if (whichfork == XFS_COW_FORK)
-				*cow_seq = READ_ONCE(ifp->if_seq);
-			xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		}
+		if (whichfork == XFS_COW_FORK)
+			*cow_seq = READ_ONCE(ifp->if_seq);
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+		if (nimaps == 0 || imap->br_startblock == HOLESTARTBLOCK)
+			return -EAGAIN;
 
 		/*
 		 * See if we were able to allocate an extent that
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/9] xfs: make xfs_bmbt_to_iomap more useful
  2018-11-19 13:46 COW improvements and always_cow support V2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2018-11-19 13:46 ` [PATCH 3/9] xfs: avoid an extent tree lookup in xfs_iomap_write_allocate Christoph Hellwig
@ 2018-11-19 13:46 ` Christoph Hellwig
  2018-11-19 13:46 ` [PATCH 5/9] xfs: don't use delalloc extents for COW on files with extsize hints Christoph Hellwig
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-11-19 13:46 UTC (permalink / raw)
  To: linux-xfs

Move checking for invalid zero blocks and setting of various iomap flags
into this helper.  Also make it deal with "raw" delalloc extents to
avoid clutter in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c | 84 +++++++++++++++++++++-------------------------
 fs/xfs/xfs_iomap.h |  4 +--
 fs/xfs/xfs_pnfs.c  |  2 +-
 3 files changed, 41 insertions(+), 49 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index c45fe427be4a..2ece2363a2a0 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -35,18 +35,40 @@
 #define XFS_WRITEIO_ALIGN(mp,off)	(((off) >> mp->m_writeio_log) \
 						<< mp->m_writeio_log)
 
-void
+static int
+xfs_alert_fsblock_zero(
+	xfs_inode_t	*ip,
+	xfs_bmbt_irec_t	*imap)
+{
+	xfs_alert_tag(ip->i_mount, XFS_PTAG_FSBLOCK_ZERO,
+			"Access to block zero in inode %llu "
+			"start_block: %llx start_off: %llx "
+			"blkcnt: %llx extent-state: %x",
+		(unsigned long long)ip->i_ino,
+		(unsigned long long)imap->br_startblock,
+		(unsigned long long)imap->br_startoff,
+		(unsigned long long)imap->br_blockcount,
+		imap->br_state);
+	return -EFSCORRUPTED;
+}
+
+int
 xfs_bmbt_to_iomap(
 	struct xfs_inode	*ip,
 	struct iomap		*iomap,
-	struct xfs_bmbt_irec	*imap)
+	struct xfs_bmbt_irec	*imap,
+	bool			shared)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 
+	if (unlikely(!imap->br_startblock && !XFS_IS_REALTIME_INODE(ip)))
+		return xfs_alert_fsblock_zero(ip, imap);
+
 	if (imap->br_startblock == HOLESTARTBLOCK) {
 		iomap->addr = IOMAP_NULL_ADDR;
 		iomap->type = IOMAP_HOLE;
-	} else if (imap->br_startblock == DELAYSTARTBLOCK) {
+	} else if (imap->br_startblock == DELAYSTARTBLOCK ||
+		   isnullstartblock(imap->br_startblock)) {
 		iomap->addr = IOMAP_NULL_ADDR;
 		iomap->type = IOMAP_DELALLOC;
 	} else {
@@ -60,6 +82,13 @@ xfs_bmbt_to_iomap(
 	iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
 	iomap->bdev = xfs_find_bdev_for_inode(VFS_I(ip));
 	iomap->dax_dev = xfs_find_daxdev_for_inode(VFS_I(ip));
+
+	if (xfs_ipincount(ip) &&
+	    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
+		iomap->flags |= IOMAP_F_DIRTY;
+	if (shared)
+		iomap->flags |= IOMAP_F_SHARED;
+	return 0;
 }
 
 static void
@@ -138,23 +167,6 @@ xfs_iomap_eof_align_last_fsb(
 	return 0;
 }
 
-STATIC int
-xfs_alert_fsblock_zero(
-	xfs_inode_t	*ip,
-	xfs_bmbt_irec_t	*imap)
-{
-	xfs_alert_tag(ip->i_mount, XFS_PTAG_FSBLOCK_ZERO,
-			"Access to block zero in inode %llu "
-			"start_block: %llx start_off: %llx "
-			"blkcnt: %llx extent-state: %x",
-		(unsigned long long)ip->i_ino,
-		(unsigned long long)imap->br_startblock,
-		(unsigned long long)imap->br_startoff,
-		(unsigned long long)imap->br_blockcount,
-		imap->br_state);
-	return -EFSCORRUPTED;
-}
-
 int
 xfs_iomap_write_direct(
 	xfs_inode_t	*ip,
@@ -649,17 +661,7 @@ xfs_file_iomap_begin_delay(
 	iomap->flags |= IOMAP_F_NEW;
 	trace_xfs_iomap_alloc(ip, offset, count, 0, &got);
 done:
-	if (isnullstartblock(got.br_startblock))
-		got.br_startblock = DELAYSTARTBLOCK;
-
-	if (!got.br_startblock) {
-		error = xfs_alert_fsblock_zero(ip, &got);
-		if (error)
-			goto out_unlock;
-	}
-
-	xfs_bmbt_to_iomap(ip, iomap, &got);
-
+	error = xfs_bmbt_to_iomap(ip, iomap, &got, false);
 out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
@@ -1105,15 +1107,7 @@ xfs_file_iomap_begin(
 	trace_xfs_iomap_alloc(ip, offset, length, 0, &imap);
 
 out_finish:
-	if (xfs_ipincount(ip) && (ip->i_itemp->ili_fsync_fields
-				& ~XFS_ILOG_TIMESTAMP))
-		iomap->flags |= IOMAP_F_DIRTY;
-
-	xfs_bmbt_to_iomap(ip, iomap, &imap);
-
-	if (shared)
-		iomap->flags |= IOMAP_F_SHARED;
-	return 0;
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, shared);
 
 out_found:
 	ASSERT(nimaps);
@@ -1236,12 +1230,10 @@ xfs_xattr_iomap_begin(
 out_unlock:
 	xfs_iunlock(ip, lockmode);
 
-	if (!error) {
-		ASSERT(nimaps);
-		xfs_bmbt_to_iomap(ip, iomap, &imap);
-	}
-
-	return error;
+	if (error)
+		return error;
+	ASSERT(nimaps);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, false);
 }
 
 const struct iomap_ops xfs_xattr_iomap_ops = {
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index c6170548831b..ed27e41b687c 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -17,8 +17,8 @@ int xfs_iomap_write_allocate(struct xfs_inode *, int, xfs_off_t,
 			struct xfs_bmbt_irec *, unsigned int *);
 int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t, bool);
 
-void xfs_bmbt_to_iomap(struct xfs_inode *, struct iomap *,
-		struct xfs_bmbt_irec *);
+int xfs_bmbt_to_iomap(struct xfs_inode *, struct iomap *,
+		struct xfs_bmbt_irec *, bool shared);
 xfs_extlen_t xfs_eof_alignment(struct xfs_inode *ip, xfs_extlen_t extsize);
 
 static inline xfs_filblks_t
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index f44c3599527d..bde2c9f56a46 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -185,7 +185,7 @@ xfs_fs_map_blocks(
 	}
 	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
 
-	xfs_bmbt_to_iomap(ip, iomap, &imap);
+	error = xfs_bmbt_to_iomap(ip, iomap, &imap, false);
 	*device_generation = mp->m_generation;
 	return error;
 out_unlock:
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/9] xfs: don't use delalloc extents for COW on files with extsize hints
  2018-11-19 13:46 COW improvements and always_cow support V2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2018-11-19 13:46 ` [PATCH 4/9] xfs: make xfs_bmbt_to_iomap more useful Christoph Hellwig
@ 2018-11-19 13:46 ` Christoph Hellwig
  2018-11-19 13:46 ` [PATCH 6/9] xfs: merge COW handling into xfs_file_iomap_begin_delay Christoph Hellwig
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-11-19 13:46 UTC (permalink / raw)
  To: linux-xfs

While using delalloc for extsize hints is generally a good idea, the
current code that does so only for COW doesn't help us much and creates
a lot of special cases.  Switch it to use real allocations like we
do for direct I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c   | 28 +++++++++++++++++-----------
 fs/xfs/xfs_reflink.c |  5 ++++-
 fs/xfs/xfs_reflink.h |  5 ++---
 3 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2ece2363a2a0..f64155c1fac8 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1047,22 +1047,28 @@ xfs_file_iomap_begin(
 	 * been done up front, so we don't need to do them here.
 	 */
 	if (xfs_is_reflink_inode(ip)) {
+		struct xfs_bmbt_irec	orig = imap;
+
 		/* if zeroing doesn't need COW allocation, then we are done. */
 		if ((flags & IOMAP_ZERO) &&
 		    !needs_cow_for_zeroing(&imap, nimaps))
 			goto out_found;
 
-		if (flags & IOMAP_DIRECT) {
-			/* may drop and re-acquire the ilock */
-			error = xfs_reflink_allocate_cow(ip, &imap, &shared,
-					&lockmode);
-			if (error)
-				goto out_unlock;
-		} else {
-			error = xfs_reflink_reserve_cow(ip, &imap);
-			if (error)
-				goto out_unlock;
-		}
+		error = xfs_reflink_allocate_cow(ip, &imap, &shared, &lockmode,
+						 flags);
+		if (error)
+			goto out_unlock;
+
+		/*
+		 * For buffered writes we need to report the address of the
+		 * previous block (if there was any) so that the higher level
+		 * write code can perform read-modify-write operations.  For
+		 * direct I/O code, which must be block aligned we need to
+		 * report the newly allocated address.
+		 */
+		if (!(flags & IOMAP_DIRECT) &&
+		    orig.br_startblock != HOLESTARTBLOCK)
+			imap = orig;
 
 		end_fsb = imap.br_startoff + imap.br_blockcount;
 		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index bfa489c8799f..5fe7126b2a3a 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -397,7 +397,8 @@ xfs_reflink_allocate_cow(
 	struct xfs_inode	*ip,
 	struct xfs_bmbt_irec	*imap,
 	bool			*shared,
-	uint			*lockmode)
+	uint			*lockmode,
+	unsigned		flags)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		offset_fsb = imap->br_startoff;
@@ -471,6 +472,8 @@ xfs_reflink_allocate_cow(
 	if (nimaps == 0)
 		return -ENOSPC;
 convert:
+	if (!(flags & IOMAP_DIRECT))
+		return 0;
 	return xfs_reflink_convert_cow_extent(ip, imap, offset_fsb, count_fsb);
 
 out_unreserve:
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 6d73daef1f13..d76fc520cac8 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -12,10 +12,9 @@ extern int xfs_reflink_find_shared(struct xfs_mount *mp, struct xfs_trans *tp,
 extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
 		struct xfs_bmbt_irec *irec, bool *shared);
 
-extern int xfs_reflink_reserve_cow(struct xfs_inode *ip,
-		struct xfs_bmbt_irec *imap);
 extern int xfs_reflink_allocate_cow(struct xfs_inode *ip,
-		struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode);
+		struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode,
+		unsigned flags);
 extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/9] xfs: merge COW handling into xfs_file_iomap_begin_delay
  2018-11-19 13:46 COW improvements and always_cow support V2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2018-11-19 13:46 ` [PATCH 5/9] xfs: don't use delalloc extents for COW on files with extsize hints Christoph Hellwig
@ 2018-11-19 13:46 ` Christoph Hellwig
  2018-11-19 13:46 ` [PATCH 7/9] xfs: report IOMAP_F_SHARED from xfs_file_iomap_begin_delay Christoph Hellwig
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-11-19 13:46 UTC (permalink / raw)
  To: linux-xfs

Besides simplifying the code a bit this allows to actually implement
the behavior of using COW preallocation for non-COW data mentioned
in the current comments.

Note that this breaks the current version of xfs/420, but that is
because the test is broken.  A separate fix will be sent for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c   | 130 ++++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_reflink.c |  67 ----------------------
 2 files changed, 91 insertions(+), 106 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f64155c1fac8..1ddcc1194cd6 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -534,15 +534,16 @@ xfs_file_iomap_begin_delay(
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	xfs_fileoff_t		maxbytes_fsb =
 		XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
 	xfs_fileoff_t		end_fsb;
-	int			error = 0, eof = 0;
-	struct xfs_bmbt_irec	got;
-	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	imap, cmap;
+	struct xfs_iext_cursor	icur, ccur;
 	xfs_fsblock_t		prealloc_blocks = 0;
+	bool			eof = false, cow_eof = false, shared;
+	int			whichfork = XFS_DATA_FORK;
+	int			error = 0;
 
 	ASSERT(!XFS_IS_REALTIME_INODE(ip));
 	ASSERT(!xfs_get_extsz_hint(ip));
@@ -560,7 +561,7 @@ xfs_file_iomap_begin_delay(
 
 	XFS_STATS_INC(mp, xs_blk_mapw);
 
-	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
+	if (!(ip->i_df.if_flags & XFS_IFEXTENTS)) {
 		error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
 		if (error)
 			goto out_unlock;
@@ -568,51 +569,90 @@ xfs_file_iomap_begin_delay(
 
 	end_fsb = min(XFS_B_TO_FSB(mp, offset + count), maxbytes_fsb);
 
-	eof = !xfs_iext_lookup_extent(ip, ifp, offset_fsb, &icur, &got);
+	/*
+	 * Search the data fork fork first to look up our source mapping.  We
+	 * always need the data fork map, as we have to return it to the
+	 * iomap code so that the higher level write code can read data in to
+	 * perform read-modify-write cycles for unaligned writes.
+	 */
+	eof = !xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap);
 	if (eof)
-		got.br_startoff = end_fsb; /* fake hole until the end */
+		imap.br_startoff = end_fsb; /* fake hole until the end */
 
-	if (got.br_startoff <= offset_fsb) {
+	/* We never need to allocate blocks for zeroing a hole. */
+	if ((flags & IOMAP_ZERO) && imap.br_startoff > offset_fsb) {
+		xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
+		goto out_unlock;
+	}
+
+	/*
+	 * Search the COW fork extent list even if we did not find a data fork
+	 * extent.  This serves two purposes: first this implements the
+	 * speculative preallocation using cowextisze, so that we also unshare
+	 * block adjacent to shared blocks instead of just the shared blocks
+	 * themselves.  Second the lookup in the extent list is generally faster
+	 * than going out to the shared extent tree.
+	 */
+	if (xfs_is_reflink_inode(ip)) {
+		cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
+				&ccur, &cmap);
+		if (!cow_eof && cmap.br_startoff <= offset_fsb) {
+			trace_xfs_reflink_cow_found(ip, &cmap);
+			whichfork = XFS_COW_FORK;
+			goto done;
+		}
+	}
+
+	if (imap.br_startoff <= offset_fsb) {
 		/*
 		 * For reflink files we may need a delalloc reservation when
 		 * overwriting shared extents.   This includes zeroing of
 		 * existing extents that contain data.
 		 */
-		if (xfs_is_reflink_inode(ip) &&
-		    ((flags & IOMAP_WRITE) ||
-		     got.br_state != XFS_EXT_UNWRITTEN)) {
-			xfs_trim_extent(&got, offset_fsb, end_fsb - offset_fsb);
-			error = xfs_reflink_reserve_cow(ip, &got);
-			if (error)
-				goto out_unlock;
+		if (!xfs_is_reflink_inode(ip) ||
+		    ((flags & IOMAP_ZERO) && imap.br_state != XFS_EXT_NORM)) {
+			trace_xfs_iomap_found(ip, offset, count, 0, &imap);
+			goto done;
 		}
 
-		trace_xfs_iomap_found(ip, offset, count, 0, &got);
-		goto done;
-	}
+		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
 
-	if (flags & IOMAP_ZERO) {
-		xfs_hole_to_iomap(ip, iomap, offset_fsb, got.br_startoff);
-		goto out_unlock;
+		/* Trim the mapping to the nearest shared extent boundary. */
+		error = xfs_reflink_trim_around_shared(ip, &imap, &shared);
+		if (error)
+			goto out_unlock;
+
+		/* Not shared?  Just report the (potentially capped) extent. */
+		if (!shared) {
+			trace_xfs_iomap_found(ip, offset, count, 0, &imap);
+			goto done;
+		}
+
+		/*
+		 * Fork all the shared blocks from our write offset until the
+		 * end of the extent.
+		 */
+		whichfork = XFS_COW_FORK;
+		end_fsb = imap.br_startoff + imap.br_blockcount;
+	} else {
+		/*
+		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
+		 * pages to keep the chunks of work done where somewhat
+		 * symmetric with the work writeback does.  This is a completely
+		 * arbitrary number pulled out of thin air.
+		 *
+		 * Note that the values needs to be less than 32-bits wide until
+		 * the lower level functions are updated.
+		 */
+		count = min_t(loff_t, count, 1024 * PAGE_SIZE);
+		end_fsb = min(XFS_B_TO_FSB(mp, offset + count), maxbytes_fsb);
 	}
 
 	error = xfs_qm_dqattach_locked(ip, false);
 	if (error)
 		goto out_unlock;
 
-	/*
-	 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES pages
-	 * to keep the chunks of work done where somewhat symmetric with the
-	 * work writeback does. This is a completely arbitrary number pulled
-	 * out of thin air as a best guess for initial testing.
-	 *
-	 * Note that the values needs to be less than 32-bits wide until
-	 * the lower level functions are updated.
-	 */
-	count = min_t(loff_t, count, 1024 * PAGE_SIZE);
-	end_fsb = min(XFS_B_TO_FSB(mp, offset + count), maxbytes_fsb);
-
-	if (eof) {
+	if (eof && whichfork == XFS_DATA_FORK) {
 		prealloc_blocks = xfs_iomap_prealloc_size(ip, offset, count,
 				&icur);
 		if (prealloc_blocks) {
@@ -635,9 +675,11 @@ xfs_file_iomap_begin_delay(
 	}
 
 retry:
-	error = xfs_bmapi_reserve_delalloc(ip, XFS_DATA_FORK, offset_fsb,
-			end_fsb - offset_fsb, prealloc_blocks, &got, &icur,
-			eof);
+	error = xfs_bmapi_reserve_delalloc(ip, whichfork, offset_fsb,
+			end_fsb - offset_fsb, prealloc_blocks,
+			whichfork == XFS_DATA_FORK ? &imap : &cmap,
+			whichfork == XFS_DATA_FORK ? &icur : &ccur,
+			whichfork == XFS_DATA_FORK ? eof : cow_eof);
 	switch (error) {
 	case 0:
 		break;
@@ -659,9 +701,19 @@ xfs_file_iomap_begin_delay(
 	 * them out if the write happens to fail.
 	 */
 	iomap->flags |= IOMAP_F_NEW;
-	trace_xfs_iomap_alloc(ip, offset, count, 0, &got);
+	trace_xfs_iomap_alloc(ip, offset, count, 0, &imap);
 done:
-	error = xfs_bmbt_to_iomap(ip, iomap, &got, false);
+	if (whichfork == XFS_COW_FORK) {
+		if (imap.br_startoff > offset_fsb) {
+			xfs_trim_extent(&cmap, offset_fsb,
+					imap.br_startoff - offset_fsb);
+			error = xfs_bmbt_to_iomap(ip, iomap, &cmap, false);
+			goto out_unlock;
+		}
+		/* ensure we only report blocks we have a reservation for */
+		xfs_trim_extent(&imap, cmap.br_startoff, cmap.br_blockcount);
+	}
+	error = xfs_bmbt_to_iomap(ip, iomap, &imap, false);
 out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 5fe7126b2a3a..826f3cd455a1 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -234,73 +234,6 @@ xfs_reflink_trim_around_shared(
 	}
 }
 
-/*
- * Trim the passed in imap to the next shared/unshared extent boundary, and
- * if imap->br_startoff points to a shared extent reserve space for it in the
- * COW fork.
- *
- * Note that imap will always contain the block numbers for the existing blocks
- * in the data fork, as the upper layers need them for read-modify-write
- * operations.
- */
-int
-xfs_reflink_reserve_cow(
-	struct xfs_inode	*ip,
-	struct xfs_bmbt_irec	*imap)
-{
-	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
-	struct xfs_bmbt_irec	got;
-	int			error = 0;
-	bool			eof = false;
-	struct xfs_iext_cursor	icur;
-	bool			shared;
-
-	/*
-	 * Search the COW fork extent list first.  This serves two purposes:
-	 * first this implement the speculative preallocation using cowextisze,
-	 * so that we also unshared block adjacent to shared blocks instead
-	 * of just the shared blocks themselves.  Second the lookup in the
-	 * extent list is generally faster than going out to the shared extent
-	 * tree.
-	 */
-
-	if (!xfs_iext_lookup_extent(ip, ifp, imap->br_startoff, &icur, &got))
-		eof = true;
-	if (!eof && got.br_startoff <= imap->br_startoff) {
-		trace_xfs_reflink_cow_found(ip, imap);
-		xfs_trim_extent(imap, got.br_startoff, got.br_blockcount);
-		return 0;
-	}
-
-	/* Trim the mapping to the nearest shared extent boundary. */
-	error = xfs_reflink_trim_around_shared(ip, imap, &shared);
-	if (error)
-		return error;
-
-	/* Not shared?  Just report the (potentially capped) extent. */
-	if (!shared)
-		return 0;
-
-	/*
-	 * Fork all the shared blocks from our write offset until the end of
-	 * the extent.
-	 */
-	error = xfs_qm_dqattach_locked(ip, false);
-	if (error)
-		return error;
-
-	error = xfs_bmapi_reserve_delalloc(ip, XFS_COW_FORK, imap->br_startoff,
-			imap->br_blockcount, 0, &got, &icur, eof);
-	if (error == -ENOSPC || error == -EDQUOT)
-		trace_xfs_reflink_cow_enospc(ip, imap);
-	if (error)
-		return error;
-
-	xfs_trim_extent(imap, got.br_startoff, got.br_blockcount);
-	trace_xfs_reflink_cow_alloc(ip, &got);
-	return 0;
-}
-
 /* Convert part of an unwritten CoW extent to a real one. */
 STATIC int
 xfs_reflink_convert_cow_extent(
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 7/9] xfs: report IOMAP_F_SHARED from xfs_file_iomap_begin_delay
  2018-11-19 13:46 COW improvements and always_cow support V2 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2018-11-19 13:46 ` [PATCH 6/9] xfs: merge COW handling into xfs_file_iomap_begin_delay Christoph Hellwig
@ 2018-11-19 13:46 ` Christoph Hellwig
  2018-11-19 13:46 ` [PATCH 8/9] xfs: make COW fork unwritten extent conversions more robust Christoph Hellwig
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-11-19 13:46 UTC (permalink / raw)
  To: linux-xfs

No user of it in the iomap code at the moment, but we should not
actively report wrong information if we can trivially get it right.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 1ddcc1194cd6..888671e1ab46 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -541,7 +541,7 @@ xfs_file_iomap_begin_delay(
 	struct xfs_bmbt_irec	imap, cmap;
 	struct xfs_iext_cursor	icur, ccur;
 	xfs_fsblock_t		prealloc_blocks = 0;
-	bool			eof = false, cow_eof = false, shared;
+	bool			eof = false, cow_eof = false, shared = false;
 	int			whichfork = XFS_DATA_FORK;
 	int			error = 0;
 
@@ -707,13 +707,14 @@ xfs_file_iomap_begin_delay(
 		if (imap.br_startoff > offset_fsb) {
 			xfs_trim_extent(&cmap, offset_fsb,
 					imap.br_startoff - offset_fsb);
-			error = xfs_bmbt_to_iomap(ip, iomap, &cmap, false);
+			error = xfs_bmbt_to_iomap(ip, iomap, &cmap, true);
 			goto out_unlock;
 		}
 		/* ensure we only report blocks we have a reservation for */
 		xfs_trim_extent(&imap, cmap.br_startoff, cmap.br_blockcount);
+		shared = true;
 	}
-	error = xfs_bmbt_to_iomap(ip, iomap, &imap, false);
+	error = xfs_bmbt_to_iomap(ip, iomap, &imap, shared);
 out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 8/9] xfs: make COW fork unwritten extent conversions more robust
  2018-11-19 13:46 COW improvements and always_cow support V2 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2018-11-19 13:46 ` [PATCH 7/9] xfs: report IOMAP_F_SHARED from xfs_file_iomap_begin_delay Christoph Hellwig
@ 2018-11-19 13:46 ` Christoph Hellwig
  2018-11-19 13:46 ` [PATCH 9/9] xfs: introduce an always_cow mode Christoph Hellwig
  2018-11-25 13:39 ` COW improvements and always_cow support V2 Chandan Rajendra
  9 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-11-19 13:46 UTC (permalink / raw)
  To: linux-xfs

If we have racing buffered and direct I/O COW fork extents under
writeback can have been moved to the data fork by the time we call
xfs_reflink_convert_cow from xfs_submit_ioend.  This would be mostly
harmless as the block numbers don't change by this move, except for
the fact that xfs_bmapi_write will crash or trigger asserts when
not finding existing extents, even despite trying to paper over this
with the XFS_BMAPI_CONVERT_ONLY flag.

Instead of special casing non-transaction conversions in the already
way too complicated xfs_bmapi_write just add a new helper for the much
simpler non-transactional COW fork case, which simplify ignores not
found extents.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_bmap.c | 12 ++------
 fs/xfs/libxfs/xfs_bmap.h |  8 +++---
 fs/xfs/xfs_reflink.c     | 61 +++++++++++++++++++++++++---------------
 3 files changed, 45 insertions(+), 36 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index ec48652e6325..d80658617f50 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2026,7 +2026,7 @@ xfs_bmap_add_extent_delay_real(
 /*
  * Convert an unwritten allocation to a real allocation or vice versa.
  */
-STATIC int				/* error */
+int					/* error */
 xfs_bmap_add_extent_unwritten_real(
 	struct xfs_trans	*tp,
 	xfs_inode_t		*ip,	/* incore inode pointer */
@@ -4244,9 +4244,7 @@ xfs_bmapi_write(
 
 	ASSERT(*nmap >= 1);
 	ASSERT(*nmap <= XFS_BMAP_MAX_NMAP);
-	ASSERT(tp != NULL ||
-	       (flags & (XFS_BMAPI_CONVERT | XFS_BMAPI_COWFORK)) ==
-			(XFS_BMAPI_CONVERT | XFS_BMAPI_COWFORK));
+	ASSERT(tp != NULL);
 	ASSERT(len > 0);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
@@ -4324,9 +4322,6 @@ xfs_bmapi_write(
 			 * locked and hence a truncate will block on them
 			 * first.
 			 */
-			ASSERT(!((flags & XFS_BMAPI_CONVERT) &&
-			         (flags & XFS_BMAPI_COWFORK)));
-
 			if (flags & XFS_BMAPI_DELALLOC) {
 				if (eof || bno >= end)
 					break;
@@ -4341,8 +4336,7 @@ xfs_bmapi_write(
 		 * First, deal with the hole before the allocated space
 		 * that we found, if any.
 		 */
-		if ((need_alloc || wasdelay) &&
-		    !(flags & XFS_BMAPI_CONVERT_ONLY)) {
+		if (need_alloc || wasdelay) {
 			bma.eof = eof;
 			bma.conv = !!(flags & XFS_BMAPI_CONVERT);
 			bma.wasdel = wasdelay;
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 488dc8860fd7..d1b71fe97933 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -98,9 +98,6 @@ struct xfs_extent_free_item
 /* Only convert delalloc space, don't allocate entirely new extents */
 #define XFS_BMAPI_DELALLOC	0x400
 
-/* Only convert unwritten extents, don't allocate new blocks */
-#define XFS_BMAPI_CONVERT_ONLY	0x800
-
 /* Skip online discard of freed extents */
 #define XFS_BMAPI_NODISCARD	0x1000
 
@@ -118,7 +115,6 @@ struct xfs_extent_free_item
 	{ XFS_BMAPI_REMAP,	"REMAP" }, \
 	{ XFS_BMAPI_COWFORK,	"COWFORK" }, \
 	{ XFS_BMAPI_DELALLOC,	"DELALLOC" }, \
-	{ XFS_BMAPI_CONVERT_ONLY, "CONVERT_ONLY" }, \
 	{ XFS_BMAPI_NODISCARD,	"NODISCARD" }, \
 	{ XFS_BMAPI_NORMAP,	"NORMAP" }
 
@@ -228,6 +224,10 @@ int	xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork,
 		xfs_fileoff_t off, xfs_filblks_t len, xfs_filblks_t prealloc,
 		struct xfs_bmbt_irec *got, struct xfs_iext_cursor *cur,
 		int eof);
+int	xfs_bmap_add_extent_unwritten_real(struct xfs_trans *tp,
+		struct xfs_inode *ip, int whichfork,
+		struct xfs_iext_cursor *icur, struct xfs_btree_cur **curp,
+		struct xfs_bmbt_irec *new, int *logflagsp);
 
 static inline void
 xfs_bmap_add_free(
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 826f3cd455a1..7c2dd4098e7e 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -234,26 +234,42 @@ xfs_reflink_trim_around_shared(
 	}
 }
 
-/* Convert part of an unwritten CoW extent to a real one. */
-STATIC int
-xfs_reflink_convert_cow_extent(
-	struct xfs_inode		*ip,
-	struct xfs_bmbt_irec		*imap,
-	xfs_fileoff_t			offset_fsb,
-	xfs_filblks_t			count_fsb)
+static int
+xfs_reflink_convert_cow_locked(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		offset_fsb,
+	xfs_filblks_t		count_fsb)
 {
-	int				nimaps = 1;
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	got;
+	struct xfs_btree_cur	*dummy_cur = NULL;
+	int			dummy_logflags;
+	int			error;
 
-	if (imap->br_state == XFS_EXT_NORM)
+	if (!xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, &got))
 		return 0;
 
-	xfs_trim_extent(imap, offset_fsb, count_fsb);
-	trace_xfs_reflink_convert_cow(ip, imap);
-	if (imap->br_blockcount == 0)
-		return 0;
-	return xfs_bmapi_write(NULL, ip, imap->br_startoff, imap->br_blockcount,
-			XFS_BMAPI_COWFORK | XFS_BMAPI_CONVERT, 0, imap,
-			&nimaps);
+	do {
+		if (got.br_startoff >= offset_fsb + count_fsb)
+			break;
+		if (got.br_state == XFS_EXT_NORM)
+			continue;
+		if (WARN_ON_ONCE(isnullstartblock(got.br_startblock)))
+			return -EIO;
+
+		xfs_trim_extent(&got, offset_fsb, count_fsb);
+		if (!got.br_blockcount)
+			continue;
+
+		got.br_state = XFS_EXT_NORM;
+		error = xfs_bmap_add_extent_unwritten_real(NULL, ip,
+				XFS_COW_FORK, &icur, &dummy_cur, &got,
+				&dummy_logflags);
+		if (error)
+			return error;
+	} while (xfs_iext_next_extent(ip->i_cowfp, &icur, &got));
+
+	return error;
 }
 
 /* Convert all of the unwritten CoW extents in a file's range to real ones. */
@@ -267,15 +283,12 @@ xfs_reflink_convert_cow(
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, offset + count);
 	xfs_filblks_t		count_fsb = end_fsb - offset_fsb;
-	struct xfs_bmbt_irec	imap;
-	int			nimaps = 1, error = 0;
+	int			error;
 
 	ASSERT(count != 0);
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	error = xfs_bmapi_write(NULL, ip, offset_fsb, count_fsb,
-			XFS_BMAPI_COWFORK | XFS_BMAPI_CONVERT |
-			XFS_BMAPI_CONVERT_ONLY, 0, &imap, &nimaps);
+	error = xfs_reflink_convert_cow_locked(ip, offset_fsb, count_fsb);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
@@ -405,9 +418,11 @@ xfs_reflink_allocate_cow(
 	if (nimaps == 0)
 		return -ENOSPC;
 convert:
-	if (!(flags & IOMAP_DIRECT))
+	xfs_trim_extent(imap, offset_fsb, count_fsb);
+	if (!(flags & IOMAP_DIRECT) || imap->br_state == XFS_EXT_NORM)
 		return 0;
-	return xfs_reflink_convert_cow_extent(ip, imap, offset_fsb, count_fsb);
+	trace_xfs_reflink_convert_cow(ip, imap);
+	return xfs_reflink_convert_cow_locked(ip, offset_fsb, count_fsb);
 
 out_unreserve:
 	xfs_trans_unreserve_quota_nblks(tp, ip, (long)resblks, 0,
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 9/9] xfs: introduce an always_cow mode
  2018-11-19 13:46 COW improvements and always_cow support V2 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2018-11-19 13:46 ` [PATCH 8/9] xfs: make COW fork unwritten extent conversions more robust Christoph Hellwig
@ 2018-11-19 13:46 ` Christoph Hellwig
  2018-11-25 13:39 ` COW improvements and always_cow support V2 Chandan Rajendra
  9 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-11-19 13:46 UTC (permalink / raw)
  To: linux-xfs

Add a mode where XFS never overwrites existing blocks in place.  This
is to aid debugging our COW code, and also put infatructure in place
for things like possible future support for zoned block devices, which
can't support overwrites.

This mode is enabled globally by doing a:

    echo 1 > /sys/fs/xfs/debug/always_cow

Note that the parameter is global to allow running all tests in xfstests
easily in this mode, which would not easily be possible with a per-fs
sysfs file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c    |  2 +-
 fs/xfs/xfs_file.c    | 11 ++++++++++-
 fs/xfs/xfs_iomap.c   | 28 ++++++++++++++++++----------
 fs/xfs/xfs_reflink.c | 28 ++++++++++++++++++++++++----
 fs/xfs/xfs_reflink.h | 13 +++++++++++++
 fs/xfs/xfs_super.c   | 13 +++++++++----
 fs/xfs/xfs_sysctl.h  |  1 +
 fs/xfs/xfs_sysfs.c   | 24 ++++++++++++++++++++++++
 8 files changed, 100 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index da520860c85e..3b4f6b76b24e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -988,7 +988,7 @@ xfs_vm_bmap(
 	 * Since we don't pass back blockdev info, we can't return bmap
 	 * information for rt files either.
 	 */
-	if (xfs_is_reflink_inode(ip) || XFS_IS_REALTIME_INODE(ip))
+	if (xfs_is_cow_inode(ip) || XFS_IS_REALTIME_INODE(ip))
 		return 0;
 	return iomap_bmap(mapping, block, &xfs_iomap_ops);
 }
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 53c9ab8fb777..4b5d23ee11d9 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -507,7 +507,7 @@ xfs_file_dio_aio_write(
 		 * We can't properly handle unaligned direct I/O to reflink
 		 * files yet, as we can't unshare a partial block.
 		 */
-		if (xfs_is_reflink_inode(ip)) {
+		if (xfs_is_cow_inode(ip)) {
 			trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos, count);
 			return -EREMCHG;
 		}
@@ -806,6 +806,15 @@ xfs_file_fallocate(
 		return -EOPNOTSUPP;
 
 	xfs_ilock(ip, iolock);
+	/*
+	 * If always_cow mode we can't use preallocation and thus should not
+	 * allow creating them.
+	 */
+	if (xfs_is_always_cow_inode(ip) && (mode & ~FALLOC_FL_KEEP_SIZE) == 0) {
+		error = -EOPNOTSUPP;
+		goto out_unlock;
+	}
+
 	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
 	if (error)
 		goto out_unlock;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 888671e1ab46..8a5711e2bd9b 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -395,12 +395,13 @@ xfs_quota_calc_throttle(
 STATIC xfs_fsblock_t
 xfs_iomap_prealloc_size(
 	struct xfs_inode	*ip,
+	int			whichfork,
 	loff_t			offset,
 	loff_t			count,
 	struct xfs_iext_cursor	*icur)
 {
 	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	struct xfs_bmbt_irec	prev;
 	int			shift = 0;
@@ -593,7 +594,11 @@ xfs_file_iomap_begin_delay(
 	 * themselves.  Second the lookup in the extent list is generally faster
 	 * than going out to the shared extent tree.
 	 */
-	if (xfs_is_reflink_inode(ip)) {
+	if (xfs_is_cow_inode(ip)) {
+		if (!ip->i_cowfp) {
+			ASSERT(!xfs_is_reflink_inode(ip));
+			xfs_ifork_init_cow(ip);
+		}
 		cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
 				&ccur, &cmap);
 		if (!cow_eof && cmap.br_startoff <= offset_fsb) {
@@ -609,7 +614,7 @@ xfs_file_iomap_begin_delay(
 		 * overwriting shared extents.   This includes zeroing of
 		 * existing extents that contain data.
 		 */
-		if (!xfs_is_reflink_inode(ip) ||
+		if (!xfs_is_cow_inode(ip) ||
 		    ((flags & IOMAP_ZERO) && imap.br_state != XFS_EXT_NORM)) {
 			trace_xfs_iomap_found(ip, offset, count, 0, &imap);
 			goto done;
@@ -618,7 +623,7 @@ xfs_file_iomap_begin_delay(
 		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
 
 		/* Trim the mapping to the nearest shared extent boundary. */
-		error = xfs_reflink_trim_around_shared(ip, &imap, &shared);
+		error = xfs_inode_need_cow(ip, &imap, &shared);
 		if (error)
 			goto out_unlock;
 
@@ -646,15 +651,18 @@ xfs_file_iomap_begin_delay(
 		 */
 		count = min_t(loff_t, count, 1024 * PAGE_SIZE);
 		end_fsb = min(XFS_B_TO_FSB(mp, offset + count), maxbytes_fsb);
+
+		if (xfs_is_always_cow_inode(ip))
+			whichfork = XFS_COW_FORK;
 	}
 
 	error = xfs_qm_dqattach_locked(ip, false);
 	if (error)
 		goto out_unlock;
 
-	if (eof && whichfork == XFS_DATA_FORK) {
-		prealloc_blocks = xfs_iomap_prealloc_size(ip, offset, count,
-				&icur);
+	if (eof) {
+		prealloc_blocks = xfs_iomap_prealloc_size(ip, whichfork, offset,
+				count, &icur);
 		if (prealloc_blocks) {
 			xfs_extlen_t	align;
 			xfs_off_t	end_offset;
@@ -993,7 +1001,7 @@ xfs_ilock_for_iomap(
 	 * COW writes may allocate delalloc space or convert unwritten COW
 	 * extents, so we need to make sure to take the lock exclusively here.
 	 */
-	if (xfs_is_reflink_inode(ip) && is_write) {
+	if (xfs_is_cow_inode(ip) && is_write) {
 		/*
 		 * FIXME: It could still overwrite on unshared extents and not
 		 * need allocation.
@@ -1027,7 +1035,7 @@ xfs_ilock_for_iomap(
 	 * check, so if we got ILOCK_SHARED for a write and but we're now a
 	 * reflink inode we have to switch to ILOCK_EXCL and relock.
 	 */
-	if (mode == XFS_ILOCK_SHARED && is_write && xfs_is_reflink_inode(ip)) {
+	if (mode == XFS_ILOCK_SHARED && is_write && xfs_is_cow_inode(ip)) {
 		xfs_iunlock(ip, mode);
 		mode = XFS_ILOCK_EXCL;
 		goto relock;
@@ -1099,7 +1107,7 @@ xfs_file_iomap_begin(
 	 * Break shared extents if necessary. Checks for non-blocking IO have
 	 * been done up front, so we don't need to do them here.
 	 */
-	if (xfs_is_reflink_inode(ip)) {
+	if (xfs_is_cow_inode(ip)) {
 		struct xfs_bmbt_irec	orig = imap;
 
 		/* if zeroing doesn't need COW allocation, then we are done. */
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 7c2dd4098e7e..14d5ec96922b 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -192,7 +192,7 @@ xfs_reflink_trim_around_shared(
 	int			error = 0;
 
 	/* Holes, unwritten, and delalloc extents cannot be shared */
-	if (!xfs_is_reflink_inode(ip) || !xfs_bmap_is_real_extent(irec)) {
+	if (!xfs_is_cow_inode(ip) || !xfs_bmap_is_real_extent(irec)) {
 		*shared = false;
 		return 0;
 	}
@@ -234,6 +234,23 @@ xfs_reflink_trim_around_shared(
 	}
 }
 
+bool
+xfs_inode_need_cow(
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*imap,
+	bool			*shared)
+{
+	/* We can't update any real extents in always COW mode. */
+	if (xfs_is_always_cow_inode(ip) &&
+	    !isnullstartblock(imap->br_startblock)) {
+		*shared = true;
+		return 0;
+	}
+
+	/* Trim the mapping to the nearest shared extent boundary. */
+	return xfs_reflink_trim_around_shared(ip, imap, shared);
+}
+
 static int
 xfs_reflink_convert_cow_locked(
 	struct xfs_inode	*ip,
@@ -321,7 +338,7 @@ xfs_find_trim_cow_extent(
 	if (got.br_startoff > offset_fsb) {
 		xfs_trim_extent(imap, imap->br_startoff,
 				got.br_startoff - imap->br_startoff);
-		return xfs_reflink_trim_around_shared(ip, imap, shared);
+		return xfs_inode_need_cow(ip, imap, shared);
 	}
 
 	*shared = true;
@@ -356,7 +373,10 @@ xfs_reflink_allocate_cow(
 	xfs_extlen_t		resblks = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
-	ASSERT(xfs_is_reflink_inode(ip));
+	if (!ip->i_cowfp) {
+		ASSERT(!xfs_is_reflink_inode(ip));
+		xfs_ifork_init_cow(ip);
+	}
 
 	error = xfs_find_trim_cow_extent(ip, imap, shared, &found);
 	if (error || !*shared)
@@ -537,7 +557,7 @@ xfs_reflink_cancel_cow_range(
 	int			error;
 
 	trace_xfs_reflink_cancel_cow_range(ip, offset, count);
-	ASSERT(xfs_is_reflink_inode(ip));
+	ASSERT(ip->i_cowfp);
 
 	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
 	if (count == NULLFILEOFF)
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index d76fc520cac8..f6505ae37626 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -6,11 +6,24 @@
 #ifndef __XFS_REFLINK_H
 #define __XFS_REFLINK_H 1
 
+static inline bool xfs_is_always_cow_inode(struct xfs_inode *ip)
+{
+	return xfs_globals.always_cow &&
+		xfs_sb_version_hasreflink(&ip->i_mount->m_sb);
+}
+
+static inline bool xfs_is_cow_inode(struct xfs_inode *ip)
+{
+	return xfs_is_reflink_inode(ip) || xfs_is_always_cow_inode(ip);
+}
+
 extern int xfs_reflink_find_shared(struct xfs_mount *mp, struct xfs_trans *tp,
 		xfs_agnumber_t agno, xfs_agblock_t agbno, xfs_extlen_t aglen,
 		xfs_agblock_t *fbno, xfs_extlen_t *flen, bool find_maximal);
 extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
 		struct xfs_bmbt_irec *irec, bool *shared);
+bool xfs_inode_need_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
+		bool *shared);
 
 extern int xfs_reflink_allocate_cow(struct xfs_inode *ip,
 		struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode,
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index d3e6cd063688..f4d34749505e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1728,11 +1728,16 @@ xfs_fs_fill_super(
 		}
 	}
 
-	if (xfs_sb_version_hasreflink(&mp->m_sb) && mp->m_sb.sb_rblocks) {
-		xfs_alert(mp,
+	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+		if (mp->m_sb.sb_rblocks) {
+			xfs_alert(mp,
 	"reflink not compatible with realtime device!");
-		error = -EINVAL;
-		goto out_filestream_unmount;
+			error = -EINVAL;
+			goto out_filestream_unmount;
+		}
+
+		if (xfs_globals.always_cow)
+			xfs_info(mp, "using DEBUG-only always_cow mode.");
 	}
 
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb) && mp->m_sb.sb_rblocks) {
diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
index 168488130a19..ad7f9be13087 100644
--- a/fs/xfs/xfs_sysctl.h
+++ b/fs/xfs/xfs_sysctl.h
@@ -85,6 +85,7 @@ struct xfs_globals {
 	int	log_recovery_delay;	/* log recovery delay (secs) */
 	int	mount_delay;		/* mount setup delay (secs) */
 	bool	bug_on_assert;		/* BUG() the kernel on assert failure */
+	bool	always_cow;		/* use COW fork for all overwrites */
 };
 extern struct xfs_globals	xfs_globals;
 
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index cd6a994a7250..cabda13f3c64 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -183,10 +183,34 @@ mount_delay_show(
 }
 XFS_SYSFS_ATTR_RW(mount_delay);
 
+static ssize_t
+always_cow_store(
+	struct kobject	*kobject,
+	const char	*buf,
+	size_t		count)
+{
+	ssize_t		ret;
+
+	ret = kstrtobool(buf, &xfs_globals.always_cow);
+	if (ret < 0)
+		return ret;
+	return count;
+}
+
+static ssize_t
+always_cow_show(
+	struct kobject	*kobject,
+	char		*buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.always_cow);
+}
+XFS_SYSFS_ATTR_RW(always_cow);
+
 static struct attribute *xfs_dbg_attrs[] = {
 	ATTR_LIST(bug_on_assert),
 	ATTR_LIST(log_recovery_delay),
 	ATTR_LIST(mount_delay),
+	ATTR_LIST(always_cow),
 	NULL,
 };
 
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: COW improvements and always_cow support V2
  2018-11-19 13:46 COW improvements and always_cow support V2 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2018-11-19 13:46 ` [PATCH 9/9] xfs: introduce an always_cow mode Christoph Hellwig
@ 2018-11-25 13:39 ` Chandan Rajendra
  2018-11-26 11:42   ` Chandan Rajendra
  9 siblings, 1 reply; 13+ messages in thread
From: Chandan Rajendra @ 2018-11-25 13:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Monday, November 19, 2018 7:16:10 PM IST Christoph Hellwig wrote:
> Hi all,
> 
> this series adds the always_cow mode support after improving our COW
> write support a little bit first.
> 
> The always_cow mode stresses the COW path a lot, but with a few xfstests
> fixups it generall looks good, except for:
> 
>  - a few tests that complain about fragmentation, which is rather inherent
>    in this mode
>  - generic/208 crashing a lot (and generic/095 with 1k block similarly)
>    because a COW fork extent has changed under writeback.  As far as I can
>    tell this is because nothing prevents another thread from moving a COW
>    fork extent to the data fork while we are under writeback.  I'm currently
>    fully root causing this and looking into a potential fix
>  - xfs/017 crashes occasionally in log recovery because we can't find
>    a refcount tree record that we try to free.
>    I haven't really fully understood this one yet.
> 
> Changes since v1:
>  - make delalloc and unwritten extent conversions simpler and more robust
>  - add a few additional cleanups
>  - support all fallocate modes but actual preallocation
>  - rebase on top of a fix from Brian (which is included as first patch
>    to make the patch set more usable)
> 

Hi Christoph,

On ppc64le (with 4k block size), xfs/017 causes the following call trace,

WARNING: CPU: 2 PID: 7865 at /root/repos/linux/fs/xfs/xfs_aops.c:352 xfs_map_blocks+0x154/0xc44
Modules linked in:
CPU: 2 PID: 7865 Comm: fsstress Not tainted 4.20.0-rc3-next-20181123-00008-gbb319d33d6c5-dirty #7
NIP:  c00000000069f924 LR: c00000000069f88c CTR: 0000000000000000
REGS: c000000630dc3520 TRAP: 0700   Not tainted  (4.20.0-rc3-next-20181123-00008-gbb319d33d6c5-dirty)
MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 44004484  XER: 00000000
CFAR: c00000000000e6d4 IRQMASK: 0
GPR00: c00000000069f88c c000000630dc37b0 c0000000017a3100 c000000630dc3b80
GPR04: 0000000000000000 000000000000008f 0000000000000008 fffffffffffffffe
GPR08: 000000000000002b 0000000000000001 0000000000000009 c000000623040080
GPR12: c0000000003113a0 c00000003fffdf00 0000000000000000 0000000000000000
GPR16: 0000000100030780 00007fffffffb2c0 00000001000136a8 0000000000000001
GPR20: 0000000000000000 0000000000010000 c000000630dc3c20 0000000000001000
GPR24: 0000000000000008 f000000000d79a00 c000000630dc3b80 c000000624320300
GPR28: c000000100733f48 0000000000000000 0000000000009000 c000000630dc37b0
NIP [c00000000069f924] xfs_map_blocks+0x154/0xc44
LR [c00000000069f88c] xfs_map_blocks+0xbc/0xc44
Call Trace:
[c000000630dc37b0] [c00000000069f88c] xfs_map_blocks+0xbc/0xc44 (unreliable)
[c000000630dc3960] [c0000000006a0e88] xfs_writepage_map+0x174/0x428
[c000000630dc39f0] [c0000000006a12ac] xfs_do_writepage+0x170/0x2c0
[c000000630dc3a40] [c0000000003257f0] write_cache_pages+0x350/0x540
[c000000630dc3b60] [c0000000006a1508] xfs_vm_writepages+0x98/0xd4
[c000000630dc3bd0] [c000000000326c2c] do_writepages+0x5c/0xcc
[c000000630dc3c00] [c00000000030e338] __filemap_fdatawrite_range+0xc4/0xd4
[c000000630dc3c60] [c00000000030e460] filemap_flush+0x30/0x40
[c000000630dc3c80] [c0000000006d0538] xfs_release+0x12c/0x298
[c000000630dc3cc0] [c0000000006b4ca4] xfs_file_release+0x24/0x38
[c000000630dc3ce0] [c0000000003f68f8] __fput+0xc8/0x280
[c000000630dc3d40] [c0000000003f6b40] ____fput+0x20/0x30
[c000000630dc3d60] [c00000000014bef4] task_work_run+0xb8/0x100
[c000000630dc3da0] [c0000000000231cc] do_notify_resume+0x184/0x18c
[c000000630dc3e20] [c00000000000e544] ret_from_except_lite+0x70/0x74
Instruction dump:
39290002 7d290074 7929d182 5529063e 913f005c 813f005c 7d290034 5529d97e
69290001 5529063e 2fa90000 419e0008 <0fe00000> 813f005c 7d290034 5529d97e


This corresponds to the following code inside xfs_map_blocks(),

imap_valid = offset_fsb >= wpc->imap.br_startoff &&
	     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
if (imap_valid &&
    !WARN_ON_ONCE(wpc->imap.br_startblock == HOLESTARTBLOCK) &&
    (!xfs_inode_has_cow_data(ip) ||
     wpc->io_type == XFS_IO_COW ||
     wpc->cow_seq == READ_ONCE(ip->i_cowfp->if_seq)))
	return 0;


I did some debugging and the following provides an explaination as to why 
the condition passed to WARN_ON_ONCE() evaluated to true,

148854:fsstress 19867 [001] 178758.940803:  probe:xfs_file_buffered_aio_write: (c0000000006b71f0) ki_pos=1301686 i_ino=8388710
149023:fsstress 19867 [001] 178758.940968:            probe:__iomap_write_end: (c00000000049720c) i_ino=8388710 len_u32=9034 copied_u32=9034 pos_u64=1301686
149191:fsstress 19867 [001] 178758.941167:            probe:__iomap_write_end: (c00000000049720c) i_ino=8388710 len_u32=65536 copied_u32=65536 pos_u64=1310720
150133:fsstress 19867 [001] 178758.941566:            probe:__iomap_write_end: (c00000000049720c) i_ino=8388710 len_u32=8475 copied_u32=8475 pos_u64=1376256

An "Extended write" is performed at a 4k block which maps file offset
1298432. Hence the file range before 1298431 maps to a hole.



123854:fsstress 19867 [001] 178758.896306:  probe:xfs_file_buffered_aio_write: (c0000000006b71f0) ki_pos=434109 i_ino=8388710
124223:fsstress 19867 [001] 178758.896549:            probe:__iomap_write_end: (c00000000049720c) i_ino=8388710 len_u32=24643 copied_u32=24643 pos_u64=434109
124583:fsstress 19867 [001] 178758.896719:            probe:__iomap_write_end: (c00000000049720c) i_ino=8388710 len_u32=65536 copied_u32=65536 pos_u64=458752
125875:fsstress 19867 [001] 178758.897266:            probe:__iomap_write_end: (c00000000049720c) i_ino=8388710 len_u32=38283 copied_u32=38283 pos_u64=524288

The 8th 64k page now has data written to it in the file offset range [524288,
565247]. So there is a hole in the range [565248, 589823] mapped by the page. 


154474:fsstress 19867 [001] 178758.947596:     probe:iomap_set_range_uptodate: (c000000000495a40) index=8 i_ino=8388710 last=0xf first=0xa

The above event occurs when fsstress invokes a buffered read operation at 10
block of 8th page i.e. file offset 565248. Hence the blocks mapping the file
offset range  [565248, 589823] are marked uptodate in iop->uptodate.


238888:fsstress 19867 [001] 178759.134993: probe:xfs_map_blocks_found_new_map: (c00000000069fefc) blockcount=27 startblock=18446744073709551614 startoff=138 i_ino=8388710 offset_fsb_u64=138 imap_valid_u8=0
238913:fsstress 19867 [001] 178759.135003:               probe:xfs_map_blocks: (c00000000069f87c) blockcount=27 startblock=18446744073709551614 startoff=138 i_ino=8388710 offset_fsb_u64=139 offset_u64=569344
238938:fsstress 19867 [001] 178759.135012:    probe:xfs_map_blocks_imap_valid: (c00000000069f8e0) startoff=138 startblock=18446744073709551614 blockcount=27 i_ino=8388710 offset_fsb_u64=139 offset_u64=569344 imap_valid_u8=1
238963:fsstress 19867 [001] 178759.135054:  probe:xfs_map_blocks_imap_valid_1: (c00000000069f93c) startoff=138 startblock=18446744073709551614 blockcount=27 i_ino=8388710 offset_fsb_u64=139 offset_u64=569344 imap_valid_u8=1
238988:fsstress 19867 [001] 178759.135065:               probe:xfs_map_blocks: (c00000000069f87c) blockcount=27 startblock=18446744073709551614 startoff=138 i_ino=8388710 offset_fsb_u64=140 offset_u64=573440

When writing the dirty page which maps the 4k block starting at file offset
565248 (138th block) we would encounter a hole with HOLESTARTBLOCK (i.e. -2)
as the value of startblock. xfs_map_blocks() assigns this imap to
xfs_writepage_ctx->imap and returns. xfs_writepage_map() loops once again
since iop->uptodate has the corresponding bit set.

-- 
chandan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: COW improvements and always_cow support V2
  2018-11-25 13:39 ` COW improvements and always_cow support V2 Chandan Rajendra
@ 2018-11-26 11:42   ` Chandan Rajendra
  2018-11-28  7:52     ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Chandan Rajendra @ 2018-11-26 11:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Sunday, November 25, 2018 7:09:33 PM IST Chandan Rajendra wrote:
> On Monday, November 19, 2018 7:16:10 PM IST Christoph Hellwig wrote:
> > Hi all,
> > 
> > this series adds the always_cow mode support after improving our COW
> > write support a little bit first.
> > 
> > The always_cow mode stresses the COW path a lot, but with a few xfstests
> > fixups it generall looks good, except for:
> > 
> >  - a few tests that complain about fragmentation, which is rather inherent
> >    in this mode
> >  - generic/208 crashing a lot (and generic/095 with 1k block similarly)
> >    because a COW fork extent has changed under writeback.  As far as I can
> >    tell this is because nothing prevents another thread from moving a COW
> >    fork extent to the data fork while we are under writeback.  I'm currently
> >    fully root causing this and looking into a potential fix
> >  - xfs/017 crashes occasionally in log recovery because we can't find
> >    a refcount tree record that we try to free.
> >    I haven't really fully understood this one yet.
> > 
> > Changes since v1:
> >  - make delalloc and unwritten extent conversions simpler and more robust
> >  - add a few additional cleanups
> >  - support all fallocate modes but actual preallocation
> >  - rebase on top of a fix from Brian (which is included as first patch
> >    to make the patch set more usable)
> > 
> 
> Hi Christoph,
> 
> On ppc64le (with 4k block size), xfs/017 causes the following call trace,
> 
> WARNING: CPU: 2 PID: 7865 at /root/repos/linux/fs/xfs/xfs_aops.c:352 xfs_map_blocks+0x154/0xc44
> Modules linked in:
> CPU: 2 PID: 7865 Comm: fsstress Not tainted 4.20.0-rc3-next-20181123-00008-gbb319d33d6c5-dirty #7
> NIP:  c00000000069f924 LR: c00000000069f88c CTR: 0000000000000000
> REGS: c000000630dc3520 TRAP: 0700   Not tainted  (4.20.0-rc3-next-20181123-00008-gbb319d33d6c5-dirty)
> MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 44004484  XER: 00000000
> CFAR: c00000000000e6d4 IRQMASK: 0
> GPR00: c00000000069f88c c000000630dc37b0 c0000000017a3100 c000000630dc3b80
> GPR04: 0000000000000000 000000000000008f 0000000000000008 fffffffffffffffe
> GPR08: 000000000000002b 0000000000000001 0000000000000009 c000000623040080
> GPR12: c0000000003113a0 c00000003fffdf00 0000000000000000 0000000000000000
> GPR16: 0000000100030780 00007fffffffb2c0 00000001000136a8 0000000000000001
> GPR20: 0000000000000000 0000000000010000 c000000630dc3c20 0000000000001000
> GPR24: 0000000000000008 f000000000d79a00 c000000630dc3b80 c000000624320300
> GPR28: c000000100733f48 0000000000000000 0000000000009000 c000000630dc37b0
> NIP [c00000000069f924] xfs_map_blocks+0x154/0xc44
> LR [c00000000069f88c] xfs_map_blocks+0xbc/0xc44
> Call Trace:
> [c000000630dc37b0] [c00000000069f88c] xfs_map_blocks+0xbc/0xc44 (unreliable)
> [c000000630dc3960] [c0000000006a0e88] xfs_writepage_map+0x174/0x428
> [c000000630dc39f0] [c0000000006a12ac] xfs_do_writepage+0x170/0x2c0
> [c000000630dc3a40] [c0000000003257f0] write_cache_pages+0x350/0x540
> [c000000630dc3b60] [c0000000006a1508] xfs_vm_writepages+0x98/0xd4
> [c000000630dc3bd0] [c000000000326c2c] do_writepages+0x5c/0xcc
> [c000000630dc3c00] [c00000000030e338] __filemap_fdatawrite_range+0xc4/0xd4
> [c000000630dc3c60] [c00000000030e460] filemap_flush+0x30/0x40
> [c000000630dc3c80] [c0000000006d0538] xfs_release+0x12c/0x298
> [c000000630dc3cc0] [c0000000006b4ca4] xfs_file_release+0x24/0x38
> [c000000630dc3ce0] [c0000000003f68f8] __fput+0xc8/0x280
> [c000000630dc3d40] [c0000000003f6b40] ____fput+0x20/0x30
> [c000000630dc3d60] [c00000000014bef4] task_work_run+0xb8/0x100
> [c000000630dc3da0] [c0000000000231cc] do_notify_resume+0x184/0x18c
> [c000000630dc3e20] [c00000000000e544] ret_from_except_lite+0x70/0x74
> Instruction dump:
> 39290002 7d290074 7929d182 5529063e 913f005c 813f005c 7d290034 5529d97e
> 69290001 5529063e 2fa90000 419e0008 <0fe00000> 813f005c 7d290034 5529d97e
> 
> 
> This corresponds to the following code inside xfs_map_blocks(),
> 
> imap_valid = offset_fsb >= wpc->imap.br_startoff &&
> 	     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
> if (imap_valid &&
>     !WARN_ON_ONCE(wpc->imap.br_startblock == HOLESTARTBLOCK) &&
>     (!xfs_inode_has_cow_data(ip) ||
>      wpc->io_type == XFS_IO_COW ||
>      wpc->cow_seq == READ_ONCE(ip->i_cowfp->if_seq)))
> 	return 0;
> 
> 
> I did some debugging and the following provides an explaination as to why 
> the condition passed to WARN_ON_ONCE() evaluated to true,
> 
> 148854:fsstress 19867 [001] 178758.940803:  probe:xfs_file_buffered_aio_write: (c0000000006b71f0) ki_pos=1301686 i_ino=8388710
> 149023:fsstress 19867 [001] 178758.940968:            probe:__iomap_write_end: (c00000000049720c) i_ino=8388710 len_u32=9034 copied_u32=9034 pos_u64=1301686
> 149191:fsstress 19867 [001] 178758.941167:            probe:__iomap_write_end: (c00000000049720c) i_ino=8388710 len_u32=65536 copied_u32=65536 pos_u64=1310720
> 150133:fsstress 19867 [001] 178758.941566:            probe:__iomap_write_end: (c00000000049720c) i_ino=8388710 len_u32=8475 copied_u32=8475 pos_u64=1376256
> 
> An "Extended write" is performed at a 4k block which maps file offset
> 1298432. Hence the file range before 1298431 maps to a hole.
> 
> 
> 
> 123854:fsstress 19867 [001] 178758.896306:  probe:xfs_file_buffered_aio_write: (c0000000006b71f0) ki_pos=434109 i_ino=8388710
> 124223:fsstress 19867 [001] 178758.896549:            probe:__iomap_write_end: (c00000000049720c) i_ino=8388710 len_u32=24643 copied_u32=24643 pos_u64=434109
> 124583:fsstress 19867 [001] 178758.896719:            probe:__iomap_write_end: (c00000000049720c) i_ino=8388710 len_u32=65536 copied_u32=65536 pos_u64=458752
> 125875:fsstress 19867 [001] 178758.897266:            probe:__iomap_write_end: (c00000000049720c) i_ino=8388710 len_u32=38283 copied_u32=38283 pos_u64=524288
> 
> The 8th 64k page now has data written to it in the file offset range [524288,
> 565247]. So there is a hole in the range [565248, 589823] mapped by the page. 
> 
> 
> 154474:fsstress 19867 [001] 178758.947596:     probe:iomap_set_range_uptodate: (c000000000495a40) index=8 i_ino=8388710 last=0xf first=0xa
> 
> The above event occurs when fsstress invokes a buffered read operation at 10
> block of 8th page i.e. file offset 565248. Hence the blocks mapping the file
> offset range  [565248, 589823] are marked uptodate in iop->uptodate.
> 
> 
> 238888:fsstress 19867 [001] 178759.134993: probe:xfs_map_blocks_found_new_map: (c00000000069fefc) blockcount=27 startblock=18446744073709551614 startoff=138 i_ino=8388710 offset_fsb_u64=138 imap_valid_u8=0
> 238913:fsstress 19867 [001] 178759.135003:               probe:xfs_map_blocks: (c00000000069f87c) blockcount=27 startblock=18446744073709551614 startoff=138 i_ino=8388710 offset_fsb_u64=139 offset_u64=569344
> 238938:fsstress 19867 [001] 178759.135012:    probe:xfs_map_blocks_imap_valid: (c00000000069f8e0) startoff=138 startblock=18446744073709551614 blockcount=27 i_ino=8388710 offset_fsb_u64=139 offset_u64=569344 imap_valid_u8=1
> 238963:fsstress 19867 [001] 178759.135054:  probe:xfs_map_blocks_imap_valid_1: (c00000000069f93c) startoff=138 startblock=18446744073709551614 blockcount=27 i_ino=8388710 offset_fsb_u64=139 offset_u64=569344 imap_valid_u8=1
> 238988:fsstress 19867 [001] 178759.135065:               probe:xfs_map_blocks: (c00000000069f87c) blockcount=27 startblock=18446744073709551614 startoff=138 i_ino=8388710 offset_fsb_u64=140 offset_u64=573440
> 
> When writing the dirty page which maps the 4k block starting at file offset
> 565248 (138th block) we would encounter a hole with HOLESTARTBLOCK (i.e. -2)
> as the value of startblock. xfs_map_blocks() assigns this imap to
> xfs_writepage_ctx->imap and returns. xfs_writepage_map() loops once again
> since iop->uptodate has the corresponding bit set.
> 
> 

IMHO, we should track the dirty blocks within the iomap_page structure and
invoke xfs_map_blocks() only on those dirty blocks.

-- 
chandan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: COW improvements and always_cow support V2
  2018-11-26 11:42   ` Chandan Rajendra
@ 2018-11-28  7:52     ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-11-28  7:52 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: Christoph Hellwig, linux-xfs

Hi Chandan,

thanks for the report.  I don't think we need a new dirty bit,
we just need be a bit better at handling holes.  I'll resend a new
version for testing once I get a little more time for testing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-11-28 18:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-19 13:46 COW improvements and always_cow support V2 Christoph Hellwig
2018-11-19 13:46 ` [PATCH 1/9] xfs: fix shared extent data corruption due to missing cow reservation Christoph Hellwig
2018-11-19 13:46 ` [PATCH 2/9] xfs: handle -EAGAIN from xfs_iomap_write_allocate Christoph Hellwig
2018-11-19 13:46 ` [PATCH 3/9] xfs: avoid an extent tree lookup in xfs_iomap_write_allocate Christoph Hellwig
2018-11-19 13:46 ` [PATCH 4/9] xfs: make xfs_bmbt_to_iomap more useful Christoph Hellwig
2018-11-19 13:46 ` [PATCH 5/9] xfs: don't use delalloc extents for COW on files with extsize hints Christoph Hellwig
2018-11-19 13:46 ` [PATCH 6/9] xfs: merge COW handling into xfs_file_iomap_begin_delay Christoph Hellwig
2018-11-19 13:46 ` [PATCH 7/9] xfs: report IOMAP_F_SHARED from xfs_file_iomap_begin_delay Christoph Hellwig
2018-11-19 13:46 ` [PATCH 8/9] xfs: make COW fork unwritten extent conversions more robust Christoph Hellwig
2018-11-19 13:46 ` [PATCH 9/9] xfs: introduce an always_cow mode Christoph Hellwig
2018-11-25 13:39 ` COW improvements and always_cow support V2 Chandan Rajendra
2018-11-26 11:42   ` Chandan Rajendra
2018-11-28  7:52     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).