All of lore.kernel.org
 help / color / mirror / Atom feed
* reflink COW improvements
@ 2016-12-05 21:05 Christoph Hellwig
  2016-12-05 21:05 ` [PATCH 1/3] xfs: reject all unaligned direct writes to reflinked files Christoph Hellwig
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Christoph Hellwig @ 2016-12-05 21:05 UTC (permalink / raw)
  To: linux-xfs; +Cc: darrick.wong

Hi all,

this series adds a few improvements to the direct I/O COW path.

Note that patch 2 at this point causes a regression in xfs/214:

@ -11,7 +11,7 @@
 CoW one of the files
 root      --       0       0       0              4     0     0       
 nobody    --       0       0       0              1     0     0       
-fsgqa     --    3520       0       0              4     0     0       
+fsgqa     --    3072       0       0              4     0     0       
 Remount the FS to see if accounting changes
 root      --       0       0       0              4     0     0       
 nobody    --       0       0       0              1     0     0 

But I think the test golden output actually is wrong - the test
claims it tests that we don't get blocks charged during the COW,
but the 3520 here means 512 bytes were charged, and later in the
golden output we actually see that going away after a remount.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/3] xfs: reject all unaligned direct writes to reflinked files
  2016-12-05 21:05 reflink COW improvements Christoph Hellwig
@ 2016-12-05 21:05 ` Christoph Hellwig
  2016-12-07 18:59   ` Brian Foster
  2016-12-05 21:05 ` [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes Christoph Hellwig
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2016-12-05 21:05 UTC (permalink / raw)
  To: linux-xfs; +Cc: darrick.wong

We currently fall back from direct to buffered writes if we detect a
remaining shared extent in the iomap_begin callback.  But by the time
iomap_begin is called for the potentially unaligned end block we might
have already written most of the data to disk, which we'd now write
again using buffered I/O.  To avoid this reject all writes to reflinked
files before starting I/O so that we are guaranteed to only write the
data once.

The alternative would be to unshare the unaligned start and/or end block
before doing the I/O. I think that's doable, and will actually be
required to support reflinks on DAX file system.  But it will take a
little more time and I'd rather get rid of the double write ASAP.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c  |  7 +++++++
 fs/xfs/xfs_iomap.c | 12 +-----------
 2 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f5effa6..873cd42 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -532,6 +532,13 @@ xfs_file_dio_aio_write(
 	if ((iocb->ki_pos & mp->m_blockmask) ||
 	    ((iocb->ki_pos + count) & mp->m_blockmask)) {
 		unaligned_io = 1;
+
+		/*
+		 * We can't properly handle unaligned direct I/O to reflink
+		 * files yet, as we can't unshare a partial block.
+		 */
+		if (xfs_is_reflink_inode(ip))
+			return -EREMCHG;
 		iolock = XFS_IOLOCK_EXCL;
 	} else {
 		iolock = XFS_IOLOCK_SHARED;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 0d14742..78105db 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1026,17 +1026,7 @@ xfs_file_iomap_begin(
 		if (error)
 			goto out_unlock;
 
-		/*
-		 * We're here because we're trying to do a directio write to a
-		 * region that isn't aligned to a filesystem block.  If the
-		 * extent is shared, fall back to buffered mode to handle the
-		 * RMW.
-		 */
-		if (!(flags & IOMAP_REPORT) && shared) {
-			trace_xfs_reflink_bounce_dio_write(ip, &imap);
-			error = -EREMCHG;
-			goto out_unlock;
-		}
+		ASSERT((flags & IOMAP_REPORT) || !shared);
 	}
 
 	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2016-12-05 21:05 reflink COW improvements Christoph Hellwig
  2016-12-05 21:05 ` [PATCH 1/3] xfs: reject all unaligned direct writes to reflinked files Christoph Hellwig
@ 2016-12-05 21:05 ` Christoph Hellwig
  2016-12-07 19:00   ` Brian Foster
  2016-12-05 21:05 ` [PATCH 3/3] xfs: allocate direct I/O COW blocks in iomap_begin Christoph Hellwig
  2016-12-06  2:09 ` reflink COW improvements Darrick J. Wong
  3 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2016-12-05 21:05 UTC (permalink / raw)
  To: linux-xfs; +Cc: darrick.wong

When we allocate COW fork blocks for direct I/O writes we currently first
create a delayed allocation, and then convert it to a real allocation
once we've got the delayed one.

As there is no good reason for that this patch instead makes use call
xfs_bmapi_write from the COW allocation path.  The only interesting bits
are a few tweaks the low-level allocator to allow for this, most notably
the need to remove the call to xfs_bmap_extsize_align for the cowextsize
in xfs_bmap_btalloc - for the existing convert case it's a no-op, but
for the direct allocation case it would blow up our block reservation
way beyond what we reserved for the transaction.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_bmap.c |   9 ++---
 fs/xfs/xfs_reflink.c     | 101 ++++++++++++++++++++++++++++++++++-------------
 2 files changed, 76 insertions(+), 34 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 6f28814..a4a4327 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2893,13 +2893,14 @@ xfs_bmap_add_extent_hole_real(
 	ASSERT(!isnullstartblock(new->br_startblock));
 	ASSERT(!bma->cur ||
 	       !(bma->cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL));
-	ASSERT(whichfork != XFS_COW_FORK);
 
 	XFS_STATS_INC(mp, xs_add_exlist);
 
 	state = 0;
 	if (whichfork == XFS_ATTR_FORK)
 		state |= BMAP_ATTRFORK;
+	if (whichfork == XFS_COW_FORK)
+		state |= BMAP_COWFORK;
 
 	/*
 	 * Check and set flags if this segment has a left neighbor.
@@ -3619,9 +3620,7 @@ xfs_bmap_btalloc(
 	else if (mp->m_dalign)
 		stripe_align = mp->m_dalign;
 
-	if (ap->flags & XFS_BMAPI_COWFORK)
-		align = xfs_get_cowextsz_hint(ap->ip);
-	else if (xfs_alloc_is_userdata(ap->datatype))
+	if (xfs_alloc_is_userdata(ap->datatype))
 		align = xfs_get_extsz_hint(ap->ip);
 	if (unlikely(align)) {
 		error = xfs_bmap_extsize_align(mp, &ap->got, &ap->prev,
@@ -4608,8 +4607,6 @@ xfs_bmapi_write(
 		 */
 		if (flags & XFS_BMAPI_REMAP)
 			ASSERT(inhole);
-		if (flags & XFS_BMAPI_COWFORK)
-			ASSERT(!inhole);
 
 		/*
 		 * First, deal with the hole before the allocated space
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 88fd03c..0a077e80 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -304,59 +304,104 @@ __xfs_reflink_allocate_cow(
 	xfs_fileoff_t		end_fsb)
 {
 	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_bmbt_irec	imap;
+	struct xfs_bmbt_irec	imap, got;
 	struct xfs_defer_ops	dfops;
 	struct xfs_trans	*tp;
 	xfs_fsblock_t		first_block;
-	int			nimaps = 1, error;
-	bool			shared;
+	int			nimaps, error, lockmode;
+	bool			shared, trimmed;
+	xfs_extlen_t		resblks, align;
+	xfs_extnum_t		idx;
 
-	xfs_defer_init(&dfops, &first_block);
+	align = xfs_eof_alignment(ip, xfs_get_cowextsz_hint(ip));
+	if (align)
+		end_fsb = roundup_64(end_fsb, align);
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
-			XFS_TRANS_RESERVE, &tp);
+	resblks = XFS_DIOSTRAT_SPACE_RES(mp, end_fsb - *offset_fsb);
+
+	error = xfs_qm_dqattach(ip, 0);
 	if (error)
 		return error;
 
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-
-	/* Read extent from the source file. */
-	nimaps = 1;
-	error = xfs_bmapi_read(ip, *offset_fsb, end_fsb - *offset_fsb,
-			&imap, &nimaps, 0);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
 	if (error)
-		goto out_unlock;
-	ASSERT(nimaps == 1);
+		return error;
 
-	error = xfs_reflink_reserve_cow(ip, &imap, &shared);
-	if (error)
-		goto out_trans_cancel;
+	lockmode = XFS_ILOCK_EXCL;
+	xfs_ilock(ip, lockmode);
 
-	if (!shared) {
-		*offset_fsb = imap.br_startoff + imap.br_blockcount;
-		goto out_trans_cancel;
+	/*
+	 * Even if the extent is not shared we might have a preallocation for
+	 * it in the COW fork.  If so use it.
+	 */
+	if (xfs_iext_lookup_extent(ip, ip->i_cowfp, *offset_fsb, &idx, &got) &&
+	    got.br_startoff <= *offset_fsb) {
+		/* If we have a real allocation in the COW fork we're done. */
+		if (!isnullstartblock(got.br_startblock)) {
+			xfs_trim_extent(&got, *offset_fsb,
+					end_fsb - *offset_fsb);
+			*offset_fsb = got.br_startoff + got.br_blockcount;
+			goto out_trans_cancel;
+		}
+	} else {
+		nimaps = 1;
+		error = xfs_bmapi_read(ip, *offset_fsb, end_fsb - *offset_fsb,
+				&imap, &nimaps, 0);
+		if (error)
+			goto out_trans_cancel;
+		ASSERT(nimaps == 1);
+
+		/* Trim the mapping to the nearest shared extent boundary. */
+		error = xfs_reflink_trim_around_shared(ip, &imap, &shared,
+				&trimmed);
+		if (error)
+			goto out_trans_cancel;
+
+		if (!shared) {
+			*offset_fsb = imap.br_startoff + imap.br_blockcount;
+			goto out_trans_cancel;
+		}
+
+		*offset_fsb = imap.br_startoff;
+		end_fsb = imap.br_startoff + imap.br_blockcount;
+		if (align)
+			end_fsb = roundup_64(end_fsb, align);
 	}
 
-	xfs_trans_ijoin(tp, ip, 0);
-	error = xfs_bmapi_write(tp, ip, imap.br_startoff, imap.br_blockcount,
-			XFS_BMAPI_COWFORK, &first_block,
-			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
-			&imap, &nimaps, &dfops);
+	error = xfs_trans_reserve_quota_nblks(tp, ip, resblks, 0,
+			XFS_QMOPT_RES_REGBLKS);
 	if (error)
 		goto out_trans_cancel;
 
+	xfs_trans_ijoin(tp, ip, 0);
+
+	xfs_defer_init(&dfops, &first_block);
+	nimaps = 1;
+	error = xfs_bmapi_write(tp, ip, *offset_fsb, end_fsb - *offset_fsb,
+			XFS_BMAPI_COWFORK, &first_block, resblks, &imap,
+			&nimaps, &dfops);
+	if (error)
+		goto out_bmap_cancel;
+
 	error = xfs_defer_finish(&tp, &dfops, NULL);
 	if (error)
-		goto out_trans_cancel;
+		goto out_bmap_cancel;
 
 	error = xfs_trans_commit(tp);
+	if (error)
+		goto out_unlock;
 
 	*offset_fsb = imap.br_startoff + imap.br_blockcount;
+
 out_unlock:
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip, lockmode);
 	return error;
-out_trans_cancel:
+
+out_bmap_cancel:
 	xfs_defer_cancel(&dfops);
+	xfs_trans_unreserve_quota_nblks(tp, ip, (long)resblks, 0,
+			XFS_QMOPT_RES_REGBLKS);
+out_trans_cancel:
 	xfs_trans_cancel(tp);
 	goto out_unlock;
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] xfs: allocate direct I/O COW blocks in iomap_begin
  2016-12-05 21:05 reflink COW improvements Christoph Hellwig
  2016-12-05 21:05 ` [PATCH 1/3] xfs: reject all unaligned direct writes to reflinked files Christoph Hellwig
  2016-12-05 21:05 ` [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes Christoph Hellwig
@ 2016-12-05 21:05 ` Christoph Hellwig
  2016-12-06  2:09 ` reflink COW improvements Darrick J. Wong
  3 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2016-12-05 21:05 UTC (permalink / raw)
  To: linux-xfs; +Cc: darrick.wong

Instead of preallocating all the required COW blocks in the high-level
write code do it inside the iomap code, like we do for all other I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c    |   8 ---
 fs/xfs/xfs_iomap.c   |  31 ++++-------
 fs/xfs/xfs_reflink.c | 154 +++++++++++++++++++--------------------------------
 fs/xfs/xfs_reflink.h |   5 +-
 fs/xfs/xfs_trace.h   |   2 -
 5 files changed, 73 insertions(+), 127 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 873cd42..e6138f3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -564,14 +564,6 @@ xfs_file_dio_aio_write(
 	}
 
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
-
-	/* If this is a block-aligned directio CoW, remap immediately. */
-	if (xfs_is_reflink_inode(ip) && !unaligned_io) {
-		ret = xfs_reflink_allocate_cow_range(ip, iocb->ki_pos, count);
-		if (ret)
-			goto out;
-	}
-
 	ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io);
 out:
 	xfs_iunlock(ip, iolock);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 78105db..01c6e31 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1002,37 +1002,31 @@ xfs_file_iomap_begin(
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	end_fsb = XFS_B_TO_FSB(mp, offset + length);
 
-	if (xfs_is_reflink_inode(ip) &&
-	    (flags & IOMAP_WRITE) && (flags & IOMAP_DIRECT)) {
-		shared = xfs_reflink_find_cow_mapping(ip, offset, &imap);
-		if (shared) {
-			xfs_iunlock(ip, lockmode);
-			goto alloc_done;
-		}
-		ASSERT(!isnullstartblock(imap.br_startblock));
-	}
-
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap,
 			       &nimaps, 0);
 	if (error)
 		goto out_unlock;
 
-	if ((flags & IOMAP_REPORT) ||
-	    (xfs_is_reflink_inode(ip) &&
-	     (flags & IOMAP_WRITE) && (flags & IOMAP_DIRECT))) {
+	if (flags & IOMAP_REPORT) {
 		/* Trim the mapping to the nearest shared extent boundary. */
 		error = xfs_reflink_trim_around_shared(ip, &imap, &shared,
 				&trimmed);
 		if (error)
 			goto out_unlock;
-
-		ASSERT((flags & IOMAP_REPORT) || !shared);
 	}
 
 	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
-		error = xfs_reflink_reserve_cow(ip, &imap, &shared);
-		if (error)
-			goto out_unlock;
+		if (flags & IOMAP_DIRECT) {
+			/* may drop and re-acquire the ilock */
+			error = xfs_reflink_allocate_cow(ip, offset_fsb, end_fsb,
+					&imap, &shared, &lockmode);
+			if (error)
+				goto out_unlock;
+		} else {
+			error = xfs_reflink_reserve_cow(ip, &imap, &shared);
+			if (error)
+				goto out_unlock;
+		}
 
 		end_fsb = imap.br_startoff + imap.br_blockcount;
 		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
@@ -1061,7 +1055,6 @@ xfs_file_iomap_begin(
 		if (error)
 			return error;
 
-alloc_done:
 		iomap->flags = IOMAP_F_NEW;
 		trace_xfs_iomap_alloc(ip, offset, length, 0, &imap);
 	} else {
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 0a077e80..2cddc9a 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -297,88 +297,92 @@ xfs_reflink_reserve_cow(
 }
 
 /* Allocate all CoW reservations covering a range of blocks in a file. */
-static int
-__xfs_reflink_allocate_cow(
+int
+xfs_reflink_allocate_cow(
 	struct xfs_inode	*ip,
-	xfs_fileoff_t		*offset_fsb,
-	xfs_fileoff_t		end_fsb)
+	xfs_fileoff_t		offset_fsb,
+	xfs_fileoff_t		end_fsb,
+	struct xfs_bmbt_irec	*imap,
+	bool			*shared,
+	uint			*lockmode)
 {
 	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_bmbt_irec	imap, got;
+	struct xfs_bmbt_irec	got;
 	struct xfs_defer_ops	dfops;
-	struct xfs_trans	*tp;
+	struct xfs_trans	*tp = NULL;
 	xfs_fsblock_t		first_block;
-	int			nimaps, error, lockmode;
-	bool			shared, trimmed;
+	int			nimaps, error = 0;
+	bool			trimmed;
 	xfs_extlen_t		resblks, align;
 	xfs_extnum_t		idx;
 
-	align = xfs_eof_alignment(ip, xfs_get_cowextsz_hint(ip));
-	if (align)
-		end_fsb = roundup_64(end_fsb, align);
-
-	resblks = XFS_DIOSTRAT_SPACE_RES(mp, end_fsb - *offset_fsb);
-
-	error = xfs_qm_dqattach(ip, 0);
-	if (error)
-		return error;
-
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
-	if (error)
-		return error;
-
-	lockmode = XFS_ILOCK_EXCL;
-	xfs_ilock(ip, lockmode);
+retry:
+	ASSERT(xfs_is_reflink_inode(ip));
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED));
 
 	/*
 	 * Even if the extent is not shared we might have a preallocation for
 	 * it in the COW fork.  If so use it.
 	 */
-	if (xfs_iext_lookup_extent(ip, ip->i_cowfp, *offset_fsb, &idx, &got) &&
-	    got.br_startoff <= *offset_fsb) {
+	if (xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &idx, &got) &&
+	    got.br_startoff <= offset_fsb) {
+		*shared = true;
+
 		/* If we have a real allocation in the COW fork we're done. */
 		if (!isnullstartblock(got.br_startblock)) {
-			xfs_trim_extent(&got, *offset_fsb,
-					end_fsb - *offset_fsb);
-			*offset_fsb = got.br_startoff + got.br_blockcount;
-			goto out_trans_cancel;
+			xfs_trim_extent(&got, offset_fsb, end_fsb - offset_fsb);
+			*imap = got;
+			goto out;
 		}
+
+		xfs_trim_extent(imap, got.br_startoff, got.br_blockcount);
 	} else {
-		nimaps = 1;
-		error = xfs_bmapi_read(ip, *offset_fsb, end_fsb - *offset_fsb,
-				&imap, &nimaps, 0);
+		error = xfs_reflink_trim_around_shared(ip, imap, shared, &trimmed);
+		if (error || !*shared)
+			goto out;
+	}
+
+	if (!tp) {
+		align = xfs_eof_alignment(ip, xfs_get_cowextsz_hint(ip));
+		if (align)
+			end_fsb = roundup_64(end_fsb, align);
+
+		resblks = XFS_DIOSTRAT_SPACE_RES(mp, end_fsb - offset_fsb);
+
+		xfs_iunlock(ip, *lockmode);
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
+		*lockmode = XFS_ILOCK_EXCL;
+		xfs_ilock(ip, *lockmode);
+
 		if (error)
-			goto out_trans_cancel;
-		ASSERT(nimaps == 1);
+			return error;
 
-		/* Trim the mapping to the nearest shared extent boundary. */
-		error = xfs_reflink_trim_around_shared(ip, &imap, &shared,
-				&trimmed);
+		error = xfs_qm_dqattach_locked(ip, 0);
 		if (error)
-			goto out_trans_cancel;
+			goto out;
 
-		if (!shared) {
-			*offset_fsb = imap.br_startoff + imap.br_blockcount;
-			goto out_trans_cancel;
-		}
+		/* Read extent from the source file. */
+		nimaps = 1;
+		error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
+				imap, &nimaps, 0);
+		if (error)
+			goto out;
+		ASSERT(nimaps == 1);
 
-		*offset_fsb = imap.br_startoff;
-		end_fsb = imap.br_startoff + imap.br_blockcount;
-		if (align)
-			end_fsb = roundup_64(end_fsb, align);
+		goto retry;
 	}
 
 	error = xfs_trans_reserve_quota_nblks(tp, ip, resblks, 0,
 			XFS_QMOPT_RES_REGBLKS);
 	if (error)
-		goto out_trans_cancel;
+		goto out;
 
 	xfs_trans_ijoin(tp, ip, 0);
 
 	xfs_defer_init(&dfops, &first_block);
 	nimaps = 1;
-	error = xfs_bmapi_write(tp, ip, *offset_fsb, end_fsb - *offset_fsb,
-			XFS_BMAPI_COWFORK, &first_block, resblks, &imap,
+	error = xfs_bmapi_write(tp, ip, imap->br_startoff, imap->br_blockcount,
+			XFS_BMAPI_COWFORK, &first_block, resblks, imap,
 			&nimaps, &dfops);
 	if (error)
 		goto out_bmap_cancel;
@@ -387,57 +391,15 @@ __xfs_reflink_allocate_cow(
 	if (error)
 		goto out_bmap_cancel;
 
-	error = xfs_trans_commit(tp);
-	if (error)
-		goto out_unlock;
-
-	*offset_fsb = imap.br_startoff + imap.br_blockcount;
-
-out_unlock:
-	xfs_iunlock(ip, lockmode);
-	return error;
+	return xfs_trans_commit(tp);
 
 out_bmap_cancel:
 	xfs_defer_cancel(&dfops);
 	xfs_trans_unreserve_quota_nblks(tp, ip, (long)resblks, 0,
 			XFS_QMOPT_RES_REGBLKS);
-out_trans_cancel:
-	xfs_trans_cancel(tp);
-	goto out_unlock;
-}
-
-/* Allocate all CoW reservations covering a part of a file. */
-int
-xfs_reflink_allocate_cow_range(
-	struct xfs_inode	*ip,
-	xfs_off_t		offset,
-	xfs_off_t		count)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
-	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, offset + count);
-	int			error;
-
-	ASSERT(xfs_is_reflink_inode(ip));
-
-	trace_xfs_reflink_allocate_cow_range(ip, offset, count);
-
-	/*
-	 * Make sure that the dquots are there.
-	 */
-	error = xfs_qm_dqattach(ip, 0);
-	if (error)
-		return error;
-
-	while (offset_fsb < end_fsb) {
-		error = __xfs_reflink_allocate_cow(ip, &offset_fsb, end_fsb);
-		if (error) {
-			trace_xfs_reflink_allocate_cow_range_error(ip, error,
-					_RET_IP_);
-			break;
-		}
-	}
-
+out:
+	if (tp)
+		xfs_trans_cancel(tp);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index aa6a4d6..2813d49 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -28,8 +28,9 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
 
 extern int xfs_reflink_reserve_cow(struct xfs_inode *ip,
 		struct xfs_bmbt_irec *imap, bool *shared);
-extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
-		xfs_off_t offset, xfs_off_t count);
+extern int xfs_reflink_allocate_cow(struct xfs_inode *ip,
+		xfs_fileoff_t offset_fsb, xfs_fileoff_t end_fsb,
+		struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode);
 extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
 		struct xfs_bmbt_irec *imap);
 extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 0907752..54aa1b2 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3347,7 +3347,6 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_found);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
 
 DEFINE_RW_EVENT(xfs_reflink_reserve_cow);
-DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
 
 DEFINE_INODE_IREC_EVENT(xfs_reflink_bounce_dio_write);
 DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
@@ -3357,7 +3356,6 @@ DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cancel_cow_range);
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_end_cow);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_remap);
 
-DEFINE_INODE_ERROR_EVENT(xfs_reflink_allocate_cow_range_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_cancel_cow_range_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_end_cow_error);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: reflink COW improvements
  2016-12-05 21:05 reflink COW improvements Christoph Hellwig
                   ` (2 preceding siblings ...)
  2016-12-05 21:05 ` [PATCH 3/3] xfs: allocate direct I/O COW blocks in iomap_begin Christoph Hellwig
@ 2016-12-06  2:09 ` Darrick J. Wong
  3 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2016-12-06  2:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Mon, Dec 05, 2016 at 10:05:21PM +0100, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds a few improvements to the direct I/O COW path.
> 
> Note that patch 2 at this point causes a regression in xfs/214:
> 
> @ -11,7 +11,7 @@
>  CoW one of the files
>  root      --       0       0       0              4     0     0       
>  nobody    --       0       0       0              1     0     0       
> -fsgqa     --    3520       0       0              4     0     0       
> +fsgqa     --    3072       0       0              4     0     0       
>  Remount the FS to see if accounting changes
>  root      --       0       0       0              4     0     0       
>  nobody    --       0       0       0              1     0     0 
> 
> But I think the test golden output actually is wrong - the test
> claims it tests that we don't get blocks charged during the COW,
> but the 3520 here means 512 bytes were charged, and later in the
> golden output we actually see that going away after a remount.

I think it's correct for how the directio path used to work...

So xfs/214 writes a 1MB file and reflinks it twice, with the intent of
playing with file2.  So, file2 looks like this:

dddddddddddddddd (d = existing data, data fork fork)
                 (cow fork)

A 512K cowextsize hint is set and we write to the last 64K of one of
the files.  The old dio write code would make the usual delalloc
reservation, which rounds both up and down to the nearest cowextsize
boundary:

dddddddddddddddd (data fork)
        cccccccc (cow fork)

At this point, we should be charged for 3584K -- 1MB for each of the
three files, and 512K for the CoW delalloc.

Then we call fsync, which CoWs the last 64k out to disk.  Now we have:

dddddddddddddddD (D = replacement data, data fork)
        ccccccc  (cow fork)

The quota charge is now 3520K -- 1MB for each of the three files plus
the 448K that remains in the CoW fork anticipating a future write.
Remounting flushes out the cow fork:

dddddddddddddddD (data fork)
                 (cow fork)

So now the quota charge is back to an even 3072MB -- three 1MB files.

My suspicion is that since the new patchset bypasses the delalloc +
conversion dance and replaces it with allocating exactly as many blocks
as the dio write asked for + rounding up to the nearest cowextsize.  In
this case, we never make the delalloc reservation between 512K and 896K,
so the repquota output says 3072K instead of 3520K.

(That said, I've been tied up in meetings all day so I haven't had a
chance to apply the patches and find out for real.)


--D

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] xfs: reject all unaligned direct writes to reflinked files
  2016-12-05 21:05 ` [PATCH 1/3] xfs: reject all unaligned direct writes to reflinked files Christoph Hellwig
@ 2016-12-07 18:59   ` Brian Foster
  0 siblings, 0 replies; 23+ messages in thread
From: Brian Foster @ 2016-12-07 18:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, darrick.wong

On Mon, Dec 05, 2016 at 10:05:22PM +0100, Christoph Hellwig wrote:
> We currently fall back from direct to buffered writes if we detect a
> remaining shared extent in the iomap_begin callback.  But by the time
> iomap_begin is called for the potentially unaligned end block we might
> have already written most of the data to disk, which we'd now write
> again using buffered I/O.  To avoid this reject all writes to reflinked
> files before starting I/O so that we are guaranteed to only write the
> data once.
> 
> The alternative would be to unshare the unaligned start and/or end block
> before doing the I/O. I think that's doable, and will actually be
> required to support reflinks on DAX file system.  But it will take a
> little more time and I'd rather get rid of the double write ASAP.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c  |  7 +++++++
>  fs/xfs/xfs_iomap.c | 12 +-----------
>  2 files changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index f5effa6..873cd42 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -532,6 +532,13 @@ xfs_file_dio_aio_write(
>  	if ((iocb->ki_pos & mp->m_blockmask) ||
>  	    ((iocb->ki_pos + count) & mp->m_blockmask)) {
>  		unaligned_io = 1;
> +
> +		/*
> +		 * We can't properly handle unaligned direct I/O to reflink
> +		 * files yet, as we can't unshare a partial block.
> +		 */
> +		if (xfs_is_reflink_inode(ip))
> +			return -EREMCHG;
>  		iolock = XFS_IOLOCK_EXCL;
>  	} else {
>  		iolock = XFS_IOLOCK_SHARED;
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 0d14742..78105db 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1026,17 +1026,7 @@ xfs_file_iomap_begin(
>  		if (error)
>  			goto out_unlock;
>  
> -		/*
> -		 * We're here because we're trying to do a directio write to a
> -		 * region that isn't aligned to a filesystem block.  If the
> -		 * extent is shared, fall back to buffered mode to handle the
> -		 * RMW.
> -		 */
> -		if (!(flags & IOMAP_REPORT) && shared) {
> -			trace_xfs_reflink_bounce_dio_write(ip, &imap);
> -			error = -EREMCHG;
> -			goto out_unlock;
> -		}
> +		ASSERT((flags & IOMAP_REPORT) || !shared);
>  	}
>  
>  	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2016-12-05 21:05 ` [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes Christoph Hellwig
@ 2016-12-07 19:00   ` Brian Foster
  2016-12-07 19:37     ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2016-12-07 19:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, darrick.wong

On Mon, Dec 05, 2016 at 10:05:23PM +0100, Christoph Hellwig wrote:
> When we allocate COW fork blocks for direct I/O writes we currently first
> create a delayed allocation, and then convert it to a real allocation
> once we've got the delayed one.
> 
> As there is no good reason for that this patch instead makes use call
> xfs_bmapi_write from the COW allocation path.  The only interesting bits
> are a few tweaks the low-level allocator to allow for this, most notably
> the need to remove the call to xfs_bmap_extsize_align for the cowextsize
> in xfs_bmap_btalloc - for the existing convert case it's a no-op, but
> for the direct allocation case it would blow up our block reservation
> way beyond what we reserved for the transaction.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_bmap.c |   9 ++---
>  fs/xfs/xfs_reflink.c     | 101 ++++++++++++++++++++++++++++++++++-------------
>  2 files changed, 76 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 6f28814..a4a4327 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -2893,13 +2893,14 @@ xfs_bmap_add_extent_hole_real(
>  	ASSERT(!isnullstartblock(new->br_startblock));
>  	ASSERT(!bma->cur ||
>  	       !(bma->cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL));
> -	ASSERT(whichfork != XFS_COW_FORK);
>  
>  	XFS_STATS_INC(mp, xs_add_exlist);
>  
>  	state = 0;
>  	if (whichfork == XFS_ATTR_FORK)
>  		state |= BMAP_ATTRFORK;
> +	if (whichfork == XFS_COW_FORK)
> +		state |= BMAP_COWFORK;
>  
>  	/*
>  	 * Check and set flags if this segment has a left neighbor.
> @@ -3619,9 +3620,7 @@ xfs_bmap_btalloc(
>  	else if (mp->m_dalign)
>  		stripe_align = mp->m_dalign;
>  
> -	if (ap->flags & XFS_BMAPI_COWFORK)
> -		align = xfs_get_cowextsz_hint(ap->ip);
> -	else if (xfs_alloc_is_userdata(ap->datatype))
> +	if (xfs_alloc_is_userdata(ap->datatype))

Doesn't this defeat the purpose (i.e., fragmentation avoidance via wider
allocations) of the cowextszhint for direct I/O? I think it would be
better to be consistent with the approach for traditional I/O + extsz
and incorporate the alignment into the reservation. Perhaps the hunk of
code that already does just that in xfs_iomap_write_direct() could be
converted to a small helper and reused..?

>  		align = xfs_get_extsz_hint(ap->ip);
>  	if (unlikely(align)) {
>  		error = xfs_bmap_extsize_align(mp, &ap->got, &ap->prev,
> @@ -4608,8 +4607,6 @@ xfs_bmapi_write(
>  		 */
>  		if (flags & XFS_BMAPI_REMAP)
>  			ASSERT(inhole);
> -		if (flags & XFS_BMAPI_COWFORK)
> -			ASSERT(!inhole);
>  
>  		/*
>  		 * First, deal with the hole before the allocated space
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 88fd03c..0a077e80 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -304,59 +304,104 @@ __xfs_reflink_allocate_cow(
>  	xfs_fileoff_t		end_fsb)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
> -	struct xfs_bmbt_irec	imap;
> +	struct xfs_bmbt_irec	imap, got;
>  	struct xfs_defer_ops	dfops;
>  	struct xfs_trans	*tp;
>  	xfs_fsblock_t		first_block;
> -	int			nimaps = 1, error;
> -	bool			shared;
> +	int			nimaps, error, lockmode;
> +	bool			shared, trimmed;
> +	xfs_extlen_t		resblks, align;
> +	xfs_extnum_t		idx;
>  
> -	xfs_defer_init(&dfops, &first_block);
> +	align = xfs_eof_alignment(ip, xfs_get_cowextsz_hint(ip));
> +	if (align)
> +		end_fsb = roundup_64(end_fsb, align);
>  
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
> -			XFS_TRANS_RESERVE, &tp);
> +	resblks = XFS_DIOSTRAT_SPACE_RES(mp, end_fsb - *offset_fsb);
> +
> +	error = xfs_qm_dqattach(ip, 0);

This is already in the (only) caller.

Brian

>  	if (error)
>  		return error;
>  
> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -
> -	/* Read extent from the source file. */
> -	nimaps = 1;
> -	error = xfs_bmapi_read(ip, *offset_fsb, end_fsb - *offset_fsb,
> -			&imap, &nimaps, 0);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
>  	if (error)
> -		goto out_unlock;
> -	ASSERT(nimaps == 1);
> +		return error;
>  
> -	error = xfs_reflink_reserve_cow(ip, &imap, &shared);
> -	if (error)
> -		goto out_trans_cancel;
> +	lockmode = XFS_ILOCK_EXCL;
> +	xfs_ilock(ip, lockmode);
>  
> -	if (!shared) {
> -		*offset_fsb = imap.br_startoff + imap.br_blockcount;
> -		goto out_trans_cancel;
> +	/*
> +	 * Even if the extent is not shared we might have a preallocation for
> +	 * it in the COW fork.  If so use it.
> +	 */
> +	if (xfs_iext_lookup_extent(ip, ip->i_cowfp, *offset_fsb, &idx, &got) &&
> +	    got.br_startoff <= *offset_fsb) {
> +		/* If we have a real allocation in the COW fork we're done. */
> +		if (!isnullstartblock(got.br_startblock)) {
> +			xfs_trim_extent(&got, *offset_fsb,
> +					end_fsb - *offset_fsb);
> +			*offset_fsb = got.br_startoff + got.br_blockcount;
> +			goto out_trans_cancel;
> +		}
> +	} else {
> +		nimaps = 1;
> +		error = xfs_bmapi_read(ip, *offset_fsb, end_fsb - *offset_fsb,
> +				&imap, &nimaps, 0);
> +		if (error)
> +			goto out_trans_cancel;
> +		ASSERT(nimaps == 1);
> +
> +		/* Trim the mapping to the nearest shared extent boundary. */
> +		error = xfs_reflink_trim_around_shared(ip, &imap, &shared,
> +				&trimmed);
> +		if (error)
> +			goto out_trans_cancel;
> +
> +		if (!shared) {
> +			*offset_fsb = imap.br_startoff + imap.br_blockcount;
> +			goto out_trans_cancel;
> +		}
> +
> +		*offset_fsb = imap.br_startoff;
> +		end_fsb = imap.br_startoff + imap.br_blockcount;
> +		if (align)
> +			end_fsb = roundup_64(end_fsb, align);
>  	}
>  
> -	xfs_trans_ijoin(tp, ip, 0);
> -	error = xfs_bmapi_write(tp, ip, imap.br_startoff, imap.br_blockcount,
> -			XFS_BMAPI_COWFORK, &first_block,
> -			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
> -			&imap, &nimaps, &dfops);
> +	error = xfs_trans_reserve_quota_nblks(tp, ip, resblks, 0,
> +			XFS_QMOPT_RES_REGBLKS);
>  	if (error)
>  		goto out_trans_cancel;
>  
> +	xfs_trans_ijoin(tp, ip, 0);
> +
> +	xfs_defer_init(&dfops, &first_block);
> +	nimaps = 1;
> +	error = xfs_bmapi_write(tp, ip, *offset_fsb, end_fsb - *offset_fsb,
> +			XFS_BMAPI_COWFORK, &first_block, resblks, &imap,
> +			&nimaps, &dfops);
> +	if (error)
> +		goto out_bmap_cancel;
> +
>  	error = xfs_defer_finish(&tp, &dfops, NULL);
>  	if (error)
> -		goto out_trans_cancel;
> +		goto out_bmap_cancel;
>  
>  	error = xfs_trans_commit(tp);
> +	if (error)
> +		goto out_unlock;
>  
>  	*offset_fsb = imap.br_startoff + imap.br_blockcount;
> +
>  out_unlock:
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	xfs_iunlock(ip, lockmode);
>  	return error;
> -out_trans_cancel:
> +
> +out_bmap_cancel:
>  	xfs_defer_cancel(&dfops);
> +	xfs_trans_unreserve_quota_nblks(tp, ip, (long)resblks, 0,
> +			XFS_QMOPT_RES_REGBLKS);
> +out_trans_cancel:
>  	xfs_trans_cancel(tp);
>  	goto out_unlock;
>  }
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2016-12-07 19:00   ` Brian Foster
@ 2016-12-07 19:37     ` Christoph Hellwig
  2016-12-07 19:46       ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2016-12-07 19:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, darrick.wong

On Wed, Dec 07, 2016 at 02:00:09PM -0500, Brian Foster wrote:
> > -	if (ap->flags & XFS_BMAPI_COWFORK)
> > -		align = xfs_get_cowextsz_hint(ap->ip);
> > -	else if (xfs_alloc_is_userdata(ap->datatype))
> > +	if (xfs_alloc_is_userdata(ap->datatype))
> 
> Doesn't this defeat the purpose (i.e., fragmentation avoidance via wider
> allocations) of the cowextszhint for direct I/O? I think it would be
> better to be consistent with the approach for traditional I/O + extsz
> and incorporate the alignment into the reservation. Perhaps the hunk of
> code that already does just that in xfs_iomap_write_direct() could be
> converted to a small helper and reused..?

We're already doing the alignment to the cowextsize hint in
__xfs_reflink_allocate_cow so that we can take the cowextsize into
account.

> > +	resblks = XFS_DIOSTRAT_SPACE_RES(mp, end_fsb - *offset_fsb);
> > +
> > +	error = xfs_qm_dqattach(ip, 0);
> 
> This is already in the (only) caller.

Yes, it can be dropped, although a superflous xfs_qm_dqattach is
totally harmless anyway.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2016-12-07 19:37     ` Christoph Hellwig
@ 2016-12-07 19:46       ` Brian Foster
  2016-12-08  4:23         ` Darrick J. Wong
  2017-01-24  8:37         ` Christoph Hellwig
  0 siblings, 2 replies; 23+ messages in thread
From: Brian Foster @ 2016-12-07 19:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, darrick.wong

On Wed, Dec 07, 2016 at 08:37:09PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2016 at 02:00:09PM -0500, Brian Foster wrote:
> > > -	if (ap->flags & XFS_BMAPI_COWFORK)
> > > -		align = xfs_get_cowextsz_hint(ap->ip);
> > > -	else if (xfs_alloc_is_userdata(ap->datatype))
> > > +	if (xfs_alloc_is_userdata(ap->datatype))
> > 
> > Doesn't this defeat the purpose (i.e., fragmentation avoidance via wider
> > allocations) of the cowextszhint for direct I/O? I think it would be
> > better to be consistent with the approach for traditional I/O + extsz
> > and incorporate the alignment into the reservation. Perhaps the hunk of
> > code that already does just that in xfs_iomap_write_direct() could be
> > converted to a small helper and reused..?
> 
> We're already doing the alignment to the cowextsize hint in
> __xfs_reflink_allocate_cow so that we can take the cowextsize into
> account.
> 

Only for end_fsb... xfs_bmap_btalloc() calls xfs_bmap_extsize_align()
with the alignment, which rounds out the start and end offsets.

This same end_fsb code has already been removed from
xfs_reflink_reserve_cow() for similar reasons. It should probably be
removed from xfs_reflink_allocate_cow() as well.

Brian

> > > +	resblks = XFS_DIOSTRAT_SPACE_RES(mp, end_fsb - *offset_fsb);
> > > +
> > > +	error = xfs_qm_dqattach(ip, 0);
> > 
> > This is already in the (only) caller.
> 
> Yes, it can be dropped, although a superflous xfs_qm_dqattach is
> totally harmless anyway.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2016-12-07 19:46       ` Brian Foster
@ 2016-12-08  4:23         ` Darrick J. Wong
  2017-01-24  8:37         ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2016-12-08  4:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs

On Wed, Dec 07, 2016 at 02:46:34PM -0500, Brian Foster wrote:
> On Wed, Dec 07, 2016 at 08:37:09PM +0100, Christoph Hellwig wrote:
> > On Wed, Dec 07, 2016 at 02:00:09PM -0500, Brian Foster wrote:
> > > > -	if (ap->flags & XFS_BMAPI_COWFORK)
> > > > -		align = xfs_get_cowextsz_hint(ap->ip);
> > > > -	else if (xfs_alloc_is_userdata(ap->datatype))
> > > > +	if (xfs_alloc_is_userdata(ap->datatype))
> > > 
> > > Doesn't this defeat the purpose (i.e., fragmentation avoidance via wider
> > > allocations) of the cowextszhint for direct I/O? I think it would be
> > > better to be consistent with the approach for traditional I/O + extsz
> > > and incorporate the alignment into the reservation. Perhaps the hunk of
> > > code that already does just that in xfs_iomap_write_direct() could be
> > > converted to a small helper and reused..?
> > 
> > We're already doing the alignment to the cowextsize hint in
> > __xfs_reflink_allocate_cow so that we can take the cowextsize into
> > account.
> > 
> 
> Only for end_fsb... xfs_bmap_btalloc() calls xfs_bmap_extsize_align()
> with the alignment, which rounds out the start and end offsets.

Seconded; I finally got a chance to build a kernel with these three
patches and ran xfs/214 with a tracepointed kernel; it doesn't round the
start down to the previous cowextsize alignment like 4.9 does... and
thus the quota accounting is off.

--D

> 
> This same end_fsb code has already been removed from
> xfs_reflink_reserve_cow() for similar reasons. It should probably be
> removed from xfs_reflink_allocate_cow() as well.
> 
> Brian
> 
> > > > +	resblks = XFS_DIOSTRAT_SPACE_RES(mp, end_fsb - *offset_fsb);
> > > > +
> > > > +	error = xfs_qm_dqattach(ip, 0);
> > > 
> > > This is already in the (only) caller.
> > 
> > Yes, it can be dropped, although a superflous xfs_qm_dqattach is
> > totally harmless anyway.
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2016-12-07 19:46       ` Brian Foster
  2016-12-08  4:23         ` Darrick J. Wong
@ 2017-01-24  8:37         ` Christoph Hellwig
  2017-01-24 13:50           ` Brian Foster
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2017-01-24  8:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, darrick.wong

On Wed, Dec 07, 2016 at 02:46:34PM -0500, Brian Foster wrote:
> Only for end_fsb... xfs_bmap_btalloc() calls xfs_bmap_extsize_align()
> with the alignment, which rounds out the start and end offsets.

... and corrupts data in the direct I/O case.

The problem is that the down-alignment in xfs_bmap_extsize_align will
now create a real extent that spans before the extent that we have
to COW in this write_begin call.  But the area before might have been
a hole before the dio write that had just before been filled with
an allocation in the data fork.  And due to the direct I/O end_io
interface that only covers the range of the whole write we don't
know at that point where exactly the COW operation started and will
happily splice back our front pad into the data fork, replacing
the just written data with garbage.  xfs/228 and sometimes generic/199
reproduce this nicely.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2017-01-24  8:37         ` Christoph Hellwig
@ 2017-01-24 13:50           ` Brian Foster
  2017-01-24 13:59             ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-01-24 13:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, darrick.wong

On Tue, Jan 24, 2017 at 09:37:32AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2016 at 02:46:34PM -0500, Brian Foster wrote:
> > Only for end_fsb... xfs_bmap_btalloc() calls xfs_bmap_extsize_align()
> > with the alignment, which rounds out the start and end offsets.
> 
> ... and corrupts data in the direct I/O case.
> 
> The problem is that the down-alignment in xfs_bmap_extsize_align will
> now create a real extent that spans before the extent that we have
> to COW in this write_begin call.  But the area before might have been
> a hole before the dio write that had just before been filled with
> an allocation in the data fork.  And due to the direct I/O end_io
> interface that only covers the range of the whole write we don't
> know at that point where exactly the COW operation started and will
> happily splice back our front pad into the data fork, replacing
> the just written data with garbage.  xfs/228 and sometimes generic/199
> reproduce this nicely.

Is this reproducible on the current tree or only with this patch series?

Also, shouldn't the end_io handler only remap the range of the write,
regardless of whether the initial allocation ended up preallocating over
holes or purely a shared range?

Perhaps what you are saying here is that we have a single dio write that
spans wider than a shared data fork extent..? In that case, we iterate
the range in xfs_reflink_reserve_cow(). We'd skip the start of the range
that is a hole in the data fork, but as you say, the
xfs_bmapi_reserve_delalloc() call for the part of the I/O on the shared
extent can widen the COW fork allocation to before the extent in the
data fork, possibly to before the start of the I/O. (Thus we end up
allocating COW blocks over the hole anyways...).

>From there we are going to drop into iomap_dio_rw(), which looks like
it's going to check the COW fork for blocks and if found, write to those
blocks (as opposed to doing a data fork allocation). AFAICT, that means
the extent size hint shouldn't be a problem. What am I missing?

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2017-01-24 13:50           ` Brian Foster
@ 2017-01-24 13:59             ` Christoph Hellwig
  2017-01-24 15:02               ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2017-01-24 13:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, darrick.wong

On Tue, Jan 24, 2017 at 08:50:45AM -0500, Brian Foster wrote:
> Is this reproducible on the current tree or only with this patch series?

It's only reproducible with the series modified to your review comments.

> Also, shouldn't the end_io handler only remap the range of the write,
> regardless of whether the initial allocation ended up preallocating over
> holes or purely a shared range?

The end_io handler is caller for the whole size of the write.  That's
mostly because we don't have an object corresponding to a write_begin
call.

> Perhaps what you are saying here is that we have a single dio write that
> spans wider than a shared data fork extent..?

Yes.

> In that case, we iterate
> the range in xfs_reflink_reserve_cow(). We'd skip the start of the range
> that is a hole in the data fork, but as you say, the
> xfs_bmapi_reserve_delalloc() call for the part of the I/O on the shared
> extent can widen the COW fork allocation to before the extent in the
> data fork, possibly to before the start of the I/O. (Thus we end up
> allocating COW blocks over the hole anyways...).

The problem is the following.

We have a file with the following layout


HHHHHHHHHHHHDDDDDDDDDDDDDD

where H is hole and D is data.  The H to D boundary is not aligned
to the cowextsize.

The direct I/O code now does a first pass allocating an extent for
H and copies data to it.  Then in the next step it goes on to D
and unshares it.  It then enlarges the extent into the end of the
previously H range. It does however not copy data into H again,
as the iomap iterator is past it.  The ->end_io routine however
is called for the hole range, and will move the just allocated
rounding before H back into the data fork, replacing the valid data
writtent just before.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2017-01-24 13:59             ` Christoph Hellwig
@ 2017-01-24 15:02               ` Brian Foster
  2017-01-24 15:09                 ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-01-24 15:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, darrick.wong

On Tue, Jan 24, 2017 at 02:59:37PM +0100, Christoph Hellwig wrote:
> On Tue, Jan 24, 2017 at 08:50:45AM -0500, Brian Foster wrote:
> > Is this reproducible on the current tree or only with this patch series?
> 
> It's only reproducible with the series modified to your review comments.
> 

Ok, well then I'm probably not going to be able to follow the details
well enough to try and provide constructive feedback without seeing the
code. Looking back, my comments were generally about the tradeoff of
bypassing the extent size hint mechanism that has been built into
reflink to avoid cow fragmention.

> > Also, shouldn't the end_io handler only remap the range of the write,
> > regardless of whether the initial allocation ended up preallocating over
> > holes or purely a shared range?
> 
> The end_io handler is caller for the whole size of the write.  That's
> mostly because we don't have an object corresponding to a write_begin
> call.
> 
> > Perhaps what you are saying here is that we have a single dio write that
> > spans wider than a shared data fork extent..?
> 
> Yes.
> 
> > In that case, we iterate
> > the range in xfs_reflink_reserve_cow(). We'd skip the start of the range
> > that is a hole in the data fork, but as you say, the
> > xfs_bmapi_reserve_delalloc() call for the part of the I/O on the shared
> > extent can widen the COW fork allocation to before the extent in the
> > data fork, possibly to before the start of the I/O. (Thus we end up
> > allocating COW blocks over the hole anyways...).
> 
> The problem is the following.
> 
> We have a file with the following layout
> 
> 
> HHHHHHHHHHHHDDDDDDDDDDDDDD
> 
> where H is hole and D is data.  The H to D boundary is not aligned
> to the cowextsize.
> 
> The direct I/O code now does a first pass allocating an extent for
> H and copies data to it.  Then in the next step it goes on to D
> and unshares it.  It then enlarges the extent into the end of the
> previously H range. It does however not copy data into H again,
> as the iomap iterator is past it.  The ->end_io routine however
> is called for the hole range, and will move the just allocated
> rounding before H back into the data fork, replacing the valid data
> writtent just before.
> 

Without seeing the code, perhaps we need to pull up the cow extent size
hint mechanism from the bmapi layer to something similar to how
xfs_iomap_direct_write() handles the traditional extent size hints..?
That may allow us to more intelligently consider the current state
across the data and cow forks in such cases (to not preallocate over
existing blocks, for example, without having to kill off the extent size
hint entirely).

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2017-01-24 15:02               ` Brian Foster
@ 2017-01-24 15:09                 ` Christoph Hellwig
  2017-01-24 16:17                   ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2017-01-24 15:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, darrick.wong

On Tue, Jan 24, 2017 at 10:02:22AM -0500, Brian Foster wrote:
> Ok, well then I'm probably not going to be able to follow the details
> well enough to try and provide constructive feedback without seeing the
> code. Looking back, my comments were generally about the tradeoff of
> bypassing the extent size hint mechanism that has been built into
> reflink to avoid cow fragmention.

Ok - really very little change here - only the call to the extsize
alignment in xfs_bmap_btalloc reinstated and the according reservation
changes in the caller.

Note that with the previously posted series there is no change in
handling the cowextsize hint for real on-disk allocations.  While
the current code rounds down based on the cowextsize for creating
the delayed extent we will never convert the alignment before the
write start to a real extent - it will just get cleaned up later
at inode eviction time or using the timer.

> Without seeing the code, perhaps we need to pull up the cow extent size
> hint mechanism from the bmapi layer to something similar to how
> xfs_iomap_direct_write() handles the traditional extent size hints..?
> That may allow us to more intelligently consider the current state
> across the data and cow forks in such cases (to not preallocate over
> existing blocks, for example, without having to kill off the extent size
> hint entirely).

We could.  On the other hand I'd love to get the current series in
first as the only thing it change in behavior is not allocating
additional delayed extent space that never gets used, and not writing
data twice if it's sub-blocksize, all of which seem like a clear
improvement.  And it's also the base for my pending DAX reflink support.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2017-01-24 15:09                 ` Christoph Hellwig
@ 2017-01-24 16:17                   ` Brian Foster
  2017-01-24 16:21                     ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-01-24 16:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, darrick.wong

On Tue, Jan 24, 2017 at 04:09:59PM +0100, Christoph Hellwig wrote:
> On Tue, Jan 24, 2017 at 10:02:22AM -0500, Brian Foster wrote:
> > Ok, well then I'm probably not going to be able to follow the details
> > well enough to try and provide constructive feedback without seeing the
> > code. Looking back, my comments were generally about the tradeoff of
> > bypassing the extent size hint mechanism that has been built into
> > reflink to avoid cow fragmention.
> 
> Ok - really very little change here - only the call to the extsize
> alignment in xfs_bmap_btalloc reinstated and the according reservation
> changes in the caller.
> 
> Note that with the previously posted series there is no change in
> handling the cowextsize hint for real on-disk allocations.  While
> the current code rounds down based on the cowextsize for creating
> the delayed extent we will never convert the alignment before the
> write start to a real extent - it will just get cleaned up later
> at inode eviction time or using the timer.
> 

I thought that while not necessarily guaranteed, generally the entire
extent gets converted from delalloc to real blocks. IIRC, that's what
I've seen in the past when looking into the cow fork with bmap. After
all, isn't that the point of the extent size hint? Allocate wider than
the write to accommodate potential subsequent writes into a more
contiguous range.

> > Without seeing the code, perhaps we need to pull up the cow extent size
> > hint mechanism from the bmapi layer to something similar to how
> > xfs_iomap_direct_write() handles the traditional extent size hints..?
> > That may allow us to more intelligently consider the current state
> > across the data and cow forks in such cases (to not preallocate over
> > existing blocks, for example, without having to kill off the extent size
> > hint entirely).
> 
> We could.  On the other hand I'd love to get the current series in
> first as the only thing it change in behavior is not allocating
> additional delayed extent space that never gets used, and not writing
> data twice if it's sub-blocksize, all of which seem like a clear
> improvement.  And it's also the base for my pending DAX reflink support.

Fair enough, that's a different discussion. Personally, I'd prefer to
fix up the stuff we know that needs fixing before piling more stuff on
top. Particularly since something like cow I/O to a reflinked vm image
file (taking full advantage of cowextszhint) might be a more common use
case than DAX reflink at the moment.

All of this stuff is "experimental," however, so I don't feel strongly
about the order so long as we agree the regression can/should be fixed
up. So I'll defer to the maintainer on that one. ;)

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2017-01-24 16:17                   ` Brian Foster
@ 2017-01-24 16:21                     ` Christoph Hellwig
  2017-01-24 17:43                       ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2017-01-24 16:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, darrick.wong

On Tue, Jan 24, 2017 at 11:17:20AM -0500, Brian Foster wrote:
> I thought that while not necessarily guaranteed, generally the entire
> extent gets converted from delalloc to real blocks.

For buffered I/O that's the case.  See the discussion on my recent
xfs_bmapi_write patch.

> IIRC, that's what
> I've seen in the past when looking into the cow fork with bmap. After
> all, isn't that the point of the extent size hint? Allocate wider than
> the write to accommodate potential subsequent writes into a more
> contiguous range.

Well, for direct I/O that's not what the current code does.  Implementing
it might be useful, but I'm not sure how much the front alignment is
going to help in usual worksloads - you'd need a backwards write or
random writes that happen to look almost backwards for it to make
a difference.  I suspect most of them time we'd just allocate blocks to
reclaim them again a little later.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2017-01-24 16:21                     ` Christoph Hellwig
@ 2017-01-24 17:43                       ` Brian Foster
  2017-01-24 20:08                         ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-01-24 17:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, darrick.wong

On Tue, Jan 24, 2017 at 05:21:56PM +0100, Christoph Hellwig wrote:
> On Tue, Jan 24, 2017 at 11:17:20AM -0500, Brian Foster wrote:
> > I thought that while not necessarily guaranteed, generally the entire
> > extent gets converted from delalloc to real blocks.
> 
> For buffered I/O that's the case.  See the discussion on my recent
> xfs_bmapi_write patch.
> 

Yeah, that's the behavior I would have expected...

> > IIRC, that's what
> > I've seen in the past when looking into the cow fork with bmap. After
> > all, isn't that the point of the extent size hint? Allocate wider than
> > the write to accommodate potential subsequent writes into a more
> > contiguous range.
> 
> Well, for direct I/O that's not what the current code does.  Implementing
> it might be useful, but I'm not sure how much the front alignment is
> going to help in usual worksloads - you'd need a backwards write or
> random writes that happen to look almost backwards for it to make
> a difference.  I suspect most of them time we'd just allocate blocks to
> reclaim them again a little later.

Hmm, that's not what I'm seeing (not that it really matters, but I'm
curious if I'm missing something):

# xfs_io -d -c "bmap -v" -c "pwrite 8k 4k" -c "bmap -v" -c "bmap -cv" ./file
./file:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
   0: [0..15]:         80..95            0 (80..95)            16 100000
   1: [16..2047]:      160..2191         0 (160..2191)       2032 100000
wrote 4096/4096 bytes at offset 8192
4 KiB, 1 ops; 0.0000 sec (12.440 MiB/sec and 3184.7134 ops/sec)
./file:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
   0: [0..15]:         80..95            0 (80..95)            16 100000
   1: [16..23]:        2208..2215        0 (2208..2215)         8
   2: [24..2047]:      168..2191         0 (168..2191)       2024 100000
./file:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
   0: [0..15]:         2192..2207        0 (2192..2207)        16
   1: [16..23]:        hole                                     8
   2: [24..255]:       2216..2447        0 (2216..2447)       232
   3: [256..2047]:     hole                                  1792

That's on a fairly recent for-next on a fully reflinked file. It looks
to me that the cow fork extent is fully allocated immediately
(otherwise, I'm not sure what else would have converted it).

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2017-01-24 17:43                       ` Brian Foster
@ 2017-01-24 20:08                         ` Christoph Hellwig
  2017-01-24 20:10                           ` Christoph Hellwig
  2017-01-25  0:09                           ` Darrick J. Wong
  0 siblings, 2 replies; 23+ messages in thread
From: Christoph Hellwig @ 2017-01-24 20:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, darrick.wong

On Tue, Jan 24, 2017 at 12:43:18PM -0500, Brian Foster wrote:
> Hmm, that's not what I'm seeing (not that it really matters, but I'm
> curious if I'm missing something):

Yeah, I can reproduce this on mainline.  Turns out the it was done
by the align call in xfs_bmap_btalloc that even my before run had
removed.

Took me some time to spin my head around this.

Btw, I think we have a nasty race in the current DIO code that might
expose stale data, but this is just the same kind of head spinning
exercise for now:

Thread 1 writes a range from B to c

                    B --------- C

a little later thread 2 writes from A to B

        A --------- B

but the code preallocates beyond B into the range where thread
1 has just written, but ->end_io hasn't been called yet.
But once ->end_io is called thread 2 has already allocated
up to the extent size hint into the write range of thread 1,
so the end_io handler will splice the unintialized blocks from
that preallocation back into the file right after B.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2017-01-24 20:08                         ` Christoph Hellwig
@ 2017-01-24 20:10                           ` Christoph Hellwig
  2017-01-25  0:09                           ` Darrick J. Wong
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2017-01-24 20:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, darrick.wong

On Tue, Jan 24, 2017 at 09:08:55PM +0100, Christoph Hellwig wrote:
> On Tue, Jan 24, 2017 at 12:43:18PM -0500, Brian Foster wrote:
> > Hmm, that's not what I'm seeing (not that it really matters, but I'm
> > curious if I'm missing something):
> 
> Yeah, I can reproduce this on mainline.  Turns out the it was done

s/can/can't/, sorry.  It's getting late for the day.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2017-01-24 20:08                         ` Christoph Hellwig
  2017-01-24 20:10                           ` Christoph Hellwig
@ 2017-01-25  0:09                           ` Darrick J. Wong
  2017-01-27 17:44                             ` Darrick J. Wong
  1 sibling, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2017-01-25  0:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, linux-xfs

On Tue, Jan 24, 2017 at 09:08:55PM +0100, Christoph Hellwig wrote:
> On Tue, Jan 24, 2017 at 12:43:18PM -0500, Brian Foster wrote:
> > Hmm, that's not what I'm seeing (not that it really matters, but I'm
> > curious if I'm missing something):
> 
> Yeah, I can reproduce this on mainline.  Turns out the it was done
> by the align call in xfs_bmap_btalloc that even my before run had
> removed.
> 
> Took me some time to spin my head around this.
> 
> Btw, I think we have a nasty race in the current DIO code that might
> expose stale data, but this is just the same kind of head spinning
> exercise for now:
> 
> Thread 1 writes a range from B to c
> 
>                     B --------- C

          A --------- B --------- C
               ^           ^
               d           e

I'm assuming B-C has no shared blocks, d-B has shared blocks, and that
both d & e are cowextsize boundaries.

> a little later thread 2 writes from A to B
> 
>         A --------- B
> but the code preallocates beyond B into the range where thread
> 1 has just written, but ->end_io hasn't been called yet.
> But once ->end_io is called thread 2 has already allocated
> up to the extent size hint into the write range of thread 1,
> so the end_io handler will splice the unintialized blocks from
> that preallocation back into the file right after B.

I think you're right about there being a dio race here.  I'm not sure
what the solution here is -- certainly we could disregard the cowextsize
hint, though that has a fragmentation cost that we already know about.

We could also change the dio write setup to extend the range that it
checks for shared blocks up and down to the nearest cowextsize boundary
and set up the delalloc reservations in the cow fork as necessary.  If
our thread2 comes along then it'll find the reservations already set
up for a cow so that we avoid the situation where B-C changes between
iomap_begin and dio_write_end_io does the wrong thing.  That's more in
the spirit of cowextsize since we'd promote future writes to CoW.
However it's more wasteful of blocks since we have no idea if those
future writes are ever actually going to happen, and it also adds more
bmap records if we don't use all of the reservation.

Ugh, my head hurts, I'm going to go for a walk to ponder this some more,
and to try to work out whether the buffered path is all right.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2017-01-25  0:09                           ` Darrick J. Wong
@ 2017-01-27 17:44                             ` Darrick J. Wong
  2017-01-27 17:48                               ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2017-01-27 17:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, linux-xfs

On Tue, Jan 24, 2017 at 04:09:34PM -0800, Darrick J. Wong wrote:
> On Tue, Jan 24, 2017 at 09:08:55PM +0100, Christoph Hellwig wrote:
> > On Tue, Jan 24, 2017 at 12:43:18PM -0500, Brian Foster wrote:
> > > Hmm, that's not what I'm seeing (not that it really matters, but I'm
> > > curious if I'm missing something):
> > 
> > Yeah, I can reproduce this on mainline.  Turns out the it was done
> > by the align call in xfs_bmap_btalloc that even my before run had
> > removed.
> > 
> > Took me some time to spin my head around this.
> > 
> > Btw, I think we have a nasty race in the current DIO code that might
> > expose stale data, but this is just the same kind of head spinning
> > exercise for now:
> > 
> > Thread 1 writes a range from B to c
> > 
> >                     B --------- C
> 
>           A --------- B --------- C
>                ^           ^
>                d           e
> 
> I'm assuming B-C has no shared blocks, d-B has shared blocks, and that
> both d & e are cowextsize boundaries.
> 
> > a little later thread 2 writes from A to B
> > 
> >         A --------- B
> > but the code preallocates beyond B into the range where thread
> > 1 has just written, but ->end_io hasn't been called yet.
> > But once ->end_io is called thread 2 has already allocated
> > up to the extent size hint into the write range of thread 1,
> > so the end_io handler will splice the unintialized blocks from
> > that preallocation back into the file right after B.
> 
> I think you're right about there being a dio race here.  I'm not sure
> what the solution here is -- certainly we could disregard the cowextsize
> hint, though that has a fragmentation cost that we already know about.
> 
> We could also change the dio write setup to extend the range that it
> checks for shared blocks up and down to the nearest cowextsize boundary
> and set up the delalloc reservations in the cow fork as necessary.  If
> our thread2 comes along then it'll find the reservations already set
> up for a cow so that we avoid the situation where B-C changes between
> iomap_begin and dio_write_end_io does the wrong thing.  That's more in
> the spirit of cowextsize since we'd promote future writes to CoW.
> However it's more wasteful of blocks since we have no idea if those
> future writes are ever actually going to happen, and it also adds more
> bmap records if we don't use all of the reservation.
> 
> Ugh, my head hurts, I'm going to go for a walk to ponder this some more,
> and to try to work out whether the buffered path is all right.

I think I came up with a better idea overnight: take advantage of the
unwritten bit.  Currently, all extents in the cow fork are either
delalloc or normal extents.  When we allocate the blocks to fill a cow
extent, we'll flag the extent as unwritten.  When we go to write the
data to disk, we convert the unwritten extent to a real extent, and the
post-write remap (since it only takes a file offset and length) will
be changed to remap only real, written extents into the data fork.

--D

> 
> --D
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes
  2017-01-27 17:44                             ` Darrick J. Wong
@ 2017-01-27 17:48                               ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2017-01-27 17:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Brian Foster, linux-xfs

On Fri, Jan 27, 2017 at 09:44:47AM -0800, Darrick J. Wong wrote:
> I think I came up with a better idea overnight: take advantage of the
> unwritten bit.  Currently, all extents in the cow fork are either
> delalloc or normal extents.  When we allocate the blocks to fill a cow
> extent, we'll flag the extent as unwritten.  When we go to write the
> data to disk, we convert the unwritten extent to a real extent, and the
> post-write remap (since it only takes a file offset and length) will
> be changed to remap only real, written extents into the data fork.

That sounds like a possibility.  My other idea was to force COW all
direct I/O writes, that way we can't have a mismatch between extents
in the data and the COW fork.  But so far this is just an idea..

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-01-27 17:49 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-05 21:05 reflink COW improvements Christoph Hellwig
2016-12-05 21:05 ` [PATCH 1/3] xfs: reject all unaligned direct writes to reflinked files Christoph Hellwig
2016-12-07 18:59   ` Brian Foster
2016-12-05 21:05 ` [PATCH 2/3] xfs: go straight to real allocations for direct I/O COW writes Christoph Hellwig
2016-12-07 19:00   ` Brian Foster
2016-12-07 19:37     ` Christoph Hellwig
2016-12-07 19:46       ` Brian Foster
2016-12-08  4:23         ` Darrick J. Wong
2017-01-24  8:37         ` Christoph Hellwig
2017-01-24 13:50           ` Brian Foster
2017-01-24 13:59             ` Christoph Hellwig
2017-01-24 15:02               ` Brian Foster
2017-01-24 15:09                 ` Christoph Hellwig
2017-01-24 16:17                   ` Brian Foster
2017-01-24 16:21                     ` Christoph Hellwig
2017-01-24 17:43                       ` Brian Foster
2017-01-24 20:08                         ` Christoph Hellwig
2017-01-24 20:10                           ` Christoph Hellwig
2017-01-25  0:09                           ` Darrick J. Wong
2017-01-27 17:44                             ` Darrick J. Wong
2017-01-27 17:48                               ` Christoph Hellwig
2016-12-05 21:05 ` [PATCH 3/3] xfs: allocate direct I/O COW blocks in iomap_begin Christoph Hellwig
2016-12-06  2:09 ` reflink COW improvements Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.