All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] xfs: random fixes
@ 2017-01-31  0:23 Darrick J. Wong
  2017-01-31  0:23 ` [PATCH 1/7] xfs: fix toctou race when locking an inode to access the data map Darrick J. Wong
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Darrick J. Wong @ 2017-01-31  0:23 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

Here are some fixes for various hangs and crashes I've seen while
running the field-fuzzing xfstests.  I'm aiming for 4.11 for these.
So far they don't seem to cause any auto group regressions.

--Darrick

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/7] xfs: fix toctou race when locking an inode to access the data map
  2017-01-31  0:23 [PATCH 0/7] xfs: random fixes Darrick J. Wong
@ 2017-01-31  0:23 ` Darrick J. Wong
  2017-01-31  3:01   ` Eric Sandeen
                     ` (2 more replies)
  2017-01-31  0:23 ` [PATCH 2/7] xfs: fail _dir_open when readahead fails Darrick J. Wong
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 31+ messages in thread
From: Darrick J. Wong @ 2017-01-31  0:23 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

We use di_format and if_flags to decide whether we're grabbing the ilock
in btree mode (btree extents not loaded) or shared mode (anything else),
but the state of those fields can be changed by other threads that are
also trying to load the btree extents -- IFEXTENTS gets set before the
_bmap_read_extents call and cleared if it fails.  Therefore, once we've
grabbed the shared ilock we have to re-check the fields to see if we
actually need to upgrade to the exclusive ilock in order to try loading
the extents.

Without this patch, we trigger ilock assert failures when a bunch of
threads try to access a btree format directory with a corrupt bmbt root
and corrupt the incore data structures, leading to a crash.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index de32f0f..7d7206c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -125,6 +125,18 @@ xfs_ilock_data_map_shared(
 	    (ip->i_df.if_flags & XFS_IFEXTENTS) == 0)
 		lock_mode = XFS_ILOCK_EXCL;
 	xfs_ilock(ip, lock_mode);
+	/*
+	 * We can change if_flags under ilock if we try to read the
+	 * extents and fail.  Since we hadn't grabbed the ilock at check
+	 * time, we have to re-check and upgrade the lock now.
+	 */
+	if (lock_mode == XFS_ILOCK_SHARED &&
+	    ip->i_d.di_format == XFS_DINODE_FMT_BTREE &&
+	    (ip->i_df.if_flags & XFS_IFEXTENTS) == 0) {
+		xfs_iunlock(ip, lock_mode);
+		lock_mode = XFS_ILOCK_EXCL;
+		xfs_ilock(ip, lock_mode);
+	}
 	return lock_mode;
 }
 


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/7] xfs: fail _dir_open when readahead fails
  2017-01-31  0:23 [PATCH 0/7] xfs: random fixes Darrick J. Wong
  2017-01-31  0:23 ` [PATCH 1/7] xfs: fix toctou race when locking an inode to access the data map Darrick J. Wong
@ 2017-01-31  0:23 ` Darrick J. Wong
  2017-01-31  4:12   ` Eric Sandeen
  2017-01-31 13:29   ` Christoph Hellwig
  2017-01-31  0:23 ` [PATCH 3/7] xfs: filter out obviously bad btree pointers Darrick J. Wong
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 31+ messages in thread
From: Darrick J. Wong @ 2017-01-31  0:23 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

When we open a directory, we try to readahead block 0 of the directory
on the assumption that we're going to need it soon.  If the bmbt is
corrupt, the directory will never be usable and the readahead fails
immediately, so we might as well prevent the directory from being opened
at all.  This prevents a subsequent read or modify operation from
hitting it and taking the fs offline.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_da_btree.c |    6 ++----
 fs/xfs/libxfs/xfs_da_btree.h |    2 +-
 fs/xfs/xfs_file.c            |    4 ++--
 3 files changed, 5 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index f2dc1a9..1bdf288 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2633,7 +2633,7 @@ xfs_da_read_buf(
 /*
  * Readahead the dir/attr block.
  */
-xfs_daddr_t
+int
 xfs_da_reada_buf(
 	struct xfs_inode	*dp,
 	xfs_dablk_t		bno,
@@ -2664,7 +2664,5 @@ xfs_da_reada_buf(
 	if (mapp != &map)
 		kmem_free(mapp);
 
-	if (error)
-		return -1;
-	return mappedbno;
+	return error;
 }
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 98c75cb..4e29cb6 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -201,7 +201,7 @@ int	xfs_da_read_buf(struct xfs_trans *trans, struct xfs_inode *dp,
 			       xfs_dablk_t bno, xfs_daddr_t mappedbno,
 			       struct xfs_buf **bpp, int whichfork,
 			       const struct xfs_buf_ops *ops);
-xfs_daddr_t	xfs_da_reada_buf(struct xfs_inode *dp, xfs_dablk_t bno,
+int	xfs_da_reada_buf(struct xfs_inode *dp, xfs_dablk_t bno,
 				xfs_daddr_t mapped_bno, int whichfork,
 				const struct xfs_buf_ops *ops);
 int	xfs_da_shrink_inode(xfs_da_args_t *args, xfs_dablk_t dead_blkno,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index bbb9eb6..4c87e60f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -908,9 +908,9 @@ xfs_dir_open(
 	 */
 	mode = xfs_ilock_data_map_shared(ip);
 	if (ip->i_d.di_nextents > 0)
-		xfs_dir3_data_readahead(ip, 0, -1);
+		error = xfs_dir3_data_readahead(ip, 0, -1);
 	xfs_iunlock(ip, mode);
-	return 0;
+	return error;
 }
 
 STATIC int


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/7] xfs: filter out obviously bad btree pointers
  2017-01-31  0:23 [PATCH 0/7] xfs: random fixes Darrick J. Wong
  2017-01-31  0:23 ` [PATCH 1/7] xfs: fix toctou race when locking an inode to access the data map Darrick J. Wong
  2017-01-31  0:23 ` [PATCH 2/7] xfs: fail _dir_open when readahead fails Darrick J. Wong
@ 2017-01-31  0:23 ` Darrick J. Wong
  2017-01-31  4:39   ` Eric Sandeen
  2017-01-31  0:23 ` [PATCH 4/7] xfs: check for obviously bad level values in the bmbt root Darrick J. Wong
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2017-01-31  0:23 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Don't let anybody load an obviously bad btree pointer.  Since the values
come from disk, we must return an error, not just ASSERT.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c  |    4 +---
 fs/xfs/libxfs/xfs_btree.c |    3 ++-
 fs/xfs/libxfs/xfs_btree.h |    2 +-
 3 files changed, 4 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index bfc00de..443cb10 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -1291,9 +1291,7 @@ xfs_bmap_read_extents(
 	ASSERT(level > 0);
 	pp = XFS_BMAP_BROOT_PTR_ADDR(mp, block, 1, ifp->if_broot_bytes);
 	bno = be64_to_cpu(*pp);
-	ASSERT(bno != NULLFSBLOCK);
-	ASSERT(XFS_FSB_TO_AGNO(mp, bno) < mp->m_sb.sb_agcount);
-	ASSERT(XFS_FSB_TO_AGBNO(mp, bno) < mp->m_sb.sb_agblocks);
+
 	/*
 	 * Go down the tree until leaf level is reached, following the first
 	 * pointer (leftmost) at each level.
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6a..2849d3f 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -810,7 +810,8 @@ xfs_btree_read_bufl(
 	xfs_daddr_t		d;		/* real disk block address */
 	int			error;
 
-	ASSERT(fsbno != NULLFSBLOCK);
+	if (!XFS_FSB_SANITY_CHECK(mp, fsbno))
+		return -EFSCORRUPTED;
 	d = XFS_FSB_TO_DADDR(mp, fsbno);
 	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, d,
 				   mp->m_bsize, lock, &bp, ops);
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index b69b947..33a8f86 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -456,7 +456,7 @@ static inline int xfs_btree_get_level(struct xfs_btree_block *block)
 #define	XFS_FILBLKS_MAX(a,b)	max_t(xfs_filblks_t, (a), (b))
 
 #define	XFS_FSB_SANITY_CHECK(mp,fsb)	\
-	(XFS_FSB_TO_AGNO(mp, fsb) < mp->m_sb.sb_agcount && \
+	(fsb && XFS_FSB_TO_AGNO(mp, fsb) < mp->m_sb.sb_agcount && \
 		XFS_FSB_TO_AGBNO(mp, fsb) < mp->m_sb.sb_agblocks)
 
 /*


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/7] xfs: check for obviously bad level values in the bmbt root
  2017-01-31  0:23 [PATCH 0/7] xfs: random fixes Darrick J. Wong
                   ` (2 preceding siblings ...)
  2017-01-31  0:23 ` [PATCH 3/7] xfs: filter out obviously bad btree pointers Darrick J. Wong
@ 2017-01-31  0:23 ` Darrick J. Wong
  2017-01-31 13:31   ` Christoph Hellwig
  2017-01-31  0:23 ` [PATCH 5/7] xfs: verify free block header fields Darrick J. Wong
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2017-01-31  0:23 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

We can't handle a bmbt that's taller than BTREE_MAXLEVELS, and there's
no such thing as a zero-level bmbt (for that we have extents format),
so if we see this, send back an error code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_fork.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 222e103..84b3e51 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -26,6 +26,7 @@
 #include "xfs_inode.h"
 #include "xfs_trans.h"
 #include "xfs_inode_item.h"
+#include "xfs_btree.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_bmap.h"
 #include "xfs_error.h"
@@ -429,11 +430,13 @@ xfs_iformat_btree(
 	/* REFERENCED */
 	int			nrecs;
 	int			size;
+	int			level;
 
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 	dfp = (xfs_bmdr_block_t *)XFS_DFORK_PTR(dip, whichfork);
 	size = XFS_BMAP_BROOT_SPACE(mp, dfp);
 	nrecs = be16_to_cpu(dfp->bb_numrecs);
+	level = be16_to_cpu(dfp->bb_level);
 
 	/*
 	 * blow out if -- fork has less extents than can fit in
@@ -446,7 +449,8 @@ xfs_iformat_btree(
 					XFS_IFORK_MAXEXT(ip, whichfork) ||
 		     XFS_BMDR_SPACE_CALC(nrecs) >
 					XFS_DFORK_SIZE(dip, mp, whichfork) ||
-		     XFS_IFORK_NEXTENTS(ip, whichfork) > ip->i_d.di_nblocks)) {
+		     XFS_IFORK_NEXTENTS(ip, whichfork) > ip->i_d.di_nblocks) ||
+		     level == 0 || level > XFS_BTREE_MAXLEVELS) {
 		xfs_warn(mp, "corrupt inode %Lu (btree).",
 					(unsigned long long) ip->i_ino);
 		XFS_CORRUPTION_ERROR("xfs_iformat_btree", XFS_ERRLEVEL_LOW,


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/7] xfs: verify free block header fields
  2017-01-31  0:23 [PATCH 0/7] xfs: random fixes Darrick J. Wong
                   ` (3 preceding siblings ...)
  2017-01-31  0:23 ` [PATCH 4/7] xfs: check for obviously bad level values in the bmbt root Darrick J. Wong
@ 2017-01-31  0:23 ` Darrick J. Wong
  2017-01-31 13:42   ` Christoph Hellwig
  2017-01-31  0:23 ` [PATCH 6/7] xfs: allow unwritten extents in the CoW fork Darrick J. Wong
  2017-01-31  0:23 ` [PATCH 7/7] xfs: mark speculative prealloc CoW fork extents unwritten Darrick J. Wong
  6 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2017-01-31  0:23 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Perform basic sanity checking of the directory free block header
fields so that we avoid hanging the system on invalid data.

(Granted that just means that now we shutdown on directory write,
but that seems better than hanging...)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_dir2_node.c |   51 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index 75a5574..fd17228 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -155,6 +155,42 @@ const struct xfs_buf_ops xfs_dir3_free_buf_ops = {
 	.verify_write = xfs_dir3_free_write_verify,
 };
 
+/* Everything ok in the free block header? */
+static bool
+xfs_dir3_free_header_check(
+	struct xfs_inode	*dp,
+	xfs_dablk_t		fbno,
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = dp->i_mount;
+	unsigned int		firstdb;
+	int			maxbests;
+
+	maxbests = dp->d_ops->free_max_bests(mp->m_dir_geo);
+	firstdb = (xfs_dir2_da_to_db(mp->m_dir_geo, fbno) -
+		   xfs_dir2_byte_to_db(mp->m_dir_geo, XFS_DIR2_FREE_OFFSET)) *
+			maxbests;
+	if (xfs_sb_version_hascrc(&mp->m_sb)) {
+		struct xfs_dir3_free_hdr *hdr3 = bp->b_addr;
+
+		if (be32_to_cpu(hdr3->firstdb) != firstdb)
+			return false;
+		if (be32_to_cpu(hdr3->nvalid) > maxbests)
+			return false;
+		if (be32_to_cpu(hdr3->nvalid) < be32_to_cpu(hdr3->nused))
+			return false;
+	} else {
+		struct xfs_dir2_free_hdr *hdr = bp->b_addr;
+
+		if (be32_to_cpu(hdr->firstdb) != firstdb)
+			return false;
+		if (be32_to_cpu(hdr->nvalid) > maxbests)
+			return false;
+		if (be32_to_cpu(hdr->nvalid) < be32_to_cpu(hdr->nused))
+			return false;
+	}
+	return true;
+}
 
 static int
 __xfs_dir3_free_read(
@@ -168,11 +204,22 @@ __xfs_dir3_free_read(
 
 	err = xfs_da_read_buf(tp, dp, fbno, mappedbno, bpp,
 				XFS_DATA_FORK, &xfs_dir3_free_buf_ops);
+	if (err || !(*bpp))
+		return err;
+
+	/* Check things that we can't do in the verifier. */
+	if (!xfs_dir3_free_header_check(dp, fbno, *bpp)) {
+		xfs_buf_ioerror(*bpp, -EFSCORRUPTED);
+		xfs_verifier_error(*bpp);
+		xfs_trans_brelse(tp, *bpp);
+		return -EFSCORRUPTED;
+	}
 
 	/* try read returns without an error or *bpp if it lands in a hole */
-	if (!err && tp && *bpp)
+	if (tp)
 		xfs_trans_buf_set_type(tp, *bpp, XFS_BLFT_DIR_FREE_BUF);
-	return err;
+
+	return 0;
 }
 
 int


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/7] xfs: allow unwritten extents in the CoW fork
  2017-01-31  0:23 [PATCH 0/7] xfs: random fixes Darrick J. Wong
                   ` (4 preceding siblings ...)
  2017-01-31  0:23 ` [PATCH 5/7] xfs: verify free block header fields Darrick J. Wong
@ 2017-01-31  0:23 ` Darrick J. Wong
  2017-02-01  2:35   ` [PATCH v2 " Darrick J. Wong
  2017-01-31  0:23 ` [PATCH 7/7] xfs: mark speculative prealloc CoW fork extents unwritten Darrick J. Wong
  6 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2017-01-31  0:23 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

In the data fork, we only allow extents to perform the following state
transitions:

delay -> real <-> unwritten

There's no way to move directly from a delalloc reservation to an
/unwritten/ allocated extent.  However, for the CoW fork we want to be
able to do the following to each extent:

delalloc -> unwritten -> written -> remapped to data fork

This will help us to avoid a race in the speculative CoW preallocation
code between a first thread that is allocating a CoW extent and a second
thread that is remapping part of a file after a write.  In order to do
this, however, we need two things: first, we have to be able to
transition from da to unwritten, and second the function that converts
between real and unwritten has to be made aware of the cow fork.  Do
both of those things.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   68 ++++++++++++++++++++++++++++------------------
 1 file changed, 41 insertions(+), 27 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 443cb10..7c14c11 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -1862,6 +1862,7 @@ xfs_bmap_add_extent_delay_real(
 		 */
 		trace_xfs_bmap_pre_update(bma->ip, bma->idx, state, _THIS_IP_);
 		xfs_bmbt_set_startblock(ep, new->br_startblock);
+		xfs_bmbt_set_state(ep, new->br_state);
 		trace_xfs_bmap_post_update(bma->ip, bma->idx, state, _THIS_IP_);
 
 		(*nextents)++;
@@ -2200,6 +2201,7 @@ STATIC int				/* error */
 xfs_bmap_add_extent_unwritten_real(
 	struct xfs_trans	*tp,
 	xfs_inode_t		*ip,	/* incore inode pointer */
+	int			whichfork,
 	xfs_extnum_t		*idx,	/* extent number to update/insert */
 	xfs_btree_cur_t		**curp,	/* if *curp is null, not a btree */
 	xfs_bmbt_irec_t		*new,	/* new data to add to file extents */
@@ -2224,7 +2226,9 @@ xfs_bmap_add_extent_unwritten_real(
 	*logflagsp = 0;
 
 	cur = *curp;
-	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	ifp = XFS_IFORK_PTR(ip, whichfork);
+	if (whichfork == XFS_COW_FORK)
+		state |= BMAP_COWFORK;
 
 	ASSERT(*idx >= 0);
 	ASSERT(*idx <= xfs_iext_count(ifp));
@@ -2283,7 +2287,7 @@ xfs_bmap_add_extent_unwritten_real(
 	 * Don't set contiguous if the combined extent would be too large.
 	 * Also check for all-three-contiguous being too large.
 	 */
-	if (*idx < xfs_iext_count(&ip->i_df) - 1) {
+	if (*idx < xfs_iext_count(ifp) - 1) {
 		state |= BMAP_RIGHT_VALID;
 		xfs_bmbt_get_all(xfs_iext_get_ext(ifp, *idx + 1), &RIGHT);
 		if (isnullstartblock(RIGHT.br_startblock))
@@ -2323,7 +2327,8 @@ xfs_bmap_add_extent_unwritten_real(
 		trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
 
 		xfs_iext_remove(ip, *idx + 1, 2, state);
-		ip->i_d.di_nextents -= 2;
+		XFS_IFORK_NEXT_SET(ip, whichfork,
+				XFS_IFORK_NEXTENTS(ip, whichfork) - 2);
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2366,7 +2371,8 @@ xfs_bmap_add_extent_unwritten_real(
 		trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
 
 		xfs_iext_remove(ip, *idx + 1, 1, state);
-		ip->i_d.di_nextents--;
+		XFS_IFORK_NEXT_SET(ip, whichfork,
+				XFS_IFORK_NEXTENTS(ip, whichfork) - 1);
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2401,7 +2407,8 @@ xfs_bmap_add_extent_unwritten_real(
 		xfs_bmbt_set_state(ep, newext);
 		trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
 		xfs_iext_remove(ip, *idx + 1, 1, state);
-		ip->i_d.di_nextents--;
+		XFS_IFORK_NEXT_SET(ip, whichfork,
+				XFS_IFORK_NEXTENTS(ip, whichfork) - 1);
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2513,7 +2520,8 @@ xfs_bmap_add_extent_unwritten_real(
 		trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
 
 		xfs_iext_insert(ip, *idx, 1, new, state);
-		ip->i_d.di_nextents++;
+		XFS_IFORK_NEXT_SET(ip, whichfork,
+				XFS_IFORK_NEXTENTS(ip, whichfork) + 1);
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2591,7 +2599,8 @@ xfs_bmap_add_extent_unwritten_real(
 		++*idx;
 		xfs_iext_insert(ip, *idx, 1, new, state);
 
-		ip->i_d.di_nextents++;
+		XFS_IFORK_NEXT_SET(ip, whichfork,
+				XFS_IFORK_NEXTENTS(ip, whichfork) + 1);
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2639,7 +2648,8 @@ xfs_bmap_add_extent_unwritten_real(
 		++*idx;
 		xfs_iext_insert(ip, *idx, 2, &r[0], state);
 
-		ip->i_d.di_nextents += 2;
+		XFS_IFORK_NEXT_SET(ip, whichfork,
+				XFS_IFORK_NEXTENTS(ip, whichfork) + 2);
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2693,17 +2703,17 @@ xfs_bmap_add_extent_unwritten_real(
 	}
 
 	/* update reverse mappings */
-	error = xfs_rmap_convert_extent(mp, dfops, ip, XFS_DATA_FORK, new);
+	error = xfs_rmap_convert_extent(mp, dfops, ip, whichfork, new);
 	if (error)
 		goto done;
 
 	/* convert to a btree if necessary */
-	if (xfs_bmap_needs_btree(ip, XFS_DATA_FORK)) {
+	if (xfs_bmap_needs_btree(ip, whichfork)) {
 		int	tmp_logflags;	/* partial log flag return val */
 
 		ASSERT(cur == NULL);
 		error = xfs_bmap_extents_to_btree(tp, ip, first, dfops, &cur,
-				0, &tmp_logflags, XFS_DATA_FORK);
+				0, &tmp_logflags, whichfork);
 		*logflagsp |= tmp_logflags;
 		if (error)
 			goto done;
@@ -2715,7 +2725,7 @@ xfs_bmap_add_extent_unwritten_real(
 		*curp = cur;
 	}
 
-	xfs_bmap_check_leaf_extents(*curp, ip, XFS_DATA_FORK);
+	xfs_bmap_check_leaf_extents(*curp, ip, whichfork);
 done:
 	*logflagsp |= rval;
 	return error;
@@ -4366,10 +4376,16 @@ xfs_bmapi_allocate(
 	bma->got.br_state = XFS_EXT_NORM;
 
 	/*
-	 * A wasdelay extent has been initialized, so shouldn't be flagged
-	 * as unwritten.
+	 * In the data fork, a wasdelay extent has been initialized, so
+	 * shouldn't be flagged as unwritten.
+	 *
+	 * For the cow fork, however, we convert delalloc reservations
+	 * (extents allocated for speculative preallocation) to
+	 * allocated unwritten extents, and only convert the unwritten
+	 * extents to real extents when we're about to write the data.
 	 */
-	if (!bma->wasdel && (bma->flags & XFS_BMAPI_PREALLOC) &&
+	if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) &&
+	    (bma->flags & XFS_BMAPI_PREALLOC) &&
 	    xfs_sb_version_hasextflgbit(&mp->m_sb))
 		bma->got.br_state = XFS_EXT_UNWRITTEN;
 
@@ -4420,8 +4436,6 @@ xfs_bmapi_convert_unwritten(
 			(XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT))
 		return 0;
 
-	ASSERT(whichfork != XFS_COW_FORK);
-
 	/*
 	 * Modify (by adding) the state flag, if writing.
 	 */
@@ -4446,8 +4460,8 @@ xfs_bmapi_convert_unwritten(
 			return error;
 	}
 
-	error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, &bma->idx,
-			&bma->cur, mval, bma->firstblock, bma->dfops,
+	error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, whichfork,
+			&bma->idx, &bma->cur, mval, bma->firstblock, bma->dfops,
 			&tmp_logflags);
 	/*
 	 * Log the inode core unconditionally in the unwritten extent conversion
@@ -4538,8 +4552,6 @@ xfs_bmapi_write(
 	ASSERT(!(flags & XFS_BMAPI_REMAP) || whichfork == XFS_DATA_FORK);
 	ASSERT(!(flags & XFS_BMAPI_PREALLOC) || !(flags & XFS_BMAPI_REMAP));
 	ASSERT(!(flags & XFS_BMAPI_CONVERT) || !(flags & XFS_BMAPI_REMAP));
-	ASSERT(!(flags & XFS_BMAPI_PREALLOC) || whichfork != XFS_COW_FORK);
-	ASSERT(!(flags & XFS_BMAPI_CONVERT) || whichfork != XFS_COW_FORK);
 
 	/* zeroing is for currently only for data extents, not metadata */
 	ASSERT((flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ZERO)) !=
@@ -5554,8 +5566,8 @@ __xfs_bunmapi(
 			}
 			del.br_state = XFS_EXT_UNWRITTEN;
 			error = xfs_bmap_add_extent_unwritten_real(tp, ip,
-					&lastx, &cur, &del, firstblock, dfops,
-					&logflags);
+					whichfork, &lastx, &cur, &del,
+					firstblock, dfops, &logflags);
 			if (error)
 				goto error0;
 			goto nodelete;
@@ -5608,8 +5620,9 @@ __xfs_bunmapi(
 				prev.br_state = XFS_EXT_UNWRITTEN;
 				lastx--;
 				error = xfs_bmap_add_extent_unwritten_real(tp,
-						ip, &lastx, &cur, &prev,
-						firstblock, dfops, &logflags);
+						ip, whichfork, &lastx, &cur,
+						&prev, firstblock, dfops,
+						&logflags);
 				if (error)
 					goto error0;
 				goto nodelete;
@@ -5617,8 +5630,9 @@ __xfs_bunmapi(
 				ASSERT(del.br_state == XFS_EXT_NORM);
 				del.br_state = XFS_EXT_UNWRITTEN;
 				error = xfs_bmap_add_extent_unwritten_real(tp,
-						ip, &lastx, &cur, &del,
-						firstblock, dfops, &logflags);
+						ip, whichfork, &lastx, &cur,
+						&del, firstblock, dfops,
+						&logflags);
 				if (error)
 					goto error0;
 				goto nodelete;


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 7/7] xfs: mark speculative prealloc CoW fork extents unwritten
  2017-01-31  0:23 [PATCH 0/7] xfs: random fixes Darrick J. Wong
                   ` (5 preceding siblings ...)
  2017-01-31  0:23 ` [PATCH 6/7] xfs: allow unwritten extents in the CoW fork Darrick J. Wong
@ 2017-01-31  0:23 ` Darrick J. Wong
  2017-01-31 13:41   ` Christoph Hellwig
  2017-02-01  2:36   ` [PATCH v2 " Darrick J. Wong
  6 siblings, 2 replies; 31+ messages in thread
From: Darrick J. Wong @ 2017-01-31  0:23 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Christoph Hellwig pointed out that there's a potentially nasty race when
performing simultaneous nearby directio cow writes:

"Thread 1 writes a range from B to c

"                    B --------- C
                           p

"a little later thread 2 writes from A to B

"        A --------- B
               p

[editor's note: the 'p' denote cowextsize boundaries, which I added to
make this more clear]

"but the code preallocates beyond B into the range where thread
"1 has just written, but ->end_io hasn't been called yet.
"But once ->end_io is called thread 2 has already allocated
"up to the extent size hint into the write range of thread 1,
"so the end_io handler will splice the unintialized blocks from
"that preallocation back into the file right after B."

We can avoid this race by ensuring that thread 1 cannot accidentally
remap the blocks that thread 2 allocated (as part of speculative
preallocation) as part of t2's write preparation in t1's end_io handler.
The way we make this happen is by taking advantage of the unwritten
extent flag as an intermediate step.

Recall that when we begin the process of writing data to shared blocks,
we create a delayed allocation extent in the CoW fork:

D: --RRRRRRSSSRRRRRRRR---
C: ------DDDDDDD---------

When a thread prepares to CoW some dirty data out to disk, it will now
convert the delalloc reservation into an /unwritten/ allocated extent in
the cow fork.  The da conversion code tries to opportunistically
allocate as much of a (speculatively prealloc'd) extent as possible, so
we may end up allocating a larger extent than we're actually writing
out:

D: --RRRRRRSSSRRRRRRRR---
U: ------UUUUUUU---------

Next, we convert only the part of the extent that we're actively
planning to write to normal (i.e. not unwritten) status:

D: --RRRRRRSSSRRRRRRRR---
U: ------UURRUUU---------

If the write succeeds, the end_cow function will now scan the relevant
range of the CoW fork for real extents and remap only the real extents
into the data fork:

D: --RRRRRRRRSRRRRRRRR---
U: ------UU--UUU---------

This ensures that we never obliterate valid data fork extents with
unwritten blocks from the CoW fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c    |    6 ++
 fs/xfs/xfs_iomap.c   |    2 -
 fs/xfs/xfs_reflink.c |  140 ++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_reflink.h |    2 +
 fs/xfs/xfs_trace.h   |    8 ++-
 5 files changed, 148 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 631e7c0..1ff9df7 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -481,6 +481,12 @@ xfs_submit_ioend(
 	struct xfs_ioend	*ioend,
 	int			status)
 {
+	/* Convert CoW extents to regular */
+	if (!status && ioend->io_type == XFS_IO_COW) {
+		status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode),
+				ioend->io_offset, ioend->io_size);
+	}
+
 	/* Reserve log space if we might write beyond the on-disk inode size. */
 	if (!status &&
 	    ioend->io_type != XFS_IO_UNWRITTEN &&
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 1aa3abd..767208f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -685,7 +685,7 @@ xfs_iomap_write_allocate(
 	int		nres;
 
 	if (whichfork == XFS_COW_FORK)
-		flags |= XFS_BMAPI_COWFORK;
+		flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
 
 	/*
 	 * Make sure that the dquots are there.
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 07593a3..df042d7 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -82,11 +82,22 @@
  * mappings are a reservation against the free space in the filesystem;
  * adjacent mappings can also be combined into fewer larger mappings.
  *
+ * As an optimization, the CoW extent size hint (cowextsz) creates
+ * outsized aligned delalloc reservations in the hope of landing out of
+ * order nearby CoW writes in a single extent on disk, thereby reducing
+ * fragmentation and improving future performance.
+ *
+ * D: --RRRRRRSSSRRRRRRRR--- (data fork)
+ * C: ------DDDDDDD--------- (CoW fork)
+ *
  * When dirty pages are being written out (typically in writepage), the
- * delalloc reservations are converted into real mappings by allocating
- * blocks and replacing the delalloc mapping with real ones.  A delalloc
- * mapping can be replaced by several real ones if the free space is
- * fragmented.
+ * delalloc reservations are converted into unwritten mappings by
+ * allocating blocks and replacing the delalloc mapping with real ones.
+ * A delalloc mapping can be replaced by several unwritten ones if the
+ * free space is fragmented.
+ *
+ * D: --RRRRRRSSSRRRRRRRR---
+ * C: ------UUUUUUU---------
  *
  * We want to adapt the delalloc mechanism for copy-on-write, since the
  * write paths are similar.  The first two steps (creating the reservation
@@ -101,13 +112,29 @@
  * Block-aligned directio writes will use the same mechanism as buffered
  * writes.
  *
+ * Just prior to submitting the actual disk write requests, we convert
+ * the extents representing the range of the file actually being written
+ * (as opposed to extra pieces created for the cowextsize hint) to real
+ * extents.  This will become important in the next step:
+ *
+ * D: --RRRRRRSSSRRRRRRRR---
+ * C: ------UUrrUUU---------
+ *
  * CoW remapping must be done after the data block write completes,
  * because we don't want to destroy the old data fork map until we're sure
  * the new block has been written.  Since the new mappings are kept in a
  * separate fork, we can simply iterate these mappings to find the ones
  * that cover the file blocks that we just CoW'd.  For each extent, simply
  * unmap the corresponding range in the data fork, map the new range into
- * the data fork, and remove the extent from the CoW fork.
+ * the data fork, and remove the extent from the CoW fork.  Because of
+ * the presence of the cowextsize hint, however, we must be careful
+ * only to remap the blocks that we've actually written out --  we must
+ * never remap delalloc reservations nor CoW staging blocks that have
+ * yet to be written.  This corresponds exactly to the real extents in
+ * the CoW fork:
+ *
+ * D: --RRRRRRrrSRRRRRRRR---
+ * C: ------UU--UUU---------
  *
  * Since the remapping operation can be applied to an arbitrary file
  * range, we record the need for the remap step as a flag in the ioend
@@ -296,6 +323,89 @@ xfs_reflink_reserve_cow(
 	return 0;
 }
 
+/* Convert part of an unwritten CoW extent to a real one. */
+STATIC int
+__xfs_reflink_convert_cow(
+	struct xfs_trans		*tp,
+	struct xfs_inode		*ip,
+	struct xfs_bmbt_irec		*imap,
+	xfs_fileoff_t			offset_fsb,
+	xfs_filblks_t			count_fsb,
+	struct xfs_defer_ops		*dfops)
+{
+	struct xfs_bmbt_irec		irec = *imap;
+	xfs_fsblock_t			first_block;
+	int				nimaps = 1;
+
+	xfs_trim_extent(&irec, offset_fsb, count_fsb);
+	trace_xfs_reflink_convert_cow(ip, &irec);
+	if (irec.br_blockcount == 0)
+		return 0;
+	return xfs_bmapi_write(tp, ip, irec.br_startoff, irec.br_blockcount,
+			XFS_BMAPI_COWFORK | XFS_BMAPI_CONVERT, &first_block,
+			0, &irec, &nimaps, dfops);
+}
+
+/* Convert all of the unwritten CoW extents in a file's range to real ones. */
+int
+xfs_reflink_convert_cow(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		count)
+{
+	struct xfs_bmbt_irec	got;
+	struct xfs_defer_ops	dfops;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	struct xfs_trans	*tp;
+	xfs_fsblock_t		first_block;
+	xfs_fileoff_t		offset_fsb;
+	xfs_fileoff_t		end_fsb;
+	xfs_extnum_t		idx;
+	bool			found;
+	int			error;
+
+	offset_fsb = XFS_B_TO_FSBT(mp, offset);
+	end_fsb = XFS_B_TO_FSB(mp, offset + count);
+
+	/* Grab transaction. */
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+	xfs_defer_init(&dfops, &first_block);
+
+	/* Convert all the extents to real from unwritten. */
+	for (found = xfs_iext_lookup_extent(ip, ifp, offset_fsb, &idx, &got);
+	     found && got.br_startoff < end_fsb;
+	     found = xfs_iext_get_extent(ifp, ++idx, &got)) {
+		if (got.br_state == XFS_EXT_NORM)
+			continue;
+		error = __xfs_reflink_convert_cow(tp, ip, &got, offset_fsb,
+				end_fsb - offset_fsb, &dfops);
+		if (error)
+			goto err;
+	}
+
+	/* Finish up. */
+	error = xfs_defer_finish(&tp, &dfops, NULL);
+	if (error)
+		goto out_trans_cancel;
+
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+
+err:
+	xfs_defer_cancel(&dfops);
+out_trans_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
+
 /* Allocate all CoW reservations covering a range of blocks in a file. */
 static int
 __xfs_reflink_allocate_cow(
@@ -328,6 +438,7 @@ __xfs_reflink_allocate_cow(
 		goto out_unlock;
 	ASSERT(nimaps == 1);
 
+	/* Make sure there's a CoW reservation for it. */
 	error = xfs_reflink_reserve_cow(ip, &imap, &shared);
 	if (error)
 		goto out_trans_cancel;
@@ -337,14 +448,16 @@ __xfs_reflink_allocate_cow(
 		goto out_trans_cancel;
 	}
 
+	/* Allocate the entire reservation as unwritten blocks. */
 	xfs_trans_ijoin(tp, ip, 0);
 	error = xfs_bmapi_write(tp, ip, imap.br_startoff, imap.br_blockcount,
-			XFS_BMAPI_COWFORK, &first_block,
+			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, &first_block,
 			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
 			&imap, &nimaps, &dfops);
 	if (error)
 		goto out_trans_cancel;
 
+	/* Finish up. */
 	error = xfs_defer_finish(&tp, &dfops, NULL);
 	if (error)
 		goto out_trans_cancel;
@@ -389,10 +502,13 @@ xfs_reflink_allocate_cow_range(
 		if (error) {
 			trace_xfs_reflink_allocate_cow_range_error(ip, error,
 					_RET_IP_);
-			break;
+			goto out;
 		}
 	}
 
+	/* Convert the CoW extents to regular. */
+	error = xfs_reflink_convert_cow(ip, offset, count);
+out:
 	return error;
 }
 
@@ -641,6 +757,16 @@ xfs_reflink_end_cow(
 
 		ASSERT(!isnullstartblock(got.br_startblock));
 
+		/*
+		 * Don't remap unwritten extents; these are
+		 * speculatively preallocated CoW extents that have been
+		 * allocated but have not yet been involved in a write.
+		 */
+		if (got.br_state == XFS_EXT_UNWRITTEN) {
+			idx--;
+			goto next_extent;
+		}
+
 		/* Unmap the old blocks in the data fork. */
 		xfs_defer_init(&dfops, &firstfsb);
 		rlen = del.br_blockcount;
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index aa6a4d6..1583c47 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -30,6 +30,8 @@ extern int xfs_reflink_reserve_cow(struct xfs_inode *ip,
 		struct xfs_bmbt_irec *imap, bool *shared);
 extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
 		xfs_off_t offset, xfs_off_t count);
+extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t count);
 extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
 		struct xfs_bmbt_irec *imap);
 extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 69c5bcd..d3d11905 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3089,6 +3089,7 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class,
 		__field(xfs_fileoff_t, lblk)
 		__field(xfs_extlen_t, len)
 		__field(xfs_fsblock_t, pblk)
+		__field(int, state)
 	),
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
@@ -3096,13 +3097,15 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class,
 		__entry->lblk = irec->br_startoff;
 		__entry->len = irec->br_blockcount;
 		__entry->pblk = irec->br_startblock;
+		__entry->state = irec->br_state;
 	),
-	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu",
+	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu st %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->lblk,
 		  __entry->len,
-		  __entry->pblk)
+		  __entry->pblk,
+		  __entry->state)
 );
 #define DEFINE_INODE_IREC_EVENT(name) \
 DEFINE_EVENT(xfs_inode_irec_class, name, \
@@ -3242,6 +3245,7 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_around_shared);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_alloc);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_found);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_convert_cow);
 
 DEFINE_RW_EVENT(xfs_reflink_reserve_cow);
 DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7] xfs: fix toctou race when locking an inode to access the data map
  2017-01-31  0:23 ` [PATCH 1/7] xfs: fix toctou race when locking an inode to access the data map Darrick J. Wong
@ 2017-01-31  3:01   ` Eric Sandeen
  2017-01-31 13:26   ` Christoph Hellwig
  2017-02-01  2:34   ` [PATCH v2 " Darrick J. Wong
  2 siblings, 0 replies; 31+ messages in thread
From: Eric Sandeen @ 2017-01-31  3:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 1/30/17 6:23 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> We use di_format and if_flags to decide whether we're grabbing the ilock
> in btree mode (btree extents not loaded) or shared mode (anything else),
> but the state of those fields can be changed by other threads that are
> also trying to load the btree extents -- IFEXTENTS gets set before the
> _bmap_read_extents call and cleared if it fails.  Therefore, once we've
> grabbed the shared ilock we have to re-check the fields to see if we
> actually need to upgrade to the exclusive ilock in order to try loading
> the extents.
> 
> Without this patch, we trigger ilock assert failures when a bunch of
> threads try to access a btree format directory with a corrupt bmbt root
> and corrupt the incore data structures, leading to a crash.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

I'm a little surprised this hasn't been hit before, TBH.

Seems right -

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

> ---
>  fs/xfs/xfs_inode.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index de32f0f..7d7206c 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -125,6 +125,18 @@ xfs_ilock_data_map_shared(
>  	    (ip->i_df.if_flags & XFS_IFEXTENTS) == 0)
>  		lock_mode = XFS_ILOCK_EXCL;
>  	xfs_ilock(ip, lock_mode);
> +	/*
> +	 * We can change if_flags under ilock if we try to read the
> +	 * extents and fail.  Since we hadn't grabbed the ilock at check
> +	 * time, we have to re-check and upgrade the lock now.
> +	 */
> +	if (lock_mode == XFS_ILOCK_SHARED &&
> +	    ip->i_d.di_format == XFS_DINODE_FMT_BTREE &&
> +	    (ip->i_df.if_flags & XFS_IFEXTENTS) == 0) {
> +		xfs_iunlock(ip, lock_mode);
> +		lock_mode = XFS_ILOCK_EXCL;
> +		xfs_ilock(ip, lock_mode);
> +	}
>  	return lock_mode;
>  }
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/7] xfs: fail _dir_open when readahead fails
  2017-01-31  0:23 ` [PATCH 2/7] xfs: fail _dir_open when readahead fails Darrick J. Wong
@ 2017-01-31  4:12   ` Eric Sandeen
  2017-01-31 13:29   ` Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: Eric Sandeen @ 2017-01-31  4:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 1/30/17 6:23 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we open a directory, we try to readahead block 0 of the directory
> on the assumption that we're going to need it soon.  If the bmbt is
> corrupt, the directory will never be usable and the readahead fails
> immediately, so we might as well prevent the directory from being opened
> at all.  This prevents a subsequent read or modify operation from
> hitting it and taking the fs offline.

Seems fine.  I did want to make sure I understood the error returns
from xfs_dabuf_map w.r.t. mappedbno & holes etc, but it looks ok.

xfs_da_reada_buf return type & meaning changes but nobody looks
at it until now...

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_da_btree.c |    6 ++----
>  fs/xfs/libxfs/xfs_da_btree.h |    2 +-
>  fs/xfs/xfs_file.c            |    4 ++--
>  3 files changed, 5 insertions(+), 7 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
> index f2dc1a9..1bdf288 100644
> --- a/fs/xfs/libxfs/xfs_da_btree.c
> +++ b/fs/xfs/libxfs/xfs_da_btree.c
> @@ -2633,7 +2633,7 @@ xfs_da_read_buf(
>  /*
>   * Readahead the dir/attr block.
>   */
> -xfs_daddr_t
> +int
>  xfs_da_reada_buf(
>  	struct xfs_inode	*dp,
>  	xfs_dablk_t		bno,
> @@ -2664,7 +2664,5 @@ xfs_da_reada_buf(
>  	if (mapp != &map)
>  		kmem_free(mapp);
>  
> -	if (error)
> -		return -1;
> -	return mappedbno;
> +	return error;
>  }
> diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
> index 98c75cb..4e29cb6 100644
> --- a/fs/xfs/libxfs/xfs_da_btree.h
> +++ b/fs/xfs/libxfs/xfs_da_btree.h
> @@ -201,7 +201,7 @@ int	xfs_da_read_buf(struct xfs_trans *trans, struct xfs_inode *dp,
>  			       xfs_dablk_t bno, xfs_daddr_t mappedbno,
>  			       struct xfs_buf **bpp, int whichfork,
>  			       const struct xfs_buf_ops *ops);
> -xfs_daddr_t	xfs_da_reada_buf(struct xfs_inode *dp, xfs_dablk_t bno,
> +int	xfs_da_reada_buf(struct xfs_inode *dp, xfs_dablk_t bno,
>  				xfs_daddr_t mapped_bno, int whichfork,
>  				const struct xfs_buf_ops *ops);
>  int	xfs_da_shrink_inode(xfs_da_args_t *args, xfs_dablk_t dead_blkno,
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index bbb9eb6..4c87e60f 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -908,9 +908,9 @@ xfs_dir_open(
>  	 */
>  	mode = xfs_ilock_data_map_shared(ip);
>  	if (ip->i_d.di_nextents > 0)
> -		xfs_dir3_data_readahead(ip, 0, -1);
> +		error = xfs_dir3_data_readahead(ip, 0, -1);
>  	xfs_iunlock(ip, mode);
> -	return 0;
> +	return error;
>  }
>  
>  STATIC int
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7] xfs: filter out obviously bad btree pointers
  2017-01-31  0:23 ` [PATCH 3/7] xfs: filter out obviously bad btree pointers Darrick J. Wong
@ 2017-01-31  4:39   ` Eric Sandeen
  2017-01-31 20:09     ` Darrick J. Wong
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Sandeen @ 2017-01-31  4:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 1/30/17 6:23 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Don't let anybody load an obviously bad btree pointer.  Since the values
> come from disk, we must return an error, not just ASSERT.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c  |    4 +---
>  fs/xfs/libxfs/xfs_btree.c |    3 ++-
>  fs/xfs/libxfs/xfs_btree.h |    2 +-
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index bfc00de..443cb10 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -1291,9 +1291,7 @@ xfs_bmap_read_extents(
>  	ASSERT(level > 0);
>  	pp = XFS_BMAP_BROOT_PTR_ADDR(mp, block, 1, ifp->if_broot_bytes);
>  	bno = be64_to_cpu(*pp);
> -	ASSERT(bno != NULLFSBLOCK);
> -	ASSERT(XFS_FSB_TO_AGNO(mp, bno) < mp->m_sb.sb_agcount);
> -	ASSERT(XFS_FSB_TO_AGBNO(mp, bno) < mp->m_sb.sb_agblocks);
> +

fwiw there's an assignment of bno = NULLFSBLOCK just before here,
which is never used and overwritten in this hunk.
Could probably clean that up here too.

>  	/*
>  	 * Go down the tree until leaf level is reached, following the first
>  	 * pointer (leftmost) at each level.
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 21e6a6a..2849d3f 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -810,7 +810,8 @@ xfs_btree_read_bufl(
>  	xfs_daddr_t		d;		/* real disk block address */
>  	int			error;
>  
> -	ASSERT(fsbno != NULLFSBLOCK);
> +	if (!XFS_FSB_SANITY_CHECK(mp, fsbno))
> +		return -EFSCORRUPTED;

This does away with the NULLFSBLOCK checks though, right?

#define NULLFSBLOCK     ((xfs_fsblock_t)-1)

which is also used as an in-memory condition, so I'm not sure
it should be added to XFS_FSB_SANITY_CHECK.

IOWs some asserts about it are really code flow asserts, though
it also shouldn't be read from disk.

>  	d = XFS_FSB_TO_DADDR(mp, fsbno);
>  	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, d,
>  				   mp->m_bsize, lock, &bp, ops);
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index b69b947..33a8f86 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -456,7 +456,7 @@ static inline int xfs_btree_get_level(struct xfs_btree_block *block)
>  #define	XFS_FILBLKS_MAX(a,b)	max_t(xfs_filblks_t, (a), (b))
>  
>  #define	XFS_FSB_SANITY_CHECK(mp,fsb)	\
> -	(XFS_FSB_TO_AGNO(mp, fsb) < mp->m_sb.sb_agcount && \
> +	(fsb && XFS_FSB_TO_AGNO(mp, fsb) < mp->m_sb.sb_agcount && \
>  		XFS_FSB_TO_AGBNO(mp, fsb) < mp->m_sb.sb_agblocks)
>  
>  /*
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7] xfs: fix toctou race when locking an inode to access the data map
  2017-01-31  0:23 ` [PATCH 1/7] xfs: fix toctou race when locking an inode to access the data map Darrick J. Wong
  2017-01-31  3:01   ` Eric Sandeen
@ 2017-01-31 13:26   ` Christoph Hellwig
  2017-01-31 19:45     ` Darrick J. Wong
  2017-02-01  2:34   ` [PATCH v2 " Darrick J. Wong
  2 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2017-01-31 13:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Jan 30, 2017 at 04:23:10PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> We use di_format and if_flags to decide whether we're grabbing the ilock
> in btree mode (btree extents not loaded) or shared mode (anything else),
> but the state of those fields can be changed by other threads that are
> also trying to load the btree extents -- IFEXTENTS gets set before the
> _bmap_read_extents call and cleared if it fails.  Therefore, once we've
> grabbed the shared ilock we have to re-check the fields to see if we
> actually need to upgrade to the exclusive ilock in order to try loading
> the extents.
> 
> Without this patch, we trigger ilock assert failures when a bunch of
> threads try to access a btree format directory with a corrupt bmbt root
> and corrupt the incore data structures, leading to a crash.

I see the problem here, but I really don't like the fix.  Instead
I'd much rather check for a new flag that tells that the extent list
hasn't been read, which can only be cleared under the exclusive
ilock.  That way we shouldn't need any additional relocking or
checks.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/7] xfs: fail _dir_open when readahead fails
  2017-01-31  0:23 ` [PATCH 2/7] xfs: fail _dir_open when readahead fails Darrick J. Wong
  2017-01-31  4:12   ` Eric Sandeen
@ 2017-01-31 13:29   ` Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2017-01-31 13:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

There is an xfs missing in the subject, I think..

On Mon, Jan 30, 2017 at 04:23:16PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we open a directory, we try to readahead block 0 of the directory
> on the assumption that we're going to need it soon.  If the bmbt is
> corrupt, the directory will never be usable and the readahead fails
> immediately, so we might as well prevent the directory from being opened
> at all.  This prevents a subsequent read or modify operation from
> hitting it and taking the fs offline.

I think this looks fine, but the patch description is a bit odd -
we don't actually check for failures for the actual read-ahead, just
for the block mapping, which I guess is what you need anyway.

So with a slight update to the patch descriptions:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/7] xfs: check for obviously bad level values in the bmbt root
  2017-01-31  0:23 ` [PATCH 4/7] xfs: check for obviously bad level values in the bmbt root Darrick J. Wong
@ 2017-01-31 13:31   ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2017-01-31 13:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/7] xfs: mark speculative prealloc CoW fork extents unwritten
  2017-01-31  0:23 ` [PATCH 7/7] xfs: mark speculative prealloc CoW fork extents unwritten Darrick J. Wong
@ 2017-01-31 13:41   ` Christoph Hellwig
  2017-01-31 19:11     ` Darrick J. Wong
  2017-02-01  2:36   ` [PATCH v2 " Darrick J. Wong
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2017-01-31 13:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Hi Darrick,

from a quick look the code looks reasonable, but I'm worried about
yet another set of transactions that modify all extents again.

Do you have any measurements of the overhead?  I'll see if I
can prototype my always COW idea to see how the approaches compare.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7] xfs: verify free block header fields
  2017-01-31  0:23 ` [PATCH 5/7] xfs: verify free block header fields Darrick J. Wong
@ 2017-01-31 13:42   ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2017-01-31 13:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

Minor nitpick below:

> +	if (err || !(*bpp))

no need for the inner braces.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/7] xfs: mark speculative prealloc CoW fork extents unwritten
  2017-01-31 13:41   ` Christoph Hellwig
@ 2017-01-31 19:11     ` Darrick J. Wong
  2017-02-01  1:28       ` Darrick J. Wong
  0 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2017-01-31 19:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Jan 31, 2017 at 05:41:35AM -0800, Christoph Hellwig wrote:
> Hi Darrick,
> 
> from a quick look the code looks reasonable, but I'm worried about
> yet another set of transactions that modify all extents again.
> 
> Do you have any measurements of the overhead?  I'll see if I
> can prototype my always COW idea to see how the approaches compare.

The overhead should be pretty low -- since the cow fork never goes to
disk, the only thing we end up logging is the inode core (because the
conversion function logs it unconditionally), and I'm not even sure
that's necessary since we're performing a pure conversion of blocks that
are already allocated.

In any case I didn't see any noticeable impact on performance other
than the extra CPU overhead.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7] xfs: fix toctou race when locking an inode to access the data map
  2017-01-31 13:26   ` Christoph Hellwig
@ 2017-01-31 19:45     ` Darrick J. Wong
  2017-01-31 21:40       ` Darrick J. Wong
  0 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2017-01-31 19:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Jan 31, 2017 at 05:26:58AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 30, 2017 at 04:23:10PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > We use di_format and if_flags to decide whether we're grabbing the ilock
> > in btree mode (btree extents not loaded) or shared mode (anything else),
> > but the state of those fields can be changed by other threads that are
> > also trying to load the btree extents -- IFEXTENTS gets set before the
> > _bmap_read_extents call and cleared if it fails.  Therefore, once we've
> > grabbed the shared ilock we have to re-check the fields to see if we
> > actually need to upgrade to the exclusive ilock in order to try loading
> > the extents.
> > 
> > Without this patch, we trigger ilock assert failures when a bunch of
> > threads try to access a btree format directory with a corrupt bmbt root
> > and corrupt the incore data structures, leading to a crash.
> 
> I see the problem here, but I really don't like the fix.  Instead
> I'd much rather check for a new flag that tells that the extent list
> hasn't been read, which can only be cleared under the exclusive
> ilock.  That way we shouldn't need any additional relocking or
> checks.

I'm confused --

I thought XFS_IFEXTENTS means "extents have been read", which is the
inverse of the flag you propose.  AFAICT the bit is only ever set (or
cleared) with ILOCK_EXCL held, so the problem here is that we're
performing an unlocked read of if_flags prior to actually taking the
lock that we need.

On the other hand, I /think/ it's the case that none of the functions
called in _iread_extents actually cares about IFEXTENTS being set, so
perhaps an alternative could be to avoid setting the bit until we've
successfully read in all the bmbt records?

I'll try that out and report back.

--D

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7] xfs: filter out obviously bad btree pointers
  2017-01-31  4:39   ` Eric Sandeen
@ 2017-01-31 20:09     ` Darrick J. Wong
  2017-01-31 20:37       ` Eric Sandeen
  0 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2017-01-31 20:09 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs

On Mon, Jan 30, 2017 at 10:39:05PM -0600, Eric Sandeen wrote:
> On 1/30/17 6:23 PM, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Don't let anybody load an obviously bad btree pointer.  Since the values
> > come from disk, we must return an error, not just ASSERT.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c  |    4 +---
> >  fs/xfs/libxfs/xfs_btree.c |    3 ++-
> >  fs/xfs/libxfs/xfs_btree.h |    2 +-
> >  3 files changed, 4 insertions(+), 5 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index bfc00de..443cb10 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -1291,9 +1291,7 @@ xfs_bmap_read_extents(
> >  	ASSERT(level > 0);
> >  	pp = XFS_BMAP_BROOT_PTR_ADDR(mp, block, 1, ifp->if_broot_bytes);
> >  	bno = be64_to_cpu(*pp);
> > -	ASSERT(bno != NULLFSBLOCK);
> > -	ASSERT(XFS_FSB_TO_AGNO(mp, bno) < mp->m_sb.sb_agcount);
> > -	ASSERT(XFS_FSB_TO_AGBNO(mp, bno) < mp->m_sb.sb_agblocks);
> > +
> 
> fwiw there's an assignment of bno = NULLFSBLOCK just before here,
> which is never used and overwritten in this hunk.
> Could probably clean that up here too.

Yes, that could go too.

> >  	/*
> >  	 * Go down the tree until leaf level is reached, following the first
> >  	 * pointer (leftmost) at each level.
> > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > index 21e6a6a..2849d3f 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> > @@ -810,7 +810,8 @@ xfs_btree_read_bufl(
> >  	xfs_daddr_t		d;		/* real disk block address */
> >  	int			error;
> >  
> > -	ASSERT(fsbno != NULLFSBLOCK);
> > +	if (!XFS_FSB_SANITY_CHECK(mp, fsbno))
> > +		return -EFSCORRUPTED;
> 
> This does away with the NULLFSBLOCK checks though, right?
> 
> #define NULLFSBLOCK     ((xfs_fsblock_t)-1)
> 
> which is also used as an in-memory condition, so I'm not sure
> it should be added to XFS_FSB_SANITY_CHECK.

[reiterating our irc conversation]

It shouldn't, since XFS_FSB_TO_AGNO(mp, NULLFSBLOCK) ought to produce
NULLAGNUMBER, which will still fail the check.

> IOWs some asserts about it are really code flow asserts, though
> it also shouldn't be read from disk.

xfs_btree_read_bufl is called by xfs_bmap_check_leaf_extents, which is
the verifier of the on-disk data.

--D

> >  	d = XFS_FSB_TO_DADDR(mp, fsbno);
> >  	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, d,
> >  				   mp->m_bsize, lock, &bp, ops);
> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index b69b947..33a8f86 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -456,7 +456,7 @@ static inline int xfs_btree_get_level(struct xfs_btree_block *block)
> >  #define	XFS_FILBLKS_MAX(a,b)	max_t(xfs_filblks_t, (a), (b))
> >  
> >  #define	XFS_FSB_SANITY_CHECK(mp,fsb)	\
> > -	(XFS_FSB_TO_AGNO(mp, fsb) < mp->m_sb.sb_agcount && \
> > +	(fsb && XFS_FSB_TO_AGNO(mp, fsb) < mp->m_sb.sb_agcount && \
> >  		XFS_FSB_TO_AGBNO(mp, fsb) < mp->m_sb.sb_agblocks)
> >  
> >  /*
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7] xfs: filter out obviously bad btree pointers
  2017-01-31 20:09     ` Darrick J. Wong
@ 2017-01-31 20:37       ` Eric Sandeen
  0 siblings, 0 replies; 31+ messages in thread
From: Eric Sandeen @ 2017-01-31 20:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 1/31/17 2:09 PM, Darrick J. Wong wrote:
>>> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
>>> index 21e6a6a..2849d3f 100644
>>> --- a/fs/xfs/libxfs/xfs_btree.c
>>> +++ b/fs/xfs/libxfs/xfs_btree.c
>>> @@ -810,7 +810,8 @@ xfs_btree_read_bufl(
>>>  	xfs_daddr_t		d;		/* real disk block address */
>>>  	int			error;
>>>  
>>> -	ASSERT(fsbno != NULLFSBLOCK);
>>> +	if (!XFS_FSB_SANITY_CHECK(mp, fsbno))
>>> +		return -EFSCORRUPTED;
>> This does away with the NULLFSBLOCK checks though, right?
>>
>> #define NULLFSBLOCK     ((xfs_fsblock_t)-1)
>>
>> which is also used as an in-memory condition, so I'm not sure
>> it should be added to XFS_FSB_SANITY_CHECK.
> [reiterating our irc conversation]
> 
> It shouldn't, since XFS_FSB_TO_AGNO(mp, NULLFSBLOCK) ought to produce
> NULLAGNUMBER, which will still fail the check.

Oh, sure.  sorry for missing that.

Looks ok after all.  :)

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

>> IOWs some asserts about it are really code flow asserts, though
>> it also shouldn't be read from disk.
> xfs_btree_read_bufl is called by xfs_bmap_check_leaf_extents, which is
> the verifier of the on-disk data.
> 
> --D
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7] xfs: fix toctou race when locking an inode to access the data map
  2017-01-31 19:45     ` Darrick J. Wong
@ 2017-01-31 21:40       ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2017-01-31 21:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Jan 31, 2017 at 11:45:20AM -0800, Darrick J. Wong wrote:
> On Tue, Jan 31, 2017 at 05:26:58AM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 30, 2017 at 04:23:10PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > We use di_format and if_flags to decide whether we're grabbing the ilock
> > > in btree mode (btree extents not loaded) or shared mode (anything else),
> > > but the state of those fields can be changed by other threads that are
> > > also trying to load the btree extents -- IFEXTENTS gets set before the
> > > _bmap_read_extents call and cleared if it fails.  Therefore, once we've
> > > grabbed the shared ilock we have to re-check the fields to see if we
> > > actually need to upgrade to the exclusive ilock in order to try loading
> > > the extents.
> > > 
> > > Without this patch, we trigger ilock assert failures when a bunch of
> > > threads try to access a btree format directory with a corrupt bmbt root
> > > and corrupt the incore data structures, leading to a crash.
> > 
> > I see the problem here, but I really don't like the fix.  Instead
> > I'd much rather check for a new flag that tells that the extent list
> > hasn't been read, which can only be cleared under the exclusive
> > ilock.  That way we shouldn't need any additional relocking or
> > checks.
> 
> I'm confused --
> 
> I thought XFS_IFEXTENTS means "extents have been read", which is the
> inverse of the flag you propose.  AFAICT the bit is only ever set (or
> cleared) with ILOCK_EXCL held, so the problem here is that we're
> performing an unlocked read of if_flags prior to actually taking the
> lock that we need.
> 
> On the other hand, I /think/ it's the case that none of the functions
> called in _iread_extents actually cares about IFEXTENTS being set, so
> perhaps an alternative could be to avoid setting the bit until we've
> successfully read in all the bmbt records?
> 
> I'll try that out and report back.

Seems to work, will send a revised patch.

--D

> 
> --D
> 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/7] xfs: mark speculative prealloc CoW fork extents unwritten
  2017-01-31 19:11     ` Darrick J. Wong
@ 2017-02-01  1:28       ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2017-02-01  1:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Jan 31, 2017 at 11:11:20AM -0800, Darrick J. Wong wrote:
> On Tue, Jan 31, 2017 at 05:41:35AM -0800, Christoph Hellwig wrote:
> > Hi Darrick,
> > 
> > from a quick look the code looks reasonable, but I'm worried about
> > yet another set of transactions that modify all extents again.
> > 
> > Do you have any measurements of the overhead?  I'll see if I
> > can prototype my always COW idea to see how the approaches compare.
> 
> The overhead should be pretty low -- since the cow fork never goes to
> disk, the only thing we end up logging is the inode core (because the
> conversion function logs it unconditionally), and I'm not even sure
> that's necessary since we're performing a pure conversion of blocks that
> are already allocated.
> 
> In any case I didn't see any noticeable impact on performance other
> than the extra CPU overhead.

Reading a little closer, we don't actually change anything in the
on-disk inode, so we can get rid of transaction altogether.

--D

> 
> --D
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/7] xfs: fix toctou race when locking an inode to access the data map
  2017-01-31  0:23 ` [PATCH 1/7] xfs: fix toctou race when locking an inode to access the data map Darrick J. Wong
  2017-01-31  3:01   ` Eric Sandeen
  2017-01-31 13:26   ` Christoph Hellwig
@ 2017-02-01  2:34   ` Darrick J. Wong
  2017-02-01 14:48     ` Christoph Hellwig
  2 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2017-02-01  2:34 UTC (permalink / raw)
  To: linux-xfs

We use di_format and if_flags to decide whether we're grabbing the ilock
in btree mode (btree extents not loaded) or shared mode (anything else),
but the state of those fields can be changed by other threads that are
also trying to load the btree extents -- IFEXTENTS gets set before the
_bmap_read_extents call and cleared if it fails.

We don't actually need to have IFEXTENTS set until after the bmbt
records are successfully loaded and validated, which will fix the race
between multiple threads trying to read the same directory.  The next
patch strengthens directory bmbt validation by refusing to open the
directory if reading the bmbt to start directory readahead fails.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_fork.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 222e103..421341f 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -497,15 +497,14 @@ xfs_iread_extents(
 	 * We know that the size is valid (it's checked in iformat_btree)
 	 */
 	ifp->if_bytes = ifp->if_real_bytes = 0;
-	ifp->if_flags |= XFS_IFEXTENTS;
 	xfs_iext_add(ifp, 0, nextents);
 	error = xfs_bmap_read_extents(tp, ip, whichfork);
 	if (error) {
 		xfs_iext_destroy(ifp);
-		ifp->if_flags &= ~XFS_IFEXTENTS;
 		return error;
 	}
 	xfs_validate_extents(ifp, nextents, XFS_EXTFMT_INODE(ip));
+	ifp->if_flags |= XFS_IFEXTENTS;
 	return 0;
 }
 /*

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 6/7] xfs: allow unwritten extents in the CoW fork
  2017-01-31  0:23 ` [PATCH 6/7] xfs: allow unwritten extents in the CoW fork Darrick J. Wong
@ 2017-02-01  2:35   ` Darrick J. Wong
  2017-02-01 18:06     ` Christoph Hellwig
  0 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2017-02-01  2:35 UTC (permalink / raw)
  To: linux-xfs

In the data fork, we only allow extents to perform the following state
transitions:

delay -> real <-> unwritten

There's no way to move directly from a delalloc reservation to an
/unwritten/ allocated extent.  However, for the CoW fork we want to be
able to do the following to each extent:

delalloc -> unwritten -> written -> remapped to data fork

This will help us to avoid a race in the speculative CoW preallocation
code between a first thread that is allocating a CoW extent and a second
thread that is remapping part of a file after a write.  In order to do
this, however, we need two things: first, we have to be able to
transition from da to unwritten, and second the function that converts
between real and unwritten has to be made aware of the cow fork.  Do
both of those things.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Converting unwritten CoW fork extents to real extents doesn't
change anything on disk, so we don't actually need a transaction to
perform the conversion.
---
 fs/xfs/libxfs/xfs_bmap.c |   80 +++++++++++++++++++++++++++++-----------------
 1 file changed, 50 insertions(+), 30 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index d79c66d..5b626a4 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -1861,6 +1861,7 @@ xfs_bmap_add_extent_delay_real(
 		 */
 		trace_xfs_bmap_pre_update(bma->ip, bma->idx, state, _THIS_IP_);
 		xfs_bmbt_set_startblock(ep, new->br_startblock);
+		xfs_bmbt_set_state(ep, new->br_state);
 		trace_xfs_bmap_post_update(bma->ip, bma->idx, state, _THIS_IP_);
 
 		(*nextents)++;
@@ -2199,6 +2200,7 @@ STATIC int				/* error */
 xfs_bmap_add_extent_unwritten_real(
 	struct xfs_trans	*tp,
 	xfs_inode_t		*ip,	/* incore inode pointer */
+	int			whichfork,
 	xfs_extnum_t		*idx,	/* extent number to update/insert */
 	xfs_btree_cur_t		**curp,	/* if *curp is null, not a btree */
 	xfs_bmbt_irec_t		*new,	/* new data to add to file extents */
@@ -2218,12 +2220,14 @@ xfs_bmap_add_extent_unwritten_real(
 					/* left is 0, right is 1, prev is 2 */
 	int			rval=0;	/* return value (logging flags) */
 	int			state = 0;/* state bits, accessed thru macros */
-	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_mount	*mp = ip->i_mount;
 
 	*logflagsp = 0;
 
 	cur = *curp;
-	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	ifp = XFS_IFORK_PTR(ip, whichfork);
+	if (whichfork == XFS_COW_FORK)
+		state |= BMAP_COWFORK;
 
 	ASSERT(*idx >= 0);
 	ASSERT(*idx <= xfs_iext_count(ifp));
@@ -2282,7 +2286,7 @@ xfs_bmap_add_extent_unwritten_real(
 	 * Don't set contiguous if the combined extent would be too large.
 	 * Also check for all-three-contiguous being too large.
 	 */
-	if (*idx < xfs_iext_count(&ip->i_df) - 1) {
+	if (*idx < xfs_iext_count(ifp) - 1) {
 		state |= BMAP_RIGHT_VALID;
 		xfs_bmbt_get_all(xfs_iext_get_ext(ifp, *idx + 1), &RIGHT);
 		if (isnullstartblock(RIGHT.br_startblock))
@@ -2322,7 +2326,8 @@ xfs_bmap_add_extent_unwritten_real(
 		trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
 
 		xfs_iext_remove(ip, *idx + 1, 2, state);
-		ip->i_d.di_nextents -= 2;
+		XFS_IFORK_NEXT_SET(ip, whichfork,
+				XFS_IFORK_NEXTENTS(ip, whichfork) - 2);
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2365,7 +2370,8 @@ xfs_bmap_add_extent_unwritten_real(
 		trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
 
 		xfs_iext_remove(ip, *idx + 1, 1, state);
-		ip->i_d.di_nextents--;
+		XFS_IFORK_NEXT_SET(ip, whichfork,
+				XFS_IFORK_NEXTENTS(ip, whichfork) - 1);
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2400,7 +2406,8 @@ xfs_bmap_add_extent_unwritten_real(
 		xfs_bmbt_set_state(ep, newext);
 		trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
 		xfs_iext_remove(ip, *idx + 1, 1, state);
-		ip->i_d.di_nextents--;
+		XFS_IFORK_NEXT_SET(ip, whichfork,
+				XFS_IFORK_NEXTENTS(ip, whichfork) - 1);
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2512,7 +2519,8 @@ xfs_bmap_add_extent_unwritten_real(
 		trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
 
 		xfs_iext_insert(ip, *idx, 1, new, state);
-		ip->i_d.di_nextents++;
+		XFS_IFORK_NEXT_SET(ip, whichfork,
+				XFS_IFORK_NEXTENTS(ip, whichfork) + 1);
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2590,7 +2598,8 @@ xfs_bmap_add_extent_unwritten_real(
 		++*idx;
 		xfs_iext_insert(ip, *idx, 1, new, state);
 
-		ip->i_d.di_nextents++;
+		XFS_IFORK_NEXT_SET(ip, whichfork,
+				XFS_IFORK_NEXTENTS(ip, whichfork) + 1);
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2638,7 +2647,8 @@ xfs_bmap_add_extent_unwritten_real(
 		++*idx;
 		xfs_iext_insert(ip, *idx, 2, &r[0], state);
 
-		ip->i_d.di_nextents += 2;
+		XFS_IFORK_NEXT_SET(ip, whichfork,
+				XFS_IFORK_NEXTENTS(ip, whichfork) + 2);
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2692,17 +2702,17 @@ xfs_bmap_add_extent_unwritten_real(
 	}
 
 	/* update reverse mappings */
-	error = xfs_rmap_convert_extent(mp, dfops, ip, XFS_DATA_FORK, new);
+	error = xfs_rmap_convert_extent(mp, dfops, ip, whichfork, new);
 	if (error)
 		goto done;
 
 	/* convert to a btree if necessary */
-	if (xfs_bmap_needs_btree(ip, XFS_DATA_FORK)) {
+	if (xfs_bmap_needs_btree(ip, whichfork)) {
 		int	tmp_logflags;	/* partial log flag return val */
 
 		ASSERT(cur == NULL);
 		error = xfs_bmap_extents_to_btree(tp, ip, first, dfops, &cur,
-				0, &tmp_logflags, XFS_DATA_FORK);
+				0, &tmp_logflags, whichfork);
 		*logflagsp |= tmp_logflags;
 		if (error)
 			goto done;
@@ -2714,7 +2724,7 @@ xfs_bmap_add_extent_unwritten_real(
 		*curp = cur;
 	}
 
-	xfs_bmap_check_leaf_extents(*curp, ip, XFS_DATA_FORK);
+	xfs_bmap_check_leaf_extents(*curp, ip, whichfork);
 done:
 	*logflagsp |= rval;
 	return error;
@@ -4365,10 +4375,16 @@ xfs_bmapi_allocate(
 	bma->got.br_state = XFS_EXT_NORM;
 
 	/*
-	 * A wasdelay extent has been initialized, so shouldn't be flagged
-	 * as unwritten.
+	 * In the data fork, a wasdelay extent has been initialized, so
+	 * shouldn't be flagged as unwritten.
+	 *
+	 * For the cow fork, however, we convert delalloc reservations
+	 * (extents allocated for speculative preallocation) to
+	 * allocated unwritten extents, and only convert the unwritten
+	 * extents to real extents when we're about to write the data.
 	 */
-	if (!bma->wasdel && (bma->flags & XFS_BMAPI_PREALLOC) &&
+	if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) &&
+	    (bma->flags & XFS_BMAPI_PREALLOC) &&
 	    xfs_sb_version_hasextflgbit(&mp->m_sb))
 		bma->got.br_state = XFS_EXT_UNWRITTEN;
 
@@ -4419,8 +4435,6 @@ xfs_bmapi_convert_unwritten(
 			(XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT))
 		return 0;
 
-	ASSERT(whichfork != XFS_COW_FORK);
-
 	/*
 	 * Modify (by adding) the state flag, if writing.
 	 */
@@ -4445,8 +4459,8 @@ xfs_bmapi_convert_unwritten(
 			return error;
 	}
 
-	error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, &bma->idx,
-			&bma->cur, mval, bma->firstblock, bma->dfops,
+	error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, whichfork,
+			&bma->idx, &bma->cur, mval, bma->firstblock, bma->dfops,
 			&tmp_logflags);
 	/*
 	 * Log the inode core unconditionally in the unwritten extent conversion
@@ -4455,8 +4469,12 @@ xfs_bmapi_convert_unwritten(
 	 * in the transaction for the sake of fsync(), even if nothing has
 	 * changed, because fsync() will not force the log for this transaction
 	 * unless it sees the inode pinned.
+	 *
+	 * Note: If we're only converting cow fork extents, there aren't
+	 * any on-disk updates to make, so we don't need to log anything.
 	 */
-	bma->logflags |= tmp_logflags | XFS_ILOG_CORE;
+	if (whichfork != XFS_COW_FORK)
+		bma->logflags |= tmp_logflags | XFS_ILOG_CORE;
 	if (error)
 		return error;
 
@@ -4530,15 +4548,15 @@ xfs_bmapi_write(
 	ASSERT(*nmap >= 1);
 	ASSERT(*nmap <= XFS_BMAP_MAX_NMAP);
 	ASSERT(!(flags & XFS_BMAPI_IGSTATE));
-	ASSERT(tp != NULL);
+	ASSERT(tp != NULL ||
+	       (flags & (XFS_BMAPI_CONVERT | XFS_BMAPI_COWFORK)) ==
+			(XFS_BMAPI_CONVERT | XFS_BMAPI_COWFORK));
 	ASSERT(len > 0);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(!(flags & XFS_BMAPI_REMAP) || whichfork == XFS_DATA_FORK);
 	ASSERT(!(flags & XFS_BMAPI_PREALLOC) || !(flags & XFS_BMAPI_REMAP));
 	ASSERT(!(flags & XFS_BMAPI_CONVERT) || !(flags & XFS_BMAPI_REMAP));
-	ASSERT(!(flags & XFS_BMAPI_PREALLOC) || whichfork != XFS_COW_FORK);
-	ASSERT(!(flags & XFS_BMAPI_CONVERT) || whichfork != XFS_COW_FORK);
 
 	/* zeroing is for currently only for data extents, not metadata */
 	ASSERT((flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ZERO)) !=
@@ -5553,8 +5571,8 @@ __xfs_bunmapi(
 			}
 			del.br_state = XFS_EXT_UNWRITTEN;
 			error = xfs_bmap_add_extent_unwritten_real(tp, ip,
-					&lastx, &cur, &del, firstblock, dfops,
-					&logflags);
+					whichfork, &lastx, &cur, &del,
+					firstblock, dfops, &logflags);
 			if (error)
 				goto error0;
 			goto nodelete;
@@ -5607,8 +5625,9 @@ __xfs_bunmapi(
 				prev.br_state = XFS_EXT_UNWRITTEN;
 				lastx--;
 				error = xfs_bmap_add_extent_unwritten_real(tp,
-						ip, &lastx, &cur, &prev,
-						firstblock, dfops, &logflags);
+						ip, whichfork, &lastx, &cur,
+						&prev, firstblock, dfops,
+						&logflags);
 				if (error)
 					goto error0;
 				goto nodelete;
@@ -5616,8 +5635,9 @@ __xfs_bunmapi(
 				ASSERT(del.br_state == XFS_EXT_NORM);
 				del.br_state = XFS_EXT_UNWRITTEN;
 				error = xfs_bmap_add_extent_unwritten_real(tp,
-						ip, &lastx, &cur, &del,
-						firstblock, dfops, &logflags);
+						ip, whichfork, &lastx, &cur,
+						&del, firstblock, dfops,
+						&logflags);
 				if (error)
 					goto error0;
 				goto nodelete;

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 7/7] xfs: mark speculative prealloc CoW fork extents unwritten
  2017-01-31  0:23 ` [PATCH 7/7] xfs: mark speculative prealloc CoW fork extents unwritten Darrick J. Wong
  2017-01-31 13:41   ` Christoph Hellwig
@ 2017-02-01  2:36   ` Darrick J. Wong
  2017-02-01 18:36     ` Christoph Hellwig
  2017-02-02 15:04     ` Brian Foster
  1 sibling, 2 replies; 31+ messages in thread
From: Darrick J. Wong @ 2017-02-01  2:36 UTC (permalink / raw)
  To: linux-xfs

Christoph Hellwig pointed out that there's a potentially nasty race when
performing simultaneous nearby directio cow writes:

"Thread 1 writes a range from B to c

"                    B --------- C
                           p

"a little later thread 2 writes from A to B

"        A --------- B
               p

[editor's note: the 'p' denote cowextsize boundaries, which I added to
make this more clear]

"but the code preallocates beyond B into the range where thread
"1 has just written, but ->end_io hasn't been called yet.
"But once ->end_io is called thread 2 has already allocated
"up to the extent size hint into the write range of thread 1,
"so the end_io handler will splice the unintialized blocks from
"that preallocation back into the file right after B."

We can avoid this race by ensuring that thread 1 cannot accidentally
remap the blocks that thread 2 allocated (as part of speculative
preallocation) as part of t2's write preparation in t1's end_io handler.
The way we make this happen is by taking advantage of the unwritten
extent flag as an intermediate step.

Recall that when we begin the process of writing data to shared blocks,
we create a delayed allocation extent in the CoW fork:

D: --RRRRRRSSSRRRRRRRR---
C: ------DDDDDDD---------

When a thread prepares to CoW some dirty data out to disk, it will now
convert the delalloc reservation into an /unwritten/ allocated extent in
the cow fork.  The da conversion code tries to opportunistically
allocate as much of a (speculatively prealloc'd) extent as possible, so
we may end up allocating a larger extent than we're actually writing
out:

D: --RRRRRRSSSRRRRRRRR---
U: ------UUUUUUU---------

Next, we convert only the part of the extent that we're actively
planning to write to normal (i.e. not unwritten) status:

D: --RRRRRRSSSRRRRRRRR---
U: ------UURRUUU---------

If the write succeeds, the end_cow function will now scan the relevant
range of the CoW fork for real extents and remap only the real extents
into the data fork:

D: --RRRRRRRRSRRRRRRRR---
U: ------UU--UUU---------

This ensures that we never obliterate valid data fork extents with
unwritten blocks from the CoW fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: We don't update anything on disk for the CoW unwritten extent
conversion, so don't allocate a transaction.
---
 fs/xfs/xfs_aops.c    |    6 +++
 fs/xfs/xfs_iomap.c   |    2 -
 fs/xfs/xfs_reflink.c |  117 +++++++++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_reflink.h |    2 +
 fs/xfs/xfs_trace.h   |    8 +++
 5 files changed, 125 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 631e7c0..1ff9df7 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -481,6 +481,12 @@ xfs_submit_ioend(
 	struct xfs_ioend	*ioend,
 	int			status)
 {
+	/* Convert CoW extents to regular */
+	if (!status && ioend->io_type == XFS_IO_COW) {
+		status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode),
+				ioend->io_offset, ioend->io_size);
+	}
+
 	/* Reserve log space if we might write beyond the on-disk inode size. */
 	if (!status &&
 	    ioend->io_type != XFS_IO_UNWRITTEN &&
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 1aa3abd..767208f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -685,7 +685,7 @@ xfs_iomap_write_allocate(
 	int		nres;
 
 	if (whichfork == XFS_COW_FORK)
-		flags |= XFS_BMAPI_COWFORK;
+		flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
 
 	/*
 	 * Make sure that the dquots are there.
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 07593a3..a2e1fff 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -82,11 +82,22 @@
  * mappings are a reservation against the free space in the filesystem;
  * adjacent mappings can also be combined into fewer larger mappings.
  *
+ * As an optimization, the CoW extent size hint (cowextsz) creates
+ * outsized aligned delalloc reservations in the hope of landing out of
+ * order nearby CoW writes in a single extent on disk, thereby reducing
+ * fragmentation and improving future performance.
+ *
+ * D: --RRRRRRSSSRRRRRRRR--- (data fork)
+ * C: ------DDDDDDD--------- (CoW fork)
+ *
  * When dirty pages are being written out (typically in writepage), the
- * delalloc reservations are converted into real mappings by allocating
- * blocks and replacing the delalloc mapping with real ones.  A delalloc
- * mapping can be replaced by several real ones if the free space is
- * fragmented.
+ * delalloc reservations are converted into unwritten mappings by
+ * allocating blocks and replacing the delalloc mapping with real ones.
+ * A delalloc mapping can be replaced by several unwritten ones if the
+ * free space is fragmented.
+ *
+ * D: --RRRRRRSSSRRRRRRRR---
+ * C: ------UUUUUUU---------
  *
  * We want to adapt the delalloc mechanism for copy-on-write, since the
  * write paths are similar.  The first two steps (creating the reservation
@@ -101,13 +112,29 @@
  * Block-aligned directio writes will use the same mechanism as buffered
  * writes.
  *
+ * Just prior to submitting the actual disk write requests, we convert
+ * the extents representing the range of the file actually being written
+ * (as opposed to extra pieces created for the cowextsize hint) to real
+ * extents.  This will become important in the next step:
+ *
+ * D: --RRRRRRSSSRRRRRRRR---
+ * C: ------UUrrUUU---------
+ *
  * CoW remapping must be done after the data block write completes,
  * because we don't want to destroy the old data fork map until we're sure
  * the new block has been written.  Since the new mappings are kept in a
  * separate fork, we can simply iterate these mappings to find the ones
  * that cover the file blocks that we just CoW'd.  For each extent, simply
  * unmap the corresponding range in the data fork, map the new range into
- * the data fork, and remove the extent from the CoW fork.
+ * the data fork, and remove the extent from the CoW fork.  Because of
+ * the presence of the cowextsize hint, however, we must be careful
+ * only to remap the blocks that we've actually written out --  we must
+ * never remap delalloc reservations nor CoW staging blocks that have
+ * yet to be written.  This corresponds exactly to the real extents in
+ * the CoW fork:
+ *
+ * D: --RRRRRRrrSRRRRRRRR---
+ * C: ------UU--UUU---------
  *
  * Since the remapping operation can be applied to an arbitrary file
  * range, we record the need for the remap step as a flag in the ioend
@@ -296,6 +323,66 @@ xfs_reflink_reserve_cow(
 	return 0;
 }
 
+/* Convert part of an unwritten CoW extent to a real one. */
+STATIC int
+__xfs_reflink_convert_cow(
+	struct xfs_inode		*ip,
+	struct xfs_bmbt_irec		*imap,
+	xfs_fileoff_t			offset_fsb,
+	xfs_filblks_t			count_fsb,
+	struct xfs_defer_ops		*dfops)
+{
+	struct xfs_bmbt_irec		irec = *imap;
+	xfs_fsblock_t			first_block;
+	int				nimaps = 1;
+
+	xfs_trim_extent(&irec, offset_fsb, count_fsb);
+	trace_xfs_reflink_convert_cow(ip, &irec);
+	if (irec.br_blockcount == 0)
+		return 0;
+	return xfs_bmapi_write(NULL, ip, irec.br_startoff, irec.br_blockcount,
+			XFS_BMAPI_COWFORK | XFS_BMAPI_CONVERT, &first_block,
+			0, &irec, &nimaps, dfops);
+}
+
+/* Convert all of the unwritten CoW extents in a file's range to real ones. */
+int
+xfs_reflink_convert_cow(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		count)
+{
+	struct xfs_bmbt_irec	got;
+	struct xfs_defer_ops	dfops;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	xfs_fileoff_t		offset_fsb;
+	xfs_fileoff_t		end_fsb;
+	xfs_extnum_t		idx;
+	bool			found;
+	int			error;
+
+	offset_fsb = XFS_B_TO_FSBT(mp, offset);
+	end_fsb = XFS_B_TO_FSB(mp, offset + count);
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+
+	/* Convert all the extents to real from unwritten. */
+	for (found = xfs_iext_lookup_extent(ip, ifp, offset_fsb, &idx, &got);
+	     found && got.br_startoff < end_fsb;
+	     found = xfs_iext_get_extent(ifp, ++idx, &got)) {
+		if (got.br_state == XFS_EXT_NORM)
+			continue;
+		error = __xfs_reflink_convert_cow(ip, &got, offset_fsb,
+				end_fsb - offset_fsb, &dfops);
+		if (error)
+			break;
+	}
+
+	/* Finish up. */
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
+
 /* Allocate all CoW reservations covering a range of blocks in a file. */
 static int
 __xfs_reflink_allocate_cow(
@@ -328,6 +415,7 @@ __xfs_reflink_allocate_cow(
 		goto out_unlock;
 	ASSERT(nimaps == 1);
 
+	/* Make sure there's a CoW reservation for it. */
 	error = xfs_reflink_reserve_cow(ip, &imap, &shared);
 	if (error)
 		goto out_trans_cancel;
@@ -337,14 +425,16 @@ __xfs_reflink_allocate_cow(
 		goto out_trans_cancel;
 	}
 
+	/* Allocate the entire reservation as unwritten blocks. */
 	xfs_trans_ijoin(tp, ip, 0);
 	error = xfs_bmapi_write(tp, ip, imap.br_startoff, imap.br_blockcount,
-			XFS_BMAPI_COWFORK, &first_block,
+			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, &first_block,
 			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
 			&imap, &nimaps, &dfops);
 	if (error)
 		goto out_trans_cancel;
 
+	/* Finish up. */
 	error = xfs_defer_finish(&tp, &dfops, NULL);
 	if (error)
 		goto out_trans_cancel;
@@ -389,10 +479,13 @@ xfs_reflink_allocate_cow_range(
 		if (error) {
 			trace_xfs_reflink_allocate_cow_range_error(ip, error,
 					_RET_IP_);
-			break;
+			goto out;
 		}
 	}
 
+	/* Convert the CoW extents to regular. */
+	error = xfs_reflink_convert_cow(ip, offset, count);
+out:
 	return error;
 }
 
@@ -641,6 +734,16 @@ xfs_reflink_end_cow(
 
 		ASSERT(!isnullstartblock(got.br_startblock));
 
+		/*
+		 * Don't remap unwritten extents; these are
+		 * speculatively preallocated CoW extents that have been
+		 * allocated but have not yet been involved in a write.
+		 */
+		if (got.br_state == XFS_EXT_UNWRITTEN) {
+			idx--;
+			goto next_extent;
+		}
+
 		/* Unmap the old blocks in the data fork. */
 		xfs_defer_init(&dfops, &firstfsb);
 		rlen = del.br_blockcount;
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index aa6a4d6..1583c47 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -30,6 +30,8 @@ extern int xfs_reflink_reserve_cow(struct xfs_inode *ip,
 		struct xfs_bmbt_irec *imap, bool *shared);
 extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
 		xfs_off_t offset, xfs_off_t count);
+extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t count);
 extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
 		struct xfs_bmbt_irec *imap);
 extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 69c5bcd..d3d11905 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3089,6 +3089,7 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class,
 		__field(xfs_fileoff_t, lblk)
 		__field(xfs_extlen_t, len)
 		__field(xfs_fsblock_t, pblk)
+		__field(int, state)
 	),
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
@@ -3096,13 +3097,15 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class,
 		__entry->lblk = irec->br_startoff;
 		__entry->len = irec->br_blockcount;
 		__entry->pblk = irec->br_startblock;
+		__entry->state = irec->br_state;
 	),
-	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu",
+	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu st %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->lblk,
 		  __entry->len,
-		  __entry->pblk)
+		  __entry->pblk,
+		  __entry->state)
 );
 #define DEFINE_INODE_IREC_EVENT(name) \
 DEFINE_EVENT(xfs_inode_irec_class, name, \
@@ -3242,6 +3245,7 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_around_shared);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_alloc);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_found);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_convert_cow);
 
 DEFINE_RW_EVENT(xfs_reflink_reserve_cow);
 DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/7] xfs: fix toctou race when locking an inode to access the data map
  2017-02-01  2:34   ` [PATCH v2 " Darrick J. Wong
@ 2017-02-01 14:48     ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2017-02-01 14:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/7] xfs: allow unwritten extents in the CoW fork
  2017-02-01  2:35   ` [PATCH v2 " Darrick J. Wong
@ 2017-02-01 18:06     ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2017-02-01 18:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 7/7] xfs: mark speculative prealloc CoW fork extents unwritten
  2017-02-01  2:36   ` [PATCH v2 " Darrick J. Wong
@ 2017-02-01 18:36     ` Christoph Hellwig
  2017-02-02 15:04     ` Brian Foster
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2017-02-01 18:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks fine (and tests fine):

Reviewed-by: Christoph Hellwig <hch@lst.de>

But a few style nitpicks below:


> +	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> +	xfs_fileoff_t		offset_fsb;
> +	xfs_fileoff_t		end_fsb;
> +	xfs_extnum_t		idx;
> +	bool			found;
> +	int			error;
> +
> +	offset_fsb = XFS_B_TO_FSBT(mp, offset);
> +	end_fsb = XFS_B_TO_FSB(mp, offset + count);

Move these into the lines with the variable declarations, please.

> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +
> +	/* Convert all the extents to real from unwritten. */
> +	for (found = xfs_iext_lookup_extent(ip, ifp, offset_fsb, &idx, &got);
> +	     found && got.br_startoff < end_fsb;
> +	     found = xfs_iext_get_extent(ifp, ++idx, &got)) {
> +		if (got.br_state == XFS_EXT_NORM)
> +			continue;
> +		error = __xfs_reflink_convert_cow(ip, &got, offset_fsb,
> +				end_fsb - offset_fsb, &dfops);

Move the state check into __xfs_reflink_convert_cow and rename that
to xfs_reflink_convert_cow_extent?

> @@ -389,10 +479,13 @@ xfs_reflink_allocate_cow_range(
>  		if (error) {
>  			trace_xfs_reflink_allocate_cow_range_error(ip, error,
>  					_RET_IP_);
> -			break;
> +			goto out;
>  		}
>  	}
>  
> +	/* Convert the CoW extents to regular. */
> +	error = xfs_reflink_convert_cow(ip, offset, count);
> +out:
>  	return error;

No need for the goto here really, justt return the error directly.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 7/7] xfs: mark speculative prealloc CoW fork extents unwritten
  2017-02-01  2:36   ` [PATCH v2 " Darrick J. Wong
  2017-02-01 18:36     ` Christoph Hellwig
@ 2017-02-02 15:04     ` Brian Foster
  2017-02-02 17:04       ` Darrick J. Wong
  1 sibling, 1 reply; 31+ messages in thread
From: Brian Foster @ 2017-02-02 15:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jan 31, 2017 at 06:36:17PM -0800, Darrick J. Wong wrote:
> Christoph Hellwig pointed out that there's a potentially nasty race when
> performing simultaneous nearby directio cow writes:
> 
> "Thread 1 writes a range from B to c
> 
> "                    B --------- C
>                            p
> 
> "a little later thread 2 writes from A to B
> 
> "        A --------- B
>                p
> 
> [editor's note: the 'p' denote cowextsize boundaries, which I added to
> make this more clear]
> 
> "but the code preallocates beyond B into the range where thread
> "1 has just written, but ->end_io hasn't been called yet.

I'm confused by the description. Wouldn't thread 1 have already
allocated from B-C in the COW fork in order to send a write I/O? How
could thread 2 then "preallocate into a range" of the COW fork that has
been allocated by thread 1? (Also, wouldn't the ioend offset+size limit
a COW remap to the range that is covered by the completed write?)

Brian

> "But once ->end_io is called thread 2 has already allocated
> "up to the extent size hint into the write range of thread 1,
> "so the end_io handler will splice the unintialized blocks from
> "that preallocation back into the file right after B."
> 
> We can avoid this race by ensuring that thread 1 cannot accidentally
> remap the blocks that thread 2 allocated (as part of speculative
> preallocation) as part of t2's write preparation in t1's end_io handler.
> The way we make this happen is by taking advantage of the unwritten
> extent flag as an intermediate step.
> 
> Recall that when we begin the process of writing data to shared blocks,
> we create a delayed allocation extent in the CoW fork:
> 
> D: --RRRRRRSSSRRRRRRRR---
> C: ------DDDDDDD---------
> 
> When a thread prepares to CoW some dirty data out to disk, it will now
> convert the delalloc reservation into an /unwritten/ allocated extent in
> the cow fork.  The da conversion code tries to opportunistically
> allocate as much of a (speculatively prealloc'd) extent as possible, so
> we may end up allocating a larger extent than we're actually writing
> out:
> 
> D: --RRRRRRSSSRRRRRRRR---
> U: ------UUUUUUU---------
> 
> Next, we convert only the part of the extent that we're actively
> planning to write to normal (i.e. not unwritten) status:
> 
> D: --RRRRRRSSSRRRRRRRR---
> U: ------UURRUUU---------
> 
> If the write succeeds, the end_cow function will now scan the relevant
> range of the CoW fork for real extents and remap only the real extents
> into the data fork:
> 
> D: --RRRRRRRRSRRRRRRRR---
> U: ------UU--UUU---------
> 
> This ensures that we never obliterate valid data fork extents with
> unwritten blocks from the CoW fork.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: We don't update anything on disk for the CoW unwritten extent
> conversion, so don't allocate a transaction.
> ---
>  fs/xfs/xfs_aops.c    |    6 +++
>  fs/xfs/xfs_iomap.c   |    2 -
>  fs/xfs/xfs_reflink.c |  117 +++++++++++++++++++++++++++++++++++++++++++++++---
>  fs/xfs/xfs_reflink.h |    2 +
>  fs/xfs/xfs_trace.h   |    8 +++
>  5 files changed, 125 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 631e7c0..1ff9df7 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -481,6 +481,12 @@ xfs_submit_ioend(
>  	struct xfs_ioend	*ioend,
>  	int			status)
>  {
> +	/* Convert CoW extents to regular */
> +	if (!status && ioend->io_type == XFS_IO_COW) {
> +		status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode),
> +				ioend->io_offset, ioend->io_size);
> +	}
> +
>  	/* Reserve log space if we might write beyond the on-disk inode size. */
>  	if (!status &&
>  	    ioend->io_type != XFS_IO_UNWRITTEN &&
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 1aa3abd..767208f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -685,7 +685,7 @@ xfs_iomap_write_allocate(
>  	int		nres;
>  
>  	if (whichfork == XFS_COW_FORK)
> -		flags |= XFS_BMAPI_COWFORK;
> +		flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
>  
>  	/*
>  	 * Make sure that the dquots are there.
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 07593a3..a2e1fff 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -82,11 +82,22 @@
>   * mappings are a reservation against the free space in the filesystem;
>   * adjacent mappings can also be combined into fewer larger mappings.
>   *
> + * As an optimization, the CoW extent size hint (cowextsz) creates
> + * outsized aligned delalloc reservations in the hope of landing out of
> + * order nearby CoW writes in a single extent on disk, thereby reducing
> + * fragmentation and improving future performance.
> + *
> + * D: --RRRRRRSSSRRRRRRRR--- (data fork)
> + * C: ------DDDDDDD--------- (CoW fork)
> + *
>   * When dirty pages are being written out (typically in writepage), the
> - * delalloc reservations are converted into real mappings by allocating
> - * blocks and replacing the delalloc mapping with real ones.  A delalloc
> - * mapping can be replaced by several real ones if the free space is
> - * fragmented.
> + * delalloc reservations are converted into unwritten mappings by
> + * allocating blocks and replacing the delalloc mapping with real ones.
> + * A delalloc mapping can be replaced by several unwritten ones if the
> + * free space is fragmented.
> + *
> + * D: --RRRRRRSSSRRRRRRRR---
> + * C: ------UUUUUUU---------
>   *
>   * We want to adapt the delalloc mechanism for copy-on-write, since the
>   * write paths are similar.  The first two steps (creating the reservation
> @@ -101,13 +112,29 @@
>   * Block-aligned directio writes will use the same mechanism as buffered
>   * writes.
>   *
> + * Just prior to submitting the actual disk write requests, we convert
> + * the extents representing the range of the file actually being written
> + * (as opposed to extra pieces created for the cowextsize hint) to real
> + * extents.  This will become important in the next step:
> + *
> + * D: --RRRRRRSSSRRRRRRRR---
> + * C: ------UUrrUUU---------
> + *
>   * CoW remapping must be done after the data block write completes,
>   * because we don't want to destroy the old data fork map until we're sure
>   * the new block has been written.  Since the new mappings are kept in a
>   * separate fork, we can simply iterate these mappings to find the ones
>   * that cover the file blocks that we just CoW'd.  For each extent, simply
>   * unmap the corresponding range in the data fork, map the new range into
> - * the data fork, and remove the extent from the CoW fork.
> + * the data fork, and remove the extent from the CoW fork.  Because of
> + * the presence of the cowextsize hint, however, we must be careful
> + * only to remap the blocks that we've actually written out --  we must
> + * never remap delalloc reservations nor CoW staging blocks that have
> + * yet to be written.  This corresponds exactly to the real extents in
> + * the CoW fork:
> + *
> + * D: --RRRRRRrrSRRRRRRRR---
> + * C: ------UU--UUU---------
>   *
>   * Since the remapping operation can be applied to an arbitrary file
>   * range, we record the need for the remap step as a flag in the ioend
> @@ -296,6 +323,66 @@ xfs_reflink_reserve_cow(
>  	return 0;
>  }
>  
> +/* Convert part of an unwritten CoW extent to a real one. */
> +STATIC int
> +__xfs_reflink_convert_cow(
> +	struct xfs_inode		*ip,
> +	struct xfs_bmbt_irec		*imap,
> +	xfs_fileoff_t			offset_fsb,
> +	xfs_filblks_t			count_fsb,
> +	struct xfs_defer_ops		*dfops)
> +{
> +	struct xfs_bmbt_irec		irec = *imap;
> +	xfs_fsblock_t			first_block;
> +	int				nimaps = 1;
> +
> +	xfs_trim_extent(&irec, offset_fsb, count_fsb);
> +	trace_xfs_reflink_convert_cow(ip, &irec);
> +	if (irec.br_blockcount == 0)
> +		return 0;
> +	return xfs_bmapi_write(NULL, ip, irec.br_startoff, irec.br_blockcount,
> +			XFS_BMAPI_COWFORK | XFS_BMAPI_CONVERT, &first_block,
> +			0, &irec, &nimaps, dfops);
> +}
> +
> +/* Convert all of the unwritten CoW extents in a file's range to real ones. */
> +int
> +xfs_reflink_convert_cow(
> +	struct xfs_inode	*ip,
> +	xfs_off_t		offset,
> +	xfs_off_t		count)
> +{
> +	struct xfs_bmbt_irec	got;
> +	struct xfs_defer_ops	dfops;
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> +	xfs_fileoff_t		offset_fsb;
> +	xfs_fileoff_t		end_fsb;
> +	xfs_extnum_t		idx;
> +	bool			found;
> +	int			error;
> +
> +	offset_fsb = XFS_B_TO_FSBT(mp, offset);
> +	end_fsb = XFS_B_TO_FSB(mp, offset + count);
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +
> +	/* Convert all the extents to real from unwritten. */
> +	for (found = xfs_iext_lookup_extent(ip, ifp, offset_fsb, &idx, &got);
> +	     found && got.br_startoff < end_fsb;
> +	     found = xfs_iext_get_extent(ifp, ++idx, &got)) {
> +		if (got.br_state == XFS_EXT_NORM)
> +			continue;
> +		error = __xfs_reflink_convert_cow(ip, &got, offset_fsb,
> +				end_fsb - offset_fsb, &dfops);
> +		if (error)
> +			break;
> +	}
> +
> +	/* Finish up. */
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	return error;
> +}
> +
>  /* Allocate all CoW reservations covering a range of blocks in a file. */
>  static int
>  __xfs_reflink_allocate_cow(
> @@ -328,6 +415,7 @@ __xfs_reflink_allocate_cow(
>  		goto out_unlock;
>  	ASSERT(nimaps == 1);
>  
> +	/* Make sure there's a CoW reservation for it. */
>  	error = xfs_reflink_reserve_cow(ip, &imap, &shared);
>  	if (error)
>  		goto out_trans_cancel;
> @@ -337,14 +425,16 @@ __xfs_reflink_allocate_cow(
>  		goto out_trans_cancel;
>  	}
>  
> +	/* Allocate the entire reservation as unwritten blocks. */
>  	xfs_trans_ijoin(tp, ip, 0);
>  	error = xfs_bmapi_write(tp, ip, imap.br_startoff, imap.br_blockcount,
> -			XFS_BMAPI_COWFORK, &first_block,
> +			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, &first_block,
>  			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
>  			&imap, &nimaps, &dfops);
>  	if (error)
>  		goto out_trans_cancel;
>  
> +	/* Finish up. */
>  	error = xfs_defer_finish(&tp, &dfops, NULL);
>  	if (error)
>  		goto out_trans_cancel;
> @@ -389,10 +479,13 @@ xfs_reflink_allocate_cow_range(
>  		if (error) {
>  			trace_xfs_reflink_allocate_cow_range_error(ip, error,
>  					_RET_IP_);
> -			break;
> +			goto out;
>  		}
>  	}
>  
> +	/* Convert the CoW extents to regular. */
> +	error = xfs_reflink_convert_cow(ip, offset, count);
> +out:
>  	return error;
>  }
>  
> @@ -641,6 +734,16 @@ xfs_reflink_end_cow(
>  
>  		ASSERT(!isnullstartblock(got.br_startblock));
>  
> +		/*
> +		 * Don't remap unwritten extents; these are
> +		 * speculatively preallocated CoW extents that have been
> +		 * allocated but have not yet been involved in a write.
> +		 */
> +		if (got.br_state == XFS_EXT_UNWRITTEN) {
> +			idx--;
> +			goto next_extent;
> +		}
> +
>  		/* Unmap the old blocks in the data fork. */
>  		xfs_defer_init(&dfops, &firstfsb);
>  		rlen = del.br_blockcount;
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index aa6a4d6..1583c47 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -30,6 +30,8 @@ extern int xfs_reflink_reserve_cow(struct xfs_inode *ip,
>  		struct xfs_bmbt_irec *imap, bool *shared);
>  extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
>  		xfs_off_t offset, xfs_off_t count);
> +extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
> +		xfs_off_t count);
>  extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
>  		struct xfs_bmbt_irec *imap);
>  extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 69c5bcd..d3d11905 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -3089,6 +3089,7 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class,
>  		__field(xfs_fileoff_t, lblk)
>  		__field(xfs_extlen_t, len)
>  		__field(xfs_fsblock_t, pblk)
> +		__field(int, state)
>  	),
>  	TP_fast_assign(
>  		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> @@ -3096,13 +3097,15 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class,
>  		__entry->lblk = irec->br_startoff;
>  		__entry->len = irec->br_blockcount;
>  		__entry->pblk = irec->br_startblock;
> +		__entry->state = irec->br_state;
>  	),
> -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu",
> +	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu st %d",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->lblk,
>  		  __entry->len,
> -		  __entry->pblk)
> +		  __entry->pblk,
> +		  __entry->state)
>  );
>  #define DEFINE_INODE_IREC_EVENT(name) \
>  DEFINE_EVENT(xfs_inode_irec_class, name, \
> @@ -3242,6 +3245,7 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_around_shared);
>  DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_alloc);
>  DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_found);
>  DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
> +DEFINE_INODE_IREC_EVENT(xfs_reflink_convert_cow);
>  
>  DEFINE_RW_EVENT(xfs_reflink_reserve_cow);
>  DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 7/7] xfs: mark speculative prealloc CoW fork extents unwritten
  2017-02-02 15:04     ` Brian Foster
@ 2017-02-02 17:04       ` Darrick J. Wong
  2017-02-02 19:42         ` Brian Foster
  0 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2017-02-02 17:04 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 02, 2017 at 10:04:35AM -0500, Brian Foster wrote:
> On Tue, Jan 31, 2017 at 06:36:17PM -0800, Darrick J. Wong wrote:
> > Christoph Hellwig pointed out that there's a potentially nasty race when
> > performing simultaneous nearby directio cow writes:
> > 
> > "Thread 1 writes a range from B to c
> > 
> > "                    B --------- C
> >                            p
> > 
> > "a little later thread 2 writes from A to B
> > 
> > "        A --------- B
> >                p
> > 
> > [editor's note: the 'p' denote cowextsize boundaries, which I added to
> > make this more clear]
> > 
> > "but the code preallocates beyond B into the range where thread
> > "1 has just written, but ->end_io hasn't been called yet.
> 
> I'm confused by the description. Wouldn't thread 1 have already
> allocated from B-C in the COW fork in order to send a write I/O? How
> could thread 2 then "preallocate into a range" of the COW fork that has
> been allocated by thread 1? (Also, wouldn't the ioend offset+size limit
> a COW remap to the range that is covered by the completed write?)

No, it hasn't, because ... oh let's make another diagram. :)

Let's say cowextsize = 8 blocks.

    0       1       2      3
    0123456701234567012345701234567

s = shared real blocks, R = unshared original real blocks,
C = cow reservation, u = unwritten blocks, r = replacement real blocks
d = where our dio write happens

D:  ssssRRRRRRRRsssssssssssssssssss
C:  -------------------------------

Blocks 0-3 and 12-31 are shared, blocks 4-11 are unshared.

Now let's say that thread 1 tries a directio write for blocks 6-14:

t1: ------ddddddddd----------------

The whole directio context gets marked IOMAP_DIO_COW because the last 3
blocks (12-14) have to be CoWed, and a cowextsz reservation gets created
for blocks 8-16.  However, a cowextsz reservation is not created for
blocks 0-7 because blocks 6-7 aren't shared.

D:  ssssRRRRRRRRsssssssssssssssssss
C:  --------CCCCCCCC---------------
t1: ------ddddddddd----------------

The key here is that we don't look far enough back (or forward) in the
data fork to realize that part of our dio write lies within a potential
cowextsize region that has shared blocks, even though none the parts of
that segment that this directio is writing to are themselves shared.

In other words, we see that blocks 6-7 are not shared but we do not look
at blocks 0-5 to realize that some of those are shared.  Therefore we
don't create a cowextsize reservation for blocks 0-7.  Furthermore, if
we never write blocks 0-3, then promoting blocks 6-7 to a CoW is
unnecessary.  It's unclear to me if we should do that anyway, or just
leave it.

Before issuing the write bios, we convert the reservation to a real
extent:

D:  ssssRRRRRRRRRRsssssssssssssssss
C:  --------rrrrrrrr---------------
t1: ------ddddddddd----------------

Then thread 1 issues a write to the original blocks 6-7 and to the
replacement blocks 8-14.

Now let's say thread 2 comes along and wants to write to block 0:

D:  ssssRRRRRRRRRRsssssssssssssssss
C:  --------rrrrrrrr---------------
t1: ------ddddddddd----------------
t2: d------------------------------

Since block 0 is shared, we create a cow reservation for blocks 0-7:

D:  ssssRRRRRRRRRRsssssssssssssssss
C:  rrrrrrrrrrrrrrrr---------------
t1: ------ddddddddd----------------
t2: d------------------------------

Thread 1's write now finishes.  The directio context is marked COW, so
it remaps anything it finds in the cow fork between blocks 6 and 14:

D:  ssssRRrrrrrrrrrssssssssssssssss
C:  rrrrrr---------r---------------
t1: ------ddddddddd----------------
t2: d------------------------------

Oh no!  Thread 1 wrote to original blocks 6-7 and then unmapped them!
Worse yet it replaced them with whatever allocation the CoW fork made,
so we've inserted stale data blocks into the original file!

--D

> 
> Brian
> 
> > "But once ->end_io is called thread 2 has already allocated
> > "up to the extent size hint into the write range of thread 1,
> > "so the end_io handler will splice the unintialized blocks from
> > "that preallocation back into the file right after B."
> > 
> > We can avoid this race by ensuring that thread 1 cannot accidentally
> > remap the blocks that thread 2 allocated (as part of speculative
> > preallocation) as part of t2's write preparation in t1's end_io handler.
> > The way we make this happen is by taking advantage of the unwritten
> > extent flag as an intermediate step.
> > 
> > Recall that when we begin the process of writing data to shared blocks,
> > we create a delayed allocation extent in the CoW fork:
> > 
> > D: --RRRRRRSSSRRRRRRRR---
> > C: ------DDDDDDD---------
> > 
> > When a thread prepares to CoW some dirty data out to disk, it will now
> > convert the delalloc reservation into an /unwritten/ allocated extent in
> > the cow fork.  The da conversion code tries to opportunistically
> > allocate as much of a (speculatively prealloc'd) extent as possible, so
> > we may end up allocating a larger extent than we're actually writing
> > out:
> > 
> > D: --RRRRRRSSSRRRRRRRR---
> > U: ------UUUUUUU---------
> > 
> > Next, we convert only the part of the extent that we're actively
> > planning to write to normal (i.e. not unwritten) status:
> > 
> > D: --RRRRRRSSSRRRRRRRR---
> > U: ------UURRUUU---------
> > 
> > If the write succeeds, the end_cow function will now scan the relevant
> > range of the CoW fork for real extents and remap only the real extents
> > into the data fork:
> > 
> > D: --RRRRRRRRSRRRRRRRR---
> > U: ------UU--UUU---------
> > 
> > This ensures that we never obliterate valid data fork extents with
> > unwritten blocks from the CoW fork.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > v2: We don't update anything on disk for the CoW unwritten extent
> > conversion, so don't allocate a transaction.
> > ---
> >  fs/xfs/xfs_aops.c    |    6 +++
> >  fs/xfs/xfs_iomap.c   |    2 -
> >  fs/xfs/xfs_reflink.c |  117 +++++++++++++++++++++++++++++++++++++++++++++++---
> >  fs/xfs/xfs_reflink.h |    2 +
> >  fs/xfs/xfs_trace.h   |    8 +++
> >  5 files changed, 125 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 631e7c0..1ff9df7 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -481,6 +481,12 @@ xfs_submit_ioend(
> >  	struct xfs_ioend	*ioend,
> >  	int			status)
> >  {
> > +	/* Convert CoW extents to regular */
> > +	if (!status && ioend->io_type == XFS_IO_COW) {
> > +		status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode),
> > +				ioend->io_offset, ioend->io_size);
> > +	}
> > +
> >  	/* Reserve log space if we might write beyond the on-disk inode size. */
> >  	if (!status &&
> >  	    ioend->io_type != XFS_IO_UNWRITTEN &&
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 1aa3abd..767208f 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -685,7 +685,7 @@ xfs_iomap_write_allocate(
> >  	int		nres;
> >  
> >  	if (whichfork == XFS_COW_FORK)
> > -		flags |= XFS_BMAPI_COWFORK;
> > +		flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
> >  
> >  	/*
> >  	 * Make sure that the dquots are there.
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 07593a3..a2e1fff 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -82,11 +82,22 @@
> >   * mappings are a reservation against the free space in the filesystem;
> >   * adjacent mappings can also be combined into fewer larger mappings.
> >   *
> > + * As an optimization, the CoW extent size hint (cowextsz) creates
> > + * outsized aligned delalloc reservations in the hope of landing out of
> > + * order nearby CoW writes in a single extent on disk, thereby reducing
> > + * fragmentation and improving future performance.
> > + *
> > + * D: --RRRRRRSSSRRRRRRRR--- (data fork)
> > + * C: ------DDDDDDD--------- (CoW fork)
> > + *
> >   * When dirty pages are being written out (typically in writepage), the
> > - * delalloc reservations are converted into real mappings by allocating
> > - * blocks and replacing the delalloc mapping with real ones.  A delalloc
> > - * mapping can be replaced by several real ones if the free space is
> > - * fragmented.
> > + * delalloc reservations are converted into unwritten mappings by
> > + * allocating blocks and replacing the delalloc mapping with real ones.
> > + * A delalloc mapping can be replaced by several unwritten ones if the
> > + * free space is fragmented.
> > + *
> > + * D: --RRRRRRSSSRRRRRRRR---
> > + * C: ------UUUUUUU---------
> >   *
> >   * We want to adapt the delalloc mechanism for copy-on-write, since the
> >   * write paths are similar.  The first two steps (creating the reservation
> > @@ -101,13 +112,29 @@
> >   * Block-aligned directio writes will use the same mechanism as buffered
> >   * writes.
> >   *
> > + * Just prior to submitting the actual disk write requests, we convert
> > + * the extents representing the range of the file actually being written
> > + * (as opposed to extra pieces created for the cowextsize hint) to real
> > + * extents.  This will become important in the next step:
> > + *
> > + * D: --RRRRRRSSSRRRRRRRR---
> > + * C: ------UUrrUUU---------
> > + *
> >   * CoW remapping must be done after the data block write completes,
> >   * because we don't want to destroy the old data fork map until we're sure
> >   * the new block has been written.  Since the new mappings are kept in a
> >   * separate fork, we can simply iterate these mappings to find the ones
> >   * that cover the file blocks that we just CoW'd.  For each extent, simply
> >   * unmap the corresponding range in the data fork, map the new range into
> > - * the data fork, and remove the extent from the CoW fork.
> > + * the data fork, and remove the extent from the CoW fork.  Because of
> > + * the presence of the cowextsize hint, however, we must be careful
> > + * only to remap the blocks that we've actually written out --  we must
> > + * never remap delalloc reservations nor CoW staging blocks that have
> > + * yet to be written.  This corresponds exactly to the real extents in
> > + * the CoW fork:
> > + *
> > + * D: --RRRRRRrrSRRRRRRRR---
> > + * C: ------UU--UUU---------
> >   *
> >   * Since the remapping operation can be applied to an arbitrary file
> >   * range, we record the need for the remap step as a flag in the ioend
> > @@ -296,6 +323,66 @@ xfs_reflink_reserve_cow(
> >  	return 0;
> >  }
> >  
> > +/* Convert part of an unwritten CoW extent to a real one. */
> > +STATIC int
> > +__xfs_reflink_convert_cow(
> > +	struct xfs_inode		*ip,
> > +	struct xfs_bmbt_irec		*imap,
> > +	xfs_fileoff_t			offset_fsb,
> > +	xfs_filblks_t			count_fsb,
> > +	struct xfs_defer_ops		*dfops)
> > +{
> > +	struct xfs_bmbt_irec		irec = *imap;
> > +	xfs_fsblock_t			first_block;
> > +	int				nimaps = 1;
> > +
> > +	xfs_trim_extent(&irec, offset_fsb, count_fsb);
> > +	trace_xfs_reflink_convert_cow(ip, &irec);
> > +	if (irec.br_blockcount == 0)
> > +		return 0;
> > +	return xfs_bmapi_write(NULL, ip, irec.br_startoff, irec.br_blockcount,
> > +			XFS_BMAPI_COWFORK | XFS_BMAPI_CONVERT, &first_block,
> > +			0, &irec, &nimaps, dfops);
> > +}
> > +
> > +/* Convert all of the unwritten CoW extents in a file's range to real ones. */
> > +int
> > +xfs_reflink_convert_cow(
> > +	struct xfs_inode	*ip,
> > +	xfs_off_t		offset,
> > +	xfs_off_t		count)
> > +{
> > +	struct xfs_bmbt_irec	got;
> > +	struct xfs_defer_ops	dfops;
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > +	xfs_fileoff_t		offset_fsb;
> > +	xfs_fileoff_t		end_fsb;
> > +	xfs_extnum_t		idx;
> > +	bool			found;
> > +	int			error;
> > +
> > +	offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > +	end_fsb = XFS_B_TO_FSB(mp, offset + count);
> > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > +
> > +	/* Convert all the extents to real from unwritten. */
> > +	for (found = xfs_iext_lookup_extent(ip, ifp, offset_fsb, &idx, &got);
> > +	     found && got.br_startoff < end_fsb;
> > +	     found = xfs_iext_get_extent(ifp, ++idx, &got)) {
> > +		if (got.br_state == XFS_EXT_NORM)
> > +			continue;
> > +		error = __xfs_reflink_convert_cow(ip, &got, offset_fsb,
> > +				end_fsb - offset_fsb, &dfops);
> > +		if (error)
> > +			break;
> > +	}
> > +
> > +	/* Finish up. */
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +	return error;
> > +}
> > +
> >  /* Allocate all CoW reservations covering a range of blocks in a file. */
> >  static int
> >  __xfs_reflink_allocate_cow(
> > @@ -328,6 +415,7 @@ __xfs_reflink_allocate_cow(
> >  		goto out_unlock;
> >  	ASSERT(nimaps == 1);
> >  
> > +	/* Make sure there's a CoW reservation for it. */
> >  	error = xfs_reflink_reserve_cow(ip, &imap, &shared);
> >  	if (error)
> >  		goto out_trans_cancel;
> > @@ -337,14 +425,16 @@ __xfs_reflink_allocate_cow(
> >  		goto out_trans_cancel;
> >  	}
> >  
> > +	/* Allocate the entire reservation as unwritten blocks. */
> >  	xfs_trans_ijoin(tp, ip, 0);
> >  	error = xfs_bmapi_write(tp, ip, imap.br_startoff, imap.br_blockcount,
> > -			XFS_BMAPI_COWFORK, &first_block,
> > +			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, &first_block,
> >  			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
> >  			&imap, &nimaps, &dfops);
> >  	if (error)
> >  		goto out_trans_cancel;
> >  
> > +	/* Finish up. */
> >  	error = xfs_defer_finish(&tp, &dfops, NULL);
> >  	if (error)
> >  		goto out_trans_cancel;
> > @@ -389,10 +479,13 @@ xfs_reflink_allocate_cow_range(
> >  		if (error) {
> >  			trace_xfs_reflink_allocate_cow_range_error(ip, error,
> >  					_RET_IP_);
> > -			break;
> > +			goto out;
> >  		}
> >  	}
> >  
> > +	/* Convert the CoW extents to regular. */
> > +	error = xfs_reflink_convert_cow(ip, offset, count);
> > +out:
> >  	return error;
> >  }
> >  
> > @@ -641,6 +734,16 @@ xfs_reflink_end_cow(
> >  
> >  		ASSERT(!isnullstartblock(got.br_startblock));
> >  
> > +		/*
> > +		 * Don't remap unwritten extents; these are
> > +		 * speculatively preallocated CoW extents that have been
> > +		 * allocated but have not yet been involved in a write.
> > +		 */
> > +		if (got.br_state == XFS_EXT_UNWRITTEN) {
> > +			idx--;
> > +			goto next_extent;
> > +		}
> > +
> >  		/* Unmap the old blocks in the data fork. */
> >  		xfs_defer_init(&dfops, &firstfsb);
> >  		rlen = del.br_blockcount;
> > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > index aa6a4d6..1583c47 100644
> > --- a/fs/xfs/xfs_reflink.h
> > +++ b/fs/xfs/xfs_reflink.h
> > @@ -30,6 +30,8 @@ extern int xfs_reflink_reserve_cow(struct xfs_inode *ip,
> >  		struct xfs_bmbt_irec *imap, bool *shared);
> >  extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
> >  		xfs_off_t offset, xfs_off_t count);
> > +extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
> > +		xfs_off_t count);
> >  extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
> >  		struct xfs_bmbt_irec *imap);
> >  extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 69c5bcd..d3d11905 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -3089,6 +3089,7 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class,
> >  		__field(xfs_fileoff_t, lblk)
> >  		__field(xfs_extlen_t, len)
> >  		__field(xfs_fsblock_t, pblk)
> > +		__field(int, state)
> >  	),
> >  	TP_fast_assign(
> >  		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > @@ -3096,13 +3097,15 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class,
> >  		__entry->lblk = irec->br_startoff;
> >  		__entry->len = irec->br_blockcount;
> >  		__entry->pblk = irec->br_startblock;
> > +		__entry->state = irec->br_state;
> >  	),
> > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu",
> > +	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu st %d",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->ino,
> >  		  __entry->lblk,
> >  		  __entry->len,
> > -		  __entry->pblk)
> > +		  __entry->pblk,
> > +		  __entry->state)
> >  );
> >  #define DEFINE_INODE_IREC_EVENT(name) \
> >  DEFINE_EVENT(xfs_inode_irec_class, name, \
> > @@ -3242,6 +3245,7 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_around_shared);
> >  DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_alloc);
> >  DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_found);
> >  DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
> > +DEFINE_INODE_IREC_EVENT(xfs_reflink_convert_cow);
> >  
> >  DEFINE_RW_EVENT(xfs_reflink_reserve_cow);
> >  DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 7/7] xfs: mark speculative prealloc CoW fork extents unwritten
  2017-02-02 17:04       ` Darrick J. Wong
@ 2017-02-02 19:42         ` Brian Foster
  0 siblings, 0 replies; 31+ messages in thread
From: Brian Foster @ 2017-02-02 19:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Feb 02, 2017 at 09:04:31AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 02, 2017 at 10:04:35AM -0500, Brian Foster wrote:
> > On Tue, Jan 31, 2017 at 06:36:17PM -0800, Darrick J. Wong wrote:
> > > Christoph Hellwig pointed out that there's a potentially nasty race when
> > > performing simultaneous nearby directio cow writes:
> > > 
> > > "Thread 1 writes a range from B to c
> > > 
> > > "                    B --------- C
> > >                            p
> > > 
> > > "a little later thread 2 writes from A to B
> > > 
> > > "        A --------- B
> > >                p
> > > 
> > > [editor's note: the 'p' denote cowextsize boundaries, which I added to
> > > make this more clear]
> > > 
> > > "but the code preallocates beyond B into the range where thread
> > > "1 has just written, but ->end_io hasn't been called yet.
> > 
> > I'm confused by the description. Wouldn't thread 1 have already
> > allocated from B-C in the COW fork in order to send a write I/O? How
> > could thread 2 then "preallocate into a range" of the COW fork that has
> > been allocated by thread 1? (Also, wouldn't the ioend offset+size limit
> > a COW remap to the range that is covered by the completed write?)
> 
> No, it hasn't, because ... oh let's make another diagram. :)
> 
> Let's say cowextsize = 8 blocks.
> 
>     0       1       2      3
>     0123456701234567012345701234567
> 
> s = shared real blocks, R = unshared original real blocks,
> C = cow reservation, u = unwritten blocks, r = replacement real blocks
> d = where our dio write happens
> 
> D:  ssssRRRRRRRRsssssssssssssssssss
> C:  -------------------------------
> 
> Blocks 0-3 and 12-31 are shared, blocks 4-11 are unshared.
> 
> Now let's say that thread 1 tries a directio write for blocks 6-14:
> 
> t1: ------ddddddddd----------------
> 
> The whole directio context gets marked IOMAP_DIO_COW because the last 3
> blocks (12-14) have to be CoWed, and a cowextsz reservation gets created
> for blocks 8-16.  However, a cowextsz reservation is not created for
> blocks 0-7 because blocks 6-7 aren't shared.
> 

Ah, I see. This is the bit I was missing. For some reason I was thinking
we'd get a separate end_io between the regular blocks and the shared
blocks. The rest makes more sense knowing that we have one dio context
covering an I/O range split across the data and cow forks, and the only
thing to otherwise tell us what was written to the cow fork (and thus
needs to be remapped into the data fork on completion) is the existence
of those blocks in the cow fork. Thanks for the explanation.

Brian

> D:  ssssRRRRRRRRsssssssssssssssssss
> C:  --------CCCCCCCC---------------
> t1: ------ddddddddd----------------
> 
> The key here is that we don't look far enough back (or forward) in the
> data fork to realize that part of our dio write lies within a potential
> cowextsize region that has shared blocks, even though none the parts of
> that segment that this directio is writing to are themselves shared.
> 
> In other words, we see that blocks 6-7 are not shared but we do not look
> at blocks 0-5 to realize that some of those are shared.  Therefore we
> don't create a cowextsize reservation for blocks 0-7.  Furthermore, if
> we never write blocks 0-3, then promoting blocks 6-7 to a CoW is
> unnecessary.  It's unclear to me if we should do that anyway, or just
> leave it.
> 
> Before issuing the write bios, we convert the reservation to a real
> extent:
> 
> D:  ssssRRRRRRRRRRsssssssssssssssss
> C:  --------rrrrrrrr---------------
> t1: ------ddddddddd----------------
> 
> Then thread 1 issues a write to the original blocks 6-7 and to the
> replacement blocks 8-14.
> 
> Now let's say thread 2 comes along and wants to write to block 0:
> 
> D:  ssssRRRRRRRRRRsssssssssssssssss
> C:  --------rrrrrrrr---------------
> t1: ------ddddddddd----------------
> t2: d------------------------------
> 
> Since block 0 is shared, we create a cow reservation for blocks 0-7:
> 
> D:  ssssRRRRRRRRRRsssssssssssssssss
> C:  rrrrrrrrrrrrrrrr---------------
> t1: ------ddddddddd----------------
> t2: d------------------------------
> 
> Thread 1's write now finishes.  The directio context is marked COW, so
> it remaps anything it finds in the cow fork between blocks 6 and 14:
> 
> D:  ssssRRrrrrrrrrrssssssssssssssss
> C:  rrrrrr---------r---------------
> t1: ------ddddddddd----------------
> t2: d------------------------------
> 
> Oh no!  Thread 1 wrote to original blocks 6-7 and then unmapped them!
> Worse yet it replaced them with whatever allocation the CoW fork made,
> so we've inserted stale data blocks into the original file!
> 
> --D
> 
> > 
> > Brian
> > 
> > > "But once ->end_io is called thread 2 has already allocated
> > > "up to the extent size hint into the write range of thread 1,
> > > "so the end_io handler will splice the unintialized blocks from
> > > "that preallocation back into the file right after B."
> > > 
> > > We can avoid this race by ensuring that thread 1 cannot accidentally
> > > remap the blocks that thread 2 allocated (as part of speculative
> > > preallocation) as part of t2's write preparation in t1's end_io handler.
> > > The way we make this happen is by taking advantage of the unwritten
> > > extent flag as an intermediate step.
> > > 
> > > Recall that when we begin the process of writing data to shared blocks,
> > > we create a delayed allocation extent in the CoW fork:
> > > 
> > > D: --RRRRRRSSSRRRRRRRR---
> > > C: ------DDDDDDD---------
> > > 
> > > When a thread prepares to CoW some dirty data out to disk, it will now
> > > convert the delalloc reservation into an /unwritten/ allocated extent in
> > > the cow fork.  The da conversion code tries to opportunistically
> > > allocate as much of a (speculatively prealloc'd) extent as possible, so
> > > we may end up allocating a larger extent than we're actually writing
> > > out:
> > > 
> > > D: --RRRRRRSSSRRRRRRRR---
> > > U: ------UUUUUUU---------
> > > 
> > > Next, we convert only the part of the extent that we're actively
> > > planning to write to normal (i.e. not unwritten) status:
> > > 
> > > D: --RRRRRRSSSRRRRRRRR---
> > > U: ------UURRUUU---------
> > > 
> > > If the write succeeds, the end_cow function will now scan the relevant
> > > range of the CoW fork for real extents and remap only the real extents
> > > into the data fork:
> > > 
> > > D: --RRRRRRRRSRRRRRRRR---
> > > U: ------UU--UUU---------
> > > 
> > > This ensures that we never obliterate valid data fork extents with
> > > unwritten blocks from the CoW fork.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > > v2: We don't update anything on disk for the CoW unwritten extent
> > > conversion, so don't allocate a transaction.
> > > ---
> > >  fs/xfs/xfs_aops.c    |    6 +++
> > >  fs/xfs/xfs_iomap.c   |    2 -
> > >  fs/xfs/xfs_reflink.c |  117 +++++++++++++++++++++++++++++++++++++++++++++++---
> > >  fs/xfs/xfs_reflink.h |    2 +
> > >  fs/xfs/xfs_trace.h   |    8 +++
> > >  5 files changed, 125 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index 631e7c0..1ff9df7 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -481,6 +481,12 @@ xfs_submit_ioend(
> > >  	struct xfs_ioend	*ioend,
> > >  	int			status)
> > >  {
> > > +	/* Convert CoW extents to regular */
> > > +	if (!status && ioend->io_type == XFS_IO_COW) {
> > > +		status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode),
> > > +				ioend->io_offset, ioend->io_size);
> > > +	}
> > > +
> > >  	/* Reserve log space if we might write beyond the on-disk inode size. */
> > >  	if (!status &&
> > >  	    ioend->io_type != XFS_IO_UNWRITTEN &&
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 1aa3abd..767208f 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -685,7 +685,7 @@ xfs_iomap_write_allocate(
> > >  	int		nres;
> > >  
> > >  	if (whichfork == XFS_COW_FORK)
> > > -		flags |= XFS_BMAPI_COWFORK;
> > > +		flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
> > >  
> > >  	/*
> > >  	 * Make sure that the dquots are there.
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 07593a3..a2e1fff 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -82,11 +82,22 @@
> > >   * mappings are a reservation against the free space in the filesystem;
> > >   * adjacent mappings can also be combined into fewer larger mappings.
> > >   *
> > > + * As an optimization, the CoW extent size hint (cowextsz) creates
> > > + * outsized aligned delalloc reservations in the hope of landing out of
> > > + * order nearby CoW writes in a single extent on disk, thereby reducing
> > > + * fragmentation and improving future performance.
> > > + *
> > > + * D: --RRRRRRSSSRRRRRRRR--- (data fork)
> > > + * C: ------DDDDDDD--------- (CoW fork)
> > > + *
> > >   * When dirty pages are being written out (typically in writepage), the
> > > - * delalloc reservations are converted into real mappings by allocating
> > > - * blocks and replacing the delalloc mapping with real ones.  A delalloc
> > > - * mapping can be replaced by several real ones if the free space is
> > > - * fragmented.
> > > + * delalloc reservations are converted into unwritten mappings by
> > > + * allocating blocks and replacing the delalloc mapping with real ones.
> > > + * A delalloc mapping can be replaced by several unwritten ones if the
> > > + * free space is fragmented.
> > > + *
> > > + * D: --RRRRRRSSSRRRRRRRR---
> > > + * C: ------UUUUUUU---------
> > >   *
> > >   * We want to adapt the delalloc mechanism for copy-on-write, since the
> > >   * write paths are similar.  The first two steps (creating the reservation
> > > @@ -101,13 +112,29 @@
> > >   * Block-aligned directio writes will use the same mechanism as buffered
> > >   * writes.
> > >   *
> > > + * Just prior to submitting the actual disk write requests, we convert
> > > + * the extents representing the range of the file actually being written
> > > + * (as opposed to extra pieces created for the cowextsize hint) to real
> > > + * extents.  This will become important in the next step:
> > > + *
> > > + * D: --RRRRRRSSSRRRRRRRR---
> > > + * C: ------UUrrUUU---------
> > > + *
> > >   * CoW remapping must be done after the data block write completes,
> > >   * because we don't want to destroy the old data fork map until we're sure
> > >   * the new block has been written.  Since the new mappings are kept in a
> > >   * separate fork, we can simply iterate these mappings to find the ones
> > >   * that cover the file blocks that we just CoW'd.  For each extent, simply
> > >   * unmap the corresponding range in the data fork, map the new range into
> > > - * the data fork, and remove the extent from the CoW fork.
> > > + * the data fork, and remove the extent from the CoW fork.  Because of
> > > + * the presence of the cowextsize hint, however, we must be careful
> > > + * only to remap the blocks that we've actually written out --  we must
> > > + * never remap delalloc reservations nor CoW staging blocks that have
> > > + * yet to be written.  This corresponds exactly to the real extents in
> > > + * the CoW fork:
> > > + *
> > > + * D: --RRRRRRrrSRRRRRRRR---
> > > + * C: ------UU--UUU---------
> > >   *
> > >   * Since the remapping operation can be applied to an arbitrary file
> > >   * range, we record the need for the remap step as a flag in the ioend
> > > @@ -296,6 +323,66 @@ xfs_reflink_reserve_cow(
> > >  	return 0;
> > >  }
> > >  
> > > +/* Convert part of an unwritten CoW extent to a real one. */
> > > +STATIC int
> > > +__xfs_reflink_convert_cow(
> > > +	struct xfs_inode		*ip,
> > > +	struct xfs_bmbt_irec		*imap,
> > > +	xfs_fileoff_t			offset_fsb,
> > > +	xfs_filblks_t			count_fsb,
> > > +	struct xfs_defer_ops		*dfops)
> > > +{
> > > +	struct xfs_bmbt_irec		irec = *imap;
> > > +	xfs_fsblock_t			first_block;
> > > +	int				nimaps = 1;
> > > +
> > > +	xfs_trim_extent(&irec, offset_fsb, count_fsb);
> > > +	trace_xfs_reflink_convert_cow(ip, &irec);
> > > +	if (irec.br_blockcount == 0)
> > > +		return 0;
> > > +	return xfs_bmapi_write(NULL, ip, irec.br_startoff, irec.br_blockcount,
> > > +			XFS_BMAPI_COWFORK | XFS_BMAPI_CONVERT, &first_block,
> > > +			0, &irec, &nimaps, dfops);
> > > +}
> > > +
> > > +/* Convert all of the unwritten CoW extents in a file's range to real ones. */
> > > +int
> > > +xfs_reflink_convert_cow(
> > > +	struct xfs_inode	*ip,
> > > +	xfs_off_t		offset,
> > > +	xfs_off_t		count)
> > > +{
> > > +	struct xfs_bmbt_irec	got;
> > > +	struct xfs_defer_ops	dfops;
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > > +	xfs_fileoff_t		offset_fsb;
> > > +	xfs_fileoff_t		end_fsb;
> > > +	xfs_extnum_t		idx;
> > > +	bool			found;
> > > +	int			error;
> > > +
> > > +	offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > > +	end_fsb = XFS_B_TO_FSB(mp, offset + count);
> > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > +
> > > +	/* Convert all the extents to real from unwritten. */
> > > +	for (found = xfs_iext_lookup_extent(ip, ifp, offset_fsb, &idx, &got);
> > > +	     found && got.br_startoff < end_fsb;
> > > +	     found = xfs_iext_get_extent(ifp, ++idx, &got)) {
> > > +		if (got.br_state == XFS_EXT_NORM)
> > > +			continue;
> > > +		error = __xfs_reflink_convert_cow(ip, &got, offset_fsb,
> > > +				end_fsb - offset_fsb, &dfops);
> > > +		if (error)
> > > +			break;
> > > +	}
> > > +
> > > +	/* Finish up. */
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +	return error;
> > > +}
> > > +
> > >  /* Allocate all CoW reservations covering a range of blocks in a file. */
> > >  static int
> > >  __xfs_reflink_allocate_cow(
> > > @@ -328,6 +415,7 @@ __xfs_reflink_allocate_cow(
> > >  		goto out_unlock;
> > >  	ASSERT(nimaps == 1);
> > >  
> > > +	/* Make sure there's a CoW reservation for it. */
> > >  	error = xfs_reflink_reserve_cow(ip, &imap, &shared);
> > >  	if (error)
> > >  		goto out_trans_cancel;
> > > @@ -337,14 +425,16 @@ __xfs_reflink_allocate_cow(
> > >  		goto out_trans_cancel;
> > >  	}
> > >  
> > > +	/* Allocate the entire reservation as unwritten blocks. */
> > >  	xfs_trans_ijoin(tp, ip, 0);
> > >  	error = xfs_bmapi_write(tp, ip, imap.br_startoff, imap.br_blockcount,
> > > -			XFS_BMAPI_COWFORK, &first_block,
> > > +			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, &first_block,
> > >  			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
> > >  			&imap, &nimaps, &dfops);
> > >  	if (error)
> > >  		goto out_trans_cancel;
> > >  
> > > +	/* Finish up. */
> > >  	error = xfs_defer_finish(&tp, &dfops, NULL);
> > >  	if (error)
> > >  		goto out_trans_cancel;
> > > @@ -389,10 +479,13 @@ xfs_reflink_allocate_cow_range(
> > >  		if (error) {
> > >  			trace_xfs_reflink_allocate_cow_range_error(ip, error,
> > >  					_RET_IP_);
> > > -			break;
> > > +			goto out;
> > >  		}
> > >  	}
> > >  
> > > +	/* Convert the CoW extents to regular. */
> > > +	error = xfs_reflink_convert_cow(ip, offset, count);
> > > +out:
> > >  	return error;
> > >  }
> > >  
> > > @@ -641,6 +734,16 @@ xfs_reflink_end_cow(
> > >  
> > >  		ASSERT(!isnullstartblock(got.br_startblock));
> > >  
> > > +		/*
> > > +		 * Don't remap unwritten extents; these are
> > > +		 * speculatively preallocated CoW extents that have been
> > > +		 * allocated but have not yet been involved in a write.
> > > +		 */
> > > +		if (got.br_state == XFS_EXT_UNWRITTEN) {
> > > +			idx--;
> > > +			goto next_extent;
> > > +		}
> > > +
> > >  		/* Unmap the old blocks in the data fork. */
> > >  		xfs_defer_init(&dfops, &firstfsb);
> > >  		rlen = del.br_blockcount;
> > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > index aa6a4d6..1583c47 100644
> > > --- a/fs/xfs/xfs_reflink.h
> > > +++ b/fs/xfs/xfs_reflink.h
> > > @@ -30,6 +30,8 @@ extern int xfs_reflink_reserve_cow(struct xfs_inode *ip,
> > >  		struct xfs_bmbt_irec *imap, bool *shared);
> > >  extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
> > >  		xfs_off_t offset, xfs_off_t count);
> > > +extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
> > > +		xfs_off_t count);
> > >  extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
> > >  		struct xfs_bmbt_irec *imap);
> > >  extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > index 69c5bcd..d3d11905 100644
> > > --- a/fs/xfs/xfs_trace.h
> > > +++ b/fs/xfs/xfs_trace.h
> > > @@ -3089,6 +3089,7 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class,
> > >  		__field(xfs_fileoff_t, lblk)
> > >  		__field(xfs_extlen_t, len)
> > >  		__field(xfs_fsblock_t, pblk)
> > > +		__field(int, state)
> > >  	),
> > >  	TP_fast_assign(
> > >  		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > > @@ -3096,13 +3097,15 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class,
> > >  		__entry->lblk = irec->br_startoff;
> > >  		__entry->len = irec->br_blockcount;
> > >  		__entry->pblk = irec->br_startblock;
> > > +		__entry->state = irec->br_state;
> > >  	),
> > > -	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu",
> > > +	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu st %d",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->ino,
> > >  		  __entry->lblk,
> > >  		  __entry->len,
> > > -		  __entry->pblk)
> > > +		  __entry->pblk,
> > > +		  __entry->state)
> > >  );
> > >  #define DEFINE_INODE_IREC_EVENT(name) \
> > >  DEFINE_EVENT(xfs_inode_irec_class, name, \
> > > @@ -3242,6 +3245,7 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_around_shared);
> > >  DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_alloc);
> > >  DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_found);
> > >  DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
> > > +DEFINE_INODE_IREC_EVENT(xfs_reflink_convert_cow);
> > >  
> > >  DEFINE_RW_EVENT(xfs_reflink_reserve_cow);
> > >  DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-02-02 19:42 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-31  0:23 [PATCH 0/7] xfs: random fixes Darrick J. Wong
2017-01-31  0:23 ` [PATCH 1/7] xfs: fix toctou race when locking an inode to access the data map Darrick J. Wong
2017-01-31  3:01   ` Eric Sandeen
2017-01-31 13:26   ` Christoph Hellwig
2017-01-31 19:45     ` Darrick J. Wong
2017-01-31 21:40       ` Darrick J. Wong
2017-02-01  2:34   ` [PATCH v2 " Darrick J. Wong
2017-02-01 14:48     ` Christoph Hellwig
2017-01-31  0:23 ` [PATCH 2/7] xfs: fail _dir_open when readahead fails Darrick J. Wong
2017-01-31  4:12   ` Eric Sandeen
2017-01-31 13:29   ` Christoph Hellwig
2017-01-31  0:23 ` [PATCH 3/7] xfs: filter out obviously bad btree pointers Darrick J. Wong
2017-01-31  4:39   ` Eric Sandeen
2017-01-31 20:09     ` Darrick J. Wong
2017-01-31 20:37       ` Eric Sandeen
2017-01-31  0:23 ` [PATCH 4/7] xfs: check for obviously bad level values in the bmbt root Darrick J. Wong
2017-01-31 13:31   ` Christoph Hellwig
2017-01-31  0:23 ` [PATCH 5/7] xfs: verify free block header fields Darrick J. Wong
2017-01-31 13:42   ` Christoph Hellwig
2017-01-31  0:23 ` [PATCH 6/7] xfs: allow unwritten extents in the CoW fork Darrick J. Wong
2017-02-01  2:35   ` [PATCH v2 " Darrick J. Wong
2017-02-01 18:06     ` Christoph Hellwig
2017-01-31  0:23 ` [PATCH 7/7] xfs: mark speculative prealloc CoW fork extents unwritten Darrick J. Wong
2017-01-31 13:41   ` Christoph Hellwig
2017-01-31 19:11     ` Darrick J. Wong
2017-02-01  1:28       ` Darrick J. Wong
2017-02-01  2:36   ` [PATCH v2 " Darrick J. Wong
2017-02-01 18:36     ` Christoph Hellwig
2017-02-02 15:04     ` Brian Foster
2017-02-02 17:04       ` Darrick J. Wong
2017-02-02 19:42         ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.