All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] xfs: defrag support for v5 filesystems
@ 2013-08-30  0:23 Dave Chinner
  2013-08-30  0:23 ` [PATCH 1/2] xfs: swap extents operations for CRC filesystems Dave Chinner
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Dave Chinner @ 2013-08-30  0:23 UTC (permalink / raw)
  To: xfs

Hi folks,

The following 2 patches implement the BMBT owner change transaction
that is necessary to enable the XFS_IOC_SWAPEXT ioctl to operate on
v5 filesystems correctly. The first patch implements the
transactional runtime change, and the second patch implements the
recovery of that change.

Both the run time and recovery code use the same mechanism for
changing the owner field in all the blocks in the BMBT on an inode,
and even though XFS_IOC_SWAPEXT only swaps the data fork, the code
has been written to be fork neutral so if we even need to swap
attribute forks it should just work for that, too.

Further, because the BMBT code uses the generic btree
infrastructure, the btree modification is done as a generic function
as well and so should work for all types of btrees supported by the
generic code. Hence if the need arises we can easily change the
owner of any btree that uses the generic code.

The testing carried out is documented in the description of the
second patch.

AFAIA, this is the only remaining feature that the kernel v5
filesystem implementation didn't support. Hence, with this patchset,
there are no more feature checkboxes that need to be ticked that
would prevent us from removing the experimental tag from it. Testing
is the only remaining gate to removing the tag from the kernel
code...

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] xfs: swap extents operations for CRC filesystems
  2013-08-30  0:23 [PATCH 0/2] xfs: defrag support for v5 filesystems Dave Chinner
@ 2013-08-30  0:23 ` Dave Chinner
  2013-09-09 20:32   ` Mark Tinguely
  2013-08-30  0:23 ` [PATCH 2/2] xfs: recovery of " Dave Chinner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2013-08-30  0:23 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

For CRC enabled filesystems, we can't just swap inode forks from one
inode to another when defragmenting a file - the blocks in the inode
fork bmap btree contain pointers back to the owner inode. Hence if
we are to swap the inode forks we have to atomically modify every
block in the btree during the transaction.

We are doing an entire fork swap here, so we could create a new
transaction item type that indicates we are changing the owner of a
certain structure from one value to another. If we combine this with
ordered buffer logging to modify all the buffers in the tree, then
we can change the buffers in the tree without needing log space for
the operation. However, this then requires log recovery to perform
the modification of the owner information of the objects/structures
in question.

This does introduce some interesting ordering details into recovery:
we have to make sure that the owner change replay occurs after the
change that moves the objects is made, not before. Hence we can't
use a separate log item for this as we have no guarantee of strict
ordering between multiple items in the log due to the relogging
action of asynchronous transaction commits. Hence there is no
"generic" method we can use for changing the ownership of arbitrary
metadata structures.

For inode forks, however, there is a simple method of communicating
that the fork contents need the owner rewritten - we can pass a
inode log format flag for the fork for the transaction that does a
fork swap. This flag will then follow the inode fork through
relogging actions so when the swap actually gets replayed the
ownership can be changed immediately by log recovery.  So that gives
us a simple method of "whole fork" exchange between two inodes.

This is relatively simple to implement, so it makes sense to do this
as an initial implementation to support xfs_fsr on CRC enabled
filesytems in the same manner as we do on existing filesystems. This
commit introduces the swapext driven functionality, the recovery
functionality will be in a separate patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_btree.c |  34 ++++++++++
 fs/xfs/xfs_bmap_btree.h |   3 +
 fs/xfs/xfs_bmap_util.c  |  52 +++++++++++-----
 fs/xfs/xfs_btree.c      | 162 ++++++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_btree.h      |  18 +++---
 fs/xfs/xfs_log_format.h |   1 +
 6 files changed, 231 insertions(+), 39 deletions(-)

diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c
index cf3bc76..aa2eadd 100644
--- a/fs/xfs/xfs_bmap_btree.c
+++ b/fs/xfs/xfs_bmap_btree.c
@@ -925,3 +925,37 @@ xfs_bmdr_maxrecs(
 		return blocklen / sizeof(xfs_bmdr_rec_t);
 	return blocklen / (sizeof(xfs_bmdr_key_t) + sizeof(xfs_bmdr_ptr_t));
 }
+
+/*
+ * Change the owner of a btree format fork fo the inode passed in. Change it to
+ * the owner of that is passed in so that we can change owners before or after
+ * we switch forks between inodes. The operation that the caller is doing will
+ * determine whether is needs to change owner before or after the switch.
+ *
+ * For demand paged modification, the fork switch should be done after reading
+ * in all the blocks, modifying them and pinning them in the transaction. For
+ * modification when the buffers are already pinned in memory, the fork switch
+ * can be done before changing the owner as we won't need to validate the owner
+ * until the btree buffers are unpinned and writes can occur again.
+ */
+int
+xfs_bmbt_change_owner(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	int			whichfork,
+	xfs_ino_t		new_owner)
+{
+	struct xfs_btree_cur	*cur;
+	int			error;
+
+	if (whichfork == XFS_DATA_FORK)
+		ASSERT(ip->i_d.di_format = XFS_DINODE_FMT_BTREE);
+	else
+		ASSERT(ip->i_d.di_aformat = XFS_DINODE_FMT_BTREE);
+
+	cur = xfs_bmbt_init_cursor(ip->i_mount, tp, ip, whichfork);
+	error = xfs_btree_change_owner(cur, new_owner);
+	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	return error;
+}
+
diff --git a/fs/xfs/xfs_bmap_btree.h b/fs/xfs/xfs_bmap_btree.h
index 1b726d6..bceac7a 100644
--- a/fs/xfs/xfs_bmap_btree.h
+++ b/fs/xfs/xfs_bmap_btree.h
@@ -236,6 +236,9 @@ extern int xfs_bmbt_get_maxrecs(struct xfs_btree_cur *, int level);
 extern int xfs_bmdr_maxrecs(struct xfs_mount *, int blocklen, int leaf);
 extern int xfs_bmbt_maxrecs(struct xfs_mount *, int blocklen, int leaf);
 
+extern int xfs_bmbt_change_owner(struct xfs_trans *tp, struct xfs_inode *ip,
+				 int whichfork, xfs_ino_t new_owner);
+
 extern struct xfs_btree_cur *xfs_bmbt_init_cursor(struct xfs_mount *,
 		struct xfs_trans *, struct xfs_inode *, int);
 
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 541d59f..ad8a91d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1789,14 +1789,6 @@ xfs_swap_extents(
 	int		taforkblks = 0;
 	__uint64_t	tmp;
 
-	/*
-	 * We have no way of updating owner information in the BMBT blocks for
-	 * each inode on CRC enabled filesystems, so to avoid corrupting the
-	 * this metadata we simply don't allow extent swaps to occur.
-	 */
-	if (xfs_sb_version_hascrc(&mp->m_sb))
-		return XFS_ERROR(EINVAL);
-
 	tempifp = kmem_alloc(sizeof(xfs_ifork_t), KM_MAYFAIL);
 	if (!tempifp) {
 		error = XFS_ERROR(ENOMEM);
@@ -1920,6 +1912,40 @@ xfs_swap_extents(
 			goto out_trans_cancel;
 	}
 
+	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
+	xfs_trans_ijoin(tp, tip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
+
+	/*
+	 * Before we've swapped the forks, lets set the owners of the forks
+	 * appropriately. We have to do this as we are demand paging the btree
+	 * buffers, and so the validation done on read will expect the owner
+	 * field to be correctly set. Once we change the owners, we can swap the
+	 * inode forks.
+	 *
+	 * Note the trickiness in setting the log flags - we set the owner log
+	 * flag on the opposite inode (i.e. the inode we are setting the new
+	 * owner to be) because once we swap the forks and log that, log
+	 * recovery is going to see the fork as owned by the swapped inode,
+	 * not the pre-swapped inodes.
+	 */
+	src_log_flags = XFS_ILOG_CORE;
+	target_log_flags = XFS_ILOG_CORE;
+	if (ip->i_d.di_version == 3 &&
+	    ip->i_d.di_format == XFS_DINODE_FMT_BTREE) {
+		target_log_flags |= XFS_ILOG_OWNER;
+		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, tip->i_ino);
+		if (error)
+			goto out_trans_cancel;
+	}
+
+	if (tip->i_d.di_version == 3 &&
+	    tip->i_d.di_format == XFS_DINODE_FMT_BTREE) {
+		src_log_flags |= XFS_ILOG_OWNER;
+		error = xfs_bmbt_change_owner(tp, tip, XFS_DATA_FORK, ip->i_ino);
+		if (error)
+			goto out_trans_cancel;
+	}
+
 	/*
 	 * Swap the data forks of the inodes
 	 */
@@ -1957,7 +1983,6 @@ xfs_swap_extents(
 	tip->i_delayed_blks = ip->i_delayed_blks;
 	ip->i_delayed_blks = 0;
 
-	src_log_flags = XFS_ILOG_CORE;
 	switch (ip->i_d.di_format) {
 	case XFS_DINODE_FMT_EXTENTS:
 		/* If the extents fit in the inode, fix the
@@ -1971,11 +1996,12 @@ xfs_swap_extents(
 		src_log_flags |= XFS_ILOG_DEXT;
 		break;
 	case XFS_DINODE_FMT_BTREE:
+		ASSERT(ip->i_d.di_version < 3 ||
+		       (src_log_flags & XFS_ILOG_OWNER));
 		src_log_flags |= XFS_ILOG_DBROOT;
 		break;
 	}
 
-	target_log_flags = XFS_ILOG_CORE;
 	switch (tip->i_d.di_format) {
 	case XFS_DINODE_FMT_EXTENTS:
 		/* If the extents fit in the inode, fix the
@@ -1990,13 +2016,11 @@ xfs_swap_extents(
 		break;
 	case XFS_DINODE_FMT_BTREE:
 		target_log_flags |= XFS_ILOG_DBROOT;
+		ASSERT(tip->i_d.di_version < 3 ||
+		       (target_log_flags & XFS_ILOG_OWNER));
 		break;
 	}
 
-
-	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
-	xfs_trans_ijoin(tp, tip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
-
 	xfs_trans_log_inode(tp, ip,  src_log_flags);
 	xfs_trans_log_inode(tp, tip, target_log_flags);
 
diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c
index 7a2b4da..047573f 100644
--- a/fs/xfs/xfs_btree.c
+++ b/fs/xfs/xfs_btree.c
@@ -855,6 +855,41 @@ xfs_btree_readahead(
 	return xfs_btree_readahead_sblock(cur, lr, block);
 }
 
+STATIC xfs_daddr_t
+xfs_btree_ptr_to_daddr(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr)
+{
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
+		ASSERT(ptr->l != cpu_to_be64(NULLDFSBNO));
+
+		return XFS_FSB_TO_DADDR(cur->bc_mp, be64_to_cpu(ptr->l));
+	} else {
+		ASSERT(cur->bc_private.a.agno != NULLAGNUMBER);
+		ASSERT(ptr->s != cpu_to_be32(NULLAGBLOCK));
+
+		return XFS_AGB_TO_DADDR(cur->bc_mp, cur->bc_private.a.agno,
+					be32_to_cpu(ptr->s));
+	}
+}
+
+/*
+ * Readahead @count btree blocks at the given @ptr location.
+ *
+ * We don't need to care about long or short form btrees here as we have a
+ * method of converting the ptr directly to a daddr available to us.
+ */
+STATIC void
+xfs_btree_readahead_ptr(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr,
+	xfs_extlen_t		count)
+{
+	xfs_buf_readahead(cur->bc_mp->m_ddev_targp,
+			  xfs_btree_ptr_to_daddr(cur, ptr),
+			  cur->bc_mp->m_bsize * count, cur->bc_ops->buf_ops);
+}
+
 /*
  * Set the buffer for level "lev" in the cursor to bp, releasing
  * any previous buffer.
@@ -1073,24 +1108,6 @@ xfs_btree_buf_to_ptr(
 	}
 }
 
-STATIC xfs_daddr_t
-xfs_btree_ptr_to_daddr(
-	struct xfs_btree_cur	*cur,
-	union xfs_btree_ptr	*ptr)
-{
-	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
-		ASSERT(ptr->l != cpu_to_be64(NULLDFSBNO));
-
-		return XFS_FSB_TO_DADDR(cur->bc_mp, be64_to_cpu(ptr->l));
-	} else {
-		ASSERT(cur->bc_private.a.agno != NULLAGNUMBER);
-		ASSERT(ptr->s != cpu_to_be32(NULLAGBLOCK));
-
-		return XFS_AGB_TO_DADDR(cur->bc_mp, cur->bc_private.a.agno,
-					be32_to_cpu(ptr->s));
-	}
-}
-
 STATIC void
 xfs_btree_set_refs(
 	struct xfs_btree_cur	*cur,
@@ -3869,3 +3886,112 @@ xfs_btree_get_rec(
 	*stat = 1;
 	return 0;
 }
+
+/*
+ * Change the owner of a btree.
+ *
+ * The mechanism we use here is ordered buffer logging. Because we don't know
+ * how many buffers were are going to need to modify, we don't really want to
+ * have to make transaction reservations for the worst case of every buffer in a
+ * full size btree as that may be more space that we can fit in the log....
+ *
+ * We do the btree walk in the most optimal manner possible - we have sibling
+ * pointers so we can just walk all the blocks on each level from left to right
+ * in a single pass, and then move to the next level and do the same. We can
+ * also do readahead on the sibling pointers to get IO moving more quickly,
+ * though for slow disks this is unlikely to make much difference to performance
+ * as the amount of CPU work we have to do before moving to the next block is
+ * relatively small.
+ *
+ * For each btree block that we load, modify the owner appropriately, set the
+ * buffer as an ordered buffer and log it appropriately. We need to ensure that
+ * we mark the region we change dirty so that if the buffer is relogged in
+ * a subsequent transaction the changes we make here as an ordered buffer are
+ * correctly relogged in that transaction.
+ */
+static int
+xfs_btree_block_change_owner(
+	struct xfs_btree_cur	*cur,
+	int			level,
+	__uint64_t		new_owner)
+{
+	struct xfs_btree_block	*block;
+	struct xfs_buf		*bp;
+	union xfs_btree_ptr     rptr;
+
+	/* do right sibling readahead */
+	xfs_btree_readahead(cur, level, XFS_BTCUR_RIGHTRA);
+
+	/* modify the owner */
+	block = xfs_btree_get_block(cur, level, &bp);
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+		block->bb_u.l.bb_owner = cpu_to_be64(new_owner);
+	else
+		block->bb_u.s.bb_owner = cpu_to_be32(new_owner);
+
+	/*
+	 * Log owner change as an ordered buffer. If the block is a root block
+	 * hosted in an inode, we might not have a buffer pointer here and we
+	 * shouldn't attempt to log the change as the information is already
+	 * held in the inode and discarded when the root block is formatted into
+	 * the on-disk inode fork. We still change it, though, so everything is
+	 * consistent in memory.
+	 */
+	if (bp) {
+		xfs_trans_ordered_buf(cur->bc_tp, bp);
+		xfs_btree_log_block(cur, bp, XFS_BB_OWNER);
+	} else {
+		ASSERT(cur->bc_flags & XFS_BTREE_ROOT_IN_INODE);
+		ASSERT(level == cur->bc_nlevels - 1);
+	}
+
+	/* now read rh sibling block for next iteration */
+	xfs_btree_get_sibling(cur, block, &rptr, XFS_BB_RIGHTSIB);
+	if (xfs_btree_ptr_is_null(cur, &rptr))
+		return ENOENT;
+
+	return xfs_btree_lookup_get_block(cur, level, &rptr, &block);
+}
+
+int
+xfs_btree_change_owner(
+	struct xfs_btree_cur	*cur,
+	__uint64_t		new_owner)
+{
+	union xfs_btree_ptr     lptr;
+	int			level;
+	struct xfs_btree_block	*block = NULL;
+	int			error = 0;
+
+	cur->bc_ops->init_ptr_from_cur(cur, &lptr);
+
+	/* for each level */
+	for (level = cur->bc_nlevels - 1; level >= 0; level--) {
+		/* grab the left hand block */
+		error = xfs_btree_lookup_get_block(cur, level, &lptr, &block);
+		if (error)
+			return error;
+
+		/* readahead the left most block for the next level down */
+		if (level > 0) {
+			union xfs_btree_ptr     *ptr;
+
+			ptr = xfs_btree_ptr_addr(cur, 1, block);
+			xfs_btree_readahead_ptr(cur, ptr, 1);
+
+			/* save for the next iteration of the loop */
+			lptr = *ptr;
+		}
+
+		/* for each buffer in the level */
+		do {
+			error = xfs_btree_block_change_owner(cur, level,
+							     new_owner);
+		} while (!error);
+
+		if (error != ENOENT)
+			return error;
+	}
+
+	return 0;
+}
diff --git a/fs/xfs/xfs_btree.h b/fs/xfs/xfs_btree.h
index c8473c7..544b209 100644
--- a/fs/xfs/xfs_btree.h
+++ b/fs/xfs/xfs_btree.h
@@ -121,15 +121,18 @@ union xfs_btree_rec {
 /*
  * For logging record fields.
  */
-#define	XFS_BB_MAGIC		0x01
-#define	XFS_BB_LEVEL		0x02
-#define	XFS_BB_NUMRECS		0x04
-#define	XFS_BB_LEFTSIB		0x08
-#define	XFS_BB_RIGHTSIB		0x10
-#define	XFS_BB_BLKNO		0x20
+#define	XFS_BB_MAGIC		(1 << 0)
+#define	XFS_BB_LEVEL		(1 << 1)
+#define	XFS_BB_NUMRECS		(1 << 2)
+#define	XFS_BB_LEFTSIB		(1 << 3)
+#define	XFS_BB_RIGHTSIB		(1 << 4)
+#define	XFS_BB_BLKNO		(1 << 5)
+#define	XFS_BB_LSN		(1 << 6)
+#define	XFS_BB_UUID		(1 << 7)
+#define	XFS_BB_OWNER		(1 << 8)
 #define	XFS_BB_NUM_BITS		5
 #define	XFS_BB_ALL_BITS		((1 << XFS_BB_NUM_BITS) - 1)
-#define	XFS_BB_NUM_BITS_CRC	8
+#define	XFS_BB_NUM_BITS_CRC	9
 #define	XFS_BB_ALL_BITS_CRC	((1 << XFS_BB_NUM_BITS_CRC) - 1)
 
 /*
@@ -442,6 +445,7 @@ int xfs_btree_new_iroot(struct xfs_btree_cur *, int *, int *);
 int xfs_btree_insert(struct xfs_btree_cur *, int *);
 int xfs_btree_delete(struct xfs_btree_cur *, int *);
 int xfs_btree_get_rec(struct xfs_btree_cur *, union xfs_btree_rec **, int *);
+int xfs_btree_change_owner(struct xfs_btree_cur *cur, __uint64_t new_owner);
 
 /*
  * btree block CRC helpers
diff --git a/fs/xfs/xfs_log_format.h b/fs/xfs/xfs_log_format.h
index a49ab2c..1902c1d 100644
--- a/fs/xfs/xfs_log_format.h
+++ b/fs/xfs/xfs_log_format.h
@@ -474,6 +474,7 @@ typedef struct xfs_inode_log_format_64 {
 #define	XFS_ILOG_ADATA	0x040	/* log i_af.if_data */
 #define	XFS_ILOG_AEXT	0x080	/* log i_af.if_extents */
 #define	XFS_ILOG_ABROOT	0x100	/* log i_af.i_broot */
+#define XFS_ILOG_OWNER	0x200	/* change the extent tree owner on replay */
 
 
 /*
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] xfs: recovery of swap extents operations for CRC filesystems
  2013-08-30  0:23 [PATCH 0/2] xfs: defrag support for v5 filesystems Dave Chinner
  2013-08-30  0:23 ` [PATCH 1/2] xfs: swap extents operations for CRC filesystems Dave Chinner
@ 2013-08-30  0:23 ` Dave Chinner
  2013-09-09 20:37   ` Mark Tinguely
  2013-09-03 19:12 ` [PATCH 0/2] xfs: defrag support for v5 filesystems Ben Myers
  2013-09-10 17:51 ` Ben Myers
  3 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2013-08-30  0:23 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

This is the recovery side of the btree block owner change operation
performed by swapext on CRC enabled filesystems. We detect that an
owner change is needed by the flag that has been placed on the inode
log format flag field. Because the inode recovery is being replayed
after the buffers that make up the BMBT in the given checkpoint, we
can walk all the buffers and directly modify them when we see the
flag set on an inode.

Because the inode can be relogged and hence present in multiple
chekpoints with the "change owner" flag set, we could do multiple
passes across the inode to do this change. While this isn't optimal,
we can't directly ignore the flag as there may be multiple
independent swap extent operations being replayed on the same inode
in different checkpoints so we can't ignore them.

Further, because the owner change operation uses ordered buffers, we
might have buffers that are newer on disk than the current
checkpoint and so already have the owner changed in them. Hence we
cannot just peek at a buffer in the tree and check that it has the
correct owner and assume that the change was completed.

So, for the moment just brute force the owner change every time we
see an inode with the flag set. Note that we have to be careful here
because the owner of the buffers may point to either the old owner
or the new owner. Currently the verifier can't verify the owner
directly, so there is no failure case here right now. If we verify
the owner exactly in future, then we'll have to take this into
account.

This was tested in terms of normal operation via xfstests - all of
the fsr tests now pass without failure. however, we really need to
modify xfs/227 to stress v3 inodes correctly to ensure we fully
cover this case for v5 filesystems.

In terms of recovery testing, I used a hacked version of xfs_fsr
that held the temp inode open for a few seconds before exiting so
that the filesystem could be shut down with an open owner change
recovery flags set on at least the temp inode. fsr leaves the temp
inode unlinked and in btree format, so this was necessary for the
owner change to be reliably replayed.

logprint confirmed the tmp inode in the log had the correct flag set:

INO: cnt:3 total:3 a:0x69e9e0 len:56 a:0x69ea20 len:176 a:0x69eae0 len:88
        INODE: #regs:3   ino:0x44  flags:0x209   dsize:88
	                                 ^^^^^

0x200 is set, indicating a data fork owner change needed to be
replayed on inode 0x44.  A printk in the revoery code confirmed that
the inode change was recovered:

XFS (vdc): Mounting Filesystem
XFS (vdc): Starting recovery (logdev: internal)
recovering owner change ino 0x44
XFS (vdc): Version 5 superblock detected. This kernel L support enabled!
Use of these features in this kernel is at your own risk!
XFS (vdc): Ending recovery (logdev: internal)

The script used to test this was:

$ cat ./recovery-fsr.sh
#!/bin/bash

dev=/dev/vdc
mntpt=/mnt/scratch
testfile=$mntpt/testfile

umount $mntpt
mkfs.xfs -f -m crc=1 $dev
mount $dev $mntpt
chmod 777 $mntpt

for i in `seq 10000 -1 0`; do
        xfs_io -f -d -c "pwrite $(($i * 4096)) 4096" $testfile > /dev/null 2>&1
done
xfs_bmap -vp $testfile |head -20

xfs_fsr -d -v $testfile &
sleep 10
/home/dave/src/xfstests-dev/src/godown -f $mntpt
wait
umount $mntpt

xfs_logprint -t $dev |tail -20
time mount $dev $mntpt
xfs_bmap -vp $testfile
umount $mntpt
$

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_btree.c  |  26 +++++++---
 fs/xfs/xfs_bmap_btree.h  |   3 +-
 fs/xfs/xfs_bmap_util.c   |  14 +++---
 fs/xfs/xfs_btree.c       |  32 +++++++-----
 fs/xfs/xfs_btree.h       |   3 +-
 fs/xfs/xfs_icache.c      |   4 +-
 fs/xfs/xfs_icache.h      |   4 ++
 fs/xfs/xfs_inode_buf.c   |   2 +-
 fs/xfs/xfs_inode_buf.h   |  18 +++----
 fs/xfs/xfs_log_format.h  |   9 ++--
 fs/xfs/xfs_log_recover.c | 123 ++++++++++++++++++++++++++++++++++++++---------
 11 files changed, 171 insertions(+), 67 deletions(-)

diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c
index aa2eadd..531b020 100644
--- a/fs/xfs/xfs_bmap_btree.c
+++ b/fs/xfs/xfs_bmap_btree.c
@@ -932,30 +932,40 @@ xfs_bmdr_maxrecs(
  * we switch forks between inodes. The operation that the caller is doing will
  * determine whether is needs to change owner before or after the switch.
  *
- * For demand paged modification, the fork switch should be done after reading
- * in all the blocks, modifying them and pinning them in the transaction. For
- * modification when the buffers are already pinned in memory, the fork switch
- * can be done before changing the owner as we won't need to validate the owner
- * until the btree buffers are unpinned and writes can occur again.
+ * For demand paged transactional modification, the fork switch should be done
+ * after reading in all the blocks, modifying them and pinning them in the
+ * transaction. For modification when the buffers are already pinned in memory,
+ * the fork switch can be done before changing the owner as we won't need to
+ * validate the owner until the btree buffers are unpinned and writes can occur
+ * again.
+ *
+ * For recovery based ownership change, there is no transactional context and
+ * so a buffer list must be supplied so that we can record the buffers that we
+ * modified for the caller to issue IO on.
  */
 int
 xfs_bmbt_change_owner(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
 	int			whichfork,
-	xfs_ino_t		new_owner)
+	xfs_ino_t		new_owner,
+	struct list_head	*buffer_list)
 {
 	struct xfs_btree_cur	*cur;
 	int			error;
 
+	ASSERT(tp || buffer_list);
+	ASSERT(!(tp && buffer_list));
 	if (whichfork == XFS_DATA_FORK)
 		ASSERT(ip->i_d.di_format = XFS_DINODE_FMT_BTREE);
 	else
 		ASSERT(ip->i_d.di_aformat = XFS_DINODE_FMT_BTREE);
 
 	cur = xfs_bmbt_init_cursor(ip->i_mount, tp, ip, whichfork);
-	error = xfs_btree_change_owner(cur, new_owner);
+	if (!cur)
+		return ENOMEM;
+
+	error = xfs_btree_change_owner(cur, new_owner, buffer_list);
 	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
 	return error;
 }
-
diff --git a/fs/xfs/xfs_bmap_btree.h b/fs/xfs/xfs_bmap_btree.h
index bceac7a..e367461 100644
--- a/fs/xfs/xfs_bmap_btree.h
+++ b/fs/xfs/xfs_bmap_btree.h
@@ -237,7 +237,8 @@ extern int xfs_bmdr_maxrecs(struct xfs_mount *, int blocklen, int leaf);
 extern int xfs_bmbt_maxrecs(struct xfs_mount *, int blocklen, int leaf);
 
 extern int xfs_bmbt_change_owner(struct xfs_trans *tp, struct xfs_inode *ip,
-				 int whichfork, xfs_ino_t new_owner);
+				 int whichfork, xfs_ino_t new_owner,
+				 struct list_head *buffer_list);
 
 extern struct xfs_btree_cur *xfs_bmbt_init_cursor(struct xfs_mount *,
 		struct xfs_trans *, struct xfs_inode *, int);
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index ad8a91d..c6dc551 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1932,16 +1932,18 @@ xfs_swap_extents(
 	target_log_flags = XFS_ILOG_CORE;
 	if (ip->i_d.di_version == 3 &&
 	    ip->i_d.di_format == XFS_DINODE_FMT_BTREE) {
-		target_log_flags |= XFS_ILOG_OWNER;
-		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, tip->i_ino);
+		target_log_flags |= XFS_ILOG_DOWNER;
+		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK,
+					      tip->i_ino, NULL);
 		if (error)
 			goto out_trans_cancel;
 	}
 
 	if (tip->i_d.di_version == 3 &&
 	    tip->i_d.di_format == XFS_DINODE_FMT_BTREE) {
-		src_log_flags |= XFS_ILOG_OWNER;
-		error = xfs_bmbt_change_owner(tp, tip, XFS_DATA_FORK, ip->i_ino);
+		src_log_flags |= XFS_ILOG_DOWNER;
+		error = xfs_bmbt_change_owner(tp, tip, XFS_DATA_FORK,
+					      ip->i_ino, NULL);
 		if (error)
 			goto out_trans_cancel;
 	}
@@ -1997,7 +1999,7 @@ xfs_swap_extents(
 		break;
 	case XFS_DINODE_FMT_BTREE:
 		ASSERT(ip->i_d.di_version < 3 ||
-		       (src_log_flags & XFS_ILOG_OWNER));
+		       (src_log_flags & XFS_ILOG_DOWNER));
 		src_log_flags |= XFS_ILOG_DBROOT;
 		break;
 	}
@@ -2017,7 +2019,7 @@ xfs_swap_extents(
 	case XFS_DINODE_FMT_BTREE:
 		target_log_flags |= XFS_ILOG_DBROOT;
 		ASSERT(tip->i_d.di_version < 3 ||
-		       (target_log_flags & XFS_ILOG_OWNER));
+		       (target_log_flags & XFS_ILOG_DOWNER));
 		break;
 	}
 
diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c
index 047573f..5690e10 100644
--- a/fs/xfs/xfs_btree.c
+++ b/fs/xfs/xfs_btree.c
@@ -3907,13 +3907,16 @@ xfs_btree_get_rec(
  * buffer as an ordered buffer and log it appropriately. We need to ensure that
  * we mark the region we change dirty so that if the buffer is relogged in
  * a subsequent transaction the changes we make here as an ordered buffer are
- * correctly relogged in that transaction.
+ * correctly relogged in that transaction.  If we are in recovery context, then
+ * just queue the modified buffer as delayed write buffer so the transaction
+ * recovery completion writes the changes to disk.
  */
 static int
 xfs_btree_block_change_owner(
 	struct xfs_btree_cur	*cur,
 	int			level,
-	__uint64_t		new_owner)
+	__uint64_t		new_owner,
+	struct list_head	*buffer_list)
 {
 	struct xfs_btree_block	*block;
 	struct xfs_buf		*bp;
@@ -3930,16 +3933,19 @@ xfs_btree_block_change_owner(
 		block->bb_u.s.bb_owner = cpu_to_be32(new_owner);
 
 	/*
-	 * Log owner change as an ordered buffer. If the block is a root block
-	 * hosted in an inode, we might not have a buffer pointer here and we
-	 * shouldn't attempt to log the change as the information is already
-	 * held in the inode and discarded when the root block is formatted into
-	 * the on-disk inode fork. We still change it, though, so everything is
-	 * consistent in memory.
+	 * If the block is a root block hosted in an inode, we might not have a
+	 * buffer pointer here and we shouldn't attempt to log the change as the
+	 * information is already held in the inode and discarded when the root
+	 * block is formatted into the on-disk inode fork. We still change it,
+	 * though, so everything is consistent in memory.
 	 */
 	if (bp) {
-		xfs_trans_ordered_buf(cur->bc_tp, bp);
-		xfs_btree_log_block(cur, bp, XFS_BB_OWNER);
+		if (cur->bc_tp) {
+			xfs_trans_ordered_buf(cur->bc_tp, bp);
+			xfs_btree_log_block(cur, bp, XFS_BB_OWNER);
+		} else {
+			xfs_buf_delwri_queue(bp, buffer_list);
+		}
 	} else {
 		ASSERT(cur->bc_flags & XFS_BTREE_ROOT_IN_INODE);
 		ASSERT(level == cur->bc_nlevels - 1);
@@ -3956,7 +3962,8 @@ xfs_btree_block_change_owner(
 int
 xfs_btree_change_owner(
 	struct xfs_btree_cur	*cur,
-	__uint64_t		new_owner)
+	__uint64_t		new_owner,
+	struct list_head	*buffer_list)
 {
 	union xfs_btree_ptr     lptr;
 	int			level;
@@ -3986,7 +3993,8 @@ xfs_btree_change_owner(
 		/* for each buffer in the level */
 		do {
 			error = xfs_btree_block_change_owner(cur, level,
-							     new_owner);
+							     new_owner,
+							     buffer_list);
 		} while (!error);
 
 		if (error != ENOENT)
diff --git a/fs/xfs/xfs_btree.h b/fs/xfs/xfs_btree.h
index 544b209..06729b6 100644
--- a/fs/xfs/xfs_btree.h
+++ b/fs/xfs/xfs_btree.h
@@ -445,7 +445,8 @@ int xfs_btree_new_iroot(struct xfs_btree_cur *, int *, int *);
 int xfs_btree_insert(struct xfs_btree_cur *, int *);
 int xfs_btree_delete(struct xfs_btree_cur *, int *);
 int xfs_btree_get_rec(struct xfs_btree_cur *, union xfs_btree_rec **, int *);
-int xfs_btree_change_owner(struct xfs_btree_cur *cur, __uint64_t new_owner);
+int xfs_btree_change_owner(struct xfs_btree_cur *cur, __uint64_t new_owner,
+			   struct list_head *buffer_list);
 
 /*
  * btree block CRC helpers
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 16219b9..7942432 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -48,7 +48,7 @@ STATIC void __xfs_inode_clear_reclaim_tag(struct xfs_mount *mp,
 /*
  * Allocate and initialise an xfs_inode.
  */
-STATIC struct xfs_inode *
+struct xfs_inode *
 xfs_inode_alloc(
 	struct xfs_mount	*mp,
 	xfs_ino_t		ino)
@@ -98,7 +98,7 @@ xfs_inode_free_callback(
 	kmem_zone_free(xfs_inode_zone, ip);
 }
 
-STATIC void
+void
 xfs_inode_free(
 	struct xfs_inode	*ip)
 {
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 8a89f7d..458e6bc 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -42,6 +42,10 @@ struct xfs_eofblocks {
 int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 	     uint flags, uint lock_flags, xfs_inode_t **ipp);
 
+/* recovery needs direct inode allocation capability */
+struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
+void xfs_inode_free(struct xfs_inode *ip);
+
 void xfs_reclaim_worker(struct work_struct *work);
 
 int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
diff --git a/fs/xfs/xfs_inode_buf.c b/fs/xfs/xfs_inode_buf.c
index e011d59..3d25c9a 100644
--- a/fs/xfs/xfs_inode_buf.c
+++ b/fs/xfs/xfs_inode_buf.c
@@ -196,7 +196,7 @@ xfs_imap_to_bp(
 	return 0;
 }
 
-STATIC void
+void
 xfs_dinode_from_disk(
 	xfs_icdinode_t		*to,
 	xfs_dinode_t		*from)
diff --git a/fs/xfs/xfs_inode_buf.h b/fs/xfs/xfs_inode_buf.h
index 599e6c0..abba0ae 100644
--- a/fs/xfs/xfs_inode_buf.h
+++ b/fs/xfs/xfs_inode_buf.h
@@ -32,17 +32,17 @@ struct xfs_imap {
 	ushort		im_boffset;	/* inode offset in block in bytes */
 };
 
-int		xfs_imap_to_bp(struct xfs_mount *, struct xfs_trans *,
-			       struct xfs_imap *, struct xfs_dinode **,
-			       struct xfs_buf **, uint, uint);
-int		xfs_iread(struct xfs_mount *, struct xfs_trans *,
-			  struct xfs_inode *, uint);
-void		xfs_dinode_calc_crc(struct xfs_mount *, struct xfs_dinode *);
-void		xfs_dinode_to_disk(struct xfs_dinode *,
-				   struct xfs_icdinode *);
+int	xfs_imap_to_bp(struct xfs_mount *, struct xfs_trans *,
+		       struct xfs_imap *, struct xfs_dinode **,
+		       struct xfs_buf **, uint, uint);
+int	xfs_iread(struct xfs_mount *, struct xfs_trans *,
+		  struct xfs_inode *, uint);
+void	xfs_dinode_calc_crc(struct xfs_mount *, struct xfs_dinode *);
+void	xfs_dinode_to_disk(struct xfs_dinode *to, struct xfs_icdinode *from);
+void	xfs_dinode_from_disk(struct xfs_icdinode *to, struct xfs_dinode *from);
 
 #if defined(DEBUG)
-void		xfs_inobp_check(struct xfs_mount *, struct xfs_buf *);
+void	xfs_inobp_check(struct xfs_mount *, struct xfs_buf *);
 #else
 #define	xfs_inobp_check(mp, bp)
 #endif /* DEBUG */
diff --git a/fs/xfs/xfs_log_format.h b/fs/xfs/xfs_log_format.h
index 1902c1d..5b7ba98 100644
--- a/fs/xfs/xfs_log_format.h
+++ b/fs/xfs/xfs_log_format.h
@@ -474,7 +474,8 @@ typedef struct xfs_inode_log_format_64 {
 #define	XFS_ILOG_ADATA	0x040	/* log i_af.if_data */
 #define	XFS_ILOG_AEXT	0x080	/* log i_af.if_extents */
 #define	XFS_ILOG_ABROOT	0x100	/* log i_af.i_broot */
-#define XFS_ILOG_OWNER	0x200	/* change the extent tree owner on replay */
+#define XFS_ILOG_DOWNER	0x200	/* change the data fork owner on replay */
+#define XFS_ILOG_AOWNER	0x400	/* change the attr fork owner on replay */
 
 
 /*
@@ -488,7 +489,8 @@ typedef struct xfs_inode_log_format_64 {
 #define	XFS_ILOG_NONCORE	(XFS_ILOG_DDATA | XFS_ILOG_DEXT | \
 				 XFS_ILOG_DBROOT | XFS_ILOG_DEV | \
 				 XFS_ILOG_UUID | XFS_ILOG_ADATA | \
-				 XFS_ILOG_AEXT | XFS_ILOG_ABROOT)
+				 XFS_ILOG_AEXT | XFS_ILOG_ABROOT | \
+				 XFS_ILOG_DOWNER | XFS_ILOG_AOWNER)
 
 #define	XFS_ILOG_DFORK		(XFS_ILOG_DDATA | XFS_ILOG_DEXT | \
 				 XFS_ILOG_DBROOT)
@@ -500,7 +502,8 @@ typedef struct xfs_inode_log_format_64 {
 				 XFS_ILOG_DEXT | XFS_ILOG_DBROOT | \
 				 XFS_ILOG_DEV | XFS_ILOG_UUID | \
 				 XFS_ILOG_ADATA | XFS_ILOG_AEXT | \
-				 XFS_ILOG_ABROOT | XFS_ILOG_TIMESTAMP)
+				 XFS_ILOG_ABROOT | XFS_ILOG_TIMESTAMP | \
+				 XFS_ILOG_DOWNER | XFS_ILOG_AOWNER)
 
 static inline int xfs_ilog_fbroot(int w)
 {
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 7c0c1fd..8984ec4 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2629,6 +2629,82 @@ out_release:
 	return error;
 }
 
+/*
+ * Inode fork owner changes
+ *
+ * If we have been told that we have to reparent the inode fork, it's because an
+ * extent swap operation on a CRC enabled filesystem has been done and we are
+ * replaying it. We need to walk the BMBT of the appropriate fork and change the
+ * owners of it.
+ *
+ * The complexity here is that we don't have an inode context to work with, so
+ * after we've replayed the inode we need to instantiate one.  This is where the
+ * fun begins.
+ *
+ * We are in the middle of log recovery, so we can't run transactions. That
+ * means we cannot use cache coherent inode instantiation via xfs_iget(), as
+ * that will result in the corresponding iput() running the inode through
+ * xfs_inactive(). If we've just replayed an inode core that changes the link
+ * count to zero (i.e. it's been unlinked), then xfs_inactive() will run
+ * transactions (bad!).
+ *
+ * So, to avoid this, we instantiate an inode directly from the inode core we've
+ * just recovered. We have the buffer still locked, and all we really need to
+ * instantiate is the inode core and the forks being modified. We can do this
+ * manually, then run the inode btree owner change, and then tear down the
+ * xfs_inode without having to run any transactions at all.
+ *
+ * Also, because we don't have a transaction context available here but need to
+ * gather all the buffers we modify for writeback so we pass the buffer_list
+ * instead for the operation to use.
+ */
+
+STATIC int
+xfs_recover_inode_owner_change(
+	struct xfs_mount	*mp,
+	struct xfs_dinode	*dip,
+	struct xfs_inode_log_format *in_f,
+	struct list_head	*buffer_list)
+{
+	struct xfs_inode	*ip;
+	int			error;
+
+	ASSERT(in_f->ilf_fields & (XFS_ILOG_DOWNER|XFS_ILOG_AOWNER));
+
+	ip = xfs_inode_alloc(mp, in_f->ilf_ino);
+	if (!ip)
+		return ENOMEM;
+
+	/* instantiate the inode */
+	xfs_dinode_from_disk(&ip->i_d, dip);
+	ASSERT(ip->i_d.di_version >= 3);
+
+	error = xfs_iformat_fork(ip, dip);
+	if (error)
+		goto out_free_ip;
+
+
+	if (in_f->ilf_fields & XFS_ILOG_DOWNER) {
+		ASSERT(in_f->ilf_fields & XFS_ILOG_DBROOT);
+		error = xfs_bmbt_change_owner(NULL, ip, XFS_DATA_FORK,
+					      ip->i_ino, buffer_list);
+		if (error)
+			goto out_free_ip;
+	}
+
+	if (in_f->ilf_fields & XFS_ILOG_AOWNER) {
+		ASSERT(in_f->ilf_fields & XFS_ILOG_ABROOT);
+		error = xfs_bmbt_change_owner(NULL, ip, XFS_ATTR_FORK,
+					      ip->i_ino, buffer_list);
+		if (error)
+			goto out_free_ip;
+	}
+
+out_free_ip:
+	xfs_inode_free(ip);
+	return error;
+}
+
 STATIC int
 xlog_recover_inode_pass2(
 	struct xlog			*log,
@@ -2681,8 +2757,7 @@ xlog_recover_inode_pass2(
 	error = bp->b_error;
 	if (error) {
 		xfs_buf_ioerror_alert(bp, "xlog_recover_do..(read#2)");
-		xfs_buf_relse(bp);
-		goto error;
+		goto out_release;
 	}
 	ASSERT(in_f->ilf_fields & XFS_ILOG_CORE);
 	dip = (xfs_dinode_t *)xfs_buf_offset(bp, in_f->ilf_boffset);
@@ -2692,30 +2767,31 @@ xlog_recover_inode_pass2(
 	 * like an inode!
 	 */
 	if (unlikely(dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))) {
-		xfs_buf_relse(bp);
 		xfs_alert(mp,
 	"%s: Bad inode magic number, dip = 0x%p, dino bp = 0x%p, ino = %Ld",
 			__func__, dip, bp, in_f->ilf_ino);
 		XFS_ERROR_REPORT("xlog_recover_inode_pass2(1)",
 				 XFS_ERRLEVEL_LOW, mp);
 		error = EFSCORRUPTED;
-		goto error;
+		goto out_release;
 	}
 	dicp = item->ri_buf[1].i_addr;
 	if (unlikely(dicp->di_magic != XFS_DINODE_MAGIC)) {
-		xfs_buf_relse(bp);
 		xfs_alert(mp,
 			"%s: Bad inode log record, rec ptr 0x%p, ino %Ld",
 			__func__, item, in_f->ilf_ino);
 		XFS_ERROR_REPORT("xlog_recover_inode_pass2(2)",
 				 XFS_ERRLEVEL_LOW, mp);
 		error = EFSCORRUPTED;
-		goto error;
+		goto out_release;
 	}
 
 	/*
 	 * If the inode has an LSN in it, recover the inode only if it's less
-	 * than the lsn of the transaction we are replaying.
+	 * than the lsn of the transaction we are replaying. Note: we still
+	 * need to replay an owner change even though the inode is more recent
+	 * than the transaction as there is no guarantee that all the btree
+	 * blocks are more recent than this transaction, too.
 	 */
 	if (dip->di_version >= 3) {
 		xfs_lsn_t	lsn = be64_to_cpu(dip->di_lsn);
@@ -2723,7 +2799,7 @@ xlog_recover_inode_pass2(
 		if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) {
 			trace_xfs_log_recover_inode_skip(log, in_f);
 			error = 0;
-			goto out_release;
+			goto out_owner_change;
 		}
 	}
 
@@ -2745,10 +2821,9 @@ xlog_recover_inode_pass2(
 		    dicp->di_flushiter < (DI_MAX_FLUSH >> 1)) {
 			/* do nothing */
 		} else {
-			xfs_buf_relse(bp);
 			trace_xfs_log_recover_inode_skip(log, in_f);
 			error = 0;
-			goto error;
+			goto out_release;
 		}
 	}
 
@@ -2760,13 +2835,12 @@ xlog_recover_inode_pass2(
 		    (dicp->di_format != XFS_DINODE_FMT_BTREE)) {
 			XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(3)",
 					 XFS_ERRLEVEL_LOW, mp, dicp);
-			xfs_buf_relse(bp);
 			xfs_alert(mp,
 		"%s: Bad regular inode log record, rec ptr 0x%p, "
 		"ino ptr = 0x%p, ino bp = 0x%p, ino %Ld",
 				__func__, item, dip, bp, in_f->ilf_ino);
 			error = EFSCORRUPTED;
-			goto error;
+			goto out_release;
 		}
 	} else if (unlikely(S_ISDIR(dicp->di_mode))) {
 		if ((dicp->di_format != XFS_DINODE_FMT_EXTENTS) &&
@@ -2774,19 +2848,17 @@ xlog_recover_inode_pass2(
 		    (dicp->di_format != XFS_DINODE_FMT_LOCAL)) {
 			XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(4)",
 					     XFS_ERRLEVEL_LOW, mp, dicp);
-			xfs_buf_relse(bp);
 			xfs_alert(mp,
 		"%s: Bad dir inode log record, rec ptr 0x%p, "
 		"ino ptr = 0x%p, ino bp = 0x%p, ino %Ld",
 				__func__, item, dip, bp, in_f->ilf_ino);
 			error = EFSCORRUPTED;
-			goto error;
+			goto out_release;
 		}
 	}
 	if (unlikely(dicp->di_nextents + dicp->di_anextents > dicp->di_nblocks)){
 		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
 				     XFS_ERRLEVEL_LOW, mp, dicp);
-		xfs_buf_relse(bp);
 		xfs_alert(mp,
 	"%s: Bad inode log record, rec ptr 0x%p, dino ptr 0x%p, "
 	"dino bp 0x%p, ino %Ld, total extents = %d, nblocks = %Ld",
@@ -2794,29 +2866,27 @@ xlog_recover_inode_pass2(
 			dicp->di_nextents + dicp->di_anextents,
 			dicp->di_nblocks);
 		error = EFSCORRUPTED;
-		goto error;
+		goto out_release;
 	}
 	if (unlikely(dicp->di_forkoff > mp->m_sb.sb_inodesize)) {
 		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(6)",
 				     XFS_ERRLEVEL_LOW, mp, dicp);
-		xfs_buf_relse(bp);
 		xfs_alert(mp,
 	"%s: Bad inode log record, rec ptr 0x%p, dino ptr 0x%p, "
 	"dino bp 0x%p, ino %Ld, forkoff 0x%x", __func__,
 			item, dip, bp, in_f->ilf_ino, dicp->di_forkoff);
 		error = EFSCORRUPTED;
-		goto error;
+		goto out_release;
 	}
 	isize = xfs_icdinode_size(dicp->di_version);
 	if (unlikely(item->ri_buf[1].i_len > isize)) {
 		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(7)",
 				     XFS_ERRLEVEL_LOW, mp, dicp);
-		xfs_buf_relse(bp);
 		xfs_alert(mp,
 			"%s: Bad inode log record length %d, rec ptr 0x%p",
 			__func__, item->ri_buf[1].i_len, item);
 		error = EFSCORRUPTED;
-		goto error;
+		goto out_release;
 	}
 
 	/* The core is in in-core format */
@@ -2842,7 +2912,7 @@ xlog_recover_inode_pass2(
 	}
 
 	if (in_f->ilf_size == 2)
-		goto write_inode_buffer;
+		goto out_owner_change;
 	len = item->ri_buf[2].i_len;
 	src = item->ri_buf[2].i_addr;
 	ASSERT(in_f->ilf_size <= 4);
@@ -2903,13 +2973,15 @@ xlog_recover_inode_pass2(
 		default:
 			xfs_warn(log->l_mp, "%s: Invalid flag", __func__);
 			ASSERT(0);
-			xfs_buf_relse(bp);
 			error = EIO;
-			goto error;
+			goto out_release;
 		}
 	}
 
-write_inode_buffer:
+out_owner_change:
+	if (in_f->ilf_fields & (XFS_ILOG_DOWNER|XFS_ILOG_AOWNER))
+		error = xfs_recover_inode_owner_change(mp, dip, in_f,
+						       buffer_list);
 	/* re-generate the checksum. */
 	xfs_dinode_calc_crc(log->l_mp, dip);
 
@@ -2923,6 +2995,9 @@ error:
 	if (need_free)
 		kmem_free(in_f);
 	return XFS_ERROR(error);
+
+	xfs_buf_relse(bp);
+	goto error;
 }
 
 /*
-- 
1.8.3.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] xfs: defrag support for v5 filesystems
  2013-08-30  0:23 [PATCH 0/2] xfs: defrag support for v5 filesystems Dave Chinner
  2013-08-30  0:23 ` [PATCH 1/2] xfs: swap extents operations for CRC filesystems Dave Chinner
  2013-08-30  0:23 ` [PATCH 2/2] xfs: recovery of " Dave Chinner
@ 2013-09-03 19:12 ` Ben Myers
  2013-09-03 22:45   ` Dave Chinner
  2013-09-10 17:51 ` Ben Myers
  3 siblings, 1 reply; 12+ messages in thread
From: Ben Myers @ 2013-09-03 19:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hey Dave,

On Fri, Aug 30, 2013 at 10:23:43AM +1000, Dave Chinner wrote:
> Hi folks,
> 
> The following 2 patches implement the BMBT owner change transaction
> that is necessary to enable the XFS_IOC_SWAPEXT ioctl to operate on
> v5 filesystems correctly. The first patch implements the
> transactional runtime change, and the second patch implements the
> recovery of that change.
> 
> Both the run time and recovery code use the same mechanism for
> changing the owner field in all the blocks in the BMBT on an inode,
> and even though XFS_IOC_SWAPEXT only swaps the data fork, the code
> has been written to be fork neutral so if we even need to swap
> attribute forks it should just work for that, too.
> 
> Further, because the BMBT code uses the generic btree
> infrastructure, the btree modification is done as a generic function
> as well and so should work for all types of btrees supported by the
> generic code. Hence if the need arises we can easily change the
> owner of any btree that uses the generic code.
> 
> The testing carried out is documented in the description of the
> second patch.
> 
> AFAIA, this is the only remaining feature that the kernel v5
> filesystem implementation didn't support. Hence, with this patchset,
> there are no more feature checkboxes that need to be ticked that
> would prevent us from removing the experimental tag from it. Testing
> is the only remaining gate to removing the tag from the kernel
> code...

I believe there is still the discussion surrounding being able to use
the self describing metadata without enabling crcs that needs to be
resolved before removing the experimental tag.  Some customers will want
to use features such as t10-dif and won't want to calculate two separate
crcs.  Normally that discussion probably needn't gate the removal of the
experimental tag, but unfortunately there will be some compatability
issues surrounding an additional feature bit and modification of mkfs
that we should probably iron out first.

Other than that discussion, I think we're about on the same page...

Regards,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] xfs: defrag support for v5 filesystems
  2013-09-03 19:12 ` [PATCH 0/2] xfs: defrag support for v5 filesystems Ben Myers
@ 2013-09-03 22:45   ` Dave Chinner
  2013-09-05 19:34     ` Ben Myers
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2013-09-03 22:45 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On Tue, Sep 03, 2013 at 02:12:01PM -0500, Ben Myers wrote:
> Hey Dave,
> 
> On Fri, Aug 30, 2013 at 10:23:43AM +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > The following 2 patches implement the BMBT owner change transaction
> > that is necessary to enable the XFS_IOC_SWAPEXT ioctl to operate on
> > v5 filesystems correctly. The first patch implements the
> > transactional runtime change, and the second patch implements the
> > recovery of that change.
> > 
> > Both the run time and recovery code use the same mechanism for
> > changing the owner field in all the blocks in the BMBT on an inode,
> > and even though XFS_IOC_SWAPEXT only swaps the data fork, the code
> > has been written to be fork neutral so if we even need to swap
> > attribute forks it should just work for that, too.
> > 
> > Further, because the BMBT code uses the generic btree
> > infrastructure, the btree modification is done as a generic function
> > as well and so should work for all types of btrees supported by the
> > generic code. Hence if the need arises we can easily change the
> > owner of any btree that uses the generic code.
> > 
> > The testing carried out is documented in the description of the
> > second patch.
> > 
> > AFAIA, this is the only remaining feature that the kernel v5
> > filesystem implementation didn't support. Hence, with this patchset,
> > there are no more feature checkboxes that need to be ticked that
> > would prevent us from removing the experimental tag from it. Testing
> > is the only remaining gate to removing the tag from the kernel
> > code...
> 
> I believe there is still the discussion surrounding being able to use
> the self describing metadata without enabling crcs that needs to be
> resolved before removing the experimental tag.

There is no discussion to be had here - CRCs are not optional on v5
filesystems, nor is there any reason to make them optional. Please
stop bring this up over and over again - you're just wasting my time
by making me have to respond with the same answers over and over
again.

If people don't want CRCs, then we've still got a perfectly good v4
filesystem format that they can use.

> Some customers will want
> to use features such as t10-dif and won't want to calculate two separate
> crcs. 

This is a straw man argument.

T10-dif is a completely different layer of protection that is only
useful for filesystem metadata if the CRC we calculate for the
metadata is written into the T10-DIF CRC fields. This is the only
way for us to get full end-to-end protection for metadata from
T10-dif. i.e. we have to supply the CRCs ourselves before we issue
the write IO, and verify it ourselves after a read IO.

Guess what we do right now with CRC support?

That's right: the existing CRC infrastructure is ready to support
integrated, end-to-end T10 CRCs for metadata in the filesystem. All
that is missing is the block layer interfaces and a few changes to
the CRC code to do iterative per-sector CRCs rather than
per-filesystem object CRCs. Surprise you, it may, but I've actually
considered how to implement T10-dif support as part of the overall
metadata CRC infrastructure architecture...

Given that with T10-dif we still need calculate and verify the
CRCs ourselves, why would we not also store it in the filesystem
metadata at the same time? That would mean that tools like
xfs_repair and xfs_db can also verify the metadata as being correct
without needing to explicitly support T10-dif. Of course, if you
want them to be able to repair or modify a filesystem with t10-dif
enabled, we need feature bits and explicit userspace support, too.

Hence, before making strawman arguments about how filesystem
metadata CRCs will need to be turned off for t10-dif support,
perhaps it would be better to first consider a design and prototype
support for end-to-end T10-dif CRCs for filesystem metadata and
share that with us?

As I've said before, we do not make on disk format changes for
strawmen or "potential" features that have not had design documents
or code published - we make the changes when code arrives to
implement the feature. So, until you have code for a feature that
*fundamentally requires* CRCs to be disabled to function correctly,
then the is not point even starting a discussion about making CRCs
optional on v5 superblocks.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] xfs: defrag support for v5 filesystems
  2013-09-03 22:45   ` Dave Chinner
@ 2013-09-05 19:34     ` Ben Myers
  2013-09-05 19:57       ` Eric Sandeen
  0 siblings, 1 reply; 12+ messages in thread
From: Ben Myers @ 2013-09-05 19:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Dave,

On Wed, Sep 04, 2013 at 08:45:42AM +1000, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 02:12:01PM -0500, Ben Myers wrote:
> > Hey Dave,
> > 
> > On Fri, Aug 30, 2013 at 10:23:43AM +1000, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > The following 2 patches implement the BMBT owner change transaction
> > > that is necessary to enable the XFS_IOC_SWAPEXT ioctl to operate on
> > > v5 filesystems correctly. The first patch implements the
> > > transactional runtime change, and the second patch implements the
> > > recovery of that change.
> > > 
> > > Both the run time and recovery code use the same mechanism for
> > > changing the owner field in all the blocks in the BMBT on an inode,
> > > and even though XFS_IOC_SWAPEXT only swaps the data fork, the code
> > > has been written to be fork neutral so if we even need to swap
> > > attribute forks it should just work for that, too.
> > > 
> > > Further, because the BMBT code uses the generic btree
> > > infrastructure, the btree modification is done as a generic function
> > > as well and so should work for all types of btrees supported by the
> > > generic code. Hence if the need arises we can easily change the
> > > owner of any btree that uses the generic code.
> > > 
> > > The testing carried out is documented in the description of the
> > > second patch.
> > > 
> > > AFAIA, this is the only remaining feature that the kernel v5
> > > filesystem implementation didn't support. Hence, with this patchset,
> > > there are no more feature checkboxes that need to be ticked that
> > > would prevent us from removing the experimental tag from it. Testing
> > > is the only remaining gate to removing the tag from the kernel
> > > code...
> > 
> > I believe there is still the discussion surrounding being able to use
> > the self describing metadata without enabling crcs that needs to be
> > resolved before removing the experimental tag.
> 
> There is no discussion to be had here - CRCs are not optional on v5
> filesystems, nor is there any reason to make them optional. Please
> stop bring this up over and over again - you're just wasting my time
> by making me have to respond with the same answers over and over
> again.

There is a discussion to be had here, and I'm sorry that you feel I'm wasting
your time.  I do feel that we need to get this sorted before removing the
experimental tag due to the possibility of needing to change interfaces to make
it work.

> If people don't want CRCs, then we've still got a perfectly good v4
> filesystem format that they can use.

People can still use v4 filesystem format, but the self describing metadata
includes checks that have value even without the crc.

> > Some customers will want
> > to use features such as t10-dif and won't want to calculate two separate
> > crcs. 
> 
> This is a straw man argument.

Read:  http://en.wikipedia.org/wiki/Straw_man

AFAICT at no time in my email did I misrepresent or even attempt to represent
your position on this issue.  Seems to me that I represented only my position
on the issue, and what I think customers will want.  Maybe you can find a
different fallacy in there but I don't think it's a straw man.

> T10-dif is a completely different layer of protection that is only
> useful for filesystem metadata if the CRC we calculate for the
> metadata is written into the T10-DIF CRC fields. This is the only
> way for us to get full end-to-end protection for metadata from
> T10-dif. i.e. we have to supply the CRCs ourselves before we issue
> the write IO, and verify it ourselves after a read IO.

I think of T10-dif as a superset of what you've implemented already, as opposed
to a completely different layer of protection.  Keeping a crc within the
filesystem metadata is useful, and I'm pretty sure we all agree on that.
However, today that crc cannot be checked in hardware as it goes across the
wire to and from the disk, because 1) the disk doesn't know about the length of
the structure protected by the crc, or the location of the crc in the
structure, and 2) the disk doesn't support the crc32c, it supports only a 16
bit t10-dif crc per sector.

> Guess what we do right now with CRC support?
> 
> That's right: the existing CRC infrastructure is ready to support
> integrated, end-to-end T10 CRCs for metadata in the filesystem. All
> that is missing is the block layer interfaces and a few changes to
> the CRC code to do iterative per-sector CRCs rather than
> per-filesystem object CRCs.

Yes!  This is exactly what I would like to discuss.

> Surprise you, it may, but I've actually
> considered how to implement T10-dif support as part of the overall
> metadata CRC infrastructure architecture...

Great, sounds we are thinking along the same lines.
 
> Given that with T10-dif we still need calculate and verify the
> CRCs ourselves, why would we not also store it in the filesystem
> metadata at the same time?

The reason we would not also want to store the crc32c in the filesystem
metadata itself is that the cost of calculating crc32c in addition to t10-dif
is redundant if you are already using t10-dif.  The disk cannot use crc32c, so
I'm trying to think like a customer, and it seems to me that a customer will
probably want to be able to turn off the crc32c you have implemented if he/she
is already using an alternate form of protection.  It's belt and suspenders.

> That would mean that tools like
> xfs_repair and xfs_db can also verify the metadata as being correct
> without needing to explicitly support T10-dif.

Right.  Userspace wouldn't necessarily have to explicitly support t10-dif,
given that today the block layer will do it for them.

> Of course, if you
> want them to be able to repair or modify a filesystem with t10-dif
> enabled, we need feature bits and explicit userspace support, too.

I think whether userspace needs explicit support depends upon the behavior of
the block layer when an error is found.  IIRC today it just retries several
times before giving up and you get EIO.  Not sure how well that information can
be used by userspace at this point.

> Hence, before making strawman arguments about how filesystem
> metadata CRCs will need to be turned off for t10-dif support,

AFAICT my assertion that your crc32c should be optional was not a straw man.
It comes as a consequence of it not being usable for t10-dif, and therefore
redundant.

> perhaps it would be better to first consider a design and prototype
> support for end-to-end T10-dif CRCs for filesystem metadata and
> share that with us?

A great idea.  I'll shop it around.
 
> As I've said before, we do not make on disk format changes for
> strawmen or "potential" features that have not had design documents
> or code published - we make the changes when code arrives to
> implement the feature.

Nah, we think further ahead than that.  When I need to move my rook across the
board I don't first put a knight in it's path.

> So, until you have code for a feature that
> *fundamentally requires* CRCs to be disabled to function correctly,
> then the is not point even starting a discussion about making CRCs
> optional on v5 superblocks.

'Fundamentally requires' is not the bar to pass for starting a discussion on
this list.  Customers will likely want to be able to disable crc32c in metadata
when they are using an alternate form of data protection.  That is enough to go
on for starting a discussion.  Insofar as supporting such a feature might
require interface changes to the existing work, I think we should discuss what
they might look like before finalizing the mkfs.xfs interfaces in a way that we
might find to be incompatible later.  If after some discussion we find that
this can be done without interface changes that will gate removal of the
experimental tag... then it won't gate.

Regards,
Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] xfs: defrag support for v5 filesystems
  2013-09-05 19:34     ` Ben Myers
@ 2013-09-05 19:57       ` Eric Sandeen
  2013-09-05 20:03         ` Eric Sandeen
  2013-09-06  3:34         ` Dave Chinner
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Sandeen @ 2013-09-05 19:57 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On 9/5/13 2:34 PM, Ben Myers wrote:
> Dave,
> 
> On Wed, Sep 04, 2013 at 08:45:42AM +1000, Dave Chinner wrote:

...

>> If people don't want CRCs, then we've still got a perfectly good v4
>> filesystem format that they can use.
> 
> People can still use v4 filesystem format, but the self describing metadata
> includes checks that have value even without the crc.

Perhaps, but unless there is *value* in turning them off, there is no reason
to do so.  See previous arguments about test matrix etc.

Right now you suggest a different mechanism, but it doesn't actually
exist at this point - at least not for end-to-end metadata integrity.

crcs between hba & storage is a very different thing, and really not
a substitute for xfs's object crcs.  More below

...

>> Guess what we do right now with CRC support?
>>
>> That's right: the existing CRC infrastructure is ready to support
>> integrated, end-to-end T10 CRCs for metadata in the filesystem. All
>> that is missing is the block layer interfaces and a few changes to
>> the CRC code to do iterative per-sector CRCs rather than
>> per-filesystem object CRCs.
> 
> Yes!  This is exactly what I would like to discuss.

...

So if and when that is available, we could discuss whether or not
there is any reason to disable crcs, right?  Until then we're
handwaving with no good rationale.

> 'Fundamentally requires' is not the bar to pass for starting a discussion on
> this list.  Customers will likely want to be able to disable crc32c in metadata
> when they are using an alternate form of data protection.

Which does not exist, not in any similar form.

>  That is enough to go
> on for starting a discussion.  Insofar as supporting such a feature might
> require interface changes to the existing work, I think we should discuss what
> they might look like before finalizing the mkfs.xfs interfaces in a way that we
> might find to be incompatible later.  If after some discussion we find that
> this can be done without interface changes that will gate removal of the
> experimental tag... then it won't gate.

It is incumbent on SGI to explain why they want to make it optional.  
There is no alternative design; if and when there is one, that's the time
to make another configuration knob.

I see 2 possible reasons you would want it to be configurable.  The first
seems least stated but most likely:

1) You have performance concerns.

  - you need to show us the numbers if that's the case so we can discuss facts

2) You think T10dif will make it unnecessary

  - t10dif in hardware gives you EIO, not corruption detection & recovery
  - end-to-end t10dif with xfs at the "app" layer might be an option, but
    - nobody has written that, and
    - the only reason to turn off the object crcs at that point is perf, and
       - again, we'd need to start with performance numbers.

There's been no compelling argument about the need to configure it off
at this stage, so discussion of "gating" is really getting off on the wrong
foot here, IMHO.

If you spot a design decision which would make configurability impossible
later, by all means point it out, but I cannot see what purpose it would
serve to switch it off today, when there is no replacement.  The only good
reason I could see is if it affects performance; if you find that it does,
publish the numbers...

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] xfs: defrag support for v5 filesystems
  2013-09-05 19:57       ` Eric Sandeen
@ 2013-09-05 20:03         ` Eric Sandeen
  2013-09-06  3:34         ` Dave Chinner
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Sandeen @ 2013-09-05 20:03 UTC (permalink / raw)
  To: Ben Myers; +Cc: xfs

On 9/5/13 2:57 PM, Eric Sandeen wrote:
> On 9/5/13 2:34 PM, Ben Myers wrote:
>> Dave,
>>
>> On Wed, Sep 04, 2013 at 08:45:42AM +1000, Dave Chinner wrote:
> 
> ...
> 
>>> If people don't want CRCs, then we've still got a perfectly good v4
>>> filesystem format that they can use.
>>
>> People can still use v4 filesystem format, but the self describing metadata
>> includes checks that have value even without the crc.
> 
> Perhaps, but unless there is *value* in turning them off, there is no reason
> to do so.  See previous arguments about test matrix etc.
> 
> Right now you suggest a different mechanism, but it doesn't actually
> exist at this point - at least not for end-to-end metadata integrity.
> 
> crcs between hba & storage is a very different thing, and really not
> a substitute for xfs's object crcs.  More below
> 
> ...
> 
>>> Guess what we do right now with CRC support?
>>>
>>> That's right: the existing CRC infrastructure is ready to support
>>> integrated, end-to-end T10 CRCs for metadata in the filesystem. All
>>> that is missing is the block layer interfaces and a few changes to
>>> the CRC code to do iterative per-sector CRCs rather than
>>> per-filesystem object CRCs.
>>
>> Yes!  This is exactly what I would like to discuss.
> 
> ...
> 
> So if and when that is available, we could discuss whether or not
> there is any reason to disable crcs, right?  Until then we're
> handwaving with no good rationale.

In fact, I think we can distill this even further.  Even *with*
t10dif at the HBA level, the only reason I can see to turn off
per-object crcs is performance.

To make that argument, you should publish the performance numbers.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] xfs: defrag support for v5 filesystems
  2013-09-05 19:57       ` Eric Sandeen
  2013-09-05 20:03         ` Eric Sandeen
@ 2013-09-06  3:34         ` Dave Chinner
  1 sibling, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2013-09-06  3:34 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Ben Myers, xfs

On Thu, Sep 05, 2013 at 02:57:11PM -0500, Eric Sandeen wrote:
> On 9/5/13 2:34 PM, Ben Myers wrote:
> >  That is enough to go
> > on for starting a discussion.  Insofar as supporting such a feature might
> > require interface changes to the existing work, I think we should discuss what
> > they might look like before finalizing the mkfs.xfs interfaces in a way that we
> > might find to be incompatible later.  If after some discussion we find that
> > this can be done without interface changes that will gate removal of the
> > experimental tag... then it won't gate.
> 
> It is incumbent on SGI to explain why they want to make it optional.  
> There is no alternative design; if and when there is one, that's the time
> to make another configuration knob.
> 
> I see 2 possible reasons you would want it to be configurable.  The first
> seems least stated but most likely:
> 
> 1) You have performance concerns.
> 
>   - you need to show us the numbers if that's the case so we can discuss facts
> 
> 2) You think T10dif will make it unnecessary
> 
>   - t10dif in hardware gives you EIO, not corruption detection & recovery
>   - end-to-end t10dif with xfs at the "app" layer might be an option, but
>     - nobody has written that, and
>     - the only reason to turn off the object crcs at that point is perf, and
>        - again, we'd need to start with performance numbers.

Actually, t10-dif is not equivalent  to v5 filesystem object CRCs at
all. The filesystem object can span multiple sectors, and while
t10-dif can only guarantee individual sector contents are correct,
it cannot guarantee that all the sectors in a given filesystem
object are up to date.

Why is this important? Because XFS metadata can span multiple
filesystem blocks and hence be discontiguous and require multiple
IOs to write to disk. Hence we can have regions of the object that
have different ages (i.e. partially written) and t10-diff based CRCs
*cannot detect this*.

So, even with end-to-end t10-dif CRCs, we still need filesystem
object level CRCs to guarantee that the objects are completely
internally consistent....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] xfs: swap extents operations for CRC filesystems
  2013-08-30  0:23 ` [PATCH 1/2] xfs: swap extents operations for CRC filesystems Dave Chinner
@ 2013-09-09 20:32   ` Mark Tinguely
  0 siblings, 0 replies; 12+ messages in thread
From: Mark Tinguely @ 2013-09-09 20:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 08/29/13 19:23, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> For CRC enabled filesystems, we can't just swap inode forks from one
> inode to another when defragmenting a file - the blocks in the inode
> fork bmap btree contain pointers back to the owner inode. Hence if
> we are to swap the inode forks we have to atomically modify every
> block in the btree during the transaction.
>
> We are doing an entire fork swap here, so we could create a new
> transaction item type that indicates we are changing the owner of a
> certain structure from one value to another. If we combine this with
> ordered buffer logging to modify all the buffers in the tree, then
> we can change the buffers in the tree without needing log space for
> the operation. However, this then requires log recovery to perform
> the modification of the owner information of the objects/structures
> in question.
>
> This does introduce some interesting ordering details into recovery:
> we have to make sure that the owner change replay occurs after the
> change that moves the objects is made, not before. Hence we can't
> use a separate log item for this as we have no guarantee of strict
> ordering between multiple items in the log due to the relogging
> action of asynchronous transaction commits. Hence there is no
> "generic" method we can use for changing the ownership of arbitrary
> metadata structures.
>
> For inode forks, however, there is a simple method of communicating
> that the fork contents need the owner rewritten - we can pass a
> inode log format flag for the fork for the transaction that does a
> fork swap. This flag will then follow the inode fork through
> relogging actions so when the swap actually gets replayed the
> ownership can be changed immediately by log recovery.  So that gives
> us a simple method of "whole fork" exchange between two inodes.
>
> This is relatively simple to implement, so it makes sense to do this
> as an initial implementation to support xfs_fsr on CRC enabled
> filesytems in the same manner as we do on existing filesystems. This
> commit introduces the swapext driven functionality, the recovery
> functionality will be in a separate patch.
>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> ---

Makes sense. Change owner in btree entry, Don't log the modified btree 
buffers but recreate in recovery when necessary.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] xfs: recovery of swap extents operations for CRC filesystems
  2013-08-30  0:23 ` [PATCH 2/2] xfs: recovery of " Dave Chinner
@ 2013-09-09 20:37   ` Mark Tinguely
  0 siblings, 0 replies; 12+ messages in thread
From: Mark Tinguely @ 2013-09-09 20:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 08/29/13 19:23, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> This is the recovery side of the btree block owner change operation
> performed by swapext on CRC enabled filesystems. We detect that an
> owner change is needed by the flag that has been placed on the inode
> log format flag field. Because the inode recovery is being replayed
> after the buffers that make up the BMBT in the given checkpoint, we
> can walk all the buffers and directly modify them when we see the
> flag set on an inode.
>
> Because the inode can be relogged and hence present in multiple
> chekpoints with the "change owner" flag set, we could do multiple
> passes across the inode to do this change. While this isn't optimal,
> we can't directly ignore the flag as there may be multiple
> independent swap extent operations being replayed on the same inode
> in different checkpoints so we can't ignore them.
>
> Further, because the owner change operation uses ordered buffers, we
> might have buffers that are newer on disk than the current
> checkpoint and so already have the owner changed in them. Hence we
> cannot just peek at a buffer in the tree and check that it has the
> correct owner and assume that the change was completed.
>
> So, for the moment just brute force the owner change every time we
> see an inode with the flag set. Note that we have to be careful here
> because the owner of the buffers may point to either the old owner
> or the new owner. Currently the verifier can't verify the owner
> directly, so there is no failure case here right now. If we verify
> the owner exactly in future, then we'll have to take this into
> account.
>

...


Recovery of swap extent makes sense. Variable name, XFS_ILOG_DOWNER, 
gave me a chuckle but it is consistent with the other names.

Really could use some recovery test for this. Did inode create recovery 
get a test?

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] xfs: defrag support for v5 filesystems
  2013-08-30  0:23 [PATCH 0/2] xfs: defrag support for v5 filesystems Dave Chinner
                   ` (2 preceding siblings ...)
  2013-09-03 19:12 ` [PATCH 0/2] xfs: defrag support for v5 filesystems Ben Myers
@ 2013-09-10 17:51 ` Ben Myers
  3 siblings, 0 replies; 12+ messages in thread
From: Ben Myers @ 2013-09-10 17:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Aug 30, 2013 at 10:23:43AM +1000, Dave Chinner wrote:
> Hi folks,
> 
> The following 2 patches implement the BMBT owner change transaction
> that is necessary to enable the XFS_IOC_SWAPEXT ioctl to operate on
> v5 filesystems correctly. The first patch implements the
> transactional runtime change, and the second patch implements the
> recovery of that change.
> 
> Both the run time and recovery code use the same mechanism for
> changing the owner field in all the blocks in the BMBT on an inode,
> and even though XFS_IOC_SWAPEXT only swaps the data fork, the code
> has been written to be fork neutral so if we even need to swap
> attribute forks it should just work for that, too.
> 
> Further, because the BMBT code uses the generic btree
> infrastructure, the btree modification is done as a generic function
> as well and so should work for all types of btrees supported by the
> generic code. Hence if the need arises we can easily change the
> owner of any btree that uses the generic code.
> 
> The testing carried out is documented in the description of the
> second patch.
> 
> AFAIA, this is the only remaining feature that the kernel v5
> filesystem implementation didn't support. Hence, with this patchset,
> there are no more feature checkboxes that need to be ticked that
> would prevent us from removing the experimental tag from it. Testing
> is the only remaining gate to removing the tag from the kernel
> code...

Applied these 2.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-09-10 17:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-30  0:23 [PATCH 0/2] xfs: defrag support for v5 filesystems Dave Chinner
2013-08-30  0:23 ` [PATCH 1/2] xfs: swap extents operations for CRC filesystems Dave Chinner
2013-09-09 20:32   ` Mark Tinguely
2013-08-30  0:23 ` [PATCH 2/2] xfs: recovery of " Dave Chinner
2013-09-09 20:37   ` Mark Tinguely
2013-09-03 19:12 ` [PATCH 0/2] xfs: defrag support for v5 filesystems Ben Myers
2013-09-03 22:45   ` Dave Chinner
2013-09-05 19:34     ` Ben Myers
2013-09-05 19:57       ` Eric Sandeen
2013-09-05 20:03         ` Eric Sandeen
2013-09-06  3:34         ` Dave Chinner
2013-09-10 17:51 ` Ben Myers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.